This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Memory capacity of neural networks with threshold and ReLU activations

Roman Vershynin Department of Mathematics, University of California, Irvine rvershyn@uci.edu
Abstract.

Overwhelming theoretical and empirical evidence shows that mildly overparametrized neural networks – those with more connections than the size of the training data – are often able to memorize the training data with 100%100\% accuracy. This was rigorously proved for networks with sigmoid activation functions [23, 13] and, very recently, for ReLU activations [24]. Addressing a 1988 open question of Baum [6], we prove that this phenomenon holds for general multilayered perceptrons, i.e. neural networks with threshold activation functions, or with any mix of threshold and ReLU activations. Our construction is probabilistic and exploits sparsity.

Work in part supported by U.S. Air Force grant FA9550-18-1-0031

1. Introduction

This paper continues the long study of the memory capacity of neural architectures. How much information can a human brain learn? What are fundamental memory limitations, and how should the “optimal brain” be organized to achieve maximal capacity? These questions are complicated by the fact that we do not sufficiently understand the architecture of the human brain. But suppose that a neural architecture is known to us. Consider, for example, a given artificial neural network. Is there a general formula that expresses the memory capacity in terms of the network’s architecture?

1.1. Neural architectures

In this paper we study a general layered, feedforward, fully connected neural architecture with arbitrarily many layers, arbitrarily many nodes in each layer, with either threshold or ReLU activation functions between all layers, and with the threshold activation function at the output node.

Readers unfamiliar with this terminology may think of a neural architecture as a computational device that can compute certain compositions of linear and nonlinear maps. Let us describe precisely the functions computable by a neural architecture. Some of the best studied and most popular nonlinear functions ϕ:\phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\to\mathbb{R}, or “activation functions”, include the threshold function and the rectified linear unit (ReLU), defined by

ϕ(t)=𝟏{t>0}andϕ(t)=max(0,t)=t+,\phi(t)={\mathbf{1}}_{\{t>0\}}\quad\text{and}\quad\phi(t)=\max(0,t)=t_{+}, (1.1)

respectively.111It should be possible to extend out results for other activation functions. To keep the argument simple, we shall focus on the threshold and ReLU nonlinearities in this paper. We call a map pseudolinear if it is a composition of an affine transformation and a nonlinear transformation ϕ\phi applied coordinate-wise. Thus, Φ:nm\Phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{m} is pseudolinear map if it can be expressed as

Φ(x)=ϕ(Vxb),xn,\Phi(x)=\phi(Vx-b),\quad x\in\mathbb{R}^{n},

where VV is a m×nm\times n matrix of “weights”, bmb\in\mathbb{R}^{m} is a vector of “biases”, and ϕ\phi is either the threshold or ReLU function (1.1), which we apply to each coordinate of the vector WxbWx-b.

A neural architecture computes compositions of pseudolinear maps, i.e. functions F:n1F\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{1}}\to\mathbb{R} of the type

F=ΦLΦ2Φ1F=\Phi_{L}\circ\cdots\circ\Phi_{2}\circ\Phi_{1}

where Φ1:n1n2\Phi_{1}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{1}}\to\mathbb{R}^{n_{2}}, Φ2:n2n3\Phi_{2}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{2}}\to\mathbb{R}^{n_{3}}, …, ΦL1:nL1nL\Phi_{L-1}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{L-1}}\to\mathbb{R}^{n_{L}}, ΦL:nL\Phi_{L}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{L}}\to\mathbb{R} are pseudolinear maps. Each of maps Φi\Phi_{i} may be defined using either the threshold or ReLU function, and mix and match is allowed. However, for the purpose of this paper, we require the output function ΦL:nL\Phi_{L}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{L}}\to\mathbb{R} to have the threshold activation.222General neural architectures used by practitioners and considered in the literature may have more than one output node and have other activation functions at the output node.

We regard the matrices VV and bb in the definition of each pseudolinear map Φi\Phi_{i} as free parameters of the given neural architecture. Varying these free parameters one can make a given neural architecture compute different functions F:n1{0,1}F\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{1}}\to\{0,1\}. Let us denote the class of such functions computable by a given architecture by

(n1,,nL,1).\mathcal{F}(n_{1},\ldots,n_{L},1).
Refer to caption
Figure 1. A neural architecture with an input layer, two hidden layers, and an output node. The class of functions F:3F\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{3}\to\mathbb{R} this architecture can compute is denoted (3,4,2,1)\mathcal{F}(3,4,2,1).

A neural architecture can be visualized as a directed graph, which consists of LL layers each having nin_{i} nodes (or neurons), and one output node. Successive layers are connected by bipartite graphs, each of which represents a pseudolinear map Φi\Phi_{i}. Each neuron is a little computational device. It sums all inputs from the neurons in the previous layer with certain weights, applies the activation function ϕ\phi to the sum, and passes the output to neurons in the next layer. More specifically, the neuron determines if the sum of incoming signals from the previous layers exceeds a certain firing threshold bb. If so, the neuron fires with either a signal of strength 11 (if ϕ\phi is the threshold activation function) or with strength proportional to the incoming signal (if ϕ\phi is the ReLU activation).

1.2. Memory capacity

When can a given neural architecture remember a given data? Suppose, for example, that we have KK digital pictures of cats and dogs, encoded as as vectors x1,,xKn1x_{1},\ldots,x_{K}\in\mathbb{R}^{n_{1}}, and labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\} where 0 stands for a cat and 11 for a dog. Can we train a given neural architecture to memorize which images are cats and which are dogs? Equivalently, does there exist a function F(n1,,nL,1)F\in\mathcal{F}(n_{1},\ldots,n_{L},1) such that

F(xk)=ykfor allk=1,,K?F(x_{k})=y_{k}\quad\text{for all}\quad k=1,\ldots,K? (1.2)

A common belief is that this should happen for any sufficiently overparametrized network – an architecture that that has significantly more free parameters than the size of the training data. The free parameters of our neural architecture are the ni1×nin_{i-1}\times n_{i} weight matrices ViV_{i} and the bias vectors binib_{i}\in\mathbb{R}^{n_{i}}. The number of biases is negligible compared to the number of weights, and the number of free parameters is approximately the same as the number of connections333We dropped the number of output connections, which is negligible compared to W¯\overline{W}.

W¯=n1n2++nL1nL.\overline{W}=n_{1}n_{2}+\ldots+n_{L-1}n_{L}.

Thus, one can wonder whether a general neural architecture is able to memorize the data as long as the number of connections is bigger than the size of the data, i.e. as long as

W¯K.\overline{W}\gtrsim K. (1.3)

Motivated by this question, one can define memory capacity of a given architecture as the largest size of general data the architecture is able to memorize. In other words, the memory capacity is the largest KK such that for a general444One may sometimes wish to exclude some kinds of pathological data, so natural assumptions can be placed on the set of data points xkx_{k}. In this paper, for example, we consider unit and separated points xkx_{k}. set of points x1,,xKn1x_{1},\ldots,x_{K}\in\mathbb{R}^{n_{1}} and for any labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\} there exists a function F(n1,,nL)F\in\mathcal{F}(n_{1},\ldots,n_{L}) that satisfies (1.2).

The memory capacity is clearly bounded above by the vc-dimension,555This is because the vc-dimension is the maximal KK for which there exist points x1,,xKn1x_{1},\ldots,x_{K}\in\mathbb{R}^{n_{1}} so that for any labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\} there exists a function F(n1,,nL)F\in\mathcal{F}(n_{1},\ldots,n_{L}) that satisfies (1.2). The memory capacity requires any general set of points x1,,xKn1x_{1},\ldots,x_{K}\in\mathbb{R}^{n_{1}} to succeed as above. which is O(W¯logW¯)O(\overline{W}\log\overline{W}) for neural architectures with threshold activation functions [7] and O(LW¯logW¯)O(L\overline{W}\log\overline{W}) for neural architectures with ReLU activation functions [5]. Thus, our question is whether these bounds are tight – is memory capacity (approximately) proportional to W¯\overline{W}?

1.3. The history of the problem

A version of this question was raised by Baum [6] in 1988. Building on the earlier work of Cover [8], Baum studied the memory capacity of multilayer perceptrons, i.e. feedforward neural architectures with threshold activation functions. He first looked at the network architecture [n,m,1][n,m,1] with one hidden layer consisting of mm nodes (and, as notation suggests, nn nodes in the hidden layer and one output node). Baum noticed that for data points xkx_{k} in general position in n\mathbb{R}^{n}, the memory capacity of the architecture [n,m,1][n,m,1] is about nmnm, i.e. it is proportional to the number of connections. This is not difficult: general position guarantees that the hyperplane spanned by any subset of nn data points misses any other data points; this allows one to train each of the mm neurons in the hidden layer on its own batch of nn data points.

Baum then asked if the same phenomenon persists for deeper neural networks. He asked whether for large KK there exists a deep neural architecture with a total of O(K)O(\sqrt{K}) neurons in the hidden layers and with memory capacity at least KK. Such result would demonstrate the benefit of depth. Indeed, we just saw that shallow architecture [n,O(K),1][n,O(\sqrt{K}),1] has capacity just nKn\sqrt{K}, which would be smaller than the hypothetical capacity KK of deeper architectures for nKn\ll K.

There was no significant progress on Baum’s problem. As Mitchison and Durbin noted in 1989, “one significant difference between a single threshold unit and a multilayer network is that, in the latter case, the capacity can vary between sets of input vectors, even when the vectors are in general position” [17]. Attempting to count different functions FF that a deep network can realize on a given data set (xk)(x_{k}), Kowalczyk writes in 1997: “One of the complications arising here is that in contrast to the single neuron case even for perceptrons with two hidden units, the number of implementable dichotomies may be different for various nn-tuples in general position… Extension of this result to the multilayer case is still an open problem” [15].

The memory capacity problem is more tractable for neural architectures in which the threshold activation is replaced by one of its continuous proxies such as ReLU, sigmoid, tanh, or polynomial activation functions. Such activations allow neurons to fire with variable, controllable amplitudes. Heuristically, this ability makes it possible to encode the training data very compactly into the firing amplitudes.

Yamasaki claimed without proof in 1993 that for the sigmoid activation ϕ(t)=1/(1+et)\phi(t)=1/(1+e^{-t}) and for data in general position, the memory capacity of a general deep neural architecture is lower bounded W¯\overline{W}, the number of connections [23]. A version of Yamasaki’s claim was proved in 2003 by Huang for arbitrary data and neural architectures with two hidden layers [13].

In 2016, Zhang et al. [25] gave a construction of an arbitrarily large (but not fully connected) neural architecture with ReLU activations and whose memory capacity is proportional to both the number of connections and the number of nodes. Hardt and Ma [12] gave a different construction of a residual network with similar properties.

Very recently, Yun et at. [24] removed the requirement that there be more nodes than data, showing that the memory capacity of networks with ReLU and tahn activation functions is proportional to the number of connections. Ge et al. [11] proved a similar result for polynomial activations.

Significant efforts was made in the last two years to justify why for overparametrized networks, the gradient descent and its variants could achieve 100%100\% capacity on the training data [10, 9, 16, 26, 1, 14, 27, 18, 20, 2, panigrahi2019effect]; see [21] for a survey of related developments.

1.4. Main result

Meanwhile, the original problem of Baum [6] – determine memory capacity of networks with threshold activations – has remained open. In contrast to the neurons with continuous activation functions, neurons with threshold activations either not fire at all of fire with the same unit amplitude. The strength of the incoming signal is lost when transmitted through such neurons, and it is not clear how the data can be encoded.

This is what makes Baum’s question hard. In this paper, we (almost) give a positive answer to this question.

Why “almost”? First, the size of the input layer n1n_{1} should not affect the capacity bound and should be excluded from the count of the free parameters W¯\overline{W}. To see this, consider, for example, the data points xkn1x_{k}\in\mathbb{R}^{n_{1}} all laying on one line; with respect to such data, the network is equivalent to one with n1=1n_{1}=1. Next, ultra-narrow bottlenecks should be excluded at least for the threshold nonlinearity: for example, any layer with just ni=1n_{i}=1 node make the number of connections that occur in the further layers irrelevant as free parameters.

In our actual result, we make somewhat stronger assumptions: in counting connections, we exclude not only the first layer but also the second; we rule out all exponentially narrow bottlenecks (not just of size one); we assume that the data points xkx_{k} are unit and separated; finally, we allow logarithmic factors.

Theorem 1.1.

Let n1,,nLn_{1},\ldots,n_{L} be positive integers, and set n0:=min(n2,,nL)n_{0}\mathrel{\mathop{\mathchar 58\relax}}=\min(n_{2},\ldots,n_{L}) and n:=max(n2,,nL)n_{\infty}\mathrel{\mathop{\mathchar 58\relax}}=\max(n_{2},\ldots,n_{L}). Consider unit vectors x1,,xKnx_{1},\ldots,x_{K}\in\mathbb{R}^{n} that satisfy

xixj2Cloglognlogn0.\|x_{i}-x_{j}\|_{2}\geq C\sqrt{\frac{\log\log n_{\infty}}{\log n_{0}}}. (1.4)

Consider any labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\}. Assume that the number of deep connections W:=n3n4++nL1nLW\mathrel{\mathop{\mathchar 58\relax}}=n_{3}n_{4}+\cdots+n_{L-1}n_{L} satisfies

WCKlog5K,W\geq CK\log^{5}K, (1.5)

as well as Kexp(cn01/5)K\leq\exp(cn_{0}^{1/5}) and nexp(cn01/5)n_{\infty}\leq\exp(cn_{0}^{1/5}). Then the network can memorize the label assignment xkykx_{k}\to y_{k} exactly, i.e. there exists a map F(n1,,nL,1)F\in\mathcal{F}(n_{1},\ldots,n_{L},1) such that

F(xk)=ykfor allk=1,,K.F(x_{k})=y_{k}\quad\text{for all}\quad k=1,\ldots,K. (1.6)

Here CC and cc denote certain positive absolute constants.

In short, Theorem 1.1 states that the memory capacity of a general neural architecture with threshold or ReLU activations (or a mix thereof) is lower bounded by the number of the deep connections. This bound is independent of the depth, bottlenecks (up to exponentially narrow), or any other architectural details.

1.5. Should the data be separated?

One can wonder about the necessity of the separation assumption (1.4). Can we just assume that xkx_{k} are distinct? While this is true for ReLU and tanh activations [24], it is false for threshold activations. A moment’s thought reveals that any pseudolinear map from \mathbb{R} to m\mathbb{R}^{m} transforms any line into a finite set such of cardinality O(m)O(m). Thus, by pigeonhole principle, any map from layer 11 to layer 22 is non-injective on the set of KK data points xkx_{k} – which makes it impossible to memorize some label assignments – unless K=O(n2)K=O(n_{2}). In other words, if we just assume that the data points xkx_{k} are distinct, the network must have at least as many nodes in the second layer as the number of data points. Still, the separation assumption (5.2) does not look tight and might be weakened.

1.6. Related notions of capacity

Instead of requiring the network to memorize the training data with 100%100\% accuracy as in Theorem 1.1, one can ask to memorize just 1ε1-\varepsilon fraction, or just a half of the training data correctly. This corresponds to a relaxed, or fractional memory capacity of neural architectures that was introduced by Cover in 1965 [8] and studied extensively afterwards.

To estimate fractional capacity of a given architecture, one needs to count all functions FF this architecture can realize on a given finite set points xkx_{k}. When this set is the Boolean cube {0,1}n\{0,1\}^{n}, this amounts to counting all Boolean functions F:{0,1}n{0,1}F\mathrel{\mathop{\mathchar 58\relax}}\{0,1\}^{n}\to\{0,1\} the architecture can realize. The binary logarithm of the number of all such Boolean functions was called (expressive) capacity by Baldi and the author [3, 4]. For a neural architecture with all threshold activations and LL layers, the expressive capacity is equivalent to the cubic polynomial in the sizes of layers nin_{i}:

i=1L1min(n1,,ni)nini+1,\sum_{i=1}^{L-1}\min(n_{1},\ldots,n_{i})n_{i}n_{i+1},

up to an absolute constant factor [4]. The factor min(n1,,ni)\min(n_{1},\ldots,n_{i}) quantifies the effect of any bottlenecks that occur before layer ii.

Similar results can be proved for the restricted expressive capacity where we count the functions FF the architecture can realize on a given finite set of KK points xkx_{k} [4, Section 10.5]. Ideally, one might hope to find that all 2K2^{K} functions can be realized on a general KK-element set, which would imply that the memory capacity is at least KK. However, the current results on restricted expressive capacity are not tight enough to reach such conclusions.

2. The method

Our construction of the function FF in Theorem 1.1 is probabilistic. Let us first illustrate our approach for the architecture [n,n,n,1][n,n,n,1] with two hidden layers, and with threshold activations throughout. We would like to find a composition of pseudolinear functions

F:nΦ1nΦ2nΨ{0,1}F\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\xrightarrow{\Phi_{1}}\mathbb{R}^{n}\xrightarrow{\Phi_{2}}\mathbb{R}^{n}\xrightarrow{\Psi}\{0,1\}

that fits the given data (xk,yk)(x_{k},y_{k}) as in (1.6).

The first two maps Φ1\Phi_{1} and Φ2\Phi_{2} are enrichment maps whose only purpose is spread the data xkx_{k} in the space, transforming it into an almost orthogonal set. Specifically, Φ1\Phi_{1} will transform the separated points xkx_{k} into o(1)o(1)-orthogonal points uku_{k} (Theorem 5.2), Φ2\Phi_{2} will transform the o(1)o(1)-orthogonal points uku_{k} into O(1/n)O(1/\sqrt{n})-orthogonal points vkv_{k} (Theorem 5.4), and, finally, the perception map Ψ(x)\Psi(x) will fit the data: Ψ(vk)=yk\Psi(v_{k})=y_{k}.

2.1. Enrichment

Our construction of the enrichment maps Φ1\Phi_{1} and Φ2\Phi_{2} exploits sparsity. Both maps will have the form

Φ(x)=ϕ(Gxb¯)=(𝟏{gi,x>b})i=1n\Phi(x)=\phi(Gx-\bar{b})=\big{(}{\mathbf{1}}_{\{\langle g_{i},x\rangle>b\}}\big{)}_{i=1}^{n}

where GG is an n×nn\times n Gaussian random matrix with all i.i.d. N(0,1)N(0,1) coordinates, giN(0,In)g_{i}\sim N(0,I_{n}) are independent standard normal random vectors, b¯\bar{b} is the vector whose all coordinates equal some value b>0b>0.

If bb is large, Φ(x)\Phi(x) is a sparse random vector with i.i.d. Bernoulli coordinates. A key heuristic is that independent sparse random vectors are almost orthogonal. Indeed, if uu and uu^{\prime} are independent random vectors in n\mathbb{R}^{n} whose all coordinates are Bernoulli(p), then 𝔼u,u=np2\operatorname{\mathbb{E}}\langle u,u^{\prime}\rangle=np^{2} while 𝔼u2=𝔼u2=np\operatorname{\mathbb{E}}\|u\|_{2}=\operatorname{\mathbb{E}}\|u^{\prime}\|_{2}=np, so we should expect

u,uu2u2p,\frac{\langle u,u^{\prime}\rangle}{\|u\|_{2}\|u^{\prime}\|_{2}}\sim p,

making uu and uu^{\prime} almost orthogonal for small pp.

Unfortunately, the sparse random vectors Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) are not independent unless xx and xx^{\prime} are exactly orthogonal. Nevertheless, our heuristic that sparsity induces orthogonality still works in this setting. To see this, let us check that the correlation of the coefficients Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) is small even if xx and xx^{\prime} are far from being orthogonal. A standard asymptotic analysis of the tails of the normal distribution implies that

𝔼Φ(x)iΦ(x)i={g,x>b,g,x>b}2exp(b2δ2/8){g,x>b}\operatorname{\mathbb{E}}\Phi(x)_{i}\Phi(x^{\prime})_{i}=\mathbb{P}\left\{\langle g,x\rangle>b,\,\langle g,x^{\prime}\rangle>b\rule{0.0pt}{8.53581pt}\right\}\leq 2\exp(-b^{2}\delta^{2}/8)\;\mathbb{P}\left\{\langle g,x\rangle>b\rule{0.0pt}{8.53581pt}\right\} (2.1)

if xx and xx^{\prime} are unit and δ\delta-separated (Proposition 3.1), and

𝔼Φ(x)iΦ(x)i={g,x>b,g,x>b}2exp(2b2ε)({g,x>b})2\operatorname{\mathbb{E}}\Phi(x)_{i}\Phi(x^{\prime})_{i}=\mathbb{P}\left\{\langle g,x\rangle>b,\,\langle g,x^{\prime}\rangle>b\rule{0.0pt}{8.53581pt}\right\}\leq 2\exp(2b^{2}\varepsilon)\;\big{(}\mathbb{P}\left\{\langle g,x\rangle>b\rule{0.0pt}{8.53581pt}\right\}\big{)}^{2} (2.2)

if xx and xx^{\prime} are unit and ε\varepsilon-orthogonal (Proposition 3.2).

Now we choose bb so that the coordinates of Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) are sparse enough, i.e.

𝔼Φ(x)i={g,x>b}=1n=:p;\operatorname{\mathbb{E}}\Phi(x)_{i}=\mathbb{P}\left\{\langle g,x\rangle>b\rule{0.0pt}{8.53581pt}\right\}=\frac{1}{\sqrt{n}}=\mathrel{\mathop{\mathchar 58\relax}}p;

thus blognb\sim\sqrt{\log n}. Choose ε\varepsilon sufficiently small to make the factor exp(2b2ε)\exp(2b^{2}\varepsilon) in (2.2) nearly constant, i.e.

ε1b21logn.\varepsilon\sim\frac{1}{b^{2}}\sim\frac{1}{\log n}.

Finally, we choose the separation threshold δ\delta sufficiently large to make the factor 2exp(b2δ2/8)2\exp(-b^{2}\delta^{2}/8) in (2.1) less than ε\varepsilon, i.e.

δlog(1/ε)lognloglognlogn;\delta\sim\sqrt{\frac{\log(1/\varepsilon)}{\log n}}\sim\sqrt{\frac{\log\log n}{\log n}};

this explains the form of separation condition in Theorem 1.1.

With these choices, (2.1) gives

𝔼Φ(x)iΦ(x)iεp\operatorname{\mathbb{E}}\Phi(x)_{i}\Phi(x^{\prime})_{i}\leq\varepsilon p

confirming our claim that Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) tend to be ε\varepsilon-orthogonal provided that xx and xx^{\prime} are δ\delta-separated. Similarly, (2.2) gives

𝔼Φ(x)iΦ(x)ip2\operatorname{\mathbb{E}}\Phi(x)_{i}\Phi(x^{\prime})_{i}\lesssim p^{2}

confirming our claim that Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) tend to be (p=1/n)(p=1/\sqrt{n})-orthogonal provided that xx and xx^{\prime} are ε\varepsilon-orthogonal.

2.2. Perception

As we just saw, the enrichment process transforms our data points xkx_{k} into O(1/n)O(1/\sqrt{n})-orthogonal vectors vkv_{k}. Let us now find a perception map Ψ\Psi that can fit the labels to the data: Ψ(vk)=yk\Psi(v_{k})=y_{k}.

Consider the random vector

w:=i=1K±yiviw\mathrel{\mathop{\mathchar 58\relax}}=\sum_{i=1}^{K}\pm y_{i}v_{i}

where the signs are independently chosen with probability 1/21/2 each. Then, separating the kk-th term from the sum defining ww and assuming for simplicity that vkv_{k} are unit vectors, we get

w,vk=±yk+i:ik±yivi,vk=:yk+noise.\langle w,v_{k}\rangle=\pm y_{k}+\sum_{i\mathrel{\mathop{\mathchar 58\relax}}\,i\neq k}\pm y_{i}\langle v_{i},v_{k}\rangle=\mathrel{\mathop{\mathchar 58\relax}}y_{k}+\textrm{noise}.

Taking the expectation over independent signs, we see that

𝔼(noise)2=i:ikyi2vi,vk2\operatorname{\mathbb{E}}(\textrm{noise})^{2}=\sum_{i\mathrel{\mathop{\mathchar 58\relax}}\,i\neq k}y_{i}^{2}\langle v_{i},v_{k}\rangle^{2}

where yi2{0,1}y_{i}^{2}\in\{0,1\} and vi,vk2=O(1/n)\langle v_{i},v_{k}\rangle^{2}=O(1/n) by almost orthogonality. Hence

𝔼(noise)2K/n=o(1)\operatorname{\mathbb{E}}(\textrm{noise})^{2}\lesssim K/n=o(1)

if KnK\ll n. This yields w,vk=±yk+o(1)\langle w,v_{k}\rangle=\pm y_{k}+o(1), or

|w,vk|=yk+o(1).|\langle w,v_{k}\rangle|=y_{k}+o(1).

Since yk{0,1}y_{k}\in\{0,1\}, the “mirror perceptron”666The mirror perceptron requires not one but two neurons to implement, which is not a problem for us.

Ψ(v):=𝟏{w,v>1/2}+𝟏{w,v>1/2}\Psi(v)\mathrel{\mathop{\mathchar 58\relax}}={\mathbf{1}}_{\{\langle w,v\rangle>1/2\}}+{\mathbf{1}}_{\{-\langle w,v\rangle>1/2\}}

fits the data exactly: Ψ(vk)=yk\Psi(v_{k})=y_{k}.

2.3. Deeper networks

The same argument can be repeated for networks with variable sizes of layers, i.e. [n,m,d,1][n,m,d,1]. Interestingly, the enrichment works fine even if nmdn\ll m\ll d, making the lower-dimensional data almost orthogonal even in very high dimensions. This explains why (moderate) bottlenecks – narrow layers – do not restrict memory capacity.

The argument we outlined allows the network [n,m,d,1][n,m,d,1] to fit around dd data points, which is not very surprising, since we expect the memory capacity be proportional to the number of connections and not the number of nodes. However, the power of enrichment allows us to boost the capacity using the standard method of batch learning (or distributed learning).

Let us show, for example, how the network [n,m,d,r,1][n,m,d,r,1] with three hidden layers can fit KdrK\sim dr data points (xk,yk)(x_{k},y_{k}). Partition the data into rr batches each having O(d)O(d) data points. Use our previous result to train each of the rr neurons in the fourth layer on its own batch of O(d)O(d) points, while zeroing out all labels outside that batch. Then simply sum up the results. (The details are found in Theorem 7.1.)

This result can be extended to deeper networks using stacking, or unrolling a shallow architecture into a deep architecture, thereby trading width for depth. Figure 2 gives an illustration of stacking, and Theorem 7.2 provides the details. A similar stacking construction was employed in [4].

The success of stacking indicates that depth has no benefit formemorization purposes: a shallow architecture [n,m,d,r,1][n,m,d,r,1] can memorize roughly as much data as any deep architecture with the same number of connections. It should be noted, however, that training algorithms commonly used by practitioners, i.e. variants of stochastic gradient descent, do not seem to lead to anything similar to stacking; this leaves the question of benefit of depth open.

2.4. Neural networks as preconditioners

As we explained in Section 2.1, the first two hidden layers of the network act as preconditioners: they transform the input vectors xix_{i} into vectors viv_{i} that are almost orthogonal. Almost orthogonality facilitates memorization process in the deeper layers, as we saw in Section 2.2.

The idea to keep the data well-conditioned as it passes through the network is not new. The learning rate of the stochastic gradient descent (which we are not dealing with here) is related to how well conditioned is the so-called gradient Gram matrix HH. In the simplest scenario where the activation is ReLU and the network has one hidden layer of infinite size, HH is a K×KK\times K matrix with entries

Hij=𝔼xi,xj𝟏{g,xi>0,g,xj>0},wheregN(0,In1).H_{ij}=\operatorname{\mathbb{E}}\langle x_{i},x_{j}\rangle{\mathbf{1}}_{\{\langle g,x_{i}\rangle>0,\,\langle g,x_{j}\rangle>0\}},\quad\text{where}\quad g\sim N(0,I_{n_{1}}).

Much effort was made recently to prove that HH is well-conditioned, i.e. its smallest singular value of HH is bounded away from zero, since this can be used to establishes a good convergence rate for the stochastic gradient descent, see the papers cited in Section 1.3 and especially [1, 10, 9, panigrahi2019effect]. However, existing results that prove that HH is well conditioned only hold for very overparametrized networks, requiring at least n14n_{1}^{4} nodes in the hidden layer [panigrahi2019effect]. This is a much stronger overparametrization requirement than in our Theorem 1.1.

On the other hand, as opposed to many of the results quoted above, Theorem 1.1 does not shed any light on the behavior of stochastic gradient descent, the most popular method for training deep networks. Instead of training the weights, we explicitly compute them from the data. This allows us to avoid dealing with the gradient Gram matrix: our enrichment method provides an explicit way to make the data well conditioned. This is achieved by setting the biases high enough to enforce sparsity. It would be interesting to see if similar preconditioning guarantees can be achieved with small (or even zero) biases and thus without exploiting sparsity.

A different form of enrichment was developed recently in the paper [4] which showed that a neural network can compute a lot of different Boolean functions. Toward this goal, an enrichment map was implemented in the first hidden layer. The objective of this map is to transform the input set (the Boolean cube {0,1}n\{0,1\}^{n}) into a set S{0,1}mS\subset\{0,1\}^{m} on which there are lots of different threshold functions – so that the next layers can automatically compute lots of different Boolean functions. While the general goal of this enrichment map in [4] is the same as in the present paper – achieve a more robust data representation that is passed to deeper layers – the constructions of these two enrichment maps are quite different.

2.5. Further observations and questions

As we saw in the previous section, we utilized the first two hidden layers of the network to preprocess, or enrich, the data vectors xkx_{k}. This made us skip the sizes of the first two layers when we counted the number of connections WW. If these vectors are already nice, no enrichment may be necessary, and we have a higher memory capacity.

Suppose, for example, that the data vectors xkx_{k} in Theorem 1.1 are O(1/n)O(1/\sqrt{n_{\infty}})-orthogonal. Then, since no enrichment is needed in this case, the conclusion of the theorem holds with

W=n1n2++nL1nL,W=n_{1}n_{2}+\cdots+n_{L-1}n_{L},

which is the sum of all connections in the network.

If, on the other hand, the data vectors xkx_{k} in Theorem 1.1 are only O(1/logn)O(1/\sqrt{\log n_{\infty}})-orthogonal, just the second enrichment is needed, and so the conclusion of the theorem holds with

W=n2n3++nL1nL,W=n_{2}n_{3}+\cdots+n_{L-1}n_{L},

which is the sum of all connections between the non-input layers.

This make us wonder: can enrichment be always achieved in one step instead of two? Can one find a pseudolinear map Φ:nn\Phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{n} that transforms a given set of δ\delta-separated vectors xkx_{k} (say, for δ=0.01\delta=0.01) into a set of O(1/n)O(1/\sqrt{n})-orthogonal vectors? If this is possible, we would not need to exclude the second layer from the parameter count, and Theorem 1.1 would hold for W:=n2n3++nL1nLW\mathrel{\mathop{\mathchar 58\relax}}=n_{2}n_{3}+\cdots+n_{L-1}n_{L}.

A related question for further study is to find an optimal separation threshold δ\delta in the assumption xixj2δ\|x_{i}-x_{j}\|_{2}\geq\delta in Theorem 1.1, and to remove the assumption that xkx_{k} be unit vectors. Both the logarithmic separation level of δ\delta and the normalization requirement could be artifacts of the enrichment scheme we used.

There are several ways Theorem 1.1 could be extended. It should not be too difficult, for example, to allow the output layer have more than one node; such multi-output networks are used in classification problems with multiple classes.

Finally, it should be possible to extend the analysis for completely general activation functions. Threshold activations we treated are conceptually the hardest case, since they act as extreme quantizers that restrict the flow of information through the network in the most dramatic way.

2.6. The rest of the paper

In Section 3 we prove bounds (2.1) and (2.2) which control 𝔼Φ(x)iΦ(x)i\operatorname{\mathbb{E}}\Phi(x)_{i}\Phi(x^{\prime})_{i}, the correlation of the coefficients of the coordinates of Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}). This immediately controls the expected inner product 𝔼Φ(x),Φ(x)=i𝔼Φ(x)iΦ(x)i\operatorname{\mathbb{E}}\langle\Phi(x),\Phi(x^{\prime})\rangle=\sum_{i}\operatorname{\mathbb{E}}\Phi(x)_{i}\Phi(x^{\prime})_{i}. In Section 4 we develop a deviation inequality to make sure that the inner product Φ(x),Φ(x)\langle\Phi(x),\Phi(x^{\prime})\rangle is close to its expectation with high probability. In Section 5, we take a union bound over all data points xkx_{k} and thus control all inner products Φ(xi),Φ(xj)\langle\Phi(x_{i}),\Phi(x_{j})\rangle simultaneously. This demonstrates how enrichment maps Φ\Phi make the data almost orthogonal – the property we outlined in Section 2.1. In Section 6, we construct a random perception map Ψ\Psi as we outlined in Section 2.2. We combine enrichment and perception in Section 7 as we outlined in Section 2.3. We first prove a version of our main result for networks with three hidden layers (Theorem 7.1); then we stack shallow networks into an arbitrarily deep architecture proving a full version of our main result in Theorem 7.2.

In the rest of the paper, positive absolute constants will be denoted C,c,C1,c1C,c,C_{1},c_{1}, etc. The notation f(x)g(x)f(x)\lesssim g(x) means that f(x)Cg(x)f(x)\leq Cg(x) for some absolute constant CC and for all values of parameter xx. Similarly, f(x)g(x)f(x)\asymp g(x) means that cg(x)f(x)Cg(x)cg(x)\leq f(x)\leq Cg(x) where cc is another positive absolute constant.

We call a map EE almost pseudolinear if E(x)=λΦ(x)E(x)=\lambda\Phi(x) for some nonnegative constant λ\lambda and some pseudolinear map Φ\Phi. For the ReLU nonlinearity, almost pseudolinear maps are automatically pseudolinear, but for the threshold nonlinearity this is not necessarily the case.

3. Correlation decay

Let gN(0,In)g\sim N(0,I_{n}) and consider the random process

Zx:=ϕ(g,xb)Z_{x}\mathrel{\mathop{\mathchar 58\relax}}=\phi(\langle g,x\rangle-b) (3.1)

which is indexed by points xx on the unit Euclidean sphere in n\mathbb{R}^{n}. Here ϕ\phi can be either the threshold of ReLU nonlinearity as in (1.1), and bb\in\mathbb{R} is a fixed value. Due to rotation invariance, the correlation of ZxZ_{x} and ZxZ_{x^{\prime}} only depends on the distance between xx and xx^{\prime}. Although it seems to be difficult to compute this dependence exactly, we will demonstrate that the correlation of ZxZ_{x} and ZxZ_{x^{\prime}} decays rapidly in bb. We will prove this in two extreme regimes – where xx and xx^{\prime} are just a little separated, and where xx and xx^{\prime} are almost orthogonal.

3.1. Correlation for separated vectors

Cauchy-Schwarz inequality gives a trivial bound

𝔼ZxZx𝔼Zx2\operatorname{\mathbb{E}}Z_{x}Z_{x^{\prime}}\leq\operatorname{\mathbb{E}}Z_{x}^{2}

with equality when x=xx=x^{\prime}. Our first result shows that if the vectors xx and xx^{\prime} are δ\delta-separated, this bound can be dramatically improved, and we have

𝔼ZxZx2exp(b2δ2/8)𝔼Zx2.\operatorname{\mathbb{E}}Z_{x}Z_{x^{\prime}}\leq 2\exp(-b^{2}\delta^{2}/8)\,\operatorname{\mathbb{E}}Z_{x}^{2}.
Proposition 3.1 (Correlation for separated vectors).

Consider a pair of unit vectors x,xnx,x^{\prime}\in\mathbb{R}^{n}, and let bb\in\mathbb{R} be a number that is larger than a certain absolute constant. Then

𝔼ϕ(g,xb)ϕ(g,xb)2exp(b2xx228)𝔼ϕ(γb)2\operatorname{\mathbb{E}}\phi\big{(}\langle g,x\rangle-b\big{)}\,\phi\big{(}\langle g,x^{\prime}\rangle-b\big{)}\leq 2\exp\bigg{(}-\frac{b^{2}\mathinner{\lVert x-x^{\prime}\rVert}_{2}^{2}}{8}\bigg{)}\,\operatorname{\mathbb{E}}\phi(\gamma-b)^{2}

where gN(0,In)g\sim N(0,I_{n}) and γN(0,1)\gamma\sim N(0,1).

Proof.

Step 1. Orthogonal decomposition. Consider the vectors

u:=x+x2,v:=xx2.u\mathrel{\mathop{\mathchar 58\relax}}=\frac{x+x^{\prime}}{2},\quad v\mathrel{\mathop{\mathchar 58\relax}}=\frac{x-x^{\prime}}{2}.

Then uu and vv are orthogonal and x=u+vx=u+v, x=uvx^{\prime}=u-v. We claim that

ϕ(z,xb)ϕ(z,xb)(ϕ(z,ub))2for any zn.\phi\big{(}\langle z,x\rangle-b\big{)}\,\phi\big{(}\langle z,x^{\prime}\rangle-b\big{)}\leq\big{(}\phi\big{(}\langle z,u\rangle-b\big{)}\big{)}^{2}\quad\text{for any }z\in\mathbb{R}^{n}. (3.2)

To check this claim, note that if both z,x\langle z,x\rangle and z,x\langle z,x^{\prime}\rangle are greater than bb so is z,u\langle z,u\rangle. Expressing this implication as

𝟏z,x>b 1z,x>b𝟏z,u>b,{\mathbf{1}}_{\langle z,x\rangle>b}\,{\mathbf{1}}_{\langle z,x^{\prime}\rangle>b}\leq{\mathbf{1}}_{\langle z,u\rangle>b}, (3.3)

we conclude that (3.2) holds for the threshold nonlinearity ϕ(t)=𝟏{t>0}\phi(t)={\mathbf{1}}_{\{t>0\}}.

To prove (3.2) for the ReLU nonlinearity, note that

(z,xb)(z,xb)=(z,u+vb)(z,uvb)=(z,ub)2z,v2(z,ub)2.\big{(}\langle z,x\rangle-b\big{)}\,\big{(}\langle z,x^{\prime}\rangle-b\big{)}=\big{(}\langle z,u+v\rangle-b\big{)}\,\big{(}\langle z,u-v\rangle-b\big{)}=\big{(}\langle z,u\rangle-b\big{)}^{2}-\langle z,v\rangle^{2}\leq\big{(}\langle z,u\rangle-b\big{)}^{2}.

Combine this bound with (3.3) to get

(z,xb)(z,xb) 1z,x>b 1z,x>b(z,ub)2 1z,u>b.\big{(}\langle z,x\rangle-b\big{)}\,\big{(}\langle z,x^{\prime}\rangle-b\big{)}\,{\mathbf{1}}_{\langle z,x\rangle>b}\,{\mathbf{1}}_{\langle z,x^{\prime}\rangle>b}\leq\big{(}\langle z,u\rangle-b\big{)}^{2}\,{\mathbf{1}}_{\langle z,u\rangle>b}.

This yields (3.2) for the ReLU nonlinearity ϕ(t)=t 1{t>0}\phi(t)=t\,{\mathbf{1}}_{\{t>0\}}.

Step 2. Taking expectation. Substitute z=gN(0,In)z=g\sim N(0,I_{n}) into the bound (3.2) and take expectation on both sides. We get

𝔼ϕ(g,xb)ϕ(g,xb)𝔼ϕ(g,ub)2.\operatorname{\mathbb{E}}\phi\big{(}\langle g,x\rangle-b\big{)}\,\phi\big{(}\langle g,x^{\prime}\rangle-b\big{)}\leq\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}^{2}. (3.4)

Denote

δ:=v2=xx22.\delta\mathrel{\mathop{\mathchar 58\relax}}=\|v\|_{2}=\frac{\|x-x^{\prime}\|_{2}}{2}.

Since x=u+vx=u+v is a unit vector and u,vu,v are orthogonal, we have 1=u22+v221=\|u\|_{2}^{2}+\|v\|_{2}^{2} and thus u2=1δ2\|u\|_{2}=\sqrt{1-\delta^{2}}. Therefore, the random variable g,u\langle g,u\rangle in the right side of (3.4) is distributed identically with γ1δ2\gamma\sqrt{1-\delta^{2}} where γN(0,1)\gamma\sim N(0,1), and we obtain

𝔼ϕ(g,xb)ϕ(g,xb)𝔼ϕ(γ1δ2b)2.\operatorname{\mathbb{E}}\phi\big{(}\langle g,x\rangle-b\big{)}\,\phi\big{(}\langle g,x^{\prime}\rangle-b\big{)}\leq\operatorname{\mathbb{E}}\phi\big{(}\gamma\sqrt{1-\delta^{2}}-b\big{)}^{2}. (3.5)

Step 3. Stability. Now use the stability property of the normal distribution, which we state in Lemma A.2. For a=ba=b larger than a suitable absolute constant, z=δ2z=-\delta^{2}, and for either the threshold or ReLU nonlinearity ϕ\phi, we see that

𝔼ϕ(γ1δ2b)2𝔼ϕ(γb)22exp(b2δ22(1δ2))(1δ2)3/22exp(b2δ22).\frac{\operatorname{\mathbb{E}}\phi(\gamma\sqrt{1-\delta^{2}}-b)^{2}}{\operatorname{\mathbb{E}}\phi(\gamma-b)^{2}}\leq 2\exp\bigg{(}-\frac{b^{2}\delta^{2}}{2(1-\delta^{2})}\bigg{)}(1-\delta^{2})^{3/2}\leq 2\exp\bigg{(}-\frac{b^{2}\delta^{2}}{2}\bigg{)}.

Combine this with (3.5) to complete the proof. ∎

3.2. Correlation for almost orthogonal vectors

We continue to study the covariance structure of the random process ZxZ_{x}. If xx and xx^{\prime} are orthogonal, ZxZ_{x} and ZxZ_{x^{\prime}} are independent and we have

𝔼ZxZx=[𝔼Zx]2.\operatorname{\mathbb{E}}Z_{x}Z_{x^{\prime}}=\big{[}\operatorname{\mathbb{E}}Z_{x}\big{]}^{2}.

In this subsection, we show the stability of this equality. The result below implies that if xx and xx^{\prime} are almost orthogonal, namely |x,x|b2\mathinner{\!\left\lvert\langle x,x^{\prime}\rangle\right\rvert}\ll b^{-2}, then

𝔼ZxZx[𝔼Zx]2.\operatorname{\mathbb{E}}Z_{x}Z_{x^{\prime}}\lesssim\big{[}\operatorname{\mathbb{E}}Z_{x}\big{]}^{2}.
Proposition 3.2 (Correlation for ε\varepsilon-orthogonal vectors).

Consider a pair of vectors u,umu,u^{\prime}\in\mathbb{R}^{m} satisfying

|u221|ε,|u221|ε,|u,u|ε\mathinner{\!\bigl{\lvert}\|u\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\|u^{\prime}\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\langle u,u^{\prime}\rangle\bigr{\rvert}}\leq\varepsilon

for some ε(0,1/8)\varepsilon\in(0,1/8). Let bb\in\mathbb{R} be a number that is larger than a certain absolute constant. Then

𝔼ϕ(g,ub)212exp(b2ε)𝔼ϕ(γb)2;\displaystyle\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}^{2}\geq\frac{1}{2}\exp(-b^{2}\varepsilon)\,\operatorname{\mathbb{E}}\phi(\gamma-b)^{2};
𝔼ϕ(g,ub)ϕ(g,ub)2exp(2b2ε)[𝔼ϕ(γb)]2\displaystyle\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}\,\phi\big{(}\langle g,u^{\prime}\rangle-b\big{)}\leq 2\exp(2b^{2}\varepsilon)\,\big{[}\operatorname{\mathbb{E}}\phi(\gamma-b)\big{]}^{2}

where gN(0,Im)g\sim N(0,I_{m}) and γN(0,1)\gamma\sim N(0,1).

In order to prove this proposition, we first establish a more general stability property:

Lemma 3.3.

Let ε(0,1/2)\varepsilon\in(0,1/2) and let uu, uu^{\prime}, gg, and γ\gamma be as in Proposition 3.2. Then, for any measurable function ψ:[0,)\psi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\to[0,\infty) we have

𝔼ψ(g,u)ψ(g,u)1+2ε12ε[𝔼ψ(γ1+2ε)]2.\operatorname{\mathbb{E}}\psi(\langle g,u\rangle)\,\psi(\langle g,u^{\prime}\rangle)\leq\sqrt{\frac{1+2\varepsilon}{1-2\varepsilon}}\;\Big{[}\operatorname{\mathbb{E}}\psi\big{(}\gamma\sqrt{1+2\varepsilon}\big{)}\Big{]}^{2}.
Proof.

Consider the 2×m2\times m matrix AA whose rows are uu and uu^{\prime}, and define the function

Ψ:2,Ψ(x):=ψ(x1)ψ(x2).\Psi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{2}\to\mathbb{R},\quad\Psi(x)\mathrel{\mathop{\mathchar 58\relax}}=\psi(x_{1})\,\psi(x_{2}).

Since the vector AgAg has coordinates g,u\langle g,u\rangle and g,u\langle g,u^{\prime}\rangle, we have Ψ(Ag)=ψ(g,u)ψ(g,u)\Psi(Ag)=\psi(\langle g,u\rangle)\,\psi(\langle g,u^{\prime}\rangle). Thus

𝔼ψ(g,u)ψ(g,u)=𝔼Ψ(Ag)=12πdet(Σ)2Ψ(x)exp(x𝖳Σ1x2)𝑑x\operatorname{\mathbb{E}}\psi(\langle g,u\rangle)\,\psi(\langle g,u^{\prime}\rangle)=\operatorname{\mathbb{E}}\Psi(Ag)=\frac{1}{2\pi\sqrt{\det(\Sigma)}}\int_{\mathbb{R}^{2}}\Psi(x)\exp\bigg{(}-\frac{x^{\mathsf{T}}\Sigma^{-1}x}{2}\bigg{)}\;dx (3.6)

where

Σ=Cov(Ag)=AA𝖳=[u22u,uu,uu22].\Sigma=\operatorname{Cov}(Ag)=AA^{\mathsf{T}}=\begin{bmatrix}\|u\|_{2}^{2}&\langle u,u^{\prime}\rangle\\ \langle u,u^{\prime}\rangle&\|u^{\prime}\|_{2}^{2}\end{bmatrix}.

The assumptions on z,zz,z^{\prime} then give

det(Σ)12εandx𝖳Σ1xx221+2εfor all x2.\det(\Sigma)\geq 1-2\varepsilon\quad\text{and}\quad x^{\mathsf{T}}\Sigma^{-1}x\geq\frac{\|x\|_{2}^{2}}{1+2\varepsilon}\quad\text{for all }x\in\mathbb{R}^{2}. (3.7)

Indeed, the first bound is straightforward. To verify the second bound, note that each entry of the matrix ΣI2\Sigma-I_{2} is bounded in absolute value by ε\varepsilon. Thus, the operator norm of ΣI2\Sigma-I_{2} is bounded by 2ε2\varepsilon, which we can write as ΣI22εI2\Sigma-I_{2}\preceq 2\varepsilon I_{2} in the positive-semidefinite order. This implies that Σ1(1+2ε)1I2\Sigma^{-1}\succeq(1+2\varepsilon)^{-1}I_{2}, and multiplying both sides of this relation by x𝖳x^{\mathsf{T}} and xx, we get the second bound in (3.7).

Substitute (3.7) into (3.6) to obtain

𝔼ψ(g,x)ψ(g,x)\displaystyle\operatorname{\mathbb{E}}\psi(\langle g,x\rangle)\,\psi(\langle g,x^{\prime}\rangle) 12π12ε2Ψ(x)exp(x222(1+2ε))𝑑x=1+2ε12ε𝔼Ψ(h1+2ε)\displaystyle\geq\frac{1}{2\pi\sqrt{1-2\varepsilon}}\int_{\mathbb{R}^{2}}\Psi(x)\exp\bigg{(}-\frac{\|x\|_{2}^{2}}{2(1+2\varepsilon)}\bigg{)}\,dx=\sqrt{\frac{1+2\varepsilon}{1-2\varepsilon}}\operatorname{\mathbb{E}}\Psi\big{(}h\sqrt{1+2\varepsilon}\big{)}

where h=(h1,h2)N(0,I2)h=(h_{1},h_{2})\sim N(0,I_{2}).

It remains to recall that Ψ(x)=ψ(x1)ψ(x2)\Psi(x)=\psi(x_{1})\,\psi(x_{2}), so

𝔼Ψ(h1+2ε)=𝔼ψ(h11+2ε)ψ(h21+2ε)=[𝔼ψ(γ1+2ε)]2\operatorname{\mathbb{E}}\Psi\big{(}h\sqrt{1+2\varepsilon}\big{)}=\operatorname{\mathbb{E}}\psi\big{(}h_{1}\sqrt{1+2\varepsilon}\big{)}\,\psi\big{(}h_{2}\sqrt{1+2\varepsilon}\big{)}=\Big{[}\operatorname{\mathbb{E}}\psi\big{(}\gamma\sqrt{1+2\varepsilon}\big{)}\Big{]}^{2}

by independence. Lemma 3.3 is proved. ∎

Proof of Proposition 3.2.

By assumption, u21ε\|u\|_{2}\geq\sqrt{1-\varepsilon}, so

𝔼ϕ(g,ub)2=𝔼ϕ(γu2b)2𝔼ϕ(γ1εb)2,\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}^{2}=\operatorname{\mathbb{E}}\phi\big{(}\gamma\mathinner{\!\left\lVert u\right\rVert}_{2}-b\big{)}^{2}\geq\operatorname{\mathbb{E}}\phi\big{(}\gamma\sqrt{1-\varepsilon}-b\big{)}^{2}, (3.8)

where the last inequality follows by monotonicity; see Lemma A.3 for justification. Now we use the stability property of the normal distribution that we state in Lemma A.2. For a=ba=b larger than a suitable absolute constant and z=εz=-\varepsilon, it gives for both threshold of ReLU nonlinearities the following:

𝔼ϕ(γ1εb)2𝔼ϕ(γb)20.9exp(b2ε2(1ε))(1ε)3/212exp(b2ε),\frac{\operatorname{\mathbb{E}}\phi(\gamma\sqrt{1-\varepsilon}-b)^{2}}{\operatorname{\mathbb{E}}\phi(\gamma-b)^{2}}\geq 0.9\exp\bigg{(}-\frac{b^{2}\varepsilon}{2(1-\varepsilon)}\bigg{)}(1-\varepsilon)^{3/2}\geq\frac{1}{2}\exp(-b^{2}\varepsilon),

where the last bound follows since ε1/8\varepsilon\leq 1/8. Combining this with (3.8), we obtain the first conclusion of the lemma.

Next, Lemma 3.3 gives

𝔼ϕ(g,ub)(g,ub)1+2ε12ε[𝔼ϕ(γ1+2εb)]2.\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}\,\big{(}\langle g,u^{\prime}\rangle-b\big{)}\leq\sqrt{\frac{1+2\varepsilon}{1-2\varepsilon}}\;\Big{[}\operatorname{\mathbb{E}}\phi\big{(}\gamma\sqrt{1+2\varepsilon}-b\big{)}\Big{]}^{2}. (3.9)

Now we again use the stability property of the normal distribution, Lemma A.2, this time for z=2εz=2\varepsilon. It gives for both threshold of ReLU nonlinearities the following:

𝔼ϕ(γ1+2εb)𝔼ϕ(γb)1.01exp(2b2ε2(1+2ε))(1+2ε)3/21.01(1+2ε)3/2exp(b2ε).\frac{\operatorname{\mathbb{E}}\phi(\gamma\sqrt{1+2\varepsilon}-b)}{\operatorname{\mathbb{E}}\phi(\gamma-b)}\leq 1.01\exp\bigg{(}\frac{2b^{2}\varepsilon}{2(1+2\varepsilon)}\bigg{)}(1+2\varepsilon)^{3/2}\leq 1.01(1+2\varepsilon)^{3/2}\exp(b^{2}\varepsilon).

Combining this with (3.9) gives

𝔼ϕ(g,ub)ϕ(g,ub)[𝔼ϕ(γb)]21+2ε12ε(1.01(1+2ε)3/2exp(b2ε))22exp(2b2ε),\frac{\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}\,\phi\big{(}\langle g,u^{\prime}\rangle-b\big{)}}{\big{[}\operatorname{\mathbb{E}}\phi(\gamma-b)\big{]}^{2}}\leq\sqrt{\frac{1+2\varepsilon}{1-2\varepsilon}}\Big{(}1.01(1+2\varepsilon)^{3/2}\exp(b^{2}\varepsilon)\Big{)}^{2}\leq 2\exp(2b^{2}\varepsilon),

where the last step follows since ε1/8\varepsilon\leq 1/8. This completes the proof of Proposition 3.2. ∎

4. Deviation

In the previous section, we studied the covariance of the random process

Zx:=ϕ(g,xb),xn,Z_{x}\mathrel{\mathop{\mathchar 58\relax}}=\phi(\langle g,x\rangle-b),\quad x\in\mathbb{R}^{n},

where ϕ\phi is either the threshold of ReLU nonlinearity as in (1.1), gN(0,In)g\sim N(0,I_{n}) is a standard normal random variable, and bb\in\mathbb{R} is a fixed value. Consider a multivariate version of this process, a random pseudolinear map Φ:nm\Phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{m} whose mm components are independent copies of ZxZ_{x}. In other words, define

Φ(x):=(ϕ(gi,xb))i=1mfor xn,\Phi(x)\mathrel{\mathop{\mathchar 58\relax}}=\Big{(}\phi\big{(}\langle g_{i},x\rangle-b\big{)}\Big{)}_{i=1}^{m}\quad\text{for }x\in\mathbb{R}^{n},

where giN(0,In)g_{i}\sim N(0,I_{n}) are independent standard normal random vectors.

We are interested in how the map Φ\Phi transforms the distances between different points. Since

𝔼Φ(x),Φ(x)=m𝔼ZxZx,\operatorname{\mathbb{E}}\langle\Phi(x),\Phi(x^{\prime})\rangle=m\operatorname{\mathbb{E}}Z_{x}Z_{x^{\prime}},

the bounds on 𝔼ZxZx\operatorname{\mathbb{E}}Z_{x}Z_{x^{\prime}} we proved in the previous section describe the behavior of Φ\Phi in expectation. In this section, we use standard concentration inequalities to ensure a similar behavior with high probability.

Lemma 4.1 (Deviation).

Consider a pair of vectors x,xnx,x^{\prime}\in\mathbb{R}^{n} such that x22\|x\|_{2}\leq 2, x22\|x^{\prime}\|_{2}\leq 2, and let bb\in\mathbb{R} be a number that is larger than a certain absolute constant. Define

p:=𝔼ϕ(g,xb)ϕ(g,xb),where gN(0,In).p\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}\phi\big{(}\langle g,x\rangle-b\big{)}\,\phi\big{(}\langle g,x^{\prime}\rangle-b\big{)},\quad\text{where }g\sim N(0,I_{n}). (4.1)

Then for every N2N\geq 2, with probability at least 12mN51-2mN^{-5} we have

|Φ(x),Φ(x)mp|C1(mplogN+log2N).\big{|}\langle\Phi(x),\Phi(x^{\prime})\rangle-mp\big{|}\leq C_{1}\big{(}\sqrt{mp}\,\log N+\log^{2}N\big{)}.
Proof.

Step 1. Decomposition and truncation. By construction, 𝔼Φ(x),Φ(x)=mp\operatorname{\mathbb{E}}\langle\Phi(x),\Phi(x^{\prime})\rangle=mp. The deviation from the mean is

Φ(x),Φ(x)mp=i=1mϕ(γib)ϕ(γib)mp\langle\Phi(x),\Phi(x^{\prime})\rangle-mp=\sum_{i=1}^{m}\phi\big{(}\gamma_{i}-b\big{)}\,\phi\big{(}\gamma^{\prime}_{i}-b\big{)}-mp (4.2)

where γi:=gi,x\gamma_{i}\mathrel{\mathop{\mathchar 58\relax}}=\langle g_{i},x\rangle and γi:=gi,x\gamma^{\prime}_{i}\mathrel{\mathop{\mathchar 58\relax}}=\langle g_{i},x^{\prime}\rangle. These two normal random variables are possibly correlated, and each has zero mean and variance bounded by 44.

We will control the sum of i.i.d. random variables in (4.2) using Bernstein’s concentration inequality. In order to apply it, we first perform a standard truncation of the terms of the sum. The level of truncation will be

M:=C2logNM\mathrel{\mathop{\mathchar 58\relax}}=C_{2}\sqrt{\log N} (4.3)

where C2C_{2} is a sufficiently large absolute constant. Consider the random variables

Zi\displaystyle Z_{i} :=ϕ(γib)ϕ(γib) 1{γiM and γiM},\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\phi(\gamma_{i}-b)\,\phi(\gamma^{\prime}_{i}-b)\,{\mathbf{1}}_{\{\gamma_{i}\leq M\text{ and }\gamma^{\prime}_{i}\leq M\}},
Ri\displaystyle R_{i} :=ϕ(γib)ϕ(γib) 1{γi>M or γi>M}.\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\phi(\gamma_{i}-b)\,\phi(\gamma^{\prime}_{i}-b)\,{\mathbf{1}}_{\{\gamma_{i}>M\text{ or }\gamma^{\prime}_{i}>M\}}.

Then we can decompose the sum in (4.2) as follows:

Φ(x),Φ(x)mp=i=1m(Zi𝔼Zi)+i=1m(Ri𝔼Ri).\langle\Phi(x),\Phi(x^{\prime})\rangle-mp=\sum_{i=1}^{m}(Z_{i}-\operatorname{\mathbb{E}}Z_{i})+\sum_{i=1}^{m}(R_{i}-\operatorname{\mathbb{E}}R_{i}). (4.4)

Step 2. The residual is small. Let us first control the residual, i.e. the second sum on the right side of (4.4). For a fixed ii, the probability that RiR_{i} is nonzero can be bounded by

{γi>M or γi>M}{γi>M}+{γi>M}2{γ>M/2}N10\mathbb{P}\left\{\gamma_{i}>M\text{ or }\gamma^{\prime}_{i}>M\rule{0.0pt}{8.53581pt}\right\}\leq\mathbb{P}\left\{\gamma_{i}>M\rule{0.0pt}{8.53581pt}\right\}+\mathbb{P}\left\{\gamma^{\prime}_{i}>M\rule{0.0pt}{8.53581pt}\right\}\leq 2\mathbb{P}\left\{\gamma>M/2\rule{0.0pt}{8.53581pt}\right\}\leq N^{-10} (4.5)

where γN(0,1)\gamma\sim N(0,1). In the second inequality, we used that γi\gamma_{i} and γi\gamma^{\prime}_{i} are normal with mean zero and variance at most 44. The third inequality follows from the asymptotic (A.2) on the tail of the normal distribution and our choice (4.3) of the truncation level MM with sufficiently large C0C_{0}.

Taking the union bound we see that all RiR_{i} vanish simultaneously with probability at least 1mN101-mN^{-10}. Furthermore, by monotonicity,

𝔼Ri\displaystyle\operatorname{\mathbb{E}}R_{i} 𝔼ϕ(γi)ϕ(γi) 1{γi>M or γi>M}\displaystyle\leq\operatorname{\mathbb{E}}\phi(\gamma_{i})\,\phi(\gamma^{\prime}_{i})\,{\mathbf{1}}_{\{\gamma_{i}>M\text{ or }\gamma^{\prime}_{i}>M\}}
(𝔼ϕ(γi)4)1/4(𝔼ϕ(γi)4)1/4({γi>M or γi>M})1/2\displaystyle\leq\big{(}\operatorname{\mathbb{E}}\phi(\gamma_{i})^{4}\big{)}^{1/4}\,\big{(}\operatorname{\mathbb{E}}\phi(\gamma^{\prime}_{i})^{4}\big{)}^{1/4}\,\big{(}\mathbb{P}\left\{\gamma_{i}>M\text{ or }\gamma^{\prime}_{i}>M\rule{0.0pt}{8.53581pt}\right\}\big{)}^{1/2} (4.6)

where in the last step we used generalized Hölder’s inequality. Now, for the threshold nonlinearity ϕ(t)=𝟏{t>0}\phi(t)={\mathbf{1}}_{\{t>0\}}, the terms 𝔼ϕ(γi)4\operatorname{\mathbb{E}}\phi(\gamma_{i})^{4} and 𝔼ϕ(γi)4\operatorname{\mathbb{E}}\phi(\gamma^{\prime}_{i})^{4} obviously equal 1/21/2, and for the ReLU nonlinearity ϕ(t)=t+\phi(t)=t_{+} these terms are bounded by the fourth moment of the standard normal distribution, which equals 33. Combining this with (4.5) gives

0𝔼Ri2N5.0\leq\operatorname{\mathbb{E}}R_{i}\leq 2N^{-5}.

Summarizing, with probability at least 1mN101-mN^{-10}, we have

|i=1m(Ri𝔼Ri)|=|i=1m𝔼Ri|2mN51.\bigg{|}\sum_{i=1}^{m}(R_{i}-\operatorname{\mathbb{E}}R_{i})\bigg{|}=\bigg{|}\sum_{i=1}^{m}\operatorname{\mathbb{E}}R_{i}\bigg{|}\leq 2mN^{-5}\leq 1. (4.7)

The last bound holds because otherwise we have 12mN5<01-2mN^{-5}<0 and the statement of the proposition holds trivially.

Step 3. The main sum is concentrated. To bound the first sum in (4.4), we can use Bernstein’s inequality [22], which we can state as follows. If Z1,,ZmZ_{1},\ldots,Z_{m} are independent random variables and s0s\geq 0, then with probability at least 12es1-2e^{-s} we have

|i=1m(Zi𝔼Zi)|σs+Ks,\bigg{|}\sum_{i=1}^{m}(Z_{i}-\operatorname{\mathbb{E}}Z_{i})\bigg{|}\lesssim\sigma\sqrt{s}+Ks, (4.8)

where σ2=i=1mVar(Zi)\sigma^{2}=\sum_{i=1}^{m}\operatorname{Var}(Z_{i}) and K=maxiZiK=\max_{i}\|Z_{i}\|_{\infty}. In our case, it is easy to check that for both threshold and ReLU nonlinearities ϕ\phi, we have

K=Z1ϕ(Mb)2M2,K=\|Z_{1}\|_{\infty}\leq\phi(M-b)^{2}\leq M^{2},

and

σ2=mVar(Z1)m𝔼Z12M2m𝔼Z1M2m𝔼ϕ(γ1b)ϕ(γ1b)=M2mp\sigma^{2}=m\operatorname{Var}(Z_{1})\leq m\operatorname{\mathbb{E}}Z_{1}^{2}\leq M^{2}m\operatorname{\mathbb{E}}Z_{1}\leq M^{2}m\operatorname{\mathbb{E}}\phi(\gamma_{1}-b)\,\phi(\gamma^{\prime}_{1}-b)=M^{2}mp

by definition of pp in (4.1). Apply Bernstein’s inequality (4.8) for

s=C3logNs=C_{3}\log N (4.9)

where C3C_{3} is a suitably large absolute constant. We obtain that with probability at least 12es1N101-2e^{-s}\geq 1-N^{-10},

|i=1m(Zi𝔼Zi)|M2mps+M2smplogN+log2N.\bigg{|}\sum_{i=1}^{m}(Z_{i}-\operatorname{\mathbb{E}}Z_{i})\bigg{|}\lesssim\sqrt{M^{2}mps}+M^{2}s\lesssim\sqrt{mp}\,\log N+\log^{2}N. (4.10)

Here we used the choice of MM we made in (4.3) and ss in (4.9).

Combining the bounds on the residual (4.7) and on the main sum (4.10) and putting them into the decomposition (4.4), we conclude that with probability at least 12mN101-2mN^{-10},

|Φ(x),Φ(x)mp|mplogN+log2N+1mplogN+log2N.\big{|}\langle\Phi(x),\Phi(x^{\prime})\rangle-mp\big{|}\lesssim\sqrt{mp}\,\log N+\log^{2}N+1\lesssim\sqrt{mp}\,\log N+\log^{2}N.

The proof is complete. ∎

5. Enrichment

In the previous section, we defined a random pseudolinear map

Φ:nm,Φ(x):=(ϕ(gi,xb))i=1m,\Phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{m},\quad\Phi(x)\mathrel{\mathop{\mathchar 58\relax}}=\Big{(}\phi\big{(}\langle g_{i},x\rangle-b\big{)}\Big{)}_{i=1}^{m}, (5.1)

where ϕ\phi is either the threshold of ReLU nonlinearity as in (1.1), giN(0,In)g_{i}\sim N(0,I_{n}) are independent standard normal random vectors, and bb is a fixed value.

We will now demonstrate the ability of Φ\Phi to “enrich” the data, to move different points away from each other. To see why this could be the case, choose the value of bb to be moderately large, say b=100logmb=100\sqrt{\log m}. Then with high probability, most of the random variables gi,x\langle g_{i},x\rangle will fall below bb, making most of the coordinates Φ(x)\Phi(x) equal zero, thus making Φ(x)\Phi(x) a random sparse vector. Sparsity will tend to make Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) almost orthogonal even when xx and xx^{\prime} are just a little separated from each other.

To make this rigorous, we can use the results of Section 3 to check that for such bb, the coordinates of Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) are almost uncorrelated. This immediately implies that Φ(x)\Phi(x) and Φ(x)\Phi(x^{\prime}) are almost orthogonal in expectation, and the deviation inequality from Section 4 then implies that the same holds with high probability. This allows us to take a union bound over all data points xix_{i} and conclude that Φ(xi)\Phi(x_{i}) and Φ(xj)\Phi(x_{j}) are almost orthogonal for all distinct data points.

As in Section 3, we will prove this in two regimes, first for the data points that are just a little separated, and then for the data points that are almost orthogonal.

5.1. From separated to ε\varepsilon-orthogonal

In this part we show that the random pseudolinear map Φ\Phi transforms separated data points into almost orthogonal points.

Lemma 5.1 (Enrichment I: from separated to ε\varepsilon-orthogonal).

Consider a pair of unit vectors x,xnx,x^{\prime}\in\mathbb{R}^{n} satisfying

xx2C2log(1/ε)logm\|x-x^{\prime}\|_{2}\geq C_{2}\sqrt{\frac{\log(1/\varepsilon)}{\log m}} (5.2)

for some ε[m1/5,1/8]\varepsilon\in[m^{-1/5},1/8]. Let 2Nexp(m1/5)2\leq N\leq\exp(m^{1/5}), and let pp and bb be numbers such that

p=C2log2Nε2m=𝔼ϕ(γb)2.p=\frac{C_{2}\log^{2}N}{\varepsilon^{2}m}=\operatorname{\mathbb{E}}\phi(\gamma-b)^{2}.

Consider the random pseudolinear map Φ:nm\Phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{m} defined in (5.1). Then with probability at least 14mN51-4mN^{-5}, the vectors

u:=Φ(x)mp,u:=Φ(x)mpu\mathrel{\mathop{\mathchar 58\relax}}=\frac{\Phi(x)}{\sqrt{mp}},\quad u^{\prime}\mathrel{\mathop{\mathchar 58\relax}}=\frac{\Phi(x^{\prime})}{\sqrt{mp}}

satisfy

|u221|ε,|u,u|ε.\mathinner{\!\bigl{\lvert}\|u\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\langle u,u^{\prime}\rangle\bigr{\rvert}}\leq\varepsilon.
Proof.

Step 1. Bounding the bias bb. We begin with some easy observations. Note that xx2\|x-x^{\prime}\|_{2} is bounded above by 22 and below by C2/logmC_{2}/\sqrt{\log m}. Thus, by setting the value of C2C_{2} sufficiently large, we can assume that mm is arbitrarily large, i.e. larger than any given absolute constant. Furthermore, the restrictions on ε\varepsilon and NN yield

m1pm1/10,m^{-1}\leq p\lesssim m^{-1/10}, (5.3)

so pp is arbitrarily small, smaller than any given absolute constant. The function t𝔼ϕ(γt)2t\mapsto\operatorname{\mathbb{E}}\phi(\gamma-t)^{2} is continuous, takes an absolute constant value at t=0t=0, and tends to zero as tt\to\infty. Thus the equation 𝔼ϕ(γt)2=p\operatorname{\mathbb{E}}\phi(\gamma-t)^{2}=p has a solution, so bb is well defined and b1b\geq 1.

To get a better bound on bb, one can use (A.2) for the threshold nonlinearity and Lemma A.1 for ReLU, which give

log𝔼ϕ(γb)2b2.\log\operatorname{\mathbb{E}}\phi(\gamma-b)^{2}\asymp-b^{2}.

Since p=𝔼ϕ(γb)2p=\operatorname{\mathbb{E}}\phi(\gamma-b)^{2}, this and (5.3) yield

blogm.b\asymp\sqrt{\log m}. (5.4)

Step 2. Controlling the norm. Applying Lemma 4.1 for x=xx=x^{\prime}, we obtain with probability at least 12mN51-2mN^{-5} that

|Φ(x)22mp|C1(mplogN+log2N).\big{|}\|\Phi(x)\|_{2}^{2}-mp\big{|}\leq C_{1}\big{(}\sqrt{mp}\,\log N+\log^{2}N\big{)}.

Divide both sides by mpmp to get

|u221|C1(logNmp+log2Nmp)ε,\mathinner{\!\bigl{\lvert}\|u\|_{2}^{2}-1\bigr{\rvert}}\leq C_{1}\bigg{(}\frac{\log N}{\sqrt{mp}}+\frac{\log^{2}N}{mp}\bigg{)}\leq\varepsilon,

where the second inequality follows from our choice of pp with large C2C_{2}. We proved the first conclusion of the proposition.

Step 3. Controlling the inner product. Proposition 3.1 gives

q:=𝔼ϕ(g,xb)ϕ(g,xb)2exp(b2xx228)pε10p,q\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}\phi\big{(}\langle g,x\rangle-b\big{)}\,\phi\big{(}\langle g,x^{\prime}\rangle-b\big{)}\leq 2\exp\bigg{(}-\frac{b^{2}\mathinner{\lVert x-x^{\prime}\rVert}_{2}^{2}}{8}\bigg{)}p\leq\varepsilon^{10}p, (5.5)

where in the last step we used the bounds (5.4) on bb and the separation assumption (5.2) with a sufficiently large constant C2C_{2}. Now, applying Lemma 4.1, we obtain with probability at least 12mN51-2mN^{-5} that

|Φ(x),Φ(x)|mq+C1(mqlogN+log2N).\mathinner{\!\left\lvert\langle\Phi(x),\Phi(x^{\prime})\rangle\right\rvert}\leq mq+C_{1}\big{(}\sqrt{mq}\,\log N+\log^{2}N\big{)}.

Divide both sides by mpmp to obtain

|u,u|qp+C1(qlogNmp+log2Nmp)ε2,\mathinner{\!\left\lvert\langle u,u^{\prime}\rangle\right\rvert}\leq\frac{q}{p}+C_{1}\bigg{(}\frac{\sqrt{q}\log N}{\sqrt{m}p}+\frac{\log^{2}N}{mp}\bigg{)}\leq\varepsilon^{2},

where the last step follows from the bound (5.5) on qq and our choice of pp with a sufficiently large C2C_{2}. This is an even stronger bound than we claimed. ∎

Theorem 5.2 (Enrichment I: from separated to ε\varepsilon-orthogonal).

Consider unit vectors x1,,xKnx_{1},\ldots,x_{K}\in\mathbb{R}^{n} that satisfy

xixj2C2log(1/ε)logm\|x_{i}-x_{j}\|_{2}\geq C_{2}\sqrt{\frac{\log(1/\varepsilon)}{\log m}}

for all distinct i,ji,j, where ε[m1/5,1/8]\varepsilon\in[m^{-1/5},1/8]. Assume that Kexp(c2m1/5)K\leq\exp(c_{2}m^{1/5}). Then there exists an almost777Recall from Section 2.6 that an almost pseudolinear map EE is, by definition, a pseudolinear map multiplied by a nonnegative constant. In our case, E=(mp)1/2ΦE=(mp)^{-1/2}\Phi. pseudolinear map E:nmE\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{m} such that the vectors uk:=E(xk)u_{k}\mathrel{\mathop{\mathchar 58\relax}}=E(x_{k}) satisfy

|ui221|ε,|ui,uj|ε\mathinner{\!\bigl{\lvert}\|u_{i}\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\langle u_{i},u_{j}\rangle\bigr{\rvert}}\leq\varepsilon

for all distinct indices i,j=1,,Ki,j=1,\ldots,K.

Proof.

Apply Lemma 5.1 followed by a union bound over all pairs of distinct vectors xkx_{k}. If we chose N=2mKN=2mK, then the probability of success is at least 1K24m(2mK)5>01-K^{2}\cdot 4m(2mK)^{-5}>0. The proof is complete. ∎

5.2. From ε\varepsilon-orthogonal to 1d\frac{1}{\sqrt{d}}-orthogonal

In this part we show that a random pseudolinear map Φ:md\Phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{m}\to\mathbb{R}^{d} makes almost orthogonal data points even closer to being orthogonal: Φ\Phi reduces the inner products from a small constant ε\varepsilon to O(1/d)O(1/\sqrt{d}).

The pseudolinear map Φ\Phi considered in this part will have the same form as in (5.1), but for different dimensions:

Φ:md,Φ(u):=(ϕ(gi,ub))i=1m,\Phi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{m}\to\mathbb{R}^{d},\quad\Phi(u)\mathrel{\mathop{\mathchar 58\relax}}=\Big{(}\phi\big{(}\langle g_{i},u\rangle-b\big{)}\Big{)}_{i=1}^{m}, (5.6)

where ϕ\phi is either the threshold of ReLU nonlinearity as in (1.1), giN(0,Im)g_{i}\sim N(0,I_{m}) are independent standard normal random vectors, and bb is a fixed value.

Lemma 5.3 (Enrichment II: from ε\varepsilon-orthogonal to 1d\frac{1}{\sqrt{d}}-orthogonal).

Consider a pair of vectors u,umu,u^{\prime}\in\mathbb{R}^{m} satisfying

|u221|ε,|u221|ε,|u,u|ε\mathinner{\!\bigl{\lvert}\|u\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\|u^{\prime}\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\langle u,u^{\prime}\rangle\bigr{\rvert}}\leq\varepsilon (5.7)

for some 0<εc3/logd0<\varepsilon\leq c_{3}/\log d. Let 2Nexp(c3d1/5)2\leq N\leq\exp(c_{3}d^{1/5}), and let pp and bb be numbers such that

p:=1d,𝔼ϕ(γb)2=p.p\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{\sqrt{d}},\qquad\operatorname{\mathbb{E}}\phi(\gamma-b)^{2}=p.

Then with probability at least 14dN51-4dN^{-5}, the vectors

v:=Φ(u)dp,v:=Φ(u)dpv\mathrel{\mathop{\mathchar 58\relax}}=\frac{\Phi(u)}{\sqrt{dp}},\quad v^{\prime}\mathrel{\mathop{\mathchar 58\relax}}=\frac{\Phi(u^{\prime})}{\sqrt{dp}}

satisfy

v212,|v,v|C3(logd+log2N)d.\|v\|_{2}\geq\frac{1}{2},\quad\mathinner{\!\bigl{\lvert}\langle v,v^{\prime}\rangle\bigr{\rvert}}\leq\frac{C_{3}(\log d+\log^{2}N)}{\sqrt{d}}.
Proof.

Step 1. Bounding the bias bb. Following the beginning of the proof of Lemma 5.1, we can check that bb exists and

blogd.b\asymp\sqrt{\log d}. (5.8)

Step 2. Controlling the norm. Applying Proposition 3.2, we see that

p0:=𝔼ϕ(g,ub)212exp(b2ε)pp3,p_{0}\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}^{2}\geq\frac{1}{2}\exp(-b^{2}\varepsilon)p\geq\frac{p}{3},

where in the last step we used the bound (5.8) on bb and the assumption on ε\varepsilon with a sufficiently small constant c3c_{3}. Then, applying Lemma 4.1 for x=x=ux=x^{\prime}=u, we obtain with probability at least 12dN51-2dN^{-5} that

Φ(u)22dp0C1(dp0logN+log2N)34dp014dp,\|\Phi(u)\|_{2}^{2}\geq dp_{0}-C_{1}\big{(}\sqrt{dp_{0}}\,\log N+\log^{2}N\big{)}\geq\frac{3}{4}dp_{0}\geq\frac{1}{4}dp,

where we used our choice of pp and the restriction on NN with sufficiently small constant c3c_{3}. Divide both sides by dpdp to get

v212,\|v\|_{2}\geq\frac{1}{2},

which is the first conclusion of the proposition.

Step 3. Controlling the inner product. Proposition 3.2 gives

q:=𝔼ϕ(g,ub)ϕ(g,ub)2exp(2b2ε)[𝔼ϕ(γb)]2[𝔼ϕ(γb)]2,q\mathrel{\mathop{\mathchar 58\relax}}=\operatorname{\mathbb{E}}\phi\big{(}\langle g,u\rangle-b\big{)}\,\phi\big{(}\langle g,u^{\prime}\rangle-b\big{)}\leq 2\exp(2b^{2}\varepsilon)\,\big{[}\operatorname{\mathbb{E}}\phi(\gamma-b)\big{]}^{2}\lesssim\big{[}\operatorname{\mathbb{E}}\phi(\gamma-b)\big{]}^{2}, (5.9)

where the last inequality follows as before from bound (5.8) on bb and the assumption on ε\varepsilon with sufficiently small c3c_{3}.

Next, we will use the following inequality that holds for all sufficiently large a>0a>0:

𝔼ϕ(γa)a𝔼ϕ(γa)2.\operatorname{\mathbb{E}}\phi(\gamma-a)\leq a\cdot\operatorname{\mathbb{E}}\phi(\gamma-a)^{2}.

For the threshold nonlinearity ϕ\phi, this bound is trivial even without the factor aa in the right side. For the ReLU nonlinearity, it follows from Lemma A.1 in the Appendix. Therefore, we have

𝔼ϕ(γb)bpplogd\operatorname{\mathbb{E}}\phi(\gamma-b)\leq bp\lesssim p\sqrt{\log d}

where we used (5.8) in the last step. Substituting this into (5.9), we conclude that

qp2logd.q\lesssim p^{2}\log d. (5.10)

Now, applying Lemma 4.1, we obtain with probability at least 12mN51-2mN^{-5} that

|Φ(u),Φ(u)|dq+dqlogN+log2N.\mathinner{\!\left\lvert\langle\Phi(u),\Phi(u^{\prime})\rangle\right\rvert}\lesssim dq+\sqrt{dq}\,\log N+\log^{2}N.

Divide both sides by dpdp to obtain

|v,v|qp+qlogNdp+log2Ndplogd+log2Nd.\mathinner{\!\left\lvert\langle v,v^{\prime}\rangle\right\rvert}\lesssim\frac{q}{p}+\frac{\sqrt{q}\log N}{\sqrt{d}p}+\frac{\log^{2}N}{dp}\lesssim\frac{\log d+\log^{2}N}{\sqrt{d}}.

where in the last step we used (5.10) and our choice of pp. ∎

Theorem 5.4 (Enrichment II: from ε\varepsilon-orthogonal to 1d\frac{1}{\sqrt{d}}-orthogonal).

Consider vectors u1,,uKnu_{1},\ldots,u_{K}\in\mathbb{R}^{n} that satisfy

|ui221|ε,|ui,uj|ε\mathinner{\!\bigl{\lvert}\|u_{i}\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\langle u_{i},u_{j}\rangle\bigr{\rvert}}\leq\varepsilon

for all distinct i,ji,j, where 0<εc3/logd0<\varepsilon\leq c_{3}/\log d. Assume that Kexp(c3d1/5)K\leq\exp(c_{3}d^{1/5}). Then there exists an almost pseudolinear map R:mdR\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{m}\to\mathbb{R}^{d} such that the vectors vk:=R(uk)v_{k}\mathrel{\mathop{\mathchar 58\relax}}=R(u_{k}) satisfy

vi212,|vi,vj|C4log2(dK)d\|v_{i}\|_{2}\geq\frac{1}{2},\quad\mathinner{\!\bigl{\lvert}\langle v_{i},v_{j}\rangle\bigr{\rvert}}\leq\frac{C_{4}\log^{2}(dK)}{\sqrt{d}}

for all distinct indices i,j=1,,Ki,j=1,\ldots,K.

Proof.

Apply Lemma 5.3 followed by a union bound over all pairs of distinct vectors uku_{k}. If we chose N=2dKN=2dK, then the probability of success is at least 1K24d(2dK)5>01-K^{2}\cdot 4d(2dK)^{-5}>0. The proof is complete. ∎

6. Perception

The previous sections were concerned with preprocessing, or “enrichment”, of the data. We demonstrated how a pseudolinear map can transform the original data points xkx_{k}, which can be just a little separated, into ε\varepsilon-orthogonal points uku_{k} with ε=o(1)\varepsilon=o(1), and further into η\eta-orthogonal points vkv_{k} with η=O(1/d)\eta=O(1/\sqrt{d}). In this section we train a pseudolinear map that can memorize any label assignment yky_{k} for the η\eta-orthogonal points vkv_{k}.

We will first try to train a single neuron to perform this task assuming that the number KK of the data points vkv_{k} is smaller than the dimension dd, up to a logarithmic factor. Specifically, we construct a vector wnw\in\mathbb{R}^{n} so that the values |w,vk|\mathinner{\!\left\lvert\langle w,v_{k}\rangle\right\rvert} are small whenever yk=0y_{k}=0 and large whenever yk=1y_{k}=1. Our construction is probabilistic: we choose w=k=1K±ykvkw=\sum_{k=1}^{K}\pm y_{k}v_{k} with random independent signs, and show that ww succeeds with high probability.

Lemma 6.1 (Perception).

Let η(0,1)\eta\in(0,1) and consider vectors v1,,vKdv_{1},\ldots,v_{K}\in\mathbb{R}^{d} satisfying

vi212,|vi,vj|η\|v_{i}\|_{2}\geq\frac{1}{2},\quad\mathinner{\!\left\lvert\langle v_{i},v_{j}\rangle\right\rvert}\leq\eta (6.1)

for all distinct i,ji,j. Consider any labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\}, at most K1K_{1} of which equal 11. Assume that K1logKc4η2K_{1}\log K\leq c_{4}\eta^{-2}. Then there exists a vector wdw\in\mathbb{R}^{d} that satisfies the following holds for every k=1,,Kk=1,\ldots,K:

|w,vk|116 if yk=0;|w,vk|316 if yk=1.\mathinner{\!\left\lvert\langle w,v_{k}\rangle\right\rvert}\leq\frac{1}{16}\text{ if $y_{k}=0$};\qquad\mathinner{\!\left\lvert\langle w,v_{k}\rangle\right\rvert}\geq\frac{3}{16}\text{ if $y_{k}=1$}. (6.2)
Proof.

Let ξ1,,ξK\xi_{1},\ldots,\xi_{K} be independent Rademacher random variables and define

w:=k=1Kξkykvk.w\mathrel{\mathop{\mathchar 58\relax}}=\sum_{k=1}^{K}\xi_{k}y_{k}v_{k}.

We are going to show that the random vector ww satisfies the conclusion of the proposition with positive probability.

Let us first check the conclusion (6.2) for k=1k=1. To this end, we decompose w,v1\langle w,v_{1}\rangle as follows:

w,v1=ξ1y1v122+k=2Kξkykvk,v1=:signal+noise.\langle w,v_{1}\rangle=\xi_{1}y_{1}\|v_{1}\|_{2}^{2}+\sum_{k=2}^{K}\xi_{k}y_{k}\langle v_{k},v_{1}\rangle=\mathrel{\mathop{\mathchar 58\relax}}\textrm{signal}+\textrm{noise}.

To bound the noise, we shall use Hoeffding’s inequality (see e.g. [22, Theorem 2.2.2]), which can be stated as follows. If a1,,aNa_{1},\ldots,a_{N} are any fixed numbers and s0s\geq 0, then with probability at least 12es2/21-2e^{-s^{2}/2} we have

|k=1Nξkak|s(k=1Nak2)1/2.\mathinner{\!\biggl{\lvert}\sum_{k=1}^{N}\xi_{k}a_{k}\biggr{\rvert}}\leq s\bigg{(}\sum_{k=1}^{N}a_{k}^{2}\bigg{)}^{1/2}.

Using this for s=4logKs=4\sqrt{\log K}, we conclude that with probability at least 12K81-2K^{-8}, we have

|noise|4logK(k=2Kyk2vk,v12)1/24logKK1η116,\mathinner{\!\left\lvert\textrm{noise}\right\rvert}\leq 4\sqrt{\log K}\bigg{(}\sum_{k=2}^{K}y_{k}^{2}\langle v_{k},v_{1}\rangle^{2}\bigg{)}^{1/2}\leq 4\sqrt{\log K}\,\sqrt{K_{1}}\eta\leq\frac{1}{16},

where we used (6.1) and the assumption on K,K1K,K_{1} with a sufficiently small constant c4c_{4}.

If y1=0y_{1}=0, the signal is zero and so |w,v1|=|noise|1/16\mathinner{\!\left\lvert\langle w,v_{1}\rangle\right\rvert}=\mathinner{\!\left\lvert\textrm{noise}\right\rvert}\leq 1/16, as claimed. If y1=1y_{1}=1 then |signal|=v1221/4\mathinner{\!\left\lvert\textrm{signal}\right\rvert}=\|v_{1}\|_{2}^{2}\geq 1/4 and thus

|w,v1||signal||noise|14116=316,\mathinner{\!\left\lvert\langle w,v_{1}\rangle\right\rvert}\geq\mathinner{\!\left\lvert\textrm{signal}\right\rvert}-\mathinner{\!\left\lvert\textrm{noise}\right\rvert}\geq\frac{1}{4}-\frac{1}{16}=\frac{3}{16},

as claimed.

Repeating this argument for general kk, we can obtain the same bounds for |w,vk|\mathinner{\!\left\lvert\langle w,v_{k}\rangle\right\rvert}. Finally, take the union bound over the KK choices of kk. The random vector satisfies the conclusion with probability at least 12K7>01-2K^{-7}>0. The proof is complete. ∎

Lemma 6.1 essentially says that one neuron can memorize the labels of O(d)O(d) data points in d\mathbb{R}^{d}. Thus, rr neurons should be able to memorize the labels of O(dr)O(dr) data points in d\mathbb{R}^{d}. To make this happen, we can partition the data into rr batches of size O(d)O(d) each, and train each neuron on a different batch. The following lemma makes this formal; to see the connection, apply it for η1/d\eta\asymp 1/\sqrt{d}.

Theorem 6.2 (One layer).

Consider a number η(0,1)\eta\in(0,1), vectors v1,,vKdv_{1},\ldots,v_{K}\in\mathbb{R}^{d} and labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\} as in Lemma 6.1. Assume that (2K1+r)logKc4rη2(2K_{1}+r)\log K\leq c_{4}r\eta^{-2} where rr is a positive integer. Then there exists a pseudolinear map P:drP\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R}^{r} such that for all k=1,,Kk=1,\ldots,K we have:

P(vk)=0 iff yk=0.P(v_{k})=0\text{ iff }y_{k}=0.
Proof.

Without loss of generality, assume that the first K1K_{1} of the labels yky_{k} equal 11 and the rest equal zero, i.e. yk=𝟏{kK1}y_{k}={\mathbf{1}}_{\{k\leq K_{1}\}}. Partition the indices of the nonzero labels {1,,K1}\{1,\ldots,K_{1}\} into r/2r/2 subsets IiI_{i} (“batches”), each of cardinality at most 2K1/r+12K_{1}/r+1. For each batch ii, define a new set of labels

yki=𝟏{kIi},k=1,,K.y_{ki}={\mathbf{1}}_{\{k\in I_{i}\}},\quad k=1,\ldots,K.

In other words, the labels ykiy_{ki} are obtained from the original labels yiy_{i} by zeroing out the labels outside batch ii.

For each ii, apply Lemma 6.1 for the labels ykiy_{ki}. The number of nonzero labels is |Ii|2K1/r+1\mathinner{\!\left\lvert I_{i}\right\rvert}\leq 2K_{1}/r+1, so we can use this number instead of K1K_{1}, noting that the condition (2K1/r+1)logKc4η2(2K_{1}/r+1)\log K\leq c_{4}\eta^{-2} required in Lemma 6.1 does hold by our assumption. We obtain a vector widw_{i}\in\mathbb{R}^{d} that satisfies the following holds for every k=1,,Kk=1,\ldots,K:

|wi,vk|116 if kIi;|wi,vk|316 if kIi.\mathinner{\!\left\lvert\langle w_{i},v_{k}\rangle\right\rvert}\leq\frac{1}{16}\text{ if $k\not\in I_{i}$};\qquad\mathinner{\!\left\lvert\langle w_{i},v_{k}\rangle\right\rvert}\geq\frac{3}{16}\text{ if $k\in I_{i}$}. (6.3)

Define the pseudolinear map Φ(v)=(Φ(v)1,,Φ(v)r)\Phi(v)=\big{(}\Phi(v)_{1},\ldots,\Phi(v)_{r}\big{)} as follows:

P(v)i:=ϕ(wi,v18),P(v)r/2+i:=ϕ(wi,v18),i=1,,r/2.P(v)_{i}\mathrel{\mathop{\mathchar 58\relax}}=\phi\Big{(}\langle w_{i},v\rangle-\frac{1}{8}\Big{)},\quad P(v)_{r/2+i}\mathrel{\mathop{\mathchar 58\relax}}=\phi\Big{(}-\langle w_{i},v\rangle-\frac{1}{8}\Big{)},\quad i=1,\ldots,r/2.

If yk=0y_{k}=0, then k>K1k>K_{1}. Thus kk does not belong to any batch IiI_{i}, and (6.3) implies that |wi,vk|1/16\mathinner{\!\left\lvert\langle w_{i},v_{k}\rangle\right\rvert}\leq 1/16 for all ii. Then both wi,vk1/8\langle w_{i},v_{k}\rangle-1/8 and wi,vk1/8-\langle w_{i},v_{k}\rangle-1/8 are negative, and since ϕ(t)=0\phi(t)=0 for negative tt, all coordinates of P(vk)P(v_{k}) are zero, i.e. P(vk)=0P(v_{k})=0.

Conversely, if P(vk)=0P(v_{k})=0 then, by construction, for each ii both wi,v1/8\langle w_{i},v\rangle-1/8 and wi,v1/8\langle w_{i},v\rangle-1/8 must be nonpositive, which yields |wi,vk|1/8<3/16\mathinner{\!\left\lvert\langle w_{i},v_{k}\rangle\right\rvert}\leq 1/8<3/16. Thus, by (6.3), kk may not belong to any batch IiI_{i}, which means that k>K1k>K_{1}, and this implies yk=0y_{k}=0. ∎

7. Assembly

In this section we prove a general version of our main result. Let us first show how to train a network with four layers. To this end, choose an enrichment map from layer 11 to layer 22 to transform the data from merely separated to ε\varepsilon-orthogonal, choose a map from layer 22 to layer 33 to further enrich the data by making it O(1/d)O(1/\sqrt{d})-orthogonal, and finally make a map from layer 33 to layer 44 memorize the labels. This yields the following result:

Theorem 7.1 (Shallow networks).

Consider unit vectors x1,,xKnx_{1},\ldots,x_{K}\in\mathbb{R}^{n} that satisfy

xixj2C2loglogdlogm.\|x_{i}-x_{j}\|_{2}\geq C_{2}\sqrt{\frac{\log\log d}{\log m}}.

Consider any labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\}, at most K1K_{1} of which equal 11. Assume that

K1log5(dK)c5drK_{1}\log^{5}(dK)\leq c_{5}dr

as well as Kexp(c5m1/5)K\leq\exp(c_{5}m^{1/5}), Kexp(c5d1/5)K\leq\exp(c_{5}d^{1/5}), and dexp(c5m1/5)d\leq\exp(c_{5}m^{1/5}). Then there exists a map F(n,m,d,r)F\in\mathcal{F}(n,m,d,r) such that for all k=1,,Kk=1,\ldots,K we have:

F(xk)=0 iff yk=0.F(x_{k})=0\text{ iff }y_{k}=0.
Proof.

Step 1. From separated to ε\varepsilon-orthogonal. Apply Theorem 5.2 with ε=c5/logd\varepsilon=c_{5}/\log d. (Note that the required constraints in that result hold by our assumptions with small c5c_{5}.) We obtain an almost pseudolinear map E:nmE\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{m} such that the vectors uk:=E(xk)u_{k}\mathrel{\mathop{\mathchar 58\relax}}=E(x_{k}) satisfy

|ui221|ε,|ui,uj|ε\mathinner{\!\bigl{\lvert}\|u_{i}\|_{2}^{2}-1\bigr{\rvert}}\leq\varepsilon,\quad\mathinner{\!\bigl{\lvert}\langle u_{i},u_{j}\rangle\bigr{\rvert}}\leq\varepsilon

for all distinct i,ji,j.

Step 2. From ε\varepsilon-orthogonal to 1d\frac{1}{\sqrt{d}}-orthogonal. Apply Theorem 5.4. We obtain an almost pseudolinear map R:mdR\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{m}\to\mathbb{R}^{d} such that the vectors vk:=R(uk)v_{k}\mathrel{\mathop{\mathchar 58\relax}}=R(u_{k}) satisfy

vi212,|vi,vj|C4log2(dK)d=:η\|v_{i}\|_{2}\geq\frac{1}{2},\quad\mathinner{\!\bigl{\lvert}\langle v_{i},v_{j}\rangle\bigr{\rvert}}\leq\frac{C_{4}\log^{2}(dK)}{\sqrt{d}}=\mathrel{\mathop{\mathchar 58\relax}}\eta

for all distinct indices i,ji,j.

Step 3. Perception. Apply Theorem 6.2. (Note that our assumptions with small enough c5c_{5} ensure that the required constraint (2K1+r)logKc4rη2(2K_{1}+r)\log K\leq c_{4}r\eta^{-2} does hold.) We obtain a pseudolinear map P:drP\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R}^{r} such that the vectors zk:=P(vk)z_{k}\mathrel{\mathop{\mathchar 58\relax}}=P(v_{k}) satisfy:

zk=0 iff yk=0.z_{k}=0\text{ iff }y_{k}=0.

Step 4. Assembly. Define

F:=PRE.F\mathrel{\mathop{\mathchar 58\relax}}=P\circ R\circ E.

Since EE and RR are almost pseudolinear and PP is pseudolinear, FF can be represented as a composition of three pseudolinear maps (by absorbing the linear factors), i.e. F(n,m,d,r)F\in\mathcal{F}(n,m,d,r). Also, F(xk)=zkF(x_{k})=z_{k} by construction, so the proof is complete. ∎

Finally, we can extend Theorem 7.1 for arbitrarily deep networks by distributing the memorization tasks among all layers evenly. Indeed, consider a network with LL layers and with nin_{i} nodes in layer ii. As in Theorem 7.1, first we enrich the data, thereby making the input to layer 33 almost orthogonal. Train the map from layer 33 to layer 44 to memorize the labels of the first O(n3n4)O(n_{3}n_{4}) data points using Theorem 6.2 (for d=n3d=n_{3}, r=n4r=n_{4}, and η1/d\eta\asymp 1/\sqrt{d}). Similarly, train the map from layer 44 to layer 55 to memorize the labels of the next O(n4n5)O(n_{4}n_{5}) data points, and so on. This allows us to train the network on the total of O(n3n4+n4n5++nL1nL)=O(W)O(n_{3}n_{4}+n_{4}n_{5}+\cdots+n_{L-1}n_{L})=O(W) data points, where WW is the number of “deep connections” in the network, i.e. connections that occur from layer 33 and up. This leads us to the main result of this paper.

Theorem 7.2 (Deep networks).

Let n1,,nLn_{1},\ldots,n_{L} be positive integers, and set n0:=min(n2,,nL)n_{0}\mathrel{\mathop{\mathchar 58\relax}}=\min(n_{2},\ldots,n_{L}) and n:=max(n2,,nL)n_{\infty}\mathrel{\mathop{\mathchar 58\relax}}=\max(n_{2},\ldots,n_{L}). Consider unit vectors x1,,xKnx_{1},\ldots,x_{K}\in\mathbb{R}^{n} that satisfy

xixj2Cloglognlogn0.\|x_{i}-x_{j}\|_{2}\geq C\sqrt{\frac{\log\log n_{\infty}}{\log n_{0}}}.

Consider any labels y1,,yK{0,1}y_{1},\ldots,y_{K}\in\{0,1\}, at most K1K_{1} of which equal 11. Assume that the number of deep connections W:=n3n4+nL1nLW\mathrel{\mathop{\mathchar 58\relax}}=n_{3}n_{4}+\cdots n_{L-1}n_{L} satisfies

WCK1log5K,W\geq CK_{1}\log^{5}K, (7.1)

as well as Kexp(cn01/5)K\leq\exp(cn_{0}^{1/5}) and nexp(cn01/5)n_{\infty}\leq\exp(cn_{0}^{1/5}). Then there exists a map F(n1,,nL)F\in\mathcal{F}(n_{1},\ldots,n_{L}) such that for all k=1,,Kk=1,\ldots,K we have:

F(xk)=0 iff yk=0.F(x_{k})=0\text{ iff }y_{k}=0.

We stated a simplified version of this result in Theorem 1.1. To see the connection, just take the ‘OR’ of the outputs of all nLn_{L} nodes of the last layer.

Proof.

Step 1. Initial reductions. For networks with L=4L=4 layers, we already proved the result in Theorem 7.1, so we an assume that L5L\geq 5. Moreover, for K=1K=1 the result is trivial, so we can assume that K2K\geq 2. In this case, if we make the constant cc in our assumption 2exp(cn01/5)2\leq\exp(cn_{0}^{1/5}) sufficiently small, we can assume that n0n_{0} (and thus also all nin_{i} and WW) are arbitrarily large, i.e. larger than any given absolute constant.

Step 2. Distributing data to layers. Without loss of generality, assume that the first K1K_{1} of the labels yky_{k} equal 11 and the rest equal zero, i.e. yk=𝟏{kK1}y_{k}={\mathbf{1}}_{\{k\leq K_{1}\}}. Partition the indices of the nonzero labels {1,,K1}\{1,\ldots,K_{1}\} into subsets I3,,IL1I_{3},\ldots,I_{L-1} (“batches”) so that

|Ii|nini+1WK1+1.\mathinner{\!\left\lvert I_{i}\right\rvert}\leq\frac{n_{i}n_{i+1}}{W}K_{1}+1.

(This is possible since the numbers nini+1/Wn_{i}n_{i+1}/W sum to one.) For each batch ii, define a new set of labels

yki=𝟏{kIi},k=1,,K.y_{ki}={\mathbf{1}}_{\{k\in I_{i}\}},\quad k=1,\ldots,K.

In other words, ykiy_{ki} are obtained from the original labels yiy_{i} by zeroing out the labels outside batch ii.

Step 3. Memorization at each layer. For each ii, apply Theorem 7.1 for the labels ykiy_{ki}, for the number of nonzero labels |Ii|\mathinner{\!\left\lvert I_{i}\right\rvert} instead of K1K_{1}, and for n=n1n=n_{1}, m=n0/3m=n_{0}/3, d=ni/3d=n_{i}/3, r=ni+1/3r=n_{i+1}/3. Thus, if

(nini+1WK1+1)log5(niK3)c5nini+19\bigg{(}\frac{n_{i}n_{i+1}}{W}K_{1}+1\bigg{)}\log^{5}\bigg{(}\frac{n_{i}K}{3}\bigg{)}\leq\frac{c_{5}n_{i}n_{i+1}}{9} (7.2)

as well as

Kexp(c5n01/5/3),Kexp(c5ni1/5/3),niexp(c5n01/5/3),K\leq\exp(c_{5}n_{0}^{1/5}/3),\qquad K\leq\exp(c_{5}n_{i}^{1/5}/3),\qquad n_{i}\leq\exp(c_{5}n_{0}^{1/5}/3), (7.3)

then there exist a map

Fi(n1,n0/3,ni/3,ni+1/3)F_{i}\in\mathcal{F}(n_{1},n_{0}/3,n_{i}/3,n_{i+1}/3)

satisfying the following for all ii and kk:

Fi(xk)=0 iff yki=0.F_{i}(x_{k})=0\text{ iff }y_{ki}=0. (7.4)

Moreover, when we factorize Fi=PiRiEiF_{i}=P_{i}\circ R_{i}\circ E_{i} into three almost pseudolinear maps, then Ei=EE_{i}=E, the enrichment map from n1\mathbb{R}^{n_{1}} into n0/3\mathbb{R}^{n_{0}/3}, is trivially independent of ii, so

Fi=PiRiE.F_{i}=P_{i}\circ R_{i}\circ E. (7.5)

Our assumptions with sufficiently small cc guarantee that the required conditions (7.3) do hold. In order to check (7.2), we will first note a somewhat stronger bound than (7.1) holds, namely we have

35WCK1log5(niK),i=3,,L1.3^{5}W\geq CK_{1}\log^{5}(n_{i}K),\quad i=3,\ldots,L-1. (7.6)

Indeed, if WK2W\geq K^{2} then using that K1KW1/2K_{1}\leq K\leq W^{1/2} and niWn_{i}\leq W we get

K1log5(niK)W1/2log5(W3/2)=3525W1/2log5W35CWK_{1}\log^{5}(n_{i}K)\leq W^{1/2}\log^{5}(W^{3/2})=\frac{3^{5}}{2^{5}}W^{1/2}\log^{5}W\leq\frac{3^{5}}{C}W

when WW is sufficiently large. If WK2W\leq K^{2} then using that niWK2n_{i}\leq W\leq K^{2} we get

K1log5(niK)K1log5(K3)=35K1log5K35CWK_{1}\log^{5}(n_{i}K)\leq K_{1}\log^{5}(K^{3})=3^{5}K_{1}\log^{5}K\leq\frac{3^{5}}{C}W

where the last step follows from (7.1). Hence, we verified (7.6) for the entire range of WW.

Now, to check (7.2), note that

log5(niK)25(log5ni+log5K)c5ni20c5nini+120\log^{5}(n_{i}K)\leq 2^{5}\big{(}\log^{5}n_{i}+\log^{5}K\big{)}\leq\frac{c_{5}n_{i}}{20}\leq\frac{c_{5}n_{i}n_{i+1}}{20}

where we used that nin_{i} is arbitrarily large and the assumption on KK with a sufficiently small constant cc. Combining this bound with (7.6), we obtain

(nini+1WK1+1)log5(niK3)(35C+c520)nini+1c5nini+19\bigg{(}\frac{n_{i}n_{i+1}}{W}K_{1}+1\bigg{)}\log^{5}\bigg{(}\frac{n_{i}K}{3}\bigg{)}\leq\bigg{(}\frac{3^{5}}{C}+\frac{c_{5}}{20}\bigg{)}n_{i}n_{i+1}\leq\frac{c_{5}n_{i}n_{i+1}}{9}

if CC is sufficiently large. We have checked(7.2).

Step 4. Stacking. To complete the proof, it suffices to construct map F(n1,,nL)F\in\mathcal{F}(n_{1},\ldots,n_{L}) with the following property:

F(x)=0iffFi(x)=0i=3,,L1.F(x)=0\quad\text{iff}\quad F_{i}(x)=0\;\forall i=3,\ldots,L-1. (7.7)

Indeed, this would imply that F(xk)=0F(x_{k})=0 happens iff Fi(xk)=0F_{i}(x_{k})=0 for all ii, which, according to (7.4) is equivalent to yki=0y_{ki}=0 for all ii. By definition of ykiy_{ki}, this is further equivalent to kIik\not\in I_{i} for any ii, which by construction of IiI_{i} is equivalent to k>K1k>K_{1}, which is finally equivalent to yk=0y_{k}=0, as claimed.

Refer to caption
Figure 2. Trading width for depth: stacking shallow networks into a deep network.

We construct FF by “stacking” the maps Fi=PiRiEF_{i}=P_{i}\circ R_{i}\circ E for i=3,,L1i=3,\ldots,L-1 as illustrated in Figure 2. To help us stack these maps, we drop some nodes from the original network and first construct

F(n1,,nL)F\in\mathcal{F}(n^{\prime}_{1},\ldots,n^{\prime}_{L})

with some ninin^{\prime}_{i}\leq n_{i}; we can then extend FF trivially to (n1,,nL)\mathcal{F}(n_{1},\ldots,n_{L}). As Figure 2 suggests, we choose n1=n1n^{\prime}_{1}=n_{1}, n2=n0/3n^{\prime}_{2}=n_{0}/3, n3=n0/3+n3/3n^{\prime}_{3}=n_{0}/3+n_{3}/3, ni=n0/3+2ni/3n^{\prime}_{i}=n_{0}/3+2n_{i}/3 for i=4,,L2i=4,\ldots,L-2 (skip these layers if L=5L=5), nL1=2nL1/3n^{\prime}_{L-1}=2n_{L-1}/3, and nL=nL/3n^{\prime}_{L}=n_{L}/3. Note that by definition of n0n_{0}, we indeed have ninin^{\prime}_{i}\leq n_{i} for all ii.

We made this choice so that the network can realize the maps FiF_{i}. As Figure 2 illustrates, the map F3=P3R3E(n1,n0/3,n3/3,n4/3)F_{3}=P_{3}\circ R_{3}\circ E\in\mathcal{F}(n_{1},n_{0}/3,n_{3}/3,n_{4}/3) is realized by setting the factor E:n1n0/3E\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{1}}\to\mathbb{R}^{n_{0}/3} to map the first layer to the second, the factor R3:n0/3n3/3R_{3}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{0}/3}\to\mathbb{R}^{n_{3}/3} to map the second layer to the last n3/3n_{3}/3 nodes of the third layer, and the factor P3:n3/3n4/3P_{3}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n_{3}/3}\to\mathbb{R}^{n_{4}/3} to map further to the last n3/3n_{3}/3 nodes of the fourth layer. Moreover, the output of the second layer is transferred to the first n0/3n_{0}/3 nodes of the third layer by the identity map888Note that the identity map restricted to the image of EE can be realized as an almost pseudolinear map for both ReLU and threshold nonlinearities. For ReLU this is obvious by setting the bias large enough; for threshold nonlinearity note that the image of the almost pseudolinear map EE consists of vectors whose coordinates are either zero or take the same value λ\lambda. Thus, the Heaviside function multiplied by λ\lambda is the identity on the image of EE. II, so we can realize the next map F4F_{4}, and so on.

The outputs of all maps FiF_{i} are added together as the signs “+” in Figure 2 indicate. Namely, the components of the output of F1F_{1}, i.e. the last n4/3n_{4}/3 nodes of the fourth layer, are summed together and added to any node (say, the last node) of the fifth layer; the components of the output of F2F_{2}, i.e. the last n5/3n_{5}/3 nodes of the fourth layer, are summed together and added to the last node of the sixth layer, and so on. For ReLU nonlinearity, the ++ refers to addition of real numbers; for threshold nonlinearity, we replace adding by taking the maximum (i.e. the ‘OR’ operation), which is clearly realizable.

Step 5. Conclusion. Due to our construction, the sum of all nLn^{\prime}_{L} components of the function F(x)F(x) computed by the network equals the sum (or maximum, for threshold nonlinearity) of all components of all functions Fi(x)F_{i}(x). Since the components are always nonnegative, F(x)F(x) is zero iff all components of all functions Fi(x)F_{i}(x) are zero. In other words, our claim (7.7) holds. ∎

Appendix A Asymptotical expressions for Gaussian integrals

The asymptotical expansion of Mills ratio for the normal distribution is well known, see [19]. For our purposes, the first three terms of the expansion will be sufficient:

Ψ(a)=aex2/2𝑑xea2/2=a1a3+3a5+O(a7).\Psi(a)=\frac{\int_{a}^{\infty}e^{-x^{2}/2}\;dx}{e^{-a^{2}/2}}=a^{-1}-a^{-3}+3a^{-5}+O(a^{-7}). (A.1)

In particular, the tail probability of the standard normal random variable γN(0,1)\gamma\sim N(0,1) satisfies

{γ>a}=12πea2/2(a1+O(a3)).\mathbb{P}\left\{\gamma>a\rule{0.0pt}{8.53581pt}\right\}=\frac{1}{\sqrt{2\pi}}e^{-a^{2}/2}\Big{(}a^{-1}+O(a^{-3})\Big{)}. (A.2)

The following two lemmas give asymptotical expressions for the expected value of the first two moments of the random variable (γa)+=max(γa,0)(\gamma-a)_{+}=\max(\gamma-a,0) where, as before, γN(0,1)\gamma\sim N(0,1) is standard normal.

Lemma A.1 (ReLU of the normal distribution).

Let γN(0,1)\gamma\sim N(0,1). Then, as aa\to\infty, we have

𝔼(γa)+\displaystyle\operatorname{\mathbb{E}}(\gamma-a)_{+} =12πea2/2(a2+O(a4)),\displaystyle=\frac{1}{\sqrt{2\pi}}e^{-a^{2}/2}\Big{(}a^{-2}+O(a^{-4})\Big{)},
𝔼((γa)+)2\displaystyle\operatorname{\mathbb{E}}((\gamma-a)_{+})^{2} =12πea2/2(2a3+O(a5)).\displaystyle=\frac{1}{\sqrt{2\pi}}e^{-a^{2}/2}\Big{(}2a^{-3}+O(a^{-5})\Big{)}.
Proof.

Expressing expectation as the integral of the tail (see e.g. [22, Lemma 1.2.1]), we have

2π𝔼(γa)+\displaystyle\sqrt{2\pi}\,\operatorname{\mathbb{E}}(\gamma-a)_{+} =0(xa)+ex2/2𝑑x=a(xa)ex2/2𝑑x\displaystyle=\int_{0}^{\infty}(x-a)_{+}\,e^{-x^{2}/2}\;dx=\int_{a}^{\infty}(x-a)e^{-x^{2}/2}\;dx
=axex2/2𝑑xaaex2/2𝑑x.\displaystyle=\int_{a}^{\infty}xe^{-x^{2}/2}\;dx-a\int_{a}^{\infty}e^{-x^{2}/2}\;dx.

Using substitution y=x2/2y=x^{2}/2, we see that the value of the first integral on the right hand side is ea2/2e^{-a^{2}/2}. Using the Mills ratio asymptotics (A.1) for the second integral, we get

2π𝔼(γa)+=ea2/2aea2/2(a1a3+O(a5))=ea2/2(a2+O(a4)).\sqrt{2\pi}\,\operatorname{\mathbb{E}}(\gamma-a)_{+}=e^{-a^{2}/2}-a\cdot e^{-a^{2}/2}\Big{(}a^{-1}-a^{-3}+O(a^{-5})\Big{)}=e^{-a^{2}/2}\Big{(}a^{-2}+O(a^{-4})\Big{)}.

This finishes the first part of the lemma.

To prove the second part, we start similarly:

2π𝔼((γa)+)2\displaystyle\sqrt{2\pi}\,\operatorname{\mathbb{E}}((\gamma-a)_{+})^{2} =a(xa)2ex2/2𝑑x\displaystyle=\int_{a}^{\infty}(x-a)^{2}e^{-x^{2}/2}\;dx
=ax2ex2/2𝑑x2aaxex2/2𝑑x+a2aex2/2𝑑x.\displaystyle=\int_{a}^{\infty}x^{2}e^{-x^{2}/2}\;dx-2a\int_{a}^{\infty}xe^{-x^{2}/2}\;dx+a^{2}\int_{a}^{\infty}e^{-x^{2}/2}\;dx.

Integrating by parts, we find that the first integral on the right side equals

aea2/2+aex2/2𝑑x=aea2/2+Ψ(a)ea2/2;ae^{-a^{2}/2}+\int_{a}^{\infty}e^{-x^{2}/2}\;dx=ae^{-a^{2}/2}+\Psi(a)e^{-a^{2}/2};

the second integral equals ea2/2e^{-a^{2}/2} as before, and the third integral equals Ψ(a)ea2/2\Psi(a)e^{-a^{2}/2}. Combining these and using the asymptotical expansion (A.1) for Ψ(a)\Psi(a), we conclude that

2π𝔼((γa)+)2\displaystyle\sqrt{2\pi}\,\operatorname{\mathbb{E}}((\gamma-a)_{+})^{2} =aea2/2+Ψ(a)ea2/22aea2/2+a2Ψ(a)ea2/2\displaystyle=ae^{-a^{2}/2}+\Psi(a)e^{-a^{2}/2}-2ae^{-a^{2}/2}+a^{2}\Psi(a)e^{-a^{2}/2}
=ea2/2((a2+1)Ψ(a)a)=ea2/2(2a3+O(a5)).\displaystyle=e^{-a^{2}/2}\Big{(}(a^{2}+1)\Psi(a)-a\Big{)}=e^{-a^{2}/2}\Big{(}2a^{-3}+O(a^{-5})\Big{)}.

This completes the proof of the second part of the lemma. ∎

Lemma A.2 (Stability).

Fix any z>1z>-1. Let γN(0,1)\gamma\sim N(0,1). Then, as aa\to\infty, we have

{γ1+z>a}{γ>a}\displaystyle\frac{\mathbb{P}\{\gamma\sqrt{1+z}>a\}}{\mathbb{P}\left\{\gamma>a\rule{0.0pt}{8.53581pt}\right\}} =exp(a2z2(1+z))(1+z)1/2(1+O(a2));\displaystyle=\exp\bigg{(}\frac{a^{2}z}{2(1+z)}\bigg{)}\,(1+z)^{1/2}\,\Big{(}1+O(a^{-2})\Big{)}; (A.3)
𝔼(γ1+za)+𝔼(γa)+\displaystyle\frac{\operatorname{\mathbb{E}}(\gamma\sqrt{1+z}-a)_{+}}{\operatorname{\mathbb{E}}(\gamma-a)_{+}} =exp(a2z2(1+z))(1+z)(1+O(a2));\displaystyle=\exp\bigg{(}\frac{a^{2}z}{2(1+z)}\bigg{)}\,(1+z)\,\Big{(}1+O(a^{-2})\Big{)}; (A.4)
𝔼((γ1+za)+)2𝔼((γa)+)2\displaystyle\frac{\operatorname{\mathbb{E}}\big{(}(\gamma\sqrt{1+z}-a)_{+}\big{)}^{2}}{\operatorname{\mathbb{E}}((\gamma-a)_{+})^{2}} =exp(a2z2(1+z))(1+z)3/2(1+O(a2)).\displaystyle=\exp\bigg{(}\frac{a^{2}z}{2(1+z)}\bigg{)}\,(1+z)^{3/2}\,\Big{(}1+O(a^{-2})\Big{)}. (A.5)
Proof.

Use the asymptotics in (A.2) and Lemma A.1 for aa and a/1+za/\sqrt{1+z} and simplify. ∎

We complete this paper by proving an elementary monotonicity property for Gaussian integrals, which we used in the proof of of Proposition 3.2.

Lemma A.3 (Monotonicity).

Let ψ:[0,)\psi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}\to[0,\infty) be a nondecreasing function satisfying ψ(t)=0\psi(t)=0 for all t<0t<0, and let γN(0,1)\gamma\sim N(0,1). Then σ𝔼ψ(σγ)\sigma\mapsto\operatorname{\mathbb{E}}\psi(\sigma\gamma) is a nondecreasing function on [0,)[0,\infty).

Proof.

Denoting by f(x)f(x) the probability density function of N(0,1)N(0,1), we have

𝔼ψ(σγ)=ψ(σx)f(x)𝑑x=0ψ(σx)f(x)𝑑x.\operatorname{\mathbb{E}}\psi(\sigma\gamma)=\int_{\infty}^{\infty}\psi(\sigma x)f(x)\,dx=\int_{0}^{\infty}\psi(\sigma x)f(x)\,dx.

The last step follows since, by assumption, ψ(σx)=0\psi(\sigma x)=0 for all x<0x<0. To complete the proof, it remains to note that for every fixed x0x\geq 0, the function σψ(σx)\sigma\mapsto\psi(\sigma x) is nondecreasing. ∎

References

  • [1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.
  • [2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
  • [3] Pierre Baldi and Roman Vershynin. On neuronal capacity. In Advances in Neural Information Processing Systems, pages 7729–7738, 2018.
  • [4] Pierre Baldi and Roman Vershynin. The capacity of feedforward neural networks. Neural Networks, 116:288–311, 2019.
  • [5] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.
  • [6] Eric B Baum. On the capabilities of multilayer perceptrons. Journal of complexity, 4(3):193–215, 1988.
  • [7] Eric B Baum and David Haussler. What size net gives valid generalization? In Advances in neural information processing systems, pages 81–90, 1989.
  • [8] Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, (3):326–334, 1965.
  • [9] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
  • [10] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • [11] Rong Ge, Runzhe Wang, and Haoyu Zhao. Mildly overparametrized neural nets can memorize training data efficiently. arXiv preprint arXiv:1909.11837, 2019.
  • [12] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
  • [13] Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003.
  • [14] Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
  • [15] Adam Kowalczyk. Estimates of storage capacity of multilayer perceptron with threshold logic hidden units. Neural networks, 10(8):1417–1433, 1997.
  • [16] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
  • [17] GJ Mitchison and RM Durbin. Bounds on the learning capacity of some multi-layer networks. Biological Cybernetics, 60(5):345–365, 1989.
  • [18] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674, 2019.
  • [19] Iosif Pinelis. Monotonicity properties of the relative error of a padé approximation for mills’ ratio. J. Inequal. Pure Appl. Math, 3(2):1–8, 2002.
  • [20] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
  • [21] Ruoyu Sun. Optimization for deep learning: theory and algorithms. arXiv preprint arXiv:1912.08957, 2019.
  • [22] Roman Vershynin. High-dimensional probability: An introduction with applications in data science. cambridge series in statistical and probabilistic mathematics, 2018.
  • [23] Masami Yamasaki. The lower bound of the capacity for a neural network with multiple hidden layers. In International Conference on Artificial Neural Networks, pages 546–549. Springer, 1993.
  • [24] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, pages 15532–15543, 2019.
  • [25] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2017.
  • [26] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
  • [27] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. arXiv preprint arXiv:1906.04688, 2019.