This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Kedar Karhadkar kedar@math.ucla.edu
University of California, Los Angeles
Michael Murray mmurray@math.ucla.edu
University of California, Los Angeles
Hanna Tseran hanna.tseran@gmail.com
University of Tokyo
Guido Montúfar montufar@math.ucla.edu
University of California, Los Angeles
Max Planck Institute for Mathematics in the Sciences
Abstract

We study the loss landscape of both shallow and deep, mildly overparameterized ReLU neural networks on a generic finite input dataset for the squared error loss. We show both by count and volume that most activation patterns correspond to parameter regions with no bad local minima. Furthermore, for one-dimensional input data, we show most activation regions realizable by the network contain a high dimensional set of global minima and no bad local minima. We experimentally confirm these results by finding a phase transition from most regions having full rank Jacobian to many regions having deficient rank depending on the amount of overparameterization.

1 Introduction

The optimization landscape of neural networks has been a topic of enormous interest over the years. A particularly puzzling question is why bad local minima do not seem to be a problem for training. In this context, an important observation is that overparameterization can yield a more benevolent optimization landscapes. While this type of observation can be traced back at least to the works of Poston et al. (1991); Gori & Tesi (1992), it has increasingly become a focus of investigation in recent years. In this article we follow this line of work and describe the optimization landscape of a moderately overparameterized ReLU network with a view on the different activation regions of the parameter space.

Before going into the details of our results we first provide some context. The existence of non-global local minima has been documented in numerous works. This is the case even for networks without hidden units (Sontag & Sussmann, 1989) and single units (Auer et al., 1995). For shallow networks, Fukumizu & Amari (2000) showed how local minima and plateaus can arise from the hierarchical structure of the model for general types of targets, loss functions and activation functions. Other works have also constructed concrete examples of non-global local minima for several architectures (Swirszcz et al., 2017). While these works considered finite datasets, Safran & Shamir (2018) observed that spurious minima are also common for the student-teacher population loss of a two-layer ReLU network with unit output weights. In fact, for ReLU networks Yun et al. (2019b); Goldblum et al. (2020) show that if linear models cannot perfectly fit the data, one can construct local minima that are not global. He et al. (2020) shows the existence of spurious minima for arbitrary piecewise linear (non-linear) activations and arbitrary continuously differentiable loss functions.

A number of works suggest overparameterized networks have a more benevolent optimization landscape than underparameterized ones. For instance, Soudry & Carmon (2016) show for mildly overparameterized networks with leaky ReLUs that for almost every dataset every differentiable local minimum of the squared error training loss is a zero-loss global minimum, provided a certain type of dropout noise is used. In addition, for the case of one hidden layer they show this is the case whenever the number of weights in the first layer matches the number of training samples, d0d1nd_{0}d_{1}\geq n. In another work, Safran & Shamir (2016) used a computer aided approach to study the existence of spurious local minima, arriving at the key finding that when the student and teacher networks are equal in width spurious local minima are commonly encountered. However, under only mild overparameterization, no spurious local minima were identified, implying again that overparameterization leads to a more benign loss landscape. In the work of Tian (2017), the student and teacher networks are assumed to be equal in width and it is shown that if the dimension of the data is sufficiently large relative to the width, then the critical points lying outside the span of the ground truth weight vectors of the teacher network form manifolds. Another related line of works studies the connectivity of sublevel sets (Nguyen, 2019) and bounds on the overparameterization that ensures existence of descent paths (Sharifnassab et al., 2020).

The key objects in our investigation will be the rank of the Jacobian of the parametrization map over a finite input dataset, and the combinatorics of the corresponding subdivision of parameter space into pieces where the loss is differentiable. The Jacobian map captures the local dependency of the functions represented over the dataset on their parameters. The notion of connecting parameters and functions is prevalent in old and new studies of neural networks: for instance, in discussions of parameter symmetries (Chen et al., 1993), functional equivalence (Phuong & Lampert, 2020), functional dimension (Grigsby et al., 2022), the question of when a ReLU network is a universal approximator over a finite input dataset (Yun et al., 2019a), as well as in studies involving the neural tangent kernel (Jacot et al., 2018).

We highlight a few works that take a geometric or combinatorial perspective to discuss the optimization landscape. Using dimension counting arguments, Cooper (2021) showed that, under suitable overparameterization and smoothness assumptions, the set of zero-training loss parameters has the expected codimension, equal to the number of training data points. In the case of ReLU networks, Laurent & von Brecht (2018) used the piecewise multilinear structure of the parameterization map to describe the location of the minimizers of the hinge loss. Further, for piecewise linear activation functions Zhou & Liang (2018) partition the parameter space into cones corresponding to different activation patterns of the network to show that, while linear networks have no spurious minima, shallow ReLU networks do. In a similar vein, Liu (2021) study one hidden layer ReLU networks and show for a convex loss that differentiable local minima in an activation region are global minima within the region, as well as providing conditions for the existence of differentiable local minima, saddles and non differentiable local minima. Considering the parameter symmetries, Simsek et al. (2021) show that the level of overparameterization changes the relative number of subspaces that make up the set of global minima. As a result, overparameterization implies a larger set of global minima relative to the set of critical points, and underparameterization the reverse. Wang et al. (2022) studied the optimization landscape in two-layer ReLU networks and showed that all optimal solutions of the non-convex loss can be found via the optimal solutions of a convex program.

1.1 Contributions

For both shallow and deep neural networks we show most linear regions of parameter space have no bad local minima and often contain a high-dimensional space of global minima. We examine the loss landscape for various scalings of the input dimension d0d_{0}, hidden dimensions dld_{l}, and number of data points nn.

  • Theorem 5 shows for two-layer networks that if d0d1nd_{0}d_{1}\geq n and d1=Ω(log(nϵd0))d_{1}=\Omega(\log(\frac{n}{\epsilon d_{0}})), then all activation regions, except for a fraction of size at most ϵ\epsilon, have no bad local minima. We establish this by studying the rank of the Jacobian with respect to the parameters. By appealing to results from random matrix theory on binary matrices, we show for most choices of activation patterns that the Jacobian will have full rank. Given this, all local minima within a region will be zero-loss global minima. For generic high-dimensional input data d0nd_{0}\geq n, this implies most non-empty activation regions will have no bad local minima as all activation regions are non-empty for generic data in high dimensions. We extend these results to the deep case in Theorem 14.

  • In Theorem 10, we specialize to the case of one-dimensional input data d0=1d_{0}=1, and consider two-layer networks with a bias term. We show that if d1=Ω(nlog(nϵ))d_{1}=\Omega(n\log(\frac{n}{\epsilon})), all but at most a fraction ϵ\epsilon of non-empty linear regions in parameter space will have no bad local minima. We remark that this includes the non-differentiable local minima on the boundary between activation regions as per Theorem 13. Further, in contrast to Theorem 5 which looks at all binary matrices of potential activation patterns, in the one-dimensional case we are able to explicitly enumerate the binary matrices which correspond to non-empty activation regions.

  • Theorem 12 continues our investigation of one-dimensional input data, this time concerning the existence of global minima within a region. Suppose that the output head vv has d+d_{+} positive weights and dd_{-} negative weights and d++d=d1d_{+}+d_{-}=d_{1}. We show that if d+,d=Ω(nlog(nϵ))d_{+},d_{-}=\Omega(n\log(\frac{n}{\epsilon})), then all but at most a fraction ϵ\epsilon of non-empty linear regions in parameter space will have global minima. Moreover, the regions with global minima will contain an affine set of global minima of codimension nn.

  • In addition to counting the number of activation regions with bad local minima, Proposition 15 and Theorem 17 provide bounds on the fraction of regions with bad local minima by volume under additional assumptions on the data, notably anti-concentratedness. These results imply that mild overparameterization suffices again to ensure that the ‘size’ of activation regions with bad local minima is small not only as measured by number but also by volume.

1.2 Relation to prior works

As indicated above, several works have studied sets of local and global critical points of two-layer ReLU networks, so it is in order to give a comparison. We take a different approach by identifying the number of regions which have a favorable optimization landscape, and as a result are able to avoid having to make certain assumptions about the dataset or network. Since we are able to exclude pathological regions with many dead neurons, we can formulate our results for ReLU activations rather than LeakyReLU or other smooth activation functions. In contrast to Soudry & Carmon (2016), we do not assume dropout noise on the outputs of the network, and as a result global minima in our setting typically attain zero loss. Unlike Safran & Shamir (2018), we do not assume any particular distribution on our datasets; our results hold for almost all datasets (a set of full Lebesgue measure). Extremely overparameterized networks (with d1=Ω(n2)d_{1}=\Omega(n^{2})) are known to follow lazy training (see Chizat et al., 2019); our theorems hold under more realistic assumptions of mild overparameterization d1=Ω(nlogn)d_{1}=\Omega(n\log n) or even d1=Ω(1)d_{1}=\Omega(1) for high-dimensional inputs. We are able to avoid excessive overparameterization by emphasizing qualitative aspects of the loss landscape, using only the rank of the Jacobian rather than the smallest eigenvalue of the neural tangent kernel, for instance.

2 Preliminaries

Before specializing to specific neural network architectures, we introduce general definitions which encompass all of the models we will study. For any dd\in\mathbb{N} we will write [d]:={1,,d}[d]:=\{1,\ldots,d\}. We will write 𝟏d\bm{1}_{d} for a vector of dd ones, and drop the subscript when the dimension is clear from the context. The Hadamard (entry-wise) product of two matrices AA and BB of the same dimension is defined as AB:=(AijBij)A\odot B:=(A_{ij}\cdot B_{ij}). The Kronecker product of two vectors unu\in\mathbb{R}^{n} and vmv\in\mathbb{R}^{m} is defined as uv:=(uivj)nmu\otimes v:=(u_{i}v_{j})\in\mathbb{R}^{nm}.

Let [x1,,xn]\mathbb{R}[x_{1},\ldots,x_{n}] denote the set of polynomials in variables x1,,xnx_{1},\ldots,x_{n} with real coefficients. We say that a set 𝒱n\mathcal{V}\subseteq\mathbb{R}^{n} is an algebraic set if there exist f1,,fm[x1,,xn]f_{1},\ldots,f_{m}\in\mathbb{R}[x_{1},\ldots,x_{n}] such that 𝒱\mathcal{V} is the zero locus of the polynomials fif_{i}; that is,

𝒱={xn:fi(x)=0 for all i[m]}.\displaystyle\mathcal{V}=\{x\in\mathbb{R}^{n}:f_{i}(x)=0\text{ for all }i\in[m]\}.

Clearly, \emptyset and n\mathbb{R}^{n} are algebraic, being zero sets of f=1f=1 and f=0f=0 respectively. A finite union of algebraic subsets is algebraic, as well as an arbitrary intersection of algebraic subsets. In other words, algebraic sets form a topological space. Its topology is known as the Zariski topology. The following lemma is a basic fact of algebraic geometry, and follows from subdividing algebraic sets into submanifolds of n\mathbb{R}^{n}.

Lemma 1.

Let 𝒱\mathcal{V} be a proper algebraic subset of n\mathbb{R}^{n}. Then 𝒱\mathcal{V} has Lebesgue measure 0.

For more details on the above facts, we refer the reader to Harris (2013). Justified by Lemma 1, we say that a property PP depending on xnx\in\mathbb{R}^{n} holds for generic xx if there exists a proper algebraic set 𝒱\mathcal{V} such that PP holds whenever x𝒱x\notin\mathcal{V}. So if PP is a property that holds for generic xx, then in particular it holds for a set of full Lebesgue measure.

We consider input data

X=(x(1),,x(n))d×nX=(x^{(1)},\ldots,x^{(n)})\in\mathbb{R}^{d\times n}

and output data

y=(y(1),,y(n))1×n.y=(y^{(1)},\ldots,y^{(n)})\in\mathbb{R}^{1\times n}.

A parameterized model with parameter space m\mathbb{R}^{m} is a mapping

F:m×d.F:\mathbb{R}^{m}\times\mathbb{R}^{d}\to\mathbb{R}.

We overload notation and also define FF as a map from m×d×n\mathbb{R}^{m}\times\mathbb{R}^{d\times n} to 1×n\mathbb{R}^{1\times n} by

F(θ,(x(1),,x(n))):=(F(θ,x(1)),F(θ,x(2)),,F(θ,x(n))).F(\theta,(x^{(1)},\ldots,x^{(n)})):=(F(\theta,x^{(1)}),F(\theta,x^{(2)}),\ldots,F(\theta,x^{(n)})).

Whether we are thinking of FF as a mapping on individual data points or on datasets will be clear from the context. We define the mean squared error loss L:m×d×n×1×n1L\colon\mathbb{R}^{m}\times\mathbb{R}^{d\times n}\times\mathbb{R}^{1\times n}\to\mathbb{R}^{1} by

L(θ,X,y):=\displaystyle L(\theta,X,y):= 12i=1n(F(θ,x(i))y(i))2\displaystyle\frac{1}{2}\sum_{i=1}^{n}(F(\theta,x^{(i)})-y^{(i)})^{2}
=\displaystyle= 12F(θ,X)y2.\displaystyle\frac{1}{2}\|F(\theta,X)-y\|^{2}. (1)

For a fixed dataset (X,y)(X,y), let 𝒢X,ym\mathcal{G}_{X,y}\subseteq\mathbb{R}^{m} denote the set of global minima of the loss LL; that is,

𝒢X,y={θm:L(θ,X,y)=infϕmL(ϕ,X,y)}.\mathcal{G}_{X,y}=\left\{\theta\in\mathbb{R}^{m}:L(\theta,X,y)=\inf_{\phi\in\mathbb{R}^{m}}L(\phi,X,y)\right\}.

If there exists θm\theta^{*}\in\mathbb{R}^{m} such that F(θ,X)=yF(\theta^{*},X)=y, then L(θ,X,y)=0L(\theta^{*},X,y)=0, so θ\theta^{*} is a global minimum. In such a case,

𝒢X,y\displaystyle\mathcal{G}_{X,y} ={θm:F(θ,X)=y}.\displaystyle=\left\{\theta\in\mathbb{R}^{m}:F(\theta,X)=y\right\}.

For a dataset (X,y)(X,y), we say that θm\theta\in\mathbb{R}^{m} is a local minimum if there exists an open set 𝒰m\mathcal{U}\subseteq\mathbb{R}^{m} containing θ\theta such that L(θ,X,y)L(ϕ,X,y)L(\theta,X,y)\leq L(\phi,X,y) for all ϕ𝒰\phi\in\mathcal{U}. We say that θ\theta is a bad local minimum if it is a local minimum but not a global minimum.

A key indicator of the local optimization landscape of a neural network is the rank of the Jacobian of the map FF with respect to the parameters θ\theta. We will use the following observation.

Lemma 2.

Fix a dataset (X,y)d×n×1×n(X,y)\in\mathbb{R}^{d\times n}\times\mathbb{R}^{1\times n}. Let FF be a parameterized model and let θm\theta\in\mathbb{R}^{m} be a differentiable critical point of the squared error loss equation 1. If rank(θF(θ,X))=n\operatorname{rank}(\nabla_{\theta}F(\theta,X))=n, then θ\theta is a global minimizer.

Proof.

Suppose that θm\theta\in\mathbb{R}^{m} is a differentiable critical point of LL. Then

0\displaystyle 0 =θL(θ,X,y)\displaystyle=\nabla_{\theta}L(\theta,X,y)
=θF(θ,X)(F(θ,X)y).\displaystyle=\nabla_{\theta}F(\theta,X)\cdot(F(\theta,X)-y).

Since rank(θF(θ,X,y))=n\operatorname{rank}(\nabla_{\theta}F(\theta,X,y))=n, this implies that F(θ,X)y=0F(\theta,X)-y=0. In other words, θ\theta is a global minimizer. ∎

Finally, in what follows given data XX we study activation regions in parameter space. Informally, these are sets of parameters which give rise to particular activation patterns of the neurons of a network over the dataset. We will define this notion formally for each setting we study in the subsequent sections.

3 Shallow ReLU networks: counting activation regions with bad local minima

Here we focus on a two-layer network with d0d_{0} inputs, one hidden layer of d1d_{1} ReLUs, and an output layer. The key takeaway of this section is that moderate overparameterization is sufficient to ensure that bad local minima are scarce. With regard to setup, our parameter space is the input weight matrix in d1×d0\mathbb{R}^{d_{1}\times d_{0}}. Our input dataset will be an element Xd0×nX\in\mathbb{R}^{d_{0}\times n} and our output dataset an element y1×ny\in\mathbb{R}^{1\times n}, where nn is the cardinality of the dataset. The model F:(d1×d0×d1)×d0F:(\mathbb{R}^{d_{1}\times d_{0}}\times\mathbb{R}^{d_{1}})\times\mathbb{R}^{d_{0}}\to\mathbb{R} is defined by

F(W,v,x)\displaystyle F(W,v,x) =vTσ(Wx),\displaystyle=v^{T}\sigma(Wx),

where σ\sigma is the ReLU activation function smax{0,s}s\mapsto\max\{0,s\} applied componentwise. We write W=(w(1),,w(d1))TW=(w^{(1)},\ldots,w^{(d_{1})})^{T}, where w(i)d0w^{(i)}\in\mathbb{R}^{d_{0}} is the iith row of WW. Since σ\sigma is piecewise linear, for any finite input dataset XX we may split the parameter space into a finite number of regions on which FF is linear in WW (and linear in vv). For any binary matrix A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} and input dataset Xd0×nX\in\mathbb{R}^{d_{0}\times n}, we define a corresponding activation region in parameter space by

𝒮XA:={Wd1×d0:(2Aij1)w(i),x(j)>0 for all i[d1],j[n]}.\displaystyle\mathcal{S}^{A}_{X}:=\left\{W\in\mathbb{R}^{d_{1}\times d_{0}}:(2A_{ij}-1)\langle w^{(i)},x^{(j)}\rangle>0\text{ for all }i\in[d_{1}],j\in[n]\right\}.

This is a polyhedral cone defined by linear inequalities for each w(i)w^{(i)}. For each A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} and Xd0×nX\in\mathbb{R}^{d_{0}\times n}, we have F(W,v,X)=vT(A(WX))F(W,v,X)=v^{T}(A\odot(WX)) for all W𝒮XAW\in\mathcal{S}^{A}_{X}, which is linear in WW. The Jacobian with respect to vv is A(WX)A\odot(WX) and with respect to WW is

θF(W,X)=[viAijx(j)]ij=[(va(j))x(j)]j,for all W𝒮XAA{0,1}d1×n,\nabla_{\theta}F(W,X)=[v_{i}A_{ij}x^{(j)}]_{ij}=[(v\odot a^{(j)})\otimes x^{(j)}]_{j},\quad\text{for all $W\in\mathcal{S}^{A}_{X}$, $A\in\{0,1\}^{d_{1}\times n}$},

where a(j)a^{(j)} denotes the jjth column of AA. To show that the Jacobian θF\nabla_{\theta}F has rank nn, we need to ensure that the activation matrix AA does not have too much linear dependence between its rows. The following result, due to Bourgain et al. (2010), establishes this for most choices of AA.

Theorem 3.

Let AA be a d×dd\times d matrix whose entries are iid random variables sampled uniformly from {0,1}\{0,1\}. Then AA is singular with probability at most

(12+o(1))d.\left(\frac{1}{\sqrt{2}}+o(1)\right)^{d}.

The next lemma shows that the specific values of vv are not relevant to the rank of the Jacobian.

Lemma 4.

Let a(j)d1a^{(j)}\in\mathbb{R}^{d_{1}}, x(j)d0x^{(j)}\in\mathbb{R}^{d_{0}} for j[n]j\in[n] and vd1v\in\mathbb{R}^{d_{1}} be vectors, with vi0v_{i}\neq 0 for all i[d1]i\in[d_{1}]. Then

rank({(va(j))x(j):jn})=rank({a(j)x(j):j[n]}).\displaystyle\operatorname{rank}(\{(v\odot a^{(j)})\otimes x^{(j)}:j\in n\})=\operatorname{rank}(\{a^{(j)}\otimes x^{(j)}:j\in[n]\}).

Using algebraic techniques, we show that for generic XX, the rank of the Jacobian is determined by AA. Then by the above results, Lemma 2 concludes that for most activation regions the smooth critical points are global minima (see full details in Appendix A):

Theorem 5.

Let ϵ>0\epsilon>0. If

d1max(nd0,Ω(log(nϵd0))),d_{1}\geq\max\left(\frac{n}{d_{0}},\Omega\left(\log\left(\frac{n}{\epsilon d_{0}}\right)\right)\right),

then for generic datasets (X,y)(X,y), the following holds. In all but at most ϵ2nd1\lceil\epsilon 2^{nd_{1}}\rceil activation regions (i.e., an ϵ\epsilon fraction of all regions), every differentiable critical point of LL with nonzero entries for vv is a global minimum.

The takeaway of Theorem 5 is that most activation regions do no contain local minima. However, for a given dataset XX not all activation regions, each encoded by a binary matrix AA, may be realizable by the network. Equivalently, AA is not realizable if and only if 𝒮XA=\mathcal{S}^{A}_{X}=\emptyset and we call such activation regions empty. We therefore turn our attention to evaluating the number of non-empty activation regions. Different activation regions in parameter space are separated by the hyperplanes {W:w(i),x(j)=0}\{W\colon\langle w^{(i)},x^{(j)}\rangle=0\}, i[d1]i\in[d_{1}], j[n]j\in[n]. We say that a set of vectors in a dd-dimensional space is in general position if any kdk\leq d of them are linearly independent, which is a generic property. Standard results on hyperplane arrangements give the following proposition.

Proposition 6 (Number of non-empty regions).

Consider a network with one layer of d1d_{1} ReLUs. If the columns of XX are in general position in a dd-dimensional linear space, then the number of non-empty activation regions in the parameter space is (2k=0d1(n1k))d1(2\sum_{k=0}^{d-1}{n-1\choose k})^{d_{1}}.

The formula provided above in Proposition 6 is equal to 2nd12^{nd_{1}} if and only if ndn\leq d, and is O(ndd1)O(n^{dd_{1}}) if n>dn>d. We therefore observe that if dd is large in relation to nn and the data is in general position then by Proposition 6 most activation regions are non-empty. Thus we obtain the following corollary of Theorem 5.

Corollary 7.

Under the same assumptions as Theorem 5, if dnd\geq n, then for XX in general position and arbitrary yy, the following holds. In all but at most an ϵ\epsilon fraction of all non-empty activation regions, every differentiable critical point of LL with nonzero entries for vv is a zero loss global minimum.

More generally, for an arbitrary dataset that is not necessarily in general position the regions can be enumerated using a celebrated formula by Zaslavsky (1975), in terms of the intersection poset of the hyperplanes. Moreover, importantly one can show that the maximal number of non-empty regions is attained when the dataset is in general position.

From the above discussion we conclude we have a relatively good understanding of how many activation regions are non-empty. Notably, the number is the same for any dataset that is in general position. However, it is worth reflecting on the fact that the identity of the non-empty regions depends more closely on the specific dataset and is harder to catalogue. For a given dataset XX the non-empty regions correspond to the vertices of the Minkowski sum of the line segments with end points 0,x(j)0,x^{(j)}, for j[n]j\in[n], as can be inferred from results in tropical geometry (see Joswig, 2021).

Proposition 8 (Identity of non-empty regions).

Let A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n}. The corresponding activation region is non-empty if and only if j:Aij=1x(j)\sum_{j:A_{ij}=1}x^{(j)} is a vertex of j[n]conv{0,x(j)}\sum_{j\in[n]}\operatorname{conv}\{0,x^{(j)}\} for all i[d1]i\in[d_{1}].

This provides a sense of which activation regions are non-empty, depending on XX. The explicit list of non-empty regions is known in the literature as the list of maximal covectors of an oriented matroid (see Björner et al., 1999), which can be interpreted as a combinatorial type of the dataset.

4 Shallow univariate ReLU networks: activation regions with global vs local minima

In Section 3 we showed that mild overparameterization suffices to ensure that most activation regions do not contain bad local minima. This result however does not discuss the relative scarcity of activation regions with global versus local minima. To this end in this section we take a closer look at the case of a single input dimension, d0=1d_{0}=1. Importantly, in the univariate setting we are able to entirely characterize the realizable (non-empty) activation regions.

Consider a two-layer ReLU network with input dimension d0=1d_{0}=1, hidden dimension d1d_{1}, and a dataset consisting of nn data points. We suppose the network has a bias term bd1b\in\mathbb{R}^{d_{1}}. The model F:(d1×d1×d1)×F:(\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}})\times\mathbb{R}\to\mathbb{R} is given by

F(w,b,v,x)=vTσ(wx+b).F(w,b,v,x)=v^{T}\sigma(wx+b).

Since we include a bias term here, we define the activation region 𝒮XA\mathcal{S}^{A}_{X} by

𝒮XA:={(w,b)d1×d1:(2Aij1)(w(i)x(j)+b(i))>0 for all i[d1],j[n]}.\mathcal{S}^{A}_{X}:=\left\{(w,b)\in\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}:(2A_{ij}-1)(w^{(i)}x^{(j)}+b^{(i)})>0\text{ for all }i\in[d_{1}],j\in[n]\right\}.

In this one-dimensional case, we first obtain new bounds on the fraction of favorable activation regions to show that most non-empty activation regions have no bad differentiable local minima. We begin with a characterization of which activation regions are non-empty. For k[n+1]k\in[n+1], we introduce the step vectors ξ(k,0),ξ(k,1)n\xi^{(k,0)},\xi^{(k,1)}\in\mathbb{R}^{n}, defined by

(ξ(k,0))i={1 if i<k,0 if ik,and(ξ(k,1))i={0 if i<k,1 if ik.\displaystyle(\xi^{(k,0)})_{i}=\begin{cases}1&\text{ if }i<k,\\ 0&\text{ if }i\geq k\end{cases},\quad\text{and}\quad(\xi^{(k,1)})_{i}=\begin{cases}0&\text{ if }i<k,\\ 1&\text{ if }i\geq k\end{cases}.

Note that ξ(1,0)=ξ(n+1,1)=0\xi^{(1,0)}=\xi^{(n+1,1)}=0 and ξ(n+1,0)=ξ(1,1)=1\xi^{(n+1,0)}=\xi^{(1,1)}=1. There are 2n2n step vectors in total. Intuitively, step vectors describe activation regions because all data points on one side of a threshold value activate the neuron. The following lemma makes this notion precise.

Lemma 9.

Fix a dataset (X,y)(X,y) with x(1)<x(2)<<x(n)x^{(1)}<x^{(2)}<\cdots<x^{(n)}. Let A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} be a binary matrix. Then 𝒮XA\mathcal{S}^{A}_{X} is non-empty if and only if the rows of AA are step vectors. In particular, there are exactly (2n)d1(2n)^{d_{1}} non-empty activation regions.

Using this characterization of the non-empty activation regions, we show that most activation patterns corresponding to these regions yield full-rank Jacobian matrices, and hence the regions have no bad local minima.

Theorem 10.

Let ϵ(0,1)\epsilon\in(0,1). Suppose that XX consists of distinct data points, and

d12nlog(nϵ).d_{1}\geq 2n\log\left(\frac{n}{\epsilon}\right).

Then in all but at most a fraction ϵ\epsilon of non-empty activation regions, θF\nabla_{\theta}F is full rank and every differentiable critical point of LL where vv has nonzero entries is a global minimum.

Our strategy for proving Theorem 10 hinges on the following observation. For the sake of example, consider the step vectors ξ(1,1),ξ(2,1),,ξ(n,1)\xi^{(1,1)},\xi^{(2,1)},\ldots,\xi^{(n,1)}. This set of vectors forms a basis of n\mathbb{R}^{n}, so if each of these vectors was a row of the activation matrix AA, it would have full rank. This observation generalizes to cases where some of the step vectors are taken to be ξ(k,0)\xi^{(k,0)} instead of ξ(k,1)\xi^{(k,1)}. If enough step vectors are “collected” by rows of the activation matrix, it will be of full rank. We can interpret this condition in a probabilistic way. Suppose that the rows of AA are sampled randomly from the set of step vectors. We wish to determine the probability that after d1d_{1} samples, enough step vectors have been sampled to cover a certain set. We use the following lemma.

Lemma 11 (Coupon collector’s problem).

Let ϵ(0,1)\epsilon\in(0,1), and let nmn\leq m be positive integers. Let C1,C2,,Cd[m]C_{1},C_{2},\ldots,C_{d}\in[m] be iid random variables such that for all j[n]j\in[n] one has (C1=j)δ\mathbb{P}(C_{1}=j)\geq\delta. If

d1δlog(nϵ),d\geq\frac{1}{\delta}\log\left(\frac{n}{\epsilon}\right),

then [n]{C1,,Cd}[n]\subseteq\{C_{1},\ldots,C_{d}\} with probability at least 1ϵ1-\epsilon.

This gives us a bound for the probability that a randomly sampled region is of full rank. We finally convert this into a combinatorial statement to obtain Theorem 10. For the details and a complete proof, see Appendix C.

In one-dimensional input space, the existence of global minima within a region requires similar conditions to the region having a full rank Jacobian. Both of them depend on having many different step vectors represented among the rows of the activation matrix. The condition we need to check for the existence of global minima is slightly more stringent, and depends on there being enough step vectors for both the positive and negative entries of vv.

Theorem 12 (Fraction of regions with global minima).

Let ϵ(0,1)\epsilon\in(0,1) and let v1dv\in\mathbb{R}^{d}_{1} have nonzero entries. Suppose that XX consists of distinct data points,

|{i[d1]:v(i)>0}|2nlog(2nϵ),|\{i\in[d_{1}]:v^{(i)}>0\}|\geq 2n\log\left(\frac{2n}{\epsilon}\right),

and

|{i[d1]:v(i)<0}|2nlog(2nϵ).|\{i\in[d_{1}]:v^{(i)}<0\}|\geq 2n\log\left(\frac{2n}{\epsilon}\right).

Then in all but at most an ϵ\epsilon fraction of non-empty activation regions 𝒮XA\mathcal{S}^{A}_{X}, the subset of global minimizers in (w,b)(w,b), 𝒢X,y𝒮XA\mathcal{G}_{X,y}\cap\mathcal{S}^{A}_{X}, is a non-empty affine set of codimension nn. Moreover, all global minima of LL have zero loss.

We provide the proof of this statement in Appendix C. To conclude, for shallow ReLU networks and univariate data we have shown that mild overparameterization suffices to ensure that the loss landscape is favorable in the sense that most activation regions contain a set of global minima of significant codimension and do not contain any local minima. Furthermore, considering both Sections 3 and 4, we have shown in both the one-dimensional and higher-dimensional cases that most activation regions have a full rank Jacobian and contain no bad local minima. This suggests that a large fraction of parameter space by volume should also have a full rank Jacobian. This indeed turns out to be the case and is discussed in Section 6. Finally, we highlight to the reader that we provide a discussion of the function space perspective of this setting in Appendix D.

4.1 Nonsmooth critical points

Our results in this section considered the local minima in the interior of a given activation region. Here we extend our analysis to handle points on the boundaries between regions where the loss is non-differentiable. We consider a network on univariate data F(w,b,v,x)F(w,b,v,x) as in Section 4. Let sgn:{0,1}\operatorname{sgn}:\mathbb{R}\to\{0,1\} be the unit step function:

sgn(z):={0 if z01 if z>0.\displaystyle\operatorname{sgn}(z):=\begin{cases}0&\text{ if }z\leq 0\\ 1&\text{ if }z>0\end{cases}.

Here we define the activation regions to include the boundaries:

𝒮XA:={(w,b)d1×d1:sgn(w(i)x(j)+b(i))=Aij for all i[d1],j[n].}\mathcal{S}^{A}_{X}:=\left\{(w,b)\in\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}:\operatorname{sgn}(w^{(i)}x^{(j)}+b^{(i)})=A_{ij}\text{ for all }i\in[d_{1}],j\in[n].\right\}

We assume that the input dataset XX consists of distinct data points x(1)<x(2)<<x(n)x^{(1)}<x^{(2)}<\cdots<x^{(n)}. By the same argument as Lemma 9, an activation region 𝒮XA\mathcal{S}^{A}_{X} is non-empty if and only if the rows of AA are step vectors and we derive the following result.

Theorem 13.

Let ϵ(0,1)\epsilon\in(0,1). If

d12nlog(nϵ),d_{1}\geq 2n\log\left(\frac{n}{\epsilon}\right),

then in all but at most a fraction ϵ\epsilon of non-empty activation regions AA, every local minimum of LL in 𝒮XA×d1\mathcal{S}^{A}_{X}\times\mathbb{R}^{d_{1}} is a global minimum.

5 Extension to deep networks

While the results presented so far focused on shallow networks, they admit generalizations to deep networks. These generalizations are achieved by considering the Jacobian with respect to parameters of individual layers of the network. In this section, we demonstrate this technique and prove that in deeper networks, most activation regions have no spurious critical points. We consider fully connected deep networks with LL layers, where layer ll maps from dl1\mathbb{R}^{d_{l-1}} to dl\mathbb{R}^{d_{l}}, and dL=1d_{L}=1. The parameter space consists of tuples of matrices

W=(W1,W2,,WL),W=(W_{1},W_{2},\ldots,W_{L}),

where Wldl×dl1W_{l}\in\mathbb{R}^{d_{l}\times d_{l-1}}. We identify the vector space of all such tuples with m\mathbb{R}^{m}, where m=l=1Ldldl1m=\sum_{l=1}^{L}d_{l}d_{l-1}. For l{0,,L}l\in\{0,\ldots,L\}, we define the ll-th layer fl:m×dl1dlf_{l}\colon\mathbb{R}^{m}\times\mathbb{R}^{d_{l-1}}\to\mathbb{R}^{d_{l}} recursively by

f0(W,x):=x,f_{0}(W,x):=x,
fl(W,x):=σ(Wlfl1(θ,x))f_{l}(W,x):=\sigma(W_{l}f_{l-1}(\theta,x))

if l[L1]l\in[L-1], and

fL(W,x):=vTfl1(θ,x),f_{L}(W,x):=v^{T}f_{l-1}(\theta,x),

where vdl1v\in\mathbb{R}^{d_{l-1}} is a fixed vector whose entries are nonzero. Then fLf_{L} is the final layer output of the network, so the model is given by F:=fLF:=f_{L}. We denote the jj-th row of WlW_{l} by wl(j)w_{l}^{(j)}. The activation patterns of a deep network are given by tuples

A=(A1,A2,,AL1),A=(A_{1},A_{2},\cdots,A_{L-1}),

where for each l[L1]l\in[L-1], Al{0,1}dl×nA_{l}\in\{0,1\}^{d_{l}\times n}. For an activation pattern AA, let 𝒮XA\mathcal{S}^{A}_{X} denote the subset of parameter space corresponding to AA. More precisely,

𝒮XA={Wm:(2[Al]ij1)wl(i),fl1(W,x(j))>0 for all l[L1],i[dl],j[n]}.\displaystyle\mathcal{S}^{A}_{X}=\{W\in\mathbb{R}^{m}:(2[A_{l}]_{ij}-1)\langle w_{l}^{(i)},f_{l-1}(W,x^{(j)})\rangle>0\text{ for all }l\in[L-1],i\in[d_{l}],j\in[n]\}.

Assuming linear scaling of the second-to-last layer of the network in the number of data points, along with mild restrictions on the other layers, we prove that most activation regions for deep networks have a full rank Jacobian.

Theorem 14.

Let Xd0×nX\in\mathbb{R}^{d_{0}\times n} be an input dataset with distinct points. Suppose that for all l[L2]l\in[L-2],

dl=Ω(lognϵL),d_{l}=\Omega\left(\log\frac{n}{\epsilon L}\right),

and that

dL1=n+Ω(log1ϵ).d_{L-1}=n+\Omega\left(\log\frac{1}{\epsilon}\right).

Then for at least a (1ϵ)(1-\epsilon) fraction of all activation patterns AA, the following holds. For all W𝒮XAW\in\mathcal{S}^{A}_{X}, WF(W,X)\nabla_{W}F(W,X) has rank nn.

We provide a proof of this statement in Appendix E.

6 Volumes of activation regions

Our main results so far have concerned the number of activation regions containing bad local versus global minima. Here we compute bounds for the volume of the union of these bad versus good activation regions, which recall are subsets of parameter space. Proofs of all results in this section are provided in Appendix F.

6.1 One-dimensional input data

Consider the setting of Section 4, where we have a network F(w,b,v,x)F(w,b,v,x) on one-dimensional input data. Recall that for A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n}, we defined the activation region 𝒮XA\mathcal{S}^{A}_{X} by

𝒮XA:={(w,b)d1×d1:(2Aij1)(w(i)x(j)+b(i))>0 for all i[d1],j[n]}.\mathcal{S}^{A}_{X}:=\{(w,b)\in\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}:(2A_{ij}-1)(w^{(i)}x^{(j)}+b^{(i)})>0\text{ for all }i\in[d_{1}],j\in[n]\}.

For k[n+1]k\in[n+1], β{0,1}\beta\in\{0,1\} and corresponding step vectors ξ(k,β)\xi^{(k,\beta)}, we also define the individual neuron activation regions 𝒩k,β\mathcal{N}_{k,\beta} by

𝒩k,β:={(w,b)×:(2ξj(k,β)1)(wx(j)+b)>0 for all j[n]}.\mathcal{N}_{k,\beta}:=\{(w,b)\in\mathbb{R}\times\mathbb{R}:(2\xi^{(k,\beta)}_{j}-1)(wx^{(j)}+b)>0\text{ for all }j\in[n]\}.

Let μ\mu denote the Lebesgue measure. First, we compute an exact formula for the volume of the activation regions intersected with the unit interval. That is, we compute the Lebesgue measure of the set 𝒩k,β([1,1]×[1,1])\mathcal{N}_{k,\beta}\cap([-1,1]\times[-1,1]).

Proposition 15.

Let ψ:[,][0,2]\psi:[-\infty,\infty]\to[0,2] be defined by

ψ(x):={12xif x11+x2if 1<x1212xif x>1.\psi(x):=\begin{cases}-\frac{1}{2x}&\text{if }x\leq-1\\ 1+\frac{x}{2}&\text{if }-1<x\leq 1\\ 2-\frac{1}{2x}&\text{if }x>1\end{cases}.

Consider data x(1)x(n)x^{(1)}\leq\cdots\leq x^{(n)}. Then for k[n+1]k\in[n+1] and β{0,1}\beta\in\{0,1\},

μ(𝒩k,β([1,1]×[1,1]))=ψ(x(k))ψ(x(k1)).\mu(\mathcal{N}_{k,\beta}\cap([-1,1]\times[-1,1]))=\psi(x^{(k)})-\psi(x^{(k-1)}).

Here we define x(0):=x^{(0)}:=-\infty, x(n+1):=x^{(n+1)}:=\infty.

We apply this result to bound the volume of activation regions with full rank Jacobian in terms of the amount of separation between the data points. For overparameterized networks with sufficient separation, the activation regions with full rank Jacobian fill up most of the parameter space by volume.

Proposition 16.

Let n2n\geq 2. Suppose that the entries of vv are nonzero. Suppose that for all j,k[n]j,k\in[n] with jkj\neq k, we have |x(j)|1|x^{(j)}|\leq 1 and |x(j)x(k)|ϕ|x^{(j)}-x^{(k)}|\geq\phi. If

d14ϕlog(nϵ),d_{1}\geq\frac{4}{\phi}\log\left(\frac{n}{\epsilon}\right),

then

μ({𝒮XA[1,1]d1×[1,1]d1:w,bF has full rank on 𝒮XA})(1ϵ)22d1.\displaystyle\mu(\cup\{\mathcal{S}^{A}_{X}\cap[-1,1]^{d_{1}}\times[-1,1]^{d_{1}}:\nabla_{w,b}F\textnormal{ has full rank on }\mathcal{S}^{A}_{X}\})\geq(1-\epsilon)2^{2d_{1}}.

6.2 Arbitrary dimension input data

We now consider the setting of Section 3. Our model (F:d1×d0×d1)×d0(F\colon\mathbb{R}^{d_{1}\times d_{0}}\times\mathbb{R}^{d_{1}})\times\mathbb{R}^{d_{0}}\to\mathbb{R} is defined by

F(W,v,x)=vTσ(Wx).F(W,v,x)=v^{T}\sigma(Wx).

We consider the volume of the set of points (W,v)(W,v) such that (W,v)F(W,v,X)\nabla_{(W,v)}F(W,v,X) has rank nn. We can formulate this problem probabilistically. Suppose that the entries of WW and vv are sampled from the standard normal distribution 𝒩(0,1)\mathcal{N}(0,1). We wish to compute the probability that the Jacobian has full rank.

Let sgn:{0,1}\operatorname{sgn}:\mathbb{R}\to\{0,1\} denote the step function:

sgn(z)={0 if z01 if z>0.\operatorname{sgn}(z)=\begin{cases}0&\text{ if }z\leq 0\\ 1&\text{ if }z>0\end{cases}.

For an input dataset Xd0×nX\in\mathbb{R}^{d_{0}\times n}, consider the random variable

sgn(XTw),\operatorname{sgn}(X^{T}w),

where w𝒩(0,I)w\sim\mathcal{N}(0,I), and sgn\operatorname{sgn} is defined entrywise. This defines the distribution of activation patterns on the dataset, which we denote by 𝒟X\mathcal{D}_{X}. We say that an input dataset Xd0×nX\in\mathbb{R}^{d_{0}\times n} is γ\gamma-anticoncentrated if for all nonzero unu\in\mathbb{R}^{n},

a𝒟X(uTa=0)1γ.\displaystyle\mathbb{P}_{a\sim\mathcal{D}_{X}}(u^{T}a=0)\leq 1-\gamma.

We can interpret this as a condition on the amount of separation between data points. For example, suppose that two data points x(j)x^{(j)} and x(k)x^{(k)} are highly correlated: x(j)=x(k)=1\|x^{(j)}\|=\|x^{(k)}\|=1 and x(j),x(k)ρ\langle x^{(j)},x^{(k)}\rangle\geq\rho. Let us take unu\in\mathbb{R}^{n} to be defined by ui=1u_{i}=1, uk=1u_{k}=-1, and ul=0u_{l}=0 for lj,kl\neq j,k. Then

a𝒟X(uTa=0)\displaystyle\mathbb{P}_{a\sim\mathcal{D}_{X}}(u^{T}a=0) =a𝒟X(ai=ak)\displaystyle=\mathbb{P}_{a\sim\mathcal{D}_{X}}(a_{i}=a_{k})
=w𝒩(0,I)(sgn(wTx(j))=sgn(wTx(k)))\displaystyle=\mathbb{P}_{w\sim\mathcal{N}(0,I)}(\operatorname{sgn}(w^{T}x^{(j)})=\operatorname{sgn}(w^{T}x^{(k)}))
=11πarccos(x(j),x(k))\displaystyle=1-\frac{1}{\pi}\arccos(\langle x^{(j)},x^{(k)}\rangle)
1arccos(ρ)π.\displaystyle\leq 1-\frac{\arccos(\rho)}{\pi}.

So in this case, the dataset is not γ\gamma-anticoncentrated for γ=arccos(ρ)π\gamma=\frac{\arccos(\rho)}{\pi}. At the other extreme, suppose that the dataset is uncorrelated: x(j),x(k)=0\langle x^{(j)},x^{(k)}\rangle=0 for all j,k[n]j,k\in[n]. Then for all zz\in\mathbb{R},

a𝒟X(an=0a1,,an1)\displaystyle\mathbb{P}_{a\sim\mathcal{D}_{X}}(a_{n}=0\mid a_{1},\cdots,a_{n-1}) =w𝒩(0,I)(w,x(n)0w,x(1),,w,x(n1))\displaystyle=\mathbb{P}_{w\sim\mathcal{N}(0,I)}(\langle w,x^{(n)}\rangle\leq 0\mid\langle w,x^{(1)}\rangle,\cdots,\langle w,x^{(n-1)}\rangle)
=w𝒩(0,I)(w,x(n)=0)\displaystyle=\mathbb{P}_{w\sim\mathcal{N}(0,I)}(\langle w,x^{(n)}\rangle=0)
=12.\displaystyle=\frac{1}{2}.

For unu\in\mathbb{R}^{n} with un0u_{n}\neq 0,

a𝒟X(uTa=0)\displaystyle\mathbb{P}_{a\sim\mathcal{D}_{X}}(u^{T}a=0) =𝔼[(uTa=0a1,,an1)]\displaystyle=\mathbb{E}[\mathbb{P}(u^{T}a=0\mid a_{1},\cdots,a_{n-1})]
=𝔼[(unan=j=1n1ujaj|a1,,an1)]\displaystyle=\mathbb{E}\left[\mathbb{P}\left(u_{n}a_{n}=-\sum_{j=1}^{n-1}u_{j}a_{j}\middle|a_{1},\cdots,a_{n-1}\right)\right]
𝔼[1/2]\displaystyle\leq\mathbb{E}[1/2]
=1/2.\displaystyle=1/2.

So in this case, the dataset is γ\gamma-anticoncentrated with γ=1/2\gamma=1/2. In order to prove that the Jacobian has full rank with high probability, we must impose a separation condition such as this one – as data points get closer together, it becomes harder for the network to distinguish between them and the Jacobian drops rank. Once we impose γ\gamma-anticoncentration, a mildly overparameterized network will attain full rank at randomly selected parameters with high probability.

Theorem 17.

Let ϵ,γ(0,1)\epsilon,\gamma\in(0,1). Suppose that Xd0×nX\in\mathbb{R}^{d_{0}\times n} is generic and γ\gamma-anticoncentrated. If

d18γ2log(d0ϵ)+2γ(nd0+1),d_{1}\geq\frac{8}{\gamma^{2}}\log\left(\frac{d_{0}}{\epsilon}\right)+\frac{2}{\gamma}\left(\frac{n}{d_{0}}+1\right),

then with probability at least 1ϵ1-\epsilon, (W,v)F(W,v,X)\nabla_{(W,v)}F(W,v,X) has rank nn.

7 Experiments

Here we empirically demonstrate that most regions of parameter space have a good optimization landscape by computing the rank of the Jacobian for two-layer neural networks. We initialize our network with random weights and biases sampled iid uniformly on [1d1,1d1]\left[-\frac{1}{\sqrt{d_{1}}},\frac{1}{\sqrt{d_{1}}}\right]. We evaluate the network on a random dataset Xd0×nX\in\mathbb{R}^{d_{0}\times n} whose entries are sampled iid Gaussian with mean 0 and variance 1. This gives us an activation region corresponding to the network evaluated at XX, and we record the rank of the Jacobian of that matrix. For each choice of d0,d1d_{0},d_{1}, and nn, we run 100 trials and record the fraction of them which resulted in a Jacobian of full rank. The results are shown in Figures 1 and 2.

For different scalings of nn and d0d_{0}, we observe different minimal widths d1d_{1} needed for the Jacobian to achieve full rank with high probability. Figure 1 suggests that the minimum value of d1d_{1} needed to achieve full rank increases linearly in the dataset size nn, and that the slope of this linear dependence decreases as the input dimension d0d_{0} increases. This is exactly the behavior predicted by Theorem 2, which finds full rank regions for d1nd0d_{1}\gtrsim\frac{n}{d_{0}}. Figure 2 operates in the regime d0nd_{0}\sim n, and shows that the necessary hidden dimension d1d_{1} remains constant in the dataset size nn. This is again consistent with Theorem 5, whose bounds depend only on the ratio nd0\frac{n}{d_{0}}. Further supporting experiments, including those involving real-world data, are provided in Appendix G.

Refer to caption
(a) d0=1d_{0}=1.
Refer to caption
(b) d0=2d_{0}=2.
Refer to caption
(c) d0=3d_{0}=3.
Refer to caption
(d) d0=5d_{0}=5.
Figure 1: The probability of the Jacobian being of full rank from a random initialization for various values of d1d_{1} and nn, where the input dimension d0d_{0} is left fixed.
Refer to caption
(a) d0=n4d_{0}=\lceil\frac{n}{4}\rceil.
Refer to caption
(b) d0=n2d_{0}=\lceil\frac{n}{2}\rceil.
Refer to caption
(c) d0=nd_{0}=n.
Refer to caption
(d) d0=2nd_{0}=2n.
Figure 2: The probability of the Jacobian being of full rank for various values of d1d_{1} and nn, where the input dimension d0d_{0} scales linearly in the number of samples nn.

8 Conclusion

In this work we studied the loss landscape of both shallow and deep ReLU networks. The key takeaway is that mildly overparameterization in terms of network width suffices to ensure the loss landscape is favorable in the sense that bad local minima exist in only a small fraction of parameter space. In particular, using random matrix theory and combinatorial techniques we showed that most activation regions have no bad differentiable local minima by determining which regions have a full rank Jacobian. For univariate data we further proved that most regions contain a high-dimensional set of global minimizers and also showed illustrated that the same takeaway is true when also considering potential bad non-differentiable local minima on the boundaries between regions. Finally the combinatorial approach we adopted allowed us to prove results independent of the specific choice of initialization of parameters, or on the distribution of the dataset. There are a number of directions for improvement. Notably we obtain our strongest results for shallow one-dimensional input case, where we have a concrete grasp of the possible activation patterns; the case 1<d1<n1<d_{1}<n remains open. We leave strengthening our results concerning deep networks and high dimensional data to future work, and hope that our contributions inspire further exploration to address them.

Reproducibility statement

The computer implementation of the scripts needed to reproduce our experiments can be found at https://anonymous.4open.science/r/loss-landscape-4271.

Acknowledgments

This project has been supported by NSF CAREER 2145630, NSF 2212520, DFG SPP 2298 grant 464109215, ERC Starting Grant 757983, and BMBF in DAAD project 57616814.

References

  • Anthony & Bartlett (1999) Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. URL https://doi.org/10.1017/CBO9780511624216.
  • Auer et al. (1995) Peter Auer, Mark Herbster, and Manfred K. K Warmuth. Exponentially many local minima for single neurons. In D. Touretzky, M.C. Mozer, and M. Hasselmo (eds.), Advances in Neural Information Processing Systems, volume 8. MIT Press, 1995. URL https://proceedings.neurips.cc/paper_files/paper/1995/file/3806734b256c27e41ec2c6bffa26d9e7-Paper.pdf.
  • Björner et al. (1999) Anders Björner, Michel Las Vergnas, Bernd Sturmfels, Neil White, and Gunter M. Ziegler. Oriented Matroids. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2 edition, 1999. URL https://www.cambridge.org/core/books/oriented-matroids/A34966F40E168883C68362886EF5D334.
  • Bourgain et al. (2010) Jean Bourgain, Van H Vu, and Philip Matchett Wood. On the singularity probability of discrete random matrices. Journal of Functional Analysis, 258(2):559–603, 2010. URL https://www.sciencedirect.com/science/article/pii/S0022123609001955.
  • Chen et al. (1993) An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On the geometry of feedforward neural network error surfaces. Neural Computation, 5(6):910–927, 1993. URL https://ieeexplore.ieee.org/document/6796044.
  • Chizat et al. (2019) Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ae614c557843b1df326cb29c57225459-Paper.pdf.
  • Cooper (2021) Yaim Cooper. Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021. URL https://doi.org/10.1137/19M1308943.
  • Cover (1964) Thomas M. Cover. Geometrical and Statistical Properties of Linear Threshold Devices. PhD thesis, Stanford Electronics Laboratories Technical Report #6107-1, May 1964. URL https://isl.stanford.edu/~cover/papers/paper1.pdf.
  • Cover (1965) Thomas M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, EC-14(3):326–334, 1965.
  • Dalcin et al. (2011) Lisandro D Dalcin, Rodrigo R Paz, Pablo A Kler, and Alejandro Cosimo. Parallel distributed computing using python. Advances in Water Resources, 34(9):1124–1139, 2011.
  • Fukumizu & Amari (2000) Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13(3):317–327, 2000. URL https://www.sciencedirect.com/science/article/pii/S0893608000000095.
  • Goldblum et al. (2020) Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein. Truth or backpropaganda? an empirical investigation of deep learning theory. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HyxyIgHFvr.
  • Gori & Tesi (1992) Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992. URL https://ieeexplore.ieee.org/document/107014.
  • Grigsby et al. (2022) J. Elisenda Grigsby, Kathryn Lindsey, Robert Meyerhoff, and Chenxi Wu. Functional dimension of feedforward ReLU neural networks. arXiv preprint arXiv:2209.04036, 2022.
  • Harris et al. (2020) Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature, 585(7825):357–362, 2020.
  • Harris (2013) Joe Harris. Algebraic geometry: A first course, volume 133. Springer Science & Business Media, 2013. URL https://link.springer.com/book/10.1007/978-1-4757-2189-8.
  • He et al. (2020) Fengxiang He, Bohan Wang, and Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1x6BTEKwr.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015. URL https://openaccess.thecvf.com/content_iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html.
  • Hunter (2007) John D Hunter. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(03):90–95, 2007.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
  • Joswig (2021) Michael Joswig. Essentials of tropical combinatorics, volume 219 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2021. URL https://page.math.tu-berlin.de/~joswig/etc/index.html.
  • Laurent & von Brecht (2018) Thomas Laurent and James von Brecht. The multilinear structure of ReLU networks. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2908–2916. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/laurent18b.html.
  • Liu (2021) Bo Liu. Understanding the loss landscape of one-hidden-layer relu networks. Knowledge-Based Systems, 220:106923, 2021. URL https://www.sciencedirect.com/science/article/pii/S0950705121001866.
  • Montúfar et al. (2014) Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/109d2dd3608f669ca17920c511c2a41e-Paper.pdf.
  • Montúfar et al. (2022) Guido Montúfar, Yue Ren, and Leon Zhang. Sharp bounds for the number of regions of maxout networks and vertices of Minkowski sums. SIAM Journal on Applied Algebra and Geometry, 6(4):618–649, 2022. URL https://doi.org/10.1137/21M1413699.
  • Nguyen (2019) Quynh Nguyen. On connected sublevel sets in deep learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  4790–4799. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/nguyen19a.html.
  • Oymak & Soltanolkotabi (2019) Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1:84–105, 2019.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Phuong & Lampert (2020) Mary Phuong and Christoph H. Lampert. Functional vs. parametric equivalence of ReLU networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Bylx-TNKvH.
  • Poston et al. (1991) T. Poston, C.-N. Lee, Y. Choie, and Y. Kwon. Local minima and back propagation. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume ii, pp.  173–176 vol.2, 1991. URL https://ieeexplore.ieee.org/document/155333.
  • Safran & Shamir (2016) Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.  774–782, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/safran16.html.
  • Safran & Shamir (2018) Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer ReLU neural networks. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  4433–4441. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/safran18a.html.
  • Sakurai (1998) Akito Sakurai. Tight bounds for the VC-dimension of piecewise polynomial networks. In M. Kearns, S. Solla, and D. Cohn (eds.), Advances in Neural Information Processing Systems, volume 11. MIT Press, 1998. URL https://proceedings.neurips.cc/paper_files/paper/1998/file/f18a6d1cde4b205199de8729a6637b42-Paper.pdf.
  • Schläfli (1950) Ludwig Schläfli. Theorie der vielfachen Kontinuität, pp.  167–387. Springer Basel, Basel, 1950. URL https://doi.org/10.1007/978-3-0348-4118-4_13.
  • Sharifnassab et al. (2020) Arsalan Sharifnassab, Saber Salehkaleybar, and S. Jamaloddin Golestani. Bounds on over-parameterization for guaranteed existence of descent paths in shallow ReLU networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BkgXHTNtvS.
  • Simsek et al. (2021) Berfin Simsek, François Ged, Arthur Jacot, Francesco Spadaro, Clement Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  9722–9732. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/simsek21a.html.
  • Sontag & Sussmann (1989) Eduardo Sontag and Héctor J. Sussmann. Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Syst., 3, 1989. URL https://www.complex-systems.com/abstracts/v03_i01_a07/.
  • Soudry & Carmon (2016) Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
  • Swirszcz et al. (2017) Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Local minima in training of deep networks, 2017. URL https://openreview.net/forum?id=Syoiqwcxx.
  • Tian (2017) Yuandong Tian. An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  3404–3413. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/tian17a.html.
  • Wang et al. (2022) Yifei Wang, Jonathan Lacotte, and Mert Pilanci. The hidden convex optimization landscape of regularized two-layer reLU networks: an exact characterization of optimal solutions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Z7Lk2cQEG8a.
  • Yun et al. (2019a) Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/dbea3d0e2a17c170c412c74273778159-Paper.pdf.
  • Yun et al. (2019b) Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=rke_YiRct7.
  • Zaslavsky (1975) Thomas Zaslavsky. Facing up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes. American Mathematical Society: Memoirs of the American Mathematical Society. American Mathematical Society, 1975. URL https://books.google.com/books?id=K-nTCQAAQBAJ.
  • Zhang et al. (2018) Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neural networks. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  5824–5832, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/zhang18i.html.
  • Zhou & Liang (2018) Yi Zhou and Yingbin Liang. Critical points of linear neural networks: Analytical forms and landscape properties. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SysEexbRb.

Appendix A Details on counting activation regions with no bad local minima

We provide the proofs of the results presented in Section 3.

Proof of Lemma 4.

Suppose that λ(1),,λ(n)\lambda^{(1)},\ldots,\lambda^{(n)}\in\mathbb{R}. Since the entries of vv are nonzero, the following are equivalent:

i=1nλ(i)(va(i))x(i)\displaystyle\sum_{i=1}^{n}\lambda^{(i)}(v\odot a^{(i)})\otimes x^{(i)} =0\displaystyle=0
i=1nλ(i)(va(i))kxj(i)\displaystyle\sum_{i=1}^{n}\lambda^{(i)}(v\odot a^{(i)})_{k}x^{(i)}_{j} =0j[d0],k[d1]\displaystyle=0\quad\forall j\in[d_{0}],k\in[d_{1}]
i=1nλ(i)vkak(i)xj(i)\displaystyle\sum_{i=1}^{n}\lambda^{(i)}v_{k}a^{(i)}_{k}x_{j}^{(i)} =0j[d0],k[d1]\displaystyle=0\quad\forall j\in[d_{0}],k\in[d_{1}]
i=1nλ(i)ak(i)xj(i)\displaystyle\sum_{i=1}^{n}\lambda^{(i)}a^{(i)}_{k}x_{j}^{(i)} =0j[d0],k[d1]\displaystyle=0\quad\forall j\in[d_{0}],k\in[d_{1}]
i=1nλ(i)a(i)x(i)\displaystyle\sum_{i=1}^{n}\lambda^{(i)}a^{(i)}\otimes x^{(i)} =0.\displaystyle=0.

So the kernel of the matrix whose ii-th row is (va(i))x(i)(v\odot a^{(i)})\otimes x^{(i)} is equal to the kernel of the matrix whose ii-th row is a(i)x(i)a^{(i)}\otimes x^{(i)}. It follows that

rank({(va(i))x(i):in})=rank({a(i)x(i):i[n]}).\displaystyle\operatorname{rank}(\{(v\odot a^{(i)})\otimes x^{(i)}:i\in n\})=\operatorname{rank}(\{a^{(i)}\otimes x^{(i)}:i\in[n]\}).

Proof of Theorem 5.

Suppose that d0d1nd_{0}d_{1}\geq n. Consider a matrix A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n}, which corresponds to one of the theoretically possible activation pattern of all d1d_{1} units across all nn input examples, and denote its columns by a(j){0,1}d1,j=1,,na^{(j)}\in\{0,1\}^{d_{1}},j=1,\ldots,n. For any given input dataset XX, the function FF is piecewise linear in WW. More specifically, on each activation region 𝒮XA\mathcal{S}^{A}_{X}, F(,v,X)F(\cdot,v,X) is a linear map in WW of the form

F(W,v,x(j))\displaystyle F(W,v,x^{(j)}) =i=1d1k=1d0viAijWikxk(j)\displaystyle=\sum_{i=1}^{d_{1}}\sum_{k=1}^{d_{0}}v_{i}A_{ij}W_{ik}x_{k}^{(j)} (j[n]).\displaystyle(j\in[n]).

So on 𝒮XA\mathcal{S}^{A}_{X} the Jacobian of F(,X)F(\cdot,X) is given by the Khatri-Rao (columnwise Kronecker) product

WF(W,v,X)\displaystyle\nabla_{W}F(W,v,X) =((va(1))x(1),,(va(n))x(n)).\displaystyle=((v\odot a^{(1)})\otimes x^{(1)},\ldots,(v\odot a^{(n)})\otimes x^{(n)}). (2)

In particular, WF\nabla_{W}F is of full rank on 𝒮XA\mathcal{S}^{A}_{X} exactly when the set

{(va(i))x(i):i[n]}\{(v\odot a^{(i)})\otimes x^{(i)}:i\in[n]\}

consists of linearly independent elements of d1×d0\mathbb{R}^{d_{1}\times d_{0}}, since d0d1nd_{0}d_{1}\geq n. By Lemma 4, this is equivalent to the set

{a(i)x(i):i[n]}\{a^{(i)}\otimes x^{(i)}:i\in[n]\}

consisting of linearly independent vectors, or in another words, the Khatri-Rao product AXA*X having full rank.

For a given A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} consider the set

𝒥A:={Xd0×n:AX is of full rank}.\displaystyle\mathcal{J}^{A}:=\{X\in\mathbb{R}^{d_{0}\times n}:A\ast X\text{ is of full rank}\}.

The expression AXA*X corresponds to WF(W,v,X)\nabla_{W}F(W,v,X) in the case that W𝒮XAW\in\mathcal{S}^{A}_{X}. Suppose that d1d_{1} is large enough that d1nd0d_{1}\geq\frac{n}{d_{0}}. We will show that for most A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n}, 𝒥A\mathcal{J}^{A} is non-empty and in fact contains almost every XX. We may partition [n][n] into r:=nd1r:=\lceil\frac{n}{d_{1}}\rceil subsets S1,S2,,SrS_{1},S_{2},\ldots,S_{r} such that |Sk|d1|S_{k}|\leq d_{1} for all k[r]k\in[r] and partition the set of columns of AA accordingly into blocks (a(s))sSk(a^{(s)})_{s\in S_{k}} for all k[r]k\in[r]. Let us form a d1×|Sk|d_{1}\times|S_{k}| matrix MM whose columns are the a(s),sSka^{(s)},s\in S_{k}. We will use a probabilistic argument. For this, consider the entries of MM as being iid Bernoulli random variables with parameter 1/21/2. We may extend MM to a d1×d1d_{1}\times d_{1} random matrix M~\tilde{M} whose entries are iid Bernoulli random variables with parameter 1/21/2. By Theorem 3, M~\tilde{M} will be singular with probability at most

(12+o(1))d1C10.72d1,\left(\frac{1}{\sqrt{2}}+o(1)\right)^{d_{1}}\leq C_{1}\cdot 0.72^{d_{1}},

where C1C_{1} is a universal constant. Whenever M~\tilde{M} is nonsingular, the vectors a(s),sSka^{(s)},s\in S_{k} are linearly independent. Using a simple union bound, we have

((a(s))sSk are linearly independent for all k[r])\displaystyle\mathbb{P}\left((a^{(s)})_{s\in S_{k}}\text{ are linearly independent for all }k\in[r]\right) 1rC1(0.72)d1.\displaystyle\geq 1-rC_{1}(0.72)^{d_{1}}.

Now suppose that (a(s))sSk(a^{(s)})_{s\in S_{k}} are linearly independent for all k[r]k\in[r]. Let e1,,ed0e_{1},\ldots,e_{d_{0}} be the standard basis of d0\mathbb{R}^{d_{0}}. Since rd0r\leq d_{0}, there exists Xd0×nX\in\mathbb{R}^{d_{0}\times n} such that x(i)=ekx^{(i)}=e_{k} whenever iSki\in S_{k}. For such an XX, we claim that the set

{a(i)x(i):i[n]}\{a^{(i)}\otimes x^{(i)}:i\in[n]\}

consists of linearly independent elements of d1×d0\mathbb{R}^{d_{1}\times d_{0}}. To see this, suppose that there exist α1,,αn\alpha_{1},\ldots,\alpha_{n} such that

i=1nαi(a(i)x(i))\displaystyle\sum_{i=1}^{n}\alpha_{i}(a^{(i)}\otimes x^{(i)}) =0.\displaystyle=0.

Then

0\displaystyle 0 =k=1riSkαi(a(i)x(i))\displaystyle=\sum_{k=1}^{r}\sum_{i\in S_{k}}\alpha_{i}(a^{(i)}\otimes x^{(i)})
=k=1riSkαi(a(i)ek)\displaystyle=\sum_{k=1}^{r}\sum_{i\in S_{k}}\alpha_{i}(a^{(i)}\otimes e_{k})
=k=1r(iSkαia(i))ek.\displaystyle=\sum_{k=1}^{r}\left(\sum_{i\in S_{k}}\alpha_{i}a^{(i)}\right)\otimes e_{k}.

The above equation can only hold if

iSkαia(i)=0,for all k[r].\sum_{i\in S_{k}}\alpha_{i}a^{(i)}=0,\quad\text{for all $k\in[r]$.}

By the linear independence of (a(i))iSk(a^{(i)})_{i\in S_{k}}, this implies that αi=0\alpha_{i}=0 for all i[n]i\in[n]. This shows that the elements a(i)x(i)a^{(i)}\otimes x^{(i)} are linearly independent for a particular XX whenever the (a(i))iSk(a^{(i)})_{i\in S_{k}} are all linearly independent. In other words, 𝒥A\mathcal{J}^{A} is non-empty with probability at least 1C1r(0.72d1)1-C_{1}r(0.72^{d_{1}}) when the activation region is chosen uniformly at random.

Let us define

𝒜:={A{0,1}d1×n:𝒥A}.\mathcal{A}:=\{A\in\{0,1\}^{d_{1}\times n}:\mathcal{J}^{A}\neq\emptyset\}.

We have shown that if AA is chosen uniformly at random from {0,1}d1×n\{0,1\}^{d_{1}\times n}, then A𝒜A\in\mathcal{A} with high probability. Note that 𝒥A\mathcal{J}^{A} is defined in terms of polynomials of XX not vanishing, so 𝒥A\mathcal{J}^{A} is the complement of a Zariski-closed subset of d0×n\mathbb{R}^{d_{0}\times n}. Let

𝒥:=A𝒜𝒥A.\displaystyle\mathcal{J}:=\cap_{A\in\mathcal{A}}\mathcal{J}_{A}.

Then 𝒥\mathcal{J} is a Zariski-open set of full measure, since it is a finite intersection of non-empty Zariski-open sets (which are themselves of full measure by Lemma 1). If X𝒥X\in\mathcal{J} and A𝒜A\in\mathcal{A}, then WF(W,X)\nabla_{W}F(W,X) is of full rank for all W𝒮XAW\in\mathcal{S}_{X}^{A}, and therefore all local minima in 𝒮XA\mathcal{S}_{X}^{A} will be global minima by Lemma 2. So if we take X𝒥X\in\mathcal{J} and d1d_{1} such that

d1log(C1(n+1)d0ϵ)log(10.72)d_{1}\geq\frac{\log\left(\frac{C_{1}(n+1)}{d_{0}\epsilon}\right)}{\log\left(\frac{1}{0.72}\right)}

and choose A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} uniformly at random, then with probability at least

1C1r(0.72d1)\displaystyle 1-C_{1}r(0.72^{d_{1}}) 1C1r(d0ϵC1(n+1))\displaystyle\geq 1-C_{1}r\left(\frac{d_{0}\epsilon}{C_{1}(n+1)}\right)
1ϵ,\displaystyle\geq 1-\epsilon,

𝒮XA\mathcal{S}^{A}_{X} will have no bad local minima. We rephrase this as a combinatorial result: if d1d_{1} satisfies the same bounds above, then for generic datasets XX we have the following: for all but at most ϵ2d0d1\lceil\epsilon 2^{d_{0}d_{1}}\rceil activation regions 𝒮XA\mathcal{S}^{A}_{X}, 𝒮XA\mathcal{S}^{A}_{X} has no bad local minima. The theorem follows. ∎

The argument that we used to prove that the Jacobian is full rank for generic data can be applied to arbitrary binary matrices. The proof is exactly the same as the sub-argument in the proof of Theorem 5.

Lemma 18.

For generic datasets Xd0×nX\in\mathbb{R}^{d_{0}\times n}, the following holds. Let a(i){0,1}d1a^{(i)}\in\{0,1\}^{d_{1}} for i[n]i\in[n]. Suppose that there exists a partition of [n][n] into rd0r\leq d_{0} subsets S1,,SrS_{1},\cdots,S_{r} such that for all k[r]k\in[r],

rank({a(i):iSk})=|Sk|.\operatorname{rank}(\{a^{(i)}:i\in S_{k}\})=|S_{k}|.

Then

rank({a(i)x(i):i[n]})=n.\operatorname{rank}(\{a^{(i)}\otimes x^{(i)}:i\in[n]\})=n.

Appendix B Details on counting non-empty activation regions

We provide the proofs of the results presented in Section 3. The subdivisions of parameter space by activation properties of neurons is classically studied in VC-dimension computation (Cover, 1965; Sakurai, 1998; Anthony & Bartlett, 1999). This has similarities to the analysis of linear regions in input space for neural networks with piecewise linear activation functions (Montúfar et al., 2014).

Proof of Proposition 6.

Consider the map d0n;w[wTX]+\mathbb{R}^{d_{0}}\to\mathbb{R}^{n};w\mapsto[w^{T}X]_{+} that takes the input weights ww of a single ReLU to its activation values over nn input data points given by the columns of XX. This can equivalently be interpreted as a map taking an input vector ww to the activation values of nn ReLUs with input weights given by the columns of XX. The linear regions of the latter correspond to the (full-dimensional) regions of a central arrangement of nn hyperplanes in d0\mathbb{R}^{d_{0}} with normals x(1),,x(n)x^{(1)},\ldots,x^{(n)}. Denote the number of such regions by NXN_{X}. If the columns of XX are in general position in a dd-dimensional linear space, meaning that they are contained in a dd-dimensional linear space and any dd columns are linearly independent, then

NX=Nd,n:=2k=0d1(n1k).N_{X}=N_{d,n}:=2\sum_{k=0}^{d-1}{n-1\choose k}. (3)

This is a classic result that can be traced back to the work of Schläfli (1950), and which also appeared in the discussion of linear threshold devices by Cover (1964).

Now for a layer of d1d_{1} ReLUs, each unit has its parameter space d0\mathbb{R}^{d_{0}} subdivided by an equivalent hyperplane arrangement that is determined solely by XX. Since all units have individual parameters, the arrangement of each unit essentially lives in a different set of coordinates. In turn, the overall arrangement in the parameter space d0×d1\mathbb{R}^{d_{0}\times d_{1}} of all units is a so-called product arrangement, and the number of regions is (NX)d1(N_{X})^{d_{1}}. This conclusion is sufficiently intuitive, but it can also be derived from Zaslavsky (1975, Lemma 4A3). If the input data XX is in general position in a dd-dimensional linear subspace of the input space, then we can substitute equation 3 into (NX)d1(N_{X})^{d_{1}} and obtain the number of regions stated in the proposition. ∎

We are also interested in the specific identity of the non-empty regions; that is, the sign patterns that are associated with them. As we have seen above, the set of sign patterns of a layer of units is the d1d_{1}-Cartesian power of the set of sign patterns of an individual unit. Therefore, it is sufficient to understand the set of sign patterns of an individual unit; that is, the set of dichotomies that can be computed over the columns of XX by a bias-free simple perceptron xsgn(wTx)x\mapsto\operatorname{sgn}(w^{T}x). Note that this subsumes networks with biases as the special case where the first row of XX consists of ones, in which case the first component of the weight vector can be regarded as the bias. Let LXL_{X} be the NX×nN_{X}\times n matrix whose NXN_{X} rows are the different possible dichotomies (sgnwTx(1),,sgnwTx(n)){1,+1}n(\operatorname{sgn}w^{T}x^{(1)},\ldots,\operatorname{sgn}w^{T}x^{(n)})\in\{-1,+1\}^{n}. If we extend the definition of dichotomies to allow not only +1+1 and 1-1 but also zeros for the case that data points fall on the decision boundary, then we obtain a matrix LXL_{X} that is referred to as the oriented matroid of the vector configuration XX, and whose rows are referred to as the covectors of XX (Björner et al., 1999). This is also known as the list of sign sequences of the faces of the hyperplane arrangement.

To provide more intuitions for Proposition 8, we give a self-contained proof below.

Proof of Proposition 8.

For each unit, the parameter space d0\mathbb{R}^{d_{0}} is subdivided by an arrangement of nn hyperplanes with normals given by x(j)x^{(j)}, j[n]j\in[n]. A weight vector ww is in the interior of the activation region with pattern a{0,1}na\in\{0,1\}^{n} if and only if (2aj1)wTx(j)>0(2a_{j}-1)w^{T}x^{(j)}>0 for all j[n]j\in[n]. Equivalently,

wTx(j)>wT0\displaystyle w^{T}x^{(j)}>w^{T}0 for all jj with aj=+1a_{j}=+1
wT0>wTx(j)\displaystyle w^{T}0>w^{T}x^{(j^{\prime})} for all j with aj=0.\displaystyle\quad\text{for all $j^{\prime}$ with $a_{j}^{\prime}=0$}.

This means that ww is a point where the function wwTj:aj=1x(j)w\mapsto w^{T}\sum_{j\colon a_{j}=1}x^{(j)} attains the maximum value among of all linear functions with gradients given by sums of x(j)x^{(j)}s, meaning that at ww this function attains the same value as

ψ:wj[n]max{0,wTx(j)}=maxS[n]wTjSx(j).\psi\colon w\mapsto\sum_{j\in[n]}\max\{0,w^{T}x^{(j)}\}=\max_{S\subseteq[n]}w^{T}\sum_{j\in S}x^{(j)}. (4)

Dually, the linear function xwTxx\mapsto w^{T}x attains its maximum over the polytope

P=conv{jSx(j):S[n]}P=\operatorname{conv}\{\sum_{j\in S}x^{(j)}\colon S\subseteq[n]\} (5)

precisely at j:aj=1x(j)\sum_{j\colon a_{j}=1}x^{(j)}. In other words, j:aj=1x(j)\sum_{j\colon a_{j}=1}x^{(j)} is an extreme point or vertex of PP with a supporting hyperplane with normal ww. For a polytope Pd0P\subseteq\mathbb{R}^{d_{0}}, the normal cone of PP at a point xPx\in P is defined as the set of wd0w\in\mathbb{R}^{d_{0}} such that wTxwTxw^{T}x\geq w^{T}x^{\prime} for all xPx^{\prime}\in P. For any S[n]S\subseteq[n] let us denote by 𝟏S{0,1}[n]\mathbf{1}_{S}\in\{0,1\}^{[n]} the vector with ones at components in SS and zeros otherwise. Then the above discussion shows that the activation region 𝒮Xa\mathcal{S}_{X}^{a} with a=𝟏Sa=\mathbf{1}_{S} is the interior of the normal cone of PP at jSx(j)\sum_{j\in S}x^{(j)}. In particular, the activation region is non empty if and only if jSx(j)\sum_{j\in S}x^{(j)} is a vertex of PP.

To conclude, we show that PP is a Minkowski sum of line segments,

P=j[n]Pj,Pi=conv{0,x(j)}.P=\sum_{j\in[n]}P_{j},\quad P_{i}=\operatorname{conv}\{0,x^{(j)}\}.

To see this note that xPx\in P if and only if there exist αS0\alpha_{S}\geq 0, SαS=1\sum_{S}\alpha_{S}=1 with

x=\displaystyle x= S[n]αSjSx(j)\displaystyle\sum_{S\subseteq[n]}\alpha_{S}\sum_{j\in S}x^{(j)}
=\displaystyle= j[n]S:jSαSx(j)\displaystyle\sum_{j\in[n]}\sum_{S\colon j\in S}\alpha_{S}x^{(j)}
=\displaystyle= j[n][(S:jSαS)0+(S:jSαS)x(j)]\displaystyle\sum_{j\in[n]}[(\sum_{S\colon j\not\in S}\alpha_{S})0+(\sum_{S\colon j\in S}\alpha_{S})x^{(j)}]
=\displaystyle= j[n][(1βj)0+βjx(j)]=j[n]z(j),\displaystyle\sum_{j\in[n]}[(1-\beta_{j})0+\beta_{j}x^{(j)}]=\sum_{j\in[n]}z^{(j)},

where βj=S:jSαS\beta_{j}=\sum_{S\colon j\in S}\alpha_{S} and z(j)=[(1βj)0+βjx(j)]z^{(j)}=[(1-\beta_{j})0+\beta_{j}x^{(j)}]. Thus, xPx\in P if and only if xx is a sum of points z(j)Pj=conv{0,x(j)}z^{(j)}\in P_{j}=\operatorname{conv}\{0,x^{(j)}\}, meaning that P=jPjP=\sum_{j}P_{j} as was claimed. ∎

The polytope equation 5 may be regarded as the Newton polytope of the piecewise linear convex function equation 4 in the context of tropical geometry (Joswig, 2021). This perspective has been used to study the linear regions in input space for ReLU (Zhang et al., 2018) and maxout networks (Montúfar et al., 2022).

We note that each vertex of a Minkowski sum of polytopes P=P1++PnP=P_{1}+\cdots+P_{n} is a sum of vertices of the summands PjP_{j}, but not every sum of vertices of the PjP_{j} results in a vertex of PP. Our polytope PP is a sum of line segments, which is a type of polytope known as a zonotope. Each vertex of PP takes the form v=j[n]vjv=\sum_{j\in[n]}v_{j}, where each vjv_{j} is a vertex of PjP_{j}, either the zero vertex 0 or the nonzero vertex x(j)x^{(j)}, and is naturally labeled by a vector 𝟏S{0,1}n2[n]\mathbf{1}_{S}\in\{0,1\}^{n}\cong 2^{[n]} that indicates the jjs for which vj=x(j)v_{j}=x^{(j)}. A zonotope can be interpreted as the image of a cube by a linear map; in our case it is the projection of the nn-cube conv{a{0,1}n}n\operatorname{conv}\{a\in\{0,1\}^{n}\}\subseteq\mathbb{R}^{n} into d0\mathbb{R}^{d_{0}} by the matrix Xd0×nX\in\mathbb{R}^{d_{0}\times n}. The situation is illustrated in Figure 3.

Example 19 (Non-empty activation regions for 1-dimensional inputs).

In the case of one-dimensional inputs and units with biases, the parameter space of each unit is 2\mathbb{R}^{2}. We treat the data as points x(1),,x(n)1×x^{(1)},\ldots,x^{(n)}\in 1\times\mathbb{R}, where the first coordinate is to accommodate the bias. For generic data (i.e., no data points on top of each other), the polytope P=j[d1]conv{0,x(j)}P=\sum_{j\in[d_{1}]}\operatorname{conv}\{0,x^{(j)}\} is a polygon with 2n2n vertices. The vertices have labels 𝟏S\mathbf{1}_{S} indicating, for each k=0,,nk=0,\ldots,n, the subsets S[n]S\subseteq[n] containing the largest respectively smallest kk elements in the dataset with respect to the non-bias coordinate.

000000x(1)x^{(1)}100100x(2)x^{(2)}x(3)x^{(3)}001001110110011011111111P=P1+P2+P3P=P_{1}+P_{2}+P_{3}
Figure 3: Illustration of Proposition 8. The polytope PP for a ReLU on three data points x(1),x(2),x(3)x^{(1)},x^{(2)},x^{(3)} is the Minkowski sum of the line segments Pi=conv{0,x(i)}P_{i}=\operatorname{conv}\{0,x^{(i)}\} highlighted in red. The activation regions in parameter space are the normal cones of PP at its different vertices. Hence the vertices correspond to the non-empty activation regions. These are naturally labeled by vectors 𝟏S\mathbf{1}_{S} that indicate which x(i)x^{(i)} are added to produce the vertex and record the activation patterns.

Appendix C Details on counting non-empty activation regions with no bad minima

We provide the proofs of the statements in Section 4.

Proof of Lemma 9.

First we establish that the rows of the matrices AA corresponding to non-empty activation regions must be step vectors. To this end, assume A𝒮XAA\in\mathcal{S}^{A}_{X} and for an arbitrary row i[d1]i\in[d_{1}] let Ai1=α{0,1}A_{i1}=\alpha\in\{0,1\}. If Aij=αA_{ij}=\alpha for all j[n]j\in[n], then the ii-th row of AA is a step vector equal to either ξ(n+1,0)\xi^{(n+1,0)} or ξ(n+1,1)\xi^{(n+1,1)}. Otherwise, there exists a minimal k[n+1]k\in[n+1] such that Aik=1αA_{ik}=1-\alpha. We proceed by contradiction to prove in this setting that Aij=1αA_{ij}=1-\alpha for all jkj\geq k. Suppose there exists a j>kj>k such that Aij=αA_{ij}=\alpha, then as A𝒮XAA\in\mathcal{S}^{A}_{X} and x(k)<x(j)x^{(k)}<x^{(j)} this implies that the following two inequalities are simultaneously satisfied,

0<(2Aik1)(w(i)x(k)+b(i))=(12α)(w(i)x(k)+b(i))<(12α)(w(i)x(j)+b(i)),\displaystyle 0<(2A_{ik}-1)(w^{(i)}x^{(k)}+b^{(i)})=(1-2\alpha)(w^{(i)}x^{(k)}+b^{(i)})<(1-2\alpha)(w^{(i)}x^{(j)}+b^{(i)}),
0<(2Aij1)(w(i)x(j)+b(i))=(2α1)(w(i)x(j)+b(i))=(12α)(w(i)x(j)+b(i)).\displaystyle 0<(2A_{ij}-1)(w^{(i)}x^{(j)}+b^{(i)})=(2\alpha-1)(w^{(i)}x^{(j)}+b^{(i)})=-(1-2\alpha)(w^{(i)}x^{(j)}+b^{(i)}).

However, these inequalities clearly contradict one another and therefore we conclude Ai,A_{i,\cdot} is a step vector equal to either ξ(k,0)\xi^{(k,0)} or ξ(k,1)\xi^{(k,1)}

Conversely, assume the rows of AA are all step vectors. We proceed to prove 𝒮XA\mathcal{S}^{A}_{X} is non-empty under this assumption by constructing (w,b)(w,b) such that sign(w(i)x(j)+b(i))=Aij\text{sign}(w^{(i)}x^{(j)}+b^{(i)})=A_{ij} for all j[n]j\in[n] and any row i[d1]i\in[d_{1}]. To this end, let Ai1=α{0,1}A_{i1}=\alpha\in\{0,1\}. First, consider the case where Aij=αA_{ij}=\alpha for all j[n]j\in[n]. If α=0\alpha=0 then with w(i)=1w^{(i)}=1 and b(i)=2|x(n)|1b^{(i)}=-2|x^{(n)}|-1 it follows that

w(i)x(j)+b(i)<x(n)+b(i)<|x(n)|1w^{(i)}x^{(j)}+b^{(i)}<x^{(n)}+b^{(i)}<-|x^{(n)}|-1

for all j[n]j\in[n]. If α=1\alpha=1, then with w(i)=1w^{(i)}=1 and b(i)=2|x(i)|+1b^{(i)}=2|x^{(i)}|+1 we have

w(i)x(j)+b(i)>w(i)x(1)+b(i)>|x(1)|+1w^{(i)}x^{(j)}+b^{(i)}>w^{(i)}x^{(1)}+b^{(i)}>|x^{(1)}|+1

for all j[n]j\in[n]. Otherwise, suppose Ai,A_{i,\cdot} is not constant: then by construction there exists a k{2,,n}k\in\{2,\cdots,n\} such that Aij=αA_{ij}=\alpha for all j[1,k1]j\in[1,k-1] and Aij=1αA_{ij}=1-\alpha for all j[k,n]j\in[k,n]. Letting

w(i)\displaystyle w^{(i)} =(2α1),\displaystyle=(2\alpha-1),
b(i)\displaystyle b^{(i)} =(2α1)(x(k1)+x(k)2),\displaystyle=-(2\alpha-1)\left(\frac{x^{(k-1)}+x^{(k)}}{2}\right),

then for any j[n]j\in[n]

sgn(w(i)x(j)+b(i))=sgn((2α1)(x(j)x(k1)+x(k)2))=Aij.\displaystyle\operatorname{sgn}(w^{(i)}x^{(j)}+b^{(i)})=\operatorname{sgn}\left((2\alpha-1)\left(x^{(j)}-\frac{x^{(k-1)}+x^{(k)}}{2}\right)\right)=A_{ij}.

Thus, for any AA with step vector rows, given a dataset consisting distinct one dimensional points we can construct network whose preactivations correspond to the activation pattern encoded by AA.

In summary, given a fixed, distinct one dimensional dataset we have established a one-to-one correspondence between the non-empty activation regions and the set of binary matrices whose rows are step vectors. For convenience we refer to these as row-step matrices. As there are 2n2n step vector rows of dimension nn and d1d_{1} rows in AA, then there are (2n)d1(2n)^{d_{1}} binary row-step matrices and hence (2n)d1(2n)^{d_{1}} non-empty activation regions. ∎

Proof of Lemma 11.

By the union bound,

([n]{C1,,Cd})\displaystyle\mathbb{P}([n]\not\subseteq\{C_{1},\ldots,C_{d}\}) j=1n(j{C1,,Cd})\displaystyle\leq\sum_{j=1}^{n}\mathbb{P}(j\notin\{C_{1},\cdots,C_{d}\})
=j=1n(1δ)d\displaystyle=\sum_{j=1}^{n}\left(1-\delta\right)^{d}
j=1neδd\displaystyle\leq\sum_{j=1}^{n}e^{-\delta d}
=neδd.\displaystyle=ne^{-\delta d}.

So if d>1δlog(nϵ)d>\frac{1}{\delta}\log(\frac{n}{\epsilon}),

([n]{C1,,Cd})\displaystyle\mathbb{P}([n]\subseteq\{C_{1},\ldots,C_{d}\}) 1neδd\displaystyle\geq 1-ne^{-\delta d}
1ϵ.\displaystyle\geq 1-\epsilon.

This concludes the proof. ∎

Proof of Theorem 10.

Consider a dataset (X,y)(X,y) consisting of distinct data points. Without loss of generality we may index these points such that

x(1)<x(2)<<x(n).x^{(1)}<x^{(2)}<\cdots<x^{(n)}.

Now consider a matrix A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} whose rows are step vectors and which therefore corresponds to a non-empty activation region by Lemma 9. As in the proof of Theorem 5, denote its columns by a(j){0,1}d1a^{(j)}\in\{0,1\}^{d_{1}} for j[n]j\in[n]. On 𝒮XA\mathcal{S}^{A}_{X}, the Jacobian of FF with respect to bb is given by

bF(w,b,v,X)\displaystyle\nabla_{b}F(w,b,v,X) =(va(1),,va(n)).\displaystyle=(v\odot a^{(1)},\ldots,v\odot a^{(n)}).

We claim that if AA has full rank, then bF(w,b,v,X)\nabla_{b}F(w,b,v,X) has full rank for all w,b,v,Xw,b,v,X with the entries of vv nonzero. To see this, suppose that AA is of full rank, and that

j=1nαj(va(j))\displaystyle\sum_{j=1}^{n}\alpha_{j}(v\odot a^{(j)}) =0\displaystyle=0

for some α1,,αn\alpha_{1},\ldots,\alpha_{n}\in\mathbb{R}. Then for all i[d1]i\in[d_{1}],

j=1nαjviAij\displaystyle\sum_{j=1}^{n}\alpha_{j}v_{i}A_{ij} =0,\displaystyle=0,
j=1nαiAij\displaystyle\sum_{j=1}^{n}\alpha_{i}A_{ij} =0.\displaystyle=0.

But AA is of full rank, so this implies that αi=0\alpha_{i}=0 for all i[n]i\in[n]. As a result, if AA is full rank then bF(w,b,v,X)\nabla_{b}F(w,b,v,X) is full rank, and in particular (w,b,v)F(w,b,v,X)\nabla_{(w,b,v)}F(w,b,v,X) is rank. Therefore, to show most non-empty activation regions have no bad local minima, it suffices to show most non-empty activation regions are defined by a full rank binary matrix AA.

To this end, if AA is a binary matrix with step vector rows, we say that AA is diverse if it satisfies the following properties:

  1. 1.

    For all k[n]k\in[n], there exists i[d1]i\in[d_{1}] such that Ai,{ξ(k,0),ξ(k,1)}A_{i,\cdot}\in\{\xi^{(k,0)},\xi^{(k,1)}\}.

  2. 2.

    There exists i[d1]i\in[d_{1}] such that Ai,=ξ(1,1)A_{i,\cdot}=\xi^{(1,1)}.

We proceed by i) showing all diverse matrices are of full rank and then ii) non-empty activation regions are defined by a diverse matrix. Suppose AA is diverse and denote the span of the rows of AA by row(A)\operatorname{row}(A). Then ξ(1,1)=(1,,1)row(A)\xi^{(1,1)}=(1,\ldots,1)\in\text{row}(A) and for each k[n]k\in[n] either ξ(k,0)\xi^{(k,0)} or ξ(k,1)\xi^{(k,1)} is in row(A)\text{row}(A). Observe if ξ(k,0)row(A)\xi^{(k,0)}\in\text{row}(A) then 1ξ(k,0)=ξ(k,1)row(A)1-\xi^{(k,0)}=\xi^{(k,1)}\in\text{row}(A), therefore for all k[n]k\in[n] ξ(k,1)row(A)\xi^{(k,1)}\in\text{row}(A). As the set of vectors

{ξ(k,1):k[n]}\{\xi^{(k,1)}:k\in[n]\}

forms a basis of n\mathbb{R}^{n} we conclude that all diverse matrices are full rank.

Now we show most binary matrices with step vector rows are diverse. Let AA be a random binary matrix whose rows are selected mutually iid from the set of all step vectors. For i[d1]i\in[d_{1}], let Ci[n+1]C_{i}\in[n+1] be defined as follows: if Ai,{ξ(k,0),ξ(k,1)}A_{i,\cdot}\in\{\xi^{(k,0)},\xi^{(k,1)}\} for some k{2,3,,n}k\in\{2,3,\ldots,n\}, we define Ci=kC_{i}=k; if Ai,=ξ(1,1)A_{i,\cdot}=\xi^{(1,1)}, then we define Ci=1C_{i}=1; otherwise, we define Ci=n+1C_{i}=n+1. By definition AA is diverse if and only if

[n]{C1,,Cd1}.[n]\subseteq\{C_{1},\ldots,C_{d_{1}}\}.

As the rows of AA are chosen uniformly at random from the set of all step vectors, the CiC_{i} are iid and

(C1=k)12n\mathbb{P}(C_{1}=k)\geq\frac{1}{2n}

for all k[n]k\in[n], then by Lemma 11, if d2nlog(nϵ)d\geq 2n\log(\frac{n}{\epsilon}),

(A is diverse)\displaystyle\mathbb{P}(A\text{ is diverse}) =([n]{C1,,Cd1})\displaystyle=\mathbb{P}([n]\subseteq\{C_{1},\ldots,C_{d_{1}}\})
1ϵ.\displaystyle\geq 1-\epsilon.

This holds for a randomly selected matrix with step vector rows. Translating this into a combinatorial statement, we see for all but at most a fraction ϵ\epsilon of matrices with step vector rows are diverse. Furthermore, in each activation region 𝒮XA\mathcal{S}^{A}_{X} corresponding to a diverse AA the Jacobian is full rank and every differentiable critical point of LL is a global minimum. ∎

Proof of Theorem 12.

Let ϵ>0\epsilon>0. We define the following two sets of neurons based on the sign of their output weight,

𝒟1\displaystyle\mathcal{D}_{1} ={i[d1]:v(i)>0},\displaystyle=\{i\in[d_{1}]:v^{(i)}>0\},
𝒟0\displaystyle\mathcal{D}_{0} ={i[d1]:v(i)<0}.\displaystyle=\{i\in[d_{1}]:v^{(i)}<0\}.

Suppose that |𝒟1|,|𝒟0|2nlog(2nϵ)|\mathcal{D}_{1}|,|\mathcal{D}_{0}|\geq 2n\log(\frac{2n}{\epsilon}). Furthermore, without loss of generality, we index the points in the dataset such that

x(1)<x(2)<<x(n).x^{(1)}<x^{(2)}<\cdots<x^{(n)}.

Now consider a matrix A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} whose rows are step vectors, then by Lemma 9 AA corresponds to a non-empty activation region. We say that AA is complete if for all k{0,,n1}k\in\{0,\ldots,n-1\} and β{0,1}\beta\in\{0,1\} there exists i[d1]i\in[d_{1}] such that Ai,=ξ(k,1)A_{i,\cdot}=\xi^{(k,1)} and sgn(v(i))=β\operatorname{sgn}(v^{(i)})=\beta. We first show that if AA is complete, then there exists a global minimizer in 𝒮XA\mathcal{S}^{A}_{X}. Consider the linear map φ:d1×d11×n\varphi:\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}\to\mathbb{R}^{1\times n} defined by

φ(w,b)=[i=1d1v(i)(2Aij1)(w(i)x(j)+b(i))]j=1,,n.\varphi(w,b)=\left[\sum_{i=1}^{d_{1}}v^{(i)}(2A_{ij}-1)(w^{(i)}x^{(j)}+b^{(i)})\right]_{j=1,\ldots,n}.

Note for (w,b)𝒮XA(w,b)\in\mathcal{S}^{A}_{X} then φ(w,b)=F(w,b,v,X)\varphi(w,b)=F(w,b,v,X). As proved in Lemma 20, for every z1×nz\in\mathbb{R}^{1\times n} there exists (w,b)𝒮XA(w,b)\in\mathcal{S}^{A}_{X} such that F(w,b,v,X)=zF(w,b,v,X)=z. In particular, this means that φ\varphi is surjective and therefore 𝒮XA\mathcal{S}^{A}_{X} contains zero-loss global minimizers. Define

𝒱y={(w,b)d1×d1:φ(w,b)=y},\mathcal{V}_{y}=\{(w,b)\in\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}:\varphi(w,b)=y\},

then 𝒢X,y𝒮XA=𝒱y𝒮XA\mathcal{G}_{X,y}\cap\mathcal{S}^{A}_{X}=\mathcal{V}_{y}\cap\mathcal{S}^{A}_{X} and by the rank-nullity theorem

dim(𝒱y)\displaystyle\dim(\mathcal{V}_{y}) =dim(φ1({y}))\displaystyle=\dim(\varphi^{-1}(\{y\}))
=2d1n.\displaystyle=2d_{1}-n.

We therefore conclude that if AA is complete, then 𝒢X,y𝒮XA\mathcal{G}_{X,y}\cap\mathcal{S}^{A}_{X} is the restriction of a (2d1n)(2d_{1}-n)-dimensional linear subspace to the open set 𝒮XA\mathcal{S}^{A}_{X}. This is equivalent to 𝒢X,y𝒮XA\mathcal{G}_{X,y}\cap\mathcal{S}^{A}_{X} being an affine set of codimension nn as claimed.

To prove the desired result it therefore suffices to show that most binary matrices with step vector rows are complete. Let AA be a random binary matrix whose rows are selected mutually iid uniformly at random from the set of all step vectors. For β{0,1}\beta\in\{0,1\} and i𝒟βi\in\mathcal{D}_{\beta} let Cβ,i[n]C_{\beta,i}\in[n] be defined as follows: if Ai,=ξ(k1,1)A_{i,\cdot}=\xi^{(k-1,1)} for some k[n]k\in[n] then Cβ,i=kC_{\beta,i}=k, otherwise Cβ,i=n+1C_{\beta,i}=n+1. Observe that AA is complete if and only if

[n]{C0,i:i𝒟0} and [n]{C1,i:i𝒟1}.[n]\subseteq\{C_{0,i}:i\in\mathcal{D}_{0}\}\text{ and }[n]\subseteq\{C_{1,i}:i\in\mathcal{D}_{1}\}.

Since there are 2n2n step vectors,

(Cβ,i=k)=12n\mathbb{P}(C_{\beta,i}=k)=\frac{1}{2n}

for all k[n]k\in[n]. So by the union bound and Lemma 11,

(A is complete)\displaystyle\mathbb{P}(A\text{ is complete}) =([n]{C0,i:i𝒟0}) and [n]{C1,i:i𝒟1})\displaystyle=\mathbb{P}([n]\subseteq\{C_{0,i}:i\in\mathcal{D}_{0}\})\text{ and }[n]\subseteq\{C_{1,i}:i\in\mathcal{D}_{1}\})
1([n]{C0,i:i𝒟0})([n]{C1,i:i𝒟1})\displaystyle\geq 1-\mathbb{P}([n]\not\subseteq\{C_{0,i}:i\in\mathcal{D}_{0}\})-\mathbb{P}([n]\not\subseteq\{C_{1,i}:i\in\mathcal{D}_{1}\})
1ϵ2ϵ2\displaystyle\geq 1-\frac{\epsilon}{2}-\frac{\epsilon}{2}
=1ϵ.\displaystyle=1-\epsilon.

As this holds for a matrix with step vectors chosen uniformly at random, it therefore follows that all but at most a fraction ϵ\epsilon of such matrices are complete. This concludes the proof. ∎

Lemma 20.

Let A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} be complete, and let Xd0×nX\in\mathbb{R}^{d_{0}\times n} be an input dataset with distinct points. Then for any y1×ny\in\mathbb{R}^{1\times n}, there exists (w,b)𝒮XA(w,b)\in\mathcal{S}^{A}_{X} such that F(w,b,v,X)=yF(w,b,v,X)=y.

Proof.

If AA is complete then there exist row indices i0,0,i1,0,,in1,0[d1]i_{0,0},i_{1,0},\ldots,i_{n-1,0}\in[d_{1}] and i0,1,i1,1,,in1,1[d1]i_{0,1},i_{1,1},\ldots,i_{n-1,1}\in[d_{1}] such that for all k{0,,n1}k\in\{0,\ldots,n-1\} and β{0,1}\beta\in\{0,1\} the ik,βi_{k,\beta}-th row of AA is equal to ξ(k,1)\xi^{(k,1)} with v(ik,β)=βv^{(i_{k,\beta})}=\beta. For β{0,1}\beta\in\{0,1\}, we define

β={i0,β,,in1,β}.\mathcal{I}_{\beta}=\{i_{0,\beta},\ldots,i_{n-1,\beta}\}.

We will construct weights w(1),,w(d1)w^{(1)},\ldots,w^{(d_{1})} and biases b(1),,b(d1)b^{(1)},\ldots,b^{(d_{1})} such that (w,b)𝒮XA(w,b)\in\mathcal{S}^{A}_{X} and F(w,b,v,X)=yF(w,b,v,X)=y. For i01i\notin\mathcal{I}_{0}\cup\mathcal{I}_{1}, we define w(i)w^{(i)} and b(i)b^{(i)} arbitrarily such that

sgn(w(i)x(j)+b(i))=Aij\operatorname{sgn}(w^{(i)}x^{(j)}+b^{(i)})=A_{ij}

for all j[n]j\in[n]. Note by Lemma 9 that this is possible as each Ai,A_{i,\cdot} is a step vector. Therefore, in order to show the desired result it suffices only to appropriately construct w(i),b(i)w^{(i)},b^{(i)} for i01i\in\mathcal{I}_{0}\cup\mathcal{I}_{1}. To this end we proceed using the following sequence of steps.

  1. 1.

    First we separately consider the contributions to the output of the network coming from the slack neurons, those with index i01i\not\in\mathcal{I}_{0}\cup\mathcal{I}_{1}, and the key neurons, those with index i01i\in\mathcal{I}_{0}\cup\mathcal{I}_{1}. For j[n]j\in[n], we denote the residual of the target y(j)y^{(j)} leftover after removing the corresponding output from the slack part of the network as z(j)z^{(j)}. We then recursively build a sequence of functions (g(l))l=0n1(g^{(l)})_{l=0}^{n-1} in such a manner that g(l)(x(j))=z(j)g^{(l)}(x^{(j)})=z^{(j)} for all jnj\leq n.

  2. 2.

    Based on this construction, we select the parameters of the key neurons, i.e., (w(i),b(i))(w^{(i)},b^{(i)}) for i01i\in\mathcal{I}_{0}\cup\mathcal{I}_{1}, and prove (w,b)𝒮XA(w,b)\in\mathcal{S}_{X}^{A}.

  3. 3.

    Finally, using these parameters and the function g(n1)g^{(n-1)} we show F(w,b,v,X)=yF(w,b,v,X)=y.

Step 1: for j[n]j\in[n] we define the residual

z(j)=y(j)i01v(i)σ(w(i)x(j)+b(i)).\displaystyle z^{(j)}=y^{(j)}-\sum_{i\notin\mathcal{I}_{0}\cup\mathcal{I}_{1}}v^{(i)}\sigma(w^{(i)}x^{(j)}+b^{(i)}). (6)

For j{0,,n1}j\in\{0,\ldots,n-1\}, consider the values β(j),b~(j),\beta^{(j)},\tilde{b}^{(j)}, w~(j)\tilde{w}^{(j)}\in\mathbb{R}, and the functions g(j):g^{(j)}:\mathbb{R}\to\mathbb{R}, defined recursively across the dataset as

β(0)\displaystyle\beta^{(0)} =sgn(z(1))\displaystyle=\operatorname{sgn}(z^{(1)})
w~(0)\displaystyle\tilde{w}^{(0)} =1\displaystyle=1
b~(0)\displaystyle\tilde{b}^{(0)} =|z(1)|x~(1)\displaystyle=|z^{(1)}|-\tilde{x}^{(1)}
g(0)(x)\displaystyle g^{(0)}(x) =(2β(0)1)σ(w~(0)x+b~(0))\displaystyle=(2\beta^{(0)}-1)\sigma(\tilde{w}^{(0)}x+\tilde{b}^{(0)})
β(j)\displaystyle\beta^{(j)} =sgn(z(j+1)g(j1)(x(j+1)))\displaystyle=\operatorname{sgn}(z^{(j+1)}-g^{(j-1)}(x^{(j+1)})) (1jn1)\displaystyle(1\leq j\leq n-1)\phantom{.}
w~(j)\displaystyle\tilde{w}^{(j)} =2|z(j+1)g(j1)(x(j+1))|x(j+1)x(j)\displaystyle=\frac{2|z^{(j+1)}-g^{(j-1)}(x^{(j+1)})|}{x^{(j+1)}-x^{(j)}} (1jn1)\displaystyle(1\leq j\leq n-1)\phantom{.}
b~(j)\displaystyle\tilde{b}^{(j)} =|z(j+1)g(j1)(x(j+1))|(x(j+1)+x(j))x(j+1)x(j)\displaystyle=-\frac{|z^{(j+1)}-g^{(j-1)}(x^{(j+1)})|(x^{(j+1)}+x^{(j)})}{x^{(j+1)}-x^{(j)}} (1jn1)\displaystyle(1\leq j\leq n-1)\phantom{.}
g(j)(x)\displaystyle g^{(j)}(x) ==0j(2β()1)σ(w~()x+b~)\displaystyle=\sum_{\ell=0}^{j}(2\beta^{(\ell)}-1)\sigma(\tilde{w}^{(\ell)}x+\tilde{b}^{\ell}) (1jn1).\displaystyle(1\leq j\leq n-1).

Observe for all j{1,,n1}j\in\{1,\ldots,n-1\} that w~(j)0\tilde{w}^{(j)}\geq 0 and

w~(j)(x(j)+x(j+1)2)+b(j)\displaystyle\tilde{w}^{(j)}\left(\frac{x^{(j)}+x^{(j+1)}}{2}\right)+b^{(j)} =0.\displaystyle=0.

In particular, w~(j)x(j)+b~(j)<0\tilde{w}^{(j)}x^{(j^{\prime})}+\tilde{b}^{(j)}<0 if jjj^{\prime}\leq j and w~(j)x(j)+b(j)>0\tilde{w}^{(j)}x^{(j^{\prime})}+b^{(j)}>0 otherwise. Moreover, for all j[n]j^{\prime}\in[n],

w~(0)x(j)+b(0)\displaystyle\tilde{w}^{(0)}x^{(j^{\prime})}+b^{(0)} =x(j)+|y~(1)|x(1)\displaystyle=x^{(j^{\prime})}+|\tilde{y}^{(1)}|-x^{(1)}
>0.\displaystyle>0.

We claim that for all j[n]j\in[n], g(n1)(x(j))=z(j)g^{(n-1)}(x^{(j)})=z^{(j)}. For the case j=1j=1, we compute

g(n1)(x(1))\displaystyle g^{(n-1)}(x^{(1)}) ==0n1(2β()1)σ(w~()x(1)+b~())\displaystyle=\sum_{\ell=0}^{n-1}(2\beta^{(\ell)}-1)\sigma(\tilde{w}^{(\ell)}x^{(1)}+\tilde{b}^{(\ell)})
=(2β(0)1)σ(w~(0)x(1)+b~(0))\displaystyle=(2\beta^{(0)}-1)\sigma(\tilde{w}^{(0)}x^{(1)}+\tilde{b}^{(0)})
=(2sgn(z~(1))1)σ(x(1)+|z~(1)|x(1))\displaystyle=(2\operatorname{sgn}(\tilde{z}^{(1)})-1)\sigma(x^{(1)}+|\tilde{z}^{(1)}|-x^{(1)})
=z~(1).\displaystyle=\tilde{z}^{(1)}.

For j{2,,n1}j\in\{2,\cdots,n-1\},

g(n1)(x(j))\displaystyle g^{(n-1)}(x^{(j)})
==0n1(2β()1)σ(w~()x(j)+b~())\displaystyle=\sum_{\ell=0}^{n-1}(2\beta^{(\ell)}-1)\sigma(\tilde{w}^{(\ell)}x^{(j)}+\tilde{b}^{(\ell)})
==0j1(2β()1)σ(w~()x(j)+b~())\displaystyle=\sum_{\ell=0}^{j-1}(2\beta^{(\ell)}-1)\sigma(\tilde{w}^{(\ell)}x^{(j)}+\tilde{b}^{(\ell)})
=(2β(j1)1)σ(w~(j1)x(j)+b~(j1))+g(j2)(x(j))\displaystyle=(2\beta^{(j-1)}-1)\sigma(\tilde{w}^{(j-1)}x^{(j)}+\tilde{b}^{(j-1)})+g^{(j-2)}(x^{(j)})
=(2β(j1)1)(w~(j1)x(j)+b~(j1))+g(j2)(x(j))\displaystyle=(2\beta^{(j-1)}-1)(\tilde{w}^{(j-1)}x^{(j)}+\tilde{b}^{(j-1)})+g^{(j-2)}(x^{(j)})
=(2β(j1)1)(w~(j1)x(j)+b~(j1))+g(j2)(x(j))\displaystyle=(2\beta^{(j-1)}-1)(\tilde{w}^{(j-1)}x^{(j)}+\tilde{b}^{(j-1)})+g^{(j-2)}(x^{(j)})
=(2β(j1)1)(2|z(j)g(j2)(x(j))|x(j)x(j1)x(j)|z(j)g(j2)(x(j))|(x(j)+x(j1))x(j)x(j1))\displaystyle=(2\beta^{(j-1)}-1)\left(\frac{2|z^{(j)}-g^{(j-2)}(x^{(j)})|}{x^{(j)}-x^{(j-1)}}x^{(j)}-\frac{|z^{(j)}-g^{(j-2)}(x^{(j)})|(x^{(j)}+x^{(j-1)})}{x^{(j)}-x^{(j-1)}}\right)
+g(j2)(x(j))\displaystyle\phantom{=}+g^{(j-2)}(x^{(j)})
=(2β(j1)1)|z(j)g(j2)(x(j))|+g(j2)(x(j))\displaystyle=(2\beta^{(j-1)}-1)|z^{(j)}-g^{(j-2)}(x^{(j)})|+g^{(j-2)}(x^{(j)})
=(z(j)g(j2)(x(j)))+g(j2)(x(j))\displaystyle=(z^{(j)}-g^{(j-2)}(x^{(j)}))+g^{(j-2)}(x^{(j)})
=z(j).\displaystyle=z^{(j)}.

Hence, g(n1)(x(j))=z(j)g^{(n-1)}(x^{(j)})=z^{(j)} for all j[n]j\in[n].

Step 2: based on the construction above we define (w(i),b(i))(w^{(i)},b^{(i)}) for i01i\in\mathcal{I}_{0}\cup\mathcal{I}_{1}. For k{0,,n1}k\in\{0,\ldots,n-1\} and β{0,1}\beta\in\{0,1\}, we define

w(ik,β)\displaystyle w^{(i_{k,\beta})} =3+(2β(k)1)(2β1)2w~(k)|v(ik,β)|\displaystyle=\frac{3+(2\beta^{(k)}-1)(2\beta-1)}{2}\frac{\tilde{w}^{(k)}}{|v^{(i_{k,\beta})}|}
b(ik,β)\displaystyle b^{(i_{k,\beta})} =3+(2β(k)1)(2β1)2b~(k)|v(ik,β)|.\displaystyle=\frac{3+(2\beta^{(k)}-1)(2\beta-1)}{2}\frac{\tilde{b}^{(k)}}{|v^{(i_{k,\beta})}|}.

Now we show that this pair (w,b)(w,b) satisfies the desired properties. By construction, we have

sgn(w(i)x(j)+b(i))=Aij\operatorname{sgn}(w^{(i)}x^{(j)}+b^{(i)})=A_{ij}

for i01i\notin\mathcal{I}_{0}\cup\mathcal{I}_{1}, j[n]j\in[n]. If i01,j[n]i\in\mathcal{I}_{0}\cup\mathcal{I}_{1},j\in[n], then we can write i=ik,βi=i_{k,\beta} for some k{0,,n1}k\in\{0,\ldots,n-1\}, β{0,1}\beta\in\{0,1\}. Then

sgn(w(i)x(j)+b(i))\displaystyle\operatorname{sgn}(w^{(i)}x^{(j)}+b^{(i)}) =sgn(3+(2β(k)1)(2β1)2|v(ik,β)|(w~(k)x(j)+b~(k)))\displaystyle=\operatorname{sgn}\left(\frac{3+(2\beta^{(k)}-1)(2\beta-1)}{2|v^{(i_{k,\beta})}|}(\tilde{w}^{(k)}x^{(j)}+\tilde{b}^{(k)})\right)
=sgn(w~(k)x(j)+b~(k))\displaystyle=\operatorname{sgn}(\tilde{w}^{(k)}x^{(j)}+\tilde{b}^{(k)})
=1k<j\displaystyle=1_{k<j}
=Aij,\displaystyle=A_{ij},

where the second-to-last line follows from the construction of w~\tilde{w} and b~\tilde{b} and the last line follows from the fact that Ai,=ξ(k,1)A_{i,\cdot}=\xi^{(k,1)}.

Step 3: finally, we show that F(w,b,v,X)=yF(w,b,v,X)=y. For j[n]j\in[n],

F(w,b,x(j))\displaystyle F(w,b,x^{(j)}) =i=1d1v(i)σ(w(i)x(j)+b(i))\displaystyle=\sum_{i=1}^{d_{1}}v^{(i)}\sigma(w^{(i)}x^{(j)}+b^{(i)})
=i01v(i)σ(w(i)x(j)+b(i))+i01v(i)σ(w(i)x(j)+b(i))\displaystyle=\sum_{i\notin\mathcal{I}_{0}\cup\mathcal{I}_{1}}v^{(i)}\sigma(w^{(i)}x^{(j)}+b^{(i)})+\sum_{i\in\mathcal{I}_{0}\cup\mathcal{I}_{1}}v^{(i)}\sigma(w^{(i)}x^{(j)}+b^{(i)})
=z(j)y(j)+k=0n1β=01v(ik,β)σ(w(ik,β)x(j)+b(ik,β)),\displaystyle=z^{(j)}-y^{(j)}+\sum_{k=0}^{n-1}\sum_{\beta=0}^{1}v^{(i_{k,\beta})}\sigma(w^{(i_{k,\beta})}x^{(j)}+b^{(i_{k,\beta})}),

where the last line follows from (6). By the definition of w(ik,β)w^{(i_{k,\beta})} and b(ik,β)b^{(i_{k,\beta})}, this is equal to

z(j)y(j)+k=0n1β=01v(ik,β)|v(ik,β)|3+(2β(k)1)(2β1)2σ(w~(k)x(j)+b~(k)).\displaystyle z^{(j)}-y^{(j)}+\sum_{k=0}^{n-1}\sum_{\beta=0}^{1}\frac{v^{(i_{k},\beta)}}{|v^{(i_{k},\beta)}|}\frac{3+(2\beta^{(k)}-1)(2\beta-1)}{2}\sigma(\tilde{w}^{(k)}x^{(j)}+\tilde{b}^{(k)}).

By construction sgn(v(ik,β))=β\operatorname{sgn}(v^{(i_{k},\beta)})=\beta, so the above is equal to

y(j)z(j)+k=0n1β=01(2β1)3+(2β(k)1)(2β1)2σ(w~(k)x(j)+b~(k))\displaystyle y^{(j)}-z^{(j)}+\sum_{k=0}^{n-1}\sum_{\beta=0}^{1}(2\beta-1)\frac{3+(2\beta^{(k)}-1)(2\beta-1)}{2}\sigma(\tilde{w}^{(k)}x^{(j)}+\tilde{b}^{(k)})
=y(j)z(j)+k=0n1(2β(k)1)σ(w~(k)x(j)+b~(k))\displaystyle=y^{(j)}-z^{(j)}+\sum_{k=0}^{n-1}(2\beta^{(k)}-1)\sigma(\tilde{w}^{(k)}x^{(j)}+\tilde{b}^{(k)})
=y(j)z(j)+g(n1)(x(j))\displaystyle=y^{(j)}-z^{(j)}+g^{(n-1)}(x^{(j)})
=y(j).\displaystyle=y^{(j)}.

In conclusion, we have therefore successfully identified weights and biases (w,b)𝒮XA(w,b)\in\mathcal{S}^{A}_{X} such that F(w,b,v,X)=yF(w,b,v,X)=y. ∎

Proof of Theorem 13.

We say that a binary matrix AA with step vector rows is diverse if for all k[n]k\in[n], there exists i[d1]i\in[d_{1}] such that Ai,=ξ(k,1)A_{i,\cdot}=\xi^{(k,1)}. Suppose that AA is a binary matrix uniformly selected among all binary matrices with step vector rows. Define the random variables C1,,Cd1[n+1]C_{1},\cdots,C_{d_{1}}\in[n+1] by Ci=kC_{i}=k if Ai,=ξ(k,1)A_{i,\cdot}=\xi^{(k,1)} for some k[n]k\in[n], and Ci=n+1C_{i}=n+1 otherwise. Since the rows of AA are iid, the CiC_{i} are iid. Moreover, for each k[n]k\in[n], we have (Ci=k)12n\mathbb{P}(C_{i}=k)\geq\frac{1}{2n}. So by Lemma 11, if

d12nlog(nϵ),d_{1}\geq 2n\log\left(\frac{n}{\epsilon}\right),

then with probability at least 1ϵ1-\epsilon,

[n]{C1,,Cd1}.[n]\subseteq\{C_{1},\cdots,C_{d_{1}}\}.

This means that for each k[n]k\in[n], there exists ii such that Ci=ξ(k,1)C_{i}=\xi^{(k,1)}. In other words, AA is diverse. Since we chose AA uniformly at random among all binary matrices with step vector rows, it follows that all but a fraction ϵ\epsilon of such matrices are diverse.

Now it suffices to show that when AA is diverse, every local minimum of LL in 𝒮XA×d1\mathcal{S}^{A}_{X}\times\mathbb{R}^{d_{1}} is a global minimum. Note that FF is continuously differentiable with respect to vv everywhere. We will show that vF(w,b,v,X)\nabla_{v}F(w,b,v,X) has rank nn for all (w,b,v)𝒮XA×d1(w,b,v)\in\mathcal{S}^{A}_{X}\times\mathbb{R}^{d_{1}}. Since AA is diverse, there exist i1,,in[d1]i_{1},\cdots,i_{n}\in[d_{1}] such that for all k[n]k\in[n], Aik,=ξ(k,1)A_{i_{k},\cdot}=\xi^{(k,1)}. Consider the n×nn\times n submatrix MM of vF(w,b,v,X)\nabla_{v}F(w,b,v,X) generated by the rows i1,,ini_{1},\cdots,i_{n}. That is,

Mpq\displaystyle M_{pq} =Fv(ip)(w,b,v,x(q)).\displaystyle=\frac{\partial F}{\partial v^{(i_{p})}}(w,b,v,x^{(q)}).

Then

Mpq\displaystyle M_{pq} =σ(w(ip)x(q)+b(ip)),\displaystyle=\sigma(w^{(i_{p})}x^{(q)}+b^{(i_{p})}),

so the entries of MM are non-negative, and

sgn(Mpq)\displaystyle\operatorname{sgn}(M_{pq}) =sgn(w(ip)x(q)+b(ip))\displaystyle=\operatorname{sgn}(w^{(i_{p})}x^{(q)}+b^{(i_{p})})
=Aip,q\displaystyle=A_{i_{p},q}
=(ξ(p,1))q\displaystyle=(\xi^{(p,1)})_{q}
=1qp.\displaystyle=1_{q\geq p}.

Hence, MM is upper triangular with positive entries on its diagonals, implying that rank(M)=n\operatorname{rank}(M)=n. Since MM is a submatrix of vF(w,b,v,X)\nabla_{v}F(w,b,v,X), we have rank(vF(w,b,v,X))=n\operatorname{rank}(\nabla_{v}F(w,b,v,X))=n as well. Now suppose that (w,b,v)𝒮XA×d1(w,b,v)\in\mathcal{S}^{A}_{X}\times\mathbb{R}^{d_{1}} is a local minimum of LL. Then

vL(w,b,v,X)=vF(w,b,v,X)(F(w,b,v,X)y)=0.\nabla_{v}L(w,b,v,X)=\nabla_{v}F(w,b,v,X)\cdot(F(w,b,v,X)-y)=0.

Since vF(w,b,v,X)\nabla_{v}F(w,b,v,X) has rank nn, this implies that F(w,b,v,X)=yF(w,b,v,X)=y, and so (w,b,v)(w,b,v) is a global minimizer of FF. So whenever AA is diverse, the region 𝒮XA×d1\mathcal{S}^{A}_{X}\times\mathbb{R}^{d_{1}} has no bad local minima. This concludes the proof. ∎

Appendix D Function space on one-dimensional data

We have studied activation regions in parameter space over which the Jacobian has full rank. We can give a picture of what the function space looks like as follows. We describe the function space of a ReLU over the data and based on this the function space of a network with a hidden layer of ReLUs.

Consider a single ReLU with one input with bias on nn input data points in \mathbb{R}. Equivalently, this is as a ReLU with two input dimensions and no bias on nn input data points in 1×1\times\mathbb{R}. For fixed XX, denote by FX=[f(x(1)),,f(x(n))]F_{X}=[f(x^{(1)}),\ldots,f(x^{(n)})] the vector of outputs on all input data points, and by X={FX(θ):θ}\mathcal{F}_{X}=\{F_{X}(\theta)\colon\theta\} the set of all such vectors for any choice of the parameters. For a network with a single output coordinate, this is a subset of X\mathbb{R}^{X}. As before, without loss of generality we sort the input data as x(1)<<x(n)x^{(1)}<\cdots<x^{(n)} (according to the non-constant component). We will use notation of the form Xi=[0,,0,x(i),,x(n)]X_{\geq i}=[0,\ldots,0,x^{(i)},\ldots,x^{(n)}]. Further, we write x¯(i)=[x2(i),1]\bar{x}^{(i)}=[x^{(i)}_{2},-1], which is a solution of the linear equation w,x(i)=0\langle w,x^{(i)}\rangle=0. Recall that a polyline is a list of points with line segments drawn between consecutive points.

Since the parametrization map is piecewise linear, the function space is the union of the images of the Jacobian over the parameter regions where it is constant. In the case of a single ReLU one quickly verifies that, for n=1n=1, X=0\mathcal{F}_{X}=\mathbb{R}_{\geq 0}, and for n=2n=2, X=02\mathcal{F}_{X}=\mathbb{R}_{\geq 0}^{2}. For general nn, there will be equality and inequality constraints, coming from the bounded rank of the Jacobian and the boundaries of the linear regions in parameter space.

f1f_{1}f2f_{2}f3f_{3}
f1f_{1}f2f_{2}f3f_{3}f4f_{4}
Figure 4: Function space of a ReLU on nn data points in \mathbb{R}, for n=3,4n=3,4. The function space is a polyhedral cone in the non-negative orthant of n\mathbb{R}^{n}. We can represent this, up to scaling by non-negative factors, by functions f=(f1,,fn)f=(f_{1},\ldots,f_{n}) with f1++fn=1f_{1}+\cdots+f_{n}=1. These form a polyline, shown in red, inside the (n1)(n-1)-simplex. The sum of mm ReLUs corresponds to non-negative multiples of convex combinations of any mm points in the polyline, and arbitrary linear combinations of mm ReLUs correspond to arbitrary scalar multiples of affine combinations of any mm points in this polyline.

The function space of a single ReLU on 33 and 44 data points is illustrated in Figure 4.

For a single ReLU, there are 2n2n non-empty activation regions in parameter space. One of them has Jacobian rank 0 and is mapped to the 0 function, two others have Jacobian rank 1 and are mapped to non-negative scalar multiples of the coordinate vectors e1e_{1} and ene_{n}, and the other 2n32n-3 regions have Jacobian rank 2 and are each mapped to the set of non-negative scalar multiples of a line segment in the polyline. Vertices in the list equation 7 correspond to the extreme rays of the linear pieces of the function space of the ReLU. They are the extreme rays of the cone of non-negative convex functions on XX. Here a function on XX is convex if f(x(j+1))f(x(i))x(i+1)x(i)f(x(i+2))f(x(i+1))x(i+2)x(i+1)\frac{f(x^{(j+1)})-f(x^{(i)})}{x^{(i+1)}-x^{(i)}}\leq\frac{f(x^{(i+2)})-f(x^{(i+1)})}{x^{(i+2)}-x^{(i+1)}}. A non-negative sum of ReLUs is always contained in the cone of non-negative convex functions, which over nn data points is an nn-dimensional convex polyhedral cone. For an overparameterized network, if an activation region in parameter space has full rank Jacobian, then that region maps to an nn-dimensional polyhedral cone. It contains a zero-loss minimizer if and only if the corresponding cone intersects the output data vector yXy\in\mathbb{R}^{X}.

Proposition 21 (Function space on one-dimensional data).

Let XX be a list of nn distinct points in 1×1\times\mathbb{R} sorted in increasing order with respect to the second coordinate.

  • Then the set of functions a ReLU represents on XX is a polyhedral cone consisting of functions αf\alpha f, where α0\alpha\geq 0 and ff is an element of the polyline with vertices

    x¯(i)Xi,i=1,,nandx¯(i)Xi,i=1,,n.\displaystyle\bar{x}^{(i)}X_{\leq i},\;i=1,\ldots,n\quad\text{and}\quad-\bar{x}^{(i)}X_{\geq i},\;i=1,\ldots,n. (7)
  • The set of functions represented by a sum of mm ReLUs consists of non-negative scalar multiples of convex combinations of any mm points on this polyline.

  • The set of functions represented by arbitrary linear combinations of mm ReLUs consists of arbitrary scalar multiples of affine combinations of any mm points on this polyline.

Proof of Proposition 21.

Consider first the case of a single ReLU. We write x(i)x^{(i)} for the input data points in 1×1\times\mathbb{R}. The activation regions in parameter space are determined by the arrangement of hyperplanes Hi={w:w,x(i)=0}H_{i}=\{w\colon\langle w,x^{(i)}\rangle=0\}. Namely, the unit is active on the input data point x(i)x^{(i)} if and only if the parameter is contained in the half-space Hi+={w:w,x(i)>0}H_{i}^{+}=\{w\colon\langle w,x^{(i)}\rangle>0\} and it is inactive otherwise. We write x¯(i)=[x2(i),1]\bar{x}^{(i)}=[x^{(i)}_{2},-1], which is a row vector that satisfies x¯(i),x(i)=0\langle\bar{x}^{(i)},x^{(i)}\rangle=0. We write 𝟏S\mathbf{1}_{S} for a vector in 1×n\mathbb{R}^{1\times n} with ones at the coordinates SS and zeros elsewhere, and write XS=𝟏SXX_{S}=\mathbf{1}_{S}\ast X for the matrix in d0×n\mathbb{R}^{d_{0}\times n} where we substitute columns of XX whose index is not in SS by zero columns.

With these notations, in the following table we list, for each of the non-empty activation regions, the rank of the Jacobian, the activation pattern, the description of the activation region as an intersection of half-spaces, the extreme rays of the activation region, and the extreme rays of the function space represented by the activation region, which is simply the image of the Jacobian over the activation region.

rankA𝒮XAextreme rays of 𝒮XAextreme rays of XA0𝟏H1Hnx¯(1),x¯(n)01𝟏1H1+H2x¯(1),x¯(2)e12𝟏iHi+Hi+1x¯(i),x¯(i+1)x¯(i)Xi1,x¯(i+1)Xi(i=2,,n1)2𝟏[n]Hn+H1+x¯(n),x¯(1)x¯(n)Xn1,x¯(1)X22𝟏iHi1Hi+x¯(i),x¯(i1)x¯(i)Xi+1,x¯(i1)Xi(i=2,,n1)1𝟏nHn1Hn+x¯(n),x¯(n1)en\displaystyle\begin{matrix}\text{rank}&A&\mathcal{S}^{A}_{X}&\text{extreme rays of }\mathcal{S}^{A}_{X}&\text{extreme rays of }\mathcal{F}^{A}_{X}&\\ \hline\cr\text{0}&\mathbf{1}_{\emptyset}&H_{1}^{-}\cap H_{n}^{-}&\bar{x}^{(1)},-\bar{x}^{(n)}&0\\ \text{1}&\mathbf{1}_{1}&H_{1}^{+}\cap H_{2}^{-}&\bar{x}^{(1)},\bar{x}^{(2)}&e_{1}\\ \text{2}&\mathbf{1}_{\leq i}&H_{i}^{+}\cap H_{i+1}^{-}&\bar{x}^{(i)},\bar{x}^{(i+1)}&\bar{x}^{(i)}X_{\leq i-1},\bar{x}^{(i+1)}X_{\leq i}&(i=2,\ldots,n-1)\\ \text{2}&\mathbf{1}_{[n]}&H_{n}^{+}\cap H_{1}^{+}&\bar{x}^{(n)},-\bar{x}^{(1)}&\bar{x}^{(n)}X_{\leq n-1},-\bar{x}^{(1)}X_{\geq 2}\\ \text{2}&\mathbf{1}_{\geq i}&H_{i-1}^{-}\cap H_{i}^{+}&-\bar{x}^{(i)},-\bar{x}^{(i-1)}&-\bar{x}^{(i)}X_{\geq i+1},-\bar{x}^{(i-1)}X_{\geq i}&(i=2,\ldots,n-1)\\ 1&\mathbf{1}_{n}&H_{n-1}^{-}\cap H_{n}^{+}&-\bar{x}^{(n)},-\bar{x}^{(n-1)}&e_{n}\end{matrix}

The situation is illustrated in Figure 5 (see also Figure 4). On one of the parameter regions the unit is inactive on all data points so that the Jacobian has rank 0 and maps to 0. There are precisely two parameter regions where the unit is active on just one data point, x(1)x^{(1)} or x(n)x^{(n)}, so that the Jacobian has rank 1 and maps to non-negative multiples of e1e_{1} and ene_{n}, respectively. On all the other parameter regions the unit is active at least on two data points. On those data points where the unit is active it can adjust the slope and intercept by local changes of the bias and weight and these are all the available degrees of freedom, so that the Jacobian has rank 2. These regions map to two-dimensional cones in function space. To obtain the extreme rays of these cones, we just evaluate the Jacobian on the two extreme rays of the activation region. This gives us item 1 in the proposition.

Parameter spacew2w_{2}w1w_{1}x(1)x^{(1)}x(2)x^{(2)}x¯(1)\bar{x}^{(1)}x¯(2)-\bar{x}^{(2)}x¯(2)\bar{x}^{(2)}x¯(1)-\bar{x}^{(1)}H1+H2+H_{1}^{+}\cap H_{2}^{+}H1+H2H_{1}^{+}\cap H_{2}^{-}H1H2+H_{1}^{-}\cap H_{2}^{+}H1H2H_{1}^{-}\cap H_{2}^{-} Function spacef1f_{1}f2f_{2}10\mathcal{F}^{10}01\mathcal{F}^{01}11\mathcal{F}^{11}00\mathcal{F}^{00}
Figure 5: Subdivision of the parameter space of a single ReLU on two data points x(1),x(2)x^{(1)},x^{(2)} in 1×11\times\mathbb{R}^{1} by values of the Jacobian (left) and corresponding pieces of the function space in 2\mathbb{R}^{2} (right). The activation regions are intersections of half-spaces with activation patterns indicating the positive ones or, equivalently, the indices of data points where the unit is active.

Consider now d1d_{1} ReLUs. Recall that the Minkowski sum of two sets M,NM,N in a vector space is defined as M+N={f+g:fM,gN}M+N=\{f+g\colon f\in M,g\in N\}. An activation region §XA\S_{X}^{A} with activation pattern AA for all units corresponds to picking one region with pattern a(i)a^{(i)} for each of the units, i=1,,d1i=1,\ldots,d_{1}. Since the parametrization map is linear on the activation region, the overall computed function is simply the sum of the functions computed by each of the units,

F(W,v,X)=id1viF(w(i),X).F(W,v,X)=\sum_{i\in d_{1}}v_{i}F(w^{(i)},X).

Here F(W,v,X)=iviσ(WX)F(W,v,X)=\sum_{i}v_{i}\sigma(WX) is the overall function and F(w(i),X)=σ(w(i)X)F(w^{(i)},X)=\sigma(w^{(i)}X) is the function computed by the iith unit. The parameters and activation regions of all units are independent of each other. Thus

XA=i[d1]viXa(i).\mathcal{F}_{X}^{A}=\sum_{i\in[d_{1}]}v_{i}\mathcal{F}_{X}^{a^{(i)}}.

Here we write Xa(i)={(aj(i)w(i)x(j))jX:w(i)𝒮Xa(i)}\mathcal{F}_{X}^{a^{(i)}}=\{(a^{(i)}_{j}w^{(i)}x^{(j)})_{j}\in\mathbb{R}^{X}\colon w^{(i)}\in\mathcal{S}_{X}^{a^{(i)}}\} for the function space of the iith unit over its activation region 𝒮Xa(i)\mathcal{S}_{X}^{a^{(i)}}. This is a cone and thus it is closed under nonegative scaling,

Xa(i)=αiXa(i)for all αi0.\mathcal{F}_{X}^{a^{(i)}}=\alpha_{i}\mathcal{F}_{X}^{a^{(i)}}\quad\text{for all }\alpha_{i}\geq 0.

Thus, for an arbitrary linear combination of d1d_{1} ReLUs we have

f=i[d1]vif(i)=i[d1]:f(i)0vif(i)1f(i)f(i)1.f=\sum_{i\in[d_{1}]}v_{i}f^{(i)}=\sum_{i\in[d_{1}]\colon f^{(i)}\neq 0}v_{i}\|f^{(i)}\|_{1}\frac{f^{(i)}}{\|f^{(i)}\|_{1}}.

Here f(i)f^{(i)} is an arbitrary function represented by the iith unit. We have jfj(i)=f(i)1\sum_{j}f^{(i)}_{j}=\|f^{(i)}\|_{1} and jfj=ivif(i)1\sum_{j}f_{j}=\sum_{i}v_{i}\|f^{(i)}\|_{1}. Thus if ff satisfies f1++fn=1f_{1}+\cdots+f_{n}=1, then ivif(i)=1\sum_{i}v_{i}\|f^{(i)}\|=1, and hence ff is an affine combination of the functions f(i)f(i)\frac{f^{(i)}}{\|f^{(i)}\|}. If all viv_{i} are non-negative, then vif(i)0v_{i}\|f^{(i)}\|\geq 0 and the affine combination is a convex combination. Each of the summands is an element of the function space of a single ReLU with entries adding to one.

In conclusion, the function space of a network with one hidden layer of d1d_{1} ReLUs with non-negative output weights is the set of non-negative scalar multiples of functions in the convex hull of any d1d_{1} functions in the normalized function space of a single ReLU. For a network with arbitrary output weights we obtain arbitrary scalar multiples of the affine hulls of any d1d_{1} functions in the normalized function space of a single ReLU. This is what was claimed in items 2 and 3, respectively. ∎

Appendix E Details on loss landscapes of deep networks

We provide details and proofs of the results in Section 5. We say that a binary matrix BB is non-repeating if its columns are distinct.

Proposition 22.

Let Xd0×nX\in\mathbb{R}^{d_{0}\times n} be an input dataset with distinct points. Suppose that AA is an activation pattern such that AL1A_{L-1} has rank nn, and such that AlA_{l} is non-repeating for all l[L2]l\in[L-2]. Then for all W𝒮XAW\in\mathcal{S}^{A}_{X}, WF(W,X)\nabla_{W}F(W,X) has rank nn.

Proof.

Suppose that AA is an activation pattern satisfying the stated properties. We claim that for all W𝒮XAW\in\mathcal{S}^{A}_{X}, l{0,,L2}l\in\{0,\cdots,L-2\}, and j,k[n]j,k\in[n] with jkj\neq k, that fl(W,x(j))fl(W,x(k))f_{l}(W,x^{(j)})\neq f_{l}(W,x^{(k)}). We prove by induction on ll. The base case l=0l=0 holds by assumption. Suppose that the claim holds for some l{0,,L3}l\in\{0,\cdots,L-3\}. By assumption, Al+1A_{l+1} is non-repeating, so the columns (Al+1),j(A_{l+1})_{\cdot,j} and (Al+1),k(A_{l+1})_{\cdot,k} are not equal. Let i[dl+1]i\in[d_{l+1}] be such that (Al+1)ij(Al+1)ik(A_{l+1})_{ij}\neq(A_{l+1})_{ik}. Then, since W𝒮XAW\in\mathcal{S}^{A}_{X},

sgn(wl+1(i),fl(W,x(j)))sgn(wl+1(i),fl(W,x(k))).\displaystyle\operatorname{sgn}(\langle w_{l+1}^{(i)},f_{l}(W,x^{(j)})\rangle)\neq\operatorname{sgn}(\langle w_{l+1}^{(i)},f_{l}(W,x^{(k)})\rangle).

This implies that

σ(wl+1(i),fl(W,x(j)))σ(wl+1(i),fl(W,x(k))),\displaystyle\sigma(\langle w_{l+1}^{(i)},f_{l}(W,x^{(j)})\rangle)\neq\sigma(\langle w_{l+1}^{(i)},f_{l}(W,x^{(k)})\rangle),

or in other words

(fl+1(W,x(j)))i(fl+1(W,x(k)))i.\displaystyle(f_{l+1}(W,x^{(j)}))_{i}\neq(f_{l+1}(W,x^{(k)}))_{i}.

So fl+1(W,x(j))fl+1(W,x(k))f_{l+1}(W,x^{(j)})\neq f_{l+1}(W,x^{(k)}), proving the claim by induction.

Now we consider the gradient of FF with respect to the (L1)(L-1)-th layer. Let X~dL2×n\tilde{X}\in\mathbb{R}^{d_{L-2}\times n} be defined by X~:=fL2(W,X)\tilde{X}:=f_{L-2}(W,X), and for j[n]j\in[n] let x~(j)\tilde{x}^{(j)} denote the jj-th column of X~\tilde{X}. Let a(1),,a(n)a^{(1)},\cdots,a^{(n)} denote the rows of AL1A_{L-1}. Then for all W𝒮XAW\in\mathcal{S}^{A}_{X},

WL1F(W,X)=((va(1))x~(1),,(va(n))x~(n)).\nabla_{W_{L-1}}F(W,X)=((v\odot a^{(1)})\otimes\tilde{x}^{(1)},\cdots,(v\odot a^{(n)})\otimes\tilde{x}^{(n)}).

By Lemma 4, the rank of this matrix is equal to the rank of the matrix

(a(1)x~(1),,a(n)x~(n)).(a^{(1)}\otimes\tilde{x}^{(1)},\cdots,a^{(n)}\otimes\tilde{x}^{(n)}).

But AL1A_{L-1} has rank nn by assumption, so the set a(1),,a(n)a^{(1)},\cdots,a^{(n)} is linearly independent, implying that the above matrix is full rank. Hence, WL1F(W,X)\nabla_{W_{L-1}}F(W,X) has full rank, and so WF(W,X)\nabla_{W}F(W,X) has full rank. ∎

Now we count the number of activation patterns which satisfy the assumptions of Proposition 22 and hence correspond to regions with full rank Jacobian.

Lemma 23.

Suppose that Bd×nB\in\mathbb{R}^{d\times n} has entries chosen iid uniformly from {0,1}\{0,1\}. If

d=Ω(lognδ),d=\Omega\left(\log\frac{n}{\delta}\right),

then with probability at least 1δ1-\delta, BB is non-repeating.

Proof.

For any j,k[n]j,k\in[n] with jkj\neq k,

(B,j=B,k)\displaystyle\mathbb{P}(B_{\cdot,j}=B_{\cdot,k}) =(Bij=Bik for all i[d])\displaystyle=\mathbb{P}(B_{ij}=B_{ik}\text{ for all }i\in[d])
=2d.\displaystyle=2^{-d}.

So

(B is non-repeating)\displaystyle\mathbb{P}(B\text{ is non-repeating}) =(B,jB,k for all j,k[n] with jk)\displaystyle=\mathbb{P}(B_{\cdot,j}\neq B_{\cdot,k}\text{ for all $j,k\in[n]$ with $j\neq k$})
1j,k[n]jk(B,j=B,k)\displaystyle\geq 1-\sum_{\begin{subarray}{c}j,k\in[n]\\ j\neq k\end{subarray}}\mathbb{P}(B_{\cdot,j}=B_{\cdot,k})
1n22d.\displaystyle\geq 1-n^{2}2^{-d}.

If

d2logn+log1δlog2,d\geq\frac{2\log n+\log\frac{1}{\delta}}{\log 2},

then the above expression is at least 1δ1-\delta. ∎

Lemma 24.

Suppose that Bd×nB\in\mathbb{R}^{d\times n} has entries chosen iid uniformly from {0,1}\{0,1\}. If

d=n+Ω(log1δ),d=n+\Omega\left(\log\frac{1}{\delta}\right),

then with probability at least 1δ1-\delta, BB has rank nn.

Proof.

Suppose that dnd\geq n. Let BB^{\prime} be a d×dd\times d matrix selected uniformly at random from {0,1}d×d\{0,1\}^{d\times d}, and let BB be the top d×nd\times n minor of BB^{\prime}. Note that BB has entries chosen iid uniformly from {0,1}\{0,1\}. Moreover, BB has rank nn whenever BB^{\prime} is invertible. Moreover, by Theorem 3, BB^{\prime} will be singular with probability at most C(0.72)dC(0.72)^{d}, where C1C\geq 1 is a universal constant. Then

(rank(B)=n)\displaystyle\mathbb{P}(\operatorname{rank}(B)=n) (B is invertible)\displaystyle\geq\mathbb{P}(B^{\prime}\text{ is invertible})
1C(0.72)d.\displaystyle\geq 1-C(0.72)^{d}.

Setting

dn+logCδlog10.72,d\geq n+\frac{\log\frac{C}{\delta}}{\log\frac{1}{0.72}},

we get that the above expression is at least

1C(0.72)log(C/δ)/log(1/0.72)\displaystyle 1-C(0.72)^{\log(C/\delta)/\log(1/0.72)} =1δ.\displaystyle=1-\delta.

Hence, if d=n+Ω(log1δ)d=n+\Omega(\log\frac{1}{\delta}), then BB has rank nn with probability at least 1δ1-\delta. ∎

Proof of Theorem 14.

By Proposition 22, it suffices to count the fraction of activation patterns AA such that AlA_{l} is non-repeating for l[L2]l\in[L-2] and AL1A_{L-1} has rank nn. Let AA be an activation pattern whose entries are chosen iid uniformly from {0,1}\{0,1\}. Fix l[L2]l\in[L-2]. Since dl=Ω(log(nϵL))d_{l}=\Omega(\log(\frac{n}{\epsilon L})), by Lemma 23 we have with probability at least 1ϵ2L1-\frac{\epsilon}{2L} that AlA_{l} is non-repeating. Hence, with probability at least 1ϵ21-\frac{\epsilon}{2}, all of the AlA_{l} for l[L2]l\in[L-2] are non-repeating. Since dL1=n+Ω(log1ϵ)d_{L-1}=n+\Omega(\log\frac{1}{\epsilon}), by Lemma 24 we have with probability at least 1ϵ21-\frac{\epsilon}{2} that AL1A_{L-1} has rank nn. Putting everything together, we have with probability at least 1ϵ1-\epsilon that AlA_{l} is non-repeating for l[L2]l\in[L-2] and AL1A_{L-1} has rank nn. So with this probability, WF(W,X)\nabla_{W}F(W,X) has rank nn. Since we generated the activation pattern uniformly at random from all activation regions, the fraction of patterns AA with full rank Jacobian is at least 1ϵ1-\epsilon.

Appendix F Details on the volume of activation regions

We provide details and proofs of the statements in Section 6.

F.1 One-dimensional input data

Proof of Proposition 15.

Let us define ψ~:[,][0,2]\tilde{\psi}:[-\infty,\infty]\to[0,2] by

ψ~(x):=μ({(w,b)(0,1]×[1,1]:bwx}).\tilde{\psi}(x):=\mu\left(\left\{(w,b)\in(0,1]\times[-1,1]:\frac{b}{w}\leq x\right\}\right).

For xx\in\mathbb{R},

ψ~(x)\displaystyle\tilde{\psi}(x) =01111bwx𝑑b𝑑w\displaystyle=\int_{0}^{1}\int_{-1}^{1}1_{b\leq wx}dbdw
=01(1w1/|x|(1+wx)+1wx1(2))𝑑w\displaystyle=\int_{0}^{1}(1_{w\leq 1/|x|}(1+wx)+1_{wx\geq 1}(2))dw
={12xif x11+x2if 1<x1212xif x1\displaystyle=\begin{cases}-\frac{1}{2x}&\text{if }x\leq-1\\ 1+\frac{x}{2}&\text{if }-1<x\leq 1\\ 2-\frac{1}{2x}&\text{if }x\geq 1\end{cases}
=ψ(x).\displaystyle=\psi(x).

Moreover, ψ~()=2=ψ()\tilde{\psi}(\infty)=2=\psi(\infty) and ψ~()=0=ψ()\tilde{\psi}(-\infty)=0=\psi(\infty). So ψ=ψ~\psi=\tilde{\psi}. For all k[n+1]k\in[n+1], a neuron (w,b)(w,b) has activation pattern ξ(k,0)\xi^{(k,0)} if and only if wx(k1)+b>0wx^{(k-1)}+b>0 and wx(k)+b<0wx^{(k)}+b<0. So

μ(𝒩k,0([1,1]×[1,1]))\displaystyle\mu(\mathcal{N}_{k,0}\cap([-1,1]\times[-1,1])) =μ({(w,b)[1,1]×[1,1]:wx(k1)+b<0,wx(k)+b>0})\displaystyle=\mu\left(\left\{(w,b)\in[-1,1]\times[-1,1]:wx^{(k-1)}+b<0,wx^{(k)}+b>0\right\}\right)
=μ({(w,b)[0,1]×[1,1]:wx(k1)+b<0,wx(k)+b>0})\displaystyle=\mu\left(\left\{(w,b)\in[0,1]\times[-1,1]:wx^{(k-1)}+b<0,wx^{(k)}+b>0\right\}\right)
=μ({(w,b)[0,1]×[1,1]:wx(k1)b<0,wx(k)b>0})\displaystyle=\mu\left(\left\{(w,b)\in[0,1]\times[-1,1]:wx^{(k-1)}-b<0,wx^{(k)}-b>0\right\}\right)
=μ({(w,b)[0,1]×[1,1]:x(k1)<bw<x(k)})\displaystyle=\mu\left(\left\{(w,b)\in[0,1]\times[-1,1]:x^{(k-1)}<\frac{b}{w}<x^{(k)}\right\}\right)
=ψ(x(k))ψ(x(k1)).\displaystyle=\psi(x^{(k)})-\psi(x^{(k-1)}).

Similarly, a neuron (w,b)(w,b) has activation pattern ξ(k,1)\xi^{(k,1)} if and only if wx(k1)+b<0wx^{(k-1)}+b<0 and wx(k)+b>0wx^{(k)}+b>0. So

μ(𝒩k,1([1,1]×[1,1]))\displaystyle\mu(\mathcal{N}_{k,1}\cap([-1,1]\times[-1,1])) =μ({(w,b)[1,1]×[1,1]:wx(k1)+b>0,wx(k)+b<0})\displaystyle=\mu\left(\left\{(w,b)\in[-1,1]\times[-1,1]:wx^{(k-1)}+b>0,wx^{(k)}+b<0\right\}\right)
=μ({(w,b)[1,0]×[1,1]:wx(k1)+b>0,wx(k)+b<0})\displaystyle=\mu\left(\left\{(w,b)\in[-1,0]\times[-1,1]:wx^{(k-1)}+b>0,wx^{(k)}+b<0\right\}\right)
=μ({(w,b)[0,1]×[1,1]:wx(k1)+b>0,wx(k)+b<0})\displaystyle=\mu\left(\left\{(w,b)\in[0,1]\times[-1,1]:-wx^{(k-1)}+b>0,-wx^{(k)}+b<0\right\}\right)
=μ({(w,b)[0,1]×[1,1]:x(k1)<bw<x(k)})\displaystyle=\mu\left(\left\{(w,b)\in[0,1]\times[-1,1]:x^{(k-1)}<\frac{b}{w}<x^{(k)}\right\}\right)
=ψ(x(k))ψ(x(k1)).\displaystyle=\psi(x^{(k)})-\psi(x^{(k-1)}).

This establishes the result. ∎

If the amount of separation between data points is large enough, the volume of the activation regions separating them should be large. The following proposition formalizes this.

Proposition 25.

Let n2n\geq 2. Suppose that for all j,k[n]j,k\in[n] with jkj\neq k, we have |x(j)|1|x^{(j)}|\leq 1 and |x(j)x(k)|ϕ|x^{(j)}-x^{(k)}|\geq\phi. Then for all k[n+1]k\in[n+1] and β{0,1}\beta\in\{0,1\},

μ(𝒩k,β([1,1]×[1,1]))ϕ4.\mu(\mathcal{N}_{k,\beta}\cap([-1,1]\times[-1,1]))\geq\frac{\phi}{4}.

Moreover, for all A{0,1}d1×nA\in\{0,1\}^{d_{1}\times n} whose rows are step vectors,

μ(𝒮XA([1,1]d1×[1,1]d1))(ϕ4)d1.\mu(\mathcal{S}^{A}_{X}\cap([-1,1]^{d_{1}}\times[-1,1]^{d_{1}}))\geq\left(\frac{\phi}{4}\right)^{d_{1}}.
Proof.

Since n2n\geq 2, |x1|,|x2|1|x_{1}|,|x_{2}|\leq 1, and |x1x2|ϕ|x_{1}-x_{2}|\geq\phi, we have

ϕ\displaystyle\phi |x1x2|\displaystyle\leq|x_{1}-x_{2}|
|x1|+|x2|\displaystyle\leq|x_{1}|+|x_{2}|
2.\displaystyle\leq 2.

Let ψ\psi, x(0),x(n+1)x^{(0)},x^{(n+1)} be defined as in Proposition 15. Fix β{0,1}\beta\in\{0,1\}. If k{2,3,,n}k\in\{2,3,\cdots,n\}, then by Proposition 15 and the assumption that |x(j)|1|x^{(j)}|\leq 1 for j[n]j\in[n],

μ(𝒩k,β([1,1]×[1,1]))\displaystyle\mu(\mathcal{N}_{k,\beta}\cap([-1,1]\times[-1,1])) =ψ(x(k))ψ(x(k1))\displaystyle=\psi(x^{(k)})-\psi(x^{(k-1)})
=x(k)x(k1)2\displaystyle=\frac{x^{(k)}-x^{(k-1)}}{2}
ϕ2\displaystyle\geq\frac{\phi}{2}
ϕ4.\displaystyle\geq\frac{\phi}{4}.

If k=1k=1, then

μ(𝒩k,β([1,1]×[1,1]))\displaystyle\mu(\mathcal{N}_{k,\beta}\cap([-1,1]\times[-1,1])) =ψ(x(1))ψ(x(0))\displaystyle=\psi(x^{(1)})-\psi(x^{(0)})
=1+x(1)2\displaystyle=1+\frac{x^{(1)}}{2}
12\displaystyle\geq\frac{1}{2}
ϕ4.\displaystyle\geq\frac{\phi}{4}.

If k=n+1k=n+1, then

μ(𝒩k,β([1,1]×[1,1]))\displaystyle\mu(\mathcal{N}_{k,\beta}\cap([-1,1]\times[-1,1])) =ψ(x(n+1))ψ(x(n))\displaystyle=\psi(x^{(n+1)})-\psi(x^{(n)})
=1x(n)2\displaystyle=1-\frac{x^{(n)}}{2}
12\displaystyle\geq\frac{1}{2}
ϕ4.\displaystyle\geq\frac{\phi}{4}.

Hence, for all k[n+1]k\in[n+1] and β{0,1}\beta\in\{0,1\}, μ(𝒩k,β)ϕ4\mu(\mathcal{N}_{k,\beta})\geq\frac{\phi}{4}.

The rows of AA are step vectors, so for each i[d1]i\in[d_{1}], there exists ki[n+1]k_{i}\in[n+1] and βi{0,1}\beta_{i}\in\{0,1\} such that

Ai,=ξ(ki,βi).A_{i,\cdot}=\xi^{(k_{i},\beta_{i})}.

Then

μ(𝒮XA([1,1]d1×[1,1]d1))\displaystyle\mu(\mathcal{S}^{A}_{X}\cap([-1,1]^{d_{1}}\times[-1,1]^{d_{1}}))
=μ({(w,b)[1,1]d1×[1,1]d1:(2Aij1)(w(i)x(j)+b(i))>0 for all i[d1],j[n]})\displaystyle=\mu(\{(w,b)\in[-1,1]^{d_{1}}\times[-1,1]^{d_{1}}:(2A_{ij}-1)(w^{(i)}x^{(j)}+b^{(i)})>0\text{ for all }i\in[d_{1}],j\in[n]\})
=i=1d1μ({(w,b)[1,1]×[1,1]:(2Aij1)(wx(j)+b)>0 for all j[n]})\displaystyle=\prod_{i=1}^{d_{1}}\mu(\{(w,b)\in[-1,1]\times[-1,1]:(2A_{ij}-1)(wx^{(j)}+b)>0\text{ for all }j\in[n]\})
=i=1d1μ({(w,b)[1,1]×[1,1]:(2ξj(ki,βi)1)(wx(j)+b)>0 for all j[n]})\displaystyle=\prod_{i=1}^{d_{1}}\mu(\{(w,b)\in[-1,1]\times[-1,1]:(2\xi^{(k_{i},\beta_{i})}_{j}-1)(wx^{(j)}+b)>0\text{ for all }j\in[n]\})
=i=1d1μ(𝒩ki,βi)\displaystyle=\prod_{i=1}^{d_{1}}\mu(\mathcal{N}_{k_{i},\beta_{i}})
i=1d1ϕ4\displaystyle\geq\prod_{i=1}^{d_{1}}\frac{\phi}{4}
=(ϕ4)d1.\displaystyle=\left(\frac{\phi}{4}\right)^{d_{1}}.

Proof of Proposition 16.

We use a probabilistic argument similar to the proof of Theorem 10. Let us choose a parameter initialization (w,b)[1,1]d1×[1,1]d1(w,b)\in[-1,1]^{d_{1}}\times[-1,1]^{d_{1}} uniformly at random. Then the ii-th row Ai,A_{i,\cdot} of the activation matrix is a random step vector. By Proposition 25, for all k[n+1]k\in[n+1] and β{0,1}\beta\in\{0,1\},

(Ai,=ξ(k,β))ϕ4.\displaystyle\mathbb{P}(A_{i,\cdot}=\xi^{(k,\beta)})\geq\frac{\phi}{4}. (8)

By the proof of Theorem 10, the Jacobian of 𝒮XA\mathcal{S}^{A}_{X} is full rank when AA is a diverse matrix. That is, for all k[n]k\in[n], there exists i[d1]i\in[d_{1}] such that Ai,{ξ(k,0),ξ(k,1)}A_{i,\cdot}\in\{\xi^{(k,0)},\xi^{(k,1)}\}, and there exists i[d1]i\in[d_{1}] such that Ai,=ξ(1,1)A_{i,\cdot}=\xi^{(1,1)}. For i[d1]i\in[d_{1}], let Ci[n+1]C_{i}\in[n+1] be defined as follows. If Ai,{ξ(k,0),ξ(k,1)}A_{i,\cdot}\in\{\xi^{(k,0)},\xi^{(k,1)}\} for some k{2,3,,n}k\in\{2,3,\cdots,n\}, then we define Ci=kC_{i}=k. If Ai,=ξ(1,1)A_{i,\cdot}=\xi^{(1,1)}, then we define Ci=1C_{i}=1. Otherwise, we define Ci=n+1C_{i}=n+1. By definition, AA is diverse if and only if

[n]{Ci:i[d1]}.[n]\subseteq\{C_{i}:i\in[d_{1}]\}.

By (8), if d14ϕlog(nϵ)d_{1}\geq\frac{4}{\phi}\log(\frac{n}{\epsilon}), then for all j[n]j\in[n],

(Ci=j)ϕ4.\displaystyle\mathbb{P}(C_{i}=j)\geq\frac{\phi}{4}.

Thus, by Lemma 11,

(A is diverse)\displaystyle\mathbb{P}(A\text{ is diverse}) =(n{Ci:i[d1]})\displaystyle=\mathbb{P}(n\subseteq\{C_{i}:i\in[d_{1}]\})
1ϵ.\displaystyle\geq 1-\epsilon.

This holds for (w,b)(w,b) selected uniformly from [1,1]d1×[1,1]d1[-1,1]^{d_{1}}\times[-1,1]^{d_{1}}. Hence, the volume of the union of regions with full rank Jacobian is at least

(1ϵ)μ([1,1]d1×[1,1]d1)\displaystyle(1-\epsilon)\mu([-1,1]^{d_{1}}\times[-1,1]^{d_{1}}) =(1ϵ)22d1.\displaystyle=(1-\epsilon)2^{2d_{1}}.

F.2 Arbitrary dimension input data

Suppose that the entries of WW and vv are sampled iid from the standard normal distribution 𝒩(0,1)\mathcal{N}(0,1). We wish to show that the Jacobian of the network will be full rank with high probability. Our strategy will be to think of a random process which successively adds neurons to the network, and we will bound the amount of time necessary until the network has full rank with high probability.

Definition 26.

For γ(0,1)\gamma\in(0,1), we say that a distribution 𝒟\mathcal{D} on {0,1}n\{0,1\}^{n} is γ\gamma-anticoncentrated if for all nonzero unu\in\mathbb{R}^{n},

a𝒟(uTa=0)1γ.\mathbb{P}_{a\sim\mathcal{D}}(u^{T}a=0)\leq 1-\gamma.
Lemma 27.

Let γ,ϵ(0,1)\gamma,\epsilon\in(0,1). Suppose that A{0,1}d×nA\in\{0,1\}^{d\times n} is a random matrix whose rows are selected iid from a distribution 𝒟\mathcal{D} on {0,1}n\{0,1\}^{n} which is γ\gamma-anticoncentrated. If

d8log(ϵ1)γ2+2nγ,d\geq\frac{8\log(\epsilon^{-1})}{\gamma^{2}}+\frac{2n}{\gamma},

then AA has rank nn with probability at least 1ϵ1-\epsilon.

Proof.

Suppose that a(1),a(2),{0,1}na^{(1)},a^{(2)},\cdots\in\{0,1\}^{n} are selected iid from 𝒟\mathcal{D}. Define the filtration (t)t(\mathcal{F}_{t})_{t\in\mathbb{N}} by letting t\mathcal{F}_{t} be the σ\sigma-algebra generated by a(1),,a(t)a^{(1)},\cdots,a^{(t)}. For tt\in\mathbb{N}, let DtD_{t} denote the dimension of the vector space spanned by a(1),,a(t)a^{(1)},\cdots,a^{(t)}, and let

Rt:=Dtγt.R_{t}:=D_{t}-\gamma t.

Let ωt\omega\in\mathcal{F}_{t} be an event in which a(1),,a(t)a^{(1)},\cdots,a^{(t)} do not span n\mathbb{R}^{n}. Then there exists u(ω)nu(\omega)\in\mathbb{R}^{n} nonzero such that uTa(s)=0u^{T}a^{(s)}=0 for all sts\leq t. Then

(Dt+1Dt=1t)(ω)\displaystyle\mathbb{P}(D_{t+1}-D_{t}=1\mid\mathcal{F}_{t})(\omega) =(a(t+1)span(a(1),,a(t))t)(ω)\displaystyle=\mathbb{P}(a^{(t+1)}\notin\text{span}(a^{(1)},\cdots,a^{(t)})\mid\mathcal{F}_{t})(\omega)
(uTa(t+1)0t)(ω)\displaystyle\geq\mathbb{P}(u^{T}a^{(t+1)}\neq 0\mid\mathcal{F}_{t})(\omega)
=(uTa(t+1)0)\displaystyle=\mathbb{P}(u^{T}a^{(t+1)}\neq 0)
γ,\displaystyle\geq\gamma,

where the third line from the independence of the a(s)a^{(s)}. Hence, for all tt\in\mathbb{N},

𝔼[1Dtn(Dt+1Dt)t]\displaystyle\mathbb{E}[1_{D_{t}\neq n}(D_{t+1}-D_{t})\mid\mathcal{F}_{t}] =1Dtn𝔼[Dt+1Dtt]\displaystyle=1_{D_{t}\neq n}\mathbb{E}[D_{t+1}-D_{t}\mid\mathcal{F}_{t}]
=1Dtn(Dt+1Dt=1t)\displaystyle=1_{D_{t}\neq n}\mathbb{P}(D_{t+1}-D_{t}=1\mid\mathcal{F}_{t}) (9)
1Dtnγ\displaystyle\geq 1_{D_{t}\neq n}\gamma

Let τ\tau\in\mathbb{N} be a stopping time with respect to (t)t(\mathcal{F}_{t})_{t\in\mathbb{N}} defined by

τ:=min({}{t:Dt=n}).\tau:=\min(\{\infty\}\cup\{t\in\mathbb{N}:D_{t}=n\}).

We also define the sequence (Mt)t(M_{t})_{t\in\mathbb{N}} by Mt:=Rmin(t,τ)M_{t}:=R_{\min(t,\tau)}. Then for all tt\in\mathbb{N},

𝔼[Mt+1t]\displaystyle\mathbb{E}[M_{t+1}\mid\mathcal{F}_{t}] =𝔼[Rmin(t+1,τ)t]\displaystyle=\mathbb{E}[R_{\min(t+1,\tau)}\mid\mathcal{F}_{t}]
=𝔼[1τtRτ+1τ>tRt+1t]\displaystyle=\mathbb{E}[1_{\tau\leq t}R_{\tau}+1_{\tau>t}R_{t+1}\mid\mathcal{F}_{t}]
=1τtRτ+𝔼[1τ>tRt+1t]\displaystyle=1_{\tau\leq t}R_{\tau}+\mathbb{E}[1_{\tau>t}R_{t+1}\mid\mathcal{F}_{t}]
=1τtRτ+𝔼[1Dtn(Dt+1γ(t+1))|t]\displaystyle=1_{\tau\leq t}R_{\tau}+\mathbb{E}\left[1_{D_{t}\neq n}\left(D_{t+1}-\gamma(t+1)\right)\middle|\mathcal{F}_{t}\right]
=1τtRτ1Dtnγ(t+1)+𝔼[1DtnDt+1t]\displaystyle=1_{\tau\leq t}R_{\tau}-1_{D_{t}\neq n}\gamma(t+1)+\mathbb{E}[1_{D_{t}\neq n}D_{t+1}\mid\mathcal{F}_{t}]
1τtRτ1Dtnγ(t+1)+1Dtnγ+𝔼[1DtnDtt]\displaystyle\geq 1_{\tau\leq t}R_{\tau}-1_{D_{t}\neq n}\gamma(t+1)+1_{D_{t}\neq n}\gamma+\mathbb{E}[1_{D_{t}\neq n}D_{t}\mid\mathcal{F}_{t}]
=1τtRτ1Dtnγt+1DtnDt\displaystyle=1_{\tau\leq t}R_{\tau}-1_{D_{t}\neq n}\gamma t+1_{D_{t}\neq n}D_{t}
=1τtRτ+1DtnRt\displaystyle=1_{\tau\leq t}R_{\tau}+1_{D_{t}\neq n}R_{t}
=1τtRτ+1τ>tRt\displaystyle=1_{\tau\leq t}R_{\tau}+1_{\tau>t}R_{t}
=Rmin(t,τ)\displaystyle=R_{\min(t,\tau)}
=Mt,\displaystyle=M_{t},

where we used (F.2) in the sixth line with properties of the conditional expectation. Moreover, we have 𝔼|Mt|<\mathbb{E}|M_{t}|<\infty for all tt\in\mathbb{N}. Hence, the sequence (Mt)t(M_{t})_{t\in\mathbb{N}} is a submartinagle with respect to the filtration (t)t(\mathcal{F}_{t})_{t\in\mathbb{N}}. We also have |Mt+1Mt|1|M_{t+1}-M_{t}|\leq 1 for all tt\in\mathbb{N}. By Azuma-Hoeffding, for all δ>0\delta>0 and tt\in\mathbb{N},

(Mtδ)\displaystyle\mathbb{P}(M_{t}\leq-\delta) (MtM1δ)\displaystyle\leq\mathbb{P}(M_{t}-M_{1}\leq-\delta)
exp(δ22(t1))\displaystyle\leq\exp\left(-\frac{\delta^{2}}{2(t-1)}\right)
exp(δ22t).\displaystyle\leq\exp\left(-\frac{\delta^{2}}{2t}\right).

So for any ϵ(0,1)\epsilon\in(0,1),

(Mt2tlog(ϵ1))ϵ.\mathbb{P}(M_{t}\leq-\sqrt{2t\log(\epsilon^{-1})})\leq\epsilon.

We also have

(Dtn1)\displaystyle\mathbb{P}(D_{t}\leq n-1) =(Dtn1,τ>t)\displaystyle=\mathbb{P}(D_{t}\leq n-1,\tau>t)
=(Rtn1γt,τ>t)\displaystyle=\mathbb{P}(R_{t}\leq n-1-\gamma t,\tau>t)
=(Mtn1γt,τ>t)\displaystyle=\mathbb{P}(M_{t}\leq n-1-\gamma t,\tau>t)
(Mtn1γt)\displaystyle\leq\mathbb{P}(M_{t}\leq n-1-\gamma t)
(Mtnγt)\displaystyle\leq\mathbb{P}(M_{t}\leq n-\gamma t)

So we have Dtn1D_{t}\leq n-1 with probability at most ϵ\epsilon when

nγt2tlog(ϵ1).\displaystyle n-\gamma t\leq-\sqrt{2t\log(\epsilon^{-1})}.

If

t8γ2log(ϵ1)+2nγ,t\geq\frac{8}{\gamma^{2}}\log(\epsilon^{-1})+\frac{2n}{\gamma},

then

γt2tlog(ϵ1)\displaystyle\gamma t-\sqrt{2t\log(\epsilon^{-1})} =γ2t+γt2t2tlog(ϵ1)\displaystyle=\frac{\gamma}{2}t+\frac{\gamma\sqrt{t}}{2}\sqrt{t}-\sqrt{2t\log(\epsilon^{-1})}
γ2(2nγ)+γt28γ2log(ϵ1)2tlog(ϵ1)\displaystyle\geq\frac{\gamma}{2}\left(\frac{2n}{\gamma}\right)+\frac{\gamma\sqrt{t}}{2}\sqrt{\frac{8}{\gamma^{2}}\log(\epsilon^{-1})}-\sqrt{2t\log(\epsilon^{-1})}
=n.\displaystyle=n.

Hence, for such values of tt, we have DtnD_{t}\geq n with probability at least 1ϵ1-\epsilon. In other words, with probability at least 1ϵ1-\epsilon, the vectors a(1),,a(t)a^{(1)},\cdots,a^{(t)} span n\mathbb{R}^{n}. We can rephrase this as follows. If A{0,1}d×nA\in\{0,1\}^{d\times n} is a random matrix whose columns are selected iid from 𝒟\mathcal{D} and

d8γ2log(ϵ1)+2nγ,d\geq\frac{8}{\gamma^{2}}\log(\epsilon^{-1})+\frac{2n}{\gamma},

then with probability at least 1ϵ1-\epsilon, AA has full rank. This proves the result. ∎

Now we apply this lemma to study the rank of the Jacobian of the network. We rely on the observation that an input dataset XX is γ\gamma-anticoncentrated exactly when 𝒟X\mathcal{D}_{X} is γ\gamma-anticoncentrated.

Proof of Theorem 17.

We employ the strategy used in the proof of Theorem 5. The Jacobian of FF with respect to WW is given by

WF(W,v,X)=((va(1))x(1),,(va(n))x(n)),\nabla_{W}F(W,v,X)=((v\odot a^{(1)})\otimes x^{(1)},\cdots,(v\odot a^{(n)})\otimes x^{(n)}),

where a(j)a^{(j)} is the activation pattern of the jj-th data point. With probability 1, none of the entries of vv are 0. So by Lemma 4, this Jacobian is of full rank when the set

{a(i)x(i):i[n]}\{a^{(i)}\otimes x^{(i)}:i\in[n]\}

consists of linearly independent elements of d1×d0\mathbb{R}^{d_{1}\times d_{0}}. We may partition [n][n] into d0d_{0} subsets S1,,Sd0S_{1},\cdots,S_{d_{0}} such that |Sk|nd0nd0+1|S_{k}|\leq\left\lceil\frac{n}{d_{0}}\right\rceil\leq\frac{n}{d_{0}}+1 for all k[r]k\in[r] and partition the columns of AA accordingly into blocks (a(s))sSk(a^{(s)})_{s\in S_{k}} for each k[r]k\in[r]. Let MkM_{k} denote the d1×|Sk|d_{1}\times|S_{k}| matrix whose columns are the a(s)a^{(s)}, sks\in k. Then the rows of MkM_{k} are activation patterns of individual neurons of the network. So the rows of MkM_{k} are iid and distributed according to 𝒟X\mathcal{D}_{X}. Since 𝒟X\mathcal{D}_{X} is γ\gamma-anticoncentrated and

d1\displaystyle d_{1} 8γ2log(d0ϵ)+2γ(nd0+1)\displaystyle\geq\frac{8}{\gamma^{2}}\log\left(\frac{d_{0}}{\epsilon}\right)+\frac{2}{\gamma}\left(\frac{n}{d_{0}}+1\right)
8γ2log(d0ϵ)+2|Sk|γ,\displaystyle\geq\frac{8}{\gamma^{2}}\log\left(\frac{d_{0}}{\epsilon}\right)+\frac{2|S_{k}|}{\gamma},

we have by Lemma 24 that with probability at least 1ϵd01-\frac{\epsilon}{d_{0}}, MkM_{k} has rank |Sk||S_{k}|. So with probability at least 1ϵ1-\epsilon, all of the MkM_{k} have full rank. This means that for each kk,

rank({a(i):iSk})=|Sk|.\operatorname{rank}(\{a^{(i)}:i\in S_{k}\})=|S_{k}|.

Then by Lemma 18,

rank({a(i)x(i):i[n]})=n,\operatorname{rank}(\{a^{(i)}\otimes x^{(i)}:i\in[n]\})=n,

so the Jacobian has rank nn. ∎

Appendix G Details on the experiments

We provide details on the experiments presented in Section 7. In addition, we provide experiments evaluating the number of activation regions that contain global optima, illustrating Theorem 12. Experiments were implemented in Python using PyTorch (Paszke et al., 2019), numpy (Harris et al., 2020), and mpi4py (Dalcin et al., 2011). The plots were created using Matplotlib (Hunter, 2007). The experiments in Section  G.1 were run on the CPU of a MacBook Pro with an M2 chip and 8GB RAM. The experiments in Section G.2 were run on a CPU cluster that uses Intel Xeon IceLake-SP processors (Platinum 8360Y) with 72 cores per node and 256 GB RAM. The computer implementation of the scripts needed to reproduce our experiments can be found at https://anonymous.4open.science/r/loss-landscape-4271.

G.1 Non-empty activation regions with full rank Jacobian

We sample a dataset Xd0×nX\in\mathbb{R}^{d_{0}\times n} whose entries are sampled iid Gaussian with mean 0 and variance 1. We use a two-layer network with weights Wd1×d0W\in\mathbb{R}^{d_{1}\times d_{0}} and biases bd1b\in\mathbb{R}^{d_{1}} initialized iid uniformly on [1d1,1d1]\left[-\frac{1}{\sqrt{d_{1}}},\frac{1}{\sqrt{d_{1}}}\right]. We choose a random activation region by evaluating FF at parameters (W,b)(W,b) and dataset XX. Then we compute the Jacobian of FF on this activation region, and record whether or not it is of full rank. We repeat this 100 times to estimate the probability of the Jacobian being of full rank for a randomly sampled activation region. We calculate this probability for various values of nn, d0d_{0}, and d1d_{1}. The results are reported in Figure 1 and Figure 2.

G.2 Non-empty activation regions with global minimizers

We sample data X,yX,y as follows. The input data XX is generated as independent samples from a uniform distribution on the cube [1,1]d0[-1,1]^{d_{0}}. We consider three types of samples for the labels yy:

  • Polynomial: We construct a polynomial of degree 22 with the coefficients sampled from a uniform distribution on the interval [1,1][-1,1]. The labels are then the evaluations of this polynomial at the points from XX.

  • Teacher: We construct a teacher network with an identical architecture to the main network used in the experiment. Then we initialize it with the same procedure as described below for the main network, which acts as a student network in this setup, and we take the outputs of the teacher network on XX as the labels.

  • Random output: We sample labels from a uniform distribution on the interval [1,1][-1,1].

We sample activation regions as follows. For each experiment trial, we construct a ReLU fully-connected network with d0d_{0} inputs, d1d_{1} hidden units, and 11 output. Weights and biases of the hidden units are sampled iid from the uniform distribution on the interval [6/fan-in,6/fan-in][-\sqrt{6/\text{fan-in}},\sqrt{6/\text{fan-in}}] according to the uniform-He initialization (He et al., 2015). The weights of the output layer are initialized as alternating 11 and 1-1 and look like [1,1,1,1,][1,-1,1,-1,\ldots]. Additionally, we generate a new dataset X,yX,y for each experiment trial. Afterward, we consider an activation pattern AA corresponding to the network and the input data XX.

For a given dataset (X,y)(X,y) and activation pattern AA, we look for a zero-loss global minimizer to the linear regression problem: minθ1×(d0+1)d112X~θyT22\min_{\theta\in\mathbb{R}^{1\times(d_{0}+1)d_{1}}}\frac{1}{2}\|\tilde{X}\theta-y^{T}\|_{2}^{2} subject to θ𝒮XA\theta\in\mathcal{S}_{X}^{A}, where θ(d0+1)d1×1\theta\in\mathbb{R}^{(d_{0}+1)d_{1}\times 1} is a flattened matrix of the first layer weights and X~n×(d0+1)d1\tilde{X}\in\mathbb{R}^{n\times(d_{0}+1)d_{1}} is a vector with entries X~jk=vkmodd1Aj(kmodd1)Xjk/d1\tilde{X}_{jk}=v_{k\ {\operatorname{mod}}\ d_{1}}A_{j{(k\ {\operatorname{mod}}\ d_{1})}}X_{j{\lfloor k/d_{1}\rfloor}}, y1×ny\in\mathbb{R}^{1\times n}. Here we appended the first layer biases to the weight rows and appended 11 to each network input. Following the descriptions given in Section 3, the second condition is a system of linear inequalities. Thus, this linear regression problem corresponds to the next quadratic program:

minθ1×(d0+1)d112θTPθ+qTθ,where P=X~TX~,q=X~TyT\displaystyle\min_{\theta\in\mathbb{R}^{1\times(d_{0}+1)d_{1}}}\frac{1}{2}\theta^{T}P\theta+q^{T}\theta,\quad\text{where }P=\tilde{X}^{T}\tilde{X},q=-\tilde{X}^{T}y^{T}
(2Aij1)w(i),x(j)>0for all i[d1],j[n].\displaystyle(2A_{ij}-1)\langle w^{(i)},x^{(j)}\rangle>0\quad\text{for all }i\in[d_{1}],j\in[n].

We record the percentage of sampled regions with a zero-loss minimizer. This is reported in Figures 6, 7, 8 for the three different types of data and d0=1,2,5d_{0}=1,2,5. The result is consistent with Theorem 12 in that, for d0=1d_{0}=1, the minimal width d1d_{1}, needed for most regions to have a global minimum, is close to d1nlognd_{1}\sim n\log n predicted by the theorem. We believe that this dependence will be relatively precise for most datasets. However, for specific data distributions, the probability of being in a region with a global minimum might be larger. In Figure 6, we see that, for instance, for data on a parabola, there are more interpolating activation regions than for data from a teacher network or random data. For higher input dimensions, we observe in Figures 7 and 8 that most of the regions contain a global minimum which is consistent with Theorem 5 and Corollary 7, by which any differentiable critical point in most non-empty activation regions is a global minimum.

Our figures are reminiscent of figures in the work of Oymak & Soltanolkotabi (2019). They showed for shallow ReLU networks that if the number of hidden units is large enough, d1d0Cn2/d0\sqrt{d_{1}d_{0}}\geq Cn^{2}/d_{0}, then gradient descent from random initialization converges to a global optimum. Empirically they observed a phase transition at roughly (but not exactly) d0d1=nd_{0}d_{1}=n for the probability that gradient descent from a random initialization successfully fits nn random labels. Note that, in contrast, we are recording the number of regions that contain global optima.

Refer to caption
(a) Polynomial data, degree 11.
Refer to caption
(b) Polynomial data, degree 22.
Refer to caption
(c) Polynomial data, degree 1010.
Refer to caption
(d) Polynomial data, degree 100100.
Refer to caption
(e) Teacher-student setup.
Refer to caption
(f) Random output.
Figure 6: Percentage of randomly sampled activation regions that contain a global minimum of the loss for the networks with input dimension d0=1d_{0}=1, depending on the size nn of the dataset and the width d1d_{1} of the hidden layer. The results are based on 140140 random samples of the activation region for each fixed value of n,d1n,d_{1}. The target data is the same for each random network initialization for the same combination of nn and d1d_{1}. The black dashed line corresponds to the lower bound on d1d_{1} estimated for a given nn and ϵ=0.1\epsilon=0.1 based on the condition on the number of the negative and positive weights in the last network layer from Theorem 12. Precisely, it represents the function d1=4nlog(2n/ϵ)d_{1}=4n\log(2n/\epsilon).
Refer to caption
(a) Polynomial data, degree 22.
Refer to caption
(b) Teacher-student setup.
Refer to caption
(c) Random output.
Figure 7: Percentage of randomly sampled activation regions that contain a global minimum of the loss for the networks with input dimension d0=2d_{0}=2, depending on the size nn of the dataset and the width d1d_{1} of the hidden layer. The results are based on 100100 random samples of the activation region for each fixed value of n,d1n,d_{1}. The target data is the same for each random network initialization for the same combination of nn and d1d_{1}.
Refer to caption
(a) Polynomial data, degree 22.
Refer to caption
(b) Teacher-student setup.
Refer to caption
(c) Random output.
Figure 8: Percentage of randomly sampled activation regions that contain a global minimum of the loss for the networks with input dimension d0=5d_{0}=5, depending on the size nn of the dataset and the width d1d_{1} of the hidden layer. The results are based on 100100 random samples of the activation region for each fixed value of n,d1n,d_{1}. The target data is the same for each random network initialization for the same combination of nn and d1d_{1}.
Refer to caption
(a) (%) Full rank Jacobian at initialization
Refer to caption
(b) (%) Convergence of GD to global minimizer
Figure 9: Classification task on MNIST to predict one hot binary class vectors. These heatmaps show percentages out of 100 trials of networks trained with GD from a random Gaussian initialization which have a full rank Jacobian at initialization (a) and achieve a cross-entropy loss of at most 10210^{-2} after 1000 epochs (b). The number of network parameters matches the training set size nn when the width satisfies d1=n/d0d_{1}=n/d_{0}, where for MNIST the input dimension is d0=784d_{0}=784. As d0d_{0} is large our results predict that there should exist some linear scaling of the network width d1d_{1} and the data size nn such that in all but a very small fraction of regions every critical point is a global minimum.