Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent

Surbhi Goel Department of Computer Science, University of Texas at Austin Aravind Gollakota Department of Computer Science, University of Texas at Austin Zhihan Jin Department of Computer Science, Shanghai Jiao Tong University Sushrut Karmalkar Department of Computer Science, University of Texas at Austin Adam Klivans Department of Computer Science, University of Texas at Austin

(June 22, 2020)

Abstract

We prove the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution using gradient descent. We show that any classifier trained using gradient descent with respect to square-loss will fail to achieve small test error in polynomial time given access to samples labeled by a one-layer neural network. For classification, we give a stronger result, namely that any statistical query (SQ) algorithm (including gradient descent) will fail to achieve small test error in polynomial time. Prior work held only for gradient descent run with small batch sizes, required sharp activations, and applied to specific classes of queries. Our lower bounds hold for broad classes of activations including ReLU and sigmoid. The core of our result relies on a novel construction of a simple family of neural networks that are exactly orthogonal with respect to all spherically symmetric distributions.

1 Introduction

A major challenge in the theory of deep learning is to understand when gradient descent can efficiently learn simple families of neural networks. The associated optimization problem is nonconvex and well known to be computationally intractable in the worst case. For example, cyphertexts from public-key cryptosystems can be encoded into a training set labeled by simple neural networks [KS09], implying that the corresponding learning problem is as hard as breaking cryptographic primitives. These hardness results, however, rely on discrete representations and produce relatively unrealistic joint distributions.

Our Results.

In this paper we give the first superpolynomial lower bounds for learning neural networks using gradient descent in arguably the simplest possible setting: we assume the marginal distribution is a spherical Gaussian, the labels are noiseless and are exactly equal to the output of a one-layer neural network (a linear combination of say ReLU or sigmoid activations), and the goal is to output a classifier whose test error (measured by square-loss) is small. We prove—unconditionally—that gradient descent fails to produce a classifier with small square-loss if it is required to run in polynomial time in the dimension. Our lower bound depends only on the algorithm used (gradient descent) and not on the architecture of the underlying classifier. That is, our results imply that current popular heuristics such as running gradient descent on an overparameterized network (for example, working in the NTK regime [JHG18]) will require superpolynomial time to achieve small test error.

Statistical Queries.

We prove our lower bounds in the now well-studied statistical query (SQ) model of [Kea98] that captures most learning algorithms used in practice. For a loss function $\ell$ and a hypothesis $h_{\theta}$ parameterized by $\theta$ , the true population loss with respect to joint distribution $D$ on $X\times Y$ is given by $\mathbb{E}_{(x,y)\sim D}[\ell(h_{\theta}(x),y)]$ , and the gradient with respect to $\theta$ is given by $\mathbb{E}_{(x,y)\sim D}[\ell^{\prime}(h_{\theta}(x),y)\nabla_{\theta}h_{\theta}(x)]$ . In the SQ model, we specify a query function $\phi(x,y)$ and receive an estimate of $|\mathbb{E}_{(x,y)\sim D}[\phi(x,y)]|$ to within some tolerance parameter $\tau$ . An important special class of queries are correlational or inner-product queries, where the query function $\phi$ is defined only on $X$ and we receive an estimate of $|\mathbb{E}_{(x,y)\sim D}[\phi(x)\cdot y]|$ within some tolerance $\tau$ . It is not difficult to see that (1) the gradient of a population loss can be approximated to within $\tau$ using statistical queries of tolerance $\tau$ and (2) for square-loss only inner-product queries are required.

Since the convergence analysis of gradient descent holds given sufficiently strong approximations of the gradient, lower bounds for learning in the SQ model [Kea98, BFJ⁺94, Szö09, Fel12, Fel17] directly imply unconditional lower bounds on the running time for gradient descent to achieve small error. We give the first superpolynomial lower bounds for learning one-layer networks with respect to any Gaussian distribution for any SQ algorithm that uses inner product queries:

Theorem 1.1 (informal).

Let ${\mathcal{C}}$ be a class of real-valued concepts defined by one-layer single-output neural networks with input dimension $n$ and $m$ hidden units (ReLU or sigmoid); i.e., functions of the form $f(x)=\sum_{i=1}^{m}a_{i}\sigma(w_{i}\cdot x)$ . Then learning ${\mathcal{C}}$ under the standard Gaussian $\mathcal{N}(0,I_{n})$ in the SQ model with inner-product queries requires $n^{\Omega(\log m)}$ queries for any tolerance $\tau=n^{-\Omega(\log m)}$ .

In particular, this rules out any approach for learning one-layer neural networks in polynomial-time that performs gradient descent on any polynomial-size classifier with respect to square-loss or logistic loss. For classification, we obtain significantly stronger results and rule out general SQ algorithms that run in polynomial-time (e.g., gradient descent with respect to any polynomial-size classifier and any polynomial-time computable loss). In this setting, our labels are $\{\pm 1\}$ and correspond to the softmax of an unknown one-layer neural network. We prove the following:

Theorem 1.2 (informal).

Let ${\mathcal{C}}$ be a class of real-valued concepts defined by a one-layer neural network in $n$ dimensions with $m$ hidden units (ReLU or sigmoid) feeding into any odd, real-valued output node with range $[-1,1]$ . Let $D^{\prime}$ be a distribution on ${\mathbb{R}}^{n}\times\{\pm 1\}$ such that the marginal on ${\mathbb{R}}^{n}$ is the standard Gaussian $\mathcal{N}(0,I_{n})$ , and $\mathbb{E}[Y|X]=c(X)$ for some $c\in{\mathcal{C}}$ . For some $b,C>0$ and $\epsilon=Cm^{-b}$ , outputting a classifier $f:{\mathbb{R}}^{n}\to\{\pm 1\}$ with $\mathbb{P}_{(X,Y)\sim D^{\prime}}[f(X)\neq Y]\leq 1/2-\epsilon$ requires $n^{\Omega(\log m)}$ statistical queries of tolerance $n^{-\Omega(\log m)}.$

The above lower bound for classification rules out the commonly used approach of training a polynomial-size, real-valued neural network using gradient descent (with respect to any polynomial-time computable loss) and then taking the sign of the output of the resulting network.

Our techniques.

At the core of all SQ lower bounds is the construction of a family of functions that are pairwise approximately orthogonal with respect to the underlying marginal distribution. Typically, these constructions embed $2^{n}$ parity functions over the discrete hypercube $\{-1,1\}^{n}$ . Since parity functions are perfectly orthogonal, the resulting lower bound can be quite strong. Here we wish to give lower bounds for more natural families of distributions, namely Gaussians, and it is unclear how to embed parity.

Instead, we use an alternate construction. For activation functions $\phi,\psi:{\mathbb{R}}\rightarrow{\mathbb{R}}$ , define

\displaystyle f_{S}(x)=\psi\left(\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right).

Enumerating over every $S\subseteq[n]$ of size $k$ gives a family of functions of size $n^{O(k)}$ . Here $x_{S}$ denotes the vector of $x_{i}$ for $i\in S$ (typically we choose $k=\log m$ to produce a family of one-layer neural networks with $m$ hidden units). Each of the $2^{k}=m$ inner weight vectors are all of unit norm, and all of the $m$ outer weights have absolute value one. Note also that our construction uses activations with zero bias term.

We give a complete characterization of the class of nonlinear activations for which these functions are orthogonal. In particular, the family is orthogonal for any activation with a nonzero Hermite coefficient of degree $k$ or higher.

Apart from showing orthogonality, we must also prove that functions in these classes are nontrivial (i.e., are not exponentially close to the constant zero function). This reduces to proving certain lower bounds on the norms of one-layer neural networks. The analysis requires tools from Hermite and complex analysis.

SQ Lower Bounds for Real-Valued Functions.

Another major challenge is that our function family is real-valued as opposed to boolean. Given an orthogonal family of (deterministic) boolean functions, it is straightforward to apply known results and obtain general SQ lower bounds for learning with respect to $0/1$ loss. For the case of real-valued functions, the situation is considerably more complicated. For example, the class of orthogonal Hermite polynomials on $n$ variables of degree $d$ has size $n^{O(d)}$ , yet there is an SQ algorithm due to [APVZ14] that learns this class with respect to the Gaussian distribution in time $2^{O(d)}$ . More recent work due to [ADHV19] shows that Hermite polynomials can be learned by an SQ algorithm in time polynomial in $n$ and $\log d$ .

As such, it is impossible to rule out general polynomial-time SQ algorithms for learning real-valued functions based solely on orthogonal function families. Fortunately, it is not difficult to see that the SQ reductions due to [Szö09] hold in the real-valued setting as long as the learning algorithm uses only inner-product queries (and the norms of the functions are sufficiently large). Since performing gradient descent with respect to square-loss or logistic loss can be implemented using inner-product queries, we obtain our first set of desired results¹¹1The algorithms of [APVZ14] and [ADHV19] do not use inner-product queries..

Still, we would like rule out general SQ algorithms for learning simple classes of neural networks. To that end, we consider the classification problem for one-layer neural networks and output labels after performing a softmax on a one-layer network. Concretely, consider a distribution on ${\mathbb{R}}^{n}\times\{-1,1\}$ where $\mathbb{E}[Y|X]=\sigma(c(X))$ for some $c\in{\mathcal{C}}$ and $\sigma:{\mathbb{R}}\to[-1,1]$ (for example, $\sigma$ could be tanh). We describe two goals. The first is to estimate the conditional mean function, i.e., output a classifier $h$ such that $\mathbb{E}[(h(x)-c(x))^{2}]\leq\epsilon$ . The second is to directly minimize classification loss, i.e., output a boolean classifier $h$ such that $\mathbb{P}_{X,Y\sim D}[h(X)\neq Y]\leq 1/2-\epsilon.$

We give superpolynomial lower bounds for both of these problems in the general SQ model by making a new connection to probabilistic concepts, a learning model due to [KS94]. Our key theorem gives a superpolynomial SQ lower bound for the problem of distinguishing probabilistic concepts induced by our one-layer neural networks from truly random labels. A final complication we overcome is that we must prove orthogonality and norm bounds on one-layer neural networks that have been composed with a nonlinear activation (e.g., tanh).

SGD and Gradient Descent Plus Noise.

It is easy to see that our results also imply lower bounds for algorithms where the learner adds noise to the estimate of the gradient (e.g., Langevin dynamics). On the other hand, for technical reasons, it is known that SGD is not a statistical query algorithm (because it examines training points individually) and does not fall into our framework. That said, recent work by [AS20] shows that SGD is universal in the sense that it can encode all polynomial-time learners. This implies that proving unconditional lower bounds for SGD would give a proof that $\P\neq\NP$ . Thus, we cannot hope to prove unconditional lower bounds on SGD (unless we can prove $\P\neq\NP$ ).

Independent Work.

Independently, Diakonikolas et al. [DKKZ20] have given stronger correlational SQ lower bounds for the same class of functions with respect to the Gaussian distribution. Their bounds are exponential in the number of hidden units while ours is quasipolynomial. We can plug in their result and obtain exponential general SQ lower bounds for the associated probabilistic concept using our framework.

Related Work.

There is a large literature of results proving hardness results (or unconditional lower bounds in some cases) for learning various classes of neural networks [BR89, Vu98, KS09, LSSS14, GKKT17].

The most relevant prior work is due to [SVWX17], who addressed learning one-layer neural networks under logconcave distributions using Lipschitz queries. Specifically, let $n$ be the input dimension, and let $m$ be the number of hidden $s$ -Lipschitz sigmoid units. For $m=\tilde{O}(s\sqrt{n})$ , they construct a family of neural networks such that any learner using $\lambda$ -Lipschitz queries with tolerance greater than $\Omega(1/(s^{2}n))$ needs at least $2^{\Omega(n)}/(\lambda s^{2})$ queries.

Roughly speaking, their lower bounds hold for $\lambda$ -Lipschitz queries due to the composition of their one-layer neural networks with a $\delta$ -function in order make the family more “boolean.” Because of their restriction on the tolerance parameter, they cannot rule out gradient descent with large batch sizes. Further, the slope of the activations they require in their constructions scales inversely with the Lipschitz and tolerance parameters.

To contrast with [SVWX17], note that our lower bounds hold for any inverse-polynomial tolerance parameter (i.e., will hold for polynomially-large batch sizes), do not require a Lipschitz constraint on the queries, and use only standard $1$ -Lipschitz ReLU and/or sigmoid activations (with zero bias) for the construction of the hard family. Our lower bounds are typically quasipolynomial in the number of hidden units; improving this to an exponential lower bound is an interesting open question. Both of our models capture square-loss and logistic loss.

In terms of techniques, [SVWX17] build an orthogonal function family using univariate, periodic “wave” functions. Our construction takes a different approach, adding and subtracting activation functions with respect to overlapping “masks.” Finally, aside from the (black-box) use of a theorem from complex analysis, our construction and analysis are considerably simpler than the proof in [SVWX17].

A follow-up work [VW19] gave SQ lower bounds for learning classes of degree $d$ orthogonal polynomials in $n$ variables with respect to the uniform distribution on the unit sphere (as opposed to Gaussians) using inner product queries of bounded tolerance (roughly $1/n^{d}$ ). To obtain superpolynomial lower bounds, each function in the family requires superpolynomial description length (their polynomials also take on very small values, $1/n^{d}$ , with high probability).

Shamir [Sha18] (see also the related work of [SSSS17]) proves hardness results (and lower bounds) for learning neural networks using gradient descent with respect to square-loss. His results are separated into two categories: (1) hardness for learning “natural” target families (one layer ReLU networks) or (2) lower bounds for “natural” input distributions (Gaussians). We achieve lower bounds for learning problems with both natural target families and natural input distributions. Additionally, our lower bounds hold for any nonlinear activations (as opposed to just ReLUs) and for broader classes of algorithms (SQ).

Recent work due to [GKK19] gives hardness results for learning a ReLU with respect to Gaussian distributions. Their results require the learner to output a single ReLU as its output hypothesis and require the learner to succeed in the agnostic model of learning. [KK14] prove hardness results for learning a threshold function with respect to Gaussian distributions, but they also require the learner to succeed in the agnostic model. Very recent work due to Daniely and Vardi [DV20] gives hardness results for learning randomly chosen two-layer networks. The hard distributions in their case are not Gaussians, and they require a nonlinear clipping output activation.

Positive Results. Many recent works give algorithms for learning one-layer ReLU networks using gradient descent with respect to Gaussians under various assumptions [ZSJ⁺17, ZPS17, BG17, ZYWG19] or use tensor methods [JSA15, GLM18]. These results depend on the hidden weight vectors being sufficiently orthogonal, or the coefficients in the second layer being positive, or both. Our lower bounds explain why these types of assumptions are necessary.

2 Preliminaries

We use $[n]$ to denote the set $\{1,\dots,n\}$ , and $S\subseteq_{k}T$ to indicate that $S$ is a $k$ -element subset of $T$ . We denote euclidean inner products between vectors $u$ and $v$ by $u~{}{\cdot}~{}v$ . We denote the element-wise product of vectors $u$ and $v$ by $u\circ v$ , that is, $u\circ v$ is the vector $(u_{1}v_{1},\dots,u_{n}v_{n})$ .

Let $X$ be an arbitrary domain, and let $D$ be a distribution on $X$ . Given two functions $f,g:X\to{\mathbb{R}}$ , we define their $L_{2}$ inner product with respect to $D$ to be $\langle f,g\rangle_{D}=\mathbb{E}_{D}[fg]$ . The corresponding $L_{2}$ norm is given by $\|f\|_{D}=\sqrt{\langle f,f\rangle_{D}}=\sqrt{\mathbb{E}_{D}[f^{2}]}$ .

A real-valued concept on ${\mathbb{R}}^{n}$ is a function $c:{\mathbb{R}}^{n}\to{\mathbb{R}}$ . We denote the induced labeled distribution on ${\mathbb{R}}^{n}\times{\mathbb{R}}$ , i.e. the distribution of $(x,c(x))$ for $x\sim D$ , by $D_{c}$ . A probabilistic concept, or $p$ -concept, on $X$ is a concept that maps each point $x$ to a random $\{\pm 1\}$ -valued label in such a way that $\mathbb{E}[Y|X]=c(X)$ for a fixed function $c:{\mathbb{R}}^{n}\to[-1,1]$ , known as the conditional mean function. Given a distribution $D$ on the domain, we abuse $D_{c}$ to denote the induced labeled distribution on $X\times\{\pm 1\}$ such that the marginal distribution on ${\mathbb{R}}^{n}$ is $D$ and $\mathbb{E}[Y|X]=c(X)$ (equivalently the label is $+1$ with probability $\frac{1+c(x)}{2}$ and $-1$ otherwise).

The SQ model

A statistical query is specified by a query function $h:{\mathbb{R}}^{d}\times Y\to{\mathbb{R}}$ . The SQ model allows access to an SQ oracle that accepts a query $h$ of specified tolerance $\tau$ , and responds with a value in $[\mathbb{E}_{x,y}[h(x,y)]-\tau,\mathbb{E}_{x,y}[h(x,y)]+\tau]$ .²²2In the SQ literature, this is referred to as the STAT oracle. A variant called VSTAT is also sometimes used, known to be equivalent up to small polynomial factors [Fel17]. While it makes no substantive difference to our superpolynomial lower bounds, our arguments can be extended to VSTAT as well. To disallow arbitrary scaling, we will require that for each $y$ , the function $x\mapsto h(x,y)$ has norm at most 1. In the real-valued setting, a query $h$ is called a correlational or inner product query if it is of the form $h(x,y)=g(x)\cdot y$ for some function $g$ , so that $\mathbb{E}_{D_{c}}[h]=\mathbb{E}_{D}[gc]=\langle g,c\rangle_{D}$ . Here we assume $\|g\|\leq 1$ when stating lower bounds, again to disallow arbitrary scaling.

Gradient descent with respect to squared loss is captured by inner product queries, since the gradient is given by

	$\displaystyle\mathbb{E}_{x,y}[\nabla_{\theta}(h_{\theta}(x)-y)^{2}]$	$\displaystyle=\mathbb{E}_{x,y}[2(h_{\theta}(x)-y)\nabla_{\theta}h_{\theta}(x)]$
		$\displaystyle=2\mathbb{E}_{x}[h_{\theta}(x)\nabla_{\theta}h_{\theta}(x)]$
		$\displaystyle\quad-2\mathbb{E}_{x,y}[y\nabla_{\theta}h_{\theta}(x)].$

Here the first term can be estimated directly using knowledge of the distribution, while the latter is a vector each of whose elements is an inner product query.

We now formally define the learning problems we consider.

Definition 2.1 (SQ learning of real-valued concepts using inner product queries).

Let ${\mathcal{C}}$ be a class of $p$ -concepts over a domain $X$ , and let $D$ be a distribution on $X$ . We say that a learner learns ${\mathcal{C}}$ with respect to $D$ up to $L_{2}$ error $\epsilon$ using inner product quiers (equivalently squared loss $\epsilon^{2}$ ) if, given only SQ oracle access to $D_{c}$ for some unknown $c\in{\mathcal{C}}$ , and using only inner product queries, it is able to output $\tilde{c}:X\to[-1,1]$ such that $\|c-\tilde{c}\|_{D}\leq\epsilon$ .

For the classification setting, we consider two different notions of learning $p$ -concepts. One is learning the target up to small $L_{2}$ error, to be thought of as a strong form of learning. The other, weaker form, is achieving a nontrivial inner product (i.e. unnormalized correlation) with the target. We prove lower bounds on both in order to capture different learning goals.

Definition 2.2 (SQ learning of $p$ -concepts).

Let ${\mathcal{C}}$ be a class of $p$ -concepts over a domain $X$ , and let $D$ be a distribution on $X$ . We say that a learner learns ${\mathcal{C}}$ with respect to $D$ up to $L_{2}$ error $\epsilon$ if, given only SQ oracle access to $D_{c}$ for some unknown $c\in{\mathcal{C}}$ , and using arbitrary queries, it is able to output $\tilde{c}:X\to[-1,1]$ such that $\|c-\tilde{c}\|_{D}\leq\epsilon$ . We say that a learner weakly learns ${\mathcal{C}}$ with respect to $D$ with advantage $\epsilon$ if it is able to output $\tilde{c}:X\to[-1,1]$ such that $\langle\tilde{c},c\rangle_{D}\geq\epsilon$ .

Note that the best achievable advantage is $\mathbb{E}_{x\sim D}[|c(x)|]$ , achieved by $\tilde{c}(x)=\operatorname{sign}(c(x))$ . Note also that $\|c\|_{D}^{2}\leq\mathbb{E}_{D}[|c|]\leq\|c\|_{D}$ , and therefore a norm lower bound on functions in ${\mathcal{C}}$ implies an upper bound on the achievable advantage.

Remark 2.3 (Learning with $L_{2}$ error implies weak learning).

If the functions in our class satisfy a norm lower bound, say $\|c\|_{D}^{2}\geq(1+\alpha)\epsilon^{2}$ , then a simple calculation shows that learning with $L_{2}$ error $\epsilon$ implies weak learning with advantage $\alpha\epsilon^{2}/2$ .

Our definition of weak learning also captures the standard boolean sense of weak learning, in which the learner is required to output a boolean hypothesis with 0/1 loss bounded away from $1/2$ . Indeed, by an easy calculation, the 0/1 loss of a function $f:X\to\{\pm 1\}$ satisfies

\displaystyle\mathbb{P}_{(x,y)\sim D_{c}}[f(x)\neq y]=\frac{1}{2}-\frac{\langle c,f\rangle_{D}}{2}.

The difficulty of learning a concept class in the SQ model is captured by a parameter known as the statistical dimension of the class.

Definition 2.4 (Statistical dimension).

Let ${\mathcal{C}}$ be a concept class of either real-valued concepts or $p$ -concepts (i.e. their corresponding conditional mean functions) on a domain $X$ , and let $D$ be a distribution on $X$ . The (un-normalized) correlation of two concepts $c,c^{\prime}\in{\mathcal{C}}$ under $D$ is $|\langle c,c^{\prime}\rangle_{D}|$ .³³3In the $p$ -concept setting, it is instructive to note that in the notation of [FGR⁺17], this correlation is precisely the distributional correlation $\chi_{D_{0}}(D_{c},D_{c^{\prime}})$ of the induced labeled distributions $D_{c}$ and $D_{c^{\prime}}$ under the reference distribution $D_{0}=D\times\operatorname{Unif}\{\pm 1\}$ . The average correlation of ${\mathcal{C}}$ is defined to be

\displaystyle\rho_{D}({\mathcal{C}})=\frac{1}{|{\mathcal{C}}|^{2}}\sum_{c,c^{\prime}\in{\mathcal{C}}}|\langle c,c^{\prime}\rangle_{D}|.

The statistical dimension on average at threshold $\gamma$ , $\operatorname{SDA}_{D}({\mathcal{C}},\gamma)$ , is the largest $d$ such that for all ${\mathcal{C}}^{\prime}\subseteq{\mathcal{C}}$ with $|{\mathcal{C}}^{\prime}|\geq|{\mathcal{C}}|/d$ , $\rho_{D}({\mathcal{C}}^{\prime})\leq\gamma$ .

Remark 2.5.

For any general and large concept class ${\mathcal{C}}^{*}$ (such as all one-layer neural nets), we may consider a specific subclass ${\mathcal{C}}\subseteq{\mathcal{C}}^{*}$ and prove lower bounds on learning ${\mathcal{C}}$ in terms of the SDA of ${\mathcal{C}}$ . These lower bounds extend to ${\mathcal{C}}^{*}$ because if it is hard to learn a subset, then it is hard to learn the whole class.

We will mainly be interested in the statistical dimension in a setting where bounds on pairwise correlations are known. In that case the following lemma holds.

Lemma 2.6 (adapted from [FGR⁺17], Lemma 3.10).

Suppose a concept class ${\mathcal{C}}$ has pairwise correlation $\gamma$ , i.e. $|\langle c,c^{\prime}\rangle_{D}|\leq\gamma$ for $c\neq c^{\prime}\in{\mathcal{C}}$ , and squared norm at most $\beta$ , i.e. $\|c\|_{D}^{2}\leq\beta$ for all $c\in{\mathcal{C}}$ . Then for any $\gamma^{\prime}>0$ , $\operatorname{SDA}_{D}({\mathcal{C}},\gamma+\gamma^{\prime})\geq|{\mathcal{C}}|\frac{\gamma^{\prime}}{\beta-\gamma}$ . In particular, if ${\mathcal{C}}$ is a class of orthogonal concepts (i.e. $\gamma=0$ ) with squared norm bounded by $\beta$ , then $\operatorname{SDA}({\mathcal{C}},\gamma^{\prime})\geq|{\mathcal{C}}|\frac{\gamma^{\prime}}{\beta}$ .

Proof.

Let $d=|{\mathcal{C}}|\frac{\gamma^{\prime}}{\beta-\gamma}$ , and observe that for any subset ${\mathcal{C}}^{\prime}\subseteq{\mathcal{C}}$ satisfying $|{\mathcal{C}}^{\prime}|\geq|{\mathcal{C}}|/d=\frac{\beta-\gamma}{\gamma^{\prime}}$ ,

	$\displaystyle\rho_{D}({\mathcal{C}}^{\prime})$	$\displaystyle=\frac{1}{\|{\mathcal{C}}^{\prime}\|^{2}}\sum_{c,c^{\prime}\in{\mathcal{C}}^{\prime}}\|\langle c,c^{\prime}\rangle_{D}\|$
		$\displaystyle\leq\frac{1}{\|{\mathcal{C}}^{\prime}\|^{2}}(\|{\mathcal{C}}^{\prime}\|\beta+(\|{\mathcal{C}}^{\prime}\|^{2}-\|{\mathcal{C}}^{\prime}\|)\gamma)$
		$\displaystyle=\gamma+\frac{\beta-\gamma}{\|{\mathcal{C}}^{\prime}\|}$
		$\displaystyle=\gamma+\gamma^{\prime}.$

∎

3 Orthogonal Family of Neural Networks

We consider neural networks with one hidden layer with activation function $\phi:{\mathbb{R}}\to{\mathbb{R}}$ , and with one output node that has some activation function $\psi:{\mathbb{R}}\to{\mathbb{R}}$ . If we take the input dimension to be $n$ and the number of hidden nodes to be $m$ , then such a neural network is a function $f:{\mathbb{R}}^{n}\to{\mathbb{R}}$ given by

\displaystyle f(x)=\psi\left(\sum_{i=1}^{m}a_{i}\phi(w_{i}~{}{\cdot}~{}x)\right),

where $w_{i}\in{\mathbb{R}}^{n}$ are the weights feeding into the $i^{\text{th}}$ hidden node, and $a_{i}\in{\mathbb{R}}$ are the weights feeding into the output node. If $\psi$ takes values in $[-1,1]$ , we may also view $f$ as defining a $p$ -concept in terms of its conditional mean function.

For our construction, we need our functions to be orthogonal, and we need a lower bound on their norms. For the first property we only need the distribution on the domain to satisfy a relaxed kind of spherical symmetry that we term sign-symmetry, which says that the distribution must look identical on all orthants. To lower bound the norms, we need to assume that the distribution is Gaussian ${\cal N}(0,I)$ .

Assumption 3.1 (Sign-symmetry).

For any $z\in\{\pm 1\}^{n}$ and $x\in{\mathbb{R}}^{n}$ , let $x\circ z$ denote $(x_{1}z_{1},\dots,x_{n}z_{n})$ . A distribution $D$ on ${\mathbb{R}}^{n}$ is sign-symmetric if for any $z\in\{\pm 1\}^{n}$ and $x$ drawn from $D$ , $x$ and $x\circ z$ have the same distribution $D$ .

Assumption 3.2 (Odd outer activation).

The outer activation $\psi$ is an odd, increasing function, i.e. $\psi(-x)=-\psi(x)$ .

Note that $\psi$ could be the identity function.

Assumption 3.3 (Inner activation).

The inner activation $\phi\in L_{2}({\cal N}(0,I))$ .

The construction of our orthogonal family of neural networks is simple and exploits sign-symmetry.

Definition 3.4 (Family of Orthogonal Neural Networks).

Let the domain be ${\mathbb{R}}^{n}$ , let $\phi:{\mathbb{R}}\to{\mathbb{R}}$ be any well-behaved activation function, and let $\psi:{\mathbb{R}}\to{\mathbb{R}}$ be any odd function. For an index set $S\subseteq[n]$ , let $x_{S}\in{\mathbb{R}}^{|S|}$ denote the vector of $x_{i}$ for $i\in S$ . Fix any $k>0$ . For any sign-pattern $z\in\{\pm 1\}^{k}$ , let $\chi(z)$ denote the parity $\prod_{i}z_{i}$ . For any index set $S\subseteq_{k}[n]$ , define a one-layer neural network with $m=2^{k}$ hidden nodes,

	$\displaystyle g_{S}(x)$	$\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)$
	$\displaystyle f_{S}(x)$	$\displaystyle=\psi\left(g_{S}(x)\right).$

Our orthogonal family is

{\mathcal{C}_{\text{orth}}}(n,k)=\{f_{S}\mid S\subseteq_{k}[n]\}.

Notice that the size of this family is $\binom{n}{k}=n^{\Theta(k)}$ (for appropriate $k$ ), which is $n^{\Theta(\log m)}$ in terms of $m$ . We will take $k=\Theta(\log n)$ , so that $m=\operatorname{poly}(n)$ and thus the neural networks are $\operatorname{poly}(n)$ -sized, and the size of the family is $n^{\Theta(\log n)}$ , i.e. quasipolynomial in $n$ .

We now prove that our functions are orthogonal under any sign-symmetric distribution.

Theorem 3.5.

Let the domain be ${\mathbb{R}}^{n}$ , and let $D$ be a sign-symmetric distribution on ${\mathbb{R}}^{n}$ . Fix any $k>0$ . Then $\langle f_{S},f_{T}\rangle_{D}=0$ for any two distinct $f_{S},f_{T}\in{\mathcal{C}_{\text{orth}}}(n,k)$ .

Proof.

For the proof, the key property of our construction that we will use is the following: for any sign-pattern $z\in\{\pm 1\}^{n}$ and any $x\in{\mathbb{R}}^{n}$ ,

f_{S}(x\circ z)=\chi_{S}(z)f_{S}(x),

(1)

where $\chi_{S}(z)=\prod_{i\in S}z_{i}=\chi(z_{S})$ is the parity on $S$ of $z$ . Indeed, observe first that

$\displaystyle g_{S}(x\circ z)$	$\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}(x\circ z)_{S}}{\sqrt{k}}\right)$
	$\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{(w\circ z_{S})~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)$
	$\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w\circ z_{S})\chi(z_{S})\phi\left(\frac{(w\circ z_{S})~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)$
	$\displaystyle=\chi(z_{S})\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)$	(replacing $w\circ z_{S}$ with $w$ )
	$\displaystyle=\chi(z_{S})g_{S}(x).$

The property then follows since $\psi$ is odd and $\psi(av)=a\psi(v)$ for any $a\in\{\pm 1\}$ and $v\in{\mathbb{R}}$ .

Consider $f_{S}$ and $f_{T}$ for any two distinct $S,T\subseteq_{k}[n]$ . Recall that by the definition of sign-symmetry, for any $z\in\{\pm 1\}^{n}$ and $x$ drawn from $D$ , $x$ and $x\circ z$ has the same distribution. Using this and Eq. 1, we have

$\displaystyle\langle f_{S},f_{T}\rangle_{D}$	$\displaystyle=\mathbb{E}_{x\sim D}[f_{S}(x)f_{T}(x)]$
	$\displaystyle=\mathbb{E}_{z\sim\{\pm 1\}^{n}}\ \mathbb{E}_{x\sim D}[f_{S}(x\circ z)f_{T}(x\circ z)]$	(sign-symmetry)
	$\displaystyle=\mathbb{E}_{z\sim\{\pm 1\}^{n}}\ \mathbb{E}_{x\sim D}[\chi_{S}(z)f_{S}(x)\chi_{T}(z)f_{T}(x)]$	(Eq. 1)
	$\displaystyle=\mathbb{E}_{x\sim D}\left[f_{S}(x)f_{T}(x)\ \mathbb{E}_{z\sim\{\pm 1\}^{n}}\chi_{S}(z)\chi_{T}(z)\right]$
	$\displaystyle=0,$

since $\mathbb{E}_{z\sim\{\pm 1\}^{n}}\chi_{S}(z)\chi_{T}(z)=0$ for any two distinct parities $\chi_{S},\chi_{T}$ . ∎

Remark 3.6.

Our proof actually shows that any family of functions satisfying Eq. 1 is an orthogonal family under any sign-symmetric distribution.

We still need to establish that our functions are nonzero. For this we need to specialize to the Gaussian distribution, as well as consider specific activation functions (a similar analysis can in principle be carried out for other sign-symmetric distributions). For any $n$ and $k$ , it follows from Lemma A.1 that if the inner activation $\phi$ has a nonzero Hermite coefficient of degree $k$ or higher, then the functions in ${\mathcal{C}_{\text{orth}}}(n,k)$ are nonzero. The sigmoid, ReLU and sign functions all satisfy this property.

Corollary 3.7.

Let the domain be ${\mathbb{R}}^{n}$ , and let $D$ be any sign-symmetric distribution on ${\mathbb{R}}^{n}$ . For any $\gamma>0$ ,

\operatorname{SDA}_{D}({\mathcal{C}_{\text{orth}}}(n,k),\gamma)\geq|{\mathcal{C}_{\text{orth}}}(n,k)|\gamma=\binom{n}{k}\gamma.

Here we also assume that all $c\in{\mathcal{C}_{\text{orth}}}(n,k)$ are nonzero for our distribution $D$ .

Proof.

Follows from Theorem 3.5 and Lemma 2.6, using a loose upper bound of 1 on the squared norm. ∎

We also need to prove norm lower bounds on our functions for our notions of learning to be meaningful. In Appendix A, we prove the following.

Theorem 3.8.

Let the inner activation function $\phi$ be $\operatorname{ReLU}$ or sigmoid, and let the outer activation function $\psi$ be any odd, increasing, continuous function. Let the underlying distribution $D$ be $\mathcal{N}(0,I_{n})$ . Then $\|f_{S}\|=\Omega(e^{-\Theta(k)})$ , where the hidden constants depend on $\psi$ and $\phi$ , for any $f_{S}\in{\mathcal{C}_{\text{orth}}}(n,k)$ .

With this in hand, we now state our main SQ lower bounds.

Theorem 3.9.

Let the input dimension be $n$ , and let the underlying distribution be $\mathcal{N}(0,I_{n})$ . Consider ${\mathcal{C}_{\text{orth}}}(n,k)$ instantiated with $\phi=\operatorname{ReLU}$ or sigmoid and $\psi$ any odd, increasing function (including the identity function), and let $m=2^{k}$ be the hidden layer size of each neural net. Let $A$ be an SQ learner using only inner product queries of tolerance $\tau$ . For any $k\in{\mathbb{N}}$ , there exists $\tau=1/n^{-\Theta(k)}$ such that $A$ requires at least $n^{\Omega(k)}$ queries of tolerance $\tau$ to learn ${\mathcal{C}_{\text{orth}}}(n,k)$ with advantage $1/\exp(k)$ .

In particular, there exist $k=\Theta(\log n)$ and $\tau=1/n^{\Theta(\log n)}$ such that $A$ requires at least $n^{\Omega(\log n)}$ queries of tolerance $\tau$ to learn ${\mathcal{C}_{\text{orth}}}(n,k)$ with advantage $1/\operatorname{poly}(n)$ . In this case $m=\operatorname{poly}(n)$ , so that each function in the family has polynomial size. This is our main superpolynomial lower bound.

Proof.

The proof amounts to careful choices of the parameters $\epsilon,\gamma$ and $\tau$ in Corollary 3.7 and Corollary 4.6. Recall that $\operatorname{SDA}({\mathcal{C}_{\text{orth}}}(n,k),\gamma)\geq n^{\Theta(k)}\gamma$ . We pick $\gamma=n^{-\Theta(k)}$ appropriately such that $d=\operatorname{SDA}({\mathcal{C}_{\text{orth}}}(n,k),\gamma)$ is still $n^{\Theta(k)}$ . Theorem 3.8 gives us a norm lower bound of $\exp(-\Theta(k))$ , allowing us to take $\epsilon=\exp(-\Theta(k))$ and $\tau=\sqrt{\gamma}=n^{-\Theta(k)}$ in Corollary 4.6. ∎

4 SQ Lower Bounds

SQ Lower Bounds for Real-valued Functions

Prior work [Szö09, Fel12] has already established the following fundamental result, which we phrase in terms of our definition of statistical dimension. For the reader’s convenience, we include a proof in Appendix B.

Theorem 4.1.

Let $D$ be a distribution on $X$ , and let ${\mathcal{C}}$ be a real-valued concept class over a domain $X$ such that $\|c\|_{D}>\epsilon$ for all $c\in{\mathcal{C}}$ . Consider any SQ learner that is allowed to make only inner product queries to an SQ oracle for the labeled distribution $D_{c}$ for some unknown $c\in{\mathcal{C}}$ . Let $d=\operatorname{SDA}_{D}({\mathcal{C}},\gamma)$ . Then any such SQ learner needs at least $\Omega(d)$ queries of tolerance $\sqrt{\gamma}$ to learn ${\mathcal{C}}$ up to $L_{2}$ error $\epsilon$ .

SQ Lower Bounds for p-concepts

It turns out to be fruitful to view our learning problem in terms of a decision problem over distributions. We define the problem of distinguishing a valid labeled distribution from a randomly labeled one, and show a lower bound for this problem. We then show that learning is at least as hard as distinguishing, thereby extending the lower bound to learning as well. Our analysis closely follows that of [FGR⁺17].

Definition 4.2 (Distinguishing between labeled and uniformly random distributions).

Let ${\mathcal{C}}$ be a class of $p$ -concepts over a domain $X$ , and let $D$ be a distribution on $X$ . Let $D_{0}=D_{c_{0}}$ be the randomly labeled distribution $D\times\operatorname{Unif}\{\pm 1\}$ . Suppose we are given SQ access either to a labeled distribution $D_{c}$ for some $c\in{\mathcal{C}}$ such that $c\neq c_{0}$ or to $D_{0}$ . The problem of distinguishing between labeled and uniformly random distributions is to decide which.

Remark 4.3.

Given access to $D_{c}$ for some truly boolean concept $c:X\to\{\pm 1\}$ , it is easy to distinguish any other boolean function $c^{\prime}$ from $c$ since $\|c-c^{\prime}\|_{D}^{2}=2-2\langle c,c^{\prime}\rangle_{D}$ (which is information-theoretically optimal as a distinguishing criterion) can be computed using a single inner product query. However, if $c$ and $c^{\prime}$ are $p$ -concepts, $\|c\|_{D}$ and $\|c^{\prime}\|_{D}$ are not 1 in general and may be difficult to estimate. It is not obvious how best to distinguish the two, short of directly learning the target.

Considering the distinguishing problem is useful because if we can show that distinguishing itself is hard, then any reasonable notion of learning will be hard as well, including weak learning. We give simple reductions for both our notions of learning.

Lemma 4.4 (Learning is as hard as distinguishing).

Let $D$ be a distribution over the domain $X$ , and let ${\mathcal{C}}$ be a $p$ -concept class over $X$ . Suppose there exists either

(a) a weak SQ learner capable of learning ${\mathcal{C}}$ up to advantage $\epsilon$ using $q$ queries of tolerance $\tau$ , where $\tau\leq\epsilon/2$ ; or,

(b) an SQ learner capable of learning ${\mathcal{C}}$ (assume $\|c\|_{D}\geq 3\epsilon$ for all $c\in{\mathcal{C}}$ ) up to $L_{2}$ error $\epsilon$ using $q$ queries of tolerance $\tau$ , where $\tau\leq\epsilon^{2}$ . Then there exists a distinguisher that is able to distinguish between an unknown $D_{c}$ and $D_{0}$ using at most $q+1$ queries of tolerance $\tau$ .

Proof.

(a) Run the weak learner to obtain $\tilde{c}$ . If $c\neq c_{0}$ , we know that $\langle\tilde{c},c\rangle_{D}\geq\epsilon$ , whereas if $c=c_{0}$ , then $\langle\tilde{c},c\rangle_{D}=0$ no matter what $\tilde{c}$ is. A single additional query ( $h(x,y)=\tilde{c}(x)y$ ) of tolerance $\epsilon/2$ distinguishes between the two cases.

(b) Run the learner to obtain $\tilde{c}$ . If $c\neq c_{0}$ , i.e. $\|c\|_{D}\geq 3\epsilon$ , we know that $\|\tilde{c}-c\|_{D}\leq\epsilon$ , so that by the triangle inequality, $\|\tilde{c}\|_{D}\geq\|c\|_{D}-\|\tilde{c}-c\|_{D}\geq 2\epsilon$ . But if $c=c_{0}$ , then $\|\tilde{c}\|_{D}\leq\epsilon$ . An additional query ( $h(x,y)=\tilde{c}(x)^{2}$ ) of tolerance $\epsilon^{2}$ suffices to distinguish the two cases. ∎

We now prove the main lower bound on distinguishing.

Theorem 4.5.

Let $D$ be a distribution over the domain $X$ , and let ${\mathcal{C}}$ be a $p$ -concept class over $X$ . Then any SQ algorithm needs at least $d=\operatorname{SDA}({\mathcal{C}},\gamma)$ queries of tolerance $\sqrt{\gamma}$ to distinguish between $D_{c}$ and $D_{0}$ for an unknown $c\in{\mathcal{C}}$ . (We will consider deterministic SQ algorithms that always succeed, for simplicity.)

Proof.

Consider any successful SQ algorithm $A$ . Consider the adversarial strategy where to every query $h:X\times\{-1,1\}\to[-1,1]$ of $A$ (with tolerance $\tau=\sqrt{\gamma}$ ), we respond with $\mathbb{E}_{D_{0}}[h]$ . We can pretend that this is a valid answer with respect to any $c\in{\mathcal{C}}$ such that $|\mathbb{E}_{D_{c}}[h]-\mathbb{E}_{D_{0}}[h]|\leq\tau$ . Our argument will be based on showing that each such query rules out fairly few distributions, so that the number of queries required in total is large.

Since we assumed that $A$ is a deterministic algorithm that always succeeds, it eventually correctly guesses that it is $D_{0}$ that it is getting answers from. Say it takes $q$ queries to do so. For the $k^{\text{th}}$ query $h_{k}$ , let $S_{k}$ be the set of concepts in ${\mathcal{C}}$ that are ruled out by our response $\mathbb{E}_{D_{0}}[h_{k}]$ :

S_{k}=\{c\in{\mathcal{C}}\mid|\mathbb{E}_{D_{c}}[h]-\mathbb{E}_{D_{0}}[h]|\ >\tau\}.

We’ll show that

(a) on the one hand, $\cup_{k=1}^{q}S_{k}={\mathcal{C}}$ , so that $\sum_{k=1}^{q}|S_{k}|\geq|{\mathcal{C}}|$ ,

(b) while on the other, $|S_{k}|\leq|{\mathcal{C}}|/d$ for every $k$ . Together, this will mean that $q\geq d$ .

For the first claim, suppose $\cup_{k=1}^{q}S_{k}$ were not all of ${\mathcal{C}}$ , and indeed say $c\in{\mathcal{C}}\setminus(\cup_{k=1}^{q}S_{k})$ . This is a distribution that our answers were consistent with throughout, yet one that $A$ ’s solution ( $D_{0}$ ) is incorrect for. But $A$ always succeeds, so for it not to have ruled out this $D_{c}$ is impossible.

For the second claim, suppose for the sake of contradiction that for some $k$ , $|S_{k}|>|{\mathcal{C}}|/d$ . By Definition 2.4, this means we know that $\rho_{D}(S_{k})\leq\gamma$ . One of the key insights in the proof of [Szö09] is that by expressing query expectations entirely in terms of inner products, we gain the ability to apply simple algebraic techniques. To this end, for any query function $h$ , let $\widehat{h}(x)=(h(x,1)-h(x,-1))/2$ . Observe that for any $p$ -concept $c$ ,

	$\displaystyle\langle\widehat{h},c\rangle_{D}$	$\displaystyle=\mathbb{E}_{x\sim D}\left[h(x,1)\frac{c(x)}{2}\right]-\mathbb{E}_{x\sim D}\left[h(x,-1)\frac{c(x)}{2}\right]$
		$\displaystyle=\mathbb{E}_{x\sim D}\left[h(x,1)\frac{1+c(x)}{2}\right]$
		$\displaystyle\quad+\mathbb{E}_{x\sim D}\left[h(x,-1)\frac{1-c(x)}{2}\right]$
		$\displaystyle\quad-\mathbb{E}_{x\sim D}\left[h(x,1)\frac{1}{2}\right]-\mathbb{E}_{x\sim D}\left[h(x,-1)\frac{1}{2}\right]$
		$\displaystyle=\mathbb{E}_{D_{c}}[h]-\mathbb{E}_{D_{0}}[h],$

the difference between the query expectations wrt $D_{c}$ and $D_{0}$ . Here we have expanded each $\mathbb{E}_{D_{c}}[h]$ using the fact that the label for $x$ is $1$ with probability $(1+c(x))/2$ and $-1$ otherwise. Thus $|\langle\widehat{h_{k}},c\rangle_{D}|$ , where $h_{k}$ is the $k^{\text{th}}$ query, is greater than $\tau$ for any $c\in S_{k}$ , since $S_{k}$ are precisely those concepts ruled out by our response. We will show contradictory upper and lower bounds on the following quantity:

\displaystyle\Phi=\left\langle\widehat{h_{k}},\sum_{c\in S_{k}}c\cdot\operatorname{sign}(\langle\widehat{h_{k}},c\rangle_{D})\right\rangle_{D}.

Note that since every query $h$ satisfies $\|h(\cdot,y)\|_{D}\leq 1$ for all $y$ , it follows by the triangle inequality that $\|\widehat{h}\|_{D}\leq 1$ . So by Cauchy-Schwarz and our observation that $\rho_{D}(S_{k})\leq\gamma$ ,

	$\displaystyle\Phi^{2}$	$\displaystyle\leq\\|\widehat{h_{k}}\\|_{D}^{2}\cdot\left\\|\sum_{c\in S_{k}}c\cdot\operatorname{sign}(\langle\widehat{h_{k}},c\rangle)\right\\|_{D}^{2}$
		$\displaystyle\leq\sum_{c,c^{\prime}\in S_{k}}\|\langle c,c^{\prime}\rangle_{D}\|=\|S_{k}\|^{2}\rho_{D}(S_{k})\leq\|S_{k}\|^{2}\gamma.$

However since $|\langle\widehat{h_{k}},c\rangle_{D}|\ >\tau$ , we also have that $\Phi=\sum_{c\in S_{k}}|\langle\widehat{h_{k}},c\rangle_{D}|\ >|S_{k}|\tau.$ Since $\tau=\sqrt{\gamma}$ , this contradicts our upper bound and in turn completes the proof of our second claim. And as noted earlier, the two claims together imply that $q\geq d$ . ∎

The final lower bounds on learning thus obtained are stated as a corollary for convenience. The proof follows directly from Lemma 4.4 and Theorem 4.5.

Corollary 4.6.

Let $D$ be a distribution over the domain $X$ , and let ${\mathcal{C}}$ be a $p$ -concept class over $X$ . Let $\gamma,\tau$ be such that $\sqrt{\gamma}\leq\tau$ . Let $d=\operatorname{SDA}({\mathcal{C}},\gamma)$ .

(a) Let $\epsilon$ be such that $\tau\leq\epsilon^{2}$ , and assume $\|c\|_{D}\geq 3\epsilon$ for all $c\in{\mathcal{C}}$ . Then any SQ learner learning ${\mathcal{C}}$ up to $L_{2}$ error $\epsilon$ requires at least $d-1$ queries of tolerance $\tau$ .

(b) Let $\epsilon$ be such that $\tau\leq\epsilon/2$ . Then any weak SQ learner learning ${\mathcal{C}}$ up to advantage $\epsilon$ requires at least $d-1$ queries of tolerance $\tau$ .

5 Experiments

We include experiments for both regression and classification. We train an overparameterized neural network on data from our function class, using gradient descent. We find that we are able to achieve close to zero training error, while test error remains high. This is consistent with our lower bound for these classes of functions.

Refer to caption — (a) Learning a softmax of a one-layer tanh network

For classification, we use a training set of size $T$ of data corresponding to $f\in{\mathcal{C}_{\text{orth}}}(n,k)$ instantiated with $\phi=\tanh$ and $\psi=\tanh$ . We draw $x\sim\mathcal{N}(0,I_{n})$ . For each $x$ , $y$ is picked randomly from $\{\pm 1\}$ in such a way that $\mathbb{E}[y|x]=f(x)$ . Since the outer activation $\psi$ is $\tanh$ , this can be thought of as applying a softmax to the network’s output, or as the Boolean label corresponding to a logit output. We train a sum of tanh network (i.e. a network in which the inner activation is $\tanh$ and no outer activation is applied) on this data using gradient descent on squared loss, threshold the output, and plot the resulting 0/1 loss. See Fig. 1(a). This setup models a common way in which neural networks are trained for classification problems in practice.

For regression, we use a training set of size $T$ of data corresponding to $f\in{\mathcal{C}_{\text{orth}}}(n,k)$ instantiated with $\phi=\tanh$ and $\psi$ being the identity. We draw $x\sim\mathcal{N}(0,I_{n})$ , and $y=f(x)$ . We train a sum of tanh network on this data using gradient descent on squared loss, which we plot in Fig. 1(b). This setup models the natural way of using neural networks for regression problems.

In both cases, we train neural networks whose number of parameters considerably exceeds the amount of training data. In all our experiments, we plot the median over 10 trials and shade the inter-quartile range of the data.

Similar results hold with the inner activation $\phi$ being $\operatorname{ReLU}$ instead of $\tanh$ , and are shown in Fig. 2.

References

[ADHV19] Alexandr Andoni, Rishabh Dudeja, Daniel Hsu, and Kiran Vodrahalli. Attribute-efficient learning of monomials over highly-correlated variables. In Thirtieth International Conference on Algorithmic Learning Theory, 2019.
[APVZ14] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning sparse polynomial functions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 500–510. SIAM, 2014.
[AS20] Emmanuel Abbe and Colin Sandon. Poly-time universality and limitations of deep learning. arXiv preprint arXiv:2001.02992, 2020.
[BFJ⁺94] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 253–262, 1994.
[BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. CoRR, abs/1702.07966, 2017.
[Boy84] John P Boyd. Asymptotic coefficients of hermite function series. Journal of Computational Physics, 54(3):382–410, 1984.
[BR89] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. In Advances in neural information processing systems, pages 494–501, 1989.
[DKKZ20] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks. In Conference on Learning Theory, 2020. To appear.
[DV20] Amit Daniely and Gal Vardi. Hardness of learning neural networks with natural weights. arXiv preprint arXiv:2006.03177, 2020.
[Fel12] Vitaly Feldman. A complete characterization of statistical query learning with applications to evolvability. Journal of Computer and System Sciences, 78(5):1444–1459, 2012.
[Fel17] Vitaly Feldman. A general characterization of the statistical query complexity. In Conference on Learning Theory, pages 785–830, 2017.
[FGR⁺17] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64(2):8, 2017.
[GKK19] Surbhi Goel, Sushrut Karmalkar, and Adam Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems, pages 8582–8591, 2019.
[GKKT17] Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In COLT, pages 1004–1042, 2017.
[GLM18] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In ICLR. OpenReview.net, 2018.
[JHG18] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, NeurIPS, pages 8580–8589, 2018.
[JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
[Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
[KK14] Adam R. Klivans and Pravesh Kothari. Embedding hard learning problems into gaussian space. In APPROX-RANDOM, volume 28 of LIPIcs, pages 793–809. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2014.
[KS94] Michael J Kearns and Robert E Schapire. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464–497, 1994.
[KS09] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009.
[LSSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
[Sha18] Ohad Shamir. Distribution-specific hardness of learning neural networks. J. Mach. Learn. Res, 19:32:1–32:29, 2018.
[SSSS17] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3067–3075, 2017.
[SVWX17] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. In Advances in Neural Information Processing Systems, pages 5514–5522, 2017.
[Szö09] Balázs Szörényi. Characterizing statistical query learning: simplified notions and proofs. In International Conference on Algorithmic Learning Theory, pages 186–200. Springer, 2009.
[Vu98] Van H Vu. On the infeasibility of training neural networks with small mean-squared error. IEEE Transactions on Information Theory, 44(7):2892–2900, 1998.
[VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117, 2019.
[ZPS17] Qiuyi Zhang, Rina Panigrahy, and Sushant Sachdeva. Electron-proton dynamics in deep learning. CoRR, abs/1702.00458, 2017.
[ZSJ⁺17] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In ICML, volume 70, pages 4140–4149. JMLR.org, 2017.
[ZYWG19] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 1524–1534. PMLR, 2019.

Appendix A Bounding the function norms under the Gaussian

Our goal in this section will be to give lower bounds on the norms of the functions in ${\mathcal{C}_{\text{orth}}}(n,k)$ , which is a technical requirement for our results to hold (see Lemma 4.4 and Corollary 4.6). Note that when learning with respect to $L_{2}$ error, such a lower bound is necessary if we wish to state SQ lower bounds, since if the target had small norm, say $\|f\|_{D}\leq\epsilon$ , then the zero function trivially achieves $L_{2}$ error $\epsilon$ .

All inner products and norms in this section will be with respect to the standard Gaussian, $\mathcal{N}(0,I)$ . Since we will fix $S$ throughout, for our purposes the only relevant part of the input is $x_{S}$ and so we drop the subscripts and let $g=g_{S},f=f_{S}$ and $x=x_{S}$ , so that $g$ and $f$ are functions of $x\in{\mathbb{R}}^{k}$ . Our approach will be as follows. In order to prove a norm lower bound on $f$ , we will prove an anticoncentration result for $g$ . To this end we first calculate the second moment of $g$ in terms of the Hermite coefficients of $\phi$ .

Lemma A.1.

Under the distribution $\mathcal{N}(0,I_{n})$ , let the Hermite representation of $\phi$ be $\phi(x)=\sum_{i=0}^{\infty}\widehat{\phi_{i}}\tilde{H}_{i}(x)$ , where $\tilde{H}_{i}(x)$ is the $i^{\text{th}}$ normalized probabilists’ Hermite polynomial. Then

\displaystyle\mathbb{E}\left[g(x)^{2}\right]=4^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k}\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}}.

Proof.

We use $\mathbb{E}$ in this proof instead of $\mathbb{E}_{x\sim\mathcal{N}(0,I_{n})}$ for simplicity. Then we have

		$\displaystyle\mathbb{E}\!\left[g(x)^{2}\right]$
	$\displaystyle=$	$\displaystyle\,\mathbb{E}\!\left[\sum_{\alpha\in\{\pm 1\}^{k}}\chi(\alpha)\phi\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right]\!\!\!\left[\sum_{\beta\in\{\pm 1\}^{k}}\chi(\beta)\phi\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right]$
	$\displaystyle=$	$\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\,\mathbb{E}\!\left[\phi\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\phi\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right]$
	$\displaystyle=$	$\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\,\mathbb{E}\!\!\left[\sum_{i,j\geq 0}\widehat{\phi_{i}}\widehat{\phi_{j}}\tilde{H}_{i}\!\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\!\tilde{H}_{j}\!\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\!\right]$
	$\displaystyle=$	$\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\!\sum_{i,j\geq 0}\widehat{\phi_{i}}\widehat{\phi_{j}}\,\mathbb{E}\!\left[\tilde{H}_{i}\!\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\!\tilde{H}_{j}\!\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right].$

Since $x\sim\mathcal{N}(0,I_{k})$ , $\frac{\langle\alpha,x_{S}\rangle}{\sqrt{k}}$ and $\frac{\langle\beta,x_{S}\rangle}{\sqrt{k}}$ are both standard Gaussian and have correlation $\frac{\langle\alpha,\beta\rangle}{k}$ , we then apply the following well-known property of the Hermite polynomials.

\mathbb{E}_{(a,b)^{T}\sim\mathcal{N}\left(0,\bigl{(}\begin{smallmatrix}1&\rho\\ \rho&1\end{smallmatrix}\bigr{)}\right)}\tilde{H}_{i}(a)\tilde{H}_{j}(b)=\delta_{i,j}\rho^{i},

where $\delta_{i,j}$ is the Dirac delta function.

	$\displaystyle\mathbb{E}\left[g(x)^{2}\right]=$	$\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\left(\frac{\alpha~{}{\cdot}~{}\beta}{k}\right)^{i}$
	$\displaystyle=$	$\displaystyle\,\sum_{w,\theta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\left(\frac{\sum_{l=1}^{k}w_{l}}{k}\right)^{i}$
	$\displaystyle=$	$\displaystyle\,2^{k}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\left(\frac{\sum_{l=1}^{k}w_{l}}{k}\right)^{i},$

where $w_{i}=\alpha_{i}\beta_{i}$ and $\theta_{i}=w_{i}\alpha_{i}$ . Note that 3.3 implies that $\sum_{i=0}^{\infty}\widehat{\phi_{i}}^{2}<\infty$ , the series above is absolute convergent. Then,

		$\displaystyle\,\mathbb{E}\left[g(x)^{2}\right]$
	$\displaystyle=$	$\displaystyle\,2^{k}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\left(\frac{\sum_{l=1}^{k}w_{l}}{k}\right)^{i}$
	$\displaystyle=$	$\displaystyle\,2^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\sum_{i_{1}+\cdots+i_{k}=i}\prod_{l=1}^{k}w_{l}^{i_{l}}\binom{i}{i_{1},\dots,i_{k}}$
	$\displaystyle=$	$\displaystyle\,2^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{i_{1}+\cdots+i_{k}=i}\binom{i}{i_{1},\dots.i_{k}}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}^{i_{l}+1}$
	$\displaystyle=$	$\displaystyle\,2^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{i_{1}+\cdots+i_{k}=i}\binom{i}{i_{1},\dots,i_{k}}\prod_{l=1}^{k}\left[1^{i_{l}+1}+(-1)^{i_{l}+1}\right]$
	$\displaystyle=$	$\displaystyle\,4^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k}\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}}$

since we consider all distinct monomials in $\big{(}\sum_{l=1}^{k}w_{l}\big{)}^{i}$ . Note that $\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k}\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}}$ is always non-negative and is positive iff $i\geq k$ and $i\equiv k\pmod{2}$ . ∎

A.1 ReLU Activation

The goal of this section is to give a lower-bound of $\left\lVert f\right\rVert$ for $\phi=\operatorname{ReLU}$ under the standard Gaussian distribution $\mathcal{N}(0,I)$ . To this end, we prove an anti-concentration for $g$ . We first give a lower bound on $\left\lVert g\right\rVert$ based on the Hermite coefficients of $\phi$ . If $g$ were bounded, this alone would imply anti-concentration as in Section A.2. But since it is not, we first introduce $g^{T}$ , where all activations are truncated at some $T$ . We pick $T$ large enough that $g$ and $g^{T}$ behave almost identically over ${\cal N}(0,I)$ . We then show a lower bound on $\left\lVert g^{T}\right\rVert$ , translate that into an anticoncentration result for $g^{T}$ , and finally into one for $g$ .

Let $T>0$ be some constant to be determined later. Let

\operatorname{ReLU}^{T}(x)=\min(\operatorname{ReLU}(x),T)

and

g^{T}(x)=\sum\limits_{w\in\{\pm 1\}^{k}}\chi(w)\operatorname{ReLU}^{T}\left(\frac{x~{}{\cdot}~{}w}{\sqrt{k}}\right).

The following lemma from [GKK19] describes the Hermite coefficients of ReLU.

Lemma A.2.

\displaystyle\operatorname{ReLU}(x)=\sum_{i=0}^{\infty}c_{i}\tilde{H}_{i}(x)

where

	$\displaystyle c_{0}=\sqrt{\frac{1}{2\pi}},\quad$	$\displaystyle c_{1}=\frac{1}{2},$
	$\displaystyle c_{2i-1}=0,\quad$	$\displaystyle c_{2i}=\frac{H_{2i}(0)+2iH_{2i-2}(0)}{\sqrt{2\pi(2i)!}}\quad\text{for }i\geq 2.$

In particular, $c_{2i}^{2}=\Theta(i^{-2.5})$ .

We can now derive a lower bound on the norm of $g$ .

Lemma A.3.

When $k$ is even,

\displaystyle\left\lVert g\right\rVert=\Omega\left(\left(\frac{4}{e}\right)^{(\frac{1}{2}+o(1))k}\right).

Proof.

Due to Lemma A.1,

	$\displaystyle\mathbb{E}\left[g(x)^{2}\right]$	$\displaystyle=4^{k}\sum_{i\geq 0}\frac{c_{i}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k},\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}}$
		$\displaystyle\geq\frac{4^{k}c_{k}^{2}}{k^{k}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=k\\ i_{1},\dots,i_{k},\text{ are odd}\end{subarray}}\binom{k}{i_{1},\dots,i_{k}}$
		$\displaystyle\geq\frac{4^{k}c_{k}^{2}k!}{k^{k}}.$

The lemma then follows by the Stirling’s approximation,

\displaystyle n!\geq\sqrt{2\pi n}\left(\frac{n}{e}\right)^{n}.

and the bound on the Hermite coefficients,

\displaystyle c_{k}^{2}=\Theta(k^{-2.5}).

∎

For the difference of $g(x)$ and $g^{T}(x)$ , we have

Lemma A.4.

\displaystyle\left\lVert g-g^{T}\right\rVert\leq 2^{k}\,e^{-\frac{T^{2}}{4}}\sqrt{T^{2}+1-\frac{T}{\sqrt{2\pi}}}

Proof.

Let $\operatorname{ReLU}_{w}(x)$ be shorthand for $\operatorname{ReLU}(\frac{x~{}{\cdot}~{}w}{\sqrt{k}})$ , and similarly $\operatorname{ReLU}_{w}^{T}$ . Observe that by the triangle inequality,

	$\displaystyle\left\lVert g-g^{T}\right\rVert$	$\displaystyle=\left\lVert\sum_{w\in\{\pm 1\}^{k}}\chi(w)\left(\operatorname{ReLU}_{w}-\operatorname{ReLU}_{w}^{T}\right)\right\rVert$
		$\displaystyle\leq\sum_{w\in\{\pm 1\}^{k}}\left\lVert\operatorname{ReLU}_{w}-\operatorname{ReLU}_{w}^{T}\right\rVert$
		$\displaystyle=2^{k}\left\lVert\operatorname{ReLU}-\operatorname{ReLU}^{T}\right\rVert_{\mathcal{N}(0,1)},$

where the last equality holds because for any unit vector $v$ and $x\sim\mathcal{N}(0,I)$ , $x~{}{\cdot}~{}v$ has the distribution $\mathcal{N}(0,1)$ . Now,

\left\lVert\operatorname{ReLU}-\operatorname{ReLU}^{T}\right\rVert_{\mathcal{N}(0,1)}^{2}=\int_{T}^{\infty}(x-T)^{2}\,p(x)\,dx,

where $p(x)$ is the probability density function of $\mathcal{N}(0,1)$ . Note that $p^{\prime}(x)=-xp(x)$ . We have

	$\displaystyle\int_{T}^{\infty}x^{2}p(x)dx=\int_{T}^{\infty}-x\,d(p(x))$
	$\displaystyle\qquad=-x\,p(x)\bigg{\|}_{T}^{\infty}+\int_{T}^{\infty}p(x)dx$		(integration by parts)
	$\displaystyle\qquad=T\,p(T)+\mathbb{P}_{x\sim_{\mathcal{N}(0,1)}}(x>T),$
	$\displaystyle\int_{T}^{\infty}x\,p(x)dx=-p(x)\bigg{\|}_{T}^{\infty}=p(T),$
	$\displaystyle\int_{T}^{\infty}p(x)dx=\mathbb{P}_{x\sim_{\mathcal{N}(0,1)}}(x>T)\leq e^{-\frac{T^{2}}{2}}.$

Thus,

	$\displaystyle\mathbb{E}\left[g(x)-g^{T}(x)\right]^{2}$
	$\displaystyle\leq 4^{k}\,\left[(T^{2}+1)\mathbb{P}_{x\sim\mathcal{N}(0,1)}(x>T)-T\,p(T)\right]$
	$\displaystyle\leq 4^{k}\,e^{-\frac{T^{2}}{2}}\left(T^{2}+1-\frac{T}{\sqrt{2\pi}}\right).$

∎

Lemma A.5.

\displaystyle\mathbb{P}[g(x)\neq g^{T}(x)]\leq 2^{k}\,e^{-\frac{T^{2}}{2}}.

Proof.

For any $w\in\{\pm 1\}^{k}$ ,

	$\displaystyle\mathbb{P}_{x\sim\mathcal{N}(0,I)}\left[\operatorname{ReLU}(\frac{x~{}{\cdot}~{}w}{\sqrt{k}})\neq\operatorname{ReLU}^{T}(\frac{x~{}{\cdot}~{}w}{\sqrt{k}})\right]$
	$\displaystyle=\mathbb{P}_{t\sim\mathcal{N}(0,1)}[t>T]$
	$\displaystyle\leq e^{-\frac{T^{2}}{2}}.$

The lemma follows by a union bound. ∎

Lemma A.6.

\displaystyle\mathbb{P}\left[\left|{g(x)}\right|\geq 1\right]=\Omega(\exp(-\Theta(k))).

Proof.

For large enough $T=\Omega(k)$ , it holds from Lemmas A.3 and A.4 that

\displaystyle\left\lVert g^{T}\right\rVert=\Omega\left(\left(\frac{4}{e}\right)^{(\frac{1}{2}+o(1))k}\right).

Since $\left|{g^{T}(x)}\right|\leq T\,2^{k}$ ,

\left\lVert g^{T}\right\rVert^{2}=\mathbb{E}[g^{T}(x)^{2}]\leq 1\cdot\mathbb{P}[|g^{T}(x)|\leq 1]+(T2^{k})^{2}\cdot\mathbb{P}[|g^{T}(x)|\geq 1],

so that

\mathbb{P}\left[\left|{g^{T}(x)}\right|\geq 1\right]=\dfrac{\Omega\Big{(}\left(\frac{4}{e}\right)^{(1+o(1))k}\Big{)}-1}{(T\,2^{k})^{2}}=\Omega(\exp(-\Theta(k)))

(2)

Using Eq. 2 with Lemma A.5,

\displaystyle\mathbb{P}\left[\left|{g(x)}\right|\geq 1\right]=\Omega(\exp(-\Theta(k)))

for large enough $T=\Omega(k)$ . ∎

The lower bound on $\left\lVert f\right\rVert$ now follows easily.

Corollary A.7.

\displaystyle\|f\|=\Omega(\exp(-\Theta(k))).

Proof.

Since $f=\psi\circ g$ , from Lemma A.6 and the fact that $\psi$ is odd and increasing, we have that

	$\displaystyle\\|f\\|$	$\displaystyle\geq\|\psi(1)\|\ \mathbb{P}[g(x)\geq 1]+\|\psi(-1)\|\ \mathbb{P}[g(x)\geq 1]$
		$\displaystyle=\psi(1)\ \mathbb{P}[\|g(x)\|\geq 1]$
		$\displaystyle=\Omega(\exp(-\Theta(k))).$

∎

A.2 Sigmoid Activation

Here we consider $g$ and $f$ with $\phi(x)=\sigma(x)=\frac{1}{1+e^{-x}}$ . For the asymptotic bound of Hermite polynomial coefficients, we need the following theorem from [Boy84].

Theorem A.8.

For a function $f(z)$ whose convergence is limited by simple poles at the roots of $z^{2}=-\gamma^{2}$ with residue $R$ , the non-zero expansion coefficients $\{a_{n}\}$ of $f(z)$ as a series of normalized Hermite functions have magnitudes asymptotically given by

\left|{a_{n}}\right|\sim 2^{\frac{5}{4}}\,\pi^{\frac{1}{2}}\,R\,n^{-\frac{1}{4}}\,e^{-\gamma(2n+1)^{\frac{1}{2}}},

Here the normalized Hermite function $\{\psi_{n}(x)\}_{n\in{\mathbb{N}}}$ is defined by

\psi_{n}(z)=e^{-\frac{z^{2}}{2}}\pi^{-\frac{1}{4}}\tilde{H}_{n}(\sqrt{2}z).

Applying this to $f(x)=e^{-\frac{x^{2}}{2}}\sigma(\sqrt{2}x)$ and translating the Hermite coefficients for the series in terms of Hermite functions to those in terms of Hermite polynomials, we have

Lemma A.9.

\sigma(x)=\sum_{i=0}^{\infty}c_{i}\tilde{H}_{i}(x),

where $c_{0}=0.5,c_{2i}=0$ for $i\geq 1$ and all non-zero odd terms satisfies

c_{2i-1}=e^{-\Theta(\sqrt{i})}.

Corollary A.10.

There is an infinite increasing sequence $\{k_{i}\}_{i\in{\mathbb{N}}}$ such that $k_{i}$ ’s are all odd and

c_{k_{i}}=e^{-\Theta(\sqrt{k_{i}})}.

Proof.

It follows simply from the fact that $\sigma$ is not a polynomial and there should be infinitely many non-zero terms in $\{c_{k}\}_{k\in{\mathbb{N}}}$ . ∎

Remark A.11.

Experimental evidence strongly indicates that in fact all odd Hermite coefficients of sigmoid are nonzero and decay as above, but this is laborious to formally establish. So we state our norm lower bound only for $k\in\{k_{i}\}_{i\in{\mathbb{N}}}$ (and the associated $n\in\{2^{k_{i}}\}_{i\in{\mathbb{N}}}$ , since we end up taking $k=\log n$ ). Since this is nevertheless an infinite sequence, it still establishes that no better asymptotic bound holds.

Similar to Lemma A.3, we can derive a lower bound of $\left\lVert g\right\rVert$ for some $k$ ’s.

Lemma A.12.

For $k\in\{k_{i}\}_{i\in{\mathbb{N}}}$ ,

\left\lVert g(x)\right\rVert=\Omega\left(\left(\frac{4}{e}\right)^{(\frac{1}{2}+o(1))k}\right).

Proof.

Due to Lemma A.1,

	$\displaystyle\mathbb{E}\left[g(x)^{2}\right]$	$\displaystyle=4^{k}\sum_{i\geq 0}\frac{c_{i}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\cdots,i_{k},\text{ are odd}\end{subarray}}\binom{i}{i_{1}\cdots i_{k}}$
		$\displaystyle\geq\frac{4^{k}c_{k}^{2}}{k^{k}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=k\\ i_{1},\cdots,i_{k},\text{ are odd}\end{subarray}}\binom{k}{i_{1}\cdots i_{k}}$
		$\displaystyle\geq\frac{4^{k}c_{k}^{2}k!}{k^{k}}.$

Using Stirling’s approximation,

n!\geq\sqrt{2\pi n}\left(\frac{n}{e}\right)^{n},

and Corollary A.10,

c_{k}=e^{-\Theta(\sqrt{k})},

we obtain

\mathbb{E}\left[g(x)^{2}\right]=\Omega\left(\frac{4^{k}\sqrt{2\pi k}}{k^{k}}\left(\frac{k}{e}\right)^{k}e^{-\Theta(\sqrt{k})}\right)

and hence

\mathbb{E}\left[g(x)^{2}\right]=\Omega\left(\left(\frac{4}{e}\right)^{(1+o(1))k}\right).

∎

Lemma A.13.

For $k\in\{k_{i}\}_{i\in{\mathbb{N}}}$ ,

\mathbb{P}\left(\left|{g(x)}\right|\geq 1\right)=\Omega(\exp(-\Theta(k))).

Proof.

Since $\left|{g(x)}\right|\leq 2^{k}$ ,

\left\lVert g\right\rVert^{2}=\mathbb{E}[g(x)^{2}]\leq 1\cdot\mathbb{P}[|g(x)|\leq 1]+(2^{k})^{2}\cdot\mathbb{P}[|g(x)|\geq 1],

and so

\mathbb{P}\left(\left|{g(x)}\right|\geq 1\right)=\dfrac{\Omega\Big{(}\big{(}\frac{4}{e}\big{)}^{(1+o(1))k}\Big{)}-1}{(2^{k})^{2}}.

The lemma then follows. ∎

Using the same argument as Corollary A.7, we have the following bound.

Corollary A.14.

\|f\|=\Omega(\exp(-\Theta(k))).

A.3 General activations

It is not hard to see that the norm analysis of ReLU and sigmoid extends to any activation function for which a suitable lower bound on the Hermite coefficients holds, and which is either bounded or grows at a polynomial rate, so that under the standard Gaussian it behaves essentially identically to its truncated form. In particular, a lower bound of $\alpha^{-j}$ for any constant $\alpha<4/e$ on the $j^{\text{th}}$ Hermite coefficient suffices to give $\|g\|\geq\exp(\Theta(k))$ , by the same argument as in Lemma A.3 and Lemma A.12. This then suffices to give $\|f\|\geq\exp(-\Theta(k))$ , as above.

In fact, even a very weak lower bound on $\|f\|$ yields some superpolynomial bound on learning. Suppose we only had $\|f\|\geq 1/\exp(\exp(\Theta(k)))$ , for instance. Then we can take $k=\log\log n$ and have $\|f\|\geq 1/\operatorname{poly}(n)$ and still obtain a lower bound of $n^{\log\log n}=n^{\omega(1)}$ (see Theorem 3.9). Any lower bound on $\|f\|$ will be a function only of $k$ , so a similar argument applies.

Appendix B SQ lower bound for real-valued functions proof

We give a self-contained variant of the elegant proof of [Szö09] for the reader’s convenience. For simplicity, we include the $0$ function in our class ${\mathcal{C}}$ — this can only negligibly change the SDA, and it makes the core argument cleaner.

Theorem B.1.

Let $D$ be a distribution on $X$ , and let ${\mathcal{C}}$ be a real-valued concept class over a domain $X$ such that $0\in{\mathcal{C}}$ , and $\|c\|_{D}>\epsilon$ for all $c\in{\mathcal{C}},c\neq 0$ . Consider any SQ learner that is allowed to make only inner product queries to an SQ oracle for the labeled distribution $D_{c}$ for some unknown $c\in{\mathcal{C}}$ . Let $d=\operatorname{SDA}_{D}({\mathcal{C}},\gamma)$ . Then any such SQ learner needs at least $d/2$ queries of tolerance $\sqrt{\gamma}$ to learn ${\mathcal{C}}$ up to $L_{2}$ error $\epsilon$ .

Proof.

Consider the adversarial strategy where we respond to every query $h:X\to{\mathbb{R}}$ ( $\|h\|_{D}\leq 1$ ) with 0. This corresponds to the true expectation if the target were the 0 function. By the norm lower bound, outputting any other $c$ would then mean $L_{2}$ error greater than $\epsilon$ . Thus we must rule out all other $c\in{\mathcal{C}}$ .

Let $\tau=\sqrt{\gamma}$ . If $h_{k}$ is the $k^{\text{th}}$ query, let $S_{k}=\{c\in{\mathcal{C}}\mid\langle c,h_{k}\rangle_{D}>\tau\}$ be the functions ruled out by our response of 0. (A similar argument will hold for $S_{k}^{\prime}=\{c\in{\mathcal{C}}\mid\langle c,h_{k}\rangle_{D}<-\tau\}$ .) Let $\Phi=\langle h_{k},\sum_{c\in S_{k}}c\rangle_{D}$ . We claim that $\left|{S_{k}}\right|\leq\left|{{\mathcal{C}}}\right|/d$ . Suppose not. Then $\rho_{D}(S_{k})\leq\gamma$ by Definition 2.4, and

	$\displaystyle\Phi$	$\displaystyle\leq\\|h_{k}\\|_{D}\left\lVert\sum_{c\in S_{k}}c\right\rVert_{D}$
		$\displaystyle\leq\sqrt{\sum_{c,c^{\prime}\in S_{k}}\langle c,c^{\prime}\rangle_{D}}$
		$\displaystyle=\sqrt{\left\|{S_{k}}\right\|^{2}\rho_{D}(S_{k})}$
		$\displaystyle\leq\sqrt{\gamma}\|S_{k}\|,$

contradicting the fact that $\Phi>|S_{k}|\tau$ by definition of $S_{k}$ .

Similarly $|S_{k}^{\prime}|=|\{c\in{\mathcal{C}}\mid\langle c,h_{k}\rangle_{D}<-\tau\}|\leq|{\mathcal{C}}|/d$ . Thus we rule out at most a $2/d$ fraction of functions with each query, and hence need at least $d/2$ queries to rule out all other possibilities. ∎

	$\displaystyle\rho_{D}({\mathcal{C}}^{\prime})$	$\displaystyle=\frac{1}{\|{\mathcal{C}}^{\prime}\|^{2}}\sum_{c,c^{\prime}\in{\mathcal{C}}^{\prime}}\|\langle c,c^{\prime}\rangle_{D}\|$
		$\displaystyle\leq\frac{1}{\|{\mathcal{C}}^{\prime}\|^{2}}(\|{\mathcal{C}}^{\prime}\|\beta+(\|{\mathcal{C}}^{\prime}\|^{2}-\|{\mathcal{C}}^{\prime}\|)\gamma)$
		$\displaystyle=\gamma+\frac{\beta-\gamma}{\|{\mathcal{C}}^{\prime}\|}$
		$\displaystyle=\gamma+\gamma^{\prime}.$

	$\displaystyle\\|f\\|$	$\displaystyle\geq\|\psi(1)\|\ \mathbb{P}[g(x)\geq 1]+\|\psi(-1)\|\ \mathbb{P}[g(x)\geq 1]$
		$\displaystyle=\psi(1)\ \mathbb{P}[\|g(x)\|\geq 1]$
		$\displaystyle=\Omega(\exp(-\Theta(k))).$

	$\displaystyle\Phi$	$\displaystyle\leq\\|h_{k}\\|_{D}\left\lVert\sum_{c\in S_{k}}c\right\rVert_{D}$
		$\displaystyle\leq\sqrt{\sum_{c,c^{\prime}\in S_{k}}\langle c,c^{\prime}\rangle_{D}}$
		$\displaystyle=\sqrt{\left\|{S_{k}}\right\|^{2}\rho_{D}(S_{k})}$
		$\displaystyle\leq\sqrt{\gamma}\|S_{k}\|,$

Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent

Abstract

1 Introduction

Our Results.

Statistical Queries.

Theorem 1.1 (informal).

Theorem 1.2 (informal).

Our techniques.

SQ Lower Bounds for Real-Valued Functions.

SGD and Gradient Descent Plus Noise.

Independent Work.

Related Work.

2 Preliminaries

The SQ model

Definition 2.1 (SQ learning of real-valued concepts using inner product queries).

Definition 2.2 (SQ learning of pp-concepts).

Remark 2.3 (Learning with L2L_{2} error implies weak learning).

Definition 2.4 (Statistical dimension).

Remark 2.5.

Lemma 2.6 (adapted from [FGR+17], Lemma 3.10).

Proof.

3 Orthogonal Family of Neural Networks

Assumption 3.1 (Sign-symmetry).

Assumption 3.2 (Odd outer activation).

Assumption 3.3 (Inner activation).

Definition 3.4 (Family of Orthogonal Neural Networks).

Theorem 3.5.

Proof.

Remark 3.6.

Corollary 3.7.

Proof.

Theorem 3.8.

Theorem 3.9.

Proof.

4 SQ Lower Bounds

SQ Lower Bounds for Real-valued Functions

Theorem 4.1.

SQ Lower Bounds for p-concepts

Definition 4.2 (Distinguishing between labeled and uniformly random distributions).

Remark 4.3.

Lemma 4.4 (Learning is as hard as distinguishing).

Proof.

Theorem 4.5.

Proof.

Corollary 4.6.

5 Experiments

References

Appendix A Bounding the function norms under the Gaussian

Lemma A.1.

Proof.

A.1 ReLU Activation

Lemma A.2.

Lemma A.3.

Proof.

Lemma A.4.

Proof.

Lemma A.5.

Proof.

Lemma A.6.

Proof.

Corollary A.7.

Proof.

A.2 Sigmoid Activation

Theorem A.8.

Lemma A.9.

Corollary A.10.

Proof.

Remark A.11.

Lemma A.12.

Proof.

Lemma A.13.

Proof.

Corollary A.14.

A.3 General activations

Appendix B SQ lower bound for real-valued functions proof

Theorem B.1.

Proof.

Definition 2.2 (SQ learning of $p$ -concepts).

Remark 2.3 (Learning with $L_{2}$ error implies weak learning).

Lemma 2.6 (adapted from [FGR⁺17], Lemma 3.10).