This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent

Surbhi Goel Department of Computer Science, University of Texas at Austin Aravind Gollakota Department of Computer Science, University of Texas at Austin Zhihan Jin Department of Computer Science, Shanghai Jiao Tong University Sushrut Karmalkar Department of Computer Science, University of Texas at Austin Adam Klivans Department of Computer Science, University of Texas at Austin
(June 22, 2020)
Abstract

We prove the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution using gradient descent. We show that any classifier trained using gradient descent with respect to square-loss will fail to achieve small test error in polynomial time given access to samples labeled by a one-layer neural network. For classification, we give a stronger result, namely that any statistical query (SQ) algorithm (including gradient descent) will fail to achieve small test error in polynomial time. Prior work held only for gradient descent run with small batch sizes, required sharp activations, and applied to specific classes of queries. Our lower bounds hold for broad classes of activations including ReLU and sigmoid. The core of our result relies on a novel construction of a simple family of neural networks that are exactly orthogonal with respect to all spherically symmetric distributions.

1 Introduction

A major challenge in the theory of deep learning is to understand when gradient descent can efficiently learn simple families of neural networks. The associated optimization problem is nonconvex and well known to be computationally intractable in the worst case. For example, cyphertexts from public-key cryptosystems can be encoded into a training set labeled by simple neural networks [KS09], implying that the corresponding learning problem is as hard as breaking cryptographic primitives. These hardness results, however, rely on discrete representations and produce relatively unrealistic joint distributions.

Our Results.

In this paper we give the first superpolynomial lower bounds for learning neural networks using gradient descent in arguably the simplest possible setting: we assume the marginal distribution is a spherical Gaussian, the labels are noiseless and are exactly equal to the output of a one-layer neural network (a linear combination of say ReLU or sigmoid activations), and the goal is to output a classifier whose test error (measured by square-loss) is small. We prove—unconditionally—that gradient descent fails to produce a classifier with small square-loss if it is required to run in polynomial time in the dimension. Our lower bound depends only on the algorithm used (gradient descent) and not on the architecture of the underlying classifier. That is, our results imply that current popular heuristics such as running gradient descent on an overparameterized network (for example, working in the NTK regime [JHG18]) will require superpolynomial time to achieve small test error.

Statistical Queries.

We prove our lower bounds in the now well-studied statistical query (SQ) model of [Kea98] that captures most learning algorithms used in practice. For a loss function \ell and a hypothesis hθh_{\theta} parameterized by θ\theta, the true population loss with respect to joint distribution DD on X×YX\times Y is given by 𝔼(x,y)D[(hθ(x),y)]\mathbb{E}_{(x,y)\sim D}[\ell(h_{\theta}(x),y)], and the gradient with respect to θ\theta is given by 𝔼(x,y)D[(hθ(x),y)θhθ(x)]\mathbb{E}_{(x,y)\sim D}[\ell^{\prime}(h_{\theta}(x),y)\nabla_{\theta}h_{\theta}(x)]. In the SQ model, we specify a query function ϕ(x,y)\phi(x,y) and receive an estimate of |𝔼(x,y)D[ϕ(x,y)]||\mathbb{E}_{(x,y)\sim D}[\phi(x,y)]| to within some tolerance parameter τ\tau. An important special class of queries are correlational or inner-product queries, where the query function ϕ\phi is defined only on XX and we receive an estimate of |𝔼(x,y)D[ϕ(x)y]||\mathbb{E}_{(x,y)\sim D}[\phi(x)\cdot y]| within some tolerance τ\tau. It is not difficult to see that (1) the gradient of a population loss can be approximated to within τ\tau using statistical queries of tolerance τ\tau and (2) for square-loss only inner-product queries are required.

Since the convergence analysis of gradient descent holds given sufficiently strong approximations of the gradient, lower bounds for learning in the SQ model [Kea98, BFJ+94, Szö09, Fel12, Fel17] directly imply unconditional lower bounds on the running time for gradient descent to achieve small error. We give the first superpolynomial lower bounds for learning one-layer networks with respect to any Gaussian distribution for any SQ algorithm that uses inner product queries:

Theorem 1.1 (informal).

Let 𝒞{\mathcal{C}} be a class of real-valued concepts defined by one-layer single-output neural networks with input dimension nn and mm hidden units (ReLU or sigmoid); i.e., functions of the form f(x)=i=1maiσ(wix)f(x)=\sum_{i=1}^{m}a_{i}\sigma(w_{i}\cdot x). Then learning 𝒞{\mathcal{C}} under the standard Gaussian 𝒩(0,In)\mathcal{N}(0,I_{n}) in the SQ model with inner-product queries requires nΩ(logm)n^{\Omega(\log m)} queries for any tolerance τ=nΩ(logm)\tau=n^{-\Omega(\log m)}.

In particular, this rules out any approach for learning one-layer neural networks in polynomial-time that performs gradient descent on any polynomial-size classifier with respect to square-loss or logistic loss. For classification, we obtain significantly stronger results and rule out general SQ algorithms that run in polynomial-time (e.g., gradient descent with respect to any polynomial-size classifier and any polynomial-time computable loss). In this setting, our labels are {±1}\{\pm 1\} and correspond to the softmax of an unknown one-layer neural network. We prove the following:

Theorem 1.2 (informal).

Let 𝒞{\mathcal{C}} be a class of real-valued concepts defined by a one-layer neural network in nn dimensions with mm hidden units (ReLU or sigmoid) feeding into any odd, real-valued output node with range [1,1][-1,1]. Let DD^{\prime} be a distribution on n×{±1}{\mathbb{R}}^{n}\times\{\pm 1\} such that the marginal on n{\mathbb{R}}^{n} is the standard Gaussian 𝒩(0,In)\mathcal{N}(0,I_{n}), and 𝔼[Y|X]=c(X)\mathbb{E}[Y|X]=c(X) for some c𝒞c\in{\mathcal{C}}. For some b,C>0b,C>0 and ϵ=Cmb\epsilon=Cm^{-b}, outputting a classifier f:n{±1}f:{\mathbb{R}}^{n}\to\{\pm 1\} with (X,Y)D[f(X)Y]1/2ϵ\mathbb{P}_{(X,Y)\sim D^{\prime}}[f(X)\neq Y]\leq 1/2-\epsilon requires nΩ(logm)n^{\Omega(\log m)} statistical queries of tolerance nΩ(logm).n^{-\Omega(\log m)}.

The above lower bound for classification rules out the commonly used approach of training a polynomial-size, real-valued neural network using gradient descent (with respect to any polynomial-time computable loss) and then taking the sign of the output of the resulting network.

Our techniques.

At the core of all SQ lower bounds is the construction of a family of functions that are pairwise approximately orthogonal with respect to the underlying marginal distribution. Typically, these constructions embed 2n2^{n} parity functions over the discrete hypercube {1,1}n\{-1,1\}^{n}. Since parity functions are perfectly orthogonal, the resulting lower bound can be quite strong. Here we wish to give lower bounds for more natural families of distributions, namely Gaussians, and it is unclear how to embed parity.

Instead, we use an alternate construction. For activation functions ϕ,ψ:\phi,\psi:{\mathbb{R}}\rightarrow{\mathbb{R}}, define

fS(x)=ψ(w{1,1}kχ(w)ϕ(wxSk)).\displaystyle f_{S}(x)=\psi\left(\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right).

Enumerating over every S[n]S\subseteq[n] of size kk gives a family of functions of size nO(k)n^{O(k)}. Here xSx_{S} denotes the vector of xix_{i} for iSi\in S (typically we choose k=logmk=\log m to produce a family of one-layer neural networks with mm hidden units). Each of the 2k=m2^{k}=m inner weight vectors are all of unit norm, and all of the mm outer weights have absolute value one. Note also that our construction uses activations with zero bias term.

We give a complete characterization of the class of nonlinear activations for which these functions are orthogonal. In particular, the family is orthogonal for any activation with a nonzero Hermite coefficient of degree kk or higher.

Apart from showing orthogonality, we must also prove that functions in these classes are nontrivial (i.e., are not exponentially close to the constant zero function). This reduces to proving certain lower bounds on the norms of one-layer neural networks. The analysis requires tools from Hermite and complex analysis.

SQ Lower Bounds for Real-Valued Functions.

Another major challenge is that our function family is real-valued as opposed to boolean. Given an orthogonal family of (deterministic) boolean functions, it is straightforward to apply known results and obtain general SQ lower bounds for learning with respect to 0/10/1 loss. For the case of real-valued functions, the situation is considerably more complicated. For example, the class of orthogonal Hermite polynomials on nn variables of degree dd has size nO(d)n^{O(d)}, yet there is an SQ algorithm due to [APVZ14] that learns this class with respect to the Gaussian distribution in time 2O(d)2^{O(d)}. More recent work due to [ADHV19] shows that Hermite polynomials can be learned by an SQ algorithm in time polynomial in nn and logd\log d.

As such, it is impossible to rule out general polynomial-time SQ algorithms for learning real-valued functions based solely on orthogonal function families. Fortunately, it is not difficult to see that the SQ reductions due to [Szö09] hold in the real-valued setting as long as the learning algorithm uses only inner-product queries (and the norms of the functions are sufficiently large). Since performing gradient descent with respect to square-loss or logistic loss can be implemented using inner-product queries, we obtain our first set of desired results111The algorithms of [APVZ14] and [ADHV19] do not use inner-product queries..

Still, we would like rule out general SQ algorithms for learning simple classes of neural networks. To that end, we consider the classification problem for one-layer neural networks and output labels after performing a softmax on a one-layer network. Concretely, consider a distribution on n×{1,1}{\mathbb{R}}^{n}\times\{-1,1\} where 𝔼[Y|X]=σ(c(X))\mathbb{E}[Y|X]=\sigma(c(X)) for some c𝒞c\in{\mathcal{C}} and σ:[1,1]\sigma:{\mathbb{R}}\to[-1,1] (for example, σ\sigma could be tanh). We describe two goals. The first is to estimate the conditional mean function, i.e., output a classifier hh such that 𝔼[(h(x)c(x))2]ϵ\mathbb{E}[(h(x)-c(x))^{2}]\leq\epsilon. The second is to directly minimize classification loss, i.e., output a boolean classifier hh such that X,YD[h(X)Y]1/2ϵ.\mathbb{P}_{X,Y\sim D}[h(X)\neq Y]\leq 1/2-\epsilon.

We give superpolynomial lower bounds for both of these problems in the general SQ model by making a new connection to probabilistic concepts, a learning model due to [KS94]. Our key theorem gives a superpolynomial SQ lower bound for the problem of distinguishing probabilistic concepts induced by our one-layer neural networks from truly random labels. A final complication we overcome is that we must prove orthogonality and norm bounds on one-layer neural networks that have been composed with a nonlinear activation (e.g., tanh).

SGD and Gradient Descent Plus Noise.

It is easy to see that our results also imply lower bounds for algorithms where the learner adds noise to the estimate of the gradient (e.g., Langevin dynamics). On the other hand, for technical reasons, it is known that SGD is not a statistical query algorithm (because it examines training points individually) and does not fall into our framework. That said, recent work by [AS20] shows that SGD is universal in the sense that it can encode all polynomial-time learners. This implies that proving unconditional lower bounds for SGD would give a proof that \NP\P\neq\NP. Thus, we cannot hope to prove unconditional lower bounds on SGD (unless we can prove \NP\P\neq\NP).

Independent Work.

Independently, Diakonikolas et al. [DKKZ20] have given stronger correlational SQ lower bounds for the same class of functions with respect to the Gaussian distribution. Their bounds are exponential in the number of hidden units while ours is quasipolynomial. We can plug in their result and obtain exponential general SQ lower bounds for the associated probabilistic concept using our framework.

Related Work.

There is a large literature of results proving hardness results (or unconditional lower bounds in some cases) for learning various classes of neural networks [BR89, Vu98, KS09, LSSS14, GKKT17].

The most relevant prior work is due to [SVWX17], who addressed learning one-layer neural networks under logconcave distributions using Lipschitz queries. Specifically, let nn be the input dimension, and let mm be the number of hidden ss-Lipschitz sigmoid units. For m=O~(sn)m=\tilde{O}(s\sqrt{n}), they construct a family of neural networks such that any learner using λ\lambda-Lipschitz queries with tolerance greater than Ω(1/(s2n))\Omega(1/(s^{2}n)) needs at least 2Ω(n)/(λs2)2^{\Omega(n)}/(\lambda s^{2}) queries.

Roughly speaking, their lower bounds hold for λ\lambda-Lipschitz queries due to the composition of their one-layer neural networks with a δ\delta-function in order make the family more “boolean.” Because of their restriction on the tolerance parameter, they cannot rule out gradient descent with large batch sizes. Further, the slope of the activations they require in their constructions scales inversely with the Lipschitz and tolerance parameters.

To contrast with [SVWX17], note that our lower bounds hold for any inverse-polynomial tolerance parameter (i.e., will hold for polynomially-large batch sizes), do not require a Lipschitz constraint on the queries, and use only standard 11-Lipschitz ReLU and/or sigmoid activations (with zero bias) for the construction of the hard family. Our lower bounds are typically quasipolynomial in the number of hidden units; improving this to an exponential lower bound is an interesting open question. Both of our models capture square-loss and logistic loss.

In terms of techniques, [SVWX17] build an orthogonal function family using univariate, periodic “wave” functions. Our construction takes a different approach, adding and subtracting activation functions with respect to overlapping “masks.” Finally, aside from the (black-box) use of a theorem from complex analysis, our construction and analysis are considerably simpler than the proof in [SVWX17].

A follow-up work [VW19] gave SQ lower bounds for learning classes of degree dd orthogonal polynomials in nn variables with respect to the uniform distribution on the unit sphere (as opposed to Gaussians) using inner product queries of bounded tolerance (roughly 1/nd1/n^{d}). To obtain superpolynomial lower bounds, each function in the family requires superpolynomial description length (their polynomials also take on very small values, 1/nd1/n^{d}, with high probability).

Shamir [Sha18] (see also the related work of [SSSS17]) proves hardness results (and lower bounds) for learning neural networks using gradient descent with respect to square-loss. His results are separated into two categories: (1) hardness for learning “natural” target families (one layer ReLU networks) or (2) lower bounds for “natural” input distributions (Gaussians). We achieve lower bounds for learning problems with both natural target families and natural input distributions. Additionally, our lower bounds hold for any nonlinear activations (as opposed to just ReLUs) and for broader classes of algorithms (SQ).

Recent work due to [GKK19] gives hardness results for learning a ReLU with respect to Gaussian distributions. Their results require the learner to output a single ReLU as its output hypothesis and require the learner to succeed in the agnostic model of learning. [KK14] prove hardness results for learning a threshold function with respect to Gaussian distributions, but they also require the learner to succeed in the agnostic model. Very recent work due to Daniely and Vardi [DV20] gives hardness results for learning randomly chosen two-layer networks. The hard distributions in their case are not Gaussians, and they require a nonlinear clipping output activation.

Positive Results. Many recent works give algorithms for learning one-layer ReLU networks using gradient descent with respect to Gaussians under various assumptions [ZSJ+17, ZPS17, BG17, ZYWG19] or use tensor methods [JSA15, GLM18]. These results depend on the hidden weight vectors being sufficiently orthogonal, or the coefficients in the second layer being positive, or both. Our lower bounds explain why these types of assumptions are necessary.

2 Preliminaries

We use [n][n] to denote the set {1,,n}\{1,\dots,n\}, and SkTS\subseteq_{k}T to indicate that SS is a kk-element subset of TT. We denote euclidean inner products between vectors uu and vv by uvu~{}{\cdot}~{}v. We denote the element-wise product of vectors uu and vv by uvu\circ v, that is, uvu\circ v is the vector (u1v1,,unvn)(u_{1}v_{1},\dots,u_{n}v_{n}).

Let XX be an arbitrary domain, and let DD be a distribution on XX. Given two functions f,g:Xf,g:X\to{\mathbb{R}}, we define their L2L_{2} inner product with respect to DD to be f,gD=𝔼D[fg]\langle f,g\rangle_{D}=\mathbb{E}_{D}[fg]. The corresponding L2L_{2} norm is given by fD=f,fD=𝔼D[f2]\|f\|_{D}=\sqrt{\langle f,f\rangle_{D}}=\sqrt{\mathbb{E}_{D}[f^{2}]}.

A real-valued concept on n{\mathbb{R}}^{n} is a function c:nc:{\mathbb{R}}^{n}\to{\mathbb{R}}. We denote the induced labeled distribution on n×{\mathbb{R}}^{n}\times{\mathbb{R}}, i.e. the distribution of (x,c(x))(x,c(x)) for xDx\sim D, by DcD_{c}. A probabilistic concept, or pp-concept, on XX is a concept that maps each point xx to a random {±1}\{\pm 1\}-valued label in such a way that 𝔼[Y|X]=c(X)\mathbb{E}[Y|X]=c(X) for a fixed function c:n[1,1]c:{\mathbb{R}}^{n}\to[-1,1], known as the conditional mean function. Given a distribution DD on the domain, we abuse DcD_{c} to denote the induced labeled distribution on X×{±1}X\times\{\pm 1\} such that the marginal distribution on n{\mathbb{R}}^{n} is DD and 𝔼[Y|X]=c(X)\mathbb{E}[Y|X]=c(X) (equivalently the label is +1+1 with probability 1+c(x)2\frac{1+c(x)}{2} and 1-1 otherwise).

The SQ model

A statistical query is specified by a query function h:d×Yh:{\mathbb{R}}^{d}\times Y\to{\mathbb{R}}. The SQ model allows access to an SQ oracle that accepts a query hh of specified tolerance τ\tau, and responds with a value in [𝔼x,y[h(x,y)]τ,𝔼x,y[h(x,y)]+τ][\mathbb{E}_{x,y}[h(x,y)]-\tau,\mathbb{E}_{x,y}[h(x,y)]+\tau].222In the SQ literature, this is referred to as the STAT oracle. A variant called VSTAT is also sometimes used, known to be equivalent up to small polynomial factors [Fel17]. While it makes no substantive difference to our superpolynomial lower bounds, our arguments can be extended to VSTAT as well. To disallow arbitrary scaling, we will require that for each yy, the function xh(x,y)x\mapsto h(x,y) has norm at most 1. In the real-valued setting, a query hh is called a correlational or inner product query if it is of the form h(x,y)=g(x)yh(x,y)=g(x)\cdot y for some function gg, so that 𝔼Dc[h]=𝔼D[gc]=g,cD\mathbb{E}_{D_{c}}[h]=\mathbb{E}_{D}[gc]=\langle g,c\rangle_{D}. Here we assume g1\|g\|\leq 1 when stating lower bounds, again to disallow arbitrary scaling.

Gradient descent with respect to squared loss is captured by inner product queries, since the gradient is given by

𝔼x,y[θ(hθ(x)y)2]\displaystyle\mathbb{E}_{x,y}[\nabla_{\theta}(h_{\theta}(x)-y)^{2}] =𝔼x,y[2(hθ(x)y)θhθ(x)]\displaystyle=\mathbb{E}_{x,y}[2(h_{\theta}(x)-y)\nabla_{\theta}h_{\theta}(x)]
=2𝔼x[hθ(x)θhθ(x)]\displaystyle=2\mathbb{E}_{x}[h_{\theta}(x)\nabla_{\theta}h_{\theta}(x)]
2𝔼x,y[yθhθ(x)].\displaystyle\quad-2\mathbb{E}_{x,y}[y\nabla_{\theta}h_{\theta}(x)].

Here the first term can be estimated directly using knowledge of the distribution, while the latter is a vector each of whose elements is an inner product query.

We now formally define the learning problems we consider.

Definition 2.1 (SQ learning of real-valued concepts using inner product queries).

Let 𝒞{\mathcal{C}} be a class of pp-concepts over a domain XX, and let DD be a distribution on XX. We say that a learner learns 𝒞{\mathcal{C}} with respect to DD up to L2L_{2} error ϵ\epsilon using inner product quiers (equivalently squared loss ϵ2\epsilon^{2}) if, given only SQ oracle access to DcD_{c} for some unknown c𝒞c\in{\mathcal{C}}, and using only inner product queries, it is able to output c~:X[1,1]\tilde{c}:X\to[-1,1] such that cc~Dϵ\|c-\tilde{c}\|_{D}\leq\epsilon.

For the classification setting, we consider two different notions of learning pp-concepts. One is learning the target up to small L2L_{2} error, to be thought of as a strong form of learning. The other, weaker form, is achieving a nontrivial inner product (i.e. unnormalized correlation) with the target. We prove lower bounds on both in order to capture different learning goals.

Definition 2.2 (SQ learning of pp-concepts).

Let 𝒞{\mathcal{C}} be a class of pp-concepts over a domain XX, and let DD be a distribution on XX. We say that a learner learns 𝒞{\mathcal{C}} with respect to DD up to L2L_{2} error ϵ\epsilon if, given only SQ oracle access to DcD_{c} for some unknown c𝒞c\in{\mathcal{C}}, and using arbitrary queries, it is able to output c~:X[1,1]\tilde{c}:X\to[-1,1] such that cc~Dϵ\|c-\tilde{c}\|_{D}\leq\epsilon. We say that a learner weakly learns 𝒞{\mathcal{C}} with respect to DD with advantage ϵ\epsilon if it is able to output c~:X[1,1]\tilde{c}:X\to[-1,1] such that c~,cDϵ\langle\tilde{c},c\rangle_{D}\geq\epsilon.

Note that the best achievable advantage is 𝔼xD[|c(x)|]\mathbb{E}_{x\sim D}[|c(x)|], achieved by c~(x)=sign(c(x))\tilde{c}(x)=\operatorname{sign}(c(x)). Note also that cD2𝔼D[|c|]cD\|c\|_{D}^{2}\leq\mathbb{E}_{D}[|c|]\leq\|c\|_{D}, and therefore a norm lower bound on functions in 𝒞{\mathcal{C}} implies an upper bound on the achievable advantage.

Remark 2.3 (Learning with L2L_{2} error implies weak learning).

If the functions in our class satisfy a norm lower bound, say cD2(1+α)ϵ2\|c\|_{D}^{2}\geq(1+\alpha)\epsilon^{2}, then a simple calculation shows that learning with L2L_{2} error ϵ\epsilon implies weak learning with advantage αϵ2/2\alpha\epsilon^{2}/2.

Our definition of weak learning also captures the standard boolean sense of weak learning, in which the learner is required to output a boolean hypothesis with 0/1 loss bounded away from 1/21/2. Indeed, by an easy calculation, the 0/1 loss of a function f:X{±1}f:X\to\{\pm 1\} satisfies

(x,y)Dc[f(x)y]=12c,fD2.\displaystyle\mathbb{P}_{(x,y)\sim D_{c}}[f(x)\neq y]=\frac{1}{2}-\frac{\langle c,f\rangle_{D}}{2}.

The difficulty of learning a concept class in the SQ model is captured by a parameter known as the statistical dimension of the class.

Definition 2.4 (Statistical dimension).

Let 𝒞{\mathcal{C}} be a concept class of either real-valued concepts or pp-concepts (i.e. their corresponding conditional mean functions) on a domain XX, and let DD be a distribution on XX. The (un-normalized) correlation of two concepts c,c𝒞c,c^{\prime}\in{\mathcal{C}} under DD is |c,cD||\langle c,c^{\prime}\rangle_{D}|.333In the pp-concept setting, it is instructive to note that in the notation of [FGR+17], this correlation is precisely the distributional correlation χD0(Dc,Dc)\chi_{D_{0}}(D_{c},D_{c^{\prime}}) of the induced labeled distributions DcD_{c} and DcD_{c^{\prime}} under the reference distribution D0=D×Unif{±1}D_{0}=D\times\operatorname{Unif}\{\pm 1\}. The average correlation of 𝒞{\mathcal{C}} is defined to be

ρD(𝒞)=1|𝒞|2c,c𝒞|c,cD|.\displaystyle\rho_{D}({\mathcal{C}})=\frac{1}{|{\mathcal{C}}|^{2}}\sum_{c,c^{\prime}\in{\mathcal{C}}}|\langle c,c^{\prime}\rangle_{D}|.

The statistical dimension on average at threshold γ\gamma, SDAD(𝒞,γ)\operatorname{SDA}_{D}({\mathcal{C}},\gamma), is the largest dd such that for all 𝒞𝒞{\mathcal{C}}^{\prime}\subseteq{\mathcal{C}} with |𝒞||𝒞|/d|{\mathcal{C}}^{\prime}|\geq|{\mathcal{C}}|/d, ρD(𝒞)γ\rho_{D}({\mathcal{C}}^{\prime})\leq\gamma.

Remark 2.5.

For any general and large concept class 𝒞{\mathcal{C}}^{*} (such as all one-layer neural nets), we may consider a specific subclass 𝒞𝒞{\mathcal{C}}\subseteq{\mathcal{C}}^{*} and prove lower bounds on learning 𝒞{\mathcal{C}} in terms of the SDA of 𝒞{\mathcal{C}}. These lower bounds extend to 𝒞{\mathcal{C}}^{*} because if it is hard to learn a subset, then it is hard to learn the whole class.

We will mainly be interested in the statistical dimension in a setting where bounds on pairwise correlations are known. In that case the following lemma holds.

Lemma 2.6 (adapted from [FGR+17], Lemma 3.10).

Suppose a concept class 𝒞{\mathcal{C}} has pairwise correlation γ\gamma, i.e. |c,cD|γ|\langle c,c^{\prime}\rangle_{D}|\leq\gamma for cc𝒞c\neq c^{\prime}\in{\mathcal{C}}, and squared norm at most β\beta, i.e. cD2β\|c\|_{D}^{2}\leq\beta for all c𝒞c\in{\mathcal{C}}. Then for any γ>0\gamma^{\prime}>0, SDAD(𝒞,γ+γ)|𝒞|γβγ\operatorname{SDA}_{D}({\mathcal{C}},\gamma+\gamma^{\prime})\geq|{\mathcal{C}}|\frac{\gamma^{\prime}}{\beta-\gamma}. In particular, if 𝒞{\mathcal{C}} is a class of orthogonal concepts (i.e. γ=0\gamma=0) with squared norm bounded by β\beta, then SDA(𝒞,γ)|𝒞|γβ\operatorname{SDA}({\mathcal{C}},\gamma^{\prime})\geq|{\mathcal{C}}|\frac{\gamma^{\prime}}{\beta}.

Proof.

Let d=|𝒞|γβγd=|{\mathcal{C}}|\frac{\gamma^{\prime}}{\beta-\gamma}, and observe that for any subset 𝒞𝒞{\mathcal{C}}^{\prime}\subseteq{\mathcal{C}} satisfying |𝒞||𝒞|/d=βγγ|{\mathcal{C}}^{\prime}|\geq|{\mathcal{C}}|/d=\frac{\beta-\gamma}{\gamma^{\prime}},

ρD(𝒞)\displaystyle\rho_{D}({\mathcal{C}}^{\prime}) =1|𝒞|2c,c𝒞|c,cD|\displaystyle=\frac{1}{|{\mathcal{C}}^{\prime}|^{2}}\sum_{c,c^{\prime}\in{\mathcal{C}}^{\prime}}|\langle c,c^{\prime}\rangle_{D}|
1|𝒞|2(|𝒞|β+(|𝒞|2|𝒞|)γ)\displaystyle\leq\frac{1}{|{\mathcal{C}}^{\prime}|^{2}}(|{\mathcal{C}}^{\prime}|\beta+(|{\mathcal{C}}^{\prime}|^{2}-|{\mathcal{C}}^{\prime}|)\gamma)
=γ+βγ|𝒞|\displaystyle=\gamma+\frac{\beta-\gamma}{|{\mathcal{C}}^{\prime}|}
=γ+γ.\displaystyle=\gamma+\gamma^{\prime}.

3 Orthogonal Family of Neural Networks

We consider neural networks with one hidden layer with activation function ϕ:\phi:{\mathbb{R}}\to{\mathbb{R}}, and with one output node that has some activation function ψ:\psi:{\mathbb{R}}\to{\mathbb{R}}. If we take the input dimension to be nn and the number of hidden nodes to be mm, then such a neural network is a function f:nf:{\mathbb{R}}^{n}\to{\mathbb{R}} given by

f(x)=ψ(i=1maiϕ(wix)),\displaystyle f(x)=\psi\left(\sum_{i=1}^{m}a_{i}\phi(w_{i}~{}{\cdot}~{}x)\right),

where winw_{i}\in{\mathbb{R}}^{n} are the weights feeding into the ithi^{\text{th}} hidden node, and aia_{i}\in{\mathbb{R}} are the weights feeding into the output node. If ψ\psi takes values in [1,1][-1,1], we may also view ff as defining a pp-concept in terms of its conditional mean function.

For our construction, we need our functions to be orthogonal, and we need a lower bound on their norms. For the first property we only need the distribution on the domain to satisfy a relaxed kind of spherical symmetry that we term sign-symmetry, which says that the distribution must look identical on all orthants. To lower bound the norms, we need to assume that the distribution is Gaussian 𝒩(0,I){\cal N}(0,I).

Assumption 3.1 (Sign-symmetry).

For any z{±1}nz\in\{\pm 1\}^{n} and xnx\in{\mathbb{R}}^{n}, let xzx\circ z denote (x1z1,,xnzn)(x_{1}z_{1},\dots,x_{n}z_{n}). A distribution DD on n{\mathbb{R}}^{n} is sign-symmetric if for any z{±1}nz\in\{\pm 1\}^{n} and xx drawn from DD, xx and xzx\circ z have the same distribution DD.

Assumption 3.2 (Odd outer activation).

The outer activation ψ\psi is an odd, increasing function, i.e. ψ(x)=ψ(x)\psi(-x)=-\psi(x).

Note that ψ\psi could be the identity function.

Assumption 3.3 (Inner activation).

The inner activation ϕL2(𝒩(0,I))\phi\in L_{2}({\cal N}(0,I)).

The construction of our orthogonal family of neural networks is simple and exploits sign-symmetry.

Definition 3.4 (Family of Orthogonal Neural Networks).

Let the domain be n{\mathbb{R}}^{n}, let ϕ:\phi:{\mathbb{R}}\to{\mathbb{R}} be any well-behaved activation function, and let ψ:\psi:{\mathbb{R}}\to{\mathbb{R}} be any odd function. For an index set S[n]S\subseteq[n], let xS|S|x_{S}\in{\mathbb{R}}^{|S|} denote the vector of xix_{i} for iSi\in S. Fix any k>0k>0. For any sign-pattern z{±1}kz\in\{\pm 1\}^{k}, let χ(z)\chi(z) denote the parity izi\prod_{i}z_{i}. For any index set Sk[n]S\subseteq_{k}[n], define a one-layer neural network with m=2km=2^{k} hidden nodes,

gS(x)\displaystyle g_{S}(x) =w{1,1}kχ(w)ϕ(wxSk)\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)
fS(x)\displaystyle f_{S}(x) =ψ(gS(x)).\displaystyle=\psi\left(g_{S}(x)\right).

Our orthogonal family is

𝒞orth(n,k)={fSSk[n]}.{\mathcal{C}_{\text{orth}}}(n,k)=\{f_{S}\mid S\subseteq_{k}[n]\}.

Notice that the size of this family is (nk)=nΘ(k)\binom{n}{k}=n^{\Theta(k)} (for appropriate kk), which is nΘ(logm)n^{\Theta(\log m)} in terms of mm. We will take k=Θ(logn)k=\Theta(\log n), so that m=poly(n)m=\operatorname{poly}(n) and thus the neural networks are poly(n)\operatorname{poly}(n)-sized, and the size of the family is nΘ(logn)n^{\Theta(\log n)}, i.e. quasipolynomial in nn.

We now prove that our functions are orthogonal under any sign-symmetric distribution.

Theorem 3.5.

Let the domain be n{\mathbb{R}}^{n}, and let DD be a sign-symmetric distribution on n{\mathbb{R}}^{n}. Fix any k>0k>0. Then fS,fTD=0\langle f_{S},f_{T}\rangle_{D}=0 for any two distinct fS,fT𝒞orth(n,k)f_{S},f_{T}\in{\mathcal{C}_{\text{orth}}}(n,k).

Proof.

For the proof, the key property of our construction that we will use is the following: for any sign-pattern z{±1}nz\in\{\pm 1\}^{n} and any xnx\in{\mathbb{R}}^{n},

fS(xz)=χS(z)fS(x),f_{S}(x\circ z)=\chi_{S}(z)f_{S}(x), (1)

where χS(z)=iSzi=χ(zS)\chi_{S}(z)=\prod_{i\in S}z_{i}=\chi(z_{S}) is the parity on SS of zz. Indeed, observe first that

gS(xz)\displaystyle g_{S}(x\circ z) =w{1,1}kχ(w)ϕ(w(xz)Sk)\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}(x\circ z)_{S}}{\sqrt{k}}\right)
=w{1,1}kχ(w)ϕ((wzS)xSk)\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{(w\circ z_{S})~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)
=w{1,1}kχ(wzS)χ(zS)ϕ((wzS)xSk)\displaystyle=\sum_{w\in\{-1,1\}^{k}}\chi(w\circ z_{S})\chi(z_{S})\phi\left(\frac{(w\circ z_{S})~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)
=χ(zS)w{1,1}kχ(w)ϕ(wxSk)\displaystyle=\chi(z_{S})\sum_{w\in\{-1,1\}^{k}}\chi(w)\phi\left(\frac{w~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right) (replacing wzSw\circ z_{S} with ww)
=χ(zS)gS(x).\displaystyle=\chi(z_{S})g_{S}(x).

The property then follows since ψ\psi is odd and ψ(av)=aψ(v)\psi(av)=a\psi(v) for any a{±1}a\in\{\pm 1\} and vv\in{\mathbb{R}}.

Consider fSf_{S} and fTf_{T} for any two distinct S,Tk[n]S,T\subseteq_{k}[n]. Recall that by the definition of sign-symmetry, for any z{±1}nz\in\{\pm 1\}^{n} and xx drawn from DD, xx and xzx\circ z has the same distribution. Using this and Eq. 1, we have

fS,fTD\displaystyle\langle f_{S},f_{T}\rangle_{D} =𝔼xD[fS(x)fT(x)]\displaystyle=\mathbb{E}_{x\sim D}[f_{S}(x)f_{T}(x)]
=𝔼z{±1}n𝔼xD[fS(xz)fT(xz)]\displaystyle=\mathbb{E}_{z\sim\{\pm 1\}^{n}}\ \mathbb{E}_{x\sim D}[f_{S}(x\circ z)f_{T}(x\circ z)] (sign-symmetry)
=𝔼z{±1}n𝔼xD[χS(z)fS(x)χT(z)fT(x)]\displaystyle=\mathbb{E}_{z\sim\{\pm 1\}^{n}}\ \mathbb{E}_{x\sim D}[\chi_{S}(z)f_{S}(x)\chi_{T}(z)f_{T}(x)] (Eq. 1)
=𝔼xD[fS(x)fT(x)𝔼z{±1}nχS(z)χT(z)]\displaystyle=\mathbb{E}_{x\sim D}\left[f_{S}(x)f_{T}(x)\ \mathbb{E}_{z\sim\{\pm 1\}^{n}}\chi_{S}(z)\chi_{T}(z)\right]
=0,\displaystyle=0,

since 𝔼z{±1}nχS(z)χT(z)=0\mathbb{E}_{z\sim\{\pm 1\}^{n}}\chi_{S}(z)\chi_{T}(z)=0 for any two distinct parities χS,χT\chi_{S},\chi_{T}. ∎

Remark 3.6.

Our proof actually shows that any family of functions satisfying Eq. 1 is an orthogonal family under any sign-symmetric distribution.

We still need to establish that our functions are nonzero. For this we need to specialize to the Gaussian distribution, as well as consider specific activation functions (a similar analysis can in principle be carried out for other sign-symmetric distributions). For any nn and kk, it follows from Lemma A.1 that if the inner activation ϕ\phi has a nonzero Hermite coefficient of degree kk or higher, then the functions in 𝒞orth(n,k){\mathcal{C}_{\text{orth}}}(n,k) are nonzero. The sigmoid, ReLU and sign functions all satisfy this property.

Corollary 3.7.

Let the domain be n{\mathbb{R}}^{n}, and let DD be any sign-symmetric distribution on n{\mathbb{R}}^{n}. For any γ>0\gamma>0,

SDAD(𝒞orth(n,k),γ)|𝒞orth(n,k)|γ=(nk)γ.\operatorname{SDA}_{D}({\mathcal{C}_{\text{orth}}}(n,k),\gamma)\geq|{\mathcal{C}_{\text{orth}}}(n,k)|\gamma=\binom{n}{k}\gamma.

Here we also assume that all c𝒞orth(n,k)c\in{\mathcal{C}_{\text{orth}}}(n,k) are nonzero for our distribution DD.

Proof.

Follows from Theorem 3.5 and Lemma 2.6, using a loose upper bound of 1 on the squared norm. ∎

We also need to prove norm lower bounds on our functions for our notions of learning to be meaningful. In Appendix A, we prove the following.

Theorem 3.8.

Let the inner activation function ϕ\phi be ReLU\operatorname{ReLU} or sigmoid, and let the outer activation function ψ\psi be any odd, increasing, continuous function. Let the underlying distribution DD be 𝒩(0,In)\mathcal{N}(0,I_{n}). Then fS=Ω(eΘ(k))\|f_{S}\|=\Omega(e^{-\Theta(k)}), where the hidden constants depend on ψ\psi and ϕ\phi, for any fS𝒞orth(n,k)f_{S}\in{\mathcal{C}_{\text{orth}}}(n,k).

With this in hand, we now state our main SQ lower bounds.

Theorem 3.9.

Let the input dimension be nn, and let the underlying distribution be 𝒩(0,In)\mathcal{N}(0,I_{n}). Consider 𝒞orth(n,k){\mathcal{C}_{\text{orth}}}(n,k) instantiated with ϕ=ReLU\phi=\operatorname{ReLU} or sigmoid and ψ\psi any odd, increasing function (including the identity function), and let m=2km=2^{k} be the hidden layer size of each neural net. Let AA be an SQ learner using only inner product queries of tolerance τ\tau. For any kk\in{\mathbb{N}}, there exists τ=1/nΘ(k)\tau=1/n^{-\Theta(k)} such that AA requires at least nΩ(k)n^{\Omega(k)} queries of tolerance τ\tau to learn 𝒞orth(n,k){\mathcal{C}_{\text{orth}}}(n,k) with advantage 1/exp(k)1/\exp(k).

In particular, there exist k=Θ(logn)k=\Theta(\log n) and τ=1/nΘ(logn)\tau=1/n^{\Theta(\log n)} such that AA requires at least nΩ(logn)n^{\Omega(\log n)} queries of tolerance τ\tau to learn 𝒞orth(n,k){\mathcal{C}_{\text{orth}}}(n,k) with advantage 1/poly(n)1/\operatorname{poly}(n). In this case m=poly(n)m=\operatorname{poly}(n), so that each function in the family has polynomial size. This is our main superpolynomial lower bound.

Proof.

The proof amounts to careful choices of the parameters ϵ,γ\epsilon,\gamma and τ\tau in Corollary 3.7 and Corollary 4.6. Recall that SDA(𝒞orth(n,k),γ)nΘ(k)γ\operatorname{SDA}({\mathcal{C}_{\text{orth}}}(n,k),\gamma)\geq n^{\Theta(k)}\gamma. We pick γ=nΘ(k)\gamma=n^{-\Theta(k)} appropriately such that d=SDA(𝒞orth(n,k),γ)d=\operatorname{SDA}({\mathcal{C}_{\text{orth}}}(n,k),\gamma) is still nΘ(k)n^{\Theta(k)}. Theorem 3.8 gives us a norm lower bound of exp(Θ(k))\exp(-\Theta(k)), allowing us to take ϵ=exp(Θ(k))\epsilon=\exp(-\Theta(k)) and τ=γ=nΘ(k)\tau=\sqrt{\gamma}=n^{-\Theta(k)} in Corollary 4.6. ∎

4 SQ Lower Bounds

SQ Lower Bounds for Real-valued Functions

Prior work [Szö09, Fel12] has already established the following fundamental result, which we phrase in terms of our definition of statistical dimension. For the reader’s convenience, we include a proof in Appendix B.

Theorem 4.1.

Let DD be a distribution on XX, and let 𝒞{\mathcal{C}} be a real-valued concept class over a domain XX such that cD>ϵ\|c\|_{D}>\epsilon for all c𝒞c\in{\mathcal{C}}. Consider any SQ learner that is allowed to make only inner product queries to an SQ oracle for the labeled distribution DcD_{c} for some unknown c𝒞c\in{\mathcal{C}}. Let d=SDAD(𝒞,γ)d=\operatorname{SDA}_{D}({\mathcal{C}},\gamma). Then any such SQ learner needs at least Ω(d)\Omega(d) queries of tolerance γ\sqrt{\gamma} to learn 𝒞{\mathcal{C}} up to L2L_{2} error ϵ\epsilon.

SQ Lower Bounds for p-concepts

It turns out to be fruitful to view our learning problem in terms of a decision problem over distributions. We define the problem of distinguishing a valid labeled distribution from a randomly labeled one, and show a lower bound for this problem. We then show that learning is at least as hard as distinguishing, thereby extending the lower bound to learning as well. Our analysis closely follows that of [FGR+17].

Definition 4.2 (Distinguishing between labeled and uniformly random distributions).

Let 𝒞{\mathcal{C}} be a class of pp-concepts over a domain XX, and let DD be a distribution on XX. Let D0=Dc0D_{0}=D_{c_{0}} be the randomly labeled distribution D×Unif{±1}D\times\operatorname{Unif}\{\pm 1\}. Suppose we are given SQ access either to a labeled distribution DcD_{c} for some c𝒞c\in{\mathcal{C}} such that cc0c\neq c_{0} or to D0D_{0}. The problem of distinguishing between labeled and uniformly random distributions is to decide which.

Remark 4.3.

Given access to DcD_{c} for some truly boolean concept c:X{±1}c:X\to\{\pm 1\}, it is easy to distinguish any other boolean function cc^{\prime} from cc since ccD2=22c,cD\|c-c^{\prime}\|_{D}^{2}=2-2\langle c,c^{\prime}\rangle_{D} (which is information-theoretically optimal as a distinguishing criterion) can be computed using a single inner product query. However, if cc and cc^{\prime} are pp-concepts, cD\|c\|_{D} and cD\|c^{\prime}\|_{D} are not 1 in general and may be difficult to estimate. It is not obvious how best to distinguish the two, short of directly learning the target.

Considering the distinguishing problem is useful because if we can show that distinguishing itself is hard, then any reasonable notion of learning will be hard as well, including weak learning. We give simple reductions for both our notions of learning.

Lemma 4.4 (Learning is as hard as distinguishing).

Let DD be a distribution over the domain XX, and let 𝒞{\mathcal{C}} be a pp-concept class over XX. Suppose there exists either

(a) a weak SQ learner capable of learning 𝒞{\mathcal{C}} up to advantage ϵ\epsilon using qq queries of tolerance τ\tau, where τϵ/2\tau\leq\epsilon/2; or,

(b) an SQ learner capable of learning 𝒞{\mathcal{C}} (assume cD3ϵ\|c\|_{D}\geq 3\epsilon for all c𝒞c\in{\mathcal{C}}) up to L2L_{2} error ϵ\epsilon using qq queries of tolerance τ\tau, where τϵ2\tau\leq\epsilon^{2}. Then there exists a distinguisher that is able to distinguish between an unknown DcD_{c} and D0D_{0} using at most q+1q+1 queries of tolerance τ\tau.

Proof.

(a) Run the weak learner to obtain c~\tilde{c}. If cc0c\neq c_{0}, we know that c~,cDϵ\langle\tilde{c},c\rangle_{D}\geq\epsilon, whereas if c=c0c=c_{0}, then c~,cD=0\langle\tilde{c},c\rangle_{D}=0 no matter what c~\tilde{c} is. A single additional query (h(x,y)=c~(x)yh(x,y)=\tilde{c}(x)y) of tolerance ϵ/2\epsilon/2 distinguishes between the two cases.

(b) Run the learner to obtain c~\tilde{c}. If cc0c\neq c_{0}, i.e. cD3ϵ\|c\|_{D}\geq 3\epsilon, we know that c~cDϵ\|\tilde{c}-c\|_{D}\leq\epsilon, so that by the triangle inequality, c~DcDc~cD2ϵ\|\tilde{c}\|_{D}\geq\|c\|_{D}-\|\tilde{c}-c\|_{D}\geq 2\epsilon. But if c=c0c=c_{0}, then c~Dϵ\|\tilde{c}\|_{D}\leq\epsilon. An additional query (h(x,y)=c~(x)2h(x,y)=\tilde{c}(x)^{2}) of tolerance ϵ2\epsilon^{2} suffices to distinguish the two cases. ∎

We now prove the main lower bound on distinguishing.

Theorem 4.5.

Let DD be a distribution over the domain XX, and let 𝒞{\mathcal{C}} be a pp-concept class over XX. Then any SQ algorithm needs at least d=SDA(𝒞,γ)d=\operatorname{SDA}({\mathcal{C}},\gamma) queries of tolerance γ\sqrt{\gamma} to distinguish between DcD_{c} and D0D_{0} for an unknown c𝒞c\in{\mathcal{C}}. (We will consider deterministic SQ algorithms that always succeed, for simplicity.)

Proof.

Consider any successful SQ algorithm AA. Consider the adversarial strategy where to every query h:X×{1,1}[1,1]h:X\times\{-1,1\}\to[-1,1] of AA (with tolerance τ=γ\tau=\sqrt{\gamma}), we respond with 𝔼D0[h]\mathbb{E}_{D_{0}}[h]. We can pretend that this is a valid answer with respect to any c𝒞c\in{\mathcal{C}} such that |𝔼Dc[h]𝔼D0[h]|τ|\mathbb{E}_{D_{c}}[h]-\mathbb{E}_{D_{0}}[h]|\leq\tau. Our argument will be based on showing that each such query rules out fairly few distributions, so that the number of queries required in total is large.

Since we assumed that AA is a deterministic algorithm that always succeeds, it eventually correctly guesses that it is D0D_{0} that it is getting answers from. Say it takes qq queries to do so. For the kthk^{\text{th}} query hkh_{k}, let SkS_{k} be the set of concepts in 𝒞{\mathcal{C}} that are ruled out by our response 𝔼D0[hk]\mathbb{E}_{D_{0}}[h_{k}]:

Sk={c𝒞|𝔼Dc[h]𝔼D0[h]|>τ}.S_{k}=\{c\in{\mathcal{C}}\mid|\mathbb{E}_{D_{c}}[h]-\mathbb{E}_{D_{0}}[h]|\ >\tau\}.

We’ll show that

(a) on the one hand, k=1qSk=𝒞\cup_{k=1}^{q}S_{k}={\mathcal{C}}, so that k=1q|Sk||𝒞|\sum_{k=1}^{q}|S_{k}|\geq|{\mathcal{C}}|,

(b) while on the other, |Sk||𝒞|/d|S_{k}|\leq|{\mathcal{C}}|/d for every kk. Together, this will mean that qdq\geq d.

For the first claim, suppose k=1qSk\cup_{k=1}^{q}S_{k} were not all of 𝒞{\mathcal{C}}, and indeed say c𝒞(k=1qSk)c\in{\mathcal{C}}\setminus(\cup_{k=1}^{q}S_{k}). This is a distribution that our answers were consistent with throughout, yet one that AA’s solution (D0D_{0}) is incorrect for. But AA always succeeds, so for it not to have ruled out this DcD_{c} is impossible.

For the second claim, suppose for the sake of contradiction that for some kk, |Sk|>|𝒞|/d|S_{k}|>|{\mathcal{C}}|/d. By Definition 2.4, this means we know that ρD(Sk)γ\rho_{D}(S_{k})\leq\gamma. One of the key insights in the proof of [Szö09] is that by expressing query expectations entirely in terms of inner products, we gain the ability to apply simple algebraic techniques. To this end, for any query function hh, let h^(x)=(h(x,1)h(x,1))/2\widehat{h}(x)=(h(x,1)-h(x,-1))/2. Observe that for any pp-concept cc,

h^,cD\displaystyle\langle\widehat{h},c\rangle_{D} =𝔼xD[h(x,1)c(x)2]𝔼xD[h(x,1)c(x)2]\displaystyle=\mathbb{E}_{x\sim D}\left[h(x,1)\frac{c(x)}{2}\right]-\mathbb{E}_{x\sim D}\left[h(x,-1)\frac{c(x)}{2}\right]
=𝔼xD[h(x,1)1+c(x)2]\displaystyle=\mathbb{E}_{x\sim D}\left[h(x,1)\frac{1+c(x)}{2}\right]
+𝔼xD[h(x,1)1c(x)2]\displaystyle\quad+\mathbb{E}_{x\sim D}\left[h(x,-1)\frac{1-c(x)}{2}\right]
𝔼xD[h(x,1)12]𝔼xD[h(x,1)12]\displaystyle\quad-\mathbb{E}_{x\sim D}\left[h(x,1)\frac{1}{2}\right]-\mathbb{E}_{x\sim D}\left[h(x,-1)\frac{1}{2}\right]
=𝔼Dc[h]𝔼D0[h],\displaystyle=\mathbb{E}_{D_{c}}[h]-\mathbb{E}_{D_{0}}[h],

the difference between the query expectations wrt DcD_{c} and D0D_{0}. Here we have expanded each 𝔼Dc[h]\mathbb{E}_{D_{c}}[h] using the fact that the label for xx is 11 with probability (1+c(x))/2(1+c(x))/2 and 1-1 otherwise. Thus |hk^,cD||\langle\widehat{h_{k}},c\rangle_{D}|, where hkh_{k} is the kthk^{\text{th}} query, is greater than τ\tau for any cSkc\in S_{k}, since SkS_{k} are precisely those concepts ruled out by our response. We will show contradictory upper and lower bounds on the following quantity:

Φ=hk^,cSkcsign(hk^,cD)D.\displaystyle\Phi=\left\langle\widehat{h_{k}},\sum_{c\in S_{k}}c\cdot\operatorname{sign}(\langle\widehat{h_{k}},c\rangle_{D})\right\rangle_{D}.

Note that since every query hh satisfies h(,y)D1\|h(\cdot,y)\|_{D}\leq 1 for all yy, it follows by the triangle inequality that h^D1\|\widehat{h}\|_{D}\leq 1. So by Cauchy-Schwarz and our observation that ρD(Sk)γ\rho_{D}(S_{k})\leq\gamma,

Φ2\displaystyle\Phi^{2} hk^D2cSkcsign(hk^,c)D2\displaystyle\leq\|\widehat{h_{k}}\|_{D}^{2}\cdot\left\|\sum_{c\in S_{k}}c\cdot\operatorname{sign}(\langle\widehat{h_{k}},c\rangle)\right\|_{D}^{2}
c,cSk|c,cD|=|Sk|2ρD(Sk)|Sk|2γ.\displaystyle\leq\sum_{c,c^{\prime}\in S_{k}}|\langle c,c^{\prime}\rangle_{D}|=|S_{k}|^{2}\rho_{D}(S_{k})\leq|S_{k}|^{2}\gamma.

However since |hk^,cD|>τ|\langle\widehat{h_{k}},c\rangle_{D}|\ >\tau, we also have that Φ=cSk|hk^,cD|>|Sk|τ.\Phi=\sum_{c\in S_{k}}|\langle\widehat{h_{k}},c\rangle_{D}|\ >|S_{k}|\tau. Since τ=γ\tau=\sqrt{\gamma}, this contradicts our upper bound and in turn completes the proof of our second claim. And as noted earlier, the two claims together imply that qdq\geq d. ∎

The final lower bounds on learning thus obtained are stated as a corollary for convenience. The proof follows directly from Lemma 4.4 and Theorem 4.5.

Corollary 4.6.

Let DD be a distribution over the domain XX, and let 𝒞{\mathcal{C}} be a pp-concept class over XX. Let γ,τ\gamma,\tau be such that γτ\sqrt{\gamma}\leq\tau. Let d=SDA(𝒞,γ)d=\operatorname{SDA}({\mathcal{C}},\gamma).

(a) Let ϵ\epsilon be such that τϵ2\tau\leq\epsilon^{2}, and assume cD3ϵ\|c\|_{D}\geq 3\epsilon for all c𝒞c\in{\mathcal{C}}. Then any SQ learner learning 𝒞{\mathcal{C}} up to L2L_{2} error ϵ\epsilon requires at least d1d-1 queries of tolerance τ\tau.

(b) Let ϵ\epsilon be such that τϵ/2\tau\leq\epsilon/2. Then any weak SQ learner learning 𝒞{\mathcal{C}} up to advantage ϵ\epsilon requires at least d1d-1 queries of tolerance τ\tau.

5 Experiments

We include experiments for both regression and classification. We train an overparameterized neural network on data from our function class, using gradient descent. We find that we are able to achieve close to zero training error, while test error remains high. This is consistent with our lower bound for these classes of functions.

Refer to caption
(a) Learning a softmax of a one-layer tanh network
Refer to caption
(b) Learning a linear combination of tanhs
Figure 1: In (a) the target function is a softmax (±1\pm 1 labels) of a sum of 272^{7} tanh activations with n=14n=14; in (b) the labels are obtained similarly but without the softmax. In both cases, we train a 1-layer neural network with 527=6405\cdot 2^{7}=640 tanh units (hence 1024110241 parameters) using a training set of size 60006000 and a test set of size 10001000, with the learning rate set to 0.010.01. For (a) we take the sign of this trained network and measure its training and testing 0/1 loss; for (b) we measure the train and test square-loss of the learned network directly. In (a) we also plot the test error of the bayes optimal network (sign of the target function).

For classification, we use a training set of size TT of data corresponding to f𝒞orth(n,k)f\in{\mathcal{C}_{\text{orth}}}(n,k) instantiated with ϕ=tanh\phi=\tanh and ψ=tanh\psi=\tanh. We draw x𝒩(0,In)x\sim\mathcal{N}(0,I_{n}). For each xx, yy is picked randomly from {±1}\{\pm 1\} in such a way that 𝔼[y|x]=f(x)\mathbb{E}[y|x]=f(x). Since the outer activation ψ\psi is tanh\tanh, this can be thought of as applying a softmax to the network’s output, or as the Boolean label corresponding to a logit output. We train a sum of tanh network (i.e. a network in which the inner activation is tanh\tanh and no outer activation is applied) on this data using gradient descent on squared loss, threshold the output, and plot the resulting 0/1 loss. See Fig. 1(a). This setup models a common way in which neural networks are trained for classification problems in practice.

For regression, we use a training set of size TT of data corresponding to f𝒞orth(n,k)f\in{\mathcal{C}_{\text{orth}}}(n,k) instantiated with ϕ=tanh\phi=\tanh and ψ\psi being the identity. We draw x𝒩(0,In)x\sim\mathcal{N}(0,I_{n}), and y=f(x)y=f(x). We train a sum of tanh network on this data using gradient descent on squared loss, which we plot in Fig. 1(b). This setup models the natural way of using neural networks for regression problems.

In both cases, we train neural networks whose number of parameters considerably exceeds the amount of training data. In all our experiments, we plot the median over 10 trials and shade the inter-quartile range of the data.

Similar results hold with the inner activation ϕ\phi being ReLU\operatorname{ReLU} instead of tanh\tanh, and are shown in Fig. 2.

Refer to caption
(a) Learning a softmax of a one-layer ReLU network
Refer to caption
(b) Learning a linear combination of ReLUs
Figure 2: In (a) the target function is a softmax (±1\pm 1 labels) of a sum of 282^{8} ReLU activations with n=14n=14; in (b) the labels are obtained similarly but without the softmax. In both cases, we train a 1-layer neural network with 528=12805\cdot 2^{8}=1280 ReLU units (hence 2048120481 parameters) using a training set of size 60006000 and a test set of size 10001000, with the learning rate set to 0.0050.005 for classification and 0.0020.002 for regression. For (a) we take the sign of this trained network and measure its training and testing 0/1 loss; for (b) we measure the train and test square loss of the learned network directly. In (a) we also plot the test error of the bayes optimal network (sign of the target function).

References

  • [ADHV19] Alexandr Andoni, Rishabh Dudeja, Daniel Hsu, and Kiran Vodrahalli. Attribute-efficient learning of monomials over highly-correlated variables. In Thirtieth International Conference on Algorithmic Learning Theory, 2019.
  • [APVZ14] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning sparse polynomial functions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 500–510. SIAM, 2014.
  • [AS20] Emmanuel Abbe and Colin Sandon. Poly-time universality and limitations of deep learning. arXiv preprint arXiv:2001.02992, 2020.
  • [BFJ+94] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 253–262, 1994.
  • [BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. CoRR, abs/1702.07966, 2017.
  • [Boy84] John P Boyd. Asymptotic coefficients of hermite function series. Journal of Computational Physics, 54(3):382–410, 1984.
  • [BR89] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. In Advances in neural information processing systems, pages 494–501, 1989.
  • [DKKZ20] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks. In Conference on Learning Theory, 2020. To appear.
  • [DV20] Amit Daniely and Gal Vardi. Hardness of learning neural networks with natural weights. arXiv preprint arXiv:2006.03177, 2020.
  • [Fel12] Vitaly Feldman. A complete characterization of statistical query learning with applications to evolvability. Journal of Computer and System Sciences, 78(5):1444–1459, 2012.
  • [Fel17] Vitaly Feldman. A general characterization of the statistical query complexity. In Conference on Learning Theory, pages 785–830, 2017.
  • [FGR+17] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64(2):8, 2017.
  • [GKK19] Surbhi Goel, Sushrut Karmalkar, and Adam Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems, pages 8582–8591, 2019.
  • [GKKT17] Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In COLT, pages 1004–1042, 2017.
  • [GLM18] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In ICLR. OpenReview.net, 2018.
  • [JHG18] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, NeurIPS, pages 8580–8589, 2018.
  • [JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
  • [Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
  • [KK14] Adam R. Klivans and Pravesh Kothari. Embedding hard learning problems into gaussian space. In APPROX-RANDOM, volume 28 of LIPIcs, pages 793–809. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2014.
  • [KS94] Michael J Kearns and Robert E Schapire. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464–497, 1994.
  • [KS09] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009.
  • [LSSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
  • [Sha18] Ohad Shamir. Distribution-specific hardness of learning neural networks. J. Mach. Learn. Res, 19:32:1–32:29, 2018.
  • [SSSS17] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3067–3075, 2017.
  • [SVWX17] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. In Advances in Neural Information Processing Systems, pages 5514–5522, 2017.
  • [Szö09] Balázs Szörényi. Characterizing statistical query learning: simplified notions and proofs. In International Conference on Algorithmic Learning Theory, pages 186–200. Springer, 2009.
  • [Vu98] Van H Vu. On the infeasibility of training neural networks with small mean-squared error. IEEE Transactions on Information Theory, 44(7):2892–2900, 1998.
  • [VW19] Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117, 2019.
  • [ZPS17] Qiuyi Zhang, Rina Panigrahy, and Sushant Sachdeva. Electron-proton dynamics in deep learning. CoRR, abs/1702.00458, 2017.
  • [ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In ICML, volume 70, pages 4140–4149. JMLR.org, 2017.
  • [ZYWG19] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 1524–1534. PMLR, 2019.

Appendix A Bounding the function norms under the Gaussian

Our goal in this section will be to give lower bounds on the norms of the functions in 𝒞orth(n,k){\mathcal{C}_{\text{orth}}}(n,k), which is a technical requirement for our results to hold (see Lemma 4.4 and Corollary 4.6). Note that when learning with respect to L2L_{2} error, such a lower bound is necessary if we wish to state SQ lower bounds, since if the target had small norm, say fDϵ\|f\|_{D}\leq\epsilon, then the zero function trivially achieves L2L_{2} error ϵ\epsilon.

All inner products and norms in this section will be with respect to the standard Gaussian, 𝒩(0,I)\mathcal{N}(0,I). Since we will fix SS throughout, for our purposes the only relevant part of the input is xSx_{S} and so we drop the subscripts and let g=gS,f=fSg=g_{S},f=f_{S} and x=xSx=x_{S}, so that gg and ff are functions of xkx\in{\mathbb{R}}^{k}. Our approach will be as follows. In order to prove a norm lower bound on ff, we will prove an anticoncentration result for gg. To this end we first calculate the second moment of gg in terms of the Hermite coefficients of ϕ\phi.

Lemma A.1.

Under the distribution 𝒩(0,In)\mathcal{N}(0,I_{n}), let the Hermite representation of ϕ\phi be ϕ(x)=i=0ϕi^H~i(x)\phi(x)=\sum_{i=0}^{\infty}\widehat{\phi_{i}}\tilde{H}_{i}(x), where H~i(x)\tilde{H}_{i}(x) is the ithi^{\text{th}} normalized probabilists’ Hermite polynomial. Then

𝔼[g(x)2]=4ki0ϕi^2kii1++ik=ii1,,ik are odd(ii1,,ik).\displaystyle\mathbb{E}\left[g(x)^{2}\right]=4^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k}\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}}.
Proof.

We use 𝔼\mathbb{E} in this proof instead of 𝔼x𝒩(0,In)\mathbb{E}_{x\sim\mathcal{N}(0,I_{n})} for simplicity. Then we have

𝔼[g(x)2]\displaystyle\mathbb{E}\!\left[g(x)^{2}\right]
=\displaystyle= 𝔼[α{±1}kχ(α)ϕ(αxSk)][β{±1}kχ(β)ϕ(βxSk)]\displaystyle\,\mathbb{E}\!\left[\sum_{\alpha\in\{\pm 1\}^{k}}\chi(\alpha)\phi\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right]\!\!\!\left[\sum_{\beta\in\{\pm 1\}^{k}}\chi(\beta)\phi\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right]
=\displaystyle= α,β{±1}kl=1kαlβl𝔼[ϕ(αxSk)ϕ(βxSk)]\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\,\mathbb{E}\!\left[\phi\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\phi\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right]
=\displaystyle= α,β{±1}kl=1kαlβl𝔼[i,j0ϕi^ϕj^H~i(αxSk)H~j(βxSk)]\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\,\mathbb{E}\!\!\left[\sum_{i,j\geq 0}\widehat{\phi_{i}}\widehat{\phi_{j}}\tilde{H}_{i}\!\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\!\tilde{H}_{j}\!\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\!\right]
=\displaystyle= α,β{±1}kl=1kαlβli,j0ϕi^ϕj^𝔼[H~i(αxSk)H~j(βxSk)].\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\!\sum_{i,j\geq 0}\widehat{\phi_{i}}\widehat{\phi_{j}}\,\mathbb{E}\!\left[\tilde{H}_{i}\!\left(\frac{\alpha~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\!\tilde{H}_{j}\!\left(\frac{\beta~{}{\cdot}~{}x_{S}}{\sqrt{k}}\right)\right].

Since x𝒩(0,Ik)x\sim\mathcal{N}(0,I_{k}), α,xSk\frac{\langle\alpha,x_{S}\rangle}{\sqrt{k}} and β,xSk\frac{\langle\beta,x_{S}\rangle}{\sqrt{k}} are both standard Gaussian and have correlation α,βk\frac{\langle\alpha,\beta\rangle}{k}, we then apply the following well-known property of the Hermite polynomials.

𝔼(a,b)T𝒩(0,(1ρρ1))H~i(a)H~j(b)=δi,jρi,\mathbb{E}_{(a,b)^{T}\sim\mathcal{N}\left(0,\bigl{(}\begin{smallmatrix}1&\rho\\ \rho&1\end{smallmatrix}\bigr{)}\right)}\tilde{H}_{i}(a)\tilde{H}_{j}(b)=\delta_{i,j}\rho^{i},

where δi,j\delta_{i,j} is the Dirac delta function.

𝔼[g(x)2]=\displaystyle\mathbb{E}\left[g(x)^{2}\right]= α,β{±1}kl=1kαlβli0ϕi^2(αβk)i\displaystyle\,\sum_{\alpha,\beta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}\alpha_{l}\beta_{l}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\left(\frac{\alpha~{}{\cdot}~{}\beta}{k}\right)^{i}
=\displaystyle= w,θ{±1}kl=1kwli0ϕi^2(l=1kwlk)i\displaystyle\,\sum_{w,\theta\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\left(\frac{\sum_{l=1}^{k}w_{l}}{k}\right)^{i}
=\displaystyle=  2kw{±1}kl=1kwli0ϕi^2(l=1kwlk)i,\displaystyle\,2^{k}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\left(\frac{\sum_{l=1}^{k}w_{l}}{k}\right)^{i},

where wi=αiβiw_{i}=\alpha_{i}\beta_{i} and θi=wiαi\theta_{i}=w_{i}\alpha_{i}. Note that 3.3 implies that i=0ϕi^2<\sum_{i=0}^{\infty}\widehat{\phi_{i}}^{2}<\infty , the series above is absolute convergent. Then,

𝔼[g(x)2]\displaystyle\,\mathbb{E}\left[g(x)^{2}\right]
=\displaystyle=  2ki0ϕi^2w{±1}kl=1kwl(l=1kwlk)i\displaystyle\,2^{k}\sum_{i\geq 0}\widehat{\phi_{i}}^{2}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\left(\frac{\sum_{l=1}^{k}w_{l}}{k}\right)^{i}
=\displaystyle=  2ki0ϕi^2kiw{±1}kl=1kwli1++ik=il=1kwlil(ii1,,ik)\displaystyle\,2^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}\sum_{i_{1}+\cdots+i_{k}=i}\prod_{l=1}^{k}w_{l}^{i_{l}}\binom{i}{i_{1},\dots,i_{k}}
=\displaystyle=  2ki0ϕi^2kii1++ik=i(ii1,.ik)w{±1}kl=1kwlil+1\displaystyle\,2^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{i_{1}+\cdots+i_{k}=i}\binom{i}{i_{1},\dots.i_{k}}\sum_{w\in\{\pm 1\}^{k}}\prod_{l=1}^{k}w_{l}^{i_{l}+1}
=\displaystyle=  2ki0ϕi^2kii1++ik=i(ii1,,ik)l=1k[1il+1+(1)il+1]\displaystyle\,2^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{i_{1}+\cdots+i_{k}=i}\binom{i}{i_{1},\dots,i_{k}}\prod_{l=1}^{k}\left[1^{i_{l}+1}+(-1)^{i_{l}+1}\right]
=\displaystyle=  4ki0ϕi^2kii1++ik=ii1,,ik are odd(ii1,,ik)\displaystyle\,4^{k}\sum_{i\geq 0}\frac{\widehat{\phi_{i}}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k}\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}}

since we consider all distinct monomials in (l=1kwl)i\big{(}\sum_{l=1}^{k}w_{l}\big{)}^{i}. Note that i1++ik=ii1,,ik are odd(ii1,,ik)\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k}\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}} is always non-negative and is positive iff iki\geq k and ik(mod2)i\equiv k\pmod{2}. ∎

A.1 ReLU Activation

The goal of this section is to give a lower-bound of f\left\lVert f\right\rVert for ϕ=ReLU\phi=\operatorname{ReLU} under the standard Gaussian distribution 𝒩(0,I)\mathcal{N}(0,I). To this end, we prove an anti-concentration for gg. We first give a lower bound on g\left\lVert g\right\rVert based on the Hermite coefficients of ϕ\phi. If gg were bounded, this alone would imply anti-concentration as in Section A.2. But since it is not, we first introduce gTg^{T}, where all activations are truncated at some TT. We pick TT large enough that gg and gTg^{T} behave almost identically over 𝒩(0,I){\cal N}(0,I). We then show a lower bound on gT\left\lVert g^{T}\right\rVert, translate that into an anticoncentration result for gTg^{T}, and finally into one for gg.

Let T>0T>0 be some constant to be determined later. Let

ReLUT(x)=min(ReLU(x),T)\operatorname{ReLU}^{T}(x)=\min(\operatorname{ReLU}(x),T)

and

gT(x)=w{±1}kχ(w)ReLUT(xwk).g^{T}(x)=\sum\limits_{w\in\{\pm 1\}^{k}}\chi(w)\operatorname{ReLU}^{T}\left(\frac{x~{}{\cdot}~{}w}{\sqrt{k}}\right).

The following lemma from [GKK19] describes the Hermite coefficients of ReLU.

Lemma A.2.
ReLU(x)=i=0ciH~i(x)\displaystyle\operatorname{ReLU}(x)=\sum_{i=0}^{\infty}c_{i}\tilde{H}_{i}(x)

where

c0=12π,\displaystyle c_{0}=\sqrt{\frac{1}{2\pi}},\quad c1=12,\displaystyle c_{1}=\frac{1}{2},
c2i1=0,\displaystyle c_{2i-1}=0,\quad c2i=H2i(0)+2iH2i2(0)2π(2i)!for i2.\displaystyle c_{2i}=\frac{H_{2i}(0)+2iH_{2i-2}(0)}{\sqrt{2\pi(2i)!}}\quad\text{for }i\geq 2.

In particular, c2i2=Θ(i2.5)c_{2i}^{2}=\Theta(i^{-2.5}).

We can now derive a lower bound on the norm of gg.

Lemma A.3.

When kk is even,

g=Ω((4e)(12+o(1))k).\displaystyle\left\lVert g\right\rVert=\Omega\left(\left(\frac{4}{e}\right)^{(\frac{1}{2}+o(1))k}\right).
Proof.

Due to Lemma A.1,

𝔼[g(x)2]\displaystyle\mathbb{E}\left[g(x)^{2}\right] =4ki0ci2kii1++ik=ii1,,ik, are odd(ii1,,ik)\displaystyle=4^{k}\sum_{i\geq 0}\frac{c_{i}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\dots,i_{k},\text{ are odd}\end{subarray}}\binom{i}{i_{1},\dots,i_{k}}
4kck2kki1++ik=ki1,,ik, are odd(ki1,,ik)\displaystyle\geq\frac{4^{k}c_{k}^{2}}{k^{k}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=k\\ i_{1},\dots,i_{k},\text{ are odd}\end{subarray}}\binom{k}{i_{1},\dots,i_{k}}
4kck2k!kk.\displaystyle\geq\frac{4^{k}c_{k}^{2}k!}{k^{k}}.

The lemma then follows by the Stirling’s approximation,

n!2πn(ne)n.\displaystyle n!\geq\sqrt{2\pi n}\left(\frac{n}{e}\right)^{n}.

and the bound on the Hermite coefficients,

ck2=Θ(k2.5).\displaystyle c_{k}^{2}=\Theta(k^{-2.5}).

For the difference of g(x)g(x) and gT(x)g^{T}(x), we have

Lemma A.4.
ggT2keT24T2+1T2π\displaystyle\left\lVert g-g^{T}\right\rVert\leq 2^{k}\,e^{-\frac{T^{2}}{4}}\sqrt{T^{2}+1-\frac{T}{\sqrt{2\pi}}}
Proof.

Let ReLUw(x)\operatorname{ReLU}_{w}(x) be shorthand for ReLU(xwk)\operatorname{ReLU}(\frac{x~{}{\cdot}~{}w}{\sqrt{k}}), and similarly ReLUwT\operatorname{ReLU}_{w}^{T}. Observe that by the triangle inequality,

ggT\displaystyle\left\lVert g-g^{T}\right\rVert =w{±1}kχ(w)(ReLUwReLUwT)\displaystyle=\left\lVert\sum_{w\in\{\pm 1\}^{k}}\chi(w)\left(\operatorname{ReLU}_{w}-\operatorname{ReLU}_{w}^{T}\right)\right\rVert
w{±1}kReLUwReLUwT\displaystyle\leq\sum_{w\in\{\pm 1\}^{k}}\left\lVert\operatorname{ReLU}_{w}-\operatorname{ReLU}_{w}^{T}\right\rVert
=2kReLUReLUT𝒩(0,1),\displaystyle=2^{k}\left\lVert\operatorname{ReLU}-\operatorname{ReLU}^{T}\right\rVert_{\mathcal{N}(0,1)},

where the last equality holds because for any unit vector vv and x𝒩(0,I)x\sim\mathcal{N}(0,I), xvx~{}{\cdot}~{}v has the distribution 𝒩(0,1)\mathcal{N}(0,1). Now,

ReLUReLUT𝒩(0,1)2=T(xT)2p(x)𝑑x,\left\lVert\operatorname{ReLU}-\operatorname{ReLU}^{T}\right\rVert_{\mathcal{N}(0,1)}^{2}=\int_{T}^{\infty}(x-T)^{2}\,p(x)\,dx,

where p(x)p(x) is the probability density function of 𝒩(0,1)\mathcal{N}(0,1). Note that p(x)=xp(x)p^{\prime}(x)=-xp(x). We have

Tx2p(x)𝑑x=Txd(p(x))\displaystyle\int_{T}^{\infty}x^{2}p(x)dx=\int_{T}^{\infty}-x\,d(p(x))
=xp(x)|T+Tp(x)𝑑x\displaystyle\qquad=-x\,p(x)\bigg{|}_{T}^{\infty}+\int_{T}^{\infty}p(x)dx (integration by parts)
=Tp(T)+x𝒩(0,1)(x>T),\displaystyle\qquad=T\,p(T)+\mathbb{P}_{x\sim_{\mathcal{N}(0,1)}}(x>T),
Txp(x)𝑑x=p(x)|T=p(T),\displaystyle\int_{T}^{\infty}x\,p(x)dx=-p(x)\bigg{|}_{T}^{\infty}=p(T),
Tp(x)𝑑x=x𝒩(0,1)(x>T)eT22.\displaystyle\int_{T}^{\infty}p(x)dx=\mathbb{P}_{x\sim_{\mathcal{N}(0,1)}}(x>T)\leq e^{-\frac{T^{2}}{2}}.

Thus,

𝔼[g(x)gT(x)]2\displaystyle\mathbb{E}\left[g(x)-g^{T}(x)\right]^{2}
4k[(T2+1)x𝒩(0,1)(x>T)Tp(T)]\displaystyle\leq 4^{k}\,\left[(T^{2}+1)\mathbb{P}_{x\sim\mathcal{N}(0,1)}(x>T)-T\,p(T)\right]
4keT22(T2+1T2π).\displaystyle\leq 4^{k}\,e^{-\frac{T^{2}}{2}}\left(T^{2}+1-\frac{T}{\sqrt{2\pi}}\right).

Lemma A.5.
[g(x)gT(x)]2keT22.\displaystyle\mathbb{P}[g(x)\neq g^{T}(x)]\leq 2^{k}\,e^{-\frac{T^{2}}{2}}.
Proof.

For any w{±1}kw\in\{\pm 1\}^{k},

x𝒩(0,I)[ReLU(xwk)ReLUT(xwk)]\displaystyle\mathbb{P}_{x\sim\mathcal{N}(0,I)}\left[\operatorname{ReLU}(\frac{x~{}{\cdot}~{}w}{\sqrt{k}})\neq\operatorname{ReLU}^{T}(\frac{x~{}{\cdot}~{}w}{\sqrt{k}})\right]
=t𝒩(0,1)[t>T]\displaystyle=\mathbb{P}_{t\sim\mathcal{N}(0,1)}[t>T]
eT22.\displaystyle\leq e^{-\frac{T^{2}}{2}}.

The lemma follows by a union bound. ∎

Lemma A.6.
[|g(x)|1]=Ω(exp(Θ(k))).\displaystyle\mathbb{P}\left[\left|{g(x)}\right|\geq 1\right]=\Omega(\exp(-\Theta(k))).
Proof.

For large enough T=Ω(k)T=\Omega(k), it holds from Lemmas A.3 and A.4 that

gT=Ω((4e)(12+o(1))k).\displaystyle\left\lVert g^{T}\right\rVert=\Omega\left(\left(\frac{4}{e}\right)^{(\frac{1}{2}+o(1))k}\right).

Since |gT(x)|T 2k\left|{g^{T}(x)}\right|\leq T\,2^{k},

gT2=𝔼[gT(x)2]1[|gT(x)|1]+(T2k)2[|gT(x)|1],\left\lVert g^{T}\right\rVert^{2}=\mathbb{E}[g^{T}(x)^{2}]\leq 1\cdot\mathbb{P}[|g^{T}(x)|\leq 1]+(T2^{k})^{2}\cdot\mathbb{P}[|g^{T}(x)|\geq 1],

so that

[|gT(x)|1]=Ω((4e)(1+o(1))k)1(T 2k)2=Ω(exp(Θ(k)))\mathbb{P}\left[\left|{g^{T}(x)}\right|\geq 1\right]=\dfrac{\Omega\Big{(}\left(\frac{4}{e}\right)^{(1+o(1))k}\Big{)}-1}{(T\,2^{k})^{2}}=\Omega(\exp(-\Theta(k))) (2)

Using Eq. 2 with Lemma A.5,

[|g(x)|1]=Ω(exp(Θ(k)))\displaystyle\mathbb{P}\left[\left|{g(x)}\right|\geq 1\right]=\Omega(\exp(-\Theta(k)))

for large enough T=Ω(k)T=\Omega(k). ∎

The lower bound on f\left\lVert f\right\rVert now follows easily.

Corollary A.7.
f=Ω(exp(Θ(k))).\displaystyle\|f\|=\Omega(\exp(-\Theta(k))).
Proof.

Since f=ψgf=\psi\circ g, from Lemma A.6 and the fact that ψ\psi is odd and increasing, we have that

f\displaystyle\|f\| |ψ(1)|[g(x)1]+|ψ(1)|[g(x)1]\displaystyle\geq|\psi(1)|\ \mathbb{P}[g(x)\geq 1]+|\psi(-1)|\ \mathbb{P}[g(x)\geq 1]
=ψ(1)[|g(x)|1]\displaystyle=\psi(1)\ \mathbb{P}[|g(x)|\geq 1]
=Ω(exp(Θ(k))).\displaystyle=\Omega(\exp(-\Theta(k))).

A.2 Sigmoid Activation

Here we consider gg and ff with ϕ(x)=σ(x)=11+ex\phi(x)=\sigma(x)=\frac{1}{1+e^{-x}}. For the asymptotic bound of Hermite polynomial coefficients, we need the following theorem from [Boy84].

Theorem A.8.

For a function f(z)f(z) whose convergence is limited by simple poles at the roots of z2=γ2z^{2}=-\gamma^{2} with residue RR, the non-zero expansion coefficients {an}\{a_{n}\} of f(z)f(z) as a series of normalized Hermite functions have magnitudes asymptotically given by

|an|254π12Rn14eγ(2n+1)12,\left|{a_{n}}\right|\sim 2^{\frac{5}{4}}\,\pi^{\frac{1}{2}}\,R\,n^{-\frac{1}{4}}\,e^{-\gamma(2n+1)^{\frac{1}{2}}},

Here the normalized Hermite function {ψn(x)}n\{\psi_{n}(x)\}_{n\in{\mathbb{N}}} is defined by

ψn(z)=ez22π14H~n(2z).\psi_{n}(z)=e^{-\frac{z^{2}}{2}}\pi^{-\frac{1}{4}}\tilde{H}_{n}(\sqrt{2}z).

Applying this to f(x)=ex22σ(2x)f(x)=e^{-\frac{x^{2}}{2}}\sigma(\sqrt{2}x) and translating the Hermite coefficients for the series in terms of Hermite functions to those in terms of Hermite polynomials, we have

Lemma A.9.
σ(x)=i=0ciH~i(x),\sigma(x)=\sum_{i=0}^{\infty}c_{i}\tilde{H}_{i}(x),

where c0=0.5,c2i=0c_{0}=0.5,c_{2i}=0 for i1i\geq 1 and all non-zero odd terms satisfies

c2i1=eΘ(i).c_{2i-1}=e^{-\Theta(\sqrt{i})}.
Corollary A.10.

There is an infinite increasing sequence {ki}i\{k_{i}\}_{i\in{\mathbb{N}}} such that kik_{i}’s are all odd and

cki=eΘ(ki).c_{k_{i}}=e^{-\Theta(\sqrt{k_{i}})}.
Proof.

It follows simply from the fact that σ\sigma is not a polynomial and there should be infinitely many non-zero terms in {ck}k\{c_{k}\}_{k\in{\mathbb{N}}}. ∎

Remark A.11.

Experimental evidence strongly indicates that in fact all odd Hermite coefficients of sigmoid are nonzero and decay as above, but this is laborious to formally establish. So we state our norm lower bound only for k{ki}ik\in\{k_{i}\}_{i\in{\mathbb{N}}} (and the associated n{2ki}in\in\{2^{k_{i}}\}_{i\in{\mathbb{N}}}, since we end up taking k=lognk=\log n). Since this is nevertheless an infinite sequence, it still establishes that no better asymptotic bound holds.

Similar to Lemma A.3, we can derive a lower bound of g\left\lVert g\right\rVert for some kk’s.

Lemma A.12.

For k{ki}ik\in\{k_{i}\}_{i\in{\mathbb{N}}},

g(x)=Ω((4e)(12+o(1))k).\left\lVert g(x)\right\rVert=\Omega\left(\left(\frac{4}{e}\right)^{(\frac{1}{2}+o(1))k}\right).
Proof.

Due to Lemma A.1,

𝔼[g(x)2]\displaystyle\mathbb{E}\left[g(x)^{2}\right] =4ki0ci2kii1++ik=ii1,,ik, are odd(ii1ik)\displaystyle=4^{k}\sum_{i\geq 0}\frac{c_{i}^{2}}{k^{i}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=i\\ i_{1},\cdots,i_{k},\text{ are odd}\end{subarray}}\binom{i}{i_{1}\cdots i_{k}}
4kck2kki1++ik=ki1,,ik, are odd(ki1ik)\displaystyle\geq\frac{4^{k}c_{k}^{2}}{k^{k}}\sum_{\begin{subarray}{c}i_{1}+\cdots+i_{k}=k\\ i_{1},\cdots,i_{k},\text{ are odd}\end{subarray}}\binom{k}{i_{1}\cdots i_{k}}
4kck2k!kk.\displaystyle\geq\frac{4^{k}c_{k}^{2}k!}{k^{k}}.

Using Stirling’s approximation,

n!2πn(ne)n,n!\geq\sqrt{2\pi n}\left(\frac{n}{e}\right)^{n},

and Corollary A.10,

ck=eΘ(k),c_{k}=e^{-\Theta(\sqrt{k})},

we obtain

𝔼[g(x)2]=Ω(4k2πkkk(ke)keΘ(k))\mathbb{E}\left[g(x)^{2}\right]=\Omega\left(\frac{4^{k}\sqrt{2\pi k}}{k^{k}}\left(\frac{k}{e}\right)^{k}e^{-\Theta(\sqrt{k})}\right)

and hence

𝔼[g(x)2]=Ω((4e)(1+o(1))k).\mathbb{E}\left[g(x)^{2}\right]=\Omega\left(\left(\frac{4}{e}\right)^{(1+o(1))k}\right).

Lemma A.13.

For k{ki}ik\in\{k_{i}\}_{i\in{\mathbb{N}}},

(|g(x)|1)=Ω(exp(Θ(k))).\mathbb{P}\left(\left|{g(x)}\right|\geq 1\right)=\Omega(\exp(-\Theta(k))).
Proof.

Since |g(x)|2k\left|{g(x)}\right|\leq 2^{k},

g2=𝔼[g(x)2]1[|g(x)|1]+(2k)2[|g(x)|1],\left\lVert g\right\rVert^{2}=\mathbb{E}[g(x)^{2}]\leq 1\cdot\mathbb{P}[|g(x)|\leq 1]+(2^{k})^{2}\cdot\mathbb{P}[|g(x)|\geq 1],

and so

(|g(x)|1)=Ω((4e)(1+o(1))k)1(2k)2.\mathbb{P}\left(\left|{g(x)}\right|\geq 1\right)=\dfrac{\Omega\Big{(}\big{(}\frac{4}{e}\big{)}^{(1+o(1))k}\Big{)}-1}{(2^{k})^{2}}.

The lemma then follows. ∎

Using the same argument as Corollary A.7, we have the following bound.

Corollary A.14.
f=Ω(exp(Θ(k))).\|f\|=\Omega(\exp(-\Theta(k))).

A.3 General activations

It is not hard to see that the norm analysis of ReLU and sigmoid extends to any activation function for which a suitable lower bound on the Hermite coefficients holds, and which is either bounded or grows at a polynomial rate, so that under the standard Gaussian it behaves essentially identically to its truncated form. In particular, a lower bound of αj\alpha^{-j} for any constant α<4/e\alpha<4/e on the jthj^{\text{th}} Hermite coefficient suffices to give gexp(Θ(k))\|g\|\geq\exp(\Theta(k)), by the same argument as in Lemma A.3 and Lemma A.12. This then suffices to give fexp(Θ(k))\|f\|\geq\exp(-\Theta(k)), as above.

In fact, even a very weak lower bound on f\|f\| yields some superpolynomial bound on learning. Suppose we only had f1/exp(exp(Θ(k)))\|f\|\geq 1/\exp(\exp(\Theta(k))), for instance. Then we can take k=loglognk=\log\log n and have f1/poly(n)\|f\|\geq 1/\operatorname{poly}(n) and still obtain a lower bound of nloglogn=nω(1)n^{\log\log n}=n^{\omega(1)} (see Theorem 3.9). Any lower bound on f\|f\| will be a function only of kk, so a similar argument applies.

Appendix B SQ lower bound for real-valued functions proof

We give a self-contained variant of the elegant proof of [Szö09] for the reader’s convenience. For simplicity, we include the 0 function in our class 𝒞{\mathcal{C}} — this can only negligibly change the SDA, and it makes the core argument cleaner.

Theorem B.1.

Let DD be a distribution on XX, and let 𝒞{\mathcal{C}} be a real-valued concept class over a domain XX such that 0𝒞0\in{\mathcal{C}}, and cD>ϵ\|c\|_{D}>\epsilon for all c𝒞,c0c\in{\mathcal{C}},c\neq 0. Consider any SQ learner that is allowed to make only inner product queries to an SQ oracle for the labeled distribution DcD_{c} for some unknown c𝒞c\in{\mathcal{C}}. Let d=SDAD(𝒞,γ)d=\operatorname{SDA}_{D}({\mathcal{C}},\gamma). Then any such SQ learner needs at least d/2d/2 queries of tolerance γ\sqrt{\gamma} to learn 𝒞{\mathcal{C}} up to L2L_{2} error ϵ\epsilon.

Proof.

Consider the adversarial strategy where we respond to every query h:Xh:X\to{\mathbb{R}} (hD1\|h\|_{D}\leq 1) with 0. This corresponds to the true expectation if the target were the 0 function. By the norm lower bound, outputting any other cc would then mean L2L_{2} error greater than ϵ\epsilon. Thus we must rule out all other c𝒞c\in{\mathcal{C}}.

Let τ=γ\tau=\sqrt{\gamma}. If hkh_{k} is the kthk^{\text{th}} query, let Sk={c𝒞c,hkD>τ}S_{k}=\{c\in{\mathcal{C}}\mid\langle c,h_{k}\rangle_{D}>\tau\} be the functions ruled out by our response of 0. (A similar argument will hold for Sk={c𝒞c,hkD<τ}S_{k}^{\prime}=\{c\in{\mathcal{C}}\mid\langle c,h_{k}\rangle_{D}<-\tau\}.) Let Φ=hk,cSkcD\Phi=\langle h_{k},\sum_{c\in S_{k}}c\rangle_{D}. We claim that |Sk||𝒞|/d\left|{S_{k}}\right|\leq\left|{{\mathcal{C}}}\right|/d. Suppose not. Then ρD(Sk)γ\rho_{D}(S_{k})\leq\gamma by Definition 2.4, and

Φ\displaystyle\Phi hkDcSkcD\displaystyle\leq\|h_{k}\|_{D}\left\lVert\sum_{c\in S_{k}}c\right\rVert_{D}
c,cSkc,cD\displaystyle\leq\sqrt{\sum_{c,c^{\prime}\in S_{k}}\langle c,c^{\prime}\rangle_{D}}
=|Sk|2ρD(Sk)\displaystyle=\sqrt{\left|{S_{k}}\right|^{2}\rho_{D}(S_{k})}
γ|Sk|,\displaystyle\leq\sqrt{\gamma}|S_{k}|,

contradicting the fact that Φ>|Sk|τ\Phi>|S_{k}|\tau by definition of SkS_{k}.

Similarly |Sk|=|{c𝒞c,hkD<τ}||𝒞|/d|S_{k}^{\prime}|=|\{c\in{\mathcal{C}}\mid\langle c,h_{k}\rangle_{D}<-\tau\}|\leq|{\mathcal{C}}|/d. Thus we rule out at most a 2/d2/d fraction of functions with each query, and hence need at least d/2d/2 queries to rule out all other possibilities. ∎