This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimum complexity interpolation in random features models

Michael Celentano,    Theodor Misiakiewicz11footnotemark: 1,    Andrea Montanari11footnotemark: 1 Department of Statistics, University of California, BerkeleyDepartment of Electrical Engineering, Stanford University
Abstract

Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in d{\mathbb{R}}^{d}, the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). As a consequence, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm —which is equivalent to a weighted 2\ell_{2} norm— is replaced by a weighted functional p\ell_{p} norm, which we refer to as p{\mathcal{F}}_{p} norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires solving an infinite-dimensional convex problem.

We study random features approximations to these norms and show that, for p>1p>1, the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with p{\mathcal{F}}_{p} norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models. For p=1p=1, our guarantees for the random features approximation break down. We prove instead that learning with the 1{\mathcal{F}}_{1} norm is 𝖭𝖯\mathsf{NP}-hard under a randomized reduction based on the problem of learning halfspaces with noise.

1 Introduction

1.1 Background: Kernel methods and the curse of dimensionality

Kernel methods are among the most fundamental tools in machine learning. Over the last two years they attracted renewed interest because of their connection with neural networks in the linear regime (a.k.a. neural tangent kernel or lazy regime) [JGH18, LR20, LRZ19, GMMM19].

Consider a general covariates space, namely a probability space (𝒳,)(\mathcal{X},\mathbb{P}) (our results will concern the case 𝒳=d\mathcal{X}={\mathbb{R}}^{d}, but it is useful to start from a more general viewpoint). A reproducing kernel Hilbert space (RKHS) is usually constructed starting from a positive semidefinite kernel K:𝒳×𝒳K:\mathcal{X}\times\mathcal{X}\to{\mathbb{R}}. Here it will be more convenient to start from a weight space, i.e. a probability space (𝒱,μ)({\mathcal{V}},\mu), and a featurization map ϕ(;𝒘):𝒳\phi(\cdot;{\bm{w}}):\mathcal{X}\to{\mathbb{R}} parametrized by the weight vector111A sufficient condition for this construction to be well defined is ϕL2(μ)\phi\in L^{2}(\mathbb{P}\otimes\mu). 𝒘𝒱{\bm{w}}\in{\mathcal{V}}. The RKHS is then defined as the space of functions of the form

f^(𝒙;a)=𝒱ϕ(𝒙;𝒘)a(𝒘)μ(d𝒘),\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\,a({\bm{w}})\,\mu({\rm d}{\bm{w}})\,, (1)

with aL2(𝒱;μ)a\in L^{2}({\mathcal{V}};\mu). To give a concrete example, we might consider 𝒱=d{\mathcal{V}}={\mathbb{R}}^{d}, and ϕ(𝒙;𝒘)=σ(𝒘,𝒙)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle): this is the featurization map arising from two-layers neural networks with random first layer weights.

The radius-RR ball in this space is defined as

2(R)={f^(𝒙;a)=𝒱ϕ(𝒙;𝒘)a(𝒘)μ(d𝒘):𝒱|a(𝒘)|2μ(d𝒘)R2}.{\mathcal{F}}_{2}(R)=\Big{\{}\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\,a({\bm{w}})\,\mu({\rm d}{\bm{w}}):\;\;\int_{{\mathcal{V}}}|a({\bm{w}})|^{2}\mu({\rm d}{\bm{w}})\leq R^{2}\Big{\}}\,. (2)

This construction is equivalent to the more standard one, with associated kernel K(𝒙1,𝒙2):=𝒱ϕ(𝒙1;𝒘)ϕ(𝒙2;𝒘)μ(d𝒘)K({\bm{x}}_{1},{\bm{x}}_{2}):=\int_{{\mathcal{V}}}\phi({\bm{x}}_{1};{\bm{w}})\phi({\bm{x}}_{2};{\bm{w}})\,\mu({\rm d}{\bm{w}}). Vice versa, for any positive semidefinite kernel KK, we can construct a corresponding featurization map ϕ\phi, and proceed as above.

It is a basic fact that, although 2(R){\mathcal{F}}_{2}(R) is infinite-dimensional, learning in 2(R){\mathcal{F}}_{2}(R) can be performed efficiently [SSBD14]: it is useful to recall the reason here. Consider a supervised learning setting in which we are given nn samples (yi,𝒙i)(y_{i},{\bm{x}}_{i}), ini\leq n, with yiy_{i}\in\mathbb{R} and 𝒙i𝒳{\bm{x}}_{i}\in\mathcal{X}. Given a convex loss :×\ell:\mathbb{R}\times\mathbb{R}\to\mathbb{R}, the RKHS-norm regularized empirical risk minimization problem reads:

minimizei=1n(yi,f^(𝒙i;a))+λ𝒱|a(𝒘)|2μ(d𝒘).\displaystyle\mbox{\rm minimize}\;\;\;\sum_{i=1}^{n}\ell\big{(}y_{i},\hat{f}({\bm{x}}_{i};a)\big{)}+\lambda\int_{{\mathcal{V}}}|a({\bm{w}})|^{2}\mu({\rm d}{\bm{w}})\,. (3)

Conceptually, this problem can be solved in two steps: (1)(1) find the minimum RKHS norm subject to interpolation constraints f^(𝒙i;a)=y^i\hat{f}({\bm{x}}_{i};a)=\hat{y}_{i}, and (2)(2) minimize the sum of this quantity and the empirical loss i=1n(yi,y^i)\sum_{i=1}^{n}\ell\big{(}y_{i},\hat{y}_{i}). Since step (2)(2) is convex and finite-dimensional, the critical problem is the interpolation problem:

minimize 𝒱|a(𝒘)|2μ(d𝒘),\displaystyle\int_{{\mathcal{V}}}|a({\bm{w}})|^{2}\mu({\rm d}{\bm{w}})\,,
subj. to f^(𝒙i;a)=y^i,in.\displaystyle\hat{f}({\bm{x}}_{i};a)=\hat{y}_{i},\;\;\;\forall i\leq n\,.

While this is an infinite-dimensional problem, convex duality (the ‘representer theorem’) guarantees that the solution belongs to a fixed nn-dimensional subspace, a(𝒘)=i=1nciϕ(𝒙i;𝒘)a_{*}({\bm{w}})=\sum_{i=1}^{n}c_{i}\phi({\bm{x}}_{i};{\bm{w}}).

Unfortunately, kernel methods suffer from the curse of dimensionality. To give a simple example, consider 𝒙Unif(𝕊d1(d)){\bm{x}}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d})), 𝒘Unif(𝕊d1(1)){\bm{w}}\sim{\rm Unif}(\mathbb{S}^{d-1}(1)), and ϕ(𝒙;𝒘)=σ(𝒘,𝒙)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle) with σ:\sigma:{\mathbb{R}}\to{\mathbb{R}} a non-polynomial activation function. Consider noiseless data yi=f(𝒙i)y_{i}=f_{*}({\bm{x}}_{i}), with f(𝒙)=σ(𝒘,𝒙)f_{*}({\bm{x}})=\sigma(\langle{\bm{w}}_{*},{\bm{x}}\rangle) for a fixed 𝒘𝕊d1(1){\bm{w}}_{*}\in\mathbb{S}^{d-1}(1). Then, fK=\|f_{*}\|_{K}=\infty (with K\|\,\cdot\,\|_{K} denoting the RKHS norm). Correspondingly, [YS19, GMMM19] show that for any fixed kk, if ndkn\leq d^{k}, then any kernel method of the form (3) returns f^\hat{f} with f^fL2()\|\hat{f}-f_{*}\|_{L^{2}(\mathbb{P})} bounded away from zero.

The curse of dimensionality suggests to seek functions of the form (1) where aa is sparse, in a suitable sense [Bac17]. In this paper we consider a generalization in which the RKHS ball of Eq. (2) is replaced by

p(R)={f^(𝒙;a)=𝒱ϕ(𝒙;𝒘)a(𝒘)μ(d𝒘):(𝒱|a(𝒘)|pμ(d𝒘))1/pR}.{\mathcal{F}}_{p}(R)=\Big{\{}\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\,a({\bm{w}})\,\mu({\rm d}{\bm{w}}):\;\;\Big{(}\int_{{\mathcal{V}}}|a({\bm{w}})|^{p}\mu({\rm d}{\bm{w}})\Big{)}^{1/p}\leq R\Big{\}}\,. (4)

For p[1,2)p\in[1,2) this comprises a richer function class than the original 2(R){\mathcal{F}}_{2}(R), since 2(R)p(R)1(R){\mathcal{F}}_{2}(R)\subset{\mathcal{F}}_{p}(R)\subset{\mathcal{F}}_{1}(R), and it is easy to see that the inclusion is strict. The case ρ(x)=|x|\rho(x)=|x| is also known as ‘convex neural network’ [Bac17].

Although the penalty 𝒱|a(𝒘)|pμ(d𝒘)\int_{{\mathcal{V}}}|a({\bm{w}})|^{p}\mu({\rm d}{\bm{w}}) is convex, it is far from clear that the learning problem is tractable. Indeed for p2p\neq 2, the classical representer theorem does not hold anymore and we cannot reduce the infinite dimensional problem to solving a finite dimensional quadratic program (see Appendix E.2 for a representer-type theorem for p>1p>1).

By the same reduction discussed above, it is sufficient to consider the following minimum-complexity interpolation problem:

minimize 𝒱ρ(a(𝒘))μ(d𝒘),\displaystyle\int_{{\mathcal{V}}}\rho(a({\bm{w}}))\mu({\rm d}{\bm{w}})\,, (5)
subj. to f^(𝒙i;a)=yi,in.\displaystyle\hat{f}({\bm{x}}_{i};a)=y_{i},\;\;\;\forall i\leq n\,.

(We denote the values to be interpolated by yiy_{i} instead of y^i\hat{y}_{i} because we will focus on the interpolation problem hereafter.) We will take ρ(x)\rho(x) to be a convex function minimized at x=0x=0. We establish two main results. First, we establish tractability for a subset of strictly convex penalties ρ\rho which include, as special cases, ρ(x)=|x|p/p\rho(x)=|x|^{p}/p, p>1p>1. Our approach is based on a random features approximation which we discuss next. Second, we establish 𝖭𝖯\mathsf{NP}-hardness under randomized reduction for the case ρ(x)=|x|\rho(x)=|x|.

1.2 Random features approximation

We sample independently 𝒘jμ{\bm{w}}_{j}\sim\mu, jNj\leq N, and fit a model

f^N(𝒙;𝒂)=1Nj=1Najϕ(𝒙;𝒘j).\hat{f}_{N}({\bm{x}};{\bm{a}})=\frac{1}{N}\sum_{j=1}^{N}a_{j}\phi({\bm{x}};{\bm{w}}_{j})\,. (6)

We determine the coefficients 𝒂=(ai)iN{\bm{a}}=(a_{i})_{i\leq N} by solving the interpolation problem

minimize j=1Nρ(aj),\displaystyle\sum_{j=1}^{N}\rho(a_{j})\,, (7)
subj. to f^N(𝒙i;𝒂)=yi,in.\displaystyle\hat{f}_{N}({\bm{x}}_{i};{\bm{a}})=y_{i},\;\;\;\forall i\leq n\,.

Notice that this is equivalent to replacing the measure μ\mu in Eq. (5) by its empirical version μ^N=N1i=1Nδ𝒘i\hat{\mu}_{N}=N^{-1}\sum_{i=1}^{N}\delta_{{\bm{w}}_{i}}. Borrowing the terminology from neural network theory, we will refer to (7) as the “finite width” problem, and to the original problem (5) as the “infinite width” problem. We will denote by 𝒂^N\hat{\bm{a}}_{N} the solution of the finite width problem (7) and by a^\hat{a} the solution of the infinite width problem (5).

Our main result establishes that, under suitable assumptions on the penalty ρ\rho and the featurization map ϕ\phi, the random features approach provides a good approximation of the minimum complexity interpolation problem (5). Crucially, this is achieved with a number of features that is polynomial in the sample size nn. In particular we show that for ρ(x)=|x|p/p\rho(x)=|x|^{p}/p, 1<p21<p\leq 2, setting Q=p/(p1)Q=p/(p-1) (and assuming 𝒚2Cn\|{\bm{y}}\|_{2}\leq C\sqrt{n})

f^N(;𝒂^N)f^(;a^)L2C(n2log(N)N(nlogN)(Q+1)/2)N).\|\hat{f}_{N}(\cdot;\hat{\bm{a}}_{N})-\hat{f}(\cdot;\hat{a})\|_{L^{2}}\leq C\Big{(}\sqrt{\frac{n^{2}\log(N)}{N}}\vee\frac{(n\log N)^{(Q+1)/2)}}{N}\Big{)}\,\,. (8)

Hence, for N(nlogn)2(nlogn)(Q+1)/2N\gg(n\log n)^{2}\vee(n\log n)^{(Q+1)/2}, the random features approach yields a function that is a good approximation of the original problem (5).

Let us emphasize that the scaling of the number of random features NN in our bound is not optimal. In particular, for the RKHS case p=2p=2, the above result requires Nn2N\gg n^{2}, while we know from [MMM21] that NnlognN\gg n\log n features are often sufficient. We also note that the exponent QQ in the polynomial diverges as p1p\to 1. Hence, our results do not guarantee tractability for the case p=1p=1. Indeed this is a fundamental limitation: as discussed in Section 4, a bound such as Eq. (8) with finite QQ cannot hold for p=1p=1, under some standard hardness assumptions. We show instead that no polynomial time algorithm is guaranteed to achieve accuracy nCn^{-C} for some absolute constant CC in Eq. (8). This hardness result is based on a randomized reduction from the problem of learning halfspaces with noise, which was proved to be 𝖭𝖯\mathsf{NP}-hard in [FGKP06, GR09].

1.3 Dual problem and its concentration properties

Our proof that the random features model f^N(;𝒂^N)\hat{f}_{N}(\cdot;\hat{\bm{a}}_{N}) is a good approximation of the infinite width model f^(;a^)\hat{f}(\cdot;\hat{a}) (cf. Eq. (8)) is based on a simple approach which is potentially of independent interest. We notice that, while the optimization problems (5) and (7) are overparametrized and hence difficult to control, their duals are underparametrized and can be studied using uniform convergence arguments.

Let ρ\rho^{*} be the convex conjugate of ρ\rho:

ρ(x)=supy{xyρ(y)}.\rho^{*}(x)=\sup_{y\in\mathbb{R}}\{xy-\rho(y)\}\,.

Then, the dual problems of (5) and (7) are given —respectively— by the following optimization problems over 𝝀n{\bm{\lambda}}\in\mathbb{R}^{n}:

maximize𝝀,𝒚𝒱ρ(ϕn(𝒘),𝝀)μ(d𝒘),\displaystyle\mbox{maximize}\;\;\;\;\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{\mathcal{V}}\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)\mu({\rm d}{\bm{w}})\,, (9)
maximize𝝀,𝒚1Nj=1Nρ(ϕn(𝒘j),𝝀).\displaystyle\mbox{maximize}\;\;\;\;\langle{\bm{\lambda}},{\bm{y}}\rangle-\frac{1}{N}\sum_{j=1}^{N}\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}}_{j}),{\bm{\lambda}}\rangle)\,. (10)

Here we denoted 𝒚=(y1,,yn)n{\bm{y}}=(y_{1},\ldots,y_{n})\in\mathbb{R}^{n} the vector of responses and by ϕn:𝒱n\bm{\phi}_{n}:{\mathcal{V}}\to\mathbb{R}^{n} the evaluation of the feature map at the nn data points ϕn(𝒘)=(ϕ(𝒙1;𝒘),,ϕ(𝒙n;𝒘))n\bm{\phi}_{n}({\bm{w}})=(\phi({\bm{x}}_{1};{\bm{w}}),\ldots,\phi({\bm{x}}_{n};{\bm{w}}))\in\mathbb{R}^{n}.

We will prove that, under suitable assumptions on the penalty ρ\rho, the optimizer of the finite-width dual (10) concentrates around the optimizer of the infinite-width dual (9). Our results hold conditionally on the realization of the data {(𝒙i,yi)}in\{({\bm{x}}_{i},y_{i})\}_{i\leq n} and instead exploit the randomness of the weights {𝒘i}iN\{{\bm{w}}_{i}\}_{i\leq N}.

The rest of the paper is organized as follows. After briefly discussing related work in Section 2, we state our assumptions and results for strictly convex penalties in Section 3. In Section 4, we show that the problem with 1{\mathcal{F}}_{1} norm is 𝖭𝖯\mathsf{NP}-hard under randomized reduction. In Section 5, we describe a few examples in which we can apply our general results. The proof of our main result is outlined in Section 6, with most technical work deferred to the appendices.

2 Related work

Random features methods were first introduced as a randomized approximation to RKHS methods [RR08, BBV06]. Given a kernel K:𝒳×𝒳K:\mathcal{X}\times\mathcal{X}\to{\mathbb{R}}, the idea of [RR08] was to replace it by a low rank approximation

K(N)(𝒙1,𝒙2):=1Ni=1Nϕ(𝒙1;𝒘i)ϕ(𝒙2;𝒘i),\displaystyle K^{(N)}({\bm{x}}_{1},{\bm{x}}_{2}):=\frac{1}{N}\sum_{i=1}^{N}\phi({\bm{x}}_{1};{\bm{w}}_{i})\phi({\bm{x}}_{2};{\bm{w}}_{i})\,, (11)

where the random features are such that 𝔼{ϕ(𝒙1;𝒘)ϕ(𝒙2;𝒘)}=K(𝒙1,𝒙2)\mathbb{E}\{\phi({\bm{x}}_{1};{\bm{w}})\phi({\bm{x}}_{2};{\bm{w}})\}=K({\bm{x}}_{1},{\bm{x}}_{2}). Several papers prove bounds on the test error of random features methods and compare them with the corresponding kernel approach [RR09, RR17, MWW20, GMMM19, MM19, GMMM20, MMM21]. In particular, [MMM21] proves that —under certain concentration assumptions on the random features— if Nn1+εN\geq n^{1+\varepsilon} for some ε>0\varepsilon>0, then the random features approach has nearly the same test error as the corresponding RKHS method.

Notice that the idea of approximating the kernel via random features implicitly selects a specific regularization, in our notation ρ(x)=x2\rho(x)=x^{2}. Indeed, for any other regularization, the predictor f^(𝒙;a^)\hat{f}({\bm{x}};\hat{a}) does not depend uniquely on the kernel. An alternative viewpoint regards the random features model (6) as a two-layer neural network with random first layer weights. It was shown in [Bar98] that the generalization properties of a two-layer neural network can be controlled in terms of the sum of the norms of the second-layer weights. This opens the way to considering infinitely wide two-layer networks as per Eq. (1), with |a(𝒘)|μ(d𝒘)<\int|a({\bm{w}})|\,\mu({\rm d}{\bm{w}})<\infty. These networks represent functions in the class 1{\mathcal{F}}_{1}.

The infinite-width limit 1{\mathcal{F}}_{1} was considered in [BRV+06], which propose an incremental algorithm to fit functions with an increasing number of units. Their approach however has no global optimality guarantees. Generalization error bounds within 1{\mathcal{F}}_{1} were proved in [Bac17], which demonstrates the superiority of this approach over RKHS methods, in particular to fit ridge functions or other classes of functions that depend strongly on a low-dimensional subspace of d{\mathbb{R}}^{d}. The same paper also develops an optimization algorithm based on a conditional-gradient approach. However, in high dimension each iteration of this algorithm requires solving a potentially hard problem.

A line of recent work shows that —in certain overparametrized training regimes— neural networks are indeed well approximated by random features models obtained by linearizing the network around its (random) initialization. A subset of references include [JGH18, LL18, DZPS18, OS19, COB19]. The implicit bias induced by gradient descent on these models is a ridge (or RKHS) regularization. However, the bias can change if either the algorithm or the model parametrization is changed [NTSS17, GLSS18].

We finally notice that several recent works study the impact of changing regularization in high-dimensional linear regression, focusing on the interpolation limit [LS20, MKL+20, CLvdG20]. None of these works addresses the main problem studied here, that is approximating the fully nonparametric model (1).

Our work is a first step towards understanding the effect of regularization in random features models. It implies that, for certain penalties ρ\rho, as soon as NN is polynomially large in nn, studying the random features model reduces to studying the corresponding infinite width model.

3 Convergence to the infinite width limit

3.1 Setting and dual problem

We consider a slightly more general setting than the one discussed in the introduction, whereby we allow the featurization map functions to be randomized. More precisely, conditional on the data (𝒙i)in({\bm{x}}_{i})_{i\leq n} and weight vectors (𝒘j)jN({\bm{w}}_{j})_{j\leq N}, the feature values {ϕ(𝒙i;𝒘j)}in,jN\{\phi({\bm{x}}_{i};{\bm{w}}_{j})\}_{i\leq n,j\leq N} are independent random variables. Explicitly, such randomized features can be constructed by letting ϕ(𝒙;𝒘,z)\phi({\bm{x}};{\bm{w}},z) depend on an additional variable z𝒵z\in\mathcal{Z}, and setting (with an abuse of notation) ϕ(𝒙i;𝒘j)=ϕ(𝒙i;𝒘j,zij)\phi({\bm{x}}_{i};{\bm{w}}_{j})=\phi({\bm{x}}_{i};{\bm{w}}_{j},z_{ij}) for {zij}in,jN\{z_{ij}\}_{i\leq n,j\leq N} a collection of independent random variables with common law zijνz_{ij}\sim\nu (with ν\nu a probability distribution over 𝒵\mathcal{Z}). Without loss of generality222For instance, this is the case as long as the weights take values in a Polish space, by which the conditional probabilities (ϕ(𝒙;𝒘)S|𝒘)=P(S;𝒘)\mathbb{P}(\phi({\bm{x}};{\bm{w}})\in S|{\bm{w}})=P(S;{\bm{w}}) exist. we can assume zijUnif([0,1])z_{ij}\sim{\rm Unif}([0,1]).

We denote by ϕ¯(𝒙;𝒘)=𝔼ϕ[ϕ(𝒙;𝒘)]=𝔼z[ϕ(𝒙;𝒘,z)]\overline{\phi}({\bm{x}};{\bm{w}})=\mathbb{E}_{\phi}[\phi({\bm{x}};{\bm{w}})]=\mathbb{E}_{z}[\phi({\bm{x}};{\bm{w}},z)] the expectation of the features with respect to this additional randomness. In what follows, we will omit to write the dependency on zz explicitly, unless required for clarity. The additional freedom afforded by randomized features is useful in multiple scenarios:

  • We only have access to noisy measurements ϕ(𝒙i;𝒘j)\phi({\bm{x}}_{i};{\bm{w}}_{j}) of the true features ϕ¯(𝒙i;𝒘j)\overline{\phi}({\bm{x}}_{i};{\bm{w}}_{j}).

  • We deliberately introduce noise in the featurization mechanism. This is known to have a regularizing effect [Bis95].

  • We do not explicitly introduce noise in the featurization mechanism. However, the noiseless featurization mechanism is asymptotically equivalent to one in which noise has been added. Examples of this type are given in [MM19, GLK+20, HL20]

At prediction time we use the average features333Our results do not change significantly if we use randomized features also at prediction time.

f^(𝒙;a)=𝒱ϕ¯(𝒙;𝒘)a(𝒘)μ(d𝒘),f^N(𝒙;𝒂)=1Nj=1Najϕ¯(𝒙;𝒘j).\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\overline{\phi}({\bm{x}};{\bm{w}})a({\bm{w}})\,\mu({\rm d}{\bm{w}})\,,\qquad\hat{f}_{N}({\bm{x}};{\bm{a}})=\frac{1}{N}\sum_{j=1}^{N}a_{j}\overline{\phi}({\bm{x}};{\bm{w}}_{j})\,. (12)

The dual of the finite-width problem (7) is given by the problem (10), for which we introduce the following notations:

FN(𝝀):=\displaystyle F_{N}({\bm{\lambda}}):= 𝝀,𝒚1Nj=1Nρ(ϕn,j,𝝀),\displaystyle~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\frac{1}{N}\sum_{j=1}^{N}\rho^{*}(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)\,, (13)
𝝀^N=\displaystyle\hat{\bm{\lambda}}_{N}= argmax𝝀nFN(𝝀).\displaystyle~{}\arg\max_{{\bm{\lambda}}\in\mathbb{R}^{n}}F_{N}({\bm{\lambda}})\,.

Notice that ϕn,j=ϕn(𝒘j)=(ϕ(𝒙1;𝒘j),,ϕ𝒘(𝒙n;𝒘j))𝖳\bm{\phi}_{n,j}=\bm{\phi}_{n}({\bm{w}}_{j})=(\phi({\bm{x}}_{1};{\bm{w}}_{j}),\ldots,\phi_{{\bm{w}}}({\bm{x}}_{n};{\bm{w}}_{j}))^{{\mathsf{T}}} is now a random vector. In the case of random features, the dual of the infinite width problem (5) has to be modified with respect to (9), and takes instead the form

F(𝝀):=\displaystyle F({\bm{\lambda}}):= 𝝀,𝒚𝒱𝔼ϕ[ρ(ϕn(𝒘),𝝀)]μ(d𝒘),\displaystyle~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{\mathcal{V}}\mathbb{E}_{\phi}[\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)]\mu({\rm d}{\bm{w}})\,, (14)
𝝀^=\displaystyle\hat{\bm{\lambda}}= argmax𝝀nF(𝝀).\displaystyle~{}\arg\max_{{\bm{\lambda}}\in\mathbb{R}^{n}}F({\bm{\lambda}})\,.

Note the expectation 𝔼ϕ\mathbb{E}_{\phi} with respect to the randomness in the features (equivalently, this is the expectation with respect to the independent randomness zz, which is not noted explicitly). The most direct way to see that this is the correct infinite width dual is to notice that this is obtained from Eq. (13) by taking expectation with respect to the weights {𝒘j}jN\{{\bm{w}}_{j}\}_{j\leq N} and the features randomness. We will further discuss the connection between (14) and the infinite width primal problem in Section E.1 below.

In order to discuss the dual optimality conditions, let s:s:\mathbb{R}\to\mathbb{R} be the derivative of the convex conjugate of ρ\rho, i.e., s(x)=(ρ)(x)=(ρ)1(x)s(x)=(\rho^{*})^{\prime}(x)=(\rho^{\prime})^{-1}(x). Since ρ\rho is assumed to be strictly convex, s(x)s(x) exists and is continuous and non-decreasing. With these definitions, the dual optimality condition reads

𝒚=1Nj=1Nϕn,js(ϕn,j,𝝀^N).\displaystyle{\bm{y}}=\frac{1}{N}\sum_{j=1}^{N}\bm{\phi}_{n,j}\,s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}_{N}\rangle)\,. (15)

The primal solution is then given by (𝒂^N)j=s(ϕn,j,𝝀^N)(\hat{\bm{a}}_{N})_{j}=s(\langle\bm{\phi}_{n,j},\hat{\bm{\lambda}}_{N}\rangle) and the resulting predictor is

f^N(𝒙;𝒂^N)=1Nj=1Nϕ¯(𝒙;𝒘j)s(ϕn,j,𝝀^N).\hat{f}_{N}({\bm{x}};\hat{\bm{a}}_{N})=\frac{1}{N}\sum_{j=1}^{N}\overline{\phi}({\bm{x}};{\bm{w}}_{j})\,s(\langle\bm{\phi}_{n,j},\hat{\bm{\lambda}}_{N}\rangle)\,.

In the following, with an abuse of notation, we will write f^N(𝒙;𝝀)=N1j=1Nϕ¯(𝒙;𝒘j)s(ϕn,j,𝝀)\hat{f}_{N}({\bm{x}};{\bm{\lambda}})=N^{-1}\sum_{j=1}^{N}\overline{\phi}({\bm{x}};{\bm{w}}_{j})\,s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle) for the model at dual parameter 𝝀{\bm{\lambda}}. The corresponding infinite width predictor reads

f^(𝒙;𝝀):=𝒱ϕ¯(𝒙;𝒘)𝔼ϕ[s(ϕn(𝒘),𝝀)]μ(d𝒘).\displaystyle\hat{f}({\bm{x}};{\bm{\lambda}}):=\int_{{\mathcal{V}}}\overline{\phi}({\bm{x}};{\bm{w}})\,\mathbb{E}_{\phi}[s(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)]\,\mu({\rm d}{\bm{w}})\,. (16)

3.2 General theorem

We will show that conditional on the realization of 𝒚,𝑿{\bm{y}},{\bm{X}}, the distance between the infinite width interpolating model f^(;𝝀^)\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}}) and the finite width one f^N(;𝝀^N)\hat{f}_{N}(\,\cdot\,;{\hat{\bm{\lambda}}}_{N}) is small with high probability as soon as NN is large enough. Throughout this section, 𝒚,𝑿{\bm{y}},{\bm{X}} are viewed as fixed, and we assume certain conditions to hold on the distribution of the features ϕn(𝒘)\phi_{n}({\bm{w}}). In Section 5 we will check that these conditions hold for a few models of interest, for typical realizations of 𝒚,𝑿{\bm{y}},{\bm{X}}.

We first state our assumptions on the feature distribution. We define the whitened features

𝝍n,j:=𝑲n1/2ϕn,j,𝑲n:=𝔼ϕ[ϕn(𝒘)ϕn(𝒘)]μ(d𝒘).\displaystyle{\bm{\psi}}_{n,j}:={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n,j},\;\;\;\;\;\;\;{\bm{K}}_{n}:=\int\mathbb{E}_{\phi}[\bm{\phi}_{n}({\bm{w}})\bm{\phi}_{n}({\bm{w}})^{\top}]\,\mu({\rm d}{\bm{w}})\,. (17)

Here expectation is over the randomization in the features, and 𝑲nn×n{\bm{K}}_{n}\in{\mathbb{R}}^{n\times n} is the empirical kernel matrix. By construction, 𝝍n,j{\bm{\psi}}_{n,j} are isotropic: 𝔼𝒘,ϕ[𝝍n,j𝝍n,j]=𝐈n\mathbb{E}_{{\bm{w}},\phi}[{\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\top}]={\mathbf{I}}_{n}.

We will assume the following conditions to hold for the featurization map (𝒙,𝒘)ϕ(𝒙;𝒘)({\bm{x}},{\bm{w}})\mapsto\phi({\bm{x}};{\bm{w}}) and the penalty xρ(x)x\mapsto\rho(x). Throughout we will follow the convention of denoting by Greek letters constants that we track of in our calculations, and by C,c,r0C,c,r_{0} constants that we do not track (and therefore our statements depend on an unspecified way on the latter). We also recall that a random vector 𝒁{\bm{Z}} is γ2\gamma^{2}-subgaussian if 𝔼[exp(𝒗,𝒁)]exp(γ2𝒗2/2)\mathbb{E}[\exp(\langle{\bm{v}},{\bm{Z}}\rangle)]\leq\exp(\gamma^{2}\|{\bm{v}}\|^{2}/2) for all vectors 𝒗{\bm{v}}.

FEAT1

(Sub-gaussianity) For some 0<τnC0<\tau\leq n^{C} and any fixed 𝒙2r0d\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}, ϕ¯(𝒙;𝒘)\overline{\phi}({\bm{x}};{\bm{w}}) is τ2\tau^{2}-sub-Gaussian when 𝒘μ{\bm{w}}\sim\mu. Further, the feature vector 𝝍n:=𝑲n1/2[ϕ(x1;𝒘,z1),,ϕ(xn;𝒘,zn)]𝖳{\bm{\psi}}_{n}:={\bm{K}}_{n}^{-1/2}[\phi(x_{1};{\bm{w}},z_{1}),\dots,\phi(x_{n};{\bm{w}},z_{n})]^{{\mathsf{T}}} is τ2\tau^{2}-sub-Gaussian when 𝒘μ{\bm{w}}\sim\mu, (zi)iniidν(z_{i})_{i\leq n}\sim_{iid}\nu. Without loss of generality we assume τ1\tau\geq 1.

FEAT2

(Lipschitz continuity) For any 𝒘𝒱{\bm{w}}\in{\mathcal{V}}, assume that ϕ¯(𝒙;𝒘)\overline{\phi}({\bm{x}};{\bm{w}}) is L(𝒘)L({\bm{w}})-Lipschitz with respect to 𝒙{\bm{x}} and ϕ¯(𝟎;𝒘)L(𝒘)\bar{\phi}({\bm{0}};{\bm{w}})\leq L({\bm{w}}), where (L(𝒘)t)Cet2/(2τ2)\mathbb{P}(L({\bm{w}})\geq t)\leq Ce^{-t^{2}/(2\tau^{2})}, τ1\tau\geq 1.

FEAT3

(Small ball) There exists ηnC\eta\geq n^{-C} such that

inf𝒖2=1,𝒗2=1(|𝒖,𝝍n|η,|𝒗,𝝍n|η)c,\displaystyle\inf_{\|{\bm{u}}\|_{2}=1,\|{\bm{v}}\|_{2}=1}\mathbb{P}\big{(}|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\geq\eta,\,|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\geq\eta\big{)}\geq c\,, (18)

for some strictly positive constants C,cC,c. By union bound, this is implied by the stronger condition

sup𝒗2=1(|𝒗,𝝍n|η)12(1c).\sup_{\|{\bm{v}}\|_{2}=1}\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta\big{)}\leq\frac{1}{2}(1-c)\,.

Without loss of generality, we will assume η1\eta\leq 1.

PEN

(Polynomial growth) We assume that ρ\rho is strictly convex and minimized at 0, so that s(x)s(x) is continuous and s(0)=0s(0)=0. Because ss is non-decreasing, we have that s(x)s(x) has a derivative almost everywhere. We assume there exists Q1,Q2,q1,q2>1Q_{1},Q_{2},q_{1},q_{2}>1 and C,c>0C,c>0 such that for |x1|,|x2|>0|x_{1}|,|x_{2}|>0,

c|s(x2)/x2|(|x1/x2|q12|x1/x2|q22)s(x1)C|s(x2)/x2|(|x1/x2|Q12|x1/x2|Q22),\displaystyle c|s(x_{2})/x_{2}|\,(|x_{1}/x_{2}|^{q_{1}-2}\wedge|x_{1}/x_{2}|^{q_{2}-2})\leq s^{\prime}(x_{1})\leq C|s(x_{2})/x_{2}|\,(|x_{1}/x_{2}|^{Q_{1}-2}\vee|x_{1}/x_{2}|^{Q_{2}-2}),

and

c|s(x2)|(|x1/x2|q11|x1/x2|q21)|s(x1)|C|s(x2)|(|x1/x2|Q11|x1/x2|Q21).\displaystyle c|s(x_{2})|\,(|x_{1}/x_{2}|^{q_{1}-1}\wedge|x_{1}/x_{2}|^{q_{2}-1})\leq|s(x_{1})|\leq C|s(x_{2})|\,(|x_{1}/x_{2}|^{Q_{1}-1}\vee|x_{1}/x_{2}|^{Q_{2}-1}). (19)

If s(x)=(ρ)′′(x)s^{\prime}(x)=(\rho^{*})^{\prime\prime}(x) is unbounded as x0x\to 0, we will require a stronger assumption FEAT3’ which will ensure that 2F(𝝀)\nabla^{2}F({\bm{\lambda}}) exists and is finite.

FEAT3’

(Small ball) For every r(1,0)r\in(-1,0) and every 𝒗2=1\|{\bm{v}}\|_{2}=1, 𝔼[|𝝍n,𝒗|r]Crηr<\mathbb{E}[|\langle{\bm{\psi}}_{n},{\bm{v}}\rangle|^{r}]\leq C_{r}\eta^{r}<\infty, η1\eta\leq 1.

Remark 1.

Assumption PEN implies that s(x)s(x) is locally Lipschitz away from 0. It further implies that s(x)s(x) and its derivatives are upper and lower bounded by polynomials in xx. The most important example is provided by pp-norms with 1<p<1<p<\infty, in which case we set ρ(x)=1p|x|p\rho(x)=\frac{1}{p}|x|^{p} for 1<p<1<p<\infty. It is easy to check that PEN is satisfied with Q1=Q2=q1=q2=p/(p1)Q_{1}=Q_{2}=q_{1}=q_{2}=p/(p-1) and appropriate choices of C,cC,c.

Our main theorem establishes that the infinite-dimensional problem of Eq. (5) (and its generalization to randomized featurization maps) is well-approximated by its finite random features counterpart. For this approximation to be good, it is sufficient that the number of random features scales polynomially with the sample size nn.

Theorem 1.

Assume 𝐱i2r0d\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d} for all ini\leq n, and let \mathbb{P} be a probability distribution supported on 𝖡2d(r0d):={𝐱d:𝐱2r0d}{\sf B}^{d}_{2}(r_{0}\sqrt{d}):=\{{\bm{x}}\in{\mathbb{R}}^{d}:\,\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\}. Assume that conditions FEAT1, FEAT2, FEAT3, and PEN are satisfied. If Q1Q2<2Q_{1}\wedge Q_{2}<2, further assume that FEAT3’ holds. Then for any δ>0\delta>0, there exist constants C,cC^{\prime},c^{\prime} depending on the constants C,c,r0C,c,r_{0} in those assumptions, but not on τ\tau and η\eta, and C′′(δ)C^{\prime\prime}(\delta) additionally dependent on δ>0\delta>0 such that the following holds. If NN1N2N\geq N_{1}\vee N_{2}, N1:=C(τQ2/ηq3)2nlognN_{1}:=C^{\prime}(\tau^{Q\vee 2}/\eta^{q\vee 3})^{2}n\log n, N2:=C(τQ2/ηq3)(nlogn)Q/21N_{2}:=C^{\prime}(\tau^{Q\vee 2}/\eta^{q\vee 3})(n\log n)^{Q/2\vee 1} and ncdn\geq c^{\prime}d, then with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n},

f^N(;𝝀^N)f^(;𝝀^)L2()\displaystyle\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}\leq M(δ,τ,η)(nlogNN(nlogN)Q/2N)𝑲n1/2𝒚2,\displaystyle~{}M(\delta,\tau,\eta)\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\,, (20)

where

M(δ,τ,η)=C′′(δ)τQ+2+(2q+δ)+ηq3(τQ2ηQ2),M(\delta,\tau,\eta)=C^{\prime\prime}(\delta)\frac{\tau^{Q+2+(2-q^{\prime}+\delta)_{+}}}{\eta^{q\vee 3}}(\tau^{Q-2}\vee\eta^{Q^{\prime}-2})\,,

with Q=Q1Q2Q=Q_{1}\vee Q_{2}, Q=Q1Q2Q^{\prime}=Q_{1}\wedge Q_{2}, q=q1q2q=q_{1}\vee q_{2}, q=q1q2q^{\prime}=q_{1}\wedge q_{2}. Further, the bound holds with δ=0\delta=0, C′′(0)<C^{\prime\prime}(0)<\infty when q1=q22q_{1}=q_{2}\geq 2.

In order to interpret Theorem 1, we remark that we expect typically 𝑲n1/2𝒚2\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2} to be of order n\sqrt{n}. In this case f^N(;𝝀^N)\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N}) differs negligibly from f^(;𝝀^)\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}}) when N(nlog(n))2(Q+1)/2N\gg(n\log(n))^{2\vee(Q+1)/2}.

Remark 2.

The most restrictive among our assumptions are FEAT3 and FEAT3’. Both conditions imply that the infinite-width dual problem of Eq. (14) is well behaved.

In particular, condition FEAT3 ensures that the minimum eigenvalue of the rescaled Hessian 𝐇n(𝛌):=𝐊n1/22F(𝛌)𝐊n1/2{\bm{H}}_{n}({\bm{\lambda}}):={\bm{K}}_{n}^{-1/2}\nabla^{2}F({\bm{\lambda}}){\bm{K}}_{n}^{-1/2} is bounded away from 0, as long as 𝛌{\bm{\lambda}} is bounded away from 0 and \infty. Notice indeed that the Hessian is given by

2F(𝝀)=𝔼𝒘,ϕ[s(𝝀,ϕn)ϕnϕn𝖳].\displaystyle\nabla^{2}F({\bm{\lambda}})=-\mathbb{E}_{{\bm{w}},\phi}[s^{\prime}(\langle{\bm{\lambda}},\bm{\phi}_{n}\rangle)\,\bm{\phi}_{n}\bm{\phi}_{n}^{{\mathsf{T}}}]\,. (21)

For any 𝐯{\bm{v}}, with 𝐯2=1\|{\bm{v}}\|_{2}=1, we have 𝔼{𝐯,𝛙n2}=1\mathbb{E}\{\langle{\bm{v}},{\bm{\psi}}_{n}\rangle^{2}\}=1, whence, using assumption FEAT3 and Markov’s inequality and union bound, for any two unit vectors 𝐮,𝐯{\bm{u}},{\bm{v}}, (|𝐮,𝛙n|[η,C],|𝐯,𝛙n|[η,C])c/2\mathbb{P}(|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\in[\eta,C],\,|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\in[\eta,C])\geq c/2 for some constant CC. We then have,

inf𝒗2=1λmin(𝑯n(𝑲n1/2𝒗))=\displaystyle\inf_{\|{\bm{v}}\|_{2}=1}\lambda_{\min}(-{\bm{H}}_{n}({\bm{K}}_{n}^{-1/2}{\bm{v}}))= inf𝒖2=1,𝒗2=1𝔼𝒘,ϕ[𝒖,𝝍n2s(𝒗,𝝍n)]c′′η2infx[η,4]s(x).\displaystyle~{}\inf_{\|{\bm{u}}\|_{2}=1,\|{\bm{v}}\|_{2}=1}\mathbb{E}_{{\bm{w}},\phi}[\langle{\bm{u}},{\bm{\psi}}_{n}\rangle^{2}s^{\prime}(\langle{\bm{v}},{\bm{\psi}}_{n}\rangle)]\geq c^{\prime\prime}\eta^{2}\inf_{x\in[\eta,4]}s^{\prime}(x)\,.

Condition FEAT3’ ensures that the largest eigenvalue of the Hessian 2F(𝛌)\nabla^{2}F({\bm{\lambda}}) is bounded for 𝛌2\|{\bm{\lambda}}\|_{2} bounded below and above. Notice that, from Eq. (21), no such assumption is required when s(x)s^{\prime}(x) is bounded as x0x\to 0.

Remark 3.

As mentioned in the introduction, the bound in Eq. (20) is not optimal. For instance, for the case of a penalty ρ\rho that is strongly convex and smooth (covered in Theorem 1 by taking Q=2Q=2) this bound implies that N(nlogn)2N\gg(n\log n)^{2} random features are sufficient to approximate the infinite width problem. However, a more careful analysis yields that —in this case— NnlognN\gg n\log n random features are sufficient. This is also what is established in [MMM21] for the case of kernel ridge regression (corresponding to ρ(x)=x2\rho(x)=x^{2}).

It is instructive to specialize Theorem 1 to the case of pp-norms, which is covered by taking ρ(x)=|x|p/p\rho(x)=|x|^{p}/p, p(1,2]p\in(1,2]. In this case q1=q2=Q1=Q2=Qq_{1}=q_{2}=Q_{1}=Q_{2}=Q, with Q=p/(p1)2Q=p/(p-1)\geq 2 and hence the formulas are simpler.

Corollary 1.

Assume 𝐱i2r0d\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d} for all ini\leq n, and let \mathbb{P} be a probability distribution supported on 𝖡2d(r0d):={𝐱d:𝐱2r0d}{\sf B}^{d}_{2}(r_{0}\sqrt{d}):=\{{\bm{x}}\in{\mathbb{R}}^{d}:\,\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\}. Assume that conditions FEAT1, FEAT2, FEAT3 are satisfied, and penalty ρ(x)=|x|p/p\rho(x)=|x|^{p}/p, p(1,2]p\in(1,2]. Then there exist constants C,cC^{\prime},c^{\prime} depending on the constants C,c,r0C,c,r_{0} in those assumptions, such that the following holds. If NC[(τQ/ηQ3)2(nlogn)(τQ/ηQ3)(nlogn)Q/2]N\geq C^{\prime}[(\tau^{Q}/\eta^{Q\vee 3})^{2}(n\log n)\vee(\tau^{Q}/\eta^{Q\vee 3})(n\log n)^{Q/2}] and ncdn\geq c^{\prime}d, then with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n},

f^N(;𝝀^N)f^(;𝝀^)L2()Cτ2QηQ3(nlogNN(nlogN)Q/2N)𝑲n1/2𝒚2.\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}\leq C^{\prime}\frac{\tau^{2Q}}{\eta^{Q\vee 3}}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\,. (22)

where Q=p/(p1)Q=p/(p-1).

Note that the exponent QQ on the right hand side of Eq. (22) diverges as p1p\to 1. Instead of being a feature of our proof technique, we show in the next section (Section 4), that this is unavoidable under standard hardness assumptions.

Remark 4.

In the case of randomized features, we wrote the infinite-width dual problem (Eq. (9)), but we did not write the infinite-width primal problem to which it corresponds. Indeed the dual problem entirely defines the predictor via Eq. (16). For the sake of completeness, we present the infinite-width primal problem in Appendix E.1.

4 𝖭𝖯\mathsf{NP}-hardness of learning with 1{\mathcal{F}}_{1} norm

The fact that, when ρ(x)=|x|\rho(x)=|x|, we are unable to solve efficiently the infinite width problem (5) via the random features approach is not surprising. Indeed, consider the featurization map ϕ(𝒙;𝒘)=(𝒘,𝒙+c)+\phi({\bm{x}};{\bm{w}})=(\langle{\bm{w}},{\bm{x}}\rangle+c)_{+} (ReLu activation), and random weights (𝒘i)iniidμ=Unif(𝕊d1(1))({\bm{w}}_{i})_{i\leq n}\sim_{iid}\mu={\rm Unif}(\mathbb{S}^{d-1}(1)). Consider solving either the infinite-dimensional problem (5) or its random features approximation (7) with data (𝒙i)inUnif(𝕊d1(d))({\bm{x}}_{i})_{i\leq n}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d})), yi=f(𝒙i)=(𝒘,𝒙i+c)+y_{i}=f_{*}({\bm{x}}_{i})=(\langle{\bm{w}}_{\star},{\bm{x}}_{i}\rangle+c)_{+} (where 𝒘𝕊d1(1){\bm{w}}_{\star}\in\mathbb{S}^{d-1}(1) and c0c\neq 0 are fixed).

In [GMMM19], it was shown that for any kk\in\mathbb{N}, if NdkN\leq d^{k}, then f^N\hat{f}_{N} has test error bounded away from zero, namely f^NfL2c(k)oN(1)\|\hat{f}_{N}-f_{*}\|_{L^{2}}\geq c(k)-o_{N}(1) for any sample size nn. Indeed this lower bound holds for any function that can be written as in Eq. (6), for some coefficients (ai)iN(a_{i})_{i\leq N}. On the other hand, [Bac17] proves that minimizing the empirical risk subject to f^1(R)\hat{f}\in{\mathcal{F}}_{1}(R), cf. Eq. (4), for a suitable choice of RR, achieves test error f^fL2dO(1)/n\|\hat{f}-f_{*}\|_{L^{2}}\leq d^{O(1)}/\sqrt{n}. The last result does not apply directly to min 1{\mathcal{F}}_{1}-norm interpolator. However, if an analogous result was established for interpolators, it would imply that f^Nf^L2\|\hat{f}_{N}-\hat{f}\|_{L^{2}} remains bounded away from zero in this case as long as N=dO(1)N=d^{O(1)}.

In order to provide stronger evidence towards hardness, we consider the computational complexity of the interpolation problem

minimize 𝒱|a(𝒘)|μ(d𝒘),\displaystyle\int_{{\mathcal{V}}}|a({\bm{w}})|\mu({\rm d}{\bm{w}})\,, (23)
subj. to f^(𝒙i;a)=yi,in.\displaystyle\hat{f}({\bm{x}}_{i};a)=y_{i},\;\;\;\forall i\leq n\,.

We show that it is 𝖭𝖯\mathsf{NP}-hard under a randomized reduction to solve (23) within an accuracy nCn^{-C} with CC a fixed absolute constant. On the contrary, for p>1p>1, Corollary 1 and its proof shows that one can obtain accuracy nKn^{-K} for KK fixed arbitrary in polynomial time using a random features approximation.

It will be convenient to consider a relaxation of problem (23) by minimizing over the set of signed measures on 𝒱{\mathcal{V}}, which we will denote (𝒱){\mathcal{M}}({\mathcal{V}}). For any τ(𝒱)\tau\in{\mathcal{M}}({\mathcal{V}}), let f^(𝒙;τ)=𝒱ϕ(𝒙;𝒘)τ(d𝒘)\hat{f}({\bm{x}};\tau)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\tau({\rm d}{\bm{w}}). We have

minimize |τ|(𝒱),\displaystyle|\tau|({\mathcal{V}})\,, (24)
subj. to f^(𝒙i;τ)=yi,in,\displaystyle\hat{f}({\bm{x}}_{i};\tau)=y_{i},\;\;\;\forall i\leq n\,,

where the minimization is now over all τ(𝒱)\tau\in{\mathcal{M}}({\mathcal{V}}) and |τ|(𝒱)|\tau|({\mathcal{V}}) denotes the total variation of τ\tau444Recall that by Hahn decomposition, there exists two non-negative measures τ+\tau_{+} and τ\tau_{-} such that τ=τ+τ\tau=\tau_{+}-\tau_{-}. The total variation is equal to |τ|(𝒱)=τ+(𝒱)+τ(𝒱)|\tau|({\mathcal{V}})=\tau_{+}({\mathcal{V}})+\tau_{-}({\mathcal{V}}).. If we assume that the signed measure τ\tau has a density with respect to the fixed probability measure μ\mu, i.e., τ(d𝒘)=a(𝒘)μ(d𝒘)\tau({\rm d}{\bm{w}})=a({\bm{w}})\mu({\rm d}{\bm{w}}), then the total variation is equal to |τ|(𝒱)=𝒱|a(𝒘)|μ(d𝒘)|\tau|({\mathcal{V}})=\int_{{\mathcal{V}}}|a({\bm{w}})|\mu({\rm d}{\bm{w}}) and we recover the original problem (23). Note that if μ\mu has full support on 𝒱{\mathcal{V}}, then the infinum in the two problems are the same (all measures can be written as limits of measures with densities). However the infinum in problem (23) is not attained in general, while the infimum in (24) is always achieved for 𝒱{\mathcal{V}} compact (barring degenerate cases in which it is not feasible). Technically this happens because the space of integrable functions a(𝒘)a({\bm{w}}) with |a(𝒘)|μ(d𝒘)C\int|a({\bm{w}})|\mu({\rm d}{\bm{w}})\leq C is not compact in the weak-* topology, while the set of signed measure with |τ|(𝒱)C|\tau|({\mathcal{V}})\leq C is. (The optimum τ\tau can be singular with respect to μ\mu, e.g. a sum of Dirac delta functions.)

Remark 5.

For any τ(𝒱)\tau\in{\mathcal{M}}({\mathcal{V}}) that verifies the equality constraint of problem (24), we have 𝐲/|τ|(𝒱){\bm{y}}/|\tau|({\mathcal{V}}) is in the convex hull of {ϕn(𝐰),ϕn(𝐰):𝐰𝒱}\{\bm{\phi}_{n}({\bm{w}}),-\bm{\phi}_{n}({\bm{w}}):{\bm{w}}\in{\mathcal{V}}\}. Hence, by Caratheodory theorem, there exists at most (n+1)(n+1) weights {𝐰j}j[n+1]\{{\bm{w}}_{j}\}_{j\in[n+1]} such that 𝐲=j[n+1]ajϕn(𝐰j){\bm{y}}=\sum_{j\in[n+1]}a_{j}\bm{\phi}_{n}({\bm{w}}_{j}) and j[n+1]|aj|=|τ|(𝒱)\sum_{j\in[n+1]}|a_{j}|=|\tau|({\mathcal{V}}). In particular, if 𝒱{\mathcal{V}} is compact, the minimizer in problem (24) is always attained by a measure that is supported on at most n+1n+1 points, i.e., there exists {(aj,𝐰j)}j[n+1]\{(a_{j}^{*},{\bm{w}}_{j}^{*})\}_{j\in[n+1]} such that τ=j[n+1]ajδ𝐰j\tau^{*}=\sum_{j\in[n+1]}a_{j}^{*}\delta_{{\bm{w}}^{*}_{j}} minimizes (24).

Let us call the problem (24) the 1{\mathcal{F}}_{1}-problem. In order to study the computational complexity of solving this problem, we introduce a weak version of the 1{\mathcal{F}}_{1}-problem, where we allow an error ε>0\varepsilon>0 on the equality constraint:

minimize |τ|(𝒱),\displaystyle|\tau|({\mathcal{V}})\,, (25)
subj. to f^(𝒙i;τ)=y^i,in,\displaystyle\hat{f}({\bm{x}}_{i};\tau)=\hat{y}_{i},\;\;\;\forall i\leq n\,,
𝒚𝒚^2ε,\displaystyle\|{\bm{y}}-\hat{\bm{y}}\|_{2}\leq\varepsilon\,,

where the minimization is now over 𝒚^n\hat{\bm{y}}\in\mathbb{R}^{n} and τ(𝒱)\tau\in{\mathcal{M}}({\mathcal{V}}). For concreteness we will consider ϕ(𝒙;𝒘)=min(max(𝒘,𝒙,0),1)\phi({\bm{x}};{\bm{w}})=\min(\max(\langle{\bm{w}},{\bm{x}}\rangle,0),1) (truncated ReLu), but we believe it is possible to generalize our proofs to other activations at the cost of additional technical work. We further restrict 𝒱{\mathcal{V}} to be a rectangle in d\mathbb{R}^{d} possibly with some infinite sides.

Denote \mathbb{Q} the set of rational numbers. We consider the following problem W-F1-PB which depends on a rational number ε>0\varepsilon>0:

W-F1-PB(ε)\boxed{\texttt{W-F1-PB}(\varepsilon)}

: Given 𝒚n{\bm{y}}\in\mathbb{Q}^{n} and γ\gamma\in\mathbb{Q}. Denote LL^{*} the value of the weak 1{\mathcal{F}}_{1}-problem (25) with ε\varepsilon error on the constraints. Either

  • (1)

    Assert that Lγ+εL^{*}\leq\gamma+\varepsilon; or,

  • (2)

    Assert that LγεL^{*}\geq\gamma-\varepsilon.

We can think about W-F1-PB as the weak validity problem associated to the 1{\mathcal{F}}_{1}-problem (24). In particular, if we are able to solve the 1{\mathcal{F}}_{1}-problem within an additive error ε\varepsilon of the optimum and with at most ε\varepsilon 2\ell_{2}-error on the equality constraints, we can solve W-F1-PB.

We show in the following theorem that W-F1-PB is hard to solve under the standard assumption that 𝖡𝖯𝖯\mathsf{BPP} (bounded-error probabilistic polynomial time class) does not contain 𝖭𝖯\mathsf{NP}:

Theorem 2.

Let the activation function be the truncated ReLu ϕ(𝐱;𝐰)=min(max(𝐰,𝐱,0),1)\phi({\bm{x}};{\bm{w}})=\min(\max(\langle{\bm{w}},{\bm{x}}\rangle,0),1), and 𝒱{\mathcal{V}} be a rectangle in d\mathbb{R}^{d} (possibly with some infinite sides). Assuming 𝖭𝖯𝖡𝖯𝖯\mathsf{NP}\not\subset\mathsf{BPP}, there exists an absolute constant C>0C>0 such that the problem W-F1-PB(nC)(n^{-C}) is 𝖭𝖯\mathsf{NP}-hard.

By equality of the infimum of problems (23) and (24), Theorem 2 also implies the hardness of the original problem. The proof of Theorem 2 relies on a polynomial time randomized reduction from an 𝖭𝖯\mathsf{NP}-hard problem, the Maximum Agreement for Halfspaces problem. If we only assume 𝖭𝖯𝖯\mathsf{NP}\neq\mathsf{P}, our reductions can be made deterministic using results from [GLS12]. However this deterministic reduction only rules out precision ε\varepsilon that are exponential in the number of bits, in particular it does not rule out precision ε=en\varepsilon=e^{-n}.

We denote below the Maximum Agreement for Halfspaces problem HS-MA . It was shown to be 𝖭𝖯\mathsf{NP}-hard in [FGKP06] and [GR09]. We will follow the notations of [GR09]. Consider (n+,n,d)3(n_{+},n_{-},d)\in\mathbb{N}^{3} and {𝒙1,,𝒙n+,𝒛1,,𝒛n}{1,1}d\{{\bm{x}}_{1},\ldots,{\bm{x}}_{n_{+}},{\bm{z}}_{1},\ldots,{\bm{z}}_{n_{-}}\}\subset\{-1,1\}^{d}. Denote

M(𝒘,a)=i=1n+𝟙[𝒘,𝒙i>a]+i=1n𝟙[𝒘,𝒛i<a].M({\bm{w}},a)=\sum_{i=1}^{n_{+}}\mathbbm{1}[\langle{\bm{w}},{\bm{x}}_{i}\rangle>a]+\sum_{i=1}^{n_{-}}\mathbbm{1}[\langle{\bm{w}},{\bm{z}}_{i}\rangle<a]\,. (26)

The HS-MA problem depends on a rational numbers ε>0\varepsilon>0 (where we slightly simplified the statement in [GR09]):

HS-MA(ε)\boxed{\texttt{HS-MA}(\varepsilon)}

: Distinguish the following two cases

  • (1)

    There exists a half space (𝒘,a)({\bm{w}},a) such that M(𝒘,a)(n++n)(1ε)M({\bm{w}},a)\geq(n_{+}+n_{-})(1-\varepsilon); or,

  • (2)

    For any half space (𝒘,a)({\bm{w}},a), we have M(𝒘,a)(n++n)(1/2+ε)M({\bm{w}},a)\leq(n_{+}+n_{-})(1/2+\varepsilon).

[GR09] showed that for all 0<ε<1/40<\varepsilon<1/4, the problem HS-MA(ε)(\varepsilon) is 𝖭𝖯\mathsf{NP}-hard.

Below we briefly describe the main ideas of the proof of Theorem 2 and we defer the detailed to Appendix D. In order to reduce HS-MA to W-F1-PB, we use the following intermediate problem:

maximize 𝒚,𝒚^,\displaystyle\langle{\bm{y}},\hat{\bm{y}}\rangle\,, (27)
subj. to f^(𝒙i;τ)=y^i,in,\displaystyle\hat{f}({\bm{x}}_{i};\tau)=\hat{y}_{i},\;\;\;\forall i\leq n\,,
|τ|(𝒱)1.\displaystyle|\tau|({\mathcal{V}})\leq 1\,.

We denote W-VAL(ε)(\varepsilon) the weak validity problem associated to problem (27). First notice that we can rewrite equivalently the constraint set as 𝒚^K\hat{\bm{y}}\in K where K:={𝒛n:τ(𝒱),f^(𝒙i;τ)=zi,in}nK:=\{{\bm{z}}\in\mathbb{R}^{n}:\exists\tau\in{\mathcal{M}}({\mathcal{V}}),\hat{f}({\bm{x}}_{i};\tau)=z_{i},i\leq n\}\subseteq\mathbb{R}^{n}. It is easy to see that W-F1-PB can be used as a weak membership oracle for KK. [LSV18] shows that there exists a polynomial-time randomized algorithm that solves W-VAL(ε)(\varepsilon) from a weak membership oracle W-F1-PB((ε/n)C)((\varepsilon/n)^{C}) for some constant C>0C>0. Hence there is a randomized reduction between W-VAL and W-F1-PB. Secondly, the problem (27) for 𝒚=𝟏{\bm{y}}={\bm{1}} (vector of ones) has the same value at optimum as the problem

maximize 𝟏,ϕn(𝒘),\displaystyle\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\,, (28)
subj. to 𝒘𝒱.\displaystyle{\bm{w}}\in{\mathcal{V}}\,.

It is easy to see that we can construct data points and weights such that the problem (28) coincide with M(𝒘,a)M({\bm{w}},a) defined in Eq. (26) at the optimum. Hence, there is an easy deterministic reduction from HS-MA(ε)(\varepsilon) to W-VAL(ε)(\varepsilon^{\prime}).

In summary, we used the following two reductions

W-F1-PB\xLongrightarrowRW-VAL\xLongrightarrowDHS-MA\boxed{\texttt{W-F1-PB}}\,\,\,\xLongrightarrow{\text{R}}\,\,\,\boxed{\texttt{W-VAL}}\,\,\,\xLongrightarrow{\text{D}}\,\,\,\boxed{\texttt{HS-MA}} (29)

where A\xLongrightarrowRB\boxed{A}\xLongrightarrow{\text{R}}\boxed{B} (resp. A\xLongrightarrowDB\boxed{A}\xLongrightarrow{\text{D}}\boxed{B}) means that there exists a polynomial time randomized (resp. deterministic) reduction from BB to AA.

5 Examples

5.1 A numerical illustration

In our first example 𝒘𝒱=d{\bm{w}}\in{\mathcal{V}}=\mathbb{R}^{d} and ϕ(𝒙;𝒘)=σ(𝒘,𝒙)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle) where σ:\sigma:\mathbb{R}\to\mathbb{R} is an activation function with σ(0)=0\sigma(0)=0 (this assumption is only to simplify some of the formulas below and can be removed). In this case the featurization map is deterministic, i.e., ϕ(𝒙;𝒘)=ϕ¯(𝒙;𝒘)\phi({\bm{x}};{\bm{w}})=\overline{\phi}({\bm{x}};{\bm{w}}).

Refer to caption
Refer to caption
Figure 1: Minimum complexity random features interpolation for penalty ρ(x)=|x|p/p\rho(x)=|x|^{p}/p and synthetic data generated according to the model y=σ(𝒘,𝒙)y=\sigma(\langle{\bm{w}}_{*},{\bm{x}}\rangle), 𝒙d{\bm{x}}\in{\mathbb{R}}^{d} (see text). We report the average test error over 2020 realizations of this experiment and the 95%95\% confidence intervals. Left frame: behavior as a function of the number of features NN for fixed sample size nn. Right frame: behavior as a function of sample size nn.

Before checking the assumptions of our main theorem in this context, we present a numerical illustration in Figure 1. We generate synthetic data with (𝒙i)iniidUnif(𝕊d1(d))({\bm{x}}_{i})_{i\leq n}\sim_{iid}{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d})), and yi=σ(𝒘,𝒙i)y_{i}=\sigma(\langle{\bm{w}}_{*},{\bm{x}}_{i}\rangle) where 𝒘{\bm{w}}_{*} is a fixed unit vector, 𝒘2=1\|{\bm{w}}_{*}\|_{2}=1, and σ(t)=max(t,0)\sigma(t)=\max(t,0) is the ReLU activation. As mentioned above, we use the featurization map ϕ(𝒙;𝒘)=σ(𝒘,𝒙)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle) with weigths (𝒘j)jN𝖭(0,𝐈d/d)({\bm{w}}_{j})_{j\leq N}\sim{\sf N}(0,{\mathbf{I}}_{d}/d).

We fix d=30d=30 and solve the minimum complexity interpolation problem (5), using ρ(x)=|x|p/p\rho(x)=|x|^{p}/p. We first fix the sample size n=150n=150 and report the average test error as a function of the number of features NN (left plot) for several values of pp. Notice that in the case p=2p=2, the N=N=\infty limit corresponds to kernel ridge regression, and hence is directly accessible. We then consider for each pp a value of NN that is large enough to obtain a rough approximation of the infinite width limit, and plot the test error as a function of the sample size nn (right plot).

A few remarks are in order:

  • (i)(i)

    For p>1p>1, the test error appears to settle on a limiting value after NN becomes large enough NN(n;p)N\gtrsim N_{*}(n;p).

  • (ii)(ii)

    The required number of random features N(n;p)N_{*}(n;p) appears to increase as pp decreases. For p=1p=1, we are not able to reach the N=N=\infty limit with practical values of NN.

  • (iii)(iii)

    As pp decreases, the test error achieved by minimum complexity interpolation decreases.

Notice that points (i)(i) and (ii)(ii) are consistent with our main result Theorem 1. Point (iii)(iii) is consistent with the notion that the class p{\mathcal{F}}_{p} captures better functions that are highly dependent on low dimensional projections of the covariates vectors.

5.2 Non-linear random features model

We next check the assumptions of Theorem 1 for the case of non-linear random features model ϕ(𝒙;𝒘)=σ(𝒘,𝒙)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle).

Proposition 1.

Assume that σ\sigma is LL-Lipschitz and σ(0)=0\sigma(0)=0, 𝐱i2r0d\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d} for all ini\leq n, and that the random weights (𝐰i)inμ({\bm{w}}_{i})_{i\leq n}\sim\mu are mean 0 and satisfy the transportation cost inequality

W1(ν,μ)2(κ2/d)D(ν||μ) for all probability measures ν on d,W_{1}(\nu,\mu)\leq\sqrt{2(\kappa^{2}/d)D(\nu||\mu)}\,\,\text{ for all probability measures $\nu$ on $\mathbb{R}^{d}$,} (30)

where W1W_{1} is the Wassertein distance and DD is the relative entropy (Kullback-Leibler divergence). Then, FEAT1 and FEAT2 are satisfied with constants L(𝐰)=L𝐰2L({\bm{w}})=L\|{\bm{w}}\|_{2} and τ=τ1τ2\tau=\tau_{1}\vee\tau_{2}, where

τ1=CκLr0,τ2=CκL𝑲n1/2op𝑿opd.\displaystyle\tau_{1}=C\kappa Lr_{0},\qquad\tau_{2}=C\kappa L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\frac{\|{\bm{X}}\|_{{\rm op}}}{\sqrt{d}}\,. (31)
Proof of Proposition 1.

Let us begin with condition FEAT1. Notice that for any 𝒙2r0d\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d},

|σ(𝒘1,𝒙)σ(𝒘2,𝒙)|L𝒙2𝒘1𝒘22Lr0d𝒘1𝒘22.|\sigma(\langle{\bm{w}}_{1},{\bm{x}}\rangle)-\sigma(\langle{\bm{w}}_{2},{\bm{x}}\rangle)|\leq L\|{\bm{x}}\|_{2}\|{\bm{w}}_{1}-{\bm{w}}_{2}\|_{2}\leq Lr_{0}\sqrt{d}\|{\bm{w}}_{1}-{\bm{w}}_{2}\|_{2}\,.

Hence σ(𝒘,𝒙)\sigma(\langle{\bm{w}},{\bm{x}}\rangle) is Lr0dLr_{0}\sqrt{d}-Lipschitz. By assumption (30) and Bobkov-Götze theorem [BG99], σ(𝒘,𝒙)𝔼𝒘[σ(𝒘,𝒙)]\sigma(\langle{\bm{w}},{\bm{x}}\rangle)-\mathbb{E}_{{\bm{w}}}[\sigma(\langle{\bm{w}},{\bm{x}}\rangle)] is (κLr0)2(\kappa Lr_{0})^{2}-sub-Gaussian with respect to 𝒘{\bm{w}}, for any fixed 𝒙{\bm{x}}. Further |𝔼𝒘[σ(𝒘,𝒙)]|L𝔼𝒘[|𝒘,𝒙|]CκLr0|\mathbb{E}_{{\bm{w}}}[\sigma(\langle{\bm{w}},{\bm{x}}\rangle)]|\leq L\mathbb{E}_{{\bm{w}}}[|\langle{\bm{w}},{\bm{x}}\rangle|]\leq C\kappa Lr_{0}.

Similarly, for any 𝒗n{\bm{v}}\in\mathbb{R}^{n}, 𝒗2=1\|{\bm{v}}\|_{2}=1,

|𝒗,𝝍n(𝒘1)𝒗,𝝍n(𝒘2)|L𝑲n1/22𝑿(𝒘1𝒘2)2L𝑲n1/2op𝑿op𝒘1𝒘22.|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}}_{1})\rangle-\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}}_{2})\rangle|\leq L\|{\bm{K}}_{n}^{-1/2}\|_{2}\|{\bm{X}}({\bm{w}}_{1}-{\bm{w}}_{2})\|_{2}\leq L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}\|{\bm{w}}_{1}-{\bm{w}}_{2}\|_{2}\,.

We deduce that 𝝍n(𝒘)𝔼[𝝍n(𝒘)]{\bm{\psi}}_{n}({\bm{w}})-\mathbb{E}[{\bm{\psi}}_{n}({\bm{w}})] is (κL𝑲n1/2op𝑿op/d)2(\kappa L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}/\sqrt{d})^{2}-sub-Gaussian with respect to 𝒘{\bm{w}}, and 𝔼𝒘[𝝍n(𝒘)]2CκL𝑲n1/2op𝑿op/d\|\mathbb{E}_{{\bm{w}}}[{\bm{\psi}}_{n}({\bm{w}})]\|_{2}\leq C\kappa L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}/\sqrt{d}.

Next consider condition FEAT2. We have σ(𝒙,𝒘)\sigma(\langle{\bm{x}},{\bm{w}}\rangle) that is Lipschitz with respect to 𝒙{\bm{x}} with Lipschitz constant L(𝒘):=L𝒘2L({\bm{w}}):=L\|{\bm{w}}\|_{2}. Using that L(𝒘)L({\bm{w}}) is LL-Lipschitz with respect to 𝒘{\bm{w}}, we have

(L(𝒘)4Lκ+t)exp{dt2/(2L2κ2)},\mathbb{P}(L({\bm{w}})\geq 4L\kappa+t)\leq\exp\{-dt^{2}/(2L^{2}\kappa^{2})\}\,,

and therefore there exists an absolute constant C>0C>0 that only depend on LL and κ\kappa such that

(L(𝒘)t)Cexp(t2/(2L2κ2)).\mathbb{P}(L({\bm{w}})\geq t)\leq C\exp(-t^{2}/(2L^{2}\kappa^{2}))\,.

To make Proposition 1 more concrete, let us make the following remarks:

  1. (a)

    The transportation cost inequality (30) is a necessary and sufficient condition for any LL-Lipschitz function of 𝒘{\bm{w}} to be (L2κ2/d)(L^{2}\kappa^{2}/d)-sub-Gaussian. For example, 𝒘𝖭(0,(κ2/d)𝐈d){\bm{w}}\sim{\sf N}(0,(\kappa^{2}/d)\cdot{\mathbf{I}}_{d}) verifies this inequality [vH14, Chapter 4].

  2. (b)

    If 𝒙{\bm{x}} is a κ2\kappa^{2}-sub-Gaussian random vector, then by Lemma 9, there exists constants C,c>0C,c>0 depending only on κ\kappa, such that 𝑿opCn\|{\bm{X}}\|_{{\rm op}}\leq C\sqrt{n} with probability at least 1exp(cn)1-\exp(-cn).

Proposition 1 only focused on FEAT1 and FEAT2. The last two assumptions FEAT3 and FEAT3’ are more difficult to check. The next proposition provides a class of activation functions for which FEAT3 is satisfied.

Proposition 2.

Assume ϕ(𝐱;𝐰)=σ(𝐰,𝐱)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle) with 𝐰𝖭(0,𝐈d/d){\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d). Let μk(σ):=𝔼{σ(G)Hek(G)}\mu_{k}(\sigma):=\mathbb{E}\{\sigma(G){\rm He}_{k}(G)\} be the kk-th Hermite coefficient of the activation function (where G𝖭(0,1)G\sim{\sf N}(0,1), and the normalization 𝔼{Hek(G)Hej(G)}=k!δjk\mathbb{E}\{{\rm He}_{k}(G){\rm He}_{j}(G)\}=k!\delta_{jk} is assumed).

Consider (𝐱i)iniidUnif(𝕊d1(d))({\bm{x}}_{i})_{i\leq n}\sim_{iid}{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and d+δnd+1δd^{\ell+\delta}\leq n\leq d^{\ell+1-\delta} for some constant δ>0\delta>0. If μ(σ)\mu_{\ell}(\sigma) is bounded away from zero and |μk(σ)|C/kk+1/2|\mu_{k}(\sigma)|\leq C/k^{k+1/2} for all kk large enough, then FEAT 3 is satisfied with high probability with c=1/4c=1/4 and nonrandom η\eta independent of d,nd,n.

A more general version of this result is proved in Appendix C for FEAT3. While we expect FEAT3 and FEAT3’ to hold more generally, we do not have a more general proof at the moment. We can bypass this difficulty below by considering noisy features.

Proposition 3.

Consider the same setting as in Proposition 1. Define ϕ¯(𝐱;𝐰)=σ(𝐱,𝐰)\overline{\phi}({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{x}},{\bm{w}}\rangle) and ϕ(𝐱;𝐰):=ϕ(𝐱;𝐰,z)=σ(𝐱,𝐰)+z\phi({\bm{x}};{\bm{w}}):=\phi({\bm{x}};{\bm{w}},z)=\sigma(\langle{\bm{x}},{\bm{w}}\rangle)+z where z𝖭(0,γ2)z\sim{\sf N}(0,\gamma^{2}). Then FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with constants L(𝐰)=L𝐰2L({\bm{w}})=L\|{\bm{w}}\|_{2}, Cr=Γ((r+1)/2)C_{r}=\Gamma((r+1)/2) and

η\displaystyle\eta =γ10λmax(𝑲n)1/2,\displaystyle=\frac{\gamma}{10\lambda_{\max}({\bm{K}}_{n})^{1/2}}\,, (32)
τ\displaystyle\tau =τ1τ2,τ1=κLr0,τ2=κ2L2𝑿op2γ2d+1.\displaystyle=\tau_{1}\vee\tau_{2}\,,\;\;\;\;\tau_{1}=\kappa Lr_{0},\qquad\tau_{2}=\kappa^{2}L^{2}\frac{\|{\bm{X}}\|^{2}_{{\rm op}}}{\gamma^{2}d}+1\,. (33)
Proof of Proposition 3.

First notice that 𝑲n=𝑲¯n+γ2𝐈n{\bm{K}}_{n}=\overline{\bm{K}}_{n}+\gamma^{2}{\mathbf{I}}_{n}, where we denoted the matrix (𝑲¯n)i,j=𝔼𝒘[σ(𝒙i,𝒘)σ(𝒙j,𝒘)](\overline{\bm{K}}_{n})_{i,j}=\mathbb{E}_{{\bm{w}}}[\sigma(\langle{\bm{x}}_{i},{\bm{w}}\rangle)\sigma(\langle{\bm{x}}_{j},{\bm{w}}\rangle)]. FEAT1 and FEAT2 are verified in Proposition 1 with the difference that for 𝒗n{\bm{v}}\in\mathbb{R}^{n}, 𝒗2=1\|{\bm{v}}\|_{2}=1,

𝒗,𝝍n(𝒘)=𝒗,𝑲n1/2ϕ¯n(𝒘)+z~,\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle=\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}\overline{\bm{\phi}}_{n}({\bm{w}})\rangle+\tilde{z}\,,

where z~𝖭(0,γ2𝒗,𝑲n1𝒗)\tilde{z}\sim{\sf N}(0,\gamma^{2}\langle{\bm{v}},{\bm{K}}_{n}^{-1}{\bm{v}}\rangle), and therefore 𝝍n(𝒘){\bm{\psi}}_{n}({\bm{w}}) is a τ22\tau_{2}^{2}-sub-Gaussian random vector with

τ22=κ2L2𝑿op2γ2d+1.\tau_{2}^{2}=\kappa^{2}L^{2}\frac{\|{\bm{X}}\|_{{\rm op}}^{2}}{\gamma^{2}d}+1.

Let us now check FEAT3. Define Δ=γ2𝒗,𝑲n1𝒗\Delta=\gamma^{2}\langle{\bm{v}},{\bm{K}}_{n}^{-1}{\bm{v}}\rangle. Recalling that we have 𝒗,𝝍n(𝒘)=𝒗,𝑲n1/2ϕ¯n(𝒘)+z~\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle=\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}\overline{\bm{\phi}}_{n}({\bm{w}})\rangle+\tilde{z} with z~𝖭(0,Δ)\tilde{z}\sim{\sf N}(0,\Delta) independent of ϕ¯n(𝒘)\overline{\bm{\phi}}_{n}({\bm{w}}), we have

(|𝒗,𝝍n(𝒘)|η)\displaystyle\mathbb{P}(|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|\leq\eta) supx(xηz~x+η)\displaystyle\leq\sup_{x\in\mathbb{R}}\mathbb{P}(x-\eta\leq\tilde{z}\leq x+\eta)
2η2πΔ=(50πλmax(𝑲n)𝒗,𝑲n1𝒗)1/2110.\displaystyle\leq\frac{2\eta}{\sqrt{2\pi\Delta}}=\big{(}50\pi\lambda_{\max}({\bm{K}}_{n})\langle{\bm{v}},{\bm{K}}^{-1}_{n}{\bm{v}}\rangle\big{)}^{-1/2}\leq\frac{1}{10}\,.

This shows that FEAT3 is satisfied with the stated value of η\eta.

Finally, let us check FEAT3’. Consider r(1,0)r\in(-1,0). Letting G𝖭(0,1)G\sim{\sf N}(0,1), we have

𝔼[|𝒗,𝝍n(𝒘)|r]\displaystyle\mathbb{E}[|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|^{r}]\leq supx𝔼z~[|z~+x|r]𝔼z~[|z~|r]\displaystyle~{}\sup_{x\in\mathbb{R}}\mathbb{E}_{\tilde{z}}[|\tilde{z}+x|^{r}]\leq\mathbb{E}_{\tilde{z}}[|\tilde{z}|^{r}]
=\displaystyle= Δr/2𝔼[|G|r]ηr𝔼[|G|r],.\displaystyle~{}\Delta^{r/2}\mathbb{E}[|G|^{r}]\leq\eta^{r}\mathbb{E}[|G|^{r}]\,,\,.

where in the last inequality we used the fact that Δ=γ2𝒗,𝑲n1𝒗γ2λmax(𝑲n)1η\Delta=\gamma^{2}\langle{\bm{v}},{\bm{K}}_{n}^{-1}{\bm{v}}\rangle\geq\gamma^{2}\lambda_{\max}({\bm{K}}_{n})^{-1}\geq\eta. FEAT3’ is satisfied with Cr=𝔼[|G|r]Γ((r+1)/2)<C_{r}=\mathbb{E}[|G|^{r}]\leq\Gamma((r+1)/2)<\infty. ∎

Corollary 2.

Consider the same setting as in Proposition 1, and Proposition 3, namely ϕ¯(𝐱;𝐰)=σ(𝐱,𝐰)\overline{\phi}({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{x}},{\bm{w}}\rangle) and ϕ(𝐱;𝐰):=ϕ(𝐱;𝐰,z)=σ(𝐱,𝐰)+z\phi({\bm{x}};{\bm{w}}):=\phi({\bm{x}};{\bm{w}},z)=\sigma(\langle{\bm{x}},{\bm{w}}\rangle)+z where z𝖭(0,γ2)z\sim{\sf N}(0,\gamma^{2}), γ1\gamma\leq 1.

Further assume (𝐱i)in({\bm{x}}_{i})_{i\leq n} to be independent centered isotropic, with 𝐱i2r0d\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d} almost surely. Then there exist constants C,cC,c depending uniquely on L,r0,κL,r_{0},\kappa (but not on n,d,σn,d,\sigma or the distribution of the data 𝐱i{\bm{x}}_{i}), such that conditions FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with probability at least 1Cecn1-Ce^{-cn} for the constants

η=cγn,τ\displaystyle\eta=\frac{c\gamma}{\sqrt{n}}\,,\;\;\;\;\;\;\;\;\tau =Cγ(ndlogd).\displaystyle=\frac{C}{\gamma}\Big{(}\sqrt{\frac{n}{d}}\vee\sqrt{\log d}\Big{)}\,. (34)

In particular, consider the case ρ(x)=|x|p/p\rho(x)=|x|^{p}/p, p(1,2]p\in(1,2]. If (yi)in(y_{i})_{i\leq n} are independent 𝔼{yi4}C\mathbb{E}\{y_{i}^{4}\}\leq C, then with probability at least 1Cn11-Cn^{-1}, we have for NCγ2Q2(Q3)(ndlogd)QnQ3+Q/2(logn)Q/2N\geq C^{\prime}\gamma^{-2Q-2(Q\vee 3)}(\frac{n}{d}\vee\log d)^{Q}n^{Q\vee 3+Q/2}(\log n)^{Q/2},

f^N(;𝝀^N)f^(;𝝀^)L2()Cn(Q3+1)/2γQ3+2Q+1(ndlogd)Q(nlogNN(nlogN)Q/2N).\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}\leq C\frac{n^{(Q\vee 3+1)/2}}{\gamma^{Q\vee 3+2Q+1}}\Big{(}\frac{n}{d}\vee\log d\Big{)}^{Q}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\,. (35)
Proof of Corollary 2.

We have

λmax(𝑲n)\displaystyle\lambda_{\max}({\bm{K}}_{n}) =sup𝒖2=1𝔼𝒘,z[ϕn(𝒘,z),𝒖2]\displaystyle=\sup_{\|{\bm{u}}\|_{2}=1}\mathbb{E}_{{\bm{w}},z}[\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{u}}\rangle^{2}]
𝔼𝒘,z[ϕn(𝒘,z)2]nsup𝒙2r0d𝔼𝒘,z[ϕ(𝒙;𝒘,z)2]\displaystyle\leq\mathbb{E}_{{\bm{w}},z}[\|\bm{\phi}_{n}({\bm{w}},z)\|^{2}]\leq n\sup_{\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}}\mathbb{E}_{{\bm{w}},z}[\phi({\bm{x}};{\bm{w}},z)^{2}]
nsup𝒙2r0d{L2𝔼[𝒘,𝒙2]+γ2}Cn.\displaystyle\leq n\sup_{\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}}\big{\{}L^{2}\mathbb{E}[\langle{\bm{w}},{\bm{x}}\rangle^{2}]+\gamma^{2}\big{\}}\leq Cn\,.

Substituting into Eq. (32), we get the desired bound on η\eta.

Considering the estimate on τ\tau, we note that 𝑿opC(n+dlogd)\|{\bm{X}}\|_{{\rm op}}\leq C(\sqrt{n}+\sqrt{d\log d}) with the stated probability by [Ver18, Theorem 5.6.1]. Substituting in Eq. (33), we get the desired estimate of τ\tau.

Finally, to prove (35), recall that 𝑲n=𝑲¯n+γ2𝐈n{\bm{K}}_{n}=\overline{\bm{K}}_{n}+\gamma^{2}{\mathbf{I}}_{n}, where 𝑲¯n𝟎\overline{\bm{K}}_{n}\succeq{\bm{0}}, whence 𝑲n1/2𝒚2γ1𝒚2Cγ1n\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\leq\gamma^{-1}\|{\bm{y}}\|_{2}\leq C\gamma^{-1}\sqrt{n}, where the second step is just Markov inequality. Substituting in Eq. (22) yields the desired bound (35). ∎

Remark 6.

The estimates of τ\tau and η\eta in Proposition 3 and Corollary 2 are not optimal. First, we expect that the correct order in the second expression in Eq. (31) to be often 𝐊n1/2𝐗op/d\|{\bm{K}}_{n}^{-1/2}{\bm{X}}\|_{{\rm op}}/\sqrt{d} instead of 𝐊n1/2op𝐗op/d\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}/\sqrt{d}. For many cases of interest the former is of order one, while the latter is of order n/d\sqrt{n/d} as we saw.

Second, we expect the dependence of η\eta on λmax(𝐊n)\lambda_{\max}({\bm{K}}_{n}) to be often milder (see Appendix C). Third, we know that in many interesting cases, λmax(𝐊n)\lambda_{\max}({\bm{K}}_{n}) is of order n/dn/d instead of order nn. This is for instance the case if 𝐰Unif(𝕊d1(1)){\bm{w}}\sim{\rm Unif}(\mathbb{S}^{d-1}(1)), 𝐱iUnif(𝕊d1(d)){\bm{x}}_{i}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d})) [MMM21] (under the assumption 𝔼[ϕn(𝐰)]=𝟎\mathbb{E}[\bm{\phi}_{n}({\bm{w}})]={\bm{0}}).

5.3 The latent linear model

This section provides a simple example in which the estimates derived above can be strengthened. We consider the following model

ϕ(𝒙;𝒘,z)=𝒙,𝒘+z,\phi({\bm{x}};{\bm{w}},z)=\langle{\bm{x}},{\bm{w}}\rangle+z\,,

which we will refer to as the ‘latent linear model.’

While the latent linear model is extremely simple, it was shown in some settings to have the same asymptotic behavior as the noiseless nonlinear random features model ϕ(𝒙;𝒘)=σ(𝒘,𝒙)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle) [MM19, HL20]. For instance, consider the case 𝒘Unif(𝕊d1(1)){\bm{w}}\sim{\rm Unif}(\mathbb{S}^{d-1}(1)), 𝒙Unif(𝕊d1(d)){\bm{x}}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d})). Then 𝒘,𝒙\langle{\bm{w}},{\bm{x}}\rangle is approximately 𝖭(0,1){\sf N}(0,1). Decompose σ(t)=μ0+μ1t+σ(t)\sigma(t)=\mu_{0}+\mu_{1}t+\sigma_{\perp}(t), where σ\sigma_{\perp} is orthogonal to linear and constant functions in L2(;γ)L^{2}({\mathbb{R}};\gamma), with γ\gamma the standard normal measure. Then [MM19] shows that in the proportional asymptotics NndN\asymp n\asymp d, ridge regression in the nonlinear random features model behaves as ridge regression with the latent linear model ϕ(𝒙;𝒘,z)=μ0+μ1𝒘,𝒙+μz\phi({\bm{x}};{\bm{w}},z)=\mu_{0}+\mu_{1}\langle{\bm{w}},{\bm{x}}\rangle+\mu_{\star}z, with z𝖭(0,1)z\sim{\sf N}(0,1) independent of 𝒘,𝒙{\bm{w}},{\bm{x}}. Here we study the latent linear model from a different perspective and in a broader context than [MM19, GLK+20, HL20].

We make the following assumptions:

A1

(Covariates distribution) The covariates 𝒙{\bm{x}} are κ2\kappa^{2}-subgaussian random vectors with zero mean and second moment 𝔼[𝒙𝒙𝖳]=𝚺x\mathbb{E}[{\bm{x}}{\bm{x}}^{\mathsf{T}}]=\bm{\Sigma}_{x} and bounded support 𝒙2r0d\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}.

A2

(Features distribution) The features 𝒘{\bm{w}} follow a multivariate Gaussian distribution with zero mean and covariance 𝚺w/d\bm{\Sigma}_{w}/d. Furthermore, assume that λmax(𝚺w)κ\lambda_{\max}(\bm{\Sigma}_{w})\leq\kappa.

A3

(Features noise distribution) We assume that z𝖭(0,γ2)z\sim{\sf N}(0,\gamma^{2}).

Proposition 4.

Assume that conditions A1, A2 and A3 hold. Then there exists C>0C>0 such that for nCdn\geq Cd, the conditions FEAT1, FEAT2, FEAT3 and FEAT3’ hold with probability at least 1ecn1-e^{-cn} with constants τ,η\tau,\eta depending only on the constants in A1, A2 and A3. In particular, τ,η\tau,\eta can be taken to be independent of d,nd,n.

Proof of Proposition 4.

We begin by noticing that 𝑲n=𝔼𝒘,𝒛[ϕn(𝒘,𝒛)ϕn(𝒘,𝒛)]=𝑿(𝚺w/d)𝑿𝖳+γ2𝐈n{\bm{K}}_{n}=\mathbb{E}_{{\bm{w}},{\bm{z}}}[\bm{\phi}_{n}({\bm{w}},{\bm{z}})\bm{\phi}_{n}({\bm{w}},{\bm{z}})]={\bm{X}}(\bm{\Sigma}_{w}/d){\bm{X}}^{\mathsf{T}}+\gamma^{2}{\mathbf{I}}_{n}.

Consider condition FEAT1. We have ϕ¯(𝒙;𝒘)=𝒙,𝒘𝖭(0,𝒙𝖳𝚺w𝒙/d)\overline{\phi}({\bm{x}};{\bm{w}})=\langle{\bm{x}},{\bm{w}}\rangle\sim{\sf N}(0,{\bm{x}}^{\mathsf{T}}\bm{\Sigma}_{w}{\bm{x}}/d) is mean 0. By A1 and A2, ϕ¯(𝒙;𝒘)\overline{\phi}({\bm{x}};{\bm{w}}) is (r0κ)2(r_{0}\kappa)^{2}-subgaussian. Furthermore, for any fixed 𝒙1,,𝒙nd{\bm{x}}_{1},\dots,{\bm{x}}_{n}\in{\mathbb{R}}^{d}, 𝝍n(𝒘){\bm{\psi}}_{n}({\bm{w}}) is by construction isotropic, and Gaussian (since 𝒘{\bm{w}} and zz are). Therefore, it is 11-subgaussian and mean 0.

FEAT2 is easily verified with L(𝒘)=𝒘2L({\bm{w}})=\|{\bm{w}}\|_{2} and (L(𝒘)t)Cexp(t2/(2κ2))\mathbb{P}(L({\bm{w}})\geq t)\leq C\exp(-t^{2}/(2\kappa^{2})).

For condition FEAT3, note that 𝒗,𝝍n(𝒘)𝖭(0,1)\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\sim{\sf N}(0,1) for any unit vector 𝒗{\bm{v}}. Therefore we have (|𝒗,𝝍n(𝒘)|η)(2/π)1/2η\mathbb{P}(|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|\leq\eta)\leq(2/\pi)^{1/2}\eta, whence we can take η=1/10\eta=1/10.

Finally, for FEAT3’, 𝔼[|𝒗,𝝍n(𝒘)|r]=𝔼[|G|r]=:Cr\mathbb{E}[|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|^{r}]=\mathbb{E}[|G|^{r}]=:C_{r} for G𝖭(0,1)G\sim{\sf N}(0,1). ∎

We finally notice that, for the latent linear model, we can improve Theorem 1. For the latent linear model, we can use the constraint that the (randomized) predictor interpolates the data, to improve the bound (20) by a factor n\sqrt{n}. We expect this insight to generalize to the non-linear setting.

Proposition 5.

Assume conditions A1, A2 and A3 hold. There exists constants C,cC^{\prime},c^{\prime} depending only on the constants in those assumptions such that if NC(nlog(N))Q/2N\geq C(n\log(N))^{Q/2} and nCdn\geq C^{\prime}d, then with probability at least 1CecnCNcn1-C^{\prime}e^{-c^{\prime}n}-C^{\prime}N^{-c^{\prime}n},

f^N(;𝝀^N)f^0(;𝝀^0)L2C(nlogNN(nlogN)Q/2N)𝒚2n,\|\hat{f}_{N}(\cdot;\hat{\bm{\lambda}}_{N})-\hat{f}_{0}(\cdot;\hat{\bm{\lambda}}_{0})\|_{L_{2}}\leq C^{\prime}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\frac{\|{\bm{y}}\|_{2}}{\sqrt{n}}\,, (36)

where we denoted Q=Q1Q2Q=Q_{1}\vee Q_{2}.

In other words, the random features predictor f^N(;𝝀^N)\hat{f}_{N}(\,\cdot\,;{\hat{\bm{\lambda}}}_{N}) differs negligibly from the infinite-width predictor f^(;𝝀^)\hat{f}(\cdot;{\hat{\bm{\lambda}}}) when N(nlog(n))1(Q/2)N\gg(n\log(n))^{1\vee(Q/2)}. In particular, for Q=2Q=2, we obtain that Nnlog(n)N\gg n\log(n) features are sufficient, which matches the results for kernel methods [MM19, MMM21]. The proof of Proposition 5 is deferred to Appendix B.

6 Proof of Theorem 1: Convergence to the population predictor

The proof of Theorem 1 is structured as follows. We first define three events on which the finite-width dual objective FNF_{N} and the random features predictor f^N(;𝝀)\hat{f}_{N}(\,\cdot\,;{\bm{\lambda}}) satisfy certain concentration properties. Lemma 1 shows that the simultaneous occurrence of these events implies a bound on f^N(;𝝀^N)f^(;𝝀^)L2()\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}. We then verify that these events occur with high-probability.

Throughout the proofs, we will use the notation 𝒖𝑨=𝑨1/2𝒖2\|{\bm{u}}\|_{{\bm{A}}}=\|{\bm{A}}^{1/2}{\bm{u}}\|_{2} for 𝒖n{\bm{u}}\in\mathbb{R}^{n} and 𝑨n×n{\bm{A}}\in\mathbb{R}^{n\times n} a positive semidefinite matrix. We also use the standard big-Oh and little-o notations whereby the subscript is used to indicate the asymptotic variable. For instance, we write fN=oN(gN)f_{N}=o_{N}(g_{N}) if fN/gN0f_{N}/g_{N}\to 0 as NN\to\infty.

The three events mentioned above are defined as follows:

  1. 1.

    (Uniform concentration of the predictor) Event 1{\mathcal{E}}_{1} is the event that for all 𝝀𝑲n/𝝀^𝑲n[1/2,2]\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\in[1/2,2]

    f^N(;𝝀)f^(;𝝀)L2s(𝝀^𝑲n)=1Nj=1Nϕ¯𝒘j()s(ϕn,j,𝝀)𝔼𝒘,ϕ[ϕ¯𝒘()s(ϕn(𝒘),𝝀)]L2s(𝝀^𝑲n)ε1.\frac{\|\hat{f}_{N}(\,\cdot\,;{\bm{\lambda}})-\hat{f}(\,\cdot\,;{\bm{\lambda}})\|_{L_{2}}}{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}=\frac{\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\overline{\phi}_{{\bm{w}}_{j}}(\,\cdot\,)s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}\Big{[}\overline{\phi}_{{\bm{w}}}(\,\cdot\,)\,s(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)\Big{]}\Big{\|}_{L_{2}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq\varepsilon_{1}.
  2. 2.

    (Concentration of dual gradient at 𝛌^{\hat{\bm{\lambda}}}) Event 2{\mathcal{E}}_{2} is the event that

    FN(𝝀^)𝑲n1s(𝝀^𝑲n)=1Nj=1N𝝍n,js(ϕn,j,𝝀^)𝔼𝒘,ϕ[𝝍n(𝒘)s(ϕn(𝒘),𝝀^)]2s(𝝀^𝑲n)ε2.\frac{\|\nabla F_{N}({\hat{\bm{\lambda}}})\|_{{\bm{K}}_{n}^{-1}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}=\frac{\Big{\|}\frac{1}{N}\sum_{j=1}^{N}{\bm{\psi}}_{n,j}s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}\Big{[}{\bm{\psi}}_{n}({\bm{w}})s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)\Big{]}\Big{\|}_{2}}{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}\leq\varepsilon_{2}.
  3. 3.

    (Uniform lower bound on dual curvature) Event 3{\mathcal{E}}_{3} is the event that for all 𝝀𝑲n/𝝀^𝑲n[1/2,2]\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\in[1/2,2]

    𝑲n1/22FN(𝝀)𝑲n1/2s(𝝀^𝑲n)/𝝀^𝑲n=1Nj=1N𝝍n,j𝝍n,js(ϕn,j,𝝀)s(𝝀^𝑲n)/𝝀^𝑲nβ𝐈n.\frac{{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}){\bm{K}}_{n}^{-1/2}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}=-\frac{\frac{1}{N}\sum_{j=1}^{N}{\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\top}s^{\prime}(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}\preceq-\beta{\mathbf{I}}_{n}.

The first event 1{\mathcal{E}}_{1} corresponds to the random features predictor f^N(;𝝀)\hat{f}_{N}(\,\cdot\,;{\bm{\lambda}}) approximating the infinite-width predictor f^(;𝝀)\hat{f}(\,\cdot\,;{\bm{\lambda}}) uniformly well over 𝝀{\bm{\lambda}} in a region around 𝝀^{\hat{\bm{\lambda}}}. Events 2{\mathcal{E}}_{2} and 3{\mathcal{E}}_{3} relate to local properties of the finite-width dual objective FNF_{N} around 𝝀^{\hat{\bm{\lambda}}}: on 2{\mathcal{E}}_{2}, the gradient of the dual objective FN(𝝀^)\nabla F_{N}({\hat{\bm{\lambda}}}) concentrates around the gradient of the infinite width dual objective F(𝝀^)=𝟎\nabla F({\hat{\bm{\lambda}}})={\bm{0}}; on 3{\mathcal{E}}_{3}, the Hessian of the dual objective 2FN(𝝀)\nabla^{2}F_{N}({\bm{\lambda}}) has maximum eigenvalue uniformly bounded away from 0 for 𝝀{\bm{\lambda}} in a region around 𝝀^{\hat{\bm{\lambda}}}.

Because these three events involve concentration or bounds on empirical means over a sample of NN features, it is perhaps not surprising that the preceding bounds can be established with high-probability for ε1,ε2=o(1)\varepsilon_{1},\varepsilon_{2}=o(1) and β=Θ(1)\beta=\Theta(1) appropriately chosen when NN if sufficiently large compared to nn.

If the infinite-width predictor f^(;𝝀)\hat{f}(\,\cdot\,;{\bm{\lambda}}) satisfies a certain continuity property in 𝝀{\bm{\lambda}}, then the events 1{\mathcal{E}}_{1}, 2{\mathcal{E}}_{2}, 3{\mathcal{E}}_{3} imply a bound on f^N(;𝝀^N)f^(;𝝀^)L2\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}.

Lemma 1.

Assume that for all 𝛌𝛌^𝐊n𝛌^𝐊n/2\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\leq\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}/2,

f^(;𝝀)f^(;𝝀^)L2s(𝝀^𝑲n)K𝝀𝝀^𝑲n𝝀^𝑲n,\frac{\|\hat{f}(\,\cdot\,;{\bm{\lambda}})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq K\frac{\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}{\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}\,, (37)

for some K>0K>0. If ε2/β1/4\varepsilon_{2}/\beta\leq 1/4, then on events 1,2,3{\mathcal{E}}_{1},{\mathcal{E}}_{2},{\mathcal{E}}_{3}, we have

f^N(;𝝀^N)f^(;𝝀^)L2(ε1+2Kε2β)s(𝝀^𝑲n).\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}\leq\Big{(}\varepsilon_{1}+2\frac{K\varepsilon_{2}}{\beta}\Big{)}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\,.

The continuity property for f^\hat{f} used in Lemma 1 is much easier to establish than the corresponding continuity property of f^N\hat{f}_{N}. In the setting of Theorem 1, we will show that we can choose K,β=ΘN(1)K,\beta=\Theta_{N}(1) and ε1,ε2=oN(1)\varepsilon_{1},\varepsilon_{2}=o_{N}(1) such that the above events hold with high probability. Then, we will have ε2/β1/4\varepsilon_{2}/\beta\leq 1/4 eventually, and Lemma 1 can be applied.

Proof of Lemma 1.

Consider 𝝀{\bm{\lambda}} with 𝝀𝝀^𝑲n=(2ε2/β)𝝀^𝑲n𝝀^𝑲n/2\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}=(2\varepsilon_{2}/\beta)\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2. By Taylor’s theorem, there exists 𝝀{\bm{\lambda}}_{\star} on the line segment between 𝝀{\bm{\lambda}} and 𝝀^\hat{\bm{\lambda}}, such that

FN(𝝀)\displaystyle F_{N}({\bm{\lambda}}) =FN(𝝀^)+(𝝀𝝀^)FN(𝝀^)+12(𝝀𝝀^)2FN(𝝀)(𝝀𝝀^)\displaystyle=F_{N}(\hat{\bm{\lambda}})+({\bm{\lambda}}-\hat{\bm{\lambda}})^{\top}\nabla F_{N}(\hat{\bm{\lambda}})+\frac{1}{2}({\bm{\lambda}}-\hat{\bm{\lambda}})^{\top}\nabla^{2}F_{N}({\bm{\lambda}}_{\star})({\bm{\lambda}}-\hat{\bm{\lambda}})
FN(𝝀^)+𝝀𝝀^𝑲nFN(𝝀^)𝑲n112λmin(𝑲n1/22FN(𝝀)𝑲n1/2)𝝀𝝀^𝑲n2.\displaystyle\leq F_{N}(\hat{\bm{\lambda}})+\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\|\nabla F_{N}(\hat{\bm{\lambda}})\|_{{\bm{K}}_{n}^{-1}}-\frac{1}{2}\lambda_{\min}\big{(}-{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}_{\star}){\bm{K}}_{n}^{-1/2}\big{)}\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{2}.

When both 2{\mathcal{E}}_{2} and 3{\mathcal{E}}_{3} occur and using 𝝀𝝀^𝑲n𝝀𝝀^𝑲n𝝀^𝑲n/2\|{\bm{\lambda}}_{\star}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2, we get

FN(𝝀)\displaystyle F_{N}({\bm{\lambda}}) FN(𝝀^)+2ε2𝝀^𝑲nβε2s(𝝀^𝑲n)12βs(𝝀^𝑲n)𝝀^𝑲n4ε22𝝀^𝑲n2β2FN(𝝀^).\displaystyle\leq F_{N}(\hat{\bm{\lambda}})+\frac{2\varepsilon_{2}\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\beta}\varepsilon_{2}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})-\frac{1}{2}\frac{\beta s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}\frac{4\varepsilon_{2}^{2}\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{2}}{\beta^{2}}\leq F_{N}(\hat{\bm{\lambda}})\,.

Because FN(𝝀)FN(𝝀^)F_{N}({\bm{\lambda}})\leq F_{N}(\hat{\bm{\lambda}}) for all 𝝀{\bm{\lambda}} with 𝝀𝝀^𝑲n=(2ε2/β)𝝀^𝑲n\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}=(2\varepsilon_{2}/\beta)\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}, and FNF_{N} is strictly convex at 𝝀^{\hat{\bm{\lambda}}} by 3{\mathcal{E}}_{3}we conclude

𝝀^N𝝀^𝑲n(2ε2/β)𝝀^𝑲n.\displaystyle\|\hat{\bm{\lambda}}_{N}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq(2\varepsilon_{2}/\beta)\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\,.

In this case, by Eq. (37),

f^(;𝝀^N)f^(;𝝀^)L2K𝝀^N𝝀^𝑲n𝝀^𝑲ns(𝝀^𝑲n)2Kε2βs(𝝀^𝑲n).\|\hat{f}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}}\leq K\frac{\|\hat{\bm{\lambda}}_{N}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})\leq\frac{2K\varepsilon_{2}}{\beta}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}).

If event 1{\mathcal{E}}_{1} occurs, then

f^N(;𝝀^N)f^(;𝝀^N)L2ε1s(𝝀^𝑲n).\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}}_{N})\|_{L_{2}}\leq\varepsilon_{1}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}).

Combining the previous displays with the triangle inequality completes the proof. ∎

We next state three lemmas implying that events 1{\mathcal{E}}_{1}, 2{\mathcal{E}}_{2}, 3{\mathcal{E}}_{3} hold with high probability, as well as the continuity of f^(;𝝀)\hat{f}(\,\cdot\,;{\bm{\lambda}}) in the last lemma. Proofs of these lemmas are deferred to the appendices. We begin with the continuity property of the infinite width predictor.

Lemma 2.

If either (i)(i) assumptions FEAT1 and PEN hold with Q1Q22Q_{1}\wedge Q_{2}\geq 2, or (ii)(ii) assumptions FEAT1, FEAT3’ and PEN hold with Q1Q2<2Q_{1}\wedge Q_{2}<2, then there exists CC^{\prime} depending only on the constants c,C,r0c,C,r_{0} in FEAT3’ and PEN, but not on τ,η\tau,\eta, such that for all 𝛌𝛌^𝐊n𝛌^𝐊n/2\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2,

f^(;𝝀)f^(;𝝀^)L2s(𝝀^𝑲n)C(τQτ2η(Q2))𝝀𝝀^𝑲n𝝀^𝑲n,\frac{\|\hat{f}(\,\cdot\,;{\bm{\lambda}})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq C^{\prime}(\tau^{Q}\vee\tau^{2}\eta^{(Q^{\prime}-2)})\frac{\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}{\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}\,, (38)

where Q=Q1Q2Q=Q_{1}\vee Q_{2}, Q=Q1Q2Q^{\prime}=Q_{1}\wedge Q_{2}.

Next we state a lemma to check condition 1{\mathcal{E}}_{1}.

Lemma 3.

Assume FEAT1, FEAT2 and PEN hold. Then there exist C,c>0C^{\prime},c^{\prime}>0 depending only on the constants c,C,r0c,C,r_{0} in those assumptions, but not on τ,η\tau,\eta, such that for NncdN\geq n\geq c^{\prime}d, we have with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n},

S¯1\displaystyle\bar{S}_{1} :=sup𝒙2r0dsup𝒗22𝝀^𝑲n|1Nj=1Nϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)𝔼𝒘,ϕ[ϕ¯(𝒙;𝒘)s(𝝍n(𝒘),𝒗)]|\displaystyle:=\sup_{\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}}\;\sup_{\|{\bm{v}}\|_{2}\leq 2\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}\Big{|}\frac{1}{N}\sum_{j=1}^{N}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}[\bar{\phi}({\bm{x}};{\bm{w}})s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{v}}\rangle)]\Big{|} (39)
CτQs(𝝀^𝑲n)(nlogNN(nlogN)(Q/2)1N),\displaystyle\leq C^{\prime}\tau^{Q}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}\,,

where 𝛙n(𝐰)=𝐊n1/2ϕn(𝐰){\bm{\psi}}_{n}({\bm{w}})={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}({\bm{w}}) and Q=Q1Q2Q=Q_{1}\vee Q_{2}.

The next lemma allows us to check event 2{\mathcal{E}}_{2}.

Lemma 4.

Assume FEAT1 and PEN hold. There exist C,c>0C^{\prime},c^{\prime}>0 depending only on the constants c,C,r0c,C,r_{0} in those assumptions, but not on τ,η\tau,\eta, such that for NncdN\geq n\geq c^{\prime}d, we have with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n},

S¯2\displaystyle\bar{S}_{2} :=1Nj=1N𝝍n,js(ϕn,j,𝝀^)𝔼𝒘,ϕ[𝝍n(𝒘)s(ϕn,𝝀^)]2\displaystyle:=\Big{\|}\frac{1}{N}\sum_{j=1}^{N}{\bm{\psi}}_{n,j}s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}[{\bm{\psi}}_{n}({\bm{w}})s(\langle\bm{\phi}_{n},{\hat{\bm{\lambda}}}\rangle)]\Big{\|}_{2} (40)
CτQs(𝝀^𝑲n)(nlogNN(nlogN)(Q/2)1N),\displaystyle\leq C^{\prime}\tau^{Q}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)},

where 𝛙n(𝐰)=𝐊n1/2ϕn(𝐰){\bm{\psi}}_{n}({\bm{w}})={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}({\bm{w}}) and Q=Q1Q2Q=Q_{1}\vee Q_{2}.

We finally state a lemma to check event 3{\mathcal{E}}_{3}.

Lemma 5.

Assume that FEAT1, FEAT3 and PEN hold, and further assume that τ,η1nC\tau,\eta^{-1}\leq n^{C} for some absolute constant CC. Define δ0(η):=η3q1q2\delta_{0}(\eta):=\eta^{3\vee q_{1}\vee q_{2}}. There exist C,c>0C^{\prime},c^{\prime}>0 depending only on the constants in c,C,r0c,C,r_{0} in the assumptions, but not on τ,η\tau,\eta, such that for NC(τ4/δ0(η)2)nlog(N)N\geq C^{\prime}(\tau^{4}/\delta_{0}(\eta)^{2})n\log(N), we have with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n},

sup1/2𝝀𝑲n/𝝀^𝑲n2𝑲n1/22FN(𝝀)𝑲n1/2cδ0(η)s(𝝀^𝑲n)𝝀^𝑲n𝐈n.\displaystyle\sup_{1/2\leq\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq 2}{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}){\bm{K}}_{n}^{-1/2}\preceq-c^{\prime}\delta_{0}(\eta)\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\mathbf{I}}_{n}\,. (41)

Using the above lemmas, we are now in position to prove our main result, Theorem 1.

Proof of Theorem 1.

Recall we define q=q1q2q=q_{1}\vee q_{2}, Q=Q1Q2Q=Q_{1}\vee Q_{2}, Q=Q1Q2Q^{\prime}=Q_{1}\wedge Q_{2}. Assume NN1N2N\geq N_{1}\vee N_{2} as in the statement. By Lemmas 3, 4, 5 events 1{\mathcal{E}}_{1}, 2{\mathcal{E}}_{2}, 3{\mathcal{E}}_{3} hold with probability

(122)1CNcn,\displaystyle\mathbb{P}({\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}\cap{\mathcal{E}}_{2})\geq 1-C^{\prime}N^{-c^{\prime}n}\,,

with constants

ε1=ε2\displaystyle\varepsilon_{1}=\varepsilon_{2} =CτQ(nlogNN(nlogN)(Q/2)1N),\displaystyle=C^{\prime}\tau^{Q}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}\,,
β\displaystyle\beta =cη3q.\displaystyle=c^{\prime}\eta^{3\vee q}\,.

Further, by Lemma 2, the continuity property of Eq. (37) holds with

K=C(τQτ2η(Q2)).\displaystyle K=C^{\prime}(\tau^{Q}\vee\tau^{2}\eta^{(Q^{\prime}-2)})\,.

We can now apply Lemma 1. Since ε1=ε2\varepsilon_{1}=\varepsilon_{2} and, without loss of generality, K/β1K/\beta\geq 1, we obtain that with, probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n},

f^N(;𝝀^N)f^(;𝝀^)L2Δs(𝝀^𝑲n),\displaystyle\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}}\leq\Delta\cdot s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\,, (42)
Δ=3Kε1β=CτQ+2(τQ2ηQ2)η3q(nlogNN(nlogN)(Q/2)1N)\displaystyle\Delta=\frac{3K\varepsilon_{1}}{\beta}=C^{\prime}\tau^{Q+2}\frac{(\tau^{Q-2}\vee\eta^{Q^{\prime}-2})}{\eta^{3\vee q}}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}\,\, (43)

where we verify that ε1/β1/4\varepsilon_{1}/\beta\leq 1/4 by the assumption that NN1N2N\geq N_{1}\vee N_{2}.

Let us bound s(𝝀^2)s(\|{\hat{\bm{\lambda}}}\|_{2}). We have by the optimality condition 𝑲n1/2𝒚=𝔼[𝝍n(𝒘)s(𝝍n(𝒘),𝑲n1/2𝝀^)]{\bm{K}}_{n}^{-1/2}{\bm{y}}=\mathbb{E}[{\bm{\psi}}_{n}({\bm{w}})s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{K}}_{n}^{1/2}{\hat{\bm{\lambda}}}\rangle)]. Therefore, denoting 𝒗=𝑲n1/2𝝀^/𝝀^𝑲n{\bm{v}}={\bm{K}}_{n}^{1/2}{\hat{\bm{\lambda}}}/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}} and using PEN,

𝑲n1/2𝒚2=\displaystyle\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}= sup𝒖2=1𝔼[𝒖,𝝍n(𝒘)s(𝝍n(𝒘),𝑲n1/2𝝀^)]\displaystyle\sup_{\|{\bm{u}}\|_{2}=1}\mathbb{E}[\langle{\bm{u}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{K}}_{n}^{1/2}{\hat{\bm{\lambda}}}\rangle)]
\displaystyle\geq 𝔼[𝒗,𝝍n(𝒘)s(𝝀^𝑲n𝝍n(𝒘),𝒗)]\displaystyle~{}\mathbb{E}[\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{v}}\rangle)]
\displaystyle\geq cs(𝝀^𝑲n)𝔼[|𝒗,𝝍n(𝒘)|q1|𝒗,𝝍n(𝒘)|q2].\displaystyle~{}cs(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\mathbb{E}[|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|^{q_{1}}\wedge|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|^{q_{2}}]\,.

We next notice that, for any δ>0\delta>0, and any Cτ2C\tau^{2}- sub-Gaussian random variable XX with E[X2]=1E[X^{2}]=1, 𝔼[|X|q1|X|q2]C(δ)τ(2q1q2+δ)+\mathbb{E}[|X|^{q_{1}}\vee|X|^{q_{2}}]\geq C(\delta)\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}. This basic fact is proved in Appendix F. We apply this inequality to X=𝒗,𝝍n(𝒘)X=\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle to get

𝑲n1/2𝒚2\displaystyle\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\geq c(δ)s(𝝀^𝑲n)τ(2q1q2+δ)+.\displaystyle c(\delta)s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\,\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}\,.

Using this bound together with Eqs. (42), (43) yields the claim of the theorem. ∎

Acknowledgements

T.M. thanks Enric Boix-Adsera for helpful discussions about hardness results. This work was supported by NSF through award DMS-2031883 and from the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. We also acknowledge NSF grants CCF-2006489, IIS-1741162 and the ONR grant N00014-18-1-2729. M.C. was supported by the National Science Foundation Graduate Research Fellowship under grant DGE-1656518.

References

  • [Bac17] Francis Bach, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
  • [Bar98] Peter L Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE transactions on Information Theory 44 (1998), no. 2, 525–536.
  • [BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala, Kernels as features: On kernels, margins, and low-dimensional mappings, Machine Learning 65 (2006), no. 1, 79–94.
  • [BG99] Sergej G Bobkov and Friedrich Götze, Exponential integrability and transportation cost related to logarithmic sobolev inequalities, Journal of Functional Analysis 163 (1999), no. 1, 1–28.
  • [Bis95] Chris M Bishop, Training with noise is equivalent to Tikhonov regularization, Neural computation 7 (1995), no. 1, 108–116.
  • [BLM13] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence, OUP Oxford, 2013.
  • [BRV+06] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte, Convex neural networks, Advances in neural information processing systems, 2006, pp. 123–130.
  • [CLvdG20] Geoffrey Chinot, Matthias Löffler, and Sara van de Geer, Minimum 1\ell_{1} norm interpolation via basis pursuit is robust to errors, arXiv:2012.00807 (2020).
  • [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach, On lazy training in differentiable programming, Advances in Neural Information Processing Systems, 2019, pp. 2937–2947.
  • [CW01] Anthony Carbery and James Wright, Distributional and lql^{q} norm inequalities for polynomials over convex bodies in n\mathbbm{R}^{n}, Mathematical research letters 8 (2001), no. 3, 233–248.
  • [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018).
  • [FGKP06] Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami, New results for learning noisy parities and halfspaces, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), IEEE, 2006, pp. 563–574.
  • [GLK+20] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová, Generalisation error in learning with random features and the hidden manifold model, arXiv:2002.09339 (2020).
  • [GLS12] Martin Grötschel, László Lovász, and Alexander Schrijver, Geometric algorithms and combinatorial optimization, vol. 2, Springer Science & Business Media, 2012.
  • [GLSS18] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro, Characterizing implicit bias in terms of optimization geometry, Proceedings of Machine Learning Research, vol. 80, PMLR, 2018, pp. 1832–1841.
  • [GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, Annals of Statistics (2019), arXiv:1904.12191.
  • [GMMM20]  , When do neural networks outperform kernel methods?, Advances in Neural Information Processing Systems 33 (2020).
  • [GR09] Venkatesan Guruswami and Prasad Raghavendra, Hardness of learning halfspaces with noise, SIAM Journal on Computing 39 (2009), no. 2, 742–765.
  • [HKZ12] Daniel Hsu, Sham Kakade, and Tong Zhang, A tail inequality for quadratic forms of subgaussian random vectors, Electron. Commun. Probab. 17 (2012), 6 pp.
  • [HL20] Hong Hu and Yue M Lu, Universality laws for high-dimensional learning with random features, arXiv:2009.07669 (2020).
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571–8580.
  • [LL18] Yuanzhi Li and Yingyu Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, Advances in Neural Information Processing Systems, 2018, pp. 8157–8166.
  • [LR20] Tengyuan Liang and Alexander Rakhlin, Just interpolate: Kernel “ridgeless” regression can generalize, Annals of Statistics 48 (2020), no. 3, 1329–1347.
  • [LRZ19] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, On the risk of minimum-norm interpolants and restricted lower isometry of kernels, arXiv:1908.10292 (2019).
  • [LS20] Tengyuan Liang and Pragya Sur, A precise high-dimensional asymptotic theory for boosting and min-l1-norm interpolated classifiers, arXiv:2002.01586 (2020).
  • [LSV18] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala, Efficient convex optimization with membership oracles, Conference On Learning Theory, PMLR, 2018, pp. 1292–1294.
  • [MKL+20] Francesca Mignacco, Florent Krzakala, Yue Lu, Pierfrancesco Urbani, and Lenka Zdeborova, The role of regularization in classification of high-dimensional noisy gaussian mixture, International Conference on Machine Learning, PMLR, 2020, pp. 6874–6883.
  • [MM19] Song Mei and Andrea Montanari, The generalization error of random features regression: Precise asymptotics and double descent curve, arXiv:1908.05355 (2019).
  • [MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration, arXiv:2101.10588 (2021).
  • [MWW20] Chao Ma, Stephan Wojtowytsch, and Lei Wu, Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t, arXiv:2009.10713 (2020).
  • [NTSS17] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro, Geometry of optimization and implicit regularization in deep learning, arXiv:1705.03071 (2017).
  • [OS19] Samet Oymak and Mahdi Soltanolkotabi, Towards moderate overparameterization: global convergence guarantees for training shallow neural networks, arXiv:1902.04674 (2019).
  • [RR08] Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177–1184.
  • [RR09]  , Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, Advances in neural information processing systems, 2009, pp. 1313–1320.
  • [RR17] Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215–3225.
  • [SSBD14] Shai Shalev-Shwartz and Shai Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.
  • [Ver10] Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010).
  • [Ver18]  , High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018.
  • [vH14] Ramon van Handel, Probability in high dimension, Tech. report, Princetoon University, 2014.
  • [YS19] Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for understanding neural networks, arXiv:1904.00687 (2019).

Appendix A Verifying the continuity and concentration conditions

A.1 Continuity property of the infinite width problem: Proof of Lemma 2

By the fundamental theorem of calculus,

f^(𝒙;𝝀)f^(𝒙;𝝀^)\displaystyle\hat{f}({\bm{x}};{\bm{\lambda}})-\hat{f}({\bm{x}};\hat{\bm{\lambda}}) =01𝝀f^(𝒙;𝝀t)(𝝀𝝀^)dt,\displaystyle=\int_{0}^{1}\nabla_{{\bm{\lambda}}}\hat{f}({\bm{x}};{\bm{\lambda}}_{t})^{\top}({\bm{\lambda}}-\hat{\bm{\lambda}}){\rm d}t, (44)

where 𝝀t:=t𝝀+(1t)𝝀^{\bm{\lambda}}_{t}:=t{\bm{\lambda}}+(1-t)\hat{\bm{\lambda}}. We will show that

𝝀f^(𝒙;𝝀t)(𝝀𝝀^)=𝔼𝒘,ϕ[ϕ¯(𝒙;𝒘)s(ϕn(𝒘),𝝀t)ϕn(𝒘),𝝀𝝀],\nabla_{{\bm{\lambda}}}\hat{f}({\bm{x}};{\bm{\lambda}}_{t})^{\top}({\bm{\lambda}}-\hat{\bm{\lambda}})=\mathbb{E}_{{\bm{w}},\phi}\Big{[}\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}_{t}\rangle)\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}-{\bm{\lambda}}\rangle\Big{]}\,, (45)

(i.e., we may exchange integration and differentiation), and we will bound the right-hand side. In what follows, we will use the shorthand ϕn=ϕn(𝒘)=(ϕ(𝒙1;𝒘),,ϕ(𝒙n;𝒘))𝖳\bm{\phi}_{n}=\bm{\phi}_{n}({\bm{w}})=(\phi({\bm{x}}_{1};{\bm{w}}),\dots,\phi({\bm{x}}_{n};{\bm{w}}))^{{\mathsf{T}}} for the evaluation of the featurization map at the nn datapoints.

For e1e\geq 1 and r1,r2,r3>1r_{1},r_{2},r_{3}>1 such that 1r1+1r2+1r3=1\frac{1}{r_{1}}+\frac{1}{r_{2}}+\frac{1}{r_{3}}=1, we have by Hölder’s inequality,

𝔼𝒘,ϕ[|ϕ¯(𝒙;𝒘)s(ϕn,𝝀t)ϕn,𝝀𝝀^|e]\displaystyle\mathbb{E}_{{\bm{w}},\phi}\big{[}\big{|}\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\big{|}^{e}\big{]} (46)
𝔼[|ϕ¯(𝒙;𝒘)|r1e]1/r1𝔼[|s(ϕn,𝝀t)|r2e]1/r2𝔼[|ϕn,𝝀𝝀^|r3e]1/r3.\displaystyle\qquad\qquad\leq\mathbb{E}[|\overline{\phi}({\bm{x}};{\bm{w}})|^{r_{1}e}]^{1/r_{1}}\,\mathbb{E}[|s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)|^{r_{2}e}]^{1/r_{2}}\,\mathbb{E}[|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle|^{r_{3}e}]^{1/r_{3}}\,.

Denote Q=Q1Q2Q^{\prime}=Q_{1}\wedge Q_{2}. If Q<2Q^{\prime}<2, we set r1=r3=2(3Q)Q1r_{1}=r_{3}=\frac{2(3-Q^{\prime})}{Q^{\prime}-1}, r2=(3Q)2(2Q)r_{2}=\frac{(3-Q^{\prime})}{2(2-Q^{\prime})}, and u=5Q2(3Q)u=\frac{5-Q^{\prime}}{2(3-Q^{\prime})}. In this case, one can check that u>1u>1 and r2(Q2)u=5Q4>1r_{2}(Q^{\prime}-2)u=-\frac{5-Q^{\prime}}{4}>-1. Otherwise, we set r1=r2=r3=3r_{1}=r_{2}=r_{3}=3 and u=2u=2. By FEAT1,

𝔼[|ϕ¯(𝒙;𝒘)|r1u]1/r1\displaystyle\mathbb{E}[|\overline{\phi}({\bm{x}};{\bm{w}})|^{r_{1}u}]^{1/r_{1}} Cτu,\displaystyle\leq C\tau^{u}\,,
𝔼[|ϕn,𝝀𝝀^|r3u]1/r3\displaystyle\mathbb{E}[|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle|^{r_{3}u}]^{1/r_{3}} Cτu𝝀𝝀^𝑲nu,\displaystyle\leq C\tau^{u}\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{u}\,,

where CC depends only on QQ^{\prime}. Furthermore, by either (i) PEN and FEAT1 in the case Q2Q^{\prime}\geq 2 or (ii) PEN, FEAT 1, and FEAT3’ in the case Q<2Q^{\prime}<2, we have

𝔼[|s(ϕn,𝝀t)|r2u]1/r2\displaystyle\mathbb{E}[|s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)|^{r_{2}u}]^{1/r_{2}} Cs(𝝀^𝑲n)u𝝀^𝑲nu𝔼[|ϕn,𝝀t/𝝀^𝑲n|r2u(Q12)|ϕn,𝝀t/𝝀^𝑲n|r2u(Q22)]1/r2\displaystyle\leq C\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})^{u}}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{u}}\mathbb{E}\Big{[}\big{|}\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\big{|}^{r_{2}u(Q_{1}-2)}\vee\big{|}\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\big{|}^{r_{2}u(Q_{2}-2)}\Big{]}^{1/r_{2}}
C[τu(Q2)ηu(Q2)]s(𝝀^𝑲n)u𝝀^𝑲nu,\displaystyle\leq C[\tau^{u(Q-2)}\vee\eta^{u(Q^{\prime}-2)}]\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})^{u}}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{u}}\,,

where we denoted Q=Q1Q2Q=Q_{1}\vee Q_{2}, the constant changes in the last line, and we used that 𝝀^𝑲n/2𝝀t𝑲n2𝝀^𝑲n\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2\leq\|{\bm{\lambda}}_{t}\|_{{\bm{K}}_{n}}\leq 2\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}} and that either (i) FEAT1 and r2u(Q2)0r_{2}u(Q^{\prime}-2)\geq 0 in the case that Q2Q^{\prime}\geq 2 or (ii) FEAT1, FEAT3’, and r2u(Q2)>1r_{2}u(Q^{\prime}-2)>-1 in the case that Q<2Q^{\prime}<2. We see that the expectation 𝔼𝒘,ϕ[|ϕ¯(𝒙;𝒘)s(ϕn,𝝀t)ϕn,𝝀𝝀^|u]\mathbb{E}_{{\bm{w}},\phi}\big{[}\big{|}\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\big{|}^{u}\big{]} is bounded for some u>1u>1 and all t[0,1]t\in[0,1], whence ϕ¯(𝒙;𝒘)s(ϕn(𝒘),𝝀t)ϕn(𝒘)\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}_{t}\rangle)\bm{\phi}_{n}({\bm{w}}) is uniformly integrable. We conclude that we may exchange differentiation and expectation, justifying Eq. (45).

We may now replace uu by 1 in Eq. (46) to get

|𝝀f^(𝒙;𝝀t)(𝝀𝝀^)|\displaystyle|\nabla_{{\bm{\lambda}}}\hat{f}({\bm{x}};{\bm{\lambda}}_{t})^{\top}({\bm{\lambda}}-\hat{\bm{\lambda}})| 𝔼[|ϕ¯(𝒙;𝒘)|r1]1/r1𝔼[|s(ϕn,𝝀t)|r2]1/r2𝔼[|ϕn,𝝀𝝀^|r3]1/r3\displaystyle\leq\mathbb{E}[|\overline{\phi}({\bm{x}};{\bm{w}})|^{r_{1}}]^{1/r_{1}}\,\mathbb{E}[|s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)|^{r_{2}}]^{1/r_{2}}\,\mathbb{E}[|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle|^{r_{3}}]^{1/r_{3}} (47)
C(τQτ2ηQ2)s(𝝀^𝑲n)𝝀^𝑲n𝝀𝝀^𝑲n,\displaystyle\leq C(\tau^{Q}\vee\tau^{2}\eta^{Q^{\prime}-2})\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\,,

which is independent of 𝒙{\bm{x}} and tt. Combining this bound with Eq. (44), we conclude that

𝔼𝒙[(f^(𝒙;𝝀)f^(𝒙;𝝀^))2]C(τ2Qτ4η2(Q2))s(𝝀^𝑲n)2𝝀^𝑲n2𝝀𝝀^𝑲n2.\mathbb{E}_{{\bm{x}}}[(\hat{f}({\bm{x}};{\bm{\lambda}})-\hat{f}({\bm{x}};\hat{\bm{\lambda}}))^{2}]\leq C(\tau^{2Q}\vee\tau^{4}\eta^{2(Q^{\prime}-2)})\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})^{2}}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{2}}\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{2}\,. (48)

Thus, we have established Eq. (38).

A.2 Uniform concentration of the predictor: Proof of Lemma 3

Throughout the proof, we will denote by c,c,C,Cc,c^{\prime},C,C^{\prime} constants that depend on the constants in assumptions FEAT1, FEAT2 and PEN, but not on d,n,Nd,n,N and τ\tau. The value of these constants is allowed to change from line to line.

Step 1. Decoupling.

Let G:0G:{\mathbb{R}}\rightarrow{\mathbb{R}}_{\geq 0} be convex with G(x)=0G(x)=0 for x0x\leq 0 and satisfy for some α1,α2>0\alpha_{1},\alpha_{2}>0,

G(x+y)α1(G(α2x)+G(α2y)).G(x+y)\leq\alpha_{1}(G(\alpha_{2}x)+G(\alpha_{2}y)). (49)

Below denote by 𝔼\mathbb{E} the expectation with respect to the 𝒘j{\bm{w}}_{j}’s and the randomization in ϕn(𝒘)\bm{\phi}_{n}({\bm{w}}) (for notational simplicity, we will omit below the subscripts from the expectations). Further, we will denote ϕn,j=ϕn(𝒘j)=(ϕ(𝒙1;𝒘j),,ϕ(𝒙n;𝒘j))𝖳\bm{\phi}_{n,j}=\bm{\phi}_{n}({\bm{w}}_{j})=(\phi({\bm{x}}_{1};{\bm{w}}_{j}),\dots,\phi({\bm{x}}_{n};{\bm{w}}_{j}))^{{\mathsf{T}}} for the evaluation of the feature map at the nn data points, and 𝝍n,j=𝝍n(𝒘j)=𝑲n1/2ϕn(𝒘j){\bm{\psi}}_{n,j}={\bm{\psi}}_{n}({\bm{w}}_{j})={\bm{K}}^{-1/2}_{n}\bm{\phi}_{n}({\bm{w}}_{j}) for the corresponding isotropic random vector. Then

𝔼G(S¯1)\displaystyle\mathbb{E}G(\bar{S}_{1}) =(1)𝔼sup𝒙2r0d𝒗22𝝀^𝑲nG(|1Nj=1Nϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)𝔼[ϕ¯(𝒙;𝒘)s(𝝍n(𝒘),𝒗)]|)\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\mathbb{E}\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\\ \|{\bm{v}}\|_{2}\leq 2\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\end{subarray}}G\Big{(}\Big{|}\frac{1}{N}\sum_{j=1}^{N}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\mathbb{E}[\bar{\phi}({\bm{x}};{\bm{w}})s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{v}}\rangle)]\Big{|}\Big{)}
(2)𝔼sup𝒙2r0d𝒗22𝝀^𝑲nG(|1Nj=1N[ϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)ϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)]]|)\displaystyle\stackrel{{\scriptstyle(2)}}{{\leq}}\mathbb{E}\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\\ \|{\bm{v}}\|_{2}\leq 2\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\end{subarray}}G\Big{(}\Big{|}\frac{1}{N}\sum_{j=1}^{N}\Big{[}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\bar{\phi}({\bm{x}};{\bm{w}}_{j}^{\prime})s(\langle{\bm{\psi}}_{n,j}^{\prime},{\bm{v}}\rangle)]\Big{]}\Big{|}\Big{)}
=(3)2𝔼sup𝒙2r0d𝒗22𝝀^𝑲nG(1Nj=1Nσj[ϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)ϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)]])\displaystyle\stackrel{{\scriptstyle(3)}}{{=}}2\mathbb{E}\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\\ \|{\bm{v}}\|_{2}\leq 2\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\end{subarray}}G\Big{(}\frac{1}{N}\sum_{j=1}^{N}\sigma_{j}\Big{[}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\bar{\phi}({\bm{x}};{\bm{w}}_{j}^{\prime})s(\langle{\bm{\psi}}_{n,j}^{\prime},{\bm{v}}\rangle)]\Big{]}\Big{)}
(4)4α1𝔼G(α2sup𝒙2r0d𝒗22𝝀^𝑲n1Nj=1Nσjϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)=:T(𝒙,𝒗)),\displaystyle\stackrel{{\scriptstyle(4)}}{{\leq}}4\alpha_{1}\mathbb{E}G\Big{(}\alpha_{2}\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\\ \|{\bm{v}}\|_{2}\leq 2\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\end{subarray}}\underbrace{\frac{1}{N}\sum_{j=1}^{N}\sigma_{j}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)}_{=:T({\bm{x}},{\bm{v}})}\Big{)},

where (1) uses that GG is nondecreasing; (2) uses that G(||)G(|\,\cdot\,|) is convex and Jensen’s inequality; in (3), we denoted σj,j=1,,N\sigma_{j},j=1,\ldots,N independent Rademacher random variables and used that the distribution is symmetric and G(x)=0G(x)=0 for x0x\leq 0; (4) uses Eq. (49) and that T(𝒙,𝒗)T({\bm{x}},{\bm{v}}) is a symmetric random variable and GG is a nondecreasing function.

Step 2. Concentration of T(x,v)T({\bm{x}},{\bm{v}}).

We bound the tail of TT for fixed 𝒙,𝒗{\bm{x}},{\bm{v}} with 𝒙2r0d\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}, 𝒗2R\|{\bm{v}}\|_{2}\leq R. Denote by XjX_{j} the terms in the sum in T(𝒙,𝒗)T({\bm{x}},{\bm{v}}), i.e., Xj=σjϕ¯(𝒙;𝒘j)s(𝝍n,j,𝒗)X_{j}=\sigma_{j}\overline{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle). By Eq. (19), there exists CC such that |s(x1)|C|s(x2)|(1+|x1/x2|)Q1|s(x_{1})|\leq C|s(x_{2})|(1+|x_{1}/x_{2}|)^{Q-1}, where Q=Q1Q2Q=Q_{1}\vee Q_{2} (for simplicity we have loosened the bound for when |x1/x2|<1|x_{1}/x_{2}|<1). By Hölder inequality, we have for any k1k\geq 1 (potentially non-integer)

𝔼[|Xi|k]\displaystyle\mathbb{E}[|X_{i}|^{k}] C𝔼[|ϕ¯(𝒙;𝒘j)|k|s(τR)|k(1+|𝒗,𝝍n,j|Q1/(τR)Q1)k]\displaystyle\leq C\mathbb{E}[|\bar{\phi}({\bm{x}};{\bm{w}}_{j})|^{k}\cdot|s(\tau R)|^{k}\cdot(1+|\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle|^{{Q}-1}/(\tau R)^{{Q}-1})^{k}] (50)
Cτk|s(τR)|k𝔼[|ϕ¯(𝒙;𝒘j)/τ|Qk]1/Q𝔼[(1+|𝒗,𝝍n,j|Q1/(τR)Q1)Qk/(Q1)](Q1)/Q\displaystyle\leq C\tau^{k}|s(\tau R)|^{k}\mathbb{E}\Big{[}|\bar{\phi}({\bm{x}};{\bm{w}}_{j})/\tau|^{Qk}\Big{]}^{1/Q}\mathbb{E}\Big{[}(1+|\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle|^{{Q}-1}/(\tau R)^{{Q}-1})^{Qk/(Q-1)}\Big{]}^{(Q-1)/Q}
τks(τR)k(Ck)Qk/2,\displaystyle\leq\tau^{k}s(\tau R)^{k}(C^{\prime}k)^{{Q}k/2},

where CC^{\prime} depends only on Q{Q}, and we used FEAT1, i.e., ϕ¯(𝒙;𝒘j)\bar{\phi}({\bm{x}};{\bm{w}}_{j}) and 𝝍n,j{\bm{\psi}}_{n,j} are τ2\tau^{2}-sub-Gaussian.

For Q2Q\leq 2, we have directly by Eq. (50) and [BLM13, Theorem 2.10],

(T(𝒙,𝒗)t)exp{cN(t2τ2s(τR)2tτs(τR))}.\mathbb{P}\Big{(}T({\bm{x}},{\bm{v}})\geq t\Big{)}\leq\exp\Big{\{}-c^{\prime}N\Big{(}\frac{t^{2}}{\tau^{2}s(\tau R)^{2}}\wedge\frac{t}{\tau s(\tau R)}\Big{)}\Big{\}}\,. (51)

For Q>2Q>2, define for M>0M>0 the truncated random variable XiM=sign(Xi)(|Xi|τs(τR)MQ)X_{i}^{M}=\text{sign}(X_{i})(|X_{i}|\wedge\tau s(\tau R)M^{Q}). Then setting =2k+2Q4\ell=2k+2{Q}-4, we have

𝔼[|XiM/N|k]\displaystyle\mathbb{E}[|X_{i}^{M}/N|^{k}] Nk𝔼[|Xi|/Q(τs(τR)MQ)k/Q]Nkτks(τR)k(C/Q)/2M(Q2)(k2)\displaystyle\leq N^{-k}\mathbb{E}[|X_{i}|^{\ell/{Q}}(\tau s(\tau R)M^{Q})^{k-\ell/{Q}}]\leq N^{-k}\tau^{k}s(\tau R)^{k}(C^{\prime}\ell/{Q})^{\ell/2}M^{({Q}-2)(k-2)} (52)
(Ck)kτ2s(τR)2N2(τs(τR)MQ2N)k2.\displaystyle\leq(C^{\prime}k)^{k}\frac{\tau^{2}s(\tau R)^{2}}{N^{2}}\Big{(}\frac{\tau s(\tau R)M^{{Q}-2}}{N}\Big{)}^{k-2}\,.

We deduce again by [BLM13, Theorem 2.10] with Eq. (52), that

(1Ni=1NXiMt)exp{cN(t2τ2s(τR)2tτs(τR)MQ2)}.\mathbb{P}\Big{(}\frac{1}{N}\sum_{i=1}^{N}X_{i}^{M}\geq t\Big{)}\leq\exp\Big{\{}-c^{\prime}N\Big{(}\frac{t^{2}}{\tau^{2}s(\tau R)^{2}}\wedge\frac{t}{\tau s(\tau R)M^{{Q}-2}}\Big{)}\Big{\}}\,.

Furthermore, using the bound (50), we have

(|X1|τs(τR)MQ)infk1MkQ𝔼[(|X1|/τs(τR))k]=exp{cM2}.\mathbb{P}\Big{(}|X_{1}|\geq\tau s(\tau R)M^{Q}\Big{)}\leq\inf_{k\geq 1}M^{-kQ}\mathbb{E}\Big{[}(|X_{1}|/\tau s(\tau R))^{k}\Big{]}=\exp\Big{\{}-c^{\prime}M^{2}\Big{\}}\,.

Combining the above two displays and taking M=(Nt)1/Q/(τs(τR))1/QM=(Nt)^{1/{Q}}/(\tau s(\tau R))^{1/{Q}}, we conclude that for Q>2Q>2 and fixed 𝒙2r0d\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}, 𝒗2R\|{\bm{v}}\|_{2}\leq R, we have

(T(𝒙,𝒗)t)\displaystyle\mathbb{P}\Big{(}T({\bm{x}},{\bm{v}})\geq t\Big{)}\leq (1Ni=1NXiMt)+N(|X1|τs(τR)MQ)\displaystyle~{}\mathbb{P}\Big{(}\frac{1}{N}\sum_{i=1}^{N}X_{i}^{M}\geq t\Big{)}+N\mathbb{P}\Big{(}|X_{1}|\geq\tau s(\tau R)M^{Q}\Big{)}\, (53)
\displaystyle\leq Nexp{c(Nt2τ2s(τR)2(Nt)2/Qτ2/Qs(τR)2/Q)}.\displaystyle~{}N\exp\Big{\{}-c^{\prime}\Big{(}\frac{Nt^{2}}{\tau^{2}s(\tau R)^{2}}\wedge\frac{(Nt)^{2/{Q}}}{\tau^{2/{Q}}s(\tau R)^{2/{Q}}}\Big{)}\Big{\}}\,.

Step 3. Uniform concentration of T(x,v)T({\bm{x}},{\bm{v}}) on x2r0d\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}, v2R\|{\bm{v}}\|_{2}\leq R.

We now evaluate the concentration of TT uniformly over 𝒙2r0d\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}, 𝒗2R\|{\bm{v}}\|_{2}\leq R. By the tail bound on L(𝒘)L({\bm{w}}) in FEAT2 and that 𝝍n,j{\bm{\psi}}_{n,j} is a τ2\tau^{2}-sub-Gaussian random vector by FEAT1, there exists constants C,c>0C,c>0 independent of τ\tau, such that for all Δ1\Delta\geq 1 [HKZ12],

(maxj[N]𝝍n,j2L(𝒘j)CτnΔ)1CNexp(cnΔ2).\mathbb{P}\Big{(}\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\vee L({\bm{w}}_{j})\leq C\tau\sqrt{n}\Delta\Big{)}\geq 1-CN\exp(-cn\Delta^{2}). (54)

Let 0<ζ21ζ10<\zeta_{2}\leq 1\leq\zeta_{1} and q=q1q2q^{\prime}=q_{1}\wedge q_{2}. Then, by assumption PEN, for any |x1|,|x2|τRΔζ1|x_{1}|,|x_{2}|\leq\tau R\Delta\zeta_{1}, we have

|s(x1)s(x2)|\displaystyle|s(x_{1})-s(x_{2})| (|s(ζ2τRΔ)|+|s(ζ2τRΔ)|)+(sup|x|[ζ2τRΔ,ζ1τRΔ]s(x))|x1x2|\displaystyle\leq(|s(\zeta_{2}\tau R\Delta)|+|s(-\zeta_{2}\tau R\Delta)|)+\Big{(}\sup_{|x|\in[\zeta_{2}\tau R\Delta,\zeta_{1}\tau R\Delta]}s^{\prime}(x)\Big{)}|x_{1}-x_{2}|
Cs(τRΔ)ζ2q1+Cs(τRΔ)τRΔ(ζ1(Q2)0ζ2(q2)0)|x1x2|\displaystyle\leq Cs(\tau R\Delta)\zeta_{2}^{q^{\prime}-1}+C\frac{s(\tau R\Delta)}{\tau R\Delta}\Big{(}\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0}\Big{)}|x_{1}-x_{2}|
Cs(τRΔ)(ζ2q1+(ζ1(Q2)0ζ2(q2)0)|x1x2|/(τRΔ)).\displaystyle\leq Cs(\tau R\Delta)\Big{(}\zeta_{2}^{q^{\prime}-1}+(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})|x_{1}-x_{2}|/(\tau R\Delta)\Big{)}.

Let ζ1=Cn\zeta_{1}=C\sqrt{n}. On the event (54) and for 𝒗2R\|{\bm{v}}\|_{2}\leq R, we have |𝝍n,j,𝒗|CτnRΔ=τRΔζ1|\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle|\leq C\tau\sqrt{n}R\Delta=\tau R\Delta\zeta_{1}. Furthermore, by FEAT2, we have |ϕ¯(𝒙;𝒘)|L(𝒘)(1+𝒙2)|\overline{\phi}({\bm{x}};{\bm{w}})|\leq L({\bm{w}})(1+\|{\bm{x}}\|_{2}). For convenience, let us denote 𝒙~=𝒙/(r0d)\tilde{\bm{x}}={\bm{x}}/(r_{0}\sqrt{d}) so that 𝒙~21\|\tilde{\bm{x}}\|_{2}\leq 1. On the event (54), whenever 𝒙~12,𝒙~22,𝒗~12,𝒗~221\|\tilde{\bm{x}}_{1}\|_{2},\|\tilde{\bm{x}}_{2}\|_{2},\|\tilde{\bm{v}}_{1}\|_{2},\|\tilde{\bm{v}}_{2}\|_{2}\leq 1, we have (the constants CC below may depend on r0r_{0} but not on τ\tau, η\eta)

|T(𝒙1,R𝒗~1)T(𝒙2,R𝒗~2)|\displaystyle\,\,\,\,\,\,\,\,|T({\bm{x}}_{1},R\tilde{\bm{v}}_{1})-T({\bm{x}}_{2},R\tilde{\bm{v}}_{2})|
1Nj=1NL(𝒘j)|s(𝝍n,j,R𝒗~1)|𝒙1𝒙22+CNj=1NL(𝒘j)(1+𝒙22)|s(𝝍n,j,R𝒗~1)s(𝝍n,j,R𝒗~2)|\displaystyle\leq\frac{1}{N}\sum_{j=1}^{N}L({\bm{w}}_{j})|s(\langle{\bm{\psi}}_{n,j},R\tilde{\bm{v}}_{1}\rangle)|\|{\bm{x}}_{1}-{\bm{x}}_{2}\|_{2}+\frac{C}{N}\sum_{j=1}^{N}L({\bm{w}}_{j})(1+\|{\bm{x}}_{2}\|_{2})|s(\langle{\bm{\psi}}_{n,j},R\tilde{\bm{v}}_{1}\rangle)-s(\langle{\bm{\psi}}_{n,j},R\tilde{\bm{v}}_{2}\rangle)|
CNj=1NτndΔs(τRΔ)(1+(Cn)Q1)𝒙~1𝒙~22\displaystyle\leq\frac{C}{N}\sum_{j=1}^{N}\tau\sqrt{nd}\Delta s(\tau R\Delta)(1+(C\sqrt{n})^{Q-1})\|\tilde{\bm{x}}_{1}-\tilde{\bm{x}}_{2}\|_{2}
+CNj=1NτndΔs(τRΔ)(ζ2q1+(ζ1(Q2)0ζ2(q2)0)n𝒗~1𝒗~22)\displaystyle\qquad\qquad+\frac{C}{N}\sum_{j=1}^{N}\tau\sqrt{nd}\Delta s(\tau R\Delta)\big{(}\zeta_{2}^{q^{\prime}-1}+(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})\sqrt{n}\|\tilde{\bm{v}}_{1}-\tilde{\bm{v}}_{2}\|_{2}\big{)}
CτΔs(τRΔ)(nQ/2d𝒙~1𝒙~22+nd(ζ1(Q2)0ζ2(q2)0)𝒗~1𝒗~22+ndζ2q1).\displaystyle\leq C\tau\Delta s(\tau R\Delta)\Big{(}n^{Q/2}\sqrt{d}\|\tilde{\bm{x}}_{1}-\tilde{\bm{x}}_{2}\|_{2}+n\sqrt{d}(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})\|\tilde{\bm{v}}_{1}-\tilde{\bm{v}}_{2}\|_{2}+\sqrt{nd}\zeta_{2}^{q^{\prime}-1}\Big{)}.

Fix ε>0\varepsilon>0. Let ζ2=(ε/(3Cnd))1/(q1)\zeta_{2}=(\varepsilon/(3C\sqrt{nd}))^{1/(q^{\prime}-1)}, and define

ε=ε3CnQ/2d+3Cnd(ζ1(Q2)0ζ2(q2)0),\varepsilon^{\prime}=\frac{\varepsilon}{3Cn^{Q/2}\sqrt{d}+3Cn\sqrt{d}(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})}\,,

so that |T(𝒙1,R𝒗~1)T(𝒙2,R𝒗~2)|τΔs(τRΔ)ε|T({\bm{x}}_{1},R\tilde{\bm{v}}_{1})-T({\bm{x}}_{2},R\tilde{\bm{v}}_{2})|\leq\tau\Delta s(\tau R\Delta)\varepsilon as soon as 𝒙~1𝒙~22ε\|\tilde{\bm{x}}_{1}-\tilde{\bm{x}}_{2}\|_{2}\leq\varepsilon^{\prime} and 𝒗~1𝒗~22ε\|\tilde{\bm{v}}_{1}-\tilde{\bm{v}}_{2}\|_{2}\leq\varepsilon^{\prime}. Let 𝒩1\mathcal{N}_{1} and 𝒩2\mathcal{N}_{2} be ε\varepsilon^{\prime}-net of 𝖡2d(1){\sf B}_{2}^{d}(1) and 𝖡2n(1){\sf B}_{2}^{n}(1) respectively, where we define 𝖡2k(r):={𝒖k:𝒖2r}{\sf B}_{2}^{k}(r):=\{{\bm{u}}\in\mathbb{R}^{k}:\|{\bm{u}}\|_{2}\leq r\}. Note that log(|𝒩1||𝒩2|)(d+n)log(3ε)\log(|\mathcal{N}_{1}||\mathcal{N}_{2}|)\leq(d+n)\log\Big{(}\frac{3}{\varepsilon^{\prime}}\Big{)}. Then taking ε=ts(τR)/(2Δs(τRΔ))CtΔQ\varepsilon=ts(\tau R)/(2\Delta s(\tau R\Delta))\geq C^{\prime}t\Delta^{-Q} and recalling Eqs. (51) and (53), we have

(sup𝒙~21𝒗2RT(𝒙~,𝒗)tτs(τR))\displaystyle\mathbb{P}\Big{(}\sup_{\begin{subarray}{c}\|\tilde{\bm{x}}\|_{2}\leq 1\\ \|{\bm{v}}\|_{2}\leq R\end{subarray}}T(\tilde{\bm{x}},{\bm{v}})\geq t\tau s(\tau R)\Big{)}
(sup𝒙~𝒩1,𝒗~𝒩2T(𝒙~,R𝒗~)tτs(τR)/2)+(maxj[N]𝝍n,j2L(𝒘j)CτnΔ)\displaystyle\qquad\qquad\leq\mathbb{P}\Big{(}\sup_{\tilde{\bm{x}}\in\mathcal{N}_{1},\tilde{\bm{v}}\in\mathcal{N}_{2}}T(\tilde{\bm{x}},R\tilde{\bm{v}})\geq t\tau s(\tau R)/2\Big{)}+\mathbb{P}\Big{(}\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\vee L({\bm{w}}_{j})\geq C\tau\sqrt{n}\Delta\Big{)}
exp{logN+2nlog(3ε)c(Nt2(Nt)(2/Q)1)}+CNecnΔ2\displaystyle\qquad\qquad\leq\exp\Big{\{}\log N+2n\log\Big{(}\frac{3}{\varepsilon^{\prime}}\Big{)}-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}+CNe^{-cn\Delta^{2}}
exp{logN+Cnlog(n)+Cnlog(1/ε)c(Nt2(Nt)(2/Q)1)}+NecnΔ2\displaystyle\qquad\qquad\leq\exp\Big{\{}\log N+Cn\log(n)+Cn\log(1/\varepsilon)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}+Ne^{-cn\Delta^{2}}
exp{logN+Cnlog(n)+Cnlog(ΔQ/t)c(Nt2(Nt)(2/Q)1)}+NecnΔ2,\displaystyle\qquad\qquad\leq\exp\Big{\{}\log N+Cn\log(n)+Cn\log(\Delta^{Q}/t)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}+Ne^{-cn\Delta^{2}},

where in the second to last inequality we have used the definition of ε\varepsilon^{\prime} and adjusted constants appropriately, and in the last inequality we have used the definition of ε\varepsilon. Consider t1/Nt\geq 1/N and set Δ=(tN)1/Q1\Delta=(tN)^{1/Q}\geq 1. Then

(sup𝒙~21𝒗2RT(𝒙~,𝒗)tτs(τR))\displaystyle~{}\mathbb{P}\Big{(}\sup_{\begin{subarray}{c}\|\tilde{\bm{x}}\|_{2}\leq 1\\ \|{\bm{v}}\|_{2}\leq R\end{subarray}}T(\tilde{\bm{x}},{\bm{v}})\geq t\tau s(\tau R)\Big{)} (55)
\displaystyle\leq exp{Cnlog(N)c(Nt2(Nt)(2/Q)1)}+Nexp{cnN2/Qt2/Q}.\displaystyle~{}\exp\Big{\{}Cn\log(N)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/{Q})\wedge 1}\Big{)}\Big{\}}+N\exp\Big{\{}-cnN^{2/Q}t^{2/Q}\Big{\}}.

Step 4. Concluding the proof.

For a constant C~>0\tilde{C}>0 that will be set sufficiently large, define

ε1=C~τs(τR)(nlogNN(nlogN)(Q/2)1N).\displaystyle\varepsilon_{1}=\tilde{C}\tau s(\tau R)\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{({Q}/2)\vee 1}}{N}\Big{)}\,.

Consider G(x)=(xε1/2)+G(x)=(x-\varepsilon_{1}/2)_{+} which is convex and verifies Eq. (49) with α1=1\alpha_{1}=1 and α2=2\alpha_{2}=2. Then we have

(S¯1ε1)2ε1𝔼[G(S¯1)]\displaystyle\mathbb{P}(\bar{S}_{1}\geq\varepsilon_{1})\leq\frac{2}{\varepsilon_{1}}\mathbb{E}[G(\bar{S}_{1})]\leq 8ε1𝔼G(2sup𝒙2r0d𝒗2RT(𝒙,𝒗))\displaystyle~{}\frac{8}{\varepsilon_{1}}\mathbb{E}G\Big{(}2\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\\ \|{\bm{v}}\|_{2}\leq R\end{subarray}}T({\bm{x}},{\bm{v}})\Big{)}
\displaystyle\leq 16τs(τR)ε10(sup𝒙2r0d𝒗2RT(𝒙,𝒗)tτs(τR)+ε1/4)dt.\displaystyle~{}\frac{16\tau s(\tau R)}{\varepsilon_{1}}\int_{0}^{\infty}\mathbb{P}\Big{(}\sup_{\begin{subarray}{c}\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\\ \|{\bm{v}}\|_{2}\leq R\end{subarray}}T({\bm{x}},{\bm{v}})\geq t\tau s(\tau R)+\varepsilon_{1}/4\Big{)}{\rm d}t\,.

Using Eq. (55) and the inequalities N(ε1/τs(τR))2C~2nlogNN(\varepsilon_{1}/\tau s(\tau R))^{2}\geq\tilde{C}^{2}n\log N and (Nε1/τs(τR))(2/Q)1C~(2/Q)1nlogN(N\varepsilon_{1}/\tau s(\tau R))^{(2/Q)\wedge 1}\geq\tilde{C}^{(2/Q)\wedge 1}n\log N, we get

(S¯1ε1)Cexp{Cnlog(N)cC~(2/Q)1nlog(N)}+CNexp{cC~2/Qn(nlog(N))1(2/Q)}.\mathbb{P}(\bar{S}_{1}\geq\varepsilon_{1})\leq C\exp\Big{\{}Cn\log(N)-c\tilde{C}^{(2/Q)\wedge 1}n\log(N)\Big{\}}+CN\exp\Big{\{}-c\tilde{C}^{2/Q}n(n\log(N))^{1\vee(2/Q)}\Big{\}}.

Taking C~\tilde{C} sufficiently large, R=2𝝀^𝑲nR=2\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}} and using that s(2τ𝝀^𝑲n)Cs(𝝀^𝑲n)τQ1s(2\tau\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\leq Cs(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\tau^{Q-1} (where we recall that we assumed τ1)\tau\geq 1), we deduce that there exists constants c,C>0c^{\prime},C^{\prime}>0 that depend only on the constants of FEAT1, FEAT2 and PEN (except τ\tau) such that

S¯1CτQs(𝝀^𝑲n)(nlogNN(nlogN)(Q/2)1N),\bar{S}_{1}\leq C^{\prime}\tau^{Q}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{({Q}/2)\vee 1}}{N}\Big{)}\,,

with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n}. The proof is complete.

A.3 Point-wise concentration of dual gradient: Proof of Lemma 4

The proof follows from the same argument as in the proof of Lemma 3 and we will only highlight the differences.

Step 1. Decoupling.

First notice that we can rewrite

S¯2=sup𝒃21{1Nj=1N𝒃,𝝍n,js(ϕn,j,𝝀^)𝔼[𝒃,𝝍n(𝒘)s(ϕn,𝝀^)]}.\bar{S}_{2}=\sup_{\|{\bm{b}}\|_{2}\leq 1}\Big{\{}\frac{1}{N}\sum_{j=1}^{N}\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)-\mathbb{E}[\langle{\bm{b}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\langle\bm{\phi}_{n},{\hat{\bm{\lambda}}}\rangle)]\Big{\}}\,.

Denote

T(𝒃)=1Nj=1Nσj𝒃,𝝍n,js(ϕn,j,𝝀^).T({\bm{b}})=\frac{1}{N}\sum_{j=1}^{N}\sigma_{j}\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)\,.

For GG with the properties listed in step 11 in the proof of Lemma 3, we have

𝔼G(S¯2)4α1𝔼G(α2sup𝒃21T(𝒃)).\mathbb{E}G(\bar{S}_{2})\leq 4\alpha_{1}\mathbb{E}G\Big{(}\alpha_{2}\sup_{\|{\bm{b}}\|_{2}\leq 1}T({\bm{b}})\Big{)}.

Step 2. Concentration of T(b)T({\bm{b}}).

Denote XjX_{j} the terms in the sum in T(𝒃)T({\bm{b}}), i.e., Xj=σj𝒃,𝝍n,js(ϕn,j,𝝀^)X_{j}=\sigma_{j}\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle). By FEAT1 and PEN, and denoting Q=Q1Q2Q=Q_{1}\vee Q_{2}, R=𝝀^𝑲nR=\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}, we have by Hölder inequality, for any k1k\geq 1 and 𝒃21\|{\bm{b}}\|_{2}\leq 1,

𝔼[|Xj|k]\displaystyle\mathbb{E}[|X_{j}|^{k}] Cτks(τR)k𝔼[|𝒃,𝝍n,j/τ|Qk]1/Q𝔼[(1+|𝝀,ϕn,j|Q1/(τR)Q1)Qk/(Q1)](Q1)/Q\displaystyle\leq C\tau^{k}s(\tau R)^{k}\mathbb{E}\Big{[}|\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle/\tau|^{Qk}\Big{]}^{1/Q}\mathbb{E}\Big{[}(1+|\langle{\bm{\lambda}},\bm{\phi}_{n,j}\rangle|^{Q-1}/(\tau R)^{{Q}-1})^{Qk/(Q-1)}\Big{]}^{(Q-1)/Q}
τks(τR)k(Ck)Qk/2.\displaystyle\leq\tau^{k}s(\tau R)^{k}(C^{\prime}k)^{Qk/2}\,.

For Q>2Q>2 and M>0M>0, define XjM=sign(Xj)(|Xj|τs(τR)MQ)X_{j}^{M}=\text{sign}(X_{j})(|X_{j}|\wedge\tau s(\tau R)M^{Q}). Then setting =2k+2Q4\ell=2k+2{Q}-4, we have

𝔼[|XjM/N|k]\displaystyle\mathbb{E}[|X_{j}^{M}/N|^{k}] Nk𝔼[|Xj|/Q(τs(τR)MQ)k/Q](Ck)kτ2s(τR)2N2(τs(τR)MQ2N)k2.\displaystyle\leq N^{-k}\mathbb{E}[|X_{j}|^{\ell/{Q}}(\tau s(\tau R)M^{Q})^{k-\ell/{Q}}]\leq(C^{\prime}k)^{k}\frac{\tau^{2}s(\tau R)^{2}}{N^{2}}\Big{(}\frac{\tau s(\tau R)M^{{Q}-2}}{N}\Big{)}^{k-2}\,.

Following step 2 in the proof of Lemma 3, in particular taking M=(Nt)1/QM=(Nt)^{1/Q} in the case of Q>2Q>2, we get

(T(𝒃)tτs(τR))Nexp{c(Nt2(Nt)(2/Q)1)}.\mathbb{P}\Big{(}T({\bm{b}})\geq t\tau s(\tau R)\Big{)}\leq N\exp\Big{\{}-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}\,. (56)

Step 3. Uniform concentration of T(b)T({\bm{b}}) on b21\|{\bm{b}}\|_{2}\leq 1.

We consider the event for Δ1\Delta\geq 1,

(maxj[N]𝝍n,j2CτnΔ)1CNexp(cnΔ2),\mathbb{P}\Big{(}\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\leq C\tau\sqrt{n}\Delta\Big{)}\geq 1-CN\exp(-cn\Delta^{2})\,, (57)

where we used that the 𝝍n,j{\bm{\psi}}_{n,j} are τ2\tau^{2}-sub-Gaussian by FEAT1. On this event, we have |ϕn,j,𝝀^|𝝍n,j2𝝀^𝑲nCRτnΔ|\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle|\leq\|{\bm{\psi}}_{n,j}\|_{2}\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\leq CR\tau\sqrt{n}\Delta and by PEN, we get

|T(𝒃1)T(𝒃2)|\displaystyle|T({\bm{b}}_{1})-T({\bm{b}}_{2})|\leq 1Nj=1N𝝍n,j2|s(ϕn,j,𝝀)|𝒃1𝒃22CτΔs(τRΔ)nQ/2𝒃1𝒃22.\displaystyle~{}\frac{1}{N}\sum_{j=1}^{N}\|{\bm{\psi}}_{n,j}\|_{2}|s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)|\|{\bm{b}}_{1}-{\bm{b}}_{2}\|_{2}\leq C\tau\Delta s(\tau R\Delta)n^{Q/2}\|{\bm{b}}_{1}-{\bm{b}}_{2}\|_{2}\,.

Fix ε>0\varepsilon>0 and ε=ε/(CnQ/2)\varepsilon^{\prime}=\varepsilon/(Cn^{Q/2}), so that |T(𝒃1)T(𝒃2)|τΔs(τRΔ)ε|T({\bm{b}}_{1})-T({\bm{b}}_{2})|\leq\tau\Delta s(\tau R\Delta)\varepsilon as soon as 𝒃1𝒃22ε\|{\bm{b}}_{1}-{\bm{b}}_{2}\|_{2}\leq\varepsilon^{\prime}. Taking ε=ts(τR)/(2Δs(τRΔ))CtΔQ\varepsilon=ts(\tau R)/(2\Delta s(\tau R\Delta))\geq C^{\prime}t\Delta^{-Q} and Δ=(tN)1/Q\Delta=(tN)^{1/Q}, we have for t1/Nt\geq 1/N,

(sup𝒃21T(𝒃)tτs(τR))\displaystyle~{}\mathbb{P}\Big{(}\sup_{\|{\bm{b}}\|_{2}\leq 1}T({\bm{b}})\geq t\tau s(\tau R)\Big{)} (58)
\displaystyle\leq exp{Cnlog(N)c(Nt2(Nt)(2/Q)1)}+Nexp{cnN2/Qt2/Q}.\displaystyle~{}\exp\Big{\{}Cn\log(N)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/{Q})\wedge 1}\Big{)}\Big{\}}+N\exp\Big{\{}-cnN^{2/Q}t^{2/Q}\Big{\}}.

Step 4. Concluding the proof.

Following the same argument as in Lemma 3, there exists constants c,C>0c^{\prime},C^{\prime}>0 that depend only on the constants of PEN such that

S¯2CτQs(𝝀𝑲n)(nlogNN(nlogN)Q/21N),\bar{S}_{2}\leq C^{\prime}\tau^{Q}s(\|{\bm{\lambda}}\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2\vee 1}}{N}\Big{)},

with probability at least 1Cecn1-C^{\prime}e^{-c^{\prime}n}.

A.4 Uniform lower bound on the dual Hessian: Proof of Lemma 5

Define q=3q1q2q=3\vee q_{1}\vee q_{2}, where q1q_{1} and q2q_{2} are as in assumption PEN, and define s0s_{0} as

s0(t)=1ttq12tq22.s_{0}(t)=1\wedge t\wedge t^{q_{1}-2}\wedge t^{q_{2}-2}. (59)

The function s0(t)s_{0}(t) is (1|q12||q22|)(1\vee|q_{1}-2|\vee|q_{2}-2|)-Lipschitz. By assumption PEN, there exists c>0c>0 depending only on the constants in that assumption such that for all tt

s(𝝀^𝑲nt)cs(𝝀^𝑲n)𝝀^𝑲ns0(t).s^{\prime}\big{(}\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}t\big{)}\geq c\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}s_{0}(t). (60)

Thus, for 𝝀n{\bm{\lambda}}\in\mathbb{R}^{n} such that 1/2𝝀𝑲n/𝝀^𝑲n21/2\leq\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq 2 and denoting 𝒗=𝑲n1/2𝝀/𝝀^𝑲n{\bm{v}}={\bm{K}}_{n}^{1/2}{\bm{\lambda}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}, we have

𝑲n1/22FN(𝝀)𝑲n1/2=\displaystyle-{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}){\bm{K}}_{n}^{-1/2}= 1Nj=1Ns(𝑲n1/2𝝀,𝝍n,j)𝝍n,j𝝍n,j𝖳cs(𝝀^𝑲n)𝝀^𝑲n𝑯N(𝒗),\displaystyle~{}\frac{1}{N}\sum_{j=1}^{N}s^{\prime}(\langle{\bm{K}}_{n}^{1/2}{\bm{\lambda}},{\bm{\psi}}_{n,j}\rangle){\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\mathsf{T}}\succeq c\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\bm{H}}_{N}({\bm{v}})\,,

where

𝑯N(𝒗)=1Nj=1Ns0(𝒗,𝝍n,j)𝝍n,j𝝍n,j𝖳.{\bm{H}}_{N}({\bm{v}})=\frac{1}{N}\sum_{j=1}^{N}s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle){\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\mathsf{T}}\,.

Define 𝑯¯(𝒗):=𝔼𝒘,ϕ[s0(𝒗,𝝍n)𝝍n𝝍n𝖳]\overline{\bm{H}}({\bm{v}}):=\mathbb{E}_{{\bm{w}},\phi}[s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n}\rangle){\bm{\psi}}_{n}{\bm{\psi}}_{n}^{\mathsf{T}}]. As already explained in Remark 2, the fact that 𝝍n{\bm{\psi}}_{n} is isotropic, together with assumption FEAT3 imply that, for any two unit vectors 𝒖,𝒖{\bm{u}},{\bm{u}}^{\prime}, we have (|𝒖,𝝍n|[η,C];|𝒖,𝝍n|[η,C])c/2\mathbb{P}\big{(}|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\in[\eta,C];\;|\langle{\bm{u}}^{\prime},{\bm{\psi}}_{n}\rangle|\in[\eta,C]\big{)}\geq c/2  . Using the fact that 𝒗2[1/2,2]\|{\bm{v}}\|_{2}\in[1/2,2], we deduce that

(|𝒖,𝝍n|[η,C];|𝒗,𝝍n|[η/2,2C])c2.\displaystyle\mathbb{P}\big{(}|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\in[\eta,C];\;|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\in[\eta/2,2C]\big{)}\geq\frac{c}{2}\,.

Thus,

𝒖,𝑯¯(𝒗)𝒖=𝔼𝒘,ϕ[s0(𝒗,𝝍n)𝒖,𝝍n2]Cη2min{η,ηq12,ηq22}.\displaystyle\langle{\bm{u}},\overline{\bm{H}}({\bm{v}}){\bm{u}}\rangle=\mathbb{E}_{{\bm{w}},\phi}[s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n}\rangle)\langle{\bm{u}},{\bm{\psi}}_{n}\rangle^{2}]\geq C\eta^{2}\min\{\eta,\eta^{q_{1}-2},\eta^{q_{2}-2}\}.

Thus, λmin(𝑯¯(𝒗))δ0(η)\lambda_{\min}(\overline{\bm{H}}({\bm{v}}))\geq\delta_{0}(\eta) for δ0(η):=Cη3q1q2\delta_{0}(\eta):=C\eta^{3\vee q_{1}\vee q_{2}}. Furthermore, by FEAT1 and that 0s0(x)10\leq s_{0}(x)\leq 1, we have s0(𝒗,𝝍n,j)1/2𝝍n,js_{0}(\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle)^{1/2}{\bm{\psi}}_{n,j} are Cτ2C\tau^{2}-sub-Gaussian. Applying Lemma 9 to the vectors s0(𝒗,𝝍n,j)1/2𝝍n,j/(Cτ)s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle)^{1/2}{\bm{\psi}}_{n,j}/(C\tau) with δ=δ0(η)/(2C2τ2)\delta=\delta_{0}(\eta)/(2C^{2}\tau^{2}) and t=N/nδ0(η)/(2CC2τ2)1t=\sqrt{N/n}\,\delta_{0}(\eta)/(2C^{\prime}C^{2}\tau^{2})-1, and noting that δ0(η)/(2C2τ2)1\delta_{0}(\eta)/(2C^{2}\tau^{2})\leq 1, we conclude that there exists a constant c>0c>0 such that for N16C2C4τ4n/δ0(η)2N\geq 16C^{\prime 2}C^{4}\tau^{4}n/\delta_{0}(\eta)^{2}

(λmin(𝑯N(𝒗))δ0(η)/2)(𝑯N(𝒗)𝑯¯(𝒗)opδ0(η)/2)2exp(cNδ0(η)2/τ4).\displaystyle\mathbb{P}(\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}(\eta)/2)\leq\mathbb{P}(\|{\bm{H}}_{N}({\bm{v}})-\overline{\bm{H}}({\bm{v}})\|_{{\rm op}}\geq\delta_{0}(\eta)/2)\leq 2\exp\big{(}-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{)}.

Denote 𝚿n=[𝝍n,1,,𝝍n,N]n×N{\bm{\Psi}}_{n}=[{\bm{\psi}}_{n,1},\ldots,{\bm{\psi}}_{n,N}]\in\mathbb{R}^{n\times N}. For constant C>0C>0 sufficiently large, consider the event

0={{maxj[N]𝝍n,j2Cnτ}{𝚿nop2CNτ2}}.{\mathcal{E}}_{0}=\Big{\{}\{\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\leq C\sqrt{n}\,\tau\}\cup\{\|{\bm{\Psi}}_{n}\|_{\rm op}^{2}\leq CN\tau^{2}\}\Big{\}}\,.

We have for NnN\geq n,

(0c)\displaystyle\mathbb{P}({\mathcal{E}}_{0}^{c})\leq (maxj[N]𝝍n,j2Cnτ)+(𝚿nop2CNτ2)exp(cN),\displaystyle~{}\mathbb{P}\big{(}\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\geq C\sqrt{n}\,\tau\big{)}+\mathbb{P}\big{(}\|{\bm{\Psi}}_{n}\|_{\rm op}^{2}\geq CN\tau^{2}\big{)}\leq\exp(-cN)\,,

where we used that 𝝍n,j{\bm{\psi}}_{n,j} are τ2\tau^{2}-subgaussian by FEAT1 to bound the first term, and that 𝔼𝒘,ϕ[𝝍n𝝍n𝖳]=𝐈n\mathbb{E}_{{\bm{w}},\phi}[{\bm{\psi}}_{n}{\bm{\psi}}_{n}^{\mathsf{T}}]={\mathbf{I}}_{n} and Lemma 9 to bound the second term.

On the event 0{\mathcal{E}}_{0}, we have

𝑯N(𝒗1)𝑯N(𝒗2)op\displaystyle\|{\bm{H}}_{N}({\bm{v}}_{1})-{\bm{H}}_{N}({\bm{v}}_{2})\|_{{\rm op}}\leq 1N𝚿nop2maxj[N]|s0(𝝍n,j,𝒗1)s0(𝝍n,j,𝒗2)|\displaystyle~{}\frac{1}{N}\|{\bm{\Psi}}_{n}\|_{{\rm op}}^{2}\max_{j\in[N]}|s_{0}(\langle{\bm{\psi}}_{n,j},{\bm{v}}_{1}\rangle)-s_{0}(\langle{\bm{\psi}}_{n,j},{\bm{v}}_{2}\rangle)|
\displaystyle\leq Cτ2maxj[N]|𝝍n,j,𝒗1𝒗2|Cnτ3𝒗1𝒗22,\displaystyle~{}C\tau^{2}\max_{j\in[N]}|\langle{\bm{\psi}}_{n,j},{\bm{v}}_{1}-{\bm{v}}_{2}\rangle|\leq C\sqrt{n}\,\tau^{3}\|{\bm{v}}_{1}-{\bm{v}}_{2}\|_{2}\,,

where we used that s0s_{0} is Lipschitz. Let 𝒩n\mathcal{N}_{n} be a minimal ε\varepsilon-net of {𝒗n:𝒗2[1/2,2]}\{{\bm{v}}\in\mathbb{R}^{n}:\|{\bm{v}}\|_{2}\in[1/2,2]\}, with ε=δ0(η)/(4Cnτ3)\varepsilon=\delta_{0}(\eta)/(4C\sqrt{n}\tau^{3}), so that 𝑯N(𝒗1)𝑯N(𝒗2)opδ0(η)/4\|{\bm{H}}_{N}({\bm{v}}_{1})-{\bm{H}}_{N}({\bm{v}}_{2})\|_{{\rm op}}\leq\delta_{0}(\eta)/4 as soon as 𝒗1𝒗22ε\|{\bm{v}}_{1}-{\bm{v}}_{2}\|_{2}\leq\varepsilon. We have

(min𝒗2[1/2,2]λmin(𝑯N(𝒗))δ0/4)\displaystyle\mathbb{P}(\min_{\|{\bm{v}}\|_{2}\in[1/2,2]}\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}/4)\leq |𝒩n|max𝒗𝒩n(λmin(𝑯N(𝒗))δ0/2)+(0c)\displaystyle~{}|\mathcal{N}_{n}|\max_{{\bm{v}}\in\mathcal{N}_{n}}\mathbb{P}(\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}/2)+\mathbb{P}({\mathcal{E}}_{0}^{c})
Cexp{Cnlog(1/ε)cNδ0(η)2/τ4}\displaystyle\leq C\exp\big{\{}Cn\log(1/\varepsilon)-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{\}}
Cexp{Cnlog(n)+Cnlogτ+Cnlog(1/η)cNδ0(η)2/τ4}\displaystyle\leq C\exp\big{\{}Cn\log(n)+Cn\log\tau+Cn\log(1/\eta)-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{\}}
Cexp{Cnlog(n)cNδ0(η)2/τ4},\displaystyle\leq C\exp\big{\{}Cn\log(n)-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{\}}\,,

where the last inequality follows since by assumption τnC\tau\leq n^{C}, ηnC\eta\geq n^{-C}. By taking NC(τ4/δ0(η)2)nlog(N)N\geq C^{\prime}(\tau^{4}/\delta_{0}(\eta)^{2})n\log(N) for a sufficiently large constant CC^{\prime} we obtain min𝒗2[1/2,2]λmin(𝑯N(𝒗))δ0/4\min_{\|{\bm{v}}\|_{2}\in[1/2,2]}\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}/4 with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n} as claimed.

Appendix B The latent linear model: Proof of Proposition 5

We consider the latent linear model presented in the main text (Section 5.3). We have ϕn,j=ϕn(𝒘j)=𝑿𝒘j+𝒛j\bm{\phi}_{n,j}=\bm{\phi}_{n}({\bm{w}}_{j})={\bm{X}}{\bm{w}}_{j}+{\bm{z}}_{j}, where 𝒛j=(z1j,,znj)n{\bm{z}}_{j}=(z_{1j},\ldots,z_{nj})\in\mathbb{R}^{n} are the iid features noise. Denote 𝑾=[𝒘1,,𝒘N]𝖳N×d{\bm{W}}=[{\bm{w}}_{1},\ldots,{\bm{w}}_{N}]^{\mathsf{T}}\in\mathbb{R}^{N\times d} and 𝒁=[𝒛1,,𝒛N]n×N{\bm{Z}}=[{\bm{z}}_{1},\ldots,{\bm{z}}_{N}]\in\mathbb{R}^{n\times N}. The features matrix is given by

𝚽n=[ϕn,1,,ϕn,N]=𝑿𝑾𝖳+𝒁n×N.\bm{\Phi}_{n}=[\bm{\phi}_{n,1},\ldots,\bm{\phi}_{n,N}]={\bm{X}}{\bm{W}}^{\mathsf{T}}+{\bm{Z}}\in\mathbb{R}^{n\times N}\,. (61)

The random features and infinite-width predictors are given by

f^(𝒙;𝝀^)=𝔼𝒘,ϕ[𝒙,𝒘s(ϕn(𝒘),𝝀^)],f^N(𝒙;𝝀^N)=1N𝒙,𝑾𝖳s(𝚽N𝖳𝝀^N),\hat{f}({\bm{x}};{\hat{\bm{\lambda}}})=\mathbb{E}_{{\bm{w}},\phi}[\langle{\bm{x}},{\bm{w}}\rangle s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)],\qquad\hat{f}_{N}({\bm{x}};{\hat{\bm{\lambda}}}_{N})=\frac{1}{N}\langle{\bm{x}},{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})\rangle\,,

where ss is applied component-wise. The distance between f^\hat{f} and f^N\hat{f}_{N} is therefore given explicitly by

f^N(;𝝀^N)f^(;𝝀^)L22=\displaystyle\|\hat{f}_{N}(\cdot;{\hat{\bm{\lambda}}}_{N})-\hat{f}(\cdot;{\hat{\bm{\lambda}}})\|_{L^{2}}^{2}= 𝔼𝒙[(N1𝒙,𝑾𝖳s(𝚽N𝖳𝝀^N)𝔼𝒘,ϕ[𝒙,𝒘s(ϕn(𝒘),𝝀^)])2]\displaystyle~{}\mathbb{E}_{{\bm{x}}}\Big{[}\Big{(}N^{-1}\langle{\bm{x}},{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[\langle{\bm{x}},{\bm{w}}\rangle s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{)}^{2}\Big{]} (62)
=\displaystyle= N1𝑾𝖳s(𝚽N𝖳𝝀^N)𝔼𝒘,ϕ[𝒘s(ϕn(𝒘),𝝀^)]𝚺x2.\displaystyle~{}\|N^{-1}{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\|_{\bm{\Sigma}_{x}}^{2}\,.

The following lemma is the key result that allows us to improve on Theorem 1 and gain a factor 1/n1/\sqrt{n}.

Lemma 6.

Assume λmax(𝚺x)κ2\lambda_{\max}(\bm{\Sigma}_{x})\leq\kappa^{2} and σmin(𝐗)c0n\sigma_{\min}({\bm{X}})\geq c_{0}\sqrt{n} (the minimum singular value of 𝐗{\bm{X}}). Then if n2d/c02n\geq 2d/c_{0}^{2}, we have

f^N(;𝝀^N)f^(;𝝀^)L22κc0nN1𝒁s(𝚽n𝝀^N)𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀^)]2.\|\hat{f}_{N}(\cdot;{\hat{\bm{\lambda}}}_{N})-\hat{f}(\cdot;{\hat{\bm{\lambda}}})\|_{L^{2}}\leq\frac{2\kappa}{c_{0}\sqrt{n}}\|N^{-1}{\bm{Z}}s(\bm{\Phi}_{n}^{\top}\hat{\bm{\lambda}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}. (63)

We remark that Lemma 6 is entirely deterministic. This may seem surprising because 𝑾{\bm{W}}^{\top} and 𝒁{\bm{Z}} are random. In fact, the proof of Lemma 6 relies on a deterministic argument which uses the fact that both the infinite-width and random features predictors interpolate the training data, i.e., for ini\leq n,

𝔼𝒘,ϕ[ϕ(𝒙i;𝒘)s(ϕn(𝒘),𝝀^)]=1Nj=1Nϕ(𝒙i;𝒘j)s(ϕn,j,𝝀^N)=yi,\mathbb{E}_{{\bm{w}},\phi}[\phi({\bm{x}}_{i};{\bm{w}})s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}})]=\frac{1}{N}\sum_{j=1}^{N}\phi({\bm{x}}_{i};{\bm{w}}_{j})s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}_{N}\rangle)=y_{i}\,,

or equivalently,

𝔼𝒘,𝒛[(𝑿𝒘+𝒛)s(ϕn(𝒘),𝝀^)]=1N(𝑿𝑾+𝒁)s(𝚽N𝝀^N)=𝒚\mathbb{E}_{{\bm{w}},{\bm{z}}}[({\bm{X}}{\bm{w}}+{\bm{z}})s(\langle\bm{\phi}_{n}({\bm{w}}),\hat{\bm{\lambda}}\rangle)]=\frac{1}{N}({\bm{X}}{\bm{W}}^{\top}+{\bm{Z}})s(\bm{\Phi}_{N}^{\top}\hat{\bm{\lambda}}_{N})={\bm{y}} (64)

Note that here we have used the form of the infinite-width primal problem (see Section E.1).

Proof of Lemma 6.

By Eq. (62) and the bound λmax(𝚺x)κ2\lambda_{\max}(\bm{\Sigma}_{x})\leq\kappa^{2}, we have

f^N(;𝝀^N)f^(;𝝀^)L2κN1𝑾𝖳s(𝚽N𝖳𝝀^N)𝔼𝒘,ϕ[𝒘s(ϕn(𝒘),𝝀^)]2.\|\hat{f}_{N}(\cdot;{\hat{\bm{\lambda}}}_{N})-\hat{f}(\cdot;{\hat{\bm{\lambda}}})\|_{L^{2}}\leq\kappa\Big{\|}N^{-1}{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\|}_{2}\,.

Denote 𝑴=𝐈n+1d𝑿𝑿{\bm{M}}={\mathbf{I}}_{n}+\frac{1}{d}{\bm{X}}{\bm{X}}^{\top}. We have

𝑾𝖳s(𝚽N𝖳𝝀^N)=1d𝑿𝑴1𝚽Ns(𝚽N𝖳𝝀^N)+(𝑾1d𝑿𝑴1𝚽N)s(𝚽N𝖳𝝀^N),{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})=\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\Phi}_{N}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})+\Big{(}{\bm{W}}^{\top}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\Phi}_{N}\Big{)}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})\,,

and

𝔼𝒘,ϕ[𝒘s(ϕn(𝒘),𝝀^)]=\displaystyle\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]= 𝔼𝒘,ϕ[1d𝑿𝑴1ϕn(𝒘)s(ϕn(𝒘),𝝀^)]\displaystyle~{}\mathbb{E}_{{\bm{w}},\phi}\Big{[}\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\phi}_{n}({\bm{w}})s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)\Big{]}
+𝔼𝒘,ϕ[(𝒘1d𝑿𝑴1ϕn(𝒘))s(ϕn(𝒘),𝝀^)].\displaystyle~{}+\mathbb{E}_{{\bm{w}},\phi}\Big{[}\Big{(}{\bm{w}}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\phi}_{n}({\bm{w}})\Big{)}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)\Big{]}.

By the interpolation constraints (64) and recalling the expression of the features matrix in Eq. (61), we write

𝑾s(𝚽N𝖳𝝀^N)=Nd𝑿𝑴1𝒚+(𝐈d1d𝑿𝑴1𝑿)𝑾s(𝚽N𝖳𝝀^N)1d𝑿𝑴1𝒁s(𝚽N𝖳𝝀^N),{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})=\frac{N}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{y}}+\Big{(}{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\Big{)}{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{Z}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})\,,

and

𝔼𝒘,ϕ[𝒘s(ϕn(𝒘),𝝀^)]=\displaystyle\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]= 1d𝑿𝑴1𝒚+(𝐈d1d𝑿𝑴1𝑿)𝔼𝒘,ϕ[𝒘s(ϕn(𝒘),𝝀^)]\displaystyle~{}\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{y}}+\Big{(}{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\Big{)}\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]
1d𝑿𝑴1𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀^)].\displaystyle~{}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\,.

The first terms on the right-hand sides of the preceding two expressions are the same (up to a factor NN). We deduce that

N1𝑾s(𝚽N𝖳𝝀^N)𝔼𝒘,ϕ[𝒘s(ϕn(𝒘),𝝀^)]2\displaystyle~{}\Big{\|}N^{-1}{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\|}_{2} (65)
\displaystyle\leq 𝐈d1d𝑿𝑴1𝑿opN1𝑾s(𝚽N𝖳𝝀^N)𝔼𝒘,ϕ[𝒘s(ϕn(𝒘),𝝀^)]2\displaystyle~{}\Big{\|}{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\Big{\|}_{{\rm op}}\Big{\|}N^{-1}{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\|}_{2}
+1d𝑿𝑴1opN1𝒁s(𝚽N𝖳𝝀^N)𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀^)]2.\displaystyle~{}+\Big{\|}\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\Big{\|}_{{\rm op}}\Big{\|}N^{-1}{\bm{Z}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\|}_{2}.

From the expression of 𝑴{\bm{M}}, we have 𝐈d1d𝑿𝑴1𝑿=(𝐈d+1d𝑿𝖳𝑿)1{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}=({\mathbf{I}}_{d}+\frac{1}{d}{\bm{X}}^{\mathsf{T}}{\bm{X}})^{-1}. By the assumption that σmin(𝑿)nc0\sigma_{\min}({\bm{X}})\geq\sqrt{n}c_{0} and n2d/c02n\geq 2d/c_{0}^{2}, we have 𝐈d1d𝑿𝑴1𝑿opdnc021/2\|{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\|_{{\rm op}}\leq\frac{d}{n}c_{0}^{-2}\leq 1/2. Similarly, we have 1d𝑿𝑴1op2/(3c0n)\|\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\|_{{\rm op}}\leq 2/(3c_{0}\sqrt{n}). Rearranging the terms in Eq. (65) implies Eq. (63). ∎

Proof of Proposition 5.

By Proposition 4, FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with probability 1ecn1-e^{-c^{\prime}n}. Hence, we can use the results proved in the proof of Theorem 1. Replace the event 1{\mathcal{E}}_{1} by the event 1{\mathcal{E}}_{1}^{\prime} that for all 𝝀𝑲n/𝝀^𝑲n[1/2,2]\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\in[1/2,2],

1Nj=1N𝒛js(ϕn,j,𝝀)𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀)]2s(𝝀^𝑲n)ε1.\frac{\Big{\|}\frac{1}{N}\sum_{j=1}^{N}{\bm{z}}_{j}s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\bm{\lambda}}\rangle)]\Big{\|}_{2}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq\varepsilon_{1}\,.

Using the same proof as in Lemma 4, where the vector 𝒛{\bm{z}} is a κ2\kappa^{2}-sub-Gaussian random vector by A3, there exists C,cC^{\prime},c^{\prime} such that for NncdN\geq n\geq c^{\prime}d, taking ε1=C(nlogNN(nlogN)(Q/2)1N)\varepsilon_{1}=C^{\prime}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}, we have (1)1CNcn\mathbb{P}({\mathcal{E}}_{1}^{\prime})\geq 1-C^{\prime}N^{-c^{\prime}n}. Consider the same events 2{\mathcal{E}}_{2} (with same ε2\varepsilon_{2}) and 3{\mathcal{E}}_{3} as in the proof of Theorem 1, except with τ,η\tau,\eta—which do not depend on d,nd,n—absorbed into the constnast CC. By the same proof as in Lemma 2, we can show that there exists a constant K>0K>0 such that for all 𝝀𝝀^𝑲n𝝀^𝑲n/2\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\leq\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}/2,

𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀)]𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀^)]2s(𝝀^𝑲n)K𝝀𝝀^𝑲n𝝀^𝑲n.\frac{\|\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\bm{\lambda}}\rangle)]-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq K\frac{\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}{\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}. (66)

The same argument as in Lemma 1 implies that on events 1,2,3{\mathcal{E}}_{1}^{\prime},{\mathcal{E}}_{2},{\mathcal{E}}_{3}, we have

N1𝒁s(𝚽n𝝀^N)𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀^)]2(ε1+Kε2β)s(𝝀^𝑲n).\|N^{-1}{\bm{Z}}s(\bm{\Phi}_{n}^{\top}\hat{\bm{\lambda}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}\leq\Big{(}\varepsilon_{1}+\frac{K\varepsilon_{2}}{\beta}\Big{)}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\,. (67)

Recalling the argument in the proof of Theorem 1, we have s(𝝀^𝑲n)C𝑲n1/2𝒚2C𝒚2s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\leq C\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\leq C^{\prime}\|{\bm{y}}\|_{2}, where we used that by A3, 𝑲nγ2𝐈n{\bm{K}}_{n}\succeq\gamma^{2}{\mathbf{I}}_{n}. Hence with probability at least 1CNcn1-C^{\prime}N^{-c^{\prime}n}, we have

N1𝒁s(𝚽n𝝀^N)𝔼𝒘,𝒛[𝒛s(ϕn(𝒘,𝒛),𝝀^)]2C(nlogNN(nlogN)Q/2N)𝒚2.\|N^{-1}{\bm{Z}}s(\bm{\Phi}_{n}^{\top}\hat{\bm{\lambda}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}\leq C^{\prime}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\|{\bm{y}}\|_{2}\,. (68)

Furthermore, from assumption A1 and Lemma 9, there exists c>0c>0 such that σmin(𝑿)cn\sigma_{\min}({\bm{X}})\geq c\sqrt{n} with probability at least 1ecn1-e^{-cn}. Hence, we can use Lemma 6, which concludes the proof. ∎

Appendix C Small ball property under fast decay of Hermite coefficients

In this section, we show FEAT3 for a deterministic feature map ϕ(𝒙;𝒘)=σ(𝒙,𝒘)\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{x}},{\bm{w}}\rangle) with the activation function σ:\sigma:\mathbb{R}\to\mathbb{R} verifying a decay condition on its Hermite coefficients.

Recall that for any function gL2(,γ)g\in L^{2}(\mathbb{R},\gamma) with γ(dx)=ex2/2dx/2π\gamma({\rm d}x)=e^{-x^{2}/2}{\rm d}x/\sqrt{2\pi} the standard Gaussian measure, we have the decomposition

g(x)=k=0μk(g)k!Hek(x),μk(g)𝔼Gγ{g(G)Hek(G)},g(x)=\sum_{k=0}^{\infty}\frac{\mu_{k}(g)}{k!}{\rm He}_{k}(x),\qquad\mu_{k}(g)\equiv\mathbb{E}_{G\sim\gamma}\left\{g(G){\rm He}_{k}(G)\right\}\,,

where {Hek}k0\{{\rm He}_{k}\}_{k\geq 0} are the Hermite polynomials with standard normalization 𝔼{Hek(G)Hej(G)}=k!δjk\mathbb{E}\{{\rm He}_{k}(G){\rm He}_{j}(G)\}=k!\delta_{jk} and we call μk(g)\mu_{k}(g) the kk-th Hermite coefficient of gg.

Throughout this section, we will consider 𝒘𝖭(0,𝐈d/d){\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d) and 𝒙i2=d\|{\bm{x}}_{i}\|_{2}=\sqrt{d} for all ini\leq n. For an integer m0m\geq 0, consider the decomposition of the activation σ=σm+σ>m\sigma=\sigma_{\leq m}+\sigma_{>m} into a low-degree and a non-polynomial parts

σm(x)=k=0mμk(σ)k!Hek(x),σ>m(x)=k=m+1μk(σ)k!Hek(x).\displaystyle\sigma_{\leq m}(x)=\sum_{k=0}^{m}\frac{\mu_{k}(\sigma)}{k!}{\rm He}_{k}(x),\qquad\sigma_{>m}(x)=\sum_{k=m+1}^{\infty}\frac{\mu_{k}(\sigma)}{k!}{\rm He}_{k}(x)\,. (69)

For 𝒘𝖭(0,𝐈d/d){\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d) and taking 𝒙2=𝒙2=d\|{\bm{x}}\|_{2}=\|{\bm{x}}^{\prime}\|_{2}=\sqrt{d}, we can decompose the kernel function into

K(𝒙,𝒙)=𝔼𝒘{σ(𝒘,𝒙)σ(𝒘,𝒙)}=Km(𝒙,𝒙)+K>m(𝒙,𝒙),\displaystyle K({\bm{x}},{\bm{x}}^{\prime})=\mathbb{E}_{{\bm{w}}}\{\sigma(\langle{\bm{w}},{\bm{x}}\rangle)\sigma(\langle{\bm{w}},{\bm{x}}^{\prime}\rangle)\}=K^{\leq m}({\bm{x}},{\bm{x}}^{\prime})+K^{>m}({\bm{x}},{\bm{x}}^{\prime})\,, (70)

where

Km(𝒙,𝒙)=\displaystyle K^{\leq m}({\bm{x}},{\bm{x}}^{\prime})= 𝔼𝒘{σm(𝒘,𝒙)σm(𝒘,𝒙)},\displaystyle~{}\mathbb{E}_{{\bm{w}}}\{\sigma_{\leq m}(\langle{\bm{w}},{\bm{x}}\rangle)\sigma_{\leq m}(\langle{\bm{w}},{\bm{x}}^{\prime}\rangle)\}\,, (71)
K>m(𝒙,𝒙)=\displaystyle K^{>m}({\bm{x}},{\bm{x}}^{\prime})= 𝔼𝒘{σ>m(𝒘,𝒙)σ>m(𝒘,𝒙)}.\displaystyle~{}\mathbb{E}_{{\bm{w}}}\{\sigma_{>m}(\langle{\bm{w}},{\bm{x}}\rangle)\sigma_{>m}(\langle{\bm{w}},{\bm{x}}^{\prime}\rangle)\}\,. (72)

Therefore the empirical kernel matrix can be decomposed into 𝑲n=𝑲nm+𝑲n>m{\bm{K}}_{n}={\bm{K}}_{n}^{\leq m}+{\bm{K}}_{n}^{>m} where 𝑲nm=(Km(𝒙i,𝒙j))ij[n]{\bm{K}}_{n}^{\leq m}=(K^{\leq m}({\bm{x}}_{i},{\bm{x}}_{j}))_{ij\in[n]} and 𝑲n>m=(K>m(𝒙i,𝒙j))ij[n]{\bm{K}}_{n}^{>m}=(K^{>m}({\bm{x}}_{i},{\bm{x}}_{j}))_{ij\in[n]}.

Similarly, we introduce

ϕnm(𝒘)=\displaystyle\bm{\phi}_{n}^{\leq m}({\bm{w}})= (σm(𝒙1,𝒘),,σm(𝒙n,𝒘)),\displaystyle~{}\big{(}\sigma_{\leq m}(\langle{\bm{x}}_{1},{\bm{w}}\rangle),\ldots,\sigma_{\leq m}(\langle{\bm{x}}_{n},{\bm{w}}\rangle)\big{)}\,, (73)
ϕn>m(𝒘)=\displaystyle\bm{\phi}_{n}^{>m}({\bm{w}})= (σ>m(𝒙1,𝒘),,σ>m(𝒙n,𝒘)).\displaystyle~{}\big{(}\sigma_{>m}(\langle{\bm{x}}_{1},{\bm{w}}\rangle),\ldots,\sigma_{>m}(\langle{\bm{x}}_{n},{\bm{w}}\rangle)\big{)}\,. (74)

Notice that 𝔼𝒘{ϕnm(𝒘)ϕnm(𝒘)𝖳}=𝑲nm\mathbb{E}_{\bm{w}}\big{\{}\bm{\phi}_{n}^{\leq m}({\bm{w}})\bm{\phi}_{n}^{\leq m}({\bm{w}})^{\mathsf{T}}\big{\}}={\bm{K}}_{n}^{\leq m} and 𝔼𝒘{ϕn>m(𝒘)ϕn>m(𝒘)𝖳}=𝑲n>m\mathbb{E}_{\bm{w}}\big{\{}\bm{\phi}_{n}^{>m}({\bm{w}})\bm{\phi}_{n}^{>m}({\bm{w}})^{\mathsf{T}}\big{\}}={\bm{K}}_{n}^{>m}. Denote 𝝍nm=𝑲n1/2ϕnm{\bm{\psi}}_{n}^{\leq m}={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}^{\leq m} and 𝝍n>m=𝑲n1/2ϕn>m{\bm{\psi}}_{n}^{>m}={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}^{>m}.

With these notations, we can now introduce our result.

Proposition 6.

Assume that 𝐱i2=d\|{\bm{x}}_{i}\|_{2}=\sqrt{d} for all ini\leq n and 𝐰𝖭(0,𝐈d/d){\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d), and that σL2(,γ)\sigma\in L^{2}(\mathbb{R},\gamma). There exists an absolute constant C0>0C_{0}>0 such that the following holds. If for an integer m1m\geq 1,

λmax(𝑲n>m)λmin(𝑲n){(C0m)(2m+1)(1/4)},\displaystyle\lambda_{\max}\left({\bm{K}}_{n}^{>m}\right)\leq\lambda_{\min}\left({\bm{K}}_{n}\right)\cdot\Big{\{}(C_{0}m)^{-(2m+1)}\wedge(1/4)\Big{\}}\,, (75)

then we have

sup𝒗2=1(|𝒗,𝝍n|η)14,\displaystyle\sup_{\|{\bm{v}}\|_{2}=1}\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta^{*}\big{)}\leq\frac{1}{4}\,, (76)

where η=4λmax(𝐊n>m)/λmin(𝐊n)\eta^{*}=4\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}/\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}.

Proof of Proposition 6.

Throughout this proof, we will denote C>0C>0 a generic absolute constant. In particular, CC is allowed to change from line to line.

Consider η>0\eta>0, 𝒗2=1\|{\bm{v}}\|_{2}=1, and an integer mm such that condition (75) is satisfied with a constant C0C_{0} that will be fixed later, and decompose

(|𝒗,𝝍n|η)(|𝒗,𝝍nm|2η)+(|𝒗,𝝍n>m|η).\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta\big{)}\leq\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle|\leq 2\eta\big{)}+\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{>m}\rangle|\geq\eta\big{)}\,. (77)

The first term 𝒗,𝝍nm(𝒘)\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}({\bm{w}})\rangle is a polynomial of degree mm in 𝒘𝖭(0,𝐈d/d){\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d). From Carbery-Wright inequality [CW01], we have the following anti-concentration bound

(|𝒗,𝝍nm|2η)Cm(2η𝔼{𝒗,𝝍nm2})1/m.\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle|\leq 2\eta\big{)}\leq Cm\left(\frac{2\eta}{\mathbb{E}\{\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle^{2}\}}\right)^{1/m}\,.

Note that

𝔼{𝒗,𝝍nm2}=\displaystyle\mathbb{E}\big{\{}\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle^{2}\big{\}}= 𝒗,𝑲n1/2𝑲nm𝑲n1/2𝒗\displaystyle~{}\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}{\bm{K}}_{n}^{\leq m}{\bm{K}}_{n}^{-1/2}{\bm{v}}\rangle
=\displaystyle= 1𝒗,𝑲n1/2𝑲n>m𝑲n1/2𝒗1λmax(𝑲n>m)λmin(𝑲n)3/4,\displaystyle~{}1-\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}{\bm{K}}_{n}^{>m}{\bm{K}}_{n}^{-1/2}{\bm{v}}\rangle\geq 1-\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\geq 3/4\,,

where in the last inequality, we used λmax(𝑲n>m)λmin(𝑲n)/4\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}\leq\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}/4 from condition (75). Hence

(|𝒗,𝝍nm|2η)Cmη1/m.\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle|\leq 2\eta\big{)}\leq Cm\cdot\eta^{1/m}\,. (78)

By Markov’s inequality, the second term in Eq. (77) is bounded by

(|𝒗,𝝍n>m|η)η2𝔼{𝒗,𝝍n>m2}η2λmax(𝑲n>m)λmin(𝑲n).\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{>m}\rangle|\geq\eta\big{)}\leq\eta^{-2}\mathbb{E}\big{\{}\langle{\bm{v}},{\bm{\psi}}_{n}^{>m}\rangle^{2}\big{\}}\leq\eta^{-2}\cdot\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\,. (79)

Hence combining bounds (78) and (79) in Eq. (77) yields

(|𝒗,𝝍n|η)Cmη1/m+η2λmax(𝑲n>m)λmin(𝑲n).\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta\big{)}\leq Cm\cdot\eta^{1/m}+\eta^{-2}\cdot\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\,.

Setting η=(λmax(𝑲n>m)/(Cmλmin(𝑲n)))m/(2m+1)\eta^{*}=\left(\lambda_{\max}({\bm{K}}_{n}^{>m})/(Cm\lambda_{\min}({\bm{K}}_{n}))\right)^{m/(2m+1)}, we get

(|𝒗,𝝍n|η)Cm(λmax(𝑲n>m)λmin(𝑲n))1/(2m+1)14,\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta^{*}\big{)}\leq Cm\left(\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\right)^{1/(2m+1)}\leq\frac{1}{4}\,,

where we set up C0=4CC_{0}=4C to obtain the last inequality. Noticing that condition (75) implies

η=\displaystyle\eta^{*}= (λmax(𝑲n>m)/(Cmλmin(𝑲n)))m/(2m+1)\displaystyle~{}\left(\lambda_{\max}({\bm{K}}_{n}^{>m})/(Cm\lambda_{\min}({\bm{K}}_{n}))\right)^{m/(2m+1)}
\displaystyle\geq (4λmax(𝑲n>m)/λmin(𝑲n))m(2m+2)/(2m+1)24λmax(𝑲n>m)/λmin(𝑲n),\displaystyle~{}\left(4\lambda_{\max}({\bm{K}}_{n}^{>m})/\lambda_{\min}({\bm{K}}_{n})\right)^{m(2m+2)/(2m+1)^{2}}\geq 4\lambda_{\max}({\bm{K}}_{n}^{>m})/\lambda_{\min}({\bm{K}}_{n})\,,

concludes the proof. ∎

Note that for 𝒙iUnif(𝕊d1(d)){\bm{x}}_{i}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d})), we have 𝒙i,𝒙j=Od,(d1/2)\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle=O_{d,\mathbb{P}}(d^{-1/2}) and therefore we expect the off-diagonal elements of 𝑲n>m{\bm{K}}_{n}^{>m} to have negligible operator norm for mm sufficiently large. In fact, it was shown in [GMMM19] (see [MMM21] for a generalization), that for any constant δ>0\delta>0 and integer \ell, if d+δnd+1δd^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}, then for any mm\geq\ell, we have 𝑲n>mκ>m𝐈nop=κ>mod,(1)\|{\bm{K}}_{n}^{>m}-\kappa_{>m}{\mathbf{I}}_{n}\|_{{\rm op}}=\kappa_{>m}\cdot o_{d,\mathbb{P}}(1), where

κ>m=k=m+1μk(σ)2k!.\kappa_{>m}=\sum_{k=m+1}^{\infty}\frac{\mu_{k}(\sigma)^{2}}{k!}\,.

Hence for d+δnd+1δd^{\ell+\delta}\leq n\leq d^{\ell+1-\delta} and dd sufficiently large, condition (75) is verified with high probability over the data 𝑿{\bm{X}}, if there exists m>m>\ell such that

κ>mκ>{(C0m)(2m+1)(1/4)}.\kappa_{>m}\leq\kappa_{>\ell}\cdot\Big{\{}(C_{0}m)^{-(2m+1)}\wedge(1/4)\Big{\}}\,.

Appendix D Proof of Theorem 2: 𝖭𝖯\mathsf{NP}-hardness of learning with 1{\mathcal{F}}_{1} norm

We borrow some notations and terminology from [GLS12]. We consider the convex set KnK\subset\mathbb{R}^{n} define by

K=\displaystyle K= {𝒛n:τ(𝒱),|τ|(𝒱)1 and 𝒛=𝒱ϕn(𝒘)τ(d𝒘)}\displaystyle~{}\Big{\{}{\bm{z}}\in\mathbb{R}^{n}\,:\,\exists\tau\in{\mathcal{M}}({\mathcal{V}}),|\tau|({\mathcal{V}})\leq 1\text{ and }{\bm{z}}=\int_{{\mathcal{V}}}\bm{\phi}_{n}({\bm{w}})\tau({\rm d}{\bm{w}})\Big{\}} (80)
=\displaystyle= 𝖢𝗈𝗇𝗏𝖾𝗑𝖧𝗎𝗅𝗅{ϕn(𝒘),ϕn(𝒘):𝒘𝒱}.\displaystyle~{}\mathsf{ConvexHull}\Big{\{}\bm{\phi}_{n}({\bm{w}}),-\bm{\phi}_{n}({\bm{w}}):{\bm{w}}\in{\mathcal{V}}\Big{\}}\,.

From our choice of the truncated ReLu activation, we have ϕn(𝒘)2n\|\bm{\phi}_{n}({\bm{w}})\|_{2}\leq\sqrt{n} and K𝖡(𝟎,n)K\subseteq\mathsf{B}({\bm{0}},\sqrt{n}), where we denoted 𝖡(𝒂,R)\mathsf{B}({\bm{a}},R) the ball of center 𝒂{\bm{a}} and radius RR, i.e., 𝖡(𝒂,R)={𝒛n:𝒛𝒂2R}\mathsf{B}({\bm{a}},R)=\{{\bm{z}}\in\mathbb{R}^{n}:\|{\bm{z}}-{\bm{a}}\|_{2}\leq R\}. In our reductions, we will further meed to assume that there exists r>0r>0 such that 𝖡(𝟎,r)K\mathsf{B}({\bm{0}},r)\subseteq K. We will check that indeed we can choose r>0r>0 during our reduction. We will denote (K,r)(K,r) the set KK such that 𝖡(𝟎,r)K\mathsf{B}({\bm{0}},r)\subseteq K. For δ>0\delta>0, we let S(K,δ)S(K,\delta) denote the δ\delta-neighborhood of KK, i.e.,

S(K,δ):={𝒛n:𝒖𝒛2δ for some 𝒖K}.S(K,\delta):=\big{\{}{\bm{z}}\in\mathbb{R}^{n}:\|{\bm{u}}-{\bm{z}}\|_{2}\leq\delta\text{ for some }{\bm{u}}\in K\big{\}}\,.

Similarly, we will denote S(K,δ)S(K,-\delta) the interior δ\delta-ball of KK defined by

S(K,δ):={𝒛K:𝖡(𝒛,δ)K}.S(K,-\delta):=\big{\{}{\bm{z}}\in K:\mathsf{B}({\bm{z}},\delta)\subseteq K\big{\}}\,.

For convenience, we recall here the different problems of interest. The weak version of the 1{\mathcal{F}}_{1}-problem is given by

minimize |τ|(𝒱),\displaystyle|\tau|({\mathcal{V}})\,, (81)
subj. to f^(𝒙i;τ)=y^i,in,\displaystyle\hat{f}({\bm{x}}_{i};\tau)=\hat{y}_{i},\;\;\;\forall i\leq n\,,
𝒚𝒚^2ε,\displaystyle\|{\bm{y}}-\hat{\bm{y}}\|_{2}\leq\varepsilon\,,

while the intermediary optimization problem reads with the new notations

maximize 𝒚,𝒛,\displaystyle\langle{\bm{y}},{\bm{z}}\rangle\,, (82)
subj. to 𝒛K.\displaystyle{\bm{z}}\in K\,.

We will consider the following problems:

W-F1-PB(ε)\boxed{\texttt{W-F1-PB}(\varepsilon)}

: given 𝒚n{\bm{y}}\in\mathbb{Q}^{n} and γ\gamma\in\mathbb{Q}. Denote LL^{*} the value of the weak 1{\mathcal{F}}_{1}-problem (81). Either

  • (1)

    assert that Lγ+εL^{*}\leq\gamma+\varepsilon; or,

  • (2)

    assert that LγεL^{*}\geq\gamma-\varepsilon.

W-MEM(δ,K,r)\boxed{\texttt{W-MEM}(\delta,K,r)}

: given 𝝀n{\bm{\lambda}}\in\mathbb{Q}^{n}. Either

  • (1)

    assert that 𝝀S(K,δ){\bm{\lambda}}\in S(K,\delta); or,

  • (2)

    assert that 𝝀S(K,δ){\bm{\lambda}}\not\in S(K,-\delta).

W-VAL(δ,K,r)\boxed{\texttt{W-VAL}(\delta,K,r)}

: given 𝒛n,γ{\bm{z}}\in\mathbb{Q}^{n},\gamma\in\mathbb{Q}. Either

  • (1)

    assert that 𝒛,𝝀γ+δ\langle{\bm{z}},{\bm{\lambda}}\rangle\leq\gamma+\delta for all 𝝀S(K,δ){\bm{\lambda}}\in S(K,-\delta); or,

  • (2)

    assert that 𝒛,𝝀γδ\langle{\bm{z}},{\bm{\lambda}}\rangle\geq\gamma-\delta for some 𝝀S(K,δ){\bm{\lambda}}\in S(K,\delta).

HS-MA(ε)\boxed{\texttt{HS-MA}(\varepsilon)}

: distinguish the following two cases

  • (1)

    there exists a half space 𝒘,𝒙>a\langle{\bm{w}},{\bm{x}}\rangle>a such that M(𝒘,a)(n++n)(1ε)M({\bm{w}},a)\geq(n_{+}+n_{-})(1-\varepsilon); or,

  • (1)

    for any half space 𝒘,𝒙>a\langle{\bm{w}},{\bm{x}}\rangle>a, we have M(𝒘,a)(n++n)(1/2+ε)M({\bm{w}},a)\leq(n_{+}+n_{-})(1/2+\varepsilon).

W-F1-PB corresponds to a weak validity problem associated to the weak 1{\mathcal{F}}_{1}-problem (81); W-MEM is the weak membership problem associated to the convex set KK; W-VAL is the weak validity problem associated to the intermediary optimization problem (82); and HS-MA is the Maximum Agreement for Halfspaces problem.

We will use the following hardness result on HS-MA:

Theorem 3 (Theorem 8.1 in [GR09]).

For all 0<ε<1/40<\varepsilon<1/4, the problem HS-MA(ε)(\varepsilon) is 𝖭𝖯\mathsf{NP}-hard.

Let us first prove that there exists a polynomial time randomized reduction from W-VAL to W-F1-PB.

Lemma 7.

There exists an absolute constant C>0C>0 and an oracle-polynomial time randomized algorithm that solves the weak validity problem W-VAL(δ,K,r)(\delta,K,r) given a W-F1-PB(ε)(\varepsilon) oracle, where εΩ((δr/n)C)\varepsilon\geq\Omega((\delta r/n)^{C}).

Proof of Lemma 7.

Let us first show that one can use W-F1-PB(ε)(\varepsilon) to solve W-MEM(δ,K,r)\texttt{W-MEM}(\delta,K,r).

Consider 𝝀n{\bm{\lambda}}\in\mathbb{Q}^{n} and call W-F1-PB with 𝒚:=𝝀{\bm{y}}:={\bm{\lambda}}, γ:=1\gamma:=1 and ε:=δ1+2δ+4n\varepsilon:=\frac{\delta}{1+2\delta+4\sqrt{n}}. First, we know that K𝖡(𝟎,n)K\subseteq\mathsf{B}({\bm{0}},\sqrt{n}), hence if 𝝀22n\|{\bm{\lambda}}\|_{2}\geq 2\sqrt{n}, we can directly assert that 𝝀S(K,δ){\bm{\lambda}}\not\in S(K,-\delta). We assume from now on that 𝝀22n\|{\bm{\lambda}}\|_{2}\leq 2\sqrt{n}. If W-F1-PB asserts that L1+εL^{*}\leq 1+\varepsilon, then it means there exists 𝒛n{\bm{z}}\in\mathbb{R}^{n} with 𝒛𝝀2ε\|{\bm{z}}-{\bm{\lambda}}\|_{2}\leq\varepsilon and associated measure |τ|(𝒱)1+ε|\tau|({\mathcal{V}})\leq 1+\varepsilon. Consider 𝒛=𝒛/(1+ε){\bm{z}}^{\prime}={\bm{z}}/(1+\varepsilon), then 𝒛{\bm{z}}^{\prime} has associated measure |τ|(𝒱)1|\tau^{\prime}|({\mathcal{V}})\leq 1 and 𝒛𝝀2ε1+ε(𝝀2+ε)+εδ\|{\bm{z}}^{\prime}-{\bm{\lambda}}\|_{2}\leq\frac{\varepsilon}{1+\varepsilon}(\|{\bm{\lambda}}\|_{2}+\varepsilon)+\varepsilon\leq\delta with our choice of ε\varepsilon. Hence we can assert that 𝝀S(K,δ){\bm{\lambda}}\in S(K,\delta). If W-F1-PB asserts that L1εL^{*}\geq 1-\varepsilon, then it means in particular that 𝝀{\bm{\lambda}} has associated measure |τ|(𝒱)1ε|\tau|({\mathcal{V}})\geq 1-\varepsilon. Consider 𝒛=𝝀/(12ε){\bm{z}}={\bm{\lambda}}/(1-2\varepsilon), then 𝒛{\bm{z}} has associated measure |τ|(𝒱)1ε12ε>1|\tau^{\prime}|({\mathcal{V}})\leq\frac{1-\varepsilon}{1-2\varepsilon}>1 and 𝒛𝝀22ε12ε𝝀2δ\|{\bm{z}}^{\prime}-{\bm{\lambda}}\|_{2}\leq\frac{2\varepsilon}{1-2\varepsilon}\|{\bm{\lambda}}\|_{2}\leq\delta with our choice of ε\varepsilon. Hence we can assert that 𝝀S(K,δ){\bm{\lambda}}\not\in S(K,-\delta).

Now that we saw we can implement a weak membership oracle W-MEM(δ,K,r)\texttt{W-MEM}(\delta,K,r) using W-F1-PB(ε)(\varepsilon) with ε=Ω(δ/n)\varepsilon=\Omega(\delta/\sqrt{n}), we can use the results in [LSV18], for example their Theorem 21 (using the sequence of reductions MEM to SEP to OPT to VAL), which shows that there exists an absolute constant C>0C>0 and a randomized reduction from the weak validity problem W-VAL(δ,K,r)(\delta,K,r) to the weak membership problem W-MEM(δ,K,r)\texttt{W-MEM}(\delta^{\prime},K,r) with δ=Ω((δr/n)C)\delta^{\prime}=\Omega((\delta r/n)^{C}). ∎

Lemma 8.

There exists an oracle-polynomial time algorithm that solves the problem HS-MA(ε)(\varepsilon) given a weak validity W-VAL(δ,K,r)(\delta,K,r) oracle, where δ=Ω((1ε)2/n)\delta=\Omega((1-\varepsilon)^{2}/\sqrt{n}) and η=Ω((1ε)/n)\eta=\Omega((1-\varepsilon)/\sqrt{n}) .

Proof of Lemma 8.

First let us show that

sup𝝀K𝟏,𝝀=sup𝒘𝒱𝟏,ϕn(𝒘).\sup_{{\bm{\lambda}}\in K}\langle{\bm{1}},{\bm{\lambda}}\rangle=\sup_{{\bm{w}}\in{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\,. (83)

Notice that with our truncated ReLu activation, ϕn(𝒘)\bm{\phi}_{n}({\bm{w}}) has all its coordinates non-negative for any 𝒘𝒱{\bm{w}}\in{\mathcal{V}}. Hence,

supτ=τ+τ,τ+(𝒱)+τ(𝒱)1𝒱𝟏,ϕn(𝒘)τ+(d𝒘)𝒱𝟏,ϕn(𝒘)τ(d𝒘)=supτ=τ+,τ+(𝒱)1𝒱𝟏,ϕn(𝒘)τ+(d𝒘),\sup_{\begin{subarray}{c}\tau=\tau_{+}-\tau_{-},\\ \tau_{+}({\mathcal{V}})+\tau_{-}({\mathcal{V}})\leq 1\end{subarray}}\int_{{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\tau_{+}({\rm d}{\bm{w}})-\int_{{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\tau_{-}({\rm d}{\bm{w}})=\sup_{\begin{subarray}{c}\tau=\tau_{+},\\ \tau_{+}({\mathcal{V}})\leq 1\end{subarray}}\int_{{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\tau_{+}({\rm d}{\bm{w}})\,,

where we denoted τ+\tau_{+} and τ\tau_{-} non-negative measures on 𝒱{\mathcal{V}}. Hence, we directly have sup𝝀K𝟏,𝝀sup𝒘𝒱𝟏,ϕn(𝒘)\sup_{{\bm{\lambda}}\in K}\langle{\bm{1}},{\bm{\lambda}}\rangle\leq\sup_{{\bm{w}}\in{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle and the converse inequality is obtained by taking the supremum over τ=δ𝒘\tau=\delta_{{\bm{w}}} with 𝒘𝒱{\bm{w}}\in{\mathcal{V}}.

Let us now prove the reduction from HS-MA to W-VAL. Consider (n+,n,d)2(n_{+},n_{-},d)\in\mathbb{N}^{2} and vectors {𝒙1,,𝒙n+,𝒛1,,𝒛n}{1,1}d\{{\bm{x}}_{1},\ldots,{\bm{x}}_{n_{+}},{\bm{z}}_{1},\ldots,{\bm{z}}_{n_{-}}\}\subset\{-1,1\}^{d}. Denote n=n++nn=n_{+}+n_{-}, and 𝒙~i=(𝒙i,1)\tilde{\bm{x}}_{i}=({\bm{x}}_{i},-1) for i[n+]i\in[n_{+}] and 𝒙~n++i=(𝒛i,1)\tilde{\bm{x}}_{n_{+}+i}=(-{\bm{z}}_{i},1) for i[n]i\in[n_{-}]. Writing now (𝒘,a)({\bm{w}},a) as 𝒘d+1{\bm{w}}\in\mathbb{R}^{d+1}, we see that

M~(𝒘)=in𝟙[𝒘,𝒙~i>0].\tilde{M}({\bm{w}})=\sum_{i\leq n}\mathbbm{1}[\langle{\bm{w}},\tilde{\bm{x}}_{i}\rangle>0]\,.

Denote now ϕn(𝒘)=(ϕ(𝒙~1;𝒘),,ϕ(𝒙~n;𝒘))\bm{\phi}_{n}({\bm{w}})=(\bm{\phi}(\tilde{\bm{x}}_{1};{\bm{w}}),\ldots,\bm{\phi}(\tilde{\bm{x}}_{n};{\bm{w}})). Recall that ϕ(𝒙;𝒘)=min(max(𝒘,𝒙,0),1)\bm{\phi}({\bm{x}};{\bm{w}})=\min(\max(\langle{\bm{w}},{\bm{x}}\rangle,0),1): if we take 𝒱=d+1{\mathcal{V}}=\mathbb{R}^{d+1}, we can always rescale 𝒘{\bm{w}} by a constant large enough such that ϕn(𝒘)\bm{\phi}_{n}({\bm{w}}) only as 0 or 11 values, and

sup𝒘d+1𝟏,ϕn(𝒘)=sup𝒘d+1M~(𝒘).\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle=\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\,.

Using Eq. (83), we can find easily a reduction from HS-MA to the strong version of W-VAL (with δ=0\delta=0). However, we will do a slightly more complicated reduction, in order to insure that there exists r>0r>0 such that 𝖡(𝟎,r)K\mathsf{B}({\bm{0}},r)\subseteq K and we can take δ>0\delta>0.

Consider 𝒙¯i=(𝒙~i,𝒆i)d+1+n\underline{{\bm{x}}}_{i}=(\tilde{\bm{x}}_{i},{\bm{e}}_{i})\in\mathbb{R}^{d+1+n} where 𝒆i{\bm{e}}_{i} is the canonical vector basis in n\mathbb{R}^{n} (vector with 11 at the ii’th coordinate and 0 otherwise). Consider 𝒱=d+1×[η,η]n{\mathcal{V}}=\mathbb{R}^{d+1}\times[-\eta,\eta]^{n}. In this case, by taking 𝒘=η(𝟎,𝒆i){\bm{w}}=\eta({\bm{0}},{\bm{e}}_{i}), we have ϕn(𝒘)=η𝒆i\bm{\phi}_{n}({\bm{w}})=\eta{\bm{e}}_{i}. Hence KK contain all ±𝒆i\pm{\bm{e}}_{i} and therefore contain 𝖡(𝟎,η/n)\mathsf{B}({\bm{0}},\eta/\sqrt{n}). Denote M¯\underline{M} the MM function associated to the 𝒙¯i\underline{{\bm{x}}}_{i}. By 11-Lipschitzness of truncated ReLu, we have

sup𝒘d+1M~(𝒘)sup𝒘d+1+nM¯(𝒘)sup𝒘d+1M~(𝒘)+nη.\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\leq\sup_{{\bm{w}}\in\mathbb{R}^{d+1+n}}\underline{M}({\bm{w}})\leq\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})+n\eta\,. (84)

Let us call W-VAL(δ,K,η/n)(\delta,K,\eta/\sqrt{n}) with γ:=34n\gamma:=\frac{3}{4}n, 𝒛:=𝟏{\bm{z}}:={\bm{1}} and δ\delta and η\eta that will fixed sufficiently small later. Consider the case sup𝒘d+1M~(𝒘)n(1ε)\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\geq n(1-\varepsilon), i.e., there exists an halfspace that makes less than nεn\varepsilon errors. In that case, we show that (for δ,η>0\delta,\eta>0 sufficiently small), the oracle must assert that 𝟏,𝝀34nδ\langle{\bm{1}},{\bm{\lambda}}\rangle\geq\frac{3}{4}n-\delta for some 𝝀S(K,δ){\bm{\lambda}}\in S(K,\delta). Let us construct 𝝀S(K,δ){\bm{\lambda}}\in S(K,-\delta) such that 𝒛,𝝀>γ+δ\langle{\bm{z}},{\bm{\lambda}}\rangle>\gamma+\delta. Denote 𝝀{\bm{\lambda}}^{*} such that 𝟏,𝝀n(1ε)\langle{\bm{1}},{\bm{\lambda}}^{*}\rangle\geq n(1-\varepsilon), and consider λs=(1s)𝝀\lambda_{s}=(1-s){\bm{\lambda}}^{*}. Because KK is convex and contains 𝝀{\bm{\lambda}}^{*} and 𝖡(𝟎,η/n)\mathsf{B}({\bm{0}},\eta/\sqrt{n}), it must contain the cone with apex 𝝀{\bm{\lambda}}^{*} and base the section of 𝖡(𝟎,η/n)\mathsf{B}({\bm{0}},\eta/\sqrt{n}) perpendicular to 𝝀{\bm{\lambda}}_{*}. In particular, 𝖡(𝝀s,ηs/n)K\mathsf{B}({\bm{\lambda}}_{s},\eta s/\sqrt{n})\subseteq K (for nn sufficiently large). Furthermore, 𝝀s,𝟏=(1s)𝝀,𝟏(1s)n(1ε)\langle{\bm{\lambda}}_{s},{\bm{1}}\rangle=(1-s)\langle{\bm{\lambda}}^{*},{\bm{1}}\rangle\geq(1-s)n(1-\varepsilon). Setting s=1/4ε2(1ε)s=\frac{1/4-\varepsilon}{2(1-\varepsilon)} and δηs/n\delta\leq\eta s/\sqrt{n}, we have 𝝀sS(K,δ){\bm{\lambda}}_{s}\in S(K,-\delta) such that 𝝀s,𝟏(3/4+κ)n>3n/4+δ\langle{\bm{\lambda}}_{s},{\bm{1}}\rangle\geq(3/4+\kappa)n>3n/4+\delta for some κ>0\kappa>0.

Consider now the case sup𝒘d+1M~(𝒘)n(1/2+ε)\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\leq n(1/2+\varepsilon), i.e., halfspaces classify at most n(1/2+ε)n(1/2+\varepsilon) vectors correctly. In that case, we show that (for δ,η>0\delta,\eta>0 sufficiently small), the oracle must assert that 𝟏,𝝀34n+δ\langle{\bm{1}},{\bm{\lambda}}\rangle\leq\frac{3}{4}n+\delta for all 𝝀S(K,δ){\bm{\lambda}}\in S(K,-\delta). To do so, we show that for any 𝝀S(K,δ){\bm{\lambda}}\in S(K,\delta), we must have 𝟏,𝝀<3n/4δ\langle{\bm{1}},{\bm{\lambda}}\rangle<3n/4-\delta. Consider an arbitrary 𝝀S(K,δ){\bm{\lambda}}\in S(K,\delta), there exists 𝒛K{\bm{z}}\in K such that 𝝀𝒛2δ\|{\bm{\lambda}}-{\bm{z}}\|_{2}\leq\delta. Using Eqs. (83) and (84), we have 𝟏,𝒛n(1/2+ε+η)\langle{\bm{1}},{\bm{z}}\rangle\leq n(1/2+\varepsilon+\eta). Hence 𝟏,𝝀n(1/2+δ+η)+δn\langle{\bm{1}},{\bm{\lambda}}\rangle\leq n(1/2+\delta+\eta)+\delta\sqrt{n}. Taking η=14ε8>0\eta=\frac{1-4\varepsilon}{8}>0 and δC\delta\leq C, we have 𝟏,𝝀<3n/4δ\langle{\bm{1}},{\bm{\lambda}}\rangle<3n/4-\delta.

Combining the above conditions, we see that there exists an absolute constant c>0c>0 such that the W-VAL(δ,K,η/n)(\delta,K,\eta/\sqrt{n}) oracle with δ=c(14ε)2/n\delta=c(1-4\varepsilon)^{2}/\sqrt{n} and η=c(14ε)\eta=c(1-4\varepsilon) allows to distinguish HS-MA(ε)(\varepsilon). ∎

We are now ready to prove Theorem 2.

Proof of Theorem 2.

From Lemma 8, there is a polynomial-time reduction from HS-MA(1/10)(1/10) to W-VAL(c/n,K,c/n)(c/\sqrt{n},K,c/\sqrt{n}) for some absolute constant c>0c>0. From Lemma 7, there is a polynomial-time randomized reduction from W-VAL(c/n,K,c/n)(c/\sqrt{n},K,c/\sqrt{n}) to W-F1-PB(nC)(n^{-C}) for some absolute constant C>0C>0. Using Theorem 3 concludes the proof. ∎

Appendix E Additional properties of the minimum interpolation problem

E.1 The infinite-width primal problem for randomized features

In the case of randomized features, we wrote the infinite-width dual problem (Eq. (9)), but we did not write the infinite-width primal problem to which it corresponds. Indeed the dual problem entirely defines the predictor via Eq. (16). For the sake of completeness, we present the infinite-width primal problem here.

It is convenient to make the randomization mechanism explicit by drawing zνz\sim\nu (with (𝒵,ν)(\mathcal{Z},\nu) a probability space) independent of 𝒘{\bm{w}}, and writing ϕ(𝒙;𝒘)=ϕ(𝒙;𝒘,z)\phi({\bm{x}};{\bm{w}})=\phi({\bm{x}};{\bm{w}},z) (without loss of generality, the reader can assume zν=Unif([0,1])z\sim\nu={\rm Unif}([0,1])). Therefore, we can rewrite Eq. (14) as

F(𝝀)\displaystyle F({\bm{\lambda}}) =𝝀,𝒚𝒱×𝒵ρ(ϕn(𝒘,z),𝝀)μ(d𝒘)ν(dz)\displaystyle=~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{{\mathcal{V}}\times\mathcal{Z}}\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{\lambda}}\rangle)\mu({\rm d}{\bm{w}})\nu({\rm d}z)
=𝝀,𝒚supa𝒱×𝒵[a(𝒘,z)ϕn(𝒘,z),𝝀ρ(a(𝒘,z))]μ(d𝒘)ν(dz)\displaystyle=~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\sup_{a}\int_{{\mathcal{V}}\times\mathcal{Z}}\big{[}a({\bm{w}},z)\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{\lambda}}\rangle-\rho(a({\bm{w}},z))\big{]}\mu({\rm d}{\bm{w}})\nu({\rm d}z)
=infa(𝝀,a),\displaystyle=~{}\inf_{a}\mathcal{L}({\bm{\lambda}},a)\,,

where a:𝒱×𝒵a:{\mathcal{V}}\times\mathcal{Z}\to{\mathbb{R}} is a general measurable function. This expression shows immediately that the primal variable is a function of both 𝒘{\bm{w}} and zz. Further, maximizing the Lagrangian (𝝀,a)\mathcal{L}({\bm{\lambda}},a) over 𝝀{\bm{\lambda}}, we obtain the following primal problem that generalizes (5) to the case of randomized features:

minimize 𝒱×𝒵ρ(a(𝒘,z))μ(d𝒘)ν(dz),\displaystyle\int_{{\mathcal{V}}\times\mathcal{Z}}\rho(a({\bm{w}},z))\mu({\rm d}{\bm{w}})\nu({\rm d}z)\,, (85)
subj. to 𝒱×𝒵a(𝒘,z)ϕ(𝒙i;𝒘,z)μ(d𝒘)ν(dz)=yi,in.\displaystyle\int_{{\mathcal{V}}\times\mathcal{Z}}a({\bm{w}},z)\phi({\bm{x}}_{i};{\bm{w}},z)\mu({\rm d}{\bm{w}})\nu({\rm d}z)=y_{i},\;\;\;\forall i\leq n\,. (86)

This is a natural limit of the finite-width primal problem (7), whereby we replace the weights aia_{i} by evaluations of the function aa, ai=a(𝒘i,zi)a_{i}=a({\bm{w}}_{i},z_{i}).

Remark 7.

Note that, under assumptions FEAT1, FEAT3 (or FEAT3’ for Q1Q2<2Q_{1}\wedge Q_{2}<2) and PEN, the minimizer to problem (5) (and its generalization (85)) exists and is unique. First of all notice that problem (85) is feasible. Indeed, choosing a(𝐰,z)=ϕn(𝐰,z),𝛏a({\bm{w}},z)=\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{\xi}}\rangle for 𝛏n{\bm{\xi}}\in{\mathbb{R}}^{n}, the interpolation constraint takes the form

𝑲n𝝃=𝒚,\displaystyle{\bm{K}}_{n}{\bm{\xi}}={\bm{y}}\,, (87)

which has solutions since 𝐊n=𝔼𝐰,z{ϕn(𝐰,z)ϕn(𝐰,z)𝖳}𝟎{\bm{K}}_{n}=\mathbb{E}_{{\bm{w}},z}\{\bm{\phi}_{n}({\bm{w}},z)\bm{\phi}_{n}({\bm{w}},z)^{{\mathsf{T}}}\}\succ{\bm{0}} is strictly positive definite by condition FEAT3. Let a0a_{0} be such a feasible point.

Denote by U(a):=𝒱×𝒵ρ(a(𝐰,z))μ(d𝐰)ν(dz)U(a):=\int_{{\mathcal{V}}\times\mathcal{Z}}\rho(a({\bm{w}},z))\,\mu({\rm d}{\bm{w}})\nu({\rm d}z) the cost function. Notice that, by assumption PEN, ρ(x)c|x|1+δC\rho(x)\geq c|x|^{1+\delta}-C for some constants c,δ>0c,\delta>0, and CC\in{\mathbb{R}}, and therefore U(a)U(a0)U(a)\leq U(a_{0}) only if aL1+δ(μν)M<\|a\|_{L^{1+\delta}(\mu\otimes\nu)}\leq M<\infty. It is therefore sufficient to focus on these functions aa. Further notice that the map af(𝐱i;a)a\mapsto f({\bm{x}}_{i};a) is continuous under weak convergence in L1+δ(μν)L^{1+\delta}(\mu\otimes\nu), since ϕ(𝐱i;)Lk(μν)\phi({\bm{x}}_{i};\cdot)\in L^{k}(\mu\otimes\nu) for every kk by assumption FEAT1. It follows that the set of feasible solutions of (85) satisfying aL1+δ(μν)C\|a\|_{L^{1+\delta}(\mu\otimes\nu)}\leq C is closed and bounded and hence weakly sequentially compact by Banach-Alaoglu. Finally, aU(a)a\mapsto U(a) is weakly lower semicontinuous by Fatou’s lemma, and this implies existence of minimizers.

Uniqueness follows from the fact that ρ\rho is strictly convex, which implies that aU(a)a\mapsto U(a) is also strictly convex.

E.2 Representer theorem for strictly convex and differentiable penalty

In the case of deterministic features (i.e. a(𝒘,z)=a(𝒘)a({\bm{w}},z)=a({\bm{w}}) and ϕ(𝒙;𝒘,z)=ϕ(𝒙;𝒘)\phi({\bm{x}};{\bm{w}},z)=\phi({\bm{x}};{\bm{w}})), we present here a generalization of the representer theorem to a broad class of penalties ρ\rho. Recall that in the case of ρ(x)=12|x|2\rho(x)=\frac{1}{2}|x|^{2}, the representer theorem states that the solution a:da_{*}:\mathbb{R}^{d}\to\mathbb{R} belongs to the class of functions

{𝒘𝝀,ϕn(𝒘):𝝀n}.\Big{\{}{\bm{w}}\mapsto\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle:{\bm{\lambda}}\in\mathbb{R}^{n}\Big{\}}\,. (88)

In words, the solution aa_{*} is contained in the (at most) nn-dimensional linear subspace spanned by {𝒘ϕ(𝒙i;𝒘):in}\{{\bm{w}}\mapsto\phi({\bm{x}}_{i};{\bm{w}}):i\leq n\}.

The following proposition generalizes this result to a penalty ρ\rho that is strictly convex.

Proposition 7 (Representer theorem for penalty ρ\rho).

Let ρ:\rho:\mathbb{R}\to\mathbb{R} be strictly convex and differentiable, and consider the minimum complexity interpolation problem (85) in the case of deterministic features. The solution a:da_{*}:\mathbb{R}^{d}\to\mathbb{R} of Problem (85) belongs to the class of functions (in the almost sure sense)

{𝒘s(𝝀,ϕn(𝒘)):𝝀n},\Big{\{}{\bm{w}}\mapsto s(\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle):{\bm{\lambda}}\in\mathbb{R}^{n}\Big{\}}\,, (89)

where s(x)=(ρ)1(x)s(x)=(\rho^{\prime})^{-1}(x).

Note that we recover the representer theorem for RKHS when ρ=12|x|2\rho=\frac{1}{2}|x|^{2} and s(x)=xs(x)=x. However for general ρ\rho, the loss function cannot be simplified by evaluating (K(𝒙i,𝒙j))i,jn(K({\bm{x}}_{i},{\bm{x}}_{j}))_{i,j\leq n} once.

Proof of Proposition 7.

We have

infa:d{𝒱ρ(a(𝒘))μ(d𝒘):𝒱a(𝒘)ϕ(𝒙i;𝒘)μ(d𝒘)=yi,in}\displaystyle~{}\inf_{a:\mathbb{R}^{d}\to\mathbb{R}}\Big{\{}\int_{{\mathcal{V}}}\rho(a({\bm{w}}))\mu({\rm d}{\bm{w}}):\int_{{\mathcal{V}}}a({\bm{w}})\phi({\bm{x}}_{i};{\bm{w}})\mu({\rm d}{\bm{w}})=y_{i},\;\;\;\forall i\leq n\Big{\}}
=(1)\displaystyle\stackrel{{\scriptstyle(1)}}{{=}} infa:dsup𝝀n{𝝀,𝒚+𝒱ρ(a(𝒘))μ(d𝒘)𝒱a(𝒘)𝝀,ϕn(𝒘)μ(d𝒘)}\displaystyle~{}\inf_{a:\mathbb{R}^{d}\to\mathbb{R}}\sup_{{\bm{\lambda}}\in\mathbb{R}^{n}}\Big{\{}\langle{\bm{\lambda}},{\bm{y}}\rangle+\int_{{\mathcal{V}}}\rho(a({\bm{w}}))\mu({\rm d}{\bm{w}})-\int_{{\mathcal{V}}}a({\bm{w}})\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle\mu({\rm d}{\bm{w}})\Big{\}}
=(2)\displaystyle\stackrel{{\scriptstyle(2)}}{{=}} sup𝝀n[𝝀,𝒚^+infa:d𝒱{ρ(a(𝒘))a(𝒘)𝝀,ϕn(𝒘)}μ(d𝒘)]\displaystyle~{}\sup_{{\bm{\lambda}}\in\mathbb{R}^{n}}\Big{[}\langle{\bm{\lambda}},\hat{\bm{y}}\rangle+\inf_{a:\mathbb{R}^{d}\to\mathbb{R}}\int_{{\mathcal{V}}}\big{\{}\rho(a({\bm{w}}))-a({\bm{w}})\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle\big{\}}\mu({\rm d}{\bm{w}})\Big{]}
=(3)\displaystyle\stackrel{{\scriptstyle(3)}}{{=}} sup𝝀n[𝝀,𝒚𝒱ρ(𝝀,ϕn(𝒘))μ(d𝒘)],\displaystyle~{}\sup_{{\bm{\lambda}}\in\mathbb{R}^{n}}\Big{[}\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{{\mathcal{V}}}\rho^{*}(\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle)\mu({\rm d}{\bm{w}})\Big{]}\,,

where we introduced the Lagrange multiplies 𝝀{\bm{\lambda}} in (1), we used strong duality in (2), and we used the definition of the convex conjugate in (3). At the optimum, we must have a(𝒘)=(ρ)(𝝀,ϕn(𝒘))=(ρ)1(𝝀,ϕn(𝒘))=s((𝝀,ϕn(𝒘)))a_{*}({\bm{w}})=(\rho^{*})^{\prime}(\langle{\bm{\lambda}}_{*},\bm{\phi}_{n}({\bm{w}})\rangle)=(\rho^{\prime})^{-1}(\langle{\bm{\lambda}}_{*},\bm{\phi}_{n}({\bm{w}})\rangle)=s((\langle{\bm{\lambda}}_{*},\bm{\phi}_{n}({\bm{w}})\rangle)) a.s. with respect to μ\mu. ∎

Appendix F Useful technical facts

We recall the following basic result on concentration of the empirical covariance of independent sub-Gaussian random vectors. This can be found, for instance, in [Ver18, Exercise 4.7.3] or [Ver10] (Theorem 5.39 and Remark 5.40(1)).

Lemma 9.

Let (𝐚i)iN({\bm{a}}_{i})_{i\leq N} be independent τ2\tau^{2}-subgaussian random vectors, with common covariance 𝔼{𝐚i𝐚i𝖳}=𝚺\mathbb{E}\{{\bm{a}}_{i}{\bm{a}}_{i}^{{\mathsf{T}}}\}=\bm{\Sigma}. Denote by 𝚺^:=N1i=1N𝐚i𝐚i𝖳\bm{\hat{\Sigma}}:=N^{-1}\sum_{i=1}^{N}{\bm{a}}_{i}{\bm{a}}_{i}^{{\mathsf{T}}} the empirical covariance. Then there exist absolute constants C,c>0C,c>0 such that, for all sC(n/N(n/N))s\geq C(\sqrt{n/N}\vee(n/N)), we have

(𝚺^𝚺opsτ2)exp(cN(ss2)).\displaystyle\mathbb{P}\big{(}\big{\|}\bm{\hat{\Sigma}}-\bm{\Sigma}\big{\|}_{{\rm op}}\geq s\tau^{2}\big{)}\leq\exp\big{(}-cN(s\wedge s^{2})\big{)}\,. (90)

We also state and prove two simple lemmas about moments of sub-Gaussian random variables.

Lemma 10.

For any a>0a>0 and any δ>0\delta>0 there exists C(a,δ)C(a,\delta) such that the following holds. For any random variable XX that is τ2\tau^{2}-sub-Gaussian with 𝔼{X2}=1\mathbb{E}\{X^{2}\}=1, we have

𝔼{|X|a}C(a,δ)τ(a2+δ)+.\displaystyle\mathbb{E}\{|X|^{a}\}\leq C(a,\delta)\,\tau^{(a-2+\delta)_{+}}\,. (91)

This inequality holds for a2a\leq 2, with δ=0\delta=0 and C=1C=1.

Proof.

Without loss of generality, we can assume X0X\geq 0. For a2a\leq 2, this is just Jensen’s inequality.

For a>2a>2, by Hölder for any q[0,a]q\in[0,a] and any r>1r>1, we have

𝔼[Xa]𝔼[Xrq]1/r𝔼[Xr(aq)/(r1)](r1)/r.\displaystyle\mathbb{E}[X^{a}]\leq\mathbb{E}[X^{rq}]^{1/r}\mathbb{E}[X^{r(a-q)/(r-1)}]^{(r-1)/r}\,.

Setting q=2/rq=2/r, r2/ar\geq 2/a and using the fact that 𝔼[X2]=1\mathbb{E}[X^{2}]=1 and XX is sub-Gaussian, we get that there exist constants C0(a,r)C_{0}(a,r) finite as long as r>1r>1, such that

𝔼[Xa]𝔼[X(ra2)/(r1)](r1)/rC0(a,r)τa2/r.\displaystyle\mathbb{E}[X^{a}]\leq\mathbb{E}[X^{(ra-2)/(r-1)}]^{(r-1)/r}\leq C_{0}(a,r)\tau^{a-2/r}\,.

The claim follows by setting 2/r=2δ2/r=2-\delta or r=(1δ/2)1>1r=(1-\delta/2)^{-1}>1. ∎

Lemma 11.

For any q1,q2>0q_{1},q_{2}>0 and any δ>0\delta>0 there exists C(q1,q2,δ)C(q_{1},q_{2},\delta) such that the following holds. For any random variable XX that is τ2\tau^{2}-sub-Gaussian with 𝔼{X2}=1\mathbb{E}\{X^{2}\}=1, we have

𝔼{|X|q1|X|q2}C(q1,q2,δ)τ(2q1q2+δ)+.\displaystyle\mathbb{E}\{|X|^{q_{1}}\wedge|X|^{q_{2}}\}\geq C(q_{1},q_{2},\delta)\,\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}\,. (92)

This inequality holds for q1=q22q_{1}=q_{2}\geq 2, with δ=0\delta=0 and C=1C=1.

Proof.

Without loss of generality, we can assume X0X\geq 0. For q1=q22q_{1}=q_{2}\geq 2, this is just Jensen’s inequality.

For the other cases, notice that, for a1,a2[0,2]a_{1},a_{2}\in[0,2], (xa1xa2)(x2a1x2a2)=x2(x^{a_{1}}\wedge x^{a_{2}})(x^{2-a_{1}}\vee x^{2-a_{2}})=x^{2}. Hence, by Hölder inequality, for all r>1r>1:

1=𝔼[X2]𝔼[Xa1rXa2r]1/r𝔼[X(2a1)r/(r1)X(2a2)r/(r1)](r1)/r,\displaystyle 1=\mathbb{E}[X^{2}]\leq\mathbb{E}[X^{a_{1}r}\wedge X^{a_{2}r}]^{1/r}\mathbb{E}[X^{(2-a_{1})r/(r-1)}\vee X^{(2-a_{2})r/(r-1)}]^{(r-1)/r}\,,

We then set ai=qi/ra_{i}=q_{i}/r, and invert this relationship to get

𝔼[Xq1Xq2]\displaystyle\mathbb{E}[X^{q_{1}}\wedge X^{q_{2}}] 𝔼[X(2rq1)/(r1)X(2rq2)/(r1)](r1)\displaystyle\geq\mathbb{E}[X^{(2r-q_{1})/(r-1)}\vee X^{(2r-q_{2})/(r-1)}]^{-(r-1)}
(𝔼[X(2rq1)/(r1)]+𝔼[X(2rq2)/(r1)])(r1).\displaystyle\geq\Big{(}\mathbb{E}[X^{(2r-q_{1})/(r-1)}]+\mathbb{E}[X^{(2r-q_{2})/(r-1)}]\Big{)}^{-(r-1)}\,.

Taking r1(q1/2)(q2/2)r\geq 1\vee(q_{1}/2)\vee(q_{2}/2), we can apply Lemma 10, to get

𝔼[Xq1Xq2]\displaystyle\mathbb{E}[X^{q_{1}}\wedge X^{q_{2}}] C(τ(2rq1r12+δ)+(r1)+τ(2rq2r12+δ)+(r1))1\displaystyle\geq C\Big{(}\tau^{(\frac{2r-q_{1}}{r-1}-2+\delta^{\prime})_{+}(r-1)}+\tau^{(\frac{2r-q_{2}}{r-1}-2+\delta^{\prime})_{+}(r-1)}\Big{)}^{-1}
C(τ(2q1+δ)++τ(2q2+δ)+)1\displaystyle\geq C\Big{(}\tau^{(2-q_{1}+\delta)_{+}}+\tau^{(2-q_{2}+\delta)_{+}}\Big{)}^{-1}
Cτ(2q1q2+δ)+.\displaystyle\geq C\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}\,.

This proves the claim. ∎