Minimum complexity interpolation in random features models

Michael Celentano, Theodor Misiakiewicz¹¹footnotemark: 1, Andrea Montanari¹¹footnotemark: 1 Department of Statistics, University of California, BerkeleyDepartment of Electrical Engineering, Stanford University

Abstract

Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in ${\mathbb{R}}^{d}$ , the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). As a consequence, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm —which is equivalent to a weighted $\ell_{2}$ norm— is replaced by a weighted functional $\ell_{p}$ norm, which we refer to as ${\mathcal{F}}_{p}$ norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires solving an infinite-dimensional convex problem.

We study random features approximations to these norms and show that, for $p>1$ , the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with ${\mathcal{F}}_{p}$ norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models. For $p=1$ , our guarantees for the random features approximation break down. We prove instead that learning with the ${\mathcal{F}}_{1}$ norm is $\mathsf{NP}$ -hard under a randomized reduction based on the problem of learning halfspaces with noise.

1 Introduction

1.1 Background: Kernel methods and the curse of dimensionality

Kernel methods are among the most fundamental tools in machine learning. Over the last two years they attracted renewed interest because of their connection with neural networks in the linear regime (a.k.a. neural tangent kernel or lazy regime) [JGH18, LR20, LRZ19, GMMM19].

Consider a general covariates space, namely a probability space $(\mathcal{X},\mathbb{P})$ (our results will concern the case $\mathcal{X}={\mathbb{R}}^{d}$ , but it is useful to start from a more general viewpoint). A reproducing kernel Hilbert space (RKHS) is usually constructed starting from a positive semidefinite kernel $K:\mathcal{X}\times\mathcal{X}\to{\mathbb{R}}$ . Here it will be more convenient to start from a weight space, i.e. a probability space $({\mathcal{V}},\mu)$ , and a featurization map $\phi(\cdot;{\bm{w}}):\mathcal{X}\to{\mathbb{R}}$ parametrized by the weight vector¹¹1A sufficient condition for this construction to be well defined is $\phi\in L^{2}(\mathbb{P}\otimes\mu)$ . ${\bm{w}}\in{\mathcal{V}}$ . The RKHS is then defined as the space of functions of the form

\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\,a({\bm{w}})\,\mu({\rm d}{\bm{w}})\,,

(1)

with $a\in L^{2}({\mathcal{V}};\mu)$ . To give a concrete example, we might consider ${\mathcal{V}}={\mathbb{R}}^{d}$ , and $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ : this is the featurization map arising from two-layers neural networks with random first layer weights.

The radius- $R$ ball in this space is defined as

{\mathcal{F}}_{2}(R)=\Big{\{}\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\,a({\bm{w}})\,\mu({\rm d}{\bm{w}}):\;\;\int_{{\mathcal{V}}}|a({\bm{w}})|^{2}\mu({\rm d}{\bm{w}})\leq R^{2}\Big{\}}\,.

(2)

This construction is equivalent to the more standard one, with associated kernel $K({\bm{x}}_{1},{\bm{x}}_{2}):=\int_{{\mathcal{V}}}\phi({\bm{x}}_{1};{\bm{w}})\phi({\bm{x}}_{2};{\bm{w}})\,\mu({\rm d}{\bm{w}})$ . Vice versa, for any positive semidefinite kernel $K$ , we can construct a corresponding featurization map $\phi$ , and proceed as above.

It is a basic fact that, although ${\mathcal{F}}_{2}(R)$ is infinite-dimensional, learning in ${\mathcal{F}}_{2}(R)$ can be performed efficiently [SSBD14]: it is useful to recall the reason here. Consider a supervised learning setting in which we are given $n$ samples $(y_{i},{\bm{x}}_{i})$ , $i\leq n$ , with $y_{i}\in\mathbb{R}$ and ${\bm{x}}_{i}\in\mathcal{X}$ . Given a convex loss $\ell:\mathbb{R}\times\mathbb{R}\to\mathbb{R}$ , the RKHS-norm regularized empirical risk minimization problem reads:

\displaystyle\mbox{\rm minimize}\;\;\;\sum_{i=1}^{n}\ell\big{(}y_{i},\hat{f}({\bm{x}}_{i};a)\big{)}+\lambda\int_{{\mathcal{V}}}|a({\bm{w}})|^{2}\mu({\rm d}{\bm{w}})\,.

(3)

Conceptually, this problem can be solved in two steps: $(1)$ find the minimum RKHS norm subject to interpolation constraints $\hat{f}({\bm{x}}_{i};a)=\hat{y}_{i}$ , and $(2)$ minimize the sum of this quantity and the empirical loss $\sum_{i=1}^{n}\ell\big{(}y_{i},\hat{y}_{i})$ . Since step $(2)$ is convex and finite-dimensional, the critical problem is the interpolation problem:

	minimize	$\displaystyle\int_{{\mathcal{V}}}\|a({\bm{w}})\|^{2}\mu({\rm d}{\bm{w}})\,,$
	subj. to	$\displaystyle\hat{f}({\bm{x}}_{i};a)=\hat{y}_{i},\;\;\;\forall i\leq n\,.$

While this is an infinite-dimensional problem, convex duality (the ‘representer theorem’) guarantees that the solution belongs to a fixed $n$ -dimensional subspace, $a_{*}({\bm{w}})=\sum_{i=1}^{n}c_{i}\phi({\bm{x}}_{i};{\bm{w}})$ .

Unfortunately, kernel methods suffer from the curse of dimensionality. To give a simple example, consider ${\bm{x}}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , ${\bm{w}}\sim{\rm Unif}(\mathbb{S}^{d-1}(1))$ , and $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ with $\sigma:{\mathbb{R}}\to{\mathbb{R}}$ a non-polynomial activation function. Consider noiseless data $y_{i}=f_{*}({\bm{x}}_{i})$ , with $f_{*}({\bm{x}})=\sigma(\langle{\bm{w}}_{*},{\bm{x}}\rangle)$ for a fixed ${\bm{w}}_{*}\in\mathbb{S}^{d-1}(1)$ . Then, $\|f_{*}\|_{K}=\infty$ (with $\|\,\cdot\,\|_{K}$ denoting the RKHS norm). Correspondingly, [YS19, GMMM19] show that for any fixed $k$ , if $n\leq d^{k}$ , then any kernel method of the form (3) returns $\hat{f}$ with $\|\hat{f}-f_{*}\|_{L^{2}(\mathbb{P})}$ bounded away from zero.

The curse of dimensionality suggests to seek functions of the form (1) where $a$ is sparse, in a suitable sense [Bac17]. In this paper we consider a generalization in which the RKHS ball of Eq. (2) is replaced by

{\mathcal{F}}_{p}(R)=\Big{\{}\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\,a({\bm{w}})\,\mu({\rm d}{\bm{w}}):\;\;\Big{(}\int_{{\mathcal{V}}}|a({\bm{w}})|^{p}\mu({\rm d}{\bm{w}})\Big{)}^{1/p}\leq R\Big{\}}\,.

(4)

For $p\in[1,2)$ this comprises a richer function class than the original ${\mathcal{F}}_{2}(R)$ , since ${\mathcal{F}}_{2}(R)\subset{\mathcal{F}}_{p}(R)\subset{\mathcal{F}}_{1}(R)$ , and it is easy to see that the inclusion is strict. The case $\rho(x)=|x|$ is also known as ‘convex neural network’ [Bac17].

Although the penalty $\int_{{\mathcal{V}}}|a({\bm{w}})|^{p}\mu({\rm d}{\bm{w}})$ is convex, it is far from clear that the learning problem is tractable. Indeed for $p\neq 2$ , the classical representer theorem does not hold anymore and we cannot reduce the infinite dimensional problem to solving a finite dimensional quadratic program (see Appendix E.2 for a representer-type theorem for $p>1$ ).

By the same reduction discussed above, it is sufficient to consider the following minimum-complexity interpolation problem:

	minimize	$\displaystyle\int_{{\mathcal{V}}}\rho(a({\bm{w}}))\mu({\rm d}{\bm{w}})\,,$		(5)
	subj. to	$\displaystyle\hat{f}({\bm{x}}_{i};a)=y_{i},\;\;\;\forall i\leq n\,.$

(We denote the values to be interpolated by $y_{i}$ instead of $\hat{y}_{i}$ because we will focus on the interpolation problem hereafter.) We will take $\rho(x)$ to be a convex function minimized at $x=0$ . We establish two main results. First, we establish tractability for a subset of strictly convex penalties $\rho$ which include, as special cases, $\rho(x)=|x|^{p}/p$ , $p>1$ . Our approach is based on a random features approximation which we discuss next. Second, we establish $\mathsf{NP}$ -hardness under randomized reduction for the case $\rho(x)=|x|$ .

1.2 Random features approximation

We sample independently ${\bm{w}}_{j}\sim\mu$ , $j\leq N$ , and fit a model

\hat{f}_{N}({\bm{x}};{\bm{a}})=\frac{1}{N}\sum_{j=1}^{N}a_{j}\phi({\bm{x}};{\bm{w}}_{j})\,.

(6)

We determine the coefficients ${\bm{a}}=(a_{i})_{i\leq N}$ by solving the interpolation problem

	minimize	$\displaystyle\sum_{j=1}^{N}\rho(a_{j})\,,$		(7)
	subj. to	$\displaystyle\hat{f}_{N}({\bm{x}}_{i};{\bm{a}})=y_{i},\;\;\;\forall i\leq n\,.$

Notice that this is equivalent to replacing the measure $\mu$ in Eq. (5) by its empirical version $\hat{\mu}_{N}=N^{-1}\sum_{i=1}^{N}\delta_{{\bm{w}}_{i}}$ . Borrowing the terminology from neural network theory, we will refer to (7) as the “finite width” problem, and to the original problem (5) as the “infinite width” problem. We will denote by $\hat{\bm{a}}_{N}$ the solution of the finite width problem (7) and by $\hat{a}$ the solution of the infinite width problem (5).

Our main result establishes that, under suitable assumptions on the penalty $\rho$ and the featurization map $\phi$ , the random features approach provides a good approximation of the minimum complexity interpolation problem (5). Crucially, this is achieved with a number of features that is polynomial in the sample size $n$ . In particular we show that for $\rho(x)=|x|^{p}/p$ , $1<p\leq 2$ , setting $Q=p/(p-1)$ (and assuming $\|{\bm{y}}\|_{2}\leq C\sqrt{n}$ )

\|\hat{f}_{N}(\cdot;\hat{\bm{a}}_{N})-\hat{f}(\cdot;\hat{a})\|_{L^{2}}\leq C\Big{(}\sqrt{\frac{n^{2}\log(N)}{N}}\vee\frac{(n\log N)^{(Q+1)/2)}}{N}\Big{)}\,\,.

(8)

Hence, for $N\gg(n\log n)^{2}\vee(n\log n)^{(Q+1)/2}$ , the random features approach yields a function that is a good approximation of the original problem (5).

Let us emphasize that the scaling of the number of random features $N$ in our bound is not optimal. In particular, for the RKHS case $p=2$ , the above result requires $N\gg n^{2}$ , while we know from [MMM21] that $N\gg n\log n$ features are often sufficient. We also note that the exponent $Q$ in the polynomial diverges as $p\to 1$ . Hence, our results do not guarantee tractability for the case $p=1$ . Indeed this is a fundamental limitation: as discussed in Section 4, a bound such as Eq. (8) with finite $Q$ cannot hold for $p=1$ , under some standard hardness assumptions. We show instead that no polynomial time algorithm is guaranteed to achieve accuracy $n^{-C}$ for some absolute constant $C$ in Eq. (8). This hardness result is based on a randomized reduction from the problem of learning halfspaces with noise, which was proved to be $\mathsf{NP}$ -hard in [FGKP06, GR09].

1.3 Dual problem and its concentration properties

Our proof that the random features model $\hat{f}_{N}(\cdot;\hat{\bm{a}}_{N})$ is a good approximation of the infinite width model $\hat{f}(\cdot;\hat{a})$ (cf. Eq. (8)) is based on a simple approach which is potentially of independent interest. We notice that, while the optimization problems (5) and (7) are overparametrized and hence difficult to control, their duals are underparametrized and can be studied using uniform convergence arguments.

Let $\rho^{*}$ be the convex conjugate of $\rho$ :

\rho^{*}(x)=\sup_{y\in\mathbb{R}}\{xy-\rho(y)\}\,.

Then, the dual problems of (5) and (7) are given —respectively— by the following optimization problems over ${\bm{\lambda}}\in\mathbb{R}^{n}$ :

	$\displaystyle\mbox{maximize}\;\;\;\;\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{\mathcal{V}}\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)\mu({\rm d}{\bm{w}})\,,$		(9)
	$\displaystyle\mbox{maximize}\;\;\;\;\langle{\bm{\lambda}},{\bm{y}}\rangle-\frac{1}{N}\sum_{j=1}^{N}\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}}_{j}),{\bm{\lambda}}\rangle)\,.$		(10)

Here we denoted ${\bm{y}}=(y_{1},\ldots,y_{n})\in\mathbb{R}^{n}$ the vector of responses and by $\bm{\phi}_{n}:{\mathcal{V}}\to\mathbb{R}^{n}$ the evaluation of the feature map at the $n$ data points $\bm{\phi}_{n}({\bm{w}})=(\phi({\bm{x}}_{1};{\bm{w}}),\ldots,\phi({\bm{x}}_{n};{\bm{w}}))\in\mathbb{R}^{n}$ .

We will prove that, under suitable assumptions on the penalty $\rho$ , the optimizer of the finite-width dual (10) concentrates around the optimizer of the infinite-width dual (9). Our results hold conditionally on the realization of the data $\{({\bm{x}}_{i},y_{i})\}_{i\leq n}$ and instead exploit the randomness of the weights $\{{\bm{w}}_{i}\}_{i\leq N}$ .

The rest of the paper is organized as follows. After briefly discussing related work in Section 2, we state our assumptions and results for strictly convex penalties in Section 3. In Section 4, we show that the problem with ${\mathcal{F}}_{1}$ norm is $\mathsf{NP}$ -hard under randomized reduction. In Section 5, we describe a few examples in which we can apply our general results. The proof of our main result is outlined in Section 6, with most technical work deferred to the appendices.

2 Related work

Random features methods were first introduced as a randomized approximation to RKHS methods [RR08, BBV06]. Given a kernel $K:\mathcal{X}\times\mathcal{X}\to{\mathbb{R}}$ , the idea of [RR08] was to replace it by a low rank approximation

\displaystyle K^{(N)}({\bm{x}}_{1},{\bm{x}}_{2}):=\frac{1}{N}\sum_{i=1}^{N}\phi({\bm{x}}_{1};{\bm{w}}_{i})\phi({\bm{x}}_{2};{\bm{w}}_{i})\,,

(11)

where the random features are such that $\mathbb{E}\{\phi({\bm{x}}_{1};{\bm{w}})\phi({\bm{x}}_{2};{\bm{w}})\}=K({\bm{x}}_{1},{\bm{x}}_{2})$ . Several papers prove bounds on the test error of random features methods and compare them with the corresponding kernel approach [RR09, RR17, MWW20, GMMM19, MM19, GMMM20, MMM21]. In particular, [MMM21] proves that —under certain concentration assumptions on the random features— if $N\geq n^{1+\varepsilon}$ for some $\varepsilon>0$ , then the random features approach has nearly the same test error as the corresponding RKHS method.

Notice that the idea of approximating the kernel via random features implicitly selects a specific regularization, in our notation $\rho(x)=x^{2}$ . Indeed, for any other regularization, the predictor $\hat{f}({\bm{x}};\hat{a})$ does not depend uniquely on the kernel. An alternative viewpoint regards the random features model (6) as a two-layer neural network with random first layer weights. It was shown in [Bar98] that the generalization properties of a two-layer neural network can be controlled in terms of the sum of the norms of the second-layer weights. This opens the way to considering infinitely wide two-layer networks as per Eq. (1), with $\int|a({\bm{w}})|\,\mu({\rm d}{\bm{w}})<\infty$ . These networks represent functions in the class ${\mathcal{F}}_{1}$ .

The infinite-width limit ${\mathcal{F}}_{1}$ was considered in [BRV⁺06], which propose an incremental algorithm to fit functions with an increasing number of units. Their approach however has no global optimality guarantees. Generalization error bounds within ${\mathcal{F}}_{1}$ were proved in [Bac17], which demonstrates the superiority of this approach over RKHS methods, in particular to fit ridge functions or other classes of functions that depend strongly on a low-dimensional subspace of ${\mathbb{R}}^{d}$ . The same paper also develops an optimization algorithm based on a conditional-gradient approach. However, in high dimension each iteration of this algorithm requires solving a potentially hard problem.

A line of recent work shows that —in certain overparametrized training regimes— neural networks are indeed well approximated by random features models obtained by linearizing the network around its (random) initialization. A subset of references include [JGH18, LL18, DZPS18, OS19, COB19]. The implicit bias induced by gradient descent on these models is a ridge (or RKHS) regularization. However, the bias can change if either the algorithm or the model parametrization is changed [NTSS17, GLSS18].

We finally notice that several recent works study the impact of changing regularization in high-dimensional linear regression, focusing on the interpolation limit [LS20, MKL⁺20, CLvdG20]. None of these works addresses the main problem studied here, that is approximating the fully nonparametric model (1).

Our work is a first step towards understanding the effect of regularization in random features models. It implies that, for certain penalties $\rho$ , as soon as $N$ is polynomially large in $n$ , studying the random features model reduces to studying the corresponding infinite width model.

3 Convergence to the infinite width limit

3.1 Setting and dual problem

We consider a slightly more general setting than the one discussed in the introduction, whereby we allow the featurization map functions to be randomized. More precisely, conditional on the data $({\bm{x}}_{i})_{i\leq n}$ and weight vectors $({\bm{w}}_{j})_{j\leq N}$ , the feature values $\{\phi({\bm{x}}_{i};{\bm{w}}_{j})\}_{i\leq n,j\leq N}$ are independent random variables. Explicitly, such randomized features can be constructed by letting $\phi({\bm{x}};{\bm{w}},z)$ depend on an additional variable $z\in\mathcal{Z}$ , and setting (with an abuse of notation) $\phi({\bm{x}}_{i};{\bm{w}}_{j})=\phi({\bm{x}}_{i};{\bm{w}}_{j},z_{ij})$ for $\{z_{ij}\}_{i\leq n,j\leq N}$ a collection of independent random variables with common law $z_{ij}\sim\nu$ (with $\nu$ a probability distribution over $\mathcal{Z}$ ). Without loss of generality²²2For instance, this is the case as long as the weights take values in a Polish space, by which the conditional probabilities $\mathbb{P}(\phi({\bm{x}};{\bm{w}})\in S|{\bm{w}})=P(S;{\bm{w}})$ exist. we can assume $z_{ij}\sim{\rm Unif}([0,1])$ .

We denote by $\overline{\phi}({\bm{x}};{\bm{w}})=\mathbb{E}_{\phi}[\phi({\bm{x}};{\bm{w}})]=\mathbb{E}_{z}[\phi({\bm{x}};{\bm{w}},z)]$ the expectation of the features with respect to this additional randomness. In what follows, we will omit to write the dependency on $z$ explicitly, unless required for clarity. The additional freedom afforded by randomized features is useful in multiple scenarios:

•

We only have access to noisy measurements $\phi({\bm{x}}_{i};{\bm{w}}_{j})$ of the true features $\overline{\phi}({\bm{x}}_{i};{\bm{w}}_{j})$ .
•

We deliberately introduce noise in the featurization mechanism. This is known to have a regularizing effect [Bis95].
•

We do not explicitly introduce noise in the featurization mechanism. However, the noiseless featurization mechanism is asymptotically equivalent to one in which noise has been added. Examples of this type are given in [MM19, GLK⁺20, HL20]

At prediction time we use the average features³³3Our results do not change significantly if we use randomized features also at prediction time.

\hat{f}({\bm{x}};a)=\int_{{\mathcal{V}}}\overline{\phi}({\bm{x}};{\bm{w}})a({\bm{w}})\,\mu({\rm d}{\bm{w}})\,,\qquad\hat{f}_{N}({\bm{x}};{\bm{a}})=\frac{1}{N}\sum_{j=1}^{N}a_{j}\overline{\phi}({\bm{x}};{\bm{w}}_{j})\,.

(12)

The dual of the finite-width problem (7) is given by the problem (10), for which we introduce the following notations:

	$\displaystyle F_{N}({\bm{\lambda}}):=$	$\displaystyle~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\frac{1}{N}\sum_{j=1}^{N}\rho^{*}(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)\,,$		(13)
	$\displaystyle\hat{\bm{\lambda}}_{N}=$	$\displaystyle~{}\arg\max_{{\bm{\lambda}}\in\mathbb{R}^{n}}F_{N}({\bm{\lambda}})\,.$		(13)

Notice that $\bm{\phi}_{n,j}=\bm{\phi}_{n}({\bm{w}}_{j})=(\phi({\bm{x}}_{1};{\bm{w}}_{j}),\ldots,\phi_{{\bm{w}}}({\bm{x}}_{n};{\bm{w}}_{j}))^{{\mathsf{T}}}$ is now a random vector. In the case of random features, the dual of the infinite width problem (5) has to be modified with respect to (9), and takes instead the form

	$\displaystyle F({\bm{\lambda}}):=$	$\displaystyle~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{\mathcal{V}}\mathbb{E}_{\phi}[\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)]\mu({\rm d}{\bm{w}})\,,$		(14)
	$\displaystyle\hat{\bm{\lambda}}=$	$\displaystyle~{}\arg\max_{{\bm{\lambda}}\in\mathbb{R}^{n}}F({\bm{\lambda}})\,.$		(14)

Note the expectation $\mathbb{E}_{\phi}$ with respect to the randomness in the features (equivalently, this is the expectation with respect to the independent randomness $z$ , which is not noted explicitly). The most direct way to see that this is the correct infinite width dual is to notice that this is obtained from Eq. (13) by taking expectation with respect to the weights $\{{\bm{w}}_{j}\}_{j\leq N}$ and the features randomness. We will further discuss the connection between (14) and the infinite width primal problem in Section E.1 below.

In order to discuss the dual optimality conditions, let $s:\mathbb{R}\to\mathbb{R}$ be the derivative of the convex conjugate of $\rho$ , i.e., $s(x)=(\rho^{*})^{\prime}(x)=(\rho^{\prime})^{-1}(x)$ . Since $\rho$ is assumed to be strictly convex, $s(x)$ exists and is continuous and non-decreasing. With these definitions, the dual optimality condition reads

\displaystyle{\bm{y}}=\frac{1}{N}\sum_{j=1}^{N}\bm{\phi}_{n,j}\,s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}_{N}\rangle)\,.

(15)

The primal solution is then given by $(\hat{\bm{a}}_{N})_{j}=s(\langle\bm{\phi}_{n,j},\hat{\bm{\lambda}}_{N}\rangle)$ and the resulting predictor is

\hat{f}_{N}({\bm{x}};\hat{\bm{a}}_{N})=\frac{1}{N}\sum_{j=1}^{N}\overline{\phi}({\bm{x}};{\bm{w}}_{j})\,s(\langle\bm{\phi}_{n,j},\hat{\bm{\lambda}}_{N}\rangle)\,.

In the following, with an abuse of notation, we will write $\hat{f}_{N}({\bm{x}};{\bm{\lambda}})=N^{-1}\sum_{j=1}^{N}\overline{\phi}({\bm{x}};{\bm{w}}_{j})\,s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)$ for the model at dual parameter ${\bm{\lambda}}$ . The corresponding infinite width predictor reads

\displaystyle\hat{f}({\bm{x}};{\bm{\lambda}}):=\int_{{\mathcal{V}}}\overline{\phi}({\bm{x}};{\bm{w}})\,\mathbb{E}_{\phi}[s(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)]\,\mu({\rm d}{\bm{w}})\,.

(16)

3.2 General theorem

We will show that conditional on the realization of ${\bm{y}},{\bm{X}}$ , the distance between the infinite width interpolating model $\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})$ and the finite width one $\hat{f}_{N}(\,\cdot\,;{\hat{\bm{\lambda}}}_{N})$ is small with high probability as soon as $N$ is large enough. Throughout this section, ${\bm{y}},{\bm{X}}$ are viewed as fixed, and we assume certain conditions to hold on the distribution of the features $\phi_{n}({\bm{w}})$ . In Section 5 we will check that these conditions hold for a few models of interest, for typical realizations of ${\bm{y}},{\bm{X}}$ .

We first state our assumptions on the feature distribution. We define the whitened features

\displaystyle{\bm{\psi}}_{n,j}:={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n,j},\;\;\;\;\;\;\;{\bm{K}}_{n}:=\int\mathbb{E}_{\phi}[\bm{\phi}_{n}({\bm{w}})\bm{\phi}_{n}({\bm{w}})^{\top}]\,\mu({\rm d}{\bm{w}})\,.

(17)

Here expectation is over the randomization in the features, and ${\bm{K}}_{n}\in{\mathbb{R}}^{n\times n}$ is the empirical kernel matrix. By construction, ${\bm{\psi}}_{n,j}$ are isotropic: $\mathbb{E}_{{\bm{w}},\phi}[{\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\top}]={\mathbf{I}}_{n}$ .

We will assume the following conditions to hold for the featurization map $({\bm{x}},{\bm{w}})\mapsto\phi({\bm{x}};{\bm{w}})$ and the penalty $x\mapsto\rho(x)$ . Throughout we will follow the convention of denoting by Greek letters constants that we track of in our calculations, and by $C,c,r_{0}$ constants that we do not track (and therefore our statements depend on an unspecified way on the latter). We also recall that a random vector ${\bm{Z}}$ is $\gamma^{2}$ -subgaussian if $\mathbb{E}[\exp(\langle{\bm{v}},{\bm{Z}}\rangle)]\leq\exp(\gamma^{2}\|{\bm{v}}\|^{2}/2)$ for all vectors ${\bm{v}}$ .

FEAT1

(Sub-gaussianity) For some $0<\tau\leq n^{C}$ and any fixed $\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}$ , $\overline{\phi}({\bm{x}};{\bm{w}})$ is $\tau^{2}$ -sub-Gaussian when ${\bm{w}}\sim\mu$ . Further, the feature vector ${\bm{\psi}}_{n}:={\bm{K}}_{n}^{-1/2}[\phi(x_{1};{\bm{w}},z_{1}),\dots,\phi(x_{n};{\bm{w}},z_{n})]^{{\mathsf{T}}}$ is $\tau^{2}$ -sub-Gaussian when ${\bm{w}}\sim\mu$ , $(z_{i})_{i\leq n}\sim_{iid}\nu$ . Without loss of generality we assume $\tau\geq 1$ .

FEAT2

(Lipschitz continuity) For any ${\bm{w}}\in{\mathcal{V}}$ , assume that $\overline{\phi}({\bm{x}};{\bm{w}})$ is $L({\bm{w}})$ -Lipschitz with respect to ${\bm{x}}$ and $\bar{\phi}({\bm{0}};{\bm{w}})\leq L({\bm{w}})$ , where $\mathbb{P}(L({\bm{w}})\geq t)\leq Ce^{-t^{2}/(2\tau^{2})}$ , $\tau\geq 1$ .

FEAT3

(Small ball) There exists $\eta\geq n^{-C}$ such that

\displaystyle\inf_{\|{\bm{u}}\|_{2}=1,\|{\bm{v}}\|_{2}=1}\mathbb{P}\big{(}|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\geq\eta,\,|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\geq\eta\big{)}\geq c\,,

(18)

for some strictly positive constants $C,c$ . By union bound, this is implied by the stronger condition

\sup_{\|{\bm{v}}\|_{2}=1}\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta\big{)}\leq\frac{1}{2}(1-c)\,.

Without loss of generality, we will assume $\eta\leq 1$ .

PEN

(Polynomial growth) We assume that $\rho$ is strictly convex and minimized at 0, so that $s(x)$ is continuous and $s(0)=0$ . Because $s$ is non-decreasing, we have that $s(x)$ has a derivative almost everywhere. We assume there exists $Q_{1},Q_{2},q_{1},q_{2}>1$ and $C,c>0$ such that for $|x_{1}|,|x_{2}|>0$ ,

\displaystyle c|s(x_{2})/x_{2}|\,(|x_{1}/x_{2}|^{q_{1}-2}\wedge|x_{1}/x_{2}|^{q_{2}-2})\leq s^{\prime}(x_{1})\leq C|s(x_{2})/x_{2}|\,(|x_{1}/x_{2}|^{Q_{1}-2}\vee|x_{1}/x_{2}|^{Q_{2}-2}),

and

\displaystyle c|s(x_{2})|\,(|x_{1}/x_{2}|^{q_{1}-1}\wedge|x_{1}/x_{2}|^{q_{2}-1})\leq|s(x_{1})|\leq C|s(x_{2})|\,(|x_{1}/x_{2}|^{Q_{1}-1}\vee|x_{1}/x_{2}|^{Q_{2}-1}).

(19)

If $s^{\prime}(x)=(\rho^{*})^{\prime\prime}(x)$ is unbounded as $x\to 0$ , we will require a stronger assumption FEAT3’ which will ensure that $\nabla^{2}F({\bm{\lambda}})$ exists and is finite.

FEAT3’: (Small ball) For every $r\in(-1,0)$ and every $\|{\bm{v}}\|_{2}=1$ , $\mathbb{E}[|\langle{\bm{\psi}}_{n},{\bm{v}}\rangle|^{r}]\leq C_{r}\eta^{r}<\infty$ , $\eta\leq 1$ .

Remark 1.

Assumption PEN implies that $s(x)$ is locally Lipschitz away from $0$ . It further implies that $s(x)$ and its derivatives are upper and lower bounded by polynomials in $x$ . The most important example is provided by $p$ -norms with $1<p<\infty$ , in which case we set $\rho(x)=\frac{1}{p}|x|^{p}$ for $1<p<\infty$ . It is easy to check that PEN is satisfied with $Q_{1}=Q_{2}=q_{1}=q_{2}=p/(p-1)$ and appropriate choices of $C,c$ .

Our main theorem establishes that the infinite-dimensional problem of Eq. (5) (and its generalization to randomized featurization maps) is well-approximated by its finite random features counterpart. For this approximation to be good, it is sufficient that the number of random features scales polynomially with the sample size $n$ .

Theorem 1.

Assume $\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d}$ for all $i\leq n$ , and let $\mathbb{P}$ be a probability distribution supported on ${\sf B}^{d}_{2}(r_{0}\sqrt{d}):=\{{\bm{x}}\in{\mathbb{R}}^{d}:\,\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\}$ . Assume that conditions FEAT1, FEAT2, FEAT3, and PEN are satisfied. If $Q_{1}\wedge Q_{2}<2$ , further assume that FEAT3’ holds. Then for any $\delta>0$ , there exist constants $C^{\prime},c^{\prime}$ depending on the constants $C,c,r_{0}$ in those assumptions, but not on $\tau$ and $\eta$ , and $C^{\prime\prime}(\delta)$ additionally dependent on $\delta>0$ such that the following holds. If $N\geq N_{1}\vee N_{2}$ , $N_{1}:=C^{\prime}(\tau^{Q\vee 2}/\eta^{q\vee 3})^{2}n\log n$ , $N_{2}:=C^{\prime}(\tau^{Q\vee 2}/\eta^{q\vee 3})(n\log n)^{Q/2\vee 1}$ and $n\geq c^{\prime}d$ , then with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ ,

\displaystyle\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}\leq

\displaystyle~{}M(\delta,\tau,\eta)\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\,,

(20)

where

M(\delta,\tau,\eta)=C^{\prime\prime}(\delta)\frac{\tau^{Q+2+(2-q^{\prime}+\delta)_{+}}}{\eta^{q\vee 3}}(\tau^{Q-2}\vee\eta^{Q^{\prime}-2})\,,

with $Q=Q_{1}\vee Q_{2}$ , $Q^{\prime}=Q_{1}\wedge Q_{2}$ , $q=q_{1}\vee q_{2}$ , $q^{\prime}=q_{1}\wedge q_{2}$ . Further, the bound holds with $\delta=0$ , $C^{\prime\prime}(0)<\infty$ when $q_{1}=q_{2}\geq 2$ .

In order to interpret Theorem 1, we remark that we expect typically $\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}$ to be of order $\sqrt{n}$ . In this case $\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})$ differs negligibly from $\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})$ when $N\gg(n\log(n))^{2\vee(Q+1)/2}$ .

Remark 2.

The most restrictive among our assumptions are FEAT3 and FEAT3’. Both conditions imply that the infinite-width dual problem of Eq. (14) is well behaved.

In particular, condition FEAT3 ensures that the minimum eigenvalue of the rescaled Hessian ${\bm{H}}_{n}({\bm{\lambda}}):={\bm{K}}_{n}^{-1/2}\nabla^{2}F({\bm{\lambda}}){\bm{K}}_{n}^{-1/2}$ is bounded away from $0$ , as long as ${\bm{\lambda}}$ is bounded away from $0$ and $\infty$ . Notice indeed that the Hessian is given by

\displaystyle\nabla^{2}F({\bm{\lambda}})=-\mathbb{E}_{{\bm{w}},\phi}[s^{\prime}(\langle{\bm{\lambda}},\bm{\phi}_{n}\rangle)\,\bm{\phi}_{n}\bm{\phi}_{n}^{{\mathsf{T}}}]\,.

(21)

For any ${\bm{v}}$ , with $\|{\bm{v}}\|_{2}=1$ , we have $\mathbb{E}\{\langle{\bm{v}},{\bm{\psi}}_{n}\rangle^{2}\}=1$ , whence, using assumption FEAT3 and Markov’s inequality and union bound, for any two unit vectors ${\bm{u}},{\bm{v}}$ , $\mathbb{P}(|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\in[\eta,C],\,|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\in[\eta,C])\geq c/2$ for some constant $C$ . We then have,

\displaystyle\inf_{\|{\bm{v}}\|_{2}=1}\lambda_{\min}(-{\bm{H}}_{n}({\bm{K}}_{n}^{-1/2}{\bm{v}}))=

\displaystyle~{}\inf_{\|{\bm{u}}\|_{2}=1,\|{\bm{v}}\|_{2}=1}\mathbb{E}_{{\bm{w}},\phi}[\langle{\bm{u}},{\bm{\psi}}_{n}\rangle^{2}s^{\prime}(\langle{\bm{v}},{\bm{\psi}}_{n}\rangle)]\geq c^{\prime\prime}\eta^{2}\inf_{x\in[\eta,4]}s^{\prime}(x)\,.

Condition FEAT3’ ensures that the largest eigenvalue of the Hessian $\nabla^{2}F({\bm{\lambda}})$ is bounded for $\|{\bm{\lambda}}\|_{2}$ bounded below and above. Notice that, from Eq. (21), no such assumption is required when $s^{\prime}(x)$ is bounded as $x\to 0$ .

Remark 3.

As mentioned in the introduction, the bound in Eq. (20) is not optimal. For instance, for the case of a penalty $\rho$ that is strongly convex and smooth (covered in Theorem 1 by taking $Q=2$ ) this bound implies that $N\gg(n\log n)^{2}$ random features are sufficient to approximate the infinite width problem. However, a more careful analysis yields that —in this case— $N\gg n\log n$ random features are sufficient. This is also what is established in [MMM21] for the case of kernel ridge regression (corresponding to $\rho(x)=x^{2}$ ).

It is instructive to specialize Theorem 1 to the case of $p$ -norms, which is covered by taking $\rho(x)=|x|^{p}/p$ , $p\in(1,2]$ . In this case $q_{1}=q_{2}=Q_{1}=Q_{2}=Q$ , with $Q=p/(p-1)\geq 2$ and hence the formulas are simpler.

Corollary 1.

Assume $\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d}$ for all $i\leq n$ , and let $\mathbb{P}$ be a probability distribution supported on ${\sf B}^{d}_{2}(r_{0}\sqrt{d}):=\{{\bm{x}}\in{\mathbb{R}}^{d}:\,\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}\}$ . Assume that conditions FEAT1, FEAT2, FEAT3 are satisfied, and penalty $\rho(x)=|x|^{p}/p$ , $p\in(1,2]$ . Then there exist constants $C^{\prime},c^{\prime}$ depending on the constants $C,c,r_{0}$ in those assumptions, such that the following holds. If $N\geq C^{\prime}[(\tau^{Q}/\eta^{Q\vee 3})^{2}(n\log n)\vee(\tau^{Q}/\eta^{Q\vee 3})(n\log n)^{Q/2}]$ and $n\geq c^{\prime}d$ , then with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ ,

\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}\leq C^{\prime}\frac{\tau^{2Q}}{\eta^{Q\vee 3}}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\,.

(22)

where $Q=p/(p-1)$ .

Note that the exponent $Q$ on the right hand side of Eq. (22) diverges as $p\to 1$ . Instead of being a feature of our proof technique, we show in the next section (Section 4), that this is unavoidable under standard hardness assumptions.

Remark 4.

In the case of randomized features, we wrote the infinite-width dual problem (Eq. (9)), but we did not write the infinite-width primal problem to which it corresponds. Indeed the dual problem entirely defines the predictor via Eq. (16). For the sake of completeness, we present the infinite-width primal problem in Appendix E.1.

4 $\mathsf{NP}$ -hardness of learning with ${\mathcal{F}}_{1}$ norm

The fact that, when $\rho(x)=|x|$ , we are unable to solve efficiently the infinite width problem (5) via the random features approach is not surprising. Indeed, consider the featurization map $\phi({\bm{x}};{\bm{w}})=(\langle{\bm{w}},{\bm{x}}\rangle+c)_{+}$ (ReLu activation), and random weights $({\bm{w}}_{i})_{i\leq n}\sim_{iid}\mu={\rm Unif}(\mathbb{S}^{d-1}(1))$ . Consider solving either the infinite-dimensional problem (5) or its random features approximation (7) with data $({\bm{x}}_{i})_{i\leq n}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , $y_{i}=f_{*}({\bm{x}}_{i})=(\langle{\bm{w}}_{\star},{\bm{x}}_{i}\rangle+c)_{+}$ (where ${\bm{w}}_{\star}\in\mathbb{S}^{d-1}(1)$ and $c\neq 0$ are fixed).

In [GMMM19], it was shown that for any $k\in\mathbb{N}$ , if $N\leq d^{k}$ , then $\hat{f}_{N}$ has test error bounded away from zero, namely $\|\hat{f}_{N}-f_{*}\|_{L^{2}}\geq c(k)-o_{N}(1)$ for any sample size $n$ . Indeed this lower bound holds for any function that can be written as in Eq. (6), for some coefficients $(a_{i})_{i\leq N}$ . On the other hand, [Bac17] proves that minimizing the empirical risk subject to $\hat{f}\in{\mathcal{F}}_{1}(R)$ , cf. Eq. (4), for a suitable choice of $R$ , achieves test error $\|\hat{f}-f_{*}\|_{L^{2}}\leq d^{O(1)}/\sqrt{n}$ . The last result does not apply directly to min ${\mathcal{F}}_{1}$ -norm interpolator. However, if an analogous result was established for interpolators, it would imply that $\|\hat{f}_{N}-\hat{f}\|_{L^{2}}$ remains bounded away from zero in this case as long as $N=d^{O(1)}$ .

In order to provide stronger evidence towards hardness, we consider the computational complexity of the interpolation problem

	minimize	$\displaystyle\int_{{\mathcal{V}}}\|a({\bm{w}})\|\mu({\rm d}{\bm{w}})\,,$		(23)
	subj. to	$\displaystyle\hat{f}({\bm{x}}_{i};a)=y_{i},\;\;\;\forall i\leq n\,.$		(23)

We show that it is $\mathsf{NP}$ -hard under a randomized reduction to solve (23) within an accuracy $n^{-C}$ with $C$ a fixed absolute constant. On the contrary, for $p>1$ , Corollary 1 and its proof shows that one can obtain accuracy $n^{-K}$ for $K$ fixed arbitrary in polynomial time using a random features approximation.

It will be convenient to consider a relaxation of problem (23) by minimizing over the set of signed measures on ${\mathcal{V}}$ , which we will denote ${\mathcal{M}}({\mathcal{V}})$ . For any $\tau\in{\mathcal{M}}({\mathcal{V}})$ , let $\hat{f}({\bm{x}};\tau)=\int_{{\mathcal{V}}}\phi({\bm{x}};{\bm{w}})\tau({\rm d}{\bm{w}})$ . We have

	minimize	$\displaystyle\|\tau\|({\mathcal{V}})\,,$		(24)
	subj. to	$\displaystyle\hat{f}({\bm{x}}_{i};\tau)=y_{i},\;\;\;\forall i\leq n\,,$		(24)

where the minimization is now over all $\tau\in{\mathcal{M}}({\mathcal{V}})$ and $|\tau|({\mathcal{V}})$ denotes the total variation of $\tau$ ⁴⁴4Recall that by Hahn decomposition, there exists two non-negative measures $\tau_{+}$ and $\tau_{-}$ such that $\tau=\tau_{+}-\tau_{-}$ . The total variation is equal to $|\tau|({\mathcal{V}})=\tau_{+}({\mathcal{V}})+\tau_{-}({\mathcal{V}})$ .. If we assume that the signed measure $\tau$ has a density with respect to the fixed probability measure $\mu$ , i.e., $\tau({\rm d}{\bm{w}})=a({\bm{w}})\mu({\rm d}{\bm{w}})$ , then the total variation is equal to $|\tau|({\mathcal{V}})=\int_{{\mathcal{V}}}|a({\bm{w}})|\mu({\rm d}{\bm{w}})$ and we recover the original problem (23). Note that if $\mu$ has full support on ${\mathcal{V}}$ , then the infinum in the two problems are the same (all measures can be written as limits of measures with densities). However the infinum in problem (23) is not attained in general, while the infimum in (24) is always achieved for ${\mathcal{V}}$ compact (barring degenerate cases in which it is not feasible). Technically this happens because the space of integrable functions $a({\bm{w}})$ with $\int|a({\bm{w}})|\mu({\rm d}{\bm{w}})\leq C$ is not compact in the weak-* topology, while the set of signed measure with $|\tau|({\mathcal{V}})\leq C$ is. (The optimum $\tau$ can be singular with respect to $\mu$ , e.g. a sum of Dirac delta functions.)

Remark 5.

For any $\tau\in{\mathcal{M}}({\mathcal{V}})$ that verifies the equality constraint of problem (24), we have ${\bm{y}}/|\tau|({\mathcal{V}})$ is in the convex hull of $\{\bm{\phi}_{n}({\bm{w}}),-\bm{\phi}_{n}({\bm{w}}):{\bm{w}}\in{\mathcal{V}}\}$ . Hence, by Caratheodory theorem, there exists at most $(n+1)$ weights $\{{\bm{w}}_{j}\}_{j\in[n+1]}$ such that ${\bm{y}}=\sum_{j\in[n+1]}a_{j}\bm{\phi}_{n}({\bm{w}}_{j})$ and $\sum_{j\in[n+1]}|a_{j}|=|\tau|({\mathcal{V}})$ . In particular, if ${\mathcal{V}}$ is compact, the minimizer in problem (24) is always attained by a measure that is supported on at most $n+1$ points, i.e., there exists $\{(a_{j}^{*},{\bm{w}}_{j}^{*})\}_{j\in[n+1]}$ such that $\tau^{*}=\sum_{j\in[n+1]}a_{j}^{*}\delta_{{\bm{w}}^{*}_{j}}$ minimizes (24).

Let us call the problem (24) the ${\mathcal{F}}_{1}$ -problem. In order to study the computational complexity of solving this problem, we introduce a weak version of the ${\mathcal{F}}_{1}$ -problem, where we allow an error $\varepsilon>0$ on the equality constraint:

minimize	$\displaystyle\|\tau\|({\mathcal{V}})\,,$	(25)
subj. to	$\displaystyle\hat{f}({\bm{x}}_{i};\tau)=\hat{y}_{i},\;\;\;\forall i\leq n\,,$
	$\displaystyle\\|{\bm{y}}-\hat{\bm{y}}\\|_{2}\leq\varepsilon\,,$

where the minimization is now over $\hat{\bm{y}}\in\mathbb{R}^{n}$ and $\tau\in{\mathcal{M}}({\mathcal{V}})$ . For concreteness we will consider $\phi({\bm{x}};{\bm{w}})=\min(\max(\langle{\bm{w}},{\bm{x}}\rangle,0),1)$ (truncated ReLu), but we believe it is possible to generalize our proofs to other activations at the cost of additional technical work. We further restrict ${\mathcal{V}}$ to be a rectangle in $\mathbb{R}^{d}$ possibly with some infinite sides.

Denote $\mathbb{Q}$ the set of rational numbers. We consider the following problem W-F1-PB which depends on a rational number $\varepsilon>0$ :

$\boxed{\texttt{W-F1-PB}(\varepsilon)}$

: Given ${\bm{y}}\in\mathbb{Q}^{n}$ and $\gamma\in\mathbb{Q}$ . Denote $L^{*}$ the value of the weak ${\mathcal{F}}_{1}$ -problem (25) with $\varepsilon$ error on the constraints. Either

(1)

Assert that $L^{*}\leq\gamma+\varepsilon$ ; or,
(2)

Assert that $L^{*}\geq\gamma-\varepsilon$ .

We can think about W-F1-PB as the weak validity problem associated to the ${\mathcal{F}}_{1}$ -problem (24). In particular, if we are able to solve the ${\mathcal{F}}_{1}$ -problem within an additive error $\varepsilon$ of the optimum and with at most $\varepsilon$ $\ell_{2}$ -error on the equality constraints, we can solve W-F1-PB.

We show in the following theorem that W-F1-PB is hard to solve under the standard assumption that $\mathsf{BPP}$ (bounded-error probabilistic polynomial time class) does not contain $\mathsf{NP}$ :

Theorem 2.

Let the activation function be the truncated ReLu $\phi({\bm{x}};{\bm{w}})=\min(\max(\langle{\bm{w}},{\bm{x}}\rangle,0),1)$ , and ${\mathcal{V}}$ be a rectangle in $\mathbb{R}^{d}$ (possibly with some infinite sides). Assuming $\mathsf{NP}\not\subset\mathsf{BPP}$ , there exists an absolute constant $C>0$ such that the problem W-F1-PB $(n^{-C})$ is $\mathsf{NP}$ -hard.

By equality of the infimum of problems (23) and (24), Theorem 2 also implies the hardness of the original problem. The proof of Theorem 2 relies on a polynomial time randomized reduction from an $\mathsf{NP}$ -hard problem, the Maximum Agreement for Halfspaces problem. If we only assume $\mathsf{NP}\neq\mathsf{P}$ , our reductions can be made deterministic using results from [GLS12]. However this deterministic reduction only rules out precision $\varepsilon$ that are exponential in the number of bits, in particular it does not rule out precision $\varepsilon=e^{-n}$ .

We denote below the Maximum Agreement for Halfspaces problem HS-MA . It was shown to be $\mathsf{NP}$ -hard in [FGKP06] and [GR09]. We will follow the notations of [GR09]. Consider $(n_{+},n_{-},d)\in\mathbb{N}^{3}$ and $\{{\bm{x}}_{1},\ldots,{\bm{x}}_{n_{+}},{\bm{z}}_{1},\ldots,{\bm{z}}_{n_{-}}\}\subset\{-1,1\}^{d}$ . Denote

M({\bm{w}},a)=\sum_{i=1}^{n_{+}}\mathbbm{1}[\langle{\bm{w}},{\bm{x}}_{i}\rangle>a]+\sum_{i=1}^{n_{-}}\mathbbm{1}[\langle{\bm{w}},{\bm{z}}_{i}\rangle<a]\,.

(26)

The HS-MA problem depends on a rational numbers $\varepsilon>0$ (where we slightly simplified the statement in [GR09]):

$\boxed{\texttt{HS-MA}(\varepsilon)}$

: Distinguish the following two cases

(1)

There exists a half space $({\bm{w}},a)$ such that $M({\bm{w}},a)\geq(n_{+}+n_{-})(1-\varepsilon)$ ; or,
(2)

For any half space $({\bm{w}},a)$ , we have $M({\bm{w}},a)\leq(n_{+}+n_{-})(1/2+\varepsilon)$ .

[GR09] showed that for all $0<\varepsilon<1/4$ , the problem HS-MA $(\varepsilon)$ is $\mathsf{NP}$ -hard.

Below we briefly describe the main ideas of the proof of Theorem 2 and we defer the detailed to Appendix D. In order to reduce HS-MA to W-F1-PB, we use the following intermediate problem:

maximize	$\displaystyle\langle{\bm{y}},\hat{\bm{y}}\rangle\,,$	(27)
subj. to	$\displaystyle\hat{f}({\bm{x}}_{i};\tau)=\hat{y}_{i},\;\;\;\forall i\leq n\,,$
	$\displaystyle\|\tau\|({\mathcal{V}})\leq 1\,.$

We denote W-VAL $(\varepsilon)$ the weak validity problem associated to problem (27). First notice that we can rewrite equivalently the constraint set as $\hat{\bm{y}}\in K$ where $K:=\{{\bm{z}}\in\mathbb{R}^{n}:\exists\tau\in{\mathcal{M}}({\mathcal{V}}),\hat{f}({\bm{x}}_{i};\tau)=z_{i},i\leq n\}\subseteq\mathbb{R}^{n}$ . It is easy to see that W-F1-PB can be used as a weak membership oracle for $K$ . [LSV18] shows that there exists a polynomial-time randomized algorithm that solves W-VAL $(\varepsilon)$ from a weak membership oracle W-F1-PB $((\varepsilon/n)^{C})$ for some constant $C>0$ . Hence there is a randomized reduction between W-VAL and W-F1-PB. Secondly, the problem (27) for ${\bm{y}}={\bm{1}}$ (vector of ones) has the same value at optimum as the problem

	maximize	$\displaystyle\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\,,$		(28)
	subj. to	$\displaystyle{\bm{w}}\in{\mathcal{V}}\,.$		(28)

It is easy to see that we can construct data points and weights such that the problem (28) coincide with $M({\bm{w}},a)$ defined in Eq. (26) at the optimum. Hence, there is an easy deterministic reduction from HS-MA $(\varepsilon)$ to W-VAL $(\varepsilon^{\prime})$ .

In summary, we used the following two reductions

\boxed{\texttt{W-F1-PB}}\,\,\,\xLongrightarrow{\text{R}}\,\,\,\boxed{\texttt{W-VAL}}\,\,\,\xLongrightarrow{\text{D}}\,\,\,\boxed{\texttt{HS-MA}}

(29)

where $\boxed{A}\xLongrightarrow{\text{R}}\boxed{B}$ (resp. $\boxed{A}\xLongrightarrow{\text{D}}\boxed{B}$ ) means that there exists a polynomial time randomized (resp. deterministic) reduction from $B$ to $A$ .

5 Examples

5.1 A numerical illustration

In our first example ${\bm{w}}\in{\mathcal{V}}=\mathbb{R}^{d}$ and $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ where $\sigma:\mathbb{R}\to\mathbb{R}$ is an activation function with $\sigma(0)=0$ (this assumption is only to simplify some of the formulas below and can be removed). In this case the featurization map is deterministic, i.e., $\phi({\bm{x}};{\bm{w}})=\overline{\phi}({\bm{x}};{\bm{w}})$ .

Refer to caption — Figure 1: Minimum complexity random features interpolation for penalty $\rho(x)=|x|^{p}/p$ and synthetic data generated according to the model $y=\sigma(\langle{\bm{w}}_{*},{\bm{x}}\rangle)$ , ${\bm{x}}\in{\mathbb{R}}^{d}$ (see text). We report the average test error over $20$ realizations of this experiment and the $95\%$ confidence intervals. Left frame: behavior as a function of the number of features $N$ for fixed sample size $n$ . Right frame: behavior as a function of sample size $n$ .

Before checking the assumptions of our main theorem in this context, we present a numerical illustration in Figure 1. We generate synthetic data with $({\bm{x}}_{i})_{i\leq n}\sim_{iid}{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , and $y_{i}=\sigma(\langle{\bm{w}}_{*},{\bm{x}}_{i}\rangle)$ where ${\bm{w}}_{*}$ is a fixed unit vector, $\|{\bm{w}}_{*}\|_{2}=1$ , and $\sigma(t)=\max(t,0)$ is the ReLU activation. As mentioned above, we use the featurization map $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ with weigths $({\bm{w}}_{j})_{j\leq N}\sim{\sf N}(0,{\mathbf{I}}_{d}/d)$ .

We fix $d=30$ and solve the minimum complexity interpolation problem (5), using $\rho(x)=|x|^{p}/p$ . We first fix the sample size $n=150$ and report the average test error as a function of the number of features $N$ (left plot) for several values of $p$ . Notice that in the case $p=2$ , the $N=\infty$ limit corresponds to kernel ridge regression, and hence is directly accessible. We then consider for each $p$ a value of $N$ that is large enough to obtain a rough approximation of the infinite width limit, and plot the test error as a function of the sample size $n$ (right plot).

A few remarks are in order:

$(i)$

For $p>1$ , the test error appears to settle on a limiting value after $N$ becomes large enough $N\gtrsim N_{*}(n;p)$ .
$(ii)$

The required number of random features $N_{*}(n;p)$ appears to increase as $p$ decreases. For $p=1$ , we are not able to reach the $N=\infty$ limit with practical values of $N$ .
$(iii)$

As $p$ decreases, the test error achieved by minimum complexity interpolation decreases.

Notice that points $(i)$ and $(ii)$ are consistent with our main result Theorem 1. Point $(iii)$ is consistent with the notion that the class ${\mathcal{F}}_{p}$ captures better functions that are highly dependent on low dimensional projections of the covariates vectors.

5.2 Non-linear random features model

We next check the assumptions of Theorem 1 for the case of non-linear random features model $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ .

Proposition 1.

Assume that $\sigma$ is $L$ -Lipschitz and $\sigma(0)=0$ , $\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d}$ for all $i\leq n$ , and that the random weights $({\bm{w}}_{i})_{i\leq n}\sim\mu$ are mean 0 and satisfy the transportation cost inequality

W_{1}(\nu,\mu)\leq\sqrt{2(\kappa^{2}/d)D(\nu||\mu)}\,\,\text{ for all probability measures $\nu$ on $\mathbb{R}^{d}$,}

(30)

where $W_{1}$ is the Wassertein distance and $D$ is the relative entropy (Kullback-Leibler divergence). Then, FEAT1 and FEAT2 are satisfied with constants $L({\bm{w}})=L\|{\bm{w}}\|_{2}$ and $\tau=\tau_{1}\vee\tau_{2}$ , where

\displaystyle\tau_{1}=C\kappa Lr_{0},\qquad\tau_{2}=C\kappa L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\frac{\|{\bm{X}}\|_{{\rm op}}}{\sqrt{d}}\,.

(31)

Proof of Proposition 1.

Let us begin with condition FEAT1. Notice that for any $\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}$ ,

|\sigma(\langle{\bm{w}}_{1},{\bm{x}}\rangle)-\sigma(\langle{\bm{w}}_{2},{\bm{x}}\rangle)|\leq L\|{\bm{x}}\|_{2}\|{\bm{w}}_{1}-{\bm{w}}_{2}\|_{2}\leq Lr_{0}\sqrt{d}\|{\bm{w}}_{1}-{\bm{w}}_{2}\|_{2}\,.

Hence $\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ is $Lr_{0}\sqrt{d}$ -Lipschitz. By assumption (30) and Bobkov-Götze theorem [BG99], $\sigma(\langle{\bm{w}},{\bm{x}}\rangle)-\mathbb{E}_{{\bm{w}}}[\sigma(\langle{\bm{w}},{\bm{x}}\rangle)]$ is $(\kappa Lr_{0})^{2}$ -sub-Gaussian with respect to ${\bm{w}}$ , for any fixed ${\bm{x}}$ . Further $|\mathbb{E}_{{\bm{w}}}[\sigma(\langle{\bm{w}},{\bm{x}}\rangle)]|\leq L\mathbb{E}_{{\bm{w}}}[|\langle{\bm{w}},{\bm{x}}\rangle|]\leq C\kappa Lr_{0}$ .

Similarly, for any ${\bm{v}}\in\mathbb{R}^{n}$ , $\|{\bm{v}}\|_{2}=1$ ,

|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}}_{1})\rangle-\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}}_{2})\rangle|\leq L\|{\bm{K}}_{n}^{-1/2}\|_{2}\|{\bm{X}}({\bm{w}}_{1}-{\bm{w}}_{2})\|_{2}\leq L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}\|{\bm{w}}_{1}-{\bm{w}}_{2}\|_{2}\,.

We deduce that ${\bm{\psi}}_{n}({\bm{w}})-\mathbb{E}[{\bm{\psi}}_{n}({\bm{w}})]$ is $(\kappa L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}/\sqrt{d})^{2}$ -sub-Gaussian with respect to ${\bm{w}}$ , and $\|\mathbb{E}_{{\bm{w}}}[{\bm{\psi}}_{n}({\bm{w}})]\|_{2}\leq C\kappa L\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}/\sqrt{d}$ .

Next consider condition FEAT2. We have $\sigma(\langle{\bm{x}},{\bm{w}}\rangle)$ that is Lipschitz with respect to ${\bm{x}}$ with Lipschitz constant $L({\bm{w}}):=L\|{\bm{w}}\|_{2}$ . Using that $L({\bm{w}})$ is $L$ -Lipschitz with respect to ${\bm{w}}$ , we have

\mathbb{P}(L({\bm{w}})\geq 4L\kappa+t)\leq\exp\{-dt^{2}/(2L^{2}\kappa^{2})\}\,,

and therefore there exists an absolute constant $C>0$ that only depend on $L$ and $\kappa$ such that

\mathbb{P}(L({\bm{w}})\geq t)\leq C\exp(-t^{2}/(2L^{2}\kappa^{2}))\,.

∎

To make Proposition 1 more concrete, let us make the following remarks:

(a)

The transportation cost inequality (30) is a necessary and sufficient condition for any $L$ -Lipschitz function of ${\bm{w}}$ to be $(L^{2}\kappa^{2}/d)$ -sub-Gaussian. For example, ${\bm{w}}\sim{\sf N}(0,(\kappa^{2}/d)\cdot{\mathbf{I}}_{d})$ verifies this inequality [vH14, Chapter 4].
(b)

If ${\bm{x}}$ is a $\kappa^{2}$ -sub-Gaussian random vector, then by Lemma 9, there exists constants $C,c>0$ depending only on $\kappa$ , such that $\|{\bm{X}}\|_{{\rm op}}\leq C\sqrt{n}$ with probability at least $1-\exp(-cn)$ .

Proposition 1 only focused on FEAT1 and FEAT2. The last two assumptions FEAT3 and FEAT3’ are more difficult to check. The next proposition provides a class of activation functions for which FEAT3 is satisfied.

Proposition 2.

Assume $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ with ${\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d)$ . Let $\mu_{k}(\sigma):=\mathbb{E}\{\sigma(G){\rm He}_{k}(G)\}$ be the $k$ -th Hermite coefficient of the activation function (where $G\sim{\sf N}(0,1)$ , and the normalization $\mathbb{E}\{{\rm He}_{k}(G){\rm He}_{j}(G)\}=k!\delta_{jk}$ is assumed).

Consider $({\bm{x}}_{i})_{i\leq n}\sim_{iid}{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ and $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ for some constant $\delta>0$ . If $\mu_{\ell}(\sigma)$ is bounded away from zero and $|\mu_{k}(\sigma)|\leq C/k^{k+1/2}$ for all $k$ large enough, then FEAT 3 is satisfied with high probability with $c=1/4$ and nonrandom $\eta$ independent of $d,n$ .

A more general version of this result is proved in Appendix C for FEAT3. While we expect FEAT3 and FEAT3’ to hold more generally, we do not have a more general proof at the moment. We can bypass this difficulty below by considering noisy features.

Proposition 3.

Consider the same setting as in Proposition 1. Define $\overline{\phi}({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{x}},{\bm{w}}\rangle)$ and $\phi({\bm{x}};{\bm{w}}):=\phi({\bm{x}};{\bm{w}},z)=\sigma(\langle{\bm{x}},{\bm{w}}\rangle)+z$ where $z\sim{\sf N}(0,\gamma^{2})$ . Then FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with constants $L({\bm{w}})=L\|{\bm{w}}\|_{2}$ , $C_{r}=\Gamma((r+1)/2)$ and

	$\displaystyle\eta$	$\displaystyle=\frac{\gamma}{10\lambda_{\max}({\bm{K}}_{n})^{1/2}}\,,$		(32)
	$\displaystyle\tau$	$\displaystyle=\tau_{1}\vee\tau_{2}\,,\;\;\;\;\tau_{1}=\kappa Lr_{0},\qquad\tau_{2}=\kappa^{2}L^{2}\frac{\\|{\bm{X}}\\|^{2}_{{\rm op}}}{\gamma^{2}d}+1\,.$		(33)

Proof of Proposition 3.

First notice that ${\bm{K}}_{n}=\overline{\bm{K}}_{n}+\gamma^{2}{\mathbf{I}}_{n}$ , where we denoted the matrix $(\overline{\bm{K}}_{n})_{i,j}=\mathbb{E}_{{\bm{w}}}[\sigma(\langle{\bm{x}}_{i},{\bm{w}}\rangle)\sigma(\langle{\bm{x}}_{j},{\bm{w}}\rangle)]$ . FEAT1 and FEAT2 are verified in Proposition 1 with the difference that for ${\bm{v}}\in\mathbb{R}^{n}$ , $\|{\bm{v}}\|_{2}=1$ ,

\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle=\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}\overline{\bm{\phi}}_{n}({\bm{w}})\rangle+\tilde{z}\,,

where $\tilde{z}\sim{\sf N}(0,\gamma^{2}\langle{\bm{v}},{\bm{K}}_{n}^{-1}{\bm{v}}\rangle)$ , and therefore ${\bm{\psi}}_{n}({\bm{w}})$ is a $\tau_{2}^{2}$ -sub-Gaussian random vector with

\tau_{2}^{2}=\kappa^{2}L^{2}\frac{\|{\bm{X}}\|_{{\rm op}}^{2}}{\gamma^{2}d}+1.

Let us now check FEAT3. Define $\Delta=\gamma^{2}\langle{\bm{v}},{\bm{K}}_{n}^{-1}{\bm{v}}\rangle$ . Recalling that we have $\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle=\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}\overline{\bm{\phi}}_{n}({\bm{w}})\rangle+\tilde{z}$ with $\tilde{z}\sim{\sf N}(0,\Delta)$ independent of $\overline{\bm{\phi}}_{n}({\bm{w}})$ , we have

	$\displaystyle\mathbb{P}(\|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\|\leq\eta)$	$\displaystyle\leq\sup_{x\in\mathbb{R}}\mathbb{P}(x-\eta\leq\tilde{z}\leq x+\eta)$
		$\displaystyle\leq\frac{2\eta}{\sqrt{2\pi\Delta}}=\big{(}50\pi\lambda_{\max}({\bm{K}}_{n})\langle{\bm{v}},{\bm{K}}^{-1}_{n}{\bm{v}}\rangle\big{)}^{-1/2}\leq\frac{1}{10}\,.$

This shows that FEAT3 is satisfied with the stated value of $\eta$ .

Finally, let us check FEAT3’. Consider $r\in(-1,0)$ . Letting $G\sim{\sf N}(0,1)$ , we have

	$\displaystyle\mathbb{E}[\|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\|^{r}]\leq$	$\displaystyle~{}\sup_{x\in\mathbb{R}}\mathbb{E}_{\tilde{z}}[\|\tilde{z}+x\|^{r}]\leq\mathbb{E}_{\tilde{z}}[\|\tilde{z}\|^{r}]$
	$\displaystyle=$	$\displaystyle~{}\Delta^{r/2}\mathbb{E}[\|G\|^{r}]\leq\eta^{r}\mathbb{E}[\|G\|^{r}]\,,\,.$

where in the last inequality we used the fact that $\Delta=\gamma^{2}\langle{\bm{v}},{\bm{K}}_{n}^{-1}{\bm{v}}\rangle\geq\gamma^{2}\lambda_{\max}({\bm{K}}_{n})^{-1}\geq\eta$ . FEAT3’ is satisfied with $C_{r}=\mathbb{E}[|G|^{r}]\leq\Gamma((r+1)/2)<\infty$ . ∎

Corollary 2.

Consider the same setting as in Proposition 1, and Proposition 3, namely $\overline{\phi}({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{x}},{\bm{w}}\rangle)$ and $\phi({\bm{x}};{\bm{w}}):=\phi({\bm{x}};{\bm{w}},z)=\sigma(\langle{\bm{x}},{\bm{w}}\rangle)+z$ where $z\sim{\sf N}(0,\gamma^{2})$ , $\gamma\leq 1$ .

Further assume $({\bm{x}}_{i})_{i\leq n}$ to be independent centered isotropic, with $\|{\bm{x}}_{i}\|_{2}\leq r_{0}\sqrt{d}$ almost surely. Then there exist constants $C,c$ depending uniquely on $L,r_{0},\kappa$ (but not on $n,d,\sigma$ or the distribution of the data ${\bm{x}}_{i}$ ), such that conditions FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with probability at least $1-Ce^{-cn}$ for the constants

\displaystyle\eta=\frac{c\gamma}{\sqrt{n}}\,,\;\;\;\;\;\;\;\;\tau

\displaystyle=\frac{C}{\gamma}\Big{(}\sqrt{\frac{n}{d}}\vee\sqrt{\log d}\Big{)}\,.

(34)

In particular, consider the case $\rho(x)=|x|^{p}/p$ , $p\in(1,2]$ . If $(y_{i})_{i\leq n}$ are independent $\mathbb{E}\{y_{i}^{4}\}\leq C$ , then with probability at least $1-Cn^{-1}$ , we have for $N\geq C^{\prime}\gamma^{-2Q-2(Q\vee 3)}(\frac{n}{d}\vee\log d)^{Q}n^{Q\vee 3+Q/2}(\log n)^{Q/2}$ ,

\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}\leq C\frac{n^{(Q\vee 3+1)/2}}{\gamma^{Q\vee 3+2Q+1}}\Big{(}\frac{n}{d}\vee\log d\Big{)}^{Q}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\,.

(35)

Proof of Corollary 2.

We have

	$\displaystyle\lambda_{\max}({\bm{K}}_{n})$	$\displaystyle=\sup_{\\|{\bm{u}}\\|_{2}=1}\mathbb{E}_{{\bm{w}},z}[\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{u}}\rangle^{2}]$
		$\displaystyle\leq\mathbb{E}_{{\bm{w}},z}[\\|\bm{\phi}_{n}({\bm{w}},z)\\|^{2}]\leq n\sup_{\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}}\mathbb{E}_{{\bm{w}},z}[\phi({\bm{x}};{\bm{w}},z)^{2}]$
		$\displaystyle\leq n\sup_{\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}}\big{\{}L^{2}\mathbb{E}[\langle{\bm{w}},{\bm{x}}\rangle^{2}]+\gamma^{2}\big{\}}\leq Cn\,.$

Substituting into Eq. (32), we get the desired bound on $\eta$ .

Considering the estimate on $\tau$ , we note that $\|{\bm{X}}\|_{{\rm op}}\leq C(\sqrt{n}+\sqrt{d\log d})$ with the stated probability by [Ver18, Theorem 5.6.1]. Substituting in Eq. (33), we get the desired estimate of $\tau$ .

Finally, to prove (35), recall that ${\bm{K}}_{n}=\overline{\bm{K}}_{n}+\gamma^{2}{\mathbf{I}}_{n}$ , where $\overline{\bm{K}}_{n}\succeq{\bm{0}}$ , whence $\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\leq\gamma^{-1}\|{\bm{y}}\|_{2}\leq C\gamma^{-1}\sqrt{n}$ , where the second step is just Markov inequality. Substituting in Eq. (22) yields the desired bound (35). ∎

Remark 6.

The estimates of $\tau$ and $\eta$ in Proposition 3 and Corollary 2 are not optimal. First, we expect that the correct order in the second expression in Eq. (31) to be often $\|{\bm{K}}_{n}^{-1/2}{\bm{X}}\|_{{\rm op}}/\sqrt{d}$ instead of $\|{\bm{K}}_{n}^{-1/2}\|_{{\rm op}}\|{\bm{X}}\|_{{\rm op}}/\sqrt{d}$ . For many cases of interest the former is of order one, while the latter is of order $\sqrt{n/d}$ as we saw.

Second, we expect the dependence of $\eta$ on $\lambda_{\max}({\bm{K}}_{n})$ to be often milder (see Appendix C). Third, we know that in many interesting cases, $\lambda_{\max}({\bm{K}}_{n})$ is of order $n/d$ instead of order $n$ . This is for instance the case if ${\bm{w}}\sim{\rm Unif}(\mathbb{S}^{d-1}(1))$ , ${\bm{x}}_{i}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ [MMM21] (under the assumption $\mathbb{E}[\bm{\phi}_{n}({\bm{w}})]={\bm{0}}$ ).

5.3 The latent linear model

This section provides a simple example in which the estimates derived above can be strengthened. We consider the following model

\phi({\bm{x}};{\bm{w}},z)=\langle{\bm{x}},{\bm{w}}\rangle+z\,,

which we will refer to as the ‘latent linear model.’

While the latent linear model is extremely simple, it was shown in some settings to have the same asymptotic behavior as the noiseless nonlinear random features model $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ [MM19, HL20]. For instance, consider the case ${\bm{w}}\sim{\rm Unif}(\mathbb{S}^{d-1}(1))$ , ${\bm{x}}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Then $\langle{\bm{w}},{\bm{x}}\rangle$ is approximately ${\sf N}(0,1)$ . Decompose $\sigma(t)=\mu_{0}+\mu_{1}t+\sigma_{\perp}(t)$ , where $\sigma_{\perp}$ is orthogonal to linear and constant functions in $L^{2}({\mathbb{R}};\gamma)$ , with $\gamma$ the standard normal measure. Then [MM19] shows that in the proportional asymptotics $N\asymp n\asymp d$ , ridge regression in the nonlinear random features model behaves as ridge regression with the latent linear model $\phi({\bm{x}};{\bm{w}},z)=\mu_{0}+\mu_{1}\langle{\bm{w}},{\bm{x}}\rangle+\mu_{\star}z$ , with $z\sim{\sf N}(0,1)$ independent of ${\bm{w}},{\bm{x}}$ . Here we study the latent linear model from a different perspective and in a broader context than [MM19, GLK⁺20, HL20].

We make the following assumptions:

A1: (Covariates distribution) The covariates ${\bm{x}}$ are $\kappa^{2}$ -subgaussian random vectors with zero mean and second moment $\mathbb{E}[{\bm{x}}{\bm{x}}^{\mathsf{T}}]=\bm{\Sigma}_{x}$ and bounded support $\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}$ .
A2: (Features distribution) The features ${\bm{w}}$ follow a multivariate Gaussian distribution with zero mean and covariance $\bm{\Sigma}_{w}/d$ . Furthermore, assume that $\lambda_{\max}(\bm{\Sigma}_{w})\leq\kappa$ .
A3: (Features noise distribution) We assume that $z\sim{\sf N}(0,\gamma^{2})$ .

Proposition 4.

Assume that conditions A1, A2 and A3 hold. Then there exists $C>0$ such that for $n\geq Cd$ , the conditions FEAT1, FEAT2, FEAT3 and FEAT3’ hold with probability at least $1-e^{-cn}$ with constants $\tau,\eta$ depending only on the constants in A1, A2 and A3. In particular, $\tau,\eta$ can be taken to be independent of $d,n$ .

Proof of Proposition 4.

We begin by noticing that ${\bm{K}}_{n}=\mathbb{E}_{{\bm{w}},{\bm{z}}}[\bm{\phi}_{n}({\bm{w}},{\bm{z}})\bm{\phi}_{n}({\bm{w}},{\bm{z}})]={\bm{X}}(\bm{\Sigma}_{w}/d){\bm{X}}^{\mathsf{T}}+\gamma^{2}{\mathbf{I}}_{n}$ .

Consider condition FEAT1. We have $\overline{\phi}({\bm{x}};{\bm{w}})=\langle{\bm{x}},{\bm{w}}\rangle\sim{\sf N}(0,{\bm{x}}^{\mathsf{T}}\bm{\Sigma}_{w}{\bm{x}}/d)$ is mean 0. By A1 and A2, $\overline{\phi}({\bm{x}};{\bm{w}})$ is $(r_{0}\kappa)^{2}$ -subgaussian. Furthermore, for any fixed ${\bm{x}}_{1},\dots,{\bm{x}}_{n}\in{\mathbb{R}}^{d}$ , ${\bm{\psi}}_{n}({\bm{w}})$ is by construction isotropic, and Gaussian (since ${\bm{w}}$ and $z$ are). Therefore, it is $1$ -subgaussian and mean 0.

FEAT2 is easily verified with $L({\bm{w}})=\|{\bm{w}}\|_{2}$ and $\mathbb{P}(L({\bm{w}})\geq t)\leq C\exp(-t^{2}/(2\kappa^{2}))$ .

For condition FEAT3, note that $\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\sim{\sf N}(0,1)$ for any unit vector ${\bm{v}}$ . Therefore we have $\mathbb{P}(|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|\leq\eta)\leq(2/\pi)^{1/2}\eta$ , whence we can take $\eta=1/10$ .

Finally, for FEAT3’, $\mathbb{E}[|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle|^{r}]=\mathbb{E}[|G|^{r}]=:C_{r}$ for $G\sim{\sf N}(0,1)$ . ∎

We finally notice that, for the latent linear model, we can improve Theorem 1. For the latent linear model, we can use the constraint that the (randomized) predictor interpolates the data, to improve the bound (20) by a factor $\sqrt{n}$ . We expect this insight to generalize to the non-linear setting.

Proposition 5.

Assume conditions A1, A2 and A3 hold. There exists constants $C^{\prime},c^{\prime}$ depending only on the constants in those assumptions such that if $N\geq C(n\log(N))^{Q/2}$ and $n\geq C^{\prime}d$ , then with probability at least $1-C^{\prime}e^{-c^{\prime}n}-C^{\prime}N^{-c^{\prime}n}$ ,

\|\hat{f}_{N}(\cdot;\hat{\bm{\lambda}}_{N})-\hat{f}_{0}(\cdot;\hat{\bm{\lambda}}_{0})\|_{L_{2}}\leq C^{\prime}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\frac{\|{\bm{y}}\|_{2}}{\sqrt{n}}\,,

(36)

where we denoted $Q=Q_{1}\vee Q_{2}$ .

In other words, the random features predictor $\hat{f}_{N}(\,\cdot\,;{\hat{\bm{\lambda}}}_{N})$ differs negligibly from the infinite-width predictor $\hat{f}(\cdot;{\hat{\bm{\lambda}}})$ when $N\gg(n\log(n))^{1\vee(Q/2)}$ . In particular, for $Q=2$ , we obtain that $N\gg n\log(n)$ features are sufficient, which matches the results for kernel methods [MM19, MMM21]. The proof of Proposition 5 is deferred to Appendix B.

6 Proof of Theorem 1: Convergence to the population predictor

The proof of Theorem 1 is structured as follows. We first define three events on which the finite-width dual objective $F_{N}$ and the random features predictor $\hat{f}_{N}(\,\cdot\,;{\bm{\lambda}})$ satisfy certain concentration properties. Lemma 1 shows that the simultaneous occurrence of these events implies a bound on $\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}(\mathbb{P})}$ . We then verify that these events occur with high-probability.

Throughout the proofs, we will use the notation $\|{\bm{u}}\|_{{\bm{A}}}=\|{\bm{A}}^{1/2}{\bm{u}}\|_{2}$ for ${\bm{u}}\in\mathbb{R}^{n}$ and ${\bm{A}}\in\mathbb{R}^{n\times n}$ a positive semidefinite matrix. We also use the standard big-Oh and little-o notations whereby the subscript is used to indicate the asymptotic variable. For instance, we write $f_{N}=o_{N}(g_{N})$ if $f_{N}/g_{N}\to 0$ as $N\to\infty$ .

The three events mentioned above are defined as follows:

(Uniform concentration of the predictor) Event ${\mathcal{E}}_{1}$ is the event that for all $\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\in[1/2,2]$

\frac{\|\hat{f}_{N}(\,\cdot\,;{\bm{\lambda}})-\hat{f}(\,\cdot\,;{\bm{\lambda}})\|_{L_{2}}}{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}=\frac{\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\overline{\phi}_{{\bm{w}}_{j}}(\,\cdot\,)s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}\Big{[}\overline{\phi}_{{\bm{w}}}(\,\cdot\,)\,s(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}\rangle)\Big{]}\Big{\|}_{L_{2}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq\varepsilon_{1}.

(Concentration of dual gradient at ${\hat{\bm{\lambda}}}$ ) Event ${\mathcal{E}}_{2}$ is the event that

\frac{\|\nabla F_{N}({\hat{\bm{\lambda}}})\|_{{\bm{K}}_{n}^{-1}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}=\frac{\Big{\|}\frac{1}{N}\sum_{j=1}^{N}{\bm{\psi}}_{n,j}s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}\Big{[}{\bm{\psi}}_{n}({\bm{w}})s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)\Big{]}\Big{\|}_{2}}{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}\leq\varepsilon_{2}.

(Uniform lower bound on dual curvature) Event ${\mathcal{E}}_{3}$ is the event that for all $\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\in[1/2,2]$

\frac{{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}){\bm{K}}_{n}^{-1/2}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}=-\frac{\frac{1}{N}\sum_{j=1}^{N}{\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\top}s^{\prime}(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}\preceq-\beta{\mathbf{I}}_{n}.

The first event ${\mathcal{E}}_{1}$ corresponds to the random features predictor $\hat{f}_{N}(\,\cdot\,;{\bm{\lambda}})$ approximating the infinite-width predictor $\hat{f}(\,\cdot\,;{\bm{\lambda}})$ uniformly well over ${\bm{\lambda}}$ in a region around ${\hat{\bm{\lambda}}}$ . Events ${\mathcal{E}}_{2}$ and ${\mathcal{E}}_{3}$ relate to local properties of the finite-width dual objective $F_{N}$ around ${\hat{\bm{\lambda}}}$ : on ${\mathcal{E}}_{2}$ , the gradient of the dual objective $\nabla F_{N}({\hat{\bm{\lambda}}})$ concentrates around the gradient of the infinite width dual objective $\nabla F({\hat{\bm{\lambda}}})={\bm{0}}$ ; on ${\mathcal{E}}_{3}$ , the Hessian of the dual objective $\nabla^{2}F_{N}({\bm{\lambda}})$ has maximum eigenvalue uniformly bounded away from $0$ for ${\bm{\lambda}}$ in a region around ${\hat{\bm{\lambda}}}$ .

Because these three events involve concentration or bounds on empirical means over a sample of $N$ features, it is perhaps not surprising that the preceding bounds can be established with high-probability for $\varepsilon_{1},\varepsilon_{2}=o(1)$ and $\beta=\Theta(1)$ appropriately chosen when $N$ if sufficiently large compared to $n$ .

If the infinite-width predictor $\hat{f}(\,\cdot\,;{\bm{\lambda}})$ satisfies a certain continuity property in ${\bm{\lambda}}$ , then the events ${\mathcal{E}}_{1}$ , ${\mathcal{E}}_{2}$ , ${\mathcal{E}}_{3}$ imply a bound on $\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}$ .

Lemma 1.

Assume that for all $\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\leq\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}/2$ ,

\frac{\|\hat{f}(\,\cdot\,;{\bm{\lambda}})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq K\frac{\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}{\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}\,,

(37)

for some $K>0$ . If $\varepsilon_{2}/\beta\leq 1/4$ , then on events ${\mathcal{E}}_{1},{\mathcal{E}}_{2},{\mathcal{E}}_{3}$ , we have

\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}\leq\Big{(}\varepsilon_{1}+2\frac{K\varepsilon_{2}}{\beta}\Big{)}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\,.

The continuity property for $\hat{f}$ used in Lemma 1 is much easier to establish than the corresponding continuity property of $\hat{f}_{N}$ . In the setting of Theorem 1, we will show that we can choose $K,\beta=\Theta_{N}(1)$ and $\varepsilon_{1},\varepsilon_{2}=o_{N}(1)$ such that the above events hold with high probability. Then, we will have $\varepsilon_{2}/\beta\leq 1/4$ eventually, and Lemma 1 can be applied.

Proof of Lemma 1.

Consider ${\bm{\lambda}}$ with $\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}=(2\varepsilon_{2}/\beta)\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2$ . By Taylor’s theorem, there exists ${\bm{\lambda}}_{\star}$ on the line segment between ${\bm{\lambda}}$ and $\hat{\bm{\lambda}}$ , such that

	$\displaystyle F_{N}({\bm{\lambda}})$	$\displaystyle=F_{N}(\hat{\bm{\lambda}})+({\bm{\lambda}}-\hat{\bm{\lambda}})^{\top}\nabla F_{N}(\hat{\bm{\lambda}})+\frac{1}{2}({\bm{\lambda}}-\hat{\bm{\lambda}})^{\top}\nabla^{2}F_{N}({\bm{\lambda}}_{\star})({\bm{\lambda}}-\hat{\bm{\lambda}})$
		$\displaystyle\leq F_{N}(\hat{\bm{\lambda}})+\\|{\bm{\lambda}}-\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}\\|\nabla F_{N}(\hat{\bm{\lambda}})\\|_{{\bm{K}}_{n}^{-1}}-\frac{1}{2}\lambda_{\min}\big{(}-{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}_{\star}){\bm{K}}_{n}^{-1/2}\big{)}\\|{\bm{\lambda}}-\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}^{2}.$

When both ${\mathcal{E}}_{2}$ and ${\mathcal{E}}_{3}$ occur and using $\|{\bm{\lambda}}_{\star}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2$ , we get

\displaystyle F_{N}({\bm{\lambda}})

\displaystyle\leq F_{N}(\hat{\bm{\lambda}})+\frac{2\varepsilon_{2}\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\beta}\varepsilon_{2}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})-\frac{1}{2}\frac{\beta s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}\frac{4\varepsilon_{2}^{2}\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{2}}{\beta^{2}}\leq F_{N}(\hat{\bm{\lambda}})\,.

Because $F_{N}({\bm{\lambda}})\leq F_{N}(\hat{\bm{\lambda}})$ for all ${\bm{\lambda}}$ with $\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}=(2\varepsilon_{2}/\beta)\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}$ , and $F_{N}$ is strictly convex at ${\hat{\bm{\lambda}}}$ by ${\mathcal{E}}_{3}$ we conclude

\displaystyle\|\hat{\bm{\lambda}}_{N}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq(2\varepsilon_{2}/\beta)\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\,.

In this case, by Eq. (37),

\|\hat{f}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\|_{L_{2}}\leq K\frac{\|\hat{\bm{\lambda}}_{N}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})\leq\frac{2K\varepsilon_{2}}{\beta}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}).

If event ${\mathcal{E}}_{1}$ occurs, then

\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}}_{N})\|_{L_{2}}\leq\varepsilon_{1}s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}).

Combining the previous displays with the triangle inequality completes the proof. ∎

We next state three lemmas implying that events ${\mathcal{E}}_{1}$ , ${\mathcal{E}}_{2}$ , ${\mathcal{E}}_{3}$ hold with high probability, as well as the continuity of $\hat{f}(\,\cdot\,;{\bm{\lambda}})$ in the last lemma. Proofs of these lemmas are deferred to the appendices. We begin with the continuity property of the infinite width predictor.

Lemma 2.

If either $(i)$ assumptions FEAT1 and PEN hold with $Q_{1}\wedge Q_{2}\geq 2$ , or $(ii)$ assumptions FEAT1, FEAT3’ and PEN hold with $Q_{1}\wedge Q_{2}<2$ , then there exists $C^{\prime}$ depending only on the constants $c,C,r_{0}$ in FEAT3’ and PEN, but not on $\tau,\eta$ , such that for all $\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2$ ,

\frac{\|\hat{f}(\,\cdot\,;{\bm{\lambda}})-\hat{f}(\,\cdot\,;{\hat{\bm{\lambda}}})\|_{L_{2}}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq C^{\prime}(\tau^{Q}\vee\tau^{2}\eta^{(Q^{\prime}-2)})\frac{\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}{\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}\,,

(38)

where $Q=Q_{1}\vee Q_{2}$ , $Q^{\prime}=Q_{1}\wedge Q_{2}$ .

Next we state a lemma to check condition ${\mathcal{E}}_{1}$ .

Lemma 3.

Assume FEAT1, FEAT2 and PEN hold. Then there exist $C^{\prime},c^{\prime}>0$ depending only on the constants $c,C,r_{0}$ in those assumptions, but not on $\tau,\eta$ , such that for $N\geq n\geq c^{\prime}d$ , we have with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ ,

	$\displaystyle\bar{S}_{1}$	$\displaystyle:=\sup_{\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}}\;\sup_{\\|{\bm{v}}\\|_{2}\leq 2\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}}}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}[\bar{\phi}({\bm{x}};{\bm{w}})s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{v}}\rangle)]\Big{\|}$		(39)
		$\displaystyle\leq C^{\prime}\tau^{Q}s(\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}\,,$		(39)

where ${\bm{\psi}}_{n}({\bm{w}})={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}({\bm{w}})$ and $Q=Q_{1}\vee Q_{2}$ .

The next lemma allows us to check event ${\mathcal{E}}_{2}$ .

Lemma 4.

Assume FEAT1 and PEN hold. There exist $C^{\prime},c^{\prime}>0$ depending only on the constants $c,C,r_{0}$ in those assumptions, but not on $\tau,\eta$ , such that for $N\geq n\geq c^{\prime}d$ , we have with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ ,

	$\displaystyle\bar{S}_{2}$	$\displaystyle:=\Big{\\|}\frac{1}{N}\sum_{j=1}^{N}{\bm{\psi}}_{n,j}s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)-\mathbb{E}_{{\bm{w}},\phi}[{\bm{\psi}}_{n}({\bm{w}})s(\langle\bm{\phi}_{n},{\hat{\bm{\lambda}}}\rangle)]\Big{\\|}_{2}$		(40)
		$\displaystyle\leq C^{\prime}\tau^{Q}s(\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)},$		(40)

where ${\bm{\psi}}_{n}({\bm{w}})={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}({\bm{w}})$ and $Q=Q_{1}\vee Q_{2}$ .

We finally state a lemma to check event ${\mathcal{E}}_{3}$ .

Lemma 5.

Assume that FEAT1, FEAT3 and PEN hold, and further assume that $\tau,\eta^{-1}\leq n^{C}$ for some absolute constant $C$ . Define $\delta_{0}(\eta):=\eta^{3\vee q_{1}\vee q_{2}}$ . There exist $C^{\prime},c^{\prime}>0$ depending only on the constants in $c,C,r_{0}$ in the assumptions, but not on $\tau,\eta$ , such that for $N\geq C^{\prime}(\tau^{4}/\delta_{0}(\eta)^{2})n\log(N)$ , we have with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ ,

\displaystyle\sup_{1/2\leq\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq 2}{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}){\bm{K}}_{n}^{-1/2}\preceq-c^{\prime}\delta_{0}(\eta)\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\mathbf{I}}_{n}\,.

(41)

Using the above lemmas, we are now in position to prove our main result, Theorem 1.

Proof of Theorem 1.

Recall we define $q=q_{1}\vee q_{2}$ , $Q=Q_{1}\vee Q_{2}$ , $Q^{\prime}=Q_{1}\wedge Q_{2}$ . Assume $N\geq N_{1}\vee N_{2}$ as in the statement. By Lemmas 3, 4, 5 events ${\mathcal{E}}_{1}$ , ${\mathcal{E}}_{2}$ , ${\mathcal{E}}_{3}$ hold with probability

\displaystyle\mathbb{P}({\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}\cap{\mathcal{E}}_{2})\geq 1-C^{\prime}N^{-c^{\prime}n}\,,

with constants

	$\displaystyle\varepsilon_{1}=\varepsilon_{2}$	$\displaystyle=C^{\prime}\tau^{Q}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}\,,$
	$\displaystyle\beta$	$\displaystyle=c^{\prime}\eta^{3\vee q}\,.$

Further, by Lemma 2, the continuity property of Eq. (37) holds with

\displaystyle K=C^{\prime}(\tau^{Q}\vee\tau^{2}\eta^{(Q^{\prime}-2)})\,.

We can now apply Lemma 1. Since $\varepsilon_{1}=\varepsilon_{2}$ and, without loss of generality, $K/\beta\geq 1$ , we obtain that with, probability at least $1-C^{\prime}N^{-c^{\prime}n}$ ,

	$\displaystyle\\|\hat{f}_{N}(\,\cdot\,;\hat{\bm{\lambda}}_{N})-\hat{f}(\,\cdot\,;\hat{\bm{\lambda}})\\|_{L_{2}}\leq\Delta\cdot s(\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}})\,,$		(42)
	$\displaystyle\Delta=\frac{3K\varepsilon_{1}}{\beta}=C^{\prime}\tau^{Q+2}\frac{(\tau^{Q-2}\vee\eta^{Q^{\prime}-2})}{\eta^{3\vee q}}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}\,\,$		(43)

where we verify that $\varepsilon_{1}/\beta\leq 1/4$ by the assumption that $N\geq N_{1}\vee N_{2}$ .

Let us bound $s(\|{\hat{\bm{\lambda}}}\|_{2})$ . We have by the optimality condition ${\bm{K}}_{n}^{-1/2}{\bm{y}}=\mathbb{E}[{\bm{\psi}}_{n}({\bm{w}})s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{K}}_{n}^{1/2}{\hat{\bm{\lambda}}}\rangle)]$ . Therefore, denoting ${\bm{v}}={\bm{K}}_{n}^{1/2}{\hat{\bm{\lambda}}}/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}$ and using PEN,

	$\displaystyle\\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\\|_{2}=$	$\displaystyle\sup_{\\|{\bm{u}}\\|_{2}=1}\mathbb{E}[\langle{\bm{u}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{K}}_{n}^{1/2}{\hat{\bm{\lambda}}}\rangle)]$
	$\displaystyle\geq$	$\displaystyle~{}\mathbb{E}[\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}}\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{v}}\rangle)]$
	$\displaystyle\geq$	$\displaystyle~{}cs(\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}})\mathbb{E}[\|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\|^{q_{1}}\wedge\|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\|^{q_{2}}]\,.$

We next notice that, for any $\delta>0$ , and any $C\tau^{2}$ - sub-Gaussian random variable $X$ with $E[X^{2}]=1$ , $\mathbb{E}[|X|^{q_{1}}\vee|X|^{q_{2}}]\geq C(\delta)\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}$ . This basic fact is proved in Appendix F. We apply this inequality to $X=\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle$ to get

\displaystyle\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\geq

\displaystyle c(\delta)s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\,\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}\,.

Using this bound together with Eqs. (42), (43) yields the claim of the theorem. ∎

Acknowledgements

T.M. thanks Enric Boix-Adsera for helpful discussions about hardness results. This work was supported by NSF through award DMS-2031883 and from the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. We also acknowledge NSF grants CCF-2006489, IIS-1741162 and the ONR grant N00014-18-1-2729. M.C. was supported by the National Science Foundation Graduate Research Fellowship under grant DGE-1656518.

References

[Bac17] Francis Bach, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
[Bar98] Peter L Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE transactions on Information Theory 44 (1998), no. 2, 525–536.
[BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala, Kernels as features: On kernels, margins, and low-dimensional mappings, Machine Learning 65 (2006), no. 1, 79–94.
[BG99] Sergej G Bobkov and Friedrich Götze, Exponential integrability and transportation cost related to logarithmic sobolev inequalities, Journal of Functional Analysis 163 (1999), no. 1, 1–28.
[Bis95] Chris M Bishop, Training with noise is equivalent to Tikhonov regularization, Neural computation 7 (1995), no. 1, 108–116.
[BLM13] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence, OUP Oxford, 2013.
[BRV⁺06] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte, Convex neural networks, Advances in neural information processing systems, 2006, pp. 123–130.
[CLvdG20] Geoffrey Chinot, Matthias Löffler, and Sara van de Geer, Minimum $\ell_{1}$ norm interpolation via basis pursuit is robust to errors, arXiv:2012.00807 (2020).
[COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach, On lazy training in differentiable programming, Advances in Neural Information Processing Systems, 2019, pp. 2937–2947.
[CW01] Anthony Carbery and James Wright, Distributional and $l^{q}$ norm inequalities for polynomials over convex bodies in $\mathbbm{R}^{n}$ , Mathematical research letters 8 (2001), no. 3, 233–248.
[DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018).
[FGKP06] Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami, New results for learning noisy parities and halfspaces, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), IEEE, 2006, pp. 563–574.
[GLK⁺20] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová, Generalisation error in learning with random features and the hidden manifold model, arXiv:2002.09339 (2020).
[GLS12] Martin Grötschel, László Lovász, and Alexander Schrijver, Geometric algorithms and combinatorial optimization, vol. 2, Springer Science & Business Media, 2012.
[GLSS18] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro, Characterizing implicit bias in terms of optimization geometry, Proceedings of Machine Learning Research, vol. 80, PMLR, 2018, pp. 1832–1841.
[GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, Annals of Statistics (2019), arXiv:1904.12191.
[GMMM20] , When do neural networks outperform kernel methods?, Advances in Neural Information Processing Systems 33 (2020).
[GR09] Venkatesan Guruswami and Prasad Raghavendra, Hardness of learning halfspaces with noise, SIAM Journal on Computing 39 (2009), no. 2, 742–765.
[HKZ12] Daniel Hsu, Sham Kakade, and Tong Zhang, A tail inequality for quadratic forms of subgaussian random vectors, Electron. Commun. Probab. 17 (2012), 6 pp.
[HL20] Hong Hu and Yue M Lu, Universality laws for high-dimensional learning with random features, arXiv:2009.07669 (2020).
[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571–8580.
[LL18] Yuanzhi Li and Yingyu Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, Advances in Neural Information Processing Systems, 2018, pp. 8157–8166.
[LR20] Tengyuan Liang and Alexander Rakhlin, Just interpolate: Kernel “ridgeless” regression can generalize, Annals of Statistics 48 (2020), no. 3, 1329–1347.
[LRZ19] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, On the risk of minimum-norm interpolants and restricted lower isometry of kernels, arXiv:1908.10292 (2019).
[LS20] Tengyuan Liang and Pragya Sur, A precise high-dimensional asymptotic theory for boosting and min-l1-norm interpolated classifiers, arXiv:2002.01586 (2020).
[LSV18] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala, Efficient convex optimization with membership oracles, Conference On Learning Theory, PMLR, 2018, pp. 1292–1294.
[MKL⁺20] Francesca Mignacco, Florent Krzakala, Yue Lu, Pierfrancesco Urbani, and Lenka Zdeborova, The role of regularization in classification of high-dimensional noisy gaussian mixture, International Conference on Machine Learning, PMLR, 2020, pp. 6874–6883.
[MM19] Song Mei and Andrea Montanari, The generalization error of random features regression: Precise asymptotics and double descent curve, arXiv:1908.05355 (2019).
[MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration, arXiv:2101.10588 (2021).
[MWW20] Chao Ma, Stephan Wojtowytsch, and Lei Wu, Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t, arXiv:2009.10713 (2020).
[NTSS17] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro, Geometry of optimization and implicit regularization in deep learning, arXiv:1705.03071 (2017).
[OS19] Samet Oymak and Mahdi Soltanolkotabi, Towards moderate overparameterization: global convergence guarantees for training shallow neural networks, arXiv:1902.04674 (2019).
[RR08] Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177–1184.
[RR09] , Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, Advances in neural information processing systems, 2009, pp. 1313–1320.
[RR17] Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215–3225.
[SSBD14] Shai Shalev-Shwartz and Shai Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.
[Ver10] Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010).
[Ver18] , High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018.
[vH14] Ramon van Handel, Probability in high dimension, Tech. report, Princetoon University, 2014.
[YS19] Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for understanding neural networks, arXiv:1904.00687 (2019).

Appendix A Verifying the continuity and concentration conditions

A.1 Continuity property of the infinite width problem: Proof of Lemma 2

By the fundamental theorem of calculus,

\displaystyle\hat{f}({\bm{x}};{\bm{\lambda}})-\hat{f}({\bm{x}};\hat{\bm{\lambda}})

\displaystyle=\int_{0}^{1}\nabla_{{\bm{\lambda}}}\hat{f}({\bm{x}};{\bm{\lambda}}_{t})^{\top}({\bm{\lambda}}-\hat{\bm{\lambda}}){\rm d}t,

(44)

where ${\bm{\lambda}}_{t}:=t{\bm{\lambda}}+(1-t)\hat{\bm{\lambda}}$ . We will show that

\nabla_{{\bm{\lambda}}}\hat{f}({\bm{x}};{\bm{\lambda}}_{t})^{\top}({\bm{\lambda}}-\hat{\bm{\lambda}})=\mathbb{E}_{{\bm{w}},\phi}\Big{[}\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}_{t}\rangle)\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}-{\bm{\lambda}}\rangle\Big{]}\,,

(45)

(i.e., we may exchange integration and differentiation), and we will bound the right-hand side. In what follows, we will use the shorthand $\bm{\phi}_{n}=\bm{\phi}_{n}({\bm{w}})=(\phi({\bm{x}}_{1};{\bm{w}}),\dots,\phi({\bm{x}}_{n};{\bm{w}}))^{{\mathsf{T}}}$ for the evaluation of the featurization map at the $n$ datapoints.

For $e\geq 1$ and $r_{1},r_{2},r_{3}>1$ such that $\frac{1}{r_{1}}+\frac{1}{r_{2}}+\frac{1}{r_{3}}=1$ , we have by Hölder’s inequality,

		$\displaystyle\mathbb{E}_{{\bm{w}},\phi}\big{[}\big{\|}\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\big{\|}^{e}\big{]}$		(46)
		$\displaystyle\qquad\qquad\leq\mathbb{E}[\|\overline{\phi}({\bm{x}};{\bm{w}})\|^{r_{1}e}]^{1/r_{1}}\,\mathbb{E}[\|s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\|^{r_{2}e}]^{1/r_{2}}\,\mathbb{E}[\|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\|^{r_{3}e}]^{1/r_{3}}\,.$		(46)

Denote $Q^{\prime}=Q_{1}\wedge Q_{2}$ . If $Q^{\prime}<2$ , we set $r_{1}=r_{3}=\frac{2(3-Q^{\prime})}{Q^{\prime}-1}$ , $r_{2}=\frac{(3-Q^{\prime})}{2(2-Q^{\prime})}$ , and $u=\frac{5-Q^{\prime}}{2(3-Q^{\prime})}$ . In this case, one can check that $u>1$ and $r_{2}(Q^{\prime}-2)u=-\frac{5-Q^{\prime}}{4}>-1$ . Otherwise, we set $r_{1}=r_{2}=r_{3}=3$ and $u=2$ . By FEAT1,

	$\displaystyle\mathbb{E}[\|\overline{\phi}({\bm{x}};{\bm{w}})\|^{r_{1}u}]^{1/r_{1}}$	$\displaystyle\leq C\tau^{u}\,,$
	$\displaystyle\mathbb{E}[\|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\|^{r_{3}u}]^{1/r_{3}}$	$\displaystyle\leq C\tau^{u}\\|{\bm{\lambda}}-\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}^{u}\,,$

where $C$ depends only on $Q^{\prime}$ . Furthermore, by either (i) PEN and FEAT1 in the case $Q^{\prime}\geq 2$ or (ii) PEN, FEAT 1, and FEAT3’ in the case $Q^{\prime}<2$ , we have

	$\displaystyle\mathbb{E}[\|s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\|^{r_{2}u}]^{1/r_{2}}$	$\displaystyle\leq C\frac{s(\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}})^{u}}{\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}^{u}}\mathbb{E}\Big{[}\big{\|}\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle/\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}\big{\|}^{r_{2}u(Q_{1}-2)}\vee\big{\|}\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle/\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}\big{\|}^{r_{2}u(Q_{2}-2)}\Big{]}^{1/r_{2}}$
		$\displaystyle\leq C[\tau^{u(Q-2)}\vee\eta^{u(Q^{\prime}-2)}]\frac{s(\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}})^{u}}{\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}^{u}}\,,$

where we denoted $Q=Q_{1}\vee Q_{2}$ , the constant changes in the last line, and we used that $\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}/2\leq\|{\bm{\lambda}}_{t}\|_{{\bm{K}}_{n}}\leq 2\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}$ and that either (i) FEAT1 and $r_{2}u(Q^{\prime}-2)\geq 0$ in the case that $Q^{\prime}\geq 2$ or (ii) FEAT1, FEAT3’, and $r_{2}u(Q^{\prime}-2)>-1$ in the case that $Q^{\prime}<2$ . We see that the expectation $\mathbb{E}_{{\bm{w}},\phi}\big{[}\big{|}\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\big{|}^{u}\big{]}$ is bounded for some $u>1$ and all $t\in[0,1]$ , whence $\overline{\phi}({\bm{x}};{\bm{w}})s^{\prime}(\langle\bm{\phi}_{n}({\bm{w}}),{\bm{\lambda}}_{t}\rangle)\bm{\phi}_{n}({\bm{w}})$ is uniformly integrable. We conclude that we may exchange differentiation and expectation, justifying Eq. (45).

We may now replace $u$ by 1 in Eq. (46) to get

	$\displaystyle\|\nabla_{{\bm{\lambda}}}\hat{f}({\bm{x}};{\bm{\lambda}}_{t})^{\top}({\bm{\lambda}}-\hat{\bm{\lambda}})\|$	$\displaystyle\leq\mathbb{E}[\|\overline{\phi}({\bm{x}};{\bm{w}})\|^{r_{1}}]^{1/r_{1}}\,\mathbb{E}[\|s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\|^{r_{2}}]^{1/r_{2}}\,\mathbb{E}[\|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\|^{r_{3}}]^{1/r_{3}}$		(47)
		$\displaystyle\leq C(\tau^{Q}\vee\tau^{2}\eta^{Q^{\prime}-2})\frac{s(\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}})}{\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}}\\|{\bm{\lambda}}-\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}\,,$		(47)

which is independent of ${\bm{x}}$ and $t$ . Combining this bound with Eq. (44), we conclude that

\mathbb{E}_{{\bm{x}}}[(\hat{f}({\bm{x}};{\bm{\lambda}})-\hat{f}({\bm{x}};\hat{\bm{\lambda}}))^{2}]\leq C(\tau^{2Q}\vee\tau^{4}\eta^{2(Q^{\prime}-2)})\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})^{2}}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{2}}\|{\bm{\lambda}}-\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}^{2}\,.

(48)

Thus, we have established Eq. (38).

A.2 Uniform concentration of the predictor: Proof of Lemma 3

Throughout the proof, we will denote by $c,c^{\prime},C,C^{\prime}$ constants that depend on the constants in assumptions FEAT1, FEAT2 and PEN, but not on $d,n,N$ and $\tau$ . The value of these constants is allowed to change from line to line.

Step 1. Decoupling.

Let $G:{\mathbb{R}}\rightarrow{\mathbb{R}}_{\geq 0}$ be convex with $G(x)=0$ for $x\leq 0$ and satisfy for some $\alpha_{1},\alpha_{2}>0$ ,

G(x+y)\leq\alpha_{1}(G(\alpha_{2}x)+G(\alpha_{2}y)).

(49)

Below denote by $\mathbb{E}$ the expectation with respect to the ${\bm{w}}_{j}$ ’s and the randomization in $\bm{\phi}_{n}({\bm{w}})$ (for notational simplicity, we will omit below the subscripts from the expectations). Further, we will denote $\bm{\phi}_{n,j}=\bm{\phi}_{n}({\bm{w}}_{j})=(\phi({\bm{x}}_{1};{\bm{w}}_{j}),\dots,\phi({\bm{x}}_{n};{\bm{w}}_{j}))^{{\mathsf{T}}}$ for the evaluation of the feature map at the $n$ data points, and ${\bm{\psi}}_{n,j}={\bm{\psi}}_{n}({\bm{w}}_{j})={\bm{K}}^{-1/2}_{n}\bm{\phi}_{n}({\bm{w}}_{j})$ for the corresponding isotropic random vector. Then

	$\displaystyle\mathbb{E}G(\bar{S}_{1})$	$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\mathbb{E}\sup_{\begin{subarray}{c}\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}\\ \\|{\bm{v}}\\|_{2}\leq 2\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}}\end{subarray}}G\Big{(}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\mathbb{E}[\bar{\phi}({\bm{x}};{\bm{w}})s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{v}}\rangle)]\Big{\|}\Big{)}$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{\leq}}\mathbb{E}\sup_{\begin{subarray}{c}\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}\\ \\|{\bm{v}}\\|_{2}\leq 2\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}}\end{subarray}}G\Big{(}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\Big{[}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\bar{\phi}({\bm{x}};{\bm{w}}_{j}^{\prime})s(\langle{\bm{\psi}}_{n,j}^{\prime},{\bm{v}}\rangle)]\Big{]}\Big{\|}\Big{)}$
		$\displaystyle\stackrel{{\scriptstyle(3)}}{{=}}2\mathbb{E}\sup_{\begin{subarray}{c}\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}\\ \\|{\bm{v}}\\|_{2}\leq 2\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}}\end{subarray}}G\Big{(}\frac{1}{N}\sum_{j=1}^{N}\sigma_{j}\Big{[}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)-\bar{\phi}({\bm{x}};{\bm{w}}_{j}^{\prime})s(\langle{\bm{\psi}}_{n,j}^{\prime},{\bm{v}}\rangle)]\Big{]}\Big{)}$
		$\displaystyle\stackrel{{\scriptstyle(4)}}{{\leq}}4\alpha_{1}\mathbb{E}G\Big{(}\alpha_{2}\sup_{\begin{subarray}{c}\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}\\ \\|{\bm{v}}\\|_{2}\leq 2\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}}\end{subarray}}\underbrace{\frac{1}{N}\sum_{j=1}^{N}\sigma_{j}\bar{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)}_{=:T({\bm{x}},{\bm{v}})}\Big{)},$

where (1) uses that $G$ is nondecreasing; (2) uses that $G(|\,\cdot\,|)$ is convex and Jensen’s inequality; in (3), we denoted $\sigma_{j},j=1,\ldots,N$ independent Rademacher random variables and used that the distribution is symmetric and $G(x)=0$ for $x\leq 0$ ; (4) uses Eq. (49) and that $T({\bm{x}},{\bm{v}})$ is a symmetric random variable and $G$ is a nondecreasing function.

Step 2. Concentration of $T({\bm{x}},{\bm{v}})$ .

We bound the tail of $T$ for fixed ${\bm{x}},{\bm{v}}$ with $\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}$ , $\|{\bm{v}}\|_{2}\leq R$ . Denote by $X_{j}$ the terms in the sum in $T({\bm{x}},{\bm{v}})$ , i.e., $X_{j}=\sigma_{j}\overline{\phi}({\bm{x}};{\bm{w}}_{j})s(\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle)$ . By Eq. (19), there exists $C$ such that $|s(x_{1})|\leq C|s(x_{2})|(1+|x_{1}/x_{2}|)^{Q-1}$ , where $Q=Q_{1}\vee Q_{2}$ (for simplicity we have loosened the bound for when $|x_{1}/x_{2}|<1$ ). By Hölder inequality, we have for any $k\geq 1$ (potentially non-integer)

$\displaystyle\mathbb{E}[\|X_{i}\|^{k}]$	$\displaystyle\leq C\mathbb{E}[\|\bar{\phi}({\bm{x}};{\bm{w}}_{j})\|^{k}\cdot\|s(\tau R)\|^{k}\cdot(1+\|\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle\|^{{Q}-1}/(\tau R)^{{Q}-1})^{k}]$	(50)
	$\displaystyle\leq C\tau^{k}\|s(\tau R)\|^{k}\mathbb{E}\Big{[}\|\bar{\phi}({\bm{x}};{\bm{w}}_{j})/\tau\|^{Qk}\Big{]}^{1/Q}\mathbb{E}\Big{[}(1+\|\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle\|^{{Q}-1}/(\tau R)^{{Q}-1})^{Qk/(Q-1)}\Big{]}^{(Q-1)/Q}$
	$\displaystyle\leq\tau^{k}s(\tau R)^{k}(C^{\prime}k)^{{Q}k/2},$

where $C^{\prime}$ depends only on ${Q}$ , and we used FEAT1, i.e., $\bar{\phi}({\bm{x}};{\bm{w}}_{j})$ and ${\bm{\psi}}_{n,j}$ are $\tau^{2}$ -sub-Gaussian.

For $Q\leq 2$ , we have directly by Eq. (50) and [BLM13, Theorem 2.10],

\mathbb{P}\Big{(}T({\bm{x}},{\bm{v}})\geq t\Big{)}\leq\exp\Big{\{}-c^{\prime}N\Big{(}\frac{t^{2}}{\tau^{2}s(\tau R)^{2}}\wedge\frac{t}{\tau s(\tau R)}\Big{)}\Big{\}}\,.

(51)

For $Q>2$ , define for $M>0$ the truncated random variable $X_{i}^{M}=\text{sign}(X_{i})(|X_{i}|\wedge\tau s(\tau R)M^{Q})$ . Then setting $\ell=2k+2{Q}-4$ , we have

	$\displaystyle\mathbb{E}[\|X_{i}^{M}/N\|^{k}]$	$\displaystyle\leq N^{-k}\mathbb{E}[\|X_{i}\|^{\ell/{Q}}(\tau s(\tau R)M^{Q})^{k-\ell/{Q}}]\leq N^{-k}\tau^{k}s(\tau R)^{k}(C^{\prime}\ell/{Q})^{\ell/2}M^{({Q}-2)(k-2)}$		(52)
		$\displaystyle\leq(C^{\prime}k)^{k}\frac{\tau^{2}s(\tau R)^{2}}{N^{2}}\Big{(}\frac{\tau s(\tau R)M^{{Q}-2}}{N}\Big{)}^{k-2}\,.$		(52)

We deduce again by [BLM13, Theorem 2.10] with Eq. (52), that

\mathbb{P}\Big{(}\frac{1}{N}\sum_{i=1}^{N}X_{i}^{M}\geq t\Big{)}\leq\exp\Big{\{}-c^{\prime}N\Big{(}\frac{t^{2}}{\tau^{2}s(\tau R)^{2}}\wedge\frac{t}{\tau s(\tau R)M^{{Q}-2}}\Big{)}\Big{\}}\,.

Furthermore, using the bound (50), we have

\mathbb{P}\Big{(}|X_{1}|\geq\tau s(\tau R)M^{Q}\Big{)}\leq\inf_{k\geq 1}M^{-kQ}\mathbb{E}\Big{[}(|X_{1}|/\tau s(\tau R))^{k}\Big{]}=\exp\Big{\{}-c^{\prime}M^{2}\Big{\}}\,.

Combining the above two displays and taking $M=(Nt)^{1/{Q}}/(\tau s(\tau R))^{1/{Q}}$ , we conclude that for $Q>2$ and fixed $\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}$ , $\|{\bm{v}}\|_{2}\leq R$ , we have

	$\displaystyle\mathbb{P}\Big{(}T({\bm{x}},{\bm{v}})\geq t\Big{)}\leq$	$\displaystyle~{}\mathbb{P}\Big{(}\frac{1}{N}\sum_{i=1}^{N}X_{i}^{M}\geq t\Big{)}+N\mathbb{P}\Big{(}\|X_{1}\|\geq\tau s(\tau R)M^{Q}\Big{)}\,$		(53)
	$\displaystyle\leq$	$\displaystyle~{}N\exp\Big{\{}-c^{\prime}\Big{(}\frac{Nt^{2}}{\tau^{2}s(\tau R)^{2}}\wedge\frac{(Nt)^{2/{Q}}}{\tau^{2/{Q}}s(\tau R)^{2/{Q}}}\Big{)}\Big{\}}\,.$		(53)

Step 3. Uniform concentration of $T({\bm{x}},{\bm{v}})$ on $\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}$ , $\|{\bm{v}}\|_{2}\leq R$ .

We now evaluate the concentration of $T$ uniformly over $\|{\bm{x}}\|_{2}\leq r_{0}\sqrt{d}$ , $\|{\bm{v}}\|_{2}\leq R$ . By the tail bound on $L({\bm{w}})$ in FEAT2 and that ${\bm{\psi}}_{n,j}$ is a $\tau^{2}$ -sub-Gaussian random vector by FEAT1, there exists constants $C,c>0$ independent of $\tau$ , such that for all $\Delta\geq 1$ [HKZ12],

\mathbb{P}\Big{(}\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\vee L({\bm{w}}_{j})\leq C\tau\sqrt{n}\Delta\Big{)}\geq 1-CN\exp(-cn\Delta^{2}).

(54)

Let $0<\zeta_{2}\leq 1\leq\zeta_{1}$ and $q^{\prime}=q_{1}\wedge q_{2}$ . Then, by assumption PEN, for any $|x_{1}|,|x_{2}|\leq\tau R\Delta\zeta_{1}$ , we have

	$\displaystyle\|s(x_{1})-s(x_{2})\|$	$\displaystyle\leq(\|s(\zeta_{2}\tau R\Delta)\|+\|s(-\zeta_{2}\tau R\Delta)\|)+\Big{(}\sup_{\|x\|\in[\zeta_{2}\tau R\Delta,\zeta_{1}\tau R\Delta]}s^{\prime}(x)\Big{)}\|x_{1}-x_{2}\|$
		$\displaystyle\leq Cs(\tau R\Delta)\zeta_{2}^{q^{\prime}-1}+C\frac{s(\tau R\Delta)}{\tau R\Delta}\Big{(}\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0}\Big{)}\|x_{1}-x_{2}\|$
		$\displaystyle\leq Cs(\tau R\Delta)\Big{(}\zeta_{2}^{q^{\prime}-1}+(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})\|x_{1}-x_{2}\|/(\tau R\Delta)\Big{)}.$

Let $\zeta_{1}=C\sqrt{n}$ . On the event (54) and for $\|{\bm{v}}\|_{2}\leq R$ , we have $|\langle{\bm{\psi}}_{n,j},{\bm{v}}\rangle|\leq C\tau\sqrt{n}R\Delta=\tau R\Delta\zeta_{1}$ . Furthermore, by FEAT2, we have $|\overline{\phi}({\bm{x}};{\bm{w}})|\leq L({\bm{w}})(1+\|{\bm{x}}\|_{2})$ . For convenience, let us denote $\tilde{\bm{x}}={\bm{x}}/(r_{0}\sqrt{d})$ so that $\|\tilde{\bm{x}}\|_{2}\leq 1$ . On the event (54), whenever $\|\tilde{\bm{x}}_{1}\|_{2},\|\tilde{\bm{x}}_{2}\|_{2},\|\tilde{\bm{v}}_{1}\|_{2},\|\tilde{\bm{v}}_{2}\|_{2}\leq 1$ , we have (the constants $C$ below may depend on $r_{0}$ but not on $\tau$ , $\eta$ )

	$\displaystyle\,\,\,\,\,\,\,\,\|T({\bm{x}}_{1},R\tilde{\bm{v}}_{1})-T({\bm{x}}_{2},R\tilde{\bm{v}}_{2})\|$
	$\displaystyle\leq\frac{1}{N}\sum_{j=1}^{N}L({\bm{w}}_{j})\|s(\langle{\bm{\psi}}_{n,j},R\tilde{\bm{v}}_{1}\rangle)\|\\|{\bm{x}}_{1}-{\bm{x}}_{2}\\|_{2}+\frac{C}{N}\sum_{j=1}^{N}L({\bm{w}}_{j})(1+\\|{\bm{x}}_{2}\\|_{2})\|s(\langle{\bm{\psi}}_{n,j},R\tilde{\bm{v}}_{1}\rangle)-s(\langle{\bm{\psi}}_{n,j},R\tilde{\bm{v}}_{2}\rangle)\|$
	$\displaystyle\leq\frac{C}{N}\sum_{j=1}^{N}\tau\sqrt{nd}\Delta s(\tau R\Delta)(1+(C\sqrt{n})^{Q-1})\\|\tilde{\bm{x}}_{1}-\tilde{\bm{x}}_{2}\\|_{2}$
	$\displaystyle\qquad\qquad+\frac{C}{N}\sum_{j=1}^{N}\tau\sqrt{nd}\Delta s(\tau R\Delta)\big{(}\zeta_{2}^{q^{\prime}-1}+(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})\sqrt{n}\\|\tilde{\bm{v}}_{1}-\tilde{\bm{v}}_{2}\\|_{2}\big{)}$
	$\displaystyle\leq C\tau\Delta s(\tau R\Delta)\Big{(}n^{Q/2}\sqrt{d}\\|\tilde{\bm{x}}_{1}-\tilde{\bm{x}}_{2}\\|_{2}+n\sqrt{d}(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})\\|\tilde{\bm{v}}_{1}-\tilde{\bm{v}}_{2}\\|_{2}+\sqrt{nd}\zeta_{2}^{q^{\prime}-1}\Big{)}.$

Fix $\varepsilon>0$ . Let $\zeta_{2}=(\varepsilon/(3C\sqrt{nd}))^{1/(q^{\prime}-1)}$ , and define

\varepsilon^{\prime}=\frac{\varepsilon}{3Cn^{Q/2}\sqrt{d}+3Cn\sqrt{d}(\zeta_{1}^{(Q-2)\vee 0}\vee\zeta_{2}^{(q^{\prime}-2)\wedge 0})}\,,

so that $|T({\bm{x}}_{1},R\tilde{\bm{v}}_{1})-T({\bm{x}}_{2},R\tilde{\bm{v}}_{2})|\leq\tau\Delta s(\tau R\Delta)\varepsilon$ as soon as $\|\tilde{\bm{x}}_{1}-\tilde{\bm{x}}_{2}\|_{2}\leq\varepsilon^{\prime}$ and $\|\tilde{\bm{v}}_{1}-\tilde{\bm{v}}_{2}\|_{2}\leq\varepsilon^{\prime}$ . Let $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ be $\varepsilon^{\prime}$ -net of ${\sf B}_{2}^{d}(1)$ and ${\sf B}_{2}^{n}(1)$ respectively, where we define ${\sf B}_{2}^{k}(r):=\{{\bm{u}}\in\mathbb{R}^{k}:\|{\bm{u}}\|_{2}\leq r\}$ . Note that $\log(|\mathcal{N}_{1}||\mathcal{N}_{2}|)\leq(d+n)\log\Big{(}\frac{3}{\varepsilon^{\prime}}\Big{)}$ . Then taking $\varepsilon=ts(\tau R)/(2\Delta s(\tau R\Delta))\geq C^{\prime}t\Delta^{-Q}$ and recalling Eqs. (51) and (53), we have

	$\displaystyle\mathbb{P}\Big{(}\sup_{\begin{subarray}{c}\\|\tilde{\bm{x}}\\|_{2}\leq 1\\ \\|{\bm{v}}\\|_{2}\leq R\end{subarray}}T(\tilde{\bm{x}},{\bm{v}})\geq t\tau s(\tau R)\Big{)}$
	$\displaystyle\qquad\qquad\leq\mathbb{P}\Big{(}\sup_{\tilde{\bm{x}}\in\mathcal{N}_{1},\tilde{\bm{v}}\in\mathcal{N}_{2}}T(\tilde{\bm{x}},R\tilde{\bm{v}})\geq t\tau s(\tau R)/2\Big{)}+\mathbb{P}\Big{(}\max_{j\in[N]}\\|{\bm{\psi}}_{n,j}\\|_{2}\vee L({\bm{w}}_{j})\geq C\tau\sqrt{n}\Delta\Big{)}$
	$\displaystyle\qquad\qquad\leq\exp\Big{\{}\log N+2n\log\Big{(}\frac{3}{\varepsilon^{\prime}}\Big{)}-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}+CNe^{-cn\Delta^{2}}$
	$\displaystyle\qquad\qquad\leq\exp\Big{\{}\log N+Cn\log(n)+Cn\log(1/\varepsilon)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}+Ne^{-cn\Delta^{2}}$
	$\displaystyle\qquad\qquad\leq\exp\Big{\{}\log N+Cn\log(n)+Cn\log(\Delta^{Q}/t)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}+Ne^{-cn\Delta^{2}},$

where in the second to last inequality we have used the definition of $\varepsilon^{\prime}$ and adjusted constants appropriately, and in the last inequality we have used the definition of $\varepsilon$ . Consider $t\geq 1/N$ and set $\Delta=(tN)^{1/Q}\geq 1$ . Then

		$\displaystyle~{}\mathbb{P}\Big{(}\sup_{\begin{subarray}{c}\\|\tilde{\bm{x}}\\|_{2}\leq 1\\ \\|{\bm{v}}\\|_{2}\leq R\end{subarray}}T(\tilde{\bm{x}},{\bm{v}})\geq t\tau s(\tau R)\Big{)}$		(55)
	$\displaystyle\leq$	$\displaystyle~{}\exp\Big{\{}Cn\log(N)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/{Q})\wedge 1}\Big{)}\Big{\}}+N\exp\Big{\{}-cnN^{2/Q}t^{2/Q}\Big{\}}.$		(55)

Step 4. Concluding the proof.

For a constant $\tilde{C}>0$ that will be set sufficiently large, define

\displaystyle\varepsilon_{1}=\tilde{C}\tau s(\tau R)\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{({Q}/2)\vee 1}}{N}\Big{)}\,.

Consider $G(x)=(x-\varepsilon_{1}/2)_{+}$ which is convex and verifies Eq. (49) with $\alpha_{1}=1$ and $\alpha_{2}=2$ . Then we have

	$\displaystyle\mathbb{P}(\bar{S}_{1}\geq\varepsilon_{1})\leq\frac{2}{\varepsilon_{1}}\mathbb{E}[G(\bar{S}_{1})]\leq$	$\displaystyle~{}\frac{8}{\varepsilon_{1}}\mathbb{E}G\Big{(}2\sup_{\begin{subarray}{c}\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}\\ \\|{\bm{v}}\\|_{2}\leq R\end{subarray}}T({\bm{x}},{\bm{v}})\Big{)}$
	$\displaystyle\leq$	$\displaystyle~{}\frac{16\tau s(\tau R)}{\varepsilon_{1}}\int_{0}^{\infty}\mathbb{P}\Big{(}\sup_{\begin{subarray}{c}\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}\\ \\|{\bm{v}}\\|_{2}\leq R\end{subarray}}T({\bm{x}},{\bm{v}})\geq t\tau s(\tau R)+\varepsilon_{1}/4\Big{)}{\rm d}t\,.$

Using Eq. (55) and the inequalities $N(\varepsilon_{1}/\tau s(\tau R))^{2}\geq\tilde{C}^{2}n\log N$ and $(N\varepsilon_{1}/\tau s(\tau R))^{(2/Q)\wedge 1}\geq\tilde{C}^{(2/Q)\wedge 1}n\log N$ , we get

\mathbb{P}(\bar{S}_{1}\geq\varepsilon_{1})\leq C\exp\Big{\{}Cn\log(N)-c\tilde{C}^{(2/Q)\wedge 1}n\log(N)\Big{\}}+CN\exp\Big{\{}-c\tilde{C}^{2/Q}n(n\log(N))^{1\vee(2/Q)}\Big{\}}.

Taking $\tilde{C}$ sufficiently large, $R=2\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}$ and using that $s(2\tau\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\leq Cs(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\tau^{Q-1}$ (where we recall that we assumed $\tau\geq 1)$ , we deduce that there exists constants $c^{\prime},C^{\prime}>0$ that depend only on the constants of FEAT1, FEAT2 and PEN (except $\tau$ ) such that

\bar{S}_{1}\leq C^{\prime}\tau^{Q}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{({Q}/2)\vee 1}}{N}\Big{)}\,,

with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ . The proof is complete.

A.3 Point-wise concentration of dual gradient: Proof of Lemma 4

The proof follows from the same argument as in the proof of Lemma 3 and we will only highlight the differences.

Step 1. Decoupling.

First notice that we can rewrite

\bar{S}_{2}=\sup_{\|{\bm{b}}\|_{2}\leq 1}\Big{\{}\frac{1}{N}\sum_{j=1}^{N}\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)-\mathbb{E}[\langle{\bm{b}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\langle\bm{\phi}_{n},{\hat{\bm{\lambda}}}\rangle)]\Big{\}}\,.

Denote

T({\bm{b}})=\frac{1}{N}\sum_{j=1}^{N}\sigma_{j}\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)\,.

For $G$ with the properties listed in step $1$ in the proof of Lemma 3, we have

\mathbb{E}G(\bar{S}_{2})\leq 4\alpha_{1}\mathbb{E}G\Big{(}\alpha_{2}\sup_{\|{\bm{b}}\|_{2}\leq 1}T({\bm{b}})\Big{)}.

Step 2. Concentration of $T({\bm{b}})$ .

Denote $X_{j}$ the terms in the sum in $T({\bm{b}})$ , i.e., $X_{j}=\sigma_{j}\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle)$ . By FEAT1 and PEN, and denoting $Q=Q_{1}\vee Q_{2}$ , $R=\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}$ , we have by Hölder inequality, for any $k\geq 1$ and $\|{\bm{b}}\|_{2}\leq 1$ ,

	$\displaystyle\mathbb{E}[\|X_{j}\|^{k}]$	$\displaystyle\leq C\tau^{k}s(\tau R)^{k}\mathbb{E}\Big{[}\|\langle{\bm{b}},{\bm{\psi}}_{n,j}\rangle/\tau\|^{Qk}\Big{]}^{1/Q}\mathbb{E}\Big{[}(1+\|\langle{\bm{\lambda}},\bm{\phi}_{n,j}\rangle\|^{Q-1}/(\tau R)^{{Q}-1})^{Qk/(Q-1)}\Big{]}^{(Q-1)/Q}$
		$\displaystyle\leq\tau^{k}s(\tau R)^{k}(C^{\prime}k)^{Qk/2}\,.$

For $Q>2$ and $M>0$ , define $X_{j}^{M}=\text{sign}(X_{j})(|X_{j}|\wedge\tau s(\tau R)M^{Q})$ . Then setting $\ell=2k+2{Q}-4$ , we have

\displaystyle\mathbb{E}[|X_{j}^{M}/N|^{k}]

\displaystyle\leq N^{-k}\mathbb{E}[|X_{j}|^{\ell/{Q}}(\tau s(\tau R)M^{Q})^{k-\ell/{Q}}]\leq(C^{\prime}k)^{k}\frac{\tau^{2}s(\tau R)^{2}}{N^{2}}\Big{(}\frac{\tau s(\tau R)M^{{Q}-2}}{N}\Big{)}^{k-2}\,.

Following step 2 in the proof of Lemma 3, in particular taking $M=(Nt)^{1/Q}$ in the case of $Q>2$ , we get

\mathbb{P}\Big{(}T({\bm{b}})\geq t\tau s(\tau R)\Big{)}\leq N\exp\Big{\{}-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/Q)\wedge 1}\Big{)}\Big{\}}\,.

(56)

Step 3. Uniform concentration of $T({\bm{b}})$ on $\|{\bm{b}}\|_{2}\leq 1$ .

We consider the event for $\Delta\geq 1$ ,

\mathbb{P}\Big{(}\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\leq C\tau\sqrt{n}\Delta\Big{)}\geq 1-CN\exp(-cn\Delta^{2})\,,

(57)

where we used that the ${\bm{\psi}}_{n,j}$ are $\tau^{2}$ -sub-Gaussian by FEAT1. On this event, we have $|\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}\rangle|\leq\|{\bm{\psi}}_{n,j}\|_{2}\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\leq CR\tau\sqrt{n}\Delta$ and by PEN, we get

\displaystyle|T({\bm{b}}_{1})-T({\bm{b}}_{2})|\leq

\displaystyle~{}\frac{1}{N}\sum_{j=1}^{N}\|{\bm{\psi}}_{n,j}\|_{2}|s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)|\|{\bm{b}}_{1}-{\bm{b}}_{2}\|_{2}\leq C\tau\Delta s(\tau R\Delta)n^{Q/2}\|{\bm{b}}_{1}-{\bm{b}}_{2}\|_{2}\,.

Fix $\varepsilon>0$ and $\varepsilon^{\prime}=\varepsilon/(Cn^{Q/2})$ , so that $|T({\bm{b}}_{1})-T({\bm{b}}_{2})|\leq\tau\Delta s(\tau R\Delta)\varepsilon$ as soon as $\|{\bm{b}}_{1}-{\bm{b}}_{2}\|_{2}\leq\varepsilon^{\prime}$ . Taking $\varepsilon=ts(\tau R)/(2\Delta s(\tau R\Delta))\geq C^{\prime}t\Delta^{-Q}$ and $\Delta=(tN)^{1/Q}$ , we have for $t\geq 1/N$ ,

		$\displaystyle~{}\mathbb{P}\Big{(}\sup_{\\|{\bm{b}}\\|_{2}\leq 1}T({\bm{b}})\geq t\tau s(\tau R)\Big{)}$		(58)
	$\displaystyle\leq$	$\displaystyle~{}\exp\Big{\{}Cn\log(N)-c^{\prime}\Big{(}Nt^{2}\wedge(Nt)^{(2/{Q})\wedge 1}\Big{)}\Big{\}}+N\exp\Big{\{}-cnN^{2/Q}t^{2/Q}\Big{\}}.$		(58)

Step 4. Concluding the proof.

Following the same argument as in Lemma 3, there exists constants $c^{\prime},C^{\prime}>0$ that depend only on the constants of PEN such that

\bar{S}_{2}\leq C^{\prime}\tau^{Q}s(\|{\bm{\lambda}}\|_{{\bm{K}}_{n}})\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2\vee 1}}{N}\Big{)},

with probability at least $1-C^{\prime}e^{-c^{\prime}n}$ .

A.4 Uniform lower bound on the dual Hessian: Proof of Lemma 5

Define $q=3\vee q_{1}\vee q_{2}$ , where $q_{1}$ and $q_{2}$ are as in assumption PEN, and define $s_{0}$ as

s_{0}(t)=1\wedge t\wedge t^{q_{1}-2}\wedge t^{q_{2}-2}.

(59)

The function $s_{0}(t)$ is $(1\vee|q_{1}-2|\vee|q_{2}-2|)$ -Lipschitz. By assumption PEN, there exists $c>0$ depending only on the constants in that assumption such that for all $t$

s^{\prime}\big{(}\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}t\big{)}\geq c\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}s_{0}(t).

(60)

Thus, for ${\bm{\lambda}}\in\mathbb{R}^{n}$ such that $1/2\leq\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}\leq 2$ and denoting ${\bm{v}}={\bm{K}}_{n}^{1/2}{\bm{\lambda}}/\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}$ , we have

\displaystyle-{\bm{K}}_{n}^{-1/2}\nabla^{2}F_{N}({\bm{\lambda}}){\bm{K}}_{n}^{-1/2}=

\displaystyle~{}\frac{1}{N}\sum_{j=1}^{N}s^{\prime}(\langle{\bm{K}}_{n}^{1/2}{\bm{\lambda}},{\bm{\psi}}_{n,j}\rangle){\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\mathsf{T}}\succeq c\frac{s(\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}})}{\|\hat{\bm{\lambda}}\|_{{\bm{K}}_{n}}}{\bm{H}}_{N}({\bm{v}})\,,

where

{\bm{H}}_{N}({\bm{v}})=\frac{1}{N}\sum_{j=1}^{N}s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle){\bm{\psi}}_{n,j}{\bm{\psi}}_{n,j}^{\mathsf{T}}\,.

Define $\overline{\bm{H}}({\bm{v}}):=\mathbb{E}_{{\bm{w}},\phi}[s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n}\rangle){\bm{\psi}}_{n}{\bm{\psi}}_{n}^{\mathsf{T}}]$ . As already explained in Remark 2, the fact that ${\bm{\psi}}_{n}$ is isotropic, together with assumption FEAT3 imply that, for any two unit vectors ${\bm{u}},{\bm{u}}^{\prime}$ , we have $\mathbb{P}\big{(}|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\in[\eta,C];\;|\langle{\bm{u}}^{\prime},{\bm{\psi}}_{n}\rangle|\in[\eta,C]\big{)}\geq c/2$ . Using the fact that $\|{\bm{v}}\|_{2}\in[1/2,2]$ , we deduce that

\displaystyle\mathbb{P}\big{(}|\langle{\bm{u}},{\bm{\psi}}_{n}\rangle|\in[\eta,C];\;|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\in[\eta/2,2C]\big{)}\geq\frac{c}{2}\,.

Thus,

\displaystyle\langle{\bm{u}},\overline{\bm{H}}({\bm{v}}){\bm{u}}\rangle=\mathbb{E}_{{\bm{w}},\phi}[s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n}\rangle)\langle{\bm{u}},{\bm{\psi}}_{n}\rangle^{2}]\geq C\eta^{2}\min\{\eta,\eta^{q_{1}-2},\eta^{q_{2}-2}\}.

Thus, $\lambda_{\min}(\overline{\bm{H}}({\bm{v}}))\geq\delta_{0}(\eta)$ for $\delta_{0}(\eta):=C\eta^{3\vee q_{1}\vee q_{2}}$ . Furthermore, by FEAT1 and that $0\leq s_{0}(x)\leq 1$ , we have $s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle)^{1/2}{\bm{\psi}}_{n,j}$ are $C\tau^{2}$ -sub-Gaussian. Applying Lemma 9 to the vectors $s_{0}(\langle{\bm{v}},{\bm{\psi}}_{n,j}\rangle)^{1/2}{\bm{\psi}}_{n,j}/(C\tau)$ with $\delta=\delta_{0}(\eta)/(2C^{2}\tau^{2})$ and $t=\sqrt{N/n}\,\delta_{0}(\eta)/(2C^{\prime}C^{2}\tau^{2})-1$ , and noting that $\delta_{0}(\eta)/(2C^{2}\tau^{2})\leq 1$ , we conclude that there exists a constant $c>0$ such that for $N\geq 16C^{\prime 2}C^{4}\tau^{4}n/\delta_{0}(\eta)^{2}$

\displaystyle\mathbb{P}(\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}(\eta)/2)\leq\mathbb{P}(\|{\bm{H}}_{N}({\bm{v}})-\overline{\bm{H}}({\bm{v}})\|_{{\rm op}}\geq\delta_{0}(\eta)/2)\leq 2\exp\big{(}-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{)}.

Denote ${\bm{\Psi}}_{n}=[{\bm{\psi}}_{n,1},\ldots,{\bm{\psi}}_{n,N}]\in\mathbb{R}^{n\times N}$ . For constant $C>0$ sufficiently large, consider the event

{\mathcal{E}}_{0}=\Big{\{}\{\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\leq C\sqrt{n}\,\tau\}\cup\{\|{\bm{\Psi}}_{n}\|_{\rm op}^{2}\leq CN\tau^{2}\}\Big{\}}\,.

We have for $N\geq n$ ,

\displaystyle\mathbb{P}({\mathcal{E}}_{0}^{c})\leq

\displaystyle~{}\mathbb{P}\big{(}\max_{j\in[N]}\|{\bm{\psi}}_{n,j}\|_{2}\geq C\sqrt{n}\,\tau\big{)}+\mathbb{P}\big{(}\|{\bm{\Psi}}_{n}\|_{\rm op}^{2}\geq CN\tau^{2}\big{)}\leq\exp(-cN)\,,

where we used that ${\bm{\psi}}_{n,j}$ are $\tau^{2}$ -subgaussian by FEAT1 to bound the first term, and that $\mathbb{E}_{{\bm{w}},\phi}[{\bm{\psi}}_{n}{\bm{\psi}}_{n}^{\mathsf{T}}]={\mathbf{I}}_{n}$ and Lemma 9 to bound the second term.

On the event ${\mathcal{E}}_{0}$ , we have

	$\displaystyle\\|{\bm{H}}_{N}({\bm{v}}_{1})-{\bm{H}}_{N}({\bm{v}}_{2})\\|_{{\rm op}}\leq$	$\displaystyle~{}\frac{1}{N}\\|{\bm{\Psi}}_{n}\\|_{{\rm op}}^{2}\max_{j\in[N]}\|s_{0}(\langle{\bm{\psi}}_{n,j},{\bm{v}}_{1}\rangle)-s_{0}(\langle{\bm{\psi}}_{n,j},{\bm{v}}_{2}\rangle)\|$
	$\displaystyle\leq$	$\displaystyle~{}C\tau^{2}\max_{j\in[N]}\|\langle{\bm{\psi}}_{n,j},{\bm{v}}_{1}-{\bm{v}}_{2}\rangle\|\leq C\sqrt{n}\,\tau^{3}\\|{\bm{v}}_{1}-{\bm{v}}_{2}\\|_{2}\,,$

where we used that $s_{0}$ is Lipschitz. Let $\mathcal{N}_{n}$ be a minimal $\varepsilon$ -net of $\{{\bm{v}}\in\mathbb{R}^{n}:\|{\bm{v}}\|_{2}\in[1/2,2]\}$ , with $\varepsilon=\delta_{0}(\eta)/(4C\sqrt{n}\tau^{3})$ , so that $\|{\bm{H}}_{N}({\bm{v}}_{1})-{\bm{H}}_{N}({\bm{v}}_{2})\|_{{\rm op}}\leq\delta_{0}(\eta)/4$ as soon as $\|{\bm{v}}_{1}-{\bm{v}}_{2}\|_{2}\leq\varepsilon$ . We have

	$\displaystyle\mathbb{P}(\min_{\\|{\bm{v}}\\|_{2}\in[1/2,2]}\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}/4)\leq$	$\displaystyle~{}\|\mathcal{N}_{n}\|\max_{{\bm{v}}\in\mathcal{N}_{n}}\mathbb{P}(\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}/2)+\mathbb{P}({\mathcal{E}}_{0}^{c})$
		$\displaystyle\leq C\exp\big{\{}Cn\log(1/\varepsilon)-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{\}}$
		$\displaystyle\leq C\exp\big{\{}Cn\log(n)+Cn\log\tau+Cn\log(1/\eta)-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{\}}$
		$\displaystyle\leq C\exp\big{\{}Cn\log(n)-cN\delta_{0}(\eta)^{2}/\tau^{4}\big{\}}\,,$

where the last inequality follows since by assumption $\tau\leq n^{C}$ , $\eta\geq n^{-C}$ . By taking $N\geq C^{\prime}(\tau^{4}/\delta_{0}(\eta)^{2})n\log(N)$ for a sufficiently large constant $C^{\prime}$ we obtain $\min_{\|{\bm{v}}\|_{2}\in[1/2,2]}\lambda_{\min}({\bm{H}}_{N}({\bm{v}}))\leq\delta_{0}/4$ with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ as claimed.

Appendix B The latent linear model: Proof of Proposition 5

We consider the latent linear model presented in the main text (Section 5.3). We have $\bm{\phi}_{n,j}=\bm{\phi}_{n}({\bm{w}}_{j})={\bm{X}}{\bm{w}}_{j}+{\bm{z}}_{j}$ , where ${\bm{z}}_{j}=(z_{1j},\ldots,z_{nj})\in\mathbb{R}^{n}$ are the iid features noise. Denote ${\bm{W}}=[{\bm{w}}_{1},\ldots,{\bm{w}}_{N}]^{\mathsf{T}}\in\mathbb{R}^{N\times d}$ and ${\bm{Z}}=[{\bm{z}}_{1},\ldots,{\bm{z}}_{N}]\in\mathbb{R}^{n\times N}$ . The features matrix is given by

\bm{\Phi}_{n}=[\bm{\phi}_{n,1},\ldots,\bm{\phi}_{n,N}]={\bm{X}}{\bm{W}}^{\mathsf{T}}+{\bm{Z}}\in\mathbb{R}^{n\times N}\,.

(61)

The random features and infinite-width predictors are given by

\hat{f}({\bm{x}};{\hat{\bm{\lambda}}})=\mathbb{E}_{{\bm{w}},\phi}[\langle{\bm{x}},{\bm{w}}\rangle s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)],\qquad\hat{f}_{N}({\bm{x}};{\hat{\bm{\lambda}}}_{N})=\frac{1}{N}\langle{\bm{x}},{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})\rangle\,,

where $s$ is applied component-wise. The distance between $\hat{f}$ and $\hat{f}_{N}$ is therefore given explicitly by

	$\displaystyle\\|\hat{f}_{N}(\cdot;{\hat{\bm{\lambda}}}_{N})-\hat{f}(\cdot;{\hat{\bm{\lambda}}})\\|_{L^{2}}^{2}=$	$\displaystyle~{}\mathbb{E}_{{\bm{x}}}\Big{[}\Big{(}N^{-1}\langle{\bm{x}},{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[\langle{\bm{x}},{\bm{w}}\rangle s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{)}^{2}\Big{]}$		(62)
	$\displaystyle=$	$\displaystyle~{}\\|N^{-1}{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\\|_{\bm{\Sigma}_{x}}^{2}\,.$		(62)

The following lemma is the key result that allows us to improve on Theorem 1 and gain a factor $1/\sqrt{n}$ .

Lemma 6.

Assume $\lambda_{\max}(\bm{\Sigma}_{x})\leq\kappa^{2}$ and $\sigma_{\min}({\bm{X}})\geq c_{0}\sqrt{n}$ (the minimum singular value of ${\bm{X}}$ ). Then if $n\geq 2d/c_{0}^{2}$ , we have

\|\hat{f}_{N}(\cdot;{\hat{\bm{\lambda}}}_{N})-\hat{f}(\cdot;{\hat{\bm{\lambda}}})\|_{L^{2}}\leq\frac{2\kappa}{c_{0}\sqrt{n}}\|N^{-1}{\bm{Z}}s(\bm{\Phi}_{n}^{\top}\hat{\bm{\lambda}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}.

(63)

We remark that Lemma 6 is entirely deterministic. This may seem surprising because ${\bm{W}}^{\top}$ and ${\bm{Z}}$ are random. In fact, the proof of Lemma 6 relies on a deterministic argument which uses the fact that both the infinite-width and random features predictors interpolate the training data, i.e., for $i\leq n$ ,

\mathbb{E}_{{\bm{w}},\phi}[\phi({\bm{x}}_{i};{\bm{w}})s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}})]=\frac{1}{N}\sum_{j=1}^{N}\phi({\bm{x}}_{i};{\bm{w}}_{j})s(\langle\bm{\phi}_{n,j},{\hat{\bm{\lambda}}}_{N}\rangle)=y_{i}\,,

or equivalently,

\mathbb{E}_{{\bm{w}},{\bm{z}}}[({\bm{X}}{\bm{w}}+{\bm{z}})s(\langle\bm{\phi}_{n}({\bm{w}}),\hat{\bm{\lambda}}\rangle)]=\frac{1}{N}({\bm{X}}{\bm{W}}^{\top}+{\bm{Z}})s(\bm{\Phi}_{N}^{\top}\hat{\bm{\lambda}}_{N})={\bm{y}}

(64)

Note that here we have used the form of the infinite-width primal problem (see Section E.1).

Proof of Lemma 6.

By Eq. (62) and the bound $\lambda_{\max}(\bm{\Sigma}_{x})\leq\kappa^{2}$ , we have

\|\hat{f}_{N}(\cdot;{\hat{\bm{\lambda}}}_{N})-\hat{f}(\cdot;{\hat{\bm{\lambda}}})\|_{L^{2}}\leq\kappa\Big{\|}N^{-1}{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\|}_{2}\,.

Denote ${\bm{M}}={\mathbf{I}}_{n}+\frac{1}{d}{\bm{X}}{\bm{X}}^{\top}$ . We have

{\bm{W}}^{\mathsf{T}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})=\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\Phi}_{N}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})+\Big{(}{\bm{W}}^{\top}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\Phi}_{N}\Big{)}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})\,,

and

	$\displaystyle\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]=$	$\displaystyle~{}\mathbb{E}_{{\bm{w}},\phi}\Big{[}\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\phi}_{n}({\bm{w}})s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)\Big{]}$
		$\displaystyle~{}+\mathbb{E}_{{\bm{w}},\phi}\Big{[}\Big{(}{\bm{w}}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\bm{\phi}_{n}({\bm{w}})\Big{)}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)\Big{]}.$

By the interpolation constraints (64) and recalling the expression of the features matrix in Eq. (61), we write

{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})=\frac{N}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{y}}+\Big{(}{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\Big{)}{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{Z}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})\,,

and

	$\displaystyle\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]=$	$\displaystyle~{}\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{y}}+\Big{(}{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\Big{)}\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]$
		$\displaystyle~{}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\,.$

The first terms on the right-hand sides of the preceding two expressions are the same (up to a factor $N$ ). We deduce that

	$\displaystyle~{}\Big{\\|}N^{-1}{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\\|}_{2}$	(65)
$\displaystyle\leq$	$\displaystyle~{}\Big{\\|}{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\Big{\\|}_{{\rm op}}\Big{\\|}N^{-1}{\bm{W}}^{\top}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},\phi}[{\bm{w}}s(\langle\bm{\phi}_{n}({\bm{w}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\\|}_{2}$
	$\displaystyle~{}+\Big{\\|}\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\Big{\\|}_{{\rm op}}\Big{\\|}N^{-1}{\bm{Z}}s(\bm{\Phi}_{N}^{\mathsf{T}}{\hat{\bm{\lambda}}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\Big{\\|}_{2}.$

From the expression of ${\bm{M}}$ , we have ${\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}=({\mathbf{I}}_{d}+\frac{1}{d}{\bm{X}}^{\mathsf{T}}{\bm{X}})^{-1}$ . By the assumption that $\sigma_{\min}({\bm{X}})\geq\sqrt{n}c_{0}$ and $n\geq 2d/c_{0}^{2}$ , we have $\|{\mathbf{I}}_{d}-\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}{\bm{X}}\|_{{\rm op}}\leq\frac{d}{n}c_{0}^{-2}\leq 1/2$ . Similarly, we have $\|\frac{1}{d}{\bm{X}}^{\top}{\bm{M}}^{-1}\|_{{\rm op}}\leq 2/(3c_{0}\sqrt{n})$ . Rearranging the terms in Eq. (65) implies Eq. (63). ∎

Proof of Proposition 5.

By Proposition 4, FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with probability $1-e^{-c^{\prime}n}$ . Hence, we can use the results proved in the proof of Theorem 1. Replace the event ${\mathcal{E}}_{1}$ by the event ${\mathcal{E}}_{1}^{\prime}$ that for all $\|{\bm{\lambda}}\|_{{\bm{K}}_{n}}/\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\in[1/2,2]$ ,

\frac{\Big{\|}\frac{1}{N}\sum_{j=1}^{N}{\bm{z}}_{j}s(\langle\bm{\phi}_{n,j},{\bm{\lambda}}\rangle)-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\bm{\lambda}}\rangle)]\Big{\|}_{2}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq\varepsilon_{1}\,.

Using the same proof as in Lemma 4, where the vector ${\bm{z}}$ is a $\kappa^{2}$ -sub-Gaussian random vector by A3, there exists $C^{\prime},c^{\prime}$ such that for $N\geq n\geq c^{\prime}d$ , taking $\varepsilon_{1}=C^{\prime}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{(Q/2)\vee 1}}{N}\Big{)}$ , we have $\mathbb{P}({\mathcal{E}}_{1}^{\prime})\geq 1-C^{\prime}N^{-c^{\prime}n}$ . Consider the same events ${\mathcal{E}}_{2}$ (with same $\varepsilon_{2}$ ) and ${\mathcal{E}}_{3}$ as in the proof of Theorem 1, except with $\tau,\eta$ —which do not depend on $d,n$ —absorbed into the constnast $C$ . By the same proof as in Lemma 2, we can show that there exists a constant $K>0$ such that for all $\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}\leq\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}/2$ ,

\frac{\|\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\bm{\lambda}}\rangle)]-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}}{s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})}\leq K\frac{\|{\bm{\lambda}}-{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}{\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}}}.

(66)

The same argument as in Lemma 1 implies that on events ${\mathcal{E}}_{1}^{\prime},{\mathcal{E}}_{2},{\mathcal{E}}_{3}$ , we have

\|N^{-1}{\bm{Z}}s(\bm{\Phi}_{n}^{\top}\hat{\bm{\lambda}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}\leq\Big{(}\varepsilon_{1}+\frac{K\varepsilon_{2}}{\beta}\Big{)}s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\,.

(67)

Recalling the argument in the proof of Theorem 1, we have $s(\|{\hat{\bm{\lambda}}}\|_{{\bm{K}}_{n}})\leq C\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\|_{2}\leq C^{\prime}\|{\bm{y}}\|_{2}$ , where we used that by A3, ${\bm{K}}_{n}\succeq\gamma^{2}{\mathbf{I}}_{n}$ . Hence with probability at least $1-C^{\prime}N^{-c^{\prime}n}$ , we have

\|N^{-1}{\bm{Z}}s(\bm{\Phi}_{n}^{\top}\hat{\bm{\lambda}}_{N})-\mathbb{E}_{{\bm{w}},{\bm{z}}}[{\bm{z}}s(\langle\bm{\phi}_{n}({\bm{w}},{\bm{z}}),{\hat{\bm{\lambda}}}\rangle)]\|_{2}\leq C^{\prime}\Big{(}\sqrt{\frac{n\log N}{N}}\vee\frac{(n\log N)^{Q/2}}{N}\Big{)}\|{\bm{y}}\|_{2}\,.

(68)

Furthermore, from assumption A1 and Lemma 9, there exists $c>0$ such that $\sigma_{\min}({\bm{X}})\geq c\sqrt{n}$ with probability at least $1-e^{-cn}$ . Hence, we can use Lemma 6, which concludes the proof. ∎

Appendix C Small ball property under fast decay of Hermite coefficients

In this section, we show FEAT3 for a deterministic feature map $\phi({\bm{x}};{\bm{w}})=\sigma(\langle{\bm{x}},{\bm{w}}\rangle)$ with the activation function $\sigma:\mathbb{R}\to\mathbb{R}$ verifying a decay condition on its Hermite coefficients.

Recall that for any function $g\in L^{2}(\mathbb{R},\gamma)$ with $\gamma({\rm d}x)=e^{-x^{2}/2}{\rm d}x/\sqrt{2\pi}$ the standard Gaussian measure, we have the decomposition

g(x)=\sum_{k=0}^{\infty}\frac{\mu_{k}(g)}{k!}{\rm He}_{k}(x),\qquad\mu_{k}(g)\equiv\mathbb{E}_{G\sim\gamma}\left\{g(G){\rm He}_{k}(G)\right\}\,,

where $\{{\rm He}_{k}\}_{k\geq 0}$ are the Hermite polynomials with standard normalization $\mathbb{E}\{{\rm He}_{k}(G){\rm He}_{j}(G)\}=k!\delta_{jk}$ and we call $\mu_{k}(g)$ the $k$ -th Hermite coefficient of $g$ .

Throughout this section, we will consider ${\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d)$ and $\|{\bm{x}}_{i}\|_{2}=\sqrt{d}$ for all $i\leq n$ . For an integer $m\geq 0$ , consider the decomposition of the activation $\sigma=\sigma_{\leq m}+\sigma_{>m}$ into a low-degree and a non-polynomial parts

\displaystyle\sigma_{\leq m}(x)=\sum_{k=0}^{m}\frac{\mu_{k}(\sigma)}{k!}{\rm He}_{k}(x),\qquad\sigma_{>m}(x)=\sum_{k=m+1}^{\infty}\frac{\mu_{k}(\sigma)}{k!}{\rm He}_{k}(x)\,.

(69)

For ${\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d)$ and taking $\|{\bm{x}}\|_{2}=\|{\bm{x}}^{\prime}\|_{2}=\sqrt{d}$ , we can decompose the kernel function into

\displaystyle K({\bm{x}},{\bm{x}}^{\prime})=\mathbb{E}_{{\bm{w}}}\{\sigma(\langle{\bm{w}},{\bm{x}}\rangle)\sigma(\langle{\bm{w}},{\bm{x}}^{\prime}\rangle)\}=K^{\leq m}({\bm{x}},{\bm{x}}^{\prime})+K^{>m}({\bm{x}},{\bm{x}}^{\prime})\,,

(70)

where

	$\displaystyle K^{\leq m}({\bm{x}},{\bm{x}}^{\prime})=$	$\displaystyle~{}\mathbb{E}_{{\bm{w}}}\{\sigma_{\leq m}(\langle{\bm{w}},{\bm{x}}\rangle)\sigma_{\leq m}(\langle{\bm{w}},{\bm{x}}^{\prime}\rangle)\}\,,$		(71)
	$\displaystyle K^{>m}({\bm{x}},{\bm{x}}^{\prime})=$	$\displaystyle~{}\mathbb{E}_{{\bm{w}}}\{\sigma_{>m}(\langle{\bm{w}},{\bm{x}}\rangle)\sigma_{>m}(\langle{\bm{w}},{\bm{x}}^{\prime}\rangle)\}\,.$		(72)

Therefore the empirical kernel matrix can be decomposed into ${\bm{K}}_{n}={\bm{K}}_{n}^{\leq m}+{\bm{K}}_{n}^{>m}$ where ${\bm{K}}_{n}^{\leq m}=(K^{\leq m}({\bm{x}}_{i},{\bm{x}}_{j}))_{ij\in[n]}$ and ${\bm{K}}_{n}^{>m}=(K^{>m}({\bm{x}}_{i},{\bm{x}}_{j}))_{ij\in[n]}$ .

Similarly, we introduce

	$\displaystyle\bm{\phi}_{n}^{\leq m}({\bm{w}})=$	$\displaystyle~{}\big{(}\sigma_{\leq m}(\langle{\bm{x}}_{1},{\bm{w}}\rangle),\ldots,\sigma_{\leq m}(\langle{\bm{x}}_{n},{\bm{w}}\rangle)\big{)}\,,$		(73)
	$\displaystyle\bm{\phi}_{n}^{>m}({\bm{w}})=$	$\displaystyle~{}\big{(}\sigma_{>m}(\langle{\bm{x}}_{1},{\bm{w}}\rangle),\ldots,\sigma_{>m}(\langle{\bm{x}}_{n},{\bm{w}}\rangle)\big{)}\,.$		(74)

Notice that $\mathbb{E}_{\bm{w}}\big{\{}\bm{\phi}_{n}^{\leq m}({\bm{w}})\bm{\phi}_{n}^{\leq m}({\bm{w}})^{\mathsf{T}}\big{\}}={\bm{K}}_{n}^{\leq m}$ and $\mathbb{E}_{\bm{w}}\big{\{}\bm{\phi}_{n}^{>m}({\bm{w}})\bm{\phi}_{n}^{>m}({\bm{w}})^{\mathsf{T}}\big{\}}={\bm{K}}_{n}^{>m}$ . Denote ${\bm{\psi}}_{n}^{\leq m}={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}^{\leq m}$ and ${\bm{\psi}}_{n}^{>m}={\bm{K}}_{n}^{-1/2}\bm{\phi}_{n}^{>m}$ .

With these notations, we can now introduce our result.

Proposition 6.

Assume that $\|{\bm{x}}_{i}\|_{2}=\sqrt{d}$ for all $i\leq n$ and ${\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d)$ , and that $\sigma\in L^{2}(\mathbb{R},\gamma)$ . There exists an absolute constant $C_{0}>0$ such that the following holds. If for an integer $m\geq 1$ ,

\displaystyle\lambda_{\max}\left({\bm{K}}_{n}^{>m}\right)\leq\lambda_{\min}\left({\bm{K}}_{n}\right)\cdot\Big{\{}(C_{0}m)^{-(2m+1)}\wedge(1/4)\Big{\}}\,,

(75)

then we have

\displaystyle\sup_{\|{\bm{v}}\|_{2}=1}\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta^{*}\big{)}\leq\frac{1}{4}\,,

(76)

where $\eta^{*}=4\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}/\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}$ .

Proof of Proposition 6.

Throughout this proof, we will denote $C>0$ a generic absolute constant. In particular, $C$ is allowed to change from line to line.

Consider $\eta>0$ , $\|{\bm{v}}\|_{2}=1$ , and an integer $m$ such that condition (75) is satisfied with a constant $C_{0}$ that will be fixed later, and decompose

\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta\big{)}\leq\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle|\leq 2\eta\big{)}+\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{>m}\rangle|\geq\eta\big{)}\,.

(77)

The first term $\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}({\bm{w}})\rangle$ is a polynomial of degree $m$ in ${\bm{w}}\sim{\sf N}(0,{\mathbf{I}}_{d}/d)$ . From Carbery-Wright inequality [CW01], we have the following anti-concentration bound

\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle|\leq 2\eta\big{)}\leq Cm\left(\frac{2\eta}{\mathbb{E}\{\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle^{2}\}}\right)^{1/m}\,.

Note that

	$\displaystyle\mathbb{E}\big{\{}\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle^{2}\big{\}}=$	$\displaystyle~{}\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}{\bm{K}}_{n}^{\leq m}{\bm{K}}_{n}^{-1/2}{\bm{v}}\rangle$
	$\displaystyle=$	$\displaystyle~{}1-\langle{\bm{v}},{\bm{K}}_{n}^{-1/2}{\bm{K}}_{n}^{>m}{\bm{K}}_{n}^{-1/2}{\bm{v}}\rangle\geq 1-\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\geq 3/4\,,$

where in the last inequality, we used $\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}\leq\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}/4$ from condition (75). Hence

\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{\leq m}\rangle|\leq 2\eta\big{)}\leq Cm\cdot\eta^{1/m}\,.

(78)

By Markov’s inequality, the second term in Eq. (77) is bounded by

\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}^{>m}\rangle|\geq\eta\big{)}\leq\eta^{-2}\mathbb{E}\big{\{}\langle{\bm{v}},{\bm{\psi}}_{n}^{>m}\rangle^{2}\big{\}}\leq\eta^{-2}\cdot\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\,.

(79)

Hence combining bounds (78) and (79) in Eq. (77) yields

\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta\big{)}\leq Cm\cdot\eta^{1/m}+\eta^{-2}\cdot\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\,.

Setting $\eta^{*}=\left(\lambda_{\max}({\bm{K}}_{n}^{>m})/(Cm\lambda_{\min}({\bm{K}}_{n}))\right)^{m/(2m+1)}$ , we get

\mathbb{P}\big{(}|\langle{\bm{v}},{\bm{\psi}}_{n}\rangle|\leq\eta^{*}\big{)}\leq Cm\left(\frac{\lambda_{\max}\big{(}{\bm{K}}_{n}^{>m}\big{)}}{\lambda_{\min}\big{(}{\bm{K}}_{n}\big{)}}\right)^{1/(2m+1)}\leq\frac{1}{4}\,,

where we set up $C_{0}=4C$ to obtain the last inequality. Noticing that condition (75) implies

	$\displaystyle\eta^{*}=$	$\displaystyle~{}\left(\lambda_{\max}({\bm{K}}_{n}^{>m})/(Cm\lambda_{\min}({\bm{K}}_{n}))\right)^{m/(2m+1)}$
	$\displaystyle\geq$	$\displaystyle~{}\left(4\lambda_{\max}({\bm{K}}_{n}^{>m})/\lambda_{\min}({\bm{K}}_{n})\right)^{m(2m+2)/(2m+1)^{2}}\geq 4\lambda_{\max}({\bm{K}}_{n}^{>m})/\lambda_{\min}({\bm{K}}_{n})\,,$

concludes the proof. ∎

Note that for ${\bm{x}}_{i}\sim{\rm Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ , we have $\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle=O_{d,\mathbb{P}}(d^{-1/2})$ and therefore we expect the off-diagonal elements of ${\bm{K}}_{n}^{>m}$ to have negligible operator norm for $m$ sufficiently large. In fact, it was shown in [GMMM19] (see [MMM21] for a generalization), that for any constant $\delta>0$ and integer $\ell$ , if $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ , then for any $m\geq\ell$ , we have $\|{\bm{K}}_{n}^{>m}-\kappa_{>m}{\mathbf{I}}_{n}\|_{{\rm op}}=\kappa_{>m}\cdot o_{d,\mathbb{P}}(1)$ , where

\kappa_{>m}=\sum_{k=m+1}^{\infty}\frac{\mu_{k}(\sigma)^{2}}{k!}\,.

Hence for $d^{\ell+\delta}\leq n\leq d^{\ell+1-\delta}$ and $d$ sufficiently large, condition (75) is verified with high probability over the data ${\bm{X}}$ , if there exists $m>\ell$ such that

\kappa_{>m}\leq\kappa_{>\ell}\cdot\Big{\{}(C_{0}m)^{-(2m+1)}\wedge(1/4)\Big{\}}\,.

Appendix D Proof of Theorem 2: $\mathsf{NP}$ -hardness of learning with ${\mathcal{F}}_{1}$ norm

We borrow some notations and terminology from [GLS12]. We consider the convex set $K\subset\mathbb{R}^{n}$ define by

	$\displaystyle K=$	$\displaystyle~{}\Big{\{}{\bm{z}}\in\mathbb{R}^{n}\,:\,\exists\tau\in{\mathcal{M}}({\mathcal{V}}),\|\tau\|({\mathcal{V}})\leq 1\text{ and }{\bm{z}}=\int_{{\mathcal{V}}}\bm{\phi}_{n}({\bm{w}})\tau({\rm d}{\bm{w}})\Big{\}}$		(80)
	$\displaystyle=$	$\displaystyle~{}\mathsf{ConvexHull}\Big{\{}\bm{\phi}_{n}({\bm{w}}),-\bm{\phi}_{n}({\bm{w}}):{\bm{w}}\in{\mathcal{V}}\Big{\}}\,.$		(80)

From our choice of the truncated ReLu activation, we have $\|\bm{\phi}_{n}({\bm{w}})\|_{2}\leq\sqrt{n}$ and $K\subseteq\mathsf{B}({\bm{0}},\sqrt{n})$ , where we denoted $\mathsf{B}({\bm{a}},R)$ the ball of center ${\bm{a}}$ and radius $R$ , i.e., $\mathsf{B}({\bm{a}},R)=\{{\bm{z}}\in\mathbb{R}^{n}:\|{\bm{z}}-{\bm{a}}\|_{2}\leq R\}$ . In our reductions, we will further meed to assume that there exists $r>0$ such that $\mathsf{B}({\bm{0}},r)\subseteq K$ . We will check that indeed we can choose $r>0$ during our reduction. We will denote $(K,r)$ the set $K$ such that $\mathsf{B}({\bm{0}},r)\subseteq K$ . For $\delta>0$ , we let $S(K,\delta)$ denote the $\delta$ -neighborhood of $K$ , i.e.,

S(K,\delta):=\big{\{}{\bm{z}}\in\mathbb{R}^{n}:\|{\bm{u}}-{\bm{z}}\|_{2}\leq\delta\text{ for some }{\bm{u}}\in K\big{\}}\,.

Similarly, we will denote $S(K,-\delta)$ the interior $\delta$ -ball of $K$ defined by

S(K,-\delta):=\big{\{}{\bm{z}}\in K:\mathsf{B}({\bm{z}},\delta)\subseteq K\big{\}}\,.

For convenience, we recall here the different problems of interest. The weak version of the ${\mathcal{F}}_{1}$ -problem is given by

minimize	$\displaystyle\|\tau\|({\mathcal{V}})\,,$	(81)
subj. to	$\displaystyle\hat{f}({\bm{x}}_{i};\tau)=\hat{y}_{i},\;\;\;\forall i\leq n\,,$
	$\displaystyle\\|{\bm{y}}-\hat{\bm{y}}\\|_{2}\leq\varepsilon\,,$

while the intermediary optimization problem reads with the new notations

	maximize	$\displaystyle\langle{\bm{y}},{\bm{z}}\rangle\,,$		(82)
	subj. to	$\displaystyle{\bm{z}}\in K\,.$		(82)

We will consider the following problems:

$\boxed{\texttt{W-F1-PB}(\varepsilon)}$

: given ${\bm{y}}\in\mathbb{Q}^{n}$ and $\gamma\in\mathbb{Q}$ . Denote $L^{*}$ the value of the weak ${\mathcal{F}}_{1}$ -problem (81). Either

(1)

assert that $L^{*}\leq\gamma+\varepsilon$ ; or,
(2)

assert that $L^{*}\geq\gamma-\varepsilon$ .

$\boxed{\texttt{W-MEM}(\delta,K,r)}$

: given ${\bm{\lambda}}\in\mathbb{Q}^{n}$ . Either

(1)

assert that ${\bm{\lambda}}\in S(K,\delta)$ ; or,
(2)

assert that ${\bm{\lambda}}\not\in S(K,-\delta)$ .

$\boxed{\texttt{W-VAL}(\delta,K,r)}$

: given ${\bm{z}}\in\mathbb{Q}^{n},\gamma\in\mathbb{Q}$ . Either

(1)

assert that $\langle{\bm{z}},{\bm{\lambda}}\rangle\leq\gamma+\delta$ for all ${\bm{\lambda}}\in S(K,-\delta)$ ; or,
(2)

assert that $\langle{\bm{z}},{\bm{\lambda}}\rangle\geq\gamma-\delta$ for some ${\bm{\lambda}}\in S(K,\delta)$ .

$\boxed{\texttt{HS-MA}(\varepsilon)}$

: distinguish the following two cases

(1)

there exists a half space $\langle{\bm{w}},{\bm{x}}\rangle>a$ such that $M({\bm{w}},a)\geq(n_{+}+n_{-})(1-\varepsilon)$ ; or,
(1)

for any half space $\langle{\bm{w}},{\bm{x}}\rangle>a$ , we have $M({\bm{w}},a)\leq(n_{+}+n_{-})(1/2+\varepsilon)$ .

W-F1-PB corresponds to a weak validity problem associated to the weak ${\mathcal{F}}_{1}$ -problem (81); W-MEM is the weak membership problem associated to the convex set $K$ ; W-VAL is the weak validity problem associated to the intermediary optimization problem (82); and HS-MA is the Maximum Agreement for Halfspaces problem.

We will use the following hardness result on HS-MA:

Theorem 3 (Theorem 8.1 in [GR09]).

For all $0<\varepsilon<1/4$ , the problem HS-MA $(\varepsilon)$ is $\mathsf{NP}$ -hard.

Let us first prove that there exists a polynomial time randomized reduction from W-VAL to W-F1-PB.

Lemma 7.

There exists an absolute constant $C>0$ and an oracle-polynomial time randomized algorithm that solves the weak validity problem W-VAL $(\delta,K,r)$ given a W-F1-PB $(\varepsilon)$ oracle, where $\varepsilon\geq\Omega((\delta r/n)^{C})$ .

Proof of Lemma 7.

Let us first show that one can use W-F1-PB $(\varepsilon)$ to solve $\texttt{W-MEM}(\delta,K,r)$ .

Consider ${\bm{\lambda}}\in\mathbb{Q}^{n}$ and call W-F1-PB with ${\bm{y}}:={\bm{\lambda}}$ , $\gamma:=1$ and $\varepsilon:=\frac{\delta}{1+2\delta+4\sqrt{n}}$ . First, we know that $K\subseteq\mathsf{B}({\bm{0}},\sqrt{n})$ , hence if $\|{\bm{\lambda}}\|_{2}\geq 2\sqrt{n}$ , we can directly assert that ${\bm{\lambda}}\not\in S(K,-\delta)$ . We assume from now on that $\|{\bm{\lambda}}\|_{2}\leq 2\sqrt{n}$ . If W-F1-PB asserts that $L^{*}\leq 1+\varepsilon$ , then it means there exists ${\bm{z}}\in\mathbb{R}^{n}$ with $\|{\bm{z}}-{\bm{\lambda}}\|_{2}\leq\varepsilon$ and associated measure $|\tau|({\mathcal{V}})\leq 1+\varepsilon$ . Consider ${\bm{z}}^{\prime}={\bm{z}}/(1+\varepsilon)$ , then ${\bm{z}}^{\prime}$ has associated measure $|\tau^{\prime}|({\mathcal{V}})\leq 1$ and $\|{\bm{z}}^{\prime}-{\bm{\lambda}}\|_{2}\leq\frac{\varepsilon}{1+\varepsilon}(\|{\bm{\lambda}}\|_{2}+\varepsilon)+\varepsilon\leq\delta$ with our choice of $\varepsilon$ . Hence we can assert that ${\bm{\lambda}}\in S(K,\delta)$ . If W-F1-PB asserts that $L^{*}\geq 1-\varepsilon$ , then it means in particular that ${\bm{\lambda}}$ has associated measure $|\tau|({\mathcal{V}})\geq 1-\varepsilon$ . Consider ${\bm{z}}={\bm{\lambda}}/(1-2\varepsilon)$ , then ${\bm{z}}$ has associated measure $|\tau^{\prime}|({\mathcal{V}})\leq\frac{1-\varepsilon}{1-2\varepsilon}>1$ and $\|{\bm{z}}^{\prime}-{\bm{\lambda}}\|_{2}\leq\frac{2\varepsilon}{1-2\varepsilon}\|{\bm{\lambda}}\|_{2}\leq\delta$ with our choice of $\varepsilon$ . Hence we can assert that ${\bm{\lambda}}\not\in S(K,-\delta)$ .

Now that we saw we can implement a weak membership oracle $\texttt{W-MEM}(\delta,K,r)$ using W-F1-PB $(\varepsilon)$ with $\varepsilon=\Omega(\delta/\sqrt{n})$ , we can use the results in [LSV18], for example their Theorem 21 (using the sequence of reductions MEM to SEP to OPT to VAL), which shows that there exists an absolute constant $C>0$ and a randomized reduction from the weak validity problem W-VAL $(\delta,K,r)$ to the weak membership problem $\texttt{W-MEM}(\delta^{\prime},K,r)$ with $\delta^{\prime}=\Omega((\delta r/n)^{C})$ . ∎

Lemma 8.

There exists an oracle-polynomial time algorithm that solves the problem HS-MA $(\varepsilon)$ given a weak validity W-VAL $(\delta,K,r)$ oracle, where $\delta=\Omega((1-\varepsilon)^{2}/\sqrt{n})$ and $\eta=\Omega((1-\varepsilon)/\sqrt{n})$ .

Proof of Lemma 8.

First let us show that

\sup_{{\bm{\lambda}}\in K}\langle{\bm{1}},{\bm{\lambda}}\rangle=\sup_{{\bm{w}}\in{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\,.

(83)

Notice that with our truncated ReLu activation, $\bm{\phi}_{n}({\bm{w}})$ has all its coordinates non-negative for any ${\bm{w}}\in{\mathcal{V}}$ . Hence,

\sup_{\begin{subarray}{c}\tau=\tau_{+}-\tau_{-},\\ \tau_{+}({\mathcal{V}})+\tau_{-}({\mathcal{V}})\leq 1\end{subarray}}\int_{{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\tau_{+}({\rm d}{\bm{w}})-\int_{{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\tau_{-}({\rm d}{\bm{w}})=\sup_{\begin{subarray}{c}\tau=\tau_{+},\\ \tau_{+}({\mathcal{V}})\leq 1\end{subarray}}\int_{{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle\tau_{+}({\rm d}{\bm{w}})\,,

where we denoted $\tau_{+}$ and $\tau_{-}$ non-negative measures on ${\mathcal{V}}$ . Hence, we directly have $\sup_{{\bm{\lambda}}\in K}\langle{\bm{1}},{\bm{\lambda}}\rangle\leq\sup_{{\bm{w}}\in{\mathcal{V}}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle$ and the converse inequality is obtained by taking the supremum over $\tau=\delta_{{\bm{w}}}$ with ${\bm{w}}\in{\mathcal{V}}$ .

Let us now prove the reduction from HS-MA to W-VAL. Consider $(n_{+},n_{-},d)\in\mathbb{N}^{2}$ and vectors $\{{\bm{x}}_{1},\ldots,{\bm{x}}_{n_{+}},{\bm{z}}_{1},\ldots,{\bm{z}}_{n_{-}}\}\subset\{-1,1\}^{d}$ . Denote $n=n_{+}+n_{-}$ , and $\tilde{\bm{x}}_{i}=({\bm{x}}_{i},-1)$ for $i\in[n_{+}]$ and $\tilde{\bm{x}}_{n_{+}+i}=(-{\bm{z}}_{i},1)$ for $i\in[n_{-}]$ . Writing now $({\bm{w}},a)$ as ${\bm{w}}\in\mathbb{R}^{d+1}$ , we see that

\tilde{M}({\bm{w}})=\sum_{i\leq n}\mathbbm{1}[\langle{\bm{w}},\tilde{\bm{x}}_{i}\rangle>0]\,.

Denote now $\bm{\phi}_{n}({\bm{w}})=(\bm{\phi}(\tilde{\bm{x}}_{1};{\bm{w}}),\ldots,\bm{\phi}(\tilde{\bm{x}}_{n};{\bm{w}}))$ . Recall that $\bm{\phi}({\bm{x}};{\bm{w}})=\min(\max(\langle{\bm{w}},{\bm{x}}\rangle,0),1)$ : if we take ${\mathcal{V}}=\mathbb{R}^{d+1}$ , we can always rescale ${\bm{w}}$ by a constant large enough such that $\bm{\phi}_{n}({\bm{w}})$ only as $0$ or $1$ values, and

\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\langle{\bm{1}},\bm{\phi}_{n}({\bm{w}})\rangle=\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\,.

Using Eq. (83), we can find easily a reduction from HS-MA to the strong version of W-VAL (with $\delta=0$ ). However, we will do a slightly more complicated reduction, in order to insure that there exists $r>0$ such that $\mathsf{B}({\bm{0}},r)\subseteq K$ and we can take $\delta>0$ .

Consider $\underline{{\bm{x}}}_{i}=(\tilde{\bm{x}}_{i},{\bm{e}}_{i})\in\mathbb{R}^{d+1+n}$ where ${\bm{e}}_{i}$ is the canonical vector basis in $\mathbb{R}^{n}$ (vector with $1$ at the $i$ ’th coordinate and $0$ otherwise). Consider ${\mathcal{V}}=\mathbb{R}^{d+1}\times[-\eta,\eta]^{n}$ . In this case, by taking ${\bm{w}}=\eta({\bm{0}},{\bm{e}}_{i})$ , we have $\bm{\phi}_{n}({\bm{w}})=\eta{\bm{e}}_{i}$ . Hence $K$ contain all $\pm{\bm{e}}_{i}$ and therefore contain $\mathsf{B}({\bm{0}},\eta/\sqrt{n})$ . Denote $\underline{M}$ the $M$ function associated to the $\underline{{\bm{x}}}_{i}$ . By $1$ -Lipschitzness of truncated ReLu, we have

\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\leq\sup_{{\bm{w}}\in\mathbb{R}^{d+1+n}}\underline{M}({\bm{w}})\leq\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})+n\eta\,.

(84)

Let us call W-VAL $(\delta,K,\eta/\sqrt{n})$ with $\gamma:=\frac{3}{4}n$ , ${\bm{z}}:={\bm{1}}$ and $\delta$ and $\eta$ that will fixed sufficiently small later. Consider the case $\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\geq n(1-\varepsilon)$ , i.e., there exists an halfspace that makes less than $n\varepsilon$ errors. In that case, we show that (for $\delta,\eta>0$ sufficiently small), the oracle must assert that $\langle{\bm{1}},{\bm{\lambda}}\rangle\geq\frac{3}{4}n-\delta$ for some ${\bm{\lambda}}\in S(K,\delta)$ . Let us construct ${\bm{\lambda}}\in S(K,-\delta)$ such that $\langle{\bm{z}},{\bm{\lambda}}\rangle>\gamma+\delta$ . Denote ${\bm{\lambda}}^{*}$ such that $\langle{\bm{1}},{\bm{\lambda}}^{*}\rangle\geq n(1-\varepsilon)$ , and consider $\lambda_{s}=(1-s){\bm{\lambda}}^{*}$ . Because $K$ is convex and contains ${\bm{\lambda}}^{*}$ and $\mathsf{B}({\bm{0}},\eta/\sqrt{n})$ , it must contain the cone with apex ${\bm{\lambda}}^{*}$ and base the section of $\mathsf{B}({\bm{0}},\eta/\sqrt{n})$ perpendicular to ${\bm{\lambda}}_{*}$ . In particular, $\mathsf{B}({\bm{\lambda}}_{s},\eta s/\sqrt{n})\subseteq K$ (for $n$ sufficiently large). Furthermore, $\langle{\bm{\lambda}}_{s},{\bm{1}}\rangle=(1-s)\langle{\bm{\lambda}}^{*},{\bm{1}}\rangle\geq(1-s)n(1-\varepsilon)$ . Setting $s=\frac{1/4-\varepsilon}{2(1-\varepsilon)}$ and $\delta\leq\eta s/\sqrt{n}$ , we have ${\bm{\lambda}}_{s}\in S(K,-\delta)$ such that $\langle{\bm{\lambda}}_{s},{\bm{1}}\rangle\geq(3/4+\kappa)n>3n/4+\delta$ for some $\kappa>0$ .

Consider now the case $\sup_{{\bm{w}}\in\mathbb{R}^{d+1}}\tilde{M}({\bm{w}})\leq n(1/2+\varepsilon)$ , i.e., halfspaces classify at most $n(1/2+\varepsilon)$ vectors correctly. In that case, we show that (for $\delta,\eta>0$ sufficiently small), the oracle must assert that $\langle{\bm{1}},{\bm{\lambda}}\rangle\leq\frac{3}{4}n+\delta$ for all ${\bm{\lambda}}\in S(K,-\delta)$ . To do so, we show that for any ${\bm{\lambda}}\in S(K,\delta)$ , we must have $\langle{\bm{1}},{\bm{\lambda}}\rangle<3n/4-\delta$ . Consider an arbitrary ${\bm{\lambda}}\in S(K,\delta)$ , there exists ${\bm{z}}\in K$ such that $\|{\bm{\lambda}}-{\bm{z}}\|_{2}\leq\delta$ . Using Eqs. (83) and (84), we have $\langle{\bm{1}},{\bm{z}}\rangle\leq n(1/2+\varepsilon+\eta)$ . Hence $\langle{\bm{1}},{\bm{\lambda}}\rangle\leq n(1/2+\delta+\eta)+\delta\sqrt{n}$ . Taking $\eta=\frac{1-4\varepsilon}{8}>0$ and $\delta\leq C$ , we have $\langle{\bm{1}},{\bm{\lambda}}\rangle<3n/4-\delta$ .

Combining the above conditions, we see that there exists an absolute constant $c>0$ such that the W-VAL $(\delta,K,\eta/\sqrt{n})$ oracle with $\delta=c(1-4\varepsilon)^{2}/\sqrt{n}$ and $\eta=c(1-4\varepsilon)$ allows to distinguish HS-MA $(\varepsilon)$ . ∎

We are now ready to prove Theorem 2.

Proof of Theorem 2.

From Lemma 8, there is a polynomial-time reduction from HS-MA $(1/10)$ to W-VAL $(c/\sqrt{n},K,c/\sqrt{n})$ for some absolute constant $c>0$ . From Lemma 7, there is a polynomial-time randomized reduction from W-VAL $(c/\sqrt{n},K,c/\sqrt{n})$ to W-F1-PB $(n^{-C})$ for some absolute constant $C>0$ . Using Theorem 3 concludes the proof. ∎

Appendix E Additional properties of the minimum interpolation problem

E.1 The infinite-width primal problem for randomized features

It is convenient to make the randomization mechanism explicit by drawing $z\sim\nu$ (with $(\mathcal{Z},\nu)$ a probability space) independent of ${\bm{w}}$ , and writing $\phi({\bm{x}};{\bm{w}})=\phi({\bm{x}};{\bm{w}},z)$ (without loss of generality, the reader can assume $z\sim\nu={\rm Unif}([0,1])$ ). Therefore, we can rewrite Eq. (14) as

	$\displaystyle F({\bm{\lambda}})$	$\displaystyle=~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{{\mathcal{V}}\times\mathcal{Z}}\rho^{*}(\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{\lambda}}\rangle)\mu({\rm d}{\bm{w}})\nu({\rm d}z)$
		$\displaystyle=~{}\langle{\bm{\lambda}},{\bm{y}}\rangle-\sup_{a}\int_{{\mathcal{V}}\times\mathcal{Z}}\big{[}a({\bm{w}},z)\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{\lambda}}\rangle-\rho(a({\bm{w}},z))\big{]}\mu({\rm d}{\bm{w}})\nu({\rm d}z)$
		$\displaystyle=~{}\inf_{a}\mathcal{L}({\bm{\lambda}},a)\,,$

where $a:{\mathcal{V}}\times\mathcal{Z}\to{\mathbb{R}}$ is a general measurable function. This expression shows immediately that the primal variable is a function of both ${\bm{w}}$ and $z$ . Further, maximizing the Lagrangian $\mathcal{L}({\bm{\lambda}},a)$ over ${\bm{\lambda}}$ , we obtain the following primal problem that generalizes (5) to the case of randomized features:

	minimize	$\displaystyle\int_{{\mathcal{V}}\times\mathcal{Z}}\rho(a({\bm{w}},z))\mu({\rm d}{\bm{w}})\nu({\rm d}z)\,,$		(85)
	subj. to	$\displaystyle\int_{{\mathcal{V}}\times\mathcal{Z}}a({\bm{w}},z)\phi({\bm{x}}_{i};{\bm{w}},z)\mu({\rm d}{\bm{w}})\nu({\rm d}z)=y_{i},\;\;\;\forall i\leq n\,.$		(86)

This is a natural limit of the finite-width primal problem (7), whereby we replace the weights $a_{i}$ by evaluations of the function $a$ , $a_{i}=a({\bm{w}}_{i},z_{i})$ .

Remark 7.

Note that, under assumptions FEAT1, FEAT3 (or FEAT3’ for $Q_{1}\wedge Q_{2}<2$ ) and PEN, the minimizer to problem (5) (and its generalization (85)) exists and is unique. First of all notice that problem (85) is feasible. Indeed, choosing $a({\bm{w}},z)=\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{\xi}}\rangle$ for ${\bm{\xi}}\in{\mathbb{R}}^{n}$ , the interpolation constraint takes the form

\displaystyle{\bm{K}}_{n}{\bm{\xi}}={\bm{y}}\,,

(87)

which has solutions since ${\bm{K}}_{n}=\mathbb{E}_{{\bm{w}},z}\{\bm{\phi}_{n}({\bm{w}},z)\bm{\phi}_{n}({\bm{w}},z)^{{\mathsf{T}}}\}\succ{\bm{0}}$ is strictly positive definite by condition FEAT3. Let $a_{0}$ be such a feasible point.

Denote by $U(a):=\int_{{\mathcal{V}}\times\mathcal{Z}}\rho(a({\bm{w}},z))\,\mu({\rm d}{\bm{w}})\nu({\rm d}z)$ the cost function. Notice that, by assumption PEN, $\rho(x)\geq c|x|^{1+\delta}-C$ for some constants $c,\delta>0$ , and $C\in{\mathbb{R}}$ , and therefore $U(a)\leq U(a_{0})$ only if $\|a\|_{L^{1+\delta}(\mu\otimes\nu)}\leq M<\infty$ . It is therefore sufficient to focus on these functions $a$ . Further notice that the map $a\mapsto f({\bm{x}}_{i};a)$ is continuous under weak convergence in $L^{1+\delta}(\mu\otimes\nu)$ , since $\phi({\bm{x}}_{i};\cdot)\in L^{k}(\mu\otimes\nu)$ for every $k$ by assumption FEAT1. It follows that the set of feasible solutions of (85) satisfying $\|a\|_{L^{1+\delta}(\mu\otimes\nu)}\leq C$ is closed and bounded and hence weakly sequentially compact by Banach-Alaoglu. Finally, $a\mapsto U(a)$ is weakly lower semicontinuous by Fatou’s lemma, and this implies existence of minimizers.

Uniqueness follows from the fact that $\rho$ is strictly convex, which implies that $a\mapsto U(a)$ is also strictly convex.

E.2 Representer theorem for strictly convex and differentiable penalty

In the case of deterministic features (i.e. $a({\bm{w}},z)=a({\bm{w}})$ and $\phi({\bm{x}};{\bm{w}},z)=\phi({\bm{x}};{\bm{w}})$ ), we present here a generalization of the representer theorem to a broad class of penalties $\rho$ . Recall that in the case of $\rho(x)=\frac{1}{2}|x|^{2}$ , the representer theorem states that the solution $a_{*}:\mathbb{R}^{d}\to\mathbb{R}$ belongs to the class of functions

\Big{\{}{\bm{w}}\mapsto\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle:{\bm{\lambda}}\in\mathbb{R}^{n}\Big{\}}\,.

(88)

In words, the solution $a_{*}$ is contained in the (at most) $n$ -dimensional linear subspace spanned by $\{{\bm{w}}\mapsto\phi({\bm{x}}_{i};{\bm{w}}):i\leq n\}$ .

The following proposition generalizes this result to a penalty $\rho$ that is strictly convex.

Proposition 7 (Representer theorem for penalty $\rho$ ).

Let $\rho:\mathbb{R}\to\mathbb{R}$ be strictly convex and differentiable, and consider the minimum complexity interpolation problem (85) in the case of deterministic features. The solution $a_{*}:\mathbb{R}^{d}\to\mathbb{R}$ of Problem (85) belongs to the class of functions (in the almost sure sense)

\Big{\{}{\bm{w}}\mapsto s(\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle):{\bm{\lambda}}\in\mathbb{R}^{n}\Big{\}}\,,

(89)

where $s(x)=(\rho^{\prime})^{-1}(x)$ .

Note that we recover the representer theorem for RKHS when $\rho=\frac{1}{2}|x|^{2}$ and $s(x)=x$ . However for general $\rho$ , the loss function cannot be simplified by evaluating $(K({\bm{x}}_{i},{\bm{x}}_{j}))_{i,j\leq n}$ once.

Proof of Proposition 7.

We have

		$\displaystyle~{}\inf_{a:\mathbb{R}^{d}\to\mathbb{R}}\Big{\{}\int_{{\mathcal{V}}}\rho(a({\bm{w}}))\mu({\rm d}{\bm{w}}):\int_{{\mathcal{V}}}a({\bm{w}})\phi({\bm{x}}_{i};{\bm{w}})\mu({\rm d}{\bm{w}})=y_{i},\;\;\;\forall i\leq n\Big{\}}$
	$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}$	$\displaystyle~{}\inf_{a:\mathbb{R}^{d}\to\mathbb{R}}\sup_{{\bm{\lambda}}\in\mathbb{R}^{n}}\Big{\{}\langle{\bm{\lambda}},{\bm{y}}\rangle+\int_{{\mathcal{V}}}\rho(a({\bm{w}}))\mu({\rm d}{\bm{w}})-\int_{{\mathcal{V}}}a({\bm{w}})\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle\mu({\rm d}{\bm{w}})\Big{\}}$
	$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}$	$\displaystyle~{}\sup_{{\bm{\lambda}}\in\mathbb{R}^{n}}\Big{[}\langle{\bm{\lambda}},\hat{\bm{y}}\rangle+\inf_{a:\mathbb{R}^{d}\to\mathbb{R}}\int_{{\mathcal{V}}}\big{\{}\rho(a({\bm{w}}))-a({\bm{w}})\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle\big{\}}\mu({\rm d}{\bm{w}})\Big{]}$
	$\displaystyle\stackrel{{\scriptstyle(3)}}{{=}}$	$\displaystyle~{}\sup_{{\bm{\lambda}}\in\mathbb{R}^{n}}\Big{[}\langle{\bm{\lambda}},{\bm{y}}\rangle-\int_{{\mathcal{V}}}\rho^{*}(\langle{\bm{\lambda}},\bm{\phi}_{n}({\bm{w}})\rangle)\mu({\rm d}{\bm{w}})\Big{]}\,,$

where we introduced the Lagrange multiplies ${\bm{\lambda}}$ in (1), we used strong duality in (2), and we used the definition of the convex conjugate in (3). At the optimum, we must have $a_{*}({\bm{w}})=(\rho^{*})^{\prime}(\langle{\bm{\lambda}}_{*},\bm{\phi}_{n}({\bm{w}})\rangle)=(\rho^{\prime})^{-1}(\langle{\bm{\lambda}}_{*},\bm{\phi}_{n}({\bm{w}})\rangle)=s((\langle{\bm{\lambda}}_{*},\bm{\phi}_{n}({\bm{w}})\rangle))$ a.s. with respect to $\mu$ . ∎

Appendix F Useful technical facts

We recall the following basic result on concentration of the empirical covariance of independent sub-Gaussian random vectors. This can be found, for instance, in [Ver18, Exercise 4.7.3] or [Ver10] (Theorem 5.39 and Remark 5.40(1)).

Lemma 9.

Let $({\bm{a}}_{i})_{i\leq N}$ be independent $\tau^{2}$ -subgaussian random vectors, with common covariance $\mathbb{E}\{{\bm{a}}_{i}{\bm{a}}_{i}^{{\mathsf{T}}}\}=\bm{\Sigma}$ . Denote by $\bm{\hat{\Sigma}}:=N^{-1}\sum_{i=1}^{N}{\bm{a}}_{i}{\bm{a}}_{i}^{{\mathsf{T}}}$ the empirical covariance. Then there exist absolute constants $C,c>0$ such that, for all $s\geq C(\sqrt{n/N}\vee(n/N))$ , we have

\displaystyle\mathbb{P}\big{(}\big{\|}\bm{\hat{\Sigma}}-\bm{\Sigma}\big{\|}_{{\rm op}}\geq s\tau^{2}\big{)}\leq\exp\big{(}-cN(s\wedge s^{2})\big{)}\,.

(90)

We also state and prove two simple lemmas about moments of sub-Gaussian random variables.

Lemma 10.

For any $a>0$ and any $\delta>0$ there exists $C(a,\delta)$ such that the following holds. For any random variable $X$ that is $\tau^{2}$ -sub-Gaussian with $\mathbb{E}\{X^{2}\}=1$ , we have

\displaystyle\mathbb{E}\{|X|^{a}\}\leq C(a,\delta)\,\tau^{(a-2+\delta)_{+}}\,.

(91)

This inequality holds for $a\leq 2$ , with $\delta=0$ and $C=1$ .

Proof.

Without loss of generality, we can assume $X\geq 0$ . For $a\leq 2$ , this is just Jensen’s inequality.

For $a>2$ , by Hölder for any $q\in[0,a]$ and any $r>1$ , we have

\displaystyle\mathbb{E}[X^{a}]\leq\mathbb{E}[X^{rq}]^{1/r}\mathbb{E}[X^{r(a-q)/(r-1)}]^{(r-1)/r}\,.

Setting $q=2/r$ , $r\geq 2/a$ and using the fact that $\mathbb{E}[X^{2}]=1$ and $X$ is sub-Gaussian, we get that there exist constants $C_{0}(a,r)$ finite as long as $r>1$ , such that

\displaystyle\mathbb{E}[X^{a}]\leq\mathbb{E}[X^{(ra-2)/(r-1)}]^{(r-1)/r}\leq C_{0}(a,r)\tau^{a-2/r}\,.

The claim follows by setting $2/r=2-\delta$ or $r=(1-\delta/2)^{-1}>1$ . ∎

Lemma 11.

For any $q_{1},q_{2}>0$ and any $\delta>0$ there exists $C(q_{1},q_{2},\delta)$ such that the following holds. For any random variable $X$ that is $\tau^{2}$ -sub-Gaussian with $\mathbb{E}\{X^{2}\}=1$ , we have

\displaystyle\mathbb{E}\{|X|^{q_{1}}\wedge|X|^{q_{2}}\}\geq C(q_{1},q_{2},\delta)\,\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}\,.

(92)

This inequality holds for $q_{1}=q_{2}\geq 2$ , with $\delta=0$ and $C=1$ .

Proof.

Without loss of generality, we can assume $X\geq 0$ . For $q_{1}=q_{2}\geq 2$ , this is just Jensen’s inequality.

For the other cases, notice that, for $a_{1},a_{2}\in[0,2]$ , $(x^{a_{1}}\wedge x^{a_{2}})(x^{2-a_{1}}\vee x^{2-a_{2}})=x^{2}$ . Hence, by Hölder inequality, for all $r>1$ :

\displaystyle 1=\mathbb{E}[X^{2}]\leq\mathbb{E}[X^{a_{1}r}\wedge X^{a_{2}r}]^{1/r}\mathbb{E}[X^{(2-a_{1})r/(r-1)}\vee X^{(2-a_{2})r/(r-1)}]^{(r-1)/r}\,,

We then set $a_{i}=q_{i}/r$ , and invert this relationship to get

	$\displaystyle\mathbb{E}[X^{q_{1}}\wedge X^{q_{2}}]$	$\displaystyle\geq\mathbb{E}[X^{(2r-q_{1})/(r-1)}\vee X^{(2r-q_{2})/(r-1)}]^{-(r-1)}$
		$\displaystyle\geq\Big{(}\mathbb{E}[X^{(2r-q_{1})/(r-1)}]+\mathbb{E}[X^{(2r-q_{2})/(r-1)}]\Big{)}^{-(r-1)}\,.$

Taking $r\geq 1\vee(q_{1}/2)\vee(q_{2}/2)$ , we can apply Lemma 10, to get

	$\displaystyle\mathbb{E}[X^{q_{1}}\wedge X^{q_{2}}]$	$\displaystyle\geq C\Big{(}\tau^{(\frac{2r-q_{1}}{r-1}-2+\delta^{\prime})_{+}(r-1)}+\tau^{(\frac{2r-q_{2}}{r-1}-2+\delta^{\prime})_{+}(r-1)}\Big{)}^{-1}$
		$\displaystyle\geq C\Big{(}\tau^{(2-q_{1}+\delta)_{+}}+\tau^{(2-q_{2}+\delta)_{+}}\Big{)}^{-1}$
		$\displaystyle\geq C\tau^{-(2-q_{1}\wedge q_{2}+\delta)_{+}}\,.$

This proves the claim. ∎

	$\displaystyle\mathbb{E}[\|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\|^{r}]\leq$	$\displaystyle~{}\sup_{x\in\mathbb{R}}\mathbb{E}_{\tilde{z}}[\|\tilde{z}+x\|^{r}]\leq\mathbb{E}_{\tilde{z}}[\|\tilde{z}\|^{r}]$
	$\displaystyle=$	$\displaystyle~{}\Delta^{r/2}\mathbb{E}[\|G\|^{r}]\leq\eta^{r}\mathbb{E}[\|G\|^{r}]\,,\,.$

	$\displaystyle\lambda_{\max}({\bm{K}}_{n})$	$\displaystyle=\sup_{\\|{\bm{u}}\\|_{2}=1}\mathbb{E}_{{\bm{w}},z}[\langle\bm{\phi}_{n}({\bm{w}},z),{\bm{u}}\rangle^{2}]$
		$\displaystyle\leq\mathbb{E}_{{\bm{w}},z}[\\|\bm{\phi}_{n}({\bm{w}},z)\\|^{2}]\leq n\sup_{\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}}\mathbb{E}_{{\bm{w}},z}[\phi({\bm{x}};{\bm{w}},z)^{2}]$
		$\displaystyle\leq n\sup_{\\|{\bm{x}}\\|_{2}\leq r_{0}\sqrt{d}}\big{\{}L^{2}\mathbb{E}[\langle{\bm{w}},{\bm{x}}\rangle^{2}]+\gamma^{2}\big{\}}\leq Cn\,.$

	$\displaystyle\\|{\bm{K}}_{n}^{-1/2}{\bm{y}}\\|_{2}=$	$\displaystyle\sup_{\\|{\bm{u}}\\|_{2}=1}\mathbb{E}[\langle{\bm{u}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{K}}_{n}^{1/2}{\hat{\bm{\lambda}}}\rangle)]$
	$\displaystyle\geq$	$\displaystyle~{}\mathbb{E}[\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle s(\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}}\langle{\bm{\psi}}_{n}({\bm{w}}),{\bm{v}}\rangle)]$
	$\displaystyle\geq$	$\displaystyle~{}cs(\\|{\hat{\bm{\lambda}}}\\|_{{\bm{K}}_{n}})\mathbb{E}[\|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\|^{q_{1}}\wedge\|\langle{\bm{v}},{\bm{\psi}}_{n}({\bm{w}})\rangle\|^{q_{2}}]\,.$

	$\displaystyle\mathbb{E}[\|\overline{\phi}({\bm{x}};{\bm{w}})\|^{r_{1}u}]^{1/r_{1}}$	$\displaystyle\leq C\tau^{u}\,,$
	$\displaystyle\mathbb{E}[\|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\|^{r_{3}u}]^{1/r_{3}}$	$\displaystyle\leq C\tau^{u}\\|{\bm{\lambda}}-\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}^{u}\,,$

	$\displaystyle\|\nabla_{{\bm{\lambda}}}\hat{f}({\bm{x}};{\bm{\lambda}}_{t})^{\top}({\bm{\lambda}}-\hat{\bm{\lambda}})\|$	$\displaystyle\leq\mathbb{E}[\|\overline{\phi}({\bm{x}};{\bm{w}})\|^{r_{1}}]^{1/r_{1}}\,\mathbb{E}[\|s^{\prime}(\langle\bm{\phi}_{n},{\bm{\lambda}}_{t}\rangle)\|^{r_{2}}]^{1/r_{2}}\,\mathbb{E}[\|\langle\bm{\phi}_{n},{\bm{\lambda}}-\hat{\bm{\lambda}}\rangle\|^{r_{3}}]^{1/r_{3}}$		(47)
		$\displaystyle\leq C(\tau^{Q}\vee\tau^{2}\eta^{Q^{\prime}-2})\frac{s(\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}})}{\\|\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}}\\|{\bm{\lambda}}-\hat{\bm{\lambda}}\\|_{{\bm{K}}_{n}}\,,$		(47)

Minimum complexity interpolation in random features models

Abstract

1 Introduction

1.1 Background: Kernel methods and the curse of dimensionality

1.2 Random features approximation

1.3 Dual problem and its concentration properties

2 Related work

3 Convergence to the infinite width limit

3.1 Setting and dual problem

3.2 General theorem

Remark 1.

Theorem 1.

Remark 2.

Remark 3.

Corollary 1.

Remark 4.

4 𝖭𝖯\mathsf{NP}-hardness of learning with ℱ1{\mathcal{F}}_{1} norm

Remark 5.

Theorem 2.

5 Examples

5.1 A numerical illustration

5.2 Non-linear random features model

Proposition 1.

Proof of Proposition 1.

Proposition 2.

Proposition 3.

Proof of Proposition 3.

Corollary 2.

Proof of Corollary 2.

Remark 6.

5.3 The latent linear model

Proposition 4.

Proof of Proposition 4.

Proposition 5.

6 Proof of Theorem 1: Convergence to the population predictor

Lemma 1.

Proof of Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Proof of Theorem 1.

Acknowledgements

References

Appendix A Verifying the continuity and concentration conditions

A.1 Continuity property of the infinite width problem: Proof of Lemma 2

A.2 Uniform concentration of the predictor: Proof of Lemma 3

A.3 Point-wise concentration of dual gradient: Proof of Lemma 4

A.4 Uniform lower bound on the dual Hessian: Proof of Lemma 5

Appendix B The latent linear model: Proof of Proposition 5

Lemma 6.

Proof of Lemma 6.

Proof of Proposition 5.

Appendix C Small ball property under fast decay of Hermite coefficients

Proposition 6.

Proof of Proposition 6.

Appendix D Proof of Theorem 2: 𝖭𝖯\mathsf{NP}-hardness of learning with ℱ1{\mathcal{F}}_{1} norm

Theorem 3 (Theorem 8.1 in [GR09]).

Lemma 7.

Proof of Lemma 7.

Lemma 8.

Proof of Lemma 8.

Proof of Theorem 2.

Appendix E Additional properties of the minimum interpolation problem

E.1 The infinite-width primal problem for randomized features

Remark 7.

E.2 Representer theorem for strictly convex and differentiable penalty

Proposition 7 (Representer theorem for penalty ρ\rho).

Proof of Proposition 7.

Appendix F Useful technical facts

Lemma 9.

Lemma 10.

Proof.

Lemma 11.

Proof.

4 $\mathsf{NP}$ -hardness of learning with ${\mathcal{F}}_{1}$ norm

Appendix D Proof of Theorem 2: $\mathsf{NP}$ -hardness of learning with ${\mathcal{F}}_{1}$ norm

Proposition 7 (Representer theorem for penalty $\rho$ ).