Minimum complexity interpolation in random features models
Abstract
Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in , the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). As a consequence, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm —which is equivalent to a weighted norm— is replaced by a weighted functional norm, which we refer to as norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires solving an infinite-dimensional convex problem.
We study random features approximations to these norms and show that, for , the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models. For , our guarantees for the random features approximation break down. We prove instead that learning with the norm is -hard under a randomized reduction based on the problem of learning halfspaces with noise.
1 Introduction
1.1 Background: Kernel methods and the curse of dimensionality
Kernel methods are among the most fundamental tools in machine learning. Over the last two years they attracted renewed interest because of their connection with neural networks in the linear regime (a.k.a. neural tangent kernel or lazy regime) [JGH18, LR20, LRZ19, GMMM19].
Consider a general covariates space, namely a probability space (our results will concern the case , but it is useful to start from a more general viewpoint). A reproducing kernel Hilbert space (RKHS) is usually constructed starting from a positive semidefinite kernel . Here it will be more convenient to start from a weight space, i.e. a probability space , and a featurization map parametrized by the weight vector111A sufficient condition for this construction to be well defined is . . The RKHS is then defined as the space of functions of the form
(1) |
with . To give a concrete example, we might consider , and : this is the featurization map arising from two-layers neural networks with random first layer weights.
The radius- ball in this space is defined as
(2) |
This construction is equivalent to the more standard one, with associated kernel . Vice versa, for any positive semidefinite kernel , we can construct a corresponding featurization map , and proceed as above.
It is a basic fact that, although is infinite-dimensional, learning in can be performed efficiently [SSBD14]: it is useful to recall the reason here. Consider a supervised learning setting in which we are given samples , , with and . Given a convex loss , the RKHS-norm regularized empirical risk minimization problem reads:
(3) |
Conceptually, this problem can be solved in two steps: find the minimum RKHS norm subject to interpolation constraints , and minimize the sum of this quantity and the empirical loss . Since step is convex and finite-dimensional, the critical problem is the interpolation problem:
minimize | |||
subj. to |
While this is an infinite-dimensional problem, convex duality (the ‘representer theorem’) guarantees that the solution belongs to a fixed -dimensional subspace, .
Unfortunately, kernel methods suffer from the curse of dimensionality. To give a simple example, consider , , and with a non-polynomial activation function. Consider noiseless data , with for a fixed . Then, (with denoting the RKHS norm). Correspondingly, [YS19, GMMM19] show that for any fixed , if , then any kernel method of the form (3) returns with bounded away from zero.
The curse of dimensionality suggests to seek functions of the form (1) where is sparse, in a suitable sense [Bac17]. In this paper we consider a generalization in which the RKHS ball of Eq. (2) is replaced by
(4) |
For this comprises a richer function class than the original , since , and it is easy to see that the inclusion is strict. The case is also known as ‘convex neural network’ [Bac17].
Although the penalty is convex, it is far from clear that the learning problem is tractable. Indeed for , the classical representer theorem does not hold anymore and we cannot reduce the infinite dimensional problem to solving a finite dimensional quadratic program (see Appendix E.2 for a representer-type theorem for ).
By the same reduction discussed above, it is sufficient to consider the following minimum-complexity interpolation problem:
minimize | (5) | |||
subj. to |
(We denote the values to be interpolated by instead of because we will focus on the interpolation problem hereafter.) We will take to be a convex function minimized at . We establish two main results. First, we establish tractability for a subset of strictly convex penalties which include, as special cases, , . Our approach is based on a random features approximation which we discuss next. Second, we establish -hardness under randomized reduction for the case .
1.2 Random features approximation
We sample independently , , and fit a model
(6) |
We determine the coefficients by solving the interpolation problem
minimize | (7) | |||
subj. to |
Notice that this is equivalent to replacing the measure in Eq. (5) by its empirical version . Borrowing the terminology from neural network theory, we will refer to (7) as the “finite width” problem, and to the original problem (5) as the “infinite width” problem. We will denote by the solution of the finite width problem (7) and by the solution of the infinite width problem (5).
Our main result establishes that, under suitable assumptions on the penalty and the featurization map , the random features approach provides a good approximation of the minimum complexity interpolation problem (5). Crucially, this is achieved with a number of features that is polynomial in the sample size . In particular we show that for , , setting (and assuming )
(8) |
Hence, for , the random features approach yields a function that is a good approximation of the original problem (5).
Let us emphasize that the scaling of the number of random features in our bound is not optimal. In particular, for the RKHS case , the above result requires , while we know from [MMM21] that features are often sufficient. We also note that the exponent in the polynomial diverges as . Hence, our results do not guarantee tractability for the case . Indeed this is a fundamental limitation: as discussed in Section 4, a bound such as Eq. (8) with finite cannot hold for , under some standard hardness assumptions. We show instead that no polynomial time algorithm is guaranteed to achieve accuracy for some absolute constant in Eq. (8). This hardness result is based on a randomized reduction from the problem of learning halfspaces with noise, which was proved to be -hard in [FGKP06, GR09].
1.3 Dual problem and its concentration properties
Our proof that the random features model is a good approximation of the infinite width model (cf. Eq. (8)) is based on a simple approach which is potentially of independent interest. We notice that, while the optimization problems (5) and (7) are overparametrized and hence difficult to control, their duals are underparametrized and can be studied using uniform convergence arguments.
Let be the convex conjugate of :
Then, the dual problems of (5) and (7) are given —respectively— by the following optimization problems over :
(9) | |||
(10) |
Here we denoted the vector of responses and by the evaluation of the feature map at the data points .
We will prove that, under suitable assumptions on the penalty , the optimizer of the finite-width dual (10) concentrates around the optimizer of the infinite-width dual (9). Our results hold conditionally on the realization of the data and instead exploit the randomness of the weights .
The rest of the paper is organized as follows. After briefly discussing related work in Section 2, we state our assumptions and results for strictly convex penalties in Section 3. In Section 4, we show that the problem with norm is -hard under randomized reduction. In Section 5, we describe a few examples in which we can apply our general results. The proof of our main result is outlined in Section 6, with most technical work deferred to the appendices.
2 Related work
Random features methods were first introduced as a randomized approximation to RKHS methods [RR08, BBV06]. Given a kernel , the idea of [RR08] was to replace it by a low rank approximation
(11) |
where the random features are such that . Several papers prove bounds on the test error of random features methods and compare them with the corresponding kernel approach [RR09, RR17, MWW20, GMMM19, MM19, GMMM20, MMM21]. In particular, [MMM21] proves that —under certain concentration assumptions on the random features— if for some , then the random features approach has nearly the same test error as the corresponding RKHS method.
Notice that the idea of approximating the kernel via random features implicitly selects a specific regularization, in our notation . Indeed, for any other regularization, the predictor does not depend uniquely on the kernel. An alternative viewpoint regards the random features model (6) as a two-layer neural network with random first layer weights. It was shown in [Bar98] that the generalization properties of a two-layer neural network can be controlled in terms of the sum of the norms of the second-layer weights. This opens the way to considering infinitely wide two-layer networks as per Eq. (1), with . These networks represent functions in the class .
The infinite-width limit was considered in [BRV+06], which propose an incremental algorithm to fit functions with an increasing number of units. Their approach however has no global optimality guarantees. Generalization error bounds within were proved in [Bac17], which demonstrates the superiority of this approach over RKHS methods, in particular to fit ridge functions or other classes of functions that depend strongly on a low-dimensional subspace of . The same paper also develops an optimization algorithm based on a conditional-gradient approach. However, in high dimension each iteration of this algorithm requires solving a potentially hard problem.
A line of recent work shows that —in certain overparametrized training regimes— neural networks are indeed well approximated by random features models obtained by linearizing the network around its (random) initialization. A subset of references include [JGH18, LL18, DZPS18, OS19, COB19]. The implicit bias induced by gradient descent on these models is a ridge (or RKHS) regularization. However, the bias can change if either the algorithm or the model parametrization is changed [NTSS17, GLSS18].
We finally notice that several recent works study the impact of changing regularization in high-dimensional linear regression, focusing on the interpolation limit [LS20, MKL+20, CLvdG20]. None of these works addresses the main problem studied here, that is approximating the fully nonparametric model (1).
Our work is a first step towards understanding the effect of regularization in random features models. It implies that, for certain penalties , as soon as is polynomially large in , studying the random features model reduces to studying the corresponding infinite width model.
3 Convergence to the infinite width limit
3.1 Setting and dual problem
We consider a slightly more general setting than the one discussed in the introduction, whereby we allow the featurization map functions to be randomized. More precisely, conditional on the data and weight vectors , the feature values are independent random variables. Explicitly, such randomized features can be constructed by letting depend on an additional variable , and setting (with an abuse of notation) for a collection of independent random variables with common law (with a probability distribution over ). Without loss of generality222For instance, this is the case as long as the weights take values in a Polish space, by which the conditional probabilities exist. we can assume .
We denote by the expectation of the features with respect to this additional randomness. In what follows, we will omit to write the dependency on explicitly, unless required for clarity. The additional freedom afforded by randomized features is useful in multiple scenarios:
-
•
We only have access to noisy measurements of the true features .
-
•
We deliberately introduce noise in the featurization mechanism. This is known to have a regularizing effect [Bis95].
- •
At prediction time we use the average features333Our results do not change significantly if we use randomized features also at prediction time.
(12) |
The dual of the finite-width problem (7) is given by the problem (10), for which we introduce the following notations:
(13) | ||||
Notice that is now a random vector. In the case of random features, the dual of the infinite width problem (5) has to be modified with respect to (9), and takes instead the form
(14) | ||||
Note the expectation with respect to the randomness in the features (equivalently, this is the expectation with respect to the independent randomness , which is not noted explicitly). The most direct way to see that this is the correct infinite width dual is to notice that this is obtained from Eq. (13) by taking expectation with respect to the weights and the features randomness. We will further discuss the connection between (14) and the infinite width primal problem in Section E.1 below.
In order to discuss the dual optimality conditions, let be the derivative of the convex conjugate of , i.e., . Since is assumed to be strictly convex, exists and is continuous and non-decreasing. With these definitions, the dual optimality condition reads
(15) |
The primal solution is then given by and the resulting predictor is
In the following, with an abuse of notation, we will write for the model at dual parameter . The corresponding infinite width predictor reads
(16) |
3.2 General theorem
We will show that conditional on the realization of , the distance between the infinite width interpolating model and the finite width one is small with high probability as soon as is large enough. Throughout this section, are viewed as fixed, and we assume certain conditions to hold on the distribution of the features . In Section 5 we will check that these conditions hold for a few models of interest, for typical realizations of .
We first state our assumptions on the feature distribution. We define the whitened features
(17) |
Here expectation is over the randomization in the features, and is the empirical kernel matrix. By construction, are isotropic: .
We will assume the following conditions to hold for the featurization map and the penalty . Throughout we will follow the convention of denoting by Greek letters constants that we track of in our calculations, and by constants that we do not track (and therefore our statements depend on an unspecified way on the latter). We also recall that a random vector is -subgaussian if for all vectors .
- FEAT1
-
(Sub-gaussianity) For some and any fixed , is -sub-Gaussian when . Further, the feature vector is -sub-Gaussian when , . Without loss of generality we assume .
- FEAT2
-
(Lipschitz continuity) For any , assume that is -Lipschitz with respect to and , where , .
- FEAT3
-
(Small ball) There exists such that
(18) for some strictly positive constants . By union bound, this is implied by the stronger condition
Without loss of generality, we will assume .
- PEN
-
(Polynomial growth) We assume that is strictly convex and minimized at 0, so that is continuous and . Because is non-decreasing, we have that has a derivative almost everywhere. We assume there exists and such that for ,
and
(19)
If is unbounded as , we will require a stronger assumption FEAT3’ which will ensure that exists and is finite.
- FEAT3’
-
(Small ball) For every and every , , .
Remark 1.
Assumption PEN implies that is locally Lipschitz away from . It further implies that and its derivatives are upper and lower bounded by polynomials in . The most important example is provided by -norms with , in which case we set for . It is easy to check that PEN is satisfied with and appropriate choices of .
Our main theorem establishes that the infinite-dimensional problem of Eq. (5) (and its generalization to randomized featurization maps) is well-approximated by its finite random features counterpart. For this approximation to be good, it is sufficient that the number of random features scales polynomially with the sample size .
Theorem 1.
Assume for all , and let be a probability distribution supported on . Assume that conditions FEAT1, FEAT2, FEAT3, and PEN are satisfied. If , further assume that FEAT3’ holds. Then for any , there exist constants depending on the constants in those assumptions, but not on and , and additionally dependent on such that the following holds. If , , and , then with probability at least ,
(20) |
where
with , , , . Further, the bound holds with , when .
In order to interpret Theorem 1, we remark that we expect typically to be of order . In this case differs negligibly from when .
Remark 2.
The most restrictive among our assumptions are FEAT3 and FEAT3’. Both conditions imply that the infinite-width dual problem of Eq. (14) is well behaved.
In particular, condition FEAT3 ensures that the minimum eigenvalue of the rescaled Hessian is bounded away from , as long as is bounded away from and . Notice indeed that the Hessian is given by
(21) |
For any , with , we have , whence, using assumption FEAT3 and Markov’s inequality and union bound, for any two unit vectors , for some constant . We then have,
Condition FEAT3’ ensures that the largest eigenvalue of the Hessian is bounded for bounded below and above. Notice that, from Eq. (21), no such assumption is required when is bounded as .
Remark 3.
As mentioned in the introduction, the bound in Eq. (20) is not optimal. For instance, for the case of a penalty that is strongly convex and smooth (covered in Theorem 1 by taking ) this bound implies that random features are sufficient to approximate the infinite width problem. However, a more careful analysis yields that —in this case— random features are sufficient. This is also what is established in [MMM21] for the case of kernel ridge regression (corresponding to ).
It is instructive to specialize Theorem 1 to the case of -norms, which is covered by taking , . In this case , with and hence the formulas are simpler.
Corollary 1.
Assume for all , and let be a probability distribution supported on . Assume that conditions FEAT1, FEAT2, FEAT3 are satisfied, and penalty , . Then there exist constants depending on the constants in those assumptions, such that the following holds. If and , then with probability at least ,
(22) |
where .
Note that the exponent on the right hand side of Eq. (22) diverges as . Instead of being a feature of our proof technique, we show in the next section (Section 4), that this is unavoidable under standard hardness assumptions.
Remark 4.
In the case of randomized features, we wrote the infinite-width dual problem (Eq. (9)), but we did not write the infinite-width primal problem to which it corresponds. Indeed the dual problem entirely defines the predictor via Eq. (16). For the sake of completeness, we present the infinite-width primal problem in Appendix E.1.
4 -hardness of learning with norm
The fact that, when , we are unable to solve efficiently the infinite width problem (5) via the random features approach is not surprising. Indeed, consider the featurization map (ReLu activation), and random weights . Consider solving either the infinite-dimensional problem (5) or its random features approximation (7) with data , (where and are fixed).
In [GMMM19], it was shown that for any , if , then has test error bounded away from zero, namely for any sample size . Indeed this lower bound holds for any function that can be written as in Eq. (6), for some coefficients . On the other hand, [Bac17] proves that minimizing the empirical risk subject to , cf. Eq. (4), for a suitable choice of , achieves test error . The last result does not apply directly to min -norm interpolator. However, if an analogous result was established for interpolators, it would imply that remains bounded away from zero in this case as long as .
In order to provide stronger evidence towards hardness, we consider the computational complexity of the interpolation problem
minimize | (23) | |||
subj. to |
We show that it is -hard under a randomized reduction to solve (23) within an accuracy with a fixed absolute constant. On the contrary, for , Corollary 1 and its proof shows that one can obtain accuracy for fixed arbitrary in polynomial time using a random features approximation.
It will be convenient to consider a relaxation of problem (23) by minimizing over the set of signed measures on , which we will denote . For any , let . We have
minimize | (24) | |||
subj. to |
where the minimization is now over all and denotes the total variation of 444Recall that by Hahn decomposition, there exists two non-negative measures and such that . The total variation is equal to .. If we assume that the signed measure has a density with respect to the fixed probability measure , i.e., , then the total variation is equal to and we recover the original problem (23). Note that if has full support on , then the infinum in the two problems are the same (all measures can be written as limits of measures with densities). However the infinum in problem (23) is not attained in general, while the infimum in (24) is always achieved for compact (barring degenerate cases in which it is not feasible). Technically this happens because the space of integrable functions with is not compact in the weak-* topology, while the set of signed measure with is. (The optimum can be singular with respect to , e.g. a sum of Dirac delta functions.)
Remark 5.
For any that verifies the equality constraint of problem (24), we have is in the convex hull of . Hence, by Caratheodory theorem, there exists at most weights such that and . In particular, if is compact, the minimizer in problem (24) is always attained by a measure that is supported on at most points, i.e., there exists such that minimizes (24).
Let us call the problem (24) the -problem. In order to study the computational complexity of solving this problem, we introduce a weak version of the -problem, where we allow an error on the equality constraint:
minimize | (25) | |||
subj. to | ||||
where the minimization is now over and . For concreteness we will consider (truncated ReLu), but we believe it is possible to generalize our proofs to other activations at the cost of additional technical work. We further restrict to be a rectangle in possibly with some infinite sides.
Denote the set of rational numbers. We consider the following problem W-F1-PB which depends on a rational number :
-
: Given and . Denote the value of the weak -problem (25) with error on the constraints. Either
-
(1)
Assert that ; or,
-
(2)
Assert that .
-
(1)
We can think about W-F1-PB as the weak validity problem associated to the -problem (24). In particular, if we are able to solve the -problem within an additive error of the optimum and with at most -error on the equality constraints, we can solve W-F1-PB.
We show in the following theorem that W-F1-PB is hard to solve under the standard assumption that (bounded-error probabilistic polynomial time class) does not contain :
Theorem 2.
Let the activation function be the truncated ReLu , and be a rectangle in (possibly with some infinite sides). Assuming , there exists an absolute constant such that the problem W-F1-PB is -hard.
By equality of the infimum of problems (23) and (24), Theorem 2 also implies the hardness of the original problem. The proof of Theorem 2 relies on a polynomial time randomized reduction from an -hard problem, the Maximum Agreement for Halfspaces problem. If we only assume , our reductions can be made deterministic using results from [GLS12]. However this deterministic reduction only rules out precision that are exponential in the number of bits, in particular it does not rule out precision .
We denote below the Maximum Agreement for Halfspaces problem HS-MA . It was shown to be -hard in [FGKP06] and [GR09]. We will follow the notations of [GR09]. Consider and . Denote
(26) |
The HS-MA problem depends on a rational numbers (where we slightly simplified the statement in [GR09]):
-
: Distinguish the following two cases
-
(1)
There exists a half space such that ; or,
-
(2)
For any half space , we have .
-
(1)
[GR09] showed that for all , the problem HS-MA is -hard.
Below we briefly describe the main ideas of the proof of Theorem 2 and we defer the detailed to Appendix D. In order to reduce HS-MA to W-F1-PB, we use the following intermediate problem:
maximize | (27) | |||
subj. to | ||||
We denote W-VAL the weak validity problem associated to problem (27). First notice that we can rewrite equivalently the constraint set as where . It is easy to see that W-F1-PB can be used as a weak membership oracle for . [LSV18] shows that there exists a polynomial-time randomized algorithm that solves W-VAL from a weak membership oracle W-F1-PB for some constant . Hence there is a randomized reduction between W-VAL and W-F1-PB. Secondly, the problem (27) for (vector of ones) has the same value at optimum as the problem
maximize | (28) | |||
subj. to |
It is easy to see that we can construct data points and weights such that the problem (28) coincide with defined in Eq. (26) at the optimum. Hence, there is an easy deterministic reduction from HS-MA to W-VAL.
In summary, we used the following two reductions
(29) |
where (resp. ) means that there exists a polynomial time randomized (resp. deterministic) reduction from to .
5 Examples
5.1 A numerical illustration
In our first example and where is an activation function with (this assumption is only to simplify some of the formulas below and can be removed). In this case the featurization map is deterministic, i.e., .


Before checking the assumptions of our main theorem in this context, we present a numerical illustration in Figure 1. We generate synthetic data with , and where is a fixed unit vector, , and is the ReLU activation. As mentioned above, we use the featurization map with weigths .
We fix and solve the minimum complexity interpolation problem (5), using . We first fix the sample size and report the average test error as a function of the number of features (left plot) for several values of . Notice that in the case , the limit corresponds to kernel ridge regression, and hence is directly accessible. We then consider for each a value of that is large enough to obtain a rough approximation of the infinite width limit, and plot the test error as a function of the sample size (right plot).
A few remarks are in order:
-
For , the test error appears to settle on a limiting value after becomes large enough .
-
The required number of random features appears to increase as decreases. For , we are not able to reach the limit with practical values of .
-
As decreases, the test error achieved by minimum complexity interpolation decreases.
Notice that points and are consistent with our main result Theorem 1. Point is consistent with the notion that the class captures better functions that are highly dependent on low dimensional projections of the covariates vectors.
5.2 Non-linear random features model
We next check the assumptions of Theorem 1 for the case of non-linear random features model .
Proposition 1.
Assume that is -Lipschitz and , for all , and that the random weights are mean 0 and satisfy the transportation cost inequality
(30) |
where is the Wassertein distance and is the relative entropy (Kullback-Leibler divergence). Then, FEAT1 and FEAT2 are satisfied with constants and , where
(31) |
Proof of Proposition 1.
Let us begin with condition FEAT1. Notice that for any ,
Hence is -Lipschitz. By assumption (30) and Bobkov-Götze theorem [BG99], is -sub-Gaussian with respect to , for any fixed . Further .
Similarly, for any , ,
We deduce that is -sub-Gaussian with respect to , and .
Next consider condition FEAT2. We have that is Lipschitz with respect to with Lipschitz constant . Using that is -Lipschitz with respect to , we have
and therefore there exists an absolute constant that only depend on and such that
∎
To make Proposition 1 more concrete, let us make the following remarks:
- (a)
-
(b)
If is a -sub-Gaussian random vector, then by Lemma 9, there exists constants depending only on , such that with probability at least .
Proposition 1 only focused on FEAT1 and FEAT2. The last two assumptions FEAT3 and FEAT3’ are more difficult to check. The next proposition provides a class of activation functions for which FEAT3 is satisfied.
Proposition 2.
Assume with . Let be the -th Hermite coefficient of the activation function (where , and the normalization is assumed).
Consider and for some constant . If is bounded away from zero and for all large enough, then FEAT 3 is satisfied with high probability with and nonrandom independent of .
A more general version of this result is proved in Appendix C for FEAT3. While we expect FEAT3 and FEAT3’ to hold more generally, we do not have a more general proof at the moment. We can bypass this difficulty below by considering noisy features.
Proposition 3.
Consider the same setting as in Proposition 1. Define and where . Then FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with constants , and
(32) | ||||
(33) |
Proof of Proposition 3.
First notice that , where we denoted the matrix . FEAT1 and FEAT2 are verified in Proposition 1 with the difference that for , ,
where , and therefore is a -sub-Gaussian random vector with
Let us now check FEAT3. Define . Recalling that we have with independent of , we have
This shows that FEAT3 is satisfied with the stated value of .
Finally, let us check FEAT3’. Consider . Letting , we have
where in the last inequality we used the fact that . FEAT3’ is satisfied with . ∎
Corollary 2.
Further assume to be independent centered isotropic, with almost surely. Then there exist constants depending uniquely on (but not on or the distribution of the data ), such that conditions FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with probability at least for the constants
(34) |
In particular, consider the case , . If are independent , then with probability at least , we have for ,
(35) |
Proof of Corollary 2.
5.3 The latent linear model
This section provides a simple example in which the estimates derived above can be strengthened. We consider the following model
which we will refer to as the ‘latent linear model.’
While the latent linear model is extremely simple, it was shown in some settings to have the same asymptotic behavior as the noiseless nonlinear random features model [MM19, HL20]. For instance, consider the case , . Then is approximately . Decompose , where is orthogonal to linear and constant functions in , with the standard normal measure. Then [MM19] shows that in the proportional asymptotics , ridge regression in the nonlinear random features model behaves as ridge regression with the latent linear model , with independent of . Here we study the latent linear model from a different perspective and in a broader context than [MM19, GLK+20, HL20].
We make the following assumptions:
- A1
-
(Covariates distribution) The covariates are -subgaussian random vectors with zero mean and second moment and bounded support .
- A2
-
(Features distribution) The features follow a multivariate Gaussian distribution with zero mean and covariance . Furthermore, assume that .
- A3
-
(Features noise distribution) We assume that .
Proposition 4.
Assume that conditions A1, A2 and A3 hold. Then there exists such that for , the conditions FEAT1, FEAT2, FEAT3 and FEAT3’ hold with probability at least with constants depending only on the constants in A1, A2 and A3. In particular, can be taken to be independent of .
Proof of Proposition 4.
We begin by noticing that .
Consider condition FEAT1. We have is mean 0. By A1 and A2, is -subgaussian. Furthermore, for any fixed , is by construction isotropic, and Gaussian (since and are). Therefore, it is -subgaussian and mean 0.
FEAT2 is easily verified with and .
For condition FEAT3, note that for any unit vector . Therefore we have , whence we can take .
Finally, for FEAT3’, for . ∎
We finally notice that, for the latent linear model, we can improve Theorem 1. For the latent linear model, we can use the constraint that the (randomized) predictor interpolates the data, to improve the bound (20) by a factor . We expect this insight to generalize to the non-linear setting.
Proposition 5.
Assume conditions A1, A2 and A3 hold. There exists constants depending only on the constants in those assumptions such that if and , then with probability at least ,
(36) |
where we denoted .
6 Proof of Theorem 1: Convergence to the population predictor
The proof of Theorem 1 is structured as follows. We first define three events on which the finite-width dual objective and the random features predictor satisfy certain concentration properties. Lemma 1 shows that the simultaneous occurrence of these events implies a bound on . We then verify that these events occur with high-probability.
Throughout the proofs, we will use the notation for and a positive semidefinite matrix. We also use the standard big-Oh and little-o notations whereby the subscript is used to indicate the asymptotic variable. For instance, we write if as .
The three events mentioned above are defined as follows:
-
1.
(Uniform concentration of the predictor) Event is the event that for all
-
2.
(Concentration of dual gradient at ) Event is the event that
-
3.
(Uniform lower bound on dual curvature) Event is the event that for all
The first event corresponds to the random features predictor approximating the infinite-width predictor uniformly well over in a region around . Events and relate to local properties of the finite-width dual objective around : on , the gradient of the dual objective concentrates around the gradient of the infinite width dual objective ; on , the Hessian of the dual objective has maximum eigenvalue uniformly bounded away from for in a region around .
Because these three events involve concentration or bounds on empirical means over a sample of features, it is perhaps not surprising that the preceding bounds can be established with high-probability for and appropriately chosen when if sufficiently large compared to .
If the infinite-width predictor satisfies a certain continuity property in , then the events , , imply a bound on .
Lemma 1.
Assume that for all ,
(37) |
for some . If , then on events , we have
The continuity property for used in Lemma 1 is much easier to establish than the corresponding continuity property of . In the setting of Theorem 1, we will show that we can choose and such that the above events hold with high probability. Then, we will have eventually, and Lemma 1 can be applied.
Proof of Lemma 1.
Consider with . By Taylor’s theorem, there exists on the line segment between and , such that
When both and occur and using , we get
Because for all with , and is strictly convex at by we conclude
In this case, by Eq. (37),
If event occurs, then
Combining the previous displays with the triangle inequality completes the proof. ∎
We next state three lemmas implying that events , , hold with high probability, as well as the continuity of in the last lemma. Proofs of these lemmas are deferred to the appendices. We begin with the continuity property of the infinite width predictor.
Lemma 2.
If either assumptions FEAT1 and PEN hold with , or assumptions FEAT1, FEAT3’ and PEN hold with , then there exists depending only on the constants in FEAT3’ and PEN, but not on , such that for all ,
(38) |
where , .
Next we state a lemma to check condition .
Lemma 3.
Assume FEAT1, FEAT2 and PEN hold. Then there exist depending only on the constants in those assumptions, but not on , such that for , we have with probability at least ,
(39) | ||||
where and .
The next lemma allows us to check event .
Lemma 4.
Assume FEAT1 and PEN hold. There exist depending only on the constants in those assumptions, but not on , such that for , we have with probability at least ,
(40) | ||||
where and .
We finally state a lemma to check event .
Lemma 5.
Assume that FEAT1, FEAT3 and PEN hold, and further assume that for some absolute constant . Define . There exist depending only on the constants in in the assumptions, but not on , such that for , we have with probability at least ,
(41) |
Using the above lemmas, we are now in position to prove our main result, Theorem 1.
Proof of Theorem 1.
Recall we define , , . Assume as in the statement. By Lemmas 3, 4, 5 events , , hold with probability
with constants
Further, by Lemma 2, the continuity property of Eq. (37) holds with
We can now apply Lemma 1. Since and, without loss of generality, , we obtain that with, probability at least ,
(42) | |||
(43) |
where we verify that by the assumption that .
Let us bound . We have by the optimality condition . Therefore, denoting and using PEN,
We next notice that, for any , and any - sub-Gaussian random variable with , . This basic fact is proved in Appendix F. We apply this inequality to to get
Using this bound together with Eqs. (42), (43) yields the claim of the theorem. ∎
Acknowledgements
T.M. thanks Enric Boix-Adsera for helpful discussions about hardness results. This work was supported by NSF through award DMS-2031883 and from the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. We also acknowledge NSF grants CCF-2006489, IIS-1741162 and the ONR grant N00014-18-1-2729. M.C. was supported by the National Science Foundation Graduate Research Fellowship under grant DGE-1656518.
References
- [Bac17] Francis Bach, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
- [Bar98] Peter L Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE transactions on Information Theory 44 (1998), no. 2, 525–536.
- [BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala, Kernels as features: On kernels, margins, and low-dimensional mappings, Machine Learning 65 (2006), no. 1, 79–94.
- [BG99] Sergej G Bobkov and Friedrich Götze, Exponential integrability and transportation cost related to logarithmic sobolev inequalities, Journal of Functional Analysis 163 (1999), no. 1, 1–28.
- [Bis95] Chris M Bishop, Training with noise is equivalent to Tikhonov regularization, Neural computation 7 (1995), no. 1, 108–116.
- [BLM13] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence, OUP Oxford, 2013.
- [BRV+06] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte, Convex neural networks, Advances in neural information processing systems, 2006, pp. 123–130.
- [CLvdG20] Geoffrey Chinot, Matthias Löffler, and Sara van de Geer, Minimum norm interpolation via basis pursuit is robust to errors, arXiv:2012.00807 (2020).
- [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach, On lazy training in differentiable programming, Advances in Neural Information Processing Systems, 2019, pp. 2937–2947.
- [CW01] Anthony Carbery and James Wright, Distributional and norm inequalities for polynomials over convex bodies in , Mathematical research letters 8 (2001), no. 3, 233–248.
- [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018).
- [FGKP06] Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami, New results for learning noisy parities and halfspaces, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), IEEE, 2006, pp. 563–574.
- [GLK+20] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová, Generalisation error in learning with random features and the hidden manifold model, arXiv:2002.09339 (2020).
- [GLS12] Martin Grötschel, László Lovász, and Alexander Schrijver, Geometric algorithms and combinatorial optimization, vol. 2, Springer Science & Business Media, 2012.
- [GLSS18] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro, Characterizing implicit bias in terms of optimization geometry, Proceedings of Machine Learning Research, vol. 80, PMLR, 2018, pp. 1832–1841.
- [GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, Annals of Statistics (2019), arXiv:1904.12191.
- [GMMM20] , When do neural networks outperform kernel methods?, Advances in Neural Information Processing Systems 33 (2020).
- [GR09] Venkatesan Guruswami and Prasad Raghavendra, Hardness of learning halfspaces with noise, SIAM Journal on Computing 39 (2009), no. 2, 742–765.
- [HKZ12] Daniel Hsu, Sham Kakade, and Tong Zhang, A tail inequality for quadratic forms of subgaussian random vectors, Electron. Commun. Probab. 17 (2012), 6 pp.
- [HL20] Hong Hu and Yue M Lu, Universality laws for high-dimensional learning with random features, arXiv:2009.07669 (2020).
- [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571–8580.
- [LL18] Yuanzhi Li and Yingyu Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, Advances in Neural Information Processing Systems, 2018, pp. 8157–8166.
- [LR20] Tengyuan Liang and Alexander Rakhlin, Just interpolate: Kernel “ridgeless” regression can generalize, Annals of Statistics 48 (2020), no. 3, 1329–1347.
- [LRZ19] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, On the risk of minimum-norm interpolants and restricted lower isometry of kernels, arXiv:1908.10292 (2019).
- [LS20] Tengyuan Liang and Pragya Sur, A precise high-dimensional asymptotic theory for boosting and min-l1-norm interpolated classifiers, arXiv:2002.01586 (2020).
- [LSV18] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala, Efficient convex optimization with membership oracles, Conference On Learning Theory, PMLR, 2018, pp. 1292–1294.
- [MKL+20] Francesca Mignacco, Florent Krzakala, Yue Lu, Pierfrancesco Urbani, and Lenka Zdeborova, The role of regularization in classification of high-dimensional noisy gaussian mixture, International Conference on Machine Learning, PMLR, 2020, pp. 6874–6883.
- [MM19] Song Mei and Andrea Montanari, The generalization error of random features regression: Precise asymptotics and double descent curve, arXiv:1908.05355 (2019).
- [MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration, arXiv:2101.10588 (2021).
- [MWW20] Chao Ma, Stephan Wojtowytsch, and Lei Wu, Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t, arXiv:2009.10713 (2020).
- [NTSS17] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro, Geometry of optimization and implicit regularization in deep learning, arXiv:1705.03071 (2017).
- [OS19] Samet Oymak and Mahdi Soltanolkotabi, Towards moderate overparameterization: global convergence guarantees for training shallow neural networks, arXiv:1902.04674 (2019).
- [RR08] Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177–1184.
- [RR09] , Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, Advances in neural information processing systems, 2009, pp. 1313–1320.
- [RR17] Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215–3225.
- [SSBD14] Shai Shalev-Shwartz and Shai Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.
- [Ver10] Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010).
- [Ver18] , High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018.
- [vH14] Ramon van Handel, Probability in high dimension, Tech. report, Princetoon University, 2014.
- [YS19] Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for understanding neural networks, arXiv:1904.00687 (2019).
Appendix A Verifying the continuity and concentration conditions
A.1 Continuity property of the infinite width problem: Proof of Lemma 2
By the fundamental theorem of calculus,
(44) |
where . We will show that
(45) |
(i.e., we may exchange integration and differentiation), and we will bound the right-hand side. In what follows, we will use the shorthand for the evaluation of the featurization map at the datapoints.
For and such that , we have by Hölder’s inequality,
(46) | ||||
Denote . If , we set , , and . In this case, one can check that and . Otherwise, we set and . By FEAT1,
where depends only on . Furthermore, by either (i) PEN and FEAT1 in the case or (ii) PEN, FEAT 1, and FEAT3’ in the case , we have
where we denoted , the constant changes in the last line, and we used that and that either (i) FEAT1 and in the case that or (ii) FEAT1, FEAT3’, and in the case that . We see that the expectation is bounded for some and all , whence is uniformly integrable. We conclude that we may exchange differentiation and expectation, justifying Eq. (45).
A.2 Uniform concentration of the predictor: Proof of Lemma 3
Throughout the proof, we will denote by constants that depend on the constants in assumptions FEAT1, FEAT2 and PEN, but not on and . The value of these constants is allowed to change from line to line.
Step 1. Decoupling.
Let be convex with for and satisfy for some ,
(49) |
Below denote by the expectation with respect to the ’s and the randomization in (for notational simplicity, we will omit below the subscripts from the expectations). Further, we will denote for the evaluation of the feature map at the data points, and for the corresponding isotropic random vector. Then
where (1) uses that is nondecreasing; (2) uses that is convex and Jensen’s inequality; in (3), we denoted independent Rademacher random variables and used that the distribution is symmetric and for ; (4) uses Eq. (49) and that is a symmetric random variable and is a nondecreasing function.
Step 2. Concentration of .
We bound the tail of for fixed with , . Denote by the terms in the sum in , i.e., . By Eq. (19), there exists such that , where (for simplicity we have loosened the bound for when ). By Hölder inequality, we have for any (potentially non-integer)
(50) | ||||
where depends only on , and we used FEAT1, i.e., and are -sub-Gaussian.
For , we have directly by Eq. (50) and [BLM13, Theorem 2.10],
(51) |
For , define for the truncated random variable . Then setting , we have
(52) | ||||
We deduce again by [BLM13, Theorem 2.10] with Eq. (52), that
Furthermore, using the bound (50), we have
Combining the above two displays and taking , we conclude that for and fixed , , we have
(53) | ||||
Step 3. Uniform concentration of on , .
We now evaluate the concentration of uniformly over , . By the tail bound on in FEAT2 and that is a -sub-Gaussian random vector by FEAT1, there exists constants independent of , such that for all [HKZ12],
(54) |
Let and . Then, by assumption PEN, for any , we have
Let . On the event (54) and for , we have . Furthermore, by FEAT2, we have . For convenience, let us denote so that . On the event (54), whenever , we have (the constants below may depend on but not on , )
Fix . Let , and define
so that as soon as and . Let and be -net of and respectively, where we define . Note that . Then taking and recalling Eqs. (51) and (53), we have
where in the second to last inequality we have used the definition of and adjusted constants appropriately, and in the last inequality we have used the definition of . Consider and set . Then
(55) | ||||
Step 4. Concluding the proof.
For a constant that will be set sufficiently large, define
Consider which is convex and verifies Eq. (49) with and . Then we have
Using Eq. (55) and the inequalities and , we get
Taking sufficiently large, and using that (where we recall that we assumed , we deduce that there exists constants that depend only on the constants of FEAT1, FEAT2 and PEN (except ) such that
with probability at least . The proof is complete.
A.3 Point-wise concentration of dual gradient: Proof of Lemma 4
The proof follows from the same argument as in the proof of Lemma 3 and we will only highlight the differences.
Step 1. Decoupling.
First notice that we can rewrite
Denote
For with the properties listed in step in the proof of Lemma 3, we have
Step 2. Concentration of .
Denote the terms in the sum in , i.e., . By FEAT1 and PEN, and denoting , , we have by Hölder inequality, for any and ,
For and , define . Then setting , we have
Following step 2 in the proof of Lemma 3, in particular taking in the case of , we get
(56) |
Step 3. Uniform concentration of on .
We consider the event for ,
(57) |
where we used that the are -sub-Gaussian by FEAT1. On this event, we have and by PEN, we get
Fix and , so that as soon as . Taking and , we have for ,
(58) | ||||
Step 4. Concluding the proof.
Following the same argument as in Lemma 3, there exists constants that depend only on the constants of PEN such that
with probability at least .
A.4 Uniform lower bound on the dual Hessian: Proof of Lemma 5
Define , where and are as in assumption PEN, and define as
(59) |
The function is -Lipschitz. By assumption PEN, there exists depending only on the constants in that assumption such that for all
(60) |
Thus, for such that and denoting , we have
where
Define . As already explained in Remark 2, the fact that is isotropic, together with assumption FEAT3 imply that, for any two unit vectors , we have . Using the fact that , we deduce that
Thus,
Thus, for . Furthermore, by FEAT1 and that , we have are -sub-Gaussian. Applying Lemma 9 to the vectors with and , and noting that , we conclude that there exists a constant such that for
Denote . For constant sufficiently large, consider the event
We have for ,
where we used that are -subgaussian by FEAT1 to bound the first term, and that and Lemma 9 to bound the second term.
On the event , we have
where we used that is Lipschitz. Let be a minimal -net of , with , so that as soon as . We have
where the last inequality follows since by assumption , . By taking for a sufficiently large constant we obtain with probability at least as claimed.
Appendix B The latent linear model: Proof of Proposition 5
We consider the latent linear model presented in the main text (Section 5.3). We have , where are the iid features noise. Denote and . The features matrix is given by
(61) |
The random features and infinite-width predictors are given by
where is applied component-wise. The distance between and is therefore given explicitly by
(62) | ||||
The following lemma is the key result that allows us to improve on Theorem 1 and gain a factor .
Lemma 6.
Assume and (the minimum singular value of ). Then if , we have
(63) |
We remark that Lemma 6 is entirely deterministic. This may seem surprising because and are random. In fact, the proof of Lemma 6 relies on a deterministic argument which uses the fact that both the infinite-width and random features predictors interpolate the training data, i.e., for ,
or equivalently,
(64) |
Note that here we have used the form of the infinite-width primal problem (see Section E.1).
Proof of Lemma 6.
By Eq. (62) and the bound , we have
Denote . We have
and
By the interpolation constraints (64) and recalling the expression of the features matrix in Eq. (61), we write
and
The first terms on the right-hand sides of the preceding two expressions are the same (up to a factor ). We deduce that
(65) | ||||
From the expression of , we have . By the assumption that and , we have . Similarly, we have . Rearranging the terms in Eq. (65) implies Eq. (63). ∎
Proof of Proposition 5.
By Proposition 4, FEAT1, FEAT2, FEAT3 and FEAT3’ are satisfied with probability . Hence, we can use the results proved in the proof of Theorem 1. Replace the event by the event that for all ,
Using the same proof as in Lemma 4, where the vector is a -sub-Gaussian random vector by A3, there exists such that for , taking , we have . Consider the same events (with same ) and as in the proof of Theorem 1, except with —which do not depend on —absorbed into the constnast . By the same proof as in Lemma 2, we can show that there exists a constant such that for all ,
(66) |
The same argument as in Lemma 1 implies that on events , we have
(67) |
Recalling the argument in the proof of Theorem 1, we have , where we used that by A3, . Hence with probability at least , we have
(68) |
Furthermore, from assumption A1 and Lemma 9, there exists such that with probability at least . Hence, we can use Lemma 6, which concludes the proof. ∎
Appendix C Small ball property under fast decay of Hermite coefficients
In this section, we show FEAT3 for a deterministic feature map with the activation function verifying a decay condition on its Hermite coefficients.
Recall that for any function with the standard Gaussian measure, we have the decomposition
where are the Hermite polynomials with standard normalization and we call the -th Hermite coefficient of .
Throughout this section, we will consider and for all . For an integer , consider the decomposition of the activation into a low-degree and a non-polynomial parts
(69) |
For and taking , we can decompose the kernel function into
(70) |
where
(71) | ||||
(72) |
Therefore the empirical kernel matrix can be decomposed into where and .
Similarly, we introduce
(73) | ||||
(74) |
Notice that and . Denote and .
With these notations, we can now introduce our result.
Proposition 6.
Assume that for all and , and that . There exists an absolute constant such that the following holds. If for an integer ,
(75) |
then we have
(76) |
where .
Proof of Proposition 6.
Throughout this proof, we will denote a generic absolute constant. In particular, is allowed to change from line to line.
Consider , , and an integer such that condition (75) is satisfied with a constant that will be fixed later, and decompose
(77) |
The first term is a polynomial of degree in . From Carbery-Wright inequality [CW01], we have the following anti-concentration bound
Note that
where in the last inequality, we used from condition (75). Hence
(78) |
Note that for , we have and therefore we expect the off-diagonal elements of to have negligible operator norm for sufficiently large. In fact, it was shown in [GMMM19] (see [MMM21] for a generalization), that for any constant and integer , if , then for any , we have , where
Hence for and sufficiently large, condition (75) is verified with high probability over the data , if there exists such that
Appendix D Proof of Theorem 2: -hardness of learning with norm
We borrow some notations and terminology from [GLS12]. We consider the convex set define by
(80) | ||||
From our choice of the truncated ReLu activation, we have and , where we denoted the ball of center and radius , i.e., . In our reductions, we will further meed to assume that there exists such that . We will check that indeed we can choose during our reduction. We will denote the set such that . For , we let denote the -neighborhood of , i.e.,
Similarly, we will denote the interior -ball of defined by
For convenience, we recall here the different problems of interest. The weak version of the -problem is given by
minimize | (81) | |||
subj. to | ||||
while the intermediary optimization problem reads with the new notations
maximize | (82) | |||
subj. to |
We will consider the following problems:
-
: given and . Denote the value of the weak -problem (81). Either
-
(1)
assert that ; or,
-
(2)
assert that .
-
(1)
-
: given . Either
-
(1)
assert that ; or,
-
(2)
assert that .
-
(1)
-
: given . Either
-
(1)
assert that for all ; or,
-
(2)
assert that for some .
-
(1)
-
: distinguish the following two cases
-
(1)
there exists a half space such that ; or,
-
(1)
for any half space , we have .
-
(1)
W-F1-PB corresponds to a weak validity problem associated to the weak -problem (81); W-MEM is the weak membership problem associated to the convex set ; W-VAL is the weak validity problem associated to the intermediary optimization problem (82); and HS-MA is the Maximum Agreement for Halfspaces problem.
We will use the following hardness result on HS-MA:
Theorem 3 (Theorem 8.1 in [GR09]).
For all , the problem HS-MA is -hard.
Let us first prove that there exists a polynomial time randomized reduction from W-VAL to W-F1-PB.
Lemma 7.
There exists an absolute constant and an oracle-polynomial time randomized algorithm that solves the weak validity problem W-VAL given a W-F1-PB oracle, where .
Proof of Lemma 7.
Let us first show that one can use W-F1-PB to solve .
Consider and call W-F1-PB with , and . First, we know that , hence if , we can directly assert that . We assume from now on that . If W-F1-PB asserts that , then it means there exists with and associated measure . Consider , then has associated measure and with our choice of . Hence we can assert that . If W-F1-PB asserts that , then it means in particular that has associated measure . Consider , then has associated measure and with our choice of . Hence we can assert that .
Now that we saw we can implement a weak membership oracle using W-F1-PB with , we can use the results in [LSV18], for example their Theorem 21 (using the sequence of reductions MEM to SEP to OPT to VAL), which shows that there exists an absolute constant and a randomized reduction from the weak validity problem W-VAL to the weak membership problem with . ∎
Lemma 8.
There exists an oracle-polynomial time algorithm that solves the problem HS-MA given a weak validity W-VAL oracle, where and .
Proof of Lemma 8.
First let us show that
(83) |
Notice that with our truncated ReLu activation, has all its coordinates non-negative for any . Hence,
where we denoted and non-negative measures on . Hence, we directly have and the converse inequality is obtained by taking the supremum over with .
Let us now prove the reduction from HS-MA to W-VAL. Consider and vectors . Denote , and for and for . Writing now as , we see that
Denote now . Recall that : if we take , we can always rescale by a constant large enough such that only as or values, and
Using Eq. (83), we can find easily a reduction from HS-MA to the strong version of W-VAL (with ). However, we will do a slightly more complicated reduction, in order to insure that there exists such that and we can take .
Consider where is the canonical vector basis in (vector with at the ’th coordinate and otherwise). Consider . In this case, by taking , we have . Hence contain all and therefore contain . Denote the function associated to the . By -Lipschitzness of truncated ReLu, we have
(84) |
Let us call W-VAL with , and and that will fixed sufficiently small later. Consider the case , i.e., there exists an halfspace that makes less than errors. In that case, we show that (for sufficiently small), the oracle must assert that for some . Let us construct such that . Denote such that , and consider . Because is convex and contains and , it must contain the cone with apex and base the section of perpendicular to . In particular, (for sufficiently large). Furthermore, . Setting and , we have such that for some .
Consider now the case , i.e., halfspaces classify at most vectors correctly. In that case, we show that (for sufficiently small), the oracle must assert that for all . To do so, we show that for any , we must have . Consider an arbitrary , there exists such that . Using Eqs. (83) and (84), we have . Hence . Taking and , we have .
Combining the above conditions, we see that there exists an absolute constant such that the W-VAL oracle with and allows to distinguish HS-MA. ∎
We are now ready to prove Theorem 2.
Appendix E Additional properties of the minimum interpolation problem
E.1 The infinite-width primal problem for randomized features
In the case of randomized features, we wrote the infinite-width dual problem (Eq. (9)), but we did not write the infinite-width primal problem to which it corresponds. Indeed the dual problem entirely defines the predictor via Eq. (16). For the sake of completeness, we present the infinite-width primal problem here.
It is convenient to make the randomization mechanism explicit by drawing (with a probability space) independent of , and writing (without loss of generality, the reader can assume ). Therefore, we can rewrite Eq. (14) as
where is a general measurable function. This expression shows immediately that the primal variable is a function of both and . Further, maximizing the Lagrangian over , we obtain the following primal problem that generalizes (5) to the case of randomized features:
minimize | (85) | |||
subj. to | (86) |
This is a natural limit of the finite-width primal problem (7), whereby we replace the weights by evaluations of the function , .
Remark 7.
Note that, under assumptions FEAT1, FEAT3 (or FEAT3’ for ) and PEN, the minimizer to problem (5) (and its generalization (85)) exists and is unique. First of all notice that problem (85) is feasible. Indeed, choosing for , the interpolation constraint takes the form
(87) |
which has solutions since is strictly positive definite by condition FEAT3. Let be such a feasible point.
Denote by the cost function. Notice that, by assumption PEN, for some constants , and , and therefore only if . It is therefore sufficient to focus on these functions . Further notice that the map is continuous under weak convergence in , since for every by assumption FEAT1. It follows that the set of feasible solutions of (85) satisfying is closed and bounded and hence weakly sequentially compact by Banach-Alaoglu. Finally, is weakly lower semicontinuous by Fatou’s lemma, and this implies existence of minimizers.
Uniqueness follows from the fact that is strictly convex, which implies that is also strictly convex.
E.2 Representer theorem for strictly convex and differentiable penalty
In the case of deterministic features (i.e. and ), we present here a generalization of the representer theorem to a broad class of penalties . Recall that in the case of , the representer theorem states that the solution belongs to the class of functions
(88) |
In words, the solution is contained in the (at most) -dimensional linear subspace spanned by .
The following proposition generalizes this result to a penalty that is strictly convex.
Proposition 7 (Representer theorem for penalty ).
Note that we recover the representer theorem for RKHS when and . However for general , the loss function cannot be simplified by evaluating once.
Proof of Proposition 7.
We have
where we introduced the Lagrange multiplies in (1), we used strong duality in (2), and we used the definition of the convex conjugate in (3). At the optimum, we must have a.s. with respect to . ∎
Appendix F Useful technical facts
We recall the following basic result on concentration of the empirical covariance of independent sub-Gaussian random vectors. This can be found, for instance, in [Ver18, Exercise 4.7.3] or [Ver10] (Theorem 5.39 and Remark 5.40(1)).
Lemma 9.
Let be independent -subgaussian random vectors, with common covariance . Denote by the empirical covariance. Then there exist absolute constants such that, for all , we have
(90) |
We also state and prove two simple lemmas about moments of sub-Gaussian random variables.
Lemma 10.
For any and any there exists such that the following holds. For any random variable that is -sub-Gaussian with , we have
(91) |
This inequality holds for , with and .
Proof.
Without loss of generality, we can assume . For , this is just Jensen’s inequality.
For , by Hölder for any and any , we have
Setting , and using the fact that and is sub-Gaussian, we get that there exist constants finite as long as , such that
The claim follows by setting or . ∎
Lemma 11.
For any and any there exists such that the following holds. For any random variable that is -sub-Gaussian with , we have
(92) |
This inequality holds for , with and .
Proof.
Without loss of generality, we can assume . For , this is just Jensen’s inequality.
For the other cases, notice that, for , . Hence, by Hölder inequality, for all :
We then set , and invert this relationship to get
Taking , we can apply Lemma 10, to get
This proves the claim. ∎