Convex Regression in Multidimensions: Suboptimality of Least Squares Estimators
Abstract
Under the usual nonparametric regression model with Gaussian errors, Least Squares Estimators (LSEs) over natural subclasses of convex functions are shown to be suboptimal for estimating a -dimensional convex function in squared error loss when the dimension is 5 or larger. The specific function classes considered include: (i) bounded convex functions supported on a polytope (in random design), (ii) Lipschitz convex functions supported on any convex domain (in random design), (iii) convex functions supported on a polytope (in fixed design). For each of these classes, the risk of the LSE is proved to be of the order (up to logarithmic factors) while the minimax risk is , when . In addition, the first rate of convergence results (worst case and adaptive) for the unrestricted convex LSE are established in fixed-design for polytopal domains for all . Some new metric entropy results for convex functions are also proved which are of independent interest.
keywords:
[class=MSC]keywords:
, , , and
1 Introduction
The main goal of this paper is to show that nonparametric Least Squares Estimators (LSEs) associated with the constraint of convexity can be minimax suboptimal when the underlying dimension is 5 or larger. Specifically, we consider regression over a variety of different classes of convex functions (and covariate designs) and find lower bounds on the rates of convergence of the corresponding least squares estimator over each class, and show that, when , these rates of convergence do not match the minimax rate for the given class.
We work in the standard nonparametric regression setting for estimating an unknown convex function defined on a known full-dimensional compact convex domain () from observations generated via the model:
(1) |
where are i.i.d. errors having the distribution, and design points that may be fixed or random. This problem is known as convex regression and has a long history in statistics and related fields. Standard references are [29, 28, 22, 21, 44, 31, 35, 3], and applications of convex regression can be found in [49, 50, 2, 38, 1, 30, 46].
The Least Squares Estimator (LSE) over a class of functions on is defined as any minimizer of the least squares criterion over :
(2) |
In order to introduce the specific LSEs studied in this paper, consider the following four function classes:
-
1.
: class of all convex functions on which are uniformly Lipschitz with Lipschitz constant and uniformly bounded by .
-
2.
: class of all convex functions on that are uniformly Lipschitz with Lipschitz constant . There is no uniform boundedness assumption on functions in this class.
-
3.
: class of all convex functions on that are uniformly bounded by . There is no Lipschitz assumption on functions in this class.
-
4.
: class of all convex functions on .
Throughout the paper, we assume that and are positive constants not depending on the sample size . We study LSEs over each of the above function classes and we use the following terminology for these estimators:
-
1.
: Bounded Lipschitz Convex LSE,
-
2.
: Lipschitz Convex LSE,
-
3.
: Bounded Convex LSE,
-
4.
: Unrestricted Convex LSE.
Strictly speaking, should be called the “Uniformly Bounded Uniformly Lipschitz Convex LSE” but we omit the word “uniformly” throughout for brevity. Table 1 gives references to some previous papers that studied each of these estimators. For the unrestricted convex LSE, Table 1 lists only the papers focusing on the multivariate case (univariate investigations are in [28, 17, 22, 14, 20, 23, 12, 6, 11]).
Function class | LSE | Name | References |
---|---|---|---|
Bounded Lipschitz Convex LSE | [41, 42, 36, 4] | ||
Lipschitz Convex LSE | [34, 39] | ||
Bounded Convex LSE | [26] | ||
Unrestricted Convex LSE | [44, 31, 35, 39, 13] |
This paper studies the performance of these LSEs and compares their rates of convergence with the corresponding minimax rates. For our results, we assume throughout the paper that the convex body (which is the domain of the unknown function ) is translated and scaled so that
(3) |
where is the unit ball in and is a positive constant depending on alone. In particular, this assumption implies that has diameter at most 2 and also volume at most 1. Also the diameter and volume of are bounded from below by a constant depending on alone. The classical John’s theorem (see e.g., [5, Lecture 3]) shows that for every convex body , there exists an affine transformation such that (3) holds for the transformed body with . In the rest of the paper, whenever we say that a result holds for “any convex body ” (such as in Tables 2, 3, 4, 5 and 6), we mean “any convex body satisfying (3)”.
We focus first on the random design setting where the design points are assumed to be independent having the uniform distribution on (see Subsection 5.3 for a discussion of more general design assumptions in random design), and work with the loss function
(4) |
The minimax rate for a function class in the above setting is:
where denotes expectation with respect to the joint distribution of all the observations when is the true regression function (see (1)), and the infimum is over all estimators .
Minimax rates for the above function classes 1–4 are presented in Table 2. These results are known except perhaps for which we state and prove in this paper as Proposition 2.1. These minimax rates can also be derived from the corresponding metric entropy rates which are given in Table 3. Recall that the -metric entropy of a function class with respect to a metric is defined as the logarithm of the smallest number of closed balls of radius whose union contains . Classical results from Yang and Barron [51] imply that the minimax rate (for the loss ) for a nonparametric function class satisfying some natural assumptions (see [51, Subsection 3.2] for a listing of these assumptions which are satisfied for and ) equals where solves the equation:
(5) |
Class | Minimax Rate | Assumption on | Reference |
---|---|---|---|
any convex body | [42], [4, Theorem 4.1] | ||
any convex body | this paper (Proposition 2.1) | ||
polytope | [26, Theorems 2.3, 2.4] | ||
ball | [26, Theorems 2.3, 2.4] |
Class | -entropy | -bracketing entropy | Assumption on | Reference |
any convex body | [9] | |||
infinite | infinite | any convex body | — | |
polytope | [19, 15] | |||
ball | [19] |
For , the minimax rate is which is the square of the solution for in (5): . For , the metric entropy is infinite as this class contains every constant function and hence is not totally bounded. But the minimax rate is still and can be easily derived as a consequence of the minimax rate result for (the argument is given in the proof of Proposition 2.1).
For with , as shown by Han and Wellner [26] and Gao and Wellner [19] respectively, the minimax rate and the metric entropy rate change depending on whether is polytopal or has smooth boundary. Here (and in the rest of the paper), when we say that a specific set is a “polytope”, we assume that the number of facets or extreme points of is bounded by a constant not depending on the sample size . If the number of facets is allowed to grow with , then polytopes can approximate general convex bodies arbitrarily well in which case it is not meaningful to give separate rates for polytopes and other convex bodies.
When is a polytope, the minimax rate for equals corresponding to the -entropy of . In contrast, when is the unit ball, the minimax rate is corresponding to the -entropy of . The larger metric entropy for the ball (compared to polytopes) is driven by the curvature of the boundary and the lower bound is proved by considering indicator-like functions of spherical caps (see [19, Subsection 2.10] for the metric entropy proofs). More high level intuition on the differences between the polytopal case and the ball case is provided in Subsection 5.1.
We now ask whether the LSEs (as defined via (2)) achieve the corresponding minimax rates. The supremum risk of an estimator on a function class is defined as
Existing results for the convex regression LSEs are described in Table 4 where the rate is given by
(6) |
Supremum Risk | Result | Assumption on | Reference |
---|---|---|---|
any convex body | [4] | ||
any convex body | [34, 39] | ||
polytope | [26] | ||
ball | [32] |
The upper bound of in the first three rows of Table 4 is derived from standard upper bounds on the performance of LSEs [7, 47] which state that the LSE risk is bounded from above by where solves the equation:
(7) |
Here denotes the -bracketing number of under the metric which is defined as the smallest number of -brackets:
needed to cover . The logarithm of denotes the -bracketing entropy of . Gao and Wellner [19] (see also Doss [15]) proved bracketing entropy numbers for the convex function classes; these are given in Table 3 and they coincide with the metric entropy rates. Plugging in for the bracketing entropy in (7) and solving the resulting equation in leads to the rate (although has infinite entropy, results for this class can be derived by working instead with which has entropy ). The split in for and occurs because depends differently on for compared to . The same dimension-dependent split in the upper bounds for the rate can be seen in [7] and [47, Chapter 9] for certain other LSEs on function classes with smoothness restrictions.
It is interesting to note that the last row in Table 4 (corresponding to the case when is the unit ball) does not have any dimension-dependent split in the rate (it equals for every ). This result is due to Kur et al. [32]. More details on this case are given in Subsection 5.1.
We are now ready to describe the main results of this paper. Tables 2 and 4 together indicate that the minimax rate for the first three rows equals which also coincides with for . This immediately implies minimax optimality (up to log factors) of the LSE over each function class in the first three rows of Table 2 when . However for , there is a significant gap between the minimax rate and . The question of whether this gap is real (implying suboptimality of the LSEs) or merely an artifact of some loose argument in the derived upper bounds was previously open. We settle this question in this paper by proving that the LSE for each function class in the first three rows of Table 2 is minimax suboptimal for . More precisely, in Theorem 3.1, we prove that is a lower bound (up to a multiplicative factor independent of ) on each supremum risk in the first three rows of Table 4 for .
Table 5 has a summary of the minimax rates and rates exhibited by the LSE (along with minimax optimality and suboptimality of the LSE) in our random design setting.
Class | Assumption on | Minimax Rate |
Supremum
LSE
risk (upto logs) |
Minimax Optimality of (upto logs) |
any convex body |
,
, |
Optimal for
Suboptimal for |
||
any convex body |
,
, |
Optimal for
Suboptimal for |
||
polytope |
,
, |
Optimal for
Suboptimal for |
||
ball () | Optimal for |
Our proof of minimax suboptimality for the LSE is based on the following idea. Suppose and let be a piecewise affine convex approximation to with affine pieces as described in the statment of Lemma 3.2. It is then well-known (see e.g., [4, Lemma 4.1] or [8]) that the distance between and is of order . If we now set to be of order , then , or equivalently . Note that is much larger than the minimax rate for . We study the behavior of each convex LSE (in the first three rows of Table 2) when the true function is for , and show (in Theorem 3.1) that is bounded below by a constant multiple of . In other words, we prove that the LSEs are at a squared distance from the true function that is much larger than the minimax rate when . Our proof techniques are outlined in Section 4. We are unable to definitively say why the LSEs are behaving this way when the true function is piecewise affine with pieces. We believe this is due to overfitting, probably the LSEs have many more affine pieces than in this case and are actually closer to the quadratic function .
It has previously been observed that LSEs and related Empirical Risk Minimization procedures can be suboptimal for function classes with large entropy integrals. For example, [7, Section 4] designed certain pathological large function classes where the LSE is provably minimax suboptimal. The fact that minimax suboptimality also holds for natural function classes such as and (for ) is noteworthy.
In contrast to our suboptimality results, the last row of Table 4 and Table 5 correspond to the class (here is the unit ball) for which the LSE is minimax optimal for all (this result is due to Kur et al. [32]). As already mentioned, has much larger metric entropy (compared to for polytopal ) that is driven by the curvature of ; one can understand this from the proofs of the metric entropy lower bounds on and for polytopal in [19] (specifically see [19, proof of Theorem 1(i) and proof of Theorem 4]). High-level intuition on the differences between the ball and polytopal cases is given in Subsection 5.1. Isotonic regression is another shape constrained regression problem where the LSE is minimax optimal for all (see Han et al. [25]). The class of coordinatewise monotone functions on is similar to in that its metric entropy is driven by well-separated subsets of the domain (see [18, Proof of Proposition 2.1]). Other examples of such classes where the LSE is optimal for all dimensions can be found in Han [24].
1.1 Fixed-design results for the unrestricted convex LSE
So far we have not discussed any rates of convergence for the unrestricted Convex LSE . No such results exist in the literature; it appears difficult to prove them in the random design setting in part because the general random-design theory of Empirical Risk Minimization is largely restricted to uniformly bounded function classes. In this paper, we prove rates for for a specific fixed-design setting. We now assume that form a fixed regular rectangular grid intersected with , and the loss function is
(8) |
with being the (non-random) empirical distribution of . This setting is also quite standard in nonparametric function estimation (see e.g., [40]). We only work with polytopal . Fixed-design rates for the unrestricted convex LSE when is nonpolytopal are likely more complicated (as indicated in Subsection 5.2) and are not addressed in this paper.
In this setting, we are able to prove uniform rates of convergence for over a function class that is larger than the function classes considered so far. This function class is given by:
(9) |
where denotes the class of all affine functions on , and is a fixed positive constant. It can be easily shown (see Section 3.2 for details) that, as a function class, is larger than both and , which means that is a weaker assumption compared to uniform boundedness () and also compared to uniform Lipschitzness (). Further satisfies the natural invariance property: if and only if for every affine function . This invariance is obviously also satisfied by but not by the other classes in Table 1.
In the aforementioned fixed-design setting, we prove, in Theorem 3.3, that
(10) |
is bounded by the rate (see (6)) up to logarithmic factors. Thus the unrestricted convex LSE achieves the rate under the assumption that the true function . That this holds under the weaker assumption can be considered an advantage of the unrestricted convex LSE over the Bounded or Lipschitz convex LSE which require the stronger assumptions or .
The minimax rate for in the fixed-design setting is defined by
(11) |
where is the loss function (8). In Proposition 2.2, we prove that equals (up to log factors) for all . This implies that the unrestricted convex LSE is, up to log factors, minimax optimal for for the class (and consequently also for the smaller classes and ).
For , we prove, in Theorem 3.4, that there exists where the risk of is bounded from below by . This proves minimax suboptimality of the unrestricted convex LSE for over the class , and consequently also over the larger function class , , (it is helpful to note here all these function classes have, up to a logarithmic factor, the same minimax rate because of Proposition 2.2 and Proposition 2.3).
In addition to these results, we also prove that the rate of convergence of the unrestricted convex LSE can be faster than when is piecewise affine. Specifically, we prove that when is a piecewise affine convex function on with affine pieces, the risk of is bounded from above by
(12) |
up to logarithmic factors. When is not too large, is smaller than (which bounds (10)). Specifically, when , we have for , and when , we have for . For such , we say that the unrestricted convex LSE adapts to the piecewise affine convex function . In the random design setting, adaptive rates for the bounded convex LSE can be found in [26] with a suboptimal dependence on .
It is especially interesting that the adaptive rate switches from being parametric for to a slower nonparametric rate for . For , we prove a lower bound showing that the adaptive rate cannot be improved for . Specifically, in Theorem 3.6, we prove, assuming , that for the piecewise affine convex function given by Lemma 3.2, the rate of the unrestricted convex LSE is bounded from below by for every . This lower bound is related to the minimax suboptimality of because when , the rate becomes .
1.2 Paper Outline
The rest of this paper is organized as follows. Minimax rate results for convex regression are summarized in Section 2. These minimax rates are useful for gauging minimax optimality and suboptimality of LSEs in convex regression. Rates of convergence of the LSEs are given in Section 3. Subsection 3.1 contains the main minimax suboptimality result for the LSEs over the three classes , and in the random design setting. Section 3.2 contains results for the unrestricted convex LSE in fixed design. Section 4 provides a sketch of the main ideas and ingredients behind the proofs of the LSE rate results in Section 3. Proofs of all results can be found in the Appendices: Appendix A contains proofs of all results in Section 2, Appendix B contains proofs of all results in Section 3.1 (and the auxiliary results stated in Subsection 4.3) and Appendix C contains proofs of all results in Section 3.2 (and the auxiliary results stated in Subsection 4.4). The paper ends with a discussion section (see Section 5) where some issues naturally connected to our results are discussed.
1.3 A note on constants underlying our rate results
Our rate results often involve various constants which are generically denoted by or . These constants never depend on the sample size but they often depend on various other parameters involved in the problem (such as the dimension , Lipschitz constant , uniform bound , number of facets of the polytopal domain , etc.). We have tried to indicate the dependence of the constant explicity using notation such as . We have not attempted to explicitly specify the dependence of these constants on these parameters. Please note that the value of these constants may change from occurrence to occurrence.
2 Minimax Rates for Convex Regression
In this section, we provide more details for the minimax rate results mentioned in the Introduction.
2.1 Random Design
Minimax rates for convex regression in random design are given in Table 2. The first result is for the class and, as indicated in Table 2, the minimax rate of for was stated and proved in [4, Theorem 4.1]. However, they additionally assumed that is a rectangle which is unnecessary and the same proof applies to every convex body satisfying (3). This is because the metric entropy of holds for for every convex body satisfying (3) as can be readily seen from the proof of [9, Theorem 6].
The next minimax result is for . Here also the rate is and it holds for every convex body satisfying (3). The upper bound here is slightly nontrivial because the class has infinite metric entropy as it contains all constant functions. We could not find a reference for this result in the literature so we state it below and include a proof in Appendix A.
Proposition 2.1.
Next are the results for where the minimax rates change depending on whether is polytopal or not. When is a polytope, the minimax rate is for all and when is a ball, the rate is for . These results (whose proofs can be found in [26]) can be derived as a consequence of the metric entropy rates in Table 3 ( and respectively) by solving the equation (5).
2.2 Fixed Design
For fixed design, we are mostly only able to prove results when is a polytope (see Subsection 5.2 for some remarks on fixed-design results for non-polytopal ). Specifically, we assume that
(14) |
for , unit vectors and real numbers . The number is assumed to be bounded from above by a constant depending on alone. As in the rest of the paper, we also assume (3).
The design points are assumed to form a fixed regular rectangular grid in and are generated according to (1). Specifically, for , let
(15) |
denote the regular -dimensional -grid in . We assume that are an enumeration of the points in with denoting the cardinality of . By the usual volumetric argument and assumption (3) , there exists a small enough constant such that whenever , we have
(16) |
for dimensional constants and . We have included a proof of the above claim in Lemma C.2. Throughout, we assume so that the above inequality holds. The following result (proved in Appendix A) gives an upper bound on the minimax risk (11) over the class defined in (9).
The class is larger than both and which means that the bound in (17) also holds for for or or . To see why is larger than , just note that (by taking )
To see why is also larger than , note that (taking for the origin ; the origin belongs to because of (3))
where the last inequality is true because each as .
The following minimax lower bound (proved in Appendix A) complements the upper bound in Proposition 2.2 and it is applicable to the smaller function class . need not be polytopal for this result.
Proposition 2.3.
Combining the above two results, in fixed-design for of the form (14) and satisfying (3), the minimax rate for each class equals up to logarithmic multiplicative factors (this is because is the smallest of these classes for which the lower bound (18) holds and is the largest of these classes for which the upper bound (17) holds).
3 LSE Rates for Convex Regression
3.1 Random Design
In this section, we state our main result that the supremum risk appearing in the first three rows of Table 4 is bounded from below by (up to logarithmic factors) for . As is strictly larger compared to the minimax rate of for , these results prove minimax suboptimality, for of the LSE over the class for equaling or for any , and also for equaling for polytopal .
Theorem 3.1.
Fix and let be a convex body satisfying (3). The following inequality holds when is either for , or :
(19) |
provided . Additionally when is a polytope whose number of facets is bounded by a constant depending on alone, the same inequality holds for i.e.,
(20) |
for .
We prove the above theorem via for a specific piecewise affine convex function with affine pieces. This function will be a piecewise affine approximation to the quadratic . The properties of that we need along with the existence of satisfying those properties are given in the next result. While this result is stated for arbitrary , we only need it for for proving Theorem 3.1. Recall that a -simplex in is the convex hull of affinely independent points. It is well-known that -simplices can be represented as the intersection of halfspaces.
Lemma 3.2.
Suppose is a convex body satisfying (3). Let . There exists a positive constant (depending on alone) such that the following is true. For every , there exist -simplices and a convex function on such that
-
1.
,
-
2.
is contained in a facet of and a facet of for each ,
-
3.
is affine on each ,
-
4.
,
-
5.
,
If, in addition, is a polytope whose number of facets is bounded by a constant depending on alone, then the first condition above can be strengthened to .
3.2 Results for the unrestricted convex LSE in fixed design
Here we work with the same setting as in Subsection 2.2 (in particular, note that is polytopal of the form (14)). The following result shows that the unrestricted convex LSE achieves the rate (defined in (6)) under the assumption for a positive constant . The number appearing in the bound (21) below is the number of parallel halfspaces or slabs defining (see (14)) and this number is assumed to be bounded by a constant depending on alone.
Theorem 3.3.
For every and , there exist constants (depending on alone) and (depending only on and ) such that
(21) |
for .
The rates appearing on the right hand side of (21) coincide with if we ignore multiplicative factors that are at most logarithmic in . Theorem 3.3 and the minimax lower bound given in Proposition 2.3 together imply that the unrestricted convex LSE is minimax optimal (up to log factors) over each class when the dimension . However for , there is a gap between the rate given by (21) and the minimax upper bound in Proposition 2.2. The following result shows that, for , the unrestricted convex LSE is indeed minimax suboptimal over (or over the larger classes , , ).
Theorem 3.4.
Fix , and . There exist constants and such that
(22) |
for .
The main idea for the proof of Theorem 3.4 comes from analyzing the rates of convergence of the unrestricted convex LSE when the true convex function is piecewise affine. Indeed, Theorem 3.4 is derived from Theorem 3.6 (stated later in this section) which provides a lower bound on the risk of for certain piecewise affine convex functions. We state Theorem 3.6 after the next result which provides upper bounds on the rate of convergence of for piecewise affine convex functions. This result shows that the unrestricted convex LSE exhibits adaptive behaviour to piecewise affine convex functions in the sense that the risk is much smaller than the right hand side of (21) for such . For and , let denote all functions for which there exist convex subsets satisfying the following properties:
Theorem 3.5.
For every and , we have
(23) |
for a constant depending on alone.
When is a constant (not depending on ) and is not too large, the rates in (23) are of strictly smaller order compared to (21). More precisely, ignoring log factors, (23) is strictly smaller than (21) as long as is of smaller order than for , and as long as is of smaller order than for . The next result gives a lower bound which proves that the upper bound in (23) cannot be improved (up to log factors) for for all up to . More specifically, we prove the lower bound for the piecewise affine convex function described in Lemma 3.2.
Theorem 3.6.
Fix . There exist positive constants and such that for and
(24) |
we have
(25) |
where is the function from Lemma 3.2.
The lower bound given by (25) for is of the same order as that given by Theorem 3.4. In other words, Theorem 3.4 is a corollary of Theorem 3.6.
We reiterate that the results of this subsection (fixed-design risk bounds for the unrestricted convex LSE) hold when is a polytope. A natural question is to extend these to the case where is a smooth convex body such as the unit ball. Based on the results of Kur et al. [32] who analyzed the LSE over under random-design, it is reasonable to conjecture that the unrestricted convex LSE will be minimax optimal in fixed design when the domain is the unit ball. However it appears nontrivial to prove this as the main ideas from [32] such as the reduction to expected suprema bounds over level sets cannot be used in the absence of uniform boundedness.
4 Proof Sketches for Results in Section 3
In this section, we provide the main ideas and ingredients behind the proofs of the LSE rate results in Section 3. Further details and full proofs of the auxiliary results in this section can be found in Appendix B.
4.1 General LSE Accuracy result
The first step to prove all our LSE results is the following theorem due to Chatterjee [10] which provides a course of action for bounding (from above as well as below) the accuracy of abstract LSEs on convex classes of functions. The original result applies to the fixed design case for every set of design points . In the random design case, we apply this result by conditioning on (see (30) and (31) below).
Theorem 4.1 (Chatterjee).
Consider data generated according to the model:
where are fixed deterministic design points in a convex body , belongs to a convex class of functions and . Consider the LSE defined in (2). Define
where
Then is a concave function on , is unique and the following pair of inequalities hold for positive constants and :
(26) |
and
(27) |
Upper bounds for can be obtained via:
(28) |
and lower bounds for can be obtained via:
(29) |
Intuitively, Theorem 4.1 states that the loss of the LSE is controlled by for which bounds can be obtained using (28) and (29).
Theorem 4.1 holds for the fixed design setting with no restriction on the design points which means that it also applies to the random design setting provided we condition on the design points . In particular, for random design, inequality (26) becomes:
(30) |
where
(31) |
with
(32) |
Here is random as it depends on the random design points . Note that (30) applies to the loss and not to .
4.2 High-level intuition for the proof of Theorem 3.1
Here we provide the main ideas behind the proof of Theorem 3.1 in a somewhat non-rigorous fashion. More details are provided in the remainder of this section. According to Theorem 4.1, the rate of convergence of the LSE over (when the true function is ) is controlled by where is the maximizer of defined in (32). Note that is given by an expected supremum term minus . The expected supremum term can be seen as a measure of the complexity of the local region around the true function .
For the proof of Theorem 3.1, we shall take the true function to be (given by Lemma 3.2) with . The challenge is to show that the maximizer of is at least for each function class in the statement of Theorem 3.1. For this, the first step is to prove that
(33) |
for a large enough range of (see Lemma 4.2 and the writing after it). This inequality, which grows linearly in , can be seen as a bound on the complexity of the local region in around , and is proved by bounding the bracketing entropy numbers of local regions around (Theorem 4.5).
The second step is to prove the following lower bound on :
(34) |
This is proved by lower bounding the metric entropy of a local region around (Lemma 4.6) and the fact that and are within of each other (this is guaranteed by Lemma 3.2).
It might be interesting to note that this proof will not work if we take the true function to be the smooth function instead of the piecewise affine approximation . For , the upper bound (33) is no longer true; in fact will be as large as for of the order which clearly violates the upper bound (33). Intuitively, this happens because local regions around have more complexity compared to local regions around . This is because the piecewise affine convex function has flat Hessians so it is not easy to find many perturbations which keep the resulting function convex. But this is much easier to do with the smooth function . We are unable to determine the rate of convergence of the LSEs when the true function is . It can be anywhere between and , a range which includes the minimax rate .
4.3 Proof Sketch for Theorem 3.1
It is enough to prove (19) and (20) when and are fixed constants depending on the dimension alone, respectively. From here, these inequalities for arbitrary and can be deduced by elementary scaling arguments. Let and let be as given by Lemma 3.2 for a fixed . Below we assume that is a large enough constant depending on alone so that and also that . We assume that the true function is and show that, when , the risk of at is bounded from below by for each of the choices of the function class in the statement of Theorem 3.1.
Our strategy is to bound from below and then use Theorem 4.1 to obtain the LSE risk lower bound. Recall, from Theorem 4.1, that maximizes
(35) |
over all where
(36) |
The main ingredients necessary to bound are the next pair of results, which provide upper and lower bounds for respectively.
Lemma 4.2.
Fix and let be a convex body satisfying (3). There exists such that the following holds
(37) |
for every fixed satisfying , and for equal to either of the two classes and .
Additionally, when is a polytope whose number of facets is bounded by a constant depending on alone, we have the following inequality for :
(38) |
for every fixed satisfying .
Lemma 4.3.
Fix and let be a convex body satisfying (3). There exists such that the following holds for every fixed and :
for equal to , and also for the larger classes and .
Before providing details behind the proofs of Lemma 4.2 and Lemma 4.3, let us outline how they lead to the proof of Theorem 3.1 (full details are in Appendices B.3 and B.4). The leading term in the upper bound on given by Lemma 4.2 is
(39) |
This upper bound also applies to because as is clear from the definition (35). This bound was previously stated as (33) in Subsection 4.2.
On the other hand, the lower bound on from Lemma 4.3 leads to the following lower bound for :
The special choice in the above bound leads to
(40) |
This lower bound is the basis for the inequality (34) in the intuition subsection 4.2. The requirement in Lemma 4.3 would hold if satisfies for some .
We now compare the upper bound (39) with the lower bound (40). It can be seen that (39) is less than or equal to the right hand side of (40) when is of the order . Let us denote by this particular choice of for . We will argue, in Lemma 4.4 below, that and that
This allows application of inequality (29) to yield the following lower bound on (this lower bound gives the necessary risk lower bounds for via Theorem 4.1).
Lemma 4.4.
There exist constants such that
(41) |
for and . Here can be taken to be any of the three choices in Theorem 3.1.
4.3.1 Proof ideas for Lemma 4.2 and Lemma 4.3
We now explain the main proof ideas behind Lemma 4.2 and Lemma 4.3. For Lemma 4.2, we use available bounds on suprema of empirical processes via bracketing numbers. The key bracketing result that is needed for Lemma 4.2 is given below. It provides an upper bound on the bracketing entropy of bounded convex functions with an additional norm constraint. The metric employed is the metric on a bunch of simplices. This result is stated for arbitrary although we only use it for . Lemma 4.2 is proved by applying this theorem to the simplices given in Lemma 3.2.
Theorem 4.5.
Suppose is a convex body contained in the unit ball. Let be a convex function on that is bounded by . For a fixed and , let
(42) |
Suppose are -simplices with disjoint interiors such that is affine on each . Then for every and , we have
(43) |
for a constant that depends on and alone. The left hand side above denotes bracketing entropy with respect to metric on .
The above bracketing entropy result (whose proof is given in Appendix B.5) is nontrivial and novel. The function class considered in the above theorem has both an constraint (uniform boundedness) as well as an constraint. If the constraint is dropped, then the bracketing entropy is of the order as proved by Gao and Wellner [19] (see also Doss [15]). In contrast to , (43) only has a logarithmic dependence on and is much smaller when is small. Also, Theorem 4.5 is comparable to but stronger than [26, Lemma 3.3] which gives a weaker bound for the left hand side of (43) having additional multiplicative factors involving (these factors cannot be neglected since we care about the regime ).
For Lemma 4.3, the key is the following result which proves a lower bound on the metric entropy of balls of convex functions around the quadratic function (recall, from Lemma 3.2, that are chosen to be piecewise affine approximations of ).
Lemma 4.6.
Let be a convex body satisfying (3). Let . Then there exist positive constants depending on alone such that
for each fixed satisfying and . The probability above is with respect to the randomness in the design points .
The main consequence of the above lemma is that when and , the metric entropy of is of order . This also holds for for as . Because the distance between and is bounded by , the same order- lower bound also holds for the metric entropy of . Sudakov minoration can then be used to prove Lemma 4.3. See Appendix B for full details.
4.4 Proof sketches for fixed-design results (Subsection 3.2)
Here, we provide sketches for the proofs of the fixed-design results stated in Subsection 3.2. Further details and full proofs can be found in Appendix C.
The starting point for these proofs is Theorem 4.1. For the risk upper bound results (Theorem 3.3 and Theorem 3.5), we need to bound the quantity (appearing in the statement of Theorem 4.1) from above. Let
(44) |
so that and upper bounds on imply upper bounds on . In contrast to the random design setting of the previous subsection, we do not need to explicitly indicate conditioning on in this fixed design setting.
Theorem 3.3 is a consequence of the following upper bound on :
Lemma 4.7.
Fix and suppose . There exists such that for every ,
(45) |
Lemma 4.7 is a consequence of the following bound which holds for functions satisfying :
(46) |
for every and . This metric entropy bound, which is novel and nontrivial, will be derived based on a more fundamental metric entropy result that is stated later in this section. Dudley’s entropy bound [16] will be applied in conjunction with (46) to derive Lemma 4.7. Dudley’s bound involves an integral of the square root of the metric entropy which will lead to an integral of . Since this integral converges for , and blows up at zero for (logarithmically) and for (exponentially), the upper bound for behaves differently for the three regimes , and .
Theorem 3.5 is a consequence of the following upper bound on :
Lemma 4.8.
Fix . There exists such that for every ,
(47) |
Lemma 4.8 is a consequence of the following bound which holds for functions :
(48) |
The bound (48), which also seems novel and nontrivial, is an improvement over (46) provided and is not too large. Both the entropy bounds (46) and (48) will be derived based on a more general bound that is stated later in this section.
We now move to the proof sketches for the lower bound results, Theorem 3.4 and Theorem 3.6, which apply to the case . Because Theorem 3.4 is a special case of Theorem 3.6 corresponding to (see Appendix C.3), we focus on the proof sketch of Theorem 3.6. This proof is similar to that of Theorem 3.1 but simpler; for example, we can simply work here with the single loss (and not worry about its discrepancy with any continuous loss as in the proof of Theorem 3.1).
As in the proof of Theorem 3.1, we need to prove upper and lower bounds for . The upper bound follows from Lemma 4.8 because, as guaranteed by Lemma 3.2, the function belongs to for some (we are using here because every -simplex can be written as the intersection of atmost slabs). This gives the following bound which coincides with the leading term in random design bound given by (37):
(49) |
The lower bound on is given below.
Lemma 4.9.
There exist a positive constant such that
(50) |
provided .
The lower bound (50) and the upper bound (49) differ only by the logarithmic factor . Further (50) coincides with corresponding lower bound in Lemma 4.3 for the random design setting when and is of order . Lemma 4.9 is proved via the following metric entropy lower bound which applies to the discrete metric and is analogous to Lemma 4.6.
Lemma 4.10.
Let be a convex body satisying (3). Let . There exist two positive constants and depending on alone such that
The metric entropy upper bounds (46) and (48) play a very important role in these proofs. Both these bounds will be derived as a consequence of the following entropy result. This result involves the resolution of the grid (defined in (15)) which, by (16), is of order . Even though we only need the entropy bound for , we state the next result for the discrete metric for every where
(51) |
The -covering number of a space of functions on under the metric will be denoted by .
Theorem 4.11.
Theorem 4.11 is the first entropy result for convex functions that deals with the discrete metric (all previous results hold for continuous metrics). Its proof is similar to but longer than the proof of the related result Theorem 4.5. The two bounds (46) and (48) follow easily from Theorem 4.11 as proved in Appendix C.6.
5 Discussion
We conclude by addressing a few natural questions and extensions that arise from our main results.
5.1 Optimality vs Suboptimality
As already indicated, the statement “the LSE in convex regression is rate optimal” can be both true and false in random-design depending on details and context. In this paper, we proved that, if the domain is polytopal and the dimension , this statement is false meaning that the LSE is minimax suboptimal. On the other hand, if the domain is the unit ball, then the statement is true for all meaning that the LSE is minimax optimal (this optimality has been proved in [32, 24]).
Here we briefly outline the two key differences (relevant to LSE suboptimality/optimality) in the structures of corresponding to the two cases where is a polytope and is the unit ball. We assume below.
-
1.
The main difference is that the metric entropy of is in the polytopal case and in the ball case (these entropy results are due to [19]). The larger metric entropy directly leads to the slower minimax rate in the ball case compared to in the polytopal case. In the polytopal case, constructions for the proof of the metric entropy lower bound are based on perturbations of a smooth convex function. This construction also leads to the same lower bound in the ball case. However the improved lower bound of in the ball case is obtained by taking indicator-like convex functions which are zero in the interior of the ball but rapidly rise to 1 near a subset of the boundary. This construction is feasible because of the high complexity of the boundary of the ball, as well as because of the nature of the metric. If the metric is changed to , the metric entropy remains for both the ball and the polytopal case. It may be helpful to note here that if is the indicator function of a set , then which can be much larger than .
-
2.
Now let us discuss the behavior of the LSE. In the polytopal case, when , the supremum risk of the LSE over is (upto log factors). This exceeds the minimax rate implying that the LSE is minimax suboptimal. On the other hand, note that is actually smaller than the minimax rate of in the ball case. This might suggest that the LSE might be minimax optimal in the ball case. Kur et al. [32] proved that the LSE achieves the rate in the ball case implying optimality. At a very high level, their argument is as follows. From Theorem 4.1, it is clear that the risk of the LSE can be upper bounded via upper bounds on the expected supremum
where is the true function. Kur et al. [32] upper bound the above by first removing the localization constraint leading to
The right hand side above is the Gaussian complexity of the class . Kur et al. [32] then show that the Gaussian complexity is upper bounded by the discrepancy between and over all compact convex subsets of :
(53) They then use chaining with respect to the norm to prove that (53) is bounded by . This argument reveals that, like the minimax rate which is driven by indicator-like functions over convex sets, the LSE accuracy is also driven by the discrepancy over convex sets. Han [24] gives examples of other problems where LSE accuracy is also driven by discrepancy over sets and the LSE turns out to be optimal in these problems as well.
5.2 Additional remarks on the fixed-design results
In the fixed-design setting (where the design points are given by grid points (15) intersected with and the loss function is (8)), we proved that when is a polytope of the form (14), the minimax rate over each class , , , equals (up to logarithmic factors). We also proved (Theorem 3.4) that the supremum risk of the unrestricted LSE over each of these classes is at least for . The proof of Theorem 3.4 can be easily modified to prove that the restricted LSE over also has supremum risk of at least and hence is minimax suboptimal over , where is any of the four classes , , , .
Now, let us briefly comment on the case when is not a polytope. Under the Lipschitz constraint, the story is the same as in the case of random design. Specifically, the minimax rate for and equals for every satisfying (3) regardless of whether is polytopal or not (this is proved by essentially the same argument as in the proof of Proposition 2.1; the main idea is that the metric entropy of is not influenced by the boundary of ). Furthermore, the restricted LSE over each of these classes achieves supremum risk of at least . The proof of these results can be obtained by suitably modifying the proof of Theorem 3.4 following ideas in the proof of Theorem 3.1.
Without the Lipschitz constraint, things are more complicated. In Section 1, we discussed that when is the unit ball for and we are in the random-design setting, the minimax rate over equals as proved by Han and Wellner [26]. Additionally, we noted the minimax optimality of the LSE over as proved by Kur et al. [32]. The analogues of these results for the fixed design setting are unknown.
We summarize this discussion on the fixed-design setting in Table 6.
Class | Assumption on | Minimax Rate (up to logs) | (upto logs) | Minimax Optimality of (upto logs) |
polytope |
,
, |
Optimal for
Suboptimal for |
||
any convex body |
,
, |
Optimal for
Suboptimal for |
||
any convex body |
,
, |
Optimal for
Suboptimal for |
||
polytope |
,
, |
Optimal for
Suboptimal for |
||
ball, | open question | open question | open question |
The larger minimax rate of (compared to for polytopal ) in the random design case can be attributed to the increased metric entropy (in the metric with respect to the continuous uniform probability measure on ) of due to the curvature of the boundary of (see [19, Subsection 2.10]). In the fixed design case, the boundary of is also expected to elevate the minimax rate above . However, the precise minimax rate in this context seems hard to determine (note that there are no existing metric entropy results for in the discrete metric with respect to the grid points of the fixed design) and it could range between and . Furthermore, the question of whether the LSE over and the unrestricted convex LSE are optimal or suboptimal in the fixed design setting for non-polytopal also remains open.
5.3 Additional discussion on the random-design setting
In our random design setting with , we assumed that is the uniform distribution on the convex body . From an examination of our proofs, it should be clear that all our random design results would continue to work under the more general assumption that has a density on that is bounded from above and below by positive constants. With this assumption, our bounds will involve additional multiplicative constants depending on the density constants. We worked with the simpler uniform distribution assumption because, essentially, the proof in the more general case is reduced to the uniform case.
One can also consider the case where has full support over such as the case when is the standard multivariate normal vdistribution on (here ). In this case, boundedness would not make sense as non-constant convex functions cannot be bounded over the whole of . As the Lipschitz assumption is still reasonable, one may study the minimax rate over , and the optimality of the Lipschitz convex LSE over (note that the loss function is still given by (4) but now has full support over ). When is the standard multivariate normal distribution, we believe that the minimax rate over should be and that the supremum risk of over is bounded below by (up to logarithmic factors) for . The intuition is that decreases exponentially in , while the metric entropy (of bounded Lipschitz convex functions) over only grows polynomially in . We leave a principled study of such unbounded design distributions to future work.
5.4 Beyond Gaussian noise
Throughout, we assumed that the errors in the regression model (1) are Gaussian, i.e. . It is natural to ask if the results continue to hold if the errors have mean zero and variance but a non-Gaussian distribution. For sub-Gaussian errors, we believe that certain variants of our results can be proved with additional work. One important bottleneck in the extension of our main results (Theorem 3.1 and Theorem 3.4) to sub-Gaussian errors is the result of [10] stated as Theorem 4.1. This result was originally stated and proved in [10] for Gaussian errors and we do not know if an extension to sub-Gaussian errors exists in the literature. The proof is mainly based on the concentration of the random variable (for fixed ):
which is proved in [10], by using the fact that the above is a -Lipschitz function of i.i.d Gaussian errors .
For sub-Gaussian , one can first argue boundedness of by with high probability (say probability of ), and then invoke concentration of convex Lipschitz functions of bounded random variables (see e.g., [45, Theorem 6.6]); note that is convex in . This should lead to a version of Theorem 4.1, for sub-Gaussian errors (although with weaker control on the probabilities involved).
Note that the proofs of Theorems 3.1 and 3.4, also use certain other tools such as the Sudakov minoration inequality which are also only valid for Gaussian errors. Their use should also be replaced by appropriate multiplier inequality for the suprema of an empirical process (such results can be found in e.g., [48, Chapter 2.9] and [27]). We leave, to future work, a rigorous extension of our results to sub-Gaussian errors.
5.5 Rates of convergence in the interior of the domain
It is natural to wonder if our suboptimality results are caused by boundary effects, and if the LSEs are optimal or suboptimal in the interior of the domain . Concretely, let denote a compact, convex region that is contained in the interior of , and that the loss function is given by
(54) |
The question then is whether the LSEs considered in Theorem 3.1 are still suboptimal under the above modified loss function. This is a difficult question that is hard to resolve with the methods used for the proof of Theorem 3.1. However, we believe that the result should be true because of the following heuristic arguments. Consider, for concreteness, the bounded convex LSE . Every function in is actually both -bounded and -Lipschitz inside the smaller domain – regardless of the shape of ). Because the entropy of -bounded and Lipschitz convex functions on equals regardless of the shape of , it gives us reason to believe that would be minimax suboptimal (for ) with respect to the modified loss (54). Proving this is challenging, mainly because the behavior of the LSE inside is also influenced (in a complicated way) by the observations lying outside . Therefore, we would like to highlight this question as an open problem – as we do not know how to “decouple” the performance of the LSE from the domain from under .
5.6 Results for LSEs with smoothness constraints (without convexity)
All the results in this paper apply to LSEs over classes of convex functions (with additional constraints such as boundedness and/or being Lipschitz). It is natural to ask if similar results of suboptimality holds for LSEs with purely smoothness constraints (without any shape constraints such as convexity). The methods of this paper are closely tied to convexity, and we leave a study of purely smoothness constrained LSEs for future work. For a concrete problem in this direction, consider the class of -Lipschitz functions on :
For a natural (e.g., ), the question is whether the LSE over is minimax optimal over for all . One can also ask the same question for function classes with constraints on second (and higher) order derivatives. We would like to highlight these as open problems – as our techniques do not apply on these classes.
We are truly thankful to the Associate Editor and three anonymous referees for their comprehensive reviews of our earlier manuscript. Their insightful feedback significantly enhanced both the content and organization of the paper.
The first author was funded by the Center for Minds, Brains and Machines, funded by NSF award CCF-1231216. The second author was funded by NSF Grant OCA-1940270. The third author was funded by NSF CAREER Grant DMS-1654589. The fourth author was funded by NSF Grant DMS-1712822.
The rest of this paper consists of three sections: Appendix A, Appendix B and Appendix C. Appendix A contains proofs of all results in Section 2. Appendix B contains proofs of all results in Section 3.1. The proof sketch given in Subsection 4.3 is followed and results quoted in that subsection are also proved in Appendix B. Appendix C contains proofs of all results in Section 3.2. The proof sketch given in Subsection 4.4 is followed and results quoted in that subsection are also proved in Appendix C.
Appendix A Proofs of Minimax Rates for Convex Regression
This section contains the proofs of Proposition 2.1, Proposition 2.2 and Proposition 2.3 (these results were stated in Section 2).
For the proof of Proposition 2.1 below, we need Lemma B.3 which allows switching between the two loss functions and . Lemma B.3 is stated in Section B because it is also crucial for the proof of Theorem 3.1.
Proof of Proposition 2.1.
The lower bound for the minimax rate follows from the corresponding result for the smaller class . For the upper bound, let us first describe the estimator that we work with. Let (where ) for each . For a finite subset of the function class , let denote any least squares estimator over for the data . In other words,
Our estimator is given by the convex function:
(55) |
for an appropriately chosen covering subset of . The proof below shows that this estimator achieves the rate uniformly over . Fix a “true” function . We first bound the risk of . For any arbitrary nonnegative valued functional , the following inequality holds:
because of the presence of the term corresponding to on the right hand side. Because minimizes sum of squares over , we can replace by any other element leading to
(56) |
for every . We now take the expectation on both sides of the above inequality conditioned on . The following function will play a key role in the sequel
is Lipschitz with Lipschitz constant and, moreover, is uniformly bounded by because:
In other words, . In order to take the expectation of both sides of (56), note that the conditional distribution of given is multivariate normal with mean vector and covariance matrix . Here is the identity matrix and is the vector of ones. By a straightforward calculation, we obtain
where we used the standard fact for the penultimate inequality. As is arbitrary, we can take an infimum over to obtain
Because and ,
We have thus proved
The choice
leads to
where denotes the cardinality of the finite set . Jensen’s inequality on the left hand side gives
Taking expectations with respect to (the right hand side above is actually nonrandom), we get the same bound unconditionally:
We now take to be an -covering subset of for an appropriate . By the classical metric entropy result of [9], for each , we can find such that
Taking to be of order and to the corresponding -covering subset of , we deduce
We convert this bound to via Lemma B.3:
By the elementary inequality , we get
where . Thus
Finally
This completes the proof of Proposition 2.1. ∎
We next turn to the proof of Proposition 2.2 which gives an upper bound on the minimax rate for the class in the fixed design setting. This proof is similar to that of Proposition 2.1 but somewhat simpler as, in the fixed design setting, we do not need to switch between the two losses and . The metric entropy result stated in Theorem 4.11 is an important ingredient of the following proof.
Proof of Proposition 2.2.
Let us describe the estimator that we work with. Let the linear regression solution be given by the affine function :
Let for be the residuals after linear regression. For a finite class of functions on , let denote any least squares estimator over for the data :
We work with the estimator for an appropriate choice of , and bound its supremum risk over the class . Fix a “true” function . Let be the closest affine function to in the metric:
A key role in the sequel will be played by the convex function (note that is convex because is convex and is affine). Fix a nonnegative valued functional and start with the inequality (56):
for every . Now take expectations on both sides above (note that we are working in fixed design so are fixed grid points in ). To calculate the expectation of the right hand side, observe that is multivariate normal with mean vector and a covariance matrix which is dominated by in the positive semi-definite ordering. As in the proof of Proposition 2.1, we deduce
Taking
and using Jensen’s inequality (as in the proof of Proposition 2.1), we obtain
We now specify the finite function class . For this note that because (i.e., ), the function belongs to the class
Theorem 4.11 (with and ) gives an upper bound for the -metric entropy of the above function class under the metric. This result implies the existence of a finite set satisfying the twin properties:
As a result
The choice gives
Finally
completing the proof of Proposition 2.2. ∎
Next we give the proof of Proposition 2.3. This proof is based on the use of Assouad’s lemma with a standard construction.
Proof of Proposition 2.3.
Define the smooth function:
Note that for and
which implies that the Hessian of is dominated by times the identity matrix. It is also easy to check that the Hessian of equals zero on the boundary of .
Fix and let be the -grid consisting of all points for . Let
Assume that is small enough so that the cardinality of is at least . For every point , let
Clearly is supported on the cube and these cubes for different points have disjoint interiors.
For each binary vector , consider the function
(57) |
where . It can be verified that is convex because has constant Hessian equal to times the identity, the Hessian of is bounded by and the supports of have disjoint interiors. For a sufficiently small constant and a sufficiently large constant , it is also clear that . We use Assouad’s lemma with this collection :
where is the Hamming distance, and is the multivariate normal distribution with mean vector and covariance matrix . We bound from below as
To bound from below, note that is supported on the cube which has volume , and also that the “magnitude” of on this cube is of order (equivalently, the squared magnitude equals ). Thus when for some , it is straightforward to check
This gives
For bounding , we use Pinsker’s inequality to get
Assouad’s lemma (with ) thus gives
The choice
leads to
∎
Appendix B Proofs of the random design LSE rate lower bounds
This section contains the proof of Theorem 3.1. We follow the sketch given in Subsection 4.3 and results quoted in that subsection are also proved here. The first three subsections below contain the proofs of Lemma 4.2, Lemma 4.3 and Lemma 4.4 respectively. Using these three lemmas, the proof of Theorem 3.1 is completed in Subsection B.4. An important role is played by the main bracketing entropy bound stated in Theorem 4.5 and this theorem is proved in Subsection B.5. Finally, Subsection B.6 contains the proofs of Lemma 3.2 and Lemma 4.6.
B.1 Proof of Lemma 4.2
The most important ingredient for the proof of Lemma 4.2 is the bracketing entropy bound given by Theorem 4.5 which is proved in Subsection B.5. We also use the following two standard results from the theory of empirical processes and Gaussian processes respectively.
Lemma B.1.
[47, Theorem 5.11] Let be a probability on a set and let be the empirical distribution of . Let be a class of real-valued functions on and assume that each function in is uniformly bounded by . Then
(58) |
The following result is Dudley’s entropy integral bound [16].
Theorem B.2 (Dudley).
Let . Then for every deterministic , every class of functions , and :
In the course of proving Lemma 4.2, we will need to work with both the loss functions and . The next result states that these loss functions are, up to the factor 2, sufficiently close (of order which is much smaller than for ) allowing us to switch between them.
Lemma B.3.
Suppose is a convex body satisfying (3). Then there exist positive constants and such that
and
and
and
Additionally, when is a polytope whose number of facets is bounded by a constant depending on alone, all these inequalities continue to hold if is replaced by the larger class with replaced by .
Proof of Lemma B.3.
We use the following result which can be found in [47, Proof of Lemma 5.16]: Suppose is a class of functions on that are uniformly bounded by . Then there exists a positive constant such that
(59) |
and
(60) |
provided
(61) |
We first apply this result to with . From [9], we get
(62) |
The above bound implies that (61) is satisfied when is multiplied by a large enough dimensional constant. Inequalities (59) and (60) then lead to the two stated inequalities in Lemma B.3. The expectation bounds are obtained by integrating the probability inequalities.
We are now ready to give the proof of Lemma 4.2.
Proof of Lemma 4.2.
Throughout we take and to be constants depending on dimension alone (so we can absorb them into a generic ). Lemma 4.2 has two parts: (37) and (38). We prove (38) first and then indicate the changes necessary for (37). So assume first that is a polytope with facets bounded by a constant depending on alone and take .
With and , Lemma B.3 gives
with probability at least
(64) |
Thus for
(65) |
we get
and consequently where
holds with probability at least (64). By concentration of measure, the above conditional expectation will be close to the corresponding unconditional expectation because
satisfies the bounded differences condition:
and the bounded differences concentration inequality consequently gives
(66) |
for every . We next control
where the expectation on the left hand side is with respect to while the expectation on the right hand side is with respect to all variables . Clearly
(67) |
where consists of all functions of the form as varies over , is the empirical measure corresponding to , and is the distribution of where and are independent with and .
We now use the bound (58) which requires us to control . This is done by Theorem 4.5 which states that
(68) |
Theorem 4.5 is stated under the unnormalized integral constraint and for bracketing numbers under the unnormalized Lebesgue measure but this implies (68) as the volume of is assumed to be bounded on both sides by dimensional constants. We now claim that
(69) |
Inequality (69) is true because of the following. Let be a set of covering brackets for the set . For each bracket , we associate a corresponding bracket for as follows:
and
It is now easy to check that whenever , we have where . Further, and thus which proves (69). Inequality (68) then gives that for every , we have
where, in the last inequality, we used . The inequality
will therefore be satisfied for
for an appropriate constant . The bound (58) then gives
Combining the above steps, we deduce that for every , the inequality
holds with probability at least
for every fixed satisfying (65). Inequality (38) is now deduced by taking
Let us now get to (37). Here is not necessarily a polytope and is either or . Let us first argue that it is enough to prove (37) when . This will obviously imply the same inequality for the smaller class . It also implies the same inequality for the larger class because of the following claim:
(70) |
The above claim immediately implies that
which leads to
To see (70), note that assumptions and together imply that for some . By the Lipschitz property of , the fact that has diameter and the fact that is bounded by , we have
for every . This clearly implies that both and are larger than which proves (70).
In the rest of the proof, we therefore assume that . We write
where
and
Here are the -simplices given by Lemma 3.2. We now bound and separately. The first term can be shown to satisfy the same bound in (37) using almost the same argument used for (37). The only difference is that, instead of (68), we use:
the above bound also following from Theorem 4.5. For , let and use Dudley’s bound (Theorem B.2) and (63) to write
The choice leads to
(71) |
Note that is binomially distributed with parameters and . Because (from Lemma 3.2), we have
where we used . Hoeffding’s inequality:
gives (below is such that )
Combining the above with (71), we obtain
with probability at least . The proof of (37) is now completed by combining the obtained bounds for and ∎
B.2 Proof of Lemma 4.3
Lemma 4.3 is proved below using Lemma 4.6 (proved in Subsection B.6) and the following standard result (Sudakov Minoration) from the theory of Gaussian processes (see e.g., [33, Theorem 3.18]):
Lemma B.4 (Sudakov minoration).
In our random design setting (with ), the following is true for every class and :
(72) |
where is a universal positive constant.
Proof of Lemma 4.3.
As in the proof of Lemma 4.2, we use the notation: . By triangle inequality,
From Lemma 3.2, , and so will be satisfied whenever . For such , by Sudakov minoration (Lemma B.4):
We now use Lemma 4.6 which applies to the function class and, consequently, also to the larger classes , and for . Applying Lemma 4.6 with , we get that for each fixed , the inequality
holds with probability at least . This completes the proof of Lemma 4.3 (note that, if the constant is enlarged suitably, the earlier condition implies because ). ∎
B.3 Proof of Lemma 4.4
Proof of Lemma 4.4.
Taking in Lemma 4.3, we obtain
(73) |
The condition required for the application of Lemma 4.3 places the following restriction on :
(74) |
Now consider the upper bound on given in Lemma 4.2. The leading term in this upper bound is the first term:
(75) |
as long as
(76) |
We now choose so that (75) matches the lower bound on from (73):
leading to
Check that this choice of satisfies (76) provided
(77) |
and
(78) |
We have thus proved . Check that provided
(79) |
As can now be directly checked, the choice for an appropriate and the condition for an appropriate satisfy all the four conditions (74), (78), (79) and (77). Further the probability with which all the above bounds hold is bounded from below by . Inequality (29) in Theorem 4.1 now implies that . The logarithmic term in can be further simplified to because of . This completes the proof of Lemma 4.4. ∎
B.4 Completion of the proof of Theorem 3.1
Proof of Theorem 3.1.
We shall lower bound . Let
be the lower bound on given by Lemma 4.4. As a result,
where we used (30) in the penultimate inequality. Lemma 4.4 now gives
Clearly if is chosen appropriately, then, for ,
will be larger than any constant multiple of which gives
(80) |
We now argue that a similar inequality also holds for . Here it is easiest to break the argument into the different choices of . First assume . Combining the above inequality with Lemma B.3, we obtain
Because is of a smaller order than and is of a larger order than (and is a dimensional constant), we obtain
provided where is a constant depending on and alone. Using this, we can further adjust so that
which immediately gives that
(81) |
completing the proof of inequality (20). The same argument also yields (19) for . Now take . Let us take large enough so that for . The fact (70) then implies that both and trivially exceed if does not belong to . Therefore while lower bounding these losses, we can assume that . This allows application of Lemma B.3 again and yields (19) for . The proof of Theorem 3.1 is complete. ∎
B.5 Proof of Theorem 4.5
It is enough to prove Theorem 4.5 when is exactly equal to . So in the proof of Theorem 4.5, we shall assume that . The main step in the proof is to prove it in the special case when is itself a -simplex and is identically zero on . This special case is stated in the next result (note that simplices can be written in the form (82) so the next result is applicable when is a simplex).
Lemma B.5.
Let be a convex body contained in the unit ball, and of the form
(82) |
where are fixed unit vectors. For a fixed and , let
Then for every , we have
Below, we first provide the proof of Theorem 4.5 using Lemma B.5, and then provide the proof of Lemma B.5.
Proof of Theorem 4.5.
Without loss of generality, we assume . For each and , we define as the smallest positive integer such that
(83) |
where denotes the volume of . Let denote the collection of all sequences as ranges over . Because for all , we have (here is the smallest positive integer larger than or equal to ). Thus, the cardinality of is at most . Further as is the smallest positive integer satisying (83), the inequality will be violated for so that
for each . Summing these for , we get
which is equivalent to . As a result
(84) |
For each sequence , let be the collection of all functions satisfying
for each . For and , the restriction of to belongs to (since is linear on each ). Applying Lemma B.5, we deduce, for every , the existence of a set consisting of no more than
brackets, such that for each , there exists a bracket such that for all , and
We now define a bracket on via and for , . Then, we clearly have for all , and
We choose
so that
where we used (84). Thus, is an -bracket.
Note that for each fixed , the total number of brackets is at most
Combining with the number of choices of the sequences , the number of realizations of the brackets is at most
which can be shown to yield inequality (43). ∎
We now prove Lemma B.5. The main ingredient in this proof is the result below whose proof directly follows from arguments in [19].
Lemma B.6.
Suppose has volume at most 1 and is of the form:
where are fixed unit vectors. For a fixed , let
For and , let
Then for every , we have
The main idea behind Lemma B.6 is that on , each function in is uniformly bounded by a constant multiple of . The proof of Lemma B.6 follows then from the bracketing entropy result for uniformly bounded convex functions (details can be found in [19]).
In the proof of Lemma B.5, we use the following notation. For , where are fixed unit vectors and , let
Observe that for , the set coincides with in Lemma B.6. Further .
Proof of Lemma B.5.
Fix . We shall prove the following by induction on : There exist two constants and such that
(85) |
for every , and for every of the form (82). As mentioned above, for , the set equals the set in Lemma B.6 and, as a result, (85) for follows directly from Lemma B.6. Let us now assume that (85) is true for and proceed to prove it for . Define
For a positive integer (to be determined later) and for , define
Furthermore, define
Then, the sets together with form a partition of . We now aim to use the induction step. For this purpose we denote the inflated sets
The key observation now is that and . To prove , observe that equals
As a result, if and only if
which is equivalent to . The claim follows similarly.
For every and , let be the smallest positive integer satisfying
(86) |
where denotes the volume of . Because for all , we have . Let denote the collection of all sequences as ranges over . Because each , the cardinality of is at most . Further, because is the smallest positive integer satisfying (86), the inequality will be violated for so that
for each . Summing these over , we obtain
We now observe that every point in is contained in for at most three different so that . This gives
In other words, we have proved that every sequence satisfies:
(87) |
One consequence of the above is that
(88) |
where we used convexity ( with and ) and the fact that .
For each , let be the class of all functions additionally satisfying
(89) |
for every . It is then clear that which implies (below denotes the cardinality of the finite set )
(90) |
We shall now fix and bound . Because , , , , , form a partition of , we have
(91) |
provided satisfy
(92) |
To bound for a fixed , note first that so that
The induction hypothesis will be used to control the right hand side above. Because of the right side inequality in (89), the restriction of each to the set belongs to and so, by the induction hypothesis, we have
(93) |
For , we use the induction hypothesis again to obtain
(94) |
On the sets and , we set and to be equal to zero, by taking the trivial bracket . For this to be valid, and have to satisfy
(95) |
We now choose , so as to satisfy (92). First of all, we take
(96) |
To satisfy the conditions (95), note that and are bounded by (note also that is a constant). Thus (95) can be ensured for for for some .
B.6 Proofs of Lemma 3.2 and Lemma 4.6
Proof of Lemma 3.2.
Let us first assume that is a polytope (satisfying (3)) whose number of facets is bounded by a constant depending on alone. For a fixed , let be the collection of all cubes of the form
(98) |
for which intersect . Because is contained in the unit ball, there exists a dimensional constant such that the cardinality of is at most for .
For each , the set is a polytope whose number of facets is bounded from above by a constant depending on alone. This polytope can therefore be triangulated into at most number of -simplices. Let be the collection obtained by the taking the all of the aforementioned simplices as varies over . These simplices clearly satisfy and the second requirement of Lemma 3.2. Moreover
and the diameter of each simplex is at most . Now define to be a piecewise affine convex function that agrees with for each vertex of each simplex and is defined by linearity everywhere else on . This function is clearly affine on each , belongs to for a sufficiently large and it satisfies
Now given , let for a sufficiently small dimensional constant and let to be the function for this . The number of simplices is now and
which completes the proof of Lemma 3.2 when is a polytope.
Now assume that is a generic convex body (not necessarily a polytope) satisfying (3). Here the only difference in the proof is that we take be the collection of all cubes of the form (98) for which are contained in the interior of . Because each of these cubes has diameter , it follows that
The rest of the argument is the same as in the polytopal case. ∎
Proof of Lemma 4.6.
By a standard argument involving local perturbations to the quadratic function , we can prove the following for three constants depending on alone: for every and , there exist an integer with
and functions such that
One explicit construction of such perturbed functions is given in the proof of Lemma 4.10. Lemma 4.6 follows from the above claim and the Hoeffding inequality. Indeed, Hoeffding’s inequality applied to the random variables (which are bounded by ) followed by a union bound allows us to deduce that, for every ,
for a universal constant . Taking , we get
Assuming now that , we get
As each satisfies , the above completes the proof of Lemma 4.6. ∎
Appendix C Proofs of fixed-design results for the unrestricted convex LSE
This section contains the proofs of Theorem 3.3, Theorem 3.4, Theorem 3.5 and Theorem 3.6. We follow the sketch given in Subsection 4.4 and results quoted in that subsection are also proved here. Subsection C.1 below contains the proof of Theorem 3.3 and that of the main ingredient in its proof, Lemma 4.7. Subsection C.2 contains the proof of Theorem 3.5 and that of the main ingredient in its proof, Lemma 4.8. Subsection C.3 contains the proof of Theorem 3.4. Subsection C.4 contains the proof of Theorem 3.6. Lemma 4.9 and Lemma 4.10, which are both crucial for the proof of Theorem 3.6, are proved in Subsection C.5. The metric entropy results are proved in Subsections C.6 and C.7. Specifically, the main metric entropy result (Theorem 4.11) is proved in Subsection C.7. The two important corollaries of Theorem 4.11: inequality (46) and inequality (48), are proved in Subsection C.6.
C.1 Proof of Lemma 4.7 and Theorem 3.3
Proof of Lemma 4.7.
The metric entropy bound (46), together with Dudley’s bound (Theorem B.2), give
(99) |
for every . We replace by just (in general, the value of can change from place to place). We now split into the three cases and . For , we take to get
For , (99) leads to
Choosing , we obtain
Finally, for , (99) leads to the bound
for every . The choice
gives
∎
Proof of Theorem 3.3.
By Theorem 4.1 (specifically the upper bound in inequality (27), and (28)), the risk is bounded from above by for every satisfying or, equivalently, . We shall use the bound on given in Lemma 4.7 to get such that . The risk will be dominated by such because this will be of larger order than . For , Lemma 4.7 gives
Because
and
we deduce that for
This proves, for ,
The leading term in the right hand side above is the first term inside the maximum which proves Theorem 3.3 for .
C.2 Proof of Lemma 4.8 and Theorem 3.5
Proof of Lemma 4.8.
We next provide the proof of Theorem 3.5.
C.3 Proof of Theorem 3.4
This basically follows from Theorem 3.6. Let and be as given by Theorem 3.6. Letting and assuming that ), we obtain from Theorem 3.6 that
where is such that (existence of such a is guaranteed by Lemma 3.2). The required lower bound (22) on the class for an arbitrary can now be obtained by an elementary scaling argument.
C.4 Proof of Theorem 3.6
Proof of Theorem 3.6.
Let and for notational convenience. By the lower bound given by Lemma 4.9,
Taking and noting that
we get that
The above inequality, combined with the upper bound (49) and the fact that maximizes over all , yield
This implies
Theorem 4.1 then gives
The first term on the right hand side above dominates the second term when is larger than a constant depending on alone. This completes the proof of Theorem 3.6. ∎
C.5 Proofs of Lemma 4.9 and Lemma 4.10
We first prove Lemma 4.9 below assuming the validity of Lemma 4.10. Lemma 4.10 is proved later in this subsection.
Proof of Lemma 4.9.
By Lemma 3.2, satisfies
(102) |
where . Assume first that where is the constant from (102). For this choice of , it follows from the triangle inequality (and (102)) that
This immediately implies
Let and in the rest of this proof (this is for notational convenience). We just proved for . By Sudakov minoration (Lemma B.4), is bounded from below by
Lemma 4.10 gives a lower bound on the metric entropy appearing above. Applying this for , we get
provided
(103) |
The condition above is necessary for the inequality which is required for the application of Lemma 4.10. This gives
Now for , we use the fact that is concave on (and that ) to deduce that
This proves (50) for
completing the proof of Lemma 4.9. ∎
We next prove Lemma 4.10.
Proof of Lemma 4.10.
We shall use a perturbation result that is similar to the one mentioned at the beginning of the proof of Lemma 4.6. Because we are dealing with the discrete metric here, this argument might be nonstandard so we provide an explicit construction (some aspects of this construction were also used in the proof of Proposition 2.3). Let
Note that is smooth, for and
which implies that the Hessian of is dominated by times the identity matrix. It is also easy to check that the Hessian of equals zero on the boundary of .
Now for every grid point in , consider the function
Clearly is supported on the cube
and these cubes for different grid points have disjoint interiors.
We now consider binary vectors in . We shall index each by . For every , consider the function
(104) |
It can be verified that is convex because has constant Hessian equal to times the identity, the Hessian of is bounded by and the supports of have disjoint interiors. Note further that for and ,
This implies that
where is the Hamming distance between and . The Varshamov-Gilbert lemma (see e.g., [37, Lemma 4.7]) asserts the existence of a subset of with cardinality such that for all with . We then have
Inequality (16) then gives
for a dimensional constant . One can also check that
for another dimensional constant completing the proof of Lemma 4.10. ∎
C.6 Proofs of inequality (46) and inequality (48)
In this subsection, we provide proofs of the discrete metric entropy results given by inequalities (46) and (48). We shall assume and use Theorem 4.11 for these proofs. The proof of Theorem 4.11 is given in the next subsection (Subsection C.7).
Proof of Inequality (46).
By specializing to the case and writing for , we can rephrase the conclusion of Theorem 4.11 as:
Here the in refers to the function that is identically equal to zero. Because in the representation (14), the number is assumed to be bounded from above by a constant depending on alone, we have . We then use (16) to bound by a constant multiple of . These give
Because if and only if for every affine function on (i.e., ), we deduce
for every . By triangle inequality,
for every . Thus
Because is arbitrary in the right hand side above, we can take the infimum over . This allows us to replace on the right hand side above by leading to the required inequality (46). ∎
Proof of Inequality (48).
Fix . By definition of , there exists subsets satisfying the following properties:
-
1.
is affine on each ,
-
2.
each can be written as an intersection of at most slabs (i.e., as in (14) with ), and
-
3.
are disjoint with
Note that is the cardinality of . Let denote the cardinality of for each . We can assume that each for otherwise we can simply drop that . For each such that and , let be the smallest positive integer for which
(105) |
Because , we have for each . Also because is the smallest integer satisfying (105), we have
which implies that
leading to
Let
and note that the cardinality of is at most as for each . For each , let
By construction, we have
so that
(106) |
where denotes the cardinality of the finite set , and we used . We shall now upper-bound for a fixed . The idea is that this will follow from Theorem 4.11 applied to each . First observe that Theorem 4.11 applied to and replaced by gives
(107) |
for every . Here it is crucial that is affine on (which ensures ); also note that in (52) is replaced by here because it is assumed that each is of the form (14) with (this is part of the assumption ). Observe now that if
for each , then
As a result
Inequality (107) then gives
The proof of (48) is now completed by combining the above inequality with (106) (note that because of (16)). ∎
C.7 Proof of Theorem 4.11
This proof has two main steps. In the first step (which is the focus of the next subsection), we stay away from the boundary and prove the entropy bound with a modified metric that is defined only in the interior of the domain . In the second step (which is the focus of Subsection C.7.2), we extend this result to reach the boundary of .
Throughout this proof, we use the setting described in Subsection 2.2. In particular, is the regular -dimensional -grid (15) and is of the form (14) and satisfies (3). are an enumeration of the points in with denoting the cardinality of . Also is defined as in (51).
C.7.1 Away from the boundary
The goal of this subsection is to prove the following proposition, which is the equivalent version of Lemma B.6 in this discrete setting. Let denote the center of the John ellipsoid of (recall that the John ellipsoid of is the unique ellipsoid of maximum volume contained in ). For , let
(108) |
and note that , where and denote volumes of and .
Proposition C.1.
Suppose . Fix and . Then there exists a set consisting of functions such that for every with , there exists satisfying for all , where is as defined in (108) and is a constant depending only on .
Proposition C.1 states that, under the supremum metric on the interior set , the metric entropy of is atmost . To prove Proposition C.1, we need some preliminary results. The next result shows that the number of grid points contained in is of order (this provides a proof of the claim (16)).
Lemma C.2.
Proof.
First observe that
Because of (3), contains the ball of radius centered at the origin which implies
Volume comparison with the last pair of displayed relations gives us
On the other hand, let be the union of the cubes . The volume of is . Since the union of covers , we have . In particular, contains the set
Since contains the ball of radius centered at the origin, if we let
then the distance between any and is at least . Hence . Consequently
Using and , the above implies that completing the proof of Lemma C.2. ∎
Lemma C.3.
Suppose . Then for every with .
Proof.
Let be the minimizer of on . If , then there is nothing to prove; otherwise, the set is a closed convex set containing . Denote , and let , where . We show that for all , . Indeed, if we define a function on so that , for all , and is linear on . Then, by the convexity of on each , we have on . Thus, for all ,
Next, we show that most of the grid points in are outside . Indeed, If is a grid point in , then and at least one half of the cube lies outside . Thus, the number of grid points in is bounded by
Since by the Minkowski-Steiner formula (see e.g., [43]), can be expressed as a sum of mixed volumes of and , and because the mixed volumes are monotone, we have
for all convex sets and . This gives
Also so that
Thus, the number of grid points in is bounded by
Hence,
which implies that by using and for . ∎
Lemma C.4.
Suppose . Then, at any point on the boundary of , any hyperplane passing through cuts into two parts. The part that does not contain the center of John ellipsoid of as its interior point contains at least grid points.
Proof.
Since is on the boundary of , any hyperplane passing through cuts into two parts. Suppose is a part that does not contain the center of John ellipsoid of as its interior. We prove that . Because the ratio is invariant under affine transform, we estimate , where is an affine transform so that the John ellipsoid of is the unit ball . Then, it is known that is contained in a ball of radius (see e.g., [5, Lecture 3]). Because the distance from to the boundary of is at least , contain half of the ball with center at and radius . Thus, has volume at least . Since is contained in the ball of radius , we have . This implies that . Hence .
Lemma C.5.
Suppose . Then for every with .
Proof.
Let be the maximizer of on . By convexity of , the point must be on the boundary of . If , there is nothing to prove. So we assume . The convexity of implies that lies on the boundary of the convex set . There exists a hyperplane so that the convex set lies entirely on one side of the hyperplane. Let be the portion of that lies on the other side of the hyperplane that support at . This hyperplane cuts into two parts. Let be the part that does not contain . Then, for all . Let denote the cardinality of . By Lemma C.4, we have
Since for all , we have
We obtain by combining the above two displayed inequalities, and this proves Lemma C.5. ∎
We are now ready to prove Proposition C.1.
Proof of Proposition C.1.
Fix with . By Lemma C.3 and Lemma C.5, we have for all . Let be an affine transformation so that the John ellipsoid of equals the unit ball . Because , by the proof of Lemma C.4, the distance between the boundary of and the boundary of is at least . If we define convex function on by . Then, for all .
For , assume without loss of generality that and consider the half-line starting from and passing through . Suppose the half-line intersects the boundary of at the point and the boundary of at the point . By the convexity of on this half-line, we have
This implies that is a convex function on that has a Lipschitz constant . Of course is also bounded by on . Thus by the classical result of [9] on the metric entropy (in the supremum metric) of bounded Lipschitz convex functions, there exists a finite set of functions consisting of functions such that for every with , there exists , such that . This implies . Thus, by setting the lemma follows with (note that is a multiple of so that is a constant depending on alone). ∎
C.7.2 Reaching the Boundary
Now, we try to reach closer to the boundary of . More precisely, we will extend Proposition C.1 from to the set defined below. For this section, it will be convenient to rewrite as:
where is the center of the John ellipsoid of , and . As in (14), are unit vectors.
Let and be the smallest integers such that and . Let
Note that is quite close to because the Hausdorff distance between them is at most .
The following proposition suggests that to achieve our goal of bounding the metric entropy of on , we only need to properly decompose .
Proposition C.6.
Suppose , is a sequence of convex subsets of such that no point in is contained in more than subsets in the sequence. Further suppose that . Then
Proof.
Let be the set of grid points in , and be the grid points in . For every such that , define to be the smallest positive integer, such that
Because is the smallest positive integer satisfying the above, the inequality will be reversed for allowing us to deduce
which is equivalent to . The number of possible values for each is clearly at most which implies that there are no more than possible values of .
Let
For each , define
Then for each . Applying Proposition C.1 for , there exists a set consisting of functions, such that for every , there exists satisfying
If we define for , then we have
where the last inequality holds if we let
The total number of possible choices for is therefore
Using the inequalities
and
we can bound the total number of realizations of by
Consequently, we have
∎
In the next result, we decompose according to the requirement of Lemma C.6. Recall that and are of order and they have been defined at in the beginning of this subsection (Subsection C.7.2).
Lemma C.7.
Let . There exists convex sets , contained in , such that no point in is contained in more than of these sets, and
Proof.
Let
There are elements in . For each , define
where
is a convex set. The union of all , is the set
Similarly, we define
where
Let be the center of John ellipsoid of . Because equals
we have
where in the second to the last equality we used the fact that and .
It can be checked that when the integer , the intervals and intersect only when or when or when . Hence where are at most four possibilities which implies that no point can be contained in more than different sets . Lemma C.7 follows by renaming these sets as , . ∎
We are finally ready to complete the proof of Theorem 4.11.
Proof of Theorem 4.11.
Because the distance between the boundary of and the boundary of is no larger than , the set can be decomposed into at most pieces of width . By Khinchine’s flatness theorem, the grid points in are contained in hyperplanes for some constant . The intersection of and each of these hyperplanes is a dimensional convex polytope. This enables us to obtain covering number estimates on using lower dimensional estimates. Because the desired covering number estimate is known to be true for , the result follows from mathematical induction on dimension. This concludes the proof of Theorem 4.11. ∎
References
- [1] {barticle}[author] \bauthor\bsnmAït-Sahalia, \bfnmYacine\binitsY. and \bauthor\bsnmDuarte, \bfnmJefferson\binitsJ. (\byear2003). \btitleNonparametric option pricing under shape restrictions. \bjournalJ. Econometrics \bvolume116 \bpages9–47. \bnoteFrontiers of financial econometrics and financial engineering. \bdoi10.1016/S0304-4076(03)00102-7 \bmrnumber2002521 \endbibitem
- [2] {barticle}[author] \bauthor\bsnmAllon, \bfnmGad\binitsG., \bauthor\bsnmBeenstock, \bfnmMichael\binitsM., \bauthor\bsnmHackman, \bfnmSteven\binitsS., \bauthor\bsnmPassy, \bfnmUry\binitsU. and \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear2007). \btitleNonparametric estimation of concave production technologies by entropic methods. \bjournalJ. Appl. Econometrics \bvolume22 \bpages795–816. \bdoi10.1002/jae.918 \bmrnumber2370975 \endbibitem
- [3] {bphdthesis}[author] \bauthor\bsnmBalázs, \bfnmGábor\binitsG. (\byear2016). \btitleConvex regression: theory, practice, and applications, \btypePhD thesis, \bpublisherUniversity of Alberta. \endbibitem
- [4] {binproceedings}[author] \bauthor\bsnmBalázs, \bfnmGábor\binitsG., \bauthor\bsnmGyörgy, \bfnmAndrás\binitsA. and \bauthor\bsnmSzepesvári, \bfnmCsaba\binitsC. (\byear2015). \btitleNear-optimal max-affine estimators for convex regression. In \bbooktitleAISTATS. \endbibitem
- [5] {barticle}[author] \bauthor\bsnmBall, \bfnmKeith\binitsK. \betalet al. (\byear1997). \btitleAn elementary introduction to modern convex geometry. \bjournalFlavors of geometry \bvolume31 \bpages26. \endbibitem
- [6] {barticle}[author] \bauthor\bsnmBellec, \bfnmPierre C.\binitsP. C. (\byear2018). \btitleSharp oracle inequalities for least squares estimators in shape restricted regression. \bjournalAnn. Statist. \bvolume46 \bpages745–780. \bdoi10.1214/17-AOS1566 \bmrnumber3782383 \endbibitem
- [7] {barticle}[author] \bauthor\bsnmBirgé, \bfnmLucien\binitsL. and \bauthor\bsnmMassart, \bfnmPascal\binitsP. (\byear1993). \btitleRates of convergence for minimum contrast estimators. \bjournalProbab. Theory Related Fields \bvolume97 \bpages113–150. \bdoi10.1007/BF01199316 \bmrnumber1240719 \endbibitem
- [8] {barticle}[author] \bauthor\bsnmBronshteyn, \bfnmE. M.\binitsE. M. and \bauthor\bsnmIvanov, \bfnmL. D.\binitsL. D. (\byear1975). \btitleThe approximation of convex sets by polyhedra. \bjournalSiberian Mathematical Journal \bvolume16 \bpages852–853. \endbibitem
- [9] {barticle}[author] \bauthor\bsnmBronšteĭn, \bfnmE. M.\binitsE. M. (\byear1976). \btitle-entropy of convex sets and functions. \bjournalSibirsk. Mat. Ž. \bvolume17 \bpages508–514, 715. \bmrnumber0415155 \endbibitem
- [10] {barticle}[author] \bauthor\bsnmChatterjee, \bfnmSourav\binitsS. (\byear2014). \btitleA new perspective on least squares under convex constraint. \bjournalAnn. Statist. \bvolume42 \bpages2340–2381. \bdoi10.1214/14-AOS1254 \bmrnumber3269982 \endbibitem
- [11] {barticle}[author] \bauthor\bsnmChatterjee, \bfnmSabyasachi\binitsS. (\byear2016). \btitleAn improved global risk bound in concave regression. \bjournalElectron. J. Stat. \bvolume10 \bpages1608–1629. \bdoi10.1214/16-EJS1151 \bmrnumber3522655 \endbibitem
- [12] {barticle}[author] \bauthor\bsnmChatterjee, \bfnmSabyasachi\binitsS., \bauthor\bsnmGuntuboyina, \bfnmAdityanand\binitsA. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2015). \btitleOn risk bounds in isotonic and other shape restricted regression problems. \bjournalAnn. Statist. \bvolume43 \bpages1774–1800. \bdoi10.1214/15-AOS1324 \bmrnumber3357878 \endbibitem
- [13] {bunpublished}[author] \bauthor\bsnmChen, \bfnmWenyu\binitsW. and \bauthor\bsnmMazumder, \bfnmRahul\binitsR. (\byear2020). \btitleMultivariate Convex Regression at Scale. \endbibitem
- [14] {barticle}[author] \bauthor\bsnmChen, \bfnmYining\binitsY. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2016). \btitleOn convex least squares estimation when the truth is linear. \bjournalElectron. J. Stat. \bvolume10 \bpages171–209. \bdoi10.1214/15-EJS1098 \bmrnumber3466180 \endbibitem
- [15] {barticle}[author] \bauthor\bsnmDoss, \bfnmCharles R.\binitsC. R. (\byear2020). \btitleBracketing numbers of convex and -monotone functions on polytopes. \bjournalJ. Approx. Theory \bvolume256 \bpages105425. \bdoi10.1016/j.jat.2020.105425 \bmrnumber4093045 \endbibitem
- [16] {barticle}[author] \bauthor\bsnmDudley, \bfnmR. M.\binitsR. M. (\byear1967). \btitleThe sizes of compact subsets of Hilbert space and continuity of Gaussian processes. \bjournalJournal of Functional Analysis \bvolume1 \bpages290-330. \endbibitem
- [17] {barticle}[author] \bauthor\bsnmDümbgen, \bfnmL.\binitsL., \bauthor\bsnmFreitag, \bfnmS.\binitsS. and \bauthor\bsnmJongbloed, \bfnmG.\binitsG. (\byear2004). \btitleConsistency of concave regression with an application to current-status data. \bjournalMath. Methods Statist. \bvolume13 \bpages69–81. \bmrnumber2078313 \endbibitem
- [18] {barticle}[author] \bauthor\bsnmGao, \bfnmFuchang\binitsF. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2007). \btitleEntropy estimate for high-dimensional monotonic functions. \bjournalJ. Multivariate Anal. \bvolume98 \bpages1751–1764. \bdoi10.1016/j.jmva.2006.09.003 \bmrnumber2392431 \endbibitem
- [19] {barticle}[author] \bauthor\bsnmGao, \bfnmFuchang\binitsF. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2017). \btitleEntropy of convex functions on . \bjournalConstr. Approx. \bvolume46 \bpages565–592. \bdoi10.1007/s00365-017-9387-1 \bmrnumber3735701 \endbibitem
- [20] {barticle}[author] \bauthor\bsnmGhosal, \bfnmPromit\binitsP. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2017). \btitleOn univariate convex regression. \bjournalSankhya A \bvolume79 \bpages215–253. \bdoi10.1007/s13171-017-0104-8 \bmrnumber3707421 \endbibitem
- [21] {bbook}[author] \bauthor\bsnmGroeneboom, \bfnmPiet\binitsP. and \bauthor\bsnmJongbloed, \bfnmGeurt\binitsG. (\byear2014). \btitleNonparametric estimation under shape constraints \bvolume38. \bpublisherCambridge University Press. \endbibitem
- [22] {barticle}[author] \bauthor\bsnmGroeneboom, \bfnmPiet\binitsP., \bauthor\bsnmJongbloed, \bfnmGeurt\binitsG. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2001). \btitleEstimation of a convex function: characterizations and asymptotic theory. \bjournalAnn. Statist. \bvolume29 \bpages1653–1698. \bdoi10.1214/aos/1015345958 \bmrnumber1891742 \endbibitem
- [23] {barticle}[author] \bauthor\bsnmGuntuboyina, \bfnmAdityanand\binitsA. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2015). \btitleGlobal risk bounds and adaptation in univariate convex regression. \bjournalProbab. Theory Related Fields \bvolume163 \bpages379–411. \bdoi10.1007/s00440-014-0595-3 \bmrnumber3405621 \endbibitem
- [24] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ. (\byear2019). \btitleGlobal empirical risk minimizers with ”shape constraints” are rate optimal in general dimensions. \bjournalarXiv preprint arXiv:1905.12823. \endbibitem
- [25] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ., \bauthor\bsnmWang, \bfnmTengyao\binitsT., \bauthor\bsnmChatterjee, \bfnmSabyasachi\binitsS. and \bauthor\bsnmSamworth, \bfnmRichard J.\binitsR. J. (\byear2019). \btitleIsotonic regression in general dimensions. \bjournalAnn. Statist. \bvolume47 \bpages2440–2471. \bdoi10.1214/18-AOS1753 \bmrnumber3988762 \endbibitem
- [26] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear2016). \btitleMultivariate convex regression: global risk bounds and adaptation. \bjournalarXiv preprint arXiv:1601.06844. \endbibitem
- [27] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear2019). \btitleConvergence rats of least squares regression estimators with heavy-tailed errors. \bjournalThe Annals of Statistics \bvolume47 \bpages2286–2319. \endbibitem
- [28] {barticle}[author] \bauthor\bsnmHanson, \bfnmD. L.\binitsD. L. and \bauthor\bsnmPledger, \bfnmGordon\binitsG. (\byear1976). \btitleConsistency in concave regression. \bjournalAnn. Statist. \bvolume4 \bpages1038–1050. \bmrnumber0426273 (54 ##14219) \endbibitem
- [29] {barticle}[author] \bauthor\bsnmHildreth, \bfnmClifford\binitsC. (\byear1954). \btitlePoint estimates of ordinates of concave functions. \bjournalJ. Amer. Statist. Assoc. \bvolume49 \bpages598–619. \bmrnumber0065093 (16,382f) \endbibitem
- [30] {binproceedings}[author] \bauthor\bsnmKeshavarz, \bfnmArezou\binitsA., \bauthor\bsnmWang, \bfnmYang\binitsY. and \bauthor\bsnmBoyd, \bfnmStephen\binitsS. (\byear2011). \btitleImputing a convex objective function. In \bbooktitleIntelligent Control (ISIC), 2011 IEEE International Symposium on \bpages613–619. \bpublisherIEEE. \endbibitem
- [31] {barticle}[author] \bauthor\bsnmKuosmanen, \bfnmTimo\binitsT. (\byear2008). \btitleRepresentation theorem for convex nonparametric least squares. \bjournalThe Econometrics Journal \bvolume11 \bpages308–325. \endbibitem
- [32] {barticle}[author] \bauthor\bsnmKur, \bfnmGil\binitsG., \bauthor\bsnmDagan, \bfnmYuval\binitsY. and \bauthor\bsnmRakhlin, \bfnmAlexander\binitsA. (\byear2019). \btitleOptimality of Maximum Likelihood for Log-Concave Density Estimation and Bounded Convex Regression. \bjournalarXiv preprint arXiv:1903.05315. \endbibitem
- [33] {bbook}[author] \bauthor\bsnmLedoux, \bfnmM.\binitsM. and \bauthor\bsnmTalagrand, \bfnmM.\binitsM. (\byear1991). \btitleProbability in Banach Spaces: Isoperimetry and Processes. \bpublisherSpringer, \baddressNew York. \endbibitem
- [34] {barticle}[author] \bauthor\bsnmLim, \bfnmEunji\binitsE. (\byear2014). \btitleOn convergence rates of convex regression in multiple dimensions. \bjournalINFORMS J. Comput. \bvolume26 \bpages616–628. \bdoi10.1287/ijoc.2013.0587 \bmrnumber3246615 \endbibitem
- [35] {barticle}[author] \bauthor\bsnmLim, \bfnmEunji\binitsE. and \bauthor\bsnmGlynn, \bfnmPeter W.\binitsP. W. (\byear2012). \btitleConsistency of multidimensional convex regression. \bjournalOper. Res. \bvolume60 \bpages196–208. \bdoi10.1287/opre.1110.1007 \bmrnumber2911667 \endbibitem
- [36] {barticle}[author] \bauthor\bsnmMammen, \bfnmEnno\binitsE. (\byear1991). \btitleNonparametric regression under qualitative smoothness assumptions. \bjournalAnn. Statist. \bvolume19 \bpages741–759. \bdoi10.1214/aos/1176348118 \bmrnumber1105842 (92j:62051) \endbibitem
- [37] {bbook}[author] \bauthor\bsnmMassart, \bfnmP.\binitsP. (\byear2007). \btitleConcentration inequalities and model selection. Lecture notes in Mathematics \bvolume1896. \bpublisherSpringer, \baddressBerlin. \endbibitem
- [38] {barticle}[author] \bauthor\bsnmMatzkin, \bfnmRosa L.\binitsR. L. (\byear1991). \btitleSemiparametric estimation of monotone and concave utility functions for polychotomous choice models. \bjournalEconometrica \bvolume59 \bpages1315–1327. \bdoi10.2307/2938369 \bmrnumber1133036 \endbibitem
- [39] {barticle}[author] \bauthor\bsnmMazumder, \bfnmRahul\binitsR., \bauthor\bsnmChoudhury, \bfnmArkopal\binitsA., \bauthor\bsnmIyengar, \bfnmGarud\binitsG. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2019). \btitleA computational framework for multivariate convex regression and its variants. \bjournalJ. Amer. Statist. Assoc. \bvolume114 \bpages318–331. \bdoi10.1080/01621459.2017.1407771 \bmrnumber3941257 \endbibitem
- [40] {bincollection}[author] \bauthor\bsnmNemirovski, \bfnmA. S.\binitsA. S. (\byear2000). \btitleTopics in nonparametric statistics. In \bbooktitleLecture on Probability Theory and Statistics, École d’Été de Probabilitiés de Saint-flour XXVIII-1998, \bvolume1738 \bpublisherSpringer-Verlag, \baddressBerlin, Germany. \bnoteLecture Notes in Mathematics. \endbibitem
- [41] {barticle}[author] \bauthor\bsnmNemirovskij, \bfnmAS\binitsA., \bauthor\bsnmPolyak, \bfnmBoris\binitsB. and \bauthor\bsnmTsybakov, \bfnmAB\binitsA. (\byear1984). \btitleSignal processing by the nonparametric maximum likelihood method. \bjournalProblems of Information Transmission \bvolume20 \bpages177–192. \endbibitem
- [42] {barticle}[author] \bauthor\bsnmNemirovskij, \bfnmAS\binitsA., \bauthor\bsnmPolyak, \bfnmBoris\binitsB. and \bauthor\bsnmTsybakov, \bfnmAB\binitsA. (\byear1985). \btitleRate of convergence of nonparametric estimates of maximum-likelihood type. \bjournalProblems of Information Transmission \bvolume21 \bpages258–272. \endbibitem
- [43] {bbook}[author] \bauthor\bsnmSchneider, \bfnmRolf\binitsR. (\byear1993). \btitleConvex Bodies: The Brunn-Minkowski Theory. \bpublisherCambridge Univ. Press, \baddressCambridge. \endbibitem
- [44] {barticle}[author] \bauthor\bsnmSeijo, \bfnmEmilio\binitsE. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2011). \btitleNonparametric least squares estimation of a multivariate convex regression function. \bjournalAnn. Statist. \bvolume39 \bpages1633–1657. \bdoi10.1214/10-AOS852 \bmrnumber2850215 \endbibitem
- [45] {barticle}[author] \bauthor\bsnmTalagrand, \bfnmMichel\binitsM. (\byear1996). \btitleA new look at independence. \bjournalAnnals of Probability \bvolume24 \bpages1–34. \endbibitem
- [46] {barticle}[author] \bauthor\bsnmToriello, \bfnmAlejandro\binitsA., \bauthor\bsnmNemhauser, \bfnmGeorge\binitsG. and \bauthor\bsnmSavelsbergh, \bfnmMartin\binitsM. (\byear2010). \btitleDecomposing inventory routing problems with approximate value functions. \bjournalNaval Res. Logist. \bvolume57 \bpages718–727. \bdoi10.1002/nav.20433 \bmrnumber2762313 \endbibitem
- [47] {bbook}[author] \bauthor\bparticlevan de \bsnmGeer, \bfnmSara A.\binitsS. A. (\byear2000). \btitleApplications of empirical process theory. \bseriesCambridge Series in Statistical and Probabilistic Mathematics \bvolume6. \bpublisherCambridge University Press, Cambridge. \bmrnumber1739079 \endbibitem
- [48] {bbook}[author] \bauthor\bparticlevan der \bsnmVaart, \bfnmAad W.\binitsA. W. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear1996). \btitleWeak convergence and empirical processes. \bseriesSpringer Series in Statistics. \bpublisherSpringer-Verlag, New York \bnoteWith applications to statistics. \bdoi10.1007/978-1-4757-2545-2 \bmrnumber1385671 \endbibitem
- [49] {barticle}[author] \bauthor\bsnmVarian, \bfnmHal R.\binitsH. R. (\byear1982). \btitleThe nonparametric approach to demand analysis. \bjournalEconometrica \bvolume50 \bpages945–973. \bdoi10.2307/1912771 \bmrnumber666119 \endbibitem
- [50] {barticle}[author] \bauthor\bsnmVarian, \bfnmHal R.\binitsH. R. (\byear1984). \btitleThe nonparametric approach to production analysis. \bjournalEconometrica \bvolume52 \bpages579–597. \bdoi10.2307/1913466 \bmrnumber740302 \endbibitem
- [51] {barticle}[author] \bauthor\bsnmYang, \bfnmY.\binitsY. and \bauthor\bsnmBarron, \bfnmA.\binitsA. (\byear1999). \btitleInformation-Theoretic Determination of Minimax Rates of Convergence. \bjournalAnnals of Statistics \bvolume27 \bpages1564-1599. \endbibitem