This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convex Regression in Multidimensions: Suboptimality of Least Squares Estimators

Gil Kurlabel=e3]gil.kur@inf.ethz.ch [    Fuchang Gaolabel=e1]fuchang@uidaho.edu [    Adityanand Guntuboyinalabel=e2]aditya@stat.berkeley.edu [    Bodhisattva Senlabel=e4]bodhi@stat.columbia.edu [ Institute for Machine Learning, ETH Zürichpresep=, ]e3 Department of Mathematics, University of Idahopresep=, ]e1 Department of Statistics, University of California Berkeleypresep=, ]e2 Department of Statistics, Columbia Universitypresep=, ]e4
Abstract

Under the usual nonparametric regression model with Gaussian errors, Least Squares Estimators (LSEs) over natural subclasses of convex functions are shown to be suboptimal for estimating a dd-dimensional convex function in squared error loss when the dimension dd is 5 or larger. The specific function classes considered include: (i) bounded convex functions supported on a polytope (in random design), (ii) Lipschitz convex functions supported on any convex domain (in random design), (iii) convex functions supported on a polytope (in fixed design). For each of these classes, the risk of the LSE is proved to be of the order n2/dn^{-2/d} (up to logarithmic factors) while the minimax risk is n4/(d+4)n^{-4/(d+4)}, when d5d\geq 5. In addition, the first rate of convergence results (worst case and adaptive) for the unrestricted convex LSE are established in fixed-design for polytopal domains for all d1d\geq 1. Some new metric entropy results for convex functions are also proved which are of independent interest.

62G08,
Adaptive risk bounds,
bounded convex regression,
Dudley’s entropy bound,
Lipschitz convex regression,
lower bounds on the risk of least squares estimators,
metric entropy,
nonparametric maximum likelihood estimation,
Sudakov minoration,
keywords:
[class=MSC]
keywords:

, , , and

1 Introduction

The main goal of this paper is to show that nonparametric Least Squares Estimators (LSEs) associated with the constraint of convexity can be minimax suboptimal when the underlying dimension is 5 or larger. Specifically, we consider regression over a variety of different classes of convex functions (and covariate designs) and find lower bounds on the rates of convergence of the corresponding least squares estimator over each class, and show that, when d5d\geq 5, these rates of convergence do not match the minimax rate for the given class.

We work in the standard nonparametric regression setting for estimating an unknown convex function f0:Ωf_{0}:\Omega\rightarrow{\mathbb{R}} defined on a known full-dimensional compact convex domain Ωd\Omega\subseteq{\mathbb{R}}^{d} (d1d\geq 1) from observations (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\dots,(X_{n},Y_{n}) generated via the model:

Yi=f0(Xi)+ξi,for i=1,,n,Y_{i}=f_{0}(X_{i})+\xi_{i},\qquad\mbox{for }i=1,\ldots,n, (1)

where ξ1,,ξn\xi_{1},\ldots,\xi_{n} are i.i.d. errors having the N(0,σ2)N(0,\sigma^{2}) distribution, and design points X1,,XnΩX_{1},\dots,X_{n}\in\Omega that may be fixed or random. This problem is known as convex regression and has a long history in statistics and related fields. Standard references are [29, 28, 22, 21, 44, 31, 35, 3], and applications of convex regression can be found in [49, 50, 2, 38, 1, 30, 46].

The Least Squares Estimator (LSE) over a class {\cal F} of functions on Ω\Omega is defined as any minimizer of the least squares criterion over {\cal F}:

f^n()argminfi=1n(Yif(Xi))2.\hat{f}_{n}({\cal F})\in\mathop{\rm argmin}_{f\in{\cal F}}\sum_{i=1}^{n}\left(Y_{i}-f(X_{i})\right)^{2}. (2)

In order to introduce the specific LSEs studied in this paper, consider the following four function classes:

  1. 1.

    𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega): class of all convex functions on Ω\Omega which are uniformly Lipschitz with Lipschitz constant LL and uniformly bounded by BB.

  2. 2.

    𝒞L(Ω){\mathcal{C}}_{L}(\Omega): class of all convex functions on Ω\Omega that are uniformly Lipschitz with Lipschitz constant LL. There is no uniform boundedness assumption on functions in this class.

  3. 3.

    𝒞B(Ω){\mathcal{C}}^{B}(\Omega): class of all convex functions on Ω\Omega that are uniformly bounded by BB. There is no Lipschitz assumption on functions in this class.

  4. 4.

    𝒞(Ω){\mathcal{C}}(\Omega): class of all convex functions on Ω\Omega.

Throughout the paper, we assume that LL and BB are positive constants not depending on the sample size nn. We study LSEs over each of the above function classes and we use the following terminology for these estimators:

  1. 1.

    f^n(𝒞LB(Ω))\hat{f}_{n}({\mathcal{C}}_{L}^{B}(\Omega)): Bounded Lipschitz Convex LSE,

  2. 2.

    f^n(𝒞L(Ω))\hat{f}_{n}({\mathcal{C}}_{L}(\Omega)): Lipschitz Convex LSE,

  3. 3.

    f^n(𝒞B(Ω))\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)): Bounded Convex LSE,

  4. 4.

    f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)): Unrestricted Convex LSE.

Strictly speaking, f^n(𝒞LB(Ω))\hat{f}_{n}({\mathcal{C}}_{L}^{B}(\Omega)) should be called the “Uniformly Bounded Uniformly Lipschitz Convex LSE” but we omit the word “uniformly” throughout for brevity. Table 1 gives references to some previous papers that studied each of these estimators. For the unrestricted convex LSE, Table 1 lists only the papers focusing on the multivariate case (univariate investigations are in [28, 17, 22, 14, 20, 23, 12, 6, 11]).

Function class LSE Name References
𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) f^n(𝒞LB(Ω))\hat{f}_{n}({\mathcal{C}}_{L}^{B}(\Omega)) Bounded Lipschitz Convex LSE [41, 42, 36, 4]
𝒞L(Ω){\mathcal{C}}_{L}(\Omega) f^n(𝒞L(Ω))\hat{f}_{n}({\mathcal{C}}_{L}(\Omega)) Lipschitz Convex LSE [34, 39]
𝒞B(Ω){\mathcal{C}}^{B}(\Omega) f^n(𝒞B(Ω))\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)) Bounded Convex LSE [26]
𝒞(Ω){\mathcal{C}}(\Omega) f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) Unrestricted Convex LSE [44, 31, 35, 39, 13]
Table 1: A listing of the LSEs studied in this paper

This paper studies the performance of these LSEs and compares their rates of convergence with the corresponding minimax rates. For our results, we assume throughout the paper that the convex body Ω\Omega (which is the domain of the unknown function f0f_{0}) is translated and scaled so that

rd𝔅dΩ𝔅dr_{d}{\mathfrak{B}_{d}}\subseteq\Omega\subseteq{\mathfrak{B}_{d}} (3)

where 𝔅d{\mathfrak{B}_{d}} is the unit ball in d{\mathbb{R}}^{d} and rdr_{d} is a positive constant depending on dd alone. In particular, this assumption implies that Ω\Omega has diameter at most 2 and also volume at most 1. Also the diameter and volume of Ω\Omega are bounded from below by a constant depending on dd alone. The classical John’s theorem (see e.g., [5, Lecture 3]) shows that for every convex body Ω\Omega, there exists an affine transformation such that (3) holds for the transformed body with rd=1/dr_{d}=1/d. In the rest of the paper, whenever we say that a result holds for “any convex body Ω\Omega” (such as in Tables 2, 3, 4, 5 and 6), we mean “any convex body Ω\Omega satisfying (3)”.

We focus first on the random design setting where the design points X1,,XnX_{1},\dots,X_{n} are assumed to be independent having the uniform distribution {\mathbb{P}} on Ω\Omega (see Subsection 5.3 for a discussion of more general design assumptions in random design), and work with the loss function

2(f,g):=Ω(fg)2𝑑.\ell_{{\mathbb{P}}}^{2}\left(f,g\right):=\int_{\Omega}(f-g)^{2}d{\mathbb{P}}. (4)

The minimax rate for a function class {\cal F} in the above setting is:

nrandom():=inff˘nsupf0𝔼f02(f˘n,f0){\mathfrak{R}}_{n}^{\mathrm{random}}({\cal F}):=\inf_{\breve{f}_{n}}\sup_{f_{0}\in{\cal F}}{\mathbb{E}}_{f_{0}}\ell^{2}_{{\mathbb{P}}}(\breve{f}_{n},f_{0})

where 𝔼f0{\mathbb{E}}_{f_{0}} denotes expectation with respect to the joint distribution of all the observations (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\dots,(X_{n},Y_{n}) when f0f_{0} is the true regression function (see (1)), and the infimum is over all estimators f˘n\breve{f}_{n}.

Minimax rates for the above function classes 1–4 are presented in Table 2. These results are known except perhaps for 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) which we state and prove in this paper as Proposition 2.1. These minimax rates can also be derived from the corresponding metric entropy rates which are given in Table 3. Recall that the ϵ\epsilon-metric entropy of a function class {\cal F} with respect to a metric \ell is defined as the logarithm of the smallest number N(ϵ,,)N(\epsilon,{\cal F},\ell) of closed balls of radius ϵ\epsilon whose union contains {\cal F}. Classical results from Yang and Barron [51] imply that the minimax rate (for the loss 2\ell_{{\mathbb{P}}}^{2}) for a nonparametric function class {\cal F} satisfying some natural assumptions (see [51, Subsection 3.2] for a listing of these assumptions which are satisfied for 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) and 𝒞B(Ω){\mathcal{C}}^{B}(\Omega)) equals ϵn2\epsilon^{2}_{n} where ϵn\epsilon_{n} solves the equation:

nϵ2logN(ϵ,,).n\epsilon^{2}\asymp\log N(\epsilon,{\cal F},\ell_{{\mathbb{P}}}). (5)
Class Minimax Rate Assumption on Ω\Omega Reference
𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) n4/(d+4)n^{-4/(d+4)} any convex body [42], [4, Theorem 4.1]
𝒞L(Ω){\mathcal{C}}_{L}(\Omega) n4/(d+4)n^{-4/(d+4)} any convex body this paper (Proposition 2.1)
𝒞B(Ω){\mathcal{C}}^{B}(\Omega) n4/(d+4)n^{-4/(d+4)} polytope [26, Theorems 2.3, 2.4]
𝒞B(Ω){\mathcal{C}}^{B}(\Omega) n2/(d+1)n^{-2/(d+1)} ball [26, Theorems 2.3, 2.4]
Table 2: Minimax rates in the loss (4) under the random design setting where X1,,XnX_{1},\dots,X_{n} are i.i.d. from the uniform distribution on Ω\Omega. For the rate in the last row, we assume d2d\geq 2.
Class ϵ\epsilon-entropy ϵ\epsilon-bracketing entropy Assumption on Ω\Omega Reference
𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) ϵd/2\epsilon^{-d/2} ϵd/2\epsilon^{-d/2} any convex body [9]
𝒞L(Ω){\mathcal{C}}_{L}(\Omega) infinite infinite any convex body
𝒞B(Ω){\mathcal{C}}^{B}(\Omega) ϵd/2\epsilon^{-d/2} ϵd/2\epsilon^{-d/2} polytope [19, 15]
𝒞B(Ω){\mathcal{C}}^{B}(\Omega) ϵ(d1)\epsilon^{-(d-1)} ϵ(d1)\epsilon^{-(d-1)} ball [19]
Table 3: ϵ\epsilon-entropy and ϵ\epsilon-bracketing entropy rates in the metric \ell_{{\mathbb{P}}} for each of the convex function classes. For the last row, we assume d2d\geq 2.

For 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega), the minimax rate is n4/(d+4)n^{-4/(d+4)} which is the square of the solution for ϵ\epsilon in (5): nϵ2=ϵd/2n\epsilon^{2}=\epsilon^{-d/2}. For 𝒞L(Ω){\mathcal{C}}_{L}(\Omega), the metric entropy is infinite as this class contains every constant function and hence is not totally bounded. But the minimax rate is still n4/(d+4)n^{-4/(d+4)} and can be easily derived as a consequence of the minimax rate result for 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) (the argument is given in the proof of Proposition 2.1).

For 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) with d2d\geq 2, as shown by Han and Wellner [26] and Gao and Wellner [19] respectively, the minimax rate and the metric entropy rate change depending on whether Ω\Omega is polytopal or has smooth boundary. Here (and in the rest of the paper), when we say that a specific set SS is a “polytope”, we assume that the number of facets or extreme points of SS is bounded by a constant not depending on the sample size nn. If the number of facets is allowed to grow with nn, then polytopes can approximate general convex bodies arbitrarily well in which case it is not meaningful to give separate rates for polytopes and other convex bodies.

When Ω\Omega is a polytope, the minimax rate for 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) equals n4/(d+4)n^{-4/(d+4)} corresponding to the ϵ\epsilon-entropy of ϵd/2\epsilon^{-d/2}. In contrast, when Ω\Omega is the unit ball, the minimax rate is n2/(d+1)n^{-2/(d+1)} corresponding to the ϵ\epsilon-entropy of ϵ(d1)\epsilon^{-(d-1)}. The larger metric entropy for the ball (compared to polytopes) is driven by the curvature of the boundary and the lower bound is proved by considering indicator-like functions of spherical caps (see [19, Subsection 2.10] for the metric entropy proofs). More high level intuition on the differences between the polytopal case and the ball case is provided in Subsection 5.1.

We now ask whether the LSEs (as defined via (2)) achieve the corresponding minimax rates. The supremum risk of an estimator f˘\breve{f} on a function class {\cal F} is defined as

Rn(f˘n,):=supf𝔼f2(f˘n,f).R_{n}(\breve{f}_{n},{\cal F}):=\sup_{f\in{\cal F}}{\mathbb{E}}_{f}\ell_{{\mathbb{P}}}^{2}(\breve{f}_{n},f).

Existing results for the convex regression LSEs are described in Table 4 where the rate rn,dr_{n,d} is given by

rn,d:={n4/(d+4):d4n2/d:d5.r_{n,d}:=\left\{\begin{array}[]{lr}n^{-4/(d+4)}&:d\leq 4\\ n^{-2/d}&:d\geq 5.\end{array}\right. (6)
Supremum Risk Result Assumption on Ω\Omega Reference
Rn(f^n(𝒞LB(Ω)),𝒞LB(Ω))R_{n}(\hat{f}_{n}({\mathcal{C}}_{L}^{B}(\Omega)),{\mathcal{C}}_{L}^{B}(\Omega)) rn,d\lesssim^{*}r_{n,d} any convex body [4]
Rn(f^n(𝒞L(Ω)),𝒞L(Ω))R_{n}(\hat{f}_{n}({\mathcal{C}}_{L}(\Omega)),{\mathcal{C}}_{L}(\Omega)) rn,d\lesssim^{*}r_{n,d} any convex body [34, 39]
Rn(f^n(𝒞B(Ω)),𝒞B(Ω))R_{n}(\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)),{\mathcal{C}}^{B}(\Omega)) rn,d\lesssim^{*}r_{n,d} polytope [26]
Rn(f^n(𝒞B(Ω)),𝒞B(Ω))R_{n}(\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)),{\mathcal{C}}^{B}(\Omega)) n2/(d+1)n^{-2/(d+1)} ball [32]
Table 4: Existing LSE rates in the loss (4) for random design (for the last row, we take d2d\geq 2). Here \lesssim^{*} refers to an upper bound with multiplicative factors that are logarithmic in nn. We prove, in this paper, that the rate rn,dr_{n,d} is actually tight for each of the first three rows.

The upper bound of rn,dr_{n,d} in the first three rows of Table 4 is derived from standard upper bounds on the performance of LSEs [7, 47] which state that the LSE risk is bounded from above by δn2\delta_{n}^{2} where δn\delta_{n} solves the equation:

nδ2δ2δlogN[](ϵ,,)𝑑ϵ.\sqrt{n}\delta^{2}\asymp\int_{\delta^{2}}^{\delta}\sqrt{\log N_{[\,]}(\epsilon,{\cal F},\ell_{{\mathbb{P}}})}d\epsilon. (7)

Here N[](ϵ,,)N_{[\,]}(\epsilon,{\cal F},\ell_{{\mathbb{P}}}) denotes the ϵ\epsilon-bracketing number of {\cal F} under the \ell_{{\mathbb{P}}} metric which is defined as the smallest number of ϵ\epsilon-brackets:

[f¯,f¯]:={g:f¯(x)g(x)f¯(x) for all x} with (f¯,f¯)ϵ[\underline{f},\bar{f}]:=\left\{g:\underline{f}(x)\leq g(x)\leq\bar{f}(x)\text{ for all }x\right\}~{}~{}~{}~{}~{}\text{ with }\ell_{{\mathbb{P}}}(\underline{f},\bar{f})\leq\epsilon

needed to cover {\cal F}. The logarithm of N[](ϵ,,)N_{[\,]}(\epsilon,{\cal F},\ell_{{\mathbb{P}}}) denotes the ϵ\epsilon-bracketing entropy of {\cal F}. Gao and Wellner [19] (see also Doss [15]) proved bracketing entropy numbers for the convex function classes; these are given in Table 3 and they coincide with the metric entropy rates. Plugging in ϵd/2\epsilon^{-d/2} for the bracketing entropy in (7) and solving the resulting equation in δ\delta leads to the rate rn,dr_{n,d} (although 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) has infinite entropy, results for this class can be derived by working instead with 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) which has entropy ϵd/2\epsilon^{-d/2}). The split in rn,dr_{n,d} for d4d\leq 4 and d5d\geq 5 occurs because δ2δϵd/4𝑑ϵ\int_{\delta^{2}}^{\delta}\epsilon^{-d/4}d\epsilon depends differently on δ\delta for d4d\leq 4 compared to d5d\geq 5. The same dimension-dependent split in the upper bounds for the rate can be seen in [7] and [47, Chapter 9] for certain other LSEs on function classes with smoothness restrictions.

It is interesting to note that the last row in Table 4 (corresponding to the case when Ω\Omega is the unit ball) does not have any dimension-dependent split in the rate (it equals n2/(d+1)n^{-2/(d+1)} for every d2d\geq 2). This result is due to Kur et al. [32]. More details on this case are given in Subsection 5.1.

We are now ready to describe the main results of this paper. Tables 2 and 4 together indicate that the minimax rate for the first three rows equals n4/(d+4)n^{-4/(d+4)} which also coincides with rn,dr_{n,d} for d4d\leq 4. This immediately implies minimax optimality (up to log factors) of the LSE over each function class in the first three rows of Table 2 when d4d\leq 4. However for d5d\geq 5, there is a significant gap between the minimax rate n4/(d+4)n^{-4/(d+4)} and rn,d=n2/dr_{n,d}=n^{-2/d}. The question of whether this gap is real (implying suboptimality of the LSEs) or merely an artifact of some loose argument in the derived upper bounds was previously open. We settle this question in this paper by proving that the LSE for each function class in the first three rows of Table 2 is minimax suboptimal for d5d\geq 5. More precisely, in Theorem 3.1, we prove that n2/d(logn)4(d+1)/dn^{-2/d}(\log n)^{-4(d+1)/d} is a lower bound (up to a multiplicative factor independent of nn) on each supremum risk in the first three rows of Table 4 for d5d\geq 5.

Table 5 has a summary of the minimax rates and rates exhibited by the LSE (along with minimax optimality and suboptimality of the LSE) in our random design setting.

Class \mathcal{F} Assumption on Ω\Omega Minimax Rate Supremum LSE
risk Rn(f^n(),)R_{n}(\hat{f}_{n}(\mathcal{F}),\mathcal{F}) (upto logs)
Minimax Optimality of f^n()\hat{f}_{n}({\cal F}) (upto logs)
𝒞LB(Ω)\mathcal{C}_{L}^{B}(\Omega) any convex body n4/(d+4)n^{-4/(d+4)} n4/(d+4)n^{-4/(d+4)}, d4d\leq 4
n2/dn^{-2/d}, d5d\geq 5
Optimal for d4d\leq 4
Suboptimal for d5d\geq 5
𝒞L(Ω)\mathcal{C}_{L}(\Omega) any convex body n4/(d+4)n^{-4/(d+4)} n4/(d+4)n^{-4/(d+4)}, d4d\leq 4
n2/dn^{-2/d}, d5d\geq 5
Optimal for d4d\leq 4
Suboptimal for d5d\geq 5
𝒞B(Ω)\mathcal{C}^{B}(\Omega) polytope n4/(d+4)n^{-4/(d+4)} n4/(d+4)n^{-4/(d+4)}, d4d\leq 4
n2/dn^{-2/d}, d5d\geq 5
Optimal for d4d\leq 4
Suboptimal for d5d\geq 5
𝒞B(Ω)\mathcal{C}^{B}(\Omega) ball (d2d\geq 2) n2/(d+1)n^{-2/(d+1)} n2/(d+1)n^{-2/(d+1)} Optimal for d2d\geq 2
Table 5: Summary of the minimax and LSE rates, and optimality/suboptimality of the LSE in our random design setting

Our proof of minimax suboptimality for the LSE is based on the following idea. Suppose f0(x):=x2f_{0}(x):=\|x\|^{2} and let f~k\tilde{f}_{k} be a piecewise affine convex approximation to f0f_{0} with kk affine pieces as described in the statment of Lemma 3.2. It is then well-known (see e.g., [4, Lemma 4.1] or [8]) that the \ell_{{\mathbb{P}}} distance between f0f_{0} and f~k\tilde{f}_{k} is of order k2/dk^{-2/d}. If we now set kk to be of order n\sqrt{n}, then (f~k,f0)n1/d\ell_{{\mathbb{P}}}(\tilde{f}_{k},f_{0})\sim n^{-1/d}, or equivalently 2(f~k,f0)n2/d\ell_{{\mathbb{P}}}^{2}(\tilde{f}_{k},f_{0})\sim n^{-2/d}. Note that n2/dn^{-2/d} is much larger than the minimax rate n4/(d+4)n^{-4/(d+4)} for d5d\geq 5. We study the behavior of each convex LSE f^n\hat{f}_{n} (in the first three rows of Table 2) when the true function is f~k\tilde{f}_{k} for knk\sim\sqrt{n}, and show (in Theorem 3.1) that 𝔼2(f^n,f~k){\mathbb{E}}\ell_{{\mathbb{P}}}^{2}(\hat{f}_{n},\tilde{f}_{k}) is bounded below by a constant multiple of n2/d(logn)4(d+1)/dn^{-2/d}(\log n)^{-4(d+1)/d}. In other words, we prove that the LSEs are at a squared distance from the true function f~k\tilde{f}_{k} that is much larger than the minimax rate when d5d\geq 5. Our proof techniques are outlined in Section 4. We are unable to definitively say why the LSEs are behaving this way when the true function is piecewise affine with n\sqrt{n} pieces. We believe this is due to overfitting, probably the LSEs have many more affine pieces than knk\sim\sqrt{n} in this case and are actually closer to the quadratic function f0f_{0}.

It has previously been observed that LSEs and related Empirical Risk Minimization procedures can be suboptimal for function classes with large entropy integrals. For example, [7, Section 4] designed certain pathological large function classes where the LSE is provably minimax suboptimal. The fact that minimax suboptimality also holds for natural function classes such as 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) and 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) (for d5d\geq 5) is noteworthy.

In contrast to our suboptimality results, the last row of Table 4 and Table 5 correspond to the class 𝒞B(𝔅d){\mathcal{C}}^{B}({\mathfrak{B}_{d}}) (here 𝔅d{\mathfrak{B}_{d}} is the unit ball) for which the LSE is minimax optimal for all d2d\geq 2 (this result is due to Kur et al. [32]). As already mentioned, 𝒞B(𝔅d){\mathcal{C}}^{B}({\mathfrak{B}_{d}}) has much larger metric entropy (compared to 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) for polytopal Ω\Omega) that is driven by the curvature of 𝔅d{\mathfrak{B}_{d}}; one can understand this from the proofs of the metric entropy lower bounds on 𝒞B(𝔅d){\mathcal{C}}^{B}({\mathfrak{B}_{d}}) and 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) for polytopal Ω\Omega in [19] (specifically see [19, proof of Theorem 1(i) and proof of Theorem 4]). High-level intuition on the differences between the ball and polytopal cases is given in Subsection 5.1. Isotonic regression is another shape constrained regression problem where the LSE is minimax optimal for all dd (see Han et al. [25]). The class of coordinatewise monotone functions on [0,1]d[0,1]^{d} is similar to 𝒞B(𝔅d){\mathcal{C}}^{B}({\mathfrak{B}_{d}}) in that its metric entropy is driven by well-separated subsets of the domain [0,1]d[0,1]^{d} (see [18, Proof of Proposition 2.1]). Other examples of such classes where the LSE is optimal for all dimensions can be found in Han [24].

1.1 Fixed-design results for the unrestricted convex LSE

So far we have not discussed any rates of convergence for the unrestricted Convex LSE f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)). No such results exist in the literature; it appears difficult to prove them in the random design setting in part because the general random-design theory of Empirical Risk Minimization is largely restricted to uniformly bounded function classes. In this paper, we prove rates for f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) for a specific fixed-design setting. We now assume that X1,,XnX_{1},\dots,X_{n} form a fixed regular rectangular grid intersected with Ω\Omega, and the loss function is

n2(f,g):=(fg)2𝑑n=1ni=1n(f(Xi)g(Xi))2\ell_{{\mathbb{P}}_{n}}^{2}(f,g):=\int\left(f-g\right)^{2}d{\mathbb{P}}_{n}=\frac{1}{n}\sum_{i=1}^{n}\left(f(X_{i})-g(X_{i})\right)^{2} (8)

with n{\mathbb{P}}_{n} being the (non-random) empirical distribution of X1,,XnX_{1},\dots,X_{n}. This setting is also quite standard in nonparametric function estimation (see e.g., [40]). We only work with polytopal Ω\Omega. Fixed-design rates for the unrestricted convex LSE when Ω\Omega is nonpolytopal are likely more complicated (as indicated in Subsection 5.2) and are not addressed in this paper.

In this setting, we are able to prove uniform rates of convergence for f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) over a function class that is larger than the function classes considered so far. This function class is given by:

𝔏(Ω):={f0 convex on Ω:infg𝒜(Ω)n(f0,g)𝔏}{\cal F}^{{\mathfrak{L}}}(\Omega):=\left\{f_{0}\text{ convex on }\Omega:\inf_{g\in\mathcal{A}(\Omega)}\ell_{{\mathbb{P}}_{n}}(f_{0},g)\leq{\mathfrak{L}}\right\} (9)

where 𝒜(Ω){\mathcal{A}}(\Omega) denotes the class of all affine functions on Ω\Omega, and 𝔏{\mathfrak{L}} is a fixed positive constant. It can be easily shown (see Section 3.2 for details) that, as a function class, 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) is larger than both 𝒞𝔏(Ω){\mathcal{C}}^{{\mathfrak{L}}}(\Omega) and 𝒞𝔏(Ω){\mathcal{C}}_{{\mathfrak{L}}}(\Omega), which means that f0𝔏(Ω)f_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega) is a weaker assumption compared to uniform boundedness (f0𝒞𝔏(Ω)f_{0}\in{\mathcal{C}}^{{\mathfrak{L}}}(\Omega)) and also compared to uniform Lipschitzness (f0𝒞𝔏(Ω)f_{0}\in{\mathcal{C}}_{{\mathfrak{L}}}(\Omega)). Further 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) satisfies the natural invariance property: f0𝔏(Ω)f_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega) if and only if f0g0𝔏(Ω)f_{0}-g_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega) for every affine function g0g_{0}. This invariance is obviously also satisfied by 𝒞(Ω){\mathcal{C}}(\Omega) but not by the other classes in Table 1.

In the aforementioned fixed-design setting, we prove, in Theorem 3.3, that

supf𝔏(Ω)𝔼fn2(f^n(𝒞(Ω)),f)\sup_{f\in{\cal F}^{{\mathfrak{L}}}(\Omega)}{\mathbb{E}}_{f}\ell_{{\mathbb{P}}_{n}}^{2}\left(\hat{f}_{n}({\mathcal{C}}(\Omega)),f\right) (10)

is bounded by the rate rn,dr_{n,d} (see (6)) up to logarithmic factors. Thus the unrestricted convex LSE achieves the rate rn,dr_{n,d} under the assumption that the true function f0𝔏(Ω)f_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega). That this holds under the weaker assumption f0𝔏(Ω)f_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega) can be considered an advantage of the unrestricted convex LSE over the Bounded or Lipschitz convex LSE which require the stronger assumptions f0𝒞𝔏(Ω)f_{0}\in{\mathcal{C}}^{{\mathfrak{L}}}(\Omega) or f0𝒞𝔏(Ω)f_{0}\in{\mathcal{C}}_{{\mathfrak{L}}}(\Omega).

The minimax rate for 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) in the fixed-design setting is defined by

nfixed(𝔏(Ω)):=inff˘nsupf𝔏(Ω)𝔼fn2(f˘n,f){\mathfrak{R}}_{n}^{\mathrm{fixed}}({\cal F}^{{\mathfrak{L}}}(\Omega)):=\inf_{\breve{f}_{n}}\sup_{f\in{\cal F}^{{\mathfrak{L}}}(\Omega)}{\mathbb{E}}_{f}\ell^{2}_{{\mathbb{P}}_{n}}(\breve{f}_{n},f) (11)

where n\ell_{{\mathbb{P}}_{n}} is the loss function (8). In Proposition 2.2, we prove that nfixed(𝔏(Ω)){\mathfrak{R}}_{n}^{\mathrm{fixed}}({\cal F}^{{\mathfrak{L}}}(\Omega)) equals n4/(d+4)n^{-4/(d+4)} (up to log factors) for all d1d\geq 1. This implies that the unrestricted convex LSE is, up to log factors, minimax optimal for d4d\leq 4 for the class 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) (and consequently also for the smaller classes 𝒞𝔏(Ω){\mathcal{C}}_{{\mathfrak{L}}}(\Omega) and 𝒞𝔏(Ω){\mathcal{C}}^{{\mathfrak{L}}}(\Omega)).

For d5d\geq 5, we prove, in Theorem 3.4, that there exists f0𝒞LL(Ω)f_{0}\in{\mathcal{C}}^{L}_{L}(\Omega) where the risk of f^n\hat{f}_{n} is bounded from below by n2/d(logn)4(d+1)/dn^{-2/d}(\log n)^{-4(d+1)/d}. This proves minimax suboptimality of the unrestricted convex LSE for d5d\geq 5 over the class 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega), and consequently also over the larger function class 𝒞L(Ω){\mathcal{C}}_{L}(\Omega), 𝒞L(Ω){\mathcal{C}}^{L}(\Omega), L(Ω){\cal F}^{L}(\Omega) (it is helpful to note here all these function classes have, up to a logarithmic factor, the same minimax rate because of Proposition 2.2 and Proposition 2.3).

In addition to these results, we also prove that the rate of convergence of the unrestricted convex LSE can be faster than rn,dr_{n,d} when f0f_{0} is piecewise affine. Specifically, we prove that when f0f_{0} is a piecewise affine convex function on Ω\Omega with kk affine pieces, the risk of f^n\hat{f}_{n} is bounded from above by

ak,n,d:={kn:d4(kn)4/d:d5a_{k,n,d}:=\left\{\begin{array}[]{lr}\frac{k}{n}&:d\leq 4\\ \left(\frac{k}{n}\right)^{4/d}&:d\geq 5\end{array}\right. (12)

up to logarithmic factors. When kk is not too large, ak,n,da_{k,n,d} is smaller than rn,dr_{n,d} (which bounds (10)). Specifically, when d4d\leq 4, we have ak,n,d=k/nrn,d=n4/(d+4)a_{k,n,d}=k/n\leq r_{n,d}=n^{-4/(d+4)} for knd/(d+4)k\leq n^{d/(d+4)}, and when d5d\geq 5, we have ak,n,d=(k/n)4/drn,d=n2/da_{k,n,d}=(k/n)^{4/d}\leq r_{n,d}=n^{-2/d} for knk\leq\sqrt{n}. For such kk, we say that the unrestricted convex LSE adapts to the piecewise affine convex function f0f_{0}. In the random design setting, adaptive rates for the bounded convex LSE can be found in [26] with a suboptimal dependence on kk.

It is especially interesting that the adaptive rate ak,n,da_{k,n,d} switches from being parametric for d4d\leq 4 to a slower nonparametric rate for d5d\geq 5. For d5d\geq 5, we prove a lower bound showing that the adaptive rate (k/n)4/d(k/n)^{4/d} cannot be improved for knk\lesssim\sqrt{n}. Specifically, in Theorem 3.6, we prove, assuming d5d\geq 5, that for the piecewise affine convex function f~k\tilde{f}_{k} given by Lemma 3.2, the rate of the unrestricted convex LSE is bounded from below by ak,n,d(logn)4(d+1)/da_{k,n,d}(\log n)^{-4(d+1)/d} for every knk\lesssim\sqrt{n}. This lower bound is related to the minimax suboptimality of f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) because when knk\sim\sqrt{n}, the rate (k/n)4/d(k/n)^{4/d} becomes n2/dn^{-2/d}.

1.2 Paper Outline

The rest of this paper is organized as follows. Minimax rate results for convex regression are summarized in Section 2. These minimax rates are useful for gauging minimax optimality and suboptimality of LSEs in convex regression. Rates of convergence of the LSEs are given in Section 3. Subsection 3.1 contains the main minimax suboptimality result for the LSEs over the three classes 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega), 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) and 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) in the random design setting. Section 3.2 contains results for the unrestricted convex LSE in fixed design. Section 4 provides a sketch of the main ideas and ingredients behind the proofs of the LSE rate results in Section 3. Proofs of all results can be found in the Appendices: Appendix A contains proofs of all results in Section 2, Appendix B contains proofs of all results in Section 3.1 (and the auxiliary results stated in Subsection 4.3) and Appendix C contains proofs of all results in Section 3.2 (and the auxiliary results stated in Subsection 4.4). The paper ends with a discussion section (see Section 5) where some issues naturally connected to our results are discussed.

1.3 A note on constants underlying our rate results

Our rate results often involve various constants which are generically denoted by cc or CC. These constants never depend on the sample size nn but they often depend on various other parameters involved in the problem (such as the dimension dd, Lipschitz constant LL, uniform bound BB, number of facets of the polytopal domain Ω\Omega, etc.). We have tried to indicate the dependence of the constant explicity using notation such as Cd,B,LC_{d,B,L}. We have not attempted to explicitly specify the dependence of these constants on these parameters. Please note that the value of these constants may change from occurrence to occurrence.

2 Minimax Rates for Convex Regression

In this section, we provide more details for the minimax rate results mentioned in the Introduction.

2.1 Random Design

Minimax rates for convex regression in random design are given in Table 2. The first result is for the class 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) and, as indicated in Table 2, the minimax rate of n4/(d+4)n^{-4/(d+4)} for 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) was stated and proved in [4, Theorem 4.1]. However, they additionally assumed that Ω\Omega is a rectangle which is unnecessary and the same proof applies to every convex body Ω\Omega satisfying (3). This is because the metric entropy of ϵd/2\epsilon^{-d/2} holds for 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) for every convex body Ω\Omega satisfying (3) as can be readily seen from the proof of [9, Theorem 6].

The next minimax result is for 𝒞L(Ω){\mathcal{C}}_{L}(\Omega). Here also the rate is n4/(d+4)n^{-4/(d+4)} and it holds for every convex body Ω\Omega satisfying (3). The upper bound here is slightly nontrivial because the class 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) has infinite metric entropy as it contains all constant functions. We could not find a reference for this result in the literature so we state it below and include a proof in Appendix A.

Proposition 2.1.

For every Ω\Omega satisfying (3), we have

cd,L,σn4/(d+4)nrandom(𝒞L(Ω))Cd,L,σn4/(d+4)c_{d,L,\sigma}n^{-4/(d+4)}\leq{\mathfrak{R}}_{n}^{\mathrm{random}}({\mathcal{C}}_{L}(\Omega))\leq C_{d,L,\sigma}n^{-4/(d+4)} (13)

where cd,L,σc_{d,L,\sigma} and Cd,L,σC_{d,L,\sigma} are positive constants depending only on d,L,σd,L,\sigma.

Next are the results for 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) where the minimax rates change depending on whether Ω\Omega is polytopal or not. When Ω\Omega is a polytope, the minimax rate is n4/(d+4)n^{-4/(d+4)} for all d1d\geq 1 and when Ω\Omega is a ball, the rate is n2/(d+1)n^{-2/(d+1)} for d2d\geq 2. These results (whose proofs can be found in [26]) can be derived as a consequence of the metric entropy rates in Table 3 (ϵd/2\epsilon^{-d/2} and ϵ(d1)\epsilon^{-(d-1)} respectively) by solving the equation (5).

2.2 Fixed Design

For fixed design, we are mostly only able to prove results when Ω\Omega is a polytope (see Subsection 5.2 for some remarks on fixed-design results for non-polytopal Ω\Omega). Specifically, we assume that

Ω:={xd:aiviTxbi for i=1,,F}\Omega:=\left\{x\in{\mathbb{R}}^{d}:a_{i}\leq v_{i}^{T}x\leq b_{i}\text{ for }i=1,\dots,F\right\} (14)

for F1F\geq 1, unit vectors v1,,vFv_{1},\dots,v_{F} and real numbers a1,,aF,b1,,bFa_{1},\dots,a_{F},b_{1},\dots,b_{F}. The number FF is assumed to be bounded from above by a constant depending on dd alone. As in the rest of the paper, we also assume (3).

The design points X1,,XnX_{1},\dots,X_{n} are assumed to form a fixed regular rectangular grid in Ω\Omega and Y1,,YnY_{1},\dots,Y_{n} are generated according to (1). Specifically, for δ>0\delta>0, let

𝒮:={(k1δ,,kdδ):ki,1id}\mathcal{S}:=\left\{(k_{1}\delta,\dots,k_{d}\delta):k_{i}\in\mathbb{Z},1\leq i\leq d\right\} (15)

denote the regular dd-dimensional δ\delta-grid in d{\mathbb{R}}^{d}. We assume that X1,,XnX_{1},\dots,X_{n} are an enumeration of the points in 𝒮Ω\mathcal{S}\cap\Omega with nn denoting the cardinality of 𝒮Ω\mathcal{S}\cap\Omega. By the usual volumetric argument and assumption (3) , there exists a small enough constant κd>0\kappa_{d}>0 such that whenever 0<δκd0<\delta\leq\kappa_{d}, we have

2cdδdnCdδd2\leq c_{d}\delta^{-d}\leq n\leq C_{d}\delta^{-d} (16)

for dimensional constants cdc_{d} and CdC_{d}. We have included a proof of the above claim in Lemma C.2. Throughout, we assume δκd\delta\leq\kappa_{d} so that the above inequality holds. The following result (proved in Appendix A) gives an upper bound on the minimax risk (11) over the class 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) defined in (9).

Proposition 2.2.

For every Ω\Omega of the form (14) and satisfying (3), we have

nfixed(𝔏(Ω))C𝔏,σn4/(d+4)(cdlogn)4F/(d+4).{\mathfrak{R}}_{n}^{\mathrm{fixed}}({\cal F}^{{\mathfrak{L}}}(\Omega))\leq C_{{\mathfrak{L}},\sigma}n^{-4/(d+4)}(c_{d}\log n)^{4F/(d+4)}. (17)

The class 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) is larger than both 𝒞𝔏(Ω){\mathcal{C}}^{{\mathfrak{L}}}(\Omega) and 𝒞𝔏(Ω){\mathcal{C}}_{{\mathfrak{L}}}(\Omega) which means that the bound in (17) also holds for nfixed(){\mathfrak{R}}_{n}^{\mathrm{fixed}}({\cal F}) for =𝒞𝔏(Ω){\cal F}={\mathcal{C}}^{{\mathfrak{L}}}(\Omega) or 𝒞𝔏(Ω){\mathcal{C}}_{{\mathfrak{L}}}(\Omega) or 𝒞𝔏𝔏(Ω){\mathcal{C}}^{{\mathfrak{L}}}_{{\mathfrak{L}}}(\Omega). To see why 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) is larger than 𝒞𝔏(Ω){\mathcal{C}}^{{\mathfrak{L}}}(\Omega), just note that (by taking g0g\equiv 0)

infg𝒜(Ω)n(f0,g)supxΩ|f0(x)|.\inf_{g\in{\mathcal{A}}(\Omega)}\ell_{{\mathbb{P}}_{n}}(f_{0},g)\leq\sup_{x\in\Omega}|f_{0}(x)|.

To see why 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) is also larger than 𝒞𝔏(Ω){\mathcal{C}}_{{\mathfrak{L}}}(\Omega), note that (taking g(x):=f0(0)g(x):=f_{0}(0) for the origin 0; the origin belongs to Ω\Omega because of (3))

infg𝒜(Ω)n2(f0,g)\displaystyle\inf_{g\in{\mathcal{A}}(\Omega)}\ell^{2}_{{\mathbb{P}}_{n}}(f_{0},g) 1ni=1n(f0(Xi)f0(0))2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left(f_{0}(X_{i})-f_{0}(0)\right)^{2}
[supxy|f0(x)f0(y)|xy]21ni=1nXi2[supxy|f0(x)f0(y)|xy]2\displaystyle\leq\left[\sup_{x\neq y}\frac{|f_{0}(x)-f_{0}(y)|}{\|x-y\|}\right]^{2}\frac{1}{n}\sum_{i=1}^{n}\|X_{i}\|^{2}\leq\left[\sup_{x\neq y}\frac{|f_{0}(x)-f_{0}(y)|}{\|x-y\|}\right]^{2}

where the last inequality is true because each Xi1\|X_{i}\|\leq 1 as Ω𝔅d\Omega\subseteq{\mathfrak{B}_{d}}.

The following minimax lower bound (proved in Appendix A) complements the upper bound in Proposition 2.2 and it is applicable to the smaller function class 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega). Ω\Omega need not be polytopal for this result.

Proposition 2.3.

For every Ω\Omega satisfying (3), we have

nfixed(𝒞LL(Ω))cd,σn4/(d+4).{\mathfrak{R}}_{n}^{\mathrm{fixed}}({\mathcal{C}}^{L}_{L}(\Omega))\geq c_{d,\sigma}n^{-4/(d+4)}. (18)

provided LCdL\geq C_{d}.

Combining the above two results, in fixed-design for Ω\Omega of the form (14) and satisfying (3), the minimax rate for each class 𝒞LL(Ω),𝒞L(Ω),𝒞L(Ω),L(Ω){\mathcal{C}}_{L}^{L}(\Omega),{\mathcal{C}}_{L}(\Omega),{\mathcal{C}}^{L}(\Omega),{\cal F}^{L}(\Omega) equals n4/(d+4)n^{-4/(d+4)} up to logarithmic multiplicative factors (this is because 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) is the smallest of these classes for which the lower bound (18) holds and L(Ω){\cal F}^{L}(\Omega) is the largest of these classes for which the upper bound (17) holds).

3 LSE Rates for Convex Regression

3.1 Random Design

In this section, we state our main result that the supremum risk appearing in the first three rows of Table 4 is bounded from below by n2/dn^{-2/d} (up to logarithmic factors) for d5d\geq 5. As n2/dn^{-2/d} is strictly larger compared to the minimax rate of n4/(d+4)n^{-4/(d+4)} for d5d\geq 5, these results prove minimax suboptimality, for d5d\geq 5 of the LSE f^n()\hat{f}_{n}({\cal F}) over the class {\cal F} for {\cal F} equaling 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) or 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) for any Ω\Omega, and also for {\cal F} equaling 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) for polytopal Ω\Omega.

Theorem 3.1.

Fix d5d\geq 5 and let Ω\Omega be a convex body satisfying (3). The following inequality holds when {\cal F} is either 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) for BLB\geq L, or 𝒞L(Ω){\mathcal{C}}_{L}(\Omega):

supf𝔼f2(f^n(),f)cdσLn2/d(logn)4(d+1)/d\sup_{f\in{\cal F}}{\mathbb{E}}_{f}\ell^{2}_{{\mathbb{P}}}(\hat{f}_{n}({\cal F}),f)\geq c_{d}\sigma Ln^{-2/d}(\log n)^{-4(d+1)/d} (19)

provided nNd,σ/Ln\geq N_{d,\sigma/L}. Additionally when Ω\Omega is a polytope whose number of facets is bounded by a constant depending on dd alone, the same inequality holds for =𝒞B(Ω){\cal F}={\mathcal{C}}^{B}(\Omega) i.e.,

supf𝒞B(Ω)𝔼f2(f^n(𝒞B(Ω)),f)cdσBn2/d(logn)4(d+1)/d\sup_{f\in{\mathcal{C}}^{B}(\Omega)}{\mathbb{E}}_{f}\ell^{2}_{{\mathbb{P}}}(\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)),f)\geq c_{d}\sigma Bn^{-2/d}(\log n)^{-4(d+1)/d} (20)

for nNd,σ/Bn\geq N_{d,\sigma/B}.

We prove the above theorem via supf𝔼f2(f^n(),f)𝔼f~2(f^n(),f~)\sup_{f\in{\cal F}}{\mathbb{E}}_{f}\ell^{2}_{{\mathbb{P}}}(\hat{f}_{n}({\cal F}),f)\geq{\mathbb{E}}_{\tilde{f}}\ell^{2}_{{\mathbb{P}}}(\hat{f}_{n}({\cal F}),\tilde{f}) for a specific piecewise affine convex function f~\tilde{f} with n\sqrt{n} affine pieces. This function f~\tilde{f} will be a piecewise affine approximation to the quadratic f0(x):=x2f_{0}(x):=\|x\|^{2}. The properties of f~\tilde{f} that we need along with the existence of f~\tilde{f} satisfying those properties are given in the next result. While this result is stated for arbitrary kk, we only need it for knk\sim\sqrt{n} for proving Theorem 3.1. Recall that a dd-simplex Δ\Delta in d{\mathbb{R}}^{d} is the convex hull of d+1d+1 affinely independent points. It is well-known that dd-simplices can be represented as the intersection of d+1d+1 halfspaces.

Lemma 3.2.

Suppose Ω\Omega is a convex body satisfying (3). Let f0(x):=x2f_{0}(x):=\|x\|^{2}. There exists a positive constant CdC_{d} (depending on dd alone) such that the following is true. For every k1k\geq 1, there exist mCdkm\leq C_{d}k dd-simplices Δ1,,ΔmΩ\Delta_{1},\dots,\Delta_{m}\subseteq\Omega and a convex function f~k\tilde{f}_{k} on Ω\Omega such that

  1. 1.

    (1Cdk1/d)Ωi=1mΔiΩ(1-C_{d}k^{-1/d})\Omega\subseteq\cup_{i=1}^{m}\Delta_{i}\subseteq\Omega,

  2. 2.

    ΔiΔj\Delta_{i}\cap\Delta_{j} is contained in a facet of Δi\Delta_{i} and a facet of Δj\Delta_{j} for each iji\neq j,

  3. 3.

    f~k\tilde{f}_{k} is affine on each Δi,i=1,,m\Delta_{i},i=1,\dots,m,

  4. 4.

    supxΩ|f0(x)f~k(x)|Cdk2/d\sup_{x\in\Omega}|f_{0}(x)-\tilde{f}_{k}(x)|\leq C_{d}k^{-2/d},

  5. 5.

    f~k𝒞CdCd(Ω)\tilde{f}_{k}\in{\mathcal{C}}_{C_{d}}^{C_{d}}(\Omega),

If, in addition, Ω\Omega is a polytope whose number of facets is bounded by a constant depending on dd alone, then the first condition above can be strengthened to Ω=i=1mΔi\Omega=\cup_{i=1}^{m}\Delta_{i}.

3.2 Results for the unrestricted convex LSE in fixed design

Here we work with the same setting as in Subsection 2.2 (in particular, note that Ω\Omega is polytopal of the form (14)). The following result shows that the unrestricted convex LSE f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) achieves the rate rn,dr_{n,d} (defined in (6)) under the assumption f0𝔏(Ω)f_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega) for a positive constant 𝔏{\mathfrak{L}}. The number FF appearing in the bound (21) below is the number of parallel halfspaces or slabs defining Ω\Omega (see (14)) and this number is assumed to be bounded by a constant depending on dd alone.

Theorem 3.3.

For every 𝔏>0{\mathfrak{L}}>0 and σ>0\sigma>0, there exist constants CdC_{d} (depending on dd alone) and Nd,σ/𝔏N_{d,\sigma/{\mathfrak{L}}} (depending only on dd and σ/𝔏\sigma/{\mathfrak{L}}) such that

supf𝔏(Ω)𝔼fn2(f^n(𝒞(Ω)),f){Cd𝔏2d4+d(σ2n(logn)F)4d+4for d3C4σ𝔏n(logn)1+F2for d=4Cdσ𝔏((logn)Fn)2dfor d5\sup_{f\in{\cal F}^{{\mathfrak{L}}}(\Omega)}{\mathbb{E}}_{f}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n}({\mathcal{C}}(\Omega)),f)\leq\begin{cases}C_{d}{\mathfrak{L}}^{\frac{2d}{4+d}}\left(\frac{\sigma^{2}}{n}(\log n)^{F}\right)^{\frac{4}{d+4}}\text{for }d\leq 3\\ C_{4}\frac{\sigma{\mathfrak{L}}}{\sqrt{n}}(\log n)^{1+\frac{F}{2}}~{}~{}~{}\text{for }d=4\\ C_{d}\sigma{\mathfrak{L}}\left(\frac{(\log n)^{F}}{n}\right)^{\frac{2}{d}}~{}\text{for }d\geq 5\end{cases} (21)

for nNd,σ/𝔏n\geq N_{d,\sigma/{\mathfrak{L}}}.

The rates appearing on the right hand side of (21) coincide with rn,dr_{n,d} if we ignore multiplicative factors that are at most logarithmic in nn. Theorem 3.3 and the minimax lower bound given in Proposition 2.3 together imply that the unrestricted convex LSE is minimax optimal (up to log factors) over each class L(Ω),𝒞LL(Ω),𝒞L(Ω),𝒞L(Ω){\cal F}^{L}(\Omega),{\mathcal{C}}_{L}^{L}(\Omega),{\mathcal{C}}_{L}(\Omega),{\mathcal{C}}^{L}(\Omega) when the dimension d4d\leq 4. However for d5d\geq 5, there is a gap between the rate given by (21) and the minimax upper bound in Proposition 2.2. The following result shows that, for d5d\geq 5, the unrestricted convex LSE is indeed minimax suboptimal over 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) (or over the larger classes 𝒞L(Ω){\mathcal{C}}_{L}(\Omega), 𝒞L(Ω){\mathcal{C}}^{L}(\Omega), L(Ω){\cal F}^{L}(\Omega)).

Theorem 3.4.

Fix d5d\geq 5, L>0L>0 and σ>0\sigma>0. There exist constants cdc_{d} and Cd,σ/LC_{d,\sigma/L} such that

supf𝒞LL(Ω)𝔼fn2(f^n(𝒞(Ω)),f)cdσLn2d(logn)4(d+1)d\sup_{f\in{\mathcal{C}}^{L}_{L}(\Omega)}{\mathbb{E}}_{f}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n}({\mathcal{C}}(\Omega)),f)\geq c_{d}\sigma Ln^{-\frac{2}{d}}(\log n)^{-\frac{4(d+1)}{d}} (22)

for nCd,σ/Ln\geq C_{d,\sigma/L}.

The main idea for the proof of Theorem 3.4 comes from analyzing the rates of convergence of the unrestricted convex LSE when the true convex function is piecewise affine. Indeed, Theorem 3.4 is derived from Theorem 3.6 (stated later in this section) which provides a lower bound on the risk of f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) for certain piecewise affine convex functions. We state Theorem 3.6 after the next result which provides upper bounds on the rate of convergence of f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) for piecewise affine convex functions. This result shows that the unrestricted convex LSE exhibits adaptive behaviour to piecewise affine convex functions ff in the sense that the risk 𝔼fn2(f^n(𝒞(Ω)),f){\mathbb{E}}_{f}\ell_{{\mathbb{P}}_{n}}^{2}(\hat{f}_{n}({\mathcal{C}}(\Omega)),f) is much smaller than the right hand side of (21) for such ff. For k1k\geq 1 and h1h\geq 1, let k,h(Ω){\mathfrak{C}}_{k,h}(\Omega) denote all functions f𝒞(Ω)f\in{\mathcal{C}}(\Omega) for which there exist kk convex subsets Ω1,,Ωk\Omega_{1},\dots,\Omega_{k} satisfying the following properties:

  1. 1.

    ff is affine on each Ωi\Omega_{i},

  2. 2.

    each Ωi\Omega_{i} can be written as an intersection of at most hh slabs (i.e., as in (14) with F=hF=h), and

  3. 3.

    Ω1𝒮,,Ωk𝒮\Omega_{1}\cap\mathcal{S},\dots,\Omega_{k}\cap\mathcal{S} are disjoint with i=1k(Ωi𝒮)=Ω𝒮\cup_{i=1}^{k}(\Omega_{i}\cap\mathcal{S})=\Omega\cap\mathcal{S} (recall that 𝒮\mathcal{S} is the regular rectangular grid (15)).

Theorem 3.5.

For every k1k\geq 1 and h1h\geq 1, we have

supfk,h(Ω)𝔼fn2(f^n(𝒞(Ω)),f){Cdσ2(kn)(logn)hfor d=1,2,3Cdσ2(kn)(logn)h+2for d=4Cdσ2(k(logn)hn)4/dfor d5\sup_{f\in{\mathfrak{C}}_{k,h}(\Omega)}{\mathbb{E}}_{f}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n}({\mathcal{C}}(\Omega)),f)\leq\begin{cases}C_{d}\sigma^{2}\left(\frac{k}{n}\right)(\log n)^{h}&\text{for }d=1,2,3\\ C_{d}\sigma^{2}\left(\frac{k}{n}\right)(\log n)^{h+2}&\text{for }d=4\\ C_{d}\sigma^{2}\left(\frac{k(\log n)^{h}}{n}\right)^{4/d}&\text{for }d\geq 5\end{cases} (23)

for a constant CdC_{d} depending on dd alone.

When hh is a constant (not depending on nn) and kk is not too large, the rates in (23) are of strictly smaller order compared to (21). More precisely, ignoring log factors, (23) is strictly smaller than (21) as long as kk is of smaller order than nd/(d+4)n^{d/(d+4)} for d4d\leq 4, and as long as kk is of smaller order than n\sqrt{n} for d5d\geq 5. The next result gives a lower bound which proves that the upper bound in (23) cannot be improved (up to log factors) for d5d\geq 5 for all kk up to n\sqrt{n}. More specifically, we prove the (k/n)4/d(k/n)^{4/d} lower bound for the piecewise affine convex function f~k\tilde{f}_{k} described in Lemma 3.2.

Theorem 3.6.

Fix d5d\geq 5. There exist positive constants cdc_{d} and NdN_{d} such that for nNdn\geq N_{d} and

1kmin(nσd/4,cdn),1\leq k\leq\min\left(\sqrt{n}\sigma^{-d/4},c_{d}n\right), (24)

we have

𝔼f~kn2(f^n(𝒞(Ω)),f~k)cdσ2(kn)4/d(logn)4(d+1)/d{\mathbb{E}}_{\tilde{f}_{k}}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n}({\mathcal{C}}(\Omega)),\tilde{f}_{k})\geq c_{d}\sigma^{2}\left(\frac{k}{n}\right)^{4/d}(\log n)^{-4(d+1)/d} (25)

where f~k\tilde{f}_{k} is the function from Lemma 3.2.

The lower bound given by (25) for k=nσd/4k=\sqrt{n}\sigma^{-d/4} is of the same order as that given by Theorem 3.4. In other words, Theorem 3.4 is a corollary of Theorem 3.6.

We reiterate that the results of this subsection (fixed-design risk bounds for the unrestricted convex LSE) hold when Ω\Omega is a polytope. A natural question is to extend these to the case where Ω\Omega is a smooth convex body such as the unit ball. Based on the results of Kur et al. [32] who analyzed the LSE over 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) under random-design, it is reasonable to conjecture that the unrestricted convex LSE will be minimax optimal in fixed design when the domain is the unit ball. However it appears nontrivial to prove this as the main ideas from [32] such as the reduction to expected suprema bounds over level sets cannot be used in the absence of uniform boundedness.

4 Proof Sketches for Results in Section 3

In this section, we provide the main ideas and ingredients behind the proofs of the LSE rate results in Section 3. Further details and full proofs of the auxiliary results in this section can be found in Appendix B.

4.1 General LSE Accuracy result

The first step to prove all our LSE results is the following theorem due to Chatterjee [10] which provides a course of action for bounding (from above as well as below) the accuracy of abstract LSEs on convex classes of functions. The original result applies to the fixed design case for every set of design points X1,,XnX_{1},\dots,X_{n}. In the random design case, we apply this result by conditioning on X1,,XnX_{1},\dots,X_{n} (see (30) and (31) below).

Theorem 4.1 (Chatterjee).

Consider data generated according to the model:

Yi=f(Xi)+ξifor i=1,,nY_{i}=f(X_{i})+\xi_{i}\qquad\text{for $i=1,\dots,n$}

where X1,,XnX_{1},\dots,X_{n} are fixed deterministic design points in a convex body 𝒳d{\mathcal{X}}\subseteq{\mathbb{R}}^{d}, ff belongs to a convex class of functions {\cal F} and ξ1,,ξni.i.dN(0,σ2)\xi_{1},\dots,\xi_{n}\overset{\text{i.i.d}}{\sim}N(0,\sigma^{2}). Consider the LSE f^n()\hat{f}_{n}({\cal F}) defined in (2). Define

tf():=argmaxt0Hf(t,)t_{f}({\cal F}):=\mathop{\rm argmax}_{t\geq 0}H_{f}(t,{\cal F})

where

Hf(t,):=𝔼supg:n(f,g)t1ni=1nξi(g(Xi)f(Xi))t22.H_{f}(t,{\cal F}):={\mathbb{E}}\sup_{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(f,g)\leq t}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-f(X_{i})\right)-\frac{t^{2}}{2}.

Then Hf(,)H_{f}(\cdot,{\cal F}) is a concave function on [0,)[0,\infty), tf()t_{f}({\cal F}) is unique and the following pair of inequalities hold for positive constants cc and CC:

{0.5tf2()n2(f^n(),f)2tf2()}16exp(cntf2()σ2){\mathbb{P}}\left\{0.5t^{2}_{f}({\cal F})\leq\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n}({\cal F}),f)\leq 2t^{2}_{f}({\cal F})\right\}\geq 1-6\exp\left(-\frac{cnt^{2}_{f}({\cal F})}{\sigma^{2}}\right) (26)

and

0.5tf2()Cσ2n𝔼n2(f^n(),f)2tf2()+Cσ2n.0.5t^{2}_{f}({\cal F})-\frac{C\sigma^{2}}{n}\leq{\mathbb{E}}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n}({\cal F}),f)\leq 2t^{2}_{f}({\cal F})+\frac{C\sigma^{2}}{n}. (27)

Upper bounds for tf()t_{f}({\cal F}) can be obtained via:

tf()inf{t>0:Hf(t,)0}t_{f}({\cal F})\leq\inf\left\{t>0:H_{f}(t,{\cal F})\leq 0\right\} (28)

and lower bounds for tf()t_{f}({\cal F}) can be obtained via:

tf()t1if 0t1<t0 are such that Hf(t1,)Hf(t0,).t_{f}({\cal F})\geq t_{1}\qquad\text{if $0\leq t_{1}<t_{0}$ are such that $H_{f}(t_{1},{\cal F})\leq H_{f}(t_{0},{\cal F})$}. (29)

Intuitively, Theorem 4.1 states that the loss n(f^n(),f)\ell_{{\mathbb{P}}_{n}}(\hat{f}_{n}({\cal F}),f) of the LSE is controlled by tf()t_{f}({\cal F}) for which bounds can be obtained using (28) and (29).

Theorem 4.1 holds for the fixed design setting with no restriction on the design points which means that it also applies to the random design setting provided we condition on the design points X1,,XnX_{1},\dots,X_{n}. In particular, for random design, inequality (26) becomes:

{0.5tf2()n2(f^n(),f)2tf2()|X1,,Xn}16exp(cntf2()σ2)\begin{split}&{\mathbb{P}}\left\{0.5t^{2}_{f}({\cal F})\leq\ell_{{\mathbb{P}}_{n}}^{2}(\hat{f}_{n}({\cal F}),f)\leq 2t^{2}_{f}({\cal F})\bigg{|}X_{1},\dots,X_{n}\right\}\\ &\geq 1-6\exp\left(-\frac{cnt^{2}_{f}({\cal F})}{\sigma^{2}}\right)\end{split} (30)

where

tf():=argmaxt0Hf(t,)t_{f}({\cal F}):=\mathop{\rm argmax}_{t\geq 0}H_{f}(t,{\cal F}) (31)

with

Hf(t,):=𝔼[supg:n(f,g)t1ni=1nξi(g(Xi)f(Xi))|X1,,Xn]t22.H_{f}(t,{\cal F}):={\mathbb{E}}\left[\sup_{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(f,g)\leq t}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-f(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]-\frac{t^{2}}{2}. (32)

Here tf()t_{f}({\cal F}) is random as it depends on the random design points X1,,XnX_{1},\dots,X_{n}. Note that (30) applies to the loss n\ell_{{\mathbb{P}}_{n}} and not to \ell_{{\mathbb{P}}}.

4.2 High-level intuition for the proof of Theorem 3.1

Here we provide the main ideas behind the proof of Theorem 3.1 in a somewhat non-rigorous fashion. More details are provided in the remainder of this section. According to Theorem 4.1, the rate of convergence of the LSE over {\cal F} (when the true function is ff) is controlled by tf2()t_{f}^{2}({\cal F}) where tf()t_{f}({\cal F}) is the maximizer of Hf(t,)H_{f}(t,{\cal F}) defined in (32). Note that Hf(t,)H_{f}(t,{\cal F}) is given by an expected supremum term minus t2/2t^{2}/2. The expected supremum term can be seen as a measure of the complexity of the local region {g:n(f,g)t}\{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(f,g)\leq t\} around the true function ff.

For the proof of Theorem 3.1, we shall take the true function to be f~k\tilde{f}_{k} (given by Lemma 3.2) with knk\sim\sqrt{n}. The challenge is to show that the maximizer of Hf~k(t,)H_{\tilde{f}_{k}}(t,{\cal F}) is at least n1/d(logn)2(d+1)/dn^{-1/d}(\log n)^{-2(d+1)/d} for each function class {\cal F} in the statement of Theorem 3.1. For this, the first step is to prove that

Hf~k(t,)Cdtn1/d(logn)2(d+1)/d.\displaystyle H_{\tilde{f}_{k}}(t,{\cal F})\leq C_{d}tn^{-1/d}(\log n)^{2(d+1)/d}. (33)

for a large enough range of tt (see Lemma 4.2 and the writing after it). This inequality, which grows linearly in tt, can be seen as a bound on the complexity of the local region in {\cal F} around f~k\tilde{f}_{k}, and is proved by bounding the bracketing entropy numbers of local regions around f~k\tilde{f}_{k} (Theorem 4.5).

The second step is to prove the following lower bound on Hf~k(t,)H_{\tilde{f}_{k}}(t,{\cal F}):

suptHf~k(t,)cdn2/d\sup_{t}H_{\tilde{f}_{k}}(t,{\cal F})\geq c_{d}n^{-2/d} (34)

This is proved by lower bounding the metric entropy of a local region around f0(x):=x2f_{0}(x):=\|x\|^{2} (Lemma 4.6) and the fact that f~k\tilde{f}_{k} and f0f_{0} are within Cdk2/d=n1/dC_{d}k^{-2/d}=n^{-1/d} of each other (this is guaranteed by Lemma 3.2).

Combining (33) and (34), it is straightforward to see that

Hf~k(t,)Cdtn1/d(logn)2(d+1)/d<cdn2/dsuptHf~k(t,)\displaystyle H_{\tilde{f}_{k}}(t,{\cal F})\leq C_{d}tn^{-1/d}(\log n)^{2(d+1)/d}<c_{d}n^{-2/d}\leq\sup_{t}H_{\tilde{f}_{k}}(t,{\cal F})

for

t<(cd/Cd)n1/d(logn)2(d+1)/d,\displaystyle t<(c_{d}/C_{d})n^{-1/d}(\log n)^{-2(d+1)/d},

so that the maximizer of Hf~k(t,)H_{\tilde{f}_{k}}(t,{\cal F}) is (cd/Cd)n1/d(logn)2(d+1)/d\geq(c_{d}/C_{d})n^{-1/d}(\log n)^{-2(d+1)/d}.

It might be interesting to note that this proof will not work if we take the true function to be the smooth function f0(x):=x2f_{0}(x):=\|x\|^{2} instead of the piecewise affine approximation f~k\tilde{f}_{k}. For f0f_{0}, the upper bound (33) is no longer true; in fact Hf0(t,)H_{f_{0}}(t,{\cal F}) will be as large as cdn2/dc_{d}n^{-2/d} for tt of the order n2/dn^{-2/d} which clearly violates the upper bound (33). Intuitively, this happens because local regions around f0f_{0} have more complexity compared to local regions around f~k\tilde{f}_{k}. This is because the piecewise affine convex function f~k\tilde{f}_{k} has flat Hessians so it is not easy to find many perturbations which keep the resulting function convex. But this is much easier to do with the smooth function f0f_{0}. We are unable to determine the rate of convergence of the LSEs when the true function is f0f_{0}. It can be anywhere between n4/dn^{-4/d} and n2/dn^{-2/d}, a range which includes the minimax rate n4/(d+4)n^{-4/(d+4)}.

4.3 Proof Sketch for Theorem 3.1

It is enough to prove (19) and (20) when LL and BB are fixed constants depending on the dimension dd alone, respectively. From here, these inequalities for arbitrary LL and BB can be deduced by elementary scaling arguments. Let f0(x):=x2f_{0}(x):=\|x\|^{2} and let f~k\tilde{f}_{k} be as given by Lemma 3.2 for a fixed 1kn1\leq k\leq n. Below we assume that LL is a large enough constant depending on dd alone so that f~k𝒞LL(Ω)\tilde{f}_{k}\in{\mathcal{C}}^{L}_{L}(\Omega) and also that BLB\geq L. We assume that the true function is f~k\tilde{f}_{k} and show that, when knk\sim\sqrt{n}, the risk of f^n()\hat{f}_{n}({\cal F}) at f~k\tilde{f}_{k} is bounded from below by cdσn2/d(logn)4(d+1)/dc_{d}\sigma n^{-2/d}(\log n)^{-4(d+1)/d} for each of the choices of the function class {\cal F} in the statement of Theorem 3.1.

Our strategy is to bound tf~k()t_{\tilde{f}_{k}}({\cal F}) from below and then use Theorem 4.1 to obtain the LSE risk lower bound. Recall, from Theorem 4.1, that tf~k()t_{\tilde{f}_{k}}({\cal F}) maximizes

Hf~k(t,):=Gf~k(t,)t22H_{\tilde{f}_{k}}(t,{\cal F}):=G_{\tilde{f}_{k}}(t,{\cal F})-\frac{t^{2}}{2} (35)

over all t0t\geq 0 where

Gf~k(t,):=𝔼[supg:n(g,f~k)t1ni=1nξi(g(Xi)f~k(Xi))|X1,,Xn].G_{\tilde{f}_{k}}(t,{\cal F}):={\mathbb{E}}\left[\sup_{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(g,\tilde{f}_{k})\leq t}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]. (36)

The main ingredients necessary to bound tf~k()t_{\tilde{f}_{k}}({\cal F}) are the next pair of results, which provide upper and lower bounds for Gf~k(t,)G_{\tilde{f}_{k}}(t,{\cal F}) respectively.

Lemma 4.2.

Fix d5d\geq 5 and let Ω\Omega be a convex body satisfying (3). There exists Cd>0C_{d}>0 such that the following holds

{Gf~k(t,)>Cdtσ(kn)2/d(log(Cdσn))2(d+1)/d+Cdn+Cdσk2/d2k1/dn2/d}Cdexp(nd/(d+4)Cd)+Cdexp(n(d4)/dCd2t2k4/d)+exp(Cdnk2/d)\begin{split}&{\mathbb{P}}\left\{G_{\tilde{f}_{k}}(t,{\cal F})>C_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}(\log(C_{d}\sigma\sqrt{n}))^{2(d+1)/d}+\frac{C_{d}}{\sqrt{n}}+\frac{C_{d}\sigma k^{2/d^{2}}}{k^{1/d}n^{2/d}}\right\}\\ &\leq C_{d}\exp\left(-\frac{n^{d/(d+4)}}{C_{d}}\right)+C_{d}\exp\left(-\frac{n^{(d-4)/d}}{C_{d}^{2}}t^{2}k^{4/d}\right)+\exp\left(-\frac{C_{d}n}{k^{2/d}}\right)\end{split} (37)

for every fixed tt satisfying Cdn2/(d+4)tLC_{d}n^{-2/(d+4)}\leq t\leq L, and for {\cal F} equal to either of the two classes 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) and 𝒞L(Ω){\mathcal{C}}_{L}(\Omega).

Additionally, when Ω\Omega is a polytope whose number of facets is bounded by a constant depending on dd alone, we have the following inequality for =𝒞B(Ω){\cal F}={\mathcal{C}}^{B}(\Omega):

{Gf~k(t,𝒞B(Ω))>Cdtσ(kn)2/d(log(Cdσn))2(d+1)/d+Cdn}Cdexp(nd/(d+4)Cd)+Cdexp(n(d4)/dCd2t2k4/d)\begin{split}&{\mathbb{P}}\left\{G_{\tilde{f}_{k}}(t,{\mathcal{C}}^{B}(\Omega))>C_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}(\log(C_{d}\sigma\sqrt{n}))^{2(d+1)/d}+\frac{C_{d}}{\sqrt{n}}\right\}\\ &\leq C_{d}\exp\left(-\frac{n^{d/(d+4)}}{C_{d}}\right)+C_{d}\exp\left(-\frac{n^{(d-4)/d}}{C_{d}^{2}}t^{2}k^{4/d}\right)\end{split} (38)

for every fixed tt satisfying tCdn2/(d+4)t\geq C_{d}n^{-2/(d+4)}.

Lemma 4.3.

Fix d1d\geq 1 and let Ω\Omega be a convex body satisfying (3). There exists cd,Cd>0c_{d},C_{d}>0 such that the following holds for every fixed 1kn1\leq k\leq n and tCdk2/dt\geq C_{d}k^{-2/d}:

{Gf~k(t,)cdσn2/d}1exp(cdn){\mathbb{P}}\left\{G_{\tilde{f}_{k}}(t,{\cal F})\geq c_{d}\sigma n^{-2/d}\right\}\geq 1-\exp\left(-c_{d}n\right)

for {\cal F} equal to 𝒞LB(Ω){\mathcal{C}}^{B}_{L}(\Omega), and also for the larger classes 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) and 𝒞B(Ω){\mathcal{C}}^{B}(\Omega).

Before providing details behind the proofs of Lemma 4.2 and Lemma 4.3, let us outline how they lead to the proof of Theorem 3.1 (full details are in Appendices B.3 and B.4). The leading term in the upper bound on Gf~k(t,)G_{\tilde{f}_{k}}(t,{\cal F}) given by Lemma 4.2 is

Cdtσ(kn)2/d(log(Cdσn))2(d+1)/d.C_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}(\log(C_{d}\sigma\sqrt{n}))^{2(d+1)/d}. (39)

This upper bound also applies to Hf~k(t,)H_{\tilde{f}_{k}}(t,{\cal F}) because Hf~k(t,)Gf~k(t,)H_{\tilde{f}_{k}}(t,{\cal F})\leq G_{\tilde{f}_{k}}(t,{\cal F}) as is clear from the definition (35). This bound was previously stated as (33) in Subsection 4.2.

On the other hand, the lower bound on Gf~k(t,)G_{\tilde{f}_{k}}(t,{\cal F}) from Lemma 4.3 leads to the following lower bound for Hf~k(t,)H_{\tilde{f}_{k}}(t,{\cal F}):

Hf~k(t,)cdσn2/dt22.H_{\tilde{f}_{k}}(t,{\cal F})\geq c_{d}\sigma n^{-2/d}-\frac{t^{2}}{2}.

The special choice t0:=cdn1/dσt_{0}:=\sqrt{c_{d}}n^{-1/d}\sqrt{\sigma} in the above bound leads to

Hf~k(t0,)cd2σn2/d.H_{\tilde{f}_{k}}(t_{0},{\cal F})\geq\frac{c_{d}}{2}\sigma n^{-2/d}. (40)

This lower bound is the basis for the inequality (34) in the intuition subsection 4.2. The requirement t0=cdn1/dσCdk2/dt_{0}=\sqrt{c_{d}}n^{-1/d}\sqrt{\sigma}\geq C_{d}k^{-2/d} in Lemma 4.3 would hold if kk satisfies kγdnσd/4k\geq\gamma_{d}\sqrt{n}\sigma^{-d/4} for some γd\gamma_{d}.

We now compare the upper bound (39) with the lower bound (40). It can be seen that (39) is less than or equal to the right hand side of (40) when tt is of the order k2/d(log(Cdσn))2(d+1)/dk^{-2/d}(\log(C_{d}\sigma\sqrt{n}))^{-2(d+1)/d}. Let us denote by t1t_{1} this particular choice of tt for k=γdnσd/4k=\gamma_{d}\sqrt{n}\sigma^{-d/4}. We will argue, in Lemma 4.4 below, that t1<t0t_{1}<t_{0} and that

Hf~k(t1,)Hf~k(t0,).H_{\tilde{f}_{k}}(t_{1},{\cal F})\leq H_{\tilde{f}_{k}}(t_{0},{\cal F}).

This allows application of inequality (29) to yield the following lower bound on tf~k()t_{\tilde{f}_{k}}({\cal F}) (this lower bound gives the necessary risk lower bounds for f^n()\hat{f}_{n}({\cal F}) via Theorem 4.1).

Lemma 4.4.

There exist constants γd,cd,Cd,Nd,σ\gamma_{d},c_{d},C_{d},N_{d,\sigma} such that

{tf~k()cdn1/dσ(logn)2(d+1)/d}1Cdexp(n(d4)/dCd2){\mathbb{P}}\left\{t_{\tilde{f}_{k}}({\cal F})\geq c_{d}n^{-1/d}\sqrt{\sigma}(\log n)^{-2(d+1)/d}\right\}\geq 1-C_{d}\exp\left(\frac{-n^{(d-4)/d}}{C_{d}^{2}}\right) (41)

for k=γdnσd/4k=\gamma_{d}\sqrt{n}\sigma^{-d/4} and nNd,σn\geq N_{d,\sigma}. Here {\cal F} can be taken to be any of the three choices in Theorem 3.1.

4.3.1 Proof ideas for Lemma 4.2 and Lemma 4.3

We now explain the main proof ideas behind Lemma 4.2 and Lemma 4.3. For Lemma 4.2, we use available bounds on suprema of empirical processes via bracketing numbers. The key bracketing result that is needed for Lemma 4.2 is given below. It provides an upper bound on the bracketing entropy of bounded convex functions with an additional LpL_{p} norm constraint. The metric employed is the LpL_{p} metric on a bunch of simplices. This result is stated for arbitrary p[1,)p\in[1,\infty) although we only use it for p=2p=2. Lemma 4.2 is proved by applying this theorem to the simplices given in Lemma 3.2.

Theorem 4.5.

Suppose Ω\Omega is a convex body contained in the unit ball. Let f~\tilde{f} be a convex function on Ω\Omega that is bounded by Γ\Gamma. For a fixed 1p<1\leq p<\infty and t>0t>0, let

BpΓ(f~,t,Ω)={f𝒞Γ(Ω):Ω|f(x)f~(x)|p𝑑xtp}.B_{p}^{\Gamma}(\tilde{f},t,\Omega)=\left\{f\in{\mathcal{C}}^{\Gamma}(\Omega):\int_{\Omega}|f(x)-\tilde{f}(x)|^{p}dx\leq t^{p}\right\}. (42)

Suppose Δ1,,ΔkΩ\Delta_{1},\dots,\Delta_{k}\subseteq\Omega are dd-simplices with disjoint interiors such that f~\tilde{f} is affine on each Δi\Delta_{i}. Then for every 0<ϵ<Γ0<\epsilon<\Gamma and t>0t>0, we have

logN[](ε,BpΓ(f~,t,Ω),p,i=1kΔi)Cd,pk(logΓϵ)d+1(tϵ)d/2\log N_{[\,]}({\varepsilon},B_{p}^{\Gamma}(\tilde{f},t,\Omega),\|\cdot\|_{p,\cup_{i=1}^{k}\Delta_{i}})\leq C_{d,p}k\left(\log\frac{\Gamma}{\epsilon}\right)^{d+1}\left(\frac{t}{\epsilon}\right)^{d/2} (43)

for a constant Cd,pC_{d,p} that depends on pp and dd alone. The left hand side above denotes bracketing entropy with respect to LpL_{p} metric on Δ1Δk\Delta_{1}\cup\dots\cup\Delta_{k}.

The above bracketing entropy result (whose proof is given in Appendix B.5) is nontrivial and novel. The function class considered in the above theorem has both an LL_{\infty} constraint (uniform boundedness) as well as an LpL_{p} constraint. If the LpL_{p} constraint is dropped, then the bracketing entropy is of the order (Γ/ϵ)d/2(\Gamma/\epsilon)^{d/2} as proved by Gao and Wellner [19] (see also Doss [15]). In contrast to (Γ/ϵ)d/2(\Gamma/\epsilon)^{d/2}, (43) only has a logarithmic dependence on Γ\Gamma and is much smaller when tt is small. Also, Theorem 4.5 is comparable to but stronger than [26, Lemma 3.3] which gives a weaker bound for the left hand side of (43) having additional multiplicative factors involving kk (these factors cannot be neglected since we care about the regime knk\sim\sqrt{n}).

For Lemma 4.3, the key is the following result which proves a lower bound on the metric entropy of balls of convex functions around the quadratic function f0(x):=x2f_{0}(x):=\|x\|^{2} (recall, from Lemma 3.2, that f~k\tilde{f}_{k} are chosen to be piecewise affine approximations of f0(x)=x2f_{0}(x)=\|x\|^{2}).

Lemma 4.6.

Let Ω\Omega be a convex body satisfying (3). Let f0(x):=x2f_{0}(x):=\|x\|^{2}. Then there exist positive constants c1,c2,c3,c4,Cc_{1},c_{2},c_{3},c_{4},C depending on dd alone such that

{logN(ϵ,{f𝒞LL(Ω):n(f,f0)t},n)c1ϵd/2}1exp(c2n){\mathbb{P}}\left\{\log N(\epsilon,\{f\in{\mathcal{C}}_{L}^{L}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\},{\ell_{{\mathbb{P}_{n}}}})\geq c_{1}\epsilon^{-d/2}\right\}\geq 1-\exp(-c_{2}n)

for each fixed ϵ,t,L\epsilon,t,L satisfying LCL\geq C and c3n2/dϵmin(c4,t/4)c_{3}n^{-2/d}\leq\epsilon\leq\min(c_{4},t/4). The probability above is with respect to the randomness in the design points X1,,Xni.i.dX_{1},\dots,X_{n}\overset{\text{i.i.d}}{\sim}{\mathbb{P}}.

The main consequence of the above lemma is that when ϵn2/d\epsilon\sim n^{-2/d} and tn2/dt\gtrsim n^{-2/d}, the metric entropy of {f𝒞LL(Ω):n(f,f0)t}\{f\in{\mathcal{C}}_{L}^{L}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\} is of order nn. This also holds for tk2/dt\sim k^{-2/d} for knk\leq n as k2/dn2/dk^{-2/d}\geq n^{-2/d}. Because the distance between f~k\tilde{f}_{k} and f0f_{0} is bounded by k2/dk^{-2/d}, the same order-nn lower bound also holds for the metric entropy of {f𝒞LL(Ω):n(f,f~k)t}\{f\in{\mathcal{C}}_{L}^{L}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,\tilde{f}_{k})\leq t\}. Sudakov minoration can then be used to prove Lemma 4.3. See Appendix B for full details.

4.4 Proof sketches for fixed-design results (Subsection 3.2)

Here, we provide sketches for the proofs of the fixed-design results stated in Subsection 3.2. Further details and full proofs can be found in Appendix C.

The starting point for these proofs is Theorem 4.1. For the risk upper bound results (Theorem 3.3 and Theorem 3.5), we need to bound the quantity Hf0(t,𝒞(Ω))H_{f_{0}}(t,{\mathcal{C}}(\Omega)) (appearing in the statement of Theorem 4.1) from above. Let

Gf0(t,𝒞(Ω)):=𝔼supg𝒞(Ω):n(f0,g)t1ni=1nξi(g(Xi)f0(Xi))G_{f_{0}}(t,{\mathcal{C}}(\Omega)):={\mathbb{E}}\sup_{g\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}_{n}}}(f_{0},g)\leq t}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-f_{0}(X_{i})\right) (44)

so that Hf0(t,𝒞(Ω))=Gf0(t,𝒞(Ω))t2/2H_{f_{0}}(t,{\mathcal{C}}(\Omega))=G_{f_{0}}(t,{\mathcal{C}}(\Omega))-t^{2}/2 and upper bounds on Gf0(t,𝒞(Ω))G_{f_{0}}(t,{\mathcal{C}}(\Omega)) imply upper bounds on Hf0(t,𝒞(Ω))H_{f_{0}}(t,{\mathcal{C}}(\Omega)). In contrast to the random design setting of the previous subsection, we do not need to explicitly indicate conditioning on X1,,XnX_{1},\dots,X_{n} in this fixed design setting.

Theorem 3.3 is a consequence of the following upper bound on Gf0(t,𝒞(Ω))G_{f_{0}}(t,{\mathcal{C}}(\Omega)):

Lemma 4.7.

Fix f0𝒞(Ω)f_{0}\in{\mathcal{C}}(\Omega) and suppose infg𝒜(Ω)n(f0,g)𝔏\inf_{g\in{\mathcal{A}}(\Omega)}\ell_{{\mathbb{P}}_{n}}(f_{0},g)\leq{\mathfrak{L}}. There exists CdC_{d} such that for every t>0t>0,

Gf0(t,𝒞(Ω)){Cdσn(logn)F/2(t+𝔏d/4t1d/4)for d3Cdσn(t+𝔏)(logn)1+(F/2)for d=4Cdσ((logn)F/2n)4/d(t+𝔏)for d5.G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\begin{cases}\frac{C_{d}\sigma}{\sqrt{n}}(\log n)^{F/2}\left(t+{\mathfrak{L}}^{d/4}t^{1-d/4}\right)~{}~{}\text{for }d\leq 3\\ \frac{C_{d}\sigma}{\sqrt{n}}(t+{\mathfrak{L}})(\log n)^{1+(F/2)}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{for }d=4\\ C_{d}\sigma\left(\frac{(\log n)^{F/2}}{\sqrt{n}}\right)^{4/d}(t+{\mathfrak{L}})~{}~{}~{}~{}~{}~{}\text{for }d\geq 5.\end{cases} (45)

Lemma 4.7 is a consequence of the following bound which holds for functions f0𝒞(Ω)f_{0}\in{\mathcal{C}}(\Omega) satisfying inff𝒜(Ω)n(f0,f)𝔏\inf_{f\in{\mathcal{A}}(\Omega)}\ell_{{\mathbb{P}}_{n}}(f_{0},f)\leq{\mathfrak{L}}:

logN(ϵ,{f𝒞(Ω):n(f0,f)t},n)Cd(logn)F(t+𝔏ϵ)d/2\log N(\epsilon,\left\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f_{0},f)\leq t\right\},\ell_{{\mathbb{P}_{n}}})\leq C_{d}(\log n)^{F}\left(\frac{t+{\mathfrak{L}}}{\epsilon}\right)^{d/2} (46)

for every t>0t>0 and ϵ>0\epsilon>0. This metric entropy bound, which is novel and nontrivial, will be derived based on a more fundamental metric entropy result that is stated later in this section. Dudley’s entropy bound [16] will be applied in conjunction with (46) to derive Lemma 4.7. Dudley’s bound involves an integral of the square root of the metric entropy which will lead to an integral of ϵd/4\epsilon^{-d/4}. Since this integral converges for d3d\leq 3, and blows up at zero for d=4d=4 (logarithmically) and for d5d\geq 5 (exponentially), the upper bound for Gf0(t,𝒞(Ω))G_{f_{0}}(t,{\mathcal{C}}(\Omega)) behaves differently for the three regimes d3d\leq 3, d=4d=4 and d5d\geq 5.

Theorem 3.5 is a consequence of the following upper bound on Gf0(t,𝒞(Ω))G_{f_{0}}(t,{\mathcal{C}}(\Omega)):

Lemma 4.8.

Fix f0k,h(Ω)f_{0}\in{\mathfrak{C}}_{k,h}(\Omega). There exists cdc_{d} such that for every t>0t>0,

Gf0(t,𝒞(Ω)){tσkn(cdlogn)h/2for d3tσkn(cdlogn)1+(h/2)for d=4tσ((cdlogn)h/2kn)4/dfor d5.G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\begin{cases}t\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{h/2}~{}~{}~{}~{}~{}~{}~{}\text{for }d\leq 3\\ t\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{1+(h/2)}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{for }d=4\\ t\sigma\left((c_{d}\log n)^{h/2}\sqrt{\frac{k}{n}}\right)^{4/d}~{}~{}~{}~{}~{}~{}\text{for }d\geq 5.\end{cases} (47)

Lemma 4.8 is a consequence of the following bound which holds for functions f0k,h(Ω)f_{0}\in\mathfrak{C}_{k,h}(\Omega):

logN(ϵ,{f𝒞(Ω):n(f0,f)t},n)k(tϵ)d/2(cdlogn)h.\log N(\epsilon,\left\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f_{0},f)\leq t\right\},\ell_{{\mathbb{P}_{n}}})\leq k\left(\frac{t}{\epsilon}\right)^{d/2}\left(c_{d}\log n\right)^{h}. (48)

The bound (48), which also seems novel and nontrivial, is an improvement over (46) provided tLt\lesssim L and kk is not too large. Both the entropy bounds (46) and (48) will be derived based on a more general bound that is stated later in this section.

We now move to the proof sketches for the lower bound results, Theorem 3.4 and Theorem 3.6, which apply to the case d5d\geq 5. Because Theorem 3.4 is a special case of Theorem 3.6 corresponding to k=nσd/4k=\sqrt{n}\sigma^{-d/4} (see Appendix C.3), we focus on the proof sketch of Theorem 3.6. This proof is similar to that of Theorem 3.1 but simpler; for example, we can simply work here with the single loss n\ell_{{\mathbb{P}}_{n}} (and not worry about its discrepancy with any continuous loss \ell_{{\mathbb{P}}} as in the proof of Theorem 3.1).

As in the proof of Theorem 3.1, we need to prove upper and lower bounds for Gf~k(t,𝒞(Ω))G_{\tilde{f}_{k}}(t,{\mathcal{C}}(\Omega)). The upper bound follows from Lemma 4.8 because, as guaranteed by Lemma 3.2, the function f~k\tilde{f}_{k} belongs to m,d+1(Ω)\mathfrak{C}_{m,d+1}(\Omega) for some mCdkm\leq C_{d}k (we are using h=d+1h=d+1 here because every dd-simplex can be written as the intersection of atmost d+1d+1 slabs). This gives the following bound which coincides with the leading term in random design bound given by (37):

Gf~k(t,𝒞(Ω))Cdtσ(logn)2(d+1)/d(kn)2/dfor every t>0G_{\tilde{f}_{k}}(t,{\mathcal{C}}(\Omega))\leq C_{d}t\sigma(\log n)^{2(d+1)/d}\left(\frac{k}{n}\right)^{2/d}\qquad\text{for every $t>0$} (49)

The lower bound on Gf~k(t,𝒞(Ω))G_{\tilde{f}_{k}}(t,{\mathcal{C}}(\Omega)) is given below.

Lemma 4.9.

There exist a positive constant cdc_{d} such that

Gf~k(t,𝒞(Ω))cdσt(kn)2/dfor all 0<tcdk2/dG_{\tilde{f}_{k}}(t,{\mathcal{C}}(\Omega))\geq c_{d}\sigma t\left(\frac{k}{n}\right)^{2/d}\qquad\text{for all $0<t\leq c_{d}k^{-2/d}$} (50)

provided kcdnk\leq c_{d}n.

The lower bound (50) and the upper bound (49) differ only by the logarithmic factor (logn)2(d+1)/d(\log n)^{2(d+1)/d}. Further (50) coincides with corresponding lower bound in Lemma 4.3 for the random design setting when k=nσd/4k=\sqrt{n}\sigma^{-d/4} and tt is of order k2/dk^{-2/d}. Lemma 4.9 is proved via the following metric entropy lower bound which applies to the discrete metric n\ell_{{\mathbb{P}}_{n}} and is analogous to Lemma 4.6.

Lemma 4.10.

Let Ω\Omega be a convex body satisying (3). Let f0(x):=x2f_{0}(x):=\|x\|^{2}. There exist two positive constants c1c_{1} and c2c_{2} depending on dd alone such that

logN(c1n2/d,{g𝒞(Ω):n(f0,g)t},n)n8for tc2n2/d.\log N(c_{1}n^{-2/d},\{g\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f_{0},g)\leq t\},\ell_{{\mathbb{P}_{n}}})\geq\frac{n}{8}\qquad\text{for $t\geq c_{2}n^{-2/d}$}.

Theorem 3.6 is proved by combining (49) and (50) via Theorem 4.1.

The metric entropy upper bounds (46) and (48) play a very important role in these proofs. Both these bounds will be derived as a consequence of the following entropy result. This result involves the resolution δ\delta of the grid 𝒮\mathcal{S} (defined in (15)) which, by (16), is of order n1/dn^{-1/d}. Even though we only need the entropy bound for p=2p=2, we state the next result for the discrete LpL_{p} metric (f,g)𝒮(fg,Ω,p)(f,g)\mapsto\ell_{\mathcal{S}}(f-g,\Omega,p) for every 1p<1\leq p<\infty where

𝒮(f,Ω,p)=(1nsΩ𝒮|f(s)|p)1/p.\ell_{\mathcal{S}}(f,\Omega,p)=\left(\frac{1}{n}\sum_{s\in\Omega\cap\mathcal{S}}|f(s)|^{p}\right)^{1/p}. (51)

The ϵ\epsilon-covering number of a space {\cal F} of functions on Ω\Omega under the metric (f,g)𝒮(fg,Ω,p)(f,g)\mapsto\ell_{\mathcal{S}}(f-g,\Omega,p) will be denoted by N(ϵ,,𝒮(,Ω,p))N(\epsilon,{\cal F},\ell_{\mathcal{S}}(\cdot,\Omega,p)).

Theorem 4.11.

Suppose Ω\Omega is of the form (14) and satisfies (3). There exists cd,pc_{d,p} depending only on dd and pp such that for every ϵ>0\epsilon>0 and t>0t>0,

logN(ϵ,{f𝒞(Ω):𝒮(f,Ω,p)t},𝒮(,Ω,p))[cd,plog(1/δ)]F(tϵ)d/2.\log N(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{\mathcal{S}}(f,\Omega,p)\leq t\},\ell_{\mathcal{S}}(\cdot,\Omega,p))\leq[c_{d,p}\log(1/\delta)]^{F}\left(\frac{t}{\epsilon}\right)^{d/2}. (52)

Theorem 4.11 is the first entropy result for convex functions that deals with the discrete metric 𝒮(,Ω,p)\ell_{\mathcal{S}}(\cdot,\Omega,p) (all previous results hold for continuous LpL_{p} metrics). Its proof is similar to but longer than the proof of the related result Theorem 4.5. The two bounds (46) and (48) follow easily from Theorem 4.11 as proved in Appendix C.6.

5 Discussion

We conclude by addressing a few natural questions and extensions that arise from our main results.

5.1 Optimality vs Suboptimality

As already indicated, the statement “the LSE in convex regression is rate optimal” can be both true and false in random-design depending on details and context. In this paper, we proved that, if the domain is polytopal and the dimension d5d\geq 5, this statement is false meaning that the LSE is minimax suboptimal. On the other hand, if the domain is the unit ball, then the statement is true for all d2d\geq 2 meaning that the LSE is minimax optimal (this optimality has been proved in [32, 24]).

Here we briefly outline the two key differences (relevant to LSE suboptimality/optimality) in the structures of 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) corresponding to the two cases where Ω\Omega is a polytope and Ω\Omega is the unit ball. We assume d2d\geq 2 below.

  1. 1.

    The main difference is that the L2()L_{2}({\mathbb{P}}) metric entropy of 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) is ϵd/2\epsilon^{-d/2} in the polytopal case and ϵ(d1)\epsilon^{-(d-1)} in the ball case (these entropy results are due to [19]). The larger metric entropy directly leads to the slower minimax rate n2/(d+1)n^{-2/(d+1)} in the ball case compared to n4/(d+4)n^{-4/(d+4)} in the polytopal case. In the polytopal case, constructions for the proof of the metric entropy lower bound are based on perturbations of a smooth convex function. This construction also leads to the same ϵd/2\epsilon^{-d/2} lower bound in the ball case. However the improved lower bound of ϵ(d1)\epsilon^{-(d-1)} in the ball case is obtained by taking indicator-like convex functions which are zero in the interior of the ball but rapidly rise to 1 near a subset of the boundary. This construction is feasible because of the high complexity of the boundary of the ball, as well as because of the nature of the L2()L_{2}({\mathbb{P}}) metric. If the metric is changed to L1()L_{1}({\mathbb{P}}), the metric entropy remains ϵd/2\epsilon^{-d/2} for both the ball and the polytopal case. It may be helpful to note here that if f=IAf=I_{A} is the indicator function of a set AA, then fL2()=(A)\|f\|_{L_{2}({\mathbb{P}})}=\sqrt{{\mathbb{P}}(A)} which can be much larger than fL1()=(A)\|f\|_{L_{1}({\mathbb{P}})}={\mathbb{P}}(A).

  2. 2.

    Now let us discuss the behavior of the LSE. In the polytopal case, when d5d\geq 5, the supremum risk of the LSE over 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) is n2/dn^{-2/d} (upto log factors). This exceeds the minimax rate n4/(d+4)n^{-4/(d+4)} implying that the LSE is minimax suboptimal. On the other hand, note that n2/dn^{-2/d} is actually smaller than the minimax rate of n2/(d+1)n^{-2/(d+1)} in the ball case. This might suggest that the LSE might be minimax optimal in the ball case. Kur et al. [32] proved that the LSE achieves the rate n2/(d+1)n^{-2/(d+1)} in the ball case implying optimality. At a very high level, their argument is as follows. From Theorem 4.1, it is clear that the risk of the LSE can be upper bounded via upper bounds on the expected supremum

    𝔼supf𝒞B(Ω):n(f,f0)t1ni=1nξi(f(Xi)f0(Xi)){\mathbb{E}}\sup_{f\in{\mathcal{C}}^{B}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(f(X_{i})-f_{0}(X_{i})\right)

    where f0𝒞B(Ω)f_{0}\in{\mathcal{C}}^{B}(\Omega) is the true function. Kur et al. [32] upper bound the above by first removing the localization constraint n(f,f0)t\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t leading to

    𝔼supf𝒞B(Ω)1ni=1nξi(f(Xi)f0(Xi))𝔼supf𝒞2B(Ω)1ni=1nξif(Xi).{\mathbb{E}}\sup_{f\in{\mathcal{C}}^{B}(\Omega)}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(f(X_{i})-f_{0}(X_{i})\right)\leq{\mathbb{E}}\sup_{f\in{\mathcal{C}}^{2B}(\Omega)}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}f(X_{i}).

    The right hand side above is the Gaussian complexity of the class 𝒞2B(Ω){\mathcal{C}}^{2B}(\Omega). Kur et al. [32] then show that the Gaussian complexity is upper bounded by the discrepancy between n{\mathbb{P}}_{n} and {\mathbb{P}} over all compact convex subsets of Ω\Omega:

    𝔼supCΩ:C is compact, convex|n(C)(C)|.{\mathbb{E}}\sup_{C\subseteq\Omega:C\text{ is compact, convex}}|{\mathbb{P}}_{n}(C)-{\mathbb{P}}(C)|. (53)

    They then use chaining with respect to the L1L_{1} norm to prove that (53) is bounded by n2/(d+1)n^{-2/(d+1)}. This argument reveals that, like the minimax rate which is driven by indicator-like functions over convex sets, the LSE accuracy is also driven by the discrepancy over convex sets. Han [24] gives examples of other problems where LSE accuracy is also driven by discrepancy over sets and the LSE turns out to be optimal in these problems as well.

5.2 Additional remarks on the fixed-design results

In the fixed-design setting (where the design points are given by grid points (15) intersected with Ω\Omega and the loss function is (8)), we proved that when Ω\Omega is a polytope of the form (14), the minimax rate over each class 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega), 𝒞L(Ω){\mathcal{C}}_{L}(\Omega), 𝒞B(Ω){\mathcal{C}}^{B}(\Omega), L(Ω){\cal F}^{L}(\Omega) equals n4/(d+4)n^{-4/(d+4)} (up to logarithmic factors). We also proved (Theorem 3.4) that the supremum risk of the unrestricted LSE f^n(𝒞(Ω))\hat{f}_{n}({\mathcal{C}}(\Omega)) over each of these classes is at least Ω~(n2/d)\tilde{\Omega}(n^{-2/d}) for d5d\geq 5. The proof of Theorem 3.4 can be easily modified to prove that the restricted LSE over {\cal F} also has supremum risk of at least Ω~(n2/d)\tilde{\Omega}(n^{-2/d}) and hence is minimax suboptimal over {\cal F}, where {\cal F} is any of the four classes 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega), 𝒞L(Ω){\mathcal{C}}_{L}(\Omega), 𝒞B(Ω){\mathcal{C}}^{B}(\Omega), L(Ω){\cal F}^{L}(\Omega).

Now, let us briefly comment on the case when Ω\Omega is not a polytope. Under the Lipschitz constraint, the story is the same as in the case of random design. Specifically, the minimax rate for 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) and 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) equals n4/(d+4)n^{-4/(d+4)} for every Ω\Omega satisfying (3) regardless of whether Ω\Omega is polytopal or not (this is proved by essentially the same argument as in the proof of Proposition 2.1; the main idea is that the metric entropy of 𝒞LB(Ω){\mathcal{C}}_{L}^{B}(\Omega) is not influenced by the boundary of Ω\Omega). Furthermore, the restricted LSE over each of these classes achieves supremum risk of at least Ω~(n2/d)\tilde{\Omega}(n^{-2/d}). The proof of these results can be obtained by suitably modifying the proof of Theorem 3.4 following ideas in the proof of Theorem 3.1.

Without the Lipschitz constraint, things are more complicated. In Section 1, we discussed that when Ω\Omega is the unit ball for d2d\geq 2 and we are in the random-design setting, the minimax rate over 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) equals n2/(d+1)n^{-2/(d+1)} as proved by Han and Wellner [26]. Additionally, we noted the minimax optimality of the LSE over 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) as proved by Kur et al. [32]. The analogues of these results for the fixed design setting are unknown.

We summarize this discussion on the fixed-design setting in Table 6.

Class \mathcal{F} Assumption on Ω\Omega Minimax Rate (up to logs) supf𝔼fn2(f^n(),f)\underset{f\in{\cal F}}{\sup}{\mathbb{E}}_{f}\ell_{{\mathbb{P}}_{n}}^{2}(\hat{f}_{n}({\cal F}),f) (upto logs) Minimax Optimality of f^n()\hat{f}_{n}({\cal F}) (upto logs)
L(Ω){\cal F}^{L}(\Omega) polytope n4/(d+4)n^{-4/(d+4)} n4/(d+4)n^{-4/(d+4)}, d4d\leq 4
n2/dn^{-2/d}, d5d\geq 5
Optimal for d4d\leq 4
Suboptimal for d5d\geq 5
𝒞LB(Ω)\mathcal{C}_{L}^{B}(\Omega) any convex body n4/(d+4)n^{-4/(d+4)} n4/(d+4)n^{-4/(d+4)}, d4d\leq 4
n2/dn^{-2/d}, d5d\geq 5
Optimal for d4d\leq 4
Suboptimal for d5d\geq 5
𝒞L(Ω)\mathcal{C}_{L}(\Omega) any convex body n4/(d+4)n^{-4/(d+4)} n4/(d+4)n^{-4/(d+4)}, d4d\leq 4
n2/dn^{-2/d}, d5d\geq 5
Optimal for d4d\leq 4
Suboptimal for d5d\geq 5
𝒞B(Ω)\mathcal{C}^{B}(\Omega) polytope n4/(d+4)n^{-4/(d+4)} n4/(d+4)n^{-4/(d+4)}, d4d\leq 4
n2/dn^{-2/d}, d5d\geq 5
Optimal for d4d\leq 4
Suboptimal for d5d\geq 5
𝒞B(Ω)\mathcal{C}^{B}(\Omega) ball, d2d\geq 2 open question open question open question
Table 6: Analogue of Table 5 for the fixed-design setting

The larger minimax rate of n2/(d+1)n^{-2/(d+1)} (compared to n4/(d+4)n^{-4/(d+4)} for polytopal Ω\Omega) in the random design case can be attributed to the increased metric entropy (in the L2L_{2} metric with respect to the continuous uniform probability measure on Ω\Omega) of 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) due to the curvature of the boundary of Ω\Omega (see [19, Subsection 2.10]). In the fixed design case, the boundary of Ω\Omega is also expected to elevate the minimax rate above n4/(d+4)n^{-4/(d+4)}. However, the precise minimax rate in this context seems hard to determine (note that there are no existing metric entropy results for 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) in the discrete L2L_{2} metric with respect to the grid points of the fixed design) and it could range between n4/(d+4)n^{-4/(d+4)} and n2/(d+1)n^{-2/(d+1)}. Furthermore, the question of whether the LSE over 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) and the unrestricted convex LSE are optimal or suboptimal in the fixed design setting for non-polytopal Ω\Omega also remains open.

5.3 Additional discussion on the random-design setting

In our random design setting with X1,,Xni.i.dX_{1},\dots,X_{n}\overset{\text{i.i.d}}{\sim}{\mathbb{P}}, we assumed that {\mathbb{P}} is the uniform distribution on the convex body Ω\Omega. From an examination of our proofs, it should be clear that all our random design results would continue to work under the more general assumption that {\mathbb{P}} has a density on Ω\Omega that is bounded from above and below by positive constants. With this assumption, our bounds will involve additional multiplicative constants depending on the density constants. We worked with the simpler uniform distribution assumption because, essentially, the proof in the more general case is reduced to the uniform case.

One can also consider the case where {\mathbb{P}} has full support over d{\mathbb{R}}^{d} such as the case when {\mathbb{P}} is the standard multivariate normal vdistribution on d{\mathbb{R}}^{d} (here Ω=d\Omega={\mathbb{R}}^{d}). In this case, boundedness would not make sense as non-constant convex functions cannot be bounded over the whole of d{\mathbb{R}}^{d}. As the Lipschitz assumption is still reasonable, one may study the minimax rate over 𝒞L(d){\mathcal{C}}_{L}({\mathbb{R}}^{d}), and the optimality of the Lipschitz convex LSE f^n(𝒞L(d))\hat{f}_{n}({\mathcal{C}}_{L}({\mathbb{R}}^{d})) over 𝒞L(d){\mathcal{C}}_{L}({\mathbb{R}}^{d}) (note that the loss function is still given by (4) but now {\mathbb{P}} has full support over d{\mathbb{R}}^{d}). When {\mathbb{P}} is the standard multivariate normal distribution, we believe that the minimax rate over 𝒞L(d){\mathcal{C}}_{L}({\mathbb{R}}^{d}) should be n4/(d+4)n^{-4/(d+4)} and that the supremum risk of f^n(𝒞L(d))\hat{f}_{n}({\mathcal{C}}_{L}({\mathbb{R}}^{d})) over 𝒞L(d){\mathcal{C}}_{L}({\mathbb{R}}^{d}) is bounded below by n2/dn^{-2/d} (up to logarithmic factors) for d5d\geq 5. The intuition is that {xd:x>r}{\mathbb{P}}\{x\in{\mathbb{R}}^{d}:\|x\|>r\} decreases exponentially in rr, while the metric entropy (of bounded Lipschitz convex functions) over {xd:xr}\{x\in{\mathbb{R}}^{d}:\|x\|\leq r\} only grows polynomially in rr. We leave a principled study of such unbounded design distributions to future work.

5.4 Beyond Gaussian noise

Throughout, we assumed that the errors in the regression model (1) are Gaussian, i.e. ξ1,,ξni.i.dN(0,σ2)\xi_{1},\dots,\xi_{n}\overset{\text{i.i.d}}{\sim}N(0,\sigma^{2}). It is natural to ask if the results continue to hold if the errors have mean zero and variance σ2\sigma^{2} but a non-Gaussian distribution. For sub-Gaussian errors, we believe that certain variants of our results can be proved with additional work. One important bottleneck in the extension of our main results (Theorem 3.1 and Theorem 3.4) to sub-Gaussian errors is the result of [10] stated as Theorem 4.1. This result was originally stated and proved in [10] for Gaussian errors and we do not know if an extension to sub-Gaussian errors exists in the literature. The proof is mainly based on the concentration of the random variable (for fixed t>0t>0):

F(ξ,t):=supg:n(f,g)t1ni=1nξi(f(Xi)g(Xi))F(\vec{\xi},t):=\sup_{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(f,g)\leq t}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(f(X_{i})-g(X_{i})\right)

which is proved in [10], by using the fact that the above is a tt-Lipschitz function of i.i.d Gaussian errors ξ:=(ξ1,,ξn)\vec{\xi}:=(\xi_{1},\dots,\xi_{n}).

For sub-Gaussian ξ1,,ξn\xi_{1},\dots,\xi_{n}, one can first argue boundedness of ξ1,,ξn\xi_{1},\dots,\xi_{n} by ClognC\sqrt{\log n} with high probability (say probability of 1O(n1)1-O(n^{-1})), and then invoke concentration of convex Lipschitz functions of bounded random variables (see e.g., [45, Theorem 6.6]); note that ξF(ξ,t)\vec{\xi}\mapsto F(\vec{\xi},t) is convex in ξ\vec{\xi}. This should lead to a version of Theorem 4.1, for sub-Gaussian errors (although with weaker control on the probabilities involved).

Note that the proofs of Theorems 3.1 and 3.4, also use certain other tools such as the Sudakov minoration inequality which are also only valid for Gaussian errors. Their use should also be replaced by appropriate multiplier inequality for the suprema of an empirical process (such results can be found in e.g., [48, Chapter 2.9] and [27]). We leave, to future work, a rigorous extension of our results to sub-Gaussian errors.

5.5 Rates of convergence in the interior of the domain Ω\Omega

It is natural to wonder if our suboptimality results are caused by boundary effects, and if the LSEs are optimal or suboptimal in the interior of the domain Ω\Omega. Concretely, let Ω0\Omega_{0} denote a compact, convex region that is contained in the interior of Ω\Omega, and that the loss function is given by

Ω0(f(x)g(x))2𝑑(x).\int_{\Omega_{0}}\left(f(x)-g(x)\right)^{2}d{\mathbb{P}}(x). (54)

The question then is whether the LSEs considered in Theorem 3.1 are still suboptimal under the above modified loss function. This is a difficult question that is hard to resolve with the methods used for the proof of Theorem 3.1. However, we believe that the result should be true because of the following heuristic arguments. Consider, for concreteness, the bounded convex LSE f^n(𝒞B(Ω))\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)). Every function in 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) is actually both BB-bounded and O(B)O(B)-Lipschitz inside the smaller domain Ω0\Omega_{0}regardless of the shape of Ω0\Omega_{0}). Because the entropy of O(B)O(B)-bounded and Lipschitz convex functions on Ω0\Omega_{0} equals C(B)ϵd/2C(B)\cdot\epsilon^{-d/2} regardless of the shape of Ω0\Omega_{0}, it gives us reason to believe that f^n(𝒞B(Ω))\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)) would be minimax suboptimal (for d5d\geq 5) with respect to the modified loss (54). Proving this is challenging, mainly because the behavior of the LSE f^n(𝒞B(Ω))\hat{f}_{n}({\mathcal{C}}^{B}(\Omega)) inside Ω0\Omega_{0} is also influenced (in a complicated way) by the observations lying outside Ω0\Omega_{0}. Therefore, we would like to highlight this question as an open problem – as we do not know how to “decouple” the performance of the LSE from the domain Ω0\Omega_{0} from ΩΩ0\Omega\setminus\Omega_{0} under 𝒞B(Ω){\mathcal{C}}^{B}(\Omega).

5.6 Results for LSEs with smoothness constraints (without convexity)

All the results in this paper apply to LSEs over classes of convex functions (with additional constraints such as boundedness and/or being Lipschitz). It is natural to ask if similar results of suboptimality holds for LSEs with purely smoothness constraints (without any shape constraints such as convexity). The methods of this paper are closely tied to convexity, and we leave a study of purely smoothness constrained LSEs for future work. For a concrete problem in this direction, consider the class of LL-Lipschitz functions on Ω\Omega:

:={f:Ω such that f is L-Lipschitz on Ω}.\mathcal{L}:=\left\{f:\Omega\rightarrow{\mathbb{R}}\text{ such that }f\text{ is }L\text{-Lipschitz on }\Omega\right\}.

For a natural Ω\Omega (e.g., Ω=[1,1]d\Omega=[-1,1]^{d}), the question is whether the LSE over \mathcal{L} is minimax optimal over \mathcal{L} for all d1d\geq 1. One can also ask the same question for function classes with constraints on second (and higher) order derivatives. We would like to highlight these as open problems – as our techniques do not apply on these classes.

{acks}

We are truly thankful to the Associate Editor and three anonymous referees for their comprehensive reviews of our earlier manuscript. Their insightful feedback significantly enhanced both the content and organization of the paper.

{funding}

The first author was funded by the Center for Minds, Brains and Machines, funded by NSF award CCF-1231216. The second author was funded by NSF Grant OCA-1940270. The third author was funded by NSF CAREER Grant DMS-1654589. The fourth author was funded by NSF Grant DMS-1712822.

The rest of this paper consists of three sections: Appendix A, Appendix B and Appendix C. Appendix A contains proofs of all results in Section 2. Appendix B contains proofs of all results in Section 3.1. The proof sketch given in Subsection 4.3 is followed and results quoted in that subsection are also proved in Appendix B. Appendix C contains proofs of all results in Section 3.2. The proof sketch given in Subsection 4.4 is followed and results quoted in that subsection are also proved in Appendix C.

Appendix A Proofs of Minimax Rates for Convex Regression

This section contains the proofs of Proposition 2.1, Proposition 2.2 and Proposition 2.3 (these results were stated in Section 2).

For the proof of Proposition 2.1 below, we need Lemma B.3 which allows switching between the two loss functions n2\ell_{{\mathbb{P}}_{n}}^{2} and 2\ell_{{\mathbb{P}}}^{2}. Lemma B.3 is stated in Section B because it is also crucial for the proof of Theorem 3.1.

Proof of Proposition 2.1.

The lower bound for the minimax rate follows from the corresponding result for the smaller class 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega). For the upper bound, let us first describe the estimator that we work with. Let Ti:=YiY¯T_{i}:=Y_{i}-\bar{Y} (where Y¯:=(Y1++Yn)/n\bar{Y}:=(Y_{1}+\dots+Y_{n})/n) for each i=1,,ni=1,\dots,n. For a finite subset 𝐕\mathbf{V} of the function class 𝒞L2L(Ω){\mathcal{C}}_{L}^{2L}(\Omega), let h^n𝐕\hat{h}_{n}^{\mathbf{V}} denote any least squares estimator over 𝐕\mathbf{V} for the data (X1,T1),,(Xn,Tn)(X_{1},T_{1}),\dots,(X_{n},T_{n}). In other words,

h^n𝐕argminh𝐕i=1n(Tih(Xi))2.\hat{h}_{n}^{\mathbf{V}}\in\mathop{\rm argmin}_{h\in\mathbf{V}}\sum_{i=1}^{n}\left(T_{i}-h(X_{i})\right)^{2}.

Our estimator is given by the convex function:

xY¯+h^n𝐕(x)x\mapsto\bar{Y}+\hat{h}_{n}^{\mathbf{V}}(x) (55)

for an appropriately chosen covering subset 𝐕\mathbf{V} of 𝒞L2L(Ω){\mathcal{C}}_{L}^{2L}(\Omega). The proof below shows that this estimator achieves the rate n4/(d+4)n^{-4/(d+4)} uniformly over 𝒞L(Ω){\mathcal{C}}_{L}(\Omega). Fix a “true” function f0𝒞L(Ω)f_{0}\in{\mathcal{C}}_{L}(\Omega). We first bound the risk of h^n𝐕\hat{h}_{n}^{\mathbf{V}}. For any arbitrary nonnegative valued functional HH, the following inequality holds:

H(h^n𝐕)h𝐕H(h)exp(18σ2i=1n(Tih^n𝐕(Xi))218σ2i=1n(Tih(Xi))2)H(\hat{h}_{n}^{\mathbf{V}})\leq\sum_{h\in\mathbf{V}}H(h)\exp\left(\frac{1}{8\sigma^{2}}\sum_{i=1}^{n}(T_{i}-\hat{h}_{n}^{\mathbf{V}}(X_{i}))^{2}-\frac{1}{8\sigma^{2}}\sum_{i=1}^{n}(T_{i}-h(X_{i}))^{2}\right)

because of the presence of the term corresponding to h=h^n𝐕𝐕h=\hat{h}_{n}^{\mathbf{V}}\in\mathbf{V} on the right hand side. Because h^n𝐕\hat{h}_{n}^{\mathbf{V}} minimizes sum of squares over 𝐕\mathbf{V}, we can replace h^n𝐕\hat{h}_{n}^{\mathbf{V}} by any other element h𝐕h^{\prime}\in\mathbf{V} leading to

H(h^n𝐕)h𝐕H(h)exp(18σ2i=1n(Tih(Xi))218σ2i=1n(Tih(Xi))2)H(\hat{h}_{n}^{\mathbf{V}})\leq\sum_{h\in\mathbf{V}}H(h)\exp\left(\frac{1}{8\sigma^{2}}\sum_{i=1}^{n}(T_{i}-h^{\prime}(X_{i}))^{2}-\frac{1}{8\sigma^{2}}\sum_{i=1}^{n}(T_{i}-h(X_{i}))^{2}\right) (56)

for every h𝐕h^{\prime}\in\mathbf{V}. We now take the expectation on both sides of the above inequality conditioned on X1,,XnX_{1},\dots,X_{n}. The following function h0:Ωh_{0}:\Omega\rightarrow{\mathbb{R}} will play a key role in the sequel

h0(x):=f0(x)f0(X1)++f0(Xn)n.h_{0}(x):=f_{0}(x)-\frac{f_{0}(X_{1})+\dots+f_{0}(X_{n})}{n}.

h0h_{0} is Lipschitz with Lipschitz constant LL and, moreover, h0h_{0} is uniformly bounded by 2L2L because:

|h0(x)|\displaystyle|h_{0}(x)| 1ni=1n|f0(x)f0(Xi)|\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}|f_{0}(x)-f_{0}(X_{i})|
Lni=1nxXiLsupx,xΩxx2L.\displaystyle\leq\frac{L}{n}\sum_{i=1}^{n}\|x-X_{i}\|\leq L\sup_{x,x^{\prime}\in\Omega}\|x-x^{\prime}\|\leq 2L.

In other words, h0𝒞L2L(Ω)h_{0}\in{\mathcal{C}}_{L}^{2L}(\Omega). In order to take the expectation of both sides of (56), note that the conditional distribution of T1,,TnT_{1},\dots,T_{n} given X1,,XnX_{1},\dots,X_{n} is multivariate normal with mean vector (h0(X1),,h0(Xn))(h_{0}(X_{1}),\dots,h_{0}(X_{n})) and covariance matrix σ2(I𝟏𝟏Tn)σ2I\sigma^{2}\left(I-\frac{{\mathbf{1}}{\mathbf{1}}^{T}}{n}\right)\preceq\sigma^{2}I. Here II is the identity matrix and 𝟏{\mathbf{1}} is the n×1n\times 1 vector of ones. By a straightforward calculation, we obtain

𝔼H(h^n𝐕X1,,Xn)\displaystyle{\mathbb{E}}H\left(\hat{h}_{n}^{\mathbf{V}}\mid X_{1},\dots,X_{n}\right)
h𝐕H(h)exp(n8σ2n2(h0,h)n8σ2n2(h0,h)+n32σ2n2(h,h))\displaystyle\leq\sum_{h\in\mathbf{V}}H(h)\exp\left(\frac{n}{8\sigma^{2}}\ell^{2}_{{\mathbb{P}}_{n}}\left(h_{0},h^{\prime}\right)-\frac{n}{8\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h\right)+\frac{n}{32\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h,h^{\prime}\right)\right)
h𝐕H(h)exp(3n16σ2n2(h0,h)n16σ2n2(h0,h))\displaystyle\leq\sum_{h\in\mathbf{V}}H(h)\exp\left(\frac{3n}{16\sigma^{2}}\ell^{2}_{{\mathbb{P}}_{n}}\left(h_{0},h^{\prime}\right)-\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h\right)\right)
=exp(3n16σ2n2(h0,h))h𝐕H(h)exp(n16σ2n2(h0,h))\displaystyle=\exp\left(\frac{3n}{16\sigma^{2}}\ell^{2}_{{\mathbb{P}}_{n}}\left(h_{0},h^{\prime}\right)\right)\sum_{h\in\mathbf{V}}H(h)\exp\left(-\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h\right)\right)

where we used the standard fact n2(h,h)2n2(h0,h)+2n2(h0,h)\ell_{{\mathbb{P}}_{n}}^{2}(h,h^{\prime})\leq 2\ell_{{\mathbb{P}}_{n}}^{2}(h_{0},h)+2\ell_{{\mathbb{P}}_{n}}^{2}(h_{0},h^{\prime}) for the penultimate inequality. As h𝐕h^{\prime}\in\mathbf{V} is arbitrary, we can take an infimum over h𝐕h^{\prime}\in\mathbf{V} to obtain

𝔼H(h^n𝐕X1,,Xn)\displaystyle{\mathbb{E}}H\left(\hat{h}_{n}^{\mathbf{V}}\mid X_{1},\dots,X_{n}\right)
exp(3n16σ2infh𝐕n2(h0,h))h𝐕H(h)exp(n16σ2n2(h0,h))\displaystyle\leq\exp\left(\frac{3n}{16\sigma^{2}}\inf_{h^{\prime}\in\mathbf{V}}\ell^{2}_{{\mathbb{P}}_{n}}\left(h_{0},h^{\prime}\right)\right)\sum_{h\in\mathbf{V}}H(h)\exp\left(-\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h\right)\right)

Because n2(h0,h)h0h2\ell_{{\mathbb{P}}_{n}}^{2}(h_{0},h^{\prime})\leq\|h_{0}-h^{\prime}\|_{\infty}^{2} and h0𝒞L2L(Ω)h_{0}\in{\mathcal{C}}_{L}^{2L}(\Omega),

infh𝐕n2(h0,h)infh𝐕h0h2supg𝒞L2L(Ω)infh𝐕gh2.\displaystyle\inf_{h^{\prime}\in\mathbf{V}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h^{\prime}\right)\leq\inf_{h^{\prime}\in\mathbf{V}}\|h_{0}-h^{\prime}\|_{\infty}^{2}\leq\sup_{g\in{\mathcal{C}}_{L}^{2L}(\Omega)}\inf_{h^{\prime}\in\mathbf{V}}\|g-h^{\prime}\|_{\infty}^{2}.

We have thus proved

𝔼H(h^n𝐕X1,,Xn)\displaystyle{\mathbb{E}}H\left(\hat{h}_{n}^{\mathbf{V}}\mid X_{1},\dots,X_{n}\right)
exp(3n16σ2supg𝒞L2L(Ω)infh𝐕gh2)h𝐕H(h)exp(n16σ2n2(h0,h)).\displaystyle\leq\exp\left(\frac{3n}{16\sigma^{2}}\sup_{g\in{\mathcal{C}}_{L}^{2L}(\Omega)}\inf_{h^{\prime}\in\mathbf{V}}\|g-h^{\prime}\|_{\infty}^{2}\right)\sum_{h\in\mathbf{V}}H(h)\exp\left(-\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h\right)\right).

The choice

H(h):=exp(n16σ2n2(h0,h)).\displaystyle H(h):=\exp\left(\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h\right)\right).

leads to

𝔼[exp(n16σ2n2(h0,h^n𝐕))|X1,,Xn]\displaystyle{\mathbb{E}}\left[\exp\left(\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\right)\bigg{|}X_{1},\dots,X_{n}\right]
exp(3n16σ2supg𝒞L2L(Ω)infh𝐕gh2+log|𝐕|)\displaystyle\leq\exp\left(\frac{3n}{16\sigma^{2}}\sup_{g\in{\mathcal{C}}_{L}^{2L}(\Omega)}\inf_{h^{\prime}\in\mathbf{V}}\|g-h^{\prime}\|_{\infty}^{2}+\log|\mathbf{V}|\right)

where |𝐕||\mathbf{V}| denotes the cardinality of the finite set 𝐕\mathbf{V}. Jensen’s inequality on the left hand side gives

𝔼(n2(h0,h^n𝐕)X1,,Xn)3supg𝒞L2L(Ω)infh𝐕gh2+16σ2nlog|𝐕|.\displaystyle{\mathbb{E}}\left(\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\mid X_{1},\dots,X_{n}\right)\leq 3\sup_{g\in{\mathcal{C}}_{L}^{2L}(\Omega)}\inf_{h^{\prime}\in\mathbf{V}}\|g-h^{\prime}\|_{\infty}^{2}+\frac{16\sigma^{2}}{n}\log|\mathbf{V}|.

Taking expectations with respect to X1,,XnX_{1},\dots,X_{n} (the right hand side above is actually nonrandom), we get the same bound unconditionally:

𝔼n2(h0,h^n𝐕)3supg𝒞L2L(Ω)infh𝐕gh2+16σ2nlog|𝐕|.\displaystyle{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\leq 3\sup_{g\in{\mathcal{C}}_{L}^{2L}(\Omega)}\inf_{h^{\prime}\in\mathbf{V}}\|g-h^{\prime}\|_{\infty}^{2}+\frac{16\sigma^{2}}{n}\log|\mathbf{V}|.

We now take 𝐕\mathbf{V} to be an ϵ\epsilon-covering subset of 𝒞L2L(Ω){\mathcal{C}}_{L}^{2L}(\Omega) for an appropriate ϵ\epsilon. By the classical metric entropy result of [9], for each ϵ>0\epsilon>0, we can find 𝐕\mathbf{V} such that

supg𝒞L2L(Ω)infh𝐕gh2ϵ2 and log|𝐕|Cd(Lϵ)d/2.\displaystyle\sup_{g\in{\mathcal{C}}_{L}^{2L}(\Omega)}\inf_{h^{\prime}\in\mathbf{V}}\|g-h^{\prime}\|_{\infty}^{2}\leq\epsilon^{2}~{}~{}\text{ and }~{}~{}\log|\mathbf{V}|\leq C_{d}\left(\frac{L}{\epsilon}\right)^{d/2}.

Taking ϵ\epsilon to be of order n2/(d+4)n^{-2/(d+4)} and 𝐕\mathbf{V} to the corresponding ϵ\epsilon-covering subset of 𝒞L2L(Ω){\mathcal{C}}_{L}^{2L}(\Omega), we deduce

𝔼n2(h0,h^n𝐕)Cd,L,σn4/(d+4).\displaystyle{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\leq C_{d,L,\sigma}n^{-4/(d+4)}.

We convert this bound to 2\ell_{{\mathbb{P}}}^{2} via Lemma B.3:

𝔼2(h0,h^n𝐕)\displaystyle{\mathbb{E}}\ell_{{\mathbb{P}}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right) 8𝔼n2(h0,h^n𝐕)+2𝔼((h0,h^n𝐕)2n(h0,h^n𝐕))2\displaystyle\leq 8{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)+2{\mathbb{E}}\left(\ell_{{\mathbb{P}}}(h_{0},\hat{h}_{n}^{\mathbf{V}})-2\ell_{{\mathbb{P}}_{n}}(h_{0},\hat{h}_{n}^{\mathbf{V}})\right)^{2}
8𝔼n2(h0,h^n𝐕)+2𝔼supf,g𝒞L2L(Ω)((f,g)2n(f,g))2\displaystyle\leq 8{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)+2{\mathbb{E}}\sup_{f,g\in{\mathcal{C}}_{L}^{2L}(\Omega)}\left(\ell_{{\mathbb{P}}}(f,g)-2\ell_{{\mathbb{P}}_{n}}(f,g)\right)^{2}
8𝔼n2(h0,h^n𝐕)+Cdn4/(d+4)L2d/(d+4)+CL2n\displaystyle\leq 8{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)+C_{d}n^{-4/(d+4)}L^{2d/(d+4)}+C\frac{L^{2}}{n}
Cd,L,σn4/(d+4).\displaystyle\leq C_{d,L,\sigma}n^{-4/(d+4)}.

By the elementary inequality (a+b)2a2/2b2(a+b)^{2}\geq a^{2}/2-b^{2}, we get

2(h0,h^n𝐕)122(f0,μ+h^n𝐕)(f0(X1)++f0(Xn)nμ)2\displaystyle\ell_{{\mathbb{P}}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\geq\frac{1}{2}\ell_{{\mathbb{P}}}^{2}\left(f_{0},\mu+\hat{h}_{n}^{\mathbf{V}}\right)-\left(\frac{f_{0}(X_{1})+\dots+f_{0}(X_{n})}{n}-\mu\right)^{2}

where μ:=𝔼f0(X1)\mu:={\mathbb{E}}f_{0}(X_{1}). Thus

𝔼2(f0,μ+h^n𝐕)\displaystyle{\mathbb{E}}\ell_{{\mathbb{P}}}^{2}\left(f_{0},\mu+\hat{h}_{n}^{\mathbf{V}}\right) 2𝔼2(h0,h^n𝐕)+2𝔼(f0(X1)++f0(Xn)nμ)2\displaystyle\leq 2{\mathbb{E}}\ell_{{\mathbb{P}}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)+2{\mathbb{E}}\left(\frac{f_{0}(X_{1})+\dots+f_{0}(X_{n})}{n}-\mu\right)^{2}
Cd,L,σn4/(d+4)+1n𝔼(f0(X1)f0(X2))2\displaystyle\leq C_{d,L,\sigma}n^{-4/(d+4)}+\frac{1}{n}{\mathbb{E}}\left(f_{0}(X_{1})-f_{0}(X_{2})\right)^{2}
Cd,L,σn4/(d+4)+4L2nCd,L,σn4/(d+4).\displaystyle\leq C_{d,L,\sigma}n^{-4/(d+4)}+\frac{4L^{2}}{n}\leq C_{d,L,\sigma}n^{-4/(d+4)}.

Finally

𝔼2(f0,Y¯+h^n𝐕)\displaystyle{\mathbb{E}}\ell_{{\mathbb{P}}}^{2}\left(f_{0},\bar{Y}+\hat{h}_{n}^{\mathbf{V}}\right) 2𝔼2(f0,μ+h^n𝐕)+2𝔼(Y¯μ)2\displaystyle\leq 2{\mathbb{E}}\ell_{{\mathbb{P}}}^{2}\left(f_{0},\mu+\hat{h}_{n}^{\mathbf{V}}\right)+2{\mathbb{E}}\left(\bar{Y}-\mu\right)^{2}
Cd,L,σn4/(d+4)+2σ2nCd,L,σn4/(d+4).\displaystyle\leq C_{d,L,\sigma}n^{-4/(d+4)}+\frac{2\sigma^{2}}{n}\leq C_{d,L,\sigma}n^{-4/(d+4)}.

This completes the proof of Proposition 2.1. ∎

We next turn to the proof of Proposition 2.2 which gives an upper bound on the minimax rate for the class 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega) in the fixed design setting. This proof is similar to that of Proposition 2.1 but somewhat simpler as, in the fixed design setting, we do not need to switch between the two losses n2\ell^{2}_{{\mathbb{P}}_{n}} and 2\ell^{2}_{{\mathbb{P}}}. The metric entropy result stated in Theorem 4.11 is an important ingredient of the following proof.

Proof of Proposition 2.2.

Let us describe the estimator that we work with. Let the linear regression solution be given by the affine function g^n\hat{g}_{n}:

g^nargming𝒜(Ω)i=1n(Yig(Xi))2.\hat{g}_{n}\in\mathop{\rm argmin}_{g\in{\mathcal{A}}(\Omega)}\sum_{i=1}^{n}\left(Y_{i}-g(X_{i})\right)^{2}.

Let Ti:=Yig^n(Xi)T_{i}:=Y_{i}-\hat{g}_{n}(X_{i}) for i=1,,ni=1,\dots,n be the residuals after linear regression. For a finite class 𝐕\mathbf{V} of functions on Ω\Omega, let h^n𝐕\hat{h}_{n}^{\mathbf{V}} denote any least squares estimator over 𝐕\mathbf{V} for the data (X1,T1),,(Xn,Tn)(X_{1},T_{1}),\dots,(X_{n},T_{n}):

h^n𝐕argminh𝐕i=1n(Tih(Xi))2\hat{h}_{n}^{\mathbf{V}}\in\mathop{\rm argmin}_{h\in\mathbf{V}}\sum_{i=1}^{n}\left(T_{i}-h(X_{i})\right)^{2}

We work with the estimator g^n+h^n𝐕\hat{g}_{n}+\hat{h}_{n}^{\mathbf{V}} for an appropriate choice of 𝐕\mathbf{V}, and bound its supremum risk over the class 𝔏(Ω){\cal F}^{{\mathfrak{L}}}(\Omega). Fix a “true” function f0𝔏(Ω)f_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega). Let gog_{o} be the closest affine function to f0f_{0} in the n\ell_{{\mathbb{P}}_{n}} metric:

g0=argming𝒜(Ω)i=1n(f0(Xi)g(Xi))2.g_{0}=\mathop{\rm argmin}_{g\in{\mathcal{A}}(\Omega)}\sum_{i=1}^{n}\left(f_{0}(X_{i})-g(X_{i})\right)^{2}.

A key role in the sequel will be played by the convex function h0:=f0g0h_{0}:=f_{0}-g_{0} (note that h0h_{0} is convex because f0f_{0} is convex and g0g_{0} is affine). Fix a nonnegative valued functional HH and start with the inequality (56):

H(h^n𝐕)h𝐕H(h)exp(18σ2i=1n(Tih(Xi))218σ2i=1n(Tih(Xi))2)H(\hat{h}_{n}^{\mathbf{V}})\leq\sum_{h\in\mathbf{V}}H(h)\exp\left(\frac{1}{8\sigma^{2}}\sum_{i=1}^{n}(T_{i}-h^{\prime}(X_{i}))^{2}-\frac{1}{8\sigma^{2}}\sum_{i=1}^{n}(T_{i}-h(X_{i}))^{2}\right)

for every h𝐕h^{\prime}\in\mathbf{V}. Now take expectations on both sides above (note that we are working in fixed design so X1,,XnX_{1},\dots,X_{n} are fixed grid points in Ω\Omega). To calculate the expectation of the right hand side, observe that (T1,,Tn)(T_{1},\dots,T_{n}) is multivariate normal with mean vector (h0(X1),,h0(Xn))(h_{0}(X_{1}),\dots,h_{0}(X_{n})) and a covariance matrix Σ\Sigma which is dominated by σ2In\sigma^{2}I_{n} in the positive semi-definite ordering. As in the proof of Proposition 2.1, we deduce

𝔼H(h^n𝐕)exp(3n16σ2infh𝐕n2(h0,h))h𝐕H(h)exp(n16σ2n2(h0,h)){\mathbb{E}}H(\hat{h}_{n}^{\mathbf{V}})\leq\exp\left(\frac{3n}{16\sigma^{2}}\inf_{h^{\prime}\in\mathbf{V}}\ell^{2}_{{\mathbb{P}}_{n}}\left(h_{0},h^{\prime}\right)\right)\sum_{h\in\mathbf{V}}H(h)\exp\left(-\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h\right)\right)

Taking

H(h):=exp(n16σ2n2(h0,h))H(h):=\exp\left(\frac{n}{16\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}(h_{0},h)\right)

and using Jensen’s inequality (as in the proof of Proposition 2.1), we obtain

𝔼n2(h0,h^n𝐕)3infh𝐕n2(h0,h)+16σ2nlog|𝐕|.{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\leq 3\inf_{h^{\prime}\in\mathbf{V}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h^{\prime}\right)+\frac{16\sigma^{2}}{n}\log|\mathbf{V}|.

We now specify the finite function class 𝐕\mathbf{V}. For this note that because f0𝔏(Ω)f_{0}\in{\cal F}^{{\mathfrak{L}}}(\Omega) (i.e., infg𝒜(Ω)n(f0,g0)𝔏\inf_{g\in{\mathcal{A}}(\Omega)}\ell_{{\mathbb{P}}_{n}}(f_{0},g_{0})\leq{\mathfrak{L}}), the function h0=f0g0h_{0}=f_{0}-g_{0} belongs to the class

:={h𝒞(Ω):1ni=1nh2(Xi)𝔏2}.\mathcal{H}:=\left\{h\in{\mathcal{C}}(\Omega):\frac{1}{n}\sum_{i=1}^{n}h^{2}(X_{i})\leq{\mathfrak{L}}^{2}\right\}.

Theorem 4.11 (with p=2p=2 and t=𝔏t={\mathfrak{L}}) gives an upper bound for the ϵ\epsilon-metric entropy of the above function class under the n\ell_{{\mathbb{P}}_{n}} metric. This result implies the existence of a finite set 𝐕\mathbf{V} satisfying the twin properties:

suph0infh𝐕n2(h0,h)ϵ2 and log|𝐕|(cdlogn)F(𝔏ϵ)d/2.\sup_{h_{0}\in\mathcal{H}}\inf_{h^{\prime}\in\mathbf{V}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},h^{\prime}\right)\leq\epsilon^{2}~{}~{}\text{ and }~{}~{}\log|\mathbf{V}|\leq\left(c_{d}\log n\right)^{F}\left(\frac{{\mathfrak{L}}}{\epsilon}\right)^{d/2}.

As a result

𝔼n2(h0,h^n𝐕)3ϵ2+16σ2n(cdlogn)F(𝔏ϵ)d/2.{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\leq 3\epsilon^{2}+\frac{16\sigma^{2}}{n}\left(c_{d}\log n\right)^{F}\left(\frac{{\mathfrak{L}}}{\epsilon}\right)^{d/2}.

The choice ϵ=n2/(d+4)(cdlogn)2F/(d+4)\epsilon=n^{-2/(d+4)}(c_{d}\log n)^{2F/(d+4)} gives

𝔼n2(h0,h^n𝐕)C𝔏,σn4/(d+4)(cdlogn)4F/(d+4).{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}_{n}^{\mathbf{V}}\right)\leq C_{{\mathfrak{L}},\sigma}n^{-4/(d+4)}(c_{d}\log n)^{4F/(d+4)}.

Finally

𝔼n2(f0,g^n+h^n𝐕)\displaystyle{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(f_{0},\hat{g}_{n}+\hat{h}_{n}^{\mathbf{V}}\right) =𝔼n2(g0+h0,g^n+h^n𝐕)\displaystyle={\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(g_{0}+h_{0},\hat{g}_{n}+\hat{h}_{n}^{\mathbf{V}}\right)
2𝔼n2(g0,g^n)+𝔼n2(h0,h^n𝐕)\displaystyle\leq 2{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(g_{0},\hat{g}_{n}\right)+{\mathbb{E}}\ell_{{\mathbb{P}}_{n}}^{2}\left(h_{0},\hat{h}^{\mathbf{V}}_{n}\right)
Cdσ2n+C𝔏,σn4/(d+4)(cdlogn)4F/(d+4)\displaystyle\leq\frac{C_{d}\sigma^{2}}{n}+C_{{\mathfrak{L}},\sigma}n^{-4/(d+4)}(c_{d}\log n)^{4F/(d+4)}
C𝔏,σn4/(d+4)(cdlogn)4F/(d+4)\displaystyle\leq C_{{\mathfrak{L}},\sigma}n^{-4/(d+4)}(c_{d}\log n)^{4F/(d+4)}

completing the proof of Proposition 2.2. ∎

Next we give the proof of Proposition 2.3. This proof is based on the use of Assouad’s lemma with a standard construction.

Proof of Proposition 2.3.

Define the smooth function:

g(x1,x2,,xd)={i=1dcos3(πxi)(x1,x2,,xd)[1/2,1/2]d0(x1,x2,,xd)[1/2,1/2]d.g(x_{1},x_{2},\ldots,x_{d})=\left\{\begin{array}[]{ll}{\sum_{i=1}^{d}\cos^{3}(\pi x_{i})}&{(x_{1},x_{2},\ldots,x_{d})\in[-1/2,1/2]^{d}}\\ {0}&{(x_{1},x_{2},\ldots,x_{d})\notin[-1/2,1/2]^{d}}\end{array}\right..

Note that 2gxixj=0\frac{\partial^{2}g}{\partial x_{i}\partial x_{j}}=0 for iji\neq j and

|2gxi2(x1,,xd)|423π2\left|\frac{\partial^{2}g}{\partial x_{i}^{2}}(x_{1},\dots,x_{d})\right|\leq\frac{4\sqrt{2}}{3}\pi^{2}

which implies that the Hessian of gg is dominated by (42π2/3)(4\sqrt{2}\pi^{2}/3) times the identity matrix. It is also easy to check that the Hessian of gg equals zero on the boundary of [0.5,0.5]d[-0.5,0.5]^{d}.

Fix η>0\eta>0 and let 𝒮η\mathcal{S}^{\eta} be the η\eta-grid consisting of all points (k1η,,kdη)(k_{1}\eta,\dots,k_{d}\eta) for k1,,kdk_{1},\dots,k_{d}\in\mathbb{Z}. Let

Tη:={(u1,,ud)𝒮η:j=1d[uj0.5η,uj+0.5η]Ω}.T^{\eta}:=\{(u_{1},\dots,u_{d})\in\mathcal{S}^{\eta}:\prod_{j=1}^{d}\left[u_{j}-0.5\eta,u_{j}+0.5\eta\right]\subseteq\Omega\}.

Assume that η\eta is small enough so that the cardinality m:=|Tη|m:=|T^{\eta}| of TηT^{\eta} is at least cdηdc_{d}\eta^{-d}. For every point uTηu\in T^{\eta}, let

gu(x1,,xd):=η2g(x1u1η,,xdudη).g_{u}(x_{1},\dots,x_{d}):=\eta^{2}g\left(\frac{x_{1}-u_{1}}{\eta},\dots,\frac{x_{d}-u_{d}}{\eta}\right).

Clearly gug_{u} is supported on the cube j=1d[uj0.5η,uj+0.5η]\prod_{j=1}^{d}[u_{j}-0.5\eta,u_{j}+0.5\eta] and these cubes for different points uTηu\in T^{\eta} have disjoint interiors.

For each binary vector ξ=(ξu,uTη){0,1}m\xi=(\xi_{u},u\in T^{\eta})\in\{0,1\}^{m}, consider the function

Gξ(x)=f0(x)+342π2uTηξugu(x)G_{\xi}(x)=f_{0}(x)+\frac{3}{4\sqrt{2}\pi^{2}}\sum_{u\in T^{\eta}}\xi_{u}g_{u}(x) (57)

where f0(x):=x2f_{0}(x):=\|x\|^{2}. It can be verified that GξG_{\xi} is convex because f0f_{0} has constant Hessian equal to 22 times the identity, the Hessian of gsg_{s} is bounded by (42π2/3)(4\sqrt{2}\pi^{2}/3) and the supports of gs,s𝒮Ωg_{s},s\in\mathcal{S}\cap\Omega have disjoint interiors. For a sufficiently small constant η\eta and a sufficiently large constant LL, it is also clear that Gξ𝒞LL(Ω)G_{\xi}\in{\mathcal{C}}_{L}^{L}(\Omega). We use Assouad’s lemma with this collection {Gξ,ξ{0,1}m}\{G_{\xi},\xi\in\{0,1\}^{m}\}:

nfixed(𝒞LL(Ω))m8minξξn2(Gξ,Gξ)Υ(ξ,ξ)minΥ(ξ,ξ)=1(1GξGξTV)\mathfrak{R}_{n}^{\mathrm{fixed}}({\mathcal{C}}^{L}_{L}(\Omega))\geq\frac{m}{8}\min_{\xi\neq\xi^{\prime}}\frac{\ell_{{\mathbb{P}}_{n}}^{2}(G_{\xi},G_{\xi^{\prime}})}{\Upsilon(\xi,\xi^{\prime})}\min_{\Upsilon(\xi,\xi^{\prime})=1}\left(1-\|{\mathbb{P}}_{G_{\xi}}-{\mathbb{P}}_{G_{\xi^{\prime}}}\|_{\text{TV}}\right)

where Υ(ξ,ξ):=uTηI{ξuξu}\Upsilon(\xi,\xi^{\prime}):=\sum_{u\in T^{\eta}}I\{\xi_{u}\neq\xi^{\prime}_{u}\} is the Hamming distance, and f{\mathbb{P}}_{f} is the multivariate normal distribution with mean vector (f(X1),,f(Xn))(f(X_{1}),\dots,f(X_{n})) and covariance matrix σ2In\sigma^{2}I_{n}. We bound n2(Gξ,Gξ)\ell_{{\mathbb{P}}_{n}}^{2}(G_{\xi},G_{\xi^{\prime}}) from below as

n2(Gξ,Gξ)\displaystyle\ell_{{\mathbb{P}}_{n}}^{2}(G_{\xi},G_{\xi^{\prime}}) =1ni=1n(Gξ(Xi)Gξ(Xi))2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left(G_{\xi}(X_{i})-G_{\xi^{\prime}}(X_{i})\right)^{2}
=1ni=1n{342π2uTη(ξuξu)gu(Xi)}2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{3}{4\sqrt{2}\pi^{2}}\sum_{u\in T^{\eta}}\left(\xi_{u}-\xi^{\prime}_{u}\right)g_{u}(X_{i})\right\}^{2}
=932π4ni=1nuTηI{ξuξu}gu2(Xi)\displaystyle=\frac{9}{32\pi^{4}n}\sum_{i=1}^{n}\sum_{u\in T^{\eta}}I\{\xi_{u}\neq\xi^{\prime}_{u}\}g_{u}^{2}(X_{i})
=932π4uTηI{ξuξu}1ni=1ngu2(Xi)\displaystyle=\frac{9}{32\pi^{4}}\sum_{u\in T^{\eta}}I\{\xi_{u}\neq\xi^{\prime}_{u}\}\frac{1}{n}\sum_{i=1}^{n}g_{u}^{2}(X_{i})

To bound 1ni=1ngu2(Xi)\frac{1}{n}\sum_{i=1}^{n}g_{u}^{2}(X_{i}) from below, note that gug_{u} is supported on the cube j=1d[uj0.5η,uj+0.5η]\prod_{j=1}^{d}[u_{j}-0.5\eta,u_{j}+0.5\eta] which has volume ηd\eta^{d}, and also that the “magnitude” of gug_{u} on this cube is of order η2\eta^{2} (equivalently, the squared magnitude equals η4\eta^{4}). Thus when ηcδ\eta\geq c\delta for some cc, it is straightforward to check

1ni=1ngu2(Xi)Cdηd×η4=Cdηd+4.\displaystyle\frac{1}{n}\sum_{i=1}^{n}g_{u}^{2}(X_{i})\geq C_{d}\eta^{d}\times\eta^{4}=C_{d}\eta^{d+4}.

This gives

minξξn2(Gξ,Gξ)Υ(ξ,ξ)Cdηd+4\displaystyle\min_{\xi\neq\xi^{\prime}}\frac{\ell_{{\mathbb{P}}_{n}}^{2}(G_{\xi},G_{\xi^{\prime}})}{\Upsilon(\xi,\xi^{\prime})}\geq C_{d}\eta^{d+4}

For bounding GξGξTV\|{\mathbb{P}}_{G_{\xi}}-{\mathbb{P}}_{G_{\xi^{\prime}}}\|_{\text{TV}}, we use Pinsker’s inequality to get

GξGξTV2\displaystyle\|{\mathbb{P}}_{G_{\xi}}-{\mathbb{P}}_{G_{\xi^{\prime}}}\|^{2}_{\text{TV}} 12D(GξGξ)\displaystyle\leq\frac{1}{2}D\left({\mathbb{P}}_{G_{\xi}}\mathrel{\|}{\mathbb{P}}_{G_{\xi^{\prime}}}\right)
=n4σ2n2(Gξ,Gξ)\displaystyle=\frac{n}{4\sigma^{2}}\ell_{{\mathbb{P}}_{n}}^{2}\left(G_{\xi},G_{\xi^{\prime}}\right)
=9n124σ2π4uTηI{ξuξu}1ni=1ngu2(Xi)\displaystyle=\frac{9n}{124\sigma^{2}\pi^{4}}\sum_{u\in T^{\eta}}I\{\xi_{u}\neq\xi^{\prime}_{u}\}\frac{1}{n}\sum_{i=1}^{n}g_{u}^{2}(X_{i})
C~dnηd+4σ2Υ(ξ,ξ).\displaystyle\leq\tilde{C}_{d}\frac{n\eta^{d+4}}{\sigma^{2}}\Upsilon(\xi,\xi^{\prime}).

Assouad’s lemma (with mcdηdm\geq c_{d}\eta^{-d}) thus gives

nfixed(𝒞LL(Ω))cdηdCdηd+4(1C~dnηd+4σ2).\displaystyle\mathfrak{R}_{n}^{\mathrm{fixed}}({\mathcal{C}}^{L}_{L}(\Omega))\geq c_{d}\eta^{-d}C_{d}\eta^{d+4}\left(1-\sqrt{\tilde{C}_{d}\frac{n\eta^{d+4}}{\sigma^{2}}}\right).

The choice

η=(σ24nC~d)1/(d+4)\displaystyle\eta=\left(\frac{\sigma^{2}}{4n\tilde{C}_{d}}\right)^{1/(d+4)}

leads to

nfixed(𝒞LL(Ω))cd,σn4/(d+4).\displaystyle\mathfrak{R}_{n}^{\mathrm{fixed}}({\mathcal{C}}^{L}_{L}(\Omega))\geq c_{d,\sigma}n^{-4/(d+4)}.

Appendix B Proofs of the random design LSE rate lower bounds

This section contains the proof of Theorem 3.1. We follow the sketch given in Subsection 4.3 and results quoted in that subsection are also proved here. The first three subsections below contain the proofs of Lemma 4.2, Lemma 4.3 and Lemma 4.4 respectively. Using these three lemmas, the proof of Theorem 3.1 is completed in Subsection B.4. An important role is played by the main bracketing entropy bound stated in Theorem 4.5 and this theorem is proved in Subsection B.5. Finally, Subsection B.6 contains the proofs of Lemma 3.2 and Lemma 4.6.

B.1 Proof of Lemma 4.2

The most important ingredient for the proof of Lemma 4.2 is the bracketing entropy bound given by Theorem 4.5 which is proved in Subsection B.5. We also use the following two standard results from the theory of empirical processes and Gaussian processes respectively.

Lemma B.1.

[47, Theorem 5.11] Let {\mathbb{Q}} be a probability on a set 𝒵\mathcal{Z} and let n{\mathbb{Q}}_{n} be the empirical distribution of Z1,,Zni.i.dZ_{1},\dots,Z_{n}\overset{\text{i.i.d}}{\sim}{\mathbb{Q}}. Let \mathcal{H} be a class of real-valued functions on 𝒵\mathcal{Z} and assume that each function in \mathcal{H} is uniformly bounded by Γ>0\Gamma>0. Then

𝔼suph|nhh|Cinf{aΓn:aCnaΓlogN[](u,,L2())𝑑u}.\begin{split}&{\mathbb{E}}\sup_{h\in\mathcal{H}}|{\mathbb{Q}}_{n}h-{\mathbb{Q}}h|\\ &\leq C\inf\left\{a\geq\frac{\Gamma}{\sqrt{n}}:a\geq\frac{C}{\sqrt{n}}\int_{a}^{\Gamma}\sqrt{\log N_{[\,]}(u,\mathcal{H},L_{2}({\mathbb{Q}}))}du\right\}.\end{split} (58)

The following result is Dudley’s entropy integral bound [16].

Theorem B.2 (Dudley).

Let ξ1,,ξni.i.dN(0,σ2)\xi_{1},\dots,\xi_{n}\overset{\text{i.i.d}}{\sim}N(0,\sigma^{2}). Then for every deterministic X1,,XnX_{1},\dots,X_{n}, every class of functions {\cal F}, ff\in{\cal F} and t0t\geq 0:

𝔼supg:n(f,g)t1ni=1nξi(g(Xi)f(Xi))σinf0<θt/2(12nθt/2logN(ϵ,{g:n(f,g)t},n)𝑑ϵ+2θ).\begin{split}&{\mathbb{E}}\sup_{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(f,g)\leq t}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-f(X_{i})\right)\\ &\leq\sigma\inf_{0<\theta\leq t/2}\left(\frac{12}{\sqrt{n}}\int_{\theta}^{t/2}\sqrt{\log N(\epsilon,\left\{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(f,g)\leq t\right\},\ell_{{\mathbb{P}}_{n}})}d\epsilon+2\theta\right).\end{split}

In the course of proving Lemma 4.2, we will need to work with both the loss functions \ell_{{\mathbb{P}}} and n\ell_{{\mathbb{P}}_{n}}. The next result states that these loss functions are, up to the factor 2, sufficiently close (of order n2/(d+4)n^{-2/(d+4)} which is much smaller than n1/dn^{-1/d} for d5d\geq 5) allowing us to switch between them.

Lemma B.3.

Suppose Ω\Omega is a convex body satisfying (3). Then there exist positive constants CC and CdC_{d} such that

{supf,g𝒞LB(Ω)(n(f,g)2(f,g))Cdn2/(d+4)(B+L)d/(d+4)}1Cexp(nd/(d+4)CdB8/(d+4)),\begin{split}&{\mathbb{P}}\left\{\sup_{f,g\in{\mathcal{C}}^{B}_{L}(\Omega)}\left(\ell_{{\mathbb{P}}_{n}}(f,g)-2\ell_{{\mathbb{P}}}(f,g)\right)\leq C_{d}n^{-2/(d+4)}(B+L)^{d/(d+4)}\right\}\\ &\geq 1-C\exp\left(-\frac{n^{d/(d+4)}}{C_{d}B^{8/(d+4)}}\right),\end{split}

and

{supf,g𝒞LB(Ω)((f,g)2n(f,g))Cdn2/(d+4)(B+L)d/(d+4)}1Cexp(nd/(d+4)CdB8/(d+4)),\begin{split}&{\mathbb{P}}\left\{\sup_{f,g\in{\mathcal{C}}^{B}_{L}(\Omega)}\left(\ell_{{\mathbb{P}}}(f,g)-2\ell_{{\mathbb{P}}_{n}}(f,g)\right)\leq C_{d}n^{-2/(d+4)}(B+L)^{d/(d+4)}\right\}\\ &\geq 1-C\exp\left(-\frac{n^{d/(d+4)}}{C_{d}B^{8/(d+4)}}\right),\end{split}

and

𝔼supf,g𝒞LB(Ω)((f,g)2n(f,g))2Cdn4/(d+4)L2d/(d+4)+CL2n,{\mathbb{E}}\sup_{f,g\in{\mathcal{C}}_{L}^{B}(\Omega)}\left(\ell_{{\mathbb{P}}}(f,g)-2\ell_{{\mathbb{P}}_{n}}(f,g)\right)^{2}\leq C_{d}n^{-4/(d+4)}L^{2d/(d+4)}+\frac{CL^{2}}{n},

and

𝔼supf,g𝒞LB(Ω)(n(f,g)2(f,g))2Cdn4/(d+4)L2d/(d+4)+CL2n.{\mathbb{E}}\sup_{f,g\in{\mathcal{C}}_{L}^{B}(\Omega)}\left(\ell_{{\mathbb{P}}_{n}}(f,g)-2\ell_{{\mathbb{P}}}(f,g)\right)^{2}\leq C_{d}n^{-4/(d+4)}L^{2d/(d+4)}+\frac{CL^{2}}{n}.

Additionally, when Ω\Omega is a polytope whose number of facets is bounded by a constant depending on dd alone, all these inequalities continue to hold if 𝒞LB(Ω){\mathcal{C}}^{B}_{L}(\Omega) is replaced by the larger class 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) with (B+L)(B+L) replaced by BB.

Proof of Lemma B.3.

We use the following result which can be found in [47, Proof of Lemma 5.16]: Suppose {\cal F} is a class of functions on Ω\Omega that are uniformly bounded by Γ>0\Gamma>0. Then there exists a positive constant CC such that

{supf,g(n(f,g)2(f,g))>Ca}Cexp(na2CΓ2){\mathbb{P}}\left\{\sup_{f,g\in{\cal F}}\left(\ell_{{\mathbb{P}}_{n}}(f,g)-2\ell_{{\mathbb{P}}}(f,g)\right)>Ca\right\}\leq C\exp\left(-\frac{na^{2}}{C\Gamma^{2}}\right) (59)

and

{supf,g((f,g)2n(f,g))>Ca}Cexp(na2CΓ2){\mathbb{P}}\left\{\sup_{f,g\in{\cal F}}\left(\ell_{{\mathbb{P}}}(f,g)-2\ell_{{\mathbb{P}}_{n}}(f,g)\right)>Ca\right\}\leq C\exp\left(-\frac{na^{2}}{C\Gamma^{2}}\right) (60)

provided

na2ClogN[](a,,L2()).na^{2}\geq C\log N_{[\,]}(a,{\cal F},L_{2}({\mathbb{P}})). (61)

We first apply this result to =𝒞LB(Ω){\cal F}={\mathcal{C}}_{L}^{B}(\Omega) with Γ=B\Gamma=B. From [9], we get

logN[](ϵ,𝒞LB(Ω),)Cd(B+Lϵ)d/2\log N_{[\,]}(\epsilon,{\mathcal{C}}_{L}^{B}(\Omega),{\ell_{{\mathbb{P}}}})\leq C_{d}\left(\frac{B+L}{\epsilon}\right)^{d/2} (62)

The above bound implies that (61) is satisfied when aa is n2/(d+4)(B+L)d/(d+4)n^{-2/(d+4)}(B+L)^{d/(d+4)} multiplied by a large enough dimensional constant. Inequalities (59) and (60) then lead to the two stated inequalities in Lemma B.3. The expectation bounds are obtained by integrating the probability inequalities.

To prove the bounds for 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) for polytopal Ω\Omega, just use, instead of the result from [9], the following due to [19, Theorem 1.5]:

logN[](ϵ,𝒞B(Ω),)Cd(Bϵ)d/2.\log N_{[\,]}(\epsilon,{\mathcal{C}}^{B}(\Omega),{\ell_{{\mathbb{P}}}})\leq C_{d}\left(\frac{B}{\epsilon}\right)^{d/2}. (63)

We are now ready to give the proof of Lemma 4.2.

Proof of Lemma 4.2.

Throughout we take BB and LL to be constants depending on dimension alone (so we can absorb them into a generic CdC_{d}). Lemma 4.2 has two parts: (37) and (38). We prove (38) first and then indicate the changes necessary for (37). So assume first that Ω\Omega is a polytope with facets bounded by a constant depending on dd alone and take =𝒞B(Ω){\cal F}={\mathcal{C}}^{B}(\Omega).

With 𝔅(f,t):={g:(f,g)t}{\mathfrak{B}}_{{\mathbb{P}}}^{{\cal F}}(f,t):=\left\{g\in{\cal F}:\ell_{{\mathbb{P}}}(f,g)\leq t\right\} and 𝔅n(f,t):={g:n(f,g)t}{\mathfrak{B}}_{{\mathbb{P}_{n}}}^{{\cal F}}(f,t):=\left\{g\in{\cal F}:\ell_{{\mathbb{P}_{n}}}(f,g)\leq t\right\}, Lemma B.3 gives

𝔅n𝒞B(Ω)(f~k,t)𝔅𝒞B(Ω)(f~k,2t+Cdn2/(d+4)){\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}_{n}}}(\tilde{f}_{k},t)\subseteq{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},2t+C_{d}n^{-2/(d+4)})

with probability at least

1Cexp(nd/(d+4)Cd).1-C\exp\left(-\frac{n^{d/(d+4)}}{C_{d}}\right). (64)

Thus for

tCdn2/(d+4),t\geq C_{d}n^{-2/(d+4)}, (65)

we get

𝔅n𝒞B(Ω)(f~k,t)𝔅𝒞B(Ω)(f~k,3t),{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}_{n}}}(\tilde{f}_{k},t)\subseteq{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},3t),

and consequently Gf~k(t,𝒞B(Ω))𝔊f~k(3t)G_{\tilde{f}_{k}}(t,{\mathcal{C}}^{B}(\Omega))\leq\mathfrak{G}_{\tilde{f}_{k}}(3t) where

𝔊f~k(t):=𝔼[supg𝔅𝒞B(Ω)(f~k,t)1ni=1nξi(g(Xi)f~k(Xi))|X1,,Xn]\mathfrak{G}_{\tilde{f}_{k}}(t):={\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},t)}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]

holds with probability at least (64). By concentration of measure, the above conditional expectation will be close to the corresponding unconditional expectation because

T(x1,,xn)\displaystyle T(x_{1},\dots,x_{n})
:=𝔼[supg𝔅𝒞B(Ω)(f~k,3t)1ni=1nξi(g(Xi)f~k(Xi))|X1=x1,,Xn=xn]\displaystyle:={\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},3t)}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\bigg{|}X_{1}=x_{1},\dots,X_{n}=x_{n}\right]

satisfies the bounded differences condition:

|T(x1,,xn)T(x1,,xn)|2Bσni=1nI{xixi}\displaystyle\left|T(x_{1},\dots,x_{n})-T(x_{1}^{\prime},\dots,x_{n}^{\prime})\right|\leq\frac{2B\sigma}{n}\sum_{i=1}^{n}I\{x_{i}\neq x_{i}^{\prime}\}

and the bounded differences concentration inequality consequently gives

{𝔊f~k(3t)𝔼𝔊f~k(3t)+x}1exp(nx22B2σ2)\displaystyle{\mathbb{P}}\left\{\mathfrak{G}_{\tilde{f}_{k}}(3t)\leq{\mathbb{E}}\mathfrak{G}_{\tilde{f}_{k}}(3t)+x\right\}\geq 1-\exp\left(\frac{-nx^{2}}{2B^{2}\sigma^{2}}\right) (66)

for every x>0x>0. We next control

𝔼𝔊f~k(3t)=𝔼[supg𝔅𝒞B(Ω)(f~k,3t)1ni=1nξi(g(Xi)f~k(Xi))]{\mathbb{E}}\mathfrak{G}_{\tilde{f}_{k}}(3t)={\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},3t)}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\right]

where the expectation on the left hand side is with respect to X1,,XnX_{1},\dots,X_{n} while the expectation on the right hand side is with respect to all variables ξ1,,ξn,X1,,Xn\xi_{1},\dots,\xi_{n},X_{1},\dots,X_{n}. Clearly

𝔼𝔊f~k(3t)=𝔼suph(nhh){\mathbb{E}}\mathfrak{G}_{\tilde{f}_{k}}(3t)={\mathbb{E}}\sup_{h\in\mathcal{H}}\left(\mathbb{Q}_{n}h-\mathbb{Q}h\right) (67)

where \mathcal{H} consists of all functions of the form (ξ,x)ξ(g(x)f~k(x))(\xi,x)\mapsto\xi\left(g(x)-\tilde{f}_{k}(x)\right) as gg varies over 𝔅𝒞B(Ω)(f~k,3t){\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},3t), n\mathbb{Q}_{n} is the empirical measure corresponding to (ξi,Xi),i=1,,n(\xi_{i},X_{i}),i=1,\dots,n, and \mathbb{Q} is the distribution of (ξ,X)(\xi,X) where ξ\xi and XX are independent with ξN(0,σ2)\xi\sim N(0,\sigma^{2}) and XX\sim{\mathbb{P}}.

We now use the bound (58) which requires us to control N[](ϵ,,L2())N_{[\,]}(\epsilon,\mathcal{H},L_{2}({\mathbb{Q}})). This is done by Theorem 4.5 which states that

logN[](ϵ,𝔅𝒞B(Ω)(f~k,t),L2())Cdk(logCdBϵ)d+1(tϵ)d/2.\log N_{[\,]}(\epsilon,{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},t),L_{2}({\mathbb{P}}))\leq C_{d}k\left(\log\frac{C_{d}B}{\epsilon}\right)^{d+1}\left(\frac{t}{\epsilon}\right)^{d/2}. (68)

Theorem 4.5 is stated under the unnormalized integral constraint Ω(ff~k)2t2\int_{\Omega}(f-\tilde{f}_{k})^{2}\leq t^{2} and for bracketing numbers under the unnormalized Lebesgue measure but this implies (68) as the volume of Ω\Omega is assumed to be bounded on both sides by dimensional constants. We now claim that

N[](ϵ,,L2())N[](ϵσ1,𝔅𝒞B(Ω)(f~k,3t),L2()).N_{[\,]}(\epsilon,\mathcal{H},L_{2}({\mathbb{Q}}))\leq N_{[\,]}(\epsilon\sigma^{-1},{\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},3t),L_{2}({\mathbb{P}})). (69)

Inequality (69) is true because of the following. Let {[gL,gU],gG}\{[g_{L},g_{U}],g\in G\} be a set of covering brackets for the set 𝔅𝒞B(Ω)(f~k,3t){\mathfrak{B}}^{{\mathcal{C}^{B}(\Omega)}}_{{\mathbb{P}}}(\tilde{f}_{k},3t). For each bracket [gL,gU][g_{L},g_{U}], we associate a corresponding bracket [hL,hU][h_{L},h_{U}] for \mathcal{H} as follows:

hL(ξ,x):=ξ(gL(x)f~k(x))I{ξ0}+ξ(gU(x)f~k(x))I{ξ<0}h_{L}(\xi,x):=\xi\left(g_{L}(x)-\tilde{f}_{k}(x)\right)I\{\xi\geq 0\}+\xi\left(g_{U}(x)-\tilde{f}_{k}(x)\right)I\{\xi<0\}

and

hU(ξ,x):=ξ(gU(x)f~k(x))I{ξ0}+ξ(gL(x)f~k(x))I{ξ<0}.h_{U}(\xi,x):=\xi\left(g_{U}(x)-\tilde{f}_{k}(x)\right)I\{\xi\geq 0\}+\xi\left(g_{L}(x)-\tilde{f}_{k}(x)\right)I\{\xi<0\}.

It is now easy to check that whenever gLggUg_{L}\leq g\leq g_{U}, we have hLhghUh_{L}\leq h_{g}\leq h_{U} where hg(ξ,x)=ξ(g(x)f~k(x))h_{g}(\xi,x)=\xi\left(g(x)-\tilde{f}_{k}(x)\right). Further, hUhL=|ξ|(gUgL)h_{U}-h_{L}=|\xi|\left(g_{U}-g_{L}\right) and thus (hUhL)2=σ2(gUgL)2{\mathbb{Q}}\left(h_{U}-h_{L}\right)^{2}=\sigma^{2}{\mathbb{P}}\left(g_{U}-g_{L}\right)^{2} which proves (69). Inequality (68) then gives that for every aB/na\geq B/\sqrt{n}, we have

aBlogN[](u,,L2())𝑑u\displaystyle\int_{a}^{B}\sqrt{\log N_{[\,]}(u,\mathcal{H},L_{2}({\mathbb{Q}}))}du
CdkaB(logCdBσu)(d+1)/2(tσu)d/4𝑑u\displaystyle\leq C_{d}\sqrt{k}\int_{a}^{B}\left(\log\frac{C_{d}B\sigma}{u}\right)^{(d+1)/2}\left(\frac{t\sigma}{u}\right)^{d/4}du
Cdk(tσ)d/4(logCdBσa)(d+1)/2aud/4𝑑u\displaystyle\leq C_{d}\sqrt{k}(t\sigma)^{d/4}\left(\log\frac{C_{d}B\sigma}{a}\right)^{(d+1)/2}\int_{a}^{\infty}u^{-d/4}du
Cdk(tσ)d/4(logCdBσa)(d+1)/2a1(d/4)\displaystyle\leq C_{d}\sqrt{k}(t\sigma)^{d/4}\left(\log\frac{C_{d}B\sigma}{a}\right)^{(d+1)/2}a^{1-(d/4)}
Cdk(tσ)d/4(log(Cdσn))(d+1)/2a1(d/4)\displaystyle\leq C_{d}\sqrt{k}(t\sigma)^{d/4}\left(\log(C_{d}\sigma\sqrt{n})\right)^{(d+1)/2}a^{1-(d/4)}

where, in the last inequality, we used aB/na\geq B/\sqrt{n}. The inequality

aCn1/2aBlogN[](u,,L2())𝑑ua\geq Cn^{-1/2}\int_{a}^{B}\sqrt{\log N_{[\,]}(u,\mathcal{H},L_{2}({\mathbb{Q}}))}du

will therefore be satisfied for

aCdtσ(kn)2/d(log(Cdσn))2(d+1)/d\displaystyle a\geq C_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}\left(\log(C_{d}\sigma\sqrt{n})\right)^{2(d+1)/d}

for an appropriate constant CdC_{d}. The bound (58) then gives

𝔼𝔊f~k(3t)Cdtσ(kn)2/d(log(Cdσn))2(d+1)/d+Cdn.\displaystyle{\mathbb{E}}\mathfrak{G}_{\tilde{f}_{k}}(3t)\leq C_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}\left(\log(C_{d}\sigma\sqrt{n})\right)^{2(d+1)/d}+\frac{C_{d}}{\sqrt{n}}.

Combining the above steps, we deduce that for every x>0x>0, the inequality

Gf~k(t,𝒞B(Ω))Cdtσ(kn)2/d(log(Cdσn))2(d+1)/d+Cdn+x\displaystyle G_{\tilde{f}_{k}}(t,{\mathcal{C}}^{B}(\Omega))\leq C_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}\left(\log(C_{d}\sigma\sqrt{n})\right)^{2(d+1)/d}+\frac{C_{d}}{\sqrt{n}}+x

holds with probability at least

1Cexp(nd/(d+4)Cd)exp(nx2Cdσ2)\displaystyle 1-C\exp\left(-\frac{n^{d/(d+4)}}{C_{d}}\right)-\exp\left(\frac{-nx^{2}}{C_{d}\sigma^{2}}\right)

for every fixed tt satisfying (65). Inequality (38) is now deduced by taking

x=Cdtσ(kn)2/d(log(Cdσn))2(d+1)/dx=C_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}\left(\log(C_{d}\sigma\sqrt{n})\right)^{2(d+1)/d}

Let us now get to (37). Here Ω\Omega is not necessarily a polytope and {\cal F} is either 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) or 𝒞L(Ω){\mathcal{C}}_{L}(\Omega). Let us first argue that it is enough to prove (37) when =𝒞L4L(Ω){\cal F}={\mathcal{C}}_{L}^{4L}(\Omega). This will obviously imply the same inequality for the smaller class 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega). It also implies the same inequality for the larger class 𝒞L(Ω){\mathcal{C}}_{L}(\Omega) because of the following claim:

f𝒞L(Ω),f𝒞L4L(Ω)min(n(f,f~k),(f,f~k))>L.\displaystyle f\in{\mathcal{C}}_{L}(\Omega),f\notin{\mathcal{C}}_{L}^{4L}(\Omega)\implies\min\left(\ell_{{\mathbb{P}}_{n}}(f,\tilde{f}_{k}),\ell_{{\mathbb{P}}}(f,\tilde{f}_{k})\right)>L. (70)

The above claim immediately implies that

𝔅n𝒞L(Ω)(f~k,t)=𝔅n𝒞L4L(Ω)(f~k,t)for all tL.{\mathfrak{B}}_{{\mathbb{P}}_{n}}^{{\mathcal{C}}_{L}(\Omega)}(\tilde{f}_{k},t)={\mathfrak{B}}_{{\mathbb{P}}_{n}}^{{\mathcal{C}}^{4L}_{L}(\Omega)}(\tilde{f}_{k},t)\qquad\text{for all $t\leq L$}.

which leads to

Gf~k(t,𝒞L4L(Ω))=Gf~k(t,𝒞L(Ω))for all tL.G_{\tilde{f}_{k}}(t,{\mathcal{C}}_{L}^{4L}(\Omega))=G_{\tilde{f}_{k}}(t,{\mathcal{C}}_{L}(\Omega))\qquad\text{for all $t\leq L$}.

To see (70), note that assumptions f𝒞L(Ω)f\in{\mathcal{C}}_{L}(\Omega) and f𝒞L4L(Ω)f\notin{\mathcal{C}}_{L}^{4L}(\Omega) together imply that |f(x)|>4L|f(x)|>4L for some xΩx\in\Omega. By the Lipschitz property of ff, the fact that Ω\Omega has diameter 2\leq 2 and the fact that f~k\tilde{f}_{k} is bounded by LL, we have

|f(y)f~k(y)|\displaystyle|f(y)-\tilde{f}_{k}(y)| =|f(y)f(x)+f(x)f~k(y)|\displaystyle=|f(y)-f(x)+f(x)-\tilde{f}_{k}(y)|
|f(x)||f(y)f(x)||f~k(y)|\displaystyle\geq|f(x)|-|f(y)-f(x)|-|\tilde{f}_{k}(y)|
>4LLxyLL\displaystyle>4L-L\|x-y\|-L\geq L

for every yΩy\in\Omega. This clearly implies that both n(f,f~k)\ell_{{\mathbb{P}}_{n}}(f,\tilde{f}_{k}) and (f,f~k)\ell_{{\mathbb{P}}}(f,\tilde{f}_{k}) are larger than LL which proves (70).

In the rest of the proof, we therefore assume that =𝒞L4L(Ω){\cal F}={\mathcal{C}}_{L}^{4L}(\Omega). We write

Gf~k(t,)Gf~kI(t,)+Gf~kII(t,)G_{\tilde{f}_{k}}(t,{\cal F})\leq G^{I}_{\tilde{f}_{k}}(t,{\cal F})+G^{II}_{\tilde{f}_{k}}(t,{\cal F})

where

Gf~kI(t,):=𝔼[supg𝔅n(f~k,t)1ni:Xii=1mΔiξi(g(Xi)f~k(Xi))|X1,,Xn]G^{I}_{\tilde{f}_{k}}(t,{\cal F}):={\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(\tilde{f}_{k},t)}\frac{1}{n}\sum_{i:X_{i}\in\cup_{i=1}^{m}\Delta_{i}}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]

and

Gf~kII(t,):=𝔼[supg𝔅n(f~k,t)1ni:Xii=1mΔiξi(g(Xi)f~k(Xi))|X1,,Xn].G^{II}_{\tilde{f}_{k}}(t,{\cal F}):={\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(\tilde{f}_{k},t)}\frac{1}{n}\sum_{i:X_{i}\notin\cup_{i=1}^{m}\Delta_{i}}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right].

Here Δ1,,Δm\Delta_{1},\dots,\Delta_{m} are the dd-simplices given by Lemma 3.2. We now bound Gf~kI(t,)G^{I}_{\tilde{f}_{k}}(t,{\cal F}) and Gf~kII(t,)G^{II}_{\tilde{f}_{k}}(t,{\cal F}) separately. The first term Gf~kI(t,)G^{I}_{\tilde{f}_{k}}(t,{\cal F}) can be shown to satisfy the same bound in (37) using almost the same argument used for (37). The only difference is that, instead of (68), we use:

N[](ϵ,{xg(x)I{xi=1mΔi}:g𝔅𝒞L4L(Ω)(f~k,3t)},L2())\displaystyle N_{[\,]}(\epsilon,\left\{x\mapsto g(x)I\{x\in\cup_{i=1}^{m}\Delta_{i}\}:g\in{\mathfrak{B}}^{{\mathcal{C}}_{L}^{4L}(\Omega)}_{{\mathbb{P}}}(\tilde{f}_{k},3t)\right\},L_{2}({\mathbb{P}}))
Cdk(logCdLϵ)d+1(tϵ)d/2,\displaystyle\leq C_{d}k\left(\log\frac{C_{d}L}{\epsilon}\right)^{d+1}\left(\frac{t}{\epsilon}\right)^{d/2},

the above bound also following from Theorem 4.5. For Gf~kII(t)G^{II}_{\tilde{f}_{k}}(t), let n~:=i=1nI{Xii=1mΔi}\tilde{n}:=\sum_{i=1}^{n}I\{X_{i}\notin\cup_{i=1}^{m}\Delta_{i}\} and use Dudley’s bound (Theorem B.2) and (63) to write

Gf~kII(t,)\displaystyle G^{II}_{\tilde{f}_{k}}(t,{\cal F}) =n~n𝔼[supg𝔅n(f~k,t)1n~i:Xii=1mΔiξi(g(Xi)f~k(Xi))|X1,,Xn]\displaystyle=\frac{\tilde{n}}{n}{\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(\tilde{f}_{k},t)}\frac{1}{\tilde{n}}\sum_{i:X_{i}\notin\cup_{i=1}^{m}\Delta_{i}}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]
σn~ninfδ>0(12n~δlogN(ϵ,𝒞L4L(Ω),)𝑑ϵ+2δ)\displaystyle\leq\frac{\sigma\tilde{n}}{n}\inf_{\delta>0}\left(\frac{12}{\sqrt{\tilde{n}}}\int_{\delta}^{\infty}\sqrt{\log N(\epsilon,{\mathcal{C}}_{L}^{4L}(\Omega),\ell_{\infty})}d\epsilon+2\delta\right)
Cdσn~ninfδ>0(1n~δ(Lϵ)d/4𝑑ϵ+δ).\displaystyle\leq C_{d}\frac{\sigma\tilde{n}}{n}\inf_{\delta>0}\left(\frac{1}{\sqrt{\tilde{n}}}\int_{\delta}^{\infty}\left(\frac{L}{\epsilon}\right)^{d/4}d\epsilon+\delta\right).

The choice δ=L(n~)2/d\delta=L(\tilde{n})^{-2/d} leads to

Gf~kII(t,)CdLσn(n~)1(2/d).\displaystyle G^{II}_{\tilde{f}_{k}}(t,{\cal F})\leq C_{d}L\frac{\sigma}{n}(\tilde{n})^{1-(2/d)}. (71)

Note that n~\tilde{n} is binomially distributed with parameters nn and p~:=Vol(Ω(i=1mΔi))/Vol(Ω)\tilde{p}:=\text{Vol}(\Omega\setminus(\cup_{i=1}^{m}\Delta_{i}))/\text{Vol}(\Omega). Because (1Cdk2/d)Ωi=1mΔiΩ(1-C_{d}k^{-2/d})\Omega\subseteq\cup_{i=1}^{m}\Delta_{i}\subseteq\Omega (from Lemma 3.2), we have

p~1(1Cdk1/d)ddCdk1/d\displaystyle\tilde{p}\leq 1-(1-C_{d}k^{-1/d})^{d}\leq dC_{d}k^{-1/d}

where we used (1u)d1du(1-u)^{d}\geq 1-du. Hoeffding’s inequality:

{Bin(n,p)np+u}1exp(u22n)for every u0\displaystyle{\mathbb{P}}\left\{\text{Bin}(n,p)\leq np+u\right\}\geq 1-\exp\left(\frac{-u^{2}}{2n}\right)\qquad\text{for every $u\geq 0$}

gives (below CdC_{d} is such that p~Cdk1/d\tilde{p}\leq C_{d}k^{-1/d})

{n~2Cdnk1/d}{n~np~Cdnk1/d}1exp(Cd22nk2/d).\displaystyle{\mathbb{P}}\left\{\tilde{n}\leq 2C_{d}nk^{-1/d}\right\}\geq{\mathbb{P}}\left\{\tilde{n}-n\tilde{p}\leq C_{d}nk^{-1/d}\right\}\geq 1-\exp\left(-\frac{C_{d}^{2}}{2}nk^{-2/d}\right).

Combining the above with (71), we obtain

Gf~kII(t,)CdLσn2/dk1/dk2/d2\displaystyle G^{II}_{\tilde{f}_{k}}(t,{\cal F})\leq C_{d}L\sigma n^{-2/d}k^{-1/d}k^{2/d^{2}}

with probability at least 1exp(Cdnk2/d)1-\exp(-C_{d}nk^{-2/d}). The proof of (37) is now completed by combining the obtained bounds for Gf~kI(t,)G^{I}_{\tilde{f}_{k}}(t,{\cal F}) and Gf~kII(t,)G^{II}_{\tilde{f}_{k}}(t,{\cal F})

B.2 Proof of Lemma 4.3

Lemma 4.3 is proved below using Lemma 4.6 (proved in Subsection B.6) and the following standard result (Sudakov Minoration) from the theory of Gaussian processes (see e.g., [33, Theorem 3.18]):

Lemma B.4 (Sudakov minoration).

In our random design setting (with ξ1,,ξni.i.dN(0,σ2)\xi_{1},\dots,\xi_{n}\overset{\text{i.i.d}}{\sim}N(0,\sigma^{2})), the following is true for every class {\cal F} and t0t\geq 0:

𝔼[sup{g:n(g,f)t}1ni=1nξi(g(Xi)f(Xi))|X1,,Xn]βσnsupϵ>0{ϵlogN(ϵ,{g:n(g,f)t},n)},\begin{split}&{\mathbb{E}}\left[\sup_{\{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(g,f)\leq t\}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-f(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]\\ &\geq\frac{\beta\sigma}{\sqrt{n}}\sup_{\epsilon>0}\left\{\epsilon\sqrt{\log N(\epsilon,\{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(g,f)\leq t\},\ell_{{\mathbb{P}}_{n}})}\right\},\end{split} (72)

where β\beta is a universal positive constant.

Proof of Lemma 4.3.

As in the proof of Lemma 4.2, we use the notation: n(f,t)={g:n(g,f)t}{\mathcal{B}}_{{\mathbb{P}}_{n}}^{{\cal F}}(f,t)=\{g\in{\cal F}:\ell_{{\mathbb{P}}_{n}}(g,f)\leq t\}. By triangle inequality,

𝔅n(f~k,t)𝔅n(f0,t/2)whenever n(f0,f~k)t/2.{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(\tilde{f}_{k},t)\supseteq{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(f_{0},t/2)\qquad\text{whenever ${\ell_{{\mathbb{P}_{n}}}}(f_{0},\tilde{f}_{k})\leq t/2$}.

From Lemma 3.2, n(f0,f~k)supxΩ|f0(x)f~k(x)|Cdk2/d{\ell_{{\mathbb{P}_{n}}}}(f_{0},\tilde{f}_{k})\leq\sup_{x\in\Omega}|f_{0}(x)-\tilde{f}_{k}(x)|\leq C_{d}k^{-2/d}, and so n(f0,f~k)t/2{\ell_{{\mathbb{P}_{n}}}}(f_{0},\tilde{f}_{k})\leq t/2 will be satisfied whenever t2Cdk2/dt\geq 2C_{d}k^{-2/d}. For such tt, by Sudakov minoration (Lemma B.4):

Gf~k(t,)\displaystyle G_{\tilde{f}_{k}}(t,{\cal F}) 𝔼[supg𝔅n(f0,t/2)1ni=1nξi(g(Xi)f~k(Xi))|X1,,Xn]\displaystyle\geq{\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(f_{0},t/2)}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-\tilde{f}_{k}(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]
=𝔼[supg𝔅n(f0,t/2)1ni=1nξi(g(Xi)f0(Xi))|X1,,Xn]\displaystyle={\mathbb{E}}\left[\sup_{g\in{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(f_{0},t/2)}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left(g(X_{i})-f_{0}(X_{i})\right)\bigg{|}X_{1},\dots,X_{n}\right]
βσnsupϵ>0{ϵlogN(ϵ,𝔅n(f0,t/2),n)}.\displaystyle\geq\frac{\beta\sigma}{\sqrt{n}}\sup_{\epsilon>0}\left\{\epsilon\sqrt{\log N(\epsilon,{\mathfrak{B}}^{{\cal F}}_{{\mathbb{P}_{n}}}(f_{0},t/2),{\mathbb{P}_{n}})}\right\}.

We now use Lemma 4.6 which applies to the function class 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) and, consequently, also to the larger classes 𝒞L(Ω){\mathcal{C}}_{L}(\Omega), 𝒞LB(Ω){\mathcal{C}}^{B}_{L}(\Omega) and 𝒞B(Ω){\mathcal{C}}^{B}(\Omega) for BLB\geq L. Applying Lemma 4.6 with ϵ=c3n2/d\epsilon=c_{3}n^{-2/d}, we get that for each fixed t16c3n2/dt\geq 16c_{3}n^{-2/d}, the inequality

Gf~k(t,)βc1σc31(d/4)n2/dG_{\tilde{f}_{k}}(t,{\cal F})\geq\beta\sqrt{c_{1}}\sigma c_{3}^{1-(d/4)}n^{-2/d}

holds with probability at least 1exp(c2n)1-\exp(-c_{2}n). This completes the proof of Lemma 4.3 (note that, if the constant CdC_{d} is enlarged suitably, the earlier condition t2Cdk2/dt\geq 2C_{d}k^{-2/d} implies t16c3n2/dt\geq 16c_{3}n^{-2/d} because knk\leq n). ∎

B.3 Proof of Lemma 4.4

Lemma 4.4 is proved below using Lemmas 4.2 and 4.3.

Proof of Lemma 4.4.

Taking t=t0:=cdn1/dσt=t_{0}:=\sqrt{c_{d}}n^{-1/d}\sqrt{\sigma} in Lemma 4.3, we obtain

{Hf~k(t0,)cd2σn2/d}1exp(cdn).{\mathbb{P}}\left\{H_{\tilde{f}_{k}}(t_{0},{\cal F})\geq\frac{c_{d}}{2}\sigma n^{-2/d}\right\}\geq 1-\exp(-c_{d}n). (73)

The condition t0Cdk2/dt_{0}\geq C_{d}k^{-2/d} required for the application of Lemma 4.3 places the following restriction on kk:

k(Cdcd)d/2σd/4n.k\geq\left(\frac{C_{d}}{\sqrt{c_{d}}}\right)^{d/2}\sigma^{-d/4}\sqrt{n}. (74)

Now consider the upper bound on Hf~k(t,)Gf~k(t,)H_{\tilde{f}_{k}}(t,{\cal F})\leq G_{\tilde{f}_{k}}(t,{\cal F}) given in Lemma 4.2. The leading term in this upper bound is the first term:

Cdtσ(kn)2/d(log(Cdσn))2(d+1)/dC_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}(\log(C_{d}\sigma\sqrt{n}))^{2(d+1)/d} (75)

as long as

tmax(n(4d)/(2d)k2/dσ1,k(23d)/d2).t\geq\max\left(n^{(4-d)/(2d)}k^{-2/d}\sigma^{-1},k^{(2-3d)/d^{2}}\right). (76)

We now choose tt so that (75) matches the lower bound on Hf~k(t0,)H_{\tilde{f}_{k}}(t_{0},{\cal F}) from (73):

Cdtσ(kn)2/d(log(Cdσn))2(d+1)/d=cd2σn2/dC_{d}t\sigma\left(\frac{k}{n}\right)^{2/d}(\log(C_{d}\sigma\sqrt{n}))^{2(d+1)/d}=\frac{c_{d}}{2}\sigma n^{-2/d}

leading to

t=t1:=cd2Cdk2/d(log(Cdσn))2(d+1)/d.t=t_{1}:=\frac{c_{d}}{2C_{d}}k^{-2/d}(\log(C_{d}\sigma\sqrt{n}))^{-2(d+1)/d}.

Check that this choice of t=t1t=t_{1} satisfies (76) provided

n(cdσ2Cd)2d/(d4)(log(Cdσn))4(d+1)/(d4)n\geq\left(\frac{c_{d}\sigma}{2C_{d}}\right)^{2d/(d-4)}(\log(C_{d}\sigma\sqrt{n}))^{4(d+1)/(d-4)} (77)

and

k(2Cdcd)d2/(d2)(log(Cdσn))2d(d+1)/(d2).k\geq\left(\frac{2C_{d}}{c_{d}}\right)^{d^{2}/(d-2)}(\log(C_{d}\sigma\sqrt{n}))^{2d(d+1)/(d-2)}. (78)

We have thus proved Hf~k(t1,)Hf~k(t0,)H_{\tilde{f}_{k}}(t_{1},{\cal F})\leq H_{\tilde{f}_{k}}(t_{0},{\cal F}). Check that t1<t0t_{1}<t_{0} provided

k>(2cdCd)d/2(log(Cdσn))(d+1)σd/4n.k>(2\sqrt{c_{d}}C_{d})^{d/2}(\log(C_{d}\sigma\sqrt{n}))^{-(d+1)}\sigma^{-d/4}\sqrt{n}. (79)

As can now be directly checked, the choice k=γdnσd/4k=\gamma_{d}\sqrt{n}\sigma^{-d/4} for an appropriate γd\gamma_{d} and the condition nNd,σn\geq N_{d,\sigma} for an appropriate Nd,σN_{d,\sigma} satisfy all the four conditions (74), (78), (79) and (77). Further the probability with which all the above bounds hold is bounded from below by 1Cdexp(n(d4)/d)/Cd21-C_{d}\exp(-n^{(d-4)/d})/C_{d}^{2}. Inequality (29) in Theorem 4.1 now implies that tf~k()t1t_{\tilde{f}_{k}}({\cal F})\geq t_{1}. The logarithmic term log(Cdσn)\log(C_{d}\sigma\sqrt{n}) in t1t_{1} can be further simplified to logn\log n because of nNd,σn\geq N_{d,\sigma}. This completes the proof of Lemma 4.4. ∎

B.4 Completion of the proof of Theorem 3.1

We now complete the proof of Theorem 3.1 from Lemma 4.4.

Proof of Theorem 3.1.

We shall lower bound 𝔼f~k2(f^n(),f~k){\mathbb{E}}_{\tilde{f}_{k}}\ell^{2}_{{\mathbb{P}}}(\hat{f}_{n}({\cal F}),\tilde{f}_{k}). Let

ρn:=cdn1/dσ(logn)2(d+1)/d\rho_{n}:=c_{d}n^{-1/d}\sqrt{\sigma}(\log n)^{-2(d+1)/d}

be the lower bound on tf~k()t_{\tilde{f}_{k}}({\cal F}) given by Lemma 4.4. As a result,

{n2(f^n(),f~k)12ρn2}\displaystyle{\mathbb{P}}\left\{\ell^{2}_{{\mathbb{P}}_{n}}(\hat{f}_{n}({\cal F}),\tilde{f}_{k})\geq\frac{1}{2}\rho_{n}^{2}\right\}
{n2(f^n(),f~k)12tf~k2(),tf~k()ρn}\displaystyle\geq{\mathbb{P}}\left\{\ell^{2}_{{\mathbb{P}}_{n}}(\hat{f}_{n}({\cal F}),\tilde{f}_{k})\geq\frac{1}{2}t^{2}_{\tilde{f}_{k}}({\cal F}),t_{\tilde{f}_{k}}({\cal F})\geq\rho_{n}\right\}
=𝔼[I{tf~k()ρn}{n2(f^n(),f~k)12tf~k2()|X1,,Xn}]\displaystyle={\mathbb{E}}\left[I\left\{t_{\tilde{f}_{k}}({\cal F})\geq\rho_{n}\right\}{\mathbb{P}}\left\{\ell^{2}_{{\mathbb{P}}_{n}}(\hat{f}_{n}({\cal F}),\tilde{f}_{k})\geq\frac{1}{2}t^{2}_{\tilde{f}_{k}}({\cal F})\bigg{|}X_{1},\dots,X_{n}\right\}\right]
𝔼[I{tf~k()ρn}(16exp(cntf~k2()σ2))]\displaystyle\geq{\mathbb{E}}\left[I\left\{t_{\tilde{f}_{k}}({\cal F})\geq\rho_{n}\right\}\left(1-6\exp\left(\frac{-cnt^{2}_{\tilde{f}_{k}}({\cal F})}{\sigma^{2}}\right)\right)\right]
(16exp(cnρn2σ2)){tf~k()ρn}.\displaystyle\geq\left(1-6\exp\left(\frac{-cn\rho_{n}^{2}}{\sigma^{2}}\right)\right){\mathbb{P}}\left\{t_{\tilde{f}_{k}}({\cal F})\geq\rho_{n}\right\}.

where we used (30) in the penultimate inequality. Lemma 4.4 now gives

{n2(f^n(),f~k)12ρn2}\displaystyle{\mathbb{P}}\left\{\ell^{2}_{{\mathbb{P}}_{n}}(\hat{f}_{n}({\cal F}),\tilde{f}_{k})\geq\frac{1}{2}\rho_{n}^{2}\right\}
(16exp(cnρn2σ2))(1Cdexp(n(d4)/dCd2))\displaystyle\geq\left(1-6\exp\left(\frac{-cn\rho_{n}^{2}}{\sigma^{2}}\right)\right)\left(1-C_{d}\exp\left(\frac{-n^{(d-4)/d}}{C_{d}^{2}}\right)\right)
16exp(cnρn2σ2)Cdexp(n(d4)/dCd2).\displaystyle\geq 1-6\exp\left(\frac{-cn\rho_{n}^{2}}{\sigma^{2}}\right)-C_{d}\exp\left(\frac{-n^{(d-4)/d}}{C_{d}^{2}}\right).

Clearly if Nd,σN_{d,\sigma} is chosen appropriately, then, for nNd,σn\geq N_{d,\sigma},

nρn2σ2=cd2σn(d2)/d(logn)4(d+1)/d\frac{n\rho_{n}^{2}}{\sigma^{2}}=\frac{c_{d}^{2}}{\sigma}n^{(d-2)/d}(\log n)^{-4(d+1)/d}

will be larger than any constant multiple of n(d4)/dn^{(d-4)/d} which gives

{n2(f^n(),f~k)12ρn2}1Cdexp(n(d4)/dCd2).{\mathbb{P}}\left\{\ell^{2}_{{\mathbb{P}}_{n}}(\hat{f}_{n}({\cal F}),\tilde{f}_{k})\geq\frac{1}{2}\rho_{n}^{2}\right\}\geq 1-C_{d}\exp\left(\frac{-n^{(d-4)/d}}{C_{d}^{2}}\right). (80)

We now argue that a similar inequality also holds for 2(f^n(),f~k)\ell^{2}_{{\mathbb{P}}}(\hat{f}_{n}({\cal F}),\tilde{f}_{k}). Here it is easiest to break the argument into the different choices of {\cal F}. First assume =𝒞B(Ω){\cal F}={\mathcal{C}}^{B}(\Omega). Combining the above inequality with Lemma B.3, we obtain

{(f^n(𝒞B(Ω)),f~k)12ρn212Cdn2/(d+4)Bd/(d+4)}\displaystyle{\mathbb{P}}\left\{\ell_{{\mathbb{P}}}(\hat{f}_{n}({\mathcal{C}^{B}(\Omega)}),\tilde{f}_{k})\geq\frac{1}{2}\frac{\rho_{n}}{\sqrt{2}}-\frac{1}{2}C_{d}n^{-2/(d+4)}B^{d/(d+4)}\right\}
1Cdexp(n(d4)/dCd2)Cexp(nd/(d+4)CdB8/(d+4))\displaystyle\geq 1-C_{d}\exp\left(\frac{-n^{(d-4)/d}}{C_{d}^{2}}\right)-C\exp\left(-\frac{n^{d/(d+4)}}{C_{d}B^{8/(d+4)}}\right)

Because n2/(d+4)n^{-2/(d+4)} is of a smaller order than n1/dn^{-1/d} and nd/(d+4)n^{d/(d+4)} is of a larger order than n(d4)/dn^{(d-4)/d} (and BB is a dimensional constant), we obtain

{(f^n(𝒞B(Ω)),f~k)cdn1/dσ(logn)2(d+1)/d}1Cdexp(n(d4)/dCd2).\begin{split}&{\mathbb{P}}\left\{\ell_{{\mathbb{P}}}(\hat{f}_{n}({\mathcal{C}^{B}(\Omega)}),\tilde{f}_{k})\geq c_{d}n^{-1/d}\sqrt{\sigma}(\log n)^{-2(d+1)/d}\right\}\\ &\geq 1-C_{d}\exp\left(\frac{-n^{(d-4)/d}}{C_{d}^{2}}\right).\end{split}

provided nNd,σn\geq N_{d,\sigma} where Nd,σN_{d,\sigma} is a constant depending on dd and σ\sigma alone. Using this, we can further adjust Nd,σN_{d,\sigma} so that

{(f^n(𝒞B(Ω)),f~k)cdn1/dσ(logn)2(d+1)/d}12\displaystyle{\mathbb{P}}\left\{\ell_{{\mathbb{P}}}(\hat{f}_{n}({\mathcal{C}^{B}(\Omega)}),\tilde{f}_{k})\geq c_{d}n^{-1/d}\sqrt{\sigma}(\log n)^{-2(d+1)/d}\right\}\geq\frac{1}{2}

which immediately gives that

𝔼f~k2(f^n(𝒞B(Ω)),f~k)cd24σn2/d(logn)4(d+1)/d\displaystyle{\mathbb{E}}_{\tilde{f}_{k}}\ell^{2}_{{\mathbb{P}}}(\hat{f}_{n}({\mathcal{C}^{B}(\Omega)}),\tilde{f}_{k})\geq\frac{c_{d}^{2}}{4}\sigma n^{-2/d}(\log n)^{-4(d+1)/d} (81)

completing the proof of inequality (20). The same argument also yields (19) for =𝒞LB(Ω){\cal F}={\mathcal{C}}_{L}^{B}(\Omega). Now take =𝒞L(Ω){\cal F}={\mathcal{C}}_{L}(\Omega). Let us take Nd,σN_{d,\sigma} large enough so that ρnL\rho_{n}\leq L for nNd,σn\geq N_{d,\sigma}. The fact (70) then implies that both n(f^(𝒞L(Ω)),f~k)\ell_{{\mathbb{P}}_{n}}(\hat{f}({\mathcal{C}}_{L}(\Omega)),\tilde{f}_{k}) and (f^(𝒞L(Ω)),f~k)\ell_{{\mathbb{P}}}(\hat{f}({\mathcal{C}}_{L}(\Omega)),\tilde{f}_{k}) trivially exceed ρn\rho_{n} if f^n(𝒞L(Ω))\hat{f}_{n}({\mathcal{C}}_{L}(\Omega)) does not belong to 𝒞L4L(Ω){\mathcal{C}}_{L}^{4L}(\Omega). Therefore while lower bounding these losses, we can assume that f^n(𝒞L(Ω))𝒞L4L(Ω)\hat{f}_{n}({\mathcal{C}}_{L}(\Omega))\in{\mathcal{C}}_{L}^{4L}(\Omega). This allows application of Lemma B.3 again and yields (19) for =𝒞L(Ω){\cal F}={\mathcal{C}}_{L}(\Omega). The proof of Theorem 3.1 is complete. ∎

B.5 Proof of Theorem 4.5

It is enough to prove Theorem 4.5 when Ω\Omega is exactly equal to i=1kΔi\cup_{i=1}^{k}\Delta_{i}. So in the proof of Theorem 4.5, we shall assume that Ω=i=1kΔi\Omega=\cup_{i=1}^{k}\Delta_{i}. The main step in the proof is to prove it in the special case when Ω\Omega is itself a dd-simplex and f~\tilde{f} is identically zero on Ω\Omega. This special case is stated in the next result (note that simplices can be written in the form (82) so the next result is applicable when Ω\Omega is a simplex).

Lemma B.5.

Let Ω\Omega be a convex body contained in the unit ball, and of the form

Ω={xd:aiviTxbi,1id+1}\Omega=\{x\in{\mathbb{R}}^{d}:a_{i}\leq v_{i}^{T}x\leq b_{i},1\leq i\leq d+1\} (82)

where viv_{i} are fixed unit vectors. For a fixed 1p<1\leq p<\infty and t>0t>0, let

BpΓ(0,t,Ω):={f𝒞Γ(Ω):Ω|f(x)|p𝑑xtp}.B_{p}^{\Gamma}(0,t,\Omega):=\left\{f\in{\mathcal{C}}^{\Gamma}(\Omega):\int_{\Omega}|f(x)|^{p}dx\leq t^{p}\right\}.

Then for every 0<ϵ<Γ0<\epsilon<\Gamma, we have

logN[](ϵ,BpΓ(0,t,Ω),p,Ω)Cd,p(logΓϵ)d+1(tϵ)d/2\log N_{[\,]}(\epsilon,B_{p}^{\Gamma}(0,t,\Omega),\|\cdot\|_{p,\Omega})\leq C_{d,p}\left(\log\frac{\Gamma}{\epsilon}\right)^{d+1}\left(\frac{t}{\epsilon}\right)^{d/2}

Below, we first provide the proof of Theorem 4.5 using Lemma B.5, and then provide the proof of Lemma B.5.

Proof of Theorem 4.5.

Without loss of generality, we assume Ω=i=1kΔi\Omega=\cup_{i=1}^{k}\Delta_{i}. For each fBpΓ(f~,t,Ω)f\in B_{p}^{\Gamma}(\tilde{f},t,\Omega) and 1ik1\leq i\leq k, we define ti(f)t_{i}(f) as the smallest positive integer tit_{i} such that

Δi|f(x)f~(x)|p𝑑xtiptp|Δi|\int_{\Delta_{i}}|f(x)-\tilde{f}(x)|^{p}dx\leq t_{i}^{p}t^{p}|\Delta_{i}| (83)

where |Δi||\Delta_{i}| denotes the volume of Δi\Delta_{i}. Let 𝔗\mathfrak{T} denote the collection of all sequences T:=(t1(f),,tk(f))T:=(t_{1}(f),\dots,t_{k}(f)) as ff ranges over BpΓ(f~,t,Ω)B_{p}^{\Gamma}(\tilde{f},t,\Omega). Because |f(x)f0(x)|2Γ|f(x)-f_{0}(x)|\leq 2\Gamma for all xΩx\in\Omega, we have ti(f)2Γ/tt_{i}(f)\leq\lceil 2\Gamma/t\rceil (here x\lceil x\rceil is the smallest positive integer larger than or equal to xx). Thus, the cardinality of 𝔗\mathfrak{T} is at most 2Γ/tk\lceil 2\Gamma/t\rceil^{k}. Further as ti(f)t_{i}(f) is the smallest positive integer tit_{i} satisying (83), the inequality will be violated for ti1t_{i}-1 so that

(ti1)ptp|Δi|Δi|f(x)f~(x)|p𝑑x(t_{i}-1)^{p}t^{p}|\Delta_{i}|\leq\int_{\Delta_{i}}|f(x)-\tilde{f}(x)|^{p}dx

for each i=1,,ki=1,\dots,k. Summing these for i=1,,ki=1,\dots,k, we get

i=1k(ti1)ptp|Δi|\displaystyle\sum_{i=1}^{k}(t_{i}-1)^{p}t^{p}|\Delta_{i}| i=1kΔi|f(x)f0(x)|p𝑑x\displaystyle\leq\sum_{i=1}^{k}\int_{\Delta_{i}}|f(x)-f_{0}(x)|^{p}dx
Ω|f(x)f0(x)|p𝑑xtp,\displaystyle\leq\int_{\Omega}|f(x)-f_{0}(x)|^{p}dx\leq t^{p},

which is equivalent to i=1k(ti1)p|Δi|1\sum_{i=1}^{k}(t_{i}-1)^{p}|\Delta_{i}|\leq 1. As a result

i=1mtip|Δi|2p1i=1m[(ti1)p+1]|Δi|2p.\sum_{i=1}^{m}t_{i}^{p}|\Delta_{i}|\leq 2^{p-1}\sum_{i=1}^{m}[(t_{i}-1)^{p}+1]|\Delta_{i}|\leq 2^{p}. (84)

For each sequence T=(t1,t2,,tk)𝔗T=(t_{1},t_{2},\ldots,t_{k})\in\mathfrak{T}, let T{\cal F}_{T} be the collection of all functions fBpΓ(f~,t,Ω)f\in B_{p}^{\Gamma}(\tilde{f},t,\Omega) satisfying

(ti1)ptp|Δi|Δi|f(x)f0(x)|p𝑑xtiptp|Δi|(t_{i}-1)^{p}t^{p}|\Delta_{i}|\leq\int_{\Delta_{i}}|f(x)-f_{0}(x)|^{p}dx\leq t_{i}^{p}t^{p}|\Delta_{i}|

for each i=1,,ki=1,\dots,k. For fTf\in{\cal F}_{T} and 1ik1\leq i\leq k, the restriction of ff~f-\tilde{f} to Δi\Delta_{i} belongs to Bp2Γ(0,tit|Δi|1/p,Δi)B_{p}^{2\Gamma}(0,t_{i}t|\Delta_{i}|^{1/p},\Delta_{i}) (since f~\tilde{f} is linear on each Δi\Delta_{i}). Applying Lemma B.5, we deduce, for every 0<ϵi<2Γ0<\epsilon_{i}<2\Gamma, the existence of a set 𝒢i{\mathcal{G}}_{i} consisting of no more than

exp(Cd,p[log(2Γ/ϵi)]d+1tid/2|Δi|d/2ptd/2ϵid/2)\exp(C_{d,p}[\log(2\Gamma/\epsilon_{i})]^{d+1}t_{i}^{d/2}|\Delta_{i}|^{d/2p}t^{d/2}\epsilon_{i}^{-d/2})

brackets, such that for each fTf\in{\cal F}_{T}, there exists a bracket [gi,hi]𝒢i[g_{i},h_{i}]\in{\mathcal{G}}_{i} such that gi(x)+f0(x)f(x)hi(x)+f0(x)g_{i}(x)+f_{0}(x)\leq f(x)\leq h_{i}(x)+f_{0}(x) for all xΔix\in\Delta_{i}, and

Δi|hi(x)gi(x)|p𝑑xϵip.\int_{\Delta_{i}}|h_{i}(x)-g_{i}(x)|^{p}dx\leq\epsilon_{i}^{p}.

We now define a bracket [g,h][g,h] on Ω\Omega via g(x)=gi(x)g(x)=g_{i}(x) and h(x)=hi(x)h(x)=h_{i}(x) for xΔix\in\Delta_{i}, 1ik1\leq i\leq k. Then, we clearly have g(x)f(x)h(x)g(x)\leq f(x)\leq h(x) for all xΩx\in\Omega, and

Ω|h(x)g(x)|p𝑑x=i=1kΔi|hi(x)gi(x)|p𝑑xi=1kϵip.\int_{\Omega}|h(x)-g(x)|^{p}dx=\sum_{i=1}^{k}\int_{\Delta_{i}}|h_{i}(x)-g_{i}(x)|^{p}dx\leq\sum_{i=1}^{k}\epsilon_{i}^{p}.

We choose

ϵi=max(212/pti|Δi|1/pϵ,(4k)1/pϵ),\epsilon_{i}=\max\left(2^{-1-2/p}t_{i}|\Delta_{i}|^{1/p}\epsilon,(4k)^{-1/p}\epsilon\right),

so that

i=1kϵip(ϵp2p+2i=1ktip|Δi|+14ϵp)ϵp4,\sum_{i=1}^{k}\epsilon_{i}^{p}\leq\left(\frac{\epsilon^{p}}{2^{p+2}}\sum_{i=1}^{k}t_{i}^{p}|\Delta_{i}|+\frac{1}{4}\epsilon^{p}\right)\leq\frac{\epsilon^{p}}{4},

where we used (84). Thus, [g,h][g,h] is an ϵ\epsilon-bracket.

Note that for each fixed TT, the total number of brackets [g,h][g,h] is at most

N:=\displaystyle N:= i=1kexp(Cd,p[log(Γ/ϵi)]d+1tid/2|Δi|d/2ptd/2ϵid/2)\displaystyle\prod_{i=1}^{k}\exp\left(C_{d,p}[\log(\Gamma/\epsilon_{i})]^{d+1}t_{i}^{d/2}|\Delta_{i}|^{d/2p}t^{d/2}\epsilon_{i}^{-d/2}\right)
\displaystyle\leq i=1kexp(Cd,p[1plog(4k)+logΓ+log(1/ϵ)]d+1(21+2/ptϵ)d/2)\displaystyle\prod_{i=1}^{k}\exp\left(C_{d,p}\left[\frac{1}{p}\log(4k)+\log\Gamma+\log(1/\epsilon)\right]^{d+1}\left(\frac{2^{1+2/p}t}{\epsilon}\right)^{d/2}\right)
\displaystyle\leq exp(Cd,pk[logk+logΓ+log(1/ϵ)]d+1(tϵ)d/2)\displaystyle\exp\left(C^{\prime}_{d,p}k[\log k+\log\Gamma+\log(1/\epsilon)]^{d+1}\left(\frac{t}{\epsilon}\right)^{d/2}\right)

Combining with the number of choices 2Γ/tk\leq\lceil 2\Gamma/t\rceil^{k} of the sequences T=(t1,,tk)T=(t_{1},\dots,t_{k}), the number of realizations of the brackets [g,h][g,h] is at most

2Γ/tkNexp(Cd,p′′k[logk+logΓ+log(1/ϵ)]d+1(t/ϵ)d/2),\lceil 2\Gamma/t\rceil^{k}\cdot N\leq\exp\left(C^{\prime\prime}_{d,p}k[\log k+\log\Gamma+\log(1/\epsilon)]^{d+1}(t/\epsilon)^{d/2}\right),

which can be shown to yield inequality (43). ∎

We now prove Lemma B.5. The main ingredient in this proof is the result below whose proof directly follows from arguments in [19].

Lemma B.6.

Suppose Ω\Omega has volume at most 1 and is of the form:

Ω={xd:aiviTxbi,1id+1}\Omega=\{x\in{\mathbb{R}}^{d}:a_{i}\leq v_{i}^{T}x\leq b_{i},1\leq i\leq d+1\}

where viv_{i} are fixed unit vectors. For a fixed 0<η<1/50<\eta<1/5, let

Ω0:={xd:ai+η(biai)viTxbiη(biai),1id+1}.\Omega_{0}:=\{x\in{\mathbb{R}}^{d}:a_{i}+\eta(b_{i}-a_{i})\leq v_{i}^{T}x\leq b_{i}-\eta(b_{i}-a_{i}),1\leq i\leq d+1\}.

For 1p<1\leq p<\infty and t>0t>0, let

Cp(Ω,t):={f𝒞(Ω):Ω|f(x)|p𝑑xtp}.C_{p}(\Omega,t):=\left\{f\in{\mathcal{C}}(\Omega):\int_{\Omega}|f(x)|^{p}dx\leq t^{p}\right\}.

Then for every ϵ>0\epsilon>0, we have

logN[](ϵ,Cp(Ω,t),p,Ω0)Cd,p,η(tϵ)d/2.\log N_{[\,]}(\epsilon,C_{p}(\Omega,t),\|\cdot\|_{p,\Omega_{0}})\leq C_{d,p,\eta}\left(\frac{t}{\epsilon}\right)^{d/2}.

The main idea behind Lemma B.6 is that on Ω0\Omega_{0}, each function in Cp(Ω,t)C_{p}(\Omega,t) is uniformly bounded by a constant multiple of tt. The proof of Lemma B.6 follows then from the bracketing entropy result for uniformly bounded convex functions (details can be found in [19]).

In the proof of Lemma B.5, we use the following notation. For Ω:={xd:aiviTxbi,1id+1}\Omega:=\{x\in{\mathbb{R}}^{d}:a_{i}\leq v_{i}^{T}x\leq b_{i},1\leq i\leq d+1\}, where viv_{i} are fixed unit vectors and 0rd+10\leq r\leq d+1, let

Tr(Ω)={xΩ:aiviTxbifor 1ir,andaj+η(bjaj)vjTxbjη(bjaj)forr<jd+1}.\begin{split}T_{r}(\Omega)&=\left\{x\in\Omega:a_{i}\leq v_{i}^{T}x\leq b_{i}\ {\rm for}\ 1\leq i\leq r,{\rm and}\right.\\ &\left.a_{j}+\eta(b_{j}-a_{j})\leq v_{j}^{T}x\leq b_{j}-\eta(b_{j}-a_{j})\ {\rm for}\ r<j\leq d+1\right\}.\end{split}

Observe that for r=0r=0, the set T0(Ω)T_{0}(\Omega) coincides with Ω0\Omega_{0} in Lemma B.6. Further Td+1(Ω)=ΩT_{d+1}(\Omega)=\Omega.

Proof of Lemma B.5.

Fix 0<η<1/50<\eta<1/5. We shall prove the following by induction on rr: There exist two constants C1(d,p)C_{1}(d,p) and C2(d,p)C_{2}(d,p) such that

logN[](ϵ,BpΓ(0,t,Ω),p,Tr(Ω))C1[C2log(Γ/ϵ)]rtd/2εd/2\log N_{[\,]}(\epsilon,B^{\Gamma}_{p}(0,t,\Omega),\|\cdot\|_{p,T_{r}(\Omega)})\leq C_{1}[C_{2}\log(\Gamma/\epsilon)]^{r}t^{d/2}{\varepsilon}^{-d/2} (85)

for every r=0,1,,d+1r=0,1,\ldots,d+1, and for every Ω\Omega of the form (82). As mentioned above, for r=0r=0, the set T0(Ω)T_{0}(\Omega) equals the set Ω0\Omega_{0} in Lemma B.6 and, as a result, (85) for r=0r=0 follows directly from Lemma B.6. Let us now assume that (85) is true for r=k1r=k-1 and proceed to prove it for r=kr=k. Define

K0=Tk1(Ω)={xTk(Ω):ak+η(bkak)vkTxbkη(bkak)}.K_{0}=T_{k-1}(\Omega)=\{x\in T_{k}(\Omega):a_{k}+\eta(b_{k}-a_{k})\leq v_{k}^{T}x\leq b_{k}-\eta(b_{k}-a_{k})\}.

For a positive integer mm (to be determined later) and for s=0,1,2,,ms=0,1,2,\ldots,m, define

K2s+1={xTk(Ω):ak+2s1η(bkak)vkTx<ak+2sη(bkak)},K_{2s+1}=\{x\in T_{k}(\Omega):a_{k}+2^{-s-1}\eta(b_{k}-a_{k})\leq v_{k}^{T}x<a_{k}+2^{-s}\eta(b_{k}-a_{k})\},
K2s+2={xTk(Ω):bk2sη(bkak)<vkTxbk2s1η(bkak)},K_{2s+2}=\{x\in T_{k}(\Omega):b_{k}-2^{-s}\eta(b_{k}-a_{k})<v_{k}^{T}x\leq b_{k}-2^{-s-1}\eta(b_{k}-a_{k})\},

Furthermore, define

KL={xTk(Ω):akvkTx<ak+2m1η(bkak)},K_{L}=\{x\in T_{k}(\Omega):a_{k}\leq v_{k}^{T}x<a_{k}+2^{-m-1}\eta(b_{k}-a_{k})\},
KR={xTk(Ω):bk2m1η(bkak)<vkTxbk}.K_{R}=\{x\in T_{k}(\Omega):b_{k}-2^{-m-1}\eta(b_{k}-a_{k})<v_{k}^{T}x\leq b_{k}\}.

Then, the sets K1,K2,K3,K4,,K2m+1,K2m+2K_{1},K_{2},K_{3},K_{4},\dots,K_{2m+1},K_{2m+2} together with K0,KL,KRK_{0},K_{L},K_{R} form a partition of Tk(Ω)T_{k}(\Omega). We now aim to use the induction step. For this purpose we denote the inflated sets

K^2s+1={xΩ:ak+2s2η(bkak)vkTx<ak+32s1η(bkak)},\widehat{K}_{2s+1}=\{x\in\Omega:a_{k}+2^{-s-2}\eta(b_{k}-a_{k})\leq v_{k}^{T}x<a_{k}+3\cdot 2^{-s-1}\eta(b_{k}-a_{k})\},
K^2s+2={xΩ:bk32s1η(bkak)<vkTxbk2s2η(bkak)}.\widehat{K}_{2s+2}=\{x\in\Omega:b_{k}-3\cdot 2^{-s-1}\eta(b_{k}-a_{k})<v_{k}^{T}x\leq b_{k}-2^{-s-2}\eta(b_{k}-a_{k})\}.

The key observation now is that Tk1(K^2s+1)K2s+1T_{k-1}(\widehat{K}_{2s+1})\supset K_{2s+1} and Tk1(K^2s+2)K2s+2T_{k-1}(\widehat{K}_{2s+2})\supset K_{2s+2}. To prove Tk1(K^2s+1)K2s+1T_{k-1}(\widehat{K}_{2s+1})\supset K_{2s+1}, observe that Tk1(K^2s+1)T_{k-1}(\widehat{K}_{2s+1}) equals

{xK^2s+1:aiviTxbiforik1,\displaystyle\left\{x\in\widehat{K}_{2s+1}:a_{i}\leq v_{i}^{T}x\leq b_{i}\ {\rm for}\ i\leq k-1,\right.
aj+η(bjaj)vjTxbjη(bjaj)fork+1jd+1,\displaystyle\ \ \ \ \ \ a_{j}+\eta(b_{j}-a_{j})\leq v_{j}^{T}x\leq b_{j}-\eta(b_{j}-a_{j})\ {\rm for}\ k+1\leq j\leq d+1,
ak+2s2η(bkak)+η2(52s2(bkak))vkTx\displaystyle\ \ \ \ \ \ a_{k}+2^{-s-2}\eta(b_{k}-a_{k})+\eta^{2}(5\cdot 2^{-s-2}(b_{k}-a_{k}))\leq v_{k}^{T}x
<ak+32s1η(bkak)η2(52s2(bkak))}.\displaystyle\left.\ \ \ \ \ \ \ \ \ \ \ \ \ \ <a_{k}+3\cdot 2^{-s-1}\eta(b_{k}-a_{k})-\eta^{2}(5\cdot 2^{-s-2}(b_{k}-a_{k}))\right\}.

As a result, Tk1(K^2s+1)K2s+1T_{k-1}(\widehat{K}_{2s+1})\supset K_{2s+1} if and only if

2s2η(bkak)+η252s2(bkak)2s1η(bkak)2^{-s-2}\eta(b_{k}-a_{k})+\eta^{2}5\cdot 2^{-s-2}(b_{k}-a_{k})\leq 2^{-s-1}\eta(b_{k}-a_{k})

which is equivalent to η<1/5\eta<1/5. The claim Tk1(K^2s+2)K2s+2T_{k-1}(\widehat{K}_{2s+2})\supset K_{2s+2} follows similarly.

For every fBpΓ(0,t,Ω)f\in B^{\Gamma}_{p}(0,t,\Omega) and 1i2m+21\leq i\leq 2m+2, let ti:=ti(f)t_{i}:=t_{i}(f) be the smallest positive integer satisfying

K^i|f(x)|p𝑑x|K^i|tiptp\int_{\widehat{K}_{i}}|f(x)|^{p}dx\leq|\widehat{K}_{i}|t_{i}^{p}t^{p} (86)

where |K^i||\widehat{K}_{i}| denotes the volume of K^i\widehat{K}_{i}. Because |f(x)|Γ|f(x)|\leq\Gamma for all xΩx\in\Omega, we have ti(f)Γ/tt_{i}(f)\leq\lceil\Gamma/t\rceil. Let 𝔗\mathfrak{T} denote the collection of all sequences T:=(t1(f),,t2m+2(f))T:=(t_{1}(f),\dots,t_{2m+2}(f)) as ff ranges over BpΓ(0,t,Ω)B^{\Gamma}_{p}(0,t,\Omega). Because each ti(f)Γ/tt_{i}(f)\leq\lceil\Gamma/t\rceil, the cardinality of 𝔗\mathfrak{T} is at most Γ/t2m+2\lceil\Gamma/t\rceil^{2m+2}. Further, because ti(f)t_{i}(f) is the smallest positive integer tit_{i} satisfying (86), the inequality will be violated for ti1t_{i}-1 so that

|K^i|(ti1)ptpK^i|f(x)|p𝑑x|\widehat{K}_{i}|(t_{i}-1)^{p}t^{p}\leq\int_{\widehat{K}_{i}}|f(x)|^{p}dx

for each i=1,,2m+2i=1,\dots,2m+2. Summing these over ii, we obtain

i=12m+2|K^i|(ti1)ptpi=12m+2K^i|f(x)|p𝑑x\displaystyle\sum_{i=1}^{2m+2}|\widehat{K}_{i}|(t_{i}-1)^{p}t^{p}\leq\sum_{i=1}^{2m+2}\int_{\widehat{K}_{i}}|f(x)|^{p}dx

We now observe that every point in Ω\Omega is contained in K^i\widehat{K}_{i} for at most three different ii so that i=12m+2I{xK^i}3I{xΩ}\sum_{i=1}^{2m+2}I\{x\in\widehat{K}_{i}\}\leq 3I\{x\in\Omega\}. This gives

i=12m+2|K^i|(ti1)ptpi=12m+2K^i|f(x)|p𝑑x3Ω|f(x)|p𝑑x3tp.\displaystyle\sum_{i=1}^{2m+2}|\widehat{K}_{i}|(t_{i}-1)^{p}t^{p}\leq\sum_{i=1}^{2m+2}\int_{\widehat{K}_{i}}|f(x)|^{p}dx\leq 3\int_{\Omega}|f(x)|^{p}dx\leq 3t^{p}.

In other words, we have proved that every sequence (t1,,t2m+2)𝔗(t_{1},\dots,t_{2m+2})\in\mathfrak{T} satisfies:

i=12m+2|K^i|(ti1)p3.\displaystyle\sum_{i=1}^{2m+2}|\widehat{K}_{i}|(t_{i}-1)^{p}\leq 3. (87)

One consequence of the above is that

i=12m+2tip|K^i|2p1i=02m+2[(ti1)p+1]|K^i|32p,\displaystyle\sum_{i=1}^{2m+2}t_{i}^{p}|\widehat{K}_{i}|\leq 2^{p-1}\sum_{i=0}^{2m+2}[(t_{i}-1)^{p}+1]|\widehat{K}_{i}|\leq 3\cdot 2^{p}, (88)

where we used convexity ((ai+bi)p2p1(aip+bip)(a_{i}+b_{i})^{p}\leq 2^{p-1}(a_{i}^{p}+b_{i}^{p}) with ai=ti1a_{i}=t_{i}-1 and bi=1b_{i}=1) and the fact that i=02m+2|K^i|3|Ω|3\sum_{i=0}^{2m+2}|\widehat{K}_{i}|\leq 3|\Omega|\leq 3.

For each (t1,t2,,t2m+2)𝔗(t_{1},t_{2},\ldots,t_{2m+2})\in\mathfrak{T}, let T{\cal F}_{T} be the class of all functions fBpΓ(0,t,Ω)f\in B_{p}^{\Gamma}(0,t,\Omega) additionally satisfying

(ti1)ptp|K^i|K^i|f(x)|p𝑑xtiptp|K^i|\displaystyle(t_{i}-1)^{p}t^{p}|\widehat{K}_{i}|\leq\int_{\widehat{K}_{i}}|f(x)|^{p}dx\leq t_{i}^{p}t^{p}|\widehat{K}_{i}| (89)

for every i=1,,2m+2i=1,\dots,2m+2. It is then clear that BpΓ(0,t,Ω)T𝔗TB_{p}^{\Gamma}(0,t,\Omega)\subseteq\cup_{T\in\mathfrak{T}}{\cal F}_{T} which implies (below |𝔗||\mathfrak{T}| denotes the cardinality of the finite set 𝔗\mathfrak{T})

logN[](ϵ,BpΓ(0,t,Ω),p,Tk(Ω))\displaystyle\log N_{[\,]}(\epsilon,B_{p}^{\Gamma}(0,t,\Omega),\|\cdot\|_{p,T_{k}(\Omega)})
logT𝔗N[](ϵ,T,p,Tk(Ω))\displaystyle\leq\log\sum_{T\in\mathfrak{T}}N_{[\,]}(\epsilon,{\cal F}_{T},\|\cdot\|_{p,T_{k}(\Omega)})
log|𝔗|+maxT𝔗logN[](ϵ,T,p,Tk(Ω))\displaystyle\leq\log|\mathfrak{T}|+\max_{T\in\mathfrak{T}}\log N_{[\,]}(\epsilon,{\cal F}_{T},\|\cdot\|_{p,T_{k}(\Omega)})
(2m+2)logΓ/t+maxT𝔗logN[](ϵ,T,p,Tk(Ω)).\displaystyle\leq(2m+2)\log\lceil\Gamma/t\rceil+\max_{T\in\mathfrak{T}}\log N_{[\,]}(\epsilon,{\cal F}_{T},\|\cdot\|_{p,T_{k}(\Omega)}). (90)

We shall now fix T𝔗T\in\mathfrak{T} and bound logN[](ϵ,T,p,Tk(Ω))\log N_{[\,]}(\epsilon,{\cal F}_{T},\|\cdot\|_{p,T_{k}(\Omega)}). Because K0K_{0}, K1K_{1}, \dots, K2m+2K_{2m+2}, KLK_{L}, KRK_{R} form a partition of Tk(Ω)T_{k}(\Omega), we have

logN[](ϵ,T,p,Tk(Ω))logN[](ϵ0,T,p,K0)++i=12m+2logN[](ϵi,T,p,Ki)+logN[](ϵL,T,p,KL)+logN[](ϵR,T,p,KR)\begin{split}&\log N_{[\,]}(\epsilon,{\cal F}_{T},\|\cdot\|_{p,T_{k}(\Omega)})\\ &\leq\log N_{[\,]}(\epsilon_{0},{\cal F}_{T},\|\cdot\|_{p,K_{0}})++\sum_{i=1}^{2m+2}\log N_{[\,]}(\epsilon_{i},{\cal F}_{T},\|\cdot\|_{p,K_{i}})\\ &+\log N_{[\,]}(\epsilon_{L},{\cal F}_{T},\|\cdot\|_{p,K_{L}})+\log N_{[\,]}(\epsilon_{R},{\cal F}_{T},\|\cdot\|_{p,K_{R}})\end{split} (91)

provided ϵ0,ϵ1,,ϵ2m+2,ϵL,ϵR>0\epsilon_{0},\epsilon_{1},\dots,\epsilon_{2m+2},\epsilon_{L},\epsilon_{R}>0 satisfy

ϵ0p+i=12m+2ϵip+ϵLp+ϵRpϵp.\displaystyle\epsilon_{0}^{p}+\sum_{i=1}^{2m+2}\epsilon_{i}^{p}+\epsilon_{L}^{p}+\epsilon_{R}^{p}\leq\epsilon^{p}. (92)

To bound logN[](ϵi,T,p,Ki)\log N_{[\,]}(\epsilon_{i},{\cal F}_{T},\|\cdot\|_{p,K_{i}}) for a fixed 1i2m+21\leq i\leq 2m+2, note first that KiTk1(K^i)K_{i}\subseteq T_{k-1}(\widehat{K}_{i}) so that

logN[](ϵi,T,p,Ki)logN[](ϵi,T,p,Tk1(K^i)).\displaystyle\log N_{[\,]}(\epsilon_{i},{\cal F}_{T},\|\cdot\|_{p,K_{i}})\leq\log N_{[\,]}(\epsilon_{i},{\cal F}_{T},\|\cdot\|_{p,T_{k-1}(\widehat{K}_{i})}).

The induction hypothesis will be used to control the right hand side above. Because of the right side inequality in (89), the restriction of each fTf\in{\cal F}_{T} to the set K^i\widehat{K}_{i} belongs to BpΓ(0,tit|K^i|1/p,K^i)B_{p}^{\Gamma}(0,t_{i}t|\widehat{K}_{i}|^{1/p},\widehat{K}_{i}) and so, by the induction hypothesis, we have

logN[](ϵi,T,p,Tk1(K^i))\displaystyle\log N_{[\,]}(\epsilon_{i},{\cal F}_{T},\|\cdot\|_{p,T_{k-1}(\widehat{K}_{i})})
logN[](ϵi,BpΓ(0,tit|K^i|1/p,K^i),p,Tk1(K^i))\displaystyle\leq\log N_{[\,]}(\epsilon_{i},B_{p}^{\Gamma}(0,t_{i}t|\widehat{K}_{i}|^{1/p},\widehat{K}_{i}),\|\cdot\|_{p,T_{k-1}(\widehat{K}_{i})})
C1(C2logΓϵi)k1|K^i|d/(2p)td/2(tiϵi)d/2.\displaystyle\leq C_{1}\left(C_{2}\log\frac{\Gamma}{\epsilon_{i}}\right)^{k-1}|\widehat{K}_{i}|^{d/(2p)}t^{d/2}\left(\frac{t_{i}}{\epsilon_{i}}\right)^{d/2}. (93)

For K0=Tk1(Ω)K_{0}=T_{k-1}(\Omega), we use the induction hypothesis again to obtain

logN[](ϵ0,T,p,K0)\displaystyle\log N_{[\,]}(\epsilon_{0},{\cal F}_{T},\|\cdot\|_{p,K_{0}})
logN[](ϵ0,BpΓ(0,t,Ω),p,Tk1(Ω))\displaystyle\leq\log N_{[\,]}(\epsilon_{0},B_{p}^{\Gamma}(0,t,\Omega),\|\cdot\|_{p,T_{k-1}(\Omega)})
C1[C2logΓϵ0]k1(tϵ0)d/2.\displaystyle\leq C_{1}\left[C_{2}\log\frac{\Gamma}{\epsilon_{0}}\right]^{k-1}\left(\frac{t}{\epsilon_{0}}\right)^{d/2}. (94)

On the sets KLK_{L} and KRK_{R}, we set logN[](ϵL,T,p,KL)\log N_{[\,]}(\epsilon_{L},{\cal F}_{T},\|\cdot\|_{p,K_{L}}) and logN[](ϵR,T,p,KR)\log N_{[\,]}(\epsilon_{R},{\cal F}_{T},\|\cdot\|_{p,K_{R}}) to be equal to zero, by taking the trivial bracket [Γ,Γ][-\Gamma,\Gamma]. For this to be valid, ϵL\epsilon_{L} and ϵR\epsilon_{R} have to satisfy

Γ|KL|1/pϵL and Γ|KR|1/pϵR.\displaystyle\Gamma|K_{L}|^{1/p}\leq\epsilon_{L}~{}~{}\text{ and }~{}~{}\Gamma|K_{R}|^{1/p}\leq\epsilon_{R}. (95)

We now choose mm, ϵi,0i2m+2,ϵL,ϵR\epsilon_{i},0\leq i\leq 2m+2,\epsilon_{L},\epsilon_{R} so as to satisfy (92). First of all, we take

ϵ0=ϵL=ϵR=ϵ61/p.\displaystyle\epsilon_{0}=\epsilon_{L}=\epsilon_{R}=\frac{\epsilon}{6^{1/p}}. (96)

To satisfy the conditions (95), note that |KL||K_{L}| and |KR||K_{R}| are bounded by Cd2mηC_{d}2^{-m}\eta (note also that η\eta is a constant). Thus (95) can be ensured for ϵL=ϵR=ϵ/61/p\epsilon_{L}=\epsilon_{R}=\epsilon/6^{1/p} for mCd,plog(Γ/ϵ)m\leq C_{d,p}\log(\Gamma/\epsilon) for some Cd,pC_{d,p}.

We also take

ϵi=max(12121/pti|K^i|1/pϵ,ϵ41/p(2m+2)1/p).\displaystyle\epsilon_{i}=\max\left(\frac{1}{2\cdot 12^{1/p}}t_{i}|\widehat{K}_{i}|^{1/p}\epsilon,\frac{\epsilon}{4^{1/p}(2m+2)^{1/p}}\right). (97)

It is then easy to check (using (88)) that

i=12m+2ϵip\displaystyle\sum_{i=1}^{2m+2}\epsilon_{i}^{p} i=12m+2(12121/pti|K^i|1/pϵ)p+i=12m+2(ϵ41/p(2m+2)1/p)p\displaystyle\leq\sum_{i=1}^{2m+2}\left(\frac{1}{2\cdot 12^{1/p}}t_{i}|\widehat{K}_{i}|^{1/p}\epsilon\right)^{p}+\sum_{i=1}^{2m+2}\left(\frac{\epsilon}{4^{1/p}(2m+2)^{1/p}}\right)^{p}
ϵp4+ϵp4=ϵp2.\displaystyle\leq\frac{\epsilon^{p}}{4}+\frac{\epsilon^{p}}{4}=\frac{\epsilon^{p}}{2}.

Thus our choices satisfy (92). The proof of Lemma B.5 is now completed by plugging these choices in (93) and (94), and combining the resulting bounds with (90) and (91). ∎

B.6 Proofs of Lemma 3.2 and Lemma 4.6

Proof of Lemma 3.2.

Let us first assume that Ω\Omega is a polytope (satisfying (3)) whose number of facets is bounded by a constant depending on dd alone. For a fixed η>0\eta>0, let η\mathfrak{C}_{\eta} be the collection of all cubes of the form

[k1η,(k1+1)η]××[kdη,(kd+1)η][k_{1}\eta,(k_{1}+1)\eta]\times\dots\times[k_{d}\eta,(k_{d}+1)\eta] (98)

for (k1,,kd)d(k_{1},\dots,k_{d})\in\mathbb{Z}^{d} which intersect Ω\Omega. Because Ω\Omega is contained in the unit ball, there exists a dimensional constant cdc_{d} such that the cardinality of η\mathfrak{C}_{\eta} is at most CdηdC_{d}\eta^{-d} for ηcd\eta\leq c_{d}.

For each BηB\in\mathfrak{C}_{\eta}, the set BΩB\cap\Omega is a polytope whose number of facets is bounded from above by a constant depending on dd alone. This polytope can therefore be triangulated into at most CdC_{d} number of dd-simplices. Let Δ1,,Δm\Delta_{1},\dots,\Delta_{m} be the collection obtained by the taking the all of the aforementioned simplices as BB varies over η\mathfrak{C}_{\eta}. These simplices clearly satisfy Ω=i=1mΔi\Omega=\cup_{i=1}^{m}\Delta_{i} and the second requirement of Lemma 3.2. Moreover

mCdηdm\leq C_{d}\eta^{-d}

and the diameter of each simplex Δi\Delta_{i} is at most CdηC_{d}\eta. Now define f~η\tilde{f}_{\eta} to be a piecewise affine convex function that agrees with f0(x)=x2f_{0}(x)=\|x\|^{2} for each vertex of each simplex Δi\Delta_{i} and is defined by linearity everywhere else on Ω\Omega. This function is clearly affine on each Δi\Delta_{i}, belongs to 𝒞CdCd(Ω){\mathcal{C}}_{C_{d}}^{C_{d}}(\Omega) for a sufficiently large CdC_{d} and it satisfies

supxΔi|f0(x)f~η(x)|Cd(diameter(Δi))2Cdη2.\sup_{x\in\Delta_{i}}|f_{0}(x)-\tilde{f}_{\eta}(x)|\leq C_{d}\left(\text{diameter}(\Delta_{i})\right)^{2}\leq C_{d}\eta^{2}.

Now given k1k\geq 1, let η=cdk1/d\eta=c_{d}k^{-1/d} for a sufficiently small dimensional constant cdc_{d} and let f~k\tilde{f}_{k} to be the function f~η\tilde{f}_{\eta} for this η\eta. The number of simplices is now mCdkm\leq C_{d}k and

supxΔi|f0(x)f~η(x)|Cdk2/d\sup_{x\in\Delta_{i}}|f_{0}(x)-\tilde{f}_{\eta}(x)|\leq C_{d}k^{-2/d}

which completes the proof of Lemma 3.2 when Ω\Omega is a polytope.

Now assume that Ω\Omega is a generic convex body (not necessarily a polytope) satisfying (3). Here the only difference in the proof is that we take 𝔇η\mathfrak{D}_{\eta} be the collection of all cubes of the form (98) for (k1,,kd)d(k_{1},\dots,k_{d})\in\mathbb{Z}^{d} which are contained in the interior of Ω\Omega. Because each of these cubes has diameter ηd\eta\sqrt{d}, it follows that

(1Cdη)ΩB𝔇ηBΩ.\left(1-C_{d}\eta\right)\Omega\subseteq\cup_{B\in\mathfrak{D}_{\eta}}B\subseteq\Omega.

The rest of the argument is the same as in the polytopal case. ∎

Proof of Lemma 4.6.

By a standard argument involving local perturbations to the quadratic function f0(x)=x2f_{0}(x)=\|x\|^{2}, we can prove the following for three constants c1,c2,Cc_{1},c_{2},C depending on dd alone: for every ϵc1\epsilon\leq c_{1} and LCL\geq C, there exist an integer NN with

c2ϵd/2logN2c2ϵd/2c_{2}\epsilon^{-d/2}\leq\log N\leq 2c_{2}\epsilon^{-d/2}

and functions f1,,fN𝒞LL(Ω)f_{1},\dots,f_{N}\in{\mathcal{C}}_{L}^{L}(\Omega) such that

min1ijN(fi,fj)2ϵ and max1iNsupxΩ|fi(x)f0(x)|4ϵ.\min_{1\leq i\neq j\leq N}\ell_{{\mathbb{P}}}(f_{i},f_{j})\geq\sqrt{2}\epsilon~{}~{}\text{ and }~{}~{}\max_{1\leq i\leq N}\sup_{x\in\Omega}|f_{i}(x)-f_{0}(x)|\leq 4\epsilon.

One explicit construction of such perturbed functions is given in the proof of Lemma 4.10. Lemma 4.6 follows from the above claim and the Hoeffding inequality. Indeed, Hoeffding’s inequality applied to the random variables (fj(Xi)fk(Xi))2(f_{j}(X_{i})-f_{k}(X_{i}))^{2} (which are bounded by 64ϵ264\epsilon^{2}) followed by a union bound allows us to deduce that, for every t>0t>0,

{n2(fj,fk)2(fj,fk)tn1/2 for all j,k}1N2exp(t2Γϵ4).{\mathbb{P}}\left\{\ell_{{\mathbb{P}}_{n}}^{2}(f_{j},f_{k})-\ell_{{\mathbb{P}}}^{2}(f_{j},f_{k})\geq-tn^{-1/2}\text{ for all }j,k\right\}\geq 1-N^{2}\exp\left(\frac{-t^{2}}{\Gamma\epsilon^{4}}\right).

for a universal constant Γ\Gamma. Taking t=ϵ2nt=\epsilon^{2}\sqrt{n}, we get

{n(fj,fk)ϵ for all j,k}\displaystyle{\mathbb{P}}\left\{\ell_{{\mathbb{P}}_{n}}(f_{j},f_{k})\geq\epsilon\text{ for all }j,k\right\} 1N2exp(nΓ)\displaystyle\geq 1-N^{2}\exp\left(\frac{-n}{\Gamma}\right)
1exp(4c2ϵd/2nΓ).\displaystyle\geq 1-\exp\left(4c_{2}\epsilon^{-d/2}-\frac{n}{\Gamma}\right).

Assuming now that ϵn2/d(8c2Γ)2/d\epsilon\geq n^{-2/d}(8c_{2}\Gamma)^{2/d}, we get

{n(fj,fk)ϵ for all j,k}1exp(n2Γ).{\mathbb{P}}\left\{\ell_{{\mathbb{P}}_{n}}(f_{j},f_{k})\geq\epsilon\text{ for all }j,k\right\}\geq 1-\exp\left(-\frac{n}{2\Gamma}\right).

As each fjf_{j} satisfies n(fj,f0)supx|fj(x)f0(x)|4ϵ\ell_{{\mathbb{P}}_{n}}(f_{j},f_{0})\leq\sup_{x}|f_{j}(x)-f_{0}(x)|\leq 4\epsilon, the above completes the proof of Lemma 4.6. ∎

Appendix C Proofs of fixed-design results for the unrestricted convex LSE

This section contains the proofs of Theorem 3.3, Theorem 3.4, Theorem 3.5 and Theorem 3.6. We follow the sketch given in Subsection 4.4 and results quoted in that subsection are also proved here. Subsection C.1 below contains the proof of Theorem 3.3 and that of the main ingredient in its proof, Lemma 4.7. Subsection C.2 contains the proof of Theorem 3.5 and that of the main ingredient in its proof, Lemma 4.8. Subsection C.3 contains the proof of Theorem 3.4. Subsection C.4 contains the proof of Theorem 3.6. Lemma 4.9 and Lemma 4.10, which are both crucial for the proof of Theorem 3.6, are proved in Subsection C.5. The metric entropy results are proved in Subsections C.6 and C.7. Specifically, the main metric entropy result (Theorem 4.11) is proved in Subsection C.7. The two important corollaries of Theorem 4.11: inequality (46) and inequality (48), are proved in Subsection C.6.

C.1 Proof of Lemma 4.7 and Theorem 3.3

The proof of Lemma 4.7 makes crucial use of the metric entropy bound (46).

Proof of Lemma 4.7.

The metric entropy bound (46), together with Dudley’s bound (Theorem B.2), give

Gf0(t,𝒞(Ω))σ\displaystyle\frac{G_{f_{0}}(t,{\mathcal{C}}(\Omega))}{\sigma}
Cd(logn)F/2nθt/2(t+𝔏ϵ)d/4𝑑ϵ+2θ\displaystyle\leq\frac{C_{d}(\log n)^{F/2}}{\sqrt{n}}\int_{\theta}^{t/2}\left(\frac{t+{\mathfrak{L}}}{\epsilon}\right)^{d/4}d\epsilon+2\theta
(Cd2d/4)(logn)F/2n{θt/2(tϵ)d/4𝑑ϵ+θt/2(𝔏ϵ)d/4𝑑ϵ}+2θ\displaystyle\leq\frac{(C_{d}2^{d/4})(\log n)^{F/2}}{\sqrt{n}}\left\{\int_{\theta}^{t/2}\left(\frac{t}{\epsilon}\right)^{d/4}d\epsilon+\int_{\theta}^{t/2}\left(\frac{{\mathfrak{L}}}{\epsilon}\right)^{d/4}d\epsilon\right\}+2\theta (99)

for every 0<θt/20<\theta\leq t/2. We replace Cd2d/4C_{d}2^{d/4} by just CdC_{d} (in general, the value of CdC_{d} can change from place to place). We now split into the three cases d3,d=4d\leq 3,d=4 and d5d\geq 5. For d3d\leq 3, we take θ=0\theta=0 to get

Gf0(t,𝒞(Ω))Cdσn(logn)F/2(t+𝔏d/4t1d/4).G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\frac{C_{d}\sigma}{\sqrt{n}}(\log n)^{F/2}\left(t+{\mathfrak{L}}^{d/4}t^{1-d/4}\right).

For d=4d=4, (99) leads to

Gf0(t,𝒞(Ω))Cdσn(logn)F/2(t+𝔏)logt2θ+2σθ.G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\frac{C_{d}\sigma}{\sqrt{n}}(\log n)^{F/2}(t+{\mathfrak{L}})\log\frac{t}{2\theta}+2\sigma\theta.

Choosing θ=t/(2n)\theta=t/(2\sqrt{n}), we obtain

Gf0(t,𝒞(Ω))Cdσn(t+𝔏)(logn)1+(F/2)G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\frac{C_{d}\sigma}{\sqrt{n}}\left(t+{\mathfrak{L}}\right)(\log n)^{1+(F/2)}

Finally, for d5d\geq 5, (99) leads to the bound

Gf0(t,𝒞(Ω))\displaystyle G_{f_{0}}(t,{\mathcal{C}}(\Omega)) Cdσ(logn)F/2n{θ(tϵ)d/4𝑑ϵ+θ(𝔏ϵ)d/4𝑑ϵ}+2σθ\displaystyle\leq\frac{C_{d}\sigma(\log n)^{F/2}}{\sqrt{n}}\left\{\int_{\theta}^{\infty}\left(\frac{t}{\epsilon}\right)^{d/4}d\epsilon+\int_{\theta}^{\infty}\left(\frac{{\mathfrak{L}}}{\epsilon}\right)^{d/4}d\epsilon\right\}+2\sigma\theta
Cdσ(logn)F/2n(t+𝔏)d/4θ1(d/4)+2σθ\displaystyle\leq C_{d}\frac{\sigma(\log n)^{F/2}}{\sqrt{n}}(t+{\mathfrak{L}})^{d/4}\theta^{1-(d/4)}+2\sigma\theta

for every θ>0\theta>0. The choice

θ=(Cd(logn)F/2n)4/d(t+𝔏)\theta=\left(\frac{C_{d}(\log n)^{F/2}}{\sqrt{n}}\right)^{4/d}(t+{\mathfrak{L}})

gives

Gf0(t,𝒞(Ω))Cdσ((logn)F/2n)4/d(t+𝔏).G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq C_{d}\sigma\left(\frac{(\log n)^{F/2}}{\sqrt{n}}\right)^{4/d}\left(t+{\mathfrak{L}}\right).

Theorem 3.3, which follows from Lemma 4.7 and Theorem 4.1, is proved next.

Proof of Theorem 3.3.

By Theorem 4.1 (specifically the upper bound in inequality (27), and (28)), the risk is bounded from above by 2t2+Cσ2/n2t^{2}+C\sigma^{2}/n for every tt satisfying Hf0(t,𝒞(Ω))0H_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq 0 or, equivalently, Gf0(t,𝒞(Ω))t2/2G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq t^{2}/2. We shall use the bound on Gf0(t,𝒞(Ω))G_{f_{0}}(t,{\mathcal{C}}(\Omega)) given in Lemma 4.7 to get tt such that Gf0(t,𝒞(Ω))t2/2G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq t^{2}/2. The risk will be dominated by such tt because this tt will be of larger order than σ2/n\sigma^{2}/n. For d3d\leq 3, Lemma 4.7 gives

Gf0(t,𝒞(Ω))Cdσn(logn)F/2(t+𝔏d/4t1d/4).G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\frac{C_{d}\sigma}{\sqrt{n}}(\log n)^{F/2}\left(t+{\mathfrak{L}}^{d/4}t^{1-d/4}\right).

Because

Cdσn(logn)F/2tt24 if and only if t4Cdσn(logn)F/2\frac{C_{d}\sigma}{\sqrt{n}}(\log n)^{F/2}t\leq\frac{t^{2}}{4}~{}~{}\text{ if and only if }~{}~{}t\geq\frac{4C_{d}\sigma}{\sqrt{n}}(\log n)^{F/2}

and

Cdσn(logn)F/2𝔏d/4t1d/4t24 iff t(4Cd)4d+4(σ(logn)F/2n)4d+4𝔏dd+4,\frac{C_{d}\sigma}{\sqrt{n}}(\log n)^{F/2}{\mathfrak{L}}^{d/4}t^{1-d/4}\leq\frac{t^{2}}{4}~{}~{}\text{ iff }~{}~{}t\geq(4C_{d})^{\frac{4}{d+4}}\left(\frac{\sigma(\log n)^{F/2}}{\sqrt{n}}\right)^{\frac{4}{d+4}}{\mathfrak{L}}^{\frac{d}{d+4}},

we deduce that Gf0(t,f0)t2/2G_{f_{0}}(t,f_{0})\leq t^{2}/2 for

tCdmax((σ(logn)F/2n)4d+4𝔏dd+4,σn(logn)F/2).t\geq C_{d}\max\left(\left(\frac{\sigma(\log n)^{F/2}}{\sqrt{n}}\right)^{\frac{4}{d+4}}{\mathfrak{L}}^{\frac{d}{d+4}},\frac{\sigma}{\sqrt{n}}(\log n)^{F/2}\right).

This proves, for d3d\leq 3,

𝔼f0n2(f^n,f0)Cdmax{𝔏2d4+d(σ2n(logn)F)4d+4,σ2n(logn)F}{\mathbb{E}}_{f_{0}}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n},f_{0})\leq C_{d}\max\left\{{\mathfrak{L}}^{\frac{2d}{4+d}}\left(\frac{\sigma^{2}}{n}(\log n)^{F}\right)^{\frac{4}{d+4}},\frac{\sigma^{2}}{n}(\log n)^{F}\right\}

The leading term in the right hand side above is the first term inside the maximum which proves Theorem 3.3 for d3d\leq 3.

For d=4d=4, Lemma 4.7 gives

Gf0(t,𝒞(Ω))Cdσn(t+𝔏)(logn)1+(F/2)G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\frac{C_{d}\sigma}{\sqrt{n}}\left(t+{\mathfrak{L}}\right)(\log n)^{1+(F/2)}

from which we can deduce (as in the case d3d\leq 3) that G(t)t2/2G(t)\leq t^{2}/2 for

tCdmax(σ𝔏(logn)(F/4)+(1/2)n1/4σ(logn)1+(F/2)n).t\geq C_{d}\max\left(\frac{\sqrt{\sigma{\mathfrak{L}}}(\log n)^{(F/4)+(1/2)}}{n^{1/4}}\frac{\sigma(\log n)^{1+(F/2)}}{\sqrt{n}}\right).

This proves, for d=4d=4,

𝔼f0n2(f^n,f0)C4max{σ𝔏n(logn)1+F2,σ2n(logn)2+F}.{\mathbb{E}}_{f_{0}}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n},f_{0})\leq C_{4}\max\left\{\frac{\sigma{\mathfrak{L}}}{\sqrt{n}}(\log n)^{1+\frac{F}{2}},\frac{\sigma^{2}}{n}(\log n)^{2+F}\right\}.

The leading term in the right hand side above is the first term inside the maximum which proves Theorem 3.3 for d=4d=4.

Finally, for d5d\geq 5, Lemma 4.7 gives

Gf0(t,𝒞(Ω))2σ(Cd(logn)F/2n)4/d(t+𝔏)G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq 2\sigma\left(\frac{C_{d}(\log n)^{F/2}}{\sqrt{n}}\right)^{4/d}\left(t+{\mathfrak{L}}\right)

from which it follows that G(t)t2/2G(t)\leq t^{2}/2 for

tCdmax(σ𝔏((logn)F/2n)2/d,σ((logn)F/2n)4/d).t\geq C_{d}\max\left(\sqrt{\sigma{\mathfrak{L}}}\left(\frac{(\log n)^{F/2}}{\sqrt{n}}\right)^{2/d},\sigma\left(\frac{(\log n)^{F/2}}{\sqrt{n}}\right)^{4/d}\right).

This proves, for d5d\geq 5,

𝔼f0n2(f^n,f0)Cdmax{σ𝔏((logn)Fn)2d,σ2((logn)Fn)4d}.{\mathbb{E}}_{f_{0}}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n},f_{0})\leq C_{d}\max\left\{\sigma{\mathfrak{L}}\left(\frac{(\log n)^{F}}{n}\right)^{\frac{2}{d}},\sigma^{2}\left(\frac{(\log n)^{F}}{n}\right)^{\frac{4}{d}}\right\}.

The leading term in the right hand side above is the first term inside the maximum which proves Theorem 3.3 for d5d\geq 5. ∎

C.2 Proof of Lemma 4.8 and Theorem 3.5

First we prove Lemma 4.8 which crucially uses the metric entropy bound (48).

Proof of Lemma 4.8.

Combining the entropy bound (48) and Dudley’s result (Theorem B.2), we get

Gf0(t,𝒞(Ω))σkn(cdlogn)h/2θt/2(tϵ)d/4𝑑ϵ+2σθG_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{h/2}\int_{\theta}^{t/2}\left(\frac{t}{\epsilon}\right)^{d/4}d\epsilon+2\sigma\theta (100)

for every 0<θt/20<\theta\leq t/2. We now split into the three cases d3d\leq 3, d=4d=4 and d5d\geq 5. When d3d\leq 3, we can take θ=0\theta=0 to obtain

Gf0(t,𝒞(Ω))tσkn(cdlogn)h/2.G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq t\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{h/2}.

For d=4d=4, we get

Gf0(t,𝒞(Ω))σkn(cdlogn)h/2tlogt2θ+2σθ.G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{h/2}t\log\frac{t}{2\theta}+2\sigma\theta.

Choosing θ:=t/(2n)\theta:=t/(2\sqrt{n}), we get

Gf0(t,𝒞(Ω))σtkn(cdlogn)1+h/2.G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\sigma t\sqrt{\frac{k}{n}}(c_{d}\log n)^{1+h/2}.

For d5d\geq 5, we get

Gf0(t,𝒞(Ω))\displaystyle G_{f_{0}}(t,{\mathcal{C}}(\Omega)) σkn(cdlogn)h/2θ(tϵ)d/4𝑑ϵ+2σθ\displaystyle\leq\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{h/2}\int_{\theta}^{\infty}\left(\frac{t}{\epsilon}\right)^{d/4}d\epsilon+2\sigma\theta
Cdσkn(cdlogn)h/2td/4θ1(d/4)+2σθ.\displaystyle\leq C_{d}\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{h/2}t^{d/4}\theta^{1-(d/4)}+2\sigma\theta.

Take θ=t((cdlogn)h/2kn)4/d\theta=t\left((c_{d}\log n)^{h/2}\sqrt{\frac{k}{n}}\right)^{4/d} to get

Gf0(t,𝒞(Ω))σt((cdlogn)h/2kn)4/d.G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq\sigma t\left((c_{d}\log n)^{h/2}\sqrt{\frac{k}{n}}\right)^{4/d}. (101)

which completes the proof of Lemma 4.8. ∎

We next provide the proof of Theorem 3.5.

Proof of Theorem 3.5.

From Lemma 4.8, it is straightforward to see that Gf0(t,𝒞(Ω))t2/2G_{f_{0}}(t,{\mathcal{C}}(\Omega))\leq t^{2}/2 provided

t{2σkn(cdlogn)h/2for d32σkn(cdlogn)1+(h/2)for d=42σ((cdlogn)h/2kn)4/dfor d5.t\geq\begin{cases}2\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{h/2}~{}~{}~{}~{}~{}~{}~{}\text{for }d\leq 3\\ 2\sigma\sqrt{\frac{k}{n}}(c_{d}\log n)^{1+(h/2)}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{for }d=4\\ 2\sigma\left((c_{d}\log n)^{h/2}\sqrt{\frac{k}{n}}\right)^{4/d}~{}~{}~{}~{}~{}~{}\text{for }d\geq 5.\end{cases}

Theorem 4.8 then follows by an application of Theorem 4.1. ∎

C.3 Proof of Theorem 3.4

This basically follows from Theorem 3.6. Let cdc_{d} and NdN_{d} be as given by Theorem 3.6. Letting k=nσd/4k=\sqrt{n}\sigma^{-d/4} and assuming that nmax(Nd,(cd2)σd/2n\geq\max(N_{d},(c_{d}^{-2})\sigma^{-d/2}), we obtain from Theorem 3.6 that

supf0𝒞CdCd(Ω)𝔼f0n2(f^n,f0)cdσn2/d(logn)4(d+1)/d.\sup_{f_{0}\in{\mathcal{C}}^{C_{d}}_{C_{d}}(\Omega)}{\mathbb{E}}_{f_{0}}\ell_{{\mathbb{P}_{n}}}^{2}(\hat{f}_{n},f_{0})\geq c_{d}\sigma n^{-2/d}(\log n)^{-4(d+1)/d}.

where CdC_{d} is such that f~k𝒞CdCd(Ω)\tilde{f}_{k}\in{\mathcal{C}}^{C_{d}}_{C_{d}}(\Omega) (existence of such a CdC_{d} is guaranteed by Lemma 3.2). The required lower bound (22) on the class 𝒞LL(Ω){\mathcal{C}}_{L}^{L}(\Omega) for an arbitrary L>0L>0 can now be obtained by an elementary scaling argument.

C.4 Proof of Theorem 3.6

We prove Theorem 3.6 below using Lemma 4.9. The proof of Lemma 4.9 is given in Subsection C.5.

Proof of Theorem 3.6.

Let G~(t):=Gf~k(t,𝒞(Ω))\tilde{G}(t):=G_{\tilde{f}_{k}}(t,{\mathcal{C}}(\Omega)) and tf~k:=tf~k(𝒞(Ω))t_{\tilde{f}_{k}}:=t_{\tilde{f}_{k}}({\mathcal{C}}(\Omega)) for notational convenience. By the lower bound given by Lemma 4.9,

supt>0(G~(t)t22)suptcdk2/d(cdσt(kn)2/dt22).\sup_{t>0}\left(\tilde{G}(t)-\frac{t^{2}}{2}\right)\geq\sup_{t\leq c_{d}k^{-2/d}}\left(c_{d}\sigma t\left(\frac{k}{n}\right)^{2/d}-\frac{t^{2}}{2}\right).

Taking t=cdσ(k/n)2/dt=c_{d}\sigma(k/n)^{2/d} and noting that

t=cdσ(kn)2/dcdk2/dif and only if knσd/4,t=c_{d}\sigma\left(\frac{k}{n}\right)^{2/d}\leq c_{d}k^{-2/d}\qquad\text{if and only if $k\leq\sqrt{n}\sigma^{-d/4}$},

we get that

supt>0(G~(t)t22)cd2σ22(kn)4/d.\sup_{t>0}\left(\tilde{G}(t)-\frac{t^{2}}{2}\right)\geq\frac{c_{d}^{2}\sigma^{2}}{2}\left(\frac{k}{n}\right)^{4/d}.

The above inequality, combined with the upper bound (49) and the fact that tf~kt_{\tilde{f}_{k}} maximizes G~(t)t2/2\tilde{G}(t)-t^{2}/2 over all t>0t>0, yield

cd2σ22(kn)4/d\displaystyle\frac{c_{d}^{2}\sigma^{2}}{2}\left(\frac{k}{n}\right)^{4/d} supt>0(G~(t)t22)\displaystyle\leq\sup_{t>0}\left(\tilde{G}(t)-\frac{t^{2}}{2}\right)
=G~(tf~k)tf~k22\displaystyle=\tilde{G}(t_{\tilde{f}_{k}})-\frac{t^{2}_{\tilde{f}_{k}}}{2}
G~(tf~k)Cdσtf~k(logn)2(d+1)/d(kn)2/d.\displaystyle\leq\tilde{G}(t_{\tilde{f}_{k}})\leq C_{d}\sigma t_{\tilde{f}_{k}}(\log n)^{2(d+1)/d}\left(\frac{k}{n}\right)^{2/d}.

This implies

tf~kcd22Cdσ(kn)2/d(logn)2(d+1)/d.t_{\tilde{f}_{k}}\geq\frac{c_{d}^{2}}{2C_{d}}\sigma\left(\frac{k}{n}\right)^{2/d}(\log n)^{-2(d+1)/d}.

Theorem 4.1 then gives

𝔼f~kn2(f^n(𝒞(Ω)),f~k)cd48Cd2σ2(kn)4/d(logn)4(d+1)/dCσ2n.{\mathbb{E}}_{\tilde{f}_{k}}\ell_{{\mathbb{P}}_{n}}^{2}\left(\hat{f}_{n}({\mathcal{C}}(\Omega)),\tilde{f}_{k}\right)\geq\frac{c_{d}^{4}}{8C_{d}^{2}}\sigma^{2}\left(\frac{k}{n}\right)^{4/d}(\log n)^{-4(d+1)/d}-\frac{C\sigma^{2}}{n}.

The first term on the right hand side above dominates the second term when nn is larger than a constant depending on dd alone. This completes the proof of Theorem 3.6. ∎

C.5 Proofs of Lemma 4.9 and Lemma 4.10

We first prove Lemma 4.9 below assuming the validity of Lemma 4.10. Lemma 4.10 is proved later in this subsection.

Proof of Lemma 4.9.

By Lemma 3.2, f~k\tilde{f}_{k} satisfies

n(f0,f~k)supxΩ|f0(x)f~k(x)|Cdk2/d.\ell_{{\mathbb{P}_{n}}}(f_{0},\tilde{f}_{k})\leq\sup_{x\in\Omega}\left|f_{0}(x)-\tilde{f}_{k}(x)\right|\leq C_{d}k^{-2/d}. (102)

where f0(x):=x2f_{0}(x):=\|x\|^{2}. Assume first that t=2Cdk2/dt=2C_{d}k^{-2/d} where CdC_{d} is the constant from (102). For this choice of tt, it follows from the triangle inequality (and (102)) that

{f𝒞(Ω):n(f,f~k)t}{f𝒞(Ω):n(f,f0)Cdk2/d}.\left\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,\tilde{f}_{k})\leq t\right\}\supseteq\left\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq C_{d}k^{-2/d}\right\}.

This immediately implies

Gf~k(t,𝒞(Ω))Gf0(Cdk2/d,𝒞(Ω)).G_{\tilde{f}_{k}}(t,{\mathcal{C}}(\Omega))\geq G_{f_{0}}(C_{d}k^{-2/d},{\mathcal{C}}(\Omega)).

Let G~(t):=Gf~k(t,𝒞(Ω))\tilde{G}(t):=G_{\tilde{f}_{k}}(t,{\mathcal{C}}(\Omega)) and G0(t):=Gf0(t,𝒞(Ω))G_{0}(t):=G_{f_{0}}(t,{\mathcal{C}}(\Omega)) in the rest of this proof (this is for notational convenience). We just proved G~(t)G0(Cdk2/d)\tilde{G}(t)\geq G_{0}(C_{d}k^{-2/d}) for t=2Cdk2/dt=2C_{d}k^{-2/d}. By Sudakov minoration (Lemma B.4), G0(Cdk2/d)G_{0}(C_{d}k^{-2/d}) is bounded from below by

βσnsupϵ>0{ϵlogN(ϵ,{g𝒞(Ω):n(f0,g)Cdk2/d},n)}.\displaystyle\frac{\beta\sigma}{\sqrt{n}}\sup_{\epsilon>0}\left\{\epsilon\sqrt{\log N(\epsilon,\left\{g\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f_{0},g)\leq C_{d}k^{-2/d}\right\},\ell_{{\mathbb{P}_{n}}})}\right\}.

Lemma 4.10 gives a lower bound on the metric entropy appearing above. Applying this for ϵ=c1n2/d\epsilon=c_{1}n^{-2/d}, we get

G~(t)βσn(c1n2/d)n8=βc18σn2/d\tilde{G}(t)\geq\frac{\beta\sigma}{\sqrt{n}}(c_{1}n^{-2/d})\sqrt{\frac{n}{8}}=\frac{\beta c_{1}}{\sqrt{8}}\sigma n^{-2/d}

provided

k(Cdc2)d/2n.k\leq\left(\frac{C_{d}}{c_{2}}\right)^{d/2}n. (103)

The condition above is necessary for the inequality c2n2/dCdk2/dc_{2}n^{-2/d}\leq C_{d}k^{-2/d} which is required for the application of Lemma 4.10. This gives

G~(t)βc18σn2/dfor t=2Cdk2/d.\tilde{G}(t)\geq\frac{\beta c_{1}}{\sqrt{8}}\sigma n^{-2/d}\qquad\text{for $t=2C_{d}k^{-2/d}$}.

Now for t2Cdk2/dt\leq 2C_{d}k^{-2/d}, we use the fact that xG~(x)x\mapsto\tilde{G}(x) is concave on [0,)[0,\infty) (and that G~(0)=0\tilde{G}(0)=0) to deduce that

G~(t)tG~(2Cdk2/d)2Cdk2/dσ(kn)2/dβc128Cdfor all t2Cdk2/d.\frac{\tilde{G}(t)}{t}\geq\frac{\tilde{G}(2C_{d}k^{-2/d})}{2C_{d}k^{-2/d}}\geq\sigma\left(\frac{k}{n}\right)^{2/d}\frac{\beta c_{1}}{2\sqrt{8}C_{d}}\qquad\text{for all $t\leq 2C_{d}k^{-2/d}$}.

This proves (50) for

cd=min(βc128Cd,2Cd,(Cdc2)d/2),c_{d}=\min\left(\frac{\beta c_{1}}{2\sqrt{8}C_{d}},2C_{d},\left(\frac{C_{d}}{c_{2}}\right)^{d/2}\right),

completing the proof of Lemma 4.9. ∎

We next prove Lemma 4.10.

Proof of Lemma 4.10.

We shall use a perturbation result that is similar to the one mentioned at the beginning of the proof of Lemma 4.6. Because we are dealing with the discrete metric n\ell_{{\mathbb{P}}_{n}} here, this argument might be nonstandard so we provide an explicit construction (some aspects of this construction were also used in the proof of Proposition 2.3). Let

g(x1,x2,,xd)={i=1dcos3(πxi)(x1,x2,,xd)[1/2,1/2]d0(x1,x2,,xd)[1/2,1/2]d.g(x_{1},x_{2},\ldots,x_{d})=\left\{\begin{array}[]{ll}{\sum_{i=1}^{d}\cos^{3}(\pi x_{i})}&{(x_{1},x_{2},\ldots,x_{d})\in[-1/2,1/2]^{d}}\\ {0}&{(x_{1},x_{2},\ldots,x_{d})\notin[-1/2,1/2]^{d}}\end{array}\right..

Note that gg is smooth, 2gxixj=0\frac{\partial^{2}g}{\partial x_{i}\partial x_{j}}=0 for iji\neq j and

|2gxi2(x1,,xd)|423π2\left|\frac{\partial^{2}g}{\partial x_{i}^{2}}(x_{1},\dots,x_{d})\right|\leq\frac{4\sqrt{2}}{3}\pi^{2}

which implies that the Hessian of gg is dominated by (42π2/3)(4\sqrt{2}\pi^{2}/3) times the identity matrix. It is also easy to check that the Hessian of gg equals zero on the boundary of [0.5,0.5]d[-0.5,0.5]^{d}.

Now for every grid point s:=(k1δ,,kdδ)s:=(k_{1}\delta,\dots,k_{d}\delta) in 𝒮Ω\mathcal{S}\cap\Omega, consider the function

gs(x1,,xd):=δ2g(x1k1δδ,,xdkdδδ).g_{s}(x_{1},\dots,x_{d}):=\delta^{2}g\left(\frac{x_{1}-k_{1}\delta}{\delta},\dots,\frac{x_{d}-k_{d}\delta}{\delta}\right).

Clearly gsg_{s} is supported on the cube

j=1d[(kj1/2)δ,(kj+1/2)δ]\prod_{j=1}^{d}[(k_{j}-1/2)\delta,(k_{j}+1/2)\delta]

and these cubes for different grid points have disjoint interiors.

We now consider binary vectors in {0,1}n\{0,1\}^{n}. We shall index each ξ{0,1}n\xi\in\{0,1\}^{n} by ξs,s𝒮Ω\xi_{s},s\in\mathcal{S}\cap\Omega. For every ξ=(ξs,s𝒮Ω){0,1}n\xi=(\xi_{s},s\in\mathcal{S}\cap\Omega)\in\{0,1\}^{n}, consider the function

Gξ(x)=f0(x)+342π2s𝒮Ωξsgs(x).G_{\xi}(x)=f_{0}(x)+\frac{3}{4\sqrt{2}\pi^{2}}\sum_{s\in\mathcal{S}\cap\Omega}\xi_{s}g_{s}(x). (104)

It can be verified that GξG_{\xi} is convex because f0f_{0} has constant Hessian equal to 22 times the identity, the Hessian of gsg_{s} is bounded by (42π2/3)(4\sqrt{2}\pi^{2}/3) and the supports of gs,s𝒮Ωg_{s},s\in\mathcal{S}\cap\Omega have disjoint interiors. Note further that for ξ,ξ{0,1}n\xi,\xi^{\prime}\in\{0,1\}^{n} and s𝒮Ωs\in\mathcal{S}\cap\Omega,

Gξ(s)Gξ(s)=3dδ242π2(ξsξs).G_{\xi}(s)-G_{\xi^{\prime}}(s)=\frac{3d\delta^{2}}{4\sqrt{2}\pi^{2}}\left(\xi_{s}-\xi^{\prime}_{s}\right).

This implies that

n(Gξ,Gξ)=3dδ242π2Υ(ξ,ξ)n\ell_{{\mathbb{P}_{n}}}(G_{\xi},G_{\xi^{\prime}})=\frac{3d\delta^{2}}{4\sqrt{2}\pi^{2}}\sqrt{\frac{\Upsilon(\xi,\xi^{\prime})}{n}}

where Υ(ξ,ξ):=s𝒮ΩI{ξsξs}\Upsilon(\xi,\xi^{\prime}):=\sum_{s\in\mathcal{S}\cap\Omega}I\{\xi_{s}\neq\xi^{\prime}_{s}\} is the Hamming distance between ξ\xi and ξ\xi^{\prime}. The Varshamov-Gilbert lemma (see e.g., [37, Lemma 4.7]) asserts the existence of a subset WW of {0,1}n\{0,1\}^{n} with cardinality |W|exp(n/8)|W|\geq\exp(n/8) such that Υ(ξ,ξ)n/4\Upsilon(\xi,\xi^{\prime})\geq n/4 for all ξ,ξW\xi,\xi^{\prime}\in W with ξξ\xi\neq\xi^{\prime}. We then have

n(Gξ,Gξ)3dδ282π2for all ξ,ξW with ξξ.\ell_{{\mathbb{P}_{n}}}(G_{\xi},G_{\xi^{\prime}})\geq\frac{3d\delta^{2}}{8\sqrt{2}\pi^{2}}\qquad\text{for all $\xi,\xi^{\prime}\in W$ with $\xi\neq\xi^{\prime}$}.

Inequality (16) then gives

n(Gξ,Gξ)c1n2/dfor all ξ,ξW with ξξ.\ell_{{\mathbb{P}_{n}}}(G_{\xi},G_{\xi^{\prime}})\geq c_{1}n^{-2/d}\qquad\text{for all $\xi,\xi^{\prime}\in W$ with $\xi\neq\xi^{\prime}$}.

for a dimensional constant c1c_{1}. One can also check that

n(Gξ,f0)3dδ242π2c2n2/d\ell_{{\mathbb{P}_{n}}}(G_{\xi},f_{0})\leq\frac{3d\delta^{2}}{4\sqrt{2}\pi^{2}}\leq c_{2}n^{-2/d}

for another dimensional constant c2c_{2} completing the proof of Lemma 4.10. ∎

C.6 Proofs of inequality (46) and inequality (48)

In this subsection, we provide proofs of the discrete metric entropy results given by inequalities (46) and (48). We shall assume and use Theorem 4.11 for these proofs. The proof of Theorem 4.11 is given in the next subsection (Subsection C.7).

Proof of Inequality (46).

By specializing to the case p=2p=2 and writing n\ell_{{\mathbb{P}}_{n}} for 𝒮(,Ω,2)\ell_{\mathcal{S}}(\cdot,\Omega,2), we can rephrase the conclusion of Theorem 4.11 as:

logN(ϵ,{g𝒞(Ω):n(g,0)t},n)[cdlog(1/δ)]F(tϵ)d/2.\log N\left(\epsilon,\{g\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(g,0)\leq t\},\ell_{{\mathbb{P}}_{n}}\right)\leq\left[c_{d}\log(1/\delta)\right]^{F}\left(\frac{t}{\epsilon}\right)^{d/2}.

Here the 0 in n(g,0)t\ell_{{\mathbb{P}}_{n}}(g,0)\leq t refers to the function that is identically equal to zero. Because in the representation (14), the number FF is assumed to be bounded from above by a constant depending on dd alone, we have [cdlog(1/δ)]FCdlog(1/δ)\left[c_{d}\log(1/\delta)\right]^{F}\leq C_{d}\log(1/\delta). We then use (16) to bound log(1/δ)\log(1/\delta) by a constant multiple of logn\log n. These give

logN(ϵ,{f𝒞(Ω):n(f,0)t},n)Cd(logn)F(tϵ)d/2.\log N\left(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,0)\leq t\},\ell_{{\mathbb{P}}_{n}}\right)\leq C_{d}(\log n)^{F}\left(\frac{t}{\epsilon}\right)^{d/2}.

Because f𝒞(Ω)f\in{\mathcal{C}}(\Omega) if and only if fg0𝒞(Ω)f-g_{0}\in{\mathcal{C}}(\Omega) for every affine function g0g_{0} on Ω\Omega (i.e., g0𝒜(Ω)g_{0}\in{\mathcal{A}}(\Omega)), we deduce

logN(ϵ,{f𝒞(Ω):n(f,g0)t},n)Cd(logn)F(tϵ)d/2\log N\left(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,g_{0})\leq t\},\ell_{{\mathbb{P}}_{n}}\right)\leq C_{d}(\log n)^{F}\left(\frac{t}{\epsilon}\right)^{d/2}

for every g0𝒜(Ω)g_{0}\in{\mathcal{A}}(\Omega). By triangle inequality,

{f𝒞(Ω):n(f,f0)t}{f𝒞(Ω):n(f,g0)t+n(f0,g0)}\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\}\subseteq\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,g_{0})\leq t+\ell_{{\mathbb{P}}_{n}}(f_{0},g_{0})\}

for every g0𝒜(Ω)g_{0}\in{\mathcal{A}}(\Omega). Thus

logN(ϵ,{f𝒞(Ω):n(f,f0)t},n)Cd(logn)F(t+n(f0,g0)ϵ)d/2\log N\left(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\},\ell_{{\mathbb{P}}_{n}}\right)\leq C_{d}(\log n)^{F}\left(\frac{t+\ell_{{\mathbb{P}}_{n}}(f_{0},g_{0})}{\epsilon}\right)^{d/2}

Because g0𝒜(Ω)g_{0}\in{\mathcal{A}}(\Omega) is arbitrary in the right hand side above, we can take the infimum over g0𝒜(Ω)g_{0}\in{\mathcal{A}}(\Omega). This allows us to replace n(f0,g0)\ell_{{\mathbb{P}}_{n}}(f_{0},g_{0}) on the right hand side above by infg0𝒜(Ω)n(f0,g0)𝔏\inf_{g_{0}\in{\mathcal{A}}(\Omega)}\ell_{{\mathbb{P}}_{n}}(f_{0},g_{0})\leq{\mathfrak{L}} leading to the required inequality (46). ∎

Proof of Inequality (48).

Fix f0k,h(Ω)f_{0}\in\mathfrak{C}_{k,h}(\Omega). By definition of k,h(Ω)\mathfrak{C}_{k,h}(\Omega), there exists kk subsets Ω1,,Ωk\Omega_{1},\dots,\Omega_{k} satisfying the following properties:

  1. 1.

    f0f_{0} is affine on each Ωi\Omega_{i},

  2. 2.

    each Ωi\Omega_{i} can be written as an intersection of at most hh slabs (i.e., as in (14) with F=hF=h), and

  3. 3.

    Ω1𝒮,,Ωk𝒮\Omega_{1}\cap\mathcal{S},\dots,\Omega_{k}\cap\mathcal{S} are disjoint with i=1k(Ωi𝒮)=Ω𝒮\cup_{i=1}^{k}(\Omega_{i}\cap\mathcal{S})=\Omega\cap\mathcal{S}

Note that nn is the cardinality of Ω𝒮\Omega\cap\mathcal{S}. Let nin_{i} denote the cardinality of Ωi𝒮\Omega_{i}\cap\mathcal{S} for each i=1,,ki=1,\dots,k. We can assume that each ni>0n_{i}>0 for otherwise we can simply drop that Ωi\Omega_{i}. For each f𝒞(Ω)f\in{\mathcal{C}}(\Omega) such that n(f0,f)t\ell_{{\mathbb{P}}_{n}}(f_{0},f)\leq t and 1ik1\leq i\leq k, let σi(f)\sigma_{i}(f) be the smallest positive integer for which

sΩi𝒮(f(s)f0(s))2niσi(f)t2.\sum_{s\in\Omega_{i}\cap\mathcal{S}}\left(f(s)-f_{0}(s)\right)^{2}\leq n_{i}\sigma_{i}(f)t^{2}. (105)

Because n(f0,f)t\ell_{{\mathbb{P}}_{n}}(f_{0},f)\leq t, we have 1σi(f)n1\leq\sigma_{i}(f)\leq n for each ii. Also because σi(f)\sigma_{i}(f) is the smallest integer satisfying (105), we have

sΩi𝒮(f(s)f0(s))2ni(σi(f)1)t2\sum_{s\in\Omega_{i}\cap\mathcal{S}}\left(f(s)-f_{0}(s)\right)^{2}\geq n_{i}\left(\sigma_{i}(f)-1\right)t^{2}

which implies that

i=1kni(σi(f)1)t2\displaystyle\sum_{i=1}^{k}n_{i}\left(\sigma_{i}(f)-1\right)t^{2} i=1ksΩi𝒮(f(s)f0(s))2\displaystyle\leq\sum_{i=1}^{k}\sum_{s\in\Omega_{i}\cap\mathcal{S}}\left(f(s)-f_{0}(s)\right)^{2}
=sΩ𝒮(f(s)f0(s))2nt2,\displaystyle=\sum_{s\in\Omega\cap\mathcal{S}}\left(f(s)-f_{0}(s)\right)^{2}\leq nt^{2},

leading to

i=1kniσi(f)n+i=1kni=2n.\sum_{i=1}^{k}n_{i}\sigma_{i}(f)\leq n+\sum_{i=1}^{k}n_{i}=2n.

Let

Σ:={(σ1(f),,σk(f)):f𝒞(Ω),n(f,f0)t},\Sigma:=\left\{(\sigma_{1}(f),\dots,\sigma_{k}(f)):f\in{\mathcal{C}}(\Omega),\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\right\},

and note that the cardinality of Σ\Sigma is at most nkn^{k} as 1σi(f)n1\leq\sigma_{i}(f)\leq n for each ii. For each (σ1,,σk)Σ(\sigma_{1},\dots,\sigma_{k})\in\Sigma, let

σ1,,σk={f𝒞(Ω):n(f,f0)t and σi(f)=σi for each i=1,,k}.{\cal F}_{\sigma_{1},\dots,\sigma_{k}}=\left\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\text{ and }\sigma_{i}(f)=\sigma_{i}\text{ for each }i=1,\dots,k\right\}.

By construction, we have

{f𝒞(Ω):n(f,f0)t}=(σ1,,σk)Σσ1,,σk\left\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\right\}=\bigcup_{(\sigma_{1},\dots,\sigma_{k})\in\Sigma}{\cal F}_{\sigma_{1},\dots,\sigma_{k}}

so that

logN(ϵ,{f𝒞(Ω):n(f,f0)t},n)\displaystyle\log N\left(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{{\mathbb{P}}_{n}}(f,f_{0})\leq t\},\ell_{{\mathbb{P}}_{n}}\right)
log((σ1,,σk)ΣN(ϵ,σ1,,σk,n))\displaystyle\leq\log\left(\sum_{(\sigma_{1},\dots,\sigma_{k})\in\Sigma}N\left(\epsilon,{\cal F}_{\sigma_{1},\dots,\sigma_{k}},\ell_{{\mathbb{P}}_{n}}\right)\right)
max(σ1,,σk)ΣlogN(ϵ,σ1,,σk,n)+log|Σ|\displaystyle\leq\max_{(\sigma_{1},\dots,\sigma_{k})\in\Sigma}\log N(\epsilon,{\cal F}_{\sigma_{1},\dots,\sigma_{k}},\ell_{{\mathbb{P}}_{n}})+\log|\Sigma|
max(σ1,,σk)ΣlogN(ϵ,σ1,,σk,n)+klogn\displaystyle\leq\max_{(\sigma_{1},\dots,\sigma_{k})\in\Sigma}\log N(\epsilon,{\cal F}_{\sigma_{1},\dots,\sigma_{k}},\ell_{{\mathbb{P}}_{n}})+k\log n (106)

where |Σ||\Sigma| denotes the cardinality of the finite set Σ\Sigma, and we used |Σ|nk|\Sigma|\leq n^{k}. We shall now upper-bound logN(ϵ,σ1,,σk,n)\log N(\epsilon,{\cal F}_{\sigma_{1},\dots,\sigma_{k}},\ell_{{\mathbb{P}}_{n}}) for a fixed (σ1,,σk)Σ(\sigma_{1},\dots,\sigma_{k})\in\Sigma. The idea is that this will follow from Theorem 4.11 applied to each Ωi,i=1,,k\Omega_{i},i=1,\dots,k. First observe that Theorem 4.11 applied to Ωi\Omega_{i} and ff replaced by ff0f-f_{0} gives

logN(ϵ,{f𝒞(Ωi):𝒮(ff0,Ωi,2)t},𝒮(,Ωi,2))(cdlog(1/δ))h(tϵ)d/2\begin{split}&\log N\left(\epsilon,\{f\in{\mathcal{C}}(\Omega_{i}):\ell_{\mathcal{S}}(f-f_{0},\Omega_{i},2)\leq t\},\ell_{\mathcal{S}}(\cdot,\Omega_{i},2)\right)\\ &\leq\left(c_{d}\log(1/\delta)\right)^{h}\left(\frac{t}{\epsilon}\right)^{d/2}\end{split} (107)

for every ϵ,t>0\epsilon,t>0. Here it is crucial that f0f_{0} is affine on Ωi\Omega_{i} (which ensures f𝒞(Ω)ff0𝒞(Ω)f\in{\mathcal{C}}(\Omega)\iff f-f_{0}\in{\mathcal{C}}(\Omega)); also note that FF in (52) is replaced by hh here because it is assumed that each Ωi\Omega_{i} is of the form (14) with F=hF=h (this is part of the assumption f0k,h(Ω)f_{0}\in\mathfrak{C}_{k,h}(\Omega)). Observe now that if

1nisΩi𝒮(f(s)f0(s))2ϵ22σi\displaystyle\frac{1}{n_{i}}\sum_{s\in\Omega_{i}\cap\mathcal{S}}\left(f(s)-f_{0}(s)\right)^{2}\leq\frac{\epsilon^{2}}{2}\sigma_{i}

for each i=1,,ki=1,\dots,k, then

1nsΩ𝒮(f(s)f0(s))2ϵ22ni=1kniσi(f)ϵ2.\displaystyle\frac{1}{n}\sum_{s\in\Omega\cap\mathcal{S}}\left(f(s)-f_{0}(s)\right)^{2}\leq\frac{\epsilon^{2}}{2n}\sum_{i=1}^{k}n_{i}\sigma_{i}(f)\leq\epsilon^{2}.

As a result

logN(ϵ,σ1,,σk,n)\displaystyle\log N\left(\epsilon,{\cal F}_{\sigma_{1},\dots,\sigma_{k}},\ell_{{\mathbb{P}}_{n}}\right)
i=1klogN(ϵσi/2,{f𝒞(Ωi):𝒮(ff0,Ωi,2)tσi},𝒮(,Ωi,2))\displaystyle\leq\sum_{i=1}^{k}\log N\left(\epsilon\sqrt{\sigma_{i}/2},\{f\in{\mathcal{C}}(\Omega_{i}):\ell_{\mathcal{S}}(f-f_{0},\Omega_{i},2)\leq t\sqrt{\sigma_{i}}\},\ell_{\mathcal{S}}(\cdot,\Omega_{i},2)\right)

Inequality (107) then gives

logN(ϵ,σ1,,σk,n)k(cdlog(1/δ))h(tϵ)d/2\displaystyle\log N\left(\epsilon,{\cal F}_{\sigma_{1},\dots,\sigma_{k}},\ell_{{\mathbb{P}}_{n}}\right)\leq k\left(c_{d}\log(1/\delta)\right)^{h}\left(\frac{t}{\epsilon}\right)^{d/2}

The proof of (48) is now completed by combining the above inequality with (106) (note that lognCdlog(1/δ)\log n\leq C_{d}\log(1/\delta) because of (16)). ∎

C.7 Proof of Theorem 4.11

This proof has two main steps. In the first step (which is the focus of the next subsection), we stay away from the boundary and prove the entropy bound with a modified metric that is defined only in the interior of the domain Ω\Omega. In the second step (which is the focus of Subsection C.7.2), we extend this result to reach the boundary of Ω\Omega.

Throughout this proof, we use the setting described in Subsection 2.2. In particular, 𝒮\mathcal{S} is the regular dd-dimensional δ\delta-grid (15) and Ω\Omega is of the form (14) and satisfies (3). X1,,XnX_{1},\dots,X_{n} are an enumeration of the points in Ω𝒮\Omega\cap\mathcal{S} with nn denoting the cardinality of Ω𝒮\Omega\cap\mathcal{S}. Also 𝒮(f,Ω,p)\ell_{\mathcal{S}}(f,\Omega,p) is defined as in (51).

C.7.1 Away from the boundary

The goal of this subsection is to prove the following proposition, which is the equivalent version of Lemma B.6 in this discrete setting. Let ω0\omega_{0} denote the center of the John ellipsoid of Ω\Omega (recall that the John ellipsoid of Ω\Omega is the unique ellipsoid of maximum volume contained in Ω\Omega). For λ>0\lambda>0, let

Ωλ=ω0+λ(Ωω0)\displaystyle\Omega_{\lambda}=\omega_{0}+\lambda(\Omega-\omega_{0}) (108)

and note that |Ωλ|=λd|Ω||\Omega_{\lambda}|=\lambda^{d}|\Omega|, where |Ωλ||\Omega_{\lambda}| and |Ω||\Omega| denote volumes of Ωλ\Omega_{\lambda} and Ω\Omega.

Proposition C.1.

Suppose δrd/(400d3/2)\delta\leq r_{d}/(400d^{3/2}). Fix t>0t>0 and 0<ϵ<10<\epsilon<1. Then there exists a set 𝒩{\mathcal{N}} consisting of exp(γd(t/ϵ)d/2)\exp(\gamma_{d}\cdot(t/\epsilon)^{d/2}) functions such that for every f𝒞(Ω)f\in{\mathcal{C}}(\Omega) with 𝒮(f,Ω,p)t\ell_{\mathcal{S}}(f,\Omega,p)\leq t, there exists g𝒩g\in{\mathcal{N}} satisfying |f(x)g(x)|<ϵ|f(x)-g(x)|<\epsilon for all xΩ0.9x\in\Omega_{0.9}, where Ω0.9\Omega_{0.9} is as defined in (108) and γd\gamma_{d} is a constant depending only on dd.

Proposition C.1 states that, under the supremum metric on the interior set Ω0.9\Omega_{0.9}, the metric entropy of {f𝒞(Ω):𝒮(f,Ω,p)t}\left\{f\in{\mathcal{C}}(\Omega):\ell_{\mathcal{S}}(f,\Omega,p)\leq t\right\} is atmost γd(t/ϵ)d/2\gamma_{d}(t/\epsilon)^{d/2}. To prove Proposition C.1, we need some preliminary results. The next result shows that the number of grid points contained in Ω\Omega is of order δd\delta^{-d} (this provides a proof of the claim (16)).

Lemma C.2.

Suppose δrd/(10d3/2)\delta\leq r_{d}/(10d^{3/2}) where rdr_{d} is the quantity from (3). Then

910|Ω|δdn1110|Ω|δd\displaystyle\frac{9}{10}|\Omega|\delta^{-d}\leq n\leq\frac{11}{10}|\Omega|\delta^{-d}

where |Ω||\Omega| is the volume of Ω\Omega.

Proof.

First observe that

i=1n(Xi+[δ/2,δ/2]d)Ω+[δ/2,δ/2]dΩ+dδ2Bd.\bigcup_{i=1}^{n}(X_{i}+[-\delta/2,\delta/2]^{d})\subseteq\Omega+[-\delta/2,\delta/2]^{d}\subseteq\Omega+\frac{\sqrt{d}\delta}{2}B_{d}.

Because of (3), Ω\Omega contains the ball of radius rdr_{d} centered at the origin which implies

|Ω+dδ2Bd|(1+dδ2rd)d|Ω|(1+120d)1/d|Ω|1110|Ω|.|\Omega+\frac{\sqrt{d}\delta}{2}B_{d}|\leq\left(1+\frac{\sqrt{d}\delta}{2r_{d}}\right)^{d}|\Omega|\leq\left(1+\frac{1}{20d}\right)^{1/d}|\Omega|\leq\frac{11}{10}|\Omega|.

Volume comparison with the last pair of displayed relations gives us

n1110|Ω|δd.n\leq\frac{11}{10}|\Omega|\delta^{-d}.

On the other hand, let UU be the union of the cubes Xi+[δ/2,δ/2]dX_{i}+[-\delta/2,\delta/2]^{d}. The volume of UU is nδdn\delta^{d}. Since the union of Xi+[δ,δ]dX_{i}+[-\delta,\delta]^{d} covers Ω\Omega, we have U+[δ/2,δ/2]dΩU+[-\delta/2,\delta/2]^{d}\supseteq\Omega. In particular, UU contains the set

{xΩ:dist(x,Ω)dδ/2}.\{x\in\Omega:{\rm dist}(x,\partial\Omega)\geq\sqrt{d}\delta/2\}.

Since Ω\Omega contains the ball of radius r0r_{0} centered at the origin, if we let

Ω^=(1dδ2rd)Ω,\widehat{\Omega}=\left(1-\frac{\sqrt{d}\delta}{2r_{d}}\right)\Omega,

then the distance between any xΩ^x\in\widehat{\Omega} and Ω\partial\Omega is at least dδ/2\sqrt{d}\delta/2. Hence UΩ^U\supset\widehat{\Omega}. Consequently

n=|U|δd|Ω^|δd=(1dδ2rd)d|Ω|δd.n=|U|\delta^{-d}\geq|\widehat{\Omega}|\delta^{-d}=\left(1-\frac{\sqrt{d}\delta}{2r_{d}}\right)^{d}|\Omega|\delta^{-d}.

Using δrd/(10d3/2)\delta\leq r_{d}/(10d^{3/2}) and (11/(20d))d9/10(1-1/(20d))^{d}\geq 9/10, the above implies that n(9/10)|Ω|δdn\geq(9/10)|\Omega|\delta^{-d} completing the proof of Lemma C.2. ∎

Lemma C.3.

Suppose δrd/(10d3/2)\delta\leq r_{d}/(10d^{3/2}). Then infxΩf(x)20dt\inf_{x\in\Omega}f(x)\geq-20dt for every f𝒞(Ω)f\in{\mathcal{C}}(\Omega) with 𝒮(f,Ω,p)t\ell_{\mathcal{S}}(f,\Omega,p)\leq t.

Proof.

Let x0x_{0} be the minimizer of ff on Ω\Omega. If f(x0)0f(x_{0})\geq 0, then there is nothing to prove; otherwise, the set K:={xΩ|f(x)0}K:=\left\{x\in\Omega\;\middle|\;f(x)\leq 0\right\} is a closed convex set containing x0x_{0}. Denote K(t)=x0+t(Kx0)K(t)=x_{0}+t(K-x_{0}), and let K^=K(1+ζ)K(1ζ)\widehat{K}=K({1+\zeta})\setminus K({1-\zeta}), where ζ:=(10d)1\zeta:=(10d)^{-1}. We show that for all xΩK^x\in\Omega\setminus\widehat{K}, |f(x)|ζ|f(x0)||f(x)|\geq\zeta|f(x_{0})|. Indeed, if we define a function gg on Ω\Omega so that g(x0)=f(x0)g(x_{0})=f(x_{0}), g(γ)=f(γ)g(\gamma)=f(\gamma) for all γK\gamma\in\partial K, and gg is linear on Lγ:={x=x0+t(γx0)Ω|t0}L_{\gamma}:=\left\{x=x_{0}+t(\gamma-x_{0})\in\Omega\;\middle|\;t\geq 0\right\}. Then, by the convexity of ff on each LγL_{\gamma}, we have |f(x)||g(x)||f(x)|\geq|g(x)| on Ω\Omega. Thus, for all xΩK^x\in\Omega\setminus\widehat{K},

|f(x)||g(x)|=|g(γ)|+xγx0γ|f(x0)|ζ|f(x0)|.|f(x)|\geq|g(x)|=|g(\gamma)|+\frac{\|x-\gamma\|}{\|x_{0}-\gamma\|}|f(x_{0})|\geq\zeta|f(x_{0})|.

Next, we show that most of the grid points in Ω\Omega are outside K^\widehat{K}. Indeed, If ss is a grid point in K^\widehat{K}, then s+[δ/2,δ/2]dK(1+ζ)Ω+[δ/2,δ/2]ds+[-\delta/2,\delta/2]^{d}\subset K({1+\zeta})\cap\Omega+[-\delta/2,\delta/2]^{d} and at least one half of the cube s+[δ/2,δ/2]ds+[-\delta/2,\delta/2]^{d} lies outside K(1ζ)K({1-\zeta}). Thus, the number of grid points in K^\widehat{K} is bounded by

2|(K(1+ζ)Ω+[δ/2,δ/2]d)K(1ζ)|δd.2|(K({1+\zeta})\cap\Omega+[-\delta/2,\delta/2]^{d})\setminus K({1-\zeta})|\delta^{-d}.

Since by the Minkowski-Steiner formula (see e.g., [43]), |(A+B)A||(A+B)\setminus A| can be expressed as a sum of mixed volumes of AA and BB, and because the mixed volumes are monotone, we have

|(A+B)B||(C+D)D||(A+B)\setminus B|\leq|(C+D)\setminus D|

for all convex sets CAC\supset A and DBD\supset B. This gives

|(K(1+ζ)Ω+[δ/2,δ/2]d)K(1+ζ)Ω|\displaystyle|(K({1+\zeta})\cap\Omega+[-\delta/2,\delta/2]^{d})\setminus K({1+\zeta})\cap\Omega|
|([1,1]d+[δ/2,δ/2]d)[1,1]d|=(2+δ)d2d.\displaystyle\leq|([-1,1]^{d}+[-\delta/2,\delta/2]^{d})\setminus[-1,1]^{d}|=(2+\delta)^{d}-2^{d}.

Also |K(1+ζ)ΩK(1ζ)|[1(1ζ1+ζ)d]|K(1+ζ)Ω||K({1+\zeta})\cap\Omega\setminus K({1-\zeta})|\leq\left[1-\left(\frac{1-\zeta}{1+\zeta}\right)^{d}\right]|K({1+\zeta})\cap\Omega| so that

|(K(1+ζ)Ω+[δ/2,δ/2]d)K(1ζ)|\displaystyle|(K({1+\zeta})\cap\Omega+[-\delta/2,\delta/2]^{d})\setminus K({1-\zeta})|
[1(1ζ1+ζ)d]|K(1+ζ)Ω|+(2+δ)d2d3dζ|Ω|.\displaystyle\leq\left[1-\left(\frac{1-\zeta}{1+\zeta}\right)^{d}\right]|K({1+\zeta})\cap\Omega|+(2+\delta)^{d}-2^{d}\leq 3d\zeta|\Omega|.

Thus, the number of grid points in K^\widehat{K} is bounded by

6dζ|Ω|δd6dζ109n7dζn.6d\zeta|\Omega|\delta^{-d}\leq 6d\zeta\cdot\frac{10}{9}n\leq 7d\zeta n.

Hence,

ntps𝒮(ΩK^)|f(s)|p(17dζ)n(ζ|f(x0)|)p,nt^{p}\geq\sum_{s\in\mathcal{S}\cap(\Omega\setminus\widehat{K})}|f(s)|^{p}\geq(1-7d\zeta)n\cdot(\zeta|f(x_{0})|)^{p},

which implies that f(x0)21/pζ1t20dtf(x_{0})\geq-2^{1/p}\zeta^{-1}t\geq-20dt by using ζ=(10d)1\zeta=(10d)^{-1} and 21/p22^{1/p}\leq 2 for p[1,)p\in[1,\infty). ∎

Lemma C.4.

Suppose δrd/(400d3/2)\delta\leq r_{d}/(400d^{3/2}). Then, at any point PP on the boundary of Ω0.95\Omega_{0.95}, any hyperplane passing through PP cuts Ω\Omega into two parts. The part that does not contain the center of John ellipsoid of Ω\Omega as its interior point contains at least (20d)d1n(20d)^{-d-1}\cdot n grid points.

Proof.

Since PP is on the boundary of Ω0.95\Omega_{0.95}, any hyperplane passing through PP cuts Ω\Omega into two parts. Suppose LL is a part that does not contain the center of John ellipsoid of Ω\Omega as its interior. We prove that |L|12(20d)d|Ω||L|\geq\frac{1}{2(20d)^{d}}|\Omega|. Because the ratio |L|/|Ω||L|/|\Omega| is invariant under affine transform, we estimate |TL|/|TΩ||TL|/|T\Omega|, where TT is an affine transform so that the John ellipsoid of TΩT\Omega is the unit ball BdB_{d}. Then, it is known that TΩT\Omega is contained in a ball of radius dd (see e.g., [5, Lecture 3]). Because the distance from (TΩ)0.95(T\Omega)_{0.95} to the boundary of TΩT\Omega is at least 120\frac{1}{20}, TLTL contain half of the ball with center at TPTP and radius 120\frac{1}{20}. Thus, TLTL has volume at least 1220d|Bd|\frac{1}{2}20^{-d}|B_{d}|. Since TΩT\Omega is contained in the ball of radius dd, we have |TΩ|dd|Bd||T\Omega|\leq d^{d}|B_{d}|. This implies that |TL|12d(20d)d|TΩ||TL|\geq\frac{1}{2d}(20d)^{-d}|T\Omega|. Hence |L|12d(20d)d|Ω||L|\geq\frac{1}{2d}(20d)^{-d}|\Omega|.

Because the John ellipsoid of Ω\Omega contains a ball of radius at least 400d3/2δ400d^{3/2}\delta, the distance from Ω0.95\Omega_{0.95} to the boundary of Ω\Omega is at least 20d3/2δ20d^{3/2}\delta. Thus, LL contains a ball of radius at least 10d3/2δ10d^{3/2}\delta. By Lemma C.2, the number of grid points in it is at least 910|L|δd920d(20d)d|Ω|δd\frac{9}{10}|L|\delta^{-d}\geq\frac{9}{20d}(20d)^{-d}|\Omega|\delta^{-d}. The statement of Lemma C.4 then follows by using Lemma C.2 one more time. ∎

Lemma C.5.

Suppose δrd/(400d3/2)\delta\leq r_{d}/(400d^{3/2}). Then supxΩ0.95f(x)(20d)d+1pt\sup_{x\in\Omega_{0.95}}f(x)\leq(20d)^{\frac{d+1}{p}}t for every f𝒞(Ω)f\in{\mathcal{C}}(\Omega) with 𝒮(f,Ω,p)t\ell_{\mathcal{S}}(f,\Omega,p)\leq t.

Proof.

Let zz be the maximizer of ff on Ω0.95\Omega_{0.95}. By convexity of ff, the point zz must be on the boundary of Ω0.95\Omega_{0.95}. If f(z)0f(z)\leq 0, there is nothing to prove. So we assume f(z)>0f(z)>0. The convexity of ff implies that zz lies on the boundary of the convex set K:{xΩ:f(x)f(z)}ΩηK:\{x\in\Omega:f(x)\leq f(z)\}\supset\Omega_{\eta}. There exists a hyperplane zz so that the convex set {x|f(x)f(z)}\left\{x\;\middle|\;f(x)\leq f(z)\right\} lies entirely on one side of the hyperplane. Let LL be the portion of Ω\Omega that lies on the other side of the hyperplane that support KK at zz. This hyperplane cuts Ω\Omega into two parts. Let LL be the part that does not contain KK. Then, f(x)f(z)f(x)\geq f(z) for all xLx\in L. Let mm denote the cardinality of L𝒮L\cap\mathcal{S}. By Lemma C.4, we have

m(20d)d1n.\displaystyle m\geq(20d)^{-d-1}n.

Since f(x)f(z)>0f(x)\geq f(z)>0 for all xLx\in L, we have

mf(z)psL𝒮|f(s)|psΩ𝒮|f(s)|pntp.mf(z)^{p}\leq\sum_{s\in L\cap\mathcal{S}}|f(s)|^{p}\leq\sum_{s\in\Omega\cap\mathcal{S}}|f(s)|^{p}\leq nt^{p}.

We obtain f(z)(20d)d+1ptf(z)\leq(20d)^{\frac{d+1}{p}}t by combining the above two displayed inequalities, and this proves Lemma C.5. ∎

We are now ready to prove Proposition C.1.

Proof of Proposition C.1.

Fix f𝒞(Ω)f\in{\mathcal{C}}(\Omega) with 𝒮(f,Ω,p)t\ell_{\mathcal{S}}(f,\Omega,p)\leq t. By Lemma C.3 and Lemma C.5, we have 20dtf(x)(20d)d+1t-20dt\leq f(x)\leq(20d)^{d+1}t for all xΩ0.95x\in\Omega_{0.95}. Let TT be an affine transformation so that the John ellipsoid of TΩ0.95T\Omega_{0.95} equals the unit ball 𝔅d{\mathfrak{B}_{d}}. Because Ω0.9(Ω0.95)0.95\Omega_{0.9}\subseteq(\Omega_{0.95})_{0.95}, by the proof of Lemma C.4, the distance between the boundary of T(Ω0.95)T(\Omega_{0.95}) and the boundary of T(Ω0.9)T(\Omega_{0.9}) is at least 120\frac{1}{20}. If we define convex function f~\widetilde{f} on T(Ω)T(\Omega) by f~(y)=f(T1(y))\widetilde{f}(y)=f(T^{-1}(y)). Then, 20dtf~(y)(20d)d+1t-20dt\leq\widetilde{f}(y)\leq(20d)^{d+1}t for all yT(Ω0.95)y\in T(\Omega_{0.95}).

For u,vT(Ω0.9)u,v\in T(\Omega_{0.9}), assume without loss of generality that f~(u)f~(v)\widetilde{f}(u)\leq\widetilde{f}(v) and consider the half-line starting from uu and passing through vv. Suppose the half-line intersects the boundary of T(Ω0.9)T(\Omega_{0.9}) at the point aa and the boundary of T(Ω0.95)T(\Omega_{0.95}) at the point bb. By the convexity of f~\widetilde{f} on this half-line, we have

0f~(v)f~(u)vu|f~(b)f~(a)|ba20[(20d)d+1t+20dt]:=M.0\leq\frac{\widetilde{f}(v)-\widetilde{f}(u)}{\|v-u\|}\leq\frac{|\widetilde{f}(b)-\widetilde{f}(a)|}{\|b-a\|}\leq 20[(20d)^{d+1}t+20dt]:=M.

This implies that f~\widetilde{f} is a convex function on T(Ω0.9)T(\Omega_{0.9}) that has a Lipschitz constant MM. Of course ff is also bounded by MM on T(Ω0.9)T(\Omega_{0.9}). Thus by the classical result of [9] on the metric entropy (in the supremum metric) of bounded Lipschitz convex functions, there exists a finite set of functions 𝒢{\mathcal{G}} consisting of exp(β(M/ϵ)d/2)\leq\exp(\beta\cdot(M/\epsilon)^{d/2}) functions such that for every f𝒞(Ω)f\in{\mathcal{C}}(\Omega) with 𝒮(f,Ω,p)t\ell_{\mathcal{S}}(f,\Omega,p)\leq t, there exists g𝒢g\in{\mathcal{G}}, such that supyT(Ω0.9)|f~(y)g(y)|<ϵ\sup_{y\in T(\Omega_{0.9})}|\widetilde{f}(y)-g(y)|<\epsilon. This implies supxΩ.9|f(x)g(T(x))|<ϵ\sup_{x\in\Omega_{.9}}|f(x)-g(T(x))|<\epsilon. Thus, by setting 𝒩={gT|g𝒢}{\mathcal{N}}=\left\{g\circ T\;\middle|\;g\in{\mathcal{G}}\right\} the lemma follows with γdβ(M/t)d/2\gamma_{d}\geq\beta(M/t)^{d/2} (note that MM is a multiple of tt so that M/tM/t is a constant depending on dd alone). ∎

C.7.2 Reaching the Boundary

Now, we try to reach closer to the boundary of Ω\Omega. More precisely, we will extend Proposition C.1 from Ω0.9\Omega_{0.9} to the set Ω0\Omega_{0} defined below. For this section, it will be convenient to rewrite Ω\Omega as:

Ω:={xd:𝔞iviT(xω0)𝔟i,1iF}\displaystyle\Omega:=\{x\in{\mathbb{R}}^{d}:-\mathfrak{a}_{i}\leq v_{i}^{T}(x-\omega_{0})\leq\mathfrak{b}_{i},1\leq i\leq F\}

where ω0\omega_{0} is the center of the John ellipsoid of Ω\Omega, and 𝔞i,𝔟i>0\mathfrak{a}_{i},\mathfrak{b}_{i}>0. As in (14), v1,,vFv_{1},\dots,v_{F} are unit vectors.

Let mim_{i} and nin_{i} be the smallest integers such that 2mi𝔞iδ2^{-m_{i}}\mathfrak{a}_{i}\leq\delta and 2ni𝔟iδ2^{-n_{i}}\mathfrak{b}_{i}\leq\delta. Let

Ω0={xd:(12mi)𝔞iviT(xω0)(12ni)𝔟i,1iF}.\Omega_{0}=\{x\in{\mathbb{R}}^{d}:-(1-2^{-m_{i}})\mathfrak{a}_{i}\leq v_{i}^{T}(x-\omega_{0})\leq(1-2^{-n_{i}})\mathfrak{b}_{i},1\leq i\leq F\}.

Note that Ω0\Omega_{0} is quite close to Ω\Omega because the Hausdorff distance between them is at most δ\delta.

The following proposition suggests that to achieve our goal of bounding the metric entropy of {f𝒞(Ω):𝒮(f,Ω,p)t}\{f\in{\mathcal{C}}(\Omega):\ell_{\mathcal{S}}(f,\Omega,p)\leq t\} on Ω0\Omega_{0}, we only need to properly decompose Ω0\Omega_{0}.

Proposition C.6.

Suppose DiD_{i}, 1im1\leq i\leq m is a sequence of convex subsets of Ω\Omega such that no point in Ω\Omega is contained in more than MM subsets in the sequence. Further suppose that Ω0i=1m(Di)0.9\Omega_{0}\subset\cup_{i=1}^{m}(D_{i})_{0.9}. Then

logN(ϵ,{f𝒞(Ω):𝒮(f,Ω,p)t},𝒮(,Ω0,p))cmMd2p(tϵ)d/2.\log N(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{\mathcal{S}}(f,\Omega,p)\leq t\},\ell_{\mathcal{S}}(\cdot,\Omega_{0},p))\leq cmM^{\frac{d}{2p}}\left(\frac{t}{\epsilon}\right)^{d/2}.
Proof.

Let GiG_{i} be the set of grid points in DiD_{i}, and 𝒮i\mathcal{S}_{i} be the grid points in (Di)0.9j<i(Dj)0.9(D_{i})_{0.9}\setminus\cup_{j<i}(D_{j})_{0.9}. For every f𝒞(Ω)f\in{\mathcal{C}}(\Omega) such that 𝒮(f,Ω,p)t\ell_{\mathcal{S}}(f,\Omega,p)\leq t, define ti=ti(f)t_{i}=t_{i}(f) to be the smallest positive integer, such that

xGi|f(x)|p|Gi|titp.\sum_{x\in G_{i}}|f(x)|^{p}\leq|G_{i}|t_{i}t^{p}.

Because tit_{i} is the smallest positive integer satisfying the above, the inequality will be reversed for ti1t_{i}-1 allowing us to deduce

i=1m|Gi|(ti1)tpi=1mxGi|f(x)|pMntp,\sum_{i=1}^{m}|G_{i}|(t_{i}-1)t^{p}\leq\sum_{i=1}^{m}\sum_{x\in G_{i}}|f(x)|^{p}\leq Mnt^{p},

which is equivalent to i=1m|Gi|(ti1)Mn\sum_{i=1}^{m}|G_{i}|(t_{i}-1)\leq Mn. The number of possible values for each tit_{i} is clearly at most Mn+mMn+m which implies that there are no more than (Mn+mm){{Mn+m}\choose{m}} possible values of (t1,t2,,tm)(t_{1},t_{2},\ldots,t_{m}).

Let

𝒦={(k1,k2,,km)m:i=1m|Gi|(ki1)Mn}.{\mathcal{K}}=\{(k_{1},k_{2},\ldots,k_{m})\in\mathbb{N}^{m}:\sum_{i=1}^{m}|G_{i}|(k_{i}-1)\leq Mn\}.

For each K=(k1,k2,,km)𝒦K=(k_{1},k_{2},\ldots,k_{m})\in{\mathcal{K}}, define

K={f𝒞(Ω):𝒮(f,Ω,p)t and ti(f)=ki,1im}.\displaystyle{\cal F}_{K}=\{f\in{\mathcal{C}}(\Omega):\ell_{\mathcal{S}}(f,\Omega,p)\leq t\text{ and }t_{i}(f)=k_{i},1\leq i\leq m\}.

Then K{f𝒞(Di):𝒮(f,Di,p)ki1/pt}{\cal F}_{K}\subset\{f\in{\mathcal{C}}(D_{i}):\ell_{\mathcal{S}}(f,D_{i},p)\leq k_{i}^{1/p}t\} for each ii. Applying Proposition C.1 for DiD_{i}, there exists a set 𝒢i{\mathcal{G}}_{i} consisting of exp(γ[ki1/pt]d/2ϵid/2)\exp(\gamma[k_{i}^{1/p}t]^{d/2}\epsilon_{i}^{-d/2}) functions, such that for every fFKf\in F_{K}, there exists gi𝒢ig_{i}\in{\mathcal{G}}_{i} satisfying

x𝒮i|f(x)gi(x)|p|Gi|ϵip.\sum_{x\in\mathcal{S}_{i}}|f(x)-g_{i}(x)|^{p}\leq|G_{i}|\epsilon_{i}^{p}.

If we define g(x)=gi(x)g(x)=g_{i}(x) for x𝒮ix\in\mathcal{S}_{i}, then we have

x𝒮Ω0|f(x)g(x)|pi=1m|Gi|ϵip=nϵp,\sum_{x\in\mathcal{S}\cap\Omega_{0}}|f(x)-g(x)|^{p}\leq\sum_{i=1}^{m}|G_{i}|\epsilon_{i}^{p}=n\epsilon^{p},

where the last inequality holds if we let

ϵi=tidp(d+2p)|Gi|2d+4p(i=1m(|Gi|ki)dd+2p)1/pn1/pϵ.\epsilon_{i}=\frac{t_{i}^{\frac{d}{p(d+2p)}}|G_{i}|^{-\frac{2}{d+4p}}}{\left(\sum_{i=1}^{m}(|G_{i}|k_{i})^{\frac{d}{d+2p}}\right)^{1/p}}n^{1/p}\epsilon.

The total number of possible choices for gg is therefore

exp(γi=1m[ki1/pt]d/2ϵid/2)\displaystyle\exp\left(\gamma\sum_{i=1}^{m}[k_{i}^{1/p}t]^{d/2}\epsilon_{i}^{-d/2}\right)
=exp{(i=1m(|Gi|ki)dd+2p)d+4p2pnd2p(t/ϵ)d/2}.\displaystyle=\exp\left\{\left(\sum_{i=1}^{m}(|G_{i}|k_{i})^{\frac{d}{d+2p}}\right)^{\frac{d+4p}{2p}}\cdot n^{-\frac{d}{2p}}(t/\epsilon)^{d/2}\right\}.

Using the inequalities

i=1m(|Gi|ki)dd+2p(i=1m|Gi|ki)dd+2pm2pd+4p\sum_{i=1}^{m}(|G_{i}|k_{i})^{\frac{d}{d+2p}}\leq\left(\sum_{i=1}^{m}|G_{i}|k_{i}\right)^{\frac{d}{d+2p}}m^{\frac{2p}{d+4p}}

and

i=1m|Gi|ki=i=1m|Gi|(ki1)+i=1m|Gi|Mn+Mn=2Mn,\sum_{i=1}^{m}|G_{i}|k_{i}=\sum_{i=1}^{m}|G_{i}|(k_{i}-1)+\sum_{i=1}^{m}|G_{i}|\leq Mn+Mn=2Mn,

we can bound the total number of realizations of gg by

exp(γ(2M)d2pm(t/ϵ)d/2).\exp\left(\gamma(2M)^{\frac{d}{2p}}m(t/\epsilon)^{d/2}\right).

Consequently, we have

logN(ϵ,{f𝒞(Ω):𝒮(f,Ω,p)t},𝒮(,Ω0,p))\displaystyle\log N(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{\mathcal{S}}(f,\Omega,p)\leq t\},\ell_{\mathcal{S}}(\cdot,\Omega_{0},p))
log(Mn+mm)+γ(2M)d2pm(t/ϵ)d/2cmMd2p(t/ϵ)d/2.\displaystyle\leq\log{{Mn+m}\choose{m}}+\gamma(2M)^{\frac{d}{2p}}m(t/\epsilon)^{d/2}\leq cmM^{\frac{d}{2p}}(t/\epsilon)^{d/2}.

In the next result, we decompose Ω0\Omega_{0} according to the requirement of Lemma C.6. Recall that mim_{i} and nin_{i} are of order log(1/δ)\log(1/\delta) and they have been defined at in the beginning of this subsection (Subsection C.7.2).

Lemma C.7.

Let N:=i=1F(mi+ni)N:=\prod_{i=1}^{F}(m_{i}+n_{i}). There exists convex sets D^i\widehat{D}_{i}, 1iN1\leq i\leq N contained in Ω\Omega, such that no point in Ω\Omega is contained in more than 4F4^{F} of these sets, and

Ω0i=1N(D^i)0.8.\Omega_{0}\subset\cup_{i=1}^{N}(\widehat{D}_{i})_{0.8}.
Proof.

Let

𝒦={(k1,k2,,kF):mikini1,1iF}.{\mathcal{K}}=\{(k_{1},k_{2},\ldots,k_{F}):-m_{i}\leq k_{i}\leq n_{i}-1,1\leq i\leq F\}.

There are i=1F(mi+ni)\prod_{i=1}^{F}(m_{i}+n_{i}) elements in 𝒦{\mathcal{K}}. For each K=(k1,k2,,kF)𝒦K=(k_{1},k_{2},\ldots,k_{F})\in{\mathcal{K}}, define

DK={xd:αi(ki)viT(xω0)αi(ki+1)},D_{K}=\{x\in{\mathbb{R}}^{d}:\alpha_{i}(k_{i})\leq v_{i}^{T}(x-\omega_{0})\leq\alpha_{i}(k_{i}+1)\},

where

αi(t)={(12t)𝔞it0(12t)𝔟it>0.\alpha_{i}(t)=\left\{\begin{array}[]{ll}{-(1-2^{t})\mathfrak{a}_{i}}&{t\leq 0}\\ {(1-2^{-t})\mathfrak{b}_{i}}&{t>0}\end{array}\right..

DKD_{K} is a convex set. The union of all DKD_{K}, K𝒦K\in{\mathcal{K}} is the set

{xd:(12mi)𝔞iviT(xω0)(12ni)𝔟i}.\{x\in{\mathbb{R}}^{d}:-(1-2^{-m_{i}})\mathfrak{a}_{i}\leq v_{i}^{T}(x-\omega_{0})\leq(1-2^{-n_{i}})\mathfrak{b}_{i}\}.

Similarly, we define

D^K={xd:βi(ki)viT(xc)γi(ki)},\widehat{D}_{K}=\{x\in{\mathbb{R}}^{d}:\beta_{i}(k_{i})\leq v_{i}^{T}(x-c)\leq\gamma_{i}(k_{i})\},

where

βi(ki)=αi(ki)14[αi(ki+1)αi(ki)], and\displaystyle\beta_{i}(k_{i})=\alpha_{i}(k_{i})-\frac{1}{4}[\alpha_{i}(k_{i}+1)-\alpha_{i}(k_{i})],\text{ and }
γi(ki)=αi(ki+1)+14[αi(ki+1)αi(ki)].\displaystyle\gamma_{i}(k_{i})=\alpha_{i}(k_{i}+1)+\frac{1}{4}[\alpha_{i}(k_{i}+1)-\alpha_{i}(k_{i})].

Let ω0(K)\omega_{0}(K) be the center of John ellipsoid of D^K\widehat{D}_{K}. Because D^K\widehat{D}_{K} equals

{xd:βi(ki)viT(ω0(K)ω0)viT(xω0(K))\displaystyle\left\{x\in{\mathbb{R}}^{d}:\beta_{i}(k_{i})-v_{i}^{T}(\omega_{0}(K)-\omega_{0})\leq v_{i}^{T}(x-\omega_{0}(K))\right.
γi(ki)viT(ω0(K)ω0)},\displaystyle\left.\leq\gamma_{i}(k_{i})-v_{i}^{T}(\omega_{0}(K)-\omega_{0})\right\},

we have

(D^K)0.8\displaystyle(\widehat{D}_{K})_{0.8}
=\displaystyle= {x:0.8[βi(ki)viT(ω0(K)ω0)]viT(xω0(K))\displaystyle\left\{x:0.8[\beta_{i}(k_{i})-v_{i}^{T}(\omega_{0}(K)-\omega_{0})]\leq v_{i}^{T}(x-\omega_{0}(K))\right.
0.8[γi(ki)viT(ω0(K)ω0)]}\displaystyle\leq\left.0.8[\gamma_{i}(k_{i})-v_{i}^{T}(\omega_{0}(K)-\omega_{0})]\right\}
=\displaystyle= {x:0.8βi(ki)+0.2viT(ω0(K)ω0)viT(xω0)\displaystyle\left\{x:0.8\beta_{i}(k_{i})+0.2v_{i}^{T}(\omega_{0}(K)-\omega_{0})\leq v_{i}^{T}(x-\omega_{0})\right.
0.8γi(ki)+0.2viT(ω0(K)ω0)}\displaystyle\left.\leq 0.8\gamma_{i}(k_{i})+0.2v_{i}^{T}(\omega_{0}(K)-\omega_{0})\right\}
\displaystyle\supseteq {x:0.8βi(ki)+0.2αi(ki+1)viT(xω0(K))0.8γi(ki)+0.2αi(ki)}\displaystyle\{x:0.8\beta_{i}(k_{i})+0.2\alpha_{i}(k_{i}+1)\leq v_{i}^{T}(x-\omega_{0}(K))\leq 0.8\gamma_{i}(k_{i})+0.2\alpha_{i}(k_{i})\}
=\displaystyle= {x:αi(ki)viT(xω0(K))αi(ki+1)}=DK,\displaystyle\{x:\alpha_{i}(k_{i})\leq v_{i}^{T}(x-\omega_{0}(K))\leq\alpha_{i}(k_{i}+1)\}=D_{K},

where in the second to the last equality we used the fact that 0.8βi(ki)+0.2αi(ki+1)=αi(ki)0.8\beta_{i}(k_{i})+0.2\alpha_{i}(k_{i}+1)=\alpha_{i}(k_{i}) and 0.8γi(ki)+0.2αi(ki)=αi(ki+1)0.8\gamma_{i}(k_{i})+0.2\alpha_{i}(k_{i})=\alpha_{i}(k_{i}+1).

It can be checked that when the integer ki{1,0}k_{i}\notin\{-1,0\}, the intervals (βi(ki),γi(ki))(\beta_{i}(k_{i}),\gamma_{i}(k_{i})) and (βi(ji),γi(ji))(\beta_{i}(j_{i}),\gamma_{i}(j_{i})) intersect only when |kiji|1|k_{i}-j_{i}|\leq 1 or when ji=0j_{i}=0 or when ji=1j_{i}=-1. Hence where are at most four possibilities which implies that no point can be contained in more than 4F4^{F} different sets DK^\widehat{D_{K}}. Lemma C.7 follows by renaming these sets as D^i\widehat{D}_{i}, 1iN1\leq i\leq N. ∎

We are finally ready to complete the proof of Theorem 4.11.

Proof of Theorem 4.11.

By Lemma C.6 and Lemma C.7, we have

logN(ϵ,{f𝒞(Ω):𝒮(f,Ω,p)t},𝒮(,Ω0,p))c2dF2p(log1δ)F(tϵ)d/2.\log N(\epsilon,\{f\in{\mathcal{C}}(\Omega):\ell_{\mathcal{S}}(f,\Omega,p)\leq t\},\ell_{\mathcal{S}}(\cdot,\Omega_{0},p))\leq c2^{\frac{dF}{2p}}\left(\log\frac{1}{\delta}\right)^{F}\left(\frac{t}{\epsilon}\right)^{d/2}.

Because the distance between the boundary of Ω\Omega and the boundary of Ω0\Omega_{0} is no larger than δ\delta, the set ΩΩ0\Omega\setminus\Omega_{0} can be decomposed into at most 2F2F pieces of width δ\delta. By Khinchine’s flatness theorem, the grid points in ΩΩ0\Omega\setminus\Omega_{0} are contained in cFcF hyperplanes for some constant cc. The intersection of Ω\Omega and each of these hyperplanes is a (d1)(d-1) dimensional convex polytope. This enables us to obtain covering number estimates on ΩΩ0\Omega\setminus\Omega_{0} using lower dimensional estimates. Because the desired covering number estimate is known to be true for d=1d=1, the result follows from mathematical induction on dimension. This concludes the proof of Theorem 4.11. ∎

References

  • [1] {barticle}[author] \bauthor\bsnmAït-Sahalia, \bfnmYacine\binitsY. and \bauthor\bsnmDuarte, \bfnmJefferson\binitsJ. (\byear2003). \btitleNonparametric option pricing under shape restrictions. \bjournalJ. Econometrics \bvolume116 \bpages9–47. \bnoteFrontiers of financial econometrics and financial engineering. \bdoi10.1016/S0304-4076(03)00102-7 \bmrnumber2002521 \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAllon, \bfnmGad\binitsG., \bauthor\bsnmBeenstock, \bfnmMichael\binitsM., \bauthor\bsnmHackman, \bfnmSteven\binitsS., \bauthor\bsnmPassy, \bfnmUry\binitsU. and \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear2007). \btitleNonparametric estimation of concave production technologies by entropic methods. \bjournalJ. Appl. Econometrics \bvolume22 \bpages795–816. \bdoi10.1002/jae.918 \bmrnumber2370975 \endbibitem
  • [3] {bphdthesis}[author] \bauthor\bsnmBalázs, \bfnmGábor\binitsG. (\byear2016). \btitleConvex regression: theory, practice, and applications, \btypePhD thesis, \bpublisherUniversity of Alberta. \endbibitem
  • [4] {binproceedings}[author] \bauthor\bsnmBalázs, \bfnmGábor\binitsG., \bauthor\bsnmGyörgy, \bfnmAndrás\binitsA. and \bauthor\bsnmSzepesvári, \bfnmCsaba\binitsC. (\byear2015). \btitleNear-optimal max-affine estimators for convex regression. In \bbooktitleAISTATS. \endbibitem
  • [5] {barticle}[author] \bauthor\bsnmBall, \bfnmKeith\binitsK. \betalet al. (\byear1997). \btitleAn elementary introduction to modern convex geometry. \bjournalFlavors of geometry \bvolume31 \bpages26. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmBellec, \bfnmPierre C.\binitsP. C. (\byear2018). \btitleSharp oracle inequalities for least squares estimators in shape restricted regression. \bjournalAnn. Statist. \bvolume46 \bpages745–780. \bdoi10.1214/17-AOS1566 \bmrnumber3782383 \endbibitem
  • [7] {barticle}[author] \bauthor\bsnmBirgé, \bfnmLucien\binitsL. and \bauthor\bsnmMassart, \bfnmPascal\binitsP. (\byear1993). \btitleRates of convergence for minimum contrast estimators. \bjournalProbab. Theory Related Fields \bvolume97 \bpages113–150. \bdoi10.1007/BF01199316 \bmrnumber1240719 \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmBronshteyn, \bfnmE. M.\binitsE. M. and \bauthor\bsnmIvanov, \bfnmL. D.\binitsL. D. (\byear1975). \btitleThe approximation of convex sets by polyhedra. \bjournalSiberian Mathematical Journal \bvolume16 \bpages852–853. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmBronšteĭn, \bfnmE. M.\binitsE. M. (\byear1976). \btitleε\varepsilon-entropy of convex sets and functions. \bjournalSibirsk. Mat. Ž. \bvolume17 \bpages508–514, 715. \bmrnumber0415155 \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmChatterjee, \bfnmSourav\binitsS. (\byear2014). \btitleA new perspective on least squares under convex constraint. \bjournalAnn. Statist. \bvolume42 \bpages2340–2381. \bdoi10.1214/14-AOS1254 \bmrnumber3269982 \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmChatterjee, \bfnmSabyasachi\binitsS. (\byear2016). \btitleAn improved global risk bound in concave regression. \bjournalElectron. J. Stat. \bvolume10 \bpages1608–1629. \bdoi10.1214/16-EJS1151 \bmrnumber3522655 \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmChatterjee, \bfnmSabyasachi\binitsS., \bauthor\bsnmGuntuboyina, \bfnmAdityanand\binitsA. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2015). \btitleOn risk bounds in isotonic and other shape restricted regression problems. \bjournalAnn. Statist. \bvolume43 \bpages1774–1800. \bdoi10.1214/15-AOS1324 \bmrnumber3357878 \endbibitem
  • [13] {bunpublished}[author] \bauthor\bsnmChen, \bfnmWenyu\binitsW. and \bauthor\bsnmMazumder, \bfnmRahul\binitsR. (\byear2020). \btitleMultivariate Convex Regression at Scale. \endbibitem
  • [14] {barticle}[author] \bauthor\bsnmChen, \bfnmYining\binitsY. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2016). \btitleOn convex least squares estimation when the truth is linear. \bjournalElectron. J. Stat. \bvolume10 \bpages171–209. \bdoi10.1214/15-EJS1098 \bmrnumber3466180 \endbibitem
  • [15] {barticle}[author] \bauthor\bsnmDoss, \bfnmCharles R.\binitsC. R. (\byear2020). \btitleBracketing numbers of convex and mm-monotone functions on polytopes. \bjournalJ. Approx. Theory \bvolume256 \bpages105425. \bdoi10.1016/j.jat.2020.105425 \bmrnumber4093045 \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmDudley, \bfnmR. M.\binitsR. M. (\byear1967). \btitleThe sizes of compact subsets of Hilbert space and continuity of Gaussian processes. \bjournalJournal of Functional Analysis \bvolume1 \bpages290-330. \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmDümbgen, \bfnmL.\binitsL., \bauthor\bsnmFreitag, \bfnmS.\binitsS. and \bauthor\bsnmJongbloed, \bfnmG.\binitsG. (\byear2004). \btitleConsistency of concave regression with an application to current-status data. \bjournalMath. Methods Statist. \bvolume13 \bpages69–81. \bmrnumber2078313 \endbibitem
  • [18] {barticle}[author] \bauthor\bsnmGao, \bfnmFuchang\binitsF. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2007). \btitleEntropy estimate for high-dimensional monotonic functions. \bjournalJ. Multivariate Anal. \bvolume98 \bpages1751–1764. \bdoi10.1016/j.jmva.2006.09.003 \bmrnumber2392431 \endbibitem
  • [19] {barticle}[author] \bauthor\bsnmGao, \bfnmFuchang\binitsF. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2017). \btitleEntropy of convex functions on d\mathbb{R}^{d}. \bjournalConstr. Approx. \bvolume46 \bpages565–592. \bdoi10.1007/s00365-017-9387-1 \bmrnumber3735701 \endbibitem
  • [20] {barticle}[author] \bauthor\bsnmGhosal, \bfnmPromit\binitsP. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2017). \btitleOn univariate convex regression. \bjournalSankhya A \bvolume79 \bpages215–253. \bdoi10.1007/s13171-017-0104-8 \bmrnumber3707421 \endbibitem
  • [21] {bbook}[author] \bauthor\bsnmGroeneboom, \bfnmPiet\binitsP. and \bauthor\bsnmJongbloed, \bfnmGeurt\binitsG. (\byear2014). \btitleNonparametric estimation under shape constraints \bvolume38. \bpublisherCambridge University Press. \endbibitem
  • [22] {barticle}[author] \bauthor\bsnmGroeneboom, \bfnmPiet\binitsP., \bauthor\bsnmJongbloed, \bfnmGeurt\binitsG. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear2001). \btitleEstimation of a convex function: characterizations and asymptotic theory. \bjournalAnn. Statist. \bvolume29 \bpages1653–1698. \bdoi10.1214/aos/1015345958 \bmrnumber1891742 \endbibitem
  • [23] {barticle}[author] \bauthor\bsnmGuntuboyina, \bfnmAdityanand\binitsA. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2015). \btitleGlobal risk bounds and adaptation in univariate convex regression. \bjournalProbab. Theory Related Fields \bvolume163 \bpages379–411. \bdoi10.1007/s00440-014-0595-3 \bmrnumber3405621 \endbibitem
  • [24] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ. (\byear2019). \btitleGlobal empirical risk minimizers with ”shape constraints” are rate optimal in general dimensions. \bjournalarXiv preprint arXiv:1905.12823. \endbibitem
  • [25] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ., \bauthor\bsnmWang, \bfnmTengyao\binitsT., \bauthor\bsnmChatterjee, \bfnmSabyasachi\binitsS. and \bauthor\bsnmSamworth, \bfnmRichard J.\binitsR. J. (\byear2019). \btitleIsotonic regression in general dimensions. \bjournalAnn. Statist. \bvolume47 \bpages2440–2471. \bdoi10.1214/18-AOS1753 \bmrnumber3988762 \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear2016). \btitleMultivariate convex regression: global risk bounds and adaptation. \bjournalarXiv preprint arXiv:1601.06844. \endbibitem
  • [27] {barticle}[author] \bauthor\bsnmHan, \bfnmQiyang\binitsQ. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear2019). \btitleConvergence rats of least squares regression estimators with heavy-tailed errors. \bjournalThe Annals of Statistics \bvolume47 \bpages2286–2319. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmHanson, \bfnmD. L.\binitsD. L. and \bauthor\bsnmPledger, \bfnmGordon\binitsG. (\byear1976). \btitleConsistency in concave regression. \bjournalAnn. Statist. \bvolume4 \bpages1038–1050. \bmrnumber0426273 (54 ##14219) \endbibitem
  • [29] {barticle}[author] \bauthor\bsnmHildreth, \bfnmClifford\binitsC. (\byear1954). \btitlePoint estimates of ordinates of concave functions. \bjournalJ. Amer. Statist. Assoc. \bvolume49 \bpages598–619. \bmrnumber0065093 (16,382f) \endbibitem
  • [30] {binproceedings}[author] \bauthor\bsnmKeshavarz, \bfnmArezou\binitsA., \bauthor\bsnmWang, \bfnmYang\binitsY. and \bauthor\bsnmBoyd, \bfnmStephen\binitsS. (\byear2011). \btitleImputing a convex objective function. In \bbooktitleIntelligent Control (ISIC), 2011 IEEE International Symposium on \bpages613–619. \bpublisherIEEE. \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmKuosmanen, \bfnmTimo\binitsT. (\byear2008). \btitleRepresentation theorem for convex nonparametric least squares. \bjournalThe Econometrics Journal \bvolume11 \bpages308–325. \endbibitem
  • [32] {barticle}[author] \bauthor\bsnmKur, \bfnmGil\binitsG., \bauthor\bsnmDagan, \bfnmYuval\binitsY. and \bauthor\bsnmRakhlin, \bfnmAlexander\binitsA. (\byear2019). \btitleOptimality of Maximum Likelihood for Log-Concave Density Estimation and Bounded Convex Regression. \bjournalarXiv preprint arXiv:1903.05315. \endbibitem
  • [33] {bbook}[author] \bauthor\bsnmLedoux, \bfnmM.\binitsM. and \bauthor\bsnmTalagrand, \bfnmM.\binitsM. (\byear1991). \btitleProbability in Banach Spaces: Isoperimetry and Processes. \bpublisherSpringer, \baddressNew York. \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmLim, \bfnmEunji\binitsE. (\byear2014). \btitleOn convergence rates of convex regression in multiple dimensions. \bjournalINFORMS J. Comput. \bvolume26 \bpages616–628. \bdoi10.1287/ijoc.2013.0587 \bmrnumber3246615 \endbibitem
  • [35] {barticle}[author] \bauthor\bsnmLim, \bfnmEunji\binitsE. and \bauthor\bsnmGlynn, \bfnmPeter W.\binitsP. W. (\byear2012). \btitleConsistency of multidimensional convex regression. \bjournalOper. Res. \bvolume60 \bpages196–208. \bdoi10.1287/opre.1110.1007 \bmrnumber2911667 \endbibitem
  • [36] {barticle}[author] \bauthor\bsnmMammen, \bfnmEnno\binitsE. (\byear1991). \btitleNonparametric regression under qualitative smoothness assumptions. \bjournalAnn. Statist. \bvolume19 \bpages741–759. \bdoi10.1214/aos/1176348118 \bmrnumber1105842 (92j:62051) \endbibitem
  • [37] {bbook}[author] \bauthor\bsnmMassart, \bfnmP.\binitsP. (\byear2007). \btitleConcentration inequalities and model selection. Lecture notes in Mathematics \bvolume1896. \bpublisherSpringer, \baddressBerlin. \endbibitem
  • [38] {barticle}[author] \bauthor\bsnmMatzkin, \bfnmRosa L.\binitsR. L. (\byear1991). \btitleSemiparametric estimation of monotone and concave utility functions for polychotomous choice models. \bjournalEconometrica \bvolume59 \bpages1315–1327. \bdoi10.2307/2938369 \bmrnumber1133036 \endbibitem
  • [39] {barticle}[author] \bauthor\bsnmMazumder, \bfnmRahul\binitsR., \bauthor\bsnmChoudhury, \bfnmArkopal\binitsA., \bauthor\bsnmIyengar, \bfnmGarud\binitsG. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2019). \btitleA computational framework for multivariate convex regression and its variants. \bjournalJ. Amer. Statist. Assoc. \bvolume114 \bpages318–331. \bdoi10.1080/01621459.2017.1407771 \bmrnumber3941257 \endbibitem
  • [40] {bincollection}[author] \bauthor\bsnmNemirovski, \bfnmA. S.\binitsA. S. (\byear2000). \btitleTopics in nonparametric statistics. In \bbooktitleLecture on Probability Theory and Statistics, École d’Été de Probabilitiés de Saint-flour XXVIII-1998, \bvolume1738 \bpublisherSpringer-Verlag, \baddressBerlin, Germany. \bnoteLecture Notes in Mathematics. \endbibitem
  • [41] {barticle}[author] \bauthor\bsnmNemirovskij, \bfnmAS\binitsA., \bauthor\bsnmPolyak, \bfnmBoris\binitsB. and \bauthor\bsnmTsybakov, \bfnmAB\binitsA. (\byear1984). \btitleSignal processing by the nonparametric maximum likelihood method. \bjournalProblems of Information Transmission \bvolume20 \bpages177–192. \endbibitem
  • [42] {barticle}[author] \bauthor\bsnmNemirovskij, \bfnmAS\binitsA., \bauthor\bsnmPolyak, \bfnmBoris\binitsB. and \bauthor\bsnmTsybakov, \bfnmAB\binitsA. (\byear1985). \btitleRate of convergence of nonparametric estimates of maximum-likelihood type. \bjournalProblems of Information Transmission \bvolume21 \bpages258–272. \endbibitem
  • [43] {bbook}[author] \bauthor\bsnmSchneider, \bfnmRolf\binitsR. (\byear1993). \btitleConvex Bodies: The Brunn-Minkowski Theory. \bpublisherCambridge Univ. Press, \baddressCambridge. \endbibitem
  • [44] {barticle}[author] \bauthor\bsnmSeijo, \bfnmEmilio\binitsE. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2011). \btitleNonparametric least squares estimation of a multivariate convex regression function. \bjournalAnn. Statist. \bvolume39 \bpages1633–1657. \bdoi10.1214/10-AOS852 \bmrnumber2850215 \endbibitem
  • [45] {barticle}[author] \bauthor\bsnmTalagrand, \bfnmMichel\binitsM. (\byear1996). \btitleA new look at independence. \bjournalAnnals of Probability \bvolume24 \bpages1–34. \endbibitem
  • [46] {barticle}[author] \bauthor\bsnmToriello, \bfnmAlejandro\binitsA., \bauthor\bsnmNemhauser, \bfnmGeorge\binitsG. and \bauthor\bsnmSavelsbergh, \bfnmMartin\binitsM. (\byear2010). \btitleDecomposing inventory routing problems with approximate value functions. \bjournalNaval Res. Logist. \bvolume57 \bpages718–727. \bdoi10.1002/nav.20433 \bmrnumber2762313 \endbibitem
  • [47] {bbook}[author] \bauthor\bparticlevan de \bsnmGeer, \bfnmSara A.\binitsS. A. (\byear2000). \btitleApplications of empirical process theory. \bseriesCambridge Series in Statistical and Probabilistic Mathematics \bvolume6. \bpublisherCambridge University Press, Cambridge. \bmrnumber1739079 \endbibitem
  • [48] {bbook}[author] \bauthor\bparticlevan der \bsnmVaart, \bfnmAad W.\binitsA. W. and \bauthor\bsnmWellner, \bfnmJon A.\binitsJ. A. (\byear1996). \btitleWeak convergence and empirical processes. \bseriesSpringer Series in Statistics. \bpublisherSpringer-Verlag, New York \bnoteWith applications to statistics. \bdoi10.1007/978-1-4757-2545-2 \bmrnumber1385671 \endbibitem
  • [49] {barticle}[author] \bauthor\bsnmVarian, \bfnmHal R.\binitsH. R. (\byear1982). \btitleThe nonparametric approach to demand analysis. \bjournalEconometrica \bvolume50 \bpages945–973. \bdoi10.2307/1912771 \bmrnumber666119 \endbibitem
  • [50] {barticle}[author] \bauthor\bsnmVarian, \bfnmHal R.\binitsH. R. (\byear1984). \btitleThe nonparametric approach to production analysis. \bjournalEconometrica \bvolume52 \bpages579–597. \bdoi10.2307/1913466 \bmrnumber740302 \endbibitem
  • [51] {barticle}[author] \bauthor\bsnmYang, \bfnmY.\binitsY. and \bauthor\bsnmBarron, \bfnmA.\binitsA. (\byear1999). \btitleInformation-Theoretic Determination of Minimax Rates of Convergence. \bjournalAnnals of Statistics \bvolume27 \bpages1564-1599. \endbibitem