This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

TURF: A Two-factor, Universal, Robust, Fast
Distribution Learning Algorithm

Yi Hao    Ayush Jain    Alon Orlitsky    Vaishakh Ravindrakumar
Abstract

Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an 1\ell_{1} distance essentially at most a constant times larger than its closest tt-piece degree-dd polynomial, where t1t\geq 1 and d0d\geq 0. Letting ct,dc_{t,d} denote the smallest such factor, clearly c1,0=1c_{1,0}=1, and it can be shown that ct,d2c_{t,d}\geq 2 for all other tt and dd. Yet current computationally efficient algorithms show only ct,12.25c_{t,1}\leq 2.25 and the bound rises quickly to ct,d3c_{t,d}\leq 3 for d9d\geq 9. We derive a near-linear-time and essentially sample-optimal estimator that establishes ct,d=2c_{t,d}=2 for all (t,d)(1,0)(t,d)\neq(1,0). Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies.

Information Theory, Computational Learning Theory, Statistics, Probability Estimation

1 Introduction

Learning distributions from samples is one of the oldest (Pearson, 1895), most natural (Silverman, 1986), and important statistical-learning paradigms (Givens & Hoeting, 2012). Its numerous applications include epidemiology (Bithell, 1990), economics (Zambom & Ronaldo, 2013), anomaly detection (Pimentel et al., 2014), language based prediction (Gerber, 2014), GANs (Goodfellow et al., 2014), and many more, as outlined in several books and surveys e.g., (Tukey, 1977; Scott, 2012; Diakonikolas, 2016).

Consider estimating an unknown, real, discrete, continuous, or mixed distribution ff from nn independent samples Xn:=X1,,XnX^{n}:=X_{1},...,X_{n} it generates. A distribution estimator maps XnX^{n} to an approximating distribution festf^{\textit{est}} meant to approximate ff. We evaluate its performance via the expected 1\ell_{1} distance 𝔼festf1\mathbb{E}\|f^{\textit{est}}-f\|_{1}.

The 1\ell_{1} distance between two functions f1f_{1} and f2f_{2}, f1f21:=|f1f2|\|f_{1}-f_{2}\|_{1}:=\int_{\mathbb{R}}|f_{1}-f_{2}|, is one of density estimation’s most common distance measures (Devroye & Lugosi, 2012). Among its several desirable properties, its value remains unchanged under linear transformation of the underlying domain, and the absolute difference between the expected values of any bounded function of the observations under f1f_{1} and f2f_{2} is at most a constant factor larger than f1f21\|f_{1}-f_{2}\|_{1}, as for any bounded g:g:\mathbb{R}\rightarrow\mathbb{R}, |𝔼f1[g(X)]𝔼f2[g(X)]|maxxg(x)f1f21\big{|}\operatorname*{\mathbb{E}}_{f_{1}}[g(X)]-\operatorname*{\mathbb{E}}_{f_{2}}[g(X)]\big{|}\leq\max_{x\in\mathbb{R}}g(x)\cdot\|f_{1}-f_{2}\|_{1}. Further, a small 1\ell_{1} distance between two distributions implies a small difference between any given Lipschitz functions of the two distributions. Therefore, learning in 1\ell_{1} distance implies a bound on the error of the plug-in estimator for Lipschitz functions of the underlying distribution  (Hao & Orlitsky, 2019).

Ideally, we would like to learn any distribution to a small 1\ell_{1} distance. However, arbitrary distributions cannot be learned in 1\ell_{1} distance with any number of samples (Devroye & Gyorfi, 1990), as the following example shows.

Example 1.

Let uu be the continuous uniform distribution over [0,1][0,1]. For any number nn of samples, construct a discrete distribution pp by assigning probability 1/n31/n^{3} to each of n3n^{3} random points in [0,1][0,1]. By the birthday paradox, nn samples from pp will be all distinct with high probability and follow the same uniform distribution as nn samples from uu, and hence uu and pp will be indistinguishable. As up1=2\|u-p\|_{1}=2, the triangle inequality implies that for any estimator festf^{\textit{est}}, maxf{u,p}𝔼festf11\max_{f\in\{u,p\}}\mathbb{E}\|f^{\textit{est}}-f\|_{1}\gtrsim 1.

A common remedy to this shortcoming assumes that the distribution ff belongs to a structured approximation class 𝒞{\cal C}, for example unimodal (Birgé, 1987), log-concave (Devroye & Lugosi, 2012) and Gaussian (Acharya et al., 2014; Ashtiani et al., 2018) distributions.

The min-max learning rate of 𝒞{\cal C} is the lowest worst-case expected distance achieved by any estimator,

n(𝒞)=defminfestmaxf𝒞𝔼XnffXnestf1.{\cal R}_{n}({\cal C})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\min_{f^{\textit{est}}}\max_{f\in{\cal C}}\operatorname*{\mathbb{E}}_{X^{n}\sim f}\|f^{\textit{est}}_{X^{n}}-f\|_{1}.\vspace{-.25em}

The study of n(𝒞){\cal R}_{n}({\cal C}) for various classes such as Gaussians, exponentials, and discrete distributions has been the focus of many works e.g., (Vapnik, 1999; Kamath et al., 2015; Han et al., 2015; Cohen et al., 2020).

Considering all pairs of distributions in 𝒞{\cal C}, (Yatracos, 1985) defined a collection of subsets with VC dimension (Vapnik, 1999) VC(𝒞)\mathrm{VC}({\cal C}), and applying the minimum distance estimation method (Wolfowitz, 1957), showed that

n(𝒞)=𝒪(VC(𝒞)/n).{\cal R}_{n}({\cal C})={\cal O}(\sqrt{\mathrm{VC}({\cal C})/n}).\vspace{-.25em}

However, real underlying distributions ff are unlikely to fall exactly in any predetermined class. Hence (Yatracos, 1985) also considered approximating ff nearly as well as its best approximation in 𝒞{\cal C}. Letting

f𝒞1=definfg𝒞fg1\|f-{\cal C}\|_{1}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\inf_{g\in{\cal C}}\|f-g\|_{1}\vspace{-.25em}

be lowest 1\ell_{1} distance between ff and any distribution in 𝒞{\cal C}, he designed an estimator fYatf^{\textit{Yat}}, possibly outside 𝒞{\cal C}, whose 1\ell_{1} distance from ff is close to f𝒞1\|f-{\cal C}\|_{1}. For all distributions ff,

𝔼fYatf13f𝒞1+𝒪(VC(𝒞)/n).\operatorname*{\mathbb{E}}\|f^{\textit{Yat}}-f\|_{1}\leq 3\cdot\|f-{\cal C}\|_{1}+{\cal O}(\sqrt{\mathrm{VC}({\cal C})/n}).\vspace{-.25em}

For many natural classes, n(𝒞)=Θ(VC(𝒞)/n){\cal R}_{n}({\cal C})=\Theta(\sqrt{\mathrm{VC}({\cal C})/n}). Hence, an estimator festf^{\textit{est}} is called a cc-factor approximation for 𝒞{\cal C} if for any distribution ff,

𝔼festf1cf𝒞1+𝒪(n(𝒞)).\operatorname*{\mathbb{E}}\|f^{\textit{est}}-f\|_{1}\leq c\cdot\|f-{\cal C}\|_{1}+{\cal O}({\cal R}_{n}({\cal C})).\vspace{-.25em}

f𝒞1\|f-{\cal C}\|_{1} and 𝒪(n(𝒞)){\cal O}({\cal R}_{n}({\cal C})) may be thought of as the error’s bias and variance components.

A small cc is desirable as it upper bounds the asymptotic error when nn\nearrow\infty for f𝒞f\notin{\cal C}, hence providing robustness guarantees when the underlying distribution does not quite follow the assumed model. It ensures robust estimation also under the Huber contamination model (Huber, 1992) where with probability 0μ10\leq\mu\leq 1, ff is perturbed by an arbitrary noise, and the error incurred by a cc-factor approximation is upper bounded as cμc\cdot\mu.

One of the more important distribution classes is the collection 𝒫t,d{\cal P}_{t,d} of tt-piecewise degree-dd polynomials. For simplicity, we assume that all polynomials in 𝒫t,d{\cal P}_{t,d} are defined over a known interval II\subseteq\mathbb{R}, hence any p𝒫t,dp\in{\cal P}_{t,d} consists of degree-dd polynomials p1,,ptp_{1}{,}\ldots{,}p_{t}, each defined over one part in a partition I1,,ItI_{1}{,}\ldots{,}I_{t} of II.

The significance of 𝒫t,d{\cal P}_{t,d} stems partly from the fact that it approximates numerous important distributions even with small tt and dd. For example, for every distribution ff in the class {\cal L} of log-concave distributions, f𝒫t,11=𝒪(t2)\|f-{\cal P}_{t,1}\|_{1}={\cal O}(t^{-2}) (Chan et al., 2014). Also, VC(𝒫t,d)=t(d+1)\mathrm{VC}({\cal P}_{t,d})=t(d+1), e.g., (Acharya et al., 2017).

It follows that if festf^{\textit{est}} is a cc-factor estimator for 𝒫t,1{\cal P}_{t,1}, then for all ff\in{\cal L}, 𝔼festf1cf𝒫t,11+𝒪(t/n)=𝒪(t2)+𝒪(t/n).\operatorname*{\mathbb{E}}\|f^{\textit{est}}-f\|_{1}\leq c\cdot\|f-{\cal P}_{t,1}\|_{1}+{\cal O}(\sqrt{t/n})={\cal O}(t^{-2})+{\cal O}(\sqrt{t/n}). Choosing t=n1/5t=n^{1/5} to equate the bias and variance terms, festf^{\textit{est}} achieves an expected 1\ell_{1} error 𝒪(n2/5){\cal O}(n^{-2/5}), which is the optimal min-max learning rate of {\cal L} (Chan et al., 2014).

Lemma 17 in Appendix A.2 shows a stronger result. If festf^{\textit{est}} is a cc-factor approximation for 𝒫t,d{\cal P}_{t,d} for some tt and dd and achieves the min-max rate of a distribution class 𝒞{\cal C}, then festf^{\textit{est}} is also a cc-factor approximation for 𝒞{\cal C}. In addition to the log-concave class, this result also holds for Gaussian, and unimodal distributions, and for their mixtures.

2 Contributions

Lower Bounds: As noted above, it is beneficial to find the smallest approximation factor for 𝒫t,d{\cal P}_{t,d}. The following simple example shows that if we allow sub-distributions, even simple collections may have an approximation factor of at least 2.

Example 2.

Let class 𝒞{\cal C} consist of the uniform distribution u(x)=1u(x)=1 and the subdistribution z(x)=0z(x)=0, over [0,1][0,1]. Consider any estimator festf^{\textit{est}}. Let fu=festf_{u}=f^{\textit{est}} when XnuX^{n}\sim u as nn\nearrow\infty. Since u𝒞1=0\|u-{\cal C}\|_{1}=0, for festf^{\textit{est}} to achieve finite approximation factor, we must have fuu1=0\|f_{u}-u\|_{1}=0. Now consider the discrete distribution pp in Example 1. Since its samples are indistinguishable from those of uu, fXnest=fuf^{\textit{est}}_{X^{n}}=f_{u} also for XnpX^{n}\sim p. But then fup1up1fuu1=2=2p𝒞1\|f_{u}-p\|_{1}\geq\|u-p\|_{1}-\|f_{u}-u\|_{1}=2=2\cdot\|p-{\cal C}\|_{1}, so festf^{\textit{est}} has approximation factor 2\geq 2.

Our definition however considers only strict distributions, complicating lower bound proofs. Let ct,dc_{t,d} be the lowest approximation factor for 𝒫t,d{\cal P}_{t,d}. 𝒫1,0{\cal P}_{1,0} consists of a single distribution over a known interval, hence c1,0=1c_{1,0}=1. (Chan et al., 2014) showed that for all t2t\geq 2 and d0d\geq 0, ct,d2c_{t,d}\geq 2. The following lemma, proved in Appendix A.1, shows that c1,d2c_{1,d}\geq 2 for all d1d\geq 1, and as we shall see later, establishes a precise lower bound for all tt and dd.

Lemma 3.

For all (t,d)(t,d) except (1,0)(1,0), ct,d2c_{t,d}\geq 2.

Upper Bounds: As discussed earlier, fYatf^{\textit{Yat}} is a 33-factor approximation for 𝒫t,d{\cal P}_{t,d}. However its runtime is n𝒪(t(d+1))n^{{\cal O}(t(d+1))}. For many applications, tt or dd may be large, and even increase with nn, for example in learning unimodal distributions, we select t=𝒪(n1/3)t={\cal O}(n^{1/3}) (Birgé, 1987), resulting in exponential time complexity. (Chan et al., 2014) improved the runtime to polynomial in nn independent of t,dt,d, and (Acharya et al., 2017) further reduced it to near-linear 𝒪(nlogn){\cal O}(n\log n). (Hao et al., 2020) derived a 𝒪(nlogn){\cal O}(n\log n) time algorithm, SURF, achieving ct,1=2.25c_{t,1}=2.25, and ct,d<3c_{t,d}<3 for d8d\leq 8. They also showed that this estimator can be parallelized to run in time 𝒪(nlogn/t){\cal O}(n\log n/t). (Bousquet et al., 2019, 2021)’s estimator for the improper learning setting (wherein festf^{\textit{est}} can be any distribution as we consider in this paper) achieves a bias nearly within a factor of 22, but the variance term exceeds 𝒪(n(𝒫t,d)){\cal O}({\cal R}_{n}({\cal P}_{t,d})), hence does not satisfy the constant factor approximation definition. Moreover, like Yatracos, they suffer a prohibitive n𝒪(t(d+1))n^{{\cal O}(t(d+1))} runtime, that could be exponential for some applications.

Our main contribution is an estimator, TURF\mathrm{TURF}, a two factor, universal, robust and fast estimator that achieves an approximation factor that is arbitrarily close to the optimal ct,d=2c_{t,d}=2 in near-linear 𝒪(nlogn){\cal O}(n\log n) time. TURF\mathrm{TURF} is also simple to implement as a step on top of the existing merge routine in (Acharya et al., 2017). The construction of our estimate relies on upper bounding the maximum absolute value of polynomials (see Lemma 7) based on their 1\ell_{1} norm, similar to the Bernstein (Rahman et al., 2002) and Markov Brothers’ (Achieser, 1992) inequalities. We show for any p𝒫1,dp\in{\cal P}_{1,d} and a[0,1)a\in[0,1),

p,[a,a]28(d+1)p1,[1,1]1a2,\|p\|_{\infty,[-a,a]}\,\leq\,\frac{28(d+1){\|p\|}_{1,[-1,1]}}{\sqrt{1-a^{2}}},\vspace{-.25em}

where ,I\|\cdot\|_{\cdot,I} indicates the respective norms over any interval II\subseteq\mathbb{R}. This point-wise inequality reveals a novel connection between the \ell_{\infty} and 1\ell_{1} norms of a polynomial, which may be interesting in its own right.

Practical Estimation: For many practical distributions, the optimal parameters values of t,dt,d in approximating with 𝒫t,d{\cal P}_{t,d} many be unknown. While for common structured classes such as Gaussian, log-concave and unimodal, and their mixtures, it suffices to choose dd to be any small value, but the optimal choice of tt can vary significantly. For example, for any constant dd, for a unimodal ff, the optimal t=𝒪(n1/3)t={\cal O}(n^{1/3}) pieces whereas for a smoother log-concave ff, significantly lower errors are obtained with a much smaller t=𝒪(n1/5)t={\cal O}(n^{1/5}). Given a family of ct,dc_{t,d}-factor approximate estimators for 𝒫t,d{\cal P}_{t,d}, ft,destf^{\textit{est}}_{t,d}, a suitable objective is to select the number of pieces, tdestt^{\textit{est}}_{d} to achieve for any given degree-dd,

𝔼ftdestestf1mint1(ct,df𝒫t,d1+𝒪(t(d+1)/n)).\displaystyle\operatorname*{\mathbb{E}}\!\|f^{\textit{est}}_{t^{\textit{est}}_{d}}\!\!-\!f\|_{1}\!\!\leq\!\min_{t\geq 1}\!{\left({\!c_{t,d}\|{f}\!-\!{{\cal P}_{t,d}}\|_{1}\!\!+\!{\cal O}{\left({\!\sqrt{t(d\!+\!1)/n}}\right)}\!}\right)}.\vspace{-.25em} (1)

Simple modifications to existing cross-validation approaches (Yatracos, 1985) partly achieve Equation (1) with the larger c=3ct,dc=3c_{t,d} along with an additive 𝒪(logn/n){\cal O}(\log n/\sqrt{n}). Via a novel cross-validation technique, we obtain a tdestt^{\textit{est}}_{d} that satisfies Equation (1) with the factor cc arbitrarily close to the optimal ct,dc_{t,d} with an additive 𝒪(logn/n){\cal O}(\sqrt{\log n/n}). In fact, this technique removes the need to know parameters beforehand in other related settings as well, such as the corruption level in robust estimation that all existing works assume is known. We elaborate this in (Jain et al., 2022).

Our experiments reflect the improved errors of TURF\mathrm{TURF} over existing algorithms in regimes where the bias dominates.

3 Setup

3.1 Notation and Definitions

Henceforth, for brevity, we skip the XnX^{n} subscript when referring to estimators. Given samples XnfX^{n}\sim f, the empirical distribution is defined via the dirac delta function δ(x)\delta(x) as

femp(x)=defi=1nδ(xXi)n,\vspace{-.5em}f^{\textit{emp}}(x)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{i=1}^{n}\frac{\delta(x-X_{i})}{n},

allotting a 1/n1/n mass at each sample location.

Note that if an estimator gg is partly negative but integrates to 11, then g=defmax{g,0}/max{g,0}g^{\prime}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max\{g,0\}/\int_{\mathbb{R}}\max\{g,0\}, satisfies dTV(g,f)dTV(g,f)d_{\mathrm{TV}}(g^{\prime},f)\leq d_{\mathrm{TV}}(g,f) for any distribution ff, e.g., Devroye & Lugosi (2012). This allows us to estimate using any real normalized function as our estimator.

For any interval II\subseteq\mathbb{R} and integrable functions g1,g2:g_{1},g_{2}:\mathbb{R}\rightarrow\mathbb{R}, let g1g21,I\|g_{1}-g_{2}\|_{1,I} denote the 1\ell_{1} distance evaluated over II. Similarly, for any class 𝒞{\cal C} of real functions, let g𝒞1,I\|g-{\cal C}\|_{1,I} denote the least 1\ell_{1} distance between gg and members of 𝒞{\cal C} over II.

The 1\ell_{1} distance between ff and festf^{\textit{est}} is closely related to their TV\mathrm{TV} or statistical distance as

1/2festf1=dTV(fest,f)=defsupS|Sfestf|,1/2\cdot\|f^{\textit{est}}-f\|_{1}=d_{\mathrm{TV}}(f^{\textit{est}},f)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{S\in\mathbb{R}}\Big{|}\int_{S}f^{\textit{est}}-f\Big{|},

the greatest absolute difference in areas of festf^{\textit{est}} and ff over all subsets of \mathbb{R}. As we argued in the introduction, a direct approach to estimate ff in all possible subsets of \mathbb{R} is not feasible with finitely many samples. Instead, for a given k1k\geq 1, the 𝒜k{\cal A}_{k} distance (Devroye & Lugosi, 2012) considers the largest difference between ff and festf^{\textit{est}} on real subsets with at most kk intervals. As we show in Lemma 4, it is possible to learn any ff in 𝒜k{\cal A}_{k} distance simply by using the empirical distribution fempf^{\textit{emp}}.

We formally define the 𝒜k{\cal A}_{k} distance as follows. For any given k1k\geq 1 and interval II\subseteq\mathbb{R}, let k(I){\cal I}_{k}(I) be the set of all unions of at most kk intervals contained in II. Define the 𝒜k{\cal A}_{k} distance between g1,g2g_{1},g_{2} as

g1g2𝒜k,I=defsupSk(I)|g1(S)g2(S)|,\|g_{1}-g_{2}\|_{{\cal A}_{k},I}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{S\in{\cal I}_{k}(I)}|g_{1}(S)-g_{2}(S)|,\vspace{-.25em}

where g(S)g(S) denotes the area of the function gg on the set SS. For example, if I=[0,1]I=[0,1] and g1(x)=x,g2(x)=2/3g_{1}(x)=x,\ g_{2}(x)=2/3, the 𝒜1{\cal A}_{1} distance g1g2𝒜1,I=02/3|z2/3|𝑑z=2/9\|g_{1}-g_{2}\|_{{\cal A}_{1},I}=\int_{0}^{2/3}|z-2/3|dz=2/9. Suppose II is the support of ff. Use this to define g1g2𝒜k=defg1g2𝒜k,I\|g_{1}-g_{2}\|_{{\cal A}_{k}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\|g_{1}-g_{2}\|_{{\cal A}_{k},I}.

For two distributions g1g_{1} and g2g_{2}, the 𝒜k{\cal A}_{k} distance is at-most half the 1\ell_{1} distance and with equality achieved as kk\nearrow\infty since k(I){\cal I}_{k}(I) approximates all subsets of II for large kk,

g1g2𝒜k,I1/2g1g21.\|g_{1}-g_{2}\|_{{\cal A}_{k},I}\leq 1/2\cdot\|g_{1}-g_{2}\|_{1}.\vspace{-.25em}

The reverse is not true, the 𝒜k{\cal A}_{k} distance between two functions may be made arbitrarily small even for a constant 1\ell_{1} distance. For example the 1\ell_{1} distance between any distribution ff and its empirical distribution fempf^{\textit{emp}} is 22 for any n2n\geq 2. However, the 𝒜k{\cal A}_{k} distance between ff and fempf^{\textit{emp}} goes to zero. The next lemma, which is a consequence of VC inequality (Devroye & Lugosi, 2012), gives the rate at which fempf𝒜k\|f^{\textit{emp}}-f\|_{{\cal A}_{k}} goes to zero.

Lemma 4.

(Devroye & Lugosi, 2012) Given XnfX^{n}\sim f according to any real distribution ff,

𝔼fempf𝒜k=𝒪(k/n).\operatorname*{\mathbb{E}}\|f^{\textit{emp}}-f\|_{{\cal A}_{k}}={\cal O}{\left({\sqrt{{k}/{n}}}\right)}.\vspace{-1em}

Note that if ff is a discrete distribution with support size kk, Lemma 4 implies 𝔼fempf1=𝔼fempf𝒜k𝒪(k/n)\operatorname{{\mathbb{E}}}\|f^{\textit{emp}}-f\|_{1}=\operatorname{{\mathbb{E}}}\|f^{\textit{emp}}-f\|_{{\cal A}_{k}}\leq{\cal O}(\sqrt{{k}/{n}}), matching the rate of learning discrete distributions. Since arbitrary continuous distributions can be thought of as infinite dimensional discrete distributions where kk\rightarrow\infty, the lemma does not bound this error.

3.2 Preliminaries

The following 𝒜k{\cal A}_{k}-distance properties are helpful.

Property 5.

Given the partition I1,I2I_{1},I_{2} of any interval II\subseteq\mathbb{R} integrable functions g1,g2Ig_{1},g_{2}\in I, and integers k1,k2k_{1},k_{2},

g1g2𝒜k1,I1+g1g2𝒜k2,I2g1g2𝒜k1+k2,I.\|g_{1}-g_{2}\|_{{\cal A}_{k_{1}},I_{1}}+\|g_{1}-g_{2}\|_{{\cal A}_{k_{2}},I_{2}}\leq\|g_{1}-g_{2}\|_{{\cal A}_{k_{1}+k_{2}},I}.\vspace{-.25em}

Property 5 follows since the interval choices with k1k_{1} and k2k_{2} intervals respectively that achieve the suprema of g1g2𝒜k1,I1\|g_{1}-g_{2}\|_{{\cal A}_{k_{1}},I_{1}} and g1g2𝒜k2,I2\|g_{1}-g_{2}\|_{{\cal A}_{k_{2}},I_{2}} are included in the k1+k2k_{1}+k_{2} interval partition considered in the RHS.

Property 6.

Given any interval II\subseteq\mathbb{R}, integrable functions g1,g2Ig_{1},g_{2}\in I, and integers k1k2>0k_{1}\geq k_{2}>0,

g1g2𝒜k1,Ik1k2g1g2𝒜k2,I.\|g_{1}-g_{2}\|_{{\cal A}_{k_{1}},I}\leq\frac{k_{1}}{k_{2}}\cdot\|g_{1}-g_{2}\|_{{\cal A}_{k_{2}},I}.\vspace{-.25em}

Property 6 follows from selecting the k2k_{2} intervals with the largest contribution to g1g2𝒜k1,I\|g_{1}-g_{2}\|_{{\cal A}_{k_{1}},I} in the RHS expression, among the k1k_{1} interval partition that attains g1g2𝒜k1,I\|g_{1}-g_{2}\|_{{\cal A}_{k_{1}},I} on the LHS. In Sections 45 that follow, we consider deriving the optimal rates of learning with piecewise polynomials.

4 A 2-Factor Estimator for 𝒫1,d{\cal P}_{1,d}

Our objective is to obtain a 22-factor approximation for the piecewise class, 𝒫t,d{\cal P}_{t,d}. To achieve this, we first consider the single-piece class 𝒫1,d{\cal P}_{1,d} that for simplicity we denote by 𝒫d=𝒫1,d{\cal P}_{d}={\cal P}_{1,d}, and then use the resulting estimator as a sub-routine for the multi-piece class.

4.1 Intuition and Results

It is easy to show from the triangle inequality that if an estimator is as close to all degree-dd polynomials as their 1\ell_{1} distances to ff, then the estimator achieves an 1\ell_{1} distance to ff that is nearly twice that of the best degree-dd polynomial.

Let |I||I| denote the length of an interval II. The histogram of an integrable function gg over II is g¯I=def|Ig|/|I|\bar{g}_{I}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}|\int_{I}g|/|I|, where we assign zero to division by zero.

Let fpoly𝒫df^{\textit{poly}}\in{\cal P}_{d} be a polynomial estimator of ff over II, and let fadjf^{\textit{adj}} be the function obtained by adding to fpolyf^{\textit{poly}} a constant to match its mass to ff over II. For any p𝒫dp\in{\cal P}_{d},

fadjp1,I\displaystyle\|f^{\textit{adj}}-p\|_{1,I} (a)fadjp(fadj¯p¯)1,I+fadj¯p¯1,I\displaystyle\overset{(a)}{\leq}\|f^{\textit{adj}}\!-\!p-{\left({\overline{f^{\textit{adj}}}\!-\!\bar{p}}\right)}\|_{1,I}\!+\!\|\overline{f^{\textit{adj}}}\!-\!\bar{p}\|_{1,I}
(b)fadjp(fadj¯p¯)1,I+fp1,I,\displaystyle\overset{(b)}{\leq}\|f^{\textit{adj}}\!-\!p-{\left({\overline{f^{\textit{adj}}}\!-\!\bar{p}}\right)}\|_{1,I}+\|f-p\|_{1,I},\vspace{-.5em}

where (a)(a) follows by the triangle inequality, and (b)(b) follows since fadjf^{\textit{adj}} has the same mass as ff by construction, it implies fadj¯p¯1,I=f¯p¯1,Ifp𝒜1,Ifp1,I\|\overline{f^{\textit{adj}}}\!-\!\bar{p}\|_{1,I}=\!\|\bar{f}\!-\!\bar{p}\|_{1,I}\leq\|f-p\|_{{\cal A}_{1},I}\leq\|f-p\|_{1,I}.

Since fadjp𝒫df^{\textit{adj}}\!-\!p\in{\cal P}_{d}, if qq¯1,I\|q-\bar{q}\|_{1,I} is a small value q𝒫d\forall q\in{\cal P}_{d}, fadjf^{\textit{adj}} approximates ff nearly as well as any degree-dd polynomial. Let

ΔI(q)=defmaxxIq(x)minxIq(x)\Delta_{I}(q)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{x\in I}q(x)-\min_{x\in I}q(x)\vspace{-.25em}

be the difference between qq’s largest and smallest values. Note that q¯I(x)\bar{q}_{I}(x) has zero mean over II, hence must be zero on at least one point in II, implying

qq¯1,I\displaystyle{\|q-\bar{q}\|}_{1,I} ΔI(q)|I|.\displaystyle\leq\Delta_{I}(q)\cdot|I|.\vspace{-.25em} (2)

Thus we would like ΔI(q)|I|\Delta_{I}(q)\cdot|I| to be small q𝒫d\forall q\in{\cal P}_{d}, but which may not hold for the given II. By additivity, we may partition II and perform this adjustment over each sub-interval. A partition I¯\overline{I} of II is a collection of disjoint intervals whose union is II. Let the histogram of gg over I¯\overline{I} be

g¯I¯(x)=def|Jg||J|=g¯J(x)xJI¯,\bar{g}_{\overline{I}}(x)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{|\int_{J}g|}{|J|}=\bar{g}_{J}(x)\qquad x\in J\in\overline{I},\vspace{-.25em} (3)

We will construct a partition for which JI¯ΔJ(q)|J|\sum_{J\in\overline{I}}\Delta_{J}(q)\cdot|J| is small q𝒫d\forall q\in{\cal P}_{d}. Further, as we don’t know ff, we will use the empirical distribution fempf^{\textit{emp}} that approximates the mass of ff over each interval of I¯\overline{I}. By Lemma 4, for any I¯\overline{I} with kk intervals, the expected extra error 𝔼f¯I¯femp¯I¯1=𝒪(k/n)\operatorname*{\mathbb{E}}\|\bar{f}_{\overline{I}}-\overline{f^{\textit{emp}}}_{\overline{I}}\|_{1}={\cal O}(\sqrt{k/n}). If we want this error to be within a constant factor from the 𝒪((d+1)/n){\cal O}(\sqrt{(d+1)/n}) min-max rate of 𝒫d{\cal P}_{d}, we need to take k=𝒪(d+1)k={\cal O}(d+1).

We use the bound in Lemma 7 to construct a partition I¯\overline{I} whose widths decrease towards the extremes, while ensuring k=𝒪(d+1)k={\cal O}(d+1). In Section 4.2 we show that for this universal partition, II¯ΔI(q)|I|\sum_{I\in\overline{I}}\Delta_{I}(q)\!\cdot\!|I| decreases at the rate 𝒪((d+1)/k){\cal O}((d+1)/k) for all q𝒫dq\in{\cal P}_{d} that we conjecture is optimal in dd and kk.

In Section 4.3 we formally define the construction of fadjf^{\textit{adj}} by modifying fpolyf^{\textit{poly}} over I¯\overline{I} using fempf^{\textit{emp}}. We show in Lemma 11 that it suffices to select fpolyf^{\textit{poly}} to be the polynomial estimator fadlsf^{\textit{adls}} in (Acharya et al., 2017) or fsurff^{\textit{surf}} in (Hao et al., 2020) to obtain Theorem 12 which shows that fadjf^{\textit{adj}} is a 22-factor approximation for 𝒫d{\cal P}_{d}.

4.2 Polynomial Histogram Approximation

We would first like to bound ΔI(q)\Delta_{I}(q) for any q𝒫dq\in{\cal P}_{d} in terms of its 1\ell_{1} norm. From the Markov Brothers’ inequality (Achieser, 1992), for any q𝒫dq\in{\cal P}_{d},

ΔI(q)=𝒪(d+1)2q1,I,\Delta_{I}(q)={\cal O}(d+1)^{2}\cdot{\|q\|}_{1,I},\vspace{-.25em}

and is achieved by the Chebyshev polynomial of degree-dd. Instead, the next lemma shows that the bound can be improved for the interior of II. Its proof in Appendix B.1 carefully applies Markov Brothers’ inequality over select sub-intervals of II based on the Bernstein’s inequality. For simplicity, consider I=[1,1]I=[-1,1].

Lemma 7.

For any a[0,1)a\in[0,1) and q𝒫d{q\in{\cal P}_{d}},

Δ[a,a](q)aa|q(x)|𝑑x28(d+1)1a2q1,[1,1].\Delta_{[-a,a]}(q)\leq\int_{-a}^{a}|q^{\prime}(x)|dx\leq\frac{28(d+1)}{\sqrt{1-a^{2}}}{\|q\|}_{1,[-1,1]}.\vspace{-.25em}

We use the lemma and Equation 2 to construct a partition [1,1]¯d,k\overline{[-1,1]}^{d,k} of [1,1][-1,1] such that J[1,1]¯d,k\forall J\in\overline{[-1,1]}^{d,k}, ΔJ(p)\Delta_{J}(p) is bounded by a small value. Note that the lemma’s bound is weaker when aa is close to the boundary of (1,1)(-1,1), hence the parts of [1,1]¯d,k\overline{[-1,1]}^{d,k} decrease roughly geometrically towards the boundary, ensuring ΔJ(p)\Delta_{J}(p) is small over each. The geometric partition ensures that the number of intervals is still upper bounded by kk as we show in Lemma 8.

Consider the positive half [0,1][0,1] of [1,1][-1,1]. Given 1\ell\geq 1, let m=log2((d+1)2)m=\lceil\log_{2}(\ell(d+1)^{2})\rceil. For 1im1\leq i\leq m define the intervals Ii+=[11/2i1,11/2i)I_{i}^{+}=[1-1/2^{i-1},1-1/2^{i}), that together span [0,11/2m)[0,1-1/2^{m}), and let Em+=def[11/2m,1]E_{m}^{+}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[1-1/2^{m},1] complete the partition of [0,1][0,1]. Note that |Em+|=1/2m1/((d+1)2)|E_{m}^{+}|=1/2^{m}\leq 1/(\ell(d+1)^{2}). For each 1im1\leq i\leq m further partition Ii+I_{i}^{+} into (d+1)/2i/4\lceil\ell(d+1)/2^{i/4}\rceil intervals of equal width, and denote this partition by I¯i+\bar{I}_{i}^{+}. Clearly I¯+=def(I¯1+,,I¯m+,Em+)\bar{I}^{+}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\bar{I}_{1}^{+}{,}\ldots{,}\bar{I}_{m}^{+},E^{+}_{m}) partitions [0,1][0,1].

Define the mirror-image partition I¯\bar{I}^{-} of [1,0][-1,0], where, for example, we mirror the interval [c,d)[c,d) in I¯+\bar{I}^{+} to (d,c](-d,-c]. The following lemma upper bounds the number of intervals in the combination of I¯\bar{I}^{-} and I¯+\bar{I}^{+} and is proven in Appendix B.2.

Lemma 8.

For any degree d0d\geq 0 and >0\ell>0, the number of intervals in (I¯,I¯+)(\bar{I}^{-},\bar{I}^{+}) is at most 4(d+1)/(21/41)4\ell(d+1)/(2^{1/4}-1).

The lemma ensures that we get the desired partition, [1,1]¯d,k\overline{[-1,1]}^{d,k}, with kk intervals by setting

=defk(21/41)/(4(d+1)).\ell\stackrel{{\scriptstyle\mathrm{def}}}{{=}}{k(2^{1/4}-1)}/(4(d+1)).\vspace{-.25em} (4)

For any interval I=[a,b]I=[a,b], we obtain I¯d,k\overline{I}^{d,k} by a linear translation of [1,1]¯d,k\overline{[-1,1]}^{d,k}. For example, [c,d)[1,1]¯d,k[c,d)\in\overline{[-1,1]}^{d,k} translates to [a+(ba)(c+1)/2,a+(ba)(d+1)/2)I¯d,k[a+{(b-a)(c+1)}/{2},a+{(b-a)(d+1)}/{2})\in\overline{I}^{d,k}.

Recall that p¯I¯d,k\bar{p}_{\overline{I}^{d,k}} denotes the histogram of pp on I¯d,k\overline{I}^{d,k}. The following lemma, proven in Appendix B.3 using Equations (2), (4), and Lemma 7, shows that the 1\ell_{1} distance of any degree-dd polynomial to its histogram on I¯d,k\overline{I}^{d,k} is a factor 𝒪((d+1)/k){\cal O}((d+1)/k) times than the 1\ell_{1} norm of the polynomial.

Lemma 9.

Given an interval II, for some universal constant c1>1c_{1}>1, for all p𝒫dp\in{\cal P}_{d}, and integer k4(d+1)/(21/41)k\geq 4(d+1)/(2^{1/4}-1),

pp¯I¯d,k1,Ic1(d+1)p1,I/k.\|p-\bar{p}_{\overline{I}^{d,k}}\|_{1,I}\leq c_{1}\cdot{(d+1)}\cdot{\|p\|}_{1,I}/k.\vspace{-.5em}

We obtain our split estimator fadjf^{\textit{adj}} for a given polynomial estimator fpoly𝒫df^{\textit{poly}}\in{\cal P}_{d} as

fadj=deffI,fpoly,d,kadj=deffpoly+fI¯d,kempfI¯d,kpolyf^{\textit{adj}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}f^{\textit{adj}}_{I,f^{\textit{poly}},d,k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}f^{\textit{poly}}+f^{\textit{emp}}_{\overline{I}^{d,k}}-f^{\textit{poly}}_{\overline{I}^{d,k}}

that over each subinterval JI¯d,kJ\in\overline{I}^{d,k} adds to fpolyf^{\textit{poly}} a constant so that its mass over JJ equals that of fempf^{\textit{emp}}. Since I¯d,k\overline{I}^{d,k} has kk intervals, it follows that fadj𝒫k,df^{\textit{adj}}\in{\cal P}_{k,d}.

The next lemma (essentially) upper bounds the 1\ell_{1} distance of fadjf^{\textit{adj}} from any p𝒫dp\in{\cal P}_{d} that is close to fpolyf^{\textit{poly}} in 𝒜k{\cal A}_{k} distance by the 1\ell_{1} distance of pp from any function ff that is close to fempf^{\textit{emp}} in 𝒜k{\cal A}_{k} distance.

Lemma 10.

For any interval II, functions ff and polynomials p,fpoly𝒫dp,f^{\textit{poly}}\in{\cal P}_{d}, d0d\geq 0, and k4(d+1)/(21/41)k\geq 4(d+1)/(2^{1/4}-1), fadj=fI,fpoly,d,kadjf^{\textit{adj}}=f^{\textit{adj}}_{I,f^{\textit{poly}},d,k} satisfies

fadjp1,I\displaystyle\|f^{\textit{adj}}-p\|_{1,I}\
c1(d+1)kfpolyp1,I+fp1,I+fempf𝒜k,I,\displaystyle\leq\!\frac{c_{1}(d+1)}{k}\|f^{\textit{poly}}-p\|_{1,I}\!+\!\|f-p\|_{1,I}\!+\!\|f^{\textit{emp}}\!-\!f\|_{{\cal A}_{k},I},\vspace{-.25em}

where c1c_{1} is the universal constant in Lemma 9.

Proof  Consider an interval JI¯d,kJ\in\overline{I}^{d,k} and let fadj¯,femp¯,fpoly¯,p¯\overline{f^{\textit{adj}}},\overline{f^{\textit{emp}}},\overline{f^{\textit{poly}}},\bar{p} respectively denote the histograms of fadj,femp,fpoly,pf^{\textit{adj}},f^{\textit{emp}},f^{\textit{poly}},p respectively over JJ.

fadjp1,J\displaystyle\|f^{\textit{adj}}-p\|_{1,J}
(a)fadjp(fadj¯p¯)1,J+fadj¯p¯1,J\displaystyle\overset{(a)}{\leq}\!\|f^{\textit{adj}}\!-\!p-{\left({\overline{f^{\textit{adj}}}\!-\!\bar{p}}\right)}\|_{1,J}\!+\!\|\overline{f^{\textit{adj}}}\!-\!\bar{p}\|_{1,J}
=(b)fadjp(fadj¯p¯)1,J+femp¯p¯1,J\displaystyle\overset{(b)}{=}\!\|f^{\textit{adj}}\!-\!p-{\left({\overline{f^{\textit{adj}}}\!-\!\bar{p}}\right)}\|_{1,J}\!+\!\!\|\overline{f^{\textit{emp}}}\!-\!\bar{p}\|_{1,J}
=(c)fpolyp(fpoly¯p¯)1,J+femp¯p¯1,J\displaystyle\overset{(c)}{=}\!\|f^{\textit{poly}}\!-\!p-{\left({\overline{f^{\textit{poly}}}\!-\!\bar{p}}\right)}\|_{1,J}\!+\!\!\|\overline{f^{\textit{emp}}}\!-\!\bar{p}\|_{1,J}
(d)fpolyp(fpoly¯p¯)1,J+f¯p¯1,J+femp¯f¯1,J,\displaystyle\overset{(d)}{\leq}\!\|f^{\textit{poly}}\!-\!p\!-\!{\left({\overline{f^{\textit{poly}}}\!\!-\!\bar{p}}\right)}\|_{1,J}\!+\!\!\|\overline{f}\!-\!\bar{p}\|_{1,J}\!+\!\!\|\overline{f^{\textit{emp}}}\!-\!\bar{f}\|_{1,J},

where (a)(a) and (d)(d) follow from the triangle inequality, (b)(b) follows since fadjf^{\textit{adj}} has the same mass as fempf^{\textit{emp}} by construction, it implies fadj¯p¯1,J=femp¯p¯1,J\|\overline{f^{\textit{adj}}}\!-\bar{p}\|_{1,J}=\!\|\overline{f^{\textit{emp}}}\!-\bar{p}\|_{1,J}, and (c)(c) follows because fadjfadj¯=fpolyfpoly¯f^{\textit{adj}}-\overline{f^{\textit{adj}}}=f^{\textit{poly}}-\overline{f^{\textit{poly}}} since fadjf^{\textit{adj}} and fpolyf^{\textit{poly}} differ by a constant in each JJ.

The proof is complete by summing over JI¯d,kJ\in\overline{I}^{d,k} using the fact that I¯d,k\overline{I}^{d,k} has at most kk intervals, and since fpolyp𝒫df^{\textit{poly}}\!-p\in{\cal P}_{d} over II, from Lemma 9, the sum

JI¯d,kfpolyp(fJpoly¯p¯J)1,Jc1(d+1)kfpolyp1,I.\!\sum_{J\in\overline{I}^{d,k}}\!\!\!\|f^{\textit{poly}}\!-p-{\left({\overline{f^{\textit{poly}}_{J}}\!-\!\bar{p}_{J}}\right)}\|_{1,J}\!\leq\!\!\frac{c_{1}(d+1)}{k}\|f^{\textit{poly}}-p\|_{1,I}.

4.3 Applying the Estimator

In this more technical section, we show how to use existing estimators in place of fpolyf^{\textit{poly}} to achieve Theorem 12. The next lemma follows from a straightforward application of triangle inequality to Lemma 10 as shown in Appendix B.4. It shows that given an estimate fpolyf^{\textit{poly}} whose distance to ff is a constant multiple of f𝒫d1\|f-{\cal P}_{d}\|_{1} plus fempf𝒜k,I\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I}, depending on the value of kk, fadjf^{\textit{adj}} has nearly the optimal approximation factor of 22 at the expense of the larger fempf𝒜k,I\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I}.

Lemma 11.

Given an interval II, fpoly𝒫df^{\textit{poly}}\in{\cal P}_{d}, such that for some constants c,c′′,η>0c^{\prime},c^{\prime\prime},\eta>0,

fpolyf1,Icf𝒫d1,I+c′′fempf𝒜k,I+η,\|f^{\textit{poly}}-f\|_{1,I}\leq c^{\prime}\|f-{\cal P}_{d}\|_{1,I}+c^{\prime\prime}\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I}+\eta,\vspace{-.25em}

and the parameter k4(d+1)/(21/41)k\geq 4(d+1)/(2^{1/4}-1), the estimator fadj=fI,fpoly,d,kadjf^{\textit{adj}}=f^{\textit{adj}}_{I,f^{\textit{poly}},d,k} satisfies

fadjf1,I\displaystyle\vspace{-.25em}\|f^{\textit{adj}}-f\|_{1,I}\leq (2+c2(c+1)k)f𝒫d1,I+c2ηk\displaystyle{\left({2+\frac{c_{2}(c^{\prime}+1)}{k}}\right)}\|f-{\cal P}_{d}\|_{1,I}+\frac{c_{2}\eta}{k}
+(1+c2c′′k)fempf𝒜k,I,\displaystyle+{\left({1+\frac{c_{2}c^{\prime\prime}}{k}}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I},\vspace{-.25em}

where c2=c1(d+1)c_{2}=c_{1}(d+1) and c1c_{1} is the constant from Lemma 9.

Prior works (Acharya et al., 2017; Hao et al., 2020) derive a polynomial estimator that achieves a constant factor approximation for 𝒫t,d{\cal P}_{t,d}. We may thus use them as fpolyf^{\textit{poly}} in the above lemma. In particular, the estimator fadlsf^{\textit{adls}} in (Acharya et al., 2017) achieves c=3c^{\prime}=3 and c′′=2c^{\prime\prime}=2 and fsurff^{\textit{surf}} in (Hao et al., 2020) achieves a c=cd2c^{\prime}=c_{d}\geq 2 and c′′=cdc^{\prime\prime}=c_{d}, where cdc_{d} increases with the degree dd (e.g., c<3c^{\prime}<3 d8\forall\ d\leq 8). Define

ηd=def(d+1)/n\eta_{d}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sqrt{(d+1)/n} (5)

and for any 0<γ<10<\gamma<1, let

k(γ)=def8c1(d+1)/γ,k(\gamma)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\Big{\lceil}{8c_{1}(d+1)}/\gamma\Big{\rceil}, (6)

where c1c_{1} is the constant from Lemma 10. We obtain the following theorem for 0<γ<10<\gamma<1 by using fpoly=fadlsf^{\textit{poly}}=f^{\textit{adls}} with η=ηd(γ)\eta=\eta_{d}(\gamma) in Lemma 11, and then applying Lemma 4, all over I=[X(0),X(n)]I=[X_{(0)},X_{(n)}], i.e. the interval between the least and the largest sample.

Theorem 12.

Given XnfX^{n}\sim f, for any 0<γ<10<\gamma<1, the estimator fadj=fI,fadls(ηd),d,k(γ)adjf^{\textit{adj}}=f^{\textit{adj}}_{I,f^{\textit{adls}}(\eta_{d}),d,k(\gamma)} for I=[X(0),X(n)]I=[X_{(0)},X_{(n)}], achieves

𝔼fadjf1(2+γ)f𝒫d1+𝒪(d+1γn).\operatorname*{\mathbb{E}}\|f^{\textit{adj}}-f\|_{1}\leq(2+\gamma)\|f-{\cal P}_{d}\|_{1}+{\cal O}{\left({\sqrt{\frac{d+1}{\gamma\cdot n}}}\right)}.\vspace{-.25em}

We prove the above theorem in Appendix B.5, showing that fadjf^{\textit{adj}} is a 22-factor approximation for 𝒫d{\cal P}_{d}. Notice that when f𝒫d1𝒪((d+1)/n)\|f-{\cal P}_{d}\|_{1}\gg{\cal O}{\left({\sqrt{(d+1)/n}}\right)}, as is the case when nn\nearrow\infty, Theorem 12 gives fadjf^{\textit{adj}} a lower 1\ell_{1}-distance bound to ff than fadlsf^{\textit{adls}}. We use the above procedure in the main TURF\mathrm{TURF} routine that we describe in the next section.

5 A 2-Factor Estimator for 𝒫t,d{\cal P}_{t,d}

In the previous section, we described an estimator that approximates a distribution to a distance only slightly larger than twice f𝒫1,d1\|f-{\cal P}_{1,d}\|_{1}. We now extend this result to 𝒫t,d{\cal P}_{t,d}.

Consider a p𝒫t,dp^{*}\in{\cal P}_{t,d} that achieves fp1=f𝒫t,d1\|f-p^{*}\|_{1}=\|f-{\cal P}_{t,d}\|_{1}. If the tt intervals corresponding to the different polynomial pieces of pp^{*} are known, we may apply the routine in Section 4 to each interval and combine the estimate to obtain the 22-factor approximation for 𝒫t,d{\cal P}_{t,d}.

However, as these intervals are unknown, we instead use the partition returned by the ADLS\mathrm{ADLS} routine in (Acharya et al., 2017). ADLS\mathrm{ADLS} returns a partition with βt\beta t intervals where the parameter β>1\beta>1 by choice. Among these βt\beta t intervals, p𝒫dp^{*}\in{\cal P}_{d} is not a degree-dd polynomial in at most tt intervals. Let II be an interval in this partition where pp^{*} has more than one piece. The ADLS\mathrm{ADLS} routine has the property that there are at-least (β1)t(\beta-1)t other intervals in the partition in which pp^{*} is a single-piece polynomial with a worse 𝒜d+1{\cal A}_{d+1} distance to ff. That is, for any interval JJ in the (β1)t(\beta-1)t interval collection,

fp𝒜d+1,Ifp𝒜d+1,J+fempf𝒜d+1,JI+η.\|f-p^{*}\|_{{\cal A}_{d+1},I}\leq\|f-p^{*}\|_{{\cal A}_{d+1},J}+\|f^{\textit{emp}}-f\|_{{\cal A}_{d+1},J\cup I}+\eta.

This is used to bound the 1\ell_{1} distance in these intervals.

Our main routine TURF\mathrm{TURF} consists of simply applying the transformation discussed in Section 4 to the partition returned by the ADLS\mathrm{ADLS} routine in (Acharya et al., 2017). Given samples XnX^{n}, the number of pieces-t1t\geq 1, degree-d0d\geq 0, for any 0<α<10<\alpha<1, we first run the ADLS\mathrm{ADLS} routine with input XnX^{n}, fempf^{\textit{emp}}, and parameters t,dt,d,

β=β(α)=def1+4k(α)α(d+1),\beta=\beta(\alpha)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}1+\frac{4k(\alpha)}{\alpha(d+1)},\vspace{-.25em} (7)

where k(α)k(\alpha) is as defined in Equation (6), and ηd=(d+1)/n\eta_{d}=\sqrt{(d+1)/n}. ADLS\mathrm{ADLS} returns a partition I¯ADLS\bar{I}_{\mathrm{ADLS}} of \mathbb{R} with 2βt2\beta t intervals and a degree-dd, 2βt2\beta t-piecewise polynomial defined over the partition. For any interval II¯ADLSI\in\bar{I}_{\mathrm{ADLS}}, let fIadlsf^{\textit{adls}}_{I} denote the degree-dd estimate output by ADLS\mathrm{ADLS} over this interval. We obtain our output estimate ft,d,αoutf^{\textit{out}}_{t,d,\alpha} by applying the routine in Section 4 to fIadlsf^{\textit{adls}}_{I} for each II¯ADLSI\in\bar{I}_{\mathrm{ADLS}} with k=k(α)k=k(\alpha) (ref. Equation (6)). This is summarized below in Algorithm 1.

Algorithm 1 TURF\mathrm{TURF}
  Input: XnX^{n}, tt, dd, α\alpha
  k8c1(d+1)/αk\leftarrow\lceil{8c_{1}(d+1)}/{\alpha}\rceil {c1c_{1} is the constant in Lemma 9}
  β1+4k/(α(d+1))\beta\leftarrow 1+4k/(\alpha(d+1))
  ηd(d+1)/n\eta_{d}\leftarrow\sqrt{(d+1)/n}
  I¯ADLS,(fIadls,II¯ADLS)ADLS(Xn,t,d,β,ηd)\bar{I}_{\mathrm{ADLS}},{\left({f^{\textit{adls}}_{I},I\in\bar{I}_{\mathrm{ADLS}}}\right)}\leftarrow\mathrm{ADLS}(X^{n},t,d,\beta,\eta_{d})
  Output: ft,d,αout(fI,fIadls,d,kadj,II¯ADLS)f^{\textit{out}}_{t,d,\alpha}\leftarrow{\left({f^{\textit{adj}}_{I,f^{\textit{adls}}_{I},d,k},I\in\bar{I}_{\mathrm{ADLS}}}\right)}

Theorem 13 shows that ft,d,αoutf^{\textit{out}}_{t,d,\alpha} is a min-max 22-factor approximation for 𝒫t,d{\cal P}_{t,d}. We have a 𝒪(1/α3/2){\cal O}(1/\alpha^{3/2}) term in the ‘variance’ term in Theorem 13 that reflects the 𝒪(t/α3/2){\cal O}(t/\alpha^{3/2}) pieces in the output estimate. A small α\alpha corresponds to a low-bias, high-variance estimator with many pieces, and vice-versa. Note that the 3/23/2 exponent here is larger than the corresponding 1/21/2 in the result for 𝒫d{\cal P}_{d} in Section 4 (Theorem 12). The increased exponent over 𝒫d{\cal P}_{d} is due to the unknown locations of the polynomial pieces of p𝒫t,dp^{*}\in{\cal P}_{t,d}. Obtaining the exact exponent for 22-factor approximation for various classes may be an interesting question but beyond the scope of this paper. Let ω<3\omega<3 be the matrix multiplication constant. As our transformation of fadlsf^{\textit{adls}} takes 𝒪(n){\cal O}(n) time, the overall time complexity is the same as ADLS\mathrm{ADLS}’s near-linear 𝒪~(nd3+ω)\tilde{{\cal O}}(nd^{3+\omega}).

Theorem 13.

Given XnfX^{n}\sim f, an integer number of pieces t1t\geq 1 and degree d0d\geq 0, the parameter α0\alpha\geq 0, ft,d,αoutf^{\textit{out}}_{t,d,\alpha} is returned by TURF\mathrm{TURF} in 𝒪~(nd3+ω)\tilde{{\cal O}}(nd^{3+\omega}) time such that

𝔼ft,d,αoutf1(2+α)f𝒫t,d1+𝒪(t(d+1)α3n).\operatorname*{\mathbb{E}}\|f^{\textit{out}}_{t,d,\alpha}-f\|_{1}\leq{\left({2+\alpha}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\cal O}{\left({\sqrt{\frac{t(d+1)}{\alpha^{3}n}}}\right)}.\vspace{-.25em}

Theorem 13 is proven in Appendix C.1 and follows from the following lemma via a simple application of the VC\mathrm{VC} inequality in Lemma 4 and using Property 56. We prove the lemma in Appendix C.2.

Lemma 14.

Given samples XnfX^{n}\sim f for some n1n\geq 1, parameters t1,d0t\geq 1,d\geq 0 and for 0<α<10<\alpha<1, ft,d,αoutf^{\textit{out}}_{t,d,\alpha} returned by TURF\mathrm{TURF} satisfies

ft,d,αoutf1(3+2c1+2β1)fempf𝒜2βtk\displaystyle\|f^{\textit{out}}_{t,d,\alpha}-f\|_{1}\leq{\left({3+2c_{1}+\frac{2}{\beta-1}}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot k}}
+(2+4c1(d+1)k+1+k/(d+1)β1)f𝒫t,d1\displaystyle\phantom{lala}+{\left({2+\frac{4c_{1}(d+1)}{k}+\frac{1+k/(d+1)}{\beta-1}}\right)}\|f-{\cal P}_{t,d}\|_{1}
+(c1(d+1)k+k(β1)(d+1))ηd,\displaystyle\phantom{lala}+{\left({\frac{c_{1}(d+1)}{k}+\frac{k}{(\beta-1)(d+1)}}\right)}\eta_{d},\vspace{-.25em}

where c1c_{1}, k=k(α)k=k(\alpha), β=β(α)\beta=\beta(\alpha), are the constants in Lemma 9 and Equations (6), and (7) respectively, and ηd=def(d+1)/n\eta_{d}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sqrt{(d+1)/n}.

6 Optimal Parameter Selection

Like many other statistical learning problems, learning distributions exhibits a fundamental trade-off between bias and variance. In Equation (1) increasing the parameters tt and dd enlarges the polynomial class 𝒫t,d{\cal P}_{t,d}, hence decreases the bias term f𝒫t,d1\|f-{\cal P}_{t,d}\|_{1} while increasing the variance term 𝒪(t(d+1)/n){\cal O}(\sqrt{t(d+1)/n}). As the number of samples nn increases, asymptotically, it is always better to opt for larger tt and dd. Yet for any given nn, some parameters tt and dd yield the smallest error. We consider the parameters minimizing the upper bound in Theorem 13.

6.1 Context and Results

For several popular structured distributions such as unimodal, log-concave, Gaussian, and their mixtures, low-degree polynomials, e.g. d8d\leq 8, are essentially optimal (Birgé, 1987; Chan et al., 2014; Hao et al., 2020). Yet for the same classes, the range of the optimal tt is large, between Θ(1)\Theta(1) and Θ(n1/3)\Theta(n^{1/3}). Therefore, for a given dd, we seek the tt minimizing the error upper bound in Equation (1).

In the next subsection, we describe a parameter selection algorithm that improves this result for the estimators we considered in the previous section. Following nearly identical steps as in the derivation of Theorem 13 from Lemma 14, and using the probabilistic version of the VC Lemma 4 (see (Devroye & Lugosi, 2012)), it may be shown that with high probability ft,d,αoutf^{\textit{out}}_{t,d,\alpha} is a cc-factor approximation for 𝒫t,d{\cal P}_{t,d}. Namely, for any δ0\delta\geq 0,

ft,d,αoutf1cf𝒫t,d1+𝒪((t(d+1)+log1/δ)/n),\!\|f^{\textit{out}}_{t,d,\alpha}-\!f\|_{1}\!\leq\!c\cdot\|{f}\!-\!{{\cal P}_{t,d}}\|_{1}\!+{\cal O}{\left({\!\sqrt{(t(d\!+\!1)+\log 1/\delta)/n}}\right)}, (8)

where c=c(α)c=c(\alpha) is a function of the chosen α\alpha. We use the estimates ft,d,αoutf^{\textit{out}}_{t,d,\alpha} to find an estimate testt^{\textit{est}} such that ftest,d,αoutf^{\textit{out}}_{t^{\textit{est}},d,\alpha} has an error comparable to the cc-factor approximation for 𝒫t,d{\cal P}_{t,d} with the best tt.

Theorem 15.

Given n2𝒩n\in 2^{{\cal N}}, d0d\geq 0, 0<α<10<\alpha<1, cc-factor estimates for 𝒫t,d{\cal P}_{t,d} in high probability (see Equation (8)), {ft,d,αout:1tn}\{f^{\textit{out}}_{t,d,\alpha}:1\leq t\leq n\}, for any 0<β<10<\beta<1, we find the estimate testt^{\textit{est}} such that w.p. 1δlogn\geq 1-\delta\cdot\log n,

ftest,d,αoutf1\displaystyle\|f^{\textit{out}}_{t^{\textit{est}},d,\alpha}-f\|_{1}\!\leq\! mint1((1+β)cf𝒫t,d1\displaystyle\min_{t\geq 1}\Bigg{(}{\left({1+\beta}\right)}\cdot c\cdot\|f-{\cal P}_{t,d}\|_{1}
+𝒪((t(d+1)+log1/δ)/(β2n))).\displaystyle+{\cal O}{\left({\!\sqrt{(t(d+1)\!+\!\log 1/\delta)/(\beta^{2}n)}}\right)}\Bigg{)}.\vspace{-.25em}

The proof, provided in Appendix D.1, exploits the fact that the bias term of ft,d,αoutf^{\textit{out}}_{t,d,\alpha} is at most cf𝒫t,d1c\cdot\|f-{\cal P}_{t,d}\|_{1}, which decreases with tt, and the variance term upper bounded by 𝒪((t(d+1)+log1/δ)/n){\cal O}(\sqrt{(t(d+1)+\log 1/\delta)/n}), increasing with tt.

6.2 Construction

We use the following algorithm derived in (Jain et al., 2022). Consider the set 𝒱={v1,v2,,,vk}{\cal V}=\{v_{1},v_{2},{,}\ldots{,}v_{k}\}\subseteq{\cal M}, an unknown target vv\in{\cal M}, an unknown non-increasing sequence bib_{i}, and a known non-decreasing sequence cic_{i} such that d(vi,v)bi+ciid{\left({{v_{i}},{v}}\right)}\leq b_{i}+c_{i}\ \forall i.

First consider selecting for a given 1i<jk1\leq i<j\leq k, the point among vi,vjv_{i},v_{j} that is closer to vv. Suppose for some constant γ>0\gamma>0, d(vi,vj)γcjd{\left({{v_{i}},{v_{j}}}\right)}\leq\gamma c_{j}. Then from the triangle inequality, d(vi,v)d(vj,v)+d(vi,vj)bj+cj+γcjbj+(1+γ)cjd(v_{i},v)\leq d(v_{j},v)+d(v_{i},v_{j})\leq b_{j}+c_{j}+\gamma c_{j}\leq b_{j}+(1+\gamma)c_{j}. On the other hand if d(vi,vj)>γcjd{\left({{v_{i}},{v_{j}}}\right)}>\gamma c_{j}, since bjbib_{j}\leq b_{i} (as j>ij>i), d(vj,v)bj+cjbi+cjbi+d(vi,vj)/γd(v_{j},v)\leq b_{j}+c_{j}\leq b_{i}+c_{j}\leq b_{i}+d{\left({{v_{i}},{v_{j}}}\right)}/\gamma.

Therefore if we set γ\gamma to be sufficiently large and select vγ=viv^{\prime}_{\gamma}=v_{i} if d(vi,vj)γcjd{\left({{v_{i}},{v_{j}}}\right)}\leq\gamma c_{j}, and otherwise set vγ=vjv^{\prime}_{\gamma}=v_{j}, we roughly obtain d(vγ,v)bi+(1+γ)cjd(v^{\prime}_{\gamma},v)\lesssim b_{i}+(1+\gamma)c_{j}. We now generalize this approach to selecting between all points in 𝒱{\cal V}. Let iγi_{\gamma} be the smallest index in {1,,k}\{1{,}\ldots{,}k\} such that iγ<ik\forall i_{\gamma}<i\leq k, d(viγ,vj)γcid{\left({{v_{i_{\gamma}}},{v_{j}}}\right)}\leq\gamma c_{i}. Lemma 16 shows the favorable properties of viγv_{i_{\gamma}}, for example, that for a sufficiently large γ\gamma, d(viγ,v)d{\left({{v_{i_{\gamma}}},{v}}\right)} is comparable to mini{1,,k}(bi+λci)\min_{i\in\{1{,}\ldots{,}k\}}(b_{i}+\lambda c_{i}) when bicib_{i}\gg c_{i}. The proof may be found in Appendix D.2.

Lemma 16.

Given a set 𝒱={v1,v2,,,vk}{\cal V}=\{v_{1},v_{2},{,}\ldots{,}v_{k}\} in a metric space (,d)({\cal M},d), a sequence 0c1c2ck0\leq c_{1}\leq c_{2}\ldots\leq c_{k}, and γ>2\gamma>2, let 1iγk1\leq i_{\gamma}\leq k be the smallest index such that for all iγ<iki_{\gamma}<i\leq k, d(vi,viγ)γcid{\left({{v_{i}},{v_{i_{\gamma}}}}\right)}\leq\gamma c_{i}. Then for all sequences b1b2bk0b_{1}\geq b_{2}\geq\ldots\geq b_{k}\geq 0 such that for all ii, d(vi,v)bi+cid{\left({{v_{i}},{v}}\right)}\leq b_{i}+c_{i},

d(viγ,v)minj{1,,k}((1+2γ2)bj+(γ+1)cj).d{\left({{v_{i_{\gamma}}},{v}}\right)}\leq\min_{j\in\{1{,}\ldots{,}k\}}{\left({{\left({1+\frac{2}{\gamma-2}}\right)}\cdot b_{j}+(\gamma+1)c_{j}}\right)}.\vspace{-.25em}

The set of real integrable functions with TV\mathrm{TV} distance forms a metric space. For simplicity for the given d0d\geq 0, 0<α<10<\alpha<1, and nn, denote ftout=defft,d,αoutf^{\textit{out}}_{t}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}f^{\textit{out}}_{t,d,\alpha}. Assume nn is a power of 22 and let 𝒱={f1out,f2out,f4out,,fnout}{\cal V}=\{f^{\textit{out}}_{1},f^{\textit{out}}_{2},f^{\textit{out}}_{4}{,}\ldots{,}f^{\textit{out}}_{n}\}. Suppose for constants c,c′′>0c^{\prime},c^{\prime\prime}>0, and a chosen 0<δ<10<\delta<1, ct+c′′log1/δ\sqrt{c^{\prime}t+c^{\prime\prime}\log 1/\delta} is ftoutf^{\textit{out}}_{t}’s variance term. For any chosen 0<β<10<\beta<1, we obtain tβestt^{\textit{est}}_{\beta} by applying the above method with 𝒱{\cal V} and the cic_{i}s corresponding to ftoutf^{\textit{out}}_{t} as ct+c′′log1/δ\sqrt{c^{\prime}t+c^{\prime\prime}\log 1/\delta}. That is, tβestt^{\textit{est}}_{\beta} is the smallest t={1,2,4,,n}t\in{\cal I}=\{1,2,4{,}\ldots{,}n\} such that j:jt\forall j\in{\cal I}:j\geq t, d(ftout,fjout)γcj+c′′log1/δd{\left({{f^{\textit{out}}_{t}},{f^{\textit{out}}_{j}}}\right)}\leq\gamma\sqrt{c^{\prime}j+c^{\prime\prime}\log 1/\delta} (where we select γ=γ(β)=2+2/β\gamma=\gamma(\beta)=2+2/\beta). In Section 7, we experimentally evaluate the TURF\mathrm{TURF} estimator and the cross-validation technique.

7 Experiments

Refer to caption
Refer to caption
Refer to caption
Figure 1: The Beta, Gamma and Gaussian mixtures, respectively. The smooth and coarse plots in each sub-figure correspond to the noise-free and noisy cases, respectively.
Refer to caption
Refer to caption
Refer to caption
Figure 2: 1\ell_{1} error versus number of samples on the Beta, Gamma, and Gaussian mixtures respectively in Figure 1 for d=1d=1.
Refer to caption
Refer to caption
Refer to caption
Figure 3: 1\ell_{1} error versus number of samples on the Beta, Gamma, and Gaussian mixtures respectively in Figure 1 for d=2d=2.

Direct comparison of TURF\mathrm{TURF} and ADLS\mathrm{ADLS} for a given t,dt,d is not straightforward as TURF\mathrm{TURF} outputs polynomials consisting of more pieces. To compare the algorithms more equitably, we apply the cross-validation technique in Section 6 to select the best tt for each. The cross validation parameter δ\delta is chosen to reflect the actual number of pieces output by ADLS\mathrm{ADLS} and TURF\mathrm{TURF}. Note that while SURF\mathrm{SURF} (Hao et al., 2020) is another piecewise polynomial based estimation method, it has an implicit method to cross-validate tt, unlike ADLS\mathrm{ADLS} and TURF\mathrm{TURF}. As comparisons against SURF\mathrm{SURF} may only reflect the relative strengths of the cross validation methods and not that of the underlying estimation procedure, we defer them to Appendix 5. All experiments compare the 1\ell_{1} error, run for nn between 1,000 and 80,000, and averaged over 50 runs. For ADLS\mathrm{ADLS} we use the code provided in (Acharya et al., 2017), and for TURF\mathrm{TURF} we use the algorithm in Section 5.

The experiments consider the structured distributions addressed in (Acharya et al., 2017), namely mixtures of Beta: .4B(.8,4)+.6B(2,2).4\text{B}(.8,4)+.6\text{B}(2,2), Gamma: .7Γ(2,2)+.3Γ(7.5,1).7\Gamma(2,2)+.3\Gamma(7.5,1), and Gaussians: .65𝒩{\cal N}(-.45,.152.15^{2})+.35𝒩{\cal N}(.3,.22.2^{2}) as shown in Figure 1. Figure 2 considers approximation relative to 𝒫t,1{\cal P}_{t,1}. The blue-dashed and the black-dot-dash plots show that TURF\mathrm{TURF} modestly outperforms ADLS\mathrm{ADLS}. It is especially significant for the Beta distribution as B(.8,4)\text{B}(.8,4) has a large second derivative near 0, and approximating it may require many degree-1 pieces localized to that region. For this lower width region, the 𝒜1{\cal A}_{1} distance may be too small to warrant many pieces in ADLS\mathrm{ADLS}, unlike in TURF\mathrm{TURF} that forms intervals guided by shape constraints e.g., based on Lemma 7.

We perturb these distribution mixtures to increase their bias. For a given k>0k>0, select μ¯k=def(μ1,,μk)\bar{\mu}_{k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(\mu_{1}{,}\ldots{,}\mu_{k}) by independently choosing μi,i{1,,k}\mu_{i},\ i\in\{1{,}\ldots{,}k\} uniformly from the effective support of ff (we remove 5%5\% tail mass on either side). At each of these locations, apply a Gaussian noise of magnitude 0.25/k0.25/k with standard deviation σ=c2/k\sigma=c_{2}/k, for some constant c2>0c_{2}>0 that is chosen to scale with the effective support width of ff. That is,

fμ¯=def34f+14i=1k1k𝒩(μi,c22k2).f_{\bar{\mu}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{3}{4}\cdot f+\frac{1}{4}\cdot\sum_{i=1}^{k}\frac{1}{k}\cdot{\cal N}{\left({\mu_{i},\frac{c_{2}^{2}}{k^{2}}}\right)}.\vspace{-.25em}

We choose k=100k=100 and c2=0.05,1,0.1c_{2}=0.05,1,0.1 for the Beta, Gamma and Gaussian mixtures respectively, to yield the distributions shown in Figure 1. The red-dotted and olive-solid plots in Figure 2 compares ADLS\mathrm{ADLS} and TURF\mathrm{TURF} on these distributions. While the overall errors are larger due to the added noise, TURF\mathrm{TURF} outperforms ADLS\mathrm{ADLS} on nearly all distributions. A consistent trend across our experiments is that for large nn, the performance gap between ADLS\mathrm{ADLS} and TURF\mathrm{TURF} decreases. This may be explained by the fact that as nn increases, the value of tt output by the cross-validation method also increases, reducing the bias under both ADLS\mathrm{ADLS} and TURF\mathrm{TURF}. However, the reduction in ADLS\mathrm{ADLS}’s bias is more significant due to its larger approximation factor compared to TURF\mathrm{TURF}, resulting in the smaller gap.

Figure 3 repeats the same experiments for d=2d=2. Increasing the degree leads to lower errors on both ADLS\mathrm{ADLS} and TURF\mathrm{TURF} in the non-noisy case. However, the larger bias in the noisy case reveals the improved performance of TURF\mathrm{TURF}.

References

  • Acharya et al. (2014) Acharya, J., Jafarpour, A., Orlitsky, A., and Suresh, A. T. Near-optimal-sample estimators for spherical gaussian mixtures. arXiv preprint arXiv:1402.4746, 2014.
  • Acharya et al. (2017) Acharya, J., Diakonikolas, I., Li, J., and Schmidt, L. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.  1278–1289. SIAM, 2017.
  • Achieser (1992) Achieser, N. Theory of Approximation. Dover books on advanced mathematics. Dover Publications, 1992. ISBN 9780486671291.
  • Ashtiani et al. (2018) Ashtiani, H., Ben-David, S., Harvey, N. J., Liaw, C., Mehrabian, A., and Plan, Y. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp.  3416–3425, 2018.
  • Birgé (1987) Birgé, L. Estimating a density under order restrictions: Nonasymptotic minimax risk. The Annals of Statistics, pp.  995–1012, 1987.
  • Bithell (1990) Bithell, J. F. An application of density estimation to geographical epidemiology. Statistics in medicine, 9(6):691–701, 1990.
  • Bousquet et al. (2019) Bousquet, O., Kane, D., and Moran, S. The optimal approximation factor in density estimation. arXiv preprint arXiv:1902.05876, 2019.
  • Bousquet et al. (2021) Bousquet, O., Braverman, M., Efremenko, K., Kol, G., and Moran, S. Statistically near-optimal hypothesis selection. arXiv preprint arXiv:2108.07880, 2021.
  • Chan et al. (2014) Chan, S.-O., Diakonikolas, I., Servedio, R. A., and Sun, X. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp.  604–613. ACM, 2014.
  • Cohen et al. (2020) Cohen, D., Kontorovich, A., and Wolfer, G. Learning discrete distributions with infinite support. Advances in Neural Information Processing Systems, 33:3942–3951, 2020.
  • Devroye & Gyorfi (1990) Devroye, L. and Gyorfi, L. No empirical probability measure can converge in the total variation sense for all distributions. The Annals of Statistics, pp.  1496–1499, 1990.
  • Devroye & Lugosi (2012) Devroye, L. and Lugosi, G. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.
  • Diakonikolas (2016) Diakonikolas, I. Learning structured distributions. Handbook of Big Data, pp.  267, 2016.
  • Gerber (2014) Gerber, M. S. Predicting crime using twitter and kernel density estimation. Decision Support Systems, 61:115–125, 2014.
  • Givens & Hoeting (2012) Givens, G. H. and Hoeting, J. A. Computational statistics, volume 703. John Wiley & Sons, 2012.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Han et al. (2015) Han, Y., Jiao, J., and Weissman, T. Minimax estimation of discrete distributions under 1\ell_{1} loss. IEEE Transactions on Information Theory, 61(11):6343–6354, 2015.
  • Hao & Orlitsky (2019) Hao, Y. and Orlitsky, A. Unified sample-optimal property estimation in near-linear time. Advances in Neural Information Processing Systems, 32, 2019.
  • Hao et al. (2020) Hao, Y., Jain, A., Orlitsky, A., and Ravindrakumar, V. Surf: A simple, universal, robust, fast distribution learning algorithm. Advances in Neural Information Processing Systems, 33:10881–10890, 2020.
  • Huber (1992) Huber, P. J. Robust estimation of a location parameter. In Breakthroughs in statistics, pp.  492–518. Springer, 1992.
  • Jain et al. (2022) Jain, A., Orlitsky, A., and Ravindrakumar, V. Robust estimation algorithms don’t need to know the corruption level. arXiv preprint arXiv:2202.05453, 2022.
  • Kamath et al. (2015) Kamath, S., Orlitsky, A., Pichapati, D., and Suresh, A. T. On learning distributions from their samples. In Conference on Learning Theory, pp.  1066–1100. PMLR, 2015.
  • Pearson (1895) Pearson, K. X. contributions to the mathematical theory of evolution.—ii. skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London.(A.), (186):343–414, 1895.
  • Pimentel et al. (2014) Pimentel, M. A., Clifton, D. A., Clifton, L., and Tarassenko, L. A review of novelty detection. Signal Processing, 99:215–249, 2014.
  • Rahman et al. (2002) Rahman, Q. I., Schmeisser, G., et al. Analytic theory of polynomials. Number 26 in London Mathematical Society monographs. Clarendon Press, 2002.
  • Scott (2012) Scott, D. W. Multivariate density estimation and visualization. In Handbook of computational statistics, pp.  549–569. Springer, 2012.
  • Silverman (1986) Silverman, B. W. Density Estimation for Statistics and Data Analysis, volume 26. CRC Press, 1986.
  • Tukey (1977) Tukey, J. W. Exploratory data analysis. Addison-Wesley Series in Behavioral Science: Quantitative Methods, 1977.
  • Vapnik (1999) Vapnik, V. N. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
  • Wolfowitz (1957) Wolfowitz, J. The minimum distance method. The Annals of Mathematical Statistics, pp.  75–88, 1957.
  • Yatracos (1985) Yatracos, Y. G. Rates of convergence of minimum distance estimators and kolmogorov’s entropy. The Annals of Statistics, pp.  768–774, 1985.
  • Zambom & Ronaldo (2013) Zambom, A. Z. and Ronaldo, D. A review of kernel density estimation with applications to econometrics. International Econometric Review, 5(1):20–42, 2013.

Appendix A Proofs for Section 1

A.1 Proof of Lemma 3

Proof

Refer to caption
Figure 4: ff that is indistinguishable from uu in the proof of Lemma 3

Note that for any t1tt_{1}\leq t and d1dd_{1}\leq d, 𝒫t1,d1𝒫t,d{\cal P}_{t_{1},d_{1}}\subseteq{\cal P}_{t,d}. Therefore ct,dc_{t,d} increases with t,dt,d. (Chan et al., 2014) showed that c2,02c_{2,0}\geq 2. We show below that that c1,12c_{1,1}\geq 2. Together they imply ct,d2c_{t,d}\geq 2 when t2t\geq 2 or d1d\geq 1.

Let uu be the uniform distribution on [0,1][0,1]. Fix an ϵ>0\epsilon>0 and consider the distribution g(x)=1ϵ+2ϵxg(x)=1-\epsilon+2\epsilon x. Note that g𝒫1,1g\in{\cal P}_{1,1}. For a fixed k1k\geq 1, we construct two random distributions fkf_{k} and fkf_{k}^{\prime} that are essentially indistinguishable using nn samples as kk is made large, and such that all functions have an 1\ell_{1} distance to either fkf_{k} or fkf_{k}^{\prime} that is at least twice as far as their respective 1\ell_{1} approximations in 𝒫1,1{\cal P}_{1,1}.

To construct fkf_{k} we perturb gg separately on the left, [0,1/2][0,1/2], and right, (1/2,1](1/2,1], halves of [0,1][0,1].

For the left half, we use the discrete sub-distribution hkh_{k} that assigns a mass ϵ/41/k\epsilon/4\cdot 1/k to kk values drawn according to the distribution h(x)=48xh(x)=4-8x over [0,1/2][0,1/2]. Then let

fk(x)=g(x)+hk(x), for x[0,1/2].f_{k}(x)=g(x)+h_{k}(x),\text{\quad for\quad}x\in[0,1/2].

Thus fkf_{k} consists of discrete atoms added to gg on [0,1/2][0,1/2].

For the right half, assuming Wolog that kk is even, first partition (1/2,1](1/2,1] into k/2k/2 intervals of width 1/k1/k by letting Ii=def(1/2+(i1)/k,1/2+i/k]I_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(1/2+(i-1)/k,1/2+i/k] for i{1,,k/2}i\in\{1{,}\ldots{,}k/2\}. Let |I||I| denote the width of interval II. For each i{1,,k/2}i\in\{1{,}\ldots{,}k/2\}, select a random circular sub-interval JiIiJ_{i}\subseteq I_{i} of width

wi=11+2ϵi/k|Ii|w_{i}=\frac{1}{1+2\epsilon i/k}\cdot|I_{i}|

as follows: Suppose Ii=[ai,bi]I_{i}=[a_{i},b_{i}] for simplicity. Choose a point xix_{i} uniformly at random in IiI_{i} and define

Ji=def[ai,ai+max{0,xi+wibi}](xi,min{xi+wi,bi}].J_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[a_{i},a_{i}+\max\{0,x_{i}+w_{i}-b_{i}\}]\cup(x_{i},\min\{x_{i}+w_{i},b_{i}\}].

Let fkf_{k} be gg over JiJ_{i} and 0 over IiJiI_{i}\setminus J_{i}, hence as illustrated in Figure 4, for x(1/2,1]x\in(1/2,1],

fk(x)=def{g(x)xJi,i{1,,k/2},0xIiJi,i{1,,k/2}.f_{k}(x)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\begin{cases}g(x)&x\in J_{i},\quad i\in\{1{,}\ldots{,}k/2\},\\ 0&x\in I_{i}\setminus J_{i},\quad i\in\{1{,}\ldots{,}k/2\}.\end{cases}

It is easy to show that on any sub-interval of IiI_{i}, the area of fkf_{k} is within that of uu up to an additive 𝒪(ϵ/k){\cal O}(\epsilon/k).

Construct fkf_{k}^{\prime} via the same method for fkf_{k} but mirrored along 1/21/2, with g=1+ϵ2ϵxg^{\prime}=1+\epsilon-2\epsilon x, adding atoms to (1/2,1](1/2,1] and alternating between gg^{\prime} and 0 on [0,1/2][0,1/2] as described in the construction of fkf_{k}.

By a birthday paradox type argument, for any δ0\delta\geq 0, it is easy to see that the distributions fkf_{k}, fkf_{k}^{\prime} are indistinguishable from uu with probability 1δ\geq 1-\delta using any finitely many nn samples (by choosing an appropriately large k=k(δ,n)k=k(\delta,n)). Thus w.p. 12δ\geq 1-2\delta, the estimate fest=fXnestf^{\textit{est}}=f^{\textit{est}}_{X^{n}} is identical under both fkf_{k} and fkf_{k}^{\prime}. Therefore any estimator festf^{\textit{est}} suffers a factor

c(12δ)minfestmax{festfk1fk𝒫1,11,festfk1fk𝒫1,11},c\geq(1-2\delta)\cdot\min_{f^{\textit{est}}}\max\Bigg{\{}\frac{\|f^{\textit{est}}-f_{k}\|_{1}}{\|f_{k}-{\cal P}_{1,1}\|_{1}},\frac{\|f^{\textit{est}}-f_{k}^{\prime}\|_{1}}{\|f_{k}^{\prime}-{\cal P}_{1,1}\|_{1}}\Bigg{\}},

By the mirror image symmetry between fkf_{k} and fkf_{k}^{\prime} about 1/2, fest=uf^{\textit{est}}=u is the optimal estimate to within an additive 𝒪(ϵ/k){\cal O}(\epsilon/k). This lower bounds cc as

c12δfku1𝒪(ϵ/k)fk𝒫1,11\displaystyle\frac{c}{1-2\delta}\geq\frac{\|f_{k}-u\|_{1}-{\cal O}(\epsilon/k)}{\|f_{k}-{\cal P}_{1,1}\|_{1}} (a)fku1𝒪(ϵ/k)fkg1\displaystyle\overset{(a)}{\geq}\frac{\|f_{k}-u\|_{1}-{\cal O}(\epsilon/k)}{\|f_{k}-g\|_{1}}
=fku1,[0,1/2]+fku1,(1/2,1]𝒪(ϵ/k)fkg1,[0,1/2]+fkg1,(1/2,1]\displaystyle=\frac{\|f_{k}-u\|_{1,[0,1/2]}+\|f_{k}-u\|_{1,(1/2,1]}-{\cal O}(\epsilon/k)}{\|f_{k}-g\|_{1,[0,1/2]}+\|f_{k}-g\|_{1,(1/2,1]}}
=(b)hk1,[0,1/2]+gu1,[0,1/2]+fku1,(1/2,1]𝒪(ϵ/k)hk1,[0,1/2]+fkg1,(1/2,1]\displaystyle\overset{(b)}{=}\frac{\|h_{k}\|_{1,[0,1/2]}+\|g-u\|_{1,[0,1/2]}+\|f_{k}-u\|_{1,(1/2,1]}-{\cal O}(\epsilon/k)}{\|h_{k}\|_{1,[0,1/2]}+\|f_{k}-g\|_{1,(1/2,1]}}
=(c)ϵ/4+ϵ/4+fku1,[1/2,1]𝒪(ϵ/k)ϵ/4+fkg1,(1/2,1]\displaystyle\overset{(c)}{=}\frac{\epsilon/4+\epsilon/4+\|f_{k}-u\|_{1,[1/2,1]}-{\cal O}(\epsilon/k)}{\epsilon/4+\|f_{k}-g\|_{1,(1/2,1]}}
=(d)ϵ/4+ϵ/4+fku1,[1/2,1]𝒪(ϵ/k)ϵ/4+1/21g(x)𝑑x1/21fk(x)𝑑x\displaystyle\overset{(d)}{=}\frac{\epsilon/4+\epsilon/4+\|f_{k}-u\|_{1,[1/2,1]}-{\cal O}(\epsilon/k)}{\epsilon/4+\int_{1/2}^{1}g(x)dx-\int_{1/2}^{1}f_{k}(x)dx}
=ϵ/4+ϵ/4+fku1,[1/2,1]𝒪(ϵ/k)ϵ/4+1/2+ϵ/41/21fk(x)𝑑x\displaystyle=\frac{\epsilon/4+\epsilon/4+\|f_{k}-u\|_{1,[1/2,1]}-{\cal O}(\epsilon/k)}{\epsilon/4+1/2+\epsilon/4-\int_{1/2}^{1}f_{k}(x)dx}
=(e)ϵ/4+ϵ/4+fku1,[1/2,1]𝒪(ϵ/k)ϵ/2+𝒪(ϵ/k)\displaystyle\overset{(e)}{=}\frac{\epsilon/4+\epsilon/4+\|f_{k}-u\|_{1,[1/2,1]}-{\cal O}(\epsilon/k)}{\epsilon/2+{\cal O}(\epsilon/k)}
(f)ϵ/4+ϵ/4+ϵ/2𝒪(ϵ/k)ϵ/2+𝒪(ϵ/k)=2𝒪(1k),\displaystyle\overset{(f)}{\geq}\frac{\epsilon/4+\epsilon/4+\epsilon/2-{\cal O}(\epsilon/k)}{\epsilon/2+{\cal O}(\epsilon/k)}=2-{\cal O}{\left({\frac{1}{k}}\right)},

where (a)(a) follows since g𝒫1,1g\in{\cal P}_{1,1}, (b)(b) follows since hkh_{k} is a discrete distribution, (c)(c) follows since hkh_{k} has a total mass ϵ/4\epsilon/4 and since gu1,[0,1/2]=ϵ/4\|g-u\|_{1,[0,1/2]}=\epsilon/4 by a straightforward calculation, (d)(d) follows since gfkg\geq f_{k} in (1/2,1](1/2,1], (e)(e) follows since the area of fkf_{k} and uu on I=(1/2,1]I=(1/2,1] are equal to within an additive 𝒪(ϵ/k){\cal O}(\epsilon/k), and (f)(f) follows since fku1,(1/2,1]2fkg1,(1/2,1]𝒪(ϵ/k)\|f_{k}-u\|_{1,(1/2,1]}\geq 2\|f_{k}-g\|_{1,(1/2,1]}-{\cal O}(\epsilon/k). Choosing δ0\delta\searrow 0 and kk\nearrow\infty completes the proof.

A.2 Description and proof of Lemma 17

The following lemma shows that if festf^{\textit{est}} is a cc-factor approximation for 𝒫t,d{\cal P}_{t,d} for some tt and dd and achieves the min-max rate of a distribution class 𝒞{\cal C}, then festf^{\textit{est}} is also a cc-factor approximation for 𝒞{\cal C}.

Lemma 17.

If festf^{\textit{est}} is a cc-factor approximation for 𝒫t,d{\cal P}_{t,d} and for all ff in a class 𝒞{\cal C}, cf𝒫t,d1+𝒪(n(𝒫t,d))𝒪(n(𝒞))c\cdot\|f-{\cal P}_{t,d}\|_{1}+{\cal O}({\cal R}_{n}({{\cal P}_{t,d})})\leq{\cal O}({\cal R}_{n}({\cal C})), then for any ff, not necessarily in 𝒞{\cal C},

festf1cf𝒞1+𝒪(n(𝒞)).\|f^{\textit{est}}-f\|_{1}\leq c\cdot\|f-{\cal C}\|_{1}+{\cal O}({\cal R}_{n}({\cal C})).

Proof  For a distribution gg and class 𝒟{\cal D}, let g𝒟𝒟g_{{\cal D}}\in{\cal D} be the closest approximation to gg from 𝒟{\cal D}, namely achieving gg𝒟1=g𝒟1\|g-g_{{\cal D}}\|_{1}=\|g-{\cal D}\|_{1}. Then for any distribution ff,

festf1\displaystyle\|f^{\textit{est}}-f\|_{1} cf𝒫t,d1+𝒪(n(𝒫t,d))\displaystyle\leq c\cdot\|f-{\cal P}_{t,d}\|_{1}+{\cal O}{\left({{\cal R}_{n}({\cal P}_{t,d})}\right)}
(a)cff𝒞𝒫t,d1+𝒪(n(𝒫t,d))\displaystyle\overset{(a)}{\leq}c\cdot\|f-f_{{\cal C}_{{\cal P}_{t,d}}}\|_{1}+{\cal O}{\left({{\cal R}_{n}({\cal P}_{t,d})}\right)}
cff𝒞1+cf𝒞𝒫t,df𝒞1+𝒪(n(𝒫t,d))\displaystyle\leq c\cdot\|f-f_{{\cal C}}\|_{1}+c\cdot\|f_{{\cal C}_{{\cal P}_{t,d}}}-f_{{\cal C}}\|_{1}+{\cal O}{\left({{\cal R}_{n}({\cal P}_{t,d})}\right)}
=(b)cf𝒞1+cf𝒞𝒫t,d1+𝒪(n(𝒫t,d))\displaystyle\overset{(b)}{=}c\cdot\|f-{\cal C}\|_{1}+c\cdot\|f_{{\cal C}}-{\cal P}_{t,d}\|_{1}+{\cal O}{\left({{\cal R}_{n}({\cal P}_{t,d})}\right)}
=(c)cf𝒞1+𝒪(n(𝒞)),\displaystyle\overset{(c)}{=}c\cdot\|f-{\cal C}\|_{1}+{\cal O}{\left({{\cal R}_{n}({{\cal C}})}\right)},

where in (a)(a), just as f𝒞f_{{\cal C}} is the 𝒞{\cal C} distribution closest to ff, f𝒞𝒫t,df_{{\cal C}_{{\cal P}_{t,d}}} is the 𝒫t,d{\cal P}_{t,d} distribution closest to f𝒞f_{{\cal C}} and the inequality follows since f𝒞𝒫t,d𝒫t,df_{{\cal C}_{{\cal P}_{t,d}}}\in{\cal P}_{t,d} and by definition, f𝒫t,d1\|f-{\cal P}_{t,d}\|_{1} is the least distance from ff to any q𝒫t,dq\in{\cal P}_{t,d}, (b)(b) follows since by definition, f𝒞𝒫t,df_{{\cal C}_{{\cal P}_{t,d}}} is the best approximation to f𝒞f_{{\cal C}} from 𝒫t,d{\cal P}_{t,d}, and (c)(c) follows from the property of 𝒞{\cal C} considered in the lemma as f𝒞𝒞f_{{\cal C}}\in{\cal C}.

Appendix B Proofs for Section 4

B.1 Proof of Lemma 7

Proof  Observe that for any q𝒫dq\in{\cal P}_{d} and interval JJ,

ΔJ(q)=defmaxxJq(x)minxJq(x)J|q(t)|𝑑t,\Delta_{J}(q)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\max_{x\in J}q(x)-\min_{x\in J}q(x)\leq\int_{J}|q^{\prime}(t)|dt,

where qq^{\prime} denotes the first derivative of qq. The case of d=0d=0 is trivial. We give a proof for d1d\geq 1. Consider

fd(x)=def1xsin(darcsin(x)).f_{d}(x)\,\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\,\frac{1}{\text{\normalsize$x$}}\sin(d\arcsin(x)).

The following two claims (indicated with the overset (*)) may be verified via Wolfram Mathematica e.g. version 12.3.
Claim 1||fd||2πd\lvert\lvert f_{d}\rvert\rvert_{2}\leq\sqrt{\pi\,d},  because

||fd||22\displaystyle\lvert\lvert f_{d}\rvert\rvert_{2}^{2} =()201sin2(darcsin(x))x2𝑑x0π21cos(2dθ)sin2θ𝑑θ=πd.\displaystyle\overset{(*)}{=}2\int_{0}^{1}\frac{\sin^{2}(d\arcsin(x))}{x^{2}}\,\,dx\leq\int_{0}^{\frac{\pi}{2}}\frac{1-\cos(2d\theta)}{\sin^{2}\theta}\,d\theta\overset{}{=}\pi\,d.

Claim 2fd(0)=()df_{d}(0)\overset{(*)}{=}d and fd(0)=()0f^{\prime}_{d}(0)\overset{(*)}{=}0.
Let f(t)=defq(t)1t2f(t)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}q(t)\sqrt{1-t^{2}} and note that for any x[0,1]x\in[0,1],

1x2xx|q(t)|𝑑t\displaystyle\sqrt{1-x^{2}}\int_{-x}^{x}|q^{\prime}(t)|\,dt 11|q(t)|1t2𝑑t11|f(t)|𝑑t+11|tq(t)|1t2,dt,\displaystyle\leq\int_{-1}^{1}|q^{\prime}(t)|\sqrt{1-t^{2}}\,dt\leq\int_{-1}^{1}|f^{\prime}(t)|\,dt+\int_{-1}^{1}\frac{|t\,q(t)|}{\sqrt{1-t^{2}}},dt,

where the final inequality follows by integration by parts.

For simplicity let gd(x)=def2πdfd2(x)g_{d}(x)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sqrt{\frac{2}{\pi d}}\cdot f_{\frac{d}{2}}(x), then gd(0)=d2πg_{d}(0)=\sqrt{\frac{d}{2\pi}} and gd(0)=0g^{\prime}_{d}(0)=0. Hence,

12π|f(0)|\displaystyle\frac{1}{2\pi}|f^{\prime}(0)| =1d|f(0)|gd2(0)(a)2maxt|f(t)|gd2(t)(b)4d11|q(t)|gd2(t)𝑑t,\displaystyle=\frac{1}{d}\,|f^{\prime}(0)|\,g^{2}_{d}(0){\overset{(a)}{\leq}}2\max_{t}|f(t)|\,g^{2}_{d}(t){\overset{(b)}{\leq}}4d\int_{-1}^{1}|q(t)|\,g^{2}_{d}(t)\,dt,

where (b)(b) follows from Bernstein’s inequality since |q(t)|1t2d11|q(z)|𝑑z|q(t)|\sqrt{1-t^{2}}\leq d\int_{-1}^{1}|q(z)|dz for a degree-dd polynomial qq and any t[0,1]t\in[0,1], and applying the inequality to q(t)gd2(t)q(t)g_{d}^{2}(t) of degree 2d22d-2. For (a)(a) we apply Bernstein’s inequality to q(z)=ddz(f(z)gd2(z))q(z)=\frac{d}{dz}{\left({f(z)g^{2}_{d}(z)}\right)}, set t=0t=0 and notice that gd(0)=0g_{d}^{\prime}(0)=0.
Equivalently by a change of variables, for c=def8πc\,\stackrel{{\scriptstyle\mathrm{def}}}{{=}}8\pi and fT(θ)=deff(sinθ)f_{\text{\tiny T}}(\theta)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}f(\sin\theta),

|fT(0)|=|f(0)|cd11|q(t)|gd2(t)𝑑tcdππ|fT(θ)|gd2(sinθ)𝑑θ.\displaystyle|f_{\text{\tiny T}}^{\prime}(0)|=|f^{\prime}(0)|\leq c\ d\int_{-1}^{1}|q(t)|\,g^{2}_{d}(t)\,dt\leq c\,d\int_{-\pi}^{\pi}|f_{\text{\tiny T}}(\theta)|\,g^{2}_{d}(\sin\theta)d\theta.

Using the same reasoning for fT(θ+α)f_{\text{\tiny T}}(\theta+\alpha) instead of fT(θ)f_{\text{\tiny T}}(\theta) we obtain

|fT(α)|cdππ|fT(θ+α)|gd2(sinθ)𝑑θ=cdππ|fT(θ)|gd2(sin(θα))𝑑θ,|f_{\text{\tiny T}}^{\prime}(\alpha)|\leq c\,d\int_{-\pi}^{\pi}|f_{\text{\tiny T}}(\theta+\alpha)|\,g^{2}_{d}(\sin\theta)\,d\theta=c\,d\int_{-\pi}^{\pi}|f_{\text{\tiny T}}(\theta)|\,g^{2}_{d}(\sin(\theta-\alpha))\,d\theta,

where the equality follows since |fT(θ+α)|gd2(sin(θ))|f_{\text{\tiny T}}(\theta+\alpha)|\,g^{2}_{d}(\sin(\theta)) is a periodic function with period 2π2\pi. Integrating both sides for α\alpha from π-\pi to π\pi yields

11|f(t)|𝑑t=ππ|fT(α)|𝑑α(a)cdππ|fT(θ)|𝑑θ=cd11|q(t)|𝑑t,\int_{-1}^{1}|f^{\prime}(t)|\,dt=\int_{-\pi}^{\pi}|f_{\text{\tiny T}}^{\prime}(\alpha)|\,d\alpha\,{\overset{(a)}{\leq}}\,c\,d\int_{-\pi}^{\pi}|f_{\text{\tiny T}}(\theta)|\,d\theta=c\,d\int_{-1}^{1}|q(t)|\,dt,

where (a) follows since ||gd||21\lvert\lvert g_{d}\rvert\rvert_{2}\leq 1 from Claim 1 and the definition of gdg_{d}. Finally, for a proper τ(0,1)\tau\in(0,1) of order 1/d21/d^{2},

11|tq(t)|1t2𝑑t\displaystyle\int_{-1}^{1}\frac{|t\,q(t)|}{\sqrt{1-t^{2}}}\,dt 1+τ1τ|q(t)|1t2𝑑t+2maxt|q(t)|1τ1t1t2𝑑t\displaystyle\leq\int_{-1+\tau}^{1-\tau}\frac{|q(t)|}{\sqrt{1-t^{2}}}\,dt+2\max_{t}|q(t)|\int_{1-\tau}^{1}\frac{t}{\sqrt{1-t^{2}}}\,dt
(a)11|q(t)|𝑑t2ττ2+2maxt|q(t)|1τ1t1t2𝑑t\displaystyle{\overset{(a)}{\leq}}\frac{\int_{-1}^{1}|q(t)|\,dt}{\sqrt{2\tau-\tau^{2}}}+2\max_{t}|q(t)|\int_{1-\tau}^{1}\frac{t}{\sqrt{1-t^{2}}}\,dt
=11|q(t)|𝑑t2ττ2+2maxt|q(t)|2ττ2\displaystyle{=}\frac{\int_{-1}^{1}|q(t)|\,dt}{\sqrt{2\tau-\tau^{2}}}+2\max_{t}|q(t)|\cdot\sqrt{2\tau-\tau^{2}}
(b)11|q(t)|𝑑t2ττ2+22ττ2(d+1)211|q(t)|𝑑t\displaystyle{\overset{(b)}{\leq}}\frac{\int_{-1}^{1}|q(t)|\,dt}{\sqrt{2\tau-\tau^{2}}}+2\sqrt{2\tau-\tau^{2}}\cdot(d+1)^{2}\int_{-1}^{1}|q(t)|\,dt
=(c)22(d+1)11|q(t)|𝑑t,\displaystyle\overset{(c)}{=}{2\sqrt{2}}(d+1)\int_{-1}^{1}|q(t)|\,dt,

where (a)(a) follows since 1/1t21/2ττ21/\sqrt{1-t^{2}}\leq 1/\sqrt{2\tau-\tau^{2}} for t[1+τ,1τ]t\in[-1+\tau,1-\tau], (b)(b) follows from the Markov Brothers’ inequality, and (c)(c) holds for some τ=𝒪(1/d2)\tau={\cal O}(1/d^{2}). Proof is complete from the fact that 8π+22<288\pi+2\sqrt{2}<28.

B.2 Proof of Lemma 8

Since for any i{1,,m}i\in\{1{,}\ldots{,}m\}, I¯i+I¯+\bar{I}_{i}^{+}\in\bar{I}^{+} and I¯iI¯\bar{I}_{i}^{-}\in\bar{I}^{-} both have (d+1)/2i/4\lceil\ell(d+1)/2^{i/4}\rceil intervals for i{1,,m}i\in\{1{,}\ldots{,}m\}, it follows that the total number of intervals in (I¯,I¯+)(\bar{I}^{-},\bar{I}^{+}) is upper bounded as

2(i=1m|I¯i|+1)\displaystyle 2{\left({\sum_{i=1}^{m}|\bar{I}_{i}|+1}\right)} =2(i=1m(d+1)2i/4+1)\displaystyle=2{\left({\sum_{i=1}^{m}\Bigg{\lceil}\frac{\ell(d+1)}{2^{i/4}}\Bigg{\rceil}+1}\right)}
2(i=1m((d+1)2i/4+1)+1)\displaystyle\leq 2{\left({\sum_{i=1}^{m}{\left({\frac{\ell(d+1)}{2^{i/4}}+1}\right)}+1}\right)}
2(i=1(d+1)2i/4+m+1)\displaystyle\leq 2{\left({\sum_{i=1}^{\infty}\frac{\ell(d+1)}{2^{i/4}}+m+1}\right)}
=2((d+1)21/41121/4+log2((d+1)2)+1)\displaystyle=2{\left({\frac{\ell(d+1)}{2^{1/4}}\cdot\frac{1}{1-2^{-1/4}}+\log_{2}(\ell(d+1)^{2})+1}\right)}
(a)2((d+1)21/41121/4+2log2((d+1))+1)\displaystyle\overset{(a)}{\leq}2{\left({\frac{\ell(d+1)}{2^{1/4}}\cdot\frac{1}{1-2^{-1/4}}+2\log_{2}(\ell(d+1))+1}\right)}
(b)2((d+1)21/41121/4+2(d+1)log2)\displaystyle\overset{(b)}{\leq}2{\left({\frac{\ell(d+1)}{2^{1/4}}\cdot\frac{1}{1-2^{-1/4}}+\frac{2\ell(d+1)}{\log 2}}\right)}
=2(d+1)(121/41+2log2)\displaystyle\overset{}{=}2\ell(d+1){\left({\frac{1}{2^{1/4}-1}+\frac{2}{\log 2}}\right)}
4(d+1)21/41.\displaystyle\leq\frac{4\ell(d+1)}{2^{1/4}-1}.

where (a)(a) follows since 1\ell\geq 1 and (b)(b) follows from the identity that for any x1,log(x)x1x\geq 1,\log(x)\leq x-1.

B.3 Proof of Lemma 9

Proof  We provide a proof for I=[1,1]I=[-1,1] by considering I¯=[1,1]¯d,k\bar{I}=\overline{[-1,1]}^{d,k}. An identical proof follows for any other interval II as its partition I¯d,k\overline{I}^{d,k} is obtained by a linear translation of [1,1]¯d,k\overline{[-1,1]}^{d,k}. For i{1,,m}i\in\{1{,}\ldots{,}m\}, let

Ii=defIi+Ii.I_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}I_{i}^{+}\cup I_{i}^{-}.

Similarly let Em=defEm+Em{E}_{m}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}{E}_{m}^{+}\cup{E}_{m}^{-}. Applying Lemma 7 with a=11/2ia={1-1/2^{i}}, we obtain

Ii|p(x)|𝑑x(a)(11/2i)(11/2i)|p(x)|𝑑x\displaystyle\int_{I_{i}}|p^{\prime}(x)|dx\overset{(a)}{\leq}\int_{-{\left({1-1/2^{i}}\right)}}^{{\left({1-1/2^{i}}\right)}}|p^{\prime}(x)|dx 28(d+1)p1,I(1(11/2i)2)1/2\displaystyle\overset{}{\leq}\frac{28(d+1){\|p\|}_{1,I}}{(1-{\left({1-1/2^{i}}\right)}^{2})^{1/2}}
28(d+1)p1,I(1(11/2i))1/2\displaystyle\leq\frac{28(d+1){\|p\|}_{1,I}}{(1-{\left({1-1/2^{i}}\right)})^{1/2}}
282i/2(d+1)p1,I,\displaystyle\leq 28\cdot 2^{i/2}\cdot(d+1){\|p\|}_{1,I},

where (a)(a) follows since Ii[(11/2i),(11/2i)]I_{i}\subseteq[-{\left({1-1/2^{i}}\right)},{\left({1-1/2^{i}}\right)}]. As I¯i+\bar{I}_{i}^{+} consists of (d+1)/2i/4\lceil\ell(d+1)/2^{i/4}\rceil equal width intervals from Equation (4) and since I¯i+\bar{I}_{i}^{+} is of width |I¯i+|=1/2i|\bar{I}_{i}^{+}|=1/2^{i}, it follows that each interval in I¯i+\bar{I}_{i}^{+} (and similarly for I¯\bar{I}^{-}) is of width 1/(23i/4l(d+1))\leq 1/(2^{3i/4}\cdot l(d+1)). Thus from Equation (2), the 1\ell_{1} difference between pp and p¯I¯\bar{p}_{\bar{I}} over IiI_{i} is given by

pp¯I¯1,Ii\displaystyle\|p-\bar{p}_{\bar{I}}\|_{1,I_{i}} JI¯iΔJ(p)|J|\displaystyle\leq\sum_{J\in\bar{I}_{i}}\Delta_{J}(p)\cdot|J|
JI¯iΔIi(p)123i/4(d+1)\displaystyle\leq\sum_{J\in\bar{I}_{i}}\Delta_{I_{i}}(p)\cdot\frac{1}{2^{3i/4}\cdot\ell(d+1)}
(JI¯iJ|p(x)|𝑑x)123i/4(d+1)\displaystyle\leq{\left({\sum_{J\in\bar{I}_{i}}\int_{J}|p^{\prime}(x)|dx}\right)}\cdot\frac{1}{2^{3i/4}\cdot\ell(d+1)}
Ii|p(x)|𝑑x123i/4(d+1)\displaystyle\leq\int_{I_{i}}|p^{\prime}(x)|dx\cdot\frac{1}{2^{3i/4}\cdot\ell(d+1)}
282i/2(d+1)p1,I.123i/4(d+1)=282i/4p1,I.\displaystyle\leq 28\cdot 2^{i/2}\cdot(d+1){\|p\|}_{1,I}.\frac{1}{2^{3i/4}\cdot\ell(d+1)}=\frac{28\cdot 2^{-i/4}{\|p\|}_{1,I}}{\ell}.

Therefore

pp¯I¯1,I\displaystyle\|p-\bar{p}_{\bar{I}}\|_{1,I} =i=1mpp¯I¯1,Ii+pp¯I¯1,Em\displaystyle=\sum_{i=1}^{m}\|p-\bar{p}_{\bar{I}}\|_{1,I_{i}}+\|p-\bar{p}_{\bar{I}}\|_{1,{E}_{m}}
i=1m282i/4p1,I+maxxEmp(x)2(d+1)2\displaystyle\leq\sum_{i=1}^{m}\frac{28\cdot 2^{-i/4}{\|p\|}_{1,I}}{\ell}+\max_{x\in{E}_{m}}{p(x)}\cdot\frac{2}{\ell(d+1)^{2}}
(a)i=1m282i/4p1,I+(d+1)2p1,I2(d+1)2\displaystyle\overset{(a)}{\leq}\sum_{i=1}^{m}\frac{28\cdot 2^{-i/4}{\|p\|}_{1,I}}{\ell}+(d+1)^{2}{\|p\|}_{1,I}\frac{2}{\ell(d+1)^{2}}
(b)p1,I(28121/4+2)\displaystyle\overset{(b)}{\leq}\frac{{\|p\|}_{1,I}}{\ell}{\left({\frac{28}{1-2^{-1/4}}+2}\right)}
(c)4(d+1)p1,Ik(21/41)(28121/4+2)\displaystyle\overset{(c)}{\leq}\frac{4(d+1){\|p\|}_{1,I}}{k(2^{1/4}-1)}{\left({\frac{28}{1-2^{-1/4}}+2}\right)}
3764(d+1)p1,Ik,\displaystyle\overset{}{\leq}\frac{3764(d+1){\|p\|}_{1,I}}{k},

where (a)(a) follows since II is a symmetric interval, xI\forall x\in I, p(x)(d+1)2p1,Ip(x)\leq(d+1)^{2}{\|p\|}_{1,I} from the Markov Brothers’ inequality, (b)(b) follows from the infinite negative geometric sum and (c)(c) follows since =defk(21/41)/(4(d+1))\ell\stackrel{{\scriptstyle\mathrm{def}}}{{=}}k(2^{1/4}-1)/(4(d+1)) as defined in Equation (4).

B.4 Proof of Lemma 11

Proof  Select a p𝒫dp^{*}\in{\cal P}_{d} that achieves fp1,I=f𝒫d1,I\|f-p^{*}\|_{1,I}=\|f-{\cal P}_{d}\|_{1,I}. Then

fadjf1,I\displaystyle\|f^{\textit{adj}}-f\|_{1,I} fadjp1,I+pf1,I\displaystyle\leq\|f^{\textit{adj}}-p^{*}\|_{1,I}+\|p^{*}-f\|_{1,I}
(a)2fp1,I+c1(d+1)pfpoly1,Ik+fempf𝒜k,I\displaystyle\overset{(a)}{\leq}2\cdot\|f-p^{*}\|_{1,I}+\frac{c_{1}(d+1)\|p^{*}-f^{\textit{poly}}\|_{1,I}}{k}+\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I}
2fp1,I+c1(d+1)(pf1,I+ffpoly1,I)k+fempf𝒜k,I\displaystyle\leq 2\cdot\|f-p^{*}\|_{1,I}+\frac{c_{1}(d+1){\left({\|p^{*}-f\|_{1,I}+\|f-f^{\textit{poly}}\|_{1,I}}\right)}}{k}+\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I}
(b)(2+c1(d+1)(1+c)k)fp1,I+(c1(d+1)c′′k+1)fempf𝒜k,I+c1(d+1)kη,\displaystyle\overset{(b)}{\leq}{\left({2+\frac{c_{1}(d+1)(1+c^{\prime})}{k}}\right)}\|f-p^{*}\|_{1,I}+{\left({\frac{c_{1}(d+1)c^{\prime\prime}}{k}+1}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I}+\frac{c_{1}(d+1)}{k}\cdot{\eta},

where (a)(a) follows from setting p=pp=p^{*} in Lemma 10, and (b)(b) follows from using fpolyf1,Icfp1,I+c′′fempf𝒜d+1,I+η\|f^{\textit{poly}}-f\|_{1,I}\leq c^{\prime}\|f-p^{*}\|_{1,I}+c^{\prime\prime}\|f^{\textit{emp}}-f\|_{{\cal A}_{d+1},I}+\eta along with the fact that kd+1k\geq d+1 (so that fempf𝒜d+1,Ifempf𝒜k,I\|f^{\textit{emp}}-f\|_{{\cal A}_{d+1},I}\leq\|f^{\textit{emp}}-f\|_{{\cal A}_{k},I}).

B.5 Proof of Theorem 12

Proof  Since c=3c=3 and c=2c^{\prime}=2 for the fadlsf^{\textit{adls}} estimate, it follows for I=[X(0),X(n)]I=[X_{(0)},X_{(n)}] from Lemma 11 that

fadjf1,I\displaystyle\|f^{\textit{adj}}-f\|_{1,I} (2+3c1(d+1)k)fp1+(2c1(d+1)k+1)fempf𝒜d+1+c1(d+1)kηd\displaystyle\leq{\left({2+\frac{3c_{1}(d+1)}{k}}\right)}\|f-p^{*}\|_{1}+{\left({\frac{2c_{1}(d+1)}{k}+1}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{d+1}}+\frac{c_{1}(d+1)}{k}\cdot{\eta_{d}}
(2+γ)f𝒫d1+(γ4+1)fempf𝒜d+1+γ8ηd\displaystyle\leq{\left({2+\gamma}\right)}\|f-{\cal P}_{d}\|_{1}+{\left({\frac{\gamma}{4}+1}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{d+1}}+\frac{\gamma}{8}\cdot{\eta_{d}}

where the last inequality follows since we choose k=k(γ)8c1(d+1)/γk=k(\gamma)\geq 8c_{1}(d+1)/\gamma. Let J=IJ=\mathbb{R}\setminus I. Using Lemma 4,

𝔼fadjf1\displaystyle\operatorname*{\mathbb{E}}\|f^{\textit{adj}}-f\|_{1} (2+γ)f𝒫d1,I+(γ4+1)𝔼fempf𝒜d+1+γ8ηd+(2+γ)f1,J\displaystyle\leq{\left({2+\gamma}\right)}\|f-{\cal P}_{d}\|_{1,I}+{\left({\frac{\gamma}{4}+1}\right)}\operatorname*{\mathbb{E}}\|f^{\textit{emp}}-f\|_{{\cal A}_{d+1}}+\frac{\gamma}{8}\cdot{\eta_{d}}+{\left({2+\gamma}\right)}{\|f\|}_{1,J}
=(2+γ)f𝒫d1,I+(γ4+1)𝔼fempf𝒜d+1+γ8ηd+(2+γ)𝔼ffemp𝒜2,J\displaystyle={\left({2+\gamma}\right)}\|f-{\cal P}_{d}\|_{1,I}+{\left({\frac{\gamma}{4}+1}\right)}\operatorname*{\mathbb{E}}\|f^{\textit{emp}}-f\|_{{\cal A}_{d+1}}+\frac{\gamma}{8}\cdot{\eta_{d}}+{\left({2+\gamma}\right)}\operatorname*{\mathbb{E}}\|f-f^{\textit{emp}}\|_{{\cal A}_{2},J}
(a)(2+γ)f𝒫d1,I+(γ4+1)𝒪(kn)+γ8ηd+3𝒪(2n)\displaystyle\overset{(a)}{\leq}{\left({2+\gamma}\right)}\|f-{\cal P}_{d}\|_{1,I}+{\left({\frac{\gamma}{4}+1}\right)}{\cal O}{\left({\sqrt{\frac{k}{n}}}\right)}+\frac{\gamma}{8}\cdot{\eta_{d}}+3\cdot{\cal O}{\left({\sqrt{\frac{2}{n}}}\right)}
(b)(2+γ)f𝒫d1+𝒪(d+1γn),\displaystyle\overset{(b)}{\leq}{\left({2+\gamma}\right)}\|f-{\cal P}_{d}\|_{1}+{\cal O}{\left({\sqrt{\frac{d+1}{\gamma\cdot n}}}\right)},

where (a)(a) follows since γ<1\gamma<1 and from Lemma 4, and (b)(b) follows as ηd=(d+1)/n\eta_{d}=\sqrt{(d+1)/n}, k=𝒪(d+1)k={\cal O}(d+1) and 0<γ<10<\gamma<1.

Appendix C Proofs for Section 5

C.1 Proof of Theorem 13

From Lemma 14,

ft,d,αoutf1\displaystyle\|f^{\textit{out}}_{t,d,\alpha}-f\|_{1} (2+4c1(d+1)k+1+k/(d+1)β1)f𝒫t,d1+(3+2c1+2β1)fempf𝒜2βtk\displaystyle\leq{\left({2+\frac{4c_{1}(d+1)}{k}+\frac{1+k/(d+1)}{\beta-1}}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\left({3+2c_{1}+\frac{2}{\beta-1}}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot k}}
+(c1(d+1)k+k(β1)(d+1))ηd\displaystyle\phantom{\leq}+{\left({\frac{c_{1}(d+1)}{k}+\frac{k}{(\beta-1)(d+1)}}\right)}\eta_{d}
(a)(2+4c1(d+1)k+α(d+1)4k(1+kd+1))f𝒫t,d1+(3+2c1+α(d+1)2k)fempf𝒜2βtk\displaystyle\overset{(a)}{\leq}{\left({2+\frac{4c_{1}(d+1)}{k}+\frac{\alpha(d+1)}{4k}\cdot{\left({1+\frac{k}{d+1}}\right)}}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\left({3+2c_{1}+\frac{\alpha(d+1)}{2k}}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot k}}
+(c1(d+1)k+α(d+1)4kkd+1)ηd\displaystyle\phantom{\leq}+{\left({\frac{c_{1}(d+1)}{k}+\frac{\alpha(d+1)}{4k}\cdot\frac{k}{d+1}}\right)}\eta_{d}
(b)(2+α2+α2)f𝒫t,d1+(3+2c1+α2)fempf𝒜2βtk+(α8+α4)ηd\displaystyle\overset{(b)}{\leq}{\left({2+\frac{\alpha}{2}+\frac{\alpha}{2}}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\left({3+2c_{1}+\frac{\alpha}{2}}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot k}}+{\left({\frac{\alpha}{8}+\frac{\alpha}{4}}\right)}\eta_{d}

where (a)(a) follows since by definition β1=4k/(α(d+1))\beta-1={4k}/(\alpha(d+1)), (b)(b) follows since k=def8c1(d+1)/αk\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\lceil{8c_{1}(d+1)}/{\alpha}\rceil and since 0<α<10<\alpha<1, c1>1c_{1}>1 imply kd+1k\geq d+1. From Lemma 4,

𝔼ft,d,αoutf1\displaystyle\operatorname*{\mathbb{E}}\|f^{\textit{out}}_{t,d,\alpha}-f\|_{1} (2+α)f𝒫t,d1+(3+2c1+α2)𝔼fempf𝒜2βtk+3αηd8\displaystyle\leq{\left({2+\alpha}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\left({3+2c_{1}+\frac{\alpha}{2}}\right)}\operatorname*{\mathbb{E}}\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot k}}+\frac{3\alpha\eta_{d}}{8}
(2+α)f𝒫t,d1+(3+2c1+α2)𝒪(2βtkn)+3αηd8\displaystyle\overset{}{\leq}{\left({2+\alpha}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\left({3+2c_{1}+\frac{\alpha}{2}}\right)}{\cal O}{\left({\sqrt{\frac{2\beta tk}{n}}}\right)}+\frac{3\alpha\eta_{d}}{8}
(a)(2+α)f𝒫t,d1+(3+2c1+α2)𝒪(k2tα(d+1))+3αηd8\displaystyle\overset{(a)}{\leq}{\left({2+\alpha}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\left({3+2c_{1}+\frac{\alpha}{2}}\right)}{\cal O}{\left({\sqrt{\frac{k^{2}t}{\alpha(d+1)}}}\right)}+\frac{3\alpha\eta_{d}}{8}
(b)(2+α)f𝒫t,d1+(3+2c1+α2)𝒪(c12(d+1)2tα3(d+1)n)+3αηd8\displaystyle\overset{(b)}{\leq}{\left({2+\alpha}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\left({3+2c_{1}+\frac{\alpha}{2}}\right)}{\cal O}{\left({\sqrt{\frac{c_{1}^{2}(d+1)^{2}t}{\alpha^{3}(d+1)n}}}\right)}+\frac{3\alpha\eta_{d}}{8}
(c)(2+α)f𝒫t,d1+𝒪(t(d+1)α3n),\displaystyle\overset{(c)}{\leq}{\left({2+\alpha}\right)}\|f-{\cal P}_{t,d}\|_{1}+{\cal O}{\left({\sqrt{\frac{t(d+1)}{\alpha^{3}n}}}\right)},

where (a)(a), (b)(b) both follow from the definitions of β\beta, kk in Equations (7), (6) and (c)(c) follows since nd=def(d+1)/nn_{d}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sqrt{(d+1)/n} and 0<α<10<\alpha<1.

C.2 Proof of Lemma 14

For simplicity denote fout=defft,d,αoutf^{\textit{out}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}f^{\textit{out}}_{t,d,\alpha} and consider a particular p𝒫t,dp^{*}\in{\cal P}_{t,d} that achieves f𝒫t,d1\|f-{\cal P}_{t,d}\|_{1}. Let F¯\bar{F} denote the set of intervals in I¯ADLS\bar{I}_{\mathrm{ADLS}} that has pp^{*} as a single piece polynomial in II. Let J¯=defI¯ADLSF¯\bar{J}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\bar{I}_{\mathrm{ADLS}}\setminus\bar{F} be the remaining intervals where pp^{*} has more than one polynomial piece. Since p𝒫t,dp^{*}\in{\cal P}_{t,d} has tt polynomial pieces, the number of intervals in J¯\bar{J} is t\leq t.

Recall that for any subset SS\subseteq\mathbb{R}, and integrable functions g1g_{1}, g2g_{2}, and integer m1m\geq 1, g1g21,S\|g_{1}-g_{2}\|_{1,S}, g1g2𝒜m,S\|g_{1}-g_{2}\|_{{\cal A}_{m},S} denote the 1\ell_{1} and 𝒜m{\cal A}_{m} distances over SS respectively. Equation (14) in (Acharya et al., 2017) shows that over F¯\bar{F},

IF¯fIadlsf1\displaystyle\sum_{I\in\bar{F}}\|f^{\textit{adls}}_{I}-f\|_{1}\leq 3fp1,F¯+2fempf𝒜|F¯|(d+1),F¯+ηd.\displaystyle\ 3\|f-p^{*}\|_{1,\bar{F}}+2||f^{\textit{emp}}-f||_{{\cal A}_{|\bar{F}|\cdot(d+1)},\bar{F}}+\eta_{d}. (9)

We bound the error in F¯\bar{F} by setting p=p,fpoly=fIadlsp=p^{*},\ f^{\textit{poly}}=f^{\textit{adls}}_{I} in Lemma 10, using Equation (9), and noting that kd+1k\geq d+1 as c11c_{1}\geq 1 from Equation (6):

IF¯foutf1,I\displaystyle\sum_{I\in\bar{F}}\|f^{\textit{out}}-f\|_{1,I} 2IF¯fp1,I+fempf𝒜k|F¯|+c1(d+1)k(pf1+ffIadls1)\displaystyle\leq 2\sum_{I\in\bar{F}}\|f-p^{*}\|_{1,I}+||f^{\textit{emp}}-f||_{{\cal A}_{k\cdot|\bar{F}|}}+\frac{c_{1}(d+1)}{k}(\|p^{*}-f\|_{1}+\|f-f^{\textit{adls}}_{I}\|_{1})
2fp1,F¯+(1+2c1)fempf𝒜k|F¯|+c1(d+1)k(4fp1+ηd).\displaystyle\leq\ 2\|f-p^{*}\|_{1,\bar{F}}+(1+2c_{1})||f^{\textit{emp}}-f||_{{\cal A}_{k\cdot|\bar{F}|}}+\frac{c_{1}(d+1)}{k}{\left({4\|f-p^{*}\|_{1}+\eta_{d}}\right)}. (10)

From Lemma 49 (Acharya et al., 2017), for all intervals IJ¯I\in\bar{J}, the following Equation (11) holds that they use to derive Equation (12).

fIadlsfemp𝒜d+1,I\displaystyle\|f^{\textit{adls}}_{I}-f^{\textit{emp}}\|_{{\cal A}_{d+1},I} fp1+fempf𝒜2βt(d+1)+ηd(β1)t.\displaystyle\leq\frac{\|f-p^{*}\|_{1}+\|f^{\textit{emp}}-f\|_{{\cal A}_{{2\beta t\cdot(d+1)}}}+\eta_{d}}{(\beta-1)t}. (11)
IJ¯fIadlsf1,I\displaystyle\sum_{I\in\bar{J}}\|f^{\textit{adls}}_{I}-f\|_{1,I} fp1+fempf𝒜2βt(d+1)β1+2fempf𝒜2βt(d+1)+2fp1,J¯+ηd2(β1).\displaystyle\leq\frac{\|f-p^{*}\|_{1}+\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot(d+1)}}}{\beta-1}+2\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot(d+1)}}+{2}\|f-p^{*}\|_{1,\bar{J}}+\frac{\eta_{d}}{2(\beta-1)}. (12)

Recall that we obtain foutf^{\textit{out}} by adding a constant to fIadlsf^{\textit{adls}}_{I} along each interval II¯d,kI\in\overline{I}^{d,k} to match its area to fempf^{\textit{emp}} in that interval. Since I¯d,k\overline{I}^{d,k} has k\leq k intervals (Lemma 8), IJ¯\forall I\in\bar{J},

foutfIadls1,I\displaystyle\|f^{\textit{out}}-f^{\textit{adls}}_{I}\|_{1,I} fIadlsfemp𝒜k,I\displaystyle\leq\|f^{\textit{adls}}_{I}-f^{\textit{emp}}\|_{{\cal A}_{k},I}
kd+1fIadlsfemp𝒜d+1,I\displaystyle\leq\frac{k}{d+1}\|f^{\textit{adls}}_{I}-f^{\textit{emp}}\|_{{\cal A}_{d+1},I} (13)

where the last inequality follows from Property 6.

Adding Equations (12) and (13) over intervals in J¯\bar{J} by noting J¯\bar{J} has t\leq t intervals and kd+1k\geq d+1 implies

IJ¯foutf1,I\displaystyle\sum_{I\in\bar{J}}\|f^{\textit{out}}-f\|_{1,I}\leq 1+k/(d+1)β1fp1+2fp1,J¯+2(1+1β1)fempf𝒜2βtk\displaystyle{\frac{1+k/(d+1)}{\beta-1}}\cdot\|f-p^{*}\|_{1}+2\|f-p^{*}\|_{1,\bar{J}}+2{\left({1+\frac{1}{\beta-1}}\right)}\|f^{\textit{emp}}-f\|_{{\cal A}_{2\beta t\cdot k}}
+kηd(d+1)(β1).\displaystyle+\frac{k\cdot\eta_{d}}{(d+1)(\beta-1)}. (14)

Adding Equations (10) and (14) proves Lemma 14 (since pp^{*} satisfies fp1=f𝒫t,d1\|f-p^{*}\|_{1}=\|f-{\cal P}_{t,d}\|_{1}).

Appendix D Proofs for Section 6

D.1 Proof of Theorem 15

Proof  Applying the probabilistic version of the VC inequality, i.e. Lemma 4, (see (Devroye & Lugosi, 2012)) to Lemma 14 we have with probability 1δ\geq 1-\delta,

ft,d,αoutf1cf𝒫t,d1+𝒪(t(d+1)+log1/δn).\|f^{\textit{out}}_{t,d,\alpha}-f\|_{1}\leq c\cdot\|f-{\cal P}_{t,d}\|_{1}+{\cal O}{\left({\sqrt{\frac{t(d+1)+\log 1/\delta}{n}}}\right)}.

From the union bound, the above condition holds true for the logn\log n sized estimate collection {ft,d,αout:t{1,2,4,,n}}\{f^{\textit{out}}_{t,d,\alpha}:t\in\{1,2,4{,}\ldots{,}n\}\} with probability 1δlogn\geq 1-\delta\cdot\log n. Apply the method discussed in Section 6.2 with γ=2+2/β\gamma=2+2/\beta, to obtain test=tβestt^{\textit{est}}=t^{\textit{est}}_{\beta}. Using Lemma 16 that w.p. 1δlogn\geq 1-\delta\cdot\log n,

ftest,d,αoutf1\displaystyle\|f^{\textit{out}}_{t^{\textit{est}},d,\alpha}-f\|_{1} mint{1,2,4,,n},d((1+2γ2)cf𝒫t,d1+(γ+1)χt(d+1)n)\displaystyle\overset{}{\leq}\min_{t\in\{1,2,4{,}\ldots{,}n\},d}{\left({{\left({1+\frac{2}{\gamma-2}}\right)}\cdot c\cdot\|f-{\cal P}_{t,d}\|_{1}+(\gamma+1)\chi{\sqrt{\frac{t(d+1)}{n}}}}\right)}
(a)min0tn,d((1+2γ2)cf𝒫t,d1+2(γ+1)χt(d+1)+log1/δn)\displaystyle\overset{(a)}{\leq}\min_{0\leq t\leq n,d}{\left({{\left({1+\frac{2}{\gamma-2}}\right)}\cdot c\cdot\|f-{\cal P}_{t,d}\|_{1}+\sqrt{2}(\gamma+1)\chi{\sqrt{\frac{t(d+1)+\log 1/\delta}{n}}}}\right)}
(b)min0tn,d((1+β)cf𝒫t,d1+𝒪(t(d+1)+log1/δβ2n)),\displaystyle\overset{(b)}{\leq}\min_{0\leq t\leq n,d}{\left({{\left({1+\beta}\right)}\cdot c\cdot\|f-{\cal P}_{t,d}\|_{1}+{\cal O}{\left({{\sqrt{\frac{t(d+1)+\log 1/\delta}{\beta^{2}n}}}}\right)}}\right)},

where (a)(a) follows from the fact that for any 1tn1\leq t\leq n, t{1,2,4,,n}:t[t,2t]\exists t^{\prime}\in\{1,2,4{,}\ldots{,}n\}:t^{\prime}\in[t,2t] (so that f𝒫t,d1f𝒫t,d1\|f-{\cal P}_{t^{\prime},d}\|_{1}\leq\|f-{\cal P}_{t,d}\|_{1} and (b)(b) follows since γ=2+2/β\gamma=2+2/\beta.

D.2 Proof of Lemma 16

Proof  For iiγi\geq i_{\gamma}, from the triangle inequality, and as by definition, d(viγ,vi)γcid{\left({{v_{i_{\gamma}}},{v_{i}}}\right)}\leq\gamma c_{i} for all iiγi\geq i_{\gamma},

d(viγ,v)\displaystyle d{\left({{v_{i_{\gamma}}},{v}}\right)} d(vi,v)+d(viγ,vi)bi+ci+γci=bi+(1+γ)ci.\displaystyle\leq d{\left({{v_{i}},{v}}\right)}+d{\left({{v_{i_{\gamma}}},{v_{i}}}\right)}\leq b_{i}+c_{i}+\gamma c_{i}=b_{i}+(1+\gamma)c_{i}.

For i<iγi<i_{\gamma}, if

biγ1γ22ciγ,b_{i_{\gamma}-1}\geq\frac{\gamma-2}{2}c_{i_{\gamma}},

the proof follows since for any 1jiγ11\leq j^{\prime}\leq i_{\gamma}-1,

d(viγ,v)biγ+ciγbiγ+2γ2bj(a)(1+2γ2)bj,\displaystyle d{\left({{v_{i_{\gamma}}},{v}}\right)}\leq b_{i_{\gamma}}+c_{i_{\gamma}}\leq b_{i_{\gamma}}+\frac{2}{\gamma-2}b_{j}\overset{(a)}{\leq}{\left({1+\frac{2}{\gamma-2}}\right)}b_{j^{\prime}},

where (a)(a) follows since jj<iγj^{\prime}\leq j<i_{\gamma}. On the other hand if

biγ1<γ22ciγ,b_{i_{\gamma}-1}<\frac{\gamma-2}{2}c_{i_{\gamma}},

then j′′j+1\forall j^{\prime\prime}\geq j+1,

d(vj,vj′′)bj+bj′′+cj+cj′′(a)2bj+2cj′′2γ22ciγ+2cj′′(b)γcj′′,d{\left({{v_{j}},{v_{j^{\prime\prime}}}}\right)}\leq b_{j}+b_{j^{\prime\prime}}+c_{j}+c_{j^{\prime\prime}}\overset{(a)}{\leq}2b_{j}+2c_{j^{\prime\prime}}\leq 2\cdot\frac{\gamma-2}{2}c_{i_{\gamma}}+2c_{j^{\prime\prime}}\overset{(b)}{\leq}\gamma c_{j^{\prime\prime}},

where (a)(a) follows since j′′jj^{\prime\prime}\geq j, and (b)(b) follows since j′′j=iγj^{\prime\prime}\geq j=i_{\gamma}, contradicting the definition of iγi_{\gamma}.

Appendix E Additional Experiments

Refer to caption
Refer to caption
Refer to caption
Figure 5: 1\ell_{1} error versus number of samples on the Beta, Gamma, and Gaussian mixtures respectively in Figure 1 for d=1d=1.

We compare SURF\mathrm{SURF} (Hao et al., 2020) against TURF\mathrm{TURF} and ADLS\mathrm{ADLS} (Acharya et al., 2017) for the non-noisy distributions considered in Section 7, namely mixtures of Beta: .4B(.8,4)+.6B(2,2).4\text{B}(.8,4)+.6\text{B}(2,2), Gamma: .7Γ(2,2)+.3Γ(7.5,1).7\Gamma(2,2)+.3\Gamma(7.5,1), and Gaussians: .65𝒩{\cal N}(-.45,.152.15^{2})+.35𝒩{\cal N}(.3,.22.2^{2}). This is shown in Figure 5. While SURF\mathrm{SURF} achieves a lower error, this may be due to its implicit cross-validation method, unlike in ADLS\mathrm{ADLS} and TURF\mathrm{TURF} that relies on our independent cross-validation procedure in Section 6. While the primary focus of our work was in determining the optimal approximation constant, evaluating the experimental performance of the various piecewise polynomial estimators may be an interesting topic for future research.