This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

U-statistics of growing order and sub-Gaussian mean estimators with sharp constants

Stanislav Minsker Department of Mathematics, University of Southern California
email: minsker@usc.edu
Abstract

This paper addresses the following question: given a sample of i.i.d. random variables with finite variance, can one construct an estimator of the unknown mean that performs nearly as well as if the data were normally distributed? One of the most popular examples achieving this goal is the median of means estimator. However, it is inefficient in a sense that the constants in the resulting bounds are suboptimal. We show that a permutation-invariant modification of the median of means estimator admits deviation guarantees that are sharp up to 1+o(1)1+o(1) factor if the underlying distribution possesses more than 3+522.62\frac{3+\sqrt{5}}{2}\approx 2.62 moments and is absolutely continuous with respect to the Lebesgue measure. This result yields potential improvements for a variety of algorithms that rely on the median of means estimator as a building block. At the core of our argument is are the new deviation inequalities for the U-statistics of order that is allowed to grow with the sample size, a result that could be of independent interest.

62G35,
60E15,62E20,
U-statistics,
median-of-means estimator,
heavy tails,
keywords:
[class=MSC]
keywords:

t3Author acknowledges support by the National Science Foundation grants CIF-1908905 and DMS CAREER-2045068.

1 Introduction.

Let X1,,XNX_{1},\ldots,X_{N} be i.i.d. random variables with distribution PP having mean μ\mu and finite variance σ2\sigma^{2}. At the core of this paper is the following question: given 1ttmax(N)1\leq t\leq t_{\max}(N), construct an estimator μ~N=μ~N(X1,,XN)\widetilde{\mu}_{N}=\widetilde{\mu}_{N}(X_{1},\ldots,X_{N}) such that

(|μ~Nμ|σtN)2etL\mathbb{P}{\left(\left|\widetilde{\mu}_{N}-\mu\right|\geq\sigma\sqrt{\frac{t}{N}}\right)}\leq 2e^{-\frac{t}{L}} (1)

for some absolute positive constant LL. Estimators that satisfy this deviation property are called sub-Gaussian. For example, the sample mean X¯N=1Nj=1NXj\bar{X}_{N}=\frac{1}{N}\sum_{j=1}^{N}X_{j} is sub-Gaussian for tmaxq(N,P)t_{\max}\asymp q(N,P) where q(N,P)q(N,P)\to\infty as NN\to\infty and the constant LL equals 22: this immediately follows from the fact that convergence of the distribution functions is uniform in the central limit theorem. However, q(N,P)q(N,P) can grow arbitrarily slow in general, and it grows as log1/2(N)\log^{1/2}(N) if 𝔼|X|2+ε<\mathbb{E}|X|^{2+\varepsilon}<\infty for some ε>0\varepsilon>0 in view of the Berry-Esseen theorem (for instance, see the book by Petrov,, 1975). At the same time, the so-called median of means (MOM) estimator, originally introduced by Nemirovski and Yudin, (1983); Alon et al., (1996); Jerrum et al., (1986) and studied recently in relation to the problem at hand satisfies inequality (1) with tmaxt_{\max} of order NN and L=24eL=24e (Lerasle and Oliveira,, 2011), although the latter can be improved. A large body of existing work used the MOM estimator as a core subroutine to relax underlying assumptions for a variety of statistical problems, in particular the methods based on the empirical risk minimization; we refer the reader to an excellent survey paper by Lugosi and Mendelson, (2019) for a detailed overview of the recent advances.

The exact value of constant LL in inequality (1) is less important in problems where only the minimax rates are of interest, but it becomes crucial in terms of practical value and sample efficiency of the algorithms. The benchmark here is the situation when observations are normally distributed: Catoni, (2012) showed that no estimator can outperform the sample mean in this situation. The latter satisfies the relation

(|X¯Nμ|σΦ1(1et/2)N)=2et2\mathbb{P}{\left(\left|\bar{X}_{N}-\mu\right|\geq\sigma\frac{\Phi^{-1}(1-e^{-t/2})}{\sqrt{N}}\right)}=2e^{-\frac{t}{2}}

where Φ1()\Phi^{-1}(\cdot) denotes the quantile function of the standard normal law. As Φ1(1et/2)=(1+o(1))t\Phi^{-1}(1-e^{-t/2})=(1+o(1))\sqrt{t} as tt\to\infty, the best guarantee of the form (1) one can hope for is attained for L=2L=2. It is therefore natural to ask whether there exist sharp sub-Gaussian estimators of the mean, that is, estimators satisfying (1) with L=2(1+o(1))L=2(1+o(1)) where o(1)o(1) is a sequence that converges to 0 as NN\to\infty, under minimal assumptions on the underlying distribution. This question has previously been posed by Devroye et al., (2016) as an open problem, and several results appeared since then that give partial answers. We proceed with a brief review of the state of the art.

1.1 Overview of the existing results.

Catoni, (2012) presented the first known example of a sharp sub-Gaussian estimator with tmax=o(N/κ)t_{\max}=o(N/\kappa) for distributions with finite fourth moment and a known upper bound on the kurtosis κ\kappa (or, alternatively, for distribution with finite but known variance). Devroye et al., (2016) introduced an alternative estimator that also required finite fourth moment but did not explicitly depend on the value of the kurtosis as an input while satisfying required guarantees for tmax=o((N/κ)2/3)t_{\max}=o\left((N/\kappa)^{2/3}\right). Minsker and Ndaoud, (2021) designed an asymptotically efficient sub-Gaussian estimator μ~N\widetilde{\mu}_{N} that satisfies N(μ~Nμ)𝑑N(0,σ2)\sqrt{N}\left(\widetilde{\mu}_{N}-\mu\right)\xrightarrow{d}N(0,\sigma^{2}) assuming only the finite second moment plus a mild, “small-ball” type condition. However, the constants in the non-asymptotic version of their bounds were not sharp. Finally, Lee and Valiant, (2020) constructed an estimator with required properties assuming just the finite second moment, however, their guarantees hold with optimal constants only for tminttmaxt_{\min}\leq t\leq t_{\max} where tmax=o(N)t_{\max}=o(N) and tmint_{\min}\to\infty as NN\to\infty. In particular, this range excludes tt in the neighborhood of 0 which is often the region of most practical interest.

1.2 Summary of the main contributions.

The reasons for the popularity of MOM estimator are plenty: it is simple to define and to compute, it admits strong theoretical guarantees, moreover it is scale-invariant and therefore essentially tuning-free. Thus, we believe that any quantifiable improvements to its performance are worth investigating.

We start by showing that the standard MOM estimator achieves bound (1) with L=π(1+o(1))L=\pi(1+o(1)) where o(1)0o(1)\to 0 as NN\to\infty; this fact is formally stated in Theorem 2.1. We then define a permutation-invariant version of MOM, denoted μ^N\widehat{\mu}_{N}, and show in Corollary 3.1 that, surprisingly, it is asymptotically optimal in a sense that N(μ^Nμ)𝑑N(0,σ2)\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\xrightarrow{d}N(0,\sigma^{2}) under minimal assumptions; compare this to the the standard MOM estimator that has a limiting variance π2σ2\frac{\pi}{2}\sigma^{2}. The main result of the paper, Theorem 5.1, demonstrates that optimality of μ^N\widehat{\mu}_{N} holds in the stronger sense, namely, that inequality (1) is valid for a wide range the confidence parameters assuming the distribution of X1X_{1} possesses qq moments for some possibly unknown q>3+522.62q>\frac{3+\sqrt{5}}{2}\approx 2.62 and that its characteristic function satisfies a mild decay bound.

Analysis of the estimator μ^N\widehat{\mu}_{N} requires new inequalities for U-statistics of order that grows with the sample size. Detailed discussion and comparison with existing bounds is given in section 4. In particular, we prove novel bounds for large deviations of the degenerate, higher order terms of the Hoeffding decomposition (Theorem 4.1), and deduce sub-Gaussian deviation guarantees for the non-degenerate U-statistics (Corollary 4.1) with the “correct” sub-Gaussian parameter. These bounds could be of independent interest.

1.3 Notation.

Unspecified absolute constants will be denoted C,c,C1,cC,c,C_{1},c^{\prime}, etc., and may take different values in different parts of the paper. Given a,ba,b\in\mathbb{R}, we will write aba\wedge b for min(a,b)\min(a,b) and aba\vee b for max(a,b)\max(a,b). For a positive integer MM, [M][M] denotes the set {1,,M}\{1,\ldots,M\}.

We will frequently use the standard big-O and small-o notation for asymptotic relations between functions and sequences. Moreover, given two sequences {an}n1\{a_{n}\}_{n\geq 1} and {bn}n1\{b_{n}\}_{n\geq 1} where bn0b_{n}\neq 0 for all nn, we will write that anbna_{n}\ll b_{n} if anbn=o(1)\frac{a_{n}}{b_{n}}=o(1) as nn\to\infty. Note that o(1)o(1) may denote different functions/sequences from line to line.

For a function f:f:\mathbb{R}\mapsto\mathbb{R}, f(m)f^{(m)} will denote its mm-th derivative whenever it exists. Similarly, given g:dg:\mathbb{R}^{d}\mapsto\mathbb{R}, xjg(x1,,xd)\partial_{x_{j}}g(x_{1},\ldots,x_{d}) will stand for the partial derivative of gg with respect to the jj-th variable. Finally, the sup-norm of gg is defined via g:=esssup{|g(y)|:yd}\|g\|_{\infty}:=\mathrm{ess\,sup}\{|g(y)|:\,y\in\mathbb{R}^{d}\} and the convolution of ff and gg is denoted fgf\ast g.

Given i.i.d. random variables X1,,XNX_{1},\ldots,X_{N} distributed according to PP, PN:=1Nj=1NδXjP_{N}:=\frac{1}{N}\sum_{j=1}^{N}\delta_{X_{j}} will stand for the associated empirical measure, where δX(f):=f(X)\delta_{X}(f):=f(X). For a real-valued function ff and a signed measure QQ, we will write QfQf for f𝑑Q\int fdQ, assuming that the last integral is well-defined. Additional notation and auxiliary results will be introduced on demand.

2 Optimal constants for the median of means estimator.

Recall that we are given an i.i.d. sample X1,,XNX_{1},\ldots,X_{N} from distribution PP with mean μ\mu and variance σ2\sigma^{2}. The median of means estimator of μ\mu is constructed as follows: let G1Gk[N]G_{1}\cup\ldots\cup G_{k}\subseteq[N] be an arbitrary (possibly random but independent from the data) collection of kN/2k\leq N/2 disjoint subsets (“blocks”) of cardinality N/k\lfloor N/k\rfloor each, X¯j:=1|Gj|iGjXi\bar{X}_{j}:=\frac{1}{|G_{j}|}\sum_{i\in G_{j}}X_{i} and

μ^MOM=med(X¯1,,X¯k).\widehat{\mu}_{\mathrm{MOM}}=\mbox{med}\left(\bar{X}_{1},\ldots,\bar{X}_{k}\right).

It is known (e.g. Lerasle and Oliveira,, 2011; Devroye et al.,, 2016) that μ^MOM\widehat{\mu}_{\mathrm{MOM}} satisfies inequality (1) for t=kt=k and L=8e2L=8e^{2}. This value of LL appears to be overly pessimistic however: it follows from Theorem 5 in (Minsker,, 2019) that if kk\to\infty sufficiently slow so that the bias of μ^MOM\widehat{\mu}_{\mathrm{MOM}} is of order o(N1/2)o(N^{-1/2}), then

N(μ^MOMμ)𝑑N(0,π2σ2)\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\xrightarrow{d}N\left(0,\frac{\pi}{2}\sigma^{2}\right) (2)

as k,N/kk,N/k\to\infty. In particular, if 𝔼|X|2+δ<\mathbb{E}|X|^{2+\delta}<\infty for some 0<δ10<\delta\leq 1, then k=o(Nδ/(1+δ))k=o\left(N^{\delta/(1+\delta)}\right) suffices for the asymptotic unbiasedness and asymptotic normality to hold. Asymptotic relation (2) suggests that the best value of the constant LL in the deviation inequality (1) for the estimator μ^MOM\widehat{\mu}_{\mathrm{MOM}} is π+o(1)\pi+o(1). We will demonstrate that this is indeed the case. Denote

g(m):=1m𝔼[(X1μσ)2min(|X1μσ|,m)],g(m):=\frac{1}{\sqrt{m}}\mathbb{E}\left[\left(\frac{X_{1}-\mu}{\sigma}\right)^{2}\min\left(\left|\frac{X_{1}-\mu}{\sigma}\right|,\sqrt{m}\right)\right], (3)

Clearly, g(m)0g(m)\to 0 as mm\to\infty for distributions with finite variance. Feller, (1968) proved that supt|Φm(t)Φ(t)|6g(m)\sup_{t\in\mathbb{R}}\left|\Phi_{m}(t)-\Phi(t)\right|\leq 6g(m) where Φm\Phi_{m} and Φ\Phi are the distribution functions of j=1mXjμσm\frac{\sum_{j=1}^{m}X_{j}-\mu}{\sigma\sqrt{m}} and the standard normal law respectively. It is well known that g(m)C𝔼|X1μσ|qm(q2)/2g(m)\leq C\mathbb{E}\left|\frac{X_{1}-\mu}{\sigma}\right|^{q}m^{-(q-2)/2} whenever 𝔼|X1μ|q<\mathbb{E}|X_{1}-\mu|^{q}<\infty for some q(2,3]q\in(2,3]. The next result can be viewed as a non-asymptotic analogue of relation (2).

Theorem 2.1.

The following bound holds:

(|N(μ^MOMμ)|σt)2exp(tπ(1+o(1))).\mathbb{P}{\left(\left|\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\right|\geq\sigma\sqrt{t}\right)}\leq 2\operatorname{exp}\left(-\frac{t}{\pi}(1+o(1))\right). (4)

Here, o(1)o(1) is a function that goes to 0 as k,N/kk,N/k\to\infty, uniformly over t[lk,N,uk,N]t\in\left[l_{k,N},u_{k,N}\right] for any sequences lk,Nkg2(N/k)l_{k,N}\gg k\,g^{2}(N/k) and uk,Nku_{k,N}\ll k.

Remark 1.
  1. 1.

    Note that the bound of the theorem holds in some range of the confidence parameter (such estimators are often called “multiple-δ\delta” in the literature, e.g., see Devroye et al., (2016)), however, this range is distribution-dependent. In particular, if kg(N/k)0\sqrt{k}\,g(N/k)\to 0 as k,Nk,N\to\infty, the previous bound holds in the range 1tk1\leq t\ll k, but the function g()g(\cdot) depends on PP and may converge to 0 arbitrarily slow. Under additional assumptions, more concrete bounds can be deduced: for instance, if 𝔼|X/σ|2+ε<\mathbb{E}|X/\sigma|^{2+\varepsilon}<\infty for some 0<ε10<\varepsilon\leq 1, the condition kg(N/k)0\sqrt{k}\,g(N/k)\to 0 is satisfied if k=o(Nε1+ε)k=o\left(N^{\frac{\varepsilon}{1+\varepsilon}}\right) as NN\to\infty. In general, by choosing kk appropriately, we can construct a version of the median of means estimator that satisfies required guarantees for any 1tN1\leq t\ll N.

  2. 2.

    The exact expression for the function o(1)o(1) appearing in the statement of Theorems 2.1 and well as other results in the paper (e.g. Theorem 5.1) is not made explicit. We remark that it depends on the distribution of X1X_{1} through the function g()g(\cdot) defined in (3), and on the ratios kg2(N/k)lk,N\frac{kg^{2}(N/k)}{l_{k,N}} and uk,Nk\frac{u_{k,N}}{k}.

Proof of Theorem 2.1.

As μ^MOM\widehat{\mu}_{\mathrm{MOM}} is scale-invariant, we can assume without loss of generality that σ2=1\sigma^{2}=1. Denote m=N/km=\lfloor N/k\rfloor for brevity, let ρ(x)=|x|\rho(x)=|x|, and note that the equivalent characterization of μ^MOM\widehat{\mu}_{\mathrm{MOM}} is

μ^MOMmissingargminzj=1kρ(m(X¯jz)).\widehat{\mu}_{\mathrm{MOM}}\in\mathop{\mathrm{missing}}{argmin}_{z\in\mathbb{R}}\sum_{j=1}^{k}\rho\left(\sqrt{m}\left(\bar{X}_{j}-z\right)\right).

The necessary conditions for the minimum of F(z):=j=1kρ(m(X¯jz))F(z):=\sum_{j=1}^{k}\rho\left(\sqrt{m}\left(\bar{X}_{j}-z\right)\right) imply that 0F(μ^MOM)0\in\partial F(\widehat{\mu}_{\mathrm{MOM}}) – the subgradient of FF, hence the left derivative F(μ^MOM)0F^{\prime}_{-}(\widehat{\mu}_{\mathrm{MOM}})\leq 0. Therefore, if N(μ^MOMμ)t\sqrt{N}\left(\widehat{\mu}_{\mathrm{MOM}}-\mu\right)\geq\sqrt{t} for some t>0t>0, then μ^MOMμ+t/N\widehat{\mu}_{\mathrm{MOM}}\geq\mu+\sqrt{t/N} and, due to FF^{\prime}_{-} being nondecreasing, F(μ+t/N)0F^{\prime}_{-}\left(\mu+\sqrt{t/N}\right)\leq 0. It implies that

(N(μ^MOMμ)t)(j=1kρ(m(X¯jμt/N))0)=(1kj=1k(ρ(m(X¯jμt/N))𝔼ρ)k𝔼ρ){\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\geq\sqrt{t}\right)}\leq\mathbb{P}{\left(\sum_{j=1}^{k}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)\geq 0\right)}\\ =\mathbb{P}{\left(\frac{1}{\sqrt{k}}\sum_{j=1}^{k}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right)\geq-\sqrt{k}\mathbb{E}\rho^{\prime}_{-}\right)}} (5)

where we used the shortcut 𝔼ρ\mathbb{E}\rho^{\prime}_{-} in place of 𝔼ρ(m(X¯jμt/N))\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right). Note that

k𝔼ρ(m(X¯jμt/N))=k(12(m(X¯jμt/N)0))=2k(Φ(tk)Φ(0))2k(Φ(tk)(m(X¯jμ)tk))2kg(m)+2t1t/N/m(Φ(tN/m)Φ(0)).{-\sqrt{k}\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)=-\sqrt{k}\left(1-2\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\leq 0\right)}\right)\\ =2\sqrt{k}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{k}}\right)-\Phi(0)\right)-2\sqrt{k}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{k}}\right)-\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\frac{\sqrt{t}}{\sqrt{k}}\right)}\right)\\ \leq 2\sqrt{k}\cdot g(m)+2\sqrt{t}\frac{1}{\sqrt{t}/\sqrt{N/m}}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{N/m}}\right)-\Phi(0)\right).} (6)

Since

2t1t/N/m(Φ(tN/m)Φ(0))=2t(ϕ(0)+O(t/N/m))=t(2π+O(t/N/m)){2\sqrt{t}\frac{1}{\sqrt{t}/\sqrt{N/m}}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{N/m}}\right)-\Phi(0)\right)=2\sqrt{t}\left(\phi(0)+O(t/\sqrt{N/m})\right)\\ =\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O(t/\sqrt{N/m})\right)} (7)

where ϕ(t)=Φ(t)\phi(t)=\Phi^{\prime}(t), we see that

k𝔼ρ(m(X¯jμt/N))2kg(m)+t(2π+O(t/k))-\sqrt{k}\,\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)\leq 2\sqrt{k}\cdot g(m)+\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O(\sqrt{t/k})\right)

which is t2π(1+o(1))\sqrt{t}\sqrt{\frac{2}{\pi}}\left(1+o(1)\right) whenever tkt\ll k and tkg2(m)t\gg k\,g^{2}(m). It remains to apply Bernstein’s inequality to the right-hand side in (5). Observe that

Var(ρ(m(X¯jμt/N)))=4Var(I{m(X¯jμ)t/k})=4(m(X¯jμ)t/k)(1(m(X¯jμ)t/k))1,{\mathrm{Var}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)\right)=4\mathrm{Var}\left(I\left\{\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\sqrt{t/k}\right\}\right)\\ =4\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\sqrt{t/k}\right)}\left(1-\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\sqrt{t/k}\right)}\right)\leq 1,}

therefore

(N(μ^MOMμ)t)exp(tπ(1+o(1))+2t2π31k(1+o(1)))=exp(tπ(1+o(1))){\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\geq\sqrt{t}\right)}\leq\operatorname{exp}\left(-\frac{t}{\pi(1+o(1))+\frac{2\sqrt{t}\sqrt{2\pi}}{3}\frac{1}{\sqrt{k}}\left(1+o(1)\right)}\right)\\ =\operatorname{exp}\left(-\frac{t}{\pi}(1+o(1))\right)}

whenever kg(m)tk\sqrt{k}\,g(m)\ll t\ll\sqrt{k}. Similar reasoning gives a matching bound for (N(μ^MOMμ)t)\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\leq-\sqrt{t}\right)}, and the result follows. ∎

One may ask whether the median of means estimator admits a more sample-efficient modification, one that would satisfy inequality (1) with a constant LL smaller than π\pi. A natural idea is to require that the estimator is invariant with respect to permutations of the data or, equivalently, is a function of order statistics only. Such an extension of the MOM estimator was proposed by Minsker, (2019), however no provable improvements for the performance over the standard MOM estimator were established rigorously. The question of such improvements, especially the guarantees expressed in the form (1), is addressed next. Let us recall the proposed construction. Assume that 2m<N2\leq m<N and, given J[N]J\subseteq[N] of cardinality |J|=m|J|=m, set X¯J:=1mjJXj\bar{X}_{J}:=\frac{1}{m}\sum_{j\in J}X_{j}. Define 𝒜N(m)={J[N]:|J|=m}\mathcal{A}_{N}^{(m)}=\left\{J\subset[N]:\ |J|=m\right\} and

μ^N:=med(X¯J,J𝒜N(m)),\widehat{\mu}_{N}:=\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{A}_{N}^{(m)}\right), (8)

where {X¯J,J𝒜N(m)}\left\{\bar{X}_{J},\ J\in\mathcal{A}_{N}^{(m)}\right\} denotes the set of sample averages computed over all possible subsets of [N][N] of cardinality mm; in particular, unlike the standard median-of-means estimator, μ^N\widehat{\mu}_{N} is uniquely defined. Note that for m=2m=2, μ^N\widehat{\mu}_{N} coincides with the well known Hodges-Lehmann estimator of location (Hodges and Lehmann,, 1963). When mm is a fixed integer greater than 22, μ^N\widehat{\mu}_{N} is known as the generalized Hodges-Lehmann estimator. Its asymptotic properties are well-understood and can be deduced from results by Serfling, (1984), among other works. For example, its breakdown point is 1(1/2)1/m1-(1/2)^{1/m} and, in case of normally distributed data, the asymptotic distribution of N(μ^Nμ)\sqrt{N}(\widehat{\mu}_{N}-\mu) is centered normal with variance Δm2=mσ2arctan(1m21)\Delta_{m}^{2}=m\sigma^{2}\arctan\left(\frac{1}{\sqrt{m^{2}-1}}\right). In particular, Δm2=σ2(1+o(1))\Delta_{m}^{2}=\sigma^{2}(1+o(1)) as mm\to\infty. When the underlying distribution is not symmetric however, μ^N\widehat{\mu}_{N} is biased for the mean, and the properties of this estimator in the regime mm\to\infty have not been investigated in the robust statistics literature (to the best of our knowledge). Only very recently, DiCiccio and Romano, (2022) proved that whenever mm\to\infty, m=o(N)m=o(\sqrt{N}) and the sample is normally distributed, N(μ^Nμ)N(0,σ2)\sqrt{N}(\widehat{\mu}_{N}-\mu)\to N(0,\sigma^{2}). We will extend this result in several directions: first, by allowing a much wider class of underlying distributions, second, by including the case when NmN\sqrt{N}\ll m\ll N which is interesting as bias(μ^N)\mathrm{bias}\left(\widehat{\mu}_{N}\right) is o(N1/2)o\left(N^{-1/2}\right) in this regime, and finally by presenting sharp sub-Gaussian deviation inequalities for μ^N\widehat{\mu}_{N} that hold for heavy-tailed data.

Let us remark that an argument behind Theorem 2.1 combined with a version of Bernstein’s inequality for U-statistics due to Hoeffding, (1963) immediately implies that μ^N\widehat{\mu}_{N} satisfies relation (4). Similar reasoning applies to other deviation guarantees for the classical median of means estimator that exist in the literature, so in this sense μ^N\widehat{\mu}_{N} always performs at least as good as μ^MOM\widehat{\mu}_{\mathrm{MOM}}.

Analysis of the estimator μ^N\widehat{\mu}_{N} is most naturally carried out using the language of U-statistics. The following section introduces the necessary background, while additional useful facts are summarized in section 7.1.

3 Asymptotic normality of U-statistics and the implications for μ^N\widehat{\mu}_{N}.

Let Y1,,YNY_{1},\ldots,Y_{N} be i.i.d. random variables with distribution PYP_{Y} and assume that hm:m,m1h_{m}:\mathbb{R}^{m}\mapsto\mathbb{R},\ m\geq 1 are square-integrable with respect to PYmP_{Y}^{m} and permutation-symmetric functions, meaning that that 𝔼hm2(Y1,,Ym)<\mathbb{E}h_{m}^{2}(Y_{1},\ldots,Y_{m})<\infty and hm(xπ(1),,xπ(m))=hm(x1,,xm)h_{m}(x_{\pi(1)},\ldots,x_{\pi(m)})=h_{m}(x_{1},\ldots,x_{m}) for any x1,,xmx_{1},\ldots,x_{m}\in\mathbb{R} and any permutation π:[m][m]\pi:[m]\mapsto[m]. Without loss of generality, we will also assume that 𝔼hm:=𝔼hm(Y1,,Ym)=0\mathbb{E}h_{m}:=\mathbb{E}h_{m}(Y_{1},\ldots,Y_{m})=0. Recall that 𝒜N(m)={J[N]:|J|=m}\mathcal{A}_{N}^{(m)}=\left\{J\subseteq[N]:\ |J|=m\right\}. The U-statistic with kernel hmh_{m} is defined as

UN,m=1(Nm)J𝒜N(m)hm(Yi,iJ).U_{N,m}=\frac{1}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}h_{m}(Y_{i},\ i\in J). (9)

For i[N]i\in[N], let

hm(1)(Yi)=𝔼[hm(Y1,,Ym)|Yi].h_{m}^{(1)}(Y_{i})=\mathbb{E}\left[h_{m}(Y_{1},\ldots,Y_{m})\,|\,Y_{i}\right]. (10)

We will assume that (hm(1)(Y1)0)>0\mathbb{P}{\left(h_{m}^{(1)}(Y_{1})\neq 0\right)}>0 for all mm, meaning that the kernels hmh_{m} are non-degenerate. The random variable

SN,m:=j=1N𝔼[UN,m|Yj]=mNj=1Nhm(1)(Yj),S_{N,m}:=\sum_{j=1}^{N}\mathbb{E}\left[U_{N,m}\,|\,Y_{j}\right]=\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(Y_{j}),

known as the Hájek projection of UN,mU_{N,m}, is essentially the best approximation of UN,mU_{N,m} in terms of the sum of i.i.d. random variables of the form f(Y1)++f(Ym)f(Y_{1})+\ldots+f(Y_{m}). We are interested in the sufficient conditions guaranteeing that UN,mSN,mVar(SN,m)=oP(1)\frac{U_{N,m}-S_{N,m}}{\sqrt{\mathrm{Var}(S_{N,m})}}=o_{P}(1) as N,mN,m\to\infty. Such asymptotic relation immediately implies that the limiting behavior of UN,mU_{N,m} is defined by the Hájek projection SN,mS_{N,m}. Results of these type for U-statistics of fixed order mm are standard and well-known (Hoeffding,, 1948; Serfling,, 2009; Lee,, 2019). However, we are interested in the situation when mm is allowed to grow with NN, possibly up to the order m=o(N)m=o(N). U-statistics of growing order were studied for example by Frees, (1989), however existing results are not readily applicable in our framework. Very recently, such U-statistics have been investigated in relation to performance of Breiman’s random forests algorithm (e.g. see the papers by Song et al., (2019) and Peng et al., (2022)). The following theorem is essentially due to Peng et al., (2022); we give a different proof of this fact in Appendix 7.2 as we rely on parts of the argument elsewhere in the paper.

Theorem 3.1.

Assume that Var(hm(Y1,,Ym))Var(hm(1)(Y1))=o(N)\frac{\mathrm{Var}\left(h_{m}(Y_{1},\ldots,Y_{m})\right)}{\mathrm{Var}\left(h_{m}^{(1)}(Y_{1})\right)}=o(N) as N,mN,m\to\infty. 111It is well known (Hoeffding,, 1948) that Var(h(1)(Y1))Var(hm)m\mathrm{Var}\left(h^{(1)}(Y_{1})\right)\leq\frac{\mathrm{Var}(h_{m})}{m}, therefore the condition imposed on the ratio of variances implies that m=o(N)m=o(N). Then UN,mSN,mVar(SN,m)=oP(1)\frac{U_{N,m}-S_{N,m}}{\sqrt{\mathrm{Var}(S_{N,m})}}=o_{P}(1) as N,mN,m\to\infty.

It is easy to see that asymptotic normality of UN,mVar(SN,m)\frac{U_{N,m}}{\sqrt{\mathrm{Var}(S_{N,m})}} immediately follows from the previous theorem whenever its assumptions are satisfied. Next, we will apply this result to establish asymptotic normality of the estimator μ^N\widehat{\mu}_{N} defined via (8).

Corollary 3.1.

Let X1,,XNX_{1},\ldots,X_{N} be i.i.d. with finite variance σ2\sigma^{2}. Moreover, assume that Nmg(m)0\sqrt{\frac{N}{m}}\,g(m)\to 0 as N/mN/m and mm\to\infty. Then

N(μ^Nμ)𝑑𝒩(0,σ2)\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\xrightarrow{d}\mathcal{N}(0,\sigma^{2})

as N/mN/m and mm\to\infty.

Remark 2.

Requirement Nmg(m)0\sqrt{\frac{N}{m}}\,g(m)\to 0 guarantees that bias(μ^N)=o(N1/2)\mathrm{bias}(\widehat{\mu}_{N})=o(N^{-1/2}). Without this requirement, asymptotic normality can be established for the debiased estimator μ^N𝔼μ^N\widehat{\mu}_{N}-\mathbb{E}\widehat{\mu}_{N}.

Proof.

Let ρ(x)=|x|\rho(x)=|x| and note that the equivalent characterization of μ^N\widehat{\mu}_{N} is

μ^NmissingargminzJ𝒜N(m)ρ(m(X¯Jz)).\widehat{\mu}_{N}\in\mathop{\mathrm{missing}}{argmin}_{z\in\mathbb{R}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho\left(\sqrt{m}\left(\bar{X}_{J}-z\right)\right).

The necessary conditions for the minimum of this problem imply that for any fixed t0t\geq 0,

(11)

Therefore, it suffices to show that the upper and lower bounds for (N(μ^Nμ)t)\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{N}-\mu)\geq t\right)} converge to the same limit. To this end, we see that

(J𝒜N(m)ρ(m(X¯JμtN1/2))0)=(N/m(Nm)J𝒜N(m)(ρ(m(X¯JμtN1/2))𝔼ρ)N/m𝔼ρ),{\mathbb{P}{\left(\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\geq 0\right)}\\ =\mathbb{P}{\left(\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right)\geq-\sqrt{N/m}\mathbb{E}\rho^{\prime}_{-}\right)},} (12)

where 𝔼ρ\mathbb{E}\rho^{\prime}_{-} stands for 𝔼ρ(m(X¯JμtN1/2))\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right). As it the proof of Theorem 2.1, we deduce that N/m𝔼ρ(m(X¯JμtN1/2))tσ2π-\sqrt{N/m}\,\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\to\frac{t}{\sigma}\sqrt{\frac{2}{\pi}} whenever N/mg(m)0\sqrt{N/m}\,g(m)\to 0 and N/mN/m\to\infty. It remains to analyze the U-statistic

NmUN,m=N/m(Nm)J𝒜N(m)(ρ(m(X¯JμtN1/2))𝔼ρ).\sqrt{\frac{N}{m}}\,U_{N,m}=\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right).

As the expression above is invariant with respect to the shift XjXjμX_{j}\mapsto X_{j}-\mu, we can assume that μ=0\mu=0. To complete the proof, we will verify the conditions of Theorem 3.1 allowing one to reduce the asymptotic behavior of UN,mU_{N,m} to the analysis of sums of i.i.d. random variables. For i[N]i\in[N], let

h(1)(Xi)=Nm𝔼[ρ(1mj=1m1X~j+Ximt/N/m)|Xi]Nm𝔼ρ,h^{(1)}(X_{i})=\sqrt{\frac{N}{m}}\,\mathbb{E}\left[\rho^{\prime}_{-}\left(\frac{1}{\sqrt{m}}\sum_{j=1}^{m-1}\tilde{X}_{j}+\frac{X_{i}}{\sqrt{m}}-t/\sqrt{N/m}\right)\,\big{|}\,X_{i}\right]-\sqrt{\frac{N}{m}}\mathbb{E}\rho^{\prime}_{-},

where (X~1,,X~m)(\tilde{X}_{1},\ldots,\tilde{X}_{m}) is an independent copy of (X1,,Xm)(X_{1},\ldots,X_{m}). Our goal is to understand the size of Var(h(1)(X1))\mathrm{Var}(h^{(1)}(X_{1})): specifically, we will show that Var(mNh(1)(X1))2π\mathrm{Var}\left(\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right)\to\frac{2}{\pi} as both mm and N/mN/m\to\infty. Given an integer l1l\geq 1, let Φ~l(t)\widetilde{\Phi}_{l}(t) be the cumulative distribution function of j=1lXj\sum_{j=1}^{l}X_{j}. Then

mNh(1)(X1)=m(2Φ~m1(tmNX1)1)m𝔼ρ=2m(Φ~m1(tmNX1)𝔼Φ~m1(tmNX1))=2m(Φ~m1(tmNX1)Φ~m1(tmNx))𝑑P(x).{\frac{m}{\sqrt{N}}h^{(1)}(X_{1})=\sqrt{m}\left(2\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-1\right)-\sqrt{m}\mathbb{E}\,\rho^{\prime}_{-}\\ =2\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-\mathbb{E}\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)\right)\\ =2\sqrt{m}\int_{\mathbb{R}}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right)dP(x).}

We will apply the dominated convergence theorem to analyze this expression. Consider first the situation when the distribution of X1X_{1} is non-lattice 222We say that X1X_{1} has lattice distribution if (X1α+kβ,k)=1\mathbb{P}{\left(X_{1}\in\alpha+k\beta,\ k\in\mathbb{Z}\right)}=1 and there is no arithmetic progression AZA\subset Z such that (X1α+kβ,kA)=1\mathbb{P}{\left(X_{1}\in\alpha+k\beta,\ k\in A\right)}=1. Then the local limit theorem for non-lattice distributions (Shepp,, 1964, Theorem 2) implies that

Φ~m1(a+h)Φ~m1(a)=h2π(m1)σexp(a22(m1)σ2)+o(m1/2),\widetilde{\Phi}_{m-1}\left(a+h\right)-\widetilde{\Phi}_{m-1}\left(a\right)=\frac{h}{\sqrt{2\pi(m-1)}\sigma}\operatorname{exp}\left(-\frac{a^{2}}{2(m-1)\sigma^{2}}\right)+o(m^{-1/2}),

where mo(m1/2)\sqrt{m}\cdot o(m^{-1/2}) converges to 0 as mm\to\infty for every hh and uniformly in aa. Therefore, we see that conditionally on X1X_{1} and for every xx,

Φ~m1(tmNx+(xX1))Φ~m1(tmNx)=xX12π(m1)σexp((tm/Nx)2/2(m1)σ2)+o(m1/2){\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x+(x-X_{1})\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\\ =\frac{x-X_{1}}{\sqrt{2\pi(m-1)}\sigma}\operatorname{exp}\left(-(tm/\sqrt{N}-x)^{2}/2(m-1)\sigma^{2}\right)+o(m^{-1/2})} (13)

uniformly in mm. Since m=o(N)m=o(N) by assumption, exp((tm/Nx)2/2(m1)σ2)=1+o(1)\operatorname{exp}\left(-(tm/\sqrt{N}-x)^{2}/2(m-1)\sigma^{2}\right)=1+o(1) as m,Nm,N\to\infty, hence

2m(Φ~m1(tmNx+(xX1))Φ~m1(tmNx))=2xX12πσ+o(1)2\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x+(x-X_{1})\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right)=2\frac{x-X_{1}}{\sqrt{2\pi}\sigma}+o(1)

PP-almost everywhere. Next, we will show that qm(x,X1):=m(Φ~m1(tmNX1)Φ~m1(tmNx))q_{m}(x,X_{1}):=\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right) admits an integrable majorant that does not depend on mm. Note that

|qm(x,X1)|supzm(j=1m1Xj(z,z+|xX1|])C|xX1|,|q_{m}(x,X_{1})|\leq\sup_{z}\sqrt{m}\,\mathbb{P}{\left(\sum_{j=1}^{m-1}X_{j}\in\big{(}z,z+|x-X_{1}|\big{]}\right)}\leq C|x-X_{1}|,

where the last inequality follows from the well known bound for the concentration function (Theorem 2.20 in the book by Petrov, (1995)); here, C=C(P)>0C=C(P)>0 is a constant that may depend on the distribution of X1X_{1}. We conclude that by the dominated convergence theorem,

mNh(1)(X1)2πX1σ\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\to\sqrt{\frac{2}{\pi}}\frac{X_{1}}{\sigma}

as m,N/mm,N/m\to\infty, PP-almost everywhere. As

|mNh(1)(X1)|2|qm(x,X1)𝑑P(x)|C|xX1|𝑑P(x)\left|\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right|\leq 2\left|\int_{\mathbb{R}}q_{m}(x,X_{1})dP(x)\right|\leq C\int_{\mathbb{R}}|x-X_{1}|dP(x)

and 𝔼(|xX1|𝑑P(x))2<,\mathbb{E}\left(\int_{\mathbb{R}}|x-X_{1}|dP(x)\right)^{2}<\infty, the second application of the dominated convergence theorem yields that Var(mNh(1)(X1))Var(2πX1σ)=2π\mathrm{Var}\left(\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right)\to\mathrm{Var}\left(\sqrt{\frac{2}{\pi}}\frac{X_{1}}{\sigma}\right)=\frac{2}{\pi} as N/mN/m and mm\to\infty.

It remains to consider the case when X1X_{1} has a lattice distribution. In this case, a version of the local limit theorem (Petrov,, 1995) states that

(j=1m1Xj=(m1)α+qβ)=β2π(m1)σe((m1)α+qβ)22σ2(m1)+o(m1/2)\mathbb{P}{\left(\sum_{j=1}^{m-1}X_{j}=(m-1)\alpha+q\beta\right)}=\frac{\beta}{\sqrt{2\pi(m-1)}\sigma}e^{-\frac{((m-1)\alpha+q\beta)^{2}}{2\sigma^{2}(m-1)}}+o(m^{-1/2})

where the o(m1/2)o(m^{-1/2}) term is uniform in qZq\in Z. For any yy in the interval (tmNx,tmNx+(xX1)]\big{(}\frac{tm}{\sqrt{N}}-x,\frac{tm}{\sqrt{N}}-x+(x-X_{1})\big{]} of the form y=(m1)α+qβy=(m-1)\alpha+q\beta, we have that ey22σ2(m1)=1+o(1)e^{-\frac{y^{2}}{2\sigma^{2}(m-1)}}=1+o(1) as mN0\frac{m}{N}\to 0. Therefore, similarly to (13), in this case

2m(Φ~m1(tmNx+(xX1))Φ~m1(tmNx))=2xX12πσ+o(1)2\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x+(x-X_{1})\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right)=2\frac{x-X_{1}}{\sqrt{2\pi}\sigma}+o(1)

PP-almost everywhere, where we also used the fact that the number of points of the form (m1)α+qβ(m-1)\alpha+q\beta in the interval of interest equals xX1β\frac{x-X_{1}}{\beta}. The rest of the proof proceeds exactly as in the case of non-lattice distributions, and concludes the part of the argument related to Var(mNh(1)(X1))\mathrm{Var}\left(\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right).

To finish the proof, note that, since ρ=1\|\rho^{\prime}_{-}\|_{\infty}=1, Var(N/mρ(m(X¯JμtN1/2)))Nm\mathrm{Var}\left(\sqrt{N/m}\,\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\right)\leq\frac{N}{m}, hence

Var(N/mρ(m(X¯JμtN1/2)))Var(h(1)(X1))N/m2π(1+o(1))N/m2=m2π(1+o(1))=o(N)\frac{\mathrm{Var}\left(\sqrt{N/m}\,\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\right)}{\mathrm{Var}\left(h^{(1)}(X_{1})\right)}\leq\frac{N/m}{\frac{2}{\pi}(1+o(1))N/m^{2}}=\frac{m}{\frac{2}{\pi}(1+o(1))}=o(N)

as mm\to\infty and N/mN/m\to\infty. Therefore, Theorem 3.1 applies and yields that

NmUN,mmNj=1Nh(1)(Xj)m2NVar(h(1)(Xj))=oP(1),\frac{\sqrt{\frac{N}{m}}U_{N,m}-\frac{m}{N}\sum_{j=1}^{N}h^{(1)}(X_{j})}{\frac{m^{2}}{N}\mathrm{Var}\left(h^{(1)}(X_{j})\right)}=o_{P}(1),

where m2NVar(h(1)(Xj))=2π(1+o(1))\frac{m^{2}}{N}\mathrm{Var}\left(h^{(1)}(X_{j})\right)=\frac{2}{\pi}(1+o(1)). In view of the Central Limit Theorem, mNj=1Nh(1)(Xj)𝑑N(0,2π)\frac{m}{N}\sum_{j=1}^{N}h^{(1)}(X_{j})\xrightarrow{d}N\left(0,\frac{2}{\pi}\right), and we conclude that NmUN,m𝑑N(0,2π)\sqrt{\frac{N}{m}}U_{N,m}\xrightarrow{d}N\left(0,\frac{2}{\pi}\right). Recalling (12), we see that

(NmUN,mNm𝔼ρ)1Φ~(tσ),\mathbb{P}{\left(\sqrt{\frac{N}{m}}U_{N,m}\leq\sqrt{\frac{N}{m}}\mathbb{E}\rho^{\prime}_{-}\right)}\to 1-\widetilde{\Phi}\left(\frac{t}{\sigma}\right),

or lim supm,N/m(N(μ^Nμ)t)1Φ~(tσ)\limsup\limits_{m,N/m\to\infty}\mathbb{P}{\left(\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\geq t\right)}\leq 1-\widetilde{\Phi}\left(\frac{t}{\sigma}\right). Repeating the preceding argument for the lower bound for (N(μ^Nμ)t)\mathbb{P}{\left(\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\geq t\right)}, we get that lim infm,N/m(N(μ^Nμ)t)1Φ~(tσ)\liminf\limits_{m,N/m\to\infty}\mathbb{P}{\left(\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\geq t\right)}\geq 1-\widetilde{\Phi}\left(\frac{t}{\sigma}\right), whence the claim of the theorem follows. ∎

Corollary 3.1 implies that asymptotically, the estimator μ^N\widehat{\mu}_{N} improves upon μ^MOM\widehat{\mu}_{\mathrm{MOM}}. The more interesting, and difficult, question is whether non-asymptotic sub-Gaussian deviation bounds for μ^N\widehat{\mu}_{N} with improved constant can be established, and to understand the range of the deviation parameter in which such bounds are valid.

4 Deviation inequalities for U-statistics of growing order.

The ultimate goal of this section is to establish a non-asymptotic analogue of Corollary 3.1. Recall that its proof relied on the classical strategy of showing that the higher-order terms in the Hoeffding decomposition of certain U-statistics are asymptotically negligible. To prove the desired non-asymptotic extension, one has be able to show that these higher-order terms are sufficiently small with exponentially high probability. However, classical tools used to prove such bounds rely on decoupling inequalities due to de la Pena and Montgomery-Smith, (1995). Unfortunately, the constants appearing in decoupling inequalities grow very fast with respect to the order mm of U-statistics, at least like mmm^{m}. As mm is allowed to grow with the sample size NN in our examples, such tools become insufficient to get the desired bounds in our framework. Arcones, (1995) derived an improved version of Bernstein’s inequality for non-degenerate U-statistics where the sub-Gaussian deviations regime is controlled by mVar(hm(1)(X))m\mathrm{Var}(h^{(1)}_{m}(X)) defined in equation (10), rather than the larger quantity Var(hm)\mathrm{Var}(h_{m}) appearing in the inequality due to Hoeffding, (1963); however, this result is only useful when mm is essentially fixed. Maurer, (2019) used different techniques that yield improvements over Arcones’ result, in particular with respect to the order mm; bounds obtained in this work are non-trivial for mm up to the order of N1/3N^{1/3}, however, this does not suffice for the applications required in the present paper. Moreover, unlike Theorem 4.1 below, results in Maurer, (2019) do not capture the correct behavior of degenerate U-statistics. Recently, Song et al., (2019) made significant progress in studying U-statistics of growing order and developed tools that avoid using decoupling inequalities, however, their techniques apply when m=o(N)m=o\left(\sqrt{N}\right), while we only require that m=o(N)m=o(N).

We will be interested in U-statistics with kernels of special structure that assumes “weak” dependence on each of the individual variables. Let the kernel be centered and written in the form hm(x1m,,xmm)h_{m}\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{m}}{\sqrt{m}}\right), whence the corresponding U-statistic is

UN,m=1(Nm)J𝒜N(m)hm(Xim,iJ).U_{N,m}=\frac{1}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}h_{m}\left(\frac{X_{i}}{\sqrt{m}},\ i\in J\right). (14)

The Hoeffding decomposition of UN,mU_{N,m} is defined as the sum

UN,m=mNj=1Nhm(1)(Xj)+j=2m(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ),U_{N,m}=\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j})+\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J), (15)

where hm(j)(x1,,xj)=(δx1P)××(δxjP)×Pmjhmh^{(j)}_{m}(x_{1},\ldots,x_{j})=(\delta_{x_{1}}-P)\times\ldots\times(\delta_{x_{j}}-P)\times P^{m-j}h_{m} . We refer the reader to section 7.1 where the Hoeffding decomposition and related background material is reviewed in more detail.

We will assume that UN,mU_{N,m} is non-degenerate, in particular, one can expect that the behavior of UN,mU_{N,m} is determined by the first term mNj=1Nhm(1)(Xj)\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j}) in the decomposition. In order to make this intuition rigorous, we need to prove that the higher-order terms are of smaller order with exponentially high probability. It is shown in the course of the proof of Theorem 3.1 that Var((mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ))Var(hm)(mN)j\mathrm{Var}\left(\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right)\leq\mathrm{Var}(h_{m})\left(\frac{m}{N}\right)^{j}. However, to achieve our current goal, bounds for the moments of higher order are required. More specifically, the key technical difficulty lies in establishing the correct rate of decay of the higher moments with respect to the order mm of the U-statistic. We will show that under suitable assumptions, 𝔼1/q|(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|q=O(jη1qη2(mN)j/2)\mathbb{E}^{1/q}\left|\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|^{q}=O\left(j^{\eta_{1}}q^{\eta_{2}}\left(\frac{m}{N}\right)^{j/2}\right) for some η1>0\eta_{1}>0, η2>0\eta_{2}>0 and for all q2q\geq 2, 2jjmax2\leq j\leq j_{\mathrm{\max}} for a sufficiently large jmaxj_{\mathrm{max}}. The crucial observation is that the upper bound for the higher-order LqL_{q} norms is still proportional to (mN)j/2\left(\frac{m}{N}\right)^{j/2}, same as the L2L_{2} norm. The following result, essentially implied by the moment inequalities of this form, is a main technical novelty and a key ingredient needed to control large deviations of the higher order terms in the Hoeffding decomposition.

Theorem 4.1.

Let

VN,j=(mj)1/2(Nj)1/2J𝒜N(j)hm(j)(Xim,iJ),fj(x1,,xj)=𝔼hm(x1m,,xjm,Xj+1m,,Xmm)V_{N,j}=\frac{{m\choose j}^{1/2}}{{N\choose j}^{1/2}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}\left(\frac{X_{i}}{\sqrt{m}},\,i\in J\right),\ f_{j}(x_{1},\ldots,x_{j})=\mathbb{E}h_{m}\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{j}}{\sqrt{m}},\frac{X_{j+1}}{\sqrt{m}},\ldots,\frac{X_{m}}{\sqrt{m}}\right)

and νk=𝔼1/k|X1𝔼X1|k\nu_{k}=\mathbb{E}^{1/k}|X_{1}-\mathbb{E}X_{1}|^{k}. If the kernel hmh_{m} is uniformly bounded, then there exists an absolute constant c>0c>0 such that

(|VN.j|t)exp(min(1c(t2Var(hm))1j,(thmNj)2j+1c(m/j)jj+1))\mathbb{P}{\left(|V_{N.j}|\geq t\right)}\leq\operatorname{exp}\left(-\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\frac{\left(\frac{t}{\|h_{m}\|_{\infty}}\sqrt{\frac{N}{j}}\right)^{\frac{2}{j+1}}}{c\left(m/j\right)^{\frac{j}{j+1}}}\right)\right)

whenever min(1c(t2Var(hm))1j,(thmNj)2j+1c(m/j)jj+1)2\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\frac{\left(\frac{t}{\|h_{m}\|_{\infty}}\sqrt{\frac{N}{j}}\right)^{\frac{2}{j+1}}}{c\left(m/j\right)^{\frac{j}{j+1}}}\right)\geq 2. Alternatively, suppose that

  1. (i)

    x1xjfj(C1(P)m)j/2jγ1j\left\|\partial_{x_{1}}\ldots\partial_{x_{j}}f_{j}\right\|_{\infty}\leq\left(\frac{C_{1}(P)}{m}\right)^{j/2}j^{\gamma_{1}j} for some γ112\gamma_{1}\geq\frac{1}{2};

  2. (ii)

    νkkγ2M\nu_{k}\leq k^{\gamma_{2}}M for all integers k2k\geq 2 and some γ20\gamma_{2}\geq 0, M>0M>0.

Then there exist constants c1(P),c2(P)>0c_{1}(P),c_{2}(P)>0 that depend on γ1\gamma_{1} and γ2\gamma_{2} only such that

(|VN,j|t)exp(min(1c1(t2Var(hm))1j,(tN/j(c2Mjγ11/2)j)21+j(2γ2+1)))\mathbb{P}{\left(|V_{N,j}|\geq t\right)}\leq\operatorname{exp}\left(-\min\left(\frac{1}{c_{1}}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c_{2}Mj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\right) (16)

whenever min(1c1(t2Var(hm))1j,(tN/j(c2Mjγ11/2)j)21+j(2γ2+1))max(2,log(N/j)γ2j)\min\left(\frac{1}{c_{1}}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c_{2}Mj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\geq\max\left(2,\frac{\log(N/j)}{\gamma_{2}j}\right). 333In the course of the proof, we show that whenever γ2=0\gamma_{2}=0, corresponding to the case of a.s. bounded X1X_{1}, inequality (16) is valid for all t>0t>0.

The proof of the theorem is given in section 7.3. Let us briefly discuss the imposed conditions. The first inequality requires only boundedness of the kernel and follows from a standard argument; it is mostly useful for the degenerate kernels of higher order jj, for instance when jCm/log(m)j\geq Cm/\log(m)). The main result is the second inequality of the theorem that provides a much better dependence of the tails on mm for small and moderate values of jj. Assumption (ii) is a standard one: for instance, it holds with γ2=0\gamma_{2}=0 for bounded random variables, γ2=1/2\gamma_{2}=1/2 for sub-Gaussian and with γ2=1\gamma_{2}=1 for sub-exponential random variables. As for assumption (i), suppose that the kernel hmh_{m} is sufficiently smooth. In this case,

x1xjfj(x1,,xj)=mj/2𝔼[(x1xjhm)(x1m,,xjm,Xj+1m,,Xmm)],\partial_{x_{1}}\ldots\partial_{x_{j}}f_{j}(x_{1},\ldots,x_{j})=m^{-j/2}\mathbb{E}\left[\left(\partial_{x_{1}}\ldots\partial_{x_{j}}h_{m}\right)\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{j}}{\sqrt{m}},\frac{X_{j+1}}{\sqrt{m}},\ldots,\frac{X_{m}}{\sqrt{m}}\right)\right],

which is indeed of order mj/2m^{-j/2} with respect to mm. However, the functions fjf_{j} are often smooth even if the kernel hmh_{m} is not, as we will show later for the case of an indicator function (specifically, we will prove that required inequalities hold with γ1=12\gamma_{1}=\frac{1}{2} for all jm/log(m)j\ll m/\log(m) under mild assumptions on the distribution of X1X_{1}). Next, we state a corollary – a deviation inequality that takes a particularly simple form and suffices for most of the applications discussed later. It can be viewed as an extension of Arcones, (1995) version of Bernstein’s inequality for the case of U-statistics of growing order.

Corollary 4.1.

Suppose that

  1. (i)

    assumptions of Theorem 4.1 hold for all 2jjmax2\leq j\leq j_{\mathrm{max}} with γ1=12\gamma_{1}=\frac{1}{2};

  2. (ii)

    the kernel hmh_{m} is uniformly bounded;

  3. (iii)

    lim infmVar(mhm(1)(X1))>0\liminf_{m\to\infty}\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)>0;

  4. (iv)

    mM2=o(N1δ)mM^{2}=o\left(N^{1-\delta}\right) for some δ>0\delta>0.

Moreover, let q(N,m)q(N,m) be an increasing function such that

q(N,m)=o(min((NmM2)11+2γ2,jmaxlog(N/m))) as N/m.q(N,m)=o\left(\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}},j_{\mathrm{max}}\log(N/m)\right)\right)\text{ as }N/m\to\infty.

Then for all 2tq(N,m)2\leq t\leq q(N,m),

(|UN,m|tmN)2exp(t2(1+o(1))Var(mhm(1)(X1))),\mathbb{P}{\left(\left|U_{N,m}\right|\geq\sqrt{\frac{tm}{N}}\right)}\leq 2\operatorname{exp}\left(-\frac{t}{2(1+o(1))\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)}\right),

where o(1)0o(1)\to 0 as N/mN/m\to\infty uniformly over 2tq(N,m)2\leq t\leq q(N,m). If m=o(N1/2log(N))m=o\left(\frac{N^{1/2}}{\log(N)}\right), we can instead choose q(N,m)q(N,m) such that q(N,m)=o(min((NmM2)11+2γ2,Njmaxm2))q(N,m)=o\left(\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}},\frac{Nj_{\mathrm{max}}}{m^{2}}\right)\right).

Remark 3.

The key point of the inequality is that the sub-Gaussian deviations are controlled by Var(mhm(1)(X1))\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right) rather than the sub-optimal quantity Var(hm)\mathrm{Var}(h_{m}) appearing in Hoeffding’s version of Bernstein’s inequality for U-statistics. Moreover, the range in which UN,mU_{N,m} admits sub-Gaussian deviations is much wider compared to the implications of Arcones’ inequality when mm is allowed to grow with NN. Several comments regarding the additional assumptions are in order:

  1. 1.

    Assumption of uniform boundedness of the kernel hmh_{m} is needed to ensure that we can apply Bernstein’s concentration inequality to the first term of the Hoeffding decomposition. This suffices for our purposes but in general this condition can be relaxed.

  2. 2.

    Assumption on the asymptotic behavior of the variance is made to simplify the statement and the proof; if it does not hold, the result is still valid once the definition of q(N,m)q(N,m) is modified to reflect the different behavior of the this quantity. We include the following heuristic argument which shows that limmVar(mhm(1)(X1))\lim_{m\to\infty}\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right) often admits a simple closed-form expression. Indeed, note that m(hm(1)(X1)hm(1)(0))=0X1muhm(1)(u)du\sqrt{m}\left(h_{m}^{(1)}(X_{1})-h_{m}^{(1)}(0)\right)=\int_{0}^{X_{1}}\sqrt{m}\,\partial_{u}h_{m}^{(1)}(u)du. If u2hm(1)=o(m1/2)\left\|\partial^{2}_{u}h_{m}^{(1)}\right\|_{\infty}=o(m^{-1/2}), then

    m|uhm(1)(u)uhm(1)(0)|mu2hm(1)u0\sqrt{m}\left|\partial_{u}h_{m}^{(1)}(u)-\partial_{u}h_{m}^{(1)}(0)\right|\leq\sqrt{m}\left\|\partial^{2}_{u}h_{m}^{(1)}\right\|_{\infty}u\to 0

    pointwise as mm\to\infty. If the limit muhm(1)(0)\sqrt{m}\,\partial_{u}h_{m}^{(1)}(0) exists, then m(hm(1)(X1)hm(1)(0))limmmuhm(1)(0)X1\sqrt{m}\left(h_{m}^{(1)}(X_{1})-h_{m}^{(1)}(0)\right)\to\lim_{m\to\infty}\sqrt{m}\,\partial_{u}h_{m}^{(1)}(0)X_{1}, P-almost everywhere. Moreover, as muhm(1)\sqrt{m}\|\partial_{u}h_{m}^{(1)}\|_{\infty} admits an upper bound independent of mm by assumption (i) of Theorem 4.1 and X1X_{1} is sufficiently integrable, Lebesgue’s dominated convergence theorem applies and yields that Var(mhm(1)(X1))(limmuhm(1)(0))2Var(X1)\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)\to\left(\lim_{m\to\infty}\partial_{u}h_{m}^{(1)}(0)\right)^{2}\mathrm{Var}(X_{1}). For instance, this heuristic argument can often be made precise for kernels of the form h(j=1mxjm)h\left(\sum_{j=1}^{m}\frac{x_{j}}{\sqrt{m}}\right).

  3. 3.

    Finally, condition requiring that mM2=o(N1δ)mM^{2}=o\left(N^{1-\delta}\right) is used to ensure that (NmM2)τlog(m)\left(\frac{N}{mM^{2}}\right)^{\tau}\gg\log(m) for any fixed τ>0\tau>0 which simplifies the statement and the proof.

Proof.

The union bound together with Hoeffding’s decomposition entails that for any t>0t>0 and 0<ε<10<\varepsilon<1 (to be chosen later),

(|UN,m|tmN)(|mNj=1Nhm(1)(Xj)|(1ε)tmN)+(|j=2m(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|εtmN).{\mathbb{P}{\left(\left|U_{N,m}\right|\geq\sqrt{\frac{tm}{N}}\right)}\\ \leq\mathbb{P}{\left(\left|\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j})\right|\geq(1-\varepsilon)\sqrt{t}\sqrt{\frac{m}{N}}\right)}+\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\varepsilon\sqrt{t}\sqrt{\frac{m}{N}}\right)}.}

Bernstein’s inequality yields that

(|mNj=1Nhm(1)(Xj)|(1ε)tmN)2exp((1ε)2t/2Var(mhm(1)(X1))+(1ε)13mNhmt1/2)=2exp((1ε)2t2Var(mhm(1)(X1))(1+o(1))){\mathbb{P}{\left(\left|\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j})\right|\geq(1-\varepsilon)\sqrt{t}\sqrt{\frac{m}{N}}\right)}\\ \leq 2\operatorname{exp}\left(-\frac{(1-\varepsilon)^{2}\,t/2}{\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)+(1-\varepsilon)\frac{1}{3}\sqrt{\frac{m}{N}}\|h_{m}\|_{\infty}t^{1/2}}\right)\\ =2\operatorname{exp}\left(-\frac{(1-\varepsilon)^{2}\,t}{2\,\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)(1+o(1))}\right)}

where o(1)0o(1)\to 0 as N/mN/m\to\infty uniformly over sq(N/m)s\leq q(N/m). It remains to control the expression involving higher order Hoeffding decomposition terms: specifically, we will show that under our assumptions, it is bounded from above by exp(t2Var(mhm(1)(X1)))o(1)\operatorname{exp}\left(-\frac{t}{2\,\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)}\right)\cdot o(1) where o(1)0o(1)\to 0 uniformly over the range of tt. To this end, denote tε:=ε2tt_{\varepsilon}:=\varepsilon^{2}t and j:=min(jmax,log(N/m)+1)j_{\ast}:=\min\left(j_{\mathrm{max}},\lfloor\log(N/m)\rfloor+1\right). Observe that

(|j=2m(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tεmN)(|j=2j(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tε3mN)+(|j=j+1jmax(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tε3mN)+(|j>jmax(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tε3mN),{\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\sqrt{t_{\varepsilon}}\sqrt{\frac{m}{N}}\right)}\\ \leq\mathbb{P}{\left(\left|\sum_{j=2}^{j_{\ast}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ +\mathbb{P}{\left(\left|\sum_{j=j_{\ast}+1}^{j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ +\mathbb{P}{\left(\left|\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)},} (17)

where the second sum may be empty depending on the value of jj_{\ast}. First, we estimate the last term using Chebyshev’s inequality: repeating the reasoning leading to equation (38) in the proof of Theorem 3.1, we see that Var(j>jmax(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ))Var(hm)(mN)jmax+1(1m/N)1\mathrm{Var}\left(\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right)\leq\mathrm{Var}(h_{m})\left(\frac{m}{N}\right)^{j_{\mathrm{max}}+1}\left(1-m/N\right)^{-1}, hence

(|j>jmax(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tε3mN)18Var(hm)tε(mN)jmax=18Var(hm)exp(jmaxlog(N/m)+log(tε)){\mathbb{P}{\left(\left|\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\leq\frac{18\mathrm{Var}(h_{m})}{t_{\varepsilon}}\left(\frac{m}{N}\right)^{j_{\mathrm{max}}}\\ =18\mathrm{Var}(h_{m})\operatorname{exp}\left(-j_{\mathrm{max}}\log(N/m)+\log(t_{\varepsilon})\right)}

whenever N/m2N/m\geq 2. Alternatively, we can apply the first inequality of Theorem 4.1 instead of Chebyshev’s inequality to each term corresponding to j>jmaxj>j_{\mathrm{max}} individually, with t=tj,ε:=tε3j2(Nm)j12t=t_{j,\varepsilon}:=\frac{\sqrt{t_{\varepsilon}}}{3j^{2}}\left(\frac{N}{m}\right)^{\frac{j-1}{2}}. It implies that

(|j>jmax(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tε3mN)j>jmax(|(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tj,ε3mN)mmaxj>jmaxexp(cmin(tε1/j(Nm)j1j,(tεh2)1j+1(Njm2)jj+1)).{\mathbb{P}{\left(\left|\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq\sum_{j>j_{\mathrm{max}}}\mathbb{P}{\left(\left|\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{j,\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq m\max_{j>j_{\mathrm{max}}}\operatorname{exp}\left(-c\min\left(t_{\varepsilon}^{1/j}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\left(\frac{t_{\varepsilon}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{Nj}{m^{2}}\right)^{\frac{j}{j+1}}\right)\right).}

This bound is useful when (Njmaxm2)jmaxjmax+1jmaxlog(N/m)\left(\frac{Nj_{\mathrm{max}}}{m^{2}}\right)^{\frac{j_{\mathrm{max}}}{j_{\mathrm{max}}+1}}\gg j_{\mathrm{max}}\log(N/m), which is true whenever m2Nlog2(N)m^{2}\ll\frac{N}{\log^{2}(N)}. If moreover ε1log(N)\varepsilon\gg\frac{1}{\sqrt{\log(N)}}, then the last probability is bounded from above by

maxj>jmaxexp(cmin(tε1/j(Nm)j1j,(tεh2)1j+1(Njm2)jj+1)).\max_{j>j_{\mathrm{max}}}\operatorname{exp}\left(-c^{\prime}\min\left(t_{\varepsilon}^{1/j}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\left(\frac{t_{\varepsilon}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{Nj}{m^{2}}\right)^{\frac{j}{j+1}}\right)\right).

To estimate the middle term (the probability involving the terms indexed by j+1jjmaxj_{\ast}+1\leq j\leq j_{\mathrm{max}}), we apply Theorem 4.1 to each term individually for t=tj,ε:=tε3j2(Nm)j12t=t_{j,\varepsilon}:=\frac{\sqrt{t_{\varepsilon}}}{3j^{2}}\left(\frac{N}{m}\right)^{\frac{j-1}{2}}, keeping in mind that jj+1tj,επ218(Nm)j12tε\sum_{j\geq j_{\ast}+1}t_{j,\varepsilon}\leq\frac{\pi^{2}}{18}\left(\frac{N}{m}\right)^{\frac{j-1}{2}}\sqrt{t_{\varepsilon}}. Note that for any 2tNm2\leq t\leq\frac{N}{m}, ε>mN\varepsilon>\frac{m}{N} and jlog(N/m)+1j\geq\lfloor\log(N/m)\rfloor+1,

min(tj,ε2jc,(tj,εN/j(cMjγ11/2)j)21+j(2γ2+1))c1M21+2γ2(Nm)11+2γ2,\min\left(\frac{t_{j,\varepsilon}^{\frac{2}{j}}}{c},\left(\frac{t_{j,\varepsilon}\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\geq\frac{c_{1}}{M^{\frac{2}{1+2\gamma_{2}}}}\left(\frac{N}{m}\right)^{\frac{1}{1+2\gamma_{2}}},

whence

(|j=j+1jmax(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tε3mN)jmaxexp(c1M21+2γ2(Nm)11+2γ2)exp(c2M21+2γ2(Nm)11+2γ2).{\mathbb{P}{\left(\left|\sum_{j=j_{\ast}+1}^{j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq j_{\mathrm{max}}\operatorname{exp}\left(-\frac{c_{1}}{M^{\frac{2}{1+2\gamma_{2}}}}\left(\frac{N}{m}\right)^{\frac{1}{1+2\gamma_{2}}}\right)\leq\operatorname{exp}\left(-\frac{c_{2}}{M^{\frac{2}{1+2\gamma_{2}}}}\left(\frac{N}{m}\right)^{\frac{1}{1+2\gamma_{2}}}\right).}

Finally, to estimate the first term in the right side of inequality (17), we again apply Theorem 4.1. With tj,εt_{j,\varepsilon} defined as above,

(|j=2j(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tε3mN)j=2j(|(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|6π2tj,ε3mN)j=2jexp(cmin(tε1/j(Nm)j1j,tε11+j(1+2γ2)M2j1+j(1+2γ2)(Nm)j1+j(1+2γ2)))jmax2jjexp(cmin(tε1/j(Nm)j1j,tε11+j(1+2γ2)M2j1+j(1+2γ2)(Nm)j1+j(1+2γ2))).{\mathbb{P}{\left(\left|\sum_{j=2}^{j_{\ast}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq\sum_{j=2}^{j_{\ast}}\mathbb{P}{\left(\left|\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{6}{\pi^{2}}\frac{\sqrt{t_{j,\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq\sum_{j=2}^{j_{\ast}}\operatorname{exp}\left(-c\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\frac{t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}}{M^{\frac{2j}{1+j(1+2\gamma_{2})}}}\left(\frac{N}{m}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right)\\ \leq j_{\ast}\max_{2\leq j\leq j_{\ast}}\operatorname{exp}\left(-c\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\frac{t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}}{M^{\frac{2j}{1+j(1+2\gamma_{2})}}}\left(\frac{N}{m}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right).}

Whenever ε1N/m)\varepsilon\geq\frac{1}{\sqrt{N/m)}}, the last expression is upper bounded by

max2jjexp(c3min(tε1/j(Nm)j1j,tε11+j(1+2γ2)M2j1+j(1+2γ2)(Nm)j1+j(1+2γ2)))\max_{2\leq j\leq j_{\ast}}\operatorname{exp}\left(-c_{3}\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\frac{t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}}{M^{\frac{2j}{1+j(1+2\gamma_{2})}}}\left(\frac{N}{m}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right)

for c3c_{3} small enough. Combining all the estimates, we obtain the inequality

(|j=2m(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tεmN)max2jjexp(c3min(tε1/j(Nm)j1j,tε11+j(1+2γ2)(NmM2)j1+j(1+2γ2)))+exp(c2(NmM2)11+2γ2)+c4Var(hm)exp(jmaxlog(N/m)+log(tε)){\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\sqrt{t_{\varepsilon}}\sqrt{\frac{m}{N}}\right)}\leq\\ \max_{2\leq j\leq j_{\ast}}\operatorname{exp}\left(-c_{3}\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}\left(\frac{N}{mM^{2}}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right)\\ +\operatorname{exp}\left(-c_{2}\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\right)+c_{4}\mathrm{Var}(h_{m})\operatorname{exp}\left(-j_{\mathrm{max}}\log(N/m)+\log(t_{\varepsilon})\right)} (18)

that holds if ε1N/m\varepsilon\geq\frac{1}{\sqrt{N/m}} and 2tNm2\leq t\leq\frac{N}{m}. If t<(NmM2)11+2γ2ε4t<\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\varepsilon^{4}, then the first two terms on the right-hand side of the previous display are bounded by ectεε3=ectεe^{-\frac{ct_{\varepsilon}}{\varepsilon^{3}}}=e^{-\frac{ct}{\varepsilon}} each, and if t<ε(jmax1)log(N/m)t<\varepsilon(j_{\mathrm{max}}-1)\log(N/m), the same is true for the last term. Therefore, if

t<ε4min((NmM2)11+2γ2,(jmax1)log(N/m)),t<\varepsilon^{4}\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\,,(j_{\mathrm{max}}-1)\log(N/m)\right),

then

(|j=2m(mj)(Nj)J𝒜N(j)hm(j)(Xi,iJ)|tεmN)3exp(ctε)=exp(t2Var(mhm(1)(X1)))o(1){\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\sqrt{t_{\varepsilon}}\sqrt{\frac{m}{N}}\right)}\\ \leq 3\operatorname{exp}\left(-\frac{ct}{\varepsilon}\right)=\operatorname{exp}\left(-\frac{t}{2\,\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)}\right)\cdot o(1)}

where the last equality holds whenever we choose ε:=ε(N,m)\varepsilon:=\varepsilon(N,m) such that ε(N,m)0\varepsilon(N,m)\to 0 as N/mN/m\to\infty. Specifically, take ε=(q(N,m)min((NmM2)11+2γ2,jmaxlog(N/m)))1/4\varepsilon=\left(\frac{q(N,m)}{\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\,,j_{\mathrm{max}}\log(N/m)\right)}\right)^{1/4} where the function q(N,m)q(N,m) was defined in the statement of the corollary, and conclusion follows immediately. If m2Nlog2(N)m^{2}\ll\frac{N}{\log^{2}(N)}, we can replace the last term in equation (18) by

maxj>jmaxexp(cmin(tε1/j(Nm)j1j,(tεh2)1j+1(Njm2)jj+1)),\max_{j>j_{\mathrm{max}}}\operatorname{exp}\left(-c^{\prime}\min\left(t_{\varepsilon}^{1/j}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\left(\frac{t_{\varepsilon}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{Nj}{m^{2}}\right)^{\frac{j}{j+1}}\right)\right),

which is bounded by ectεe^{-\frac{ct}{\varepsilon}} whenever t<Njmaxm2ε4t<\frac{Nj_{\mathrm{max}}}{m^{2}}\varepsilon^{4}. Final result in this case follows similarly. ∎

5 Implications for the median of means estimator.

We are going to apply results of the previous section to deduce non-asymptotic bounds for the permutation-invariant version of the median of means estimator. Recall that it was defined as

μ^N:=med(X¯J,J𝒜N(m)).\widehat{\mu}_{N}:=\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{A}_{N}^{(m)}\right).
Theorem 5.1.

Assume that X1,,XNX_{1},\ldots,X_{N} are i.i.d. copies of a random variable XX with mean μ\mu and variance σ2\sigma^{2}. Moreover, suppose that

  1. (i)

    the distribution of X1X_{1} is absolutely continuous with respect to the Lebesgue measure on \mathbb{R} with density ϕ1\phi_{1};

  2. (ii)

    the Fourier transform ϕ^1\widehat{\phi}_{1} of the density satisfies the inequality |ϕ^1(x)|C1(1+|x|)δ\left|\widehat{\phi}_{1}(x)\right|\leq\frac{C_{1}}{(1+|x|)^{\delta}} for some positive constants C1C_{1} and δ\delta;

  3. (iii)

    𝔼|(X1μ)/σ|q<\mathbb{E}\left|(X_{1}-\mu)/\sigma\right|^{q}<\infty for some 3+52<q3\frac{3+\sqrt{5}}{2}<q\leq 3;

Then the estimator μ^N\widehat{\mu}_{N} satisfies

(|N(μ^μ)|σt)2exp(t2(1+o(1)))\mathbb{P}{\left(\left|\sqrt{N}(\widehat{\mu}-\mu)\right|\geq\sigma\sqrt{t}\right)}\leq 2\operatorname{exp}\left(-\frac{t}{2(1+o(1))}\right)

where o(1)0o(1)\to 0 as m,N/mm,\,N/m\to\infty uniformly for all t[lN,m,uN,m]t\in\left[l_{N,m},u_{N,m}\right] for any sequences {lN,m},{uN,m}\{l_{N,m}\}\,,\{u_{N,m}\} such that lN,mNmq1l_{N,m}\gg\frac{N}{m^{q-1}} and uN,mNmqq1mlog2(N)u_{N,m}\ll\frac{N}{m^{\frac{q}{q-1}}\vee m\log^{2}(N)}.

Remark 4.
  1. 1.

    Let us recall the Riemann-Lebesgue lemma stating that |ϕ^1(x)|0|\widehat{\phi}_{1}(x)|\to 0 as |x||x|\to\infty for any absolutely continuous distribution, so assumption (ii) is rather mild;

  2. 2.

    The inequality q>3+52q>\frac{3+\sqrt{5}}{2} assures that lN,ml_{N,m} and uN,mu_{N,m} can be chosen such that lN,muN,ml_{N,m}\ll u_{N,m}.

Proof.

Throughout the course of the proof, we will assume without loss of generality that σ2=1\sigma^{2}=1; general case follows by rescaling. Let us also recall that all asymptotic relations are defined in the limit as both mm and N/mN/m\to\infty. Note that direct application of Corollary 4.1 requires existence of all moments of X1X_{1}, which is too prohibitive. Therefore, we will first show how to reduce the problem to the case of bounded random variables. Specifically, we want to truncate Xjμ,j=1,,NX_{j}-\mu,\ j=1,\ldots,N in a way that preserves the decay rate of the characteristic function. To this end, let RR be a large constant (that will later be specified as an increasing function of mm), and define the standard mollifier κ(x)\kappa(x) via κ(x)={C1exp(11x2),|x|<1,0,|x|1\kappa(x)=\begin{cases}C_{1}\operatorname{exp}\left(-\frac{1}{1-x^{2}}\right),&|x|<1,\\ 0,&|x|\geq 1\end{cases} where C1C_{1} is chosen so that κ(x)=1\int_{\mathbb{R}}\kappa(x)=1. Moreover, let χR(x)=(I2RκR)(x)\chi_{R}(x)=\left(I_{2R}\ast\kappa_{R}\right)(x) be the smooth approximation of the indicator function of the interval [2R,2R][-2R,2R], where I2R(x)=I{|x|2R}I_{2R}(x)=I\{|x|\leq 2R\} and κR(x)=1Rκ(x/R)\kappa_{R}(x)=\frac{1}{R}\kappa(x/R); in particular, χR(x)=1\chi_{R}(x)=1 for |x|R|x|\leq R and χR(x)=0\chi_{R}(x)=0 for |x|3R|x|\geq 3R. Set

ψ(x)=C2ϕ1(x+μ)χR(x)\psi(x)=C_{2}\phi_{1}(x+\mu)\chi_{R}(x)

where C2>0C_{2}>0 is such that ψ(x)𝑑x=1\int_{\mathbb{R}}\psi(x)dx=1. Suppose that Y(R)Y^{(R)} has distribution with density ψ\psi and note that by construction the laws of X1μX_{1}-\mu and Y(R)Y^{(R)}, conditionally on the events {|X1μ|R}\{|X_{1}-\mu|\leq R\} and {|Y(R)|R}\{|Y^{(R)}|\leq R\} respectively, coincide. Therefore, there exists a random variable ZZ independent from X1X_{1} such that

Y1(R):={X1μ,|X1μ|R,Z,|X1μ|>RY_{1}^{(R)}:=\begin{cases}X_{1}-\mu,&|X_{1}-\mu|\leq R,\\ Z,&|X_{1}-\mu|>R\end{cases} (19)

also has density ψ\psi. Observe the following properties of Y1(R)Y_{1}^{(R)}: (a) |Y1(R)|3R|Y_{1}^{(R)}|\leq 3R almost surely; (b) 𝔼h(Y1(R))C2𝔼h(X1μ)\mathbb{E}h\left(Y_{1}^{(R)}\right)\leq C_{2}\mathbb{E}h\left(X_{1}-\mu\right) for any nonnegative function hh – indeed, this follows from the inequality ψ(x)C2ϕ1(x+μ)\psi(x)\leq C_{2}\phi_{1}(x+\mu); (c) |𝔼Y1(R)|(1+C2)𝔼|X1μ|qI{|X1μ|>R}Rq1\left|\mathbb{E}Y_{1}^{(R)}\right|\leq(1+C_{2})\frac{\mathbb{E}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-1}}. Indeed,

|𝔼Y1(R)|=|𝔼Y1(R)I{|X1μ|R}+𝔼Y1(R)I{|X1μ|>R}|=|𝔼(μX1)I{|X1μ|>R}+𝔼Y1(R)I{|Y1(R)|>R}|𝔼|X1μ|I{|X1μ|>R}+C2𝔼|X1μ|I{|X1μ|>R}{\left|\mathbb{E}Y_{1}^{(R)}\right|=\left|\mathbb{E}Y_{1}^{(R)}I\{|X_{1}-\mu|\leq R\}+\mathbb{E}Y_{1}^{(R)}I\{|X_{1}-\mu|>R\}\right|\\ =\left|\mathbb{E}(\mu-X_{1})I\{|X_{1}-\mu|>R\}+\mathbb{E}Y_{1}^{(R)}I\{|Y_{1}^{(R)}|>R\}\right|\\ \leq\mathbb{E}\left|X_{1}-\mu\right|I\{|X_{1}-\mu|>R\}+C_{2}\mathbb{E}\left|X_{1}-\mu\right|I\{|X_{1}-\mu|>R\}}

where the last bound follows from property (b) for h(x)=|x|I{|x|>R}h(x)=|x|I\{|x|>R\}. It remains to apply Hölder’s and Markov’s inequalities. The final property of Y1(R)Y_{1}^{(R)} is stated in a lemma below and is proven in the appendix.

Lemma 1.

The characteristic function ψ^(x)\widehat{\psi}(x) of Y1(R)Y_{1}^{(R)} satisfies

|ψ^1(x)|C(1+|x|)δ\left|\widehat{\psi}_{1}(x)\right|\leq\frac{C}{(1+|x|)^{\delta}}

for all xx\in\mathbb{R} and a sufficiently large constant CC.

Define ρ(x)=|x|\rho(x)=|x|. Proceeding as in the proof of Theorem 2.1, we observe that

(N(μ^μ)t)(N/m(Nm)J𝒜N(m)ρ(m(X¯Jμt/N))0).\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\geq\sqrt{t}\right)}\leq\mathbb{P}{\left(\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)\geq 0\right)}. (23)

Our next goal is to show that for sufficiently large RR, the U-statistic with kernel ρ\rho^{\prime}_{-} appearing in (23) and evaluated at X1,,XNX_{1},\ldots,X_{N} can be replaced by the U-statistic evaluated over an i.i.d. sample Y1(R),,YN(R)Y_{1}^{(R)},\ldots,Y_{N}^{(R)} where Yj(R)Y_{j}^{(R)} is related to XjX_{j} according to (19). To this end, recall that 𝔼|X1μ|q<\mathbb{E}|X_{1}-\mu|^{q}<\infty, and choose RR as R=cm12(q1)R=cm^{\frac{1}{2(q-1)}} for some c>0c>0. Next, observe that

J𝒜N(m)ρ(m(X¯Jμt/N))=J𝒜N(m)(ρ(m(Y¯J(R)t/N))𝔼ρ,R+𝔼ρ)+J𝒜N(m)(ρ(m(X¯Jμt/N))ρ(m(Y¯J(R)t/N))𝔼ρ+𝔼ρ,R),{\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)\\ =\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho_{-,R}^{\prime}+\mathbb{E}\rho_{-}^{\prime}\right)\\ +\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho_{-}^{\prime}+\mathbb{E}\rho_{-,R}^{\prime}\right),} (27)

where 𝔼ρ=ρ(m(X¯Jμt/N))\mathbb{E}\rho_{-}^{\prime}=\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right) and 𝔼ρ,R=𝔼ρ(m(Y¯J(R)t/N))\mathbb{E}\rho_{-,R}^{\prime}=\mathbb{E}\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right). It was shown in the proof of Theorem 2.1 that

N/m𝔼ρCkg(m)t(2π+O(tk))=t2π(1+o(1))\sqrt{N/m}\,\mathbb{E}\rho_{-}^{\prime}\leq C\sqrt{k}\cdot g(m)-\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O\left(\sqrt{\frac{t}{k}}\right)\right)=-\sqrt{t}\sqrt{\frac{2}{\pi}}\left(1+o(1)\right)

whenever tN/mt\ll N/m and tNmg2(m)t\gg\frac{N}{m}\,g^{2}(m). Let us remark that in view of imposed moment assumptions, g(m)=O(m(q2)/2)g(m)=O\left(m^{-(q-2)/2}\right). Moreover, it follows from Hoeffding’s version of Bernstein’s inequality for U-statistics (Hoeffding,, 1963) that

Nm(Nm)J𝒜N(m)(ρ(m(X¯Jμt/N))ρ(m(Y¯J(R)t/N))𝔼ρ+𝔼ρ,R)2𝔼1/2(ρ(m(X¯[m]μt/N))ρ(m(Y¯[m](R)t/N)))2s16s3mN{\frac{\sqrt{\frac{N}{m}}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho_{-}^{\prime}+\mathbb{E}\rho_{-,R}^{\prime}\right)\\ \leq 2\mathbb{E}^{1/2}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right)\right)^{2}\sqrt{s}\bigvee\frac{16s}{3}\sqrt{\frac{m}{N}}}

with probability at least 1es1-e^{-s}. We want to choose s>0s>0 such that t=o(s)t=o(s) and

α(s,R):=2𝔼1/2(ρ(m(X¯[m]μt/N))ρ(m(Y¯[m](R)t/N)))2s16s3mN=o(t){\alpha(s,R):=2\mathbb{E}^{1/2}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right)\right)^{2}\sqrt{s}\bigvee\frac{16s}{3}\sqrt{\frac{m}{N}}\\ =o\left(\sqrt{t}\right)} (28)

as m,N/mm,N/m\to\infty. To estimate

Σm2:=𝔼(ρ(m(X¯[m]μt/N))ρ(m(Y¯[m](R)t/N)))2,\Sigma_{m}^{2}:=\mathbb{E}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right)\right)^{2},

note that for any a>0a>0, ρ(m(X¯[m]μt/N))=ρ(m(Y¯[m](R)t/N))\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)=\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right) whenever |m(Y¯[m](R)t/N)|>a/2\left|\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right|>a/2, |m(X¯[m]μt/N)|>a/2\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right|>a/2 and |m(X¯[m]μY¯[m](R))|a\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\bar{Y}^{(R)}_{[m]}\right)\right|\leq a, hence

Σm24((|m(Y¯[m](R)t/N)|a)+(|m(X¯[m]μt/N)|a))+4(|m(X¯[m]μY¯[m](R))|>a).{\Sigma_{m}^{2}\leq 4\left(\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right|\leq a\right)}+\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right|\leq a\right)}\right)\\ +4\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\bar{Y}^{(R)}_{[m]}\right)\right|>a\right)}.}

Up to the additive error term Cg(m)=O(m(q2)/2)Cg(m)=O\left(m^{-(q-2)/2}\right), the distributions of mX¯[m]\sqrt{m}\bar{X}_{[m]} and mY¯[m](R)\sqrt{m}\bar{Y}^{(R)}_{[m]} can be approximated by the normal distribution, hence

(|m(Y¯[m](R)t/N)|a)+(|m(X¯[m]μt/N)|a)C(a+g(m)).\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right|\leq a\right)}+\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right|\leq a\right)}\leq C(a+g(m)).

Moreover,

(|m(X¯[m]μY¯[m](R))|>a)=(|1mj=1mYj(R)I{|Yj(R)|>R}|a)(|1mj=1mYj(R)I{|Yj(R)|>R}𝔼(Yj(R)I{|Yj(R)|>R})|am|𝔼Yj(R)I{|Yj(R)|>R}|)C2𝔼|X1μ|2I{|X1μ|>R}(aC2m|𝔼(X1μ)I{|X1μ|>R}|)2𝔼|X1μ|qI{|X1μ|>R}Rq2(aC2m|𝔼(X1μ)I{|X1μ|>R}|)2{\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\bar{Y}^{(R)}_{[m]}\right)\right|>a\right)}=\mathbb{P}{\left(\left|\frac{1}{\sqrt{m}}\sum_{j=1}^{m}Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}\right|\geq a\right)}\\ \leq\mathbb{P}{\left(\left|\frac{1}{\sqrt{m}}\sum_{j=1}^{m}Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}-\mathbb{E}\left(Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}\right)\right|\geq a-\sqrt{m}\left|\mathbb{E}Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}\right|\right)}\\ \leq C_{2}\frac{\mathbb{E}|X_{1}-\mu|^{2}I\{|X_{1}-\mu|>R\}}{\left(a-C_{2}\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|\right)^{2}}\\ \leq\frac{\mathbb{E}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-2}\left(a-C_{2}\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|\right)^{2}}} (29)

where we used property (b) of Y1(R)Y_{1}^{(R)} along with Hölder’s and Markov’s inequalities. It is also clear that

m|𝔼(X1μ)I{|X1μ|>R}|m𝔼|X1μ|qI{|X1μ|>R}Rq1,\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|\leq\frac{\sqrt{m}\mathbb{E}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-1}},

therefore, for R=cm12(q1)R=cm^{\frac{1}{2(q-1)}} specified before, m|𝔼(X1μ)I{|X1μ|>R}|=o(1)\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|=o(1). Setting a=2C2m𝔼1/2|X1μ|qI{|X1μ|>R}Rq1a=2C_{2}\frac{\sqrt{m}\mathbb{E}^{1/2}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-1}}, one easily checks that the right-hand side in (29) is at most CR(q2)=Cmq22(q1)CR^{-(q-2)}=C^{\prime}m^{-\frac{q-2}{2(q-1)}}. whence Σm2=o(1)\Sigma^{2}_{m}=o(1). Therefore, there exists a function o(1)o(1) such that setting s=t/o(1)s=t/o(1) yields the stated goal, namely, that t=o(s)t=o(s) and α(s,R)=o(t)\alpha(s,R)=o(\sqrt{t}) where α(s,R)\alpha(s,R) was defined in (28). Combined with (27), it implies that

(N(μ^μ)t)o(1)et+(Nm(Nm)J𝒜N(m)(ρ(m(Y¯J(R)t/N))𝔼ρ,R)t2π(1+o(1))).{\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\geq\sqrt{t}\right)}\leq o(1)\cdot e^{-t}\\ +\mathbb{P}{\left(\frac{\sqrt{\frac{N}{m}}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho^{\prime}_{-,R}\right)\geq\sqrt{t}\sqrt{\frac{2}{\pi}}(1+o(1))\right)}.} (30)

Note that the U-statistic in the display above is now a function of bounded random variables, hence we can apply Corollary 4.1 with γ2=0\gamma_{2}=0. As ρ=1\|\rho^{\prime}_{-}\|_{\infty}=1, condition (ii) of the corollary holds. Let mNj=1Nh(1)(Yj(R))\sqrt{\frac{m}{N}}\sum_{j=1}^{N}h^{(1)}(Y^{(R)}_{j}) be the first term in Hoeffding decomposition of the U-statistic

N/m(Nm)J𝒜N(m)(ρ(m(Y¯J(R)𝔼Y1(R)t/N+𝔼Y1(R)))𝔼ρ,R).\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\mathbb{E}Y_{1}^{(R)}-\sqrt{t/N}+\mathbb{E}Y_{1}^{(R)}\right)\right)-\mathbb{E}\rho^{\prime}_{-,R}\right).

Following the lines of the proof of Theorem 3.1 and recalling that m|𝔼Y1(R)|=o(1)\sqrt{m}\left|\mathbb{E}Y_{1}^{(R)}\right|=o(1) in view of property (c) of Y1(R)Y_{1}^{(R)} and the choice of RR, we deduce that

Var(mh(1)(Y1(R)))=2π(1+o(1))\mathrm{Var}\left(\sqrt{m}h^{(1)}(Y^{(R)}_{1})\right)=\frac{2}{\pi}(1+o(1))

where o(1)0o(1)\to 0 as m,N/mm,N/m\to\infty, validating assumption (iii) of the corollary. It remains to verify assumption (i) and specify the value of jmaxj_{\max}. Recall that ρ(x)=I{x0}I{x<0}\rho^{\prime}_{-}(x)=I\{x\geq 0\}-I\{x<0\} and let Y~j(R)\widetilde{Y}_{j}^{(R)} stand for Yj(R)𝔼Yj(R)Y_{j}^{(R)}-\mathbb{E}Y_{j}^{(R)}. The function fj(u1,,uj)f_{j}(u_{1},\ldots,u_{j}) appearing in the statement of Theorem 4.1 can therefore be expressed as

fj(u1,,uj)=𝔼ρ(1mi=1jui+mjmi=j+1mY~i(R)mjtmN+m𝔼Y1(R))=𝔼ρ(1mi=1jui+mjmi=j+1mY~i(R)mjtmN+m𝔼Y1(R))=2Φmj(1mji=1juimmj(tmN+m𝔼Y1(R)))1{f_{j}(u_{1},\ldots,u_{j})=\mathbb{E}\rho^{\prime}_{-}\left(\frac{1}{\sqrt{m}}\sum_{i=1}^{j}u_{i}+\sqrt{\frac{m-j}{m}}\frac{\sum_{i=j+1}^{m}\widetilde{Y}_{i}^{(R)}}{\sqrt{m-j}}-\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\\ =\mathbb{E}\rho^{\prime}_{-}\left(\frac{1}{\sqrt{m}}\sum_{i=1}^{j}u_{i}+\sqrt{\frac{m-j}{m}}\frac{\sum_{i=j+1}^{m}\widetilde{Y}_{i}^{(R)}}{\sqrt{m-j}}-\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\\ =2\Phi_{m-j}\left(\frac{1}{\sqrt{m-j}}\sum_{i=1}^{j}u_{i}-\sqrt{\frac{m}{m-j}}\left(\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\right)-1}

where for any integer k1k\geq 1, Φk\Phi_{k} stands for the cumulative distribution function of 1kj=1kY~j(R)\frac{1}{\sqrt{k}}\sum_{j=1}^{k}\widetilde{Y}_{j}^{(R)} and ϕk\phi_{k} is the corresponding density function that exists by assumption. Consequently,

uju1f1(u1,,uj)=2(mj)j/2ϕmj(j1)(1mji=1juimmj(tmN+m𝔼Y1(R))).\partial_{u_{j}}\ldots\partial_{u_{1}}f_{1}(u_{1},\ldots,u_{j})=\frac{2}{(m-j)^{j/2}}\phi^{(j-1)}_{m-j}\left(\frac{1}{\sqrt{m-j}}\sum_{i=1}^{j}u_{i}-\sqrt{\frac{m}{m-j}}\left(\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\right).

The following lemma demonstrates that Theorem 4.1 applies with γ1=1/2\gamma_{1}=1/2 and that jmax=mlog(m)o(1)j_{\max}=\frac{m}{\log(m)}\,o(1) in the statement of Corollary 4.1.

Lemma 2.

Let assumptions of Theorem 5.1 hold. Then for mm large enough and j=o(m/logm)j=o(m/\log m),

ϕmj(j1)C(2je)j/2\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq C\left(\frac{2j}{e}\right)^{j/2}

for a sufficiently large constant C=C(P)C=C(P).

We postpone the proof of this lemma until section 7.6. As all the necessary conditions have been verified, the bound of Corollary 4.1 applies. Recalling that tNmg2(m)t\gg\frac{N}{m}\,g^{2}(m) and that g(m)C𝔼|X1μ|qm(q2)/2g(m)\leq C\frac{\mathbb{E}|X_{1}-\mu|^{q}}{m^{(q-2)/2}}, Corollary 4.1 yields that the probability in the right-hand side of inequality (30) can be bounded from above by exp(t2σ2(1+o(1)))\operatorname{exp}\left(-\frac{t}{2\sigma^{2}(1+o(1))}\right) for all

Nmq1tq(N,m)\frac{N}{m^{q-1}}\ll t\leq q(N,m) (31)

whenever

q(N,m)=min(NmR2,Nmlog2(N))o(1) as N/m.q(N,m)=\min\left(\frac{N}{mR^{2}},\frac{N}{m\log^{2}(N)}\right)\cdot o\left(1\right)\text{ as }N/m\to\infty.

To get the expression for the second term in the minimum above from the bound of the corollary, it suffices to consider the cases when mNlog(N)o(1)m\geq\frac{\sqrt{N}}{\log(N)}o(1) and mNlog(N)o(1)m\leq\frac{\sqrt{N}}{\log(N)}o(1) separately; we omit the simple algebra. Since R=cm12(q1)R=cm^{\frac{1}{2(q-1)}}, (31) in only possible when q2>1q1q-2>\frac{1}{q-1}, implying the requirement q>3+52q>\frac{3+\sqrt{5}}{2}. The final form of the bound stating that

(N(μ^μ)σt)exp(t2(1+o(1)))\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\geq\sigma\sqrt{t}\right)}\leq\operatorname{exp}\left(-\frac{t}{2(1+o(1))}\right)

uniformly for all Nmq1tNmqq1mlog2(N)\frac{N}{m^{q-1}}\ll t\ll\frac{N}{m^{\frac{q}{q-1}}\vee m\log^{2}(N)}. The argument needed to estimate (N(μ^μ)σt)\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\leq-\sigma\sqrt{t}\right)} is identical. ∎

6 Open questions.

Several potentially interesting questions and directions have not been addressed in this paper. We summarize few of them below.

  • (i)

    First is the question related to assumptions in Theorem 5.1: does it still hold for distributions with only 2+ε2+\varepsilon moments? And can the assumptions requiring absolute continuity and a bound on the rate of decay of the characteristic function be dropped? For example, Corollary 3.1 holds for lattice distributions as well.

  • (ii)

    It is known that (Hanson and Wright,, 1971) the sample mean based on i.i.d. observations from the multivariate normal distribution N(μ,Σ)N(\mu,\Sigma) satisfies the inequality

    X¯Nμ2trace(Σ)N+2tΣN\left\|\bar{X}_{N}-\mu\right\|_{2}\leq\sqrt{\frac{\mathrm{trace}(\Sigma)}{N}}+\sqrt{\frac{2t\|\Sigma\|}{N}}

    with probability at least 1et1-e^{-t}. Does there exist an estimator of the mean that achieves this bound (up to o(1)o(1) factors) for the heavy-tailed distributions? Partial results in this direction have been recently obtained by Lee and Valiant, (2022).

  • (iii)

    Exact computation of the estimator μ^N\widehat{\mu}_{N} is infeasible, as it requires evaluation and sorting of (Nm)m\asymp\left(\frac{N}{m}\right)^{m} sample means. Therefore, it is interesting to understand whether it can be replaced by med(X¯J,J)\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{B}\right) where \mathcal{B} is a (deterministic or random) subset of 𝒜N(m)\mathcal{A}_{N}^{(m)} of much smaller cardinality, while preserving the deviation guarantees. For instance, it is easy to deduce from results on incomplete U-statistics in section 4.3 of the book by Lee, (2019) combined with the proof of Corollary 3.1 that if \mathcal{B} consists of MM subsets selected at random with replacement from 𝒜Nm\mathcal{A}_{N}^{m}, then the asymptotic distribution of N(med(X¯J,J)μ)\sqrt{N}\left(\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{B}\right)-\mu\right) is still N(0,σ2)N(0,\sigma^{2}) as long as MNM\gg N. However, establishing results in spirit of Theorem 5.1 in this framework appears to be more difficult.

7 Remaining proofs.

The proofs omitted in the main text are presented in this section.

7.1 Technical tools.

Let us recall the definition of Hoeffding’s decomposition (Hoeffding,, 1948) and closely related concepts that are at the core of many arguments related to U-statistics. Assume that Y1,,YNY_{1},\ldots,Y_{N} are i.i.d. random variables with distribution PYP_{Y}. Recall that 𝒜N(m)={J[N]:|J|=m}\mathcal{A}_{N}^{(m)}=\left\{J\subseteq[N]:\ |J|=m\right\} and that the U-statistic with permutation-symmetric kernel hmh_{m} is defined as

UN,m=1(Nm)J𝒜N(m)hm(Yi,iJ),U_{N,m}=\frac{1}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}h_{m}(Y_{i},\ i\in J),

where we assume that 𝔼hm=0\mathbb{E}h_{m}=0. Moreover, for j=1,,m,j=1,\ldots,m, define the projections

(πjhm)(y1,,yj):=(δy1PY)××(δyjPY)×PYmjhm.(\pi_{j}h_{m})(y_{1},\ldots,y_{j}):=(\delta_{y_{1}}-P_{Y})\times\ldots\times(\delta_{y_{j}}-P_{Y})\times P_{Y}^{m-j}h_{m}. (32)

For brevity and to ease notation, we will often write hm(j)h_{m}^{(j)} in place of πjhm\pi_{j}h_{m}. The variances of these projections will be denoted by

δj2:=Var(hm(j)(Y1,,Yj)).\delta_{j}^{2}:=\mathrm{Var}\left(h^{(j)}_{m}(Y_{1},\ldots,Y_{j})\right).

In particular, δm2=Var(hm)\delta_{m}^{2}=\mathrm{Var}(h_{m}). It is well known (Lee,, 2019) that hm(j)h^{(j)}_{m} can be viewed geometrically as orthogonal projections of hmh_{m} onto a particular subspace of L2(PYm)L_{2}(P_{Y}^{m}). The kernels hm(j)h^{(j)}_{m} have the property of complete degeneracy, meaning that 𝔼hm(j)(y1,,yj1,Yj)=0\mathbb{E}h^{(j)}_{m}(y_{1},\ldots,y_{j-1},Y_{j})=0 for PYP_{Y}-almost all y1,,yj1y_{1},\ldots,y_{j-1} while hm(j)(Y1,,Yj)h^{(j)}_{m}(Y_{1},\ldots,Y_{j}) is non-zero with positive probability. One can easily check that h(y1,,ym)=j=1mJ[m]:|J|=jhm(j)(yi,iJ)h(y_{1},\ldots,y_{m})=\sum_{j=1}^{m}\sum_{J\subseteq[m]:|J|=j}h^{(j)}_{m}(y_{i},\,i\in J), in particular, the partial sum j=1kJ[m]:|J|=jhm(j)(yi,iJ)\sum_{j=1}^{k}\sum_{J\subseteq[m]:|J|=j}h^{(j)}_{m}(y_{i},\,i\in J) is the best approximation of hmh_{m}, in the mean-squared sense, in terms of sums of functions of at most kk variables. The Hoeffding decomposition states that (see (Hoeffding,, 1948) as well as the book by Lee, (2019))

UN,m=j=1m(mj)UN,m(j),U_{N,m}=\sum_{j=1}^{m}{m\choose j}U_{N,m}^{(j)}, (33)

where UN,m(j)U_{N,m}^{(j)} are U-statistics with kernels hm(j)h^{(j)}_{m}, namely UN,m(j):=1(Nj)J𝒜N(j)hm(j)(Yi,iJ)U_{N,m}^{(j)}:=\frac{1}{{N\choose j}}\sum\limits_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(Y_{i},\ i\in J). Moreover, all terms in representation (33) are uncorrelated.

Next, we recall some useful moment bounds, found for instance in the book by de la Pena and Gine, (1999), for the Rademacher chaos variables. Let ε1,,εN\varepsilon_{1},\ldots,\varepsilon_{N} be i.i.d. Rademacher random variables (random signs), {aJ,J𝒜N(l)}\{a_{J},\ J\in\mathcal{A}_{N}^{(l)}\}\subset\mathbb{R}, and Z=J𝒜N(l)aJiJεiZ=\sum_{J\in\mathcal{A}_{N}^{(l)}}a_{J}\prod_{i\in J}\varepsilon_{i}. Here, iJεi=εi1εil\prod_{i\in J}\varepsilon_{i}=\varepsilon_{i_{1}}\cdot\ldots\cdot\varepsilon_{i_{l}} for J={i1,,il}J=\{i_{1},\ldots,i_{l}\}.

Fact 1 (Bonami inequality).

Let σ2(Z)=Var(Z)=J𝒜N(l)aJ2\sigma^{2}(Z)=\mathrm{Var}(Z)=\sum_{J\in\mathcal{A}_{N}^{(l)}}a_{J}^{2}. Then for any q>2q>2,

𝔼|Z|q(q1)ql/2(σ2(Z))q/2.\mathbb{E}|Z|^{q}\leq\left(q-1\right)^{ql/2}\left(\sigma^{2}(Z)\right)^{q/2}.

Now we state a version of the symmetrization inequality for completely degenerate U-statistics due to Sherman, (1994), also see the paper by Song et al., (2019) for the modern exposition of the proof. The main feature of this inequality, put forward by Song et al., (2019), is the fact that its proof does not rely on decoupling, and yields constants that do not grow too fast with the order of U-statistics.

Fact 2.

Let hh be a completely degenerate kernel of order ll, and Φ\Phi – a convex, nonnegative, non-decreasing function. Moreover, assume that ε1,,εN\varepsilon_{1},\ldots,\varepsilon_{N} are i.i.d. Rademacher random variables. Then

𝔼Φ(1j1<<jlNh(Yj1,,Yjl))𝔼Φ(2l1j1<<jlNεj1εjlh(Yj1,,Yjl)).\mathbb{E}\Phi\left(\sum_{1\leq j_{1}<\ldots<j_{l}\leq N}h(Y_{j_{1}},\ldots,Y_{j_{l}})\right)\leq\mathbb{E}\Phi\left(2^{l}\sum_{1\leq j_{1}<\ldots<j_{l}\leq N}\varepsilon_{j_{1}}\ldots\varepsilon_{j_{l}}h(Y_{j_{1}},\ldots,Y_{j_{l}})\right).

Next is the well-known identity, due to Hoeffding, (1963), that allows to reduce many problems for non-degenerate U-statistics to the corresponding problems for the sums of i.i.d. random variables.

Fact 3.

The following representation holds:

UN,m=1N!πWπ,U_{N,m}=\frac{1}{N!}\sum_{\pi}W_{\pi},

where the sum is over all permutations π:[N][N]\pi:[N]\mapsto[N], and

Wπ=1k(hm(Yπ(1),Yπ(2),,Yπ(m))++hm(Yπ((k1)m+1),Yπ((k1)m+2),,Yπ(km)))W_{\pi}=\frac{1}{k}\left(h_{m}\left(Y_{\pi(1)},Y_{\pi(2)},\ldots,Y_{\pi(m)}\right)+\ldots+h_{m}\left(Y_{\pi((k-1)m+1)},Y_{\pi((k-1)m+2)},\ldots,Y_{\pi(km)}\right)\right)

for k=N/mk=\lfloor N/m\rfloor.

Finally, we state a version of Rosenthal’s inequality for the moments of sums of independent, nonnegative random variables with explicit constants, see (Boucheron et al.,, 2013; Chen et al.,, 2012).

Fact 4.

Let Y1,,YNY_{1},\ldots,Y_{N} be independent random variables such that Yj0Y_{j}\geq 0 with probability 11 for all j[N]j\in[N]. Then for any q1q\geq 1,

(𝔼|j=1NYj|q)1/q((j=1N𝔼Yj)1/2+2eq(𝔼maxj=1,,NYjq)1/2q)2.\left(\mathbb{E}\left|\sum_{j=1}^{N}Y_{j}\right|^{q}\right)^{1/q}\leq\left(\left(\sum_{j=1}^{N}\mathbb{E}Y_{j}\right)^{1/2}+2\sqrt{eq}\left(\mathbb{E}\max_{j=1,\ldots,N}Y_{j}^{q}\right)^{1/2q}\right)^{2}.

7.2 Proof of Theorem 3.1.

Recall that

It is easy to verify that

hm(Y1,,Ym)=(δY1PY+PY)××(δYmPY+PY)hm=j=1mJ[m]:|J|=jhm(j)(Yi,iJ){h_{m}(Y_{1},\ldots,Y_{m})=(\delta_{Y_{1}}-P_{Y}+P_{Y})\times\ldots\times(\delta_{Y_{m}}-P_{Y}+P_{Y})h_{m}=\sum_{j=1}^{m}\sum_{J\subseteq[m]:|J|=j}h^{(j)}_{m}(Y_{i},\,i\in J)}

and that the terms in the sum above are mutually orthogonal, yielding that

Var(hm(Y1,,Ym))=j=1m(mj)δj2.\mathrm{Var}\left(h_{m}(Y_{1},\ldots,Y_{m})\right)=\sum_{j=1}^{m}{m\choose j}\delta_{j}^{2}. (34)

Moreover, as a corollary of Hoeffding’s decomposition, one can get the well known identities See Chapters 1.6 and 1.7 in the book by Lee, (2019) for detailed derivations of these facts. The simple but key observation following from equation (34) is that for any j[m]j\in[m], Var(hm)(mj)δj2\mathrm{Var}(h_{m})\geq{m\choose j}\delta_{j}^{2}, or

δj2Var(hm)(mj).\delta_{j}^{2}\leq\frac{\mathrm{Var}(h_{m})}{{m\choose j}}. (35)

Therefore,

Var(UN,mSN,m)=j=2m(mj)2(Nj)δj2Var(h)j=2m(mj)(Nj)Var(h)j2(mN)j=Var(h)(mN)2(1m/N)1,{\mathrm{Var}(U_{N,m}-S_{N,m})=\sum_{j=2}^{m}\frac{{m\choose j}^{2}}{{N\choose j}}\delta_{j}^{2}\leq\mathrm{Var}(h)\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\leq\mathrm{Var}(h)\sum_{j\geq 2}\left(\frac{m}{N}\right)^{j}\\ =\mathrm{Var}(h)\left(\frac{m}{N}\right)^{2}\left(1-m/N\right)^{-1},} (38)

where we used the fact that (mj)(Nj)(mN)j\frac{{m\choose j}}{{N\choose j}}\leq\left(\frac{m}{N}\right)^{j} for mNm\leq N: indeed, the latter easily follows from the identity (mj)(Nj)=m(m1)(mj+1)N(N1)(Nj+1)\frac{{m\choose j}}{{N\choose j}}=\frac{m(m-1)\ldots(m-j+1)}{N(N-1)\ldots(N-j+1)}. It is well known (Hoeffding,, 1948) that Var(h(1)(Y1))Var(hm)m\mathrm{Var}\left(h^{(1)}(Y_{1})\right)\leq\frac{\mathrm{Var}(h_{m})}{m}, therefore the condition Var(hm(Y1,,Ym))Var(hm(1)(Y1))=o(N)\frac{\mathrm{Var}\left(h_{m}(Y_{1},\ldots,Y_{m})\right)}{\mathrm{Var}\left(h_{m}^{(1)}(Y_{1})\right)}=o(N) imposed on the ratio of variances implies that m=o(N)m=o(N). Therefore, for m,Nm,N large enough (so that m/N1/2m/N\leq 1/2),

Var(UN,mSN,m)Var(SN,m)2Var(hm)(mN)2δ12m2/N=2Var(hm)Nδ12=o(1)\frac{\mathrm{Var}(U_{N,m}-S_{N,m})}{\mathrm{Var}(S_{N,m})}\leq 2\frac{\mathrm{Var}(h_{m})\left(\frac{m}{N}\right)^{2}}{\delta_{1}^{2}m^{2}/N}=2\frac{\mathrm{Var}(h_{m})}{N\delta_{1}^{2}}=o(1)

by assumption, yielding that UN,mSNVar1/2(SN)=oP(1)\frac{U_{N,m}-S_{N}}{\mathrm{Var}^{1/2}(S_{N})}=o_{P}(1) as N,mN,m\to\infty.

7.3 Proof of Theorem 4.1.

We are going to estimate 𝔼|VN,j|q\mathbb{E}|V_{N,j}|^{q} for an arbitrary q>2q>2. It follows from the symmetrization inequality (Fact 2) followed by the moment bound stated in Fact 1 that

𝔼|VN,j|q2jq𝔼X𝔼ε|(mj)1/2(Nj)1/2(i1,,ij)𝒜N(j)εi1εijhm(j)(Xi1,,Xij)|q2jq(q1)jq/2𝔼|(mj)(Nj)(i1,,ij)𝒜N(j)(hm(j)(Xi1,,Xij))2|q/2.{\mathbb{E}|V_{N,j}|^{q}\leq 2^{jq}\,\mathbb{E}_{X}\mathbb{E}_{\varepsilon}\left|\frac{{m\choose j}^{1/2}}{{N\choose j}^{1/2}}\sum_{(i_{1},\ldots,i_{j})\in\mathcal{A}_{N}^{(j)}}\varepsilon_{i_{1}}\ldots\varepsilon_{i_{j}}h^{(j)}_{m}(X_{i_{1}},\ldots,X_{i_{j}})\right|^{q}\\ \leq 2^{jq}(q-1)^{jq/2}\mathbb{E}\left|\frac{{m\choose j}}{{N\choose j}}\sum_{(i_{1},\ldots,i_{j})\in\mathcal{A}_{N}^{(j)}}\left(h^{(j)}_{m}(X_{i_{1}},\ldots,X_{i_{j}})\right)^{2}\right|^{q/2}.}

Next, Hoeffding’s representation of the U-statistic (Fact 3) together with Jensen’s inequality yields that

𝔼|(mj)(Nj)(i1,,ij)𝒜N(j)(hm(j)(Xi1,,Xij))2|q/2𝔼|(mj)N/ji=1N/jWi|q/2,\mathbb{E}\left|\frac{{m\choose j}}{{N\choose j}}\sum_{(i_{1},\ldots,i_{j})\in\mathcal{A}_{N}^{(j)}}\left(h^{(j)}_{m}(X_{i_{1}},\ldots,X_{i_{j}})\right)^{2}\right|^{q/2}\leq\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2},

where Wi:=(hm(j)(X(i1)j+1,,Xij))2W_{i}:=\left(h^{(j)}_{m}(X_{(i-1)j+1},\ldots,X_{ij})\right)^{2}. We are going to estimate 𝔼maxj=1,,N/jWjp\mathbb{E}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{j}^{p} in two different ways. First, recall that

hm(j)(x1,,xj):=(πjhm)(x1,,xj)=(δx1PX)××(δxjPX)×PXmjhm.h^{(j)}_{m}(x_{1},\ldots,x_{j}):=(\pi_{j}h_{m})(x_{1},\ldots,x_{j})=(\delta_{x_{1}}-P_{X})\times\ldots\times(\delta_{x_{j}}-P_{X})\times P_{X}^{m-j}h_{m}.

Therefore, (πjh)(x1,,xj)(\pi_{j}h)(x_{1},\ldots,x_{j}) is a linear combination of 2j2^{j} terms of the form iIδxiPXm|I|hm\prod_{i\in I}\delta_{x_{i}}\,P_{X}^{m-|I|}\,h_{m}, for all choices of I[j]I\subseteq[j]. Consequently, |(πjhm)(x1,,xj)|222jhm2\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|^{2}\leq 2^{2j}\|h_{m}\|^{2}_{\infty}, and the same bound also holds (almost surely) for the maximum of WjW_{j}’s. Therefore, 𝔼maxj=1,,N/jWjp22jphm2p\mathbb{E}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{j}^{p}\leq 2^{2jp}\|h_{m}\|^{2p}_{\infty} and 𝔼((mj)W1)p(2e)2jp(mj)jphm2p\mathbb{E}\left({m\choose j}W_{1}\right)^{p}\leq(2e)^{2jp}\left(\frac{m}{j}\right)^{jp}\|h_{m}\|_{\infty}^{2p}. Moreover, equation (35) in the proof of Theorem 3.1 implies that 𝔼W1Var(hm)(mj)\mathbb{E}W_{1}\leq\frac{\mathrm{Var}(h_{m})}{{m\choose j}}. Therefore, Rosenthal’s inequality for nonnegative random variables (Fact 4) entails that for q2q\geq 2,

𝔼|(mj)N/ji=1N/jWi|q/2Cq/2(Varq/2(hm+(q2)q/2(jN)q/2𝔼((mj)maxj=1,,N/jW1)q/2)Cq/2(Varq/2(hm)+(q2)q/2(jN)q/2(2e)jq(mj)jq/2hmq){\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2}\leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m}+\left(\frac{q}{2}\right)^{q/2}\left(\frac{j}{N}\right)^{q/2}\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\right)\\ \leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m})+\left(\frac{q}{2}\right)^{q/2}\left(\frac{j}{N}\right)^{q/2}(2e)^{jq}\left(\frac{m}{j}\right)^{jq/2}\|h_{m}\|_{\infty}^{q}\right)}

and

𝔼|VN,j|q(Cq1/2)qj(Varq/2(hm)((qjN)1/2(mj)j/2hm)q).\mathbb{E}|V_{N,j}|^{q}\leq(Cq^{1/2})^{qj}\left(\mathrm{Var}^{q/2}(h_{m})\vee\left(\left(\frac{qj}{N}\right)^{1/2}\left(\frac{m}{j}\right)^{j/2}\|h_{m}\|_{\infty}\right)^{q}\right).

Markov’s inequality therefore yields that

(|VN,j|(C1q)j/2(Var1/2(hm)(qjN)1/2(mj)j/2hm))eq.\mathbb{P}{\left(|V_{N,j}|\geq(C_{1}q)^{j/2}\left(\mathrm{Var}^{1/2}(h_{m})\vee\left(\frac{qj}{N}\right)^{1/2}\left(\frac{m}{j}\right)^{j/2}\|h_{m}\|_{\infty}\right)\right)}\leq e^{-q}.

Let A(q)=(C1q)j/2Var1/2(hm)A(q)=(C_{1}q)^{j/2}\mathrm{Var}^{1/2}(h_{m}) and B(q)=hm(qjN)1/2(C1q1/2(mj)1/2)jB(q)=\|h_{m}\|_{\infty}\left(\frac{qj}{N}\right)^{1/2}\left(C_{1}q^{1/2}\left(\frac{m}{j}\right)^{1/2}\right)^{j}. If t=A(q)B(q)t=A(q)\vee B(q), then q=A1(t)B1(t)q=A^{-1}(t)\wedge B^{-1}(t). We can solve the inequalities explicitly to get, after some algebra, that

(|VN.j|t)exp(min(1c(t2Var(hm))1j,(thmNj)2j+1(cmj)jj+1)).\mathbb{P}{\left(|V_{N.j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\frac{\left(\frac{t}{\|h_{m}\|_{\infty}}\sqrt{\frac{N}{j}}\right)^{\frac{2}{j+1}}}{\left(\frac{cm}{j}\right)^{\frac{j}{j+1}}}\right)\right). (39)
Remark 5.

Whenever |X1𝔼X1|M|X_{1}-\mathbb{E}X_{1}|\leq M almost surely, the inequality |(πjhm)(x1,,xj)|2jhm\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|\leq 2^{j}\|h_{m}\|_{\infty} can be replaced by the bound |(πjhm)(x1,,xj)|Cuju1fj(2M)j\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|\leq C\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}(2M)^{j} which follows from Lemma 3 below. Combined with the assumption stating that uju1fj(C1(P)m)j/2jγ1j\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\leq\left(\frac{C_{1}(P)}{m}\right)^{j/2}j^{\gamma_{1}j}, one easily finds that the resulting concentration inequality reads as follows:

(|VN.j|t)exp(min(1c(t2Var(hm))1j,(tN/j(cMjγ11/2)j)2j+1)).\mathbb{P}{\left(|V_{N.j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{j+1}}\right)\right). (40)

This bound holds for all t>0t>0 and is usually sharper than (39).

The bound (39) is mostly useful only when mj\frac{m}{j} is not too large. Now we will present a second way to estimate 𝔼maxj=1,,N/jWjp\mathbb{E}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{j}^{p} that will yield much better inequalities for small values of jj and is valid when X1X_{1} is not necessarily supported on a bounded interval. The key technical element that we rely on is the following lemma that allows one to control the growth of moments of W1W_{1} with respect to mm. Define

fj(x1,,xj):=𝔼hm(x1,,xj,Xj+1,,Xm).f_{j}(x_{1},\ldots,x_{j}):=\mathbb{E}h_{m}(x_{1},\ldots,x_{j},X_{j+1},\ldots,X_{m}).
Lemma 3.

Let conditions of the theorem hold and let σ2=Var(X1)\sigma^{2}=\mathrm{Var}(X_{1}). Then there exists C=C(P)>0C=C(P)>0 such that

|(πjhm)(X1,,Xj)|Cuju1fji=1j(|Xi𝔼Xi|+σ)\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|\leq C\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\prod_{i=1}^{j}\left(|X_{i}-\mathbb{E}X_{i}|+\sigma\right)

with probability 11. Moreover, for any p>2p>2,

𝔼|(πjhm)(X1,,Xj)|pCpjuju1fjp(𝔼|X1𝔼X1|p)j.\mathbb{E}\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|^{p}\leq C^{pj}\,\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|^{p}_{\infty}\left(\mathbb{E}\left|X_{1}-\mathbb{E}X_{1}\right|^{p}\right)^{j}.

The proof of the lemma is outlined in section 7.4. As uju1fj(C1(P)m)j/2jγ1j\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\leq\left(\frac{C_{1}(P)}{m}\right)^{j/2}j^{\gamma_{1}j} by assumption, the second bound of the lemma can be written as

𝔼|(πjhm)(X1,,Xj)|pC2pjmjp/2jγ1pj(𝔼|X1𝔼X1|p)j.\mathbb{E}\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|^{p}\leq C_{2}^{pj}m^{-jp/2}j^{\gamma_{1}pj}\left(\mathbb{E}\left|X_{1}-\mathbb{E}X_{1}\right|^{p}\right)^{j}.

Recall that νk=𝔼1/k|X1𝔼X1|k\nu_{k}=\mathbb{E}^{1/k}|X_{1}-\mathbb{E}X_{1}|^{k} and that under the stated assumptions, νkkγ2M\nu_{k}\leq k^{\gamma_{2}}M for all integers k2k\geq 2 and some γ2,M>0\gamma_{2},M>0. Therefore,

𝔼W1pC2pjj2γ1pjmpjν2p2pj(CMjγ1m1/2pγ2)2pj,\mathbb{E}W_{1}^{p}\leq C^{2pj}j^{2\gamma_{1}p\,j}m^{-pj}\nu_{2p}^{2pj}\leq\left(C^{\prime}Mj^{\gamma_{1}}m^{-1/2}p^{\gamma_{2}}\right)^{2pj}, (41)

and consequently 𝔼((mj)W1)p(CMjγ11/2pγ2)2pj\mathbb{E}\left({m\choose j}W_{1}\right)^{p}\leq\left(C^{\prime}Mj^{\gamma_{1}-1/2}p^{\gamma_{2}}\right)^{2pj}. The rest of the argument proceeds in a similar way as before. Recall again that 𝔼W1Var(hm)(mj)\mathbb{E}W_{1}\leq\frac{\mathrm{Var}(h_{m})}{{m\choose j}}. Rosenthal’s inequality for nonnegative random variables (Fact 4) implies that for q2q\geq 2,

𝔼|(mj)N/ji=1N/jWi|q/2Cq/2(Varq/2(hm)+(q2)q/2(jN)q/2𝔼((mj)maxj=1,,N/jW1)q/2).\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2}\leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m})+\left(\frac{q}{2}\right)^{q/2}\left(\frac{j}{N}\right)^{q/2}\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\right).

With the inequality for 𝔼W1p\mathbb{E}W_{1}^{p} in hand, the expectation 𝔼((mj)maxj=1,,N/jW1)q/2\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2} can be upper bounded in two ways: first, trivially,

𝔼((mj)maxj=1,,N/jW1)q/2N/j𝔼((mj)W1)q/2N/j(C1Mjγ11/2qγ2)qj.\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\leq\lfloor N/j\rfloor\mathbb{E}\left({m\choose j}W_{1}\right)^{q/2}\leq\lfloor N/j\rfloor\left(C_{1}Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{qj}.

On the other hand, for any identically distributed ξ1,,ξk\xi_{1},\ldots,\xi_{k} and any p>1p>1, 𝔼maxj=1,,k|ξj|k1/pmaxj=1,,k𝔼1/p|ξj|p\mathbb{E}\max_{j=1,\ldots,k}|\xi_{j}|\leq k^{1/p}\max_{j=1,\ldots,k}\mathbb{E}^{1/p}|\xi_{j}|^{p}. Choosing ξj=(mj)Wj\xi_{j}={m\choose j}W_{j} and p=log(N/j)+1p=\lfloor\log(N/j)\rfloor+1, we obtain the inequality

𝔼((mj)maxj=1,,N/jW1)q/2(log(N/j))γ2qj(C1Mjγ11/2qγ2)qj.\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\leq\left(\log(N/j)\right)^{\gamma_{2}qj}\left(C_{1}Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{qj}.

The second bound is better for qlog(N/j)γ2jloglog(N/j)q\leq\frac{\log(N/j)}{\gamma_{2}j\log\log(N/j)}, therefore we get an estimate

𝔼|(mj)N/ji=1N/jWi|q/2Cq/2(Varq/2(hm)+(C3j(qjN)1/2(logγ2(N/j)Mjγ11/2qγ2)j)q)\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2}\leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m})+\left(C_{3}^{j}\left(\frac{qj}{N}\right)^{1/2}\left(\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)^{q}\right)

and

𝔼|VN,j|q(Cq1/2)qj(Varq/2(hm)((qjN)1/2(logγ2(N/j)Mjγ11/2qγ2)j)q)\mathbb{E}|V_{N,j}|^{q}\leq(Cq^{1/2})^{qj}\left(\mathrm{Var}^{q/2}(h_{m})\vee\left(\left(\frac{qj}{N}\right)^{1/2}\left(\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)^{q}\right)

that we will use for 2qlog(N/j)γ2j2\leq q\leq\frac{\log(N/j)}{\gamma_{2}j}, while for larger values of qq, (N/j)1/qeγ2j(N/j)^{1/q}\leq e^{\gamma_{2}j} and

𝔼|VN,j|q(Cq1/2)qj(Varq/2(hm)((qjN)1/2(Mjγ11/2qγ2)j)q).\mathbb{E}|V_{N,j}|^{q}\leq(Cq^{1/2})^{qj}\left(\mathrm{Var}^{q/2}(h_{m})\vee\left(\left(\frac{qj}{N}\right)^{1/2}\left(Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)^{q}\right).

Markov’s inequality therefore yields that for small values of qq (that is, whenever 2qlog(N/j)γ2j2\leq q\leq\frac{\log(N/j)}{\gamma_{2}j}),

(|VN,j|(Cq)j/2(Var1/2(hm)(qjN)1/2(logγ2(N/j)Mjγ11/2qγ2)j))eq.\mathbb{P}{\left(|V_{N,j}|\geq(Cq)^{j/2}\left(\mathrm{Var}^{1/2}(h_{m})\vee\left(\frac{qj}{N}\right)^{1/2}\left(\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)\right)}\leq e^{-q}.

Let A(q)=(Cq)j/2Var1/2(hm)A(q)=(Cq)^{j/2}\mathrm{Var}^{1/2}(h_{m}) and B(q)=(qjN)1/2(Cq1/2logγ2(N/j)Mjγ11/2qγ2)jB(q)=\left(\frac{qj}{N}\right)^{1/2}\left(Cq^{1/2}\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}. If t=A(q)B(q)t=A(q)\vee B(q), then q=A1(t)B1(t)q=A^{-1}(t)\wedge B^{-1}(t). Solving these inequalities explicitly to get, after some algebra, that

(|VN,j|t)exp(min(1c(t2Var(hm))1j,(tN/j(clogγ2(N/j)Mjγ11/2)j))21+j(2γ2+1))\mathbb{P}{\left(|V_{N,j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}\right)^{j}}\right)\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)

for values of tt satisfying 2min(1c(t2Var(hm))1j,(tN/j(clogγ2(N/j)Mjγ11/2)j)21+j(2γ2+1))log(N/j)γ2j2\leq\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\leq\frac{\log(N/j)}{\gamma_{2}j}. Similarly, for qmax(2,log(N/j)γ2j)q\geq\max\left(2,\frac{\log(N/j)}{\gamma_{2}j}\right), the previously established bounds yield that

(|VN,j|(Cq)j/2(Var1/q(hm)(qjN)1/2(Mjγ11/2qγ2)j))eq,\mathbb{P}{\left(|V_{N,j}|\geq(Cq)^{j/2}\left(\mathrm{Var}^{1/q}(h_{m})\vee\left(\frac{qj}{N}\right)^{1/2}\left(Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)\right)}\leq e^{-q},

or equivalently

(|VN,j|t)exp(min(1c(t2Var(hm))1j,(tN/j(cMjγ11/2)j))21+j(2γ2+1))\mathbb{P}{\left(|V_{N,j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right) (42)

whenever min(1c(t2Var(hm))1j,(tN/j(cMjγ11/2)j))21+j(2γ2+1)max(2,log(N/j)γ2j)\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\geq\max\left(2,\frac{\log(N/j)}{\gamma_{2}j}\right). Combination of inequalities (39) and (42) yields the final result.

7.4 Proof of Lemma 3.

Recall that fj(x1,,xj)=𝔼hm(x1m,,xjm,Xj+1m,,Xmm)f_{j}(x_{1},\ldots,x_{j})=\mathbb{E}h_{m}\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{j}}{\sqrt{m}},\frac{X_{j+1}}{\sqrt{m}},\ldots,\frac{X_{m}}{\sqrt{m}}\right) where j<mj<m. It is easy to see from the definition of πj\pi_{j} that (πjh)(x1,,xj)=(πjfj)(x1,,xj)(\pi_{j}h)(x_{1},\ldots,x_{j})=(\pi_{j}f_{j})(x_{1},\ldots,x_{j}). Next, observe that for any function g:j1g:\mathbb{R}^{j-1}\mapsto\mathbb{R} of j1j-1 variables such that 𝔼g2(X1,,Xj1)<,\mathbb{E}g^{2}(X_{1},\ldots,X_{j-1})<\infty, πjg=0\pi_{j}g=0 Pj1P^{j-1}-almost everywhere. Indeed, this follows immediately from the definition (32) of the operator πj\pi_{j} since gg is a constant when viewed as a function of yjy_{j}. Based on this fact, it is easy to see that for any constant aa\in\mathbb{R}, fj(x1,,xj)f_{j}(x_{1},\ldots,x_{j}) and fj(x1,,xj)fj|x1=a(x2,,xj)f_{j}(x_{1},\ldots,x_{j})-f_{j}|_{x_{1}=a}(x_{2},\ldots,x_{j}), where fj|x1=a(x2,,xj):=fj(a,x2,,xj)f_{j}|_{x_{1}=a}(x_{2},\ldots,x_{j}):=f_{j}(a,x_{2},\ldots,x_{j}), are mapped to the same function by πj\pi_{j}. In particular, (πjh)(x1,,xj)=(πj(fjfj|x1=a))(x1,,xj)(\pi_{j}h)(x_{1},\ldots,x_{j})=\left(\pi_{j}(f_{j}-f_{j}|_{x_{1}=a})\right)(x_{1},\ldots,x_{j}). Moreover,

fj(x1,,xj)fj|x1=a(x2,,xj)=ax1u1fj(u1,x2,,xj)du1f_{j}(x_{1},\ldots,x_{j})-f_{j}|_{x_{1}=a}(x_{2},\ldots,x_{j})=\int_{a}^{x_{1}}\partial_{u_{1}}f_{j}(u_{1},x_{2},\ldots,x_{j})du_{1}

Next, we repeat the same argument with fjf_{j} replaced by

fj,2(x2,,xj;u1):=u1fj(u1,x2,,xj)f_{j,2}(x_{2},\ldots,x_{j};u_{1}):=\partial_{u_{1}}f_{j}(u_{1},x_{2},\ldots,x_{j})

and noting that

fj,2(x2,,xj;u1)f2,j|x2=a(x3,,xj;u1)=ax2u2fj,2(u2,x3,,xj;u1)du2.f_{j,2}(x_{2},\ldots,x_{j};u_{1})-f_{2,j}|_{x_{2}=a}(x_{3},\ldots,x_{j};u_{1})=\int_{a}^{x_{2}}\partial_{u_{2}}f_{j,2}(u_{2},x_{3},\ldots,x_{j};u_{1})du_{2}.

The expression ax1fj,2|x2=a(x3,,xj;u1)du1\int_{a}^{x_{1}}f_{j,2}|_{x_{2}=a}(x_{3},\ldots,x_{j};u_{1})du_{1} is a function of j1j-1 variables, hence πj\pi_{j} maps it to 0 so that

(πjhm)(x1,,xj)=πj(ax1ax2u2fj,2(u2,x3,,xj;u1)du2du1).(\pi_{j}h_{m})(x_{1},\ldots,x_{j})=\pi_{j}\left(\int_{a}^{x_{1}}\int_{a}^{x_{2}}\partial_{u_{2}}f_{j,2}(u_{2},x_{3},\ldots,x_{j};u_{1})du_{2}du_{1}\right).

Iterating this process, we arrive at the expression

(πjhm)(x1,,xj)=πj(ax1axjuju1fj(u1,,uj)dujdu1).(\pi_{j}h_{m})(x_{1},\ldots,x_{j})=\pi_{j}\left(\int_{a}^{x_{1}}\ldots\int_{a}^{x_{j}}\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}(u_{1},\ldots,u_{j})du_{j}\ldots du_{1}\right). (43)

Next, observe that for any function gg of jj variables,

(πjg)(x1,,xj)=(δx1PX)××(δxjPX)g=𝔼X~[(δx1δX~1)××(δxjδX~j)g],(\pi_{j}g)(x_{1},\ldots,x_{j})=(\delta_{x_{1}}-P_{X})\times\ldots\times(\delta_{x_{j}}-P_{X})g=\mathbb{E}_{\tilde{X}}\left[(\delta_{x_{1}}-\delta_{\tilde{X}_{1}})\times\ldots\times(\delta_{x_{j}}-\delta_{\tilde{X}_{j}})g\right],

where X~1,,X~j\tilde{X}_{1},\ldots,\tilde{X}_{j} are i.i.d. with the same law as XX, and independent from X1,,XNX_{1},\ldots,X_{N}. Therefore, (πjhm)(x1,,xj)(\pi_{j}h_{m})(x_{1},\ldots,x_{j}) is a linear combination of 2j2^{j} terms of the form 𝔼X~(iIδxijIcδX~jg)\mathbb{E}_{\tilde{X}}\left(\prod_{i\in I}\delta_{x_{i}}\prod_{j\in I^{c}}\delta_{\tilde{X}_{j}}\,g\right), for all choices of I[j]I\subseteq[j] and

g(x1,,xj)=ax1axjuju1fj(u1,,uj)dujdu1.g(x_{1},\ldots,x_{j})=\int_{a}^{x_{1}}\ldots\int_{a}^{x_{j}}\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}(u_{1},\ldots,u_{j})du_{j}\ldots du_{1}.

Take a:=𝔼X1a:=\mathbb{E}X_{1}, and note that

|(πjhm)(x1,,xj)|uju1fjI[j]iI|xia|jIc𝔼|X~ia|uju1fjI[j]iI|xia|σ|Ic|=uju1fji=1j(|xi𝔼X1|+σ).{\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|\leq\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\sum_{I\subseteq[j]}\prod_{i\in I}|x_{i}-a|\prod_{j\in I^{c}}\mathbb{E}|\tilde{X}_{i}-a|\\ \leq\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\sum_{I\subseteq[j]}\prod_{i\in I}|x_{i}-a|\cdot\sigma^{|I^{c}|}=\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\prod_{i=1}^{j}\left(|x_{i}-\mathbb{E}X_{1}|+\sigma\right).}

The first claim of the lemma follows. To deduce the moment bound, observe that since X1,,Xj,X~1,,X~jX_{1},\ldots,X_{j},\tilde{X}_{1},\ldots,\tilde{X}_{j} are i.i.d. and in view of convexity of the function x|x|px\mapsto|x|^{p} for p1p\geq 1,

𝔼|(πjhm)(X1,,Xj)|p2(p1)j𝔼|aX1aXjuju1fj(u1,,uj)dujdu1|p2(p1)juju1fjp𝔼|(X1𝔼X1)(Xj𝔼Xj)|p.{\mathbb{E}\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|^{p}\leq 2^{(p-1)j}\mathbb{E}\left|\int_{a}^{X_{1}}\ldots\int_{a}^{X_{j}}\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}(u_{1},\ldots,u_{j})\,du_{j}\ldots du_{1}\right|^{p}\\ \leq 2^{(p-1)j}\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|^{p}_{\infty}\mathbb{E}\left|(X_{1}-\mathbb{E}X_{1})\ldots(X_{j}-\mathbb{E}X_{j})\right|^{p}.}

for a=𝔼X1a=\mathbb{E}X_{1}.

7.5 Proof of Lemma 1.

As ψ(x)\psi(x) is integrable, its Fourier transform equals C2ϕ^1χ^RC_{2}\widehat{\phi}_{1}\ast\widehat{\chi}_{R}, while χ^R=κ^RI^2R\widehat{\chi}_{R}=\widehat{\kappa}_{R}\cdot\widehat{I}_{2R}. It is well known (e.g. Johnson,, 2015) that κ^(x)C3e|x|\widehat{\kappa}(x)\leq C_{3}e^{-\sqrt{|x|}}, hence κ^R(x)=κ^(Rx)C3eR|x|\widehat{\kappa}_{R}(x)=\widehat{\kappa}(Rx)\leq C_{3}e^{-\sqrt{R|x|}}. Moreover, I^2R(x)=sin(2Rx)x\widehat{I}_{2R}(x)=\frac{\sin(2Rx)}{x}. Therefore, for |x||x| large enough,

|ψ^(x)|=C2|ϕ^1(xy)χ^R(y)𝑑y|=C2(y:|yx||x|/2ϕ^1(xy)χ^R(y)𝑑y+y:|yx|<|x|/2ϕ^1(xy)χ^R(y)𝑑y).{\left|\widehat{\psi}(x)\right|=C_{2}\left|\int_{\mathbb{R}}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\right|\\ =C_{2}\left(\int\limits_{y:|y-x|\geq|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy+\int\limits_{y:|y-x|<|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\right).}

To estimate the first integral, note that ϕ^1(xy)C1(1+|x|/2)δC12δ(1+|x|)δ\widehat{\phi}_{1}(x-y)\leq\frac{C_{1}}{(1+|x|/2)^{\delta}}\leq\frac{C_{1}2^{\delta}}{(1+|x|)^{\delta}} whenever |yx||x|/2|y-x|\geq|x|/2 and that I^2R(x)2R\widehat{I}_{2R}(x)\leq 2R, implying that

|y:|yx||x|/2ϕ^1(xy)χ^R(y)𝑑y|C4(1+|x|)δeR|x|d(Rx)=C5(1+|x|)δ.\Bigg{|}\int\limits_{y:|y-x|\geq|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\Bigg{|}\leq\frac{C_{4}}{(1+|x|)^{\delta}}\int_{\mathbb{R}}e^{-\sqrt{R|x|}}d(Rx)=\frac{C_{5}}{(1+|x|)^{\delta}}.

On the other hand,

|y:|yx|<|x|/2ϕ^1(xy)χ^R(y)𝑑y|C6|x|x|/2x+|x|/2eR|x|sin(2Rx)x𝑑x|C7R|x|/23R|x|/2ez𝑑zC8eR|x|/2R|x|.{\Bigg{|}\int\limits_{y:|y-x|<|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\Bigg{|}\leq C_{6}\left|\int_{x-|x|/2}^{x+|x|/2}e^{-\sqrt{R|x|}}\frac{\sin(2Rx)}{x}dx\right|\\ \leq C_{7}\int_{R|x|/2}^{3R|x|/2}e^{-\sqrt{z}}dz\leq C_{8}e^{-\sqrt{R|x|/2}}\sqrt{R|x|}.}

Clearly, the last expression is smaller than C9(1+|x|)δ\frac{C_{9}}{(1+|x|)^{\delta}}, implying the desired result.

7.6 Proof of Lemma 2.

The proof proceeds using the standard Fourier-analytic tools. Let ϕ^1:=[ϕ1]\widehat{\phi}_{1}:=\mathcal{F}[\phi_{1}] be the Fourier transform of ϕ1\phi_{1}, whence [ϕmj](t)=(ϕ^1(tmj))mj\mathcal{F}\left[\phi_{m-j}\right](t)=\left(\widehat{\phi}_{1}\left(\frac{t}{\sqrt{m-j}}\right)\right)^{m-j}. Therefore,

ϕmj(j1)(t)=12πexp(itx)(ix)j1(ϕ^1(xmj))mj𝑑x\phi_{m-j}^{(j-1)}(t)=\frac{1}{2\pi}\int_{\mathbb{R}}\operatorname{exp}\left(-itx\right)(ix)^{j-1}\left(\widehat{\phi}_{1}\left(\frac{x}{\sqrt{m-j}}\right)\right)^{m-j}dx

and ϕmj(j1)12π|x|j1|ϕ^1(xmj)|mj𝑑x=(mj)j/22π|x|j1|ϕ^1(x)|mj𝑑x\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq\frac{1}{2\pi}\int_{\mathbb{R}}|x|^{j-1}\left|\widehat{\phi}_{1}\left(\frac{x}{\sqrt{m-j}}\right)\right|^{m-j}dx=\frac{(m-j)^{j/2}}{2\pi}\int_{\mathbb{R}}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx. As |ϕ^1(x)|C1(1+|x|)δ\left|\widehat{\phi}_{1}(x)\right|\leq\frac{C_{1}}{(1+|x|)^{\delta}} by assumption, the integral is finite when δ(mj)>j\delta(m-j)>j (in particular, this inequality holds when mm is large enough and j=o(m)j=o(m) as mm\to\infty). To get an explicit bound, we will estimate the integral over [η,η][-\eta,\eta] and [η,η]\mathbb{R}\setminus[-\eta,\eta] separately, for a specific choice of η>0\eta>0. To this end, observe that ϕ^1(x)=ψσ(x)+o(x2)\widehat{\phi}_{1}(x)=\psi_{\sigma}(x)+o(x^{2}) where ψσ(x)=exp(σ2x22)\psi_{\sigma}(x)=\operatorname{exp}\left(-\frac{\sigma^{2}x^{2}}{2}\right) is the characteristic function of the normal law N(0,σ2)N(0,\sigma^{2}). Therefore, there exists η>0\eta>0 such that for all |x|η|x|\leq\eta, |ϕ^1(x)|exp(σ2x24)\left|\widehat{\phi}_{1}(x)\right|\leq\operatorname{exp}\left(-\frac{\sigma^{2}x^{2}}{4}\right), and

(mj)j/2ηη|x|j1|ϕ^1(x)|mj𝑑x(mj)j/2|x|j1exp(σ2x2(mj)4)𝑑x=|y|j1exp(σ2y24)𝑑y=2jσjΓ(j2){(m-j)^{j/2}\int_{-\eta}^{\eta}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\leq(m-j)^{j/2}\int_{\mathbb{R}}|x|^{j-1}\operatorname{exp}\left(-\frac{\sigma^{2}x^{2}(m-j)}{4}\right)dx\\ =\int_{\mathbb{R}}|y|^{j-1}\operatorname{exp}\left(-\frac{\sigma^{2}y^{2}}{4}\right)dy=\frac{2^{j}}{\sigma^{j}}\Gamma\left(\frac{j}{2}\right)}

where we used the exact expression for the absolute moments of the normal distribution. As Γ(x+1)C22πx(xe)x\Gamma(x+1)\leq C_{2}\sqrt{2\pi x}\left(\frac{x}{e}\right)^{x} for all x1x\geq 1 and an absolute constant C2C_{2} large enough, 2jσjΓ(j2)C2σj(2je)j/2\frac{2^{j}}{\sigma^{j}}\Gamma\left(\frac{j}{2}\right)\leq\frac{C_{2}}{\sigma^{j}}\left(\frac{2j}{e}\right)^{j/2}. At the same time,

(mj)j/2[η,η]|x|j1|ϕ^1(x)|mj𝑑x=(mj)j/2[(2C1)2/δ,(2C1)2/δ]|x|j1|ϕ^1(x)|mj𝑑x+(mj)j/2[(2C1)2/δ,(2C1)2/δ][η,η]|x|j1|ϕ^1(x)|mj𝑑x{(m-j)^{j/2}\int_{\mathbb{R}\setminus[-\eta,\eta]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx=(m-j)^{j/2}\int_{\mathbb{R}\setminus[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\\ +(m-j)^{j/2}\int_{[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]\setminus[-\eta,\eta]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx}

where C11C_{1}\geq 1 is a constant such that |ϕ^1(x)|C1(1+|x|)δ\left|\widehat{\phi}_{1}(x)\right|\leq\frac{C_{1}}{(1+|x|)^{\delta}}. The first term can be estimated via

(mj)j/2[(2C1)2/δ,(2C1)2/δ]|x|j1|ϕ^1(x)|mj𝑑xC1mj(mj)j/2[(2C1)2/δ,(2C1)2/δ]|x|j1(1+|x|)δ(mj)𝑑x2C1mj(mj)j/2δ(mj)j1(2C1)2(mj)2j/δ.{(m-j)^{j/2}\int_{\mathbb{R}\setminus[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\\ \leq C_{1}^{m-j}(m-j)^{j/2}\int_{\mathbb{R}\setminus[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]}\frac{|x|^{j-1}}{(1+|x|)^{\delta(m-j)}}dx\\ \leq\frac{2C_{1}^{m-j}(m-j)^{j/2}}{\delta(m-j)-j}\frac{1}{(2C_{1})^{2(m-j)-2j/\delta}}.}

Whenever m>2j+2j/δm>2j+2j/\delta, we can bound the last expression from above by C3mj/22mC_{3}m^{j/2}2^{-m}. Finally, as sup|x|>η|ϕ^1(x)|1γ\sup_{|x|>\eta}|\widehat{\phi}_{1}(x)|\leq 1-\gamma for some 0<γ<10<\gamma<1,

(mj)j/2[(2C1)2/δ,(2C1)2/δ][η,η]|x|j1|ϕ^1(x)|mj𝑑x2(mj)j/2(1γ)mj(2C1)2j/δj.(m-j)^{j/2}\int_{[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]\setminus[-\eta,\eta]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\leq 2(m-j)^{j/2}(1-\gamma)^{m-j}\frac{(2C_{1})^{2j/\delta}}{j}.

Putting the estimates together, we deduce that

ϕmj(j1)(mj)j/22π|x|j1|ϕ^1(x)|mj𝑑xC2σj(2je)j/2+C3mj/22m+C4((2C1)4/δm)j/2(1γ)mj.{\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq\frac{(m-j)^{j/2}}{2\pi}\int_{\mathbb{R}}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\\ \leq\frac{C_{2}}{\sigma^{j}}\left(\frac{2j}{e}\right)^{j/2}+C_{3}m^{j/2}2^{-m}+C_{4}\left((2C_{1})^{4/\delta}m\right)^{j/2}(1-\gamma)^{m-j}.}

Whenever j=o(m/logm)j=o(m/\log m), the last two terms in the sum above are negligible so that for mm large enough,

ϕmj(j1)C5σj(2je)j/2,\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq\frac{C_{5}}{\sigma^{j}}\left(\frac{2j}{e}\right)^{j/2},

as claimed.

References

  • Alon et al., (1996) Alon, N., Matias, Y., and Szegedy, M. (1996). The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29. ACM.
  • Arcones, (1995) Arcones, M. A. (1995). A Bernstein-type inequality for U-statistics and U-processes. Statistics&Probability Letters, 22(3):239–247.
  • Boucheron et al., (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  • Catoni, (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 48, pages 1148–1185. Institut Henri Poincaré.
  • Chen et al., (2012) Chen, R. Y., Gittens, A., and Tropp, J. A. (2012). The masked sample covariance estimator: an analysis using matrix concentration inequalities. Information and Inference, page ias001.
  • de la Pena and Gine, (1999) de la Pena, V. and Gine, E. (1999). Decoupling: From dependence to independence. Springer-Verlag, New York.
  • de la Pena and Montgomery-Smith, (1995) de la Pena, V. and Montgomery-Smith, S. J. (1995). Decoupling inequalities for the tail probabilities of multivariate U-Statistics. Annals of Probability, 23(2):806–816.
  • Devroye et al., (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725.
  • DiCiccio and Romano, (2022) DiCiccio, C. and Romano, J. (2022). CLT for U-statistics with growing dimension. Statistica Sinica, 32:1–22.
  • Feller, (1968) Feller, W. (1968). On the Berry-Esseen theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 10(3):261–268.
  • Frees, (1989) Frees, E. W. (1989). Infinite order U-statistics. Scandinavian Journal of Statistics, pages 29–45.
  • Hanson and Wright, (1971) Hanson, D. L. and Wright, F. T. (1971). A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083.
  • Hodges and Lehmann, (1963) Hodges, J. L. and Lehmann, E. L. (1963). Estimates of location based on rank tests. The annals of mathematical statistics, pages 598–611.
  • Hoeffding, (1948) Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, pages 293–325.
  • Hoeffding, (1963) Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30.
  • Jerrum et al., (1986) Jerrum, M. R., Valiant, L. G., and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoretical computer science, 43:169–188.
  • Johnson, (2015) Johnson, S. G. (2015). Saddle-point integration of C{C}_{\infty} “bump” functions. arXiv preprint arXiv:1508.04376.
  • Lee, (2019) Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.
  • Lee and Valiant, (2020) Lee, J. C. and Valiant, P. (2020). Optimal sub-Gaussian mean estimation in 𝐑\mathbf{R}. arXiv preprint arXiv:2011.08384.
  • Lee and Valiant, (2022) Lee, J. C. and Valiant, P. (2022). Optimal sub-Gaussian mean estimation in very high dimensions. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  • Lerasle and Oliveira, (2011) Lerasle, M. and Oliveira, R. I. (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
  • Lugosi and Mendelson, (2019) Lugosi, G. and Mendelson, S. (2019). Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190.
  • Maurer, (2019) Maurer, A. (2019). A Bernstein-type inequality for functions of bounded interaction. Bernoulli, 25(2):1451–1471.
  • Minsker, (2019) Minsker, S. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics, 13(2):5213–5252.
  • Minsker and Ndaoud, (2021) Minsker, S. and Ndaoud, M. (2021). Robust and efficient mean estimation: an approach based on the properties of self-normalized sums. Electronic Journal of Statistics, 15(2):6036–6070.
  • Nemirovski and Yudin, (1983) Nemirovski, A. and Yudin, D. (1983). Problem complexity and method efficiency in optimization. John Wiley & Sons Inc.
  • Peng et al., (2022) Peng, W., Coleman, T., and Mentch, L. (2022). Rates of convergence for random forests via generalized U-statistics. Electronic Journal of Statistics, 16(1):232–292.
  • Petrov, (1975) Petrov, V. V. (1975). Sums of Independent Random Variables. Springer Berlin Heidelberg.
  • Petrov, (1995) Petrov, V. V. (1995). Limit theorems of probability theory: sequences of independent random variables. Oxford, New York.
  • Serfling, (1984) Serfling, R. J. (1984). Generalized L-, M-, and R-statistics. The Annals of Statistics, pages 76–86.
  • Serfling, (2009) Serfling, R. J. (2009). Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons.
  • Shepp, (1964) Shepp, L. A. (1964). A local limit theorem. The Annals of Mathematical Statistics, 35(1):419–423.
  • Sherman, (1994) Sherman, R. P. (1994). Maximal inequalities for degenerate uu-processes with applications to optimization estimators. The Annals of Statistics, pages 439–459.
  • Song et al., (2019) Song, Y., Chen, X., and Kato, K. (2019). Approximating high-dimensional infinite-order U-statistics: Statistical and computational guarantees. Electronic Journal of Statistics, 13(2):4794–4848.