U-statistics of growing order and sub-Gaussian mean estimators with sharp constants

Stanislav Minsker Department of Mathematics, University of Southern California
email: minsker@usc.edu

Abstract

This paper addresses the following question: given a sample of i.i.d. random variables with finite variance, can one construct an estimator of the unknown mean that performs nearly as well as if the data were normally distributed? One of the most popular examples achieving this goal is the median of means estimator. However, it is inefficient in a sense that the constants in the resulting bounds are suboptimal. We show that a permutation-invariant modification of the median of means estimator admits deviation guarantees that are sharp up to $1+o(1)$ factor if the underlying distribution possesses more than $\frac{3+\sqrt{5}}{2}\approx 2.62$ moments and is absolutely continuous with respect to the Lebesgue measure. This result yields potential improvements for a variety of algorithms that rely on the median of means estimator as a building block. At the core of our argument is are the new deviation inequalities for the U-statistics of order that is allowed to grow with the sample size, a result that could be of independent interest.

62G35,

60E15,62E20,

U-statistics,

median-of-means estimator,

heavy tails,

keywords:

[class=MSC]

keywords:

t3Author acknowledges support by the National Science Foundation grants CIF-1908905 and DMS CAREER-2045068.

1 Introduction.

Let $X_{1},\ldots,X_{N}$ be i.i.d. random variables with distribution $P$ having mean $\mu$ and finite variance $\sigma^{2}$ . At the core of this paper is the following question: given $1\leq t\leq t_{\max}(N)$ , construct an estimator $\widetilde{\mu}_{N}=\widetilde{\mu}_{N}(X_{1},\ldots,X_{N})$ such that

\mathbb{P}{\left(\left|\widetilde{\mu}_{N}-\mu\right|\geq\sigma\sqrt{\frac{t}{N}}\right)}\leq 2e^{-\frac{t}{L}}

(1)

for some absolute positive constant $L$ . Estimators that satisfy this deviation property are called sub-Gaussian. For example, the sample mean $\bar{X}_{N}=\frac{1}{N}\sum_{j=1}^{N}X_{j}$ is sub-Gaussian for $t_{\max}\asymp q(N,P)$ where $q(N,P)\to\infty$ as $N\to\infty$ and the constant $L$ equals $2$ : this immediately follows from the fact that convergence of the distribution functions is uniform in the central limit theorem. However, $q(N,P)$ can grow arbitrarily slow in general, and it grows as $\log^{1/2}(N)$ if $\mathbb{E}|X|^{2+\varepsilon}<\infty$ for some $\varepsilon>0$ in view of the Berry-Esseen theorem (for instance, see the book by Petrov,, 1975). At the same time, the so-called median of means (MOM) estimator, originally introduced by Nemirovski and Yudin, (1983); Alon et al., (1996); Jerrum et al., (1986) and studied recently in relation to the problem at hand satisfies inequality (1) with $t_{\max}$ of order $N$ and $L=24e$ (Lerasle and Oliveira,, 2011), although the latter can be improved. A large body of existing work used the MOM estimator as a core subroutine to relax underlying assumptions for a variety of statistical problems, in particular the methods based on the empirical risk minimization; we refer the reader to an excellent survey paper by Lugosi and Mendelson, (2019) for a detailed overview of the recent advances.

The exact value of constant $L$ in inequality (1) is less important in problems where only the minimax rates are of interest, but it becomes crucial in terms of practical value and sample efficiency of the algorithms. The benchmark here is the situation when observations are normally distributed: Catoni, (2012) showed that no estimator can outperform the sample mean in this situation. The latter satisfies the relation

\mathbb{P}{\left(\left|\bar{X}_{N}-\mu\right|\geq\sigma\frac{\Phi^{-1}(1-e^{-t/2})}{\sqrt{N}}\right)}=2e^{-\frac{t}{2}}

where $\Phi^{-1}(\cdot)$ denotes the quantile function of the standard normal law. As $\Phi^{-1}(1-e^{-t/2})=(1+o(1))\sqrt{t}$ as $t\to\infty$ , the best guarantee of the form (1) one can hope for is attained for $L=2$ . It is therefore natural to ask whether there exist sharp sub-Gaussian estimators of the mean, that is, estimators satisfying (1) with $L=2(1+o(1))$ where $o(1)$ is a sequence that converges to $0$ as $N\to\infty$ , under minimal assumptions on the underlying distribution. This question has previously been posed by Devroye et al., (2016) as an open problem, and several results appeared since then that give partial answers. We proceed with a brief review of the state of the art.

1.1 Overview of the existing results.

Catoni, (2012) presented the first known example of a sharp sub-Gaussian estimator with $t_{\max}=o(N/\kappa)$ for distributions with finite fourth moment and a known upper bound on the kurtosis $\kappa$ (or, alternatively, for distribution with finite but known variance). Devroye et al., (2016) introduced an alternative estimator that also required finite fourth moment but did not explicitly depend on the value of the kurtosis as an input while satisfying required guarantees for $t_{\max}=o\left((N/\kappa)^{2/3}\right)$ . Minsker and Ndaoud, (2021) designed an asymptotically efficient sub-Gaussian estimator $\widetilde{\mu}_{N}$ that satisfies $\sqrt{N}\left(\widetilde{\mu}_{N}-\mu\right)\xrightarrow{d}N(0,\sigma^{2})$ assuming only the finite second moment plus a mild, “small-ball” type condition. However, the constants in the non-asymptotic version of their bounds were not sharp. Finally, Lee and Valiant, (2020) constructed an estimator with required properties assuming just the finite second moment, however, their guarantees hold with optimal constants only for $t_{\min}\leq t\leq t_{\max}$ where $t_{\max}=o(N)$ and $t_{\min}\to\infty$ as $N\to\infty$ . In particular, this range excludes $t$ in the neighborhood of $0$ which is often the region of most practical interest.

1.2 Summary of the main contributions.

The reasons for the popularity of MOM estimator are plenty: it is simple to define and to compute, it admits strong theoretical guarantees, moreover it is scale-invariant and therefore essentially tuning-free. Thus, we believe that any quantifiable improvements to its performance are worth investigating.

We start by showing that the standard MOM estimator achieves bound (1) with $L=\pi(1+o(1))$ where $o(1)\to 0$ as $N\to\infty$ ; this fact is formally stated in Theorem 2.1. We then define a permutation-invariant version of MOM, denoted $\widehat{\mu}_{N}$ , and show in Corollary 3.1 that, surprisingly, it is asymptotically optimal in a sense that $\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\xrightarrow{d}N(0,\sigma^{2})$ under minimal assumptions; compare this to the the standard MOM estimator that has a limiting variance $\frac{\pi}{2}\sigma^{2}$ . The main result of the paper, Theorem 5.1, demonstrates that optimality of $\widehat{\mu}_{N}$ holds in the stronger sense, namely, that inequality (1) is valid for a wide range the confidence parameters assuming the distribution of $X_{1}$ possesses $q$ moments for some possibly unknown $q>\frac{3+\sqrt{5}}{2}\approx 2.62$ and that its characteristic function satisfies a mild decay bound.

Analysis of the estimator $\widehat{\mu}_{N}$ requires new inequalities for U-statistics of order that grows with the sample size. Detailed discussion and comparison with existing bounds is given in section 4. In particular, we prove novel bounds for large deviations of the degenerate, higher order terms of the Hoeffding decomposition (Theorem 4.1), and deduce sub-Gaussian deviation guarantees for the non-degenerate U-statistics (Corollary 4.1) with the “correct” sub-Gaussian parameter. These bounds could be of independent interest.

1.3 Notation.

Unspecified absolute constants will be denoted $C,c,C_{1},c^{\prime}$ , etc., and may take different values in different parts of the paper. Given $a,b\in\mathbb{R}$ , we will write $a\wedge b$ for $\min(a,b)$ and $a\vee b$ for $\max(a,b)$ . For a positive integer $M$ , $[M]$ denotes the set $\{1,\ldots,M\}$ .

We will frequently use the standard big-O and small-o notation for asymptotic relations between functions and sequences. Moreover, given two sequences $\{a_{n}\}_{n\geq 1}$ and $\{b_{n}\}_{n\geq 1}$ where $b_{n}\neq 0$ for all $n$ , we will write that $a_{n}\ll b_{n}$ if $\frac{a_{n}}{b_{n}}=o(1)$ as $n\to\infty$ . Note that $o(1)$ may denote different functions/sequences from line to line.

For a function $f:\mathbb{R}\mapsto\mathbb{R}$ , $f^{(m)}$ will denote its $m$ -th derivative whenever it exists. Similarly, given $g:\mathbb{R}^{d}\mapsto\mathbb{R}$ , $\partial_{x_{j}}g(x_{1},\ldots,x_{d})$ will stand for the partial derivative of $g$ with respect to the $j$ -th variable. Finally, the sup-norm of $g$ is defined via $\|g\|_{\infty}:=\mathrm{ess\,sup}\{|g(y)|:\,y\in\mathbb{R}^{d}\}$ and the convolution of $f$ and $g$ is denoted $f\ast g$ .

Given i.i.d. random variables $X_{1},\ldots,X_{N}$ distributed according to $P$ , $P_{N}:=\frac{1}{N}\sum_{j=1}^{N}\delta_{X_{j}}$ will stand for the associated empirical measure, where $\delta_{X}(f):=f(X)$ . For a real-valued function $f$ and a signed measure $Q$ , we will write $Qf$ for $\int fdQ$ , assuming that the last integral is well-defined. Additional notation and auxiliary results will be introduced on demand.

2 Optimal constants for the median of means estimator.

Recall that we are given an i.i.d. sample $X_{1},\ldots,X_{N}$ from distribution $P$ with mean $\mu$ and variance $\sigma^{2}$ . The median of means estimator of $\mu$ is constructed as follows: let $G_{1}\cup\ldots\cup G_{k}\subseteq[N]$ be an arbitrary (possibly random but independent from the data) collection of $k\leq N/2$ disjoint subsets (“blocks”) of cardinality $\lfloor N/k\rfloor$ each, $\bar{X}_{j}:=\frac{1}{|G_{j}|}\sum_{i\in G_{j}}X_{i}$ and

\widehat{\mu}_{\mathrm{MOM}}=\mbox{med}\left(\bar{X}_{1},\ldots,\bar{X}_{k}\right).

It is known (e.g. Lerasle and Oliveira,, 2011; Devroye et al.,, 2016) that $\widehat{\mu}_{\mathrm{MOM}}$ satisfies inequality (1) for $t=k$ and $L=8e^{2}$ . This value of $L$ appears to be overly pessimistic however: it follows from Theorem 5 in (Minsker,, 2019) that if $k\to\infty$ sufficiently slow so that the bias of $\widehat{\mu}_{\mathrm{MOM}}$ is of order $o(N^{-1/2})$ , then

\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\xrightarrow{d}N\left(0,\frac{\pi}{2}\sigma^{2}\right)

(2)

as $k,N/k\to\infty$ . In particular, if $\mathbb{E}|X|^{2+\delta}<\infty$ for some $0<\delta\leq 1$ , then $k=o\left(N^{\delta/(1+\delta)}\right)$ suffices for the asymptotic unbiasedness and asymptotic normality to hold. Asymptotic relation (2) suggests that the best value of the constant $L$ in the deviation inequality (1) for the estimator $\widehat{\mu}_{\mathrm{MOM}}$ is $\pi+o(1)$ . We will demonstrate that this is indeed the case. Denote

g(m):=\frac{1}{\sqrt{m}}\mathbb{E}\left[\left(\frac{X_{1}-\mu}{\sigma}\right)^{2}\min\left(\left|\frac{X_{1}-\mu}{\sigma}\right|,\sqrt{m}\right)\right],

(3)

Clearly, $g(m)\to 0$ as $m\to\infty$ for distributions with finite variance. Feller, (1968) proved that $\sup_{t\in\mathbb{R}}\left|\Phi_{m}(t)-\Phi(t)\right|\leq 6g(m)$ where $\Phi_{m}$ and $\Phi$ are the distribution functions of $\frac{\sum_{j=1}^{m}X_{j}-\mu}{\sigma\sqrt{m}}$ and the standard normal law respectively. It is well known that $g(m)\leq C\mathbb{E}\left|\frac{X_{1}-\mu}{\sigma}\right|^{q}m^{-(q-2)/2}$ whenever $\mathbb{E}|X_{1}-\mu|^{q}<\infty$ for some $q\in(2,3]$ . The next result can be viewed as a non-asymptotic analogue of relation (2).

Theorem 2.1.

The following bound holds:

\mathbb{P}{\left(\left|\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\right|\geq\sigma\sqrt{t}\right)}\leq 2\operatorname{exp}\left(-\frac{t}{\pi}(1+o(1))\right).

(4)

Here, $o(1)$ is a function that goes to $0$ as $k,N/k\to\infty$ , uniformly over $t\in\left[l_{k,N},u_{k,N}\right]$ for any sequences $l_{k,N}\gg k\,g^{2}(N/k)$ and $u_{k,N}\ll k$ .

Remark 1.

1.

Note that the bound of the theorem holds in some range of the confidence parameter (such estimators are often called “multiple- $\delta$ ” in the literature, e.g., see Devroye et al., (2016)), however, this range is distribution-dependent. In particular, if $\sqrt{k}\,g(N/k)\to 0$ as $k,N\to\infty$ , the previous bound holds in the range $1\leq t\ll k$ , but the function $g(\cdot)$ depends on $P$ and may converge to $0$ arbitrarily slow. Under additional assumptions, more concrete bounds can be deduced: for instance, if $\mathbb{E}|X/\sigma|^{2+\varepsilon}<\infty$ for some $0<\varepsilon\leq 1$ , the condition $\sqrt{k}\,g(N/k)\to 0$ is satisfied if $k=o\left(N^{\frac{\varepsilon}{1+\varepsilon}}\right)$ as $N\to\infty$ . In general, by choosing $k$ appropriately, we can construct a version of the median of means estimator that satisfies required guarantees for any $1\leq t\ll N$ .
2.

The exact expression for the function $o(1)$ appearing in the statement of Theorems 2.1 and well as other results in the paper (e.g. Theorem 5.1) is not made explicit. We remark that it depends on the distribution of $X_{1}$ through the function $g(\cdot)$ defined in (3), and on the ratios $\frac{kg^{2}(N/k)}{l_{k,N}}$ and $\frac{u_{k,N}}{k}$ .

Proof of Theorem 2.1.

As $\widehat{\mu}_{\mathrm{MOM}}$ is scale-invariant, we can assume without loss of generality that $\sigma^{2}=1$ . Denote $m=\lfloor N/k\rfloor$ for brevity, let $\rho(x)=|x|$ , and note that the equivalent characterization of $\widehat{\mu}_{\mathrm{MOM}}$ is

\widehat{\mu}_{\mathrm{MOM}}\in\mathop{\mathrm{missing}}{argmin}_{z\in\mathbb{R}}\sum_{j=1}^{k}\rho\left(\sqrt{m}\left(\bar{X}_{j}-z\right)\right).

The necessary conditions for the minimum of $F(z):=\sum_{j=1}^{k}\rho\left(\sqrt{m}\left(\bar{X}_{j}-z\right)\right)$ imply that $0\in\partial F(\widehat{\mu}_{\mathrm{MOM}})$ – the subgradient of $F$ , hence the left derivative $F^{\prime}_{-}(\widehat{\mu}_{\mathrm{MOM}})\leq 0$ . Therefore, if $\sqrt{N}\left(\widehat{\mu}_{\mathrm{MOM}}-\mu\right)\geq\sqrt{t}$ for some $t>0$ , then $\widehat{\mu}_{\mathrm{MOM}}\geq\mu+\sqrt{t/N}$ and, due to $F^{\prime}_{-}$ being nondecreasing, $F^{\prime}_{-}\left(\mu+\sqrt{t/N}\right)\leq 0$ . It implies that

{\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\geq\sqrt{t}\right)}\leq\mathbb{P}{\left(\sum_{j=1}^{k}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)\geq 0\right)}\\ =\mathbb{P}{\left(\frac{1}{\sqrt{k}}\sum_{j=1}^{k}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right)\geq-\sqrt{k}\mathbb{E}\rho^{\prime}_{-}\right)}}

(5)

where we used the shortcut $\mathbb{E}\rho^{\prime}_{-}$ in place of $\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)$ . Note that

{-\sqrt{k}\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)=-\sqrt{k}\left(1-2\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\leq 0\right)}\right)\\ =2\sqrt{k}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{k}}\right)-\Phi(0)\right)-2\sqrt{k}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{k}}\right)-\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\frac{\sqrt{t}}{\sqrt{k}}\right)}\right)\\ \leq 2\sqrt{k}\cdot g(m)+2\sqrt{t}\frac{1}{\sqrt{t}/\sqrt{N/m}}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{N/m}}\right)-\Phi(0)\right).}

(6)

Since

{2\sqrt{t}\frac{1}{\sqrt{t}/\sqrt{N/m}}\left(\Phi\left(\frac{\sqrt{t}}{\sqrt{N/m}}\right)-\Phi(0)\right)=2\sqrt{t}\left(\phi(0)+O(t/\sqrt{N/m})\right)\\ =\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O(t/\sqrt{N/m})\right)}

(7)

where $\phi(t)=\Phi^{\prime}(t)$ , we see that

-\sqrt{k}\,\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)\leq 2\sqrt{k}\cdot g(m)+\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O(\sqrt{t/k})\right)

which is $\sqrt{t}\sqrt{\frac{2}{\pi}}\left(1+o(1)\right)$ whenever $t\ll k$ and $t\gg k\,g^{2}(m)$ . It remains to apply Bernstein’s inequality to the right-hand side in (5). Observe that

{\mathrm{Var}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{j}-\mu-\sqrt{t/N}\right)\right)\right)=4\mathrm{Var}\left(I\left\{\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\sqrt{t/k}\right\}\right)\\ =4\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\sqrt{t/k}\right)}\left(1-\mathbb{P}{\left(\sqrt{m}\left(\bar{X}_{j}-\mu\right)\leq\sqrt{t/k}\right)}\right)\leq 1,}

therefore

{\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\geq\sqrt{t}\right)}\leq\operatorname{exp}\left(-\frac{t}{\pi(1+o(1))+\frac{2\sqrt{t}\sqrt{2\pi}}{3}\frac{1}{\sqrt{k}}\left(1+o(1)\right)}\right)\\ =\operatorname{exp}\left(-\frac{t}{\pi}(1+o(1))\right)}

whenever $\sqrt{k}\,g(m)\ll t\ll\sqrt{k}$ . Similar reasoning gives a matching bound for $\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{\mathrm{MOM}}-\mu)\leq-\sqrt{t}\right)}$ , and the result follows. ∎

One may ask whether the median of means estimator admits a more sample-efficient modification, one that would satisfy inequality (1) with a constant $L$ smaller than $\pi$ . A natural idea is to require that the estimator is invariant with respect to permutations of the data or, equivalently, is a function of order statistics only. Such an extension of the MOM estimator was proposed by Minsker, (2019), however no provable improvements for the performance over the standard MOM estimator were established rigorously. The question of such improvements, especially the guarantees expressed in the form (1), is addressed next. Let us recall the proposed construction. Assume that $2\leq m<N$ and, given $J\subseteq[N]$ of cardinality $|J|=m$ , set $\bar{X}_{J}:=\frac{1}{m}\sum_{j\in J}X_{j}$ . Define $\mathcal{A}_{N}^{(m)}=\left\{J\subset[N]:\ |J|=m\right\}$ and

\widehat{\mu}_{N}:=\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{A}_{N}^{(m)}\right),

(8)

where $\left\{\bar{X}_{J},\ J\in\mathcal{A}_{N}^{(m)}\right\}$ denotes the set of sample averages computed over all possible subsets of $[N]$ of cardinality $m$ ; in particular, unlike the standard median-of-means estimator, $\widehat{\mu}_{N}$ is uniquely defined. Note that for $m=2$ , $\widehat{\mu}_{N}$ coincides with the well known Hodges-Lehmann estimator of location (Hodges and Lehmann,, 1963). When $m$ is a fixed integer greater than $2$ , $\widehat{\mu}_{N}$ is known as the generalized Hodges-Lehmann estimator. Its asymptotic properties are well-understood and can be deduced from results by Serfling, (1984), among other works. For example, its breakdown point is $1-(1/2)^{1/m}$ and, in case of normally distributed data, the asymptotic distribution of $\sqrt{N}(\widehat{\mu}_{N}-\mu)$ is centered normal with variance $\Delta_{m}^{2}=m\sigma^{2}\arctan\left(\frac{1}{\sqrt{m^{2}-1}}\right)$ . In particular, $\Delta_{m}^{2}=\sigma^{2}(1+o(1))$ as $m\to\infty$ . When the underlying distribution is not symmetric however, $\widehat{\mu}_{N}$ is biased for the mean, and the properties of this estimator in the regime $m\to\infty$ have not been investigated in the robust statistics literature (to the best of our knowledge). Only very recently, DiCiccio and Romano, (2022) proved that whenever $m\to\infty$ , $m=o(\sqrt{N})$ and the sample is normally distributed, $\sqrt{N}(\widehat{\mu}_{N}-\mu)\to N(0,\sigma^{2})$ . We will extend this result in several directions: first, by allowing a much wider class of underlying distributions, second, by including the case when $\sqrt{N}\ll m\ll N$ which is interesting as $\mathrm{bias}\left(\widehat{\mu}_{N}\right)$ is $o\left(N^{-1/2}\right)$ in this regime, and finally by presenting sharp sub-Gaussian deviation inequalities for $\widehat{\mu}_{N}$ that hold for heavy-tailed data.

Let us remark that an argument behind Theorem 2.1 combined with a version of Bernstein’s inequality for U-statistics due to Hoeffding, (1963) immediately implies that $\widehat{\mu}_{N}$ satisfies relation (4). Similar reasoning applies to other deviation guarantees for the classical median of means estimator that exist in the literature, so in this sense $\widehat{\mu}_{N}$ always performs at least as good as $\widehat{\mu}_{\mathrm{MOM}}$ .

Analysis of the estimator $\widehat{\mu}_{N}$ is most naturally carried out using the language of U-statistics. The following section introduces the necessary background, while additional useful facts are summarized in section 7.1.

3 Asymptotic normality of U-statistics and the implications for $\widehat{\mu}_{N}$ .

Let $Y_{1},\ldots,Y_{N}$ be i.i.d. random variables with distribution $P_{Y}$ and assume that $h_{m}:\mathbb{R}^{m}\mapsto\mathbb{R},\ m\geq 1$ are square-integrable with respect to $P_{Y}^{m}$ and permutation-symmetric functions, meaning that that $\mathbb{E}h_{m}^{2}(Y_{1},\ldots,Y_{m})<\infty$ and $h_{m}(x_{\pi(1)},\ldots,x_{\pi(m)})=h_{m}(x_{1},\ldots,x_{m})$ for any $x_{1},\ldots,x_{m}\in\mathbb{R}$ and any permutation $\pi:[m]\mapsto[m]$ . Without loss of generality, we will also assume that $\mathbb{E}h_{m}:=\mathbb{E}h_{m}(Y_{1},\ldots,Y_{m})=0$ . Recall that $\mathcal{A}_{N}^{(m)}=\left\{J\subseteq[N]:\ |J|=m\right\}$ . The U-statistic with kernel $h_{m}$ is defined as

U_{N,m}=\frac{1}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}h_{m}(Y_{i},\ i\in J).

(9)

For $i\in[N]$ , let

h_{m}^{(1)}(Y_{i})=\mathbb{E}\left[h_{m}(Y_{1},\ldots,Y_{m})\,|\,Y_{i}\right].

(10)

We will assume that $\mathbb{P}{\left(h_{m}^{(1)}(Y_{1})\neq 0\right)}>0$ for all $m$ , meaning that the kernels $h_{m}$ are non-degenerate. The random variable

S_{N,m}:=\sum_{j=1}^{N}\mathbb{E}\left[U_{N,m}\,|\,Y_{j}\right]=\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(Y_{j}),

known as the Hájek projection of $U_{N,m}$ , is essentially the best approximation of $U_{N,m}$ in terms of the sum of i.i.d. random variables of the form $f(Y_{1})+\ldots+f(Y_{m})$ . We are interested in the sufficient conditions guaranteeing that $\frac{U_{N,m}-S_{N,m}}{\sqrt{\mathrm{Var}(S_{N,m})}}=o_{P}(1)$ as $N,m\to\infty$ . Such asymptotic relation immediately implies that the limiting behavior of $U_{N,m}$ is defined by the Hájek projection $S_{N,m}$ . Results of these type for U-statistics of fixed order $m$ are standard and well-known (Hoeffding,, 1948; Serfling,, 2009; Lee,, 2019). However, we are interested in the situation when $m$ is allowed to grow with $N$ , possibly up to the order $m=o(N)$ . U-statistics of growing order were studied for example by Frees, (1989), however existing results are not readily applicable in our framework. Very recently, such U-statistics have been investigated in relation to performance of Breiman’s random forests algorithm (e.g. see the papers by Song et al., (2019) and Peng et al., (2022)). The following theorem is essentially due to Peng et al., (2022); we give a different proof of this fact in Appendix 7.2 as we rely on parts of the argument elsewhere in the paper.

Theorem 3.1.

Assume that $\frac{\mathrm{Var}\left(h_{m}(Y_{1},\ldots,Y_{m})\right)}{\mathrm{Var}\left(h_{m}^{(1)}(Y_{1})\right)}=o(N)$ as $N,m\to\infty$ . ¹¹1It is well known (Hoeffding,, 1948) that $\mathrm{Var}\left(h^{(1)}(Y_{1})\right)\leq\frac{\mathrm{Var}(h_{m})}{m}$ , therefore the condition imposed on the ratio of variances implies that $m=o(N)$ . Then $\frac{U_{N,m}-S_{N,m}}{\sqrt{\mathrm{Var}(S_{N,m})}}=o_{P}(1)$ as $N,m\to\infty$ .

It is easy to see that asymptotic normality of $\frac{U_{N,m}}{\sqrt{\mathrm{Var}(S_{N,m})}}$ immediately follows from the previous theorem whenever its assumptions are satisfied. Next, we will apply this result to establish asymptotic normality of the estimator $\widehat{\mu}_{N}$ defined via (8).

Corollary 3.1.

Let $X_{1},\ldots,X_{N}$ be i.i.d. with finite variance $\sigma^{2}$ . Moreover, assume that $\sqrt{\frac{N}{m}}\,g(m)\to 0$ as $N/m$ and $m\to\infty$ . Then

\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\xrightarrow{d}\mathcal{N}(0,\sigma^{2})

as $N/m$ and $m\to\infty$ .

Remark 2.

Requirement $\sqrt{\frac{N}{m}}\,g(m)\to 0$ guarantees that $\mathrm{bias}(\widehat{\mu}_{N})=o(N^{-1/2})$ . Without this requirement, asymptotic normality can be established for the debiased estimator $\widehat{\mu}_{N}-\mathbb{E}\widehat{\mu}_{N}$ .

Proof.

Let $\rho(x)=|x|$ and note that the equivalent characterization of $\widehat{\mu}_{N}$ is

\widehat{\mu}_{N}\in\mathop{\mathrm{missing}}{argmin}_{z\in\mathbb{R}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho\left(\sqrt{m}\left(\bar{X}_{J}-z\right)\right).

The necessary conditions for the minimum of this problem imply that for any fixed $t\geq 0$ ,

(11)

Therefore, it suffices to show that the upper and lower bounds for $\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}_{N}-\mu)\geq t\right)}$ converge to the same limit. To this end, we see that

{\mathbb{P}{\left(\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\geq 0\right)}\\ =\mathbb{P}{\left(\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right)\geq-\sqrt{N/m}\mathbb{E}\rho^{\prime}_{-}\right)},}

(12)

where $\mathbb{E}\rho^{\prime}_{-}$ stands for $\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)$ . As it the proof of Theorem 2.1, we deduce that $-\sqrt{N/m}\,\mathbb{E}\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\to\frac{t}{\sigma}\sqrt{\frac{2}{\pi}}$ whenever $\sqrt{N/m}\,g(m)\to 0$ and $N/m\to\infty$ . It remains to analyze the U-statistic

\sqrt{\frac{N}{m}}\,U_{N,m}=\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)-\mathbb{E}\rho^{\prime}_{-}\right).

As the expression above is invariant with respect to the shift $X_{j}\mapsto X_{j}-\mu$ , we can assume that $\mu=0$ . To complete the proof, we will verify the conditions of Theorem 3.1 allowing one to reduce the asymptotic behavior of $U_{N,m}$ to the analysis of sums of i.i.d. random variables. For $i\in[N]$ , let

h^{(1)}(X_{i})=\sqrt{\frac{N}{m}}\,\mathbb{E}\left[\rho^{\prime}_{-}\left(\frac{1}{\sqrt{m}}\sum_{j=1}^{m-1}\tilde{X}_{j}+\frac{X_{i}}{\sqrt{m}}-t/\sqrt{N/m}\right)\,\big{|}\,X_{i}\right]-\sqrt{\frac{N}{m}}\mathbb{E}\rho^{\prime}_{-},

where $(\tilde{X}_{1},\ldots,\tilde{X}_{m})$ is an independent copy of $(X_{1},\ldots,X_{m})$ . Our goal is to understand the size of $\mathrm{Var}(h^{(1)}(X_{1}))$ : specifically, we will show that $\mathrm{Var}\left(\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right)\to\frac{2}{\pi}$ as both $m$ and $N/m\to\infty$ . Given an integer $l\geq 1$ , let $\widetilde{\Phi}_{l}(t)$ be the cumulative distribution function of $\sum_{j=1}^{l}X_{j}$ . Then

{\frac{m}{\sqrt{N}}h^{(1)}(X_{1})=\sqrt{m}\left(2\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-1\right)-\sqrt{m}\mathbb{E}\,\rho^{\prime}_{-}\\ =2\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-\mathbb{E}\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)\right)\\ =2\sqrt{m}\int_{\mathbb{R}}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right)dP(x).}

We will apply the dominated convergence theorem to analyze this expression. Consider first the situation when the distribution of $X_{1}$ is non-lattice ²²2We say that $X_{1}$ has lattice distribution if $\mathbb{P}{\left(X_{1}\in\alpha+k\beta,\ k\in\mathbb{Z}\right)}=1$ and there is no arithmetic progression $A\subset Z$ such that $\mathbb{P}{\left(X_{1}\in\alpha+k\beta,\ k\in A\right)}=1$ . Then the local limit theorem for non-lattice distributions (Shepp,, 1964, Theorem 2) implies that

\widetilde{\Phi}_{m-1}\left(a+h\right)-\widetilde{\Phi}_{m-1}\left(a\right)=\frac{h}{\sqrt{2\pi(m-1)}\sigma}\operatorname{exp}\left(-\frac{a^{2}}{2(m-1)\sigma^{2}}\right)+o(m^{-1/2}),

where $\sqrt{m}\cdot o(m^{-1/2})$ converges to $0$ as $m\to\infty$ for every $h$ and uniformly in $a$ . Therefore, we see that conditionally on $X_{1}$ and for every $x$ ,

{\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x+(x-X_{1})\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\\ =\frac{x-X_{1}}{\sqrt{2\pi(m-1)}\sigma}\operatorname{exp}\left(-(tm/\sqrt{N}-x)^{2}/2(m-1)\sigma^{2}\right)+o(m^{-1/2})}

(13)

uniformly in $m$ . Since $m=o(N)$ by assumption, $\operatorname{exp}\left(-(tm/\sqrt{N}-x)^{2}/2(m-1)\sigma^{2}\right)=1+o(1)$ as $m,N\to\infty$ , hence

2\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x+(x-X_{1})\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right)=2\frac{x-X_{1}}{\sqrt{2\pi}\sigma}+o(1)

$P$ -almost everywhere. Next, we will show that $q_{m}(x,X_{1}):=\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-X_{1}\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right)$ admits an integrable majorant that does not depend on $m$ . Note that

|q_{m}(x,X_{1})|\leq\sup_{z}\sqrt{m}\,\mathbb{P}{\left(\sum_{j=1}^{m-1}X_{j}\in\big{(}z,z+|x-X_{1}|\big{]}\right)}\leq C|x-X_{1}|,

where the last inequality follows from the well known bound for the concentration function (Theorem 2.20 in the book by Petrov, (1995)); here, $C=C(P)>0$ is a constant that may depend on the distribution of $X_{1}$ . We conclude that by the dominated convergence theorem,

\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\to\sqrt{\frac{2}{\pi}}\frac{X_{1}}{\sigma}

as $m,N/m\to\infty$ , $P$ -almost everywhere. As

\left|\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right|\leq 2\left|\int_{\mathbb{R}}q_{m}(x,X_{1})dP(x)\right|\leq C\int_{\mathbb{R}}|x-X_{1}|dP(x)

and $\mathbb{E}\left(\int_{\mathbb{R}}|x-X_{1}|dP(x)\right)^{2}<\infty,$ the second application of the dominated convergence theorem yields that $\mathrm{Var}\left(\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right)\to\mathrm{Var}\left(\sqrt{\frac{2}{\pi}}\frac{X_{1}}{\sigma}\right)=\frac{2}{\pi}$ as $N/m$ and $m\to\infty$ .

It remains to consider the case when $X_{1}$ has a lattice distribution. In this case, a version of the local limit theorem (Petrov,, 1995) states that

\mathbb{P}{\left(\sum_{j=1}^{m-1}X_{j}=(m-1)\alpha+q\beta\right)}=\frac{\beta}{\sqrt{2\pi(m-1)}\sigma}e^{-\frac{((m-1)\alpha+q\beta)^{2}}{2\sigma^{2}(m-1)}}+o(m^{-1/2})

where the $o(m^{-1/2})$ term is uniform in $q\in Z$ . For any $y$ in the interval $\big{(}\frac{tm}{\sqrt{N}}-x,\frac{tm}{\sqrt{N}}-x+(x-X_{1})\big{]}$ of the form $y=(m-1)\alpha+q\beta$ , we have that $e^{-\frac{y^{2}}{2\sigma^{2}(m-1)}}=1+o(1)$ as $\frac{m}{N}\to 0$ . Therefore, similarly to (13), in this case

2\sqrt{m}\left(\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x+(x-X_{1})\right)-\widetilde{\Phi}_{m-1}\left(\frac{tm}{\sqrt{N}}-x\right)\right)=2\frac{x-X_{1}}{\sqrt{2\pi}\sigma}+o(1)

$P$ -almost everywhere, where we also used the fact that the number of points of the form $(m-1)\alpha+q\beta$ in the interval of interest equals $\frac{x-X_{1}}{\beta}$ . The rest of the proof proceeds exactly as in the case of non-lattice distributions, and concludes the part of the argument related to $\mathrm{Var}\left(\frac{m}{\sqrt{N}}h^{(1)}(X_{1})\right)$ .

To finish the proof, note that, since $\|\rho^{\prime}_{-}\|_{\infty}=1$ , $\mathrm{Var}\left(\sqrt{N/m}\,\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\right)\leq\frac{N}{m}$ , hence

\frac{\mathrm{Var}\left(\sqrt{N/m}\,\rho^{\prime}_{-}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-tN^{-1/2}\right)\right)\right)}{\mathrm{Var}\left(h^{(1)}(X_{1})\right)}\leq\frac{N/m}{\frac{2}{\pi}(1+o(1))N/m^{2}}=\frac{m}{\frac{2}{\pi}(1+o(1))}=o(N)

as $m\to\infty$ and $N/m\to\infty$ . Therefore, Theorem 3.1 applies and yields that

\frac{\sqrt{\frac{N}{m}}U_{N,m}-\frac{m}{N}\sum_{j=1}^{N}h^{(1)}(X_{j})}{\frac{m^{2}}{N}\mathrm{Var}\left(h^{(1)}(X_{j})\right)}=o_{P}(1),

where $\frac{m^{2}}{N}\mathrm{Var}\left(h^{(1)}(X_{j})\right)=\frac{2}{\pi}(1+o(1))$ . In view of the Central Limit Theorem, $\frac{m}{N}\sum_{j=1}^{N}h^{(1)}(X_{j})\xrightarrow{d}N\left(0,\frac{2}{\pi}\right)$ , and we conclude that $\sqrt{\frac{N}{m}}U_{N,m}\xrightarrow{d}N\left(0,\frac{2}{\pi}\right)$ . Recalling (12), we see that

\mathbb{P}{\left(\sqrt{\frac{N}{m}}U_{N,m}\leq\sqrt{\frac{N}{m}}\mathbb{E}\rho^{\prime}_{-}\right)}\to 1-\widetilde{\Phi}\left(\frac{t}{\sigma}\right),

or $\limsup\limits_{m,N/m\to\infty}\mathbb{P}{\left(\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\geq t\right)}\leq 1-\widetilde{\Phi}\left(\frac{t}{\sigma}\right)$ . Repeating the preceding argument for the lower bound for $\mathbb{P}{\left(\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\geq t\right)}$ , we get that $\liminf\limits_{m,N/m\to\infty}\mathbb{P}{\left(\sqrt{N}\left(\widehat{\mu}_{N}-\mu\right)\geq t\right)}\geq 1-\widetilde{\Phi}\left(\frac{t}{\sigma}\right)$ , whence the claim of the theorem follows. ∎

Corollary 3.1 implies that asymptotically, the estimator $\widehat{\mu}_{N}$ improves upon $\widehat{\mu}_{\mathrm{MOM}}$ . The more interesting, and difficult, question is whether non-asymptotic sub-Gaussian deviation bounds for $\widehat{\mu}_{N}$ with improved constant can be established, and to understand the range of the deviation parameter in which such bounds are valid.

4 Deviation inequalities for U-statistics of growing order.

The ultimate goal of this section is to establish a non-asymptotic analogue of Corollary 3.1. Recall that its proof relied on the classical strategy of showing that the higher-order terms in the Hoeffding decomposition of certain U-statistics are asymptotically negligible. To prove the desired non-asymptotic extension, one has be able to show that these higher-order terms are sufficiently small with exponentially high probability. However, classical tools used to prove such bounds rely on decoupling inequalities due to de la Pena and Montgomery-Smith, (1995). Unfortunately, the constants appearing in decoupling inequalities grow very fast with respect to the order $m$ of U-statistics, at least like $m^{m}$ . As $m$ is allowed to grow with the sample size $N$ in our examples, such tools become insufficient to get the desired bounds in our framework. Arcones, (1995) derived an improved version of Bernstein’s inequality for non-degenerate U-statistics where the sub-Gaussian deviations regime is controlled by $m\mathrm{Var}(h^{(1)}_{m}(X))$ defined in equation (10), rather than the larger quantity $\mathrm{Var}(h_{m})$ appearing in the inequality due to Hoeffding, (1963); however, this result is only useful when $m$ is essentially fixed. Maurer, (2019) used different techniques that yield improvements over Arcones’ result, in particular with respect to the order $m$ ; bounds obtained in this work are non-trivial for $m$ up to the order of $N^{1/3}$ , however, this does not suffice for the applications required in the present paper. Moreover, unlike Theorem 4.1 below, results in Maurer, (2019) do not capture the correct behavior of degenerate U-statistics. Recently, Song et al., (2019) made significant progress in studying U-statistics of growing order and developed tools that avoid using decoupling inequalities, however, their techniques apply when $m=o\left(\sqrt{N}\right)$ , while we only require that $m=o(N)$ .

We will be interested in U-statistics with kernels of special structure that assumes “weak” dependence on each of the individual variables. Let the kernel be centered and written in the form $h_{m}\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{m}}{\sqrt{m}}\right)$ , whence the corresponding U-statistic is

U_{N,m}=\frac{1}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}h_{m}\left(\frac{X_{i}}{\sqrt{m}},\ i\in J\right).

(14)

The Hoeffding decomposition of $U_{N,m}$ is defined as the sum

U_{N,m}=\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j})+\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J),

(15)

where $h^{(j)}_{m}(x_{1},\ldots,x_{j})=(\delta_{x_{1}}-P)\times\ldots\times(\delta_{x_{j}}-P)\times P^{m-j}h_{m}$ . We refer the reader to section 7.1 where the Hoeffding decomposition and related background material is reviewed in more detail.

We will assume that $U_{N,m}$ is non-degenerate, in particular, one can expect that the behavior of $U_{N,m}$ is determined by the first term $\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j})$ in the decomposition. In order to make this intuition rigorous, we need to prove that the higher-order terms are of smaller order with exponentially high probability. It is shown in the course of the proof of Theorem 3.1 that $\mathrm{Var}\left(\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right)\leq\mathrm{Var}(h_{m})\left(\frac{m}{N}\right)^{j}$ . However, to achieve our current goal, bounds for the moments of higher order are required. More specifically, the key technical difficulty lies in establishing the correct rate of decay of the higher moments with respect to the order $m$ of the U-statistic. We will show that under suitable assumptions, $\mathbb{E}^{1/q}\left|\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|^{q}=O\left(j^{\eta_{1}}q^{\eta_{2}}\left(\frac{m}{N}\right)^{j/2}\right)$ for some $\eta_{1}>0$ , $\eta_{2}>0$ and for all $q\geq 2$ , $2\leq j\leq j_{\mathrm{\max}}$ for a sufficiently large $j_{\mathrm{max}}$ . The crucial observation is that the upper bound for the higher-order $L_{q}$ norms is still proportional to $\left(\frac{m}{N}\right)^{j/2}$ , same as the $L_{2}$ norm. The following result, essentially implied by the moment inequalities of this form, is a main technical novelty and a key ingredient needed to control large deviations of the higher order terms in the Hoeffding decomposition.

Theorem 4.1.

Let

V_{N,j}=\frac{{m\choose j}^{1/2}}{{N\choose j}^{1/2}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}\left(\frac{X_{i}}{\sqrt{m}},\,i\in J\right),\ f_{j}(x_{1},\ldots,x_{j})=\mathbb{E}h_{m}\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{j}}{\sqrt{m}},\frac{X_{j+1}}{\sqrt{m}},\ldots,\frac{X_{m}}{\sqrt{m}}\right)

and $\nu_{k}=\mathbb{E}^{1/k}|X_{1}-\mathbb{E}X_{1}|^{k}$ . If the kernel $h_{m}$ is uniformly bounded, then there exists an absolute constant $c>0$ such that

\mathbb{P}{\left(|V_{N.j}|\geq t\right)}\leq\operatorname{exp}\left(-\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\frac{\left(\frac{t}{\|h_{m}\|_{\infty}}\sqrt{\frac{N}{j}}\right)^{\frac{2}{j+1}}}{c\left(m/j\right)^{\frac{j}{j+1}}}\right)\right)

whenever $\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\frac{\left(\frac{t}{\|h_{m}\|_{\infty}}\sqrt{\frac{N}{j}}\right)^{\frac{2}{j+1}}}{c\left(m/j\right)^{\frac{j}{j+1}}}\right)\geq 2$ . Alternatively, suppose that

(i)

$\left\|\partial_{x_{1}}\ldots\partial_{x_{j}}f_{j}\right\|_{\infty}\leq\left(\frac{C_{1}(P)}{m}\right)^{j/2}j^{\gamma_{1}j}$ for some $\gamma_{1}\geq\frac{1}{2}$ ;
(ii)

$\nu_{k}\leq k^{\gamma_{2}}M$ for all integers $k\geq 2$ and some $\gamma_{2}\geq 0$ , $M>0$ .

Then there exist constants $c_{1}(P),c_{2}(P)>0$ that depend on $\gamma_{1}$ and $\gamma_{2}$ only such that

\mathbb{P}{\left(|V_{N,j}|\geq t\right)}\leq\operatorname{exp}\left(-\min\left(\frac{1}{c_{1}}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c_{2}Mj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\right)

(16)

whenever $\min\left(\frac{1}{c_{1}}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c_{2}Mj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\geq\max\left(2,\frac{\log(N/j)}{\gamma_{2}j}\right)$ . ³³3In the course of the proof, we show that whenever $\gamma_{2}=0$ , corresponding to the case of a.s. bounded $X_{1}$ , inequality (16) is valid for all $t>0$ .

The proof of the theorem is given in section 7.3. Let us briefly discuss the imposed conditions. The first inequality requires only boundedness of the kernel and follows from a standard argument; it is mostly useful for the degenerate kernels of higher order $j$ , for instance when $j\geq Cm/\log(m)$ ). The main result is the second inequality of the theorem that provides a much better dependence of the tails on $m$ for small and moderate values of $j$ . Assumption (ii) is a standard one: for instance, it holds with $\gamma_{2}=0$ for bounded random variables, $\gamma_{2}=1/2$ for sub-Gaussian and with $\gamma_{2}=1$ for sub-exponential random variables. As for assumption (i), suppose that the kernel $h_{m}$ is sufficiently smooth. In this case,

\partial_{x_{1}}\ldots\partial_{x_{j}}f_{j}(x_{1},\ldots,x_{j})=m^{-j/2}\mathbb{E}\left[\left(\partial_{x_{1}}\ldots\partial_{x_{j}}h_{m}\right)\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{j}}{\sqrt{m}},\frac{X_{j+1}}{\sqrt{m}},\ldots,\frac{X_{m}}{\sqrt{m}}\right)\right],

which is indeed of order $m^{-j/2}$ with respect to $m$ . However, the functions $f_{j}$ are often smooth even if the kernel $h_{m}$ is not, as we will show later for the case of an indicator function (specifically, we will prove that required inequalities hold with $\gamma_{1}=\frac{1}{2}$ for all $j\ll m/\log(m)$ under mild assumptions on the distribution of $X_{1}$ ). Next, we state a corollary – a deviation inequality that takes a particularly simple form and suffices for most of the applications discussed later. It can be viewed as an extension of Arcones, (1995) version of Bernstein’s inequality for the case of U-statistics of growing order.

Corollary 4.1.

Suppose that

(i)

assumptions of Theorem 4.1 hold for all $2\leq j\leq j_{\mathrm{max}}$ with $\gamma_{1}=\frac{1}{2}$ ;
(ii)

the kernel $h_{m}$ is uniformly bounded;
(iii)

$\liminf_{m\to\infty}\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)>0$ ;
(iv)

$mM^{2}=o\left(N^{1-\delta}\right)$ for some $\delta>0$ .

Moreover, let $q(N,m)$ be an increasing function such that

q(N,m)=o\left(\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}},j_{\mathrm{max}}\log(N/m)\right)\right)\text{ as }N/m\to\infty.

Then for all $2\leq t\leq q(N,m)$ ,

\mathbb{P}{\left(\left|U_{N,m}\right|\geq\sqrt{\frac{tm}{N}}\right)}\leq 2\operatorname{exp}\left(-\frac{t}{2(1+o(1))\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)}\right),

where $o(1)\to 0$ as $N/m\to\infty$ uniformly over $2\leq t\leq q(N,m)$ . If $m=o\left(\frac{N^{1/2}}{\log(N)}\right)$ , we can instead choose $q(N,m)$ such that $q(N,m)=o\left(\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}},\frac{Nj_{\mathrm{max}}}{m^{2}}\right)\right)$ .

Remark 3.

The key point of the inequality is that the sub-Gaussian deviations are controlled by $\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)$ rather than the sub-optimal quantity $\mathrm{Var}(h_{m})$ appearing in Hoeffding’s version of Bernstein’s inequality for U-statistics. Moreover, the range in which $U_{N,m}$ admits sub-Gaussian deviations is much wider compared to the implications of Arcones’ inequality when $m$ is allowed to grow with $N$ . Several comments regarding the additional assumptions are in order:

1.

Assumption of uniform boundedness of the kernel $h_{m}$ is needed to ensure that we can apply Bernstein’s concentration inequality to the first term of the Hoeffding decomposition. This suffices for our purposes but in general this condition can be relaxed.

Assumption on the asymptotic behavior of the variance is made to simplify the statement and the proof; if it does not hold, the result is still valid once the definition of $q(N,m)$ is modified to reflect the different behavior of the this quantity. We include the following heuristic argument which shows that $\lim_{m\to\infty}\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)$ often admits a simple closed-form expression. Indeed, note that $\sqrt{m}\left(h_{m}^{(1)}(X_{1})-h_{m}^{(1)}(0)\right)=\int_{0}^{X_{1}}\sqrt{m}\,\partial_{u}h_{m}^{(1)}(u)du$ . If $\left\|\partial^{2}_{u}h_{m}^{(1)}\right\|_{\infty}=o(m^{-1/2})$ , then

\sqrt{m}\left|\partial_{u}h_{m}^{(1)}(u)-\partial_{u}h_{m}^{(1)}(0)\right|\leq\sqrt{m}\left\|\partial^{2}_{u}h_{m}^{(1)}\right\|_{\infty}u\to 0

pointwise as $m\to\infty$ . If the limit $\sqrt{m}\,\partial_{u}h_{m}^{(1)}(0)$ exists, then $\sqrt{m}\left(h_{m}^{(1)}(X_{1})-h_{m}^{(1)}(0)\right)\to\lim_{m\to\infty}\sqrt{m}\,\partial_{u}h_{m}^{(1)}(0)X_{1}$ , P-almost everywhere. Moreover, as $\sqrt{m}\|\partial_{u}h_{m}^{(1)}\|_{\infty}$ admits an upper bound independent of $m$ by assumption (i) of Theorem 4.1 and $X_{1}$ is sufficiently integrable, Lebesgue’s dominated convergence theorem applies and yields that $\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)\to\left(\lim_{m\to\infty}\partial_{u}h_{m}^{(1)}(0)\right)^{2}\mathrm{Var}(X_{1})$ . For instance, this heuristic argument can often be made precise for kernels of the form $h\left(\sum_{j=1}^{m}\frac{x_{j}}{\sqrt{m}}\right)$ .

3.

Finally, condition requiring that $mM^{2}=o\left(N^{1-\delta}\right)$ is used to ensure that $\left(\frac{N}{mM^{2}}\right)^{\tau}\gg\log(m)$ for any fixed $\tau>0$ which simplifies the statement and the proof.

Proof.

The union bound together with Hoeffding’s decomposition entails that for any $t>0$ and $0<\varepsilon<1$ (to be chosen later),

{\mathbb{P}{\left(\left|U_{N,m}\right|\geq\sqrt{\frac{tm}{N}}\right)}\\ \leq\mathbb{P}{\left(\left|\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j})\right|\geq(1-\varepsilon)\sqrt{t}\sqrt{\frac{m}{N}}\right)}+\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\varepsilon\sqrt{t}\sqrt{\frac{m}{N}}\right)}.}

Bernstein’s inequality yields that

{\mathbb{P}{\left(\left|\frac{m}{N}\sum_{j=1}^{N}h_{m}^{(1)}(X_{j})\right|\geq(1-\varepsilon)\sqrt{t}\sqrt{\frac{m}{N}}\right)}\\ \leq 2\operatorname{exp}\left(-\frac{(1-\varepsilon)^{2}\,t/2}{\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)+(1-\varepsilon)\frac{1}{3}\sqrt{\frac{m}{N}}\|h_{m}\|_{\infty}t^{1/2}}\right)\\ =2\operatorname{exp}\left(-\frac{(1-\varepsilon)^{2}\,t}{2\,\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)(1+o(1))}\right)}

where $o(1)\to 0$ as $N/m\to\infty$ uniformly over $s\leq q(N/m)$ . It remains to control the expression involving higher order Hoeffding decomposition terms: specifically, we will show that under our assumptions, it is bounded from above by $\operatorname{exp}\left(-\frac{t}{2\,\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)}\right)\cdot o(1)$ where $o(1)\to 0$ uniformly over the range of $t$ . To this end, denote $t_{\varepsilon}:=\varepsilon^{2}t$ and $j_{\ast}:=\min\left(j_{\mathrm{max}},\lfloor\log(N/m)\rfloor+1\right)$ . Observe that

{\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\sqrt{t_{\varepsilon}}\sqrt{\frac{m}{N}}\right)}\\ \leq\mathbb{P}{\left(\left|\sum_{j=2}^{j_{\ast}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ +\mathbb{P}{\left(\left|\sum_{j=j_{\ast}+1}^{j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ +\mathbb{P}{\left(\left|\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)},}

(17)

where the second sum may be empty depending on the value of $j_{\ast}$ . First, we estimate the last term using Chebyshev’s inequality: repeating the reasoning leading to equation (38) in the proof of Theorem 3.1, we see that $\mathrm{Var}\left(\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right)\leq\mathrm{Var}(h_{m})\left(\frac{m}{N}\right)^{j_{\mathrm{max}}+1}\left(1-m/N\right)^{-1}$ , hence

{\mathbb{P}{\left(\left|\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\leq\frac{18\mathrm{Var}(h_{m})}{t_{\varepsilon}}\left(\frac{m}{N}\right)^{j_{\mathrm{max}}}\\ =18\mathrm{Var}(h_{m})\operatorname{exp}\left(-j_{\mathrm{max}}\log(N/m)+\log(t_{\varepsilon})\right)}

whenever $N/m\geq 2$ . Alternatively, we can apply the first inequality of Theorem 4.1 instead of Chebyshev’s inequality to each term corresponding to $j>j_{\mathrm{max}}$ individually, with $t=t_{j,\varepsilon}:=\frac{\sqrt{t_{\varepsilon}}}{3j^{2}}\left(\frac{N}{m}\right)^{\frac{j-1}{2}}$ . It implies that

{\mathbb{P}{\left(\left|\sum_{j>j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq\sum_{j>j_{\mathrm{max}}}\mathbb{P}{\left(\left|\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{j,\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq m\max_{j>j_{\mathrm{max}}}\operatorname{exp}\left(-c\min\left(t_{\varepsilon}^{1/j}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\left(\frac{t_{\varepsilon}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{Nj}{m^{2}}\right)^{\frac{j}{j+1}}\right)\right).}

This bound is useful when $\left(\frac{Nj_{\mathrm{max}}}{m^{2}}\right)^{\frac{j_{\mathrm{max}}}{j_{\mathrm{max}}+1}}\gg j_{\mathrm{max}}\log(N/m)$ , which is true whenever $m^{2}\ll\frac{N}{\log^{2}(N)}$ . If moreover $\varepsilon\gg\frac{1}{\sqrt{\log(N)}}$ , then the last probability is bounded from above by

\max_{j>j_{\mathrm{max}}}\operatorname{exp}\left(-c^{\prime}\min\left(t_{\varepsilon}^{1/j}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\left(\frac{t_{\varepsilon}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{Nj}{m^{2}}\right)^{\frac{j}{j+1}}\right)\right).

To estimate the middle term (the probability involving the terms indexed by $j_{\ast}+1\leq j\leq j_{\mathrm{max}}$ ), we apply Theorem 4.1 to each term individually for $t=t_{j,\varepsilon}:=\frac{\sqrt{t_{\varepsilon}}}{3j^{2}}\left(\frac{N}{m}\right)^{\frac{j-1}{2}}$ , keeping in mind that $\sum_{j\geq j_{\ast}+1}t_{j,\varepsilon}\leq\frac{\pi^{2}}{18}\left(\frac{N}{m}\right)^{\frac{j-1}{2}}\sqrt{t_{\varepsilon}}$ . Note that for any $2\leq t\leq\frac{N}{m}$ , $\varepsilon>\frac{m}{N}$ and $j\geq\lfloor\log(N/m)\rfloor+1$ ,

\min\left(\frac{t_{j,\varepsilon}^{\frac{2}{j}}}{c},\left(\frac{t_{j,\varepsilon}\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\geq\frac{c_{1}}{M^{\frac{2}{1+2\gamma_{2}}}}\left(\frac{N}{m}\right)^{\frac{1}{1+2\gamma_{2}}},

whence

{\mathbb{P}{\left(\left|\sum_{j=j_{\ast}+1}^{j_{\mathrm{max}}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq j_{\mathrm{max}}\operatorname{exp}\left(-\frac{c_{1}}{M^{\frac{2}{1+2\gamma_{2}}}}\left(\frac{N}{m}\right)^{\frac{1}{1+2\gamma_{2}}}\right)\leq\operatorname{exp}\left(-\frac{c_{2}}{M^{\frac{2}{1+2\gamma_{2}}}}\left(\frac{N}{m}\right)^{\frac{1}{1+2\gamma_{2}}}\right).}

Finally, to estimate the first term in the right side of inequality (17), we again apply Theorem 4.1. With $t_{j,\varepsilon}$ defined as above,

{\mathbb{P}{\left(\left|\sum_{j=2}^{j_{\ast}}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{\sqrt{t_{\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq\sum_{j=2}^{j_{\ast}}\mathbb{P}{\left(\left|\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\frac{6}{\pi^{2}}\frac{\sqrt{t_{j,\varepsilon}}}{3}\sqrt{\frac{m}{N}}\right)}\\ \leq\sum_{j=2}^{j_{\ast}}\operatorname{exp}\left(-c\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\frac{t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}}{M^{\frac{2j}{1+j(1+2\gamma_{2})}}}\left(\frac{N}{m}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right)\\ \leq j_{\ast}\max_{2\leq j\leq j_{\ast}}\operatorname{exp}\left(-c\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\frac{t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}}{M^{\frac{2j}{1+j(1+2\gamma_{2})}}}\left(\frac{N}{m}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right).}

Whenever $\varepsilon\geq\frac{1}{\sqrt{N/m)}}$ , the last expression is upper bounded by

\max_{2\leq j\leq j_{\ast}}\operatorname{exp}\left(-c_{3}\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\frac{t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}}{M^{\frac{2j}{1+j(1+2\gamma_{2})}}}\left(\frac{N}{m}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right)

for $c_{3}$ small enough. Combining all the estimates, we obtain the inequality

{\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\sqrt{t_{\varepsilon}}\sqrt{\frac{m}{N}}\right)}\leq\\ \max_{2\leq j\leq j_{\ast}}\operatorname{exp}\left(-c_{3}\min\left(t^{1/j}_{\varepsilon}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},t_{\varepsilon}^{\frac{1}{1+j(1+2\gamma_{2})}}\left(\frac{N}{mM^{2}}\right)^{\frac{j}{1+j(1+2\gamma_{2})}}\right)\right)\\ +\operatorname{exp}\left(-c_{2}\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\right)+c_{4}\mathrm{Var}(h_{m})\operatorname{exp}\left(-j_{\mathrm{max}}\log(N/m)+\log(t_{\varepsilon})\right)}

(18)

that holds if $\varepsilon\geq\frac{1}{\sqrt{N/m}}$ and $2\leq t\leq\frac{N}{m}$ . If $t<\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\varepsilon^{4}$ , then the first two terms on the right-hand side of the previous display are bounded by $e^{-\frac{ct_{\varepsilon}}{\varepsilon^{3}}}=e^{-\frac{ct}{\varepsilon}}$ each, and if $t<\varepsilon(j_{\mathrm{max}}-1)\log(N/m)$ , the same is true for the last term. Therefore, if

t<\varepsilon^{4}\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\,,(j_{\mathrm{max}}-1)\log(N/m)\right),

then

{\mathbb{P}{\left(\left|\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\sum_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(X_{i},\,i\in J)\right|\geq\sqrt{t_{\varepsilon}}\sqrt{\frac{m}{N}}\right)}\\ \leq 3\operatorname{exp}\left(-\frac{ct}{\varepsilon}\right)=\operatorname{exp}\left(-\frac{t}{2\,\mathrm{Var}\left(\sqrt{m}\,h_{m}^{(1)}(X_{1})\right)}\right)\cdot o(1)}

where the last equality holds whenever we choose $\varepsilon:=\varepsilon(N,m)$ such that $\varepsilon(N,m)\to 0$ as $N/m\to\infty$ . Specifically, take $\varepsilon=\left(\frac{q(N,m)}{\min\left(\left(\frac{N}{mM^{2}}\right)^{\frac{1}{1+2\gamma_{2}}}\,,j_{\mathrm{max}}\log(N/m)\right)}\right)^{1/4}$ where the function $q(N,m)$ was defined in the statement of the corollary, and conclusion follows immediately. If $m^{2}\ll\frac{N}{\log^{2}(N)}$ , we can replace the last term in equation (18) by

\max_{j>j_{\mathrm{max}}}\operatorname{exp}\left(-c^{\prime}\min\left(t_{\varepsilon}^{1/j}\left(\frac{N}{m}\right)^{\frac{j-1}{j}},\left(\frac{t_{\varepsilon}}{\|h\|_{\infty}^{2}}\right)^{\frac{1}{j+1}}\left(\frac{Nj}{m^{2}}\right)^{\frac{j}{j+1}}\right)\right),

which is bounded by $e^{-\frac{ct}{\varepsilon}}$ whenever $t<\frac{Nj_{\mathrm{max}}}{m^{2}}\varepsilon^{4}$ . Final result in this case follows similarly. ∎

5 Implications for the median of means estimator.

We are going to apply results of the previous section to deduce non-asymptotic bounds for the permutation-invariant version of the median of means estimator. Recall that it was defined as

\widehat{\mu}_{N}:=\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{A}_{N}^{(m)}\right).

Theorem 5.1.

Assume that $X_{1},\ldots,X_{N}$ are i.i.d. copies of a random variable $X$ with mean $\mu$ and variance $\sigma^{2}$ . Moreover, suppose that

(i)

the distribution of $X_{1}$ is absolutely continuous with respect to the Lebesgue measure on $\mathbb{R}$ with density $\phi_{1}$ ;
(ii)

the Fourier transform $\widehat{\phi}_{1}$ of the density satisfies the inequality $\left|\widehat{\phi}_{1}(x)\right|\leq\frac{C_{1}}{(1+|x|)^{\delta}}$ for some positive constants $C_{1}$ and $\delta$ ;
(iii)

$\mathbb{E}\left|(X_{1}-\mu)/\sigma\right|^{q}<\infty$ for some $\frac{3+\sqrt{5}}{2}<q\leq 3$ ;

Then the estimator $\widehat{\mu}_{N}$ satisfies

\mathbb{P}{\left(\left|\sqrt{N}(\widehat{\mu}-\mu)\right|\geq\sigma\sqrt{t}\right)}\leq 2\operatorname{exp}\left(-\frac{t}{2(1+o(1))}\right)

where $o(1)\to 0$ as $m,\,N/m\to\infty$ uniformly for all $t\in\left[l_{N,m},u_{N,m}\right]$ for any sequences $\{l_{N,m}\}\,,\{u_{N,m}\}$ such that $l_{N,m}\gg\frac{N}{m^{q-1}}$ and $u_{N,m}\ll\frac{N}{m^{\frac{q}{q-1}}\vee m\log^{2}(N)}$ .

Remark 4.

1.

Let us recall the Riemann-Lebesgue lemma stating that $|\widehat{\phi}_{1}(x)|\to 0$ as $|x|\to\infty$ for any absolutely continuous distribution, so assumption (ii) is rather mild;
2.

The inequality $q>\frac{3+\sqrt{5}}{2}$ assures that $l_{N,m}$ and $u_{N,m}$ can be chosen such that $l_{N,m}\ll u_{N,m}$ .

Proof.

Throughout the course of the proof, we will assume without loss of generality that $\sigma^{2}=1$ ; general case follows by rescaling. Let us also recall that all asymptotic relations are defined in the limit as both $m$ and $N/m\to\infty$ . Note that direct application of Corollary 4.1 requires existence of all moments of $X_{1}$ , which is too prohibitive. Therefore, we will first show how to reduce the problem to the case of bounded random variables. Specifically, we want to truncate $X_{j}-\mu,\ j=1,\ldots,N$ in a way that preserves the decay rate of the characteristic function. To this end, let $R$ be a large constant (that will later be specified as an increasing function of $m$ ), and define the standard mollifier $\kappa(x)$ via $\kappa(x)=\begin{cases}C_{1}\operatorname{exp}\left(-\frac{1}{1-x^{2}}\right),&|x|<1,\\ 0,&|x|\geq 1\end{cases}$ where $C_{1}$ is chosen so that $\int_{\mathbb{R}}\kappa(x)=1$ . Moreover, let $\chi_{R}(x)=\left(I_{2R}\ast\kappa_{R}\right)(x)$ be the smooth approximation of the indicator function of the interval $[-2R,2R]$ , where $I_{2R}(x)=I\{|x|\leq 2R\}$ and $\kappa_{R}(x)=\frac{1}{R}\kappa(x/R)$ ; in particular, $\chi_{R}(x)=1$ for $|x|\leq R$ and $\chi_{R}(x)=0$ for $|x|\geq 3R$ . Set

\psi(x)=C_{2}\phi_{1}(x+\mu)\chi_{R}(x)

where $C_{2}>0$ is such that $\int_{\mathbb{R}}\psi(x)dx=1$ . Suppose that $Y^{(R)}$ has distribution with density $\psi$ and note that by construction the laws of $X_{1}-\mu$ and $Y^{(R)}$ , conditionally on the events $\{|X_{1}-\mu|\leq R\}$ and $\{|Y^{(R)}|\leq R\}$ respectively, coincide. Therefore, there exists a random variable $Z$ independent from $X_{1}$ such that

Y_{1}^{(R)}:=\begin{cases}X_{1}-\mu,&|X_{1}-\mu|\leq R,\\ Z,&|X_{1}-\mu|>R\end{cases}

(19)

also has density $\psi$ . Observe the following properties of $Y_{1}^{(R)}$ : (a) $|Y_{1}^{(R)}|\leq 3R$ almost surely; (b) $\mathbb{E}h\left(Y_{1}^{(R)}\right)\leq C_{2}\mathbb{E}h\left(X_{1}-\mu\right)$ for any nonnegative function $h$ – indeed, this follows from the inequality $\psi(x)\leq C_{2}\phi_{1}(x+\mu)$ ; (c) $\left|\mathbb{E}Y_{1}^{(R)}\right|\leq(1+C_{2})\frac{\mathbb{E}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-1}}$ . Indeed,

{\left|\mathbb{E}Y_{1}^{(R)}\right|=\left|\mathbb{E}Y_{1}^{(R)}I\{|X_{1}-\mu|\leq R\}+\mathbb{E}Y_{1}^{(R)}I\{|X_{1}-\mu|>R\}\right|\\ =\left|\mathbb{E}(\mu-X_{1})I\{|X_{1}-\mu|>R\}+\mathbb{E}Y_{1}^{(R)}I\{|Y_{1}^{(R)}|>R\}\right|\\ \leq\mathbb{E}\left|X_{1}-\mu\right|I\{|X_{1}-\mu|>R\}+C_{2}\mathbb{E}\left|X_{1}-\mu\right|I\{|X_{1}-\mu|>R\}}

where the last bound follows from property (b) for $h(x)=|x|I\{|x|>R\}$ . It remains to apply Hölder’s and Markov’s inequalities. The final property of $Y_{1}^{(R)}$ is stated in a lemma below and is proven in the appendix.

Lemma 1.

The characteristic function $\widehat{\psi}(x)$ of $Y_{1}^{(R)}$ satisfies

\left|\widehat{\psi}_{1}(x)\right|\leq\frac{C}{(1+|x|)^{\delta}}

for all $x\in\mathbb{R}$ and a sufficiently large constant $C$ .

Define $\rho(x)=|x|$ . Proceeding as in the proof of Theorem 2.1, we observe that

\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\geq\sqrt{t}\right)}\leq\mathbb{P}{\left(\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)\geq 0\right)}.

(23)

Our next goal is to show that for sufficiently large $R$ , the U-statistic with kernel $\rho^{\prime}_{-}$ appearing in (23) and evaluated at $X_{1},\ldots,X_{N}$ can be replaced by the U-statistic evaluated over an i.i.d. sample $Y_{1}^{(R)},\ldots,Y_{N}^{(R)}$ where $Y_{j}^{(R)}$ is related to $X_{j}$ according to (19). To this end, recall that $\mathbb{E}|X_{1}-\mu|^{q}<\infty$ , and choose $R$ as $R=cm^{\frac{1}{2(q-1)}}$ for some $c>0$ . Next, observe that

{\sum_{J\in\mathcal{A}_{N}^{(m)}}\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)\\ =\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho_{-,R}^{\prime}+\mathbb{E}\rho_{-}^{\prime}\right)\\ +\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho_{-}^{\prime}+\mathbb{E}\rho_{-,R}^{\prime}\right),}

(27)

where $\mathbb{E}\rho_{-}^{\prime}=\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)$ and $\mathbb{E}\rho_{-,R}^{\prime}=\mathbb{E}\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)$ . It was shown in the proof of Theorem 2.1 that

\sqrt{N/m}\,\mathbb{E}\rho_{-}^{\prime}\leq C\sqrt{k}\cdot g(m)-\sqrt{t}\left(\sqrt{\frac{2}{\pi}}+O\left(\sqrt{\frac{t}{k}}\right)\right)=-\sqrt{t}\sqrt{\frac{2}{\pi}}\left(1+o(1)\right)

whenever $t\ll N/m$ and $t\gg\frac{N}{m}\,g^{2}(m)$ . Let us remark that in view of imposed moment assumptions, $g(m)=O\left(m^{-(q-2)/2}\right)$ . Moreover, it follows from Hoeffding’s version of Bernstein’s inequality for U-statistics (Hoeffding,, 1963) that

{\frac{\sqrt{\frac{N}{m}}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{J}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho_{-}^{\prime}+\mathbb{E}\rho_{-,R}^{\prime}\right)\\ \leq 2\mathbb{E}^{1/2}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right)\right)^{2}\sqrt{s}\bigvee\frac{16s}{3}\sqrt{\frac{m}{N}}}

with probability at least $1-e^{-s}$ . We want to choose $s>0$ such that $t=o(s)$ and

{\alpha(s,R):=2\mathbb{E}^{1/2}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right)\right)^{2}\sqrt{s}\bigvee\frac{16s}{3}\sqrt{\frac{m}{N}}\\ =o\left(\sqrt{t}\right)}

(28)

as $m,N/m\to\infty$ . To estimate

\Sigma_{m}^{2}:=\mathbb{E}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)-\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right)\right)^{2},

note that for any $a>0$ , $\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right)=\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right)$ whenever $\left|\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right|>a/2$ , $\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right|>a/2$ and $\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\bar{Y}^{(R)}_{[m]}\right)\right|\leq a$ , hence

{\Sigma_{m}^{2}\leq 4\left(\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right|\leq a\right)}+\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right|\leq a\right)}\right)\\ +4\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\bar{Y}^{(R)}_{[m]}\right)\right|>a\right)}.}

Up to the additive error term $Cg(m)=O\left(m^{-(q-2)/2}\right)$ , the distributions of $\sqrt{m}\bar{X}_{[m]}$ and $\sqrt{m}\bar{Y}^{(R)}_{[m]}$ can be approximated by the normal distribution, hence

\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{Y}^{(R)}_{[m]}-\sqrt{t/N}\right)\right|\leq a\right)}+\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\sqrt{t/N}\right)\right|\leq a\right)}\leq C(a+g(m)).

Moreover,

{\mathbb{P}{\left(\left|\sqrt{m}\left(\bar{X}_{[m]}-\mu-\bar{Y}^{(R)}_{[m]}\right)\right|>a\right)}=\mathbb{P}{\left(\left|\frac{1}{\sqrt{m}}\sum_{j=1}^{m}Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}\right|\geq a\right)}\\ \leq\mathbb{P}{\left(\left|\frac{1}{\sqrt{m}}\sum_{j=1}^{m}Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}-\mathbb{E}\left(Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}\right)\right|\geq a-\sqrt{m}\left|\mathbb{E}Y_{j}^{(R)}I\{|Y_{j}^{(R)}|>R\}\right|\right)}\\ \leq C_{2}\frac{\mathbb{E}|X_{1}-\mu|^{2}I\{|X_{1}-\mu|>R\}}{\left(a-C_{2}\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|\right)^{2}}\\ \leq\frac{\mathbb{E}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-2}\left(a-C_{2}\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|\right)^{2}}}

(29)

where we used property (b) of $Y_{1}^{(R)}$ along with Hölder’s and Markov’s inequalities. It is also clear that

\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|\leq\frac{\sqrt{m}\mathbb{E}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-1}},

therefore, for $R=cm^{\frac{1}{2(q-1)}}$ specified before, $\sqrt{m}\left|\mathbb{E}(X_{1}-\mu)I\{|X_{1}-\mu|>R\}\right|=o(1)$ . Setting $a=2C_{2}\frac{\sqrt{m}\mathbb{E}^{1/2}|X_{1}-\mu|^{q}I\{|X_{1}-\mu|>R\}}{R^{q-1}}$ , one easily checks that the right-hand side in (29) is at most $CR^{-(q-2)}=C^{\prime}m^{-\frac{q-2}{2(q-1)}}$ . whence $\Sigma^{2}_{m}=o(1)$ . Therefore, there exists a function $o(1)$ such that setting $s=t/o(1)$ yields the stated goal, namely, that $t=o(s)$ and $\alpha(s,R)=o(\sqrt{t})$ where $\alpha(s,R)$ was defined in (28). Combined with (27), it implies that

{\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\geq\sqrt{t}\right)}\leq o(1)\cdot e^{-t}\\ +\mathbb{P}{\left(\frac{\sqrt{\frac{N}{m}}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\sqrt{t/N}\right)\right)-\mathbb{E}\rho^{\prime}_{-,R}\right)\geq\sqrt{t}\sqrt{\frac{2}{\pi}}(1+o(1))\right)}.}

(30)

Note that the U-statistic in the display above is now a function of bounded random variables, hence we can apply Corollary 4.1 with $\gamma_{2}=0$ . As $\|\rho^{\prime}_{-}\|_{\infty}=1$ , condition (ii) of the corollary holds. Let $\sqrt{\frac{m}{N}}\sum_{j=1}^{N}h^{(1)}(Y^{(R)}_{j})$ be the first term in Hoeffding decomposition of the U-statistic

\frac{\sqrt{N/m}}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}\left(\rho_{-}^{\prime}\left(\sqrt{m}\left(\bar{Y}^{(R)}_{J}-\mathbb{E}Y_{1}^{(R)}-\sqrt{t/N}+\mathbb{E}Y_{1}^{(R)}\right)\right)-\mathbb{E}\rho^{\prime}_{-,R}\right).

Following the lines of the proof of Theorem 3.1 and recalling that $\sqrt{m}\left|\mathbb{E}Y_{1}^{(R)}\right|=o(1)$ in view of property (c) of $Y_{1}^{(R)}$ and the choice of $R$ , we deduce that

\mathrm{Var}\left(\sqrt{m}h^{(1)}(Y^{(R)}_{1})\right)=\frac{2}{\pi}(1+o(1))

where $o(1)\to 0$ as $m,N/m\to\infty$ , validating assumption (iii) of the corollary. It remains to verify assumption (i) and specify the value of $j_{\max}$ . Recall that $\rho^{\prime}_{-}(x)=I\{x\geq 0\}-I\{x<0\}$ and let $\widetilde{Y}_{j}^{(R)}$ stand for $Y_{j}^{(R)}-\mathbb{E}Y_{j}^{(R)}$ . The function $f_{j}(u_{1},\ldots,u_{j})$ appearing in the statement of Theorem 4.1 can therefore be expressed as

{f_{j}(u_{1},\ldots,u_{j})=\mathbb{E}\rho^{\prime}_{-}\left(\frac{1}{\sqrt{m}}\sum_{i=1}^{j}u_{i}+\sqrt{\frac{m-j}{m}}\frac{\sum_{i=j+1}^{m}\widetilde{Y}_{i}^{(R)}}{\sqrt{m-j}}-\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\\ =\mathbb{E}\rho^{\prime}_{-}\left(\frac{1}{\sqrt{m}}\sum_{i=1}^{j}u_{i}+\sqrt{\frac{m-j}{m}}\frac{\sum_{i=j+1}^{m}\widetilde{Y}_{i}^{(R)}}{\sqrt{m-j}}-\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\\ =2\Phi_{m-j}\left(\frac{1}{\sqrt{m-j}}\sum_{i=1}^{j}u_{i}-\sqrt{\frac{m}{m-j}}\left(\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\right)-1}

where for any integer $k\geq 1$ , $\Phi_{k}$ stands for the cumulative distribution function of $\frac{1}{\sqrt{k}}\sum_{j=1}^{k}\widetilde{Y}_{j}^{(R)}$ and $\phi_{k}$ is the corresponding density function that exists by assumption. Consequently,

\partial_{u_{j}}\ldots\partial_{u_{1}}f_{1}(u_{1},\ldots,u_{j})=\frac{2}{(m-j)^{j/2}}\phi^{(j-1)}_{m-j}\left(\frac{1}{\sqrt{m-j}}\sum_{i=1}^{j}u_{i}-\sqrt{\frac{m}{m-j}}\left(\sqrt{\frac{tm}{N}}+\sqrt{m}\mathbb{E}Y_{1}^{(R)}\right)\right).

The following lemma demonstrates that Theorem 4.1 applies with $\gamma_{1}=1/2$ and that $j_{\max}=\frac{m}{\log(m)}\,o(1)$ in the statement of Corollary 4.1.

Lemma 2.

Let assumptions of Theorem 5.1 hold. Then for $m$ large enough and $j=o(m/\log m)$ ,

\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq C\left(\frac{2j}{e}\right)^{j/2}

for a sufficiently large constant $C=C(P)$ .

We postpone the proof of this lemma until section 7.6. As all the necessary conditions have been verified, the bound of Corollary 4.1 applies. Recalling that $t\gg\frac{N}{m}\,g^{2}(m)$ and that $g(m)\leq C\frac{\mathbb{E}|X_{1}-\mu|^{q}}{m^{(q-2)/2}}$ , Corollary 4.1 yields that the probability in the right-hand side of inequality (30) can be bounded from above by $\operatorname{exp}\left(-\frac{t}{2\sigma^{2}(1+o(1))}\right)$ for all

\frac{N}{m^{q-1}}\ll t\leq q(N,m)

(31)

whenever

q(N,m)=\min\left(\frac{N}{mR^{2}},\frac{N}{m\log^{2}(N)}\right)\cdot o\left(1\right)\text{ as }N/m\to\infty.

To get the expression for the second term in the minimum above from the bound of the corollary, it suffices to consider the cases when $m\geq\frac{\sqrt{N}}{\log(N)}o(1)$ and $m\leq\frac{\sqrt{N}}{\log(N)}o(1)$ separately; we omit the simple algebra. Since $R=cm^{\frac{1}{2(q-1)}}$ , (31) in only possible when $q-2>\frac{1}{q-1}$ , implying the requirement $q>\frac{3+\sqrt{5}}{2}$ . The final form of the bound stating that

\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\geq\sigma\sqrt{t}\right)}\leq\operatorname{exp}\left(-\frac{t}{2(1+o(1))}\right)

uniformly for all $\frac{N}{m^{q-1}}\ll t\ll\frac{N}{m^{\frac{q}{q-1}}\vee m\log^{2}(N)}$ . The argument needed to estimate $\mathbb{P}{\left(\sqrt{N}(\widehat{\mu}-\mu)\leq-\sigma\sqrt{t}\right)}$ is identical. ∎

6 Open questions.

Several potentially interesting questions and directions have not been addressed in this paper. We summarize few of them below.

(i)

First is the question related to assumptions in Theorem 5.1: does it still hold for distributions with only $2+\varepsilon$ moments? And can the assumptions requiring absolute continuity and a bound on the rate of decay of the characteristic function be dropped? For example, Corollary 3.1 holds for lattice distributions as well.
(ii)

It is known that (Hanson and Wright,, 1971) the sample mean based on i.i.d. observations from the multivariate normal distribution $N(\mu,\Sigma)$ satisfies the inequality

$\left\|\bar{X}_{N}-\mu\right\|_{2}\leq\sqrt{\frac{\mathrm{trace}(\Sigma)}{N}}+\sqrt{\frac{2t\|\Sigma\|}{N}}$

with probability at least $1-e^{-t}$ . Does there exist an estimator of the mean that achieves this bound (up to $o(1)$ factors) for the heavy-tailed distributions? Partial results in this direction have been recently obtained by Lee and Valiant, (2022).
(iii)

Exact computation of the estimator $\widehat{\mu}_{N}$ is infeasible, as it requires evaluation and sorting of $\asymp\left(\frac{N}{m}\right)^{m}$ sample means. Therefore, it is interesting to understand whether it can be replaced by $\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{B}\right)$ where $\mathcal{B}$ is a (deterministic or random) subset of $\mathcal{A}_{N}^{(m)}$ of much smaller cardinality, while preserving the deviation guarantees. For instance, it is easy to deduce from results on incomplete U-statistics in section 4.3 of the book by Lee, (2019) combined with the proof of Corollary 3.1 that if $\mathcal{B}$ consists of $M$ subsets selected at random with replacement from $\mathcal{A}_{N}^{m}$ , then the asymptotic distribution of $\sqrt{N}\left(\mbox{med}\left(\bar{X}_{J},\ J\in\mathcal{B}\right)-\mu\right)$ is still $N(0,\sigma^{2})$ as long as $M\gg N$ . However, establishing results in spirit of Theorem 5.1 in this framework appears to be more difficult.

7 Remaining proofs.

The proofs omitted in the main text are presented in this section.

7.1 Technical tools.

Let us recall the definition of Hoeffding’s decomposition (Hoeffding,, 1948) and closely related concepts that are at the core of many arguments related to U-statistics. Assume that $Y_{1},\ldots,Y_{N}$ are i.i.d. random variables with distribution $P_{Y}$ . Recall that $\mathcal{A}_{N}^{(m)}=\left\{J\subseteq[N]:\ |J|=m\right\}$ and that the U-statistic with permutation-symmetric kernel $h_{m}$ is defined as

U_{N,m}=\frac{1}{{N\choose m}}\sum_{J\in\mathcal{A}_{N}^{(m)}}h_{m}(Y_{i},\ i\in J),

where we assume that $\mathbb{E}h_{m}=0$ . Moreover, for $j=1,\ldots,m,$ define the projections

(\pi_{j}h_{m})(y_{1},\ldots,y_{j}):=(\delta_{y_{1}}-P_{Y})\times\ldots\times(\delta_{y_{j}}-P_{Y})\times P_{Y}^{m-j}h_{m}.

(32)

For brevity and to ease notation, we will often write $h_{m}^{(j)}$ in place of $\pi_{j}h_{m}$ . The variances of these projections will be denoted by

\delta_{j}^{2}:=\mathrm{Var}\left(h^{(j)}_{m}(Y_{1},\ldots,Y_{j})\right).

In particular, $\delta_{m}^{2}=\mathrm{Var}(h_{m})$ . It is well known (Lee,, 2019) that $h^{(j)}_{m}$ can be viewed geometrically as orthogonal projections of $h_{m}$ onto a particular subspace of $L_{2}(P_{Y}^{m})$ . The kernels $h^{(j)}_{m}$ have the property of complete degeneracy, meaning that $\mathbb{E}h^{(j)}_{m}(y_{1},\ldots,y_{j-1},Y_{j})=0$ for $P_{Y}$ -almost all $y_{1},\ldots,y_{j-1}$ while $h^{(j)}_{m}(Y_{1},\ldots,Y_{j})$ is non-zero with positive probability. One can easily check that $h(y_{1},\ldots,y_{m})=\sum_{j=1}^{m}\sum_{J\subseteq[m]:|J|=j}h^{(j)}_{m}(y_{i},\,i\in J)$ , in particular, the partial sum $\sum_{j=1}^{k}\sum_{J\subseteq[m]:|J|=j}h^{(j)}_{m}(y_{i},\,i\in J)$ is the best approximation of $h_{m}$ , in the mean-squared sense, in terms of sums of functions of at most $k$ variables. The Hoeffding decomposition states that (see (Hoeffding,, 1948) as well as the book by Lee, (2019))

U_{N,m}=\sum_{j=1}^{m}{m\choose j}U_{N,m}^{(j)},

(33)

where $U_{N,m}^{(j)}$ are U-statistics with kernels $h^{(j)}_{m}$ , namely $U_{N,m}^{(j)}:=\frac{1}{{N\choose j}}\sum\limits_{J\in\mathcal{A}_{N}^{(j)}}h^{(j)}_{m}(Y_{i},\ i\in J)$ . Moreover, all terms in representation (33) are uncorrelated.

Next, we recall some useful moment bounds, found for instance in the book by de la Pena and Gine, (1999), for the Rademacher chaos variables. Let $\varepsilon_{1},\ldots,\varepsilon_{N}$ be i.i.d. Rademacher random variables (random signs), $\{a_{J},\ J\in\mathcal{A}_{N}^{(l)}\}\subset\mathbb{R}$ , and $Z=\sum_{J\in\mathcal{A}_{N}^{(l)}}a_{J}\prod_{i\in J}\varepsilon_{i}$ . Here, $\prod_{i\in J}\varepsilon_{i}=\varepsilon_{i_{1}}\cdot\ldots\cdot\varepsilon_{i_{l}}$ for $J=\{i_{1},\ldots,i_{l}\}$ .

Fact 1 (Bonami inequality).

Let $\sigma^{2}(Z)=\mathrm{Var}(Z)=\sum_{J\in\mathcal{A}_{N}^{(l)}}a_{J}^{2}$ . Then for any $q>2$ ,

\mathbb{E}|Z|^{q}\leq\left(q-1\right)^{ql/2}\left(\sigma^{2}(Z)\right)^{q/2}.

Now we state a version of the symmetrization inequality for completely degenerate U-statistics due to Sherman, (1994), also see the paper by Song et al., (2019) for the modern exposition of the proof. The main feature of this inequality, put forward by Song et al., (2019), is the fact that its proof does not rely on decoupling, and yields constants that do not grow too fast with the order of U-statistics.

Fact 2.

Let $h$ be a completely degenerate kernel of order $l$ , and $\Phi$ – a convex, nonnegative, non-decreasing function. Moreover, assume that $\varepsilon_{1},\ldots,\varepsilon_{N}$ are i.i.d. Rademacher random variables. Then

\mathbb{E}\Phi\left(\sum_{1\leq j_{1}<\ldots<j_{l}\leq N}h(Y_{j_{1}},\ldots,Y_{j_{l}})\right)\leq\mathbb{E}\Phi\left(2^{l}\sum_{1\leq j_{1}<\ldots<j_{l}\leq N}\varepsilon_{j_{1}}\ldots\varepsilon_{j_{l}}h(Y_{j_{1}},\ldots,Y_{j_{l}})\right).

Next is the well-known identity, due to Hoeffding, (1963), that allows to reduce many problems for non-degenerate U-statistics to the corresponding problems for the sums of i.i.d. random variables.

Fact 3.

The following representation holds:

U_{N,m}=\frac{1}{N!}\sum_{\pi}W_{\pi},

where the sum is over all permutations $\pi:[N]\mapsto[N]$ , and

W_{\pi}=\frac{1}{k}\left(h_{m}\left(Y_{\pi(1)},Y_{\pi(2)},\ldots,Y_{\pi(m)}\right)+\ldots+h_{m}\left(Y_{\pi((k-1)m+1)},Y_{\pi((k-1)m+2)},\ldots,Y_{\pi(km)}\right)\right)

for $k=\lfloor N/m\rfloor$ .

Finally, we state a version of Rosenthal’s inequality for the moments of sums of independent, nonnegative random variables with explicit constants, see (Boucheron et al.,, 2013; Chen et al.,, 2012).

Fact 4.

Let $Y_{1},\ldots,Y_{N}$ be independent random variables such that $Y_{j}\geq 0$ with probability $1$ for all $j\in[N]$ . Then for any $q\geq 1$ ,

\left(\mathbb{E}\left|\sum_{j=1}^{N}Y_{j}\right|^{q}\right)^{1/q}\leq\left(\left(\sum_{j=1}^{N}\mathbb{E}Y_{j}\right)^{1/2}+2\sqrt{eq}\left(\mathbb{E}\max_{j=1,\ldots,N}Y_{j}^{q}\right)^{1/2q}\right)^{2}.

7.2 Proof of Theorem 3.1.

Recall that

It is easy to verify that

{h_{m}(Y_{1},\ldots,Y_{m})=(\delta_{Y_{1}}-P_{Y}+P_{Y})\times\ldots\times(\delta_{Y_{m}}-P_{Y}+P_{Y})h_{m}=\sum_{j=1}^{m}\sum_{J\subseteq[m]:|J|=j}h^{(j)}_{m}(Y_{i},\,i\in J)}

and that the terms in the sum above are mutually orthogonal, yielding that

\mathrm{Var}\left(h_{m}(Y_{1},\ldots,Y_{m})\right)=\sum_{j=1}^{m}{m\choose j}\delta_{j}^{2}.

(34)

Moreover, as a corollary of Hoeffding’s decomposition, one can get the well known identities See Chapters 1.6 and 1.7 in the book by Lee, (2019) for detailed derivations of these facts. The simple but key observation following from equation (34) is that for any $j\in[m]$ , $\mathrm{Var}(h_{m})\geq{m\choose j}\delta_{j}^{2}$ , or

\delta_{j}^{2}\leq\frac{\mathrm{Var}(h_{m})}{{m\choose j}}.

(35)

Therefore,

{\mathrm{Var}(U_{N,m}-S_{N,m})=\sum_{j=2}^{m}\frac{{m\choose j}^{2}}{{N\choose j}}\delta_{j}^{2}\leq\mathrm{Var}(h)\sum_{j=2}^{m}\frac{{m\choose j}}{{N\choose j}}\leq\mathrm{Var}(h)\sum_{j\geq 2}\left(\frac{m}{N}\right)^{j}\\ =\mathrm{Var}(h)\left(\frac{m}{N}\right)^{2}\left(1-m/N\right)^{-1},}

(38)

where we used the fact that $\frac{{m\choose j}}{{N\choose j}}\leq\left(\frac{m}{N}\right)^{j}$ for $m\leq N$ : indeed, the latter easily follows from the identity $\frac{{m\choose j}}{{N\choose j}}=\frac{m(m-1)\ldots(m-j+1)}{N(N-1)\ldots(N-j+1)}$ . It is well known (Hoeffding,, 1948) that $\mathrm{Var}\left(h^{(1)}(Y_{1})\right)\leq\frac{\mathrm{Var}(h_{m})}{m}$ , therefore the condition $\frac{\mathrm{Var}\left(h_{m}(Y_{1},\ldots,Y_{m})\right)}{\mathrm{Var}\left(h_{m}^{(1)}(Y_{1})\right)}=o(N)$ imposed on the ratio of variances implies that $m=o(N)$ . Therefore, for $m,N$ large enough (so that $m/N\leq 1/2$ ),

\frac{\mathrm{Var}(U_{N,m}-S_{N,m})}{\mathrm{Var}(S_{N,m})}\leq 2\frac{\mathrm{Var}(h_{m})\left(\frac{m}{N}\right)^{2}}{\delta_{1}^{2}m^{2}/N}=2\frac{\mathrm{Var}(h_{m})}{N\delta_{1}^{2}}=o(1)

by assumption, yielding that $\frac{U_{N,m}-S_{N}}{\mathrm{Var}^{1/2}(S_{N})}=o_{P}(1)$ as $N,m\to\infty$ .

7.3 Proof of Theorem 4.1.

We are going to estimate $\mathbb{E}|V_{N,j}|^{q}$ for an arbitrary $q>2$ . It follows from the symmetrization inequality (Fact 2) followed by the moment bound stated in Fact 1 that

{\mathbb{E}|V_{N,j}|^{q}\leq 2^{jq}\,\mathbb{E}_{X}\mathbb{E}_{\varepsilon}\left|\frac{{m\choose j}^{1/2}}{{N\choose j}^{1/2}}\sum_{(i_{1},\ldots,i_{j})\in\mathcal{A}_{N}^{(j)}}\varepsilon_{i_{1}}\ldots\varepsilon_{i_{j}}h^{(j)}_{m}(X_{i_{1}},\ldots,X_{i_{j}})\right|^{q}\\ \leq 2^{jq}(q-1)^{jq/2}\mathbb{E}\left|\frac{{m\choose j}}{{N\choose j}}\sum_{(i_{1},\ldots,i_{j})\in\mathcal{A}_{N}^{(j)}}\left(h^{(j)}_{m}(X_{i_{1}},\ldots,X_{i_{j}})\right)^{2}\right|^{q/2}.}

Next, Hoeffding’s representation of the U-statistic (Fact 3) together with Jensen’s inequality yields that

\mathbb{E}\left|\frac{{m\choose j}}{{N\choose j}}\sum_{(i_{1},\ldots,i_{j})\in\mathcal{A}_{N}^{(j)}}\left(h^{(j)}_{m}(X_{i_{1}},\ldots,X_{i_{j}})\right)^{2}\right|^{q/2}\leq\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2},

where $W_{i}:=\left(h^{(j)}_{m}(X_{(i-1)j+1},\ldots,X_{ij})\right)^{2}$ . We are going to estimate $\mathbb{E}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{j}^{p}$ in two different ways. First, recall that

h^{(j)}_{m}(x_{1},\ldots,x_{j}):=(\pi_{j}h_{m})(x_{1},\ldots,x_{j})=(\delta_{x_{1}}-P_{X})\times\ldots\times(\delta_{x_{j}}-P_{X})\times P_{X}^{m-j}h_{m}.

Therefore, $(\pi_{j}h)(x_{1},\ldots,x_{j})$ is a linear combination of $2^{j}$ terms of the form $\prod_{i\in I}\delta_{x_{i}}\,P_{X}^{m-|I|}\,h_{m}$ , for all choices of $I\subseteq[j]$ . Consequently, $\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|^{2}\leq 2^{2j}\|h_{m}\|^{2}_{\infty}$ , and the same bound also holds (almost surely) for the maximum of $W_{j}$ ’s. Therefore, $\mathbb{E}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{j}^{p}\leq 2^{2jp}\|h_{m}\|^{2p}_{\infty}$ and $\mathbb{E}\left({m\choose j}W_{1}\right)^{p}\leq(2e)^{2jp}\left(\frac{m}{j}\right)^{jp}\|h_{m}\|_{\infty}^{2p}$ . Moreover, equation (35) in the proof of Theorem 3.1 implies that $\mathbb{E}W_{1}\leq\frac{\mathrm{Var}(h_{m})}{{m\choose j}}$ . Therefore, Rosenthal’s inequality for nonnegative random variables (Fact 4) entails that for $q\geq 2$ ,

{\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2}\leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m}+\left(\frac{q}{2}\right)^{q/2}\left(\frac{j}{N}\right)^{q/2}\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\right)\\ \leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m})+\left(\frac{q}{2}\right)^{q/2}\left(\frac{j}{N}\right)^{q/2}(2e)^{jq}\left(\frac{m}{j}\right)^{jq/2}\|h_{m}\|_{\infty}^{q}\right)}

and

\mathbb{E}|V_{N,j}|^{q}\leq(Cq^{1/2})^{qj}\left(\mathrm{Var}^{q/2}(h_{m})\vee\left(\left(\frac{qj}{N}\right)^{1/2}\left(\frac{m}{j}\right)^{j/2}\|h_{m}\|_{\infty}\right)^{q}\right).

Markov’s inequality therefore yields that

\mathbb{P}{\left(|V_{N,j}|\geq(C_{1}q)^{j/2}\left(\mathrm{Var}^{1/2}(h_{m})\vee\left(\frac{qj}{N}\right)^{1/2}\left(\frac{m}{j}\right)^{j/2}\|h_{m}\|_{\infty}\right)\right)}\leq e^{-q}.

Let $A(q)=(C_{1}q)^{j/2}\mathrm{Var}^{1/2}(h_{m})$ and $B(q)=\|h_{m}\|_{\infty}\left(\frac{qj}{N}\right)^{1/2}\left(C_{1}q^{1/2}\left(\frac{m}{j}\right)^{1/2}\right)^{j}$ . If $t=A(q)\vee B(q)$ , then $q=A^{-1}(t)\wedge B^{-1}(t)$ . We can solve the inequalities explicitly to get, after some algebra, that

\mathbb{P}{\left(|V_{N.j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\frac{\left(\frac{t}{\|h_{m}\|_{\infty}}\sqrt{\frac{N}{j}}\right)^{\frac{2}{j+1}}}{\left(\frac{cm}{j}\right)^{\frac{j}{j+1}}}\right)\right).

(39)

Remark 5.

Whenever $|X_{1}-\mathbb{E}X_{1}|\leq M$ almost surely, the inequality $\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|\leq 2^{j}\|h_{m}\|_{\infty}$ can be replaced by the bound $\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|\leq C\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}(2M)^{j}$ which follows from Lemma 3 below. Combined with the assumption stating that $\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\leq\left(\frac{C_{1}(P)}{m}\right)^{j/2}j^{\gamma_{1}j}$ , one easily finds that the resulting concentration inequality reads as follows:

\mathbb{P}{\left(|V_{N.j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{j+1}}\right)\right).

(40)

This bound holds for all $t>0$ and is usually sharper than (39).

The bound (39) is mostly useful only when $\frac{m}{j}$ is not too large. Now we will present a second way to estimate $\mathbb{E}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{j}^{p}$ that will yield much better inequalities for small values of $j$ and is valid when $X_{1}$ is not necessarily supported on a bounded interval. The key technical element that we rely on is the following lemma that allows one to control the growth of moments of $W_{1}$ with respect to $m$ . Define

f_{j}(x_{1},\ldots,x_{j}):=\mathbb{E}h_{m}(x_{1},\ldots,x_{j},X_{j+1},\ldots,X_{m}).

Lemma 3.

Let conditions of the theorem hold and let $\sigma^{2}=\mathrm{Var}(X_{1})$ . Then there exists $C=C(P)>0$ such that

\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|\leq C\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\prod_{i=1}^{j}\left(|X_{i}-\mathbb{E}X_{i}|+\sigma\right)

with probability $1$ . Moreover, for any $p>2$ ,

\mathbb{E}\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|^{p}\leq C^{pj}\,\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|^{p}_{\infty}\left(\mathbb{E}\left|X_{1}-\mathbb{E}X_{1}\right|^{p}\right)^{j}.

The proof of the lemma is outlined in section 7.4. As $\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\leq\left(\frac{C_{1}(P)}{m}\right)^{j/2}j^{\gamma_{1}j}$ by assumption, the second bound of the lemma can be written as

\mathbb{E}\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|^{p}\leq C_{2}^{pj}m^{-jp/2}j^{\gamma_{1}pj}\left(\mathbb{E}\left|X_{1}-\mathbb{E}X_{1}\right|^{p}\right)^{j}.

Recall that $\nu_{k}=\mathbb{E}^{1/k}|X_{1}-\mathbb{E}X_{1}|^{k}$ and that under the stated assumptions, $\nu_{k}\leq k^{\gamma_{2}}M$ for all integers $k\geq 2$ and some $\gamma_{2},M>0$ . Therefore,

\mathbb{E}W_{1}^{p}\leq C^{2pj}j^{2\gamma_{1}p\,j}m^{-pj}\nu_{2p}^{2pj}\leq\left(C^{\prime}Mj^{\gamma_{1}}m^{-1/2}p^{\gamma_{2}}\right)^{2pj},

(41)

and consequently $\mathbb{E}\left({m\choose j}W_{1}\right)^{p}\leq\left(C^{\prime}Mj^{\gamma_{1}-1/2}p^{\gamma_{2}}\right)^{2pj}$ . The rest of the argument proceeds in a similar way as before. Recall again that $\mathbb{E}W_{1}\leq\frac{\mathrm{Var}(h_{m})}{{m\choose j}}$ . Rosenthal’s inequality for nonnegative random variables (Fact 4) implies that for $q\geq 2$ ,

\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2}\leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m})+\left(\frac{q}{2}\right)^{q/2}\left(\frac{j}{N}\right)^{q/2}\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\right).

With the inequality for $\mathbb{E}W_{1}^{p}$ in hand, the expectation $\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}$ can be upper bounded in two ways: first, trivially,

\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\leq\lfloor N/j\rfloor\mathbb{E}\left({m\choose j}W_{1}\right)^{q/2}\leq\lfloor N/j\rfloor\left(C_{1}Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{qj}.

On the other hand, for any identically distributed $\xi_{1},\ldots,\xi_{k}$ and any $p>1$ , $\mathbb{E}\max_{j=1,\ldots,k}|\xi_{j}|\leq k^{1/p}\max_{j=1,\ldots,k}\mathbb{E}^{1/p}|\xi_{j}|^{p}$ . Choosing $\xi_{j}={m\choose j}W_{j}$ and $p=\lfloor\log(N/j)\rfloor+1$ , we obtain the inequality

\mathbb{E}\left({m\choose j}\max_{j=1,\ldots,\lfloor N/j\rfloor}W_{1}\right)^{q/2}\leq\left(\log(N/j)\right)^{\gamma_{2}qj}\left(C_{1}Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{qj}.

The second bound is better for $q\leq\frac{\log(N/j)}{\gamma_{2}j\log\log(N/j)}$ , therefore we get an estimate

\mathbb{E}\left|\frac{{m\choose j}}{\lfloor N/j\rfloor}\sum_{i=1}^{\lfloor N/j\rfloor}W_{i}\right|^{q/2}\leq C^{q/2}\left(\mathrm{Var}^{q/2}(h_{m})+\left(C_{3}^{j}\left(\frac{qj}{N}\right)^{1/2}\left(\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)^{q}\right)

and

\mathbb{E}|V_{N,j}|^{q}\leq(Cq^{1/2})^{qj}\left(\mathrm{Var}^{q/2}(h_{m})\vee\left(\left(\frac{qj}{N}\right)^{1/2}\left(\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)^{q}\right)

that we will use for $2\leq q\leq\frac{\log(N/j)}{\gamma_{2}j}$ , while for larger values of $q$ , $(N/j)^{1/q}\leq e^{\gamma_{2}j}$ and

\mathbb{E}|V_{N,j}|^{q}\leq(Cq^{1/2})^{qj}\left(\mathrm{Var}^{q/2}(h_{m})\vee\left(\left(\frac{qj}{N}\right)^{1/2}\left(Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)^{q}\right).

Markov’s inequality therefore yields that for small values of $q$ (that is, whenever $2\leq q\leq\frac{\log(N/j)}{\gamma_{2}j}$ ),

\mathbb{P}{\left(|V_{N,j}|\geq(Cq)^{j/2}\left(\mathrm{Var}^{1/2}(h_{m})\vee\left(\frac{qj}{N}\right)^{1/2}\left(\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)\right)}\leq e^{-q}.

Let $A(q)=(Cq)^{j/2}\mathrm{Var}^{1/2}(h_{m})$ and $B(q)=\left(\frac{qj}{N}\right)^{1/2}\left(Cq^{1/2}\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}$ . If $t=A(q)\vee B(q)$ , then $q=A^{-1}(t)\wedge B^{-1}(t)$ . Solving these inequalities explicitly to get, after some algebra, that

\mathbb{P}{\left(|V_{N,j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}\right)^{j}}\right)\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)

for values of $t$ satisfying $2\leq\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(c\log^{\gamma_{2}}(N/j)Mj^{\gamma_{1}-1/2}\right)^{j}}\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)\leq\frac{\log(N/j)}{\gamma_{2}j}$ . Similarly, for $q\geq\max\left(2,\frac{\log(N/j)}{\gamma_{2}j}\right)$ , the previously established bounds yield that

\mathbb{P}{\left(|V_{N,j}|\geq(Cq)^{j/2}\left(\mathrm{Var}^{1/q}(h_{m})\vee\left(\frac{qj}{N}\right)^{1/2}\left(Mj^{\gamma_{1}-1/2}q^{\gamma_{2}}\right)^{j}\right)\right)}\leq e^{-q},

or equivalently

\mathbb{P}{\left(|V_{N,j}|\geq t\right)}\leq\operatorname{exp}\left(\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\right)

(42)

whenever $\min\left(\frac{1}{c}\left(\frac{t^{2}}{\mathrm{Var}(h_{m})}\right)^{{\frac{1}{j}}},\left(\frac{t\sqrt{N/j}}{\left(cMj^{\gamma_{1}-1/2}\right)^{j}}\right)\right)^{\frac{2}{1+j(2\gamma_{2}+1)}}\geq\max\left(2,\frac{\log(N/j)}{\gamma_{2}j}\right)$ . Combination of inequalities (39) and (42) yields the final result.

7.4 Proof of Lemma 3.

Recall that $f_{j}(x_{1},\ldots,x_{j})=\mathbb{E}h_{m}\left(\frac{x_{1}}{\sqrt{m}},\ldots,\frac{x_{j}}{\sqrt{m}},\frac{X_{j+1}}{\sqrt{m}},\ldots,\frac{X_{m}}{\sqrt{m}}\right)$ where $j<m$ . It is easy to see from the definition of $\pi_{j}$ that $(\pi_{j}h)(x_{1},\ldots,x_{j})=(\pi_{j}f_{j})(x_{1},\ldots,x_{j})$ . Next, observe that for any function $g:\mathbb{R}^{j-1}\mapsto\mathbb{R}$ of $j-1$ variables such that $\mathbb{E}g^{2}(X_{1},\ldots,X_{j-1})<\infty,$ $\pi_{j}g=0$ $P^{j-1}$ -almost everywhere. Indeed, this follows immediately from the definition (32) of the operator $\pi_{j}$ since $g$ is a constant when viewed as a function of $y_{j}$ . Based on this fact, it is easy to see that for any constant $a\in\mathbb{R}$ , $f_{j}(x_{1},\ldots,x_{j})$ and $f_{j}(x_{1},\ldots,x_{j})-f_{j}|_{x_{1}=a}(x_{2},\ldots,x_{j})$ , where $f_{j}|_{x_{1}=a}(x_{2},\ldots,x_{j}):=f_{j}(a,x_{2},\ldots,x_{j})$ , are mapped to the same function by $\pi_{j}$ . In particular, $(\pi_{j}h)(x_{1},\ldots,x_{j})=\left(\pi_{j}(f_{j}-f_{j}|_{x_{1}=a})\right)(x_{1},\ldots,x_{j})$ . Moreover,

f_{j}(x_{1},\ldots,x_{j})-f_{j}|_{x_{1}=a}(x_{2},\ldots,x_{j})=\int_{a}^{x_{1}}\partial_{u_{1}}f_{j}(u_{1},x_{2},\ldots,x_{j})du_{1}

Next, we repeat the same argument with $f_{j}$ replaced by

f_{j,2}(x_{2},\ldots,x_{j};u_{1}):=\partial_{u_{1}}f_{j}(u_{1},x_{2},\ldots,x_{j})

and noting that

f_{j,2}(x_{2},\ldots,x_{j};u_{1})-f_{2,j}|_{x_{2}=a}(x_{3},\ldots,x_{j};u_{1})=\int_{a}^{x_{2}}\partial_{u_{2}}f_{j,2}(u_{2},x_{3},\ldots,x_{j};u_{1})du_{2}.

The expression $\int_{a}^{x_{1}}f_{j,2}|_{x_{2}=a}(x_{3},\ldots,x_{j};u_{1})du_{1}$ is a function of $j-1$ variables, hence $\pi_{j}$ maps it to $0$ so that

(\pi_{j}h_{m})(x_{1},\ldots,x_{j})=\pi_{j}\left(\int_{a}^{x_{1}}\int_{a}^{x_{2}}\partial_{u_{2}}f_{j,2}(u_{2},x_{3},\ldots,x_{j};u_{1})du_{2}du_{1}\right).

Iterating this process, we arrive at the expression

(\pi_{j}h_{m})(x_{1},\ldots,x_{j})=\pi_{j}\left(\int_{a}^{x_{1}}\ldots\int_{a}^{x_{j}}\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}(u_{1},\ldots,u_{j})du_{j}\ldots du_{1}\right).

(43)

Next, observe that for any function $g$ of $j$ variables,

(\pi_{j}g)(x_{1},\ldots,x_{j})=(\delta_{x_{1}}-P_{X})\times\ldots\times(\delta_{x_{j}}-P_{X})g=\mathbb{E}_{\tilde{X}}\left[(\delta_{x_{1}}-\delta_{\tilde{X}_{1}})\times\ldots\times(\delta_{x_{j}}-\delta_{\tilde{X}_{j}})g\right],

where $\tilde{X}_{1},\ldots,\tilde{X}_{j}$ are i.i.d. with the same law as $X$ , and independent from $X_{1},\ldots,X_{N}$ . Therefore, $(\pi_{j}h_{m})(x_{1},\ldots,x_{j})$ is a linear combination of $2^{j}$ terms of the form $\mathbb{E}_{\tilde{X}}\left(\prod_{i\in I}\delta_{x_{i}}\prod_{j\in I^{c}}\delta_{\tilde{X}_{j}}\,g\right)$ , for all choices of $I\subseteq[j]$ and

g(x_{1},\ldots,x_{j})=\int_{a}^{x_{1}}\ldots\int_{a}^{x_{j}}\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}(u_{1},\ldots,u_{j})du_{j}\ldots du_{1}.

Take $a:=\mathbb{E}X_{1}$ , and note that

{\left|(\pi_{j}h_{m})(x_{1},\ldots,x_{j})\right|\leq\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\sum_{I\subseteq[j]}\prod_{i\in I}|x_{i}-a|\prod_{j\in I^{c}}\mathbb{E}|\tilde{X}_{i}-a|\\ \leq\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\sum_{I\subseteq[j]}\prod_{i\in I}|x_{i}-a|\cdot\sigma^{|I^{c}|}=\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|_{\infty}\prod_{i=1}^{j}\left(|x_{i}-\mathbb{E}X_{1}|+\sigma\right).}

The first claim of the lemma follows. To deduce the moment bound, observe that since $X_{1},\ldots,X_{j},\tilde{X}_{1},\ldots,\tilde{X}_{j}$ are i.i.d. and in view of convexity of the function $x\mapsto|x|^{p}$ for $p\geq 1$ ,

{\mathbb{E}\left|(\pi_{j}h_{m})(X_{1},\ldots,X_{j})\right|^{p}\leq 2^{(p-1)j}\mathbb{E}\left|\int_{a}^{X_{1}}\ldots\int_{a}^{X_{j}}\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}(u_{1},\ldots,u_{j})\,du_{j}\ldots du_{1}\right|^{p}\\ \leq 2^{(p-1)j}\left\|\partial_{u_{j}}\ldots\partial_{u_{1}}f_{j}\right\|^{p}_{\infty}\mathbb{E}\left|(X_{1}-\mathbb{E}X_{1})\ldots(X_{j}-\mathbb{E}X_{j})\right|^{p}.}

for $a=\mathbb{E}X_{1}$ .

7.5 Proof of Lemma 1.

As $\psi(x)$ is integrable, its Fourier transform equals $C_{2}\widehat{\phi}_{1}\ast\widehat{\chi}_{R}$ , while $\widehat{\chi}_{R}=\widehat{\kappa}_{R}\cdot\widehat{I}_{2R}$ . It is well known (e.g. Johnson,, 2015) that $\widehat{\kappa}(x)\leq C_{3}e^{-\sqrt{|x|}}$ , hence $\widehat{\kappa}_{R}(x)=\widehat{\kappa}(Rx)\leq C_{3}e^{-\sqrt{R|x|}}$ . Moreover, $\widehat{I}_{2R}(x)=\frac{\sin(2Rx)}{x}$ . Therefore, for $|x|$ large enough,

{\left|\widehat{\psi}(x)\right|=C_{2}\left|\int_{\mathbb{R}}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\right|\\ =C_{2}\left(\int\limits_{y:|y-x|\geq|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy+\int\limits_{y:|y-x|<|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\right).}

To estimate the first integral, note that $\widehat{\phi}_{1}(x-y)\leq\frac{C_{1}}{(1+|x|/2)^{\delta}}\leq\frac{C_{1}2^{\delta}}{(1+|x|)^{\delta}}$ whenever $|y-x|\geq|x|/2$ and that $\widehat{I}_{2R}(x)\leq 2R$ , implying that

\Bigg{|}\int\limits_{y:|y-x|\geq|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\Bigg{|}\leq\frac{C_{4}}{(1+|x|)^{\delta}}\int_{\mathbb{R}}e^{-\sqrt{R|x|}}d(Rx)=\frac{C_{5}}{(1+|x|)^{\delta}}.

On the other hand,

{\Bigg{|}\int\limits_{y:|y-x|<|x|/2}\widehat{\phi}_{1}(x-y)\widehat{\chi}_{R}(y)dy\Bigg{|}\leq C_{6}\left|\int_{x-|x|/2}^{x+|x|/2}e^{-\sqrt{R|x|}}\frac{\sin(2Rx)}{x}dx\right|\\ \leq C_{7}\int_{R|x|/2}^{3R|x|/2}e^{-\sqrt{z}}dz\leq C_{8}e^{-\sqrt{R|x|/2}}\sqrt{R|x|}.}

Clearly, the last expression is smaller than $\frac{C_{9}}{(1+|x|)^{\delta}}$ , implying the desired result.

7.6 Proof of Lemma 2.

The proof proceeds using the standard Fourier-analytic tools. Let $\widehat{\phi}_{1}:=\mathcal{F}[\phi_{1}]$ be the Fourier transform of $\phi_{1}$ , whence $\mathcal{F}\left[\phi_{m-j}\right](t)=\left(\widehat{\phi}_{1}\left(\frac{t}{\sqrt{m-j}}\right)\right)^{m-j}$ . Therefore,

\phi_{m-j}^{(j-1)}(t)=\frac{1}{2\pi}\int_{\mathbb{R}}\operatorname{exp}\left(-itx\right)(ix)^{j-1}\left(\widehat{\phi}_{1}\left(\frac{x}{\sqrt{m-j}}\right)\right)^{m-j}dx

and $\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq\frac{1}{2\pi}\int_{\mathbb{R}}|x|^{j-1}\left|\widehat{\phi}_{1}\left(\frac{x}{\sqrt{m-j}}\right)\right|^{m-j}dx=\frac{(m-j)^{j/2}}{2\pi}\int_{\mathbb{R}}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx$ . As $\left|\widehat{\phi}_{1}(x)\right|\leq\frac{C_{1}}{(1+|x|)^{\delta}}$ by assumption, the integral is finite when $\delta(m-j)>j$ (in particular, this inequality holds when $m$ is large enough and $j=o(m)$ as $m\to\infty$ ). To get an explicit bound, we will estimate the integral over $[-\eta,\eta]$ and $\mathbb{R}\setminus[-\eta,\eta]$ separately, for a specific choice of $\eta>0$ . To this end, observe that $\widehat{\phi}_{1}(x)=\psi_{\sigma}(x)+o(x^{2})$ where $\psi_{\sigma}(x)=\operatorname{exp}\left(-\frac{\sigma^{2}x^{2}}{2}\right)$ is the characteristic function of the normal law $N(0,\sigma^{2})$ . Therefore, there exists $\eta>0$ such that for all $|x|\leq\eta$ , $\left|\widehat{\phi}_{1}(x)\right|\leq\operatorname{exp}\left(-\frac{\sigma^{2}x^{2}}{4}\right)$ , and

{(m-j)^{j/2}\int_{-\eta}^{\eta}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\leq(m-j)^{j/2}\int_{\mathbb{R}}|x|^{j-1}\operatorname{exp}\left(-\frac{\sigma^{2}x^{2}(m-j)}{4}\right)dx\\ =\int_{\mathbb{R}}|y|^{j-1}\operatorname{exp}\left(-\frac{\sigma^{2}y^{2}}{4}\right)dy=\frac{2^{j}}{\sigma^{j}}\Gamma\left(\frac{j}{2}\right)}

where we used the exact expression for the absolute moments of the normal distribution. As $\Gamma(x+1)\leq C_{2}\sqrt{2\pi x}\left(\frac{x}{e}\right)^{x}$ for all $x\geq 1$ and an absolute constant $C_{2}$ large enough, $\frac{2^{j}}{\sigma^{j}}\Gamma\left(\frac{j}{2}\right)\leq\frac{C_{2}}{\sigma^{j}}\left(\frac{2j}{e}\right)^{j/2}$ . At the same time,

{(m-j)^{j/2}\int_{\mathbb{R}\setminus[-\eta,\eta]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx=(m-j)^{j/2}\int_{\mathbb{R}\setminus[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\\ +(m-j)^{j/2}\int_{[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]\setminus[-\eta,\eta]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx}

where $C_{1}\geq 1$ is a constant such that $\left|\widehat{\phi}_{1}(x)\right|\leq\frac{C_{1}}{(1+|x|)^{\delta}}$ . The first term can be estimated via

{(m-j)^{j/2}\int_{\mathbb{R}\setminus[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\\ \leq C_{1}^{m-j}(m-j)^{j/2}\int_{\mathbb{R}\setminus[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]}\frac{|x|^{j-1}}{(1+|x|)^{\delta(m-j)}}dx\\ \leq\frac{2C_{1}^{m-j}(m-j)^{j/2}}{\delta(m-j)-j}\frac{1}{(2C_{1})^{2(m-j)-2j/\delta}}.}

Whenever $m>2j+2j/\delta$ , we can bound the last expression from above by $C_{3}m^{j/2}2^{-m}$ . Finally, as $\sup_{|x|>\eta}|\widehat{\phi}_{1}(x)|\leq 1-\gamma$ for some $0<\gamma<1$ ,

(m-j)^{j/2}\int_{[-(2C_{1})^{2/\delta},(2C_{1})^{2/\delta}]\setminus[-\eta,\eta]}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\leq 2(m-j)^{j/2}(1-\gamma)^{m-j}\frac{(2C_{1})^{2j/\delta}}{j}.

Putting the estimates together, we deduce that

{\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq\frac{(m-j)^{j/2}}{2\pi}\int_{\mathbb{R}}|x|^{j-1}\left|\widehat{\phi}_{1}(x)\right|^{m-j}dx\\ \leq\frac{C_{2}}{\sigma^{j}}\left(\frac{2j}{e}\right)^{j/2}+C_{3}m^{j/2}2^{-m}+C_{4}\left((2C_{1})^{4/\delta}m\right)^{j/2}(1-\gamma)^{m-j}.}

Whenever $j=o(m/\log m)$ , the last two terms in the sum above are negligible so that for $m$ large enough,

\left\|\phi_{m-j}^{(j-1)}\right\|_{\infty}\leq\frac{C_{5}}{\sigma^{j}}\left(\frac{2j}{e}\right)^{j/2},

as claimed.

References

Alon et al., (1996) Alon, N., Matias, Y., and Szegedy, M. (1996). The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29. ACM.
Arcones, (1995) Arcones, M. A. (1995). A Bernstein-type inequality for U-statistics and U-processes. Statistics&Probability Letters, 22(3):239–247.
Boucheron et al., (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
Catoni, (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 48, pages 1148–1185. Institut Henri Poincaré.
Chen et al., (2012) Chen, R. Y., Gittens, A., and Tropp, J. A. (2012). The masked sample covariance estimator: an analysis using matrix concentration inequalities. Information and Inference, page ias001.
de la Pena and Gine, (1999) de la Pena, V. and Gine, E. (1999). Decoupling: From dependence to independence. Springer-Verlag, New York.
de la Pena and Montgomery-Smith, (1995) de la Pena, V. and Montgomery-Smith, S. J. (1995). Decoupling inequalities for the tail probabilities of multivariate U-Statistics. Annals of Probability, 23(2):806–816.
Devroye et al., (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695–2725.
DiCiccio and Romano, (2022) DiCiccio, C. and Romano, J. (2022). CLT for U-statistics with growing dimension. Statistica Sinica, 32:1–22.
Feller, (1968) Feller, W. (1968). On the Berry-Esseen theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 10(3):261–268.
Frees, (1989) Frees, E. W. (1989). Infinite order U-statistics. Scandinavian Journal of Statistics, pages 29–45.
Hanson and Wright, (1971) Hanson, D. L. and Wright, F. T. (1971). A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083.
Hodges and Lehmann, (1963) Hodges, J. L. and Lehmann, E. L. (1963). Estimates of location based on rank tests. The annals of mathematical statistics, pages 598–611.
Hoeffding, (1948) Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, pages 293–325.
Hoeffding, (1963) Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30.
Jerrum et al., (1986) Jerrum, M. R., Valiant, L. G., and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoretical computer science, 43:169–188.
Johnson, (2015) Johnson, S. G. (2015). Saddle-point integration of ${C}_{\infty}$ “bump” functions. arXiv preprint arXiv:1508.04376.
Lee, (2019) Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.
Lee and Valiant, (2020) Lee, J. C. and Valiant, P. (2020). Optimal sub-Gaussian mean estimation in $\mathbf{R}$ . arXiv preprint arXiv:2011.08384.
Lee and Valiant, (2022) Lee, J. C. and Valiant, P. (2022). Optimal sub-Gaussian mean estimation in very high dimensions. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Lerasle and Oliveira, (2011) Lerasle, M. and Oliveira, R. I. (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
Lugosi and Mendelson, (2019) Lugosi, G. and Mendelson, S. (2019). Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190.
Maurer, (2019) Maurer, A. (2019). A Bernstein-type inequality for functions of bounded interaction. Bernoulli, 25(2):1451–1471.
Minsker, (2019) Minsker, S. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics, 13(2):5213–5252.
Minsker and Ndaoud, (2021) Minsker, S. and Ndaoud, M. (2021). Robust and efficient mean estimation: an approach based on the properties of self-normalized sums. Electronic Journal of Statistics, 15(2):6036–6070.
Nemirovski and Yudin, (1983) Nemirovski, A. and Yudin, D. (1983). Problem complexity and method efficiency in optimization. John Wiley & Sons Inc.
Peng et al., (2022) Peng, W., Coleman, T., and Mentch, L. (2022). Rates of convergence for random forests via generalized U-statistics. Electronic Journal of Statistics, 16(1):232–292.
Petrov, (1975) Petrov, V. V. (1975). Sums of Independent Random Variables. Springer Berlin Heidelberg.
Petrov, (1995) Petrov, V. V. (1995). Limit theorems of probability theory: sequences of independent random variables. Oxford, New York.
Serfling, (1984) Serfling, R. J. (1984). Generalized L-, M-, and R-statistics. The Annals of Statistics, pages 76–86.
Serfling, (2009) Serfling, R. J. (2009). Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons.
Shepp, (1964) Shepp, L. A. (1964). A local limit theorem. The Annals of Mathematical Statistics, 35(1):419–423.
Sherman, (1994) Sherman, R. P. (1994). Maximal inequalities for degenerate $u$ -processes with applications to optimization estimators. The Annals of Statistics, pages 439–459.
Song et al., (2019) Song, Y., Chen, X., and Kato, K. (2019). Approximating high-dimensional infinite-order U-statistics: Statistical and computational guarantees. Electronic Journal of Statistics, 13(2):4794–4848.

U-statistics of growing order and sub-Gaussian mean estimators with sharp constants

Abstract

keywords:

keywords:

1 Introduction.

1.1 Overview of the existing results.

1.2 Summary of the main contributions.

1.3 Notation.

2 Optimal constants for the median of means estimator.

Theorem 2.1.

Remark 1.

Proof of Theorem 2.1.

3 Asymptotic normality of U-statistics and the implications for μ^N\widehat{\mu}_{N}.

Theorem 3.1.

Corollary 3.1.

Remark 2.

Proof.

4 Deviation inequalities for U-statistics of growing order.

Theorem 4.1.

Corollary 4.1.

Remark 3.

Proof.

5 Implications for the median of means estimator.

Theorem 5.1.

Remark 4.

Proof.

Lemma 1.

Lemma 2.

6 Open questions.

7 Remaining proofs.

7.1 Technical tools.

Fact 1 (Bonami inequality).

Fact 2.

Fact 3.

Fact 4.

7.2 Proof of Theorem 3.1.

7.3 Proof of Theorem 4.1.

Remark 5.

Lemma 3.

7.4 Proof of Lemma 3.

7.5 Proof of Lemma 1.

7.6 Proof of Lemma 2.

References

3 Asymptotic normality of U-statistics and the implications for $\widehat{\mu}_{N}$ .