This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Binomial Mixture Model With U-shape Constraint

Yuting Ye yeyt@berkeley.edu Biostatistics Division, UC Berkeley Peter J. Bickel bickelp@berkeley.edu Department of Statistics, UC Berkeley
Abstract

In this article, we study the binomial mixture model under the regime that the binomial size mm can be relatively large compared to the sample size nn. This project is motivated by the GeneFishing method [18], whose output is a combination of the parameter of interest and the subsampling noise. To tackle the noise in the output, we utilize the observation that the density of the output has a U shape and model the output with the binomial mixture model under a U shape constraint. We first analyze the estimation of the underlying distribution FF in the binomial mixture model under various conditions for FF. Equipped with these theoretical understandings, we propose a simple method Ucut to identify the cutoffs of the U shape and recover the underlying distribution based on the Grenander estimator [8]. It has been shown that when m=Ω(n23)m=\Omega(n^{\frac{2}{3}}), the identified cutoffs converge at the rate 𝒪(n13)\operatorname*{\mathcal{O}}(n^{-\frac{1}{3}}). The L1L_{1} distance between the recovered distribution and the true one decreases at the same rate. To demonstrate the performance, we apply our method to varieties of simulation studies, a GTEX dataset used in [18] and a single cell dataset from Tabula Muris.

1 Introduction

The binomial mixture model has received much attention since the late 1960s. In the field of performance evaluation, [19, 20, 32] utilized this model to address the problem of psychological testing. [38] used a two-component binomial mixture distribution to model the individual differences in children’s performance on a classification task. [9] employed a binomial finite mixture to model the number of credits gained by freshmen during the first year at the School of Economics of the University of Florence. In addition, the binomial mixture model is commonly applied to analyze population survey data in ecology. [30, 16, 27, 44] estimate absolute abundance while accounting for imperfect detection using binomial detection models. The binomial mixture model was also used to estimate bird and bat fatalities at wind power facilities [25].

Formally, we say XX is a random variable which has a binomial mixture model if

XBinomial(m,s)𝑑Q(m,s)X\sim\int\text{Binomial}(m,s)dQ(m,s) (1.1)

where Q(,)Q(\cdot,\cdot) is a bivariate measure of the binomial size mm and the success probability ss on ×[0,1]\mathbb{N}\times[0,1]. In the field of population survey in ecology, mm can be modeled as Poisson or negative binomial random variable while ss can be modeled as a beta random variable, linked to a linear combination of additional covariates by a logistic function [30, 16, 27, 44, 25]. Such models are always identifiable thanks to the parametric assumptions. In the field of performance evaluation, mm is always known and usually small (restricted by the intrinsic nature of the problem), thus (1.1) is reduced to

{sF,X|sBinomial(m,s),\left\{\begin{array}[]{l}s\sim F,\\ X|s\sim\text{Binomial}(m,s),\\ \end{array}\right. (1.2)

where FF is a probability distribution on [0,1][0,1]. For instance, mm refers to the number of questions of a psychological test in [19, 20, 32]. When the univariate probability distribution FF corresponds to a finite point mass function (pmf), (1.2) is a binomial finite mixture model as in [38, 9]; when FF corresponds to a density (either parametric or nonparametric), (1.2) is a hierarchical binomial mixture model as in [19, 20, 32]. Such models suffer from an unidentifiability issue — only the first mm moments of FF can be identified even with unlimited sample size [36]; more identifiability details are discussed in Appendix A.

In this study, we are interested in a regime unlike that for performance evaluation where mm is known and small or that for population survey in ecology where mm is unknown. We study (1.2) with a known mm that can be as large as the magnitude of nn, which has not been investigated before in the literature. This study is motivated by the GeneFishing method [18], which is proposed to take care of extreme inhomogeneity and low signal-to-noise ratio in high-dimensional data. This method employs subsampling and aggregation techniques to evaluate the functional relevance of a candidate gene to a set of bait genes (the bait genes are pre-selected by experts concerning certain biological processes). Briefly speaking, it clusters a subsample of candidate genes and the bait genes repetitively. One candidate gene is thought of as functioning similarly to the bait genes if the former and the latter are always grouped during the clustering procedure. More details about GeneFishing can be found in Appendix B.

Mathematically, suppose there are nn objects (e.g., genes) and mm binomial trials (e.g., the number of clustering times which can be manually tuned). For each object, we assume it has a success probability sis_{i} (e.g., a rate reflecting how relevant one candidate gene is to the baits genes) that is i.i.d. sampled from an underlying distribution FF. Then, an observation XiX_{i} (e.g., the times one candidate gene and the bait genes are grouped together) or s^i:=Xi/m\hat{s}_{i}:=X_{i}/m (denoted as capture frequency ratio in GeneFishing, or CFR) is independently generated from the binomial mixture model (1.2). A discussion on the independence assumption is put in Appendix B.

Our goal is to estimate and infer the underlying distribution FF. When mm is sufficiently large, the binomial randomness plays little role, and there is no difference between s^i\hat{s}_{i} and sis_{i} for estimation or inference purposes. In other words, the permission of a large mm alleviates the identifiability issue of the binomial mixture model. Thereby, a question naturally arises as follows.

  • Q1

    For the binomial mixture model (1.2), what is the minimal binomial size mm so that any distribution estimators can work as if there is no binomial randomness?

We can reformulate Q1 in a more explicit way:

  1.   Q1’

    Considering that the binomial mixture model (1.2) is trivially identifiable when m=m=\infty, is there a minimal mm such that we can have an “acceptable” estimator of FF under various conditions:

    • General FF without additional conditions.

    • FF with a density.

    • FF with a continuously differentiable density.

    • FF with a monotone density.

If the above question can be answered satisfactorily, we can cast our attention back to the GeneFishing example. In [18], there remains an open question on how large s^i\hat{s}_{i} should be so that the associated gene is labeled as a “discovery”. We have an important observation that can be leveraged towards this end: the histograms of s^i\hat{s}_{i} appear in a U shape; see Figure 1. Hence, the second question we want to answer is

  • Q2

    Suppose the underlying distribution FF has a U shape, how to make decisions based on the data generated from the binomial mixture model?

Refer to caption
(a) Liver
Refer to caption
(b) Colon Transverse
Refer to caption
(c) Artery Coronary
Refer to caption
(d) Testis
Figure 1: Histograms of the CFRs obtained by applying GeneFishing to different tissues.

1.1 Main contributions

Our contributions are twofold, which correspond to the two questions raised above. One tackles the identifiability issue for the binomial mixture model when mm can be relatively large compared to nn. The other one answers the motivating question — how to select the cutoff for the output of GeneFishing.

1.1.1 New results for large mm in Binomial mixture model

Based on the results of [36] and [42], the only hope is to use a large mm if we want to solve, or at least alleviate, the identifiability issue for arbitrary mixtures of binomial distributions. When nn is sufficiently large, we show that regardless of the identifiability of the model (1.2), we can find an estimator of FF, according to the conditions on FF, such that the estimator locates in a ball of radius r(m)r(m) of FF in terms of some metrics such as L1L_{1} distance and Kolmogorov distance, where r(m)r(m) is a decreasing function of mm. Specifically,

  • [Corollary of Theorem 1] For general FF, if the LpL_{p} distance is used, then r(m)=C1m12pr(m)=\frac{C_{1}}{m^{\frac{1}{2p}}} for the empirical cumulative distribution function, where C1C_{1} is a positive constant that depends on pp.

  • [Corollary of Theorem 5] If FF has a bounded density and the Kolmogorov distance is used, then r(m)=C2mr(m)=\frac{C_{2}}{\sqrt{m}} for the empirical cumulative distribution function, where C2C_{2} only depends on the maximal value of the density of FF.

  • [Corollary of Theorem 8] If FF has a density whose derivative is absolutely continuous, and the truncated integrated L2L_{2} distance is used, then r(m)=C3mr(m)=\frac{C_{3}}{\sqrt{m}} for the histogram estimator, where C3C_{3} only depends on the density of FF.

  • [Corollary of Theorem 10] If FF has a bounded monotone density and the L1L_{1} distance is used, then r(m)=C4mr(m)=\frac{C_{4}}{\sqrt{m}} for the Grenander estimator, where C4C_{4} only depends on the density of FF.

1.1.2 The cutoff selection for GeneFishing

To model the CFRs generated by GeneFishing, we use the binomial mixture model with the U-shape constraint, under the regime where the binomial size mm can be relatively large compared to the sample size nn. With the theoretical understandings mentioned in Section 1.1.1, we propose a simple method Ucut to identify the cutoffs of the U shape and recover the underlying distribution based on the Grenander estimator. It has been shown that when m=Ω(n23)m=\Omega(n^{\frac{2}{3}}), the identified cutoffs converge at the rate 𝒪(n13)\operatorname*{\mathcal{O}}(n^{-\frac{1}{3}}). The L1L_{1} distance between the recovered distribution and the true one decreases at the same rate. We also show that the estimated cutoff is larger than the true cutoff with a high probability. The performance of our method is demonstrated with varieties of simulation studies, a GTEX dataset used in [18] and a single cell dataset from Tabula Muris.

1.2 Outline

The rest of the paper is organized as follows. Section 2 introduces the notation used throughout the paper. To answer Q1 (or more specifically, Q1’), Section 3 analyzes the estimation and inference of the underlying distribution FF in the binomial mixture model (1.2), under various conditions imposed on FF. Equipped with these analysis tools, we cast our attention back to the GeneFishing method to answer Q2. Section 4 first introduces a model with U-shape constraint, called BMU, to model the output of GeneFishing. The cutoffs of our interest are also introduced in this model. Then in the same section, we propose a non-parametric maximum likelihood estimation (NPMLE) method Ucut based on the Grenander estimator to identify the cutoffs and estimate the underlying U-shape density. We also provide a theoretical analysis of the estimator. Next, we apply Ucut to several synthetic datasets and two real datasets in Section 5. All the detailed proofs of the theorems mentioned in previous sections are in Appendix.

2 Notation

Denote by FF a probability distribution. Let sFs\sim F and ms^Binomial(m,s)m\cdot\hat{s}\sim Binomial(m,s) given ss, where mm is a positive integer; see Model (1.2) for details. If there is no confusion, we also use FF to represent the associated cumulative distribution function (CDF), i.e, F(x)=[sx]F(x)=\mathbb{P}[s\leq x]. By F(m)F^{(m)} denote the binomial mixture CDF for s^\hat{s}, i.e., F(m)(x)=[s^x]F^{(m)}(x)=\mathbb{P}[\hat{s}\leq x].

Suppose there are nn samples independently generated from FF, i.e., s1,,sns_{1},\ldots,s_{n}. Correspondingly, we have ms^i|siBinomial(m,si)m\cdot\hat{s}_{i}|s_{i}\sim Binomial(m,s_{i}) independently. By FnF_{n} and Fn,mF_{n,m} denote the empirical CDF of sis_{i}’s and s^i\hat{s}_{i}’s, respectively. Specifically,

Fn(x)=1ni=1n𝕀[six]andFn,m(x)=1ni=1n𝕀[s^ix].F_{n}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[s_{i}\leq x]~{}~{}~{}~{}\text{and}~{}~{}~{}~{}F_{n,m}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[\hat{s}_{i}\leq x].

If FF has a density, define the Grenander estimator f~n(x)\tilde{f}_{n}(x) (f~n,m(x)\tilde{f}_{n,m}(x)) for sis_{i}’s (s^i\hat{s}_{i}’s) as the left derivative of the least concave majorant of FnF_{n} (Fn,mF_{n,m}) evaluated at xx [8].

For a density ff, we use fmaxf_{max} and fminf_{\min} to denote its maximal and minimal function values on the domain of ff. We use 𝕀\mathbb{I} to denote the indicator function, and use x+x^{+}, xx^{-} to denote the right and left limit of xx respectively.

3 Estimation and Inference of FF under Various Conditions

When mm is sufficiently large, s^\hat{s} behaves like ss, which implies that we can directly estimate the underlying true FF of the binomial mixture model (1.2). A natural question arises whether there exists a minimal binomial size mm so that the identifiability issue mentioned in Appendix A is not a concern. We investigate the estimation of general FF, FF with a density, FF whose density has an absolutely continuous derivative, and FF with a monotone density. We consider the empirical CDF estimator for the first two conditions, the histogram estimator for the third condition, and the Grenander estimator for the last condition. The investigations into the estimation of FF under various conditions provide us with the analysis tools to design and analyze the cutoff method for GeneFishing.

3.1 General FF

We begin with the estimation of FF, based on {s^i}i=1n\{\hat{s}_{i}\}_{i=1}^{n}, without additional conditions except that FF is defined on [0,1][0,1]. Towards this end, the empirical CDF estimator might be the first method that comes into one’s mind. It is easy to interpret using a diagram, and it exists no matter whether FF corresponds to a density, a point mass function, or a mixture of both.

To measure the deviation of the empirical CDF from FF, we use the LpL_{p} distance with p1p\geq 1. The LpL_{p} distance between two distributions F1F_{1} and F2F_{2} is defined as

p(F1,F2):=(|F1(x)F2(x)|p𝑑x)1p.\mathcal{L}_{p}(F_{1},F_{2}):=\left(\int_{\mathbb{R}}|F_{1}(x)-F_{2}(x)|^{p}dx\right)^{\frac{1}{p}}.

The LpL_{p} distance has two special cases that are easily interpretable from a geometric perspective. First, when p=1p=1, it looks at the cumulative differences between the two CDFs.The L1L_{1} distance is known to be equivalent to the 1-Wasserstein (𝒲1\mathcal{W}_{1}) distance on \mathbb{R}, which is also known as the Kantorovich-Monge-Rubinstein metric. Second, when p=p=\infty, it corresponds to the Kolmogorov-Smirnov (K-S) distance:

dKS(F1,F2):=supx|F1(x)F2(x)|,d_{KS}(F_{1},F_{2}):=\sup_{x}|F_{1}(x)-F_{2}(x)|,

which measures the largest deviation between two CDFs. The K-S distance is a weaker notion of the total variation distance on \mathbb{R} (total variation is often too strong to be useful).

In the sequel, we study p(F(m),F)\mathcal{L}_{p}(F^{(m)},F) for FF supported on [0,1][0,1]. Theorem 1 states that without any conditions imposed on FF, the LpL_{p} distance between F(m)F^{(m)} and FF is bounded by 𝒪(m12p)\mathcal{O}(m^{-\frac{1}{2p}}). One key to this theorem lies in the finite support of FF, which enables the usage of the Fubini’s theorem. Along with the Dvoretzky-Kiefer-Wolfowi (DKW) inequality [24], it implies that the LpL_{p} distance between Fn,mF_{n,m} and FF is bounded by 𝒪(m12p)+𝒪(n12)\mathcal{O}(m^{-\frac{1}{2p}})+\mathcal{O}(n^{-\frac{1}{2}}). Particularly, we have 1(F(m),F)=𝒪(m12)\mathcal{L}_{1}(F^{(m)},F)=\mathcal{O}(m^{-\frac{1}{2}}) and 1(Fn,m,F)=𝒪(m12)+𝒪(n12)\mathcal{L}_{1}(F_{n,m},F)=\mathcal{O}(m^{-\frac{1}{2}})+\mathcal{O}(n^{-\frac{1}{2}}).

Theorem 1 (The LpL_{p} distance between F(m)F^{(m)}/Fn,mF_{n,m} and FF)

For a general FF on [0,1][0,1], it follows that

(01|F(m)(x)F(x)|p𝑑x)1pC(p)m12p,\left(\int_{0}^{1}|F^{(m)}(x)-F(x)|^{p}dx\right)^{\frac{1}{p}}\leq\frac{C(p)}{m^{\frac{1}{2p}}},

where C(p)C(p) is a positive constant that depends on pp. It indicates that

𝔼(01|Fn,m(x)F(x)|p𝑑x)1pC(p)m12p+Kn,{\mathbb{E}}\left(\int_{0}^{1}|F_{n,m}(x)-F(x)|^{p}dx\right)^{\frac{1}{p}}\leq\frac{C(p)}{m^{\frac{1}{2p}}}+\frac{K}{\sqrt{n}},

where KK is another universal positive constant.

Proof By definition, it follows that

|F(m)(x)F(x)|\displaystyle|F^{(m)}(x)-F(x)| =\displaystyle= |𝔼[𝕀(s^x)]𝔼[𝕀(sx)]|\displaystyle|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x)]-{\mathbb{E}}[\mathbb{I}(s\leq x)]|
=\displaystyle= |𝔼[𝕀(s^x<s)]𝔼[𝕀(sx<s^)]|.\displaystyle|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)]-{\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]|.
\displaystyle\leq max{|𝔼[𝕀(s^x<s)]|,|𝔼[𝕀(sx<s^)]|}\displaystyle\max\{|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)]|,|{\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]|\}

Note that

𝔼[𝕀(s^x<s)]\displaystyle{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)] =\displaystyle= 𝔼[𝔼[𝕀(s^x<s)|s]]\displaystyle{\mathbb{E}}[{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)|s]]
=\displaystyle= 𝔼[𝔼[𝕀(s^sxs<0)|s]]\displaystyle{\mathbb{E}}[{\mathbb{E}}[\mathbb{I}(\hat{s}-s\leq x-s<0)|s]]
\displaystyle\leq 𝔼[exp{m(xs)2/2}],\displaystyle{\mathbb{E}}[\exp\{-m(x-s)^{2}/2\}],

where s^s\hat{s}-s is bounded in [1,1][-1,1], and thus it is a sub-Gaussian random variable with the variance less or equal to 11 [10]. The same argument applies to 𝔼[𝕀(sx<s^)]{\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]. Therefore, we have

(01|F(m)(x)F(x)|p𝑑x)1p\displaystyle\left(\int_{0}^{1}|F^{(m)}(x)-F(x)|^{p}dx\right)^{\frac{1}{p}}
\displaystyle\leq (01𝔼[exp{mp(xs)2/2}]𝑑x)1p\displaystyle\left(\int_{0}^{1}{\mathbb{E}}[\exp\{-mp(x-s)^{2}/2\}]dx\right)^{\frac{1}{p}}
=\displaystyle= (0101exp{mp(xs)2/2}𝑑F(s)𝑑x)1p\displaystyle\left(\int_{0}^{1}\int_{0}^{1}\exp\{-mp(x-s)^{2}/2\}dF(s)dx\right)^{\frac{1}{p}}
=(i)\displaystyle\overset{(i)}{=} (0101exp{mp(xs)2/2}𝑑x𝑑F(s))1p\displaystyle\left(\int_{0}^{1}\int_{0}^{1}\exp\{-mp(x-s)^{2}/2\}dxdF(s)\right)^{\frac{1}{p}}
\displaystyle\leq (012πmp𝑑F(s))1p\displaystyle\left(\int_{0}^{1}\frac{\sqrt{2\pi}}{\sqrt{mp}}dF(s)\right)^{\frac{1}{p}}
=\displaystyle= (2π)12p(mp)12p,\displaystyle\frac{(2\pi)^{\frac{1}{2p}}}{(mp)^{\frac{1}{2p}}},

where Equation (i)(i) holds by the Fubini’s theorem. Further, by noting that Fn,m(x)F(x)=Fn,m(x)F(m)(x)+F(m)(x)F(x)F_{n,m}(x)-F(x)=F_{n,m}(x)-F^{(m)}(x)+F^{(m)}(x)-F(x) and using the DKW inequality, it follows that

𝔼01|Fn,m(x)F(x)|𝑑x(2π)12p(mp)12p+2πn.{\mathbb{E}}\int_{0}^{1}|F_{n,m}(x)-F(x)|dx\leq\frac{(2\pi)^{\frac{1}{2p}}}{(mp)^{\frac{1}{2p}}}+\frac{\sqrt{2\pi}}{\sqrt{n}}.

 

Notwithstanding, Theorem 1 does not establish a useful bound for the K-S distance that corresponds to the LL_{\infty} distance — there remains a non-negligible constant limp(2π)12p(mp)12p=1\underset{p\rightarrow\infty}{\lim}\frac{(2\pi)^{\frac{1}{2p}}}{(mp)^{\frac{1}{2p}}}=1 which does not depend on mm. In fact, the K-S distance might evaluate the estimate of FF from a too stringent perspective. Proposition 2 shows that no matter how small mm is, there is an FF with a point mass function and a point x0x_{0} such that |F(m)(x0)F(x0)||F^{(m)}(x_{0})-F(x_{0})| is larger than a constant that is independent of mm. This result is attributed to the “bad” points with non-trivial masses like x0x_{0}. Such a “bad” point gives rise to a sharp jump in FF, which F(m)F^{(m)} cannot immediately catch up with due to the discretization nature of the binomial randomness. It leads to difficulty in recovering the underlying distribution FF of the binomial mixture model.

On the other hand, the LpL_{p} distance with p<p<\infty does not suffer from the issue of the K-S distance — it can be regarded as looking at an average of the absolute distance between F(m)F^{(m)} and FF when the support of FF is finite. To be specific, even if there are “bad” points x1,x2,x_{1},x_{2},\ldots such that |F(m)(xi)F(xi)||F^{(m)}(x_{i})-F(x_{i})| has a non-trivial difference, i=1,2,i=1,2,\ldots, this difference will diminish outside a small neighbor of xix_{i} of a width 𝒪(1m)\mathcal{O}(\frac{1}{m}). Therefore, when taking the integral, the averaging distance decreases as mm grows. Furthermore, if FF has a density, the issue of “bad” points no longer exists for the K-S distance either. In this case, the K-S distance is an appropriate choice to measure the difference between F(m)F^{(m)} and FF; see Section 3.2 for details.

Proposition 2 (F(m)F^{(m)} can deviate in K-S far from FF with a pmf)

There exists an FF with a pmf, such that supx|F(m)(x)F(x)|c>0\sup_{x}|F^{(m)}(x)-F(x)|\geq c>0, where cc is a constant.

Proof Let F(x)F(x) be the delta function taking the mass at 12+κ\frac{1}{2}+\kappa, where κ\kappa is an extremely small positive irrational number. Then by CLT, with probability about 1/21/2, s^\hat{s} is no larger than m~m\frac{\tilde{m}}{m}, where m~\tilde{m} is the largest integer such that m~m<12+κ\frac{\tilde{m}}{m}<\frac{1}{2}+\kappa. Take any x0x_{0} in (m~m,12+κ)(\frac{\tilde{m}}{m},\frac{1}{2}+\kappa). It follows that F(x0)=0F(x_{0})=0 but F(m)(x0)12F^{(m)}(x_{0})\approx\frac{1}{2}. It implies that F(m)(x0)F(x0)F^{(m)}(x_{0})-F(x_{0}) is larger than a constant that is independent of mm.  

3.2 FF with a density

In this section, we focus on FF with a density so that the K-S distance can be employed. In addition, we stick to this metric partly because it is related to the Grenander estimator for monotone density estimation [8, 4], which constitutes our method for the GeneFishing cutoff selection; see Section 3.4 and Section 4.2.

To bound the K-S distance between Fn,mF_{n,m} and FF, i.e., supx|Fn,m(x)F(x)|\sup_{x}|F_{n,m}(x)-F(x)|, we just need to bound Fn,mF(m)F_{n,m}-F^{(m)} and F(m)FF^{(m)}-F by noticing that Fn,mF=Fn,mF(m)+F(m)FF_{n,m}-F=F_{n,m}-F^{(m)}+F^{(m)}-F. By the DKW inequality, we have a tight bound

[supx|Fn,m(x)F(m)(x)|>t]2exp{2nt2},t>0.\mathbb{P}[\sup_{x}|F_{n,m}(x)-F^{(m)}(x)|>t]\leq 2\exp\{-2nt^{2}\},~{}~{}~{}~{}\forall t>0.

So it only remains to study the deviation of F(m)F^{(m)} from FF. In Proposition 3, we show that when FF has a derivative bounded from both below and above, the K-S distance between F(m)F^{(m)} and FF is bounded by 𝒪(1m)\mathcal{O}(\frac{1}{m}) from below and by 𝒪(1m)\mathcal{O}(\frac{1}{\sqrt{m}}) from above.

Proposition 3 (Deviation of F(m)F^{(m)} from FF with a density)

Suppose ff is a density function on [0,1][0,1] with 0<fminfmax<0<f_{\min}\leq f_{\max}<\infty. It follows that

fminm+1supx|F(m)(x)F(x)|fmax2πm.\frac{f_{\min}}{m+1}\leq\sup_{x}|F^{(m)}(x)-F(x)|\leq f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}}.

Proof For the lower bound, we have

(s^0)(s0)=(s^0)=01(1u)mf(u)𝑑ufmin01(1u)m𝑑u=fminm+1.\mathbb{P}(\hat{s}\leq 0)-\mathbb{P}(s\leq 0)=\mathbb{P}(\hat{s}\leq 0)=\int_{0}^{1}(1-u)^{m}f(u)du\geq f_{\min}\int_{0}^{1}(1-u)^{m}du=\frac{f_{\min}}{m+1}.

For the upper bound, note that

F(m)(x)F(x)=(s^x)(sx)=𝔼(𝕀[s^x]𝕀[sx])=𝔼(𝕀[s^x<s]𝕀[sx<s^]),\displaystyle F^{(m)}(x)-F(x)=\mathbb{P}(\hat{s}\leq x)-\mathbb{P}(s\leq x)={\mathbb{E}}(\mathbb{I}[\hat{s}\leq x]-\mathbb{I}[s\leq x])={\mathbb{E}}(\mathbb{I}[\hat{s}\leq x<s]-\mathbb{I}[s\leq x<\hat{s}]),

thus

|F(m)(x)F(x)|max{𝔼𝕀[s^x<s],𝔼𝕀[sx<s^])}.|F^{(m)}(x)-F(x)|\leq\max\{{\mathbb{E}}\mathbb{I}[\hat{s}\leq x<s],{\mathbb{E}}\mathbb{I}[s\leq x<\hat{s}])\}.

We have

𝔼𝕀[s^x<s]\displaystyle{\mathbb{E}}\mathbb{I}[\hat{s}\leq x<s] =\displaystyle= (s^x<s)\displaystyle\mathbb{P}(\hat{s}\leq x<s)
=\displaystyle= (s^sxs<0)\displaystyle\mathbb{P}(\hat{s}-s\leq x-s<0)
=\displaystyle= 𝔼[(s^sxs<0)|s]\displaystyle{\mathbb{E}}[\mathbb{P}(\hat{s}-s\leq x-s<0)|s]
\displaystyle\leq 𝔼[exp(m(xs)2/2)]\displaystyle{\mathbb{E}}[\exp(-m(x-s)^{2}/2)]
=\displaystyle= 01exp(m(xu)2/2)f(u)𝑑u\displaystyle\int_{0}^{1}\exp(-m(x-u)^{2}/2)f(u)du
\displaystyle\leq fmax2πm,\displaystyle f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}},

Similarly, we have 𝔼𝕀[sx<s^]fmax2πm{\mathbb{E}}\mathbb{I}[s\leq x<\hat{s}]\leq f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}}. So it follows that

supx|(s^x)(sx)|max{supx𝔼𝕀[s^x<s],supx𝔼𝕀[sx<s^]}fmax2πm.\sup_{x}|\mathbb{P}(\hat{s}\leq x)-\mathbb{P}(s\leq x)|\leq\max\{\sup_{x}{\mathbb{E}}\mathbb{I}[\hat{s}\leq x<s],\sup_{x}{\mathbb{E}}\mathbb{I}[s\leq x<\hat{s}]\}\leq f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}}.

 

Proposition 3 shows that the largest deviation of the binomial mixture CDF from the underlying CDF is at least the order 𝒪(m1)\mathcal{O}(m^{-1}) and at most the order 𝒪(m12)\mathcal{O}(m^{-\frac{1}{2}}). In fact, the condition that ff is bounded can be relaxed to that ff is LpL_{p}-integrable with p>1p>1 using the Hölder inequality, but the rate will be 𝒪(mp2(p1))\mathcal{O}(m^{-\frac{p}{2(p-1)}}) correspondingly. Our result is the special case with p=p=\infty. Moreover, FF with a density is a necessary condition for Proposition 3 — we have seen in Proposition 2 that if FF has a point mass function, the deviation of F(m)F^{(m)} from FF cannot be controlled w.r.t mm.

Proposition 4 shows that there exist two simple distributions that respectively attain the lower bound and the upper bound. On the other hand, Proposition 3 can be further improved: if the underlying density is assumed to be smooth, the upper bound is lowered to 𝒪(m1)\mathcal{O}(m^{-1}); see Proposition 7 in Section 3.3.

Proposition 4 (Tightness of Proposition 3)

The upper bound and the lower bound in Proposition 3 are tight. In other words, there exist an F1F_{1} and F2F_{2} such that supx|F1(m)(x)F1(x)|C11m+1\sup_{x}|F_{1}^{(m)}(x)-F_{1}(x)|\leq C_{1}\cdot\frac{1}{m+1} and supx|F2(m)(x)F2(x)|C21m\sup_{x}|F_{2}^{(m)}(x)-F_{2}(x)|\geq C_{2}\cdot\frac{1}{\sqrt{m}}, where C1C_{1} and C2C_{2} are two positive constants.

Proof The lower bound of Proposition 3 can be attained by the uniform distribution. Specifically, if f1f\equiv 1, (s^=k/m)=1m+1\mathbb{P}(\hat{s}=k/m)=\frac{1}{m+1}. So |(s^0)(s0)|=1m+1|\mathbb{P}(\hat{s}\leq 0)-\mathbb{P}(s\leq 0)|=\frac{1}{m+1}.

On the other hand, the upper bound can be attained by another simple distribution. Let

f(x)=1.8𝕀(x[0,1/2])+0.2𝕀(x(1/2,1])f(x)=1.8\cdot\mathbb{I}(x\in[0,1/2])+0.2\cdot\mathbb{I}(x\in(1/2,1]) (3.1)

We can show that this density ff leads to |(s^1/2)(s1/2)|Cm|\mathbb{P}(\hat{s}\leq 1/2)-\mathbb{P}(s\leq 1/2)|\geq\frac{C}{\sqrt{m}}, where CC is a positive constant. It is a consequence of the central limit theorem for the binomial distribution. The detailed proof is delegated to Appendix D.  

Given Proposition 3, we can get the rate of of supx|Fn,m(x)F(x)|\sup_{x}|F_{n,m}(x)-F(x)| along with the DKW inequality, which is 𝒪(n12)+𝒪(m12)\mathcal{O}(n^{-\frac{1}{2}})+\mathcal{O}(m^{-\frac{1}{2}}) as shown in Theorem 5. By taking integral of (supx[Fn,m(x)F(x)]>t)\mathbb{P}(\sup_{x}[F_{n,m}(x)-F(x)]>t) w.r.t tt, we immediately have Corollary 6.

Theorem 5 (The rate of K-S distance between Fn,mF_{n,m} and FF with a density)

Suppose ff is a density function on [0,1][0,1] with fmax<f_{\max}<\infty. The data is generated as Model (1.2). It follows that

(supx[Fn,m(x)F(x)]>t)exp(nt2/2)+𝕀(fmax22πm>t),\mathbb{P}(\sup_{x}[F_{n,m}(x)-F(x)]>t)\leq\exp(-nt^{2}/2)+\mathbb{I}(f_{\max}\cdot\frac{2\sqrt{2\pi}}{\sqrt{m}}>t),

where tlog22nt\geq\sqrt{\frac{\log 2}{2n}}. The two-side tail bound also holds as follow

(supx|Fn,m(x)F(x)|>t)2exp(nt2/2)+𝕀(fmax22πm>t),\mathbb{P}(\sup_{x}|F_{n,m}(x)-F(x)|>t)\leq 2\exp(-nt^{2}/2)+\mathbb{I}(f_{\max}\cdot\frac{2\sqrt{2\pi}}{\sqrt{m}}>t),

where t>0t>0.

Proof Note that

supx|Fn,m(x)F(x)|supx|Fn,m(x)F(m)(x)|+supx|F(m)(x)F(x)|.\sup_{x}|F_{n,m}(x)-F(x)|\leq\sup_{x}|F_{n,m}(x)-F^{(m)}(x)|+\sup_{x}|F^{(m)}(x)-F(x)|.

The first term can be bounded by the original DKW inequality and the second term can be bounded using the result of Proposition 3. Then we conclude the second result. The first result can be obtained in the same fashion.  

Corollary 6

Under the same setup of Theorem 5, we have 𝔼supx[Fn,m(x)F(x)]2πn+min{1,fmax22πm}{\mathbb{E}}\sup_{x}[F_{n,m}(x)-F(x)]\leq\frac{\sqrt{2\pi}}{\sqrt{n}}+\min\{1,\frac{f_{\max}\cdot 2\sqrt{2\pi}}{\sqrt{m}}\}, and 𝔼supx|Fn,m(x)F(x)|22πn+min{1,fmax22πm}{\mathbb{E}}\sup_{x}|F_{n,m}(x)-F(x)|\leq\frac{2\sqrt{2\pi}}{\sqrt{n}}+\min\{1,\frac{f_{\max}\cdot 2\sqrt{2\pi}}{\sqrt{m}}\}.

Proof By Theorem 5, it follows that

𝔼[supx[Fn,m(x)F(x)]]\displaystyle{\mathbb{E}}[\sup_{x}[F_{n,m}(x)-F(x)]] =\displaystyle= 01(supx[Fn,m(x)F(x)]>t)𝑑t\displaystyle\int_{0}^{1}\mathbb{P}(\sup_{x}[F_{n,m}(x)-F(x)]>t)dt
\displaystyle\leq 01[exp(nt2/2)+𝕀(fmax22πm>t)]𝑑t\displaystyle\int_{0}^{1}[\exp(-nt^{2}/2)+\mathbb{I}(f_{\max}\cdot\frac{2\sqrt{2\pi}}{\sqrt{m}}>t)]dt
\displaystyle\leq 2πn+min{1,fmax22πm}.\displaystyle\frac{\sqrt{2\pi}}{\sqrt{n}}+\min\{1,\frac{f_{\max}\cdot 2\sqrt{2\pi}}{\sqrt{m}}\}.

The two-side expectation can be proved in a similar manner.  

3.3 FF with A Smooth Density

In this section, we investigate the estimation of FF with a smooth density. Under this condition, we first obtain a stronger result than Proposition 3 for F(m)F^{(m)}, based on a truncated K-S distance on the interval [a,1a][a,1-a], where 0<a<1/20<a<1/2. Proposition 7 shows that if the density of FF is smooth, the truncated K-S distance decreases at 𝒪(1m)\mathcal{O}(\frac{1}{m}). The proof is based on the fact the binomial distribution random variable ms^im\cdot\hat{s}_{i} behaves like a Gaussian random variable when the binomial probability sis_{i} is bounded away from 0 and 11. When sis_{i} is close to 0 and 11, the Gaussian approximation cannot be used since it has an unbounded variance 1msi(1si)\frac{1}{ms_{i}(1-s_{i})}. The proof is deferred to Appendix E.

Proposition 7 (Deviation of F(m)F^{(m)} from FF with a smooth density)

Suppose ff is a density function on [0,1][0,1] with ff^{\prime} being absolutely continuous. Let sfs\sim f and ms^Binomial(m,s)m\cdot\hat{s}\sim Binomial(m,s). It follows for any 0<a<1/20<a<1/2 that

supx[a,1a]|F(m)(x)F(x)|C1m,\sup_{x\in[a,1-a]}|F^{(m)}(x)-F(x)|\leq C\cdot\frac{1}{m},

where CC is some constant that only depends on ff and aa.

Then, we investigate the histogram estimator, since it is the one of simplest nonparametric density estimators and it has a theoretical guarantee when the density is smooth. Let LL be an integer and define bins

B1=[0,1L),B2=[1L,2L),,BL=[L1L,1].B_{1}=[0,\frac{1}{L}),B_{2}=[\frac{1}{L},\frac{2}{L}),\ldots,B_{L}=[\frac{L-1}{L},1].

Let YlY_{l} be the number of observations in BlB_{l}, p^l=Yl/n\hat{p}_{l}=Y_{l}/n and pl=(s1Bl)p_{l}=\mathbb{P}(s_{1}\in B_{l}). It is known that under certain smoothness conditions, the histogram converges in a cubic root rate for the rooted mean squared error (MSE) [41, Chapter 6]. Next we study how the histogram behaves on the binomial mixture model. Denote

{f^n,m(x)=1ni=1n1h𝕀(s^iB(x))f^n(x)=1ni=1n1h𝕀(siB(x)),\left\{\begin{array}[]{ll}\hat{f}_{n,m}(x)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{h}\mathbb{I}(\hat{s}_{i}\in B(x))\\ \hat{f}_{n}(x)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{h}\mathbb{I}(s_{i}\in B(x)),\end{array}\right.

where hh is the bandwidth, B(x)B(x) denotes the bin that xx falls in. Theorem 8 shows that the histogram estimator based on s^i\hat{s}_{i}’s has the same convergence rate in terms of the MSE metrics as the histogram estimator based on sis_{i}’s if m=Ω(n23)m=\Omega(n^{\frac{2}{3}}) and h=𝒪(n13)h=\mathcal{O}(n^{-\frac{1}{3}}). This rate might not be improved even if ff has higher order continuous derivatives since 𝔼|f^n,m(x)𝔼f^n(x)|{\mathbb{E}}|\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x)| is bounded by 𝒪(1m+h+1mh)\mathcal{O}(\frac{1}{\sqrt{m}}+h+\frac{1}{mh}), which dominates 𝔼|f^n(x)f(x)|=𝒪(hν){\mathbb{E}}|\hat{f}_{n}(x)-f(x)|=\mathcal{O}(h^{\nu}) when ff has a ν\nu-order continuous derivative. The proof of Theorem 8 is delegated to Appendix F.

Theorem 8 (Upper bound of the histogram risk for binomial mixture model)

Let R(a,b)=ab𝔼(f(x)f^n,m(x))2𝑑xR(a,b)=\int_{a}^{b}{\mathbb{E}}(f(x)-\hat{f}_{n,m}(x))^{2}dx be the integrated risk on the interval [a,b][a,b]. Assume that ff^{\prime} is absolutely continuous. It follows that

R(a,1a)C1(h2+1m+1m2h2+1nh),0<a<12,R(a,1-a)\leq C_{1}\cdot(h^{2}+\frac{1}{m}+\frac{1}{m^{2}h^{2}}+\frac{1}{nh}),\forall 0<a<\frac{1}{2},

Furthermore, if mC2n23m\geq C_{2}\cdot n^{\frac{2}{3}}, h=C3n13h=C_{3}\cdot n^{-\frac{1}{3}}, we have

R(a,1a)C4n23,0<a<12.R(a,1-a)\leq C_{4}\cdot n^{-\frac{2}{3}},\forall 0<a<\frac{1}{2}.

Here C1C_{1}, C2C_{2} and C4C_{4} are positive constants that only depend on aa and ff, C3>0C_{3}>0 only depends on ff.

Finally, we conclude this section with a study on FF whose density is discretized into a point mass function of KK non-zero masses. In contrast to the existing results in [36], we allow KK to be as large as n\sqrt{n}, and study the finite-sample rate of the histogram estimator. Let p(x)p(x) be a point mass function such that

p(x)=k=1Kα(k)𝕀(x=xk),p(x)=\sum_{k=1}^{K}\alpha(k)\mathbb{I}(x=x_{k}),

where α(k)0\alpha(k)\geq 0 and k=1Kα(k)=1\sum_{k=1}^{K}\alpha(k)=1, xk=(k1)+1/2Kx_{k}=\frac{(k-1)+1/2}{K}. Denote by IkI_{k} the interval centered at xkx_{k} and of length 1/K1/K. We can make (α(1),,α(K))(\alpha(1),\ldots,\alpha(K)) “smooth” by letting α(k)=Ikf(t)𝑑t\alpha(k)=\int_{I_{k}}f(t)dt, where ff is a smooth function. Denote

{α^n,m(k)=1ni=1n𝕀(s^iIk)α^n(k)=1ni=1n𝕀(siIk)\left\{\begin{array}[]{ll}\hat{\alpha}_{n,m}(k)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(\hat{s}_{i}\in I_{k})\\ \hat{\alpha}_{n}(k)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(s_{i}\in I_{k})\\ \end{array}\right.

Then the MSE can be defined as R(α^,α)=1Kk=1K(α^(k)α(k))2R(\hat{\alpha},\alpha)=\frac{1}{K}\sum_{k=1}^{K}(\hat{\alpha}(k)-\alpha(k))^{2}. It is known that R(α^n,α)=𝒪(1n)R(\hat{\alpha}_{n},\alpha)=\mathcal{O}(\frac{1}{n}). Theorem 9 shows that the same rate can be achieved by α^n,m\hat{\alpha}_{n,m} when m=Ω(nmax{n,K})m=\Omega(\sqrt{n}\max\{\sqrt{n},K\}). The proof is deferred to Appendix G.

Theorem 9 (Upper bound of the histogram risk for finite binomial mixture)

Let α(k)=Ikf(t)𝑑t\alpha(k)=\int_{I_{k}}f(t)dt, where ff^{\prime} is absolutely continuous and (f)2𝑑x<\int(f^{\prime})^{2}dx<\infty. Let R(a,b)=aKkbK(α^n,m(k)α(k))2k=1K𝕀(aKkbK)R(a,b)=\frac{\sum_{aK\leq k\leq bK}(\hat{\alpha}_{n,m}(k)-\alpha(k))^{2}}{\sum_{k=1}^{K}\mathbb{I}(aK\leq k\leq bK)} be the risk on the interval [a,b][a,b]. It follows that

R(a,1a)C1(1n+1m+K2m2),0<a<12.R(a,1-a)\leq C_{1}\cdot(\frac{1}{n}+\frac{1}{m}+\frac{K^{2}}{m^{2}}),\forall 0<a<\frac{1}{2}.

Furthermore, if mC2nmax{n,K}m\geq C_{2}\cdot\sqrt{n}\max\{\sqrt{n},K\}, then

R(a,1a)C31n,0<a<12.R(a,1-a)\leq C_{3}\cdot\frac{1}{n},\forall 0<a<\frac{1}{2}.

Here C1,C2,C3C_{1},C_{2},C_{3} are positive constants that only depend on aa and ff.

3.4 FF with A Monotone Density

Next, we shift our attention to FF with a monotone density ff. It is motivated by the U shape in the histograms of the GeneFishing output, where we can decompose the U shape into a decreasing part, a flat part, and an increasing part; see Section 4.1. To estimate ff, a natural solution is the Grenander estimator [8, 14]. Specifically, we first construct the least concave majorant of the empirical CDF of FF; then its left derivative is the desired estimator. The detailed review of the Grenander estimator is deferred to Appendix C.

In the sequel, we establish convergence and inference results for f~n,m\tilde{f}_{n,m} that is the Grenander estimator based on s^i\hat{s}_{i}’s. Theorem 10 states f~n,m\tilde{f}_{n,m} achieves the convergence rate of 𝒪(n13)\mathcal{O}(n^{-\frac{1}{3}}) w.r.t the L1L_{1} distance if m=Ω(n23)m=\Omega(n^{\frac{2}{3}}).

Theorem 10 (L1L_{1} convergence of f~m,n\tilde{f}_{m,n})

Suppose ff is a decreasing density on [0,1][0,1] with fmax<f_{\max}<\infty. It follows that

𝔼f01|f~n,m(x)f(x)|𝑑xC1n13+C2m12.{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{1}\cdot n^{-\frac{1}{3}}+C_{2}\cdot m^{-\frac{1}{2}}.

Furthermore, if mC3n23m\geq C_{3}\cdot n^{\frac{2}{3}}, we have

𝔼f01|f~n,m(x)f(x)|𝑑xC4n13.{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot n^{-\frac{1}{3}}.

Here C1,C2,C3,C4C_{1},C_{2},C_{3},C_{4} are positive constants that only depend on ff.

Theorem 10 follows by Corollary 6 and [4]. The details can be found in Appendix H. Furthermore, we discuss the minimal mm we need to ensure the 𝒪(n13)\mathcal{O}(n^{-\frac{1}{3}}) convergence rate in terms of the L1L_{1} distance. We have seen that m=Ω(n23)m=\Omega(n^{\frac{2}{3}}) is a sufficient condition. The question arises naturally whether it is necessary. Based on the density (3.1) used in Proposition 4, we can show that m=Ω(n23)m=\Omega(n^{\frac{2}{3}}) is also the minimal binomial size; see Proposition 11.

Proposition 11

When ff takes (3.1), i.e.,

f(x)=1.8𝕀(x[0,1/2])+0.2𝕀(x(1/2,1]),f(x)=1.8\cdot\mathbb{I}(x\in[0,1/2])+0.2\cdot\mathbb{I}(x\in(1/2,1]),

it needs mC1n23m\geq C_{1}\cdot n^{\frac{2}{3}} to ensure

𝔼f01|f~n,m(x)f(x)|𝑑xC2n13,{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{2}\cdot n^{-\frac{1}{3}},

where C1,C2>0C_{1},C_{2}>0 only depend on ff.

Proof Note that

𝔼f01|f~n,m(x)f(x)|𝑑x\displaystyle{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx (i)\displaystyle\overset{(i)}{\geq} 𝔼f0x|f~n,m(x)f(x)|𝑑x\displaystyle{\mathbb{E}}_{f}\int_{0}^{x^{*}}|\tilde{f}_{n,m}(x)-f(x)|dx
\displaystyle\geq 𝔼f|0x[f~n,m(x)f(x)]𝑑x|\displaystyle{\mathbb{E}}_{f}|\int_{0}^{x^{*}}[\tilde{f}_{n,m}(x)-f(x)]dx|
=\displaystyle= 𝔼f|F~n,m(x)F(x)|\displaystyle{\mathbb{E}}_{f}|\tilde{F}_{n,m}(x^{*})-F(x^{*})|
\displaystyle\geq 𝔼f|F(m)(x)F(x)|𝔼f|F~n,m(x)F(m)(x)|\displaystyle{\mathbb{E}}_{f}|F^{(m)}(x^{*})-F(x^{*})|-{\mathbb{E}}_{f}|\tilde{F}_{n,m}(x^{*})-F^{(m)}(x^{*})|
(ii)\displaystyle\overset{(ii)}{\geq} 𝔼f|F(m)(x)F(x)|supx𝔼f|Fn,m(x)F(m)(x)|\displaystyle{\mathbb{E}}_{f}|F^{(m)}(x^{*})-F(x^{*})|-\sup_{x}{\mathbb{E}}_{f}|F_{n,m}(x)-F^{(m)}(x)|
(iii)\displaystyle\overset{(iii)}{\geq} K1mK2n,\displaystyle\frac{K_{1}}{\sqrt{m}}-\frac{K_{2}}{\sqrt{n}},

where x:=argmaxx|F(m)(x)F(x)|x^{*}:=arg\max_{x}|F^{(m)}(x)-F(x)| in Inequality (i); Inequality (ii) holds because supx|F~n,m(x)F(m)(x)|supx|Fn,m(x)F(m)(x)|\sup_{x}|\tilde{F}_{n,m}(x)-F^{(m)}(x)|\leq\sup_{x}|F_{n,m}(x)-F^{(m)}(x)|, which can be easily derived by Marshall’s lemma [22]; Inequality (iii) holds because of Proposition 4 and the DKW inequality, and K1,K2>0K_{1},K_{2}>0 are two positive constants that only depend ff. To ensure the expected L1L_{1} distance between f~n,m\tilde{f}_{n,m} and ff is bounded by C2n13C_{2}\cdot n^{-\frac{1}{3}}, it is necessary to have

C2n3K1mK2n,\frac{C_{2}}{\sqrt[3]{n}}\geq\frac{K_{1}}{\sqrt{m}}-\frac{K_{2}}{\sqrt{n}},

which implies that mC1n23m\geq C_{1}\cdot n^{\frac{2}{3}}, for some positive constant C1C_{1} that depends on ff.  

Next, we study the local asymptotics of f~n,m\tilde{f}_{n,m}. For the binomial mixture model, we yield a similar result for f~n,m\tilde{f}_{n,m} as the local asymptotics of f~n\tilde{f}_{n} when mm grows faster than nn, as shown in Theorem 12.

Theorem 12 (Local Asymptotics of f~m,n\tilde{f}_{m,n})

Suppose ff is a decreasing density on [0,1][0,1] and is smooth at t0(0,1)t_{0}\in(0,1). Then when mn\frac{m}{n}\rightarrow\infty as nn\rightarrow\infty, if fmax<f_{\max}<\infty, we have

  • (A)

    If ff is flat in a neighborhood of t0t_{0}, and [a,b][a,b] is the flat part containing t0t_{0}, it follows that

    n(f~m,n(t0)f(t0))𝑑S^a,b(t0),\sqrt{n}(\tilde{f}_{m,n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}\hat{S}_{a,b}(t_{0}),

    where S^a,b(t)\hat{S}_{a,b}(t) is the slope at F(t)F(t) of the least concave majorant in [F(a),F(b)][F(a),F(b)] of a standard Brownian Bridge in [0,1][0,1].

  • (B)

    If f(t)f(t0)f(k)(t0)(tt0)kf(t)-f(t_{0})\sim f^{(k)}(t_{0})(t-t_{0})^{k} near t0t_{0} for some kk and f(k)(t0)<0f^{(k)}(t_{0})<0, it follows that

    nk2k+1[fk(t0)|f(k)(t0)|(k+1)!]12k+1(f~m,n(t0)f(t0))𝑑Vk(0),n^{\frac{k}{2k+1}}[\frac{f^{k}(t_{0})|f^{(k)}(t_{0})|}{(k+1)!}]^{-\frac{1}{2k+1}}(\tilde{f}_{m,n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}V_{k}(0),

    where Vk(t)V_{k}(t) is the slope at tt of the least concave majorant of the process {B(t)|t|k+1,t(,)}\{B(t)-|t|^{k+1},t\in(-\infty,\infty)\}, and B(t)B(t) is a standard two-sided Brownian motion on {\mathbb{R}} with B(0)=0B(0)=0.

The proof of Theorem 12 relies on Proposition 3 and the Komlós-Major-Tusnády (KMT) approximation [17]. Given these two results, we can show that if ff is upper bounded, there exists a sequence of Brownian bridges {Bn(x),0x1}\{B_{n}(x),0\leq x\leq 1\} such that

{sup0x1|n(Fn,m(x)F(x))Bn(F(x))|>a~nm+alognn+t}b(ecnt+edmt2),\mathbb{P}\left\{\sup_{0\leq x\leq 1}|\sqrt{n}(F_{n,m}(x)-F(x))-B_{n}(F(x))|>\frac{\tilde{a}\sqrt{n}}{\sqrt{m}}+\frac{a\log n}{\sqrt{n}}+t\right\}\leq b(e^{-c\sqrt{n}t}+e^{-dmt^{2}}),

where a~>0\tilde{a}>0 only depends on ff and a,b,c,da,b,c,d are universal positive constants. Then following [40], we can prove Theorem 12. The details are deferred to Appendix I.

Finally, we conclude this section by discussing the histogram estimator and the Grenander estimator (for a density). Both of them are bin estimators but differ in the choice of the bin width. One can pick the bin width for the histogram to attain optimal convergence rates [41]. On the other hand, the bin widths of the Grenander estimator are chosen completely automatically by the estimator and are naturally locally adaptive [4]. The consequence is that the Grenander estimator can guarantee monotonicity, but the histogram estimator cannot. If the underlying model is monotone, the Grenander estimator has a better convergence rate than the histogram estimator. Notably, the convergence theory of the histogram estimator cannot be established unless the density is smooth, while that of the Grenander estimator only requires the density is monotone and LpL_{p} integrable (p>2p>2) [4].

In our setup, we show that when m=Ω(n23)m=\Omega(n^{\frac{2}{3}}), both the Grenander estimator and the histogram estimator, based on {s^i}i=1n\{\hat{s}_{i}\}_{i=1}^{n}, have the same rate at 𝒪(n13)\operatorname*{\mathcal{O}}(n^{-\frac{1}{3}}) in L1L_{1} distance (the L2L_{2} convergence of the histogram can imply the L1L_{1} convergence). It seems that both methods are comparable. Nonetheless, we mention that the conditions for the convergence of the two estimators are different. The Grenander estimator requires a bounded monotone density, while the histogram requires a smooth density.

4 Estimation and Inference of FF with U-shape Constraint

Now we have sufficient insight into the estimation of FF under various conditions in the binomial mixture model (1.2). We are ready to cast our attention back to the cutoff selection problem for the GeneFishing method, i.e., distinguishing the relevant genes (to the baits genes or the associated biological process) from the irrelevant ones. To answer this question, we leverage the observation that the histogram of {s^i}i=1n\{\hat{s}_{i}\}_{i=1}^{n} appears to have a U shape as shown in Figure 1.

4.1 Model

We decompose the density or the pmf of FF into three parts: the first part decreases, the second part remains flat, and the last part increases. The first part is assumed to be purely related to the irrelevant genes; the second part is associated with the mixture of the irrelevant and the relevant genes; the last part is purely corresponding to the relevant genes. Denote by clc_{l} and crc_{r} the transition points from the first part to the second part and the second part to the third part, respectively. Then the question is reduced to identifying crc_{r} and getting an upper confidence bound on crc_{r}. In the sequel, we formally write this assumption as Assumption 1 when FF is associated with a continuous random variable. The corresponding mathematical formulations for the pmf are similar, so we omit them here.

Assumption 1

Let ff be the derivative of FF, i.e., the probability density function. We assume ff consists of three parts:

f(x)={fl(x)=αlgl(x), if x[0,cl]αmidcrcl, if x(cl,cr]fr(x)=αrgr(x), if x(cr,1],f(x)=\left\{\begin{array}[]{ll}f_{l}(x)=\alpha_{l}\cdot g_{l}(x),&\text{ if }x\in[0,c_{l}]\\ \frac{\alpha_{mid}}{c_{r}-c_{l}},&\text{ if }x\in(c_{l},c_{r}]\\ f_{r}(x)=\alpha_{r}\cdot g_{r}(x),&\text{ if }x\in(c_{r},1]\end{array}\right., (4.1)

where 0<cl<cr<10<c_{l}<c_{r}<1, glg_{l} is a decreasing function, grg_{r} is an increasing function such that 0clgl(x)𝑑x=1\int_{0}^{c_{l}}g_{l}(x)dx=1, 1crgr(x)𝑑x=1\int_{1}^{c_{r}}g_{r}(x)dx=1, and αl+αr+αmid=1\alpha_{l}+\alpha_{r}+\alpha_{mid}=1 with αl,αr,αmid>0\alpha_{l},\alpha_{r},\alpha_{mid}>0.

For the U-shaped constraint, we also need

min{fl(cl),fr(cr+)}αmidcrcl.\min\{f_{l}(c_{l}^{-}),f_{r}(c_{r}^{+})\}\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}.

The shape constraint (4.1) is determined by six parameters {αl,αr,cl,cr,gl,gr}\{\alpha_{l},\alpha_{r},c_{l},c_{r},g_{l},g_{r}\}, but they are not identifiable. Below is an example of such unidentifiability.

Example 1 (Identifiability Issue for (4.1))
α~l=αl+αmidcrclτ;α~r=αrc~l=cl+τ;c~r=crg~l={glαl/α~l, if x[0,cl]αmid(crcl)α~l, if (cl,cl+τ];g~r=gr\begin{array}[]{ll}\tilde{\alpha}_{l}=\alpha_{l}+\frac{\alpha_{mid}}{c_{r}-c_{l}}\cdot\tau;&\tilde{\alpha}_{r}=\alpha_{r}\\ \tilde{c}_{l}=c_{l}+\tau;&\tilde{c}_{r}=c_{r}\\ \tilde{g}_{l}=\left\{\begin{array}[]{ll}g_{l}\cdot\alpha_{l}/\tilde{\alpha}_{l},&\text{ if }x\in[0,c_{l}]\\ \frac{\alpha_{mid}}{(c_{r}-c_{l})\cdot\tilde{\alpha}_{l}},&\text{ if }\in(c_{l},c_{l}+\tau]\end{array}\right.;&\tilde{g}_{r}=g_{r}\end{array}

The parameters {α~l,α~r,c~l,c~r,g~l,g~r}\{\tilde{\alpha}_{l},\tilde{\alpha}_{r},\tilde{c}_{l},\tilde{c}_{r},\tilde{g}_{l},\tilde{g}_{r}\} yield the same model as {αl,αr,cl,cr,gl,gr}\{\alpha_{l},\alpha_{r},c_{l},c_{r},g_{l},g_{r}\} if τ<crcl\tau<c_{r}-c_{l}.

The identifiability issue results from the vague transitions from one part to the next adjacent part in Model (4.1). To tackle it, we need to introduce some assumptions to sharpen the transitions. For example, if ff is smooth, then a sharp transition means the first derivative of ff, i.e., the slope, significantly changes at this point. In general, we do not impose the smoothness on ff and use the finite difference as the surrogate of the slope. To be specific, suppose there exist δl,δr>0\delta_{l},\delta_{r}>0 and neighborhoods of clc_{l} and crc_{r}, whose sizes are τl\tau_{l} and τr\tau_{r} respectively, such that

fl(x)αmidcrcl+δl(clx), if x[clτl,cl)fr(x)αmidcrcl+δr(xcr), if x(cr,cr+τr].\begin{array}[]{ll}f_{l}(x)\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{l}\cdot(c_{l}-x),&\text{ if }x\in[c_{l}-\tau_{l},c_{l})\\ f_{r}(x)\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{r}\cdot(x-c_{r}),&\text{ if }x\in(c_{r},c_{r}+\tau_{r}].\\ \end{array}

For the sake of convenience, we consider a stronger condition that drop off the factors clxc_{l}-x and xcrx-c_{r}, which is called Assumption 2. It indicates that the density jumps at the transition points clc_{l} and crc_{r}.

Assumption 2

There are two positive parameters δl\delta_{l} and δr\delta_{r} such that

fl(cl)αmidcrcl+δlfr(cr+)αmidcrcl+δr.\begin{array}[]{l}f_{l}(c_{l}^{-})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{l}\\ f_{r}(c_{r}^{+})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{r}.\end{array}

Together, we refer to the Binomial mixture with Assumption 1 and Assumption 2 as the BMU model.

4.2 Method

Let cl(0)c_{l}^{(0)} and cr(0)c_{r}^{(0)} be the underlying ground truth of the two cutoffs in BMU. Our goal is to identify the cutoff that separates the relevant genes and the irrelevant genes in the GeneFishing method. Specifically, we want to find an estimator c^r\hat{c}_{r} for cr(0)c_{r}^{(0)} and study the behavior of [c^rcr(0)]\mathbb{P}[\hat{c}_{r}\geq c_{r}^{(0)}].

Define αl(x;v)=[vx]\alpha_{l}(x;v)=\mathbb{P}[v\leq x], αmid(x,y;v)=[x<v<y]\alpha_{mid}(x,y;v)=\mathbb{P}[x<v<y], αr(y;v)=[vy]\alpha_{r}(y;v)=\mathbb{P}[v\geq y], where vv can be ss or s^\hat{s}. Define Nl(x;{vi}i=1n):=#{vix,i=1,,n}N_{l}(x;\{v_{i}\}_{i=1}^{n}):=\#\{v_{i}\leq x,i=1,\ldots,n\}, Nmid(x,y;{vi}i=1n):=#{x<viy,i=1,,n}N_{mid}(x,y;\{v_{i}\}_{i=1}^{n}):=\#\{x<v_{i}\leq y,i=1,\ldots,n\}, Nr(y;{vi}i=1n):=#{vi>y,i=1,,n}N_{r}(y;\{v_{i}\}_{i=1}^{n}):=\#\{v_{i}>y,i=1,\ldots,n\}, N(x;{vi}i=1n)=#{vi=x,i=1,,n}N(x;\{v_{i}\}_{i=1}^{n})=\#\{v_{i}=x,i=1,\ldots,n\}, where {vi}i=1n\{v_{i}\}_{i=1}^{n} can be {si}i=1n\{s_{i}\}_{i=1}^{n} or {s^i}i=1n\{\hat{s}_{i}\}_{i=1}^{n}.

Since we are working on s^i\hat{s}_{i}’s rather than on sis_{i}’s, we denote αl(x)=[s^x]\alpha_{l}(x)=\mathbb{P}[\hat{s}\leq x] and Nl(x)=Nl(x;{s^i}i=1n)N_{l}(x)=N_{l}(x;\{\hat{s}_{i}\}_{i=1}^{n}) for simplicity. Similarly, we can get simplified notation αmid(x,y)\alpha_{mid}(x,y), αr(y)\alpha_{r}(y), Nmid(x,y)N_{mid}(x,y), Nr(y)N_{r}(y), N(x)N(x). In the rest of the paper, we sometimes use αl\alpha_{l}, αr\alpha_{r}, αmid\alpha_{mid} for αl(cl)\alpha_{l}(c_{l}), αr(cr)\alpha_{r}(c_{r}) and αmid(cl,cr)\alpha_{mid}(c_{l},c_{r}) respectively if no confusion arises.

4.2.1 The Non-parametric Maximum Likelihood Estimation

To estimate the parameters in BMU, we consider the non-parametric maximum likelihood estimation (NPMLE). We first solve the problem given clc_{l} and crc_{r}, then searching for optimal clc_{l} and crc_{r} using grid searching. The NPMLE problem is:

Hfull(cl,cr):=max\displaystyle H_{full}(c_{l},c_{r}):=\max s^iclloggl(s^i)+s^i>crloggr(s^i)\displaystyle\sum_{\hat{s}_{i}\leq c_{l}}\log g_{l}(\hat{s}_{i})+\sum_{\hat{s}_{i}>c_{r}}\log g_{r}(\hat{s}_{i})
+Nl(cl)logαl+Nmid(cl,cr)logαmidcrcl+Nr(cr)logαr\displaystyle+N_{l}(c_{l})\log\alpha_{l}+N_{mid}(c_{l},c_{r})\log\frac{\alpha_{mid}}{c_{r}-c_{l}}+N_{r}(c_{r})\log\alpha_{r}
s.t.\displaystyle s.t. 0clgl=1,cr1gr=1,gl decreasing ,gr increasing\displaystyle\int_{0}^{c_{l}}g_{l}=1,\int_{c_{r}}^{1}g_{r}=1,g_{l}\text{ decreasing },g_{r}\text{ increasing} (4.5)
αl,αr,αmid>0,αl+αr+αmid=1\displaystyle\alpha_{l},\alpha_{r},\alpha_{mid}>0,\alpha_{l}+\alpha_{r}+\alpha_{mid}=1
αlgl(cl)αmidcrcl+dlαrgr(cr+)αmidcrcl+dr}\displaystyle\left.\begin{array}[]{l}\alpha_{l}g_{l}(c_{l}^{-})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+d_{l}\\ \alpha_{r}g_{r}(c_{r}^{+})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+d_{r}\end{array}\right\}

Here dld_{l} and drd_{r} are two parameters to tune, and we call the inequalities (4.5) the change-point-gap constraint, which correspond to Assumption 1 and Assumption 2. Given clc_{l} and crc_{r}, the variables to optimize over are

S:={αl,αr,gl(s^1),,gl(s^il),gr(s^ir),,gr(s^n)},S:=\{\alpha_{l},\alpha_{r},g_{l}(\hat{s}_{1}),\ldots,g_{l}(\hat{s}_{i_{l}}),g_{r}(\hat{s}_{i_{r}}),\ldots,g_{r}(\hat{s}_{n})\},

where il:=maxi𝕀(s^i<cl)i_{l}:=\max_{i}\cdot\mathbb{I}(\hat{s}_{i}<c_{l}), ir:=mini𝕀(s^icr)i_{r}:=\min_{i}\cdot\mathbb{I}(\hat{s}_{i}\geq c_{r}). Since logx\log x is continuous and concave w.r.t xx, and the feasible set is convex (it is easy to check that {(x,y,z):xyz;x,y,z0}\{(x,y,z):xy\geq z;x,y,z\geq 0\} is a convex set), the problem (4.2.1) is a convex optimization with a unique optimizer.

There are mainly two difficulties for the optimization problem (4.2.1). First, the change-point-gap constraint (4.5) complicates the monotone density estimation. Second, it is not easy to optimize over αl\alpha_{l}, αr\alpha_{r} and glg_{l}, grg_{r} simultaneously.

4.2.2 Ucut: A Simplified Estimator

Fortunately, the following observation suggests a simplified optimization problem.

  • Note that αl\alpha_{l} and αr\alpha_{r} are the population masses for xclx\leq c_{l} and xcrx\geq c_{r}, which can be well estimated by the empirical masses α^l(cl)=Nl(cl)/n\hat{\alpha}_{l}(c_{l})=N_{l}(c_{l})/n, α^mid(cl,cr)=Nmid(cl,cr)/n\hat{\alpha}_{mid}(c_{l},c_{r})=N_{mid}(c_{l},c_{r})/n and α^r(cr)=Nr(cr)/n\hat{\alpha}_{r}(c_{r})=N_{r}(c_{r})/n.

  • If BMU is true with δldl\delta_{l}\geq d_{l} and δrdr\delta_{r}\geq d_{r}, and the solution to the optimization (4.2.1) without (4.5) at cl=cl(0)c_{l}=c_{l}^{(0)}, cr=cr(0)c_{r}=c_{r}^{(0)} is good enough, then the change-point-gap constraint (4.5) is satisfied with high probability.

  • From Figure 1, we can see that the flat region is wide. We can easily pick an interior point within the flat region.

Inspired by these observations, we replace the population masses with the empirical masses and drop off the change-point-gap constraint. We obtain the simplified objective function as follows:

Hsimplified(cl,cr):=max\displaystyle H_{simplified}(c_{l},c_{r}):=\max s^iclloggl(s^i)+s^i>crloggr(s^i)\displaystyle\sum_{\hat{s}_{i}\leq c_{l}}\log g_{l}(\hat{s}_{i})+\sum_{\hat{s}_{i}>c_{r}}\log g_{r}(\hat{s}_{i})
+Nl(cl)logα^l(cl)+Nmid(cl,cr)logα^mid(cl,cr)crcl+Nr(cr)logα^r(cr)\displaystyle+N_{l}(c_{l})\log\hat{\alpha}_{l}(c_{l})+N_{mid}(c_{l},c_{r})\log\frac{\hat{\alpha}_{mid}(c_{l},c_{r})}{c_{r}-c_{l}}+N_{r}(c_{r})\log\hat{\alpha}_{r}(c_{r})
s.t.\displaystyle s.t. 0clgl=1,cr1gr=1,gl decreasing ,gr increasing\displaystyle\int_{0}^{c_{l}}g_{l}=1,\int_{c_{r}}^{1}g_{r}=1,g_{l}\text{ decreasing },g_{r}\text{ increasing}

where

α^l(cl)=Nl(cl)/n=#{i|s^icl}n\displaystyle\hat{\alpha}_{l}(c_{l})=N_{l}(c_{l})/n=\frac{\#\{i|\hat{s}_{i}\leq c_{l}\}}{n}
α^mid(cl,cr)=Nmid(cl,cr)/n=#{i|cl<s^icr}n\displaystyle\hat{\alpha}_{mid}(c_{l},c_{r})=N_{mid}(c_{l},c_{r})/n=\frac{\#\{i|c_{l}<\hat{s}_{i}\leq c_{r}\}}{n}
α^1(cr)=Nr(cr)/n=#{i|s^i>cr}n.\displaystyle\hat{\alpha}_{1}(c_{r})=N_{r}(c_{r})/n=\frac{\#\{i|\hat{s}_{i}>c_{r}\}}{n}.

The problem (4.2.2) is reduced to two monotone density estimations, which the Grenander estimator can solve. As we point out in the above observations, we can easily identify an interior point μ\mu in the flat region. We fit an Grenander estimator for the decreasing glg_{l} on [0,μ][0,\mu] and an Grenander estimator for the increasing grg_{r} on (μ,1](\mu,1]. There are three advantages of using the interior point μ\mu. First, it significantly reduces the computational cost by estimating the two Grenander estimators just once, regardless of the choices of clc_{l} and crc_{r}. Second, it bypasses the boundary issue of the Grenander estimators since we are mainly concerned with the behaviors of the estimators at the points cl<μc_{l}<\mu and cr>μc_{r}>\mu. Moreover, the usage of μ\mu disentangles the mutual influences of the left decreasing part and the right increasing part; thus, it makes the analysis of the estimators simple.

Once we fit the Grenander estimator, we check whether the change-point-gap constraint (4.5) holds for different pairs of clc_{l} and crc_{r}. Finally, we pick the feasible pair with the maximal likelihood. We call this algorithm Ucut (U-shape cutoff), which is summarized in Algorithm 1.

Algorithm 1 Ucut: estimation of the BMU model by grid-searching the optimal cutoff pair.
0:  Data: {s^1,,s^n}\{\hat{s}_{1},\ldots,\hat{s}_{n}\};         The density gaps: dld_{l}, drd_{r};         The interior point μ\mu of the flat region;         The searching interval: [0,cl(max)][0,c_{l}^{(\max)}], and (cr(min),1](c_{r}^{(\min)},1], where cl(max)<μc_{l}^{(\max)}<\mu and cr(min)>μc_{r}^{(\min)}>\mu;         The unit for grid searching: γ\gamma.
1:  Initiate cl=NULLc_{l}^{*}=\text{NULL}, cr=NULLc_{r}^{*}=\text{NULL}; (cl,cr)=\ell(c_{l}^{*},c_{r}^{*})=-\infty.
2:  Estimate α^l(μ)=Nl(μ)/n\hat{\alpha}_{l}(\mu)=N_{l}(\mu)/n.
3:  Fit the Grenander estimator on [0,μ][0,\mu] to get g~l\tilde{g}_{l} and on (μ,1](\mu,1] to get g~r\tilde{g}_{r}.
4:  for cl{0,γ,2γ,,cl(max)}c_{l}\in\{0,\gamma,2\gamma,\ldots,c_{l}^{(\max)}\} do
5:     for cr{cr(min),cr(min)+γ,cr(min)+2γ,,1}c_{r}\in\{c_{r}^{(\min)},c_{r}^{(\min)}+\gamma,c_{r}^{(\min)}+2\gamma,\ldots,1\} do
6:        Estimate α^mid(cl,μ)\hat{\alpha}_{mid}(c_{l},\mu), and α^mid(μ,cr)\hat{\alpha}_{mid}(\mu,c_{r}).
7:        Let d~l=α^mid(cl,μ)α^l(μ)(μcl)+dlα^l(μ)\tilde{d}_{l}=\frac{\hat{\alpha}_{mid}(c_{l},\mu)}{\hat{\alpha}_{l}(\mu)\cdot(\mu-c_{l})}+\frac{d_{l}}{\hat{\alpha}_{l}(\mu)}, d~r=α^mid(μ,cr)(1α^l(μ))(crμ)+dr1α^l(μ)\tilde{d}_{r}=\frac{\hat{\alpha}_{mid}(\mu,c_{r})}{(1-\hat{\alpha}_{l}(\mu))\cdot(c_{r}-\mu)}+\frac{d_{r}}{1-\hat{\alpha}_{l}(\mu)}.
8:        Let (cl,cr)\ell(c_{l},c_{r}) be the corresponding Hsimplified(cl,cr)H_{simplified}(c_{l},c_{r}) defined in the problem (4.2.2).
9:        Let flag=𝕀[g~l(cl)d~lflag=\mathbb{I}[\tilde{g}_{l}(c_{l})\geq\tilde{d}_{l} and g~r(cr)d~r]\tilde{g}_{r}(c_{r})\geq\tilde{d}_{r}].
10:        if flagflag and (cl,cr)>(cl,cr)\ell(c_{l},c_{r})>\ell(c_{l}^{*},c_{r}^{*}) then
11:           (cl,cr)(cl,cr)(c_{l}^{*},c_{r}^{*})\leftarrow(c_{l},c_{r}).
12:        end if
13:     end for
14:  end for
15:  if (cl,cr)>\ell(c_{l}^{*},c_{r}^{*})>-\infty then
16:     Return: clc_{l}^{*}, crc_{r}^{*}, g~l,g~r\tilde{g}_{l},\tilde{g}_{r}, α^l(μ)\hat{\alpha}_{l}(\mu), (cl,cr)\ell(c_{l}^{*},c_{r}^{*}).
17:  else
18:     Return: FalseFalse.
19:  end if

4.3 Analysis

For Algorithm 1, the question arises whether (cl(0),cr(0))(c_{l}^{(0)},c_{r}^{(0)}) is a feasible pair for the change-point-gap constraint (4.5). Theorem 13 answers this question by claiming that there exist clc_{l} in a small neighborhood of cl(0)c_{l}^{(0)} and crc_{r} in a small neighborhood of cr(0)c_{r}^{(0)} such that g~l(cl)>d~l\tilde{g}_{l}(c_{l})>\tilde{d}_{l} and g~r(cr)>d~r\tilde{g}_{r}(c_{r})>\tilde{d}_{r} for appropriate choices of dld_{l} and drd_{r}. This implies that we can safely set aside the constraint (4.5) when solving the problem (4.2.2). The proof of Theorem 13 is deferred to Appendix J.

Theorem 13 (Feasibility of Gap Constraint for (cl(0),cr(0))(c_{l}^{(0)},c_{r}^{(0)}))

Suppose ff is a distribution satisfying Assumption 1 and Assumption 2, with αmid(cl(0),cr(0))>0\alpha_{mid}(c_{l}^{(0)},c_{r}^{(0)})>0, fmax<f_{\max}<\infty. If mC1(max{Nl(cl(0)),Nr(cr(0))})23m\geq C_{1}\cdot\left(\max\{N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)})\}\right)^{\frac{2}{3}}, and dl<δld_{l}<\delta_{l}, dr<δrd_{r}<\delta_{r}, there exist clc_{l} and crc_{r} such that

clcl(0) with |clcl(0)|C2Nl(cl(0))13c_{l}\leq c_{l}^{(0)}\text{ with }|c_{l}-c_{l}^{(0)}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-\frac{1}{3}}

and

crcr(0) with |crcr(0)|C3Nr(cr(0))13c_{r}\geq c_{r}^{(0)}\text{ with }|c_{r}-c_{r}^{(0)}|\leq C_{3}\cdot N_{r}(c_{r}^{(0)})^{-\frac{1}{3}}

such that g~l\tilde{g}_{l}, g~r\tilde{g}_{r}, d~l\tilde{d}_{l} and d~r\tilde{d}_{r} produced by Algorithm 1 satisfy g~l(cl)>d~l\tilde{g}_{l}(c_{l})>\tilde{d}_{l} and g~r(cr)>d~r\tilde{g}_{r}(c_{r})>\tilde{d}_{r}, provided the input μ(cl(0),cr(0))\mu\in(c_{l}^{(0)},c_{r}^{(0)}). Furthermore, the resulting density estimator f~m,n\tilde{f}_{m,n} satisfies

01|f~m,nf(x)|𝑑xC4{Nl(cl(0))13+Nr(cr(0))13}.\int_{0}^{1}|\tilde{f}_{m,n}-f(x)|dx\leq C_{4}\cdot\left\{N_{l}(c_{l}^{(0)})^{-\frac{1}{3}}+N_{r}(c_{r}^{(0)})^{-\frac{1}{3}}\right\}.

Here C1,C2,C3,C4C_{1},C_{2},C_{3},C_{4} are positive constants that only depend on ff.

Besides knowing there are some feasible points near cl(0)c_{l}^{(0)} and cr(0)c_{r}^{(0)}, we want to have a clear sense of the optimal cutoff pair produced by Algorithm 1. Theorem 14 says that the optimal cutoff for the left (right) part is smaller (larger) than cl(0)c_{l}^{(0)} (cr(0)c_{r}^{(0)}) with high probability.

Theorem 14 (Tail Bounds of Identified Cutoffs)

Suppose (c^l,c^r)(\hat{c}_{l},\hat{c}_{r}) is the identified optimal cutoff pair produced by Algorithm 1, provided an input μ(cl(0),cr(0))\mu\in(c_{l}^{(0)},c_{r}^{(0)}). Under the same assumptions as Theorem 13, particularly nn\rightarrow\infty, m/max{Nl(cl(0)),Nr(cr(0))}m/\max\{N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)})\}\rightarrow\infty,

[c^l>cl(0)][S^cl(0),μ(cl(0))Nl(cl(0))dlαl(μ)C1],\mathbb{P}[\hat{c}_{l}>c_{l}^{(0)}]\leq\mathbb{P}[\hat{S}_{c_{l}^{(0)},\mu}(c_{l}^{(0)})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C_{1}],

and

[c^r<cr(0)][S^μ,cr(0)(cr(0))Nr(cr(0))dr1αl(μ)C2],\mathbb{P}[\hat{c}_{r}<c_{r}^{(0)}]\leq\mathbb{P}[\hat{S}_{\mu,c_{r}^{(0)}}(c_{r}^{(0)})\geq\sqrt{N_{r}(c_{r}^{(0)})}\cdot\frac{d_{r}}{1-\alpha_{l}(\mu)}-C_{2}],

where C1C_{1}, C2C_{2} are positive constants, and C1C_{1} only depends on αl(μ)\alpha_{l}(\mu), dld_{l}, C2C_{2} only depends on αr(μ)\alpha_{r}(\mu), drd_{r}; S^a,b(t)\hat{S}_{a,b}(t) is the slope at F(t)F(t) of the least concave majorant in [F(a),F(b)][F(a),F(b)] of a standard Brownian Bridge in [0,1][0,1].

The proof of Theorem 14 is in fact reduced to proving any cutoff pair (cl,cr)(c_{l},c_{r}) with cl>cl(0)c_{l}>c_{l}^{(0)} or cr<cr(0)c_{r}<c_{r}^{(0)} does not satisfy the change-point-constraint with high probability. Since g~l\tilde{g}_{l} and g~r\tilde{g}_{r} estimated in Algorithm 1 are decreasing and increasing respectively, if cl>cl(0)c_{l}>c_{l}^{(0)} (or cr<cr(0)c_{r}<c_{r}^{(0)}) violates the constraint, then clc_{l}^{\prime} (or crc_{r}^{\prime}) will violate it with high probability if cl>clc_{l}^{\prime}>c_{l} (or cr<crc_{r}^{\prime}<c_{r}). So it is reduced to considering the smallest cl>cl(0)c_{l}>c_{l}^{(0)} and the largest cr<cr(0)c_{r}<c_{r}^{(0)} in the grid searching space of Algorithm 1. Then the result can be concluded using Theorem 12. The detail is deferred to Appendix K.

Finally, we show in Theorem 15 that the identified c^r\hat{c}_{r} converges to cr(0)c_{r}^{(0)} at the rate of 𝒪([Nr(cr(0))]13)\mathcal{O}([N_{r}(c_{r}^{(0)})]^{-\frac{1}{3}}) if m=Ω(n23)m=\Omega(n^{\frac{2}{3}}). And the estimated density also converges to the true one at the rate of 𝒪(max{[Nr(cr(0))]13,[Nl(cl(0))]13})\mathcal{O}(\max\{[N_{r}(c_{r}^{(0)})]^{-\frac{1}{3}},[N_{l}(c_{l}^{(0)})]^{-\frac{1}{3}}\}). The proof can be found in Appendix L.

Theorem 15 (L1L_{1} Convergence of Identified Cutoff)

Suppose ff is a distribution satisfying Assumption 1 and Assumption 2, with αmid(cl(0),cr(0))>0\alpha_{mid}(c_{l}^{(0)},c_{r}^{(0)})>0, fmax<f_{\max}<\infty. Let Δl=clcl(0)\Delta_{l}=c_{l}-c_{l}^{(0)}, Δr=crcr(0)\Delta_{r}=c_{r}-c_{r}^{(0)}. If we have

mC1(max{Nl(cl(0)),Nr(cr(0))})23,m\geq C_{1}\cdot\left(\max\{N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)})\}\right)^{\frac{2}{3}},

then

|Δ^l|C2Nl(cl(0))13,|Δ^r|C3Nr(cr(0))13,|\widehat{\Delta}_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-\frac{1}{3}},~{}|\widehat{\Delta}_{r}|\leq C_{3}N_{r}(c_{r}^{(0)})^{-\frac{1}{3}},

where Δ^l\widehat{\Delta}_{l} and Δ^r\widehat{\Delta}_{r} are associated with c^l\hat{c}_{l} and c^r\hat{c}_{r} output by Algorithm 1. Furthermore, the resulting f~n,m\tilde{f}_{n,m} satisfies

𝔼f01|f~n,m(x)f(x)|𝑑xC4{Nl(cl(0))13+Nr(cr(0))13}.{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot\left\{N_{l}(c_{l}^{(0)})^{-\frac{1}{3}}+N_{r}(c_{r}^{(0)})^{-\frac{1}{3}}\right\}.

Here C1,C2,C3,C4C_{1},C_{2},C_{3},C_{4} are positive constants that do not depend on nn, Nl(cl(0))N_{l}(c_{l}^{(0)}) and Nr(cr(0))N_{r}(c_{r}^{(0)}).

5 Experiments of Ucut

5.1 Numerical Experiments

5.1.1 Data generating process

To confirm and complement our theory, we use extensive numerical experiments to examine the finite performance of Ucut on the estimation of crc_{r}. We study a BMU model that is comprised of linear components. Specifically, the model consists of three parts with boundaries clc_{l} and crc_{r}. The middle part is a flat region of height δm\delta_{m}. The left part is a segment with the right end at (cl,δm+δl)(c_{l},\delta_{m}+\delta_{l}) and slope sl<0s_{l}<0 while the right part is a segment with the left end at (cr,δm+δr)(c_{r},\delta_{m}+\delta_{r}) and slope sr>0s_{r}>0; see Figure 2 (a) for illustration. We call it the linear valley model. We normalize the linear valley model to produce the density of interest. We call the normalized gaps δ~l\tilde{\delta}_{l} and δ~r\tilde{\delta}_{r}.

Refer to caption
(a) original model
Refer to caption
(b) two-group model
Figure 2: The linear valley model.

The linear valley model depicts the mixture density of the null distribution and the alternative distribution. We only assume that the left part ahead of clc_{l} purely belongs to the null distribution while the right part ahead of crc_{r} purely belongs to the alternative distribution. The null and the alternative distributions can hardly be distinguished in the flat middle part. To understand the linear valley model from the perspective of the two-group model, we assume that τ0×100%\tau_{0}\times 100\% of the middle part belongs to the null distribution while the remaining belongs to the alternative model. In Figure 2 (b), the part in left slash corresponds to the null density f0f_{0} while the part in right slash corresponds to the alternative density f1f_{1}. Let π0\pi_{0} be the area in the left slash divided by the total area. Then the marginal density can be written as f=π0f0+(1π0)f1f=\pi_{0}f_{0}+(1-\pi_{0})f_{1}. Since any τ0[0,1]\tau_{0}\in[0,1] gives the same ff, the middle part is not identifiable. It is necessary to estimate and infer the right cutoff crc_{r}, so that we can safely claim all the samples beyond crc_{r} are from the alternative distribution.

By default, we set cl=0.3c_{l}=0.3, cr=0.9c_{r}=0.9, δm=1\delta_{m}=1, δl=0.5\delta_{l}=0.5, δr=0.5\delta_{r}=0.5, sl=3s_{l}=-3, sr=1s_{r}=1. We sample n=10,000n=10,000 samples {s1,,sn}\{s_{1},\ldots,s_{n}\} from the linear valley model. Then for each i{1,,n}i\in\{1,\ldots,n\}, we get the observations s^iBinom(m,si)\hat{s}_{i}\sim Binom(m,s_{i}) independently, where m=1,000m=1,000 if it is not specified particularly. The value of τ0\tau_{0} does not affect the data generation but it affects the FDR and the power of any method that yields discoveries.

Finally, the left part and the right part are not necessary to be linear. To investigate the effect of general monotone cases and misspecified cases (e.g., unimodal densities), we replace the left part and the right part with other functions; see Appendix M. All codes to replicate the results in this paper can be found at github.com/Elric2718/GFcutoff/.

5.1.2 Robustness to model parameters

When using Algorithm 1, we use the middle point μ=0.5\mu=0.5, the left gap parameter dl=0.8δ~ld_{l}=0.8\cdot\tilde{\delta}_{l}, the right gap parameter dr=0.8δ~rd_{r}=0.8\cdot\tilde{\delta}_{r}, the searching unit γ=0.001\gamma=0.001. We first investigate how the binomial size mm affects the estimation of crc_{r}. Using the default setup as described in Section 5.1.1, we vary the binomial size m{102,103,2×103,5×103,104,Inf}m\in\{10^{2},10^{3},2\times 10^{3},5\times 10^{3},10^{4},Inf\}, where InfInf refers to the case that there is no binomial randomness and we observe sis_{i}’s directly. As shown in Figure 3 (a), c^r\hat{c}_{r} converges to the true cr(0)c_{r}^{(0)} as mm grows. When m=103m=10^{3}, the estimated crc_{r} is as good as that of using sis_{i}’s directly. This corroborates our theory that we only need mn23500m\sim n^{\frac{2}{3}}\approx 500 when the BMU model holds. Note that in the linear setup, even with m=102m=10^{2}, c^r\hat{c}_{r} is larger than true cr(0)c_{r}^{(0)} with large probability. It implies that Ucut is safe in the sense that it will make few false discoveries by using c^r\hat{c}_{r} as the cutoff.

In the sequel, we stick to m=103m=10^{3} since it works well enough for the linear valley model. We investigate whether the width of the middle flat region affects the estimation of crc_{r}. We consider cl=0.5w/2c_{l}=0.5-w/2, cr=0.5+w/2c_{r}=0.5+w/2 with w{0.6,0.4,0.2,0.1,0.}w\in\{0.6,0.4,0.2,0.1,0.\} while other model parameters are set by default. In Figure 3 (b), the estimation of crc_{r} is quite satisfying when the width is no smaller than 0.20.2. When the width drops to 0.10.1 or smaller, the estimation is not stable but still conservative in the sense that c^r>cr\hat{c}_{r}>c_{r} in most cases.

Finally, we examine how the gap size influences the estimation of crc_{r}. We take δl={0.5,0.3,0.20.1,0.01}\delta_{l}=\{0.5,0.3,0.20.1,0.01\} and δr={0.5,0.3,0.20.1,0.01}\delta_{r}=\{0.5,0.3,0.20.1,0.01\}. Figure 3 (c) shows that the estimation of crc_{r} is robust to the gap sizes as long as the input dld_{l} and drd_{r} are smaller than the true gaps. This gives us confidence in applying Ucut to identify the cutoff even when there is no gap, which is more realistic.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: The estimation of c^r\hat{c}_{r} under the linear valley model. (a) with respect to mm; (b) with respect to the width of the middle flat region; (c) with respect to the gap sizes.

5.1.3 Sensitivity of the algorithm hyper-parameters

Algorithm 1 (Ucut) mainly have three tuning parameters: the middle point μ\mu, the left gap dld_{l} and the right gap drd_{r}. For practical use, the three tuning parameters may be misspecified. For example, the middle point is not easy to spot, or the left gap and the right gap are too small. We use the default model parameters as specified in Section 5.1.1. Let dld_{l} and drd_{r} be 0.80.8 times the true normalized gaps δ~l\tilde{\delta}_{l} and δ~r\tilde{\delta}_{r} respectively. We vary the choice of the middle point μ\mu. Figure 4 (a) shows that the estimation of crc_{r} is not sensitive to the choice of μ\mu as long as it is picked within the flat region [0.3,0.9][0.3,0.9]. If μ\mu is picked left to the flat region, the c^r\hat{c}_{r} has a larger variance but it is more conservative in the sense that c^r>cr(0)\hat{c}_{r}>c_{r}^{(0)} in most cases. If μ\mu is picked right to the flat region, the c^r\hat{c}_{r} tends to be min{μ,cr(0)}\min\{\mu,c_{r}^{(0)}\}.

Next, we fix μ=0.5\mu=0.5 but consider dl=κ×δ~ld_{l}=\kappa\times\tilde{\delta}_{l} and dr=κ×δ~rd_{r}=\kappa\times\tilde{\delta}_{r}, where κ{1,0.9,0.8,0.5,0.2,\kappa\in\{1,0.9,0.8,0.5,0.2, 0.1,0.01}0.1,0.01\}. We do not consider κ>1\kappa>1 because there might not exist feasible (cl,cr)(c_{l},c_{r}) that satisfies the change-point-constraint. Figure 4 (b) shows that when κ\kappa is within [0.5,1][0.5,1] the estimation is satisfying. The estimated crc_{r} can be slightly smaller than the true cr(0)c_{r}^{(0)} when κ<0.5\kappa<0.5 but in a tolerable range.

In a nutshell, the choices of μ\mu, dld_{l} and drd_{r} are crucial to Algorithm 1. But the sensitivity analysis indicates that it is not necessary to be excessively cautious. In practice, picking these parameters by eyeballs can give a safe estimation in most cases.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: The estimation of c^r\hat{c}_{r} under the linear valley model. (a) with respect to the choice of the middle point μ\mu; (b) with respect to the choice of the input dld_{l} and drd_{r}. Here dl=κ×δ~ld_{l}=\kappa\times\tilde{\delta}_{l} and dr=κ×δ~rd_{r}=\kappa\times\tilde{\delta}_{r}, where κ\kappa is a ratio of the normalized δ\delta’s.

5.1.4 Comparison to other methods

To examine the power and the capacity of controlling FDR of Ucut, we consider the two-group model as specified in Figure 2 (b). The middle part is not identifiable, which means that the samples of the alternative distribution can not be distinguished from those of the null distribution. To reflect this point, we arbitrarily set the proportion of the null distribution in the middle part τ0\tau_{0}. The goal is to identify the right cutoff crc_{r} but not τ0\tau_{0} because it is impossible to infer τ0\tau_{0}.

We compare Ucut to the other four methods that are studied in [7], i.e., ZIGP (Zero-inflated Generalized Poisson), ZIP (Zero Inflated Poisson), GP (Generalized Poisson) and P (Poisson). The data studied in this paper and the associated primary purpose are highly similar to ours. First, they study mutation counts for a specific protein domain, which have excess zeros and are assumed to follow a zero-inflated model. Moreover, the four mentioned methods are used to select significant mutation counts (i.e., cutoff) while controlling a given level of Type I error via False Discovery Rate procedures. Therefore, the four methods are good competitors for Ucut in view of the data distribution and the method usage.

Figure 5 shows that GP and P use rather small cutoffs and have too large FDRs. ZIGP and ZIP are over-conservative if the target FDR level is too low at 0.0050.005 or 0.010.01, thus having quite low power. They perform better when the target FDR level is set to be 0.050.05. On the other hand, Ucut can control FDR at 0.010.01 if we directly use c^r\hat{c}_{r} as the cutoff. The associated power is better than those of ZIGP and ZIP. In order to loosen the FDR control and get higher power, it is fine to use a slightly smaller cutoff than c^r\hat{c}_{r}. From this result, we confirm that Ucut is a better fit for the scenario where the middle part is not distinguishable.

Refer to caption
Figure 5: FDR and power of Ucut and other competing methods.

5.2 Application to Real Data

We further demonstrate the performance of Ucut on the real datasets used in [18]. To be specific, we apply the GeneFishing method to four GTEx RNAseq datasets, liver, Artery Coronary, Transverse Colon, and Testis; see Table 1 for details. We leverage the same set of 2121 bait genes used in [18]. The number of fishing rounds is set to be m=10,000m=10,000.

Table 1: Details of GTEx RNAseq datasets.
# samples # genes
Liver 119119 18,84518,845
Artery-Coronary 133133 20,59720,597
Colon-Transverse 196196 21,69521,695
Testis 172172 31,93131,931

Once the CFRs are generated, we apply Algorithm 1 with the middle point μ=0.5\mu=0.5, dl=0.1d_{l}=0.1 and dr=0.01d_{r}=0.01. We take dld_{l} to be ten times drd_{r} because there are much more zeros than ones in CFRs. As shown in Section 5.1, Ucut is not sensitive to the three hyper-parameters. The change of these parameters lays little influence on the results. Table 2 shows that for each tissue Ucut gives the estimator of crc_{r} that yields 5050 to 8080 discoveries. We estimate the false discovery rate using the second approach in [18]. Note that for Artery-Coronary, the estimated cr=0.972c_{r}=0.972 by Ucut (which gives FDR^103\widehat{FDR}\approx 10^{-3}) is less extreme than simply using 0.9900.990 (which gives FDR^104\widehat{FDR}\approx 10^{-4}). It implicates that Ucut adapts to the tissue and can pick a cutoff with a reasonable false discovery rate.

Table 2: Estimation of crc_{r} by Algorithm 1 on four tissues, where μ=0.5\mu=0.5, dl=0.1d_{l}=0.1 and dr=0.01d_{r}=0.01. The second column is the estimated crc_{r} using bootstrap by sampling 70%70\% of the CFRs.
c^r\hat{c}_{r} bootstrapping estimation # discovery (use c^r\hat{c}_{r}) FDR^\widehat{FDR}
Liver 0.9950.995 0.993(0.005)0.993(0.005) 5252 1.4×1031.4\times 10^{-3}
Artery-Coronary 0.9720.972 0.976(0.009)0.976(0.009) 8585 5.7×1035.7\times 10^{-3}
Colon-Transverse 0.9890.989 0.991(0.049)0.991(0.049) 5757 1.2×1041.2\times 10^{-4}
Testis 0.9930.993 0.992(0.001)0.992(0.001) 7373 0.0100.010

In addition, we also apply the GeneFishing method to a single-cell data of the pancreas cells from Tabula Muris111https://tabula-muris.ds.czbiohub.org. It contains 849849 cells from mice and 5,2205,220 genes expressed in enough cells out of about 20,00020,000 genes. We find out 99 bait genes based on the pancreas insulin secretion gene ontology (GO) term.

Unlike Figure 1, the CFRs of this data set do not appear in a U shape. Instead, we observe a unimodal pattern around zero and an increasing pattern around one (Figure 6). Nonetheless, it does not hinder us from using Ucut to determine the cutoff, since we are mainly concerned about the right cutoff and Section 5.1 demonstrates that Ucut is conservative even if the model is misspecified. By using μ=0.5\mu=0.5, dl=0.1d_{l}=0.1 and dr=0.01d_{r}=0.01, Ucut gives 0.9940.994 as the estimation of the right cutoff, which discovers 7777 genes. By doing the GO enrichment analysis, we find out that these identified genes are enriched for the GO of response to ethanol with p-value 0.00210.0021, the GO of positive regulation of fatty acid biosynthesis with p-value 0.00550.0055, and the GO of eating behavior with p-value 0.00790.0079. These GOs have been shown to relate to insulin secretion in literature [13, 26, 35], which indicates the effectiveness of Ucut.

Refer to caption
Figure 6: The CFRs of the single cell data of the pancreas tissue.

6 Discussion

In this work, we analyze the binomial mixture model (1.2) under the U-shape constraint, which is motivated by the results of the GeneFishing method [18]. The contributions of this work are two-fold. First, to the best of our knowledge, this is the first paper to investigate the relationship between the binomial size mm and the sample size nn for the binomial mixture model under various conditions for FF. Second, we provide a convenient tool for the downstream decision-making of the GeneFishing method.

Despite the identifiability issue of the binomial mixture model, we show that the estimator of the underlying distribution deviates from the true distribution, in some distance, at most a small quantity that depends on mm. The implication is that to have the same convergence rate as if there is no binomial randomness, we need m=Ω(n23)m=\Omega(n^{\frac{2}{3}}) for the empirical CDF and the Grenander estimator when the underlying density is respectively bounded and monotone. It is also of great interest to further investigate how the minimal mm hinges on the smoothness of the underlying distribution, e.g., studying the kernel density estimator and the smoothing spline estimator under the binomial mixture model.

To answer the motivating question of how large the CFR should be so that the associated sample can be regarded as a discovery in the GeneFishing method, we propose the BMU model to depict the underlying distribution and the NPMLE method Ucut to determine the cutoff. This estimator comprises two Grenander estimators, thus having a cubic convergence rate as the Grenander estimator when m=Ω(n23)m=\Omega(n^{\frac{2}{3}}). We also show that the estimated cutoff is larger than the true cutoff with high probability. On the other hand, all these results are established for a bounded density. We leave it as future work to study the relationship between the convergence rate and the smoothness of the density.

Finally, we conclude the paper with a discussion on the empirical performance of Ucut. The simulation studies reassure the practitioners that Ucut is not sensitive to the choice of its hyper-parameters — the default setup can handle most cases. In the real data example used in [18], Ucut gives a cutoff with a low estimated FDR using much less time than [18]. In the other example of the pancreas single-cell dataset, we use GeneFishing with Ucut and find out 7777 genes that are shown to be related to pancreas insulin secretion by GO enrichment analysis. Therefore, we are interested and confident in discovering exciting signals in biology or other fields if GeneFishing with Ucut can be applied to more real datasets.

References

  • [1] Raghu Raj Bahadur “Some approximations to the binomial distribution function” In The Annals of Mathematical Statistics JSTOR, 1960, pp. 43–54
  • [2] Fadoua Balabdaoui et al. “On the Grenander estimator at zero” In Statistica Sinica 21.2 NIH Public Access, 2011, pp. 873
  • [3] O Barndorff-Nielsen “Identifiability of mixtures of exponential families” In Journal of Mathematical Analysis and Applications 12.1 Elsevier, 1965, pp. 115–121
  • [4] Lucien Birge “The grenander estimator: A nonasymptotic approach” In The Annals of Statistics JSTOR, 1989, pp. 1532–1549
  • [5] Steven R Dunbar “The de Moivre-Laplace Central Limit Theorem” In Technical Report, 2011
  • [6] Martin D Fraser, YU SHENG HSU and WALKER JJ “Identifiability of finite mixtures of von Mises distributions” In The Annals of Statistics 9.5 JSTOR, 1981, pp. 1130–1134
  • [7] Iris Ivy M Gauran et al. “Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data” In Biometrics 74.2 Wiley Online Library, 2018, pp. 458–471
  • [8] Ulf Grenander “On the theory of mortality measurement: part ii” In Scandinavian Actuarial Journal 1956.2 Taylor & Francis, 1956, pp. 125–153
  • [9] Leonardo Grilli, Carla Rampichini and Roberta Varriale “Binomial mixture modeling of university credits” In Communications in Statistics-Theory and Methods 44.22 Taylor & Francis, 2015, pp. 4866–4879
  • [10] Wassily Hoeffding “Probability Inequalities for Sums of Bounded Random Variables” In Journal of the American Statistical Association 58.301 Taylor & Francis, 1963, pp. 13–30 DOI: 10.1080/01621459.1963.10500830
  • [11] Jian Huang and Jon A Wellner “Estimation of a monotone density or monotone hazard under random censoring” In Scandinavian Journal of Statistics JSTOR, 1995, pp. 3–33
  • [12] Youping Huang and Cun-Hui Zhang “Estimating a monotone density from censored observations” In The Annals of Statistics JSTOR, 1994, pp. 1256–1274
  • [13] Zhen Huang and $A$ke Sjöholm “Ethanol acutely stimulates islet blood flow, amplifies insulin secretion, and induces hypoglycemia via nitric oxide and vagally mediated mechanisms” In Endocrinology 149.1 Oxford University Press, 2008, pp. 232–236
  • [14] Hanna K Jankowski and Jon A Wellner “Estimation of a discrete monotone distribution” In Electronic journal of statistics 3 NIH Public Access, 2009, pp. 1567
  • [15] John T Kent “Identifiability of finite mixtures for directional data” In The Annals of Statistics JSTOR, 1983, pp. 984–988
  • [16] Marc Kéry “Estimating abundance from bird counts: binomial mixture models uncover complex covariate relationships” In The Auk 125.2 Oxford University Press, 2008, pp. 336–345
  • [17] János Komlós, Péter Major and Gábor Tusnády “An approximation of partial sums of independent RV’-s, and the sample DF. I” In Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 32.1-2 Springer, 1975, pp. 111–131
  • [18] Ke Liu et al. “GeneFishing to reconstruct context specific portraits of biological processes” In Proceedings of the National Academy of Sciences 116.38 National Acad Sciences, 2019, pp. 18943–18950
  • [19] Frederic M Lord “Estimating true-score distributions in psychological testing (an empirical Bayes estimation problem)” In Psychometrika 34.3 Springer, 1969, pp. 259–299
  • [20] Frederic M Lord and Noel Cressie “An empirical Bayes procedure for finding an interval estimate” In Sankhyā: The Indian Journal of Statistics, Series B JSTOR, 1975, pp. 1–9
  • [21] U Lüxmann-Ellinghaus “On the identifiability of mixtures of infinitely divisible power series distributions” In Statistics & probability letters 5.5 Elsevier, 1987, pp. 375–378
  • [22] AW Marshall “Discussion on Barlow and van Zwet’s paper” In Nonparametric Techniques in Statistical Inference 1969 Cambridge University Press New York, 1970, pp. 174–176
  • [23] Albert W Marshall and Frank Proschan “Maximum likelihood estimation for distributions with monotone failure rate” In The annals of mathematical statistics 36.1 JSTOR, 1965, pp. 69–77
  • [24] Pascal Massart “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality” In The annals of Probability JSTOR, 1990, pp. 1269–1283
  • [25] Trent McDonald et al. “Evidence of Absence Regression: A Binomial N-Mixture Model for Estimating Bird and Bat Fatalities at Wind Power Facilities” In bioRxiv Cold Spring Harbor Laboratory, 2020 DOI: 10.1101/2020.01.21.914754
  • [26] Christopher J Nolan et al. “Fatty acid signaling in the β\beta-cell and insulin secretion” In Diabetes 55.Supplement 2 Am Diabetes Assoc, 2006, pp. S16–S23
  • [27] Katherine M O’Donnell, Frank R Thompson III and Raymond D Semlitsch “Partitioning detectability components in populations subject to within-season temporary emigration using binomial mixture models” In PLoS One 10.3 Public Library of Science, 2015, pp. e0117216
  • [28] GP Patil and Sheela Bildikar “Identifiability of countable mixtures of discrete probability distributions using methods of infinite matrices” In Mathematical Proceedings of the Cambridge Philosophical Society 62.3, 1966, pp. 485–494 Cambridge University Press
  • [29] BLS Prakasa Rao “Estimation for distributions with monotone failure rate” In The annals of mathematical statistics JSTOR, 1970, pp. 507–519
  • [30] J Andrew Royle “N-mixture models for estimating population size from spatially replicated counts” In Biometrics 60.1 Wiley Online Library, 2004, pp. 108–115
  • [31] Theofanis Sapatinas “Identifiability of mixtures of power-series distributions and related characterizations” In Annals of the Institute of Statistical Mathematics 47.3 Springer, 1995, pp. 447–459
  • [32] S Sivaganesan and James Berger “Robust Bayesian analysis of the binomial empirical Bayes problem” In Canadian Journal of Statistics 21.1 Wiley Online Library, 1993, pp. 107–119
  • [33] GM Tallis “The identifiability of mixtures of distributions” In Journal of Applied Probability 6.2 JSTOR, 1969, pp. 389–398
  • [34] GM Tallis and Peter Chesson “Identifiability of mixtures” In Journal of the Australian Mathematical Society 32.3 Cambridge University Press, 1982, pp. 339–348
  • [35] Muneki Tanaka et al. “Eating pattern and the effect of oral glucose on ghrelin and insulin secretion in patients with anorexia nervosa” In Clinical endocrinology 59.5 Wiley Online Library, 2003, pp. 574–579
  • [36] Henry Teicher “Identifiability of finite mixtures” In The annals of Mathematical statistics JSTOR, 1963, pp. 1265–1269
  • [37] Henry Teicher “Identifiability of mixtures” In The annals of Mathematical statistics 32.1 Institute of Mathematical Statistics, 1961, pp. 244–248
  • [38] Hoben Thomas “A binomial mixture model for classification performance: A commentary on Waxman, Chambers, Yntema, and Gelman (1989)” In Journal of Experimental Child Psychology 48.3 Elsevier, 1989, pp. 423–430
  • [39] Aad Van der Vaart and Mark Van der Laan “Smooth estimation of a monotone density” In Statistics: A Journal of Theoretical and Applied Statistics 37.3 Taylor & Francis, 2003, pp. 189–203
  • [40] Y. Wang “Nonparametric Estimation Subject to Shape Restrictions” University of California, Berkeley, 1992 URL: https://books.google.com/books?id=QmVMAQAAMAAJ
  • [41] Larry Wasserman “All of nonparametric statistics” Springer Science & Business Media, 2006
  • [42] GR Wood “Binomial mixtures: geometric estimation of the mixing distribution” In The Annals of Statistics 27.5 Institute of Mathematical Statistics, 1999, pp. 1706–1721
  • [43] Michael Woodroofe and Jiayang Sun “A penalized maximum likelihood estimate of f (0+) when f is non-increasing” In Statistica Sinica JSTOR, 1993, pp. 501–515
  • [44] Guohui Wu, Scott H Holan, Charles H Nilon and Christopher K Wikle “Bayesian binomial mixture models for estimating abundance in ecological monitoring studies” In Annals of Applied Statistics 9.1 Institute of Mathematical Statistics, 2015, pp. 1–26
  • [45] Sidney J Yakowitz and John D Spragins “On the identifiability of finite mixtures” In The Annals of Mathematical Statistics JSTOR, 1968, pp. 209–214

Appendix

Appendix A The Identifiability Issue of Binomial Mixture Model

We review existing results for mixtures of distributions and the binomial mixture model in the literature. A general mixture model is defined as

H(x)=Ωh(x|θ)𝑑G(θ),H(x)=\int_{\Omega}h(x|\theta)dG(\theta), (A.1)

where h(|θ)h(\cdot|\theta) is a density function for all θΩ\theta\in\Omega, h(x|)h(x|\cdot) is a Borel measurable function for each xx and GG is a distribution function defined on Ω\Omega. The family h(x|θ)h(x|\theta), θΩ\theta\in\Omega is referred to as the kernel of the mixture and GG as the mixing distribution function. In order to devise statistical procedures for inferential purposes, an important requirement is the identifiability of the mixing distribution. Without this condition, it is not meaningful to estimate the distribution either non-parametrically or parametrically. The mixture HH defined by (A.1) is said to be identifiable if there exists a unique GG yielding HH, or equivalently,if the relationship

H(x)=Ωh(x|θ)𝑑G1(θ)=Ωh(x|θ)𝑑G2(θ)H(x)=\int_{\Omega}h(x|\theta)dG_{1}(\theta)=\int_{\Omega}h(x|\theta)dG_{2}(\theta)

implies G1(θ)=G2(θ)G_{1}(\theta)=G_{2}(\theta) for all θΩ\theta\in\Omega.

The identifiability problems concerning finite and countable mixtures (i.e. when the support of FF in (A.1) is finite and countable respectively) have been investigated by [36, 28, 45, 33, 6, 34, 15]. Examples of identifiable finite mixtures include: the family of Gaussian distribution {N(μ,σ2),<μ<,0<σ2<}\{N(\mu,\sigma^{2}),-\infty<\mu<\infty,0<\sigma^{2}<\infty\}, the family of one-dimensional Gaussian distributions, the family of one-dimensional gamma distributions, the family of multiple products of exponential distributions, the multivariate Gaussian family, the union of the last two families, the family of one-dimensional Cauchy distributions, etc.

For sufficient conditions for identifiability of arbitrary mixtures, [37] studied the identifiability of mixtures of additive closed families, while [3] discussed the identifiability of mixtures of some restricted multivariate exponential families. [21] has given a sufficient condition for the identifiability of a large class of discrete distributions, namely that of the power-series distributions. Using topological arguments, he has shown that if the family in question is infinitely divisible, mixtures of this family are identifiable. For example, Poisson, negative binomial, logarithmic series are infinitely divisible, so arbitrary mixtures are identifiable.

On the other hand, despite being a power-series distribution, the binomial distribution is not infinitely divisible. So its identifiability is not established for the success parameter [31]. In fact, the binomial mixture has often been regarded as unidentifiable, as FF can be determined only up to its first mm moments when HH is known exactly. To be more specific, h(x|s)h(x|s) is a linear function of the first mm moments μr=01sr𝑑F(s),r=1,2,,m\mu_{r}=\int_{0}^{1}s^{r}dF(s),r=1,2,\ldots,m of F(s)F(s), for every xx. Therefore, with the same first mm moments, any other F(s)F^{\prime}(s) will make the same mixed distribution H(x)H(x). To ensure the identifiability for the binomial mixture model, it is a common practice to assume that FF corresponds to a finite discrete pmf or a parametric density (e.g., beta distribution).

In particular, there are two results for the identifiability of binomial mixture model [36]:

  1. 1.

    If h(x|s)h(x|s) in (A.1) is a binomial distribution with a known binomial size mm, and the support of FF only contain KK points. A necessary and sufficient condition that the identifiability holds is that m2K1m\geq 2K-1.

  2. 2.

    Consider h(x|mj,sj)h(x|m_{j},s_{j}) as a binomial distribution with binomial size mjm_{j} and probability sjs_{j}, where 0<sj<10<s_{j}<1, j=1,2,j=1,2,\ldots and the support of FF is {s1,s2,}\{s_{1},s_{2},\ldots\}. If mjmjm_{j}\neq m_{j^{\prime}} for j=jj=j^{\prime}, then (A.1) is identifiable.

In this study, we are interested in the regime where the support size of FF may not be finite, and thus the identifiability may fail for the binomial mixture model. Some efforts are made without identifiability. [20, 32] constructed credible intervals for the Bayes estimators of each point mass and that of FF, which rely on the lower order moments of the mixing distribution. [42] empirically shows that their proposed nonparametric maximum likelihood estimator of FF is unique with high probability when mm is large. However, these results are far from satisfying in terms of our ultimate goal — estimate or infer the underlying distribution FF.

Appendix B A motivating Example: GeneFishing

The study of model (1.2) with a large mm is motivated by the GeneFishing method [18]. Provided some knowledge involved in a biological process as “bait”, GeneFishing was designed to “fish” (or identify) discoveries that are yet identified related to this process. In this work, the authors used a set of pre-identified 2121 “bait genes”, all of which have known roles in cholesterol metabolism, and then applied GeneFishing to three independent RNAseq datasets of human lymphoblastoid cell lines. They found that this approach identified other genes with known roles not only in cholesterol metabolism but also with high levels of consistency across the three datasets. They also applied GeneFishing to GTEx human liver RNAseq data and identified gene glyoxalase I (GLO1). In a follow-up wet-lab experiment, GLO1 knockdown increased levels of cellular cholesterol esters.

The GeneFishing procedure is as follows, as shown in Figure 7:

  1. 1.

    Split the nn candidate genes randomly into many sub-search-spaces of LL genes per sub-group (e.g., L=100L=100), then added to by the bait genes. This step is the key reduction of search space, facilitating making the “signal” standing out from the “noise”.

  2. 2.

    Construct the Spearman co-expression matrices for gene pairs contained within each sub-search-space. Apply the spectral clustering algorithm (with the number of clusters equal to 22) to each matrix separately. In most cases, the bait genes cluster separately from the candidate genes. But in some instances, candidate gene(s) related to the bait genes will cluster with them. When this occurs, the candidate gene is regarded as being “fished out”.

  3. 3.

    Repeat steps 1 and 2 (defining one round of GeneFishing) mm times (e.g., m=10,000m=10,000) to reduce the impact that a candidate gene may randomly co-cluster with the bait genes.

  4. 4.

    Aggregate the results from all rounds, and the ii-th gene is fished out XiX_{i} times out of mm. The final output is a table that records the “capture frequency rate” (CFRi:=s^i=Xi/mCFR_{i}:=\hat{s}_{i}=X_{i}/m). The “fished-out” genes with large CFR values are thought of as “discoveries”. Note, however, instead of considering these “discoveries” to perform a specific or similar function as the bait genes, we only believe they are likely to be functionally related to the bait genes. Figure 1 displays the distribution of XiX_{i}’s with m=10,000m=10,000 and the number of total genes n=21,000n=21,000 on four tissues for the cholesterol-relevant genes.

Refer to caption
Figure 7: Workflow of GeneFishing (Fig 1 (e) of [18]). CFR stands for Capture Frequency Rate.

Compared to other works for “gene prioritization”, GeneFishing has three merits. First, it takes care of extreme inhomogeneity and low signal-to-noise ratio in high-throughput data using dimensionality reduction by subsampling. Second, it uses clustering to identify 2121 tightly clustered bait genes, which is a data-driven confirmation of the domain knowledge. Finally, GeneFishing leverages the technique of results aggregation (motivated by a bagging-like idea) in order to prioritize genes relevant to the biological process and reduce false discoveries.

However, there remains an open question on how large a CFR should be so that the associated gene is labeled as a “discovery”. [18] picked a large cutoff 0.990.99 by eye. It is acceptable when the histograms are sparse in the middle as in the liver or the transverse colon tissues (Figure 1 (a)(b)), since cutoff=0.25=0.25 and cutoff=0.75=0.75 make little difference in determining the discoveries. On the other hand, for the artery coronary and testis tissues (Figure 1 (c)(d)), the non-trivial middle part of the histogram is a mixture of the null (irrelevant to the biological process) distribution and the alternative (relevant to the biological process) distribution. The null and the alternative are hard to separate since the middle part is quite flat. Existing tools using parametric models to estimate local false discovery rates [7] are not able to account for the middle flatness well. They tend to select a smaller cutoff and produce excessive false discoveries. [18] provided a permutation-like procedure to compute approximate p-values and false discovery rates (FDRs). Nonetheless, there are two problems with this procedure. On the one hand, it is substantially computationally expensive, considering another round of permutation is added on top of numerous fishing rounds. On the other hand, the permutation idea is based on a strong null hypothesis that none of the candidate non-bait genes are relevant to the bait genes, thus producing an extreme p-value or FDR, which is unrealistic.

To identify a reasonable cutoff on CFRs, we assume there is an underlying fishing rate sis_{i}, reflecting the probability that the ii-th gene is fished out in each GeneFishing round. The fishing rates are assumed to be independently sampled from the same distribution FF. Thus, the observation CFRiCFR_{i} (or si^\hat{s_{i}}) can be modelled by (1.2). We mention that the independence assumption is raised mainly for convenience but not realistic de facto since genes are interactive and correlated in the same pathways or even remotely. Consequently, the effective sample size is smaller than expected. However, this assumption is still acceptable by assuming only a handful of candidate genes are related to the bait genes, which is reasonable from the biological perspective. Furthermore, we notice that in Figure 1 there exists a clear pattern that the histogram is decreasing on the left-hand side and increasing on the right-hand side for all four tissues. In the middle, liver and colon transverse display sparse densities while artery coronary and testis exhibit flat ones. Thus, we can impose a U shape constraint on the associated density of FF (see Section 4.1 for details). Then the original problem becomes finding out the cutoff where the flat middle part transits to the increasing part on the right-hand side.

Appendix C Review of Grenander Estimator

Monotone density models are often used in survival analysis and reliability analysis in economics—see [12, 11]. We can apply maximum likelihood for the monotone density estimation. Suppose that X1,,XnX_{1},\ldots,X_{n} is a random sample from a density ff on [0,)[0,\infty) that is known to be nonincreasing; the maximum likelihood estimator f~n\tilde{f}_{n} is defined as the nonincreasing density that maximizes the log-likelihood (f)=i=1nlogf(Xi).\ell(f)=\sum_{i=1}^{n}\log f(X_{i}). [8] first showed that this optimization problem has a unique solution under the monotone assumption — so the estimator is also called the Grenander estimator. The Grenander estimator is explicitly given as the left derivative of the least concave majorant of the empirical distribution function FnF_{n}. The least concave majorant of FnF_{n} is defined as the smallest concave function F~n\tilde{F}_{n} with F~nFn\tilde{F}_{n}\geq F_{n} for every xx. Because F~n\tilde{F}_{n} is concave, its derivative is nonincreasing.

[23] showed that Grenander estimator is consistent when ff is decreasing, as stated in Theorem 16.

Theorem 16 ([23])

Suppose that X1,,XnX_{1},\ldots,X_{n} are i.i.d random variables with a decreasing density ff on [0,)[0,\infty). Then the Grenander estimator f~n\tilde{f}_{n} is uniformly consistent on closed intervals bounded away from zero: that is, for each c>0c>0, we have

supcx<|f~n(x)f(x)|0 a.s.\sup_{c\leq x<\infty}|\tilde{f}_{n}(x)-f(x)|\rightarrow 0\text{ a.s.}

The inconsistency of the Grenander estimator at 0, when f(0)f(0) is bounded, was first pointed out by [43]. [2] later extended this result to other situations, where they consider different behavior near zero of the true density. Theorem 17 explicitly characterizes the behavior of f~n\tilde{f}_{n} at zero.

Theorem 17 ([43])

Suppose that ff is a decreasing density on [0,)[0,\infty) with 0<f(0)<0<f(0)<\infty, and let N(t)N(t) denote a rate 11 Poisson process. Then

f~n(0)f(0)𝑑supt>0N(t)t=𝑑1U,\frac{\tilde{f}_{n}(0)}{f(0)}\overset{d}{\rightarrow}\sup_{t>0}\frac{N(t)}{t}\overset{d}{=}\frac{1}{U},

where UU is a uniform random variable on the unit interval.

[4] proved that Grenander estimator has a cubic root convergence rate in the sense of L1L_{1} norm, as in Theorem 18. [39] pointed out that the rate of convergence of the Grenander estimator is slower than that of the monotone kernel density estimator when the underlying function is smooth enough.

Theorem 18 ([4])

Suppose ff is a decreasing density on [0,)[0,\infty) with 0<f(0)<0<f(0)<\infty. it follows that

𝔼f01|f~n(x)f(x)|𝑑xCn13,{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n}(x)-f(x)|dx\leq C\cdot n^{-\frac{1}{3}},

where CC is a constant that only depends on ff.

[29] first obtained the limiting distribution of the Grenander estimator at a point. He has proved that n3(f~n(t0)f(t0))\sqrt[3]{n}(\tilde{f}_{n}(t_{0})-f(t_{0})) converges to the location of the maximum of the process {B(x)x2,x}\{B(x)-x^{2},x\in{\mathbb{R}}\}, where f(t0)<0f^{\prime}(t_{0})<0 and B(x)B(x) is the standard two-sided Brownian motion on {\mathbb{R}} such that B(0)=0B(0)=0; see [29]. [40] extends this result to the flat region and a higher order derivative, as stated in Theorem 19.

Theorem 19 ([40])

Suppose ff is a decreasing density on [0,1][0,1] and is smooth at t0(0,1)t_{0}\in(0,1). It follows that

  • (A)

    If ff is flat in a neighborhood of t0t_{0}. Let [a,b][a,b] be the flat part containing t0t_{0}. Then,

    n(f~n(t0)f(t0))𝑑S^a,b(t0),\sqrt{n}(\tilde{f}_{n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}\hat{S}_{a,b}(t_{0}),

    where S^a,b(t)\hat{S}_{a,b}(t) is the slope at F(t)F(t) of the least concave majorant in [F(a),F(b)][F(a),F(b)] of a standard Brownian Bridge in [0,1][0,1].

  • (B)

    If f(t)f(t0)f(k)(t0)(tt0)kf(t)-f(t_{0})\sim f^{(k)}(t_{0})(t-t_{0})^{k} near t0t_{0} for some kk and f(k)(t0)<0f^{(k)}(t_{0})<0. Then,

    nk2k+1[fk(t0)|f(k)(t0)|(k+1)!]12k+1(f~n(t0)f(t0))𝑑Vk(0),n^{\frac{k}{2k+1}}[\frac{f^{k}(t_{0})|f^{(k)}(t_{0})|}{(k+1)!}]^{-\frac{1}{2k+1}}(\tilde{f}_{n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}V_{k}(0),

    where Vk(t)V_{k}(t) is the slope at tt of the least concave majorant of the process {B(t)|t|k+1,t(,)}\{B(t)-|t|^{k+1},t\in(-\infty,\infty)\}, and B(t)B(t) is a standard two-sided Brownian motion on {\mathbb{R}} with B(0)=0B(0)=0.

Appendix D Proof of Proposition 4

Proof Suppose

f(x)=1.8𝕀(x[0,1/2])+0.2𝕀(x(1/2,1]).f(x)=1.8\cdot\mathbb{I}(x\in[0,1/2])+0.2\cdot\mathbb{I}(x\in(1/2,1]).

Note that

(sk/m)(s^k/m)\displaystyle\mathbb{P}(s\leq k/m)-\mathbb{P}(\hat{s}\leq k/m) (D.1)
=\displaystyle= 0k/mf(u)𝑑urk01(mr)ur(1u)mrf(u)𝑑u\displaystyle\int_{0}^{k/m}f(u)du-\sum_{r\leq k}\int_{0}^{1}{m\choose r}u^{r}(1-u)^{m-r}f(u)du
=\displaystyle= 01(𝕀[uk/m]rk(mr)ur(1u)mr)f(u)𝑑u\displaystyle\int_{0}^{1}\left(\mathbb{I}[u\leq k/m]-\sum_{r\leq k}{m\choose r}u^{r}(1-u)^{m-r}\right)f(u)du

Decompose f(x)=f1(x)+f2(x)f(x)=f_{1}(x)+f_{2}(x), where f1(x)=1.6𝕀(x[0,1/2])f_{1}(x)=1.6\cdot\mathbb{I}(x\in[0,1/2]), f2(x)0.2f_{2}(x)\equiv 0.2. The previous example shows that the difference for the f2f_{2} part in Equation (D.1) is at most 0.2m+1.\frac{0.2}{m+1}. So we only need to take care of the f1f_{1} part in Equation (D.1), i.e.,

1.6×01/2(𝕀[uk/m]rk(mr)ur(1u)mr)𝑑u=1.6×01/2Bk(m,u)𝑑u,1.6\times\int_{0}^{1/2}\left(\mathbb{I}[u\leq k/m]-\sum_{r\leq k}{m\choose r}u^{r}(1-u)^{m-r}\right)du=1.6\times\int_{0}^{1/2}B_{k}(m,u)du,

provided k/m1/2k/m\leq 1/2. Here Bk(m,x)=r=km(mk)xr(1x)mrB_{k}(m,x)=\sum_{r=k}^{m}{m\choose k}x^{r}(1-x)^{m-r}. Define

Ak(m,x)=[(mk)xk(1x)mk+1][(k+1)/(k+1(m+1)x)].A_{k}(m,x)=[{m\choose k}x^{k}(1-x)^{m-k+1}]\cdot[(k+1)/(k+1-(m+1)x)].

[1][Theorem 1] indicates that 1Ak(m,x)/Bk(m,x)1+z21\leq A_{k}(m,x)/B_{k}(m,x)\leq 1+z^{-2}, where z=(kmx)/(mx(1x))12z=(k-mx)/(mx(1-x))^{\frac{1}{2}}. Let x=1/2ϵx=1/2-\epsilon. By Stirling’s formula and Taylor expansion on log(1+ϵ)\log(1+\epsilon), we can obtain

Am/2(m,1/2ϵ)12πme2mϵ2,A_{m/2}(m,1/2-\epsilon)\sim\frac{1}{\sqrt{2\pi m}}e^{-2m\epsilon^{2}},

and we can rewrite

z=mϵ1/4ϵ2.z=\frac{\sqrt{m}\epsilon}{\sqrt{1/4-\epsilon^{2}}}.

So we have

Bm/2(m,1/2ϵ)Am/2(m,1/2ϵ)1+z22mϵe2mϵ24(m1)ϵ2+12mϵe2mϵ24mϵ2+1.\displaystyle B_{m/2}(m,1/2-\epsilon)\geq\frac{A_{m/2}(m,1/2-\epsilon)}{1+z^{-2}}\sim\frac{2\sqrt{m}\epsilon e^{-2m\epsilon^{2}}}{4(m-1)\epsilon^{2}+1}\geq\frac{2\sqrt{m}\epsilon e^{-2m\epsilon^{2}}}{4m\epsilon^{2}+1}.

Then it follows that

01/2Bm/2(m,1/2ϵ)𝑑ϵ01/m2mϵe2mϵ24mϵ2+1𝑑ϵ=1m012e2u24u2+1𝑑u0.81m.\int_{0}^{1/2}B_{m/2}(m,1/2-\epsilon)d\epsilon\geq\int_{0}^{1/\sqrt{m}}\frac{2\sqrt{m}\epsilon e^{-2m\epsilon^{2}}}{4m\epsilon^{2}+1}d\epsilon=\frac{1}{\sqrt{m}}\int_{0}^{1}\frac{2e^{-2u^{2}}}{4u^{2}+1}du\approx\frac{0.81}{\sqrt{m}}.

Together, we have (s1/2)(s^1/2)Cm+εm1\mathbb{P}(s\leq 1/2)-\mathbb{P}(\hat{s}\leq 1/2)\geq\frac{C}{\sqrt{m}}+\varepsilon\cdot m^{-1}, where ε\varepsilon is a residual term with |ε|K|\varepsilon|\leq K, CC and KK are positive constants.  

Appendix E Proof of Proposition 7

Proof The proof of this theorem follows from that of Theorem 8, so we defer most details to the proof of the latter.

From Equation (F.5), we know that

(s^1B(x))(s1B(x))\displaystyle\mathbb{P}(\hat{s}_{1}\in B(x))-\mathbb{P}(s_{1}\in B(x))
=\displaystyle= d=1D[(s^1B(x),s1B(x+dh))(s1B(x),s^1B(x+dh))]\displaystyle\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]
+d=1D[(s^1B(x),s1B(xdh))(s1B(x),s^1B(xdh))]\displaystyle+\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x-d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x-d\cdot h))]
+(s^1B(x),s1B(x+dh):dD+1 or dD1)\displaystyle+\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)
+(s1B(x),s^1B(x+dh):dD+1 or dD1),\displaystyle+\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1),

where B(x)B(x) is still defined as the bin that contains xx among [0,h],(h,2h],,(1h,1][0,h],(h,2h],\ldots,(1-h,1]. Note that

(s^1x)(s1x)\displaystyle\mathbb{P}(\hat{s}_{1}\leq x)-\mathbb{P}(s_{1}\leq x)
=\displaystyle= (s^1x,s1>x)(s^1>x,s1x)\displaystyle\mathbb{P}(\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}>x,s_{1}\leq x)
=\displaystyle= (s^1B(x),s1B(x),s^1x,s1>x)(s^1B(x),s1B(x),s^1>x,s1x)\displaystyle\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}>x,s_{1}\leq x)
+d=1,2,d=1,2,[(s^1B(xdh),s1B(x+dh))(s^1B(x+dh),s1B(xdh))].\displaystyle+\sum_{{d=1,2,\ldots}\atop{d^{\prime}=1,2,\ldots}}[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))].
=\displaystyle= (s^1B(x),s1B(x),s^1x,s1>x)(s^1B(x),s1B(x),s^1>x,s1x)\displaystyle\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}>x,s_{1}\leq x)
+(|dd|D+|dd|>D)[(s^1B(xdh),s1B(x+dh))\displaystyle+(\sum_{|d-d^{\prime}|\leq D}+\sum_{|d-d^{\prime}|>D})[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))
(s^1B(x+dh),s1B(xdh))].\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))].

Following the proof of bounding (s^1B(x),s1B(x+dh):dD+1 or dD1)\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1) in Theorem 8, we can bound

||dd|>D[(s^1B(xdh),s1B(x+dh))(s^1B(x+dh),s1B(xdh))|\displaystyle|\sum_{|d-d^{\prime}|>D}[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))|
\displaystyle\leq 2fmaxexp(2mD2h2)=2fmaxm,\displaystyle 2f_{\max}\cdot\exp(-2mD^{2}h^{2})=\frac{2f_{\max}}{m},

where D=logm2mh2D=\lceil\sqrt{\frac{\log m}{2mh^{2}}}\rceil.
Following the proof of bounding |d=1D[(s^1B(x),s1B(x+dh))(s1B(x),s^1B(x+dh))]||\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]| in Theorem 8, we can bound

|(s^1B(x),s1B(x),s^1x,s1>x)(s^1B(x),s1B(x),s^1>x,s1x)|\displaystyle|\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}>x,s_{1}\leq x)|
\displaystyle\leq K1(fmaxhm+fmaxm+hfmaxm)+||,\displaystyle K_{1}\cdot(\frac{f_{\max}\cdot h}{\sqrt{m}}+\frac{f_{\max}}{m}+\frac{h\cdot f^{\prime}_{\max}}{\sqrt{m}})+|\mathcal{E}|,

where ||K2(hfmaxm+h2fmax+fmaxm)|\mathcal{E}|\leq K_{2}\cdot(\frac{h\cdot f_{\max}}{\sqrt{m}}+h^{2}f_{\max}+\frac{f_{\max}}{m}) for some constant K2K_{2} that only depends on aa.
To bound

||dd|D[(s^1B(xdh),s1B(x+dh))(s^1B(x+dh),s1B(xdh))|,|\sum_{|d-d^{\prime}|\leq D}[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))|,

we still follow the proof of bounding |d=1D[(s^1B(x),s1B(x+dh))(s1B(x),s^1B(x+dh))]||\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]| in Theorem 8. The problem is that there are D2D^{2} terms instead DD terms. By carefully arranging the D2D^{2} terms, and using the fact that d=1exp(C0d2)<\sum_{d=1}^{\infty}\exp(-C_{0}d^{2})<\infty for any positive C0C_{0}, it can be easily seen that we still get the same bound as above. In total, we have

|(s^1x)(s1x)|K3(fmax+fmax)(hm+h2+1m),|\mathbb{P}(\hat{s}_{1}\leq x)-\mathbb{P}(s_{1}\leq x)|\leq K_{3}\cdot(f_{\max}+f^{\prime}_{\max})\cdot(\frac{h}{\sqrt{m}}+h^{2}+\frac{1}{m}),

where K3K_{3} is some constant that only hinges on aa. To minimize the upper bound, take h=1mh=\frac{1}{\sqrt{m}}. Thus, it follows that

supx[a,1a]|F(m)(x)F(x)|Cm,\sup_{x\in[a,1-a]}|F^{(m)}(x)-F(x)|\leq\frac{C}{m},

where CC is some constant that only depends on ff and aa.  

Appendix F Proof of Theorem 8

Theorem 8 relies on the local limit theorem of binomial distribution, as follows.

Lemma 20

Suppose XBinom(m,s)X\sim Binom(m,s), with 0<s<10<s<1. For any a<ba<b such that Δ:=max{|ams|,|bms|}0\Delta:=\max\{|\frac{a}{m}-s|,|\frac{b}{m}-s|\}\rightarrow 0 as mm\rightarrow\infty, then

(aXb)=[amsms(1s)bmsms(1s)12πexp(t2/2)𝑑t](1+ε1)+ε2,\mathbb{P}(a\leq X\leq b)=\left[\int_{\frac{a-ms}{\sqrt{ms(1-s)}}}^{\frac{b-ms}{\sqrt{ms(1-s)}}}\frac{1}{\sqrt{2\pi}}\exp(-t^{2}/2)dt\right](1+\varepsilon_{1})+\varepsilon_{2},

where ε1\varepsilon_{1} is the error from the Gaussian approximation to Binomial point mass function, and ε2\varepsilon_{2} is the error from the summation series approximating the integral. Specifically, we have

|ε1|KΔ,|\varepsilon_{1}|\leq K\cdot\Delta,
|ε2|C[exp(mδ2)m+bamΔexp(mδ2)],|\varepsilon_{2}|\leq C\cdot[\frac{\exp(-m\delta^{2})}{\sqrt{m}}+\frac{b-a}{\sqrt{m}}\cdot\Delta\cdot\exp(-m\delta^{2})],

where δ:=minx[am,bm]|sx|\delta:=\min_{x\in[\frac{a}{m},\frac{b}{m}]}|s-x|, KK and CC are two positive constants that depend on ss via 1s(1s)\frac{1}{s(1-s)}.

The detailed proof of Lemma 20 can be easily obtained by adapting that of [5]. Now we prove Theorem 8:
Proof Denote by Rx=𝔼(f(x)f^n,m(x))2R_{x}={\mathbb{E}}(f(x)-\hat{f}_{n,m}(x))^{2} the risk at a point xx. We decompose the risk into the variance and the bias square as follows.

Rx=var(f^n,m(x))+(𝔼f^n,m(x)f(x))2.R_{x}=\textit{var}(\hat{f}_{n,m}(x))+({\mathbb{E}}\hat{f}_{n,m}(x)-f(x))^{2}. (F.1)

For the variance part, denote by pl=(ms^1Bl)p_{l}=\mathbb{P}(m\cdot\hat{s}_{1}\in B_{l}). We have

a1avar(f^n,m(x))𝑑x=aLl(1a)LBlpl(1pl)nh2K11nh,\displaystyle\int_{a}^{1-a}\textit{var}(\hat{f}_{n,m}(x))dx=\sum_{aL\leq l\leq(1-a)L}\int_{B_{l}}\frac{p_{l}(1-p_{l})}{nh^{2}}\leq K_{1}\cdot\frac{1}{nh}, (F.2)

where K1K_{1} is a positive constant that relies on a>0a>0. For the bias part, since

(𝔼f^n,m(x)f(x))22(𝔼f^n,m(x)𝔼f^n(x))2+2(𝔼f^n(x)f(x))2,({\mathbb{E}}\hat{f}_{n,m}(x)-f(x))^{2}\leq 2({\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x))^{2}+2({\mathbb{E}}\hat{f}_{n}(x)-f(x))^{2}, (F.3)

and it is well known that

01(𝔼f^n(x)f(x))2𝑑xK2h2,\int_{0}^{1}({\mathbb{E}}\hat{f}_{n}(x)-f(x))^{2}dx\leq K_{2}\cdot h^{2}, (F.4)

where K2K_{2} is positive constant that only relies on ff. We only need to consider 𝔼f^n,m(x)𝔼f^n(x){\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x). By definition,

𝔼f^n,m(x)𝔼f^n(x)\displaystyle{\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x) (F.5)
=\displaystyle= 1h[(s^1B(x))(s1B(x))]\displaystyle\frac{1}{h}[\mathbb{P}(\hat{s}_{1}\in B(x))-\mathbb{P}(s_{1}\in B(x))]
=\displaystyle= 1h[(s^1B(x),s1B(x))(s1B(x),s^1B(x))]\displaystyle\frac{1}{h}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\not\in B(x))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\not\in B(x))]
=\displaystyle= 1h{d=1D[(s^1B(x),s1B(x+dh))(s1B(x),s^1B(x+dh))]\displaystyle\frac{1}{h}\{\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]
+d=1D[(s^1B(x),s1B(xdh))(s1B(x),s^1B(xdh))]\displaystyle+\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x-d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x-d\cdot h))]
+(s^1B(x),s1B(x+dh):dD+1 or dD1)\displaystyle+\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)
+(s1B(x),s^1B(x+dh):dD+1 or dD1)}.\displaystyle+\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)\}.

By McDiarmid’s inequality, it follows that

(s1B(x),s^1B(x+dh):dD+1 or dD1)\displaystyle\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)
=\displaystyle= s1B(x)[𝔼𝕀(s^1B(x+dh):dD+1 or dD1|s1)]f(s1)ds1\displaystyle\int_{s_{1}\in B(x)}[{\mathbb{E}}\mathbb{I}(\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1|s_{1})]f(s_{1})ds_{1}
\displaystyle\leq s1B(x)(|s^1s1|Dh|s1)f(s1)𝑑s1\displaystyle\int_{s_{1}\in B(x)}\mathbb{P}(|\hat{s}_{1}-s_{1}|\geq D\cdot h|s_{1})f(s_{1})ds_{1}
\displaystyle\leq s1B(x)exp(2mD2h2)f(s1)𝑑s1\displaystyle\int_{s_{1}\in B(x)}\exp(-2mD^{2}h^{2})f(s_{1})ds_{1}
\displaystyle\leq fmaxhexp(2mD2h2),\displaystyle f_{\max}\cdot h\cdot\exp(-2mD^{2}h^{2}),

where fmaxf_{\max} is the maximal value of ff in [0,1][0,1]. Take D=logm4mh2D=\lceil\sqrt{\frac{\log m}{4mh^{2}}}\rceil, i.e., the least integer that is not smaller than logm4mh2\sqrt{\frac{\log m}{4mh^{2}}}. Then we have

(s1B(x),s^1B(x+dh):dD+1 or dD1)K3fmaxhm,\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)\leq K_{3}\cdot\frac{f_{\max}\cdot h}{\sqrt{m}}, (F.6)

where K3K_{3} is a universal positive constant. Similarly, we can show that (s^1B(x),s1B(x+dh):dD+1 or dD1)K4fmaxm\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)\leq K_{4}\cdot\frac{f_{\max}}{m} for some positive constant K4K_{4}. Next, we investigate (s^1B(x),s1B(x+dh))(s1B(x),s^1B(x+dh))\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h)). Denote by l(x)l(x) and r(x)r(x) the left boundary and the right boundary of the interval B(x)B(x). By Lemma 20, it follows that

(s1B(x),s^1B(x+dh))\displaystyle\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))
=\displaystyle= s1B(x)[k:kmB(x+dh)(mk)(s1)k(1s1)mk]f(s1)𝑑s1\displaystyle\int_{s_{1}\in B(x)}\left[\sum_{k:\frac{k}{m}\in B(x+d\cdot h)}{m\choose k}(s_{1})^{k}(1-s_{1})^{m-k}\right]f(s_{1})ds_{1}
=\displaystyle= sB(x){[tB(x+dh)m2πs(1s)exp(m(ts)22s(1s))𝑑t](1+ε5(d+1)h)(I)\displaystyle\int_{s\in B(x)}\left\{\underbrace{\left[\int_{t\in B(x+d\cdot h)}\frac{\sqrt{m}}{\sqrt{2\pi s(1-s)}}\exp(-\frac{m(t-s)^{2}}{2s(1-s)})dt\right]\cdot(1+\varepsilon_{5}\cdot(d+1)h)}_{\text{(I)}}\right.
+ε6exp(m[r(x)+(d1)hs]2)m(II)\displaystyle\left.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\underbrace{\varepsilon_{6}\cdot\frac{\exp(-m[r(x)+(d-1)h-s]^{2})}{\sqrt{m}}}_{\text{(II)}}\right.
+ε7mh[r(x)+dhs]exp(m[r(x)+(d1)hs]2(III)}f(s)ds,\displaystyle\left.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\underbrace{\varepsilon_{7}\cdot\sqrt{m}h\cdot[r(x)+dh-s]\cdot\exp(-m[r(x)+(d-1)h-s]^{2}}_{\text{(III)}}\right\}f(s)ds,

where |ε5|K5|\varepsilon_{5}|\leq K_{5}, |ε6|K6|\varepsilon_{6}|\leq K_{6}, |ε7|K7|\varepsilon_{7}|\leq K_{7} and K5,K6,K7K_{5},K_{6},K_{7} are positive constants that only depend on aa. We consider the summation of the DD error terms in (s1B(x),s^1B(x+dh))\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h)), d=1,,Dd=1,\ldots,D:

  1. (I)
    d=1DsB(x)[tB(x+dh)m2πs(1s)em(ts)22s(1s)dt]K5(d+1)hf(s)𝑑s\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}\frac{\sqrt{m}}{\sqrt{2\pi s(1-s)}}e{-\frac{m(t-s)^{2}}{2s(1-s)}}dt\right]K_{5}(d+1)hf(s)ds
    \displaystyle\leq K8h2fmax+d=3DK9sB(x)[tB(x+dh)m(d+1)hem(ts)22s(1s)𝑑t]f(s)𝑑s\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}\sqrt{m}(d+1)h\cdot e^{-\frac{m(t-s)^{2}}{2s(1-s)}}dt\right]f(s)ds
    \displaystyle\leq K8h2fmax+d=3DK9sB(x)[tB(x+dh)m(d+1)hem(ts)2𝑑t]f(s)𝑑s\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}\sqrt{m}(d+1)h\cdot e^{-m(t-s)^{2}}dt\right]f(s)ds
    \displaystyle\leq K8h2fmax+d=3DK9sB(x)[tB(x+dh)2m(d1)hem(ts)2𝑑t]f(s)𝑑s\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}2\sqrt{m}(d-1)h\cdot e^{-m(t-s)^{2}}dt\right]f(s)ds
    \displaystyle\leq K8h2fmax+d=3DK9sB(x)[tB(x+dh)2m(ts)em(ts)2𝑑t]f(s)𝑑s\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}2\sqrt{m}(t-s)\cdot e^{-m(t-s)^{2}}dt\right]f(s)ds
    =\displaystyle= K8h2fmax\displaystyle K_{8}h^{2}f_{\max}
    +d=3D2K9msB(x)[em(r(x)s+(d1)h)2em(r(x)s+dh)2]f(s)𝑑s\displaystyle+\sum_{d=3}^{D}\frac{2K_{9}}{\sqrt{m}}\int_{s\in B(x)}\left[e^{-m(r(x)-s+(d-1)h)^{2}}-e^{-m(r(x)-s+dh)^{2}}\right]f(s)ds
    =\displaystyle= K8h2fmax+K~9hfmaxm,\displaystyle K_{8}h^{2}f_{\max}+\tilde{K}_{9}\frac{hf_{\max}}{\sqrt{m}},

    where K8,K9,K~9K_{8},K_{9},\tilde{K}_{9} are positive constants that only depend on aa.

  2. (II)
    d=1DsB(x)K6exp(m[r(x)+(d1)hs]2)mf(s)𝑑s\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}K_{6}\cdot\frac{\exp(-m[r(x)+(d-1)h-s]^{2})}{\sqrt{m}}\cdot f(s)ds
    \displaystyle\leq K6fmaxm0(D1)hexp(mt2)𝑑t\displaystyle K_{6}\cdot\frac{f_{\max}}{\sqrt{m}}\cdot\int_{0}^{(D-1)h}\exp(-mt^{2})dt
    \displaystyle\leq K~6fmaxm,\displaystyle\tilde{K}_{6}\cdot\frac{f_{\max}}{m},

    where K~6\tilde{K}_{6} is a positive constant that only depends on aa.

  3. (III)
    d=1DsB(x)K7mh[r(x)+dhs]em[r(x)+(d1)hs]2f(s)𝑑s\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}K_{7}\cdot\sqrt{m}h\cdot[r(x)+dh-s]\cdot e^{-m[r(x)+(d-1)h-s]^{2}}\cdot f(s)ds
    =\displaystyle= d=1DsB(x)K7mh[r(x)+(d1)hs+h]em[r(x)+(d1)hs]2f(s)𝑑s\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}K_{7}\cdot\sqrt{m}h\cdot[r(x)+(d-1)h-s+h]\cdot e^{-m[r(x)+(d-1)h-s]^{2}}\cdot f(s)ds
    \displaystyle\leq K~7{h2fmax+hfmaxmd=1D[em(d1)2h2emd2h2]}\displaystyle\tilde{K}_{7}\cdot\left\{h^{2}\cdot f_{\max}+\frac{h\cdot f_{\max}}{\sqrt{m}}\cdot\sum_{d=1}^{D}[e^{-m(d-1)^{2}h^{2}}-e^{-md^{2}h^{2}}]\right\}
    =\displaystyle= K~7{h2fmax+hfmaxm[1emD2h2]}\displaystyle\tilde{K}_{7}\cdot\left\{h^{2}\cdot f_{\max}+\frac{h\cdot f_{\max}}{\sqrt{m}}\cdot[1-e^{-mD^{2}h^{2}}]\right\}
    \displaystyle\leq K~7(h2fmax+hfmaxm),\displaystyle\tilde{K}_{7}\cdot(h^{2}\cdot f_{\max}+\frac{h\cdot f_{\max}}{\sqrt{m}}),

    where K~7>0\tilde{K}_{7}>0 only depends on aa.

We call the summation of the DD error terms by \mathcal{E}, which satisfies ||K10(hfmaxm+h2fmax+fmaxm)|\mathcal{E}|\leq K_{10}\cdot(\frac{h\cdot f_{\max}}{\sqrt{m}}+h^{2}f_{\max}+\frac{f_{\max}}{m}), where K10>0K_{10}>0 only depends on aa. Similarly, for the summation of the DD error terms ~\tilde{\mathcal{E}} in (s^1B(x),s1B(x+dh))\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h)), d=1,,Dd=1,\ldots,D, we have the same rate. Now we consider

|d=1D(s^1B(x),s1B(x+dh))(s1B(x),s^1B(x+dh))|\displaystyle|\sum_{d=1}^{D}\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))| (F.7)
=(i)\displaystyle\overset{(i)}{=} d=1DsB(x)tB(x+dh)|m2πs(1s)em(ts)22s(1s)f(s)\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left|\frac{\sqrt{m}}{\sqrt{2\pi s(1-s)}}e^{-\frac{m(t-s)^{2}}{2s(1-s)}}f(s)\right.
m2πt(1t)em(st)22t(1t)f(t)|dtds+||+|~|\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.-\frac{\sqrt{m}}{\sqrt{2\pi t(1-t)}}e^{-\frac{m(s-t)^{2}}{2t(1-t)}}f(t)\right|dtds+|\mathcal{E}|+|\tilde{\mathcal{E}}|
(ii)\displaystyle\overset{(ii)}{\leq} d=1DsB(x)tB(x+dh)||2s~1|(1s~)52s~52m32(ts)322πem(ts)22s~(1s~)f(s~)|𝑑t𝑑s\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left|\frac{|2\tilde{s}-1|}{(1-\tilde{s})^{\frac{5}{2}}\tilde{s}^{\frac{5}{2}}}\cdot\frac{m^{\frac{3}{2}}(t-s)^{3}}{2\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f(\tilde{s})\right|dtds
+d=1DsB(x)tB(x+dh)||2s~1|(1s~)32s~32m12(ts)22πem(ts)22s~(1s~)f(s~)|𝑑t𝑑s\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left|\frac{|2\tilde{s}-1|}{(1-\tilde{s})^{\frac{3}{2}}\tilde{s}^{\frac{3}{2}}}\cdot\frac{m^{\frac{1}{2}}(t-s)}{2\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f(\tilde{s})\right|dtds
+d=1DsB(x)tB(x+dh)|1(1s~)12s~12m12(ts)2πem(ts)22s~(1s~)f(s~)|𝑑t𝑑s+||+|~|\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left|\frac{1}{(1-\tilde{s})^{\frac{1}{2}}\tilde{s}^{\frac{1}{2}}}\cdot\frac{m^{\frac{1}{2}}(t-s)}{\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f^{\prime}(\tilde{s})\right|dtds+|\mathcal{E}|+|\tilde{\mathcal{E}}|
(iii)\displaystyle\overset{(iii)}{\leq} d=1DsB(x)tB(x+dh)|K11fmaxm32(ts)3em(ts)2|𝑑t𝑑s\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left|K_{11}\cdot f_{\max}\cdot m^{\frac{3}{2}}(t-s)^{3}e^{-m(t-s)^{2}}\right|dtds
+d=1DsB(x)tB(x+dh)|K12(fmax+fmax)m12(ts)em(ts)2|𝑑t𝑑s+||+|~|\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left|K_{12}\cdot(f_{\max}+f^{\prime}_{\max})\cdot m^{\frac{1}{2}}(t-s)e^{-m(t-s)^{2}}\right|dtds+|\mathcal{E}|+|\tilde{\mathcal{E}}|
=(iv)\displaystyle\overset{(iv)}{=} K13(fmaxhm+fmaxm+hfmaxm)+||+|~|,\displaystyle K_{13}\cdot(\frac{f_{\max}\cdot h}{\sqrt{m}}+\frac{f_{\max}}{m}+\frac{h\cdot f^{\prime}_{\max}}{\sqrt{m}})+|\mathcal{E}|+|\tilde{\mathcal{E}}|,

where K11,K12,K13K_{11},K_{12},K_{13} are positive constants that only depend on aa. Equation (i) uses the Fubini’s theorem; Inequality (ii) applies the mean value theorem to the function g(s)=1s(1s)exp(A2s(1s))f(s)g(s)=\frac{1}{s(1-s)}\exp(-\frac{A}{2s(1-s)})f(s), where AA is a constant; Inequality (iii) holds since 0<ass~t1a<10<a\leq s\leq\tilde{s}\leq t\leq 1-a<1, thus 1(1s~)s~\frac{1}{(1-\tilde{s})\tilde{s}} is bounded, and exp(m(ts)22s~(1s~))\exp(-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}) attains the maximal when s~=1/2\tilde{s}=1/2; Inequality (iv) is obtained via integral by part. Similarly, d=1D[(s^1B(x),s1B(xdh))(s1B(x),s^1B(xdh))]\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x-d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x-d\cdot h))] has the same rate as (F.7). Putting (F.5)(F.6)(F.7) together, we have

|𝔼f^n,m(x)𝔼f^n(x)|K14(fmax+fmax)(1m+h+1mh),|{\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x)|\leq K_{14}\cdot(f_{\max}+f^{\prime}_{\max})\cdot(\frac{1}{\sqrt{m}}+h+\frac{1}{mh}), (F.8)

where K14K_{14} is some constant that only depends on aa. Combining Inequalities (F.1)(F.2) (F.3)(F.4) and Inequality (F.8), it follows that

R(a,1a)C1(h2+1m+1m2h2+1nh),R(a,1-a)\leq C_{1}\cdot(h^{2}+\frac{1}{m}+\frac{1}{m^{2}h^{2}}+\frac{1}{nh}),

The minimal risk is no larger than C4n23C_{4}\cdot n^{-\frac{2}{3}}, which is attained when h=C3n13h=C_{3}\cdot n^{-\frac{1}{3}}, mC2n23m\geq C_{2}\cdot n^{\frac{2}{3}}. Here C1,C2,C3,C4C_{1},C_{2},C_{3},C_{4} are positive constants that only depend on aa and ff.  

Appendix G Proof of Theorem 9

Proof Note for the point mass function p(x)=k=1Kα(k)𝕀(x=xk)p(x)=\sum_{k=1}^{K}\alpha(k)\mathbb{I}(x=x_{k}), we have an additional information that only hold for the discrete case but not for the density case,

maxxIk+d|xxk|=(d+12)1K.\max_{x\in I_{k+d}}|x-x_{k}|=(d+\frac{1}{2})\cdot\frac{1}{K}.

We follow the proof of Theorem 8 and can show that

R(a,1a)C1(1n+1m+K2m2).R(a,1-a)\leq C_{1}\cdot(\frac{1}{n}+\frac{1}{m}+\frac{K^{2}}{m^{2}}).

When mC2nmax{K,n}m\geq C_{2}\cdot\sqrt{n}\max\{K,\sqrt{n}\}, we have R(a,1a)C31nR(a,1-a)\leq C_{3}\cdot\frac{1}{n}. Here where C1,C2,C3>0C_{1},C_{2},C_{3}>0 do not depend on nn, mm and KK.  

Appendix H Proof of Theorem 10

We first define a few concepts. Let JJ denote the interval [a,b)[a,b), where a=0a=0 and b=1b=1 in our setup. We set

l(J)=ba,f(J)=Jf(t)𝑑t,Δf(J)=f(b)f(a),l(J)=b-a,f(J)=\int_{J}f(t)dt,\Delta f(J)=f(b)-f(a),
bf(J)=J|f(J)/l(J)f(t)|𝑑t.bf(J)=\int_{J}|f(J)/l(J)-f(t)|dt.

Any finite increasing sequences {xi}0xiq\{x_{i}\}_{0\leq x_{i}\leq q} with x0=ax_{0}=a, xq=bx_{q}=b generates a partition 𝒫\mathcal{P} of JJ into intervals Ji=[xi1,xi)J_{i}=[x_{i-1},x_{i}), 1iq1\leq i\leq q. When no confusion arises, we put fif_{i} for f(Ji)f(J_{i}), Δfi\Delta f_{i} for Δf(Ji)\Delta f(J_{i}) and so on. Set a functional L(𝒫,f,z)L(\mathcal{P},f,z) defined for positive zz by

L(𝒫,f,z)=i=1q[bf(Ji)+z(f(Ji))1/2]=i=1q[bfi+zfi1/2].L(\mathcal{P},f,z)=\sum_{i=1}^{q}[bf(J_{i})+z(f(J_{i}))^{1/2}]=\sum_{i=1}^{q}[bf_{i}+zf_{i}^{1/2}].

Before proving Theorem 10, we state the needed lemma, which is adapted from Lemma 1 in [4]. The proof also follows that of Lemma 1 in [4].

Lemma 21

Let 𝒫={Ji}1iq\mathcal{P}=\{J_{i}\}_{1\leq i\leq q} be some partition of JJ, FF an absolutely continuous distribution function, Fn,mF_{n,m} the corresponding empirical c.d.f based on s^i\hat{s}_{i} and F~n,m=F~n,mJ\tilde{F}_{n,m}=\tilde{F}_{n,m}^{J}, F~n,mi=F~n,mJi\tilde{F}_{n,m}^{i}=\tilde{F}_{n,m}^{J_{i}} the related Grenander estimators defined on the associative intervals. Define

F¯n,m(x)=i=1qF~n,mi(x)𝕀xJi,\bar{F}_{n,m}(x)=\sum_{i=1}^{q}\tilde{F}_{n,m}^{i}(x)\mathbb{I}_{x\in J_{i}},

with ff and f¯n,m\bar{f}_{n,m} to be the respective derivatives of FF and F¯n,m\bar{F}_{n,m}. Then

𝔼[J|f~n,m(x)f(x)|𝑑x]𝔼[J|f¯n,m(x)f(x)|𝑑x].{\mathbb{E}}\left[\int_{J}|\tilde{f}_{n,m}(x)-f(x)|dx\right]\leq{\mathbb{E}}\left[\int_{J}|\bar{f}_{n,m}(x)-f(x)|dx\right].

In the proof, there are many similar notations that might be confusing. I list all of them below for clarity and the convenience of reference.

  • Let F(m)(x):=(s^ix)F^{(m)}(x):=\mathbb{P}(\hat{s}_{i}\leq x), i.e. the c.d.f of s^i\hat{s}_{i}. Let f(m)f^{(m)} be the derivative of Fn,mF_{n,m}.

  • For any interval JJ, Fn,mJ(x)=1ni=1n𝕀(s^ix;s^iJ)F_{n,m}^{J}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(\hat{s}_{i}\leq x;\hat{s}_{i}\in J), FnJ(x)=1ni=1n𝕀(six;siJ)F_{n}^{J}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(s_{i}\leq x;s_{i}\in J). Fn,mF_{n,m} and FnF_{n} correspond to Fn,mJ(x)F_{n,m}^{J}(x) and FnJ(x)F_{n}^{J}(x) with J=[0,1]J=[0,1] in our setup.

  • F~n,mJ\tilde{F}^{J}_{n,m} and F~nJ\tilde{F}^{J}_{n} are the respective least concave majorants of Fn,mJF_{n,m}^{J} and FnJF_{n}^{J} condition on the interval JJ. Let f~n,mJ\tilde{f}^{J}_{n,m} and f~nJ\tilde{f}^{J}_{n} be the derivatives of F~n,mJ\tilde{F}^{J}_{n,m} and F~nJ\tilde{F}^{J}_{n}. F~n,m\tilde{F}_{n,m}, F~n\tilde{F}_{n} and f~n,m\tilde{f}_{n,m}, f~n\tilde{f}_{n} correspond to F~n,mJ(x)\tilde{F}_{n,m}^{J}(x), F~nJ(x)\tilde{F}_{n}^{J}(x) and f~n,mJ\tilde{f}^{J}_{n,m}, f~nJ\tilde{f}^{J}_{n}.

  • For any partition 𝒫={Ji}1iq\mathcal{P}=\{J_{i}\}_{1\leq i\leq q}, let f¯n,m\bar{f}_{n,m} be the derivative of F¯n,m(x)=i=1q𝕀(xJi)F~n,mJi(x).\bar{F}_{n,m}(x)=\sum_{i=1}^{q}\mathbb{I}(x\in J_{i})\tilde{F}_{n,m}^{J_{i}}(x).

  • l(J)=bal(J)=b-a, f(J)=Jf(t)𝑑tf(J)=\int_{J}f(t)dt, Δf(J)=f(b)f(a)\Delta f(J)=f(b)-f(a), bf(J)=J|f(J)/l(J)f(t)|𝑑tbf(J)=\int_{J}|f(J)/l(J)-f(t)|dt.

  • For any partition 𝒫={Ji}1iq\mathcal{P}=\{J_{i}\}_{1\leq i\leq q}, LJ(𝒫,f,z)=i=1q[bf(Ji)+z(f(Ji))1/2]=i=1q[bfi+zfi1/2]L^{J}(\mathcal{P},f,z)=\sum_{i=1}^{q}[bf(J_{i})+z(f(J_{i}))^{1/2}]=\sum_{i=1}^{q}[bf_{i}+zf_{i}^{1/2}]. LJ(f,z)=inf𝒫LJ(𝒫,f,z)L^{J}(f,z)=\inf_{\mathcal{P}}L^{J}(\mathcal{P},f,z).

  • M:=0bfp(t)𝑑tM:=\int_{0}^{b}f^{p}(t)dt for some p>2p>2; H:=limxbf(x)H:=\lim_{x\rightarrow b-}f(x).

  • For an interval I:=[aI,bI)I:=[a^{I},b^{I}),

    • Let NN and NmN_{m} be the number of sis_{i}’s and s^i\hat{s}_{i}’s that fall in II, respectively.

    • Define GG and G(m)G^{(m)} to be the respective conditional c.d.f’s of the sis_{i}’s and s^i\hat{s}_{i}’s that fall in II, gg and g(m)g^{(m)} their derivatives.

    • Define GN(x)=i=1n𝕀[six;siI]G_{N}(x)=\sum_{i=1}^{n}\mathbb{I}[s_{i}\leq x;s_{i}\in I], GNm(x):=i=1n𝕀[s^ix;s^iI]G_{N_{m}}(x):=\sum_{i=1}^{n}\mathbb{I}[\hat{s}_{i}\leq x;\hat{s}_{i}\in I].

    • Define G~Nm,m\tilde{G}_{N_{m},m} and G~N\tilde{G}_{N} be the respective least concave majorants of GNmG_{N_{m}} and GNG_{N} conditional on II. Let g~Nm,m\tilde{g}_{N_{m},m} and g~N\tilde{g}_{N} be derivatives of G~Nm,m\tilde{G}_{N_{m},m} and G~N\tilde{G}_{N} respectively.

Proof

We want to show that

𝔼f[J|f(x)f~n,m(x)|𝑑x]3LJ(f,K~n12)+C~f(J)m12,{\mathbb{E}}_{f}\left[\int_{J}|f(x)-\tilde{f}_{n,m}(x)|dx\right]\leq 3L^{J}(f,\tilde{K}\cdot n^{-\frac{1}{2}})+\tilde{C}\cdot f(J)\cdot m^{-\frac{1}{2}},

where K~,C~\tilde{K},\tilde{C} are universal constants. Then, the L1L_{1} convergence of f~n,m\tilde{f}_{n,m} only hinges on the characteristics of ff. For example, when ff is a decreasing function on J=[0,1]J=[0,1] such that M:=0bfp(t)𝑑t<+M:=\int_{0}^{b}f^{p}(t)dt<+\infty for some p>2p>2 and H=limxbf(x)>0H=\lim_{x\rightarrow b-}f(x)>0, then Proposition 4 of [4] shows that

z2/3LJ(f,z)3/2(H/(Hh))p1(bMH2p/(p2))1/3,z^{-2/3}L^{J}(f,z)\leq 3/2(H/(H-h))^{p-1}(bMH^{2-p}/(p-2))^{1/3}, (H.1)

where h3=z2b2MH2p/(p2)h^{3}=z^{2}b^{-2}MH^{2-p}/(p-2) and Mz2<(p2)b2Hp+1Mz^{2}<(p-2)b^{2}H^{p+1}. It implies that f~n,m\tilde{f}_{n,m} has an L1L_{1} convergence rate at 3LJ(f,K~n12)+C~f(J)m12C1n13+C2m123L^{J}(f,\tilde{K}\cdot n^{-\frac{1}{2}})+\tilde{C}\cdot f(J)\cdot m^{-\frac{1}{2}}\leq C_{1}\cdot n^{-\frac{1}{3}}+C_{2}\cdot m^{-\frac{1}{2}}, where C1C_{1}, C2C_{2} are some positive constants.

By Lemma 21 it is sufficient to prove that for any partition 𝒫={Ji}1iq\mathcal{P}=\{J_{i}\}_{1\leq i\leq q} of JJ, we have

𝔼f[J|f(x)f¯n,m(x)|𝑑x]3i=1q[bfi+fiK~n12]+i=1qC~fim12,{\mathbb{E}}_{f}\left[\int_{J}|f(x)-\bar{f}_{n,m}(x)|dx\right]\leq 3\sum_{i=1}^{q}[bf_{i}+\sqrt{f_{i}}\cdot\tilde{K}\cdot n^{-\frac{1}{2}}]+\sum_{i=1}^{q}\tilde{C}\cdot f_{i}\cdot m^{-\frac{1}{2}},

where f¯n,m\bar{f}_{n,m} is the derivative of F¯n,m(x)=i=1qF~n,mJi(x)𝕀(xJi)\bar{F}_{n,m}(x)=\sum_{i=1}^{q}\tilde{F}_{n,m}^{J_{i}}(x)\mathbb{I}(x\in J_{i}). This is certainly true if for any arbitrary sub-interval I=[aI,bI)I=[a^{I},b^{I}) of JJ, the below inequality holds

𝔼f[I|f(x)f~n,mI|𝑑x]3[bf(I)+f(I)K~n12]+C~f(I)m12.{\mathbb{E}}_{f}\left[\int_{I}|f(x)-\tilde{f}_{n,m}^{I}|dx\right]\leq 3\left[bf(I)+\sqrt{f(I)}\cdot\tilde{K}\cdot n^{-\frac{1}{2}}\right]+\tilde{C}\cdot f(I)\cdot m^{-\frac{1}{2}}.

In order to prove this inequality, we assume there are NN sis_{i}’s and NmN_{m} s^i\hat{s}_{i}’s falling in the interval II respectively. Here, NN has a binomial distribution Binomial(n,f(I))Binomial(n,f(I)) and NmN_{m} has a binomial distribution Binomial(n,f(m)(I))Binomial(n,f^{(m)}(I)), where f(m)f^{(m)} is the derivative of the c.d.f F(m)=[s^ix]F^{(m)}=\mathbb{P}[\hat{s}_{i}\leq x]. Then with f~n,m=f~n,mI\tilde{f}_{n,m}=\tilde{f}_{n,m}^{I},

I|f(x)f~n,m(x)|𝑑xbf(I)+|f(I)N/n|+|N/nNm/n|+bf~n,m(I).\int_{I}|f(x)-\tilde{f}_{n,m}(x)|dx\leq bf(I)+|f(I)-N/n|+|N/n-N_{m}/n|+b\tilde{f}_{n,m}(I). (H.2)

The only difficulty comes from the last term. Define GG and G(m)G^{(m)} to be the respective conditional c.d.f’s of the sis_{i}’s and s^i\hat{s}_{i}’s that fall in II, gg and g(m)g^{(m)} their derivatives. Then

𝔼f(m)[bf~n,m(I)|Nm]=Nm/n𝔼g(m)[bg~Nm,m(I)|Nm]{\mathbb{E}}_{f^{(m)}}[b\tilde{f}_{n,m}(I)|N_{m}]=N_{m}/n{\mathbb{E}}_{g^{(m)}}[b\tilde{g}_{N_{m},m}(I)|N_{m}]

because the joint distribution of the NmN_{m} s^i\hat{s}_{i}’s falling in II given NmN_{m} is the same as the distribution of NmN_{m} i.i.d variables from G(m)G^{(m)}. If U(x)U(x) is the uniform c.d.f on II, then

1/2bg~Nm,m(I)\displaystyle 1/2b\tilde{g}_{N_{m},m}(I) =(a)\displaystyle\overset{(a)}{=} supxI[G~Nm,m(x)U(x)]\displaystyle\sup_{x\in I}[\tilde{G}_{N_{m},m}(x)-U(x)]
=(b)\displaystyle\overset{(b)}{=} supxI[GNm(x)U(x)]\displaystyle\sup_{x\in I}[G_{N_{m}}(x)-U(x)]
\displaystyle\leq supxI[GNm(x)G(x)]+supxI[G(x)U(x)]\displaystyle\sup_{x\in I}[G_{N_{m}}(x)-G(x)]+\sup_{x\in I}[G(x)-U(x)]
\displaystyle\leq supxI[GNm(x)G(x)]+1/2bg(I).\displaystyle\sup_{x\in I}[G_{N_{m}}(x)-G(x)]+1/2bg(I).

Here Equation (a)(a) holds because this is an equivalent expression of the total variation for G~Nm,m(x)\tilde{G}_{N_{m},m}(x) with a non-increasing derivative and U(x)U(x) with a flat density. Equation (b)(b) holds because

  • G~Nm,m(x)GNm(x)\tilde{G}_{N_{m},m}(x)\geq G_{N_{m}}(x) for any xx and the equality occurs when the derivative of G~Nm,m\tilde{G}_{N_{m},m} changes.

  • G~Nm,m(x)U(x)\tilde{G}_{N_{m},m}(x)-U(x) attains the maximum at a point which corresponds to a change of the derivative of G~Nm,m\tilde{G}_{N_{m},m}.

Since bf(I)=f(I)bg(I)bf(I)=f(I)bg(I), we get

𝔼f(m)[bf~n,m(I)|Nm]Nm/n[2𝔼g(m)[supxI(GNm(x)G(x))|Nm]+bf(I)/f(I)],{\mathbb{E}}_{f^{(m)}}[b\tilde{f}_{n,m}(I)|N_{m}]\leq N_{m}/n\left[2{\mathbb{E}}_{g^{(m)}}[\sup_{x\in I}(G_{N_{m}(x)}-G(x))|N_{m}]+bf(I)/f(I)\right],

and using Corollary 6,

𝔼f(m)[bf~n,m(I)|Nm]Nm/n[K(1Nm+1m)+bf(I)/f(I)],{\mathbb{E}}_{f^{(m)}}[b\tilde{f}_{n,m}(I)|N_{m}]\leq N_{m}/n\left[K\cdot(\frac{1}{\sqrt{N_{m}}}+\frac{1}{\sqrt{m}})+bf(I)/f(I)\right],

where K>1K>1. Plug in this result into the inequality (H.2), and with the Cauchy-Schwarz inequality, we have

𝔼fI|f(x)f~n,m(x)|𝑑x\displaystyle{\mathbb{E}}_{f}\int_{I}|f(x)-\tilde{f}_{n,m}(x)|dx
\displaystyle\leq bf(I)+𝔼|f(I)N/n|\displaystyle bf(I)+{\mathbb{E}}|f(I)-N/n|
+Kf(m)(I)/n+Kf(m)(I)/m+bf(I)f(m)(I)/f(I)+𝔼|N/nNm/n|.\displaystyle+K\sqrt{f^{(m)}(I)/n}+K\cdot f^{(m)}(I)/\sqrt{m}+bf(I)\cdot f^{(m)}(I)/f(I)+{\mathbb{E}}|N/n-N_{m}/n|.

By the Cauchy-Schwarz inequality, it follows that

𝔼|f(I)N/n|f(I)(1f(I))n.{\mathbb{E}}|f(I)-N/n|\leq\sqrt{\frac{f(I)(1-f(I))}{n}}.

Note that

f(m)(I)\displaystyle f^{(m)}(I) =\displaystyle= 𝔼Nm/n\displaystyle{\mathbb{E}}N_{m}/n
=\displaystyle= 𝔼N/n+𝔼[NmN]/n\displaystyle{\mathbb{E}}N/n+{\mathbb{E}}[N_{m}-N]/n
\displaystyle\leq f(I)+𝔼|NmN|/n\displaystyle f(I)+{\mathbb{E}}|N_{m}-N|/n
=\displaystyle= f(I)+𝔼[𝔼[|NmN|/n|N]]\displaystyle f(I)+{\mathbb{E}}[{\mathbb{E}}[|N_{m}-N|/n|N]]
(c)\displaystyle\overset{(c)}{\leq} f(I)+𝔼[(Cm)N/n]\displaystyle f(I)+{\mathbb{E}}[(\frac{C}{\sqrt{m}})\cdot N/n]
=\displaystyle= f(I)(1+Cm12)\displaystyle f(I)(1+C\cdot m^{-\frac{1}{2}})

where Inequality (c) holds because of Proposition 3 with C>0C>0 (note that Nm/n=Fn,m(bI)Fn,m(aI)N_{m}/n=F_{n,m}(b^{I})-F_{n,m}(a^{I}), and N/n=Fn(bI)Fn(aI)N/n=F_{n}(b^{I})-F_{n}(a^{I})). Thus, for mC2/9m\geq C^{2}/9, it follows that

𝔼fI|f(x)f~n,m(x)|𝑑x3(bf(I)+Kf(I)/n)+2f(I)m(K+C).{\mathbb{E}}_{f}\int_{I}|f(x)-\tilde{f}_{n,m}(x)|dx\leq 3(bf(I)+K\sqrt{f(I)/n})+\frac{2f(I)}{\sqrt{m}}\cdot(K+C).

Then we complete the initial claim by letting K~=K\tilde{K}=K and C~=2(K+C)\tilde{C}=2(K+C). Finally, as Proposition 4 in [4], we construct the partition 𝒫={Ji}jiq\mathcal{P}=\{J_{i}\}_{j\leq i\leq q} of JJ where jj is the integer such that jhH<(j+1)hjh\leq H<(j+1)h, Jq={x|f(x)q}J_{q}=\{x|f(x)\geq q\}, Jj={x|f(x)<(j+1)h}J_{j}=\{x|f(x)<(j+1)h\}, and Ji={x|ihf(x)<(i+1)h}J_{i}=\{x|ih\leq f(x)<(i+1)h\} for q>i>jq>i>j. Since fmax<f_{\max}<\infty, there is only finite number of intervals. It can be shown that when qq is the smallest integer that is larger than fmaxf_{\max}, this partition can give the inequality (H.1).  

Appendix I Proof of local asymptotics

I.1 Local Inference of Fn,mF_{n,m} when FF is absolutely continuous

Theorem 5 shows that the empirical CDF Fn,mF_{n,m} is a consistent estimator of the population CDF. We also want to understand the uncertainty of the empirical CDF. The Komlós-Major-Tusnády (KMT) approximation shows that n(Fn(x)F(x))\sqrt{n}(F_{n}(x)-F(x)) can be approximated by a sequence of Brownian bridges {Bn(x),0x1}\{B_{n}(x),0\leq x\leq 1\} [17]. This result can be extended to the empirical CDF based on s^i\hat{s}_{i}’s; see Theorem 22. The proof is similar to Theorem 5 by splitting Fn,m(x)F(x)F_{n,m}(x)-F(x) into Fn,m(x)F(m)(x)F_{n,m}(x)-F^{(m)}(x) and F(m)(x)F(x)F^{(m)}(x)-F(x). The former can be bounded by the original KMT approximation and the latter one can be bounded using Proposition 3. Another thing needs taking care of is to bound Bn(F(x))Bn(F(m)(x))B_{n}(F(x))-B_{n}(F^{(m)}(x)), which is in fact the sum of two independent Gaussian random variables with zero mean and variance upper bounded by |F(m)(x)F(x)||F^{(m)}(x)-F(x)|.

Theorem 22 (Local inference of Fn,mF_{n,m})

Suppose FF corresponds to a density ff on [0,1][0,1] with fmax<f_{\max}<\infty. There exists a sequence of Brownian bridges {Bn(x),0x1}\{B_{n}(x),0\leq x\leq 1\} such that

{sup0x1|n(Fn,m(x)F(x))Bn(F(x))|>2πfmaxnm+alognn+t}b(ecnt+edmt2),\mathbb{P}\left\{\sup_{0\leq x\leq 1}|\sqrt{n}(F_{n,m}(x)-F(x))-B_{n}(F(x))|>\frac{\sqrt{2\pi}f_{\max}\cdot\sqrt{n}}{\sqrt{m}}+\frac{a\log n}{\sqrt{n}}+t\right\}\leq b(e^{-c\sqrt{n}t}+e^{-dmt^{2}}),

for all positive integers nn and all t>0t>0, where aa, bb, cc and dd are positive constants.

I.2 Proof of Theorem 12

This proof is adapted from [40]. Define

Un,m(a)=sup{x:Fn,m(x)ax is maximal }.U_{n,m}(a)=\sup\{x:F_{n,m}(x)-ax\text{ is maximal }\}.

Then with probability one, we have the switching relation

f~n,m(t)aUn,m(a)t.\tilde{f}_{n,m}(t)\leq a\Leftrightarrow U_{n,m}(a)\leq t. (I.1)

By the relation (I.1), we have

(n(f~n,m(t0)f(t0))x)=(Un,m(f(t0)+n12x)t0).\mathbb{P}(\sqrt{n}(\tilde{f}_{n,m}(t_{0})-f(t_{0}))\leq x)=\mathbb{P}(U_{n,m}(f(t_{0})+n^{-\frac{1}{2}}x)\leq t_{0}).

From the definition of Un,mU_{n,m}, it follows that

Un,m(f(t0)+n12x)\displaystyle U_{n,m}(f(t_{0})+n^{-\frac{1}{2}}x) =\displaystyle= sup{s:Fn,m(s)(f(t0)+n12x)s is maximal }\displaystyle\sup\{s:F_{n,m}(s)-(f(t_{0})+n^{-\frac{1}{2}}x)s\text{ is maximal }\}
=\displaystyle= sup{s:n(Fn,m(s)F(s))+n(F(s)f(t0))xs is maximal }\displaystyle\sup\{s:\sqrt{n}(F_{n,m}(s)-F(s))+\sqrt{n}(F(s)-f(t_{0}))-xs\text{ is maximal }\}

By Theorem 22,

n(Fn,mF(s))=Bn(F(s))+𝒪(nm)+𝒪p(lognn),\sqrt{n}(F_{n,m}-F(s))=B_{n}(F(s))+\mathcal{O}(\frac{\sqrt{n}}{\sqrt{m}})+\mathcal{O}_{p}(\frac{\log n}{\sqrt{n}}),

where {Bn,nN}\{B_{n},n\in N\} is a sequence of Brownian Bridges, constructed on the same space as the FnF_{n}. So the limit distribution of Un,m(f(t0)+n12x)U_{n,m}(f(t_{0})+n^{-\frac{1}{2}}x) is the same as that of the location of the maximum of the process {Bn(F(s))+n(F(s)f(t0)(s)xs,s0}\{B_{n}(F(s))+\sqrt{n}(F(s)-f(t_{0})(s)-xs,s\geq 0\}. Note that F(s)F(s) is concave and linear in [a,b][a,b], then

F(s)=F(a)+f(t0)(sa) for s[a,b],F(s)=F(a)+f(t_{0})(s-a)\text{ for }s\in[a,b],

and

F(s)f(t0)(sa)<F(a) for s[a,b].F(s)-f(t_{0})(s-a)<F(a)\text{ for }s\not\in[a,b].

Hence the location of the maximum of {(Bn(F(s))+n(F(s)f(t0)s)xs,s0}\{(B_{n}(F(s))+\sqrt{n}(F(s)-f(t_{0})s)-xs,s\geq 0\} behaves asymptotically as that of

{B(F(s))xs,asb}={B(F(a)+f(t0)(sa))xs,asb},\{B(F(s))-xs,a\leq s\leq b\}=\{B(F(a)+f(t_{0})(s-a))-xs,a\leq s\leq b\},

where BB is a standard Brownian bridge in [0,1][0,1]. Thus,

(n(f~n,m(t0)f(t0))x)\displaystyle\mathbb{P}(\sqrt{n}(\tilde{f}_{n,m}(t_{0})-f(t_{0}))\leq x)\rightarrow
( the location of the maximum of {B(F(s))xs,asbt0})\displaystyle\mathbb{P}(\text{ the location of the maximum of }\{B(F(s))-xs,a\leq s\leq b\leq t_{0}\})
=\displaystyle= (S^a,b(t0)x),\displaystyle\mathbb{P}(\hat{S}_{a,b}(t_{0})\leq x),

by the definition of S^\hat{S}. That completes the proof of Part (A). The proof of Part (B) follows in a similar manner.

Appendix J Proof of Theorem 13

Proof Let α^l(cl)\hat{\alpha}_{l}(c_{l}), α^r(cr)\hat{\alpha}_{r}(c_{r}), α^mid(cl,cr)\hat{\alpha}_{mid}(c_{l},c_{r}), g~l(cl)\tilde{g}_{l}(c_{l}), g~r(cr)\tilde{g}_{r}(c_{r}) be the output of Algorithm 1 with the input clc_{l}, crc_{r} and dld_{l}, drd_{r}. The corresponding estimator of ff is termed as f~n,m\tilde{f}_{n,m}. For simplicity, we consider cr(0)=1c_{r}^{(0)}=1, i.e., the case where there is only the decreasing part and the flat part. For a general case where cr(0)<1c_{r}^{(0)}<1, we just need to focus on [0,μ][0,\mu].

By Theorem 10, we know that 𝔼f0cl(0)|f~n,m(x)f(x)|𝑑xK1Nl(cl(0))13{\mathbb{E}}_{f}\int_{0}^{c_{l}^{(0)}}|\tilde{f}_{n,m}(x)-f(x)|dx\leq K_{1}\cdot N_{l}(c_{l}^{(0)})^{-\frac{1}{3}} when mC1[Nl(cl(0))]23m\geq C_{1}\cdot[N_{l}(c_{l}^{(0)})]^{\frac{2}{3}} for some positive constants K1K_{1} and C1C_{1} that only depend on ff. For cl>cl(0)c_{l}>c_{l}^{(0)}, it is easy to see that limx(cl)f~m,n(x)limx(cl)+f~m,n(x)=εNl(cl(0))1/2\lim_{x\rightarrow(c_{l})_{-}}\tilde{f}_{m,n}(x)-\lim_{x\rightarrow(c_{l})_{+}}\tilde{f}_{m,n}(x)=\varepsilon\cdot N_{l}(c_{l}^{(0)})^{-1/2}, where ε\varepsilon is a residual term with |ε|K2|\varepsilon|\leq K_{2} for some positive constant K2K_{2}, because the estimator of the flat region converges at a square-root rate [40]. In other words, it is unlikely to find a desired gap beyond cl(0)c_{l}^{(0)}.

If limx(cl(0))f~n,m(x)α^mid(cl(0),1)1cl(0)+dl\lim_{x\rightarrow(c_{l}^{(0)})_{-}}\tilde{f}_{n,m}(x)\geq\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}+d_{l}, we select the desired cl=cl(0)c_{l}=c_{l}^{(0)}. Otherwise, define

t be the maximal cl such that {cl<cl(0)f~n,m(cl)α^mid(cl(0),1)1cl(0)+dl.t\text{ be the maximal }c_{l}\text{ such that }\left\{\begin{array}[]{l}c_{l}<c_{l}^{(0)}\\ \tilde{f}_{n,m}(c_{l})\geq\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}+d_{l}.\\ \end{array}\right.

Then t<clcl(0)\forall t<c_{l}\leq c_{l}^{(0)}, it follows that

f(cl)f~n,m(cl)\displaystyle f(c_{l})-\tilde{f}_{n,m}(c_{l}) =\displaystyle= f(cl)limx(cl(0))+f(x)(f~n,m(cl)α^mid(cl(0),1)1cl(0))\displaystyle f(c_{l})-\lim_{x\rightarrow(c_{l}^{(0)})_{+}}f(x)-(\tilde{f}_{n,m}(c_{l})-\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}})
+limx(cl(0))+f(x)(α^mid(cl(0),1)1cl(0))\displaystyle+\lim_{x\rightarrow(c_{l}^{(0)})_{+}}f(x)-(\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}})
\displaystyle\geq δldlK2(Nmid(cl(0),1)12),\displaystyle\delta_{l}-d_{l}-K_{2}\cdot(N_{mid}(c_{l}^{(0)},1)^{-\frac{1}{2}}),

When dl<δld_{l}<\delta_{l}, it implies that

𝔼ftcl(0)|f(x)f~n,m(x)|𝑑x(δldlK2Nmid(cl(0),1)12)(cl(0)t).{\mathbb{E}}_{f}\int_{t}^{c_{l}^{(0)}}|f(x)-\tilde{f}_{n,m}(x)|dx\geq(\delta_{l}-d_{l}-K_{2}\cdot N_{mid}(c_{l}^{(0)},1)^{-\frac{1}{2}})\cdot(c_{l}^{(0)}-t).

Since the L1L_{1} distance between ff and f~n,m\tilde{f}_{n,m} reduces at a cubic-root rate, it follows that cl(0)tC2Nl(cl(0))1/3c_{l}^{(0)}-t\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3} for some positive constant C2C_{2}. So cl=tc_{l}=t is the desired cutoff. Finally, we have that

01|f~n,m(x)f(x)|𝑑x\displaystyle\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx =\displaystyle= 0t|f~n,m(x)f(x)|𝑑x+tcl(0)|α^mid(t,1)1tf(x)|𝑑x\displaystyle\int_{0}^{t}|\tilde{f}_{n,m}(x)-f(x)|dx+\int_{t}^{c_{l}^{(0)}}|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-f(x)|dx
+cl(0)1|α^mid(t,1)1tf(x)|𝑑x\displaystyle+\int_{c_{l}^{(0)}}^{1}|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-f(x)|dx
\displaystyle\leq 0t|f~n,m(x)f(x)|𝑑x+(11t+fmax)|cl(0)t|\displaystyle\int_{0}^{t}|\tilde{f}_{n,m}(x)-f(x)|dx+(\frac{1}{1-t}+f_{\max})\cdot|c_{l}^{(0)}-t|
+|α^mid(t,1)1tαmid(cl(0),1)1cl(0)||1cl(0)|\displaystyle+|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-\frac{\alpha_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}|\cdot|1-c_{l}^{(0)}|
\displaystyle\leq C4Nl(cl(0))1/3,\displaystyle C_{4}\cdot N_{l}(c_{l}^{(0)})^{-1/3},

where the last inequality holds because |cl(0)t|C2Nl(cl(0))1/3|c_{l}^{(0)}-t|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3} and ff is bounded can imply |α^mid(t,1)1tαmid(cl(0),1)1cl(0)|K3Nl(cl(0))1/3|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-\frac{\alpha_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}|\leq K_{3}\cdot N_{l}(c_{l}^{(0)})^{-1/3}. Here C4C_{4}, K3K_{3} are two positive constants that only depend on ff.  

Appendix K Proof of Theorem 14

Proof Given the interior point μ\mu in the flat region, the left decreasing part and the right increasing part are disentangled. Therefore, we only need to consider the left side, and the right side can be proven in the same way. A necessary condition for clc_{l} being identified as feasible for the change-point-gap constraint in Algorithm 1 is that

g~l(cl)α^mid(cl,μ)α^l(μ)(μcl)+dlα^l(μ),\tilde{g}_{l}(c_{l})\geq\frac{\hat{\alpha}_{mid}(c_{l},\mu)}{\hat{\alpha}_{l}(\mu)\cdot(\mu-c_{l})}+\frac{d_{l}}{\hat{\alpha}_{l}(\mu)},

where α^mid(cl,μ)=Nmid(cl,μ)/n\hat{\alpha}_{mid}(c_{l},\mu)=N_{mid}(c_{l},\mu)/n and α^l(μ)=Nl(μ)/n\hat{\alpha}_{l}(\mu)=N_{l}(\mu)/n. It is easy to see that

α^l(μ)=αl(μ)+ε1n12,\hat{\alpha}_{l}(\mu)=\alpha_{l}(\mu)+\varepsilon_{1}\cdot n^{-\frac{1}{2}},

and

α^mid(cl,μ)=αmid(cl,μ)+ε2n12,\hat{\alpha}_{mid}(c_{l},\mu)=\alpha_{mid}(c_{l},\mu)+\varepsilon_{2}\cdot n^{-\frac{1}{2}},

where αl(μ)=𝔼α^l(μ)\alpha_{l}(\mu)={\mathbb{E}}\hat{\alpha}_{l}(\mu) and αmid(cl,μ)=𝔼α^mid(cl,μ)\alpha_{mid}(c_{l},\mu)={\mathbb{E}}\hat{\alpha}_{mid}(c_{l},\mu); ε1\varepsilon_{1} and ε2\varepsilon_{2} are residual terms with max(|ε1|,|ε2|)K\max(|\varepsilon_{1}|,|\varepsilon_{2}|)\leq K for some universal positive constant KK by McDiarmid inequality. So the necessary condition is

g~l(cl)αmid(cl,μ)αl(μ)(μcl)+dlαl(μ)+ε3n12,\tilde{g}_{l}(c_{l})\geq\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})}+\frac{d_{l}}{\alpha_{l}(\mu)}+\varepsilon_{3}\cdot n^{-\frac{1}{2}}, (K.1)

with |ε3|C|\varepsilon_{3}|\leq C for some constant CC that only depends on αl(μ)\alpha_{l}(\mu), μcl\mu-c_{l} and dld_{l}. If cl>cl(0)c_{l}>c_{l}^{(0)} and the constraint is violated at clc_{l}, then for any cl>clc_{l}^{\prime}>c_{l} the constraint is violated at clc_{l}^{\prime} with high probability since g~l(cl)g~l(cl)\tilde{g}_{l}(c_{l})\geq\tilde{g}_{l}(c_{l}^{\prime}) and αmid(cl,μ)αl(μ)(μcl)=αmid(cl,μ)αl(μ)(μcl)\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})}=\frac{\alpha_{mid}(c_{l}^{\prime},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l}^{\prime})} (clc_{l} and clc_{l}^{\prime} are both in the flat region). Therefore, to see if c^l>cl(0)\hat{c}_{l}>c_{l}^{(0)}, we only need to investigate the smallest clc_{l} with cl>cl(0)c_{l}>c_{l}^{(0)} that is in the searching space of Algorithm 1. By Theorem 12, when m/Nl(cl(0))m/\cdot N_{l}(c_{l}^{(0)})\rightarrow\infty, it follows that

Nl(μ)(g~l(cl)αmid(cl,μ)αl(μ)(μcl))𝑑S^[cl(0),μ](cl).\sqrt{N_{l}(\mu)}(\tilde{g}_{l}(c_{l})-\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})})\overset{d}{\rightarrow}\hat{S}_{[c_{l}^{(0)},\mu]}(c_{l}).

By the necessary condition (K.1), then asymptotically we have

[cl>cl(0)]\displaystyle\mathbb{P}[c_{l}>c_{l}^{(0)}] \displaystyle\leq (Nl(μ)(g~l(cl)αmid(cl,μ)αl(μ)(μcl))Nl(μ)dlαl(μ)+ε3Nl(μ)n)\displaystyle\mathbb{P}\left(\sqrt{N_{l}(\mu)}(\tilde{g}_{l}(c_{l})-\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})})\geq\sqrt{N_{l}(\mu)}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}+\varepsilon_{3}\cdot\sqrt{\frac{N_{l}(\mu)}{n}}\right)
\displaystyle\approx (S^cl(0),μ(cl)Nl(μ)dlαl(μ)+ε3Nl(μ)n)\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l})\geq\sqrt{N_{l}(\mu)}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}+\varepsilon_{3}\cdot\sqrt{\frac{N_{l}(\mu)}{n}}\right)
\displaystyle\leq (S^cl(0),μ(cl)Nl(cl(0))dlαl(μ)CNl(μ)n).\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C\cdot\sqrt{\frac{N_{l}(\mu)}{n}}\right).
\displaystyle\leq (S^cl(0),μ(cl)Nl(cl(0))dlαl(μ)C).\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C\right).
\displaystyle\approx (S^cl(0),μ(cl(0))Nl(cl(0))dlαl(μ)C).\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l}^{(0)})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C\right).

where the last approximation holds because clc_{l} is the smallest one searching candidate with cl>cl(0)c_{l}>c_{l}^{(0)}, so it is very close to cl(0)c_{l}^{(0)}.  

Appendix L Proof of Theorem 15

Proof Let α^l(cl)\hat{\alpha}_{l}(c_{l}), α^r(cr)\hat{\alpha}_{r}(c_{r}), α^mid(cl,cr)\hat{\alpha}_{mid}(c_{l},c_{r}), g~l(cl)\tilde{g}_{l}(c_{l}), g~r(cr)\tilde{g}_{r}(c_{r}) be the output of Algorithm 1 with the input clc_{l}, crc_{r} and dld_{l}, drd_{r}. The corresponding estimator of ff is termed as f~n,m\tilde{f}_{n,m}. Let H(cl,cr,f~n,m)H(c_{l},c_{r},\tilde{f}_{n,m}) be the log likelihood function associated with the optimization problem (4.2.2), and N(x)=#{s^i:s^i=x}N(x)=\#\{\hat{s}_{i}:\hat{s}_{i}=x\}. It follows that

1n𝔼fH(cl,cr,f~n,m)\displaystyle\frac{1}{n}{\mathbb{E}}_{f}H(c_{l},c_{r},\tilde{f}_{n,m}) =\displaystyle= 1n𝔼fs^ilogf~n,m(s^i)\displaystyle\frac{1}{n}{\mathbb{E}}_{f}\sum_{\hat{s}_{i}}\log\tilde{f}_{n,m}(\hat{s}_{i})
=\displaystyle= 𝔼flogf~n,m(s^1)\displaystyle{\mathbb{E}}_{f}\log\tilde{f}_{n,m}(\hat{s}_{1})
=\displaystyle= KL(f||f~n,m)+C,\displaystyle-KL(f||\tilde{f}_{n,m})+C,

where C=𝔼flogf(s^1)C={\mathbb{E}}_{f}\log f(\hat{s}_{1}). From the relations between total variation, Kullback-Leibler divergence and the χ2\chi^{2} distance,

TV(P,Q)KL(P||Q)χ2(P||Q),TV(P,Q)\leq\sqrt{KL(P||Q)}\leq\sqrt{\chi^{2}(P||Q)},

we have

1n𝔼fH(cl,cr,f~n,m)=𝔼fKL(f||f~n,m)+C\displaystyle\frac{1}{n}{\mathbb{E}}_{f}H(c_{l},c_{r},\tilde{f}_{n,m})=-{\mathbb{E}}_{f}KL(f||\tilde{f}_{n,m})+C \displaystyle\leq 𝔼f(01|f(x)f~n,m(x)|𝑑x)2+C\displaystyle-{\mathbb{E}}_{f}(\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx)^{2}+C
\displaystyle\leq (𝔼f01|f(x)f~n,m(x)|𝑑x)2+C,\displaystyle-({\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx)^{2}+C,

where the last inequality uses the Jensen’s inequality. On the other hand, it follows that

1n𝔼fH(cl,cr,f~n,m)\displaystyle\frac{1}{n}{\mathbb{E}}_{f}H(c_{l},c_{r},\tilde{f}_{n,m}) =\displaystyle= 𝔼fKL(f||f~n,m)+C\displaystyle-{\mathbb{E}}_{f}KL(f||\tilde{f}_{n,m})+C
\displaystyle\geq 𝔼f01(f(x)f~n,m(x))2/f(x)𝑑x+C\displaystyle-{\mathbb{E}}_{f}\int_{0}^{1}(f(x)-\tilde{f}_{n,m}(x))^{2}/f(x)dx+C
\displaystyle\geq 𝔼f01(f(x)f~n,m(x))2/fmin𝑑x+C\displaystyle-{\mathbb{E}}_{f}\int_{0}^{1}(f(x)-\tilde{f}_{n,m}(x))^{2}/f_{min}dx+C
\displaystyle\geq 1fmin(𝔼f01|f(x)f~n,m(x)|𝑑x)2+C,\displaystyle-\frac{1}{f_{min}}\cdot({\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx)^{2}+C,

Then the problem is reduced to bound 𝔼f01|f(x)f~n,m(x)|𝑑x{\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx. From Theorem 13, we know that when mC1(max(Nl(cl(0)),Nr(cr(0))))23m\geq C_{1}\cdot\left(\max(N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)}))\right)^{\frac{2}{3}}, there exist clc_{l} and crc_{r} in the neighborhoods of cl(0)c_{l}^{(0)} and cr(0)c_{r}^{(0)} respectively, such that the resulting estimator f~n,m\tilde{f}_{n,m} satisfies 𝔼f01|f(x)f~n,m(x)|𝑑xK1(Nl(cl(0))1/3+Nr(cr(0))1/3){\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx\leq K_{1}\cdot(N_{l}(c_{l}^{(0)})^{-1/3}+N_{r}(c_{r}^{(0)})^{-1/3}) for some positive constants C1C_{1} and K1K_{1}. Along with Lemma 23, we conclude the desired result.  
 

Lemma 23

Let f~n,m\tilde{f}_{n,m} be the solution by Algorithm 1 with input clc_{l} and crc_{r} and the corresponding flag=Trueflag=\textit{True}. Assume fmax<f_{\max}<\infty and fmin>0f_{\min}>0. If mC1(max(Nl(cl(0)),Nr(cr(0))))23m\geq C_{1}\cdot\left(\max(N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)}))\right)^{\frac{2}{3}}, then |Δl|C2Nl(cl(0))1/3,|Δr|C3Nr(cr(0))1/3|\Delta_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3},|\Delta_{r}|\leq C_{3}\cdot N_{r}(c_{r}^{(0)})^{-1/3} is a necessary condition for

𝔼f01|f(x)f~n,m(x)|𝑑xC4(Nl(cl(0))1/3+Nr(cr(0))1/3),{\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx\leq C_{4}\cdot(N_{l}(c_{l}^{(0)})^{-1/3}+N_{r}(c_{r}^{(0)})^{-1/3}),

where C1,C2,C3,C4C_{1},C_{2},C_{3},C_{4} are four constants depending on dld_{l}, drd_{r}, fmaxf_{\max} and fminf_{\min}; Δl=clcl(0)\Delta_{l}=c_{l}-c_{l}^{(0)}, Δr=crcr(0)\Delta_{r}=c_{r}-c_{r}^{(0)}.

Proof For simplicity, we consider cr(0)=1c_{r}^{(0)}=1, i.e., the case where there is only the decreasing part and the flat part. For a general case where cr(0)<1c_{r}^{(0)}<1, we just need to focus on [0,μ][0,\mu]. If cl<cl(0)c_{l}<c_{l}^{(0)}, the L1L_{1} distance between f~n,m\tilde{f}_{n,m} and ff is

𝔼f01|f~n,m(x)f(x)|𝑑x\displaystyle{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx
=\displaystyle= 𝔼f0cl(0)|f~n,m(x)f(x)|𝑑x+𝔼fcl(0)1|f~n,m(x)f(x)|𝑑x\displaystyle{\mathbb{E}}_{f}\int_{0}^{c_{l}^{(0)}}|\tilde{f}_{n,m}(x)-f(x)|dx+{\mathbb{E}}_{f}\int_{c_{l}^{(0)}}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx
(a)\displaystyle\overset{(a)}{\geq} K1Nl(cl)1/3+𝔼fcl(0)1|f~n,m(x)f(x)|𝑑x\displaystyle-K_{1}\cdot N_{l}(c_{l})^{-1/3}+{\mathbb{E}}_{f}\int_{c_{l}^{(0)}}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx
=(b)\displaystyle\overset{(b)}{=} K1Nl(cl)1/3+𝔼f|10clf~n,m(x)𝑑x1cl10cl(0)f(x)𝑑x1cl(0)|(1cl(0))\displaystyle-K_{1}\cdot N_{l}(c_{l})^{-1/3}+{\mathbb{E}}_{f}|\frac{1-\int_{0}^{c_{l}}\tilde{f}_{n,m}(x)dx}{1-c_{l}}-\frac{1-\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}^{(0)}}|(1-c_{l}^{(0)})
(c)\displaystyle\overset{(c)}{\geq} K2Nl(cl(0))1/3+|10clf(x)𝑑x1cl10cl(0)f(x)𝑑x1cl(0)|(1cl(0))\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+|\frac{1-\int_{0}^{c_{l}}f(x)dx}{1-c_{l}}-\frac{1-\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}^{(0)}}|(1-c_{l}^{(0)})
=\displaystyle= K2Nl(cl(0))1/3+|Δl+(1cl)clcl(0)f(x)𝑑x+Δl0cl(0)f(x)𝑑x1cl|\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+|\frac{-\Delta_{l}+(1-c_{l})\int_{c_{l}}^{c_{l}^{(0)}}f(x)dx+\Delta_{l}\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}}|
=\displaystyle= K2Nl(cl(0))1/3+|Δl(1cl)Δlγ+Δl0cl(0)f(x)𝑑x1cl|\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+|\frac{-\Delta_{l}-(1-c_{l})\Delta_{l}\gamma+\Delta_{l}\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}}|
=\displaystyle= K2Nl(cl(0))1/3+κ|Δl|,\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+\kappa|\Delta_{l}|,

where K1K_{1} and K2K_{2} are two positive constants that only depend on ff, minx[cl,cl(0)]f(x)γmaxx[cl,cl(0)]f(x)\min_{x\in[c_{l},c_{l}^{(0)}]}f(x)\leq\gamma\leq\max_{x\in[c_{l},c_{l}^{(0)}]}f(x), κ=|1+(1cl)γ0cl(0)f(x)𝑑x1cl|<\kappa=|\frac{1+(1-c_{l})\gamma-\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}}|<\infty since fmax<f_{\max}<\infty. The inequality (a) and the equation (c) are obtained using Theorem 10. The equation (b) makes use of assumption that the right hand side is a flat region. Then, if there exists C4>0C_{4}>0 such that 𝔼f01|f~n,m(x)f(x)|𝑑xC4Nl(cl(0))1/3{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot N_{l}(c_{l}^{(0)})^{-1/3}, then |Δl|C2Nl(cl(0))1/3|\Delta_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}, for some positive constant C2C_{2}.

Next, we investigate the case when cl>cl(0)c_{l}>c_{l}^{(0)}. Denote a:=limx(cl)f~n,m(x)a:=\lim_{x\rightarrow(c_{l})_{-}}\tilde{f}_{n,m}(x), b:=limx(cl)+f~n,m(x)b:=\lim_{x\rightarrow(c_{l})_{+}}\tilde{f}_{n,m}(x), c:=cl(0)clf~n,m(x)𝑑xc:=\int_{c_{l}^{(0)}}^{c_{l}}\tilde{f}_{n,m}(x)dx, and αf(x)(whenx>cl(0))\alpha\equiv f(x)(\text{when}~{}x>c_{l}^{(0)}). It is easy to check that

|cα||Δl|cl(0)cl|f~n,m(x)α|𝑑x.|c-\alpha|\cdot|\Delta_{l}|\leq\int_{c_{l}^{(0)}}^{c_{l}}|\tilde{f}_{n,m}(x)-\alpha|dx.

Using the L1L_{1} convergence of the Grenander estimator, it follows that

|Δl||cα|+(1cl)|bα|cl(0)cl|f~n,m(x)α|𝑑x+(1cl)|bα|01|f~n,m(x)f(x)|𝑑x.|\Delta_{l}||c-\alpha|+(1-c_{l})|b-\alpha|\leq\int_{c_{l}^{(0)}}^{c_{l}}|\tilde{f}_{n,m}(x)-\alpha|dx+(1-c_{l})|b-\alpha|\leq\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx.

If there exists C4>0C_{4}>0 such that 01|f~n,m(x)f(x)|𝑑xC4Nl(cl(0))1/3\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot N_{l}(c_{l}^{(0)})^{-1/3}, we have

|Δl|1cl(0)|ab|\displaystyle\frac{|\Delta_{l}|}{1-c_{l}^{(0)}}|a-b| \displaystyle\leq |Δl|1cl(0)|cb|\displaystyle\frac{|\Delta_{l}|}{1-c_{l}^{(0)}}|c-b|
\displaystyle\leq |Δl|1cl(0)|cα|+|Δl|1cl(0)|αb|\displaystyle\frac{|\Delta_{l}|}{1-c_{l}^{(0)}}|c-\alpha|+\frac{|\Delta_{l}|}{1-c_{l}^{(0)}}|\alpha-b|
\displaystyle\leq |Δl|1cl|cα|+|αb|C41clNl(cl(0))1/3.\displaystyle\frac{|\Delta_{l}|}{1-c_{l}}|c-\alpha|+|\alpha-b|\leq\frac{C_{4}}{1-c_{l}}\cdot N_{l}(c_{l}^{(0)})^{-1/3}.

If the output flagflag of Algorithm 1 is True, abαmid(cl,μ)αl(μ)(1cl)+dlαl(μ)+K3n1/2a-b\geq\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)(1-c_{l})}+\frac{d_{l}}{\alpha_{l}(\mu)}+K_{3}\cdot n^{-1/2} for some positive constant K3K_{3}. Then it must follow that |Δl|C2Nl(cl(0))1/3|\Delta_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}, where C2>0C_{2}>0. So far, we have proven the lemma for the left hand side. For the right hand side, it can be proven similarly.  

Appendix M More simulation studies

Besides the linear valley model specified in Section 5.1.1, we also consider a non-linear model and a misspecified model.

For the non-linear model, we replace the left linear part in the linear valley model with an unnormalized decreasing function fl=Beta(x/cl;0.5,1.5)/cl3/20f_{l}=Beta(x/c_{l};0.5,1.5)/c_{l}\cdot 3/20, x[0,cl]x\in[0,c_{l}]. We replace the right linear part with an unnormalized increasing function fr=Beta(x/(1cr);2,0.8)/(1cr)1/20f_{r}=Beta(x/(1-c_{r});2,0.8)/(1-c_{r})\cdot 1/20, x(cr,1]x\in(c_{r},1]. Here Beta(x;α,β)Beta(x;\alpha,\beta) is a density of Beta distribution with parameters α\alpha and β\beta. In this case, we use m=104m=10^{4}. Applying Ucut to the synthetic data generated by this model, we observe similar results as those from the linear valley model; see Figure 8910. It indicates that Ucut can detect the cutoff in a satisfying range as long as the underlying model satisfies Assumption 1 and Assumption 2. Ucut is particularly useful when the middle part is “uniform” and not easy to tell apart the samples of the alternative distribution from those of the null distribution.

For the misspecified model, we replace the left linear part with an unnormalized unimodal function fl=Beta(x/cl;1.5,5)/cl3f_{l}=Beta(x/c_{l};1.5,5)/c_{l}\cdot 3, x[0,cl]x\in[0,c_{l}]. We replace the right linear part with an unnormalized unimodal function fr=Beta(x/(1cr);2.5,1.5)/(1cr)f_{r}=Beta(x/(1-c_{r});2.5,1.5)/(1-c_{r}), x(cr,1]x\in(c_{r},1]. The results can be found in Figure 11,12,13. We observe that the estimated crc_{r} is not satisfying until mm attains 10410^{4}. So in the other experiments, we use m=104m=10^{4}. We find that although the variance of c^r\hat{c}_{r} becomes larger than the estimate for the correctly specified model, the detected cutoff tends to be larger than the truth. It implies that if the model does not align with Assumption 1 and Assumption 2, Ucut is still useful because it is conservative.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 8: The estimation of c^r\hat{c}_{r} under the non-linear decreasing-uniform-increasing model. (a) with respect to mm; (b) with respect to the width of the middle flat region; (c) with respect to the gap sizes.
Refer to caption
(a)
Refer to caption
(b)
Figure 9: The estimation of c^r\hat{c}_{r} under the non-linear decreasing-uniform-increasing model. (a) with respect to the choice of the middle point μ\mu; (b) with respect to the choice of the input dld_{l} and drd_{r}. Here dl=κ×δ~ld_{l}=\kappa\times\tilde{\delta}_{l} and dr=κ×δ~rd_{r}=\kappa\times\tilde{\delta}_{r}, where κ\kappa is a ratio of the normalized δ\delta’s.
Refer to caption
Figure 10: FDR and power of Algorithm 1 and other competing methods on the non-linear decreasing-uniform-increasing model.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 11: The estimation of c^r\hat{c}_{r} under the unimodal model. (a) with respect to mm; (b) with respect to the width of the middle flat region; (c) with respect to the gap sizes.
Refer to caption
(a)
Refer to caption
(b)
Figure 12: The estimation of c^r\hat{c}_{r} under the unimodal model. (a) with respect to the choice of the middle point μ\mu; (b) with respect to the choice of the input dld_{l} and drd_{r}. Here dl=κ×δ~ld_{l}=\kappa\times\tilde{\delta}_{l} and dr=κ×δ~rd_{r}=\kappa\times\tilde{\delta}_{r}, where κ\kappa is a ratio of the normalized δ\delta’s.
Refer to caption
Figure 13: FDR and power of Algorithm 1 and other competing methods on the misspecified model.