Binomial Mixture Model With U-shape Constraint

Yuting Ye yeyt@berkeley.edu Biostatistics Division, UC Berkeley Peter J. Bickel bickelp@berkeley.edu Department of Statistics, UC Berkeley

Abstract

In this article, we study the binomial mixture model under the regime that the binomial size $m$ can be relatively large compared to the sample size $n$ . This project is motivated by the GeneFishing method [18], whose output is a combination of the parameter of interest and the subsampling noise. To tackle the noise in the output, we utilize the observation that the density of the output has a U shape and model the output with the binomial mixture model under a U shape constraint. We first analyze the estimation of the underlying distribution $F$ in the binomial mixture model under various conditions for $F$ . Equipped with these theoretical understandings, we propose a simple method Ucut to identify the cutoffs of the U shape and recover the underlying distribution based on the Grenander estimator [8]. It has been shown that when $m=\Omega(n^{\frac{2}{3}})$ , the identified cutoffs converge at the rate $\operatorname*{\mathcal{O}}(n^{-\frac{1}{3}})$ . The $L_{1}$ distance between the recovered distribution and the true one decreases at the same rate. To demonstrate the performance, we apply our method to varieties of simulation studies, a GTEX dataset used in [18] and a single cell dataset from Tabula Muris.

1 Introduction

The binomial mixture model has received much attention since the late 1960s. In the field of performance evaluation, [19, 20, 32] utilized this model to address the problem of psychological testing. [38] used a two-component binomial mixture distribution to model the individual differences in children’s performance on a classification task. [9] employed a binomial finite mixture to model the number of credits gained by freshmen during the first year at the School of Economics of the University of Florence. In addition, the binomial mixture model is commonly applied to analyze population survey data in ecology. [30, 16, 27, 44] estimate absolute abundance while accounting for imperfect detection using binomial detection models. The binomial mixture model was also used to estimate bird and bat fatalities at wind power facilities [25].

Formally, we say $X$ is a random variable which has a binomial mixture model if

X\sim\int\text{Binomial}(m,s)dQ(m,s)

(1.1)

where $Q(\cdot,\cdot)$ is a bivariate measure of the binomial size $m$ and the success probability $s$ on $\mathbb{N}\times[0,1]$ . In the field of population survey in ecology, $m$ can be modeled as Poisson or negative binomial random variable while $s$ can be modeled as a beta random variable, linked to a linear combination of additional covariates by a logistic function [30, 16, 27, 44, 25]. Such models are always identifiable thanks to the parametric assumptions. In the field of performance evaluation, $m$ is always known and usually small (restricted by the intrinsic nature of the problem), thus (1.1) is reduced to

\left\{\begin{array}[]{l}s\sim F,\\ X|s\sim\text{Binomial}(m,s),\\ \end{array}\right.

(1.2)

where $F$ is a probability distribution on $[0,1]$ . For instance, $m$ refers to the number of questions of a psychological test in [19, 20, 32]. When the univariate probability distribution $F$ corresponds to a finite point mass function (pmf), (1.2) is a binomial finite mixture model as in [38, 9]; when $F$ corresponds to a density (either parametric or nonparametric), (1.2) is a hierarchical binomial mixture model as in [19, 20, 32]. Such models suffer from an unidentifiability issue — only the first $m$ moments of $F$ can be identified even with unlimited sample size [36]; more identifiability details are discussed in Appendix A.

In this study, we are interested in a regime unlike that for performance evaluation where $m$ is known and small or that for population survey in ecology where $m$ is unknown. We study (1.2) with a known $m$ that can be as large as the magnitude of $n$ , which has not been investigated before in the literature. This study is motivated by the GeneFishing method [18], which is proposed to take care of extreme inhomogeneity and low signal-to-noise ratio in high-dimensional data. This method employs subsampling and aggregation techniques to evaluate the functional relevance of a candidate gene to a set of bait genes (the bait genes are pre-selected by experts concerning certain biological processes). Briefly speaking, it clusters a subsample of candidate genes and the bait genes repetitively. One candidate gene is thought of as functioning similarly to the bait genes if the former and the latter are always grouped during the clustering procedure. More details about GeneFishing can be found in Appendix B.

Mathematically, suppose there are $n$ objects (e.g., genes) and $m$ binomial trials (e.g., the number of clustering times which can be manually tuned). For each object, we assume it has a success probability $s_{i}$ (e.g., a rate reflecting how relevant one candidate gene is to the baits genes) that is i.i.d. sampled from an underlying distribution $F$ . Then, an observation $X_{i}$ (e.g., the times one candidate gene and the bait genes are grouped together) or $\hat{s}_{i}:=X_{i}/m$ (denoted as capture frequency ratio in GeneFishing, or CFR) is independently generated from the binomial mixture model (1.2). A discussion on the independence assumption is put in Appendix B.

Our goal is to estimate and infer the underlying distribution $F$ . When $m$ is sufficiently large, the binomial randomness plays little role, and there is no difference between $\hat{s}_{i}$ and $s_{i}$ for estimation or inference purposes. In other words, the permission of a large $m$ alleviates the identifiability issue of the binomial mixture model. Thereby, a question naturally arises as follows.

Q1

For the binomial mixture model (1.2), what is the minimal binomial size $m$ so that any distribution estimators can work as if there is no binomial randomness?

We can reformulate Q1 in a more explicit way:

Q1’
Considering that the binomial mixture model (1.2) is trivially identifiable when $m=\infty$ , is there a minimal $m$ such that we can have an “acceptable” estimator of $F$ under various conditions:
- •
  
  General $F$ without additional conditions.
- •
  
  $F$ with a density.
- •
  
  $F$ with a continuously differentiable density.
- •
  
  $F$ with a monotone density.

If the above question can be answered satisfactorily, we can cast our attention back to the GeneFishing example. In [18], there remains an open question on how large $\hat{s}_{i}$ should be so that the associated gene is labeled as a “discovery”. We have an important observation that can be leveraged towards this end: the histograms of $\hat{s}_{i}$ appear in a U shape; see Figure 1. Hence, the second question we want to answer is

Q2

Suppose the underlying distribution $F$ has a U shape, how to make decisions based on the data generated from the binomial mixture model?

1.1 Main contributions

Our contributions are twofold, which correspond to the two questions raised above. One tackles the identifiability issue for the binomial mixture model when $m$ can be relatively large compared to $n$ . The other one answers the motivating question — how to select the cutoff for the output of GeneFishing.

1.1.1 New results for large $m$ in Binomial mixture model

Based on the results of [36] and [42], the only hope is to use a large $m$ if we want to solve, or at least alleviate, the identifiability issue for arbitrary mixtures of binomial distributions. When $n$ is sufficiently large, we show that regardless of the identifiability of the model (1.2), we can find an estimator of $F$ , according to the conditions on $F$ , such that the estimator locates in a ball of radius $r(m)$ of $F$ in terms of some metrics such as $L_{1}$ distance and Kolmogorov distance, where $r(m)$ is a decreasing function of $m$ . Specifically,

•

[Corollary of Theorem 1] For general $F$ , if the $L_{p}$ distance is used, then $r(m)=\frac{C_{1}}{m^{\frac{1}{2p}}}$ for the empirical cumulative distribution function, where $C_{1}$ is a positive constant that depends on $p$ .
•

[Corollary of Theorem 5] If $F$ has a bounded density and the Kolmogorov distance is used, then $r(m)=\frac{C_{2}}{\sqrt{m}}$ for the empirical cumulative distribution function, where $C_{2}$ only depends on the maximal value of the density of $F$ .
•

[Corollary of Theorem 8] If $F$ has a density whose derivative is absolutely continuous, and the truncated integrated $L_{2}$ distance is used, then $r(m)=\frac{C_{3}}{\sqrt{m}}$ for the histogram estimator, where $C_{3}$ only depends on the density of $F$ .
•

[Corollary of Theorem 10] If $F$ has a bounded monotone density and the $L_{1}$ distance is used, then $r(m)=\frac{C_{4}}{\sqrt{m}}$ for the Grenander estimator, where $C_{4}$ only depends on the density of $F$ .

1.1.2 The cutoff selection for GeneFishing

To model the CFRs generated by GeneFishing, we use the binomial mixture model with the U-shape constraint, under the regime where the binomial size $m$ can be relatively large compared to the sample size $n$ . With the theoretical understandings mentioned in Section 1.1.1, we propose a simple method Ucut to identify the cutoffs of the U shape and recover the underlying distribution based on the Grenander estimator. It has been shown that when $m=\Omega(n^{\frac{2}{3}})$ , the identified cutoffs converge at the rate $\operatorname*{\mathcal{O}}(n^{-\frac{1}{3}})$ . The $L_{1}$ distance between the recovered distribution and the true one decreases at the same rate. We also show that the estimated cutoff is larger than the true cutoff with a high probability. The performance of our method is demonstrated with varieties of simulation studies, a GTEX dataset used in [18] and a single cell dataset from Tabula Muris.

1.2 Outline

The rest of the paper is organized as follows. Section 2 introduces the notation used throughout the paper. To answer Q1 (or more specifically, Q1’), Section 3 analyzes the estimation and inference of the underlying distribution $F$ in the binomial mixture model (1.2), under various conditions imposed on $F$ . Equipped with these analysis tools, we cast our attention back to the GeneFishing method to answer Q2. Section 4 first introduces a model with U-shape constraint, called BMU, to model the output of GeneFishing. The cutoffs of our interest are also introduced in this model. Then in the same section, we propose a non-parametric maximum likelihood estimation (NPMLE) method Ucut based on the Grenander estimator to identify the cutoffs and estimate the underlying U-shape density. We also provide a theoretical analysis of the estimator. Next, we apply Ucut to several synthetic datasets and two real datasets in Section 5. All the detailed proofs of the theorems mentioned in previous sections are in Appendix.

2 Notation

Denote by $F$ a probability distribution. Let $s\sim F$ and $m\cdot\hat{s}\sim Binomial(m,s)$ given $s$ , where $m$ is a positive integer; see Model (1.2) for details. If there is no confusion, we also use $F$ to represent the associated cumulative distribution function (CDF), i.e, $F(x)=\mathbb{P}[s\leq x]$ . By $F^{(m)}$ denote the binomial mixture CDF for $\hat{s}$ , i.e., $F^{(m)}(x)=\mathbb{P}[\hat{s}\leq x]$ .

Suppose there are $n$ samples independently generated from $F$ , i.e., $s_{1},\ldots,s_{n}$ . Correspondingly, we have $m\cdot\hat{s}_{i}|s_{i}\sim Binomial(m,s_{i})$ independently. By $F_{n}$ and $F_{n,m}$ denote the empirical CDF of $s_{i}$ ’s and $\hat{s}_{i}$ ’s, respectively. Specifically,

F_{n}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[s_{i}\leq x]~{}~{}~{}~{}\text{and}~{}~{}~{}~{}F_{n,m}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[\hat{s}_{i}\leq x].

If $F$ has a density, define the Grenander estimator $\tilde{f}_{n}(x)$ ( $\tilde{f}_{n,m}(x)$ ) for $s_{i}$ ’s ( $\hat{s}_{i}$ ’s) as the left derivative of the least concave majorant of $F_{n}$ ( $F_{n,m}$ ) evaluated at $x$ [8].

For a density $f$ , we use $f_{max}$ and $f_{\min}$ to denote its maximal and minimal function values on the domain of $f$ . We use $\mathbb{I}$ to denote the indicator function, and use $x^{+}$ , $x^{-}$ to denote the right and left limit of $x$ respectively.

3 Estimation and Inference of $F$ under Various Conditions

When $m$ is sufficiently large, $\hat{s}$ behaves like $s$ , which implies that we can directly estimate the underlying true $F$ of the binomial mixture model (1.2). A natural question arises whether there exists a minimal binomial size $m$ so that the identifiability issue mentioned in Appendix A is not a concern. We investigate the estimation of general $F$ , $F$ with a density, $F$ whose density has an absolutely continuous derivative, and $F$ with a monotone density. We consider the empirical CDF estimator for the first two conditions, the histogram estimator for the third condition, and the Grenander estimator for the last condition. The investigations into the estimation of $F$ under various conditions provide us with the analysis tools to design and analyze the cutoff method for GeneFishing.

3.1 General $F$

We begin with the estimation of $F$ , based on $\{\hat{s}_{i}\}_{i=1}^{n}$ , without additional conditions except that $F$ is defined on $[0,1]$ . Towards this end, the empirical CDF estimator might be the first method that comes into one’s mind. It is easy to interpret using a diagram, and it exists no matter whether $F$ corresponds to a density, a point mass function, or a mixture of both.

To measure the deviation of the empirical CDF from $F$ , we use the $L_{p}$ distance with $p\geq 1$ . The $L_{p}$ distance between two distributions $F_{1}$ and $F_{2}$ is defined as

\mathcal{L}_{p}(F_{1},F_{2}):=\left(\int_{\mathbb{R}}|F_{1}(x)-F_{2}(x)|^{p}dx\right)^{\frac{1}{p}}.

The $L_{p}$ distance has two special cases that are easily interpretable from a geometric perspective. First, when $p=1$ , it looks at the cumulative differences between the two CDFs.The $L_{1}$ distance is known to be equivalent to the 1-Wasserstein ( $\mathcal{W}_{1}$ ) distance on $\mathbb{R}$ , which is also known as the Kantorovich-Monge-Rubinstein metric. Second, when $p=\infty$ , it corresponds to the Kolmogorov-Smirnov (K-S) distance:

d_{KS}(F_{1},F_{2}):=\sup_{x}|F_{1}(x)-F_{2}(x)|,

which measures the largest deviation between two CDFs. The K-S distance is a weaker notion of the total variation distance on $\mathbb{R}$ (total variation is often too strong to be useful).

In the sequel, we study $\mathcal{L}_{p}(F^{(m)},F)$ for $F$ supported on $[0,1]$ . Theorem 1 states that without any conditions imposed on $F$ , the $L_{p}$ distance between $F^{(m)}$ and $F$ is bounded by $\mathcal{O}(m^{-\frac{1}{2p}})$ . One key to this theorem lies in the finite support of $F$ , which enables the usage of the Fubini’s theorem. Along with the Dvoretzky-Kiefer-Wolfowi (DKW) inequality [24], it implies that the $L_{p}$ distance between $F_{n,m}$ and $F$ is bounded by $\mathcal{O}(m^{-\frac{1}{2p}})+\mathcal{O}(n^{-\frac{1}{2}})$ . Particularly, we have $\mathcal{L}_{1}(F^{(m)},F)=\mathcal{O}(m^{-\frac{1}{2}})$ and $\mathcal{L}_{1}(F_{n,m},F)=\mathcal{O}(m^{-\frac{1}{2}})+\mathcal{O}(n^{-\frac{1}{2}})$ .

Theorem 1 (The $L_{p}$ distance between $F^{(m)}$ / $F_{n,m}$ and $F$ )

For a general $F$ on $[0,1]$ , it follows that

\left(\int_{0}^{1}|F^{(m)}(x)-F(x)|^{p}dx\right)^{\frac{1}{p}}\leq\frac{C(p)}{m^{\frac{1}{2p}}},

where $C(p)$ is a positive constant that depends on $p$ . It indicates that

{\mathbb{E}}\left(\int_{0}^{1}|F_{n,m}(x)-F(x)|^{p}dx\right)^{\frac{1}{p}}\leq\frac{C(p)}{m^{\frac{1}{2p}}}+\frac{K}{\sqrt{n}},

where $K$ is another universal positive constant.

Proof By definition, it follows that

$\displaystyle\|F^{(m)}(x)-F(x)\|$	$\displaystyle=$	$\displaystyle\|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x)]-{\mathbb{E}}[\mathbb{I}(s\leq x)]\|$
	$\displaystyle=$	$\displaystyle\|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)]-{\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]\|.$
	$\displaystyle\leq$	$\displaystyle\max\{\|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)]\|,\|{\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]\|\}$

Note that

$\displaystyle{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)]$	$\displaystyle=$	$\displaystyle{\mathbb{E}}[{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)\|s]]$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}[{\mathbb{E}}[\mathbb{I}(\hat{s}-s\leq x-s<0)\|s]]$
	$\displaystyle\leq$	$\displaystyle{\mathbb{E}}[\exp\{-m(x-s)^{2}/2\}],$

where $\hat{s}-s$ is bounded in $[-1,1]$ , and thus it is a sub-Gaussian random variable with the variance less or equal to $1$ [10]. The same argument applies to ${\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]$ . Therefore, we have

			$\displaystyle\left(\int_{0}^{1}\|F^{(m)}(x)-F(x)\|^{p}dx\right)^{\frac{1}{p}}$
		$\displaystyle\leq$	$\displaystyle\left(\int_{0}^{1}{\mathbb{E}}[\exp\{-mp(x-s)^{2}/2\}]dx\right)^{\frac{1}{p}}$
		$\displaystyle=$	$\displaystyle\left(\int_{0}^{1}\int_{0}^{1}\exp\{-mp(x-s)^{2}/2\}dF(s)dx\right)^{\frac{1}{p}}$
		$\displaystyle\overset{(i)}{=}$	$\displaystyle\left(\int_{0}^{1}\int_{0}^{1}\exp\{-mp(x-s)^{2}/2\}dxdF(s)\right)^{\frac{1}{p}}$
		$\displaystyle\leq$	$\displaystyle\left(\int_{0}^{1}\frac{\sqrt{2\pi}}{\sqrt{mp}}dF(s)\right)^{\frac{1}{p}}$
		$\displaystyle=$	$\displaystyle\frac{(2\pi)^{\frac{1}{2p}}}{(mp)^{\frac{1}{2p}}},$

where Equation $(i)$ holds by the Fubini’s theorem. Further, by noting that $F_{n,m}(x)-F(x)=F_{n,m}(x)-F^{(m)}(x)+F^{(m)}(x)-F(x)$ and using the DKW inequality, it follows that

{\mathbb{E}}\int_{0}^{1}|F_{n,m}(x)-F(x)|dx\leq\frac{(2\pi)^{\frac{1}{2p}}}{(mp)^{\frac{1}{2p}}}+\frac{\sqrt{2\pi}}{\sqrt{n}}.

Notwithstanding, Theorem 1 does not establish a useful bound for the K-S distance that corresponds to the $L_{\infty}$ distance — there remains a non-negligible constant $\underset{p\rightarrow\infty}{\lim}\frac{(2\pi)^{\frac{1}{2p}}}{(mp)^{\frac{1}{2p}}}=1$ which does not depend on $m$ . In fact, the K-S distance might evaluate the estimate of $F$ from a too stringent perspective. Proposition 2 shows that no matter how small $m$ is, there is an $F$ with a point mass function and a point $x_{0}$ such that $|F^{(m)}(x_{0})-F(x_{0})|$ is larger than a constant that is independent of $m$ . This result is attributed to the “bad” points with non-trivial masses like $x_{0}$ . Such a “bad” point gives rise to a sharp jump in $F$ , which $F^{(m)}$ cannot immediately catch up with due to the discretization nature of the binomial randomness. It leads to difficulty in recovering the underlying distribution $F$ of the binomial mixture model.

On the other hand, the $L_{p}$ distance with $p<\infty$ does not suffer from the issue of the K-S distance — it can be regarded as looking at an average of the absolute distance between $F^{(m)}$ and $F$ when the support of $F$ is finite. To be specific, even if there are “bad” points $x_{1},x_{2},\ldots$ such that $|F^{(m)}(x_{i})-F(x_{i})|$ has a non-trivial difference, $i=1,2,\ldots$ , this difference will diminish outside a small neighbor of $x_{i}$ of a width $\mathcal{O}(\frac{1}{m})$ . Therefore, when taking the integral, the averaging distance decreases as $m$ grows. Furthermore, if $F$ has a density, the issue of “bad” points no longer exists for the K-S distance either. In this case, the K-S distance is an appropriate choice to measure the difference between $F^{(m)}$ and $F$ ; see Section 3.2 for details.

Proposition 2 ( $F^{(m)}$ can deviate in K-S far from $F$ with a pmf)

There exists an $F$ with a pmf, such that $\sup_{x}|F^{(m)}(x)-F(x)|\geq c>0$ , where $c$ is a constant.

Proof Let $F(x)$ be the delta function taking the mass at $\frac{1}{2}+\kappa$ , where $\kappa$ is an extremely small positive irrational number. Then by CLT, with probability about $1/2$ , $\hat{s}$ is no larger than $\frac{\tilde{m}}{m}$ , where $\tilde{m}$ is the largest integer such that $\frac{\tilde{m}}{m}<\frac{1}{2}+\kappa$ . Take any $x_{0}$ in $(\frac{\tilde{m}}{m},\frac{1}{2}+\kappa)$ . It follows that $F(x_{0})=0$ but $F^{(m)}(x_{0})\approx\frac{1}{2}$ . It implies that $F^{(m)}(x_{0})-F(x_{0})$ is larger than a constant that is independent of $m$ .

3.2 $F$ with a density

In this section, we focus on $F$ with a density so that the K-S distance can be employed. In addition, we stick to this metric partly because it is related to the Grenander estimator for monotone density estimation [8, 4], which constitutes our method for the GeneFishing cutoff selection; see Section 3.4 and Section 4.2.

To bound the K-S distance between $F_{n,m}$ and $F$ , i.e., $\sup_{x}|F_{n,m}(x)-F(x)|$ , we just need to bound $F_{n,m}-F^{(m)}$ and $F^{(m)}-F$ by noticing that $F_{n,m}-F=F_{n,m}-F^{(m)}+F^{(m)}-F$ . By the DKW inequality, we have a tight bound

\mathbb{P}[\sup_{x}|F_{n,m}(x)-F^{(m)}(x)|>t]\leq 2\exp\{-2nt^{2}\},~{}~{}~{}~{}\forall t>0.

So it only remains to study the deviation of $F^{(m)}$ from $F$ . In Proposition 3, we show that when $F$ has a derivative bounded from both below and above, the K-S distance between $F^{(m)}$ and $F$ is bounded by $\mathcal{O}(\frac{1}{m})$ from below and by $\mathcal{O}(\frac{1}{\sqrt{m}})$ from above.

Proposition 3 (Deviation of $F^{(m)}$ from $F$ with a density)

Suppose $f$ is a density function on $[0,1]$ with $0<f_{\min}\leq f_{\max}<\infty$ . It follows that

\frac{f_{\min}}{m+1}\leq\sup_{x}|F^{(m)}(x)-F(x)|\leq f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}}.

Proof For the lower bound, we have

\mathbb{P}(\hat{s}\leq 0)-\mathbb{P}(s\leq 0)=\mathbb{P}(\hat{s}\leq 0)=\int_{0}^{1}(1-u)^{m}f(u)du\geq f_{\min}\int_{0}^{1}(1-u)^{m}du=\frac{f_{\min}}{m+1}.

For the upper bound, note that

\displaystyle F^{(m)}(x)-F(x)=\mathbb{P}(\hat{s}\leq x)-\mathbb{P}(s\leq x)={\mathbb{E}}(\mathbb{I}[\hat{s}\leq x]-\mathbb{I}[s\leq x])={\mathbb{E}}(\mathbb{I}[\hat{s}\leq x<s]-\mathbb{I}[s\leq x<\hat{s}]),

thus

|F^{(m)}(x)-F(x)|\leq\max\{{\mathbb{E}}\mathbb{I}[\hat{s}\leq x<s],{\mathbb{E}}\mathbb{I}[s\leq x<\hat{s}])\}.

We have

$\displaystyle{\mathbb{E}}\mathbb{I}[\hat{s}\leq x<s]$	$\displaystyle=$	$\displaystyle\mathbb{P}(\hat{s}\leq x<s)$
	$\displaystyle=$	$\displaystyle\mathbb{P}(\hat{s}-s\leq x-s<0)$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}[\mathbb{P}(\hat{s}-s\leq x-s<0)\|s]$
	$\displaystyle\leq$	$\displaystyle{\mathbb{E}}[\exp(-m(x-s)^{2}/2)]$
	$\displaystyle=$	$\displaystyle\int_{0}^{1}\exp(-m(x-u)^{2}/2)f(u)du$
	$\displaystyle\leq$	$\displaystyle f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}},$

Similarly, we have ${\mathbb{E}}\mathbb{I}[s\leq x<\hat{s}]\leq f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}}$ . So it follows that

\sup_{x}|\mathbb{P}(\hat{s}\leq x)-\mathbb{P}(s\leq x)|\leq\max\{\sup_{x}{\mathbb{E}}\mathbb{I}[\hat{s}\leq x<s],\sup_{x}{\mathbb{E}}\mathbb{I}[s\leq x<\hat{s}]\}\leq f_{\max}\cdot\frac{\sqrt{2\pi}}{\sqrt{m}}.

Proposition 3 shows that the largest deviation of the binomial mixture CDF from the underlying CDF is at least the order $\mathcal{O}(m^{-1})$ and at most the order $\mathcal{O}(m^{-\frac{1}{2}})$ . In fact, the condition that $f$ is bounded can be relaxed to that $f$ is $L_{p}$ -integrable with $p>1$ using the Hölder inequality, but the rate will be $\mathcal{O}(m^{-\frac{p}{2(p-1)}})$ correspondingly. Our result is the special case with $p=\infty$ . Moreover, $F$ with a density is a necessary condition for Proposition 3 — we have seen in Proposition 2 that if $F$ has a point mass function, the deviation of $F^{(m)}$ from $F$ cannot be controlled w.r.t $m$ .

Proposition 4 shows that there exist two simple distributions that respectively attain the lower bound and the upper bound. On the other hand, Proposition 3 can be further improved: if the underlying density is assumed to be smooth, the upper bound is lowered to $\mathcal{O}(m^{-1})$ ; see Proposition 7 in Section 3.3.

Proposition 4 (Tightness of Proposition 3)

The upper bound and the lower bound in Proposition 3 are tight. In other words, there exist an $F_{1}$ and $F_{2}$ such that $\sup_{x}|F_{1}^{(m)}(x)-F_{1}(x)|\leq C_{1}\cdot\frac{1}{m+1}$ and $\sup_{x}|F_{2}^{(m)}(x)-F_{2}(x)|\geq C_{2}\cdot\frac{1}{\sqrt{m}}$ , where $C_{1}$ and $C_{2}$ are two positive constants.

Proof The lower bound of Proposition 3 can be attained by the uniform distribution. Specifically, if $f\equiv 1$ , $\mathbb{P}(\hat{s}=k/m)=\frac{1}{m+1}$ . So $|\mathbb{P}(\hat{s}\leq 0)-\mathbb{P}(s\leq 0)|=\frac{1}{m+1}$ .

On the other hand, the upper bound can be attained by another simple distribution. Let

f(x)=1.8\cdot\mathbb{I}(x\in[0,1/2])+0.2\cdot\mathbb{I}(x\in(1/2,1])

(3.1)

We can show that this density $f$ leads to $|\mathbb{P}(\hat{s}\leq 1/2)-\mathbb{P}(s\leq 1/2)|\geq\frac{C}{\sqrt{m}}$ , where $C$ is a positive constant. It is a consequence of the central limit theorem for the binomial distribution. The detailed proof is delegated to Appendix D.

Given Proposition 3, we can get the rate of of $\sup_{x}|F_{n,m}(x)-F(x)|$ along with the DKW inequality, which is $\mathcal{O}(n^{-\frac{1}{2}})+\mathcal{O}(m^{-\frac{1}{2}})$ as shown in Theorem 5. By taking integral of $\mathbb{P}(\sup_{x}[F_{n,m}(x)-F(x)]>t)$ w.r.t $t$ , we immediately have Corollary 6.

Theorem 5 (The rate of K-S distance between $F_{n,m}$ and $F$ with a density)

Suppose $f$ is a density function on $[0,1]$ with $f_{\max}<\infty$ . The data is generated as Model (1.2). It follows that

\mathbb{P}(\sup_{x}[F_{n,m}(x)-F(x)]>t)\leq\exp(-nt^{2}/2)+\mathbb{I}(f_{\max}\cdot\frac{2\sqrt{2\pi}}{\sqrt{m}}>t),

where $t\geq\sqrt{\frac{\log 2}{2n}}$ . The two-side tail bound also holds as follow

\mathbb{P}(\sup_{x}|F_{n,m}(x)-F(x)|>t)\leq 2\exp(-nt^{2}/2)+\mathbb{I}(f_{\max}\cdot\frac{2\sqrt{2\pi}}{\sqrt{m}}>t),

where $t>0$ .

Proof Note that

\sup_{x}|F_{n,m}(x)-F(x)|\leq\sup_{x}|F_{n,m}(x)-F^{(m)}(x)|+\sup_{x}|F^{(m)}(x)-F(x)|.

The first term can be bounded by the original DKW inequality and the second term can be bounded using the result of Proposition 3. Then we conclude the second result. The first result can be obtained in the same fashion.

Corollary 6

Under the same setup of Theorem 5, we have ${\mathbb{E}}\sup_{x}[F_{n,m}(x)-F(x)]\leq\frac{\sqrt{2\pi}}{\sqrt{n}}+\min\{1,\frac{f_{\max}\cdot 2\sqrt{2\pi}}{\sqrt{m}}\}$ , and ${\mathbb{E}}\sup_{x}|F_{n,m}(x)-F(x)|\leq\frac{2\sqrt{2\pi}}{\sqrt{n}}+\min\{1,\frac{f_{\max}\cdot 2\sqrt{2\pi}}{\sqrt{m}}\}$ .

Proof By Theorem 5, it follows that

$\displaystyle{\mathbb{E}}[\sup_{x}[F_{n,m}(x)-F(x)]]$	$\displaystyle=$	$\displaystyle\int_{0}^{1}\mathbb{P}(\sup_{x}[F_{n,m}(x)-F(x)]>t)dt$
	$\displaystyle\leq$	$\displaystyle\int_{0}^{1}[\exp(-nt^{2}/2)+\mathbb{I}(f_{\max}\cdot\frac{2\sqrt{2\pi}}{\sqrt{m}}>t)]dt$
	$\displaystyle\leq$	$\displaystyle\frac{\sqrt{2\pi}}{\sqrt{n}}+\min\{1,\frac{f_{\max}\cdot 2\sqrt{2\pi}}{\sqrt{m}}\}.$

The two-side expectation can be proved in a similar manner.

3.3 $F$ with A Smooth Density

In this section, we investigate the estimation of $F$ with a smooth density. Under this condition, we first obtain a stronger result than Proposition 3 for $F^{(m)}$ , based on a truncated K-S distance on the interval $[a,1-a]$ , where $0<a<1/2$ . Proposition 7 shows that if the density of $F$ is smooth, the truncated K-S distance decreases at $\mathcal{O}(\frac{1}{m})$ . The proof is based on the fact the binomial distribution random variable $m\cdot\hat{s}_{i}$ behaves like a Gaussian random variable when the binomial probability $s_{i}$ is bounded away from $0$ and $1$ . When $s_{i}$ is close to $0$ and $1$ , the Gaussian approximation cannot be used since it has an unbounded variance $\frac{1}{ms_{i}(1-s_{i})}$ . The proof is deferred to Appendix E.

Proposition 7 (Deviation of $F^{(m)}$ from $F$ with a smooth density)

Suppose $f$ is a density function on $[0,1]$ with $f^{\prime}$ being absolutely continuous. Let $s\sim f$ and $m\cdot\hat{s}\sim Binomial(m,s)$ . It follows for any $0<a<1/2$ that

\sup_{x\in[a,1-a]}|F^{(m)}(x)-F(x)|\leq C\cdot\frac{1}{m},

where $C$ is some constant that only depends on $f$ and $a$ .

Then, we investigate the histogram estimator, since it is the one of simplest nonparametric density estimators and it has a theoretical guarantee when the density is smooth. Let $L$ be an integer and define bins

B_{1}=[0,\frac{1}{L}),B_{2}=[\frac{1}{L},\frac{2}{L}),\ldots,B_{L}=[\frac{L-1}{L},1].

Let $Y_{l}$ be the number of observations in $B_{l}$ , $\hat{p}_{l}=Y_{l}/n$ and $p_{l}=\mathbb{P}(s_{1}\in B_{l})$ . It is known that under certain smoothness conditions, the histogram converges in a cubic root rate for the rooted mean squared error (MSE) [41, Chapter 6]. Next we study how the histogram behaves on the binomial mixture model. Denote

\left\{\begin{array}[]{ll}\hat{f}_{n,m}(x)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{h}\mathbb{I}(\hat{s}_{i}\in B(x))\\ \hat{f}_{n}(x)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{h}\mathbb{I}(s_{i}\in B(x)),\end{array}\right.

where $h$ is the bandwidth, $B(x)$ denotes the bin that $x$ falls in. Theorem 8 shows that the histogram estimator based on $\hat{s}_{i}$ ’s has the same convergence rate in terms of the MSE metrics as the histogram estimator based on $s_{i}$ ’s if $m=\Omega(n^{\frac{2}{3}})$ and $h=\mathcal{O}(n^{-\frac{1}{3}})$ . This rate might not be improved even if $f$ has higher order continuous derivatives since ${\mathbb{E}}|\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x)|$ is bounded by $\mathcal{O}(\frac{1}{\sqrt{m}}+h+\frac{1}{mh})$ , which dominates ${\mathbb{E}}|\hat{f}_{n}(x)-f(x)|=\mathcal{O}(h^{\nu})$ when $f$ has a $\nu$ -order continuous derivative. The proof of Theorem 8 is delegated to Appendix F.

Theorem 8 (Upper bound of the histogram risk for binomial mixture model)

Let $R(a,b)=\int_{a}^{b}{\mathbb{E}}(f(x)-\hat{f}_{n,m}(x))^{2}dx$ be the integrated risk on the interval $[a,b]$ . Assume that $f^{\prime}$ is absolutely continuous. It follows that

R(a,1-a)\leq C_{1}\cdot(h^{2}+\frac{1}{m}+\frac{1}{m^{2}h^{2}}+\frac{1}{nh}),\forall 0<a<\frac{1}{2},

Furthermore, if $m\geq C_{2}\cdot n^{\frac{2}{3}}$ , $h=C_{3}\cdot n^{-\frac{1}{3}}$ , we have

R(a,1-a)\leq C_{4}\cdot n^{-\frac{2}{3}},\forall 0<a<\frac{1}{2}.

Here $C_{1}$ , $C_{2}$ and $C_{4}$ are positive constants that only depend on $a$ and $f$ , $C_{3}>0$ only depends on $f$ .

Finally, we conclude this section with a study on $F$ whose density is discretized into a point mass function of $K$ non-zero masses. In contrast to the existing results in [36], we allow $K$ to be as large as $\sqrt{n}$ , and study the finite-sample rate of the histogram estimator. Let $p(x)$ be a point mass function such that

p(x)=\sum_{k=1}^{K}\alpha(k)\mathbb{I}(x=x_{k}),

where $\alpha(k)\geq 0$ and $\sum_{k=1}^{K}\alpha(k)=1$ , $x_{k}=\frac{(k-1)+1/2}{K}$ . Denote by $I_{k}$ the interval centered at $x_{k}$ and of length $1/K$ . We can make $(\alpha(1),\ldots,\alpha(K))$ “smooth” by letting $\alpha(k)=\int_{I_{k}}f(t)dt$ , where $f$ is a smooth function. Denote

\left\{\begin{array}[]{ll}\hat{\alpha}_{n,m}(k)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(\hat{s}_{i}\in I_{k})\\ \hat{\alpha}_{n}(k)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(s_{i}\in I_{k})\\ \end{array}\right.

Then the MSE can be defined as $R(\hat{\alpha},\alpha)=\frac{1}{K}\sum_{k=1}^{K}(\hat{\alpha}(k)-\alpha(k))^{2}$ . It is known that $R(\hat{\alpha}_{n},\alpha)=\mathcal{O}(\frac{1}{n})$ . Theorem 9 shows that the same rate can be achieved by $\hat{\alpha}_{n,m}$ when $m=\Omega(\sqrt{n}\max\{\sqrt{n},K\})$ . The proof is deferred to Appendix G.

Theorem 9 (Upper bound of the histogram risk for finite binomial mixture)

Let $\alpha(k)=\int_{I_{k}}f(t)dt$ , where $f^{\prime}$ is absolutely continuous and $\int(f^{\prime})^{2}dx<\infty$ . Let $R(a,b)=\frac{\sum_{aK\leq k\leq bK}(\hat{\alpha}_{n,m}(k)-\alpha(k))^{2}}{\sum_{k=1}^{K}\mathbb{I}(aK\leq k\leq bK)}$ be the risk on the interval $[a,b]$ . It follows that

R(a,1-a)\leq C_{1}\cdot(\frac{1}{n}+\frac{1}{m}+\frac{K^{2}}{m^{2}}),\forall 0<a<\frac{1}{2}.

Furthermore, if $m\geq C_{2}\cdot\sqrt{n}\max\{\sqrt{n},K\}$ , then

R(a,1-a)\leq C_{3}\cdot\frac{1}{n},\forall 0<a<\frac{1}{2}.

Here $C_{1},C_{2},C_{3}$ are positive constants that only depend on $a$ and $f$ .

3.4 $F$ with A Monotone Density

Next, we shift our attention to $F$ with a monotone density $f$ . It is motivated by the U shape in the histograms of the GeneFishing output, where we can decompose the U shape into a decreasing part, a flat part, and an increasing part; see Section 4.1. To estimate $f$ , a natural solution is the Grenander estimator [8, 14]. Specifically, we first construct the least concave majorant of the empirical CDF of $F$ ; then its left derivative is the desired estimator. The detailed review of the Grenander estimator is deferred to Appendix C.

In the sequel, we establish convergence and inference results for $\tilde{f}_{n,m}$ that is the Grenander estimator based on $\hat{s}_{i}$ ’s. Theorem 10 states $\tilde{f}_{n,m}$ achieves the convergence rate of $\mathcal{O}(n^{-\frac{1}{3}})$ w.r.t the $L_{1}$ distance if $m=\Omega(n^{\frac{2}{3}})$ .

Theorem 10 ( $L_{1}$ convergence of $\tilde{f}_{m,n}$ )

Suppose $f$ is a decreasing density on $[0,1]$ with $f_{\max}<\infty$ . It follows that

{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{1}\cdot n^{-\frac{1}{3}}+C_{2}\cdot m^{-\frac{1}{2}}.

Furthermore, if $m\geq C_{3}\cdot n^{\frac{2}{3}}$ , we have

{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot n^{-\frac{1}{3}}.

Here $C_{1},C_{2},C_{3},C_{4}$ are positive constants that only depend on $f$ .

Theorem 10 follows by Corollary 6 and [4]. The details can be found in Appendix H. Furthermore, we discuss the minimal $m$ we need to ensure the $\mathcal{O}(n^{-\frac{1}{3}})$ convergence rate in terms of the $L_{1}$ distance. We have seen that $m=\Omega(n^{\frac{2}{3}})$ is a sufficient condition. The question arises naturally whether it is necessary. Based on the density (3.1) used in Proposition 4, we can show that $m=\Omega(n^{\frac{2}{3}})$ is also the minimal binomial size; see Proposition 11.

Proposition 11

When $f$ takes (3.1), i.e.,

f(x)=1.8\cdot\mathbb{I}(x\in[0,1/2])+0.2\cdot\mathbb{I}(x\in(1/2,1]),

it needs $m\geq C_{1}\cdot n^{\frac{2}{3}}$ to ensure

{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{2}\cdot n^{-\frac{1}{3}},

where $C_{1},C_{2}>0$ only depend on $f$ .

Proof Note that

$\displaystyle{\mathbb{E}}_{f}\int_{0}^{1}\|\tilde{f}_{n,m}(x)-f(x)\|dx$	$\displaystyle\overset{(i)}{\geq}$	$\displaystyle{\mathbb{E}}_{f}\int_{0}^{x^{*}}\|\tilde{f}_{n,m}(x)-f(x)\|dx$
	$\displaystyle\geq$	$\displaystyle{\mathbb{E}}_{f}\|\int_{0}^{x^{*}}[\tilde{f}_{n,m}(x)-f(x)]dx\|$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}_{f}\|\tilde{F}_{n,m}(x^{})-F(x^{})\|$
	$\displaystyle\geq$	$\displaystyle{\mathbb{E}}_{f}\|F^{(m)}(x^{})-F(x^{})\|-{\mathbb{E}}_{f}\|\tilde{F}_{n,m}(x^{})-F^{(m)}(x^{})\|$
	$\displaystyle\overset{(ii)}{\geq}$	$\displaystyle{\mathbb{E}}_{f}\|F^{(m)}(x^{})-F(x^{})\|-\sup_{x}{\mathbb{E}}_{f}\|F_{n,m}(x)-F^{(m)}(x)\|$
	$\displaystyle\overset{(iii)}{\geq}$	$\displaystyle\frac{K_{1}}{\sqrt{m}}-\frac{K_{2}}{\sqrt{n}},$

where $x^{*}:=arg\max_{x}|F^{(m)}(x)-F(x)|$ in Inequality (i); Inequality (ii) holds because $\sup_{x}|\tilde{F}_{n,m}(x)-F^{(m)}(x)|\leq\sup_{x}|F_{n,m}(x)-F^{(m)}(x)|$ , which can be easily derived by Marshall’s lemma [22]; Inequality (iii) holds because of Proposition 4 and the DKW inequality, and $K_{1},K_{2}>0$ are two positive constants that only depend $f$ . To ensure the expected $L_{1}$ distance between $\tilde{f}_{n,m}$ and $f$ is bounded by $C_{2}\cdot n^{-\frac{1}{3}}$ , it is necessary to have

\frac{C_{2}}{\sqrt[3]{n}}\geq\frac{K_{1}}{\sqrt{m}}-\frac{K_{2}}{\sqrt{n}},

which implies that $m\geq C_{1}\cdot n^{\frac{2}{3}}$ , for some positive constant $C_{1}$ that depends on $f$ .

Next, we study the local asymptotics of $\tilde{f}_{n,m}$ . For the binomial mixture model, we yield a similar result for $\tilde{f}_{n,m}$ as the local asymptotics of $\tilde{f}_{n}$ when $m$ grows faster than $n$ , as shown in Theorem 12.

Theorem 12 (Local Asymptotics of $\tilde{f}_{m,n}$ )

Suppose $f$ is a decreasing density on $[0,1]$ and is smooth at $t_{0}\in(0,1)$ . Then when $\frac{m}{n}\rightarrow\infty$ as $n\rightarrow\infty$ , if $f_{\max}<\infty$ , we have

(A)

If $f$ is flat in a neighborhood of $t_{0}$ , and $[a,b]$ is the flat part containing $t_{0}$ , it follows that

$\sqrt{n}(\tilde{f}_{m,n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}\hat{S}_{a,b}(t_{0}),$

where $\hat{S}_{a,b}(t)$ is the slope at $F(t)$ of the least concave majorant in $[F(a),F(b)]$ of a standard Brownian Bridge in $[0,1]$ .

(B)

If $f(t)-f(t_{0})\sim f^{(k)}(t_{0})(t-t_{0})^{k}$ near $t_{0}$ for some $k$ and $f^{(k)}(t_{0})<0$ , it follows that

n^{\frac{k}{2k+1}}[\frac{f^{k}(t_{0})|f^{(k)}(t_{0})|}{(k+1)!}]^{-\frac{1}{2k+1}}(\tilde{f}_{m,n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}V_{k}(0),

where $V_{k}(t)$ is the slope at $t$ of the least concave majorant of the process $\{B(t)-|t|^{k+1},t\in(-\infty,\infty)\}$ , and $B(t)$ is a standard two-sided Brownian motion on ${\mathbb{R}}$ with $B(0)=0$ .

The proof of Theorem 12 relies on Proposition 3 and the Komlós-Major-Tusnády (KMT) approximation [17]. Given these two results, we can show that if $f$ is upper bounded, there exists a sequence of Brownian bridges $\{B_{n}(x),0\leq x\leq 1\}$ such that

\mathbb{P}\left\{\sup_{0\leq x\leq 1}|\sqrt{n}(F_{n,m}(x)-F(x))-B_{n}(F(x))|>\frac{\tilde{a}\sqrt{n}}{\sqrt{m}}+\frac{a\log n}{\sqrt{n}}+t\right\}\leq b(e^{-c\sqrt{n}t}+e^{-dmt^{2}}),

where $\tilde{a}>0$ only depends on $f$ and $a,b,c,d$ are universal positive constants. Then following [40], we can prove Theorem 12. The details are deferred to Appendix I.

Finally, we conclude this section by discussing the histogram estimator and the Grenander estimator (for a density). Both of them are bin estimators but differ in the choice of the bin width. One can pick the bin width for the histogram to attain optimal convergence rates [41]. On the other hand, the bin widths of the Grenander estimator are chosen completely automatically by the estimator and are naturally locally adaptive [4]. The consequence is that the Grenander estimator can guarantee monotonicity, but the histogram estimator cannot. If the underlying model is monotone, the Grenander estimator has a better convergence rate than the histogram estimator. Notably, the convergence theory of the histogram estimator cannot be established unless the density is smooth, while that of the Grenander estimator only requires the density is monotone and $L_{p}$ integrable ( $p>2$ ) [4].

In our setup, we show that when $m=\Omega(n^{\frac{2}{3}})$ , both the Grenander estimator and the histogram estimator, based on $\{\hat{s}_{i}\}_{i=1}^{n}$ , have the same rate at $\operatorname*{\mathcal{O}}(n^{-\frac{1}{3}})$ in $L_{1}$ distance (the $L_{2}$ convergence of the histogram can imply the $L_{1}$ convergence). It seems that both methods are comparable. Nonetheless, we mention that the conditions for the convergence of the two estimators are different. The Grenander estimator requires a bounded monotone density, while the histogram requires a smooth density.

4 Estimation and Inference of $F$ with U-shape Constraint

Now we have sufficient insight into the estimation of $F$ under various conditions in the binomial mixture model (1.2). We are ready to cast our attention back to the cutoff selection problem for the GeneFishing method, i.e., distinguishing the relevant genes (to the baits genes or the associated biological process) from the irrelevant ones. To answer this question, we leverage the observation that the histogram of $\{\hat{s}_{i}\}_{i=1}^{n}$ appears to have a U shape as shown in Figure 1.

4.1 Model

We decompose the density or the pmf of $F$ into three parts: the first part decreases, the second part remains flat, and the last part increases. The first part is assumed to be purely related to the irrelevant genes; the second part is associated with the mixture of the irrelevant and the relevant genes; the last part is purely corresponding to the relevant genes. Denote by $c_{l}$ and $c_{r}$ the transition points from the first part to the second part and the second part to the third part, respectively. Then the question is reduced to identifying $c_{r}$ and getting an upper confidence bound on $c_{r}$ . In the sequel, we formally write this assumption as Assumption 1 when $F$ is associated with a continuous random variable. The corresponding mathematical formulations for the pmf are similar, so we omit them here.

Assumption 1

Let $f$ be the derivative of $F$ , i.e., the probability density function. We assume $f$ consists of three parts:

f(x)=\left\{\begin{array}[]{ll}f_{l}(x)=\alpha_{l}\cdot g_{l}(x),&\text{ if }x\in[0,c_{l}]\\ \frac{\alpha_{mid}}{c_{r}-c_{l}},&\text{ if }x\in(c_{l},c_{r}]\\ f_{r}(x)=\alpha_{r}\cdot g_{r}(x),&\text{ if }x\in(c_{r},1]\end{array}\right.,

(4.1)

where $0<c_{l}<c_{r}<1$ , $g_{l}$ is a decreasing function, $g_{r}$ is an increasing function such that $\int_{0}^{c_{l}}g_{l}(x)dx=1$ , $\int_{1}^{c_{r}}g_{r}(x)dx=1$ , and $\alpha_{l}+\alpha_{r}+\alpha_{mid}=1$ with $\alpha_{l},\alpha_{r},\alpha_{mid}>0$ .

For the U-shaped constraint, we also need

\min\{f_{l}(c_{l}^{-}),f_{r}(c_{r}^{+})\}\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}.

The shape constraint (4.1) is determined by six parameters $\{\alpha_{l},\alpha_{r},c_{l},c_{r},g_{l},g_{r}\}$ , but they are not identifiable. Below is an example of such unidentifiability.

Example 1 (Identifiability Issue for (4.1))

\begin{array}[]{ll}\tilde{\alpha}_{l}=\alpha_{l}+\frac{\alpha_{mid}}{c_{r}-c_{l}}\cdot\tau;&\tilde{\alpha}_{r}=\alpha_{r}\\ \tilde{c}_{l}=c_{l}+\tau;&\tilde{c}_{r}=c_{r}\\ \tilde{g}_{l}=\left\{\begin{array}[]{ll}g_{l}\cdot\alpha_{l}/\tilde{\alpha}_{l},&\text{ if }x\in[0,c_{l}]\\ \frac{\alpha_{mid}}{(c_{r}-c_{l})\cdot\tilde{\alpha}_{l}},&\text{ if }\in(c_{l},c_{l}+\tau]\end{array}\right.;&\tilde{g}_{r}=g_{r}\end{array}

The parameters $\{\tilde{\alpha}_{l},\tilde{\alpha}_{r},\tilde{c}_{l},\tilde{c}_{r},\tilde{g}_{l},\tilde{g}_{r}\}$ yield the same model as $\{\alpha_{l},\alpha_{r},c_{l},c_{r},g_{l},g_{r}\}$ if $\tau<c_{r}-c_{l}$ .

The identifiability issue results from the vague transitions from one part to the next adjacent part in Model (4.1). To tackle it, we need to introduce some assumptions to sharpen the transitions. For example, if $f$ is smooth, then a sharp transition means the first derivative of $f$ , i.e., the slope, significantly changes at this point. In general, we do not impose the smoothness on $f$ and use the finite difference as the surrogate of the slope. To be specific, suppose there exist $\delta_{l},\delta_{r}>0$ and neighborhoods of $c_{l}$ and $c_{r}$ , whose sizes are $\tau_{l}$ and $\tau_{r}$ respectively, such that

\begin{array}[]{ll}f_{l}(x)\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{l}\cdot(c_{l}-x),&\text{ if }x\in[c_{l}-\tau_{l},c_{l})\\ f_{r}(x)\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{r}\cdot(x-c_{r}),&\text{ if }x\in(c_{r},c_{r}+\tau_{r}].\\ \end{array}

For the sake of convenience, we consider a stronger condition that drop off the factors $c_{l}-x$ and $x-c_{r}$ , which is called Assumption 2. It indicates that the density jumps at the transition points $c_{l}$ and $c_{r}$ .

Assumption 2

There are two positive parameters $\delta_{l}$ and $\delta_{r}$ such that

\begin{array}[]{l}f_{l}(c_{l}^{-})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{l}\\ f_{r}(c_{r}^{+})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+\delta_{r}.\end{array}

Together, we refer to the Binomial mixture with Assumption 1 and Assumption 2 as the BMU model.

4.2 Method

Let $c_{l}^{(0)}$ and $c_{r}^{(0)}$ be the underlying ground truth of the two cutoffs in BMU. Our goal is to identify the cutoff that separates the relevant genes and the irrelevant genes in the GeneFishing method. Specifically, we want to find an estimator $\hat{c}_{r}$ for $c_{r}^{(0)}$ and study the behavior of $\mathbb{P}[\hat{c}_{r}\geq c_{r}^{(0)}]$ .

Define $\alpha_{l}(x;v)=\mathbb{P}[v\leq x]$ , $\alpha_{mid}(x,y;v)=\mathbb{P}[x<v<y]$ , $\alpha_{r}(y;v)=\mathbb{P}[v\geq y]$ , where $v$ can be $s$ or $\hat{s}$ . Define $N_{l}(x;\{v_{i}\}_{i=1}^{n}):=\#\{v_{i}\leq x,i=1,\ldots,n\}$ , $N_{mid}(x,y;\{v_{i}\}_{i=1}^{n}):=\#\{x<v_{i}\leq y,i=1,\ldots,n\}$ , $N_{r}(y;\{v_{i}\}_{i=1}^{n}):=\#\{v_{i}>y,i=1,\ldots,n\}$ , $N(x;\{v_{i}\}_{i=1}^{n})=\#\{v_{i}=x,i=1,\ldots,n\}$ , where $\{v_{i}\}_{i=1}^{n}$ can be $\{s_{i}\}_{i=1}^{n}$ or $\{\hat{s}_{i}\}_{i=1}^{n}$ .

Since we are working on $\hat{s}_{i}$ ’s rather than on $s_{i}$ ’s, we denote $\alpha_{l}(x)=\mathbb{P}[\hat{s}\leq x]$ and $N_{l}(x)=N_{l}(x;\{\hat{s}_{i}\}_{i=1}^{n})$ for simplicity. Similarly, we can get simplified notation $\alpha_{mid}(x,y)$ , $\alpha_{r}(y)$ , $N_{mid}(x,y)$ , $N_{r}(y)$ , $N(x)$ . In the rest of the paper, we sometimes use $\alpha_{l}$ , $\alpha_{r}$ , $\alpha_{mid}$ for $\alpha_{l}(c_{l})$ , $\alpha_{r}(c_{r})$ and $\alpha_{mid}(c_{l},c_{r})$ respectively if no confusion arises.

4.2.1 The Non-parametric Maximum Likelihood Estimation

To estimate the parameters in BMU, we consider the non-parametric maximum likelihood estimation (NPMLE). We first solve the problem given $c_{l}$ and $c_{r}$ , then searching for optimal $c_{l}$ and $c_{r}$ using grid searching. The NPMLE problem is:

$\displaystyle H_{full}(c_{l},c_{r}):=\max$	$\displaystyle\sum_{\hat{s}_{i}\leq c_{l}}\log g_{l}(\hat{s}_{i})+\sum_{\hat{s}_{i}>c_{r}}\log g_{r}(\hat{s}_{i})$
	$\displaystyle+N_{l}(c_{l})\log\alpha_{l}+N_{mid}(c_{l},c_{r})\log\frac{\alpha_{mid}}{c_{r}-c_{l}}+N_{r}(c_{r})\log\alpha_{r}$
$\displaystyle s.t.$	$\displaystyle\int_{0}^{c_{l}}g_{l}=1,\int_{c_{r}}^{1}g_{r}=1,g_{l}\text{ decreasing },g_{r}\text{ increasing}$	(4.5)
	$\displaystyle\alpha_{l},\alpha_{r},\alpha_{mid}>0,\alpha_{l}+\alpha_{r}+\alpha_{mid}=1$
	$\displaystyle\left.\begin{array}[]{l}\alpha_{l}g_{l}(c_{l}^{-})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+d_{l}\\ \alpha_{r}g_{r}(c_{r}^{+})\geq\frac{\alpha_{mid}}{c_{r}-c_{l}}+d_{r}\end{array}\right\}$

Here $d_{l}$ and $d_{r}$ are two parameters to tune, and we call the inequalities (4.5) the change-point-gap constraint, which correspond to Assumption 1 and Assumption 2. Given $c_{l}$ and $c_{r}$ , the variables to optimize over are

S:=\{\alpha_{l},\alpha_{r},g_{l}(\hat{s}_{1}),\ldots,g_{l}(\hat{s}_{i_{l}}),g_{r}(\hat{s}_{i_{r}}),\ldots,g_{r}(\hat{s}_{n})\},

where $i_{l}:=\max_{i}\cdot\mathbb{I}(\hat{s}_{i}<c_{l})$ , $i_{r}:=\min_{i}\cdot\mathbb{I}(\hat{s}_{i}\geq c_{r})$ . Since $\log x$ is continuous and concave w.r.t $x$ , and the feasible set is convex (it is easy to check that $\{(x,y,z):xy\geq z;x,y,z\geq 0\}$ is a convex set), the problem (4.2.1) is a convex optimization with a unique optimizer.

There are mainly two difficulties for the optimization problem (4.2.1). First, the change-point-gap constraint (4.5) complicates the monotone density estimation. Second, it is not easy to optimize over $\alpha_{l}$ , $\alpha_{r}$ and $g_{l}$ , $g_{r}$ simultaneously.

4.2.2 Ucut: A Simplified Estimator

Fortunately, the following observation suggests a simplified optimization problem.

•

Note that $\alpha_{l}$ and $\alpha_{r}$ are the population masses for $x\leq c_{l}$ and $x\geq c_{r}$ , which can be well estimated by the empirical masses $\hat{\alpha}_{l}(c_{l})=N_{l}(c_{l})/n$ , $\hat{\alpha}_{mid}(c_{l},c_{r})=N_{mid}(c_{l},c_{r})/n$ and $\hat{\alpha}_{r}(c_{r})=N_{r}(c_{r})/n$ .
•

If BMU is true with $\delta_{l}\geq d_{l}$ and $\delta_{r}\geq d_{r}$ , and the solution to the optimization (4.2.1) without (4.5) at $c_{l}=c_{l}^{(0)}$ , $c_{r}=c_{r}^{(0)}$ is good enough, then the change-point-gap constraint (4.5) is satisfied with high probability.
•

From Figure 1, we can see that the flat region is wide. We can easily pick an interior point within the flat region.

Inspired by these observations, we replace the population masses with the empirical masses and drop off the change-point-gap constraint. We obtain the simplified objective function as follows:

	$\displaystyle H_{simplified}(c_{l},c_{r}):=\max$		$\displaystyle\sum_{\hat{s}_{i}\leq c_{l}}\log g_{l}(\hat{s}_{i})+\sum_{\hat{s}_{i}>c_{r}}\log g_{r}(\hat{s}_{i})$
			$\displaystyle+N_{l}(c_{l})\log\hat{\alpha}_{l}(c_{l})+N_{mid}(c_{l},c_{r})\log\frac{\hat{\alpha}_{mid}(c_{l},c_{r})}{c_{r}-c_{l}}+N_{r}(c_{r})\log\hat{\alpha}_{r}(c_{r})$
	$\displaystyle s.t.$		$\displaystyle\int_{0}^{c_{l}}g_{l}=1,\int_{c_{r}}^{1}g_{r}=1,g_{l}\text{ decreasing },g_{r}\text{ increasing}$

where

	$\displaystyle\hat{\alpha}_{l}(c_{l})=N_{l}(c_{l})/n=\frac{\#\{i\|\hat{s}_{i}\leq c_{l}\}}{n}$
	$\displaystyle\hat{\alpha}_{mid}(c_{l},c_{r})=N_{mid}(c_{l},c_{r})/n=\frac{\#\{i\|c_{l}<\hat{s}_{i}\leq c_{r}\}}{n}$
	$\displaystyle\hat{\alpha}_{1}(c_{r})=N_{r}(c_{r})/n=\frac{\#\{i\|\hat{s}_{i}>c_{r}\}}{n}.$

The problem (4.2.2) is reduced to two monotone density estimations, which the Grenander estimator can solve. As we point out in the above observations, we can easily identify an interior point $\mu$ in the flat region. We fit an Grenander estimator for the decreasing $g_{l}$ on $[0,\mu]$ and an Grenander estimator for the increasing $g_{r}$ on $(\mu,1]$ . There are three advantages of using the interior point $\mu$ . First, it significantly reduces the computational cost by estimating the two Grenander estimators just once, regardless of the choices of $c_{l}$ and $c_{r}$ . Second, it bypasses the boundary issue of the Grenander estimators since we are mainly concerned with the behaviors of the estimators at the points $c_{l}<\mu$ and $c_{r}>\mu$ . Moreover, the usage of $\mu$ disentangles the mutual influences of the left decreasing part and the right increasing part; thus, it makes the analysis of the estimators simple.

Once we fit the Grenander estimator, we check whether the change-point-gap constraint (4.5) holds for different pairs of $c_{l}$ and $c_{r}$ . Finally, we pick the feasible pair with the maximal likelihood. We call this algorithm Ucut (U-shape cutoff), which is summarized in Algorithm 1.

Algorithm 1 Ucut: estimation of the BMU model by grid-searching the optimal cutoff pair.

0: Data:

\{\hat{s}_{1},\ldots,\hat{s}_{n}\}

; The density gaps:

d_{l}

d_{r}

; The interior point

\mu

of the flat region; The searching interval:

[0,c_{l}^{(\max)}]

, and

(c_{r}^{(\min)},1]

, where

c_{l}^{(\max)}<\mu

and

c_{r}^{(\min)}>\mu

; The unit for grid searching:

\gamma

1: Initiate

c_{l}^{*}=\text{NULL}

c_{r}^{*}=\text{NULL}

;

\ell(c_{l}^{*},c_{r}^{*})=-\infty

2: Estimate

\hat{\alpha}_{l}(\mu)=N_{l}(\mu)/n

3: Fit the Grenander estimator on

[0,\mu]

to get

\tilde{g}_{l}

and on

(\mu,1]

to get

\tilde{g}_{r}

4: for

c_{l}\in\{0,\gamma,2\gamma,\ldots,c_{l}^{(\max)}\}

5: for

c_{r}\in\{c_{r}^{(\min)},c_{r}^{(\min)}+\gamma,c_{r}^{(\min)}+2\gamma,\ldots,1\}

6: Estimate

\hat{\alpha}_{mid}(c_{l},\mu)

, and

\hat{\alpha}_{mid}(\mu,c_{r})

7: Let

\tilde{d}_{l}=\frac{\hat{\alpha}_{mid}(c_{l},\mu)}{\hat{\alpha}_{l}(\mu)\cdot(\mu-c_{l})}+\frac{d_{l}}{\hat{\alpha}_{l}(\mu)}

\tilde{d}_{r}=\frac{\hat{\alpha}_{mid}(\mu,c_{r})}{(1-\hat{\alpha}_{l}(\mu))\cdot(c_{r}-\mu)}+\frac{d_{r}}{1-\hat{\alpha}_{l}(\mu)}

8: Let

\ell(c_{l},c_{r})

be the corresponding

H_{simplified}(c_{l},c_{r})

defined in the problem (4.2.2).

9: Let

flag=\mathbb{I}[\tilde{g}_{l}(c_{l})\geq\tilde{d}_{l}

and

\tilde{g}_{r}(c_{r})\geq\tilde{d}_{r}]

10: if

flag

and

\ell(c_{l},c_{r})>\ell(c_{l}^{*},c_{r}^{*})

then

11:

(c_{l}^{*},c_{r}^{*})\leftarrow(c_{l},c_{r})

12: end if

13: end for

14: end for

15: if

\ell(c_{l}^{*},c_{r}^{*})>-\infty

then

16: Return:

c_{l}^{*}

c_{r}^{*}

\tilde{g}_{l},\tilde{g}_{r}

\hat{\alpha}_{l}(\mu)

\ell(c_{l}^{*},c_{r}^{*})

17: else

18: Return:

False

19: end if

4.3 Analysis

For Algorithm 1, the question arises whether $(c_{l}^{(0)},c_{r}^{(0)})$ is a feasible pair for the change-point-gap constraint (4.5). Theorem 13 answers this question by claiming that there exist $c_{l}$ in a small neighborhood of $c_{l}^{(0)}$ and $c_{r}$ in a small neighborhood of $c_{r}^{(0)}$ such that $\tilde{g}_{l}(c_{l})>\tilde{d}_{l}$ and $\tilde{g}_{r}(c_{r})>\tilde{d}_{r}$ for appropriate choices of $d_{l}$ and $d_{r}$ . This implies that we can safely set aside the constraint (4.5) when solving the problem (4.2.2). The proof of Theorem 13 is deferred to Appendix J.

Theorem 13 (Feasibility of Gap Constraint for $(c_{l}^{(0)},c_{r}^{(0)})$ )

Suppose $f$ is a distribution satisfying Assumption 1 and Assumption 2, with $\alpha_{mid}(c_{l}^{(0)},c_{r}^{(0)})>0$ , $f_{\max}<\infty$ . If $m\geq C_{1}\cdot\left(\max\{N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)})\}\right)^{\frac{2}{3}}$ , and $d_{l}<\delta_{l}$ , $d_{r}<\delta_{r}$ , there exist $c_{l}$ and $c_{r}$ such that

c_{l}\leq c_{l}^{(0)}\text{ with }|c_{l}-c_{l}^{(0)}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-\frac{1}{3}}

and

c_{r}\geq c_{r}^{(0)}\text{ with }|c_{r}-c_{r}^{(0)}|\leq C_{3}\cdot N_{r}(c_{r}^{(0)})^{-\frac{1}{3}}

such that $\tilde{g}_{l}$ , $\tilde{g}_{r}$ , $\tilde{d}_{l}$ and $\tilde{d}_{r}$ produced by Algorithm 1 satisfy $\tilde{g}_{l}(c_{l})>\tilde{d}_{l}$ and $\tilde{g}_{r}(c_{r})>\tilde{d}_{r}$ , provided the input $\mu\in(c_{l}^{(0)},c_{r}^{(0)})$ . Furthermore, the resulting density estimator $\tilde{f}_{m,n}$ satisfies

\int_{0}^{1}|\tilde{f}_{m,n}-f(x)|dx\leq C_{4}\cdot\left\{N_{l}(c_{l}^{(0)})^{-\frac{1}{3}}+N_{r}(c_{r}^{(0)})^{-\frac{1}{3}}\right\}.

Here $C_{1},C_{2},C_{3},C_{4}$ are positive constants that only depend on $f$ .

Besides knowing there are some feasible points near $c_{l}^{(0)}$ and $c_{r}^{(0)}$ , we want to have a clear sense of the optimal cutoff pair produced by Algorithm 1. Theorem 14 says that the optimal cutoff for the left (right) part is smaller (larger) than $c_{l}^{(0)}$ ( $c_{r}^{(0)}$ ) with high probability.

Theorem 14 (Tail Bounds of Identified Cutoffs)

Suppose $(\hat{c}_{l},\hat{c}_{r})$ is the identified optimal cutoff pair produced by Algorithm 1, provided an input $\mu\in(c_{l}^{(0)},c_{r}^{(0)})$ . Under the same assumptions as Theorem 13, particularly $n\rightarrow\infty$ , $m/\max\{N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)})\}\rightarrow\infty$ ,

\mathbb{P}[\hat{c}_{l}>c_{l}^{(0)}]\leq\mathbb{P}[\hat{S}_{c_{l}^{(0)},\mu}(c_{l}^{(0)})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C_{1}],

and

\mathbb{P}[\hat{c}_{r}<c_{r}^{(0)}]\leq\mathbb{P}[\hat{S}_{\mu,c_{r}^{(0)}}(c_{r}^{(0)})\geq\sqrt{N_{r}(c_{r}^{(0)})}\cdot\frac{d_{r}}{1-\alpha_{l}(\mu)}-C_{2}],

where $C_{1}$ , $C_{2}$ are positive constants, and $C_{1}$ only depends on $\alpha_{l}(\mu)$ , $d_{l}$ , $C_{2}$ only depends on $\alpha_{r}(\mu)$ , $d_{r}$ ; $\hat{S}_{a,b}(t)$ is the slope at $F(t)$ of the least concave majorant in $[F(a),F(b)]$ of a standard Brownian Bridge in $[0,1]$ .

The proof of Theorem 14 is in fact reduced to proving any cutoff pair $(c_{l},c_{r})$ with $c_{l}>c_{l}^{(0)}$ or $c_{r}<c_{r}^{(0)}$ does not satisfy the change-point-constraint with high probability. Since $\tilde{g}_{l}$ and $\tilde{g}_{r}$ estimated in Algorithm 1 are decreasing and increasing respectively, if $c_{l}>c_{l}^{(0)}$ (or $c_{r}<c_{r}^{(0)}$ ) violates the constraint, then $c_{l}^{\prime}$ (or $c_{r}^{\prime}$ ) will violate it with high probability if $c_{l}^{\prime}>c_{l}$ (or $c_{r}^{\prime}<c_{r}$ ). So it is reduced to considering the smallest $c_{l}>c_{l}^{(0)}$ and the largest $c_{r}<c_{r}^{(0)}$ in the grid searching space of Algorithm 1. Then the result can be concluded using Theorem 12. The detail is deferred to Appendix K.

Finally, we show in Theorem 15 that the identified $\hat{c}_{r}$ converges to $c_{r}^{(0)}$ at the rate of $\mathcal{O}([N_{r}(c_{r}^{(0)})]^{-\frac{1}{3}})$ if $m=\Omega(n^{\frac{2}{3}})$ . And the estimated density also converges to the true one at the rate of $\mathcal{O}(\max\{[N_{r}(c_{r}^{(0)})]^{-\frac{1}{3}},[N_{l}(c_{l}^{(0)})]^{-\frac{1}{3}}\})$ . The proof can be found in Appendix L.

Theorem 15 ( $L_{1}$ Convergence of Identified Cutoff)

Suppose $f$ is a distribution satisfying Assumption 1 and Assumption 2, with $\alpha_{mid}(c_{l}^{(0)},c_{r}^{(0)})>0$ , $f_{\max}<\infty$ . Let $\Delta_{l}=c_{l}-c_{l}^{(0)}$ , $\Delta_{r}=c_{r}-c_{r}^{(0)}$ . If we have

m\geq C_{1}\cdot\left(\max\{N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)})\}\right)^{\frac{2}{3}},

then

|\widehat{\Delta}_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-\frac{1}{3}},~{}|\widehat{\Delta}_{r}|\leq C_{3}N_{r}(c_{r}^{(0)})^{-\frac{1}{3}},

where $\widehat{\Delta}_{l}$ and $\widehat{\Delta}_{r}$ are associated with $\hat{c}_{l}$ and $\hat{c}_{r}$ output by Algorithm 1. Furthermore, the resulting $\tilde{f}_{n,m}$ satisfies

{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot\left\{N_{l}(c_{l}^{(0)})^{-\frac{1}{3}}+N_{r}(c_{r}^{(0)})^{-\frac{1}{3}}\right\}.

Here $C_{1},C_{2},C_{3},C_{4}$ are positive constants that do not depend on $n$ , $N_{l}(c_{l}^{(0)})$ and $N_{r}(c_{r}^{(0)})$ .

5 Experiments of Ucut

5.1 Numerical Experiments

5.1.1 Data generating process

To confirm and complement our theory, we use extensive numerical experiments to examine the finite performance of Ucut on the estimation of $c_{r}$ . We study a BMU model that is comprised of linear components. Specifically, the model consists of three parts with boundaries $c_{l}$ and $c_{r}$ . The middle part is a flat region of height $\delta_{m}$ . The left part is a segment with the right end at $(c_{l},\delta_{m}+\delta_{l})$ and slope $s_{l}<0$ while the right part is a segment with the left end at $(c_{r},\delta_{m}+\delta_{r})$ and slope $s_{r}>0$ ; see Figure 2 (a) for illustration. We call it the linear valley model. We normalize the linear valley model to produce the density of interest. We call the normalized gaps $\tilde{\delta}_{l}$ and $\tilde{\delta}_{r}$ .

The linear valley model depicts the mixture density of the null distribution and the alternative distribution. We only assume that the left part ahead of $c_{l}$ purely belongs to the null distribution while the right part ahead of $c_{r}$ purely belongs to the alternative distribution. The null and the alternative distributions can hardly be distinguished in the flat middle part. To understand the linear valley model from the perspective of the two-group model, we assume that $\tau_{0}\times 100\%$ of the middle part belongs to the null distribution while the remaining belongs to the alternative model. In Figure 2 (b), the part in left slash corresponds to the null density $f_{0}$ while the part in right slash corresponds to the alternative density $f_{1}$ . Let $\pi_{0}$ be the area in the left slash divided by the total area. Then the marginal density can be written as $f=\pi_{0}f_{0}+(1-\pi_{0})f_{1}$ . Since any $\tau_{0}\in[0,1]$ gives the same $f$ , the middle part is not identifiable. It is necessary to estimate and infer the right cutoff $c_{r}$ , so that we can safely claim all the samples beyond $c_{r}$ are from the alternative distribution.

By default, we set $c_{l}=0.3$ , $c_{r}=0.9$ , $\delta_{m}=1$ , $\delta_{l}=0.5$ , $\delta_{r}=0.5$ , $s_{l}=-3$ , $s_{r}=1$ . We sample $n=10,000$ samples $\{s_{1},\ldots,s_{n}\}$ from the linear valley model. Then for each $i\in\{1,\ldots,n\}$ , we get the observations $\hat{s}_{i}\sim Binom(m,s_{i})$ independently, where $m=1,000$ if it is not specified particularly. The value of $\tau_{0}$ does not affect the data generation but it affects the FDR and the power of any method that yields discoveries.

Finally, the left part and the right part are not necessary to be linear. To investigate the effect of general monotone cases and misspecified cases (e.g., unimodal densities), we replace the left part and the right part with other functions; see Appendix M. All codes to replicate the results in this paper can be found at github.com/Elric2718/GFcutoff/.

5.1.2 Robustness to model parameters

When using Algorithm 1, we use the middle point $\mu=0.5$ , the left gap parameter $d_{l}=0.8\cdot\tilde{\delta}_{l}$ , the right gap parameter $d_{r}=0.8\cdot\tilde{\delta}_{r}$ , the searching unit $\gamma=0.001$ . We first investigate how the binomial size $m$ affects the estimation of $c_{r}$ . Using the default setup as described in Section 5.1.1, we vary the binomial size $m\in\{10^{2},10^{3},2\times 10^{3},5\times 10^{3},10^{4},Inf\}$ , where $Inf$ refers to the case that there is no binomial randomness and we observe $s_{i}$ ’s directly. As shown in Figure 3 (a), $\hat{c}_{r}$ converges to the true $c_{r}^{(0)}$ as $m$ grows. When $m=10^{3}$ , the estimated $c_{r}$ is as good as that of using $s_{i}$ ’s directly. This corroborates our theory that we only need $m\sim n^{\frac{2}{3}}\approx 500$ when the BMU model holds. Note that in the linear setup, even with $m=10^{2}$ , $\hat{c}_{r}$ is larger than true $c_{r}^{(0)}$ with large probability. It implies that Ucut is safe in the sense that it will make few false discoveries by using $\hat{c}_{r}$ as the cutoff.

In the sequel, we stick to $m=10^{3}$ since it works well enough for the linear valley model. We investigate whether the width of the middle flat region affects the estimation of $c_{r}$ . We consider $c_{l}=0.5-w/2$ , $c_{r}=0.5+w/2$ with $w\in\{0.6,0.4,0.2,0.1,0.\}$ while other model parameters are set by default. In Figure 3 (b), the estimation of $c_{r}$ is quite satisfying when the width is no smaller than $0.2$ . When the width drops to $0.1$ or smaller, the estimation is not stable but still conservative in the sense that $\hat{c}_{r}>c_{r}$ in most cases.

Finally, we examine how the gap size influences the estimation of $c_{r}$ . We take $\delta_{l}=\{0.5,0.3,0.20.1,0.01\}$ and $\delta_{r}=\{0.5,0.3,0.20.1,0.01\}$ . Figure 3 (c) shows that the estimation of $c_{r}$ is robust to the gap sizes as long as the input $d_{l}$ and $d_{r}$ are smaller than the true gaps. This gives us confidence in applying Ucut to identify the cutoff even when there is no gap, which is more realistic.

5.1.3 Sensitivity of the algorithm hyper-parameters

Algorithm 1 (Ucut) mainly have three tuning parameters: the middle point $\mu$ , the left gap $d_{l}$ and the right gap $d_{r}$ . For practical use, the three tuning parameters may be misspecified. For example, the middle point is not easy to spot, or the left gap and the right gap are too small. We use the default model parameters as specified in Section 5.1.1. Let $d_{l}$ and $d_{r}$ be $0.8$ times the true normalized gaps $\tilde{\delta}_{l}$ and $\tilde{\delta}_{r}$ respectively. We vary the choice of the middle point $\mu$ . Figure 4 (a) shows that the estimation of $c_{r}$ is not sensitive to the choice of $\mu$ as long as it is picked within the flat region $[0.3,0.9]$ . If $\mu$ is picked left to the flat region, the $\hat{c}_{r}$ has a larger variance but it is more conservative in the sense that $\hat{c}_{r}>c_{r}^{(0)}$ in most cases. If $\mu$ is picked right to the flat region, the $\hat{c}_{r}$ tends to be $\min\{\mu,c_{r}^{(0)}\}$ .

Next, we fix $\mu=0.5$ but consider $d_{l}=\kappa\times\tilde{\delta}_{l}$ and $d_{r}=\kappa\times\tilde{\delta}_{r}$ , where $\kappa\in\{1,0.9,0.8,0.5,0.2,$ $0.1,0.01\}$ . We do not consider $\kappa>1$ because there might not exist feasible $(c_{l},c_{r})$ that satisfies the change-point-constraint. Figure 4 (b) shows that when $\kappa$ is within $[0.5,1]$ the estimation is satisfying. The estimated $c_{r}$ can be slightly smaller than the true $c_{r}^{(0)}$ when $\kappa<0.5$ but in a tolerable range.

In a nutshell, the choices of $\mu$ , $d_{l}$ and $d_{r}$ are crucial to Algorithm 1. But the sensitivity analysis indicates that it is not necessary to be excessively cautious. In practice, picking these parameters by eyeballs can give a safe estimation in most cases.

5.1.4 Comparison to other methods

To examine the power and the capacity of controlling FDR of Ucut, we consider the two-group model as specified in Figure 2 (b). The middle part is not identifiable, which means that the samples of the alternative distribution can not be distinguished from those of the null distribution. To reflect this point, we arbitrarily set the proportion of the null distribution in the middle part $\tau_{0}$ . The goal is to identify the right cutoff $c_{r}$ but not $\tau_{0}$ because it is impossible to infer $\tau_{0}$ .

We compare Ucut to the other four methods that are studied in [7], i.e., ZIGP (Zero-inflated Generalized Poisson), ZIP (Zero Inflated Poisson), GP (Generalized Poisson) and P (Poisson). The data studied in this paper and the associated primary purpose are highly similar to ours. First, they study mutation counts for a specific protein domain, which have excess zeros and are assumed to follow a zero-inflated model. Moreover, the four mentioned methods are used to select significant mutation counts (i.e., cutoff) while controlling a given level of Type I error via False Discovery Rate procedures. Therefore, the four methods are good competitors for Ucut in view of the data distribution and the method usage.

Figure 5 shows that GP and P use rather small cutoffs and have too large FDRs. ZIGP and ZIP are over-conservative if the target FDR level is too low at $0.005$ or $0.01$ , thus having quite low power. They perform better when the target FDR level is set to be $0.05$ . On the other hand, Ucut can control FDR at $0.01$ if we directly use $\hat{c}_{r}$ as the cutoff. The associated power is better than those of ZIGP and ZIP. In order to loosen the FDR control and get higher power, it is fine to use a slightly smaller cutoff than $\hat{c}_{r}$ . From this result, we confirm that Ucut is a better fit for the scenario where the middle part is not distinguishable.

5.2 Application to Real Data

We further demonstrate the performance of Ucut on the real datasets used in [18]. To be specific, we apply the GeneFishing method to four GTEx RNAseq datasets, liver, Artery Coronary, Transverse Colon, and Testis; see Table 1 for details. We leverage the same set of $21$ bait genes used in [18]. The number of fishing rounds is set to be $m=10,000$ .

Table 1: Details of GTEx RNAseq datasets.

	# samples	# genes
Liver	$119$	$18,845$
Artery-Coronary	$133$	$20,597$
Colon-Transverse	$196$	$21,695$
Testis	$172$	$31,931$

Once the CFRs are generated, we apply Algorithm 1 with the middle point $\mu=0.5$ , $d_{l}=0.1$ and $d_{r}=0.01$ . We take $d_{l}$ to be ten times $d_{r}$ because there are much more zeros than ones in CFRs. As shown in Section 5.1, Ucut is not sensitive to the three hyper-parameters. The change of these parameters lays little influence on the results. Table 2 shows that for each tissue Ucut gives the estimator of $c_{r}$ that yields $50$ to $80$ discoveries. We estimate the false discovery rate using the second approach in [18]. Note that for Artery-Coronary, the estimated $c_{r}=0.972$ by Ucut (which gives $\widehat{FDR}\approx 10^{-3}$ ) is less extreme than simply using $0.990$ (which gives $\widehat{FDR}\approx 10^{-4}$ ). It implicates that Ucut adapts to the tissue and can pick a cutoff with a reasonable false discovery rate.

Table 2: Estimation of

c_{r}

by Algorithm 1 on four tissues, where

\mu=0.5

d_{l}=0.1

and

d_{r}=0.01

. The second column is the estimated

c_{r}

using bootstrap by sampling

70\%

of the CFRs.

	$\hat{c}_{r}$	bootstrapping estimation	# discovery (use $\hat{c}_{r}$ )	$\widehat{FDR}$
Liver	$0.995$	$0.993(0.005)$	$52$	$1.4\times 10^{-3}$
Artery-Coronary	$0.972$	$0.976(0.009)$	$85$	$5.7\times 10^{-3}$
Colon-Transverse	$0.989$	$0.991(0.049)$	$57$	$1.2\times 10^{-4}$
Testis	$0.993$	$0.992(0.001)$	$73$	$0.010$

In addition, we also apply the GeneFishing method to a single-cell data of the pancreas cells from Tabula Muris¹¹1https://tabula-muris.ds.czbiohub.org. It contains $849$ cells from mice and $5,220$ genes expressed in enough cells out of about $20,000$ genes. We find out $9$ bait genes based on the pancreas insulin secretion gene ontology (GO) term.

Unlike Figure 1, the CFRs of this data set do not appear in a U shape. Instead, we observe a unimodal pattern around zero and an increasing pattern around one (Figure 6). Nonetheless, it does not hinder us from using Ucut to determine the cutoff, since we are mainly concerned about the right cutoff and Section 5.1 demonstrates that Ucut is conservative even if the model is misspecified. By using $\mu=0.5$ , $d_{l}=0.1$ and $d_{r}=0.01$ , Ucut gives $0.994$ as the estimation of the right cutoff, which discovers $77$ genes. By doing the GO enrichment analysis, we find out that these identified genes are enriched for the GO of response to ethanol with p-value $0.0021$ , the GO of positive regulation of fatty acid biosynthesis with p-value $0.0055$ , and the GO of eating behavior with p-value $0.0079$ . These GOs have been shown to relate to insulin secretion in literature [13, 26, 35], which indicates the effectiveness of Ucut.

6 Discussion

In this work, we analyze the binomial mixture model (1.2) under the U-shape constraint, which is motivated by the results of the GeneFishing method [18]. The contributions of this work are two-fold. First, to the best of our knowledge, this is the first paper to investigate the relationship between the binomial size $m$ and the sample size $n$ for the binomial mixture model under various conditions for $F$ . Second, we provide a convenient tool for the downstream decision-making of the GeneFishing method.

Despite the identifiability issue of the binomial mixture model, we show that the estimator of the underlying distribution deviates from the true distribution, in some distance, at most a small quantity that depends on $m$ . The implication is that to have the same convergence rate as if there is no binomial randomness, we need $m=\Omega(n^{\frac{2}{3}})$ for the empirical CDF and the Grenander estimator when the underlying density is respectively bounded and monotone. It is also of great interest to further investigate how the minimal $m$ hinges on the smoothness of the underlying distribution, e.g., studying the kernel density estimator and the smoothing spline estimator under the binomial mixture model.

To answer the motivating question of how large the CFR should be so that the associated sample can be regarded as a discovery in the GeneFishing method, we propose the BMU model to depict the underlying distribution and the NPMLE method Ucut to determine the cutoff. This estimator comprises two Grenander estimators, thus having a cubic convergence rate as the Grenander estimator when $m=\Omega(n^{\frac{2}{3}})$ . We also show that the estimated cutoff is larger than the true cutoff with high probability. On the other hand, all these results are established for a bounded density. We leave it as future work to study the relationship between the convergence rate and the smoothness of the density.

Finally, we conclude the paper with a discussion on the empirical performance of Ucut. The simulation studies reassure the practitioners that Ucut is not sensitive to the choice of its hyper-parameters — the default setup can handle most cases. In the real data example used in [18], Ucut gives a cutoff with a low estimated FDR using much less time than [18]. In the other example of the pancreas single-cell dataset, we use GeneFishing with Ucut and find out $77$ genes that are shown to be related to pancreas insulin secretion by GO enrichment analysis. Therefore, we are interested and confident in discovering exciting signals in biology or other fields if GeneFishing with Ucut can be applied to more real datasets.

References

[1] Raghu Raj Bahadur “Some approximations to the binomial distribution function” In The Annals of Mathematical Statistics JSTOR, 1960, pp. 43–54
[2] Fadoua Balabdaoui et al. “On the Grenander estimator at zero” In Statistica Sinica 21.2 NIH Public Access, 2011, pp. 873
[3] O Barndorff-Nielsen “Identifiability of mixtures of exponential families” In Journal of Mathematical Analysis and Applications 12.1 Elsevier, 1965, pp. 115–121
[4] Lucien Birge “The grenander estimator: A nonasymptotic approach” In The Annals of Statistics JSTOR, 1989, pp. 1532–1549
[5] Steven R Dunbar “The de Moivre-Laplace Central Limit Theorem” In Technical Report, 2011
[6] Martin D Fraser, YU SHENG HSU and WALKER JJ “Identifiability of finite mixtures of von Mises distributions” In The Annals of Statistics 9.5 JSTOR, 1981, pp. 1130–1134
[7] Iris Ivy M Gauran et al. “Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data” In Biometrics 74.2 Wiley Online Library, 2018, pp. 458–471
[8] Ulf Grenander “On the theory of mortality measurement: part ii” In Scandinavian Actuarial Journal 1956.2 Taylor & Francis, 1956, pp. 125–153
[9] Leonardo Grilli, Carla Rampichini and Roberta Varriale “Binomial mixture modeling of university credits” In Communications in Statistics-Theory and Methods 44.22 Taylor & Francis, 2015, pp. 4866–4879
[10] Wassily Hoeffding “Probability Inequalities for Sums of Bounded Random Variables” In Journal of the American Statistical Association 58.301 Taylor & Francis, 1963, pp. 13–30 DOI: 10.1080/01621459.1963.10500830
[11] Jian Huang and Jon A Wellner “Estimation of a monotone density or monotone hazard under random censoring” In Scandinavian Journal of Statistics JSTOR, 1995, pp. 3–33
[12] Youping Huang and Cun-Hui Zhang “Estimating a monotone density from censored observations” In The Annals of Statistics JSTOR, 1994, pp. 1256–1274
[13] Zhen Huang and $A$ke Sjöholm “Ethanol acutely stimulates islet blood flow, amplifies insulin secretion, and induces hypoglycemia via nitric oxide and vagally mediated mechanisms” In Endocrinology 149.1 Oxford University Press, 2008, pp. 232–236
[14] Hanna K Jankowski and Jon A Wellner “Estimation of a discrete monotone distribution” In Electronic journal of statistics 3 NIH Public Access, 2009, pp. 1567
[15] John T Kent “Identifiability of finite mixtures for directional data” In The Annals of Statistics JSTOR, 1983, pp. 984–988
[16] Marc Kéry “Estimating abundance from bird counts: binomial mixture models uncover complex covariate relationships” In The Auk 125.2 Oxford University Press, 2008, pp. 336–345
[17] János Komlós, Péter Major and Gábor Tusnády “An approximation of partial sums of independent RV’-s, and the sample DF. I” In Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 32.1-2 Springer, 1975, pp. 111–131
[18] Ke Liu et al. “GeneFishing to reconstruct context specific portraits of biological processes” In Proceedings of the National Academy of Sciences 116.38 National Acad Sciences, 2019, pp. 18943–18950
[19] Frederic M Lord “Estimating true-score distributions in psychological testing (an empirical Bayes estimation problem)” In Psychometrika 34.3 Springer, 1969, pp. 259–299
[20] Frederic M Lord and Noel Cressie “An empirical Bayes procedure for finding an interval estimate” In Sankhyā: The Indian Journal of Statistics, Series B JSTOR, 1975, pp. 1–9
[21] U Lüxmann-Ellinghaus “On the identifiability of mixtures of infinitely divisible power series distributions” In Statistics & probability letters 5.5 Elsevier, 1987, pp. 375–378
[22] AW Marshall “Discussion on Barlow and van Zwet’s paper” In Nonparametric Techniques in Statistical Inference 1969 Cambridge University Press New York, 1970, pp. 174–176
[23] Albert W Marshall and Frank Proschan “Maximum likelihood estimation for distributions with monotone failure rate” In The annals of mathematical statistics 36.1 JSTOR, 1965, pp. 69–77
[24] Pascal Massart “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality” In The annals of Probability JSTOR, 1990, pp. 1269–1283
[25] Trent McDonald et al. “Evidence of Absence Regression: A Binomial N-Mixture Model for Estimating Bird and Bat Fatalities at Wind Power Facilities” In bioRxiv Cold Spring Harbor Laboratory, 2020 DOI: 10.1101/2020.01.21.914754
[26] Christopher J Nolan et al. “Fatty acid signaling in the $\beta$ -cell and insulin secretion” In Diabetes 55.Supplement 2 Am Diabetes Assoc, 2006, pp. S16–S23
[27] Katherine M O’Donnell, Frank R Thompson III and Raymond D Semlitsch “Partitioning detectability components in populations subject to within-season temporary emigration using binomial mixture models” In PLoS One 10.3 Public Library of Science, 2015, pp. e0117216
[28] GP Patil and Sheela Bildikar “Identifiability of countable mixtures of discrete probability distributions using methods of infinite matrices” In Mathematical Proceedings of the Cambridge Philosophical Society 62.3, 1966, pp. 485–494 Cambridge University Press
[29] BLS Prakasa Rao “Estimation for distributions with monotone failure rate” In The annals of mathematical statistics JSTOR, 1970, pp. 507–519
[30] J Andrew Royle “N-mixture models for estimating population size from spatially replicated counts” In Biometrics 60.1 Wiley Online Library, 2004, pp. 108–115
[31] Theofanis Sapatinas “Identifiability of mixtures of power-series distributions and related characterizations” In Annals of the Institute of Statistical Mathematics 47.3 Springer, 1995, pp. 447–459
[32] S Sivaganesan and James Berger “Robust Bayesian analysis of the binomial empirical Bayes problem” In Canadian Journal of Statistics 21.1 Wiley Online Library, 1993, pp. 107–119
[33] GM Tallis “The identifiability of mixtures of distributions” In Journal of Applied Probability 6.2 JSTOR, 1969, pp. 389–398
[34] GM Tallis and Peter Chesson “Identifiability of mixtures” In Journal of the Australian Mathematical Society 32.3 Cambridge University Press, 1982, pp. 339–348
[35] Muneki Tanaka et al. “Eating pattern and the effect of oral glucose on ghrelin and insulin secretion in patients with anorexia nervosa” In Clinical endocrinology 59.5 Wiley Online Library, 2003, pp. 574–579
[36] Henry Teicher “Identifiability of finite mixtures” In The annals of Mathematical statistics JSTOR, 1963, pp. 1265–1269
[37] Henry Teicher “Identifiability of mixtures” In The annals of Mathematical statistics 32.1 Institute of Mathematical Statistics, 1961, pp. 244–248
[38] Hoben Thomas “A binomial mixture model for classification performance: A commentary on Waxman, Chambers, Yntema, and Gelman (1989)” In Journal of Experimental Child Psychology 48.3 Elsevier, 1989, pp. 423–430
[39] Aad Van der Vaart and Mark Van der Laan “Smooth estimation of a monotone density” In Statistics: A Journal of Theoretical and Applied Statistics 37.3 Taylor & Francis, 2003, pp. 189–203
[40] Y. Wang “Nonparametric Estimation Subject to Shape Restrictions” University of California, Berkeley, 1992 URL: https://books.google.com/books?id=QmVMAQAAMAAJ
[41] Larry Wasserman “All of nonparametric statistics” Springer Science & Business Media, 2006
[42] GR Wood “Binomial mixtures: geometric estimation of the mixing distribution” In The Annals of Statistics 27.5 Institute of Mathematical Statistics, 1999, pp. 1706–1721
[43] Michael Woodroofe and Jiayang Sun “A penalized maximum likelihood estimate of f (0+) when f is non-increasing” In Statistica Sinica JSTOR, 1993, pp. 501–515
[44] Guohui Wu, Scott H Holan, Charles H Nilon and Christopher K Wikle “Bayesian binomial mixture models for estimating abundance in ecological monitoring studies” In Annals of Applied Statistics 9.1 Institute of Mathematical Statistics, 2015, pp. 1–26
[45] Sidney J Yakowitz and John D Spragins “On the identifiability of finite mixtures” In The Annals of Mathematical Statistics JSTOR, 1968, pp. 209–214

Appendix

Appendix A The Identifiability Issue of Binomial Mixture Model

We review existing results for mixtures of distributions and the binomial mixture model in the literature. A general mixture model is defined as

H(x)=\int_{\Omega}h(x|\theta)dG(\theta),

(A.1)

where $h(\cdot|\theta)$ is a density function for all $\theta\in\Omega$ , $h(x|\cdot)$ is a Borel measurable function for each $x$ and $G$ is a distribution function defined on $\Omega$ . The family $h(x|\theta)$ , $\theta\in\Omega$ is referred to as the kernel of the mixture and $G$ as the mixing distribution function. In order to devise statistical procedures for inferential purposes, an important requirement is the identifiability of the mixing distribution. Without this condition, it is not meaningful to estimate the distribution either non-parametrically or parametrically. The mixture $H$ defined by (A.1) is said to be identifiable if there exists a unique $G$ yielding $H$ , or equivalently,if the relationship

H(x)=\int_{\Omega}h(x|\theta)dG_{1}(\theta)=\int_{\Omega}h(x|\theta)dG_{2}(\theta)

implies $G_{1}(\theta)=G_{2}(\theta)$ for all $\theta\in\Omega$ .

The identifiability problems concerning finite and countable mixtures (i.e. when the support of $F$ in (A.1) is finite and countable respectively) have been investigated by [36, 28, 45, 33, 6, 34, 15]. Examples of identifiable finite mixtures include: the family of Gaussian distribution $\{N(\mu,\sigma^{2}),-\infty<\mu<\infty,0<\sigma^{2}<\infty\}$ , the family of one-dimensional Gaussian distributions, the family of one-dimensional gamma distributions, the family of multiple products of exponential distributions, the multivariate Gaussian family, the union of the last two families, the family of one-dimensional Cauchy distributions, etc.

For sufficient conditions for identifiability of arbitrary mixtures, [37] studied the identifiability of mixtures of additive closed families, while [3] discussed the identifiability of mixtures of some restricted multivariate exponential families. [21] has given a sufficient condition for the identifiability of a large class of discrete distributions, namely that of the power-series distributions. Using topological arguments, he has shown that if the family in question is infinitely divisible, mixtures of this family are identifiable. For example, Poisson, negative binomial, logarithmic series are infinitely divisible, so arbitrary mixtures are identifiable.

On the other hand, despite being a power-series distribution, the binomial distribution is not infinitely divisible. So its identifiability is not established for the success parameter [31]. In fact, the binomial mixture has often been regarded as unidentifiable, as $F$ can be determined only up to its first $m$ moments when $H$ is known exactly. To be more specific, $h(x|s)$ is a linear function of the first $m$ moments $\mu_{r}=\int_{0}^{1}s^{r}dF(s),r=1,2,\ldots,m$ of $F(s)$ , for every $x$ . Therefore, with the same first $m$ moments, any other $F^{\prime}(s)$ will make the same mixed distribution $H(x)$ . To ensure the identifiability for the binomial mixture model, it is a common practice to assume that $F$ corresponds to a finite discrete pmf or a parametric density (e.g., beta distribution).

In particular, there are two results for the identifiability of binomial mixture model [36]:

1.

If $h(x|s)$ in (A.1) is a binomial distribution with a known binomial size $m$ , and the support of $F$ only contain $K$ points. A necessary and sufficient condition that the identifiability holds is that $m\geq 2K-1$ .
2.

Consider $h(x|m_{j},s_{j})$ as a binomial distribution with binomial size $m_{j}$ and probability $s_{j}$ , where $0<s_{j}<1$ , $j=1,2,\ldots$ and the support of $F$ is $\{s_{1},s_{2},\ldots\}$ . If $m_{j}\neq m_{j^{\prime}}$ for $j=j^{\prime}$ , then (A.1) is identifiable.

In this study, we are interested in the regime where the support size of $F$ may not be finite, and thus the identifiability may fail for the binomial mixture model. Some efforts are made without identifiability. [20, 32] constructed credible intervals for the Bayes estimators of each point mass and that of $F$ , which rely on the lower order moments of the mixing distribution. [42] empirically shows that their proposed nonparametric maximum likelihood estimator of $F$ is unique with high probability when $m$ is large. However, these results are far from satisfying in terms of our ultimate goal — estimate or infer the underlying distribution $F$ .

Appendix B A motivating Example: GeneFishing

The study of model (1.2) with a large $m$ is motivated by the GeneFishing method [18]. Provided some knowledge involved in a biological process as “bait”, GeneFishing was designed to “fish” (or identify) discoveries that are yet identified related to this process. In this work, the authors used a set of pre-identified $21$ “bait genes”, all of which have known roles in cholesterol metabolism, and then applied GeneFishing to three independent RNAseq datasets of human lymphoblastoid cell lines. They found that this approach identified other genes with known roles not only in cholesterol metabolism but also with high levels of consistency across the three datasets. They also applied GeneFishing to GTEx human liver RNAseq data and identified gene glyoxalase I (GLO1). In a follow-up wet-lab experiment, GLO1 knockdown increased levels of cellular cholesterol esters.

The GeneFishing procedure is as follows, as shown in Figure 7:

1.

Split the $n$ candidate genes randomly into many sub-search-spaces of $L$ genes per sub-group (e.g., $L=100$ ), then added to by the bait genes. This step is the key reduction of search space, facilitating making the “signal” standing out from the “noise”.
2.

Construct the Spearman co-expression matrices for gene pairs contained within each sub-search-space. Apply the spectral clustering algorithm (with the number of clusters equal to $2$ ) to each matrix separately. In most cases, the bait genes cluster separately from the candidate genes. But in some instances, candidate gene(s) related to the bait genes will cluster with them. When this occurs, the candidate gene is regarded as being “fished out”.
3.

Repeat steps 1 and 2 (defining one round of GeneFishing) $m$ times (e.g., $m=10,000$ ) to reduce the impact that a candidate gene may randomly co-cluster with the bait genes.
4.

Aggregate the results from all rounds, and the $i$ -th gene is fished out $X_{i}$ times out of $m$ . The final output is a table that records the “capture frequency rate” ( $CFR_{i}:=\hat{s}_{i}=X_{i}/m$ ). The “fished-out” genes with large CFR values are thought of as “discoveries”. Note, however, instead of considering these “discoveries” to perform a specific or similar function as the bait genes, we only believe they are likely to be functionally related to the bait genes. Figure 1 displays the distribution of $X_{i}$ ’s with $m=10,000$ and the number of total genes $n=21,000$ on four tissues for the cholesterol-relevant genes.

Compared to other works for “gene prioritization”, GeneFishing has three merits. First, it takes care of extreme inhomogeneity and low signal-to-noise ratio in high-throughput data using dimensionality reduction by subsampling. Second, it uses clustering to identify $21$ tightly clustered bait genes, which is a data-driven confirmation of the domain knowledge. Finally, GeneFishing leverages the technique of results aggregation (motivated by a bagging-like idea) in order to prioritize genes relevant to the biological process and reduce false discoveries.

However, there remains an open question on how large a CFR should be so that the associated gene is labeled as a “discovery”. [18] picked a large cutoff $0.99$ by eye. It is acceptable when the histograms are sparse in the middle as in the liver or the transverse colon tissues (Figure 1 (a)(b)), since cutoff $=0.25$ and cutoff $=0.75$ make little difference in determining the discoveries. On the other hand, for the artery coronary and testis tissues (Figure 1 (c)(d)), the non-trivial middle part of the histogram is a mixture of the null (irrelevant to the biological process) distribution and the alternative (relevant to the biological process) distribution. The null and the alternative are hard to separate since the middle part is quite flat. Existing tools using parametric models to estimate local false discovery rates [7] are not able to account for the middle flatness well. They tend to select a smaller cutoff and produce excessive false discoveries. [18] provided a permutation-like procedure to compute approximate p-values and false discovery rates (FDRs). Nonetheless, there are two problems with this procedure. On the one hand, it is substantially computationally expensive, considering another round of permutation is added on top of numerous fishing rounds. On the other hand, the permutation idea is based on a strong null hypothesis that none of the candidate non-bait genes are relevant to the bait genes, thus producing an extreme p-value or FDR, which is unrealistic.

To identify a reasonable cutoff on CFRs, we assume there is an underlying fishing rate $s_{i}$ , reflecting the probability that the $i$ -th gene is fished out in each GeneFishing round. The fishing rates are assumed to be independently sampled from the same distribution $F$ . Thus, the observation $CFR_{i}$ (or $\hat{s_{i}}$ ) can be modelled by (1.2). We mention that the independence assumption is raised mainly for convenience but not realistic de facto since genes are interactive and correlated in the same pathways or even remotely. Consequently, the effective sample size is smaller than expected. However, this assumption is still acceptable by assuming only a handful of candidate genes are related to the bait genes, which is reasonable from the biological perspective. Furthermore, we notice that in Figure 1 there exists a clear pattern that the histogram is decreasing on the left-hand side and increasing on the right-hand side for all four tissues. In the middle, liver and colon transverse display sparse densities while artery coronary and testis exhibit flat ones. Thus, we can impose a U shape constraint on the associated density of $F$ (see Section 4.1 for details). Then the original problem becomes finding out the cutoff where the flat middle part transits to the increasing part on the right-hand side.

Appendix C Review of Grenander Estimator

Monotone density models are often used in survival analysis and reliability analysis in economics—see [12, 11]. We can apply maximum likelihood for the monotone density estimation. Suppose that $X_{1},\ldots,X_{n}$ is a random sample from a density $f$ on $[0,\infty)$ that is known to be nonincreasing; the maximum likelihood estimator $\tilde{f}_{n}$ is defined as the nonincreasing density that maximizes the log-likelihood $\ell(f)=\sum_{i=1}^{n}\log f(X_{i}).$ [8] first showed that this optimization problem has a unique solution under the monotone assumption — so the estimator is also called the Grenander estimator. The Grenander estimator is explicitly given as the left derivative of the least concave majorant of the empirical distribution function $F_{n}$ . The least concave majorant of $F_{n}$ is defined as the smallest concave function $\tilde{F}_{n}$ with $\tilde{F}_{n}\geq F_{n}$ for every $x$ . Because $\tilde{F}_{n}$ is concave, its derivative is nonincreasing.

[23] showed that Grenander estimator is consistent when $f$ is decreasing, as stated in Theorem 16.

Theorem 16 ([23])

Suppose that $X_{1},\ldots,X_{n}$ are i.i.d random variables with a decreasing density $f$ on $[0,\infty)$ . Then the Grenander estimator $\tilde{f}_{n}$ is uniformly consistent on closed intervals bounded away from zero: that is, for each $c>0$ , we have

\sup_{c\leq x<\infty}|\tilde{f}_{n}(x)-f(x)|\rightarrow 0\text{ a.s.}

The inconsistency of the Grenander estimator at $0$ , when $f(0)$ is bounded, was first pointed out by [43]. [2] later extended this result to other situations, where they consider different behavior near zero of the true density. Theorem 17 explicitly characterizes the behavior of $\tilde{f}_{n}$ at zero.

Theorem 17 ([43])

Suppose that $f$ is a decreasing density on $[0,\infty)$ with $0<f(0)<\infty$ , and let $N(t)$ denote a rate $1$ Poisson process. Then

\frac{\tilde{f}_{n}(0)}{f(0)}\overset{d}{\rightarrow}\sup_{t>0}\frac{N(t)}{t}\overset{d}{=}\frac{1}{U},

where $U$ is a uniform random variable on the unit interval.

[4] proved that Grenander estimator has a cubic root convergence rate in the sense of $L_{1}$ norm, as in Theorem 18. [39] pointed out that the rate of convergence of the Grenander estimator is slower than that of the monotone kernel density estimator when the underlying function is smooth enough.

Theorem 18 ([4])

Suppose $f$ is a decreasing density on $[0,\infty)$ with $0<f(0)<\infty$ . it follows that

{\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n}(x)-f(x)|dx\leq C\cdot n^{-\frac{1}{3}},

where $C$ is a constant that only depends on $f$ .

[29] first obtained the limiting distribution of the Grenander estimator at a point. He has proved that $\sqrt[3]{n}(\tilde{f}_{n}(t_{0})-f(t_{0}))$ converges to the location of the maximum of the process $\{B(x)-x^{2},x\in{\mathbb{R}}\}$ , where $f^{\prime}(t_{0})<0$ and $B(x)$ is the standard two-sided Brownian motion on ${\mathbb{R}}$ such that $B(0)=0$ ; see [29]. [40] extends this result to the flat region and a higher order derivative, as stated in Theorem 19.

Theorem 19 ([40])

Suppose $f$ is a decreasing density on $[0,1]$ and is smooth at $t_{0}\in(0,1)$ . It follows that

(A)

If $f$ is flat in a neighborhood of $t_{0}$ . Let $[a,b]$ be the flat part containing $t_{0}$ . Then,

$\sqrt{n}(\tilde{f}_{n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}\hat{S}_{a,b}(t_{0}),$

where $\hat{S}_{a,b}(t)$ is the slope at $F(t)$ of the least concave majorant in $[F(a),F(b)]$ of a standard Brownian Bridge in $[0,1]$ .

(B)

If $f(t)-f(t_{0})\sim f^{(k)}(t_{0})(t-t_{0})^{k}$ near $t_{0}$ for some $k$ and $f^{(k)}(t_{0})<0$ . Then,

n^{\frac{k}{2k+1}}[\frac{f^{k}(t_{0})|f^{(k)}(t_{0})|}{(k+1)!}]^{-\frac{1}{2k+1}}(\tilde{f}_{n}(t_{0})-f(t_{0}))\overset{d}{\rightarrow}V_{k}(0),

Appendix D Proof of Proposition 4

Proof Suppose

f(x)=1.8\cdot\mathbb{I}(x\in[0,1/2])+0.2\cdot\mathbb{I}(x\in(1/2,1]).

Note that

	$\displaystyle\mathbb{P}(s\leq k/m)-\mathbb{P}(\hat{s}\leq k/m)$	(D.1)
$\displaystyle=$	$\displaystyle\int_{0}^{k/m}f(u)du-\sum_{r\leq k}\int_{0}^{1}{m\choose r}u^{r}(1-u)^{m-r}f(u)du$
$\displaystyle=$	$\displaystyle\int_{0}^{1}\left(\mathbb{I}[u\leq k/m]-\sum_{r\leq k}{m\choose r}u^{r}(1-u)^{m-r}\right)f(u)du$

Decompose $f(x)=f_{1}(x)+f_{2}(x)$ , where $f_{1}(x)=1.6\cdot\mathbb{I}(x\in[0,1/2])$ , $f_{2}(x)\equiv 0.2$ . The previous example shows that the difference for the $f_{2}$ part in Equation (D.1) is at most $\frac{0.2}{m+1}.$ So we only need to take care of the $f_{1}$ part in Equation (D.1), i.e.,

1.6\times\int_{0}^{1/2}\left(\mathbb{I}[u\leq k/m]-\sum_{r\leq k}{m\choose r}u^{r}(1-u)^{m-r}\right)du=1.6\times\int_{0}^{1/2}B_{k}(m,u)du,

provided $k/m\leq 1/2$ . Here $B_{k}(m,x)=\sum_{r=k}^{m}{m\choose k}x^{r}(1-x)^{m-r}$ . Define

A_{k}(m,x)=[{m\choose k}x^{k}(1-x)^{m-k+1}]\cdot[(k+1)/(k+1-(m+1)x)].

[1][Theorem 1] indicates that $1\leq A_{k}(m,x)/B_{k}(m,x)\leq 1+z^{-2}$ , where $z=(k-mx)/(mx(1-x))^{\frac{1}{2}}$ . Let $x=1/2-\epsilon$ . By Stirling’s formula and Taylor expansion on $\log(1+\epsilon)$ , we can obtain

A_{m/2}(m,1/2-\epsilon)\sim\frac{1}{\sqrt{2\pi m}}e^{-2m\epsilon^{2}},

and we can rewrite

z=\frac{\sqrt{m}\epsilon}{\sqrt{1/4-\epsilon^{2}}}.

So we have

\displaystyle B_{m/2}(m,1/2-\epsilon)\geq\frac{A_{m/2}(m,1/2-\epsilon)}{1+z^{-2}}\sim\frac{2\sqrt{m}\epsilon e^{-2m\epsilon^{2}}}{4(m-1)\epsilon^{2}+1}\geq\frac{2\sqrt{m}\epsilon e^{-2m\epsilon^{2}}}{4m\epsilon^{2}+1}.

Then it follows that

\int_{0}^{1/2}B_{m/2}(m,1/2-\epsilon)d\epsilon\geq\int_{0}^{1/\sqrt{m}}\frac{2\sqrt{m}\epsilon e^{-2m\epsilon^{2}}}{4m\epsilon^{2}+1}d\epsilon=\frac{1}{\sqrt{m}}\int_{0}^{1}\frac{2e^{-2u^{2}}}{4u^{2}+1}du\approx\frac{0.81}{\sqrt{m}}.

Together, we have $\mathbb{P}(s\leq 1/2)-\mathbb{P}(\hat{s}\leq 1/2)\geq\frac{C}{\sqrt{m}}+\varepsilon\cdot m^{-1}$ , where $\varepsilon$ is a residual term with $|\varepsilon|\leq K$ , $C$ and $K$ are positive constants.

Appendix E Proof of Proposition 7

Proof The proof of this theorem follows from that of Theorem 8, so we defer most details to the proof of the latter.

From Equation (F.5), we know that

			$\displaystyle\mathbb{P}(\hat{s}_{1}\in B(x))-\mathbb{P}(s_{1}\in B(x))$
		$\displaystyle=$	$\displaystyle\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]$
			$\displaystyle+\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x-d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x-d\cdot h))]$
			$\displaystyle+\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)$
			$\displaystyle+\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1),$

where $B(x)$ is still defined as the bin that contains $x$ among $[0,h],(h,2h],\ldots,(1-h,1]$ . Note that

			$\displaystyle\mathbb{P}(\hat{s}_{1}\leq x)-\mathbb{P}(s_{1}\leq x)$
		$\displaystyle=$	$\displaystyle\mathbb{P}(\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}>x,s_{1}\leq x)$
		$\displaystyle=$	$\displaystyle\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}>x,s_{1}\leq x)$
			$\displaystyle+\sum_{{d=1,2,\ldots}\atop{d^{\prime}=1,2,\ldots}}[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))].$
		$\displaystyle=$	$\displaystyle\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}>x,s_{1}\leq x)$
			$\displaystyle+(\sum_{\|d-d^{\prime}\|\leq D}+\sum_{\|d-d^{\prime}\|>D})[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))$
			$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))].$

Following the proof of bounding $\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)$ in Theorem 8, we can bound

			$\displaystyle\|\sum_{\|d-d^{\prime}\|>D}[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))\|$
		$\displaystyle\leq$	$\displaystyle 2f_{\max}\cdot\exp(-2mD^{2}h^{2})=\frac{2f_{\max}}{m},$

where $D=\lceil\sqrt{\frac{\log m}{2mh^{2}}}\rceil$ .
Following the proof of bounding $|\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]|$ in Theorem 8, we can bound

			$\displaystyle\|\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}\leq x,s_{1}>x)-\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x),\hat{s}_{1}>x,s_{1}\leq x)\|$
		$\displaystyle\leq$	$\displaystyle K_{1}\cdot(\frac{f_{\max}\cdot h}{\sqrt{m}}+\frac{f_{\max}}{m}+\frac{h\cdot f^{\prime}_{\max}}{\sqrt{m}})+\|\mathcal{E}\|,$

where $|\mathcal{E}|\leq K_{2}\cdot(\frac{h\cdot f_{\max}}{\sqrt{m}}+h^{2}f_{\max}+\frac{f_{\max}}{m})$ for some constant $K_{2}$ that only depends on $a$ .
To bound

|\sum_{|d-d^{\prime}|\leq D}[\mathbb{P}(\hat{s}_{1}\in B(x-dh),s_{1}\in B(x+d^{\prime}h))-\mathbb{P}(\hat{s}_{1}\in B(x+dh),s_{1}\in B(x-d^{\prime}h))|,

we still follow the proof of bounding $|\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]|$ in Theorem 8. The problem is that there are $D^{2}$ terms instead $D$ terms. By carefully arranging the $D^{2}$ terms, and using the fact that $\sum_{d=1}^{\infty}\exp(-C_{0}d^{2})<\infty$ for any positive $C_{0}$ , it can be easily seen that we still get the same bound as above. In total, we have

|\mathbb{P}(\hat{s}_{1}\leq x)-\mathbb{P}(s_{1}\leq x)|\leq K_{3}\cdot(f_{\max}+f^{\prime}_{\max})\cdot(\frac{h}{\sqrt{m}}+h^{2}+\frac{1}{m}),

where $K_{3}$ is some constant that only hinges on $a$ . To minimize the upper bound, take $h=\frac{1}{\sqrt{m}}$ . Thus, it follows that

\sup_{x\in[a,1-a]}|F^{(m)}(x)-F(x)|\leq\frac{C}{m},

where $C$ is some constant that only depends on $f$ and $a$ .

Appendix F Proof of Theorem 8

Theorem 8 relies on the local limit theorem of binomial distribution, as follows.

Lemma 20

Suppose $X\sim Binom(m,s)$ , with $0<s<1$ . For any $a<b$ such that $\Delta:=\max\{|\frac{a}{m}-s|,|\frac{b}{m}-s|\}\rightarrow 0$ as $m\rightarrow\infty$ , then

\mathbb{P}(a\leq X\leq b)=\left[\int_{\frac{a-ms}{\sqrt{ms(1-s)}}}^{\frac{b-ms}{\sqrt{ms(1-s)}}}\frac{1}{\sqrt{2\pi}}\exp(-t^{2}/2)dt\right](1+\varepsilon_{1})+\varepsilon_{2},

where $\varepsilon_{1}$ is the error from the Gaussian approximation to Binomial point mass function, and $\varepsilon_{2}$ is the error from the summation series approximating the integral. Specifically, we have

|\varepsilon_{1}|\leq K\cdot\Delta,

|\varepsilon_{2}|\leq C\cdot[\frac{\exp(-m\delta^{2})}{\sqrt{m}}+\frac{b-a}{\sqrt{m}}\cdot\Delta\cdot\exp(-m\delta^{2})],

where $\delta:=\min_{x\in[\frac{a}{m},\frac{b}{m}]}|s-x|$ , $K$ and $C$ are two positive constants that depend on $s$ via $\frac{1}{s(1-s)}$ .

The detailed proof of Lemma 20 can be easily obtained by adapting that of [5]. Now we prove Theorem 8:
Proof Denote by $R_{x}={\mathbb{E}}(f(x)-\hat{f}_{n,m}(x))^{2}$ the risk at a point $x$ . We decompose the risk into the variance and the bias square as follows.

R_{x}=\textit{var}(\hat{f}_{n,m}(x))+({\mathbb{E}}\hat{f}_{n,m}(x)-f(x))^{2}.

(F.1)

For the variance part, denote by $p_{l}=\mathbb{P}(m\cdot\hat{s}_{1}\in B_{l})$ . We have

\displaystyle\int_{a}^{1-a}\textit{var}(\hat{f}_{n,m}(x))dx=\sum_{aL\leq l\leq(1-a)L}\int_{B_{l}}\frac{p_{l}(1-p_{l})}{nh^{2}}\leq K_{1}\cdot\frac{1}{nh},

(F.2)

where $K_{1}$ is a positive constant that relies on $a>0$ . For the bias part, since

({\mathbb{E}}\hat{f}_{n,m}(x)-f(x))^{2}\leq 2({\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x))^{2}+2({\mathbb{E}}\hat{f}_{n}(x)-f(x))^{2},

(F.3)

and it is well known that

\int_{0}^{1}({\mathbb{E}}\hat{f}_{n}(x)-f(x))^{2}dx\leq K_{2}\cdot h^{2},

(F.4)

where $K_{2}$ is positive constant that only relies on $f$ . We only need to consider ${\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x)$ . By definition,

	$\displaystyle{\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x)$	(F.5)
$\displaystyle=$	$\displaystyle\frac{1}{h}[\mathbb{P}(\hat{s}_{1}\in B(x))-\mathbb{P}(s_{1}\in B(x))]$
$\displaystyle=$	$\displaystyle\frac{1}{h}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\not\in B(x))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\not\in B(x))]$
$\displaystyle=$	$\displaystyle\frac{1}{h}\{\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))]$
	$\displaystyle+\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x-d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x-d\cdot h))]$
	$\displaystyle+\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)$
	$\displaystyle+\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)\}.$

By McDiarmid’s inequality, it follows that

			$\displaystyle\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)$
		$\displaystyle=$	$\displaystyle\int_{s_{1}\in B(x)}[{\mathbb{E}}\mathbb{I}(\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1\|s_{1})]f(s_{1})ds_{1}$
		$\displaystyle\leq$	$\displaystyle\int_{s_{1}\in B(x)}\mathbb{P}(\|\hat{s}_{1}-s_{1}\|\geq D\cdot h\|s_{1})f(s_{1})ds_{1}$
		$\displaystyle\leq$	$\displaystyle\int_{s_{1}\in B(x)}\exp(-2mD^{2}h^{2})f(s_{1})ds_{1}$
		$\displaystyle\leq$	$\displaystyle f_{\max}\cdot h\cdot\exp(-2mD^{2}h^{2}),$

where $f_{\max}$ is the maximal value of $f$ in $[0,1]$ . Take $D=\lceil\sqrt{\frac{\log m}{4mh^{2}}}\rceil$ , i.e., the least integer that is not smaller than $\sqrt{\frac{\log m}{4mh^{2}}}$ . Then we have

\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)\leq K_{3}\cdot\frac{f_{\max}\cdot h}{\sqrt{m}},

(F.6)

where $K_{3}$ is a universal positive constant. Similarly, we can show that $\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h):d\geq D+1\text{ or }d\leq-D-1)\leq K_{4}\cdot\frac{f_{\max}}{m}$ for some positive constant $K_{4}$ . Next, we investigate $\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))$ . Denote by $l(x)$ and $r(x)$ the left boundary and the right boundary of the interval $B(x)$ . By Lemma 20, it follows that

			$\displaystyle\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))$
		$\displaystyle=$	$\displaystyle\int_{s_{1}\in B(x)}\left[\sum_{k:\frac{k}{m}\in B(x+d\cdot h)}{m\choose k}(s_{1})^{k}(1-s_{1})^{m-k}\right]f(s_{1})ds_{1}$
		$\displaystyle=$	$\displaystyle\int_{s\in B(x)}\left\{\underbrace{\left[\int_{t\in B(x+d\cdot h)}\frac{\sqrt{m}}{\sqrt{2\pi s(1-s)}}\exp(-\frac{m(t-s)^{2}}{2s(1-s)})dt\right]\cdot(1+\varepsilon_{5}\cdot(d+1)h)}_{\text{(I)}}\right.$
			$\displaystyle\left.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\underbrace{\varepsilon_{6}\cdot\frac{\exp(-m[r(x)+(d-1)h-s]^{2})}{\sqrt{m}}}_{\text{(II)}}\right.$
			$\displaystyle\left.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\underbrace{\varepsilon_{7}\cdot\sqrt{m}h\cdot[r(x)+dh-s]\cdot\exp(-m[r(x)+(d-1)h-s]^{2}}_{\text{(III)}}\right\}f(s)ds,$

where $|\varepsilon_{5}|\leq K_{5}$ , $|\varepsilon_{6}|\leq K_{6}$ , $|\varepsilon_{7}|\leq K_{7}$ and $K_{5},K_{6},K_{7}$ are positive constants that only depend on $a$ . We consider the summation of the $D$ error terms in $\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))$ , $d=1,\ldots,D$ :

(I)

			$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}\frac{\sqrt{m}}{\sqrt{2\pi s(1-s)}}e{-\frac{m(t-s)^{2}}{2s(1-s)}}dt\right]K_{5}(d+1)hf(s)ds$
		$\displaystyle\leq$	$\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}\sqrt{m}(d+1)h\cdot e^{-\frac{m(t-s)^{2}}{2s(1-s)}}dt\right]f(s)ds$
		$\displaystyle\leq$	$\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}\sqrt{m}(d+1)h\cdot e^{-m(t-s)^{2}}dt\right]f(s)ds$
		$\displaystyle\leq$	$\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}2\sqrt{m}(d-1)h\cdot e^{-m(t-s)^{2}}dt\right]f(s)ds$
		$\displaystyle\leq$	$\displaystyle K_{8}h^{2}f_{\max}+\sum_{d=3}^{D}K_{9}\int_{s\in B(x)}\left[\int_{t\in B(x+d\cdot h)}2\sqrt{m}(t-s)\cdot e^{-m(t-s)^{2}}dt\right]f(s)ds$
		$\displaystyle=$	$\displaystyle K_{8}h^{2}f_{\max}$
			$\displaystyle+\sum_{d=3}^{D}\frac{2K_{9}}{\sqrt{m}}\int_{s\in B(x)}\left[e^{-m(r(x)-s+(d-1)h)^{2}}-e^{-m(r(x)-s+dh)^{2}}\right]f(s)ds$
		$\displaystyle=$	$\displaystyle K_{8}h^{2}f_{\max}+\tilde{K}_{9}\frac{hf_{\max}}{\sqrt{m}},$

where $K_{8},K_{9},\tilde{K}_{9}$ are positive constants that only depend on $a$ .

(II)

			$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}K_{6}\cdot\frac{\exp(-m[r(x)+(d-1)h-s]^{2})}{\sqrt{m}}\cdot f(s)ds$
		$\displaystyle\leq$	$\displaystyle K_{6}\cdot\frac{f_{\max}}{\sqrt{m}}\cdot\int_{0}^{(D-1)h}\exp(-mt^{2})dt$
		$\displaystyle\leq$	$\displaystyle\tilde{K}_{6}\cdot\frac{f_{\max}}{m},$

where $\tilde{K}_{6}$ is a positive constant that only depends on $a$ .

(III)

			$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}K_{7}\cdot\sqrt{m}h\cdot[r(x)+dh-s]\cdot e^{-m[r(x)+(d-1)h-s]^{2}}\cdot f(s)ds$
		$\displaystyle=$	$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}K_{7}\cdot\sqrt{m}h\cdot[r(x)+(d-1)h-s+h]\cdot e^{-m[r(x)+(d-1)h-s]^{2}}\cdot f(s)ds$
		$\displaystyle\leq$	$\displaystyle\tilde{K}_{7}\cdot\left\{h^{2}\cdot f_{\max}+\frac{h\cdot f_{\max}}{\sqrt{m}}\cdot\sum_{d=1}^{D}[e^{-m(d-1)^{2}h^{2}}-e^{-md^{2}h^{2}}]\right\}$
		$\displaystyle=$	$\displaystyle\tilde{K}_{7}\cdot\left\{h^{2}\cdot f_{\max}+\frac{h\cdot f_{\max}}{\sqrt{m}}\cdot[1-e^{-mD^{2}h^{2}}]\right\}$
		$\displaystyle\leq$	$\displaystyle\tilde{K}_{7}\cdot(h^{2}\cdot f_{\max}+\frac{h\cdot f_{\max}}{\sqrt{m}}),$

where $\tilde{K}_{7}>0$ only depends on $a$ .

We call the summation of the $D$ error terms by $\mathcal{E}$ , which satisfies $|\mathcal{E}|\leq K_{10}\cdot(\frac{h\cdot f_{\max}}{\sqrt{m}}+h^{2}f_{\max}+\frac{f_{\max}}{m})$ , where $K_{10}>0$ only depends on $a$ . Similarly, for the summation of the $D$ error terms $\tilde{\mathcal{E}}$ in $\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))$ , $d=1,\ldots,D$ , we have the same rate. Now we consider

	$\displaystyle\|\sum_{d=1}^{D}\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))\|$	(F.7)
$\displaystyle\overset{(i)}{=}$	$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{\sqrt{m}}{\sqrt{2\pi s(1-s)}}e^{-\frac{m(t-s)^{2}}{2s(1-s)}}f(s)\right.$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.-\frac{\sqrt{m}}{\sqrt{2\pi t(1-t)}}e^{-\frac{m(s-t)^{2}}{2t(1-t)}}f(t)\right\|dtds+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|$
$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{\|2\tilde{s}-1\|}{(1-\tilde{s})^{\frac{5}{2}}\tilde{s}^{\frac{5}{2}}}\cdot\frac{m^{\frac{3}{2}}(t-s)^{3}}{2\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f(\tilde{s})\right\|dtds$
	$\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{\|2\tilde{s}-1\|}{(1-\tilde{s})^{\frac{3}{2}}\tilde{s}^{\frac{3}{2}}}\cdot\frac{m^{\frac{1}{2}}(t-s)}{2\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f(\tilde{s})\right\|dtds$
	$\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{1}{(1-\tilde{s})^{\frac{1}{2}}\tilde{s}^{\frac{1}{2}}}\cdot\frac{m^{\frac{1}{2}}(t-s)}{\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f^{\prime}(\tilde{s})\right\|dtds+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|$
$\displaystyle\overset{(iii)}{\leq}$	$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|K_{11}\cdot f_{\max}\cdot m^{\frac{3}{2}}(t-s)^{3}e^{-m(t-s)^{2}}\right\|dtds$
	$\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|K_{12}\cdot(f_{\max}+f^{\prime}_{\max})\cdot m^{\frac{1}{2}}(t-s)e^{-m(t-s)^{2}}\right\|dtds+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|$
$\displaystyle\overset{(iv)}{=}$	$\displaystyle K_{13}\cdot(\frac{f_{\max}\cdot h}{\sqrt{m}}+\frac{f_{\max}}{m}+\frac{h\cdot f^{\prime}_{\max}}{\sqrt{m}})+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|,$

where $K_{11},K_{12},K_{13}$ are positive constants that only depend on $a$ . Equation (i) uses the Fubini’s theorem; Inequality (ii) applies the mean value theorem to the function $g(s)=\frac{1}{s(1-s)}\exp(-\frac{A}{2s(1-s)})f(s)$ , where $A$ is a constant; Inequality (iii) holds since $0<a\leq s\leq\tilde{s}\leq t\leq 1-a<1$ , thus $\frac{1}{(1-\tilde{s})\tilde{s}}$ is bounded, and $\exp(-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})})$ attains the maximal when $\tilde{s}=1/2$ ; Inequality (iv) is obtained via integral by part. Similarly, $\sum_{d=1}^{D}[\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x-d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x-d\cdot h))]$ has the same rate as (F.7). Putting (F.5)(F.6)(F.7) together, we have

|{\mathbb{E}}\hat{f}_{n,m}(x)-{\mathbb{E}}\hat{f}_{n}(x)|\leq K_{14}\cdot(f_{\max}+f^{\prime}_{\max})\cdot(\frac{1}{\sqrt{m}}+h+\frac{1}{mh}),

(F.8)

where $K_{14}$ is some constant that only depends on $a$ . Combining Inequalities (F.1)(F.2) (F.3)(F.4) and Inequality (F.8), it follows that

R(a,1-a)\leq C_{1}\cdot(h^{2}+\frac{1}{m}+\frac{1}{m^{2}h^{2}}+\frac{1}{nh}),

The minimal risk is no larger than $C_{4}\cdot n^{-\frac{2}{3}}$ , which is attained when $h=C_{3}\cdot n^{-\frac{1}{3}}$ , $m\geq C_{2}\cdot n^{\frac{2}{3}}$ . Here $C_{1},C_{2},C_{3},C_{4}$ are positive constants that only depend on $a$ and $f$ .

Appendix G Proof of Theorem 9

Proof Note for the point mass function $p(x)=\sum_{k=1}^{K}\alpha(k)\mathbb{I}(x=x_{k})$ , we have an additional information that only hold for the discrete case but not for the density case,

\max_{x\in I_{k+d}}|x-x_{k}|=(d+\frac{1}{2})\cdot\frac{1}{K}.

We follow the proof of Theorem 8 and can show that

R(a,1-a)\leq C_{1}\cdot(\frac{1}{n}+\frac{1}{m}+\frac{K^{2}}{m^{2}}).

When $m\geq C_{2}\cdot\sqrt{n}\max\{K,\sqrt{n}\}$ , we have $R(a,1-a)\leq C_{3}\cdot\frac{1}{n}$ . Here where $C_{1},C_{2},C_{3}>0$ do not depend on $n$ , $m$ and $K$ .

Appendix H Proof of Theorem 10

We first define a few concepts. Let $J$ denote the interval $[a,b)$ , where $a=0$ and $b=1$ in our setup. We set

l(J)=b-a,f(J)=\int_{J}f(t)dt,\Delta f(J)=f(b)-f(a),

bf(J)=\int_{J}|f(J)/l(J)-f(t)|dt.

Any finite increasing sequences $\{x_{i}\}_{0\leq x_{i}\leq q}$ with $x_{0}=a$ , $x_{q}=b$ generates a partition $\mathcal{P}$ of $J$ into intervals $J_{i}=[x_{i-1},x_{i})$ , $1\leq i\leq q$ . When no confusion arises, we put $f_{i}$ for $f(J_{i})$ , $\Delta f_{i}$ for $\Delta f(J_{i})$ and so on. Set a functional $L(\mathcal{P},f,z)$ defined for positive $z$ by

L(\mathcal{P},f,z)=\sum_{i=1}^{q}[bf(J_{i})+z(f(J_{i}))^{1/2}]=\sum_{i=1}^{q}[bf_{i}+zf_{i}^{1/2}].

Before proving Theorem 10, we state the needed lemma, which is adapted from Lemma 1 in [4]. The proof also follows that of Lemma 1 in [4].

Lemma 21

Let $\mathcal{P}=\{J_{i}\}_{1\leq i\leq q}$ be some partition of $J$ , $F$ an absolutely continuous distribution function, $F_{n,m}$ the corresponding empirical c.d.f based on $\hat{s}_{i}$ and $\tilde{F}_{n,m}=\tilde{F}_{n,m}^{J}$ , $\tilde{F}_{n,m}^{i}=\tilde{F}_{n,m}^{J_{i}}$ the related Grenander estimators defined on the associative intervals. Define

\bar{F}_{n,m}(x)=\sum_{i=1}^{q}\tilde{F}_{n,m}^{i}(x)\mathbb{I}_{x\in J_{i}},

with $f$ and $\bar{f}_{n,m}$ to be the respective derivatives of $F$ and $\bar{F}_{n,m}$ . Then

{\mathbb{E}}\left[\int_{J}|\tilde{f}_{n,m}(x)-f(x)|dx\right]\leq{\mathbb{E}}\left[\int_{J}|\bar{f}_{n,m}(x)-f(x)|dx\right].

In the proof, there are many similar notations that might be confusing. I list all of them below for clarity and the convenience of reference.

•

Let $F^{(m)}(x):=\mathbb{P}(\hat{s}_{i}\leq x)$ , i.e. the c.d.f of $\hat{s}_{i}$ . Let $f^{(m)}$ be the derivative of $F_{n,m}$ .
•

For any interval $J$ , $F_{n,m}^{J}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(\hat{s}_{i}\leq x;\hat{s}_{i}\in J)$ , $F_{n}^{J}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(s_{i}\leq x;s_{i}\in J)$ . $F_{n,m}$ and $F_{n}$ correspond to $F_{n,m}^{J}(x)$ and $F_{n}^{J}(x)$ with $J=[0,1]$ in our setup.
•

$\tilde{F}^{J}_{n,m}$ and $\tilde{F}^{J}_{n}$ are the respective least concave majorants of $F_{n,m}^{J}$ and $F_{n}^{J}$ condition on the interval $J$ . Let $\tilde{f}^{J}_{n,m}$ and $\tilde{f}^{J}_{n}$ be the derivatives of $\tilde{F}^{J}_{n,m}$ and $\tilde{F}^{J}_{n}$ . $\tilde{F}_{n,m}$ , $\tilde{F}_{n}$ and $\tilde{f}_{n,m}$ , $\tilde{f}_{n}$ correspond to $\tilde{F}_{n,m}^{J}(x)$ , $\tilde{F}_{n}^{J}(x)$ and $\tilde{f}^{J}_{n,m}$ , $\tilde{f}^{J}_{n}$ .
•

For any partition $\mathcal{P}=\{J_{i}\}_{1\leq i\leq q}$ , let $\bar{f}_{n,m}$ be the derivative of $\bar{F}_{n,m}(x)=\sum_{i=1}^{q}\mathbb{I}(x\in J_{i})\tilde{F}_{n,m}^{J_{i}}(x).$
•

$l(J)=b-a$ , $f(J)=\int_{J}f(t)dt$ , $\Delta f(J)=f(b)-f(a)$ , $bf(J)=\int_{J}|f(J)/l(J)-f(t)|dt$ .
•

For any partition $\mathcal{P}=\{J_{i}\}_{1\leq i\leq q}$ , $L^{J}(\mathcal{P},f,z)=\sum_{i=1}^{q}[bf(J_{i})+z(f(J_{i}))^{1/2}]=\sum_{i=1}^{q}[bf_{i}+zf_{i}^{1/2}]$ . $L^{J}(f,z)=\inf_{\mathcal{P}}L^{J}(\mathcal{P},f,z)$ .
•

$M:=\int_{0}^{b}f^{p}(t)dt$ for some $p>2$ ; $H:=\lim_{x\rightarrow b-}f(x)$ .
•
For an interval $I:=[a^{I},b^{I})$ ,
- –
  
  Let $N$ and $N_{m}$ be the number of $s_{i}$ ’s and $\hat{s}_{i}$ ’s that fall in $I$ , respectively.
- –
  
  Define $G$ and $G^{(m)}$ to be the respective conditional c.d.f’s of the $s_{i}$ ’s and $\hat{s}_{i}$ ’s that fall in $I$ , $g$ and $g^{(m)}$ their derivatives.
- –
  
  Define $G_{N}(x)=\sum_{i=1}^{n}\mathbb{I}[s_{i}\leq x;s_{i}\in I]$ , $G_{N_{m}}(x):=\sum_{i=1}^{n}\mathbb{I}[\hat{s}_{i}\leq x;\hat{s}_{i}\in I]$ .
- –
  
  Define $\tilde{G}_{N_{m},m}$ and $\tilde{G}_{N}$ be the respective least concave majorants of $G_{N_{m}}$ and $G_{N}$ conditional on $I$ . Let $\tilde{g}_{N_{m},m}$ and $\tilde{g}_{N}$ be derivatives of $\tilde{G}_{N_{m},m}$ and $\tilde{G}_{N}$ respectively.

Proof

We want to show that

{\mathbb{E}}_{f}\left[\int_{J}|f(x)-\tilde{f}_{n,m}(x)|dx\right]\leq 3L^{J}(f,\tilde{K}\cdot n^{-\frac{1}{2}})+\tilde{C}\cdot f(J)\cdot m^{-\frac{1}{2}},

where $\tilde{K},\tilde{C}$ are universal constants. Then, the $L_{1}$ convergence of $\tilde{f}_{n,m}$ only hinges on the characteristics of $f$ . For example, when $f$ is a decreasing function on $J=[0,1]$ such that $M:=\int_{0}^{b}f^{p}(t)dt<+\infty$ for some $p>2$ and $H=\lim_{x\rightarrow b-}f(x)>0$ , then Proposition 4 of [4] shows that

z^{-2/3}L^{J}(f,z)\leq 3/2(H/(H-h))^{p-1}(bMH^{2-p}/(p-2))^{1/3},

(H.1)

where $h^{3}=z^{2}b^{-2}MH^{2-p}/(p-2)$ and $Mz^{2}<(p-2)b^{2}H^{p+1}$ . It implies that $\tilde{f}_{n,m}$ has an $L_{1}$ convergence rate at $3L^{J}(f,\tilde{K}\cdot n^{-\frac{1}{2}})+\tilde{C}\cdot f(J)\cdot m^{-\frac{1}{2}}\leq C_{1}\cdot n^{-\frac{1}{3}}+C_{2}\cdot m^{-\frac{1}{2}}$ , where $C_{1}$ , $C_{2}$ are some positive constants.

By Lemma 21 it is sufficient to prove that for any partition $\mathcal{P}=\{J_{i}\}_{1\leq i\leq q}$ of $J$ , we have

{\mathbb{E}}_{f}\left[\int_{J}|f(x)-\bar{f}_{n,m}(x)|dx\right]\leq 3\sum_{i=1}^{q}[bf_{i}+\sqrt{f_{i}}\cdot\tilde{K}\cdot n^{-\frac{1}{2}}]+\sum_{i=1}^{q}\tilde{C}\cdot f_{i}\cdot m^{-\frac{1}{2}},

where $\bar{f}_{n,m}$ is the derivative of $\bar{F}_{n,m}(x)=\sum_{i=1}^{q}\tilde{F}_{n,m}^{J_{i}}(x)\mathbb{I}(x\in J_{i})$ . This is certainly true if for any arbitrary sub-interval $I=[a^{I},b^{I})$ of $J$ , the below inequality holds

{\mathbb{E}}_{f}\left[\int_{I}|f(x)-\tilde{f}_{n,m}^{I}|dx\right]\leq 3\left[bf(I)+\sqrt{f(I)}\cdot\tilde{K}\cdot n^{-\frac{1}{2}}\right]+\tilde{C}\cdot f(I)\cdot m^{-\frac{1}{2}}.

In order to prove this inequality, we assume there are $N$ $s_{i}$ ’s and $N_{m}$ $\hat{s}_{i}$ ’s falling in the interval $I$ respectively. Here, $N$ has a binomial distribution $Binomial(n,f(I))$ and $N_{m}$ has a binomial distribution $Binomial(n,f^{(m)}(I))$ , where $f^{(m)}$ is the derivative of the c.d.f $F^{(m)}=\mathbb{P}[\hat{s}_{i}\leq x]$ . Then with $\tilde{f}_{n,m}=\tilde{f}_{n,m}^{I}$ ,

\int_{I}|f(x)-\tilde{f}_{n,m}(x)|dx\leq bf(I)+|f(I)-N/n|+|N/n-N_{m}/n|+b\tilde{f}_{n,m}(I).

(H.2)

The only difficulty comes from the last term. Define $G$ and $G^{(m)}$ to be the respective conditional c.d.f’s of the $s_{i}$ ’s and $\hat{s}_{i}$ ’s that fall in $I$ , $g$ and $g^{(m)}$ their derivatives. Then

{\mathbb{E}}_{f^{(m)}}[b\tilde{f}_{n,m}(I)|N_{m}]=N_{m}/n{\mathbb{E}}_{g^{(m)}}[b\tilde{g}_{N_{m},m}(I)|N_{m}]

because the joint distribution of the $N_{m}$ $\hat{s}_{i}$ ’s falling in $I$ given $N_{m}$ is the same as the distribution of $N_{m}$ i.i.d variables from $G^{(m)}$ . If $U(x)$ is the uniform c.d.f on $I$ , then

$\displaystyle 1/2b\tilde{g}_{N_{m},m}(I)$	$\displaystyle\overset{(a)}{=}$	$\displaystyle\sup_{x\in I}[\tilde{G}_{N_{m},m}(x)-U(x)]$
	$\displaystyle\overset{(b)}{=}$	$\displaystyle\sup_{x\in I}[G_{N_{m}}(x)-U(x)]$
	$\displaystyle\leq$	$\displaystyle\sup_{x\in I}[G_{N_{m}}(x)-G(x)]+\sup_{x\in I}[G(x)-U(x)]$
	$\displaystyle\leq$	$\displaystyle\sup_{x\in I}[G_{N_{m}}(x)-G(x)]+1/2bg(I).$

Here Equation $(a)$ holds because this is an equivalent expression of the total variation for $\tilde{G}_{N_{m},m}(x)$ with a non-increasing derivative and $U(x)$ with a flat density. Equation $(b)$ holds because

•

$\tilde{G}_{N_{m},m}(x)\geq G_{N_{m}}(x)$ for any $x$ and the equality occurs when the derivative of $\tilde{G}_{N_{m},m}$ changes.
•

$\tilde{G}_{N_{m},m}(x)-U(x)$ attains the maximum at a point which corresponds to a change of the derivative of $\tilde{G}_{N_{m},m}$ .

Since $bf(I)=f(I)bg(I)$ , we get

{\mathbb{E}}_{f^{(m)}}[b\tilde{f}_{n,m}(I)|N_{m}]\leq N_{m}/n\left[2{\mathbb{E}}_{g^{(m)}}[\sup_{x\in I}(G_{N_{m}(x)}-G(x))|N_{m}]+bf(I)/f(I)\right],

and using Corollary 6,

{\mathbb{E}}_{f^{(m)}}[b\tilde{f}_{n,m}(I)|N_{m}]\leq N_{m}/n\left[K\cdot(\frac{1}{\sqrt{N_{m}}}+\frac{1}{\sqrt{m}})+bf(I)/f(I)\right],

where $K>1$ . Plug in this result into the inequality (H.2), and with the Cauchy-Schwarz inequality, we have

			$\displaystyle{\mathbb{E}}_{f}\int_{I}\|f(x)-\tilde{f}_{n,m}(x)\|dx$
		$\displaystyle\leq$	$\displaystyle bf(I)+{\mathbb{E}}\|f(I)-N/n\|$
			$\displaystyle+K\sqrt{f^{(m)}(I)/n}+K\cdot f^{(m)}(I)/\sqrt{m}+bf(I)\cdot f^{(m)}(I)/f(I)+{\mathbb{E}}\|N/n-N_{m}/n\|.$

By the Cauchy-Schwarz inequality, it follows that

{\mathbb{E}}|f(I)-N/n|\leq\sqrt{\frac{f(I)(1-f(I))}{n}}.

Note that

$\displaystyle f^{(m)}(I)$	$\displaystyle=$	$\displaystyle{\mathbb{E}}N_{m}/n$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}N/n+{\mathbb{E}}[N_{m}-N]/n$
	$\displaystyle\leq$	$\displaystyle f(I)+{\mathbb{E}}\|N_{m}-N\|/n$
	$\displaystyle=$	$\displaystyle f(I)+{\mathbb{E}}[{\mathbb{E}}[\|N_{m}-N\|/n\|N]]$
	$\displaystyle\overset{(c)}{\leq}$	$\displaystyle f(I)+{\mathbb{E}}[(\frac{C}{\sqrt{m}})\cdot N/n]$
	$\displaystyle=$	$\displaystyle f(I)(1+C\cdot m^{-\frac{1}{2}})$

where Inequality (c) holds because of Proposition 3 with $C>0$ (note that $N_{m}/n=F_{n,m}(b^{I})-F_{n,m}(a^{I})$ , and $N/n=F_{n}(b^{I})-F_{n}(a^{I})$ ). Thus, for $m\geq C^{2}/9$ , it follows that

{\mathbb{E}}_{f}\int_{I}|f(x)-\tilde{f}_{n,m}(x)|dx\leq 3(bf(I)+K\sqrt{f(I)/n})+\frac{2f(I)}{\sqrt{m}}\cdot(K+C).

Then we complete the initial claim by letting $\tilde{K}=K$ and $\tilde{C}=2(K+C)$ . Finally, as Proposition 4 in [4], we construct the partition $\mathcal{P}=\{J_{i}\}_{j\leq i\leq q}$ of $J$ where $j$ is the integer such that $jh\leq H<(j+1)h$ , $J_{q}=\{x|f(x)\geq q\}$ , $J_{j}=\{x|f(x)<(j+1)h\}$ , and $J_{i}=\{x|ih\leq f(x)<(i+1)h\}$ for $q>i>j$ . Since $f_{\max}<\infty$ , there is only finite number of intervals. It can be shown that when $q$ is the smallest integer that is larger than $f_{\max}$ , this partition can give the inequality (H.1).

Appendix I Proof of local asymptotics

I.1 Local Inference of $F_{n,m}$ when $F$ is absolutely continuous

Theorem 5 shows that the empirical CDF $F_{n,m}$ is a consistent estimator of the population CDF. We also want to understand the uncertainty of the empirical CDF. The Komlós-Major-Tusnády (KMT) approximation shows that $\sqrt{n}(F_{n}(x)-F(x))$ can be approximated by a sequence of Brownian bridges $\{B_{n}(x),0\leq x\leq 1\}$ [17]. This result can be extended to the empirical CDF based on $\hat{s}_{i}$ ’s; see Theorem 22. The proof is similar to Theorem 5 by splitting $F_{n,m}(x)-F(x)$ into $F_{n,m}(x)-F^{(m)}(x)$ and $F^{(m)}(x)-F(x)$ . The former can be bounded by the original KMT approximation and the latter one can be bounded using Proposition 3. Another thing needs taking care of is to bound $B_{n}(F(x))-B_{n}(F^{(m)}(x))$ , which is in fact the sum of two independent Gaussian random variables with zero mean and variance upper bounded by $|F^{(m)}(x)-F(x)|$ .

Theorem 22 (Local inference of $F_{n,m}$ )

Suppose $F$ corresponds to a density $f$ on $[0,1]$ with $f_{\max}<\infty$ . There exists a sequence of Brownian bridges $\{B_{n}(x),0\leq x\leq 1\}$ such that

\mathbb{P}\left\{\sup_{0\leq x\leq 1}|\sqrt{n}(F_{n,m}(x)-F(x))-B_{n}(F(x))|>\frac{\sqrt{2\pi}f_{\max}\cdot\sqrt{n}}{\sqrt{m}}+\frac{a\log n}{\sqrt{n}}+t\right\}\leq b(e^{-c\sqrt{n}t}+e^{-dmt^{2}}),

for all positive integers $n$ and all $t>0$ , where $a$ , $b$ , $c$ and $d$ are positive constants.

I.2 Proof of Theorem 12

This proof is adapted from [40]. Define

U_{n,m}(a)=\sup\{x:F_{n,m}(x)-ax\text{ is maximal }\}.

Then with probability one, we have the switching relation

\tilde{f}_{n,m}(t)\leq a\Leftrightarrow U_{n,m}(a)\leq t.

(I.1)

By the relation (I.1), we have

\mathbb{P}(\sqrt{n}(\tilde{f}_{n,m}(t_{0})-f(t_{0}))\leq x)=\mathbb{P}(U_{n,m}(f(t_{0})+n^{-\frac{1}{2}}x)\leq t_{0}).

From the definition of $U_{n,m}$ , it follows that

	$\displaystyle U_{n,m}(f(t_{0})+n^{-\frac{1}{2}}x)$	$\displaystyle=$	$\displaystyle\sup\{s:F_{n,m}(s)-(f(t_{0})+n^{-\frac{1}{2}}x)s\text{ is maximal }\}$
		$\displaystyle=$	$\displaystyle\sup\{s:\sqrt{n}(F_{n,m}(s)-F(s))+\sqrt{n}(F(s)-f(t_{0}))-xs\text{ is maximal }\}$

By Theorem 22,

\sqrt{n}(F_{n,m}-F(s))=B_{n}(F(s))+\mathcal{O}(\frac{\sqrt{n}}{\sqrt{m}})+\mathcal{O}_{p}(\frac{\log n}{\sqrt{n}}),

where $\{B_{n},n\in N\}$ is a sequence of Brownian Bridges, constructed on the same space as the $F_{n}$ . So the limit distribution of $U_{n,m}(f(t_{0})+n^{-\frac{1}{2}}x)$ is the same as that of the location of the maximum of the process $\{B_{n}(F(s))+\sqrt{n}(F(s)-f(t_{0})(s)-xs,s\geq 0\}$ . Note that $F(s)$ is concave and linear in $[a,b]$ , then

F(s)=F(a)+f(t_{0})(s-a)\text{ for }s\in[a,b],

and

F(s)-f(t_{0})(s-a)<F(a)\text{ for }s\not\in[a,b].

Hence the location of the maximum of $\{(B_{n}(F(s))+\sqrt{n}(F(s)-f(t_{0})s)-xs,s\geq 0\}$ behaves asymptotically as that of

\{B(F(s))-xs,a\leq s\leq b\}=\{B(F(a)+f(t_{0})(s-a))-xs,a\leq s\leq b\},

where $B$ is a standard Brownian bridge in $[0,1]$ . Thus,

			$\displaystyle\mathbb{P}(\sqrt{n}(\tilde{f}_{n,m}(t_{0})-f(t_{0}))\leq x)\rightarrow$
			$\displaystyle\mathbb{P}(\text{ the location of the maximum of }\{B(F(s))-xs,a\leq s\leq b\leq t_{0}\})$
		$\displaystyle=$	$\displaystyle\mathbb{P}(\hat{S}_{a,b}(t_{0})\leq x),$

by the definition of $\hat{S}$ . That completes the proof of Part (A). The proof of Part (B) follows in a similar manner.

Appendix J Proof of Theorem 13

Proof Let $\hat{\alpha}_{l}(c_{l})$ , $\hat{\alpha}_{r}(c_{r})$ , $\hat{\alpha}_{mid}(c_{l},c_{r})$ , $\tilde{g}_{l}(c_{l})$ , $\tilde{g}_{r}(c_{r})$ be the output of Algorithm 1 with the input $c_{l}$ , $c_{r}$ and $d_{l}$ , $d_{r}$ . The corresponding estimator of $f$ is termed as $\tilde{f}_{n,m}$ . For simplicity, we consider $c_{r}^{(0)}=1$ , i.e., the case where there is only the decreasing part and the flat part. For a general case where $c_{r}^{(0)}<1$ , we just need to focus on $[0,\mu]$ .

By Theorem 10, we know that ${\mathbb{E}}_{f}\int_{0}^{c_{l}^{(0)}}|\tilde{f}_{n,m}(x)-f(x)|dx\leq K_{1}\cdot N_{l}(c_{l}^{(0)})^{-\frac{1}{3}}$ when $m\geq C_{1}\cdot[N_{l}(c_{l}^{(0)})]^{\frac{2}{3}}$ for some positive constants $K_{1}$ and $C_{1}$ that only depend on $f$ . For $c_{l}>c_{l}^{(0)}$ , it is easy to see that $\lim_{x\rightarrow(c_{l})_{-}}\tilde{f}_{m,n}(x)-\lim_{x\rightarrow(c_{l})_{+}}\tilde{f}_{m,n}(x)=\varepsilon\cdot N_{l}(c_{l}^{(0)})^{-1/2}$ , where $\varepsilon$ is a residual term with $|\varepsilon|\leq K_{2}$ for some positive constant $K_{2}$ , because the estimator of the flat region converges at a square-root rate [40]. In other words, it is unlikely to find a desired gap beyond $c_{l}^{(0)}$ .

If $\lim_{x\rightarrow(c_{l}^{(0)})_{-}}\tilde{f}_{n,m}(x)\geq\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}+d_{l}$ , we select the desired $c_{l}=c_{l}^{(0)}$ . Otherwise, define

t\text{ be the maximal }c_{l}\text{ such that }\left\{\begin{array}[]{l}c_{l}<c_{l}^{(0)}\\ \tilde{f}_{n,m}(c_{l})\geq\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}+d_{l}.\\ \end{array}\right.

Then $\forall t<c_{l}\leq c_{l}^{(0)}$ , it follows that

$\displaystyle f(c_{l})-\tilde{f}_{n,m}(c_{l})$	$\displaystyle=$	$\displaystyle f(c_{l})-\lim_{x\rightarrow(c_{l}^{(0)})_{+}}f(x)-(\tilde{f}_{n,m}(c_{l})-\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}})$
		$\displaystyle+\lim_{x\rightarrow(c_{l}^{(0)})_{+}}f(x)-(\frac{\hat{\alpha}_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}})$
	$\displaystyle\geq$	$\displaystyle\delta_{l}-d_{l}-K_{2}\cdot(N_{mid}(c_{l}^{(0)},1)^{-\frac{1}{2}}),$

When $d_{l}<\delta_{l}$ , it implies that

{\mathbb{E}}_{f}\int_{t}^{c_{l}^{(0)}}|f(x)-\tilde{f}_{n,m}(x)|dx\geq(\delta_{l}-d_{l}-K_{2}\cdot N_{mid}(c_{l}^{(0)},1)^{-\frac{1}{2}})\cdot(c_{l}^{(0)}-t).

Since the $L_{1}$ distance between $f$ and $\tilde{f}_{n,m}$ reduces at a cubic-root rate, it follows that $c_{l}^{(0)}-t\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}$ for some positive constant $C_{2}$ . So $c_{l}=t$ is the desired cutoff. Finally, we have that

$\displaystyle\int_{0}^{1}\|\tilde{f}_{n,m}(x)-f(x)\|dx$	$\displaystyle=$	$\displaystyle\int_{0}^{t}\|\tilde{f}_{n,m}(x)-f(x)\|dx+\int_{t}^{c_{l}^{(0)}}\|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-f(x)\|dx$
		$\displaystyle+\int_{c_{l}^{(0)}}^{1}\|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-f(x)\|dx$
	$\displaystyle\leq$	$\displaystyle\int_{0}^{t}\|\tilde{f}_{n,m}(x)-f(x)\|dx+(\frac{1}{1-t}+f_{\max})\cdot\|c_{l}^{(0)}-t\|$
		$\displaystyle+\|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-\frac{\alpha_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}\|\cdot\|1-c_{l}^{(0)}\|$
	$\displaystyle\leq$	$\displaystyle C_{4}\cdot N_{l}(c_{l}^{(0)})^{-1/3},$

where the last inequality holds because $|c_{l}^{(0)}-t|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}$ and $f$ is bounded can imply $|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-\frac{\alpha_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}|\leq K_{3}\cdot N_{l}(c_{l}^{(0)})^{-1/3}$ . Here $C_{4}$ , $K_{3}$ are two positive constants that only depend on $f$ .

Appendix K Proof of Theorem 14

Proof Given the interior point $\mu$ in the flat region, the left decreasing part and the right increasing part are disentangled. Therefore, we only need to consider the left side, and the right side can be proven in the same way. A necessary condition for $c_{l}$ being identified as feasible for the change-point-gap constraint in Algorithm 1 is that

\tilde{g}_{l}(c_{l})\geq\frac{\hat{\alpha}_{mid}(c_{l},\mu)}{\hat{\alpha}_{l}(\mu)\cdot(\mu-c_{l})}+\frac{d_{l}}{\hat{\alpha}_{l}(\mu)},

where $\hat{\alpha}_{mid}(c_{l},\mu)=N_{mid}(c_{l},\mu)/n$ and $\hat{\alpha}_{l}(\mu)=N_{l}(\mu)/n$ . It is easy to see that

\hat{\alpha}_{l}(\mu)=\alpha_{l}(\mu)+\varepsilon_{1}\cdot n^{-\frac{1}{2}},

and

\hat{\alpha}_{mid}(c_{l},\mu)=\alpha_{mid}(c_{l},\mu)+\varepsilon_{2}\cdot n^{-\frac{1}{2}},

where $\alpha_{l}(\mu)={\mathbb{E}}\hat{\alpha}_{l}(\mu)$ and $\alpha_{mid}(c_{l},\mu)={\mathbb{E}}\hat{\alpha}_{mid}(c_{l},\mu)$ ; $\varepsilon_{1}$ and $\varepsilon_{2}$ are residual terms with $\max(|\varepsilon_{1}|,|\varepsilon_{2}|)\leq K$ for some universal positive constant $K$ by McDiarmid inequality. So the necessary condition is

\tilde{g}_{l}(c_{l})\geq\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})}+\frac{d_{l}}{\alpha_{l}(\mu)}+\varepsilon_{3}\cdot n^{-\frac{1}{2}},

(K.1)

with $|\varepsilon_{3}|\leq C$ for some constant $C$ that only depends on $\alpha_{l}(\mu)$ , $\mu-c_{l}$ and $d_{l}$ . If $c_{l}>c_{l}^{(0)}$ and the constraint is violated at $c_{l}$ , then for any $c_{l}^{\prime}>c_{l}$ the constraint is violated at $c_{l}^{\prime}$ with high probability since $\tilde{g}_{l}(c_{l})\geq\tilde{g}_{l}(c_{l}^{\prime})$ and $\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})}=\frac{\alpha_{mid}(c_{l}^{\prime},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l}^{\prime})}$ ( $c_{l}$ and $c_{l}^{\prime}$ are both in the flat region). Therefore, to see if $\hat{c}_{l}>c_{l}^{(0)}$ , we only need to investigate the smallest $c_{l}$ with $c_{l}>c_{l}^{(0)}$ that is in the searching space of Algorithm 1. By Theorem 12, when $m/\cdot N_{l}(c_{l}^{(0)})\rightarrow\infty$ , it follows that

\sqrt{N_{l}(\mu)}(\tilde{g}_{l}(c_{l})-\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})})\overset{d}{\rightarrow}\hat{S}_{[c_{l}^{(0)},\mu]}(c_{l}).

By the necessary condition (K.1), then asymptotically we have

$\displaystyle\mathbb{P}[c_{l}>c_{l}^{(0)}]$	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\sqrt{N_{l}(\mu)}(\tilde{g}_{l}(c_{l})-\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)\cdot(\mu-c_{l})})\geq\sqrt{N_{l}(\mu)}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}+\varepsilon_{3}\cdot\sqrt{\frac{N_{l}(\mu)}{n}}\right)$
	$\displaystyle\approx$	$\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l})\geq\sqrt{N_{l}(\mu)}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}+\varepsilon_{3}\cdot\sqrt{\frac{N_{l}(\mu)}{n}}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C\cdot\sqrt{\frac{N_{l}(\mu)}{n}}\right).$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C\right).$
	$\displaystyle\approx$	$\displaystyle\mathbb{P}\left(\hat{S}_{c_{l}^{(0)},\mu}(c_{l}^{(0)})\geq\sqrt{N_{l}(c_{l}^{(0)})}\cdot\frac{d_{l}}{\alpha_{l}(\mu)}-C\right).$

where the last approximation holds because $c_{l}$ is the smallest one searching candidate with $c_{l}>c_{l}^{(0)}$ , so it is very close to $c_{l}^{(0)}$ .

Appendix L Proof of Theorem 15

Proof Let $\hat{\alpha}_{l}(c_{l})$ , $\hat{\alpha}_{r}(c_{r})$ , $\hat{\alpha}_{mid}(c_{l},c_{r})$ , $\tilde{g}_{l}(c_{l})$ , $\tilde{g}_{r}(c_{r})$ be the output of Algorithm 1 with the input $c_{l}$ , $c_{r}$ and $d_{l}$ , $d_{r}$ . The corresponding estimator of $f$ is termed as $\tilde{f}_{n,m}$ . Let $H(c_{l},c_{r},\tilde{f}_{n,m})$ be the log likelihood function associated with the optimization problem (4.2.2), and $N(x)=\#\{\hat{s}_{i}:\hat{s}_{i}=x\}$ . It follows that

$\displaystyle\frac{1}{n}{\mathbb{E}}_{f}H(c_{l},c_{r},\tilde{f}_{n,m})$	$\displaystyle=$	$\displaystyle\frac{1}{n}{\mathbb{E}}_{f}\sum_{\hat{s}_{i}}\log\tilde{f}_{n,m}(\hat{s}_{i})$
	$\displaystyle=$	$\displaystyle{\mathbb{E}}_{f}\log\tilde{f}_{n,m}(\hat{s}_{1})$
	$\displaystyle=$	$\displaystyle-KL(f\|\|\tilde{f}_{n,m})+C,$

where $C={\mathbb{E}}_{f}\log f(\hat{s}_{1})$ . From the relations between total variation, Kullback-Leibler divergence and the $\chi^{2}$ distance,

TV(P,Q)\leq\sqrt{KL(P||Q)}\leq\sqrt{\chi^{2}(P||Q)},

we have

	$\displaystyle\frac{1}{n}{\mathbb{E}}_{f}H(c_{l},c_{r},\tilde{f}_{n,m})=-{\mathbb{E}}_{f}KL(f\|\|\tilde{f}_{n,m})+C$	$\displaystyle\leq$	$\displaystyle-{\mathbb{E}}_{f}(\int_{0}^{1}\|f(x)-\tilde{f}_{n,m}(x)\|dx)^{2}+C$
		$\displaystyle\leq$	$\displaystyle-({\mathbb{E}}_{f}\int_{0}^{1}\|f(x)-\tilde{f}_{n,m}(x)\|dx)^{2}+C,$

where the last inequality uses the Jensen’s inequality. On the other hand, it follows that

$\displaystyle\frac{1}{n}{\mathbb{E}}_{f}H(c_{l},c_{r},\tilde{f}_{n,m})$	$\displaystyle=$	$\displaystyle-{\mathbb{E}}_{f}KL(f\|\|\tilde{f}_{n,m})+C$
	$\displaystyle\geq$	$\displaystyle-{\mathbb{E}}_{f}\int_{0}^{1}(f(x)-\tilde{f}_{n,m}(x))^{2}/f(x)dx+C$
	$\displaystyle\geq$	$\displaystyle-{\mathbb{E}}_{f}\int_{0}^{1}(f(x)-\tilde{f}_{n,m}(x))^{2}/f_{min}dx+C$
	$\displaystyle\geq$	$\displaystyle-\frac{1}{f_{min}}\cdot({\mathbb{E}}_{f}\int_{0}^{1}\|f(x)-\tilde{f}_{n,m}(x)\|dx)^{2}+C,$

Then the problem is reduced to bound ${\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx$ . From Theorem 13, we know that when $m\geq C_{1}\cdot\left(\max(N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)}))\right)^{\frac{2}{3}}$ , there exist $c_{l}$ and $c_{r}$ in the neighborhoods of $c_{l}^{(0)}$ and $c_{r}^{(0)}$ respectively, such that the resulting estimator $\tilde{f}_{n,m}$ satisfies ${\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx\leq K_{1}\cdot(N_{l}(c_{l}^{(0)})^{-1/3}+N_{r}(c_{r}^{(0)})^{-1/3})$ for some positive constants $C_{1}$ and $K_{1}$ . Along with Lemma 23, we conclude the desired result.

Lemma 23

Let $\tilde{f}_{n,m}$ be the solution by Algorithm 1 with input $c_{l}$ and $c_{r}$ and the corresponding $flag=\textit{True}$ . Assume $f_{\max}<\infty$ and $f_{\min}>0$ . If $m\geq C_{1}\cdot\left(\max(N_{l}(c_{l}^{(0)}),N_{r}(c_{r}^{(0)}))\right)^{\frac{2}{3}}$ , then $|\Delta_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3},|\Delta_{r}|\leq C_{3}\cdot N_{r}(c_{r}^{(0)})^{-1/3}$ is a necessary condition for

{\mathbb{E}}_{f}\int_{0}^{1}|f(x)-\tilde{f}_{n,m}(x)|dx\leq C_{4}\cdot(N_{l}(c_{l}^{(0)})^{-1/3}+N_{r}(c_{r}^{(0)})^{-1/3}),

where $C_{1},C_{2},C_{3},C_{4}$ are four constants depending on $d_{l}$ , $d_{r}$ , $f_{\max}$ and $f_{\min}$ ; $\Delta_{l}=c_{l}-c_{l}^{(0)}$ , $\Delta_{r}=c_{r}-c_{r}^{(0)}$ .

Proof For simplicity, we consider $c_{r}^{(0)}=1$ , i.e., the case where there is only the decreasing part and the flat part. For a general case where $c_{r}^{(0)}<1$ , we just need to focus on $[0,\mu]$ . If $c_{l}<c_{l}^{(0)}$ , the $L_{1}$ distance between $\tilde{f}_{n,m}$ and $f$ is

			$\displaystyle{\mathbb{E}}_{f}\int_{0}^{1}\|\tilde{f}_{n,m}(x)-f(x)\|dx$
		$\displaystyle=$	$\displaystyle{\mathbb{E}}_{f}\int_{0}^{c_{l}^{(0)}}\|\tilde{f}_{n,m}(x)-f(x)\|dx+{\mathbb{E}}_{f}\int_{c_{l}^{(0)}}^{1}\|\tilde{f}_{n,m}(x)-f(x)\|dx$
		$\displaystyle\overset{(a)}{\geq}$	$\displaystyle-K_{1}\cdot N_{l}(c_{l})^{-1/3}+{\mathbb{E}}_{f}\int_{c_{l}^{(0)}}^{1}\|\tilde{f}_{n,m}(x)-f(x)\|dx$
		$\displaystyle\overset{(b)}{=}$	$\displaystyle-K_{1}\cdot N_{l}(c_{l})^{-1/3}+{\mathbb{E}}_{f}\|\frac{1-\int_{0}^{c_{l}}\tilde{f}_{n,m}(x)dx}{1-c_{l}}-\frac{1-\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}^{(0)}}\|(1-c_{l}^{(0)})$
		$\displaystyle\overset{(c)}{\geq}$	$\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+\|\frac{1-\int_{0}^{c_{l}}f(x)dx}{1-c_{l}}-\frac{1-\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}^{(0)}}\|(1-c_{l}^{(0)})$
		$\displaystyle=$	$\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+\|\frac{-\Delta_{l}+(1-c_{l})\int_{c_{l}}^{c_{l}^{(0)}}f(x)dx+\Delta_{l}\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}}\|$
		$\displaystyle=$	$\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+\|\frac{-\Delta_{l}-(1-c_{l})\Delta_{l}\gamma+\Delta_{l}\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}}\|$
		$\displaystyle=$	$\displaystyle-K_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}+\kappa\|\Delta_{l}\|,$

where $K_{1}$ and $K_{2}$ are two positive constants that only depend on $f$ , $\min_{x\in[c_{l},c_{l}^{(0)}]}f(x)\leq\gamma\leq\max_{x\in[c_{l},c_{l}^{(0)}]}f(x)$ , $\kappa=|\frac{1+(1-c_{l})\gamma-\int_{0}^{c_{l}^{(0)}}f(x)dx}{1-c_{l}}|<\infty$ since $f_{\max}<\infty$ . The inequality (a) and the equation (c) are obtained using Theorem 10. The equation (b) makes use of assumption that the right hand side is a flat region. Then, if there exists $C_{4}>0$ such that ${\mathbb{E}}_{f}\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot N_{l}(c_{l}^{(0)})^{-1/3}$ , then $|\Delta_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}$ , for some positive constant $C_{2}$ .

Next, we investigate the case when $c_{l}>c_{l}^{(0)}$ . Denote $a:=\lim_{x\rightarrow(c_{l})_{-}}\tilde{f}_{n,m}(x)$ , $b:=\lim_{x\rightarrow(c_{l})_{+}}\tilde{f}_{n,m}(x)$ , $c:=\int_{c_{l}^{(0)}}^{c_{l}}\tilde{f}_{n,m}(x)dx$ , and $\alpha\equiv f(x)(\text{when}~{}x>c_{l}^{(0)})$ . It is easy to check that

|c-\alpha|\cdot|\Delta_{l}|\leq\int_{c_{l}^{(0)}}^{c_{l}}|\tilde{f}_{n,m}(x)-\alpha|dx.

Using the $L_{1}$ convergence of the Grenander estimator, it follows that

|\Delta_{l}||c-\alpha|+(1-c_{l})|b-\alpha|\leq\int_{c_{l}^{(0)}}^{c_{l}}|\tilde{f}_{n,m}(x)-\alpha|dx+(1-c_{l})|b-\alpha|\leq\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx.

If there exists $C_{4}>0$ such that $\int_{0}^{1}|\tilde{f}_{n,m}(x)-f(x)|dx\leq C_{4}\cdot N_{l}(c_{l}^{(0)})^{-1/3}$ , we have

$\displaystyle\frac{\|\Delta_{l}\|}{1-c_{l}^{(0)}}\|a-b\|$	$\displaystyle\leq$	$\displaystyle\frac{\|\Delta_{l}\|}{1-c_{l}^{(0)}}\|c-b\|$
	$\displaystyle\leq$	$\displaystyle\frac{\|\Delta_{l}\|}{1-c_{l}^{(0)}}\|c-\alpha\|+\frac{\|\Delta_{l}\|}{1-c_{l}^{(0)}}\|\alpha-b\|$
	$\displaystyle\leq$	$\displaystyle\frac{\|\Delta_{l}\|}{1-c_{l}}\|c-\alpha\|+\|\alpha-b\|\leq\frac{C_{4}}{1-c_{l}}\cdot N_{l}(c_{l}^{(0)})^{-1/3}.$

If the output $flag$ of Algorithm 1 is True, $a-b\geq\frac{\alpha_{mid}(c_{l},\mu)}{\alpha_{l}(\mu)(1-c_{l})}+\frac{d_{l}}{\alpha_{l}(\mu)}+K_{3}\cdot n^{-1/2}$ for some positive constant $K_{3}$ . Then it must follow that $|\Delta_{l}|\leq C_{2}\cdot N_{l}(c_{l}^{(0)})^{-1/3}$ , where $C_{2}>0$ . So far, we have proven the lemma for the left hand side. For the right hand side, it can be proven similarly.

Appendix M More simulation studies

Besides the linear valley model specified in Section 5.1.1, we also consider a non-linear model and a misspecified model.

For the non-linear model, we replace the left linear part in the linear valley model with an unnormalized decreasing function $f_{l}=Beta(x/c_{l};0.5,1.5)/c_{l}\cdot 3/20$ , $x\in[0,c_{l}]$ . We replace the right linear part with an unnormalized increasing function $f_{r}=Beta(x/(1-c_{r});2,0.8)/(1-c_{r})\cdot 1/20$ , $x\in(c_{r},1]$ . Here $Beta(x;\alpha,\beta)$ is a density of Beta distribution with parameters $\alpha$ and $\beta$ . In this case, we use $m=10^{4}$ . Applying Ucut to the synthetic data generated by this model, we observe similar results as those from the linear valley model; see Figure 8 9 10. It indicates that Ucut can detect the cutoff in a satisfying range as long as the underlying model satisfies Assumption 1 and Assumption 2. Ucut is particularly useful when the middle part is “uniform” and not easy to tell apart the samples of the alternative distribution from those of the null distribution.

For the misspecified model, we replace the left linear part with an unnormalized unimodal function $f_{l}=Beta(x/c_{l};1.5,5)/c_{l}\cdot 3$ , $x\in[0,c_{l}]$ . We replace the right linear part with an unnormalized unimodal function $f_{r}=Beta(x/(1-c_{r});2.5,1.5)/(1-c_{r})$ , $x\in(c_{r},1]$ . The results can be found in Figure 11,12,13. We observe that the estimated $c_{r}$ is not satisfying until $m$ attains $10^{4}$ . So in the other experiments, we use $m=10^{4}$ . We find that although the variance of $\hat{c}_{r}$ becomes larger than the estimate for the correctly specified model, the detected cutoff tends to be larger than the truth. It implies that if the model does not align with Assumption 1 and Assumption 2, Ucut is still useful because it is conservative.

$\displaystyle\|F^{(m)}(x)-F(x)\|$	$\displaystyle=$	$\displaystyle\|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x)]-{\mathbb{E}}[\mathbb{I}(s\leq x)]\|$
	$\displaystyle=$	$\displaystyle\|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)]-{\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]\|.$
	$\displaystyle\leq$	$\displaystyle\max\{\|{\mathbb{E}}[\mathbb{I}(\hat{s}\leq x<s)]\|,\|{\mathbb{E}}[\mathbb{I}(s\leq x<\hat{s})]\|\}$

	$\displaystyle\|\sum_{d=1}^{D}\mathbb{P}(\hat{s}_{1}\in B(x),s_{1}\in B(x+d\cdot h))-\mathbb{P}(s_{1}\in B(x),\hat{s}_{1}\in B(x+d\cdot h))\|$	(F.7)
$\displaystyle\overset{(i)}{=}$	$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{\sqrt{m}}{\sqrt{2\pi s(1-s)}}e^{-\frac{m(t-s)^{2}}{2s(1-s)}}f(s)\right.$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.-\frac{\sqrt{m}}{\sqrt{2\pi t(1-t)}}e^{-\frac{m(s-t)^{2}}{2t(1-t)}}f(t)\right\|dtds+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|$
$\displaystyle\overset{(ii)}{\leq}$	$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{\|2\tilde{s}-1\|}{(1-\tilde{s})^{\frac{5}{2}}\tilde{s}^{\frac{5}{2}}}\cdot\frac{m^{\frac{3}{2}}(t-s)^{3}}{2\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f(\tilde{s})\right\|dtds$
	$\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{\|2\tilde{s}-1\|}{(1-\tilde{s})^{\frac{3}{2}}\tilde{s}^{\frac{3}{2}}}\cdot\frac{m^{\frac{1}{2}}(t-s)}{2\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f(\tilde{s})\right\|dtds$
	$\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|\frac{1}{(1-\tilde{s})^{\frac{1}{2}}\tilde{s}^{\frac{1}{2}}}\cdot\frac{m^{\frac{1}{2}}(t-s)}{\sqrt{2\pi}}e^{-\frac{m(t-s)^{2}}{2\tilde{s}(1-\tilde{s})}}f^{\prime}(\tilde{s})\right\|dtds+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|$
$\displaystyle\overset{(iii)}{\leq}$	$\displaystyle\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|K_{11}\cdot f_{\max}\cdot m^{\frac{3}{2}}(t-s)^{3}e^{-m(t-s)^{2}}\right\|dtds$
	$\displaystyle+\sum_{d=1}^{D}\int_{s\in B(x)}\int_{t\in B(x+d\cdot h)}\left\|K_{12}\cdot(f_{\max}+f^{\prime}_{\max})\cdot m^{\frac{1}{2}}(t-s)e^{-m(t-s)^{2}}\right\|dtds+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|$
$\displaystyle\overset{(iv)}{=}$	$\displaystyle K_{13}\cdot(\frac{f_{\max}\cdot h}{\sqrt{m}}+\frac{f_{\max}}{m}+\frac{h\cdot f^{\prime}_{\max}}{\sqrt{m}})+\|\mathcal{E}\|+\|\tilde{\mathcal{E}}\|,$

			$\displaystyle{\mathbb{E}}_{f}\int_{I}\|f(x)-\tilde{f}_{n,m}(x)\|dx$
		$\displaystyle\leq$	$\displaystyle bf(I)+{\mathbb{E}}\|f(I)-N/n\|$
			$\displaystyle+K\sqrt{f^{(m)}(I)/n}+K\cdot f^{(m)}(I)/\sqrt{m}+bf(I)\cdot f^{(m)}(I)/f(I)+{\mathbb{E}}\|N/n-N_{m}/n\|.$

$\displaystyle\int_{0}^{1}\|\tilde{f}_{n,m}(x)-f(x)\|dx$	$\displaystyle=$	$\displaystyle\int_{0}^{t}\|\tilde{f}_{n,m}(x)-f(x)\|dx+\int_{t}^{c_{l}^{(0)}}\|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-f(x)\|dx$
		$\displaystyle+\int_{c_{l}^{(0)}}^{1}\|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-f(x)\|dx$
	$\displaystyle\leq$	$\displaystyle\int_{0}^{t}\|\tilde{f}_{n,m}(x)-f(x)\|dx+(\frac{1}{1-t}+f_{\max})\cdot\|c_{l}^{(0)}-t\|$
		$\displaystyle+\|\frac{\hat{\alpha}_{mid}(t,1)}{1-t}-\frac{\alpha_{mid}(c_{l}^{(0)},1)}{1-c_{l}^{(0)}}\|\cdot\|1-c_{l}^{(0)}\|$
	$\displaystyle\leq$	$\displaystyle C_{4}\cdot N_{l}(c_{l}^{(0)})^{-1/3},$

	$\displaystyle\frac{1}{n}{\mathbb{E}}_{f}H(c_{l},c_{r},\tilde{f}_{n,m})=-{\mathbb{E}}_{f}KL(f\|\|\tilde{f}_{n,m})+C$	$\displaystyle\leq$	$\displaystyle-{\mathbb{E}}_{f}(\int_{0}^{1}\|f(x)-\tilde{f}_{n,m}(x)\|dx)^{2}+C$
		$\displaystyle\leq$	$\displaystyle-({\mathbb{E}}_{f}\int_{0}^{1}\|f(x)-\tilde{f}_{n,m}(x)\|dx)^{2}+C,$

Binomial Mixture Model With U-shape Constraint

Abstract

1 Introduction

1.1 Main contributions

1.1.1 New results for large mm in Binomial mixture model

1.1.2 The cutoff selection for GeneFishing

1.2 Outline

2 Notation

3 Estimation and Inference of FF under Various Conditions

3.1 General FF

Theorem 1 (The LpL_{p} distance between F(m)F^{(m)}/Fn,mF_{n,m} and FF)

Proposition 2 (F(m)F^{(m)} can deviate in K-S far from FF with a pmf)

3.2 FF with a density

Proposition 3 (Deviation of F(m)F^{(m)} from FF with a density)

Proposition 4 (Tightness of Proposition 3)

Theorem 5 (The rate of K-S distance between Fn,mF_{n,m} and FF with a density)

Corollary 6

3.3 FF with A Smooth Density

Proposition 7 (Deviation of F(m)F^{(m)} from FF with a smooth density)

Theorem 8 (Upper bound of the histogram risk for binomial mixture model)

Theorem 9 (Upper bound of the histogram risk for finite binomial mixture)

3.4 FF with A Monotone Density

Theorem 10 (L1L_{1} convergence of f~m,n\tilde{f}_{m,n})

Proposition 11

Theorem 12 (Local Asymptotics of f~m,n\tilde{f}_{m,n})

4 Estimation and Inference of FF with U-shape Constraint

4.1 Model

Assumption 1

Example 1 (Identifiability Issue for (4.1))

Assumption 2

4.2 Method

4.2.1 The Non-parametric Maximum Likelihood Estimation

4.2.2 Ucut: A Simplified Estimator

4.3 Analysis

Theorem 13 (Feasibility of Gap Constraint for (cl(0),cr(0))(c_{l}^{(0)},c_{r}^{(0)}))

Theorem 14 (Tail Bounds of Identified Cutoffs)

Theorem 15 (L1L_{1} Convergence of Identified Cutoff)

5 Experiments of Ucut

5.1 Numerical Experiments

5.1.1 Data generating process

5.1.2 Robustness to model parameters

5.1.3 Sensitivity of the algorithm hyper-parameters

5.1.4 Comparison to other methods

5.2 Application to Real Data

6 Discussion

References

Appendix A The Identifiability Issue of Binomial Mixture Model

Appendix B A motivating Example: GeneFishing

Appendix C Review of Grenander Estimator

Theorem 16 ([23])

Theorem 17 ([43])

Theorem 18 ([4])

Theorem 19 ([40])

Appendix D Proof of Proposition 4

Appendix E Proof of Proposition 7

Appendix F Proof of Theorem 8

Lemma 20

Appendix G Proof of Theorem 9

Appendix H Proof of Theorem 10

Lemma 21

Appendix I Proof of local asymptotics

I.1 Local Inference of Fn,mF_{n,m} when FF is absolutely continuous

Theorem 22 (Local inference of Fn,mF_{n,m})

I.2 Proof of Theorem 12

Appendix J Proof of Theorem 13

Appendix K Proof of Theorem 14

Appendix L Proof of Theorem 15

Lemma 23

Appendix M More simulation studies

1.1.1 New results for large $m$ in Binomial mixture model

3 Estimation and Inference of $F$ under Various Conditions

3.1 General $F$

Theorem 1 (The $L_{p}$ distance between $F^{(m)}$ / $F_{n,m}$ and $F$ )

Proposition 2 ( $F^{(m)}$ can deviate in K-S far from $F$ with a pmf)

3.2 $F$ with a density

Proposition 3 (Deviation of $F^{(m)}$ from $F$ with a density)

Theorem 5 (The rate of K-S distance between $F_{n,m}$ and $F$ with a density)

3.3 $F$ with A Smooth Density

Proposition 7 (Deviation of $F^{(m)}$ from $F$ with a smooth density)

3.4 $F$ with A Monotone Density

Theorem 10 ( $L_{1}$ convergence of $\tilde{f}_{m,n}$ )

Theorem 12 (Local Asymptotics of $\tilde{f}_{m,n}$ )

4 Estimation and Inference of $F$ with U-shape Constraint

Theorem 13 (Feasibility of Gap Constraint for $(c_{l}^{(0)},c_{r}^{(0)})$ )

Theorem 15 ( $L_{1}$ Convergence of Identified Cutoff)

I.1 Local Inference of $F_{n,m}$ when $F$ is absolutely continuous

Theorem 22 (Local inference of $F_{n,m}$ )