Bernstein Polynomial Model for Grouped Continuous Data

Zhong Guan
Department of Mathematical Sciences
Indiana University South Bend
South Bend, IN 46634, USA
email: zguan@iusb.edu

Abstract

Grouped data are commonly encountered in applications. All data from a continuous population are grouped due to rounding of the individual observations. The Bernstein polynomial model is proposed as an approximate model in this paper for estimating a univariate density function based on grouped data. The coefficients of the Bernstein polynomial, as the mixture proportions of beta distributions, can be estimated using an EM algorithm. The optimal degree of the Bernstein polynomial can be determined using a change-point estimation method. The rate of convergence of the proposed density estimate to the true density is proved to be almost parametric by an acceptance-rejection arguments used in Monte Carlo method. The proposed method is compared with some existing methods in a simulation study and is applied to the Chicken Embryo Data.

Keywords:Acceptance-rejection method, Approximate model, Bernstein Type polynomials; Beta Mixture, Change-point, Density estimation; Grouped data; Model selection; Nonparametric model; Parametrization; Smoothing.

1 Introduction

In real world applications of statistics, many data are provided in the form of frequencies of observations in some fixed mutually exclusive intervals, which are called grouped data. Strictly speaking, all the data from a population with a continuous distribution are grouped due to rounding of the individual observations (Hall, 1982). The EM algorithm has been used to deal with grouped data (Dempster et al., 1977). McLachlan & Jones (1988) introduced the EM algorithm for fitting mixture model to grouped data (see Jones & McLachlan, 1990, also). Under a parametric model, let $f(x;\theta)$ be the probability density function (PDF) of the underlying distribution with an unknown parameter $\theta$ . The maximum likelihood estimate (MLE) of the parameter $\theta$ can be obtained from grouped data and is shown to be consistent and asymptotically normal (see, for example, Lindley, 1950; Tallis, 1967). Parametric MLE is sensitive to model misspecification and outliers. The minimum Hellinger distance estimate (MHDE) of the parameter using grouped continuous data is both robust for contaminated data and asymptotically efficient (Beran, 1977a, b). Parametric methods for grouped data requires evaluating integrals which makes the computation expensive. To lower the computation cost Lin & He (2006) proposed the approximate minimum Hellinger distance estimate (AMHDE) for grouped data by the data truncation and replacing the probabilities of class intervals with the first order Taylor expansion. Clearly their idea works for MLE based on grouped data.

Under nonparametric setting, the underlying PDF $f$ is unspecified. Based on grouped data $f$ can be estimated by the empirical density, the relative frequency distribution, which is actually a discrete probability mass function. The kernel density estimation (Rosenblatt, 1956, 1971) can be applied to grouped data (see Linton & Whang, 2002; Jang & Loh, 2010; Minoiu & Reddy, 2014, for example). The effects of rounding, truncating, and grouping of the data on the kernel density estimate have been studied, maybe among others, by Hall (1982), Scott & Sheather (1985), and Titterington (1983). However, the expectation of kernel density estimate is the convolution of $f$ and the kernel scaled by the bandwidth. It is crucial and difficult to select an appropriate bandwidth to balance between the bias and variance. Many authors have proposed different methods for data-based bandwidth selection over the years. The readers are referred to a survey by Jones et al. (1996) for details and more references therein. Another drawback of the kernel density is its boundary effect. Methods of boundary-effect correction have been studied, among others, by Rice (1984) and Jones (1993).

“All models are wrong”(Box, 1976). So all parametric models are subject to model misspecification. The normal model is approximate because of the central limit theorem. The goodness-of-fit tests and other methods for selecting a parametric model introduce additional errors to the statistical inference.

Any continuous function can be approximated by polynomials. Vitale (1975) proposed to estimate the PDF $f$ by estimating the coefficients $f(i/m)$ of the Bernstein polynomial (Bernstein, 1912) $\mathbb{B}f(t)=\sum_{i=0}^{m}f(i/m){m\choose i}t^{i}(1-t)^{m-i}$ by $\hat{f}(i/m)=(m+1)\{F_{n}[(i+1)/(m+1)]-F_{n}[i/(m+1)]\}$ , $i=0,1,\ldots,m$ , where $F_{n}$ is the empirical distribution function of $x_{1},\ldots,x_{n}$ . Since then, many authors have applied the Bernstein polynomial in statistics in similar ways (see Guan, 2014, for more references). These and the kernel methods are not model-based and not maximum likelihood method. Thus they are not efficient. The estimated Bernstein polynomial $\widehat{\mathbb{B}f(t)}=\sum_{i=0}^{m}\hat{f}(i/m){m\choose i}t^{i}(1-t)^{m-i}$ aims at $\mathbb{B}f(t)$ . It is known that the best convergence rate of $\mathbb{B}f(t)$ to $f(t)$ is at most ${\cal O}(m^{-1})$ if $f$ has continuous second or even higher order derivatives on [0,1]. Buckland (1992) proposed a density estimation with polynomials using grouped and ungrouped data with the help of some specified parametric models.

Thanks to a result of Lorentz (1963) there exists a Bernstein (type) polynomials $f_{m}(t;\bm{p})\equiv\sum_{i=0}^{m}p_{mi}\beta_{mi}(t)$ , where $p_{mi}\geqslant 0$ , $\beta_{mi}(t)=(m+1){m\choose i}t^{i}(1-t)^{m-i}$ , $i=0,\ldots,m$ , whose rate of convergence to $f(t)$ is at least ${\cal O}(m^{-r/2})$ if $f$ has a continuous $r$ -th derivative on [0,1] and $r\geqslant 2$ . This is called a polynomial with “positive coefficients” in the literature of polynomial approximation. Guan (2014) introduced the Bernstein polynomial model $f_{m}(t;\bm{p})$ as a globally valid approximate parametric model of any underlying continuous density function with support $[0,1]$ and proposed a change-point method for selecting an optimal degree $m$ . It has been shown that the rate of convergence to zero for the mean integrated squared error(MISE) of the maximum likelihood estimate of the density could be nearly parametric, ${\cal O}(n^{-1+\epsilon})$ , for all $\epsilon>0$ . This method does not suffer from the boundary effect.

If the support of $f$ is different from [0,1] or even infinite, then we can choose an appropriate (truncation) interval $[a,b]$ so that $\int_{a}^{b}f(x)dx\approx 1$ (see Guan, 2014). Therefore, we can treat $[a,b]$ as the support of $f$ and we can use the linearly transformed data $y_{i}=(x_{i}-a)/(b-a)$ in $[0,1]$ to obtain estimate $\hat{g}$ of the PDF $g$ of $y_{i}$ ’s, respectively. Then we estimate $f$ by $\hat{f}(x)=\hat{g}\{(x-a)/(b-a)\}/(b-a)$ . In this paper, we will assume that the density $f$ has support $[0,1]$ .

This Bernstein polynomial model $f_{m}(t;\bm{p})$ is a finite mixture of the beta densities $\beta_{mi}(t)$ of beta $(i+1,m-i+1)$ , $i=0,\ldots,m$ , with mixture proportions $\bm{p}=(p_{m0},\ldots,p_{mm})$ . It has been shown that the Bernstein polynomial model can be used to fit a ungrouped dataset and has the advantages of smoothness, robustness, and efficiency over the traditional methods such as the empirical distribution and the kernel density estimate (Guan, 2014). Because these beta densities and their integrals are specified and free of unknown parameters, this structure of $f_{m}(t;\bm{p})$ is convenient. It allows the grouped data to be approximately modeled by a mixture of $m+1$ specific discrete distributions. So the infinite dimensional “parameter” $f$ is approximately described by a finite dimensional parameter $\bm{p}$ . This and the nonparametric likelihood are similar in the sense that the underlying distribution function is approximated by a step function with jumps as parameters at the observations.

Due to the closeness of $f_{m}(t;\bm{p})$ to $f(t)$ , by the acceptance-rejection argument for generating pseudorandom numbers, almost all the observations in a sample from $f(t)$ can be used as if they were from $f_{m}(t;\bm{p})$ . It will be shown in this paper that the maximizer of the likelihood based on the approximate model $f_{m}(t;\bm{p})$ targets $\bm{p}_{0}$ which makes $f_{m}(t;\bm{p}_{0})$ the unique best approximation of $f$ . This acceptance-rejection argument can be used to prove other asymptotic results under an approximate model assumption.

In this paper we shall study the asymptotic properties of the Bernstein polynomial density estimate based on grouped data and ungrouped raw data as a special case of grouping. A stronger result than that of Guan (2014) about the rate of convergence of the proposed density estimate based on ungrouped raw data will be proved using a different argument. We shall also compare the proposed estimate with those existing methods such as the kernel density, parametric MLE, and the MHDE via simulation study.

The paper is organized as follows. The Bernstein polynomial model for grouped data is introduced and is proved to be nested in Section 2. The EM algorithm for finding the approximate maximum likelihood estimates of the mixture proportions is derived in this section. Some asymptotic results about the convergence rate of the proposed density estimate are given in Section 3. The methods for determining a lower bound for the model degree $m$ based on estimated mean and variance and for choosing the optimal degree $m$ are described in Section 4. In Section 5, the proposed methods are compared with some existing competitors through Monte Carlo experiments, and illustrated by the Chicken Embryo Data. The proofs of the theorems are relegated to the Appendix.

2 Likelihood for grouped data and EM algorithm

2.1 The Bernstein polynomial model

Let $C^{(r)}[0,1]$ be the class of functions which have $r$ -th continuous derivative $f^{(r)}$ on $[0,1]$ . Like the normal model being backed up by the central limit theorem, the Bernstein polynomial model is supported by the following mathematical result which is a consequence of Theorem 1 of Lorentz (1963). We denote the $m$ -simplex by

\mathbb{S}_{m}=\Big{\{}\bm{p}=(p_{m0},\ldots,p_{mm})^{\mbox{\tiny{$\mathrm{T}$}}}\,:\,p_{mj}\geqslant 0,\;\;\sum_{j=0}^{m}p_{mj}=1\Big{\}}.

Theorem 1.

If $f\in C^{(r)}[0,1]$ , $\int_{0}^{1}f(t)dt=1$ , and $f(t)\geqslant\delta>0$ , then there exists a sequence of Bernstein type polynomials $f_{m}(t;\bm{p})=\sum_{i=0}^{m}p_{mi}\beta_{mi}(t)$ with $\bm{p}\in\mathbb{S}_{m}$ , such that

|f(t)-f_{m}(t)|\leqslant C(r,\delta,f)\Delta_{m}^{r}(t),\quad 0\leqslant t\leqslant 1,

(1)

where $\Delta_{m}(t)=\max\{m^{-1},\sqrt{{t(1-t)}/{m}}\}$ and the constant $C(r,\delta,f)$ depends on $r$ , $\delta$ , $\max_{t}|f(t)|$ , and $\max_{t}|f^{(i)}(t)|$ , $i=2,\ldots,r$ , only.

The uniqueness of the best approximation was proved by Passow (1977). Let $f$ be the density of the underlying distribution with support $[0,1]$ . We approximate $f$ using the Bernstein polynomial $f_{m}(t;\bm{p})=\sum_{j=0}^{m}p_{mj}\beta_{mj}(t)$ , where $\bm{p}\in\mathbb{S}_{m}$ .

Define ${\cal D}_{m}=\big{\{}f_{m}(t;\bm{p})=\sum_{j=0}^{m}p_{mj}\beta_{mj}(t)\,:\,\bm{p}\in\mathbb{S}_{m}\big{\}}$ . Guan (2014) showed that, for all $r\geqslant 1$ , ${\cal D}_{m}\subset{\cal D}_{m+r}$ . So the Bernstein polynomial model $f_{m}(t;\bm{p})$ of degree $m$ is nested in all Bernstein polynomial models of larger degrees.

Let $[0,1]$ be partitioned by $N$ class intervals $\{(t_{i-1},t_{i}]\,:\,i=1,\ldots,N\}$ , where $0=t_{0}<t_{1}<\dots<t_{N}=1$ . The probability that a random observation falls in the $i$ -th interval is approximately

\theta_{mi}(\bm{p})=\int^{t_{i}}_{t_{i-1}}f_{m}(t;\bm{p})dt=\sum_{j=0}^{m}a_{ij}p_{mj},

(2)

where $a_{ij}=\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})$ , $i=1,\ldots,N$ , $\mathcal{B}_{mj}(t)$ is the cumulative distribution function (CDF) of beta( $j+1,m-j+1$ ), $j=0,1,\ldots,m$ , and

\sum_{i=1}^{N}\theta_{mi}(\bm{p})=\sum_{j=0}^{m}p_{mj}=1.

So the probability $\theta_{mi}(\bm{p})$ is a mixture of a specific components $\{a_{i0},\ldots,a_{im}\}$ with unknown proportions $\bm{p}=(p_{m0},\ldots,p_{mm})$ .

By Theorem 2 $\cdot$ 1 of Guan (2014), the above Bernstein polynomial model (2) of degree $m$ for grouped data is nested in a model of degree $m+r$ , i.e., for all $r\geqslant 1$ ,

\theta_{mi}=\int^{t_{i}}_{t_{i-1}}\sum_{j=0}^{m}p_{mj}\beta_{mj}(t)dt=\int^{t_{i}}_{t_{i-1}}\sum_{j=0}^{m+r}p_{m+r,j}\beta_{m+r,j}(t)dt=\theta_{m+r,i},\quad i=1,\ldots,N.

2.2 The Bernstein likelihood for grouped data

In many applications, we only have the grouped data $\{n_{i},(t_{i-1},t_{i}]:i=1,\ldots,N\}$ available, where $0=t_{0}<t_{1}<\dots<t_{N}=1$ and $n_{i}=\#\{j\in(1,\ldots,n):x_{j}\in(t_{i-1},t_{i}]\}$ , $i=1,\ldots,N$ , and $x_{1},\ldots,x_{n}$ is a random sample from a population having continuous density $f(x)$ on $[0,1]$ . Our goal is to estimate the unknown PDF $f$ . The loglikelihood of $(n_{1},\ldots,n_{N})$ is approximately

\ell_{G}(\bm{p})=\sum_{i=1}^{N}n_{i}\log\Big{[}\sum_{j=0}^{m}p_{mj}\{\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})\}\Big{]},

(3)

where the mixture proportions $\bm{p}=(p_{m0},\ldots,p_{mm})$ are subject to the feasibility constraints $\bm{p}\in\mathbb{S}_{m}$ . For the ungrouped raw data $x_{1},\ldots,x_{n}$ , the loglikelihood is

\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log\Big{\{}\sum_{j=0}^{m}p_{mj}\beta_{mj}(x_{i})\Big{\}}.

(4)

If we take the rounding error into account when the observations are rounded to the nearest value using the round half up tie-breaking rule, then

\ell_{G}(\bm{p})=\sum_{i=-\infty}^{\infty}n_{i}\log\Big{[}\sum_{j=0}^{m}p_{mj}\{\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})\}\Big{]},

(5)

where $t_{i}=(i+1/2)/K$ , $i=0,\pm 1,\pm 2,\ldots$ , and $K$ is a positive integer such that any observation is rounded to $i/K$ for some integer $i$ .

We shall call the maximizers $\tilde{\bm{p}}_{G}$ and $\hat{\bm{p}}_{R}$ of $\ell_{G}(\bm{p})$ and $\ell_{R}(\bm{p})$ the maximum Bernstein likelihood estimates (MBLE’s) of $\bm{p}$ based on grouped and raw data, respectively, and call $\tilde{f}_{B}(t)=f_{m}(t;\tilde{\bm{p}}_{G})$ and $\hat{f}_{B}(t)=f_{m}(t;\hat{\bm{p}}_{R})$ the MBLE’s of $f$ based on grouped and raw data, respectively.

It should also be noted that as $N\to\infty$ and $\max\{\Delta t_{i}\equiv t_{i}-t_{i-1}\,:\,i=1,\ldots,N\}\to 0$ the above loglikelihood (3) reduces to the loglikelihood (4) for ungrouped raw data. Specifically, $\lim_{\max\Delta t_{i}\to 0}\{\ell_{G}(\bm{p})-\sum_{i=1}^{N}n_{i}\log\Delta t_{i}\}=\ell_{R}(\bm{p})$ .

If the underlying PDF $f$ is approximately $f_{m}(t;\bm{p})=\sum_{i=0}^{m}p_{mi}\beta_{mi}(t)$ for some $m\geqslant 0$ , then the distribution of the grouped data $(n_{1},\ldots,n_{N})$ is approximately multinomial with probability mass function

P(W_{1}=n_{1},\ldots,W_{N}=n_{N})={n\choose n_{1},\ldots,n_{N}}\prod_{i=1}^{N}\theta_{mi}^{n_{i}}(\bm{p}).

The MLE’s of $\theta_{mi}$ ’s are $\hat{\theta}_{mi}=\frac{n_{i}}{n}$ , $i=1,\ldots,N.$ So the MLE’s $\hat{p}_{mj}$ of $p_{mj}$ satisfy the equations $\sum_{j=0}^{m}a_{ij}\hat{p}_{mj}=\frac{n_{i}}{n}$ , $i=1,\ldots,N$ , and $(\hat{p}_{m0},\ldots,\hat{p}_{mm})\in\mathbb{S}_{m}$ . Because $\hat{p}_{m0}=1-\sum_{j=1}^{m}\hat{p}_{mj}$ , $\hat{p}_{mj}$ satisfy equatins

\sum_{j=1}^{m}(a_{ij}-a_{i0})\hat{p}_{mj}=\frac{n_{i}}{n}-a_{i0},\quad i=1,\ldots,N,

and inequality constraints $\hat{p}_{mj}\geqslant 0$ , $j=1,\ldots,m$ , and $\sum_{j=1}^{m}\hat{p}_{mj}\leqslant 1$ . It seems not easy to algebraically solve the above system of equations with inequality constraints. In the next section, we shall use an EM-algorithm to find the MLE of $\bm{p}$ .

2.3 The EM Algorithm

Let $\delta_{ij}=1$ or 0 according to whether or not $x_{i}$ was from beta $(j+1,m-j+1)$ , $i=1,\ldots,n$ , $j=0,\ldots,m$ . We denote by $\bm{z}_{i}=(z_{i1},\ldots,z_{iN})^{\mbox{\tiny{$\mathrm{T}$}}}$ the vector of indicators $z_{ij}=I\{x_{i}\in(t_{j-1},t_{j}]\}$ , $i=1,\ldots,n$ , $j=1,\ldots,N$ . Then the expected value of $\delta_{ij}$ given $\bm{z}_{i}$ is

r_{j}(\bm{p},\bm{z}_{i})\equiv\mathrm{E}_{\bm{p}}(\delta_{ij}|\bm{z}_{i})=\frac{p_{mj}\prod_{l=1}^{N}\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}^{z_{il}}}{\sum_{h=0}^{m}p_{mh}\prod_{l=1}^{N}\{\mathcal{B}_{mh}(t_{l})-\mathcal{B}_{mh}(t_{l-1})\}^{z_{il}}}.

Note that $\sum_{j=0}^{m}\delta_{ij}=1$ , and the observations are $n_{l}=\sum_{i=1}^{n}\sum_{j=0}^{m}\delta_{ij}z_{il}=\sum_{i=1}^{n}z_{il}$ , $l=1,\ldots,N$ . The likelihood of $\delta_{ij}$ and $\bm{z}_{i}$ is

\mathscr{L}_{c}(\bm{p})=\prod_{i=1}^{n}\prod_{j=0}^{m}\Big{[}p_{mj}\prod_{l=1}^{N}\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}^{z_{il}}\Big{]}^{\delta_{ij}}.

The loglikelihood is then

\ell_{c}(\bm{p})=\sum_{i=1}^{n}\sum_{j=0}^{m}\delta_{ij}\Big{[}\log p_{mj}+\sum_{l=1}^{N}{z_{il}}\log\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}\Big{]}.

E-Step Given $\tilde{\bm{p}}^{(s)}$ , we have

	$\displaystyle Q(\bm{p},\tilde{\bm{p}}^{(s)})$	$\displaystyle=$	$\displaystyle\mathrm{E}_{\tilde{\bm{p}}^{(s)}}\{\ell(\bm{p})\|\bm{z}\}$
		$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\sum_{j=0}^{m}r_{j}(\tilde{\bm{p}}^{(s)},\bm{z}_{i})\Big{[}\log p_{mj}+\sum_{l=1}^{N}{z_{il}}\log\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}\Big{]}.$

M-Step Maximizing $Q(\bm{p},\tilde{\bm{p}}^{(s)})$ with respect to $\bm{p}$ subject to constraint $\bm{p}\in\mathbb{S}_{m}$ we have, for $s=0,1,\ldots$ ,

\tilde{p}^{(s+1)}_{mj}=\frac{1}{n}\sum_{i=1}^{n}{r_{j}(\tilde{\bm{p}}^{(s)},\bm{z}_{i})}=\frac{1}{n}\sum_{l=1}^{N}\frac{n_{l}\tilde{p}_{mj}^{(s)}\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{{l}-1})\}}{\sum_{h=0}^{m}\tilde{p}_{mh}^{(s)}\{\mathcal{B}_{mh}(t_{l})-\mathcal{B}_{mh}(t_{{l}-1})\}}.

(6)

Starting with initial values $\tilde{p}_{mj}^{(0)}$ , $j=0,\ldots,m$ , we can use this iterative formula to obtain the maximum Bernstein likelihood estimate $\tilde{\bm{p}}_{G}$ . If the ungrouped raw data $x_{1},\ldots,x_{n}$ are available, then the iteration (Guan, 2014) is reduced to

\hat{p}^{(s+1)}_{mj}=\frac{1}{n}\sum_{i=1}^{n}\frac{\hat{p}_{mj}^{(s)}\beta_{mj}(x_{i})}{\sum_{h=0}^{m}\hat{p}_{mh}^{(s)}\beta_{mh}(x_{i})},\quad j=0,\ldots,m;\quad s=0,1,\ldots.

(7)

The following theorem shows the convergence of the EM algorithm and is proved in the Appendix.

Theorem 2.

(i) Assume $\hat{p}^{(0)}_{mj}>0$ , $j=0,1,\ldots,m$ , and $\sum_{j=0}^{m}\hat{p}^{(0)}_{mj}=1$ . Then as $s\to\infty$ , $\hat{\bm{p}}^{(s)}$ converges to the unique maximizer $\hat{\bm{p}}_{R}$ of $\ell_{R}(\bm{p})$ . (ii) Assume $\tilde{p}^{(0)}_{mj}>0$ , $j=0,1,\ldots,m$ , and $\sum_{j=0}^{m}\tilde{p}^{(0)}_{mj}=1$ . Then as $s\to\infty$ , $\tilde{\bm{p}}^{(s)}$ converges to the unique maximizer $\tilde{\bm{p}}_{G}$ of $\ell_{G}(\bm{p})$ .

3 Rate of Convergence of the Density Estimate

In this section we shall state results about the convergence rate of the density estimates which will be proved in the Appendix. Unlike most asymptotic results about maximum likelihood method which assume exact parametric models, we will show our results under the approximate model $f_{m}(t;\bm{p})=\sum_{j=0}^{m}p_{mj}\beta_{mj}(t)$ . For a given $\bm{p}_{0}$ , we define the norm

\|\bm{p}\|^{2}_{B}=\int\frac{\{f_{m}(t;\bm{p})\}^{2}}{f_{m}(t;\bm{p}_{0})}dt.

The squared distance between $\bm{p}$ and $\bm{p}_{0}$ with respect to norm $\|\cdot\|_{B}$ is

\|\bm{p}-\bm{p}_{0}\|^{2}_{B}=\int\frac{\{f_{m}(t;\bm{p})-f_{m}(t;\bm{p}_{0})\}^{2}}{f_{m}(t;\bm{p}_{0})}dt.

With the aid of the acceptance-rejection argument for generating pseudorandom numbers in the Monte Carlo method we have the following lemma which may be of independent interest.

Lemma 3.

Let $f\in C^{(2k)}[0,1]$ for some positive integer $k$ , $f(t)\geqslant\delta>0$ , and $f_{m}(t)=f_{m}(t;\bm{p}_{0})$ be the unique best approximation of degree $m$ for $f$ . Then a sample $x_{1},\ldots,x_{n}$ from $f$ can be arranged so that the first $\nu_{m}$ observations can be treated as if they were from $f_{m}$ . Moreover, for all $\bm{p}$ such that $f_{m}(x_{j};\bm{p})\geqslant\delta^{\prime}>0$ , $j=1,\ldots,n$ ,

\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log f_{m}(x_{i};\bm{p})=\tilde{\ell}_{R}(\bm{p})+R_{mn},

(8)

where $\tilde{\ell}_{R}(\bm{p})=\sum_{i=1}^{\nu_{m}}\log f_{m}(x_{i};\bm{p})$ , and

	$\displaystyle\nu_{m}$	$\displaystyle=$	$\displaystyle n-{\cal O}(nm^{-k})-{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s.,$		(9)
	$\displaystyle\|R_{mn}\|$	$\displaystyle=$	$\displaystyle{\cal O}(nm^{-k})+{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..$		(10)

Remark 3.1.

So $\tilde{\ell}_{R}(\bm{p})$ is an “exact” likelihood of $x_{1},\ldots,x_{\nu_{m}}$ while $\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log f_{m}(x_{i};\bm{p})$ is an approximate likelihood of the complete data $x_{1},\ldots,x_{n}$ which can be viewed as a slightly contaminated sample from $f_{m}$ . Maximizer $\hat{\bm{p}}$ of $\ell_{R}(\bm{p})$ approximately maximizes $\tilde{\ell}_{R}(\bm{p})$ . Hence $f_{m}(t;\hat{\bm{p}})$ targets at $f_{m}(t;\bm{p}_{0})$ which is a best approximate of $f$ .

For density estimation based on the raw data we have the following result.

Theorem 4.

Suppose that the PDF $f\in C^{(2k)}[0,1]$ for some positive integer $k$ , $f(t)\geqslant\delta>0$ , and $m={\cal O}(n^{1/k})$ . As $n\to\infty$ , with probability one the maximum value of $\ell_{R}(\bm{p})$ is attained by some $\hat{\bm{p}}_{R}$ in the interior of $\mathbb{B}_{m}(r_{n})=\{\bm{p}\in\mathbb{S}_{m}\,:\,\|\bm{p}-\bm{p}_{0}\|^{2}_{B}\leqslant r_{n}^{2}\}$ , where $r_{n}=\log n/\sqrt{n}$ and $\bm{p}_{0}$ makes $f_{m}(\cdot;\bm{p}_{0})$ the unique best approximation of degree $m$ .

Theorem 5.

Suppose that the PDF $f\in C^{(2k)}[0,1]$ for some positive integer $k$ , $f(t)\geqslant\delta>0$ , and $0<c_{0}n^{1/k}\leqslant m\leqslant c_{1}n^{1/k}<\infty$ . Then there is a positive constant $C$ such that

\displaystyle\mathrm{E}\int\frac{\{f_{m}(t;\hat{\bm{p}}_{R})-f(t)\}^{2}}{f(t)}dt

\displaystyle\leqslant

\displaystyle C\frac{(\log n)^{2}}{n}.

(11)

Because $f$ is bounded there is a positive constant $C$ such that

\displaystyle\mathrm{MISE}(\hat{f}_{B})

\displaystyle=

\displaystyle\mathrm{E}\int\{f_{m}(t;\hat{\bm{p}}_{R})-f(t)\}^{2}dt\leqslant C\frac{(\log n)^{2}}{n}.

(12)

Note that (11) is a stronger result than (12) which is an almost parametric rate of convergence for MISE. Guan (2014) showed a similar result under another set of conditions. The best parametric rate is ${\cal O}(n^{-1})$ that can be attained by the parametric density estimate under some regularity conditions.

For $\theta_{mi}(\bm{p})=\sum_{j=0}^{m}p_{mj}\{\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})\}$ , we define norm

\|\bm{p}\|^{2}_{G}=\sum_{i=1}^{N}\frac{\theta_{mi}^{2}(\bm{p})}{\theta_{mi}(\bm{p}_{0})}.

The squared distance between $\bm{p}$ and $\bm{p}_{0}$ with respect to norm $\|\cdot\|_{G}$ is

\|\bm{p}-\bm{p}_{0}\|^{2}_{G}=\sum_{i=1}^{N}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\theta_{mi}(\bm{p}_{0})}.

By the mean value theorem, we have

\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})=\int_{t_{i-1}}^{t_{i}}\beta_{mj}(t)dt=\beta_{mj}(t_{mij}^{*})\Delta t_{i},\quad i=1,\ldots,N;\;j=0,\ldots,m,

where $\Delta t_{i}=t_{i}-t_{i-1}$ and $t_{mij}^{*}\in[t_{i-1},t_{i}]$ . Thus $\|\bm{p}-\bm{p}_{0}\|_{G}$ is a Riemann sum

\|\bm{p}-\bm{p}_{0}\|_{G}^{2}=\sum_{i=1}^{N}\psi_{m}(t_{mij}^{*})\Delta t_{i}\approx\int_{0}^{1}\psi_{m}(t)dt=\|\bm{p}-\bm{p}_{0}\|^{2}_{B},

(13)

where

\psi_{m}(t)=\frac{\{\sum_{j=0}^{m}(p_{mj}-p_{mj}^{(0)})\beta_{mj}(t)\}^{2}}{\sum_{j=0}^{m}p_{mj}^{(0)}\beta_{mj}(t)}.

For grouped data we have the following.

Theorem 6.

Suppose that the PDF $f\in C^{(2k)}[0,1]$ for some positive integer $k$ , $f(t)\geqslant\delta>0$ , and $m={\cal O}(n^{1/k})$ . As $n\to\infty$ , with probability one the maximum value of $\ell_{G}(\theta_{m})$ is attained at $\tilde{\bm{p}}_{G}$ in the interior of $\mathbb{B}_{m}(r_{n})=\{\bm{p}\in\mathbb{S}_{m}\,:\,\|\bm{p}-\bm{p}_{0}\|^{2}_{G}\leqslant r_{n}^{2}\}$ , where $r_{n}=\log n/\sqrt{n}$ and $\bm{p}_{0}$ makes $f_{m}(\cdot;\bm{p}_{0})$ the unique best approximation.

For the relationship between the norms $\|\bm{p}-\bm{p}_{0}\|_{B}^{2}$ and $\|\bm{p}-\bm{p}_{0}\|_{G}^{2}$ , we have the following result.

Theorem 7.

Suppose that the PDF $f\in C^{(2k)}[0,1]$ for some positive integer $k$ , and $f(t)\geqslant\delta>0$ . Let $\bm{p}_{0}\in\mathbb{S}_{m}$ be the one that makes $f_{m}(\cdot;\bm{p}_{0})$ the unique best approximation of $f$ . Then for all $\bm{p}\in\mathbb{S}_{m}$ , we have

\|\bm{p}-\bm{p}_{0}\|^{2}_{B}=\|\bm{p}-\bm{p}_{0}\|_{G}^{2}+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}).

For a grouped data based estimate $\tilde{\bm{p}}_{G}$ , the rate of convergence of $\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|_{G}^{2}$ to zero is ${\cal O}((\log n)^{2}/n)$ . However the rate of convergence of $\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|^{2}_{B}$ to zero depends on that of $\max_{i}\Delta t_{i}$ . For equal-width classes, $\Delta t_{i}=1/N$ , and $N=n^{1/2+2/k}$ , we have $\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|_{B}^{2}={\cal O}((\log n)^{2}/n)+{\cal O}(m^{4}/n^{1+4/k})$ . Thus $\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|_{B}^{2}={\cal O}((\log n)^{2}/n)$ if $m={\cal O}(n^{1/k})$ . If $k$ is large, then $N\approx\sqrt{n}$ .

Theorem 8.

Suppose that the PDF $f\in C^{(2k)}[0,1]$ for some positive integer $k$ , $f(t)\geqslant\delta>0$ , and $0<c_{0}n^{1/k}\leqslant m\leqslant c_{1}n^{1/k}<\infty$ . Then we have

\displaystyle\mathrm{E}\int\frac{\{f_{m}(t;\tilde{\bm{p}}_{G})-f(t)\}^{2}}{f(t)}dt

\displaystyle=

\displaystyle{\cal O}((\log n)^{2}/n)+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}).

(14)

Also, because $f$ is bounded,

	$\displaystyle\mathrm{MISE}(\hat{f}_{B})$	$\displaystyle=$	$\displaystyle\mathrm{E}\int\{f_{m}(t;\tilde{\bm{p}}_{G})-f(t)\}^{2}dt$		(15)
		$\displaystyle=$	$\displaystyle{\cal O}((\log n)^{2}/n)+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}).$		(15)

4 Model Degree Selection

Guan (2014) showed that the model degree $m$ is bounded below approximately by $m_{b}=\max\left\{1,\left\lceil\mu(1-\mu)/\sigma^{2}-3\right\rceil\right\}$ . Based on the grouped data, the lower bound $m_{b}$ can be estimated by $\tilde{m}_{b}=\max\left\{1,\left\lceil\tilde{\mu}(1-\tilde{\mu})/\tilde{\sigma}^{2}-3\right\rceil\right\}$ , where

\tilde{\mu}=\frac{1}{n}\sum_{i=1}^{N}n_{i}t_{i}^{*},\quad\tilde{\sigma}^{2}=\frac{1}{n-1}\sum_{i=1}^{N}n_{i}(t_{i}^{*}-\tilde{\mu})^{2}=\frac{1}{n-1}\left(\sum_{i=1}^{N}n_{i}t_{i}^{*2}-n\tilde{\mu}^{2}\right),

t_{i}^{*}=(t_{i-1}+t_{i})/2,\quad i=1,\ldots,N.

Due to overfitting the model degree $m$ cannot be arbitrarily large. With the estimated $\tilde{m}_{b}$ , we choose a proper set of nonnegative consecutive integers, $M=\{m_{0},m_{0}+1,\ldots,m_{0}+k\}$ such that $m_{0}<\tilde{m}_{b}$ . Then we can estimate an optimal degree $m$ using the method of change-point estimation as proposed by Guan (2014). For each $m_{i}=m_{0}+i$ we use the EM algorithm to find the MBLE $\tilde{\bm{p}}_{m_{i}}$ and calculate $\ell_{i}=\ell(\tilde{\bm{p}}_{m_{i}})$ . Let $y_{i}=\ell_{i}-\ell_{i-1}$ , $i=1,\ldots,k$ . The $y_{i}$ ’s are nonnegative because the Bernstein polynomial models are nested. Guan (2014) suggested that $y_{1},\ldots,y_{\tau}$ be treated as exponentials with mean $\mu_{1}$ and $y_{\tau+1},\ldots,y_{k}$ be treated as exponentials with mean $\mu_{0}$ , where $\mu_{1}>\mu_{0}$ , so that $\tau$ is a change point and $m_{\tau}$ is the optimal degree and use the change-point detection method (see Section 1.4 of Csörgő & Horváth, 1997) for exponential model to find a change-point estimate $\hat{\tau}$ . Then we estimate the optimal $m$ by $\hat{m}=m_{\hat{\tau}}$ . Specifically, $\hat{\tau}=\arg\max_{1\leqslant\tau\leqslant k}\{R(\tau)\}$ , where the likelihood ratio of $\tau$ is

R(\tau)=k\log\left(\frac{\ell_{k}-\ell_{0}}{k}\right)-\tau\log\left(\frac{\ell_{\tau}-\ell_{0}}{\tau}\right)-(k-\tau)\log\left(\frac{\ell_{k}-\ell_{\tau}}{k-\tau}\right),\quad\tau=1,\ldots,k.

If $R(\tau)$ has multiple maximizers, we choose the smallest one as $\hat{\tau}$ .

5 Simulation Study and Example

5.1 Simulation

The distributions used for generating pseudorandom numbers and the parametric models used for density estimation are as following.

(i)

Uniform(0,1): the uniform distribution with $\mu=1/2$ and $\sigma^{2}=1/12$ as a special beta distribution beta(1,1). The parametric model is the beta distribution beta( $\alpha$ , 1).
(ii)

Exp(1): the exponential distribution with mean $\mu=1$ and variance $\sigma^{2}=1$ . We truncate this distribution by the interval $[a,b]=[0,4]$ . The parametric model is the exponential distribution with mean $\mu=\theta$ .
(iii)

Pareto(4, 0.5): The Pareto distribution with shape parameter $\alpha=4$ and scale parameter $x_{0}=$ 0.5 which is treated as known parameter. The mean and variance are, respectively, $\mu=\alpha x_{0}/(\alpha-1)=2/3$ and $\sigma^{2}=x_{0}^{2}\alpha/[(\alpha-1)^{2}(\alpha-2)]=1/18$ . We truncate this distribution by the interval $[a,b]=[x_{0},\mu+4\sigma]\approx$ [0.5, 1.6095]. The parametric model is Pareto( $\alpha$ , 0.5).
(iv)

NN( $k$ ): the nearly normal distribution of $\bar{u}_{k}=(u_{1}+\cdots+u_{k})/k$ with $u_{1},\ldots,u_{k}$ being independent uniform(0,1) random variables. The lower bound is $m_{b}=3(k-1)$ . We used the normal distribution N( $\mu$ , $\sigma^{2}$ ) as the parametric model.
(v)

N( $0,1$ ): the standard normal distribution truncated by the interval $[a,b]=[-4,4]$ . The parametric model is N( $\mu,\sigma^{2}$ ).
(vi)

Logistic(0, 0.5): the logistic distribution with location $\mu=0$ and scale $s=$ 0.5 so that $\sigma^{2}=(s\pi)^{2}/3=\pi^{2}/12$ . We truncate this distribution by the interval $[a,b]=[\mu-4\sigma,\mu+4\sigma]\approx$ [-2.9619, 2.9619]. The parametric model is Logistic( $\mu$ , $s$ ).

Except the normal distribution, the above parametric models were chosen for the simulation because the CDF’s have close-form expressions so that the expensive numerical integrations can be avoided for the MHDE and the MLE.

From each distribution we generated 500 samples of size $n=50,100,200$ , and $500$ and the grouped data using $N=5,10,10$ and $20$ equal-width class intervals, respectively. The model degree $m$ were selected using the change-point method from $\{1,2,\ldots,40\}$ .

From the results of Guan (2014) we see that the Bernstein polynomial method is much better than the kernel density for ungrouped data. The AMHDE is approximation for MHDE. So we only compare kernel, the MLE, the MHDE, and the proposed MBLE. For the kernel density estimate $\hat{f}_{K}(x)=\frac{1}{nh}\sum_{i=1}^{n}K\Big{(}\frac{x-x_{i}}{h}\Big{)},$ we used normal kernel $K(x)=e^{-x^{2}/2}/\sqrt{2\pi}$ and the commonly recommended method of Sheather & Jones (1991) to choose the bandwidth $h$ . Because $\mathrm{E}[\hat{f}_{K}(x)]=\frac{1}{h}\int_{-\infty}^{\infty}K\Big{(}\frac{x-y}{h}\Big{)}f(y)dy=(K_{h}*f)(x).$ This is the convolution of $f$ and the scaled kernel $K_{h}(\cdot)=K(\cdot/h)/h$ . So no matter how the bandwidth $h$ is chosen, there is always trade-off between the bias and the variance.

Table 1: Estimated mean and variance of

\hat{m}

, and mean integrated squared errors (MISE’s) of the kernel density

\hat{f}_{K}

, the MLE

\hat{f}_{\mathrm{ML}}

, the MHDE

\hat{f}_{\mathrm{MHD}}

, and the proposed maximum Bernstein likelihood estimate (MBLE)

\hat{f}_{\mathrm{B}}

based 500 simulated samples of size

n

which are grouped by

N

equal-width class intervals, respectively.

			MISE
.	$\mathrm{E}(\hat{m})$	$\mathrm{var}(\hat{m})$	$\hat{f}_{\mathrm{B}}$	$\hat{f}_{\mathrm{K}}$	$\hat{f}_{\mathrm{ML}}$	$\hat{f}_{\mathrm{MHD}}$
	$n=50$ , $N=5$
Beta(1,1)	14.91	3.95	0.3898	2.5722	0.0193	0.0222
Exp(1)	14.56	9.14	0.0447	0.7502	0.0018	0.0034
Pareto	12.29	29.01	0.7855	19.6962	0.0392	0.0793
NN $(4)$	12.04	15.42	0.0556	8.5549	0.0653	0.103
N $(0,1)$	14.25	11.10	0.0007	0.1603	0.0008	0.0012
Logistic	13.79	19.27	0.0022	0.2689	0.0014	0.0024
	$n=100$ , $N=10$
Beta(1,1)	12.96	47.08	0.0972	0.0558	0.0096	0.0118
Exp(1)	9.42	37.24	0.0091	0.2377	0.0011	0.0027
Pareto	8.51	9.78	0.1009	18.6903	0.0222	0.0613
NN $(4)$	10.24	6.72	0.0217	1.4357	0.0232	0.0411
N $(0,1)$	13.77	6.58	0.0004	0.0357	0.0003	0.0005
Logistic	12.02	13.42	0.0012	0.0992	0.0007	0.0012
	$n=200$ , $N=10$
Beta(1,1)	15.88	48.93	0.0741	1.2338	0.0045	0.0051
Exp(1)	9.24	41.36	0.0068	0.4547	0.0007	0.0017
Pareto	8.63	9.85	0.0661	34.3823	0.0123	0.0323
NN $(4)$	10.11	3.92	0.0128	4.0956	0.0125	0.0213
N $(0,1)$	14.08	5.74	0.0003	0.0907	0.0002	0.0003
Logistic	13.15	14.07	0.0007	0.1936	0.0004	0.0006
	$n=500$ , $N=20$
Beta(1,1)	10.18	49.84	0.0192	0.0226	0.0021	0.0024
Exp(1)	4.40	3.55	0.0006	0.2331	0.0005	0.0015
Pareto	8.67	1.94	0.0181	16.0924	0.0083	0.0253
NN $(4)$	9.97	2.75	0.0059	0.5994	0.0058	0.0110
N $(0,1)$	14.41	2.94	0.0001	0.0329	0.0001	0.0001
Logistic	13.26	5.15	0.0003	0.0905	0.0001	0.0003

Table 1 presents the simulation results of the density estimations. As expected, the proposed Bernstein polynomial method performs much better than the kernel density method and is similar to the other two parametric methods. Table 1 also shows the estimated mean and variance of the optimal model degree selected by the change-point method. It seems that the performance of the estimated optimal model degree $\hat{m}$ is satisfactory.

It should be noted that the density $\psi_{k}(t)$ of NN( $k$ ) satisfies $\psi_{k}(t)\in C^{(k-2)}[0,1]$ but $\psi_{k}(t)\notin C^{(k-1)}[0,1]$ for $k\geqslant 2$ . In fact, when, $k\geqslant 2$ , $\psi_{k}(t)$ is a piecewise polynomial function of degree $(k-1)$ defined on pieces $[i/k,(i+1)/k)$ , $i=0,1,\ldots,k-1$ . Except NN(k) all the other population densities have continuous derivatives of all orders on their supports. In the simulation, we used the normal distributions as the parametric models of NN( $4$ ). Here both the normal and the Bernstein polynomial are approximate models. In fact, in most applications the normal distribution is an approximate model due to the central limit theorem. We did a simulation on the goodness-of-fit of the normal distribution to the sample from NN( $4$ ). In this simulation, we generate $5,000$ samples of size $n$ from NN( $4$ ). We ran the Kolmogorov-Smirnov test for each sample. For $n=50,100,200$ , and $500$ the average of the $p$ -values are, respectively, 0.7884, 0.7875, and 0.7470; and the numbers of $p$ -values among the 5000 that are smaller than 0.05 are, respectively, 3, 2, 0, and 2. So the normal distribution will accepted as the parametric model for NN(4) almost all the time. The performance of the proposed MBLE for samples from NN(4) is even better than that of the MLE when sample size $n$ is small.

5.2 The Chicken Embryo Data

The chicken embryo data contain the number of hatched eggs on each day during the 21 days of incubation period. The times of hatching ( $n=43$ ) are treated as grouped by intervals with equal width of one day. The data were studied first by Jassim et al. (1996). Kuurman et al. (2003) and Lin & He (2006) also analyzed the data using the MHDE, in addition to other methods assuming some parametric mixture models including Weibull model. The latter used the AMHDE to fit the data by Weibull mixture model. The estimated density using the proposed method is close to the parametric MLE.

Applying the proposed method of this paper, we truncated the distribution using $[a,b]=[0,21]$ and selected the optimal model degree $\hat{m}=13$ from $\{2,3,\ldots,50\}$ using the change-point method.

Refer to caption — Figure 1: Upper panel left: the loglikelihood $\ell(m)$ ; Upper panel right: the likelihood ratio $R(\tau)$ for change-point of the increments of the loglikelihoods $\ell(m)$ ; Lower panel: the density estimates for the chicken embryo data.

Figure 1 displays the loglikelihood $\ell(m)$ , the likelihood ratio $R(\tau)$ for change-points, the histogram of the grouped data and the kernel density $\hat{f}_{K}$ , the MLE $\hat{f}_{\mathrm{ML}}$ , the MHDE $\hat{f}_{\mathrm{MHD}}$ , the AMHDE $\hat{f}_{\mathrm{AMHD}}$ , and the proposed maximum Bernstein likelihood estimate (MBLE) $\hat{f}_{\mathrm{B}}$ . From this figure we see that the proposed MBLE $\hat{f}_{\mathrm{B}}$ and the parametric MLE $\hat{f}_{\mathrm{ML}}$ are similar and fit the data reasonably. The kernel density is clearly not a good estimate. The AMHDE $\hat{f}_{\mathrm{AMHD}}$ seems to have overestimated $f$ at numbers close to 0.

6 Concluding Remarks

The proposed density estimate $f_{m}(t;\hat{\bm{p}})$ has obviously considerable advantages over the kernel density: (i) It is more efficient than the kernel density because it is an approximate maximum likelihood estimate; (ii) It is easier to select an optimal model degree $m$ than to select an optimal bandwidth $h$ for the kernel density; (iii) The proposed density estimate $f_{m}(t;\hat{\bm{p}})$ aims at $f_{m}(t;\bm{p}_{0})$ which is the best approximate of $f$ for each $m$ , while the kernel density $\hat{f}_{K}$ aims at $f*K_{h}$ , the convolution of $f$ and $K_{h}$ .

Another significance of this paper is the introduction of the acceptance-rejection argument in proving the asymptotic results where an approximate model is assumed which is new to the knowledge of the author.

Appendix A Proofs

A.1 Proof of Theorem 1

Proof.

We define $\Lambda_{r}=\Lambda_{r}(\delta,M_{0},M_{2},$ $\ldots,M_{r})$ as the class of functions $\phi(t)$ on $[0,1]$ whose first $r$ derivatives $\phi^{(i)}$ , $i=1,\ldots,r$ , exist and are continuous with the properties

\delta\leqslant\phi(t)\leqslant M_{0},\quad|\phi^{(i)}(t)|\leqslant M_{i},\quad 2\leqslant i\leqslant r,\quad 0\leqslant t\leqslant 1,

(16)

for some $\delta>0$ , $M_{i}>0$ , $i=0,2,\ldots,r$ . A polynomial of degree $m$ with “positive coefficients” is defined by Lorentz (1963) as $\phi_{m}(t)=\sum_{i=0}^{m}c_{i}{m\choose i}t^{i}(1-t)^{m-i}$ , where $c_{i}\geqslant 0$ , $i=0,\ldots,m$ . Theorem 1 of Lorentz (1963) proved that for given integers $r\geqslant 0$ , $\delta>0$ , and positive constants $M_{i}\geqslant 0$ , $i=0,2,3,\ldots,r$ , then there exists a constant $C_{r}=C_{r}(\delta,M_{0},M_{2},$ $\ldots,M_{r})$ such that for each function $\phi\in\Lambda_{r}(\delta,M_{0},M_{2},\ldots,M_{r})$ one can find a sequence $\phi_{m}$ , $m=1,2,\ldots$ , of polynomials with positive coefficients of degree $m$ such that

|\phi(t)-\phi_{m}(t)|\leqslant C_{r}\Delta_{m}^{r}(t)\omega(\Delta_{m}(t),\phi^{(r)}),\quad 0\leqslant t\leqslant 1,

(17)

where $\omega(\delta,\phi)=\sup_{|x-y|\leq\delta}|\phi(x)-\phi(y)|\,$ .

Under the conditions of Theorem 1, we see that $M_{0}=\max_{t}f(t)$ , $M_{i}=\max_{t}|f^{(i)}(t)|$ , $i=2,\ldots,r$ , are finite and $\omega(\Delta_{m}(t),f^{(r)})\leqslant 2M_{r}$ . So by the above result of Lorentz (1963) we have a sequence $\phi_{m}(t)=\sum_{i=0}^{m}c_{i}{m\choose i}t^{i}(1-t)^{m-i}$ , $m=0,1,\ldots$ , of polynomials with positive coefficients of degree $m$ such that

|f(t)-\phi_{m}(t)|\leqslant 2C_{r}M_{r}\Delta_{m}^{r}(t),\quad 0\leqslant t\leqslant 1.

(18)

It is clear that

\Delta_{m}(t)\leqslant m^{-1/2}.

(19)

Since $\int_{0}^{1}f(t)dt=1$ , we have, by (19),

\Big{|}1-\sum_{i=0}^{m}c_{i}/(m+1)\Big{|}\leqslant 2C_{r}M_{r}m^{-r/2}.

(20)

Let $p_{mi}=c_{i}/\sum_{i=0}^{n}c_{i}$ , $i=0,\ldots,m$ , then $f_{m}(t)=\sum_{i=0}^{m}p_{mi}\beta_{mi}(t)$ . It follows easily from (18) and (20) that (1) is true. ∎

A.2 Proof of Theorem 2

We will prove the assertion (i) only. The assertion (ii) can be proved similarly.

Proof.

The matrix of second derivatives of $\ell_{R}(\bm{p})$ is

H(\bm{p})=\frac{\partial^{2}\ell_{R}(\bm{p})}{\partial\bm{p}\partial\bm{p}^{\mbox{\tiny{$\mathrm{T}$}}}}=-\sum_{i=1}^{n}\frac{\bm{\beta}_{m}(x_{i})\bm{\beta}_{m}^{\mbox{\tiny{$\mathrm{T}$}}}(x_{i})}{\{\sum_{j=0}^{m}p_{mj}\beta_{mj}(x_{i})\}^{2}}.

For any $\bm{u}=(u_{0},\ldots,u_{m})^{\mbox{\tiny{$\mathrm{T}$}}}\in R^{m+1}$ , as $n\to\infty$ ,

\frac{1}{n}\bm{u}^{\mbox{\tiny{$\mathrm{T}$}}}\frac{\partial^{2}\ell_{R}(\bm{p})}{\partial\bm{p}\partial\bm{p}^{\mbox{\tiny{$\mathrm{T}$}}}}\bm{u}\to\mathrm{E}\left\{\frac{\sum_{j=0}^{m}u_{j}\beta_{mj}(X)}{\{\sum_{j=0}^{m}p_{mj}\beta_{mj}(X)}\right\}^{2}.

Clearly, $\beta_{m0}(t),\ldots,\beta_{mm}(t)$ are linearly independent nonvanishing functions on [0,1]. So, with probability one, $H(\bm{p})$ is negative definite for all $\bm{p}$ and sufficiently large $n$ . By Theorem 4.2 of Redner & Walker (1984), as $s\to\infty$ , $\hat{\bm{p}}^{(s)}$ converges to the maximizer of $\ell_{R}(\bm{p})$ which is unique. ∎

A.3 Proof of Lemma3

Proof.

By (1) and (19) we know that under the condition of the lemma $f_{m}(t)=f_{m}(t;\bm{p}_{0})$ converges to $f(t)$ at a rate of at least ${\cal O}(m^{-k})$ , i.e.,

f(t)=f_{m}(t)+{\cal O}(m^{-k}),

(21)

and, furthermore, since $f(t)\geqslant\delta$ ,

c_{m}=\sup_{t}\frac{f_{m}(t)}{f(t)}=1+{\cal O}(m^{-k}),

(22)

uniformly in $m$ .

Let $u_{1},\ldots,u_{n}$ be a sample from the uniform(0,1). By the acceptance-rejection method in simulation (Ross, 2013), for each $i$ , if $u_{i}\leqslant{f_{m}(x_{i})}/{c_{m}f(x_{i})}$ , then $x_{i}$ can be treated as if it were from $f_{m}$ . Assume that the data $x_{1},\ldots,x_{n}$ have been rearranged so that the first $\nu_{m}$ observations can be treated as if they were from $f_{m}$ . By the law of iterated logarithm we have

	$\displaystyle\nu_{m}$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}I\left(u_{i}\leqslant\frac{f_{m}(x_{i})}{cf(x_{i})}\right)$
		$\displaystyle=$	$\displaystyle n-{\cal O}(nm^{-k})-{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..$

So we have

\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log f_{m}(x_{i};\bm{p})=\tilde{\ell}_{R}(\bm{p})+R_{mn},

where $\tilde{\ell}_{R}(\bm{p})=\sum_{i=1}^{\nu_{m}}\log f_{m}(x_{i};\bm{p})$ is an “almost complete” likelihood and

R_{mn}=\sum_{u_{i}>\frac{f_{m}(x_{i})}{cf(x_{i})}}\log f_{m}(x_{i};\bm{p})=\sum_{i=\nu_{m}+1}^{n}\log f_{m}(x_{i};\bm{p}).

Because $0<\delta\leqslant f(t)\leqslant M_{0}$ , we have $0<\delta^{\prime}\leqslant f_{m}(x_{i};\bm{p})\leqslant M_{0}^{\prime}$ for some constants $\delta^{\prime}$ and $M_{0}^{\prime}$ . By the law of iterated logarithm

$\displaystyle\|R_{mn}\|$	$\displaystyle=$	$\displaystyle\max\{\|\log\delta^{\prime}\|,\|\log M_{0}^{\prime}\|\}\sum_{i=1}^{n}I\left(u_{i}>\frac{f_{m}(x_{i})}{c_{m}f(x_{i})}\right)$	(23)
	$\displaystyle=$	$\displaystyle\max\{\|\log\delta^{\prime}\|,\|\log M_{0}^{\prime}\|\}(n-\nu_{m})$
	$\displaystyle=$	$\displaystyle{\cal O}(nm^{-k})+{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..$

The proportion of the observations that can be treated as if they were from $f_{m}$ is

\frac{\nu_{m}}{n}=1-{\cal O}(m^{-k})-{\cal O}\left(\sqrt{m^{-k}\log\log n/n}\right),\quad a.s..

So the complete data $x_{1},\ldots,x_{n}$ can be viewed as a slightly contaminated sample from $f_{m}$ . ∎

A.4 Proof of Theorem 4

Proof.

The Taylor expansions of $\log{f_{m}(x_{j},\bm{p})}$ at $\log{f_{m}(x_{j},\bm{p}_{0})}$ yield that, for $\bm{p}\in\mathbb{B}_{m}(r_{n})$ ,

	$\displaystyle\tilde{\ell}_{R}(\bm{p})$	$\displaystyle=\sum_{j=1}^{\nu_{m}}\log{f_{m}(x_{j},\bm{p})}$
		$\displaystyle=\tilde{\ell}_{R}(\bm{p}_{0})+\sum_{j=1}^{\nu_{m}}\left[\frac{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})}{f_{m}(x_{j},\bm{p}_{0})}-\frac{1}{2}\frac{\{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})\}^{2}}{\{f_{m}(x_{j},\bm{p}_{0})\}^{2}}\right]+\tilde{R}_{mn},$

where $\tilde{R}_{mn}=o(nr_{n}^{2})$ , a.s..

Let $\bm{p}$ be a point on the boundary of $\mathbb{B}_{m}(r_{n})$ , i.e., $\|\bm{p}-\bm{p}_{0}\|^{2}_{R}=r_{n}^{2}$ . By the law of iterated logarithm we have

\displaystyle\sum_{j=1}^{\nu_{m}}\frac{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})}{f_{m}(x_{j},\bm{p}_{0})}

\displaystyle=\mathcal{O}(r_{n}\sqrt{n\log\log n}),\;a.s.,

and that there exists $\eta>0$ such that

\displaystyle\sum_{j=1}^{\nu_{m}}\frac{\{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})\}^{2}}{\{f_{m}(x_{j},\bm{p}_{0})\}^{2}}

\displaystyle=\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{n\log\log n}).

Therefore we have

	$\displaystyle\tilde{\ell}_{R}(\bm{p})$	$\displaystyle=\tilde{\ell}_{R}(\bm{p}_{0})+\sum_{j=1}^{\nu_{m}}\left[\frac{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})}{f_{m}(x_{j},\bm{p}_{0})}-\frac{1}{2}\frac{\{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})\}^{2}}{\{f_{m}(x_{j},\bm{p}_{0})\}^{2}}\right]+o(nr_{n}^{2})$
		$\displaystyle=\tilde{\ell}_{R}(\bm{p}_{0})-\frac{1}{2}\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{n\log\log n})+\mathcal{O}(r_{n}\sqrt{n\log\log n})+o(nr_{n}^{2}),\quad a.s..$

Since $m={\cal O}(n^{1/k})$ , $nm^{-k}=o(nr_{n}^{2})$ . So there exists $\eta^{\prime}>0$ such that $\ell_{R}(\bm{p})\leqslant\ell_{R}(\bm{p}_{0})-\eta^{\prime}nr_{n}^{2}=\ell_{R}(\bm{p}_{0})-\eta^{\prime}(\log n)^{2}$ . Since ${\partial^{2}\ell_{B}(\bm{p})}/{\partial\bm{p}\partial{\bm{p}}^{\mbox{\tiny{$\mathrm{T}$}}}}<0$ , the maximum value of $\ell_{R}(\bm{p})$ is attained by some $f_{m}(\cdot,\hat{\bm{p}}_{R})$ with $\hat{\bm{p}}_{R}$ being in the interior of $\mathbb{B}_{m}(r_{n})$ .

∎

A.5 Proof of Theorem 5

Proof.

It is easy to see that (11) and (12) follow from Theorem 4, (22), the boundedness of $f$ , and the triangular inequality. ∎

A.6 Proof of Theorem 6

Proof.

By (21) we have

\theta_{i}=\theta_{mi}(\bm{p}_{0})+{\cal O}(m^{-k}\Delta t_{i}),

(24)

where

\theta_{i}=\int_{t_{i-1}}^{t_{i}}f(x)dx,\quad\theta_{mi}(\bm{p}_{0})=\int_{t_{i-1}}^{t_{i}}f_{m}(x;\bm{p}_{0})dx,\quad i=1,\ldots,N.

Because $0<\delta\leqslant f(t)\leqslant M_{0}$ we have

c=\max_{1\leqslant i\leqslant N}\frac{\theta_{mi}(\bm{p}_{0})}{\theta_{i}}=1+{\cal O}(m^{-k}),

(25)

uniformly in $m$ . Assume that $y_{1},\ldots,y_{n}$ is a random sample from the discrete distribution with probability mass function $\theta_{i}=P(Y=i)$ , $i=1,\ldots,N$ .

Let $u_{1},\ldots,u_{n}$ be a sample from the uniform(0,1). For $y_{i}\sim\bm{\theta}\equiv(\theta_{1},\ldots,\theta_{N})$ , $i=1,\ldots,n$ , let $u_{i}\sim U(0,1)$ . If $u_{i}\leqslant c^{-1}{\theta_{my_{i}}^{(0)}}/{\theta_{yi}}$ , then $y_{i}$ can be treated as if it were from $\bm{\theta}_{m}(\bm{p}_{0})\equiv(\theta_{m1}(\bm{p}_{0}),\ldots,\theta_{mN}(\bm{p}_{0}))$ . Assume that the data $y_{1},\ldots,y_{n}$ have been rearranged so that the first $\nu_{m}$ observations can be treated as if they were from $\bm{\theta}_{m}(\bm{p}_{0})$ . By the law of iterated logarithm we have

	$\displaystyle\nu_{m}$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}I\left(u_{i}\leqslant\frac{\theta_{my_{i}}(\bm{p}_{0})}{c\theta_{y_{i}}}\right)$
		$\displaystyle=$	$\displaystyle n-{\cal O}(nm^{-k})-{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..$

So we have

\displaystyle\ell_{G}(\bm{p}_{0})

\displaystyle=

\displaystyle\sum_{i=1}^{N}n_{i}\log\theta_{mi}(\bm{p}_{0})=\tilde{\ell}_{G}(\bm{p}_{0})+R_{mn},

where

\displaystyle\tilde{\ell}_{G}(\bm{p}_{0})

\displaystyle=

\displaystyle\sum_{j=1}^{\nu_{m}}\sum_{i=1}^{N}I(y_{j}=i)\log\theta_{mi}(\bm{p}_{0})=\sum_{j=1}^{\nu_{m}}\tilde{n}_{i}\log\theta_{mi}(\bm{p}_{0}),

$n_{i}=\#\{j:y_{j}=i\}$ , $\tilde{n}_{i}=\sum_{j=1}^{\nu_{m}}I(y_{j}=i)$ , $i=1,\ldots,N$ , and

\displaystyle R_{mn}

\displaystyle=

\displaystyle\sum_{u_{j}>\frac{\theta_{my_{j}}}{c\theta_{y_{j}}}}\sum_{i=1}^{N}I(y_{j}=i)\log\theta_{mi}=\sum_{i=1}^{N}(n_{i}-\tilde{n}_{i})\log\theta_{mi}.

It is clear that there exist $\delta^{\prime}>0$ and $M_{0}^{\prime}>0$ such that $\delta^{\prime}\Delta t_{i}\leqslant\theta_{mi}\leqslant M_{0}^{\prime}\Delta t_{i}$ . By the law of iterated logarithm,

	$\displaystyle\|R_{mn}\|$	$\displaystyle\leqslant$	$\displaystyle\max\{\|\log\delta^{\prime}\Delta t_{i}\|,\|\log M_{0}^{\prime}\Delta t_{i}\|\}(n-\nu_{m})$
		$\displaystyle=$	$\displaystyle{\cal O}(nm^{-k})+{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..$

The Taylor expansions of $\log{\theta_{mi}(\bm{p})}$ at $\log{\theta_{mi}(\bm{p}_{0})}$ yield that, for $\bm{p}\in\mathbb{B}_{m}(r_{n})$ ,

	$\displaystyle\tilde{\ell}_{G}(\bm{p})$	$\displaystyle=\sum_{i=1}^{N}\tilde{n}_{i}\log{\theta_{mi}(\bm{p})}$
		$\displaystyle=\tilde{\ell}_{G}(\bm{p}_{0})+\sum_{i=1}^{N}\tilde{n}_{i}\left[\frac{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})}{\theta_{mi}(\bm{p}_{0})}-\frac{1}{2}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\{\theta_{mi}(\bm{p}_{0})\}^{2}}\right]+\tilde{R}_{mn},$

where $\tilde{R}_{mn}=o(nr_{n}^{2})$ , a.s..

Let $\bm{p}$ be a point on the boundary of $\mathbb{B}_{m}(r_{n})$ , i.e., $\|\bm{p}-\bm{p}_{0}\|^{2}_{B}=r_{n}^{2}$ . It follows from the law of iterated logarithm that there exists $\eta>0$ such that

\displaystyle\frac{1}{\nu_{m}}\sum_{i=1}^{N}\tilde{n}_{i}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\{\theta_{mi}(\bm{p}_{0})\}^{2}}

\displaystyle=\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{\log\log n/n}).

Therefore

	$\displaystyle\tilde{\ell}_{G}(\bm{p})$	$\displaystyle=\tilde{\ell}_{G}(\bm{p}_{0})+\sum_{i=1}^{N}\tilde{n}_{i}\left[\frac{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})}{\theta_{mi}(\bm{p}_{0})}-\frac{1}{2}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\{\theta_{mi}(\bm{p}_{0})\}^{2}}\right]+o(nr_{n}^{2})$
		$\displaystyle=\tilde{\ell}_{G}(\bm{p}_{0})-\frac{1}{2}\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{\log\log n/n})+\mathcal{O}(r_{n}\sqrt{n\log\log n})+o(nr_{n}^{2}),\quad a.s..$

Since $m={\cal O}(n^{1/k})$ , $nm^{-k}=o(nr_{n}^{2})$ . So there exists $\eta^{\prime}>0$ such that $\ell_{G}(\bm{p})\leqslant\ell_{G}(\bm{p}_{0})-\eta^{\prime}nr_{n}^{2}=\ell_{G}(\bm{p}_{0})-\eta^{\prime}(\log n)^{2}$ . Since ${\partial^{2}\ell_{G}(\bm{p})}/{\partial\bm{p}\partial{\bm{p}}^{\mbox{\tiny{$\mathrm{T}$}}}}<0$ , the maximum value of $\ell_{G}(\bm{p})$ is attained by some $\tilde{\bm{p}}_{G}$ in the interior of $\mathbb{B}_{m}(r_{n})$ .

∎

A.7 Proof of Theorem 7

Proof.

By (13) and the Taylor expansion we have

	$\displaystyle\\|\bm{p}-\bm{p}_{0}\\|^{2}_{R}$	$\displaystyle=$	$\displaystyle\\|\bm{p}-\bm{p}_{0}\\|_{G}^{2}+{\cal O}(\max_{i}\Delta t_{i})\int_{0}^{1}\|\psi_{m}^{\prime}(t)\|dt$		(26)
			$\displaystyle\hskip 30.00005pt+\max_{0\leqslant t\leqslant 1}\|\psi_{m}^{\prime\prime}(t)\|{\cal O}(\max_{i}\Delta t_{i}^{2}).$		(26)

By $\beta^{\prime}_{mj}(t)=(m+1)\{\beta_{m-1,j-1}(t)-\beta_{m-1,j}(t)\}$ , we have

|\psi_{m}^{\prime}(t)|\leqslant m^{2}\Big{\{}C_{1}\sqrt{\psi_{m}(t)}+C_{2}\psi_{m}(t)\Big{\}}.

It follows easily from $\beta^{\prime\prime}_{mj}(t)=m(m+1)\{\beta_{m-2,j-2}(t)-2\beta_{m-2,j-1}(t)+\beta_{m-2,j}(t)\}$ that $|\psi_{m}^{\prime\prime}(t)|\leqslant C_{3}m^{4}.$ Thus by (26) we can obtain

\|\bm{p}-\bm{p}_{0}\|^{2}_{R}-\|\bm{p}-\bm{p}_{0}\|_{G}^{2}={\cal O}(\max_{i}\Delta t_{i}){\cal O}(m^{2})\|\bm{p}-\bm{p}_{0}\|_{R}+{\cal O}(m^{4}){\cal O}(\max_{i}\Delta t_{i}^{2}).

Therefore we have $\|\bm{p}-\bm{p}_{0}\|^{2}_{R}=\|\bm{p}-\bm{p}_{0}\|_{G}^{2}+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}).$ The proof is complete. ∎

A.8 Proof of Theorem 8

Proof.

Similar to the proof of Theorem 5, (14) and (15) follow easily from Theorems 6 and 7, (22), the boundedness of $f$ , and the triangular inequality. ∎

References

Beran (1977a) Beran, R. (1977a). Minimum Hellinger distance estimates for parametric models. Ann. Statist. 5, 445–463.
Beran (1977b) Beran, R. (1977b). Robust location estimates. Ann. Statist. 5, 431–444.
Bernstein (1912) Bernstein, S. N. (1912). Démonstration du théorème de Weierstrass fondée sur le calcul des probabilitiés. Comm. Soc. Math. Kharkov 13, 1–2.
Box (1976) Box, G. E. P. (1976). Science and statistics. J. Amer. Statist. Assoc. 71, 791–799.
Buckland (1992) Buckland, S. T. (1992). Fitting density functions with polynomials. J. Roy. Statist. Soc. Ser. C 41, 63–76.
Csörgő & Horváth (1997) Csörgő, M. & Horváth, L. (1997). Limit Theorems in Change-Point Analysis. New York: John Wiley & Sons Inc., 1st ed.
Dempster et al. (1977) Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1–38.
Guan (2014) Guan, Z. (2014). Efficient and Robust Density Estimation Using Bernstein Type Polynomials. ArXiv e-prints .
Hall (1982) Hall, P. (1982). The influence of rounding errors on some nonparametric estimators of a density and its derivatives. SIAM J. Appl. Math. 42, 390–399.
Jang & Loh (2010) Jang, W. & Loh, J. M. (2010). Density estimation for grouped data with application to line transect sampling. Ann. Appl. Stat. 4, 893–915.
Jassim et al. (1996) Jassim, E. W., Grossman, M., Koops, W. J. & Luykx, R. A. J. (1996). Multiphasic analysis of embryonic mortality in chickens. Poultry Sci 75, 464–471.
Jones (1993) Jones, M. C. (1993). Simple boundary correction for kernel density estimation. Statistics and Computing 3, 135–146.
Jones et al. (1996) Jones, M. C., Marron, J. S. & Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91, 401–407.
Jones & McLachlan (1990) Jones, P. N. & McLachlan, G. J. (1990). Algorithm AS 254: Maximum likelihood estimation from grouped and truncated data with finite normal mixture models. Journal of the Royal Statistical Society. Series C (Applied Statistics) 39, 273–282.
Kuurman et al. (2003) Kuurman, W. W., Bailey, B. A., Koops, W. J. & Grossman, M. (2003). A model for failure of a chicken embryo to survive incubation. Poultry Sci. 82, 214–222.
Lin & He (2006) Lin, N. & He, X. (2006). Robust and efficient estimation under data grouping. Biometrika 93, 99–112.
Lindley (1950) Lindley, D. V. (1950). Grouping corrections and maximum likelihood equations. Mathematical Proceedings of the Cambridge Philosophical Society 46, 106–110.
Linton & Whang (2002) Linton, O. & Whang, Y.-J. (2002). Nonparametric estimation with aggregated data. Econometric Theory 18, 420–468.
Lorentz (1963) Lorentz, G. G. (1963). The degree of approximation by polynomials with positive coefficients. Math. Ann. 151, 239–251.
McLachlan & Jones (1988) McLachlan, G. J. & Jones, P. N. (1988). Fitting mixture models to grouped and truncated data via the EM algorithm. Biometrics 44, 571–578.
Minoiu & Reddy (2014) Minoiu, C. & Reddy, S. (2014). Kernel density estimation on grouped data: the case of poverty assessment. The Journal of Economic Inequality 12, 163–189.
Passow (1977) Passow, E. (1977). Polynomials with positive coefficients: uniqueness of best approximation. J. Approximation Theory 21, 352–355.
Redner & Walker (1984) Redner, R. A. & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26, 195–239.
Rice (1984) Rice, J. (1984). Boundary modification for kernel regression. Comm. Statist. A—Theory Methods 13, 893–900.
Rosenblatt (1956) Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832–837.
Rosenblatt (1971) Rosenblatt, M. (1971). Curve estimates. Ann. Math. Statist. 42, 1815–1842.
Ross (2013) Ross, S. M. (2013). Simulation. New York: Academic Press, 5th ed.
Scott & Sheather (1985) Scott, D. & Sheather, S. (1985). Kernel density estimation with binned data. Communications in Statistics - Theory and Methods 14, 1353–1359.
Sheather & Jones (1991) Sheather, S. J. & Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B 53, 683–690.
Tallis (1967) Tallis, G. M. (1967). Approximate maximum likelihood estimates from grouped data. Technometrics 9, 599–606.
Titterington (1983) Titterington, D. M. (1983). Kernel-based density estimation using censored, truncated or grouped data. Comm. Statist. A—Theory Methods 12, 2151–2167.
Vitale (1975) Vitale, R. A. (1975). Bernstein polynomial approach to density function estimation. In Statistical Inference and Related Topics (Proc. Summer Res. Inst. Statist. Inference for Stochastic Processes, Indiana Univ., Bloomington, Ind., 1974, Vol. 2; dedicated to Z. W. Birnbaum). New York: Academic Press, pp. 87–99.