This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bernstein Polynomial Model for Grouped Continuous Data

Zhong Guan
Department of Mathematical Sciences
Indiana University South Bend
South Bend, IN 46634, USA
email: zguan@iusb.edu

Abstract

Grouped data are commonly encountered in applications. All data from a continuous population are grouped due to rounding of the individual observations. The Bernstein polynomial model is proposed as an approximate model in this paper for estimating a univariate density function based on grouped data. The coefficients of the Bernstein polynomial, as the mixture proportions of beta distributions, can be estimated using an EM algorithm. The optimal degree of the Bernstein polynomial can be determined using a change-point estimation method. The rate of convergence of the proposed density estimate to the true density is proved to be almost parametric by an acceptance-rejection arguments used in Monte Carlo method. The proposed method is compared with some existing methods in a simulation study and is applied to the Chicken Embryo Data.

Keywords:Acceptance-rejection method, Approximate model, Bernstein Type polynomials; Beta Mixture, Change-point, Density estimation; Grouped data; Model selection; Nonparametric model; Parametrization; Smoothing.

1 Introduction

In real world applications of statistics, many data are provided in the form of frequencies of observations in some fixed mutually exclusive intervals, which are called grouped data. Strictly speaking, all the data from a population with a continuous distribution are grouped due to rounding of the individual observations (Hall, 1982). The EM algorithm has been used to deal with grouped data (Dempster et al., 1977). McLachlan & Jones (1988) introduced the EM algorithm for fitting mixture model to grouped data (see Jones & McLachlan, 1990, also). Under a parametric model, let f(x;θ)f(x;\theta) be the probability density function (PDF) of the underlying distribution with an unknown parameter θ\theta. The maximum likelihood estimate (MLE) of the parameter θ\theta can be obtained from grouped data and is shown to be consistent and asymptotically normal (see, for example, Lindley, 1950; Tallis, 1967). Parametric MLE is sensitive to model misspecification and outliers. The minimum Hellinger distance estimate (MHDE) of the parameter using grouped continuous data is both robust for contaminated data and asymptotically efficient (Beran, 1977a, b). Parametric methods for grouped data requires evaluating integrals which makes the computation expensive. To lower the computation cost Lin & He (2006) proposed the approximate minimum Hellinger distance estimate (AMHDE) for grouped data by the data truncation and replacing the probabilities of class intervals with the first order Taylor expansion. Clearly their idea works for MLE based on grouped data.

Under nonparametric setting, the underlying PDF ff is unspecified. Based on grouped data ff can be estimated by the empirical density, the relative frequency distribution, which is actually a discrete probability mass function. The kernel density estimation (Rosenblatt, 1956, 1971) can be applied to grouped data (see Linton & Whang, 2002; Jang & Loh, 2010; Minoiu & Reddy, 2014, for example). The effects of rounding, truncating, and grouping of the data on the kernel density estimate have been studied, maybe among others, by Hall (1982), Scott & Sheather (1985), and Titterington (1983). However, the expectation of kernel density estimate is the convolution of ff and the kernel scaled by the bandwidth. It is crucial and difficult to select an appropriate bandwidth to balance between the bias and variance. Many authors have proposed different methods for data-based bandwidth selection over the years. The readers are referred to a survey by Jones et al. (1996) for details and more references therein. Another drawback of the kernel density is its boundary effect. Methods of boundary-effect correction have been studied, among others, by Rice (1984) and Jones (1993).

“All models are wrong”(Box, 1976). So all parametric models are subject to model misspecification. The normal model is approximate because of the central limit theorem. The goodness-of-fit tests and other methods for selecting a parametric model introduce additional errors to the statistical inference.

Any continuous function can be approximated by polynomials. Vitale (1975) proposed to estimate the PDF ff by estimating the coefficients f(i/m)f(i/m) of the Bernstein polynomial (Bernstein, 1912) 𝔹f(t)=i=0mf(i/m)(mi)ti(1t)mi\mathbb{B}f(t)=\sum_{i=0}^{m}f(i/m){m\choose i}t^{i}(1-t)^{m-i} by f^(i/m)=(m+1){Fn[(i+1)/(m+1)]Fn[i/(m+1)]}\hat{f}(i/m)=(m+1)\{F_{n}[(i+1)/(m+1)]-F_{n}[i/(m+1)]\}, i=0,1,,mi=0,1,\ldots,m, where FnF_{n} is the empirical distribution function of x1,,xnx_{1},\ldots,x_{n}. Since then, many authors have applied the Bernstein polynomial in statistics in similar ways (see Guan, 2014, for more references). These and the kernel methods are not model-based and not maximum likelihood method. Thus they are not efficient. The estimated Bernstein polynomial 𝔹f(t)^=i=0mf^(i/m)(mi)ti(1t)mi\widehat{\mathbb{B}f(t)}=\sum_{i=0}^{m}\hat{f}(i/m){m\choose i}t^{i}(1-t)^{m-i} aims at 𝔹f(t)\mathbb{B}f(t). It is known that the best convergence rate of 𝔹f(t)\mathbb{B}f(t) to f(t)f(t) is at most 𝒪(m1){\cal O}(m^{-1}) if ff has continuous second or even higher order derivatives on [0,1]. Buckland (1992) proposed a density estimation with polynomials using grouped and ungrouped data with the help of some specified parametric models.

Thanks to a result of Lorentz (1963) there exists a Bernstein (type) polynomials fm(t;𝒑)i=0mpmiβmi(t)f_{m}(t;\bm{p})\equiv\sum_{i=0}^{m}p_{mi}\beta_{mi}(t), where pmi0p_{mi}\geqslant 0, βmi(t)=(m+1)(mi)ti(1t)mi\beta_{mi}(t)=(m+1){m\choose i}t^{i}(1-t)^{m-i}, i=0,,mi=0,\ldots,m, whose rate of convergence to f(t)f(t) is at least 𝒪(mr/2){\cal O}(m^{-r/2}) if ff has a continuous rr-th derivative on [0,1] and r2r\geqslant 2. This is called a polynomial with “positive coefficients” in the literature of polynomial approximation. Guan (2014) introduced the Bernstein polynomial model fm(t;𝒑)f_{m}(t;\bm{p}) as a globally valid approximate parametric model of any underlying continuous density function with support [0,1][0,1] and proposed a change-point method for selecting an optimal degree mm. It has been shown that the rate of convergence to zero for the mean integrated squared error(MISE) of the maximum likelihood estimate of the density could be nearly parametric, 𝒪(n1+ϵ){\cal O}(n^{-1+\epsilon}), for all ϵ>0\epsilon>0. This method does not suffer from the boundary effect.

If the support of ff is different from [0,1] or even infinite, then we can choose an appropriate (truncation) interval [a,b][a,b] so that abf(x)𝑑x1\int_{a}^{b}f(x)dx\approx 1 (see Guan, 2014). Therefore, we can treat [a,b][a,b] as the support of ff and we can use the linearly transformed data yi=(xia)/(ba)y_{i}=(x_{i}-a)/(b-a) in [0,1][0,1] to obtain estimate g^\hat{g} of the PDF gg of yiy_{i}’s, respectively. Then we estimate ff by f^(x)=g^{(xa)/(ba)}/(ba)\hat{f}(x)=\hat{g}\{(x-a)/(b-a)\}/(b-a). In this paper, we will assume that the density ff has support [0,1][0,1].

This Bernstein polynomial model fm(t;𝒑)f_{m}(t;\bm{p}) is a finite mixture of the beta densities βmi(t)\beta_{mi}(t) of beta(i+1,mi+1)(i+1,m-i+1), i=0,,mi=0,\ldots,m, with mixture proportions 𝒑=(pm0,,pmm)\bm{p}=(p_{m0},\ldots,p_{mm}). It has been shown that the Bernstein polynomial model can be used to fit a ungrouped dataset and has the advantages of smoothness, robustness, and efficiency over the traditional methods such as the empirical distribution and the kernel density estimate (Guan, 2014). Because these beta densities and their integrals are specified and free of unknown parameters, this structure of fm(t;𝒑)f_{m}(t;\bm{p}) is convenient. It allows the grouped data to be approximately modeled by a mixture of m+1m+1 specific discrete distributions. So the infinite dimensional “parameter” ff is approximately described by a finite dimensional parameter 𝒑\bm{p}. This and the nonparametric likelihood are similar in the sense that the underlying distribution function is approximated by a step function with jumps as parameters at the observations.

Due to the closeness of fm(t;𝒑)f_{m}(t;\bm{p}) to f(t)f(t), by the acceptance-rejection argument for generating pseudorandom numbers, almost all the observations in a sample from f(t)f(t) can be used as if they were from fm(t;𝒑)f_{m}(t;\bm{p}). It will be shown in this paper that the maximizer of the likelihood based on the approximate model fm(t;𝒑)f_{m}(t;\bm{p}) targets 𝒑0\bm{p}_{0} which makes fm(t;𝒑0)f_{m}(t;\bm{p}_{0}) the unique best approximation of ff. This acceptance-rejection argument can be used to prove other asymptotic results under an approximate model assumption.

In this paper we shall study the asymptotic properties of the Bernstein polynomial density estimate based on grouped data and ungrouped raw data as a special case of grouping. A stronger result than that of Guan (2014) about the rate of convergence of the proposed density estimate based on ungrouped raw data will be proved using a different argument. We shall also compare the proposed estimate with those existing methods such as the kernel density, parametric MLE, and the MHDE via simulation study.

The paper is organized as follows. The Bernstein polynomial model for grouped data is introduced and is proved to be nested in Section 2. The EM algorithm for finding the approximate maximum likelihood estimates of the mixture proportions is derived in this section. Some asymptotic results about the convergence rate of the proposed density estimate are given in Section 3. The methods for determining a lower bound for the model degree mm based on estimated mean and variance and for choosing the optimal degree mm are described in Section 4. In Section 5, the proposed methods are compared with some existing competitors through Monte Carlo experiments, and illustrated by the Chicken Embryo Data. The proofs of the theorems are relegated to the Appendix.

2 Likelihood for grouped data and EM algorithm

2.1 The Bernstein polynomial model

Let C(r)[0,1]C^{(r)}[0,1] be the class of functions which have rr-th continuous derivative f(r)f^{(r)} on [0,1][0,1]. Like the normal model being backed up by the central limit theorem, the Bernstein polynomial model is supported by the following mathematical result which is a consequence of Theorem 1 of Lorentz (1963). We denote the mm-simplex by

𝕊m={𝒑=(pm0,,pmm)T:pmj0,j=0mpmj=1}.\mathbb{S}_{m}=\Big{\{}\bm{p}=(p_{m0},\ldots,p_{mm})^{\mbox{\tiny{$\mathrm{T}$}}}\,:\,p_{mj}\geqslant 0,\;\;\sum_{j=0}^{m}p_{mj}=1\Big{\}}.
Theorem 1.

If fC(r)[0,1]f\in C^{(r)}[0,1], 01f(t)𝑑t=1\int_{0}^{1}f(t)dt=1, and f(t)δ>0f(t)\geqslant\delta>0, then there exists a sequence of Bernstein type polynomials fm(t;𝐩)=i=0mpmiβmi(t)f_{m}(t;\bm{p})=\sum_{i=0}^{m}p_{mi}\beta_{mi}(t) with 𝐩𝕊m\bm{p}\in\mathbb{S}_{m}, such that

|f(t)fm(t)|C(r,δ,f)Δmr(t),0t1,|f(t)-f_{m}(t)|\leqslant C(r,\delta,f)\Delta_{m}^{r}(t),\quad 0\leqslant t\leqslant 1, (1)

where Δm(t)=max{m1,t(1t)/m}\Delta_{m}(t)=\max\{m^{-1},\sqrt{{t(1-t)}/{m}}\} and the constant C(r,δ,f)C(r,\delta,f) depends on rr, δ\delta, maxt|f(t)|\max_{t}|f(t)|, and maxt|f(i)(t)|\max_{t}|f^{(i)}(t)|, i=2,,ri=2,\ldots,r, only.

The uniqueness of the best approximation was proved by Passow (1977). Let ff be the density of the underlying distribution with support [0,1][0,1]. We approximate ff using the Bernstein polynomial fm(t;𝒑)=j=0mpmjβmj(t)f_{m}(t;\bm{p})=\sum_{j=0}^{m}p_{mj}\beta_{mj}(t), where 𝒑𝕊m\bm{p}\in\mathbb{S}_{m}.

Define 𝒟m={fm(t;𝒑)=j=0mpmjβmj(t):𝒑𝕊m}{\cal D}_{m}=\big{\{}f_{m}(t;\bm{p})=\sum_{j=0}^{m}p_{mj}\beta_{mj}(t)\,:\,\bm{p}\in\mathbb{S}_{m}\big{\}}. Guan (2014) showed that, for all r1r\geqslant 1, 𝒟m𝒟m+r{\cal D}_{m}\subset{\cal D}_{m+r}. So the Bernstein polynomial model fm(t;𝒑)f_{m}(t;\bm{p}) of degree mm is nested in all Bernstein polynomial models of larger degrees.

Let [0,1][0,1] be partitioned by NN class intervals {(ti1,ti]:i=1,,N}\{(t_{i-1},t_{i}]\,:\,i=1,\ldots,N\}, where 0=t0<t1<<tN=10=t_{0}<t_{1}<\dots<t_{N}=1. The probability that a random observation falls in the ii-th interval is approximately

θmi(𝒑)=ti1tifm(t;𝒑)𝑑t=j=0maijpmj,\theta_{mi}(\bm{p})=\int^{t_{i}}_{t_{i-1}}f_{m}(t;\bm{p})dt=\sum_{j=0}^{m}a_{ij}p_{mj}, (2)

where aij=mj(ti)mj(ti1)a_{ij}=\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1}),  i=1,,Ni=1,\ldots,N,  mj(t)\mathcal{B}_{mj}(t) is the cumulative distribution function (CDF) of beta(j+1,mj+1j+1,m-j+1), j=0,1,,mj=0,1,\ldots,m, and

i=1Nθmi(𝒑)=j=0mpmj=1.\sum_{i=1}^{N}\theta_{mi}(\bm{p})=\sum_{j=0}^{m}p_{mj}=1.

So the probability θmi(𝒑)\theta_{mi}(\bm{p}) is a mixture of a specific components {ai0,,aim}\{a_{i0},\ldots,a_{im}\} with unknown proportions 𝒑=(pm0,,pmm)\bm{p}=(p_{m0},\ldots,p_{mm}).

By Theorem 2\cdot1 of Guan (2014), the above Bernstein polynomial model (2) of degree mm for grouped data is nested in a model of degree m+rm+r, i.e., for all r1r\geqslant 1,

θmi=ti1tij=0mpmjβmj(t)dt=ti1tij=0m+rpm+r,jβm+r,j(t)dt=θm+r,i,i=1,,N.\theta_{mi}=\int^{t_{i}}_{t_{i-1}}\sum_{j=0}^{m}p_{mj}\beta_{mj}(t)dt=\int^{t_{i}}_{t_{i-1}}\sum_{j=0}^{m+r}p_{m+r,j}\beta_{m+r,j}(t)dt=\theta_{m+r,i},\quad i=1,\ldots,N.

2.2 The Bernstein likelihood for grouped data

In many applications, we only have the grouped data {ni,(ti1,ti]:i=1,,N}\{n_{i},(t_{i-1},t_{i}]:i=1,\ldots,N\} available, where 0=t0<t1<<tN=10=t_{0}<t_{1}<\dots<t_{N}=1 and ni=#{j(1,,n):xj(ti1,ti]}n_{i}=\#\{j\in(1,\ldots,n):x_{j}\in(t_{i-1},t_{i}]\}, i=1,,Ni=1,\ldots,N, and x1,,xnx_{1},\ldots,x_{n} is a random sample from a population having continuous density f(x)f(x) on [0,1][0,1]. Our goal is to estimate the unknown PDF ff. The loglikelihood of (n1,,nN)(n_{1},\ldots,n_{N}) is approximately

G(𝒑)=i=1Nnilog[j=0mpmj{mj(ti)mj(ti1)}],\ell_{G}(\bm{p})=\sum_{i=1}^{N}n_{i}\log\Big{[}\sum_{j=0}^{m}p_{mj}\{\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})\}\Big{]}, (3)

where the mixture proportions 𝒑=(pm0,,pmm)\bm{p}=(p_{m0},\ldots,p_{mm}) are subject to the feasibility constraints 𝒑𝕊m\bm{p}\in\mathbb{S}_{m}. For the ungrouped raw data x1,,xnx_{1},\ldots,x_{n}, the loglikelihood is

R(𝒑)=i=1nlog{j=0mpmjβmj(xi)}.\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log\Big{\{}\sum_{j=0}^{m}p_{mj}\beta_{mj}(x_{i})\Big{\}}. (4)

If we take the rounding error into account when the observations are rounded to the nearest value using the round half up tie-breaking rule, then

G(𝒑)=i=nilog[j=0mpmj{mj(ti)mj(ti1)}],\ell_{G}(\bm{p})=\sum_{i=-\infty}^{\infty}n_{i}\log\Big{[}\sum_{j=0}^{m}p_{mj}\{\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})\}\Big{]}, (5)

where ti=(i+1/2)/Kt_{i}=(i+1/2)/K, i=0,±1,±2,i=0,\pm 1,\pm 2,\ldots, and KK is a positive integer such that any observation is rounded to i/Ki/K for some integer ii.

We shall call the maximizers 𝒑~G\tilde{\bm{p}}_{G} and 𝒑^R\hat{\bm{p}}_{R} of G(𝒑)\ell_{G}(\bm{p}) and R(𝒑)\ell_{R}(\bm{p}) the maximum Bernstein likelihood estimates (MBLE’s) of 𝒑\bm{p} based on grouped and raw data, respectively, and call f~B(t)=fm(t;𝒑~G)\tilde{f}_{B}(t)=f_{m}(t;\tilde{\bm{p}}_{G}) and f^B(t)=fm(t;𝒑^R)\hat{f}_{B}(t)=f_{m}(t;\hat{\bm{p}}_{R}) the MBLE’s of ff based on grouped and raw data, respectively.

It should also be noted that as NN\to\infty and max{Δtititi1:i=1,,N}0\max\{\Delta t_{i}\equiv t_{i}-t_{i-1}\,:\,i=1,\ldots,N\}\to 0 the above loglikelihood (3) reduces to the loglikelihood (4) for ungrouped raw data. Specifically, limmaxΔti0{G(𝒑)i=1NnilogΔti}=R(𝒑)\lim_{\max\Delta t_{i}\to 0}\{\ell_{G}(\bm{p})-\sum_{i=1}^{N}n_{i}\log\Delta t_{i}\}=\ell_{R}(\bm{p}).

If the underlying PDF ff is approximately fm(t;𝒑)=i=0mpmiβmi(t)f_{m}(t;\bm{p})=\sum_{i=0}^{m}p_{mi}\beta_{mi}(t) for some m0m\geqslant 0, then the distribution of the grouped data (n1,,nN)(n_{1},\ldots,n_{N}) is approximately multinomial with probability mass function

P(W1=n1,,WN=nN)=(nn1,,nN)i=1Nθmini(𝒑).P(W_{1}=n_{1},\ldots,W_{N}=n_{N})={n\choose n_{1},\ldots,n_{N}}\prod_{i=1}^{N}\theta_{mi}^{n_{i}}(\bm{p}).

The MLE’s of θmi\theta_{mi}’s are θ^mi=nin\hat{\theta}_{mi}=\frac{n_{i}}{n}, i=1,,N.i=1,\ldots,N. So the MLE’s p^mj\hat{p}_{mj} of pmjp_{mj} satisfy the equations j=0maijp^mj=nin\sum_{j=0}^{m}a_{ij}\hat{p}_{mj}=\frac{n_{i}}{n}, i=1,,Ni=1,\ldots,N, and (p^m0,,p^mm)𝕊m(\hat{p}_{m0},\ldots,\hat{p}_{mm})\in\mathbb{S}_{m}. Because p^m0=1j=1mp^mj\hat{p}_{m0}=1-\sum_{j=1}^{m}\hat{p}_{mj}, p^mj\hat{p}_{mj} satisfy equatins

j=1m(aijai0)p^mj=ninai0,i=1,,N,\sum_{j=1}^{m}(a_{ij}-a_{i0})\hat{p}_{mj}=\frac{n_{i}}{n}-a_{i0},\quad i=1,\ldots,N,

and inequality constraints p^mj0\hat{p}_{mj}\geqslant 0, j=1,,mj=1,\ldots,m, and j=1mp^mj1\sum_{j=1}^{m}\hat{p}_{mj}\leqslant 1. It seems not easy to algebraically solve the above system of equations with inequality constraints. In the next section, we shall use an EM-algorithm to find the MLE of 𝒑\bm{p}.

2.3 The EM Algorithm

Let δij=1\delta_{ij}=1 or 0 according to whether or not xix_{i} was from beta(j+1,mj+1)(j+1,m-j+1), i=1,,ni=1,\ldots,n, j=0,,mj=0,\ldots,m. We denote by 𝒛i=(zi1,,ziN)T\bm{z}_{i}=(z_{i1},\ldots,z_{iN})^{\mbox{\tiny{$\mathrm{T}$}}} the vector of indicators zij=I{xi(tj1,tj]}z_{ij}=I\{x_{i}\in(t_{j-1},t_{j}]\}, i=1,,ni=1,\ldots,n, j=1,,Nj=1,\ldots,N. Then the expected value of δij\delta_{ij} given 𝒛i\bm{z}_{i} is

rj(𝒑,𝒛i)E𝒑(δij|𝒛i)=pmjl=1N{mj(tl)mj(tl1)}zilh=0mpmhl=1N{mh(tl)mh(tl1)}zil.r_{j}(\bm{p},\bm{z}_{i})\equiv\mathrm{E}_{\bm{p}}(\delta_{ij}|\bm{z}_{i})=\frac{p_{mj}\prod_{l=1}^{N}\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}^{z_{il}}}{\sum_{h=0}^{m}p_{mh}\prod_{l=1}^{N}\{\mathcal{B}_{mh}(t_{l})-\mathcal{B}_{mh}(t_{l-1})\}^{z_{il}}}.

Note that j=0mδij=1\sum_{j=0}^{m}\delta_{ij}=1, and the observations are nl=i=1nj=0mδijzil=i=1nziln_{l}=\sum_{i=1}^{n}\sum_{j=0}^{m}\delta_{ij}z_{il}=\sum_{i=1}^{n}z_{il}, l=1,,Nl=1,\ldots,N. The likelihood of δij\delta_{ij} and 𝒛i\bm{z}_{i} is

c(𝒑)=i=1nj=0m[pmjl=1N{mj(tl)mj(tl1)}zil]δij.\mathscr{L}_{c}(\bm{p})=\prod_{i=1}^{n}\prod_{j=0}^{m}\Big{[}p_{mj}\prod_{l=1}^{N}\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}^{z_{il}}\Big{]}^{\delta_{ij}}.

The loglikelihood is then

c(𝒑)=i=1nj=0mδij[logpmj+l=1Nzillog{mj(tl)mj(tl1)}].\ell_{c}(\bm{p})=\sum_{i=1}^{n}\sum_{j=0}^{m}\delta_{ij}\Big{[}\log p_{mj}+\sum_{l=1}^{N}{z_{il}}\log\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}\Big{]}.

E-Step Given 𝒑~(s)\tilde{\bm{p}}^{(s)}, we have

Q(𝒑,𝒑~(s))\displaystyle Q(\bm{p},\tilde{\bm{p}}^{(s)}) =\displaystyle= E𝒑~(s){(𝒑)|𝒛}\displaystyle\mathrm{E}_{\tilde{\bm{p}}^{(s)}}\{\ell(\bm{p})|\bm{z}\}
=\displaystyle= i=1nj=0mrj(𝒑~(s),𝒛i)[logpmj+l=1Nzillog{mj(tl)mj(tl1)}].\displaystyle\sum_{i=1}^{n}\sum_{j=0}^{m}r_{j}(\tilde{\bm{p}}^{(s)},\bm{z}_{i})\Big{[}\log p_{mj}+\sum_{l=1}^{N}{z_{il}}\log\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{l-1})\}\Big{]}.

M-Step Maximizing Q(𝒑,𝒑~(s))Q(\bm{p},\tilde{\bm{p}}^{(s)}) with respect to 𝒑\bm{p} subject to constraint 𝒑𝕊m\bm{p}\in\mathbb{S}_{m} we have, for s=0,1,s=0,1,\ldots,

p~mj(s+1)=1ni=1nrj(𝒑~(s),𝒛i)=1nl=1Nnlp~mj(s){mj(tl)mj(tl1)}h=0mp~mh(s){mh(tl)mh(tl1)}.\tilde{p}^{(s+1)}_{mj}=\frac{1}{n}\sum_{i=1}^{n}{r_{j}(\tilde{\bm{p}}^{(s)},\bm{z}_{i})}=\frac{1}{n}\sum_{l=1}^{N}\frac{n_{l}\tilde{p}_{mj}^{(s)}\{\mathcal{B}_{mj}(t_{l})-\mathcal{B}_{mj}(t_{{l}-1})\}}{\sum_{h=0}^{m}\tilde{p}_{mh}^{(s)}\{\mathcal{B}_{mh}(t_{l})-\mathcal{B}_{mh}(t_{{l}-1})\}}. (6)

Starting with initial values p~mj(0)\tilde{p}_{mj}^{(0)}, j=0,,mj=0,\ldots,m, we can use this iterative formula to obtain the maximum Bernstein likelihood estimate 𝒑~G\tilde{\bm{p}}_{G}. If the ungrouped raw data x1,,xnx_{1},\ldots,x_{n} are available, then the iteration (Guan, 2014) is reduced to

p^mj(s+1)=1ni=1np^mj(s)βmj(xi)h=0mp^mh(s)βmh(xi),j=0,,m;s=0,1,.\hat{p}^{(s+1)}_{mj}=\frac{1}{n}\sum_{i=1}^{n}\frac{\hat{p}_{mj}^{(s)}\beta_{mj}(x_{i})}{\sum_{h=0}^{m}\hat{p}_{mh}^{(s)}\beta_{mh}(x_{i})},\quad j=0,\ldots,m;\quad s=0,1,\ldots. (7)

The following theorem shows the convergence of the EM algorithm and is proved in the Appendix.

Theorem 2.

(i) Assume p^mj(0)>0\hat{p}^{(0)}_{mj}>0, j=0,1,,mj=0,1,\ldots,m, and j=0mp^mj(0)=1\sum_{j=0}^{m}\hat{p}^{(0)}_{mj}=1. Then as ss\to\infty, 𝐩^(s)\hat{\bm{p}}^{(s)} converges to the unique maximizer 𝐩^R\hat{\bm{p}}_{R} of R(𝐩)\ell_{R}(\bm{p}). (ii) Assume p~mj(0)>0\tilde{p}^{(0)}_{mj}>0, j=0,1,,mj=0,1,\ldots,m, and j=0mp~mj(0)=1\sum_{j=0}^{m}\tilde{p}^{(0)}_{mj}=1. Then as ss\to\infty, 𝐩~(s)\tilde{\bm{p}}^{(s)} converges to the unique maximizer 𝐩~G\tilde{\bm{p}}_{G} of G(𝐩)\ell_{G}(\bm{p}).

3 Rate of Convergence of the Density Estimate

In this section we shall state results about the convergence rate of the density estimates which will be proved in the Appendix. Unlike most asymptotic results about maximum likelihood method which assume exact parametric models, we will show our results under the approximate model fm(t;𝒑)=j=0mpmjβmj(t)f_{m}(t;\bm{p})=\sum_{j=0}^{m}p_{mj}\beta_{mj}(t). For a given 𝒑0\bm{p}_{0}, we define the norm

𝒑B2={fm(t;𝒑)}2fm(t;𝒑0)𝑑t.\|\bm{p}\|^{2}_{B}=\int\frac{\{f_{m}(t;\bm{p})\}^{2}}{f_{m}(t;\bm{p}_{0})}dt.

The squared distance between 𝒑\bm{p} and 𝒑0\bm{p}_{0} with respect to norm B\|\cdot\|_{B} is

𝒑𝒑0B2={fm(t;𝒑)fm(t;𝒑0)}2fm(t;𝒑0)𝑑t.\|\bm{p}-\bm{p}_{0}\|^{2}_{B}=\int\frac{\{f_{m}(t;\bm{p})-f_{m}(t;\bm{p}_{0})\}^{2}}{f_{m}(t;\bm{p}_{0})}dt.

With the aid of the acceptance-rejection argument for generating pseudorandom numbers in the Monte Carlo method we have the following lemma which may be of independent interest.

Lemma 3.

Let fC(2k)[0,1]f\in C^{(2k)}[0,1] for some positive integer kk, f(t)δ>0f(t)\geqslant\delta>0, and fm(t)=fm(t;𝐩0)f_{m}(t)=f_{m}(t;\bm{p}_{0}) be the unique best approximation of degree mm for ff. Then a sample x1,,xnx_{1},\ldots,x_{n} from ff can be arranged so that the first νm\nu_{m} observations can be treated as if they were from fmf_{m}. Moreover, for all 𝐩\bm{p} such that fm(xj;𝐩)δ>0f_{m}(x_{j};\bm{p})\geqslant\delta^{\prime}>0, j=1,,nj=1,\ldots,n,

R(𝒑)=i=1nlogfm(xi;𝒑)=~R(𝒑)+Rmn,\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log f_{m}(x_{i};\bm{p})=\tilde{\ell}_{R}(\bm{p})+R_{mn}, (8)

where ~R(𝐩)=i=1νmlogfm(xi;𝐩)\tilde{\ell}_{R}(\bm{p})=\sum_{i=1}^{\nu_{m}}\log f_{m}(x_{i};\bm{p}), and

νm\displaystyle\nu_{m} =\displaystyle= n𝒪(nmk)𝒪(nmkloglogn),a.s.,\displaystyle n-{\cal O}(nm^{-k})-{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s., (9)
|Rmn|\displaystyle|R_{mn}| =\displaystyle= 𝒪(nmk)+𝒪(nmkloglogn),a.s..\displaystyle{\cal O}(nm^{-k})+{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s.. (10)
Remark 3.1.

So ~R(𝐩)\tilde{\ell}_{R}(\bm{p}) is an “exact” likelihood of x1,,xνmx_{1},\ldots,x_{\nu_{m}} while R(𝐩)=i=1nlogfm(xi;𝐩)\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log f_{m}(x_{i};\bm{p}) is an approximate likelihood of the complete data x1,,xnx_{1},\ldots,x_{n} which can be viewed as a slightly contaminated sample from fmf_{m}. Maximizer 𝐩^\hat{\bm{p}} of R(𝐩)\ell_{R}(\bm{p}) approximately maximizes ~R(𝐩)\tilde{\ell}_{R}(\bm{p}). Hence fm(t;𝐩^)f_{m}(t;\hat{\bm{p}}) targets at fm(t;𝐩0)f_{m}(t;\bm{p}_{0}) which is a best approximate of ff.

For density estimation based on the raw data we have the following result.

Theorem 4.

Suppose that the PDF fC(2k)[0,1]f\in C^{(2k)}[0,1] for some positive integer kk, f(t)δ>0f(t)\geqslant\delta>0, and m=𝒪(n1/k)m={\cal O}(n^{1/k}). As nn\to\infty, with probability one the maximum value of R(𝐩)\ell_{R}(\bm{p}) is attained by some 𝐩^R\hat{\bm{p}}_{R} in the interior of 𝔹m(rn)={𝐩𝕊m:𝐩𝐩0B2rn2}\mathbb{B}_{m}(r_{n})=\{\bm{p}\in\mathbb{S}_{m}\,:\,\|\bm{p}-\bm{p}_{0}\|^{2}_{B}\leqslant r_{n}^{2}\}, where rn=logn/nr_{n}=\log n/\sqrt{n} and 𝐩0\bm{p}_{0} makes fm(;𝐩0)f_{m}(\cdot;\bm{p}_{0}) the unique best approximation of degree mm.

Theorem 5.

Suppose that the PDF fC(2k)[0,1]f\in C^{(2k)}[0,1] for some positive integer kk, f(t)δ>0f(t)\geqslant\delta>0, and 0<c0n1/kmc1n1/k<0<c_{0}n^{1/k}\leqslant m\leqslant c_{1}n^{1/k}<\infty. Then there is a positive constant CC such that

E{fm(t;𝒑^R)f(t)}2f(t)𝑑t\displaystyle\mathrm{E}\int\frac{\{f_{m}(t;\hat{\bm{p}}_{R})-f(t)\}^{2}}{f(t)}dt \displaystyle\leqslant C(logn)2n.\displaystyle C\frac{(\log n)^{2}}{n}. (11)

Because ff is bounded there is a positive constant CC such that

MISE(f^B)\displaystyle\mathrm{MISE}(\hat{f}_{B}) =\displaystyle= E{fm(t;𝒑^R)f(t)}2𝑑tC(logn)2n.\displaystyle\mathrm{E}\int\{f_{m}(t;\hat{\bm{p}}_{R})-f(t)\}^{2}dt\leqslant C\frac{(\log n)^{2}}{n}. (12)

Note that (11) is a stronger result than (12) which is an almost parametric rate of convergence for MISE. Guan (2014) showed a similar result under another set of conditions. The best parametric rate is 𝒪(n1){\cal O}(n^{-1}) that can be attained by the parametric density estimate under some regularity conditions.

For θmi(𝒑)=j=0mpmj{mj(ti)mj(ti1)}\theta_{mi}(\bm{p})=\sum_{j=0}^{m}p_{mj}\{\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})\}, we define norm

𝒑G2=i=1Nθmi2(𝒑)θmi(𝒑0).\|\bm{p}\|^{2}_{G}=\sum_{i=1}^{N}\frac{\theta_{mi}^{2}(\bm{p})}{\theta_{mi}(\bm{p}_{0})}.

The squared distance between 𝒑\bm{p} and 𝒑0\bm{p}_{0} with respect to norm G\|\cdot\|_{G} is

𝒑𝒑0G2=i=1N{θmi(𝒑)θmi(𝒑0)}2θmi(𝒑0).\|\bm{p}-\bm{p}_{0}\|^{2}_{G}=\sum_{i=1}^{N}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\theta_{mi}(\bm{p}_{0})}.

By the mean value theorem, we have

mj(ti)mj(ti1)=ti1tiβmj(t)𝑑t=βmj(tmij)Δti,i=1,,N;j=0,,m,\mathcal{B}_{mj}(t_{i})-\mathcal{B}_{mj}(t_{i-1})=\int_{t_{i-1}}^{t_{i}}\beta_{mj}(t)dt=\beta_{mj}(t_{mij}^{*})\Delta t_{i},\quad i=1,\ldots,N;\;j=0,\ldots,m,

where Δti=titi1\Delta t_{i}=t_{i}-t_{i-1} and tmij[ti1,ti]t_{mij}^{*}\in[t_{i-1},t_{i}]. Thus 𝒑𝒑0G\|\bm{p}-\bm{p}_{0}\|_{G} is a Riemann sum

𝒑𝒑0G2=i=1Nψm(tmij)Δti01ψm(t)𝑑t=𝒑𝒑0B2,\|\bm{p}-\bm{p}_{0}\|_{G}^{2}=\sum_{i=1}^{N}\psi_{m}(t_{mij}^{*})\Delta t_{i}\approx\int_{0}^{1}\psi_{m}(t)dt=\|\bm{p}-\bm{p}_{0}\|^{2}_{B}, (13)

where

ψm(t)={j=0m(pmjpmj(0))βmj(t)}2j=0mpmj(0)βmj(t).\psi_{m}(t)=\frac{\{\sum_{j=0}^{m}(p_{mj}-p_{mj}^{(0)})\beta_{mj}(t)\}^{2}}{\sum_{j=0}^{m}p_{mj}^{(0)}\beta_{mj}(t)}.

For grouped data we have the following.

Theorem 6.

Suppose that the PDF fC(2k)[0,1]f\in C^{(2k)}[0,1] for some positive integer kk, f(t)δ>0f(t)\geqslant\delta>0, and m=𝒪(n1/k)m={\cal O}(n^{1/k}). As nn\to\infty, with probability one the maximum value of G(θm)\ell_{G}(\theta_{m}) is attained at 𝐩~G\tilde{\bm{p}}_{G} in the interior of 𝔹m(rn)={𝐩𝕊m:𝐩𝐩0G2rn2}\mathbb{B}_{m}(r_{n})=\{\bm{p}\in\mathbb{S}_{m}\,:\,\|\bm{p}-\bm{p}_{0}\|^{2}_{G}\leqslant r_{n}^{2}\}, where rn=logn/nr_{n}=\log n/\sqrt{n} and 𝐩0\bm{p}_{0} makes fm(;𝐩0)f_{m}(\cdot;\bm{p}_{0}) the unique best approximation.

For the relationship between the norms 𝒑𝒑0B2\|\bm{p}-\bm{p}_{0}\|_{B}^{2} and 𝒑𝒑0G2\|\bm{p}-\bm{p}_{0}\|_{G}^{2}, we have the following result.

Theorem 7.

Suppose that the PDF fC(2k)[0,1]f\in C^{(2k)}[0,1] for some positive integer kk, and f(t)δ>0f(t)\geqslant\delta>0. Let 𝐩0𝕊m\bm{p}_{0}\in\mathbb{S}_{m} be the one that makes fm(;𝐩0)f_{m}(\cdot;\bm{p}_{0}) the unique best approximation of ff. Then for all 𝐩𝕊m\bm{p}\in\mathbb{S}_{m}, we have

𝒑𝒑0B2=𝒑𝒑0G2+𝒪(m4maxiΔti2).\|\bm{p}-\bm{p}_{0}\|^{2}_{B}=\|\bm{p}-\bm{p}_{0}\|_{G}^{2}+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}).

For a grouped data based estimate 𝒑~G\tilde{\bm{p}}_{G}, the rate of convergence of 𝒑~G𝒑0G2\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|_{G}^{2} to zero is 𝒪((logn)2/n){\cal O}((\log n)^{2}/n). However the rate of convergence of 𝒑~G𝒑0B2\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|^{2}_{B} to zero depends on that of maxiΔti\max_{i}\Delta t_{i}. For equal-width classes, Δti=1/N\Delta t_{i}=1/N, and N=n1/2+2/kN=n^{1/2+2/k}, we have 𝒑~G𝒑0B2=𝒪((logn)2/n)+𝒪(m4/n1+4/k)\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|_{B}^{2}={\cal O}((\log n)^{2}/n)+{\cal O}(m^{4}/n^{1+4/k}). Thus 𝒑~G𝒑0B2=𝒪((logn)2/n)\|\tilde{\bm{p}}_{G}-\bm{p}_{0}\|_{B}^{2}={\cal O}((\log n)^{2}/n) if m=𝒪(n1/k)m={\cal O}(n^{1/k}). If kk is large, then NnN\approx\sqrt{n}.

Theorem 8.

Suppose that the PDF fC(2k)[0,1]f\in C^{(2k)}[0,1] for some positive integer kk, f(t)δ>0f(t)\geqslant\delta>0, and 0<c0n1/kmc1n1/k<0<c_{0}n^{1/k}\leqslant m\leqslant c_{1}n^{1/k}<\infty. Then we have

E{fm(t;𝒑~G)f(t)}2f(t)𝑑t\displaystyle\mathrm{E}\int\frac{\{f_{m}(t;\tilde{\bm{p}}_{G})-f(t)\}^{2}}{f(t)}dt =\displaystyle= 𝒪((logn)2/n)+𝒪(m4maxiΔti2).\displaystyle{\cal O}((\log n)^{2}/n)+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}). (14)

Also, because ff is bounded,

MISE(f^B)\displaystyle\mathrm{MISE}(\hat{f}_{B}) =\displaystyle= E{fm(t;𝒑~G)f(t)}2𝑑t\displaystyle\mathrm{E}\int\{f_{m}(t;\tilde{\bm{p}}_{G})-f(t)\}^{2}dt (15)
=\displaystyle= 𝒪((logn)2/n)+𝒪(m4maxiΔti2).\displaystyle{\cal O}((\log n)^{2}/n)+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}).

4 Model Degree Selection

Guan (2014) showed that the model degree mm is bounded below approximately by mb=max{1,μ(1μ)/σ23}m_{b}=\max\left\{1,\left\lceil\mu(1-\mu)/\sigma^{2}-3\right\rceil\right\}. Based on the grouped data, the lower bound mbm_{b} can be estimated by m~b=max{1,μ~(1μ~)/σ~23}\tilde{m}_{b}=\max\left\{1,\left\lceil\tilde{\mu}(1-\tilde{\mu})/\tilde{\sigma}^{2}-3\right\rceil\right\}, where

μ~=1ni=1Nniti,σ~2=1n1i=1Nni(tiμ~)2=1n1(i=1Nniti2nμ~2),\tilde{\mu}=\frac{1}{n}\sum_{i=1}^{N}n_{i}t_{i}^{*},\quad\tilde{\sigma}^{2}=\frac{1}{n-1}\sum_{i=1}^{N}n_{i}(t_{i}^{*}-\tilde{\mu})^{2}=\frac{1}{n-1}\left(\sum_{i=1}^{N}n_{i}t_{i}^{*2}-n\tilde{\mu}^{2}\right),
ti=(ti1+ti)/2,i=1,,N.t_{i}^{*}=(t_{i-1}+t_{i})/2,\quad i=1,\ldots,N.

Due to overfitting the model degree mm cannot be arbitrarily large. With the estimated m~b\tilde{m}_{b}, we choose a proper set of nonnegative consecutive integers, M={m0,m0+1,,m0+k}M=\{m_{0},m_{0}+1,\ldots,m_{0}+k\} such that m0<m~bm_{0}<\tilde{m}_{b}. Then we can estimate an optimal degree mm using the method of change-point estimation as proposed by Guan (2014). For each mi=m0+im_{i}=m_{0}+i we use the EM algorithm to find the MBLE 𝒑~mi\tilde{\bm{p}}_{m_{i}} and calculate i=(𝒑~mi)\ell_{i}=\ell(\tilde{\bm{p}}_{m_{i}}). Let yi=ii1y_{i}=\ell_{i}-\ell_{i-1}, i=1,,ki=1,\ldots,k. The yiy_{i}’s are nonnegative because the Bernstein polynomial models are nested. Guan (2014) suggested that y1,,yτy_{1},\ldots,y_{\tau} be treated as exponentials with mean μ1\mu_{1} and yτ+1,,yky_{\tau+1},\ldots,y_{k} be treated as exponentials with mean μ0\mu_{0}, where μ1>μ0\mu_{1}>\mu_{0}, so that τ\tau is a change point and mτm_{\tau} is the optimal degree and use the change-point detection method (see Section 1.4 of Csörgő & Horváth, 1997) for exponential model to find a change-point estimate τ^\hat{\tau}. Then we estimate the optimal mm by m^=mτ^\hat{m}=m_{\hat{\tau}}. Specifically, τ^=argmax1τk{R(τ)}\hat{\tau}=\arg\max_{1\leqslant\tau\leqslant k}\{R(\tau)\}, where the likelihood ratio of τ\tau is

R(τ)=klog(k0k)τlog(τ0τ)(kτ)log(kτkτ),τ=1,,k.R(\tau)=k\log\left(\frac{\ell_{k}-\ell_{0}}{k}\right)-\tau\log\left(\frac{\ell_{\tau}-\ell_{0}}{\tau}\right)-(k-\tau)\log\left(\frac{\ell_{k}-\ell_{\tau}}{k-\tau}\right),\quad\tau=1,\ldots,k.

If R(τ)R(\tau) has multiple maximizers, we choose the smallest one as τ^\hat{\tau}.

5 Simulation Study and Example

5.1 Simulation

The distributions used for generating pseudorandom numbers and the parametric models used for density estimation are as following.

  • (i)

    Uniform(0,1): the uniform distribution with μ=1/2\mu=1/2 and σ2=1/12\sigma^{2}=1/12 as a special beta distribution beta(1,1). The parametric model is the beta distribution beta(α\alpha, 1).

  • (ii)

    Exp(1): the exponential distribution with mean μ=1\mu=1 and variance σ2=1\sigma^{2}=1. We truncate this distribution by the interval [a,b]=[0,4][a,b]=[0,4]. The parametric model is the exponential distribution with mean μ=θ\mu=\theta.

  • (iii)

    Pareto(4, 0.5): The Pareto distribution with shape parameter α=4\alpha=4 and scale parameter x0=x_{0}= 0.5 which is treated as known parameter. The mean and variance are, respectively, μ=αx0/(α1)=2/3\mu=\alpha x_{0}/(\alpha-1)=2/3 and σ2=x02α/[(α1)2(α2)]=1/18\sigma^{2}=x_{0}^{2}\alpha/[(\alpha-1)^{2}(\alpha-2)]=1/18. We truncate this distribution by the interval [a,b]=[x0,μ+4σ][a,b]=[x_{0},\mu+4\sigma]\approx [0.5, 1.6095]. The parametric model is Pareto(α\alpha, 0.5).

  • (iv)

    NN(kk): the nearly normal distribution of u¯k=(u1++uk)/k\bar{u}_{k}=(u_{1}+\cdots+u_{k})/k with u1,,uku_{1},\ldots,u_{k} being independent uniform(0,1) random variables. The lower bound is mb=3(k1)m_{b}=3(k-1). We used the normal distribution N(μ\mu, σ2\sigma^{2}) as the parametric model.

  • (v)

    N(0,10,1): the standard normal distribution truncated by the interval [a,b]=[4,4][a,b]=[-4,4]. The parametric model is N(μ,σ2\mu,\sigma^{2}).

  • (vi)

    Logistic(0, 0.5): the logistic distribution with location μ=0\mu=0 and scale s=s= 0.5 so that σ2=(sπ)2/3=π2/12\sigma^{2}=(s\pi)^{2}/3=\pi^{2}/12. We truncate this distribution by the interval [a,b]=[μ4σ,μ+4σ][a,b]=[\mu-4\sigma,\mu+4\sigma]\approx [-2.9619, 2.9619]. The parametric model is Logistic(μ\mu, ss).

Except the normal distribution, the above parametric models were chosen for the simulation because the CDF’s have close-form expressions so that the expensive numerical integrations can be avoided for the MHDE and the MLE.

From each distribution we generated 500 samples of size n=50,100,200n=50,100,200, and 500500 and the grouped data using N=5,10,10N=5,10,10 and 2020 equal-width class intervals, respectively. The model degree mm were selected using the change-point method from {1,2,,40}\{1,2,\ldots,40\}.

From the results of Guan (2014) we see that the Bernstein polynomial method is much better than the kernel density for ungrouped data. The AMHDE is approximation for MHDE. So we only compare kernel, the MLE, the MHDE, and the proposed MBLE. For the kernel density estimate f^K(x)=1nhi=1nK(xxih),\hat{f}_{K}(x)=\frac{1}{nh}\sum_{i=1}^{n}K\Big{(}\frac{x-x_{i}}{h}\Big{)}, we used normal kernel K(x)=ex2/2/2πK(x)=e^{-x^{2}/2}/\sqrt{2\pi} and the commonly recommended method of Sheather & Jones (1991) to choose the bandwidth hh. Because E[f^K(x)]=1hK(xyh)f(y)𝑑y=(Khf)(x).\mathrm{E}[\hat{f}_{K}(x)]=\frac{1}{h}\int_{-\infty}^{\infty}K\Big{(}\frac{x-y}{h}\Big{)}f(y)dy=(K_{h}*f)(x). This is the convolution of ff and the scaled kernel Kh()=K(/h)/hK_{h}(\cdot)=K(\cdot/h)/h. So no matter how the bandwidth hh is chosen, there is always trade-off between the bias and the variance.

Table 1: Estimated mean and variance of m^\hat{m}, and mean integrated squared errors (MISE’s) of the kernel density f^K\hat{f}_{K}, the MLE f^ML\hat{f}_{\mathrm{ML}}, the MHDE f^MHD\hat{f}_{\mathrm{MHD}}, and the proposed maximum Bernstein likelihood estimate (MBLE) f^B\hat{f}_{\mathrm{B}} based 500 simulated samples of size nn which are grouped by NN equal-width class intervals, respectively.
MISE
. E(m^)\mathrm{E}(\hat{m}) var(m^)\mathrm{var}(\hat{m}) f^B\hat{f}_{\mathrm{B}} f^K\hat{f}_{\mathrm{K}} f^ML\hat{f}_{\mathrm{ML}} f^MHD\hat{f}_{\mathrm{MHD}}
n=50n=50, N=5N=5
Beta(1,1) 14.91  3.95 0.3898  2.5722 0.0193 0.0222
Exp(1) 14.56  9.14 0.0447  0.7502 0.0018 0.0034
Pareto 12.29 29.01 0.7855 19.6962 0.0392 0.0793
NN(4)(4) 12.04 15.42 0.0556  8.5549 0.0653 0.103
N(0,1)(0,1) 14.25 11.10 0.0007  0.1603 0.0008 0.0012
Logistic 13.79 19.27 0.0022  0.2689 0.0014 0.0024
n=100n=100, N=10N=10
Beta(1,1) 12.96 47.08 0.0972  0.0558 0.0096 0.0118
Exp(1)   9.42 37.24 0.0091  0.2377 0.0011 0.0027
Pareto   8.51   9.78 0.1009 18.6903 0.0222 0.0613
NN(4)(4) 10.24   6.72 0.0217  1.4357 0.0232 0.0411
N(0,1)(0,1) 13.77   6.58 0.0004  0.0357 0.0003 0.0005
Logistic 12.02 13.42 0.0012  0.0992 0.0007 0.0012
n=200n=200, N=10N=10
Beta(1,1) 15.88 48.93 0.0741   1.2338 0.0045 0.0051
Exp(1)   9.24 41.36 0.0068   0.4547 0.0007 0.0017
Pareto   8.63   9.85 0.0661 34.3823 0.0123 0.0323
NN(4)(4) 10.11   3.92 0.0128   4.0956 0.0125 0.0213
N(0,1)(0,1) 14.08   5.74 0.0003   0.0907 0.0002 0.0003
Logistic 13.15 14.07 0.0007   0.1936 0.0004 0.0006
n=500n=500, N=20N=20
Beta(1,1) 10.18 49.84 0.0192   0.0226 0.0021 0.0024
Exp(1)   4.40   3.55 0.0006   0.2331 0.0005 0.0015
Pareto   8.67   1.94 0.0181 16.0924 0.0083 0.0253
NN(4)(4)   9.97   2.75 0.0059   0.5994 0.0058 0.0110
N(0,1)(0,1) 14.41   2.94 0.0001   0.0329 0.0001 0.0001
Logistic 13.26   5.15 0.0003   0.0905 0.0001 0.0003

Table 1 presents the simulation results of the density estimations. As expected, the proposed Bernstein polynomial method performs much better than the kernel density method and is similar to the other two parametric methods. Table 1 also shows the estimated mean and variance of the optimal model degree selected by the change-point method. It seems that the performance of the estimated optimal model degree m^\hat{m} is satisfactory.

It should be noted that the density ψk(t)\psi_{k}(t) of NN(kk) satisfies ψk(t)C(k2)[0,1]\psi_{k}(t)\in C^{(k-2)}[0,1] but ψk(t)C(k1)[0,1]\psi_{k}(t)\notin C^{(k-1)}[0,1] for k2k\geqslant 2. In fact, when, k2k\geqslant 2, ψk(t)\psi_{k}(t) is a piecewise polynomial function of degree (k1)(k-1) defined on pieces [i/k,(i+1)/k)[i/k,(i+1)/k), i=0,1,,k1i=0,1,\ldots,k-1. Except NN(k) all the other population densities have continuous derivatives of all orders on their supports. In the simulation, we used the normal distributions as the parametric models of NN(44). Here both the normal and the Bernstein polynomial are approximate models. In fact, in most applications the normal distribution is an approximate model due to the central limit theorem. We did a simulation on the goodness-of-fit of the normal distribution to the sample from NN(44). In this simulation, we generate 5,0005,000 samples of size nn from NN(44). We ran the Kolmogorov-Smirnov test for each sample. For n=50,100,200n=50,100,200, and 500500 the average of the pp-values are, respectively, 0.7884, 0.7875, and 0.7470; and the numbers of pp-values among the 5000 that are smaller than 0.05 are, respectively, 3, 2, 0, and 2. So the normal distribution will accepted as the parametric model for NN(4) almost all the time. The performance of the proposed MBLE for samples from NN(4) is even better than that of the MLE when sample size nn is small.

5.2 The Chicken Embryo Data

The chicken embryo data contain the number of hatched eggs on each day during the 21 days of incubation period. The times of hatching (n=43n=43) are treated as grouped by intervals with equal width of one day. The data were studied first by Jassim et al. (1996). Kuurman et al. (2003) and Lin & He (2006) also analyzed the data using the MHDE, in addition to other methods assuming some parametric mixture models including Weibull model. The latter used the AMHDE to fit the data by Weibull mixture model. The estimated density using the proposed method is close to the parametric MLE.

Applying the proposed method of this paper, we truncated the distribution using [a,b]=[0,21][a,b]=[0,21] and selected the optimal model degree m^=13\hat{m}=13 from {2,3,,50}\{2,3,\ldots,50\} using the change-point method.

Refer to caption
Figure 1: Upper panel left: the loglikelihood (m)\ell(m); Upper panel right: the likelihood ratio R(τ)R(\tau) for change-point of the increments of the loglikelihoods (m)\ell(m); Lower panel: the density estimates for the chicken embryo data.

Figure 1 displays the loglikelihood (m)\ell(m), the likelihood ratio R(τ)R(\tau) for change-points, the histogram of the grouped data and the kernel density f^K\hat{f}_{K}, the MLE f^ML\hat{f}_{\mathrm{ML}}, the MHDE f^MHD\hat{f}_{\mathrm{MHD}}, the AMHDE f^AMHD\hat{f}_{\mathrm{AMHD}}, and the proposed maximum Bernstein likelihood estimate (MBLE) f^B\hat{f}_{\mathrm{B}}. From this figure we see that the proposed MBLE f^B\hat{f}_{\mathrm{B}} and the parametric MLE f^ML\hat{f}_{\mathrm{ML}} are similar and fit the data reasonably. The kernel density is clearly not a good estimate. The AMHDE f^AMHD\hat{f}_{\mathrm{AMHD}} seems to have overestimated ff at numbers close to 0.

6 Concluding Remarks

The proposed density estimate fm(t;𝒑^)f_{m}(t;\hat{\bm{p}}) has obviously considerable advantages over the kernel density: (i) It is more efficient than the kernel density because it is an approximate maximum likelihood estimate; (ii) It is easier to select an optimal model degree mm than to select an optimal bandwidth hh for the kernel density; (iii) The proposed density estimate fm(t;𝒑^)f_{m}(t;\hat{\bm{p}}) aims at fm(t;𝒑0)f_{m}(t;\bm{p}_{0}) which is the best approximate of ff for each mm, while the kernel density f^K\hat{f}_{K} aims at fKhf*K_{h}, the convolution of ff and KhK_{h}.

Another significance of this paper is the introduction of the acceptance-rejection argument in proving the asymptotic results where an approximate model is assumed which is new to the knowledge of the author.

Appendix A Proofs

A.1 Proof of Theorem 1

Proof.

We define Λr=Λr(δ,M0,M2,\Lambda_{r}=\Lambda_{r}(\delta,M_{0},M_{2}, ,Mr)\ldots,M_{r}) as the class of functions ϕ(t)\phi(t) on [0,1][0,1] whose first rr derivatives ϕ(i)\phi^{(i)}, i=1,,ri=1,\ldots,r, exist and are continuous with the properties

δϕ(t)M0,|ϕ(i)(t)|Mi,2ir,0t1,\delta\leqslant\phi(t)\leqslant M_{0},\quad|\phi^{(i)}(t)|\leqslant M_{i},\quad 2\leqslant i\leqslant r,\quad 0\leqslant t\leqslant 1, (16)

for some δ>0\delta>0, Mi>0M_{i}>0, i=0,2,,ri=0,2,\ldots,r. A polynomial of degree mm with “positive coefficients” is defined by Lorentz (1963) as ϕm(t)=i=0mci(mi)ti(1t)mi\phi_{m}(t)=\sum_{i=0}^{m}c_{i}{m\choose i}t^{i}(1-t)^{m-i}, where ci0c_{i}\geqslant 0, i=0,,mi=0,\ldots,m. Theorem 1 of Lorentz (1963) proved that for given integers r0r\geqslant 0, δ>0\delta>0, and positive constants Mi0M_{i}\geqslant 0, i=0,2,3,,ri=0,2,3,\ldots,r, then there exists a constant Cr=Cr(δ,M0,M2,C_{r}=C_{r}(\delta,M_{0},M_{2}, ,Mr)\ldots,M_{r}) such that for each function ϕΛr(δ,M0,M2,,Mr)\phi\in\Lambda_{r}(\delta,M_{0},M_{2},\ldots,M_{r}) one can find a sequence ϕm\phi_{m}, m=1,2,m=1,2,\ldots, of polynomials with positive coefficients of degree mm such that

|ϕ(t)ϕm(t)|CrΔmr(t)ω(Δm(t),ϕ(r)),0t1,|\phi(t)-\phi_{m}(t)|\leqslant C_{r}\Delta_{m}^{r}(t)\omega(\Delta_{m}(t),\phi^{(r)}),\quad 0\leqslant t\leqslant 1, (17)

where ω(δ,ϕ)=sup|xy|δ|ϕ(x)ϕ(y)|\omega(\delta,\phi)=\sup_{|x-y|\leq\delta}|\phi(x)-\phi(y)|\,.

Under the conditions of Theorem 1, we see that M0=maxtf(t)M_{0}=\max_{t}f(t), Mi=maxt|f(i)(t)|M_{i}=\max_{t}|f^{(i)}(t)|, i=2,,ri=2,\ldots,r, are finite and ω(Δm(t),f(r))2Mr\omega(\Delta_{m}(t),f^{(r)})\leqslant 2M_{r}. So by the above result of Lorentz (1963) we have a sequence ϕm(t)=i=0mci(mi)ti(1t)mi\phi_{m}(t)=\sum_{i=0}^{m}c_{i}{m\choose i}t^{i}(1-t)^{m-i}, m=0,1,m=0,1,\ldots, of polynomials with positive coefficients of degree mm such that

|f(t)ϕm(t)|2CrMrΔmr(t),0t1.|f(t)-\phi_{m}(t)|\leqslant 2C_{r}M_{r}\Delta_{m}^{r}(t),\quad 0\leqslant t\leqslant 1. (18)

It is clear that

Δm(t)m1/2.\Delta_{m}(t)\leqslant m^{-1/2}. (19)

Since 01f(t)𝑑t=1\int_{0}^{1}f(t)dt=1, we have, by (19),

|1i=0mci/(m+1)|2CrMrmr/2.\Big{|}1-\sum_{i=0}^{m}c_{i}/(m+1)\Big{|}\leqslant 2C_{r}M_{r}m^{-r/2}. (20)

Let pmi=ci/i=0ncip_{mi}=c_{i}/\sum_{i=0}^{n}c_{i}, i=0,,mi=0,\ldots,m, then fm(t)=i=0mpmiβmi(t)f_{m}(t)=\sum_{i=0}^{m}p_{mi}\beta_{mi}(t). It follows easily from (18) and (20) that (1) is true. ∎

A.2 Proof of Theorem 2

We will prove the assertion (i) only. The assertion (ii) can be proved similarly.

Proof.

The matrix of second derivatives of R(𝒑)\ell_{R}(\bm{p}) is

H(𝒑)=2R(𝒑)𝒑𝒑T=i=1n𝜷m(xi)𝜷mT(xi){j=0mpmjβmj(xi)}2.H(\bm{p})=\frac{\partial^{2}\ell_{R}(\bm{p})}{\partial\bm{p}\partial\bm{p}^{\mbox{\tiny{$\mathrm{T}$}}}}=-\sum_{i=1}^{n}\frac{\bm{\beta}_{m}(x_{i})\bm{\beta}_{m}^{\mbox{\tiny{$\mathrm{T}$}}}(x_{i})}{\{\sum_{j=0}^{m}p_{mj}\beta_{mj}(x_{i})\}^{2}}.

For any 𝒖=(u0,,um)TRm+1\bm{u}=(u_{0},\ldots,u_{m})^{\mbox{\tiny{$\mathrm{T}$}}}\in R^{m+1}, as nn\to\infty,

1n𝒖T2R(𝒑)𝒑𝒑T𝒖E{j=0mujβmj(X){j=0mpmjβmj(X)}2.\frac{1}{n}\bm{u}^{\mbox{\tiny{$\mathrm{T}$}}}\frac{\partial^{2}\ell_{R}(\bm{p})}{\partial\bm{p}\partial\bm{p}^{\mbox{\tiny{$\mathrm{T}$}}}}\bm{u}\to\mathrm{E}\left\{\frac{\sum_{j=0}^{m}u_{j}\beta_{mj}(X)}{\{\sum_{j=0}^{m}p_{mj}\beta_{mj}(X)}\right\}^{2}.

Clearly, βm0(t),,βmm(t)\beta_{m0}(t),\ldots,\beta_{mm}(t) are linearly independent nonvanishing functions on [0,1]. So, with probability one, H(𝒑)H(\bm{p}) is negative definite for all 𝒑\bm{p} and sufficiently large nn. By Theorem 4.2 of Redner & Walker (1984), as ss\to\infty, 𝒑^(s)\hat{\bm{p}}^{(s)} converges to the maximizer of R(𝒑)\ell_{R}(\bm{p}) which is unique. ∎

A.3 Proof of Lemma3

Proof.

By (1) and (19) we know that under the condition of the lemma fm(t)=fm(t;𝒑0)f_{m}(t)=f_{m}(t;\bm{p}_{0}) converges to f(t)f(t) at a rate of at least 𝒪(mk){\cal O}(m^{-k}), i.e.,

f(t)=fm(t)+𝒪(mk),f(t)=f_{m}(t)+{\cal O}(m^{-k}), (21)

and, furthermore, since f(t)δf(t)\geqslant\delta,

cm=suptfm(t)f(t)=1+𝒪(mk),c_{m}=\sup_{t}\frac{f_{m}(t)}{f(t)}=1+{\cal O}(m^{-k}), (22)

uniformly in mm.

Let u1,,unu_{1},\ldots,u_{n} be a sample from the uniform(0,1). By the acceptance-rejection method in simulation (Ross, 2013), for each ii, if uifm(xi)/cmf(xi)u_{i}\leqslant{f_{m}(x_{i})}/{c_{m}f(x_{i})}, then xix_{i} can be treated as if it were from fmf_{m}. Assume that the data x1,,xnx_{1},\ldots,x_{n} have been rearranged so that the first νm\nu_{m} observations can be treated as if they were from fmf_{m}. By the law of iterated logarithm we have

νm\displaystyle\nu_{m} =\displaystyle= i=1nI(uifm(xi)cf(xi))\displaystyle\sum_{i=1}^{n}I\left(u_{i}\leqslant\frac{f_{m}(x_{i})}{cf(x_{i})}\right)
=\displaystyle= n𝒪(nmk)𝒪(nmkloglogn),a.s..\displaystyle n-{\cal O}(nm^{-k})-{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..

So we have

R(𝒑)=i=1nlogfm(xi;𝒑)=~R(𝒑)+Rmn,\ell_{R}(\bm{p})=\sum_{i=1}^{n}\log f_{m}(x_{i};\bm{p})=\tilde{\ell}_{R}(\bm{p})+R_{mn},

where ~R(𝒑)=i=1νmlogfm(xi;𝒑)\tilde{\ell}_{R}(\bm{p})=\sum_{i=1}^{\nu_{m}}\log f_{m}(x_{i};\bm{p}) is an “almost complete” likelihood and

Rmn=ui>fm(xi)cf(xi)logfm(xi;𝒑)=i=νm+1nlogfm(xi;𝒑).R_{mn}=\sum_{u_{i}>\frac{f_{m}(x_{i})}{cf(x_{i})}}\log f_{m}(x_{i};\bm{p})=\sum_{i=\nu_{m}+1}^{n}\log f_{m}(x_{i};\bm{p}).

Because 0<δf(t)M00<\delta\leqslant f(t)\leqslant M_{0}, we have 0<δfm(xi;𝒑)M00<\delta^{\prime}\leqslant f_{m}(x_{i};\bm{p})\leqslant M_{0}^{\prime} for some constants δ\delta^{\prime} and M0M_{0}^{\prime}. By the law of iterated logarithm

|Rmn|\displaystyle|R_{mn}| =\displaystyle= max{|logδ|,|logM0|}i=1nI(ui>fm(xi)cmf(xi))\displaystyle\max\{|\log\delta^{\prime}|,|\log M_{0}^{\prime}|\}\sum_{i=1}^{n}I\left(u_{i}>\frac{f_{m}(x_{i})}{c_{m}f(x_{i})}\right) (23)
=\displaystyle= max{|logδ|,|logM0|}(nνm)\displaystyle\max\{|\log\delta^{\prime}|,|\log M_{0}^{\prime}|\}(n-\nu_{m})
=\displaystyle= 𝒪(nmk)+𝒪(nmkloglogn),a.s..\displaystyle{\cal O}(nm^{-k})+{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..

The proportion of the observations that can be treated as if they were from fmf_{m} is

νmn=1𝒪(mk)𝒪(mkloglogn/n),a.s..\frac{\nu_{m}}{n}=1-{\cal O}(m^{-k})-{\cal O}\left(\sqrt{m^{-k}\log\log n/n}\right),\quad a.s..

So the complete data x1,,xnx_{1},\ldots,x_{n} can be viewed as a slightly contaminated sample from fmf_{m}. ∎

A.4 Proof of Theorem 4

Proof.

The Taylor expansions of logfm(xj,𝒑)\log{f_{m}(x_{j},\bm{p})} at logfm(xj,𝒑0)\log{f_{m}(x_{j},\bm{p}_{0})} yield that, for 𝒑𝔹m(rn)\bm{p}\in\mathbb{B}_{m}(r_{n}),

~R(𝒑)\displaystyle\tilde{\ell}_{R}(\bm{p}) =j=1νmlogfm(xj,𝒑)\displaystyle=\sum_{j=1}^{\nu_{m}}\log{f_{m}(x_{j},\bm{p})}
=~R(𝒑0)+j=1νm[fm(xj,𝒑)fm(xj,𝒑0)fm(xj,𝒑0)12{fm(xj,𝒑)fm(xj,𝒑0)}2{fm(xj,𝒑0)}2]+R~mn,\displaystyle=\tilde{\ell}_{R}(\bm{p}_{0})+\sum_{j=1}^{\nu_{m}}\left[\frac{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})}{f_{m}(x_{j},\bm{p}_{0})}-\frac{1}{2}\frac{\{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})\}^{2}}{\{f_{m}(x_{j},\bm{p}_{0})\}^{2}}\right]+\tilde{R}_{mn},

where R~mn=o(nrn2)\tilde{R}_{mn}=o(nr_{n}^{2}), a.s..

Let 𝒑\bm{p} be a point on the boundary of 𝔹m(rn)\mathbb{B}_{m}(r_{n}), i.e., 𝒑𝒑0R2=rn2\|\bm{p}-\bm{p}_{0}\|^{2}_{R}=r_{n}^{2}. By the law of iterated logarithm we have

j=1νmfm(xj,𝒑)fm(xj,𝒑0)fm(xj,𝒑0)\displaystyle\sum_{j=1}^{\nu_{m}}\frac{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})}{f_{m}(x_{j},\bm{p}_{0})} =𝒪(rnnloglogn),a.s.,\displaystyle=\mathcal{O}(r_{n}\sqrt{n\log\log n}),\;a.s.,

and that there exists η>0\eta>0 such that

j=1νm{fm(xj,𝒑)fm(xj,𝒑0)}2{fm(xj,𝒑0)}2\displaystyle\sum_{j=1}^{\nu_{m}}\frac{\{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})\}^{2}}{\{f_{m}(x_{j},\bm{p}_{0})\}^{2}} =ηnrn2+𝒪(rn2nloglogn).\displaystyle=\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{n\log\log n}).

Therefore we have

~R(𝒑)\displaystyle\tilde{\ell}_{R}(\bm{p}) =~R(𝒑0)+j=1νm[fm(xj,𝒑)fm(xj,𝒑0)fm(xj,𝒑0)12{fm(xj,𝒑)fm(xj,𝒑0)}2{fm(xj,𝒑0)}2]+o(nrn2)\displaystyle=\tilde{\ell}_{R}(\bm{p}_{0})+\sum_{j=1}^{\nu_{m}}\left[\frac{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})}{f_{m}(x_{j},\bm{p}_{0})}-\frac{1}{2}\frac{\{f_{m}(x_{j},\bm{p})-f_{m}(x_{j},\bm{p}_{0})\}^{2}}{\{f_{m}(x_{j},\bm{p}_{0})\}^{2}}\right]+o(nr_{n}^{2})
=~R(𝒑0)12ηnrn2+𝒪(rn2nloglogn)+𝒪(rnnloglogn)+o(nrn2),a.s..\displaystyle=\tilde{\ell}_{R}(\bm{p}_{0})-\frac{1}{2}\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{n\log\log n})+\mathcal{O}(r_{n}\sqrt{n\log\log n})+o(nr_{n}^{2}),\quad a.s..

Since m=𝒪(n1/k)m={\cal O}(n^{1/k}), nmk=o(nrn2)nm^{-k}=o(nr_{n}^{2}). So there exists η>0\eta^{\prime}>0 such that R(𝒑)R(𝒑0)ηnrn2=R(𝒑0)η(logn)2\ell_{R}(\bm{p})\leqslant\ell_{R}(\bm{p}_{0})-\eta^{\prime}nr_{n}^{2}=\ell_{R}(\bm{p}_{0})-\eta^{\prime}(\log n)^{2}. Since 2B(𝒑)/𝒑𝒑T<0{\partial^{2}\ell_{B}(\bm{p})}/{\partial\bm{p}\partial{\bm{p}}^{\mbox{\tiny{$\mathrm{T}$}}}}<0, the maximum value of R(𝒑)\ell_{R}(\bm{p}) is attained by some fm(,𝒑^R)f_{m}(\cdot,\hat{\bm{p}}_{R}) with 𝒑^R\hat{\bm{p}}_{R} being in the interior of 𝔹m(rn)\mathbb{B}_{m}(r_{n}).

A.5 Proof of Theorem 5

Proof.

It is easy to see that (11) and (12) follow from Theorem 4, (22), the boundedness of ff, and the triangular inequality. ∎

A.6 Proof of Theorem 6

Proof.

By (21) we have

θi=θmi(𝒑0)+𝒪(mkΔti),\theta_{i}=\theta_{mi}(\bm{p}_{0})+{\cal O}(m^{-k}\Delta t_{i}), (24)

where

θi=ti1tif(x)𝑑x,θmi(𝒑0)=ti1tifm(x;𝒑0)𝑑x,i=1,,N.\theta_{i}=\int_{t_{i-1}}^{t_{i}}f(x)dx,\quad\theta_{mi}(\bm{p}_{0})=\int_{t_{i-1}}^{t_{i}}f_{m}(x;\bm{p}_{0})dx,\quad i=1,\ldots,N.

Because 0<δf(t)M00<\delta\leqslant f(t)\leqslant M_{0} we have

c=max1iNθmi(𝒑0)θi=1+𝒪(mk),c=\max_{1\leqslant i\leqslant N}\frac{\theta_{mi}(\bm{p}_{0})}{\theta_{i}}=1+{\cal O}(m^{-k}), (25)

uniformly in mm. Assume that y1,,yny_{1},\ldots,y_{n} is a random sample from the discrete distribution with probability mass function θi=P(Y=i)\theta_{i}=P(Y=i), i=1,,Ni=1,\ldots,N.

Let u1,,unu_{1},\ldots,u_{n} be a sample from the uniform(0,1). For yi𝜽(θ1,,θN)y_{i}\sim\bm{\theta}\equiv(\theta_{1},\ldots,\theta_{N}), i=1,,ni=1,\ldots,n, let uiU(0,1)u_{i}\sim U(0,1). If uic1θmyi(0)/θyiu_{i}\leqslant c^{-1}{\theta_{my_{i}}^{(0)}}/{\theta_{yi}}, then yiy_{i} can be treated as if it were from 𝜽m(𝒑0)(θm1(𝒑0),,θmN(𝒑0))\bm{\theta}_{m}(\bm{p}_{0})\equiv(\theta_{m1}(\bm{p}_{0}),\ldots,\theta_{mN}(\bm{p}_{0})). Assume that the data y1,,yny_{1},\ldots,y_{n} have been rearranged so that the first νm\nu_{m} observations can be treated as if they were from 𝜽m(𝒑0)\bm{\theta}_{m}(\bm{p}_{0}). By the law of iterated logarithm we have

νm\displaystyle\nu_{m} =\displaystyle= i=1nI(uiθmyi(𝒑0)cθyi)\displaystyle\sum_{i=1}^{n}I\left(u_{i}\leqslant\frac{\theta_{my_{i}}(\bm{p}_{0})}{c\theta_{y_{i}}}\right)
=\displaystyle= n𝒪(nmk)𝒪(nmkloglogn),a.s..\displaystyle n-{\cal O}(nm^{-k})-{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..

So we have

G(𝒑0)\displaystyle\ell_{G}(\bm{p}_{0}) =\displaystyle= i=1Nnilogθmi(𝒑0)=~G(𝒑0)+Rmn,\displaystyle\sum_{i=1}^{N}n_{i}\log\theta_{mi}(\bm{p}_{0})=\tilde{\ell}_{G}(\bm{p}_{0})+R_{mn},

where

~G(𝒑0)\displaystyle\tilde{\ell}_{G}(\bm{p}_{0}) =\displaystyle= j=1νmi=1NI(yj=i)logθmi(𝒑0)=j=1νmn~ilogθmi(𝒑0),\displaystyle\sum_{j=1}^{\nu_{m}}\sum_{i=1}^{N}I(y_{j}=i)\log\theta_{mi}(\bm{p}_{0})=\sum_{j=1}^{\nu_{m}}\tilde{n}_{i}\log\theta_{mi}(\bm{p}_{0}),

ni=#{j:yj=i}n_{i}=\#\{j:y_{j}=i\}, n~i=j=1νmI(yj=i)\tilde{n}_{i}=\sum_{j=1}^{\nu_{m}}I(y_{j}=i), i=1,,Ni=1,\ldots,N, and

Rmn\displaystyle R_{mn} =\displaystyle= uj>θmyjcθyji=1NI(yj=i)logθmi=i=1N(nin~i)logθmi.\displaystyle\sum_{u_{j}>\frac{\theta_{my_{j}}}{c\theta_{y_{j}}}}\sum_{i=1}^{N}I(y_{j}=i)\log\theta_{mi}=\sum_{i=1}^{N}(n_{i}-\tilde{n}_{i})\log\theta_{mi}.

It is clear that there exist δ>0\delta^{\prime}>0 and M0>0M_{0}^{\prime}>0 such that δΔtiθmiM0Δti\delta^{\prime}\Delta t_{i}\leqslant\theta_{mi}\leqslant M_{0}^{\prime}\Delta t_{i}. By the law of iterated logarithm,

|Rmn|\displaystyle|R_{mn}| \displaystyle\leqslant max{|logδΔti|,|logM0Δti|}(nνm)\displaystyle\max\{|\log\delta^{\prime}\Delta t_{i}|,|\log M_{0}^{\prime}\Delta t_{i}|\}(n-\nu_{m})
=\displaystyle= 𝒪(nmk)+𝒪(nmkloglogn),a.s..\displaystyle{\cal O}(nm^{-k})+{\cal O}\left(\sqrt{nm^{-k}\log\log n}\right),\quad a.s..

The Taylor expansions of logθmi(𝒑)\log{\theta_{mi}(\bm{p})} at logθmi(𝒑0)\log{\theta_{mi}(\bm{p}_{0})} yield that, for 𝒑𝔹m(rn)\bm{p}\in\mathbb{B}_{m}(r_{n}),

~G(𝒑)\displaystyle\tilde{\ell}_{G}(\bm{p}) =i=1Nn~ilogθmi(𝒑)\displaystyle=\sum_{i=1}^{N}\tilde{n}_{i}\log{\theta_{mi}(\bm{p})}
=~G(𝒑0)+i=1Nn~i[θmi(𝒑)θmi(𝒑0)θmi(𝒑0)12{θmi(𝒑)θmi(𝒑0)}2{θmi(𝒑0)}2]+R~mn,\displaystyle=\tilde{\ell}_{G}(\bm{p}_{0})+\sum_{i=1}^{N}\tilde{n}_{i}\left[\frac{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})}{\theta_{mi}(\bm{p}_{0})}-\frac{1}{2}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\{\theta_{mi}(\bm{p}_{0})\}^{2}}\right]+\tilde{R}_{mn},

where R~mn=o(nrn2)\tilde{R}_{mn}=o(nr_{n}^{2}), a.s..

Let 𝒑\bm{p} be a point on the boundary of 𝔹m(rn)\mathbb{B}_{m}(r_{n}), i.e., 𝒑𝒑0B2=rn2\|\bm{p}-\bm{p}_{0}\|^{2}_{B}=r_{n}^{2}. It follows from the law of iterated logarithm that there exists η>0\eta>0 such that

1νmi=1Nn~i{θmi(𝒑)θmi(𝒑0)}2{θmi(𝒑0)}2\displaystyle\frac{1}{\nu_{m}}\sum_{i=1}^{N}\tilde{n}_{i}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\{\theta_{mi}(\bm{p}_{0})\}^{2}} =ηnrn2+𝒪(rn2loglogn/n).\displaystyle=\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{\log\log n/n}).

Therefore

~G(𝒑)\displaystyle\tilde{\ell}_{G}(\bm{p}) =~G(𝒑0)+i=1Nn~i[θmi(𝒑)θmi(𝒑0)θmi(𝒑0)12{θmi(𝒑)θmi(𝒑0)}2{θmi(𝒑0)}2]+o(nrn2)\displaystyle=\tilde{\ell}_{G}(\bm{p}_{0})+\sum_{i=1}^{N}\tilde{n}_{i}\left[\frac{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})}{\theta_{mi}(\bm{p}_{0})}-\frac{1}{2}\frac{\{\theta_{mi}(\bm{p})-\theta_{mi}(\bm{p}_{0})\}^{2}}{\{\theta_{mi}(\bm{p}_{0})\}^{2}}\right]+o(nr_{n}^{2})
=~G(𝒑0)12ηnrn2+𝒪(rn2loglogn/n)+𝒪(rnnloglogn)+o(nrn2),a.s..\displaystyle=\tilde{\ell}_{G}(\bm{p}_{0})-\frac{1}{2}\eta nr_{n}^{2}+\mathcal{O}(r_{n}^{2}\sqrt{\log\log n/n})+\mathcal{O}(r_{n}\sqrt{n\log\log n})+o(nr_{n}^{2}),\quad a.s..

Since m=𝒪(n1/k)m={\cal O}(n^{1/k}), nmk=o(nrn2)nm^{-k}=o(nr_{n}^{2}). So there exists η>0\eta^{\prime}>0 such that G(𝒑)G(𝒑0)ηnrn2=G(𝒑0)η(logn)2\ell_{G}(\bm{p})\leqslant\ell_{G}(\bm{p}_{0})-\eta^{\prime}nr_{n}^{2}=\ell_{G}(\bm{p}_{0})-\eta^{\prime}(\log n)^{2}. Since 2G(𝒑)/𝒑𝒑T<0{\partial^{2}\ell_{G}(\bm{p})}/{\partial\bm{p}\partial{\bm{p}}^{\mbox{\tiny{$\mathrm{T}$}}}}<0, the maximum value of G(𝒑)\ell_{G}(\bm{p}) is attained by some 𝒑~G\tilde{\bm{p}}_{G} in the interior of 𝔹m(rn)\mathbb{B}_{m}(r_{n}).

A.7 Proof of Theorem 7

Proof.

By (13) and the Taylor expansion we have

𝒑𝒑0R2\displaystyle\|\bm{p}-\bm{p}_{0}\|^{2}_{R} =\displaystyle= 𝒑𝒑0G2+𝒪(maxiΔti)01|ψm(t)|𝑑t\displaystyle\|\bm{p}-\bm{p}_{0}\|_{G}^{2}+{\cal O}(\max_{i}\Delta t_{i})\int_{0}^{1}|\psi_{m}^{\prime}(t)|dt (26)
+max0t1|ψm′′(t)|𝒪(maxiΔti2).\displaystyle\hskip 30.00005pt+\max_{0\leqslant t\leqslant 1}|\psi_{m}^{\prime\prime}(t)|{\cal O}(\max_{i}\Delta t_{i}^{2}).

By βmj(t)=(m+1){βm1,j1(t)βm1,j(t)}\beta^{\prime}_{mj}(t)=(m+1)\{\beta_{m-1,j-1}(t)-\beta_{m-1,j}(t)\}, we have

|ψm(t)|m2{C1ψm(t)+C2ψm(t)}.|\psi_{m}^{\prime}(t)|\leqslant m^{2}\Big{\{}C_{1}\sqrt{\psi_{m}(t)}+C_{2}\psi_{m}(t)\Big{\}}.

It follows easily from βmj′′(t)=m(m+1){βm2,j2(t)2βm2,j1(t)+βm2,j(t)}\beta^{\prime\prime}_{mj}(t)=m(m+1)\{\beta_{m-2,j-2}(t)-2\beta_{m-2,j-1}(t)+\beta_{m-2,j}(t)\} that |ψm′′(t)|C3m4.|\psi_{m}^{\prime\prime}(t)|\leqslant C_{3}m^{4}. Thus by (26) we can obtain

𝒑𝒑0R2𝒑𝒑0G2=𝒪(maxiΔti)𝒪(m2)𝒑𝒑0R+𝒪(m4)𝒪(maxiΔti2).\|\bm{p}-\bm{p}_{0}\|^{2}_{R}-\|\bm{p}-\bm{p}_{0}\|_{G}^{2}={\cal O}(\max_{i}\Delta t_{i}){\cal O}(m^{2})\|\bm{p}-\bm{p}_{0}\|_{R}+{\cal O}(m^{4}){\cal O}(\max_{i}\Delta t_{i}^{2}).

Therefore we have 𝒑𝒑0R2=𝒑𝒑0G2+𝒪(m4maxiΔti2).\|\bm{p}-\bm{p}_{0}\|^{2}_{R}=\|\bm{p}-\bm{p}_{0}\|_{G}^{2}+{\cal O}(m^{4}\max_{i}\Delta t_{i}^{2}). The proof is complete. ∎

A.8 Proof of Theorem 8

Proof.

Similar to the proof of Theorem 5, (14) and (15) follow easily from Theorems 6 and 7, (22), the boundedness of ff, and the triangular inequality. ∎

References

  • Beran (1977a) Beran, R. (1977a). Minimum Hellinger distance estimates for parametric models. Ann. Statist. 5, 445–463.
  • Beran (1977b) Beran, R. (1977b). Robust location estimates. Ann. Statist. 5, 431–444.
  • Bernstein (1912) Bernstein, S. N. (1912). Démonstration du théorème de Weierstrass fondée sur le calcul des probabilitiés. Comm. Soc. Math. Kharkov 13, 1–2.
  • Box (1976) Box, G. E. P. (1976). Science and statistics. J. Amer. Statist. Assoc. 71, 791–799.
  • Buckland (1992) Buckland, S. T. (1992). Fitting density functions with polynomials. J. Roy. Statist. Soc. Ser. C 41, 63–76.
  • Csörgő & Horváth (1997) Csörgő, M. & Horváth, L. (1997). Limit Theorems in Change-Point Analysis. New York: John Wiley & Sons Inc., 1st ed.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1–38.
  • Guan (2014) Guan, Z. (2014). Efficient and Robust Density Estimation Using Bernstein Type Polynomials. ArXiv e-prints .
  • Hall (1982) Hall, P. (1982). The influence of rounding errors on some nonparametric estimators of a density and its derivatives. SIAM J. Appl. Math. 42, 390–399.
  • Jang & Loh (2010) Jang, W. & Loh, J. M. (2010). Density estimation for grouped data with application to line transect sampling. Ann. Appl. Stat. 4, 893–915.
  • Jassim et al. (1996) Jassim, E. W., Grossman, M., Koops, W. J. & Luykx, R. A. J. (1996). Multiphasic analysis of embryonic mortality in chickens. Poultry Sci 75, 464–471.
  • Jones (1993) Jones, M. C. (1993). Simple boundary correction for kernel density estimation. Statistics and Computing 3, 135–146.
  • Jones et al. (1996) Jones, M. C., Marron, J. S. & Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91, 401–407.
  • Jones & McLachlan (1990) Jones, P. N. & McLachlan, G. J. (1990). Algorithm AS 254: Maximum likelihood estimation from grouped and truncated data with finite normal mixture models. Journal of the Royal Statistical Society. Series C (Applied Statistics) 39, 273–282.
  • Kuurman et al. (2003) Kuurman, W. W., Bailey, B. A., Koops, W. J. & Grossman, M. (2003). A model for failure of a chicken embryo to survive incubation. Poultry Sci. 82, 214–222.
  • Lin & He (2006) Lin, N. & He, X. (2006). Robust and efficient estimation under data grouping. Biometrika 93, 99–112.
  • Lindley (1950) Lindley, D. V. (1950). Grouping corrections and maximum likelihood equations. Mathematical Proceedings of the Cambridge Philosophical Society 46, 106–110.
  • Linton & Whang (2002) Linton, O. & Whang, Y.-J. (2002). Nonparametric estimation with aggregated data. Econometric Theory 18, 420–468.
  • Lorentz (1963) Lorentz, G. G. (1963). The degree of approximation by polynomials with positive coefficients. Math. Ann. 151, 239–251.
  • McLachlan & Jones (1988) McLachlan, G. J. & Jones, P. N. (1988). Fitting mixture models to grouped and truncated data via the EM algorithm. Biometrics 44, 571–578.
  • Minoiu & Reddy (2014) Minoiu, C. & Reddy, S. (2014). Kernel density estimation on grouped data: the case of poverty assessment. The Journal of Economic Inequality 12, 163–189.
  • Passow (1977) Passow, E. (1977). Polynomials with positive coefficients: uniqueness of best approximation. J. Approximation Theory 21, 352–355.
  • Redner & Walker (1984) Redner, R. A. & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26, 195–239.
  • Rice (1984) Rice, J. (1984). Boundary modification for kernel regression. Comm. Statist. A—Theory Methods 13, 893–900.
  • Rosenblatt (1956) Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832–837.
  • Rosenblatt (1971) Rosenblatt, M. (1971). Curve estimates. Ann. Math. Statist. 42, 1815–1842.
  • Ross (2013) Ross, S. M. (2013). Simulation. New York: Academic Press, 5th ed.
  • Scott & Sheather (1985) Scott, D. & Sheather, S. (1985). Kernel density estimation with binned data. Communications in Statistics - Theory and Methods 14, 1353–1359.
  • Sheather & Jones (1991) Sheather, S. J. & Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B 53, 683–690.
  • Tallis (1967) Tallis, G. M. (1967). Approximate maximum likelihood estimates from grouped data. Technometrics 9, 599–606.
  • Titterington (1983) Titterington, D. M. (1983). Kernel-based density estimation using censored, truncated or grouped data. Comm. Statist. A—Theory Methods 12, 2151–2167.
  • Vitale (1975) Vitale, R. A. (1975). Bernstein polynomial approach to density function estimation. In Statistical Inference and Related Topics (Proc. Summer Res. Inst. Statist. Inference for Stochastic Processes, Indiana Univ., Bloomington, Ind., 1974, Vol. 2; dedicated to Z. W. Birnbaum). New York: Academic Press, pp. 87–99.