Risk Bounds for Mixture Density Estimation on Compact Domains via the $h$ -Lifted Kullback–Leibler Divergence

Mark Chiu Chong mark.chiuchong@gmail.com
School of Mathematics and Physics
The University of Queensland
St Lucia, QLD 4072, Australia Hien Duy Nguyen hien@imi.kyushu-u.ac.jp
Institute of Mathematics for Industry
Kyushu University
Nishi Ward, Fukuoka 819-0395, Japan;
School of Computing, Engineering, and Mathematical Sciences
La Trobe University
Bundoora, VIC 3086, Australia; TrungTin Nguyen trungtin.nguyen@uq.edu.au
School of Mathematics and Physics
The University of Queensland
St Lucia, QLD 4072, Australia

Abstract

We consider the problem of estimating probability density functions based on sample data, using a finite mixture of densities from some component class. To this end, we introduce the $h$ -lifted Kullback–Leibler (KL) divergence as a generalization of the standard KL divergence and a criterion for conducting risk minimization. Under a compact support assumption, we prove an $\mathcal{O}(1/{\sqrt{n}})$ bound on the expected estimation error when using the $h$ -lifted KL divergence, which extends the results of Rakhlin et al. (2005, ESAIM: Probability and Statistics, Vol. 9) and Li & Barron (1999, Advances in Neural Information Processing Systems, Vol. 12) to permit the risk bounding of density functions that are not strictly positive. We develop a procedure for the computation of the corresponding maximum $h$ -lifted likelihood estimators ( $h$ -MLLEs) using the Majorization-Maximization framework and provide experimental results in support of our theoretical bounds.

1 Introduction

Let $\left(\Omega,\mathfrak{A},{\mathrm{\mathbf{P}}}\right)$ be an abstract probability space and let $X:\Omega\to\mathcal{X}$ be a random variable taking values in the measurable space $\left(\mathcal{X},\mathfrak{F}\right)$ , where $\mathcal{X}$ is a compact metric space equipped with its Borel $\sigma$ -algebra $\mathfrak{F}$ . Suppose that we observe an independent and identically distributed (i.i.d.) sample of random variables $\mathbf{X}_{n}=\left(X_{i}\right)_{i\in\left[n\right]}$ , where $\left[n\right]=\left\{1,\ldots,n\right\}$ , and that each $X_{i}$ arises from the same data generating process as $X$ , characterized by the probability measure $F\ll\mu$ on $\left(\mathcal{X},\mathfrak{F}\right)$ , with density function $f=\mathrm{d}F/\mathrm{d}\mu$ , for some $\sigma$ -finite $\mu$ . In this work, we are concerned with the estimating $f$ via a data dependent double-index sequence of estimators $\left(f_{k,n}\right)_{k,n\in\mathbb{N}}$ , where

\displaystyle f_{k,n}\in\mathcal{C}_{k}=\mathrm{co}_{k}\left(\mathcal{P}\right)=\left\{f_{k}\left(\cdot;\psi_{k}\right)=\sum_{j=1}^{k}\pi_{j}\varphi\left(\cdot;\theta_{j}\right)\mid\varphi\left(\cdot;\theta_{j}\right)\in\mathcal{P},\,\pi_{j}\geq 0,\,j\in\left[k\right],\sum_{j=1}^{k}\pi_{j}=1\right\},

for each $k,n\in\mathbb{N}$ , and where

\mathcal{P}=\left\{\varphi\left(\cdot;\theta\right):\mathcal{X}\rightarrow\mathbb{R}_{\geq 0}\mid\theta\in\Theta\subset\mathbb{R}^{d}\right\},

(1)

$\psi_{k}=\left(\pi_{1},\dots,\pi_{k},\theta_{1},\dots,\theta_{k}\right)$ , and $d\in\mathbb{N}$ . To ensure the measurability and existence of various optima, we shall assume that $\varphi$ is Carathéodory in the sense that $\varphi\left(\cdot;\theta\right)$ is $\left(\mathcal{X},\mathfrak{F}\right)$ -measurable for each $\theta\in\Theta$ , and $\varphi\left(X;\cdot\right)$ is continuous for each $X\in\mathcal{X}$ .

In the definition above, we can identify the set $\mathcal{C}_{k}=\mathrm{co}_{k}\left(\mathcal{P}\right)$ as the set of density functions that can be written as a convex combination of $k$ elements of $\mathcal{P}$ , where $\mathcal{P}$ is often called the space of component density functions. We then interpret $\mathcal{C}_{k}$ as the class of $k$ -component finite mixtures of densities of class $\mathcal{P}$ , as studied, for example, by McLachlan & Peel (2004); Nguyen et al. (2020; 2022b).

1.1 Risk bounds for mixture density estimation

We are particularly interested in oracle bounds of the form

{\mathrm{\mathbf{E}}}\left\{\ell\left(f,f_{k,n}\right)\right\}-\ell\left(f,\mathcal{C}\right)\leq\rho\left(k,n\right),

(2)

where $\left(p,q\right)\mapsto\ell\left(p,q\right)\in\mathbb{R}_{\geq 0}$ is a loss function on pairs of density functions. We define the density-to-class loss

\ell\left(f,\mathcal{C}\right)=\inf_{q\in\mathcal{C}}\ell\left(f,q\right),\ \mathcal{C}=\mathrm{cl}\left(\bigcup_{k\in\mathbb{N}}\mathrm{co}_{k}\left(\mathcal{P}\right)\right),

where $\mathrm{cl}(\cdot)$ is the closure. Here, we identify $\left(k,n\right)\mapsto\rho\left(k,n\right)$ as a characterization of the rate at which the left-hand side of (2) converges to zero as $k$ and $n$ increase. Our present work follows the research of Li & Barron (1999), Rakhlin et al. (2005) and Klemelä (2007) (see also Klemelä 2009, Ch. 19). In Li & Barron (1999) and Rakhlin et al. (2005), the authors consider the case where $\ell\left(p,q\right)$ is taken to be the Kullback–Leibler (KL) divergence

\mathrm{KL}\left(p\,||\,q\right)=\int p\log\frac{p}{q}\mathrm{d}\mu

and $f_{k,n}=f_{k}\left(\cdot;{\psi}_{k,n}\right)$ is a maximum likelihood estimator (MLE), where

{\psi}_{k,n}\in\underset{\psi_{k}\in\mathcal{S}_{k}\times\Theta^{k}}{\operatorname*{arg\,max\,}}\frac{1}{n}\sum_{i=1}^{n}\log f_{k}\left(X_{i};\psi_{k}\right),

is a function of $\mathbf{X}_{n}$ , with $\mathcal{S}_{k}$ denoting the probability simplex in $\mathbb{R}^{k}$ .

Under the assumption that $f,f_{k}\geq a$ , for some $a>0$ and each $k\in\left[n\right]$ (i.e., strict positivity), Li & Barron (1999) obtained the bound

{\mathrm{\mathbf{E}}}\left\{\mathrm{KL}\left(f\,||\,f_{k,n}\right)\right\}-\mathrm{KL}\left(f\,||\,\mathcal{C}\right)\leq c_{1}\frac{1}{k}+c_{2}\frac{k\log\left(c_{3}n\right)}{n},

for constants $c_{1},c_{2},c_{3}>0$ , which was then improved by Rakhlin et al. (2005) who obtained the bound

{\mathrm{\mathbf{E}}}\left\{\mathrm{KL}\left(f\,||\,f_{k,n}\right)\right\}-\mathrm{KL}\left(f\,||\,\mathcal{C}\right)\leq c_{1}\frac{1}{k}+c_{2}\frac{1}{\sqrt{n}},

for constants $c_{1},c_{2}>0$ (constants $(c_{j})_{j\in\mathbb{N}}$ are typically different between expressions).

Alternatively, Klemelä (2007) takes $\ell\left(p,q\right)$ to be the squared $L_{2}\left(\mu\right)$ norm distance (i.e., the least-squares loss):

\ell\left(p,q\right)=\lVert p-q\rVert_{2,\mu}^{2},

where $\lVert p\rVert_{2,\mu}^{2}=\int_{\mathcal{X}}\lvert p\rvert^{2}\mathrm{d}\mu$ , for each $p\in L_{2}\left(\mu\right)$ , and choose $f_{k,n}$ as minimizers of the $L_{2}\left(\mu\right)$ empirical risk, i.e., $f_{k,n}=f_{k}\left(\cdot;{\psi}_{k,n}\right)$ , where

\psi_{k,n}\in\underset{\psi_{k}\in\mathcal{S}_{k}\times\Theta^{k}}{\operatorname*{arg\,min\,}}-\frac{2}{n}\sum_{i=1}^{n}f_{k}\left(\cdot;\psi_{k}\right)+\left\|f_{k}\left(\cdot;\psi_{k}\right)\right\|_{2,\mu}^{2}.

(3)

Here, Klemelä (2007) establish the bound

{\mathrm{\mathbf{E}}}\left\|f-{f}_{k,n}\right\|_{2,\mu}^{2}-\inf_{q\in\mathcal{C}}\left\|f-q\right\|_{2,\mu}^{2}\leq c_{1}\frac{1}{k}+c_{2}\frac{1}{\sqrt{n}},

$c_{1},c_{2}>0$ , without the lower bound assumption on $f,f_{k}$ above, even permitting $\mathcal{X}$ to be unbounded. Via the main results of Naito & Eguchi (2013), the bound above can be generalized to the $U$ -divergences, which includes the special $L_{2}(\mu)$ norm distance as a special case.

On the one hand, the sequence of MLEs required for the results of Li & Barron (1999) and Rakhlin et al. (2005) are typically computable, for example, via the usual expectation–maximization approach (cf. McLachlan & Peel 2004, Ch. 2). This contrasts with the computation of least-squares density estimators of form (3), which requires evaluations of the typically intractable integral expressions $\left\|f_{k}\left(\cdot;\psi_{k}\right)\right\|_{2}^{2}$ . However, the least-squares approach of Klemelä (2007) permits the analysis using families $\mathcal{P}$ of usual interest, such as normal distributions and beta distributions, the latter of which being compactly supported but having densities that cannot be bounded away from zero without restrictions, and thus do not satisfy the regularity conditions of Li & Barron (1999) and Rakhlin et al. (2005).

1.2 Main contributions

We propose the following $h$ -lifted KL divergence, as a generalization of the standard KL divergence to address the computationally tractable estimation of density functions which do not satisfy the regularity conditions of Li & Barron (1999) and Rakhlin et al. (2005). The use of the $h$ -lifted KL divergence has the possibility to advance theories based on the standard KL divergence in statistical machine learning. To this end, let $h:\mathcal{X}\rightarrow\mathbb{R}_{\geq 0}$ be a function in $L_{1}(\mu)$ , and define the $h$ -lifted KL divergence by:

\mathrm{KL}_{h}\left(p\,||\,q\right)=\int_{\mathcal{X}}\left\{p+h\right\}\log\frac{p+h}{q+h}\mathrm{d}\mu.

(4)

In the sequel, we shall show that $\mathrm{KL}_{h}$ is a Bregman divergence on the space of probability density functions, as per Csiszár (1995).

Assume that $h$ is a probability density function, and let $\mathbf{Y}_{n}=\left(Y_{i}\right)_{i\in\left[n\right]}$ be a an i.i.d. sample, independent of $\mathbf{X}_{n}$ , where each $Y_{i}:\Omega\rightarrow\mathcal{X}$ is a random variable with probability measure on $(\mathcal{X},\mathfrak{F})$ , characterized by the density $h$ with respect to $\mu$ . Then, for each $k$ and $n$ , let $f_{k,n}$ be defined via the maximum $h$ -lifted likelihood estimator ( $h$ -MLLE; see Appendix B for further discussion) $f_{k,n}=f_{k}\left(\cdot;{\psi}_{k,n}\right)$ , where

{\psi}_{k,n}\in\underset{\psi_{k}\in\mathcal{S}_{k}\times\Theta^{k}}{\operatorname*{arg\,max\,}}~{}\frac{1}{n}\sum_{i=1}^{n}\left(\log\left\{f_{k}\left(X_{i};\psi_{k}\right)+h\left(X_{i}\right)\right\}+\log\left\{f_{k}\left(Y_{i};\psi_{k}\right)+h\left(Y_{i}\right)\right\}\right).

(5)

The primary aim of this work is to show that

{\mathrm{\mathbf{E}}}\left\{\mathrm{KL}_{h}\left(f\,||\,f_{k,n}\right)\right\}-\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)\leq c_{1}\frac{1}{k}+c_{2}\frac{1}{\sqrt{n}}

(6)

for some constants $c_{1},c_{2}>0$ , without requiring the strict positivity assumption that $f,f_{k}\geq a>0$ .

This result is a compromise between the works of Li & Barron (1999) and Rakhlin et al. (2005), and Klemelä (2007), as it applies to a broader space of component densities $\mathcal{P}$ , and because the required $h$ -MLLEs (5) can be efficiently computed via minorization–maximization (MM) algorithms (see e.g., Lange 2016). We shall discuss this assertion in Section 4.

1.3 Relevant literature

Our work largely follows the approach of Li & Barron (1999), which was extended upon by Rakhlin et al. (2005) and Klemelä (2007). All three texts use approaches based on the availability of greedy algorithms for maximizing convex functions with convex functional domains. In this work, we shall make use of the proof techniques of Zhang (2003). Related results in this direction can be found in DeVore & Temlyakov (2016) and Temlyakov (2016). Making the same boundedness assumption as Rakhlin et al. (2005), Dalalyan & Sebbar (2018) obtain refined oracle inequalities under the additional assumption that the class $\mathcal{P}$ is finite. Numerical implementations of greedy algorithms for estimating finite mixtures of Gaussian densities were studied by Vlassis & Likas (2002) and Verbeek et al. (2003).

The $h$ -MLLE as an optimization objective can be compared to other similar modified likelihood estimators, such as the $L_{q}$ likelihood of Ferrari & Yang (2010) and Qin & Priebe (2013), the $\beta$ -likelihood of Basu et al. (1998) and Fujisawa & Eguchi (2006), penalized likelihood estimators, such as maximum a posteriori estimators of Bayesian models, or $f$ -separable Bregman distortion measures of Kobayashi & Watanabe (2024; 2021).

The practical computation of the $h$ -MLLEs, (5), is made possible via the MM algorithm framework of Lange (2016), see also Hunter & Lange (2004), Wu & Lange (2010), and Nguyen (2017) for further details. Such algorithms have well-studied global convergence properties and can be modified for mini-batch and stochastic settings (see, e.g., Razaviyayn et al., 2013 and Nguyen et al., 2022a).

A related and popular setting of investigations is that of model selection, where the objects of interest are single-index sequences $\left(f_{k_{n},n}\right)_{n\in\mathbb{N}}$ , and where the aim is to obtain finite-sample bounds for losses of the form $\ell\left(f_{k_{n},n},f\right)$ , where each $k_{n}\in\mathbb{N}$ is a data dependent function, often obtained by optimizing some penalized loss criterion, as described in Massart (2007), Koltchinskii (2011, Ch. 6), and Giraud (2021, Ch. 2). In the context of finite mixtures, examples of such analyses can be found in the works of Maugis & Michel (2011) and Maugis-Rabusseau & Michel (2013). A comprehensive bibliography of model selection results for finite mixtures and related statistical models can be found in Nguyen et al. (2022c).

1.4 Organization of paper

The manuscript is organized as follows. In Section 2, we formally define the $h$ -lifted KL divergence as a Bregman divergence and establish several of its properties. In Section 3, we present new risk bounds for the $h$ -lifted KL divergence of the form (2). In Section 4, we discuss the computation of the $h$ -lifted likelihood estimator in the form of (5), followed by empirical results illustrating the convergence of (2) with respect to both $k$ and $n$ . Additional insights and technical results are provided in the Appendices at the end of the manuscript.

2 The $h$ -lifted KL divergence and its properties

In this section we formally define the $h$ -lifted KL divergence on the space of density functions and establish some of its properties.

Definition 1 ( $h$ -lifted KL divergence).

Let $f,g$ , and $h$ be probability density functions on the space $\mathcal{X}$ , where $h>0$ . The $h$ -lifted KL divergence from $g$ to $f$ is defined as follows:

\mathrm{KL}_{h}\left(f\,||\,g\right)=\int_{\mathcal{X}}\left\{f+h\right\}\log\frac{f+h}{g+h}\mathrm{d}\mu={\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\left\{\log\frac{f+h}{g+h}\right\}+{\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\left\{\log\frac{f+h}{g+h}\right\}.

2.1 $\textrm{KL}_{h}$ as a Bregman divergence

Let $\phi:\mathcal{I}\to\operatorname*{\mathbb{R}}$ , $\mathcal{I}=(0,\infty)$ be a strictly convex function that is continuously differentiable. The Bregman divergence between scalars $d_{\phi}:\mathcal{I}\times\mathcal{I}\to\mathbb{R}_{\geq 0}$ generated by the function $\phi$ is given by:

d_{\phi}(p,q)=\phi(p)-\phi(q)-\phi^{\prime}(q)(p-q),

where $\phi^{\prime}(q)$ denotes the derivative of $\phi$ at $q$ .

Bregman divergences possess several useful properties, including the following list:

1.

Non-negativity: $d_{\phi}(p,q)\geq 0$ for all $p,q\in\mathcal{I}$ with equality if and only if $p=q$ ;
2.

Asymmetry: $d_{\phi}(p,q)\neq d_{\phi}(q,p)$ in general;
3.

Convexity: $d_{\phi}(p,q)$ is convex in $p$ for every fixed $q\in\mathcal{I}$ .
4.

Linearity: $d_{c_{1}\phi_{1}+c_{2}\phi_{2}}(p,q)=c_{1}d_{\phi_{1}}(p,q)+c_{2}d_{\phi_{2}}(p,q)$ for $c_{1},c_{2}\geq 0$ .

The properties for Bregman divergences between scalars can be extended to density functions and other functional spaces, as established in Frigyik et al. (2008) and Stummer & Vajda (2012), for example. We also direct the interested reader to the works of Pardo (2006), Basu et al. (2011), and Amari (2016).

The class of $h$ -lifted KL divergences constitute a generalization of the usual KL divergence and are a subset of the Bregman divergences over the space of density functions that are considered by Csiszár (1995). Namely, let $\mathcal{P}$ be a convex set of probability densities with respect to the measure $\mu$ on $\mathcal{X}$ . The Bregman divergence $D_{\phi}:\mathcal{P}\times\mathcal{P}\to[0,\infty)$ between densities $p,q\in\mathcal{P}$ can be constructed as follows:

D_{\phi}(p\,||\,q)=\int_{\mathcal{X}}d_{\phi}\left(p(x),q(x)\right)\mathrm{d}\mu(x).

The $h$ -lifted KL divergence $\mathrm{KL}_{h}$ as a Bregman divergence is generated by the function $\phi(u)=(u+h)\log(u+h)-(u+h)+1$ . This assertion is demonstrated in Appendix C.1.

2.2 Advantages of the $h$ -lifted KL divergence

When the standard KL divergence is employed in the density estimation problem, it is common to restrict consideration of density functions to those bounded away from zero by some positive constant. That is, one typically considers the smaller class of so-called admissible target densities $\mathcal{P}_{\alpha}\subset\mathcal{P}$ (cf. Meir & Zeevi, 1997), where

\mathcal{P}_{\alpha}=\left\{\varphi(\cdot;\theta)\in\mathcal{P}\mid\varphi(\cdot;\theta)\geq\alpha>0\right\}.

Without this restriction, the standard KL divergence can be unbounded, even for functions with bounded $L_{1}$ norms. For example, let $p$ and $q$ be densities of beta distributions on the support $\mathcal{X}=\left[0,1\right]$ . That is, suppose that $p,q\in\mathcal{P}_{\mathrm{beta}}$ , respectively characterized by parameters $\theta_{p}=\left(a_{p},b_{p}\right)$ and $\theta_{q}=\left(a_{q},b_{q}\right)$ , where

\mathcal{P}_{\mathrm{beta}}=\left\{x\mapsto\beta\left(x;\theta\right)=\frac{\Gamma\left(a+b\right)}{\Gamma\left(a\right)\Gamma\left(b\right)}x^{a-1}\left(1-x\right)^{b-1},\theta=\left(a,b\right)\in\mathbb{R}_{>0}^{2}\right\}.

(7)

Then, from Gil et al. (2013), the KL divergence between $p$ and $q$ is given by:

	$\displaystyle\mathrm{KL}\left(p\,\|\|\,q\right)$	$\displaystyle=\log\left\{\frac{\Gamma\left(a_{q}\right)\Gamma\left(b_{q}\right)}{\Gamma\left(a_{q}+b_{q}\right)}\right\}-\log\left\{\frac{\Gamma\left(a_{p}\right)\Gamma\left(b_{p}\right)}{\Gamma\left(a_{p}+b_{p}\right)}\right\}$
		$\displaystyle\quad+\left(a_{p}-a_{q}\right)\left\{\psi\left(a_{p}\right)-\psi\left(a_{p}+b_{p}\right)\right\}+\left(b_{p}-b_{q}\right)\left\{\psi\left(b_{p}\right)-\psi\left(a_{p}+b_{p}\right)\right\},$

where $\psi:\mathbb{R}_{>0}\rightarrow\mathbb{R}$ is the digamma function. Next, suppose that $a_{p}=b_{q}$ and $a_{q}=b_{p}=1$ , which leads to the simplification

\mathrm{KL}\left(p\,||\,q\right)=\left(a_{p}-1\right)\left\{\psi\left(a_{p}\right)-\psi(1)\right\}.

Since $\psi$ is strictly increasing, we observe that the right-hand side diverges as $a_{p}\to\infty$ . Thus, the KL divergence between beta distributions is unbounded. The $h$ -lifted KL divergence in contrast does not suffer from this problem, and does not require the restriction to $\mathcal{P}_{\alpha}$ . This allows us to consider cases where $p,q\in\mathcal{P}$ are not bounded away from $0$ , as per the following result.

Proposition 2.

Let $\mathcal{P}$ be defined as in (1). $\mathrm{KL}_{h}\left(f\,||\,g\right)$ is bounded for all continuous densities $f,g\in\mathcal{P}$ .

Proof.

See Appendix C.2. ∎

Let $L_{p}(f,g)$ denote the standard $L_{p}$ -norm, $L_{p}(f,g)=\left\{\int_{\mathcal{X}}\left\lvert f(x)-g(x)\right\rvert^{p}\mathrm{d}\mu(x)\right\}^{1/p}$ . As remarked previously, Klemelä (2007) established empirical risk bounds in terms of the $L_{2}$ -norm distance. Following results from Meir & Zeevi (1997) characterizing the relationship between the KL divergence in terms of the $L_{2}$ -norm distance, in Proposition 3 we establish the corresponding relationship between the $h$ -lifted KL divergence and the $L_{2}$ -norm distance, along with a relationship between the $h$ -lifted KL divergence and the $L_{1}$ -norm distance.

Proposition 3.

For probability density functions $f,\,g,\,$ and $h$ , where $h$ is such that $h(x)\geq\gamma>0$ for all $x\in\mathcal{X}$ , the following inequalities hold:

\frac{1}{4}L_{1}^{2}\left(f,g\right)\leq\mathrm{KL}_{h}\left(f\,||\,g\right)\leq\gamma^{-1}L_{2}^{2}\left(f,g\right).

Proof.

See Appendix C.3. ∎

Remark 4.

Proposition 2 highlights the benefit of the $h$ -lifted KL divergence being bounded for all continuous densities, unlike the standard KL divergence, while satisfying a relationship similar to that between the KL divergence and the $L_{2}$ norm distance. Moreover, the first inequality of Proposition 3 is a Pinsker-like relationship between the $h$ -lifted KL divergence and the total variation distance $\mathrm{TV}(f,g)=\frac{1}{2}L_{1}(f,g)$ .

3 Main results

Here we provide explicit statements regarding the convergence rates claimed in (6) via Theorem 5 and Corollary 6, which are proved in Appendix A.2. We assume that $f$ is bounded above by some constant $c$ and that the lifting function $h$ is bounded above and below by constants $a$ and $b$ , respectively.

Theorem 5.

Let $h$ be a positive density satisfying $0<a\leq h(x)\leq b$ , for all $x\in\mathcal{X}$ . For any target density $f$ satisfying $0\leq f(x)\leq c$ , for all $x\in\mathcal{X}$ and where $f_{k,n}$ is the minimizer of $\mathrm{KL}_{h}$ over $k$ -component mixtures, the following inequality holds:

\operatorname*{{\mathrm{\mathbf{E}}}}\left\{\mathrm{KL}_{h}\left(f\,||\,f_{k,n}\right)\right\}-\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)\leq\frac{u_{1}}{k+2}+\frac{u_{2}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon+\frac{u_{3}}{\sqrt{n}},

where $u_{1}$ , $u_{2}$ , and $u_{3}$ are positive constants that depend on some or all of $a$ , $b$ , and $c$ .

Corollary 6.

Let $\mathcal{X}$ and $\Theta$ be compact and assume the following Lipschitz condition holds: for each $x\in\mathcal{X}$ , and for each $\theta,\tau\in\Theta$ ,

\left|\varphi\left(x;\theta\right)-\varphi\left(x;\tau\right)\right|\leq\varPhi\left(x\right)\left\|\theta-\tau\right\|_{1},

(8)

for some function $\varPhi:\mathcal{X}\rightarrow\mathbb{R}_{\geq 0}$ , where $\lVert\varPhi\rVert_{\infty}=\sup_{x\in\mathcal{X}}\lvert\varPhi(x)\rvert<\infty$ . Then the bound in Theorem 5 becomes

\operatorname*{{\mathrm{\mathbf{E}}}}\left\{\mathrm{KL}_{h}\left(f\,||\,f_{k,n}\right)\right\}-\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)\leq\frac{c_{1}}{k+2}+\frac{c_{2}}{\sqrt{n}},

where $c_{1}$ and $c_{2}$ are positive constants.

Remark 7.

Our results are applicable to any compact metric space $\mathcal{X}$ , with $\left[0,1\right]$ used in the experimental setup in Section 4.2 as a simple and tractable example to illustrate key aspects of our theory. There is no issue in generalizing to $\mathcal{X}=\left[-m,m\right]^{d}$ for $m>0$ and $d\in\mathbb{N}$ , or more abstractly, to any compact subset $\mathcal{X}\subset\mathbb{R}^{d}$ . Additionally, $\mathcal{X}$ could even be taken as a functional compact space, though establishing compactness and constructing appropriate component classes $\mathcal{P}$ over such spaces to achieve small approximation errors $\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)$ is an approximation theoretic task that falls outside the scope of our work.

From the proof of Theorem 5 in Appendix A.2, it is clear that the dimensionality of $\mathcal{X}$ only influences our bound through the complexity of the class $\mathcal{P}$ , specifically, the constant $\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon$ , which remains independent of both $n$ and $k$ . Here, $N(\mathcal{P},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})$ is the $\varepsilon$ -covering number of $\mathcal{P}$ . In fact, the constant with respect to $k$ ( $u_{1}$ in Theorem 5) is entirely unaffected by the dimensionality of $\mathcal{X}$ . Thus, the rates of our bound on the expected $h$ -lifted KL divergence are dimension-independent and hold even when $\mathcal{X}$ is infinite-dimensional, as long as there exists a class $\mathcal{P}$ such that $\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon$ is finite.

Corollary 6 provides a method for obtaining such a bound when the elements of $\mathcal{P}$ satisfy a Lipschitz condition.

4 Numerical experiments

Here we discuss the computability and computation of $\mathrm{KL}_{h}$ estimation problems and provide empirical evidence towards the rates obtained in Theorem 5. Specifically, we seek to develop a methodology for computing $h$ -MLLEs, and to use numerical experiments to demonstrate that the sequence of expected $h$ -lifted KL divergences between some density $f$ and a sequence of $k$ -component mixture densities from a suitable class $\mathcal{P}$ , estimated using $n$ observations does indeed decrease at rates proportional to $1/k$ and $1/\sqrt{n}$ , as $k$ and $n$ increase.

The code for all simulations and analyses in Experiments 1 and 2 is available in both the R and Python programming languages. The code repository is available here: https://github.com/hiendn/LiftedLikelihood.

4.1 Minorization–Maximization algorithm

One solution for computing (5) is to employ an MM algorithm. To do so, we first write the objective of (5) as

L_{h,n}\left(\psi_{k}\right)=\frac{1}{n}\sum_{i=1}^{n}\left(\log\left\{\sum_{j=1}^{k}\pi_{j}\varphi\left(X_{i};\theta_{j}\right)+h\left(X_{i}\right)\right\}+\log\left\{\sum_{j=1}^{k}\pi_{j}\varphi\left(Y_{i};\theta_{j}\right)+h\left(Y_{i}\right)\right\}\right),

where $\psi_{k}\in\Psi_{k}=\mathcal{S}_{k}\times\Theta^{k}$ . We then require the definition of a minorizer $Q_{n}$ for $L_{h,n}$ on the space $\Psi_{k}$ , where $Q_{n}:\Psi_{k}\times\Psi_{k}\rightarrow\mathbb{R}$ is a function with the properties:

(i)

$Q_{n}\left(\psi_{k},\psi_{k}\right)=L_{h,n}\left(\psi_{k}\right)$ , and
(ii)

$Q_{n}\left(\psi_{k},\chi_{k}\right)\leq L_{h,n}\left(\psi_{k}\right)$ ,

for each $\psi_{k},\chi_{k}\in\Psi_{k}$ . In this context, given a fixed $\chi_{k}$ , the minorizer $Q_{n}\left(\cdot,\chi_{k}\right)$ should possess properties that simplify it compared to the original objective $L_{h,n}$ . These properties should make the minorizer more tractable and might include features such as parametric separability, differentiability, convexity, among others.

In order to build an appropriate minorizer for $L_{h,n}$ , we make use of the so-called Jensen’s inequality minorizer, as detailed in Lange (2016, Sec. 4.3), applied to the logarithm function. This construction results in a minorizer of the form

$\displaystyle Q_{n}\left(\psi_{k},\chi_{k}\right)$	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{k}\left\{\tau_{j}\left(X_{i};\chi_{k}\right)\log\pi_{j}+\tau_{j}\left(X_{i};\chi_{k}\right)\log\varphi\left(X_{i};\theta_{j}\right)\right\}$
		$\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{k}\left\{\tau_{j}\left(Y_{i};\chi_{k}\right)\log\pi_{j}+\tau_{j}\left(Y_{i};\chi_{k}\right)\log\varphi\left(Y_{i};\theta_{j}\right)\right\}$
		$\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\left\{\gamma\left(X_{i};\chi_{k}\right)\log h\left(X_{i}\right)+\gamma\left(Y_{i};\chi_{k}\right)\log h\left(Y_{i}\right)\right\}$
		$\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{k}\left\{\tau_{j}\left(X_{i};\chi_{k}\right)\log\tau_{j}\left(X_{i};\chi_{k}\right)+\tau_{j}\left(Y_{i};\chi_{k}\right)\log\tau_{j}\left(Y_{i};\chi_{k}\right)\right\}$
		$\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\left\{\gamma\left(X_{i};\chi_{k}\right)\log\gamma\left(X_{i};\chi_{k}\right)+\gamma\left(Y_{i};\chi_{k}\right)\log\gamma\left(Y_{i};\chi_{k}\right)\right\}$

where

\gamma\left(X_{i};\psi_{k}\right)=h\left(X_{i}\right)/\left\{\sum_{j=1}^{k}\pi_{j}\varphi\left(X_{i};\theta_{j}\right)+h\left(X_{i}\right)\right\}

and

\tau_{j}\left(X_{i};\psi_{k}\right)=\pi_{j}\varphi\left(X_{i};\theta_{j}\right)/\left\{\sum_{j=1}^{k}\pi_{j}\varphi\left(X_{i};\theta_{j}\right)+h\left(X_{i}\right)\right\}.

Observe that $Q_{n}\left(\cdot,\chi_{k}\right)$ now takes the form of a sum-of-logarithms, as opposed to the more challenging log-of-sum form of $L_{h,n}$ . This change produces a functional separation of the elements of $\psi_{k}$ .

Using $Q_{n}$ , we then define the MM algorithm via the parameter sequence $\left(\psi_{k}^{\left(s\right)}\right)_{s\in\mathbb{N}}$ , where

\psi_{k}^{\left(s\right)}=\underset{\psi_{k}\in\Psi_{k}}{\operatorname*{arg\,max\,}}~{}Q_{n}\left(\psi_{k},\psi_{k}^{\left(s-1\right)}\right),

(9)

for each $s>0$ , and where $\psi_{k}^{\left(0\right)}$ is user chosen and is typically referred to as the initialization of the algorithm. Notice that for each $s$ , (9) is a simpler optimization problem than (5). Writing $\psi_{k}^{\left(s\right)}=\left(\pi_{1}^{\left(s\right)},\dots,\pi_{k}^{\left(s\right)},\theta_{1}^{\left(s\right)},\dots,\theta_{k}^{\left(s\right)}\right)$ , we observe that (9) simplifies to the separated expressions:

\pi_{j}^{\left(s\right)}=\frac{\sum_{i=1}^{n}\left\{\tau_{j}\left(X_{i};\psi_{k}^{\left(s-1\right)}\right)+\tau_{j}\left(Y_{i};\psi_{k}^{\left(s-1\right)}\right)\right\}}{\sum_{i=1}^{n}\sum_{l=1}^{k}\left\{\tau_{l}\left(X_{i};\psi_{k}^{\left(s-1\right)}\right)+\tau_{l}\left(Y_{i};\psi_{k}^{\left(s-1\right)}\right)\right\}}

and

\theta_{j}^{\left(s\right)}=\underset{\theta_{j}\in\Theta}{\operatorname*{arg\,max\,}}\frac{1}{n}\sum_{i=1}^{n}\left\{\tau_{j}\left(X_{i};\psi_{k}^{\left(s-1\right)}\right)\log\varphi\left(X_{i};\theta_{j}\right)+\tau_{j}\left(Y_{i};\psi_{k}^{\left(s-1\right)}\right)\log\varphi\left(Y_{i};\theta_{j}\right)\right\},

for each $j\in\left[k\right]$ .

A noteworthy property of the MM sequence $\left(\psi_{k}^{\left(s\right)}\right)_{s\in\mathbb{N}}$ is that it generates an increasing sequence of objective values, due to the chain of inequalities

L_{h,n}\left(\psi_{k}^{\left(s-1\right)}\right)=Q_{n}\left(\psi_{k}^{\left(s-1\right)},\psi_{k}^{\left(s-1\right)}\right)\leq Q_{n}\left(\psi_{k}^{\left(s\right)},\psi_{k}^{\left(s-1\right)}\right)\leq L_{h,n}\left(\psi_{k}^{\left(s\right)}\right),

where the equality is due to property (i) of $Q_{n}$ , the first in equality is due to the definition of $\psi_{k}^{\left(s\right)}$ , and the second inequality is due to property (ii) of $Q_{n}$ . This provides a kind of stability and regularity to the sequence $\left(L_{h,n}\left(\psi_{k}^{\left(s\right)}\right)\right)_{s\in\mathbb{N}}$ .

Of course, we can provide stronger guarantees under additional assumptions. Namely, assume that (iii) $\Psi_{k}\subset\varPsi_{k}$ , where $\varPsi_{k}$ is an open set in a finite dimensional Euclidean space on which $L_{h,n}$ and $Q_{n}\left(\cdot,\chi_{k}\right)$ is differentiable, for each $\chi_{k}\in\Psi_{k}$ . Then, under assumptions (i)–(iii) regarding $L_{h,n}$ and $Q_{n}$ , and due to the compactness of $\Psi_{k}$ and the continuity of $Q_{n}$ on $\Psi_{k}\times\Psi_{k}$ , Razaviyayn et al. (2013, Cor. 1) implies that $\left(\psi_{k}^{\left(s\right)}\right)_{s\in\mathbb{N}}$ converges to the set of stationary points of $L_{h,n}$ in the sense that

\lim_{s\rightarrow\infty}\inf_{\psi_{k}^{*}\in\Psi_{k}^{*}}\left\|\psi_{k}^{\left(s\right)}-\psi_{k}^{*}\right\|_{2}=0,\text{ where }\Psi_{k}^{*}=\left\{\psi_{k}^{*}\in\Psi_{k}:\left.\frac{\partial L_{h,n}}{\partial\psi_{k}}\right|_{\psi_{k}=\psi_{k}^{*}}=0\right\}.

More concisely, we say that the sequence $\left(\psi_{k}^{\left(s\right)}\right)_{s\in\mathbb{N}}$ globally converges to the set of stationary points $\Psi_{k}^{*}$ .

4.2 Experimental setup

Towards the task of demonstrating empirical evidence of the rates in Theorem 5, we consider the family of beta distributions on the unit interval $\mathcal{X}=\left[0,1\right]$ as our base class (i.e., (7)) to estimate a pair of target densities

f_{1}\left(x\right)=\frac{1}{2}\chi_{\left[0,2/5\right]}\left(x\right)+\frac{1}{2}\chi_{\left[3/5,1\right]}\left(x\right),

and

f_{2}\left(x\right)=\chi_{\left[0,1\right]}\left(x\right)\begin{cases}2-4x&\text{if }x\leq 1/2,\\ -2+4x&\text{if }x>1/2,\end{cases}

where $\chi_{\mathcal{A}}$ is the characteristic function that takes value $1$ if $x\in\mathcal{A}$ and $0$ , otherwise. Note that neither $f_{1}$ nor $f_{2}$ are in $\mathcal{C}$ . In particular, $f_{1}\left(x\right)=0$ when $x\in\left(\frac{2}{5},\frac{3}{5}\right)$ , and $f_{2}(x)=0$ when $x=1/2$ , and hence neither densities are bounded away from 0, on $\mathcal{X}$ . Thus, the theory of Rakhlin et al. (2005) cannot be applied to provide bounds for the expected KL divergence between MLEs of beta mixtures and the pair of targets. We visualize $f_{1}$ and $f_{2}$ in Figure 1.

Refer to caption — Figure 1: Simulation target densities $f_{1}$ (solid line) and $f_{2}$ (dashed line).

To observe the rate of decrease of the $h$ -lifted KL divergence between the targets and respective sequences of $h$ -MLLEs, we conduct two experiments E1 and E2. In E1, our target density is set to $f_{1}$ and $h_{1}=\beta\left(\cdot;1/2,1/2\right)$ . For each $n\in\left\{2^{10},\dots,2^{15}\right\}$ and $k\in\left\{2,\dots,8\right\}$ , we independently simulate $\mathbf{X}_{n}$ and $\mathbf{Y}_{n}$ with each $X_{i}$ and $Y_{i}$ ( $i\in\left[n\right]$ ), i.i.d., from the distributions characterized by $f_{1}$ and $h_{1}$ , respectively. In E2, we target $f_{2}$ with $h$ -MLLEs over the same ranges of $k$ and $n$ , but with $h_{2}=\beta\left(\cdot;1,1\right)$ –the density of the uniform distribution. For each $k$ and $n$ , we simulate $\mathbf{X}_{n}$ and $\mathbf{Y}_{n}$ , respectively, from distributions characterized by $f_{2}$ and $h_{2}$ .

In both experiments, we simulate $r=50$ replicates of each $\left(k,n\right)$ -scenario and compute the corresponding $h$ -MLLEs, $\left(f_{k,n,l}\right)_{l\in[r]}$ , using the previously described MM algorithm. For each $l\in\left[r\right]$ , we compute the corresponding negative log $h$ -lifted likelihood between the target $f$ and $f_{k,n,l}$ :

K_{k,n,l}=-\int_{\mathcal{X}}\left(f+h\right)\log\left(f_{k,n,l}+h\right)\mathrm{d}\mu

to assess the rates, and note that

\mathrm{KL}_{h}\left(f\,||\,f_{k,n,l}\right)=\int_{\mathcal{X}}\left(f+h\right)\log\left(f+h\right)\mathrm{d}\mu+K_{k,n,l},

where the prior term is a constant with respect to $k$ and $n$ .

To analyze the sample of $7\times 6\times 50=2100$ observations of relationship between the values $\left(K_{k,n,l}\right)_{l\in[r]}$ and the corresponding values of $k$ and $n$ , we use non-linear least squares (Amemiya, 1985, Sec. 4.3) to fit the regression relationship

{\operatorname*{{\mathrm{\mathbf{E}}}}}\left[K_{k,n,l}\right]=a_{0}+\frac{a_{1}}{\left(k+2\right)^{b_{1}}}+\frac{a_{2}}{n^{b_{2}}}.

(10)

We obtain $95\%$ asymptotic confidence intervals for the estimates of the regression parameters $a_{0}$ , $a_{1}$ , $a_{2}$ , $b_{1}$ , $b_{2}\in\mathbb{R}$ , under the assumption of potential mis-specification of (10), by using the sandwich estimator for the asymptotic covariance matrix (cf. White 1982).

4.3 Results

We report the estimates along with $95\%$ asymptotic confidence intervals for the parameters of (10) for E1 and E2 in Table 1. Plots of the average negative log $h$ -lifted likelihood values by sample sizes $n$ and numbers of components $k$ are provided in Figure 2.

Table 1: Estimates of parameters for fitted relationships (with

95\%

confidence intervals) between negative log

h

-lifted likelihood values, sample size and number of mixture components for experiments E1 and E2.

E1	$a_{0}$	$a_{1}$	$a_{2}$	$b_{1}$	$b_{2}$
Est.	$-1.68$	$0.73$	$6.80$	$1.87$	$0.99$
95% CI	$\left(-1.68,-1.67\right)$	$\left(0.68,0.78\right)$	$\left(1.24,12.36\right)$	$\left(1.81,1.93\right)$	$\left(0.87,1.11\right)$
E2	$a_{0}$	$a_{1}$	$a_{2}$	$b_{1}$	$b_{2}$
Est.	$-1.47$	$1.49$	$6.75$	$4.36$	$1.07$
95% CI	$\left(-1.48,-1.47\right)$	$\left(0.58,2.41\right)$	$\left(2.17,11.32\right)$	$\left(3.91,4.81\right)$	$\left(0.97,1.16\right)$

From Table 1, we observe that $\operatorname*{{\mathrm{\mathbf{E}}}}\left[K_{k,n,l}\right]$ decreases with both $n$ and $k$ in both simulations, and that the rates at which the averages decrease are faster than anticipated by Theorem 5, with respect to both $n$ and $k$ . We can visually confirm the decreases in the estimate of $\operatorname*{{\mathrm{\mathbf{E}}}}\left[K_{k,n,l}\right]$ via Figure 2. In both E1 and E2, the rate of decrease over the assessed range of $n$ is approximately proportional to $1/n$ , as opposed to the anticipated rate of $1/\sqrt{n}$ , whereas the rate of decrease in $k$ is far larger, at approximately $1/k^{1.87}$ for E1 and $1/k^{4.36}$ for E2.

These observations provide empirical evidence towards the fact that the rate of decrease of $\operatorname*{{\mathrm{\mathbf{E}}}}\left[K_{k,n,l}\right]$ is at least $1/k$ and $1/\sqrt{n}$ , respectively, for $k$ and $n$ , at least over the simulation scenarios. These fast rates of fit over small values of $n$ and $k$ may be indicative of a diminishing returns of fit phenomenon, as discussed in Cadez & Smyth (2000) or the so-called elbow phenomenon (see, e.g., Ritter 2014, Sec. 4.2), whereupon the rate of decrease in average loss for small values of $k$ is fast and becomes slower as $k$ increases, converging to some asymptotic rate. This is also the reason why we do not include the outcomes when $k=1$ , as the drop in $\operatorname*{{\mathrm{\mathbf{E}}}}\left[K_{k,n,l}\right]$ between $k=1$ and $k=2$ is so dramatic that it makes our simulated data ill-fitted by any model of form (10). As such, we do not view Theorem 5 as being pessimistic in light of these phenomena, as it applies uniformly over all values of $k$ and $n$ .

5 Conclusion

The estimation of probability densities using finite mixtures from some base class $\mathcal{P}$ appears often in machine learning and statistical inference as a natural method for modelling underlying data generating processes. In this work, we pursue novel generalization bounds for such mixture estimators. To this end, we introduce the family of $h$ -lifted KL divergences for densities on compact supports, within the family of Bregman divergences, which correspond to risk functions that can be bounded, even when densities in the class $\mathcal{P}$ are not bounded away from zero, unlike the standard KL divergence.

Unlike the least-squares loss, the corresponding maximum $h$ -likelihood estimation problem can be computed via an MM algorithm, mirroring the availability of EM algorithms for the maximum likelihood problem corresponding to the KL divergence. Along with our derivations of generalization bounds that achieve the same rates as the best-known bounds for the KL divergence and least square loss, we also provide numerical evidence towards the correctness of these bounds in the case when $\mathcal{P}$ corresponds to beta densities.

Aside from beta distributions, mixture densities on compact supports that can be analysed under our framework appear frequently in the literature. For supports on compact Euclidean subset, examples include mixtures of Dirichlet distributions (Fan et al., 2012) and bivariate binomial distributions (Papageorgiou & David, 1994). Alternatively, one can consider distributions on compact Euclidean manifolds, such as mixtures of Kent (Peel et al., 2001) distributions and von Mises–Fisher distributions (Banerjee et al., 2005, Ng & Kwong, 2022). We defer investigating the practical performance of the maximum $h$ -lifted likelihood estimators and accompanying theory for such models to future work.

Acknowledgments

We express sincere gratitude to the Reviewers and Action Editor for their valuable feedback, which has helped to improve the quality of this paper. Hien Duy Nguyen and TrungTin Nguyen acknowledge funding from the Australian Research Council grant DP230100905.

References

Amari (2016) Shun-ichi Amari. Information Geometry and Its Applications, volume 194. Springer, New York, 2016.
Amemiya (1985) Takeshi Amemiya. Advanced econometrics. Harvard University Press, 1985.
Banerjee et al. (2005) Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, Suvrit Sra, and Greg Ridgeway. Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6(9), 2005.
Bartlett & Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
Basu et al. (1998) Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.
Basu et al. (2011) Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park. Statistical Inference: The Minimum Distance Approach. CRC press, Boca Raton, 2011.
Cadez & Smyth (2000) Igor Cadez and Padhraic Smyth. Model complexity, goodness of fit and diminishing returns. Advances in Neural Information Processing Systems, 13, 2000.
Csiszár (1995) Imre Csiszár. Generalized projections for non-negative functions. In Proceedings of 1995 IEEE International Symposium on Information Theory, pp. 6. IEEE, 1995.
Dalalyan & Sebbar (2018) Arnak S Dalalyan and Mehdi Sebbar. Optimal kullback–leibler aggregation in mixture density estimation by maximum likelihood. Mathematical Statistics and Learning, 1(1):1–35, 2018.
DeVore & Temlyakov (2016) Ronald A DeVore and Vladimir N Temlyakov. Convex optimization on banach spaces. Foundations of Computational Mathematics, 16(2):369–394, 2016.
Devroye & Györfi (1985) Luc Devroye and László Györfi. Nonparametric Density Estimation: The L1 View. Wiley Interscience Series in Discrete Mathematics. Wiley, 1985.
Devroye & Lugosi (2001) Luc Devroye and Gabor Lugosi. Combinatorial Methods in Density Estimation. Springer Science & Business Media, 2001.
Fan et al. (2012) Wentao Fan, Nizar Bouguila, and Djemel Ziou. Variational learning for finite dirichlet mixture models and applications. IEEE transactions on neural networks and learning systems, 23:762–774, 2012.
Ferrari & Yang (2010) Davide Ferrari and Yuhong Yang. Maximum lq-likelihood method. Annals of Statistics, 38:573–583, 2010.
Frigyik et al. (2008) Béla A Frigyik, Santosh Srivastava, and Maya R Gupta. Functional bregman divergence and bayesian estimation of distributions. IEEE Transactions on Information Theory, 54(11):5130–5139, 2008.
Fujisawa & Eguchi (2006) Hironori Fujisawa and Shinto Eguchi. Robust estimation in the normal mixture model. Journal of Statistical Planning and Inference, 136(11):3989–4011, 2006.
Ghosal (2001) Subhashis Ghosal. Convergence rates for density estimation with Bernstein polynomials. The Annals of Statistics, 29(5):1264 – 1280, 2001. Publisher: Institute of Mathematical Statistics.
Gil et al. (2013) Manuel Gil, Fady Alajaji, and Tamas Linder. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 249:124–131, 2013.
Giraud (2021) Christophe Giraud. Introduction to high-dimensional statistics. CRC Press, 2021.
Haagerup (1981) Uffe Haagerup. The best constants in the Khintchine inequality. Studia Mathematica, 70(3):231–283, 1981.
Hunter & Lange (2004) David R Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician, 58(1):30–37, 2004.
Klemelä (2007) Jussi S Klemelä. Density estimation with stagewise optimization of the empirical risk. Machine Learning, 67:169–195, 2007.
Klemelä (2009) Jussi S Klemelä. Smoothing of multivariate data: density estimation and visualization. John Wiley & Sons, New York, 2009.
Kobayashi & Watanabe (2021) Masahiro Kobayashi and Kazuho Watanabe. Generalized Dirichlet-process-means for f-separable distortion measures. Neurocomputing, 458:667–689, 2021.
Kobayashi & Watanabe (2024) Masahiro Kobayashi and Kazuho Watanabe. Unbiased Estimating Equation and Latent Bias under f-Separable Bregman Distortion Measures. IEEE Transactions on Information Theory, 2024.
Koltchinskii (2011) Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: École D’Été de Probabilités de Saint-Flour XXXVIII-2008. Lecture Notes in Mathematics. Springer, 2011.
Koltchinskii & Panchenko (2004) Vladimir Koltchinskii and Dmitry Panchenko. Rademacher processes and bounding the risk of function learning. arXiv: Probability, pp. 443–457, 2004.
Kosorok (2007) Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Springer Series in Statistics. Springer New York, 2007.
Lange (2013) Kenneth Lange. Optimization, volume 95. Springer Science & Business Media, New York, 2013.
Lange (2016) Kenneth Lange. MM optimization algorithms. SIAM, Philadelphia, 2016.
Li & Barron (1999) Jonathan Li and Andrew Barron. Mixture Density Estimation. In S. Solla, T. Leen, and K. Müller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999.
Massart (2007) Pascal Massart. Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Springer, 2007.
Maugis & Michel (2011) Cathy Maugis and Bertrand Michel. A non asymptotic penalized criterion for gaussian mixture model selection. ESAIM: Probability and Statistics, 15:41–68, 2011.
Maugis-Rabusseau & Michel (2013) Cathy Maugis-Rabusseau and Bertrand Michel. Adaptive density estimation for clustering with gaussian mixtures. ESAIM: Probability and Statistics, 17:698–724, 2013.
McDiarmid (1989) Colin McDiarmid. On the method of bounded differences. In J.Editor Siemons (ed.), Surveys in Combinatorics, 1989: Invited Papers at the Twelfth British Combinatorial Conference, London Mathematical Society Lecture Note Series, pp. 148–188. Cambridge University Press, 1989.
McDiarmid (1998) Colin McDiarmid. Concentration. In Michel Habib, Colin McDiarmid, Jorge Ramirez-Alfonsin, and Bruce Reed (eds.), Probabilistic Methods for Algorithmic Discrete Mathematics, pp. 195–248. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
McLachlan & Peel (2004) Geoffrey J McLachlan and David Peel. Finite Mixture Models. John Wiley & Sons, 2004.
Meir & Zeevi (1997) Ronny Meir and Assaf Zeevi. Density estimation through convex combinations of densities: Approximation and estimation bounds. Neural Networks, 10:99–109, 02 1997.
Naito & Eguchi (2013) Kanta Naito and Shinto Eguchi. Density estimation with minimization of U-divergence. Machine Learning, 90(1):29–57, January 2013.
Ng & Kwong (2022) Tin Lok James Ng and Kwok-Kun Kwong. Universal approximation on the hypersphere. Communications in Statistics-Theory and Methods, 51:8694–8704, 2022.
Nguyen (2017) Hien D Nguyen. An introduction to majorization-minimization algorithms for machine learning and statistical estimation. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(2):e1198, 2017.
Nguyen et al. (2021) Hien D Nguyen, TrungTin Nguyen, Faicel Chamroukhi, and Geoffrey J McLachlan. Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. Journal of Statistical Distributions and Applications, 8(1):13, August 2021.
Nguyen et al. (2022a) Hien D Nguyen, Florence Forbes, Gersende Fort, and Olivier Cappé. An online minorization-maximization algorithm. In Proceedings of the 17th Conference of the International Federation of Classification Societies, 2022a.
Nguyen et al. (2020) TrungTin Nguyen, Hien D Nguyen, Faicel Chamroukhi, and Geoffrey J McLachlan. Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Mathematics & Statistics, 7:1750861, 2020.
Nguyen et al. (2022b) TrungTin Nguyen, Faicel Chamroukhi, Hien D Nguyen, and Geoffrey J McLachlan. Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces. Communications in Statistics - Theory and Methods, pp. 1–12, 2022b.
Nguyen et al. (2022c) TrungTin Nguyen, Hien D Nguyen, Faicel Chamroukhi, and Florence Forbes. A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models. Electronic Journal of Statistics, 16(2):4742–4822, 2022c.
Papageorgiou & David (1994) Haris Papageorgiou and Katerina M David. On countable mixtures of bivariate binomial distributions. Biometrical journal, 36(5):581–601, 1994.
Pardo (2006) Leandro Pardo. Statistical Inference Based on Divergence Measures. CRC Press, Boca Raton, 2006.
Peel et al. (2001) David Peel, William J Whiten, and Geoffrey J McLachlan. Fitting mixtures of kent distributions to aid in joint set identification. Journal of the American Statistical Association, 96:56–63, 2001.
Petrone (1999) Sonia Petrone. Random Bernstein Polynomials. Scandinavian Journal of Statistics, 26(3):373–393, 1999.
Petrone & Wasserman (2002) Sonia Petrone and Larry Wasserman. Consistency of Bernstein Polynomial Posteriors. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64(1):79–100, March 2002.
Qin & Priebe (2013) Yichen Qin and Carey E Priebe. Maximum $L_{q}$ -likelihood estimation via the expectation-maximization algorithm: a robust estimation of mixture models. Journal of the American Statistical Association, 108(503):914–928, 2013.
Rakhlin et al. (2005) Alexander Rakhlin, Dmitry Panchenko, and Sayan Mukherjee. Risk bounds for mixture density estimation. ESAIM: PS, 9:220–229, 2005.
Razaviyayn et al. (2013) Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
Ritter (2014) Gunter Ritter. Robust cluster analysis and variable selection. CRC Press, Boca Raton, 2014.
Rockafellar (1997) Ralph Tyrrell Rockafellar. Convex analysis, volume 11. Princeton University Press, Princeton, 1997.
Shalev-Shwartz & Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: from Theory to Algorithms. Cambridge University Press, Cambridge, 2014.
Shapiro et al. (2021) Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on Stochastic Programming: Modeling and Theory, Third Edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, July 2021.
Stummer & Vajda (2012) Wolfgang Stummer and Igor Vajda. On bregman distances and divergences of probability measures. IEEE Transactions on Information Theory, 58(3):1277–1288, 2012.
Sundberg (2019) Rolf Sundberg. Statistical Modelling by Exponential Families. Cambridge University Press, Cambridge, 2019.
Temlyakov (2016) Vladimir N Temlyakov. Convergence and rate of convergence of some greedy algorithms in convex optimization. Proceedings of the Steklov Institute of Mathematics, 293:325–337, 2016.
van de Geer (2016) Sara van de Geer. Estimation and Testing Under Sparsity: École d’Été de Probabilités de Saint-Flour XLV – 2015. Springer, 2016.
van der Vaart & Wellner (1996) Aad W van der Vaart and Jon A Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, 1996.
Verbeek et al. (2003) Jakob J Verbeek, Nikos Vlassis, and Ben Kröse. Efficient greedy learning of Gaussian mixture models. Neural computation, 15(2):469–485, 2003.
Vlassis & Likas (2002) Nikos Vlassis and Aristidis Likas. A greedy EM algorithm for gaussian mixture learning. Neural Processing Letters, 15:77–87, 2002.
White (1982) Halbert White. Maximum likelihood estimation of misspecified models. Econometrica, 50(1):1–25, 1982.
Wu & Lange (2010) Tong Tong Wu and Kenneth Lange. The MM alternative to EM. Statistical Science, 25(4):492–505, 2010.
Zhang (2003) Tong Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory, 49(3):682–691, 2003.

Appendix A Proofs of main results

The following section is devoted to establishing some technical definitions and instrumental results which are used to prove Theorem 5 and Corollary 6, and also includes the proofs of these results themselves.

A.1 Preliminaries

Recall that we are interested in bounds of the form (2). Note that $\mathcal{P}$ is a subset of the linear space

\mathcal{V}=\mathrm{cl}\left(\bigcup_{k\in\mathbb{N}}\left\{\sum_{j=1}^{k}\varpi_{j}\varphi\left(\cdot;\theta_{j}\right)\mid\varphi\left(\cdot;\theta_{j}\right)\in\mathcal{P},\varpi_{j}\in\mathbb{R},j\in\left[k\right]\right\}\right),

and hence we can apply the following result, paraphrased from Zhang (2003, Thm. II.1).

Lemma 8.

Let $\kappa$ be a differentiable and convex function on $\mathcal{V}$ , and let $\left(\bar{f}_{k}\right)_{k\in\mathbb{N}}$ be a sequence of functions obtained by Algorithm 1. If

\sup_{p,q\in\mathcal{C},\pi\in\left(0,1\right)}\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\kappa\left(\left(1-\pi\right)p+\pi q\right)\leq\mathfrak{M}<\infty,

then, for each $k\in\mathbb{N}$ ,

\kappa\left(\bar{f}_{k}\right)-\inf_{p\in\mathcal{C}}\kappa\left(p\right)\leq\frac{2\mathfrak{M}}{k+2}.

Algorithm 1 Algorithm for computing a greedy approximation sequence.

\bar{f}_{0}\in\mathcal{P}

2:for

k\in\mathbb{N}

3: Compute

\left(\bar{\pi}_{k},\bar{\theta}_{k}\right)=\operatorname*{arg\,min\,}\limits_{\left(\pi,\theta\right)\in[0,1]\times\Theta}\kappa\left(\left(1-\pi\right)\bar{f}_{k-1}+\pi\varphi\left(\cdot;\theta\right)\right)

4: Define

\bar{f}_{k}=\left(1-\bar{\pi}_{k}\right)\bar{f}_{k-1}+\bar{\pi}_{k}\varphi\left(\cdot;\bar{\theta}_{k}\right)

5:end for

We are interested in two choices for $\kappa$ :

\kappa\left(p\right)=\mathrm{KL}_{h}\left(f\,||\,p\right)

(11)

and its sample counterpart,

\kappa_{n}\left(p\right)=\frac{1}{n}\sum_{i=1}^{n}\log\frac{f\left(X_{i}\right)+h\left(X_{i}\right)}{p\left(X_{i}\right)+h\left(X_{i}\right)}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{f\left(Y_{i}\right)+h\left(Y_{i}\right)}{p\left(Y_{i}\right)+h\left(Y_{i}\right)},

(12)

where $(X_{i})_{i\in[n]}$ and $(Y_{i})_{i\in[n]}$ are realisations of $X$ and $Y$ , respectively. We obtain the following important results.

Proposition 9.

Let $\varkappa$ denote either $\kappa$ , the $\mathrm{KL}_{h}$ divergence (11), or $\kappa_{n}$ , the sample $\mathrm{KL}_{h}$ divergence (12), and assume that $h\geq a$ and $\varphi\left(\cdot;\theta\right)\leq c$ , for each $\theta\in\Theta$ . Then,

\varkappa\left(\bar{f}_{k}\right)-\inf_{p\in\mathcal{C}}\varkappa\left(p\right)\leq\frac{4a^{-2}c^{2}}{k+2},

for each $k\in\mathbb{N}$ , where $\left(\bar{f}_{k}\right)_{k\in\mathbb{N}}$ is obtained as per Algorithm 1.

Proof.

See Appendix C.4. ∎

Notice that sequences $\left(\bar{f}_{k}\right)_{k\in\mathbb{N}}$ obtained via Algorithm 1 are greedy approximation sequences, and that $\bar{f}_{k}\in\mathcal{C}_{k}$ , for each $k\in\mathbb{N}$ . Let $\left(f_{k}\right)_{k\in\mathbb{N}}$ be the sequence of minimizers defined by

f_{k}=\underset{\psi_{k}\in\mathcal{S}_{k}\times\Theta^{k}}{\operatorname*{arg\,min\,}}\mathrm{KL}_{h}\left(f\,||\,f_{k}\left(\cdot;\psi_{k}\right)\right),

(13)

and let $\left(f_{k,n}\right)_{k\in\mathbb{N}}$ be the sequence of $h$ -MLLEs, as per (5). Then, by definition, we have the fact that $\kappa\left(f_{k}\right)\leq\kappa\left(\bar{f}_{k}\right)$ and $\kappa\left(f_{k,n}\right)\leq\kappa\left(\bar{f}_{k}\right)$ , for $\kappa$ set as (11) or (12), respectively. Thus, we have the following result.

Proposition 10.

For the $\mathrm{KL}_{h}$ divergence (11), under the assumption that $h\geq a$ and $\varphi\left(\cdot;\theta\right)\leq c$ , for each $\theta\in\Theta$ , we have

\kappa\left(f_{k}\right)-\inf_{p\in\mathcal{C}}\kappa\left(p\right)\leq\frac{4a^{-2}c^{2}}{k+2}

(14)

for each $k\in\mathbb{N}$ , where $\left(f_{k}\right)_{k\in\mathbb{N}}$ is the sequence of minimizers defined via (13). Furthermore, for the sample $\mathrm{KL}_{h}$ divergence (12), under the same assumptions as above, we have

\kappa_{n}\left(f_{k,n}\right)-\inf_{p\in\mathcal{C}}\kappa_{n}\left(p\right)\leq\frac{4a^{-2}c^{2}}{k+2},

(15)

for each $k\in\mathbb{N}$ , where $\left(f_{k,n}\right)_{k\in\mathbb{N}}$ are $h$ -MLLEs defined via (5).

As is common in many statistical learning/uniform convergence results (e.g., Bartlett & Mendelson, 2002, Koltchinskii & Panchenko, 2004), we employ the use of Rademacher processes and associated bounds. Let $(\varepsilon_{i})_{i\in[n]}$ be i.i.d. Rademacher random variables, that is $\mathrm{\mathbf{P}}(\varepsilon_{i}=-1)=\mathrm{\mathbf{P}}(\varepsilon_{i}=1)=\nicefrac{{1}}{{2}}$ , that are independent of $(X_{i})_{i\in[n]}$ . The Rademacher process, indexed by a class of real measurable functions $\mathcal{S}$ , is defined as the quantity

R_{n}(s)=\frac{1}{n}\sum_{i=1}^{n}s(X_{i})\varepsilon_{i},

for $s\in\mathcal{S}$ . The Rademacher complexity of the class $\mathcal{S}$ is given by $\mathcal{R}_{n}(\mathcal{S})=\operatorname*{{\mathrm{\mathbf{E}}}}\sup_{s\in\mathcal{S}}\lvert R_{n}(s)\rvert$ .

In the subsequent section, we make use of the following result regarding the supremum of convex functions:

Lemma 11 (Rockafellar, 1997, Thm. 32.2).

Let $\eta$ be a convex function on a linear space $\mathcal{T}$ , and let $\mathcal{S}\subset\mathcal{T}$ be an arbitrary subset. Then,

\sup_{p\in\mathcal{S}}\eta\left(p\right)=\sup_{p\in\mathrm{co}\left(\mathcal{S}\right)}\eta\left(p\right).

In particular, we use the fact that since a linear functional of convex combinations achieves its maximum value at vertices, the Rademacher complexity of $\mathcal{S}$ is equal to the Rademacher complexity of $\mathrm{co}(\mathcal{S})$ (see Lemma 21). We consequently obtain the following result.

Lemma 12.

Let $(\varepsilon_{i})_{i\in[n]}$ be i.i.d. Rademacher random variables, independent of $(X_{i})_{i\in[n]}$ and $\mathcal{P}$ be defined as above. The sets $\mathcal{C}$ and $\mathcal{P}$ will have equal complexity, $\mathcal{R}_{n}(\mathcal{C})=\mathcal{R}_{n}(\mathcal{P})$ , and the supremum of the Rademacher process indexed by $\mathcal{C}$ is equal to the supremum on the basis functions of $\mathcal{P}$ :

{\operatorname*{{\mathrm{\mathbf{E}}}}}_{\varepsilon}\sup_{g\in\mathcal{C}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}g(X_{i})\varepsilon_{i}\right\rvert={\operatorname*{{\mathrm{\mathbf{E}}}}}_{\varepsilon}\sup_{\theta\in\Theta}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i};\theta)\varepsilon_{i}\right\rvert.

Proof.

Follows immediately from Lemma 11. ∎

A.2 Proofs

We first present a result establishing a uniform concentration bound for the $h$ -lifted log-likelihood ratios, which is instrumental in the proof of Theorem 5. Our proofs broadly follow the structure of Rakhlin et al. (2005), modified as needed for the use of $\mathrm{KL}_{h}$ .

Assume that $0\leq\varphi(\cdot;\theta)<{c}$ for some $c\in\operatorname*{\mathbb{R}}_{>0}$ . For brevity, we adopt the notation: $\left\lVert T(g)\right\rVert_{\mathcal{C}}=\sup_{g\in\mathcal{C}}\lvert T(g)\rvert$ .

Theorem 13.

Let $X_{1},\ldots,X_{n}$ be an i.i.d. sample of size $n$ drawn from a fixed density $f$ such that $0\leq f(x)\leq c$ for all $x\in\mathcal{X}$ , and let $h$ be a positive density with $0<a\leq h(x)\leq b$ for all $x\in\mathcal{X}$ . Then, for each $t>0$ , with probability at least $1-\mathrm{e}^{-t}$ ,

\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{g(X_{i})+h(X_{i})}{f(X_{i})+h(X_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{g+h}{f+h}\right\rVert_{\mathcal{C}}\leq\frac{w_{1}}{\sqrt{n}}\operatorname*{{\mathrm{\mathbf{E}}}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,x})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}},

where $w_{1}$ , $w_{2}$ , and $w_{3}$ are constants that each depend on some or all of $a$ , $b$ , and $c$ , and $N(\mathcal{P},\varepsilon,d_{n,x})$ is the $\varepsilon$ -covering number of $\mathcal{P}$ with respect to the following empirical $L_{2}$ metric

d_{n,x}^{2}(\varphi_{1},\varphi_{2})=\frac{1}{n}\sum_{i=1}^{n}(\varphi_{1}(X_{i})-\varphi_{2}(X_{i}))^{2}.

Remark 14.

The bound on the term

\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{g(Y_{i})+h(Y_{i})}{f(Y_{i})+h(Y_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{g+h}{f+h}\right\rVert_{\mathcal{C}}

is the same as the above, except where the empirical distance $d_{n,x}$ is replaced by $d_{n,y}$ , defined in the same way as $d_{n,x}$ but with $Y_{i}$ replacing $X_{i}$ .

Proof of Theorem 13.

Fix $h$ and define the following quantities: $\tilde{g}=g+h$ , $\tilde{f}=f+h$ , $\tilde{\mathcal{C}}=\mathcal{C}+h$ ,

m_{i}=\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})},\quad m_{i}^{\prime}=\log\frac{\tilde{g}(X_{i}^{\prime})}{\tilde{f}(X_{i}^{\prime})},\quad Z(x_{1},\ldots,x_{n})=\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-\operatorname*{{\mathrm{\mathbf{E}}}}\log\frac{\tilde{g}}{\tilde{f}}\right\rVert_{\tilde{\mathcal{C}}}.

We first apply McDiarmid’s inequality (Lemma 23) to the random variable $Z$ . The bound on the martingale difference is given by

	$\displaystyle\left\lvert Z(X_{1},\ldots,X_{i},\ldots,X_{n})-Z(X_{1},\ldots,X_{i}^{\prime},\ldots,X_{n})\right\rvert$	$\displaystyle=\left\lvert\left\lVert\operatorname*{{\mathrm{\mathbf{E}}}}\log\frac{\tilde{g}}{\tilde{f}}-\frac{1}{n}(m_{1}+\!...\!+m_{i}+\!...\!+m_{n})\right\rVert_{\tilde{\mathcal{C}}}\right.$
		$\displaystyle-\left.\left\lVert\operatorname*{{\mathrm{\mathbf{E}}}}\log\frac{\tilde{g}}{\tilde{f}}-\frac{1}{n}(m_{1}+\!...\!+m_{i}^{\prime}+\!...\!+m_{n})\right\rVert_{\tilde{\mathcal{C}}}\right\rvert$
		$\displaystyle\leq\frac{1}{n}\left\lVert\log\frac{\tilde{g}(X_{i}^{\prime})}{\tilde{f}(X_{i}^{\prime})}-\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\right\rVert_{\tilde{\mathcal{C}}}$
		$\displaystyle\leq\frac{1}{n}\left(\log\frac{c+b}{a}-\log\frac{a}{c+b}\right)=\frac{1}{n}2\log\frac{c+b}{a}=c_{i}.$

The chain of inequalities holds because of the triangle inequality and the properties of the supremum. By Lemma 23, we have

{\mathrm{\mathbf{P}}}(Z-\operatorname*{{\mathrm{\mathbf{E}}}}Z>\varepsilon)\leq\exp\left\{-\frac{n\varepsilon^{2}}{(\sqrt{2}\log\frac{c+b}{a})^{2}}\right\},

{\mathrm{\mathbf{P}}}(Z\leq\varepsilon+\operatorname*{{\mathrm{\mathbf{E}}}}Z)\geq 1-\exp\left\{-\frac{n\varepsilon^{2}}{(\sqrt{2}\log\frac{c+b}{a})^{2}}\right\},

where it follows from $t=n\varepsilon^{2}/(\sqrt{2}\log\frac{c+b}{a})^{2}$ that $\varepsilon=\sqrt{2}\log\left(\frac{c+b}{a}\right)\sqrt{\frac{t}{n}}$ . Therefore with probability at least $1-\mathrm{e}^{-t}$ ,

\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{g}}{\tilde{f}}\right\rVert_{\tilde{\mathcal{C}}}\leq\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{g}}{\tilde{f}}\right\rVert_{\tilde{\mathcal{C}}}+\sqrt{2}\log\left(\frac{c+b}{a}\right)\sqrt{\frac{t}{n}}.

Let $(\varepsilon_{i})_{i\in[n]}$ be i.i.d. Rademacher random variables, independent of $(X_{i})_{i\in[n]}$ . By Lemma 24,

\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{g}}{\tilde{f}}\right\rVert_{\tilde{\mathcal{C}}}\leq 2\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}.

By combining the results above, the following inequality holds with probability at least $1-\mathrm{e}^{-t}$

\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{g}}{\tilde{f}}\right\rVert_{\tilde{\mathcal{C}}}\leq 2\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}+\sqrt{2}\log\left(\frac{c+b}{a}\right)\sqrt{\frac{t}{n}}.

Now let $p_{i}=\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-1$ , such that $\frac{a}{c+b}\leq p_{i}+1\leq\frac{c+b}{a}$ holds for all $i\in[n]$ . Additionally, let $\eta(p)=\log(p+1)$ so that $\eta(0)=0$ and note that for $p\in\left[\frac{a}{c+b}-1,\frac{c+b}{a}-1\right]$ , the derivative of $\eta(p)$ is maximal at $p^{*}=\frac{a}{c+b}-1$ , and equal to $\eta^{{{}^{\prime}}}(p^{*})=(c+b)/a$ . Therefore, $\frac{a}{b+c}\log(p+1)$ is $1$ -Lipschitz. By Lemma 22 applied to $\eta(p)$ ,

	$\displaystyle 2\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}$	$\displaystyle=2\operatorname{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\eta(p_{i})\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}\leq\frac{4(c+b)}{a}\operatorname{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\varepsilon_{i}-\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}$
		$\displaystyle\leq\frac{4(c+b)}{a}\operatorname{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}+\frac{4(c+b)}{a}{\operatorname{{\mathrm{\mathbf{E}}}}}_{\varepsilon}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}\right\rvert$
		$\displaystyle\leq\frac{4(c+b)}{a}\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}+\frac{4(c+b)}{a}\frac{1}{\sqrt{n}},$

where the final inequality follows from the following result, proved in Haagerup (1981):

{\operatorname*{{\mathrm{\mathbf{E}}}}}_{\varepsilon}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}\right\rvert\leq\left({\operatorname*{{\mathrm{\mathbf{E}}}}}_{\varepsilon}\left\{\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}\right\}^{2}\right)^{1/2}=\frac{1}{\sqrt{n}}.

Now, let $\xi_{i}(\tilde{g}_{i})=a\cdot{\tilde{g}(X_{i})}/{\tilde{f}(X_{i})}$ , and note that

\lvert\xi_{i}(u_{i})-\xi_{i}(v_{i})\rvert=\frac{a}{\lvert\tilde{f}(X_{i})\rvert}\lvert u(X_{i})-v(X_{i})\rvert\leq\lvert u(X_{i})-v(X_{i})\rvert.

By again applying Lemma 22, we have

	$\displaystyle\frac{4(c+b)}{a}\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}$	$\displaystyle\leq\frac{8(c+b)}{a^{2}}\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}\tilde{g}(X_{i})\varepsilon_{i}\right\rVert_{\tilde{\mathcal{C}}}$
		$\displaystyle\leq\frac{8(c+b)}{a^{2}}\operatorname{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}g(X_{i})\varepsilon_{i}\right\rVert_{\mathcal{C}}+\frac{8(c+b)}{a^{2}}\operatorname{{\mathrm{\mathbf{E}}}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}h(X_{i})\varepsilon_{i}\right\rvert$
		$\displaystyle\leq\frac{8(c+b)}{a^{2}}\operatorname*{{\mathrm{\mathbf{E}}}}\left\lVert\frac{1}{n}\sum_{i=1}^{n}g(X_{i})\varepsilon_{i}\right\rVert_{\mathcal{C}}+\frac{8(c+b)}{a^{2}}\frac{b}{\sqrt{n}}.$

Applying Lemmas 12 and 25, the following inequality holds for some constant $K>0$ :

{\operatorname*{{\mathrm{\mathbf{E}}}}}_{\varepsilon}\sup_{g\in\mathcal{C}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}g(X_{i})\varepsilon_{i}\right\rvert={\operatorname*{{\mathrm{\mathbf{E}}}}}_{\varepsilon}\sup_{\theta\in\Theta}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i};\theta)\varepsilon_{i}\right\rvert\leq\frac{K}{\sqrt{n}}\operatorname*{{\mathrm{\mathbf{E}}}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,x})\mathrm{d}\varepsilon,

(16)

and combining the results together, the following inequality holds with probability at least $1-\mathrm{e}^{-t}$ :

	$\displaystyle\left\lVert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{g}}{\tilde{f}}\right\rVert$	$\displaystyle\leq\frac{8(c+b)K}{a^{2}\sqrt{n}}\operatorname*{{\mathrm{\mathbf{E}}}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,x})\mathrm{d}\varepsilon+\frac{(8b+4a)(c+b)}{a^{2}\sqrt{n}}+\sqrt{2}\log\left(\frac{c+b}{a}\right)\sqrt{\frac{t}{n}}$
		$\displaystyle=\frac{w_{1}}{\sqrt{n}}\operatorname*{{\mathrm{\mathbf{E}}}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,x})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}},$		(17)

where $w_{1}$ , $w_{2}$ , and $w_{3}$ are constants that each depend on some or all of $a$ , $b$ , and $c$ . ∎

Remark 15.

From Lemma 25 we have that $\sigma^{2}_{n}\coloneqq\sup_{f\in\mathscr{F}}P_{n}f^{2}$ . To make explicit why $2\sigma_{n}=\left(\sup_{g\in\mathcal{C}}P_{n}g^{2}\right)^{1/2}=2c$ , let $\mathscr{F}=\mathcal{C}$ and observe

\sigma^{2}_{n}=\sup_{g\in\mathcal{C}}P_{n}g^{2}=\sup_{g\in\mathcal{C}}\frac{1}{n}\sum_{i=1}^{n}g(X_{i})^{2}\leq\frac{1}{n}\sum_{i=1}^{n}c^{2}=c^{2}.

Since our basis functions $\varphi(\cdot,\theta)$ are bounded by $c$ , everything greater than $c$ will have value $0$ and hence the change from $2c$ to $c$ is inconsequential. However, it can also be motivated by the fact that $\varphi(\cdot,\theta)$ are positive functions.

As highlighted in Remark 14, the full result of Theorem 13 relies on the empirical $L_{2}$ distances $d_{n,x}$ and $d_{n,y}$ . In the result of Theorem 5, we make use of the following result to bound $d_{n,x}$ and $d_{n,y}$ .

Proposition 16.

By combining Lemmas 18 and 19, the following inequalities holds:

\log N(\mathcal{P},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})\leq\log N_{[]}(\mathcal{P},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})\leq\log N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}),

where $N_{[]}(\mathcal{P},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})$ is the $\varepsilon$ -bracketing number of $\mathcal{P}$ . Therefore, we have that

\log N(\mathcal{P},\varepsilon,d_{n,x})\leq\log N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}),\text{ and }\log N(\mathcal{P},\varepsilon,d_{n,y})\leq\log N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}).

With this result, we can now prove Theorem 5.

Proof (of Theorem 5).

The notation is the same as in the proof of Theorem 13. The values of the constants may change from line to line.

	$\displaystyle\mathrm{KL}_{h}\left(f\,\|\|\,f_{k,n}\right)-\mathrm{KL}_{h}\left(f\,\|\|\,f_{k}\right)={\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}+{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k}}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k}}$
	$\displaystyle\ ={\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}+{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}$
	$\displaystyle\ \qquad-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}$
	$\displaystyle\ =\left({\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}\right)+\left({\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}\right)$
	$\displaystyle\ \quad+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k}}\right)+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k}}\right)$
	$\displaystyle\ \quad+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}\right)+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}\right)$
	$\displaystyle\ \leq 2\sup_{\tilde{g}\in\tilde{\mathcal{C}}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{g}}{\tilde{f}}\right\rvert+2\sup_{\tilde{g}\in\tilde{\mathcal{C}}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(Y_{i})}{\tilde{f}(Y_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{g}}{\tilde{f}}\right\rvert$
	$\displaystyle\ \quad+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}\right)+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}\right)$
	$\displaystyle\ \leq 2\operatorname*{{\mathrm{\mathbf{E}}}}\left\{\frac{w^{x}_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,x})\mathrm{d}\varepsilon\right\}+\frac{w^{x}_{2}}{\sqrt{n}}+w^{x}_{3}\sqrt{\frac{t}{n}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(X_{i})}{\tilde{f}_{k,n}(X_{i})}$
	$\displaystyle\ \qquad+2\operatorname*{{\mathrm{\mathbf{E}}}}\left\{\frac{w^{y}_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,y})\mathrm{d}\varepsilon\right\}+\frac{w^{y}_{2}}{\sqrt{n}}+w^{y}_{3}\sqrt{\frac{t}{n}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}$
	$\displaystyle\ \leq\frac{w_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(X_{i})}{\tilde{f}_{k,n}(X_{i})}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})},$

with probability at least $1-\mathrm{e}^{-t}$ , by Theorem 13. Now, we can use (15) from Proposition 10 applied to the target density $f_{k}$ , obtaining the following:

\displaystyle\mathrm{KL}_{h}\left(f_{k}\,||\,f_{k,n}\right)

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(X_{i})}{\tilde{f}_{k,n}(X_{i})}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}\leq\frac{4a^{-2}c^{2}}{k+2}+\inf_{p\in\mathcal{C}}\mathrm{KL}_{h}\left(f_{k}\,||\,p\right).

Since by definition we have that $f_{k}\in\mathcal{C}$ , $\inf_{p\in\mathcal{C}}\mathrm{KL}_{h}\left(f_{k}\,||\,p\right)=0$ , and so with probability at least $1-\mathrm{e}^{-t}$ we have:

\displaystyle\mathrm{KL}_{h}\left(f\,||\,f_{k,n}\right)

\displaystyle-\mathrm{KL}_{h}\left(f\,||\,f_{k}\right)\leq\frac{w_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}}+\frac{w_{4}}{k+2}.

(18)

We can write the overall error as the sum of the approximation and estimation errors as follows. The former is bounded by (14), and the latter is bounded as above in (18). Therefore, with probability at least $1-\mathrm{e}^{-t}$ ,

	$\displaystyle\mathrm{KL}_{h}\left(f\,\|\|\,f_{k,n}\right)-\mathrm{KL}_{h}\left(f\,\|\|\,\mathcal{C}\right)$	$\displaystyle=[\mathrm{KL}_{h}\left(f\,\|\|\,f_{k}\right)-\mathrm{KL}_{h}\left(f\,\|\|\,\mathcal{C}\right)]+[\mathrm{KL}_{h}\left(f\,\|\|\,f_{k,n}\right)-\mathrm{KL}_{h}\left(f\,\|\|\,f_{k}\right)]$
		$\displaystyle\leq\frac{w_{4}}{k+2}+\frac{w_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}}.$		(19)

As in Rakhlin et al. (2005), we rewrite the above probabilistic statement as a statement in terms of expectations. To this end, let

\mathcal{A}\coloneqq\frac{w_{4}}{k+2}+\frac{w_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}},

and $\mathcal{Z}\coloneqq\mathrm{KL}_{h}\left(f\,||\,f_{k,n}\right)-\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)$ . We have shown ${\mathrm{\mathbf{P}}}\left(\mathcal{Z}\geq\mathcal{A}+w_{3}\sqrt{\frac{t}{n}}\right)\leq\mathrm{e}^{-t}$ . Since $\mathcal{Z}\geq 0$ ,

\operatorname*{{\mathrm{\mathbf{E}}}}\{\mathcal{Z}\}=\int^{\mathcal{A}}_{0}{\mathrm{\mathbf{P}}}(\mathcal{Z}>s)\mathrm{d}s+\int^{\infty}_{\mathcal{A}}{\mathrm{\mathbf{P}}}(\mathcal{Z}>s)\mathrm{d}s\leq\mathcal{A}+\int_{0}^{\infty}{\mathrm{\mathbf{P}}}(\mathcal{Z}>\mathcal{A}+s)\mathrm{d}s.

Setting $s=w_{3}\sqrt{\frac{t}{n}}$ , we have $t=w_{5}ns^{2}$ and $\operatorname*{{\mathrm{\mathbf{E}}}}\{\mathcal{Z}\}\leq\mathcal{A}+\int_{0}^{\infty}e^{-w_{5}ns^{2}}\mathrm{d}s\leq\mathcal{A}+\frac{w}{\sqrt{n}}$ . Hence,

{\operatorname*{{\mathrm{\mathbf{E}}}}}\left\{\mathrm{KL}_{h}\left(f\,||\,f_{k,n}\right)\right\}-\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)\leq\frac{c_{1}}{k+2}+\frac{c_{2}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N\left(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}\right)\mathrm{d}\varepsilon+\frac{c_{3}}{\sqrt{n}},

where $c_{1}$ , $c_{2}$ , and $c_{3}$ are constants that depend on some or all of $a$ , $b$ , and $c$ . ∎

Remark 17.

The approximation error characterises the suitability of the class $\mathcal{C}$ , i.e., how well functions in $\mathcal{C}$ are able to estimate a target $f$ which does not necessarily lie in $\mathcal{C}$ . The estimation error characterises the error arising from the estimation of the target $f$ on the basis of the finite sample of size $n$ .

Proof of Corollary 6.

Let $\mathcal{X}$ and $\Theta$ be compact and assume the Lipshitz condition given in (8). If $\varphi(x;\cdot)$ is continuously differentiable, then

\displaystyle\left|\varphi\left(x;\theta\right)-\varphi\left(x;\tau\right)\right|

\displaystyle\leq\sum_{k=1}^{d}\left|\frac{\partial\varphi\left(x;\cdot\right)}{\partial\theta_{k}}\left(\theta_{k}^{*}\right)\right|\left|\theta_{k}-\tau_{k}\right|\leq\sup_{\theta^{*}\in\Theta}\left\|\frac{\partial\varphi\left(x;\cdot\right)}{\partial\theta}\left(\theta^{*}\right)\right\|_{1}\left\|\theta-\tau\right\|_{1}.

Setting

\varPhi\left(x\right)=\sup_{\theta^{*}\in\Theta}\left\|\frac{\partial\varphi\left(x;\cdot\right)}{\partial\theta}\left(\theta^{*}\right)\right\|_{1},

we have $\lVert\varPhi\rVert_{\infty}<\infty$ . From Lemma 20, we obtain the fact that

\log N_{[]}\left(\mathcal{P},2\varepsilon\lVert\varPhi\rVert_{\infty},{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}\right)\leq\log N\left(\Theta,\varepsilon,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}\right),

which by the change of variable $\delta=2\varepsilon\lVert\varPhi\rVert_{\infty}\implies\varepsilon=\delta/2\lVert\varPhi\rVert_{\infty}$ implies

\log N_{[]}\left(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}\right)\leq\log N\left(\Theta,\frac{\varepsilon}{4\lVert\varPhi\rVert_{\infty}},{\operatorname*{\lVert\,\cdot\,\rVert}}_{1}\right).

Since $\Theta\subset\mathbb{R}^{d}$ , using the fact that a Euclidean set of radius $r$ has covering number $N\left(r,\varepsilon\right)\leq\left(\frac{3r}{\varepsilon}\right)^{d},$ we have

\log N\left(\Theta,\frac{\varepsilon}{4\lVert\varPhi\rVert_{\infty}},{\operatorname*{\lVert\,\cdot\,\rVert}}_{1}\right)\leq d\log\left[\frac{12\lVert\varPhi\rVert_{\infty}\mathrm{diam}\left(\Theta\right)}{\varepsilon}\right].

\int_{0}^{c}\sqrt{\log N\left(\Theta,\frac{\varepsilon}{4\lVert\varPhi\rVert_{\infty}},{\operatorname*{\lVert\,\cdot\,\rVert}}_{1}\right)}\mathrm{d}\varepsilon\leq\int_{0}^{c}\sqrt{d\log\left[\frac{12\lVert\varPhi\rVert_{\infty}\mathrm{diam}\left(\Theta\right)}{\varepsilon}\right]}\mathrm{d}\varepsilon,

and since $c<\infty$ , the integral is finite, as required. ∎

Appendix B Discussions and remarks regarding $h$ -MLLEs

In this section, we share some commentary on the derivation of the $h$ -lifted KL divergence, its advantages and drawbacks, and some thoughts on the selection of the lifting function $h$ . We also discuss the suitability of the MM algorithms in contrast to other approaches (e.g., EM algorithms).

B.1 Elementary derivation

From Equation 4, we observe that if $X$ arises from a measure with density $f$ , and if we aim to approximate $f$ with a density $g\in\mathrm{co}_{k}(\mathcal{P})$ that minimizes the $h$ -lifted KL divergence $\mathrm{KL}_{h}$ with respect to $f$ , then we can define an approximator (referred to as the minimum $h$ -lifted KL approximator) as

	$\displaystyle f_{k}$	$\displaystyle=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,min\,}}\left[\int_{\mathcal{X}}\{f+h\}\log\{f+h\}\mathrm{d}\mu-\int_{\mathcal{X}}\{f+h\}\log\{g+h\}\mathrm{d}\mu\right]$
		$\displaystyle=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname{arg\,min\,}}-\int_{\mathcal{X}}\{f+h\}\log\{g+h\}\mathrm{d}\mu=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname{arg\,max\,}}\int_{\mathcal{X}}\{f+h\}\log\{g+h\}\mathrm{d}\mu,$

noting that $\int_{\mathcal{X}}\{f+h\}\log\{f+h\}\mathrm{d}\mu$ is a constant that does not depend on the argument $g$ . Now, observe that

\displaystyle\int_{\mathcal{X}}f\log\{g+h\}\mathrm{d}\mu={\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\{g+h\}\text{ and }\int_{\mathcal{X}}f\log\{g+h\}\mathrm{d}\mu={\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\{g+h\},

since both $f$ and $h$ are densities on $\mathcal{X}$ with respect to the dominating measure $\mu$ . If a sample $\mathbf{X}_{n}=(X_{i})_{i\in[n]}$ is available, we can estimate the expectation ${\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\{g+h\}$ by the sample average functional

\frac{1}{n}\sum_{i=1}^{n}\log\{g(X_{i})+h(X_{i})\},

resulting in the sample estimator for $f_{k}$ :

f_{k,n}^{\prime}=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,max\,}}\left[\frac{1}{n}\sum_{i=1}^{n}\log\{g(X_{i})+h(X_{i})\}+{\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\{g+h\}\right],

which serves as an alternative to Equation 5. However, the expectation ${\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\{g+h\}$ is intractable, making the optimization problem computationally infeasible, especially when $\mathcal{X}$ is multivariate (i.e., $\mathcal{X}\subset\mathbb{R}^{d}$ for $d>1$ ), as integral evaluations may be challenging to compute accurately. Thus, we approximate the intractable integral ${\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\{g+h\}$ using the sample average approximation (SAA) from stochastic programming (cf. Shapiro et al., 2021, Chapter 5), yielding the Monte Carlo approximation

\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\log\{g(Y_{i})+h(Y_{i})\}

for a sufficiently large $n_{1}\in\mathbb{N}$ , where each $Y_{i}$ is an independent and identically distributed random variable from the measure on $\mathcal{X}$ with density $h$ . This approach provides an estimator for $f_{k}$ of the form

f_{k,n,n_{1}}=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,max\,}}\left[\frac{1}{n}\sum_{i=1}^{n}\log\{g(X_{i})+h(X_{i})\}+\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\log\{g(Y_{i})+h(Y_{i})\}\right],

which is exactly the $h$ -MLLE defined by Equation (5) when we take $n_{1}=n$ . Notably, the additional samples $\mathbf{Y}_{n}=(Y_{i})_{i\in[n]}$ provide no information regarding ${\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\{g+h\}$ , which is the component of the objective function coupling the estimator $g$ with the target $f$ . However, it offers a feasible mechanism for approximating the otherwise intractable integral ${\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\{g+h\}$ .

By setting $n_{1}=n$ for the SAA approximation of ${\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\{g+h\}$ , the convergence rate in Theorem 5 remains unaffected. Specifically, for any $t>0$ ,

\left\|\frac{1}{n}\sum_{i=1}^{n}\log\frac{g(X_{i})+h(X_{i})}{f(X_{i})+h(X_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{g+h}{f+h}\right\|_{\mathcal{C}}\leq\frac{w_{1}}{\sqrt{n}}{\operatorname*{{\mathrm{\mathbf{E}}}}}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,x})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}}

and

\left\|\frac{1}{n}\sum_{i=1}^{n}\log\frac{g(Y_{i})+h(Y_{i})}{f(Y_{i})+h(Y_{i})}-{\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{g+h}{f+h}\right\|_{\mathcal{C}}\leq\frac{w_{1}}{\sqrt{n}}{\operatorname*{{\mathrm{\mathbf{E}}}}}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,y})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}},

with probability at least $1-e^{-t}$ , as noted in Remark 14. Given that both upper bounds are of order $\mathcal{O}(1/\sqrt{n})+\mathcal{O}(\sqrt{t/n})$ , the combined bound in the proof of Theorem 5 in Appendix A.2 is also of this order, as required.

Finally, to obtain the additional samples $\mathbf{Y}_{n}=(Y_{i})_{i\in[n]}$ , we simply simulate $\mathbf{Y}_{n}$ from the data-generating process defined by $h$ . Since we can choose $h$ freely, selecting an $h$ that facilitates easy simulation (e.g., $h$ uniform over $\mathcal{X}$ , which remains bounded away from zero on a compact set) is advisable for satisfying the requirements of our theorems.

B.2 Advantages and limitations

As discussed extensively in Sections 1 and 2, our two primary benchmarks are the MLE and the least $L_{2}$ estimator. Indeed, the MLE is simpler than the $h$ -MLLE estimator, as it takes the reduced form

\hat{f}_{k,n}=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,max\,}}~{}\frac{1}{n}\sum_{i=1}^{n}\log g(X_{i}),

and does not require a sample average approximation for intractable integrals. It is well established that the MLE estimates the minimum KL divergence approximation to the target $f$

f_{k}=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,min\,}}\int_{\mathcal{X}}f\log\frac{f}{g}\mathrm{d}\mu=\mathrm{KL}_{h}\left(f\,||\,g\right).

However, as highlighted in the foundational works of Li & Barron (1999) and Rakhlin et al. (2005), controlling the expected risk

{\operatorname*{{\mathrm{\mathbf{E}}}}}\left\{\mathrm{KL}\left(f\,||\,\hat{f}_{k,n}\right)\right\}-\mathrm{KL}\left(f\,||\,\mathcal{C}\right).

requires that $f\geq\gamma$ for some strictly positive constant $\gamma>0$ . This requirement excludes many interesting density functions as targets, including those that vanish at the boundaries of $\mathcal{X}$ , such as the $\beta(\cdot;1/2,1/2)$ distribution, or those that vanish in the interior of $\mathcal{X}$ , such as examples $f_{1}$ and $f_{2}$ in Section 4.2. Consequently, the condition $f\geq\gamma$ is restrictive and often impractical in many data analysis settings.

Alternatively, one could consider targeting the minimum $L_{2}$ estimator:

\displaystyle f_{k}

\displaystyle=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,min\,}}\int_{\mathcal{X}}(f-g)^{2}\mathrm{d}\mu=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,min\,}}\int_{\mathcal{X}}f^{2}-2fg+g^{2}\mathrm{d}\mu=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,min\,}}\left[-2\int_{\mathcal{X}}fg\mathrm{d}\mu+\int_{\mathcal{X}}g^{2}\mathrm{d}\mu\right].

Using a sample $\mathbf{X}_{n}$ generated from the distribution given by $f$ , the first term of the objective can be approximated by $-\frac{1}{n}\sum_{i=1}^{n}g(X_{i})$ , which is relatively simple. However, the second term involves an intractable integral that cannot be approximated by Monte Carlo sampling from a fixed generative distribution, as it depends on the optimization argument $g$ . Thus, unlike the $h$ -MLLE, it is not feasible to reduce this intractable integral to a sample average, which implies the need for a numerical approximation in practice. This can be computationally complex, particularly when $g$ is intricate and $\mathcal{X}$ is high-dimensional. Hence, the minimum $L_{2}$ -norm estimator of the form

\hat{f}_{k,n}=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,min\,}}\left[-\frac{2}{n}\sum_{i=1}^{n}g(X_{i})+\int_{\mathcal{X}}g^{2}\mathrm{d}\mu\right]

is often computationally infeasible, though its risk

{\operatorname*{{\mathrm{\mathbf{E}}}}}\|f-\hat{f}_{k,n}\|_{2}-\inf_{g\in\mathcal{C}}\|f-g\|_{2}

can be bounded, as shown in the works of Klemelä (2007) and Klemelä (2009), even when $\min_{\mathcal{X}}f=0$ . In comparison with the minimum $L_{2}$ estimator and the MLE, we observe that the $h$ -MLLE allows risk bounding for targets $f$ not bounded from below (i.e., $\min_{\mathcal{X}}f=0$ ), without requiring intractable integral expressions. The $h$ -MLLE achieves the beneficial properties of both the MLE and minimum $L_{2}$ estimators, which is the focus of our work.

Other divergences and risk minimization schemes for estimating $f$ , such as $\beta$ -likelihoods and $L_{q}$ likelihood, could also be considered. The $L_{q}$ likelihood, for instance, provides a maximizing estimator with a simple sample average expression, similar to the MLE and $h$ -MLLE. However, it lacks a characterization in terms of a proper divergence function, such as the KL divergence, $h$ -lifted KL divergence, or $L_{2}$ norm. Consequently, this estimator is often inconsistent, as observed in Ferrari & Yang (2010) and Qin & Priebe (2013). These studies show that the $L_{q}$ likelihood estimator may not converge meaningfully to $f$ , even when $f\in\mathrm{co}_{k}(\mathcal{P})$ , for any fixed $q\in\mathbb{R}_{>0}\setminus\{1\}$ , unless a sequence of maximum $L_{q}$ likelihood estimators is constructed with $q$ depending on $n$ and approaching $1$ to approximate the MLE. Thus, the maximum $L_{q}$ likelihood estimator does not yield the type of risk bound we require.

Similarly, with the $\beta$ -likelihood (or density power divergence), the situation is comparable to that of the minimum $L_{2}$ norm estimator, where the sample-based estimator involves an intractable integral that cannot be approximated through SAA. Specifically, the minimum $\beta$ -likelihood estimator is defined as (cf. Basu et al., 1998):

\hat{f}_{k,n}=\underset{g\in\mathrm{co}_{k}(\mathcal{P})}{\operatorname*{arg\,min\,}}\left[-\frac{1}{n}\left(1+\frac{1}{\beta}\right)\sum_{i=1}^{n}g^{\beta}(X_{i})+\int_{\mathcal{X}}g^{1+\beta}\mathrm{d}\mu\right]

for $\beta>0$ , which closely resembles the form of the minimum $L_{2}$ estimator. Hence, the limitations of the minimum $L_{2}$ estimator apply here as well, although a risk bound with respect to the $\beta$ -likelihood divergence could theoretically be obtained if the computational challenges are disregarded. In Section 1.3, we cite additional estimators based on various divergences and modified likelihoods. Nevertheless, in each case, one of the limitations discussed here will apply.

B.3 Selection of the lifting density function $h$

The choice of $h$ is entirely independent of the data. In fact, $h$ can be any density with respect to $\mu$ , satisfying $0<a\leq h(x)\leq b<\infty$ for every $x\in\mathcal{X}$ . Beyond this requirement, our theoretical framework remains unaffected by the specific choice of $h$ . In Section 4, we explore cases where $h$ is uniform and non-uniform, demonstrating convergence in both $k$ and $n$ that aligns with the predictions of Theorem 5. For practical implementation, as discussed in Appendix B.1, $h$ serves as the sampling distribution for the sample average approximation (SAA) of the intractable integral ${\operatorname*{{\mathrm{\mathbf{E}}}}}_{h}\log\{g+h\}$ . Given its role as a sampling distribution, it is advantageous to select a form for $h$ that is easy to sample from. In many applications, we find that the uniform distribution over $\mathcal{X}$ is an optimal choice for $h$ , as it meets the bounding conditions.

We observe that although calibrating $h$ does not improve the rate, it does influence the constants in the upper bound of the final equation of Equation 19. Specifically, for each $t>0$ , with probability at least $1-\mathrm{e}^{-t}$ ,

\mathrm{KL}_{h}\left(f\,||\,f_{k,n}\right)-\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)\leq\frac{w_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}}+\frac{w_{4}}{k+2},

which contributes to the constants in Theorem 5. Letting $c$ denote the upper bound of the target $f$ (i.e., $f(x)\leq c<\infty$ for every $x\in\mathcal{X}$ ), we make the following observations regarding the constants:

w_{1}\propto\frac{c+b}{a^{2}},\quad w_{2}\propto\frac{(8b+4a)(c+b)}{a^{2}},\quad w_{3}\propto\log\left(\frac{c+b}{a}\right),\quad w_{4}\propto\frac{c^{2}}{a^{2}}.

Here, $w_{1},\,w_{2},$ and $w_{3}$ are per the final bound in Equation 17 in the proof of Theorem 13, while $w_{4}$ arises from the bound in Equation 15.

When $h$ is uniform, it takes the form $h(x)=z$ , where $z=1/\int_{\mathcal{X}}\mathrm{d}\mu$ , making $a=z=b$ . If $h$ is non-uniform, then necessarily $a<z<b$ , as there would exist a region $\mathcal{Z}$ of positive measure where $h(x)>z$ , which implies that $h(x)<z$ for some $x\in\mathcal{X}\setminus\mathcal{Z}$ ; otherwise,

\int_{\mathcal{X}}h\mathrm{d}\mu=\int_{\mathcal{Z}}h\mathrm{d}\mu+\int_{\mathcal{X}\setminus\mathcal{Z}}h\mathrm{d}\mu>\frac{\mu(\mathcal{Z})}{\mu(\mathcal{X})}+\frac{\mu(\mathcal{X}\setminus\mathcal{Z})}{\mu(\mathcal{X})}=1,

contradicting $h$ being a density function. Although we cannot control $c$ , we can choose $h$ to control $a$ and $b$ . Setting $h=z$ minimizes the numerators in $w_{1}$ , as deviations from uniformity increase the numerators of $w_{1}$ and $w_{4}$ while decreasing the denominators. The same reasoning applies to $w_{2}$ :

\displaystyle w_{2}

\displaystyle\propto\frac{(8b+4a)(c+b)}{a^{2}}=\left\{\frac{8bc}{a^{2}}+\frac{4c}{a}+\frac{8b^{2}}{a^{2}}+\frac{4b}{a}\right\}.

Since $c>0$ , any deviation from uniformity in $h$ either increases or maintains the numerators while decreasing the denominators, minimizing $w_{2}$ when $h$ is uniform. The same logic applies to $w_{3}$ , as the logarithmic function is increasing, so $w_{3}$ is minimized when $h$ is uniform.

Consequently, we conclude that the smallest constants in Theorem 5 are achieved when $h$ is chosen as the uniform distribution on $\mathcal{X}$ . This suggests that a uniform $h$ is optimal from both practical and theoretical perspectives.

B.4 Discussions regarding the sharpness of the obtained risk bound

Similar to the role of Gaussian mixtures as the archetypal class of mixture models for Euclidean spaces, beta mixtures represent the archetypal class of mixture models on the compact interval $[0,1]$ , as established in the studies by Ghosal (2001); Petrone (1999). Just as Gaussian mixtures can approximate any continuous density on $\mathcal{X}=\mathbb{R}^{d}$ to an arbitrary level of accuracy in the $L_{p}$ -norm (Nguyen et al., 2020; 2021; 2022b), mixtures of beta distributions can similarly approximate any continuous density on $\mathcal{X}=[0,1]$ with respect to the supremum norm (Ghosal, 2001; Petrone, 1999; Petrone & Wasserman, 2002). We will leverage this property in the following discussion.

Assuming the target $f$ is within the closure of our mixture class $\mathcal{C}$ (i.e., $\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)=0$ ), setting $k_{n}=\mathcal{O}(\sqrt{n})$ achieves a convergence rate in expected $\mathrm{KL}_{h}$ of $\mathcal{O}(1/\sqrt{n})$ for the mixture maximum $h$ -lifted likelihood estimator ( $h$ -MLLE) $f_{k_{n},n}$ . An interesting question is whether this rate is tight and not overly conservative, given the observed rates in Table 1. We aim to investigate this question by discussing a lower bound for the estimation problem.

To approach this, we use Proposition 3 to observe that $\mathrm{KL}_{h}$ satisfies a Pinsker-like inequality:

\sqrt{\mathrm{KL}_{h}\left(f\,||\,g\right)}\geq\mathrm{TV}(f,g),

where $\mathrm{TV}(f,g)=\frac{1}{2}\int_{\mathcal{X}}|f-g|\mathrm{d}\mu$ . Using this inequality along with Corollary 6 and the convexity of $f\mapsto f^{2}$ , we find that the $h$ -MLLE satisfies the following total variation bound:

\displaystyle{\operatorname*{{\mathrm{\mathbf{E}}}}}\left\{\mathrm{TV}(f,f_{k_{n},n})\right\}

\displaystyle\leq\sqrt{\frac{w_{1,f}}{k_{n}}+\frac{w_{2,f}}{\sqrt{n}}}\leq\frac{w_{1,f}}{k_{n}^{1/2}}+\frac{w_{2,f}}{n^{1/4}}\leq\frac{w_{f}}{n^{1/4}},

for some positive constants $w_{1,f},\,w_{2,f},\,w_{f}$ depending on $f$ , by taking $k_{n}=\sqrt{n}$ . Now, consider the specific case when $\mathcal{X}=[0,1]$ , and the component class $\mathcal{P}$ consists of beta distributions. In this case, we have (cf. Petrone & Wasserman, 2002, Eq. 5), for any continuous density function $f:[0,1]\to\mathbb{R}_{\geq 0}$ :

\inf_{g\in\mathcal{C}}\sup_{x\in[0,1]}|f(x)-g(x)|=0,\text{ which implies that }\inf_{g\in\mathcal{C}}\mathrm{KL}_{h}\left(f\,||\,g\right)=0,

since

\sup_{x\in[0,1]}|f(x)-g(x)|\geq L_{2}(f,g)\geq\sqrt{\gamma}\,\mathrm{KL}_{h}\left(f\,||\,g\right),

for any $0<\gamma\leq h$ , with the second inequality due to Proposition 3. Thus, for a compact parameter space $\Theta$ defining $\mathcal{P}$ , we assume $\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)=0$ . Consequently, the rate of $\mathcal{O}(n^{-1/4})$ for expected total variation distance is achievable in the beta mixture model setting. This convergence is uniform in the sense that

{\operatorname*{{\mathrm{\mathbf{E}}}}}\left\{\mathrm{TV}(f,f_{k_{n},n})\right\}\leq\frac{w}{n^{1/4}},

where $w$ depends only on the maximum $c\geq f$ , the diameter of $\Theta$ , and the condition $\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)=0$ , with component distributions in $\mathcal{P}$ restricted to parameter values in $\Theta$ .

In the context of minimum total variation density estimation on $[0,1]$ , Exercise 15.14 of Devroye & Lugosi (2001) states that for every estimator $\hat{f}$ and every Lipschitz continuous density $f$ (with a sufficiently large Lipschitz constant),

\sup_{f\in\mathrm{Lip}}{\operatorname*{{\mathrm{\mathbf{E}}}}}_{f}\left\{\mathrm{TV}(\hat{f},f)\right\}\geq\frac{W}{n^{1/3}},

for some universal constant $W$ depending only on the Lipschitz constant. This lower bound is faster than our achieved rate of $\mathcal{O}(n^{-1/4})$ , but it applies only to the smaller class of Lipschitz targets, a subset of the continuous targets satisfying $\mathrm{KL}_{h}\left(f\,||\,\mathcal{C}\right)=0$ .

The target $f_{2}$ from our simulations in Section 4 belongs to the class of Lipschitz targets, and thus the improved lower bound rate of $\mathcal{O}(n^{-1/3})$ from Devroye & Lugosi (2001) applies. We can compare this with $\sqrt{n^{-b_{2}}}$ for Experiment 2 in Table 1, yielding an empirical rate in $n$ of $\mathcal{O}(n^{-1.03})$ , with an exponent between $-1.07$ and $-0.98$ (95% confidence), over the range $n\in\{2^{10},\dots,2^{15}\}$ . Clearly, this observed rate is faster than the lower bound rate of $\mathcal{O}(n^{-1/3})$ , indicating that the faster rates observed in Table 1 are due to small values of $n$ and $k$ . As $n$ increases, the rate must eventually decelerate to at least $\mathcal{O}(n^{-1/3})$ when the target $f$ is Lipschitz on $\mathcal{X}$ , which is only marginally faster than our guaranteed rate of $\mathcal{O}(n^{-1/4})$ . Demonstrating that $\mathcal{O}(n^{-1/4})$ is minimax optimal for certain target classes $f$ is a complex task, left for future exploration.

Lastly, we note that our discussions in this section implies that the $h$ -MLLE provides an effective and genetic method for obtaining estimators with total variation guarantees, which complements the comprehensive studies on the topic presented in Devroye & Györfi (1985) and Devroye & Lugosi (2001).

B.5 The KL divergence and the MLE

For any probability densities $f$ and $g$ with respect to a dominant measure $\mu$ on $\mathcal{X}$ , the $h$ -lifted KL divergence is defined as

\mathrm{KL}_{h}\left(f\,||\,g\right)=\int_{\mathcal{X}}\{f+h\}\log\left(\frac{f+h}{g+h}\right)\mathrm{d}\mu,

which we establish as a Bregman divergence on the space of probability densities dominated by $\mu$ on $\mathcal{X}$ in Appendix C.1.

We previously demonstrated a relationship between $\mathrm{KL}_{h}$ and the $L_{2}$ distance (Proposition 3), showing that if $h(x)\geq\gamma>0$ for all $x\in\mathcal{X}$ , then

\mathrm{KL}_{h}\left(f\,||\,g\right)\leq\frac{1}{\gamma}L_{2}^{2}(f,g),\text{ where }L_{2}^{2}(f,g)=\|f-g\|_{2}^{2}=\int_{\mathcal{X}}(f-g)^{2}\mathrm{d}\mu

is the square of the $L_{2}$ distance between the densities. Given that we can always select $h(x)\geq\gamma$ , this bound is always enforceable. This relationship is stronger than that between the standard KL divergence and the $L_{2}$ distance, which similarly satisfies

\mathrm{KL}\left(f\,||\,g\right)\leq\frac{1}{\gamma}L_{2}^{2}(f,g),

but with the more restrictive requirement that $f(x)\geq\gamma>0$ for every $x\in\mathcal{X}$ , limiting its applicability to densities that do not vanish. In the proof of Proposition 3 in Appendix C.2, we show that one can write

\mathrm{KL}_{h}\left(f\,||\,g\right)=2\mathrm{KL}\left(\frac{f+h}{2},\frac{g+h}{2}\right),

which allows the application of the theory from Rakhlin et al. (2005) by considering the mixture density $(f+h)/2$ as the target and using $g+h$ as the approximand, where $g\in\mathrm{co}_{k}\left(\mathcal{P}\right)$ . Under this framework, the maximum likelihood estimator can be formulated as

f_{k,n}\in\operatorname*{arg\,min\,}_{g\in\mathrm{co}_{k}\left(\mathcal{P}\right)}-\frac{1}{n}\sum_{i=1}^{n}\log\left(\frac{g(Z_{i})+h(Z_{i})}{2}\right),

where $\left(Z_{i}\right)_{i\in\left[n\right]}$ are independent and identically distributed samples from a distribution with density $(f+h)/2$ . This sampling can be performed by choosing $X_{i}$ with probability $1/2$ or $Y_{i}$ with probability $1/2$ for each $i\in[n]$ , where $X_{i}$ is an observation from the generative model $f$ and $Y_{i}$ is an independent sample from the auxiliary density $h$ . Although the modified estimator, based on the bound from Rakhlin et al. (2005), attains equivalent convergence rates, it inefficiently utilizes observed data, as 50% of the data is replaced by simulated samples $Y_{i}$ . In contrast, our $h$ -MLLE estimator maximally utilizes all available data while achieving the same bound.

B.6 Comparison of the MM algorithm and the EM algorithm

Since the risk functional is not a log-likelihood, a straightforward EM approach cannot be used to compute the $h$ -MLLE. However, by interpreting $\mathrm{KL}_{h}$ as a loss between the target mixture $(f+h)/2$ and the estimator $(f_{k,n}+h)/2$ , an EM algorithm can be constructed using the standard admixture framework (see Lange, 2013, Section 9.5). Remarkably, the EM algorithm for estimating $(f_{k,n}+h)/2$ , has the same form as our MM algorithm, which leverages Jensen’s inequality (cf. Lange, 2013, Section 8.3). In fact, the majorizer in any EM algorithm results directly from Jensen’s inequality (see Lange, 2013, Section 9.2), making our MM algorithm in Section 4.1 no more complex than an EM approach for mixture models.

Beyond the EM and MM methods, no other standard algorithms typically address the generic estimation of a $k$ -component mixture model in $\mathrm{co}_{k}(\mathcal{P})$ for a given parametric class $\mathcal{P}$ . Since our MM algorithm follows a structure nearly identical to the EM algorithm for the MLE of this problem, it has comparable iterative complexity. Notably, per iteration, the MM approach requires additional evaluations for both $\mathbf{X}_{n}$ and $\mathbf{Y}_{n}$ , and for $g(X_{i})$ and $h(X_{i})$ , so it requires a constant multiple of evaluations compared to EM, depending on whether $h$ is a uniform distribution or otherwise (typically by a factor of 2 or 4).

B.7 Non-convex optimization

We note that the $h$ -MLLE problem (and the corresponding MLE) are non-convex optimization problems. This implies that, aside from global optimization methods, no iterative algorithm–whether gradient-based methods like gradient descent, coordinate descent, mirror descent, or momentum-based variants–can be guaranteed to find a global optimum. Likewise, second-order techniques such as Newton and quasi-Newton methods also cannot be expected to locate the global solution. In non-convex scenarios, the primary assurance that can be offered is convergence to a critical point of the objective function. In our case, this assurance is achieved by applying Corollary 1 from Razaviyayn et al. (2013), as discussed in Section 4.1. Notably, this convergence guarantee is consistent with that provided by other iterative approaches, such as EM, gradient descent, or Newton’s method.

Additionally, it may be valuable to examine whether specific convergence rates can be ensured when the algorithm’s iterates approach a neighborhood around a critical value. In the context of the MM algorithm, we can affirmatively answer this question: since the $h$ -MLLE objective is twice continuously differentiable with respect to the parameter $\psi_{k}$ , it satisfies the local convergence conditions outlined in Lange (2016, Proposition 7.2.2). This result implies that if $\psi_{k}^{(s)}$ lies within a sufficiently close neighborhood of a local minimizer $\psi_{k}^{*}$ , the MM algorithm converges linearly to $\psi_{k}^{*}$ . This behavior aligns with the convergence guarantees offered by other iterative methods, such as gradient descent or line-search based quasi-Newton methods. Quadratic convergence rates near $\psi_{k}^{*}$ can be achieved with a Newton method, though this forfeits the monotonicity (or stability) of the MM algorithm, as it is well-known that even in convex settings, Newton’s method can diverge if the initialization is not properly handled.

An additional advantage of the MM algorithm over Newton’s method is its capacity to decompose the original objective into a sum of functions where each component of $\psi_{k}=(\pi_{1},\dots,\pi_{k},\theta_{1},\dots,\theta_{k})$ is separable within the summation. In other words, we can independently optimize functions that depend only on subsets of parameters, either $(\pi_{1},\dots,\pi_{k})$ or each $\theta_{j}$ for $j=1,\dots,k$ , thereby simplifying the iterative computation. This characteristic is noted after Equation 9 in the main text. Such decomposition can lead to computational efficiency by avoiding the need to compute the Hessian matrix for Newton’s method or approximations required by quasi-Newton methods. Specifically, in cases involving mixtures of exponential family distributions such as the beta distributions discussed in Section 4.2, each parameter-separated problem becomes a strictly concave maximization problem, which can be efficiently solved (see Proposition 3.10 in Sundberg, 2019).

Appendix C Auxiliary proofs

In this section, we include other proofs of claims made in the main text that are not included in Appendix A.

C.1 The $h$ -lifted KL divergence as a Bregman divergence

Let $\tilde{u}=u+h$ , so that $\phi(u)=\tilde{u}\log(\tilde{u})-\tilde{u}+1$ . Then $\phi^{\prime}(u)=\log(\tilde{u})$ , and

	$\displaystyle D_{\phi}(f\,\|\|\,g)$	$\displaystyle=\int_{X}\{\tilde{f}\log(\tilde{f})-\tilde{f}-1\}-\{\tilde{g}\log(\tilde{g})-\tilde{g}-1\}-\log(\tilde{g})(f-g)\mathrm{d}\mu$
		$\displaystyle=\int_{\mathcal{X}}\tilde{f}\log(\tilde{f})-\tilde{g}\log(\tilde{g})-f\log(\tilde{g})+g\log(\tilde{g})\mathrm{d}\mu$
		$\displaystyle=\int_{\mathcal{X}}\{f+h\}\log(\tilde{f})-\{g+h\}\log(\tilde{g})-f\log(\tilde{g})+g\log(\tilde{g})\mathrm{d}\mu$
		$\displaystyle=\int_{\mathcal{X}}\{f+h\}\log{\frac{f+h}{g+h}}\mathrm{d}\mu=\mathrm{KL}_{h}\left(f\,\|\|\,g\right).$

C.2 Proof of Proposition 2

Let $\tilde{f}=f+h$ and $\tilde{g}=g+h$ . Since $h$ is positive, there exists some $\tilde{g}_{*}$ such that $\tilde{g}_{*}=\inf_{x\in\mathcal{X}}\left\{g(x)+h(x)\right\}>0$ . Similarly, since $\mathcal{X}$ is compact, there exists some positive $\tilde{f}^{*}$ such that $0<\tilde{f}^{*}=\sup_{x\in\mathcal{X}}\left\{f(x)+h(x)\right\}<\infty$ . Define $M=\sup_{x\in\mathcal{X}}\log\{\tilde{f}(x)/\tilde{g}(x)\}$ . Then $M<\infty$ , and

\mathrm{KL}_{h}\left(f\,||\,g\right)=\int_{\mathcal{X}}\tilde{f}\log\frac{\tilde{f}}{\tilde{g}}\mathrm{d}\mu\leq\sup_{x\in\mathcal{X}}\log\frac{\tilde{f}}{\tilde{g}}\int_{\mathcal{X}}\tilde{f}\mathrm{d}\mu=2M<\infty.

C.3 Proof of Proposition 3

Defining $\tilde{f}$ and $\tilde{g}$ as above, we have

\mathrm{KL}_{h}\left(f\,||\,g\right)=\int_{\mathcal{X}}\tilde{f}\log\frac{\tilde{f}}{\tilde{g}}\mathrm{d}\mu\leq\int_{\mathcal{X}}\tilde{f}\left(\frac{\tilde{f}}{\tilde{g}}-1\right)\mathrm{d}\mu=\int_{\mathcal{X}}\frac{(f-g)^{2}}{\tilde{g}}\mathrm{d}\mu\leq\gamma^{-1}L_{2}^{2}(f,g),

The first inequality comes from the fundamental inequalities on logarithm $\log(x)\leq x-1$ for all $x\geq 0$ . Indeed, let $f(x)=\log(x)-x+1$ . We obtain $f^{\prime}(x)=\frac{1}{x}-1=\frac{1-x}{x}$ . Then $f^{\prime}(x)<0$ if $x>1$ and $f^{\prime}(x)\geq 0$ if $x\leq 1$ . Therefore, $f$ is strictly decreasing on $(1,\infty)$ and $f$ is strictly increasing on $(0,1]$ . This leads to the desired inequality $f(x)\leq f(1)=0$ for all $x\geq 0$ .

The next equality comes from the following identities:

\int_{\mathcal{X}}\tilde{f}(\frac{\tilde{f}}{\tilde{g}}-1)\mathrm{d}\mu=\int_{\mathcal{X}}\frac{\tilde{f}^{2}-\tilde{f}\tilde{g}}{\tilde{g}}\mathrm{d}\mu=\int_{\mathcal{X}}\frac{\tilde{f}^{2}-\tilde{f}\tilde{g}-\tilde{f}\tilde{g}+\tilde{g}\tilde{g}}{\tilde{g}}\mathrm{d}\mu=\int_{\mathcal{X}}\frac{(\tilde{f}-\tilde{g})^{2}}{\tilde{g}}\mathrm{d}\mu=\int_{\mathcal{X}}\frac{(f-g)^{2}}{\tilde{g}}\mathrm{d}\mu.

The last equality is followed from

\int_{\mathcal{X}}\frac{-\tilde{f}\tilde{g}+\tilde{g}\tilde{g}}{\tilde{g}}\mathrm{d}\mu=-\int_{\mathcal{X}}\tilde{f}\mathrm{d}\mu+\int_{\mathcal{X}}\tilde{g}\mathrm{d}\mu=-\int_{\mathcal{X}}(f+h)\mathrm{d}\mu+\int_{\mathcal{X}}(g+h)\mathrm{d}\mu=-\int_{\mathcal{X}}h\mathrm{d}\mu+\int_{\mathcal{X}}h\mathrm{d}\mu=0.

In fact, the proof of Proposition 3 follows the standard technique in the derivation of the estimation error, see for example Meir & Zeevi (1997).

Additionally, we can show that the $h$ -lifted KL divergence satisfies a Pinsker-like inequality, in the sense that

\sqrt{\mathrm{KL}_{h}\left(f\,||\,g\right)}\geq\mathrm{TV}(f,g),

where $\mathrm{TV}$ represents the total variation distance between the densities $f$ and $g$ . Indeed, this is easy to observe since

	$\displaystyle\mathrm{KL}_{h}\left(f\,\|\|\,g\right)$	$\displaystyle=\int_{\mathcal{X}}\left\{f+h\right\}\log\frac{f+h}{g+h}\mathrm{d}\mu=2\int\frac{f+h}{2}\log\frac{\left\{\frac{f+h}{2}\right\}}{\left\{\frac{g+h}{2}\right\}}\mathrm{d}\mu=2\mathrm{KL}\left(\frac{f+h}{2},\frac{g+h}{2}\right)$
		$\displaystyle\geq 4\left\{\frac{1}{2}\int_{\mathcal{X}}\left\|\frac{f+h}{2}-\frac{g+h}{2}\right\|\mathrm{d}\mu\right\}^{2}\!\!\!=\left\{\int_{\mathcal{X}}\left\|\frac{f+h}{2}-\frac{g+h}{2}\right\|\mathrm{d}\mu\right\}^{2}\!\!\!=\left\{\frac{1}{2}\int_{\mathcal{X}}\left\|f-g\right\|\mathrm{d}\mu\right\}^{2}\!\!\!=\mathrm{TV}^{2}(f,g),$

where the inequality is due to Pinsker’s inequality:

\sqrt{\frac{1}{2}\mathrm{KL}\left(f\,||\,g\right)}\geq\mathrm{TV}\left(f,g\right).

C.4 Proof of Proposition 9

For choice (11), by the dominated convergence theorem, we observe that

	$\displaystyle\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\kappa\left(\left(1-\pi\right)p+\pi q\right)$	$\displaystyle={\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\left\{\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\log\frac{f+h}{\left(1-\pi\right)p+\pi q+h}\right\}+{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\left\{\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\log\frac{f+h}{\left(1-\pi\right)p+\pi q+h}\right\}$
		$\displaystyle={\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\left\{\frac{\left(p-q\right)^{2}}{\left[\left(1-\pi\right)p+\pi q+h\right]^{2}}\right\}+{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\left\{\frac{\left(p-q\right)^{2}}{\left[\left(1-\pi\right)p+\pi q+h\right]^{2}}\right\}.$

Suppose that each $\varphi\left(\cdot;\theta\right)\in\mathcal{P}$ is bounded from above by $c<\infty$ . Then, since $p,q\in\mathcal{C}$ are non-negative functions, we have the fact that $\left(p-q\right)^{2}\leq c^{2}$ . If we further have $a\leq h$ for some $a>0$ , then $\left[\left(1-\pi\right)p+\pi q+h\right]^{2}\geq a^{2}$ , which implies that

\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\kappa\left(\left(1-\pi\right)p+\pi q\right)\leq 2\times\frac{c^{2}}{a^{2}}

for every $p,q\in\mathcal{C}$ and $\pi\in\left(0,1\right)$ , and thus

\sup_{p,q\in\mathcal{C},\pi\in\left(0,1\right)}\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\kappa\left(\left(1-\pi\right)p+\pi q\right)\leq\frac{2c^{2}}{a^{2}}<\infty.

Similarly, for case (12), we have

	$\displaystyle\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\kappa_{n}((1-\pi)p+\pi q)$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\log\frac{f(x_{i})+h(x_{i})}{\left(1-\pi\right)p\left(x_{i}\right)+\pi q\left(x_{i}\right)+h\left(x_{i}\right)}$
		$\displaystyle\qquad\qquad+\ \frac{1}{n}\sum_{i=1}^{n}\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\log\frac{f\left(y_{i}\right)+h\left(y_{i}\right)}{\left(1-\pi\right)p\left(y_{i}\right)+\pi q\left(y_{i}\right)+h\left(y_{i}\right)}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{\left(p\left(x_{i}\right)-q\left(x_{i}\right)\right)^{2}}{\left[\left(1-\pi\right)p\left(x_{i}\right)+\pi q\left(x_{i}\right)+h\left(x_{i}\right)\right]^{2}}+\frac{1}{n}\sum_{i=1}^{n}\frac{\left(p\left(y_{i}\right)-q\left(y_{i}\right)\right)^{2}}{\left[\left(1-\pi\right)p\left(y_{i}\right)+\pi q\left(y_{i}\right)+h\left(y_{i}\right)\right]^{2}}.$

By the same argument, as for $\kappa$ , we have the fact that $\left(p\left(x\right)-q\left(x\right)\right)^{2}\leq c^{2}$ , for every $p,q\in\mathcal{C}$ and every $x\in\mathcal{X}$ , and furthermore $\left[\left(1-\pi\right)p\left(x\right)+\pi q\left(x\right)+h\left(x\right)\right]^{2}\geq a^{2}$ , for any $\pi\in\left(0,1\right)$ . Thus,

\sup_{p,q\in\mathcal{C},\pi\in\left(0,1\right)}\frac{\mathrm{d}^{2}}{\mathrm{d}\pi^{2}}\kappa\left(\left(1-\pi\right)p+\pi q\right)\leq\frac{2c^{2}}{a^{2}}<\infty,\text{ as required.}

Appendix D Technical results

Here we collect some technical results that are required in our proofs but appear elsewhere in the literature. In some places, notation may be modified from the original text to keep with the established conventions herein.

Lemma 18 (Kosorok, 2007. Lem 9.18).

Let $N(\mathscr{F},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})$ denote the $\varepsilon$ -covering number of $\mathscr{F}$ , $N_{[]}(\mathscr{F},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})$ the $\varepsilon$ -bracketing number of $\mathscr{F}$ , and $\operatorname*{\lVert\,\cdot\,\rVert}$ be any norm on $\mathscr{F}$ . Then, for all $\varepsilon>0$

N(\mathscr{F},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})\leq N_{[]}(\mathscr{F},\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})

Lemma 19 (Kosorok, 2007. Lem 9.22).

For any norm $\operatorname*{\lVert\,\cdot\,\rVert}$ dominated by ${\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}$ and any class of functions $\mathscr{F}$ ,

\log N_{[]}(\mathscr{F},2\varepsilon,\operatorname*{\lVert\,\cdot\,\rVert})\leq\log N(\mathscr{F},\varepsilon,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty}),\text{ for all $\varepsilon>0$.}

Lemma 20 (Kosorok, 2007. Thm 9.23).

For some metric $d$ on $T$ , let $\mathscr{F}=\{f_{t}:t\in T\}$ be a function class:

\lvert f_{s}(x)-f_{t}(x)\rvert\leq d(s,t)F(x),

some fixed function $F$ on $\mathcal{X}$ , and for all $x\in\mathcal{X}$ and $s,t\in T$ . Then, for any norm $\operatorname*{\lVert\,\cdot\,\rVert}$ ,

N_{[]}(\mathscr{F},2\varepsilon\lVert F\rVert,\operatorname*{\lVert\,\cdot\,\rVert})\leq N(T,\varepsilon,d).

Lemma 21 (Shalev-Shwartz & Ben-David (2014), Lem 26.7).

Let $A$ be a subset of $\mathbb{R}^{m}$ and let

A^{\prime}=\left\{\sum_{j=1}^{n}\alpha_{j}\mathbf{a}_{j}\mid n\in\operatorname*{\mathbb{N}},\mathbf{a}_{j}\in A,\alpha_{j}\geq 0,\lVert\mathbf{\alpha}\rVert_{1}=1\right\}.

Then, $\mathcal{R}_{n}(A^{\prime})=\mathcal{R}_{n}(A)$ , i.e., both $A$ and $A^{\prime}$ have the same Rademacher complexity.

Lemma 22 (van de Geer, 2016, Thm. 16.2).

Let $(X_{i})_{i\in[n]}$ be non-random elements of $\mathcal{X}$ and let $\mathscr{F}$ be a class of real-valued functions on $\mathcal{X}$ . If $\varphi_{i}:\operatorname*{\mathbb{R}}\to\operatorname*{\mathbb{R}}$ , $i\in[n]$ , are functions vanishing at zero that satisfy for all $u,v\in\operatorname*{\mathbb{R}}$ , $\lvert\varphi_{i}(u)-\varphi_{i}(v)\rvert\leq\lvert u-v\rvert,$ then we have

{\mathrm{\mathbf{E}}}\left\{\left\lVert\sum_{i=1}^{n}\varphi_{i}(f(X_{i}))\varepsilon_{i}\right\rVert_{\mathscr{F}}\right\}\leq 2{\mathrm{\mathbf{E}}}\left\{\left\lVert\sum_{i=1}^{n}f(X_{i})\varepsilon_{i}\right\rVert_{\mathscr{F}}\right\}.

Lemma 23 (McDiarmid, 1998, Thm. 3.1 or McDiarmid, 1989).

Suppose $(X_{i})_{i\in[n]}$ are independent random variables and let $Z=g(X_{1},\ldots,X_{n})$ , for some function $g$ . If $g$ satisfies the bounded difference condition, that is there exists constant $c_{j}$ such that for all $j\in[n]$ and all $x_{1},\ldots,x_{j},x_{j}^{\prime},\ldots,x_{n}$ ,

\lvert g(x_{1},\ldots,x_{j-1},x_{j},x_{j+1},\ldots,x_{n})-g(x_{1},\ldots,x_{j-1},x_{j}^{\prime},x_{j+1},\ldots,x_{n})\rvert\leq c_{j},

then

{\mathrm{\mathbf{P}}}(Z-{\mathrm{\mathbf{E}}}Z\geq t)\leq\exp\left\{\frac{-2t^{2}}{\sum_{j=1}^{n}c_{j}^{2}}\right\}.

Lemma 24 (van der Vaart & Wellner, 1996, Lem. 2.3.1).

Let $\mathfrak{R}(f)=\mathrm{\mathbf{E}}f$ and $\mathfrak{R}_{n}(f)=n^{-1}\sum_{i=1}^{n}f(X_{i})$ . If $\varPhi:\mathbb{R}_{>0}\to\mathbb{R}_{>0}$ is a convex function, then the following inequality holds for any class of measurable functions $\mathscr{F}$ :

{\mathrm{\mathbf{E}}}\varPhi\left(\left\lVert\mathfrak{R}(f)-\mathfrak{R}_{n}(f)\right\rVert_{\mathscr{F}}\right)\leq{\mathrm{\mathbf{E}}}\varPhi\left(2\left\lVert R_{n}(f)\right\rVert_{\mathscr{F}}\right),

where $R_{n}(f)$ is the Rademacher process indexed by $\mathscr{F}$ . In particular, since the identity map is convex,

{\mathrm{\mathbf{E}}}\left\{\lVert\mathfrak{R}(f)-\mathfrak{R}_{n}(f)\rVert_{\mathscr{F}}\right\}\leq 2{\mathrm{\mathbf{E}}}\left\{\lVert R_{n}(f)\rVert_{\mathscr{F}}\right\}.

Lemma 25 (Koltchinskii, 2011, Thm. 3.11).

Let $d_{n}$ be the empirical distance

d_{n}^{2}(f_{1},f_{2})=\frac{1}{n}\sum_{i=1}^{n}(f_{1}(X_{i})-f_{2}(X_{i}))^{2}

and denote by $N(\mathscr{F},\varepsilon,d_{n})$ the $\varepsilon$ -covering number of $\mathscr{F}$ . Let $\sigma_{n}^{2}\coloneqq\sup_{f\in\mathscr{F}}P_{n}f^{2}$ . Then the following inequality holds

{\mathrm{\mathbf{E}}}\left\{\lVert R_{n}(f)\rVert_{\mathscr{F}}\right\}\leq\frac{K}{\sqrt{n}}\mathrm{\mathbf{E}}\int_{0}^{2\sigma_{n}}\log^{1/2}N(\mathscr{F},\varepsilon,d_{n})\mathrm{d}\varepsilon

for some constant $K>0$ .

	$\displaystyle\mathrm{KL}_{h}\left(f\,\|\|\,f_{k,n}\right)-\mathrm{KL}_{h}\left(f\,\|\|\,f_{k}\right)={\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}+{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k}}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k}}$
	$\displaystyle\ ={\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}+{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}$
	$\displaystyle\ \qquad-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}$
	$\displaystyle\ =\left({\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}\right)+\left({\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k,n}}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}\right)$
	$\displaystyle\ \quad+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{f}}{\tilde{f}_{k}}\right)+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{f}}{\tilde{f}_{k}}\right)$
	$\displaystyle\ \quad+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}\right)+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}\right)$
	$\displaystyle\ \leq 2\sup_{\tilde{g}\in\tilde{\mathcal{C}}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(X_{i})}{\tilde{f}(X_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{f}\log\frac{\tilde{g}}{\tilde{f}}\right\rvert+2\sup_{\tilde{g}\in\tilde{\mathcal{C}}}\left\lvert\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{g}(Y_{i})}{\tilde{f}(Y_{i})}-{\operatorname{{\mathrm{\mathbf{E}}}}}_{h}\log\frac{\tilde{g}}{\tilde{f}}\right\rvert$
	$\displaystyle\ \quad+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k,n}(X_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(X_{i})}{\tilde{f}_{k}(X_{i})}\right)+\left(\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}-\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}(Y_{i})}{\tilde{f}_{k}(Y_{i})}\right)$
	$\displaystyle\ \leq 2\operatorname*{{\mathrm{\mathbf{E}}}}\left\{\frac{w^{x}_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,x})\mathrm{d}\varepsilon\right\}+\frac{w^{x}_{2}}{\sqrt{n}}+w^{x}_{3}\sqrt{\frac{t}{n}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(X_{i})}{\tilde{f}_{k,n}(X_{i})}$
	$\displaystyle\ \qquad+2\operatorname*{{\mathrm{\mathbf{E}}}}\left\{\frac{w^{y}_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon,d_{n,y})\mathrm{d}\varepsilon\right\}+\frac{w^{y}_{2}}{\sqrt{n}}+w^{y}_{3}\sqrt{\frac{t}{n}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})}$
	$\displaystyle\ \leq\frac{w_{1}}{\sqrt{n}}\int_{0}^{c}\log^{1/2}N(\mathcal{P},\varepsilon/2,{\operatorname*{\lVert\,\cdot\,\rVert}}_{\infty})\mathrm{d}\varepsilon+\frac{w_{2}}{\sqrt{n}}+w_{3}\sqrt{\frac{t}{n}}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(X_{i})}{\tilde{f}_{k,n}(X_{i})}+\frac{1}{n}\sum_{i=1}^{n}\log\frac{\tilde{f}_{k}(Y_{i})}{\tilde{f}_{k,n}(Y_{i})},$

Risk Bounds for Mixture Density Estimation on Compact Domains via the hh-Lifted Kullback–Leibler Divergence

Abstract

1 Introduction

1.1 Risk bounds for mixture density estimation

1.2 Main contributions

1.3 Relevant literature

1.4 Organization of paper

2 The hh-lifted KL divergence and its properties

Definition 1 (hh-lifted KL divergence).

2.1 KLh\textrm{KL}_{h} as a Bregman divergence

2.2 Advantages of the hh-lifted KL divergence

Proposition 2.

Proof.

Proposition 3.

Proof.

Remark 4.

3 Main results

Theorem 5.

Corollary 6.

Remark 7.

4 Numerical experiments

4.1 Minorization–Maximization algorithm

4.2 Experimental setup

4.3 Results

5 Conclusion

Acknowledgments

References

Appendix A Proofs of main results

A.1 Preliminaries

Lemma 8.

Proposition 9.

Proof.

Proposition 10.

Lemma 11 (Rockafellar, 1997, Thm. 32.2).

Lemma 12.

Proof.

A.2 Proofs

Theorem 13.

Remark 14.

Proof of Theorem 13.

Remark 15.

Proposition 16.

Proof (of Theorem 5).

Remark 17.

Proof of Corollary 6.

Appendix B Discussions and remarks regarding hh-MLLEs

B.1 Elementary derivation

B.2 Advantages and limitations

B.3 Selection of the lifting density function hh

B.4 Discussions regarding the sharpness of the obtained risk bound

B.5 The KL divergence and the MLE

B.6 Comparison of the MM algorithm and the EM algorithm

B.7 Non-convex optimization

Appendix C Auxiliary proofs

C.1 The hh-lifted KL divergence as a Bregman divergence

C.2 Proof of Proposition 2

C.3 Proof of Proposition 3

C.4 Proof of Proposition 9

Appendix D Technical results

Lemma 18 (Kosorok, 2007. Lem 9.18).

Lemma 19 (Kosorok, 2007. Lem 9.22).

Lemma 20 (Kosorok, 2007. Thm 9.23).

Lemma 21 (Shalev-Shwartz & Ben-David (2014), Lem 26.7).

Lemma 22 (van de Geer, 2016, Thm. 16.2).

Lemma 23 (McDiarmid, 1998, Thm. 3.1 or McDiarmid, 1989).

Lemma 24 (van der Vaart & Wellner, 1996, Lem. 2.3.1).

Lemma 25 (Koltchinskii, 2011, Thm. 3.11).

Risk Bounds for Mixture Density Estimation on Compact Domains via the $h$ -Lifted Kullback–Leibler Divergence

2 The $h$ -lifted KL divergence and its properties

Definition 1 ( $h$ -lifted KL divergence).

2.1 $\textrm{KL}_{h}$ as a Bregman divergence

2.2 Advantages of the $h$ -lifted KL divergence

Appendix B Discussions and remarks regarding $h$ -MLLEs

B.3 Selection of the lifting density function $h$

C.1 The $h$ -lifted KL divergence as a Bregman divergence