This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersLog-concave Independent ComponentsS. Kubal, C. Campbell, and E. Robeva

Log-concave Density Estimation with Independent Componentsthanks: Submitted July 26, 2025. \fundingElina Robeva was supported by an NSERC Discovery Grant DGECR-2020-00338.

Sharvaj Kubal Department of Mathematics, University of British Columbia, Vancouver, Canada () erobeva@math.ubc.ca    Christian Campbell22footnotemark: 2    Elina Robeva22footnotemark: 2
Abstract

We propose a method for estimating a log-concave density on d\mathbb{R}^{d} from samples, under the assumption that there exists an orthogonal transformation that makes the components of the random vector independent. While log-concave density estimation is hard both computationally and statistically, the independent components assumption alleviates both issues, while still maintaining a large non-parametric class. We prove that under mild conditions, at most 𝒪~d(ϵ4)\tilde{\mathcal{O}}_{d}(\epsilon^{-4}) samples (suppressing constants and log factors) suffice for our proposed estimator to be within ϵ\epsilon of the original density in squared Hellinger distance. On the computational front, while the usual log-concave maximum likelihood estimate can be obtained via a finite-dimensional convex program, it is slow to compute – especially in higher dimensions. We demonstrate through numerical experiments that our estimator can be computed efficiently, making it more practical to use.

keywords:
density estimation, log-concave densities, sample complexity

1 Introduction

A log-concave probability density pp on d\mathbb{R}^{d} is of the form p(x)=eφ(x)p(x)=e^{-\varphi(x)} for a closed, proper, convex function φ:d(,+]\varphi:\mathbb{R}^{d}\to(-\infty,+\infty]. Log-concave density estimation is a type of non-parametric density estimation dating back to the work of Walther [37]. The class of log-concave distributions includes many known parametric families, has attractive statistical properties, and admits maximum likelihood estimates (MLE) that can be computed with established optimization algorithms [33]. Moreover, the MLE is ‘automatic’ in that no tuning parameters (such as kernel bandwidths) need to be chosen – a feature that can be especially useful in higher dimensions. However, log-concave density estimation suffers from the curse of dimensionality – one requires ndϵd+12n\gtrsim_{d}\epsilon^{-\frac{d+1}{2}} samples for the log-concave MLE in d\mathbb{R}^{d} to be ϵ\epsilon close to the true density in squared Hellinger distance [24]. Furthermore, the log-concave MLE is too slow to compute for large values of dd (e.g., d6d\geq 6 and n5000n\geq 5000).

We reduce the family of log-concave densities by assuming that the components of the random vector at hand become independent after a suitable orthogonal transformation is applied – an independent components assumption. While the family that we study is somewhat restrictive, it includes all multivariate normal distributions, and is still non-parametric. Furthermore, the independent components assumption provides a link with independent component analysis (ICA) models, see Samworth and Yuan [32]. We propose an easy-to-compute estimator for this family of densities and show that the sample complexity rate is no longer exponential in the ambient dimension dd: with at most ndϵ4n\gtrsim_{d}\epsilon^{-4} samples (up to constants and log factors), the estimate is within ϵ\epsilon of the true density in squared Hellinger distance.

1.1 Prior work

Shape-constrained density estimation has a long history. Classes of shape-constrained densities that have previously been studied include non-increasing [18], convex [19], kk-monotone [5], ss-concave [14] and quasi-concave densities [25]. The log-concave class of densities was first studied in dimension 1 by Walther [37], and a concise description of the log-concave MLE in higher dimensions was given by Cule et al. [11]. Various statistical properties of the procedure were established by Schuhmacher and Dümbgen [34], Carpenter et al. [7], Kur et al. [27], and Barber and Samworth [6] amongst others. Log-concavity has also gained popularity in applications [4, 33].

Subclasses of log-concave densities have also been studied in the recent years for the purposes of reducing the computational and statistical complexities of estimation, and also because they are important on their own. In the 1-dimensional setting, Kim et al. [23] study the set of log-concave densities that are exponentials of concave functions defined by at most kk linear hyperplanes. In this case, they show that the statistical rate becomes parametric. Xu and Samworth [39] consider log-concave densities whose super-level sets are scalar multiples of a fixed convex body in d\mathbb{R}^{d}, and establish univariate rates for estimation. Robeva et al. [30] consider maximum likelihood estimation for totally positive and log-concave densities, which correspond to random vectors whose components are highly positively correlated. They prove existence and uniqueness of the MLE and propose an algorithm for computing it, although the statistical rates for this estimator are still unknown. Next, Kubjas et al. [26] study the family of log-concave densities that factorize according to an undirected graphical model, also providing existence and uniqueness results for the MLE (when the graph is given and chordal) together with an algorithm. Sample complexity rates for this model have not been studied either.

Here, we study the subfamily of log-concave densities on d\mathbb{R}^{d} obtained by transforming product densities through orthogonal maps. These correspond to random vectors whose coordinates become independent post some orthogonal transformation. The most similar family to ours was considered by Samworth and Yuan [32], who study the class of log-concave densities of random vectors whose coordinates become independent after a linear transformation (rather than just an orthogonal transformation). They show that the projection (in a Kullback-Leibler (KL) sense) of a (not necessarily log-concave) distribution satisfying the above independent components property, to the set of log-concave distributions, still satisfies the independent components property with the same linear transformation. They also provide a method to perform density estimation in this setting, although their algorithm does not guarantee convergence to a global optimum. On the statistical front, they prove consistency; to the best of our knowledge however, finite sample rates have not yet been established in this setting. By simplifying the model here, we are able to provide a very fast algorithm together with sample complexity upper bounds displaying univariate rates.

Towards applicability to real data for tasks such as clustering, we show that our estimator can be used in conjunction with the Expectation-Maximization (EM) algorithm [13] to estimate mixtures of densities from our subfamily – this is a semiparametric generalization of Gaussian mixtures. A similar application was proposed in Cule et al. [11], where they show that mixtures of general log-concave densities can work better than Gaussian mixtures. However, estimating a mixture of general log-concave densities can be very costly, particularly in higher dimensions. Plugging our proposed estimator instead, into the EM algorithm, yields a much faster computation.

1.2 Problem

Suppose XX is a zero-mean random vector in d\mathbb{R}^{d}, with a (probability) density p:d[0,)p:\mathbb{R}^{d}\to[0,\infty). We say that XX (or pp) satisfies the orthogonal independent components property if there exists an orthogonal matrix 𝐖d×d\mathbf{W}\in\mathbb{R}^{d\times d} such that the components of Z=𝐖XZ=\mathbf{W}X are independent. In this case, the joint density of ZZ can be expressed as the product

(1) f(z)=i=1dfi(zi),\displaystyle f(z)=\prod_{i=1}^{d}f_{i}(z_{i}),

where each fi:[0,)f_{i}:\mathbb{R}\to[0,\infty) is a univariate density. We focus mainly on the case where pp is log-concave, which is equivalent in this setting to requiring log-concavity of f1,,fdf_{1},\dots,f_{d}. Our goal is to efficiently estimate pp from independent and identically distributed (iid) samples, given that it satisfies the orthogonal independent components property with some unknown 𝐖\mathbf{W}.

By writing X=𝐖1ZX=\mathbf{W}^{-1}Z, we can interpret this setting as an independent component analysis (ICA) model, where ZZ is the vector of independent sources and 𝐖\mathbf{W} is the so-called unmixing matrix. Although mixing matrices in ICA do not need to be orthogonal to begin with, a pre-whitening procedure [10] can transform the problem to such an orthogonal form.

Instead of estimating the “dd-dimensional” density pp directly, we would like to leverage the underlying structure provided by (1), which allows us to simplify pp (using a change-of-variables formula) as

(2) p(x)=i=1dfi([𝐖x]i)=i=1dfi(wix).\displaystyle p(x)=\prod_{i=1}^{d}f_{i}\left([\mathbf{W}x]_{i}\right)=\prod_{i=1}^{d}f_{i}(w_{i}\cdot x).

Here, w1,,wd𝕊d1w_{1},\dots,w_{d}\in\mathbb{S}^{d-1} are the rows of 𝐖\mathbf{W}, and we refer to them as independent directions of pp. Note that the model Z=𝐖XZ=\mathbf{W}X and the factorization in (2) are invariant under permuting the rows of 𝐖\mathbf{W} and the corresponding components of ZZ, as well as under flipping their respective signs.

If 𝐖\mathbf{W} is known beforehand – if it is provided by an oracle – the estimation problem is simplified, and one can use the following procedure:

  1. 1.

    For each i=1,,di=1,\dots,d, estimate the univariate density fif_{i} (say via log-concave maximum likelihood estimation [33]) using the samples {Zi()=wiX():[N]}\{Z_{i}^{(\ell)}=w_{i}\cdot X^{(\ell)}:\ell\in[N]\}, where X(1),,X(N)iidpX^{(1)},\dots,X^{(N)}\overset{\mathrm{iid}}{\sim}p. Denote the estimated density by fi^\hat{f_{i}}.

  2. 2.

    Plug-in the estimated densities into (2) to obtain the oracle-informed estimate of pp:

    (3) p^oracle(x)=i=1dfi^(wix).\displaystyle\hat{p}_{\text{oracle}}(x)=\prod_{i=1}^{d}\hat{f_{i}}(w_{i}\cdot x).

In reality, the unmixing matrix 𝐖\mathbf{W} is unknown, and one needs to estimate it. If the covariance matrix of XX has distinct, well-separated eigenvalues, then this can be done by Principal Component Analysis (PCA), which involves diagonalizing the sample covariance matrix of XX. Otherwise, certain ICA algorithms may be used to estimate 𝐖\mathbf{W} provided some non-Gaussianity requirements are satisfied [3, 17]. We refrain from using the same samples for the unmixing matrix estimation and the density estimation steps. Instead, we split the samples into two subsets, {X(1),,X(N)}\{X^{(1)},\dots,X^{(N)}\} and {Y(1),,Y(M)}\{Y^{(1)},\dots,Y^{(M)}\}, and use different subsets for the two purposes in order to prevent potentially problematic cross-interactions.

Denote the (PCA or ICA) estimate of 𝐖\mathbf{W} by 𝐖^\hat{\mathbf{W}}. We always require 𝐖^\hat{\mathbf{W}} to be an orthogonal matrix. Given enough samples, the rows w^1,,w^d𝕊d1\hat{w}_{1},\dots,\hat{w}_{d}\in\mathbb{S}^{d-1} of 𝐖^\hat{\mathbf{W}} can be shown to be close to those of 𝐖\mathbf{W} up to permutations and sign-flips. More precisely, there exists a permutation σ:[d][d]\sigma:[d]\to[d] and signs θi{1,+1}\theta_{i}\in\{-1,+1\} such that for some threshold ϵ>0\epsilon>0

wi^θiwσ(i)2ϵ,i[d].\displaystyle\|\hat{w_{i}}-\theta_{i}w_{\sigma(i)}\|_{2}\leq\epsilon,\quad\quad i\in[d].

As the factorization in (2) is invariant under such permutations and sign-flips, one can simply relabel wiθiwσ(i)w_{i}\leftarrow\theta_{i}w_{\sigma(i)} and ZiθiZσ(i)Z_{i}\leftarrow\theta_{i}Z_{\sigma(i)} to proceed with the analysis.

From 𝐖^\hat{\mathbf{W}}, we can follow a procedure analogous to that used to obtain the oracle-informed estimate of pp. Namely, we use the estimated independent directions w^id\hat{w}_{i}\in\mathbb{R}^{d} and the projected samples onto each such direction to produce univariate estimates p^w^i\hat{p}_{\hat{w}_{i}}; then we take the product of these estimates over ii (see Definition 2.4).

1.3 Summary of contributions

The main contribution of this work is an algorithm for density estimation in the setting of Section 1.2, for which we prove finite sample guarantees and demonstrate numerical performance in terms of estimation errors and runtimes. The proposed algorithm is recorded as Algorithm 1.

The theoretical analysis in Section 3 yields sample complexity bounds for our algorithm under various assumptions, which are listed in Table 1 and Table 2. Briefly, if the true density pp is log-concave and satisfies some moment assumptions in addition to the orthogonal independent components property, then it can be estimated efficiently. In particular, 𝒪~d(ϵ4)\tilde{\mathcal{O}}_{d}(\epsilon^{-4}) samples (up to constants and log factors) are sufficient for our proposed estimator p^\hat{p} to be within ϵ\epsilon of pp in squared Hellinger distance with probability 0.99. If the density pp has some additional smoothness, the rate can be improved up to ϵ2\epsilon^{-2}.

Since we referred to smoothness, a quick comparison with kernel density estimators (KDE) is in order. Recall that in dimension dd, a KDE typically requires the true density pp to have “smoothness of order dd” in order to achieve a dimension-independent statistical rate. More precisely, for pp in the Sobolev space 𝒲s,1(d)\mathcal{W}^{s,1}(\mathbb{R}^{d}) with s1s\geq 1, achieving an expected total variation error of ϵ\epsilon requires ndϵd+2ssn\gtrsim_{d}\epsilon^{-\frac{d+2s}{s}} samples [21] so that one needs sds\gtrsim d to counter the curse of dimensionality. On the contrary, our smoothness assumptions (see Smoothness assumption S1 and Smoothness assumption S2) are “order one” in that we only require control on the first-order (and if available, the second-order) derivatives.

Computational experiments are presented in Section 4, where we demonstrate improved estimation errors (in squared Hellinger distance) and improved runtimes as compared to usual log-concave MLE. A key consequence of our fast algorithm is easy scalability to higher dimensions – we are able to estimate densities in d=30d=30 within no more than a few seconds.

2 Set-up and algorithms

First, we establish some notation. For a random vector (e.g. XX), the subscript labels the components of the vector in d\mathbb{R}^{d}, whereas the superscript labels the iid samples (or copies). Consider a (probability) density p:d[0,)p:\mathbb{R}^{d}\to[0,\infty), and suppose XpX\sim p. Given a unit vector s𝕊d1s\in\mathbb{S}^{d-1}, define the ss-marginal ps:[0,)p_{s}:\mathbb{R}\to[0,\infty) to be the density of sXs\cdot X. Additionally, for any (measurable) map T:ddT:\mathbb{R}^{d}\to\mathbb{R}^{d}, define the pushforward TpT_{\sharp}p to be the law of T(X)T(X). For an invertible linear map 𝐀:dd\mathbf{A}:\mathbb{R}^{d}\to\mathbb{R}^{d}, the pushforward 𝐀p\mathbf{A}_{\sharp}p has a density |det(𝐀1)|p(𝐀1(x))|\mathrm{det}(\mathbf{A}^{-1})|\,p(\mathbf{A}^{-1}(x)). As described in Section 1, we have the relationship Z=𝐖XZ=\mathbf{W}X for which one can write Law(Z)=𝐖p\mathrm{Law}(Z)=\mathbf{W}_{\sharp}p by identifying 𝐖\mathbf{W} with the linear map it induces.

For any positive integer nn, define [n]:={1,2,,n}[n]:=\{1,2,\dots,n\}. We denote by 2\|\cdot\|_{2} the Euclidean norm on d\mathbb{R}^{d}, and by op\|\cdot\|_{\mathrm{op}} and F\|\cdot\|_{F} respectively the operator norm and the Frobenius norm on d×d\mathbb{R}^{d\times d}. The squared Hellinger distance between densities p,q:d[0,)p,q:\mathbb{R}^{d}\to[0,\infty) is defined as hd2(p,q)=(1/2)d(q(x)p(x))2𝑑x=1dq(x)p(x)𝑑xh_{d}^{2}(p,q)=(1/2)\int_{\mathbb{R}^{d}}(\sqrt{q(x)}-\sqrt{p(x)})^{2}dx=1-\int_{\mathbb{R}^{d}}\sqrt{q(x)p(x)}dx. Note that (p,q)hd(p,q)=(hd2(p,q))1/2(p,q)\mapsto h_{d}(p,q)=(h_{d}^{2}(p,q))^{1/2} is a metric on the space of densities. In various proofs, we also use the Wasserstein-1 distance, written W1(,)W_{1}(\cdot,\cdot).

Finally, the symbols c,C,Cc,C,C^{\prime} etc. stand for absolute (universal) constants that do not depend on any parameters of the problem. Also, their values may be readjusted as required in different statements. We also use constants Kd,KdK_{d},K_{d}^{\prime} etc. that depend only on the dimension dd.

Recall the zero-mean density pp, satisfying the orthogonal independent components property, and consider the split-up independent samples {X(1),,X(N)}\{X^{(1)},\dots,X^{(N)}\} and {Y(1),,Y(M)}\{Y^{(1)},\dots,Y^{(M)}\} as introduced in Section 1. Our procedure for estimating pp consists of two stages: (1) use the samples {Y(1),,Y(M)}\{Y^{(1)},\dots,Y^{(M)}\} to estimate an unmixing matrix 𝐖\mathbf{W}, and (2) use the samples {X(1),,X(N)}\{X^{(1)},\dots,X^{(N)}\} to estimate the marginal densities of pp in the directions given by the (estimated) rows of 𝐖\mathbf{W}. We expand on these below, starting with the unmixing matrix estimation.

For XpX\sim p (with finite second moments), consider the covariance matrix 𝚺=𝔼XXT\mathbf{\Sigma}=\mathbb{E}XX^{T}. Since the components of Z=𝐖XZ=\mathbf{W}X are independent, the covariance of ZZ

𝚺Z=𝔼ZZT=𝔼𝐖XXT𝐖T=𝐖𝚺𝐖T\displaystyle\mathbf{\Sigma}_{Z}=\mathbb{E}ZZ^{T}=\mathbb{E}\mathbf{W}XX^{T}\mathbf{W}^{T}=\mathbf{W}\mathbf{\Sigma}\mathbf{W}^{T}

is diagonal. Hence, the rows of 𝐖\mathbf{W} are the (normalized) eigenvectors of 𝚺\mathbf{\Sigma}, and PCA can be used to estimate these eigenvectors under the following non-degeneracy condition:

Moment assumption M1

The density pp has finite second moments, and the eigenvalues of the covariance matrix 𝚺=𝔼XpXXT\mathbf{\Sigma}=\mathbb{E}_{X\sim p}XX^{T} are separated from each other by at least δ>0\delta>0. Heteroscedastic multivariate Gaussians 𝒩(0,𝚺)\mathcal{N}(0,\mathbf{\Sigma}), where 𝚺\mathbf{\Sigma} has well-separated eigenvalues, satisfy M1, and so do uniform distributions supported on rectangles with distinct side lengths.

From samples Y(1),,Y(M)iidpY^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p, one can compute the empirical covariance matrix

(4) 𝚺^:=1Mj=1MY(j)Y(j)T,\displaystyle\hat{\mathbf{\Sigma}}:=\frac{1}{M}\sum_{j=1}^{M}Y^{(j)}{Y^{(j)}}^{T},

which is symmetric and positive semi-definite. Diagonalizing it yields the PCA-based estimate of 𝐖\mathbf{W}.

Definition 2.1 (PCA-based estimate of 𝐖\mathbf{W}).

Let Y(1),,Y(M)iidpY^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p, and 𝚺^\hat{\mathbf{\Sigma}} defined as in (4). Denote by w^1,,w^d\hat{w}_{1},\dots,\hat{w}_{d} a set of orthonormal eigenvectors of 𝚺^\hat{\mathbf{\Sigma}}. Then, the estimate 𝐖^\hat{\mathbf{W}} is defined to be the orthogonal matrix whose rows are w^1,,w^d\hat{w}_{1},\dots,\hat{w}_{d}.

Remark 2.2.

Note that 𝐖^\hat{\mathbf{W}} above is defined only up to permutations and sign-flips of the rows. This is consistent, nevertheless, with our earlier discussion on the invariance of the model with respect to these transformations.

By the concentration of 𝚺^\hat{\mathbf{\Sigma}} about 𝚺\mathbf{\Sigma}, and the Davis-Kahan theorem [40], the w^i\hat{w}_{i}’s can be shown to be good estimates of wiw_{i}, provided δ\delta is not too small (see Proposition 3.12).

Now consider the situation where the eigenvalues of 𝚺\mathbf{\Sigma} are poorly separated, or possibly degenerate. PCA may fail to recover the independent directions in this case. Consider, for example, the extreme case of an isotropic uniform distribution on a cube in d\mathbb{R}^{d} with 𝚺=𝐈\mathbf{\Sigma}=\mathbf{I}. Here, any w𝕊d1w\in\mathbb{S}^{d-1} is an eigenvector of 𝚺\mathbf{\Sigma}, whereas the independent directions are only the dd axes of the cube. When 𝚺\mathbf{\Sigma} has such degenerate eigenvalues, one needs to use higher moments to infer the unmixing matrix 𝐖\mathbf{W}, which is the key principle behind ICA.

Recall our model X=𝐖1ZX=\mathbf{W}^{-1}Z, where we think of ZZ as the latent variables. As it is common in the ICA literature to have the latent variables ‘standardized’ to be unit-variance, we define S:=𝚺Z1/2ZS:=\mathbf{\Sigma}_{Z}^{-1/2}Z. This lets us rewrite X=𝐖1Z=𝐖1𝚺Z1/2S=𝐀SX=\mathbf{W}^{-1}Z=\mathbf{W}^{-1}\mathbf{\Sigma}_{Z}^{1/2}S=\mathbf{A}S, where the new mixing matrix is 𝐀=𝐖1𝚺Z1/2\mathbf{A}=\mathbf{W}^{-1}\mathbf{\Sigma}_{Z}^{1/2}. By orthogonality of 𝐖\mathbf{W} and the diagonal nature of 𝚺Z\mathbf{\Sigma}_{Z}, the columns of 𝐀\mathbf{A} are just scaled versions of the original independent directions w1,,wdw_{1},\dots,w_{d}.

All ICA methods require some form of non-Gaussianity; following Auddy and Yuan [3], we require the following assumptions on SS and 𝐀\mathbf{A}.

Moment assumption M2

For each i[d]i\in[d], assume that μ41|𝔼Si43|μ4\mu_{4}^{-1}\leq|\mathbb{E}S_{i}^{4}-3|\leq\mu_{4} and 𝔼|Si|9μ9\mathbb{E}|S_{i}|^{9}\leq\mu_{9} for some μ4,μ9>0\mu_{4},\mu_{9}>0. Additionally, assume that the condition number cond(𝐀)κ\mathrm{cond}(\mathbf{A})\leq\kappa for some κ1\kappa\geq 1.

Remark 2.3.

Note that since S=𝚺Z1/2ZS=\mathbf{\Sigma}_{Z}^{-1/2}Z, one could alternatively frame Moment assumption M2 in terms of ZZ. In particular, we have that cond(𝐀)=cond(𝚺Z)1/2\mathrm{cond}(\mathbf{A})=\mathrm{cond}(\mathbf{\Sigma}_{Z})^{1/2}.

Auddy and Yuan [3] propose a variant of the FastICA algorithm [22] that produces an estimate 𝐀^\hat{\mathbf{A}} of 𝐀\mathbf{A} achieving the optimal sample complexity for computationally tractable procedures. To recover an estimate 𝐖^\hat{\mathbf{W}} of 𝐖\mathbf{W}, we first rescale the columns of 𝐀^\hat{\mathbf{A}} to unit norm, and form a matrix 𝐀~\tilde{\mathbf{A}} from these rescaled columns. To account for the possible non-orthogonality of the columns of 𝐀~\tilde{\mathbf{A}}, we compute the singular value decomposition 𝐀~=𝐔𝐃𝐕T\tilde{\mathbf{A}}=\mathbf{U}\mathbf{D}\mathbf{V}^{T}, and set 𝐖^:=𝐕𝐔T\hat{\mathbf{W}}:=\mathbf{V}\mathbf{U}^{T}.

Isotropic uniform and exponential densities, for example, satisfy Moment assumption M2, but Gaussians do not due to the non-zero kurtosis requirement. But recall that heteroscedastic Gaussians do satisfy Moment assumption M1. If the moment assumptions discussed here are too restrictive, one could use other algorithms such as Fourier PCA [17], which require weaker notions of non-Gaussianity. Our theory in the following sections is agnostic of the specific procedure used to estimate 𝐖\mathbf{W}, and depends only on the estimation error.

Given 𝐖^\hat{\mathbf{W}}, and hence the estimated independent directions w^1,,w^d𝕊d1\hat{w}_{1},\dots,\hat{w}_{d}\in\mathbb{S}^{d-1}, consider the second stage of the estimation procedure now – estimating the marginal densities of pp. Take the samples X(1),,X(N)iidpX^{(1)},\dots,X^{(N)}\overset{\mathrm{iid}}{\sim}p, which are independent of Y(1),,Y(M)Y^{(1)},\dots,Y^{(M)}, and project them as

(5) Z^i(j)=w^iX(j).\displaystyle\hat{Z}_{i}^{(j)}=\hat{w}_{i}\cdot X^{(j)}.

Conditional on w^1,,w^d\hat{w}_{1},\dots,\hat{w}_{d}, it holds that Z^i(1),,Z^i(N)iidpw^i\hat{Z}_{i}^{(1)},\dots,\hat{Z}_{i}^{(N)}\overset{\mathrm{iid}}{\sim}p_{\hat{w}_{i}} (almost surely in w^i\hat{w}_{i}), which follows from the independence between w^i\hat{w}_{i} and X(j)X^{(j)}. Univariate density estimation can then be used to estimate each marginal pw^ip_{\hat{w}_{i}} using the samples Z^i(1),,Z^i(N)\hat{Z}_{i}^{(1)},\dots,\hat{Z}_{i}^{(N)}. When pp is log-concave in particular (which is the focus of this work), we use univariate log-concave MLE [37]. This procedure finally yields our proposed “plug-in” estimator.

Definition 2.4 (Proposed estimator).

Given estimated orthogonal independent directions
w^1,,w^d𝕊d1\hat{w}_{1},\dots,\hat{w}_{d}\in\mathbb{S}^{d-1} computed from Y(1),,Y(M)iidpY^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p, and univariate density estimates p^w^1,,p^w^d\hat{p}_{\hat{w}_{1}},\dots,\hat{p}_{\hat{w}_{d}} computed from X(1),,X(N)iidpX^{(1)},\dots,X^{(N)}\overset{\mathrm{iid}}{\sim}p, define the estimator

(6) p^(x)=i=1dp^w^i(w^ix).\displaystyle\hat{p}(x)=\prod_{i=1}^{d}\hat{p}_{\hat{w}_{i}}(\hat{w}_{i}\cdot x).

Note that the proposed estimate is just the oracle-informed estimate with wiw_{i} replaced by w^i\hat{w}_{i} (since fi=Law(Zi)=pwif_{i}=\text{Law}(Z_{i})=p_{w_{i}}). Further note that if we did not split the samples to conduct covariance estimation and density estimation separately, the law of Z^i\hat{Z}_{i} conditional on w^i\hat{w}_{i} would no longer be pw^ip_{\hat{w}_{i}}.

We summarize the above discussion as Algorithm 1. Recall that we had assumed pp to be zero-mean in our set-up; in the general case, one requires an additional centering step which we include in Algorithm 1.

Algorithm 1 Proposed estimator
1: Compute the empirical mean μ^\hat{\mu} and subtract it from the samples.
2: Split the (centered) samples into two sets: {X(1),,X(N)}\{X^{(1)},\dots,X^{(N)}\} and {Y(1),,Y(M)}\{Y^{(1)},\dots,Y^{(M)}\}.
3:𝐖^:=EstimateUnmixingMatrix(Y(1),,Y(M))\hat{\mathbf{W}}:=\mathrm{EstimateUnmixingMatrix}(Y^{(1)},\dots,Y^{(M)})
4: Compute Z^(j):=𝐖^X(j)\hat{Z}^{(j)}:=\hat{\mathbf{W}}X^{(j)} for j=1,,Nj=1,\dots,N.
5:for i{1,,d}i\in\{1,\dots,d\} do
6:  p^w^i:=EstimateUnivariateDensity(Z^i(1),,Z^i(N))\hat{p}_{\hat{w}_{i}}:=\mathrm{EstimateUnivariateDensity}(\hat{Z}_{i}^{(1)},\dots,\hat{Z}_{i}^{(N)})
7:end for
8: Compute p^(x):=i=1dp^w^i(w^ix)\hat{p}(x):=\prod_{i=1}^{d}\hat{p}_{\hat{w}_{i}}(\hat{w}_{i}\cdot x), where w^i\hat{w}_{i} is the ii-th row of 𝐖^\hat{\mathbf{W}}.
9: Re-introduce the mean μ^\hat{\mu} to output p^μ^(x)=p^(xμ^)\hat{p}_{\hat{\mu}}(x)=\hat{p}(x-\hat{\mu}).

Notice that we have not specified the sample splitting proportion between MM and NN. This is a free parameter in our method, and can be optimized based on the statistical rates obtained in Section 3, or based on numerical tests as explored in Section 4.2. An equal split with M=NM=N seems to generally work well. Interestingly, we observe empirically that not splitting the samples, i.e. reusing the same samples for the two stages, can also work very well, and even beat the splitting version.

Finally, we have kept the estimation procedures for the unmixing matrix and the marginals unspecified in Definition 2.4 and Algorithm 1 for the sake of generality. The framework of our analysis allows for flexibility in the choice of these procedures. However, we do focus mainly on PCA, ICA, and log-concave MLE in the following sections, and refer to Algorithm 1 as the log-concave independent components estimator (LC-IC) in that case. In particular, we use the R package logcondens [31] for univariate log-concave MLE in Line 6 of Algorithm 1. This package plays an important role in the computational speed of our method.

3 Analysis and sample complexities

This section is organized as follows. Section 3.1 presents basic error bounds and a generic sample complexity result, where the estimators for the unmixing matrix and the marginals are left unspecified. Section 3.2 and Section 3.3 then consider specific estimators to provide concrete sample complexities. Section 3.4 derives stronger bounds under additional assumptions, and collects the various sample complexities. Finally, Section 3.5 considers the case of misspecified log-concavity.

3.1 Error bounds and generic sample complexities

We consider the oracle-informed estimator first. The analysis involved is also applicable to the proposed estimator.

Lemma 3.1 (Error bound on the oracle estimator).

Suppose pp is a zero-mean probability density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property with independent directions w1,,wdw_{1},\dots,w_{d}. Let p^oracle\hat{p}_{\mathrm{oracle}} be the oracle-informed estimator as defined in (3). We have the bound

hd2(p^oracle,p)=1i=1d(1h12(p^wi,pwi))dmaxi[d]h12(p^wi,pwi).h_{d}^{2}(\hat{p}_{\mathrm{oracle}},p)=1-\prod_{i=1}^{d}(1-h_{1}^{2}(\hat{p}_{w_{i}},p_{w_{i}}))\leq d\,\max_{i\in[d]}h_{1}^{2}(\hat{p}_{w_{i}},p_{w_{i}}).

Proof 3.2.

A direct calculation yields

1hd2(p^oracle,p)\displaystyle 1-h_{d}^{2}(\hat{p}_{\text{oracle}},p) =di=1dp^wi(wix)pwi(wix)dx\displaystyle=\int_{\mathbb{R}^{d}}\prod_{i=1}^{d}\sqrt{\hat{p}_{w_{i}}(w_{i}\cdot x)\,p_{w_{i}}(w_{i}\cdot x)}dx
=i=1dp^wi(zi)pwi(zi)𝑑zi=i=1d(1h12(p^wi,pwi)).\displaystyle=\prod_{i=1}^{d}\int_{\mathbb{R}}\sqrt{\hat{p}_{w_{i}}(z_{i})\,p_{w_{i}}(z_{i})}dz_{i}=\prod_{i=1}^{d}\left(1-h_{1}^{2}(\hat{p}_{w_{i}},p_{w_{i}})\right).

Now define b=maxi[d]h12(p^wi,pwi)b=\max_{i\in[d]}h_{1}^{2}(\hat{p}_{w_{i}},p_{w_{i}}) and note that b[0,1]b\in[0,1]. Bernoulli’s inequality gives that i=1d(1h12(p^wi,pwi))(1b)d1db\prod_{i=1}^{d}(1-h_{1}^{2}(\hat{p}_{w_{i}},p_{w_{i}}))\geq(1-b)^{d}\geq 1-d\,b.

Lemma 3.1 says that the problem of bounding hd2(p^oracle,p)h_{d}^{2}(\hat{p}_{\text{oracle}},p) boils down to controlling the univariate estimation errors h12(p^wi,pwi)h_{1}^{2}(\hat{p}_{w_{i}},p_{w_{i}}). For log-concave densities, Kim and Samworth [24] and Carpenter et al. [7] have already established results of this kind. We stay in a general setting for now, by deriving the sample complexity of the oracle estimator from the sample complexity of any univariate density estimation procedure that it invokes.

Definition 3.3 (Sample complexity of a univariate density estimator).

Let f:[0,)f:\mathbb{R}\to[0,\infty) be a univariate probability density, and let f^\hat{f} be an estimator of ff computed from iid samples. For ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, define the sample complexity N=N(ϵ,γ;f)N^{*}=N^{*}(\epsilon,\gamma;f) of f^\hat{f} be be the minimal positive integer such that NNN\geq N^{*} iid samples from ff are sufficient to give

(h12(f^,f)ϵ)1γ.\mathbb{P}\left(h_{1}^{2}(\hat{f},f)\leq\epsilon\right)\geq 1-\gamma.

Now, we can immediately express the sample complexity of the oracle estimator in terms of NN^{*} as a consequence of Lemma 3.1 and a union bound.

Corollary 3.4 (Sample complexity of the oracle-informed estimator).

Let pp be a zero-mean probability density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property. Let p^oracle\hat{p}_{\mathrm{oracle}} be the oracle-informed estimator computed from samples X(1),,X(N)iidpX^{(1)},\dots,X^{(N)}\overset{\mathrm{iid}}{\sim}p. Then, for any ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

hd2(p^oracle,p)ϵ\displaystyle h_{d}^{2}(\hat{p}_{\mathrm{oracle}},p)\leq\epsilon

with probability at least 1γ1-\gamma, whenever

Nsupi[d]N(ϵ/d,γ/d;pwi).\displaystyle N\geq\sup_{i\in[d]}N^{*}(\epsilon/d,\gamma/d;p_{w_{i}}).

Here, NN^{*} is the sample complexity of the univariate density estimator invoked by p^oracle\hat{p}_{\mathrm{oracle}}.

Having concluded the analysis of the oracle-informed estimator, we can now turn to the general setting. Here, we only have an estimate of 𝐖\mathbf{W}, and hence, the estimation error needs to be taken into account. This contributes additional terms to the risk of the proposed estimator, as the theorem below shows. The proof can be found in Section A.1.

Theorem 3.5 (Error bound on the proposed estimator).

Suppose pp is a zero-mean probability density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property with independent directions w1,,wdw_{1},\dots,w_{d}. Let p^\hat{p} be the proposed estimator (Definition 2.4). We have the error bound

hd(p^,p)\displaystyle h_{d}(\hat{p},p) dmaxi[d]h1(p^w^i,pw^i)+dhd(𝐑^p,p)+hd(𝐑^Tp,p),\displaystyle\leq\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}})+{\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right)},

where 𝐑^=𝐖T𝐖^\hat{\mathbf{R}}=\mathbf{W}^{T}\hat{\mathbf{W}}.

Here, 𝐑^\hat{\mathbf{R}} is an orthogonal matrix, because 𝐖\mathbf{W} and 𝐖^\hat{\mathbf{W}} are. The term maxi[d]h1(p^w^i,pw^i)\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}}) in the bound of Theorem 3.5 warrants an analysis similar to the oracle-informed estimator. Since p^w^i\hat{p}_{\hat{w}_{i}} is a univariate density estimate of pw^ip_{\hat{w}_{i}} conditional on w^i\hat{w}_{i}, bounds on univariate density estimation control this term, when applied in a conditional sense.

The second and third terms in the bound of Theorem 3.5 call for a notion of stability; hd(𝐑^p,p)h_{d}(\hat{\mathbf{R}}_{\sharp}p,p) measures how far the ‘rotated’ density 𝐑^p\hat{\mathbf{R}}_{\sharp}p gets from the original density pp. If Hellinger distance is stable with respect to small rotations of the density, then hd(𝐑^p,p)h_{d}(\hat{\mathbf{R}}_{\sharp}p,p) can be bounded effectively in terms of the operator norm 𝐈𝐑^op\|\mathbf{I}-\hat{\mathbf{R}}\|_{\mathrm{op}}, where 𝐈\mathbf{I} is the d×dd\times d identity matrix. This operator norm is then well-controlled by the unmixing matrix estimation error as

(7) 𝐈𝐑^op\displaystyle\|\mathbf{I}-\hat{\mathbf{R}}\|_{\mathrm{op}} =𝐖𝐖^op𝐖𝐖^F=i=1dw^iwi22dmaxi[d]w^iwi2.\displaystyle=\|\mathbf{W}-\hat{\mathbf{W}}\|_{\mathrm{op}}\leq\|\mathbf{W}-\hat{\mathbf{W}}\|_{F}=\sqrt{\textstyle{\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}}}\leq\sqrt{d}\,\max_{i\in[d]}\|\hat{w}_{i}-w_{i}\|_{2}.

The next lemma demonstrates such stability for zero-mean log-concave densities.

Lemma 3.6 (Hellinger stability).

Let pp be a zero-mean log-concave density on d\mathbb{R}^{d}, and let 𝐀d×d\mathbf{A}\in\mathbb{R}^{d\times d} be any invertible matrix. Then,

hd2(p,𝐀p)Kdcond(𝚺)1/4𝐈𝐀op1/2,h_{d}^{2}(p,\mathbf{A}_{\sharp}p)\leq K_{d}\,\mathrm{cond}(\mathbf{\Sigma})^{1/4}\,\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}^{1/2},

where 𝚺\mathbf{\Sigma} is the covariance matrix of pp, and the constant Kd>0K_{d}>0 depends only on the dimension dd.

The proof involves the log-concave projection of Dümbgen et al. [15], particularly making use of its local Hölder continuity [6]. It is delayed to Section A.2. Note that Lemma 3.6 is quite general, as it only uses log-concavity of pp (with no additional smoothness requirements). However, Lemma 3.6 inherits a constant KdK_{d} that typically depends at least exponentially in dd. Also, the exponent 1/21/2 of 𝐈𝐀op\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}} contributes a 1/ϵ41/\epsilon^{4} dependence to our sample complexity bounds (see the first two rows of Table 2). Later in Section 3.4, we show that smoothness assumptions on pp can give stronger bounds.

The last bit to consider is the unmixing matrix estimation error. In the same spirit as the discussion on univariate density estimation above, we consider a generic estimator for now, with sample complexity defined as below.

Definition 3.7 (Sample complexity of an unmixing matrix estimator).

Let pp be a zero-mean probability density on d\mathbb{R}^{d} satisfying the orthogonal independent components property. Let 𝐖^\hat{\mathbf{W}} be an estimator of an unmixing matrix 𝐖\mathbf{W} of pp, computed from iid samples. For ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, define the sample complexity M=M(ϵ,γ;p)M^{*}=M^{*}(\epsilon,\gamma;p) of 𝐖^\hat{\mathbf{W}} to be the minimal positive integer such that MMM\geq M^{*} iid samples from pp are sufficient to give

{(d1i=1dw^iwi22)1/2ϵ}1γ,\mathbb{P}\left\{\textstyle{\left(d^{-1}\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}\right)^{1/2}}\leq\epsilon\right\}\geq 1-\gamma,

for some set {w1,,wd}\{w_{1},\dots,w_{d}\} of independent directions of pp.

Now, we state the main result of this section, which expresses the sample complexity of the proposed estimator (Definition 2.4) in terms of the sample complexities in Definition 3.3 and Definition 3.7. The proof is presented in Section A.3.

Theorem 3.8 (Generic sample complexity of the proposed estimator).

Let pp be a zero-mean log-concave density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property. Let p^\hat{p} be the proposed estimator computed from samples X(1),,X(N),Y(1),,Y(M)iidpX^{(1)},\dots,X^{(N)},Y^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p. Then, for any ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

hd2(p^,p)ϵh_{d}^{2}(\hat{p},p)\leq\epsilon

with probability at least 1γ1-\gamma whenever

(8a) Nsups𝕊d1N(ϵ/4d,γ/2d;ps),N\geq\sup_{s\in\mathbb{S}^{d-1}}N^{*}\left(\epsilon/4d,\,\gamma/2d;\,p_{s}\right),
and
(8b) MM((Kdcond(𝚺))1/2ϵ2,γ/2;p).M\geq M^{*}\left((K_{d}^{\prime}\,\mathrm{cond}(\mathbf{\Sigma}))^{-1/2}\epsilon^{2},\,\gamma/2;\,p\right).

Here, NN^{*} and MM^{*} are respectively the sample complexities of the univariate density estimator and the unmixing matrix estimator invoked by p^\hat{p}. As before, 𝚺\mathbf{\Sigma} is the covariance matrix of pp and Kd>0K_{d}^{\prime}>0 is a constant that depends only on dd.

To obtain concrete sample complexities, one needs to choose specific estimation procedures. In Section 3.2, we consider univariate log-concave MLE and its boosted variant, and provide corresponding bounds on NN^{*}. In Section 3.3, we present standard bounds on MM^{*} for PCA- and ICA-based estimators of 𝐖\mathbf{W}.

3.2 Univariate density estimation for the marginals

Recall that for each i=1,,di=1,\dots,d, we need to estimate the densities pw^ip_{\hat{w}_{i}} from samples {Z^i(j)=w^iX(j):j[M]}\{\hat{Z}_{i}^{(j)}=\hat{w}_{i}\cdot X^{(j)}:j\in[M]\}. When pp is log-concave, an efficient as well as convenient option is log-concave MLE [33], which is in fact minimax optimal.

Proposition 3.9 (Sample complexity of univariate log-concave MLE [24]).

Let f:[0,)f:\mathbb{R}\to[0,\infty) be a univariate log-concave density, and denote by f^\hat{f} the log-concave MLE computed from NN iid samples from ff. Given ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

h12(f^,f)ϵh_{1}^{2}(\hat{f},f)\leq\epsilon

with probability at least 1γ1-\gamma whenever

NC1ϵ5/4log5/4(C2γ)C3γ,N\geq\frac{C_{1}}{\epsilon^{5/4}}\log^{5/4}\left(\frac{C_{2}}{\gamma}\right)\vee\frac{C_{3}}{\gamma},

for absolute constants C1,C2,C3>0C_{1},C_{2},C_{3}>0.

In terms of Definition 3.3, we can restate this as N(ϵ,γ;f)C1ϵ5/4log5/4(C2/γ)C3/γN^{*}(\epsilon,\gamma;f)\leq C_{1}\epsilon^{-5/4}\log^{5/4}\left(C_{2}/\gamma\right)\vee C_{3}/\gamma. Notice that the right-hand side does not depend on ff, and as a result, the same upper bound also holds for the term sups𝕊d1N(ϵ,γ;ps)\sup_{s\in\mathbb{S}^{d-1}}N^{*}(\epsilon,\gamma;p_{s}) from Theorem 3.8

The statement above is a “high-probability” variant of Theorem 5 of Kim and Samworth [24], and we provide a proof in Section A.4. Note that while the ϵ5/4\epsilon^{-5/4} dependence is optimal, the 1/γ1/\gamma dependence in the sample complexity could be improved. A standard way to do this is via boosting. The following definition uses the boosting technique from Algorithm A of Lovász and Vempala [28].

Definition 3.10 (Boosted univariate log-concave MLE).

Let n,qn,q\in\mathbb{N}, and suppose we have access to nqnq iid samples from a univariate log-concave density f:[0,)f:\mathbb{R}\to[0,\infty). Additionally, fix ϵ>0\epsilon>0. The boosted estimator f^^ϵ\hat{\hat{f}}_{\epsilon} at error threshold ϵ\epsilon is defined via the following procedure:

  1. 1.

    Split the samples into qq batches of size nn, and denote by f^[b]\hat{f}^{[b]} (b=1,2,,qb=1,2,\dots,q) the log-concave MLE computed from the samples in batch bb.

  2. 2.

    For each batch bb, count how many other batches bb^{\prime} satisfy h1(f^[b],f^[b])2ϵ/3h_{1}(\hat{f}^{[b]},\hat{f}^{[b^{\prime}]})\leq 2\sqrt{\epsilon}/3.

  3. 3.

    If there exists a batch bb for which this number is larger than q/2q/2, return f^^ϵ:=f^[b]\hat{\hat{f}}_{\epsilon}:=\hat{f}^{[b]}. Otherwise, return f^^ϵ:=f^[1]\hat{\hat{f}}_{\epsilon}:=\hat{f}^{[1]} .

As the next proposition show, the sample complexity of the boosted log-concave MLE depends only logarithmically on 1/γ1/\gamma, which is much milder. See Section A.5 for the proof.

Proposition 3.11 (Sample complexity boosted univariate log-concave MLE).

Let f:[0,)f:\mathbb{R}\to[0,\infty) be a univariate log-concave density, and fix ϵ>0\epsilon>0. Denote by f^^ϵ\hat{\hat{f}}_{\epsilon} the boosted log-concave MLE at error threshold ϵ\epsilon, computed from qq independent batches of nn iid samples from ff each. Given 0<γ<10<\gamma<1, we have

h12(f^^ϵ,f)ϵh_{1}^{2}(\hat{\hat{f}}_{\epsilon},f)\leq\epsilon

with probability at 1γ1-\gamma provided

nC1ϵ5/4andqC2log(1γ),n\geq\frac{C_{1}}{\epsilon^{5/4}}\quad\text{and}\quad q\geq C_{2}\log\left(\frac{1}{\gamma}\right),

for absolute constants C1,C20C_{1},C_{2}\geq 0.

In particular, by multiplying the bounds for nn and qq, we get that the sample complexity
N(ϵ,γ;f)C1C2ϵ5/4log(1/γ)N^{*}(\epsilon,\gamma;f)\leq C_{1}C_{2}\epsilon^{-5/4}\log(1/\gamma). As before, the right-hand side does not depend on ff.

3.3 Unmixing matrix estimation

Here, we discuss standard techniques for estimating 𝐖\mathbf{W}, and show closeness of w^i\hat{w}_{i} and wiw_{i}. Recall that the methods used here for estimating 𝐖\mathbf{W} – PCA or ICA – require different assumptions on the moments of XX (or Z=𝐖XZ=\mathbf{W}X) in order to yield bounds on the estimation errors.

Consider regular PCA first. The PCA estimate of 𝐖\mathbf{W} was obtained by computing the eigenvectors of the sample covariance matrix 𝚺^\hat{\mathbf{\Sigma}}. Bounding the sample complexity of this estimation procedure is standard, and can be broken down into two simple steps: (i) bounding the distance between the empirical covariance matrix 𝚺^\hat{\mathbf{\Sigma}} and the true covariance matrix 𝚺\mathbf{\Sigma}, and (ii) bounding the distances between their corresponding eigenvectors. The result is summarized in the following proposition, the proof of which is delayed to A.6.

Proposition 3.12 (Sample complexity of estimating 𝐖\mathbf{W} by PCA).

Let pp be a zero-mean, log-concave probability density on d\mathbb{R}^{d}. Let Y(1),Y(2),,Y(M)iidpY^{(1)},Y^{(2)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p be samples, and suppose Moment assumption M1 is satisfied with eigenvalue separation δ>0\delta>0. Let ϵ>0\epsilon>0 and 0<γ1/e0<\gamma\leq 1/e, and define M~(d,γ):=dlog4(1/γ)log2(C0log2(1/γ))\tilde{M}(d,\gamma):=d\log^{4}(1/\gamma)\log^{2}\left({C_{0}}\log^{2}(1/\gamma)\right). If

(9) MC1M~(d,γ){C2𝚺op2dδ2ϵ2log4(1/γ)log2(C3𝚺op2δ2ϵ2log2(1/γ))}\displaystyle M\geq{C_{1}}\tilde{M}(d,\gamma)\vee\left\{\frac{{C_{2}}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{2}d}{\delta^{2}\epsilon^{2}}\log^{4}(1/\gamma)\log^{2}\left(\frac{{C_{3}}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{2}}{\delta^{2}\epsilon^{2}}\log^{2}(1/\gamma)\right)\right\}

where C0,C1,C2,C3>0C_{0},C_{1},C_{2},C_{3}>0 are absolute constants, then with probability at least 1γ1-\gamma, PCA recovers vectors {w^1,,w^d}\{\hat{w}_{1},\dots,\hat{w}_{d}\} such that

w^iwi2ϵ,\|\hat{w}_{i}-w_{i}\|_{2}\leq\epsilon,

up to permutations and sign-flips.

The right-hand side of (9) is an upper bound on sample complexity M(ϵ,γ;p)M^{*}(\epsilon,\gamma;p) of the PCA-based estimator, for 0<γ1/e0<\gamma\leq 1/e. The term M~(d,γ)\tilde{M}(d,\gamma) is needed in order to allow the result above to hold for all positive ϵ\epsilon, instead of just small values of ϵ\epsilon.

As discussed earlier, assumptions on strict eigenvalue separation can be stringent, and possibly unrealistic in practical settings. In such cases, 𝐖\mathbf{W} should be estimated by ICA instead, for which we have a result similar to Proposition 3.12.

Recall the definitions of SS and 𝐀\mathbf{A} from Section 2, and express X=𝐀SX=\mathbf{A}S. The algorithm of Auddy and Yuan [3] then estimates the columns of 𝐀\mathbf{A}, which coincide with w1,,wd𝕊d1w_{1},\dots,w_{d}\in\mathbb{S}^{d-1} up to scalings and signs.

The proposition below is a direct consequence of Theorem 3.4 of Auddy and Yuan [3], and provides an upper bound on M(ϵ,γ;p)M^{*}(\epsilon,\gamma;p) for 3d3γ<13d^{-3}\leq\gamma<1. The proof is delayed to Section A.7.

Proposition 3.13 (Sample complexity of estimating 𝐖\mathbf{W} by ICA [3]).

Let pp be a zero-mean probability density on d\mathbb{R}^{d} satisfying the orthogonal independent components property. Suppose, additionally, that Moment assumption M2 is satisfied with parameters μ4,μ9>0\mu_{4},\mu_{9}>0 and κ1\kappa\geq 1. Given ϵ>0\epsilon>0 and 3d3γ<13d^{-3}\leq\gamma<1, and samples Y(1),,Y(M)iidpY^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p, if

(10) MKdϵ2log(3/γ)K′′d2log2(3/γ)(3/γ)8/9M\geq\frac{K^{\prime}d}{\epsilon^{2}}\log(3/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(3/\gamma)\,\vee\,(3/\gamma)^{8/9}

for constants K,K′′>0K^{\prime},K^{\prime\prime}>0 depending on μ4,μ9,κ\mu_{4},\mu_{9},\kappa, then with probability at least 1γ1-\gamma, ICA recovers vectors {w^1,,w^d}\{\hat{w}_{1},\dots,\hat{w}_{d}\} such that

(1di=1dw^iwi22)1/2ϵ\left(\frac{1}{d}\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}\right)^{1/2}\leq\epsilon

for independent directions w1,,wdw_{1},\dots,w_{d} of pp.

The result above relies on non-zero excess kurtosis of the components of ZZ, which may be interpreted as a measure of non-Gaussianity. However, there exist distributions which differ from Gaussians only in higher cumulants. Techniques such as Fourier PCA [17] allow for estimating 𝐖\mathbf{W} in such cases. As discussed earlier, our theory is agnostic to specific choices of ICA algorithms.

3.4 Stronger bounds under additional assumptions

While the statement of Theorem 3.8 is quite general, the inequality in (8b) contains a constant Kd>0K^{\prime}_{d}>0 that typically depends at least exponentially in dd. This is owed to the stability result Lemma 3.6, which uses only the log-concavity of pp. Stronger stability inequalities are possible when pp is smooth.

In what follows, we consider two possible assumptions on the smoothness of pp. The choice of assumptions then allows us to quantify stability in either Kullback-Leibler (KL) divergence or total variation distance first, before translating this to stability in Hellinger distance. Let pp be a probability density on d\mathbb{R}^{d} with mean zero.

Smoothness assumption S1

The density pp has finite second moments, is continuously differentiable, and the integral dx2p(x)2𝑑x<\int_{\mathbb{R}^{d}}\|x\|_{2}\|\nabla p(x)\|_{2}\,dx<\infty. The function φ=logp\varphi=-\log p is finite-valued everywhere, and belongs to the Hölder space 𝒞1,α(d)\mathcal{C}^{1,\alpha}(\mathbb{R}^{d}) for some 0<α10<\alpha\leq 1, i.e.

(11) [φ]α:=supxyφ(y)φ(x)2yx2α<.\displaystyle[\nabla\varphi]_{\alpha}:=\sup_{x\neq y}\frac{\|\nabla\varphi(y)-\nabla\varphi(x)\|_{2}}{\|y-x\|_{2}^{\alpha}}<\infty.

Note that φ=p/p-\nabla\varphi=\nabla p/p is commonly known as the score function, whose regularity also plays a role in sampling theory. We may also interpret (11) as a modified Hölder condition on pp. This assumption yields good stability in KL divergence, but it can be stringent since it requires, in particular, that pp is strictly positive everywhere. The following assumption relaxes this.

Smoothness assumption S2

The density pp has finite second moments, and has a (weak) gradient p\nabla p, with

(12) pL1=dp(x)2𝑑x<.\displaystyle\|\nabla p\|_{L^{1}}=\int_{\mathbb{R}^{d}}\|\nabla p(x)\|_{2}\,dx<\infty.

This assumption yields a stability estimate in total variation distance (or equivalently, L1L^{1} distance). Note that pp is not required to be differentiable in the strong sense, and hence, can have “corners” or “kinks”. But, jumps are still not allowed. An interesting example is the exponential-type density p(x)=aex1p(x)=ae^{-\|x\|_{1}}. Also, to compare with assumption S1, note that if p(x)=eφ(x)p(x)=e^{-\varphi(x)} with [φ]α<[\nabla\varphi]_{\alpha}<\infty, then we write p(x)=φ(x)p(x)\nabla p(x)=-\nabla\varphi(x)p(x) which can translate to a bound on pL1\|\nabla p\|_{L^{1}}.

Assumptions S1 and S2 can be interpreted better by considering a multivariate Gaussian distribution with density

p(x)=aexp(12x𝚺1x)\displaystyle p(x)=a\exp\left(-\frac{1}{2}x\cdot\mathbf{\Sigma}^{-1}x\right)

where a=(2π)d/2(det𝚺)1/2a=(2\pi)^{-d/2}(\det\mathbf{\Sigma})^{-1/2} is the normalizing constant. Here, φ(x)=12x𝚺1xloga\varphi(x)=\frac{1}{2}x\cdot\mathbf{\Sigma}^{-1}x-\log a is clearly 𝒞1,1(d)\mathcal{C}^{1,1}(\mathbb{R}^{d}), with [φ]1=1/λmin(𝚺)[\nabla\varphi]_{1}=1/\lambda_{\min}(\mathbf{\Sigma}). On the other hand, pL1d/λmin(𝚺)<\|\nabla p\|_{L^{1}}\leq\sqrt{d/\lambda_{\min}(\mathbf{\Sigma})}<\infty, provided λmin(𝚺)>0\lambda_{\min}(\mathbf{\Sigma})>0. We start with stability under Smoothness assumption S1.

Lemma 3.14 (Stability in KL).

Let pp be a zero-mean probability density on d\mathbb{R}^{d}, satisfying Smoothness assumption S1. Let 𝐑d×d\mathbf{R}\in\mathbb{R}^{d\times d} be any orthogonal matrix. Then,

KL(p||𝐑Tp)[φ]αd(1+α)/2𝚺op(1+α)/2𝐈𝐑op1+α.\displaystyle\mathrm{KL}(p\,||\,{\mathbf{R}_{\sharp}^{T}p})\leq[\nabla\varphi]_{\alpha}\,d^{(1+\alpha)/2}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{(1+\alpha)/2}\,\|\mathbf{I}-\mathbf{R}\|^{1+\alpha}_{\mathrm{op}}.

The proof involves a direct analysis of the KL integral, using the bound φ(y)φ(x)2[φ]αyx2α\|\nabla\varphi(y)-\nabla\varphi(x)\|_{2}\leq[\nabla\varphi]_{\alpha}\|y-x\|_{2}^{\alpha}. It is delayed to Section A.8. This result translates to Hellinger distance via the upper bound [16]

hd2(p,𝐑Tp)12KL(p||𝐑Tp).\displaystyle h_{d}^{2}(p,{\mathbf{R}_{\sharp}^{T}p})\leq\frac{1}{2}\mathrm{KL}(p\,||\,{\mathbf{R}_{\sharp}^{T}p}).

We remark that the exponent 1+α1+\alpha of 𝐈𝐑op\|\mathbf{I}-\mathbf{R}\|_{\mathrm{op}} in Lemma 3.14 is significantly higher than the corresponding exponent 1/21/2 in Lemma 3.6, and this leads to better convergence rates as 𝐑𝐈\mathbf{R}\to\mathbf{I}. In particular, this contributes an ϵ2/(1+α)\epsilon^{-2/(1+\alpha)} dependence to our sample complexity bounds, as compared to ϵ4\epsilon^{-4} (see the rows of Table 2 corresponding to S1). We note that using higher derivatives of φ\varphi (provided φ𝒞s,α(d)\varphi\in\mathcal{C}^{s,\alpha}(\mathbb{R}^{d}) for s>1s>1) in the proof could potentially lead to even faster convergence. Next, we move on to stability under Smoothness assumption S2.

Lemma 3.15 (Stability in TV).

Let pp be a zero-mean probability density on d\mathbb{R}^{d}, satisfying Smoothness assumption S2. Let 𝐀d×d\mathbf{A}\in\mathbb{R}^{d\times d} be invertible with 𝐀1opB\|\mathbf{A}^{-1}\|_{\mathrm{op}}\leq B. Then,

TV(p,𝐀p)12𝐀ppL112[(1+B)pL1d+d]𝚺op1/4𝐈𝐀op1/2.\mathrm{TV}(p,\mathbf{A}_{\sharp}p)\equiv\frac{1}{2}\|\mathbf{A}_{\sharp}p-p\|_{L^{1}}\leq\frac{1}{2}\left[(1+B)\|\nabla p\|_{L^{1}}\sqrt{d}+d\right]\|\mathbf{\Sigma}\|_{\mathrm{op}}^{1/4}\,\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}^{1/2}.

The proof is based on bounding the Wasserstein-1 distance between pp and 𝐀p\mathbf{A}_{\sharp}p, accompanied by an argument inspired from Chae and Walker [8] (see Section A.9). Again, the bound can be translated to Hellinger distance via the inequality [16]

hd2(p,𝐀p)TV(p,𝐀p).\displaystyle h_{d}^{2}(p,\mathbf{A}_{\sharp}p)\leq\mathrm{TV}(p,\mathbf{A}_{\sharp}p).

Finally, in the case of orthogonal matrices, the parameter B=1B=1.

Notice that the exponent of 𝐈𝐀op\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}} in Lemma 3.15 is 1/21/2 again – the same as in Lemma 3.6. This contributes an ϵ4\epsilon^{-4} dependence to our sample complexity bounds (see the rows of Table 2 corresponding to S2). A better exponent could potentially be obtained by requiring control on higher order Sobolev norms and following estimates from KDE theory [21].

We can now state the analogue of Theorem 3.8 under these stronger stability results. The proof is provided in Section A.10.

Corollary 3.16 (Sample complexities under smoothness assumptions).

Let pp be a zero-mean probability density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property. Let p^\hat{p} be the proposed estimator computed from samples X(1),,X(N),Y(1),,Y(M)iidpX^{(1)},\dots,X^{(N)},Y^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p. Then, for any ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

hd2(p^,p)ϵh_{d}^{2}(\hat{p},p)\leq\epsilon

with probability at least 1γ1-\gamma whenever

(13a) Nsups𝕊d1N(ϵ/4d,γ/2d;ps),N\geq\sup_{s\in\mathbb{S}^{d-1}}N^{*}\left(\epsilon/4d,\,\gamma/2d;\,p_{s}\right),
and
(13b) M{M(14d2+α1+α[φ]α11+α𝚺op1/2ϵ11+α,γ/2;p)if S1 holdsM(44[pL1d7/4+d9/4]2𝚺op1/2ϵ2,γ/2;p)if S2 holds.M\geq\begin{cases}M^{*}\left(\frac{1}{4}d^{-\frac{2+\alpha}{1+\alpha}}\,[\nabla\varphi]_{\alpha}^{-\frac{1}{1+\alpha}}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{-1/2}\,\epsilon^{\frac{1}{1+\alpha}}\,,\,\gamma/2;\,p\right)&\text{if S1 holds}\\ M^{*}\left(4^{-4}\left[\|\nabla p\|_{L^{1}}d^{7/4}+d^{9/4}\right]^{-2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{-1/2}\epsilon^{2}\,,\,\gamma/2;\,p\right)&\text{if S2 holds}.\end{cases}

Here, NN^{*} and MM^{*} are respectively the sample complexities of the univariate density estimator and the unmixing matrix estimator invoked by p^\hat{p}.

As stated, Corollary 3.16 does not require log-concavity of pp. If pp is smooth enough but not log-concave, the univariate density estimation stage of the proposed estimator can be done via kernel density estimation (KDE), for example, which would have a different sample complexity N(ϵ,γ,f)N^{*}(\epsilon,\gamma,f).

Table 1: Number of samples NN that suffice (along with a sufficiently large MM) for the proposed estimator to achieve hd2(p^,p)ϵh_{d}^{2}(\hat{p},p)\leq\epsilon with probability at least 1γ1-\gamma. Here, pp is assumed to be a zero-mean log-concave density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property. The bottom row additionally assumes that all marginals of pp are log-kk-affine.
Setting Number of samples NN
Log-concave MLE C1d5/4ϵ5/4log5/4(C2dγ)C3dγ\frac{C_{1}d^{5/4}}{\epsilon^{5/4}}\log^{5/4}\left(\frac{C_{2}d}{\gamma}\right)\vee\frac{C_{3}d}{\gamma}
Boosted log-concave MLE Cd5/4ϵ5/4log(2dγ)\frac{Cd^{5/4}}{\epsilon^{5/4}}\log\left(\frac{2d}{\gamma}\right)
Adapted log-concave MLE Ckd2ϵγ(up to log factors)\frac{Ckd^{2}}{\epsilon\gamma}\quad(\text{up to log factors})
Table 2: Number of samples MM that suffice (along with a sufficiently large NN) for the proposed estimator to achieve hd2(p^,p)ϵh_{d}^{2}(\hat{p},p)\leq\epsilon with probability at least 1γ1-\gamma. Here, pp is assumed to be a zero-mean log-concave density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property. Invoking PCA or ICA additionally assumes that Moment assumption M1 or Moment assumption M2 is (respectively) satisfied. The term M~(d,γ)\tilde{M}(d,\gamma) is defined in Proposition 3.12. The constants K,K′′K^{\prime},K^{\prime\prime} depend on parameters from Moment assumption M2.
Setting Number of samples MM
PCA C1M~(d,γ){Kdcond(𝚺)𝚺op2log4(2/γ)δ2ϵ4log2(Kd′′cond(𝚺)𝚺op2log2(2/γ)δ2ϵ4)}C_{1}\tilde{M}(d,\gamma)\vee\left\{\frac{K_{d}^{\prime}\mathrm{cond}(\mathbf{\Sigma})\|\mathbf{\Sigma}\|_{\mathrm{op}}^{2}\log^{4}(2/\gamma)}{\delta^{2}\epsilon^{4}}\log^{2}\left(\frac{K_{d}^{\prime\prime}\mathrm{cond}(\mathbf{\Sigma})\|\mathbf{\Sigma}\|_{\mathrm{op}}^{2}\log^{2}(2/\gamma)}{\delta^{2}\epsilon^{4}}\right)\right\}
ICA KdKcond(𝚺)ϵ4log(6/γ)K′′d2log2(6/γ)(6/γ)8/9\frac{K_{d}K^{\prime}\mathrm{cond}(\mathbf{\Sigma})}{\epsilon^{4}}\log(6/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(6/\gamma)\,\vee\,(6/\gamma)^{8/9}
PCA, S1 C1M~(d,γ){C2𝚺op3[φ]α21+αd5+3α1+αlog4(2/γ)δ2ϵ21+αlog2(C3𝚺op3[φ]α21+αd4+2α1+αlog2(2/γ)δ2ϵ21+α)}C_{1}\tilde{M}(d,\gamma)\vee\left\{\frac{C_{2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{3}[\nabla\varphi]_{\alpha}^{\frac{2}{1+\alpha}}d^{\frac{5+3\alpha}{1+\alpha}}\log^{4}(2/\gamma)}{\delta^{2}\epsilon^{\frac{2}{1+\alpha}}}\log^{2}\left(\frac{C_{3}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{3}[\nabla\varphi]_{\alpha}^{\frac{2}{1+\alpha}}d^{\frac{4+2\alpha}{1+\alpha}}\log^{2}(2/\gamma)}{\delta^{2}\epsilon^{\frac{2}{1+\alpha}}}\right)\right\}
ICA, S1 K𝚺op[φ]α21+αd5+3α1+αϵ21+αlog(6/γ)K′′d2log2(6/γ)(6/γ)8/9\frac{K^{\prime}\|\mathbf{\Sigma}\|_{\mathrm{op}}[\nabla\varphi]_{\alpha}^{\frac{2}{1+\alpha}}\,d^{\frac{5+3\alpha}{1+\alpha}}}{\epsilon^{\frac{2}{1+\alpha}}}\log(6/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(6/\gamma)\,\vee\,(6/\gamma)^{8/9}
PCA, S2 C1M~(d,γ){C2𝚺op3(pL14d8+d10)log4(2/γ)δ2ϵ4log2(C3𝚺op3(pL14d7+d9)log2(2/γ)δ2ϵ4)}C_{1}\tilde{M}(d,\gamma)\vee\left\{\frac{C_{2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{3}(\|\nabla p\|_{L^{1}}^{4}d^{8}+d^{10})\log^{4}(2/\gamma)}{\delta^{2}\epsilon^{4}}\log^{2}\left(\frac{C_{3}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{3}(\|\nabla p\|_{L^{1}}^{4}d^{7}+d^{9})\log^{2}(2/\gamma)}{\delta^{2}\epsilon^{4}}\right)\right\}
ICA, S2 K𝚺op(pL14d8+d10)ϵ4log(6/γ)K′′d2log2(6/γ)(6/γ)8/9\frac{K^{\prime}\|\mathbf{\Sigma}\|_{\mathrm{op}}(\|\nabla p\|_{L^{1}}^{4}d^{8}+d^{10})}{\epsilon^{4}}\log(6/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(6/\gamma)\,\vee\,(6/\gamma)^{8/9}

A different way in which we get stronger bounds under additional assumptions is adaptation. A univariate density f:[0,)f:\mathbb{R}\to[0,\infty) is said to be log-kk-affine, for some kk\in\mathbb{N}, if logf\log f is piece-wise affine with at most kk pieces on the support of ff. Kim et al. [23] established parametric sample complexity of the log-concave MLE when the true density is log-kk-affine with kk fixed. The following an immediate consequence of their Theorem 3.

Proposition 3.17 (Adaptation of log-concave MLE [23]).

Let f:[0,)f:\mathbb{R}\to[0,\infty) be a log-concave density that is log-kk-affine for some kk\in\mathbb{N}, and denote by f^\hat{f} the log-concave MLE computed from NN iid samples from ff. Given ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

h12(f^,f)ϵh_{1}^{2}(\hat{f},f)\leq\epsilon

with probability at least 1γ1-\gamma whenever

Nlog5/4(eN/k)Ckϵγ\frac{N}{\log^{5/4}(eN/k)}\geq\frac{Ck}{\epsilon\gamma}

for an absolute constant C>0C>0.

As a consequence, we have that N(ϵ,γ;f)N^{*}(\epsilon,\gamma;f) is at most Ck/ϵγCk/\epsilon\gamma up to log factors when ff is log-kk-affine. The 1/γ1/\gamma-dependence is due to Markov’s inequality, and may be improved to log(1/γ)\log(1/\gamma) via boosting as before (see Definition 3.10).

We conclude this discussion by collecting some concrete bounds on the number of samples NN and MM that are sufficient for the proposed estimator to achieve hd2(p^,p)ϵh_{d}^{2}(\hat{p},p)\leq\epsilon with probability at least 1γ1-\gamma. These depend on the particular procedures invoked for marginal estimation and unmixing matrix estimation, as well as on the smoothness assumptions. The bounds on NN are presented in Table 1, and those for MM are tabulated in Table 2. The results in both tables take pp to be a zero-mean, log-concave density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property.

3.5 Bounds under misspecification of log-concavity

Let 𝒫d\mathcal{P}_{d} be the set of probability distributions PP on d\mathbb{R}^{d} satisfying 𝔼XPX2<\mathbb{E}_{X\sim P}\|X\|_{2}<\infty and XP(XH)<1\mathbb{P}_{X\sim P}(X\in H)<1 for every hyperplane HdH\subseteq\mathbb{R}^{d}. Identical to the setting of densities, we say that a zero-mean distribution P𝒫dP\in\mathcal{P}_{d} satisfies the orthogonal independent components property if there exists an orthogonal matrix 𝐖d×d\mathbf{W}\in\mathbb{R}^{d\times d} such that the components of Z=𝐖XZ=\mathbf{W}X are independent when XPX\sim P.

To study the case of misspecification, we need the log-concave projection of Dümbgen et al. [15], which is a map from 𝒫d\mathcal{P}_{d} to the set of log-concave densities d\mathcal{F}_{d} on d\mathbb{R}^{d}, defined by

ψd(P):=argmaxfd𝔼XPlogf(X).\psi_{d}^{*}(P):=\mathop{\arg\max}_{f\in\mathcal{F}_{d}}\,\,\mathbb{E}_{X\sim P}\log f(X).

If PP is a zero-mean distribution in 𝒫d\mathcal{P}_{d} satisfying the orthogonal independent components property, we show below that our proposed estimator (invoking univariate log-concave MLE), computed from iid samples from PP, gets close to ψd(P)\psi_{d}^{*}(P).

Theorem 3.18 (Error bound with misspecification of log-concavity).

Let PP be a zero-mean distribution in 𝒫d\mathcal{P}_{d}, satisfying the orthogonal independent components property with independent directions w1,,wdw_{1},\dots,w_{d}. Let p^\hat{p} be the proposed estimator computed from samples X(1),,X(N),Y(1),,Y(M)iidPX^{(1)},\dots,X^{(N)},Y^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}P. We have the error bound

hd(p^,ψd(P))d\displaystyle h_{d}(\hat{p},\psi_{d}^{*}(P))\leq\sqrt{d} (maxi[d]h1(p^w^i,ψ1(Pw^i))+maxi[d]h1(ψ1(Pw^i),ψ1(Pwi)))\displaystyle\left(\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},\psi_{1}^{*}(P_{\hat{w}_{i}}))+\max_{i\in[d]}h_{1}(\psi_{1}^{*}(P_{\hat{w}_{i}}),\psi_{1}^{*}(P_{w_{i}}))\right)
+hd(ψd(𝐑^P),ψd(P)),\displaystyle+h_{d}(\psi_{d}^{*}(\hat{\mathbf{R}}_{\sharp}P),\psi_{d}^{*}(P)),

where 𝐑^=𝐖^T𝐖\hat{\mathbf{R}}=\hat{\mathbf{W}}^{T}\mathbf{W}.

See Section A.11 for the proof.

We can now obtain sample complexities; to get concrete bounds, we focus on the ICA case and recall Moment assumption M2 with parameters μ4,μ9\mu_{4},\mu_{9} and κ\kappa. Given P𝒫dP\in\mathcal{P}_{d} and XPX\sim P, define the moment quantities

ηP:=infsSd1𝔼|s(X𝔼X)|,andLq:=(𝔼X2q)1/q\eta_{P}:=\inf_{s\in S^{d-1}}\,\mathbb{E}\left|s\cdot(X-\mathbb{E}X)\right|,\quad\text{and}\quad L_{q}:=\left(\mathbb{E}\,\|X\|_{2}^{q}\right)^{1/q}

for q1q\geq 1.

Theorem 3.19 (Samplex complexity with misspecification of log-concavity).

Let PP be a zero-mean distribution in 𝒫d\mathcal{P}_{d} satisfying the orthogonal independent components property. Suppose that Moment assumption M2 is satisfied, ηP>0\eta_{P}>0, and Lq<L_{q}<\infty for some q>1q>1. Let p^\hat{p} be the proposed estimator computed from samples X(1),,X(N),Y(1),,Y(M)iidPX^{(1)},\dots,X^{(N)},Y^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}P, invoking univariate log-concave MLE and ICA. Then, for any ϵ>0\epsilon>0 and 6d3<γ<16d^{-3}<\gamma<1, we have

hd2(p^,ψd(P))ϵh_{d}^{2}(\hat{p},\psi_{d}^{*}(P))\leq\epsilon

with probability at least 1γ1-\gamma whenever

Nlog3qq1N(KqLqηP8d2ϵγ)2qq1,\frac{N}{\log^{\frac{3q}{q-1}}N}\geq\left(K_{q}\sqrt{\frac{L_{q}}{\eta_{P}}}\,\frac{8d^{2}}{\epsilon\gamma}\right)^{\frac{2q}{q-1}},

and

MKCd′′L12ηP2ϵ4log(6/γ)K′′d2log2(6/γ)(6/γ)8/9.M\geq\frac{K^{\prime}C_{d}^{\prime\prime}L_{1}^{2}}{\eta_{P}^{2}\,\epsilon^{4}}\log(6/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(6/\gamma)\,\vee\,(6/\gamma)^{8/9}.

Here, the constants K,K′′>0K^{\prime},K^{\prime\prime}>0 depend on μ4,μ9,κ\mu_{4},\mu_{9},\kappa, the constant Kq>0K_{q}>0 depends only on qq, and Cd′′>0C_{d}^{\prime\prime}>0 depends only on dd.

The proof is provided in Section A.12. The inequality for NN depends polynomially on γ\gamma, and this is due to the use of Markov’s inequality in the proof. If the distribution PP is sub-exponential, better tails bounds may be used, as discussed by Barber and Samworth [6].

Refer to caption
Figure 1: Comparing density estimation performance of our proposed log-concave independent components estimator (LC-IC) with the usual log-concave maximum likelihood estimator (LC-MLE), on simulated bivariate normal data. The ground truth and estimated densities are visualized in the top row, and the corresponding level curves are plotted in the bottom row.

4 Computational experiments

In this section, we demonstrate the performance of the log-concave independent components estimator (LC-IC, Algorithm 1) on simulated as well as real data. Section 4.1 compares the estimation errors and runtimes of LC-IC with those of usual multivariate log-concave MLE and the LogConICA algorithm of Samworth and Yuan [32], on simulated data. Section 4.2 assesses the effect of the sample splitting proportions between the two tasks: estimation of 𝐖\mathbf{W} (MM samples) and estimation of the marginals (NN samples). We also test the case of no sample splitting. Finally, Section 4.3 combines LC-IC with the Expectation-Maximization (EM) algorithm [13] to estimate mixtures of densities from real data – the Wisconsin breast cancer dataset [38].

4.1 Comparison of performance on simulated data

Table 3: Comparing average algorithm runtimes in seconds of our proposed log-concave independent components estimator (LC-IC) with the usual log-concave maximum likelihood estimator (LC-MLE), on simulated multivariate normal data. Here, nn is the number of samples, and dd is the dimension.
n=100n=100 n=500n=500 n=1000n=1000 n=2000n=2000 n=3000n=3000
d=2d=2 LC-MLE 0.56 11.93 60.79 469.65 1703.08
LC-IC 0.02 0.04 0.10 0.09 0.12
d=3d=3 LC-MLE 1.82 42.00 141.31 874.58 3183.45
LC-IC 0.02 0.06 0.08 0.13 0.17
d=4d=4 LC-MLE 8.70 155.03 735.06 4361.81 12933.45
LC-IC 0.03 0.07 0.11 0.16 0.24
Refer to caption
Figure 2: Comparing estimation errors in squared Hellinger distance of our proposed log-concave independent components estimator (LC-IC) with the usual log-concave maximum likelihood estimator (LC-MLE), on simulated multivariate normal data. The dots correspond to independent experiments, and the lines represent averages. Here, nn is the number of samples, and dd is the dimension.

We first demonstrate estimation performance on bivariate normal data, and provide accompanying visualizations of the estimated density. Recalling that X=𝐖TZX=\mathbf{W}^{T}Z, we generate Z𝒩(0,𝚺Z)Z\sim\mathcal{N}(0,\mathbf{\Sigma}_{Z}) with 𝚺Z=Diag(6,3)\mathbf{\Sigma}_{Z}=\mathrm{Diag}(6,3), and choose the orthogonal matrix 𝐖\mathbf{W} at random. With n=1000n=1000 independent samples of XX generated, we equally split M=N=500M=N=500, and compute the LC-IC estimate using PCA to estimate 𝐖\mathbf{W} and logcondens to estimate the marginals. From the same n=1000n=1000 samples, we also compute the usual log-concave maximum likelihood estimate (LC-MLE) using the R package LogConcDEAD [12]. Figure 1 shows the ground truth and estimated densities together with their contour plots, and lists the estimation errors in squared Hellinger distance along with the algorithm runtimes. The integrals needed to obtain the squared Hellinger distances are computed via a simple Monte-Carlo procedure (outlined in C.1); the associated uncertainty is also presented in Figure 1.

Table 4: Comparing average algorithm runtimes in seconds of our proposed log-concave independent components estimator (LC-IC) with the LogConICA method of Samworth and Yuan [32], on simulated Unif([1,1]d)\mathrm{Unif}([-1,1]^{d}) data. Here, nn is the number of samples, and dd is the dimension. For LogConICA, the reported runtimes are averages over 10 random initializations.
n=100n=100 n=500n=500 n=1000n=1000 n=2000n=2000 n=3000n=3000
d=2d=2 LogConICA 1.27 2.87 9.17 24.92 41.64
LC-IC 0.02 0.01 0.02 0.04 0.06
d=3d=3 LogConICA 0.38 2.99 4.01 15.99 31.35
LC-IC 0.02 0.03 0.05 0.09 0.10
d=4d=4 LogConICA 0.44 1.49 3.22 7.60 12.33
LC-IC 0.02 0.04 0.10 0.12 0.18
Refer to caption
Figure 3: Comparing estimation errors in squared Hellinger distance of our proposed log-concave independent components estimator (LC-IC) with the LogConICA method of Samworth and Yuan [32], on simulated Unif([1,1]d)\mathrm{Unif}([-1,1]^{d}) data. The dots correspond to independent experiments, and the lines represent averages. Here, nn is the number of samples, and dd is the dimension.

Next, we conduct systematic comparison tests on simulated data from multivariate normal distributions in d=2,3,4d=2,3,4, with various values of nn. As before, set X=𝐖TZX=\mathbf{W}^{T}Z with 𝐖\mathbf{W} chosen at random and Z𝒩(0,𝚺Z)Z\sim\mathcal{N}(0,\mathbf{\Sigma}_{Z}), where 𝚺Z=Diag(15,14)\mathbf{\Sigma}_{Z}=\mathrm{Diag}(15,14) for d=2d=2, 𝚺Z=Diag(15,14,13)\mathbf{\Sigma}_{Z}=\mathrm{Diag}(15,14,13) for d=3d=3, and 𝚺Z=Diag(15,14,13,12)\mathbf{\Sigma}_{Z}=\mathrm{Diag}(15,14,13,12) for d=4d=4. PCA is used for estimating 𝐖\mathbf{W}. Each experiment is repeated 5 times and the average metrics are presented. Algorithm runtime comparisons are given in Table 3, whereas the estimation errors in squared Hellinger distance are plotted in Figure 2. Notice that LC-IC runtimes are several orders of magnitude smaller than LC-MLE runtimes, especially for large nn. Also, the estimation errors of LC-IC are lower than those of LC-MLE, with a growing margin as dd increases. Similar tests on Gamma-distributed data, are provided in C.3.

These results are not surprising as the multivariate log-concave MLE is a fully non-parametric procedure that makes no assumptions on pp except for log-concavity. A fairer comparison is with the LogConICA algorithm of Samworth and Yuan [32]. On simulated data from the uniform distribution Unif([1,1]d)\mathrm{Unif}([-1,1]^{d}) with d=2,3,4d=2,3,4, we compare LC-IC (invoking the kurtosis-based FastICA implementation of ica [20] to estimate 𝐖\mathbf{W}, and logcondens to estimate the marginals) with our implementation of LogConICA (using 10 random initializations and a maximum of 20 iterations per initialization). As before, each experiment is repeated 5 times, and the average metrics are presented. Algorithm runtime comparisons are given in Table 4, where we report the average times over the random initializations for LogConICA. The estimation errors in squared Hellinger distance are plotted in Figure 3.

LC-IC still has much shorter runtimes as expected, because it estimates the marginals only once, whereas LogConICA updates those for multiple iterations. Low LogConICA runtimes for larger dd is peculiar however; this appears to be because the alternating scheme can get stuck quickly at bad points and subsequently terminate. Next, notice that LogConICA can achieve smaller errors than LC-IC (see the plot for d=3d=3), but may occasionally fail to produce good estimates due to getting stuck at bad points. This could be resolved by better-than-random initialization; one may use our LC-IC estimate as a fast and accurate initial point for LogConICA (see Figure 4).

Refer to caption
Figure 4: Comparing estimation errors in squared Hellinger distance of LC-IC, LogConICA with random initialization, and LogConICA initialized with LC-IC, on simulated Unif([1,1]d)\mathrm{Unif}([-1,1]^{d}) data. The dots correspond to independent experiments, and the lines represent averages. Here, nn is the number of samples, and dd is the dimension.

4.2 Effect of sample splitting proportions

We assess the effect of changing the sample splitting proportions between MM and NN on the estimation errors, for multivariate normal data in various dimensions. Using the total number of samples n=M+Nn=M+N, we define the parameter r=M/nr=M/n so that the extreme r=0r=0 corresponds to all samples being used for estimating the marginals, and r=1r=1 corresponds to all samples being used for estimating 𝐖\mathbf{W}.

Refer to caption
Figure 5: Demonstrating the effect of the sample splitting ratio r=M/nr=M/n on estimation error in squared Hellinger distance, for multivariate normal data in different dimensions dd. The dots correspond to independent experiments, and the lines represent averages.

As before, X=𝐖TZX=\mathbf{W}^{T}Z with the orthogonal matrix 𝐖\mathbf{W} chosen at random and Z𝒩(0,𝚺Z)Z\sim\mathcal{N}(0,\mathbf{\Sigma}_{Z}), where for dimension d{2,3,,9,10,15}d\in\{2,3,\dots,9,10,15\} we set 𝚺Z=Diag(15,14,,15(d1))\mathbf{\Sigma}_{Z}=\mathrm{Diag}(15,14,\dots,15-(d-1)). With n=2000n=2000 total samples, the values r{0.1,0.2,,0.9}r\in\{0.1,0.2,\dots,0.9\} are tested. PCA is used for estimating 𝐖\mathbf{W}. Each experiment is repeated 5 times independently, and the resulting estimation errors in squared Hellinger distance are presented in Figure 5 for d=5,10,15d=5,10,15.

Let rr^{\star} denote the minimizer of an rr-versus-error curve for a fixed dd, and notice that in Figure 5, rr^{\star} shifts to the right as dd increases. This trend is tabulated for all values of dd tested in Table 5. The observed shift is expected since the bounds for MM (see Table 2) have a stronger dependence on dd compared to the bounds for NN (see Table 1). However, we remark that the effect of dd has not been completely isolated in the above experiment. The increase in dd is accompanied by a corresponding decrease in λmin(𝚺)\lambda_{\min}(\mathbf{\Sigma}), which worsens the smoothness (or conditioning) of the density. Keeping λmin(𝚺)\lambda_{\min}(\mathbf{\Sigma}) fixed, on the other hand, would require shrinking the eigengap δ\delta with dd.

Finally, we test the case of no sample splitting in the setting described above. Here, all n=2000n=2000 samples are used for both marginal estimation and unmixing matrix estimation. We compare the resulting squared Hellinger errors with those of the usual sample splitting variant with r=0.4r=0.4, and present the results in Figure 6. We see, somewhat surprisingly, that the variant without sample splitting achieves lower errors. This suggests that any cross-interactions between the marginal and unmixing matrix estimation stages are mild, and both stages benefit from seeing more samples. Our theory from Section 3 does not directly work for this variant without sample splitting, however.

Table 5: Dependence of the error-minimizing sample splitting ratio rr^{\star} on dimension dd for simulated multivariate normal data.
dd 2 3 4 5 6 7 8 9 10 15
rr^{\star} 0.1 0.1 0.1 0.1 0.2 0.2 0.3 0.3 0.3 0.4
Refer to caption
Figure 6: Comparing LC-IC variants with and without sample splitting, on multivariate normal data. The variant with sample splitting uses r=0.4r=0.4. The dots correspond to independent experiments, and the lines represent averages.

4.3 Estimating mixtures of densities and clustering

Consider the problem of estimating finite mixture densities of the form

(14) p(x)=k=1Kπkpk(x),\displaystyle p(x)=\sum_{k=1}^{K}\pi_{k}p_{k}(x),

where the mixture proportions π1,,πk\pi_{1},\dots,\pi_{k} are nonnegative and sum to 1, and p1,,pk:dp_{1},\dots,p_{k}:\mathbb{R}^{d}\to\mathbb{R} are the component densities corresponding to different clusters. If the component densities are normal, the standard Gaussian EM (expectation maximization) algorithm can be used to estimate pp. Chang and Walther [9] allowed for the component densities to be univariate log-concave by combining log-concave MLE with the EM algorithm. They also considered a normal copula for the multivariate case. This was later extended to general multivariate log-concave component densities by Cule et al. [11].

Here, we consider mixture densities of the form (14) where the component densities p1,,pKp_{1},\dots,p_{K} (when centered to mean-zero positions) satisfy the log-concave independent components property. By combining the log-concave independent components estimator (LC-IC) with the EM algorithm, we can port the aforementioned computational benefits to this mixture setting.

We follow the general strategy in Cule et al. [11, Section 6.1], replacing their log-concave MLE step with LC-IC. Such a replacement is not completely obvious however. The log-concave MLE in this setting involves weighted samples, which are not immediately compatible with our sample splitting approach. Instead, we use a re-sampling heuristic as outlined in C.2.

Refer to caption
Figure 7: Estimating mixture densities and clustering using two features from the Wisconsin breast cancer dataset [38]. Labeling Cluster 1 as Malignant and Cluster 2 as Benign gives a 77.7% classification accuracy.

Estimated mixture densities can be used for unsupervised clustering, and we test this on real data. To compare the performance of our EM + LC-IC against the results from Cule et al. [11, Section 6.2], we use the Wisconsin breast cancer dataset [38]. The dataset consists of 569 observations (individuals) and 30 features, where each observation is labeled with a diagnosis of Malignant or Benign. The test involves clustering the data without looking at the labels, and then comparing the learnt clusters with the ground truth labels.

First, we use the same two features as Cule et al. [11]: Feature 1 – standard error of the cell nucleus radius, and Feature 2 – standard error of the cell nucleus texture. This gives bivariate data (d=2d=2) with n=569n=569 samples. We employ EM + LC-IC with K=2K=2 target clusters, random initialization, and 20 EM iterations. Again, PCA is used for the 𝐖\mathbf{W}-estimation steps. Figure 7 shows the data with ground truth labels (left-most plot), level curves of the estimated component densities (center plot), and the learnt clusters (right-most plot). Labeling Cluster 1 as Malignant and Cluster 2 as Benign attains a 77.7% classification accuracy (i.e. 77.7% of the instances are correctly labeled by the learnt clusters). This is close to the 78.7% accuracy reported in Cule et al. [11] (121 misclassified instances).

Refer to caption
Figure 8: Estimating mixture densities and clustering using all 30 features from the Wisconsin breast cancer dataset [38]. The first two principal components are plotted. Labeling Cluster 1 as Malignant and Cluster 2 as Benign gives an 89.6% classification accuracy.

Next, we show improved clustering performance by using more features. Due to the scalability of LC-IC to higher dimensions, one can use all 30 features available in the dataset, so that d=30d=30. We employ EM + LC-IC with K=2K=2 target clusters as before, choose an initialization based on agglomerative hierarchical clustering (using the R package mclust [35]), and use 15 EM iterations. The results can be visualized in a PCA subspace of the data – the span of the two leading PCA eigenvectors (or loadings). This particular PCA is only used for visualization, and is separate from the PCA used in estimating 𝐖\mathbf{W}.

Figure 8 shows the data in this PCA subspace with ground truth labels (left-most plot), level curves of the estimated component densities (center plot), and the learnt clusters (right-most plot). Labeling Cluster 1 as Malignant and Cluster 2 as Benign achieves an 89.6% classification accuracy, which is a significant improvement over the results for d=2d=2. In comparison, agglomerative hierarchical clustering on its own reaches an accuracy of 86.3%.

5 Discussion

We have looked at statistical and computational aspects of estimating log-concave densities satisfying the orthogonal independent components property. First, we proved that up to 𝒪~d(ϵ4)\tilde{\mathcal{O}}_{d}(\epsilon^{-4}) iid samples (suppressing constants and log-factors) are sufficient for our proposed estimator p^\hat{p} to be ϵ\epsilon-close to pp with high confidence. Although squared Hellinger distance was used here, similar results could have been obtained for total variation (TV) distance since TV(p^,p)hd(p^,p)\mathrm{TV}(\hat{p},p)\lesssim h_{d}(\hat{p},p). Note that we do not claim these sample complexity bounds to be tight. The focus here was on demonstrating univariate rates, and a mild dependence on dd under additional assumptions.

The results here could potentially be processed in the framework of Ashtiani et al. [2], who prove that any family of distributions admitting a so-called sample compression scheme can be learned with few samples. Furthermore, such compressibility extends to products and mixtures of that family. Hence, demonstrating sample compression in our setting would allow extending the sample complexities to mixtures of log-concave independent components densities, providing some theoretical backing to the results from Section 4.3.

Next, the stability results from Section 3.4 could be of further theoretical interest, providing geometric insights into families of densities obtained from orthogonal or linear transformations of product densities. In particular, these could allow one to port covering and bracketing entropies from d=1d=1 to higher dimensions by discretizing the operator norm ball in d×d\mathbb{R}^{d\times d}. Also note that these stability bounds are not necessarily tight – consider a spherically symmetric density qq so that hd2(q,𝐑q)=0h_{d}^{2}(q,\mathbf{R}_{\sharp}q)=0 for all orthogonal matrices 𝐑\mathbf{R}. Our current analysis does not make use of such symmetry properties.

On the computational side, we showed that our algorithm has significantly smaller runtimes compared to the usual log-concave MLE as well as the LogConICA algorithm of Samworth and Yuan [32]. This was achieved by breaking down the higher dimensional problem into several 1-dimensional ones. In fact these 1-dimensional marginal estimation tasks are decoupled from one another, allowing for effective parallelization to further reduce runtimes. Finally, we demonstrated the scalability of our algorithm by performing density estimation with clustering in d=30d=30.

Acknowledgments

We thank the anonymous referees for several helpful suggestions, and also thank Yaniv Plan for fruitful discussions. Elina Robeva was supported by an NSERC Discovery Grant DGECR-2020-00338.

Parts of this paper were written using the “Latex Exporter” plugin for Obsidian.

References

  • Adamczak et al. [2010] Radosław Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of the American Mathematical Society, 23(2):535–561, 2010.
  • Ashtiani et al. [2018] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. Advances in Neural Information Processing Systems, 31, 2018.
  • Auddy and Yuan [2023] Arnab Auddy and Ming Yuan. Large dimensional independent component analysis: Statistical optimality and computational tractability. arXiv preprint arXiv:2303.18156, 2023.
  • Bagnoli and Bergstrom [2005] Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic Theory, 26(2):445–469, 2005.
  • Balabdaoui and Wellner [2007] Fadoua Balabdaoui and Jon A Wellner. Estimation of a k-monotone density: Limit distribution theory and the spline connection. The Annals of Statistics, 35(6):2536–2564, 2007.
  • Barber and Samworth [2021] Rina Foygel Barber and Richard J Samworth. Local continuity of log-concave projection, with applications to estimation under model misspecification. Bernoulli, 27(4):2437–2472, 2021.
  • Carpenter et al. [2018] Timothy Carpenter, Ilias Diakonikolas, Anastasios Sidiropoulos, and Alistair Stewart. Near-optimal sample complexity bounds for maximum likelihood estimation of multivariate log-concave densities. In Conference On Learning Theory, pages 1234–1262. PMLR, 2018.
  • Chae and Walker [2020] Minwoo Chae and Stephen G. Walker. Wasserstein upper bounds of the total variation for smooth densities. Statistics & Probability Letters, 163:108771, 2020. ISSN 0167-7152.
  • Chang and Walther [2007] George T Chang and Guenther Walther. Clustering with mixtures of log-concave distributions. Computational Statistics & Data Analysis, 51(12):6242–6251, 2007.
  • Comon and Jutten [2010] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010.
  • Cule et al. [2010] Madeleine Cule, Richard Samworth, and Michael Stewart. Maximum likelihood estimation of a multi-dimensional log-concave density. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(5):545–607, 2010.
  • Cule et al. [2023] Madeleine Cule, Robert Gramacy, Richard Samworth, and Yining Chen. LogConcDEAD: Log-Concave Density Estimation in Arbitrary Dimensions, 2023. URL https://CRAN.R-project.org/package=LogConcDEAD. R package version 1.6-8.
  • Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  • Doss and Wellner [2016] Charles R Doss and Jon A Wellner. Global rates of convergence of the MLEs of log-concave and s-concave densities. The Annals of Statistics, 44(3):954, 2016.
  • Dümbgen et al. [2011] Lutz Dümbgen, Richard Samworth, and Dominic Schuhmacher. Approximation by log-concave distributions, with applications to regression. Annals of statistics, 39(2):702–730, 2011.
  • Gibbs and Su [2002] Alison L Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International statistical review, 70(3):419–435, 2002.
  • Goyal et al. [2014] Navin Goyal, Santosh Vempala, and Ying Xiao. Fourier pca and robust tensor decomposition. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC ’14, page 584–593, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450327107.
  • Grenander [1956] Ulf Grenander. On the theory of mortality measurement: Part ii. Scandinavian Actuarial Journal, 1956(2):125–153, 1956.
  • Groeneboom et al. [2001] Piet Groeneboom, Geurt Jongbloed, and Jon A Wellner. Estimation of a convex function: Characterizations and asymptotic theory. The Annals of Statistics, 29(6):1653–1698, 2001.
  • Helwig [2022] Nathaniel E. Helwig. ica: Independent Component Analysis, 2022. URL https://CRAN.R-project.org/package=ica. R package version 1.0-3.
  • Holmström and Klemelä [1992] Lasse Holmström and Jussi Klemelä. Asymptotic bounds for the expected l1 error of a multivariate kernel density estimator. Journal of multivariate analysis, 42(2):245–266, 1992.
  • Hyvärinen and Oja [2000] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • Kim et al. [2018] Arlene K. Kim, Adityanand Guntuboyina, and Richard J. Samworth. Adaptation in log-concave density estimation. The Annals of Statistics, 46(5):2279 – 2306, 2018. 10.1214/17-AOS1619.
  • Kim and Samworth [2016] Arlene KH Kim and Richard J Samworth. Global rates of convergence in log-concave density estimation. The Annals of Statistics, 44(6):2756–2779, 2016.
  • Koenker and Mizera [2010] Roger Koenker and Ivan Mizera. Quasi-concave density estimation. The Annals of Statistics, 38(5):2998–3027, 2010.
  • Kubjas et al. [2022] Kaie Kubjas, Olga Kuznetsova, Elina Robeva, Pardis Semnani, and Luca Sodomaco. Log-concave density estimation in undirected graphical models. arXiv preprint arXiv:2206.05227, 2022.
  • Kur et al. [2019] Gil Kur, Yuval Dagan, and Alexander Rakhlin. Optimality of maximum likelihood for log-concave density estimation and bounded convex regression. arXiv:1903.05315, 2019.
  • Lovász and Vempala [2007] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007. https://doi.org/10.1002/rsa.20135.
  • Meyers and Serrin [1964] Norman G. Meyers and James Serrin. H = W. Proceedings of the National Academy of Sciences, 51(6):1055–1056, 1964. 10.1073/pnas.51.6.1055.
  • Robeva et al. [2021] Elina Robeva, Bernd Sturmfels, Ngoc Tran, and Caroline Uhler. Maximum likelihood estimation for totally positive log-concave densities. Scandinavian Journal of Statistics, 48(3):817–844, 2021.
  • Rufibach and Duembgen [2023] Kaspar Rufibach and Lutz Duembgen. logcondens: Estimate a Log-Concave Probability Density from Iid Observations, 2023. URL https://CRAN.R-project.org/package=logcondens. R package version 2.1.8.
  • Samworth and Yuan [2012] Richard Samworth and Ming Yuan. Independent component analysis via nonparametric maximum likelihood estimation. The Annals of Statistics, 40(6):2973 – 3002, 2012. 10.1214/12-AOS1060.
  • Samworth [2018] Richard J. Samworth. Recent Progress in Log-Concave Density Estimation. Statistical Science, 33(4):493 – 509, 2018. 10.1214/18-STS666. URL https://doi.org/10.1214/18-STS666.
  • Schuhmacher and Dümbgen [2010] Dominic Schuhmacher and Lutz Dümbgen. Consistency of multivariate log-concave density estimators. Statistics & probability letters, 80(5-6):376–380, 2010.
  • Scrucca et al. [2023] Luca Scrucca, Chris Fraley, T. Brendan Murphy, and Adrian E. Raftery. Model-Based Clustering, Classification, and Density Estimation Using mclust in R. Chapman and Hall/CRC, 2023. ISBN 978-1032234953. 10.1201/9781003277965. URL https://mclust-org.github.io/book/.
  • Vershynin [2018] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • Walther [2002] Guenther Walther. Detecting the presence of mixing with multiscale maximum likelihood. Journal of the American Statistical Association, 97(458):508–513, 2002.
  • Wolberg et al. [1995] William Wolberg, Olvi Mangasarian, Nick Street, and W. Street. Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository, 1995. DOI: https://doi.org/10.24432/C5DW2B.
  • Xu and Samworth [2021] Min Xu and Richard J. Samworth. High-dimensional nonparametric density estimation via symmetry and shape constraints. The Annals of Statistics, 49(2):650 – 672, 2021. 10.1214/20-AOS1972. URL https://doi.org/10.1214/20-AOS1972.
  • Yu et al. [2015] Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.

Appendix A Delayed proofs

Here, we present the proofs that were deferred in the main text. We will need some additional notation. Let 𝒫d\mathcal{P}_{d} be the set of probability distributions PP on d\mathbb{R}^{d} satisfying 𝔼XPX2<\mathbb{E}_{X\sim P}\|X\|_{2}<\infty and XP(XH)<1\mathbb{P}_{X\sim P}(X\in H)<1 for every hyperplane HdH\subseteq\mathbb{R}^{d}. The log-concave projection of Dümbgen et al. [15] is a map from 𝒫d\mathcal{P}_{d} to the set of log-concave densities d\mathcal{F}_{d} on d\mathbb{R}^{d}, defined by

ψd(P):=argmaxfd𝔼XPlogf(X).\psi_{d}^{*}(P):=\mathop{\arg\max}_{f\in\mathcal{F}_{d}}\,\,\mathbb{E}_{X\sim P}\log f(X).

Also define, given XP𝒫dX\sim P\in\mathcal{P}_{d} the minimal first moment

ηP:=infs𝕊d1𝔼|s(X𝔼X)|.\eta_{P}:=\inf_{s\in\mathbb{S}^{d-1}}\,\mathbb{E}\left|s\cdot(X-\mathbb{E}X)\right|.

A.1 Proof of Theorem 3.5

We recall the statement below.

Statement

Suppose pp is a zero-mean probability density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property with independent directions w1,,wdw_{1},\dots,w_{d}. Let p^\hat{p} be the proposed estimator (Definition 2.4). We have the error bound

hd(p^,p)\displaystyle h_{d}(\hat{p},p) dmaxi[d]h1(p^w^i,pw^i)+dhd(𝐑^p,p)+hd(𝐑^Tp,p),\displaystyle\leq\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}})+{\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right)},

where 𝐑^=𝐖T𝐖^\hat{\mathbf{R}}=\mathbf{W}^{T}\hat{\mathbf{W}}.

Proof A.1.

For s𝕊d1s\in\mathbb{S}^{d-1}, denote the projection πs(x)=sx\pi_{s}(x)=s\cdot x. This allows us to write the proposed estimator as

(15) p^(x)=i=1dp^w^i(w^ix)=i=1dp^w^iπw^i(x).\displaystyle\hat{p}(x)=\prod_{i=1}^{d}\hat{p}_{\hat{w}_{i}}(\hat{w}_{i}\cdot x)=\prod_{i=1}^{d}\hat{p}_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}(x).

We can break hd(p^,p)h_{d}(\hat{p},p) down into three terms that need to be controlled:

(16) hd(p^,p)=hd(ip^w^iπw^i,ipwiπwi)hd(ip^w^iπw^i,ipw^iπw^i)+hd(ipw^iπw^i,ipwiπw^i)+hd(ipwiπw^i,ipwiπwi).\displaystyle\begin{aligned} h_{d}(\hat{p},p)&=h_{d}\left(\prod_{i}\hat{p}_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\,,\,\prod_{i}p_{w_{i}}\circ\pi_{w_{i}}\right)\\ &\leq h_{d}\left(\prod_{i}\hat{p}_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\,,\,\prod_{i}p_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\right)\\ &\phantom{\leq}+h_{d}\left(\prod_{i}p_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\,,\,\prod_{i}p_{w_{i}}\circ\pi_{\hat{w}_{i}}\right)+h_{d}\left(\prod_{i}p_{w_{i}}\circ\pi_{\hat{w}_{i}}\,,\,\prod_{i}p_{w_{i}}\circ\pi_{w_{i}}\right).\end{aligned}

This is essentially removing hats one-at-a-time. The first term is handled as

(17) hd2(ip^w^iπw^i,ipw^iπw^i)dmaxi[d]h12(p^w^i,pw^i),\displaystyle h_{d}^{2}\left(\prod_{i}\hat{p}_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\,,\,\prod_{i}p_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\right)\leq d\,\max_{i\in[d]}h_{1}^{2}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}}),

using the simple tensorization argument from Lemma 3.1. Since ipwiπw^i(x)=p(𝐑^x)\prod_{i}p_{w_{i}}\circ\pi_{\hat{w}_{i}}(x)=p(\hat{\mathbf{R}}x), which is the density of 𝐑^Tp\hat{\mathbf{R}}_{\sharp}^{T}p, we get that the third term equals hd(𝐑^Tp,p)h_{d}(\hat{\mathbf{R}}_{\sharp}^{T}p,p).

Finally, consider the second term. Noting that pw^k=(𝐑^p)wkp_{\hat{w}_{k}}=(\hat{\mathbf{R}}_{\sharp}p)_{w_{k}} and using Lemma B.1 gives

(18) h12(pw^i,pwi)=h12((𝐑^p)wi,pwi)hd2(𝐑^p,p).\displaystyle h_{1}^{2}(p_{\hat{w}_{i}},p_{w_{i}})=h_{1}^{2}\left((\hat{\mathbf{R}}_{\sharp}p)_{w_{i}},\,p_{w_{i}}\right)\leq h_{d}^{2}\left(\hat{\mathbf{R}}_{\sharp}p,\,p\right).

Now, to bound the second term, note that

1hd2(ipw^iπw^i,ipwiπw^i)\displaystyle 1-h_{d}^{2}\left(\prod_{i}p_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\,,\,\prod_{i}p_{w_{i}}\circ\pi_{\hat{w}_{i}}\right) =dipw^i(w^ix)pwi(w^ix)𝑑x\displaystyle=\int_{\mathbb{R}^{d}}\sqrt{\prod_{i}p_{\hat{w}_{i}}(\hat{w}_{i}\cdot x)\,p_{w_{i}}(\hat{w}_{i}\cdot x)}\,dx
=ipw^i(zi)pwi(zi)𝑑zi,\displaystyle=\prod_{i}\int_{\mathbb{R}}\sqrt{p_{\hat{w}_{i}}(z^{\prime}_{i})\,p_{w_{i}}(z^{\prime}_{i})}\,dz^{\prime}_{i},

via the change of variables z=𝐖^xz^{\prime}=\hat{\mathbf{W}}x. The above expression can be lower bounded using (18):

ipw^i(zi)pwi(zi)𝑑zi\displaystyle\prod_{i}\int_{\mathbb{R}}\sqrt{p_{\hat{w}_{i}}(z^{\prime}_{i})\,p_{w_{i}}(z^{\prime}_{i})}\,dz^{\prime}_{i} =i(1h12(pw^i,pwi))i(1hd2(𝐑^p,p))\displaystyle=\prod_{i}\left(1-h_{1}^{2}(p_{\hat{w}_{i}},p_{w_{i}})\right)\geq\prod_{i}\left(1-h_{d}^{2}\left(\hat{\mathbf{R}}_{\sharp}p,\,p\right)\right)
=(1hd2(𝐑^p,p))d1dhd2(𝐑^p,p),\displaystyle=\left(1-h_{d}^{2}\left(\hat{\mathbf{R}}_{\sharp}p,\,p\right)\right)^{d}\geq 1-d\cdot h_{d}^{2}\left(\hat{\mathbf{R}}_{\sharp}p,\,p\right),

where the last bound follows from Bernoulli’s inequality. Hence,

hd2(ipw^iπw^i,ipwiπw^i)dhd2(𝐑^p,p),\displaystyle h_{d}^{2}\left(\prod_{i}p_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}}\,,\,\prod_{i}p_{w_{i}}\circ\pi_{\hat{w}_{i}}\right)\leq d\cdot h_{d}^{2}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right),

which completes the last bit of the proof of Theorem 3.5.

A.2 Proof of Lemma 3.6

We recall the statement below.

Statement

Let pp be a zero-mean log-concave density on d\mathbb{R}^{d}, and let 𝐀d×d\mathbf{A}\in\mathbb{R}^{d\times d} be any invertible matrix. Then,

hd2(p,𝐀p)Kdcond(𝚺)1/4𝐈𝐀op1/2,h_{d}^{2}(p,\mathbf{A}_{\sharp}p)\leq K_{d}\,\mathrm{cond}(\mathbf{\Sigma})^{1/4}\,\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}^{1/2},

where 𝚺\mathbf{\Sigma} is the covariance matrix of pp, and the constant Kd>0K_{d}>0 depends only on the dimension dd.

Proof A.2.

We will use properties of the log-concave projection ψ\psi^{*}. Firstly, pp is a log-concave density by assumption, and by the linearity and invertibility of 𝐀\mathbf{A}, so is 𝐀p\mathbf{A}_{\sharp}p. As a result, ψ(p)=p\psi^{*}(p)=p and ψ(𝐀p)=𝐀p\psi^{*}(\mathbf{A}_{\sharp}p)=\mathbf{A}_{\sharp}p. By Theorem 2 of Barber and Samworth [6],

hd2(p,𝐀p)Cd2[W1(p,𝐀p)ηpη𝐀p]1/2,h_{d}^{2}(p,\mathbf{A}_{\sharp}p)\leq C_{d}^{2}\left[\frac{W_{1}(p,\mathbf{A}_{\sharp}p)}{\eta_{p}\vee\eta_{\mathbf{A}_{\sharp}p}}\right]^{1/2},

where Cd>0C_{d}>0 is a constant that depends only on dd. We will conclude by upper-bounding the Wasserstein-1 distance W1(p,𝐀p)W_{1}(p,\mathbf{A}_{\sharp}p) and lower bounding ηp\eta_{p}.

By Lemma B.7, we have W1(p,𝐀p)tr(𝚺)𝐈𝐀opλmax(𝚺)1/2d1/2𝐈𝐀opW_{1}(p,\mathbf{A}_{\sharp}p)\leq\sqrt{\mathrm{tr}(\mathbf{\Sigma})}\,\|\mathbf{I-A}\|_{\mathrm{op}}\leq\lambda_{\max}(\mathbf{\Sigma})^{1/2}d^{1/2}\,\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}. On the other hand, by Lemma B.5, one can lower bound ηpcλmin(𝚺)1/2\eta_{p}\geq c\,\lambda_{\min}(\mathbf{\Sigma})^{1/2}, where c>0c>0 is an absolute constant (say c=1/1600c=1/1600). The result then follows with Kd=Cd2d1/4c1/2K_{d}=C_{d}^{2}\,d^{1/4}c^{-1/2}, recalling that cond(𝚺):=λmax(𝚺)/λmin(𝚺)\mathrm{cond}(\mathbf{\Sigma}):=\lambda_{\max}(\mathbf{\Sigma})/\lambda_{\min}(\mathbf{\Sigma}).

A.3 Proof of Theorem 3.8

We recall the statement below.

Statement (brief)

Let pp be a zero-mean log-concave density on d\mathbb{R}^{d} satisfying the orthogonal independent components property, and let p^\hat{p} be the proposed estimator. Then, for any ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

hd2(p^,p)ϵh_{d}^{2}(\hat{p},p)\leq\epsilon

with probability at least 1γ1-\gamma whenever

(19) Nsups𝕊d1N(ϵ/4d,γ/2d;ps),N\geq\sup_{s\in\mathbb{S}^{d-1}}N^{*}\left(\epsilon/4d,\,\gamma/2d;\,p_{s}\right),
and
MM((Kdcond(𝚺))1/2ϵ2,γ/2;p).M\geq M^{*}\left((K_{d}^{\prime}\,\mathrm{cond}(\mathbf{\Sigma}))^{-1/2}\epsilon^{2},\,\gamma/2;\,p\right).
Proof A.3.

By Theorem 3.5, decompose hd(p^,p)h_{d}(\hat{p},p) as

hd(p^,p)\displaystyle h_{d}(\hat{p},p) dmaxi[d]h1(p^w^i,pw^i)+dhd(𝐑^p,p)+hd(𝐑^Tp,p),\displaystyle\leq\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}})+\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right),

where 𝐑^=𝐖T𝐖^\hat{\mathbf{R}}=\mathbf{W}^{T}\hat{\mathbf{W}}. We show that dmaxi[d]h1(p^w^i,pw^i)ϵ/2\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}})\leq\sqrt{\epsilon}/2 holds with probability at least 1γ/21-\gamma/2 when NN is large enough, and that dhd(𝐑^p,p)+hd(𝐑^Tp,p)ϵ/2\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right)\leq\sqrt{\epsilon}/2 holds with probability at least 1γ/21-\gamma/2 when MM is large enough. A final union bound then gives the desired result.

We handle dmaxi[d]h1(p^w^i,pw^i)\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}}) first. Write 𝐗=(X(1),,X(N))\mathbf{X}=(X^{(1)},\dots,X^{(N)}) and 𝐘=(Y(1),,YM)\mathbf{Y}=(Y^{(1)},\dots,Y^{M}), and denote the marginal error Di(𝐗,𝐘):=h12(p^w^i,pw^i)D_{i}(\mathbf{X},\mathbf{Y}):=h_{1}^{2}(\hat{p}_{\hat{w}_{i}},p_{\hat{w}_{i}}) for i=1,,di=1,\dots,d to highlight the dependence on 𝐗\mathbf{X} and 𝐘\mathbf{Y}. Then, one can use the independence of 𝐗\mathbf{X} and 𝐘\mathbf{Y} to express, for any ϵ~>0\tilde{\epsilon}>0,

{Di(𝐗,𝐘)>ϵ~}=𝔼 1{Di(𝐗,𝐘)>ϵ~}=𝔼𝐘[𝔼𝐗 1{Di(𝐗,𝐘)>ϵ~}],\displaystyle\mathbb{P}\left\{D_{i}(\mathbf{X},\mathbf{Y})>\tilde{\epsilon}\right\}=\mathbb{E}\mathbf{1}{\{D_{i}(\mathbf{X},\mathbf{Y})>\tilde{\epsilon}\}}=\mathbb{E}_{\mathbf{Y}}[\mathbb{E}_{\mathbf{X}}\mathbf{1}\{D_{i}(\mathbf{X},\mathbf{Y})>\tilde{\epsilon}\}],

where 𝟏{}\mathbf{1}\{\cdot\} denotes the indicator of an event, and 𝔼𝐗\mathbb{E}_{\mathbf{X}}, 𝔼𝐘\mathbb{E}_{\mathbf{Y}} denote expectation with respect to the distributions of 𝐗\mathbf{X} and 𝐘\mathbf{Y} respectively. As a result, it suffices to bound 𝔼𝐗 1{Di(𝐗,𝐲)>ϵ~}\mathbb{E}_{\mathbf{X}}\,\mathbf{1}\{D_{i}(\mathbf{X},\mathbf{y})>\tilde{\epsilon}\} for any deterministic 𝐲(d)N\mathbf{y}\in(\mathbb{R}^{d})^{N}.

Recall that the estimated independent directions w^1,,w^d\hat{w}_{1},\dots,\hat{w}_{d} are computed from the samples in 𝐘\mathbf{Y}. For a deterministic 𝐲\mathbf{y}, the corresponding w^1(𝐲),,w^d(𝐲)\hat{w}_{1}(\mathbf{y}),\dots,\hat{w}_{d}(\mathbf{y}) are deterministic as well. In this case, the projected samples {w^i(𝐲)X(j)}j=1N\{\hat{w}_{i}(\mathbf{y})\cdot X^{(j)}\}_{j=1}^{N} are independent and identically distributed with law pw^i(𝐲)p_{\hat{w}_{i}(\mathbf{y})}, and as a result,

𝔼𝐗 1{Di(𝐗,𝐲)>ϵ~}=𝐗{hi2(p^w^i(𝐲),pw^i(𝐲))>ϵ~}γ~\displaystyle\mathbb{E}_{\mathbf{X}}\mathbf{1}\{D_{i}(\mathbf{X},\mathbf{y})>\tilde{\epsilon}\}=\mathbb{P}_{\mathbf{X}}\{h_{i}^{2}(\hat{p}_{\hat{w}_{i}(\mathbf{y})},p_{\hat{w}_{i}(\mathbf{y})})>\tilde{\epsilon}\}\leq\tilde{\gamma}

holds for γ~(0,1)\tilde{\gamma}\in(0,1) provided Nmaxs𝕊d1N(ϵ~,γ~;ps)N(ϵ~,γ~;pw^i(𝐲))N\geq\max_{s\in\mathbb{S}^{d-1}}N^{*}(\tilde{\epsilon},\tilde{\gamma};p_{s})\geq N^{*}(\tilde{\epsilon},\tilde{\gamma};p_{\hat{w}_{i}}(\mathbf{y})). Now, re-introducing the randomness in 𝐘\mathbf{Y} and taking expectation gives {Di(𝐗,𝐘)>ϵ~}γ~\mathbb{P}\{D_{i}(\mathbf{X},\mathbf{Y})>\tilde{\epsilon}\}\leq\tilde{\gamma}. We set ϵ~=ϵ/4d\tilde{\epsilon}=\epsilon/4d and γ~=γ/2d\tilde{\gamma}=\gamma/2d, and use a union bound over i=1,,di=1,\dots,d to finally get

{dmaxi[d]h1(p^,p)>ϵ/2}γ/2.\mathbb{P}\left\{\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p},p)>\sqrt{\epsilon}/2\right\}\leq\gamma/2.

Now, we handle the remaining terms. By Lemma 3.6 and (7), we have

hd2(𝐑^p,p)\displaystyle h_{d}^{2}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right) Kdcond(𝚺)1/4𝐈𝐑^op1/2\displaystyle\leq K_{d}\,\mathrm{cond}(\mathbf{\Sigma})^{1/4}\|\mathbf{I}-\hat{\mathbf{R}}\|_{\mathrm{op}}^{1/2}
d1/4Kdcond(𝚺)1/4(d1i=1dw^iwi22)1/4,\displaystyle\leq d^{1/4}\,K_{d}\,\mathrm{cond}(\mathbf{\Sigma})^{1/4}\left(d^{-1}\textstyle{\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}}\right)^{1/4},

and similarly for hd2(𝐑^Tp,p)h_{d}^{2}\left(\hat{\mathbf{R}}^{T}_{\sharp}p,p\right). For any ϵ~>0\tilde{\epsilon}>0, note that MM(ϵ~,γ/2;p)M\geq M^{*}(\tilde{\epsilon},\gamma/2;p) samples are sufficient to give

dhd(𝐑^p,p)+hd(𝐑^Tp,p)\displaystyle\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right) (d+1)d1/8Kd1/2cond(𝚺)1/8ϵ~1/4\displaystyle\leq(\sqrt{d}+1)d^{1/8}K_{d}^{1/2}\,\mathrm{cond}(\mathbf{\Sigma})^{1/8}\tilde{\epsilon}^{1/4}

with probability at least 1γ/21-\gamma/2. Setting ϵ~=(Kdcond(𝚺))1/2ϵ2\tilde{\epsilon}=(K_{d}^{\prime}\,\mathrm{cond}(\mathbf{\Sigma}))^{-1/2}\epsilon^{2} for a suitable constant Kd>0K_{d}^{\prime}>0 (depending only on dd) then gives

{dhd(𝐑^p,p)+hd(𝐑^Tp,p)>ϵ/2}γ/2,\mathbb{P}\left\{\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right)>\sqrt{\epsilon}/2\right\}\leq\gamma/2,

which concludes the proof.

A.4 Proof of Proposition 3.9

We recall the statement below.

Statement

Let f:[0,)f:\mathbb{R}\to[0,\infty) be a univariate log-concave density, and denote by f^\hat{f} the log-concave MLE computed from NN iid samples from ff. Given ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

h12(f^,f)ϵh_{1}^{2}(\hat{f},f)\leq\epsilon

with probability at least 1γ1-\gamma whenever

NC1ϵ5/4log5/4(C2γ)C3γ,N\geq\frac{C_{1}}{\epsilon^{5/4}}\log^{5/4}\left(\frac{C_{2}}{\gamma}\right)\vee\frac{C_{3}}{\gamma},

for absolute constants C1,C2,C3>0C_{1},C_{2},C_{3}>0.

Proof A.4.

Lemma 6 and the proof of Theorem 5 of Kim and Samworth [24] imply the concentration bound

(h12(f^,f)n4/5t)Cect+C′′n1\mathbb{P}\left(h_{1}^{2}(\hat{f},f)\geq n^{-4/5}t\right)\leq C^{\prime}e^{-ct}+C^{\prime\prime}n^{-1}

for suitably large absolute constants C,C′′>0C^{\prime},C^{\prime\prime}>0. Setting t=n4/5ϵt=n^{4/5}\epsilon in the concentration bound gives that

(h12(f^,f)ϵ)Cecn4/5ϵ+C′′n1.\mathbb{P}\left(h_{1}^{2}(\hat{f},f)\geq\epsilon\right)\leq C^{\prime}e^{-cn^{4/5}\epsilon}+C^{\prime\prime}n^{-1}.

Now, Cecn4/5ϵγ/2C^{\prime}e^{-cn^{4/5}\epsilon}\leq\gamma/2 holds provided n[(cϵ)1log(2C/γ)]5/4n\geq\left[(c\epsilon)^{-1}\log\left(2C^{\prime}/\gamma\right)\right]^{5/4}, and C′′n1γ/2C^{\prime\prime}n^{-1}\leq\gamma/2 holds provided n2C′′/γn\geq 2C^{\prime\prime}/\gamma. Setting C1=c5/4C_{1}=c^{-5/4}, C2=2CC_{2}=2C^{\prime} and C3=2C′′C_{3}=2C^{\prime\prime} concludes the proof.

A.5 Proof of Proposition 3.11

We recall the statement below.

Statement

Let f:[0,)f:\mathbb{R}\to[0,\infty) be a univariate log-concave density, and fix ϵ>0\epsilon>0. Denote by f^^ϵ\hat{\hat{f}}_{\epsilon} the boosted log-concave MLE at error threshold ϵ\epsilon, computed from qq independent batches of nn iid samples from ff each. Given 0<γ<10<\gamma<1, we have

h12(f^^ϵ,f)ϵh_{1}^{2}(\hat{\hat{f}}_{\epsilon},f)\leq\epsilon

with probability at 1γ1-\gamma provided

nC1ϵ5/4andqC2log(1γ),n\geq\frac{C_{1}}{\epsilon^{5/4}}\quad\text{and}\quad q\geq C_{2}\log\left(\frac{1}{\gamma}\right),

for absolute constants C1,C20C_{1},C_{2}\geq 0.

Proof A.5.

We follow the proof of Theorem 2.8 of Lovász and Vempala [28]. Firstly, nC1ϵ5/4n\geq C_{1}\epsilon^{-5/4} for a sufficiently large C1>0C_{1}>0 ensures (by Markov’s inequality) that

{h1(f^[b],f)ϵ/3}0.9\mathbb{P}\left\{h_{1}(\hat{f}^{[b]},f)\leq\sqrt{\epsilon}/3\right\}\geq 0.9

for each batch bb. Then, by Chernoff’s inequality, strictly more than (q/2q/2)-many batches bb satisfy h1(f^[b],f)ϵ/2h_{1}(\hat{f}^{[b]},f)\leq\sqrt{\epsilon}/2 simultaneously, with probability at least 10.9q1-0.9^{q}. Under this event, suppose f^[b]\hat{f}^{[b^{\star}]} is returned. Then there exists some f^[b]\hat{f}^{[b^{\prime}]} such that h1(f^[b],f^[b])2ϵ/3h_{1}(\hat{f}^{[b^{\star}]},\hat{f}^{[b^{\prime}]})\leq 2\sqrt{\epsilon}/3 and h1(f^[b],f)ϵ/3h_{1}(\hat{f}^{[b^{\prime}]},f)\leq\sqrt{\epsilon}/3, so that by triangle inequality, h1(f^[b],f)ϵh_{1}(\hat{f}^{[b^{\star}]},f)\leq\sqrt{\epsilon}. Finally, choosing qC2log(1/γ)q\geq C_{2}\log(1/\gamma) makes the probability of this event 1γ1-\gamma.

A.6 Proof of Proposition 3.12

We recall the statement below.

Statement

(Sample complexity of estimating 𝐖\mathbf{W} by PCA). Let pp be a zero-mean, log-concave probability density on d\mathbb{R}^{d}. Let Y(1),,Y(M)iidpY^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p be samples, and suppose Moment assumption M1 is satisfied with eigenvalue separation δ>0\delta>0. Let ϵ>0\epsilon>0 and 0<γ1/e0<\gamma\leq 1/e, and define M~(d,γ):=dlog4(1/γ)log2(C0log2(1/γ))\tilde{M}(d,\gamma):=d\log^{4}(1/\gamma)\log^{2}\left(C_{0}\log^{2}(1/\gamma)\right). If

MC1M~(d,γ){C2𝚺op2dδ2ϵ2log4(1/γ)log2(C3𝚺op2δ2ϵ2log2(1/γ))},\displaystyle M\geq C_{1}\tilde{M}(d,\gamma)\vee\left\{\frac{C_{2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{2}d}{\delta^{2}\epsilon^{2}}\log^{4}(1/\gamma)\log^{2}\left(\frac{C_{3}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{2}}{\delta^{2}\epsilon^{2}}\log^{2}(1/\gamma)\right)\right\},

then with probability at least 1γ1-\gamma, PCA recovers vectors {w^1,,w^d}\{\hat{w}_{1},\dots,\hat{w}_{d}\} such that

w^iwi2ϵ,\|\hat{w}_{i}-w_{i}\|_{2}\leq\epsilon,

up to permutations and sign-flips.

Proof A.6.

Since wiw_{i} and w^i\hat{w}_{i} are eigenvectors of 𝚺\mathbf{\Sigma} and 𝚺^\hat{\mathbf{\Sigma}} respectively, the Davis-Kahan theorem (see Yu et al. [40], or Theorem 4.5.5 of Vershynin [36]) gives the bound

w^iwi223/2δ𝚺^𝚺op,\displaystyle\|\hat{w}_{i}-w_{i}\|_{2}\leq\frac{2^{3/2}}{\delta}\|\hat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{\mathrm{op}},

for some permutation and choice of sign-flips of w1,,wdw_{1},\dots,w_{d}. Hence, the problem boils down to covariance estimation of log-concave random vectors [1].

To bring the problem into an isotropic setting, define random vectors V(j):=𝚺1/2Y(j)V^{(j)}:=\mathbf{\Sigma}^{-1/2}Y^{(j)} for j=1,,Mj=1,\dots,M, such that 𝔼V(j)V(j)T=𝐈\mathbb{E}V^{(j)}{V^{(j)}}^{T}=\mathbf{I}, and note that these still have log-concave densities. Let C,c>0C,c>0 be absolute constants. By Theorem 4.1 of Adamczak et al. [1], for any ϵ~(0,1)\tilde{\epsilon}\in(0,1) and t1t\geq 1, if

MCdt4ϵ~2log2(2t2ϵ~2),\displaystyle M\geq Cd\,\frac{t^{4}}{\tilde{\epsilon}^{2}}\log^{2}\left(\frac{2t^{2}}{\tilde{\epsilon}^{2}}\right),

then with probability at least 1ectd1-e^{-ct\sqrt{d}}, it holds that

(20) 1Mj=1MV(j)V(j)T𝐈opϵ~.\displaystyle\left\|\frac{1}{M}\sum_{j=1}^{M}V^{(j)}{V^{(j)}}^{T}-\mathbf{I}\right\|_{\mathrm{op}}\leq\tilde{\epsilon}.

For a γ(0,1/e]\gamma\in(0,1/e], we have log(1/γ)1\log(1/\gamma)\geq 1 so that setting t=Clog(1/γ)t=C^{\prime}\log(1/\gamma) for a large enough absolute constant C>0C^{\prime}>0 results in the failure probability being bounded above as ectdγe^{-ct\sqrt{d}}\leq\gamma. Hence, (20) holds with probability at least 1γ1-\gamma, provided

MC1dϵ~2log4(1/γ)log2(C0ϵ~2log2(1/γ)).\displaystyle M\geq\frac{C_{1}d}{\tilde{\epsilon}^{2}}\log^{4}(1/\gamma)\log^{2}\left(\frac{C_{0}}{\tilde{\epsilon}^{2}}\log^{2}(1/\gamma)\right).

Furthermore, letting ϵ~1\tilde{\epsilon}\uparrow 1 yields that

1Mj=1MV(j)V(j)T𝐈op1\displaystyle\left\|\frac{1}{M}\sum_{j=1}^{M}V^{(j)}{V^{(j)}}^{T}-\mathbf{I}\right\|_{\mathrm{op}}\leq 1

with probability at least 1γ1-\gamma provided MC1M~(d,γ)M\geq C_{1}\tilde{M}(d,\gamma). As a result, (20) also extends to ϵ~1\tilde{\epsilon}\geq 1. Finally, letting ϵ~=23/2ϵδ𝚺op1\tilde{\epsilon}=2^{-3/2}\epsilon\delta\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{-1} and adjusting the absolute constants gives the required result, since

𝚺^𝚺op\displaystyle\|\hat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{\mathrm{op}} =1Mj=1MY(j)Y(j)T𝚺op\displaystyle=\left\|\frac{1}{M}\sum_{j=1}^{M}Y^{(j)}{Y^{(j)}}^{T}-\mathbf{\Sigma}\right\|_{\mathrm{op}}
=𝚺1/2(1Mj=1MV(j)V(j)T𝐈)𝚺1/2op\displaystyle=\left\|\mathbf{\Sigma}^{1/2}\left(\frac{1}{M}\sum_{j=1}^{M}V^{(j)}{V^{(j)}}^{T}-\mathbf{I}\right)\mathbf{\Sigma}^{1/2}\right\|_{\mathrm{op}}
𝚺opϵ~δ23/2ϵ.\displaystyle\leq\|\mathbf{\Sigma}\|_{\mathrm{op}}\tilde{\epsilon}\,\leq\frac{\delta}{2^{3/2}}\epsilon.

A.7 Proof of Proposition 3.13

We recall the statement below.

Statement

Let pp be a zero-mean probability density on d\mathbb{R}^{d} satisfying the orthogonal independent components property. Suppose, additionally, that Moment assumption M2 is satisfied with parameters μ4,μ9>0\mu_{4},\mu_{9}>0 and κ1\kappa\geq 1. Given ϵ>0\epsilon>0 and 3d3γ<13d^{-3}\leq\gamma<1, and samples Y(1),,Y(M)iidpY^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}p, if

MKdϵ2log(3/γ)K′′d2log2(3/γ)(3/γ)8/9M\geq\frac{K^{\prime}d}{\epsilon^{2}}\log(3/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(3/\gamma)\,\vee\,(3/\gamma)^{8/9}

for constants K,K′′>0K^{\prime},K^{\prime\prime}>0 depending on μ4,μ9,κ\mu_{4},\mu_{9},\kappa, then with probability at least 1γ1-\gamma, ICA recovers vectors {w^1,,w^d}\{\hat{w}_{1},\dots,\hat{w}_{d}\} such that

(1di=1dw^iwi22)1/2ϵ\left(\frac{1}{d}\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}\right)^{1/2}\leq\epsilon

for independent directions w1,,wdw_{1},\dots,w_{d} of pp.

Proof A.7.

Given vectors u,v𝕊du,v\in\mathbb{S}^{d}, Auddy and Yuan [3] consider discrepancy sin(u,v)\sin\angle(u,v) in their bounds. We can relate this to the distance vu2\|v-u\|_{2} up to a sign-flip. More precisely, pick a θ{1,+1}\theta\in\{-1,+1\} such that uθv[0,1]u\cdot\theta v\in[0,1]. In this case,

θvu22=2(1uθv)2(1(uθv)2)=2sin2(u,v).\displaystyle\|\theta v-u\|_{2}^{2}=2(1-u\cdot\theta v)\leq 2\left(1-(u\cdot\theta v)^{2}\right)=2\sin^{2}\angle(u,v).

Next, the algorithm of Auddy and Yuan [3] produces vectors a^1,,a^dd\hat{a}_{1},\dots,\hat{a}_{d}\in\mathbb{R}^{d} which estimate the columns of 𝐀=𝐖T𝚺Z1/2\mathbf{A}=\mathbf{W}^{T}\mathbf{\Sigma}_{Z}^{1/2}. Recall that the columns of 𝐀\mathbf{A} are simply scalings of independent directions w1,,wdw_{1},\dots,w_{d}. Rescaling as a~i=a^i/a^i2\tilde{a}_{i}=\hat{a}_{i}/\|\hat{a}_{i}\|_{2} does not change the angles, and hence,

a~iwi22sin(a~i,wi)=2sin(a^i,wi)\|\tilde{a}_{i}-w_{i}\|_{2}\leq\sqrt{2}\sin\angle(\tilde{a}_{i},w_{i})=\sqrt{2}\sin\angle(\hat{a}_{i},w_{i})

(flipping the sign of wiw_{i} as required). To get orthonormal estimates w^1,,w^d\hat{w}_{1},\dots,\hat{w}_{d}, we form a matrix 𝐀~\tilde{\mathbf{A}} with columns a~1,,a~d\tilde{a}_{1},\dots,\tilde{a}_{d}, compute the singular value decomposition 𝐀~=𝐔𝐃𝐕T\tilde{\mathbf{A}}=\mathbf{U}\mathbf{D}\mathbf{V}^{T} and set 𝐖^T=𝐔𝐕T\hat{\mathbf{W}}^{T}=\mathbf{U}\mathbf{V}^{T}, which is the closest orthogonal matrix to 𝐀~\tilde{\mathbf{A}} in Frobenius norm. As a result,

𝐖^T𝐖TF\displaystyle\|\hat{\mathbf{W}}^{T}-\mathbf{W}^{T}\|_{F} 𝐔𝐕T𝐀~F+𝐀~𝐖TF\displaystyle\leq\|\mathbf{U}\mathbf{V}^{T}-\tilde{\mathbf{A}}\|_{F}+\|\tilde{\mathbf{A}}-\mathbf{W}^{T}\|_{F}
2𝐀~𝐖TF\displaystyle\leq 2\,\|\tilde{\mathbf{A}}-\mathbf{W}^{T}\|_{F}
=2d(d1i=1da~iwi22)1/2\displaystyle=2\sqrt{d}\,\textstyle{\left(d^{-1}\sum_{i=1}^{d}\|\tilde{a}_{i}-w_{i}\|_{2}^{2}\right)^{1/2}}
22d(d1i=1dsin2(a^i,wi))1/2.\displaystyle\leq 2\sqrt{2d}\,\textstyle{\left(d^{-1}\sum_{i=1}^{d}\sin^{2}\angle(\hat{a}_{i},w_{i})\right)^{1/2}}.

Finally, we required a bound on the 9th moment of SiS_{i} in Moment assumption M2, instead of on some (8+ϵ)(8+\epsilon)th moment, for the sake of simplicity. Using these facts, and setting δ\delta in Theorem 3.4 of Auddy and Yuan [3] to γ/3\gamma/3, gives the desired form of the result.

A.8 Proof of Lemma 3.14

We recall the statement below.

Statement

Let pp be a zero-mean probability density on d\mathbb{R}^{d}, satisfying Smoothness assumption S1. Let 𝐑d×d\mathbf{R}\in\mathbb{R}^{d\times d} be any orthogonal matrix. Then,

KL(p||𝐑Tp)[φ]αd(1+α)/2𝚺op(1+α)/2𝐈𝐑op1+α.\displaystyle\mathrm{KL}(p\,||\,\mathbf{R}^{T}_{\sharp}p)\leq[\nabla\varphi]_{\alpha}\,d^{(1+\alpha)/2}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{(1+\alpha)/2}\,\|\mathbf{I}-\mathbf{R}\|^{1+\alpha}_{\mathrm{op}}.
Proof A.8.

Recall that 𝐑Tp(x)=p(𝐑x)=p𝐑(x)\mathbf{R}^{T}_{\sharp}p(x)=p(\mathbf{R}x)=p\circ\mathbf{R}(x). The KL divergence

KL(p||p𝐑)\displaystyle\mathrm{KL}(p\,||\,p\circ\mathbf{R}) =dp(x)(logp(x)logp(𝐑x))𝑑x,\displaystyle=\int_{\mathbb{R}^{d}}p(x)\left(\log p(x)-\log p(\mathbf{R}x)\right)\,dx,
(21) =dp(x)(φ(𝐑x)φ(x))𝑑x.\displaystyle=\int_{\mathbb{R}^{d}}p(x)\left(\varphi(\mathbf{R}x)-\varphi(x)\right)\,dx.

where φ(x)=logp(x)\varphi(x)=-\log p(x). Since KL is always non-negative, an upper bound suffices to show closeness between pp and p𝐑p\circ\mathbf{R}.

Using smoothness assumption S1 i.e. φC1,α(d)\varphi\in C^{1,\alpha}(\mathbb{R}^{d}) with [φ]α<[\nabla\varphi]_{\alpha}<\infty, we get a Taylor-like expansion

(22) φ(y)φ(x)\displaystyle\varphi(y)-\varphi(x) φ(x)(yx)+[φ]αyx21+α.\displaystyle\leq\nabla\varphi(x)\cdot(y-x)+[\nabla\varphi]_{\alpha}\|y-x\|_{2}^{1+\alpha}.

for x,ydx,y\in\mathbb{R}^{d}. Setting y=𝐑xy=\mathbf{R}x, and plugging (22) into (21) yields

dp(x)(φ(𝐑x)φ(x))𝑑x\displaystyle\int_{\mathbb{R}^{d}}p(x)\left(\varphi(\mathbf{R}x)-\varphi(x)\right)\,dx
(23) dφ(x)(𝐑xx)p(x)𝑑x+[φ]αd𝐑xx21+αp(x)𝑑x.\displaystyle\leq\int_{\mathbb{R}^{d}}\nabla\varphi(x)\cdot(\mathbf{R}x-x)\,p(x)dx+[\nabla\varphi]_{\alpha}\int_{\mathbb{R}^{d}}\|\mathbf{R}x-x\|_{2}^{1+\alpha}\,p(x)dx.

The second term can be handled as

d𝐑xx21+αp(x)𝑑x\displaystyle\int_{\mathbb{R}^{d}}\|\mathbf{R}x-x\|_{2}^{1+\alpha}\,p(x)dx 𝐈𝐑op1+αdx21+αp(x)𝑑x\displaystyle\leq\|\mathbf{I}-\mathbf{R}\|_{\mathrm{op}}^{1+\alpha}\int_{\mathbb{R}^{d}}\|x\|_{2}^{1+\alpha}\,p(x)dx
=𝐈𝐑op1+α𝔼XpX21+α\displaystyle=\|\mathbf{I}-\mathbf{R}\|_{\mathrm{op}}^{1+\alpha}\,\mathbb{E}_{X\sim p}\|X\|_{2}^{1+\alpha}
𝐈𝐑op1+α(𝔼XpX22)1+α2\displaystyle\leq\|\mathbf{I}-\mathbf{R}\|_{\mathrm{op}}^{1+\alpha}\,\left(\mathbb{E}_{X\sim p}\|X\|_{2}^{2}\right)^{\frac{1+\alpha}{2}}
=𝐈𝐑op1+α(tr𝚺)1+α2,\displaystyle=\|\mathbf{I}-\mathbf{R}\|_{\mathrm{op}}^{1+\alpha}\,\left(\mathrm{tr}\,\mathbf{\Sigma}\right)^{\frac{1+\alpha}{2}},

since 1+α21+\alpha\leq 2.

Now consider the first term in (23). Note that p(x)=φ(x)p(x)\nabla p(x)=-\nabla\varphi(x)p(x), and as a result,

dφ(x)(𝐑xx)p(x)𝑑x=dp(x)(𝐑xx)dx\int_{\mathbb{R}^{d}}\nabla\varphi(x)\cdot(\mathbf{R}x-x)\,p(x)dx=\int_{\mathbb{R}^{d}}-\nabla p(x)\cdot(\mathbf{R}x-x)\,dx

is finite by the integrability conditions in Smoothness assumption S1. We use integration by parts, and write

dp(x)(𝐑xx)dx=dp(x)(𝐑𝐈)x𝑑x+limρ𝔹ρp(x)(𝐑xx)ν(x)dS(x),\displaystyle\int_{\mathbb{R}^{d}}-\nabla p(x)\cdot(\mathbf{R}x-x)\,dx=\int_{\mathbb{R}^{d}}\,p(x)\nabla\cdot(\mathbf{R}-\mathbf{I})x\,dx+\lim_{\rho\to\infty}\int_{\partial\mathbb{B}_{\rho}}-p(x)\,(\mathbf{R}x-x)\cdot\nu(x)\,dS(x),

where 𝔹ρ\mathbb{B}_{\rho} is the Euclidean ball of radius ρ\rho centered at the origin, ν(x)=x/x2\nu(x)=x/\|x\|_{2} is the outward normal vector field on 𝔹ρ\partial\mathbb{B}_{\rho}, and SS is the (d1)(d-1)-dimensional surface area measure. We claim that the boundary term is zero; to see this, define

Q(ρ):=𝔹ρx2p(x)𝑑S(x)=𝕊d1p(ρθ)ρd𝑑S(θ)Q(\rho):=\int_{\partial\mathbb{B}_{\rho}}\|x\|_{2}\,p(x)\,dS(x)=\int_{\mathbb{S}^{d-1}}p(\rho\theta)\,\rho^{d}\,dS(\theta)

for ρ>0\rho>0, and note that Q(ρ)Q(\rho) controls the boundary term as

|𝔹ρp(x)(𝐑xx)ν(x)dS(x)|\displaystyle\left|\int_{\partial\mathbb{B}_{\rho}}-p(x)\,(\mathbf{R}x-x)\cdot\nu(x)\,dS(x)\right| 𝔹ρp(x)𝐑xx2𝑑S(x)\displaystyle\leq\int_{\partial\mathbb{B}_{\rho}}\,p(x)\|\mathbf{R}x-x\|_{2}\,dS(x)
𝐈𝐑opQ(ρ).\displaystyle\leq\|\mathbf{I}-\mathbf{R}\|_{\mathrm{op}}\,Q(\rho).

It follows from the integrability conditions in Smoothness assumption S1 that QQ and its derivative QQ^{\prime} are integrable on (0,)(0,\infty), and as a consequence, limρQ(ρ)=0\lim_{\rho\to\infty}Q(\rho)=0 (see Lemma B.9 for details).

Hence,

dp(x)(𝐑xx)dx\displaystyle\int_{\mathbb{R}^{d}}-\nabla p(x)\cdot(\mathbf{R}x-x)\,dx =dp(x)(𝐑𝐈)x𝑑x\displaystyle=\int_{\mathbb{R}^{d}}p(x)\,\nabla\cdot(\mathbf{R}-\mathbf{I})x\,dx
=dp(x)tr(𝐑𝐈)𝑑x=tr(𝐑)tr(𝐈).\displaystyle=\int_{\mathbb{R}^{d}}p(x)\mathrm{tr}(\mathbf{R}-\mathbf{I})\,dx=\mathrm{tr}(\mathbf{R})-\mathrm{tr}(\mathbf{I}).

But note that for any orthogonal 𝐑\mathbf{R},

tr(𝐑)=i=1dei𝐑eii=1dei2𝐑ei2=d=tr(𝐈),\displaystyle\mathrm{tr}(\mathbf{R})=\sum_{i=1}^{d}e_{i}\cdot\mathbf{R}e_{i}\leq\sum_{i=1}^{d}\|e_{i}\|_{2}\|\mathbf{R}e_{i}\|_{2}=d=\mathrm{tr}(\mathbf{I}),

where e1,,ede_{1},\dots,e_{d} are the standard basis vectors in d\mathbb{R}^{d}. We conclude that the the first term in (23) is non-positive, and put these bounds together with tr𝚺d𝚺op\mathrm{tr}\,\mathbf{\Sigma}\leq d\|\mathbf{\Sigma}\|_{\mathrm{op}} to get the desired result.

A.9 Proof of Lemma 3.15

We recall the statement below.

Statement

Let pp be a zero-mean probability density on d\mathbb{R}^{d}, satisfying Smoothness assumption S2. Let 𝐀d×d\mathbf{A}\in\mathbb{R}^{d\times d} be invertible with 𝐀1opB\|\mathbf{A}^{-1}\|_{\mathrm{op}}\leq B. Then,

TV(p,𝐀p)12𝐀ppL112[(1+B)pL1d+d]𝚺op1/4𝐈𝐀op1/2.\displaystyle\mathrm{TV}(p,\mathbf{A}_{\sharp}p)\equiv\frac{1}{2}\|\mathbf{A}_{\sharp}p-p\|_{L^{1}}\leq\frac{1}{2}\left[(1+B)\|\nabla p\|_{L^{1}}\sqrt{d}+d\right]\|\mathbf{\Sigma}\|_{\mathrm{op}}^{1/4}\,\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}^{1/2}.
Proof A.9.

Let q=𝐀pq=\mathbf{A}_{\sharp}p be the transformed density. Consider the standard Gaussian density g(x)=(2π)d/2e12x22g(x)=(2\pi)^{-d/2}\,e^{-\frac{1}{2}\|x\|_{2}^{2}} on d\mathbb{R}^{d}, and for any ϵ>0\epsilon>0 (to be fixed later), denote the scaled version by gϵ(x)=ϵdg(x/ϵ)g_{\epsilon}(x)=\epsilon^{-d}g(x/\epsilon). We can decompose

(24) qpL1\displaystyle\|q-p\|_{L^{1}} qqgϵL1+qgϵpgϵL1+pgϵpL1.\displaystyle\leq\|q-q*g_{\epsilon}\|_{L^{1}}+\|q*g_{\epsilon}-p*g_{\epsilon}\|_{L^{1}}+\|p*g_{\epsilon}-p\|_{L^{1}}.

To bound the second term of (24), we use Lemma B.11 and Lemma B.13, and get

(25) qgϵpgϵL1dϵW1(q,p).\displaystyle\|q*g_{\epsilon}-p*g_{\epsilon}\|_{L^{1}}\leq\frac{\sqrt{d}}{\epsilon}\,W_{1}(q,p).

The right-hand side is now a problem of stability in the Wasserstein-1 distance, and is handled by Lemma B.7. Noting that tr(𝚺)𝚺opd\mathrm{tr}(\mathbf{\Sigma})\leq\|\mathbf{\Sigma}\|_{\mathrm{op}}d, Lemma B.7 together with (25) give the bound

qgϵpgϵL1dϵ𝚺op1/2𝐈𝐀op,\displaystyle\|q*g_{\epsilon}-p*g_{\epsilon}\|_{L^{1}}\leq\frac{d}{\epsilon}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{1/2}\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}},

which concludes the analysis of the second term of (24). Now, we handle the first and third terms of (24), making use of the assumed smoothness (S2) of pp.

Lemma B.15 directly bounds the third term of (24). To address the first term, we apply Lemma B.15 to q(x)=|det𝐀1|p(𝐀1x)q(x)=|\det\mathbf{A}^{-1}|\,p(\mathbf{A}^{-1}x), noting that qL1𝐀1oppL1\|\nabla q\|_{L^{1}}\leq\|\mathbf{A}^{-1}\|_{\mathrm{op}}\|\nabla p\|_{L^{1}}. This finally yields the bound

qpL1(1+A1op)pL1dϵ+dϵ𝚺op1/2𝐈𝐀op.\displaystyle\|q-p\|_{L^{1}}\leq(1+\|A^{-1}\|_{\mathrm{op}})\|\nabla p\|_{L^{1}}\sqrt{d}\,\epsilon+\frac{d}{\epsilon}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{1/2}\,{\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}}.

The following choice completes the proof of Lemma 3.15.

ϵ=𝚺op1/4𝐈𝐀op1/2.\epsilon=\|\mathbf{\Sigma}\|_{\mathrm{op}}^{1/4}\,\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}^{1/2}.

A.10 Proof of Corollary 3.16

We recall the statement below.

Statement (brief)

Let pp be a zero-mean probability density on d\mathbb{R}^{d}, satisfying the orthogonal independent components property, and let p^\hat{p} be the proposed estimator. Then, for any ϵ>0\epsilon>0 and 0<γ<10<\gamma<1, we have

hd2(p^,p)ϵh_{d}^{2}(\hat{p},p)\leq\epsilon

with probability at least 1γ1-\gamma whenever

(26) Nsups𝕊d1N(ϵ/4d,γ/2d;ps),N\geq\sup_{s\in\mathbb{S}^{d-1}}N^{*}\left(\epsilon/4d,\,\gamma/2d;\,p_{s}\right),
and
M{M(14d2+α1+α[φ]α11+α𝚺op1/2ϵ11+α,γ/2;p)if S1 holdsM(44[pL1d7/4+d9/4]2𝚺op1/2ϵ2,γ/2;p)if S2 holds.M\geq\begin{cases}M^{*}\left(\frac{1}{4}d^{-\frac{2+\alpha}{1+\alpha}}\,[\nabla\varphi]_{\alpha}^{-\frac{1}{1+\alpha}}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{-1/2}\,\epsilon^{\frac{1}{1+\alpha}}\,,\,\gamma/2;\,p\right)&\text{if S1 holds}\\ M^{*}\left(4^{-4}\left[\|\nabla p\|_{L^{1}}d^{7/4}+d^{9/4}\right]^{-2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{-1/2}\epsilon^{2}\,,\,\gamma/2;\,p\right)&\text{if S2 holds}.\end{cases}
Proof A.10.

The inequality for NN is the same as in Theorem 3.8, and follows the same argument (see A.3). Now consider the inequalities for MM. Suppose Smoothness assumption S1 holds. Then,

hd2(𝐑^p,p)\displaystyle h_{d}^{2}(\hat{\mathbf{R}}_{\sharp}p,p) 12[φ]αd(1+α)/2𝚺op(1+α)/2𝐈𝐑^op1+α\displaystyle\leq\frac{1}{2}[\nabla\varphi]_{\alpha}\,d^{(1+\alpha)/2}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{(1+\alpha)/2}\,\|\mathbf{I}-\hat{\mathbf{R}}\|^{1+\alpha}_{\mathrm{op}}
12[φ]αd1+α𝚺op(1+α)/2(d1i=1dw^iwi22)1+α2,\displaystyle\leq\frac{1}{2}[\nabla\varphi]_{\alpha}\,d^{1+\alpha}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{(1+\alpha)/2}\,\left(d^{-1}\textstyle{\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}}\right)^{\frac{1+\alpha}{2}},

and similarly for hd2(𝐑^Tp,p)h_{d}^{2}\left(\hat{\mathbf{R}}^{T}_{\sharp}p,p\right). For any ϵ~>0\tilde{\epsilon}>0, note that MM(ϵ~,γ/2;p)M\geq M^{*}(\tilde{\epsilon},\gamma/2;p) samples are sufficient to give

dhd(𝐑^p,p)+hd(𝐑^Tp,p)d+12d(1+α)/2[φ]α1/2𝚺op(1+α)/4ϵ~(1+α)/2\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right)\leq\frac{\sqrt{d}+1}{2}d^{(1+\alpha)/2}\,[\nabla\varphi]_{\alpha}^{1/2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{(1+\alpha)/4}\,\tilde{\epsilon}^{(1+\alpha)/2}

with probability at least 1γ/21-\gamma/2. Setting

ϵ~=14d2+α1+α[φ]α11+α𝚺op1/2ϵ11+α\tilde{\epsilon}=\frac{1}{4}d^{-\frac{2+\alpha}{1+\alpha}}\,[\nabla\varphi]_{\alpha}^{-\frac{1}{1+\alpha}}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{-1/2}\,\epsilon^{\frac{1}{1+\alpha}}

then gives

{dhd(𝐑^p,p)+hd(𝐑^Tp,p)>ϵ/2}γ/2\mathbb{P}\left\{\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right)>\sqrt{\epsilon}/2\right\}\leq\gamma/2

as desired. Now, if Smoothness assumption S2 holds, then

hd2(𝐑^p,p)\displaystyle h_{d}^{2}(\hat{\mathbf{R}}_{\sharp}p,p) [pL1d+d]d1/4𝚺op1/4(d1i=1dw^iwi22)1/4.\displaystyle\leq\left[\|\nabla p\|_{L^{1}}\sqrt{d}+d\right]\,d^{1/4}\,\|\mathbf{\Sigma}\|_{\mathrm{op}}^{1/4}\,\left(d^{-1}\textstyle{\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}}\right)^{1/4}.

As before, we consider ϵ~>0\tilde{\epsilon}>0 and note that MM(ϵ~,γ/2;p)M\geq M^{*}(\tilde{\epsilon},\gamma/2;p) samples are sufficient to give

dhd(𝐑^p,p)+hd(𝐑^Tp,p)2[pL1d7/4+d9/4]1/2𝚺op1/8ϵ~1/4\sqrt{d}\,h_{d}\left(\hat{\mathbf{R}}_{\sharp}p\,,\,p\right)+h_{d}\left(\hat{\mathbf{R}}^{T}_{\sharp}p\,,\,p\right)\leq 2\,\left[\|\nabla p\|_{L^{1}}d^{7/4}+d^{9/4}\right]^{1/2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{1/8}\,\tilde{\epsilon}^{1/4}

with probability at least 1γ/21-\gamma/2. Setting

ϵ~=44[pL1d7/4+d9/4]2𝚺op1/2ϵ2\tilde{\epsilon}=4^{-4}\left[\|\nabla p\|_{L^{1}}d^{7/4}+d^{9/4}\right]^{-2}\|\mathbf{\Sigma}\|_{\mathrm{op}}^{-1/2}\epsilon^{2}

concludes the proof.

A.11 Proof of Theorem 3.18

We recall the statement below.

Statement

Let PP be a zero-mean distribution in 𝒫d\mathcal{P}_{d}, satisfying the orthogonal independent components property with independent directions w1,,wdw_{1},\dots,w_{d}. Let p^\hat{p} be the proposed estimator computed from samples X(1),,X(N),Y(1),,Y(M)iidPX^{(1)},\dots,X^{(N)},Y^{(1)},\dots,Y^{(M)}\overset{\mathrm{iid}}{\sim}P. We have the error bound

hd(p^,ψd(P))d\displaystyle h_{d}(\hat{p},\psi_{d}^{*}(P))\leq\sqrt{d} (maxi[d]h1(p^w^i,ψ1(Pw^i))+maxi[d]h1(ψ1(Pw^i),ψ1(Pwi)))\displaystyle\left(\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},\psi_{1}^{*}(P_{\hat{w}_{i}}))+\max_{i\in[d]}h_{1}(\psi_{1}^{*}(P_{\hat{w}_{i}}),\psi_{1}^{*}(P_{w_{i}}))\right)
+hd(ψd(𝐑^P),ψd(P)),\displaystyle+h_{d}(\psi_{d}^{*}(\hat{\mathbf{R}}_{\sharp}P),\psi_{d}^{*}(P)),

where 𝐑^=𝐖^T𝐖\hat{\mathbf{R}}=\hat{\mathbf{W}}^{T}\mathbf{W}.

Proof A.11.

Since PP satisfies the orthogonal independent components property with independent directions w1,,wdw_{1},\dots,w_{d}, it follows from Theorem 2 of Samworth and Yuan [32] that the density ψd(P)\psi_{d}^{*}(P) factorizes as

ψd(P)(x)=iψ1(Pwi)(wix).\psi_{d}^{*}(P)(x)=\prod_{i}\psi_{1}^{*}(P_{w_{i}})(w_{i}\cdot x).

Similar to the proof of Theorem 3.5, we then decompose

hd\displaystyle h_{d} (p^,ψd(P))U=hd(ip^w^iπw^i,iψ1(Pwi)πwi)\displaystyle(\hat{p},\psi_{d}^{*}(P))U=h_{d}\left(\prod_{i}\hat{p}_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}},\prod_{i}\psi_{1}^{*}(P_{w_{i}})\circ\pi_{w_{i}}\right)
hd(ip^w^iπw^i,iψ1(Pw^i)πw^i)(I)+hd(iψ1(Pw^i)πw^i,iψ1(Pwi)πw^i)(II)\displaystyle\leq\underbrace{h_{d}\left(\prod_{i}\hat{p}_{\hat{w}_{i}}\circ\pi_{\hat{w}_{i}},\prod_{i}\psi_{1}^{*}(P_{\hat{w}_{i}})\circ\pi_{\hat{w}_{i}}\right)}_{(\mathrm{I})}+\underbrace{h_{d}\left(\prod_{i}\psi_{1}^{*}(P_{\hat{w}_{i}})\circ\pi_{\hat{w}_{i}},\prod_{i}\psi_{1}^{*}(P_{{w}_{i}})\circ\pi_{\hat{w}_{i}}\right)}_{(\mathrm{II})}
+hd(iψ1(Pwi)πw^i,iψ1(Pwi)πwi)(III).\displaystyle\phantom{=}\,+\underbrace{h_{d}\left(\prod_{i}\psi_{1}^{*}(P_{{w}_{i}})\circ\pi_{\hat{w}_{i}},\prod_{i}\psi_{1}^{*}(P_{{w}_{i}})\circ\pi_{{w}_{i}}\right)}_{(\mathrm{III})}.

By the tensorization argument of Lemma 3.1,

(I)dmaxi[d]h1(p^w^i,ψ1(Pw^i)).(\mathrm{I})\leq\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},\psi_{1}^{*}(P_{\hat{w}_{i}})).

An identical calculation also yields

(II)dmaxi[d]h1(ψ1(Pw^i),ψ1(Pwi)).(\mathrm{II})\leq\sqrt{d}\,\max_{i\in[d]}h_{1}(\psi_{1}^{*}(P_{\hat{w}_{i}}),\psi_{1}^{*}(P_{w_{i}})).

Finally, noting that iψ1(Pwi)πw^i(x)=iψ1(Pwi)(w^ix)=ψd(P)(𝐑^Tx)\prod_{i}\psi_{1}^{*}(P_{w_{i}})\circ\pi_{\hat{w}_{i}}(x)=\prod_{i}\psi_{1}^{*}(P_{w_{i}})(\hat{w}_{i}\cdot x)=\psi_{d}^{*}(P)(\hat{\mathbf{R}}^{T}x), we can rewrite

(III)=hd(𝐑^ψd(P),ψd(P)).\displaystyle(\mathrm{III})=h_{d}(\hat{\mathbf{R}}_{\sharp}\psi_{d}^{*}(P),\psi_{d}^{*}(P)).

Since the log-concave projection ψd\psi_{d}^{*} commutes with affine maps [15], we have that 𝐑^ψd(P)=ψd(𝐑^P)\hat{\mathbf{R}}_{\sharp}\psi_{d}^{*}(P)=\psi_{d}^{*}(\hat{\mathbf{R}}_{\sharp}P) which concludes the proof.

A.12 Proof of Theorem 3.19

We recall the statement below.

Statement (brief)

Let PP be a zero-mean distribution in 𝒫d\mathcal{P}_{d} satisfying the orthogonal independent components property. Suppose that Moment assumption M2 is satisfied, ηP>0\eta_{P}>0, and Lq<L_{q}<\infty for some q>1q>1. Let p^\hat{p} be the proposed estimator, invoking log-concave MLE and ICA. Then, for any ϵ>0\epsilon>0 and 6d3<γ<16d^{-3}<\gamma<1, we have

hd2(p^,ψd(P))ϵh_{d}^{2}(\hat{p},\psi_{d}^{*}(P))\leq\epsilon

with probability at least 1γ1-\gamma whenever

Nlog3qq1N(KqLqηP8d2ϵγ)2qq1,\frac{N}{\log^{\frac{3q}{q-1}}N}\geq\left(K_{q}\sqrt{\frac{L_{q}}{\eta_{P}}}\,\frac{8d^{2}}{\epsilon\gamma}\right)^{\frac{2q}{q-1}},

and

MKCd′′L12ηP2ϵ4log(6/γ)K′′d2log2(6/γ)(6/γ)8/9.M\geq\frac{K^{\prime}C_{d}^{\prime\prime}L_{1}^{2}}{\eta_{P}^{2}\,\epsilon^{4}}\log(6/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(6/\gamma)\,\vee\,(6/\gamma)^{8/9}.
Proof A.12.

The proof follows the same pattern as that of Theorem 3.8. We use the error bound from Theorem 3.18:

hd(p^,ψd(P))\displaystyle h_{d}(\hat{p},\psi_{d}^{*}(P))\leq dmaxi[d]h1(p^w^i,ψ1(Pw^i))(I)+dmaxi[d]h1(ψ1(Pw^i),ψ1(Pwi))(II)\displaystyle\underbrace{\sqrt{d}\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},\psi_{1}^{*}(P_{\hat{w}_{i}}))}_{(\mathrm{I})}+\underbrace{\sqrt{d}\,\max_{i\in[d]}h_{1}(\psi_{1}^{*}(P_{\hat{w}_{i}}),\psi_{1}^{*}(P_{w_{i}}))}_{(\mathrm{II})}
+hd(ψd(𝐑^P),ψd(P))(III).\displaystyle+\underbrace{h_{d}(\psi_{d}^{*}(\hat{\mathbf{R}}_{\sharp}P),\psi_{d}^{*}(P))}_{(\mathrm{III})}.

First, we show that dmaxi[d]h1(p^w^i,ψ1(Pw^i))ϵ/2\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p}_{\hat{w}_{i}},\psi_{1}^{*}(P_{\hat{w}_{i}}))\leq\sqrt{\epsilon}/2 holds with probability at least 1γ/21-\gamma/2, provided NN is large enough. By a conditioning argument on 𝐘=(Y(1),,Y(M))\mathbf{Y}=(Y^{(1)},\dots,Y^{(M)}) as before, we can treat w^1,,w^d\hat{w}_{1},\dots,\hat{w}_{d} as deterministic for this part of the proof.

Define the empirical distribution P^X:=N1j=1NδX(j)\hat{P}^{X}:=N^{-1}\sum_{j=1}^{N}\delta_{X^{(j)}}, and denote it’s ss-marginal by P^sX\hat{P}^{X}_{s} for s𝕊d1s\in\mathbb{S}^{d-1}. Since we used log-concave MLE to estimate the marginals in this setting, we can express p^w^i=ψ1(P^w^iX)\hat{p}_{\hat{w}_{i}}=\psi_{1}^{*}(\hat{P}^{X}_{\hat{w}_{i}}). Noting that P^w^iX\hat{P}^{X}_{\hat{w}_{i}} is itself an empirical distribution of Pw^iP_{\hat{w}_{i}} for each i=1,,di=1,\dots,d, Theorem 5 of Barber and Samworth [6] gives the bound

𝔼h12(p^w^i,ψ1(Pw^i))KqLqηPlog3/2NN1212q,\mathbb{E}\,h_{1}^{2}(\hat{p}_{\hat{w}_{i}},\psi_{1}^{*}(P_{\hat{w}_{i}}))\leq K_{q}\sqrt{\frac{L_{q}}{\eta_{P}}}\,\frac{\log^{3/2}N}{N^{\frac{1}{2}-\frac{1}{2q}}},

for a constant Kq>0K_{q}>0 depending only on qq. (In applying the theorem, we have used the simple facts that (𝔼|w^iX|q)1/qLq(\mathbb{E}\,|\hat{w}_{i}\cdot X|^{q})^{1/q}\leq L_{q} and ηPw^iηP\eta_{P_{\hat{w}_{i}}}\geq\eta_{P} for all ii). A direct application of Markov’s inequality also gives the tail bound

{h12(p^w^i,ψ1(Pw^i))ϵ~}1ϵ~KqLqηPlog3/2NN1212q\mathbb{P}\left\{h_{1}^{2}(\hat{p}_{\hat{w}_{i}},\psi_{1}^{*}(P_{\hat{w}_{i}}))\geq\tilde{\epsilon}\right\}\leq\frac{1}{\tilde{\epsilon}}\cdot K_{q}\sqrt{\frac{L_{q}}{\eta_{P}}}\,\frac{\log^{3/2}N}{N^{\frac{1}{2}-\frac{1}{2q}}}

for any ϵ~>0\tilde{\epsilon}>0. The right-hand side is bounded above by some γ~(0,1)\tilde{\gamma}\in(0,1) provided

Nlog3qq1N(KqLqηP1ϵ~γ~)2qq1.\frac{N}{\log^{\frac{3q}{q-1}}N}\geq\left(K_{q}\sqrt{\frac{L_{q}}{\eta_{P}}}\,\frac{1}{\tilde{\epsilon}\tilde{\gamma}}\right)^{\frac{2q}{q-1}}.

We set ϵ~=ϵ/4d\tilde{\epsilon}=\epsilon/4d and γ~=γ/2d\tilde{\gamma}=\gamma/2d, and use a union bound over i=1,,di=1,\dots,d to finally get

{dmaxi[d]h1(p^,p)>ϵ/2}γ/2.\mathbb{P}\left\{\sqrt{d}\,\max_{i\in[d]}h_{1}(\hat{p},p)>\sqrt{\epsilon}/2\right\}\leq\gamma/2.

Now consider term (II). By Theorem 2 of Barber and Samworth [6], we have that

h1(ψ1(Pw^i),ψ1(Pwi))C1[W1(Pw^i,Pwi)ηP]1/4h_{1}(\psi_{1}^{*}(P_{\hat{w}_{i}}),\psi_{1}^{*}(P_{w_{i}}))\leq C_{1}\left[\frac{W_{1}(P_{\hat{w}_{i}},P_{w_{i}})}{\eta_{P}}\right]^{1/4}

for an absolute constant C1>0C_{1}>0, where one can further simplify

W1(Pw^i,Pwi)𝔼XP|w^iXwiX|(𝔼XPX2)w^iwi2=L1w^iwi2.W_{1}(P_{\hat{w}_{i}},P_{w_{i}})\leq\mathbb{E}_{X\sim P}\,|\hat{w}_{i}\cdot X-w_{i}\cdot X|\leq\left(\mathbb{E}_{X\sim P}\|X\|_{2}\right)\,\|\hat{w}_{i}-w_{i}\|_{2}=L_{1}\|\hat{w}_{i}-w_{i}\|_{2}.

Theorem 2 of Barber and Samworth [6] can also be used to bound term (III) as

hd(ψd(𝐑^P),ψd(P))Cd[W1(𝐑^P,P)ηP]1/4,h_{d}(\psi_{d}^{*}(\hat{\mathbf{R}}_{\sharp}P),\psi^{*}_{d}(P))\leq C_{d}\left[\frac{W_{1}(\hat{\mathbf{R}}_{\sharp}P,P)}{\eta_{P}}\right]^{1/4},

where Cd>0C_{d}>0 depends only on dd, and we note that

W1(𝐑^P,P)𝔼XP𝐑^XX2(𝔼XPX2)𝐈𝐑^op=L1𝐖^𝐖F.W_{1}(\hat{\mathbf{R}}_{\sharp}P,P)\leq\mathbb{E}_{X\sim P}\,\|\hat{\mathbf{R}}X-X\|_{2}\leq\left(\mathbb{E}_{X\sim P}\|X\|_{2}\right)\,\|\mathbf{I}-\hat{\mathbf{R}}\|_{\mathrm{op}}=L_{1}\|\hat{\mathbf{W}}-\mathbf{W}\|_{F}.

Putting these inequalities together gives

(II)+(III)Cd(L1ηP)1/4(d1i=1dw^iwi22)1/8(\mathrm{II})+(\mathrm{III})\leq C_{d}^{\prime}\left(\frac{L_{1}}{\eta_{P}}\right)^{1/4}\left(d^{-1}\textstyle{\sum_{i=1}^{d}\|\hat{w}_{i}-w_{i}\|_{2}^{2}}\right)^{1/8}

for some Cd>0C_{d}^{\prime}>0 depending only on dd. Applying Proposition 3.13 allows us to conclude that (II)+(III)ϵ/2(\mathrm{II})+(\mathrm{III})\leq\sqrt{\epsilon}/2 with probability at least 1γ/21-\gamma/2 provided

MCd′′KL12ηP2ϵ4log(6/γ)K′′d2log2(6/γ)(6/γ)8/9M\geq\frac{C_{d}^{\prime\prime}K^{\prime}L_{1}^{2}}{\eta_{P}^{2}\,\epsilon^{4}}\log(6/\gamma)\,\vee\,K^{\prime\prime}d^{2}\log^{2}(6/\gamma)\,\vee\,(6/\gamma)^{8/9}

for a suitably chosen Cd′′>0C_{d}^{\prime\prime}>0 depending only on dd.

Appendix B Auxiliary lemmas

Here, we prove the various lemmas invoked in Appendix A.

Lemma B.1.

Let p(x)=i=1dpwi(wix)p(x)=\prod_{i=1}^{d}p_{w_{i}}(w_{i}\cdot x), where each pwip_{w_{i}} is a univariate probability density, and consider any other probability density q:d[0,)q:\mathbb{R}^{d}\to[0,\infty). Then,

h12(pwk,qwk)hd2(p,q),\displaystyle h_{1}^{2}(p_{w_{k}},q_{w_{k}})\leq h_{d}^{2}(p,q),

for k[d]k\in[d].

The above lemma bounds the 1-dimensional Hellinger distance between marginals by the dd-dimensional Hellinger distance between the full densities. This is a special case of the data processing inequality for ff-divergences, but we nevertheless provide a proof below.

Proof B.2.

Writing

1hd2(p,q)=dp(x)q(x)𝑑x=dipwi(wix)q(x)𝑑x,\displaystyle 1-h_{d}^{2}(p,q)=\int_{\mathbb{R}^{d}}\sqrt{p(x)q(x)}dx=\int_{\mathbb{R}^{d}}\sqrt{\prod_{i}p_{w_{i}}(w_{i}\cdot x)\,q(x)}dx,

and changing variables z=𝐖xz=\mathbf{W}x gives

dipwi(zi)q¯(z)𝑑z,\displaystyle\int_{\mathbb{R}^{d}}\sqrt{\prod_{i}p_{w_{i}}(z_{i})\,\bar{q}(z)}dz,

where q¯(z):=q(𝐖Tz)\bar{q}(z):=q(\mathbf{W}^{T}z). Splitting the components of zz as zkz_{k} and z[d]\kz_{{[d]\backslash k}} allows us to rewrite the above expression as an iterated integral, and use Cauchy-Schwarz on the inner integral:

d1ikpwi(zi)q¯(zk,z[d]\k)𝑑z[d]\kpwk(zk)𝑑zk\displaystyle\int_{\mathbb{R}}\int_{\mathbb{R}^{d-1}}\sqrt{\prod_{i\neq k}p_{w_{i}}(z_{i})\,\bar{q}(z_{k},z_{{[d]\backslash k}})}\,dz_{{[d]\backslash k}}\,\,\sqrt{p_{w_{k}}(z_{k})}\,dz_{k}
{d1ikpwi(zi)dz[d]\k}1/2{d1q¯(zk,z[d]\k)𝑑z[d]\k}1/2pwk(zk)𝑑zk.\displaystyle\leq\int_{\mathbb{R}}\left\{\int_{\mathbb{R}^{d-1}}\prod_{i\neq k}p_{w_{i}}(z_{i})\,dz_{{[d]\backslash k}}\right\}^{1/2}\,\left\{\int_{\mathbb{R}^{d-1}}\bar{q}(z_{k},z_{{[d]\backslash k}})\,dz_{{[d]\backslash k}}\right\}^{1/2}\,\sqrt{p_{w_{k}}(z_{k})}\,dz_{k}.

Since each pip_{i} is a probability density,

d1ikpwi(zi)dz[d]\k=ikpwi(zi)𝑑zi=1.\displaystyle\int_{\mathbb{R}^{d-1}}\prod_{i\neq k}p_{w_{i}}(z_{i})\,dz_{{[d]\backslash k}}=\prod_{i\neq k}\int_{\mathbb{R}}p_{w_{i}}(z_{i})\,dz_{i}=1.

On the other hand, q¯\bar{q} is being marginalized and

q¯ek(zk):=d1q¯(zk,z[d]\k)𝑑z[d]\k\displaystyle\bar{q}_{e_{k}}(z_{k}):=\int_{\mathbb{R}^{d-1}}\bar{q}(z_{k},z_{{[d]\backslash k}})\,dz_{{[d]\backslash k}}

is the marginal of q¯\bar{q} along the direction eke_{k}. Noting that q¯ek=qwk\bar{q}_{e_{k}}=q_{w_{k}} and chaining together the above inequalities gives the desired result:

1hd2(p,q)qwk(zk)pwk(zk)𝑑zk=1h12(pwk,qwk).\displaystyle 1-h_{d}^{2}(p,q)\leq\int_{\mathbb{R}}\sqrt{q_{w_{k}}(z_{k})p_{w_{k}}(z_{k})}\,dz_{k}=1-h_{1}^{2}(p_{w_{k}},q_{w_{k}}).

Lemma B.3 (Jensen gap for univariate log-concave densities).

Let f:[0,)f:\mathbb{R}\to[0,\infty) be any univariate log-concave probability density, and suppose XfX\sim f. Then,

cVarX𝔼|X𝔼X|VarXc\,\sqrt{\mathrm{Var}X}\leq\mathbb{E}|X-\mathbb{E}X|\leq\sqrt{\mathrm{Var}X}

holds for c=1/1600c=1/1600.

Proof B.4.

Define the ‘standardized’ random variable Z:=σ1(Xa)Z:=\sigma^{-1}(X-a) where aa and σ\sigma are the mean and standard deviation of XX respectively. Denote the density of ZZ by gg, and note that gg is an isotropic log-concave density. By Theorem 5.14(a) of Lovász and Vempala [28] specialized to the case d=1d=1, we have the lower bound g(z)29|z|g(0)g(z)\geq 2^{-9|z|}g(0) for |z|1/9|z|\leq 1/9, which can be used to estimate

𝔼|Z|=|z|g(z)𝑑z1/91/9|z|g(z)𝑑zg(0)1/91/9|z|29|z|𝑑zg(0)200.\mathbb{E}|Z|=\int_{\mathbb{R}}|z|g(z)\,dz\geq\int_{-1/9}^{1/9}|z|g(z)\,dz\geq g(0)\int_{-1/9}^{1/9}|z|2^{-9|z|}\,dz\geq\frac{g(0)}{200}.

Additionally, by Lemma 5.5(b) of Lovász and Vempala [28], g(0)18g(0)\geq\frac{1}{8}. We conclude that 𝔼|Z|c\mathbb{E}|Z|\geq c for c=1/1600c=1/1600, which can be transformed to give

𝔼|X𝔼X|=σ𝔼|Z|cσ=cVarX.\mathbb{E}|X-\mathbb{E}X|=\sigma\,\mathbb{E}|Z|\geq c\sigma=c\sqrt{\mathrm{Var}X}.

The upper bound on 𝔼|X𝔼X|\mathbb{E}|X-\mathbb{E}X| is immediate from Jensen’s inequality or the Cauchy-Schwarz inequality.

Lemma B.5 (lower bound on η\eta for a log-concave density).

Let pp be any log-concave density on d\mathbb{R}^{d}, with mean μ\mu and covariance matrix 𝚺\mathbf{\Sigma}. Then,

ηpinfs𝕊d1𝔼Xp|s(Xμ)|cλmin(𝚺)1/2\eta_{p}\equiv\inf_{s\in\mathbb{S}^{d-1}}\mathbb{E}_{X\sim p}\,|s\cdot(X-\mu)|\geq c\,\lambda_{\min}(\mathbf{\Sigma})^{1/2}

holds for c=1/1600c=1/1600.

Proof B.6.

Let sargmins𝕊d1𝔼Xp|s(Xμ)|s_{*}\in\arg\min_{s\in\mathbb{S}^{d-1}}\mathbb{E}_{X\sim p}\,|s\cdot(X-\mu)| (which exists by compactness and continuity). For XpX\sim p, note that sXs_{*}\cdot X has a univariate log-concave density psp_{s_{*}}, so that one can apply Lemma B.3 to conclude that

ηp=𝔼|sXsμ|cVar(sX)1/2cλmin(𝚺)1/2.\displaystyle\eta_{p}=\mathbb{E}\,|s_{*}\cdot X-s_{*}\cdot\mu|\geq c\,\mathrm{Var}(s_{*}\cdot X)^{1/2}\geq c\,\lambda_{\min}(\mathbf{\Sigma})^{1/2}.

Lemma B.7 (Stability in W1W_{1}).

Let pp be a probability density on d\mathbb{R}^{d} with mean zero and covariance matrix 𝚺\mathbf{\Sigma}, and let 𝐀d×d\mathbf{A}\in\mathbb{R}^{d\times d}. Then,

W1(𝐀p,p)tr(𝚺)𝐀𝐈op.W_{1}(\mathbf{A}_{\sharp}p,\,p)\leq\sqrt{\mathrm{tr}(\mathbf{\Sigma})}\,\|\mathbf{A}-\mathbf{I}\|_{\mathrm{op}}.

Proof B.8.

A direct calculation yields

W1(𝐀p,p)\displaystyle W_{1}(\mathbf{A}_{\sharp}p,\,p) infT:ddmeasurable{dT(x)x2p(x)dx:Tp=𝐀p}\displaystyle\leq\inf_{\begin{subarray}{c}T:\mathbb{R}^{d}\to\mathbb{R}^{d}\\ \textrm{measurable}\end{subarray}}\left\{\int_{\mathbb{R}^{d}}\|T(x)-x\|_{2}\,p(x)dx\,:\,T_{\sharp}p=\mathbf{A}_{\sharp}p\right\}
d𝐀xx2p(x)𝑑x\displaystyle\leq\int_{\mathbb{R}^{d}}\|\mathbf{A}x-x\|_{2}\,p(x)dx
𝐈𝐀opdx2p(x)𝑑x\displaystyle\leq\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}\int_{\mathbb{R}^{d}}\|x\|_{2}\,p(x)dx
𝐈𝐀op𝔼XpX2.\displaystyle\leq\|\mathbf{I}-\mathbf{A}\|_{\mathrm{op}}\,\mathbb{E}_{X\sim p}\|X\|_{2}\,.

The result then follows because 𝔼X2𝔼X22=tr(𝚺)\mathbb{E}\|X\|_{2}\leq\sqrt{\mathbb{E}\|X\|_{2}^{2}}=\sqrt{\mathrm{tr}(\mathbf{\Sigma})}.

Lemma B.9 (Controlling the boundary integral).

Let pp be a continuously differentiable probability density on d\mathbb{R}^{d}, such that dx2p(x)𝑑x<\int_{\mathbb{R}^{d}}\|x\|_{2}p(x)\,dx<\infty and dx2p(x)2𝑑x<\int_{\mathbb{R}^{d}}\|x\|_{2}\|\nabla p(x)\|_{2}\,dx<\infty. Then, the function

Q(ρ):=𝔹ρx2p(x)𝑑S(x)=𝕊d1p(ρθ)ρd𝑑S(θ)Q(\rho):=\int_{\partial\mathbb{B}_{\rho}}\|x\|_{2}\,p(x)\,dS(x)=\int_{\mathbb{S}^{d-1}}p(\rho\theta)\,\rho^{d}\,dS(\theta)

is differentiable on (0,)(0,\infty), and moreover, QQ and QQ^{\prime} are integrable on (0,)(0,\infty). Consequently, we also have that limρQ(ρ)=0\lim_{\rho\to\infty}Q(\rho)=0.

Proof B.10.

Using spherical integration, we immediately get that

0|Q(ρ)|𝑑ρ=0𝕊d1p(ρθ)ρd𝑑S(θ)𝑑ρ=dp(x)x2𝑑x<.\int_{0}^{\infty}\left|Q(\rho)\right|\,d\rho=\int_{0}^{\infty}\int_{\mathbb{S}^{d-1}}\,p(\rho\theta)\rho^{d}\,dS(\theta)\,d\rho=\int_{\mathbb{R}^{d}}\,p(x)\|x\|_{2}\,dx<\infty.

To get QQ^{\prime}, we differentiate under the integral sign as

Q(ρ)=𝕊d1{ρd(p(ρθ)θ)+(ρd1d)p(ρθ)}𝑑S(θ),Q^{\prime}(\rho)=\int_{\mathbb{S}^{d-1}}\,\left\{\rho^{d}\,(\nabla p(\rho\theta)\cdot\theta)+(\rho^{d-1}d)\,p(\rho\theta)\right\}\,dS(\theta),

which is justified because the differentiated integrand is continuous in (ρ,θ)(\rho,\theta), and as a result, can be uniformly bounded in magnitude by a constant over all θ𝕊d1\theta\in\mathbb{S}^{d-1} and all ρ\rho in any compact subset of (0,)(0,\infty). Now, integrating |Q|\left|Q^{\prime}\right| gives

0|Q(ρ)|𝑑ρ\displaystyle\int_{0}^{\infty}\left|Q^{\prime}(\rho)\right|\,d\rho 0𝕊d1|ρd(p(ρθ)θ)+(ρd1d)p(ρθ)|𝑑S(θ)𝑑ρ\displaystyle\leq\int_{0}^{\infty}\int_{\mathbb{S}^{d-1}}\,\left|\rho^{d}\,(\nabla p(\rho\theta)\cdot\theta)+(\rho^{d-1}d)\,p(\rho\theta)\right|\,dS(\theta)\,d\rho
=d|p(x)x+p(x)d|𝑑x\displaystyle=\int_{\mathbb{R}^{d}}\left|\nabla p(x)\cdot x+p(x)d\right|\,dx
dx2p(x)2𝑑x+d<.\displaystyle\leq\int_{\mathbb{R}^{d}}\|x\|_{2}\|\nabla p(x)\|_{2}\,dx+d<\infty.

What remains, finally, is to show that limρQ(ρ)=0\lim_{\rho\to\infty}Q(\rho)=0. The integrability of QQ^{\prime} implies that the limit

limρ(Q(ρ)Q(1))=1Q(r)𝑑r\lim_{\rho\to\infty}\left(Q(\rho)-Q(1)\right)=\int_{1}^{\infty}Q^{\prime}(r)\,dr

exists finitely. But if L:=limρQ(ρ)L:=\lim_{\rho\to\infty}Q(\rho) were a non-zero number, then QQ would not be integrable over (0,)(0,\infty), leading to a contradiction. Hence, we may conclude that L=0L=0 as required.

For the next few lemmas, recall that g(x)=(2π)d/2e12x22g(x)=(2\pi)^{-d/2}\,e^{-\frac{1}{2}\|x\|_{2}^{2}} and gϵ(x)=ϵdg(x/ϵ)g_{\epsilon}(x)=\epsilon^{-d}g(x/\epsilon).

Lemma B.11 (L1L^{1} bound on smoothed densities).

Let pp and qq be probability densities on d\mathbb{R}^{d}, and denote by W1(p,q)W_{1}(p,q) the Wasserstein-1 distance between them. For any ϵ>0\epsilon>0, it holds that

pgϵqgϵL1sups,tdstτsgϵτtgϵL1st2W1(p,q),\|p*g_{\epsilon}-q*g_{\epsilon}\|_{L^{1}}\leq\sup_{\begin{subarray}{c}s,t\in\mathbb{R}^{d}\\ s\neq t\end{subarray}}\frac{\|\tau_{s}g_{\epsilon}-\tau_{t}g_{\epsilon}\|_{L^{1}}}{\|s-t\|_{2}}\,W_{1}(p,\,q),

where τsh(x):=h(xs)\tau_{s}h(x):=h(x-s) denotes a translation for any function hh on d\mathbb{R}^{d}.

Proof B.12.

Denote by Γ(p,q)\Gamma(p,q) the set of all couplings (or transport plans) between pp and qq, and let γΓ(p,q)\gamma\in\Gamma(p,q). Note that one can express

pgϵ(x)qgϵ(x)\displaystyle p*g_{\epsilon}(x)-q*g_{\epsilon}(x) =dgϵ(xs)p(s)𝑑sdgϵ(xt)q(t)𝑑t\displaystyle=\int_{\mathbb{R}^{d}}g_{\epsilon}(x-s)p(s)ds-\int_{\mathbb{R}^{d}}g_{\epsilon}(x-t)q(t)dt
=d×d(gϵ(xs)gϵ(xt))𝑑γ(s,t).\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\left(g_{\epsilon}(x-s)-g_{\epsilon}(x-t)\right)\,d\gamma(s,t).

A direct calculation involving an exchange of integrals and a Holder bound gives

pgϵqgϵL1\displaystyle\|p*g_{\epsilon}-q*g_{\epsilon}\|_{L^{1}} =d|pgϵ(x)qgϵ(x)|𝑑x\displaystyle=\int_{\mathbb{R}^{d}}\left|p*g_{\epsilon}(x)-q*g_{\epsilon}(x)\right|\,dx
dd×d|gϵ(xs)gϵ(xt)|𝑑γ(s,t)𝑑x\displaystyle\leq\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\left|g_{\epsilon}(x-s)-g_{\epsilon}(x-t)\right|\,d\gamma(s,t)\,dx
=d×dd|τsgϵ(x)τtgϵ(x)|𝑑x𝑑γ(s,t)\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\left|\tau_{s}g_{\epsilon}(x)-\tau_{t}g_{\epsilon}(x)\right|\,dx\,d\gamma(s,t)
=d×dτsgϵτtgϵL1𝑑γ(s,t)\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\|\tau_{s}g_{\epsilon}-\tau_{t}g_{\epsilon}\|_{L^{1}}\,d\gamma(s,t)
=d×dτsgϵτtgϵL1st2st2𝑑γ(s,t)\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\frac{\|\tau_{s}g_{\epsilon}-\tau_{t}g_{\epsilon}\|_{L^{1}}}{\|s-t\|_{2}}\,\|s-t\|_{2}\,d\gamma(s,t)
sups,tdstτsgϵτtgϵL1st2d×dst2𝑑γ(s,t).\displaystyle\leq\sup_{\begin{subarray}{c}s,t\in\mathbb{R}^{d}\\ s\neq t\end{subarray}}\frac{\|\tau_{s}g_{\epsilon}-\tau_{t}g_{\epsilon}\|_{L^{1}}}{\|s-t\|_{2}}\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\|s-t\|_{2}\,d\gamma(s,t).

Taking an infimum over all γΓ(p,q)\gamma\in\Gamma(p,q) gives the desired result.

Lemma B.13 (Translational stability of Gaussian kernels in L1L^{1}).

Let s,tds,t\in\mathbb{R}^{d}, and ϵ>0\epsilon>0. Then,

τsgϵτtgϵL1dϵst2.\|\tau_{s}g_{\epsilon}-\tau_{t}g_{\epsilon}\|_{L^{1}}\leq\frac{\sqrt{d}}{\epsilon}\|s-t\|_{2}.

Proof B.14.

Fix some xdx\in\mathbb{R}^{d} and consider the smooth function [0,1]ξgϵ(xs+ξ(st))[0,1]\ni\xi\mapsto g_{\epsilon}(x-s+\xi(s-t)) where ξ\xi linearly interpolates between xsx-s and xtx-t. This lets us write

gϵ(xt)gϵ(xs)=01gϵ(xs+ξ(st))(st)𝑑ξ.g_{\epsilon}(x-t)-g_{\epsilon}(x-s)=\int_{0}^{1}\nabla g_{\epsilon}(x-s+\xi(s-t))\cdot(s-t)\,d\xi.

Integrating, we get that

d|gϵ(xt)gϵ(xs)|𝑑x\displaystyle\int_{\mathbb{R}^{d}}\left|g_{\epsilon}(x-t)-g_{\epsilon}(x-s)\right|\,dx d01gϵ(xs+ξ(st))2st2𝑑ξ𝑑x\displaystyle\leq\int_{\mathbb{R}^{d}}\int_{0}^{1}\|\nabla g_{\epsilon}(x-s+\xi(s-t))\|_{2}\,\|s-t\|_{2}\,d\xi\,dx
01st2dgϵ(xs+ξ(st))2𝑑x𝑑ξ\displaystyle\leq\int_{0}^{1}\|s-t\|_{2}\int_{\mathbb{R}^{d}}\|\nabla g_{\epsilon}(x-s+\xi(s-t))\|_{2}\,dx\,d\xi
=dgϵ(x)2𝑑xst2.\displaystyle=\int_{\mathbb{R}^{d}}\|\nabla g_{\epsilon}(x)\|_{2}\,dx\,\|s-t\|_{2}.

What remains is to bound the integral dgϵ(x)2𝑑x\int_{\mathbb{R}^{d}}\|\nabla g_{\epsilon}(x)\|_{2}\,dx, and this is straightforward. Noting that

gϵxj(x)=xjϵ2gϵ(x),\frac{\partial g_{\epsilon}}{\partial x_{j}}(x)=-\frac{x_{j}}{\epsilon^{2}}g_{\epsilon}(x),

we get

dgϵ(x)2𝑑x\displaystyle\int_{\mathbb{R}^{d}}\|\nabla g_{\epsilon}(x)\|_{2}\,dx =1ϵ2dx2gϵ(x)𝑑x=1ϵ2𝔼XN(0,ϵ2)X2dϵ,\displaystyle=\frac{1}{\epsilon^{2}}\int_{\mathbb{R}^{d}}\|x\|_{2}\,g_{\epsilon}(x)\,dx=\frac{1}{\epsilon^{2}}\,\mathbb{E}_{X\sim N(0,\epsilon^{2})}\|X\|_{2}\leq\frac{\sqrt{d}}{\epsilon},

which completes the proof of the lemma.

Lemma B.15 (L1L^{1} error due to mollification).

Let pp be a probability density on d\mathbb{R}^{d} satisfying Smoothness assumption S2. Then, for any ϵ>0\epsilon>0,

pgϵpL1pL1dϵ.\|p*g_{\epsilon}-p\|_{L^{1}}\leq\|\nabla p\|_{L^{1}}\,\sqrt{d}\,\epsilon.

Proof B.16.

We first relate the approximation error above to the translational stability in L1L^{1}:

pgϵpL1\displaystyle\|p*g_{\epsilon}-p\|_{L^{1}} =d|d(p(xy)p(x))gϵ(y)𝑑y|𝑑x\displaystyle=\int_{\mathbb{R}^{d}}\left\lvert\int_{\mathbb{R}^{d}}\left(p(x-y)-p(x)\right)g_{\epsilon}(y)\,dy\right\rvert\,dx
dd|τyp(x)p(x)|𝑑xgϵ(y)𝑑y\displaystyle\leq\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}|\tau_{y}p(x)-p(x)|\,dx\,g_{\epsilon}(y)\,dy
=dτyppL1gϵ(y)𝑑y.\displaystyle=\int_{\mathbb{R}^{d}}\|\tau_{y}p-p\|_{L^{1}}\,g_{\epsilon}(y)\,dy.

This translational stability τyppL1\|\tau_{y}p-p\|_{L^{1}} is controlled via smoothness. Suppose first that p𝒞1(d)p\in\mathcal{C}^{1}(\mathbb{R}^{d}), so that for x,ydx,y\in\mathbb{R}^{d},

p(xy)p(x)=01p(xξy)(y)𝑑ξ.p(x-y)-p(x)=\int_{0}^{1}\nabla p(x-\xi y)\cdot(-y)\,d\xi.

Integrating gives

τyppL1\displaystyle\|\tau_{y}p-p\|_{L^{1}} d01p(xξy)2y2𝑑ξ𝑑x\displaystyle\leq\int_{\mathbb{R}^{d}}\int_{0}^{1}\|\nabla p(x-\xi y)\|_{2}\,\|y\|_{2}\,d\xi\,dx
=01y2dp(xξy)2𝑑x𝑑ξ\displaystyle=\int_{0}^{1}\|y\|_{2}\int_{\mathbb{R}^{d}}\|\nabla p(x-\xi y)\|_{2}\,dx\,d\xi
=dp(x)2𝑑xy2=pL1y2.\displaystyle=\int_{\mathbb{R}^{d}}\|\nabla p(x)\|_{2}\,dx\,\|y\|_{2}=\|\nabla p\|_{L^{1}}\,\|y\|_{2}.

If p𝒞1(d)p\notin\mathcal{C}^{1}(\mathbb{R}^{d}), one can use the fact that pp lives in the Sobolev space 𝒲1,1(d)\mathcal{W}^{1,1}(\mathbb{R}^{d}) to get an approximating sequence pk𝒞1(d)𝒲1,1(d)p_{k}\in\mathcal{C}^{1}(\mathbb{R}^{d})\cap\mathcal{W}^{1,1}(\mathbb{R}^{d}) satisfying pkpL10\|p_{k}-p\|_{L^{1}}\to 0 and pkpL10\|\nabla p_{k}-\nabla p\|_{L^{1}}\to 0 [29]. This allows us to extend the bound

τyppL1pL1y2\|\tau_{y}p-p\|_{L^{1}}\leq\|\nabla p\|_{L^{1}}\|y\|_{2}

to all p𝒲1,1(d)p\in\mathcal{W}^{1,1}(\mathbb{R}^{d}). Combining, we get

pgϵpL1\displaystyle\|p*g_{\epsilon}-p\|_{L^{1}} dτyppL1gϵ(y)𝑑ypL1dy2gϵ(y)𝑑ypL1dϵ.\displaystyle\leq\int_{\mathbb{R}^{d}}\|\tau_{y}p-p\|_{L^{1}}\,g_{\epsilon}(y)\,dy\leq\|\nabla p\|_{L^{1}}\int_{\mathbb{R}^{d}}\|y\|_{2}\,g_{\epsilon}(y)\,dy\leq\|\nabla p\|_{L^{1}}\,\sqrt{d}\,\epsilon.

Appendix C Computational details and further numerical results

C.1 Monte Carlo integration for computing squared Hellinger distances

Recall that the squared Hellinger between densities pp and qq on d\mathbb{R}^{d} is defined by the integral

(27) hd2(p,q):=12d(q(x)p(x))2𝑑x.\displaystyle h_{d}^{2}(p,q):=\frac{1}{2}\int_{\mathbb{R}^{d}}\left(\sqrt{q(x)}-\sqrt{p(x)}\right)^{2}dx.

If dd is larger than 4, computing the integral by discretizing on a grid can prove prohibitively expensive. Instead, one can use simple Monte Carlo integration to approximate (27). Notice that

hd2(p,q)=12d(q(x)/p(x)1)2p(x)𝑑x=𝔼Xp12(q(X)/p(X)1)2,\displaystyle h_{d}^{2}(p,q)=\frac{1}{2}\int_{\mathbb{R}^{d}}\left(\sqrt{q(x)/p(x)}-1\right)^{2}p(x)dx=\mathbb{E}_{X\sim p}\,\frac{1}{2}\left(\sqrt{q(X)/p(X)}-1\right)^{2},

where the expectation on the right-hand side can be approximated using an empirical average. We simulate samples S(1),S(2),,S(K)iidpS^{(1)},S^{(2)},\dots,S^{(K)}\overset{\mathrm{iid}}{\sim}p, and compute

1Kk=1K12(q(S(k))/p(S(k))1)2\displaystyle\frac{1}{K}\sum_{k=1}^{K}\frac{1}{2}\left(\sqrt{q(S^{(k)})/p(S^{(k)})}-1\right)^{2}

which converges almost surely to hd2(p,q)h_{d}^{2}(p,q) as KK\to\infty. In practice, we use K=10000K=10000. Additionally, we repeat the procedure 50 times and ensure that the resulting spread is not too large.

C.2 The EM algorithm and the re-sampling heuristic

Recall the finite mixture densities discussed in Section 4.3

p(x)=k=1Kπkpk(x),\displaystyle p(x)=\sum_{k=1}^{K}\pi_{k}p_{k}(x),

and consider samples X(1),X(2),,X(n)iidpX^{(1)},X^{(2)},\dots,X^{(n)}\overset{\mathrm{iid}}{\sim}p. We briefly describe the EM algorithm following Cule et al. [11]. Given current iterates of the mixture proportions π~1,,π~K\tilde{\pi}_{1},\dots,\tilde{\pi}_{K} and component densities p~1,,p~K\tilde{p}_{1},\dots,\tilde{p}_{K}, compute the posterior probabilities

(28) θ~jkπ~kp~k(X(j))=1Kπ~p~(X(j)).\displaystyle\tilde{\theta}_{jk}\leftarrow\frac{\tilde{\pi}_{k}\tilde{p}_{k}(X^{(j)})}{\sum_{\ell=1}^{K}\tilde{\pi}_{\ell}\tilde{p}_{\ell}(X^{(j)})}.

Then, for each component k{1,,K}k\in\{1,\dots,K\}, update the density p~k\tilde{p}_{k} by maximizing a weighted likelihood over a suitable class of densities \mathcal{F} (such as log-concave densities):

(29) p~kargmaxqkj=1nθ~jklogqk(X(j)).\displaystyle\tilde{p}_{k}\leftarrow\mathop{\arg\max}_{q_{k}\in\mathcal{F}}\sum_{j=1}^{n}\tilde{\theta}_{jk}\log q_{k}(X^{(j)}).

Finally, update the mixture proportions as

(30) π~k1nj=1nθ~j,k\displaystyle\tilde{\pi}_{k}\leftarrow\frac{1}{n}\sum_{j=1}^{n}\tilde{\theta}_{j,k}

for each kk. These three step can be iterated until convergence.

When combining the EM algorithm with LC-IC, (28) and (30) remain unchanged. Only the maximum likelihood step (29) needs to be replaced, where one needs to “fit” a density to samples weighted by θ~jk\tilde{\theta}_{jk}. Since weighted samples are not immediately compatible with our sample splitting scheme, we propose the following heuristic re-sampling scheme. From the observed samples X(1),X(2),,X(n)X^{(1)},X^{(2)},\dots,X^{(n)}, construct the weighted empirical distribution

ν~k:=j=1nθ~jkδX(j)j=1nθ~jk\displaystyle\tilde{\nu}_{k}:=\frac{\sum_{j=1}^{n}\tilde{\theta}_{jk}\,\delta_{X^{(j)}}}{\sum_{j=1}^{n}\tilde{\theta}_{jk}}

for each kk. Notice that ν~k\tilde{\nu}_{k} depends on the current iterates. Next, (computationally) generate iid samples from ν~k\tilde{\nu}_{k} (hence the term re-sampled), and feed them into Algorithm 1, using MM of those re-sampled samples for the unmixing matrix estimation stage, and NN for the marginal estimation stage. The output of Algorithm 1 is then the updated iterate of p~k\tilde{p}_{k}. Unlike the sample splitting scheme discussed before, the re-sampling heuristic involves no compromise between MM and NN; one can generate as many samples from ν~k\tilde{\nu}_{k} as they like. The experiments in Section 4.3 used N=M=4nN=M=4n.

Refer to caption
Figure 9: Comparing density estimation performance of LC-IC with LC-MLE on simulated Gamma-distributed data in d=2d=2. The ground truth and estimated densities are visualized in the top row, and the corresponding level curves are plotted in the bottom row.

However, it is important to note that we do not have a theoretical backing for this heuristic yet. Another issue in this strategy is the generation of spuriously zeros values of θ~jk\tilde{\theta}_{jk}, which can cause division-by-zero errors and likelihoods of -\infty. We address this by adding a small positive number such as 10710^{-7} to the zero values of θ~jk\tilde{\theta}_{jk}.

C.3 Comparison of performance on Gamma-distributed data

Here, we repeat the experiments from Section 4.1, but for Gamma distributed data. Consider the bivariate case first. Let X=𝐖TZX=\mathbf{W}^{T}Z with 𝐖=𝐈\mathbf{W}=\mathbf{I} and Z=(Z1,Z2)Z=(Z_{1},Z_{2}), such that Z1Gamma(6,1)Z_{1}\sim\mathrm{Gamma}(6,1) and Z2Gamma(3,1)Z_{2}\sim\mathrm{Gamma}(3,1). With n=1000n=1000 iid samples of XX generated, equally split M=N=500M=N=500, and compute the LC-IC estimate using PCA for unmixing matrix estimation and logcondens for marginal estimation. From the same n=1000n=1000 samples, also compute the LC-MLE estimate with LogConcDEAD. Figure 9 shows the ground truth and estimated densities together with their contour plots, and lists the estimation errors in squared Hellinger distance along with the algorithm runtimes.

Then, we conduct systematic tests similar to Section 4.1 for d=2,3,4d=2,3,4 and various nn. Set X=𝐖TZX=\mathbf{W}^{T}Z, where 𝐖=𝐈\mathbf{W}=\mathbf{I} and Z=(Z1,,Zd)Z=(Z_{1},\dots,Z_{d}), with ZiGamma(6(i1),1)Z_{i}\sim\mathrm{Gamma}(6-(i-1),1). PCA is used for estimating 𝐖\mathbf{W}. Each experiment is repeated 5 times, and the average metrics are computed. Figure 10 plots the estimation errors in squared Hellinger distance, as well as the algorithm runtimes, for both LC-MLE and LC-IC. The results are consistent with those from Section 4.1.

Refer to caption
Figure 10: Comparing estimation errors in squared Hellinger distance (top row) and algorithm runtimes (bottom row) of LC-IC with LC-MLE, on simulated Gamma distributed data. The dots correspond to independent experiments, and the lines represent averages. Here, nn is the number of samples, and dd is the dimension.