This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Intrinsic Dimension Estimation Using Wasserstein Distances

Adam Block
MIT
   Zeyu Jia
MIT
   Yury Polyanskiy
MIT
   Alexander Rakhlin
MIT
Abstract

It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.

1 Introduction

Recently, practical applications of machine learning involve a very large number of features, often many more than there are samples on which to train a model. Despite this imbalance, many modern machine learning models work astonishingly well. One of the more compelling explanations for this behavior is the manifold hypothesis, which posits that, though the data appear to the practitioner in a high-dimensional, ambient space, D\mathbb{R}^{D}, they really lie on (or close to) a low dimensional space MM of “dimension” dDd\ll D, where we define dimension formally below. A good example to keep in mind is that of image data: each of thousands of pixels corresponds to three dimensions, but we expect that real images have some inherent structure that limits the true number of degrees of freedom in a realistic picture. This phenomenon has been thoroughly explored over the years, beginning with the linear case and moving into the more general, nonlinear regime, with such works as Niyogi et al., (2008, 2011); Belkin & Niyogi, (2001); Bickel et al., (2007); Levina & Bickel, (2004); Kpotufe, (2011); Kpotufe & Dasgupta, (2012); Kpotufe & Garg, (2013); Weed et al., (2019); Tenenbaum et al., (2000); Bernstein et al., (2000); Kim et al., (2019); Farahmand et al., (2007), among many, many others. Some authors have focused on finding representations for these lower dimensional sets (Niyogi et al.,, 2008; Belkin & Niyogi,, 2001; Tenenbaum et al.,, 2000; Roweis & Saul,, 2000; Donoho & Grimes,, 2003), while other works have focused on leveraging the low dimensionality into statistically efficient estimators (Bickel et al.,, 2007; Kpotufe,, 2011; Nakada & Imaizumi,, 2020; Kpotufe & Dasgupta,, 2012; Kpotufe & Garg,, 2013; Ashlagi et al.,, 2021).

In this work, our primary focus is on estimating the intrinsic dimension. To see why this is an important question, note that the local estimators of Bickel et al., (2007); Kpotufe, (2011); Kpotufe & Garg, (2013) and the neural network architecture of Nakada & Imaizumi, (2020) all depend in some way on the intrinsic dimension. As noted in Levina & Bickel, (2004), while a practitioner may simply apply cross-validation to select the optimal hyperparameters, this can be very costly unless the hyperparameters have a restricted range; thus, an estimate of intrinsic dimension is critical in actually applying the above works. In addition, for manifold learning, where the goal is to construct a representation of the data manifold in a lower dimensional space, the intrinsic dimension is a key parameter in many of the most popular methods (Tenenbaum et al.,, 2000; Belkin & Niyogi,, 2001; Donoho & Grimes,, 2003; Roweis & Saul,, 2000).

We propose a new estimator, based on distances between probability distributions, as well as provide rigorous, finite sample guarantees for the quality of the novel procedure. Recall that if μ,ν\mu,\nu are two measures on a metric space (M,dM)(M,d_{M}), then the Wasserstein-pp distance between μ\mu and ν\nu is

WpM(μ,ν)p=inf(X,Y)Γ(μ,ν)𝔼[dM(X,Y)p]W_{p}^{M}(\mu,\nu)^{p}=\inf_{(X,Y)\sim\Gamma(\mu,\nu)}\mathbb{E}\left[d_{M}(X,Y)^{p}\right] (1)

where Γ(μ,ν)\Gamma(\mu,\nu) is the set of all couplings of the two measures. If MDM\subset\mathbb{R}^{D}, then there are two natural metrics to put on MM: one is simply the restriction of the Euclidean metric to MM while the other is the geodesic metric in MM, i.e., the minimal length of a curve in MM that joins the points under consideration. In the sequel, if the metric is simply the Euclidean metric, we leave the Wasserstein distance unadorned to distinguish it from the intrinsic metric. For a thorough treatment of such distances, see Villani, (2008). We recall that the Hölder integral probability metric (Hölder IPM) is given by

dβ,B(μ,ν)=supfCBβ(Ω)𝔼μ[f(X)]𝔼ν[f(Y)]d_{\beta,B}(\mu,\nu)=\sup_{f\in C_{B}^{\beta}(\Omega)}\mathbb{E}_{\mu}[f(X)]-\mathbb{E}_{\nu}[f(Y)]

where CBβ(Ω)C_{B}^{\beta}(\Omega) is the Hölder ball defined in the sequel. When p=β=1p=\beta=1, the classical result of Kantorovich-Rubinstein says that the Wasserstein and Hölder distances agree. It has been known at least since Dudley, (1969) that if a space MM has dimension dd, \mathbb{P} is a measure with support MM, and PnP_{n} is the empirical measure of nn independent samples drawn from \mathbb{P}, then W1M(Pn,)n1dW_{1}^{M}(P_{n},\mathbb{P})\asymp n^{-\frac{1}{d}}. More recently, Weed et al., (2019) has determined sharp rates for the convergence of this quantity for higher order Wasserstein distances in terms of the intrinsic dimension of the distribution. Below, we find sharp rates for the convergence of the empirical measure to the population measure with respect to the Hölder IPM: if β<d2\beta<\frac{d}{2}, then dβ(Pn,)nβdd_{\beta}(P_{n},\mathbb{P})\asymp n^{-\frac{\beta}{d}} and if β>d2\beta>\frac{d}{2} then dβ(Pn,)n12d_{\beta}(P_{n},\mathbb{P})\asymp n^{-\frac{1}{2}}. These sharp rates are intuitive in that convergence to the population measure should only depend on the intrinsic complexity (i.e. dimension) without reference to the possibly much larger ambient dimension.

The above convergence results are nice theoretical insights, but they have practical value, too. The results of Dudley, (1969); Weed et al., (2019), as well as our results on the rate of convergence of the Hölder IPM, present a natural way to estimate the intrinsic dimension: take two independent samples, Pn,PαnP_{n},P_{\alpha n} from \mathbb{P} and consider the ratio of WpM(Pn,)/WpM(Pαn,)W_{p}^{M}(P_{n},\mathbb{P})/W_{p}^{M}(P_{\alpha n},\mathbb{P}) or dβ(Pn,)/dβ(Pαn,)d_{\beta}(P_{n},\mathbb{P})/d_{\beta}(P_{\alpha n},\mathbb{P}); as nn\to\infty, the first ratio should be about αd\alpha^{d}, while the second should be about αβd\alpha^{\frac{\beta}{d}}, and so dd can be computed by taking the logarithm with respect to α\alpha. The first problem with this idea is that we do not know \mathbb{P}; to address this, we instead compute the ratios using two independent samples. A more serious issue regards how large nn must be in order for the asymptotic regime to apply. As we shall see below, the answer depends on the geometry of the supporting manifold.

We define two estimators: one using the intrinsic distance and the other using Euclidean distance

dn=logαlogW1(Pn,Pn)logW1(Pαn,Pαn)\displaystyle d_{n}=\frac{\log\alpha}{\log W_{1}(P_{n},P_{n}^{\prime})-\log W_{1}(P_{\alpha n},P_{\alpha n}^{\prime})} d~n=logαlogW1G(Pn,Pn)logW1G(Pαn,Pαn)\displaystyle\widetilde{d}_{n}=\frac{\log\alpha}{\log W_{1}^{G}(P_{n},P_{n}^{\prime})-\log W_{1}^{G}(P_{\alpha n},P_{\alpha n}^{\prime})} (2)

where the primes indicate independent samples of the same size and GG is a graph-based metric that approximates the intrinsic metric. Before we go into the details, we give an informal statement of our main theorem, which provides finite sample, non-asymptotic guarantees on the quality of the estimator111Explicit constants are given in the formal statement of Theorem 22:

Theorem 1 (Informal version of Theorem 22).

Let \mathbb{P} be a measure on D\mathbb{R}^{D} supported on a compact manifold of dimension dd. Let τ\tau be the reach of MM, an intrinsic geometric quantity defined below. Suppose we have NN independent samples from \mathbb{P} where

N=Ω(τd(volMωd)d+22γ(log1ρ)3)N=\Omega\left(\tau^{-d}\vee\left(\frac{\operatorname{vol}M}{\omega_{d}}\right)^{\frac{d+2}{2\gamma}}\vee\left(\log\frac{1}{\rho}\right)^{3}\right)

where ωd\omega_{d} is the volume of a dd-dimensional Euclidean unit ball. Then with probability at least 16ρ1-6\rho, the estimated dimension d~n\widetilde{d}_{n} satisfies

d1+4γd~n(1+4γ)d.\frac{d}{1+4\gamma}\leq\widetilde{d}_{n}\leq(1+4\gamma)d.

The same conclusion holds for dnd_{n}.

Although the guarantees for dnd_{n} and dn~\widetilde{d_{n}} are similar, empirically d~n\widetilde{d}_{n} is much better, as explained below. Note that the ambient dimension DD never enters the statistical complexity given above. While the exponential dependence on the intrinsic dimension dd is unfortunate, it is likely necessary as described below.

While the reach, τ\tau, determines the sample complexity of our dimension estimator, consideration of the injectivity radius, ι\iota, is relevant for practical application. Both geometric quantities are defined formally in the following section, but, to understand the intuition, note that, as discussed above, there are two natural metrics we could be placing on M=suppM=\operatorname{supp}\mathbb{P}, the Euclidean metric and the geodesic distance. The reach is, intuitively, the size of the largest ball with respect to the ambient metric such that we can treat points in MM as if they were simply in Euclidean space; the injectivity radius is similar, except it treats neighborhoods with respect to the intrinsic metric. Considering that manifold distances are always at least as large as Euclidean distances, it is unsurprising that τι\tau\lesssim\iota. Getting back to dimension estimation, specializing to the case of β=p=1\beta=p=1, and recalling (2), there are now two choices for our dimension estimator: we could use Wasserstein distance with respect to the Euclidean metric or Wasserstein distance with respect to the intrinsic metric (which we will denote by W1MW_{1}^{M}). We will see that if ιτ\iota\approx\tau, then the two estimators induced by each of these distances behave similarly, but when ιτ\iota\gg\tau, the latter is better. While we wish to use W1M(Pn,Pn)W_{1}^{M}(P_{n},P_{n}^{\prime}) to estimate the dimension, we do not know the intrinsic metric. As such, we use the kkNN graph to approximate this intrinsic metric and introduce the measure W1G(Pn,Pn)W_{1}^{G}(P_{n},P_{n}^{\prime}). Note that if we had oracle access to geodesic distance dMd_{M}, then the W1MW_{1}^{M}-based estimator d~n\widetilde{d}_{n} would only require ιd\asymp\iota^{-d} samples. However, our kkNN estimator of dMd_{M}, unfortunately, still requires the τd\tau^{-d} samples. Nevertheless, there is a practical advantage of d~n\widetilde{d}_{n} in that the metric estimator can leverage all N=2(1+α)nN=2(1+\alpha)n available samples, so that d~n\widetilde{d}_{n} works if NτdN\gtrsim\tau^{-d} and only nιdn\gtrsim\iota^{-d}, whereas for dnd_{n} we require nτdn\gtrsim\tau^{-d} itself.

A natural question: is this more complicated approach necessary? i.e., is ιτ\iota\gg\tau on real datasets? We believe that the answer is yes. To see this, consider the case of images of the digit 7 (for example) from MNIST (LeCun & Cortes,, 2010). As a demonstration, we sample images from MNIST in datasets of size ranging in powers of 2 from 3232 to 20482048, calculate the Wasserstein distance between these two samples, and plot the resulting trend. In the right plot, we pool all of the data to estimate the manifold distances, and then use these estimated distances to compute the Wasserstein distance between the empirical distributions. In order to better compare these two approaches, we also plot the residuals to the linear fit that we expect in the asymptotic regime. Looking at Figure 1, it is clear that we are not yet in the asymptotic regime if we simply use Euclidean distances; on the other hand, the trend using the manifold distances is much more clearly linear, suggesting that the slope of the best linear fit is meaningful. Thus we see that in order to get a meaningful dimension estimate from practical data sets, we cannot simply use W1W_{1} but must also estimate the geometry of the underlying distribution; this suggests that ιτ\iota\gg\tau on this data manifold. More generally, we note that the injectivity radius, ι\iota, is intrinsic to the geometry of the manifold and thus unaffected by the imbedding; in contradistinction, the reach, τ\tau, is extrinsic and thus can be made smaller by changing the imbedding. In particular, when the obstruction to the reach being large is a “bottleneck” in the sense that the manifold is imbedded in such a way as to place distant neighborhoods of the manifold close together in Euclidean distance (see Figure 2 for an example), we may expect τι\tau\ll\iota. Intuitively, this matches the notion that the geometry of the data would be simple if we were to have access to the “correct” coordinate system and that the difficulty in understanding the geometry comes from its imbedding in the ambient space.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Two log\log-log\log plots of comparing how W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}) decays to how W1M(Pn,Pn)W_{1}^{M}(P_{n},P_{n}^{\prime}) decays as nn gets larger, as well as the residuals from a linear fit. The data are images of the digit 77 from MNIST with Wasserstein distances computed with the Sinkhorn algorithm (Cuturi,, 2013). The manifold distances are approximated by a kk-NN graph, as described in Section 3.

We emphasize that, like many estimators of intrinsic dimension, we do not claim robustness to off-manifold noise (Levina & Bickel,, 2004; Farahmand et al.,, 2007; Kim et al.,, 2019). Indeed, any “fattening” of the manifold will force any consistent estimator of intrinsic dimension to asymptotically grow to the full, ambient dimension as the number of samples grows. Various works have included off-manifold noise in different ways, often with the assumption that either the noise is known (Koltchinskii,, 2000) or the manifold is linear (Niles-Weed & Rigollet,, 2019). Methods that do not make these simplifying assumptions are often highly sensitive to scaling parameters that are required inputs in such methods as multi-scale, local SVD (Little et al.,, 2009). Extensions of our method to such noisy settings are a promising avenue of future research, particularly in understanding the effect of this noise on downstream applications as is done for Lipschitz classification in metric spaces and the resulting dimension-distortion tradeoff found in Gottlieb et al., (2016); in this work, however, we confine our theoretical study to the noiseless setting. The primary theoretical advantage of our estimator over that of Levina & Bickel, (2004); Farahmand et al., (2007) is that we do not require the stringent regularity assumptions for our nonasymptotic rates to hold. We leave it for future empirical works whether this weakening of assumptions allows for a better practical estimator on real-world data sets.

Our main contributions are as follows:

  • In Section 3, we introduce a new estimator of intrinsic dimension. In Theorem 22 we prove non-asymptotic bounds on the quality of the introduced estimator. Moreover, unlike the MLE estimator of Levina & Bickel, (2004) with non-asymptotic analysis in Farahmand et al., (2007), minimal regularity of the density of the population distribution is required for our guarantees and, unlike that suggested in Kim et al., (2019), our estimator is both computationally efficient and has sample complexity independent of the ambient dimension.

  • In the course of proving Theorem 22, we adapt the techniques of Bernstein et al., (2000) to provide new, non-asymptotic bounds on the quality of kNN distance as an estimate of intrinsic distance in Proposition 24, with explicit sample complexity in terms of the reach of the underlying space. To our knowledge, these are the first such non-asymptotic bounds.

We further note that the techniques we develop to prove the non-asymptotic bounds on our dimension estimator also serve to provide new statistical rates in learning Generative Adversarial Networks (GANs) with a Hölder discriminator class:

  • We prove in Theorem 25 that if μ^\widehat{\mu} is a Hölder GAN, then the distance between μ^\widehat{\mu} and \mathbb{P}, as measured by the Hölder IPM, is governed by rates dependent only on the intrinsic dimension of the data, independent of the ambient dimension or the dimension of the feature space. In particular, we prove in great generality that if \mathbb{P} has intrinsic dimension dd, then the rate of a Wasserstein GAN is n1dn^{-\frac{1}{d}}. This improves on the recent work of Schreuder et al., (2020).

The work is presented in the order of the above listed contributions, preceded by a brief section on the geometric preliminaries and prerequisite results. We conclude the introduction by fixing notation and surveying some related work.

Notation:

We fix the following notation. We always let \mathbb{P} be a probability distribution on D\mathbb{R}^{D} and, whenever defined, we let d=dimsuppd=\dim\operatorname{supp}\mathbb{P}. We reserve X1,,XnX_{1},\dots,X_{n} for samples taken from \mathbb{P} and we denote by PnP_{n} their empirical distribution. We reserve β\beta for the smoothness of a Hölder class, ΩD\Omega\subset\mathbb{R}^{D} is always a bounded open domain, and Δ\Delta is always the intrinsic diameter of a closed set. We also reserve MM for a compact manifold. In general, we denote by 𝒮\mathcal{S} the support of a distribution \mathbb{P} and we reuse M=suppM=\operatorname{supp}\mathbb{P} if we restrict ourselves to the case where 𝒮=M\mathcal{S}=M is a compact manifold, with Riemannian metric induced by the Euclidean metric. We denote by volM\operatorname{vol}M the volume of the manifold with respect to its inherited metric and we reserve ωd\omega_{d} for the volume of the unit ball in d\mathbb{R}^{d}. When a compact manifold manifold MM can be assumed from context, we take the uniform measure on MM to be the volume measure of MM normalized so that MM has unit measure.

1.1 Related Work

Dimension Estimation

There is a long history of dimension estimation, beginning with linear methods such as thresholding principal components (Fukunaga & Olsen,, 1971), regressing k-Nearest-Neighbors (kNN) distances (Pettis et al.,, 1979), estimating packing numbers (Kégl,, 2002; Grassberger & Procaccia,, 2004; Camastra & Vinciarelli,, 2002), an estimator based solely on neighborhood (but not metric) information that was recently proven consistent (Kleindessner & Luxburg,, 2015), and many others. An exhaustive recent survey on the history of these techniques can be found in Camastra & Staiano, (2016). Perhaps the most popular choice among current practitioners is the MLE estimator of Levina & Bickel, (2004).

The MLE estimator is constructed as the maximum likelihood of a parameterized Poisson process. As worked out in Levina & Bickel, (2004), a local estimate of dimension for k2k\geq 2 and xDx\in\mathbb{R}^{D} is given by

m^k(x)=(1k1j=1klogTk(x)Tj(x))1\widehat{m}_{k}(x)=\left(\frac{1}{k-1}\sum_{j=1}^{k}\log\frac{T_{k}(x)}{T_{j}(x)}\right)^{-1}

where Tj(x)T_{j}(x) is the distance between xx and its jthj^{th} nearest neighbor in the data set. The final estimate for fixed kk is given by averaging m^k\widehat{m}_{k} over the data points in order to reduce variance. While not included in the original paper, a similar motivation for such an estimator could be noting that if XX is uniformly distributed on a ball of radius RR in d\mathbb{R}^{d}, then 𝔼[logRX]=1d\mathbb{E}\left[\log\frac{R}{\left|\left|X\right|\right|}\right]=\frac{1}{d}; the local estimator m^k(x)\widehat{m}_{k}(x) is the empirical version under the assumption that the density is smooth enough to be approximately constant on this small ball. The easy computation is included for the sake of completeness in Appendix E. In Farahmand et al., (2007), the authors examined a closely related estimator and provided non-asymptotic guarantees with an exponential dependence on the intrinsic dimension, albeit with stringent regularity conditions on the density.

In addition to the estimators motivated by the volume growth of local balls discussed in the previous paragraph, Kim et al., (2019) proposed and analyzed a dimension estimator based on Travelling Salesman Paths (TSP). One major advantage to the TSP estimator is the lack of necessary regularity conditions on the density, requiring only an upper bound of the likelihood of the population density with respect to the volume measure on the manifold. On the other hand, the upper bound on sample complexity that that paper presents depends exponentially on the ambient dimension, which is pessimistic when the intrinsic dimension is substantially smaller. In addition, it is unclear how practical the estimator is due to the necessity of computing a solution to TSP; even ignoring this issue, Kim et al., (2019) note that practical tuning of the constants involved in their estimator is difficult and thus deploying their estimator as is on real-world datasets is unlikely.

Manifold Learning

The notion of reach was first introduced in Federer, (1959), and subsequently used in the machine learning and computational geometry communities in such works as Niyogi et al., (2008, 2011); Aamari et al., (2019); Amenta & Bern, (1999); Fefferman et al., (2016, 2018); Narayanan & Mitter, (2010); Efimov et al., (2019); Boissonnat et al., (2019). Perhaps most relevant to our work, Narayanan & Mitter, (2010); Fefferman et al., (2016) consider the problem of testing membership in a class of manifolds of large reach and derive tight bounds on the sample complexity of this question. Our work does not fall into the purview of their conclusions as we assume that the geometry of the underlying manifold is nice and estimate the intrinsic dimension. In the course of proving bounds on our dimension estimator, we must estimate the intrinsic metric of the data. We adapt the proofs of Tenenbaum et al., (2000); Bernstein et al., (2000); Niyogi et al., (2008) and provide tight bounds on the quality of a kk-Nearest Neighbors (kkNN) approximation of the intrinsic distance.

Statistical Rates of GANs

Since the introduction of Generative Adversarial Networks (GANs) in Goodfellow et al., (2014), there has been a plethora of empirical improvements and theoretical analyses. Recall that the basic GAN problem selects an estimated distribution μ^\widehat{\mu} from a class of distributions 𝒫\mathcal{P} minimizing some adversarially learned distance between μ^\widehat{\mu} and the empirical distribution PnP_{n}. Theoretical analyses aim to control the distance between the learned distribution μ^\widehat{\mu} and the population distribution \mathbb{P} from which the data comprising PnP_{n} are sampled. In particular statistical rates for a number of interesting discriminator classes have been proven including Besov balls (Uppal et al.,, 2019), balls in an RKHS (Liang,, 2018), and neural network classes (Chen et al.,, 2020) among others. The latter paper, Chen et al., (2020) also considers GANs where the discriminative class is a Hölder ball, which includes the popular Wasserstein GAN framework of Arjovsky et al., (2017). They show that if μ^\widehat{\mu} is the empirical minimizer of the GAN loss and the population distribution 𝖫𝖾𝖻D\mathbb{P}\ll\mathsf{Leb}_{\mathbb{R}^{D}} then

𝔼[dβ(μ^,)]nβ2β+D\mathbb{E}\left[d_{\beta}(\widehat{\mu},\mathbb{P})\right]~{}\lesssim~{}n^{-\frac{\beta}{2\beta+D}}

up to factors polynomial in logn\log n. Thus, in order to beat the curse of dimensionality, one requires β=Ω(D)\beta=\Omega(D); note that the larger β\beta is, the weaker the IPM is as the Hölder ball becomes smaller. In order to mitigate this slow rate, Schreuder et al., (2020) assume that both 𝒫\mathcal{P} and \mathbb{P} are distributions arising from Lipschitz pushforwards of the uniform distribution on a dd-dimensional hypercube; in this setting, they are able to remove dependence on DD and show that

𝔼[dβ(μ^,)]Lnβdn12.\mathbb{E}\left[d_{\beta}(\widehat{\mu},\mathbb{P})\right]~{}\lesssim~{}Ln^{-\frac{\beta}{d}}\vee n^{-\frac{1}{2}}.

This last result beats the curse of dimensionality, but pays with restrictive assumptions on the generative model as well as dependence on the Lipschitz constant of the pushforward map. More importantly, the result depends exponentially not on the intrinsic dimension of \mathbb{P} but rather on the dimension of the feature space used to represent \mathbb{P}. In practice, state-of-the-art GANs used to produce images often choose dd to be on the order of 128128, which is much too large for the Schreuder et al., (2020) result to guarantee good performance.

2 Preliminaries

2.1 Geometry

In this work, we are primarily concerned with the case of compact manifolds isometrically imbedded in some large ambient space, D\mathbb{R}^{D}. We note that this focus is largely in order to maintain simplicity of notation and exposition; extensions to more complicated, less regular sets with intrinsic dimension defined as the Minkowski dimension can easily be attained with our techniques. The key example to keep in mind is that of image data, where each pixel corresponds to a dimension in the ambient space, but, in reality, the distribution lives on a much smaller, imbedded subspace. Many of our results can be easily extended to the non-compact case with additional assumptions on the geometry of the space and tails of the distribution of interest.

Central to our study is the analysis of how complex the support of a distribution is. We measure complexity of a metric space by its entropy:

Definition 2.

Let (X,d)(X,d) be a metric space. The covering number at scale ε>0\varepsilon>0, N(X,d,ε)N(X,d,\varepsilon), is the minimal number ss such that there exist points x1,,xsx_{1},\dots,x_{s} such that XX is contained in the union of balls of radius ε\varepsilon centred at the xix_{i}. The packing number at scale ε>0\varepsilon>0, D(X,d,ε)D(X,d,\varepsilon), is the maximal number ss such that there exist points x1,,xsXx_{1},\dots,x_{s}\in X such that d(xi,xj)>εd(x_{i},x_{j})>\varepsilon for all iji\neq j. The entropy is defined as logN(X,d,ε)\log N(X,d,\varepsilon).

We recall the classical packing-covering duality, proved, for example, in (van Handel,, 2014, Lemma 5.12):

Lemma 3.

For any metric space XX and scale ε>0\varepsilon>0,

D(X,d,2ε)N(X,d,ε)D(X,d,ε).D(X,d,2\varepsilon)\leq N(X,d,\varepsilon)\leq D(X,d,\varepsilon).

The most important geometric quantity that determines the complexity of a problem is the dimension of the support of the population distribution. There are many, often equivalent ways to define this quantity in general. One possibility, introduced in Assouad, (1983) and subsequently used in Dasgupta & Freund, (2008); Kpotufe & Dasgupta, (2012); Kpotufe & Garg, (2013) is that of doubling dimension:

Definition 4.

Let 𝒮D\mathcal{S}\subset\mathbb{R}^{D} be a closed set. For x𝒮x\in\mathcal{S}, the doubling dimension at xx is the smallest dd such that for all r>0r>0, the set Br(x)𝒮B_{r}(x)\cap\mathcal{S} can be covered by 2d2^{d} balls of radius r2\frac{r}{2}, where Br(x)B_{r}(x) denotes the Euclidean ball of radius rr centred at xx. The doubling dimension of 𝒮\mathcal{S} is the supremum of the doubling dimension at xx for all x𝒮x\in\mathcal{S}.

This notion of dimension plays well with the entropy, as demonstrated by the following (Kpotufe & Dasgupta,, 2012, Lemma 6):

Lemma 5 ((Kpotufe & Dasgupta,, 2012)).

Let 𝒮\mathcal{S} have doubling dimension dd and diameter Δ\Delta. Then N(𝒮,ε)(Δε)dN(\mathcal{S},\varepsilon)\leq\left(\frac{\Delta}{\varepsilon}\right)^{d}.

We remark that a similar notion of dimension is that of the Minkowski dimension, which is defined as the asymptotic rate of growth of the entropy as the scale tends to zero. Recently, Nakada & Imaizumi, (2020) examined the effect that an assumption of small Minkowski dimension has on learning with neural networks; their central statistical result can be recovered as an immediate consequence of our complexity bounds below.

In order to develop non-asymptotic bounds, we need some understanding of the geometry of the support, MM. We first recall the definition of the geodesic distance:

Definition 6.

Let 𝒮D\mathcal{S}\subset\mathbb{R}^{D} be closed. A piecewise smooth curve in 𝒮\mathcal{S}, γ\gamma, is a continuous function γ:I𝒮\gamma:I\to\mathcal{S}, where II\subset\mathbb{R} is an interval, such that there exists a partition I1,,IJI_{1},\cdots,I_{J} of II such that γIj\gamma_{I_{j}} is smooth as a function to D\mathbb{R}^{D}. The length of γ\gamma is induced by the imbedding of 𝒮D\mathcal{S}\subset\mathbb{R}^{D}. For points p,q𝒮p,q\in\mathcal{S}, the intrinsic (or geodesic) distance is

d𝒮(p,q)=inf{𝗅𝖾𝗇𝗀𝗍𝗁(γ)|γ(0)=p and γ(1)=q and γ is a piecewise smooth curve in 𝒮}.d_{\mathcal{S}}(p,q)=\inf\left\{\mathsf{length}\,(\gamma)|\gamma(0)=p\text{ and }\gamma(1)=q\text{ and }\gamma\text{ is a piecewise smooth curve in }\mathcal{S}\right\}.

It is clear from the fact that straight lines are geodesics in D\mathbb{R}^{D} that for any points p,q𝒮p,q\in\mathcal{S}, pqd𝒮(p,q)\left|\left|p-q\right|\right|\leq d_{\mathcal{S}}(p,q). We are concerned with two relevant geometric quantities, one extrinsic and the other intrinsic.

Definition 7.

Let 𝒮D\mathcal{S}\subset\mathbb{R}^{D} be a closed set. Let the medial axis Med(𝒮)\operatorname{Med}(\mathcal{S}) be defined as

Med(𝒮)={xD| there exist pq𝒮 such that px=qx=d(x,𝒮)}.\operatorname{Med}(\mathcal{S})=\left\{x\in\mathbb{R}^{D}|\text{ there exist }p\neq q\in\mathcal{S}\text{ such that }\left|\left|p-x\right|\right|=\left|\left|q-x\right|\right|=d(x,\mathcal{S})\right\}.

In other words, the medial axis is the set of points in D\mathbb{R}^{D} that have at least two projections to 𝒮\mathcal{S}. Define the reach, τ𝒮\tau_{\mathcal{S}} of 𝒮\mathcal{S} as d(𝒮,Med(𝒮))d(\mathcal{S},\operatorname{Med}(\mathcal{S})), the minimal distance between a set and its medial axis.

If 𝒮=M\mathcal{S}=M is a compact manifold with the induced Euclidean metric, we define the injectivity radius ι=ιM\iota=\iota_{M} as the maximal rr such that if p,qMp,q\in M such that dM(p,q)<rd_{M}(p,q)<r then there exists a unique length-minimizing geodesic connecting pp to qq in MM.

For more detail on the injectivity radius, see Lee, (2018), especially Chapters 6 and 10. The difference between ιM\iota_{M} and τM\tau_{M} is in the choice of metric with which we equip MM. We could choose to equip MM with the metric induced by the Euclidean distance ||||\left|\left|\cdot\right|\right| or we could choose to use the intrinsic metric dMd_{M} defined above. The reach quantifies the maximal radius of a ball with respect to the Euclidean distance such that the intersection of this ball with MM behaves roughly like Euclidean space. The injectivity radius, meanwhile, quantifies the maximal radius of a ball with respect to the intrinsic distance such that this ball looks like Euclidean space. While neither quantity is necessary for our dimension estimator, both figure heavily in the analysis. The final relevant geometric quantity is the sectional curvature. The sectional curvature of MM at a point pMp\in M given two directions tangent to MM at pp is given by the Gaussian curvature at pp of the image of the exponential map applied to a small neighborhood of the origin in the plane determined by the two directions. Intuitively, the sectional curvature measures how tightly wound the manifold is locally around each point. For an excellent exposition on the topic, see (Lee,, 2018, Chapter 8).

We now specialize to consider compact, dimension dd manifolds MM imbedded in D\mathbb{R}^{D} with the induced metric (see Lee, (2018) for an accessible introduction to the geometric notions discussed here). One measure of size of the manifold MM is the diameter, Δ\Delta, with respect to the intrinsic distance defined above. Another notion of size is the volume measure, volM\operatorname{vol}_{M}. This measure can be defined intrinsically as integration with respect to the volume form, where the volume form can be thought of as the analogue of the Lebesgue differential in standard Euclidean space; for more details see Lee, (2018). In our setting, we could equivalently define the volume as the dd-dimensional Hausdorff measure as in Aamari et al., (2019). Either way, when we refer to a measure μM\mu_{M} that is uniform on the manifold, we consider the normalization such that μM(M)=1\mu_{M}(M)=1, i.e., μM()=volM()/vol(M)\mu_{M}(\cdot)=\operatorname{vol}_{M}(\cdot)/\operatorname{vol}(M).

With the brief digression into volume concluded, we return to the notion of the reach, which encodes a number of local and global geometric properties. We summarize several of these in the following proposition:

Proposition 8.

Let MDM\subset\mathbb{R}^{D} be a compact manifold isometrically imbedded in D\mathbb{R}^{D}. Suppose that τ=τM>0\tau=\tau_{M}>0. The following hold:

  1. (a)

    (Niyogi et al.,, 2008, Proposition 6.1)] The norm of the second fundamental form of MM is bounded by 1τ\frac{1}{\tau} at all points pMp\in M.

  2. (b)

    (Aamari et al.,, 2019, Proposition A.1 (ii)) The injectivity radius of MM is at least πτ\pi\tau.

  3. (c)

    (Boissonnat et al.,, 2019, Lemma 3) If p,qMp,q\in M such that pq2τ\left|\left|p-q\right|\right|\leq 2\tau then dM(p,q)2τarcsin(pq2τ)d_{M}(p,q)\leq 2\tau\arcsin\left(\frac{\left|\left|p-q\right|\right|}{2\tau}\right).

A few remarks are in order. First, note that the Hopf-Rinow Theorem (Hopf & Rinow,, 1931) guarantees that MM is complete, which is fortuitous as completeness is a necessary, technical requirement for several of our arguments. Second, we note that (c) from Proposition 8 has a simple geometric interpretation: the upper bound on the right hand side is the length of the arc of a circle of radius τ\tau containing points p,qp,q; thus, the maximal distortion of the intrinsic metric with respect to the ambient metric is bounded by the circle of radius τ\tau.

Point (a) in the above proposition demonstrates that control of the reach leads to control of local distortion. From the definition, it is obvious that the reach provides an upper bound for the size of the global notion of a “bottleneck,” i.e., two points p,qMp,q\in M such that pq=2τ<dM(p,q)\left|\left|p-q\right|\right|=2\tau<d_{M}(p,q). Interestingly, these two local and global notions of distortion are the only ways that the reach of a manifold can be small, as (Aamari et al.,, 2019, Theorem 3.4) tells us that if the reach of a manifold MM is τ\tau, then either there exists a bottleneck of size 2τ2\tau or the norm of the second fundamental form is 1τ\frac{1}{\tau} at some point. Thus, in some sense, the reach is the “correct” measure of distortion. Note that while (b) above tells us that ιMτM\iota_{M}\gtrsim\tau_{M}, there is no comparable upper bound. To see this, consider Figure 2, which depicts a one-dimensional manifold imbedded in 2\mathbb{R}^{2}. Note that the bottleneck in the center ensures that the reach of this manifold is very small; on the other hand, it is easy to see that the injectivity radius is given by half the length of the entire curve. As the curve can be extended arbitrarily, the reach can be arbitrarily small relative to the injectivity radius.

Refer to caption
Figure 2: Curve in 2\mathbb{R}^{2} where τι\tau\ll\iota.

We now proceed to bound the covering number of a compact manifold using the dimension and the injectivity radius. We note that upper bounds on the covering number with respect to the ambient metric were provided in Niyogi et al., (2008); Narayanan & Mitter, (2010). A similar bound with less explicit constants can be found in (Kim et al.,, 2019, Lemma 4).

Proposition 9.

Let MDM\subset\mathbb{R}^{D} be an isometrically imbedded, compact, dd-dimensional submanifold with injectivity radius ι>0\iota>0 such that the sectional curvatures are bounded above by κ10\kappa_{1}\geq 0 and below by κ20\kappa_{2}\leq 0. If ε<π2k1ι\varepsilon<\frac{\pi}{2\sqrt{k_{1}}}\wedge\iota then

N(M,dM,ε)volMωdd(π2)dεd.N(M,d_{M},\varepsilon)\leq\frac{\operatorname{vol}M}{\omega_{d}}d\left(\frac{\pi}{2}\right)^{d}\varepsilon^{-d}.

If ε<1κ2ι\varepsilon<\frac{1}{\sqrt{-\kappa_{2}}}\wedge\iota then

volMωdd8dεdD(M,dM,2ε).\frac{\operatorname{vol}M}{\omega_{d}}d8^{-d}\varepsilon^{-d}\leq D(M,d_{M},2\varepsilon).

Moreover, for all ε<ι\varepsilon<\iota,

volMωddιd(κ2)d2edικ2εdD(M,dM,ε).\frac{\operatorname{vol}M}{\omega_{d}}d\iota^{d}(-\kappa_{2})^{\frac{d}{2}}e^{-d\iota\sqrt{-\kappa_{2}}}\varepsilon^{-d}\leq D(M,d_{M},\varepsilon).

Thus, if ε<τ\varepsilon<\tau, where τ\tau is the reach of MM, then

volMωdd8dεdD(M,dM,2ε)N(M,dM,ε)volMωdd(π2)dεd.\frac{\operatorname{vol}M}{\omega_{d}}d8^{-d}\varepsilon^{-d}\leq D(M,d_{M},2\varepsilon)\leq N(M,d_{M},\varepsilon)\leq\frac{\operatorname{vol}M}{\omega_{d}}d\left(\frac{\pi}{2}\right)^{d}\varepsilon^{-d}.

The proof of Proposition 9 can be found in Appendix A and relies on the Bishop-Gromov comparison theorem to leverage the curvature bounds from Proposition 8 into volume estimates for small intrinsic balls, a similar technique as found in Niyogi et al., (2008); Narayanan & Mitter, (2010). The key point to note is that we have both upper and lower bounds for ε<ι\varepsilon<\iota, as opposed to just the upper bound guaranteed by Lemma 5. As a corollary, we are also able to derive bounds for the covering number with respect to the ambient metric:

Corollary 10.

Let MM be as in Proposition 9. For ε<τ\varepsilon<\tau, we can control the covering numbers of MM with respect to the Euclidean metric as

volMωdd16dεdD(M,||||,2ε)N(M,||||,ε)volMωd(π2)dεd.\frac{\operatorname{vol}M}{\omega_{d}}d16^{-d}\varepsilon^{-d}\leq D(M,\left|\left|\cdot\right|\right|,2\varepsilon)\leq N(M,\left|\left|\cdot\right|\right|,\varepsilon)\leq\frac{\operatorname{vol}M}{\omega_{d}}\left(\frac{\pi}{2}\right)^{d}\varepsilon^{-d}.

The proof of Corollary 10 follows from Proposition 9 and the metric comparisons for small scales in Proposition 8; details can be found in Appendix A.

2.2 Hölder Classes and their Complexity

In this section we make the elementary observation that complex function classes restricted to simple subsets can be much smaller than the original class. While such intuition has certainly appeared before, especially in designing esimators that can adapt to local intrinsic dimension, such as Bickel et al., (2007); Kpotufe & Dasgupta, (2012); Kpotufe, (2011); Kpotufe & Garg, (2013); Dasgupta & Freund, (2008); Steinwart et al., (2009); Nakada & Imaizumi, (2020), we codify this approach below.

To illustrate the above phenomenon at the level of empirical processes, we focus on Hölder functions in D\mathbb{R}^{D} for some large DD and let the “simple” subset be a subspace of dimension dd where dDd\ll D. We first recall the definition of a Hölder class:

Definition 11.

For an open domain Ωd\Omega\subset\mathbb{R}^{d} and a function f:Ωf:\Omega\to\mathbb{R}, define the β\beta-Hölder norm as

fCβ(Ω)=max0|γ||α|supxΩ|Dγf(x)|supx,yΩ|Dβf(x)Dβf(y)|xyββ.\left|\left|f\right|\right|_{C^{\beta}(\Omega)}=\max_{0\leq\left|\gamma\right|\leq\left|\alpha\right|}\sup_{x\in\Omega}\left|D^{\gamma}f(x)\right|\vee\sup_{x,y\in\Omega}\frac{\left|D^{\lfloor\beta\rfloor}f(x)-D^{\lfloor\beta\rfloor}f(y)\right|}{\left|\left|x-y\right|\right|^{\beta-\lfloor\beta\rfloor}}.

Define the Hölder ball of radius BB, denoted by CBβ(Ω)C_{B}^{\beta}(\Omega), as the set of functions f:Ωf:\Omega\to\mathbb{R} such that fCβ(Ω)B\left|\left|f\right|\right|_{C^{\beta}(\Omega)}\leq B. If (M,g)(M,g) is a Riemannian manifold of class Cβ+1C^{\lfloor\beta\rfloor+1} (see Lee, (2018)), and f:Mf:M\to\mathbb{R} we define the Hölder norm analogously, replacing |Dγf(x)|\left|D^{\gamma}f(x)\right| with γf(x)g\left|\left|\nabla^{\gamma}f(x)\right|\right|_{g}, where \nabla is the covariant derivative.

It is a classical result of Kolmogorov & Tikhomirov, (1993) that, for a bounded, open domain ΩD\Omega\subset\mathbb{R}^{D}, the entropy of a Hölder ball scales as

logN(CBβ(Ω),||||,ε)(Bε)Dβ\log N\left(C_{B}^{\beta}(\Omega),\left|\left|\cdot\right|\right|_{\infty},\varepsilon\right)\asymp\left(\frac{B}{\varepsilon}\right)^{\frac{D}{\beta}}

as ε0\varepsilon\downarrow 0. As a consequence, we arrive at the following result, whose proof can be found in Appendix A for the sake of completeness.

Proposition 12.

Let 𝒮Ωd\mathcal{S}\subset\Omega\subset\mathbb{R}^{d} be a path-connected closed set contained in an open domain Ω\Omega. Let ~=CBβ(Ω)\widetilde{\mathcal{F}}=C_{B}^{\beta}(\Omega) and let =~|𝒮\mathcal{F}=\widetilde{\mathcal{F}}|_{\mathcal{S}}. Then,

D(𝒮,(εB)1β)logD(,||||,2ε)logN(,||||,ε)3β2log(2Bε)N(𝒮,(ε2B)1β).D\left(\mathcal{S},\left(\frac{\varepsilon}{B}\right)^{\frac{1}{\beta}}\right)\leq\log D(\mathcal{F},\left|\left|\cdot\right|\right|_{\infty},2\varepsilon)\leq\log N(\mathcal{F},\left|\left|\cdot\right|\right|_{\infty},\varepsilon)\leq 3\beta^{2}\log\left(\frac{2B}{\varepsilon}\right)N\left(\mathcal{S},\left(\frac{\varepsilon}{2B}\right)^{\frac{1}{\beta}}\right).

Note that the content of the above result is really that of Kolmogorov & Tikhomirov, (1993), coupled with the fact that restriction from d\mathbb{R}^{d} to MM preserves smoothness.

If we apply the easily proven volumetric bounds on covering and packing numbers for 𝒮\mathcal{S} a Euclidean ball to Proposition 12, we recover the classical result of Kolmogorov & Tikhomirov, (1993). The key insight is that low-dimensional subsets can have covering numbers much smaller than those of a high-dimensional Euclidean ball: if the “dimension” of 𝒮\mathcal{S} is dd, then we expect the covering number of 𝒮\mathcal{S} to scale like εd\varepsilon^{-d}. Plugging this into Proposition 12 tells us that the entropy of \mathcal{F}, up to a factor logarithmic in 1ε\frac{1}{\varepsilon}, scales like εdβεDβ\varepsilon^{-\frac{d}{\beta}}\ll\varepsilon^{-\frac{D}{\beta}}. An immediate corollary of Lemma 5 and Proposition 12 is:

Corollary 13.

Let 𝒮D\mathcal{S}\subset\mathbb{R}^{D} be a closed set of diameter Δ\Delta and doubling dimension dd. Let 𝒮Ω\mathcal{S}\subset\Omega open and \mathcal{F} be the restriction of CBβ(Ω)C_{B}^{\beta}(\Omega) to 𝒮\mathcal{S}. Then

logN(,||||,ε)3β2(2BΔβε)dβlog(2Bε).\log N(\mathcal{F},\left|\left|\cdot\right|\right|_{\infty},\varepsilon)\leq 3\beta^{2}\left(\frac{2B\Delta^{\beta}}{\varepsilon}\right)^{\frac{d}{\beta}}\log\left(\frac{2B}{\varepsilon}\right).
Proof.

Combine the upper bound in Proposition 12 with the bound in Lemma 5. ∎

The conclusion of Corollary 13 is very useful for upper bounds as it tells us that the entropy for Hölder balls scales at most like εdβ\varepsilon^{-\frac{d}{\beta}} as ε0\varepsilon\downarrow 0. If we desire comparable lower bounds, we require some of the geometry discussed above. Combining Proposition 12 and Corollary 10 yields the following bound:

Corollary 14.

Let MDM\subset\mathbb{R}^{D} be an isometrically imbedded, compact submanifold with reach τ>0\tau>0 and let ετ\varepsilon\leq\tau. Suppose ΩM\Omega\supset M is an open set and let \mathcal{F}^{\prime} be the restriction of CBβ(Ω)C_{B}^{\beta}(\Omega) to MM. Then for ετ\varepsilon\leq\tau,

volMωdd16d(2Bε)dβlogD(,||||,2ε)logN(,||||,ε)3β2log(2Bε)volMωdd(π2)d(2Bε)dβ.\frac{\operatorname{vol}M}{\omega_{d}}d16^{-d}\left(\frac{2B}{\varepsilon}\right)^{\frac{d}{\beta}}\leq\log D(\mathcal{F}^{\prime},\left|\left|\cdot\right|\right|_{\infty},2\varepsilon)\leq\log N(\mathcal{F}^{\prime},\left|\left|\cdot\right|\right|_{\infty},\varepsilon)\leq 3\beta^{2}\log\left(\frac{2B}{\varepsilon}\right)\frac{\operatorname{vol}M}{\omega_{d}}d\left(\frac{\pi}{2}\right)^{d}\left(\frac{2B}{\varepsilon}\right)^{\frac{d}{\beta}}.

If we set =CBβ(M)\mathcal{F}=C_{B}^{\beta}(M), then we have that for all ε<ι\varepsilon<\iota,

volMωddιd(κ2)d2edικ2εdβlogN(,||||,ε)3β2log(2Bε)volMωdd(π2)dεdβ.\frac{\operatorname{vol}M}{\omega_{d}}d\iota^{d}(-\kappa_{2})^{\frac{d}{2}}e^{-d\iota\sqrt{-\kappa_{2}}}\varepsilon^{-\frac{d}{\beta}}\leq\log N(\mathcal{F},\left|\left|\cdot\right|\right|_{\infty},\varepsilon)\leq 3\beta^{2}\log\left(\frac{2B}{\varepsilon}\right)\frac{\operatorname{vol}M}{\omega_{d}}d\left(\frac{\pi}{2}\right)^{d}\varepsilon^{-\frac{d}{\beta}}.

In essence, Corollary 14 tells us that the rate of εdβ\varepsilon^{-\frac{d}{\beta}} for the growth of the entropy of Hölder balls is sharp for sufficiently small ε\varepsilon. The key difference between the first and second statements is that the first is with respect to an ambient class of functions while the second is with respect to an intrinsic class. To better illustrate the difference, consider the case where β=B=1\beta=B=1, i.e., the class of Lipschitz functions on the manifold. In both cases, asymptotically, the entropy of Lipschitz functions scales like εd\varepsilon^{-d}; if we restrict to functions that are Lipschitz with respect to the ambient metric, then the above bound only applies for ε<τ\varepsilon<\tau; on the other hand, if we consider the larger class of functions that are Lipschitz with respect to the intrinsic metric, the bound applies for ε<ι\varepsilon<\iota. In the case where ιτ\iota\gg\tau, this can be a major improvement.

The observations in this section are undeniably simple; the real interest comes in the diverse applications of the general principle, some of which we detail below. As a final note, we remark that our guiding principle of simplifying function classes by restricting them to simple sets likely holds in far greater analysis than is explored here; in particular, Sobolev and Besov classes (see, for example, (Giné & Nickl,, 2016, §4.3)) likely exhibit similar behavior.

3 Dimension Estimation

We outlined the intuition behind our dimension estimation in the introduction. In this section, we formally define the estimator and analyse its theoretical performance. We first apply standard empirical process theory and our complexity bounds in the previous section to upper bound the expected Hölder IPM (defined in (1)) between empirical and population distributions:

Lemma 15.

Let 𝒮D\mathcal{S}\subset\mathbb{R}^{D} be a compact set contained in a ball of radius RR. Suppose that we draw nn independent samples from a probability measure \mathbb{P} supported on 𝒮\mathcal{S} and denote by PnP_{n} the corresponding empirical distribution. Let PnP_{n}^{\prime} denote an independent identically distributed measure as PnP_{n}. Then we have

𝔼[dβ,B(Pn,)]𝔼[dβ,B(Pn,Pn)]16Binfδ>0(2δ+36nβlog1δδ1N(𝒮,||||,ε)𝑑ε).\mathbb{E}\left[d_{\beta,B}(P_{n},\mathbb{P})\right]\leq\mathbb{E}\left[d_{\beta,B}(P_{n},P_{n}^{\prime})\right]\leq 16B\inf_{\delta>0}\left(2\delta+\frac{3\sqrt{6}}{\sqrt{n}}\beta\sqrt{\log\frac{1}{\delta}}\int_{\delta}^{1}\sqrt{N(\mathcal{S},\left|\left|\cdot\right|\right|,\varepsilon)}d\varepsilon\right).

In particular, there exists a universal constant KK such that if N(𝒮,||||,ε)C1εdN(\mathcal{S},\left|\left|\cdot\right|\right|,\varepsilon)\leq C_{1}\varepsilon^{-d} for some C,d>0C,d>0, then

𝔼[dβ(Pn,)]CβB(1+logn𝟏{d=2β})(nβdn12).\mathbb{E}\left[d_{\beta}(P_{n},\mathbb{P})\right]\leq C\beta B\left(1+\sqrt{\log n}\mathbf{1}_{\{d=2\beta\}}\right)\left(n^{-\frac{\beta}{d}}\vee n^{-\frac{1}{2}}\right).

holds with C=KC1C=KC_{1}.

The proof uses the symmetrization and chaining technique and applies the complexity bounds of Hölder functions found above; the details can be found in Appendix E.

We now specialize to the case where β=B=1\beta=B=1, due to the computational tractability of the resulting Wasserstein distance. Applying Kantorovich-Rubenstein duality (Kantorovich & Rubinshtein,, 1958), we see that this special case of Lemma 15 recovers the special p=1p=1 case of Weed et al., (2019). From here on, we suppose that d>2d>2 and our metric on distributions is d1,1=W1d_{1,1}=W_{1}.

We begin by noting that if we have 2n2n, independent samples from \mathbb{P}, then we can split them into two data sets of size nn, and denote by Pn,PnP_{n},P_{n}^{\prime} the empirical distributions thus generated. We then note that Lemma 15 implies that if suppM\operatorname{supp}\mathbb{P}\subset M and MM is of dimension dd, then

𝔼[W1(Pn,Pn)]CM,dn1d.\mathbb{E}\left[W_{1}(P_{n},P_{n}^{\prime})\right]\leq C_{M,d}n^{-\frac{1}{d}}.

If we were to establish a lower bound as well as concentration of W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}) about its mean, then we could consider the following estimator. Given a data set of size 2(α+1)n2(\alpha+1)n, we can break the data into four samples, Pn,PnP_{n},P_{n}^{\prime} each of size nn and Pαn,PαnP_{\alpha n},P_{\alpha n}^{\prime} of size αn\alpha n. Then we would have

dn:=1logα(W1(Pαn,Pαn)W1(Pn,Pn))=logαlogW1(Pn,Pn)logW1(Pαn,Pαn)d.d_{n}:=-\frac{1}{\log_{\alpha}\left(\frac{W_{1}(P_{\alpha n},P_{\alpha n}^{\prime})}{W_{1}(P_{n},P_{n}^{\prime})}\right)}=\frac{\log\alpha}{\log W_{1}(P_{n},P_{n}^{\prime})-\log W_{1}(P_{\alpha n},P_{\alpha n}^{\prime})}\approx d.

Which distance on MM should be used to compute the Wasserstein distance, the Euclidean metric ||||\left|\left|\cdot\right|\right| or the intrinsic metric dM(,)d_{M}(\cdot,\cdot)? As can be guessed from Corollary 14, asymptotically, both will work, but for finite sample sizes when ιτ\iota\gg\tau, the latter is much better. One problem remains, however: because we are not assuming MM to be known, we do not have access to dMd_{M} and thus we cannot compute the necessary Wasserstein cost. In order to get around this obstacle, we recall the graph distance induced by a kkNN graph:

Definition 16.

Let X1,,XnDX_{1},\dots,X_{n}\in\mathbb{R}^{D} be a data set and fix ε>0\varepsilon>0. We let G(X,ε)G(X,\varepsilon) denote the weighted graph with vertices XiX_{i} and edges of weight XiXj\left|\left|X_{i}-X_{j}\right|\right| between all vertices Xi,XjX_{i},X_{j} such that XiXjε\left|\left|X_{i}-X_{j}\right|\right|\leq\varepsilon. We denote by dG(X,ε)d_{G(X,\varepsilon)} (or dGd_{G} if X,εX,\varepsilon are clear from context) the geodesic distance on the graph G(X,ε)G(X,\varepsilon). We extend this metric to all of D\mathbb{R}^{D} by letting

dG(p,q)=pπG(p)+dG(πG(p),πG(q))+qπG(q)d_{G}(p,q)=\left|\left|p-\pi_{G}(p)\right|\right|+d_{G}(\pi_{G}(p),\pi_{G}(q))+\left|\left|q-\pi_{G}(q)\right|\right|

where πG(p)argminXipXi\pi_{G}(p)\in\operatorname*{argmin}_{X_{i}}\left|\left|p-X_{i}\right|\right|.

We now have two Wasserstein distances, each induced by a different metric; to mitigate confusion, we introduce the following notation:

Definition 17.

Let X1,,Xn,X1,,XnDX_{1},\dots,X_{n},X_{1}^{\prime},\dots,X_{n}^{\prime}\in\mathbb{R}^{D}, sampled independently from \mathbb{P} such that suppM\operatorname{supp}\mathbb{P}\subset M. Let Pn,PnP_{n},P_{n}^{\prime} be the empirical distributions associated to the data X,XX,X^{\prime}. Let W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}) denote the Wasserstein cost with respect to the Euclidean metric and W1M(Pn,Pn)W_{1}^{M}(P_{n},P_{n}^{\prime}) denote the Wasserstein cost associated to the manifold metric, as in (1). For a fixed ε>0\varepsilon>0, let W1G(Pn,Pn)W_{1}^{G}(P_{n},P_{n}^{\prime}) denote the Wasserstein cost associated to the metric dG(suppPnsuppPn,ε)d_{G(\operatorname{supp}P_{n}\cup\operatorname{supp}P_{n}^{\prime},\varepsilon)}. Let dnd_{n}, d^n\widehat{d}_{n}, and d~n\widetilde{d}_{n} denote the dimension estimators from (3) induced by each of the above metrics.

Given sample distributions Pn,PnP_{n},P_{n}^{\prime}, we are able to compute W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}) and W1G(Pn,Pn)W_{1}^{G}(P_{n},P_{n}^{\prime}) for any fixed ε\varepsilon, but not W1M(Pn,Pn)W_{1}^{M}(P_{n},P_{n}^{\prime}) because we are assuming that the learner does not have access to the manifold MM. On the other hand, adapting techniques from Weed et al., (2019), we are able to provide a non-asymptotic lower bound on W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}) and W1M(Pn,Pn)W_{1}^{M}(P_{n},P_{n}^{\prime}):

Proposition 18.

Suppose that \mathbb{P} is a measure on D\mathbb{R}^{D} such that supp=M\operatorname{supp}\mathbb{P}=M, where MM is a dd-dimensional, compact manifold with reach τ>0\tau>0 and such that the density of \mathbb{P} with respect to the uniform measure on MM is lower bounded by w>0w>0. Suppose that

n>dvolM4wωd(τ8)d.n>\frac{d\operatorname{vol}M}{4w\omega_{d}}\left(\frac{\tau}{8}\right)^{-d}.

Then, almost surely,

W1(Pn,)132(dvolM4wωd)1dn1d.W_{1}(P_{n},\mathbb{P})\geq\frac{1}{32}\left(\frac{d\operatorname{vol}M}{4w\omega_{d}}\right)^{\frac{1}{d}}n^{-\frac{1}{d}}.

If we assume only that

n>(d(κ2)d2volM4wωdedικ2)ιdn>\left(\frac{d(-\kappa_{2})^{\frac{d}{2}}\operatorname{vol}M}{4w\omega_{d}}e^{d\iota\sqrt{-\kappa_{2}}}\right)\iota^{-d}

then, almost surely,

W1M(Pn,)132(dvolM4wωd)1d(κ2)12eικ2n1d.W_{1}^{M}(P_{n},\mathbb{P})\geq\frac{1}{32}\left(\frac{d\operatorname{vol}M}{4w\omega_{d}}\right)^{\frac{1}{d}}(-\kappa_{2})^{\frac{1}{2}}e^{\iota\sqrt{-\kappa_{2}}}n^{-\frac{1}{d}}.

An easy proof, based on the techniques (Weed et al.,, 2019, Proposition 6) can be found in Appendix E. Similarly, we can apply the same proof technique as Lemma 15 to establish the following upper bound:

Proposition 19.

Let MDM\subset\mathbb{R}^{D} be a compact manifold with positive reach τ\tau and dimension d>2d>2. Furthermore, suppose that \mathbb{P} is a probability measure on D\mathbb{R}^{D} with suppM\operatorname{supp}\mathbb{P}\subset M. Let X1,,Xn,X1,,XnX_{1},\dots,X_{n},X_{1}^{\prime},\dots,X_{n}^{\prime}\sim\mathbb{P} be independent with corresponding empirical distributions Pn,PnP_{n},P_{n}^{\prime}. Then if diamM=Δ\operatorname{diam}M=\Delta, we have:

𝔼[W1M(Pn,)]𝔼[W1M(Pn,Pn)]C(volMnωd)1dlog(nωdΔddvolM).\mathbb{E}\left[W_{1}^{M}(P_{n},\mathbb{P})\right]\leq\mathbb{E}\left[W_{1}^{M}(P_{n},P_{n}^{\prime})\right]\leq C\left(\frac{\operatorname{vol}M}{n\omega_{d}}\right)^{\frac{1}{d}}\sqrt[]{\log\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}_{M}}\right)}.

The full proof is in Appendix E and applies symmetrization and chaining, with an upper bound of Corollary 14. We note, as before, that a similar asymptotic rate is obtained by Weed et al., (2019) in a slightly different setting.

We noted above (3) that we required two facts to make our intuition precise. We have just shown that the first holds; we turn now to the second: concentration. To make this rigorous, we need one last technical concept: the T2T_{2}-inequality.

Definition 20.

Let μ\mu be a measure on a metric space (M,d)(M,d). We say that μ\mu satisfies a T2T_{2}-inequality with constant c2c_{2} if for all measures νμ\nu\ll\mu, we have

W2(μ,ν)2c2D(ν||μ)W_{2}(\mu,\nu)\leq\sqrt{2c_{2}D(\nu||\mu)}

where D(ν||μ)=𝔼μ[logdνdμ]D(\nu||\mu)=\mathbb{E}_{\mu}\left[\log\frac{d\nu}{d\mu}\right] is the well-known KL-divergence.

The reason that the T2T_{2} inequality is useful for us is that Bobkov & Götze, (1999) tell us that such an inequality implies, and is, by Gozlan et al., (2009), equivalent to Lipschitz concentration. We note further that W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}) is a Lipschitz function of the dataset and thus concentrates about its mean. The constant in the T2T_{2} inequality depends on the measure μ\mu and upper bounds for specific classes of measures are both well-known and remain an active area of research; for a more complete survey, see Bakry et al., (2014). We have the following bound:

Proposition 21.

Let \mathbb{P} be a probability measure on D\mathbb{R}^{D} that has density with respect to the (normalized) volume measure of MM, lower bounded by ww and upper bounded by WW, where MM is a dd-dimensional manifold with reach τ>0\tau>0 and diamM=Δ\operatorname{diam}M=\Delta. Then we have:

c22τ2d1Wwexp(dlog3+3d2Δ2τ2).c_{2}\leq\frac{2\tau^{2}}{d-1}\frac{W}{w}\exp\left(d\log 3+\frac{3d^{2}\Delta^{2}}{\tau^{2}}\right). (3)

In order to bound the T2T_{2} constant in our case, we rely on the landmark result of Otto & Villani, (2000) that relates c2c_{2} to another functional inequality, the log-Sobolev inequality (Bakry et al.,, 2014, Chapter 5). There are many ways to control the log-Sobolev constant in various situations, many of which are covered in Bakry et al., (2014). We use results from Wang, (1997b), which incorporate the intrinsic geometry of the distribution, as our bound. A detailed proof can be found in Appendix B. We note that many other estimates with under slightly different conditions exits, such as that in Wang, (1997a), which requires second-order control of the density of the population distribution with respect to the volume measure and the bound in Block et al., (2020), which provides control using a measure of nonconvexity. With added assumptions, we can gain much sharper control over c2c_{2}; for example, if we assume a positive lower bound on the curvature of the support, we can apply the well-known Bakry-Émery result (Bakry & Émery,, 1985) and get dimension-free bounds. As another example, if we may assume stronger contol on the curvature of MM beyond that guaranteed by the reach, we can remove the exponential dependence on the reach entirely. For the sake of simplicity and because we already admit an exponential dependence on the intrinsic dimension, we present only the more general bound here. We now provide a non-asymptotic bound on the quality of the estimator d~n\widetilde{d}_{n}.

Theorem 22.

Let \mathbb{P} be a probability measure on D\mathbb{R}^{D} and suppose that \mathbb{P} has a density with respect to the (normalized) volume measure of MM lower bounded by ww, where MM is a dd-dimensional manifold with reach τ>0\tau>0 such that d3d\geq 3 and diamM=Δ\operatorname{diam}M=\Delta. Furthermore, suppose that \mathbb{P} satisfies a T2T_{2} inequality with constant c2c_{2}. Let γ>0\gamma>0 and suppose α,n\alpha,n satisfy

n\displaystyle n max[dvolM4wωd(8ι)d,(8c2Δ2log1ρ)2dd5]\displaystyle\geq\max\left[\frac{d\operatorname{vol}M}{4w\omega_{d}}\left(\frac{8}{\iota}\right)^{d},\left(\frac{8c_{2}}{\Delta^{2}}\log\frac{1}{\rho}\right)^{\frac{2d}{d-5}}\right]
α\displaystyle\alpha max[log22γ(nωdΔddvolM),(48w)1γ,3dγ]\displaystyle\geq\max\left[\log^{\frac{2}{2\gamma}}\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right),(48w)^{\frac{1}{\gamma}},3^{\frac{d}{\gamma}}\right]
αn\displaystyle\alpha n dvolM2wωd(16πτ)dlog(dvolMρωd(16πτ)d).\displaystyle\geq\frac{d\operatorname{vol}M}{2w\omega_{d}}\left(\frac{16\pi}{\tau}\right)^{d}\log\left(\frac{d\operatorname{vol}M}{\rho\omega_{d}}\left(\frac{16\pi}{\tau}\right)^{d}\right).

Suppose we have 2(α+1)n2(\alpha+1)n samples drawn independently from \mathbb{P}. Then, with probability at least 16ρ1-6\rho, we have

d1+3γd~n(1+3γ)d.\frac{d}{1+3\gamma}\leq\widetilde{d}_{n}\leq(1+3\gamma)d.

If ι\iota is replaced by τ\tau above, we get the same bound with the vanilla estimator dnd_{n} replacing d~n\widetilde{d}_{n}.

We note that we have not made every effort to minimize the constants in the statement above, with our emphasis being the dependence of these sample complexity bounds on the relevant geometric quantities. As an immediate consequence of Theorem 22, due to the fact that dd is discrete, we can control the probability of error with sufficiently many samples. We may also apply Proposition 21 to replace c2c_{2} with our upper bound in terms of the reach.

Corollary 23.

Suppose we are in the situation of Theorem 22 and that \mathbb{P} has density upper bounded by WW with respect to the normalized uniform measure on MM. Suppose further that α,n\alpha,n satisfy

n\displaystyle n max[dvolM4wωd(8ι)d,(82τ2Δ2(d1)Wwexp(dlog3+3d2Δ2τ2)log1ρ)2dd5]\displaystyle\geq\max\left[\frac{d\operatorname{vol}M}{4w\omega_{d}}\left(\frac{8}{\iota}\right)^{d},\left(8\frac{2\tau^{2}}{\Delta^{2}(d-1)}\frac{W}{w}\exp\left(d\log 3+\frac{3d^{2}\Delta^{2}}{\tau^{2}}\right)\log\frac{1}{\rho}\right)^{\frac{2d}{d-5}}\right]
α\displaystyle\alpha max[log2d2(nωdΔddvolM),(48w)3d,33d2]\displaystyle\geq\max\left[\log^{2d^{2}}\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right),(48w)^{3d},3^{3d^{2}}\right]
αn\displaystyle\alpha n dvolM2wωd(16πτ)dlog(dvolMρωd(16πτ)d).\displaystyle\geq\frac{d\operatorname{vol}M}{2w\omega_{d}}\left(\frac{16\pi}{\tau}\right)^{d}\log\left(\frac{d\operatorname{vol}M}{\rho\omega_{d}}\left(\frac{16\pi}{\tau}\right)^{d}\right).

Then if we round d~n\widetilde{d}_{n} to the nearest integer, and denote the resulting estimator by dnd_{n}^{\prime}, we have with probability at least 16ρ1-6\rho, dn=dd_{n}^{\prime}=d. Again, replacing ι\iota by τ\tau in the previous display yields the same result with d^n\widehat{d}_{n} replaced by the vanilla estimator dnd_{n}.

Proof.

Note that because dd\in\mathbb{N}, if |d~nd|12\left|\widetilde{d}_{n}-d\right|\leq\frac{1}{2}, then rounding d^n\widehat{d}_{n} to the nearest integer exactly recovers dd. Setting γ<14d\gamma<\frac{1}{4d}, and plugging into the result of Theorem 22, along with an application of Proposition 21 to bound c2c_{2}, concludes the proof. ∎

While the appearance of ι\iota in Theorem 22 and Corollary 23 may seem minor, it is critical for any practical estimator. While αn=Ω(τd)\alpha n=\Omega\left(\tau^{-d}\right), we may take nn as small as Ω(ιd)\Omega\left(\iota^{-d}\right). Thus, using d~n\widetilde{d}_{n} instead of the naive estimator dnd_{n} allows us to leverage the entire data set in estimating the intrinsic distances, even on the small sub-samples. From the proof, it is clear that we want α\alpha to be as large as possible; thus if we have a total of NN samples, we wish to make nn as small as possible. If ιτ\iota\gg\tau then we can make nn much smaller (scaling like ιd\iota^{-d}) than if we were to simply use the Euclidean distance. As a result, on any data set where ιτ\iota\gg\tau, the sample complexity of d~n\widetilde{d}_{n} can be much smaller than that of dnd_{n}.

There are two parts to the proof of Theorem 22: first, we need to establish that our metric dGd_{G} approximates dMd_{M} with high probability and thus d~nd^n\widetilde{d}_{n}\approx\widehat{d}_{n}; second, we need to show that d^n\widehat{d}_{n} is, indeed, a good estimate of dd. The second part follows from Propositions 19 and 18, and concentration; a detailed proof can be found in Appendix C. For the first part of the proof, in order to show that d^nd~n\widehat{d}_{n}\approx\widetilde{d}_{n}, we demonstrate that dMdGd_{M}\approx d_{G} in the following result:

Proposition 24.

Let \mathbb{P} be a probability measure on D\mathbb{R}^{D} and suppose that supp=M\operatorname{supp}\mathbb{P}=M, a geodesically convex, compact manifold of dimension dd and reach τ>0\tau>0. Suppose that we sample X1,,XnX_{1},\dots,X_{n}\sim\mathbb{P} independently. Let λ12\lambda\leq\frac{1}{2} and G=G(X,τλ)G=G(X,\tau\lambda). If for some ρ<1\rho<1,

nwB(τλ28)1logN(M,dM,τλ28)ρn\geq w_{B}\left(\frac{\tau\lambda^{2}}{8}\right)^{-1}\log\frac{N\left(M,d_{M},\frac{\tau\lambda^{2}}{8}\right)}{\rho}

where for any δ>0\delta>0

wB(δ)=infpM(BδM(p))w_{B}(\delta)=\inf_{p\in M}\mathbb{P}(B_{\delta}^{M}(p))

with BδM(p)B_{\delta}^{M}(p) the metric ball around pp of radius δ\delta. Then, with probability at least 1ρ1-\rho, for all x,yMx,y\in M,

(1λ)dM(x,y)dG(x,y)(1+λ)dM(x,y).\left(1-\lambda\right)d_{M}(x,y)\leq d_{G}(x,y)\leq(1+\lambda)d_{M}(x,y).

The proof of Proposition 24 follows the general outline of Bernstein et al., (2000), but is modified in two key ways: first, we control relevant geometric quantities by τ\tau instead of by the quantities in Bernstein et al., (2000); second, we provide a quantitative, nonasymptotic bound on the number of samples needed to get a good approximation with high probability. The details are deferred to Appendix D.

This result may be of interest in its own right as it provides a non-asymptotic version of the results from Tenenbaum et al., (2000); Bernstein et al., (2000). In particular, if we suppose that \mathbb{P} has a density with respect to the uniform measure on MM and this density is bounded below by a constant w>0w>0, then Proposition 24 combined with Proposition 9 tells us that if we have

nvolMw(τλ2)dlog(volMρτλ2)n\gtrsim\frac{\operatorname{vol}M}{w}\left(\tau\lambda^{2}\right)^{-d}\log\left(\frac{\operatorname{vol}M}{\rho\tau\lambda^{2}}\right)

samples, then we can recover the intrinsic distance of MM with distortion λ\lambda. We further note that the dependence on τ,λ,d\tau,\lambda,d is quite reasonable in Proposition 24. The argument requires the construction of a τλ2\tau\lambda^{2}-net on MM and it is not difficult to see that one needs a covering at scale proportional to τλ\tau\lambda in order to recover the intrinsic metric from discrete data points. For example, consider Figure 2; were a curve to be added to connect the points at the bottleneck, this would drastically decrease the intrinsic distance between the bottleneck points. In order to determine that the intrinsic distance between these points (without the connector) is actually quite large using the graph metric estimator, we need to set ε<τ\varepsilon<\tau, in which case these points are certainly only connected if there exists a point of distance less than τ\tau to the bottleneck point, which can only occur with high probability if n=Ω(τ1)n=\Omega\left(\tau^{-1}\right). We can extend this example to arbitrary dimension dd by taking the product of the curve with rSd1rS^{d-1} for r=Θ(τ)r=\Theta(\tau); in this case, a similar argument holds and we now need Ω(τd)\Omega\left(\tau^{-d}\right) points in order to guarantee with high probability that there exists a point of distance at most τ\tau to one of the bottleneck points. In this way, we see that the τd\tau^{-d} scaling is unavoidable in general. Note that the other estimators of intrinsic dimension mentioned in the introduction, in particular the MLE estimator of Levina & Bickel, (2004), implicitly require the accuracy of the kkNN distance for their estimation to hold; thus these estimators also suffer from the τd\tau^{-d} sample complexity. Finally, we remark that Kim et al., (2019) presents a minimax lower bound for a related hypothesis testing problem and shows that minimax risk is bounded below by a local analogue of the reach raised to a power that depends linearly on the intrinsic dimension.

4 Application of Techniques to GANs

In this section, we note that our techniques are not confined to the realm of dimension estimation and, in fact, readily apply to other problems. As an example, consider the unsupervised learning problem of generative modeling, where we suppose that there are samples X1,,XnX_{1},\dots,X_{n}\sim\mathbb{P} independent and we wish to produce a sample X^^\widehat{X}\sim\widehat{\mathbb{P}} such that ^\widehat{\mathbb{P}} and \mathbb{P} are close. Statistically, this problem can be expressed by fixing a class of distributions 𝒫\mathcal{P} and using the data to choose μ^𝒫\widehat{\mu}\in\mathcal{P} such that μ^\widehat{\mu} is in some sense close to \mathbb{P}. For computational reasons, one wishes 𝒫\mathcal{P} to contain distributions from which it is computationally efficient to sample; in practice, 𝒫\mathcal{P} is usually the class of pushforwards of a multi-variate Gaussian distribution by some deep neural network class 𝒢\mathcal{G}. While our statistical results include this setting, they are not restricted and apply for general classes of distributions 𝒫\mathcal{P}.

In order to make the problem more precise, we require some notion of distance between distributions. We use the notion of the Integral Probability Metric (Müller,, 1997; Sriperumbudur et al.,, 2012) associated to a Hölder ball CBβ(Ω)C_{B}^{\beta}(\Omega), as defined above. We suppose that suppΩ\operatorname{supp}\mathbb{P}\subset\Omega and we abbreviate the corresponding IPM distance by dβ,Bd_{\beta,B}. Given the empirical distribution PnP_{n}, the GAN that we study can be expressed as

μ^argminμ𝒫dβ,B(μ,Pn)=argminμ𝒫supfCBβ(Ω)𝔼μ[f]Pnf.\widehat{\mu}\in\operatorname*{argmin}_{\mu\in\mathcal{P}}d_{\beta,B}(\mu,P_{n})=\operatorname*{argmin}_{\mu\in\mathcal{P}}\sup_{f\in C_{B}^{\beta}(\Omega)}\mathbb{E}_{\mu}[f]-P_{n}f.

In this section, we generalize the results of Schreuder et al., (2020). In particular, we derive new estimation rates for a GAN using a Hölder ball as a discriminating class, assuming that the population distribution \mathbb{P} is low-dimensional; like Schreuder et al., (2020), we consider the noised and potentially contaminated setting. We have

Theorem 25.

Suppose that \mathbb{P} is a probability measure on D\mathbb{R}^{D} supported on a compact set 𝒮\mathcal{S} and suppose we have nn independent XiX_{i}\sim\mathbb{P} with empirical distribution PnP_{n}. Let ηi\eta_{i} be independent, centred random variables on D\mathbb{R}^{D} such that 𝔼[ηi2]σ2\mathbb{E}\left[\left|\left|\eta_{i}\right|\right|^{2}\right]\leq\sigma^{2}. Suppose we observe X~i\widetilde{X}_{i} such that for at least (1ε)n(1-\varepsilon)n of the X~i\widetilde{X}_{i}, we have X~i=Xi+ηi\widetilde{X}_{i}=X_{i}+\eta_{i}; let the empirical distribution of the X~i\widetilde{X}_{i} be P~n\widetilde{P}_{n}. Let 𝒫\mathcal{P} be a known set of distributions and define

μ^argminμ𝒫dβ,B(μ,P~n).\widehat{\mu}\in\operatorname*{argmin}_{\mu\in\mathcal{P}}d_{\beta,B}(\mu,\widetilde{P}_{n}).

Then if there is some C1,dC_{1},d such that N(𝒮,||||,δ)C1εdN(\mathcal{S},\left|\left|\cdot\right|\right|,\delta)\leq C_{1}\varepsilon^{-d}, we have

𝔼[dβ,B(μ^,)]infμ𝒫dβ,B(μ,)+B(σ+2ε)+CβBlogn(nβdn12)\mathbb{E}\left[d_{\beta,B}(\widehat{\mu},\mathbb{P})\right]\leq\inf_{\mu\in\mathcal{P}}d_{\beta,B}(\mu,\mathbb{P})+B(\sigma+2\varepsilon)+C\beta B\sqrt{\log n}\left(n^{-\frac{\beta}{d}}\vee n^{-\frac{1}{2}}\right)

where CC is a constant depending linearly on C1C_{1}.

We note that the logn\log n factor can be easily removed for all cases βd2\beta\neq\frac{d}{2} by paying slightly in order to increase the constants; for the sake of simplicity, we do not bother with this argument here. The proof of Theorem 25 is similar in spirit to that of Schreuder et al., (2020), which in turn follows Liang, (2018), with details in Appendix E. The key step is in applying the bounds in Lemma 15 to the arguments of Liang, (2018).

We compare our result to the corresponding theorem (Schreuder et al.,, 2020, Theorem 2). In that work, the authors considered a setting where there is a known intrinsic dimension dd and the population distribution =g#𝒰([0,1]d)\mathbb{P}=g_{\#}\mathcal{U}\left([0,1]^{d}\right), the push-forward by an LL-Lipschitz function gg of the uniform distribution on a dd-dimensional hypercube; in addition, they take 𝒫\mathcal{P} to be the set of push-forwards of U([0,1]d)U\left([0,1]^{d}\right) by functions in some class \mathcal{F}, all of whose elements are LL-Lipschitz. Their result, (Schreuder et al.,, 2020, Theorem 2), gives an upper bound of

𝔼[dβ,1(μ^,)]infμ𝒫dβ,1(μ,)+L(σ+2ε)+cLd(nβdn12).\mathbb{E}\left[d_{\beta,1}(\widehat{\mu},\mathbb{P})\right]\leq\inf_{\mu\in\mathcal{P}}d_{\beta,1}(\mu,\mathbb{P})+L(\sigma+2\varepsilon)+cL\sqrt{d}\left(n^{-\frac{\beta}{d}}\vee n^{-\frac{1}{2}}\right). (4)

Note that our result is an improvement in two key respects. First, we do not treat the intrinsic dimension dd as known, nor do we force the dimension of the feature space to be the same as the intrinsic dimension. Many of the state-of-the-art GAN architectures on datasets such as ImageNet use a feature space of dimension 128 or 256 (Wu et al.,, 2019); the best rate that the work of Schreuder et al., (2020) can give, then would be n1128n^{-\frac{1}{128}}. In our setting, even if the feature space is complex, if the true distribution lies on a much lower dimensional subspace, then it is the true, intrinsic dimension, that determines the rate of estimation. Secondly, note that the upper bound in (4) depends on the Lipschitz constant LL; as the function classes used to determine the push-forwards are essentially all deep neural networks in practice, and the Lipschitz constants of such functions are exponential in depth, this can be a very pessimistic upper bound; our result, however, does not depend on this Lipschitz constant, but rather on properties intrinsic to the probability distribution \mathbb{P}. This dependence is particularly notable in the noisy regime, where σ,ε\sigma,\varepsilon do not vanish; the large multiplicative factor of LL in this case would then make the bound useless.

We conclude this section by considering the case most often used in practice: the Wasserstein GAN.

Corollary 26.

Suppose we are in the setting of Theorem 25 and 𝒮\mathcal{S} is contained in a ball of radius RR for R12R\geq\frac{1}{2}. Then,

𝔼[W1(μ^,)]infμ𝒫W1(μ,)+σ+2Rε+CRlognn1d.\displaystyle\mathbb{E}\left[W_{1}(\widehat{\mu},\mathbb{P})\right]\leq\inf_{\mu\in\mathcal{P}}W_{1}(\mu,\mathbb{P})+\sigma+2R\varepsilon+CR\sqrt{\log n}n^{-\frac{1}{d}}.

The proof of the corollary is almost immediate from Theorem 25. With additional assumptions on the tails of the ηi\eta_{i}, we can turn our expectation into a high probability statement. In the special case with neither noise nor contamination, i.e. σ=ε=0\sigma=\varepsilon=0, we get that the Wasserstein GAN converges in Wasserstein distance at a rate of n1dn^{-\frac{1}{d}}, which we believe explains in large part the recent empirical success in modern Wasserstein-GANs.

Acknowledgements

We acknowledge support from the NSF through award DMS-2031883 and from the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. We acknowledge the support from NSF under award DMS-1953181, NSF Graduate Research Fellowship support under Grant No. 1122374, and support from the MIT-IBM Watson AI Lab.

References

  • Aamari et al., (2019) Aamari, Eddie, Kim, Jisu, Chazal, Frédéric, Michel, Bertrand, Rinaldo, Alessandro, Wasserman, Larry, et al. 2019. Estimating the reach of a manifold. Electronic journal of statistics, 13(1), 1359–1399.
  • Amenta & Bern, (1999) Amenta, Nina, & Bern, Marshall. 1999. Surface reconstruction by Voronoi filtering. Discrete & Computational Geometry, 22(4), 481–504.
  • Arjovsky et al., (2017) Arjovsky, Martin, Chintala, Soumith, & Bottou, Léon. 2017. Wasserstein generative adversarial networks. Pages 214–223 of: International conference on machine learning. PMLR.
  • Ashlagi et al., (2021) Ashlagi, Yair, Gottlieb, Lee-Ad, & Kontorovich, Aryeh. 2021. Functions with average smoothness: structure, algorithms, and learning. Pages 186–236 of: Conference on Learning Theory. PMLR.
  • Assouad, (1983) Assouad, Patrice. 1983. Plongements lipschitziens dans rnr^{n}. Bulletin de la Société Mathématique de France, 111, 429–448.
  • Bakry & Émery, (1985) Bakry, Dominique, & Émery, Michel. 1985. Diffusions hypercontractives. Pages 177–206 of: Seminaire de probabilités XIX 1983/84. Springer.
  • Bakry et al., (2014) Bakry, Dominique, Gentil, Ivan, & Ledoux, Michel. 2014. Analysis and Geometry of Markov Diffusion Operators. Springer International Publishing.
  • Belkin & Niyogi, (2001) Belkin, Mikhail, & Niyogi, Partha. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. Pages 585–591 of: Nips, vol. 14.
  • Bernstein et al., (2000) Bernstein, Mira, Silva, Vin De, Langford, John C., & Tenenbaum, Joshua B. 2000. Graph Approximations to Geodesics on Embedded Manifolds.
  • Bickel et al., (2007) Bickel, Peter J, Li, Bo, et al. 2007. Local polynomial regression on unknown manifolds. Pages 177–186 of: Complex datasets and inverse problems. Institute of Mathematical Statistics.
  • Block et al., (2020) Block, Adam, Mroueh, Youssef, Rakhlin, Alexander, & Ross, Jerret. 2020. Fast mixing of multi-scale langevin dynamics underthe manifold hypothesis. arXiv preprint arXiv:2006.11166.
  • Bobkov & Götze, (1999) Bobkov, Sergej G, & Götze, Friedrich. 1999. Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. Journal of Functional Analysis, 163(1), 1–28.
  • Boissonnat et al., (2019) Boissonnat, Jean-Daniel, Lieutier, André, & Wintraecken, Mathijs. 2019. The reach, metric distortion, geodesic convexity and the variation of tangent spaces. Journal of applied and computational topology, 3(1), 29–58.
  • Camastra & Staiano, (2016) Camastra, Francesco, & Staiano, Antonino. 2016. Intrinsic dimension estimation: Advances and open problems. Information Sciences, 328, 26–41.
  • Camastra & Vinciarelli, (2002) Camastra, Francesco, & Vinciarelli, Alessandro. 2002. Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on pattern analysis and machine intelligence, 24(10), 1404–1407.
  • Chen et al., (2020) Chen, Minshuo, Liao, Wenjing, Zha, Hongyuan, & Zhao, Tuo. 2020. Statistical Guarantees of Generative Adversarial Networks for Distribution Estimation.
  • Cuturi, (2013) Cuturi, Marco. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. Pages 2292–2300 of: Advances in neural information processing systems.
  • Dasgupta & Freund, (2008) Dasgupta, Sanjoy, & Freund, Yoav. 2008. Random projection trees and low dimensional manifolds. Pages 537–546 of: Proceedings of the fortieth annual ACM symposium on Theory of computing.
  • Donoho & Grimes, (2003) Donoho, David L, & Grimes, Carrie. 2003. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10), 5591–5596.
  • Dudley, (1969) Dudley, Richard Mansfield. 1969. The speed of mean Glivenko-Cantelli convergence. The Annals of Mathematical Statistics, 40(1), 40–50.
  • Efimov et al., (2019) Efimov, Kirill, Adamyan, Larisa, & Spokoiny, Vladimir. 2019. Adaptive nonparametric clustering. IEEE Transactions on Information Theory, 65(8), 4875–4892.
  • Farahmand et al., (2007) Farahmand, Amir Massoud, Szepesvári, Csaba, & Audibert, Jean-Yves. 2007. Manifold-adaptive dimension estimation. Pages 265–272 of: Proceedings of the 24th international conference on Machine learning.
  • Federer, (1959) Federer, Herbert. 1959. Curvature measures. Transactions of the American Mathematical Society, 93(3), 418–491.
  • Fefferman et al., (2016) Fefferman, Charles, Mitter, Sanjoy, & Narayanan, Hariharan. 2016. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4), 983–1049.
  • Fefferman et al., (2018) Fefferman, Charles, Ivanov, Sergei, Kurylev, Yaroslav, Lassas, Matti, & Narayanan, Hariharan. 2018. Fitting a putative manifold to noisy data. Pages 688–720 of: Conference On Learning Theory. PMLR.
  • Fukunaga & Olsen, (1971) Fukunaga, Keinosuke, & Olsen, David R. 1971. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, 100(2), 176–183.
  • Gigli & Ledoux, (2013) Gigli, Nicola, & Ledoux, Michel. 2013. From log Sobolev to Talagrand: a quick proof. Discrete and Continuous Dynamical Systems-Series A, dcds–2013.
  • Giné & Nickl, (2016) Giné, Evarist, & Nickl, Richard. 2016. Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press.
  • Goodfellow et al., (2014) Goodfellow, Ian J, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, & Bengio, Yoshua. 2014. Generative adversarial networks. arXiv preprint arXiv:1406.2661.
  • Gottlieb et al., (2016) Gottlieb, Lee-Ad, Kontorovich, Aryeh, & Krauthgamer, Robert. 2016. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620, 105–118.
  • Gozlan et al., (2009) Gozlan, Nathael, et al. 2009. A characterization of dimension free concentration in terms of transportation inequalities. The Annals of Probability, 37(6), 2480–2498.
  • Grassberger & Procaccia, (2004) Grassberger, Peter, & Procaccia, Itamar. 2004. Measuring the strangeness of strange attractors. Pages 170–189 of: The Theory of Chaotic Attractors. Springer.
  • Gray, (2004) Gray, Alfred. 2004. Tubes. Birkhäuser Basel.
  • Holley & Stroock, (1986) Holley, Richard, & Stroock, Daniel W. 1986. Logarithmic Sobolev inequalities and stochastic Ising models.
  • Hopf & Rinow, (1931) Hopf, Heinz, & Rinow, Willi. 1931. Über den Begriff der vollständigen differentialgeometrischen Fläche. Commentarii Mathematici Helvetici, 3(1), 209–225.
  • Kantorovich & Rubinshtein, (1958) Kantorovich, Leonid Vitaliyevich, & Rubinshtein, SG. 1958. On a space of totally additive functions. Vestnik of the St. Petersburg University: Mathematics, 13(7), 52–59.
  • Kégl, (2002) Kégl, Balázs. 2002. Intrinsic dimension estimation using packing numbers. Pages 681–688 of: NIPS. Citeseer.
  • Kim et al., (2019) Kim, Jisu, Rinaldo, Alessandro, & Wasserman, Larry. 2019. Minimax Rates for Estimating the Dimension of a Manifold. Journal of Computational Geometry, 10(1).
  • Kleindessner & Luxburg, (2015) Kleindessner, Matthäus, & Luxburg, Ulrike. 2015. Dimensionality estimation without distances. Pages 471–479 of: Artificial Intelligence and Statistics. PMLR.
  • Kolmogorov & Tikhomirov, (1993) Kolmogorov, A. N., & Tikhomirov, V. M. 1993. epsilon-Entropy and epsilon-Capacity of Sets In Functional Spaces. Pages 86–170 of: Mathematics and Its Applications. Springer Netherlands.
  • Koltchinskii, (2000) Koltchinskii, Vladimir I. 2000. Empirical geometry of multivariate data: a deconvolution approach. Annals of statistics, 591–629.
  • Kpotufe, (2011) Kpotufe, Samory. 2011. k-NN regression adapts to local intrinsic dimension. Pages 729–737 of: Proceedings of the 24th International Conference on Neural Information Processing Systems.
  • Kpotufe & Dasgupta, (2012) Kpotufe, Samory, & Dasgupta, Sanjoy. 2012. A tree-based regressor that adapts to intrinsic dimension. Journal of Computer and System Sciences, 78(5), 1496–1515.
  • Kpotufe & Garg, (2013) Kpotufe, Samory, & Garg, Vikas K. 2013. Adaptivity to Local Smoothness and Dimension in Kernel Regression. Pages 3075–3083 of: NIPS.
  • LeCun & Cortes, (2010) LeCun, Yann, & Cortes, Corinna. 2010. MNIST handwritten digit database.
  • Lee, (2018) Lee, John M. 2018. Introduction to Riemannian manifolds. Springer.
  • Levina & Bickel, (2004) Levina, Elizaveta, & Bickel, Peter. 2004. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17, 777–784.
  • Liang, (2018) Liang, Tengyuan. 2018. On how well generative adversarial networks learn densities: Nonparametric and parametric results. arXiv preprint arXiv:1811.03179.
  • Little et al., (2009) Little, Anna V, Jung, Yoon-Mo, & Maggioni, Mauro. 2009. Multiscale estimation of intrinsic dimensionality of data sets. In: 2009 AAAI Fall Symposium Series.
  • Müller, (1997) Müller, Alfred. 1997. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 429–443.
  • Nakada & Imaizumi, (2020) Nakada, Ryumei, & Imaizumi, Masaaki. 2020. Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality. Journal of Machine Learning Research, 21(174), 1–38.
  • Narayanan & Mitter, (2010) Narayanan, Hariharan, & Mitter, Sanjoy. 2010. Sample complexity of testing the manifold hypothesis. Pages 1786–1794 of: Proceedings of the 23rd International Conference on Neural Information Processing Systems-Volume 2.
  • Niles-Weed & Rigollet, (2019) Niles-Weed, Jonathan, & Rigollet, Philippe. 2019. Estimation of wasserstein distances in the spiked transport model. arXiv preprint arXiv:1909.07513.
  • Niyogi et al., (2008) Niyogi, Partha, Smale, Stephen, & Weinberger, Shmuel. 2008. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 39(1-3), 419–441.
  • Niyogi et al., (2011) Niyogi, Partha, Smale, Stephen, & Weinberger, Shmuel. 2011. A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40(3), 646–663.
  • Otto & Villani, (2000) Otto, Felix, & Villani, Cédric. 2000. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173(2), 361–400.
  • Pettis et al., (1979) Pettis, Karl W, Bailey, Thomas A, Jain, Anil K, & Dubes, Richard C. 1979. An intrinsic dimensionality estimator from near-neighbor information. IEEE Transactions on pattern analysis and machine intelligence, 25–37.
  • Roweis & Saul, (2000) Roweis, Sam T, & Saul, Lawrence K. 2000. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500), 2323–2326.
  • Schreuder et al., (2020) Schreuder, Nicolas, Brunel, Victor-Emmanuel, & Dalalyan, Arnak. 2020. Statistical guarantees for generative models without domination. arXiv preprint arXiv:2010.09237.
  • Sriperumbudur et al., (2012) Sriperumbudur, Bharath K, Fukumizu, Kenji, Gretton, Arthur, Schölkopf, Bernhard, Lanckriet, Gert RG, et al. 2012. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6, 1550–1599.
  • Steinwart et al., (2009) Steinwart, Ingo, Hush, Don R, Scovel, Clint, et al. 2009. Optimal Rates for Regularized Least Squares Regression. Pages 79–93 of: COLT.
  • Tenenbaum et al., (2000) Tenenbaum, Joshua B, De Silva, Vin, & Langford, John C. 2000. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500), 2319–2323.
  • Uppal et al., (2019) Uppal, Ananya, Singh, Shashank, & Póczos, Barnabás. 2019. Nonparametric density estimation & convergence rates for gans under besov ipm losses. arXiv preprint arXiv:1902.03511.
  • van Handel, (2014) van Handel, Ramon. 2014. Probability in high dimension. Tech. rept. PRINCETON UNIV NJ.
  • Villani, (2008) Villani, Cédric. 2008. Optimal transport: old and new. Vol. 338. Springer Science & Business Media.
  • Wang, (1997a) Wang, Feng-Yu. 1997a. Logarithmic Sobolev inequalities on noncompact Riemannian manifolds. Probability theory and related fields, 109(3), 417–424.
  • Wang, (1997b) Wang, Feng-Yu. 1997b. On estimation of the logarithmic Sobolev constant and gradient estimates of heat semigroups. Probability theory and related fields, 108(1), 87–101.
  • Weed et al., (2019) Weed, Jonathan, Bach, Francis, et al. 2019. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 25(4A), 2620–2648.
  • Wu et al., (2019) Wu, Yan, Donahue, Jeff, Balduzzi, David, Simonyan, Karen, & Lillicrap, Timothy. 2019. Logan: Latent optimisation for generative adversarial networks. arXiv preprint arXiv:1912.00953.

Appendix A Proofs from Section 2

Proof of Proposition 12.

We apply the method from the classic paper (Kolmogorov & Tikhomirov,, 1993), following notation introduced there as applicable. For the sake of simplicity, we assume that β\beta is an integer; the generalization to β\beta\not\in\mathbb{N} is analogous to that in Kolmogorov & Tikhomirov, (1993). Let Δβ=ε2B\Delta^{\beta}=\frac{\varepsilon}{2B} and let x1,,xsx_{1},\dots,x_{s} be a Δ\Delta-connected Δ\Delta net on 𝒮\mathcal{S}. For 0kβ0\leq k\leq\beta and 1is1\leq i\leq s, define

γk(f)=Dkf(xi)εk\displaystyle\gamma^{k}(f)=\left\lfloor\frac{\left|\left|D^{k}f(x_{i})\right|\right|}{\varepsilon_{k}}\right\rfloor εk=εΔk\displaystyle\varepsilon_{k}=\frac{\varepsilon}{\Delta^{k}}

where ||||\left|\left|\cdot\right|\right| is the norm on tensors induced by the ambient (Euclidean) metric and DkD^{k} is the kthk^{th} application of the covariant derivative. Let γ(f)=(γik(f))i,k\gamma(f)=\left(\gamma_{i}^{k}(f)\right)_{i,k} be the matrix of all γik(f)\gamma_{i}^{k}(f) and let UγU_{\gamma} be the set of all ff such that γ(f)=γ\gamma(f)=\gamma. Then the argument in the proof of (Kolmogorov & Tikhomirov,, 1993, Theorem XIV) applies mutatis mutandis and we note that UγU_{\gamma} are 2ε2\varepsilon neighborhoods in the Hölder norm. Thus it suffices to bound the number of possible γ\gamma. As in Kolmogorov & Tikhomirov, (1993), we note that the number of possible values for γ1k\gamma_{1}^{k} is at most 2Bεk\frac{2B}{\varepsilon_{k}}. Given the row (γik)0kβ\left(\gamma_{i}^{k}\right)_{0\leq k\leq\beta}, there are at most (4e+2)β+1(4e+2)^{\beta+1} values for the next row. Thus the total number of possible γ\gamma is bounded by

((4e+2)β+1)sk=1β2Bεk\displaystyle\left((4e+2)^{\beta+1}\right)^{s}\prod_{k=1}^{\beta}\frac{2B}{\varepsilon_{k}} =(4e+2)(β+1)sk=1β2Bε(ε2B)kβ=(4e+2)(β+1)s(2Bε)β2.\displaystyle=(4e+2)^{(\beta+1)s}\prod_{k=1}^{\beta}\frac{2B}{\varepsilon}\left(\frac{\varepsilon}{2B}\right)^{\frac{k}{\beta}}=(4e+2)^{(\beta+1)s}\left(\frac{2B}{\varepsilon}\right)^{\frac{\beta}{2}}.

By definition of the covering number and the fact that 𝒮\mathcal{S} is path-connected, we may take

s=N(𝒮,Δ)=N(𝒮,(ε2B)1β).s=N(\mathcal{S},\Delta)=N\left(\mathcal{S},\left(\frac{\varepsilon}{2B}\right)^{\frac{1}{\beta}}\right).

Taking logarithms and noting that log(4e+2)3\log(4e+2)\leq 3 concludes the proof of the upper bound.

The middle inequality is Lemma 3. For the lower bound, we again follow Kolmogorov & Tikhomirov, (1993). Define

φ(x)={ai=1D(1xi2)β2x10otherwise\varphi(x)=\begin{cases}a\prod_{i=1}^{D}\left(1-x_{i}^{2}\right)^{\frac{\beta}{2}}&\left|\left|x\right|\right|_{\infty}\leq 1\\ 0&\text{otherwise}\end{cases}

with aa a constant to be set. Choose a 2Δ2\Delta-separated set x1,,xssx^{1},\dots,x^{s}s with Δ=(ε2B)1β\Delta=\left(\frac{\varepsilon}{2B}\right)^{\frac{1}{\beta}} and consider the set of functions

gσ=i=1sσiΔβφ(xxiΔ)g_{\mathbf{\sigma}}=\sum_{i=1}^{s}\sigma_{i}\Delta^{\beta}\varphi\left(\frac{x-x^{i}}{\Delta}\right)

where σi{±1}\sigma_{i}\in\{\pm 1\} and σ\mathbf{\sigma} varies over all possible sets of signs. The results of Kolmogorov & Tikhomirov, (1993) guarantee that the gσg_{\mathbf{\sigma}} form a 2ε2\varepsilon-separated set in \mathcal{F} if aa is chosen such that gσg_{\mathbf{\sigma}}\in\mathcal{F} and there are 2s2^{s} such combinations. By definition of packing numbers, we may choose

s=D(,(εB)1β).s=D\left(\mathcal{F},\left(\frac{\varepsilon}{B}\right)^{\frac{1}{\beta}}\right).

This concludes the proof of the lower bound. ∎

Proof of Proposition 9.

We note first that the second statement follows from the first by applying (b) and (c) to Proposition 8 to control the curvature and injectivity radius in terms of the reach. Furthermore, the middle inequality in the last statement follows from Lemma 3. Thus we prove the first two statements.

A volume argument yields the following control:

N(M,||||g,r)volMinfpMvolBε2(p)N\left(M,\left|\left|\cdot\right|\right|_{g},r\right)\leq\frac{\operatorname{vol}M}{\inf_{p\in M}\operatorname{vol}B_{\frac{\varepsilon}{2}}(p)}

where Bε2(p)B_{\frac{\varepsilon}{2}}(p) is the ball around pp of radius ε2\frac{\varepsilon}{2} with respect to the metric gg. Thus it suffices to lower bound the volume of such a ball. Because ε<ι\varepsilon<\iota, we may apply the Bishop-Gromov comparison theorem (Gray,, 2004, Theorem 3.17) to get that

volBε(p)2πd2Γ(d2)0ε(sin(tκ1)κ1)d1𝑑t=ωd0ε(κ112sin(tκ1))d1𝑑t\operatorname{vol}B_{\varepsilon}(p)\geq\frac{2\pi^{\frac{d}{2}}}{\Gamma\left(\frac{d}{2}\right)}\int_{0}^{\varepsilon}\left(\frac{\sin\left(t\sqrt{\kappa_{1}}\right)}{\sqrt{\kappa_{1}}}\right)^{d-1}dt=\omega_{d}\int_{0}^{\varepsilon}\left(\kappa_{1}^{-\frac{1}{2}}\sin\left(t\sqrt{\kappa_{1}}\right)\right)^{d-1}dt

where κ1\kappa_{1} is an upper bound on the sectional curvature. We note that for tπ2κ1t\leq\frac{\pi}{2\sqrt{\kappa_{1}}}, we have sin(tκ1)2πtκ1\sin\left(t\sqrt{\kappa_{1}}\right)\geq\frac{2}{\pi}t\sqrt{\kappa_{1}} and thus

volBε(p)ωd0ε(2πt)d1𝑑t=ωdd(2π)d1εd.\operatorname{vol}B_{\varepsilon}(p)\geq\omega_{d}\int_{0}^{\varepsilon}\left(\frac{2}{\pi}t\right)^{d-1}dt=\frac{\omega_{d}}{d}\left(\frac{2}{\pi}\right)^{d-1}\varepsilon^{d}.

The upper bound follows from control on the sectional curvature by τ\tau, appearing in (Aamari et al.,, 2019, Proposition A.1), which, in turn, is an easy consequence of applying the Gauss formula to (a) of Proposition 8.

We lower bound the packing number through an analogous argument as the upper bound for the covering number, this time with an upper bound on the volume of a ball of radius ε\varepsilon, again from (Gray,, 2004, Theorem 3.17), but this time using a lower bound on the sectional curvature. In particular, we have for ε<ι\varepsilon<\iota,

volBε(p)ωd0ε(sin(tκ2)κ2)d1𝑑t=ωd0ε(sinh(tκ2)κ2)d1𝑑t\operatorname{vol}B_{\varepsilon}(p)\leq\omega_{d}\int_{0}^{\varepsilon}\left(\frac{\sin\left(t\sqrt{\kappa_{2}}\right)}{\sqrt{\kappa_{2}}}\right)^{d-1}dt=\omega_{d}\int_{0}^{\varepsilon}\left(\frac{\sinh\left(t\sqrt{-\kappa_{2}}\right)}{\sqrt{-\kappa_{2}}}\right)^{d-1}dt

where κ2\kappa_{2} is a lower bound on the sectional curvature. Note that for t1κ2t\leq\frac{1}{\sqrt{-\kappa_{2}}}, we have

sinh(tκ2)κ2cosh(2)t4t.\frac{\sinh\left(t\sqrt{-\kappa_{2}}\right)}{\sqrt{-\kappa_{2}}}\leq\cosh(2)t\leq 4t.

Thus,

volBε(p)ωd0ε(4t)d1𝑑t=ωdd4dεd.\operatorname{vol}B_{\varepsilon}(p)\leq\omega_{d}\int_{0}^{\varepsilon}(4t)^{d-1}dt=\frac{\omega_{d}}{d}4^{d}\varepsilon^{d}.

The volume argument tells us that

N(M,||||g,r)volMsuppMvolBr(p)N\left(M,\left|\left|\cdot\right|\right|_{g},r\right)\geq\frac{\operatorname{vol}M}{\sup_{p\in M}\operatorname{vol}B_{r}(p)}

and the result follows.

If we wish to extend the range of ε\varepsilon, we pay with a constant exponential exponential in dd, reflecting the growth in volume of balls in negatively curved spaces. In particular, we can apply the same argument and note that as sinh(x)x\frac{\sinh(x)}{x} is increasing, we have

sinh(tκ2)κ2sinh(ικ2)ικ2teικ2ικ2t\frac{\sinh\left(t\sqrt{-\kappa_{2}}\right)}{\sqrt{-\kappa_{2}}}\leq\frac{\sinh(\iota\sqrt{-\kappa_{2}})}{\iota\sqrt{-\kappa_{2}}}t\leq\frac{e^{\iota\sqrt{-\kappa_{2}}}}{\iota\sqrt{-\kappa_{2}}}t

for all t<ιt<\iota. Thus for all ε<ι\varepsilon<\iota. We have:

N(M,||||g,ε)volMωddιd(κ2)d2edικ2εdN(M,\left|\left|\cdot\right|\right|_{g},\varepsilon)\geq\frac{\operatorname{vol}M}{\omega_{d}}d\iota^{d}(-\kappa_{2})^{\frac{d}{2}}e^{-d\iota\sqrt{-\kappa_{2}}}\varepsilon^{-d}

as desired. ∎

Proof of Corollary 10.

Let BεD(p)B^{\mathbb{R}^{D}}_{\varepsilon}(p) be the set of points in D\mathbb{R}^{D} with Euclidean distance to pp less than ε\varepsilon and let BεM(p)B^{M}_{\varepsilon}(p) be the set of points in MM with intrinsic (geodesic) distance to pp less than ε\varepsilon. Then, if ε2τ\varepsilon\leq 2\tau, combining the fact that straight lines are geodesics in D\mathbb{R}^{D} and (d) from Proposition 8 gives

BεM(p)BεD(p)MB2τarcsin(ε2τ)M(p)B_{\varepsilon}^{M}(p)\subset B_{\varepsilon}^{\mathbb{R}^{D}}(p)\cap M\subset B_{2\tau\arcsin\left(\frac{\varepsilon}{2\tau}\right)}^{M}(p)

In particular, this implies

N(M,dM,2τarcsin(ε2τ))\displaystyle N\left(M,d_{M},2\tau\arcsin\left(\frac{\varepsilon}{2\tau}\right)\right) N(M,||||,ε)N(M,dM,ε)\displaystyle\leq N(M,\left|\left|\cdot\right|\right|,\varepsilon)\leq N(M,d_{M},\varepsilon)
D(M,dM,2τarcsin(ε2τ))\displaystyle D\left(M,d_{M},2\tau\arcsin\left(\frac{\varepsilon}{2\tau}\right)\right) D(M,||||,ε)D(M,dM,ε)\displaystyle\leq D(M,\left|\left|\cdot\right|\right|,\varepsilon)\leq D(M,d_{M},\varepsilon)

whenever ε2τ\varepsilon\leq 2\tau. Thus, applying Proposition 9, we have

N(M,||||,ε)N(M,dM,ε)volMωdd(π2)dεd\displaystyle N(M,\left|\left|\cdot\right|\right|,\varepsilon)\leq N(M,d_{M},\varepsilon)\leq\frac{\operatorname{vol}M}{\omega_{d}}d\left(\frac{\pi}{2}\right)^{d}\varepsilon^{-d}

and similarly,

D(M,||||,2ε)D(M,dM,2τarcsin(ετ))volMωdd16dεd\displaystyle D(M,\left|\left|\cdot\right|\right|,2\varepsilon)\geq D\left(M,d_{M},2\tau\arcsin\left(\frac{\varepsilon}{\tau}\right)\right)\geq\frac{\operatorname{vol}M}{\omega_{d}}d16^{-d}\varepsilon^{-d}

using the fact that arcsin(x)2x\arcsin(x)\leq 2x for x0x\geq 0. The result follows. ∎

Appendix B Proof of Proposition 21

As stated in the body, we bound the T2T_{2} constant c2c_{2} by the log-Sobolev constant of the same measure. We thus first define a log-Sobolev inequality:

Definition 27.

Let μ\mu be a measure on MM. We say that μ\mu satisfies a log-Sobolev inequality with constant cLSc_{LS} if for all real valued, differentiable functions with mean 0 f:Mf:M\to\mathbb{R}, we have:

Mf2log(f2)𝑑μcLSMf2𝑑μ\int_{M}f^{2}\log(f^{2})d\mu\leq c_{LS}\int_{M}\left|\left|\nabla f\right|\right|^{2}d\mu

where \nabla is the Levi-Civita connection and ||||\left|\left|\cdot\right|\right| is the norm with respect to the Riemannian metric.

While in the main body we cited Otto & Villani, (2000) for the Otto-Villani theorem, we actually need a slight strengthening of this result. For technical reasons, Otto & Villani, (2000) required the density of μ\mu to have two derivatives; more recent works have eliminated that assumption. We have:

Theorem 28 (Theorem 5.2 from Gigli & Ledoux, (2013)).

Suppose that μ\mu satisfies the log-Sobolev inequality with constant cLSc_{LS}. Then μ\mu satisfies the T2T_{2} inequality with constant c22cLSc_{2}\leq 2c_{LS}.

We now recall the key estimate from Wang, (1997b) that controls the log-Sobolev constant for the uniform measure on a compact manifold MM222We remark that some works, including Wang, (1997b), define the log-Sobolev constant to be the inverse of our cLSc_{LS}. We translate their theorem into our terms by taking the recipricol.:

Theorem 29 (Theorem 3.3 from Wang, (1997b)).

Let MM be a compact, dd-dimensional manifold with diameter Δ\Delta. Suppose that RicMKRic_{M}\succeq-K for some KK\in\mathbb{R}. Let μ\mu be the uniform measure on MM (i.e., the volume measure normalized so that μ(M)=1\mu(M)=1). Then μ\mu satisfies a log-Sobolev inequality with

cLS(d+2d)de2K(d+1)Δ21Ke1+dΔ2K+.c_{LS}\leq\left(\frac{d+2}{d}\right)^{d}\frac{e^{2K(d+1)\Delta^{2}}-1}{K}e^{1+d\Delta^{2}K_{+}}.

We are now ready to complete the proof.

Proof of Proposition 21.

By the Holly-Stroock perturbation theorem (Holley & Stroock,, 1986), we know that if μ\mu is the uniform measure on MM normalized such that μ(M)=1\mu(M)=1, and μ\mu satisfies a log-Sobolev inequality with constant cLSc_{LS}^{\prime} then \mathbb{P} satisfies a log-Sobolev constant with cLSWwcLSc_{LS}\leq\frac{W}{w}c_{LS}^{\prime}. By (a) from Proposition 8, we have that the sectional curvatures of MM are all bounded below by 2τ2-\frac{2}{\tau^{2}} and thus RicM(d1)2τ2Ric_{M}\succeq-(d-1)\frac{2}{\tau^{2}} (for the relationship between the Ricci tensor and the sectional curvatures, see Lee, (2018)). Noting that d+2d3\frac{d+2}{d}\leq 3 and plugging into the results of Theorem 29, we get that

cLS2τ2d1exp(dlog3+3Δ2d2τ2).c_{LS}^{\prime}\leq\frac{2\tau^{2}}{d-1}\exp\left(d\log 3+\frac{3\Delta^{2}d^{2}}{\tau^{2}}\right).

Combining this with the Holly-Stroock result and Theorem 28 concludes the proof. ∎

Appendix C Proof of Theorem 22

We first prove the following lemma on the concentration of W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}).

Lemma 30.

Suppose that \mathbb{P} is a probability measure on (T,d)(T,d) and that it satisfies a T2(c2)T_{2}(c_{2})-inequality. Let X1,,Xn,X1,,XnX_{1},\dots,X_{n},X_{1}^{\prime},\dots,X_{n}^{\prime} denote independent samples with corresponding empirical distributions Pn,PnP_{n},P_{n}^{\prime}. Then the following inequalities hold:

(|W1(Pn,Pn)𝔼[W1(Pn,Pn)]|t)\displaystyle\mathbb{P}\left(\left|W_{1}(P_{n},P_{n}^{\prime})-\mathbb{E}\left[W_{1}(P_{n},P_{n}^{\prime})\right]\right|\geq t\right) 2ent28c2\displaystyle\leq 2e^{-\frac{nt^{2}}{8c_{2}}}
(|W1(Pn,Pn)𝔼[W1(Pn,Pn)]|t)\displaystyle\mathbb{P}\left(\left|W_{1}(P_{n},P_{n}^{\prime})-\mathbb{E}\left[W_{1}(P_{n},P_{n}^{\prime})\right]\right|\leq t\right) 2ent28c2.\displaystyle\leq 2e^{-\frac{nt^{2}}{8c_{2}}}.
Proof.

We note that by Gozlan et al., (2009), in particular the form of the main theorem stated in (van Handel,, 2014, Theorem 4.31), it suffices to show that, as a function of the data, W1(Pn,Pn)W_{1}(P_{n},P_{n}^{\prime}) is 2n\frac{2}{\sqrt{n}}-Lipschitz. Note that by symmetry, it suffices to show a one-sided inequality. By the triangle inequality,

W1(Pn,Pn)W1(Pn,μ)+W1(Pn,μ)W_{1}(P_{n},P_{n}^{\prime})\leq W_{1}(P_{n},\mu)+W_{1}(P_{n}^{\prime},\mu)

for any measure μ\mu and thus it suffices to show that W1(Pn,μ)W_{1}(P_{n},\mu) is 1n\frac{1}{\sqrt{n}}-Lipschitz in the XiX_{i}. By (van Handel,, 2014, Lemma 4.34), there exists a bijection between the set of couplings between PnP_{n} and μ\mu and the set of ordered nn-tuples of measures μ1,,μn\mu_{1},\dots,\mu_{n} such that μ=1niμi\mu=\frac{1}{n}\sum_{i}\mu_{i}. Thus we see that if X,X~X,\widetilde{X} are two data sets, then

W1(Pn,μ)W1(P~n,μ)\displaystyle W_{1}(P_{n},\mu)-W_{1}(\widetilde{P}_{n},\mu) sup1ni=1nμi=μ[1ni=1n(d(Xi,y)d(X~i,y))𝑑μi(y)]\displaystyle\leq\sup_{\frac{1}{n}\sum_{i=1}^{n}\mu_{i}=\mu}\left[\frac{1}{n}\sum_{i=1}^{n}\int\left(d(X_{i},y)-d(\widetilde{X}_{i},y)\right)d\mu_{i}(y)\right]
sup1ni=1nμi=μ[1ni=1nd(Xi,X~i)𝑑μi(y)]\displaystyle\leq\sup_{\frac{1}{n}\sum_{i=1}^{n}\mu_{i}=\mu}\left[\frac{1}{n}\sum_{i=1}^{n}\int d(X_{i},\widetilde{X}_{i})d\mu_{i}(y)\right]
=1nd(Xi,X~i)\displaystyle=\frac{1}{n}\sum d(X_{i},\widetilde{X}_{i})
1nni=1nd(Xi,X~i)21ndn(X,X~).\displaystyle\leq\frac{1}{n}\sqrt{n\sum_{i=1}^{n}d(X_{i},\widetilde{X}_{i})^{2}}\leq\frac{1}{\sqrt{n}}d^{\otimes n}(X,\widetilde{X}).

The identical argument applies to W1MW_{1}^{M}. ∎

We are now ready to show that d^n\widehat{d}_{n} is a good estimator of dd.

Proposition 31.

Suppose we are in the situation of Theorem 22 and we have

n\displaystyle n max(dvolM4wωd(ι8)d,(8c2Δ2log1ρ)d2d5)\displaystyle\geq\max\left(\frac{d\operatorname{vol}M}{4w\omega_{d}}\left(\frac{\iota}{8}\right)^{-d},\left(\frac{8c_{2}}{\Delta^{2}}\log\frac{1}{\rho}\right)^{\frac{d}{2d-5}}\right)
α\displaystyle\alpha max(logd2γ(nωdΔddvolM),(Cw)1γ)\displaystyle\geq\max\left(\log^{\frac{d}{2\gamma}}\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right),(Cw)^{\frac{1}{\gamma}}\right)

Then with probability at least 14ρ1-4\rho, we have

d1+3γd^n(1+3γ)d.\frac{d}{1+3\gamma}\leq\widehat{d}_{n}\leq(1+3\gamma)d.
Proof.

By Proposition 19 and Lemma 30, we have that with probability at least 1ent28c21-e^{-\frac{nt^{2}}{8c_{2}}}, we have

W1M(Pn,Pn)C(volMnωd)1dlog(nωddvolM)+t.\displaystyle W_{1}^{M}(P_{n},P_{n}^{\prime})\leq C\left(\frac{\operatorname{vol}M}{n\omega_{d}}\right)^{\frac{1}{d}}\sqrt[]{\log\left(\frac{n\omega_{d}}{d\operatorname{vol}_{M}}\right)}+t.

By Proposition 18 and Lemma 30 and the left hand side of Proposition 19, we have that with probability at least 1eαnt28c21-e^{-\frac{\alpha nt^{2}}{8c_{2}}},

W1M(Pαn,Pαn)132(dvolM4wωd)1d(αn)1dtW_{1}^{M}(P_{\alpha n},P_{\alpha n}^{\prime})\geq\frac{1}{32}\left(\frac{d\operatorname{vol}M}{4w\omega_{d}}\right)^{\frac{1}{d}}(\alpha n)^{-\frac{1}{d}}-t

all under the assumption that

n>dvolM4wωd(ι8)d.n>\frac{d\operatorname{vol}M}{4w\omega_{d}}\left(\frac{\iota}{8}\right)^{-d}.

Setting t=Δ(αn)54dt=\Delta(\alpha n)^{-\frac{5}{4d}}, we see that, as α>1\alpha>1, with probability at least 12ent28c21-2e^{-\frac{nt^{2}}{8c_{2}}}, we simultaneously have

W1M(Pn,Pn)\displaystyle W_{1}^{M}(P_{n},P_{n}^{\prime}) C(volMnωd)1dlog(nωdΔddvolM)\displaystyle\leq C\left(\frac{\operatorname{vol}M}{n\omega_{d}}\right)^{\frac{1}{d}}\sqrt[]{\log\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}_{M}}\right)}
W1M(Pαn,Pαn)\displaystyle W_{1}^{M}(P_{\alpha n},P_{\alpha n}^{\prime}) 164(dvolM4wωd)1d(αn)1d.\displaystyle\geq\frac{1}{64}\left(\frac{d\operatorname{vol}M}{4w\omega_{d}}\right)^{\frac{1}{d}}(\alpha n)^{-\frac{1}{d}}.

Thus, in particular,

W1M(Pn,Pn)W1M(Pαn,Pαn)C(volMnωd)1dlog(nωddvolM)164(dvolM4wωd)1d(αn)1dCw1dα1dlog(nωdΔddvolM)\displaystyle\frac{W_{1}^{M}(P_{n},P_{n}^{\prime})}{W_{1}^{M}(P_{\alpha n},P_{\alpha n}^{\prime})}\leq\frac{C\ \left(\frac{\operatorname{vol}M}{n\omega_{d}}\right)^{\frac{1}{d}}\sqrt[]{\log\left(\frac{n\omega_{d}}{d\operatorname{vol}M}\right)}}{\frac{1}{64}\left(\frac{d\operatorname{vol}M}{4w\omega_{d}}\right)^{\frac{1}{d}}(\alpha n)^{-\frac{1}{d}}}\leq Cw^{\frac{1}{d}}\alpha^{\frac{1}{d}}\sqrt[]{\log\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right)}

Thus we see that

d^n\displaystyle\widehat{d}_{n} =logαlogW1(Pn,Pn)W1(Pαn,Pαn)\displaystyle=\frac{\log\alpha}{\log\frac{W_{1}(P_{n},P_{n}^{\prime})}{W_{1}(P_{\alpha n},P_{\alpha n}^{\prime})}}
logα1dlogα++1dlogw+12loglog(nωdΔddvolM)\displaystyle\geq\frac{\log\alpha}{\frac{1}{d}\log\alpha++\frac{1}{d}\log w+\frac{1}{2}\log\log\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right)}
=d1+log(Cw)+d2loglog(nωdΔddvolM)logα\displaystyle=\frac{d}{1+\frac{\log(Cw)+\frac{d}{2}\log\log\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right)}{\log\alpha}}

Now, if

n\displaystyle n max(dvolM4wωd(τ8)d,(8c22Δ2log1ρ)d2d5)\displaystyle\geq\max\left(\frac{d\operatorname{vol}M}{4w\omega_{d}}\left(\frac{\tau}{8}\right)^{-d},\left(\frac{8c_{2}^{2}}{\Delta^{2}}\log\frac{1}{\rho}\right)^{\frac{d}{2d-5}}\right)
α\displaystyle\alpha max(logd2γ(nωdΔddvolM),(Cw)1γ)\displaystyle\geq\max\left(\log^{\frac{d}{2\gamma}}\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right),(Cw)^{\frac{1}{\gamma}}\right)

Then with probability at least 12ρ1-2\rho,

d^nd1+2γ.\widehat{d}_{n}\geq\frac{d}{1+2\gamma}.

An identical proof holds for the other side of the bound and thus the result holds. ∎

We are now ready to prove the main theorem using Proposition 31 and Proposition 24.

Proof of Theorem 22.

Note first that

w(ιλ28)\displaystyle w\left(\frac{\iota\lambda^{2}}{8}\right) wωdd(π2)d(ιλ28)d\displaystyle\geq\frac{w\omega_{d}}{d}\left(\frac{\pi}{2}\right)^{-d}\left(\frac{\iota\lambda^{2}}{8}\right)^{d} (5)
N(M,dM,ιλ28)\displaystyle N\left(M,d_{M},\frac{\iota\lambda^{2}}{8}\right) volMωdd(π2)d(ιλ28)d\displaystyle\leq\frac{\operatorname{vol}M}{\omega_{d}}d\left(\frac{\pi}{2}\right)^{d}\left(\frac{\iota\lambda^{2}}{8}\right)^{-d} (6)

by Proposition 9. Setting λ=12\lambda=\frac{1}{2}, we note that by Proposition 24, if the total number of samples

2(α+1)n(wωdd(π2)d(ιλ28)d)1log(volMρωdd(τ16π)d)\displaystyle 2(\alpha+1)n\geq\left(\frac{w\omega_{d}}{d}\left(\frac{\pi}{2}\right)^{-d}\left(\frac{\iota\lambda^{2}}{8}\right)^{d}\right)^{-1}\log\left(\frac{\operatorname{vol}M}{\rho\omega_{d}}d\left(\frac{\tau}{16\pi}\right)^{-d}\right)

then with probability at least 1ρ1-\rho, we have

12dM(p,q)dG(p,q)32dM(p,q)\displaystyle\frac{1}{2}d_{M}(p,q)\leq d_{G}(p,q)\leq\frac{3}{2}d_{M}(p,q)

for all p,qMp,q\in M. Thus by the proof of Proposition 31 above,

W1M(Pn,Pn)W1M(Pαn,Pαn)1+λ1λCw1dα1dlog(nωdΔddvolM).\frac{W_{1}^{M}(P_{n},P_{n}^{\prime})}{W_{1}^{M}(P_{\alpha n},P_{\alpha n}^{\prime})}\leq\frac{1+\lambda}{1-\lambda}Cw^{\frac{1}{d}}\alpha^{\frac{1}{d}}\sqrt[]{\log\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}M}\right)}.

Thus as long as α(1+λ1λ)dγ=3dγ\alpha\geq\left(\frac{1+\lambda}{1-\lambda}\right)^{\frac{d}{\gamma}}=3^{\frac{d}{\gamma}}, then we have with probability at least 13ρ1-3\rho,

d~nd1+3γ.\widetilde{d}_{n}\geq\frac{d}{1+3\gamma}.

A similar computation holds for the other bound.

To prove the result for dnd_{n}, note that if we replace the ι\iotas by τ\tau in (5) and (6), then the result still holds by the second part of Proposition 9. Then the identical arguments apply, mutatis mutandis, after skipping the step of approximating dMd_{M} by dGd_{G}. ∎

Appendix D Metric Estimation Proofs

In order to state our result, we need to consider the minimal amount of probability mass that \mathbb{P} puts on any intrinsic ball of a certain radius in MM. To formalize this notion, we define, for δ>0\delta>0,

wB(δ)=infpM(BδM(p)).w_{B}(\delta)=\inf_{p\in M}\mathbb{P}\left(B_{\delta}^{M}(p)\right).

We need a few lemmata:

Lemma 32.

Fix ε>0\varepsilon>0 and a set of xiMx_{i}\in M and form G(x,ε)G(x,\varepsilon). If the set of xix_{i} form a δ\delta-net for MM such that δε4\delta\leq\frac{\varepsilon}{4}, then for all x,yMx,y\in M,

dG(x,y)(1+4δε)dM(x,y).d_{G}(x,y)\leq\left(1+\frac{4\delta}{\varepsilon}\right)d_{M}(x,y).
Proof.

This is a combination of (Bernstein et al.,, 2000, Proposition 1) and (Bernstein et al.,, 2000, Theorem 2). ∎

Lemma 33.

Let 0<λ<10<\lambda<1 and let x,yMx,y\in M such that xy2τλ(1λ)\left|\left|x-y\right|\right|\leq 2\tau\lambda(1-\lambda). Then

(1λ)dM(x,y)xydM(x,y).(1-\lambda)d_{M}(x,y)\leq\left|\left|x-y\right|\right|\leq d_{M}(x,y).
Proof.

Note that 2τλ(1λ)τ22\tau\lambda(1-\lambda)\leq\frac{\tau}{2} so we are in the situation of Proposition 8 (e). Let =dM(x,y)\ell=d_{M}(x,y). Rearranging the bound in Proposition 8 (e) yields

(12τ)xy.\ell\left(1-\frac{\ell}{2\tau}\right)\leq\left|\left|x-y\right|\right|\leq\ell.

Thus it suffices to show that

2τλ.\frac{\ell}{2\tau}\leq\lambda.

Again applying Proposition 8, we see that

τ(112xyτ).\ell\leq\tau\left(1-\sqrt{1-\frac{2\left|\left|x-y\right|\right|}{\tau}}\right).

Rearranging and plugging in xy2τλ(1λ)\left|\left|x-y\right|\right|\leq 2\tau\lambda(1-\lambda) concludes the proof. ∎

The next lemma is a variant of (Niyogi et al.,, 2008, Lemma 5.1).

Lemma 34.

Let wB(δ)w_{B}(\delta) be as in Proposition 24 and let N(M,δ)N(M,\delta) be the covering number of MM at scale δ\delta. If we sample nw(δ2)1logN(M,δ2)ρn\geq w\left(\frac{\delta}{2}\right)^{-1}\log\frac{N\left(M,\frac{\delta}{2}\right)}{\rho} points independently from \mathbb{P}, then with probability at least 1ρ1-\rho, the points form a δ\delta-net of MM.

Proof.

Let y1,,yNy_{1},\dots,y_{N} be a minimal δ2\frac{\delta}{2}-net of MM. For each yiy_{i} the probability that xix_{i} is not in Bδ2(yi)B_{\frac{\delta}{2}}(y_{i}) is bounded by 1wB(δ2)1-w_{B}\left(\frac{\delta}{2}\right) by definition. By independence, we have

(ixjBδ2(yi))(1)Bw(δ2))nenwB(δ2).\mathbb{P}\left(\forall i\,\,x_{j}\not\in B_{\frac{\delta}{2}}(y_{i})\right)\leq\left(1-)Bw\left(\frac{\delta}{2}\right)\right)^{n}\leq e^{-nw_{B}\left(\frac{\delta}{2}\right)}.

By a union bound, we have

(i such that jxjBδ2(yi))N(M,δ2)enwB(δ2).\mathbb{P}\left(\exists i\text{ such that }\forall j\,\,x_{j}\not\in B_{\frac{\delta}{2}}(y_{i})\right)\leq N\left(M,\frac{\delta}{2}\right)e^{-nw_{B}\left(\frac{\delta}{2}\right)}. (7)

If nn satisfies the bound in the statement then the right hand side (7) is controlled by ρ\rho. ∎

Note that for any measure \mathbb{P}, a simple union bound tells us that wB(δ)N(M,δ)1w_{B}(\delta)\leq N\left(M,\delta\right)^{-1} and that equality, up to a constant, is achieved for the uniform measure. This is within a log factor of the obvious lower bound given by the covering number on the number of points required to have a δ\delta-net on MM.

With these lemmata, we are ready to conclude the proof:

Proof of Proposition 24.

Let ε=τλ2τλ(1λ)\varepsilon=\tau\lambda\leq 2\tau\lambda(1-\lambda) by λ12\lambda\leq\frac{1}{2}. Let δ=λε4=τλ24\delta=\frac{\lambda\varepsilon}{4}=\frac{\tau\lambda^{2}}{4}. By Lemma 34, with high probability, the xix_{i} form a δ\delta-net on MM; thus for the rest of the proof, we fix a set of xix_{i} such that this condition holds. Now we may apply Lemma 32 to yield the upper bound dG(x,y)(1+λ)dM(x,y)d_{G}(x,y)\leq(1+\lambda)d_{M}(x,y).

For the lower bound, for any points p,qMp,q\in M there are points xj0,xjmx_{j_{0}},x_{j_{m}} such that dM(p,xj0)δd_{M}(p,x_{j_{0}})\leq\delta and dM(q,xjm)δd_{M}(q,x_{j_{m}})\leq\delta by the fact that the xix_{i} form a δ\delta-net. Let xj1,,xjm1x_{j_{1}},\dots,x_{j_{m-1}} be a geodesic in GG between xj0x_{j_{0}} and xjmx_{j_{m}}. By Lemma 33 and the fact that edges only exist for small weights, we have

dM(p,q)\displaystyle d_{M}(p,q) dM(p,xj0)+dM(xjm,q)+i=1mdM(xji1,xji)\displaystyle\leq d_{M}\left(p,x_{j_{0}}\right)+d_{M}\left(x_{j_{m}},q\right)+\sum_{i=1}^{m}d_{M}\left(x_{j_{i-1}},x_{j_{i}}\right)
(1λ)1(pxj0+xjmq+i=1mxji1xji)\displaystyle\leq(1-\lambda)^{-1}\left(\left|\left|p-x_{j_{0}}\right|\right|+\left|\left|x_{j_{m}}-q\right|\right|+\sum_{i=1}^{m}\left|\left|x_{j_{i-1}}-x_{j_{i}}\right|\right|\right)
=(1λ)1dG(p,q).\displaystyle=(1-\lambda)^{-1}d_{G}(p,q).

Rearranging concludes the proof. ∎

Appendix E Miscellany

Proof of Lemma 15.

By symmetrization and chaining, we have

𝔼[supf1ni=1nf(Xi)f(Xi)]\displaystyle\mathbb{E}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-f(X_{i}^{\prime})\right] 2𝔼[supf1ni=1nεif(Xi)]2infδ>0[8δ+82nδBlogN(,||||,ε)𝑑ε]\displaystyle\leq 2\mathbb{E}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}f(X_{i})\right]\leq 2\inf_{\delta>0}\left[8\delta+\frac{8\sqrt{2}}{\sqrt{n}}\int_{\delta}^{B}\sqrt{\log N(\mathcal{F},\left|\left|\cdot\right|\right|_{\infty},\varepsilon)}d\varepsilon\right]
2Binfδ>0[8δ+82nδ1logN(,||||,ε2R)𝑑ε]\displaystyle\leq 2B\inf_{\delta>0}\left[8\delta+\frac{8\sqrt{2}}{\sqrt{n}}\int_{\delta}^{1}\sqrt{\log N\left(\mathcal{F},\left|\left|\cdot\right|\right|_{\infty},\frac{\varepsilon}{2R}\right)}d\varepsilon\right]
2Binfδ>0[8δ+82nδ13β2log1εN(𝒮,||||,ε)𝑑ε]\displaystyle\leq 2B\inf_{\delta>0}\left[8\delta+\frac{8\sqrt{2}}{\sqrt{n}}\int_{\delta}^{1}\sqrt{3\beta^{2}\log\frac{1}{\varepsilon}N(\mathcal{S},\left|\left|\cdot\right|\right|,\varepsilon)}d\varepsilon\right]

where the last step follows from Proposition 12. The first statement follows from noting that log1ε\sqrt{\log\frac{1}{\varepsilon}} is decreasing in ε\varepsilon, and thus allowing it to be pulled from the integral. If β>d2\beta>\frac{d}{2}, the second statement follows from plugging in δ=0\delta=0 and recovering a rate of n12n^{-\frac{1}{2}}. If β<d2\beta<\frac{d}{2}, then the second statement follows from plugging in δ=nβd\delta=n^{-\frac{\beta}{d}}. ∎

Proof of Proposition 18.

We follow the proof of (Weed et al.,, 2019, Proposition 6) and use their notation. In particular, let

Nε(,12)=inf{N(S,dM,ε)|SM and (S)12}.N_{\varepsilon}\left(\mathbb{P},\frac{1}{2}\right)=\inf\left\{N(S,d_{M},\varepsilon)|S\subset M\text{ and }\mathbb{P}(S)\geq\frac{1}{2}\right\}.

Applying a volume argument in the identical fashion to Proposition 9, but lower bounding the probability of a ball of radius ε\varepsilon by ww multiplied by the volume of said small ball, we get that

Nε(,12)volM2wωdd8dεdN_{\varepsilon}\left(\mathbb{P},\frac{1}{2}\right)\geq\frac{\operatorname{vol}M}{2w\omega_{d}}d8^{-d}\varepsilon^{-d}

if ετ\varepsilon\leq\tau. Let

ε=(volM4wωdd8d)1dn1d\varepsilon=\left(\frac{\operatorname{vol}M}{4w\omega_{d}}d8^{-d}\right)^{\frac{1}{d}}n^{-\frac{1}{d}}

and assume that

n>volM4wωdd8d(τ)dn>\frac{\operatorname{vol}M}{4w\omega_{d}}d8^{-d}\left(\tau\right)^{-d}

Let

S=1inBε2M(Xi).S=\bigcup_{1\leq i\leq n}B_{\frac{\varepsilon}{2}}^{M}(X_{i}).

Then because

Nε(,12)>nN_{\varepsilon}\left(\mathbb{P},\frac{1}{2}\right)>n

by our choice of ε\varepsilon, we have that (S)<12\mathbb{P}(S)<\frac{1}{2}. Thus if XX\sim\mathbb{P} then we have with probability at least 12\frac{1}{2}, dM(X,{X1,,Xn})ε2d_{M}(X,\{X_{1},\dots,X_{n}\})\geq\frac{\varepsilon}{2}. Thus the Wasserstein distance between \mathbb{P} and PnP_{n} is at least ε4\frac{\varepsilon}{4}. The first result follows. We may apply the identical argument, instead using intrinsic covering numbers and the bound in Proposition 9 to recover the second statement. ∎

Proof of Proposition 19.

By Kantorovich-Rubenstein duality and Jensen’s inequality, we have

𝔼[W1M(Pn,)]\displaystyle\mathbb{E}\left[W_{1}^{M}(P_{n},\mathbb{P})\right] 𝔼[supf1ni=1nf(Xi)𝔼[f(Xi)]]𝔼[supf1ni=1nf(Xi)f(Xi)]=𝔼[W1M(Pn,Pn)]\displaystyle\leq\mathbb{E}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathbb{E}\left[f(X_{i})\right]\right]\leq\mathbb{E}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-f(X_{i}^{\prime})\right]=\mathbb{E}\left[W_{1}^{M}(P_{n},P_{n}^{\prime})\right]

where \mathcal{F} is the class of functions on MM that are 11-Lipschitz with respect to dMd_{M}. Note that, by translation invariance, we may take the radius of the Hölder ball \mathcal{F} to be Δ\Delta. By symmetrization and chaining,

𝔼[supf1ni=1nf(Xi)f(Xi)]\displaystyle\mathbb{E}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-f(X_{i}^{\prime})\right] 2𝔼[supf1ni=1nεif(Xi)]2infδ>0[8δ+82nδΔlogN(,||||,ε)𝑑ε]\displaystyle\leq 2\mathbb{E}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}f(X_{i})\right]\leq 2\inf_{\delta>0}\left[8\delta+\frac{8\sqrt{2}}{\sqrt{n}}\int_{\delta}^{\Delta}\sqrt{\log N(\mathcal{F},\left|\left|\cdot\right|\right|_{\infty},\varepsilon)}d\varepsilon\right]
infδ>0[8δ+82nδΔ3log(2Δε)dvolMωd(π2)d(2ε)d2𝑑ε]\displaystyle\leq\inf_{\delta>0}\left[8\delta+\frac{8\sqrt{2}}{\sqrt{n}}\int_{\delta}^{\Delta}\sqrt{3\log\left(\frac{2\Delta}{\varepsilon}\right)\frac{d\operatorname{vol}M}{\omega_{d}}\left(\frac{\pi}{2}\right)^{d}}\left(\frac{2}{\varepsilon}\right)^{\frac{d}{2}}d\varepsilon\right]
2Δinfδ>0[8δ+86ndvolMωd(π2)d2log1δδ1(Δε)d2𝑑ε]\displaystyle\leq 2\Delta\inf_{\delta>0}\left[8\delta+\frac{8\sqrt{6}}{\sqrt{n}}\sqrt{\frac{d\operatorname{vol}M}{\omega_{d}}}\left(\frac{\pi}{2}\right)^{\frac{d}{2}}\sqrt{\log\frac{1}{\delta}}\int_{\delta}^{1}\left(\frac{\Delta}{\varepsilon}\right)^{-\frac{d}{2}}d\varepsilon\right]

where the last step comes from Corollary 14 and noting that after recentering, \mathcal{F} contains functions ff such that fL(M)Δ\left|\left|f\right|\right|_{L^{\infty}(M)}\leq\Delta and fL(M)1\left|\left|\nabla f\right|\right|_{L^{\infty}(M)}\leq 1. Setting

δ=π2(dvolMnωdΔd)1d\delta=\frac{\pi}{2}\left(\frac{d\operatorname{vol}M}{n\omega_{d}\Delta^{d}}\right)^{\frac{1}{d}}

gives

𝔼[W1M(Pn,Pn)]C(volMnωd)1dlog(nωdΔddvolM)\mathbb{E}\left[W_{1}^{M}(P_{n},P_{n}^{\prime})\right]\leq C\left(\frac{\operatorname{vol}M}{n\omega_{d}}\right)^{\frac{1}{d}}\sqrt[]{\log\left(\frac{n\omega_{d}\Delta^{d}}{d\operatorname{vol}_{M}}\right)}

for some C48C\leq 48, which concludes the proof. ∎

Proof of Theorem 25.

By bounding the supremum of sums by the sum of suprema and the construction of μ^\widehat{\mu},

dβ,B(μ^,)\displaystyle d_{\beta,B}(\widehat{\mu},\mathbb{P}) dβ,B(μ^,P~n)+dβ,B(P~n,)infμ𝒫dβ,B(μ,P~n)+dβ,B(P~n,)\displaystyle\leq d_{\beta,B}(\widehat{\mu},\widetilde{P}_{n})+d_{\beta,B}(\widetilde{P}_{n},\mathbb{P})\leq\inf_{\mu\in\mathcal{P}}d_{\beta,B}(\mu,\widetilde{P}_{n})+d_{\beta,B}(\widetilde{P}_{n},\mathbb{P})
infμ𝒫dβ,B(μ,)+2dβ,B(P~n,)\displaystyle\leq\inf_{\mu\in\mathcal{P}}d_{\beta,B}(\mu,\mathbb{P})+2d_{\beta,B}(\widetilde{P}_{n},\mathbb{P})
infμ𝒫dβ,B(μ,)+2dβ,B(P~n,Pn)+2dβ,B(Pn,).\displaystyle\leq\inf_{\mu\in\mathcal{P}}d_{\beta,B}(\mu,\mathbb{P})+2d_{\beta,B}(\widetilde{P}_{n},P_{n})+2d_{\beta,B}(P_{n},\mathbb{P}).

Taking expectations and applying Lemma 15 bounds the last term. The middle term can be bounded as follows:

dβ,B(P~n,Pn)\displaystyle d_{\beta,B}(\widetilde{P}_{n},P_{n}) =supfCBβ(Ω)1ni=1nf(Xi)f(X~i)supfCBβ(Ω)1ni=1nf(Xi)f(Xi+ηi)+2Bε\displaystyle=\sup_{f\in C_{B}^{\beta}(\Omega)}\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-f(\widetilde{X}_{i})\leq\sup_{f\in C_{B}^{\beta}(\Omega)}\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-f(X_{i}+\eta_{i})+2B\varepsilon
supfCBβ(Ω)1ni=1nBηi+2Bε\displaystyle\leq\sup_{f\in C_{B}^{\beta}(\Omega)}\frac{1}{n}\sum_{i=1}^{n}B\left|\left|\eta_{i}\right|\right|+2B\varepsilon

where the first inequality follows from the fact that if fCBβ(Ω)f\in C_{B}^{\beta}(\Omega) then fB\left|\left|f\right|\right|_{\infty}\leq B and the contamination is at most ε\varepsilon. The second inequality follows from the fact that ff is BB-Lipschitz. Taking expectations and applying Jensen’s inequality concludes the proof. ∎

Proof of Corollary 26.

Applying Kantorovich-Rubenstein duality, the proof follows immediately from that of Theorem 25 by setting β=1\beta=1, with the caveat that we need to bound BB and the Lipschitz constant separately. The Lipschitz constant is bounded by 11 by Kantorovich duality. The class is translation invariant, and so |f𝔼[f]|2R\left|\left|\left|f\right|\right|_{\infty}-\mathbb{E}[f]\right|\leq 2R by the fact that the Euclidean diameter of 𝒮\mathcal{S} is bounded by 2R2R. The result follows. ∎

Lemma 35.

Let XX be distributed uniformly on a centred (2\ell^{2}) ball in d\mathbb{R}^{d} of radius RR. Then,

𝔼[logRX]=1d.\mathbb{E}\left[\log\frac{R}{\left|\left|X\right|\right|}\right]=\frac{1}{d}.
Proof.

Note that by scaling it suffices to prove the case R=1R=1. By changing to polar coordinates,

𝔼[log1X]\displaystyle\mathbb{E}\left[\log\frac{1}{\left|\left|X\right|\right|}\right] =S101(log1r)rd1𝑑r𝑑θS101rd1𝑑r𝑑θ\displaystyle=\frac{\int_{S^{1}}\int_{0}^{1}\left(\log\frac{1}{r}\right)r^{d-1}drd\theta}{\int_{S^{1}}\int_{0}^{1}r^{d-1}drd\theta}
=d01(logr)rd1𝑑r.\displaystyle=-d\int_{0}^{1}\left(\log r\right)r^{d-1}dr.

Substituting u=logru=\log r and applying integration by parts then gives

d01(logr)rd1𝑑r\displaystyle-d\int_{0}^{1}\left(\log r\right)r^{d-1}dr =[rddrdlogr]|r=0r=1=1d\displaystyle=\left[\frac{r^{d}}{d}-r^{d}\log r\right]\bigg{|}_{r=0}^{r=1}=\frac{1}{d}

as desired. ∎