This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Geometric Inference on Kernel Density Estimates

Jeff M. Phillips
jeffp@cs.utah.edu
University of Utah
Thanks to supported by NSF CCF-1350888, IIS-1251019, and ACI-1443046.
   Bei Wang
beiwang@sci.utah.edu
University of Utah
Thanks to supported by INL 00115847 via DOE DE-AC07ID14517, DOE NETL DEEE0004449, DOE DEFC0206ER25781, DOE DE-SC0007446, and NSF 0904631.
   Yan Zheng
yanzheng@cs.utah.edu
University of Utah

We show that geometric inference of a point cloud can be calculated by examining its kernel density estimate with a Gaussian kernel. This allows one to consider kernel density estimates, which are robust to spatial noise, subsampling, and approximate computation in comparison to raw point sets. This is achieved by examining the sublevel sets of the kernel distance, which isomorphically map to superlevel sets of the kernel density estimate. We prove new properties about the kernel distance, demonstrating stability results and allowing it to inherit reconstruction results from recent advances in distance-based topological reconstruction. Moreover, we provide an algorithm to estimate its topology using weighted Vietoris-Rips complexes.

1 Introduction

Geometry and topology have become essential tools in modern data analysis: geometry to handle spatial noise and topology to identify the core structure. Topological data analysis (TDA) has found applications spanning protein structure analysis [32, 52] to heart modeling [41] to leaf science [60], and is the central tool of identifying quantities like connectedness, cyclic structure, and intersections at various scales. Yet it can suffer from spatial noise in data, particularly outliers.

When analyzing point cloud data, classically these approaches consider α\alpha-shapes [31], where each point is replaced with a ball of radius α\alpha, and the union of these balls is analyzed. More recently a distance function interpretation [11] has become more prevalent where the union of α\alpha-radius balls can be replaced by the sublevel set (at value α\alpha) of the Hausdorff distance to the point set. Moreover, the theory can be extended to other distance functions to the point sets, including the distance-to-a-measure [15] which is more robust to noise.

This has more recently led to statistical analysis of TDA. These results show not only robustness in the function reconstruction, but also in the topology it implies about the underlying dataset. This work often operates on persistence diagrams which summarize the persistence (difference in function values between appearance and disappearance) of all homological features in single diagram. A variety of work has developed metrics on these diagrams and probability distributions over them [55, 67], and robustness and confidence intervals on their landscapes [7, 39, 18] (summarizing again the most dominant persistent features [19]). Much of this work is independent of the function and data from which the diagram is generated, but it is now more clear than ever, it is most appropriate when the underlying function is robust to noise, e.g., the distance-to-a-measure [15].

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Example with 10,00010{,}000 points in [0,1]2[0,1]^{2} generated on a circle or line with N(0,0.005)N(0,0.005) noise; 25%25\% of points are uniform background noise. The generating function is reconstructed with kde with σ=0.05\sigma=0.05 (upper left), and its persistence diagram based on the superlevel set filtration is shown (upper middle). A coreset [71] of the same dataset with only 1,3841{,}384 points (lower left) and persistence diagram (lower middle) are shown, again using kde. This associated confidence interval contains the dimension 11 homology features (red triangles) suggesting they are noise; this is because it models data as iid – but the coreset data is not iid, it subsamples more intelligently. We also show persistence diagrams of the original data based on the sublevel set filtration of the standard distance function (upper right, with no useful features due to noise) and the kernel distance (lower right).

A very recent addition to this progression is the new TDA package for R [38]; it includes built in functions to analyze point sets using Hausdorff distance, distance-to-a-measure, kk-nearest neighbor density estimators, kernel density estimates, and kernel distance. The example in Figure 1 used this package to generate persistence diagrams. While, the stability of the Hausdorff distance is classic [11, 31], and the distance-to-a-measure [15] and kk-nearest neighbor distances have been shown robust to various degrees [5], this paper is the first to analyze the stability of kernel density estimates and the kernel distance in the context of geometric inference. Some recent manuscripts show related results. Bobrowski et al. [6] consider kernels with finite support, and describe approximate confidence intervals on the superlevel sets, which recover approximate persistence diagrams. Chazal et al. [17] explore the robustness of the kernel distance in bootstrapping-based analysis.

In particular, we show that the kernel distance and kernel density estimates, using the Gaussian kernel, inherit some reconstruction properties of distance-to-a-measure, that these functions can also be approximately reconstructed using weighted (Vietoris-)Rips complexes [8], and that under certain regimes can infer homotopy of compact sets. Moreover, we show further robustness advantages of the kernel distance and kernel density estimates, including that they possess small coresets [57, 71] for persistence diagrams and inference.

1.1 Kernels, Kernel Density Estimates, and Kernel Distance

A kernel is a non-negative similarity measure K:d×d+K:{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{+}; more similar points have higher value. For any fixed pdp\in{{\mathbb{R}}}^{d}, a kernel K(p,)K(p,\cdot) can be normalized to be a probability distribution; that is xdK(p,x)dx=1\int_{x\in{{\mathbb{R}}}^{d}}K(p,x){\rm d}{x}=1. For the purposes of this article we focus on the Gaussian kernel defined as K(p,x)=σ2exp(px2/2σ2)K(p,x)=\sigma^{2}\exp(-\|p-x\|^{2}/2\sigma^{2}). 111 K(p,x)K(p,x) is normalized so that K(x,x)=1K(x,x)=1 for σ=1\sigma=1. The choice of coefficient σ2\sigma^{2} is not the standard normalization, but it is perfectly valid as it scales everything by a constant. It has the property that σ2K(p,x)px2/2\sigma^{2}-K(p,x)\approx\|p-x\|^{2}/2 for px\|p-x\| small.

A kernel density estimate [65, 61, 26, 27] is a way to estimate a continuous distribution function over d{{\mathbb{R}}}^{d} for a finite point set PdP\subset{{\mathbb{R}}}^{d}; they have been studied and applied in a variety of contexts, for instance, under subsampling [57, 71, 3], motion planning [59], multimodality [64, 33], and surveillance [37], road reconstruction [4]. Specifically,

kdeP(x)=1|P|pPK(p,x).\textsc{kde}_{P}(x)=\frac{1}{|P|}\sum_{p\in P}K(p,x).

The kernel distance [47, 42, 48, 58] (also called current distance or maximum mean discrepancy) is a metric [56, 66] between two point sets PP, QQ (as long as the kernel used is characteristic [66], a slight restriction of being positive definite [2, 70], this includes the Gaussian and Laplace kernels). Define a similarity between the two point sets as

κ(P,Q)=1|P|1|Q|pPqQK(p,q).\kappa(P,Q)=\frac{1}{|P|}\frac{1}{|Q|}\sum_{p\in P}\sum_{q\in Q}K(p,q).

Then the kernel distance between two point sets is defined as

DK(P,Q)=κ(P,P)+κ(Q,Q)2κ(P,Q).D_{K}(P,Q)=\sqrt{\kappa(P,P)+\kappa(Q,Q)-2\kappa(P,Q)}.

When we let point set QQ be a single point xx, then κ(P,x)=kdeP(x)\kappa(P,x)=\textsc{kde}_{P}(x).

Kernel density estimates can apply to any measure μ\mu (on d{{\mathbb{R}}}^{d}) as kdeμ(x)=pdK(p,x)dμ(p).\textsc{kde}_{\mu}(x)=\int_{p\in{{\mathbb{R}}}^{d}}K(p,x){\rm d}{\mu(p)}. The similarity between two measures is κ(μ,ν)=(p,q)d×dK(p,q)dmμ,ν(p,q),\kappa(\mu,\nu)=\int_{(p,q)\in{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}}K(p,q){\rm d}{\textsf{m}_{\mu,\nu}(p,q)}, where mμ,ν\textsf{m}_{\mu,\nu} is the product measure of μ\mu and ν\nu (mμ,ν:=μν\textsf{m}_{\mu,\nu}:=\mu\otimes\nu), and then the kernel distance between two measures μ\mu and ν\nu is still a metric, defined as DK(μ,ν)=κ(μ,μ)+κ(ν,ν)2κ(μ,ν).D_{K}(\mu,\nu)=\sqrt{\kappa(\mu,\mu)+\kappa(\nu,\nu)-2\kappa(\mu,\nu)}. When the measure ν\nu is a Dirac measure at xx (ν(q)=0\nu(q)=0 for xqx\neq q, but integrates to 11), then κ(μ,x)=kdeμ(x)\kappa(\mu,x)=\textsc{kde}_{\mu}(x). Given a finite point set PdP\subset{{\mathbb{R}}}^{d}, we can work with the empirical measure μP\mu_{P} defined as μP=1|P|pPδp\mu_{P}=\frac{1}{|P|}\sum_{p\in P}\delta_{p}, where δp\delta_{p} is the Dirac measure on pp, and DK(μP,μQ)=DK(P,Q)D_{K}(\mu_{P},\mu_{Q})=D_{K}(P,Q).

If KK is positive definite, it is said to have the reproducing property [2, 70]. This implies that K(p,x)K(p,x) is an inner product in some reproducing kernel Hilbert space (RKHS) K\mathcal{H}_{K}. Specifically, there is a lifting map ϕ:dK\phi:{{\mathbb{R}}}^{d}\to\mathcal{H}_{K} so that K(p,x)=ϕ(p),ϕ(x)KK(p,x)=\langle\phi(p),\phi(x)\rangle_{\mathcal{H}_{K}}, and moreover the entire set PP can be represented as Φ(P)=pPϕ(p)\Phi(P)=\sum_{p\in P}\phi(p), which is a single element of K\mathcal{H}_{K} and has a norm Φ(P)K=κ(P,P)\|\Phi(P)\|_{\mathcal{H}_{K}}=\sqrt{\kappa(P,P)}. A single point xdx\in{{\mathbb{R}}}^{d} also has a norm ϕ(x)K=K(x,x)\|\phi(x)\|_{\mathcal{H}_{K}}=\sqrt{K(x,x)} in this space.

1.2 Geometric Inference and Distance to a Measure: A Review

Given an unknown compact set SdS\subset{{\mathbb{R}}}^{d} and a finite point cloud PdP\subset{{\mathbb{R}}}^{d} that comes from SS under some process, geometric inference aims to recover topological and geometric properties of SS from PP. The offset-based (and more generally, the distance function-based) approach for geometric inference reconstructs a geometric and topological approximation of SS by offsets from PP (e.g. [13, 14, 15, 20, 21]).

Given a compact set SdS\subset{{\mathbb{R}}}^{d}, we can define a distance function fSf_{S} to SS; a common example is fS(x)=infySxyf_{S}(x)=\inf_{y\in S}\|x-y\| (i.e. α\alpha-shapes). The offsets of SS are the sublevel sets of fSf_{S}, denoted (S)r=fS1([0,r])(S)^{r}=f_{S}^{-1}([0,r]). Now an approximation of SS by another compact set PdP\subset{{\mathbb{R}}}^{d} (e.g. a finite point cloud) can be quantified by the Hausdorff distance dH(S,P):=fSfP=infxd|fS(x)fP(x)|d_{H}(S,P):=\|f_{S}-f_{P}\|_{\infty}=\inf_{x\in{{\mathbb{R}}}^{d}}|f_{S}(x)-f_{P}(x)| of their distance functions. The intuition behind the inference of topology is that if dH(S,P)d_{H}(S,P) is small, thus fSf_{S} and fPf_{P} are close, and subsequently, SS, (S)r{(S)}^{r} and (P)r{(P)}^{r} carry the same topology for an appropriate scale rr. In other words, to compare the topology of offsets (S)r{(S)}^{r} and (P)r{(P)}^{r}, we require Hausdorff stability with respect to their distance functions fSf_{S} and fPf_{P}. An example of an offset-based topological inference result is formally stated as follows (as a particular version of the reconstruction Theorem 4.6 in [14]), where the reach of a compact set SS, 𝗋𝖾𝖺𝖼𝗁(S){\sf reach}(S), is defined as the minimum distance between SS and its medial axis [54].

Theorem 1.1 (Reconstruction from fPf_{P} [14]).

Let S,PdS,P\subset{{\mathbb{R}}}^{d} be compact sets such that 𝗋𝖾𝖺𝖼𝗁(S)>R{\sf reach}(S)>R and ε:=dH(S,P)<R/17\varepsilon:=d_{H}(S,P)<R/17. Then (S)η(S)^{\eta} and (P)r{(P)}^{r} are homotopy equivalent for sufficiently small η\eta (e.g., 0<η<R0<\eta<R) if 4εr<R3ε4\varepsilon\leq r<R-3\varepsilon.

Here η<R\eta<R ensures that the topological properties of (S)η(S)^{\eta} and (S)r(S)^{r} are the same, and the ε\varepsilon parameter ensures (S)r(S)^{r} and (P)r(P)^{r} are close. Typically ε\varepsilon is tied to the density with which a point cloud PP is sampled from SS.

For function ϕ:d+\phi:\mathbb{R}^{d}\to\mathbb{R}^{+} to be distance-like it should satisfy the following properties:

  • (D1) ϕ\phi is 11-Lipschitz: For all x,ydx,y\in{{\mathbb{R}}}^{d}, |ϕ(x)ϕ(y)|xy|\phi(x)-\phi(y)|\leq\|x-y\|.

  • (D2) ϕ2\phi^{2} is 11-semiconcave: The map xd(ϕ(x))2x2x\in{{\mathbb{R}}}^{d}\mapsto(\phi(x))^{2}-\|x\|^{2} is concave.

  • (D3) ϕ\phi is proper: ϕ(x)\phi(x) tends to the infimum of its domain (e.g., \infty) as xx tends to infinity.

In addition to the Hausdorff stability property stated above, as explained in [15], fSf_{S} is distance-like. These three properties are paramount for geometric inference (e.g. [14, 53]). (D1) ensures that fSf_{S} is differentiable almost everywhere and the medial axis of SS has zero dd-volume [15]; and (D2) is a crucial technical tool, e.g., in proving the existence of the flow of the gradient of the distance function for topological inference [14].

Distance to a measure.

Given a probability measure μ\mu on d{{\mathbb{R}}}^{d} and a parameter m0>0m_{0}>0 smaller than the total mass of μ\mu, the distance to a measure dμ,m0ccm:n+d^{\textsf{{ccm}}}_{\mu,m_{0}}:{{\mathbb{R}}}^{n}\to{{\mathbb{R}}}^{+} [15] is defined for any point xdx\in{{\mathbb{R}}}^{d} as

dμ,m0ccm(x)=(1m0m=0m0(δμ,m(x))2dm)1/2, where δμ,m(x)=inf{r>0:μ(B¯r(x))m},d^{\textsf{{ccm}}}_{\mu,m_{0}}(x)=\left(\frac{1}{m_{0}}\int_{m=0}^{m_{0}}(\delta_{\mu,m}(x))^{2}{\rm d}{m}\right)^{1/2},\;\;\text{ where }\;\;\delta_{\mu,m}(x)=\inf\left\{r>0:\mu(\bar{B}_{r}(x))\geq m\right\},

It has been shown in [15] that dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}} is a distance-like function (satisfying (D1), (D2), and (D3)), and:

  • (M4) [Stability] For probability measures μ\mu and ν\nu on d{{\mathbb{R}}}^{d} and m0>0m_{0}>0, then dμ,m0ccmdν,m0ccm1m0W2(μ,ν)\|d^{\textsf{{ccm}}}_{\mu,m_{0}}-d^{\textsf{{ccm}}}_{\nu,m_{0}}\|_{\infty}\leq\frac{1}{\sqrt{m_{0}}}W_{2}(\mu,\nu).

Here W2W_{2} is the Wasserstein distance [69]: W2(μ,ν)=infπΠ(μ,ν)(d×dxy2dπ(x,y))1/2W_{2}(\mu,\nu)=\inf_{\pi\in\Pi(\mu,\nu)}\left(\int_{{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}}||x-y||^{2}{\rm d}{\pi}(x,y)\right)^{1/2} between two measures, where dπ(x,y){\rm d}{\pi}(x,y) measures the amount of mass transferred from location xx to location yy and πΠ(μ,ν)\pi\in\Pi(\mu,\nu) is a transference plan [69].

Given a point set PP, the sublevel sets of dμP,m0ccmd^{\textsf{{ccm}}}_{\mu_{P},m_{0}} can be described as the union of balls [45], and then one can algorithmically estimate the topology (e.g., persistence diagram) with weighted alpha-shapes [45] and weighted Rips complexes [8].

1.3 Our Results

We show how to estimate the topology (e.g., approximate persistence diagrams, infer homotopy of compact sets) using superlevel sets of the kernel density estimate of a point set PP. We accomplish this by showing that a similar set of properties hold for the kernel distance with respect to a measure μ\mu, (in place of distance to a measure dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}}), defined as

dμK(x)=DK(μ,x)=κ(μ,μ)+κ(x,x)2κ(μ,x).d^{K}_{\mu}(x)=D_{K}(\mu,x)=\sqrt{\kappa(\mu,\mu)+\kappa(x,x)-2\kappa(\mu,x)}.

This treats xx as a probability measure represented by a Dirac mass at xx. Specifically, we show dμKd^{K}_{\mu} is distance-like (it satisfies (D1), (D2), and (D3)), so it inherits reconstruction properties of dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}}. Moreover, it is stable with respect to the kernel distance:

  • (K4) [Stability] If μ\mu and ν\nu are two measures on d{{\mathbb{R}}}^{d}, then dμKdνKDK(μ,ν)\|d^{K}_{\mu}-d^{K}_{\nu}\|_{\infty}\leq D_{K}(\mu,\nu).

In addition, we show how to construct these topological estimates for dμKd^{K}_{\mu} using weighted Rips complexes, following power distance machinery introduced in [8]. That is, a particular form of power distance permits a multiplicative approximation with the kernel distance.

We also describe further advantages of the kernel distance. (i) Its sublevel sets conveniently map to the superlevel sets of a kernel density estimate. (ii) It is Lipschitz with respect to the smoothing parameter σ\sigma when the input xx is fixed. (iii) As σ\sigma tends to \infty for any two probability measures μ,ν\mu,\nu, the kernel distance is bounded by the Wasserstein distance: limσDK(μ,ν)W2(μ,ν)\lim_{\sigma\to\infty}D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu). (iv) It has a small coreset representation, which allows for sparse representation and efficient, scalable computation. In particular, an ε\varepsilon-kernel sample [48, 57, 71] QQ of μ\mu is a finite point set whose size only depends on ε>0\varepsilon>0 and such that maxxd|kdeμ(x)kdeμQ(x)|=maxxd|κ(μ,x)κ(μQ,x)|ε\max_{x\in{{\mathbb{R}}}^{d}}|\textsc{kde}_{\mu}(x)-\textsc{kde}_{\mu_{Q}}(x)|=\max_{x\in{{\mathbb{R}}}^{d}}|\kappa(\mu,x)-\kappa(\mu_{Q},x)|\leq\varepsilon. These coresets preserve inference results and persistence diagrams.

2 Kernel Distance is Distance-Like

In this section we prove dμKd^{K}_{\mu} satisfies (D1), (D2), and (D3); hence it is distance-like. Recall we use the σ2\sigma^{2}-normalized Gaussian kernel Kσ(p,x)=σ2exp(px2/2σ2)K_{\sigma}(p,x)=\sigma^{2}\exp(-\|p-x\|^{2}/2\sigma^{2}). For ease of exposition, unless otherwise noted, we will assume σ\sigma is fixed and write KK instead of KσK_{\sigma}.

2.1 Semiconcave Property for dμKd^{K}_{\mu}

Lemma 2.1 (D2).

(dμK)2(d^{K}_{\mu})^{2} is 11-semiconcave: the map x(dμK(x))2x2x\mapsto(d^{K}_{\mu}(x))^{2}-\|x\|^{2} is concave.

Proof 2.2.

Let T(x)=(dμK(x))2x2T(x)=(d^{K}_{\mu}(x))^{2}-\|x\|^{2}. The proof will show that the second derivative of TT along any direction is nonpositive. We can rewrite

T(x)\displaystyle T(x) =κ(μ,μ)+κ(x,x)2κ(μ,x)x2\displaystyle=\kappa(\mu,\mu)+\kappa(x,x)-2\kappa(\mu,x)-\|x\|^{2}
=κ(μ,μ)+κ(x,x)pd(2K(p,x)+x2)dμ(p).\displaystyle=\kappa(\mu,\mu)+\kappa(x,x)-\int_{p\in{{\mathbb{R}}}^{d}}(2K(p,x)+\|x\|^{2}){\rm d}{\mu(p)}.

Note that both κ(μ,μ)\kappa(\mu,\mu) and κ(x,x)\kappa(x,x) are absolute constants, so we can ignore them in the second derivative. Furthermore, by setting t(p,x)=2K(p,x)x2t(p,x)=-2K(p,x)-\|x\|^{2}, the second derivative of T(x)T(x) is nonpositive if the second derivative of t(p,x)t(p,x) is nonpositive for all p,xdp,x\in{{\mathbb{R}}}^{d}. First note that the second derivative of x2-\|x\|^{2} is a constant 2-2 in every direction. The second derivative of K(p,x)K(p,x) is symmetric about pp, so we can consider the second derivative along any vector u=xpu=x-p,

d2du2t(p,x)=2(u2σ21)exp(u22σ2)2.\frac{{\rm d}{}^{2}}{{\rm d}{u}^{2}}t(p,x)=2\left(\frac{\|u\|^{2}}{\sigma^{2}}-1\right)\exp\left(-\frac{\|u\|^{2}}{2\sigma^{2}}\right)-2.

This reaches its maximum value at u=xp=3σ\|u\|=\|x-p\|=\sqrt{3}\sigma where it is 4exp(3/2)21.14\exp(-3/2)-2\approx-1.1; this follows setting the derivative of s(y)=2(y1)exp(y/2)2s(y)=2(y-1)\exp(-y/2)-2 to 0, (ddys(y)=(1/2)(3y)exp(y/2)\frac{{\rm d}{}}{{\rm d}{y}}s(y)=(1/2)(3-y)\exp(-y/2)), substituting y=u2/σ2y=\|u\|^{2}/\sigma^{2}.

We also note in Appendix A that semiconcavity follows trivially in the RKHS K\mathcal{H}_{K}.

2.2 Lipschitz Property for dμKd^{K}_{\mu}

We generalize a (folklore, see [15]) relation between semiconcave and Lipschitz functions and prove it for completeness. A function ff is \ell-semiconcave if the function T(x)=(f(x))2x2T(x)=(f(x))^{2}-\ell\|x\|^{2} is concave.

Lemma 2.3.

Consider a twice-differentiable function gg and a parameter 1\ell\geq 1. If (g(x))2(g(x))^{2} is \ell-semiconcave, then g(x)g(x) is \ell-Lipschitz.

Proof 2.4.

The proof is by contrapositive; we assume that g(x)g(x) is not \ell-Lipschitz and then show (g(x))2(g(x))^{2} cannot be \ell-semiconcave. By this assumption, then in some direction uu, there is a point xx^{\prime} such that (d/du)g(x)=c>1(d/du)g(x^{\prime})=c>\ell\geq 1.

Now we examine f(x)=(g(x))2x2f(x)=(g(x))^{2}-\ell\|x\|^{2} at x=xx=x^{\prime}, and specifically its second derivative in direction uu.

dduf(x)|x=x\displaystyle\frac{d}{du}f(x)\big{|}_{x=x^{\prime}} =2(ddug(x))g(x)2x=2cg(x)2x\displaystyle=2\left(\frac{d}{du}g(x^{\prime})\right)g(x^{\prime})-2\ell\|x^{\prime}\|=2c\cdot g(x^{\prime})-2\ell\|x^{\prime}\|
d2du2f(x)|x=x\displaystyle\frac{d^{2}}{du^{2}}f(x)\big{|}_{x=x^{\prime}} =2c(ddug(x))2=2c22=2(c2)\displaystyle=2c\left(\frac{d}{du}g(x^{\prime})\right)-2\ell=2c^{2}-2\ell=2(c^{2}-\ell)

Since c2>c>1c^{2}>c>\ell\geq 1, then 2(c2)>02(c^{2}-\ell)>0 and f(x)f(x) is not \ell-semiconcave at xx^{\prime}.

We can now state the following lemma as a corollary of Lemma 2.3 and Lemma 2.1.

Lemma 2.5 (D1).

dμKd^{K}_{\mu} is 11-Lipschitz on its input.

2.3 Properness of dμKd^{K}_{\mu}

Finally, for dμKd^{K}_{\mu} to be distance-like, we need to show it is proper when its range is restricted to be less than cμ:=κ(μ,μ)+κ(x,x)c_{\mu}:=\sqrt{\kappa(\mu,\mu)+\kappa(x,x)}. Here, the value of cμc_{\mu} depends on μ\mu not on xx since κ(x,x)=K(x,x)=σ2\kappa(x,x)=K(x,x)=\sigma^{2}. This is required for a distance-like version [15], Proposition 4.2) of the Isotopy Lemma ([44], Proposition 1.8).

Lemma 2.6 (D3).

dμKd^{K}_{\mu} is proper.

We delay this technical proof to Appendix A. The main technical difficulty comes in mapping standard definitions and approaches for distance functions to our function dμKd^{K}_{\mu} with a restricted range.

Also by properness (see discussion in Appendix A), Lemma 2.6 also implies that dμKd^{K}_{\mu} is a closed map and its levelset at any value a[0,cμ)a\in[0,c_{\mu}) is compact. This also means that the sublevel set of dμKd^{K}_{\mu} (for ranges [0,a)[0,cμ)[0,a)\subset[0,c_{\mu})) is compact. Since the levelset (sublevel set) of dμKd^{K}_{\mu} corresponds to the levelset (superlevel set) of kdeμ\textsc{kde}_{\mu}, we have the following corollary.

Corollary 2.7.

The superlevel sets of kdeμ\textsc{kde}_{\mu} for all ranges with threshold a>0a>0, are compact.

The result in [33] shows that given a measure μP{\mu_{P}} defined by a point set PP of size nn, the kdeμP\textsc{kde}_{{\mu_{P}}} has polynomial in nn modes; hence the superlevel sets of kdeμP\textsc{kde}_{{\mu_{P}}} are compact in this setting. The above corollary is a more general statement as it holds for any measure.

3 Power Distance using Kernel Distance

A power distance using dμKd^{K}_{\mu} is defined with a point set PdP\subset{{\mathbb{R}}}^{d} and a metric d(,)d(\cdot,\cdot) on d{{\mathbb{R}}}^{d},

fP(μ,x)=minpP(d(p,x)2+dμK(p)2).{f_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left(d(p,x)^{2}+d^{K}_{\mu}(p)^{2}\right)}.

A point xdx\in\mathbb{R}^{d} takes the distance under d(p,x)d(p,x) to the closest pPp\in P, plus a weight from dμK(p)d^{K}_{\mu}(p); thus a sublevel set of fP(μ,){f_{P}}(\mu,\cdot) is defined by a union of balls. We consider a particular choice of the distance d(p,x):=DK(p,x)d(p,x):=D_{K}(p,x) which leads to a kernel version of power distance

fPk(μ,x)=minpP(DK(p,x)2+dμK(p)2).{f^{\textsc{k}}_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left({D_{K}(p,x)}^{2}+{d^{K}_{\mu}(p)}^{2}\right)}.

In Section 4.2 we use fPk(μ,x){f^{\textsc{k}}_{P}}(\mu,x) to adapt the construction introduced in [8] to approximate the persistence diagram of the sublevel sets of dμKd^{K}_{\mu}, using a weighted Rips filtration of fPk(μ,x){f^{\textsc{k}}_{P}}(\mu,x).

Given a measure μ\mu, let p+=argmaxqdκ(μ,q)p_{+}=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(\mu,q), and let P+dP_{+}\subset{{\mathbb{R}}}^{d} be a point set that contains p+p_{+}. We show below, in Theorem 3.8 and Theorem 3.2, that 12dμK(x)fP+k(μ,x)14dμK(x)\frac{1}{\sqrt{2}}d^{K}_{\mu}(x)\leq{f^{\textsc{k}}_{P_{+}}}(\mu,x)\leq\sqrt{14}d^{K}_{\mu}(x). However, constructing p+p_{+} exactly seems quite difficult. We also attempt to use p=argminpPpxp^{\star}=\arg\min_{p\in P}\|p-x\| in place of p+p_{+} (see Section C.1), but are not able to obtain useful bounds.

Now consider an empirical measure μP\mu_{P} defined by a point set PP. We show (in Theorem C.21 in Appendix C.2) how to construct a point p^+\hat{p}_{+} (that approximates p+p_{+}) such that DK(P,p^+)(1+δ)DK(P,p+)D_{K}(P,\hat{p}_{+})\leq(1+\delta)D_{K}(P,p_{+}) for any δ>0\delta>0. For a point set PP, the median concentration ΛP\Lambda_{P} is a radius such that no point pPp\in P has more than half of the points of PP within ΛP\Lambda_{P}, and the spread βP\beta_{P} is the ratio between the longest and shortest pairwise distances. The runtime is polynomial in nn and 1/δ1/\delta assuming βP\beta_{P} is bounded, and that σ/ΛP\sigma/\Lambda_{P} and dd are constants.

Then we consider P^+=P{p^+}\hat{P}_{+}=P\cup\{\hat{p}_{+}\}, where p^+\hat{p}_{+} is found with δ=1/2\delta=1/2 in the above construction. Then we can provide the following multiplicative bound, proven in Theorem 3.10. The lower bound holds independent of the choice of PP as shown in Theorem 3.2.

Theorem 3.1.

For any point set PdP\subset{{\mathbb{R}}}^{d} and point xdx\in{{\mathbb{R}}}^{d}, with empirical measure μP\mu_{P} defined by PP, then

12dμPK(x)fP^+k(μP,x)71dμPK(x).\frac{1}{\sqrt{2}}d^{K}_{\mu_{P}}(x)\leq{f^{\textsc{k}}_{\hat{P}_{+}}}(\mu_{P},x)\leq\sqrt{71}d^{K}_{\mu_{P}}(x).

3.1 Kernel Power Distance for a Measure μ\mu

First consider the case for a kernel power distance fPk(μ,x){f^{\textsc{k}}_{P}}(\mu,x) where μ\mu is an arbitrary measure.

Theorem 3.2.

For measure μ\mu, point set PdP\subset\mathbb{R}^{d}, and xdx\in{{\mathbb{R}}}^{d}, DK(μ,x)2fPk(μ,x).D_{K}(\mu,x)\leq\sqrt{2}{f^{\textsc{k}}_{P}}(\mu,x).

Proof 3.3.

Let p=argminqP(DK(q,x)2+DK(μ,q)2)p=\arg\min_{q\in P}\left(D_{K}(q,x)^{2}+D_{K}(\mu,q)^{2}\right). Then we can use the triangle inequality and (DK(μ,p)DK(p,x))20(D_{K}(\mu,p)-D_{K}(p,x))^{2}\geq 0 to show

DK(μ,x)2(DK(μ,p)+DK(p,x))22(DK(μ,p)2+DK(p,x)2)=2fPk(μ,x)2.D_{K}(\mu,x)^{2}\leq(D_{K}(\mu,p)+D_{K}(p,x))^{2}\leq 2(D_{K}(\mu,p)^{2}+D_{K}(p,x)^{2})=2{f^{\textsc{k}}_{P}}(\mu,x)^{2}.
Lemma 3.4.

For measure μ\mu, point set PdP\subset\mathbb{R}^{d}, point pPp\in P, and point xdx\in{{\mathbb{R}}}^{d} then fPk(μ,x)22DK(μ,x)2+3DK(p,x)2.{f^{\textsc{k}}_{P}}(\mu,x)^{2}\leq 2D_{K}(\mu,x)^{2}+3D_{K}(p,x)^{2}.

Proof 3.5.

Again, we can reach this result with the triangle inequality.

fPk(μ,x)2\displaystyle{f^{\textsc{k}}_{P}}(\mu,x)^{2} DK(μ,p)2+DK(p,x)2\displaystyle\leq D_{K}(\mu,p)^{2}+D_{K}(p,x)^{2}
(DK(μ,x)+DK(p,x))2+DK(p,x)2\displaystyle\leq(D_{K}(\mu,x)+D_{K}(p,x))^{2}+D_{K}(p,x)^{2}
2DK(μ,x)2+3DK(p,x)2.\displaystyle\leq 2D_{K}(\mu,x)^{2}+3D_{K}(p,x)^{2}.

Recall the definition of a point p+=argmaxqdκ(μ,q)p_{+}=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(\mu,q).

Lemma 3.6.

For any measure μ\mu and point x,p+dx,p_{+}\in{{\mathbb{R}}}^{d} we have DK(p+,x)2DK(μ,x)D_{K}(p_{+},x)\leq 2D_{K}(\mu,x).

Proof 3.7.

Since xx is a point in d{{\mathbb{R}}}^{d}, κ(μ,x)κ(μ,p+)\kappa(\mu,x)\leq\kappa(\mu,p_{+}) and thus DK(μ,x)DK(μ,p+)D_{K}(\mu,x)\geq D_{K}(\mu,p_{+}). Then by triangle inequality of DKD_{K} to see that DK(p+,x)DK(μ,x)+DK(μ,p+)2DK(μ,x)D_{K}(p_{+},x)\leq D_{K}(\mu,x)+D_{K}(\mu,p_{+})\leq 2D_{K}(\mu,x).

Theorem 3.8.

For any measure μ\mu in d{{\mathbb{R}}}^{d} and any point xdx\in{{\mathbb{R}}}^{d}, using the point p+=argmaxqdκ(μ,q)p_{+}=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(\mu,q) then f{p+}k(μ,x)14DK(μ,x){f^{\textsc{k}}_{\{p_{+}\}}}(\mu,x)\leq\sqrt{14}D_{K}(\mu,x).

Proof 3.9.

Combine Lemma 3.4 and Lemma 3.6 as

f{p+}k(μ,x)22DK(μ,x)2+3DK(p+,x)22DK(μ,x)2+3(4DK(μ,x)2)=14DK(μ,x)2.{f^{\textsc{k}}_{\{p_{+}\}}}(\mu,x)^{2}\leq 2D_{K}(\mu,x)^{2}+3D_{K}(p_{+},x)^{2}\leq 2D_{K}(\mu,x)^{2}+3(4D_{K}(\mu,x)^{2})=14D_{K}(\mu,x)^{2}.

We now need two properties of the point set PP to reach our bound, namely, the spread βP\beta_{P} and the median concentration ΛP\Lambda_{P}. Typically log(βP)\log(\beta_{P}) is not too large, and it makes sense to choose σ\sigma so σ/ΛP1\sigma/\Lambda_{P}\leq 1, or at least σ/ΛP=O(1)\sigma/\Lambda_{P}=O(1).

Theorem 3.10.

Consider any point set PdP\subset{{\mathbb{R}}}^{d} of size nn, with measure μP\mu_{P}, spread βP\beta_{P}, and median concentration ΛP\Lambda_{P}. We can construct a point set P^+=Pp^+\hat{P}_{+}=P\cup\hat{p}_{+} in O(n2((σ/ΛPδ)d+log(β))O(n^{2}((\sigma/\Lambda_{P}\delta)^{d}+\log(\beta)) time such that for any point xx, fP^+k(μP,x)71DK(μP,x).{f^{\textsc{k}}_{\hat{P}_{+}}}({\mu_{P}},x)\leq\sqrt{71}D_{K}({\mu_{P}},x).

Proof 3.11.

We use Theorem C.21 to find a point p^+\hat{p}_{+} such that DK(P,p^+)(3/2)DK(P,p+)D_{K}(P,\hat{p}_{+})\leq(3/2)D_{K}(P,p_{+}). Thus for any xdx\in{{\mathbb{R}}}^{d}, using the triangle inequality

DK(p^+,x)\displaystyle D_{K}(\hat{p}_{+},x) DK(p^+,p+)+DK(p+,x)DK(μP,p^+)+DK(μP,p+)+DK(p+,x)\displaystyle\leq D_{K}(\hat{p}_{+},p_{+})+D_{K}(p_{+},x)\leq D_{K}({\mu_{P}},\hat{p}_{+})+D_{K}({\mu_{P}},p_{+})+D_{K}(p_{+},x)
(5/2)DK(μP,p+)+DK(p+,x).\displaystyle\leq(5/2)D_{K}({\mu_{P}},p_{+})+D_{K}(p_{+},x).

Now combine this with Lemma 3.4 and Lemma 3.6 as

fP^+k(μP,x)2\displaystyle{f^{\textsc{k}}_{\hat{P}_{+}}}({\mu_{P}},x)^{2} 2DK(μP,x)2+3DK(p^+,x)2\displaystyle\leq 2D_{K}({\mu_{P}},x)^{2}+3D_{K}(\hat{p}_{+},x)^{2}
2DK(μP,x)2+3((5/2)DK(μP,x)+DK(p+,x))2\displaystyle\leq 2D_{K}({\mu_{P}},x)^{2}+3((5/2)D_{K}({\mu_{P}},x)+D_{K}(p_{+},x))^{2}
2DK(μP,x)2+3((25/4)+(5/2))DK(μP,x)2+(1+5/2)DK(p+,x)2)\displaystyle\leq 2D_{K}({\mu_{P}},x)^{2}+3((25/4)+(5/2))D_{K}({\mu_{P}},x)^{2}+(1+5/2)D_{K}(p_{+},x)^{2})
=(113/4)DK(μP,x)2+(21/2)DK(p+,x)2\displaystyle=(113/4)D_{K}({\mu_{P}},x)^{2}+(21/2)D_{K}(p_{+},x)^{2}
(113/4)DK(μP,x)2+(21/2)(4DK(μP,x)2)<71DK(μP,x)2.\displaystyle\leq(113/4)D_{K}({\mu_{P}},x)^{2}+(21/2)(4D_{K}({\mu_{P}},x)^{2})<71D_{K}({\mu_{P}},x)^{2}.

4 Reconstruction and Topological Estimation using Kernel Distance

Now applying distance-like properties from Section 2 and the power distance properties of Section 3 we can apply known reconstruction results to the kernel distance.

4.1 Homotopy Equivalent Reconstruction using dμKd^{K}_{\mu}

We have shown that the kernel distance function dμKd^{K}_{\mu} is a distance-like function. Therefore the reconstruction theory for a distance-like function [15] (which is an extension of results for compact sets [14]) holds in the setting of dμKd^{K}_{\mu}. We state the following two corollaries for completeness, whose proofs follow from the proofs of Proposition 4.2 and Theorem 4.6 in [15]. Before their formal statement, we need some notation adapted from [15] to make these statements precise. Let ϕ:d+\phi:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{+} be a distance-like function. A point xdx\in{{\mathbb{R}}}^{d} is an α\alpha-critical point if ϕ2(x+h)ϕ2(x)+2αhϕ(x)+h2\phi^{2}(x+h)\leq\phi^{2}(x)+2\alpha\|h\|\phi(x)+\|h\|^{2} with α[0,1]\alpha\in[0,1], hd\forall h\in{{\mathbb{R}}}^{d}. Let (ϕ)r={xdϕ(x)r}(\phi)^{r}=\{x\in{{\mathbb{R}}}^{d}\mid\phi(x)\leq r\} denote the sublevel set of ϕ\phi, and let (ϕ)[r1,r2]={xdr1ϕ(x)r2}(\phi)^{[r_{1},r_{2}]}=\{x\in{{\mathbb{R}}}^{d}\mid r_{1}\leq\phi(x)\leq r_{2}\} denote all points at levels in the range [r1,r2][r_{1},r_{2}]. For α[0,1]\alpha\in[0,1], the α\alpha-reach of ϕ\phi is the maximum rr such that (ϕ)r(\phi)^{r} has no α\alpha-critical point, denoted as 𝗋𝖾𝖺𝖼𝗁α(ϕ){\sf reach}_{\alpha}(\phi). When α=1\alpha=1, 𝗋𝖾𝖺𝖼𝗁1{\sf reach}_{1} coincides with reach introduced in [40].

Theorem 4.1 (Isotopy lemma on dμKd^{K}_{\mu}).

Let r1<r2r_{1}<r_{2} be two positive numbers such that dμKd^{K}_{\mu} has no critical points in (dμK)[r1,r2](d^{K}_{\mu})^{[r_{1},r_{2}]}. Then all the sublevel sets (dμK)r(d^{K}_{\mu})^{r} are isotopic for r[r1,r2]r\in[r_{1},r_{2}].

Theorem 4.2 (Reconstruction on dμKd^{K}_{\mu}).

Let dμKd^{K}_{\mu} and dνKd^{K}_{\nu} be two kernel distance functions such that dμKdνKε\|d^{K}_{\mu}-d^{K}_{\nu}\|_{\infty}\leq\varepsilon. Suppose 𝗋𝖾𝖺𝖼𝗁α(dμK)R{\sf reach}_{\alpha}(d^{K}_{\mu})\geq R for some α>0\alpha>0. Then r[4ε/α2,R3ε]\forall r\in[4\varepsilon/\alpha^{2},R-3\varepsilon], and η(0,R)\forall\eta\in(0,R), the sublevel sets (dμK)η(d^{K}_{\mu})^{\eta} and (dνK)r(d^{K}_{\nu})^{r} are homotopy equivalent for εR/(5+4/α2)\varepsilon\leq R/(5+4/\alpha^{2}).

4.2 Constructing Topological Estimates using dμKd^{K}_{\mu}

In order to actually construct a topological estimate using the kernel distance dμKd^{K}_{\mu}, one needs to be able to compute quantities related to its sublevel sets, in particular, to compute the persistence diagram of the sub-level sets filtration of dμKd^{K}_{\mu}. Now we describe such tools needed for the kernel distance based on machinery recently developed by Buchet et al. [8], which shows how to approximate the persistent homology of distance-to-a-measure for any metric space via a power distance construction. Then using similar constructions, we can use the weighted Rips filtration to approximate the persistence diagram of the kernel distance.

To state our results, first we require some technical notions and assume basic knowledge on persistent homology (see [34, 35] for a readable background). Given a metric space 𝕏{{\mathbb{X}}} with the distance d𝕏(,)d_{{{\mathbb{X}}}}(\cdot,\cdot), a set P𝕏P\subseteq{{\mathbb{X}}} and a function w:Pw:P\to{{\mathbb{R}}}, the (general) power distance ff associated with (P,w)(P,w) is defined as f(x)=minpP(d𝕏(p,x)2+w(p)2).f(x)=\sqrt{\min_{p\in P}\left(d_{{{\mathbb{X}}}}(p,x)^{2}+w(p)^{2}\right)}. Now given the set (P,w)(P,w) and its corresponding power distance ff, one could use the weighted Rips filtration to approximate the persistence diagram of ww, under certain restrictive conditions proven in Appendix D.1. Consider the sublevel set of ff, f1((,α])f^{-1}((-\infty,\alpha]). It is the union of balls centered at points pPp\in P with radius rp(α)=α2w(p)2r_{p}(\alpha)=\sqrt{\alpha^{2}-w(p)^{2}} for each pp. The weighted Čech complex Cα(P,w)C_{\alpha}(P,w) for parameter α\alpha is the union of simplices ss such that psB(p,rp(α))0\bigcap_{p\in s}B(p,r_{p}(\alpha))\neq 0. The weighted Rips complex Rα(P,w)R_{\alpha}(P,w) for parameter α\alpha is the maximal complex whose 11-skeleton is the same as Cα(P,w)C_{\alpha}(P,w). The corresponding weighted Rips filtration is denoted as {Rα(P,w)}\{R_{\alpha}(P,w)\}.

Setting w:=dμPKw:=d^{K}_{\mu_{P}} and given point set P^+\hat{P}_{+} described in Section 3, consider the weighted Rips filtration {Rα(P^+,dμK)}\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu})\} based on the kernel power distance, fP^+k{f^{\textsc{k}}_{\hat{P}_{+}}}. We view the persistence diagrams on a logarithmic scale, that is, we change coordinates of points following the mapping (x,y)(lnx,lny)(x,y)\mapsto(\ln x,\ln y). dBlnd_{B}^{\ln} denotes the corresponding bottleneck distance between persistence diagrams. We now state a corollary of Theorem 3.1.

Corollary 4.3.

The weighted Rips filtration {Rα(P^+,dμPK)}\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\} can be used to approximate the persistence diagram of dμPKd^{K}_{\mu_{P}} such that dBln(Dgm(dμPK),Dgm({Rα(P^+,dμPK)}))ln(271)d_{B}^{\rm\ln}(\textsf{Dgm}(d^{K}_{\mu_{P}}),\textsf{Dgm}(\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\}))\leq\ln(2\sqrt{71}).

Proof 4.4.

To prove that two persistence diagrams are close, one could prove that their filtration are interleaved [12], that is, two filtrations {Uα}\{U_{\alpha}\} and {Vα}\{V_{\alpha}\} are ε\varepsilon-interleaved if for any α\alpha, UαVα+εUα+2εU_{\alpha}\subseteq V_{\alpha+\varepsilon}\subseteq U_{\alpha+2\varepsilon}.

First, Lemmas D.1 and D.3 prove that the persistence diagrams Dgm(dμPK)\textsf{Dgm}(d^{K}_{\mu_{P}}) and Dgm({Rα(P^+,dμPK)}))\textsf{Dgm}(\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\})) are well-defined. Second, the results of Theorem 3.1 implies an 71\sqrt{71} multiplicative interleaving. Therefore for any α\alpha\in{{\mathbb{R}}},

(dμPK)1((,α])(fP^+k)1((,2α)(dμPK)1((,712α]).{(d^{K}_{\mu_{P}})}^{-1}((-\infty,\alpha])\subset{({f^{\textsc{k}}_{\hat{P}_{+}}})}^{-1}((-\infty,\sqrt{2}\alpha)\subset{(d^{K}_{{\mu_{P}}})}^{-1}((-\infty,\sqrt{71}\sqrt{2}\alpha]).

On a logarithmic scale (by taking the natural log of both sides), such interleaving becomes addictive,

lndμPK2lnfP^+klndμPK+71.\ln d^{K}_{\mu_{P}}-\sqrt{2}\leq\ln{f^{\textsc{k}}_{\hat{P}_{+}}}\leq\ln d^{K}_{\mu_{P}}+\sqrt{71}.

Theorem 4 of [16] implies

dBln(Dgm(dμPK),Dgm(fP^+k))71.d_{B}^{\rm\ln}(\textsf{Dgm}(d^{K}_{\mu_{P}}),\textsf{Dgm}({f^{\textsc{k}}_{\hat{P}_{+}}}))\leq\sqrt{71}.

In addition, by the Persistent Nerve Lemma ([22], Theorem 6 of [62], an extension of the Nerve Theorem [46]), the sublevel sets filtration of dμKd^{K}_{\mu}, which correspond to unions of balls of increasing radius, has the same persistent homology as the nerve filtration of these balls (which, by definition, is the Čech filtration). Finally, there exists a multiplicative interleaving between weighted Rips and Čech complexes (Proposition 31 of [16]), CαRαC2α.C_{\alpha}\subseteq R_{\alpha}\subseteq C_{2\alpha}. We then obtain the following bounds on persistence diagrams,

dBln(Dgm(fP+k),Dgm({Rα(P+,dμPK)}))ln(2).d_{B}^{\rm\ln}(\textsf{Dgm}({f^{\textsc{k}}_{P_{+}}}),\textsf{Dgm}(\{R_{\alpha}(P_{+},d^{K}_{\mu_{P}})\}))\leq\ln(2).

We use triangle inequality to obtain the final result:

dBln(Dgm(dμPK),Dgm({Rα(P+,dμPK)}))ln(271).d_{B}^{\rm\ln}(\textsf{Dgm}(d^{K}_{\mu_{P}}),\textsf{Dgm}(\{R_{\alpha}(P_{+},d^{K}_{\mu_{P}})\}))\leq\ln(2\sqrt{71}).

Based on Corollary 4.3, we have an algorithm that approximates the persistent homology of the sublevel set filtration of dμKd^{K}_{\mu} by constructing the weighted Rips filtration corresponding to the kernel-based power distance and computing its persistent homology. For memory efficient computation, sparse (weighted) Rips filtrations could be adapted by considering simplices on subsamples at each scale [63, 16], although some restrictions on the space apply.

4.3 Distance to the Support of a Measure vs. Kernel Distance

Suppose μ\mu is a uniform measure on a compact set SS in d{{\mathbb{R}}}^{d}. We now compare the kernel distance dμKd^{K}_{\mu} with the distance function fSf_{S} to the support SS of μ\mu. We show how dμKd^{K}_{\mu} approximates fSf_{S}, and thus allows one to infer geometric properties of SS from samples from μ\mu.

A generalized gradient and its corresponding flow associated with a distance function are described in [14] and later adapted for distance-like functions in [15]. Let fS:df_{S}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}} be a distance function associated with a compact set SS of d{{\mathbb{R}}}^{d}. It is not differentiable on the medial axis of SS. A generalized gradient function S:dd{\nabla_{S}{}}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d} coincides with the usual gradient of fSf_{S} where fSf_{S} is differentiable, and is defined everywhere and can be integrated into a continuous flow Φt:dd\Phi^{t}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d} that points away from SS. Let γ\gamma be an integral (flow) line. The following lemma shows that when close enough to SS, that dμKd^{K}_{\mu} is strictly increasing along any γ\gamma. The proof is quite technical and is thus deferred to Appendix D.2.

Lemma 4.5.

Given any flow line γ\gamma associated with the generalized gradient function S{\nabla_{S}{}}, dμK(x)d^{K}_{\mu}(x) is strictly monotonically increasing along γ\gamma for xx sufficiently far away from the medial axis of SS, for σR6ΔG\sigma\leq\frac{R}{6\Delta_{G}} and fS(x)(0.014R,2σ)f_{S}(x)\in(0.014R,2\sigma). Here B(σ/2)B(\sigma/2) denotes a ball of radius σ/2\sigma/2, G:=Vol(B(σ/2))Vol(S)G:=\frac{{\textsf{Vol}}(B(\sigma/2))}{{\textsf{Vol}}(S)}, ΔG:=12+3ln(4/G)\Delta_{G}:=\sqrt{12+3\ln(4/G)} and suppose R:=min(𝗋𝖾𝖺𝖼𝗁(S),𝗋𝖾𝖺𝖼𝗁(dS))>0R:=\min({\sf reach}(S),{\sf reach}({{\mathbb{R}}}^{d}\setminus S))>0.

The strict monotonicity of dμKd^{K}_{\mu} along the flow line under the conditions in Lemma 4.5 makes it possible to define a deformation retract of the sublevel sets of dμKd^{K}_{\mu} onto sublevel sets of fSf_{S}. Such a deformation retract defines a special case of homotopy equivalence between the sublevel sets of dμKd^{K}_{\mu} and sublevel sets of fSf_{S}. Consider a sufficiently large point set PdP\in\mathbb{R}^{d} sampled from μ\mu, and its induced measure μP{\mu_{P}}. We can then also invoke Theorem 4.2 and a sampling bound (see Section 6 and Lemma B.3) to show homotopy equivalence between the sublevel sets of fSf_{S} and dμPKd^{K}_{\mu_{P}}.

Note that Lemma 4.5 uses somewhat restrictive conditions related to the reach of a compact set, however we believe such conditions could be further relaxed to be associated with the concept of μ\mu-reach as described in [14].

5 Stability Properties for the Kernel Distance to a Measure

Lemma 5.1 (K4).

For two measures μ\mu and ν\nu on d{{\mathbb{R}}}^{d} we have dμKdνKDK(μ,ν)\|d^{K}_{\mu}-d^{K}_{\nu}\|_{\infty}\leq D_{K}(\mu,\nu).

Proof 5.2.

Since DK(,)D_{K}(\cdot,\cdot) is a metric, then by triangle inequality, for any xdx\in{{\mathbb{R}}}^{d} we have DK(μ,x)DK(μ,ν)+DK(ν,x)D_{K}(\mu,x)\leq D_{K}(\mu,\nu)+D_{K}(\nu,x) and DK(ν,x)DK(ν,μ)+DK(μ,x)D_{K}(\nu,x)\leq D_{K}(\nu,\mu)+D_{K}(\mu,x). Therefore for any xdx\in{{\mathbb{R}}}^{d} we have |DK(μ,x)DK(ν,x)|DK(μ,ν)|D_{K}(\mu,x)-D_{K}(\nu,x)|\leq D_{K}(\mu,\nu), proving the claim.

Both the Wasserstein and kernel distance are integral probability metrics [66], so (M4) and (K4) are both interesting, but not easily comparable. We now attempt to reconcile this.

5.1 Comparing DKD_{K} to W2W_{2}

Lemma 5.3.

There is no Lipschitz constant γ\gamma such that for any two probability measures μ\mu and ν\nu we have W2(μ,ν)γDK(μ,ν)W_{2}(\mu,\nu)\leq\gamma D_{K}(\mu,\nu).

Proof 5.4.

Consider two measures μ\mu and ν\nu which are almost identical: the only difference is some mass of measure τ\tau is moved from its location in μ\mu a distance nn in ν\nu. The Wasserstein distance requires a transportation plan that moves this τ\tau mass in ν\nu back to where it was in μ\mu with cost τΩ(n)\tau\cdot\Omega(n) in W2(μ,ν)W_{2}(\mu,\nu). On the other hand, DK(μ,ν)=κ(μ,μ)+κ(ν,ν)2κ(μ,ν)σ2+σ220=2σD_{K}(\mu,\nu)=\sqrt{\kappa(\mu,\mu)+\kappa(\nu,\nu)-2\kappa(\mu,\nu)}\leq\sqrt{\sigma^{2}+\sigma^{2}-2\cdot 0}=\sqrt{2}\sigma is bounded.

We conjecture for any two probability measures μ\mu and ν\nu that DK(μ,ν)W2(μ,ν)D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu). This would show that dμKd^{K}_{\mu} is at least as stable as dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}} since a bound on W2(μ,ν)W_{2}(\mu,\nu) would also bound DK(μ,ν)D_{K}(\mu,\nu), but not vice versa. Alternatively, this can be viewed as dμKd^{K}_{\mu} is less discriminative than dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}}; we view this as a positive in this setting, as it is mainly less discriminative towards outliers (far away points). Here we only show that this property for a special case and as σ\sigma\to\infty. To simplify notation, all integrals are assumed to be over the full domain d{{\mathbb{R}}}^{d}.

Two Dirac masses.

We first consider a special case when μ\mu is a Dirac mass at a point pp and ν\nu is a Dirac mass at a point qq. That is they are both single points. We can then write DK(μ,ν)=DK(p,q)D_{K}(\mu,\nu)=D_{K}(p,q). Figure 2 illustrates the result of this lemma.

Lemma 5.5.

For any points p,qdp,q\in{{\mathbb{R}}}^{d} it always holds that pqDK(p,q)\|p-q\|\geq D_{K}(p,q). When pq3σ\|p-q\|\leq\sqrt{3}\sigma then DK(p,q)pq/2D_{K}(p,q)\geq\|p-q\|/2.

Proof 5.6.

First expand DK(p,q)2D_{K}(p,q)^{2} as

DK(p,q)2=2σ22K(p,q)=2σ2(1exp(pq22σ2)).D_{K}(p,q)^{2}=2\sigma^{2}-2K(p,q)=2\sigma^{2}\left(1-\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right)\right).

Now using that 1tet1t+(1/2)t21-t\leq e^{-t}\leq 1-t+(1/2)t^{2} for t0t\geq 0

DK(p,q)2=2σ2(1exp(pq22σ2))2σ2(pq22σ2)=pq2D_{K}(p,q)^{2}=2\sigma^{2}\left(1-\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right)\right)\leq 2\sigma^{2}\left(\frac{\|p-q\|^{2}}{2\sigma^{2}}\right)=\|p-q\|^{2}

and

DK(p,q)2\displaystyle D_{K}(p,q)^{2} =2σ2(1exp(pq22σ2))\displaystyle=2\sigma^{2}\left(1-\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right)\right)
2σ2(pq22σ212pq44σ4)\displaystyle\geq 2\sigma^{2}\left(\frac{\|p-q\|^{2}}{2\sigma^{2}}-\frac{1}{2}\frac{\|p-q\|^{4}}{4\sigma^{4}}\right)
=pq24(4pq2σ2)\displaystyle=\frac{\|p-q\|^{2}}{4}\left(4-\frac{\|p-q\|^{2}}{\sigma^{2}}\right)
pq2/4,\displaystyle\geq\|p-q\|^{2}/4,

where the last inequality holds when pq23σ\|p-q\|^{2}\leq\sqrt{3}\sigma.

Refer to caption
Figure 2: Showing that x0/2DK(x,0)x0\|x-0\|/2\leq D_{K}(x,0)\leq\|x-0\|, where the second inequality holds for x3σ\|x\|\leq\sqrt{3}\sigma. The kernel distance DK(x,0)D_{K}(x,0) is shown for σ={1/2,1,2}\sigma=\{1/2,1,2\} in purple, blue, and red, respectively.

One Dirac mass.

Consider the case where one measure ν\nu is a Dirac mass at point xdx\in{{\mathbb{R}}}^{d}.

Lemma 5.7.

Consider two probability measures μ\mu and ν\nu on d{{\mathbb{R}}}^{d} where ν\nu is represented by a Dirac mass at a point xdx\in{{\mathbb{R}}}^{d}. Then dμK(x)=DK(μ,ν)W2(μ,ν)d^{K}_{\mu}(x)=D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu) for any σ>0\sigma>0, where the equality only holds when μ\mu is also a Dirac mass at xx.

Proof 5.8.

Since both W2(μ,ν)W_{2}(\mu,\nu) and DK(μ,ν)D_{K}(\mu,\nu) are metrics and hence non-negative, we can instead consider their squared versions: (W2(μ,ν))2=ppx2μ(p)dp(W_{2}(\mu,\nu))^{2}=\int_{p}\|p-x\|^{2}\mu(p){\rm d}{p} and

(DK(μ,ν))2\displaystyle(D_{K}(\mu,\nu))^{2} =K(x,x)+(p,q)K(p,q)dmμ,μ(p,q)2pK(p,x)dμ(p)\displaystyle=K(x,x)+\int_{(p,q)}K(p,q){\rm d}{\textsf{m}_{\mu,\mu}(p,q)}-2\int_{p}K(p,x){\rm d}{\mu(p)}
=σ2(1+(p,q)exp(pq22σ2)dmμ,μ(p,q)2pexp(px22σ2)dμ(p)).\displaystyle=\sigma^{2}\left(1+\int_{(p,q)}\exp\left(-\frac{\|p-q\|^{2}}{2\sigma^{2}}\right){\rm d}{\textsf{m}_{\mu,\mu}(p,q)}-2\int_{p}\exp\left(-\frac{\|p-x\|^{2}}{2\sigma^{2}}\right){\rm d}{\mu(p)}\right).

Now use the bound 1tet11-t\leq e^{-t}\leq 1 for t0t\geq 0 to approximate

(DK(μ,ν))2\displaystyle(D_{K}(\mu,\nu))^{2} σ2(1+(p,q)(1)d(mμ,μ(p,q)2p(1px22σ2)dμ(p))\displaystyle\leq\sigma^{2}\left(1+\int_{(p,q)}(1){\rm d}{(\textsf{m}_{\mu,\mu}(p,q)}-2\int_{p}\left(1-\frac{\|p-x\|^{2}}{2\sigma^{2}}\right){\rm d}{\mu(p)}\right)
=ppx2dμ(p)=(W2(μ,ν))2.\displaystyle=\int_{p}\|p-x\|^{2}{\rm d}{\mu(p)}=(W_{2}(\mu,\nu))^{2}.

The inequality becomes an equality only when px=0\|p-x\|=0 for all pPp\in P, and since they are both metrics, this is the only location where they are both 0.

General case.

Next we show that if ν\nu is not a unit Dirac, then this inequality holds in the limit as σ\sigma goes to infinity. The technical work is making precise how σ2K(p,x)xp2/2\sigma^{2}-K(p,x)\leq\|x-p\|^{2}/2 and how this compares to bounds on DK(μ,ν)D_{K}(\mu,\nu) and W2(μ,ν)W_{2}(\mu,\nu).

For simpler exposition, we assume μ\mu is a probability measure, that is pμ(p)dp=1\int_{p}\mu(p){\rm d}{p}=1; otherwise we can normalize μ\mu at the appropriate locations, and all of the results go through.

Lemma 5.9.

For any p,qdp,q\in{{\mathbb{R}}}^{d} we have K(p,q)=σ2pq22+i=2(pq2)i2i+1σ2i2i!.\displaystyle K(p,q)=\sigma^{2}-\frac{\|p-q\|^{2}}{2}+\sum_{i=2}^{\infty}\frac{(-\|p-q\|^{2})^{i}}{2^{i+1}\sigma^{2i-2}i!}.

Proof 5.10.

We use the Taylor expansion of ex=i=0xi/i!=1+x+i=2xi/i!e^{x}=\sum_{i=0}^{\infty}x^{i}/i!=1+x+\sum_{i=2}^{\infty}x^{i}/i!. Then it is easy to see

K(p,q)=σ2exp(pq22σ2)=σ2pq22+i=2(pq2)i2iσ2i2i!.K(p,q)=\sigma^{2}\exp\left(-\frac{\|p-q\|^{2}}{2\sigma^{2}}\right)=\sigma^{2}-\frac{\|p-q\|^{2}}{2}+\sum_{i=2}^{\infty}\frac{(-\|p-q\|^{2})^{i}}{2^{i}\sigma^{2i-2}i!}.

This lemma illustrates why the choice of coefficient of σ2\sigma^{2} is convenient. Since then σ2K(p,q)\sigma^{2}-K(p,q) acts like 12pq2\frac{1}{2}\|p-q\|^{2}, and becomes closer as σ\sigma increases. Define μ¯=ppdμ(p)\bar{\mu}=\int_{p}p\cdot{\rm d}{\mu}(p) to represent the mean point of measure μ\mu; Var(μ)=(pp2dμ(p))μ¯2\textsf{Var}(\mu)=(\int_{p}\|p\|^{2}{\rm d}{\mu(p)})-\|\bar{\mu}\|^{2} to represent the variance of the measure μ\mu; and Δμ,ν=(p,q)i=2(pq2)i2iσ2i2i!dmμ,ν(p,q)\Delta_{\mu,\nu}=\int_{(p,q)}\sum_{i=2}^{\infty}\frac{(-\|p-q\|^{2})^{i}}{2^{i}\sigma^{2i-2}i!}{\rm d}{\textsf{m}_{\mu,\nu}(p,q)}.

Lemma 5.11.

For any xdx\in{{\mathbb{R}}}^{d} we have ppx2dμ(p)=μ¯x2+Var(μ).\displaystyle\int_{p}\|p-x\|^{2}{\rm d}{\mu(p)}=\|\bar{\mu}-x\|^{2}+\textsf{Var}(\mu).

Proof 5.12.
ppx2dμ(p)\displaystyle\int_{p}\|p-x\|^{2}{\rm d}{\mu(p)} =p(p2+x22p,x)dμ(p)\displaystyle=\int_{p}\left(\|p\|^{2}+\|x\|^{2}-2\langle p,x\rangle\right){\rm d}{\mu(p)}
=pp2dμ(p)+x22pp,xdμ(p)\displaystyle=\int_{p}\|p\|^{2}{\rm d}{\mu(p)}+\|x\|^{2}-2\int_{p}\langle p,x\rangle{\rm d}{\mu(p)}
=(pp2dμ(p)μ¯2)+μ¯2+x22μ¯,x\displaystyle=\left(\int_{p}\|p\|^{2}{\rm d}{\mu(p)}-\|\bar{\mu}\|^{2}\right)+\|\bar{\mu}\|^{2}+\|x\|^{2}-2\langle\bar{\mu},x\rangle
=Var(μ)+μ¯x2.\displaystyle=\textsf{Var}(\mu)+\|\bar{\mu}-x\|^{2}.
Lemma 5.13.

For probability measures μ\mu and ν\nu on d{{\mathbb{R}}}^{d}, κ(μ,ν)=σ212(μ¯ν¯2+Var(μ)+Var(ν))+Δμ,ν.\kappa(\mu,\nu)=\sigma^{2}-\frac{1}{2}\left(\|\bar{\mu}-\bar{\nu}\|^{2}+\textsf{Var}(\mu)+\textsf{Var}(\nu)\right)+\Delta_{\mu,\nu}.

Proof 5.14.

We use Lemma 5.9 to expand

κ(μ,ν)\displaystyle\kappa(\mu,\nu) =(p,q)K(p,q)dmμ,ν(p,q)\displaystyle=\int_{(p,q)}K(p,q){\rm d}{\textsf{m}_{\mu,\nu}(p,q)}
=σ2(p,q)(pq22i=2(pq2)i2i+1σ2i2i!)dmμ,ν(p,q).\displaystyle=\sigma^{2}-\int_{(p,q)}\left(\frac{\|p-q\|^{2}}{2}-\sum_{i=2}^{\infty}\frac{(-\|p-q\|^{2})^{i}}{2^{i+1}\sigma^{2i-2}i!}\right){\rm d}{\textsf{m}_{\mu,\nu}(p,q)}.

After shifting the Δμ,ν\Delta_{\mu,\nu} term outside, we can use Lemma 5.11 (twice) to rewrite

p(qpq2dν(q))dμ(p)\displaystyle\int_{p}\left(\int_{q}\|p-q\|^{2}{\rm d}{\nu(q)}\right){\rm d}{\mu(p)} =p(pν¯2+Var(ν))dμ(p)\displaystyle=\int_{p}\left(\|p-\bar{\nu}\|^{2}+\textsf{Var}(\nu)\right){\rm d}{\mu(p)}
=μ¯ν¯2+Var(μ)+Var(ν).\displaystyle=\|\bar{\mu}-\bar{\nu}\|^{2}+\textsf{Var}(\mu)+\textsf{Var}(\nu).
Theorem 5.15.

For any two probability measures μ\mu and ν\nu defined on d{{\mathbb{R}}}^{d} limσDK(μ,ν)=μ¯ν¯.\displaystyle\lim_{\sigma\to\infty}D_{K}(\mu,\nu)=\|\bar{\mu}-\bar{\nu}\|.

Proof 5.16.

First expand

(DK(μ,ν))2=\displaystyle(D_{K}(\mu,\nu))^{2}= κ(μ,μ)+κ(ν,ν)2κ(μ,ν)\displaystyle\;\kappa(\mu,\mu)+\kappa(\nu,\nu)-2\kappa(\mu,\nu)
=\displaystyle= (σ212μ¯μ¯2Var(μ)+Δμ,μ)+(σ212ν¯ν¯2Var(ν)+Δν,ν)\displaystyle\left(\sigma^{2}-\frac{1}{2}\|\bar{\mu}-\bar{\mu}\|^{2}-\textsf{Var}(\mu)+\Delta_{\mu,\mu}\right)+\left(\sigma^{2}-\frac{1}{2}\|\bar{\nu}-\bar{\nu}\|^{2}-\textsf{Var}(\nu)+\Delta_{\nu,\nu}\right)
2(σ212μ¯ν¯212Var(μ)12Var(ν)+Δμ,ν)\displaystyle-2\left(\sigma^{2}-\frac{1}{2}\|\bar{\mu}-\bar{\nu}\|^{2}-\frac{1}{2}\textsf{Var}(\mu)-\frac{1}{2}\textsf{Var}(\nu)+\Delta_{\mu,\nu}\right)
=\displaystyle= μ¯ν¯2+Δμ,μ+Δν,ν2Δμ,ν.\displaystyle\;\|\bar{\mu}-\bar{\nu}\|^{2}+\Delta_{\mu,\mu}+\Delta_{\nu,\nu}-2\Delta_{\mu,\nu}.

Finally we observe that since all terms of Δμ,ν\Delta_{\mu,\nu} are divided by σ2\sigma^{2} or larger powers of σ\sigma. Thus as σ\sigma increases Δμ,ν\Delta_{\mu,\nu} approaches 0 and (DK(μ,ν))2(D_{K}(\mu,\nu))^{2} approaches μ¯ν¯2\|\bar{\mu}-\bar{\nu}\|^{2}, completing the proof.

Now we can relate DK(μ,ν)D_{K}(\mu,\nu) to W2(μ,ν)W_{2}(\mu,\nu) through μ¯ν¯\|\bar{\mu}-\bar{\nu}\|. The next result is a known lower bounds for the Earth movers distance [23][Theorem 7]. We reprove it in Appendix E for completeness.

Lemma 5.17.

For any probability measures μ\mu and ν\nu defined on d{{\mathbb{R}}}^{d} we have μ¯ν¯W2(μ,ν).\|\bar{\mu}-\bar{\nu}\|\leq W_{2}(\mu,\nu).

We can now combine these results to achieve the following theorem.

Theorem 5.18.

For any two probability measures μ\mu and ν\nu defined on d{{\mathbb{R}}}^{d} limσDK(μ,ν)=μ¯ν¯\displaystyle\lim_{\sigma\to\infty}D_{K}(\mu,\nu)=\|\bar{\mu}-\bar{\nu}\| and μ¯ν¯W2(μ,ν).\|\bar{\mu}-\bar{\nu}\|\leq W_{2}(\mu,\nu). Thus limσDK(μ,ν)W2(μ,ν)\lim_{\sigma\to\infty}D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu).

5.2 Kernel Distance Stability with Respect to σ\sigma

We now explore the Lipschitz properties of dμKd^{K}_{\mu} with respect to the noise parameter σ\sigma. We argue any distance function that is robust to noise needs some parameter to address how many outliers to ignore or how far away a point is that is an outlier. For instance, this parameter in dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}} is m0m_{0} which controls the amount of measure μ\mu to be used in the distance.

Here we show that dμKd^{K}_{\mu} has a particularly nice property, that it is Lipschitz with respect to the choice of σ\sigma for any fixed xx. The larger σ\sigma the more effect outliers have, and the smaller σ\sigma the less the data is smoothed and thus the closer the noise needs to be to the underlying object to effect the inference.

Lemma 5.19.

Let h(σ,z)=exp(z2/2σ2)h(\sigma,z)=\exp(-z^{2}/2\sigma^{2}). We can bound h(σ,z)1h(\sigma,z)\leq 1, ddσh(σ,z)(2/e)/σ\frac{{\rm d}{}}{{\rm d}{\sigma}}h(\sigma,z)\leq(2/e)/\sigma and d2dσ2h(σ,z)(18/e3)/σ2\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}h(\sigma,z)\leq(18/e^{3})/\sigma^{2} over any choice of z>0z>0.

Proof 5.20.

The first bound follows from y=z2/2σ20y=-z^{2}/2\sigma^{2}\leq 0 and exp(y)1\exp(y)\leq 1 for y0y\leq 0.

Next we define

w1(σ,z)\displaystyle w_{1}(\sigma,z) =ddσh(σ,z)=z2σ3exp(z22σ2), and\displaystyle=\frac{{\rm d}{}}{{\rm d}{\sigma}}h(\sigma,z)=\frac{z^{2}}{\sigma^{3}}\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right),\text{ and}
w2(σ,z)\displaystyle w_{2}(\sigma,z) =d2dσ2h(σ,z)=(z4σ63z2σ4)exp(z22σ2).\displaystyle=\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}h(\sigma,z)=\left(\frac{z^{4}}{\sigma^{6}}-\frac{3z^{2}}{\sigma^{4}}\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right).

Now to solve the first part, we differentiate w1w_{1} with respect to zz to find its maximum over all choices of zz.

ddzw1(σ,z)=(2zσ3z3σ5)exp(z22σ2)\frac{{\rm d}{}}{{\rm d}{z}}w_{1}(\sigma,z)=\left(\frac{2z}{\sigma^{3}}-\frac{z^{3}}{\sigma^{5}}\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right)

Where (d/dz)w1(σ,z)=0({\rm d}{}/{\rm d}{z})w_{1}(\sigma,z)=0 at z=0z=0, z=2σz=\sqrt{2}\sigma and as zz approaches \infty. Thus the maximum must occur at one of these values. Both w1(σ,0)=0w_{1}(\sigma,0)=0 and limzw1(σ,z)=0\lim_{z\to\infty}w_{1}(\sigma,z)=0, while w1(σ,2σ)=(2/e)/σw_{1}(\sigma,\sqrt{2}\sigma)=(2/e)/\sigma, proving the first part.

To solve the second part, we perform the same approach on w2w_{2}.

ddzw2(σ,z)\displaystyle\frac{{\rm d}{}}{{\rm d}{z}}w_{2}(\sigma,z) =(z5σ8+3z3σ6+4z3σ66zσ4)exp(z22σ2)\displaystyle=\left(\frac{-z^{5}}{\sigma^{8}}+\frac{3z^{3}}{\sigma^{6}}+\frac{4z^{3}}{\sigma^{6}}-\frac{6z}{\sigma^{4}}\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right)
=zσ4(z4σ4+7z2σ26)exp(z22σ2)\displaystyle=\frac{z}{\sigma^{4}}\left(\frac{-z^{4}}{\sigma^{4}}+\frac{7z^{2}}{\sigma^{2}}-6\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right)

Thus (d/dz)w2(σ,z)=0({\rm d}{}/{\rm d}{z})w_{2}(\sigma,z)=0 at z={0,σ,6σ}z=\{0,\sigma,\sqrt{6}\sigma\} and as zz goes to \infty for z[0,)z\in[0,\infty). Both w2(σ,0)=0w_{2}(\sigma,0)=0 and limzw2(σ,z)=0\lim_{z\to\infty}w_{2}(\sigma,z)=0. The minimum occurs at w2(σ,z=σ)=(2/e)/σ2w_{2}(\sigma,z=\sigma)=(-2/\sqrt{e})/\sigma^{2}. The maximum occurs at w2(σ,z=6σ)=(18/e3)/σ2w_{2}(\sigma,z=\sqrt{6}\sigma)=(18/e^{3})/\sigma^{2}.

Theorem 5.21.

For any measure μ\mu defined on d\mathbb{R}^{d} and xdx\in\mathbb{R}^{d}, dμK(x)d^{K}_{\mu}(x) is \ell-Lipschitz with respect to σ\sigma, for =18/e3+8/e+2<6\ell=18/e^{3}+8/e+2<6.

Proof 5.22.

Recall that mμ,ν\textsf{m}_{\mu,\nu} is the product measure of any μ\mu and ν\nu, and define Mμ,ν\textsf{M}_{\mu,\nu} as Mμ,ν(p,q)=mμ,μ(p,q)+mν,ν(p,q)2mμ,ν(p,q)\textsf{M}_{\mu,\nu}(p,q)=\textsf{m}_{\mu,\mu}(p,q)+\textsf{m}_{\nu,\nu}(p,q)-2\textsf{m}_{\mu,\nu}(p,q). It is now useful to define a function fx(σ)f_{x}(\sigma) as

fx(σ)=(p,q)exp(pq22σ2)dMμ,δx(p,q).f_{x}(\sigma)=\int_{(p,q)}\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right){\rm d}{\textsf{M}_{\mu,\delta_{x}}(p,q)}.

So dμK(x)=σfx(σ)d^{K}_{\mu}(x)=\sigma\sqrt{f_{x}(\sigma)} and we can write another function as

F(σ)=(dμK(x))2σ2=σ2fx(σ)σ2.F(\sigma)=(d^{K}_{\mu}(x))^{2}-\ell\|\sigma\|^{2}=\sigma^{2}f_{x}(\sigma)-\ell\sigma^{2}.

Now to prove dμK(x)d^{K}_{\mu}(x) is \ell-Lipschitz, we can show that (dμK)2(d^{K}_{\mu})^{2} is \ell-semiconcave with respect to σ\sigma, and apply Lemma 2.3. This boils down to showing the second derivative of F(σ)F(\sigma) is always non-positive.

ddσF(σ)=2σfx(σ)+σ2ddσfx(σ)2σ.\frac{{\rm d}{}}{{\rm d}{\sigma}}F(\sigma)=2\sigma f_{x}(\sigma)+\sigma^{2}\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)-2\sigma\ell.
d2dσ2F(σ)=σ2d2dσ2fx(σ)+4σddσfx(σ)+2fx(σ)2.\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma)=\sigma^{2}\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}f_{x}(\sigma)+4\sigma\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)+2f_{x}(\sigma)-2\ell.

First we note that since (p,q)cdmμ,ν(p,q)=c\int_{(p,q)}c\cdot{\rm d}{}\textsf{m}_{\mu,\nu}(p,q)=c for any product distribution mμ,ν\textsf{m}_{\mu,\nu} of two distributions μ\mu and ν\nu, including when μ\mu or ν\nu is a Dirac mass, then

(p,q)cdMμ,δx(p,q)=(p,q)cd[mμ,μ+mδx,δx2mμ,δx](p,q)2c.\int_{(p,q)}c\cdot{\rm d}{}\textsf{M}_{\mu,\delta_{x}}(p,q)=\int_{(p,q)}c\cdot{\rm d}{}\Big{[}\textsf{m}_{\mu,\mu}+\textsf{m}_{\delta_{x},\delta_{x}}-2\textsf{m}_{\mu,\delta_{x}}\Big{]}(p,q)\leq 2c.

Thus since exp(pq22σ2)\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right) is in [0,1][0,1] for all choices of pp, qq, and σ>0\sigma>0, then 0fx(σ)20\leq f_{x}(\sigma)\leq 2 and 2fx(σ)42f_{x}(\sigma)\leq 4. This bounds the third term in d2dσ2F(σ)\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma), we now need to use a similar approach to bound the first and second terms.

Let h(σ,z)=exp(z22σ2)h(\sigma,z)=\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right), so we can apply Lemma 5.19.

4σddσfx(σ)=4σ(p,q)(ddσh(σ,pq))dMμ,δx(p,q)4σ((2/e)/σ)2=16/e4\sigma\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)=4\sigma\int_{(p,q)}\left(\frac{{\rm d}{}}{{\rm d}{\sigma}}h(\sigma,\|p-q\|)\right){\rm d}{}\textsf{M}_{\mu,\delta_{x}}(p,q)\leq 4\sigma((2/e)/\sigma)2=16/e
σ2d2dσ2fx(σ)=σ2(p,q)(d2dσ2h(σ,pq))dMμ,δx(p,q)σ2((18/e3)/σ2)2=36/e3\sigma^{2}\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}f_{x}(\sigma)=\sigma^{2}\int_{(p,q)}\left(\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}h(\sigma,\|p-q\|)\right){\rm d}{}\textsf{M}_{\mu,\delta_{x}}(p,q)\leq\sigma^{2}\left((18/e^{3})/\sigma^{2}\right)2=36/e^{3}

Then we complete the proof using the upper bound of each item of d2dσ2F(σ)\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma)

d2dσ2F(σ)\displaystyle\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma) =σ2d2dσ2fx(σ)+4σddσfx(σ)+2fx(σ)2\displaystyle=\sigma^{2}\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}f_{x}(\sigma)+4\sigma\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)+2f_{x}(\sigma)-2\ell
36/e3+16/e+42(18/e3+8/e+2)=0.\displaystyle\leq 36/e^{3}+16/e+4-2(18/e^{3}+8/e+2)=0.

Lipschitz in m0m_{0} for dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}}.

We show that there is no Lipschitz property for dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}}, with respect to m0m_{0} that is independent of the measure μ\mu. Consider a measure μP\mu_{P} for point set PP\subset\mathbb{R} consisting of two points at a=0a=0 and at b=Δb=\Delta. Now consider dμP,m0ccm(a)d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a). When m01/2m_{0}\leq 1/2 then dμP,m0ccm(a)=0d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)=0 is constant. But for m0=1/2+αm_{0}=1/2+\alpha for α>0\alpha>0, then dμP,m0ccm(a)=αΔ/(1/2+α)d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)=\alpha\Delta/(1/2+\alpha) and ddm0dμP,m0ccm(a)=ddαdμP,12+αccm(a)=(1/2+2α)Δ(1/2+α)2\frac{{\rm d}{}}{{\rm d}{m_{0}}}d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)=\frac{{\rm d}{}}{{\rm d}{\alpha}}d^{\textsf{{ccm}}}_{{\mu_{P}},\frac{1}{2}+\alpha}(a)=\frac{(1/2+2\alpha)\Delta}{(1/2+\alpha)^{2}}, which is maximized as α\alpha approaches 0 with an infimum of 2Δ2\Delta. If n1n-1 points are at bb and 11 point at aa, then the infimum of the first derivative of m0m_{0} is nΔn\Delta. Hence for a measure μP\mu_{P} defined by a point set, the infimum of ddm0dμP,m0ccm(a)\frac{{\rm d}{}}{{\rm d}{m_{0}}}d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a) and, hence a lower bound on the Lipschitz constant is nΔPn\Delta_{P} where ΔP=maxp,pPpp\Delta_{P}=\max_{p,p^{\prime}\in P}\|p-p^{\prime}\|.

6 Algorithmic and Approximation Observations

Kernel coresets.

The kernel distance is robust under random samples [48]. Specifically, if QQ is a point set randomly chosen from μ\mu of size O((1/ε2)(d+log(1/δ))O((1/\varepsilon^{2})(d+\log(1/\delta)) then kdeμkdeQε\|\textsc{kde}_{\mu}-\textsc{kde}_{Q}\|_{\infty}\leq\varepsilon with probability at least 1δ1-\delta. We call such a subset QQ and ε\varepsilon-kernel sample of (μ,K)(\mu,K). Furthermore, it is also possible to construct ε\varepsilon-kernel samples QQ with even smaller size of |Q|=O(((1/ε)log(1/εδ))2d/(d+2))|Q|=O(((1/\varepsilon)\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)}) [57]; in particular in 2{{\mathbb{R}}}^{2} the required size is |Q|=O((1/ε)log(1/εδ))|Q|=O((1/\varepsilon)\sqrt{\log(1/\varepsilon\delta)}). Exploiting the above constructions, recent work [71] builds a data structure to allow for efficient approximate evaluations of kdeP\textsc{kde}_{P} where |P|=100,000,000|P|=100{,}000{,}000.

These constructions of QQ also immediately imply that (dμK)2(dQK)24ε\|(d^{K}_{\mu})^{2}-(d^{K}_{Q})^{2}\|_{\infty}\leq 4\varepsilon since (dμK(x))2=κ(μ,μ)+κ(x,x)2kdeμ(x)(d^{K}_{\mu}(x))^{2}=\kappa(\mu,\mu)+\kappa(x,x)-2\textsc{kde}_{\mu}(x), and both the first and third term incur at most 2ε2\varepsilon error in converting to κ(Q,Q)\kappa(Q,Q) and 2kdeQ(x)2\textsc{kde}_{Q}(x), respectively (see Lemma B.1). Thus, an (ε2/4)(\varepsilon^{2}/4)-kernel sample QQ of (μ,K)(\mu,K) implies that dμKdQKε\|d^{K}_{\mu}-d^{K}_{Q}\|_{\infty}\leq\varepsilon (see Lemma B.3).

This implies algorithms for geometric inference on enormous noisy data sets. Moreover, if we assume an input point set QQ is drawn iid from some underlying, but unknown distribution μ\mu, we can bound approximations with respect to μ\mu.

Corollary 6.1.

Consider a measure μ\mu defined on d\mathbb{R}^{d}, a kernel KK, and a parameter εR(5+4/α2)\varepsilon\leq R(5+4/\alpha^{2}). We can create a coreset QQ of size |Q|=O(((1/ε2)log(1/εδ))2d/(d+2))|Q|=O(((1/\varepsilon^{2})\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)}) or randomly sample |Q|=O((1/ε4)(d+log(1/δ)))|Q|=O((1/\varepsilon^{4})(d+\log(1/\delta))) points so, with probability at least 1δ1-\delta, any sublevel set (dμK)η(d^{K}_{\mu})^{\eta} is homotopy equivalent to (dQK)r(d^{K}_{Q})^{r} for r[4ε/α2,R3ε]r\in[4\varepsilon/\alpha^{2},R-3\varepsilon] and η(0,R)\eta\in(0,R).

Proof 6.2.

Those bounds are obtained by constructing an (ε2/4)(\varepsilon^{2}/4)-kernel sample [48, 57], which guarantees dμKdQKε\|d^{K}_{\mu}-d^{K}_{Q}\|_{\infty}\leq\varepsilon via Lemma B.3. Then since εR/(5+4/α2)\varepsilon\leq R/(5+4/\alpha^{2}), with Theorem 4.2 any sublevel set (dμK)η(d^{K}_{\mu})^{\eta} is homotopy equivalent to (dQK)r(d^{K}_{Q})^{r}.

Stability of persistence diagrams.

Furthermore, the stability results on persistence diagrams [24] hold for kernel density estimates and kernel distance of μ\mu and QQ (where QQ is a coreset of μ\mu with the same size bounds as above). If fgε\|f-g\|_{\infty}\leq\varepsilon, then dB(Dgm(f),Dgm(g))εd_{B}(\textsf{Dgm}(f),\textsf{Dgm}(g))\leq\varepsilon, where dBd_{B} is the bottleneck distance between persistence diagrams. Combined with the coreset results above, this immediately implies the following corollaries.

Corollary 6.3.

Consider a measure μ\mu defined on d\mathbb{R}^{d} and a kernel KK. We can create a core set QQ of size |Q|=O(((1/ε)log(1/εδ))2d/(d+2))|Q|=O(((1/\varepsilon)\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)}) or randomly sample |Q|=O((1/ε2)(d+log(1/δ)))|Q|=O((1/\varepsilon^{2})(d+\log(1/\delta))) points which will have the following properties with probability at least 1δ1-\delta.

  • dB(Dgm(kdeμ),Dgm(kdeQ))εd_{B}(\textsf{Dgm}(\textsc{kde}_{\mu}),\textsf{Dgm}(\textsc{kde}_{Q}))\leq\varepsilon.

  • dB(Dgm((dμK)2),Dgm((dQK)2))εd_{B}(\textsf{Dgm}((d^{K}_{\mu})^{2}),\textsf{Dgm}((d^{K}_{Q})^{2}))\leq\varepsilon.

Corollary 6.4.

Consider a measure μ\mu defined on d\mathbb{R}^{d} and a kernel KK. We can create a coreset QQ of size |Q|=O(((1/ε2)log(1/εδ))2d/(d+2))|Q|=O(((1/\varepsilon^{2})\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)}) or randomly sample |Q|=O((1/ε4)(d+log(1/δ)))|Q|=O((1/\varepsilon^{4})(d+\log(1/\delta))) points which will have the following property with probability at least 1δ1-\delta.

  • dB(Dgm(dμK),Dgm(dQK))εd_{B}(\textsf{Dgm}(d^{K}_{\mu}),\textsf{Dgm}(d^{K}_{Q}))\leq\varepsilon.

Another bound was independently derived to show an upper bound on the size of a random sample QQ such that dB(Dgm(kdeμP),Dgm(kdeQ))εd_{B}(\textsf{Dgm}(\textsc{kde}_{\mu_{P}}),\textsf{Dgm}(\textsc{kde}_{Q}))\leq\varepsilon in [3]; this can, as above, also be translated into bounds for Dgm((dQK)2)\textsf{Dgm}((d^{K}_{Q})^{2}) and Dgm(dQK)\textsf{Dgm}(d^{K}_{Q}). This result assumes P[C,C]dP\subset[-C,C]^{d} and is parametrized by a bandwidth parameter hh that retains that xdKh(x,p)dx=1\int_{x\in\mathbb{R}^{d}}K_{h}(x,p){\rm d}{x}=1 for all pp using that K1(xp)=K(x,p)K_{1}(\|x-p\|)=K(x,p) and Kh(xp)=1hdK1(xp2/h)K_{h}(\|x-p\|)=\frac{1}{h^{d}}K_{1}(\|x-p\|^{2}/h). This ensures that K(,p)K(\cdot,p) is (1/hd)(1/h^{d})-Lipschitz and that K(x,x)=Θ(1/hd)K(x,x)=\Theta(1/h^{d}) for any xx. Then their bound requires |Q|=O(dε2hdlog(Cdεδh))|Q|=O(\frac{d}{\varepsilon^{2}h^{d}}\log(\frac{Cd}{\varepsilon\delta h})) random samples.

To compare directly against the random sampling result we derive from Joshi et al. [48], for kernel Kh(x,p)K_{h}(x,p) then kdeμPkdeQεKh(x,x)=ε/hd\|\textsc{kde}_{\mu_{P}}-\textsc{kde}_{Q}\|_{\infty}\leq\varepsilon K_{h}(x,x)=\varepsilon/h^{d}. Hence, our analysis requires |Q|=O((1/ε2h2d)(d+log(1/δ)))|Q|=O((1/\varepsilon^{2}h^{2d})(d+\log(1/\delta))), and is an improvement when h=Ω(1)h=\Omega(1) or CC is not known or bounded, as well as in some other cases as a function of ε\varepsilon, hh, δ\delta, and dd.

7 Discussion

We mention here a few other interesting aspects of our results and observations about topological inference using the kernel distance. They are related to how the noise parameter σ\sigma affects the idea of scale, and a few more experiments, including with alternate kernels.

7.1 Noise and Scale

Much of geometric and topological reconstruction grew out of the desire to understand shapes at various scales. A common mechanism is offset based; e.g., α\alpha-shapes [31] represent the scale of a shape with the α\alpha parameter controlling the offsets of a point cloud. There are two parameters with the kernel distance: rr controls the offset through the sublevel set of the function, and σ\sigma controls the noise. We argue that any function which is robust to noise must have a parameter that controls the noise (e.g. σ\sigma for dμKd^{K}_{\mu} and m0m_{0} for dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}}). Here σ\sigma clearly defines some sense of scale in the setting of density estimation [65] and has a geometrical interpretation, while m0m_{0} represents a fraction of the measure and is hard to interpret geometrically, as illustrated by the lack of a Lipschitz property for dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}} with respect to m0m_{0}.

There are several experiments below, in Section 7.2, from which several insights can be drawn. One observation is that even though there are two parameters rr and σ\sigma that control the scale, the interesting values typically have rr very close to σ\sigma. Thus, we recommend to first set σ\sigma to control the scale at which the data is studied, and then explore the effect of varying rr for values near σ\sigma. Moreover, not much structure seems to be missed by not exploring the space of both parameters; Figure 3 shows that fixing one (of rr and σ\sigma) and varying the other can provide very similar superlevel sets. However, it is possible instead to fix rr and explore the persistent topological features in the data [36, 34] (those less affected by smoothing) by varying σ\sigma. On the other hand, it remains a challenging problem to study two parameter persistent homology [10, 9] under the setting of kernel distance (or kernel density estimate).

7.2 Experiments

We consider measures μP\mu_{P} defined by a point set P2P\subset{{\mathbb{R}}}^{2}. To experimentally visualize the structures of the superlevel sets of kernel density estimates, or equivalently sublevel sets of the kernel distance, we do the simplest thing and just evaluate dμPKd^{K}_{\mu_{P}} at every grid point on a sufficiently dense grid.

Grid approximation.

Due to the 11-Lipschitz property of the kernel distance, well chosen grid points have several nice properties. We consider the functions up to some resolution parameter ε>0\varepsilon>0, consistent with the parameter used to create a coreset approximation QQ. Now specifically, consider an axis-aligned grid Gε,dG_{\varepsilon,d} with edge length ε/2d\varepsilon/2\sqrt{d} so no point xdx\in{{\mathbb{R}}}^{d} is further than ε\varepsilon from some grid point gGε,dg\in G_{\varepsilon,d}. Since K(x,y)εK(x,y)\leq\varepsilon when xy2σ2ln(σ2/ε)=δε,σ\|x-y\|\geq 2\sigma^{2}\ln(\sigma^{2}/\varepsilon)=\delta_{\varepsilon,\sigma}, we only need to consider grid points gGε,dg\in G_{\varepsilon,d} which are within δε,σ\delta_{\varepsilon,\sigma} of some point pPp\in P (or qQq\in Q, of coreset QQ of PP[48, 71]. This is at most (2d/ε)d(2δε,σ)d=O((σ2log(ε/d)/ε)d)(2\sqrt{d}/\varepsilon)^{d}(2\delta_{\varepsilon,\sigma})^{d}=O((\sigma^{2}\log(\varepsilon/d)/\varepsilon)^{d}) grid points total for dd a fixed constant. Furthermore, due to the 11-Lipschitz property of dPKd^{K}_{P}, when considering a specific level set at rr

  • a point xx such that dPK(x)rεd^{K}_{P}(x)\leq r-\varepsilon is no further than ε\varepsilon from some gGg\in G such that dPK(g)rd^{K}_{P}(g)\leq r, and

  • every ball Bε(x)B_{\varepsilon}(x) centered at some point xdx\in{{\mathbb{R}}}^{d} of radius ε\varepsilon so that all yBε(x)y\in B_{\varepsilon}(x) has dPK(y)rd^{K}_{P}(y)\leq r has some representative point gGε,dg\in G_{\varepsilon,d} such that gBε(x)g\in B_{\varepsilon}(x), and hence dPK(g)rd^{K}_{P}(g)\leq r.

Thus “deep” regions and spatially thick features are preserved, however thin passageways or layers that are near the threshold rr, even if they do not correspond to a critical point, may erroneously become disconnected, causing phantom components or other topological features. However, due to the Lipschitz property, these can be different from rr by at most ε\varepsilon, so the errors will have small deviation in persistence.

Varying parameter rr or σ\sigma.

Refer to caption
Refer to caption
Figure 3: Sublevel sets for the kernel distance while varying the isolevel γ\gamma, for fixed σ\sigma (left) and for fixed isolevel γ\gamma but variable σ\sigma (right), with Gaussian kernel. The variable values of σ\sigma and γ\gamma are chosen to make the plots similar.

We demonstrate the geometric inference on a synthetic dataset in [0,1]2[0,1]^{2} where 900900 points are chosen near a circle centered at (0.5,0.5)(0.5,0.5) with radius 0.250.25 or along a line segment from (0,0)(0,0) to (1,1)(1,1). Each point has Gaussian noise added with standard deviation 0.010.01. The remaining 11001100 points are chosen uniformly from [0,1]2[0,1]^{2}. We use a Gaussian kernel with σ=0.05\sigma=0.05. Figure 3 shows (left) various sublevel sets γΓ\gamma\in\Gamma for the kernel distance at a fixed σ=0.05\sigma=0.05 and (right) various superlevel sets for a fixed γ=0.04853\gamma=0.04853, but various values of σΣ\sigma\in\Sigma, where

Γ\displaystyle\Gamma =[0.05005,0.04979,0.04954,0.04904,0.04853] and\displaystyle=[0.05005,0.04979,0.04954,0.04904,0.04853]\text{ and }
Σ\displaystyle\Sigma =[0.0485,0.0489,0.0492,0.0495,0.05].\displaystyle=[0.0485,0.0489,0.0492,0.0495,0.05].

This choice of Γ\Gamma and Σ\Sigma were made to highlight how similar the isolevels can be.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Alternate kernel density estimates for the same dataset as Figure 3. From left to right, they use the Laplace, triangle, Epanechnikov, and the ball kernel.

Alternative kernels.

We can choose kernels other than the Gaussian kernel in the kernel density estimate, for instance

  • the Laplace kernel K(p,x)=exp(2xy/σ)K(p,x)=\exp(-2\|x-y\|/\sigma),

  • the triangle kernel K(p,x)=max{0,1xy/σ}K(p,x)=\max\{0,1-\|x-y\|/\sigma\},

  • the Epanechnikov kernel K(p,x)=max{0,1xy2/σ2}K(p,x)=\max\{0,1-\|x-y\|^{2}/\sigma^{2}\}, or

  • the ball kernel (K(p,x)={1 if pxσ; o.w. 0}K(p,x)=\{1\text{ if }\|p-x\|\leq\sigma\text{; o.w. }0\}.

Figure 4 chooses parameters to make them comparable to the Figure 3(left). Of these only the Laplace kernel is characteristic [66] making the corresponding version of the kernel distance a metric. Investigating which of the above reconstruction theorems hold when using the Laplace or other kernels is an interesting question for future work.

Additionally, normal vector information and even kk-forms can be used in the definition of a kernel [42, 68, 30, 29, 43, 48]; this variant is known as the current distance. In some cases it retains its metric properties and has been shown to be very useful for shape alignment in conjunction with medical imaging.

7.3 Open Questions

This work shows it is possible to prove formal reconstruction results using kernel density estimates and the kernel distance. But it also opens many interesting questions.

  • For what other types of kernels can we show reconstruction bounds? The Laplace and triangle kernels are natural choices. For both the coresets results match those of the Gaussian kernel. The kernel distance under the Laplace kernel is also a metric, but is not known to be for the triangle kernel. Yet, the triangle kernel would be interesting since it has bounded support, and may lend itself to easier computation.

  • The power distance construction in Section 3 requires a point p^+\hat{p}_{+}, which approximates the point with minimum kernel distance. This is intuitively because it is possible to construct a point set PP (say points lying on a circle with no points inside) such that the point p+dp_{+}\in\mathbb{R}^{d} which minimizes the kernel distance and maximizes the kernel density estimate is far from any point in the point set. For one, can p^+\hat{p}_{+} be constructed efficiently without dependence on βP\beta_{P} or ΛP/σ\Lambda_{P}/\sigma?

    But more interestingly, can we generally approximate the persistence diagram without creating a simplicial complex on a subset of the input points? We do describe some bounds on using a grid-based technique in Section 7.2, but this is also unsatisfying since it essentially requires a low-dimensional Euclidean space.

  • Since dμKd^{K}_{\mu} is Lipschitz in xx and σ\sigma, it may make sense to understand the simultaneous stability of both variables. What is the best way to understand persistence over both parameters?

  • We provided some initial bound comparing the kernel distance under the Gaussian kernel and the Wasserstein 22 distance. Can we show that under our choice of normalization that DK(μ,ν)W2(μ,ν)D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu), unconstrained? More generally, how does the kernel distance under other kernels compare with other forms of Wasserstein and other distances on measures?

Acknowledgements

The authors thank Don Sheehy, Frédéric Chazal and the rest of the Geometrica group at INRIA-Saclay for enlightening discussions on geometric and topological reconstruction. We also thank Don Sheehy for personal communications regarding the power distance constructions, and Yusu Wang for ideas towards Lemma 4.5. Finally, we are also indebted to the anonymous reviewers for many detailed suggestions leading to improvements in results and presentation.

References

  • [1] Pankaj K. Agarwal, Sariel Har-Peled, Hiam Kaplan, and Micha Sharir. Union of random minkowski sums and network vulnerability analysis. In Proceedings of 29th Symposium on Computational Geometry, 2013.
  • [2] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
  • [3] Sivaraman Balakrishnan, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman. Statistical inference for persistent homology. Technical report, ArXiv:1303.7117, March 2013.
  • [4] James Biagioni and Jakob Eriksson. Map inference in the face of noise and disparity. In ACM SIGSPATIAL GIS, 2012.
  • [5] Gérard Biau, Frédéric Chazal, David Cohen-Steiner, Luc Devroye, and Carlos Rodriguez. A weighted kk-nearest neighbor density estimate for geometric inference. Electronic Journal of Statistics, 5:204–237, 2011.
  • [6] Omer Bobrowski, Sayan Mukherjee, and Jonathan E. Taylor. Topological consistency via kernel estimation. Technical report, arXiv:1407.5272, 2014.
  • [7] Peter Bubenik. Statistical topological data analysis using persistence landscapes. Jounral of Machine Learning Research, 2014.
  • [8] Mickael Buchet, Frederic Chazal, Steve Y. Oudot, and Donald R. Sheehy. Efficient and robust topological data analysis on metric spaces. arXiv:1306.0039, 2013.
  • [9] Gunnar Carlsson, Gurjeet Singh, and Afra Zomorodian. Computing multidimensional persistence. Algorithms and Computation: Lecture Notes in Computer Science, 5878:730–739, 2009.
  • [10] Gunnar Carlsson and Afra Zomorodian. The theory of multidimensional persistence. Proc. 23rd Ann. Symp. Computational Geometry, pages 184–193, 2007.
  • [11] Frédéric Chazal and David Cohen-Steiner. Geometric inference. Tessellations in the Sciences, 2012.
  • [12] Frédéric Chazal, David Cohen-Steiner, Marc Glisse, Leonidas J. Guibas, and Steve Y. Oudot. Proximity of persistence modules and their diagrams. In Proceedings 25th Annual Symposium on Computational Geometry, pages 237–246, 2009.
  • [13] Frédéric Chazal, David Cohen-Steiner, and André Lieutier. Normal cone approximation and offset shape isotopy. Computational Geometry: Theory and Applications, 42:566–581, 2009.
  • [14] Frédéric Chazal, David Cohen-Steiner, and André Lieutier. A sampling theory for compact sets in Euclidean space. Discrete Computational Geometry, 41(3):461–479, 2009.
  • [15] Frédéric Chazal, David Cohen-Steiner, and Quentin Mérigot. Geometric inference for probability measures. Foundations of Computational Mathematics, 11(6):733–751, 2011.
  • [16] Frederic Chazal, Vin de Silva, Marc Glisse, and Steve Oudot. The structure and stability of persistence modules. arXiv:1207.3674, 2013.
  • [17] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro Rinaldo, and Larry Wasserman. Robust topolical inference: Distance-to-a-measure and kernel distance. Technical report, arXiv:1412.7197, 2014.
  • [18] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman. On the bootstrap for persistence diagrams and landscapes. Modeling and Analysis of Information Systems, 20:96–105, 2013.
  • [19] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry Wasserman. Stochastic convergence of persistence landscapes. In Proceedings Symposium on Computational Geometry, 2014.
  • [20] Frédéric Chazal and André Lieutier. Weak feature size and persistent homology: computing homology of solids in rn from noisy data samples. Proceedings 21st Annual Symposium on Computational Geometry, pages 255–262, 2005.
  • [21] Frédéric Chazal and André Lieutier. Topology guaranteeing manifold reconstruction using distance function to noisy data. Proceedings 22nd Annual Symposium on Computational Geometry, pages 112–118, 2006.
  • [22] Frédéric Chazal and Steve Oudot. Towards persistence-based reconstruction in euclidean spaces. Proceedings 24th Annual Symposium on Computational Geometry, pages 232–241, 2008.
  • [23] Scott Cohen. Finding color and shape patterns in images. Technical report, Stanford University: CS-TR-99-1620, 1999.
  • [24] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete and Computational Geometry, 37:103–120, 2007.
  • [25] Hal Daumé. From zero to reproducing kernel hilbert spaces in twelve pages or less. http://pub.hal3.name/daume04rkhs.ps, 2004.
  • [26] Luc Devroye and László Györfi. Nonparametric Density Estimation: The L1L_{1} View. Wiley, 1984.
  • [27] Luc Devroye and Gábor Lugosi. Combinatorial Methods in Density Estimation. Springer-Verlag, 2001.
  • [28] Tamal K. Dey. Curve and Surface Reconstruction: Algorithms with Mathematical Analysis. Cambridge University Press, 2007.
  • [29] Stanley Durrleman, Xavier Pennec, Alain Trouvé, and Nicholas Ayache. Measuring brain variability via sulcal lines registration: A diffeomorphic approach. In 10th International Conference on Medical Image Computing and Computer Assisted Intervention, 2007.
  • [30] Stanley Durrleman, Xavier Pennec, Alain Trouvé, and Nicholas Ayache. Sparse approximation of currents for statistics on curves and surfaces. In 11th International Conference on Medical Image Computing and Computer Assisted Intervention, 2008.
  • [31] Herbert Edelsbrunner. The union of balls and its dual shape. Proceedings 9th Annual Symposium on Computational Geometry, pages 218–231, 1993.
  • [32] Herbert Edelsbrunner, Michael Facello, Ping Fu, and Jie Liang. Measuring proteins and voids in proteins. In Proceedings 28th Annual Hawaii International Conference on Systems Science, 1995.
  • [33] Herbert Edelsbrunner, Brittany Terese Fasy, and Günter Rote. Add isotropic Gaussian kernels at own risk: More and more resiliant modes in higher dimensions. Proceedings 28th Annual Symposium on Computational Geometry, pages 91–100, 2012.
  • [34] Herbert Edelsbrunner and John Harer. Persistent homology - a survey. Contemporary Mathematics, 453:257–282, 2008.
  • [35] Herbert Edelsbrunner and John Harer. Computational Topology: An Introduction. American Mathematical Society, Providence, RI, USA, 2010.
  • [36] Herbert Edelsbrunner, David Letscher, and Afra J. Zomorodian. Topological persistence and simplification. Discrete and Computational Geometry, 28:511–533, 2002.
  • [37] Ahmed Elgammal, Ramani Duraiswami, David Harwood, and Larry S. Davis. Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc. IEEE, 90:1151–1163, 2002.
  • [38] Brittany Terese Fasy, Jisu Kim, Fabrizio Lecci, and Clément Maria. Introduction to the R package TDA. Technical report, arXiV:1411.1830, 2014.
  • [39] Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, Larry Wasserman, Sivaraman Balakrishnan, and Aarti Singh. Statistical inference for persistent homology: Confidence sets for persistence diagrams. In The Annals of Statistics, volume 42, pages 2301–2339, 2014.
  • [40] H. Federer. Curvature measures. Transactions of the American Mathematical Society, 93:418–491, 1959.
  • [41] Mingchen Gao, Chao Chen, Shaoting Zhang, Zhen Qian, Dimitris Metaxas, and Leon Axel. Segmenting the papillary muscles and the trabeculae from high resolution cardiac CT through restoration of topological handles. In Proceedings International Conference on Information Processing in Medical Imaging, 2013.
  • [42] Joan Glaunès. Transport par difféomorphismes de points, de mesures et de courants pour la comparaison de formes et l’anatomie numérique. PhD thesis, Université Paris 13, 2005.
  • [43] Joan Glaunès and Sarang Joshi. Template estimation form unlabeled point set data and surfaces for computational anatomy. In International Workshop on Mathematical Foundations of Computational Anatomy, 2006.
  • [44] Karsten Grove. Critical point theory for distance functions. Proceedings of Symposia in Pure Mathematics, 54:357–385, 1993.
  • [45] Leonidas Guibas, Quentin Mérigot, and Dmitriy Morozov. Witnessed kk-distance. Proceedings 27th Annual Symposium on Computational Geometry, pages 57–64, 2011.
  • [46] Allen Hatcher. Algebraic Topology. Cambridge University Press, 2002.
  • [47] Matrial Hein and Olivier Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In Proceedings 10th International Workshop on Artificial Intelligence and Statistics, 2005.
  • [48] Sarang Joshi, Raj Varma Kommaraju, Jeff M. Phillips, and Suresh Venkatasubramanian. Comparing distributions and shapes using the kernel distance. Proceedings 27th Annual Symposium on Computational Geometry, 2011.
  • [49] A. N. Kolmogorov, S. V. Fomin, and R. A. Silverman. Introductory Real Analysis. Dover Publications, 1975.
  • [50] John M. Lee. Introduction to topological manifolds. Springer-Verlag, 2000.
  • [51] John M. Lee. Introduction to smooth manifolds. Springer, 2003.
  • [52] Jie Liang, Herbert Edelsbrunner, Ping Fu, Pamidighantam V. Sudharkar, and Shankar Subramanian. Analytic shape computation of macromolecues: I. molecular area and volume through alpha shape. Proteins: Structure, Function, and Genetics, 33:1–17, 1998.
  • [53] André Lieutier. Any open bounded subset of n\mathbb{R}^{n} has the same homotopy type as its medial axis. Computer-Aided Design, 36:1029–1046, 2004.
  • [54] Quentin Mérigot. Geometric structure detection in point clouds. PhD thesis, Université de Nice Sophia-Antipolis, 2010.
  • [55] Yuriy Mileyko, Sayan Mukherjee, and John Harer. Probability measures on the space of persistence diagrams. Inverse Problems, 27(12), 2011.
  • [56] A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
  • [57] Jeff M. Phillips. eps-samples for kernels. Proceedings 24th Annual ACM-SIAM Symposium on Discrete Algorithms, 2013.
  • [58] Jeff M. Phillips and Suresh Venkatasubramanian. A gentle introduction to the kernel distance. arXiv:1103.1625, March 2011.
  • [59] Florian T. Pokorny, Carl Henrik, Hedvig Kjellström, and Danica Kragic. Persistent homology for learning densities with bounded support. In Neural Informations Processing Systems, 2012.
  • [60] Charles A. Price, Olga Symonova, Yuriy Mileyko, Troy Hilley, and Joshua W. Weitz. Leaf gui: Segmenting and analyzing the structure of leaf veins and areoles. Plant Physiology, 155:236–245, 2011.
  • [61] David W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, 1992.
  • [62] Donald R. Sheehy. A multicover nerve for geometric inference. Canadian Conference in Computational Geometry, 2012.
  • [63] Donald R. Sheehy. Linear-size approximations to the Vietoris-Rips filtration. Discrete & Computational Geometry, 49:778–796, 2013.
  • [64] Bernard W. Silverman. Using kernel density esimates to inversitigate multimodality. J. R. Sratistical Society B, 43:97–99, 1981.
  • [65] Bernard W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1986.
  • [66] Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517–1561, 2010.
  • [67] Kathryn Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 2014.
  • [68] Marc Vaillant and Joan Glaunès. Surface matching via currents. In Proceedings Information Processing in Medical Imaging, volume 19, pages 381–92, 2005.
  • [69] Cédric Villani. Topics in Optimal Transportation. American Mathematical Society, 2003.
  • [70] Grace Wahba. Support vector machines, reproducing kernel Hilbert spaces, and randomization GACV. In Advances in Kernel Methods – Support Vector Learning, pages 69–88. Bernhard Schölkopf and Alezander J. Smola and Christopher J. C. Burges and Rosanna Soentpiet, 1999.
  • [71] Yan Zheng, Jeffrey Jestes, Jeff M. Phillips, and Feifei Li. Quality and efficiency in kernel density estimates for large data. In Proceedings ACM Conference on the Management of Data (SIGMOD), 2012.

Appendix A Details on Distance-Like Properties of Kernel Distance

We provide further details on distance-like properties of the kernel distance.

A.1 Semiconcave Properties of Kernel Distance

We also note that semiconcavity follows quite naturally and simply in the RKHS K\mathcal{H}_{K} for dμKd^{K}_{\mu}.

Lemma A.1.

(dμK)2(d^{K}_{\mu})^{2} is 11-semiconcave in K\mathcal{H}_{K}: the map x(dμK(x))2ϕ(x)K2x\mapsto(d^{K}_{\mu}(x))^{2}-\|\phi(x)\|_{\mathcal{H}_{K}}^{2} is concave.

Proof A.2.

We can write

(dμK(x))2=(DK(μ,x))2=κ(μ,μ)+κ(x,x)2κ(μ,x)=Φ(μ)K2+ϕ(x)K22Φ(μ)ϕ(x)K2.(d^{K}_{\mu}(x))^{2}=(D_{K}(\mu,x))^{2}=\kappa(\mu,\mu)+\kappa(x,x)-2\kappa(\mu,x)=\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}}+\|\phi(x)\|^{2}_{\mathcal{H}_{K}}-2\|\Phi(\mu)-\phi(x)\|^{2}_{\mathcal{H}_{K}}.

Now

(dμK(x))2ϕ(x)K2=Φ(μ)K22Φ(μ)ϕ(x)K2.(d^{K}_{\mu}(x))^{2}-\|\phi(x)\|_{\mathcal{H}_{K}}^{2}=\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}}-2\|\Phi(\mu)-\phi(x)\|^{2}_{\mathcal{H}_{K}}.

Since the above is twice-differentiable, we only need to show that its twice-differential is non-positive. By definition, for a fixed μ\mu, Φ(μ)\Phi(\mu) and Φ(μ)K2\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}} are both constant. Suppose Φ(μ)=c1\Phi(\mu)=c_{1} and Φ(μ)K2=c2\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}}=c_{2}, we have (dμ(x))2ϕ(x)K2=c2c1ϕ(x)K2(d_{\mu}(x))^{2}-\|\phi(x)\|_{\mathcal{H}_{K}}^{2}=c_{2}-\|c_{1}-\phi(x)\|^{2}_{\mathcal{H}_{K}}. Since the RKHS K\mathcal{H}_{K} is a vector space with well-defined norm K\|\cdot\|_{\mathcal{H}_{K}}, the above is a concave parabolic function.

However, this semiconcavity in K\mathcal{H}_{K} is not that useful. For unit weight elements x,ydx,y\in{{\mathbb{R}}}^{d}, an element sαs_{\alpha} such that ϕ(sα)=αϕ(y)+(1α)ϕ(x)\phi(s_{\alpha})=\alpha\phi(y)+(1-\alpha)\phi(x) is a weighted point set with a point at xx with weight (1α)(1-\alpha) and another at yy with weight α\alpha. Lemma A.1 only implies that (dK(sα))2ϕ(sα)K2α((dK(x))2ϕ(x)K2)+(1α)((dK(y))2ϕ(y)K2)(d_{K}(s_{\alpha}))^{2}-\|\phi(s_{\alpha})\|^{2}_{\mathcal{H}_{K}}\leq\alpha((d_{K}(x))^{2}-\|\phi(x)\|^{2}_{\mathcal{H}_{K}})+(1-\alpha)((d_{K}(y))^{2}-\|\phi(y)\|^{2}_{\mathcal{H}_{K}}).

A.2 Kernel Distance is Proper

We use two more general, but equivalent definitions of a proper map. Definition (i): A continuous map f:𝕏𝕐f:{{\mathbb{X}}}\to{{\mathbb{Y}}} between two topological spaces is proper if and only if the inverse image of every compact subset in 𝕐{{\mathbb{Y}}} is compact in 𝕏{{\mathbb{X}}} ([50], page 84; [51], page 45). Definition (ii): a continuous map f:𝕏𝕐f:{{\mathbb{X}}}\to{{\mathbb{Y}}} between two topological manifolds is proper if and only if for every sequence {pi}\{p_{i}\} in 𝕏{{\mathbb{X}}} that escapes to infinity, {f(pi)}\{f(p_{i})\} escapes to infinity in 𝕐{{\mathbb{Y}}} ([51], Proposition 2.17). Here, for a topological space 𝕏{{\mathbb{X}}}, a sequence {pi}\{p_{i}\} in 𝕏{{\mathbb{X}}} escapes to infinity if for every compact set G𝕏G\subset{{\mathbb{X}}}, there are at most finitely many values of ii for which piGp_{i}\in G ([51], page 46).

Lemma A.3 (Lemma 2.6).

dμKd^{K}_{\mu} is proper.

Proof A.4.

To prove that dμKd^{K}_{\mu} is proper, we prove the following two claims: (a) A continuous function f:d[0,c)f:{{\mathbb{R}}}^{d}\to[0,c) (where c is a constant) is proper, if for any sequence {xi}\{x_{i}\} in d{{\mathbb{R}}}^{d} that escapes to infinity, the sequence {f(xi)}\{f(x_{i})\} tends to cc (approaches cc in the limit); (b) Let f:=dμKf:=d^{K}_{\mu} and one needs to show that for any sequences {xi}\{x_{i}\} that escapes to infinity, the sequence {f(xi)}\{f(x_{i})\} tends to cμc_{\mu}; or equivalently, κ(μ,xi)\kappa(\mu,x_{i}) tends to 0.

We prove claim (a) by proving its contrapositive. If a continuous function f:d[0,c)f:{{\mathbb{R}}}^{d}\to[0,c) is not proper, then there exists a sequence {xi}\{x_{i}\} in d{{\mathbb{R}}}^{d} that escapes to infinity, such that the sequence {f(xi)}\{f(x_{i})\} does not tend to cc. Suppose ff is not proper, this implies that there exists a constant b<cb<c such that f1[0,b]f^{-1}[0,b] is not compact (based on properness definition (i)) and therefore either not closed or unbounded. We first show that A:=f1[0,b]A:=f^{-1}[0,b] is closed. We make use of the following theorem ([49], page 88, Theorem 10’): A mapping ff of a topological space 𝕏{{\mathbb{X}}} into a topological space 𝕐{{\mathbb{Y}}} is continuous if and only if the pre-image f1(F)f^{-1}(F) of every closed set F𝕐F\subset{{\mathbb{Y}}} is closed in 𝕏{{\mathbb{X}}}. Since ff is continuous, it implies that the pre-image of every closed set [a,b]R[a,b]\subset R is closed in d{{\mathbb{R}}}^{d}. Therefore, AA is closed, therefore it must be unbounded. Since every unbounded sequence contains a monotone subsequence that has either ++\infty or -\infty as a limit, therefore AA contains a subsequence S:={xi}S:=\{x_{i}\} that tends to an infinite limit. In addition, as elements in SS escapes to infinity, {f(xi)}\{f(x_{i})\} tends to bb and does not tend to cc. Therefore (a) holds by contraposition.

To prove claim (b), we need to show that for any sequence {xi}\{x_{i}\} that escapes to infinity, κ(μ,xi)\kappa(\mu,x_{i}) tends to 0. For each xix_{i}, define a radius ri=xi0/2r_{i}=\|x_{i}-0\|/2 and define a ball BiB_{i} that is centered at the origin 0 and has radius rir_{i}. As xix_{i} goes to infinity, rir_{i} increases until for any fixed arbitrary ε>0\varepsilon>0, we have pBiμ(p)dp1ε/2σ2\int_{p\in B_{i}}\mu(p){\rm d}{p}\geq 1-\varepsilon/2\sigma^{2} and thus pdBidμ(p)ε/2σ2\int_{p\in{{\mathbb{R}}}^{d}\setminus B_{i}}{\rm d}{\mu(p)}\leq\varepsilon/2\sigma^{2}. Furthermore, let pi=argminpBpxip_{i}=\arg\min_{p\in B}\|p-x_{i}\|, so xipi=ri\|x_{i}-p_{i}\|=r_{i}. Thus also as xix_{i} goes to infinity, rir_{i} increases until for any ε>0\varepsilon>0 we have K(pi,xi)ε/2K(p_{i},x_{i})\leq\varepsilon/2. We now decompose κ(μ,xi)=pBiK(p,xi)dμ(p)+qdBiK(q,xi)dμ(q)\kappa(\mu,x_{i})=\int_{p\in B_{i}}K(p,x_{i}){\rm d}{\mu(p)}+\int_{q\in{{\mathbb{R}}}^{d}\setminus B_{i}}K(q,x_{i}){\rm d}{\mu(q)}. Thus for any ε>0\varepsilon>0, as xix_{i} goes to infinity, the first term is at most ε/2\varepsilon/2 since all K(p,xi)K(pi,xi)ε/2K(p,x_{i})\leq K(p_{i},x_{i})\leq\varepsilon/2 and the second term is at most ε/2\varepsilon/2 since K(q,x)σ2K(q,x)\leq\sigma^{2} and qdBiμ(q)dqε/2σ2\int_{q\in{{\mathbb{R}}}^{d}\setminus B_{i}}\mu(q){\rm d}{q}\leq\varepsilon/2\sigma^{2}. Since these results hold for all ε\varepsilon, as xix_{i} goes to infinity and ε\varepsilon goes to 0, κ(μ,xi)\kappa(\mu,x_{i}) goes to 0.

Combine (a) with (b) and the fact that dμKd^{K}_{\mu} is a continuous (in fact, Lipschitz) function, we obtained the properness result.

Appendix B ε\varepsilon-Approximation of the Kernel Distance

Here we make explicit the way that an ε\varepsilon-kernel sample approximated the kernel distance. Recall that if QQ is an ε\varepsilon-kernel sample of μ\mu, then kdeμkdeμQ=maxxd|κ(μ,x)κ(μQ,x)|ε\|\textsc{kde}_{\mu}-\textsc{kde}_{\mu_{Q}}\|=\max_{x\in{{\mathbb{R}}}^{d}}|\kappa(\mu,x)-\kappa({\mu_{Q}},x)|\leq\varepsilon.

Lemma B.1.

If QQ is an ε\varepsilon-kernel sample of μ\mu, then (dμK)2(dμQK)24ε\|(d^{K}_{\mu})^{2}-(d^{K}_{\mu_{Q}})^{2}\|_{\infty}\leq 4\varepsilon.

Proof B.2.

First expand DK(μ,x)2=κ(x,x)+κ(μ,μ)2κ(μ,x)=σ2+κ(μ,μ)2κ(μ,x)D_{K}(\mu,x)^{2}=\kappa(x,x)+\kappa(\mu,\mu)-2\kappa(\mu,x)=\sigma^{2}+\kappa(\mu,\mu)-2\kappa(\mu,x). Replacing μ\mu with μQ{\mu_{Q}}, the first term is unaffected. The second term is bounded,

κ(μ,μ)\displaystyle\kappa(\mu,\mu) =(p,q)K(p,q)dmμ,μ(p,q)=p(qK(p,q)dμ(q))dμ(p)\displaystyle=\int_{(p,q)}K(p,q){\rm d}{\textsf{m}_{\mu,\mu}(p,q)}=\int_{p}\left(\int_{q}K(p,q){\rm d}{\mu(q)}\right){\rm d}{\mu(p)}
=pkdeμ(p)dμ(p)p(kdeμQ(p)+ε)dμ(p)\displaystyle=\int_{p}\textsc{kde}_{\mu}(p){\rm d}{\mu(p)}\leq\int_{p}(\textsc{kde}_{{\mu_{Q}}}(p)+\varepsilon){\rm d}{\mu(p)}
=pkdeμQ(p)dμ(p)+ε=p(qK(p,q)dμQ(q))dμ(p)+ε\displaystyle=\int_{p}\textsc{kde}_{{\mu_{Q}}}(p){\rm d}{\mu(p)}+\varepsilon=\int_{p}\left(\int_{q}K(p,q){\rm d}{{\mu_{Q}}(q)}\right){\rm d}{\mu(p)}+\varepsilon
=κ(μQ,μ)+ε\displaystyle=\kappa({\mu_{Q}},\mu)+\varepsilon
κ(μQ,μQ)+2ε.\displaystyle\leq\kappa({\mu_{Q}},{\mu_{Q}})+2\varepsilon.

Similar results hold by switching μQ{\mu_{Q}} with μ\mu in the above inequality, that is, κ(μQ,μQ)κ(μ,μ)+2ε\kappa({\mu_{Q}},{\mu_{Q}})\leq\kappa(\mu,\mu)+2\varepsilon. And for the third term we have similar inequality, |2κ(μ,x)2κ(μQ,x)|2ε|2\kappa(\mu,x)-2\kappa({\mu_{Q}},x)|\leq 2\varepsilon. Combining all three terms, we have the desired result: |DK(μ,x)2DK(μQ,x)2|4ε|D_{K}(\mu,x)^{2}-D_{K}({\mu_{Q}},x)^{2}|\leq 4\varepsilon.

Lemma B.3.

If QQ is an (ε2/4)(\varepsilon^{2}/4)-kernel sample of μ\mu, then dμKdμQKε\|d^{K}_{\mu}-d^{K}_{\mu_{Q}}\|_{\infty}\leq\varepsilon.

Proof B.4.

By Lemma B.1 this condition on QQ implies that (dμK)2(dμQK)2ε2\|(d^{K}_{\mu})^{2}-(d^{K}_{\mu_{Q}})^{2}\|_{\infty}\leq\varepsilon^{2}. We then use a basic fact for values ε0\varepsilon\geq 0 and γ0\gamma\geq 0.
\bullet γ2+ε2γ+ε\sqrt{\gamma^{2}+\varepsilon^{2}}\leq\gamma+\varepsilon. This follows since (γ+ε)2=γ2+ε2+2γεγ2+ε2(\gamma+\varepsilon)^{2}=\gamma^{2}+\varepsilon^{2}+2\gamma\varepsilon\geq\gamma^{2}+\varepsilon^{2}.

We now prove the main result as an upper and lower bound using for any xdx\in\mathbb{R}^{d}. We first use γ=dμK(x)0\gamma=d^{K}_{\mu}(x)\geq 0 and expand dμQK(x)d^{K}_{\mu_{Q}}(x) to obtain

dμQK(x)=(dμQK(x))2(dμK(x))2+ε2dμK(x)+ε.d^{K}_{\mu_{Q}}(x)=\sqrt{(d^{K}_{\mu_{Q}}(x))^{2}}\leq\sqrt{(d^{K}_{\mu}(x))^{2}+\varepsilon^{2}}\leq d^{K}_{\mu}(x)+\varepsilon.

Now we use γ=dμQK(x)0\gamma=d^{K}_{\mu_{Q}}(x)\geq 0 and expand dμK(x)d^{K}_{\mu}(x) to obtain

dμK(x)=(dμK(x))2(dμQK(x))2+ε2dμQK(x)+ε.d^{K}_{\mu}(x)=\sqrt{(d^{K}_{\mu}(x))^{2}}\leq\sqrt{(d^{K}_{\mu_{Q}}(x))^{2}+\varepsilon^{2}}\leq d^{K}_{\mu_{Q}}(x)+\varepsilon.

Hence for any xdx\in\mathbb{R}^{d} we have dμK(x)εdμQK(x)dμK(x)+εd^{K}_{\mu}(x)-\varepsilon\leq d^{K}_{\mu_{Q}}(x)\leq d^{K}_{\mu}(x)+\varepsilon.

Appendix C Power Distance Constructions

Recall we want to consider the following power distance using dμKd^{K}_{\mu} (as weight) for a measure μ\mu associated with a subset PdP\subset{{\mathbb{R}}}^{d} and metric d(,)d(\cdot,\cdot) on d{{\mathbb{R}}}^{d},

fP(μ,x)=minpP(d(p,x)2+dμK(p)2).{f_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left({d(p,x)}^{2}+{d^{K}_{\mu}(p)}^{2}\right)}.

We consider a particular choice of the distance metric d(p,x)=DK(p,x)d(p,x)=D_{K}(p,x) which leads to a kernel version of the power distance

fPk(μ,x)=minpP(DK(p,x)2+dμK(p)2).{f^{\textsc{k}}_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left({D_{K}(p,x)}^{2}+{d^{K}_{\mu}(p)}^{2}\right)}.

Recall that dμK(x)=DK(μ,x)d^{K}_{\mu}(x)=D_{K}(\mu,x). In this section, we will always use the notation DK(μ,ν)D_{K}(\mu,\nu), and when μ\mu or ν\nu are points (e.g. μ\mu is a Dirac mass at pp and ν\nu is a Dirac mass at qq), then we will just write DK(p,q)D_{K}(p,q). This will be especially helpful when we apply the triangle inequality in several places.

C.1 Kernel Power Distance on Point Set PP

Given a set PP defining a measure of interest μP\mu_{P}, it is of interest to consider if fPk(μP,x){f^{\textsc{k}}_{P}}({\mu_{P}},x) is multiplicatively bounded by DK(μP,x)D_{K}({\mu_{P}},x). Theorem 3.2 shows that the lower bound holds. In this section we try to provide a multiplicative approximation upper bound.

Let p=argminpPpxp^{\star}=\arg\min_{p\in P}\|p-x\|. We can start with Lemma 3.4 which reduces the problem finding a multiplicative upper bound for DK(p,x)D_{K}(p^{\star},x) in terms of DK(μP,x)D_{K}({\mu_{P}},x). However, we are not able to provide very useful bounds, and they require more advanced techniques that the previous section. In particular, they will only apply for points xdx\in\mathbb{R}^{d} when DK(μP,x)D_{K}({\mu_{P}},x) is large enough; hence not well-approximating the minima of dμKd^{K}_{\mu}.

For simplicity, we write dPK()=DK(μP,)d^{K}_{P}(\cdot)=D_{K}({\mu_{P}},\cdot) as DK(P,)D_{K}(P,\cdot).

The difficult case is when DK(P,x)D_{K}(P,x) is very small, and hence κ(P,P)\kappa(P,P) is very small. So we start by developing tools to upper bound κ(P,P)\kappa(P,P) using p^=argminpPDK(P,p)\hat{p}=\arg\min_{p\in P}D_{K}(P,p), a point which only provides a worse approximation that pp^{\star}.

We first provide a general result in a Hilbert space (a refinement of a vector space [25]), and then next apply it to our setting in the RKHS.

Lemma C.1.

Consider a set V={v1,,vn}V=\{v_{1},\ldots,v_{n}\} of vectors in a Hilbert space endowed with norm \|\cdot\| and inner product ,\langle\cdot,\cdot\rangle. Let each viv_{i} have norm vi=η\|v_{i}\|=\eta. Consider weights W={w1,,wn}W=\{w_{1},\ldots,w_{n}\} such that wi0w_{i}\geq 0 and i=1nwi=1\sum_{i=1}^{n}w_{i}=1. Let r=i=1nwivir=\sum_{i=1}^{n}w_{i}v_{i}. Let v^=argminviVvir\hat{v}=\arg\min_{v_{i}\in V}\|v_{i}-r\|. Then

r2η2rv^2.\|r\|^{2}\leq\eta^{2}-\|r-\hat{v}\|^{2}.
Proof C.2.

Recall elementary properties of inner product space: x2=x,x\|x\|^{2}=\langle x,x\rangle, ax,y=ax,y\langle ax,y\rangle=a\langle x,y\rangle, xy,xy=x,x+y,y2x,y\langle x-y,x-y\rangle=\langle x,x\rangle+\langle y,y\rangle-2\langle x,y\rangle. By definition of v^\hat{v}, for any viVv_{i}\in V,

vir2v^r2vi,vi+r,r2vi,rv^,v^+r,r2v^,rvi,rv^,r.\displaystyle\|v_{i}-r\|^{2}\geq\|\hat{v}-r\|^{2}\Rightarrow\langle v_{i},v_{i}\rangle+\langle r,r\rangle-2\langle v_{i},r\rangle\geq\langle\hat{v},\hat{v}\rangle+\langle r,r\rangle-2\langle\hat{v},r\rangle\Rightarrow\langle v_{i},r\rangle\leq\langle\hat{v},r\rangle.

We can decompose rr (based on linearity of an inner product space) as

r2=r,r=i=1nwivi,ri=1nwiv^,r=v^,r=12(r2+v^2v^r2).\|r\|^{2}=\langle r,r\rangle=\sum_{i=1}^{n}w_{i}\langle v_{i},r\rangle\leq\sum_{i=1}^{n}w_{i}\langle\hat{v},r\rangle=\langle\hat{v},r\rangle=\frac{1}{2}(\|r\|^{2}+\|\hat{v}\|^{2}-\|\hat{v}-r\|^{2}).

The last inequality holds by v^r2=r2+v^22v^,r\|\hat{v}-r\|^{2}=\|r\|^{2}+\|\hat{v}\|^{2}-2\langle\hat{v},r\rangle. Then since v^=η\|\hat{v}\|=\eta we can solve for r2\|r\|^{2} as

r2η2v^r2.\|r\|^{2}\leq\eta^{2}-\|\hat{v}-r\|^{2}.
Lemma C.3.

Let p^=argminpPDK(P,p)\hat{p}=\arg\min_{p\in P}D_{K}(P,p), then κ(P,P)σ2DK(P,p^)2\kappa(P,P)\leq\sigma^{2}-D_{K}(P,\hat{p})^{2}.

Proof C.4.

Let ϕK:dK\phi_{K}:{{\mathbb{R}}}^{d}\to\mathcal{H}_{K} map points in d{{\mathbb{R}}}^{d} to the reproducing kernel Hilbert space (RKHS) K\mathcal{H}_{K} defined by kernel KK. This space has norm PK=κ(P,P)\|P\|_{\mathcal{H}_{K}}=\sqrt{\kappa(P,P)} defined on a set of points PP and inner product κ(P,P)\kappa(P,P). Let ΦK(P)=1|P|pPϕK(p)\Phi_{K}(P)=\frac{1}{|P|}\sum_{p\in P}\phi_{K}(p) be the representation of a set of points PP in K\mathcal{H}_{K}. Note that DK(P,Q)=ΦK(P)ΦK(Q)KD_{K}(P,Q)=\|\Phi_{K}(P)-\Phi_{K}(Q)\|_{\mathcal{H}_{K}}. We can now apply Lemma C.1 to {ϕK(p)}pP\{\phi_{K}(p)\}_{p\in P} with weights w(p)=1/|P|w(p)=1/|P| and r=ΦK(P)r=\Phi_{K}(P), and norm η=σ\eta=\sigma. Hence κ(P,P)=PK2σ2DK(P,p^)2\kappa(P,P)=\|P\|_{\mathcal{H}_{K}}^{2}\leq\sigma^{2}-D_{K}(P,\hat{p})^{2}.

Lemma C.5.

For any s>0s>0 and any xx, then s2xsx/2s\sqrt{s^{2}-x}\leq s-x/2s.

Proof C.6.

We expand the square of the desired result

(s2x)(sx/2s)2=s2x+x2/4s2.(s^{2}-x)\leq(s-x/2s)^{2}=s^{2}-x+x^{2}/4s^{2}.

After subtracting (s2x)(s^{2}-x) from both sides, it is equivalent to 0x2/4s20\leq x^{2}/4s^{2}. This holds since x2x^{2} and ss are always nonnegative.

Lemma C.7.

DK(P,x)DK(p,x)2/CσD_{K}(P,x)\geq D_{K}(p^{\star},x)^{2}/C_{\sigma} for Cσ=2σ+2C_{\sigma}=2\sigma+2.

Proof C.8.

Refer to Figure 5 for geometric intuition in this proof. Let ν0\nu_{0} be a measure that is ν0(p)=0\nu_{0}(p)=0 for all pdp\in{{\mathbb{R}}}^{d}; thus it has a norm κ(ν0,ν0)=0\kappa(\nu_{0},\nu_{0})=0. We can measure the distance from ν0\nu_{0} to xx and PP, noting that DK(ν0,x)=κ(x,x)=σD_{K}(\nu_{0},x)=\sqrt{\kappa(x,x)}=\sigma and DK(ν0,P)=κ(P,P)D_{K}(\nu_{0},P)=\sqrt{\kappa(P,P)}. Thus by triangle inequality, Lemma C.3, and Lemma C.5,

DK(P,x)\displaystyle D_{K}(P,x) DK(ν0,x)DK(ν0,P)\displaystyle\geq D_{K}(\nu_{0},x)-D_{K}(\nu_{0},P)
=σκ(P,P)\displaystyle=\sigma-\sqrt{\kappa(P,P)}
σσ2DK(P,p^)2\displaystyle\geq\sigma-\sqrt{\sigma^{2}-D_{K}(P,\hat{p})^{2}}
DK(P,p^)2/2σ.\displaystyle\geq D_{K}(P,\hat{p})^{2}/2\sigma.
Refer to caption
Figure 5: Illustration of xx, pp^{\star}, p^\hat{p}, ν0\nu_{0}, and PP as vectors in a RKHS. Note we have omitted the ϕK\phi_{K} and ΦK\Phi_{K} maps to unclutter the notation.

We now assume that DK(P,x)<DK(p,x)/CσD_{K}(P,x)<D_{K}(p^{\star},x)/C_{\sigma} and show this is not possible. First observe that DK(P,p^)+DK(P,x)DK(p^,x)DK(p,x)D_{K}(P,\hat{p})+D_{K}(P,x)\geq D_{K}(\hat{p},x)\geq D_{K}(p^{\star},x). These expressions imply that DK(P,p^)DK(p,x)DK(P,x)(11/Cσ)DK(p,x)D_{K}(P,\hat{p})\geq D_{K}(p^{\star},x)-D_{K}(P,x)\geq(1-1/C_{\sigma})D_{K}(p^{\star},x), and thus

DK(P,x)12σDK(P,p^)212σ(11Cσ)2DK(p,x)21CσDK(p,x)2,D_{K}(P,x)\geq\frac{1}{2\sigma}D_{K}(P,\hat{p})^{2}\geq\frac{1}{2\sigma}\left(1-\frac{1}{C_{\sigma}}\right)^{2}D_{K}(p^{\star},x)^{2}\geq\frac{1}{C_{\sigma}}D_{K}(p^{\star},x)^{2},

a contradiction. The last steps follows by setting

12σ(11Cσ)2\displaystyle\frac{1}{2\sigma}\left(1-\frac{1}{C_{\sigma}}\right)^{2} 1CσCσ2(2+2σ)Cσ+10\displaystyle\geq\frac{1}{C_{\sigma}}\Rightarrow C_{\sigma}^{2}-(2+2\sigma)C_{\sigma}+1\geq 0

and solving for CσC_{\sigma},

Cσ(2+2σ)+(2+2σ)242=1+σ+σ2+2σ=1+σ+(σ+1)21.\displaystyle C_{\sigma}\geq\frac{(2+2\sigma)+\sqrt{(2+2\sigma)^{2}-4}}{2}=1+\sigma+\sqrt{\sigma^{2}+2\sigma}=1+\sigma+\sqrt{(\sigma+1)^{2}-1}.

Since Cσ=2σ+2>1+σ+(σ+1)21C_{\sigma}=2\sigma+2>1+\sigma+\sqrt{(\sigma+1)^{2}-1}, so we have 12σ(11Cσ)21Cσ\frac{1}{2\sigma}\left(1-\frac{1}{C_{\sigma}}\right)^{2}\geq\frac{1}{C_{\sigma}}.

Recall that an ε\varepsilon-kernel sample PP of μ\mu satisfies maxxd|κ(μ,x)κ(μP,x)|ε.\max_{x\in{{\mathbb{R}}}^{d}}|\kappa(\mu,x)-\kappa(\mu_{P},x)|\leq\varepsilon.

Theorem C.9.

If DK(P,x)1D_{K}(P,x)\geq 1 then fPk(P,x)6σ+8DK(P,x){f^{\textsc{k}}_{P}}(P,x)\leq\sqrt{6\sigma+8}D_{K}(P,x). If PP is an (ε/4)(\varepsilon/4)-kernel sample of μ\mu then fPk(μ,x)6σ+8(DK(μ,x)+ε){f^{\textsc{k}}_{P}}(\mu,x)\leq\sqrt{6\sigma+8}(D_{K}(\mu,x)+\varepsilon).

Proof C.10.

We combine Lemma C.7 with Lemma 3.4 to achieve

fPk(P,x)22DK(P,x)2+3DK(p,x)22DK(P,x)2+3(2σ+2)DK(P,x).{f^{\textsc{k}}_{P}}(P,x)^{2}\leq 2D_{K}(P,x)^{2}+3D_{K}(p^{\star},x)^{2}\leq 2D_{K}(P,x)^{2}+3(2\sigma+2)D_{K}(P,x).

Aside: Note that the first DK(P,x)D_{K}(P,x) is squared and the second is not. If DK(P,x)αD_{K}(P,x)\geq\alpha then DK(P,x)(1/α)DK(P,x)2D_{K}(P,x)\leq(1/\alpha)D_{K}(P,x)^{2} we have

fPk(P,x)2(2+(6+6σ)/α)DK(P,x)2.{f^{\textsc{k}}_{P}}(P,x)^{2}\leq(2+(6+6\sigma)/\alpha)D_{K}(P,x)^{2}.

Let α=1\alpha=1. We have

fPk(P,x)2(6σ+8)DK(P,x)2.{f^{\textsc{k}}_{P}}(P,x)^{2}\leq(6\sigma+8)D_{K}(P,x)^{2}.

Since DK(P,x)DK(μ,x)+εD_{K}(P,x)\leq D_{K}(\mu,x)+\varepsilon, via Lemma B.1. We obtain,

fPk(μ,x)6σ+8(DK(μ,x)+ε).{f^{\textsc{k}}_{P}}(\mu,x)\leq\sqrt{6\sigma+8}(D_{K}(\mu,x)+\varepsilon).

C.2 Approximating the Minimum Kernel Distance Point

The goal in this section is to find a point that approximately minimizes the kernel distance to a point set PP. We assume here PP contains nn points and describes a measure made of nn Dirac mass at each pPp\in P with weight 1/n1/n (this is the empirical measure μP{\mu_{P}} defined in Section 1.1). Let p+=argminqdDK(μP,q)=argmaxqdκ(μP,q)p_{+}=\arg\min_{q\in{{\mathbb{R}}}^{d}}D_{K}({\mu_{P}},q)=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa({\mu_{P}},q). Since DK(μP,q)=DK(P,q)D_{K}({\mu_{P}},q)=D_{K}(P,q), for simplicity in notation, we work with point set PP instead of μP{\mu_{P}} for the remaining of this section. That is, we define p+=argminqdDK(P,q)=argmaxqdκ(P,q)p_{+}=\arg\min_{q\in{{\mathbb{R}}}^{d}}D_{K}(P,q)=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(P,q). Note that p+p_{+} is chosen over all of d{{\mathbb{R}}}^{d}, as the bound in Theorem C.9 is not sufficient when choosing a point from PP. In particular, for any δ>0\delta>0, we want a point p^+\hat{p}_{+} such that DK(P,p^+)(1+δ)DK(P,p+)D_{K}(P,\hat{p}_{+})\leq(1+\delta)D_{K}(P,p_{+}).

Note that Agarwal et al. [1] provide an algorithm that with high probability finds a point q^\hat{q} such that κ(P,q^)(1δ)κ(P,p+)\kappa(P,\hat{q})\geq(1-\delta)\kappa(P,p_{+}) in time O((1/δ4)nlogn)O((1/\delta^{4})n\log n). However this point q^\hat{q} is not sufficient for our purpose (that is, q^\hat{q} does not satisfy the condition DK(P,q^+)(1+δ)DK(P,p+)D_{K}(P,\hat{q}_{+})\leq(1+\delta)D_{K}(P,p_{+})), since q^\hat{q} yields

DK(P,q^)2σ2+κ(P,P)2(1δ)κ(P,p+)(1+δ)(σ2+κ(P,P)2κ(P,p+))=(1δ)DK(P,p+)2,D_{K}(P,\hat{q})^{2}\leq\sigma^{2}+\kappa(P,P)-2(1-\delta)\kappa(P,p_{+})\nleq(1+\delta)\big{(}\sigma^{2}+\kappa(P,P)-2\kappa(P,p_{+})\big{)}=(1-\delta)D_{K}(P,p_{+})^{2},

since in general it is not true that 4κ(P,p+)σ2+κ(P,P)4\kappa(P,p_{+})\leq\sigma^{2}+\kappa(P,P), as would be required.

First we need some structural properties. For each point xdx\in{{\mathbb{R}}}^{d}, define a radius rx=argsupr>0{|Br(x)P|n/2r_{x}=\arg\sup_{r>0}\{|B_{r}(x)\cap P|\leq n/2}, where Br(x)B_{r}(x) is a ball of radius rr centered at xx. In other words, it is the largest radius such that at most half of points in PP are within Br(x)B_{r}(x). Let p^2\hat{p}_{2} be the point in PP such that p+p^2=rp+\|p_{+}-\hat{p}_{2}\|=r_{p_{+}}. In other words, p^2\hat{p}_{2} is a point such that no more than n/2n/2 points in PP satisfy p+pp+p^2\|p_{+}-p\|\geq\|p_{+}-\hat{p}_{2}\|. Finally it is useful to define rx,Kr_{x,K} which is rx,K=DK(x,p)r_{x,K}=D_{K}(x,p) where xp=rx\|x-p\|=r_{x}; in particular rp+,K=DK(p+,p^2)r_{p_{+},K}=D_{K}(p_{+},\hat{p}_{2}).

We now need to lower bound DK(P,p+)D_{K}(P,p_{+}) in terms of DK(P,p^2)D_{K}(P,\hat{p}_{2}). Lemma C.7 already provides a bound in terms of the closest point for any xdx\in{{\mathbb{R}}}^{d}. We follow a similar construction here.

Lemma C.11.

Consider a set V={v1,,vn}V=\{v_{1},\ldots,v_{n}\} of vectors in a Hilbert space endowed with norm \|\cdot\| and inner product ,\langle\cdot,\cdot\rangle. Let each viv_{i} have norm vi=η\|v_{i}\|=\eta. Consider weights W={w1,,wn}W=\{w_{1},\ldots,w_{n}\} such that 1/2wi01/2\geq w_{i}\geq 0 and i=1nwi=1\sum_{i=1}^{n}w_{i}=1. Let r=i=1nwivir=\sum_{i=1}^{n}w_{i}v_{i}. Define a partition of VV with V1V_{1} and V2V_{2} such that V2V_{2} is the smallest set such that viV2wi1/2\sum_{v_{i}\in V_{2}}w_{i}\geq 1/2, and for all v1V1v_{1}\in V_{1} and v2V2v_{2}\in V_{2} we have rv1<rv2\|r-v_{1}\|<\|r-v_{2}\|. Let v^2=argminviV2vir\hat{v}_{2}=\arg\min_{v_{i}\in V_{2}}\|v_{i}-r\|. Then

r2η2rv^222.\|r\|^{2}\leq\eta^{2}-\frac{\|r-\hat{v}_{2}\|^{2}}{2}.
Proof C.12.

For ease of notation, we assume that vi,r>vi+1,r\langle v_{i},r\rangle>\langle v_{i+1},r\rangle for all ii, and let {v1,,vk}=V1\{v_{1},\ldots,v_{k}\}=V_{1}. Let v^1=argminviV1vir=argminviVvir\hat{v}_{1}=\arg\min_{v_{i}\in V_{1}}\|v_{i}-r\|=\arg\min_{v_{i}\in V}\|v_{i}-r\|. Let v^\hat{v} be a norm η\eta vector that has v^,r=(v^1,r+v^2,r)/2\langle\hat{v},r\rangle=(\langle\hat{v}_{1},r\rangle+\langle\hat{v}_{2},r\rangle)/2. Since viV2wi1/2\sum_{v_{i}\in V_{2}}w_{i}\geq 1/2 and viV1wi1/2\sum_{v_{i}\in V_{1}}w_{i}\leq 1/2, let viV2wi=1/2+δ\sum_{v_{i}\in V_{2}}w_{i}=1/2+\delta and viV1wi=1/2δ\sum_{v_{i}\in V_{1}}w_{i}=1/2-\delta (for 0δ1/2)0\leq\delta\leq 1/2). By definition, we also have v^1,rv^2,r\langle\hat{v}_{1},r\rangle\geq\langle\hat{v}_{2},r\rangle. We can decompose rr as

r2=r,r\displaystyle\|r\|^{2}=\langle r,r\rangle =i=1nwivi,r=i=1kwivi,r+i=k+1nwivi,r\displaystyle=\sum_{i=1}^{n}w_{i}\langle v_{i},r\rangle=\sum_{i=1}^{k}w_{i}\langle v_{i},r\rangle+\sum_{i={k+1}}^{n}w_{i}\langle v_{i},r\rangle
i=1kwiv^1,r+i=k+1nwiv^2,r=(i=1kwi)v^1,r+(i=k+1nwi)v^2,r\displaystyle\leq\sum_{i=1}^{k}w_{i}\langle\hat{v}_{1},r\rangle+\sum_{i=k+1}^{n}w_{i}\langle\hat{v}_{2},r\rangle=\left(\sum_{i=1}^{k}w_{i}\right)\langle\hat{v}_{1},r\rangle+\left(\sum_{i=k+1}^{n}w_{i}\right)\langle\hat{v}_{2},r\rangle
=(1/2δ)v^1,r+(1/2+δ)v^2,r=(1/2)(v^1,r+v^2,r)+δ(v^2,rv^1,r)\displaystyle=(1/2-\delta)\langle\hat{v}_{1},r\rangle+(1/2+\delta)\langle\hat{v}_{2},r\rangle=(1/2)(\langle\hat{v}_{1},r\rangle+\langle\hat{v}_{2},r\rangle)+\delta(\langle\hat{v}_{2},r\rangle-\langle\hat{v}_{1},r\rangle)
(v^1,r+v^2,r)/2=v^,r\displaystyle\leq(\langle\hat{v}_{1},r\rangle+\langle\hat{v}_{2},r\rangle)/2=\langle\hat{v},r\rangle
=12(r2+v^2v^r2).\displaystyle=\frac{1}{2}(\|r\|^{2}+\|\hat{v}\|^{2}-\|\hat{v}-r\|^{2}).

The last inequality holds by v^r2=r2+v^22v^,r\|\hat{v}-r\|^{2}=\|r\|^{2}+\|\hat{v}\|^{2}-2\langle\hat{v},r\rangle. Then since v^=η\|\hat{v}\|=\eta we can solve for r2\|r\|^{2} as

r2η2v^r2=η2(v^2r2+v^1r2)/2η2v^2r2/2.\|r\|^{2}\leq\eta^{2}-\|\hat{v}-r\|^{2}=\eta^{2}-(\|\hat{v}_{2}-r\|^{2}+\|\hat{v}_{1}-r\|^{2})/2\leq\eta^{2}-\|\hat{v}_{2}-r\|^{2}/2.
Lemma C.13.

Using p^2\hat{p}_{2} as defined above, then κ(P,P)σ2DK(P,p^2)2/2\kappa(P,P)\leq\sigma^{2}-D_{K}(P,\hat{p}_{2})^{2}/2.

Proof C.14.

Let ϕK:dK\phi_{K}:{{\mathbb{R}}}^{d}\to\mathcal{H}_{K} map points in d{{\mathbb{R}}}^{d} to the reproducing kernel Hilbert space (RKHS) K\mathcal{H}_{K} defined by kernel KK. This space has norm PK=κ(P,P)\|P\|_{\mathcal{H}_{K}}=\sqrt{\kappa(P,P)} defined on a set of points PP and inner product κ(P,P)\kappa(P,P). Let ΦK(P)=1|P|pPϕK(p)\Phi_{K}(P)=\frac{1}{|P|}\sum_{p\in P}\phi_{K}(p) be the representation of a set of points PP in K\mathcal{H}_{K}. Note that DK(P,Q)=ΦK(P)ΦK(Q)KD_{K}(P,Q)=\|\Phi_{K}(P)-\Phi_{K}(Q)\|_{\mathcal{H}_{K}}. We can now apply Lemma C.11 to {ϕK(p)}pP\{\phi_{K}(p)\}_{p\in P} with weights w(p)=1/|P|w(p)=1/|P| and r=ΦK(P)r=\Phi_{K}(P), and norm η=σ\eta=\sigma. Finally note that we can use ϕK(p^2)=v^2\phi_{K}(\hat{p}_{2})=\hat{v}_{2} since V2V_{2} represents the set of points which are further or equal to PP than is p^2\hat{p}_{2}. In addition, by the property of RKHS, ΦK(P)ϕK(p^2)=DK(P,p^2)\|\Phi_{K}(P)-\phi_{K}(\hat{p}_{2})\|=D_{K}(P,\hat{p}_{2}). Hence κ(P,P)=PK2σ2DK(P,p^2)2/2\kappa(P,P)=\|P\|_{\mathcal{H}_{K}}^{2}\leq\sigma^{2}-D_{K}(P,\hat{p}_{2})^{2}/2.

Lemma C.15.

DK(P,p+)DK(p+,p^2)2/(4σ)D_{K}(P,p_{+})\geq D_{K}(p_{+},\hat{p}_{2})^{2}/(4\sigma).

Proof C.16.

Refer to Figure 5 for geometric intuition in this proof. Let ν0\nu_{0} be a measure that is ν0(p)=0\nu_{0}(p)=0 for all pdp\in{{\mathbb{R}}}^{d}; thus it has a norm κ(ν0,ν0)=0\kappa(\nu_{0},\nu_{0})=0. We can measure the distance from ν0\nu_{0} to p+p_{+} and PP, noting that DK(ν0,x)=κ(x,x)=σD_{K}(\nu_{0},x)=\sqrt{\kappa(x,x)}=\sigma and DK(P,ν0)=κ(P,P)D_{K}(P,\nu_{0})=\sqrt{\kappa(P,P)}. Thus by triangle inequality, Lemma C.13, and Lemma C.5

DK(P,p+)\displaystyle D_{K}(P,p_{+}) DK(ν0,p+)DK(P,ν0)\displaystyle\geq D_{K}(\nu_{0},p_{+})-D_{K}(P,\nu_{0})
=σκ(P,P)\displaystyle=\sigma-\sqrt{\kappa(P,P)}
σσ2DK(P,p^2)2/2\displaystyle\geq\sigma-\sqrt{\sigma^{2}-D_{K}(P,\hat{p}_{2})^{2}/2}
DK(P,p^2)2/4σ.\displaystyle\geq D_{K}(P,\hat{p}_{2})^{2}/4\sigma.

Now we place a net 𝒩\mathcal{N} on d{{\mathbb{R}}}^{d}; specifically, it is a set of points such that for some q𝒩q\in\mathcal{N} that qp+δDK(p+,p^2)2/4σδDK(P,p+)\|q-p_{+}\|\leq\delta D_{K}(p_{+},\hat{p}_{2})^{2}/4\sigma\leq\delta D_{K}(P,p_{+}) (we refer to this inequality as the net condition, therefore, 𝒩\mathcal{N} is a set of points such that some points in it satisfy the net condition). Since DK(P,)D_{K}(P,\cdot) is 11-Lipschitz, we have DK(P,p+)DK(P,q)qp+D_{K}(P,p_{+})-D_{K}(P,q)\leq\|q-p_{+}\|. This ensures that some point q𝒩q\in\mathcal{N} satisfies DK(P,q)(1+δ)DK(P,p+)D_{K}(P,q)\leq(1+\delta)D_{K}(P,p_{+}), and can serve as p^+\hat{p}_{+}. In other words, 𝒩\mathcal{N} is guaranteed to contain some point qq that can serve as p+p_{+}.

Note that p+p_{+} must be in CH(P){\textsc{\small CH}}(P), the convex hull of PP. Otherwise, moving to the closest point on CH(P){\textsc{\small CH}}(P) decreases the distance to all points, and thus increases κ(P,p+)\kappa(P,p_{+}), which cannot happen by definition of p+p_{+}. Let Δ\Delta be the diameter of PP (the distance between the two furthest points). Clearly for some pPp\in P we must have p+pΔ\|p_{+}-p\|\leq\Delta.

Also note that p+:=argmaxqdκ(P,q)p_{+}:=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(P,q) must be within a distance Rσ=σ2ln(n)R_{\sigma}=\sigma\sqrt{2\ln(n)} to some pPp\in P, otherwise for p=argminpPp+pp^{\star}=\arg\min_{p\in P}\|p_{+}-p\|, we can bound κ(P,p+)K(p,p+)σ2/n=K(p,p)/nκ(P,p),\kappa(P,p_{+})\leq K(p^{\star},p_{+})\leq\sigma^{2}/n=K(p^{\star},p^{\star})/n\leq\kappa(P,p^{\star}), which means p+p_{+} is not a maximum. The first inequality is by definition of pp^{*}, the second by assuming p+pσ2ln(n)\|p_{+}-p^{\star}\|\geq\sigma\sqrt{2\ln(n)}.

Let BR(p)B_{R}(p) be the ball centered at pp with radius R=min(Rσ,Δ)R=\min(R_{\sigma},\Delta). Let Rp=min(R,rp/2)R_{p}=\min(R,r_{p}/2). So p+p_{+} must be in pPBR(p)\bigcup_{p\in P}B_{R}(p). We describe a net 𝒩p\mathcal{N}_{p} construction for one ball BR(p)B_{R}(p); that is for any xx such that pPp\in P is the closest point to xx, then some point q𝒩pq\in\mathcal{N}_{p} satisfies qxδ(rx,K)2/4σ\|q-x\|\leq\delta(r_{x,K})^{2}/4\sigma. Thus if this point x=p+x=p_{+}, the correct property holds, and we can use the corresponding qq as p^+\hat{p}_{+}. Then 𝒩=pP𝒩p\mathcal{N}=\bigcup_{p\in P}\mathcal{N}_{p}, and is at most nn times the size of 𝒩p\mathcal{N}_{p}. Let kpk_{p} be the smallest integer kk such that rp/2R/2kr_{p}/2\geq R/2^{k}. The net 𝒩p\mathcal{N}_{p} will be composed of 𝒩p=i=0kp𝒩i=𝒩0𝒩p\mathcal{N}_{p}=\bigcup_{i=0}^{k_{p}}\mathcal{N}_{i}=\mathcal{N}_{0}\cup\mathcal{N}^{\prime}_{p}, where 𝒩p=i=1kp𝒩i\mathcal{N}^{\prime}_{p}=\bigcup_{i=1}^{k_{p}}\mathcal{N}_{i}.

Before we proceed with the construction, we need an assumption: That ΛP=minpPrp\Lambda_{P}=\min_{p\in P}r_{p} is a bounded quantity, it is not too small. That is, no point has more than half the points within an absolute radius ΛP\Lambda_{P}. We call ΛP\Lambda_{P} the median concentration.

Lemma C.17.

A net 𝒩0\mathcal{N}_{0} can be constructed of size O((σ/δΛP)d+logd/2(n))O((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)) so that all points xBRp(p)x\in B_{R_{p}}(p) satisfy qxδ(rx,K)2/4σ\|q-x\|\leq\delta(r_{x,K})^{2}/4\sigma for some q𝒩0q\in\mathcal{N}_{0}.

If x=p+x=p_{+}, then such a point satisfies the net condition, that is there is a point q𝒩0q\in\mathcal{N}_{0} such that qx=qp+δ(rp+,K)2/(4σ)=δDK(p+,p^2)/(4σ)δDK(P,p+)\|q-x\|=\|q-p_{+}\|\leq\delta(r_{p_{+},K})^{2}/(4\sigma)=\delta D_{K}(p_{+},\hat{p}_{2})/(4\sigma)\leq\delta D_{K}(P,p_{+}).

Proof C.18.

For all points xBRpBrp/2(p)x\in B_{R_{p}}\subset B_{r_{p}/2}(p), they must have rxrp/2r_{x}\geq r_{p}/2, otherwise Brp/2(x)B_{r_{p}/2}(x) is completely inside Brp(p)B_{r_{p}}(p), and cannot have enough points. Within BRp(p)B_{R_{p}}(p) we place the net 𝒩0\mathcal{N}_{0} so that all points xBRp(p)x\in B_{R_{p}}(p) satisfy xqmin(δrp2/32σ,3σ)\|x-q\|\leq\min(\delta r_{p}^{2}/32\sigma,\sqrt{3}\sigma) for some q𝒩0q\in\mathcal{N}_{0}. Now δrp2/32σδrx2/8σ\delta r_{p}^{2}/32\sigma\leq\delta r_{x}^{2}/8\sigma, and since xy2/2DK(x,y)2\|x-y\|^{2}/2\leq D_{K}(x,y)^{2} (for xy3σ\|x-y\|\leq\sqrt{3}\sigma, via Lemma 5.5), thus the net ensures if p+BRp(p)p_{+}\in B_{R_{p}}(p), then some q𝒩0q\in\mathcal{N}_{0} is sufficiently close to p+p_{+}.

Since BRp(p)B_{R_{p}}(p) fits in a squared box of side length min(2Rσ,rp)\min(2R_{\sigma},r_{p}), then we can describe 𝒩0\mathcal{N}_{0} as an axis-aligned grid with gg points along each axis. We define two cases to bound gg. When δrp2/32σ<3σ\delta r_{p}^{2}/32\sigma<\sqrt{3}\sigma then we can set

g=Rpδrp2/(32σd)32σdδrp=O(σ/δrp)=O(σ/δΛP)g=\frac{R_{p}}{\delta r_{p}^{2}/(32\sigma\sqrt{d})}\leq\frac{32\sigma\sqrt{d}}{\delta r_{p}}=O(\sigma/\delta r_{p})=O(\sigma/\delta\Lambda_{P})

Otherwise,

g=Rp3σ/dσ2ln(n)3σ/d=2dln(n)/3=O(log(n)).g=\frac{R_{p}}{\sqrt{3}\sigma/\sqrt{d}}\leq\frac{\sigma\sqrt{2\ln(n)}}{\sqrt{3}\sigma/\sqrt{d}}=\sqrt{2d\ln(n)/3}=O(\sqrt{\log(n)}).

Then we need |𝒩0|=O(gd)=O((σ/δΛP)2+lnd/2(n))|\mathcal{N}_{0}|=O(g^{d})=O((\sigma/\delta\Lambda_{P})^{2}+\ln^{d/2}(n)).

When rp/2<Rr_{p}/2<R we still need to handle the case for xApx\in A_{p} where the annulus Ap=BR(p)Brp/2(p)A_{p}=B_{R}(p)\setminus B_{r_{p}/2}(p). For a point xApx\in A_{p} if p=minpPxpp=\min_{p^{\prime}\in P}\|x-p^{\prime}\| then rxxpr_{x}\geq\|x-p\|. We only worry about the net 𝒩p\mathcal{N}_{p}^{\prime} on ApA_{p} for these points where pp is the closest point, the others will be handled by another 𝒩p\mathcal{N}_{p^{\prime}} for pPp^{\prime}\in P and ppp^{\prime}\neq p.

Recall kpk_{p} is the smallest integer kk such that rp/2R/2kr_{p}/2\geq R/2^{k}.

Lemma C.19.

A net 𝒩p\mathcal{N}_{p}^{\prime} can be constructed of size O(kp+(σ/δΛP)d+logd/2(n))O(k_{p}+(\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)) so that all points xApx\in A_{p} where p=argminpPxpp=\arg\min_{p^{\prime}\in P}\|x-p^{\prime}\|, satisfy qxδ(rx,K)2/4σ\|q-x\|\leq\delta(r_{x,K})^{2}/4\sigma for some q𝒩pq\in\mathcal{N}_{p}^{\prime}.

If x=p+x=p_{+}, then such a point satisfies the net condition, that is there is a point q𝒩pq\in\mathcal{N}^{\prime}_{p} such that qx=qp+δ(rp+,K)2/(4σ)=δDK(p+,p^2)/(4σ)δDK(P,p+)\|q-x\|=\|q-p_{+}\|\leq\delta(r_{p_{+},K})^{2}/(4\sigma)=\delta D_{K}(p_{+},\hat{p}_{2})/(4\sigma)\leq\delta D_{K}(P,p_{+}).

Proof C.20.

We now consider the kpk_{p} annuli {A1,,Akp}\{A_{1},\ldots,A_{k_{p}}\} which cover ApA_{p}. Each Ai={xdR/2i1px>R/2i}A_{i}=\{x\in{{\mathbb{R}}}^{d}\mid R/2^{i-1}\geq\|p-x\|>R/2^{i}\} has volume O((R/2i1)d)O((R/2^{i-1})^{d}). For any xAix\in A_{i} we have rxxpR/2ir_{x}\geq\|x-p\|\geq R/2^{i}, so the Euclidean distance to the nearest q𝒩iq\in\mathcal{N}_{i} can be at most min(3σ,δ(R/2i)2/8σ)\min(\sqrt{3}\sigma,\delta(R/2^{i})^{2}/8\sigma). Thus we can cover AiA_{i} with a net 𝒩i\mathcal{N}_{i} of size tit_{i} based on two cases again. If δ(R/2i)2/8σ<3σ\delta(R/2_{i})^{2}/8\sigma<\sqrt{3}\sigma then

ti=O(1+(R2i/(δσ(R2i)2))d)=O(1+(2iRσδ)d)=O(1)+O((σδR)d(2d)i).t_{i}=O\left(1+\left(\frac{R}{2^{i}}/\left(\frac{\delta}{\sigma}\left(\frac{R}{2^{i}}\right)^{2}\right)\right)^{d}\right)=O\left(1+\left(\frac{2^{i}}{R}\frac{\sigma}{\delta}\right)^{d}\right)=O(1)+O\left(\left(\frac{\sigma}{\delta R}\right)^{d}(2^{d})^{i}\right).

Otherwise

ti=O(1+((R2i)/3σ)d)=O(1+(Rσ=σlog(n)2iσ)d)=O(1)+O(logd/2(n)(2d)i).t_{i}=O\left(1+\left(\left(\frac{R}{2^{i}}\right)/\sqrt{3}\sigma\right)^{d}\right)=O\left(1+\left(\frac{R_{\sigma}=\sigma\sqrt{\log(n)}}{2^{i}\sigma}\right)^{d}\right)=O(1)+O\left(\frac{\log^{d/2}(n)}{(2^{d})^{i}}\right).

Since R/2kprp/2ΛP/2R/2^{k_{p}}\geq r_{p}/2\geq\Lambda_{P}/2, then the total size of 𝒩p\mathcal{N}_{p}^{\prime}, the union of all of these nets, is i=1kptiO(kp)+2tkp+2t1=O(kp+(σ/δΛP)d+logd/2(n))\sum_{i=1}^{k_{p}}t_{i}\leq O(k_{p})+2t_{k_{p}}+2t_{1}=O(k_{p}+(\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)). In the first case tkpt_{k_{p}} dominates the cost and in the second case it is t1t_{1}.

Thus the total size of 𝒩p\mathcal{N}_{p} is O((σ/δΛP)d+logd/2(n)+kp)O((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)+k_{p}) where kplog(R/rp)+2k_{p}\leq\log(R/r_{p})+2. It just remains to bound kpk_{p}. Given that no more than n/2n/2 points are collocated on the same spot (which already holds by ΛP\Lambda_{P} being a bounded quantity), then for all pPp\in P, rpminqqPqqr_{p}\geq\min_{q\neq q^{\prime}\in P}\|q-q^{\prime}\|. The value βP=Δ/minqqPqq\beta_{P}=\Delta/\min_{q\neq q^{\prime}\in P}\|q-q^{\prime}\| is known as the spread of a point set, and it is common to assume it is an absolute bounded quantity related to the precision of coordinates, where log(βP)\log(\beta_{P}) is not too large. Thus we can bound kp=O(log(βP))k_{p}=O(\log(\beta_{P})).

Theorem C.21.

Consider a point set PdP\subset{{\mathbb{R}}}^{d} with nn points, spread βP\beta_{P}, and median concentration ΛP\Lambda_{P}. For any δ>0\delta>0, in time O(n2((σ/δΛP)d+logd/2(n)+log(βP)))O(n^{2}((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)+\log(\beta_{P}))) we can find a point p^+\hat{p}_{+} such that DK(P,p^+)(1+δ)DK(P,p+)D_{K}(P,\hat{p}_{+})\leq(1+\delta)D_{K}(P,p_{+}).

Proof C.22.

Using Lemma C.17 and Lemma C.19 we can build a net 𝒩\mathcal{N} of size O(n((σ/δΛP)d+logd/2(n)+log(βP))O(n((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)+\log(\beta_{P})) such that some q𝒩q\in\mathcal{N} satisfies qp+δDK(q,p+)2/4σδDK(P,p+)\|q-p_{+}\|\leq\delta D_{K}(q,p_{+})^{2}/4\sigma\leq\delta D_{K}(P,p_{+}). Lemma C.15 ensures that this qq satisfies DK(P,q)(1+δ)DK(P,p+)D_{K}(P,q)\leq(1+\delta)D_{K}(P,p_{+}) since DK(P,)D_{K}(P,\cdot) is 11-Lipschitz.

We can find such a qq and set it as p+p_{+} by evaluating κ(P,q)\kappa(P,q) for all q𝒩q\in\mathcal{N} and taking the one with largest value. This takes O(n)O(n) for each q𝒩q\in\mathcal{N}.

We claim that in many realistic settings σ/ΛP=O(1)\sigma/\Lambda_{P}=O(1). In such a case the algorithm runs in O(n2(1/δd+logd/2n+log(βP)))O(n^{2}(1/\delta^{d}+\log^{d/2}n+\log(\beta_{P}))) time. If σ/ΛP=o(1)\sigma/\Lambda_{P}=o(1), then over half of the measure described by PP will essentially behave as a single point. In many settings PP is drawn uniformly from a compact set SS, so then choosing σ\sigma so that more than half of SS has negligible diameter compared to σ\sigma will cause that data to be over smoothed. In fact, the definition of ΛP\Lambda_{P} can be modified so that this radius never contains more than any τn\tau n points for any constant τ<1\tau<1, and the bounds do not change asymptotically.

Appendix D Details on Reconstruction Properties of Kernel Distance

In this section we provide the full proof for some statements from Section 4.

D.1 Topological Estimates using Kernel Power Distance

For persistence diagrams of sublevel sets filtration of dμKd^{K}_{\mu} and the weighted Rips filtration {Rα(P,dμK)}\{R_{\alpha}(P,d^{K}_{\mu})\} to be well-defined, we need the technical condition (proved in Lemma D.1 and D.3) that they are qq-tame. Recall a filtration FF is qq-tame if for any α<β\alpha<\beta, the homomorphism between H(Fα)\textsf{H}(F_{\alpha}) and H(Fβ)\textsf{H}(F_{\beta}) induced by the canonical inclusion has finite rank [12, 16].

Lemma D.1.

The sublevel sets filtration of dμKd^{K}_{\mu} is qq-tame.

Proof D.2.

The proof resembles the proof of qq-tameness for distance to measure sublevel sets filtration (Proposition 12, [8]). We have shown that dμKd^{K}_{\mu} is 11-Lipschitz and proper. Its properness property implies that any sublevel set A:=(dμK)1([0,α])A:=(d^{K}_{\mu})^{-1}([0,\alpha]) (for α<cμ\alpha<c_{\mu}) is compact. Since d{{\mathbb{R}}}^{d} is triangulable (i.e. homeomorphic to a locally finite simplicial complex), there exists a homeomorphism hh from d{{\mathbb{R}}}^{d} to a locally finite simplicial complex CC. For any α>0\alpha>0, since AA is compact, we consider the restriction of CC to a finite simplicial complex CαC_{\alpha} that contains h(A)h(A). The function (dμKh1)Cα(d^{K}_{\mu}\circ h^{-1})\mid_{C_{\alpha}} is continuous on CαC_{\alpha}, therefore its sublevel set filtration is qq-tame based on Theorem 2.22 of [16], which states that the sublevel sets filtration of a continuous function (defined on a realization of a finite simplicial complex) is qq-tame. Extending the above construction to any α\alpha, the sublevel sets filtration of dμKh1d^{K}_{\mu}\circ h^{-1} is therefore qq-tame. As homology is preserved by homeomorphisms hh, this implies that the sublevel sets filtration of dμKd^{K}_{\mu} is qq-tame.

Setting μ=μP\mu={\mu_{P}}, Lemma D.1 implies that the sublevel sets filtration of dμPKd^{K}_{\mu_{P}} is also qq-tame.

Lemma D.3.

The weighted Rips filtration {Rα(P,dμK)}\{R_{\alpha}(P,d^{K}_{\mu})\} is qq-tame for compact subset PdP\subset{{\mathbb{R}}}^{d}.

Proof D.4.

Since PP is compact subset of d{{\mathbb{R}}}^{d}, Dgm({Rα(P,dμK)}))\textsf{Dgm}(\{R_{\alpha}(P,d^{K}_{\mu})\})) is qq-tame based on Proposition 32 of [16], which states that the weighted Rips filtration with respect to a compact subset PP in metric space and its corresponding weight function is qq-tame.

Setting P=P^+P=\hat{P}_{+}, μ=μP\mu={\mu_{P}}, Lemma D.3 implies that the weighted Rips filtration {Rα(P^+,dμPK)}\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\} is well-defined.

D.2 Inference of Compact Set SS with the Kernel Distance

Suppose μ\mu is a uniform measure on a compact set SS in d{{\mathbb{R}}}^{d}. We now compare the kernel distance dμKd^{K}_{\mu} with the distance function fSf_{S} to the support SS of μ\mu. We show how dμKd^{K}_{\mu} approximates fSf_{S}, and thus allows one to infer geometric properties of SS from samples from μ\mu.

For a point xdx\in{{\mathbb{R}}}^{d}, the distance function fSf_{S} measures the minimum distance between xx and any point in SS, fS(x)=infySxyf_{S}(x)=\inf_{y\in S}||x-y||. The point xSx_{S} that realizes the minimum in the definition of fS(x)f_{S}(x) is the orthogonal projection of xx on SS. The location of the points xdx\in{{\mathbb{R}}}^{d} that have more than one projection on SS is the medial axis of SS [54], denoted as M(S){{M}}(S). Since M(S){{M}}(S) resides in the unbounded component dS{{\mathbb{R}}}^{d}\setminus S, it is referred to as the outer medial axis similar to the concept found in [28]. The reach of SS is the minimum distance between a point in SS and a point in its medial axis, denoted as 𝗋𝖾𝖺𝖼𝗁(S){\sf reach}(S). Similarly, one could define the medial axis of dS{{\mathbb{R}}}^{d}\setminus S (i.e. the inner medial axis which resides in the interior of SS) following definitions in [53], and denote its associated reach as 𝗋𝖾𝖺𝖼𝗁(dS){\sf reach}({{\mathbb{R}}}^{d}\setminus S). The concepts of reach associated with the inner and outer medial axis of SS capture curvature information of the compact set.

Recall that a generalized gradient and its corresponding flow to a distance function are described in [14] and later adapted for distance-like functions in [15]. Let fS:df_{S}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}} be a distance function associated with a compact set SS of d{{\mathbb{R}}}^{d}. It is not differentiable on the medial axis of SS. It is possible to define a generalized gradient function S:dd{\nabla_{S}{}}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d} coincides with the usual gradient of fSf_{S} where fSf_{S} is differentiable, and is defined everywhere and can be integrated into a continuous flow Φt:dd\Phi^{t}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d}. Such a flow points away from SS, towards local maxima of fSf_{S} (that belong to the medial axis of SS) [54]. The integral (flow) line γ\gamma of this flow starting at point in d{{\mathbb{R}}}^{d} can be parameterized by arc length, γ:[a,b]d\gamma:[a,b]\to{{\mathbb{R}}}^{d}, and we have fS(γ(b))=fS(γ(a))+abS(γ(t))dtf_{S}(\gamma(b))=f_{S}(\gamma(a))+\int_{a}^{b}||{\nabla_{S}{}}(\gamma(t))||d_{t}.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Illustrations of the geometric inference of SS from dμKd^{K}_{\mu} at three scales.
Lemma D.5 (Lemma 4.5).

Given any flow line γ\gamma associated with the generalized gradient function S{\nabla_{S}{}}, dμK(x)d^{K}_{\mu}(x) is strictly monotonically increasing along γ\gamma for xx sufficiently far away from the medial axis of SS, for σR6ΔG\sigma\leq\frac{R}{6\Delta_{G}} and fS(x)(0.014R,2σ)f_{S}(x)\in(0.014R,2\sigma). Here B(σ/2)B(\sigma/2) denotes a ball of radius σ/2\sigma/2, G:=Vol(B(σ/2))Vol(S)G:=\frac{{\textsf{Vol}}(B(\sigma/2))}{{\textsf{Vol}}(S)}, ΔG:=12+3ln(4/G)\Delta_{G}:=\sqrt{12+3\ln(4/G)} and suppose R:=min(𝗋𝖾𝖺𝖼𝗁(S),𝗋𝖾𝖺𝖼𝗁(dS))>0R:=\min({\sf reach}(S),{\sf reach}({{\mathbb{R}}}^{d}\setminus S))>0.

Proof D.6.

Since dμK(x)d^{K}_{\mu}(x) is always positive, and dμK(x)=cμ2kdeμ(x)d^{K}_{\mu}(x)=\sqrt{c_{\mu}-2\textsc{kde}_{\mu}(x)} where cμc_{\mu} is a constant that depends only on μ\mu, KK, and σ\sigma, then it is sufficient to show that kdeμ(x)\textsc{kde}_{\mu}(x) is strictly monotonically decreasing along γ\gamma.

Let uu be the negative of the direction of the flow line γ\gamma at xx (i.e uu is a unit vector that points towards SS). We show that kdeμ(x)\textsc{kde}_{\mu}(x) is strictly monotonically increasing along uu. Informally, we will observe that all parts of SS that are “close” to xx are in the direction uu, and that these parts dominate the gradient of kdeμ(x)\textsc{kde}_{\mu}(x) along uu. We now make this more formal by describing two quantities, BxB_{x} and AxA_{x}, illustrated in Figure 6.

For a point xdx\in{{\mathbb{R}}}^{d}, let xS=argminxSxxx_{S}=\arg\min_{x^{\prime}\in S}\|x^{\prime}-x\|; since xx is not on the medial axis of SS, xSx_{S} is uniquely defined and uu points in the direction of (xSx)/fS(x)(x_{S}-x)/f_{S}(x). First, we claim that there exists a ball BxB_{x} of radius σ/2\sigma/2 incident to xSx_{S} that is completely contained in SS. This holds since σ/2R6ΔG<R𝗋𝖾𝖺𝖼𝗁(dS)\sigma/2\leq\frac{R}{6\Delta_{G}}<R\leq{\sf reach}({{\mathbb{R}}}^{d}\setminus S). In addition, since fS(x)<2σf_{S}(x)<2\sigma, no part in BxB_{x} is further than 3σ3\sigma from xx. Second, we claim that no part of SS within ΔGσ\Delta_{G}\cdot\sigma (R/6\leq R/6) of xx (this includes BxB_{x}) is in the halfspace HxH_{x} with boundary passing through xx and outward normal defined by uu. To see this, let oo be the center of a ball with radius RR that is incident to xSx_{S} but not in SS, refer to such a ball as BoB_{o}. This implies that points oo, xx and xSx_{S} are colinear. Then a ball centered at xx with radius R/6R/6 should intersect SS outside of BoB_{o}, and in the worst case, on the boundary of HxH_{x}. This holds as long as xxS0.014R(135/36)R\|x-x_{S}\|\geq 0.014R\geq(1-\sqrt{35/36})R; see Figure 6. Define Ax={ySxy>ΔGσ}A_{x}=\{y\in S\mid\|x-y\|>\Delta_{G}\cdot\sigma\}.

Now we examine the contributions to the directional derivative of kdeμ(x)\textsc{kde}_{\mu}(x) along the direction of uu from points in BxB_{x} and AxA_{x}, respectively. Such a directional derivative is denoted as Dukdeμ(x){\textsf{D}}_{u}\textsc{kde}_{\mu}(x). Recall kdeμ(x)=ySK(x,y)dμ(y)\textsc{kde}_{\mu}(x)=\int_{y\in S}K(x,y){\rm d}{\mu(y)} and μ\mu is a uniform measure on SS, Dukdeμ(x)=1Vol(S)ySDuK(x,y){\textsf{D}}_{u}\textsc{kde}_{\mu}(x)=\frac{1}{{\textsf{Vol}}(S)}\int_{y\in S}{\textsf{D}}_{u}K(x,y). For any point ydy\in\mathbb{R}^{d}, we define g(y):=DuK(x,y)=exp(xy2/2σ2)yx,ug(y):={\textsf{D}}_{u}K(x,y)=\exp(-\|x-y\|^{2}/2\sigma^{2})\langle y-x,u\rangle. Therefore Dukdeμ(x)=1vol(S)ySg(y){\textsf{D}}_{u}\textsc{kde}_{\mu}(x)=\frac{1}{\textsf{vol}(S)}\int_{y\in S}g(y).

We now examine the contribution to Dukdeμ(x){\textsf{D}}_{u}\textsc{kde}_{\mu}(x) from points in BxB_{x}, 1Vol(S)yBxg(y)\frac{1}{{\textsf{Vol}}(S)}\int_{y\in B_{x}}g(y). First, for all points yBxy\in B_{x}, since xy3σ||x-y||\leq 3\sigma, we have exp(xy2/2σ2)exp(9/2)\exp(-\|x-y\|^{2}/2\sigma^{2})\geq\exp(-9/2). Second, at least half of points yBxy\in B_{x} (that covers half the volume of BxB_{x}) is at least σ/2\sigma/2 away from xSx_{S}, and correspondingly for these points yx,uσ/2\langle y-x,u\rangle\geq\sigma/2. We have yBxg(y)12Vol(Bx)exp(9/2)σ/2\int_{y\in B_{x}}g(y)\geq\frac{1}{2}{\textsf{Vol}}(B_{x})\cdot\exp(-9/2)\cdot\sigma/2. Given Vol(Bx)=GVol(S){\textsf{Vol}}(B_{x})=G\cdot{\textsf{Vol}}(S), we have 1Vol(S)yBxg(y)14Gexp(9/2)σ\frac{1}{{\textsf{Vol}}(S)}\int_{y\in B_{x}}g(y)\geq\frac{1}{4}G\cdot\exp(-9/2)\cdot\sigma. Denote B=14Gexp(9/2)σB=\frac{1}{4}G\cdot\exp(-9/2)\cdot\sigma.

We now examine the contribution to Dukdeμ(x){\textsf{D}}_{u}\textsc{kde}_{\mu}(x) from points in AxA_{x}, 1Vol(S)yAxg(y)\frac{1}{{\textsf{Vol}}(S)}\int_{y\in A_{x}}g(y). For any point ydy\in{{\mathbb{R}}}^{d} (including yAxy\in A_{x}), yx,uxy\langle y-x,u\rangle\leq\|x-y\|. Let ϕy=xy/σ\phi_{y}=\|x-y\|/\sigma so we have g(y)exp(ϕy2/2)ϕyσg(y)\leq\exp(-\phi_{y}^{2}/2)\phi_{y}\sigma. Since this bound on g(y)g(y) is maximized at ϕy=1\phi_{y}=1, under the condition ϕyΔG12>1\phi_{y}\geq\Delta_{G}\geq\sqrt{12}>1, we can set ϕy=ΔG\phi_{y}=\Delta_{G} to achieve the bound g(y)exp(ΔG2/2)ΔGσg(y)\leq\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma for xyΔGσ\|x-y\|\geq\Delta_{G}\cdot\sigma (that is, for all yAxy\in A_{x}). Now we have yAxg(y)Vol(S)exp(ΔG2/2)ΔGσ\int_{y\in A_{x}}g(y)\leq{\textsf{Vol}}(S)\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma, leading to 1Vol(S)yAxg(y)exp(ΔG2/2)ΔGσ\frac{1}{{\textsf{Vol}}(S)}\int_{y\in A_{x}}g(y)\leq\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma. Denote A=exp(ΔG2/2)ΔGσA=\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma.

Since only the points yAxy\in A_{x} could possibly reside in HxH_{x} and thus can cause g(y)g(y) to be negative, we just need to show that B>AB>A. This can be confirmed by plugging in ΔG=12+3ln(4/G)\Delta_{G}=\sqrt{12+3\ln(4/G)}, and using some algebraic manipulation.

Appendix E Lower Bound on Wasserstein Distance

We note the next result is a known lower bounds for the Earth movers distance [23][Theorem 7]. We reprove it here for completeness.

Lemma E.1 (Lemma. 5.17).

For any probability measures μ\mu and ν\nu defined on d{{\mathbb{R}}}^{d} we have μ¯ν¯W2(μ,ν).\|\bar{\mu}-\bar{\nu}\|\leq W_{2}(\mu,\nu).

Proof E.2.

Let π:d×d+\pi:{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{+} describes the optimal transportation plan from μ\mu to ν\nu. Also let uμ,ν=(μ¯ν¯)μ¯ν¯u_{\mu,\nu}=\frac{(\bar{\mu}-\bar{\nu})}{\|\bar{\mu}-\bar{\nu}\|} be the unit vector from μ¯\bar{\mu} to ν¯\bar{\nu}. Then we can expand

(W2(μ,ν))2=(p,q)pq2dπ(p,q)\displaystyle(W_{2}(\mu,\nu))^{2}=\int_{(p,q)}\|p-q\|^{2}{\rm d}{\pi(p,q)} (p,q)((pq),uμ,ν)2dπ(p,q)\displaystyle\geq\int_{(p,q)}(\langle(p-q),u_{\mu,\nu}\rangle)^{2}{\rm d}{\pi(p,q)}
μ¯ν¯2.\displaystyle\geq\|\bar{\mu}-\bar{\nu}\|^{2}.

The first inequality follows since (pq),uμ,ν\langle(p-q),u_{\mu,\nu}\rangle is the length of a projection and thus must be at most pq\|p-q\|. The second inequality follows since that projection describes the squared length of mass π(p,q)\pi(p,q) along the direction between the two centers μ¯\bar{\mu} and ν¯\bar{\nu}, and the total sum of squared length of unit mass moved is exactly μ¯ν¯2\|\bar{\mu}-\bar{\nu}\|^{2}. Note the left-hand-side of the second inequality could be larger since some movement may cancel out (e.g. a rotation).