Geometric Inference on Kernel Density Estimates

Jeff M. Phillips
jeffp@cs.utah.edu
University of Utah Thanks to supported by NSF CCF-1350888, IIS-1251019, and ACI-1443046. Bei Wang
beiwang@sci.utah.edu
University of Utah Thanks to supported by INL 00115847 via DOE DE-AC07ID14517, DOE NETL DEEE0004449, DOE DEFC0206ER25781, DOE DE-SC0007446, and NSF 0904631. Yan Zheng
yanzheng@cs.utah.edu
University of Utah

We show that geometric inference of a point cloud can be calculated by examining its kernel density estimate with a Gaussian kernel. This allows one to consider kernel density estimates, which are robust to spatial noise, subsampling, and approximate computation in comparison to raw point sets. This is achieved by examining the sublevel sets of the kernel distance, which isomorphically map to superlevel sets of the kernel density estimate. We prove new properties about the kernel distance, demonstrating stability results and allowing it to inherit reconstruction results from recent advances in distance-based topological reconstruction. Moreover, we provide an algorithm to estimate its topology using weighted Vietoris-Rips complexes.

1 Introduction

Geometry and topology have become essential tools in modern data analysis: geometry to handle spatial noise and topology to identify the core structure. Topological data analysis (TDA) has found applications spanning protein structure analysis [32, 52] to heart modeling [41] to leaf science [60], and is the central tool of identifying quantities like connectedness, cyclic structure, and intersections at various scales. Yet it can suffer from spatial noise in data, particularly outliers.

When analyzing point cloud data, classically these approaches consider $\alpha$ -shapes [31], where each point is replaced with a ball of radius $\alpha$ , and the union of these balls is analyzed. More recently a distance function interpretation [11] has become more prevalent where the union of $\alpha$ -radius balls can be replaced by the sublevel set (at value $\alpha$ ) of the Hausdorff distance to the point set. Moreover, the theory can be extended to other distance functions to the point sets, including the distance-to-a-measure [15] which is more robust to noise.

This has more recently led to statistical analysis of TDA. These results show not only robustness in the function reconstruction, but also in the topology it implies about the underlying dataset. This work often operates on persistence diagrams which summarize the persistence (difference in function values between appearance and disappearance) of all homological features in single diagram. A variety of work has developed metrics on these diagrams and probability distributions over them [55, 67], and robustness and confidence intervals on their landscapes [7, 39, 18] (summarizing again the most dominant persistent features [19]). Much of this work is independent of the function and data from which the diagram is generated, but it is now more clear than ever, it is most appropriate when the underlying function is robust to noise, e.g., the distance-to-a-measure [15].

Refer to caption — Figure 1: Example with $10{,}000$ points in $[0,1]^{2}$ generated on a circle or line with $N(0,0.005)$ noise; $25\%$ of points are uniform background noise. The generating function is reconstructed with kde with $\sigma=0.05$ (upper left), and its persistence diagram based on the superlevel set filtration is shown (upper middle). A coreset [71] of the same dataset with only $1{,}384$ points (lower left) and persistence diagram (lower middle) are shown, again using kde. This associated confidence interval contains the dimension $1$ homology features (red triangles) suggesting they are noise; this is because it models data as iid – but the coreset data is not iid, it subsamples more intelligently. We also show persistence diagrams of the original data based on the sublevel set filtration of the standard distance function (upper right, with no useful features due to noise) and the kernel distance (lower right).

A very recent addition to this progression is the new TDA package for R [38]; it includes built in functions to analyze point sets using Hausdorff distance, distance-to-a-measure, $k$ -nearest neighbor density estimators, kernel density estimates, and kernel distance. The example in Figure 1 used this package to generate persistence diagrams. While, the stability of the Hausdorff distance is classic [11, 31], and the distance-to-a-measure [15] and $k$ -nearest neighbor distances have been shown robust to various degrees [5], this paper is the first to analyze the stability of kernel density estimates and the kernel distance in the context of geometric inference. Some recent manuscripts show related results. Bobrowski et al. [6] consider kernels with finite support, and describe approximate confidence intervals on the superlevel sets, which recover approximate persistence diagrams. Chazal et al. [17] explore the robustness of the kernel distance in bootstrapping-based analysis.

In particular, we show that the kernel distance and kernel density estimates, using the Gaussian kernel, inherit some reconstruction properties of distance-to-a-measure, that these functions can also be approximately reconstructed using weighted (Vietoris-)Rips complexes [8], and that under certain regimes can infer homotopy of compact sets. Moreover, we show further robustness advantages of the kernel distance and kernel density estimates, including that they possess small coresets [57, 71] for persistence diagrams and inference.

1.1 Kernels, Kernel Density Estimates, and Kernel Distance

A kernel is a non-negative similarity measure $K:{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{+}$ ; more similar points have higher value. For any fixed $p\in{{\mathbb{R}}}^{d}$ , a kernel $K(p,\cdot)$ can be normalized to be a probability distribution; that is $\int_{x\in{{\mathbb{R}}}^{d}}K(p,x){\rm d}{x}=1$ . For the purposes of this article we focus on the Gaussian kernel defined as $K(p,x)=\sigma^{2}\exp(-\|p-x\|^{2}/2\sigma^{2})$ . ¹¹1 $K(p,x)$ is normalized so that $K(x,x)=1$ for $\sigma=1$ . The choice of coefficient $\sigma^{2}$ is not the standard normalization, but it is perfectly valid as it scales everything by a constant. It has the property that $\sigma^{2}-K(p,x)\approx\|p-x\|^{2}/2$ for $\|p-x\|$ small.

A kernel density estimate [65, 61, 26, 27] is a way to estimate a continuous distribution function over ${{\mathbb{R}}}^{d}$ for a finite point set $P\subset{{\mathbb{R}}}^{d}$ ; they have been studied and applied in a variety of contexts, for instance, under subsampling [57, 71, 3], motion planning [59], multimodality [64, 33], and surveillance [37], road reconstruction [4]. Specifically,

\textsc{kde}_{P}(x)=\frac{1}{|P|}\sum_{p\in P}K(p,x).

The kernel distance [47, 42, 48, 58] (also called current distance or maximum mean discrepancy) is a metric [56, 66] between two point sets $P$ , $Q$ (as long as the kernel used is characteristic [66], a slight restriction of being positive definite [2, 70], this includes the Gaussian and Laplace kernels). Define a similarity between the two point sets as

\kappa(P,Q)=\frac{1}{|P|}\frac{1}{|Q|}\sum_{p\in P}\sum_{q\in Q}K(p,q).

Then the kernel distance between two point sets is defined as

D_{K}(P,Q)=\sqrt{\kappa(P,P)+\kappa(Q,Q)-2\kappa(P,Q)}.

When we let point set $Q$ be a single point $x$ , then $\kappa(P,x)=\textsc{kde}_{P}(x)$ .

Kernel density estimates can apply to any measure $\mu$ (on ${{\mathbb{R}}}^{d}$ ) as $\textsc{kde}_{\mu}(x)=\int_{p\in{{\mathbb{R}}}^{d}}K(p,x){\rm d}{\mu(p)}.$ The similarity between two measures is $\kappa(\mu,\nu)=\int_{(p,q)\in{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}}K(p,q){\rm d}{\textsf{m}_{\mu,\nu}(p,q)},$ where $\textsf{m}_{\mu,\nu}$ is the product measure of $\mu$ and $\nu$ ( $\textsf{m}_{\mu,\nu}:=\mu\otimes\nu$ ), and then the kernel distance between two measures $\mu$ and $\nu$ is still a metric, defined as $D_{K}(\mu,\nu)=\sqrt{\kappa(\mu,\mu)+\kappa(\nu,\nu)-2\kappa(\mu,\nu)}.$ When the measure $\nu$ is a Dirac measure at $x$ ( $\nu(q)=0$ for $x\neq q$ , but integrates to $1$ ), then $\kappa(\mu,x)=\textsc{kde}_{\mu}(x)$ . Given a finite point set $P\subset{{\mathbb{R}}}^{d}$ , we can work with the empirical measure $\mu_{P}$ defined as $\mu_{P}=\frac{1}{|P|}\sum_{p\in P}\delta_{p}$ , where $\delta_{p}$ is the Dirac measure on $p$ , and $D_{K}(\mu_{P},\mu_{Q})=D_{K}(P,Q)$ .

If $K$ is positive definite, it is said to have the reproducing property [2, 70]. This implies that $K(p,x)$ is an inner product in some reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{K}$ . Specifically, there is a lifting map $\phi:{{\mathbb{R}}}^{d}\to\mathcal{H}_{K}$ so that $K(p,x)=\langle\phi(p),\phi(x)\rangle_{\mathcal{H}_{K}}$ , and moreover the entire set $P$ can be represented as $\Phi(P)=\sum_{p\in P}\phi(p)$ , which is a single element of $\mathcal{H}_{K}$ and has a norm $\|\Phi(P)\|_{\mathcal{H}_{K}}=\sqrt{\kappa(P,P)}$ . A single point $x\in{{\mathbb{R}}}^{d}$ also has a norm $\|\phi(x)\|_{\mathcal{H}_{K}}=\sqrt{K(x,x)}$ in this space.

1.2 Geometric Inference and Distance to a Measure: A Review

Given an unknown compact set $S\subset{{\mathbb{R}}}^{d}$ and a finite point cloud $P\subset{{\mathbb{R}}}^{d}$ that comes from $S$ under some process, geometric inference aims to recover topological and geometric properties of $S$ from $P$ . The offset-based (and more generally, the distance function-based) approach for geometric inference reconstructs a geometric and topological approximation of $S$ by offsets from $P$ (e.g. [13, 14, 15, 20, 21]).

Given a compact set $S\subset{{\mathbb{R}}}^{d}$ , we can define a distance function $f_{S}$ to $S$ ; a common example is $f_{S}(x)=\inf_{y\in S}\|x-y\|$ (i.e. $\alpha$ -shapes). The offsets of $S$ are the sublevel sets of $f_{S}$ , denoted $(S)^{r}=f_{S}^{-1}([0,r])$ . Now an approximation of $S$ by another compact set $P\subset{{\mathbb{R}}}^{d}$ (e.g. a finite point cloud) can be quantified by the Hausdorff distance $d_{H}(S,P):=\|f_{S}-f_{P}\|_{\infty}=\inf_{x\in{{\mathbb{R}}}^{d}}|f_{S}(x)-f_{P}(x)|$ of their distance functions. The intuition behind the inference of topology is that if $d_{H}(S,P)$ is small, thus $f_{S}$ and $f_{P}$ are close, and subsequently, $S$ , ${(S)}^{r}$ and ${(P)}^{r}$ carry the same topology for an appropriate scale $r$ . In other words, to compare the topology of offsets ${(S)}^{r}$ and ${(P)}^{r}$ , we require Hausdorff stability with respect to their distance functions $f_{S}$ and $f_{P}$ . An example of an offset-based topological inference result is formally stated as follows (as a particular version of the reconstruction Theorem 4.6 in [14]), where the reach of a compact set $S$ , ${\sf reach}(S)$ , is defined as the minimum distance between $S$ and its medial axis [54].

Theorem 1.1 (Reconstruction from $f_{P}$ [14]).

Let $S,P\subset{{\mathbb{R}}}^{d}$ be compact sets such that ${\sf reach}(S)>R$ and $\varepsilon:=d_{H}(S,P)<R/17$ . Then $(S)^{\eta}$ and ${(P)}^{r}$ are homotopy equivalent for sufficiently small $\eta$ (e.g., $0<\eta<R$ ) if $4\varepsilon\leq r<R-3\varepsilon$ .

Here $\eta<R$ ensures that the topological properties of $(S)^{\eta}$ and $(S)^{r}$ are the same, and the $\varepsilon$ parameter ensures $(S)^{r}$ and $(P)^{r}$ are close. Typically $\varepsilon$ is tied to the density with which a point cloud $P$ is sampled from $S$ .

For function $\phi:\mathbb{R}^{d}\to\mathbb{R}^{+}$ to be distance-like it should satisfy the following properties:

•

(D1) $\phi$ is $1$ -Lipschitz: For all $x,y\in{{\mathbb{R}}}^{d}$ , $|\phi(x)-\phi(y)|\leq\|x-y\|$ .
•

(D2) $\phi^{2}$ is $1$ -semiconcave: The map $x\in{{\mathbb{R}}}^{d}\mapsto(\phi(x))^{2}-\|x\|^{2}$ is concave.
•

(D3) $\phi$ is proper: $\phi(x)$ tends to the infimum of its domain (e.g., $\infty$ ) as $x$ tends to infinity.

In addition to the Hausdorff stability property stated above, as explained in [15], $f_{S}$ is distance-like. These three properties are paramount for geometric inference (e.g. [14, 53]). (D1) ensures that $f_{S}$ is differentiable almost everywhere and the medial axis of $S$ has zero $d$ -volume [15]; and (D2) is a crucial technical tool, e.g., in proving the existence of the flow of the gradient of the distance function for topological inference [14].

Distance to a measure.

Given a probability measure $\mu$ on ${{\mathbb{R}}}^{d}$ and a parameter $m_{0}>0$ smaller than the total mass of $\mu$ , the distance to a measure $d^{\textsf{{ccm}}}_{\mu,m_{0}}:{{\mathbb{R}}}^{n}\to{{\mathbb{R}}}^{+}$ [15] is defined for any point $x\in{{\mathbb{R}}}^{d}$ as

d^{\textsf{{ccm}}}_{\mu,m_{0}}(x)=\left(\frac{1}{m_{0}}\int_{m=0}^{m_{0}}(\delta_{\mu,m}(x))^{2}{\rm d}{m}\right)^{1/2},\;\;\text{ where }\;\;\delta_{\mu,m}(x)=\inf\left\{r>0:\mu(\bar{B}_{r}(x))\geq m\right\},

It has been shown in [15] that $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ is a distance-like function (satisfying (D1), (D2), and (D3)), and:

•

(M4) [Stability] For probability measures $\mu$ and $\nu$ on ${{\mathbb{R}}}^{d}$ and $m_{0}>0$ , then $\|d^{\textsf{{ccm}}}_{\mu,m_{0}}-d^{\textsf{{ccm}}}_{\nu,m_{0}}\|_{\infty}\leq\frac{1}{\sqrt{m_{0}}}W_{2}(\mu,\nu)$ .

Here $W_{2}$ is the Wasserstein distance [69]: $W_{2}(\mu,\nu)=\inf_{\pi\in\Pi(\mu,\nu)}\left(\int_{{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}}||x-y||^{2}{\rm d}{\pi}(x,y)\right)^{1/2}$ between two measures, where ${\rm d}{\pi}(x,y)$ measures the amount of mass transferred from location $x$ to location $y$ and $\pi\in\Pi(\mu,\nu)$ is a transference plan [69].

Given a point set $P$ , the sublevel sets of $d^{\textsf{{ccm}}}_{\mu_{P},m_{0}}$ can be described as the union of balls [45], and then one can algorithmically estimate the topology (e.g., persistence diagram) with weighted alpha-shapes [45] and weighted Rips complexes [8].

1.3 Our Results

We show how to estimate the topology (e.g., approximate persistence diagrams, infer homotopy of compact sets) using superlevel sets of the kernel density estimate of a point set $P$ . We accomplish this by showing that a similar set of properties hold for the kernel distance with respect to a measure $\mu$ , (in place of distance to a measure $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ ), defined as

d^{K}_{\mu}(x)=D_{K}(\mu,x)=\sqrt{\kappa(\mu,\mu)+\kappa(x,x)-2\kappa(\mu,x)}.

This treats $x$ as a probability measure represented by a Dirac mass at $x$ . Specifically, we show $d^{K}_{\mu}$ is distance-like (it satisfies (D1), (D2), and (D3)), so it inherits reconstruction properties of $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ . Moreover, it is stable with respect to the kernel distance:

•

(K4) [Stability] If $\mu$ and $\nu$ are two measures on ${{\mathbb{R}}}^{d}$ , then $\|d^{K}_{\mu}-d^{K}_{\nu}\|_{\infty}\leq D_{K}(\mu,\nu)$ .

In addition, we show how to construct these topological estimates for $d^{K}_{\mu}$ using weighted Rips complexes, following power distance machinery introduced in [8]. That is, a particular form of power distance permits a multiplicative approximation with the kernel distance.

We also describe further advantages of the kernel distance. (i) Its sublevel sets conveniently map to the superlevel sets of a kernel density estimate. (ii) It is Lipschitz with respect to the smoothing parameter $\sigma$ when the input $x$ is fixed. (iii) As $\sigma$ tends to $\infty$ for any two probability measures $\mu,\nu$ , the kernel distance is bounded by the Wasserstein distance: $\lim_{\sigma\to\infty}D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu)$ . (iv) It has a small coreset representation, which allows for sparse representation and efficient, scalable computation. In particular, an $\varepsilon$ -kernel sample [48, 57, 71] $Q$ of $\mu$ is a finite point set whose size only depends on $\varepsilon>0$ and such that $\max_{x\in{{\mathbb{R}}}^{d}}|\textsc{kde}_{\mu}(x)-\textsc{kde}_{\mu_{Q}}(x)|=\max_{x\in{{\mathbb{R}}}^{d}}|\kappa(\mu,x)-\kappa(\mu_{Q},x)|\leq\varepsilon$ . These coresets preserve inference results and persistence diagrams.

2 Kernel Distance is Distance-Like

In this section we prove $d^{K}_{\mu}$ satisfies (D1), (D2), and (D3); hence it is distance-like. Recall we use the $\sigma^{2}$ -normalized Gaussian kernel $K_{\sigma}(p,x)=\sigma^{2}\exp(-\|p-x\|^{2}/2\sigma^{2})$ . For ease of exposition, unless otherwise noted, we will assume $\sigma$ is fixed and write $K$ instead of $K_{\sigma}$ .

2.1 Semiconcave Property for $d^{K}_{\mu}$

Lemma 2.1 (D2).

$(d^{K}_{\mu})^{2}$ is $1$ -semiconcave: the map $x\mapsto(d^{K}_{\mu}(x))^{2}-\|x\|^{2}$ is concave.

Proof 2.2.

Let $T(x)=(d^{K}_{\mu}(x))^{2}-\|x\|^{2}$ . The proof will show that the second derivative of $T$ along any direction is nonpositive. We can rewrite

	$\displaystyle T(x)$	$\displaystyle=\kappa(\mu,\mu)+\kappa(x,x)-2\kappa(\mu,x)-\\|x\\|^{2}$
		$\displaystyle=\kappa(\mu,\mu)+\kappa(x,x)-\int_{p\in{{\mathbb{R}}}^{d}}(2K(p,x)+\\|x\\|^{2}){\rm d}{\mu(p)}.$

Note that both $\kappa(\mu,\mu)$ and $\kappa(x,x)$ are absolute constants, so we can ignore them in the second derivative. Furthermore, by setting $t(p,x)=-2K(p,x)-\|x\|^{2}$ , the second derivative of $T(x)$ is nonpositive if the second derivative of $t(p,x)$ is nonpositive for all $p,x\in{{\mathbb{R}}}^{d}$ . First note that the second derivative of $-\|x\|^{2}$ is a constant $-2$ in every direction. The second derivative of $K(p,x)$ is symmetric about $p$ , so we can consider the second derivative along any vector $u=x-p$ ,

\frac{{\rm d}{}^{2}}{{\rm d}{u}^{2}}t(p,x)=2\left(\frac{\|u\|^{2}}{\sigma^{2}}-1\right)\exp\left(-\frac{\|u\|^{2}}{2\sigma^{2}}\right)-2.

This reaches its maximum value at $\|u\|=\|x-p\|=\sqrt{3}\sigma$ where it is $4\exp(-3/2)-2\approx-1.1$ ; this follows setting the derivative of $s(y)=2(y-1)\exp(-y/2)-2$ to $0$ , ( $\frac{{\rm d}{}}{{\rm d}{y}}s(y)=(1/2)(3-y)\exp(-y/2)$ ), substituting $y=\|u\|^{2}/\sigma^{2}$ .

We also note in Appendix A that semiconcavity follows trivially in the RKHS $\mathcal{H}_{K}$ .

2.2 Lipschitz Property for $d^{K}_{\mu}$

We generalize a (folklore, see [15]) relation between semiconcave and Lipschitz functions and prove it for completeness. A function $f$ is $\ell$ -semiconcave if the function $T(x)=(f(x))^{2}-\ell\|x\|^{2}$ is concave.

Lemma 2.3.

Consider a twice-differentiable function $g$ and a parameter $\ell\geq 1$ . If $(g(x))^{2}$ is $\ell$ -semiconcave, then $g(x)$ is $\ell$ -Lipschitz.

Proof 2.4.

The proof is by contrapositive; we assume that $g(x)$ is not $\ell$ -Lipschitz and then show $(g(x))^{2}$ cannot be $\ell$ -semiconcave. By this assumption, then in some direction $u$ , there is a point $x^{\prime}$ such that $(d/du)g(x^{\prime})=c>\ell\geq 1$ .

Now we examine $f(x)=(g(x))^{2}-\ell\|x\|^{2}$ at $x=x^{\prime}$ , and specifically its second derivative in direction $u$ .

	$\displaystyle\frac{d}{du}f(x)\big{\|}_{x=x^{\prime}}$	$\displaystyle=2\left(\frac{d}{du}g(x^{\prime})\right)g(x^{\prime})-2\ell\\|x^{\prime}\\|=2c\cdot g(x^{\prime})-2\ell\\|x^{\prime}\\|$
	$\displaystyle\frac{d^{2}}{du^{2}}f(x)\big{\|}_{x=x^{\prime}}$	$\displaystyle=2c\left(\frac{d}{du}g(x^{\prime})\right)-2\ell=2c^{2}-2\ell=2(c^{2}-\ell)$

Since $c^{2}>c>\ell\geq 1$ , then $2(c^{2}-\ell)>0$ and $f(x)$ is not $\ell$ -semiconcave at $x^{\prime}$ .

We can now state the following lemma as a corollary of Lemma 2.3 and Lemma 2.1.

Lemma 2.5 (D1).

$d^{K}_{\mu}$ is $1$ -Lipschitz on its input.

2.3 Properness of $d^{K}_{\mu}$

Finally, for $d^{K}_{\mu}$ to be distance-like, we need to show it is proper when its range is restricted to be less than $c_{\mu}:=\sqrt{\kappa(\mu,\mu)+\kappa(x,x)}$ . Here, the value of $c_{\mu}$ depends on $\mu$ not on $x$ since $\kappa(x,x)=K(x,x)=\sigma^{2}$ . This is required for a distance-like version [15], Proposition 4.2) of the Isotopy Lemma ([44], Proposition 1.8).

Lemma 2.6 (D3).

$d^{K}_{\mu}$ is proper.

We delay this technical proof to Appendix A. The main technical difficulty comes in mapping standard definitions and approaches for distance functions to our function $d^{K}_{\mu}$ with a restricted range.

Also by properness (see discussion in Appendix A), Lemma 2.6 also implies that $d^{K}_{\mu}$ is a closed map and its levelset at any value $a\in[0,c_{\mu})$ is compact. This also means that the sublevel set of $d^{K}_{\mu}$ (for ranges $[0,a)\subset[0,c_{\mu})$ ) is compact. Since the levelset (sublevel set) of $d^{K}_{\mu}$ corresponds to the levelset (superlevel set) of $\textsc{kde}_{\mu}$ , we have the following corollary.

Corollary 2.7.

The superlevel sets of $\textsc{kde}_{\mu}$ for all ranges with threshold $a>0$ , are compact.

The result in [33] shows that given a measure ${\mu_{P}}$ defined by a point set $P$ of size $n$ , the $\textsc{kde}_{{\mu_{P}}}$ has polynomial in $n$ modes; hence the superlevel sets of $\textsc{kde}_{{\mu_{P}}}$ are compact in this setting. The above corollary is a more general statement as it holds for any measure.

3 Power Distance using Kernel Distance

A power distance using $d^{K}_{\mu}$ is defined with a point set $P\subset{{\mathbb{R}}}^{d}$ and a metric $d(\cdot,\cdot)$ on ${{\mathbb{R}}}^{d}$ ,

{f_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left(d(p,x)^{2}+d^{K}_{\mu}(p)^{2}\right)}.

A point $x\in\mathbb{R}^{d}$ takes the distance under $d(p,x)$ to the closest $p\in P$ , plus a weight from $d^{K}_{\mu}(p)$ ; thus a sublevel set of ${f_{P}}(\mu,\cdot)$ is defined by a union of balls. We consider a particular choice of the distance $d(p,x):=D_{K}(p,x)$ which leads to a kernel version of power distance

{f^{\textsc{k}}_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left({D_{K}(p,x)}^{2}+{d^{K}_{\mu}(p)}^{2}\right)}.

In Section 4.2 we use ${f^{\textsc{k}}_{P}}(\mu,x)$ to adapt the construction introduced in [8] to approximate the persistence diagram of the sublevel sets of $d^{K}_{\mu}$ , using a weighted Rips filtration of ${f^{\textsc{k}}_{P}}(\mu,x)$ .

Given a measure $\mu$ , let $p_{+}=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(\mu,q)$ , and let $P_{+}\subset{{\mathbb{R}}}^{d}$ be a point set that contains $p_{+}$ . We show below, in Theorem 3.8 and Theorem 3.2, that $\frac{1}{\sqrt{2}}d^{K}_{\mu}(x)\leq{f^{\textsc{k}}_{P_{+}}}(\mu,x)\leq\sqrt{14}d^{K}_{\mu}(x)$ . However, constructing $p_{+}$ exactly seems quite difficult. We also attempt to use $p^{\star}=\arg\min_{p\in P}\|p-x\|$ in place of $p_{+}$ (see Section C.1), but are not able to obtain useful bounds.

Now consider an empirical measure $\mu_{P}$ defined by a point set $P$ . We show (in Theorem C.21 in Appendix C.2) how to construct a point $\hat{p}_{+}$ (that approximates $p_{+}$ ) such that $D_{K}(P,\hat{p}_{+})\leq(1+\delta)D_{K}(P,p_{+})$ for any $\delta>0$ . For a point set $P$ , the median concentration $\Lambda_{P}$ is a radius such that no point $p\in P$ has more than half of the points of $P$ within $\Lambda_{P}$ , and the spread $\beta_{P}$ is the ratio between the longest and shortest pairwise distances. The runtime is polynomial in $n$ and $1/\delta$ assuming $\beta_{P}$ is bounded, and that $\sigma/\Lambda_{P}$ and $d$ are constants.

Then we consider $\hat{P}_{+}=P\cup\{\hat{p}_{+}\}$ , where $\hat{p}_{+}$ is found with $\delta=1/2$ in the above construction. Then we can provide the following multiplicative bound, proven in Theorem 3.10. The lower bound holds independent of the choice of $P$ as shown in Theorem 3.2.

Theorem 3.1.

For any point set $P\subset{{\mathbb{R}}}^{d}$ and point $x\in{{\mathbb{R}}}^{d}$ , with empirical measure $\mu_{P}$ defined by $P$ , then

\frac{1}{\sqrt{2}}d^{K}_{\mu_{P}}(x)\leq{f^{\textsc{k}}_{\hat{P}_{+}}}(\mu_{P},x)\leq\sqrt{71}d^{K}_{\mu_{P}}(x).

3.1 Kernel Power Distance for a Measure $\mu$

First consider the case for a kernel power distance ${f^{\textsc{k}}_{P}}(\mu,x)$ where $\mu$ is an arbitrary measure.

Theorem 3.2.

For measure $\mu$ , point set $P\subset\mathbb{R}^{d}$ , and $x\in{{\mathbb{R}}}^{d}$ , $D_{K}(\mu,x)\leq\sqrt{2}{f^{\textsc{k}}_{P}}(\mu,x).$

Proof 3.3.

Let $p=\arg\min_{q\in P}\left(D_{K}(q,x)^{2}+D_{K}(\mu,q)^{2}\right)$ . Then we can use the triangle inequality and $(D_{K}(\mu,p)-D_{K}(p,x))^{2}\geq 0$ to show

D_{K}(\mu,x)^{2}\leq(D_{K}(\mu,p)+D_{K}(p,x))^{2}\leq 2(D_{K}(\mu,p)^{2}+D_{K}(p,x)^{2})=2{f^{\textsc{k}}_{P}}(\mu,x)^{2}.

Lemma 3.4.

For measure $\mu$ , point set $P\subset\mathbb{R}^{d}$ , point $p\in P$ , and point $x\in{{\mathbb{R}}}^{d}$ then ${f^{\textsc{k}}_{P}}(\mu,x)^{2}\leq 2D_{K}(\mu,x)^{2}+3D_{K}(p,x)^{2}.$

Proof 3.5.

Again, we can reach this result with the triangle inequality.

	$\displaystyle{f^{\textsc{k}}_{P}}(\mu,x)^{2}$	$\displaystyle\leq D_{K}(\mu,p)^{2}+D_{K}(p,x)^{2}$
		$\displaystyle\leq(D_{K}(\mu,x)+D_{K}(p,x))^{2}+D_{K}(p,x)^{2}$
		$\displaystyle\leq 2D_{K}(\mu,x)^{2}+3D_{K}(p,x)^{2}.$

Recall the definition of a point $p_{+}=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(\mu,q)$ .

Lemma 3.6.

For any measure $\mu$ and point $x,p_{+}\in{{\mathbb{R}}}^{d}$ we have $D_{K}(p_{+},x)\leq 2D_{K}(\mu,x)$ .

Proof 3.7.

Since $x$ is a point in ${{\mathbb{R}}}^{d}$ , $\kappa(\mu,x)\leq\kappa(\mu,p_{+})$ and thus $D_{K}(\mu,x)\geq D_{K}(\mu,p_{+})$ . Then by triangle inequality of $D_{K}$ to see that $D_{K}(p_{+},x)\leq D_{K}(\mu,x)+D_{K}(\mu,p_{+})\leq 2D_{K}(\mu,x)$ .

Theorem 3.8.

For any measure $\mu$ in ${{\mathbb{R}}}^{d}$ and any point $x\in{{\mathbb{R}}}^{d}$ , using the point $p_{+}=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(\mu,q)$ then ${f^{\textsc{k}}_{\{p_{+}\}}}(\mu,x)\leq\sqrt{14}D_{K}(\mu,x)$ .

Proof 3.9.

Combine Lemma 3.4 and Lemma 3.6 as

{f^{\textsc{k}}_{\{p_{+}\}}}(\mu,x)^{2}\leq 2D_{K}(\mu,x)^{2}+3D_{K}(p_{+},x)^{2}\leq 2D_{K}(\mu,x)^{2}+3(4D_{K}(\mu,x)^{2})=14D_{K}(\mu,x)^{2}.

We now need two properties of the point set $P$ to reach our bound, namely, the spread $\beta_{P}$ and the median concentration $\Lambda_{P}$ . Typically $\log(\beta_{P})$ is not too large, and it makes sense to choose $\sigma$ so $\sigma/\Lambda_{P}\leq 1$ , or at least $\sigma/\Lambda_{P}=O(1)$ .

Theorem 3.10.

Consider any point set $P\subset{{\mathbb{R}}}^{d}$ of size $n$ , with measure $\mu_{P}$ , spread $\beta_{P}$ , and median concentration $\Lambda_{P}$ . We can construct a point set $\hat{P}_{+}=P\cup\hat{p}_{+}$ in $O(n^{2}((\sigma/\Lambda_{P}\delta)^{d}+\log(\beta))$ time such that for any point $x$ , ${f^{\textsc{k}}_{\hat{P}_{+}}}({\mu_{P}},x)\leq\sqrt{71}D_{K}({\mu_{P}},x).$

Proof 3.11.

We use Theorem C.21 to find a point $\hat{p}_{+}$ such that $D_{K}(P,\hat{p}_{+})\leq(3/2)D_{K}(P,p_{+})$ . Thus for any $x\in{{\mathbb{R}}}^{d}$ , using the triangle inequality

	$\displaystyle D_{K}(\hat{p}_{+},x)$	$\displaystyle\leq D_{K}(\hat{p}_{+},p_{+})+D_{K}(p_{+},x)\leq D_{K}({\mu_{P}},\hat{p}_{+})+D_{K}({\mu_{P}},p_{+})+D_{K}(p_{+},x)$
		$\displaystyle\leq(5/2)D_{K}({\mu_{P}},p_{+})+D_{K}(p_{+},x).$

Now combine this with Lemma 3.4 and Lemma 3.6 as

	$\displaystyle{f^{\textsc{k}}_{\hat{P}_{+}}}({\mu_{P}},x)^{2}$	$\displaystyle\leq 2D_{K}({\mu_{P}},x)^{2}+3D_{K}(\hat{p}_{+},x)^{2}$
		$\displaystyle\leq 2D_{K}({\mu_{P}},x)^{2}+3((5/2)D_{K}({\mu_{P}},x)+D_{K}(p_{+},x))^{2}$
		$\displaystyle\leq 2D_{K}({\mu_{P}},x)^{2}+3((25/4)+(5/2))D_{K}({\mu_{P}},x)^{2}+(1+5/2)D_{K}(p_{+},x)^{2})$
		$\displaystyle=(113/4)D_{K}({\mu_{P}},x)^{2}+(21/2)D_{K}(p_{+},x)^{2}$
		$\displaystyle\leq(113/4)D_{K}({\mu_{P}},x)^{2}+(21/2)(4D_{K}({\mu_{P}},x)^{2})<71D_{K}({\mu_{P}},x)^{2}.$

4 Reconstruction and Topological Estimation using Kernel Distance

Now applying distance-like properties from Section 2 and the power distance properties of Section 3 we can apply known reconstruction results to the kernel distance.

4.1 Homotopy Equivalent Reconstruction using $d^{K}_{\mu}$

We have shown that the kernel distance function $d^{K}_{\mu}$ is a distance-like function. Therefore the reconstruction theory for a distance-like function [15] (which is an extension of results for compact sets [14]) holds in the setting of $d^{K}_{\mu}$ . We state the following two corollaries for completeness, whose proofs follow from the proofs of Proposition 4.2 and Theorem 4.6 in [15]. Before their formal statement, we need some notation adapted from [15] to make these statements precise. Let $\phi:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{+}$ be a distance-like function. A point $x\in{{\mathbb{R}}}^{d}$ is an $\alpha$ -critical point if $\phi^{2}(x+h)\leq\phi^{2}(x)+2\alpha\|h\|\phi(x)+\|h\|^{2}$ with $\alpha\in[0,1]$ , $\forall h\in{{\mathbb{R}}}^{d}$ . Let $(\phi)^{r}=\{x\in{{\mathbb{R}}}^{d}\mid\phi(x)\leq r\}$ denote the sublevel set of $\phi$ , and let $(\phi)^{[r_{1},r_{2}]}=\{x\in{{\mathbb{R}}}^{d}\mid r_{1}\leq\phi(x)\leq r_{2}\}$ denote all points at levels in the range $[r_{1},r_{2}]$ . For $\alpha\in[0,1]$ , the $\alpha$ -reach of $\phi$ is the maximum $r$ such that $(\phi)^{r}$ has no $\alpha$ -critical point, denoted as ${\sf reach}_{\alpha}(\phi)$ . When $\alpha=1$ , ${\sf reach}_{1}$ coincides with reach introduced in [40].

Theorem 4.1 (Isotopy lemma on $d^{K}_{\mu}$ ).

Let $r_{1}<r_{2}$ be two positive numbers such that $d^{K}_{\mu}$ has no critical points in $(d^{K}_{\mu})^{[r_{1},r_{2}]}$ . Then all the sublevel sets $(d^{K}_{\mu})^{r}$ are isotopic for $r\in[r_{1},r_{2}]$ .

Theorem 4.2 (Reconstruction on $d^{K}_{\mu}$ ).

Let $d^{K}_{\mu}$ and $d^{K}_{\nu}$ be two kernel distance functions such that $\|d^{K}_{\mu}-d^{K}_{\nu}\|_{\infty}\leq\varepsilon$ . Suppose ${\sf reach}_{\alpha}(d^{K}_{\mu})\geq R$ for some $\alpha>0$ . Then $\forall r\in[4\varepsilon/\alpha^{2},R-3\varepsilon]$ , and $\forall\eta\in(0,R)$ , the sublevel sets $(d^{K}_{\mu})^{\eta}$ and $(d^{K}_{\nu})^{r}$ are homotopy equivalent for $\varepsilon\leq R/(5+4/\alpha^{2})$ .

4.2 Constructing Topological Estimates using $d^{K}_{\mu}$

In order to actually construct a topological estimate using the kernel distance $d^{K}_{\mu}$ , one needs to be able to compute quantities related to its sublevel sets, in particular, to compute the persistence diagram of the sub-level sets filtration of $d^{K}_{\mu}$ . Now we describe such tools needed for the kernel distance based on machinery recently developed by Buchet et al. [8], which shows how to approximate the persistent homology of distance-to-a-measure for any metric space via a power distance construction. Then using similar constructions, we can use the weighted Rips filtration to approximate the persistence diagram of the kernel distance.

To state our results, first we require some technical notions and assume basic knowledge on persistent homology (see [34, 35] for a readable background). Given a metric space ${{\mathbb{X}}}$ with the distance $d_{{{\mathbb{X}}}}(\cdot,\cdot)$ , a set $P\subseteq{{\mathbb{X}}}$ and a function $w:P\to{{\mathbb{R}}}$ , the (general) power distance $f$ associated with $(P,w)$ is defined as $f(x)=\sqrt{\min_{p\in P}\left(d_{{{\mathbb{X}}}}(p,x)^{2}+w(p)^{2}\right)}.$ Now given the set $(P,w)$ and its corresponding power distance $f$ , one could use the weighted Rips filtration to approximate the persistence diagram of $w$ , under certain restrictive conditions proven in Appendix D.1. Consider the sublevel set of $f$ , $f^{-1}((-\infty,\alpha])$ . It is the union of balls centered at points $p\in P$ with radius $r_{p}(\alpha)=\sqrt{\alpha^{2}-w(p)^{2}}$ for each $p$ . The weighted Čech complex $C_{\alpha}(P,w)$ for parameter $\alpha$ is the union of simplices $s$ such that $\bigcap_{p\in s}B(p,r_{p}(\alpha))\neq 0$ . The weighted Rips complex $R_{\alpha}(P,w)$ for parameter $\alpha$ is the maximal complex whose $1$ -skeleton is the same as $C_{\alpha}(P,w)$ . The corresponding weighted Rips filtration is denoted as $\{R_{\alpha}(P,w)\}$ .

Setting $w:=d^{K}_{\mu_{P}}$ and given point set $\hat{P}_{+}$ described in Section 3, consider the weighted Rips filtration $\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu})\}$ based on the kernel power distance, ${f^{\textsc{k}}_{\hat{P}_{+}}}$ . We view the persistence diagrams on a logarithmic scale, that is, we change coordinates of points following the mapping $(x,y)\mapsto(\ln x,\ln y)$ . $d_{B}^{\ln}$ denotes the corresponding bottleneck distance between persistence diagrams. We now state a corollary of Theorem 3.1.

Corollary 4.3.

The weighted Rips filtration $\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\}$ can be used to approximate the persistence diagram of $d^{K}_{\mu_{P}}$ such that $d_{B}^{\rm\ln}(\textsf{Dgm}(d^{K}_{\mu_{P}}),\textsf{Dgm}(\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\}))\leq\ln(2\sqrt{71})$ .

Proof 4.4.

To prove that two persistence diagrams are close, one could prove that their filtration are interleaved [12], that is, two filtrations $\{U_{\alpha}\}$ and $\{V_{\alpha}\}$ are $\varepsilon$ -interleaved if for any $\alpha$ , $U_{\alpha}\subseteq V_{\alpha+\varepsilon}\subseteq U_{\alpha+2\varepsilon}$ .

First, Lemmas D.1 and D.3 prove that the persistence diagrams $\textsf{Dgm}(d^{K}_{\mu_{P}})$ and $\textsf{Dgm}(\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\}))$ are well-defined. Second, the results of Theorem 3.1 implies an $\sqrt{71}$ multiplicative interleaving. Therefore for any $\alpha\in{{\mathbb{R}}}$ ,

{(d^{K}_{\mu_{P}})}^{-1}((-\infty,\alpha])\subset{({f^{\textsc{k}}_{\hat{P}_{+}}})}^{-1}((-\infty,\sqrt{2}\alpha)\subset{(d^{K}_{{\mu_{P}}})}^{-1}((-\infty,\sqrt{71}\sqrt{2}\alpha]).

On a logarithmic scale (by taking the natural log of both sides), such interleaving becomes addictive,

\ln d^{K}_{\mu_{P}}-\sqrt{2}\leq\ln{f^{\textsc{k}}_{\hat{P}_{+}}}\leq\ln d^{K}_{\mu_{P}}+\sqrt{71}.

Theorem 4 of [16] implies

d_{B}^{\rm\ln}(\textsf{Dgm}(d^{K}_{\mu_{P}}),\textsf{Dgm}({f^{\textsc{k}}_{\hat{P}_{+}}}))\leq\sqrt{71}.

In addition, by the Persistent Nerve Lemma ([22], Theorem 6 of [62], an extension of the Nerve Theorem [46]), the sublevel sets filtration of $d^{K}_{\mu}$ , which correspond to unions of balls of increasing radius, has the same persistent homology as the nerve filtration of these balls (which, by definition, is the Čech filtration). Finally, there exists a multiplicative interleaving between weighted Rips and Čech complexes (Proposition 31 of [16]), $C_{\alpha}\subseteq R_{\alpha}\subseteq C_{2\alpha}.$ We then obtain the following bounds on persistence diagrams,

d_{B}^{\rm\ln}(\textsf{Dgm}({f^{\textsc{k}}_{P_{+}}}),\textsf{Dgm}(\{R_{\alpha}(P_{+},d^{K}_{\mu_{P}})\}))\leq\ln(2).

We use triangle inequality to obtain the final result:

d_{B}^{\rm\ln}(\textsf{Dgm}(d^{K}_{\mu_{P}}),\textsf{Dgm}(\{R_{\alpha}(P_{+},d^{K}_{\mu_{P}})\}))\leq\ln(2\sqrt{71}).

Based on Corollary 4.3, we have an algorithm that approximates the persistent homology of the sublevel set filtration of $d^{K}_{\mu}$ by constructing the weighted Rips filtration corresponding to the kernel-based power distance and computing its persistent homology. For memory efficient computation, sparse (weighted) Rips filtrations could be adapted by considering simplices on subsamples at each scale [63, 16], although some restrictions on the space apply.

4.3 Distance to the Support of a Measure vs. Kernel Distance

Suppose $\mu$ is a uniform measure on a compact set $S$ in ${{\mathbb{R}}}^{d}$ . We now compare the kernel distance $d^{K}_{\mu}$ with the distance function $f_{S}$ to the support $S$ of $\mu$ . We show how $d^{K}_{\mu}$ approximates $f_{S}$ , and thus allows one to infer geometric properties of $S$ from samples from $\mu$ .

A generalized gradient and its corresponding flow associated with a distance function are described in [14] and later adapted for distance-like functions in [15]. Let $f_{S}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}$ be a distance function associated with a compact set $S$ of ${{\mathbb{R}}}^{d}$ . It is not differentiable on the medial axis of $S$ . A generalized gradient function ${\nabla_{S}{}}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d}$ coincides with the usual gradient of $f_{S}$ where $f_{S}$ is differentiable, and is defined everywhere and can be integrated into a continuous flow $\Phi^{t}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d}$ that points away from $S$ . Let $\gamma$ be an integral (flow) line. The following lemma shows that when close enough to $S$ , that $d^{K}_{\mu}$ is strictly increasing along any $\gamma$ . The proof is quite technical and is thus deferred to Appendix D.2.

Lemma 4.5.

Given any flow line $\gamma$ associated with the generalized gradient function ${\nabla_{S}{}}$ , $d^{K}_{\mu}(x)$ is strictly monotonically increasing along $\gamma$ for $x$ sufficiently far away from the medial axis of $S$ , for $\sigma\leq\frac{R}{6\Delta_{G}}$ and $f_{S}(x)\in(0.014R,2\sigma)$ . Here $B(\sigma/2)$ denotes a ball of radius $\sigma/2$ , $G:=\frac{{\textsf{Vol}}(B(\sigma/2))}{{\textsf{Vol}}(S)}$ , $\Delta_{G}:=\sqrt{12+3\ln(4/G)}$ and suppose $R:=\min({\sf reach}(S),{\sf reach}({{\mathbb{R}}}^{d}\setminus S))>0$ .

The strict monotonicity of $d^{K}_{\mu}$ along the flow line under the conditions in Lemma 4.5 makes it possible to define a deformation retract of the sublevel sets of $d^{K}_{\mu}$ onto sublevel sets of $f_{S}$ . Such a deformation retract defines a special case of homotopy equivalence between the sublevel sets of $d^{K}_{\mu}$ and sublevel sets of $f_{S}$ . Consider a sufficiently large point set $P\in\mathbb{R}^{d}$ sampled from $\mu$ , and its induced measure ${\mu_{P}}$ . We can then also invoke Theorem 4.2 and a sampling bound (see Section 6 and Lemma B.3) to show homotopy equivalence between the sublevel sets of $f_{S}$ and $d^{K}_{\mu_{P}}$ .

Note that Lemma 4.5 uses somewhat restrictive conditions related to the reach of a compact set, however we believe such conditions could be further relaxed to be associated with the concept of $\mu$ -reach as described in [14].

5 Stability Properties for the Kernel Distance to a Measure

Lemma 5.1 (K4).

For two measures $\mu$ and $\nu$ on ${{\mathbb{R}}}^{d}$ we have $\|d^{K}_{\mu}-d^{K}_{\nu}\|_{\infty}\leq D_{K}(\mu,\nu)$ .

Proof 5.2.

Since $D_{K}(\cdot,\cdot)$ is a metric, then by triangle inequality, for any $x\in{{\mathbb{R}}}^{d}$ we have $D_{K}(\mu,x)\leq D_{K}(\mu,\nu)+D_{K}(\nu,x)$ and $D_{K}(\nu,x)\leq D_{K}(\nu,\mu)+D_{K}(\mu,x)$ . Therefore for any $x\in{{\mathbb{R}}}^{d}$ we have $|D_{K}(\mu,x)-D_{K}(\nu,x)|\leq D_{K}(\mu,\nu)$ , proving the claim.

Both the Wasserstein and kernel distance are integral probability metrics [66], so (M4) and (K4) are both interesting, but not easily comparable. We now attempt to reconcile this.

5.1 Comparing $D_{K}$ to $W_{2}$

Lemma 5.3.

There is no Lipschitz constant $\gamma$ such that for any two probability measures $\mu$ and $\nu$ we have $W_{2}(\mu,\nu)\leq\gamma D_{K}(\mu,\nu)$ .

Proof 5.4.

Consider two measures $\mu$ and $\nu$ which are almost identical: the only difference is some mass of measure $\tau$ is moved from its location in $\mu$ a distance $n$ in $\nu$ . The Wasserstein distance requires a transportation plan that moves this $\tau$ mass in $\nu$ back to where it was in $\mu$ with cost $\tau\cdot\Omega(n)$ in $W_{2}(\mu,\nu)$ . On the other hand, $D_{K}(\mu,\nu)=\sqrt{\kappa(\mu,\mu)+\kappa(\nu,\nu)-2\kappa(\mu,\nu)}\leq\sqrt{\sigma^{2}+\sigma^{2}-2\cdot 0}=\sqrt{2}\sigma$ is bounded.

We conjecture for any two probability measures $\mu$ and $\nu$ that $D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu)$ . This would show that $d^{K}_{\mu}$ is at least as stable as $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ since a bound on $W_{2}(\mu,\nu)$ would also bound $D_{K}(\mu,\nu)$ , but not vice versa. Alternatively, this can be viewed as $d^{K}_{\mu}$ is less discriminative than $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ ; we view this as a positive in this setting, as it is mainly less discriminative towards outliers (far away points). Here we only show that this property for a special case and as $\sigma\to\infty$ . To simplify notation, all integrals are assumed to be over the full domain ${{\mathbb{R}}}^{d}$ .

Two Dirac masses.

We first consider a special case when $\mu$ is a Dirac mass at a point $p$ and $\nu$ is a Dirac mass at a point $q$ . That is they are both single points. We can then write $D_{K}(\mu,\nu)=D_{K}(p,q)$ . Figure 2 illustrates the result of this lemma.

Lemma 5.5.

For any points $p,q\in{{\mathbb{R}}}^{d}$ it always holds that $\|p-q\|\geq D_{K}(p,q)$ . When $\|p-q\|\leq\sqrt{3}\sigma$ then $D_{K}(p,q)\geq\|p-q\|/2$ .

Proof 5.6.

First expand $D_{K}(p,q)^{2}$ as

D_{K}(p,q)^{2}=2\sigma^{2}-2K(p,q)=2\sigma^{2}\left(1-\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right)\right).

Now using that $1-t\leq e^{-t}\leq 1-t+(1/2)t^{2}$ for $t\geq 0$

D_{K}(p,q)^{2}=2\sigma^{2}\left(1-\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right)\right)\leq 2\sigma^{2}\left(\frac{\|p-q\|^{2}}{2\sigma^{2}}\right)=\|p-q\|^{2}

and

	$\displaystyle D_{K}(p,q)^{2}$	$\displaystyle=2\sigma^{2}\left(1-\exp\left(\frac{-\\|p-q\\|^{2}}{2\sigma^{2}}\right)\right)$
		$\displaystyle\geq 2\sigma^{2}\left(\frac{\\|p-q\\|^{2}}{2\sigma^{2}}-\frac{1}{2}\frac{\\|p-q\\|^{4}}{4\sigma^{4}}\right)$
		$\displaystyle=\frac{\\|p-q\\|^{2}}{4}\left(4-\frac{\\|p-q\\|^{2}}{\sigma^{2}}\right)$
		$\displaystyle\geq\\|p-q\\|^{2}/4,$

where the last inequality holds when $\|p-q\|^{2}\leq\sqrt{3}\sigma$ .

One Dirac mass.

Consider the case where one measure $\nu$ is a Dirac mass at point $x\in{{\mathbb{R}}}^{d}$ .

Lemma 5.7.

Consider two probability measures $\mu$ and $\nu$ on ${{\mathbb{R}}}^{d}$ where $\nu$ is represented by a Dirac mass at a point $x\in{{\mathbb{R}}}^{d}$ . Then $d^{K}_{\mu}(x)=D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu)$ for any $\sigma>0$ , where the equality only holds when $\mu$ is also a Dirac mass at $x$ .

Proof 5.8.

Since both $W_{2}(\mu,\nu)$ and $D_{K}(\mu,\nu)$ are metrics and hence non-negative, we can instead consider their squared versions: $(W_{2}(\mu,\nu))^{2}=\int_{p}\|p-x\|^{2}\mu(p){\rm d}{p}$ and

	$\displaystyle(D_{K}(\mu,\nu))^{2}$	$\displaystyle=K(x,x)+\int_{(p,q)}K(p,q){\rm d}{\textsf{m}_{\mu,\mu}(p,q)}-2\int_{p}K(p,x){\rm d}{\mu(p)}$
		$\displaystyle=\sigma^{2}\left(1+\int_{(p,q)}\exp\left(-\frac{\\|p-q\\|^{2}}{2\sigma^{2}}\right){\rm d}{\textsf{m}_{\mu,\mu}(p,q)}-2\int_{p}\exp\left(-\frac{\\|p-x\\|^{2}}{2\sigma^{2}}\right){\rm d}{\mu(p)}\right).$

Now use the bound $1-t\leq e^{-t}\leq 1$ for $t\geq 0$ to approximate

	$\displaystyle(D_{K}(\mu,\nu))^{2}$	$\displaystyle\leq\sigma^{2}\left(1+\int_{(p,q)}(1){\rm d}{(\textsf{m}_{\mu,\mu}(p,q)}-2\int_{p}\left(1-\frac{\\|p-x\\|^{2}}{2\sigma^{2}}\right){\rm d}{\mu(p)}\right)$
		$\displaystyle=\int_{p}\\|p-x\\|^{2}{\rm d}{\mu(p)}=(W_{2}(\mu,\nu))^{2}.$

The inequality becomes an equality only when $\|p-x\|=0$ for all $p\in P$ , and since they are both metrics, this is the only location where they are both $0$ .

General case.

Next we show that if $\nu$ is not a unit Dirac, then this inequality holds in the limit as $\sigma$ goes to infinity. The technical work is making precise how $\sigma^{2}-K(p,x)\leq\|x-p\|^{2}/2$ and how this compares to bounds on $D_{K}(\mu,\nu)$ and $W_{2}(\mu,\nu)$ .

For simpler exposition, we assume $\mu$ is a probability measure, that is $\int_{p}\mu(p){\rm d}{p}=1$ ; otherwise we can normalize $\mu$ at the appropriate locations, and all of the results go through.

Lemma 5.9.

For any $p,q\in{{\mathbb{R}}}^{d}$ we have $\displaystyle K(p,q)=\sigma^{2}-\frac{\|p-q\|^{2}}{2}+\sum_{i=2}^{\infty}\frac{(-\|p-q\|^{2})^{i}}{2^{i+1}\sigma^{2i-2}i!}.$

Proof 5.10.

We use the Taylor expansion of $e^{x}=\sum_{i=0}^{\infty}x^{i}/i!=1+x+\sum_{i=2}^{\infty}x^{i}/i!$ . Then it is easy to see

K(p,q)=\sigma^{2}\exp\left(-\frac{\|p-q\|^{2}}{2\sigma^{2}}\right)=\sigma^{2}-\frac{\|p-q\|^{2}}{2}+\sum_{i=2}^{\infty}\frac{(-\|p-q\|^{2})^{i}}{2^{i}\sigma^{2i-2}i!}.

This lemma illustrates why the choice of coefficient of $\sigma^{2}$ is convenient. Since then $\sigma^{2}-K(p,q)$ acts like $\frac{1}{2}\|p-q\|^{2}$ , and becomes closer as $\sigma$ increases. Define $\bar{\mu}=\int_{p}p\cdot{\rm d}{\mu}(p)$ to represent the mean point of measure $\mu$ ; $\textsf{Var}(\mu)=(\int_{p}\|p\|^{2}{\rm d}{\mu(p)})-\|\bar{\mu}\|^{2}$ to represent the variance of the measure $\mu$ ; and $\Delta_{\mu,\nu}=\int_{(p,q)}\sum_{i=2}^{\infty}\frac{(-\|p-q\|^{2})^{i}}{2^{i}\sigma^{2i-2}i!}{\rm d}{\textsf{m}_{\mu,\nu}(p,q)}$ .

Lemma 5.11.

For any $x\in{{\mathbb{R}}}^{d}$ we have $\displaystyle\int_{p}\|p-x\|^{2}{\rm d}{\mu(p)}=\|\bar{\mu}-x\|^{2}+\textsf{Var}(\mu).$

Proof 5.12.

	$\displaystyle\int_{p}\\|p-x\\|^{2}{\rm d}{\mu(p)}$	$\displaystyle=\int_{p}\left(\\|p\\|^{2}+\\|x\\|^{2}-2\langle p,x\rangle\right){\rm d}{\mu(p)}$
		$\displaystyle=\int_{p}\\|p\\|^{2}{\rm d}{\mu(p)}+\\|x\\|^{2}-2\int_{p}\langle p,x\rangle{\rm d}{\mu(p)}$
		$\displaystyle=\left(\int_{p}\\|p\\|^{2}{\rm d}{\mu(p)}-\\|\bar{\mu}\\|^{2}\right)+\\|\bar{\mu}\\|^{2}+\\|x\\|^{2}-2\langle\bar{\mu},x\rangle$
		$\displaystyle=\textsf{Var}(\mu)+\\|\bar{\mu}-x\\|^{2}.$

Lemma 5.13.

For probability measures $\mu$ and $\nu$ on ${{\mathbb{R}}}^{d}$ , $\kappa(\mu,\nu)=\sigma^{2}-\frac{1}{2}\left(\|\bar{\mu}-\bar{\nu}\|^{2}+\textsf{Var}(\mu)+\textsf{Var}(\nu)\right)+\Delta_{\mu,\nu}.$

Proof 5.14.

We use Lemma 5.9 to expand

	$\displaystyle\kappa(\mu,\nu)$	$\displaystyle=\int_{(p,q)}K(p,q){\rm d}{\textsf{m}_{\mu,\nu}(p,q)}$
		$\displaystyle=\sigma^{2}-\int_{(p,q)}\left(\frac{\\|p-q\\|^{2}}{2}-\sum_{i=2}^{\infty}\frac{(-\\|p-q\\|^{2})^{i}}{2^{i+1}\sigma^{2i-2}i!}\right){\rm d}{\textsf{m}_{\mu,\nu}(p,q)}.$

After shifting the $\Delta_{\mu,\nu}$ term outside, we can use Lemma 5.11 (twice) to rewrite

	$\displaystyle\int_{p}\left(\int_{q}\\|p-q\\|^{2}{\rm d}{\nu(q)}\right){\rm d}{\mu(p)}$	$\displaystyle=\int_{p}\left(\\|p-\bar{\nu}\\|^{2}+\textsf{Var}(\nu)\right){\rm d}{\mu(p)}$
		$\displaystyle=\\|\bar{\mu}-\bar{\nu}\\|^{2}+\textsf{Var}(\mu)+\textsf{Var}(\nu).$

Theorem 5.15.

For any two probability measures $\mu$ and $\nu$ defined on ${{\mathbb{R}}}^{d}$ $\displaystyle\lim_{\sigma\to\infty}D_{K}(\mu,\nu)=\|\bar{\mu}-\bar{\nu}\|.$

Proof 5.16.

First expand

	$\displaystyle(D_{K}(\mu,\nu))^{2}=$	$\displaystyle\;\kappa(\mu,\mu)+\kappa(\nu,\nu)-2\kappa(\mu,\nu)$
	$\displaystyle=$	$\displaystyle\left(\sigma^{2}-\frac{1}{2}\\|\bar{\mu}-\bar{\mu}\\|^{2}-\textsf{Var}(\mu)+\Delta_{\mu,\mu}\right)+\left(\sigma^{2}-\frac{1}{2}\\|\bar{\nu}-\bar{\nu}\\|^{2}-\textsf{Var}(\nu)+\Delta_{\nu,\nu}\right)$
		$\displaystyle-2\left(\sigma^{2}-\frac{1}{2}\\|\bar{\mu}-\bar{\nu}\\|^{2}-\frac{1}{2}\textsf{Var}(\mu)-\frac{1}{2}\textsf{Var}(\nu)+\Delta_{\mu,\nu}\right)$
	$\displaystyle=$	$\displaystyle\;\\|\bar{\mu}-\bar{\nu}\\|^{2}+\Delta_{\mu,\mu}+\Delta_{\nu,\nu}-2\Delta_{\mu,\nu}.$

Finally we observe that since all terms of $\Delta_{\mu,\nu}$ are divided by $\sigma^{2}$ or larger powers of $\sigma$ . Thus as $\sigma$ increases $\Delta_{\mu,\nu}$ approaches $0$ and $(D_{K}(\mu,\nu))^{2}$ approaches $\|\bar{\mu}-\bar{\nu}\|^{2}$ , completing the proof.

Now we can relate $D_{K}(\mu,\nu)$ to $W_{2}(\mu,\nu)$ through $\|\bar{\mu}-\bar{\nu}\|$ . The next result is a known lower bounds for the Earth movers distance [23][Theorem 7]. We reprove it in Appendix E for completeness.

Lemma 5.17.

For any probability measures $\mu$ and $\nu$ defined on ${{\mathbb{R}}}^{d}$ we have $\|\bar{\mu}-\bar{\nu}\|\leq W_{2}(\mu,\nu).$

We can now combine these results to achieve the following theorem.

Theorem 5.18.

For any two probability measures $\mu$ and $\nu$ defined on ${{\mathbb{R}}}^{d}$ $\displaystyle\lim_{\sigma\to\infty}D_{K}(\mu,\nu)=\|\bar{\mu}-\bar{\nu}\|$ and $\|\bar{\mu}-\bar{\nu}\|\leq W_{2}(\mu,\nu).$ Thus $\lim_{\sigma\to\infty}D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu)$ .

5.2 Kernel Distance Stability with Respect to $\sigma$

We now explore the Lipschitz properties of $d^{K}_{\mu}$ with respect to the noise parameter $\sigma$ . We argue any distance function that is robust to noise needs some parameter to address how many outliers to ignore or how far away a point is that is an outlier. For instance, this parameter in $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ is $m_{0}$ which controls the amount of measure $\mu$ to be used in the distance.

Here we show that $d^{K}_{\mu}$ has a particularly nice property, that it is Lipschitz with respect to the choice of $\sigma$ for any fixed $x$ . The larger $\sigma$ the more effect outliers have, and the smaller $\sigma$ the less the data is smoothed and thus the closer the noise needs to be to the underlying object to effect the inference.

Lemma 5.19.

Let $h(\sigma,z)=\exp(-z^{2}/2\sigma^{2})$ . We can bound $h(\sigma,z)\leq 1$ , $\frac{{\rm d}{}}{{\rm d}{\sigma}}h(\sigma,z)\leq(2/e)/\sigma$ and $\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}h(\sigma,z)\leq(18/e^{3})/\sigma^{2}$ over any choice of $z>0$ .

Proof 5.20.

The first bound follows from $y=-z^{2}/2\sigma^{2}\leq 0$ and $\exp(y)\leq 1$ for $y\leq 0$ .

Next we define

	$\displaystyle w_{1}(\sigma,z)$	$\displaystyle=\frac{{\rm d}{}}{{\rm d}{\sigma}}h(\sigma,z)=\frac{z^{2}}{\sigma^{3}}\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right),\text{ and}$
	$\displaystyle w_{2}(\sigma,z)$	$\displaystyle=\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}h(\sigma,z)=\left(\frac{z^{4}}{\sigma^{6}}-\frac{3z^{2}}{\sigma^{4}}\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right).$

Now to solve the first part, we differentiate $w_{1}$ with respect to $z$ to find its maximum over all choices of $z$ .

\frac{{\rm d}{}}{{\rm d}{z}}w_{1}(\sigma,z)=\left(\frac{2z}{\sigma^{3}}-\frac{z^{3}}{\sigma^{5}}\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right)

Where $({\rm d}{}/{\rm d}{z})w_{1}(\sigma,z)=0$ at $z=0$ , $z=\sqrt{2}\sigma$ and as $z$ approaches $\infty$ . Thus the maximum must occur at one of these values. Both $w_{1}(\sigma,0)=0$ and $\lim_{z\to\infty}w_{1}(\sigma,z)=0$ , while $w_{1}(\sigma,\sqrt{2}\sigma)=(2/e)/\sigma$ , proving the first part.

To solve the second part, we perform the same approach on $w_{2}$ .

	$\displaystyle\frac{{\rm d}{}}{{\rm d}{z}}w_{2}(\sigma,z)$	$\displaystyle=\left(\frac{-z^{5}}{\sigma^{8}}+\frac{3z^{3}}{\sigma^{6}}+\frac{4z^{3}}{\sigma^{6}}-\frac{6z}{\sigma^{4}}\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right)$
		$\displaystyle=\frac{z}{\sigma^{4}}\left(\frac{-z^{4}}{\sigma^{4}}+\frac{7z^{2}}{\sigma^{2}}-6\right)\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right)$

Thus $({\rm d}{}/{\rm d}{z})w_{2}(\sigma,z)=0$ at $z=\{0,\sigma,\sqrt{6}\sigma\}$ and as $z$ goes to $\infty$ for $z\in[0,\infty)$ . Both $w_{2}(\sigma,0)=0$ and $\lim_{z\to\infty}w_{2}(\sigma,z)=0$ . The minimum occurs at $w_{2}(\sigma,z=\sigma)=(-2/\sqrt{e})/\sigma^{2}$ . The maximum occurs at $w_{2}(\sigma,z=\sqrt{6}\sigma)=(18/e^{3})/\sigma^{2}$ .

Theorem 5.21.

For any measure $\mu$ defined on $\mathbb{R}^{d}$ and $x\in\mathbb{R}^{d}$ , $d^{K}_{\mu}(x)$ is $\ell$ -Lipschitz with respect to $\sigma$ , for $\ell=18/e^{3}+8/e+2<6$ .

Proof 5.22.

Recall that $\textsf{m}_{\mu,\nu}$ is the product measure of any $\mu$ and $\nu$ , and define $\textsf{M}_{\mu,\nu}$ as $\textsf{M}_{\mu,\nu}(p,q)=\textsf{m}_{\mu,\mu}(p,q)+\textsf{m}_{\nu,\nu}(p,q)-2\textsf{m}_{\mu,\nu}(p,q)$ . It is now useful to define a function $f_{x}(\sigma)$ as

f_{x}(\sigma)=\int_{(p,q)}\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right){\rm d}{\textsf{M}_{\mu,\delta_{x}}(p,q)}.

So $d^{K}_{\mu}(x)=\sigma\sqrt{f_{x}(\sigma)}$ and we can write another function as

F(\sigma)=(d^{K}_{\mu}(x))^{2}-\ell\|\sigma\|^{2}=\sigma^{2}f_{x}(\sigma)-\ell\sigma^{2}.

Now to prove $d^{K}_{\mu}(x)$ is $\ell$ -Lipschitz, we can show that $(d^{K}_{\mu})^{2}$ is $\ell$ -semiconcave with respect to $\sigma$ , and apply Lemma 2.3. This boils down to showing the second derivative of $F(\sigma)$ is always non-positive.

\frac{{\rm d}{}}{{\rm d}{\sigma}}F(\sigma)=2\sigma f_{x}(\sigma)+\sigma^{2}\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)-2\sigma\ell.

\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma)=\sigma^{2}\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}f_{x}(\sigma)+4\sigma\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)+2f_{x}(\sigma)-2\ell.

First we note that since $\int_{(p,q)}c\cdot{\rm d}{}\textsf{m}_{\mu,\nu}(p,q)=c$ for any product distribution $\textsf{m}_{\mu,\nu}$ of two distributions $\mu$ and $\nu$ , including when $\mu$ or $\nu$ is a Dirac mass, then

\int_{(p,q)}c\cdot{\rm d}{}\textsf{M}_{\mu,\delta_{x}}(p,q)=\int_{(p,q)}c\cdot{\rm d}{}\Big{[}\textsf{m}_{\mu,\mu}+\textsf{m}_{\delta_{x},\delta_{x}}-2\textsf{m}_{\mu,\delta_{x}}\Big{]}(p,q)\leq 2c.

Thus since $\exp\left(\frac{-\|p-q\|^{2}}{2\sigma^{2}}\right)$ is in $[0,1]$ for all choices of $p$ , $q$ , and $\sigma>0$ , then $0\leq f_{x}(\sigma)\leq 2$ and $2f_{x}(\sigma)\leq 4$ . This bounds the third term in $\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma)$ , we now need to use a similar approach to bound the first and second terms.

Let $h(\sigma,z)=\exp\left(\frac{-z^{2}}{2\sigma^{2}}\right)$ , so we can apply Lemma 5.19.

4\sigma\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)=4\sigma\int_{(p,q)}\left(\frac{{\rm d}{}}{{\rm d}{\sigma}}h(\sigma,\|p-q\|)\right){\rm d}{}\textsf{M}_{\mu,\delta_{x}}(p,q)\leq 4\sigma((2/e)/\sigma)2=16/e

\sigma^{2}\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}f_{x}(\sigma)=\sigma^{2}\int_{(p,q)}\left(\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}h(\sigma,\|p-q\|)\right){\rm d}{}\textsf{M}_{\mu,\delta_{x}}(p,q)\leq\sigma^{2}\left((18/e^{3})/\sigma^{2}\right)2=36/e^{3}

Then we complete the proof using the upper bound of each item of $\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma)$

	$\displaystyle\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}F(\sigma)$	$\displaystyle=\sigma^{2}\frac{{\rm d}{}^{2}}{{\rm d}{\sigma}^{2}}f_{x}(\sigma)+4\sigma\frac{{\rm d}{}}{{\rm d}{\sigma}}f_{x}(\sigma)+2f_{x}(\sigma)-2\ell$
		$\displaystyle\leq 36/e^{3}+16/e+4-2(18/e^{3}+8/e+2)=0.$

Lipschitz in $m_{0}$ for $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ .

We show that there is no Lipschitz property for $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ , with respect to $m_{0}$ that is independent of the measure $\mu$ . Consider a measure $\mu_{P}$ for point set $P\subset\mathbb{R}$ consisting of two points at $a=0$ and at $b=\Delta$ . Now consider $d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)$ . When $m_{0}\leq 1/2$ then $d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)=0$ is constant. But for $m_{0}=1/2+\alpha$ for $\alpha>0$ , then $d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)=\alpha\Delta/(1/2+\alpha)$ and $\frac{{\rm d}{}}{{\rm d}{m_{0}}}d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)=\frac{{\rm d}{}}{{\rm d}{\alpha}}d^{\textsf{{ccm}}}_{{\mu_{P}},\frac{1}{2}+\alpha}(a)=\frac{(1/2+2\alpha)\Delta}{(1/2+\alpha)^{2}}$ , which is maximized as $\alpha$ approaches $0$ with an infimum of $2\Delta$ . If $n-1$ points are at $b$ and $1$ point at $a$ , then the infimum of the first derivative of $m_{0}$ is $n\Delta$ . Hence for a measure $\mu_{P}$ defined by a point set, the infimum of $\frac{{\rm d}{}}{{\rm d}{m_{0}}}d^{\textsf{{ccm}}}_{{\mu_{P}},m_{0}}(a)$ and, hence a lower bound on the Lipschitz constant is $n\Delta_{P}$ where $\Delta_{P}=\max_{p,p^{\prime}\in P}\|p-p^{\prime}\|$ .

6 Algorithmic and Approximation Observations

Kernel coresets.

The kernel distance is robust under random samples [48]. Specifically, if $Q$ is a point set randomly chosen from $\mu$ of size $O((1/\varepsilon^{2})(d+\log(1/\delta))$ then $\|\textsc{kde}_{\mu}-\textsc{kde}_{Q}\|_{\infty}\leq\varepsilon$ with probability at least $1-\delta$ . We call such a subset $Q$ and $\varepsilon$ -kernel sample of $(\mu,K)$ . Furthermore, it is also possible to construct $\varepsilon$ -kernel samples $Q$ with even smaller size of $|Q|=O(((1/\varepsilon)\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)})$ [57]; in particular in ${{\mathbb{R}}}^{2}$ the required size is $|Q|=O((1/\varepsilon)\sqrt{\log(1/\varepsilon\delta)})$ . Exploiting the above constructions, recent work [71] builds a data structure to allow for efficient approximate evaluations of $\textsc{kde}_{P}$ where $|P|=100{,}000{,}000$ .

These constructions of $Q$ also immediately imply that $\|(d^{K}_{\mu})^{2}-(d^{K}_{Q})^{2}\|_{\infty}\leq 4\varepsilon$ since $(d^{K}_{\mu}(x))^{2}=\kappa(\mu,\mu)+\kappa(x,x)-2\textsc{kde}_{\mu}(x)$ , and both the first and third term incur at most $2\varepsilon$ error in converting to $\kappa(Q,Q)$ and $2\textsc{kde}_{Q}(x)$ , respectively (see Lemma B.1). Thus, an $(\varepsilon^{2}/4)$ -kernel sample $Q$ of $(\mu,K)$ implies that $\|d^{K}_{\mu}-d^{K}_{Q}\|_{\infty}\leq\varepsilon$ (see Lemma B.3).

This implies algorithms for geometric inference on enormous noisy data sets. Moreover, if we assume an input point set $Q$ is drawn iid from some underlying, but unknown distribution $\mu$ , we can bound approximations with respect to $\mu$ .

Corollary 6.1.

Consider a measure $\mu$ defined on $\mathbb{R}^{d}$ , a kernel $K$ , and a parameter $\varepsilon\leq R(5+4/\alpha^{2})$ . We can create a coreset $Q$ of size $|Q|=O(((1/\varepsilon^{2})\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)})$ or randomly sample $|Q|=O((1/\varepsilon^{4})(d+\log(1/\delta)))$ points so, with probability at least $1-\delta$ , any sublevel set $(d^{K}_{\mu})^{\eta}$ is homotopy equivalent to $(d^{K}_{Q})^{r}$ for $r\in[4\varepsilon/\alpha^{2},R-3\varepsilon]$ and $\eta\in(0,R)$ .

Proof 6.2.

Those bounds are obtained by constructing an $(\varepsilon^{2}/4)$ -kernel sample [48, 57], which guarantees $\|d^{K}_{\mu}-d^{K}_{Q}\|_{\infty}\leq\varepsilon$ via Lemma B.3. Then since $\varepsilon\leq R/(5+4/\alpha^{2})$ , with Theorem 4.2 any sublevel set $(d^{K}_{\mu})^{\eta}$ is homotopy equivalent to $(d^{K}_{Q})^{r}$ .

Stability of persistence diagrams.

Furthermore, the stability results on persistence diagrams [24] hold for kernel density estimates and kernel distance of $\mu$ and $Q$ (where $Q$ is a coreset of $\mu$ with the same size bounds as above). If $\|f-g\|_{\infty}\leq\varepsilon$ , then $d_{B}(\textsf{Dgm}(f),\textsf{Dgm}(g))\leq\varepsilon$ , where $d_{B}$ is the bottleneck distance between persistence diagrams. Combined with the coreset results above, this immediately implies the following corollaries.

Corollary 6.3.

Consider a measure $\mu$ defined on $\mathbb{R}^{d}$ and a kernel $K$ . We can create a core set $Q$ of size $|Q|=O(((1/\varepsilon)\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)})$ or randomly sample $|Q|=O((1/\varepsilon^{2})(d+\log(1/\delta)))$ points which will have the following properties with probability at least $1-\delta$ .

•

$d_{B}(\textsf{Dgm}(\textsc{kde}_{\mu}),\textsf{Dgm}(\textsc{kde}_{Q}))\leq\varepsilon$ .
•

$d_{B}(\textsf{Dgm}((d^{K}_{\mu})^{2}),\textsf{Dgm}((d^{K}_{Q})^{2}))\leq\varepsilon$ .

Corollary 6.4.

Consider a measure $\mu$ defined on $\mathbb{R}^{d}$ and a kernel $K$ . We can create a coreset $Q$ of size $|Q|=O(((1/\varepsilon^{2})\sqrt{\log(1/\varepsilon\delta)})^{2d/(d+2)})$ or randomly sample $|Q|=O((1/\varepsilon^{4})(d+\log(1/\delta)))$ points which will have the following property with probability at least $1-\delta$ .

•

$d_{B}(\textsf{Dgm}(d^{K}_{\mu}),\textsf{Dgm}(d^{K}_{Q}))\leq\varepsilon$ .

Another bound was independently derived to show an upper bound on the size of a random sample $Q$ such that $d_{B}(\textsf{Dgm}(\textsc{kde}_{\mu_{P}}),\textsf{Dgm}(\textsc{kde}_{Q}))\leq\varepsilon$ in [3]; this can, as above, also be translated into bounds for $\textsf{Dgm}((d^{K}_{Q})^{2})$ and $\textsf{Dgm}(d^{K}_{Q})$ . This result assumes $P\subset[-C,C]^{d}$ and is parametrized by a bandwidth parameter $h$ that retains that $\int_{x\in\mathbb{R}^{d}}K_{h}(x,p){\rm d}{x}=1$ for all $p$ using that $K_{1}(\|x-p\|)=K(x,p)$ and $K_{h}(\|x-p\|)=\frac{1}{h^{d}}K_{1}(\|x-p\|^{2}/h)$ . This ensures that $K(\cdot,p)$ is $(1/h^{d})$ -Lipschitz and that $K(x,x)=\Theta(1/h^{d})$ for any $x$ . Then their bound requires $|Q|=O(\frac{d}{\varepsilon^{2}h^{d}}\log(\frac{Cd}{\varepsilon\delta h}))$ random samples.

To compare directly against the random sampling result we derive from Joshi et al. [48], for kernel $K_{h}(x,p)$ then $\|\textsc{kde}_{\mu_{P}}-\textsc{kde}_{Q}\|_{\infty}\leq\varepsilon K_{h}(x,x)=\varepsilon/h^{d}$ . Hence, our analysis requires $|Q|=O((1/\varepsilon^{2}h^{2d})(d+\log(1/\delta)))$ , and is an improvement when $h=\Omega(1)$ or $C$ is not known or bounded, as well as in some other cases as a function of $\varepsilon$ , $h$ , $\delta$ , and $d$ .

7 Discussion

We mention here a few other interesting aspects of our results and observations about topological inference using the kernel distance. They are related to how the noise parameter $\sigma$ affects the idea of scale, and a few more experiments, including with alternate kernels.

7.1 Noise and Scale

Much of geometric and topological reconstruction grew out of the desire to understand shapes at various scales. A common mechanism is offset based; e.g., $\alpha$ -shapes [31] represent the scale of a shape with the $\alpha$ parameter controlling the offsets of a point cloud. There are two parameters with the kernel distance: $r$ controls the offset through the sublevel set of the function, and $\sigma$ controls the noise. We argue that any function which is robust to noise must have a parameter that controls the noise (e.g. $\sigma$ for $d^{K}_{\mu}$ and $m_{0}$ for $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ ). Here $\sigma$ clearly defines some sense of scale in the setting of density estimation [65] and has a geometrical interpretation, while $m_{0}$ represents a fraction of the measure and is hard to interpret geometrically, as illustrated by the lack of a Lipschitz property for $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ with respect to $m_{0}$ .

There are several experiments below, in Section 7.2, from which several insights can be drawn. One observation is that even though there are two parameters $r$ and $\sigma$ that control the scale, the interesting values typically have $r$ very close to $\sigma$ . Thus, we recommend to first set $\sigma$ to control the scale at which the data is studied, and then explore the effect of varying $r$ for values near $\sigma$ . Moreover, not much structure seems to be missed by not exploring the space of both parameters; Figure 3 shows that fixing one (of $r$ and $\sigma$ ) and varying the other can provide very similar superlevel sets. However, it is possible instead to fix $r$ and explore the persistent topological features in the data [36, 34] (those less affected by smoothing) by varying $\sigma$ . On the other hand, it remains a challenging problem to study two parameter persistent homology [10, 9] under the setting of kernel distance (or kernel density estimate).

7.2 Experiments

We consider measures $\mu_{P}$ defined by a point set $P\subset{{\mathbb{R}}}^{2}$ . To experimentally visualize the structures of the superlevel sets of kernel density estimates, or equivalently sublevel sets of the kernel distance, we do the simplest thing and just evaluate $d^{K}_{\mu_{P}}$ at every grid point on a sufficiently dense grid.

Grid approximation.

Due to the $1$ -Lipschitz property of the kernel distance, well chosen grid points have several nice properties. We consider the functions up to some resolution parameter $\varepsilon>0$ , consistent with the parameter used to create a coreset approximation $Q$ . Now specifically, consider an axis-aligned grid $G_{\varepsilon,d}$ with edge length $\varepsilon/2\sqrt{d}$ so no point $x\in{{\mathbb{R}}}^{d}$ is further than $\varepsilon$ from some grid point $g\in G_{\varepsilon,d}$ . Since $K(x,y)\leq\varepsilon$ when $\|x-y\|\geq 2\sigma^{2}\ln(\sigma^{2}/\varepsilon)=\delta_{\varepsilon,\sigma}$ , we only need to consider grid points $g\in G_{\varepsilon,d}$ which are within $\delta_{\varepsilon,\sigma}$ of some point $p\in P$ (or $q\in Q$ , of coreset $Q$ of $P$ ) [48, 71]. This is at most $(2\sqrt{d}/\varepsilon)^{d}(2\delta_{\varepsilon,\sigma})^{d}=O((\sigma^{2}\log(\varepsilon/d)/\varepsilon)^{d})$ grid points total for $d$ a fixed constant. Furthermore, due to the $1$ -Lipschitz property of $d^{K}_{P}$ , when considering a specific level set at $r$

•

a point $x$ such that $d^{K}_{P}(x)\leq r-\varepsilon$ is no further than $\varepsilon$ from some $g\in G$ such that $d^{K}_{P}(g)\leq r$ , and
•

every ball $B_{\varepsilon}(x)$ centered at some point $x\in{{\mathbb{R}}}^{d}$ of radius $\varepsilon$ so that all $y\in B_{\varepsilon}(x)$ has $d^{K}_{P}(y)\leq r$ has some representative point $g\in G_{\varepsilon,d}$ such that $g\in B_{\varepsilon}(x)$ , and hence $d^{K}_{P}(g)\leq r$ .

Thus “deep” regions and spatially thick features are preserved, however thin passageways or layers that are near the threshold $r$ , even if they do not correspond to a critical point, may erroneously become disconnected, causing phantom components or other topological features. However, due to the Lipschitz property, these can be different from $r$ by at most $\varepsilon$ , so the errors will have small deviation in persistence.

Varying parameter $r$ or $\sigma$ .

We demonstrate the geometric inference on a synthetic dataset in $[0,1]^{2}$ where $900$ points are chosen near a circle centered at $(0.5,0.5)$ with radius $0.25$ or along a line segment from $(0,0)$ to $(1,1)$ . Each point has Gaussian noise added with standard deviation $0.01$ . The remaining $1100$ points are chosen uniformly from $[0,1]^{2}$ . We use a Gaussian kernel with $\sigma=0.05$ . Figure 3 shows (left) various sublevel sets $\gamma\in\Gamma$ for the kernel distance at a fixed $\sigma=0.05$ and (right) various superlevel sets for a fixed $\gamma=0.04853$ , but various values of $\sigma\in\Sigma$ , where

	$\displaystyle\Gamma$	$\displaystyle=[0.05005,0.04979,0.04954,0.04904,0.04853]\text{ and }$
	$\displaystyle\Sigma$	$\displaystyle=[0.0485,0.0489,0.0492,0.0495,0.05].$

This choice of $\Gamma$ and $\Sigma$ were made to highlight how similar the isolevels can be.

Alternative kernels.

We can choose kernels other than the Gaussian kernel in the kernel density estimate, for instance

•

the Laplace kernel $K(p,x)=\exp(-2\|x-y\|/\sigma)$ ,
•

the triangle kernel $K(p,x)=\max\{0,1-\|x-y\|/\sigma\}$ ,
•

the Epanechnikov kernel $K(p,x)=\max\{0,1-\|x-y\|^{2}/\sigma^{2}\}$ , or
•

the ball kernel ( $K(p,x)=\{1\text{ if }\|p-x\|\leq\sigma\text{; o.w. }0\}$ .

Figure 4 chooses parameters to make them comparable to the Figure 3(left). Of these only the Laplace kernel is characteristic [66] making the corresponding version of the kernel distance a metric. Investigating which of the above reconstruction theorems hold when using the Laplace or other kernels is an interesting question for future work.

Additionally, normal vector information and even $k$ -forms can be used in the definition of a kernel [42, 68, 30, 29, 43, 48]; this variant is known as the current distance. In some cases it retains its metric properties and has been shown to be very useful for shape alignment in conjunction with medical imaging.

7.3 Open Questions

This work shows it is possible to prove formal reconstruction results using kernel density estimates and the kernel distance. But it also opens many interesting questions.

•

For what other types of kernels can we show reconstruction bounds? The Laplace and triangle kernels are natural choices. For both the coresets results match those of the Gaussian kernel. The kernel distance under the Laplace kernel is also a metric, but is not known to be for the triangle kernel. Yet, the triangle kernel would be interesting since it has bounded support, and may lend itself to easier computation.
•

The power distance construction in Section 3 requires a point $\hat{p}_{+}$ , which approximates the point with minimum kernel distance. This is intuitively because it is possible to construct a point set $P$ (say points lying on a circle with no points inside) such that the point $p_{+}\in\mathbb{R}^{d}$ which minimizes the kernel distance and maximizes the kernel density estimate is far from any point in the point set. For one, can $\hat{p}_{+}$ be constructed efficiently without dependence on $\beta_{P}$ or $\Lambda_{P}/\sigma$ ?

But more interestingly, can we generally approximate the persistence diagram without creating a simplicial complex on a subset of the input points? We do describe some bounds on using a grid-based technique in Section 7.2, but this is also unsatisfying since it essentially requires a low-dimensional Euclidean space.
•

Since $d^{K}_{\mu}$ is Lipschitz in $x$ and $\sigma$ , it may make sense to understand the simultaneous stability of both variables. What is the best way to understand persistence over both parameters?
•

We provided some initial bound comparing the kernel distance under the Gaussian kernel and the Wasserstein $2$ distance. Can we show that under our choice of normalization that $D_{K}(\mu,\nu)\leq W_{2}(\mu,\nu)$ , unconstrained? More generally, how does the kernel distance under other kernels compare with other forms of Wasserstein and other distances on measures?

Acknowledgements

The authors thank Don Sheehy, Frédéric Chazal and the rest of the Geometrica group at INRIA-Saclay for enlightening discussions on geometric and topological reconstruction. We also thank Don Sheehy for personal communications regarding the power distance constructions, and Yusu Wang for ideas towards Lemma 4.5. Finally, we are also indebted to the anonymous reviewers for many detailed suggestions leading to improvements in results and presentation.

References

[1] Pankaj K. Agarwal, Sariel Har-Peled, Hiam Kaplan, and Micha Sharir. Union of random minkowski sums and network vulnerability analysis. In Proceedings of 29th Symposium on Computational Geometry, 2013.
[2] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
[3] Sivaraman Balakrishnan, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman. Statistical inference for persistent homology. Technical report, ArXiv:1303.7117, March 2013.
[4] James Biagioni and Jakob Eriksson. Map inference in the face of noise and disparity. In ACM SIGSPATIAL GIS, 2012.
[5] Gérard Biau, Frédéric Chazal, David Cohen-Steiner, Luc Devroye, and Carlos Rodriguez. A weighted $k$ -nearest neighbor density estimate for geometric inference. Electronic Journal of Statistics, 5:204–237, 2011.
[6] Omer Bobrowski, Sayan Mukherjee, and Jonathan E. Taylor. Topological consistency via kernel estimation. Technical report, arXiv:1407.5272, 2014.
[7] Peter Bubenik. Statistical topological data analysis using persistence landscapes. Jounral of Machine Learning Research, 2014.
[8] Mickael Buchet, Frederic Chazal, Steve Y. Oudot, and Donald R. Sheehy. Efficient and robust topological data analysis on metric spaces. arXiv:1306.0039, 2013.
[9] Gunnar Carlsson, Gurjeet Singh, and Afra Zomorodian. Computing multidimensional persistence. Algorithms and Computation: Lecture Notes in Computer Science, 5878:730–739, 2009.
[10] Gunnar Carlsson and Afra Zomorodian. The theory of multidimensional persistence. Proc. 23rd Ann. Symp. Computational Geometry, pages 184–193, 2007.
[11] Frédéric Chazal and David Cohen-Steiner. Geometric inference. Tessellations in the Sciences, 2012.
[12] Frédéric Chazal, David Cohen-Steiner, Marc Glisse, Leonidas J. Guibas, and Steve Y. Oudot. Proximity of persistence modules and their diagrams. In Proceedings 25th Annual Symposium on Computational Geometry, pages 237–246, 2009.
[13] Frédéric Chazal, David Cohen-Steiner, and André Lieutier. Normal cone approximation and offset shape isotopy. Computational Geometry: Theory and Applications, 42:566–581, 2009.
[14] Frédéric Chazal, David Cohen-Steiner, and André Lieutier. A sampling theory for compact sets in Euclidean space. Discrete Computational Geometry, 41(3):461–479, 2009.
[15] Frédéric Chazal, David Cohen-Steiner, and Quentin Mérigot. Geometric inference for probability measures. Foundations of Computational Mathematics, 11(6):733–751, 2011.
[16] Frederic Chazal, Vin de Silva, Marc Glisse, and Steve Oudot. The structure and stability of persistence modules. arXiv:1207.3674, 2013.
[17] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro Rinaldo, and Larry Wasserman. Robust topolical inference: Distance-to-a-measure and kernel distance. Technical report, arXiv:1412.7197, 2014.
[18] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman. On the bootstrap for persistence diagrams and landscapes. Modeling and Analysis of Information Systems, 20:96–105, 2013.
[19] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry Wasserman. Stochastic convergence of persistence landscapes. In Proceedings Symposium on Computational Geometry, 2014.
[20] Frédéric Chazal and André Lieutier. Weak feature size and persistent homology: computing homology of solids in rn from noisy data samples. Proceedings 21st Annual Symposium on Computational Geometry, pages 255–262, 2005.
[21] Frédéric Chazal and André Lieutier. Topology guaranteeing manifold reconstruction using distance function to noisy data. Proceedings 22nd Annual Symposium on Computational Geometry, pages 112–118, 2006.
[22] Frédéric Chazal and Steve Oudot. Towards persistence-based reconstruction in euclidean spaces. Proceedings 24th Annual Symposium on Computational Geometry, pages 232–241, 2008.
[23] Scott Cohen. Finding color and shape patterns in images. Technical report, Stanford University: CS-TR-99-1620, 1999.
[24] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete and Computational Geometry, 37:103–120, 2007.
[25] Hal Daumé. From zero to reproducing kernel hilbert spaces in twelve pages or less. http://pub.hal3.name/daume04rkhs.ps, 2004.
[26] Luc Devroye and László Györfi. Nonparametric Density Estimation: The $L_{1}$ View. Wiley, 1984.
[27] Luc Devroye and Gábor Lugosi. Combinatorial Methods in Density Estimation. Springer-Verlag, 2001.
[28] Tamal K. Dey. Curve and Surface Reconstruction: Algorithms with Mathematical Analysis. Cambridge University Press, 2007.
[29] Stanley Durrleman, Xavier Pennec, Alain Trouvé, and Nicholas Ayache. Measuring brain variability via sulcal lines registration: A diffeomorphic approach. In 10th International Conference on Medical Image Computing and Computer Assisted Intervention, 2007.
[30] Stanley Durrleman, Xavier Pennec, Alain Trouvé, and Nicholas Ayache. Sparse approximation of currents for statistics on curves and surfaces. In 11th International Conference on Medical Image Computing and Computer Assisted Intervention, 2008.
[31] Herbert Edelsbrunner. The union of balls and its dual shape. Proceedings 9th Annual Symposium on Computational Geometry, pages 218–231, 1993.
[32] Herbert Edelsbrunner, Michael Facello, Ping Fu, and Jie Liang. Measuring proteins and voids in proteins. In Proceedings 28th Annual Hawaii International Conference on Systems Science, 1995.
[33] Herbert Edelsbrunner, Brittany Terese Fasy, and Günter Rote. Add isotropic Gaussian kernels at own risk: More and more resiliant modes in higher dimensions. Proceedings 28th Annual Symposium on Computational Geometry, pages 91–100, 2012.
[34] Herbert Edelsbrunner and John Harer. Persistent homology - a survey. Contemporary Mathematics, 453:257–282, 2008.
[35] Herbert Edelsbrunner and John Harer. Computational Topology: An Introduction. American Mathematical Society, Providence, RI, USA, 2010.
[36] Herbert Edelsbrunner, David Letscher, and Afra J. Zomorodian. Topological persistence and simplification. Discrete and Computational Geometry, 28:511–533, 2002.
[37] Ahmed Elgammal, Ramani Duraiswami, David Harwood, and Larry S. Davis. Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc. IEEE, 90:1151–1163, 2002.
[38] Brittany Terese Fasy, Jisu Kim, Fabrizio Lecci, and Clément Maria. Introduction to the R package TDA. Technical report, arXiV:1411.1830, 2014.
[39] Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, Larry Wasserman, Sivaraman Balakrishnan, and Aarti Singh. Statistical inference for persistent homology: Confidence sets for persistence diagrams. In The Annals of Statistics, volume 42, pages 2301–2339, 2014.
[40] H. Federer. Curvature measures. Transactions of the American Mathematical Society, 93:418–491, 1959.
[41] Mingchen Gao, Chao Chen, Shaoting Zhang, Zhen Qian, Dimitris Metaxas, and Leon Axel. Segmenting the papillary muscles and the trabeculae from high resolution cardiac CT through restoration of topological handles. In Proceedings International Conference on Information Processing in Medical Imaging, 2013.
[42] Joan Glaunès. Transport par difféomorphismes de points, de mesures et de courants pour la comparaison de formes et l’anatomie numérique. PhD thesis, Université Paris 13, 2005.
[43] Joan Glaunès and Sarang Joshi. Template estimation form unlabeled point set data and surfaces for computational anatomy. In International Workshop on Mathematical Foundations of Computational Anatomy, 2006.
[44] Karsten Grove. Critical point theory for distance functions. Proceedings of Symposia in Pure Mathematics, 54:357–385, 1993.
[45] Leonidas Guibas, Quentin Mérigot, and Dmitriy Morozov. Witnessed $k$ -distance. Proceedings 27th Annual Symposium on Computational Geometry, pages 57–64, 2011.
[46] Allen Hatcher. Algebraic Topology. Cambridge University Press, 2002.
[47] Matrial Hein and Olivier Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In Proceedings 10th International Workshop on Artificial Intelligence and Statistics, 2005.
[48] Sarang Joshi, Raj Varma Kommaraju, Jeff M. Phillips, and Suresh Venkatasubramanian. Comparing distributions and shapes using the kernel distance. Proceedings 27th Annual Symposium on Computational Geometry, 2011.
[49] A. N. Kolmogorov, S. V. Fomin, and R. A. Silverman. Introductory Real Analysis. Dover Publications, 1975.
[50] John M. Lee. Introduction to topological manifolds. Springer-Verlag, 2000.
[51] John M. Lee. Introduction to smooth manifolds. Springer, 2003.
[52] Jie Liang, Herbert Edelsbrunner, Ping Fu, Pamidighantam V. Sudharkar, and Shankar Subramanian. Analytic shape computation of macromolecues: I. molecular area and volume through alpha shape. Proteins: Structure, Function, and Genetics, 33:1–17, 1998.
[53] André Lieutier. Any open bounded subset of $\mathbb{R}^{n}$ has the same homotopy type as its medial axis. Computer-Aided Design, 36:1029–1046, 2004.
[54] Quentin Mérigot. Geometric structure detection in point clouds. PhD thesis, Université de Nice Sophia-Antipolis, 2010.
[55] Yuriy Mileyko, Sayan Mukherjee, and John Harer. Probability measures on the space of persistence diagrams. Inverse Problems, 27(12), 2011.
[56] A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
[57] Jeff M. Phillips. eps-samples for kernels. Proceedings 24th Annual ACM-SIAM Symposium on Discrete Algorithms, 2013.
[58] Jeff M. Phillips and Suresh Venkatasubramanian. A gentle introduction to the kernel distance. arXiv:1103.1625, March 2011.
[59] Florian T. Pokorny, Carl Henrik, Hedvig Kjellström, and Danica Kragic. Persistent homology for learning densities with bounded support. In Neural Informations Processing Systems, 2012.
[60] Charles A. Price, Olga Symonova, Yuriy Mileyko, Troy Hilley, and Joshua W. Weitz. Leaf gui: Segmenting and analyzing the structure of leaf veins and areoles. Plant Physiology, 155:236–245, 2011.
[61] David W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, 1992.
[62] Donald R. Sheehy. A multicover nerve for geometric inference. Canadian Conference in Computational Geometry, 2012.
[63] Donald R. Sheehy. Linear-size approximations to the Vietoris-Rips filtration. Discrete & Computational Geometry, 49:778–796, 2013.
[64] Bernard W. Silverman. Using kernel density esimates to inversitigate multimodality. J. R. Sratistical Society B, 43:97–99, 1981.
[65] Bernard W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1986.
[66] Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517–1561, 2010.
[67] Kathryn Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 2014.
[68] Marc Vaillant and Joan Glaunès. Surface matching via currents. In Proceedings Information Processing in Medical Imaging, volume 19, pages 381–92, 2005.
[69] Cédric Villani. Topics in Optimal Transportation. American Mathematical Society, 2003.
[70] Grace Wahba. Support vector machines, reproducing kernel Hilbert spaces, and randomization GACV. In Advances in Kernel Methods – Support Vector Learning, pages 69–88. Bernhard Schölkopf and Alezander J. Smola and Christopher J. C. Burges and Rosanna Soentpiet, 1999.
[71] Yan Zheng, Jeffrey Jestes, Jeff M. Phillips, and Feifei Li. Quality and efficiency in kernel density estimates for large data. In Proceedings ACM Conference on the Management of Data (SIGMOD), 2012.

Appendix A Details on Distance-Like Properties of Kernel Distance

We provide further details on distance-like properties of the kernel distance.

A.1 Semiconcave Properties of Kernel Distance

We also note that semiconcavity follows quite naturally and simply in the RKHS $\mathcal{H}_{K}$ for $d^{K}_{\mu}$ .

Lemma A.1.

$(d^{K}_{\mu})^{2}$ is $1$ -semiconcave in $\mathcal{H}_{K}$ : the map $x\mapsto(d^{K}_{\mu}(x))^{2}-\|\phi(x)\|_{\mathcal{H}_{K}}^{2}$ is concave.

Proof A.2.

We can write

(d^{K}_{\mu}(x))^{2}=(D_{K}(\mu,x))^{2}=\kappa(\mu,\mu)+\kappa(x,x)-2\kappa(\mu,x)=\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}}+\|\phi(x)\|^{2}_{\mathcal{H}_{K}}-2\|\Phi(\mu)-\phi(x)\|^{2}_{\mathcal{H}_{K}}.

Now

(d^{K}_{\mu}(x))^{2}-\|\phi(x)\|_{\mathcal{H}_{K}}^{2}=\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}}-2\|\Phi(\mu)-\phi(x)\|^{2}_{\mathcal{H}_{K}}.

Since the above is twice-differentiable, we only need to show that its twice-differential is non-positive. By definition, for a fixed $\mu$ , $\Phi(\mu)$ and $\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}}$ are both constant. Suppose $\Phi(\mu)=c_{1}$ and $\|\Phi(\mu)\|^{2}_{\mathcal{H}_{K}}=c_{2}$ , we have $(d_{\mu}(x))^{2}-\|\phi(x)\|_{\mathcal{H}_{K}}^{2}=c_{2}-\|c_{1}-\phi(x)\|^{2}_{\mathcal{H}_{K}}$ . Since the RKHS $\mathcal{H}_{K}$ is a vector space with well-defined norm $\|\cdot\|_{\mathcal{H}_{K}}$ , the above is a concave parabolic function.

However, this semiconcavity in $\mathcal{H}_{K}$ is not that useful. For unit weight elements $x,y\in{{\mathbb{R}}}^{d}$ , an element $s_{\alpha}$ such that $\phi(s_{\alpha})=\alpha\phi(y)+(1-\alpha)\phi(x)$ is a weighted point set with a point at $x$ with weight $(1-\alpha)$ and another at $y$ with weight $\alpha$ . Lemma A.1 only implies that $(d_{K}(s_{\alpha}))^{2}-\|\phi(s_{\alpha})\|^{2}_{\mathcal{H}_{K}}\leq\alpha((d_{K}(x))^{2}-\|\phi(x)\|^{2}_{\mathcal{H}_{K}})+(1-\alpha)((d_{K}(y))^{2}-\|\phi(y)\|^{2}_{\mathcal{H}_{K}})$ .

A.2 Kernel Distance is Proper

We use two more general, but equivalent definitions of a proper map. Definition (i): A continuous map $f:{{\mathbb{X}}}\to{{\mathbb{Y}}}$ between two topological spaces is proper if and only if the inverse image of every compact subset in ${{\mathbb{Y}}}$ is compact in ${{\mathbb{X}}}$ ([50], page 84; [51], page 45). Definition (ii): a continuous map $f:{{\mathbb{X}}}\to{{\mathbb{Y}}}$ between two topological manifolds is proper if and only if for every sequence $\{p_{i}\}$ in ${{\mathbb{X}}}$ that escapes to infinity, $\{f(p_{i})\}$ escapes to infinity in ${{\mathbb{Y}}}$ ([51], Proposition 2.17). Here, for a topological space ${{\mathbb{X}}}$ , a sequence $\{p_{i}\}$ in ${{\mathbb{X}}}$ escapes to infinity if for every compact set $G\subset{{\mathbb{X}}}$ , there are at most finitely many values of $i$ for which $p_{i}\in G$ ([51], page 46).

Lemma A.3 (Lemma 2.6).

$d^{K}_{\mu}$ is proper.

Proof A.4.

To prove that $d^{K}_{\mu}$ is proper, we prove the following two claims: (a) A continuous function $f:{{\mathbb{R}}}^{d}\to[0,c)$ (where c is a constant) is proper, if for any sequence $\{x_{i}\}$ in ${{\mathbb{R}}}^{d}$ that escapes to infinity, the sequence $\{f(x_{i})\}$ tends to $c$ (approaches $c$ in the limit); (b) Let $f:=d^{K}_{\mu}$ and one needs to show that for any sequences $\{x_{i}\}$ that escapes to infinity, the sequence $\{f(x_{i})\}$ tends to $c_{\mu}$ ; or equivalently, $\kappa(\mu,x_{i})$ tends to $0$ .

We prove claim (a) by proving its contrapositive. If a continuous function $f:{{\mathbb{R}}}^{d}\to[0,c)$ is not proper, then there exists a sequence $\{x_{i}\}$ in ${{\mathbb{R}}}^{d}$ that escapes to infinity, such that the sequence $\{f(x_{i})\}$ does not tend to $c$ . Suppose $f$ is not proper, this implies that there exists a constant $b<c$ such that $f^{-1}[0,b]$ is not compact (based on properness definition (i)) and therefore either not closed or unbounded. We first show that $A:=f^{-1}[0,b]$ is closed. We make use of the following theorem ([49], page 88, Theorem 10’): A mapping $f$ of a topological space ${{\mathbb{X}}}$ into a topological space ${{\mathbb{Y}}}$ is continuous if and only if the pre-image $f^{-1}(F)$ of every closed set $F\subset{{\mathbb{Y}}}$ is closed in ${{\mathbb{X}}}$ . Since $f$ is continuous, it implies that the pre-image of every closed set $[a,b]\subset R$ is closed in ${{\mathbb{R}}}^{d}$ . Therefore, $A$ is closed, therefore it must be unbounded. Since every unbounded sequence contains a monotone subsequence that has either $+\infty$ or $-\infty$ as a limit, therefore $A$ contains a subsequence $S:=\{x_{i}\}$ that tends to an infinite limit. In addition, as elements in $S$ escapes to infinity, $\{f(x_{i})\}$ tends to $b$ and does not tend to $c$ . Therefore (a) holds by contraposition.

To prove claim (b), we need to show that for any sequence $\{x_{i}\}$ that escapes to infinity, $\kappa(\mu,x_{i})$ tends to $0$ . For each $x_{i}$ , define a radius $r_{i}=\|x_{i}-0\|/2$ and define a ball $B_{i}$ that is centered at the origin $0$ and has radius $r_{i}$ . As $x_{i}$ goes to infinity, $r_{i}$ increases until for any fixed arbitrary $\varepsilon>0$ , we have $\int_{p\in B_{i}}\mu(p){\rm d}{p}\geq 1-\varepsilon/2\sigma^{2}$ and thus $\int_{p\in{{\mathbb{R}}}^{d}\setminus B_{i}}{\rm d}{\mu(p)}\leq\varepsilon/2\sigma^{2}$ . Furthermore, let $p_{i}=\arg\min_{p\in B}\|p-x_{i}\|$ , so $\|x_{i}-p_{i}\|=r_{i}$ . Thus also as $x_{i}$ goes to infinity, $r_{i}$ increases until for any $\varepsilon>0$ we have $K(p_{i},x_{i})\leq\varepsilon/2$ . We now decompose $\kappa(\mu,x_{i})=\int_{p\in B_{i}}K(p,x_{i}){\rm d}{\mu(p)}+\int_{q\in{{\mathbb{R}}}^{d}\setminus B_{i}}K(q,x_{i}){\rm d}{\mu(q)}$ . Thus for any $\varepsilon>0$ , as $x_{i}$ goes to infinity, the first term is at most $\varepsilon/2$ since all $K(p,x_{i})\leq K(p_{i},x_{i})\leq\varepsilon/2$ and the second term is at most $\varepsilon/2$ since $K(q,x)\leq\sigma^{2}$ and $\int_{q\in{{\mathbb{R}}}^{d}\setminus B_{i}}\mu(q){\rm d}{q}\leq\varepsilon/2\sigma^{2}$ . Since these results hold for all $\varepsilon$ , as $x_{i}$ goes to infinity and $\varepsilon$ goes to $0$ , $\kappa(\mu,x_{i})$ goes to $0$ .

Combine (a) with (b) and the fact that $d^{K}_{\mu}$ is a continuous (in fact, Lipschitz) function, we obtained the properness result.

Appendix B $\varepsilon$ -Approximation of the Kernel Distance

Here we make explicit the way that an $\varepsilon$ -kernel sample approximated the kernel distance. Recall that if $Q$ is an $\varepsilon$ -kernel sample of $\mu$ , then $\|\textsc{kde}_{\mu}-\textsc{kde}_{\mu_{Q}}\|=\max_{x\in{{\mathbb{R}}}^{d}}|\kappa(\mu,x)-\kappa({\mu_{Q}},x)|\leq\varepsilon$ .

Lemma B.1.

If $Q$ is an $\varepsilon$ -kernel sample of $\mu$ , then $\|(d^{K}_{\mu})^{2}-(d^{K}_{\mu_{Q}})^{2}\|_{\infty}\leq 4\varepsilon$ .

Proof B.2.

First expand $D_{K}(\mu,x)^{2}=\kappa(x,x)+\kappa(\mu,\mu)-2\kappa(\mu,x)=\sigma^{2}+\kappa(\mu,\mu)-2\kappa(\mu,x)$ . Replacing $\mu$ with ${\mu_{Q}}$ , the first term is unaffected. The second term is bounded,

	$\displaystyle\kappa(\mu,\mu)$	$\displaystyle=\int_{(p,q)}K(p,q){\rm d}{\textsf{m}_{\mu,\mu}(p,q)}=\int_{p}\left(\int_{q}K(p,q){\rm d}{\mu(q)}\right){\rm d}{\mu(p)}$
		$\displaystyle=\int_{p}\textsc{kde}_{\mu}(p){\rm d}{\mu(p)}\leq\int_{p}(\textsc{kde}_{{\mu_{Q}}}(p)+\varepsilon){\rm d}{\mu(p)}$
		$\displaystyle=\int_{p}\textsc{kde}_{{\mu_{Q}}}(p){\rm d}{\mu(p)}+\varepsilon=\int_{p}\left(\int_{q}K(p,q){\rm d}{{\mu_{Q}}(q)}\right){\rm d}{\mu(p)}+\varepsilon$
		$\displaystyle=\kappa({\mu_{Q}},\mu)+\varepsilon$
		$\displaystyle\leq\kappa({\mu_{Q}},{\mu_{Q}})+2\varepsilon.$

Similar results hold by switching ${\mu_{Q}}$ with $\mu$ in the above inequality, that is, $\kappa({\mu_{Q}},{\mu_{Q}})\leq\kappa(\mu,\mu)+2\varepsilon$ . And for the third term we have similar inequality, $|2\kappa(\mu,x)-2\kappa({\mu_{Q}},x)|\leq 2\varepsilon$ . Combining all three terms, we have the desired result: $|D_{K}(\mu,x)^{2}-D_{K}({\mu_{Q}},x)^{2}|\leq 4\varepsilon$ .

Lemma B.3.

If $Q$ is an $(\varepsilon^{2}/4)$ -kernel sample of $\mu$ , then $\|d^{K}_{\mu}-d^{K}_{\mu_{Q}}\|_{\infty}\leq\varepsilon$ .

Proof B.4.

By Lemma B.1 this condition on $Q$ implies that $\|(d^{K}_{\mu})^{2}-(d^{K}_{\mu_{Q}})^{2}\|_{\infty}\leq\varepsilon^{2}$ . We then use a basic fact for values $\varepsilon\geq 0$ and $\gamma\geq 0$ .
$\bullet$ $\sqrt{\gamma^{2}+\varepsilon^{2}}\leq\gamma+\varepsilon$ . This follows since $(\gamma+\varepsilon)^{2}=\gamma^{2}+\varepsilon^{2}+2\gamma\varepsilon\geq\gamma^{2}+\varepsilon^{2}$ .

We now prove the main result as an upper and lower bound using for any $x\in\mathbb{R}^{d}$ . We first use $\gamma=d^{K}_{\mu}(x)\geq 0$ and expand $d^{K}_{\mu_{Q}}(x)$ to obtain

d^{K}_{\mu_{Q}}(x)=\sqrt{(d^{K}_{\mu_{Q}}(x))^{2}}\leq\sqrt{(d^{K}_{\mu}(x))^{2}+\varepsilon^{2}}\leq d^{K}_{\mu}(x)+\varepsilon.

Now we use $\gamma=d^{K}_{\mu_{Q}}(x)\geq 0$ and expand $d^{K}_{\mu}(x)$ to obtain

d^{K}_{\mu}(x)=\sqrt{(d^{K}_{\mu}(x))^{2}}\leq\sqrt{(d^{K}_{\mu_{Q}}(x))^{2}+\varepsilon^{2}}\leq d^{K}_{\mu_{Q}}(x)+\varepsilon.

Hence for any $x\in\mathbb{R}^{d}$ we have $d^{K}_{\mu}(x)-\varepsilon\leq d^{K}_{\mu_{Q}}(x)\leq d^{K}_{\mu}(x)+\varepsilon$ .

Appendix C Power Distance Constructions

Recall we want to consider the following power distance using $d^{K}_{\mu}$ (as weight) for a measure $\mu$ associated with a subset $P\subset{{\mathbb{R}}}^{d}$ and metric $d(\cdot,\cdot)$ on ${{\mathbb{R}}}^{d}$ ,

{f_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left({d(p,x)}^{2}+{d^{K}_{\mu}(p)}^{2}\right)}.

We consider a particular choice of the distance metric $d(p,x)=D_{K}(p,x)$ which leads to a kernel version of the power distance

{f^{\textsc{k}}_{P}}(\mu,x)=\sqrt{\min_{p\in P}\left({D_{K}(p,x)}^{2}+{d^{K}_{\mu}(p)}^{2}\right)}.

Recall that $d^{K}_{\mu}(x)=D_{K}(\mu,x)$ . In this section, we will always use the notation $D_{K}(\mu,\nu)$ , and when $\mu$ or $\nu$ are points (e.g. $\mu$ is a Dirac mass at $p$ and $\nu$ is a Dirac mass at $q$ ), then we will just write $D_{K}(p,q)$ . This will be especially helpful when we apply the triangle inequality in several places.

C.1 Kernel Power Distance on Point Set $P$

Given a set $P$ defining a measure of interest $\mu_{P}$ , it is of interest to consider if ${f^{\textsc{k}}_{P}}({\mu_{P}},x)$ is multiplicatively bounded by $D_{K}({\mu_{P}},x)$ . Theorem 3.2 shows that the lower bound holds. In this section we try to provide a multiplicative approximation upper bound.

Let $p^{\star}=\arg\min_{p\in P}\|p-x\|$ . We can start with Lemma 3.4 which reduces the problem finding a multiplicative upper bound for $D_{K}(p^{\star},x)$ in terms of $D_{K}({\mu_{P}},x)$ . However, we are not able to provide very useful bounds, and they require more advanced techniques that the previous section. In particular, they will only apply for points $x\in\mathbb{R}^{d}$ when $D_{K}({\mu_{P}},x)$ is large enough; hence not well-approximating the minima of $d^{K}_{\mu}$ .

For simplicity, we write $d^{K}_{P}(\cdot)=D_{K}({\mu_{P}},\cdot)$ as $D_{K}(P,\cdot)$ .

The difficult case is when $D_{K}(P,x)$ is very small, and hence $\kappa(P,P)$ is very small. So we start by developing tools to upper bound $\kappa(P,P)$ using $\hat{p}=\arg\min_{p\in P}D_{K}(P,p)$ , a point which only provides a worse approximation that $p^{\star}$ .

We first provide a general result in a Hilbert space (a refinement of a vector space [25]), and then next apply it to our setting in the RKHS.

Lemma C.1.

\|r\|^{2}\leq\eta^{2}-\|r-\hat{v}\|^{2}.

Proof C.2.

Recall elementary properties of inner product space: $\|x\|^{2}=\langle x,x\rangle$ , $\langle ax,y\rangle=a\langle x,y\rangle$ , $\langle x-y,x-y\rangle=\langle x,x\rangle+\langle y,y\rangle-2\langle x,y\rangle$ . By definition of $\hat{v}$ , for any $v_{i}\in V$ ,

\displaystyle\|v_{i}-r\|^{2}\geq\|\hat{v}-r\|^{2}\Rightarrow\langle v_{i},v_{i}\rangle+\langle r,r\rangle-2\langle v_{i},r\rangle\geq\langle\hat{v},\hat{v}\rangle+\langle r,r\rangle-2\langle\hat{v},r\rangle\Rightarrow\langle v_{i},r\rangle\leq\langle\hat{v},r\rangle.

We can decompose $r$ (based on linearity of an inner product space) as

\|r\|^{2}=\langle r,r\rangle=\sum_{i=1}^{n}w_{i}\langle v_{i},r\rangle\leq\sum_{i=1}^{n}w_{i}\langle\hat{v},r\rangle=\langle\hat{v},r\rangle=\frac{1}{2}(\|r\|^{2}+\|\hat{v}\|^{2}-\|\hat{v}-r\|^{2}).

The last inequality holds by $\|\hat{v}-r\|^{2}=\|r\|^{2}+\|\hat{v}\|^{2}-2\langle\hat{v},r\rangle$ . Then since $\|\hat{v}\|=\eta$ we can solve for $\|r\|^{2}$ as

\|r\|^{2}\leq\eta^{2}-\|\hat{v}-r\|^{2}.

Lemma C.3.

Let $\hat{p}=\arg\min_{p\in P}D_{K}(P,p)$ , then $\kappa(P,P)\leq\sigma^{2}-D_{K}(P,\hat{p})^{2}$ .

Proof C.4.

Let $\phi_{K}:{{\mathbb{R}}}^{d}\to\mathcal{H}_{K}$ map points in ${{\mathbb{R}}}^{d}$ to the reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{K}$ defined by kernel $K$ . This space has norm $\|P\|_{\mathcal{H}_{K}}=\sqrt{\kappa(P,P)}$ defined on a set of points $P$ and inner product $\kappa(P,P)$ . Let $\Phi_{K}(P)=\frac{1}{|P|}\sum_{p\in P}\phi_{K}(p)$ be the representation of a set of points $P$ in $\mathcal{H}_{K}$ . Note that $D_{K}(P,Q)=\|\Phi_{K}(P)-\Phi_{K}(Q)\|_{\mathcal{H}_{K}}$ . We can now apply Lemma C.1 to $\{\phi_{K}(p)\}_{p\in P}$ with weights $w(p)=1/|P|$ and $r=\Phi_{K}(P)$ , and norm $\eta=\sigma$ . Hence $\kappa(P,P)=\|P\|_{\mathcal{H}_{K}}^{2}\leq\sigma^{2}-D_{K}(P,\hat{p})^{2}$ .

Lemma C.5.

For any $s>0$ and any $x$ , then $\sqrt{s^{2}-x}\leq s-x/2s$ .

Proof C.6.

We expand the square of the desired result

(s^{2}-x)\leq(s-x/2s)^{2}=s^{2}-x+x^{2}/4s^{2}.

After subtracting $(s^{2}-x)$ from both sides, it is equivalent to $0\leq x^{2}/4s^{2}$ . This holds since $x^{2}$ and $s$ are always nonnegative.

Lemma C.7.

$D_{K}(P,x)\geq D_{K}(p^{\star},x)^{2}/C_{\sigma}$ for $C_{\sigma}=2\sigma+2$ .

Proof C.8.

Refer to Figure 5 for geometric intuition in this proof. Let $\nu_{0}$ be a measure that is $\nu_{0}(p)=0$ for all $p\in{{\mathbb{R}}}^{d}$ ; thus it has a norm $\kappa(\nu_{0},\nu_{0})=0$ . We can measure the distance from $\nu_{0}$ to $x$ and $P$ , noting that $D_{K}(\nu_{0},x)=\sqrt{\kappa(x,x)}=\sigma$ and $D_{K}(\nu_{0},P)=\sqrt{\kappa(P,P)}$ . Thus by triangle inequality, Lemma C.3, and Lemma C.5,

	$\displaystyle D_{K}(P,x)$	$\displaystyle\geq D_{K}(\nu_{0},x)-D_{K}(\nu_{0},P)$
		$\displaystyle=\sigma-\sqrt{\kappa(P,P)}$
		$\displaystyle\geq\sigma-\sqrt{\sigma^{2}-D_{K}(P,\hat{p})^{2}}$
		$\displaystyle\geq D_{K}(P,\hat{p})^{2}/2\sigma.$

We now assume that $D_{K}(P,x)<D_{K}(p^{\star},x)/C_{\sigma}$ and show this is not possible. First observe that $D_{K}(P,\hat{p})+D_{K}(P,x)\geq D_{K}(\hat{p},x)\geq D_{K}(p^{\star},x)$ . These expressions imply that $D_{K}(P,\hat{p})\geq D_{K}(p^{\star},x)-D_{K}(P,x)\geq(1-1/C_{\sigma})D_{K}(p^{\star},x)$ , and thus

D_{K}(P,x)\geq\frac{1}{2\sigma}D_{K}(P,\hat{p})^{2}\geq\frac{1}{2\sigma}\left(1-\frac{1}{C_{\sigma}}\right)^{2}D_{K}(p^{\star},x)^{2}\geq\frac{1}{C_{\sigma}}D_{K}(p^{\star},x)^{2},

a contradiction. The last steps follows by setting

\displaystyle\frac{1}{2\sigma}\left(1-\frac{1}{C_{\sigma}}\right)^{2}

\displaystyle\geq\frac{1}{C_{\sigma}}\Rightarrow C_{\sigma}^{2}-(2+2\sigma)C_{\sigma}+1\geq 0

and solving for $C_{\sigma}$ ,

\displaystyle C_{\sigma}\geq\frac{(2+2\sigma)+\sqrt{(2+2\sigma)^{2}-4}}{2}=1+\sigma+\sqrt{\sigma^{2}+2\sigma}=1+\sigma+\sqrt{(\sigma+1)^{2}-1}.

Since $C_{\sigma}=2\sigma+2>1+\sigma+\sqrt{(\sigma+1)^{2}-1}$ , so we have $\frac{1}{2\sigma}\left(1-\frac{1}{C_{\sigma}}\right)^{2}\geq\frac{1}{C_{\sigma}}$ .

Recall that an $\varepsilon$ -kernel sample $P$ of $\mu$ satisfies $\max_{x\in{{\mathbb{R}}}^{d}}|\kappa(\mu,x)-\kappa(\mu_{P},x)|\leq\varepsilon.$

Theorem C.9.

If $D_{K}(P,x)\geq 1$ then ${f^{\textsc{k}}_{P}}(P,x)\leq\sqrt{6\sigma+8}D_{K}(P,x)$ . If $P$ is an $(\varepsilon/4)$ -kernel sample of $\mu$ then ${f^{\textsc{k}}_{P}}(\mu,x)\leq\sqrt{6\sigma+8}(D_{K}(\mu,x)+\varepsilon)$ .

Proof C.10.

We combine Lemma C.7 with Lemma 3.4 to achieve

{f^{\textsc{k}}_{P}}(P,x)^{2}\leq 2D_{K}(P,x)^{2}+3D_{K}(p^{\star},x)^{2}\leq 2D_{K}(P,x)^{2}+3(2\sigma+2)D_{K}(P,x).

Aside: Note that the first $D_{K}(P,x)$ is squared and the second is not. If $D_{K}(P,x)\geq\alpha$ then $D_{K}(P,x)\leq(1/\alpha)D_{K}(P,x)^{2}$ we have

{f^{\textsc{k}}_{P}}(P,x)^{2}\leq(2+(6+6\sigma)/\alpha)D_{K}(P,x)^{2}.

Let $\alpha=1$ . We have

{f^{\textsc{k}}_{P}}(P,x)^{2}\leq(6\sigma+8)D_{K}(P,x)^{2}.

Since $D_{K}(P,x)\leq D_{K}(\mu,x)+\varepsilon$ , via Lemma B.1. We obtain,

{f^{\textsc{k}}_{P}}(\mu,x)\leq\sqrt{6\sigma+8}(D_{K}(\mu,x)+\varepsilon).

C.2 Approximating the Minimum Kernel Distance Point

The goal in this section is to find a point that approximately minimizes the kernel distance to a point set $P$ . We assume here $P$ contains $n$ points and describes a measure made of $n$ Dirac mass at each $p\in P$ with weight $1/n$ (this is the empirical measure ${\mu_{P}}$ defined in Section 1.1). Let $p_{+}=\arg\min_{q\in{{\mathbb{R}}}^{d}}D_{K}({\mu_{P}},q)=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa({\mu_{P}},q)$ . Since $D_{K}({\mu_{P}},q)=D_{K}(P,q)$ , for simplicity in notation, we work with point set $P$ instead of ${\mu_{P}}$ for the remaining of this section. That is, we define $p_{+}=\arg\min_{q\in{{\mathbb{R}}}^{d}}D_{K}(P,q)=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(P,q)$ . Note that $p_{+}$ is chosen over all of ${{\mathbb{R}}}^{d}$ , as the bound in Theorem C.9 is not sufficient when choosing a point from $P$ . In particular, for any $\delta>0$ , we want a point $\hat{p}_{+}$ such that $D_{K}(P,\hat{p}_{+})\leq(1+\delta)D_{K}(P,p_{+})$ .

Note that Agarwal et al. [1] provide an algorithm that with high probability finds a point $\hat{q}$ such that $\kappa(P,\hat{q})\geq(1-\delta)\kappa(P,p_{+})$ in time $O((1/\delta^{4})n\log n)$ . However this point $\hat{q}$ is not sufficient for our purpose (that is, $\hat{q}$ does not satisfy the condition $D_{K}(P,\hat{q}_{+})\leq(1+\delta)D_{K}(P,p_{+})$ ), since $\hat{q}$ yields

D_{K}(P,\hat{q})^{2}\leq\sigma^{2}+\kappa(P,P)-2(1-\delta)\kappa(P,p_{+})\nleq(1+\delta)\big{(}\sigma^{2}+\kappa(P,P)-2\kappa(P,p_{+})\big{)}=(1-\delta)D_{K}(P,p_{+})^{2},

since in general it is not true that $4\kappa(P,p_{+})\leq\sigma^{2}+\kappa(P,P)$ , as would be required.

First we need some structural properties. For each point $x\in{{\mathbb{R}}}^{d}$ , define a radius $r_{x}=\arg\sup_{r>0}\{|B_{r}(x)\cap P|\leq n/2$ }, where $B_{r}(x)$ is a ball of radius $r$ centered at $x$ . In other words, it is the largest radius such that at most half of points in $P$ are within $B_{r}(x)$ . Let $\hat{p}_{2}$ be the point in $P$ such that $\|p_{+}-\hat{p}_{2}\|=r_{p_{+}}$ . In other words, $\hat{p}_{2}$ is a point such that no more than $n/2$ points in $P$ satisfy $\|p_{+}-p\|\geq\|p_{+}-\hat{p}_{2}\|$ . Finally it is useful to define $r_{x,K}$ which is $r_{x,K}=D_{K}(x,p)$ where $\|x-p\|=r_{x}$ ; in particular $r_{p_{+},K}=D_{K}(p_{+},\hat{p}_{2})$ .

We now need to lower bound $D_{K}(P,p_{+})$ in terms of $D_{K}(P,\hat{p}_{2})$ . Lemma C.7 already provides a bound in terms of the closest point for any $x\in{{\mathbb{R}}}^{d}$ . We follow a similar construction here.

Lemma C.11.

Consider a set $V=\{v_{1},\ldots,v_{n}\}$ of vectors in a Hilbert space endowed with norm $\|\cdot\|$ and inner product $\langle\cdot,\cdot\rangle$ . Let each $v_{i}$ have norm $\|v_{i}\|=\eta$ . Consider weights $W=\{w_{1},\ldots,w_{n}\}$ such that $1/2\geq w_{i}\geq 0$ and $\sum_{i=1}^{n}w_{i}=1$ . Let $r=\sum_{i=1}^{n}w_{i}v_{i}$ . Define a partition of $V$ with $V_{1}$ and $V_{2}$ such that $V_{2}$ is the smallest set such that $\sum_{v_{i}\in V_{2}}w_{i}\geq 1/2$ , and for all $v_{1}\in V_{1}$ and $v_{2}\in V_{2}$ we have $\|r-v_{1}\|<\|r-v_{2}\|$ . Let $\hat{v}_{2}=\arg\min_{v_{i}\in V_{2}}\|v_{i}-r\|$ . Then

\|r\|^{2}\leq\eta^{2}-\frac{\|r-\hat{v}_{2}\|^{2}}{2}.

Proof C.12.

For ease of notation, we assume that $\langle v_{i},r\rangle>\langle v_{i+1},r\rangle$ for all $i$ , and let $\{v_{1},\ldots,v_{k}\}=V_{1}$ . Let $\hat{v}_{1}=\arg\min_{v_{i}\in V_{1}}\|v_{i}-r\|=\arg\min_{v_{i}\in V}\|v_{i}-r\|$ . Let $\hat{v}$ be a norm $\eta$ vector that has $\langle\hat{v},r\rangle=(\langle\hat{v}_{1},r\rangle+\langle\hat{v}_{2},r\rangle)/2$ . Since $\sum_{v_{i}\in V_{2}}w_{i}\geq 1/2$ and $\sum_{v_{i}\in V_{1}}w_{i}\leq 1/2$ , let $\sum_{v_{i}\in V_{2}}w_{i}=1/2+\delta$ and $\sum_{v_{i}\in V_{1}}w_{i}=1/2-\delta$ (for $0\leq\delta\leq 1/2)$ . By definition, we also have $\langle\hat{v}_{1},r\rangle\geq\langle\hat{v}_{2},r\rangle$ . We can decompose $r$ as

	$\displaystyle\\|r\\|^{2}=\langle r,r\rangle$	$\displaystyle=\sum_{i=1}^{n}w_{i}\langle v_{i},r\rangle=\sum_{i=1}^{k}w_{i}\langle v_{i},r\rangle+\sum_{i={k+1}}^{n}w_{i}\langle v_{i},r\rangle$
		$\displaystyle\leq\sum_{i=1}^{k}w_{i}\langle\hat{v}_{1},r\rangle+\sum_{i=k+1}^{n}w_{i}\langle\hat{v}_{2},r\rangle=\left(\sum_{i=1}^{k}w_{i}\right)\langle\hat{v}_{1},r\rangle+\left(\sum_{i=k+1}^{n}w_{i}\right)\langle\hat{v}_{2},r\rangle$
		$\displaystyle=(1/2-\delta)\langle\hat{v}_{1},r\rangle+(1/2+\delta)\langle\hat{v}_{2},r\rangle=(1/2)(\langle\hat{v}_{1},r\rangle+\langle\hat{v}_{2},r\rangle)+\delta(\langle\hat{v}_{2},r\rangle-\langle\hat{v}_{1},r\rangle)$
		$\displaystyle\leq(\langle\hat{v}_{1},r\rangle+\langle\hat{v}_{2},r\rangle)/2=\langle\hat{v},r\rangle$
		$\displaystyle=\frac{1}{2}(\\|r\\|^{2}+\\|\hat{v}\\|^{2}-\\|\hat{v}-r\\|^{2}).$

The last inequality holds by $\|\hat{v}-r\|^{2}=\|r\|^{2}+\|\hat{v}\|^{2}-2\langle\hat{v},r\rangle$ . Then since $\|\hat{v}\|=\eta$ we can solve for $\|r\|^{2}$ as

\|r\|^{2}\leq\eta^{2}-\|\hat{v}-r\|^{2}=\eta^{2}-(\|\hat{v}_{2}-r\|^{2}+\|\hat{v}_{1}-r\|^{2})/2\leq\eta^{2}-\|\hat{v}_{2}-r\|^{2}/2.

Lemma C.13.

Using $\hat{p}_{2}$ as defined above, then $\kappa(P,P)\leq\sigma^{2}-D_{K}(P,\hat{p}_{2})^{2}/2$ .

Proof C.14.

Let $\phi_{K}:{{\mathbb{R}}}^{d}\to\mathcal{H}_{K}$ map points in ${{\mathbb{R}}}^{d}$ to the reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{K}$ defined by kernel $K$ . This space has norm $\|P\|_{\mathcal{H}_{K}}=\sqrt{\kappa(P,P)}$ defined on a set of points $P$ and inner product $\kappa(P,P)$ . Let $\Phi_{K}(P)=\frac{1}{|P|}\sum_{p\in P}\phi_{K}(p)$ be the representation of a set of points $P$ in $\mathcal{H}_{K}$ . Note that $D_{K}(P,Q)=\|\Phi_{K}(P)-\Phi_{K}(Q)\|_{\mathcal{H}_{K}}$ . We can now apply Lemma C.11 to $\{\phi_{K}(p)\}_{p\in P}$ with weights $w(p)=1/|P|$ and $r=\Phi_{K}(P)$ , and norm $\eta=\sigma$ . Finally note that we can use $\phi_{K}(\hat{p}_{2})=\hat{v}_{2}$ since $V_{2}$ represents the set of points which are further or equal to $P$ than is $\hat{p}_{2}$ . In addition, by the property of RKHS, $\|\Phi_{K}(P)-\phi_{K}(\hat{p}_{2})\|=D_{K}(P,\hat{p}_{2})$ . Hence $\kappa(P,P)=\|P\|_{\mathcal{H}_{K}}^{2}\leq\sigma^{2}-D_{K}(P,\hat{p}_{2})^{2}/2$ .

Lemma C.15.

$D_{K}(P,p_{+})\geq D_{K}(p_{+},\hat{p}_{2})^{2}/(4\sigma)$ .

Proof C.16.

Refer to Figure 5 for geometric intuition in this proof. Let $\nu_{0}$ be a measure that is $\nu_{0}(p)=0$ for all $p\in{{\mathbb{R}}}^{d}$ ; thus it has a norm $\kappa(\nu_{0},\nu_{0})=0$ . We can measure the distance from $\nu_{0}$ to $p_{+}$ and $P$ , noting that $D_{K}(\nu_{0},x)=\sqrt{\kappa(x,x)}=\sigma$ and $D_{K}(P,\nu_{0})=\sqrt{\kappa(P,P)}$ . Thus by triangle inequality, Lemma C.13, and Lemma C.5

	$\displaystyle D_{K}(P,p_{+})$	$\displaystyle\geq D_{K}(\nu_{0},p_{+})-D_{K}(P,\nu_{0})$
		$\displaystyle=\sigma-\sqrt{\kappa(P,P)}$
		$\displaystyle\geq\sigma-\sqrt{\sigma^{2}-D_{K}(P,\hat{p}_{2})^{2}/2}$
		$\displaystyle\geq D_{K}(P,\hat{p}_{2})^{2}/4\sigma.$

Now we place a net $\mathcal{N}$ on ${{\mathbb{R}}}^{d}$ ; specifically, it is a set of points such that for some $q\in\mathcal{N}$ that $\|q-p_{+}\|\leq\delta D_{K}(p_{+},\hat{p}_{2})^{2}/4\sigma\leq\delta D_{K}(P,p_{+})$ (we refer to this inequality as the net condition, therefore, $\mathcal{N}$ is a set of points such that some points in it satisfy the net condition). Since $D_{K}(P,\cdot)$ is $1$ -Lipschitz, we have $D_{K}(P,p_{+})-D_{K}(P,q)\leq\|q-p_{+}\|$ . This ensures that some point $q\in\mathcal{N}$ satisfies $D_{K}(P,q)\leq(1+\delta)D_{K}(P,p_{+})$ , and can serve as $\hat{p}_{+}$ . In other words, $\mathcal{N}$ is guaranteed to contain some point $q$ that can serve as $p_{+}$ .

Note that $p_{+}$ must be in ${\textsc{\small CH}}(P)$ , the convex hull of $P$ . Otherwise, moving to the closest point on ${\textsc{\small CH}}(P)$ decreases the distance to all points, and thus increases $\kappa(P,p_{+})$ , which cannot happen by definition of $p_{+}$ . Let $\Delta$ be the diameter of $P$ (the distance between the two furthest points). Clearly for some $p\in P$ we must have $\|p_{+}-p\|\leq\Delta$ .

Also note that $p_{+}:=\arg\max_{q\in{{\mathbb{R}}}^{d}}\kappa(P,q)$ must be within a distance $R_{\sigma}=\sigma\sqrt{2\ln(n)}$ to some $p\in P$ , otherwise for $p^{\star}=\arg\min_{p\in P}\|p_{+}-p\|$ , we can bound $\kappa(P,p_{+})\leq K(p^{\star},p_{+})\leq\sigma^{2}/n=K(p^{\star},p^{\star})/n\leq\kappa(P,p^{\star}),$ which means $p_{+}$ is not a maximum. The first inequality is by definition of $p^{*}$ , the second by assuming $\|p_{+}-p^{\star}\|\geq\sigma\sqrt{2\ln(n)}$ .

Let $B_{R}(p)$ be the ball centered at $p$ with radius $R=\min(R_{\sigma},\Delta)$ . Let $R_{p}=\min(R,r_{p}/2)$ . So $p_{+}$ must be in $\bigcup_{p\in P}B_{R}(p)$ . We describe a net $\mathcal{N}_{p}$ construction for one ball $B_{R}(p)$ ; that is for any $x$ such that $p\in P$ is the closest point to $x$ , then some point $q\in\mathcal{N}_{p}$ satisfies $\|q-x\|\leq\delta(r_{x,K})^{2}/4\sigma$ . Thus if this point $x=p_{+}$ , the correct property holds, and we can use the corresponding $q$ as $\hat{p}_{+}$ . Then $\mathcal{N}=\bigcup_{p\in P}\mathcal{N}_{p}$ , and is at most $n$ times the size of $\mathcal{N}_{p}$ . Let $k_{p}$ be the smallest integer $k$ such that $r_{p}/2\geq R/2^{k}$ . The net $\mathcal{N}_{p}$ will be composed of $\mathcal{N}_{p}=\bigcup_{i=0}^{k_{p}}\mathcal{N}_{i}=\mathcal{N}_{0}\cup\mathcal{N}^{\prime}_{p}$ , where $\mathcal{N}^{\prime}_{p}=\bigcup_{i=1}^{k_{p}}\mathcal{N}_{i}$ .

Before we proceed with the construction, we need an assumption: That $\Lambda_{P}=\min_{p\in P}r_{p}$ is a bounded quantity, it is not too small. That is, no point has more than half the points within an absolute radius $\Lambda_{P}$ . We call $\Lambda_{P}$ the median concentration.

Lemma C.17.

A net $\mathcal{N}_{0}$ can be constructed of size $O((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n))$ so that all points $x\in B_{R_{p}}(p)$ satisfy $\|q-x\|\leq\delta(r_{x,K})^{2}/4\sigma$ for some $q\in\mathcal{N}_{0}$ .

If $x=p_{+}$ , then such a point satisfies the net condition, that is there is a point $q\in\mathcal{N}_{0}$ such that $\|q-x\|=\|q-p_{+}\|\leq\delta(r_{p_{+},K})^{2}/(4\sigma)=\delta D_{K}(p_{+},\hat{p}_{2})/(4\sigma)\leq\delta D_{K}(P,p_{+})$ .

Proof C.18.

For all points $x\in B_{R_{p}}\subset B_{r_{p}/2}(p)$ , they must have $r_{x}\geq r_{p}/2$ , otherwise $B_{r_{p}/2}(x)$ is completely inside $B_{r_{p}}(p)$ , and cannot have enough points. Within $B_{R_{p}}(p)$ we place the net $\mathcal{N}_{0}$ so that all points $x\in B_{R_{p}}(p)$ satisfy $\|x-q\|\leq\min(\delta r_{p}^{2}/32\sigma,\sqrt{3}\sigma)$ for some $q\in\mathcal{N}_{0}$ . Now $\delta r_{p}^{2}/32\sigma\leq\delta r_{x}^{2}/8\sigma$ , and since $\|x-y\|^{2}/2\leq D_{K}(x,y)^{2}$ (for $\|x-y\|\leq\sqrt{3}\sigma$ , via Lemma 5.5), thus the net ensures if $p_{+}\in B_{R_{p}}(p)$ , then some $q\in\mathcal{N}_{0}$ is sufficiently close to $p_{+}$ .

Since $B_{R_{p}}(p)$ fits in a squared box of side length $\min(2R_{\sigma},r_{p})$ , then we can describe $\mathcal{N}_{0}$ as an axis-aligned grid with $g$ points along each axis. We define two cases to bound $g$ . When $\delta r_{p}^{2}/32\sigma<\sqrt{3}\sigma$ then we can set

g=\frac{R_{p}}{\delta r_{p}^{2}/(32\sigma\sqrt{d})}\leq\frac{32\sigma\sqrt{d}}{\delta r_{p}}=O(\sigma/\delta r_{p})=O(\sigma/\delta\Lambda_{P})

Otherwise,

g=\frac{R_{p}}{\sqrt{3}\sigma/\sqrt{d}}\leq\frac{\sigma\sqrt{2\ln(n)}}{\sqrt{3}\sigma/\sqrt{d}}=\sqrt{2d\ln(n)/3}=O(\sqrt{\log(n)}).

Then we need $|\mathcal{N}_{0}|=O(g^{d})=O((\sigma/\delta\Lambda_{P})^{2}+\ln^{d/2}(n))$ .

When $r_{p}/2<R$ we still need to handle the case for $x\in A_{p}$ where the annulus $A_{p}=B_{R}(p)\setminus B_{r_{p}/2}(p)$ . For a point $x\in A_{p}$ if $p=\min_{p^{\prime}\in P}\|x-p^{\prime}\|$ then $r_{x}\geq\|x-p\|$ . We only worry about the net $\mathcal{N}_{p}^{\prime}$ on $A_{p}$ for these points where $p$ is the closest point, the others will be handled by another $\mathcal{N}_{p^{\prime}}$ for $p^{\prime}\in P$ and $p^{\prime}\neq p$ .

Recall $k_{p}$ is the smallest integer $k$ such that $r_{p}/2\geq R/2^{k}$ .

Lemma C.19.

A net $\mathcal{N}_{p}^{\prime}$ can be constructed of size $O(k_{p}+(\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n))$ so that all points $x\in A_{p}$ where $p=\arg\min_{p^{\prime}\in P}\|x-p^{\prime}\|$ , satisfy $\|q-x\|\leq\delta(r_{x,K})^{2}/4\sigma$ for some $q\in\mathcal{N}_{p}^{\prime}$ .

If $x=p_{+}$ , then such a point satisfies the net condition, that is there is a point $q\in\mathcal{N}^{\prime}_{p}$ such that $\|q-x\|=\|q-p_{+}\|\leq\delta(r_{p_{+},K})^{2}/(4\sigma)=\delta D_{K}(p_{+},\hat{p}_{2})/(4\sigma)\leq\delta D_{K}(P,p_{+})$ .

Proof C.20.

We now consider the $k_{p}$ annuli $\{A_{1},\ldots,A_{k_{p}}\}$ which cover $A_{p}$ . Each $A_{i}=\{x\in{{\mathbb{R}}}^{d}\mid R/2^{i-1}\geq\|p-x\|>R/2^{i}\}$ has volume $O((R/2^{i-1})^{d})$ . For any $x\in A_{i}$ we have $r_{x}\geq\|x-p\|\geq R/2^{i}$ , so the Euclidean distance to the nearest $q\in\mathcal{N}_{i}$ can be at most $\min(\sqrt{3}\sigma,\delta(R/2^{i})^{2}/8\sigma)$ . Thus we can cover $A_{i}$ with a net $\mathcal{N}_{i}$ of size $t_{i}$ based on two cases again. If $\delta(R/2_{i})^{2}/8\sigma<\sqrt{3}\sigma$ then

t_{i}=O\left(1+\left(\frac{R}{2^{i}}/\left(\frac{\delta}{\sigma}\left(\frac{R}{2^{i}}\right)^{2}\right)\right)^{d}\right)=O\left(1+\left(\frac{2^{i}}{R}\frac{\sigma}{\delta}\right)^{d}\right)=O(1)+O\left(\left(\frac{\sigma}{\delta R}\right)^{d}(2^{d})^{i}\right).

Otherwise

t_{i}=O\left(1+\left(\left(\frac{R}{2^{i}}\right)/\sqrt{3}\sigma\right)^{d}\right)=O\left(1+\left(\frac{R_{\sigma}=\sigma\sqrt{\log(n)}}{2^{i}\sigma}\right)^{d}\right)=O(1)+O\left(\frac{\log^{d/2}(n)}{(2^{d})^{i}}\right).

Since $R/2^{k_{p}}\geq r_{p}/2\geq\Lambda_{P}/2$ , then the total size of $\mathcal{N}_{p}^{\prime}$ , the union of all of these nets, is $\sum_{i=1}^{k_{p}}t_{i}\leq O(k_{p})+2t_{k_{p}}+2t_{1}=O(k_{p}+(\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n))$ . In the first case $t_{k_{p}}$ dominates the cost and in the second case it is $t_{1}$ .

Thus the total size of $\mathcal{N}_{p}$ is $O((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)+k_{p})$ where $k_{p}\leq\log(R/r_{p})+2$ . It just remains to bound $k_{p}$ . Given that no more than $n/2$ points are collocated on the same spot (which already holds by $\Lambda_{P}$ being a bounded quantity), then for all $p\in P$ , $r_{p}\geq\min_{q\neq q^{\prime}\in P}\|q-q^{\prime}\|$ . The value $\beta_{P}=\Delta/\min_{q\neq q^{\prime}\in P}\|q-q^{\prime}\|$ is known as the spread of a point set, and it is common to assume it is an absolute bounded quantity related to the precision of coordinates, where $\log(\beta_{P})$ is not too large. Thus we can bound $k_{p}=O(\log(\beta_{P}))$ .

Theorem C.21.

Consider a point set $P\subset{{\mathbb{R}}}^{d}$ with $n$ points, spread $\beta_{P}$ , and median concentration $\Lambda_{P}$ . For any $\delta>0$ , in time $O(n^{2}((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)+\log(\beta_{P})))$ we can find a point $\hat{p}_{+}$ such that $D_{K}(P,\hat{p}_{+})\leq(1+\delta)D_{K}(P,p_{+})$ .

Proof C.22.

Using Lemma C.17 and Lemma C.19 we can build a net $\mathcal{N}$ of size $O(n((\sigma/\delta\Lambda_{P})^{d}+\log^{d/2}(n)+\log(\beta_{P}))$ such that some $q\in\mathcal{N}$ satisfies $\|q-p_{+}\|\leq\delta D_{K}(q,p_{+})^{2}/4\sigma\leq\delta D_{K}(P,p_{+})$ . Lemma C.15 ensures that this $q$ satisfies $D_{K}(P,q)\leq(1+\delta)D_{K}(P,p_{+})$ since $D_{K}(P,\cdot)$ is $1$ -Lipschitz.

We can find such a $q$ and set it as $p_{+}$ by evaluating $\kappa(P,q)$ for all $q\in\mathcal{N}$ and taking the one with largest value. This takes $O(n)$ for each $q\in\mathcal{N}$ .

We claim that in many realistic settings $\sigma/\Lambda_{P}=O(1)$ . In such a case the algorithm runs in $O(n^{2}(1/\delta^{d}+\log^{d/2}n+\log(\beta_{P})))$ time. If $\sigma/\Lambda_{P}=o(1)$ , then over half of the measure described by $P$ will essentially behave as a single point. In many settings $P$ is drawn uniformly from a compact set $S$ , so then choosing $\sigma$ so that more than half of $S$ has negligible diameter compared to $\sigma$ will cause that data to be over smoothed. In fact, the definition of $\Lambda_{P}$ can be modified so that this radius never contains more than any $\tau n$ points for any constant $\tau<1$ , and the bounds do not change asymptotically.

Appendix D Details on Reconstruction Properties of Kernel Distance

In this section we provide the full proof for some statements from Section 4.

D.1 Topological Estimates using Kernel Power Distance

For persistence diagrams of sublevel sets filtration of $d^{K}_{\mu}$ and the weighted Rips filtration $\{R_{\alpha}(P,d^{K}_{\mu})\}$ to be well-defined, we need the technical condition (proved in Lemma D.1 and D.3) that they are $q$ -tame. Recall a filtration $F$ is $q$ -tame if for any $\alpha<\beta$ , the homomorphism between $\textsf{H}(F_{\alpha})$ and $\textsf{H}(F_{\beta})$ induced by the canonical inclusion has finite rank [12, 16].

Lemma D.1.

The sublevel sets filtration of $d^{K}_{\mu}$ is $q$ -tame.

Proof D.2.

The proof resembles the proof of $q$ -tameness for distance to measure sublevel sets filtration (Proposition 12, [8]). We have shown that $d^{K}_{\mu}$ is $1$ -Lipschitz and proper. Its properness property implies that any sublevel set $A:=(d^{K}_{\mu})^{-1}([0,\alpha])$ (for $\alpha<c_{\mu}$ ) is compact. Since ${{\mathbb{R}}}^{d}$ is triangulable (i.e. homeomorphic to a locally finite simplicial complex), there exists a homeomorphism $h$ from ${{\mathbb{R}}}^{d}$ to a locally finite simplicial complex $C$ . For any $\alpha>0$ , since $A$ is compact, we consider the restriction of $C$ to a finite simplicial complex $C_{\alpha}$ that contains $h(A)$ . The function $(d^{K}_{\mu}\circ h^{-1})\mid_{C_{\alpha}}$ is continuous on $C_{\alpha}$ , therefore its sublevel set filtration is $q$ -tame based on Theorem 2.22 of [16], which states that the sublevel sets filtration of a continuous function (defined on a realization of a finite simplicial complex) is $q$ -tame. Extending the above construction to any $\alpha$ , the sublevel sets filtration of $d^{K}_{\mu}\circ h^{-1}$ is therefore $q$ -tame. As homology is preserved by homeomorphisms $h$ , this implies that the sublevel sets filtration of $d^{K}_{\mu}$ is $q$ -tame.

Setting $\mu={\mu_{P}}$ , Lemma D.1 implies that the sublevel sets filtration of $d^{K}_{\mu_{P}}$ is also $q$ -tame.

Lemma D.3.

The weighted Rips filtration $\{R_{\alpha}(P,d^{K}_{\mu})\}$ is $q$ -tame for compact subset $P\subset{{\mathbb{R}}}^{d}$ .

Proof D.4.

Since $P$ is compact subset of ${{\mathbb{R}}}^{d}$ , $\textsf{Dgm}(\{R_{\alpha}(P,d^{K}_{\mu})\}))$ is $q$ -tame based on Proposition 32 of [16], which states that the weighted Rips filtration with respect to a compact subset $P$ in metric space and its corresponding weight function is $q$ -tame.

Setting $P=\hat{P}_{+}$ , $\mu={\mu_{P}}$ , Lemma D.3 implies that the weighted Rips filtration $\{R_{\alpha}(\hat{P}_{+},d^{K}_{\mu_{P}})\}$ is well-defined.

D.2 Inference of Compact Set $S$ with the Kernel Distance

For a point $x\in{{\mathbb{R}}}^{d}$ , the distance function $f_{S}$ measures the minimum distance between $x$ and any point in $S$ , $f_{S}(x)=\inf_{y\in S}||x-y||$ . The point $x_{S}$ that realizes the minimum in the definition of $f_{S}(x)$ is the orthogonal projection of $x$ on $S$ . The location of the points $x\in{{\mathbb{R}}}^{d}$ that have more than one projection on $S$ is the medial axis of $S$ [54], denoted as ${{M}}(S)$ . Since ${{M}}(S)$ resides in the unbounded component ${{\mathbb{R}}}^{d}\setminus S$ , it is referred to as the outer medial axis similar to the concept found in [28]. The reach of $S$ is the minimum distance between a point in $S$ and a point in its medial axis, denoted as ${\sf reach}(S)$ . Similarly, one could define the medial axis of ${{\mathbb{R}}}^{d}\setminus S$ (i.e. the inner medial axis which resides in the interior of $S$ ) following definitions in [53], and denote its associated reach as ${\sf reach}({{\mathbb{R}}}^{d}\setminus S)$ . The concepts of reach associated with the inner and outer medial axis of $S$ capture curvature information of the compact set.

Recall that a generalized gradient and its corresponding flow to a distance function are described in [14] and later adapted for distance-like functions in [15]. Let $f_{S}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}$ be a distance function associated with a compact set $S$ of ${{\mathbb{R}}}^{d}$ . It is not differentiable on the medial axis of $S$ . It is possible to define a generalized gradient function ${\nabla_{S}{}}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d}$ coincides with the usual gradient of $f_{S}$ where $f_{S}$ is differentiable, and is defined everywhere and can be integrated into a continuous flow $\Phi^{t}:{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{d}$ . Such a flow points away from $S$ , towards local maxima of $f_{S}$ (that belong to the medial axis of $S$ ) [54]. The integral (flow) line $\gamma$ of this flow starting at point in ${{\mathbb{R}}}^{d}$ can be parameterized by arc length, $\gamma:[a,b]\to{{\mathbb{R}}}^{d}$ , and we have $f_{S}(\gamma(b))=f_{S}(\gamma(a))+\int_{a}^{b}||{\nabla_{S}{}}(\gamma(t))||d_{t}$ .

Lemma D.5 (Lemma 4.5).

Proof D.6.

Since $d^{K}_{\mu}(x)$ is always positive, and $d^{K}_{\mu}(x)=\sqrt{c_{\mu}-2\textsc{kde}_{\mu}(x)}$ where $c_{\mu}$ is a constant that depends only on $\mu$ , $K$ , and $\sigma$ , then it is sufficient to show that $\textsc{kde}_{\mu}(x)$ is strictly monotonically decreasing along $\gamma$ .

Let $u$ be the negative of the direction of the flow line $\gamma$ at $x$ (i.e $u$ is a unit vector that points towards $S$ ). We show that $\textsc{kde}_{\mu}(x)$ is strictly monotonically increasing along $u$ . Informally, we will observe that all parts of $S$ that are “close” to $x$ are in the direction $u$ , and that these parts dominate the gradient of $\textsc{kde}_{\mu}(x)$ along $u$ . We now make this more formal by describing two quantities, $B_{x}$ and $A_{x}$ , illustrated in Figure 6.

For a point $x\in{{\mathbb{R}}}^{d}$ , let $x_{S}=\arg\min_{x^{\prime}\in S}\|x^{\prime}-x\|$ ; since $x$ is not on the medial axis of $S$ , $x_{S}$ is uniquely defined and $u$ points in the direction of $(x_{S}-x)/f_{S}(x)$ . First, we claim that there exists a ball $B_{x}$ of radius $\sigma/2$ incident to $x_{S}$ that is completely contained in $S$ . This holds since $\sigma/2\leq\frac{R}{6\Delta_{G}}<R\leq{\sf reach}({{\mathbb{R}}}^{d}\setminus S)$ . In addition, since $f_{S}(x)<2\sigma$ , no part in $B_{x}$ is further than $3\sigma$ from $x$ . Second, we claim that no part of $S$ within $\Delta_{G}\cdot\sigma$ ( $\leq R/6$ ) of $x$ (this includes $B_{x}$ ) is in the halfspace $H_{x}$ with boundary passing through $x$ and outward normal defined by $u$ . To see this, let $o$ be the center of a ball with radius $R$ that is incident to $x_{S}$ but not in $S$ , refer to such a ball as $B_{o}$ . This implies that points $o$ , $x$ and $x_{S}$ are colinear. Then a ball centered at $x$ with radius $R/6$ should intersect $S$ outside of $B_{o}$ , and in the worst case, on the boundary of $H_{x}$ . This holds as long as $\|x-x_{S}\|\geq 0.014R\geq(1-\sqrt{35/36})R$ ; see Figure 6. Define $A_{x}=\{y\in S\mid\|x-y\|>\Delta_{G}\cdot\sigma\}$ .

Now we examine the contributions to the directional derivative of $\textsc{kde}_{\mu}(x)$ along the direction of $u$ from points in $B_{x}$ and $A_{x}$ , respectively. Such a directional derivative is denoted as ${\textsf{D}}_{u}\textsc{kde}_{\mu}(x)$ . Recall $\textsc{kde}_{\mu}(x)=\int_{y\in S}K(x,y){\rm d}{\mu(y)}$ and $\mu$ is a uniform measure on $S$ , ${\textsf{D}}_{u}\textsc{kde}_{\mu}(x)=\frac{1}{{\textsf{Vol}}(S)}\int_{y\in S}{\textsf{D}}_{u}K(x,y)$ . For any point $y\in\mathbb{R}^{d}$ , we define $g(y):={\textsf{D}}_{u}K(x,y)=\exp(-\|x-y\|^{2}/2\sigma^{2})\langle y-x,u\rangle$ . Therefore ${\textsf{D}}_{u}\textsc{kde}_{\mu}(x)=\frac{1}{\textsf{vol}(S)}\int_{y\in S}g(y)$ .

We now examine the contribution to ${\textsf{D}}_{u}\textsc{kde}_{\mu}(x)$ from points in $B_{x}$ , $\frac{1}{{\textsf{Vol}}(S)}\int_{y\in B_{x}}g(y)$ . First, for all points $y\in B_{x}$ , since $||x-y||\leq 3\sigma$ , we have $\exp(-\|x-y\|^{2}/2\sigma^{2})\geq\exp(-9/2)$ . Second, at least half of points $y\in B_{x}$ (that covers half the volume of $B_{x}$ ) is at least $\sigma/2$ away from $x_{S}$ , and correspondingly for these points $\langle y-x,u\rangle\geq\sigma/2$ . We have $\int_{y\in B_{x}}g(y)\geq\frac{1}{2}{\textsf{Vol}}(B_{x})\cdot\exp(-9/2)\cdot\sigma/2$ . Given ${\textsf{Vol}}(B_{x})=G\cdot{\textsf{Vol}}(S)$ , we have $\frac{1}{{\textsf{Vol}}(S)}\int_{y\in B_{x}}g(y)\geq\frac{1}{4}G\cdot\exp(-9/2)\cdot\sigma$ . Denote $B=\frac{1}{4}G\cdot\exp(-9/2)\cdot\sigma$ .

We now examine the contribution to ${\textsf{D}}_{u}\textsc{kde}_{\mu}(x)$ from points in $A_{x}$ , $\frac{1}{{\textsf{Vol}}(S)}\int_{y\in A_{x}}g(y)$ . For any point $y\in{{\mathbb{R}}}^{d}$ (including $y\in A_{x}$ ), $\langle y-x,u\rangle\leq\|x-y\|$ . Let $\phi_{y}=\|x-y\|/\sigma$ so we have $g(y)\leq\exp(-\phi_{y}^{2}/2)\phi_{y}\sigma$ . Since this bound on $g(y)$ is maximized at $\phi_{y}=1$ , under the condition $\phi_{y}\geq\Delta_{G}\geq\sqrt{12}>1$ , we can set $\phi_{y}=\Delta_{G}$ to achieve the bound $g(y)\leq\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma$ for $\|x-y\|\geq\Delta_{G}\cdot\sigma$ (that is, for all $y\in A_{x}$ ). Now we have $\int_{y\in A_{x}}g(y)\leq{\textsf{Vol}}(S)\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma$ , leading to $\frac{1}{{\textsf{Vol}}(S)}\int_{y\in A_{x}}g(y)\leq\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma$ . Denote $A=\exp(-\Delta_{G}^{2}/2)\cdot\Delta_{G}\sigma$ .

Since only the points $y\in A_{x}$ could possibly reside in $H_{x}$ and thus can cause $g(y)$ to be negative, we just need to show that $B>A$ . This can be confirmed by plugging in $\Delta_{G}=\sqrt{12+3\ln(4/G)}$ , and using some algebraic manipulation.

Appendix E Lower Bound on Wasserstein Distance

We note the next result is a known lower bounds for the Earth movers distance [23][Theorem 7]. We reprove it here for completeness.

Lemma E.1 (Lemma. 5.17).

For any probability measures $\mu$ and $\nu$ defined on ${{\mathbb{R}}}^{d}$ we have $\|\bar{\mu}-\bar{\nu}\|\leq W_{2}(\mu,\nu).$

Proof E.2.

Let $\pi:{{\mathbb{R}}}^{d}\times{{\mathbb{R}}}^{d}\to{{\mathbb{R}}}^{+}$ describes the optimal transportation plan from $\mu$ to $\nu$ . Also let $u_{\mu,\nu}=\frac{(\bar{\mu}-\bar{\nu})}{\|\bar{\mu}-\bar{\nu}\|}$ be the unit vector from $\bar{\mu}$ to $\bar{\nu}$ . Then we can expand

	$\displaystyle(W_{2}(\mu,\nu))^{2}=\int_{(p,q)}\\|p-q\\|^{2}{\rm d}{\pi(p,q)}$	$\displaystyle\geq\int_{(p,q)}(\langle(p-q),u_{\mu,\nu}\rangle)^{2}{\rm d}{\pi(p,q)}$
		$\displaystyle\geq\\|\bar{\mu}-\bar{\nu}\\|^{2}.$

The first inequality follows since $\langle(p-q),u_{\mu,\nu}\rangle$ is the length of a projection and thus must be at most $\|p-q\|$ . The second inequality follows since that projection describes the squared length of mass $\pi(p,q)$ along the direction between the two centers $\bar{\mu}$ and $\bar{\nu}$ , and the total sum of squared length of unit mass moved is exactly $\|\bar{\mu}-\bar{\nu}\|^{2}$ . Note the left-hand-side of the second inequality could be larger since some movement may cancel out (e.g. a rotation).

	$\displaystyle\frac{d}{du}f(x)\big{\|}_{x=x^{\prime}}$	$\displaystyle=2\left(\frac{d}{du}g(x^{\prime})\right)g(x^{\prime})-2\ell\\|x^{\prime}\\|=2c\cdot g(x^{\prime})-2\ell\\|x^{\prime}\\|$
	$\displaystyle\frac{d^{2}}{du^{2}}f(x)\big{\|}_{x=x^{\prime}}$	$\displaystyle=2c\left(\frac{d}{du}g(x^{\prime})\right)-2\ell=2c^{2}-2\ell=2(c^{2}-\ell)$

	$\displaystyle D_{K}(p,q)^{2}$	$\displaystyle=2\sigma^{2}\left(1-\exp\left(\frac{-\\|p-q\\|^{2}}{2\sigma^{2}}\right)\right)$
		$\displaystyle\geq 2\sigma^{2}\left(\frac{\\|p-q\\|^{2}}{2\sigma^{2}}-\frac{1}{2}\frac{\\|p-q\\|^{4}}{4\sigma^{4}}\right)$
		$\displaystyle=\frac{\\|p-q\\|^{2}}{4}\left(4-\frac{\\|p-q\\|^{2}}{\sigma^{2}}\right)$
		$\displaystyle\geq\\|p-q\\|^{2}/4,$

	$\displaystyle\int_{p}\\|p-x\\|^{2}{\rm d}{\mu(p)}$	$\displaystyle=\int_{p}\left(\\|p\\|^{2}+\\|x\\|^{2}-2\langle p,x\rangle\right){\rm d}{\mu(p)}$
		$\displaystyle=\int_{p}\\|p\\|^{2}{\rm d}{\mu(p)}+\\|x\\|^{2}-2\int_{p}\langle p,x\rangle{\rm d}{\mu(p)}$
		$\displaystyle=\left(\int_{p}\\|p\\|^{2}{\rm d}{\mu(p)}-\\|\bar{\mu}\\|^{2}\right)+\\|\bar{\mu}\\|^{2}+\\|x\\|^{2}-2\langle\bar{\mu},x\rangle$
		$\displaystyle=\textsf{Var}(\mu)+\\|\bar{\mu}-x\\|^{2}.$

Geometric Inference on Kernel Density Estimates

1 Introduction

1.1 Kernels, Kernel Density Estimates, and Kernel Distance

1.2 Geometric Inference and Distance to a Measure: A Review

Theorem 1.1 (Reconstruction from fPf_{P} [14]).

Distance to a measure.

1.3 Our Results

2 Kernel Distance is Distance-Like

2.1 Semiconcave Property for dμKd^{K}_{\mu}

Lemma 2.1 (D2).

Proof 2.2.

2.2 Lipschitz Property for dμKd^{K}_{\mu}

Lemma 2.3.

Proof 2.4.

Lemma 2.5 (D1).

2.3 Properness of dμKd^{K}_{\mu}

Lemma 2.6 (D3).

Corollary 2.7.

3 Power Distance using Kernel Distance

Theorem 3.1.

3.1 Kernel Power Distance for a Measure μ\mu

Theorem 3.2.

Proof 3.3.

Lemma 3.4.

Proof 3.5.

Lemma 3.6.

Proof 3.7.

Theorem 3.8.

Proof 3.9.

Theorem 3.10.

Proof 3.11.

4 Reconstruction and Topological Estimation using Kernel Distance

4.1 Homotopy Equivalent Reconstruction using dμKd^{K}_{\mu}

Theorem 4.1 (Isotopy lemma on dμKd^{K}_{\mu}).

Theorem 4.2 (Reconstruction on dμKd^{K}_{\mu}).

4.2 Constructing Topological Estimates using dμKd^{K}_{\mu}

Corollary 4.3.

Proof 4.4.

4.3 Distance to the Support of a Measure vs. Kernel Distance

Lemma 4.5.

5 Stability Properties for the Kernel Distance to a Measure

Lemma 5.1 (K4).

Proof 5.2.

5.1 Comparing DKD_{K} to W2W_{2}

Lemma 5.3.

Proof 5.4.

Two Dirac masses.

Lemma 5.5.

Proof 5.6.

One Dirac mass.

Lemma 5.7.

Proof 5.8.

General case.

Lemma 5.9.

Proof 5.10.

Lemma 5.11.

Proof 5.12.

Lemma 5.13.

Proof 5.14.

Theorem 5.15.

Proof 5.16.

Lemma 5.17.

Theorem 5.18.

5.2 Kernel Distance Stability with Respect to σ\sigma

Lemma 5.19.

Proof 5.20.

Theorem 5.21.

Proof 5.22.

Lipschitz in m0m_{0} for dμ,m0ccmd^{\textsf{{ccm}}}_{\mu,m_{0}}.

6 Algorithmic and Approximation Observations

Kernel coresets.

Corollary 6.1.

Proof 6.2.

Stability of persistence diagrams.

Corollary 6.3.

Corollary 6.4.

7 Discussion

7.1 Noise and Scale

7.2 Experiments

Grid approximation.

Theorem 1.1 (Reconstruction from $f_{P}$ [14]).

2.1 Semiconcave Property for $d^{K}_{\mu}$

2.2 Lipschitz Property for $d^{K}_{\mu}$

2.3 Properness of $d^{K}_{\mu}$

3.1 Kernel Power Distance for a Measure $\mu$

4.1 Homotopy Equivalent Reconstruction using $d^{K}_{\mu}$

Theorem 4.1 (Isotopy lemma on $d^{K}_{\mu}$ ).

Theorem 4.2 (Reconstruction on $d^{K}_{\mu}$ ).

4.2 Constructing Topological Estimates using $d^{K}_{\mu}$

5.1 Comparing $D_{K}$ to $W_{2}$

5.2 Kernel Distance Stability with Respect to $\sigma$

Lipschitz in $m_{0}$ for $d^{\textsf{{ccm}}}_{\mu,m_{0}}$ .

Varying parameter $r$ or $\sigma$ .

Appendix B $\varepsilon$ -Approximation of the Kernel Distance

C.1 Kernel Power Distance on Point Set $P$

D.2 Inference of Compact Set $S$ with the Kernel Distance