This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Kernel Spatial Scan Statistic

Mingxuan Han u1209601@umail.utah.edu University of Utah201 Presidents Circle84112 Michael Matheny mmath@cs.utah.edu University of Utah201 Presidents Circle84112  and  Jeff M. Phillips jeffp@cs.utah.edu University of Utah201 Presidents Circle84112
Abstract.

Kulldorff’s (1997) seminal paper on spatial scan statistics (SSS) has led to many methods considering different regions of interest, different statistical models, and different approximations while also having numerous applications in epidemiology, environmental monitoring, and homeland security. SSS provides a way to rigorously test for the existence of an anomaly and provide statistical guarantees as to how ”anomalous” that anomaly is. However, these methods rely on defining specific regions where the spatial information a point contributes is limited to binary 0 or 1, of either inside or outside the region, while in reality anomalies will tend to follow smooth distributions with decaying density further from an epicenter. In this work, we propose a method that addresses this shortcoming through a continuous scan statistic that generalizes SSS by allowing the point contribution to be defined by a kernel. We provide extensive experimental and theoretical results that shows our methods can be computed efficiently while providing high statistical power for detecting anomalous regions.

1. Introduction

We propose a generalized version of spatial scan statistics called the kernel spatial scan statistic. In contrast to the many variants (AMPVW06, ; Kul97, ; huang2007spatial, ; jung2007spatial, ; neill2006bayesian, ; jung2010spatial, ; kulldorff2009scan, ; FNN17, ; HMWN18, ; SSMN16, ) of this classic method for geographic information sciences, the kernel version allows for the modeling of a gradually dampening of an anomalous event as a data point becomes further from the epicenter. As we will see, this modeling change allows for more statistical power and faster algorithms (independent of data set size), in addition to the more realistic modeling.

Refer to caption
Figure 1. An anomaly affecting 8%8\% of data (and rate parameters p=.9p=.9, q=.5q=.5) under the Bernoulli Kernel Spatial Scan Statistic model on geo-locations of crimes in Philadelphia.

To review, spatial scan statistics consider a baseline or population data set BB where each point xBx\in B has an annotated value m(x)m(x). In the simplest case, this value is binary (11 = person has cancer; 0 = person does not have cancer), and it is useful to define the measured set MBM\subset B as M={xXm(x)=1}M=\{x\in X\mid m(x)=1\}. Then the typical goal is to identify a region where there are significantly more measured points than one would expect from the baseline data BB. To prevent overfitting (e.g., gerrymandering), the typical formulation fixes a set of potential anomalous regions \mathcal{R} induced by a family of geometric shapes: disks (Kul97, ), axis-aligned rectangles (NM04, ), ellipses (Kul7.0, ), halfspaces (MP18a, ), and others (Kul7.0, ). Then given a statistical discrepancy function Φ(R)=ϕ(R(M),R(B))\Phi(R)=\phi(R(M),R(B)), where typically R(M)=|RM||M|R(M)=\frac{|R\cap M|}{|M|} and R(B)=|RB||B|R(B)=\frac{|R\cap B|}{|B|}, the spatial scan statistic SSS is

Φ=maxRΦ(R).\Phi^{*}=\max_{R\in\mathcal{R}}\Phi(R).

And the hypothesized anomalous region is R=argmaxRΦ(R)R^{*}=\mathrm{argmax}_{R\in\mathcal{R}}\Phi(R) so Φ=Φ(R)\Phi^{*}=\Phi(R^{*}). Conveniently, by choosing a fixed set of shapes, and having a fixed baseline set BB, this actually combinatorially limits the set of all possible regions that can be considered (when computing the max\max operator), since for instance, there can be only O(|B|3)O(|B|^{3}) distinct disks which each contains a different subset of points. This allows for tractable (AMPVW06, ) (and in some cases very scalable (MP18b, )) combinatorial and statistical algorithms which can (approximately) search over the class of all shapes from that family. Alternatively, the most popular software, SatScan (Kul7.0, ) uses a fixed center set of possible epicenters of events, for simpler and more scalable algorithms.

However, the discreteness of these models has a strange modeling side effect. Consider a the shape model of disks 𝒟\mathcal{D}, where each disk D𝒟D\in\mathcal{D} is defined D={xdxcr}D=\{x\in\mathbb{R}^{d}\mid\|x-c\|\leq r\} by a center cdc\in\mathbb{R}^{d} and a radius r>0r>0. Then solving a spatial scan statistic over this family 𝒟\mathcal{D} would yield an anomaly defined by a disk DD; that is, all points xBDx\in B\cap D are counted entirely inside the anomalous region, and all points xBDx^{\prime}\in B\setminus D are considered entirely outside the anomalous region. If this region is modeling a regional event; say the area around a potentially hazardous chemical leak suspected of causing cancer in nearby residents, then the hope is that the center cc identifies the location of the leak, and rr determines the radius of impact. However, this implies that data points xBx\in B very close to the epicenter cc are affected equally likely as those a distance of almost but not quite rr away. And those data points xBx^{\prime}\in B that are slightly further than rr away from cc are not affected at all. In reality, the data points closest to the epicenter should be more likely to be affected than those further away, even if they are within some radius rr, and data points just beyond some radius should still have some, but a lessened effect as well.

Introducing the Kernel Spatial Scan Statistic. The main modeling change of the kernel spatial scan statistic (KSSS) is to prescribe these diminishing effects of spatial anomalies as data points become further from the epicenter of a proposed event. From a modeling perspective, given the way we described the problem above, the generalization is quite natural: we simply replace the shape class \mathcal{R} (e.g., the family of all disks 𝒟\mathcal{D}) with a class of non-binary continuous functions 𝒦\mathcal{K}. The most natural choice (which we focus on) are kernels, and in particular Gaussian kernels. We define each K𝒦K\in\mathcal{K} by a center cc and a bandwidth rr as K(x)=exp(xc2/r2).K(x)=\exp(-\|x-c\|^{2}/r^{2}). This provides a real value K(x)[0,1]K(x)\in[0,1], in fact a probability, for each xBx\in B. We interpret this as: given an anomaly model KK, then for each xBx\in B the value K(x)K(x) is the probability that the rate of the measured event (chance that a person gets cancer) is increased.

Related Work on SSS and Kernels. There have been many papers on computing various range spaces for SSS (NM04, ; Tango2005, ; ACTW18, ; ICDMS2015, ) where a geometric region defines the set of points that included in the region for various regions such as disks, ellipses, rings, and rectangles. Other work has combined SSS and kernels as a way to penalize far away points, but still used binary regions, and only over a set of predefined starting points (DA04, ; DCTB2007, ; Patil2004, ). Another method (FNN17, ) uses a Kernel SVM boundary to define a region; this provides a regularized, but otherwise very flexible class of regions – but they are still binary. A third method (JRSA89, ), proposes an inhomogeneous Poisson process model for the spatial relationship between a measured cancer rate and exposure to single specified region (from industrial pollution source). This models the measured rate similar to our work, but does not search over a family of regions, and does not model a background rate.

Our contributions, and their challenges. We formally derive and discuss in more depth the KSSS in Section 2, and contrast this with related work on SSS. While the above intuition is (we believe) quite natural, and seems rather direct, a complication arises: the contribution of the data points towards the statistical discrepancy function (derived as a log-likelihood ratio) are no longer independent. This implies that K(B)K(B) and K(M)K(M) can no longer in general be scalar values (as they were with R(B)R(B) and R(M)R(M)); instead we need to pass in sets. Moreover, this means that unlike with traditional binary ranges, the value of Φ\Phi no longer in general has a closed form; in particular the optimal rate parameters in the alternative hypothesis do not have a closed form. We circumvent this by describing a simple convex cost function for the rate parameters. And it turns out, these can then be effectively solved for with a few steps of gradient descent for each potential choice of KK within the main scanning algorithm.

Our paper then focuses on the most intuitive Bernoulli model for how measured values are generated, but the procedures we derive will apply similar to the Poisson and Gaussian models we also derive. For instance, it turns out that the Gaussian model kernel statistical discrepancy function has a closed form.

The second major challenge is that there is no longer a combinatorial limit on the number of distinct ranges to consider. There are an infinite set of potential centers cdc\in\mathbb{R}^{d} to consider, even with a fixed bandwidth, and each could correspond to a different Φ(K)\Phi(K) value. However, there is a Lipschitz property on Φ(K)\Phi(K) as a function of the choice of center cc; that is if we change cc to cc^{\prime} by a small amount, then we can upperbound the change in Φ(Kc)\Phi(K_{c}) to Φ(Kc)\Phi(K_{c^{\prime}}) by a linear function of cc\|c-c^{\prime}\|. This implies a finite resolution needed to consider on the set of center points: we can lay down a fixed resolution grid, and only consider those grid points. Notably: this property does not hold for the combinatorial SSS version, as a direct effect of the problematic boundary issue of the binary ranges.

We combine the insight of this Lipschitz property, and the gradient descent to evaluate Φ(Kc)\Phi(K_{c}) for a set of center points cc, in a new algorithm KernelGrid. We can next develop two improvements to this basic algorithm which make the grid adaptive in resolution, and round the effect of points far from the epicenter; embodied in our algorithm KernelFast, these considerably increase the efficiency of computing the statistic (by 3030x) without significantly decreasing their accuracy (with provable guarantees). Moreover, we create a coreset BεB_{\varepsilon} of the full data set BB, independent of the original size |B||B| that provably bounds the worst case error ε\varepsilon.

We empirically demonstrate the efficiency, scalability, and accuracy of these new KSSS algorithms. In particular, we show the KSSS has superior statistical power compared to traditional SSS algorithms, and exceeds the efficiency of even the heavily optimized version of those combinatorial Disk SSS algorithms.

2. Derivation of the Kernel Spatial Scan Statistic

In this section we will provide a general definition for a spatial scan statistic, and then extend this to the kernel version. It turns out, there are two reasonable variations of such statistics, which we call the continuous and binary settings. In each case, we will then define the associated kernelized statistical discrepancy function Φ\Phi under the Bernoulli ΦBe\Phi^{\text{Be}}, Poisson ΦP\Phi^{P}, and Gaussian ΦG\Phi^{G} models. These settings are the same in the Bernoulli model, but different in the other two models.

General derivation. A spatial scan statistic considers a spatial data set BdB\subset\mathbb{R}^{d}, each data point xBx\in B endowed with a measured value m(x)m(x), and a family of measurement regions \mathcal{R}. Each region RR\in\mathcal{R} specifies the way a data point xBx\in B is associated with the anomaly (e.g., affected or not affected). Then given a statistical discrepancy function Φ\Phi which measures the anomalousness of a region RR, the statistic is maxRΦ(R)\max_{R\in\mathcal{R}}\Phi(R). To complete the definition, we need to specify \mathcal{R} and Φ\Phi, which it turns out in the way we define \mathcal{R} are more intertwined that previously realized.

To define Φ\Phi we assume a statistical model in how the values m(x)m(x) are realized, where data points xx affected by the anomaly have m(x)m(x) generated at rate pp and those unaffected generated at rate qq.

Then we can define a null hypothesis that a potential anomalous region RR has no effect on the rate parameters so p=qp=q; and an alternative hypothesis that the region does have an effect and (w.l.o.g.) p>qp>q.

For both the null and alternative hypothesis, and a region RR\in\mathcal{R}, we can then define a likelihood, denoted L0(q)L_{0}(q) and L(p,q,R)L(p,q,R), respectively. The spatial scan statistic is then the log-likelihood ratio (LRT)

Φ(R)=log(maxp,qL(p,q,R)maxqL0(q))=(maxp,qlogL(p,q,R))(maxqlogL0(q)).\Phi(R)=\log\left(\frac{\max_{p,q}L(p,q,R)}{\max_{q}L_{0}(q)}\right)=\left(\max_{p,q}\log L(p,q,R)\right)-\left(\max_{q}\log L_{0}(q)\right).

Now the main distinction with the kernel spatial scan statistic is that \mathcal{R} is specified with a family of kernels 𝒦\mathcal{K} so that each K𝒦K\in\mathcal{K} specifies a probability K(x)K(x) that xBx\in B is affected by the anomaly. This is consistent with the traditional spatial scan statistic (e.g., with \mathcal{R} as disks 𝒟\mathcal{D}), where this probability was always 0 or 11. Now this probability can be any continuous value. Then we can express the mean rate/intensity g(x)g(x) for each of the underlying distributions from which m(x)m(x) is generated, as a function of K(x)K(x), pp, and qq. Two natural and distinct models arise for kernel spatial scan statistics.

Continuous Setting. In the continuous setting, which will be our default model, we directly model the mean rate g(x)g(x) as a convex combination between pp and qq as,

g(x)=K(x)p+(1K(x))q.g(x)=K(x)p+(1-K(x))q.

Thus each xx (with nonzero K(x)K(x) value) has a slightly different rate. Consider a potentially hazardous chemical leak suspected of causing cancer, this model setting implies that the residents who live closer to the center (K(x)K(x) is larger) of potential leak would be affected more (have elevated rate) compared to the residents who live farther away. The kernel function K(x)K(x) models a decay effect from a center, and smooths out the effect of distance.

Binary Setting. In the second setting, the binary setting, the mean rate g(x)g(x) is defined

g˘(x)={p w.p. K(x)q w.p. (1K(x)).\breve{g}(x)=\begin{cases}p&\text{ w.p. }K(x)\\ q&\text{ w.p. }(1-K(x)).\end{cases}

To clarify notation, each part of the model associated with this setting (e.g., g˘\breve{g}) will carry a ˘\breve{}  to distinguish it from the continuous setting. In this case, as with the traditional SSS, the rate parameter for each xx is either pp or qq, and cannot take any other value. However, this rate assignment is not deterministic, it is assigned with probability K(x)K(x), so points closer to the epicenter (larger K(x)K(x)) have higher probability of being assigned a rate pp. The rate g˘(x)\breve{g}(x) for each xx is a mixture model with known mixture weight determined by K(x)K(x).

The null models. We show that the choice of model, binary setting, vs continuous setting, does not change the null hypothesis (e.g. 0=˘0\ell_{0}=\breve{\ell}_{0}).

2.1. Bernoulli

Under the Bernoulli model, the measured value m(x){0,1}m(x)\in\{0,1\}, and these values are generated independently. Consider a 11 value indicating that someone is diagnosed with cancer. Then an anomalous region may be associated with a leaky chemical plant, where the residents nearby the plant have an elevated rate of cancer pp, whereas the background population may have lower rate qq of cancer. That is, the cancer occurs through natural mutation at a rate qq, but if exposed to certain chemicals, there is another mechanism to get cancer that occurs at rate pqp-q (for a total rate of q+(pq)=pq+(p-q)=p). Under the binary model, any exposure to the chemical triggers this secondary mechanism, and so the chance of exposure is modeled as proportional to K(x)K(x), and rate at xx is g˘(x)\breve{g}(x) is modeled well by the binary setting. Alternatively, the rate of the secondary mechanism may increase as the amount of exposure to the chemical increases (those living closer are exposed to more chemicals), with rate g(x)g(x) modeled in the continuous setting. These are both potentially the correct biological model, so we analyze both of them.

For this model we can define two subsets of BB as M={xBm(X)=1}M=\{x\in B\mid m(X)=1\} and BM={xBm(x)=0}B\setminus M=\{x\in B\mid m(x)=0\}.

In either setting the null likelihood is defined

L0Be(q)=L˘0Be(q)=xMqxBM(1q),L_{0}^{\textsf{Be}}(q)=\breve{L}_{0}^{\textsf{Be}}(q)=\prod_{x\in M}q\prod_{x\in B\setminus M}(1-q),

then

0Be(q)=˘0Be(q)=xMlog(q)+xBMlog(1q),\ell_{0}^{\textsf{Be}}(q)=\breve{\ell}_{0}^{\textsf{Be}}(q)=\sum_{x\in M}\log(q)+\sum_{x\in B\setminus M}\log(1-q),

which is maximized over qq at q=|M|/|B|q=|M|/|B| as,

0Be=˘0Be=|M|log|M||B|+(|B||M|)log(1|M||B|).\ell_{0}^{\textsf{Be}^{*}}=\breve{\ell}_{0}^{\textsf{Be}^{*}}=|M|\log\frac{|M|}{|B|}+(|B|-|M|)\log(1-\frac{|M|}{|B|}).

The continuous setting. We first deriving the continuous setting ΦBe\Phi^{\textsf{Be}}, starting with the likelihood under the alternative hypothesis.

This is a product of the rate of measured and baseline.

LBe(p,q,K)=xMg(x)xBM(1g(x))L^{\textsf{Be}}(p,q,K)=\prod_{x\in M}g(x)\cdot\prod_{x\in B\setminus M}(1-g(x))

and so

Be(p,q,K)\displaystyle\ell^{\textsf{Be}}(p,q,K)
=log(LBe(p,q,K))=xMlogg(x)+xBMlog(1g(x))\displaystyle=\log(L^{\textsf{Be}}(p,q,K))=\sum_{x\in M}\log g(x)+\sum_{x\in B\setminus M}\log(1-g(x))
=xMlog(pK(x)+q(1K(x)))+xBMlog(1pK(x)q(1K(x))).\displaystyle=\sum_{x\in M}\log(pK(x)+q(1-K(x)))+\sum_{x\in B\setminus M}\log(1-pK(x)-q(1-K(x))).

Unfortunately, we know of no closed form for the maximum of Be(p,q,K)\ell^{\textsf{Be}}(p,q,K) over the choice of p,qp,q and therefore this form cannot be simplified further than

ΦBe(K)=maxp,qBe(p,q,K)0Be{\Phi^{\textsf{Be}}}^{*}(K)=\max_{p,q}\ell^{\textsf{Be}}(p,q,K)-{\ell_{0}^{\textsf{Be}}}^{*}

The binary setting. We continue deriving the binary setting Φ˘Be\breve{\Phi}^{\textsf{Be}}, starting with

L˘Be(p,q,K)=xB(pm(x)(1p)m(x)K(x)+qm(x)(1q)m(x)K(x)),\breve{L}^{\textsf{Be}}(p,q,K)=\prod_{x\in B}\left(p^{m(x)}(1-p)^{m(x)}K(x)+q^{m(x)}(1-q)^{m(x)}K(x)\right),

and so

˘Be(p,q,K)\displaystyle\breve{\ell}^{\textsf{Be}}(p,q,K) =log(L˘Be(p,q,K))\displaystyle=\log(\breve{L}^{\textsf{Be}}(p,q,K))
=xBlog(pm(x)(1p)m(x)K(x)+qm(x)(1q)m(x)K(x)).\displaystyle=\sum_{x\in B}\log\left(p^{m(x)}(1-p)^{m(x)}K(x)+q^{m(x)}(1-q)^{m(x)}K(x)\right).

Similarly as in the continuous setting, there is no closed form of the maximum of log(L˘Be(p,q,K))\log(\breve{L}^{\textsf{Be}}(p,q,K)) over choices of pp,qq, so we write the form below.

Φ˘Be(K)=maxp,q˘Be(p,q,K)0Be\breve{\Phi}^{\textsf{Be}^{*}}(K)=\max_{p,q}\breve{\ell}^{\textsf{Be}}(p,q,K)-{\ell_{0}^{\textsf{Be}}}^{*}

Equivalence. It turns out, under the Bernoulli model, these two settings have equivalent statistics to optimize.

Lemma 2.1.

The ˘Be(p,q,K)\breve{\ell}^{\textsf{Be}}(p,q,K) and Be(p,q,K)\ell^{\textsf{Be}}(p,q,K) are exactly same, hence Φ˘Be\breve{\Phi}^{\textsf{Be}^{*}} = ΦBe{\Phi^{\textsf{Be}}}^{*} which implies that the Bernoulli model under two settings are equivalent to each other.

Proof.

We simply expand the binary setting as follows.

˘Be(p,q,K)\displaystyle\breve{\ell}^{\textsf{Be}}(p,q,K)
=xMlog(pm(x)(1p)1m(x)K(x)+qm(x)(1q)1m(x)(1K(x)))\displaystyle=\sum_{x\in M}\log(p^{m(x)}(1-p)^{1-m(x)}K(x)+q^{m(x)}(1-q)^{1-m(x)}(1-K(x)))
+xBMlog(pm(x)(1p)1m(x)K(x)+qm(x)(1q)1m(x)(1K(x)))\displaystyle+\sum_{x\in B\setminus M}\log(p^{m(x)}(1-p)^{1-m(x)}K(x)+q^{m(x)}(1-q)^{1-m(x)}(1-K(x)))
=xMlog(pK(x)+q(1K(x)))+xBMlog(1(pq)K(x)q)\displaystyle=\sum_{x\in M}\log(pK(x)+q(1-K(x)))+\sum_{x\in B\setminus M}\log(1-(p-q)K(x)-q)
=xMlog(g(x))+xBMlog(1g(x))\displaystyle=\sum_{x\in M}\log(g(x))+\sum_{x\in B\setminus M}\log(1-g(x))
=Be(p,q,K)\displaystyle=\ell^{\textsf{Be}}(p,q,K)

Since the ˘Be(p,q,K)=Be(p,q,K)\breve{\ell}^{\textsf{Be}}(p,q,K)=\ell^{\textsf{Be}}(p,q,K) and ˘0Be=0Be\breve{\ell}_{0}^{\textsf{Be}^{*}}={\ell_{0}^{\textsf{Be}}}^{*}, then Φ˘Be\breve{\Phi}^{\textsf{Be}^{*}} = ΦBe{\Phi^{\textsf{Be}}}^{*}. ∎

2.2. Gaussian

The Gaussian model can be used to analyze spatial datasets with continuous values m(x)m(x) (e.g., temperature, rainfall, or income), which we assume varies with a normal distribution with a fixed known standard deviation σ\sigma. Under this model, both the continuous and binary settings are again both well motivated.

Consider an insect infestation, as it affects agriculture. Here we assume fields of crops are measured at discrete locations BB, and each has an observed yield rate m(x)m(x), which under the null models varies normally around a value qq. In the continuous setting, the yield rate at xBx\in B is effected proportional to K(x)K(x), depending on how close it is to the epicenter. This may for instance model that fewer insects reach further from the epicenter, and the yield rate is effected relative to the number of insects that reach the field xx. In the binary setting, it may be that if insects reach a field, then they dramatically change the yield rate (e.g., they eat and propagate until almost all of the crops are eaten). In the latter scenario, the correct model is binary one, with a mixture model of two rates governed by K(x)K(x), the closeness to the epicenter.

In either setting the null likelihood is defined as,

L0G(q)=L˘0G(q)=xBexp((m(x)q)22σ2),L_{0}^{\textsf{G}}(q)=\breve{L}_{0}^{\textsf{G}}(q)=\prod_{x\in B}\exp(-\frac{(m(x)-q)^{2}}{2\sigma^{2}}),

then

0G(q)=˘0G(q)=xB(m(x)q)22σ2,\ell_{0}^{\textsf{G}}(q)=\breve{\ell}_{0}^{\textsf{G}}(q)=\sum_{x\in B}-\frac{(m(x)-q)^{2}}{2\sigma^{2}},

which is maximized over qq at q=xBm(x)|B|=m^q=\frac{\sum_{x\in B}m(x)}{|B|}=\hat{m} as,

0G=˘0G=12σ2xB(m^m(x))2.\ell_{0}^{\textsf{G}^{*}}=\breve{\ell}_{0}^{\textsf{G}^{*}}=\frac{-1}{2\sigma^{2}}\sum_{x\in B}(\hat{m}-m(x))^{2}.

The continuous setting. We first derive the continuous setting ΦG\Phi^{\textsf{G}} and terms free of pp and qq would be treated as constant then ignored, starting with,

LG(p,q,K)=xBexp((m(x)g(x))22σ2)L^{\textsf{G}}(p,q,K)=\prod_{x\in B}\exp(-\frac{(m(x)-g(x))^{2}}{2\sigma^{2}})

and so

G(p,q,K)\displaystyle\ell^{\textsf{G}}(p,q,K) =log(LG(p,q,K))=xB(m(x)g(x))22σ2\displaystyle=\log(L^{\textsf{G}}(p,q,K))=\sum_{x\in B}-\frac{(m(x)-g(x))^{2}}{2\sigma^{2}}
=xB(m(x)pk(x)q(1K(x))22σ2.\displaystyle=\sum_{x\in B}-\frac{(m(x)-pk(x)-q(1-K(x))^{2}}{2\sigma^{2}}.

Fortunately, there is a closed form for the maximum of G(p,q,K)\ell^{\textsf{G}}(p,q,K) over the choice of p,qp,q by setting the 𝖽G(p,q,K)𝖽p=0\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}p}=0 and 𝖽G(p,q,K)𝖽q=0\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}q}=0. Hence we come up with the closed form solution of Gaussian Kernel statistical discrepancy shown by the theorem below.

Theorem 2.2.

Gaussian kernel statistical discrepancy function is

ΦG(K)=xB(m(x)p^K(x)q^(1K(x)))2xB(m¯m(x))2\Phi^{\textsf{G}^{*}}(K)=\sum_{x\in B}(m(x)-\hat{p}K(x)-\hat{q}(1-K(x)))^{2}-\sum_{x\in B}(\bar{m}-m(x))^{2}

where m¯=1|B|xBm(x)\bar{m}=\frac{1}{|B|}\sum_{x\in B}m(x), p^=K±KmKmK2K±2K2K2\hat{p}=\frac{K_{\pm}K_{-m}-K_{m}K_{-2}}{K_{\pm}^{2}-K_{2}K_{-2}}, q^=KmK±K2KmK±2K2K2\hat{q}=\frac{K_{m}K_{\pm}-K_{2}K_{-m}}{K_{\pm}^{2}-K_{2}K_{-2}}, using the following terms Km=xBK(x)mxK_{m}=\sum_{x\in B}K(x)m_{x},   K2=xBK(x)2K_{2}=\sum_{x\in B}K(x)^{2},    K±=xBK(x)(1K(x))K_{\pm}=\sum_{x\in B}K(x)(1-K(x)), Km=xBm(x)(1K(x))K_{-m}=\sum_{x\in B}m(x)(1-K(x)), and K2=xB(1K(x))2K_{-2}=\sum_{x\in B}(1-K(x))^{2}.

Proof.

For the alternative hypothesis, the log-likelihood is

G(p,q,K)=12σ2xB(m(x)g(x))2\ell^{\textsf{G}}(p,q,K)=\frac{-1}{2\sigma^{2}}\sum_{x\in B}(m(x)-g(x))^{2}

The optimal values of p,qp,q minimize

G(p,q,K)=12σ2xB(m(x)pK(x)q(1K(x)))2.-\ell^{\textsf{G}}(p,q,K)=\frac{1}{2\sigma^{2}}\sum_{x\in B}(m(x)-pK(x)-q(1-K(x)))^{2}.

By setting the partial derivatives wrt pp and qq of G(p,q,L)-\ell^{\textsf{G}}(p,q,L) equal to 0, we have,

𝖽G(p,q,K)𝖽p=xBK(x)[m(x)K(x)p(1K(x)q]=0,\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}p}=\sum_{x\in B}K(x)[m(x)-K(x)p-(1-K(x)q]=0,

and

𝖽G(p,q,K)𝖽q=xB(1K(x))[m(x)K(x)p(1K(x)q]=0,\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}q}=\sum_{x\in B}(1-K(x))[m(x)-K(x)p-(1-K(x)q]=0,

and these two can be further simplified to,

𝖽G(p,q,K)𝖽p\displaystyle\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}p} =xBK(x)m(x)pxBK(x)2qxBK(x)(1K(x))\displaystyle=\sum_{x\in B}K(x)m(x)-p\sum_{x\in B}K(x)^{2}-q\sum_{x\in B}K(x)(1-K(x))
=0,\displaystyle=0,

and

𝖽G(p,q,K)𝖽q\displaystyle\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}q}
=xB(1K(x))m(x)pxB(1K(x))K(x)qxB(1K(x))2\displaystyle=\sum_{x\in B}(1-K(x))m(x)-p\sum_{x\in B}(1-K(x))K(x)-q\sum_{x\in B}(1-K(x))^{2}
=0.\displaystyle=0.

We replace these terms by notations defined in the theorem,

KmpK2qK±=0,K_{m}-pK_{2}-qK_{\pm}=0,

and

KmpK±qK2=0.K_{-m}-pK_{\pm}-qK_{-2}=0.

Then we can solve the optimum value of pp and qq as, p=K±KmKmK2K±2K2K2=p^p=\frac{K_{\pm}K_{-m}-K_{m}K_{-2}}{K_{\pm}^{2}-K_{2}K_{-2}}=\hat{p} and q=KmK±K2KmK±2K2K2=q^q=\frac{K_{m}K_{\pm}-K_{2}K_{-m}}{K_{\pm}^{2}-K_{2}K_{-2}}=\hat{q}.

Hence we have the closed form

G=maxp,qG(p,q,K)=12σ2xB(m(x)p^K(x)q^(1K(x)))2.\ell^{\textsf{G}^{*}}=\max_{p,q}\ell^{\textsf{G}}(p,q,K)=\frac{-1}{2\sigma^{2}}\sum_{x\in B}(m(x)-\hat{p}K(x)-\hat{q}(1-K(x)))^{2}.

The binary Setting. Now we derive the binary setting Φ˘Be\breve{\Phi}^{\textsf{Be}}, starting

L˘G(p,q,K)=xBexp((m(x)p)22σ2)K(x)+exp((m(x)q)22σ2)(1K(x))\breve{L}^{\textsf{G}}(p,q,K)=\prod_{x\in B}\exp(-\frac{(m(x)-p)^{2}}{2\sigma^{2}})K(x)+\exp(-\frac{(m(x)-q)^{2}}{2\sigma^{2}})(1-K(x))

and so,

˘G(p,q,K)\displaystyle\breve{\ell}^{\textsf{G}}(p,q,K)
=xBlog(exp((m(x)p)22σ2)K(x)+exp((m(x)q)22σ2)(1K(x))).\displaystyle=\sum_{x\in B}\log(\exp(-\frac{(m(x)-p)^{2}}{2\sigma^{2}})K(x)+\exp(-\frac{(m(x)-q)^{2}}{2\sigma^{2}})(1-K(x))).

Different from the continuous setting we know of no closed form for the maximum of ˘G(p,q,K)\breve{\ell}^{\textsf{G}}(p,q,K) over pp and qq. Hence,

Φ˘G(K)=maxp,q˘G(p,q,K)0G.\breve{\Phi}^{\textsf{G}^{*}}(K)=\max_{p,q}\breve{\ell}^{\textsf{G}}(p,q,K)-{\ell_{0}^{\textsf{G}}}^{*}.

Equivalence. Different from the Bernoulli models, the Gaussian models under the binary setting and continuous setting are not equivalent to each other. Under the continuous setting,

m(x)𝒩(g(x),σ),m(x)\sim\mathcal{N}(g(x),\sigma),

however, under the binary setting, each data point follow a two components Gaussian mixture model where the mixture weight is given by K(x)K(x) and 1K(x)1-K(x), and so it is not a Gaussian distribution anymore.

2.3. Poisson

In the Poisson model the measured value m(x)m(x) is discrete and non-negative, but it can now take any positive integer value with m(x)+m(x)\in\mathbb{Z}_{+}. This can for instance model the number of check-ins or comments m(x)m(x) posted at each geo-located business xBx\in B (this can be a proxy for instance for the number of customers). An event, e.g., a festival, protest, or other large impromptu gathering could be modeled spatially by a kernel KK, and it affects the rates at each xx in the two different settings.

In the continuous setting, the closer a distance a restaurant is from the center of the event (modeled by K(x)K(x)) the more the usual number of check-ins (modeled by qq) will trend towards pp. On the hand, in the binary setting, only certain businesses are affected (e.g., a coffee shop, but not a fancy dinner location), but if it is affected, its rate is elevated all the way from qq to pp. Perhaps advertising at a festival encouraged people to patronize certain restaurants, or a protest encouraged them to give bad reviewers to certain nearby restaurants – but not others. Hence, these two settings relate to two different ways an event could affect Poisson check-in or comment rates.

In either setting, the null likelihood is defined as,

L0P(q)=L˘0P(q)=xBeqqm(x)m(x)!,L_{0}^{\textsf{P}}(q)=\breve{L}_{0}^{\textsf{P}}(q)=\prod_{x\in B}\frac{e^{-q}q^{m(x)}}{m(x)!},

then

0P(q)=˘0P(q)\displaystyle\ell_{0}^{\textsf{P}}(q)=\breve{\ell}_{0}^{\textsf{P}}(q) =xBlog(eqqm(x)m(x)!)\displaystyle=\sum_{x\in B}\log(\frac{e^{-q}q^{m(x)}}{m(x)!})
=xBq+m(x)log(q)log(m(x)!),\displaystyle=\sum_{x\in B}-q+m(x)\log(q)-\log(m(x)!),

which is maximized over qq at q=xBm(x)|B|=m^q=\frac{\sum_{x\in B}m(x)}{|B|}=\hat{m} as,

0P=˘0P\displaystyle\ell_{0}^{\textsf{P}^{*}}=\breve{\ell}_{0}^{\textsf{P}^{*}} =xBm^+m(x)log(m^)log(m(x)!)\displaystyle=\sum_{x\in B}-\hat{m}+m(x)\log(\hat{m})-\log(m(x)!)
=|B|m^xBlog(m(x)!)m(x)log(m^).\displaystyle=-|B|\hat{m}-\sum_{x\in B}\log(m(x)!)-m(x)\log(\hat{m}).

The continuous setting. We first derive the continuous setting ΦP\Phi^{\textsf{P}}, starting with,

LP(p,q,K)\displaystyle L^{\textsf{P}}(p,q,K) =xBeg(x)g(x)m(x)m(x)!\displaystyle=\prod_{x\in B}\frac{e^{-g(x)}g(x)^{m(x)}}{m(x)!}
=xBepK(x)q(1K(x))(pK(x)+q(1K(x))m(x)m(x)!,\displaystyle=\prod_{x\in B}\frac{e^{-pK(x)-q(1-K(x))}(pK(x)+q(1-K(x))^{m(x)}}{m(x)!},

so

P(p,q,K)\displaystyle\ell^{\textsf{P}}(p,q,K)
=xBg(x)+m(x)log(g(x))log(m(x)!)\displaystyle=\sum_{x\in B}-g(x)+m(x)\log(g(x))-\log(m(x)!)
=xB(pK(x)+q(1K(x))+m(x)log(pK(x)+q(1K(x))log(m(x)!))\displaystyle=\sum_{x\in B}-(pK(x)+q(1-K(x))+m(x)\log(pK(x)+q(1-K(x))-\log(m(x)!))

There is no closed form for the maximum of P(p,q,K)\ell^{\textsf{P}}(p,q,K) over the choice of pp and qq and hence

ΦP(K)=maxp,qP(p,q,K)0P(q)\Phi^{\textsf{P}^{*}}(K)=\max_{p,q}\ell^{\textsf{P}}(p,q,K)-\ell_{0}^{\textsf{P}^{*}}(q)

The binary setting. We continue deriving the binary setting Φ˘P\breve{\Phi}^{\textsf{P}^{*}}, starting with

L˘P(p,q,K)=xBeqqm(x)m(x)!(1K(x))+eppm(x)m(x)!K(x),\breve{L}^{\textsf{P}}(p,q,K)=\prod_{x\in B}\frac{e^{-q}q^{m(x)}}{m(x)!}(1-K(x))+\frac{e^{-p}p^{m(x)}}{m(x)!}K(x),

so

˘P(p,q,k)=xBlog(eqqm(x)m(x)!(1K(x))+eppm(x)m(x)!K(x)).\breve{\ell}^{\textsf{P}}(p,q,k)=\sum_{x\in B}\log\left(\frac{e^{-q}q^{m(x)}}{m(x)!}(1-K(x))+\frac{e^{-p}p^{m(x)}}{m(x)!}K(x)\right).

Same as the continuous setting, there is no closed form of ˘P(p,q,k)\breve{\ell}^{\textsf{P}}(p,q,k), hence,

Φ˘P(K)=maxp,q˘P(p,q,k)˘0P(q)\breve{\Phi}^{\textsf{P}^{*}}(K)=\max_{p,q}\breve{\ell}^{\textsf{P}}(p,q,k)-\breve{\ell}_{0}^{\textsf{P}^{*}}(q)

Equivalence. In the Poisson model the binary setting and the continuous setting are not equivalent to each other. Under the continuous setting,

m(x)𝖯𝗈𝗂(g(x)),m(x)\sim\mathsf{Poi}(g(x)),

however under the binary setting m(x)m(x) follows a mixture Poisson model which is not a Poisson distribution anymore and the mixture weight is given by K(x)K(x) and 1K(x)1-K(x).

3. Computing the Approximate KSSS

The traditional SSS can combinatorially search over all disks (SSSS, ; MP18a, ; Kul97, ) to solve for or approximate maxD𝒟Φ(D)\max_{D\in\mathcal{D}}\Phi(D), evaluating Φ(D)\Phi(D) exactly. Our new KSSS algorithms will instead search over a grid of possible centers cc, and approximate Φ(Kc)\Phi(K_{c}) with gradient descent, yet will achieve the same sort of strong error guarantees as the combinatorial versions. Improvements will allow for adaptive gridding, pruning far points, and sampling.

3.1. Approximating Φ\Phi with GD

We cannot directly calculate Φ(K)=maxp,qΦp,q(K)\Phi(K)=\max_{p,q}\Phi_{p,q}(K), since it does not have a closed form. Instead we run gradient descent 𝖦𝗋𝖺𝖽𝖣𝖾𝗌𝖼\mathsf{GradDesc} over Φp,q(Kc)-\Phi_{p,q}(K_{c}) on p,qp,q for a fixed cc. Since we have shown Φp,q(K)\Phi_{p,q}(K) is convex over p,qp,q this will converge, and since Lemma A.8 bounds its Lipschitz constant at 2|B|2|B| it will converge quickly. In particular, from starting points p0,q0p_{0},q_{0}, after ss steps we can bound

|Φp,q(K)Φps,qs(K)||B|(p,q)(p0,q0)s,|\Phi_{p^{*},q^{*}}(K)-\Phi_{p_{s},q_{s}}(K)|\leq\frac{|B|\|(p^{*},q^{*})-(p_{0},q_{0})\|}{s},

for the found rate parameters ps,qsp_{s},q_{s}. Since 0p,q10\leq p,q\leq 1, then after sε=|B|/εs_{\varepsilon}=|B|/\varepsilon steps we are guaranteed to have |Φp,q(K)Φpsε,qsε(K)|ε|\Phi_{p^{*},q^{*}}(K)-\Phi_{p_{s_{\varepsilon}},q_{s_{\varepsilon}}}(K)|\leq\varepsilon. We always initiate this procedure on Φp,q(Kc)\Phi_{p,q}(K_{c}) with the p^,q^\hat{p},\hat{q} found on a nearby KcK_{c^{\prime}}, and as a result found that running for s=3s=3 of 44 steps is sufficient. Each step of gradient descent takes O(|B|)O(|B|) to compute the gradient.

3.2. Gridding and Pruning

Computing Φ(K)\Phi(K) on every center in 2\mathbb{R}^{2} is impossible, but Lemma A.4 shows that Φ(Kc^)\Phi(K_{\hat{c}}) with c^\hat{c} close to the true maximum cc^{*}, then Φ(Kc^)\Phi(K_{\hat{c}}) will approximate the true maximum. From Lemma A.4 we have |Φ(Kc^)Φ(Kc)|1r8ec^c|\Phi(K_{\hat{c}})-\Phi(K_{c^{*}})|\leq\frac{1}{r}\sqrt{\frac{8}{e}}\|\hat{c}-c^{*}\|. To get a bound of |Φ(Kc^)Φ(Kc)|ε|\Phi(K_{\hat{c}})-\Phi(K_{c^{*}})|\leq\varepsilon we need that c^c1r8eε\|\hat{c}-c^{*}\|\frac{1}{r}\sqrt{\frac{8}{e}}\leq\varepsilon or rearanging that c^cεre8\|\hat{c}-c^{*}\|\leq\varepsilon r\sqrt{\frac{e}{8}}. Placing center points on a regular grid with sidelength τε=εre4\tau_{\varepsilon}=\varepsilon r\sqrt{\frac{e}{4}} will ensure the a center point will be close enough to the true maximum.

Assume that BΩΛ2B\subset\Omega_{\Lambda}\subset\mathbb{R}^{2}, which w.l.o.g. we assume ΩΛ[0,L]×[0,L]\Omega_{\Lambda}\in[0,L]\times[0,L], where Λ=L/r\Lambda=L/r is a unitless resolution parameter which represents the ratio of the domain size to the scale of the anomalies. Next define a regular orthogonal grid GεG_{\varepsilon} over ΩΛ\Omega_{\Lambda} at resolution τε\tau_{\varepsilon}. This contains |GεΩΛ|=Λ2ε24e|G_{\varepsilon}\cap\Omega_{\Lambda}|=\frac{\Lambda^{2}}{\varepsilon^{2}}\frac{4}{e} points. We compute the scan statistic Φ(Kc)\Phi(K_{c}) choose as cc each point in GεΩΛG_{\varepsilon}\cap\Omega_{\Lambda}. This algorithm, denoted KernelGrid and shown in Algorithm 1, can be seen to run in O(|GεΩΛ||B|tg)=O(Λ2ε2|B|sε)O(|G_{\varepsilon}\cap\Omega_{\Lambda}|\cdot|B|t_{g})=O(\frac{\Lambda^{2}}{\varepsilon^{2}}|B|s_{\varepsilon}) time, using sεs_{\varepsilon} iterations of gradient decent (in practice sε=4s_{\varepsilon}=4).

Lemma 3.1.

KernelGrid(B,ε,ΩΛ)\textsc{\em KernelGrid}(B,\varepsilon,\Omega_{\Lambda}) returns Φ(Kc^)\Phi(K_{\hat{c}}) for a center c^\hat{c} so |maxKc𝒦rΦ(Kc)Φ(Kc^)|ε|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon in time O(Λ2ε2|B|sε)O(\frac{\Lambda^{2}}{\varepsilon^{2}}|B|s_{\varepsilon}), which in the worst case is O(Λ2ε3|B|2)O(\frac{\Lambda^{2}}{\varepsilon^{3}}|B|^{2}).

Algorithm 1 KernelGrid(B,ε,ΩΛ)(B,\varepsilon,\Omega_{\Lambda})
 Initialize Φ=0\Phi=0; define Gε,Λ=GεΩΛG_{\varepsilon,\Lambda}=G_{\varepsilon}\cap\Omega_{\Lambda}
for cGε,Λc\in G_{\varepsilon,\Lambda} do
  Φc=𝖦𝗋𝖺𝖽𝖣𝖾𝗌𝖼(Φp,q(Kc))\Phi_{c}=\mathsf{GradDesc}(\Phi_{p,q}(K_{c})) over p,qp,q on BB
  if (Φc>Φ\Phi_{c}>\Phi) then Φ=Φc\Phi=\Phi_{c}
 return Φ\Phi

Adaptive Gridding. We next adjust the grid resolution based on the density of BB. We partition the ΩΛ\Omega_{\Lambda} domain with a coarse grid HεH_{\varepsilon} with side length 2rmax2r_{max} (from Lemma A.3). For a cell γHε\gamma\in H_{\varepsilon} in this grid, let SγS_{\gamma} denote the 6rmax×6rmax6r_{\textrm{max}}\times 6r_{\textrm{max}} region which expands the grid cell γ\gamma the length of one grid cell in either direction. For any center cγc\in\gamma, all points xBx\in B which are within a distance of 2rmax2r_{\textrm{max}} from cc must be within SγS_{\gamma}. Hence, by Lemma A.3 we can evaluate Φ(Kc)\Phi(K^{\prime}_{c}) for any cγc\in\gamma only inspecting SBS\cap B.

Moreover, by the local density argument in Lemma A.5, we can describe a new grid Gε,γG^{\prime}_{\varepsilon,\gamma} inside of each γHε\gamma\in H_{\varepsilon} with center separation β\beta only depending on the local number of points |SγB||S_{\gamma}\cap B|. In particular we have for c,cγc,c^{\prime}\in\gamma with cc=β\|c-c^{\prime}\|=\beta

|Φ(Kc)Φ(Kc)|β|SB||B|2rmaxr2+ε|\Phi(K^{\prime}_{c})-\Phi(K^{\prime}_{c^{\prime}})|\leq\beta\frac{|S\cap B|}{|B|}\frac{2r_{\text{max}}}{r^{2}}+\varepsilon

To guarantee that all cγc\in\gamma have another center cGε,γc^{\prime}\in G^{\prime}_{\varepsilon,\gamma} so that |Φ(Kc)Φ(Kc)|2ε|\Phi(K_{c})-\Phi(K^{\prime}_{c^{\prime}})|\leq 2\varepsilon we set the side length in Gε,γG^{\prime}_{\varepsilon,\gamma} as

βγ=ε|B||BSγ|r22rmax,\beta_{\gamma}=\varepsilon\frac{|B|}{|B\cap S_{\gamma}|}\frac{r^{2}}{2r_{\textrm{max}}},

so the number of grid points in Gε,γG^{\prime}_{\varepsilon,\gamma} is

|Gε,γ|=4rmax2βγ2=rmax4r416ε2|BSγ|2|B|2.|G^{\prime}_{\varepsilon,\gamma}|=\frac{4r_{\textrm{max}}^{2}}{\beta_{\gamma}^{2}}=\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\frac{|B\cap S_{\gamma}|^{2}}{|B|^{2}}.

Now define Gε=γHεGε,γG^{\prime}_{\varepsilon}=\bigcup_{\gamma\in H_{\varepsilon}}G^{\prime}_{\varepsilon,\gamma} as the union of these adaptively defined subgrids over each coarse grid cell. Its total size is

|Gε|\displaystyle|G^{\prime}_{\varepsilon}| =γHε|Gε,γ|=γHεrmax4r416ε2|BSγ|2|B|2.\displaystyle=\sum_{\gamma\in H_{\varepsilon}}|G_{\varepsilon,\gamma}|=\sum_{\gamma\in H_{\varepsilon}}\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\frac{|B\cap S_{\gamma}|^{2}}{|B|^{2}}.
=rmax4r416ε21|B|2γHε|BSγ|2.\displaystyle=\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\cdot\frac{1}{|B|^{2}}\sum_{\gamma\in H_{\varepsilon}}|B\cap S_{\gamma}|^{2}.
=log2(|B|/ε)16ε2γHε1|B|2γHε|BSγ|2.\displaystyle=\log^{2}(|B|/\varepsilon)\frac{16}{\varepsilon^{2}}\sum_{\gamma\in H_{\varepsilon}}\cdot\frac{1}{|B|^{2}}\sum_{\gamma\in H_{\varepsilon}}|B\cap S_{\gamma}|^{2}.
log2(|B|/ε)1296ε2,\displaystyle\leq\log^{2}(|B|/\varepsilon)\frac{1296}{\varepsilon^{2}},

where the last inequality follows from Cauchy-Schwarz, and that each point xBx\in B appears in 99 cells SγS_{\gamma}. This replaces dependence on the domain size Λ2\Lambda^{2} in the size of the grid GεG_{\varepsilon}, with a mere logarithmic log2(|B|/ε)\log^{2}(|B|/\varepsilon) dependence on |B||B| in GεG^{\prime}_{\varepsilon}. We did not minimize constants, and in practice we use significantly smaller constants.

Moreover, since this holds for each cγc\in\gamma, and the same gridding mechanism is applied for each γHε\gamma\in H_{\varepsilon}, this holds for all cΩΛc\in\Omega_{\Lambda}. We call the algorithm that extends Algorithm 1 to use this grid GεG^{\prime}_{\varepsilon} in place of GεG_{\varepsilon} KernelAdaptive.

Lemma 3.2.

KernelAdaptive(B,ε,ΩΛ)\textsc{\em KernelAdaptive}(B,\varepsilon,\Omega_{\Lambda}) returns Φ(Kc^)\Phi(K_{\hat{c}}) for a center c^\hat{c} so |maxKc𝒦rΦ(Kc)Φ(Kc^)|ε|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon in time O(1ε2log2|B|ε|B|sε)O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{|B|}{\varepsilon}|B|s_{\varepsilon}), which in the worst case is O(1ε3|B|2log2|B|ε)O(\frac{1}{\varepsilon^{3}}|B|^{2}\log^{2}\frac{|B|}{\varepsilon}).

Pruning. For both gridding methods the runtime is roughly the number of centers times the time to compute the gradient O(|B|)O(|B|). But via Lemma A.3 we can ignore the contribution of far away points, and thus only need those in the gradient computation.

However, this provides no worst-case asymptotic improvements in runtime for KernelGrid, or KernelAdaptive since all of BB may reside in a rmax×rmaxr_{\text{max}}\times r_{\text{max}} cell. But in the practical setting we consider, this does provide a significant speedup as the data is usually spread over a large domain that is many times the size of rmaxr_{\text{max}}.

We will define two new methods KernelPrune and KernelFast (shown in Algorithm 2) where the former extends KernelGrid method, and latter extends KernelAdaptive with pruning. In particular, we note that bounds in Lemma 3.2 hold for KernelFast.

Algorithm 2 KernelFast(B,ε,ΩΛ)(B,\varepsilon,\Omega_{\Lambda})
 Initialize Φ=0\Phi=0; define grid HεH_{\varepsilon} over ΩΛ\Omega_{\Lambda}
for γHε\gamma\in H_{\varepsilon} do
  Define Gε,γG^{\prime}_{\varepsilon,\gamma} adaptively on γ\gamma and SγBS_{\gamma}\cap B
  for cGε,Λc\in G^{\prime}_{\varepsilon,\Lambda} do
   Φc=𝖦𝗋𝖺𝖽𝖣𝖾𝗌𝖼(Φp,q(Kc))\Phi_{c}=\mathsf{GradDesc}(\Phi_{p,q}(K_{c})) over p,qp,q on pruned set BSγB\cap S_{\gamma}
   if (Φc>Φ\Phi_{c}>\Phi) then Φ=Φc\Phi=\Phi_{c}
 return Φ\Phi

3.3. Sampling

We can dramatically improve runtimes on large data sets by sampling a coreset BεB_{\varepsilon} iid from BB, according to Lemma A.7. With probability 1δ1-\delta we need |Bε|=O(1ε2log21εlogκδ)|B_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{1}{\varepsilon}\log\frac{\kappa}{\delta}) samples, where κ\kappa is the number of center evaluations, and can be set to the grid size. For KernelGrid κ=|Gε|=O(Λ2ε2)\kappa=|G_{\varepsilon}|=O(\frac{\Lambda^{2}}{\varepsilon^{2}}) and in AdaptiveGrid κ=|Gε|=O(1ε2log2|Bε|ε)=(1ε2log21ε)\kappa=|G^{\prime}_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{|B_{\varepsilon}|}{\varepsilon})=(\frac{1}{\varepsilon^{2}}\log^{2}\frac{1}{\varepsilon}). We restate the runtime bounds with sampling to show they are independent of |B||B|.

Lemma 3.3.

KernelGrid(Bε,ε,ΩΛ)\textsc{\em KernelGrid}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda}) &\& KernelPrune(Bε,ε,ΩΛ)\textsc{\em KernelPrune}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda}) with sample size |Bε|=O(1ε2logΛεδ)|B_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log\frac{\Lambda}{\varepsilon\delta}) returns Φ(Kc^)\Phi(K_{\hat{c}}) for a center c^\hat{c} so |maxKc𝒦rΦ(Kc)Φ(Kc^)|ε|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon in time O(sεε4logΛεδ)O(\frac{s_{\varepsilon}}{\varepsilon^{4}}\log\frac{\Lambda}{\varepsilon\delta}), with probability 1δ1-\delta. In the worst case the runtime is O(Λ2ε7log2Λεδ)O(\frac{\Lambda^{2}}{\varepsilon^{7}}\log^{2}\frac{\Lambda}{\varepsilon\delta}).

Lemma 3.4.

KernelAdaptive(Bε,ε,ΩΛ)\textsc{\em KernelAdaptive}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda}) &\& KernelFast(Bε,ε,ΩΛ)\textsc{\em KernelFast}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda}) with sample size |Bε|=O(1ε2log1εδ)|B_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon\delta}) returns Φ(Kc^)\Phi(K_{\hat{c}}) for a center c^\hat{c} so |maxKc𝒦rΦ(Kc)Φ(Kc^)|ε|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon in time O(sεε4log31εδ)O(\frac{s_{\varepsilon}}{\varepsilon^{4}}\log^{3}\frac{1}{\varepsilon\delta}), with probability 1δ1-\delta. In the worst case the runtime is O(1ε7log41εδ)O(\frac{1}{\varepsilon^{7}}\log^{4}\frac{1}{\varepsilon\delta}).

3.4. Multiple Bandwidths

We next show a sequence of bandwidths, such that one of them is close to the rr used in any K𝒦K\in\mathcal{K} (assuming some reasonable but large range on the values rr), and then take the maximum over all of these experiments. If the sequence of bandwidths r0rsr_{0}\ldots r_{s} is such that riri1eriε4r_{i}-r_{i-1}\leq\frac{er_{i}\varepsilon}{4} then |Φ(Kri)Φ(Kri)|ε|\Phi(K_{r_{i}})-\Phi(K_{r_{i}})|\leq\varepsilon.

Lemma 3.5.

A geometrically increasing sequence of bandwidths Rmin=r0,,Rmax=rsR_{\text{min}}=r_{0},\ldots,R_{\text{max}}=r_{s} with s=O(1εlogRmaxRmin)s=O(\frac{1}{\varepsilon}\log\frac{R_{\text{max}}}{R_{\text{min}}}) is sufficient so |maxiΦ(Kri)Φ(Kr)|ε|\max_{i}\Phi(K_{r_{i}})-\Phi(K_{r})|\leq\varepsilon for any bandwidth r[Rmin,Rmax]r\in[R_{\text{min}},R_{\text{max}}].

Proof.

To guarantee a ε\varepsilon error on the bandwidth rr there must be a nearby bandwidth rir_{i}. Therefore if |Φ(Kri)Φ(Kri+1)|ε|\Phi(K_{r_{i}})-\Phi(K_{r_{i+1}})|\leq\varepsilon then |Φ(Kri)Φ(Kr)|ε|\Phi(K_{r_{i}})-\Phi(K_{r})|\leq\varepsilon if r[ri,ri+1]r\in[r_{i},r_{i+1}].

From Lemma A.6 we can use the Lipshitz bound at rir_{i} to note that |Φ(Kri)Φ(Kri+1)|4eri(ri+1ri)|\Phi(K_{r_{i}})-\Phi(K_{r_{i+1}})|\leq\frac{4}{er_{i}}(r_{i+1}-r_{i}). Setting this less than ε\varepsilon we can rearrange to get that ri+1(e4ε+1)rir_{i+1}\leq(\frac{e}{4}\varepsilon+1)r_{i}. That is r0(εe4+1)srsr_{0}(\frac{\varepsilon e}{4}+1)^{s}\geq r_{s}, which can be rearranged to get ss s=log(RmaxRmin)log(εe4+1)s=\frac{\log(\frac{R_{\text{max}}}{R_{\text{min}}})}{\log(\frac{\varepsilon e}{4}+1)}. Since log(x+1)x2\log(x+1)\geq\frac{x}{2} when xx is in (0,1)(0,1), we have s8εelogRmaxRmins\leq\frac{8}{\varepsilon e}\log\frac{R_{\text{max}}}{R_{\text{min}}}. ∎

Running our KSSS over a large sequence of bandwidths is simple and merely increases the runtime by a O(1εlogrsr0)O(\frac{1}{\varepsilon}\log\frac{r_{s}}{r_{0}}) factor. Our experiments in Section 4.4 suggest that choosing 44 to 66 bandwidths should be sufficient (e.g., for scales Rmax/Rmin=1,000R_{\text{max}}/R_{\text{min}}=1{,}000).

Refer to caption
Refer to caption
Refer to caption
Figure 2. New KSSS algorithms with statistican power compared with increased sample size using cc^\|c-\hat{c}\|, Φ(Kc^)\Phi(K_{\hat{c}}), and 𝖩𝖲(Kc,Kc^)\mathsf{JS}(K_{c},K_{\hat{c}}).

4. Experiments

We compare our new KSSS algorithms to the state-of-the-art methods in terms of empirical efficiency, statistical power, and sample complexity, on large spatial data sets with planted anomalies.

Data sets. We run experiments on two large spatial data sets recording incidents of crime, these are used to represent the baseline data BB. The first contains geo-locations of all crimes in Philadelphia from 2006-2015, and has a total size of |B|=687,636|B|=687{,}636; a subsample is shown in Figure 1. The second is the well-known Chicago Crime Dataset from 2001-2017, and has a total size of |B|=6,886,676|B|=6{,}886{,}676; which is 10x10x the size of the Philadelphia set.

In modeling crime hot spots, these may often be associated with an individual or group of individuals who live at a fixed location. Then the crimes they may commit would often be centered at that location, and be more likely to happen nearby, and less likely further away. A natural way to model this decaying crime likelihood would be with a Gaussian kernel — as opposed to a uniform probability within a fixed radius, and no increased probability outside that zone. Hence, our KSSS is a good model to potentially detect such spatial anomalies.

Planting anomalous regions. To conduct controlled experiments, we use a spatial data sets BB above, but choose the mm values in a synthetic way. In particular, we plant anomalous regions Kc𝒦rK_{c}\in\mathcal{K}_{r}, and then each data point xBx\in B is assigned to a group PP (with probability K(x)K(x)) or QQ (otherwise). Those xPx\in P will be assigned m(x)m(x) through a Bernoulli process at rate pp, that is m(x)=1m(x)=1 with probability pp and 0 otherwise; those xQx\in Q are assigned m(x)m(x) at rate qq. Given a region KcK_{c}, this could model a pattern crimes (those with m(x)=1m(x)=1 may be all vehicle theft or have suspect matching a description), where cc may represent the epicenter of the targeted pattern of crime. We use p=0.8p=0.8 and q=0.5q=0.5.

We repeat this experiment with 2020 planted regions and plot the median on the Phileadelphia data set to compare our new algorithms and to compare sample complexity properties against existing algorithms. We use 33 planted regions on the Chicago data set to compare scalability (these take considerably longer to run). We attempt to fix the size PP so |P|=f|B||P|=f|B|, by adjusting the fixed and known bandwidth parameter rr on each planted region. We set f=0.03f=0.03 for Philadelphia, and f=0.01f=0.01 for Chicago, so the region contains a fairly small region with about 3%3\% or 1%1\% of the data.

Evaluating the models. A statistical power test, plants an anomalous region (for instance as described above), and then determines how often an algorithm can recover that region; it measures recall. However, all considered algorithms typically do not recover the exact same region as the one planted, so we measure how close to the planted region KcK_{c} the recovered one Kc^K_{\hat{c}} is. To do so we measure:

  • distance been found centers cc^\|c-\hat{c}\|, smaller is better.

  • Φ(Kc^)\Phi(K_{\hat{c}}), the larger the better; it may be larger than Φ(Kc)\Phi(K_{c})

  • the extended Jaccard similarity 𝖩𝖲(Kc,Kc^)\mathsf{JS}(K_{c},K_{\hat{c}}) defined

    𝖩𝖲(K,K^)=K(x),K^(x)BK(x),K(x)B+K^(x),K^(x)BK(x),K^(x)B\mathsf{JS}(K,\hat{K})=\frac{\langle K(x),\hat{K}(x)\rangle_{B}}{\langle K(x),K(x)\rangle_{B}+\langle\hat{K}(x),\hat{K}(x)\rangle_{B}-\langle K(x),\hat{K}(x)\rangle_{B}}

    where K(x),K^(x)B=xBK(x)K^(x)\langle K(x),\hat{K}(x^{\prime})\rangle_{B}=\sum_{x\in B}K(x)\hat{K}(x); larger is better.

We plot medians over 2020 trials; the targeted and hence measured values have variance because planted regions may not be the optimal region, since the m(x)m(x) values are generated under a random process. When we cannot control the xx-value (when using time) we plot a kernel smoothing over different parameters on 33 trials.

Refer to caption
Figure 3. Runtime of new KSSS algorithms in sample size.
Refer to caption
Refer to caption
Refer to caption
Figure 4. New KSSS vs. Disk SSS algorithms via statistican power from sample size using cc^\|c-\hat{c}\|, Φ(Kc^)\Phi(K_{\hat{c}}), and 𝖩𝖲(Kc,Kc^)\mathsf{JS}(K_{c},K_{\hat{c}}).
Refer to caption
Refer to caption
Refer to caption
Figure 5. Accuracy measures as a function of runtime using cc^\|c-\hat{c}\|, Φ(Kc^)\Phi(K_{\hat{c}}), and 𝖩𝖲(Kc,Kc^)\mathsf{JS}(K_{c},K_{\hat{c}}).

4.1. Comparing New KSSS Algorithms

We first compare the new KSSS algorithms against each other, as we increase the sample size |Bε||B_{\varepsilon}| and the corresponding other griding and pruning parameters to match the expected error ε\varepsilon from sample size |Bε||B_{\varepsilon}| as dictated in Section 3.2.

We observe in Figure 2 that all of the new KSSS algorithms achieve high power at about the same rate. In particular, when the sample size reaches about |Bε|=1,000|B_{\varepsilon}|=1{,}000, they have all plateaued near their best values, with large power: the center distance is close to 0, Φ(Kc^)\Phi(K_{\hat{c}}) near maximum, and 𝖩𝖲(Kc,Kc^)\mathsf{JS}(K_{c},K_{\hat{c}}) almost 0.90.9. At medium sample sizes |Bε|=500|B_{\varepsilon}|=500, KernelAdaptive and KernelFast have worse accuracy, yet reach maximum power around the same sample size – so for very small sample size, we recommend KernelPrune.

In Figure 3 we see that the improvements from KernelGrid up to KernelFast are tremendous; a speed-up of roughly 2020x to 3030x improvement. By considering KernelPrune and KernelAdaptive we see most of the improvement comes from the adaptive gridding, but the pruning is also important, itself adding 22x to 33x speed up.

4.2. Power vs. Sample Size

We next compare our KSSS algorithms against existing, standard Disk SSS algorithms. As comparison, we first consider a fast reimplementation of SatScan (Kul97, ; Kul7.0, ) in C++. To make-even the comparison, we consider the exact same center set (defined on grid GεG_{\varepsilon}) for potential epicenters, and consider all possible radii of disks. Second, we compare against a heavily-optimized DiskScan algorithm (MP18b, ) for Disks, which chooses a very small “net” of points to combinatorially reduce the set of disks scanned, but still guarantee ε\varepsilon-accuracy (in some sense similar to our adaptive approaches). For these algorithms we maximize Kuldorff’s Bernoulli likelihood function (Kul97, ), whose log\log has a closed for over binary ranges D𝒟D\in\mathcal{D}.

Figure 4 shows the power versus sample size (representing how many data points are available), using the same metrics as before. The KSSS algorithms perform consistently significantly better – to see this consider a fixed yy value in each plot. For instance the KSSS algorithms reach cc^<0.05\|c-\hat{c}\|<0.05, Φ(Kc^)>0.003\Phi(K_{\hat{c}})>0.003 and 𝖩𝖲(Kc,Kc^)>0.8\mathsf{JS}(K_{c},K_{\hat{c}})>0.8 after about 1000 data samples, whereas it takes the Disk SSS algorithms about 2500 data samples.

4.3. Power vs. Time

We next measure the power as a function of the runtime of the algorithms, again the new KSSS algorithms versus the traditional Disk SSS algorithms. We increase the sample size |Bε||B_{\varepsilon}| as before, now from the Chicago dataset, and adjust other error parameters in accordance to match the theoretical error.

Figure 5 shows KernelFast significantly outperforms SatScan and DiskScan in these measures in orders of magnitude less time. It efficiently reaches small distance to the planted center faster (10 seconds vs 1000 or more seconds). In 5 seconds it achieves Φ\Phi^{*} of 0.0006, and 0.00075 in 100 seconds; whereas in 1000 seconds the Disk SSS only reaches 0.0004. Similarly for Jaccard similarity, KernelFast reaches 0.8 in 5 seconds, and 0.95 in 100 seconds; whereas in 1000 seconds the Disk SSS algorithms only reach 0.5.

Refer to caption
Refer to caption
Figure 6. Accuracy on bandwidth rr of planted region.

4.4. Sensitivity to Bandwidth

So far we chose rr to be the bandwidth of the planted anomaly Kc𝒦rK_{c}\in\mathcal{K}_{r} (this is natural if we know the nature of the event). But if the true anomaly bandwidth is not known or only known in some range then our method should be insensitive to this parameter. On the Philadelphia dataset we consider 3030 geometrically increasing bandwidths scaled so for original bandwidth rr we considered rsrs where s[102,10]s\in[10^{-2},10]. In Figure 6 we show the accuracy using Jaccard similarity and the Φ\Phi-value found, over 2020 trials. Our KSSS algorithms are effective at fitting events with s[0.5,2]s\in[0.5,2], indicating quite a bit of lee-way in which rr to use. That is, the sample complexity would not change, but the time complexity may increase by a factor of only 22x - 55x if we also search over a range of rr.

5. Conclusion

In this work, we generalized the spatial scan statistic so that ranges can be more flexible in their boundary conditions. In particular, this allows the anomalous regions to be defined by a kernel, so the anomaly is most intense at an epicenter, and its effect decays gradually moving away from that center. However, given this new definition, it is no longer possible to define and reason about a finite number of combinatorially defined anomalous ranges. Moreover, the log-likelihood ratio test derived do not have closed form solutions and as a result we develop new algorithmic techniques to deal with these two issues. These new algorithms are guaranteed to approximately detect the kernel range which maximizes the new discrepancy function up to any error precision, and the runtime depends only on the error parameter. We also conducted controlled experiments on planted anomalies which conclusively demonstrated that our new algorithms can detect regions with few samples and in less time than the traditional disk-based combinatorial algorithms made popular by the SatScan software. That is, we show that the newly proposed Kernel Spatial Scan Statistics theoretically and experimentally outperform the existing Spatial Scan Statistic methods.

References

  • [1] D. Agarwal, A. McGregor, J. M. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: Approximations and performance study. In KDD, 2006.
  • [2] E. Arias-Castro, R. M. Castro, E. Tánczos, and M. Wang. Distribution-free detection of structured anomalies: Permutation and rank-based scans. JASA, 113:789–801, 2018.
  • [3] P. J. Diggle. A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point. J. R. Statist. Soc. A, 153(3):349–362, 1990.
  • [4] L. Duczmal and R. Assunção. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis, 45(2):269 – 286, 2004.
  • [5] L. Duczmal, A. L. Cançado, R. H. Takahashi, and L. F. Bessegato. A genetic algorithm for irregularly shaped spatial scan statistics. Computational Statistics & Data Analysis, 52(1):43 – 52, 2007.
  • [6] Y. N. Dylan Fitzpatrick and D. B. Neill. Support vector subset scan for spatial outbreak detection. Online Journal of Public Health Informatics, 2017.
  • [7] E. Eftelioglu, S. Shekhar, D. Oliver, X. Zhou, M. R. Evans, Y. Xie, J. M. Kang, R. Laubscher, and C. Farah. Ring-shaped hotspot detection: A summary of results. Proceedings - IEEE International Conference on Data Mining, ICDM, 2015:815–820, 01 2015.
  • [8] W. Herlands, E. McFowland, A. Wilson, and D. Neill. Gaussian process subset scanning for anomalous pattern detection in non-iid data. In AIStats, 2018.
  • [9] L. Huang, M. Kulldorff, and D. Gregorio. A spatial scan statistic for survival data. Biometrics, 63(1):109–118, 2007.
  • [10] I. Jung, M. Kulldorff, and A. C. Klassen. A spatial scan statistic for ordinal data. Statistics in medicine, 26(7):1594–1607, 2007.
  • [11] I. Jung, M. Kulldorff, and O. J. Richard. A spatial scan statistic for multinomial data. Statistics in medicine, 29(18):1910–1918, 2010.
  • [12] M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26:1481–1496, 1997.
  • [13] M. Kulldorff. SatScan User Guide. http://www.satscan.org/, 9.6 edition, 2018.
  • [14] M. Kulldorff, L. Huang, and K. Konty. A scan statistic for continuous data based on the normal probability model. Inter. Journal of Health Geographics, 8:58, 2009.
  • [15] M. Matheny and J. M. Phillips. Computing approximate statistical discrepancy. ISAAC, 2018.
  • [16] M. Matheny and J. M. Phillips. Practical low-dimensional halfspace range space sampling. ESA, 2018.
  • [17] M. Matheny, R. Singh, L. Zhang, K. Wang, and J. M. Phillips. Scalable spatial scan statistics through sampling. In SIGSPATIAL, 2016.
  • [18] D. B. Neill and A. W. Moore. Rapid detection of significant spatial clusters. In KDD, 2004.
  • [19] D. B. Neill, A. W. Moore, and G. F. Cooper. A bayesian spatial scan statistic. In Advances in neural information processing systems, pages 1003–1010, 2006.
  • [20] G. P. Patil and C. Taillie. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics, 11(2):183–197, Jun 2004.
  • [21] S. Speakman, S. Somanchi, E. M. III, and D. B. Neill. Penalized fast subset scanning. Journal of Computational and Graphical Statistics, 25(2):382–404, 2016.
  • [22] T. Tango and K. Takahashi. A flexibly shaped spatial scan statistic for detecting clusters. Inter. J. of Health Geographics, 4:11, 2005.
  • [23] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theo. of Prob and App, 16:264–280, 1971.

Appendix A Approximating the Bernoulli KSSS

In the following sections, we focus on the Bernoulli model and for notational simplicity use Φ=ΦBe\Phi=\Phi^{\textsf{Be}}, =Be\ell=\ell^{\textsf{Be}}. Sometimes we also use Φp,q(K)=Φ(p,q,K)\Phi_{p,q}(K)=\Phi(p,q,K).

The values of pp and qq are bounded between 0 and 11, but at these extremes Φp,q(K)\Phi_{p,q}(K) can be unbounded. Instead we will bound structural properties of Φ\Phi, by assuming p=p,q=qp=p^{*},q=q^{*}.

Lemma A.1.

When (p=p,q=q)=argmaxp,qΦ(p,q,K)(p=p^{*},q=q^{*})=\mathrm{argmax}_{p,q}\Phi(p,q,K) then

|B|=xBM11g(x) and |B|=xM1g(x).|B|=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}\;\;\;\;\text{ and }\;\;\;\;|B|=\sum_{x\in M}\frac{1}{g(x)}.
Proof.

Consider first the derivatives of the alternative hypothesis =(p,q,K)\ell=\ell(p,q,K) with respect to pp and qq.

𝖽𝖽p=\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}p}= xMK(x)g(x)xBMK(x)1g(x)\displaystyle\sum_{x\in M}\frac{K(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}
𝖽𝖽q=\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}q}= xM1K(x)g(x)xBM1K(x)1g(x)\displaystyle\sum_{x\in M}\frac{1-K(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{1-K(x)}{1-g(x)}

from this we can see that at the maximum p,qp^{*},q^{*}, when g(x)=pK(x)+q(1K(x))g(x)=p^{*}K(x)+q^{*}(1-K(x)) then

𝖽𝖽pp+𝖽𝖽qq=xMg(x)g(x)xBMg(x)1g(x)=0\frac{\mathsf{d}\ell}{\mathsf{d}p}p^{*}+\frac{\mathsf{d}\ell}{\mathsf{d}q}q^{*}=\sum_{x\in M}\frac{g(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{g(x)}{1-g(x)}=0

Since xMg(x)g(x)=|M|\sum_{x\in M}\frac{g(x)}{g(x)}=|M|, hence xBMg(x)1g(x)=|M|\sum_{x\in B\setminus M}\frac{g(x)}{1-g(x)}=|M| and then we can derive

|M|=xBM1+g(x)11g(x)=xBM11g(x)|BM|.|M|=\sum_{x\in B\setminus M}\frac{1+g(x)-1}{1-g(x)}=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}-|B\setminus M|.

As a result

|B|=xBM11g(x)|B|=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}

Similarly

𝖽𝖽p(1p)+𝖽𝖽q(1q)=\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}p}(1-p^{*})+\frac{\mathsf{d}\ell}{\mathsf{d}q}(1-q^{*})= xM1g(x)g(x)xBM1g(x)1g(x)\displaystyle\sum_{x\in M}\frac{1-g(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{1-g(x)}{1-g(x)}

and following similar steps we ultimately find

|B|=xM1g(x).|B|=\sum_{x\in M}\frac{1}{g(x)}.\qed

This implies simple bounds on g(x)g(x) as well.

Lemma A.2.

When (p=p,q=q)=argmaxp,qΦ(p,q,K)(p=p^{*},q=q^{*})=\mathrm{argmax}_{p,q}\Phi(p,q,K) then

g(x)[1|B|,1]if xM and g(x)[0,|B|1|B|]if xBMg(x)\in\left[\frac{1}{|B|},1\right]\text{if $x\in M$}\;\;\text{ and }\;\;g(x)\in\left[0,\frac{|B|-1}{|B|}\right]\text{if $x\in B\setminus M$}
Proof.

From Lemma A.1 we can state that

|B|=xM1g(x)1g(x)\displaystyle|B|=\sum_{x\in M}\frac{1}{g(x)}\geq\frac{1}{g(x)}

for any xMx\in M and therefore g(x)1|B|g(x)\geq\frac{1}{|B|} for xMx\in M. In the case that xBx\in B we have that

|B|=xBM11g(x)11g(x)\displaystyle|B|=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}\geq\frac{1}{1-g(x)}

for any xBx\in B. Therefore 1g(x)1|B|1-g(x)\geq\frac{1}{|B|} or |B|1|B|g(x)\frac{|B|-1}{|B|}\geq g(x). ∎

A.1. Spatial Approximation of the KSSS

In this section we define useful lemmas that can be used to spatially approximate the KSSS by either ignoring far away points or restricting the set of centers spatially.

Truncating Kernels. We argue here that we can consider a simpler set of truncated kernels 𝒦r,ε\mathcal{K}_{r,\varepsilon} in place of 𝒦r\mathcal{K}_{r} so replacing at Kc𝒦rK_{c}\in\mathcal{K}_{r} with a corresponding Kc𝒦r,εK^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon} without affecting (p,q,K)\ell(p,q,K) and hence Φ(K)\Phi(K) by more than and additive ε\varepsilon-error. Specifically, define any Kc𝒦r,εK^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon} using rmax=rlog(|B|/ε)r_{\text{max}}=r\sqrt{\log(|B|/\varepsilon)} as

Kc(x)={Kc(x)=exp(xc2/r2) if xcrmax0 otherwise.K^{\prime}_{c}(x)=\begin{cases}K_{c}(x)=\exp(-\|x-c\|^{2}/r^{2})&\text{ if }\|x-c\|\leq r_{\text{max}}\\ 0&\text{ otherwise.}\end{cases}
Lemma A.3.

For any data set BB, center cdc\in\mathbb{R}^{d}, and error ε>0\varepsilon>0

|Φ(Kc)Φ(Kc)|ε.|\Phi(K_{c})-\Phi(K^{\prime}_{c})|\leq\varepsilon.
Proof.

The key observation is that for xc>rmax\|x-c\|>r_{\text{max}} then K(x)exp(log(|B|/ε))=ε/|B|K(x)\leq\exp(-\log(|B|/\varepsilon))=\varepsilon/|B|. Since 0q,p10\leq q,p\leq 1, then

|g(x)g(x)|ε/|B||g(x)-g^{\prime}(x)|\leq\varepsilon/|B|

as well; where g(x)=q+(pq)K(x)g^{\prime}(x)=q+(p-q)K^{\prime}(x).

Next we rewrite (p,q,K)\ell(p,q,K) as

(p,q,K)=1|B|xBlog({g(x) if m(x)=1(1g(x)) if m(x)=0).\ell(p,q,K)=\frac{1}{|B|}\sum_{x\in B}\log\left(\begin{cases}g(x)&\text{ if }m(x)=1\\ (1-g(x))&\text{ if }m(x)=0\end{cases}\right).

Thus since g(x)>(1/|B|)g(x)>(1/|B|) for xMx\in M and 1g(x)>(1/|B|)1-g(x)>(1/|B|) for xBMx\in B\setminus M by Lemma A.2, then we can analyze each term of the average as log()\log(\cdot) of a quantity at least 1/|B|1/|B|. If that quantity changes by at most ε/|B|\varepsilon/|B|, then that term changes by at most

log(1/|B|+ε/|B|)log(1/|B|)(ε/|B|)|B|=ε,\log(1/|B|+\varepsilon/|B|)-\log(1/|B|)\leq(\varepsilon/|B|)|B|=\varepsilon,

by taking the derivative of log\log at 1/|B|1/|B|. Hence each term changes by at most ε\varepsilon, and (p,q,K)\ell(p,q,K) takes the average over these terms, so the main claim holds. ∎

Center Point Lipschitz Bound. Next we show that Φ(Kc)\Phi(K_{c}) is stable with respect to small changes in its epicenter cc

Lemma A.4.

The magnitude of the gradient with respect to the center cc of Φ(Kc)\Phi(K_{c}) for any considered Kc𝒦rK_{c}\in\mathcal{K}_{r} is bounded by

|cΦ(Kc)|(1/r)8/e.|\nabla_{c}\Phi(K_{c})|\leq(1/r)\sqrt{8/e}.
Proof.

We take the derivative of cc in direction u\vec{u} (a unit vector in 2\mathbb{R}^{2}) at magnitude tt. First analyze the Gaussian kernel under such a shift as Kc(x)=exp(cx+tu2/r2)K_{c}(x)=\exp(-\|c-x+t\vec{u}\|^{2}/r^{2}), so as t0t\to 0 we have

|𝖽Kc(x)𝖽t|=|2u,(cx)r2exp(cx2r2)|\left|\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|=\left|\frac{2\langle\vec{u},(c-x)\rangle}{r^{2}}\exp\left(-\frac{\|c-x\|^{2}}{r^{2}}\right)\right|

This maximized for u=(cx)/cx\vec{u}=(c-x)/\|c-x\| so

|𝖽Kc(x)𝖽t||2(cx)r2exp((cx)2r2)|1r2e,\left|\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|\leq\left|\frac{2(\|c-x\|)}{r^{2}}\exp\left(-\frac{(\|c-x\|)^{2}}{r^{2}}\right)\right|\leq\frac{1}{r}\sqrt{\frac{2}{e}},

since it is maximized at cx/r=1/2\|c-x\|/r=\sqrt{1/2}.

Now examine 𝖽Φ(Kc)𝖽t=𝖽(p,q,Kc)𝖽t\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}=\frac{\mathsf{d}\ell(p^{*},q^{*},K_{c})}{\mathsf{d}t} which expands to

𝖽Φ(Kc)𝖽t=xB𝖽Φ(Kc)𝖽Kc(x)𝖽Kc(x)𝖽t+𝖽Φ(Kc)𝖽p𝖽p𝖽t+𝖽Φ(Kc)𝖽q𝖽q𝖽t.\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}=\sum_{x\in B}\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}K_{c}(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}+\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}p^{*}}\frac{\mathsf{d}p^{*}}{\mathsf{d}t}+\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}q^{*}}\frac{\mathsf{d}q^{*}}{\mathsf{d}t}.

Since at p=p,q=qp=p^{*},q=q^{*} both 𝖽Φ(Kc)𝖽p=0\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}p}=0 and 𝖽Φ(Kc)𝖽q=0\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}q}=0, then as long as 𝖽p𝖽t\frac{\mathsf{d}p^{*}}{\mathsf{d}t} and 𝖽q𝖽t\frac{\mathsf{d}q^{*}}{\mathsf{d}t} are bounded the associated terms are also 0. We bound these by taking the derivative of the equation |B|=xM1g(x)|B|=\sum_{x\in M}\frac{1}{g(x)}, from Lemma A.1. with respect to tt:

xM1g(x)2(𝖽p𝖽tKc(x)+𝖽q𝖽t(1Kc(x))+𝖽Kc(x)𝖽t(pq))=𝖽|B|𝖽t=0.\sum_{x\in M}\frac{1}{g(x)^{2}}\left(\frac{\mathsf{d}p}{\mathsf{d}t}K_{c}(x)+\frac{\mathsf{d}q}{\mathsf{d}t}(1-K_{c}(x))+\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}(p^{*}-q^{*})\right)=\frac{\mathsf{d}|B|}{\mathsf{d}t}=0.

The first term xM|𝖽Kc(x)𝖽t|(pq)g(x)2<C\sum_{x\in M}\big{|}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\big{|}\frac{(p^{*}-q^{*})}{g(x)^{2}}<C for some constant CC at pp^{*} and qq^{*} from Lemma A.2 and since 𝖽K(x)𝖽t\frac{\mathsf{d}K(x)}{\mathsf{d}t} is bounded. Hence

C\displaystyle C |xM1g(x)2(𝖽p𝖽tKc(x)+𝖽q𝖽t(1Kc(x)))|\displaystyle\geq\left|\sum_{x\in M}\frac{1}{g(x)^{2}}\left(\frac{\mathsf{d}p^{*}}{\mathsf{d}t}K_{c}(x)+\frac{\mathsf{d}q^{*}}{\mathsf{d}t}(1-K_{c}(x))\right)\right|
|𝖽p𝖽txMKc(x)+𝖽q𝖽txM(1Kc(x))|\displaystyle\geq\left|\frac{\mathsf{d}p^{*}}{\mathsf{d}t}\sum_{x\in M}K_{c}(x)+\frac{\mathsf{d}q^{*}}{\mathsf{d}t}\sum_{x\in M}(1-K_{c}(x))\right|

Therefore if Kc(x)>0K_{c}(x)>0 for any xMx\in M then 𝖽p𝖽t\frac{\mathsf{d}p^{*}}{\mathsf{d}t} and 𝖽p𝖽t\frac{\mathsf{d}p^{*}}{\mathsf{d}t} are bounded. This holds for any KcK_{c} considered, as otherwise Φ=0\Phi=0 since the alternative hypothesis (Kc)\ell^{*}(K_{c}) is equal to the null hypothesis \ell^{*}.

Finally, we bound the magnitude of the first term

|𝖽Φ(Kc)𝖽t|\displaystyle\left|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right| =|𝖽Φ(Kc)𝖽t| for fixed p,q\displaystyle=\left|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right|\text{ for fixed }p^{*},q^{*}
=|1|B|(xM𝖽log(g(x))𝖽t+xBM𝖽log(1g(x))𝖽t)| fixed p,q\displaystyle=\left|\frac{1}{|B|}\left(\sum_{x\in M}\hskip-2.84526pt\frac{\mathsf{d}\log(g(x))}{\mathsf{d}t}+\hskip-8.53581pt\sum_{x\in B\setminus M}\hskip-5.69054pt\frac{\mathsf{d}\log(1-g(x))}{\mathsf{d}t}\right)\right|\text{ fixed }p^{*},q^{*}
=1|B||xMpqg(x)𝖽Kc(x)𝖽txBMpq1g(x)𝖽Kc(x)𝖽t|\displaystyle=\frac{1}{|B|}\left|\sum_{x\in M}\frac{p^{*}-q^{*}}{g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}-\hskip-5.69054pt\sum_{x\in B\setminus M}\frac{p^{*}-q^{*}}{1-g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|
1|B|(1r2e)(|xMpqg(x)|+|xBMpq1g(x)|)\displaystyle\leq\frac{1}{|B|}\left(\frac{1}{r}\sqrt{\frac{2}{e}}\right)\left(\left|\sum_{x\in M}\frac{p^{*}-q^{*}}{g(x)}\right|+\left|\sum_{x\in B\setminus M}\frac{p^{*}-q^{*}}{1-g(x)}\right|\right)
=(1r8e),\displaystyle=\left(\frac{1}{r}\sqrt{\frac{8}{e}}\right),

where the last step follows by pq<1p^{*}-q^{*}<1 and Lemma A.1. ∎

This bound suggests a further improvement for centers that have few points in their neighborhood. We will state this bound in terms of a truncated kernel, but it also applies to non truncated kernels through Lemma A.3.

Lemma A.5.

For a truncated kernel Kc𝒦r,εK^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon}, if we shift the center cc to cc^{\prime} so ccβrmax\|c-c^{\prime}\|\leq\beta\leq r_{\text{max}}, then

|Φ(Kc)Φ(Kc)|β|DB||B|2rmaxr2+ε|\Phi(K^{\prime}_{c})-\Phi(K^{\prime}_{c^{\prime}})|\leq\beta\frac{|D\cap B|}{|B|}\frac{2r_{\text{max}}}{r^{2}}+\varepsilon

where DD is a disk centered at cc of radius 2rmax2r_{\text{max}}.

Proof.

From the previous proof for a regular kernel Kc𝒦rK_{c}\in\mathcal{K}_{r}

|𝖽Φ(Kc)𝖽t|1|B||xMpqg(x)𝖽Kc(x)𝖽txBMpq1g(x)𝖽Kc(x)𝖽t|.\left|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right|\leq\frac{1}{|B|}\left|\sum_{x\in M}\frac{p^{*}-q^{*}}{g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}-\hskip-5.69054pt\sum_{x\in B\setminus M}\frac{p^{*}-q^{*}}{1-g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|.

Adapting to truncated kernel Kc𝒦r,εK^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon}, then this magnitude times β\beta would provide a bound, but a bad one since the derivative is unbounded for xx on the boundary. Rather, we analyze the sum over xBx\in B in three cases. For xDx\notin D, these have both Kc(x)=0K^{\prime}_{c}(x)=0 and Kc(x)=0K^{\prime}_{c^{\prime}}(x)=0, so do not contribute. For xc[rmaxβ,rmax+β]\|x-c\|\in[r_{\text{max}}-\beta,r_{\text{max}}+\beta] the derivative is unbounded, but Kc(x)[0,ε]K^{\prime}_{c}(x)\in[0,\varepsilon] and Kc(x)[0,ε]K^{\prime}_{c^{\prime}}(x)\in[0,\varepsilon] by Lemma A.3, so the total contribution of these terms in the average is at most ε\varepsilon. What remains is to analyze the xBx\in B such that xcrmaxβ\|x-c\|\leq r_{\text{max}}-\beta, where the derivative bound applies. However, we will not use the same approach as in the non-truncated kernel.

We write this contribution in terms as the sum over MM and BB separately as β|B|(ΔMΔB)\frac{\beta}{|B|}(\Delta_{M}-\Delta_{B}), focusing only on points in DD (and thus double counting ones near the bounary) where

ΔM=xDMpqg(x)𝖽Kc(x)𝖽t and ΔB=xDBMpq1g(x)𝖽Kc(x)𝖽t.\Delta_{M}=\hskip-5.69054pt\sum_{x\in D\cap M}\frac{p^{*}-q^{*}}{g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\;\text{ and }\;\Delta_{B}=\hskip-5.69054pt\sum_{x\in D\cap B\setminus M}\frac{p^{*}-q^{*}}{1-g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}.

To bound ΔM\Delta_{M}, observe that g(x)=(pq)Kc(x)+q(pq)Kc(x)g(x)=(p^{*}-q^{*})K_{c}(x)+q^{*}\geq(p^{*}-q^{*})K_{c}(x), and 𝖽Kc(x)𝖽t=2xcr2Kc(x)\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}=-2\frac{\|x-c\|}{r^{2}}K_{c}(x). Thus

|ΔM||xDMpq(pq)Kc(x)2Kc(x)xcr2|=|DM|2rmaxr2.|\Delta_{M}|\leq\left|\sum_{x\in D\cap M}\frac{p^{*}-q^{*}}{(p^{*}-q^{*})K_{c}(x)}2K_{c}(x)\frac{\|x-c\|}{r^{2}}\right|=|D\cap M|\frac{2r_{\text{max}}}{r^{2}}.

To bound ΔB\Delta_{B}, we use that 1g(x)=1q(pq)K(x)1q1-g(x)=1-q^{*}-(p^{*}-q^{*})K(x)\geq 1-q^{*}, that pq1qp^{*}-q^{*}\leq 1-q^{*}, and from the previous proof that |𝖽Kc(x)𝖽t|1r2e\left|\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|\leq\frac{1}{r}\sqrt{\frac{2}{e}}. Thus

|ΔB||xDBM1q1q1r2e|=|DBM|1r2e.|\Delta_{B}|\leq\left|\sum_{x\in D\cap B\setminus M}\frac{1-q^{*}}{1-q^{*}}\frac{1}{r}\sqrt{\frac{2}{e}}\right|=|D\cap B\setminus M|\frac{1}{r}\sqrt{\frac{2}{e}}.

Since 2>2/e2>\sqrt{2/e} we can combine these contributions as

β|B||ΔMΔB|β|DB||B|2rmaxr2.\frac{\beta}{|B|}|\Delta_{M}-\Delta_{B}|\leq\beta\frac{|D\cap B|}{|B|}\frac{2r_{\text{max}}}{r^{2}}.\qed

A.2. Bandwidth Approximations of the KSSS

We mainly focus on solving maxKc𝒦rΦ(Kc)\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c}) for a fixed bandwidth rr. Here we consider the stability in the choice in rr in case this is not assumed, and needs to be searched over.

Lemma A.6.

For a fixed pp, qq, and cc we have |𝖽Φ(K)𝖽r|4er\big{|}\frac{\mathsf{d}\Phi(K)}{\mathsf{d}r}\big{|}\leq\frac{4}{er}.

Proof.

Let zx=xc2z_{x}=\|x-c\|_{2}, wx=zx/rw_{x}=z_{x}/r, then K(x)=exp(wx2)K(x)=\exp(-w_{x}^{2}). We derive the derivative for =(p,q,K)\ell=\ell(p,q,K) with respect to rr.

𝖽𝖽r\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}r} =1|B|(xM2zx2(pq)ezx2r2r3g(x)xBM2zx2(pq)ezx2r2r3(1g(x)))\displaystyle=\frac{1}{|B|}\left(\sum_{x\in M}\frac{2z_{x}^{2}(p-q)e^{-\frac{z_{x}^{2}}{r^{2}}}}{r^{3}g(x)}-\sum_{x\in B\setminus M}\frac{2z_{x}^{2}(p-q)e^{-\frac{z_{x}^{2}}{r^{2}}}}{r^{3}(1-g(x))}\right)
=2(pq)r|B|(xMwx2ewx2g(x)xBMwx2ewx2(1g(x)))\displaystyle=\frac{2(p-q)}{r|B|}\left(\sum_{x\in M}\frac{w_{x}^{2}e^{-w_{x}^{2}}}{g(x)}-\sum_{x\in B\setminus M}\frac{w_{x}^{2}e^{-w_{x}^{2}}}{(1-g(x))}\right)

By two observations, maxw>0wew=2/e\max_{w>0}we^{-w}=2/e and pq1p-q\leq 1, simplify

𝖽𝖽r\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}r} 2r|B|(xM2/eg(x)xBM2/e(1g(x)))\displaystyle\leq\frac{2}{r|B|}\left(\sum_{x\in M}\frac{2/e}{g(x)}-\sum_{x\in B\setminus M}\frac{2/e}{(1-g(x))}\right)

And by Lemma A.1 we bound the sums over xMx\in M and over xBMx\in B\setminus M each by (2/e)|B|(2/e)|B| and hence the absolute value by

|𝖽𝖽r||4/er|.\left|\frac{\mathsf{d}\ell}{\mathsf{d}r}\right|\leq\left|\frac{4/e}{r}\right|.\qed

A.3. Sampling Bounds

Sampling can be used to create a coreset of a truely massive data set BB while retaining the ability to approximately solve maxK𝒦Φ(K)\max_{K\in\mathcal{K}}\Phi(K). In particular, we consider creating an iid sample BεB_{\varepsilon} from BB.

Lemma A.7.

For sample BεB_{\varepsilon} of size t=O(1ε2log21εlogκδ)t=O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{1}{\varepsilon}\log\frac{\kappa}{\delta}), then for κ\kappa different centers cCc\in C, with probability at least 1δ1-\delta, for all estimates ^(Kc)\hat{\ell}(K_{c}) from BεB_{\varepsilon} of (Kc)\ell(K_{c}) satisfy |(Kc)^(Kc)|ε.|\ell(K_{c})-\hat{\ell}(K_{c})|\leq\varepsilon.

Proof.

To approximate (Kc)-\ell(K_{c}) we will use a standard Chernoff-Hoeffding bound: for tt independent random variables X1,,XtX_{1},\ldots,X_{t}, so 𝖤[Xi]=μ\mathsf{E}[X_{i}]=\mu, and Xi[0,Δi]X_{i}\in[0,\Delta_{i}] then the average X¯=1ti=1tXi\bar{X}=\frac{1}{t}\sum_{i=1}^{t}X_{i} concentrates as Pr[|X¯μ|α]2exp(2α2tΔ2).\Pr[|\bar{X}-\mu|\geq\alpha]\leq 2\exp\left(-\frac{2\alpha^{2}}{t\Delta^{2}}\right). Set

μ=1BxBm(x)log(g(x))+(1m(x))log(1g(x))=(Kc),\mu=-\frac{1}{B}\sum_{x\in B}m(x)\log(g(x))+(1-m(x))\log(1-g(x))=-\ell(K_{c}),

and each iith sample xBεx^{\prime}\in B_{\varepsilon} maps to the iith random variable

Xi=(m(x)log(g(x))+(1m(x))log(1g(x))).X_{i}=-(m(x^{\prime})\log(g(x^{\prime}))+(1-m(x^{\prime}))\log(1-g(x^{\prime}))).

Since g(x)>1/|B|g(x)>1/|B| for m(x)=1m(x)=1 and 1g(x)>1/|B|1-g(x)>1/|B| for m(x)=0m(x)=0 by Lemma A.2 we have Δ=log(1/|B|)=log(|B|)\Delta=-\log(1/|B|)=\log(|B|). Plugging these quantities into CH bound yields

Pr[|^(Kc)(Kc)|ε]2exp(2ε2tlog2(|B|))δ.\Pr[|\hat{\ell}(K_{c})-\ell(K_{c})|\leq\varepsilon]\leq 2\exp\left(-\frac{2\varepsilon^{2}}{t\log^{2}(|B|)}\right)\leq\delta^{\prime}.

Solving for tt finds t=log2|B|2ε2log12δ.t=\frac{\log^{2}|B|}{2\varepsilon^{2}}\log\frac{1}{2\delta^{\prime}}. Then taking a union bound over κ\kappa centers, reveals that with probability at least 1δ=1δκ1-\delta=1-\delta^{\prime}\kappa, this holds by setting

t=log2|B|2ε2logκ2δ.t=\frac{\log^{2}|B|}{2\varepsilon^{2}}\log\frac{\kappa}{2\delta}.

At this point, we almost have the desired bound; however, it has a log2|B|\log^{2}|B| factor in tt in place of the desired log21ε\log^{2}\frac{1}{\varepsilon}. One should believe that it is possible to get rid of the dependence on |B||B|, since given one sample with the above bound, the dependence on |B||B| has been reduced to logarithmic. We can apply the bound again, and the dependence is reduced to log2(1εlog2|B|)\log^{2}(\frac{1}{\varepsilon}\log^{2}|B|) and so on. Ultimately, it can be shown that we can directly replace |B||B| with 𝗉𝗈𝗅𝗒(1/ε)\mathsf{poly}(1/\varepsilon) in the bounds using a classic trick [23] of considering two samples BεB_{\varepsilon} and BεB_{\varepsilon}^{\prime}, and arguing that if they yield results close to each other, then this implies they should both be close to BB. We omit the standard, but slightly unintuitive details. ∎

A.4. Convexity in pp and qq

We next show that Φp,q(K)\Phi_{p,q}(K) is convex and has a Lipschitz bound in terms of pp and qq. However, such a bound does not exist using p,q(0,1)p,q\in(0,1) as the gradient is unbounded on the boundary. We instead define a set of constraints related to g(x)g(x) at the optimal p,qp^{*},q^{*} that allow a Lipschitz constant.

Lemma A.8.

The following optimization problem is convex with Lipschitz constant 2|B|2|B|, and contains p,q=argminp,qΦp,q(K)p^{*},q^{*}=\mathrm{argmin}_{p,q}-\Phi_{p,q}(K).

(1) minp,qΦp,q(K)1/Bg(x)0 for xM(|B|1)/|B|g(x)0 for xBMp(0,1)q(0,1)\displaystyle\begin{split}\min_{p,q}-\Phi_{p,q}(K)&\\ 1/B-g(x)\leq 0&\text{ for }x\in M\\ (|B|-1)/|B|-g(x)\geq 0&\text{ for }x\in B\setminus M\\ p\in(0,1)\\ q\in(0,1)\end{split}
Proof.

The set of constraints follow from Lemma A.2 and from this bound we know that the optimal pp^{*} and qq^{*} will be contained in this domain. As each constraint is a linear combination of pp and qq, and hence convex, the space of solutions is also convex. Now since g(x)g(x) is an affine transformation of pp and qq so it is both convex and concave with respect to p,qp,q. The logarithm function is a concave and non-decreasing function, hence both log(g(x))\log(g(x)) and log(1g(x))\log(1-g(x)) are concave. The sum of these concave functions is still a concave function, and thus (p,q,K)-\ell(p,q,K) and hence Φ(p,q,K)-\Phi(p,q,K) is convex.

The task becomes to show the absolute value of first order partial derivatives are bounded for pp and qq in this domain. We have that for pp (the argument for qq is symmetric)

|𝖽(p,q,K)δp|\displaystyle\left|\frac{\mathsf{d}\ell(p,q,K)}{\delta p}\right| =|1|B|(xMK(x)g(x)xBMK(x)1g(x))|\displaystyle=\left|\frac{1}{|B|}\left(\sum_{x\in M}\frac{K(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}\right)\right|
1|B|(|xMK(x)g(x)|+|xBMK(x)1g(x)|)\displaystyle\leq\frac{1}{|B|}\left(\left|\sum_{x\in M}\frac{K(x)}{g(x)}\right|+\left|\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}\right|\right)
1|B||xM1g(x)|+1|B||xBM11g(x)|\displaystyle\leq\frac{1}{|B|}\left|\sum_{x\in M}\frac{1}{g(x)}\right|+\frac{1}{|B|}\left|\sum_{x\in B\setminus M}\frac{1}{1-g(x)}\right|
1|B||xM|B||+1|B||xBM|B||=2|B|,\displaystyle\leq\frac{1}{|B|}\left|\sum_{x\in M}|B|\right|+\frac{1}{|B|}\left|\sum_{x\in B\setminus M}|B|\right|=2|B|,

where the steps follow from the triangle inequality, by K(x)1K(x)\leq 1, and that g(x)>1/|B|g(x)>1/|B| for xMx\in M and 1g(x)>1/|B|1-g(x)>1/|B| for xBMx\in B\setminus M. ∎