The Kernel Spatial Scan Statistic

Mingxuan Han u1209601@umail.utah.edu University of Utah201 Presidents Circle84112 , Michael Matheny mmath@cs.utah.edu University of Utah201 Presidents Circle84112 and Jeff M. Phillips jeffp@cs.utah.edu University of Utah201 Presidents Circle84112

Abstract.

Kulldorff’s (1997) seminal paper on spatial scan statistics (SSS) has led to many methods considering different regions of interest, different statistical models, and different approximations while also having numerous applications in epidemiology, environmental monitoring, and homeland security. SSS provides a way to rigorously test for the existence of an anomaly and provide statistical guarantees as to how ”anomalous” that anomaly is. However, these methods rely on defining specific regions where the spatial information a point contributes is limited to binary 0 or 1, of either inside or outside the region, while in reality anomalies will tend to follow smooth distributions with decaying density further from an epicenter. In this work, we propose a method that addresses this shortcoming through a continuous scan statistic that generalizes SSS by allowing the point contribution to be defined by a kernel. We provide extensive experimental and theoretical results that shows our methods can be computed efficiently while providing high statistical power for detecting anomalous regions.

1. Introduction

We propose a generalized version of spatial scan statistics called the kernel spatial scan statistic. In contrast to the many variants (AMPVW06, ; Kul97, ; huang2007spatial, ; jung2007spatial, ; neill2006bayesian, ; jung2010spatial, ; kulldorff2009scan, ; FNN17, ; HMWN18, ; SSMN16, ) of this classic method for geographic information sciences, the kernel version allows for the modeling of a gradually dampening of an anomalous event as a data point becomes further from the epicenter. As we will see, this modeling change allows for more statistical power and faster algorithms (independent of data set size), in addition to the more realistic modeling.

Refer to caption — Figure 1. An anomaly affecting $8\%$ of data (and rate parameters $p=.9$ , $q=.5$ ) under the Bernoulli Kernel Spatial Scan Statistic model on geo-locations of crimes in Philadelphia.

To review, spatial scan statistics consider a baseline or population data set $B$ where each point $x\in B$ has an annotated value $m(x)$ . In the simplest case, this value is binary ( $1$ = person has cancer; $0$ = person does not have cancer), and it is useful to define the measured set $M\subset B$ as $M=\{x\in X\mid m(x)=1\}$ . Then the typical goal is to identify a region where there are significantly more measured points than one would expect from the baseline data $B$ . To prevent overfitting (e.g., gerrymandering), the typical formulation fixes a set of potential anomalous regions $\mathcal{R}$ induced by a family of geometric shapes: disks (Kul97, ), axis-aligned rectangles (NM04, ), ellipses (Kul7.0, ), halfspaces (MP18a, ), and others (Kul7.0, ). Then given a statistical discrepancy function $\Phi(R)=\phi(R(M),R(B))$ , where typically $R(M)=\frac{|R\cap M|}{|M|}$ and $R(B)=\frac{|R\cap B|}{|B|}$ , the spatial scan statistic SSS is

\Phi^{*}=\max_{R\in\mathcal{R}}\Phi(R).

And the hypothesized anomalous region is $R^{*}=\mathrm{argmax}_{R\in\mathcal{R}}\Phi(R)$ so $\Phi^{*}=\Phi(R^{*})$ . Conveniently, by choosing a fixed set of shapes, and having a fixed baseline set $B$ , this actually combinatorially limits the set of all possible regions that can be considered (when computing the $\max$ operator), since for instance, there can be only $O(|B|^{3})$ distinct disks which each contains a different subset of points. This allows for tractable (AMPVW06, ) (and in some cases very scalable (MP18b, )) combinatorial and statistical algorithms which can (approximately) search over the class of all shapes from that family. Alternatively, the most popular software, SatScan (Kul7.0, ) uses a fixed center set of possible epicenters of events, for simpler and more scalable algorithms.

However, the discreteness of these models has a strange modeling side effect. Consider a the shape model of disks $\mathcal{D}$ , where each disk $D\in\mathcal{D}$ is defined $D=\{x\in\mathbb{R}^{d}\mid\|x-c\|\leq r\}$ by a center $c\in\mathbb{R}^{d}$ and a radius $r>0$ . Then solving a spatial scan statistic over this family $\mathcal{D}$ would yield an anomaly defined by a disk $D$ ; that is, all points $x\in B\cap D$ are counted entirely inside the anomalous region, and all points $x^{\prime}\in B\setminus D$ are considered entirely outside the anomalous region. If this region is modeling a regional event; say the area around a potentially hazardous chemical leak suspected of causing cancer in nearby residents, then the hope is that the center $c$ identifies the location of the leak, and $r$ determines the radius of impact. However, this implies that data points $x\in B$ very close to the epicenter $c$ are affected equally likely as those a distance of almost but not quite $r$ away. And those data points $x^{\prime}\in B$ that are slightly further than $r$ away from $c$ are not affected at all. In reality, the data points closest to the epicenter should be more likely to be affected than those further away, even if they are within some radius $r$ , and data points just beyond some radius should still have some, but a lessened effect as well.

Introducing the Kernel Spatial Scan Statistic. The main modeling change of the kernel spatial scan statistic (KSSS) is to prescribe these diminishing effects of spatial anomalies as data points become further from the epicenter of a proposed event. From a modeling perspective, given the way we described the problem above, the generalization is quite natural: we simply replace the shape class $\mathcal{R}$ (e.g., the family of all disks $\mathcal{D}$ ) with a class of non-binary continuous functions $\mathcal{K}$ . The most natural choice (which we focus on) are kernels, and in particular Gaussian kernels. We define each $K\in\mathcal{K}$ by a center $c$ and a bandwidth $r$ as $K(x)=\exp(-\|x-c\|^{2}/r^{2}).$ This provides a real value $K(x)\in[0,1]$ , in fact a probability, for each $x\in B$ . We interpret this as: given an anomaly model $K$ , then for each $x\in B$ the value $K(x)$ is the probability that the rate of the measured event (chance that a person gets cancer) is increased.

Related Work on SSS and Kernels. There have been many papers on computing various range spaces for SSS (NM04, ; Tango2005, ; ACTW18, ; ICDMS2015, ) where a geometric region defines the set of points that included in the region for various regions such as disks, ellipses, rings, and rectangles. Other work has combined SSS and kernels as a way to penalize far away points, but still used binary regions, and only over a set of predefined starting points (DA04, ; DCTB2007, ; Patil2004, ). Another method (FNN17, ) uses a Kernel SVM boundary to define a region; this provides a regularized, but otherwise very flexible class of regions – but they are still binary. A third method (JRSA89, ), proposes an inhomogeneous Poisson process model for the spatial relationship between a measured cancer rate and exposure to single specified region (from industrial pollution source). This models the measured rate similar to our work, but does not search over a family of regions, and does not model a background rate.

Our contributions, and their challenges. We formally derive and discuss in more depth the KSSS in Section 2, and contrast this with related work on SSS. While the above intuition is (we believe) quite natural, and seems rather direct, a complication arises: the contribution of the data points towards the statistical discrepancy function (derived as a log-likelihood ratio) are no longer independent. This implies that $K(B)$ and $K(M)$ can no longer in general be scalar values (as they were with $R(B)$ and $R(M)$ ); instead we need to pass in sets. Moreover, this means that unlike with traditional binary ranges, the value of $\Phi$ no longer in general has a closed form; in particular the optimal rate parameters in the alternative hypothesis do not have a closed form. We circumvent this by describing a simple convex cost function for the rate parameters. And it turns out, these can then be effectively solved for with a few steps of gradient descent for each potential choice of $K$ within the main scanning algorithm.

Our paper then focuses on the most intuitive Bernoulli model for how measured values are generated, but the procedures we derive will apply similar to the Poisson and Gaussian models we also derive. For instance, it turns out that the Gaussian model kernel statistical discrepancy function has a closed form.

The second major challenge is that there is no longer a combinatorial limit on the number of distinct ranges to consider. There are an infinite set of potential centers $c\in\mathbb{R}^{d}$ to consider, even with a fixed bandwidth, and each could correspond to a different $\Phi(K)$ value. However, there is a Lipschitz property on $\Phi(K)$ as a function of the choice of center $c$ ; that is if we change $c$ to $c^{\prime}$ by a small amount, then we can upperbound the change in $\Phi(K_{c})$ to $\Phi(K_{c^{\prime}})$ by a linear function of $\|c-c^{\prime}\|$ . This implies a finite resolution needed to consider on the set of center points: we can lay down a fixed resolution grid, and only consider those grid points. Notably: this property does not hold for the combinatorial SSS version, as a direct effect of the problematic boundary issue of the binary ranges.

We combine the insight of this Lipschitz property, and the gradient descent to evaluate $\Phi(K_{c})$ for a set of center points $c$ , in a new algorithm KernelGrid. We can next develop two improvements to this basic algorithm which make the grid adaptive in resolution, and round the effect of points far from the epicenter; embodied in our algorithm KernelFast, these considerably increase the efficiency of computing the statistic (by $30$ x) without significantly decreasing their accuracy (with provable guarantees). Moreover, we create a coreset $B_{\varepsilon}$ of the full data set $B$ , independent of the original size $|B|$ that provably bounds the worst case error $\varepsilon$ .

We empirically demonstrate the efficiency, scalability, and accuracy of these new KSSS algorithms. In particular, we show the KSSS has superior statistical power compared to traditional SSS algorithms, and exceeds the efficiency of even the heavily optimized version of those combinatorial Disk SSS algorithms.

2. Derivation of the Kernel Spatial Scan Statistic

In this section we will provide a general definition for a spatial scan statistic, and then extend this to the kernel version. It turns out, there are two reasonable variations of such statistics, which we call the continuous and binary settings. In each case, we will then define the associated kernelized statistical discrepancy function $\Phi$ under the Bernoulli $\Phi^{\text{Be}}$ , Poisson $\Phi^{P}$ , and Gaussian $\Phi^{G}$ models. These settings are the same in the Bernoulli model, but different in the other two models.

General derivation. A spatial scan statistic considers a spatial data set $B\subset\mathbb{R}^{d}$ , each data point $x\in B$ endowed with a measured value $m(x)$ , and a family of measurement regions $\mathcal{R}$ . Each region $R\in\mathcal{R}$ specifies the way a data point $x\in B$ is associated with the anomaly (e.g., affected or not affected). Then given a statistical discrepancy function $\Phi$ which measures the anomalousness of a region $R$ , the statistic is $\max_{R\in\mathcal{R}}\Phi(R)$ . To complete the definition, we need to specify $\mathcal{R}$ and $\Phi$ , which it turns out in the way we define $\mathcal{R}$ are more intertwined that previously realized.

To define $\Phi$ we assume a statistical model in how the values $m(x)$ are realized, where data points $x$ affected by the anomaly have $m(x)$ generated at rate $p$ and those unaffected generated at rate $q$ .

Then we can define a null hypothesis that a potential anomalous region $R$ has no effect on the rate parameters so $p=q$ ; and an alternative hypothesis that the region does have an effect and (w.l.o.g.) $p>q$ .

For both the null and alternative hypothesis, and a region $R\in\mathcal{R}$ , we can then define a likelihood, denoted $L_{0}(q)$ and $L(p,q,R)$ , respectively. The spatial scan statistic is then the log-likelihood ratio (LRT)

\Phi(R)=\log\left(\frac{\max_{p,q}L(p,q,R)}{\max_{q}L_{0}(q)}\right)=\left(\max_{p,q}\log L(p,q,R)\right)-\left(\max_{q}\log L_{0}(q)\right).

Now the main distinction with the kernel spatial scan statistic is that $\mathcal{R}$ is specified with a family of kernels $\mathcal{K}$ so that each $K\in\mathcal{K}$ specifies a probability $K(x)$ that $x\in B$ is affected by the anomaly. This is consistent with the traditional spatial scan statistic (e.g., with $\mathcal{R}$ as disks $\mathcal{D}$ ), where this probability was always $0$ or $1$ . Now this probability can be any continuous value. Then we can express the mean rate/intensity $g(x)$ for each of the underlying distributions from which $m(x)$ is generated, as a function of $K(x)$ , $p$ , and $q$ . Two natural and distinct models arise for kernel spatial scan statistics.

Continuous Setting. In the continuous setting, which will be our default model, we directly model the mean rate $g(x)$ as a convex combination between $p$ and $q$ as,

g(x)=K(x)p+(1-K(x))q.

Thus each $x$ (with nonzero $K(x)$ value) has a slightly different rate. Consider a potentially hazardous chemical leak suspected of causing cancer, this model setting implies that the residents who live closer to the center ( $K(x)$ is larger) of potential leak would be affected more (have elevated rate) compared to the residents who live farther away. The kernel function $K(x)$ models a decay effect from a center, and smooths out the effect of distance.

Binary Setting. In the second setting, the binary setting, the mean rate $g(x)$ is defined

\breve{g}(x)=\begin{cases}p&\text{ w.p. }K(x)\\ q&\text{ w.p. }(1-K(x)).\end{cases}

To clarify notation, each part of the model associated with this setting (e.g., $\breve{g}$ ) will carry a $\breve{}$ to distinguish it from the continuous setting. In this case, as with the traditional SSS, the rate parameter for each $x$ is either $p$ or $q$ , and cannot take any other value. However, this rate assignment is not deterministic, it is assigned with probability $K(x)$ , so points closer to the epicenter (larger $K(x)$ ) have higher probability of being assigned a rate $p$ . The rate $\breve{g}(x)$ for each $x$ is a mixture model with known mixture weight determined by $K(x)$ .

The null models. We show that the choice of model, binary setting, vs continuous setting, does not change the null hypothesis (e.g. $\ell_{0}=\breve{\ell}_{0}$ ).

2.1. Bernoulli

Under the Bernoulli model, the measured value $m(x)\in\{0,1\}$ , and these values are generated independently. Consider a $1$ value indicating that someone is diagnosed with cancer. Then an anomalous region may be associated with a leaky chemical plant, where the residents nearby the plant have an elevated rate of cancer $p$ , whereas the background population may have lower rate $q$ of cancer. That is, the cancer occurs through natural mutation at a rate $q$ , but if exposed to certain chemicals, there is another mechanism to get cancer that occurs at rate $p-q$ (for a total rate of $q+(p-q)=p$ ). Under the binary model, any exposure to the chemical triggers this secondary mechanism, and so the chance of exposure is modeled as proportional to $K(x)$ , and rate at $x$ is $\breve{g}(x)$ is modeled well by the binary setting. Alternatively, the rate of the secondary mechanism may increase as the amount of exposure to the chemical increases (those living closer are exposed to more chemicals), with rate $g(x)$ modeled in the continuous setting. These are both potentially the correct biological model, so we analyze both of them.

For this model we can define two subsets of $B$ as $M=\{x\in B\mid m(X)=1\}$ and $B\setminus M=\{x\in B\mid m(x)=0\}$ .

In either setting the null likelihood is defined

L_{0}^{\textsf{Be}}(q)=\breve{L}_{0}^{\textsf{Be}}(q)=\prod_{x\in M}q\prod_{x\in B\setminus M}(1-q),

then

\ell_{0}^{\textsf{Be}}(q)=\breve{\ell}_{0}^{\textsf{Be}}(q)=\sum_{x\in M}\log(q)+\sum_{x\in B\setminus M}\log(1-q),

which is maximized over $q$ at $q=|M|/|B|$ as,

\ell_{0}^{\textsf{Be}^{*}}=\breve{\ell}_{0}^{\textsf{Be}^{*}}=|M|\log\frac{|M|}{|B|}+(|B|-|M|)\log(1-\frac{|M|}{|B|}).

The continuous setting. We first deriving the continuous setting $\Phi^{\textsf{Be}}$ , starting with the likelihood under the alternative hypothesis.

This is a product of the rate of measured and baseline.

L^{\textsf{Be}}(p,q,K)=\prod_{x\in M}g(x)\cdot\prod_{x\in B\setminus M}(1-g(x))

and so

	$\displaystyle\ell^{\textsf{Be}}(p,q,K)$
	$\displaystyle=\log(L^{\textsf{Be}}(p,q,K))=\sum_{x\in M}\log g(x)+\sum_{x\in B\setminus M}\log(1-g(x))$
	$\displaystyle=\sum_{x\in M}\log(pK(x)+q(1-K(x)))+\sum_{x\in B\setminus M}\log(1-pK(x)-q(1-K(x))).$

Unfortunately, we know of no closed form for the maximum of $\ell^{\textsf{Be}}(p,q,K)$ over the choice of $p,q$ and therefore this form cannot be simplified further than

{\Phi^{\textsf{Be}}}^{*}(K)=\max_{p,q}\ell^{\textsf{Be}}(p,q,K)-{\ell_{0}^{\textsf{Be}}}^{*}

The binary setting. We continue deriving the binary setting $\breve{\Phi}^{\textsf{Be}}$ , starting with

\breve{L}^{\textsf{Be}}(p,q,K)=\prod_{x\in B}\left(p^{m(x)}(1-p)^{m(x)}K(x)+q^{m(x)}(1-q)^{m(x)}K(x)\right),

and so

	$\displaystyle\breve{\ell}^{\textsf{Be}}(p,q,K)$	$\displaystyle=\log(\breve{L}^{\textsf{Be}}(p,q,K))$
		$\displaystyle=\sum_{x\in B}\log\left(p^{m(x)}(1-p)^{m(x)}K(x)+q^{m(x)}(1-q)^{m(x)}K(x)\right).$

Similarly as in the continuous setting, there is no closed form of the maximum of $\log(\breve{L}^{\textsf{Be}}(p,q,K))$ over choices of $p$ , $q$ , so we write the form below.

\breve{\Phi}^{\textsf{Be}^{*}}(K)=\max_{p,q}\breve{\ell}^{\textsf{Be}}(p,q,K)-{\ell_{0}^{\textsf{Be}}}^{*}

Equivalence. It turns out, under the Bernoulli model, these two settings have equivalent statistics to optimize.

Lemma 2.1.

The $\breve{\ell}^{\textsf{Be}}(p,q,K)$ and $\ell^{\textsf{Be}}(p,q,K)$ are exactly same, hence $\breve{\Phi}^{\textsf{Be}^{*}}$ = ${\Phi^{\textsf{Be}}}^{*}$ which implies that the Bernoulli model under two settings are equivalent to each other.

Proof.

We simply expand the binary setting as follows.

	$\displaystyle\breve{\ell}^{\textsf{Be}}(p,q,K)$
	$\displaystyle=\sum_{x\in M}\log(p^{m(x)}(1-p)^{1-m(x)}K(x)+q^{m(x)}(1-q)^{1-m(x)}(1-K(x)))$
	$\displaystyle+\sum_{x\in B\setminus M}\log(p^{m(x)}(1-p)^{1-m(x)}K(x)+q^{m(x)}(1-q)^{1-m(x)}(1-K(x)))$
	$\displaystyle=\sum_{x\in M}\log(pK(x)+q(1-K(x)))+\sum_{x\in B\setminus M}\log(1-(p-q)K(x)-q)$
	$\displaystyle=\sum_{x\in M}\log(g(x))+\sum_{x\in B\setminus M}\log(1-g(x))$
	$\displaystyle=\ell^{\textsf{Be}}(p,q,K)$

Since the $\breve{\ell}^{\textsf{Be}}(p,q,K)=\ell^{\textsf{Be}}(p,q,K)$ and $\breve{\ell}_{0}^{\textsf{Be}^{*}}={\ell_{0}^{\textsf{Be}}}^{*}$ , then $\breve{\Phi}^{\textsf{Be}^{*}}$ = ${\Phi^{\textsf{Be}}}^{*}$ . ∎

2.2. Gaussian

The Gaussian model can be used to analyze spatial datasets with continuous values $m(x)$ (e.g., temperature, rainfall, or income), which we assume varies with a normal distribution with a fixed known standard deviation $\sigma$ . Under this model, both the continuous and binary settings are again both well motivated.

Consider an insect infestation, as it affects agriculture. Here we assume fields of crops are measured at discrete locations $B$ , and each has an observed yield rate $m(x)$ , which under the null models varies normally around a value $q$ . In the continuous setting, the yield rate at $x\in B$ is effected proportional to $K(x)$ , depending on how close it is to the epicenter. This may for instance model that fewer insects reach further from the epicenter, and the yield rate is effected relative to the number of insects that reach the field $x$ . In the binary setting, it may be that if insects reach a field, then they dramatically change the yield rate (e.g., they eat and propagate until almost all of the crops are eaten). In the latter scenario, the correct model is binary one, with a mixture model of two rates governed by $K(x)$ , the closeness to the epicenter.

In either setting the null likelihood is defined as,

L_{0}^{\textsf{G}}(q)=\breve{L}_{0}^{\textsf{G}}(q)=\prod_{x\in B}\exp(-\frac{(m(x)-q)^{2}}{2\sigma^{2}}),

then

\ell_{0}^{\textsf{G}}(q)=\breve{\ell}_{0}^{\textsf{G}}(q)=\sum_{x\in B}-\frac{(m(x)-q)^{2}}{2\sigma^{2}},

which is maximized over $q$ at $q=\frac{\sum_{x\in B}m(x)}{|B|}=\hat{m}$ as,

\ell_{0}^{\textsf{G}^{*}}=\breve{\ell}_{0}^{\textsf{G}^{*}}=\frac{-1}{2\sigma^{2}}\sum_{x\in B}(\hat{m}-m(x))^{2}.

The continuous setting. We first derive the continuous setting $\Phi^{\textsf{G}}$ and terms free of $p$ and $q$ would be treated as constant then ignored, starting with,

L^{\textsf{G}}(p,q,K)=\prod_{x\in B}\exp(-\frac{(m(x)-g(x))^{2}}{2\sigma^{2}})

and so

	$\displaystyle\ell^{\textsf{G}}(p,q,K)$	$\displaystyle=\log(L^{\textsf{G}}(p,q,K))=\sum_{x\in B}-\frac{(m(x)-g(x))^{2}}{2\sigma^{2}}$
		$\displaystyle=\sum_{x\in B}-\frac{(m(x)-pk(x)-q(1-K(x))^{2}}{2\sigma^{2}}.$

Fortunately, there is a closed form for the maximum of $\ell^{\textsf{G}}(p,q,K)$ over the choice of $p,q$ by setting the $\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}p}=0$ and $\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}q}=0$ . Hence we come up with the closed form solution of Gaussian Kernel statistical discrepancy shown by the theorem below.

Theorem 2.2.

Gaussian kernel statistical discrepancy function is

\Phi^{\textsf{G}^{*}}(K)=\sum_{x\in B}(m(x)-\hat{p}K(x)-\hat{q}(1-K(x)))^{2}-\sum_{x\in B}(\bar{m}-m(x))^{2}

where $\bar{m}=\frac{1}{|B|}\sum_{x\in B}m(x)$ , $\hat{p}=\frac{K_{\pm}K_{-m}-K_{m}K_{-2}}{K_{\pm}^{2}-K_{2}K_{-2}}$ , $\hat{q}=\frac{K_{m}K_{\pm}-K_{2}K_{-m}}{K_{\pm}^{2}-K_{2}K_{-2}}$ , using the following terms $K_{m}=\sum_{x\in B}K(x)m_{x}$ , $K_{2}=\sum_{x\in B}K(x)^{2}$ , $K_{\pm}=\sum_{x\in B}K(x)(1-K(x))$ , $K_{-m}=\sum_{x\in B}m(x)(1-K(x))$ , and $K_{-2}=\sum_{x\in B}(1-K(x))^{2}$ .

Proof.

For the alternative hypothesis, the log-likelihood is

\ell^{\textsf{G}}(p,q,K)=\frac{-1}{2\sigma^{2}}\sum_{x\in B}(m(x)-g(x))^{2}

The optimal values of $p,q$ minimize

-\ell^{\textsf{G}}(p,q,K)=\frac{1}{2\sigma^{2}}\sum_{x\in B}(m(x)-pK(x)-q(1-K(x)))^{2}.

By setting the partial derivatives wrt $p$ and $q$ of $-\ell^{\textsf{G}}(p,q,L)$ equal to $0$ , we have,

\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}p}=\sum_{x\in B}K(x)[m(x)-K(x)p-(1-K(x)q]=0,

and

\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}q}=\sum_{x\in B}(1-K(x))[m(x)-K(x)p-(1-K(x)q]=0,

and these two can be further simplified to,

	$\displaystyle\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}p}$	$\displaystyle=\sum_{x\in B}K(x)m(x)-p\sum_{x\in B}K(x)^{2}-q\sum_{x\in B}K(x)(1-K(x))$
		$\displaystyle=0,$

and

	$\displaystyle\frac{\mathsf{d}\ell^{\textsf{G}}(p,q,K)}{\mathsf{d}q}$
	$\displaystyle=\sum_{x\in B}(1-K(x))m(x)-p\sum_{x\in B}(1-K(x))K(x)-q\sum_{x\in B}(1-K(x))^{2}$
	$\displaystyle=0.$

We replace these terms by notations defined in the theorem,

K_{m}-pK_{2}-qK_{\pm}=0,

and

K_{-m}-pK_{\pm}-qK_{-2}=0.

Then we can solve the optimum value of $p$ and $q$ as, $p=\frac{K_{\pm}K_{-m}-K_{m}K_{-2}}{K_{\pm}^{2}-K_{2}K_{-2}}=\hat{p}$ and $q=\frac{K_{m}K_{\pm}-K_{2}K_{-m}}{K_{\pm}^{2}-K_{2}K_{-2}}=\hat{q}$ .

Hence we have the closed form

\ell^{\textsf{G}^{*}}=\max_{p,q}\ell^{\textsf{G}}(p,q,K)=\frac{-1}{2\sigma^{2}}\sum_{x\in B}(m(x)-\hat{p}K(x)-\hat{q}(1-K(x)))^{2}.

∎

The binary Setting. Now we derive the binary setting $\breve{\Phi}^{\textsf{Be}}$ , starting

\breve{L}^{\textsf{G}}(p,q,K)=\prod_{x\in B}\exp(-\frac{(m(x)-p)^{2}}{2\sigma^{2}})K(x)+\exp(-\frac{(m(x)-q)^{2}}{2\sigma^{2}})(1-K(x))

and so,

	$\displaystyle\breve{\ell}^{\textsf{G}}(p,q,K)$
	$\displaystyle=\sum_{x\in B}\log(\exp(-\frac{(m(x)-p)^{2}}{2\sigma^{2}})K(x)+\exp(-\frac{(m(x)-q)^{2}}{2\sigma^{2}})(1-K(x))).$

Different from the continuous setting we know of no closed form for the maximum of $\breve{\ell}^{\textsf{G}}(p,q,K)$ over $p$ and $q$ . Hence,

\breve{\Phi}^{\textsf{G}^{*}}(K)=\max_{p,q}\breve{\ell}^{\textsf{G}}(p,q,K)-{\ell_{0}^{\textsf{G}}}^{*}.

Equivalence. Different from the Bernoulli models, the Gaussian models under the binary setting and continuous setting are not equivalent to each other. Under the continuous setting,

m(x)\sim\mathcal{N}(g(x),\sigma),

however, under the binary setting, each data point follow a two components Gaussian mixture model where the mixture weight is given by $K(x)$ and $1-K(x)$ , and so it is not a Gaussian distribution anymore.

2.3. Poisson

In the Poisson model the measured value $m(x)$ is discrete and non-negative, but it can now take any positive integer value with $m(x)\in\mathbb{Z}_{+}$ . This can for instance model the number of check-ins or comments $m(x)$ posted at each geo-located business $x\in B$ (this can be a proxy for instance for the number of customers). An event, e.g., a festival, protest, or other large impromptu gathering could be modeled spatially by a kernel $K$ , and it affects the rates at each $x$ in the two different settings.

In the continuous setting, the closer a distance a restaurant is from the center of the event (modeled by $K(x)$ ) the more the usual number of check-ins (modeled by $q$ ) will trend towards $p$ . On the hand, in the binary setting, only certain businesses are affected (e.g., a coffee shop, but not a fancy dinner location), but if it is affected, its rate is elevated all the way from $q$ to $p$ . Perhaps advertising at a festival encouraged people to patronize certain restaurants, or a protest encouraged them to give bad reviewers to certain nearby restaurants – but not others. Hence, these two settings relate to two different ways an event could affect Poisson check-in or comment rates.

In either setting, the null likelihood is defined as,

L_{0}^{\textsf{P}}(q)=\breve{L}_{0}^{\textsf{P}}(q)=\prod_{x\in B}\frac{e^{-q}q^{m(x)}}{m(x)!},

then

	$\displaystyle\ell_{0}^{\textsf{P}}(q)=\breve{\ell}_{0}^{\textsf{P}}(q)$	$\displaystyle=\sum_{x\in B}\log(\frac{e^{-q}q^{m(x)}}{m(x)!})$
		$\displaystyle=\sum_{x\in B}-q+m(x)\log(q)-\log(m(x)!),$

which is maximized over $q$ at $q=\frac{\sum_{x\in B}m(x)}{|B|}=\hat{m}$ as,

	$\displaystyle\ell_{0}^{\textsf{P}^{}}=\breve{\ell}_{0}^{\textsf{P}^{}}$	$\displaystyle=\sum_{x\in B}-\hat{m}+m(x)\log(\hat{m})-\log(m(x)!)$
		$\displaystyle=-\|B\|\hat{m}-\sum_{x\in B}\log(m(x)!)-m(x)\log(\hat{m}).$

The continuous setting. We first derive the continuous setting $\Phi^{\textsf{P}}$ , starting with,

	$\displaystyle L^{\textsf{P}}(p,q,K)$	$\displaystyle=\prod_{x\in B}\frac{e^{-g(x)}g(x)^{m(x)}}{m(x)!}$
		$\displaystyle=\prod_{x\in B}\frac{e^{-pK(x)-q(1-K(x))}(pK(x)+q(1-K(x))^{m(x)}}{m(x)!},$

	$\displaystyle\ell^{\textsf{P}}(p,q,K)$
	$\displaystyle=\sum_{x\in B}-g(x)+m(x)\log(g(x))-\log(m(x)!)$
	$\displaystyle=\sum_{x\in B}-(pK(x)+q(1-K(x))+m(x)\log(pK(x)+q(1-K(x))-\log(m(x)!))$

There is no closed form for the maximum of $\ell^{\textsf{P}}(p,q,K)$ over the choice of $p$ and $q$ and hence

\Phi^{\textsf{P}^{*}}(K)=\max_{p,q}\ell^{\textsf{P}}(p,q,K)-\ell_{0}^{\textsf{P}^{*}}(q)

The binary setting. We continue deriving the binary setting $\breve{\Phi}^{\textsf{P}^{*}}$ , starting with

\breve{L}^{\textsf{P}}(p,q,K)=\prod_{x\in B}\frac{e^{-q}q^{m(x)}}{m(x)!}(1-K(x))+\frac{e^{-p}p^{m(x)}}{m(x)!}K(x),

\breve{\ell}^{\textsf{P}}(p,q,k)=\sum_{x\in B}\log\left(\frac{e^{-q}q^{m(x)}}{m(x)!}(1-K(x))+\frac{e^{-p}p^{m(x)}}{m(x)!}K(x)\right).

Same as the continuous setting, there is no closed form of $\breve{\ell}^{\textsf{P}}(p,q,k)$ , hence,

\breve{\Phi}^{\textsf{P}^{*}}(K)=\max_{p,q}\breve{\ell}^{\textsf{P}}(p,q,k)-\breve{\ell}_{0}^{\textsf{P}^{*}}(q)

Equivalence. In the Poisson model the binary setting and the continuous setting are not equivalent to each other. Under the continuous setting,

m(x)\sim\mathsf{Poi}(g(x)),

however under the binary setting $m(x)$ follows a mixture Poisson model which is not a Poisson distribution anymore and the mixture weight is given by $K(x)$ and $1-K(x)$ .

3. Computing the Approximate KSSS

The traditional SSS can combinatorially search over all disks (SSSS, ; MP18a, ; Kul97, ) to solve for or approximate $\max_{D\in\mathcal{D}}\Phi(D)$ , evaluating $\Phi(D)$ exactly. Our new KSSS algorithms will instead search over a grid of possible centers $c$ , and approximate $\Phi(K_{c})$ with gradient descent, yet will achieve the same sort of strong error guarantees as the combinatorial versions. Improvements will allow for adaptive gridding, pruning far points, and sampling.

3.1. Approximating $\Phi$ with GD

We cannot directly calculate $\Phi(K)=\max_{p,q}\Phi_{p,q}(K)$ , since it does not have a closed form. Instead we run gradient descent $\mathsf{GradDesc}$ over $-\Phi_{p,q}(K_{c})$ on $p,q$ for a fixed $c$ . Since we have shown $\Phi_{p,q}(K)$ is convex over $p,q$ this will converge, and since Lemma A.8 bounds its Lipschitz constant at $2|B|$ it will converge quickly. In particular, from starting points $p_{0},q_{0}$ , after $s$ steps we can bound

|\Phi_{p^{*},q^{*}}(K)-\Phi_{p_{s},q_{s}}(K)|\leq\frac{|B|\|(p^{*},q^{*})-(p_{0},q_{0})\|}{s},

for the found rate parameters $p_{s},q_{s}$ . Since $0\leq p,q\leq 1$ , then after $s_{\varepsilon}=|B|/\varepsilon$ steps we are guaranteed to have $|\Phi_{p^{*},q^{*}}(K)-\Phi_{p_{s_{\varepsilon}},q_{s_{\varepsilon}}}(K)|\leq\varepsilon$ . We always initiate this procedure on $\Phi_{p,q}(K_{c})$ with the $\hat{p},\hat{q}$ found on a nearby $K_{c^{\prime}}$ , and as a result found that running for $s=3$ of $4$ steps is sufficient. Each step of gradient descent takes $O(|B|)$ to compute the gradient.

3.2. Gridding and Pruning

Computing $\Phi(K)$ on every center in $\mathbb{R}^{2}$ is impossible, but Lemma A.4 shows that $\Phi(K_{\hat{c}})$ with $\hat{c}$ close to the true maximum $c^{*}$ , then $\Phi(K_{\hat{c}})$ will approximate the true maximum. From Lemma A.4 we have $|\Phi(K_{\hat{c}})-\Phi(K_{c^{*}})|\leq\frac{1}{r}\sqrt{\frac{8}{e}}\|\hat{c}-c^{*}\|$ . To get a bound of $|\Phi(K_{\hat{c}})-\Phi(K_{c^{*}})|\leq\varepsilon$ we need that $\|\hat{c}-c^{*}\|\frac{1}{r}\sqrt{\frac{8}{e}}\leq\varepsilon$ or rearanging that $\|\hat{c}-c^{*}\|\leq\varepsilon r\sqrt{\frac{e}{8}}$ . Placing center points on a regular grid with sidelength $\tau_{\varepsilon}=\varepsilon r\sqrt{\frac{e}{4}}$ will ensure the a center point will be close enough to the true maximum.

Assume that $B\subset\Omega_{\Lambda}\subset\mathbb{R}^{2}$ , which w.l.o.g. we assume $\Omega_{\Lambda}\in[0,L]\times[0,L]$ , where $\Lambda=L/r$ is a unitless resolution parameter which represents the ratio of the domain size to the scale of the anomalies. Next define a regular orthogonal grid $G_{\varepsilon}$ over $\Omega_{\Lambda}$ at resolution $\tau_{\varepsilon}$ . This contains $|G_{\varepsilon}\cap\Omega_{\Lambda}|=\frac{\Lambda^{2}}{\varepsilon^{2}}\frac{4}{e}$ points. We compute the scan statistic $\Phi(K_{c})$ choose as $c$ each point in $G_{\varepsilon}\cap\Omega_{\Lambda}$ . This algorithm, denoted KernelGrid and shown in Algorithm 1, can be seen to run in $O(|G_{\varepsilon}\cap\Omega_{\Lambda}|\cdot|B|t_{g})=O(\frac{\Lambda^{2}}{\varepsilon^{2}}|B|s_{\varepsilon})$ time, using $s_{\varepsilon}$ iterations of gradient decent (in practice $s_{\varepsilon}=4$ ).

Lemma 3.1.

$\textsc{\em KernelGrid}(B,\varepsilon,\Omega_{\Lambda})$ returns $\Phi(K_{\hat{c}})$ for a center $\hat{c}$ so $|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon$ in time $O(\frac{\Lambda^{2}}{\varepsilon^{2}}|B|s_{\varepsilon})$ , which in the worst case is $O(\frac{\Lambda^{2}}{\varepsilon^{3}}|B|^{2})$ .

Algorithm 1 KernelGrid

(B,\varepsilon,\Omega_{\Lambda})

Initialize

\Phi=0

; define

G_{\varepsilon,\Lambda}=G_{\varepsilon}\cap\Omega_{\Lambda}

for

c\in G_{\varepsilon,\Lambda}

\Phi_{c}=\mathsf{GradDesc}(\Phi_{p,q}(K_{c}))

over

p,q

B

if (

\Phi_{c}>\Phi

) then

\Phi=\Phi_{c}

return

\Phi

Adaptive Gridding. We next adjust the grid resolution based on the density of $B$ . We partition the $\Omega_{\Lambda}$ domain with a coarse grid $H_{\varepsilon}$ with side length $2r_{max}$ (from Lemma A.3). For a cell $\gamma\in H_{\varepsilon}$ in this grid, let $S_{\gamma}$ denote the $6r_{\textrm{max}}\times 6r_{\textrm{max}}$ region which expands the grid cell $\gamma$ the length of one grid cell in either direction. For any center $c\in\gamma$ , all points $x\in B$ which are within a distance of $2r_{\textrm{max}}$ from $c$ must be within $S_{\gamma}$ . Hence, by Lemma A.3 we can evaluate $\Phi(K^{\prime}_{c})$ for any $c\in\gamma$ only inspecting $S\cap B$ .

Moreover, by the local density argument in Lemma A.5, we can describe a new grid $G^{\prime}_{\varepsilon,\gamma}$ inside of each $\gamma\in H_{\varepsilon}$ with center separation $\beta$ only depending on the local number of points $|S_{\gamma}\cap B|$ . In particular we have for $c,c^{\prime}\in\gamma$ with $\|c-c^{\prime}\|=\beta$

|\Phi(K^{\prime}_{c})-\Phi(K^{\prime}_{c^{\prime}})|\leq\beta\frac{|S\cap B|}{|B|}\frac{2r_{\text{max}}}{r^{2}}+\varepsilon

To guarantee that all $c\in\gamma$ have another center $c^{\prime}\in G^{\prime}_{\varepsilon,\gamma}$ so that $|\Phi(K_{c})-\Phi(K^{\prime}_{c^{\prime}})|\leq 2\varepsilon$ we set the side length in $G^{\prime}_{\varepsilon,\gamma}$ as

\beta_{\gamma}=\varepsilon\frac{|B|}{|B\cap S_{\gamma}|}\frac{r^{2}}{2r_{\textrm{max}}},

so the number of grid points in $G^{\prime}_{\varepsilon,\gamma}$ is

|G^{\prime}_{\varepsilon,\gamma}|=\frac{4r_{\textrm{max}}^{2}}{\beta_{\gamma}^{2}}=\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\frac{|B\cap S_{\gamma}|^{2}}{|B|^{2}}.

Now define $G^{\prime}_{\varepsilon}=\bigcup_{\gamma\in H_{\varepsilon}}G^{\prime}_{\varepsilon,\gamma}$ as the union of these adaptively defined subgrids over each coarse grid cell. Its total size is

	$\displaystyle\|G^{\prime}_{\varepsilon}\|$	$\displaystyle=\sum_{\gamma\in H_{\varepsilon}}\|G_{\varepsilon,\gamma}\|=\sum_{\gamma\in H_{\varepsilon}}\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\frac{\|B\cap S_{\gamma}\|^{2}}{\|B\|^{2}}.$
		$\displaystyle=\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\cdot\frac{1}{\|B\|^{2}}\sum_{\gamma\in H_{\varepsilon}}\|B\cap S_{\gamma}\|^{2}.$
		$\displaystyle=\log^{2}(\|B\|/\varepsilon)\frac{16}{\varepsilon^{2}}\sum_{\gamma\in H_{\varepsilon}}\cdot\frac{1}{\|B\|^{2}}\sum_{\gamma\in H_{\varepsilon}}\|B\cap S_{\gamma}\|^{2}.$
		$\displaystyle\leq\log^{2}(\|B\|/\varepsilon)\frac{1296}{\varepsilon^{2}},$

where the last inequality follows from Cauchy-Schwarz, and that each point $x\in B$ appears in $9$ cells $S_{\gamma}$ . This replaces dependence on the domain size $\Lambda^{2}$ in the size of the grid $G_{\varepsilon}$ , with a mere logarithmic $\log^{2}(|B|/\varepsilon)$ dependence on $|B|$ in $G^{\prime}_{\varepsilon}$ . We did not minimize constants, and in practice we use significantly smaller constants.

Moreover, since this holds for each $c\in\gamma$ , and the same gridding mechanism is applied for each $\gamma\in H_{\varepsilon}$ , this holds for all $c\in\Omega_{\Lambda}$ . We call the algorithm that extends Algorithm 1 to use this grid $G^{\prime}_{\varepsilon}$ in place of $G_{\varepsilon}$ KernelAdaptive.

Lemma 3.2.

$\textsc{\em KernelAdaptive}(B,\varepsilon,\Omega_{\Lambda})$ returns $\Phi(K_{\hat{c}})$ for a center $\hat{c}$ so $|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon$ in time $O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{|B|}{\varepsilon}|B|s_{\varepsilon})$ , which in the worst case is $O(\frac{1}{\varepsilon^{3}}|B|^{2}\log^{2}\frac{|B|}{\varepsilon})$ .

Pruning. For both gridding methods the runtime is roughly the number of centers times the time to compute the gradient $O(|B|)$ . But via Lemma A.3 we can ignore the contribution of far away points, and thus only need those in the gradient computation.

However, this provides no worst-case asymptotic improvements in runtime for KernelGrid, or KernelAdaptive since all of $B$ may reside in a $r_{\text{max}}\times r_{\text{max}}$ cell. But in the practical setting we consider, this does provide a significant speedup as the data is usually spread over a large domain that is many times the size of $r_{\text{max}}$ .

We will define two new methods KernelPrune and KernelFast (shown in Algorithm 2) where the former extends KernelGrid method, and latter extends KernelAdaptive with pruning. In particular, we note that bounds in Lemma 3.2 hold for KernelFast.

Algorithm 2 KernelFast

(B,\varepsilon,\Omega_{\Lambda})

Initialize

\Phi=0

; define grid

H_{\varepsilon}

over

\Omega_{\Lambda}

for

\gamma\in H_{\varepsilon}

Define

G^{\prime}_{\varepsilon,\gamma}

adaptively on

\gamma

and

S_{\gamma}\cap B

for

c\in G^{\prime}_{\varepsilon,\Lambda}

\Phi_{c}=\mathsf{GradDesc}(\Phi_{p,q}(K_{c}))

over

p,q

on pruned set

B\cap S_{\gamma}

if (

\Phi_{c}>\Phi

) then

\Phi=\Phi_{c}

return

\Phi

3.3. Sampling

We can dramatically improve runtimes on large data sets by sampling a coreset $B_{\varepsilon}$ iid from $B$ , according to Lemma A.7. With probability $1-\delta$ we need $|B_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{1}{\varepsilon}\log\frac{\kappa}{\delta})$ samples, where $\kappa$ is the number of center evaluations, and can be set to the grid size. For KernelGrid $\kappa=|G_{\varepsilon}|=O(\frac{\Lambda^{2}}{\varepsilon^{2}})$ and in AdaptiveGrid $\kappa=|G^{\prime}_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{|B_{\varepsilon}|}{\varepsilon})=(\frac{1}{\varepsilon^{2}}\log^{2}\frac{1}{\varepsilon})$ . We restate the runtime bounds with sampling to show they are independent of $|B|$ .

Lemma 3.3.

$\textsc{\em KernelGrid}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda})$ $\&$ $\textsc{\em KernelPrune}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda})$ with sample size $|B_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log\frac{\Lambda}{\varepsilon\delta})$ returns $\Phi(K_{\hat{c}})$ for a center $\hat{c}$ so $|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon$ in time $O(\frac{s_{\varepsilon}}{\varepsilon^{4}}\log\frac{\Lambda}{\varepsilon\delta})$ , with probability $1-\delta$ . In the worst case the runtime is $O(\frac{\Lambda^{2}}{\varepsilon^{7}}\log^{2}\frac{\Lambda}{\varepsilon\delta})$ .

Lemma 3.4.

$\textsc{\em KernelAdaptive}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda})$ $\&$ $\textsc{\em KernelFast}(B_{\varepsilon},\varepsilon,\Omega_{\Lambda})$ with sample size $|B_{\varepsilon}|=O(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon\delta})$ returns $\Phi(K_{\hat{c}})$ for a center $\hat{c}$ so $|\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})-\Phi(K_{\hat{c}})|\leq\varepsilon$ in time $O(\frac{s_{\varepsilon}}{\varepsilon^{4}}\log^{3}\frac{1}{\varepsilon\delta})$ , with probability $1-\delta$ . In the worst case the runtime is $O(\frac{1}{\varepsilon^{7}}\log^{4}\frac{1}{\varepsilon\delta})$ .

3.4. Multiple Bandwidths

We next show a sequence of bandwidths, such that one of them is close to the $r$ used in any $K\in\mathcal{K}$ (assuming some reasonable but large range on the values $r$ ), and then take the maximum over all of these experiments. If the sequence of bandwidths $r_{0}\ldots r_{s}$ is such that $r_{i}-r_{i-1}\leq\frac{er_{i}\varepsilon}{4}$ then $|\Phi(K_{r_{i}})-\Phi(K_{r_{i}})|\leq\varepsilon$ .

Lemma 3.5.

A geometrically increasing sequence of bandwidths $R_{\text{min}}=r_{0},\ldots,R_{\text{max}}=r_{s}$ with $s=O(\frac{1}{\varepsilon}\log\frac{R_{\text{max}}}{R_{\text{min}}})$ is sufficient so $|\max_{i}\Phi(K_{r_{i}})-\Phi(K_{r})|\leq\varepsilon$ for any bandwidth $r\in[R_{\text{min}},R_{\text{max}}]$ .

Proof.

To guarantee a $\varepsilon$ error on the bandwidth $r$ there must be a nearby bandwidth $r_{i}$ . Therefore if $|\Phi(K_{r_{i}})-\Phi(K_{r_{i+1}})|\leq\varepsilon$ then $|\Phi(K_{r_{i}})-\Phi(K_{r})|\leq\varepsilon$ if $r\in[r_{i},r_{i+1}]$ .

From Lemma A.6 we can use the Lipshitz bound at $r_{i}$ to note that $|\Phi(K_{r_{i}})-\Phi(K_{r_{i+1}})|\leq\frac{4}{er_{i}}(r_{i+1}-r_{i})$ . Setting this less than $\varepsilon$ we can rearrange to get that $r_{i+1}\leq(\frac{e}{4}\varepsilon+1)r_{i}$ . That is $r_{0}(\frac{\varepsilon e}{4}+1)^{s}\geq r_{s}$ , which can be rearranged to get $s$ $s=\frac{\log(\frac{R_{\text{max}}}{R_{\text{min}}})}{\log(\frac{\varepsilon e}{4}+1)}$ . Since $\log(x+1)\geq\frac{x}{2}$ when $x$ is in $(0,1)$ , we have $s\leq\frac{8}{\varepsilon e}\log\frac{R_{\text{max}}}{R_{\text{min}}}$ . ∎

Running our KSSS over a large sequence of bandwidths is simple and merely increases the runtime by a $O(\frac{1}{\varepsilon}\log\frac{r_{s}}{r_{0}})$ factor. Our experiments in Section 4.4 suggest that choosing $4$ to $6$ bandwidths should be sufficient (e.g., for scales $R_{\text{max}}/R_{\text{min}}=1{,}000$ ).

4. Experiments

We compare our new KSSS algorithms to the state-of-the-art methods in terms of empirical efficiency, statistical power, and sample complexity, on large spatial data sets with planted anomalies.

Data sets. We run experiments on two large spatial data sets recording incidents of crime, these are used to represent the baseline data $B$ . The first contains geo-locations of all crimes in Philadelphia from 2006-2015, and has a total size of $|B|=687{,}636$ ; a subsample is shown in Figure 1. The second is the well-known Chicago Crime Dataset from 2001-2017, and has a total size of $|B|=6{,}886{,}676$ ; which is $10x$ the size of the Philadelphia set.

In modeling crime hot spots, these may often be associated with an individual or group of individuals who live at a fixed location. Then the crimes they may commit would often be centered at that location, and be more likely to happen nearby, and less likely further away. A natural way to model this decaying crime likelihood would be with a Gaussian kernel — as opposed to a uniform probability within a fixed radius, and no increased probability outside that zone. Hence, our KSSS is a good model to potentially detect such spatial anomalies.

Planting anomalous regions. To conduct controlled experiments, we use a spatial data sets $B$ above, but choose the $m$ values in a synthetic way. In particular, we plant anomalous regions $K_{c}\in\mathcal{K}_{r}$ , and then each data point $x\in B$ is assigned to a group $P$ (with probability $K(x)$ ) or $Q$ (otherwise). Those $x\in P$ will be assigned $m(x)$ through a Bernoulli process at rate $p$ , that is $m(x)=1$ with probability $p$ and $0$ otherwise; those $x\in Q$ are assigned $m(x)$ at rate $q$ . Given a region $K_{c}$ , this could model a pattern crimes (those with $m(x)=1$ may be all vehicle theft or have suspect matching a description), where $c$ may represent the epicenter of the targeted pattern of crime. We use $p=0.8$ and $q=0.5$ .

We repeat this experiment with $20$ planted regions and plot the median on the Phileadelphia data set to compare our new algorithms and to compare sample complexity properties against existing algorithms. We use $3$ planted regions on the Chicago data set to compare scalability (these take considerably longer to run). We attempt to fix the size $P$ so $|P|=f|B|$ , by adjusting the fixed and known bandwidth parameter $r$ on each planted region. We set $f=0.03$ for Philadelphia, and $f=0.01$ for Chicago, so the region contains a fairly small region with about $3\%$ or $1\%$ of the data.

Evaluating the models. A statistical power test, plants an anomalous region (for instance as described above), and then determines how often an algorithm can recover that region; it measures recall. However, all considered algorithms typically do not recover the exact same region as the one planted, so we measure how close to the planted region $K_{c}$ the recovered one $K_{\hat{c}}$ is. To do so we measure:

•

distance been found centers $\|c-\hat{c}\|$ , smaller is better.
•

$\Phi(K_{\hat{c}})$ , the larger the better; it may be larger than $\Phi(K_{c})$

•

the extended Jaccard similarity $\mathsf{JS}(K_{c},K_{\hat{c}})$ defined

\mathsf{JS}(K,\hat{K})=\frac{\langle K(x),\hat{K}(x)\rangle_{B}}{\langle K(x),K(x)\rangle_{B}+\langle\hat{K}(x),\hat{K}(x)\rangle_{B}-\langle K(x),\hat{K}(x)\rangle_{B}}

where $\langle K(x),\hat{K}(x^{\prime})\rangle_{B}=\sum_{x\in B}K(x)\hat{K}(x)$ ; larger is better.

We plot medians over $20$ trials; the targeted and hence measured values have variance because planted regions may not be the optimal region, since the $m(x)$ values are generated under a random process. When we cannot control the $x$ -value (when using time) we plot a kernel smoothing over different parameters on $3$ trials.

4.1. Comparing New KSSS Algorithms

We first compare the new KSSS algorithms against each other, as we increase the sample size $|B_{\varepsilon}|$ and the corresponding other griding and pruning parameters to match the expected error $\varepsilon$ from sample size $|B_{\varepsilon}|$ as dictated in Section 3.2.

We observe in Figure 2 that all of the new KSSS algorithms achieve high power at about the same rate. In particular, when the sample size reaches about $|B_{\varepsilon}|=1{,}000$ , they have all plateaued near their best values, with large power: the center distance is close to $0$ , $\Phi(K_{\hat{c}})$ near maximum, and $\mathsf{JS}(K_{c},K_{\hat{c}})$ almost $0.9$ . At medium sample sizes $|B_{\varepsilon}|=500$ , KernelAdaptive and KernelFast have worse accuracy, yet reach maximum power around the same sample size – so for very small sample size, we recommend KernelPrune.

In Figure 3 we see that the improvements from KernelGrid up to KernelFast are tremendous; a speed-up of roughly $20$ x to $30$ x improvement. By considering KernelPrune and KernelAdaptive we see most of the improvement comes from the adaptive gridding, but the pruning is also important, itself adding $2$ x to $3$ x speed up.

4.2. Power vs. Sample Size

We next compare our KSSS algorithms against existing, standard Disk SSS algorithms. As comparison, we first consider a fast reimplementation of SatScan (Kul97, ; Kul7.0, ) in C++. To make-even the comparison, we consider the exact same center set (defined on grid $G_{\varepsilon}$ ) for potential epicenters, and consider all possible radii of disks. Second, we compare against a heavily-optimized DiskScan algorithm (MP18b, ) for Disks, which chooses a very small “net” of points to combinatorially reduce the set of disks scanned, but still guarantee $\varepsilon$ -accuracy (in some sense similar to our adaptive approaches). For these algorithms we maximize Kuldorff’s Bernoulli likelihood function (Kul97, ), whose $\log$ has a closed for over binary ranges $D\in\mathcal{D}$ .

Figure 4 shows the power versus sample size (representing how many data points are available), using the same metrics as before. The KSSS algorithms perform consistently significantly better – to see this consider a fixed $y$ value in each plot. For instance the KSSS algorithms reach $\|c-\hat{c}\|<0.05$ , $\Phi(K_{\hat{c}})>0.003$ and $\mathsf{JS}(K_{c},K_{\hat{c}})>0.8$ after about 1000 data samples, whereas it takes the Disk SSS algorithms about 2500 data samples.

4.3. Power vs. Time

We next measure the power as a function of the runtime of the algorithms, again the new KSSS algorithms versus the traditional Disk SSS algorithms. We increase the sample size $|B_{\varepsilon}|$ as before, now from the Chicago dataset, and adjust other error parameters in accordance to match the theoretical error.

Figure 5 shows KernelFast significantly outperforms SatScan and DiskScan in these measures in orders of magnitude less time. It efficiently reaches small distance to the planted center faster (10 seconds vs 1000 or more seconds). In 5 seconds it achieves $\Phi^{*}$ of 0.0006, and 0.00075 in 100 seconds; whereas in 1000 seconds the Disk SSS only reaches 0.0004. Similarly for Jaccard similarity, KernelFast reaches 0.8 in 5 seconds, and 0.95 in 100 seconds; whereas in 1000 seconds the Disk SSS algorithms only reach 0.5.

4.4. Sensitivity to Bandwidth

So far we chose $r$ to be the bandwidth of the planted anomaly $K_{c}\in\mathcal{K}_{r}$ (this is natural if we know the nature of the event). But if the true anomaly bandwidth is not known or only known in some range then our method should be insensitive to this parameter. On the Philadelphia dataset we consider $30$ geometrically increasing bandwidths scaled so for original bandwidth $r$ we considered $rs$ where $s\in[10^{-2},10]$ . In Figure 6 we show the accuracy using Jaccard similarity and the $\Phi$ -value found, over $20$ trials. Our KSSS algorithms are effective at fitting events with $s\in[0.5,2]$ , indicating quite a bit of lee-way in which $r$ to use. That is, the sample complexity would not change, but the time complexity may increase by a factor of only $2$ x - $5$ x if we also search over a range of $r$ .

5. Conclusion

In this work, we generalized the spatial scan statistic so that ranges can be more flexible in their boundary conditions. In particular, this allows the anomalous regions to be defined by a kernel, so the anomaly is most intense at an epicenter, and its effect decays gradually moving away from that center. However, given this new definition, it is no longer possible to define and reason about a finite number of combinatorially defined anomalous ranges. Moreover, the log-likelihood ratio test derived do not have closed form solutions and as a result we develop new algorithmic techniques to deal with these two issues. These new algorithms are guaranteed to approximately detect the kernel range which maximizes the new discrepancy function up to any error precision, and the runtime depends only on the error parameter. We also conducted controlled experiments on planted anomalies which conclusively demonstrated that our new algorithms can detect regions with few samples and in less time than the traditional disk-based combinatorial algorithms made popular by the SatScan software. That is, we show that the newly proposed Kernel Spatial Scan Statistics theoretically and experimentally outperform the existing Spatial Scan Statistic methods.

References

[1] D. Agarwal, A. McGregor, J. M. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: Approximations and performance study. In KDD, 2006.
[2] E. Arias-Castro, R. M. Castro, E. Tánczos, and M. Wang. Distribution-free detection of structured anomalies: Permutation and rank-based scans. JASA, 113:789–801, 2018.
[3] P. J. Diggle. A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point. J. R. Statist. Soc. A, 153(3):349–362, 1990.
[4] L. Duczmal and R. Assunção. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis, 45(2):269 – 286, 2004.
[5] L. Duczmal, A. L. Cançado, R. H. Takahashi, and L. F. Bessegato. A genetic algorithm for irregularly shaped spatial scan statistics. Computational Statistics & Data Analysis, 52(1):43 – 52, 2007.
[6] Y. N. Dylan Fitzpatrick and D. B. Neill. Support vector subset scan for spatial outbreak detection. Online Journal of Public Health Informatics, 2017.
[7] E. Eftelioglu, S. Shekhar, D. Oliver, X. Zhou, M. R. Evans, Y. Xie, J. M. Kang, R. Laubscher, and C. Farah. Ring-shaped hotspot detection: A summary of results. Proceedings - IEEE International Conference on Data Mining, ICDM, 2015:815–820, 01 2015.
[8] W. Herlands, E. McFowland, A. Wilson, and D. Neill. Gaussian process subset scanning for anomalous pattern detection in non-iid data. In AIStats, 2018.
[9] L. Huang, M. Kulldorff, and D. Gregorio. A spatial scan statistic for survival data. Biometrics, 63(1):109–118, 2007.
[10] I. Jung, M. Kulldorff, and A. C. Klassen. A spatial scan statistic for ordinal data. Statistics in medicine, 26(7):1594–1607, 2007.
[11] I. Jung, M. Kulldorff, and O. J. Richard. A spatial scan statistic for multinomial data. Statistics in medicine, 29(18):1910–1918, 2010.
[12] M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods, 26:1481–1496, 1997.
[13] M. Kulldorff. SatScan User Guide. http://www.satscan.org/, 9.6 edition, 2018.
[14] M. Kulldorff, L. Huang, and K. Konty. A scan statistic for continuous data based on the normal probability model. Inter. Journal of Health Geographics, 8:58, 2009.
[15] M. Matheny and J. M. Phillips. Computing approximate statistical discrepancy. ISAAC, 2018.
[16] M. Matheny and J. M. Phillips. Practical low-dimensional halfspace range space sampling. ESA, 2018.
[17] M. Matheny, R. Singh, L. Zhang, K. Wang, and J. M. Phillips. Scalable spatial scan statistics through sampling. In SIGSPATIAL, 2016.
[18] D. B. Neill and A. W. Moore. Rapid detection of significant spatial clusters. In KDD, 2004.
[19] D. B. Neill, A. W. Moore, and G. F. Cooper. A bayesian spatial scan statistic. In Advances in neural information processing systems, pages 1003–1010, 2006.
[20] G. P. Patil and C. Taillie. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics, 11(2):183–197, Jun 2004.
[21] S. Speakman, S. Somanchi, E. M. III, and D. B. Neill. Penalized fast subset scanning. Journal of Computational and Graphical Statistics, 25(2):382–404, 2016.
[22] T. Tango and K. Takahashi. A flexibly shaped spatial scan statistic for detecting clusters. Inter. J. of Health Geographics, 4:11, 2005.
[23] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theo. of Prob and App, 16:264–280, 1971.

Appendix A Approximating the Bernoulli KSSS

In the following sections, we focus on the Bernoulli model and for notational simplicity use $\Phi=\Phi^{\textsf{Be}}$ , $\ell=\ell^{\textsf{Be}}$ . Sometimes we also use $\Phi_{p,q}(K)=\Phi(p,q,K)$ .

The values of $p$ and $q$ are bounded between $0$ and $1$ , but at these extremes $\Phi_{p,q}(K)$ can be unbounded. Instead we will bound structural properties of $\Phi$ , by assuming $p=p^{*},q=q^{*}$ .

Lemma A.1.

When $(p=p^{*},q=q^{*})=\mathrm{argmax}_{p,q}\Phi(p,q,K)$ then

|B|=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}\;\;\;\;\text{ and }\;\;\;\;|B|=\sum_{x\in M}\frac{1}{g(x)}.

Proof.

Consider first the derivatives of the alternative hypothesis $\ell=\ell(p,q,K)$ with respect to $p$ and $q$ .

	$\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}p}=$	$\displaystyle\sum_{x\in M}\frac{K(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}$
	$\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}q}=$	$\displaystyle\sum_{x\in M}\frac{1-K(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{1-K(x)}{1-g(x)}$

from this we can see that at the maximum $p^{*},q^{*}$ , when $g(x)=p^{*}K(x)+q^{*}(1-K(x))$ then

\frac{\mathsf{d}\ell}{\mathsf{d}p}p^{*}+\frac{\mathsf{d}\ell}{\mathsf{d}q}q^{*}=\sum_{x\in M}\frac{g(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{g(x)}{1-g(x)}=0

Since $\sum_{x\in M}\frac{g(x)}{g(x)}=|M|$ , hence $\sum_{x\in B\setminus M}\frac{g(x)}{1-g(x)}=|M|$ and then we can derive

|M|=\sum_{x\in B\setminus M}\frac{1+g(x)-1}{1-g(x)}=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}-|B\setminus M|.

As a result

|B|=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}

Similarly

\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}p}(1-p^{*})+\frac{\mathsf{d}\ell}{\mathsf{d}q}(1-q^{*})=

\displaystyle\sum_{x\in M}\frac{1-g(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{1-g(x)}{1-g(x)}

and following similar steps we ultimately find

|B|=\sum_{x\in M}\frac{1}{g(x)}.\qed

This implies simple bounds on $g(x)$ as well.

Lemma A.2.

When $(p=p^{*},q=q^{*})=\mathrm{argmax}_{p,q}\Phi(p,q,K)$ then

g(x)\in\left[\frac{1}{|B|},1\right]\text{if $x\in M$}\;\;\text{ and }\;\;g(x)\in\left[0,\frac{|B|-1}{|B|}\right]\text{if $x\in B\setminus M$}

Proof.

From Lemma A.1 we can state that

\displaystyle|B|=\sum_{x\in M}\frac{1}{g(x)}\geq\frac{1}{g(x)}

for any $x\in M$ and therefore $g(x)\geq\frac{1}{|B|}$ for $x\in M$ . In the case that $x\in B$ we have that

\displaystyle|B|=\sum_{x\in B\setminus M}\frac{1}{1-g(x)}\geq\frac{1}{1-g(x)}

for any $x\in B$ . Therefore $1-g(x)\geq\frac{1}{|B|}$ or $\frac{|B|-1}{|B|}\geq g(x)$ . ∎

A.1. Spatial Approximation of the KSSS

In this section we define useful lemmas that can be used to spatially approximate the KSSS by either ignoring far away points or restricting the set of centers spatially.

Truncating Kernels. We argue here that we can consider a simpler set of truncated kernels $\mathcal{K}_{r,\varepsilon}$ in place of $\mathcal{K}_{r}$ so replacing at $K_{c}\in\mathcal{K}_{r}$ with a corresponding $K^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon}$ without affecting $\ell(p,q,K)$ and hence $\Phi(K)$ by more than and additive $\varepsilon$ -error. Specifically, define any $K^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon}$ using $r_{\text{max}}=r\sqrt{\log(|B|/\varepsilon)}$ as

K^{\prime}_{c}(x)=\begin{cases}K_{c}(x)=\exp(-\|x-c\|^{2}/r^{2})&\text{ if }\|x-c\|\leq r_{\text{max}}\\ 0&\text{ otherwise.}\end{cases}

Lemma A.3.

For any data set $B$ , center $c\in\mathbb{R}^{d}$ , and error $\varepsilon>0$

|\Phi(K_{c})-\Phi(K^{\prime}_{c})|\leq\varepsilon.

Proof.

The key observation is that for $\|x-c\|>r_{\text{max}}$ then $K(x)\leq\exp(-\log(|B|/\varepsilon))=\varepsilon/|B|$ . Since $0\leq q,p\leq 1$ , then

|g(x)-g^{\prime}(x)|\leq\varepsilon/|B|

as well; where $g^{\prime}(x)=q+(p-q)K^{\prime}(x)$ .

Next we rewrite $\ell(p,q,K)$ as

\ell(p,q,K)=\frac{1}{|B|}\sum_{x\in B}\log\left(\begin{cases}g(x)&\text{ if }m(x)=1\\ (1-g(x))&\text{ if }m(x)=0\end{cases}\right).

Thus since $g(x)>(1/|B|)$ for $x\in M$ and $1-g(x)>(1/|B|)$ for $x\in B\setminus M$ by Lemma A.2, then we can analyze each term of the average as $\log(\cdot)$ of a quantity at least $1/|B|$ . If that quantity changes by at most $\varepsilon/|B|$ , then that term changes by at most

\log(1/|B|+\varepsilon/|B|)-\log(1/|B|)\leq(\varepsilon/|B|)|B|=\varepsilon,

by taking the derivative of $\log$ at $1/|B|$ . Hence each term changes by at most $\varepsilon$ , and $\ell(p,q,K)$ takes the average over these terms, so the main claim holds. ∎

Center Point Lipschitz Bound. Next we show that $\Phi(K_{c})$ is stable with respect to small changes in its epicenter $c$

Lemma A.4.

The magnitude of the gradient with respect to the center $c$ of $\Phi(K_{c})$ for any considered $K_{c}\in\mathcal{K}_{r}$ is bounded by

|\nabla_{c}\Phi(K_{c})|\leq(1/r)\sqrt{8/e}.

Proof.

We take the derivative of $c$ in direction $\vec{u}$ (a unit vector in $\mathbb{R}^{2}$ ) at magnitude $t$ . First analyze the Gaussian kernel under such a shift as $K_{c}(x)=\exp(-\|c-x+t\vec{u}\|^{2}/r^{2})$ , so as $t\to 0$ we have

\left|\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|=\left|\frac{2\langle\vec{u},(c-x)\rangle}{r^{2}}\exp\left(-\frac{\|c-x\|^{2}}{r^{2}}\right)\right|

This maximized for $\vec{u}=(c-x)/\|c-x\|$ so

\left|\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|\leq\left|\frac{2(\|c-x\|)}{r^{2}}\exp\left(-\frac{(\|c-x\|)^{2}}{r^{2}}\right)\right|\leq\frac{1}{r}\sqrt{\frac{2}{e}},

since it is maximized at $\|c-x\|/r=\sqrt{1/2}$ .

Now examine $\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}=\frac{\mathsf{d}\ell(p^{*},q^{*},K_{c})}{\mathsf{d}t}$ which expands to

\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}=\sum_{x\in B}\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}K_{c}(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}+\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}p^{*}}\frac{\mathsf{d}p^{*}}{\mathsf{d}t}+\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}q^{*}}\frac{\mathsf{d}q^{*}}{\mathsf{d}t}.

Since at $p=p^{*},q=q^{*}$ both $\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}p}=0$ and $\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}q}=0$ , then as long as $\frac{\mathsf{d}p^{*}}{\mathsf{d}t}$ and $\frac{\mathsf{d}q^{*}}{\mathsf{d}t}$ are bounded the associated terms are also $0$ . We bound these by taking the derivative of the equation $|B|=\sum_{x\in M}\frac{1}{g(x)}$ , from Lemma A.1. with respect to $t$ :

\sum_{x\in M}\frac{1}{g(x)^{2}}\left(\frac{\mathsf{d}p}{\mathsf{d}t}K_{c}(x)+\frac{\mathsf{d}q}{\mathsf{d}t}(1-K_{c}(x))+\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}(p^{*}-q^{*})\right)=\frac{\mathsf{d}|B|}{\mathsf{d}t}=0.

The first term $\sum_{x\in M}\big{|}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\big{|}\frac{(p^{*}-q^{*})}{g(x)^{2}}<C$ for some constant $C$ at $p^{*}$ and $q^{*}$ from Lemma A.2 and since $\frac{\mathsf{d}K(x)}{\mathsf{d}t}$ is bounded. Hence

	$\displaystyle C$	$\displaystyle\geq\left\|\sum_{x\in M}\frac{1}{g(x)^{2}}\left(\frac{\mathsf{d}p^{}}{\mathsf{d}t}K_{c}(x)+\frac{\mathsf{d}q^{}}{\mathsf{d}t}(1-K_{c}(x))\right)\right\|$
		$\displaystyle\geq\left\|\frac{\mathsf{d}p^{}}{\mathsf{d}t}\sum_{x\in M}K_{c}(x)+\frac{\mathsf{d}q^{}}{\mathsf{d}t}\sum_{x\in M}(1-K_{c}(x))\right\|$

Therefore if $K_{c}(x)>0$ for any $x\in M$ then $\frac{\mathsf{d}p^{*}}{\mathsf{d}t}$ and $\frac{\mathsf{d}p^{*}}{\mathsf{d}t}$ are bounded. This holds for any $K_{c}$ considered, as otherwise $\Phi=0$ since the alternative hypothesis $\ell^{*}(K_{c})$ is equal to the null hypothesis $\ell^{*}$ .

Finally, we bound the magnitude of the first term

	$\displaystyle\left\|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right\|$	$\displaystyle=\left\|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right\|\text{ for fixed }p^{},q^{}$
		$\displaystyle=\left\|\frac{1}{\|B\|}\left(\sum_{x\in M}\hskip-2.84526pt\frac{\mathsf{d}\log(g(x))}{\mathsf{d}t}+\hskip-8.53581pt\sum_{x\in B\setminus M}\hskip-5.69054pt\frac{\mathsf{d}\log(1-g(x))}{\mathsf{d}t}\right)\right\|\text{ fixed }p^{},q^{}$
		$\displaystyle=\frac{1}{\|B\|}\left\|\sum_{x\in M}\frac{p^{}-q^{}}{g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}-\hskip-5.69054pt\sum_{x\in B\setminus M}\frac{p^{}-q^{}}{1-g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right\|$
		$\displaystyle\leq\frac{1}{\|B\|}\left(\frac{1}{r}\sqrt{\frac{2}{e}}\right)\left(\left\|\sum_{x\in M}\frac{p^{}-q^{}}{g(x)}\right\|+\left\|\sum_{x\in B\setminus M}\frac{p^{}-q^{}}{1-g(x)}\right\|\right)$
		$\displaystyle=\left(\frac{1}{r}\sqrt{\frac{8}{e}}\right),$

where the last step follows by $p^{*}-q^{*}<1$ and Lemma A.1. ∎

This bound suggests a further improvement for centers that have few points in their neighborhood. We will state this bound in terms of a truncated kernel, but it also applies to non truncated kernels through Lemma A.3.

Lemma A.5.

For a truncated kernel $K^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon}$ , if we shift the center $c$ to $c^{\prime}$ so $\|c-c^{\prime}\|\leq\beta\leq r_{\text{max}}$ , then

|\Phi(K^{\prime}_{c})-\Phi(K^{\prime}_{c^{\prime}})|\leq\beta\frac{|D\cap B|}{|B|}\frac{2r_{\text{max}}}{r^{2}}+\varepsilon

where $D$ is a disk centered at $c$ of radius $2r_{\text{max}}$ .

Proof.

From the previous proof for a regular kernel $K_{c}\in\mathcal{K}_{r}$

\left|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right|\leq\frac{1}{|B|}\left|\sum_{x\in M}\frac{p^{*}-q^{*}}{g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}-\hskip-5.69054pt\sum_{x\in B\setminus M}\frac{p^{*}-q^{*}}{1-g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|.

Adapting to truncated kernel $K^{\prime}_{c}\in\mathcal{K}_{r,\varepsilon}$ , then this magnitude times $\beta$ would provide a bound, but a bad one since the derivative is unbounded for $x$ on the boundary. Rather, we analyze the sum over $x\in B$ in three cases. For $x\notin D$ , these have both $K^{\prime}_{c}(x)=0$ and $K^{\prime}_{c^{\prime}}(x)=0$ , so do not contribute. For $\|x-c\|\in[r_{\text{max}}-\beta,r_{\text{max}}+\beta]$ the derivative is unbounded, but $K^{\prime}_{c}(x)\in[0,\varepsilon]$ and $K^{\prime}_{c^{\prime}}(x)\in[0,\varepsilon]$ by Lemma A.3, so the total contribution of these terms in the average is at most $\varepsilon$ . What remains is to analyze the $x\in B$ such that $\|x-c\|\leq r_{\text{max}}-\beta$ , where the derivative bound applies. However, we will not use the same approach as in the non-truncated kernel.

We write this contribution in terms as the sum over $M$ and $B$ separately as $\frac{\beta}{|B|}(\Delta_{M}-\Delta_{B})$ , focusing only on points in $D$ (and thus double counting ones near the bounary) where

\Delta_{M}=\hskip-5.69054pt\sum_{x\in D\cap M}\frac{p^{*}-q^{*}}{g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\;\text{ and }\;\Delta_{B}=\hskip-5.69054pt\sum_{x\in D\cap B\setminus M}\frac{p^{*}-q^{*}}{1-g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}.

To bound $\Delta_{M}$ , observe that $g(x)=(p^{*}-q^{*})K_{c}(x)+q^{*}\geq(p^{*}-q^{*})K_{c}(x)$ , and $\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}=-2\frac{\|x-c\|}{r^{2}}K_{c}(x)$ . Thus

|\Delta_{M}|\leq\left|\sum_{x\in D\cap M}\frac{p^{*}-q^{*}}{(p^{*}-q^{*})K_{c}(x)}2K_{c}(x)\frac{\|x-c\|}{r^{2}}\right|=|D\cap M|\frac{2r_{\text{max}}}{r^{2}}.

To bound $\Delta_{B}$ , we use that $1-g(x)=1-q^{*}-(p^{*}-q^{*})K(x)\geq 1-q^{*}$ , that $p^{*}-q^{*}\leq 1-q^{*}$ , and from the previous proof that $\left|\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right|\leq\frac{1}{r}\sqrt{\frac{2}{e}}$ . Thus

|\Delta_{B}|\leq\left|\sum_{x\in D\cap B\setminus M}\frac{1-q^{*}}{1-q^{*}}\frac{1}{r}\sqrt{\frac{2}{e}}\right|=|D\cap B\setminus M|\frac{1}{r}\sqrt{\frac{2}{e}}.

Since $2>\sqrt{2/e}$ we can combine these contributions as

\frac{\beta}{|B|}|\Delta_{M}-\Delta_{B}|\leq\beta\frac{|D\cap B|}{|B|}\frac{2r_{\text{max}}}{r^{2}}.\qed

A.2. Bandwidth Approximations of the KSSS

We mainly focus on solving $\max_{K_{c}\in\mathcal{K}_{r}}\Phi(K_{c})$ for a fixed bandwidth $r$ . Here we consider the stability in the choice in $r$ in case this is not assumed, and needs to be searched over.

Lemma A.6.

For a fixed $p$ , $q$ , and $c$ we have $\big{|}\frac{\mathsf{d}\Phi(K)}{\mathsf{d}r}\big{|}\leq\frac{4}{er}$ .

Proof.

Let $z_{x}=\|x-c\|_{2}$ , $w_{x}=z_{x}/r$ , then $K(x)=\exp(-w_{x}^{2})$ . We derive the derivative for $\ell=\ell(p,q,K)$ with respect to $r$ .

	$\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}r}$	$\displaystyle=\frac{1}{\|B\|}\left(\sum_{x\in M}\frac{2z_{x}^{2}(p-q)e^{-\frac{z_{x}^{2}}{r^{2}}}}{r^{3}g(x)}-\sum_{x\in B\setminus M}\frac{2z_{x}^{2}(p-q)e^{-\frac{z_{x}^{2}}{r^{2}}}}{r^{3}(1-g(x))}\right)$
		$\displaystyle=\frac{2(p-q)}{r\|B\|}\left(\sum_{x\in M}\frac{w_{x}^{2}e^{-w_{x}^{2}}}{g(x)}-\sum_{x\in B\setminus M}\frac{w_{x}^{2}e^{-w_{x}^{2}}}{(1-g(x))}\right)$

By two observations, $\max_{w>0}we^{-w}=2/e$ and $p-q\leq 1$ , simplify

\displaystyle\frac{\mathsf{d}\ell}{\mathsf{d}r}

\displaystyle\leq\frac{2}{r|B|}\left(\sum_{x\in M}\frac{2/e}{g(x)}-\sum_{x\in B\setminus M}\frac{2/e}{(1-g(x))}\right)

And by Lemma A.1 we bound the sums over $x\in M$ and over $x\in B\setminus M$ each by $(2/e)|B|$ and hence the absolute value by

\left|\frac{\mathsf{d}\ell}{\mathsf{d}r}\right|\leq\left|\frac{4/e}{r}\right|.\qed

A.3. Sampling Bounds

Sampling can be used to create a coreset of a truely massive data set $B$ while retaining the ability to approximately solve $\max_{K\in\mathcal{K}}\Phi(K)$ . In particular, we consider creating an iid sample $B_{\varepsilon}$ from $B$ .

Lemma A.7.

For sample $B_{\varepsilon}$ of size $t=O(\frac{1}{\varepsilon^{2}}\log^{2}\frac{1}{\varepsilon}\log\frac{\kappa}{\delta})$ , then for $\kappa$ different centers $c\in C$ , with probability at least $1-\delta$ , for all estimates $\hat{\ell}(K_{c})$ from $B_{\varepsilon}$ of $\ell(K_{c})$ satisfy $|\ell(K_{c})-\hat{\ell}(K_{c})|\leq\varepsilon.$

Proof.

To approximate $-\ell(K_{c})$ we will use a standard Chernoff-Hoeffding bound: for $t$ independent random variables $X_{1},\ldots,X_{t}$ , so $\mathsf{E}[X_{i}]=\mu$ , and $X_{i}\in[0,\Delta_{i}]$ then the average $\bar{X}=\frac{1}{t}\sum_{i=1}^{t}X_{i}$ concentrates as $\Pr[|\bar{X}-\mu|\geq\alpha]\leq 2\exp\left(-\frac{2\alpha^{2}}{t\Delta^{2}}\right).$ Set

\mu=-\frac{1}{B}\sum_{x\in B}m(x)\log(g(x))+(1-m(x))\log(1-g(x))=-\ell(K_{c}),

and each $i$ th sample $x^{\prime}\in B_{\varepsilon}$ maps to the $i$ th random variable

X_{i}=-(m(x^{\prime})\log(g(x^{\prime}))+(1-m(x^{\prime}))\log(1-g(x^{\prime}))).

Since $g(x)>1/|B|$ for $m(x)=1$ and $1-g(x)>1/|B|$ for $m(x)=0$ by Lemma A.2 we have $\Delta=-\log(1/|B|)=\log(|B|)$ . Plugging these quantities into CH bound yields

\Pr[|\hat{\ell}(K_{c})-\ell(K_{c})|\leq\varepsilon]\leq 2\exp\left(-\frac{2\varepsilon^{2}}{t\log^{2}(|B|)}\right)\leq\delta^{\prime}.

Solving for $t$ finds $t=\frac{\log^{2}|B|}{2\varepsilon^{2}}\log\frac{1}{2\delta^{\prime}}.$ Then taking a union bound over $\kappa$ centers, reveals that with probability at least $1-\delta=1-\delta^{\prime}\kappa$ , this holds by setting

t=\frac{\log^{2}|B|}{2\varepsilon^{2}}\log\frac{\kappa}{2\delta}.

At this point, we almost have the desired bound; however, it has a $\log^{2}|B|$ factor in $t$ in place of the desired $\log^{2}\frac{1}{\varepsilon}$ . One should believe that it is possible to get rid of the dependence on $|B|$ , since given one sample with the above bound, the dependence on $|B|$ has been reduced to logarithmic. We can apply the bound again, and the dependence is reduced to $\log^{2}(\frac{1}{\varepsilon}\log^{2}|B|)$ and so on. Ultimately, it can be shown that we can directly replace $|B|$ with $\mathsf{poly}(1/\varepsilon)$ in the bounds using a classic trick [23] of considering two samples $B_{\varepsilon}$ and $B_{\varepsilon}^{\prime}$ , and arguing that if they yield results close to each other, then this implies they should both be close to $B$ . We omit the standard, but slightly unintuitive details. ∎

A.4. Convexity in $p$ and $q$

We next show that $\Phi_{p,q}(K)$ is convex and has a Lipschitz bound in terms of $p$ and $q$ . However, such a bound does not exist using $p,q\in(0,1)$ as the gradient is unbounded on the boundary. We instead define a set of constraints related to $g(x)$ at the optimal $p^{*},q^{*}$ that allow a Lipschitz constant.

Lemma A.8.

The following optimization problem is convex with Lipschitz constant $2|B|$ , and contains $p^{*},q^{*}=\mathrm{argmin}_{p,q}-\Phi_{p,q}(K)$ .

(1)

\displaystyle\begin{split}\min_{p,q}-\Phi_{p,q}(K)&\\ 1/B-g(x)\leq 0&\text{ for }x\in M\\ (|B|-1)/|B|-g(x)\geq 0&\text{ for }x\in B\setminus M\\ p\in(0,1)\\ q\in(0,1)\end{split}

Proof.

The set of constraints follow from Lemma A.2 and from this bound we know that the optimal $p^{*}$ and $q^{*}$ will be contained in this domain. As each constraint is a linear combination of $p$ and $q$ , and hence convex, the space of solutions is also convex. Now since $g(x)$ is an affine transformation of $p$ and $q$ so it is both convex and concave with respect to $p,q$ . The logarithm function is a concave and non-decreasing function, hence both $\log(g(x))$ and $\log(1-g(x))$ are concave. The sum of these concave functions is still a concave function, and thus $-\ell(p,q,K)$ and hence $-\Phi(p,q,K)$ is convex.

The task becomes to show the absolute value of first order partial derivatives are bounded for $p$ and $q$ in this domain. We have that for $p$ (the argument for $q$ is symmetric)

	$\displaystyle\left\|\frac{\mathsf{d}\ell(p,q,K)}{\delta p}\right\|$	$\displaystyle=\left\|\frac{1}{\|B\|}\left(\sum_{x\in M}\frac{K(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}\right)\right\|$
		$\displaystyle\leq\frac{1}{\|B\|}\left(\left\|\sum_{x\in M}\frac{K(x)}{g(x)}\right\|+\left\|\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}\right\|\right)$
		$\displaystyle\leq\frac{1}{\|B\|}\left\|\sum_{x\in M}\frac{1}{g(x)}\right\|+\frac{1}{\|B\|}\left\|\sum_{x\in B\setminus M}\frac{1}{1-g(x)}\right\|$
		$\displaystyle\leq\frac{1}{\|B\|}\left\|\sum_{x\in M}\|B\|\right\|+\frac{1}{\|B\|}\left\|\sum_{x\in B\setminus M}\|B\|\right\|=2\|B\|,$

where the steps follow from the triangle inequality, by $K(x)\leq 1$ , and that $g(x)>1/|B|$ for $x\in M$ and $1-g(x)>1/|B|$ for $x\in B\setminus M$ . ∎

	$\displaystyle\|G^{\prime}_{\varepsilon}\|$	$\displaystyle=\sum_{\gamma\in H_{\varepsilon}}\|G_{\varepsilon,\gamma}\|=\sum_{\gamma\in H_{\varepsilon}}\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\frac{\|B\cap S_{\gamma}\|^{2}}{\|B\|^{2}}.$
		$\displaystyle=\frac{r^{4}_{\text{max}}}{r^{4}}\frac{16}{\varepsilon^{2}}\cdot\frac{1}{\|B\|^{2}}\sum_{\gamma\in H_{\varepsilon}}\|B\cap S_{\gamma}\|^{2}.$
		$\displaystyle=\log^{2}(\|B\|/\varepsilon)\frac{16}{\varepsilon^{2}}\sum_{\gamma\in H_{\varepsilon}}\cdot\frac{1}{\|B\|^{2}}\sum_{\gamma\in H_{\varepsilon}}\|B\cap S_{\gamma}\|^{2}.$
		$\displaystyle\leq\log^{2}(\|B\|/\varepsilon)\frac{1296}{\varepsilon^{2}},$

	$\displaystyle\left\|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right\|$	$\displaystyle=\left\|\frac{\mathsf{d}\Phi(K_{c})}{\mathsf{d}t}\right\|\text{ for fixed }p^{},q^{}$
		$\displaystyle=\left\|\frac{1}{\|B\|}\left(\sum_{x\in M}\hskip-2.84526pt\frac{\mathsf{d}\log(g(x))}{\mathsf{d}t}+\hskip-8.53581pt\sum_{x\in B\setminus M}\hskip-5.69054pt\frac{\mathsf{d}\log(1-g(x))}{\mathsf{d}t}\right)\right\|\text{ fixed }p^{},q^{}$
		$\displaystyle=\frac{1}{\|B\|}\left\|\sum_{x\in M}\frac{p^{}-q^{}}{g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}-\hskip-5.69054pt\sum_{x\in B\setminus M}\frac{p^{}-q^{}}{1-g(x)}\frac{\mathsf{d}K_{c}(x)}{\mathsf{d}t}\right\|$
		$\displaystyle\leq\frac{1}{\|B\|}\left(\frac{1}{r}\sqrt{\frac{2}{e}}\right)\left(\left\|\sum_{x\in M}\frac{p^{}-q^{}}{g(x)}\right\|+\left\|\sum_{x\in B\setminus M}\frac{p^{}-q^{}}{1-g(x)}\right\|\right)$
		$\displaystyle=\left(\frac{1}{r}\sqrt{\frac{8}{e}}\right),$

	$\displaystyle\left\|\frac{\mathsf{d}\ell(p,q,K)}{\delta p}\right\|$	$\displaystyle=\left\|\frac{1}{\|B\|}\left(\sum_{x\in M}\frac{K(x)}{g(x)}-\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}\right)\right\|$
		$\displaystyle\leq\frac{1}{\|B\|}\left(\left\|\sum_{x\in M}\frac{K(x)}{g(x)}\right\|+\left\|\sum_{x\in B\setminus M}\frac{K(x)}{1-g(x)}\right\|\right)$
		$\displaystyle\leq\frac{1}{\|B\|}\left\|\sum_{x\in M}\frac{1}{g(x)}\right\|+\frac{1}{\|B\|}\left\|\sum_{x\in B\setminus M}\frac{1}{1-g(x)}\right\|$
		$\displaystyle\leq\frac{1}{\|B\|}\left\|\sum_{x\in M}\|B\|\right\|+\frac{1}{\|B\|}\left\|\sum_{x\in B\setminus M}\|B\|\right\|=2\|B\|,$

The Kernel Spatial Scan Statistic

Abstract.

1. Introduction

2. Derivation of the Kernel Spatial Scan Statistic

2.1. Bernoulli

Lemma 2.1.

Proof.

2.2. Gaussian

Theorem 2.2.

Proof.

2.3. Poisson

3. Computing the Approximate KSSS

3.1. Approximating Φ\Phi with GD

3.2. Gridding and Pruning

Lemma 3.1.

Lemma 3.2.

3.3. Sampling

Lemma 3.3.

Lemma 3.4.

3.4. Multiple Bandwidths

Lemma 3.5.

Proof.

4. Experiments

4.1. Comparing New KSSS Algorithms

4.2. Power vs. Sample Size

4.3. Power vs. Time

4.4. Sensitivity to Bandwidth

5. Conclusion

References

Appendix A Approximating the Bernoulli KSSS

Lemma A.1.

Proof.

Lemma A.2.

Proof.

A.1. Spatial Approximation of the KSSS

Lemma A.3.

Proof.

Lemma A.4.

Proof.

Lemma A.5.

Proof.

A.2. Bandwidth Approximations of the KSSS

Lemma A.6.

Proof.

A.3. Sampling Bounds

Lemma A.7.

Proof.

A.4. Convexity in pp and qq

Lemma A.8.

Proof.

3.1. Approximating $\Phi$ with GD

A.4. Convexity in $p$ and $q$