This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Proximity Search for Balls using Sublinear Space111Work on this paper was partially support by NSF AF awards CCF-0915984 and CCF-1217462

Sariel Har-Peled Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA; sariel@uiuc.edu; http://www.uiuc.edu/~sariel/.    Nirman Kumar Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA; nkumar5@illinois.edu; http://www.cs.uiuc.edu/~nkumar5/.
Abstract

Given a set of nn disjoint balls b1,,bnb_{1},\dots,b_{n} in IRd{\rm I\!\hskip-0.24994ptR}^{d}, we provide a data structure, of near linear size, that can answer (1±ε)(1\pm{\varepsilon})-approximate kkth-nearest neighbor queries in O(logn+1/εd)O(\log n+1/{\varepsilon}^{d}) time, where kk and ε{\varepsilon} are provided at query time. If kk and ε{\varepsilon} are provided in advance, we provide a data structure to answer such queries, that requires (roughly) O(n/k)O(n/k) space; that is, the data structure has sublinear space requirement if kk is sufficiently large.

1 Introduction

The nearest neighbor problem is a fundamental problem in Computer Science [17, 1]. Here, one is given a set of points 𝖯\mathsf{P}, and given a query point 𝚚\mathtt{q} one needs to output the nearest point in 𝖯\mathsf{P} to 𝚚\mathtt{q}. There is a trivial O(n)O(n) algorithm for this problem. Typically the set of data points is fixed, while different queries keep arriving. Thus, one can use preprocessing to facilitate a faster query. There are several applications of nearest neighbor search in computer science including pattern recognition, information retrieval, vector compression, computational statistics, clustering, data mining and learning among many others, see for instance the survey by Clarkson [10] for references. If one is interested in guaranteed performance and near linear space, there is no known way to solve this problem efficiently (i.e., logarithmic query time) for dimension d>2d>2, while using near linear space for the data structure.

In light of the above, major effort has been devoted to develop approximation algorithms for nearest neighbor search [6, 16, 10, 13]. In the (1+ε)(1+{\varepsilon})-approximate nearest neighbor problem, one is additionally given an approximation parameter ε>0{\varepsilon}>0 and one is required to find a point 𝗎𝖯\mathsf{u}\in\mathsf{P} such that 𝖽(𝚚,𝗎)(1+ε)𝖽(𝚚,𝖯)\mathsf{d}\!\left({\mathtt{q},\mathsf{u}}\right)\leq(1+{\varepsilon})\mathsf{d}\!\left({\mathtt{q},\mathsf{P}}\right). In dd dimensional Euclidean space, one can answer ANN queries in O(logn+1/εd1)O(\log n+1/{\varepsilon}^{d-1}) time using linear space [6, 12]. Unfortunately, the constant hidden in the OO notation is exponential in the dimension (and this is true for all bounds mentioned in this paper), and specifically because of the 1/εd11/{\varepsilon}^{d-1} in the query time, this approach is only efficient in low dimensions. Interestingly, for this data structure, the approximation parameter ε{\varepsilon} need not be specified during the construction, and one can provide it during the query. An alternative approach is to use Approximate Voronoi Diagrams (AVD), introduced by Har-Peled [11], which is a partition of space into regions of low total complexity, with a representative point for each region, that is an ANN for any point in the region. In particular, Har-Peled showed that there is such a decomposition of size O((n/εd)log2n)O\!\left({(n/{\varepsilon}^{d})\log^{2}n}\right), see also [13]. This allows ANN queries to be answered in O(logn)O(\log n) time. Arya and Malamatos [2] showed how to build AVDs of linear complexity (i.e., O(n/εd)O(n/{\varepsilon}^{d})). Their construction uses WSPD (Well Separated Pairs Decomposition) [8]. Further trade-offs between query time and space usage for AVDs were studied by Arya et al. [4].

A more general problem is the kk-nearest neighbors problem where one is interested in finding the kk points in 𝖯\mathsf{P} nearest to the query point 𝚚\mathtt{q}. This is widely used in classification, where the majority label is used to label the query point. A restricted version is to find only the kkth-nearest neighbor. This problem and its approximate version have been considered in [3, 14].

Recently, the authors [14] showed that one can compute a (k,ε)(k,{\varepsilon})-AVD that (1+ε)(1+{\varepsilon})-approximates the distance to the kkth nearest neighbor, and surprisingly, requires O(n/k)O(n/k) space; that is, sublinear space if kk is sufficiently large. For example, for the case k=Ω(n)k=\Omega(\sqrt{n}), which is of interest in practice, the space required is only O(n)O\!\left({\sqrt{n}}\right). Such ANN is of interest when one is worried that there is noise in the data, and thus one is interested in the distance to the kkth NN which is more robust and noise resistant. Alternatively, one can think about such data structures as enabling one to summarize the data in a way that still facilitates meaningful proximity queries.

In this paper we consider a generalization of the kkth-nearest neighbor problem. Here, we are given a set of nn disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d} and we want to preprocess them, so that given a query point we can find approximately the kkth closest ball. The distance of a query point to a ball is defined as the distance to its boundary if the point is outside the ball or 0 otherwise. Clearly, this problem is a generalization of the kkth-nearest neighbor problem by viewing points as balls of radius 0. Algorithms for the kkth-nearest neighbor for points, do not extend in a straightforward manner to this problem because the distance function is no longer a metric. Indeed, there can be two very far off points both very close to a single ball, and thus the triangle inequality does not hold. The problem of finding the closest ball can also be modeled as a problem of approximating the minimization diagram of a set of functions; here, a function would correspond to the distance from one of the given balls. There has been some recent work by the authors on this topic, see [15], where a fairly general class of functions admits a near-linear sized data structure permitting a logarithmic time query for the problem of approximating the minimization diagram. However, the problem that we consider in this paper does not fall under the framework of [15]. The technical assumptions of [15] mandate that the set of points which form the 0-sublevel set of a distance function, i.e., the set of points at which the distance function is 0 is a single point (or an empty set). This is not the case for the problem we consider here. Also, we are interested in the more general kkth-nearest neighbor problem, while [15] only considers the nearest-neighbor problem, i.e., k=1k=1.

We first show how to preprocess the set of balls into a data structure requiring space O(n)O(n), in O(nlogn)O(n\log n) time, so that given a query point 𝚚\mathtt{q}, a number 1kn1\leq k\leq n and ε>0{\varepsilon}>0, one can compute a (1±ε)(1\pm{\varepsilon})-approximate kkth closest ball in time O(logn+εd)O(\log n+{\varepsilon}^{-d}). If both kk and ε{\varepsilon} are available during preprocessing, one can preprocess the balls into a (k,ε)(k,{\varepsilon})-AVD, using O(nkεdlog(1/ε))O(\frac{n}{k{\varepsilon}^{d}}\log(1/{\varepsilon})) space, so that given a query point 𝚚\mathtt{q}, a (k,ε)(k,{\varepsilon})-ANN closest ball can be computed, in O(log(n/k)+log(1/ε))O(\log(n/k)+\log(1/{\varepsilon})) time.

Paper Organization

In Section 2, we define the problem, list some assumptions, and introduce notations. In Section 3, we set up some basic data structures to answer approximate range counting queries for balls. In Section 4, we present the data structure, query algorithm and proof of correctness for our data structure which can compute (1±ε)(1\pm{\varepsilon})-approximate kkth-nearest neighbors of a query point when k,εk,{\varepsilon} are only provided during query time. In Section 5 we present approximate quorum clustering, see [9, 14], for a set of disjoint balls. Using this, in Section 6, we present the (k,ε)(k,{\varepsilon})-AVD construction. We conclude in Section 7.

2 Problem definition and notation

We are given a set of disjoint222Our data structure and algorithm work for the more general case where the balls are interior disjoint, where we define the interior of a “point ball”, i.e., a ball of radius 0, as the point itself. This is not the usual topological definition. balls ={b1,,bn}\mathcal{B}=\left\{{b_{1},\dots,b_{n}}\right\}, where bi=𝖻(𝖼i,𝗋i)b_{i}=\mathsf{b}(\mathsf{c}_{i},\mathsf{r}_{i}), for i=1,,ni=1,\ldots,n. Here 𝖻(𝖼,𝗋)IRd\mathsf{b}(\mathsf{c},\mathsf{r})\subseteq{\rm I\!\hskip-0.24994ptR}^{d} denotes the (closed) ball with center 𝖼\mathsf{c} and radius 𝗋0\mathsf{r}\geq 0. Additionally, we are given an approximation parameter ε(0,1){\varepsilon}\in(0,1). For a point 𝚚IRd\mathtt{q}\in{\rm I\!\hskip-0.24994ptR}^{d}, the distance of 𝚚\mathtt{q} to a ball b=𝖻(𝖼,𝗋)b=\mathsf{b}(\mathsf{c},\mathsf{r}) is 𝖽(𝚚,b)=max(𝚚𝖼𝗋, 0).\mathsf{d}\!\left({\mathtt{q},b}\right)=\max\!\left({\rule[0.0pt]{0.0pt}{9.95863pt}\left\lVert{\mathtt{q}-\mathsf{c}}\right\rVert-\mathsf{r},\,0}\right).

Observation 2.1.

For two balls b1b2IRdb_{1}\subseteq b_{2}\subseteq{\rm I\!\hskip-0.24994ptR}^{d}, and any point 𝚚IRd\mathtt{q}\in{\rm I\!\hskip-0.24994ptR}^{d}, we have 𝖽(𝚚,b1)𝖽(𝚚,b2)\mathsf{d}\!\left({\mathtt{q},b_{1}}\right)\geq\mathsf{d}\!\left({\mathtt{q},b_{2}}\right).

The kkth-nearest neighbor distance of 𝚚\mathtt{q} to \mathcal{B}, denoted by 𝖽k(𝚚,)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), is the kkth smallest number in 𝖽(𝚚,b1),,𝖽(𝚚,bn)\mathsf{d}\!\left({\mathtt{q},b_{1}}\right),\dots,\mathsf{d}\!\left({\mathtt{q},b_{n}}\right). Similarly, for a given set of points 𝖯\mathsf{P}, 𝖽k(𝚚,𝖯)\mathsf{d}_{k}\!\left({\mathtt{q},\mathsf{P}}\right) denotes the kkth-nearest neighbor distance of 𝚚\mathtt{q} to 𝖯\mathsf{P}.

We aim to build a data structure to answer (1±ε)(1\pm{\varepsilon})-approximate kkth-nearest neighbor (i.e., (k,ε)(k,{\varepsilon})-ANN) queries, where for any query point 𝚚IRd\mathtt{q}\in{\rm I\!\hskip-0.24994ptR}^{d} one needs to output a ball bb\in\mathcal{B} such that, (1ε)𝖽k(𝚚,)𝖽(𝚚,b)(1+ε)𝖽k(𝚚,)(1-{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\mathsf{d}\!\left({\mathtt{q},b}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). There are different variants depending on whether ε{\varepsilon} and kk are provided with the query or in advance.

We use cube to denote a set of the form [a1,a1+]×[a2,a2+]××[ad,ad+]IRd[a_{1},a_{1}+\ell]\times[a_{2},a_{2}+\ell]\times\ldots\times[a_{d},a_{d}+\ell]\subseteq{\rm I\!\hskip-0.24994ptR}^{d}, where a1,,adIRa_{1},\ldots,a_{d}\in{\rm I\!\hskip-0.24994ptR} and 0\ell\geq 0 is the side length of the cube.

Observation 2.2.

For any set of balls \mathcal{B}, the function 𝖽k(𝚚,)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right) is a 11-Lipschitz function; that is, for any two points 𝗎,𝗏\mathsf{u},\mathsf{v}, we have that 𝖽k(𝗎,)𝖽k(𝗏,)+𝗎𝗏\mathsf{d}_{k}\!\left({\mathsf{u},\mathcal{B}}\right)\leq\mathsf{d}_{k}\!\left({\mathsf{v},\mathcal{B}}\right)+\left\|{\mathsf{u}-\mathsf{v}}\right\|.

Assumption 2.3.

We assume all the balls are contained inside the cube [1/2δ,1/2+δ]d\left[{1/2-\delta,1/2+\delta}\rule[-5.69046pt]{0.0pt}{11.38092pt}\right]^{d}, which can be ensured by translation and scaling (which preserves order of distances), where δ=ε/4\delta={\varepsilon}/4. As such, we can ignore queries outside the unit cube [0,1]d[0,1]^{d}, as any input ball is a valid answer in this case.

For a real positive number xx and a point 𝗉=(𝗉1,,𝗉d)IRd\mathsf{p}=(\mathsf{p}_{1},\ldots,\mathsf{p}_{d})\in{\rm I\!\hskip-0.24994ptR}^{d}, define 𝖦x(𝗉)\mathsf{G}_{x}(\mathsf{p}) to be the grid point (𝗉1/xx,,𝗉d/xx)\left({\left\lfloor{\mathsf{p}_{1}/x}\right\rfloor x,\ldots,\left\lfloor{\mathsf{p}_{d}/x}\right\rfloor x}\right). The number xx is the width or side length of the grid 𝖦x\mathsf{G}_{x}. The mapping 𝖦x\mathsf{G}_{x} partitions IRd{\rm I\!\hskip-0.24994ptR}^{d} into cubes that are called grid cells.

Definition 2.4.

A cube is a canonical cube if it is contained inside the unit cube U=[0,1]dU=[0,1]^{d}, it is a cell in a grid 𝖦r\mathsf{G}_{r}, and rr is a power of two (i.e., it might correspond to a node in a quadtree having [0,1]d[0,1]^{d} as its root cell). We will refer to such a grid 𝖦r\mathsf{G}_{r} as a canonical grid. Note that all the cells corresponding to nodes of a compressed quadtree are canonical.

Definition 2.5.

Given a set bIRdb\subseteq{\rm I\!\hskip-0.24994ptR}^{d}, and a parameter δ>0\delta>0, let 𝖦(b,δ)\mathsf{G}_{\approx}\!\left({b,\delta}\right) denote the set of canonical grid cells of side length 2log2δdiam(b)/d2^{\left\lfloor{\log_{2}{\delta\mathrm{diam}\left({b}\right)/\sqrt{d}}}\right\rfloor}, that intersect bb, where diam(b)=max𝗉,𝗎b𝗉𝗎\mathrm{diam}\!\left({b}\right)=\max_{\mathsf{p},\mathsf{u}\in b}\left\|{\mathsf{p}-\mathsf{u}}\right\| denotes the diameter of bb. Clearly, the diameter of any grid cell of 𝖦(b,δ)\mathsf{G}_{\approx}\!\left({b,\delta}\right), is at most δdiam(b)\delta\mathrm{diam}\left({b}\right). Let 𝖦(b)=𝖦(b,1)\mathsf{G}_{\approx}\!\left({b}\right)=\mathsf{G}_{\approx}\!\left({b,1}\right). It is easy to verify that |𝖦(b)|=O(1)\left\lvert{\mathsf{G}_{\approx}\!\left({b}\right)}\right\rvert=O(1). The set 𝖦(b)\mathsf{G}_{\approx}\!\left({b}\right) is the grid approximation to bb.

Let \mathcal{B} be a family of balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}. Given a set XIRdX\subseteq{\rm I\!\hskip-0.24994ptR}^{d}, let

(X)={b|bX}\displaystyle{\mathcal{B}}\!\left({X}\right)=\left\{{b\in\mathcal{B}\,\left|\,{b\cap X\neq\emptyset}\rule[-5.69046pt]{0.0pt}{11.38092pt}\right.}\right\}

denote the set of all balls in \mathcal{B} that intersect XX.

For two compact sets X,YIRdX,Y\subseteq{\rm I\!\hskip-0.24994ptR}^{d}, XYX\preceq Y if and only if diam(X)diam(Y)\mathrm{diam}\!\left({X}\right)\leq\mathrm{diam}\!\left({Y}\right). For a set XX and a set of balls \mathcal{B}, let (X)={b|bX and bX}{\mathcal{B}}_{\succeq}\!\left({X}\right)=\left\{{b\in\mathcal{B}\,\left|\,{b\cap X\neq\emptyset\text{ and }b\succeq X}\rule[-5.69046pt]{0.0pt}{11.38092pt}\right.}\right\}. Let 𝖼d\mathsf{c}_{d} denote the maximum number of pairwise disjoint balls of radius at least 𝗋\mathsf{r}, that may intersect a given ball of radius 𝗋\mathsf{r} in IRd{\rm I\!\hskip-0.24994ptR}^{d}. Clearly, we have |(b)|𝖼d\left\lvert{{\mathcal{B}}_{\succeq}\!\left({b}\right)}\right\rvert\leq\mathsf{c}_{d} for any ball bb. We have the following bounds,

Lemma 2.6.

2𝖼d3d2\leq\mathsf{c}_{d}\leq 3^{d} for all dd.

Proof.

Let b=𝖻(𝖼,𝗋)b=\mathsf{b}(\mathsf{c},\mathsf{r}) be a given ball of radius 𝗋\mathsf{r}. For the lower bound we can take two balls both of radius 𝗋\mathsf{r} which touch bb at diametrically opposite points and lie outside bb. We now show the upper bound. Let \mathcal{B} be a set of disjoint balls, each having radius at least 𝗋\mathsf{r} and touching bb. Consider a ball bb^{\prime}\in\mathcal{B}. If no point of the boundary of bb^{\prime} touches bb, then clearly bb^{\prime} contains bb in its interior and it is easy to see that ||=1\left\lvert{\mathcal{B}}\right\rvert=1. As such we assume that all balls in \mathcal{B} have some point of their boundary inside bb. Take any point 𝗉\mathsf{p} of the boundary of bb^{\prime} such that 𝗉\mathsf{p} is in bb, and consider a ball of radius 𝗋\mathsf{r} that lies completely inside bb^{\prime}, is of radius 𝗋\mathsf{r} and is tangent to bb^{\prime} at 𝗉\mathsf{p}. We can find such a ball for each ball in \mathcal{B}. Moreover, these balls are all disjoint. Thus we have ||\left\lvert{\mathcal{B}}\right\rvert disjoint balls of radius exactly 𝗋\mathsf{r} that touch bb. It is easy to see that all such balls are completely inside 𝖻(𝖼,3𝗋)\mathsf{b}(\mathsf{c},3\mathsf{r}). By a simple volume packing bound it follows that ||3d\left\lvert{\mathcal{B}}\right\rvert\leq 3^{d}.

Definition 2.7.

For a parameter δ0\delta\geq 0, a function f:IR+IR+f:{\rm I\!\hskip-0.24994ptR}^{+}\to{\rm I\!\hskip-0.24994ptR}^{+} is δ\delta-monotonic, if for every x0x\geq 0, f(x/(1+δ))f(x)f(x/(1+\delta))\leq f(x).

3 Approximate range counting for balls

Data-structure 3.1.

For a given set of disjoint balls ={b1,,bn}\mathcal{B}=\left\{{b_{1},\ldots,b_{n}}\right\} in IRd{\rm I\!\hskip-0.24994ptR}^{d}, we build the following data structure, that is useful in performing several of the tasks at hand.

  1.  (A)

    Store balls in a (compressed) quadtree. For i=1,2,,ni=1,2,\dots,n, let Gi=𝖦(bi)G_{i}=\mathsf{G}_{\approx}\!\left({b_{i}}\right), and let G=i=1nGiG=\bigcup_{i=1}^{n}G_{i} denote the union of these cells. Let 𝒯\mathcal{T} be a compressed quadtree decomposition of [0,1]d[0,1]^{d}, such that all the cells of GG are cells of 𝒯\mathcal{T}. We preprocess 𝒯\mathcal{T} to answer point location queries for the cells of GG. This takes O(nlogn)O(n\log n) time, see [12].

  2.  (B)

    Compute list of “large” balls intersecting each cell. For each node uu of 𝒯\mathcal{T}, there is a list of balls registered with it. Formally, register a ball bib_{i} with all the cells of GiG_{i}. Clearly, each ball is registered with O(1)O(1) cells, and it is easy to see that each cell has O(1)O(1) balls registered with it, since the balls are disjoint.

    Next, for a cell \mathsf{\Box} in 𝒯\mathcal{T} we compute a list storing (){\mathcal{B}}_{\succeq}\!\left({\mathsf{\Box}}\right), and these balls are associated with this cell. These lists are computed in a top-down manner. To this end, propagate from a node uu its list (){\mathcal{B}}_{\succeq}\!\left({\mathsf{\Box}}\right) (which we assume is already computed) down to its children. For a node receiving such a list, it scans it, and keep only the balls that intersect its cell (adding to this list the balls already registered with this cell). For a node ν𝒯\nu\in\mathcal{T}, let ν\mathcal{B}_{\nu} be this list.

  3.  (C)

    Build compressed quadtree on centers of balls. Let 𝒞\mathcal{C} be the set of centers of the balls of \mathcal{B}. Build, in O(nlogn)O(n\log n) time, a compressed quadtree 𝒯𝒞\mathcal{T}_{\mathcal{C}} storing 𝒞\mathcal{C}.

  4.  (D)

    ANN for centers of balls. Build a data structure 𝒟\mathcal{D}, for answering 22-approximate kk-nearest neighbor distances on 𝒞\mathcal{C}, the set of centers of the balls, see [14], where kk and ε{\varepsilon} are provided with the query. The data structure 𝒟\mathcal{D}, returns a point 𝖼𝒞\mathsf{c}\in\mathcal{C} such that, 𝖽k(𝚚,𝒞)𝖽(𝚚,𝖼)2𝖽k(𝚚,𝒞)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{C}}\right)\leq\mathsf{d}\!\left({\mathtt{q},\mathsf{c}}\right)\leq 2\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{C}}\right).

  5.  (E)

    Answering approximate range searching for the centers of balls.

    Given a query ball b𝚚=𝖻(𝚚,x)b_{\mathtt{q}}=\mathsf{b}(\mathtt{q},x) and a parameter δ>0\delta>0, one can, using 𝒯𝒞\mathcal{T}_{\mathcal{C}}, report (approximately), in O(logn+1/δd)O(\log n+1/\delta^{d}) time, the points in b𝚚𝒞b_{\mathtt{q}}\cap\mathcal{C}. Specifically, the query process computes O(1/δd)O(1/\delta^{d}) sets of points, such that their union XX, has the property that b𝚚𝒞X(1+δ)b𝚚𝒞b_{\mathtt{q}}\cap\mathcal{C}\subseteq X\subseteq(1+\delta)b_{\mathtt{q}}\cap\mathcal{C}, where (1+δ)b𝚚(1+\delta)b_{\mathtt{q}} is the scaling of b𝚚b_{\mathtt{q}} by a factor of 1+δ1+\delta around its center. Indeed, compute the set 𝖦(b𝚚)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}}}\right), and then using cell queries in 𝒯𝒞\mathcal{T}_{\mathcal{C}} compute the corresponding cells (this takes O(logn)O(\log n) time). Now, descend to the relevant level of the quadtree to all the cells of the right size, that intersect b𝚚b_{\mathtt{q}}. Clearly, the union of points stored in their subtrees are the desired set. This takes overall O(logn+1/δd)O(\log n+1/\delta^{d}) time.

    A similar data structure for approximate range searching is provided by Arya and Mount [5], and our description above is provided for the sake of completeness.

Overall, it takes O(nlogn)O(n\log n) time to build this data structure.

We denote the collection of data structures above by 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}} and where necessary, specific functionality it provides, say for finding the large balls intersecting a cell, by 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}} (B).

3.1 Approximate range counting among balls

We need the ability to answer approximate range counting queries on a set of disjoint balls. Specifically, given a set of disjoint balls \mathcal{B}, and a query ball bb, the target is to compute the size of the set b={b|bb}b\cap\mathcal{B}=\left\{{b^{\prime}\in\mathcal{B}\,\left|\,{b^{\prime}\cap b\neq\emptyset}\rule[-5.69046pt]{0.0pt}{11.38092pt}\right.}\right\}. To make this query computationally fast, we allow an approximation. More precisely, for a ball bb a set b~\widetilde{b} is a (1+δ)(1+\delta)-ball of bb, if bb~(1+δ)bb\subseteq\widetilde{b}\subseteq(1+\delta)b, where (1+δ)b(1+\delta)b is the (1+δ)(1+\delta)-scaling of bb around its center. The purpose here, given a query ball bb, is to compute the size of the set b~\widetilde{b}\cap\mathcal{B} for some (1+δ)(1+\delta)-ball b~\widetilde{b} of bb.

Lemma 3.2.

Given a compressed quadtree 𝒯\mathcal{T} of size nn, a convex set XX, and a parameter δ>0\delta>0, one can compute the set of nodes in 𝒯\mathcal{T}, that realizes 𝖦(X,δ)\mathsf{G}_{\approx}\!\left({X,\delta}\right) (see Defnition 2.5), in O(logn+1/δd)O\!\left({\log n+1/\delta^{d}}\right) time. Specifically, this outputs a set XNX_{N} of nodes, of size O(1/δd)O\!\left({1/\delta^{d}}\right), such that their cells intersect 𝖦(X,δ)\mathsf{G}_{\approx}\!\left({X,\delta}\right), and their parents cell diameter is larger than δdiam(X)\delta\mathrm{diam}\!\left({X}\right). Note that the cells in XNX_{N} might be significantly larger if they are leaves of 𝒯\mathcal{T}.

Proof.

Let 𝖦=𝖦(X,1)\mathsf{G}_{\approx}=\mathsf{G}_{\approx}\!\left({X,1}\right) be the grid approximation to XX. Using cell queries on the compressed quadtree, one can compute the cells of 𝒯\mathcal{T} that corresponds to these canonical cells. Specifically, for each cube 𝖦(X)\mathsf{\Box}\in\mathsf{G}_{\approx}\!\left({X}\right), the query either returns a node for which this is its cell, or it returns a compressed edge of the quadtree; that is, two cells (one is a parent of the other), such that \mathsf{\Box} is contained in of them and contains the other. Such a cell query takes O(logn)O(\log n) time [12]. This returns O(1)O(1) nodes in 𝒯\mathcal{T} such that their cells cover 𝖦(X)\mathsf{G}_{\approx}\!\left({X}\right).

Now, traverse down the compressed quadtree starting from these nodes and collect all the nodes of the quadtree that are relevant. Clearly, one has to go at most O(log1/δ)O(\log 1/\delta) levels down the quadtree to get these nodes, and this takes O(1/δd)O(1/\delta^{d}) time overall.

Lemma 3.3.

Let XX be any convex set in IRd{\rm I\!\hskip-0.24994ptR}^{d}, and let δ>0\delta>0 be a parameter. Using 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}}, one can compute, in O(logn+1/δd)O\!\left({\log n+1/\delta^{d}}\right) time, all the balls of \mathcal{B} that intersect XX, with diameter δdiam(X)\geq\delta\mathrm{diam}\!\left({X}\right).

Proof.

We compute the cells of the quadtree realizing 𝖦(X,δ)\mathsf{G}_{\approx}\!\left({X,\delta}\right) using Lemma 3.2. Now, from each such cell (and its parent), we extract the list of large balls intersecting it (there are O(1/δd)O(1/\delta^{d}) such nodes, and the size of each such list is O(1)O(1)). Next we check for each such ball if it intersects XX and if its diameter is at least δdiam(X)\delta\mathrm{diam}\!\left({X}\right). We return the list of all such balls.

3.2 Answering a query

Given a query ball b𝚚=𝖻(𝚚,x)b_{\mathtt{q}}=\mathsf{b}(\mathtt{q},x), and an approximation parameter δ>0\delta>0, our purpose is to compute a number NN, such that |(𝖻(𝚚,x))|N|(𝖻(𝚚,(1+δ)x))|\left\lvert{{\mathcal{B}}\!\left({\mathsf{b}(\rule[-5.69046pt]{0.0pt}{11.38092pt}\mathtt{q},x)}\right)}\right\rvert\leq N\leq\left\lvert{{\mathcal{B}}\!\left({\mathsf{b}(\rule[-5.69046pt]{0.0pt}{11.38092pt}\mathtt{q},(1+\delta)x)}\right)}\right\rvert.

The query algorithm works as follows:

  1.   (A)

    Using Lemma 3.3, compute a set XX of all the balls that intersect b𝚚b_{\mathtt{q}} and are of radius δx/4\geq\delta x/4.

  2.   (B)

    Using 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}}, compute O(1/δd)O(1/\delta^{d}) cells of 𝒯𝒞\mathcal{T}_{\mathcal{C}} that corresponds to 𝖦(b𝚚(1+δ/4),δ/4)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}}(1+\delta/4),\delta/4}\right). Let NN^{\prime} be the total number of points in 𝒞\mathcal{C} stored in these nodes.

  3.   (C)

    The quantity N+|X|N^{\prime}+\left\lvert{X}\right\rvert is almost the desired quantity, except that we might be counting some of the balls of XX twice. To this end, let N′′N^{\prime\prime} be the number of balls in XX with centers in 𝖦(b𝚚(1+δ/4),δ/4)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}}(1+\delta/4),\delta/4}\right)

  4.   (D)

    Let NN+|X|N′′N\leftarrow N^{\prime}+\left\lvert{X}\right\rvert-N^{\prime\prime}. Return NN.

We only sketch the proof, as the proof is straightforward. Indeed, the union of the cells of 𝖦(b𝚚(1+δ/4),δ/4)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}}(1+\delta/4),\delta/4}\right) contains 𝖻(𝚚,x(1+δ/4))\mathsf{b}(\mathtt{q},x(1+\delta/4)) and is contained in 𝖻(𝚚,(1+δ)x)\mathsf{b}(\mathtt{q},(1+\delta)x). All the balls with radius smaller than δx/4\delta x/4 and intersecting 𝖻(𝚚,x)\mathsf{b}(\mathtt{q},x) have their centers in cells of 𝖦(b𝚚(1+δ/4),δ/4)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}}(1+\delta/4),\delta/4}\right), and their number is computed correctly. Similarly, the “large” balls are computed correctly. The last stage ensures we do not over-count by 11 each large ball that also has its center in 𝖦(b𝚚(1+δ/4),δ/4)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}}(1+\delta/4),\delta/4}\right). It is also easy to check that |(𝖻(𝚚,x))|N|(𝖻(𝚚,x(1+δ)))|\left\lvert{{\mathcal{B}}\!\left({\mathsf{b}(\mathtt{q},x)}\right)}\right\rvert\leq N\leq\left\lvert{{\mathcal{B}}\!\left({\mathsf{b}(\mathtt{q},x(1+\delta))}\right)}\right\rvert. The same result can be used for x/(1+δ)x/(1+\delta) to get δ\delta-monotonicity of NN.

We now analyze the running time. Computing all the cells of 𝖦(b𝚚(1+δ/4),δ/4)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}}(1+\delta/4),\delta/4}\right) takes O(logn+1/δd)O(\log n+1/\delta^{d}) time. Computing the “large” balls takes O(logn+1/δd)O\!\left({\log n+1/\delta^{d}}\right) time. Checking for each large ball if it is already counted by the “small” balls takes O(1/δd)O(1/\delta^{d}) by using a grid. We denote the above query algorithm by rangeCount (𝚚,x,δ)\left({\mathtt{q},x,\delta}\right).

The above implies the following.

Lemma 3.4.

Given a set \mathcal{B} of nn disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, it can be preprocessed, in O(nlogn)O(n\log n) time, into a data structure of size O(n)O(n), such that given a query ball 𝖻(𝚚,x)\mathsf{b}(\mathtt{q},x) and approximation parameter δ>0\delta>0, the query algorithm rangeCount (𝚚,x,δ)\left({\mathtt{q},x,\delta}\right) returns, in O(logn+1/δd)O(\log n+1/\delta^{d}) time, a number NN satisfying the following:

  1.   (A)

    N|(𝖻(𝚚,(1+δ)x))|N\leq\left\lvert{{\mathcal{B}}\!\left({\mathsf{b}(\mathtt{q},(1+\delta)x)}\right)}\right\rvert,

  2.   (B)

    |(𝖻(𝚚,x))|N\left\lvert{{\mathcal{B}}\!\left({\mathsf{b}(\mathtt{q},x)}\right)}\right\rvert\leq N, and

  3.   (C)

    for a query ball 𝖻(𝚚,x)\mathsf{b}(\mathtt{q},x) and δ\delta, the number NN is δ\delta-monotonic as a function of xx, see Defnition 2.7.

4 Answering kk-ANN queries among balls

4.1 Computing a constant factor approximation to 𝖽k(𝚚,)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)

Lemma 4.1.

Let \mathcal{B} be a set of disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, and consider a ball b=𝖻(𝚚,r)b=\mathsf{b}(\mathtt{q},r) that intersects at least kk balls of \mathcal{B}. Then, among the kk nearest neighbors of 𝚚\mathtt{q} from \mathcal{B}, there are at least max(0,k𝖼d)\max(0,k-\mathsf{c}_{d}) balls of radius at most rr. The centers of all these balls are in 𝖻(𝚚,2r)\mathsf{b}(\mathtt{q},2r).

Proof.

Consider the kk nearest neighbors of 𝚚\mathtt{q} from \mathcal{B}. Any such ball that has its center outside 𝖻(𝚚,2r)\mathsf{b}(\mathtt{q},2r), has radius at least rr, since it intersects b=𝖻(𝚚,r)b=\mathsf{b}(\mathtt{q},r). Since the number of balls that are of radius at least rr and intersecting bb is bounded by 𝖼d\mathsf{c}_{d}, there must be at least max(0,k𝖼d)\max(0,k-\mathsf{c}_{d}) balls among the kk nearest neighbors, each having radius less than rr. Now, 𝖻(𝚚,2r)\mathsf{b}(\mathtt{q},2r) will contain the centers of all such balls.

Corollary 4.2.

Let γ=min(k,𝖼d)\gamma=\min(k,\mathsf{c}_{d}). Then, 𝖽kγ(𝚚,𝒞)/2𝖽k(𝚚,)\mathsf{d}_{k-\gamma}\!\left({\mathtt{q},\mathcal{C}}\right)/2\leq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

The basic observation is that we only need a rough approximation to the right radius, as using approximate range counting (i.e., Lemma 3.4), one can improve the approximation.

Let xix_{i} denote the distance of 𝚚\mathtt{q} to the iith closest center in 𝒞\mathcal{C}. Let dk=𝖽k(𝚚,)d_{k}=\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Let ii be the minimum index, such that dkxid_{k}\leq x_{i}. Since dkxkd_{k}\leq x_{k}, it must be that iki\leq k. There are several possibilities:

  1.   (A)

    If ik𝖼di\leq k-\mathsf{c}_{d} (i.e., dkxk𝖼dd_{k}\leq x_{k-\mathsf{c}_{d}}) then, by Lemma 4.1, the ball 𝖻(𝚚,2dk)\mathsf{b}(\mathtt{q},2d_{k}) contains at least k𝖼dk-\mathsf{c}_{d} centers. As such, dk<xk𝖼d2dkd_{k}<x_{k-\mathsf{c}_{d}}\leq 2d_{k}, and xk𝖼dx_{k-\mathsf{c}_{d}} is a good approximation to dkd_{k}.

  2.   (B)

    If i>k𝖼di>k-\mathsf{c}_{d}, and dk4xi1d_{k}\leq 4x_{i-1}, then xi1x_{i-1} is the desired approximation.

  3.   (C)

    If i>k𝖼di>k-\mathsf{c}_{d}, and dkxi/4d_{k}\geq x_{i}/4, then xix_{i} is the desired approximation.

  4.   (D)

    Otherwise, it must be that i>k𝖼di>k-\mathsf{c}_{d}, and 4xi1<dk<xi/44x_{i-1}<d_{k}<x_{i}/4. Let bj=𝖻(𝖼j,𝗋j)b_{j}=\mathsf{b}(\mathsf{c}_{j},\mathsf{r}_{j}) be the jjth closest ball to 𝚚\mathtt{q}, for j=1,,kj=1,\ldots,k. It must be that bi,,bkb_{i},\ldots,b_{k} are much larger than 𝖻(𝚚,dk)\mathsf{b}(\mathtt{q},d_{k}). But then, the balls bi,,bkb_{i},\ldots,b_{k} must intersect 𝖻(𝚚,xi/2)\mathsf{b}(\mathtt{q},x_{i}/2), and their radius is at least xi/2x_{i}/2. We can easily compute these big balls using 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}} (B), and the number of centers of the small balls close to query, and then compute dkd_{k} exactly.

We build 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}} in O(nlogn)O(n\log n) time.

First we introduce some notation. For x0x\geq 0, let N(x)N\!\left({x}\right) denote the number of balls in \mathcal{B} that intersect 𝖻(𝚚,x)\mathsf{b}(\mathtt{q},x); that is N(x)=|{b|b𝖻(𝚚,x)}|N(x)=\left\lvert{\left\{{b\in\mathcal{B}\,\left|\,{b\cap\mathsf{b}(\mathtt{q},x)\neq\emptyset}\rule[-5.69046pt]{0.0pt}{11.38092pt}\right.}\right\}}\right\rvert, and C(x)C(x) denote the number of centers in 𝖻(𝚚,x)\mathsf{b}(\mathtt{q},x), i.e., C(x)=|𝒞𝖻(𝚚,x)|C(x)=\left\lvert{\mathcal{C}\cap\mathsf{b}(\mathtt{q},x)}\right\rvert. Also, let #(x)\#\!\left({x}\right) denote the 22-approximation to the number of balls of \mathcal{B} intersecting 𝖻(𝚚,x)\mathsf{b}(\mathtt{q},x), as computed by Lemma 3.4; that is N(x)#(x)N(2x)N\!\left({x}\right)\leq\#\!\left({x}\right)\leq N\!\left({2x}\right).

We now provide our algorithm to answer a query. We are given a query point 𝚚IRd\mathtt{q}\in{\rm I\!\hskip-0.24994ptR}^{d} and a number kk.

Using 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}}, compute a 22-approximation for the smallest ball containing kik-i centers of \mathcal{B}, for i=0,,γi=0,\ldots,\gamma, where γ=min(k,𝖼d)\gamma=\min(k,\mathsf{c}_{d}), and let rkir_{k-i} be this radius. That is, for i=0,,γi=0,\ldots,\gamma, we have C(rki/2)kiC(rki)C(r_{k-i}/2)\leq k-i\leq C(r_{k-i}). For i=0,,γi=0,\ldots,\gamma, compute Nki=#(rki)N_{k-i}=\#\!\left({r_{k-i}}\right) (Lemma 3.4).

Let α\alpha be the maximum index such that NkαkN_{k-\alpha}\geq k. Clearly, α\alpha is well defined as NkkN_{k}\geq k. The algorithm is executed in the following steps.

  1.  (A)

    If α=γ\alpha=\gamma we return 2rkγ2r_{k-\gamma}.

  2.  (B)

    If #(rkα/4)<k\#\!\left({r_{k-\alpha}/4}\right)<k, we return 2rkα2r_{k-\alpha}.

  3.  (C)

    Otherwise, compute all the balls of \mathcal{B} that are of radius at least rkα/4r_{k-\alpha}/4 and intersect the ball 𝖻(𝚚,rkα/4)\mathsf{b}(\mathtt{q},r_{k-\alpha}/4), using 𝒟𝒮3.1\mathcal{DS}_{\ref{d:s:everything}} (B). For each such ball bb, compute the distance ζ=𝖽(𝚚,b)\zeta=\mathsf{d}\!\left({\mathtt{q},b}\right) of 𝚚\mathtt{q} to it. Return 2ζ2\zeta for the minimum such number ζ\zeta such that #(ζ)k\#\!\left({\zeta}\right)\geq k.

Lemma 4.3.

Given a set of nn disjoint balls \mathcal{B} in IRd{\rm I\!\hskip-0.24994ptR}^{d}, one can preprocess them, in O(nlogn)O(n\log n) time, into a data structure of size O(n)O(n), such that given a query point 𝚚IRd\mathtt{q}\in{\rm I\!\hskip-0.24994ptR}^{d}, and a number kk, one can compute, in O(logn)O(\log n) time, a number xx such that, x/4𝖽k(𝚚,)4xx/4\leq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 4x.

Proof.

The data structure and query algorithm are described above. We next prove correctness. To prove that (A) returns the correct answer observe that under the given assumptions,

rkγ/4𝖽kγ(𝚚,𝒞)/2𝖽k(𝚚,)2rkγ,r_{k-\gamma}/4\leq\mathsf{d}_{k-\gamma}\!\left({\mathtt{q},\mathcal{C}}\right)/2\leq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 2r_{k-\gamma},

where the second inequality follows from Corollary 4.2, and the third inequality follows as N(2rkγ)#(rkγ)kN(2r_{k-\gamma})\geq\#\!\left({r_{k-\gamma}}\right)\geq k, while 𝖽k(𝚚,)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right) is the smallest number xx such that N(x)kN(x)\geq k.

For (B) observe that we have that N(rkγ/4)#(rkγ/4)<kN(r_{k-\gamma}/4)\leq\#\!\left({r_{k-\gamma}/4}\right)<k and as such we have rkγ/4<𝖽k(𝚚,)r_{k-\gamma}/4<\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). But by assumption, #(rkγ)k\#\!\left({r_{k-\gamma}}\right)\geq k and so N(2rkγ)#(rkγ)kN(2r_{k-\gamma})\geq\#\!\left({r_{k-\gamma}}\right)\geq k, thus 𝖽k(𝚚,)2rkγ\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 2r_{k-\gamma}.

For (C), first observe that α<γ\alpha<\gamma as the algorithm did not return in (A). Since α\alpha is the maximum index such that #(rkα)k\#\!\left({r_{k-\alpha}}\right)\geq k, so N(rkα1)#(rkα1)<kN(r_{k-\alpha-1})\leq\#\!\left({r_{k-\alpha-1}}\right)<k implying, rkα1<𝖽k(𝚚,)r_{k-\alpha-1}<\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Also, 𝖽k(𝚚,)rkα/4\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq r_{k-\alpha}/4, as the algorithm did not return in (B). Now the ball 𝖻(𝚚,rkα1)\mathsf{b}(\mathtt{q},r_{k-\alpha-1}) contains at least kα1k-\alpha-1 centers from 𝒞\mathcal{C}, but it does not contain kαk-\alpha centers. Indeed, otherwise we would have 𝖽kα(𝚚,𝒞)rkα1\mathsf{d}_{k-\alpha}\!\left({\mathtt{q},\mathcal{C}}\right)\leq r_{k-\alpha-1} and so rkα2𝖽kα(𝚚,𝒞)2rkα1r_{k-\alpha}\leq 2\mathsf{d}_{k-\alpha}\!\left({\mathtt{q},\mathcal{C}}\right)\leq 2r_{k-\alpha-1}, but on the other hand rkα1<𝖽k(𝚚,)rkα/4r_{k-\alpha-1}<\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq r_{k-\alpha}/4, which would be a contradiction. Similarly, there is no center of any ball whose distance from 𝚚\mathtt{q} is in the range (rkα1,rkα/2)(r_{k-\alpha-1},r_{k-\alpha}/2) otherwise we would have that 𝖽kα(𝚚,𝒞)<rkα/2\mathsf{d}_{k-\alpha}\!\left({\mathtt{q},\mathcal{C}}\right)<r_{k-\alpha}/2 and this would mean that rkα2𝖽kα(𝚚,𝒞)<rkαr_{k-\alpha}\leq 2\mathsf{d}_{k-\alpha}\!\left({\mathtt{q},\mathcal{C}}\right)<r_{k-\alpha}, a contradiction. Now, the center of the kkth closest ball is clearly more than rkα1r_{k-\alpha-1} away from 𝚚\mathtt{q}. As such its distance from 𝚚\mathtt{q} is at least rkα/2r_{k-\alpha}/2. Since 𝖽k(𝚚,)rkα/4\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq r_{k-\alpha}/4 it follows that the kkth closest ball intersects 𝖻(𝚚,rkα/4)\mathsf{b}(\mathtt{q},r_{k-\alpha}/4) and moreover, its radius is at least rkα/4r_{k-\alpha}/4. Since we compute all such balls in (C), we do encounter the kkth closest ball. It is easy to see that in this case we return a number ζ\zeta satisfying, ζ/2𝖽k(𝚚,)2ζ\zeta/2\leq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 2\zeta.

As for the running time, notice that we need to use the algorithm of Lemma 3.4 O(1)O(1) times, each iteration taking time O(logn)O(\log n). After this we need another O(logn)O(\log n) time for the invocation of the algorithm in Lemma 3.3. As such, the total query time is O(logn)O(\log n).

We now show how to refine the approximation.

Lemma 4.4.

Given a set \mathcal{B} of nn balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, it can be preprocessed, in O(nlogn)O(n\log n) time, into a data structure of size O(n)O(n). Given a query point 𝚚\mathtt{q}, numbers k,xk,x, and an approximation parameter ε>0{\varepsilon}>0, such that x/4𝖽k(𝚚,)4xx/4\leq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 4x, one can find a ball bb\in\mathcal{B} such that, (1ε)𝖽k(𝚚,)𝖽(𝚚,b)(1+ε)𝖽k(𝚚,)(1-{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\mathsf{d}\!\left({\mathtt{q},b}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), in O(logn+1/εd)O\!\left({\log n+1/{\varepsilon}^{d}}\right) time.

Proof.

We are going to use the same data structure as Lemma 3.4, for the query ball b𝚚=𝖻(𝚚,4x(1+ε))b_{\mathtt{q}}=\mathsf{b}(\mathtt{q},4x(1+{\varepsilon})). We compute all large balls of \mathcal{B} that intersect b𝚚b_{\mathtt{q}}. Here a large ball is a ball of radius >xε>x{\varepsilon}, and a ball of radius at most xεx{\varepsilon} is considered to be a small ball. Consider the O(1/εd)O(1/{\varepsilon}^{d}) grid cells of 𝖦(b𝚚,ε/16)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}},{\varepsilon}/16}\right). In O(1/εd)O(1/{\varepsilon}^{d}) time we can record the number of centers of large balls inside any such cell. Clearly, any small ball that intersects 𝖻(𝚚,4x)\mathsf{b}(\mathtt{q},4x) has its center in some cell of 𝖦(b𝚚,ε/16)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}},{\varepsilon}/16}\right). We use the quadtree 𝒯𝒞\mathcal{T}_{\mathcal{C}} to find out exactly the number of centers, NN_{\mathsf{\Box}}, of small balls in each cell \mathsf{\Box} of 𝖦(b𝚚,ε/16)\mathsf{G}_{\approx}\!\left({b_{\mathtt{q}},{\varepsilon}/16}\right), by finding the total number of centers using 𝒯𝒞\mathcal{T}_{\mathcal{C}}, and decreasing this by the count of centers of large balls in that cell. This can be done in time O(logn+1/εd)O(\log n+1/{\varepsilon}^{d}). We pick an arbitrary point in \mathsf{\Box}, and assign it weight NN_{\mathsf{\Box}}, and treat it as representing all the small balls in this grid cell – clearly, this introduces an error of size εx\leq{\varepsilon}x in the distance of such a ball from 𝚚\mathtt{q}, and as such we can ignore it in our argument. In the end of this snapping process, we have O(1/εd)O(1/{\varepsilon}^{d}) weighted points, and O(1/εd)O(1/{\varepsilon}^{d}) large balls. We know the distance of the query point from each one of these points/balls. This results in O(1/εd)O(1/{\varepsilon}^{d}) weighted distances, and we want the smallest \ell, such that the total weight of the distances \leq\ell is at least kk. This can be done by weighted median selection in linear time in the number of distances, which is O(1/εd)O(1/{\varepsilon}^{d}). Once we get the required point we can output any ball bb corresponding to the point. Clearly, bb satisfies the required conditions.

4.2 The result

Theorem 4.5.

Given a set of nn disjoint balls \mathcal{B} in IRd{\rm I\!\hskip-0.24994ptR}^{d}, one can preprocess them in time O(nlogn)O(n\log n) into a data structure of size O(n)O(n), such that given a query point 𝚚IRd\mathtt{q}\in{\rm I\!\hskip-0.24994ptR}^{d}, a number kk with 1kn1\leq k\leq n and ε>0{\varepsilon}>0, one can find in time O(logn+εd)O\!\left({\log n+{\varepsilon}^{-d}}\right) a ball bb\in\mathcal{B}, such that, (1ε)𝖽k(𝚚,)𝖽(𝚚,b)(1+ε)𝖽k(𝚚,)(1-{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\mathsf{d}\!\left({\mathtt{q},b}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

5 Quorum clustering

We are given a set \mathcal{B} of nn disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, and we describe how to compute quorum clustering for them quickly.

Let ξ\xi be some constant. Let 0=\mathcal{B}_{0}=\emptyset. For i=1,,mi=1,\ldots,m, let i=(j=0i1j)\mathcal{R}_{i}=\mathcal{B}\setminus(\bigcup_{j=0}^{i-1}\mathcal{B}_{j}), and let Λi=𝖻(𝗐i,𝗑i)\Lambda_{i}=\mathsf{b}(\mathsf{w}_{i},\mathsf{x}_{i}) be any ball that satisfies,

  1.   (A)

    Λi\Lambda_{i} contains min(k𝖼d,|i|)\min(k-\mathsf{c}_{d},\left\lvert{\mathcal{R}_{i}}\right\rvert) balls of i\mathcal{R}_{i} completely inside it,

  2.   (B)

    Λi\Lambda_{i} intersects at least kk balls of \mathcal{B}, and

  3.   (C)

    the radius of Λi\Lambda_{i} is at most ξ\xi times the radius of the smallest ball satisfying the above conditions.

Next, we remove any k𝖼dk-\mathsf{c}_{d} balls that are contained in Λi\Lambda_{i} from i\mathcal{R}_{i} to get the set i+1\mathcal{R}_{i+1}. We call the removed set of balls i\mathcal{B}_{i}. We repeat this process till all balls are extracted. Notice that at each step ii, we only require that the Λi\Lambda_{i} intersects kk balls of \mathcal{B} (and not i\mathcal{R}_{i}), but that it must contain k𝖼dk-\mathsf{c}_{d} balls from i\mathcal{R}_{i}. Also, the last quorum ball may contain fewer balls. The balls Λ1,,Λm\Lambda_{1},\ldots,\Lambda_{m}, are the resulting ξ\xi-approximate quorum clustering.

5.1 Computing an approximate quorum clustering

Definition 5.1.

For a set 𝖯\mathsf{P} of nn points in IRd{\rm I\!\hskip-0.24994ptR}^{d}, and an integer \ell, with 1n1\leq\ell\leq n, let ropt(𝖯,)r_{\mathrm{opt}}\!\left({{\mathsf{P}},{\ell}}\right) denote the radius of the smallest ball which contains at least \ell points from 𝖯\mathsf{P}, i.e., ropt(𝖯,)=min𝚚IRd𝖽(𝚚,𝖯)r_{\mathrm{opt}}\!\left({{\mathsf{P}},{\ell}}\right)=\min_{\mathtt{q}\in{\rm I\!\hskip-0.19925ptR}^{d}}\mathsf{d}_{\ell}\!\left({\mathtt{q},\mathsf{P}}\right).

Similarly, for a set \mathcal{R} of nn balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, and an integer \ell, with 1n1\leq\ell\leq n, let Ropt(,)R_{\mathrm{opt}}\!\left({{\mathcal{R}},{\ell}}\right) denote the radius of the smallest ball which completely contains at least \ell balls from \mathcal{R}.

Lemma 5.2 ([14]).

Given a set 𝖯\mathsf{P} of nn points in IRd{\rm I\!\hskip-0.24994ptR}^{d} and integer \ell, with 1n1\leq\ell\leq n, one can compute, in O(nlogn)O(n\log n) time, a sequence of n/\lceil n/\ell\rceil balls, 𝗈1=𝖻(𝗎1,ψ1),,𝗈n/=𝖻(𝗎n/,ψn/)\mathsf{o}_{1}=\mathsf{b}(\mathsf{u}_{1},\psi_{1}),\ldots,\mathsf{o}_{\lceil n/\ell\rceil}=\mathsf{b}(\mathsf{u}_{\lceil n/\ell\rceil},\psi_{\lceil n/\ell\rceil}), such that, for all i,1in/i,1\leq i\leq\lceil n/\ell\rceil, we have

  1.  (A)

    For every ball 𝗈i\mathsf{o}_{i}, there is an associated subset 𝖯i\mathsf{P}_{i} of min(,|𝖰i|)\min(\ell,\left\lvert{\mathsf{Q}_{i}}\right\rvert) points of 𝖰i=𝖯(𝖯i𝖯i1)\mathsf{Q}_{i}=\mathsf{P}\setminus\left({\mathsf{P}_{i}\cup\ldots\cup\mathsf{P}_{i-1}}\right), that it covers.

  2.  (B)

    The ball 𝗈i=𝖻(𝗎i,ψi)\mathsf{o}_{i}=\mathsf{b}(\mathsf{u}_{i},\psi_{i}) is a 22-approximation to the smallest ball covering min(,|𝖰i|)\min(\ell,\left\lvert{\mathsf{Q}_{i}}\right\rvert) points in 𝖰i\mathsf{Q}_{i}; that is, ψi/2ropt(𝖰i,min(,|𝖰i|))ψi\psi_{i}/2\leq r_{\mathrm{opt}}\!\left({{\mathsf{Q}_{i}},{\min(\ell,\left\lvert{\mathsf{Q}_{i}}\right\rvert)}}\right)\leq\psi_{i}.

The algorithm to construct an approximate quorum clustering is as follows. We use the algorithm of Lemma 5.2 with the set of points 𝖯=𝒞\mathsf{P}=\mathcal{C}, and =k𝖼d\ell=k-\mathsf{c}_{d} to get a list of m=n/(k𝖼d)m=\lceil n/(k-\mathsf{c}_{d})\rceil balls 𝗈1=𝖻(𝗎1,ψ1),,𝗈m=𝖻(𝗎m,ψm)\mathsf{o}_{1}=\mathsf{b}(\mathsf{u}_{1},\psi_{1}),\ldots,\mathsf{o}_{m}=\mathsf{b}(\mathsf{u}_{m},\psi_{m}), satisfying the conditions of Lemma 5.2. Next we use the algorithm of Theorem 4.5, to compute (k,ε)(k,{\varepsilon})-ANN distances from the centers 𝗎1,,𝗎m\mathsf{u}_{1},\ldots,\mathsf{u}_{m}, to the balls of \mathcal{B}.

Thus, we get numbers γi\gamma_{i} satisfying, (1/2)𝖽k(𝗎i,)γi(3/2)𝖽k(𝗎i,)(1/2)\mathsf{d}_{k}\!\left({\mathsf{u}_{i},\mathcal{B}}\right)\leq\gamma_{i}\leq(3/2)\mathsf{d}_{k}\!\left({\mathsf{u}_{i},\mathcal{B}}\right). Let ζi=max(2γi,3ψi)\zeta_{i}=\max(2\gamma_{i},3\psi_{i}), for i=1,,mi=1,\ldots,m. Sort ζ1,,ζm\zeta_{1},\ldots,\zeta_{m} (we assume for the sake of simplicity of exposition that ζm\zeta_{m}, being the radius of the last cluster is the largest number). Suppose the sorted order is the permutation π\pi of {1,,m}\left\{{1,\ldots,m}\right\} (by assumption π(m)=m\pi(m)=m). We output the balls Λi=𝖻(𝗎π(i),ζπ(i))\Lambda_{i}=\mathsf{b}(\mathsf{u}_{\pi(i)},\zeta_{\pi(i)}), for i=1,,mi=1,\ldots,m, as the approximate quorum clustering.

5.2 Correctness

Lemma 5.3.

Let ={b1,,bn}\mathcal{B}=\left\{{b_{1},\ldots,b_{n}}\right\} be a set of nn disjoint balls, where bi=𝖻(𝖼i,𝗋i)b_{i}=\mathsf{b}(\mathsf{c}_{i},\mathsf{r}_{i}), for i=1,,ni=1,\ldots,n. Let 𝒞={𝖼1,,𝖼n}\mathcal{C}=\left\{{\mathsf{c}_{1},\ldots,\mathsf{c}_{n}}\right\} be the set of centers of these balls. Let b=𝖻(𝖼,𝗋)b=\mathsf{b}(\mathsf{c},\mathsf{r}) be any ball that contains at least \ell centers from 𝒞\mathcal{C}, for some 2n2\leq\ell\leq n. Then 𝖻(𝖼,3𝗋)\mathsf{b}(\mathsf{c},3\mathsf{r}) contains the \ell balls that correspond to those centers.

Proof.

Without loss of generality suppose bb contains the \ell centers 𝖼1,,𝖼\mathsf{c}_{1},\ldots,\mathsf{c}_{\ell}, from 𝒞\mathcal{C}. Now consider any index ii with 1i1\leq i\leq\ell, and consider any jij\neq i, which exists as 2\ell\geq 2 by assumption. Since 𝖻(𝖼,𝗋)\mathsf{b}(\mathsf{c},\mathsf{r}) contains both 𝖼i\mathsf{c}_{i} and 𝖼j\mathsf{c}_{j}, 2𝗋𝖼i𝖼j2\mathsf{r}\geq\left\lVert{\mathsf{c}_{i}-\mathsf{c}_{j}}\right\rVert by the triangle inequality. On the other hand, as the balls bib_{i} and bjb_{j} are disjoint we have that 𝖼i𝖼j𝗋i+𝗋j𝗋i\left\lVert{\mathsf{c}_{i}-\mathsf{c}_{j}}\right\rVert\geq\mathsf{r}_{i}+\mathsf{r}_{j}\geq\mathsf{r}_{i}. It follows that 𝗋i2𝗋\mathsf{r}_{i}\leq 2\mathsf{r} for all 1i1\leq i\leq\ell. As such the ball 𝖻(𝖼,3𝗋)\mathsf{b}(\mathsf{c},3\mathsf{r}) must contain the entire ball bib_{i}, and thus it contains all the \ell balls b1,,bb_{1},\ldots,b_{\ell}, corresponding to the centers.

Lemma 5.4.

Let ={b1=𝖻(𝖼1,𝗋1),,bn=𝖻(𝖼n,𝗋n)}\mathcal{B}=\left\{{b_{1}=\mathsf{b}(\mathsf{c}_{1},\mathsf{r}_{1}),\ldots,b_{n}=\mathsf{b}(\mathsf{c}_{n},\mathsf{r}_{n})}\right\} be a set of nn disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}. Let 𝒞={𝖼1,,𝖼n}\mathcal{C}=\left\{{\mathsf{c}_{1},\ldots,\mathsf{c}_{n}}\right\} be the corresponding set of centers, and let \ell be an integer with 2n2\leq\ell\leq n. Then, ropt(𝒞,)Ropt(,)3ropt(𝒞,)r_{\mathrm{opt}}\!\left({{\mathcal{C}},{\ell}}\right)\leq R_{\mathrm{opt}}\!\left({{\mathcal{B}},{\ell}}\right)\leq 3r_{\mathrm{opt}}\!\left({{\mathcal{C}},{\ell}}\right).

Proof.

The first inequality follows since the ball realizing the optimal covering of \ell balls, clearly contains their centers as well, and therefore \ell points from 𝒞\mathcal{C}. To see the second inequality, consider the ball b=𝖻(𝖼,𝗋)b=\mathsf{b}(\mathsf{c},\mathsf{r}) realizing ropt(𝒞,)r_{\mathrm{opt}}\!\left({{\mathcal{C}},{\ell}}\right), and use Lemma 5.3 on it. This implies Ropt(,)3ropt(𝒞,)R_{\mathrm{opt}}\!\left({{\mathcal{B}},{\ell}}\right)\leq 3r_{\mathrm{opt}}\!\left({{\mathcal{C}},{\ell}}\right).

Lemma 5.5.

The balls Λ1,Λm\Lambda_{1},\ldots\Lambda_{m} computed above are a 1212-approximate quorum clustering of \mathcal{B}.

Proof.

Consider the balls 𝗈1=𝖻(𝗎1,ψ1),,𝗈m=𝖻(𝗎m,ψm)\mathsf{o}_{1}=\mathsf{b}(\mathsf{u}_{1},\psi_{1}),\ldots,\mathsf{o}_{m}=\mathsf{b}(\mathsf{u}_{m},\psi_{m}) computed by the algorithm of Lemma 5.2. Suppose 𝒞i\mathcal{C}_{i}, for 1=1,,m1=1,\ldots,m, is the set of centers assigned to the balls bib_{i}. That is 𝒞1,,𝒞m\mathcal{C}_{1},\ldots,\mathcal{C}_{m} form a disjoint decomposition of 𝒞\mathcal{C}, each of size k𝖼dk-\mathsf{c}_{d} (except for the last set, which might be smaller – a technicality that we ignore for the sake of simplicity of exposition).

For i=1,,mi=1,\ldots,m, let i\mathcal{B}_{i} denote the set of balls corresponding to the centers in 𝒞i\mathcal{C}_{i}. Now while constructing the approximate quorum clusters we are going to assign the set of balls π(i)\mathcal{B}_{\pi(i)} for i=1,,mi=1,\ldots,m, to Λi\Lambda_{i}. Now, fix a ii with 1im11\leq i\leq m-1. The balls of j=1iπ(j)\bigcup_{j=1}^{i}\mathcal{B}_{\pi(j)} have been used up. Consider an optimal ball, i.e., a ball b=𝖻(𝖼,𝗋)b=\mathsf{b}(\mathsf{c},\mathsf{r}) that contains completely k𝖼dk-\mathsf{c}_{d} balls among j=i+1mπ(j)\bigcup_{j=i+1}^{m}\mathcal{B}_{\pi(j)} and intersects kk balls from \mathcal{B}, and is the smallest such possible. Fix some k𝖼dk-\mathsf{c}_{d} balls from j=i+1mπ(j)\bigcup_{j=i+1}^{m}\mathcal{B}_{\pi(j)} that this optimal ball contains. Consider the sets of centers 𝒞\mathcal{C}^{\prime} of these balls. The quorum clusters 𝗈π(j)\mathsf{o}_{\pi(j)} for j=i+1,,mj=i+1,\ldots,m, contain all these centers, by construction. Out of these indices, i.e., out of the indices {π(i+1),,π(m)}\left\{{\pi(i+1),\ldots,\pi(m)}\right\}, suppose pp is the minimum index such that 𝗈p\mathsf{o}_{p} contains one of these centers. When 𝗈p\mathsf{o}_{p} was constructed, i.e., at the ppth iteration of the algorithm of Lemma 5.2, all the centers from 𝒞\mathcal{C}^{\prime} were available. Now since the optimal ball b=𝖻(𝖼,𝗋)b=\mathsf{b}(\mathsf{c},\mathsf{r}) contains k𝖼dk-\mathsf{c}_{d} available centers too, it follows that ψp2𝗋\psi_{p}\leq 2\mathsf{r} since Lemma 5.2 guarantees this. Since k𝖼d2k-\mathsf{c}_{d}\geq 2, by Lemma 5.3, 𝖻(𝗎p,3ψp)\mathsf{b}(\mathsf{u}_{p},3\psi_{p}) contains the balls of p\mathcal{B}_{p}. Moreover, by the Lipschitz property, see Observation 2.2, it follows that 𝖽k(𝗎p,)𝖽k(𝖼,)+𝗎p𝖼𝗋+(𝗋+ψp)4𝗋\mathsf{d}_{k}\!\left({\mathsf{u}_{p},\mathcal{B}}\right)\leq\mathsf{d}_{k}\!\left({\mathsf{c},\mathcal{B}}\right)+\left\lVert{\mathsf{u}_{p}-\mathsf{c}}\right\rVert\leq\mathsf{r}+(\mathsf{r}+\psi_{p})\leq 4\mathsf{r}, where the second last inequality follows as the balls b=𝖻(𝖼,𝗋)b=\mathsf{b}(\mathsf{c},\mathsf{r}) and the ball 𝗈p=𝖻(𝗎p,ψp)\mathsf{o}_{p}=\mathsf{b}(\mathsf{u}_{p},\psi_{p}) intersect. Therefore, for the index pp we have that, 𝖽k(𝗎p,)2γp3𝖽k(𝗎p,)12𝗋\mathsf{d}_{k}\!\left({\mathsf{u}_{p},\mathcal{B}}\right)\leq 2\gamma_{p}\leq 3\mathsf{d}_{k}\!\left({\mathsf{u}_{p},\mathcal{B}}\right)\leq 12\mathsf{r}, and also that 3ψp6𝗋3\psi_{p}\leq 6\mathsf{r}. As such ζp=max(2γp,3ψp)12𝗋\zeta_{p}=\max(2\gamma_{p},3\psi_{p})\leq 12\mathsf{r}. The index π(i+1)\pi(i+1) minimizes this quantity among the indices {π(i+1),,π(m)}\left\{{\pi(i+1),\ldots,\pi(m)}\right\} (as we took the sorted order), as such it follows that ζi+112𝗋\zeta_{i+1}\leq 12\mathsf{r}.

Lemma 5.6.

Given a set \mathcal{B} of nn disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, such that (k𝖼d)|n(k-\mathsf{c}_{d})|n, and a number kk with 2𝖼d<kn2\mathsf{c}_{d}<k\leq n, in O(nlogn)O(n\log n) time, one can output a sequence of m=n/(k𝖼d)m={n/(k-\mathsf{c}_{d})} balls Λ1,,Λm\Lambda_{1},\ldots,\Lambda_{m}, such that

  1.  (A)

    For each ball Λi\Lambda_{i}, there is an associated subset i\mathcal{B}_{i} of k𝖼dk-\mathsf{c}_{d} balls of i=(1i1)\mathcal{R}_{i}=\mathcal{B}\setminus(\mathcal{B}_{1}\cup\ldots\cup\mathcal{B}_{i-1}), that it completely covers.

  2.  (B)

    The ball Λi\Lambda_{i} intersects at least kk balls from \mathcal{B}.

  3.  (C)

    The radius of the ball Λi\Lambda_{i} is at most 1212 times that of the smallest ball covering k𝖼dk-\mathsf{c}_{d} balls of i{\mathcal{R}_{i}} completely, and intersecting kk balls of \mathcal{B}.

Proof.

The correctness was proved in Lemma 5.5. To see the time bound is also easy as the computation time is dominated by the time in Lemma 5.2, which is O(nlogn)O(n\log n).

6 Construction of the sublinear space data structure for (k,ε)(k,{\varepsilon})-ANN

Here we show how to compute an approximate Voronoi diagram for approximating the kkth-nearest ball, that takes O(n/k)O(n/k) space. We assume k>2𝖼dk>2\mathsf{c}_{d} without loss of generality, and we let m=n/(k𝖼d)=O(n/k)m=\lceil n/(k-\mathsf{c}_{d})\rceil=O(n/k). Here kk and ε{\varepsilon} are prespecified in advance.

6.1 Preliminaries

The following notation was introduced in [14]. A ball bb of radius 𝗋\mathsf{r} in IRd{\rm I\!\hskip-0.24994ptR}^{d}, centered at a point 𝖼\mathsf{c}, can be interpreted as a point in IRd+1{\rm I\!\hskip-0.24994ptR}^{d+1}, denoted by b=(𝖼,𝗋)b^{\prime}=\left({\mathsf{c},\mathsf{r}}\right). For a regular point 𝗉IRd\mathsf{p}\in{\rm I\!\hskip-0.24994ptR}^{d}, its corresponding image under this transformation is the mapped point 𝗉=(𝗉,0)IRd+1\mathsf{p}^{\prime}=\left({\mathsf{p},0}\right)\in{\rm I\!\hskip-0.24994ptR}^{d+1}, i.e., we view it as a ball of radius 0 and use the mapping defined on balls. Given point 𝗎=(𝗎1,,𝗎d)IRd\mathsf{u}=\!\left({\mathsf{u}_{1},\dots,\mathsf{u}_{d}}\right)\in{\rm I\!\hskip-0.24994ptR}^{d} we will denote its Euclidean norm by 𝗎\left\lVert{\mathsf{u}}\right\rVert. We will consider a point 𝗎=(𝗎1,𝗎2,,𝗎d+1)IRd+1\mathsf{u}=\!\left({\mathsf{u}_{1},\mathsf{u}_{2},\dots,\mathsf{u}_{d+1}}\right)\in{\rm I\!\hskip-0.24994ptR}^{d+1} to be in the product metric of IRd×IR{\rm I\!\hskip-0.24994ptR}^{d}\times{\rm I\!\hskip-0.24994ptR} and endowed with the product metric norm

𝗎=𝗎12++𝗎d2+|𝗎d+1|.\displaystyle\left\lVert{\mathsf{u}}\right\rVert_{\oplus}=\sqrt{\mathsf{u}_{1}^{2}+\dots+\mathsf{u}_{d}^{2}}+\left|{\mathsf{u}_{d+1}}\right|.

It can be verified that the above defines a norm, and for any 𝗎IRd+1\mathsf{u}\in{\rm I\!\hskip-0.24994ptR}^{d+1} we have 𝗎𝗎2𝗎\left\lVert{\mathsf{u}}\right\rVert\leq\left\lVert{\mathsf{u}}\right\rVert_{\oplus}\leq\sqrt{2}\left\lVert{\mathsf{u}}\right\rVert.

6.2 Construction

The input is a set \mathcal{B} of nn disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, and parameters kk and ε{\varepsilon}.

The construction of the data structure is similar to the construction of the kkth-nearest neighbor data structure from the authors’ paper [14]. We compute, using Lemma 5.6, a ξ\xi-approximate quorum clustering of \mathcal{B} with m=n/(k𝖼d)=O(n/k)m={n/(k-\mathsf{c}_{d})}=O(n/k) balls, Σ={Λ1=𝖻(𝗐1,𝗑1),,Λm=𝖻(𝗐m,𝗑m)}\Sigma=\left\{{\Lambda_{1}=\mathsf{b}(\mathsf{w}_{1},\mathsf{x}_{1}),\ldots,\Lambda_{m}=\mathsf{b}(\mathsf{w}_{m},\mathsf{x}_{m})}\right\}, where ξ12\xi\leq 12. The algorithm then continues as follows:

  1. (A)

    Compute an exponential grid around each quorum cluster. Specifically, let

    =i=1mj=0log(32ξ/ε)𝖦(𝖻(𝗐i,2j𝗑i),εζ1)\displaystyle\displaystyle\mathcal{I}=\,\bigcup_{i=1}^{m}\;\;\bigcup_{j=0}^{\lceil\log\left({32\xi/{\varepsilon}}\right)\rceil}\mathsf{G}_{\approx}\!\left({\mathsf{b}(\mathsf{w}_{i},2^{j}\mathsf{x}_{i}),\frac{{\varepsilon}}{\zeta_{1}}}\right) (6.1)

    be the set of grid cells covering the quorum clusters and their immediate environ, where ζ1\zeta_{1} is a sufficiently large constant (say, ζ1=256ξ\zeta_{1}=256\xi).

  2. (B)

    Intuitively, \mathcal{I} takes care of the region of space immediately next to a quorum cluster333That is, intuitively, if the query point falls into one of the grid cells of \mathcal{I}, we can answer a query in constant time.. For the other regions of space, we can apply a construction of an approximate Voronoi diagram for the centers of the clusters (the details are somewhat more involved). To this end, lift the quorum clusters into points in IRd+1{\rm I\!\hskip-0.24994ptR}^{d+1}, as follows

    Σ={Λ1,,Λm},\displaystyle\Sigma^{\prime}=\left\{{\Lambda_{1}^{\prime},\dots,\Lambda_{m}^{\prime}}\right\},

    where Λi=(𝗐i,𝗑i)IRd+1\Lambda_{i}^{\prime}=\!\left({\mathsf{w}_{i},\mathsf{x}_{i}}\right)\in{\rm I\!\hskip-0.24994ptR}^{d+1}, for i=1,,mi=1,\ldots,m. Note that all points in Σ\Sigma^{\prime} belong to U=[0,1]d+1U^{\prime}=[0,1]^{d+1} by Assumption 2.3. Now build a (1+ε/8)(1+{\varepsilon}/8)-AVD for Σ\Sigma^{\prime} using the algorithm of Arya and Malamatos [2], for distances specified by the \left\lVert{\cdot}\right\rVert_{\oplus} norm. The AVD construction provides a list of canonical cubes covering [0,1]d+1[0,1]^{d+1} such that in the smallest cube containing the query point, the associated point of Σ\Sigma^{\prime}, is a (1+ε/8)(1+{\varepsilon}/8)-ANN to the query point. (Note that these cubes are not necessarily disjoint. In particular, the smallest cube containing the query point 𝚚\mathtt{q} is the one that determines the assigned approximate nearest neighbor to 𝚚\mathtt{q}.)

    Clip this collection of cubes to the hyperplane xd+1=0x_{d+1}=0 (i.e., throw away cubes that do not have a face on this hyperplane). For a cube \mathsf{\Box} in this collection, denote by nn()\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right), the point of Σ\Sigma^{\prime} assigned to it. Let 𝒮\mathcal{S} be this resulting set of canonical dd-dimensional cubes.

  3. (C)

    Let 𝒲\mathcal{W} be the space decomposition resulting from overlaying the two collection of cubes, i.e. \mathcal{I} and 𝒮\mathcal{S}. Formally, we compute a compressed quadtree 𝒯\mathcal{T} that has all the canonical cubes of \mathcal{I} and 𝒮\mathcal{S} as nodes, and 𝒲\mathcal{W} is the resulting decomposition of space into cells. One can overlay two compressed quadtrees representing the two sets in linear time [7, 12]. Here, a cell associated with a leaf is a canonical cube, and a cell associated with a compressed node is the set difference of two canonical cubes. Each node in this compressed quadtree contains two pointers – to the smallest cube of \mathcal{I}, and to the smallest cube of 𝒮\mathcal{S}, that contains it. This information can be computed by doing a BFS on the tree.

    For each cell 𝒲\mathsf{\Box}\in\mathcal{W} we store the following.

    1.   (I)

      An arbitrary representative point 𝗋𝖾𝗉\mathsf{\Box}_{\mathsf{rep}}\in\mathsf{\Box}.

    2.   (II)

      The point nn()Σ\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right)\in\Sigma^{\prime} that is associated with the smallest cell of 𝒮\mathcal{S} that contains this cell. We also store an arbitrary ball, 𝐛()\mathbf{b}\!\left({\mathsf{\Box}}\right)\in\mathcal{B}, that is one of the balls completely inside the cluster specified by nn()\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right) – we assume we stored such a ball inside each quorum cluster, when it was computed.

    3.   (III)

      A number βk(𝗋𝖾𝗉)\mathrm{\beta}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) that satisfies 𝖽k(𝗋𝖾𝗉,)βk(𝗋𝖾𝗉)(1+ε/4)𝖽k(𝗋𝖾𝗉,)\mathsf{d}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}},\mathcal{B}}\right)\leq\mathrm{\beta}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)\leq(1+{\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}},\mathcal{B}}\right), and a ball nnk(𝗋𝖾𝗉)\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)\in\mathcal{B} that realizes this distance. In order to compute βk(𝗋𝖾𝗉)\mathrm{\beta}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) and nnk(𝗋𝖾𝗉)\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) use the data structure of Section 4, see Theorem 4.5.

6.3 Answering a query

Given a query point 𝚚\mathtt{q}, compute the leaf cell (equivalently the smallest cell) in 𝒲\mathcal{W} that contains 𝚚\mathtt{q} by performing a point-location query in the compressed quadtree 𝒯\mathcal{T}. Let \mathsf{\Box} be this cell. Let,

λ=min(𝚚nn(),βk(𝗋𝖾𝗉)+𝚚𝗋𝖾𝗉).\displaystyle\lambda^{*}=\min\!\left({\left\lVert{\mathtt{q}^{\prime}-\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right)}\right\rVert_{\oplus},\mathrm{\beta}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)+\left\lVert{\mathtt{q}-\mathsf{\Box}_{\mathsf{rep}}}\right\rVert}\right). (6.2)

If 𝖽𝗂𝖺𝗆()(ε/8)λ\mathsf{diam}\!\left({{\mathsf{\Box}}}\right)\leq({\varepsilon}/8)\lambda^{*} we return nnk(𝗋𝖾𝗉)\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) as the approximate kkth-nearest neighbor, else we return 𝐛()\mathbf{b}\!\left({\mathsf{\Box}}\right).

6.4 Correctness

Lemma 6.1.

The number λ=min(𝚚nn(),βk(𝗋𝖾𝗉)+𝚚𝗋𝖾𝗉)\lambda^{*}=\min\!\left({\left\lVert{\mathtt{q}^{\prime}-\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right)}\right\rVert_{\oplus},\mathrm{\beta}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)+\left\lVert{\mathtt{q}-\mathsf{\Box}_{\mathsf{rep}}}\right\rVert}\right) satisfies, 𝖽k(𝚚,)λ\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\lambda^{*}.

Proof.

This follows by the Lipschitz property, see Observation 2.2.

Lemma 6.2.

Let 𝒲\mathsf{\Box}\in\mathcal{W} be any cell containing 𝚚\mathtt{q}. If 𝖽𝗂𝖺𝗆()ε𝖽k(𝚚,)/4\mathsf{diam}\!\left({{\mathsf{\Box}}}\right)\leq{\varepsilon}\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)/4, then nnk(𝗋𝖾𝗉)\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) is a valid (1±ε)(1\pm{\varepsilon})-approximate kkth-nearest neighbor of 𝚚\mathtt{q}.

Proof.

For the point 𝗋𝖾𝗉\mathsf{\Box}_{\mathsf{rep}}, by Observation 2.2, we have that

𝖽k(𝗋𝖾𝗉,)𝖽k(𝚚,)+𝚚𝗋𝖾𝗉𝖽k(𝚚,)+𝖽𝗂𝖺𝗆()(1+ε/4)𝖽k(𝚚,).\displaystyle\mathsf{d}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}},\mathcal{B}}\right)\leq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)+\left\lVert{\mathtt{q}-\mathsf{\Box}_{\mathsf{rep}}}\right\rVert\leq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)+\mathsf{diam}\!\left({{\mathsf{\Box}}}\right)\leq(1+{\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

Therefore, the ball nnk(𝗋𝖾𝗉)\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) satisfies

𝖽(𝗋𝖾𝗉,nnk(𝗋𝖾𝗉))(1+ε/4)𝖽k(𝗋𝖾𝗉,)(1+ε/4)2𝖽k(𝚚,)(1+3ε/4)𝖽k(𝚚,).\displaystyle\mathsf{d}\!\left({\mathsf{\Box}_{\mathsf{rep}},\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)}\right)\leq(1+{\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}},\mathcal{B}}\right)\leq(1+{\varepsilon}/4)^{2}\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq(1+3{\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

As such we have that

𝖽(𝚚,nnk(𝗋𝖾𝗉))𝖽(𝗋𝖾𝗉,nnk(𝗋𝖾𝗉))+𝚚𝗋𝖾𝗉((1+3ε/4)+ε/4)𝖽k(𝚚,)(1+ε)𝖽k(𝚚,).\displaystyle\mathsf{d}\!\left({\mathtt{q},\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)}\right)\leq\mathsf{d}\!\left({\mathsf{\Box}_{\mathsf{rep}},\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)}\right)+\left\lVert{\mathtt{q}-\mathsf{\Box}_{\mathsf{rep}}}\right\rVert\leq\!\left({(1+3{\varepsilon}/4)+{\varepsilon}/4}\right)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

Similarly, using the Lipschitz property, we can argue that, 𝖽(𝚚,nnk(𝗋𝖾𝗉))(1ε)𝖽k(𝚚,)\mathsf{d}\!\left({\mathtt{q},\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)}\right)\geq(1-{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), and therefore we have, (1ε)𝖽k(𝚚,)𝖽(𝚚,nnk(𝗋𝖾𝗉))(1+ε)𝖽k(𝚚,)(1-{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\mathsf{d}\!\left({\mathtt{q},\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right)}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), and the required guarantees are satisfied.

Lemma 6.3.

For any point 𝚚IRd\mathtt{q}\in{\rm I\!\hskip-0.24994ptR}^{d} there is a quorum ball Λi=𝖻(𝗐i,𝗑i)\Lambda_{i}=\mathsf{b}(\mathsf{w}_{i},\mathsf{x}_{i}) such that (A) Λi\Lambda_{i}intersects 𝖻(𝚚,𝖽k(𝚚,))\mathsf{b}(\mathtt{q},\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)), (B) 𝗑i3ξ𝖽k(𝚚,)\mathsf{x}_{i}\leq 3\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), and (C) 𝚚𝗐i4ξ𝖽k(𝚚,)\left\lVert{\mathtt{q}-\mathsf{w}_{i}}\right\rVert\leq 4\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

Proof.

By assumption, k>2𝖼dk>2\mathsf{c}_{d}, and so by Lemma 4.1 among the kk nearest neighbor of 𝚚\mathtt{q}, there are k𝖼dk-\mathsf{c}_{d} balls of radius at most 𝖽k(𝚚,)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Let \mathcal{B}^{\prime} denote the set of these balls. Among the indices 1,,m1,\ldots,m, let ii be the minimum index such that one of these k𝖼dk-\mathsf{c}_{d} balls is completely covered by the quorum cluster Λi=𝖻(𝗐i,𝗑i)\Lambda_{i}=\mathsf{b}(\mathsf{w}_{i},\mathsf{x}_{i}). Since 𝖻(𝚚,𝖽k(𝚚,))\mathsf{b}(\mathtt{q},\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)) intersects the ball while Λi\Lambda_{i} completely contains it, clearly Λi\Lambda_{i} intersects 𝖻(𝚚,𝖽k(𝚚,))\mathsf{b}(\mathtt{q},\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)). Now consider the time Λi\Lambda_{i} was constructed, i.e, the iith iteration of the quorum clustering algorithm. At this time, by assumption, all of \mathcal{B}^{\prime} was available, i.e., none of its balls were assigned to earlier quorum clusters. The ball 𝖻(𝚚,3𝖽k(𝚚,))\mathsf{b}(\mathtt{q},3\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)) contains k𝖼dk-\mathsf{c}_{d} unused balls and touches kk balls from \mathcal{B}, as such the smallest such ball had radius at most 3𝖽k(𝚚,)3\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). By the guarantee on quorum clustering, 𝗑i3ξ𝖽k(𝚚,)\mathsf{x}_{i}\leq 3\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). As for the last part, as the balls 𝖻(𝚚,𝖽k(𝚚,))\mathsf{b}(\mathtt{q},\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)) and Λi=𝖻(𝗐i,𝗑i)\Lambda_{i}=\mathsf{b}(\mathsf{w}_{i},\mathsf{x}_{i}) intersect and 𝗑i3ξ𝖽k(𝚚,)\mathsf{x}_{i}\leq 3\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), we have by the triangle inequality that 𝚚𝗐i(1+3ξ)𝖽k(𝚚,)4ξ𝖽k(𝚚,)\left\lVert{\mathtt{q}-\mathsf{w}_{i}}\right\rVert\leq(1+3\xi)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 4\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), as ξ1\xi\geq 1.

Definition 6.4.

For a given query point, any quorum cluster that satisfies the conditions of Lemma 6.3 is defined to be an anchor cluster. By Lemma 6.3 an anchor cluster always exists.

Lemma 6.5.

Suppose that among the quorum cluster balls Λ1,,Λm\Lambda_{1},\ldots,\Lambda_{m}, there is some ball Λi=𝖻(𝗐i,𝗑i)\Lambda_{i}=\mathsf{b}(\mathsf{w}_{i},\mathsf{x}_{i}) which satisfies that 𝚚𝗐i8ξ𝖽k(𝚚,)\left\lVert{\mathtt{q}-\mathsf{w}_{i}}\right\rVert\leq 8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right) and ε𝖽k(𝚚,)/4𝗑i8ξ𝖽k(𝚚,){\varepsilon}\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)/4\leq\mathsf{x}_{i}\leq 8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right) then the output of the algorithm is correct.

Proof.

We have

32ξ𝗑iε32ξ(ε𝖽k(𝚚,)/4)ε=8ξ𝖽k(𝚚,)𝚚𝗐i.\frac{32\xi\mathsf{x}_{i}}{{\varepsilon}}\geq\frac{32\xi\!\left({{\varepsilon}\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)/4}\right)}{{\varepsilon}}=8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\geq\left\lVert{\mathtt{q}-\mathsf{w}_{i}}\right\rVert.

Thus, by construction, the expanded environ of the quorum cluster 𝖻(𝗐i,𝗑i)\mathsf{b}(\mathsf{w}_{i},\mathsf{x}_{i}) contains the query point, see Eq. (6.1)p6.1{}_{\text{p\ref{equation:clusters:around:q}}}. Let jj be the smallest non-negative integer such that 2j𝗑i𝖽(𝚚,𝗐i)2^{j}\mathsf{x}_{i}\geq\mathsf{d}\!\left({\mathtt{q},\mathsf{w}_{i}}\right). We have that, 2j𝗑imax(𝗑i,2𝖽(𝚚,𝗐i))2^{j}\mathsf{x}_{i}\leq\max(\mathsf{x}_{i},2\mathsf{d}\!\left({\mathtt{q},\mathsf{w}_{i}}\right)). As such, if \mathsf{\Box} is the smallest cell in 𝒲\mathcal{W} containing the query point 𝚚\mathtt{q}, then

𝖽𝗂𝖺𝗆()\displaystyle\mathsf{diam}\!\left({{\mathsf{\Box}}}\right) εζ12j+1𝗑iεζ1max(2𝗑i,4𝖽(𝚚,𝗐i))εζ1max(16ξ𝖽k(𝚚,),32ξ𝖽k(𝚚,))\displaystyle\leq\frac{{\varepsilon}}{\zeta_{1}}2^{j+1}\mathsf{x}_{i}\leq\frac{{\varepsilon}}{\zeta_{1}}\cdot\max\!\left({2\mathsf{x}_{i},4\mathsf{d}\!\left({\mathtt{q},\mathsf{w}_{i}}\right)}\right)\leq\frac{{\varepsilon}}{\zeta_{1}}\cdot\max\!\left({16\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right),32\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\rule[-5.69046pt]{0.0pt}{11.38092pt}}\right)
ε8𝖽k(𝚚,),\displaystyle\leq\frac{{\varepsilon}}{8}\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right),

by Eq. (6.1)p6.1{}_{\text{p\ref{equation:clusters:around:q}}}, and if ζ1256ξ\zeta_{1}\geq 256\xi. Now, by Lemma 6.1 we have that λ𝖽k(𝚚,)\lambda^{*}\geq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), so 𝖽𝗂𝖺𝗆()(ε/8)λ\mathsf{diam}\!\left({{\mathsf{\Box}}}\right)\leq({\varepsilon}/8)\lambda^{*}. Therefore, the algorithm returns nnk(𝗋𝖾𝗉)\mathrm{nn}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) as the (1±ε)(1\pm{\varepsilon})-approximate kkth-nearest neighbor, but then by Lemma 6.2 it is a correct answer.

Lemma 6.6.

The query algorithm always outputs a correct approximate answer, i.e., the output ball bb satisfies (1ε)𝖽k(𝚚,)𝖽(𝚚,b)(1+ε)𝖽k(𝚚,)(1-{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\mathsf{d}\!\left({\mathtt{q},b}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

Proof.

Suppose that among the quorum cluster balls Λ1=𝖻(𝗐1,𝗑1),,Λm=𝖻(𝗐m,𝗑m)\Lambda_{1}=\mathsf{b}(\mathsf{w}_{1},\mathsf{x}_{1}),\ldots,\Lambda_{m}=\mathsf{b}(\mathsf{w}_{m},\mathsf{x}_{m}), there is some ball Λi\Lambda_{i} such that 𝚚𝗐i8ξ𝖽k(𝚚,)\left\lVert{\mathtt{q}-\mathsf{w}_{i}}\right\rVert\leq 8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right) and (ε/4)𝖽k(𝚚,)𝗑i8ξ𝖽k(𝚚,)({\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\mathsf{x}_{i}\leq 8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), then by Lemma 6.5 the algorithm returns a valid approximate answer. Assume this condition is not satisfied. Let the anchor cluster be Λ=𝖻(𝗐,𝗑)\Lambda=\mathsf{b}(\mathsf{w},\mathsf{x}). Since the anchor cluster satisfies 𝚚𝗐4ξ𝖽k(𝚚,)\left\lVert{\mathtt{q}-\mathsf{w}}\right\rVert\leq 4\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right) and 𝗑3ξ𝖽k(𝚚,)\mathsf{x}\leq 3\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), it must be the case that, 𝗑<(ε/4)𝖽k(𝚚,)\mathsf{x}<({\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Since the anchor cluster intersects 𝖻(𝚚,𝖽k(𝚚,))\mathsf{b}(\mathtt{q},\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)), we have that 𝚚𝗐(1+ε/4)𝖽k(𝚚,)\left\lVert{\mathtt{q}-\mathsf{w}}\right\rVert\leq(1+{\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Thus, 𝚚Λ=𝚚𝗐+𝗑(1+ε/2)𝖽k(𝚚,)\left\lVert{\mathtt{q}^{\prime}-\Lambda^{\prime}}\right\rVert_{\oplus}=\left\lVert{\mathtt{q}-\mathsf{w}}\right\rVert+\mathsf{x}\leq(1+{\varepsilon}/2)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Let \mathsf{\Box} be the smallest cell in which 𝚚\mathtt{q} is located. Now consider the point nn()Σ\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right)\in\Sigma^{\prime}. Suppose it corresponds to the cluster Λj\Lambda_{j}, i.e., Λj=nn()\Lambda_{j}^{\prime}=\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right). Since nn()\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right) is a (1+ε/8)(1+{\varepsilon}/8)-ANN to 𝚚\mathtt{q} among the points of Σ\Sigma^{\prime}, 𝚚nn()(1+ε/8)𝚚Λ(1+ε/8)(1+ε/2)𝖽k(𝚚,)(1+ε)𝖽k(𝚚,)2𝖽k(𝚚,)8ξ𝖽k(𝚚,)\left\lVert{\mathtt{q}^{\prime}-\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right)}\right\rVert_{\oplus}\leq(1+{\varepsilon}/8)\left\lVert{\mathtt{q}^{\prime}-\Lambda^{\prime}}\right\rVert_{\oplus}\leq(1+{\varepsilon}/8)(1+{\varepsilon}/2)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 2\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq 8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). It follows that, 𝚚𝗐j8ξ𝖽k(𝚚,)\left\lVert{\mathtt{q}-\mathsf{w}_{j}}\right\rVert\leq 8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), and 𝗑j8ξ𝖽k(𝚚,)\mathsf{x}_{j}\leq 8\xi\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). By our assumption, it must be the case that, 𝗑j<(ε/4)𝖽k(𝚚,)\mathsf{x}_{j}<({\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Now, there are two cases. Suppose that, 𝖽𝗂𝖺𝗆()(ε/8)λ\mathsf{diam}\!\left({{\mathsf{\Box}}}\right)\leq({\varepsilon}/8)\lambda^{*}. Then, since we have λ𝚚nn()\lambda^{*}\leq\left\lVert{\mathtt{q}^{\prime}-\mathrm{nn^{\prime}}\!\left({\mathsf{\Box}}\right)}\right\rVert_{\oplus} so λ2𝖽k(𝚚,)\lambda^{*}\leq 2\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). As such, 𝖽𝗂𝖺𝗆()(ε/4)𝖽k(𝚚,)\mathsf{diam}\!\left({{\mathsf{\Box}}}\right)\leq({\varepsilon}/4)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). In this case we return nnk()\mathrm{nn}_{k}\!\left({\mathsf{\Box}}\right) by the algorithm, but the result is correct by Lemma 6.2. On the other hand, if we return 𝐛()\mathbf{b}\!\left({\mathsf{\Box}}\right), it is easy to see that 𝖽(𝚚,𝐛())𝚚𝗐j+𝗑j(1+ε)𝖽k(𝚚,)\mathsf{d}\!\left({\mathtt{q},\mathbf{b}\!\left({\mathsf{\Box}}\right)}\right)\leq\left\lVert{\mathtt{q}-\mathsf{w}_{j}}\right\rVert+\mathsf{x}_{j}\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right). Also, as 𝐛()\mathbf{b}\!\left({\mathsf{\Box}}\right) lies completely inside Λj\Lambda_{j} it follows by Observation 2.1, that 𝖽(𝚚,𝐛())𝖽(𝚚,Λj)𝚚𝗐j𝗑j(𝚚𝗐j+𝗑j)2𝗑j𝖽k(𝚚,)(ε/2)𝖽k(𝚚,)(1ε/2)𝖽k(𝚚,)\mathsf{d}\!\left({\mathtt{q},\mathbf{b}\!\left({\mathsf{\Box}}\right)}\right)\geq\mathsf{d}\!\left({\mathtt{q},\Lambda_{j}}\right)\geq\left\lVert{\mathtt{q}-\mathsf{w}_{j}}\right\rVert-\mathsf{x}_{j}\geq(\left\lVert{\mathtt{q}-\mathsf{w}_{j}}\right\rVert+\mathsf{x}_{j})-2\mathsf{x}_{j}\geq\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)-({\varepsilon}/2)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\geq(1-{\varepsilon}/2)\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right), where the second last inequality follows by Lemma 6.1.

6.5 The result

Theorem 6.7.

Given a set \mathcal{B} of nn disjoint balls in IRd{\rm I\!\hskip-0.24994ptR}^{d}, a number kk, with 1kn1\leq k\leq n, and ε(0,1){\varepsilon}\in(0,1), one can preprocess \mathcal{B}, in O(nlogn+nkCεlogn+nkCε)\displaystyle O\!\left({n\log n+\frac{n}{k}C_{\varepsilon}\log n+\frac{n}{k}C_{\varepsilon}^{\prime}}\right) time, where Cε=O(εdlogε1)C_{\varepsilon}=O\!\left({{\varepsilon}^{-d}\log{{\varepsilon}}^{-1}}\right) and Cε=O(ε2dlogε1)C_{\varepsilon}^{\prime}=O\!\left({{\varepsilon}^{-2d}\log{{\varepsilon}}^{-1}}\right). The space used by the data structure is O(Cεn/k)O(C_{\varepsilon}n/k). Given a query point 𝚚\mathtt{q}, this data structure outputs a ball bb\in\mathcal{B} in O(lognkε)\displaystyle O\left({\log\frac{n}{k{\varepsilon}}}\right) time, such that (1ε)𝖽k(𝚚,)𝖽(𝚚,b)(1+ε)𝖽k(𝚚,)(1-{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right)\leq\mathsf{d}\!\left({\mathtt{q},b}\right)\leq(1+{\varepsilon})\mathsf{d}_{k}\!\left({\mathtt{q},\mathcal{B}}\right).

Proof.

If k2𝖼dk\leq 2\mathsf{c}_{d} then Theorem 4.5 provides the desired result. For k>2𝖼dk>2\mathsf{c}_{d}, the correctness was proved in Lemma 6.6. We only need to bound the construction time and space as well as the query time. Computing the quorum clustering takes time O(nlogn)O(n\log n) by Lemma 5.6. Observe that ||=O(nkεdlog1ε)\left\lvert{\mathcal{I}}\right\rvert=O\left({\frac{n}{k{\varepsilon}^{d}}\log\frac{1}{{\varepsilon}}}\right). From the construction of Arya and Malamatos [2], we have |𝒮|=O(nkεdlog1ε)\left\lvert{\mathcal{S}}\right\rvert=O\left({\frac{n}{k{\varepsilon}^{d}}\log\frac{1}{{\varepsilon}}}\right) (note, that since we clip the construction to a hyperplane, we get 1/εd1/{\varepsilon}^{d} in the bound and not 1/εd+11/{\varepsilon}^{d+1}). A careful implementation of this stage takes time O(nlogn+|𝒲|(logn+1εd1))O\!\left({n\log n+\left\lvert{\mathcal{W}}\right\rvert\!\left({\log n+\frac{1}{{\varepsilon}^{d-1}}}\right)}\right). Overlaying the two compressed quadtrees representing them takes linear time in their size, that is O(||+|𝒮|)O\!\left({\left\lvert{\mathcal{I}}\right\rvert+\left\lvert{\mathcal{S}}\right\rvert}\right).

The most expensive step is to perform the (1±ε/4)(1\pm{\varepsilon}/4)-approximate kkth-nearest neighbor query for each cell in the resulting decomposition of 𝒲\mathcal{W}, see Eq. (6.2)p6.2{}_{\text{p\ref{equation:in:cell}}} (i.e., computing βk(𝗋𝖾𝗉)\mathrm{\beta}_{k}\!\left({\mathsf{\Box}_{\mathsf{rep}}}\right) for each cell 𝒲\mathsf{\Box}\in\mathcal{W}). Using the data structure of Section 4 (see Theorem 4.5) each query takes O(logn+1/εd)O\!\left({\log n+1/{\varepsilon}^{d}}\right) time.

O(nlogn+|𝒲|(logn+1εd))=O(nlogn+nkεdlog1εlogn+nkε2dlog1ε)\displaystyle O\!\left({n\log n+\left\lvert{\mathcal{W}}\right\rvert\!\left({\log n+\frac{1}{{\varepsilon}^{d}}}\right)}\right)=O\!\left({n\log n+\frac{n}{k{\varepsilon}^{d}}\log\frac{1}{{\varepsilon}}\log n+\frac{n}{k{\varepsilon}^{2d}}\log\frac{1}{{\varepsilon}}}\right)

time, and this bounds the overall construction time.

The query algorithm is a point location query followed by an O(1)O(1) time computation and takes time O(log(nkε))O\left({\log\!\left({\frac{n}{k{\varepsilon}}}\right)}\right).

Note that the space decomposition generated by Theorem 6.7 can be interpreted as a space decomposition of complexity O(Cεn/k)O(C_{\varepsilon}n/k), where every cell has two input balls associated with it, which are the candidates to be the desired (k,ε)(k,{\varepsilon})-ANN. That is, Theorem 6.7 computes a (k.ε)(k.{\varepsilon})-AVD of the input balls.

7 Conclusions

In this paper, we presented a generalization of the usual (1±ε)(1\pm{\varepsilon})-approximate kkth-nearest neighbor problem in IRd{\rm I\!\hskip-0.24994ptR}^{d}, where the input are balls of arbitrary radius, while the query is a point. We first presented a data structure that takes O(n)O(n) space, and the query time is O(logn+εd)O(\log n+{\varepsilon}^{-d}). Here, both kk and ε{\varepsilon} could be supplied at query time. Next we presented an (k,ε)(k,{\varepsilon})-AVD taking O(n/k)O(n/k) space. Thus showing, surprisingly, that the problem can be solved in sublinear space if kk is sufficiently large.

References

  • [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.
  • [2] S. Arya and T. Malamatos. Linear-size approximate Voronoi diagrams. In Proc. 13th ACM-SIAM Sympos. Discrete Algs., pages 147–155, 2002.
  • [3] S. Arya, T. Malamatos, and D. M. Mount. Space-time tradeoffs for approximate spherical range counting. In Proc. 16th ACM-SIAM Sympos. Discrete Algs., pages 535–544, 2005.
  • [4] S. Arya, T. Malamatos, and D. M. Mount. Space-time tradeoffs for approximate nearest neighbor searching. J. Assoc. Comput. Mach., 57(1):1–54, 2009.
  • [5] S. Arya and D. M. Mount. Approximate range searching. Comput. Geom. Theory Appl., 17:135–152, 2000.
  • [6] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. Assoc. Comput. Mach., 45(6):891–923, 1998.
  • [7] M. de Berg, H. Haverkort, S. Thite, and L. Toma. Star-quadtrees and guard-quadtrees: I/O-efficient indexes for fat triangulations and low-density planar subdivisions. Comput. Geom. Theory Appl., 43:493–513, July 2010.
  • [8] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to kk-nearest-neighbors and nn-body potential fields. J. Assoc. Comput. Mach., 42:67–90, 1995.
  • [9] P. Carmi, S. Dolev, S. Har-Peled, M. J. Katz, and M. Segal. Geographic quorum systems approximations. Algorithmica, 41(4):233–244, 2005.
  • [10] K. L. Clarkson. Nearest-neighbor searching and metric space dimensions. In G. Shakhnarovich, T. Darrell, and P. Indyk, editors, Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, pages 15–59. MIT Press, 2006.
  • [11] S. Har-Peled. A replacement for Voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found. Comput. Sci., pages 94–103, 2001.
  • [12] S. Har-Peled. Geometric Approximation Algorithms, volume 173 of Mathematical Surveys and Monographs. Amer. Math. Soc., 2011.
  • [13] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. Theory Comput., 8:321–350, 2012. Special issue in honor of Rajeev Motwani.
  • [14] S. Har-Peled and N. Kumar. Down the rabbit hole: Robust proximity search in sublinear space. In Proc. 53rd Annu. IEEE Sympos. Found. Comput. Sci., pages 430–439, 2012.
  • [15] S. Har-Peled and N. Kumar. Approximating minimization diagrams and generalized proximity search. In Proc. 54th Annu. IEEE Sympos. Found. Comput. Sci., pages 717–726, 2013.
  • [16] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput., pages 604–613, 1998.
  • [17] G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006.