Robust Proximity Search for Balls using Sublinear Space111Work on this paper was partially support by NSF AF awards CCF-0915984 and CCF-1217462
Abstract
Given a set of disjoint balls in , we provide a data structure, of near linear size, that can answer -approximate th-nearest neighbor queries in time, where and are provided at query time. If and are provided in advance, we provide a data structure to answer such queries, that requires (roughly) space; that is, the data structure has sublinear space requirement if is sufficiently large.
1 Introduction
The nearest neighbor problem is a fundamental problem in Computer Science [17, 1]. Here, one is given a set of points , and given a query point one needs to output the nearest point in to . There is a trivial algorithm for this problem. Typically the set of data points is fixed, while different queries keep arriving. Thus, one can use preprocessing to facilitate a faster query. There are several applications of nearest neighbor search in computer science including pattern recognition, information retrieval, vector compression, computational statistics, clustering, data mining and learning among many others, see for instance the survey by Clarkson [10] for references. If one is interested in guaranteed performance and near linear space, there is no known way to solve this problem efficiently (i.e., logarithmic query time) for dimension , while using near linear space for the data structure.
In light of the above, major effort has been devoted to develop approximation algorithms for nearest neighbor search [6, 16, 10, 13]. In the -approximate nearest neighbor problem, one is additionally given an approximation parameter and one is required to find a point such that . In dimensional Euclidean space, one can answer ANN queries in time using linear space [6, 12]. Unfortunately, the constant hidden in the notation is exponential in the dimension (and this is true for all bounds mentioned in this paper), and specifically because of the in the query time, this approach is only efficient in low dimensions. Interestingly, for this data structure, the approximation parameter need not be specified during the construction, and one can provide it during the query. An alternative approach is to use Approximate Voronoi Diagrams (AVD), introduced by Har-Peled [11], which is a partition of space into regions of low total complexity, with a representative point for each region, that is an ANN for any point in the region. In particular, Har-Peled showed that there is such a decomposition of size , see also [13]. This allows ANN queries to be answered in time. Arya and Malamatos [2] showed how to build AVDs of linear complexity (i.e., ). Their construction uses WSPD (Well Separated Pairs Decomposition) [8]. Further trade-offs between query time and space usage for AVDs were studied by Arya et al. [4].
A more general problem is the -nearest neighbors problem where one is interested in finding the points in nearest to the query point . This is widely used in classification, where the majority label is used to label the query point. A restricted version is to find only the th-nearest neighbor. This problem and its approximate version have been considered in [3, 14].
Recently, the authors [14] showed that one can compute a -AVD that -approximates the distance to the th nearest neighbor, and surprisingly, requires space; that is, sublinear space if is sufficiently large. For example, for the case , which is of interest in practice, the space required is only . Such ANN is of interest when one is worried that there is noise in the data, and thus one is interested in the distance to the th NN which is more robust and noise resistant. Alternatively, one can think about such data structures as enabling one to summarize the data in a way that still facilitates meaningful proximity queries.
In this paper we consider a generalization of the th-nearest neighbor problem. Here, we are given a set of disjoint balls in and we want to preprocess them, so that given a query point we can find approximately the th closest ball. The distance of a query point to a ball is defined as the distance to its boundary if the point is outside the ball or otherwise. Clearly, this problem is a generalization of the th-nearest neighbor problem by viewing points as balls of radius . Algorithms for the th-nearest neighbor for points, do not extend in a straightforward manner to this problem because the distance function is no longer a metric. Indeed, there can be two very far off points both very close to a single ball, and thus the triangle inequality does not hold. The problem of finding the closest ball can also be modeled as a problem of approximating the minimization diagram of a set of functions; here, a function would correspond to the distance from one of the given balls. There has been some recent work by the authors on this topic, see [15], where a fairly general class of functions admits a near-linear sized data structure permitting a logarithmic time query for the problem of approximating the minimization diagram. However, the problem that we consider in this paper does not fall under the framework of [15]. The technical assumptions of [15] mandate that the set of points which form the -sublevel set of a distance function, i.e., the set of points at which the distance function is is a single point (or an empty set). This is not the case for the problem we consider here. Also, we are interested in the more general th-nearest neighbor problem, while [15] only considers the nearest-neighbor problem, i.e., .
We first show how to preprocess the set of balls into a data structure requiring space , in time, so that given a query point , a number and , one can compute a -approximate th closest ball in time . If both and are available during preprocessing, one can preprocess the balls into a -AVD, using space, so that given a query point , a -ANN closest ball can be computed, in time.
Paper Organization
In Section 2, we define the problem, list some assumptions, and introduce notations. In Section 3, we set up some basic data structures to answer approximate range counting queries for balls. In Section 4, we present the data structure, query algorithm and proof of correctness for our data structure which can compute -approximate th-nearest neighbors of a query point when are only provided during query time. In Section 5 we present approximate quorum clustering, see [9, 14], for a set of disjoint balls. Using this, in Section 6, we present the -AVD construction. We conclude in Section 7.
2 Problem definition and notation
We are given a set of disjoint222Our data structure and algorithm work for the more general case where the balls are interior disjoint, where we define the interior of a “point ball”, i.e., a ball of radius , as the point itself. This is not the usual topological definition. balls , where , for . Here denotes the (closed) ball with center and radius . Additionally, we are given an approximation parameter . For a point , the distance of to a ball is
Observation 2.1.
For two balls , and any point , we have .
The th-nearest neighbor distance of to , denoted by , is the th smallest number in . Similarly, for a given set of points , denotes the th-nearest neighbor distance of to .
We aim to build a data structure to answer -approximate th-nearest neighbor (i.e., -ANN) queries, where for any query point one needs to output a ball such that, . There are different variants depending on whether and are provided with the query or in advance.
We use cube to denote a set of the form , where and is the side length of the cube.
Observation 2.2.
For any set of balls , the function is a -Lipschitz function; that is, for any two points , we have that .
Assumption 2.3.
We assume all the balls are contained inside the cube , which can be ensured by translation and scaling (which preserves order of distances), where . As such, we can ignore queries outside the unit cube , as any input ball is a valid answer in this case.
For a real positive number and a point , define to be the grid point . The number is the width or side length of the grid . The mapping partitions into cubes that are called grid cells.
Definition 2.4.
A cube is a canonical cube if it is contained inside the unit cube , it is a cell in a grid , and is a power of two (i.e., it might correspond to a node in a quadtree having as its root cell). We will refer to such a grid as a canonical grid. Note that all the cells corresponding to nodes of a compressed quadtree are canonical.
Definition 2.5.
Given a set , and a parameter , let denote the set of canonical grid cells of side length , that intersect , where denotes the diameter of . Clearly, the diameter of any grid cell of , is at most . Let . It is easy to verify that . The set is the grid approximation to .
Let be a family of balls in . Given a set , let
denote the set of all balls in that intersect .
For two compact sets , if and only if . For a set and a set of balls , let . Let denote the maximum number of pairwise disjoint balls of radius at least , that may intersect a given ball of radius in . Clearly, we have for any ball . We have the following bounds,
Lemma 2.6.
for all .
Proof.
Let be a given ball of radius . For the lower bound we can take two balls both of radius which touch at diametrically opposite points and lie outside . We now show the upper bound. Let be a set of disjoint balls, each having radius at least and touching . Consider a ball . If no point of the boundary of touches , then clearly contains in its interior and it is easy to see that . As such we assume that all balls in have some point of their boundary inside . Take any point of the boundary of such that is in , and consider a ball of radius that lies completely inside , is of radius and is tangent to at . We can find such a ball for each ball in . Moreover, these balls are all disjoint. Thus we have disjoint balls of radius exactly that touch . It is easy to see that all such balls are completely inside . By a simple volume packing bound it follows that .
Definition 2.7.
For a parameter , a function is -monotonic, if for every , .
3 Approximate range counting for balls
Data-structure 3.1.
For a given set of disjoint balls in , we build the following data structure, that is useful in performing several of the tasks at hand.
-
(A)
Store balls in a (compressed) quadtree. For , let , and let denote the union of these cells. Let be a compressed quadtree decomposition of , such that all the cells of are cells of . We preprocess to answer point location queries for the cells of . This takes time, see [12].
-
(B)
Compute list of “large” balls intersecting each cell. For each node of , there is a list of balls registered with it. Formally, register a ball with all the cells of . Clearly, each ball is registered with cells, and it is easy to see that each cell has balls registered with it, since the balls are disjoint.
Next, for a cell in we compute a list storing , and these balls are associated with this cell. These lists are computed in a top-down manner. To this end, propagate from a node its list (which we assume is already computed) down to its children. For a node receiving such a list, it scans it, and keep only the balls that intersect its cell (adding to this list the balls already registered with this cell). For a node , let be this list.
-
(C)
Build compressed quadtree on centers of balls. Let be the set of centers of the balls of . Build, in time, a compressed quadtree storing .
-
(D)
ANN for centers of balls. Build a data structure , for answering -approximate -nearest neighbor distances on , the set of centers of the balls, see [14], where and are provided with the query. The data structure , returns a point such that, .
-
(E)
Answering approximate range searching for the centers of balls.
Given a query ball and a parameter , one can, using , report (approximately), in time, the points in . Specifically, the query process computes sets of points, such that their union , has the property that , where is the scaling of by a factor of around its center. Indeed, compute the set , and then using cell queries in compute the corresponding cells (this takes time). Now, descend to the relevant level of the quadtree to all the cells of the right size, that intersect . Clearly, the union of points stored in their subtrees are the desired set. This takes overall time.
A similar data structure for approximate range searching is provided by Arya and Mount [5], and our description above is provided for the sake of completeness.
Overall, it takes time to build this data structure.
We denote the collection of data structures above by and where necessary, specific functionality it provides, say for finding the large balls intersecting a cell, by (B).
3.1 Approximate range counting among balls
We need the ability to answer approximate range counting queries on a set of disjoint balls. Specifically, given a set of disjoint balls , and a query ball , the target is to compute the size of the set . To make this query computationally fast, we allow an approximation. More precisely, for a ball a set is a -ball of , if , where is the -scaling of around its center. The purpose here, given a query ball , is to compute the size of the set for some -ball of .
Lemma 3.2.
Given a compressed quadtree of size , a convex set , and a parameter , one can compute the set of nodes in , that realizes (see Defnition 2.5), in time. Specifically, this outputs a set of nodes, of size , such that their cells intersect , and their parents cell diameter is larger than . Note that the cells in might be significantly larger if they are leaves of .
Proof.
Let be the grid approximation to . Using cell queries on the compressed quadtree, one can compute the cells of that corresponds to these canonical cells. Specifically, for each cube , the query either returns a node for which this is its cell, or it returns a compressed edge of the quadtree; that is, two cells (one is a parent of the other), such that is contained in of them and contains the other. Such a cell query takes time [12]. This returns nodes in such that their cells cover .
Now, traverse down the compressed quadtree starting from these nodes and collect all the nodes of the quadtree that are relevant. Clearly, one has to go at most levels down the quadtree to get these nodes, and this takes time overall.
Lemma 3.3.
Let be any convex set in , and let be a parameter. Using , one can compute, in time, all the balls of that intersect , with diameter .
Proof.
We compute the cells of the quadtree realizing using Lemma 3.2. Now, from each such cell (and its parent), we extract the list of large balls intersecting it (there are such nodes, and the size of each such list is ). Next we check for each such ball if it intersects and if its diameter is at least . We return the list of all such balls.
3.2 Answering a query
Given a query ball , and an approximation parameter , our purpose is to compute a number , such that .
The query algorithm works as follows:
-
(A)
Using Lemma 3.3, compute a set of all the balls that intersect and are of radius .
-
(B)
Using , compute cells of that corresponds to . Let be the total number of points in stored in these nodes.
-
(C)
The quantity is almost the desired quantity, except that we might be counting some of the balls of twice. To this end, let be the number of balls in with centers in
-
(D)
Let . Return .
We only sketch the proof, as the proof is straightforward. Indeed, the union of the cells of contains and is contained in . All the balls with radius smaller than and intersecting have their centers in cells of , and their number is computed correctly. Similarly, the “large” balls are computed correctly. The last stage ensures we do not over-count by each large ball that also has its center in . It is also easy to check that . The same result can be used for to get -monotonicity of .
We now analyze the running time. Computing all the cells of takes time. Computing the “large” balls takes time. Checking for each large ball if it is already counted by the “small” balls takes by using a grid. We denote the above query algorithm by rangeCount .
The above implies the following.
Lemma 3.4.
Given a set of disjoint balls in , it can be preprocessed, in time, into a data structure of size , such that given a query ball and approximation parameter , the query algorithm rangeCount returns, in time, a number satisfying the following:
-
(A)
,
-
(B)
, and
-
(C)
for a query ball and , the number is -monotonic as a function of , see Defnition 2.7.
4 Answering -ANN queries among balls
4.1 Computing a constant factor approximation to
Lemma 4.1.
Let be a set of disjoint balls in , and consider a ball that intersects at least balls of . Then, among the nearest neighbors of from , there are at least balls of radius at most . The centers of all these balls are in .
Proof.
Consider the nearest neighbors of from . Any such ball that has its center outside , has radius at least , since it intersects . Since the number of balls that are of radius at least and intersecting is bounded by , there must be at least balls among the nearest neighbors, each having radius less than . Now, will contain the centers of all such balls.
Corollary 4.2.
Let . Then, .
The basic observation is that we only need a rough approximation to the right radius, as using approximate range counting (i.e., Lemma 3.4), one can improve the approximation.
Let denote the distance of to the th closest center in . Let . Let be the minimum index, such that . Since , it must be that . There are several possibilities:
-
(A)
If (i.e., ) then, by Lemma 4.1, the ball contains at least centers. As such, , and is a good approximation to .
-
(B)
If , and , then is the desired approximation.
-
(C)
If , and , then is the desired approximation.
-
(D)
Otherwise, it must be that , and . Let be the th closest ball to , for . It must be that are much larger than . But then, the balls must intersect , and their radius is at least . We can easily compute these big balls using (B), and the number of centers of the small balls close to query, and then compute exactly.
We build in time.
First we introduce some notation. For , let denote the number of balls in that intersect ; that is , and denote the number of centers in , i.e., . Also, let denote the -approximation to the number of balls of intersecting , as computed by Lemma 3.4; that is .
We now provide our algorithm to answer a query. We are given a query point and a number .
Using , compute a -approximation for the smallest ball containing centers of , for , where , and let be this radius. That is, for , we have . For , compute (Lemma 3.4).
Let be the maximum index such that . Clearly, is well defined as . The algorithm is executed in the following steps.
-
(A)
If we return .
-
(B)
If , we return .
-
(C)
Otherwise, compute all the balls of that are of radius at least and intersect the ball , using (B). For each such ball , compute the distance of to it. Return for the minimum such number such that .
Lemma 4.3.
Given a set of disjoint balls in , one can preprocess them, in time, into a data structure of size , such that given a query point , and a number , one can compute, in time, a number such that, .
Proof.
The data structure and query algorithm are described above. We next prove correctness. To prove that (A) returns the correct answer observe that under the given assumptions,
where the second inequality follows from Corollary 4.2, and the third inequality follows as , while is the smallest number such that .
For (B) observe that we have that and as such we have . But by assumption, and so , thus .
For (C), first observe that as the algorithm did not return in (A). Since is the maximum index such that , so implying, . Also, , as the algorithm did not return in (B). Now the ball contains at least centers from , but it does not contain centers. Indeed, otherwise we would have and so , but on the other hand , which would be a contradiction. Similarly, there is no center of any ball whose distance from is in the range otherwise we would have that and this would mean that , a contradiction. Now, the center of the th closest ball is clearly more than away from . As such its distance from is at least . Since it follows that the th closest ball intersects and moreover, its radius is at least . Since we compute all such balls in (C), we do encounter the th closest ball. It is easy to see that in this case we return a number satisfying, .
We now show how to refine the approximation.
Lemma 4.4.
Given a set of balls in , it can be preprocessed, in time, into a data structure of size . Given a query point , numbers , and an approximation parameter , such that , one can find a ball such that, , in time.
Proof.
We are going to use the same data structure as Lemma 3.4, for the query ball . We compute all large balls of that intersect . Here a large ball is a ball of radius , and a ball of radius at most is considered to be a small ball. Consider the grid cells of . In time we can record the number of centers of large balls inside any such cell. Clearly, any small ball that intersects has its center in some cell of . We use the quadtree to find out exactly the number of centers, , of small balls in each cell of , by finding the total number of centers using , and decreasing this by the count of centers of large balls in that cell. This can be done in time . We pick an arbitrary point in , and assign it weight , and treat it as representing all the small balls in this grid cell – clearly, this introduces an error of size in the distance of such a ball from , and as such we can ignore it in our argument. In the end of this snapping process, we have weighted points, and large balls. We know the distance of the query point from each one of these points/balls. This results in weighted distances, and we want the smallest , such that the total weight of the distances is at least . This can be done by weighted median selection in linear time in the number of distances, which is . Once we get the required point we can output any ball corresponding to the point. Clearly, satisfies the required conditions.
4.2 The result
Theorem 4.5.
Given a set of disjoint balls in , one can preprocess them in time into a data structure of size , such that given a query point , a number with and , one can find in time a ball , such that, .
5 Quorum clustering
We are given a set of disjoint balls in , and we describe how to compute quorum clustering for them quickly.
Let be some constant. Let . For , let , and let be any ball that satisfies,
-
(A)
contains balls of completely inside it,
-
(B)
intersects at least balls of , and
-
(C)
the radius of is at most times the radius of the smallest ball satisfying the above conditions.
Next, we remove any balls that are contained in from to get the set . We call the removed set of balls . We repeat this process till all balls are extracted. Notice that at each step , we only require that the intersects balls of (and not ), but that it must contain balls from . Also, the last quorum ball may contain fewer balls. The balls , are the resulting -approximate quorum clustering.
5.1 Computing an approximate quorum clustering
Definition 5.1.
For a set of points in , and an integer , with , let denote the radius of the smallest ball which contains at least points from , i.e., .
Similarly, for a set of balls in , and an integer , with , let denote the radius of the smallest ball which completely contains at least balls from .
Lemma 5.2 ([14]).
Given a set of points in and integer , with , one can compute, in time, a sequence of balls, , such that, for all , we have
-
(A)
For every ball , there is an associated subset of points of , that it covers.
-
(B)
The ball is a -approximation to the smallest ball covering points in ; that is, .
The algorithm to construct an approximate quorum clustering is as follows. We use the algorithm of Lemma 5.2 with the set of points , and to get a list of balls , satisfying the conditions of Lemma 5.2. Next we use the algorithm of Theorem 4.5, to compute -ANN distances from the centers , to the balls of .
Thus, we get numbers satisfying, . Let , for . Sort (we assume for the sake of simplicity of exposition that , being the radius of the last cluster is the largest number). Suppose the sorted order is the permutation of (by assumption ). We output the balls , for , as the approximate quorum clustering.
5.2 Correctness
Lemma 5.3.
Let be a set of disjoint balls, where , for . Let be the set of centers of these balls. Let be any ball that contains at least centers from , for some . Then contains the balls that correspond to those centers.
Proof.
Without loss of generality suppose contains the centers , from . Now consider any index with , and consider any , which exists as by assumption. Since contains both and , by the triangle inequality. On the other hand, as the balls and are disjoint we have that . It follows that for all . As such the ball must contain the entire ball , and thus it contains all the balls , corresponding to the centers.
Lemma 5.4.
Let be a set of disjoint balls in . Let be the corresponding set of centers, and let be an integer with . Then, .
Proof.
The first inequality follows since the ball realizing the optimal covering of balls, clearly contains their centers as well, and therefore points from . To see the second inequality, consider the ball realizing , and use Lemma 5.3 on it. This implies .
Lemma 5.5.
The balls computed above are a -approximate quorum clustering of .
Proof.
Consider the balls computed by the algorithm of Lemma 5.2. Suppose , for , is the set of centers assigned to the balls . That is form a disjoint decomposition of , each of size (except for the last set, which might be smaller – a technicality that we ignore for the sake of simplicity of exposition).
For , let denote the set of balls corresponding to the centers in . Now while constructing the approximate quorum clusters we are going to assign the set of balls for , to . Now, fix a with . The balls of have been used up. Consider an optimal ball, i.e., a ball that contains completely balls among and intersects balls from , and is the smallest such possible. Fix some balls from that this optimal ball contains. Consider the sets of centers of these balls. The quorum clusters for , contain all these centers, by construction. Out of these indices, i.e., out of the indices , suppose is the minimum index such that contains one of these centers. When was constructed, i.e., at the th iteration of the algorithm of Lemma 5.2, all the centers from were available. Now since the optimal ball contains available centers too, it follows that since Lemma 5.2 guarantees this. Since , by Lemma 5.3, contains the balls of . Moreover, by the Lipschitz property, see Observation 2.2, it follows that , where the second last inequality follows as the balls and the ball intersect. Therefore, for the index we have that, , and also that . As such . The index minimizes this quantity among the indices (as we took the sorted order), as such it follows that .
Lemma 5.6.
Given a set of disjoint balls in , such that , and a number with , in time, one can output a sequence of balls , such that
-
(A)
For each ball , there is an associated subset of balls of , that it completely covers.
-
(B)
The ball intersects at least balls from .
-
(C)
The radius of the ball is at most times that of the smallest ball covering balls of completely, and intersecting balls of .
6 Construction of the sublinear space data structure for -ANN
Here we show how to compute an approximate Voronoi diagram for approximating the th-nearest ball, that takes space. We assume without loss of generality, and we let . Here and are prespecified in advance.
6.1 Preliminaries
The following notation was introduced in [14]. A ball of radius in , centered at a point , can be interpreted as a point in , denoted by . For a regular point , its corresponding image under this transformation is the mapped point , i.e., we view it as a ball of radius and use the mapping defined on balls. Given point we will denote its Euclidean norm by . We will consider a point to be in the product metric of and endowed with the product metric norm
It can be verified that the above defines a norm, and for any we have .
6.2 Construction
The input is a set of disjoint balls in , and parameters and .
The construction of the data structure is similar to the construction of the th-nearest neighbor data structure from the authors’ paper [14]. We compute, using Lemma 5.6, a -approximate quorum clustering of with balls, , where . The algorithm then continues as follows:
-
(A)
Compute an exponential grid around each quorum cluster. Specifically, let
(6.1) be the set of grid cells covering the quorum clusters and their immediate environ, where is a sufficiently large constant (say, ).
-
(B)
Intuitively, takes care of the region of space immediately next to a quorum cluster333That is, intuitively, if the query point falls into one of the grid cells of , we can answer a query in constant time.. For the other regions of space, we can apply a construction of an approximate Voronoi diagram for the centers of the clusters (the details are somewhat more involved). To this end, lift the quorum clusters into points in , as follows
where , for . Note that all points in belong to by Assumption 2.3. Now build a -AVD for using the algorithm of Arya and Malamatos [2], for distances specified by the norm. The AVD construction provides a list of canonical cubes covering such that in the smallest cube containing the query point, the associated point of , is a -ANN to the query point. (Note that these cubes are not necessarily disjoint. In particular, the smallest cube containing the query point is the one that determines the assigned approximate nearest neighbor to .)
Clip this collection of cubes to the hyperplane (i.e., throw away cubes that do not have a face on this hyperplane). For a cube in this collection, denote by , the point of assigned to it. Let be this resulting set of canonical -dimensional cubes.
-
(C)
Let be the space decomposition resulting from overlaying the two collection of cubes, i.e. and . Formally, we compute a compressed quadtree that has all the canonical cubes of and as nodes, and is the resulting decomposition of space into cells. One can overlay two compressed quadtrees representing the two sets in linear time [7, 12]. Here, a cell associated with a leaf is a canonical cube, and a cell associated with a compressed node is the set difference of two canonical cubes. Each node in this compressed quadtree contains two pointers – to the smallest cube of , and to the smallest cube of , that contains it. This information can be computed by doing a BFS on the tree.
For each cell we store the following.
-
(I)
An arbitrary representative point .
-
(II)
The point that is associated with the smallest cell of that contains this cell. We also store an arbitrary ball, , that is one of the balls completely inside the cluster specified by – we assume we stored such a ball inside each quorum cluster, when it was computed.
- (III)
-
(I)
6.3 Answering a query
Given a query point , compute the leaf cell (equivalently the smallest cell) in that contains by performing a point-location query in the compressed quadtree . Let be this cell. Let,
| (6.2) |
If we return as the approximate th-nearest neighbor, else we return .
6.4 Correctness
Lemma 6.1.
The number satisfies, .
Proof.
This follows by the Lipschitz property, see Observation 2.2.
Lemma 6.2.
Let be any cell containing . If , then is a valid -approximate th-nearest neighbor of .
Proof.
Similarly, using the Lipschitz property, we can argue that, , and therefore we have, , and the required guarantees are satisfied.
Lemma 6.3.
For any point there is a quorum ball such that (A) intersects , (B) , and (C) .
Proof.
By assumption, , and so by Lemma 4.1 among the nearest neighbor of , there are balls of radius at most . Let denote the set of these balls. Among the indices , let be the minimum index such that one of these balls is completely covered by the quorum cluster . Since intersects the ball while completely contains it, clearly intersects . Now consider the time was constructed, i.e, the th iteration of the quorum clustering algorithm. At this time, by assumption, all of was available, i.e., none of its balls were assigned to earlier quorum clusters. The ball contains unused balls and touches balls from , as such the smallest such ball had radius at most . By the guarantee on quorum clustering, . As for the last part, as the balls and intersect and , we have by the triangle inequality that , as .
Definition 6.4.
Lemma 6.5.
Suppose that among the quorum cluster balls , there is some ball which satisfies that and then the output of the algorithm is correct.
Proof.
We have
Thus, by construction, the expanded environ of the quorum cluster contains the query point, see Eq. (6.1). Let be the smallest non-negative integer such that . We have that, . As such, if is the smallest cell in containing the query point , then
by Eq. (6.1), and if . Now, by Lemma 6.1 we have that , so . Therefore, the algorithm returns as the -approximate th-nearest neighbor, but then by Lemma 6.2 it is a correct answer.
Lemma 6.6.
The query algorithm always outputs a correct approximate answer, i.e., the output ball satisfies .
Proof.
Suppose that among the quorum cluster balls , there is some ball such that and , then by Lemma 6.5 the algorithm returns a valid approximate answer. Assume this condition is not satisfied. Let the anchor cluster be . Since the anchor cluster satisfies and , it must be the case that, . Since the anchor cluster intersects , we have that . Thus, . Let be the smallest cell in which is located. Now consider the point . Suppose it corresponds to the cluster , i.e., . Since is a -ANN to among the points of , . It follows that, , and . By our assumption, it must be the case that, . Now, there are two cases. Suppose that, . Then, since we have so . As such, . In this case we return by the algorithm, but the result is correct by Lemma 6.2. On the other hand, if we return , it is easy to see that . Also, as lies completely inside it follows by Observation 2.1, that , where the second last inequality follows by Lemma 6.1.
6.5 The result
Theorem 6.7.
Given a set of disjoint balls in , a number , with , and , one can preprocess , in time, where and . The space used by the data structure is . Given a query point , this data structure outputs a ball in time, such that .
Proof.
If then Theorem 4.5 provides the desired result. For , the correctness was proved in Lemma 6.6. We only need to bound the construction time and space as well as the query time. Computing the quorum clustering takes time by Lemma 5.6. Observe that . From the construction of Arya and Malamatos [2], we have (note, that since we clip the construction to a hyperplane, we get in the bound and not ). A careful implementation of this stage takes time . Overlaying the two compressed quadtrees representing them takes linear time in their size, that is .
The most expensive step is to perform the -approximate th-nearest neighbor query for each cell in the resulting decomposition of , see Eq. (6.2) (i.e., computing for each cell ). Using the data structure of Section 4 (see Theorem 4.5) each query takes time.
time, and this bounds the overall construction time.
The query algorithm is a point location query followed by an time computation and takes time .
7 Conclusions
In this paper, we presented a generalization of the usual -approximate th-nearest neighbor problem in , where the input are balls of arbitrary radius, while the query is a point. We first presented a data structure that takes space, and the query time is . Here, both and could be supplied at query time. Next we presented an -AVD taking space. Thus showing, surprisingly, that the problem can be solved in sublinear space if is sufficiently large.
References
- [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.
- [2] S. Arya and T. Malamatos. Linear-size approximate Voronoi diagrams. In Proc. 13th ACM-SIAM Sympos. Discrete Algs., pages 147–155, 2002.
- [3] S. Arya, T. Malamatos, and D. M. Mount. Space-time tradeoffs for approximate spherical range counting. In Proc. 16th ACM-SIAM Sympos. Discrete Algs., pages 535–544, 2005.
- [4] S. Arya, T. Malamatos, and D. M. Mount. Space-time tradeoffs for approximate nearest neighbor searching. J. Assoc. Comput. Mach., 57(1):1–54, 2009.
- [5] S. Arya and D. M. Mount. Approximate range searching. Comput. Geom. Theory Appl., 17:135–152, 2000.
- [6] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. Assoc. Comput. Mach., 45(6):891–923, 1998.
- [7] M. de Berg, H. Haverkort, S. Thite, and L. Toma. Star-quadtrees and guard-quadtrees: I/O-efficient indexes for fat triangulations and low-density planar subdivisions. Comput. Geom. Theory Appl., 43:493–513, July 2010.
- [8] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to -nearest-neighbors and -body potential fields. J. Assoc. Comput. Mach., 42:67–90, 1995.
- [9] P. Carmi, S. Dolev, S. Har-Peled, M. J. Katz, and M. Segal. Geographic quorum systems approximations. Algorithmica, 41(4):233–244, 2005.
- [10] K. L. Clarkson. Nearest-neighbor searching and metric space dimensions. In G. Shakhnarovich, T. Darrell, and P. Indyk, editors, Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, pages 15–59. MIT Press, 2006.
- [11] S. Har-Peled. A replacement for Voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found. Comput. Sci., pages 94–103, 2001.
- [12] S. Har-Peled. Geometric Approximation Algorithms, volume 173 of Mathematical Surveys and Monographs. Amer. Math. Soc., 2011.
- [13] S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. Theory Comput., 8:321–350, 2012. Special issue in honor of Rajeev Motwani.
- [14] S. Har-Peled and N. Kumar. Down the rabbit hole: Robust proximity search in sublinear space. In Proc. 53rd Annu. IEEE Sympos. Found. Comput. Sci., pages 430–439, 2012.
- [15] S. Har-Peled and N. Kumar. Approximating minimization diagrams and generalized proximity search. In Proc. 54th Annu. IEEE Sympos. Found. Comput. Sci., pages 717–726, 2013.
- [16] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput., pages 604–613, 1998.
- [17] G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006.