This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Note on Graph-Based Nearest Neighbor Search

Hongya Wang    Zhizheng Wang   Wei Wang   Yingyuan Xiao§   Zeng Zhao   Kaixiang Yang
School of Computer Science and Technology, Donghua University, China
{\ddagger} School of Computer Science and Engineering, University of New South Wales, Australia
§\SSchool of Computer Science and Engineering, Tianjin University of Technology, China
{hywang@dhu.edu.cn}
Abstract

Nearest neighbor search has found numerous applications in machine learning, data mining and massive data processing systems. The past few years have witnessed the popularity of the graph-based nearest neighbor search paradigm because of its superiority over the space-partitioning algorithms. While a lot of empirical studies demonstrate the efficiency of graph-based algorithms, not much attention has been paid to a more fundamental question: why graph-based algorithms work so well in practice? And which data property affects the efficiency and how? In this paper, we try to answer these questions. Our insight is that “the probability that the neighbors of a point oo tends to be neighbors in the KKNN graph” is a crucial data property for query efficiency. For a given dataset, such a property can be qualitatively measured by clustering coefficient of the KKNN graph.

To show how clustering coefficient affects the performance, we identify that, instead of the global connectivity, the local connectivity around some given query qq has more direct impact on recall. Specifically, we observed that high clustering coefficient makes most of the kk nearest neighbors of qq sit in a maximum strongly connected component (SCC) in the graph. From the algorithmic point of view, we show that the search procedure is actually composed of two phases - the one outside the maximum SCC and the other one in it, which is different from the widely accepted single or multiple paths search models. We proved that the commonly used graph-based search algorithm is guaranteed to traverse the maximum SCC once visiting any point in it. Our analysis reveals that high clustering coefficient leads to large size of the maximum SCC, and thus provides good answer quality with the help of the two-phase search procedure. Extensive empirical results over a comprehensive collection of datasets validate our findings.

I Introduction

Nearest neighbor search among database vectors for a query is a key building block to solve problems such as large-scale image search and information retrieval, recommendation, entity resolution, and sequence matching. As database size and vector dimensionality increase, exact nearest neighbor search becomes expensive and often is considered impractical due to the long search latency. To reduce the search cost, approximate nearest neighbor (ANN) search is used, which provides a better tradeoff among accuracy, latency, and memory overhead.

Roughly speaking, the existing ANN methods can be classified into space-partitioning algorithms and graph-based ones111Please note the categorization is not fixed/unique. The space-partitioning methods further fall into three categories - the tree-based, production quantization (PQ) and locality sensitive hashing (LSH)[1, 2]. Recent empirical study shows that graph-based ANN search algorithms are more efficient than the space-partitioning methods such as PQ and LSH, and thus have been adopted by many commercial applications in Facebook, Microsoft, Taobao and etc. [3, 4, 5].

While a lot of empirical studies validate the efficiency of graph-based ANN search algorithms, not much attention has been paid to a more fundamental question: why graph-based ANN search algorithms are so efficient? And which data property affect the efficiency and how? Two recent papers analyze the asymptotic performance of graph-based methods for datasets uniformly distributed on a dd-dimensional Euclidean sphere [6, 7]. The worst-case analysis shows that the asymptotic behavior of a greedy graph-based search only matches the optimal hash-based algorithm [8], which is far worse than the practical performance of graph-based algorithms and thus could not answer these questions.

A few conceptual graph models such as Monotonic Search Network Model [9], Delaunay Graph Model [10, 11] and Navigable Small World Model [12, 13] are proposed to inspire the construction of ANN search graphs. As will be discussed in Section II, none of them can explain the success of graph-based algorithms either. Actually, the vast majority (if not all) of practical ANN search graphs uses approximate KKNN graph as the index structure instead of the conceptual models due to time or space constraints, and thus is fully devoid of the theoretical features provided by these models.

In this paper, we argue that, for a specific dataset, the clustering coefficient [14] of its KKNN graph is an important indicator on how efficiently graph-based algorithms work. The clustering coefficient of KKNN graph defines the probability of neighbors of a point are also neighbors. Comprehensive experimental results reveal that higher the clustering coefficient is, more efficiently the graph-based algorithms will perform. Since clustering coefficient is data dependent, graph-based algorithms perform rather worse for datasets such as Random with very small clustering coefficient, whereas do well in datasets such as Sift and Audio with greater ones.

We also study how clustering coefficient affects the performance. The analysis of the complex network indicates that large clustering coefficient leads to high global connectivity [15]. Our insight is that, instead of the global connectivity, the local graph structure is more crucial for high ANN search recall. Particularly, we observed that, for datasets with large clustering coefficient, most of the kkNN of some given query222This query can be in or not in the dataset lie in the maximum strongly connected component (SCC) in the subgraph composed of these kkNN. Moreover, we show that the search procedure actually consists of two phases, the one outside the maximum SCC and the one in it, in contrast to the common wisdom of single or multiple path search models. Then, we proved that the commonly used graph search algorithm is guaranteed to visit all kkNN in the maximum SCC under a mild condition, which suggests that the size of the maximum SCC determines the answer quality of kkNN search. This sheds light on the strong positive correlation between the clustering coefficient and the result quality, and thus answers the two aforementioned questions.

To sum up, the main contributions of this paper are:

  • We introduce a new quantitative measure Clustering Coefficient of KKNN graph for the difficulty of graph-based nearest neighbor search. To the best of our knowledge, this is the first measure which could explain the efficiency of this class of important ANN search algorithms.

  • The conceptual models such as MSNETs and Delaunay graphs claim that NN could be found by walking in a single path. Instead, we found that the search procedure is actually composed of two phases. In the second phase the algorithm traverse the maximum SCC of kkNN for a query, of which the size is a determining factor for answer quality, i.e., recall.

  • We proved that the graph-based search algorithm is guaranteed to visit all points in the maximum SCC once entering it. Extensive empirical results over a comprehensive collection of datasets validate our observations.

We believe that this note could provide a different perspective on graph-based ANN search methods and might inspire more interesting work along this line.

II Graph-Based Nearest Neighbor Search

II-A Graph Construction and Search Algorithms

In the sequel, we will use nodes, points and vertices interchangeably without ambiguity. A directed graph G=(V,E)G=(V,E) consists of a nonempty vertices set VV and a set EE of edges such that each edge eEe\in E is assigned to an ordered pair {u,vu,v} with u,vVu,v\in V. Most graph-based algorithms build directed graph to navigate the kkNN search. To the best of our knowledge, the idea of using graphs to process ANN search can be traced back to the Pathfinder project, which was initiated to support efficient search, classification, and clustering in computer vision systems [9]. In this project, Dearholt et al. designed and implemented the monotonic search network to retrieve the best match of an entity in a collection of geometric objects. Since then, researchers from different communities such as theory, database and pattern recognition explored different ways to construct search graphs, inspired by various graph/network models such as the relative neighborhood graph [16, 17], Delaunay graph [18, 10, 19], KNN graph [20, 21] and navigable small world network [13, 22, 23]. Thanks to its appealing practical performance, the graph-based ANN search paradigm has become an active research direction and quite a lot of new approaches are developed recently [5, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 3, 34].

While motivated by different graph/network models, most (if not all) practical graph-based algorithms essentially use approximate KKNN graphs as the core index structure. A specific algorithm distinguishes itself from the others mainly in the edge selection heuristics, i.e., the way to select a subset of links between any point and its neighbors. Algorithm 1 depicts the general framework of index construction for graph-based ANN search.

Input: VV is the vertex set and KK is used to control the graph qualilty
Output: The Search Graph GG
1 for each vVv\in V do
2       Find the exact or approximate KK nearest neighbors of vv;
3       Choose a subset CC of these KKNNs based on some specific heuristics;
4       Add directed or bi-directional connections between vv and every vertex in CC;
       // update GG
5      
return GG
Algorithm 1 The General Graph Construction Framework(VV, KK)

For almost all graph-based methods, the ANN search procedure is based on the same principle as follows. For a query qq, start at an initial vertex chosen arbitrarily or using some sophisticated selection rule. Moves along an edge to the adjacent vertex with minimum distance to qq. Repeat this step until the current element vv is closer to qq than all its neighbors, and then report vv as the NN of qq. We call this the single path search model. Figure 1 illustrates the searching procedure for qq in a sample graph with points o1o_{1} to o6o_{6}, where o1o_{1} is the starting point and dashed lines indicate the search path. At the end of searching, the NN of q, i.e., o6o_{6} is identified.

Refer to caption
Figure 1: An illustrative example of graph search

For practical ANN search networks, e.g., HNSW and NSG, there is no guarantee that a monotonic search path always exist for any given query [22, 5]. As a result, it can be easy to get trapped into the local optima, meaning that vv is not the NN of qq. To address this issue, backtracking is solicited – we need to go back to visited vertices and find another outgoing link to restart the procedure. We call this the multiple path search model. Algorithm 2 sketches the commonly adopted search algorithm that allows for backtracking, which will be discussed in detail in Section IV. Figure 2 illustrates a search path with backtracking. The starting point is o1o_{1} and o2o_{2} is the local optima since the true NN of qq is o4o_{4}. By backtracking the algorithm gets back to o3o_{3}, which is further to qq than o2o_{2} and finally find the true NN of qq.

Refer to caption
Figure 2: An illustrative example of graph search with backtracking
Input: Graph GG, entry vertex ss, query qq, priority queue of size LL
Output: kk nearest neighbors of qq
candcand.push(ss); // add ss to the priority queue of candidates
resultLresult_{L}.push(ss); // add ss to the priority queue that stores LL nearest points to qq
1 while |cand|>0|cand|>0 do
       vv = candcand.top(); candcand.pop(); // vv is the nearest point in candcand to qq
       oo = resultLresult_{L}.bottom(); // oo is the furthest point in resultLresult_{L} to qq
2       if d(v,q)>d(o,q)d(v,q)>d(o,q) then
             break; // all points in resultLresult_{L} are evaluted
3            
4      for each neighbor ee of vv in GG do
5             if evisitede\notin visited then
6                   candcand.push(ee);
7                   resultLresult_{L}.push(ee);
8                  
9            
      visitedvisited.push(vv); // add vv to the visited set
10      
return the top kk points in resultLresult_{L}
Algorithm 2 Graph-based kkNN Search(GG, ss, qq, LL)

Please note that LL is often greater than kk to achieve better answer quality. For ease of presentation, we assume k=Lk=L throughout this paper unless stated otherwise.

II-B Review of Graph Search Models and Their Limitations

While empirical studies demonstrate that the graph-based ANN search algorithms are very competitive, it is widely recognized that the graph-based methods are mostly based on heuristics and not well understood quanlitatively [31, 5]. As an exception, two recent papers take the first step to analyze the asymptotic performance of graph-based methods for datasets uniformly distributed on a dd-dimensional Euclidean sphere [6, 7]. The worst-case analysis shows that the asymptotic behavior of a greedy graph-based search only matches the optimal hashing-based algorithm [8].

It was experimentally observed that the graph-based methods are orders of magnitude faster than the hashing-based algorithms [26]. Thus, though interesting from a pure theoretical perspective, their theory fails to explain the salient practical performance of the graph-based algorithms. Next, we will review several graph/network models that inspire practical graph-based algorithms, and then point out their limitations.

Monotonic Search Network Model. The monotonic search networks (MSNET) are defined as a family of graphs such that, for any two vertices in the graph, there exists at least one monotonic search path between them [9]. If the query point happens to be equal to a point of S, then a simple greedy search will succeed in locating the query point along a path of monotonically decreasing distance to the query point. The original MSNET is not practical, even for datasets of moderate size, because of its O(n3)O(n^{3}) indexing complexity and unbounded average out-degree [9]. A recent proposal, the monotonic relative neighborhood graph, reduces the graph construction time to O(n2logn)O(n^{2}\log n). This, however, still does not make MSNETs applicable in practice.

Delaunay Graph Model. Given a set of elements in a Euclidean space, the Voronoi diagram associates a Voronoi region with each element, which gives rise to a notion of neighborhood. The significance of this neighborhood is that if a query is closer to a database element than all its neighbors, then we have found the nearest element in the whole database [10, 11]. Delaunay graph is the dual of the Voronoi diagram, where each element pp is connected to all elements that share a Voronoi edge with pp. Using the Delaunay graph, Algorithm 2 is guaranteed to find the NN of qVq\notin V. Unfortunately, the worst-case combinational complexity of the Voronoi diagram in dimension dd grows as Θ(nd/2)\Theta(n^{d/2}) [35]. In addition, the Delaunay graph quickly reduces to the complete graph as dd grows, making it infeasible for NN search in high dimension spaces [18].

Navigable Small World Model. Networks with logarithmic or polylogarithmic complexity of the greedy graph search are known as the navigable small world networks [12, 23, 13]. Inspired by this model, Malkov et al. proposed the navigable small world graph (NSW) and hierarchical NSW by introducing “long” links during the approximate KKNN graph construction, expecting that the greedy routing achieves polylogrithmic complexity for NN search [22]. They demonstrate that the number of hops during the graph routing is polylogrithmic with respect to the network size on a collection of real-life datasets experimentally. However, unlike the ideal navigable small world model, no rigorous theoretical analysis is provided for NSW and HNSW.

We argue that these conceptual models are inadequate in explaining why in most cases the search procedure quickly converges to the nearest neighbor since:

  • For ideal models, the MSNET alone gives no hint how the graph-based methods generalize to out-of-sample queries, i.e., queries that are not in VV. Delaunay graph supports out-of-sample queries, but do not guarantees NN could be found for query qq V\in V. For example, suppose Algorithm 2 can reach qq by traversing one monotonic search path sv1v2viqsv_{1}v_{2}\cdots v_{i}q from ss to qq, we actually have no idea whether viv_{i} is the NN of qq at all because there may be multiple monotonic search pathes and NN of qq may lie in some other pathes. Navigable small world model only gives intuitive explanations on the existence of short search pathes but have no quantitative justifications that why the NN of qq can be found.

  • More importantly, the vast majority of graph-based algorithms uses approximate KKNN graph or its variants, instead of the aforementioned conceptual models, as the index structure. By limiting the maximum out-degree, approximate KKNN graphs are far more sparse than MSNETs and Delaunay graphs, which makes it is fully devoid of the nice the theoretical properties, i.e., the existence of the monotonic search path or Voronoi neighborhood.

To sum up, the existing models fail to illuminate the intuitive appeal of the graph-based methods. We view this as a significant gap between the theory and practice of the graph-based search paradigm. In this paper, we try to explain more quantitatively, from a different perspective, why the approximate KKNN graph-based methods work so well in practice.

III Clustering Coefficient of KKNN Graph and Its Impact on Search Performance

In graph theory, the clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together, which has been used successfully in its own right across a wide range of applications in complex networks. To name a few, Burt uses local clustering as a probe for the existence of so-called “structural holes” in a network, and Dorogovtsev et al. found that CiC_{i} falls off with kik_{i} approximately as ki1k_{i}^{-1} for certain models of scale-free networks [15], where kik_{i} is the degree of viv_{i}.

There are different ways to define clustering coefficient. In this paper, we adopt the commonly used definition given by Watts and Strogatz [14]. The local clustering coefficient CiC_{i} of a vertex viv_{i} is defined as

Ci=number of pairs of neighbors of vi that are connectednumber of pairs of neighbors of viC_{i}=\frac{\text{number of pairs of neighbors of }v_{i}\text{ that are connected}}{\text{number of pairs of neighbors of }v_{i}} (1)

To calculate CiC_{i}, we go through all distinct pairs of vertices that are neighbors of viv_{i} in the network, count the number of such pairs that are connected to each other, and divide by the total number of pairs ki(ki1)2\frac{k_{i}(k_{i}-1)}{2}. Figure 3 illustrates the definition of the local clustering coefficient CiC_{i}. The degree of viv_{i} is 4 and there exists two edges between its neighbors. Hence, by definition Ci=2÷4×(41)2=13C_{i}=2\div\frac{4\times(4-1)}{2}=\frac{1}{3}.

Refer to caption
Figure 3: An illustrative example of local clustering coefficient

The clustering coefficient for the whole network is the average

C=1ni=1nCiC=\frac{1}{n}\sum\limits_{i=1}^{n}C_{i} (2)

The local clustering coefficient CiC_{i} of a vertex viv_{i} describes the likelihood that the neighbours of viv_{i} are also connected, i.e., the probability that two randomly selected neighbors of viv_{i} are neighbors with each other. Roughly speaking, it tells how well the neighborhood of the node is connected. If the neighborhood is fully connected, the local clustering coefficient is 1 and a value close to 0 means that there are hardly any connections in the neighborhood. If most of the nodes in the network have high clustering coefficient, then the network will probably have many edges that connect nodes to each other.

Clustering coefficient of KKNN graphs depends on KK and the intrinsic feature of datasets. Table I lists the clustering coefficients for various KK under three typical datasets As one can see, the larger KK is, the greater clustering coefficient will be. Moreover, the relative order of the clustering coefficient is stable independent of KK. In the sequel, we will use the clustering coefficient in the case of K=50K=50 as the default since KK cannot be too large due to the index space constraint.

TABLE I: Clustering coefficient for different KK under three typical datasets
Dataset K=20K=20 K=50K=50 K=100K=100 K=150K=150 K=200K=200
Sift 0.1159 0.1249 0.1371 0.1419 0.1468
Glove 0.0881 0.1029 0.1289 0.1358 0.1427
Random 0.00047 0.00074 0.00092 0.00114 0.00139

Our key observation is that the clustering coefficient of KKNN graphs is a informative measure for the efficiency of the graph-based ANN search methods. In this paper, a KKNN graph is defined as a graph such that for each vertices vv, there exists bi-directional edges with its KK most nearest neighbors. This model is reasonable because practical graph-based algorithms such as HNSW and NSG always add bi-directional links between a point and its KKNN as much as possible [22, 5]. Table II lists the statistics of the datasets, the clustering coefficients of KKNN graph in increasing order, the recalls of the top-5050 query and the average number of hops in the graph for a collection of datasets under HNSW and NSG, the two state-of-the-art graph-based algorithms333statistics are collected over 1000 random queries. For both methods, the maximum out-degrees (MOD) is 70 and the parameter LL, which control the number of hops in the graph, is set to 50. Interesting observations can be made as follows:

  • NSG consistently outperforms HNSW in recall with slightly greater average number of hops, which approximately translates to the number of distance evaluation since the MODs are identical for both algorithms. This observation agrees with the results reported in [5].

  • A more interesting observation is that, with around the same number of average hops in the graph, clustering coefficient and recall are strongly correlated. Particularly, The Pearson correlation coefficients between the clustering coefficient and recall for NSG and HNSW are 0.794 and 0.762, respectively. Besides, independent of the data cardinality and dimensionality, high clustering coefficient (greater than 0.12) often leads to high recall whereas low clustering coefficient (lower than 1.0) results in low recall. As an extreme example, the clustering coefficient of Random dataset is only 0.00074 and thus makes graph-based algorithms become very inefficient. One reason that the recall of NSG is greater than that of HNSW is that the quality of neighbors of NSG is better than HNSW, that is, NSG is much closer to an exact KKNN graph than HNSW. Please note that the datasets are comprehensive enough in terms of size, dimensionality and data types (images, text, audio and synthetic). Detailed description of these datasets can be found in [24]444https://github.com/DBWangGroupUNSW/.

TABLE II: Clustering coefficient vs. Efficiency
Dataset Size Dim Clustering coefficient HNSW NSG
Recall # of Hops Recall # of Hops
Random 1,000,000 128 0.00074 0.0049 61.3 0.02 64.8
Gist 1,000,000 960 0.080 0.5984 54.9 0.7688 54.7
NUSWIDE 268,643 500 0.096 0.4343 58.0 0.5430 59.8
GLOVE 1,192,514 100 0.103 0.4903 60.3 0.694 56.1
ImageNet 2,340,373 150 0.114 0.6643 53.3 0.8608 55.5
Sift 1,000,000 128 0.125 0.8667 52.0 0.9453 54.2
Sun 79,106 512 0.140 0.8941 51.1 0.9562 52.0
Cifar 50,000 512 0.141 0.9196 51.0 0.9685 51.5
Deep 1,000,000 256 0.144 0.8205 52.7 0.9078 54.7
MillionSong 992,272 420 0.163 0.5984 51.4 0.9608 55.1
Ukbench 1,097,907 128 0.189 0.8893 51.7 0.9545 54.5
Enron 94,987 1369 0.209 0.7599 52.3 0.9421 53.3
Trevi 99,900 4096 0.215 0.8845 51.1 0.9498 52.8
AUDIO 53,387 192 0.253 0.9553 51.0 0.9815 52.5
MINIST 69,000 784 0.286 0.9728 51.7 0.9878 53.2
Notre 332,668 128 0.287 0.9248 52.4 0.9674 53.8

These observations suggest that the clustering coefficient are a promising measure for the efficiency of graph-based algorithms. Intuitively, higher the clustering coefficient of KKNN graph is555KK should be as small as possible to reduce the memory footprint and query efficiency, the better the graph is connected, which means that graph connectivity has significant impact on the result quality of ANN search. To have an in-depth understanding of how connectivity affects the search performance, we scrutinized the graph traversal steps of a sample of queries and found that the local connectivity, instead of the global one, is the determining factor. To formally characterize the local connectivity, we propose the notion of maximum strongly connected neighborhood as follows.

Definition 1.

A directed graph is strongly connected if there is a path between all pairs of vertices. A strongly connected component (SCC) of a directed graph is a strongly connected subgraph in this graph.

Definition 2.

The kk-neighborhood of a point vv, denoted by 𝒩k(v)\mathcal{N}_{k}(v), is the set of kk nearest elements of vv, i.e., o1okVo_{1}\cdots o_{k}\in V.

Please note that the only requirement is the kk nearest neighbors of vv belongs to VV and vv may be some point in or not in VV. This definition makes our analysis supports out-of-sample queries.

Definition 3.

A subgraph of GG is the kk-neighborhood subgraph associated with a vertex vv, denoted by 𝒢k(v)\mathcal{G}_{k}(v), if V(𝒢k(v))=𝒩k(v)V(\mathcal{G}_{k}(v))=\mathcal{N}_{k}(v) and E(𝒢k(v))E(G)E(\mathcal{G}_{k}(v))\subseteq E(G).

Definition 4.

The maximum strongly connected neighborhood of 𝒢k(v)\mathcal{G}_{k}(v), denoted by 𝒞k(v)\mathcal{C}_{k}(v), is an SCC of 𝒢k(v)\mathcal{G}_{k}(v) such that |𝒞k(v)||𝒞i||\mathcal{C}_{k}(v)|\geq|\mathcal{C}_{i}| for all ii, where 𝒞i\mathcal{C}_{i} are the SCCs of 𝒢k(v)\mathcal{G}_{k}(v).

Please note that KK and kk owns totally different meaning - KK is the link number per node of KKNN graph and is determined in graph construction and kk is the search parameter and thus may vary according to the users’ requirement.

Figure 4 illustrates these definitions using a simple example. o1o_{1} to o5o_{5} is the top-5 NN of query qq and there are three undirected edges (equivalent to six directed edges) in qq’s 5-neighborhood subgraph 𝒢5(q)\mathcal{G}_{5}(q). Three SCCs exists in 𝒢5(q)\mathcal{G}_{5}(q) and the maximum SCC 𝒞5(q)\mathcal{C}_{5}(q) is composed of point o1o_{1}, o2o_{2} and o3o_{3}.

Refer to caption
Figure 4: An illustrative example of maximum strongly connected neighborhood

To show the impact of 𝒞k(v)\mathcal{C}_{k}(v) on the algorithm performance. Table III lists the 3 SCCs of largest sizes for 100 random kkNN queries in the case of k=50k=50 for three typical datasets. As we can see, the ratios of the size of 𝒞k(v)\mathcal{C}_{k}(v) (SCC1) to kk are very close the recall listed in Table II for three datasets, respectively. Other state-of-the-art algorithms, such as HNSW, exhibits similar trends.

TABLE III: SCCs of 100 random queries for three typical datasets under NSG
SCC-id Sift Glove Random
size ratio size ratio size ratio
SCC1 47.8 95.6% 33.8 67.6% 2.4 4.8%
SCC2 0.8 1.6% 1.8 3.6% 1.4 2.8%
SCC3 0.2 0.4% 1.8 3.6% 1.2 2.4%

To eliminate the bias caused by specific graph construction algorithms, we studied the exact KKNN graph and found similar results. Table IV lists, for 100 random top-50 NN queries, the average sizes of the top-3 SCCs over Sift, Glove and Random. In this experiment, we only put a directed link from a point to its KKNN and no link is added manually in the reverse direction, i.e., the KKNN graph is directed. KK is set to 50. From Table IV we can see that, independent of specific graph-based algorithm, clustering coefficient also has a significant impact on the size of 𝒞k(v)\mathcal{C}_{k}(v).

TABLE IV: SCCs of 100 random queries for exact directed KKNN graph
SCC-id Sift Glove Random
size ratio size ratio size ratio
SCC1 36.8 73.6% 30.2 60.4% 1.2 2.4%
SCC2 3.6 7.2% 2.4 4.8% 1 2%
SCC3 1.6 3.2% 1.6 3.2% 1 2%

We also examined undirected KKNN graph, where bi-directional link is added manually between a point and its KKNN. The trend listed in Table V is very similar to that of Table IV except that the sizes of 𝒞k(v)\mathcal{C}_{k}(v) are larger. This is because more links are added in the graph. Actually, practical graph-based algorithms lie somewhere between the undirected and directed KKNN graph since they always try to add bi-directional links as long as the memory budget is enough. Please note exact KKNN graphs are not practical because the unaffordable construction time and unlimimted maximum out-degree, which translates to too much memory cost.

TABLE V: SCCs of 100 random queries for exact undirected KKNN graph
SCC-id Sift Glove Random
size ratio size ratio size ratio
SCC1 48.8 97.6% 46.2 92.4% 6 12%
SCC2 0.2 0.4% 1 2% 2.2 4.4%
SCC3 0 0% 0.2 0.4% 1.4 2.8%

In a nutshell, all these experiments demonstrate that clustering coefficient of KKNN graph is an informative measure for the size of the maximum strongly connected neighborhood and the performance of graph-based algorithms over a specific dataset. Next, we will analyze how 𝒞k(q)\mathcal{C}_{k}(q) affects the recall for a given query qq. Particularly, we will show that Algorithm 2, the striking algorithmic component for graph search, can effectively reach 𝒞k(q)\mathcal{C}_{k}(q) and identify all kkNN 𝒞k(q)\in\mathcal{C}_{k}(q). This explains why greater clustering coefficient and larger size of 𝒞k(q)\mathcal{C}_{k}(q) lead to better performance.

IV Two Phase kkNN Search in Graphs

TABLE VI: Statistics of two-phase ANN search for Sift
Query ID # of Hops in P1P_{1} # of Hops in P2P_{2} |𝒞k(q)||\mathcal{C}_{k}(q)| Fraction of |𝒞k(q)||\mathcal{C}_{k}(q)| visited in P2P_{2} |𝒞k(q)¯||\overline{\mathcal{C}_{k}(q)}|
1 4 50 38 100% 4
2 4 50 43 100% 1
3 5 50 48 100% 0
4 4 50 50 100% 0
5 5 50 45 100% 0
TABLE VII: Statistics of two-phase ANN search for GLOVE
Query ID # of Hops in P1P_{1} # of Hops in P2P_{2} |𝒞k(q)||\mathcal{C}_{k}(q)| Fraction of |𝒞k(q)||\mathcal{C}_{k}(q)| visited in P2P_{2} |𝒞k(q)¯||\overline{\mathcal{C}_{k}(q)}|
1 3 52 27 100% 7
2 1 59 28 100% 0
3 11 39 20 100% 8
4 2 49 38 100% 4
5 2 48 24 100% 6
TABLE VIII: Statistics of two-phase ANN search for Random
Query ID # of Hops in P1P_{1} # of Hops in P2P_{2} |𝒞k(q)||\mathcal{C}_{k}(q)| Fraction of |𝒞k(q)||\mathcal{C}_{k}(q)| visited in P2P_{2} |𝒞k(q)¯||\overline{\mathcal{C}_{k}(q)}|
1 79 7 1 100% 0
2 83 0 1 0% 0
3 50 47 1 100% 0
4 46 25 2 100% 0
5 56 0 1 0% 0

The common wisdom about Algorithm 2 is as follows. Starting from the entry vertex ss, which is chosen by random or using some auxiliary method, Algorithm 2 finds a directed path from ss to the query qq, hoping that NN of qq are identified through the walk. Since only local information, i.e., adjacent vertices of the visited vertices, is used, this class of algorithms are termed as the decentralized algorithm [12]. Particularly, for ANN search, Algorithm 2 first follows the out-edges of ss to get its immediate neighbors, and then examines the distances from these neighbors to qq. The one with the minimum distance to qq is selected as the next base vertex for iteration. The same procedure is repeated at each step of the traversal until Algorithm 2 reaches a local optima, namely, the immediate neighbors of the base vertex does not contain a vertex that is closer to qq than the base vertex itself. Backtracking is used to jump out of the local optima and increase the odd to find the true NN. Recall that we name this search paradigm as multiple path search model.

Different from the traditional point of view, we observe that Algorithm 2 actually is composed of two phases. In the first phase, the algorithm starts with an initial point, walks the graph and encounters a point within 𝒞k(q)\mathcal{C}_{k}(q). In the second phase, the algorithm traverse 𝒞k(q)\mathcal{C}_{k}(q) and a small number of points not in 𝒞k(q)\mathcal{C}_{k}(q). Figure 5 depicted the two phase search procedure. Theorem 1 proves that Algorithm 2 is guaranteed to find all points in 𝒞k(q)\mathcal{C}_{k}(q) under a mild condition.

Refer to caption
Figure 5: Two-phases ANN graph search
Theorem 1.

Algorithm 2 is guaranteed to visit all points in 𝒞k(q)\mathcal{C}_{k}(q) starting with any point in 𝒞k(q)\mathcal{C}_{k}(q).

Proof.

We know that any directed graph is said to be a strongly connected component iff all the vertices of the graph are a part of some cycle. Please note that the cycle may not necessarily be a Hamilton cycle. Suppose all vertices not in 𝒞k(q)\mathcal{C}_{k}(q) but adjacent with vertices in 𝒞k(q)\mathcal{C}_{k}(q) are further to qq than all vertices in 𝒞k(q)\mathcal{C}_{k}(q). Without loss of generality, suppose the first vertex visited is v1v_{1}, then Algorithm 2 will visit all vertices following the cycle and push every vertices into candcand and resultLresult_{L}. Since all vertices in 𝒞k(q)\mathcal{C}_{k}(q) are closer to qq than other vertices, the distance of the bottom element of resultLresult_{L} will always greater than that of the top element in candcand until all vertices in 𝒞k(q)\mathcal{C}_{k}(q) are visited666all elements in resultLresult_{L} are initialized as infinity at the beginning. Please note that in each loop candcand pop up the element that have been pushed into resultLresult_{L}, which guarantees that Algorithm 2 always terminates. ∎

Theorem 1 suggests a different perspective in understanding the graph-based methods. Rather than searching a single path (without backtracing) or multiple paths (with backtracking) in the graph, the search algorithm actually traverses a strongly connected neighborhood around the query. In other words, high quality 𝒞k(q)\mathcal{C}_{k}(q), together with Algorithm 2, offers the salient performance. The analysis in Section III reveals the quality of 𝒞k(q)\mathcal{C}_{k}(q) are data dependent and closely related to the clustering coefficient of KKNN graphs. Therefore, there exists significant performance disparity for different datasets and we could use the clustering coefficient of KKNN graph as a meaningful measure for the efficiency of the graph-based methods.

It is possible that there exists a few vertices adjacent with vertices in 𝒞k(q)\mathcal{C}_{k}(q), which is not in 𝒞k(q)\mathcal{C}_{k}(q) but closer to qq than some vertices in 𝒞k(q)\mathcal{C}_{k}(q). In this case, the algorithm may also visit such vertices, and the answer quality will be higher than just traversing 𝒞k(q)\mathcal{C}_{k}(q) since more closer vertices outside 𝒞k(q)\mathcal{C}_{k}(q) are visited.

The probability of the search algorithm getting into 𝒞k(q)\mathcal{C}_{k}(q) is exponentially increasing with LL^{\prime}, the number of being trapped into a local optima and getting back to a distant point before entering 𝒞k(q)\mathcal{C}_{k}(q), which is expressed as follows. pip_{i} is the probability of getting into 𝒞k(q)\mathcal{C}_{k}(q) along a single path.

P=1i=1L(1pi)P=1-\prod\limits_{i=1}^{L^{\prime}}(1-p_{i}) (3)

The rigorous calculation of PP is infeasible. Empirically, for datasets with relatively large clustering coefficients we observed that (1) Algorithm 2 can quickly reach 𝒞k(q)\mathcal{C}_{k}(q), and (2) the path length of the first phase is far shorter than that of the second phase. Table VI, Table VII and Table VIII list the numbers of hops in Phase 1 and Phase 2, the size of 𝒞k(q)\mathcal{C}_{k}(q), the fraction of points in 𝒞k(q)\mathcal{C}_{k}(q) that are visited in Phase 2 and the number of true top-kk NN not in 𝒞k(q)\mathcal{C}_{k}(q) that are found during Phase 2 (denoted by |𝒞k(q)¯||\overline{\mathcal{C}_{k}(q)}|) for Sift, Glove and Random, respectively. NSG is used and the statistics of five random query are reported. Please note that HNSW and exact KKNN graph exhibit the similar trends thus we do not report results for them. Several interesting observations are made:

  • As we can see, independent of datasets, the two-phase search model is applicable to all queries listed. As proved in Theorem 1, once the search algorithm enters Phase 2, all points in 𝒞k(q)\mathcal{C}_{k}(q) will be visited, which demonstrates the importance of the quality of 𝒞k(q)\mathcal{C}_{k}(q).

  • Besides the true top-kk NN in 𝒞k(q)\mathcal{C}_{k}(q), other true top-kk NN also probably be visited during Phase 2, especially for Glove where the size of 𝒞k(q)\mathcal{C}_{k}(q) is relatively small. This is mainly caused by the search algorithm jumps into a smaller SCC or visits kkNN that only own a single directed edge with the maximum SCC.

  • For Sift and Glove, where the size of 𝒞k(q)\mathcal{C}_{k}(q) far greater than that of Random, the second phase dominates the search cost and the algorithm jumps into 𝒞k(q)\mathcal{C}_{k}(q) in a very small number of steps. In contrast, it is very hard for the algorithm to find a true top-kk NN for Random since the size of 𝒞k(q)\mathcal{C}_{k}(q) is too small (in most case is equal to 1). For example, q2q_{2} and q5q_{5} do not enter Phase 2 and didn’t find any true NN. As a result, the recall of Random is very low.

Refer to caption
Figure 6: An example of two-phase ANN graph search (best viewed in color)

To train reader’s intuition, Figure 6 illustrates the search procedure of a top-10 query for Sift dataset with NSG. Green point is the query and red points denote the true top-NN in the maximum SCC, which are strongly connected. Dashed lines in blue with single or double arrows represents the the directed or bi-directional edges between points. The solid arrowed lines in yellow depict the search path during kkNN search. As we can see, starting with the entry point, the algorithm jumps into the maximum SCC in three steps. After traversing the maximum SCC which consists of six true NNs, it continues the search by visiting one true NN (in black) and four other points before the termination condition is met. Since kk is small in this example, the size of the maximum SCC is not that large. This can be informally explained by the fact that the connectivity increases as the number vertices become large under the same edge connection probability using random network theory [15].

The case of small kk: Users may be only interested in a small number of nearest neighbors of qq, say kk ranging from 1 to 10. In this case, the size and quality of 𝒞k(q)\mathcal{C}_{k}(q) is not that good to achieve high recall. To get precise results, LL is often set greater than kk, say 50-200. The net effect is that the search algorithm actually visit 𝒞L(q)\mathcal{C}_{L}(q), which consists the most points of top-LL NN if the clustering coefficient is large enough, and then Algorithm 2 identify the best kk results and output them.

V Related Work

V-A Measrues for Difficulty of Nearest Neighbor Search

The problem of the difficult (approximate) of NN search in a given dataset has drawn much attention in recent years. Beyer et al. and Francois et al. show that NN search wil be meaningless when the number of dimensions goes to infinity [36, 37], respetively. However, they didn’t discuss the non-asymptotic analsisi when the number of dimensions is finite. Moreover, the effect of other crucial properties such as the sparsity of data vectors has not been studied. To the best of our knowledege, He et al. proposed the first concrete measure called Relative Contrast (RC) to evaluate the influence of several data characteristics such as dimensionality, sparsity and dataset size simultaneously on the difficulty of NN search [38]. They present a theoretical analysis to prove how RC determines the complexity of Locality Sensitive Hashing, a popular approximate NN search method. Relative Constrast aslo provides an explanation for a family of heristic hashing algorithm with good practical performance based on PCA. However, no evidence is given that RC can be used to explain the success of graph-based NN search method directly.

Identifying the intrinsic dimensionality (ID) of datasets has been studied for decades since its importance in machine learning, databases and data mining. Recently, local ID gains much attention since it is very useful when data is composed of heterogeneous manifolds. In addition to applications in manifold learning, measures of local ID have been used in the context of evaluate the difficulty of NN search [39]. Several local intrinsic dimensionality models have been proposed, such as the expansion dimension (ED) [40], the generalized expansion dimension (GED) [41], the minimum neighbor distance (MiND) [42], local continuous intrinsic dimension (LID) [43]. While these measures have been shown useful in their own right, non of them is applicable in explaining the salient performance of the graph-based methods.

V-B A Brief Review of the Existing ANN Search Methods

Approximate nearest neighbor search (ANNS) has been a hot topic over decades, it provides fundamental support for many applications of data mining, databases and information retrieval [44, 45, 46]. There is a large amount of significant literature on algorithms for approximate nearest neighbor search, which are mainly divided into the following categories: tree-structure based approaches, hashing-based approaches, quantization-based approaches, and graph-based approaches.

V-B1 tree-structure based approaches

Hierarchical structures (tree) based methods offer a natural way to continuously partition a dataset into discrete regions at multiple scales, such as KD-tree [47], R-tree [48], SR-tree [49]. These methods perform very well when the dimensionality of the data is relatively low. However, it has been proved to be inefficient when the dimensionality of data is high. It has been shown in [50] that when the dimensionality exceeds about 10, existing indexing data structures based on space partitioning are slower than the brute-force, linear-scan approach. Many new hierarchical-structure-based methods [51] are presented to address this limitation.

V-B2 hashing-based approaches

Among the approximate NN search algorithms, the Locality Sensitive Hashing is the most widely used one due to its excellent theoretical guarantees and empirical performance. E2LSH, the classical LSH implementations for 2\ell_{2} norm, cannot solve cc-ANN search problem directly. In practice, one has to either assume there exists a “magical’ radius rr, which can lead arbitrarily bad outputs, or uses multiple hashing tables tailored for different raddi, which may lead to prohibitively large space consumption in indexing. To reduce the storage cost, LSB-Forest [52] and C2LSH [53] use the so-called virtual rehashing technique, implicitly or explicitly, to avoid building physical hash tables for each search radius. The index size of LSB-Forest is far greater than that of C2LSH because the former ensures that the worst-case I/O cost is sub-linear to both nn and dd whereas the latter has no such guarantee - it only bounds the number of candidates by some constant but ignores the overhead in index access.

Based on the idea of query-aware hashing, the two state-of-the-art algorithms QALSH and SRS further improve the efficiency over C2LSH by using different index structures and search methods, respectively. SRS uses an mm-dimensional RR-tree (typically m10m\leq 10) to store the g(o),oid\langle g(o),oid\rangle pair for each point oo and transforms the cc-ANN search in the dd-dimensional space into the range query in the mm-dimensional projection space. The rationale is that the probability that a point oo is the NN of qq decreases as Δm(o)\Delta_{m}(o) increases, where Δm(o)=gm(q)gm(o)\Delta_{m}(o)=\|g_{m}(q)-g_{m}(o)\| During cc-ANN search, points are accessed according to the increasing order of their Δm(o)\Delta_{m}(o).

Motivated by the observation that the optimal p\ell_{p} metric is application-dependent, LazyLSH [54] is proposed to solve the NN search problem for the fractional distance metrics, i.e., p\ell_{p} metrics (0<p<10<p<1) with a single index. FALCONN is the state-of-the-art LSH scheme for the angular distance, both theoretically and practically [55]. Except for E2LSH and FALCONN, the other algorithms are disk-based and thus can handle datasets that do not fit into the memory.

All of the aforementioned LSH algorithms provide probability guarantees on the result quality (recall and/or precision). To achieve better efficiency, many LSH extensions such as Multi-probe LSH [56], SK-LSH [57], LSH-forest [58] and Selective hashing [59] use heuristics to access more plausible buckets or re-organize datasets, and do not ensure any LSH-like theoretical guarantee.

V-B3 quantization-based approaches

The most common quantization-based methods is product quantization (PQ) [2]. It seeks to perform a similar dimension reduction to hashing algorithms, but in a way that better retains information about the relative distances between points in the original vector space. Formally, a quantizer is a function q mapping a DD-dimensional vector xDx\in\mathbb{R}^{D} to a vector q(x)C={ci;i}q(x)\in C=\{c_{i};i\in\mathcal{I}\}, where the index set \mathcal{I} is finite: =0k1\mathcal{I}=0\ldots k-1. The reproduction values cic_{i} are called centroids. The set 𝒱i\mathcal{V}_{i} of vectors mapped to given index ii is referred to as a cell, and defined as

𝒱i{xD:q(x)=ci}\mathcal{V}_{i}\triangleq\left\{x\in\mathbb{R}^{D}:q(x)=c_{i}\right\}

The kk cells of a quantizer form a partition of D\mathbb{R}^{D}. So all the vectors lying in the same cell 𝒱i\mathcal{V}_{i} are reconstructed by the same centroid cic_{i}. Due to the huge number of samples required and the complexity of learning the quantizer, PQ uses mm distinct quantizers to quantize the subvectors separately. An input vector will be divided into m distinct subvectors uju_{j}, 1jm1\leq j\leq m. The dimension of each subvector is D=D/mD^{*}=D/m. An input vector x is mapped as follows:

x1,,xDu1(x),,xDD+1,,xDum(x)q1(u1(x)),,qm(um(x))\footnotesize\underbrace{x_{1},\ldots,x_{D^{*}}}_{u_{1}(x)},\cdots,\underbrace{x_{D-D^{*}+1},\ldots,x_{D}}_{u_{m}(x)}\rightarrow q_{1}\left(u_{1}(x)\right),\ldots,q_{m}\left(u_{m}(x)\right)

where qjq_{j} is a low-complexity quantizer associated with the jthj^{th} subvector. And the codebook is defined as the Cartesian product,

𝒞=𝒞1××𝒞m\mathcal{C}=\mathcal{C}_{1}\times\ldots\times\mathcal{C}_{m}

and a centroid of this set is the concatenation of centroids of the mm subquantizers. All subquantizers have the same finite number kk^{*} of reproduction values, the total number of centroids is k=(k)mk=\left(k^{*}\right)^{m}.

After using PQ, all database vectors will be replaced by reproduction values. In order to speed up the query, PQ proposes a look-up table to directly get the distance between the reproduction values and the query vector. They propose two methods to compute an approximate Euclidean distance between these vectors: the so-called Asymmetric Distance Computation (ADC) and the Symmetric Distance Computation (SDC). See Figure 7 for an illustration. Let’s take the introduction of ADC as an example.

Refer to caption
Figure 7: Two methods to compute an approximate Euclidean distance

The database vector yy is represented by q(y)q(y), but the query xx is not encoded. The distance d(x,y)d(x,y) is approximated by the distance d(x,q(y))d(x,q(y)), which is computed using the decomposition

d(x,q(y))=jd(uj(x),qj(uj(y)))2,d(x,q(y))=\sqrt{\sum_{j}d\left(u_{j}(x),q_{j}\left(u_{j}(y)\right)\right)^{2}},

where the squared distances d(uj(x),cj,i)2:j=1m,i=1kd\left(u_{j}(x),c_{j,i}\right)^{2}:j=1\ldots m,i=1\ldots k^{*}, are computed before the search. The calculation method of SDC is similar to ADC, but the query vector xx is represented by q(x)q(x). SDC limits the memory usage associated with the queries and ADC has a lower distance distortion for a similar complexity.

PQ offers three attractive properties: (1) PQ compresses an input vector into a short code (e.g., 64-bits), which enables it to handle typically one billion data points in memory; (2) the approximate distance between a raw vector and a compressed PQ code is computed efficiently (the so-called asymmetric distance computation (ADC) and the symmetric distance computation (SDC)), which is a good estimation of the original Euclidean distance; and (3) the data structure and coding algorithm are simple, which allow it to hybridize with other indexing structures. Becasue these methods avoid distance calculations on the original data vectors, it will cause a loss of certain calculation accuracy. When the recall rate is close to 1.0, the required length of the candidate list is close to the size of the dataset. Many quantization-based methods try to reduce quantization errors to improve calculation accuracy, such as Optimal Product Quantization (OPQ) [60] and Tree Quantization (TQ) [61].

V-B4 graph-based approaches

Recently, graph-based methods have drawn considerable attention, such as NSG [5], HNSW [22], Efanna [62], and FANNG [27]. Graph-based methods construct a kkNN graph offline, which can be regard as a big network graph in high-dimensional space. However, the construction complexity of the exact kNN graph, especially when it comes to large datasets, will increase exponentially. Many researchers turn to building an approximated kkNN graph, but it is still time consuming. There are two main types of graphs: directed graphs and undirected graphs.

At online search stage, they all use greedy-search algorithm or its variants. While these method require to find the initial point in advance, and the easiest way is to choose randomly. During the search, it can quickly converge from the initial point to the neighborhood of the query point. But one problem of this method is that it is easily to converge to local optima and result in a low recall. One way to solve this problem is to provide better initialization candidate set for a query point. Instead of using random selection, choosing to use the Navigating node (the approximate medoid of the dataset) and its neighbors as the candidate. Another method is to try to make the constructed graph monotonous. The edge selection strategy of MRNG, which was first proposed in paper [5], can ensure that the graph is a Monotonic Search Network (MSNET). Ideally, the search path will iterate from the starting point until reaching the query point and ending, this means that no backtracking occurs during the search.

Because the construction of graphs greatly affects search performance, many researchers focus on constructing index graphs. The fundamental issue is how to choose the neighbors of nodes on the graph. We will introduce two state-of-the-art graph neighbor selection strategies: Relative Neighborhood Graphs (RNG) [17] and Monotonic Relative Neighborhood Graphs (MRNG) [5]. Formally, given two points p and q in D\mathbb{R}^{D} space, B(p,δ(p,q))B(p,\delta(p,q)) represents an open sphere where the center is q, and δ(p,q)\delta(p,q) is the radius. The lunepq{lune}_{pq} is defined as:

lunepq=B(p,δ(p,q))B(q,δ(p,q)){lune}_{pq}=B(p,\delta(p,q))\cap B(q,\delta(p,q))

FANNG [27] and HNSW [22] adopt the RNG’s edge selection strategy to construct the index. RNG is an edge selection strategy based on undirected graph, and it selects edges by checking whether there is a point in the intersection of two open spheres. In Figure 8(a), node pp has prepared a set of neighbor candidates for selection. If there is no node in luneprlune_{pr}, then pp and rr are linked. Otherwise, there is no edge between pp and rr. Because rlunepsr\in{lune}_{ps}, slunepts\in{lune}_{pt}, tluneput\in{lune}_{pu}, and ulunepqu\in{lune}_{pq}, there are no edges between p and s,t,u,qs,t,u,q. Although RNG can reduce its out-degree to a constant Cd+o(1)C_{d}+o(1), it does not have sufficient edges to be a MSNET. NSG adopts the MRNG’s edge selection strategy to construct the index, which is a directed graph. In Figure 8(b), pp and rr are linked to each other because there is no node in luneprlune_{pr}. pp and ss are not linked because pp and rr are linked and rlunepsr\in{lune}_{ps}. However, pp and tt are linked because pp and ss are not linked and slunepts\in{lune}_{pt}. The graph constructed by MRNG is an MSNET. The common purpose of these two graph construction methods is to reduce the average out-degree of the graph so as to make the graph sparse and reduce the search complexity. These interesting selection strategies have achieved attractive results, which makes that many graph-based methods perform well in search time, such as Efanna [62], KGraph, HNSW and NSG.

Refer to caption
(a) RNG
Refer to caption
(b) MRNG
Figure 8: Two state-of-the-art edge selection strategies

VI Conclusion

This paper takes the first step to shed light on why the graph-based search algorithms work so well in practice and suggests that the clustering coefficient of KKNN graph is an important measure for the efficiency of these algorithms. Detailed analysis is also conducted to show how clustering coefficient affects the local structure of KKNN graphs. A few open problems still exists. For example, formal analysis under some simplified data model is important to have more rigorous understanding of the graph search procedure.

Acknowledgements

The work reported in this paper is partially supported by NSFC under grant number 61370205, NSF of Xinjiang Key Laboratory under grant number 2019D04024.

References

  • [1] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in SoCG, 2004, pp. 253–262.
  • [2] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, 2011.
  • [3] M. Douze, A. Sablayrolles, and H. Jégou, “Link and code: Fast indexing with graphs and compact regression codes,” in CVPR, 2018, pp. 3646–3654.
  • [4] Y. Dong, P. Indyk, I. P. Razenshteyn, and T. Wagner, “Learning space partitions for nearest neighbor search,” in ICLR, 2020.
  • [5] C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” Proc. VLDB Endow., vol. 12, no. 5, pp. 461–474, 2019.
  • [6] T. Laarhoven, “Graph-based time-space trade-offs for approximate near neighbors,” in SoCG, B. Speckmann and C. D. Tóth, Eds., pp. 57:1–57:14.
  • [7] L. Prokhorenkova, “Graph-based nearest neighbor search: From practice to theory,” CoRR, vol. abs/1907.00845, 2019.
  • [8] A. Andoni, T. Laarhoven, I. P. Razenshteyn, and E. Waingarten, “Optimal hashing-based time-space trade-offs for approximate near neighbors,” in SODA, 2017, pp. 47–66.
  • [9] D. W. Dearholt, N. Gonzales, and G. Kurup, “Monotonic search networks for computer vision databases,” in Twenty-Second Asilomar Conference on Signals, Systems and Computers, vol. 2, 1988, pp. 548–553.
  • [10] T. B. Sebastian and B. B. Kimia, “Metric-based shape retrieval in large databases,” in ICPR, 2002, pp. 291–296.
  • [11] S. Morozov and A. Babenko, “Non-metric similarity graphs for maximum inner product search,” in NIPS, 2018, pp. 4726–4735.
  • [12] J. M. Kleinberg, “Navigation in a small world,” Nature, vol. 406, no. 6798, p. 845, 2000.
  • [13] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Approximate nearest neighbor algorithm based on navigable small world graphs,” Inf. Syst., vol. 45, pp. 61–68, 2014.
  • [14] D. J. Watts and S. H. Strogatz, “Collective dynamics of ”small-world” networks,” Nature, vol. 393, pp. 440–442, 1998.
  • [15] M. Newman, Networks: An Introduction.   Oxford University Press, 2010.
  • [16] S. Arya and D. M. Mount, “Approximate nearest neighbor queries in fixed dimensions,” in SODA, 1993, pp. 271–280.
  • [17] J. W. Jaromczyk and G. T. Toussaint, “Relative neighborhood graphs and their relatives,” Proceedings of the IEEE, vol. 80, no. 9, pp. 1502–1517, 1992.
  • [18] G. Navarro, “Searching in metric spaces by spatial approximation,” in SPIRE/CRIWG, 1999, pp. 141–148.
  • [19] F. Aurenhammer, “Voronoi diagrams - A survey of a fundamental geometric data structure,” ACM Comput. Surv., vol. 23, no. 3, pp. 345–405, 1991.
  • [20] R. Paredes and E. Chávez, “Using the k-nearest neighbor graph for proximity searching in metric spaces,” in SPIRE, 2005, pp. 127–138.
  • [21] “KGraph.” [Online]. Available: https://github.com/aaalgo/kgraph
  • [22] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020.
  • [23] J. M. Kleinberg, “The small-world phenomenon: an algorithmic perspective,” in STOC, 2000, pp. 163–170.
  • [24] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, and X. Lin, “Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement (v1.0),” CoRR, vol. abs/1610.02455, 2016.
  • [25] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang, “Fast approximate nearest-neighbor search with k-nearest neighbor graph,” in IJCAI, 2011, pp. 1312–1317.
  • [26] M. Aumüller, E. Bernhardsson, and A. J. Faithfull, “Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Inf. Syst., vol. 87, 2020.
  • [27] B. Harwood and T. Drummond, “FANNG: fast approximate nearest neighbour graphs,” in CVPR, 2016, pp. 5713–5722.
  • [28] K. Aoyama, K. Saito, H. Sawada, and N. Ueda, “Fast approximate similarity search based on degree-reduced neighborhood graphs,” in SIGKDD, 2011, pp. 1055–1063.
  • [29] M. Iwasaki and D. Miyazaki, “Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data,” CoRR, vol. abs/1810.07355, 2018.
  • [30] M. Iwasaki, “Pruned bi-directed k-nearest neighbor graph for proximity search,” in SISAP, vol. 9939, 2016, pp. 20–33.
  • [31] D. Baranchuk and A. Babenko, “Towards similarity graphs constructed by deep reinforcement learning,” CoRR, vol. abs/1911.12122, 2019.
  • [32] D. Baranchuk, D. Persiyanov, A. Sinitsin, and A. Babenko, “Learning to route in similarity graphs,” in ICML, vol. 97, 2019, pp. 475–484.
  • [33] Z. Zhou, S. Tan, Z. Xu, and P. Li, “Möbius transformation for fast inner product search on graph,” in NeurIPS, 2019, pp. 8216–8227.
  • [34] J. Wang and S. Li, “Query-driven iterated neighborhood graph search for large scale indexing,” in ACM MM, 2012, pp. 179–188.
  • [35] F. P. Preparata and M. I. Shamos, Computational Geometry - An Introduction.   Springer, 1985.
  • [36] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearest neighbor” meaningful?” in ICDT, 1999, pp. 217–235.
  • [37] D. François, V. Wertz, and M. Verleysen, “The concentration of fractional distances,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 7, pp. 873–886, 2007.
  • [38] J. He, S. Kumar, and S.-F. Chang, “On the difficulty of nearest neighbor search,” in ICML, 2012, pp. 1127–1134.
  • [39] M. E. Houle and M. Nett, “Rank-based similarity search: Reducing the dimensional dependence,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 136–150, 2015.
  • [40] D. R. Karger and M. Ruhl, “Finding nearest neighbors in growth-restricted metrics,” in STOC, 2002, pp. 741–750.
  • [41] M. E. Houle, H. Kashima, and M. Nett, “Generalized expansion dimension,” in ICDM Workshops, 2012, pp. 587–594.
  • [42] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli, “Novel high intrinsic dimensionality estimators,” Mach. Learn., vol. 89, no. 1-2, pp. 37–65, 2012.
  • [43] M. E. Houle, “Dimensionality, discriminability, density and distance distributions,” in ICDM Workshops, 2013, pp. 468–473.
  • [44] W. G. Aref, A. C. Catlin, J. Fan, A. K. Elmagarmid, M. A. Hammad, I. F. Ilyas, M. S. Marzouk, and X. Zhu, “A video database management system for advancing video database research,” in Multimedia Information Systems, 2002, pp. 8–17.
  • [45] R. Fagin, R. Kumar, and D. Sivakumar, “Efficient similarity search and classification via rank aggregation,” in SIGMOD, 2003, pp. 301–312.
  • [46] Y. Ke, R. Sukthankar, and L. Huston, “An efficient parts-based near-duplicate and sub-image retrieval system,” in ACM Multimedia, 2004, pp. 869–876.
  • [47] J. L. Bentley, “K-d trees for semidynamic point sets,” in SoCG, 1990, pp. 187–197.
  • [48] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in SIGMOD, 1984, pp. 47–57.
  • [49] N. Katayama and S. Satoh, “The sr-tree: An index structure for high-dimensional nearest neighbor queries,” in SIGMOD.   ACM Press, 1997, pp. 369–380.
  • [50] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in VLDB.   Morgan Kaufmann, 1998, pp. 194–205.
  • [51] P. Ram and K. Sinha, “Revisiting kd-tree for nearest neighbor search,” in KDD, 2019, pp. 1378–1388.
  • [52] Y. Tao, K. Yi, C. Sheng, and P. Kalnis, “Quality and efficiency in high dimensional nearest neighbor search,” in SIGMOD, 2009, pp. 563–576.
  • [53] J. Gan, J. Feng, Q. Fang, and W. Ng, “Locality-sensitive hashing scheme based on dynamic collision counting,” in SIGMOD, 2012, pp. 541–552.
  • [54] Y. Zheng, Q. Guo, A. K. H. Tung, and S. Wu, “Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,” in SIGMOD, 2016, pp. 2023–2037.
  • [55] A. Andoni, P. Indyk, T. Laarhoven, I. P. Razenshteyn, and L. Schmidt, “Practical and optimal LSH for angular distance,” in NIPS, 2015, pp. 1225–1233.
  • [56] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe lsh: Efficient indexing for high-dimensional similarity search,” in VLDB, 2007, pp. 950–961.
  • [57] Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen, “SK-LSH: an efficient index structure for approximate nearest neighbor search,” PVLDB, vol. 7, no. 9, pp. 745–756, 2014.
  • [58] M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” in WWW, 2005, pp. 651–660.
  • [59] J. Gao, H. V. Jagadish, B. C. Ooi, and S. Wang, “Selective hashing: Closing the gap between radius search and k-nn search,” in SIGKDD, 2015, pp. 349–358.
  • [60] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization for approximate nearest neighbor search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2946–2953.
  • [61] A. Babenko and V. Lempitsky, “Tree quantization for large-scale similarity search and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4240–4248.
  • [62] C. Fu and D. Cai, “Efanna: An extremely fast approximate nearest neighbor search algorithm based on knn graph,” arXiv preprint arXiv:1609.07228, 2016.