A Note on Graph-Based Nearest Neighbor Search

Hongya Wang^† Zhizheng Wang^† Wei Wang^‡ Yingyuan Xiao^§ Zeng Zhao^† Kaixiang Yang^†
^† School of Computer Science and Technology, Donghua University, China

{\ddagger}

School of Computer Science and Engineering, University of New South Wales, Australia

\S

School of Computer Science and Engineering, Tianjin University of Technology, China
{hywang@dhu.edu.cn}

Abstract

Nearest neighbor search has found numerous applications in machine learning, data mining and massive data processing systems. The past few years have witnessed the popularity of the graph-based nearest neighbor search paradigm because of its superiority over the space-partitioning algorithms. While a lot of empirical studies demonstrate the efficiency of graph-based algorithms, not much attention has been paid to a more fundamental question: why graph-based algorithms work so well in practice? And which data property affects the efficiency and how? In this paper, we try to answer these questions. Our insight is that “the probability that the neighbors of a point $o$ tends to be neighbors in the $K$ NN graph” is a crucial data property for query efficiency. For a given dataset, such a property can be qualitatively measured by clustering coefficient of the $K$ NN graph.

To show how clustering coefficient affects the performance, we identify that, instead of the global connectivity, the local connectivity around some given query $q$ has more direct impact on recall. Specifically, we observed that high clustering coefficient makes most of the $k$ nearest neighbors of $q$ sit in a maximum strongly connected component (SCC) in the graph. From the algorithmic point of view, we show that the search procedure is actually composed of two phases - the one outside the maximum SCC and the other one in it, which is different from the widely accepted single or multiple paths search models. We proved that the commonly used graph-based search algorithm is guaranteed to traverse the maximum SCC once visiting any point in it. Our analysis reveals that high clustering coefficient leads to large size of the maximum SCC, and thus provides good answer quality with the help of the two-phase search procedure. Extensive empirical results over a comprehensive collection of datasets validate our findings.

I Introduction

Nearest neighbor search among database vectors for a query is a key building block to solve problems such as large-scale image search and information retrieval, recommendation, entity resolution, and sequence matching. As database size and vector dimensionality increase, exact nearest neighbor search becomes expensive and often is considered impractical due to the long search latency. To reduce the search cost, approximate nearest neighbor (ANN) search is used, which provides a better tradeoff among accuracy, latency, and memory overhead.

Roughly speaking, the existing ANN methods can be classified into space-partitioning algorithms and graph-based ones¹¹1Please note the categorization is not fixed/unique. The space-partitioning methods further fall into three categories - the tree-based, production quantization (PQ) and locality sensitive hashing (LSH)[1, 2]. Recent empirical study shows that graph-based ANN search algorithms are more efficient than the space-partitioning methods such as PQ and LSH, and thus have been adopted by many commercial applications in Facebook, Microsoft, Taobao and etc. [3, 4, 5].

While a lot of empirical studies validate the efficiency of graph-based ANN search algorithms, not much attention has been paid to a more fundamental question: why graph-based ANN search algorithms are so efficient? And which data property affect the efficiency and how? Two recent papers analyze the asymptotic performance of graph-based methods for datasets uniformly distributed on a $d$ -dimensional Euclidean sphere [6, 7]. The worst-case analysis shows that the asymptotic behavior of a greedy graph-based search only matches the optimal hash-based algorithm [8], which is far worse than the practical performance of graph-based algorithms and thus could not answer these questions.

A few conceptual graph models such as Monotonic Search Network Model [9], Delaunay Graph Model [10, 11] and Navigable Small World Model [12, 13] are proposed to inspire the construction of ANN search graphs. As will be discussed in Section II, none of them can explain the success of graph-based algorithms either. Actually, the vast majority (if not all) of practical ANN search graphs uses approximate $K$ NN graph as the index structure instead of the conceptual models due to time or space constraints, and thus is fully devoid of the theoretical features provided by these models.

In this paper, we argue that, for a specific dataset, the clustering coefficient [14] of its $K$ NN graph is an important indicator on how efficiently graph-based algorithms work. The clustering coefficient of $K$ NN graph defines the probability of neighbors of a point are also neighbors. Comprehensive experimental results reveal that higher the clustering coefficient is, more efficiently the graph-based algorithms will perform. Since clustering coefficient is data dependent, graph-based algorithms perform rather worse for datasets such as Random with very small clustering coefficient, whereas do well in datasets such as Sift and Audio with greater ones.

We also study how clustering coefficient affects the performance. The analysis of the complex network indicates that large clustering coefficient leads to high global connectivity [15]. Our insight is that, instead of the global connectivity, the local graph structure is more crucial for high ANN search recall. Particularly, we observed that, for datasets with large clustering coefficient, most of the $k$ NN of some given query²²2This query can be in or not in the dataset lie in the maximum strongly connected component (SCC) in the subgraph composed of these $k$ NN. Moreover, we show that the search procedure actually consists of two phases, the one outside the maximum SCC and the one in it, in contrast to the common wisdom of single or multiple path search models. Then, we proved that the commonly used graph search algorithm is guaranteed to visit all $k$ NN in the maximum SCC under a mild condition, which suggests that the size of the maximum SCC determines the answer quality of $k$ NN search. This sheds light on the strong positive correlation between the clustering coefficient and the result quality, and thus answers the two aforementioned questions.

To sum up, the main contributions of this paper are:

•

We introduce a new quantitative measure Clustering Coefficient of $K$ NN graph for the difficulty of graph-based nearest neighbor search. To the best of our knowledge, this is the first measure which could explain the efficiency of this class of important ANN search algorithms.
•

The conceptual models such as MSNETs and Delaunay graphs claim that NN could be found by walking in a single path. Instead, we found that the search procedure is actually composed of two phases. In the second phase the algorithm traverse the maximum SCC of $k$ NN for a query, of which the size is a determining factor for answer quality, i.e., recall.
•

We proved that the graph-based search algorithm is guaranteed to visit all points in the maximum SCC once entering it. Extensive empirical results over a comprehensive collection of datasets validate our observations.

We believe that this note could provide a different perspective on graph-based ANN search methods and might inspire more interesting work along this line.

II Graph-Based Nearest Neighbor Search

II-A Graph Construction and Search Algorithms

In the sequel, we will use nodes, points and vertices interchangeably without ambiguity. A directed graph $G=(V,E)$ consists of a nonempty vertices set $V$ and a set $E$ of edges such that each edge $e\in E$ is assigned to an ordered pair { $u,v$ } with $u,v\in V$ . Most graph-based algorithms build directed graph to navigate the $k$ NN search. To the best of our knowledge, the idea of using graphs to process ANN search can be traced back to the Pathfinder project, which was initiated to support efficient search, classification, and clustering in computer vision systems [9]. In this project, Dearholt et al. designed and implemented the monotonic search network to retrieve the best match of an entity in a collection of geometric objects. Since then, researchers from different communities such as theory, database and pattern recognition explored different ways to construct search graphs, inspired by various graph/network models such as the relative neighborhood graph [16, 17], Delaunay graph [18, 10, 19], KNN graph [20, 21] and navigable small world network [13, 22, 23]. Thanks to its appealing practical performance, the graph-based ANN search paradigm has become an active research direction and quite a lot of new approaches are developed recently [5, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 3, 34].

While motivated by different graph/network models, most (if not all) practical graph-based algorithms essentially use approximate $K$ NN graphs as the core index structure. A specific algorithm distinguishes itself from the others mainly in the edge selection heuristics, i.e., the way to select a subset of links between any point and its neighbors. Algorithm 1 depicts the general framework of index construction for graph-based ANN search.

Input:

V

is the vertex set and

K

is used to control the graph qualilty

Output: The Search Graph

G

1 for each $v\in V$ do

2 Find the exact or approximate

K

nearest neighbors of

v

;

3 Choose a subset

C

of these

K

NNs based on some specific heuristics;

4 Add directed or bi-directional connections between

v

and every vertex in

C

;

// update

G

return

G

Algorithm 1 The General Graph Construction Framework(

V

K

)

For almost all graph-based methods, the ANN search procedure is based on the same principle as follows. For a query $q$ , start at an initial vertex chosen arbitrarily or using some sophisticated selection rule. Moves along an edge to the adjacent vertex with minimum distance to $q$ . Repeat this step until the current element $v$ is closer to $q$ than all its neighbors, and then report $v$ as the NN of $q$ . We call this the single path search model. Figure 1 illustrates the searching procedure for $q$ in a sample graph with points $o_{1}$ to $o_{6}$ , where $o_{1}$ is the starting point and dashed lines indicate the search path. At the end of searching, the NN of q, i.e., $o_{6}$ is identified.

Refer to caption — Figure 1: An illustrative example of graph search

For practical ANN search networks, e.g., HNSW and NSG, there is no guarantee that a monotonic search path always exist for any given query [22, 5]. As a result, it can be easy to get trapped into the local optima, meaning that $v$ is not the NN of $q$ . To address this issue, backtracking is solicited – we need to go back to visited vertices and find another outgoing link to restart the procedure. We call this the multiple path search model. Algorithm 2 sketches the commonly adopted search algorithm that allows for backtracking, which will be discussed in detail in Section IV. Figure 2 illustrates a search path with backtracking. The starting point is $o_{1}$ and $o_{2}$ is the local optima since the true NN of $q$ is $o_{4}$ . By backtracking the algorithm gets back to $o_{3}$ , which is further to $q$ than $o_{2}$ and finally find the true NN of $q$ .

Input: Graph

G

, entry vertex

s

, query

q

, priority queue of size

L

Output:

k

nearest neighbors of

q

cand

.push(

s

); // add

s

to the priority queue of candidates

result_{L}

.push(

s

); // add

s

to the priority queue that stores

L

nearest points to

q

1 while $|cand|>0$ do

v

cand

.top();

cand

.pop(); //

v

is the nearest point in

cand

q

o

result_{L}

.bottom(); //

o

is the furthest point in

result_{L}

q

2 if $d(v,q)>d(o,q)$ then

break; // all points in

result_{L}

are evaluted

4 for each neighbor $e$ of $v$ in $G$ do

5 if $e\notin visited$ then

cand

.push(

e

);

result_{L}

.push(

e

);

visited

.push(

v

); // add

v

to the visited set

return the top

k

points in

result_{L}

Algorithm 2 Graph-based

k

NN Search(

G

s

q

L

)

Please note that $L$ is often greater than $k$ to achieve better answer quality. For ease of presentation, we assume $k=L$ throughout this paper unless stated otherwise.

II-B Review of Graph Search Models and Their Limitations

While empirical studies demonstrate that the graph-based ANN search algorithms are very competitive, it is widely recognized that the graph-based methods are mostly based on heuristics and not well understood quanlitatively [31, 5]. As an exception, two recent papers take the first step to analyze the asymptotic performance of graph-based methods for datasets uniformly distributed on a $d$ -dimensional Euclidean sphere [6, 7]. The worst-case analysis shows that the asymptotic behavior of a greedy graph-based search only matches the optimal hashing-based algorithm [8].

It was experimentally observed that the graph-based methods are orders of magnitude faster than the hashing-based algorithms [26]. Thus, though interesting from a pure theoretical perspective, their theory fails to explain the salient practical performance of the graph-based algorithms. Next, we will review several graph/network models that inspire practical graph-based algorithms, and then point out their limitations.

Monotonic Search Network Model. The monotonic search networks (MSNET) are defined as a family of graphs such that, for any two vertices in the graph, there exists at least one monotonic search path between them [9]. If the query point happens to be equal to a point of S, then a simple greedy search will succeed in locating the query point along a path of monotonically decreasing distance to the query point. The original MSNET is not practical, even for datasets of moderate size, because of its $O(n^{3})$ indexing complexity and unbounded average out-degree [9]. A recent proposal, the monotonic relative neighborhood graph, reduces the graph construction time to $O(n^{2}\log n)$ . This, however, still does not make MSNETs applicable in practice.

Delaunay Graph Model. Given a set of elements in a Euclidean space, the Voronoi diagram associates a Voronoi region with each element, which gives rise to a notion of neighborhood. The significance of this neighborhood is that if a query is closer to a database element than all its neighbors, then we have found the nearest element in the whole database [10, 11]. Delaunay graph is the dual of the Voronoi diagram, where each element $p$ is connected to all elements that share a Voronoi edge with $p$ . Using the Delaunay graph, Algorithm 2 is guaranteed to find the NN of $q\notin V$ . Unfortunately, the worst-case combinational complexity of the Voronoi diagram in dimension $d$ grows as $\Theta(n^{d/2})$ [35]. In addition, the Delaunay graph quickly reduces to the complete graph as $d$ grows, making it infeasible for NN search in high dimension spaces [18].

Navigable Small World Model. Networks with logarithmic or polylogarithmic complexity of the greedy graph search are known as the navigable small world networks [12, 23, 13]. Inspired by this model, Malkov et al. proposed the navigable small world graph (NSW) and hierarchical NSW by introducing “long” links during the approximate $K$ NN graph construction, expecting that the greedy routing achieves polylogrithmic complexity for NN search [22]. They demonstrate that the number of hops during the graph routing is polylogrithmic with respect to the network size on a collection of real-life datasets experimentally. However, unlike the ideal navigable small world model, no rigorous theoretical analysis is provided for NSW and HNSW.

We argue that these conceptual models are inadequate in explaining why in most cases the search procedure quickly converges to the nearest neighbor since:

•

For ideal models, the MSNET alone gives no hint how the graph-based methods generalize to out-of-sample queries, i.e., queries that are not in $V$ . Delaunay graph supports out-of-sample queries, but do not guarantees NN could be found for query $q$ $\in V$ . For example, suppose Algorithm 2 can reach $q$ by traversing one monotonic search path $sv_{1}v_{2}\cdots v_{i}q$ from $s$ to $q$ , we actually have no idea whether $v_{i}$ is the NN of $q$ at all because there may be multiple monotonic search pathes and NN of $q$ may lie in some other pathes. Navigable small world model only gives intuitive explanations on the existence of short search pathes but have no quantitative justifications that why the NN of $q$ can be found.
•

More importantly, the vast majority of graph-based algorithms uses approximate $K$ NN graph or its variants, instead of the aforementioned conceptual models, as the index structure. By limiting the maximum out-degree, approximate $K$ NN graphs are far more sparse than MSNETs and Delaunay graphs, which makes it is fully devoid of the nice the theoretical properties, i.e., the existence of the monotonic search path or Voronoi neighborhood.

To sum up, the existing models fail to illuminate the intuitive appeal of the graph-based methods. We view this as a significant gap between the theory and practice of the graph-based search paradigm. In this paper, we try to explain more quantitatively, from a different perspective, why the approximate $K$ NN graph-based methods work so well in practice.

III Clustering Coefficient of $K$ NN Graph and Its Impact on Search Performance

In graph theory, the clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together, which has been used successfully in its own right across a wide range of applications in complex networks. To name a few, Burt uses local clustering as a probe for the existence of so-called “structural holes” in a network, and Dorogovtsev et al. found that $C_{i}$ falls off with $k_{i}$ approximately as $k_{i}^{-1}$ for certain models of scale-free networks [15], where $k_{i}$ is the degree of $v_{i}$ .

There are different ways to define clustering coefficient. In this paper, we adopt the commonly used definition given by Watts and Strogatz [14]. The local clustering coefficient $C_{i}$ of a vertex $v_{i}$ is defined as

C_{i}=\frac{\text{number of pairs of neighbors of }v_{i}\text{ that are connected}}{\text{number of pairs of neighbors of }v_{i}}

(1)

To calculate $C_{i}$ , we go through all distinct pairs of vertices that are neighbors of $v_{i}$ in the network, count the number of such pairs that are connected to each other, and divide by the total number of pairs $\frac{k_{i}(k_{i}-1)}{2}$ . Figure 3 illustrates the definition of the local clustering coefficient $C_{i}$ . The degree of $v_{i}$ is 4 and there exists two edges between its neighbors. Hence, by definition $C_{i}=2\div\frac{4\times(4-1)}{2}=\frac{1}{3}$ .

The clustering coefficient for the whole network is the average

C=\frac{1}{n}\sum\limits_{i=1}^{n}C_{i}

(2)

The local clustering coefficient $C_{i}$ of a vertex $v_{i}$ describes the likelihood that the neighbours of $v_{i}$ are also connected, i.e., the probability that two randomly selected neighbors of $v_{i}$ are neighbors with each other. Roughly speaking, it tells how well the neighborhood of the node is connected. If the neighborhood is fully connected, the local clustering coefficient is 1 and a value close to 0 means that there are hardly any connections in the neighborhood. If most of the nodes in the network have high clustering coefficient, then the network will probably have many edges that connect nodes to each other.

Clustering coefficient of $K$ NN graphs depends on $K$ and the intrinsic feature of datasets. Table I lists the clustering coefficients for various $K$ under three typical datasets As one can see, the larger $K$ is, the greater clustering coefficient will be. Moreover, the relative order of the clustering coefficient is stable independent of $K$ . In the sequel, we will use the clustering coefficient in the case of $K=50$ as the default since $K$ cannot be too large due to the index space constraint.

TABLE I: Clustering coefficient for different

K

under three typical datasets

Dataset	$K=20$	$K=50$	$K=100$	$K=150$	$K=200$
Sift	0.1159	0.1249	0.1371	0.1419	0.1468
Glove	0.0881	0.1029	0.1289	0.1358	0.1427
Random	0.00047	0.00074	0.00092	0.00114	0.00139

Our key observation is that the clustering coefficient of $K$ NN graphs is a informative measure for the efficiency of the graph-based ANN search methods. In this paper, a $K$ NN graph is defined as a graph such that for each vertices $v$ , there exists bi-directional edges with its $K$ most nearest neighbors. This model is reasonable because practical graph-based algorithms such as HNSW and NSG always add bi-directional links between a point and its $K$ NN as much as possible [22, 5]. Table II lists the statistics of the datasets, the clustering coefficients of $K$ NN graph in increasing order, the recalls of the top- $50$ query and the average number of hops in the graph for a collection of datasets under HNSW and NSG, the two state-of-the-art graph-based algorithms³³3statistics are collected over 1000 random queries. For both methods, the maximum out-degrees (MOD) is 70 and the parameter $L$ , which control the number of hops in the graph, is set to 50. Interesting observations can be made as follows:

•

NSG consistently outperforms HNSW in recall with slightly greater average number of hops, which approximately translates to the number of distance evaluation since the MODs are identical for both algorithms. This observation agrees with the results reported in [5].
•

A more interesting observation is that, with around the same number of average hops in the graph, clustering coefficient and recall are strongly correlated. Particularly, The Pearson correlation coefficients between the clustering coefficient and recall for NSG and HNSW are 0.794 and 0.762, respectively. Besides, independent of the data cardinality and dimensionality, high clustering coefficient (greater than 0.12) often leads to high recall whereas low clustering coefficient (lower than 1.0) results in low recall. As an extreme example, the clustering coefficient of Random dataset is only 0.00074 and thus makes graph-based algorithms become very inefficient. One reason that the recall of NSG is greater than that of HNSW is that the quality of neighbors of NSG is better than HNSW, that is, NSG is much closer to an exact $K$ NN graph than HNSW. Please note that the datasets are comprehensive enough in terms of size, dimensionality and data types (images, text, audio and synthetic). Detailed description of these datasets can be found in [24]⁴⁴4https://github.com/DBWangGroupUNSW/.

TABLE II: Clustering coefficient vs. Efficiency

Dataset	Size	Dim	Clustering coefficient	HNSW		NSG
Dataset	Size	Dim	Clustering coefficient	Recall	# of Hops	Recall	# of Hops
Random	1,000,000	128	0.00074	0.0049	61.3	0.02	64.8
Gist	1,000,000	960	0.080	0.5984	54.9	0.7688	54.7
NUSWIDE	268,643	500	0.096	0.4343	58.0	0.5430	59.8
GLOVE	1,192,514	100	0.103	0.4903	60.3	0.694	56.1
ImageNet	2,340,373	150	0.114	0.6643	53.3	0.8608	55.5
Sift	1,000,000	128	0.125	0.8667	52.0	0.9453	54.2
Sun	79,106	512	0.140	0.8941	51.1	0.9562	52.0
Cifar	50,000	512	0.141	0.9196	51.0	0.9685	51.5
Deep	1,000,000	256	0.144	0.8205	52.7	0.9078	54.7
MillionSong	992,272	420	0.163	0.5984	51.4	0.9608	55.1
Ukbench	1,097,907	128	0.189	0.8893	51.7	0.9545	54.5
Enron	94,987	1369	0.209	0.7599	52.3	0.9421	53.3
Trevi	99,900	4096	0.215	0.8845	51.1	0.9498	52.8
AUDIO	53,387	192	0.253	0.9553	51.0	0.9815	52.5
MINIST	69,000	784	0.286	0.9728	51.7	0.9878	53.2
Notre	332,668	128	0.287	0.9248	52.4	0.9674	53.8

These observations suggest that the clustering coefficient are a promising measure for the efficiency of graph-based algorithms. Intuitively, higher the clustering coefficient of $K$ NN graph is⁵⁵5 $K$ should be as small as possible to reduce the memory footprint and query efficiency, the better the graph is connected, which means that graph connectivity has significant impact on the result quality of ANN search. To have an in-depth understanding of how connectivity affects the search performance, we scrutinized the graph traversal steps of a sample of queries and found that the local connectivity, instead of the global one, is the determining factor. To formally characterize the local connectivity, we propose the notion of maximum strongly connected neighborhood as follows.

Definition 1.

A directed graph is strongly connected if there is a path between all pairs of vertices. A strongly connected component (SCC) of a directed graph is a strongly connected subgraph in this graph.

Definition 2.

The $k$ -neighborhood of a point $v$ , denoted by $\mathcal{N}_{k}(v)$ , is the set of $k$ nearest elements of $v$ , i.e., $o_{1}\cdots o_{k}\in V$ .

Please note that the only requirement is the $k$ nearest neighbors of $v$ belongs to $V$ and $v$ may be some point in or not in $V$ . This definition makes our analysis supports out-of-sample queries.

Definition 3.

A subgraph of $G$ is the $k$ -neighborhood subgraph associated with a vertex $v$ , denoted by $\mathcal{G}_{k}(v)$ , if $V(\mathcal{G}_{k}(v))=\mathcal{N}_{k}(v)$ and $E(\mathcal{G}_{k}(v))\subseteq E(G)$ .

Definition 4.

The maximum strongly connected neighborhood of $\mathcal{G}_{k}(v)$ , denoted by $\mathcal{C}_{k}(v)$ , is an SCC of $\mathcal{G}_{k}(v)$ such that $|\mathcal{C}_{k}(v)|\geq|\mathcal{C}_{i}|$ for all $i$ , where $\mathcal{C}_{i}$ are the SCCs of $\mathcal{G}_{k}(v)$ .

Please note that $K$ and $k$ owns totally different meaning - $K$ is the link number per node of $K$ NN graph and is determined in graph construction and $k$ is the search parameter and thus may vary according to the users’ requirement.

Figure 4 illustrates these definitions using a simple example. $o_{1}$ to $o_{5}$ is the top-5 NN of query $q$ and there are three undirected edges (equivalent to six directed edges) in $q$ ’s 5-neighborhood subgraph $\mathcal{G}_{5}(q)$ . Three SCCs exists in $\mathcal{G}_{5}(q)$ and the maximum SCC $\mathcal{C}_{5}(q)$ is composed of point $o_{1}$ , $o_{2}$ and $o_{3}$ .

To show the impact of $\mathcal{C}_{k}(v)$ on the algorithm performance. Table III lists the 3 SCCs of largest sizes for 100 random $k$ NN queries in the case of $k=50$ for three typical datasets. As we can see, the ratios of the size of $\mathcal{C}_{k}(v)$ (SCC1) to $k$ are very close the recall listed in Table II for three datasets, respectively. Other state-of-the-art algorithms, such as HNSW, exhibits similar trends.

TABLE III: SCCs of 100 random queries for three typical datasets under NSG

SCC-id	Sift		Glove		Random
SCC-id	size	ratio	size	ratio	size	ratio
SCC1	47.8	95.6%	33.8	67.6%	2.4	4.8%
SCC2	0.8	1.6%	1.8	3.6%	1.4	2.8%
SCC3	0.2	0.4%	1.8	3.6%	1.2	2.4%

To eliminate the bias caused by specific graph construction algorithms, we studied the exact $K$ NN graph and found similar results. Table IV lists, for 100 random top-50 NN queries, the average sizes of the top-3 SCCs over Sift, Glove and Random. In this experiment, we only put a directed link from a point to its $K$ NN and no link is added manually in the reverse direction, i.e., the $K$ NN graph is directed. $K$ is set to 50. From Table IV we can see that, independent of specific graph-based algorithm, clustering coefficient also has a significant impact on the size of $\mathcal{C}_{k}(v)$ .

TABLE IV: SCCs of 100 random queries for exact directed

K

NN graph

SCC-id	Sift		Glove		Random
SCC-id	size	ratio	size	ratio	size	ratio
SCC1	36.8	73.6%	30.2	60.4%	1.2	2.4%
SCC2	3.6	7.2%	2.4	4.8%	1	2%
SCC3	1.6	3.2%	1.6	3.2%	1	2%

We also examined undirected $K$ NN graph, where bi-directional link is added manually between a point and its $K$ NN. The trend listed in Table V is very similar to that of Table IV except that the sizes of $\mathcal{C}_{k}(v)$ are larger. This is because more links are added in the graph. Actually, practical graph-based algorithms lie somewhere between the undirected and directed $K$ NN graph since they always try to add bi-directional links as long as the memory budget is enough. Please note exact $K$ NN graphs are not practical because the unaffordable construction time and unlimimted maximum out-degree, which translates to too much memory cost.

TABLE V: SCCs of 100 random queries for exact undirected

K

NN graph

SCC-id	Sift		Glove		Random
SCC-id	size	ratio	size	ratio	size	ratio
SCC1	48.8	97.6%	46.2	92.4%	6	12%
SCC2	0.2	0.4%	1	2%	2.2	4.4%
SCC3	0	0%	0.2	0.4%	1.4	2.8%

In a nutshell, all these experiments demonstrate that clustering coefficient of $K$ NN graph is an informative measure for the size of the maximum strongly connected neighborhood and the performance of graph-based algorithms over a specific dataset. Next, we will analyze how $\mathcal{C}_{k}(q)$ affects the recall for a given query $q$ . Particularly, we will show that Algorithm 2, the striking algorithmic component for graph search, can effectively reach $\mathcal{C}_{k}(q)$ and identify all $k$ NN $\in\mathcal{C}_{k}(q)$ . This explains why greater clustering coefficient and larger size of $\mathcal{C}_{k}(q)$ lead to better performance.

IV Two Phase $k$ NN Search in Graphs

TABLE VI: Statistics of two-phase ANN search for Sift

Query ID	# of Hops in $P_{1}$	# of Hops in $P_{2}$	$\|\mathcal{C}_{k}(q)\|$	Fraction of $\|\mathcal{C}_{k}(q)\|$ visited in $P_{2}$	$\|\overline{\mathcal{C}_{k}(q)}\|$
1	4	50	38	100%	4
2	4	50	43	100%	1
3	5	50	48	100%	0
4	4	50	50	100%	0
5	5	50	45	100%	0

TABLE VII: Statistics of two-phase ANN search for GLOVE

Query ID	# of Hops in $P_{1}$	# of Hops in $P_{2}$	$\|\mathcal{C}_{k}(q)\|$	Fraction of $\|\mathcal{C}_{k}(q)\|$ visited in $P_{2}$	$\|\overline{\mathcal{C}_{k}(q)}\|$
1	3	52	27	100%	7
2	1	59	28	100%	0
3	11	39	20	100%	8
4	2	49	38	100%	4
5	2	48	24	100%	6

TABLE VIII: Statistics of two-phase ANN search for Random

Query ID	# of Hops in $P_{1}$	# of Hops in $P_{2}$	$\|\mathcal{C}_{k}(q)\|$	Fraction of $\|\mathcal{C}_{k}(q)\|$ visited in $P_{2}$	$\|\overline{\mathcal{C}_{k}(q)}\|$
1	79	7	1	100%	0
2	83	0	1	0%	0
3	50	47	1	100%	0
4	46	25	2	100%	0
5	56	0	1	0%	0

The common wisdom about Algorithm 2 is as follows. Starting from the entry vertex $s$ , which is chosen by random or using some auxiliary method, Algorithm 2 finds a directed path from $s$ to the query $q$ , hoping that NN of $q$ are identified through the walk. Since only local information, i.e., adjacent vertices of the visited vertices, is used, this class of algorithms are termed as the decentralized algorithm [12]. Particularly, for ANN search, Algorithm 2 first follows the out-edges of $s$ to get its immediate neighbors, and then examines the distances from these neighbors to $q$ . The one with the minimum distance to $q$ is selected as the next base vertex for iteration. The same procedure is repeated at each step of the traversal until Algorithm 2 reaches a local optima, namely, the immediate neighbors of the base vertex does not contain a vertex that is closer to $q$ than the base vertex itself. Backtracking is used to jump out of the local optima and increase the odd to find the true NN. Recall that we name this search paradigm as multiple path search model.

Different from the traditional point of view, we observe that Algorithm 2 actually is composed of two phases. In the first phase, the algorithm starts with an initial point, walks the graph and encounters a point within $\mathcal{C}_{k}(q)$ . In the second phase, the algorithm traverse $\mathcal{C}_{k}(q)$ and a small number of points not in $\mathcal{C}_{k}(q)$ . Figure 5 depicted the two phase search procedure. Theorem 1 proves that Algorithm 2 is guaranteed to find all points in $\mathcal{C}_{k}(q)$ under a mild condition.

Theorem 1.

Algorithm 2 is guaranteed to visit all points in $\mathcal{C}_{k}(q)$ starting with any point in $\mathcal{C}_{k}(q)$ .

Proof.

We know that any directed graph is said to be a strongly connected component iff all the vertices of the graph are a part of some cycle. Please note that the cycle may not necessarily be a Hamilton cycle. Suppose all vertices not in $\mathcal{C}_{k}(q)$ but adjacent with vertices in $\mathcal{C}_{k}(q)$ are further to $q$ than all vertices in $\mathcal{C}_{k}(q)$ . Without loss of generality, suppose the first vertex visited is $v_{1}$ , then Algorithm 2 will visit all vertices following the cycle and push every vertices into $cand$ and $result_{L}$ . Since all vertices in $\mathcal{C}_{k}(q)$ are closer to $q$ than other vertices, the distance of the bottom element of $result_{L}$ will always greater than that of the top element in $cand$ until all vertices in $\mathcal{C}_{k}(q)$ are visited⁶⁶6all elements in $result_{L}$ are initialized as infinity at the beginning. Please note that in each loop $cand$ pop up the element that have been pushed into $result_{L}$ , which guarantees that Algorithm 2 always terminates. ∎

Theorem 1 suggests a different perspective in understanding the graph-based methods. Rather than searching a single path (without backtracing) or multiple paths (with backtracking) in the graph, the search algorithm actually traverses a strongly connected neighborhood around the query. In other words, high quality $\mathcal{C}_{k}(q)$ , together with Algorithm 2, offers the salient performance. The analysis in Section III reveals the quality of $\mathcal{C}_{k}(q)$ are data dependent and closely related to the clustering coefficient of $K$ NN graphs. Therefore, there exists significant performance disparity for different datasets and we could use the clustering coefficient of $K$ NN graph as a meaningful measure for the efficiency of the graph-based methods.

It is possible that there exists a few vertices adjacent with vertices in $\mathcal{C}_{k}(q)$ , which is not in $\mathcal{C}_{k}(q)$ but closer to $q$ than some vertices in $\mathcal{C}_{k}(q)$ . In this case, the algorithm may also visit such vertices, and the answer quality will be higher than just traversing $\mathcal{C}_{k}(q)$ since more closer vertices outside $\mathcal{C}_{k}(q)$ are visited.

The probability of the search algorithm getting into $\mathcal{C}_{k}(q)$ is exponentially increasing with $L^{\prime}$ , the number of being trapped into a local optima and getting back to a distant point before entering $\mathcal{C}_{k}(q)$ , which is expressed as follows. $p_{i}$ is the probability of getting into $\mathcal{C}_{k}(q)$ along a single path.

P=1-\prod\limits_{i=1}^{L^{\prime}}(1-p_{i})

(3)

The rigorous calculation of $P$ is infeasible. Empirically, for datasets with relatively large clustering coefficients we observed that (1) Algorithm 2 can quickly reach $\mathcal{C}_{k}(q)$ , and (2) the path length of the first phase is far shorter than that of the second phase. Table VI, Table VII and Table VIII list the numbers of hops in Phase 1 and Phase 2, the size of $\mathcal{C}_{k}(q)$ , the fraction of points in $\mathcal{C}_{k}(q)$ that are visited in Phase 2 and the number of true top- $k$ NN not in $\mathcal{C}_{k}(q)$ that are found during Phase 2 (denoted by $|\overline{\mathcal{C}_{k}(q)}|$ ) for Sift, Glove and Random, respectively. NSG is used and the statistics of five random query are reported. Please note that HNSW and exact $K$ NN graph exhibit the similar trends thus we do not report results for them. Several interesting observations are made:

•

As we can see, independent of datasets, the two-phase search model is applicable to all queries listed. As proved in Theorem 1, once the search algorithm enters Phase 2, all points in $\mathcal{C}_{k}(q)$ will be visited, which demonstrates the importance of the quality of $\mathcal{C}_{k}(q)$ .
•

Besides the true top- $k$ NN in $\mathcal{C}_{k}(q)$ , other true top- $k$ NN also probably be visited during Phase 2, especially for Glove where the size of $\mathcal{C}_{k}(q)$ is relatively small. This is mainly caused by the search algorithm jumps into a smaller SCC or visits $k$ NN that only own a single directed edge with the maximum SCC.
•

For Sift and Glove, where the size of $\mathcal{C}_{k}(q)$ far greater than that of Random, the second phase dominates the search cost and the algorithm jumps into $\mathcal{C}_{k}(q)$ in a very small number of steps. In contrast, it is very hard for the algorithm to find a true top- $k$ NN for Random since the size of $\mathcal{C}_{k}(q)$ is too small (in most case is equal to 1). For example, $q_{2}$ and $q_{5}$ do not enter Phase 2 and didn’t find any true NN. As a result, the recall of Random is very low.

To train reader’s intuition, Figure 6 illustrates the search procedure of a top-10 query for Sift dataset with NSG. Green point is the query and red points denote the true top-NN in the maximum SCC, which are strongly connected. Dashed lines in blue with single or double arrows represents the the directed or bi-directional edges between points. The solid arrowed lines in yellow depict the search path during $k$ NN search. As we can see, starting with the entry point, the algorithm jumps into the maximum SCC in three steps. After traversing the maximum SCC which consists of six true NNs, it continues the search by visiting one true NN (in black) and four other points before the termination condition is met. Since $k$ is small in this example, the size of the maximum SCC is not that large. This can be informally explained by the fact that the connectivity increases as the number vertices become large under the same edge connection probability using random network theory [15].

The case of small $k$ : Users may be only interested in a small number of nearest neighbors of $q$ , say $k$ ranging from 1 to 10. In this case, the size and quality of $\mathcal{C}_{k}(q)$ is not that good to achieve high recall. To get precise results, $L$ is often set greater than $k$ , say 50-200. The net effect is that the search algorithm actually visit $\mathcal{C}_{L}(q)$ , which consists the most points of top- $L$ NN if the clustering coefficient is large enough, and then Algorithm 2 identify the best $k$ results and output them.

V Related Work

V-A Measrues for Difficulty of Nearest Neighbor Search

The problem of the difficult (approximate) of NN search in a given dataset has drawn much attention in recent years. Beyer et al. and Francois et al. show that NN search wil be meaningless when the number of dimensions goes to infinity [36, 37], respetively. However, they didn’t discuss the non-asymptotic analsisi when the number of dimensions is finite. Moreover, the effect of other crucial properties such as the sparsity of data vectors has not been studied. To the best of our knowledege, He et al. proposed the first concrete measure called Relative Contrast (RC) to evaluate the influence of several data characteristics such as dimensionality, sparsity and dataset size simultaneously on the difficulty of NN search [38]. They present a theoretical analysis to prove how RC determines the complexity of Locality Sensitive Hashing, a popular approximate NN search method. Relative Constrast aslo provides an explanation for a family of heristic hashing algorithm with good practical performance based on PCA. However, no evidence is given that RC can be used to explain the success of graph-based NN search method directly.

Identifying the intrinsic dimensionality (ID) of datasets has been studied for decades since its importance in machine learning, databases and data mining. Recently, local ID gains much attention since it is very useful when data is composed of heterogeneous manifolds. In addition to applications in manifold learning, measures of local ID have been used in the context of evaluate the difficulty of NN search [39]. Several local intrinsic dimensionality models have been proposed, such as the expansion dimension (ED) [40], the generalized expansion dimension (GED) [41], the minimum neighbor distance (MiND) [42], local continuous intrinsic dimension (LID) [43]. While these measures have been shown useful in their own right, non of them is applicable in explaining the salient performance of the graph-based methods.

V-B A Brief Review of the Existing ANN Search Methods

Approximate nearest neighbor search (ANNS) has been a hot topic over decades, it provides fundamental support for many applications of data mining, databases and information retrieval [44, 45, 46]. There is a large amount of significant literature on algorithms for approximate nearest neighbor search, which are mainly divided into the following categories: tree-structure based approaches, hashing-based approaches, quantization-based approaches, and graph-based approaches.

V-B1 tree-structure based approaches

Hierarchical structures (tree) based methods offer a natural way to continuously partition a dataset into discrete regions at multiple scales, such as KD-tree [47], R-tree [48], SR-tree [49]. These methods perform very well when the dimensionality of the data is relatively low. However, it has been proved to be inefficient when the dimensionality of data is high. It has been shown in [50] that when the dimensionality exceeds about 10, existing indexing data structures based on space partitioning are slower than the brute-force, linear-scan approach. Many new hierarchical-structure-based methods [51] are presented to address this limitation.

V-B2 hashing-based approaches

Among the approximate NN search algorithms, the Locality Sensitive Hashing is the most widely used one due to its excellent theoretical guarantees and empirical performance. E2LSH, the classical LSH implementations for $\ell_{2}$ norm, cannot solve $c$ -ANN search problem directly. In practice, one has to either assume there exists a “magical’ radius $r$ , which can lead arbitrarily bad outputs, or uses multiple hashing tables tailored for different raddi, which may lead to prohibitively large space consumption in indexing. To reduce the storage cost, LSB-Forest [52] and C2LSH [53] use the so-called virtual rehashing technique, implicitly or explicitly, to avoid building physical hash tables for each search radius. The index size of LSB-Forest is far greater than that of C2LSH because the former ensures that the worst-case I/O cost is sub-linear to both $n$ and $d$ whereas the latter has no such guarantee - it only bounds the number of candidates by some constant but ignores the overhead in index access.

Based on the idea of query-aware hashing, the two state-of-the-art algorithms QALSH and SRS further improve the efficiency over C2LSH by using different index structures and search methods, respectively. SRS uses an $m$ -dimensional $R$ -tree (typically $m\leq 10$ ) to store the $\langle g(o),oid\rangle$ pair for each point $o$ and transforms the $c$ -ANN search in the $d$ -dimensional space into the range query in the $m$ -dimensional projection space. The rationale is that the probability that a point $o$ is the NN of $q$ decreases as $\Delta_{m}(o)$ increases, where $\Delta_{m}(o)=\|g_{m}(q)-g_{m}(o)\|$ During $c$ -ANN search, points are accessed according to the increasing order of their $\Delta_{m}(o)$ .

Motivated by the observation that the optimal $\ell_{p}$ metric is application-dependent, LazyLSH [54] is proposed to solve the NN search problem for the fractional distance metrics, i.e., $\ell_{p}$ metrics ( $0<p<1$ ) with a single index. FALCONN is the state-of-the-art LSH scheme for the angular distance, both theoretically and practically [55]. Except for E2LSH and FALCONN, the other algorithms are disk-based and thus can handle datasets that do not fit into the memory.

All of the aforementioned LSH algorithms provide probability guarantees on the result quality (recall and/or precision). To achieve better efficiency, many LSH extensions such as Multi-probe LSH [56], SK-LSH [57], LSH-forest [58] and Selective hashing [59] use heuristics to access more plausible buckets or re-organize datasets, and do not ensure any LSH-like theoretical guarantee.

V-B3 quantization-based approaches

The most common quantization-based methods is product quantization (PQ) [2]. It seeks to perform a similar dimension reduction to hashing algorithms, but in a way that better retains information about the relative distances between points in the original vector space. Formally, a quantizer is a function q mapping a $D$ -dimensional vector $x\in\mathbb{R}^{D}$ to a vector $q(x)\in C=\{c_{i};i\in\mathcal{I}\}$ , where the index set $\mathcal{I}$ is finite: $\mathcal{I}=0\ldots k-1$ . The reproduction values $c_{i}$ are called centroids. The set $\mathcal{V}_{i}$ of vectors mapped to given index $i$ is referred to as a cell, and defined as

\mathcal{V}_{i}\triangleq\left\{x\in\mathbb{R}^{D}:q(x)=c_{i}\right\}

The $k$ cells of a quantizer form a partition of $\mathbb{R}^{D}$ . So all the vectors lying in the same cell $\mathcal{V}_{i}$ are reconstructed by the same centroid $c_{i}$ . Due to the huge number of samples required and the complexity of learning the quantizer, PQ uses $m$ distinct quantizers to quantize the subvectors separately. An input vector will be divided into m distinct subvectors $u_{j}$ , $1\leq j\leq m$ . The dimension of each subvector is $D^{*}=D/m$ . An input vector x is mapped as follows:

\footnotesize\underbrace{x_{1},\ldots,x_{D^{*}}}_{u_{1}(x)},\cdots,\underbrace{x_{D-D^{*}+1},\ldots,x_{D}}_{u_{m}(x)}\rightarrow q_{1}\left(u_{1}(x)\right),\ldots,q_{m}\left(u_{m}(x)\right)

where $q_{j}$ is a low-complexity quantizer associated with the $j^{th}$ subvector. And the codebook is defined as the Cartesian product,

\mathcal{C}=\mathcal{C}_{1}\times\ldots\times\mathcal{C}_{m}

and a centroid of this set is the concatenation of centroids of the $m$ subquantizers. All subquantizers have the same finite number $k^{*}$ of reproduction values, the total number of centroids is $k=\left(k^{*}\right)^{m}$ .

After using PQ, all database vectors will be replaced by reproduction values. In order to speed up the query, PQ proposes a look-up table to directly get the distance between the reproduction values and the query vector. They propose two methods to compute an approximate Euclidean distance between these vectors: the so-called Asymmetric Distance Computation (ADC) and the Symmetric Distance Computation (SDC). See Figure 7 for an illustration. Let’s take the introduction of ADC as an example.

The database vector $y$ is represented by $q(y)$ , but the query $x$ is not encoded. The distance $d(x,y)$ is approximated by the distance $d(x,q(y))$ , which is computed using the decomposition

d(x,q(y))=\sqrt{\sum_{j}d\left(u_{j}(x),q_{j}\left(u_{j}(y)\right)\right)^{2}},

where the squared distances $d\left(u_{j}(x),c_{j,i}\right)^{2}:j=1\ldots m,i=1\ldots k^{*}$ , are computed before the search. The calculation method of SDC is similar to ADC, but the query vector $x$ is represented by $q(x)$ . SDC limits the memory usage associated with the queries and ADC has a lower distance distortion for a similar complexity.

PQ offers three attractive properties: (1) PQ compresses an input vector into a short code (e.g., 64-bits), which enables it to handle typically one billion data points in memory; (2) the approximate distance between a raw vector and a compressed PQ code is computed efficiently (the so-called asymmetric distance computation (ADC) and the symmetric distance computation (SDC)), which is a good estimation of the original Euclidean distance; and (3) the data structure and coding algorithm are simple, which allow it to hybridize with other indexing structures. Becasue these methods avoid distance calculations on the original data vectors, it will cause a loss of certain calculation accuracy. When the recall rate is close to 1.0, the required length of the candidate list is close to the size of the dataset. Many quantization-based methods try to reduce quantization errors to improve calculation accuracy, such as Optimal Product Quantization (OPQ) [60] and Tree Quantization (TQ) [61].

V-B4 graph-based approaches

Recently, graph-based methods have drawn considerable attention, such as NSG [5], HNSW [22], Efanna [62], and FANNG [27]. Graph-based methods construct a $k$ NN graph offline, which can be regard as a big network graph in high-dimensional space. However, the construction complexity of the exact kNN graph, especially when it comes to large datasets, will increase exponentially. Many researchers turn to building an approximated $k$ NN graph, but it is still time consuming. There are two main types of graphs: directed graphs and undirected graphs.

At online search stage, they all use greedy-search algorithm or its variants. While these method require to find the initial point in advance, and the easiest way is to choose randomly. During the search, it can quickly converge from the initial point to the neighborhood of the query point. But one problem of this method is that it is easily to converge to local optima and result in a low recall. One way to solve this problem is to provide better initialization candidate set for a query point. Instead of using random selection, choosing to use the Navigating node (the approximate medoid of the dataset) and its neighbors as the candidate. Another method is to try to make the constructed graph monotonous. The edge selection strategy of MRNG, which was first proposed in paper [5], can ensure that the graph is a Monotonic Search Network (MSNET). Ideally, the search path will iterate from the starting point until reaching the query point and ending, this means that no backtracking occurs during the search.

Because the construction of graphs greatly affects search performance, many researchers focus on constructing index graphs. The fundamental issue is how to choose the neighbors of nodes on the graph. We will introduce two state-of-the-art graph neighbor selection strategies: Relative Neighborhood Graphs (RNG) [17] and Monotonic Relative Neighborhood Graphs (MRNG) [5]. Formally, given two points p and q in $\mathbb{R}^{D}$ space, $B(p,\delta(p,q))$ represents an open sphere where the center is q, and $\delta(p,q)$ is the radius. The ${lune}_{pq}$ is defined as:

{lune}_{pq}=B(p,\delta(p,q))\cap B(q,\delta(p,q))

FANNG [27] and HNSW [22] adopt the RNG’s edge selection strategy to construct the index. RNG is an edge selection strategy based on undirected graph, and it selects edges by checking whether there is a point in the intersection of two open spheres. In Figure 8(a), node $p$ has prepared a set of neighbor candidates for selection. If there is no node in $lune_{pr}$ , then $p$ and $r$ are linked. Otherwise, there is no edge between $p$ and $r$ . Because $r\in{lune}_{ps}$ , $s\in{lune}_{pt}$ , $t\in{lune}_{pu}$ , and $u\in{lune}_{pq}$ , there are no edges between p and $s,t,u,q$ . Although RNG can reduce its out-degree to a constant $C_{d}+o(1)$ , it does not have sufficient edges to be a MSNET. NSG adopts the MRNG’s edge selection strategy to construct the index, which is a directed graph. In Figure 8(b), $p$ and $r$ are linked to each other because there is no node in $lune_{pr}$ . $p$ and $s$ are not linked because $p$ and $r$ are linked and $r\in{lune}_{ps}$ . However, $p$ and $t$ are linked because $p$ and $s$ are not linked and $s\in{lune}_{pt}$ . The graph constructed by MRNG is an MSNET. The common purpose of these two graph construction methods is to reduce the average out-degree of the graph so as to make the graph sparse and reduce the search complexity. These interesting selection strategies have achieved attractive results, which makes that many graph-based methods perform well in search time, such as Efanna [62], KGraph, HNSW and NSG.

VI Conclusion

This paper takes the first step to shed light on why the graph-based search algorithms work so well in practice and suggests that the clustering coefficient of $K$ NN graph is an important measure for the efficiency of these algorithms. Detailed analysis is also conducted to show how clustering coefficient affects the local structure of $K$ NN graphs. A few open problems still exists. For example, formal analysis under some simplified data model is important to have more rigorous understanding of the graph search procedure.

Acknowledgements

The work reported in this paper is partially supported by NSFC under grant number 61370205, NSF of Xinjiang Key Laboratory under grant number 2019D04024.

References

[1] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in SoCG, 2004, pp. 253–262.
[2] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, 2011.
[3] M. Douze, A. Sablayrolles, and H. Jégou, “Link and code: Fast indexing with graphs and compact regression codes,” in CVPR, 2018, pp. 3646–3654.
[4] Y. Dong, P. Indyk, I. P. Razenshteyn, and T. Wagner, “Learning space partitions for nearest neighbor search,” in ICLR, 2020.
[5] C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” Proc. VLDB Endow., vol. 12, no. 5, pp. 461–474, 2019.
[6] T. Laarhoven, “Graph-based time-space trade-offs for approximate near neighbors,” in SoCG, B. Speckmann and C. D. Tóth, Eds., pp. 57:1–57:14.
[7] L. Prokhorenkova, “Graph-based nearest neighbor search: From practice to theory,” CoRR, vol. abs/1907.00845, 2019.
[8] A. Andoni, T. Laarhoven, I. P. Razenshteyn, and E. Waingarten, “Optimal hashing-based time-space trade-offs for approximate near neighbors,” in SODA, 2017, pp. 47–66.
[9] D. W. Dearholt, N. Gonzales, and G. Kurup, “Monotonic search networks for computer vision databases,” in Twenty-Second Asilomar Conference on Signals, Systems and Computers, vol. 2, 1988, pp. 548–553.
[10] T. B. Sebastian and B. B. Kimia, “Metric-based shape retrieval in large databases,” in ICPR, 2002, pp. 291–296.
[11] S. Morozov and A. Babenko, “Non-metric similarity graphs for maximum inner product search,” in NIPS, 2018, pp. 4726–4735.
[12] J. M. Kleinberg, “Navigation in a small world,” Nature, vol. 406, no. 6798, p. 845, 2000.
[13] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Approximate nearest neighbor algorithm based on navigable small world graphs,” Inf. Syst., vol. 45, pp. 61–68, 2014.
[14] D. J. Watts and S. H. Strogatz, “Collective dynamics of ”small-world” networks,” Nature, vol. 393, pp. 440–442, 1998.
[15] M. Newman, Networks: An Introduction. Oxford University Press, 2010.
[16] S. Arya and D. M. Mount, “Approximate nearest neighbor queries in fixed dimensions,” in SODA, 1993, pp. 271–280.
[17] J. W. Jaromczyk and G. T. Toussaint, “Relative neighborhood graphs and their relatives,” Proceedings of the IEEE, vol. 80, no. 9, pp. 1502–1517, 1992.
[18] G. Navarro, “Searching in metric spaces by spatial approximation,” in SPIRE/CRIWG, 1999, pp. 141–148.
[19] F. Aurenhammer, “Voronoi diagrams - A survey of a fundamental geometric data structure,” ACM Comput. Surv., vol. 23, no. 3, pp. 345–405, 1991.
[20] R. Paredes and E. Chávez, “Using the k-nearest neighbor graph for proximity searching in metric spaces,” in SPIRE, 2005, pp. 127–138.
[21] “KGraph.” [Online]. Available: https://github.com/aaalgo/kgraph
[22] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020.
[23] J. M. Kleinberg, “The small-world phenomenon: an algorithmic perspective,” in STOC, 2000, pp. 163–170.
[24] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, and X. Lin, “Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement (v1.0),” CoRR, vol. abs/1610.02455, 2016.
[25] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang, “Fast approximate nearest-neighbor search with k-nearest neighbor graph,” in IJCAI, 2011, pp. 1312–1317.
[26] M. Aumüller, E. Bernhardsson, and A. J. Faithfull, “Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Inf. Syst., vol. 87, 2020.
[27] B. Harwood and T. Drummond, “FANNG: fast approximate nearest neighbour graphs,” in CVPR, 2016, pp. 5713–5722.
[28] K. Aoyama, K. Saito, H. Sawada, and N. Ueda, “Fast approximate similarity search based on degree-reduced neighborhood graphs,” in SIGKDD, 2011, pp. 1055–1063.
[29] M. Iwasaki and D. Miyazaki, “Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data,” CoRR, vol. abs/1810.07355, 2018.
[30] M. Iwasaki, “Pruned bi-directed k-nearest neighbor graph for proximity search,” in SISAP, vol. 9939, 2016, pp. 20–33.
[31] D. Baranchuk and A. Babenko, “Towards similarity graphs constructed by deep reinforcement learning,” CoRR, vol. abs/1911.12122, 2019.
[32] D. Baranchuk, D. Persiyanov, A. Sinitsin, and A. Babenko, “Learning to route in similarity graphs,” in ICML, vol. 97, 2019, pp. 475–484.
[33] Z. Zhou, S. Tan, Z. Xu, and P. Li, “Möbius transformation for fast inner product search on graph,” in NeurIPS, 2019, pp. 8216–8227.
[34] J. Wang and S. Li, “Query-driven iterated neighborhood graph search for large scale indexing,” in ACM MM, 2012, pp. 179–188.
[35] F. P. Preparata and M. I. Shamos, Computational Geometry - An Introduction. Springer, 1985.
[36] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearest neighbor” meaningful?” in ICDT, 1999, pp. 217–235.
[37] D. François, V. Wertz, and M. Verleysen, “The concentration of fractional distances,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 7, pp. 873–886, 2007.
[38] J. He, S. Kumar, and S.-F. Chang, “On the difficulty of nearest neighbor search,” in ICML, 2012, pp. 1127–1134.
[39] M. E. Houle and M. Nett, “Rank-based similarity search: Reducing the dimensional dependence,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 136–150, 2015.
[40] D. R. Karger and M. Ruhl, “Finding nearest neighbors in growth-restricted metrics,” in STOC, 2002, pp. 741–750.
[41] M. E. Houle, H. Kashima, and M. Nett, “Generalized expansion dimension,” in ICDM Workshops, 2012, pp. 587–594.
[42] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli, “Novel high intrinsic dimensionality estimators,” Mach. Learn., vol. 89, no. 1-2, pp. 37–65, 2012.
[43] M. E. Houle, “Dimensionality, discriminability, density and distance distributions,” in ICDM Workshops, 2013, pp. 468–473.
[44] W. G. Aref, A. C. Catlin, J. Fan, A. K. Elmagarmid, M. A. Hammad, I. F. Ilyas, M. S. Marzouk, and X. Zhu, “A video database management system for advancing video database research,” in Multimedia Information Systems, 2002, pp. 8–17.
[45] R. Fagin, R. Kumar, and D. Sivakumar, “Efficient similarity search and classification via rank aggregation,” in SIGMOD, 2003, pp. 301–312.
[46] Y. Ke, R. Sukthankar, and L. Huston, “An efficient parts-based near-duplicate and sub-image retrieval system,” in ACM Multimedia, 2004, pp. 869–876.
[47] J. L. Bentley, “K-d trees for semidynamic point sets,” in SoCG, 1990, pp. 187–197.
[48] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in SIGMOD, 1984, pp. 47–57.
[49] N. Katayama and S. Satoh, “The sr-tree: An index structure for high-dimensional nearest neighbor queries,” in SIGMOD. ACM Press, 1997, pp. 369–380.
[50] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in VLDB. Morgan Kaufmann, 1998, pp. 194–205.
[51] P. Ram and K. Sinha, “Revisiting kd-tree for nearest neighbor search,” in KDD, 2019, pp. 1378–1388.
[52] Y. Tao, K. Yi, C. Sheng, and P. Kalnis, “Quality and efficiency in high dimensional nearest neighbor search,” in SIGMOD, 2009, pp. 563–576.
[53] J. Gan, J. Feng, Q. Fang, and W. Ng, “Locality-sensitive hashing scheme based on dynamic collision counting,” in SIGMOD, 2012, pp. 541–552.
[54] Y. Zheng, Q. Guo, A. K. H. Tung, and S. Wu, “Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,” in SIGMOD, 2016, pp. 2023–2037.
[55] A. Andoni, P. Indyk, T. Laarhoven, I. P. Razenshteyn, and L. Schmidt, “Practical and optimal LSH for angular distance,” in NIPS, 2015, pp. 1225–1233.
[56] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe lsh: Efficient indexing for high-dimensional similarity search,” in VLDB, 2007, pp. 950–961.
[57] Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen, “SK-LSH: an efficient index structure for approximate nearest neighbor search,” PVLDB, vol. 7, no. 9, pp. 745–756, 2014.
[58] M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” in WWW, 2005, pp. 651–660.
[59] J. Gao, H. V. Jagadish, B. C. Ooi, and S. Wang, “Selective hashing: Closing the gap between radius search and k-nn search,” in SIGKDD, 2015, pp. 349–358.
[60] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization for approximate nearest neighbor search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2946–2953.
[61] A. Babenko and V. Lempitsky, “Tree quantization for large-scale similarity search and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4240–4248.
[62] C. Fu and D. Cai, “Efanna: An extremely fast approximate nearest neighbor search algorithm based on knn graph,” arXiv preprint arXiv:1609.07228, 2016.

A Note on Graph-Based Nearest Neighbor Search

Abstract

I Introduction

II Graph-Based Nearest Neighbor Search

II-A Graph Construction and Search Algorithms

II-B Review of Graph Search Models and Their Limitations

III Clustering Coefficient of KKNN Graph and Its Impact on Search Performance

Definition 1.

Definition 2.

Definition 3.

Definition 4.

IV Two Phase kkNN Search in Graphs

Theorem 1.

Proof.

V Related Work

V-A Measrues for Difficulty of Nearest Neighbor Search

V-B A Brief Review of the Existing ANN Search Methods

V-B1 tree-structure based approaches

V-B2 hashing-based approaches

V-B3 quantization-based approaches

V-B4 graph-based approaches

VI Conclusion

Acknowledgements

References

III Clustering Coefficient of $K$ NN Graph and Its Impact on Search Performance

IV Two Phase $k$ NN Search in Graphs