Revisiting $k$ -Nearest Neighbor Graph Construction on High-Dimensional Data : Experiments and Analyses

Yingfan Liu, Hong Cheng and Jiangtao Cui Yingfan Liu and Jiangtao Cui are with the School of Computer Science and Technology, Xidian University, Xi’an, China. E-mail: {liuyingfan, cuijt}@xidian.edu.cn. Hong Cheng is with the Chinese University of Hong Kong, Hong Kong SAR, China. E-mail: hcheng@se.cuhk.edu.hk.

Abstract

The $k$ -nearest neighbor graph (KNNG) on high-dimensional data is a data structure widely used in many applications such as similarity search, dimension reduction and clustering. Due to its increasing popularity, several methods under the same framework have been proposed in the past decade. This framework contains two steps, i.e. building an initial KNNG (denoted as INIT) and then refining it by neighborhood propagation (denoted as NBPG). However, there remain several questions to be answered. First, it lacks a comprehensive experimental comparison among representative solutions in the literature. Second, some recently proposed indexing structures, e.g., SW and HNSW, have not been used or tested for building an initial KNNG. Third, the relationship between the data property and the effectiveness of NBPG is still not clear. To address these issues, we comprehensively compare the representative approaches on real-world high-dimensional data sets to provide practical and insightful suggestions for users. As the first attempt, we take SW and HNSW as the alternatives of INIT in our experiments. Moreover, we investigate the effectiveness of NBPG and find the strong correlation between the huness phenomenon and the performance of NBPG.

Index Terms:

k

-nearest neighbor graph, neighborhood propagation, high-dimensional data, hubness and experimental analysis

1 Introduction

The $k$ -nearest neighbor graph (KNNG) on high-dimensional data is a useful data structure for many applications in various domains such as computer vision, data mining and machine learning. Given a $d$ -dimensional data set $D\subset\mathbb{R}^{d}$ and a positive integer $k$ , a KNNG on $D$ treats each point $u\in D$ as a graph node and creates directed edges from $u$ to its $k$ nearest neighbors (KNNs) in $D\setminus\{u\}$ . In this paper, we use point, vector and node exchangeably.

KNNG is widely used as a building brick for several fundamental operations, such as approximate $k$ -nearest neighbor (ANN) search, mainfold learning and clustering. The state-of-the-art ANN search methods (e.g., DPG [17] and NSG [10]) first construct an accurate KNNG on $D$ and then adjust the neighborhood for each node on top of the KNNG. In order to generate a low-dimensional and compact description of $D$ , many manifold learning methods (e.g., LLE [25], Isomap [31], t-SNE [32] and LargeVis [29]) requires its KNNG, which captures the local geometric properties of $D$ . Besides, some clustering methods [35, 33, 8] improve their performance following the idea that each point and its KNNs should be assigned into the same cluster.

Since computing the exact KNNG requires huge cost in practice, researchers focused on the approximate KNNG to balance efficiency and accuracy. In practice, an accurate KNNG is urgent, because its accuracy will affect the downstream applications. Let us take DPG [17], an ANN search method, as an example. We conduct ANN search with DPG on two widely-used data sets, i.e., Sift and Gist, whose details could be found in Section 3.5, and show the results in Figure 1. As KNNG’s accuracy increases, DPG’s performance is obviously enhanced, where less cost leads to more accurate results.

During the past decade, a few approximate methods [6, 34, 36, 29] have been proposed. In general, they follow the same framework that quickly constructs an initial KNNG with low or moderate accuracy and then refines it by neighborhood propagation. Existing methods usually employ some index structures to construt the initial KNNG, including binary partition tree [6, 34], random projection tree [29] and locality sensitive hashing [36]. For neighborhood propagation, existing methods follow the same principle “a neighbor of my neighbor is also likely to be my neighbor” [9], despite some variations in technical details. In the rest of this paper, we denote the first step of constructing an initial KNNG as INIT and the second step of neighborhood propagation as NBPG.

Given the current research outcomes on KNNG construction in the literature, there remain several questions to be answered. First, it lacks a comprehensive experimental comparison among representative solutions in the literature. This may be caused by the fact that these authors come from different research communities (e.g., data mining, computer vision, machine learning, etc.) and are not familiar with some other works. Second, some related techniques in the literature have not been considered for the purpose of KNNG construction, such as SW graph [21] and HNSW [22], which are the state-of-the-art methods for ANN search. It remains a question whether they are potentially suitable methods for KNNG construction. Third, NBPG methods are effective on some data but not so effective on some other data. The relationship between the effectiveness of NBPG and data property is not clear yet.

To address the above issues, we revisit existing methods and make a comprehensive experimental study for both INIT and NBPG steps, to understand the strengths and limitations of different methods and provide deeper insights for users. In addition to existing methods for the INIT step, we adopt two related techniques for ANN search, i.e., SW and HNSW, as alternative methods for INIT, denoted as SW KNNG and HNSW KNNG respectively. Through our extensive experiments, we first recommend HNSW KNNG for INIT.

We categorize NBPG methods into three categories, i.e., UniProp [6, 36, 29], BiProp [9] and DeepSearch [34]. Both UniProp and BiProp adopt the iterative approach to refine the initial KNNG step by step. But UniProp only considers nearest neighbors in NBPG, while BiProp considers both nearest neighbors and reverse nearest neighbors. In contrast, DeepSearch runs for only one iteration, which conducts ANN search on an online rapidly-built proximity graph, e.g., the initial KNNG in [34] or the HNSW graph as pointed out in this paper. According to our experimental results, there is no dominator in all cases. But, we do not recommend the popular UniProp, because it usually cannot achieve a high-accuracy KNNG and is sensitive to $k$ . KGraph ¹¹1As an improved implementation of [9], it does not appear in the experiments of several existing works [6, 34, 36, 29], a variant of BiProp, has the best performance to construct a high-accuracy KNNG, but requires much more memory than other methods. Deep HNSW, a new combination under the framework, presents the best balance among efficiency, accuracy and memory requirement in most cases.

Besides the experimental study, we further investigate the effectiveness of NBPG versus data property. We employ the definition of node hubness in [24] to characterize each node, where the node hubness of a node is defined as the number of its exact reverse KNNs. Further, we extend this definition and define the data hubness of a data set as the normalized sum of the largest node hubness of its top nodes up to a specified percentage, in order to characterize the data set. With those two definitions in our analyses, we have two interesting findings. First, a high hub data usually has worse NBPG performance than a low hub data. Second, a high hub node usually has higher accuracy of their KNNs than a low hub ndoe.

We summarize our contributions as follows.

•

We revisit the framework of constructing KNNG and introduce new methods or combinations following it.
•

We conduct a comprehensive experimental comparison among representative solutions and provide our practical suggestions.
•

We investigate the effectiveness of NBPG and discover the strong correlation between its performance and the hubness phenomenon.

The rest of this paper is organized as follows. In Section 2, we define the problem and show the framework. In Section 3, we present INIT methods and conduct experimental evaluations on INIT. In Section 4, we describe various NBPG methods and show our experimental results. In Section 5, we investigate the effectiveness of NBPG mechanism and present our insights. In Section 6, we discuss related work. Finally, we conclude this paper in Section 7.

2 Overview of KNNG Construction

2.1 Problem Definition

Let $D\subset\mathbb{R}^{d}$ be a data set that consists of $n$ $d$ -dimensional real vectors. The KNNG on $D$ is defined as follows.

Definition 1

[KNNG]. Given $D\subset\mathbb{R}^{d}$ and a positive integer $k$ , let $G=(V,E)$ be the KNNG of $D$ , where $V$ is the node set and $E$ is the edge set. Each node $u\in V$ uniquely represents a vector in $D$ . A directed edge $(u,v)\in E$ exists iff $v$ is one of $u$ ’s KNNs in $D\setminus\{u\}$ .

The KNNG of a data set with 6 data points is shown in Figure 2. For each node, it has two out-going edges pointing to its 2NNs. For simplicity, we refer to approximate KNNG as KNNG and approximate KNNs as KNNs in the rest of the paper. When we refer to exact KNNG and exact KNNs, we will mention them explicitly.

In this paper, we focus on in-memory solutions on high-dimensional dense vectors with the Euclidean distance as the distance measure, which is the most popular setting in the literature. This setting makes us focus on the comprehensive comparison on existing works. We would solve this problem with other settings, e.g., disk-based solutions, sparse data and other distance measures, in future works.

Figure 2: An illustrative KNNG with

n=6

and

k=2

2.2 The Framework of KNNG Construction

Algorithm 1 shows the framework of KNNG construction, adopted by many methods in the literature [6, 34, 36, 29]. This framework contains two steps, i.e., INIT and NBPG. The philosophy of this framework is that INIT geneartes an initial KNNG quickly, and then NBPG efficiently refines it, which follows the principle “a neighbor of my neighbor is also likely to be my neighbor” [9]. Note that we do not expect INIT costs a lot for high accuracy, because the refinement will be done in NBPG.

Input: Data set

D

and a positive integer

k

Output: KNNG

G

1 INIT : generate an initial graph

G

quickly;

2 NBPG : refine

G

by neighborhood propagation;

3 return

G

;

Algorithm 1 Framework of KNNG Construction

In this paper, we will introduce a few representative approaches for INIT and NBPG respectively. For each method, we also analyze its memory requirement and time complexity. For memory requirement, we only count the auxiliary data structures of each method, ignoring the common structures, i.e., the data $D$ and the KNNG $G$ .

2.3 KNNG vs Proximity Graph

Proximity graphs (e.g., SW [21], HNSW [22], DPG [17] and NSG [10]) are the state-of-the-art methods for ANN search. Like KNNG, they treat each point $u$ as a graph node, but have distinct strategies to define $u$ ’s edges. The relationship between KNNG and proximity graph is twisted. On one hand, KNNG could be treated as a special type of proximity graph. Hence, KNNG is used for ANN search in recent benchmarks [2, 1, 17]. In those benchmarks, KNNG is usually denoted as KGraph, which is actually a KNNG construction method [9]. On the other hand, some proximity graphs such as DPG and NSG require an accurate KNNG as the premise, so that their construction cost is higher than KNNG. However, SW and HNSW are not built on top of KNNG and thus may be built faster than KNNG with adequate settings. As a new perspective, we show that SW and HNSW could also be used to build KNNG.

Moreover, we show ANN search on a proximity graph in Algorithm 2, which will be discussed more than once in this paper. Its core idea is similar to NBPG, i.e., “a neighbor of my neighbor is also likely to be my neighbor”. Let $H$ be a proximity graph and $q$ be a query. The search starts on specified or randomly-selected node(s) $ep$ , which are first pushed into candidate set $pool$ (a sorted node list). Let $L=\max(k,efSearch)$ be the size of $pool$ . In each iteration in Line 4-11, we find the first unexpanded node $u$ in $pool$ and then expand $u$ in Lines 6-10, denoted as $expand(q,u,H)$ . This expansion treats each neighbor $v$ of $u$ on $H$ as a candidate to refine $pool$ in Line 8-10, which is denoted as $update(pool,v)$ . Once the first $efSearch$ nodes in $pool$ have been expanded, the search process terminates and returns the first $k$ nodes in $pool$ . Obviously, $efSearch$ is key to the search performance. A large $efSearch$ increases both the cost and the accuracy.

Input:

H

q

k

efSearch

and

ep

Output: KNNs of

q

1 let

pool

be the candidate set and push

ep

pool

;

L=max(k,efSearch)

and

i=0

;

3 while $i<efSearch$ do

u=pool[i]

and mark

u

as expanded;

5 /* Procedure:

expand(q,u,H)

6 for each neighbor $v$ of $u$ on $H$ do

7 /* Procedure:

update(pool,v)

pool.add(v,dist(q,v))

;

9 sort

pool

in ascending order of

dist(q,\cdot)

;

10 if

pool.size()>L

, then

pool.resize(L)

;

i=

index of the first unexpanded node

u

pool

;

14return the first

k

nodes in

pool

;

Algorithm 2

Search\_on\_Graph(H,q,k,efSearch,ep)

3 Constructing Initial Graphs

In this section, we introduce several representative methods for INIT. They can be classified into two broad categories: the partition-based approach and the small world based approach. The main idea of the former is to partition the data points into sufficiently small but intra-similar groups. Then each point finds its KNNs in the most promising group(s) and an initial KNNG is generated. The partition-based approach includes Multiple Division [34], LSH KNNG [36], and LargeVis [29]. These methods adopt different partitioning techniques. The main idea of the small world based approach is to create a navigable small world graph among the data points in a greedy manner, the structure of which is helpful for identifying the KNNs of any data point. The small world based approach was proposed to process ANN search in the literature, but not for KNNG construction. But we find it is trivial to extend them for KNNG construction. So we include these techniques in our study, which are denoted as SW KNNG [21] and HNSW KNNG [22]. We conduct extensive experiments to test and compare these methods for INIT.

3.1 Multiple Random Division

Wang et al. [34] proposed a multiple random divide-and-conquer approach to build base approximate neighborhood graphs and then ensemble them together to obtain an initial KNNG. Their method, which we denote as Multiple Division, recursively partitions the data points in $D$ into two disjoint groups. To assign nearby points to the same group, the partitioning is done along the principal direction, which could be rapidly generated [34]. The data set $D$ is partitioned into two disjoint groups $D_{1}$ and $D_{2}$ (as illustrating in Figure 3), which are further split in a recursive manner until the group to be split contains at most $T_{div}$ points. Then a neighborhood subgraph for each finally-generated group is built in a brute-force manner, which costs $O(d*T_{div}^{2})$ .

A single division as described above yields a base approximate neighborhood graph containing a series of isolated subgraphs, and it is unable to connect a point with its closer neighbors lying in different subgraphs. Thus, Multiple Division uses $L_{div}$ random divisions. Rather than computing the principal direction from the whole group, Multiple Division computes it over a random sample of the group. For each point $u\in D$ , it has a set of KNNs from each division and thus totally $L_{div}$ sets of KNNs, from which the best KNNs are generated. We illustrate this process in Figure 4.

Cost Analysis: Since those $L_{div}$ divisions could be generated independently, the memory requirement is $O(n)$ to store the current division. For each division, the recursive partition costs $O(n*d*\log n)$ and the brute-force procedure on all finally-generated groups requires $O(n*d*T_{div})$ . With $L_{div}$ divisions, the cost of Multiple Division is $O(L_{div}*n*d*(\log n+T_{div}))$ . We can see that $L_{div}$ and $T_{div}$ are key to the cost of Multiple Division.

3.2 LSH KNNG

Zhang et al. [36] proposed a method, denoted as LSH KNNG, which uses the locality sensitive hashing (LSH) technique to divide $D$ into groups with equal size, and then builds a neighborhood graph on each group using the brute-force method. To enhance the accuracy, similar to Multiple Division, $L_{hash}$ divisions are created to build $L_{hash}$ base approximate neighborhood graphs, which are then combined to generate an initial KNNG of higher accuracy.

The details of such a division are illustrated in Figure 5. For each $d$ -dimensional vector, LSH KNNG uses Anchor Graph Hashing (AGH) [18], an LSH technique, to embed it into a binary vector in $\{-1,1\}^{b}$ . Then, the binary vector is transferred to a one-dimensional projection on a random vector in $\mathbb{R}^{b}$ . All points in $D$ are sorted in the ascending order of their one-dimensional projections and then divided into equal-size disjoint groups accordingly. Let $T_{hash}$ be the size of such a group. Then a neighborhood graph is built on each group in the brute-force manner.

Cost Analysis: Like Multiple Division, each LSH division could be built independently. Hence, its memory requirement is caused by the binary codes and the division result. In total, its memory requirement is $O(n*b)$ . In each division, generating binary codes costs $O(n*d*b)$ and the brute-force procedure on all groups costs $O(n*d*T_{hash})$ . In total, the cost of LSH KNNG is $O(L_{hash}*n*d*(b+T_{hash}))$ . We can see that the cost of LSH KNNG is sensitive to $L_{hash}$ and $T_{hash}$ , since $b$ is usually a fixed value.

3.3 LargeVis

Tang et al. [29] proposed a method, denoted as LargeVis, to construct a KNNG for the purpose of visualizing large-scale and high-dimensional data. Like Multiple Division, LargeVis recursively partitions the data points in $D$ into two small groups, but the partitioning is along a random projection (RP) that connects two randomly-sampled points from the group to be split. The partitioning results are organized as a tree called RP tree [7]. Similar to Multiple Division, $L_{tree}$ RP trees will be built to enhance the accuracy. An initial KNNG is thus generated by conducting a KNN search procedure for each point on those trees. Each search process will be terminated after retrieving $L_{tree}*k$ candidates ²²2Summarized according to the public codes from the authors https://github.com/lferry007/LargeVis. .

Cost Analysis: Unlike Multiple Division and LSH KNNG, we have to store all $L_{tree}$ RP trees in the memory, since ANN search for each point is done on those trees simultaneously. Each tree requires $O(n*d)$ space, since each inner node stores a random projection vector. Hence, the total memory requirement is $O(L_{tree}*n*d)$ . For each query, its main cost is to reach the leaf node from the root node in each tree and then verify $L_{tree}*k$ candidates. Hence, the total cost of LargeVis is $O(L_{tree}*n*d*(\log n+k))$ . We can see that $L_{tree}$ and $k$ are key to the cost of LargeVis.

3.4 Small World based Approach

SW graph [21] and HNSW graph [22] are the state-of-the-art methods for ANN search [1, 17]. In the literature, neither technique (SW or HNSW) has been used for INIT. According to their working mechanism, it is trivial to extend them for INIT, while it needs a thorough comparison with other INIT methods in the literature.

The SW graph $G_{sw}$ construction process starts from an empty graph, and iteratively inserts each data point $u$ into $G_{sw}$ . In each insertion, we first treat $u$ as a new node and then conduct ANN search $\mathcal{A}=Search\_on\_Graph(G_{sw},u,M_{sw},efConstruction,ep)$ as in Algorithm 2, where $G_{sw}$ is the current SW graph and contains only a part of $D$ . Afterwards, undirected edges are created between $u$ and $M_{sw}$ neighbors in $\mathcal{A}$ . Figure 6 illustrates the insertion process for a data point $u$ . It is easy to generate an initial KNNG during the construction of $G_{sw}$ . Before building $G_{sw}$ , each point’s KNN set is initialized as empty. When inserting $u$ , its similar candidates will be retrieved by the search process and a distance is computed between $u$ and each candidate $v$ . We use $v$ (resp. $u$ ) to refine the KNN set of $u$ (resp. $v$ ). Once the SW graph is constructed, an initial KNNG is also generated. For simplicity, this method is denoted as SW KNNG.

Cost Analysis : The main memory requirement of SW KNNG is to store the SW graph, which needs $O(n*M_{sw})$ space. Its main cost is to insert $n$ nodes. By experiments, we find that each insertion expands $O(efConstruction)$ nodes, as demonstrated in Appendix A. But, there is no explicit upper bound on the neighborhood size. Let $M_{sw}^{max}$ ³³3 $M_{sw}^{max}$ is specific w.r.t. the data set, which is obviously affected by the hubness phenomenon as discussed in Section 5.1 be maximum neighborhood size in $G_{sw}$ . In the worst case, its time complexity is $O(n*d*efConstruction*M_{sw}^{max})$ .

The HNSW garph $G_{hnsw}$ is a hierarchical structure of multiple layers. A higher layer contains fewer data points and layer 0 contains all data points in $D$ . To insert a new point $u$ , an integer maximum layer $l_{h}(u)$ is randomly selected so that $u$ will appear in layers 0 to $l_{h}(u)$ . To insert $u$ , the search process starts from the entry point $ep$ in the highest layer. The nearest node found in the current layer will serve as the entry point of search in the next layer, and this search process repeats in the next layer till layer 0. Note that the search in each layer follows Algorithm 2. After the search, HNSW creates undirected edges between $u$ and $M_{hnsw}$ neighbors selected from $efConstruction$ returned ones, which is done in layers from $l_{h}(u)$ to layer 0 respectively. Unlike SW, HNSW imposes a threshold $M_{hnsw}$ on the neighborhood size. Once the edges of a node exceeds $M_{hnsw}$ , HNSW prunes them by a link diversification strategy. Figure 7 illustrates an insertion in HNSW. By the same process in SW KNNG, we build the initial KNNG by tracing each pair of distance computation when building the HNSW graph. We denote this method as HNSW KNNG.

Cost Analysis: The main requirement of HNSW KNNG is caused by the HNSW graph, which contains $O(n)$ nodes and each node has up to $M_{hnsw}$ neighbors. Hence, its memory requirement is $O(n*M_{hnsw})$ . The cost of HNSW KNNG mainly contains two parts, i.e., ANN search for each node and pruning operations for oversize nodes. We find that each search expands $O(efConstruction)$ nodes as demonstrated in Appendix B and thus the total cost for ANN search is $O(n*d*M_{hnsw}*efConstruction)$ . Each pruning operation costs $O(M_{hnsw}^{2}*d)$ , but the exact number of those operations is unknown in advance. In the worst case, adding each undirected edge will lead to a pruning operation. In this case, there are totally $n*M_{hnsw}$ pruning operations. However, the practical number of pruning operations on real data is far smaller than $n*M_{hnsw}$ , as shown in Appendix B. Overall, the total cost of HNSW KNNG is $O(n*d*M_{hnsw}*(efConstruction+M_{hnsw}^{2}))$ in the worst case.

3.5 Experimental Evaluation of INIT

3.5.1 Experimental Settings

We first describe the experimental settings in this paper, for both INIT methods and NBPG methods.

Data Sets. We use 8 real data sets with various sizes and dimensions from different domains, as listed in Table I. Sift⁴⁴4http://corpus-texmex.irisa.fr contains 1,000,000 128-dimensional SIFT vectors. Nuscm⁵⁵5http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm consists of 269,648 225-dimensional block-wise color moments. Nusbow⁶⁶6http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm has 269,648 500-dimensional bag of words based on SIFT descriptors. Gist⁷⁷7http://corpus-texmex.irisa.fr contains 1,000,000 960-dimensional Gist vectors. Glove⁸⁸8http://nlp.stanford.edu/projects/glove/ consists of 1,193,514 100-dimensional word features extracted from Tweets. Msdrp⁹⁹9http://www.ifs.tuwien.ac.at/mir/msd/download.html contains 994,185 1440-dimensional Rhythm Patterns extracted from the same number of contemporary popular music tracks. Msdtrh¹⁰¹⁰10http://www.ifs.tuwien.ac.at/mir/msd/download.html contains 994,185 420-dimensional Temporal Rhythm Histograms extracted from the same number of contemporary popular music tracks. Uqv¹¹¹¹11http://staff.itee.uq.edu.au/shenht/UQ_VIDEO/index.htm contains 3,305,525 256-dimensional feature vectors extracted from the keyframes of a video data set.

TABLE I: Data Summary : the space is measured in MB and the cost of the baseline method is measured in seconds.

Data	$n$	$d$	Space	Baseline	Domain
Sift	1,000,000	128	488	6,115	Image
Nuscm	269,648	225	231	803	Image
Nusbow	269,648	500	514	1,850	Image
Gist	1,000,000	960	3,662	63,430	Image
Glove	1,193,514	100	455	6,648	Text
Msdrp	994,185	1,440	5,461	87,630	Audio
Msdrth	994,185	420	1,593	24,053	Audio
Uqv	3,305,525	256	3,228	178,089	Video

Performance Indicators. We estimate the performance each method in three factors, i.e., cost, accuracy and memory requirement. We use the execution time to measure the construction cost, denoted as $cost$ . The accuracy is measured by $recall$ . Given a KNNG $G$ , let $N_{k}(u)$ be the KNNs of $u$ from $G$ , and $N_{k}^{*}(u)$ be its exact KNNs. Then $recall(u)=|N_{k}^{*}(u)\cap N_{k}(u)|/k$ . The $recall$ of $G$ is averaged over all nodes in $G$ . The memory requirement is estimated by the memory space occupied by the auxiliary structures of each method. By default, $k$ is set to be 20 in this paper.

Computing Environments. All experiments are conducted on a workstation with Intel(R) Xeon(R) E5-2697 CPU and 512 GB memory. The codes are implemented by C++ and compiled by g++4.8 with -O3 option. We shut down the SIMD instructions to make a fair competition, since some public codes are implemented without opening those instructions. We run those methods in parallel ¹²¹²12We do this due to two reasons. First, all the methods are easy to be executed in parallel. Second, there are many experiments in this paper. and the number of threads is set to be 16 by default.

3.5.2 Compared Methods and Parameter Selection

We evaluate the five methods described in this section. As listed below, their parameters are carefully selected following the goal, i.e., generating rapidly an initial KNNG with low or moderate accuracy in INIT. Note that the refinement will be done in NBPG.

•

Multiple Division : $T_{div}=500$ , as in [34], and $L_{div}$ (10 by default) is varied in $\{2,4,6,8,10\}$ .
•

LSH KNNG : $T_{hash}=200$ , optimized by experiments, and $L_{hash}$ is varied in $\{5,10,20,40\}$ .
•

LargeVis : we vary $L_{tree}$ (50 by default) in $\{25,50,75,100\}$ and use default settings for others.
•

SW KNNG : $M_{sw}=k$ and $efConstruction$ is varied in $\{5,10,20,40,80\}$ .
•

HNSW KNNG : $M_{hnsw}=20$ , optimized by experiments, and $efConstruction$ (80 by default) is varied in $\{20,40,80,160\}$ .

Moreover, we use the brute-force method as the baseline method, which leads to $n*(n-1)/2$ distance computations. We list the cost of the baseline method in Table I.

3.5.3 Experimental Results

A good INIT method should have a good balance between $cost$ and $recall$ . We show the experimental results of $cost$ vs $recall$ in Figure 8. We can see that small world based approaches obviously outperform existing methods that employ traditional index structures. Compared with SW KNNG, HNSW KNNG presents stable balance bwetween $cost$ and $recall$ . We notice that on data such as Nusbow, Glove and Msdrp, SW KNNG presents high accuracy and high cost. As mentioned previously, this is not the purpose of INIT. By the way, we believe that it is caused by the hubness phenomenon as discussed later in Section 5.1.

Multiple Division performs better than LargeVis and LSH KNNG. This superior performance may be brought by the principal direction based projection, which is more effective than LSH functions and random projections for data partitioning. On Nusbow, Msdtrh and Uqv, LSH KNNG fails to construct a KNNG due to runtime errors in computing the eigenvectors when training the AGH model.

We show the memory requirement of auxiliary data structures of INIT methods with their default settings in Table II. We can see that LargeVis requires the most memory space due to storing 50 RP trees, followed by HNSW KNNG and SW KNNG that need to store an HNSW graph and SW graph repressively. Due to the independence of multiple divisions, Multiple Division and LSH KNNG only store the structures for one division in the memory and thus require obviously less memory.

TABLE II: Memory Requirement (MB) of Auxiliary Structures for INIT Methods

Data	Multiple Division	LSH KNNG	Large Vis	SW KNNG	HNSW KNNG
Sift	7.7	30.5	722	164	180
Nuscm	2.3	8.2	243	44	49
Nusbow	3.2	8.2	329	44	49
Gist	11.3	30.5	874	164	179
Glvoe	9.2	36.4	1,208	196	215
Msdrp	16	30.3	1,013	163	179
Msdtrh	8.4	30.3	881	163	179
Uqv	25.6	100.9	2,488	542	654

Summary. For INIT, we first recommend HNSW KNNG. This is because HNSW KNNG presents consistently superior performance over other competitors. Even though its memory requirement is the second largest, it is still acceptable, especially compared to the data space as shown in Table I. Our second recommendation is Multiple Division, which has good performance and requires little memory space. Without the information of the data hubness, we do not recommend SW KNNG, since the user may encounter huge cost to complete the task.

4 Neighborhood Propagation

In this section, we study the NBPG technique in the framework of KNNG construction, which is used to refine the initial KNNG. In general, all methods in the literature follow the same principle: “a neighbor of my neighbor is also likely to be my neighbor” [9], but differ in the technical details. We categorize them into three categories: UniProp [36, 29], BiProp [9] and DeepSearch [34], and describe their core ideas as follows.

4.1 UniProp

As a popular NBPG method, UniProp stands for uni-directional propagation, which iteratively refines a node’s KNNs from the KNNs of its KNNs [36, 29]. Let $N_{k}^{t}(u)$ be the KNNs of $u$ found in iteration $t$ . We can obtain $N_{k}^{0}(u)$ from the initial KNNG constructed in the INIT step. Algorithm 3 shows how to iteratively refine the KNNs of each node. In iteration $t$ , we initially set $N_{k}^{t}(u)$ as $N_{k}^{t-1}(u)$ and then conduct $update(N_{k}^{t}(u),w)$ (as in Algorithm 2) for each $w\in\cup_{v\in N_{k}^{t-1}(u)}N_{k}^{t-1}(v)$ . UniProp repeats for $nIter$ times and terminates.

Cost Analysis: UniProp needs to store the last-iteration KNNG. Hence its space requirement is $O(n*k)$ . Its time complexity is $O(nIter*n*d*k^{2})$ in the worst case.

Input:

D

k

nIter

, initial KNNG

G

Output:

G

1 for $t:1\to nIter$ do

2 for each $u$ in $D$ do

N_{k}^{t}(u)=N_{k}^{t-1}(u)

;

4 for each $v\in N_{k}^{t-1}(u)$ do

5 for each $w\in N_{k}^{t-1}(v)$ do

update(N_{k}^{t}(u),w)

;

11return

G

;

Algorithm 3 UniProp

4.2 BiProp

In contrast to UniProp, another type of method considers both the nearest neighbors and the reverse nearest neighbors of a node for neighborhood propagation [9]. Thus we denote it as BiProp which stands for bi-directional propagation. For each node $u$ , BiProp maintains a larger neighborhood $u.pool$ with size $L\geq k$ , which contains $u$ ’s $L$ nearest neighbors found so far. BiProp also takes an iterative strategy. Let $N_{m}^{t}(u)$ be the $m$ nearest neighbors of $u$ in iteration $t$ , where $m$ is a variable and does not necessarily equal to $k$ . Note that $N_{m}^{t}(u)\subset u.pool$ . We use $R_{m}^{t}(u)$ to denote $u$ ’s reverse neighbor set $\{v|u\in N_{m}^{t}(v)\}$ and thus $\sum_{u\in D}|R_{m}^{t}(u)|=\sum_{v\in D}|N_{m}^{t}(v)|$ . Then we denote the candidate set for neighborhood propagation as:

B_{m}^{t}(u)=N_{m}^{t}(u)\cup R_{m}^{t}(u).

We show the procedure of BiProp in Algorithm 4. $u.pool$ is initialized randomly, and so is $B_{m}^{0}(u)$ . In iteration $t$ , a brute-force search is conducted on $B_{m}^{t-1}(u)$ for each $u$ , as in Lines 4 to 9. At the end of each iteration, BiProp will update $N_{m}^{t}(u)$ and $B_{m}^{t}(u)$ in Lines 10 and 11.

In practice, the reverse neighbor set $R_{m}^{t}(u)$ may have a large cardinality (e.g., tens of thousands or even more) for some $u\in D$ . Processing such a super node in one iteration costs as high as $O(d*(m+|R_{m}^{t}(u)|)^{2})$ . To address this issue, BiProp imposes a threshold $T$ on $|R_{m}^{t}(u)|$ and the oversize sets will be reduced by random shuffle in each iteration.

Both NNDes¹³¹³13As a well-known KNNG construction method, it is publicly available on https://code.google.com/archive/p/nndes. and KGraph¹⁴¹⁴14It is publicly available on https://github.com/aaalgo/kgraph. It is well known as a method for ANN search, but ignored for KNNG construction. follows the idea of BiProp. Their key difference is how to set $m$ and $L$ . In NNDes, $L=k$ and $m\leq k$ for all nodes, which is specified in advance. As in KGraph, $L$ is a specified value, but $m$ varies with node $u$ and the iteration $t$ , denoted as $m^{t}(u)$ . To determine $m^{t}(u)$ , three rules must be satisfied simultaneously, summarized from the source codes. Let $\delta$ be a small value (e.g., 10).

•

Rule 1 : $m^{0}(u)=\delta$ and $m^{t}(u)-m^{t-1}(u)\leq\delta$ .
•

Rule 2 : $|N_{m}^{t}(u)\setminus N_{m}^{t-1}(u)|\leq\delta$ .
•

Rule 3 : $m^{t}(u)$ increases, iff $\exists v\in B_{m}^{t-1}(u)$ such that the brute-force search on $B_{m}^{t-1}(u)$ refines $N_{k}^{t}(v)$ .

Cost Analysis: In NNDes, we need to store $N_{m}^{t}(\cdot)$ and $B_{m}^{t}(\cdot)$ . Since $m\leq k$ and $T=k$ , the required memory is $O(n*k)$ . Its time complexity is $O(nIter*n*d*k^{2}$ ) in the worst case. In KGraph, we need to store $u.pool$ and $B_{m}^{t}(u)$ for each $u$ . $|N_{m}^{t}(u)|$ is bounded by $L$ and thus $|B_{m}^{t}(u)|\leq L+T$ . Its memory requirement is $O(n*L)$ , which could be large enough since $L$ is usually set as a large value (e.g., 100). Its time complexity is $O(nIter*n*d*(L+T)^{2})$ in the worst case, but it is very fast in practice due to the filters as discussed in Section 4.4.

Input:

D

k

nIter

Output:

G

1 for each $u\in D$ do

2 randomly initialize

u.pool

and

B_{m}^{0}(u)

;

4for $t:1\to nIter$ do

5 for each $u$ in $D$ do

6 /* Procedure:

brute\_force(B_{m}^{t-1}(u))

7 for each $v$ in $B_{m}^{t-1}(u)$ do

8 for each $w$ in $B_{m}^{t-1}(u)$ and $v<w$ do

update(v.pool,w)

;

update(w.pool,v)

;

14 for each $u$ in $D$ do

15 update

N_{m}^{t}(u)

and

B_{m}^{t}(u)

;

18for each $u$ in $D$ do

N_{k}(u)=

the first

k

nodes in

u.pool

;

21return

G

;

Algorithm 4 BiProp

4.3 DeepSearch

Wang et al. introduced the initial idea of DeepSearch [34], as shown in Algorithm 5. It follows the idea of ANN search on a proximity graph $H$ , created online before neighborhood propagation. For $u\in D$ , it refines $N_{k}(u)$ by conducting ANN search with staring nodes from the initial KNNG. Unlike UniProp and BiProp, DeepSearch runs only for one iteration.

The proximity graph $H$ is key to the performance of DeepSearch. An appropriate $H$ should satisfy two conditions. First, $H$ should be rapidly built, since it is built online. Second, $H$ should have good ANN search performance. $H$ is the initial KNNG built by Multiple Division in [34], denoted as DeepMdiv, which satisfies the first condition but fails for the second. Among existing proximity graphs, HNSW is a good choice of DeepSearch, which satisfies both conditions. We denote this method as Deep HNSW. Others such as DPG [17] and NSG [10] require an expensive modification on an online-built KNNG, which violates the first condition. Thus they are not suitable for DeepSearch. The construction cost of SW is unstable as shown in Section 3.5 and its ANN search performance is usually worse than other proximity graphs [17]. Hence SW is not suitable.

Cost Analysis: DeepSearch needs to store the proximity graph. Thus DeepMdiv requires $O(n*k)$ and Deep HNSW needs $O(n*M_{hnsw})$ . The cost of DeepSearch is determined by two factors, i.e. the number of expanded nodes and their neighborhood sizes. We find that both DeepMdiv and Deep HNSW expand $O(efSearch)$ nodes, as demonstrated in Appendix C. Each node on KNNG has $k$ neighbors and thus the cost of DeepMdiv is $O(n*d*efSearch*k)$ . Each node on the HNSW graph has at most $M_{hnsw}$ edges and thus the cost of Deep HNSW is $O(n*d*efSearch*M_{hnsw})$ .

Input:

k

, proximity graph

H

and

efSearch

Output:

G

1 for each $u$ in $D$ do

ep=N_{k}(u)

from the initial KNNG;

N_{k}(u)=Search\_on\_Graph(H,u,k,efSearch,ep)

;

5return

G

;

Algorithm 5 DeepSearch

4.4 Optimizations

In NBPG, there exist repeated distance calculations for the same pair of points, which increases the computational cost. In this section, we discuss how to reduce such repeated computations which can be categorized into two sources: intra-iteration and inter-iteration.

In the left graph in Figure 9, $w_{1}$ is a neighbor of both $v_{1}$ and $v_{3}$ , which are the neighbors of $u$ . In the neighborhood propagation, the pair $(u,w_{1})$ will be checked twice in the same iteration. Thus we call this case intra-iteration repetition. To avoid the intra-iteration repetition, we use a global filter, a bitmap $visited$ of size $n$ in each iteration. $visited[u]$ indicates whether $u$ has been visited or not. Let us take UniProp as an example. Before line 4 in Algorithm 3, we initialize all $visited$ elements as $false$ . Then we check $visited[w]$ for candidate $w$ before line 6. If $w$ is a new candidate, we conduct $update(N_{k}^{t}(u),w)$ and mark $visited[w]$ as $true$ . Otherwise, $w$ is just ignored. The global filter is also applicable to BiProp and DeepSearch.

Inter-iteration repetitions refer to the repeated computation in different iterations. Consider the two iterations $t_{1},t_{2}$ in Figure 9 in which the pair $(u,w_{2})$ is considered twice. To avoid the inter-iteration repetition, we can use a local filter. Let us take UniProp as an example. For each neighbor $v\in N_{k}^{t}(u)$ , we assign it an attribute $isOld$ , which is set as $false$ at start. If $v$ is firstly added to $N_{k}^{t}(u)$ as a new neighbor, its $isOld$ is $false$ , otherwise it is $true$ as an old neighbor. In Figure 9, we use the solid arrows to represent new KNN pairs and the dashed arrows for old KNN pairs. In iteration $t_{2}$ , $w_{2}$ is an old neighbor of $v_{3}$ and $v_{3}$ is an old neighbor of $u$ . Thus the pair $(u,w_{2})$ has been considered in a previous iteration and $w_{2}$ should be ignored by $u$ . Note that the inter-iteration repetition cannot be completely avoided by the local filter. Consider $u$ finds $w_{1}$ via a new neighbor $v_{2}$ in iteration $t_{2}$ , which is a repeated computation from iteration $t_{1}$ . The local filter can also be used for BiProp, but is not necessary for DeepSearch with only one iteration.

According to our knowledge, the global filter has been widely used in existing studies, while the local filter has not. The filters will accelerate NBPG methods, but will not change the cost analysis in the worst case. On the other hand, they introduce new data structures and require more memory. The space required by the global filter is $O(n)$ , while that by the local filter is determined by the neighborhood sizes. To be specific, UniProp and NNDes require $O(n*k)$ , while KGraph needs $O(n*L)$ . Since the local filter is attached to the neighbors, it will not change the final space complexity.

4.5 Experimental Evaluation of NBPG

We use the same experimental settings in Section 3.5.1.

4.5.1 Compared Methods and Parameter Selection

We compare 6 NBPG methods as listed in the following and carefully optimize their parameters for their best performance. Following the framework, NBPG is also combined with our first choice HNSW KNNG in INIT, i.e., where $efConstruction=80$ and $M_{hnsw}=20$ by default. We just list the settings of key parameters as follows.

•

UniProp: LargeVis [29] ¹⁵¹⁵15We improve the codes from the authors by adding the local filter to reduce inter-iteration repeated distance computations in UniProp. and Uni HNSW. LargeVis takes 50 RP trees to build the initial KNNG, while Uni HNSW uses HNSW KNNG. The number $nIter$ of iterations is varied from 0 to 4.
•

BiProp: NNDes [9] and KGraph [9]. The number $nIter$ of iterations is set between 0 and 16.
•

DeepSearch: DeepMdiv [34] and Deep HNSW. $efSearch$ in DeepSearch is set between 0 and 160. The number $L_{div}$ of random divisions is 10 in DeepMdiv .

4.5.2 Experimental Results

The experimental results of $cost$ vs $recall$ are plotted in Figure 10. We first compare the methods within each category. For UniProp, we can see that Uni HNSW obviously outperforms LargeVis in most cases, because Uni HNSW performs UniProp based on a better initial KNNG generated by HNSW KNNG. Notably, both methods have significant improvement in the first two iterations but rapidly converge afterwards. This is because most KNNs have become old neighbors in subsequent iterations and fewer candidates will be found for KNN refinement. For BiProp, KGraph obviously defeats NNDes. This is because NNDes fetches neighbors from a smaller neighbor pool in NBPG than KGraph. Moreover, KGraph carefully determines the neighborhood size $m^{t}(u)$ and limits the size of $R_{m}^{t}(u)$ for each $u$ in each iteration, to balance efficiency and accuracy. For DeepSearch, Deep HNSW is much better than DeepMdiv on all data sets except Nusbow, due to two reasons. First, HNSW KNNG outperforms Multiple Division in INIT. Second, the HNSW graph has much better performance on ANN search than KNNG, as demonstrated in recent benchmarks [1, 2, 17].

We list the memory requirement of auxiliary data structures of NBPG methods in Table III. We can see that KGraph requires the most memory for auxiliary structures, followed by NNDes. On the data with relatively small dimensions, such as Sift, Nuscm and Glove, KGraph even needs more space than the data itself. This is because KGraph maintains a large neighbor pool for each node and stores reverse neighbors. The space requirement of UniProp is mainly caused by the last-iteration KNNG, while Deep HNSW and DeepMdiv are due to their proximity graphs.

TABLE III: Memory Requirement (MB) of Auxiliary Structures for NBPG Methods

Data	UniProp	NNDes	KGraph	Deep HNSW	Deep Mdiv
Sift	97	332	1,158	181	78
Nuscm	26	90	309	49	21
Nusbow	26	90	318	49	21
Gist	97	332	1,177	180	78
Glvoe	115	396	1,346	216	93
Msdrp	96	330	1,101	180	77
Msdtrh	96	330	1,116	180	77
Uqv	319	1,098	3,644	657	256

Let us consider and compare all three categories of NBPG together. Overall, There is no denominator in all cases. Even with small memory requirement, UniProp usually cannot achieve high accuracy due to its fast convergence. KGraph is the best choice for a high-recall KNNG, but not for a moderate-recall KNNG. Besides, KGraph requires obviously more memory and thus is not suitable when the memory is not enough. We can see that Deep HNSW is the most balanced method, considering efficiency, accuracy and memory requirement simultaneously.

4.5.3 KGraph with an Initial KNNG

KGraph demonstrates impressive performance in the above experiments, but it starts from a random initial KNNG. Thus it is interesting to study the following question: How well can KGraph perform if we provide a good-quality initial KNNG? To answer this question, we create new combinations, LvKGraph, MdKGraph and HnswKGraph with three different initial KNNGs (i.e., LargeVis, Multiple Division and HNSW KNNG respectively) and then perform KGraph style of neighborhood propagation on top of these initial KNNGs. We show the results in Figure 11.

We can see that KGraph with a good-quality initial KNNG does not improve its performance. Even though HnswKGraph seems better than KGraph for a small or moderate $recall$ , it is close to or even worse than KGraph when $recall$ reaches a high value. Moreover, MdKGraph and LvKGraph are even worse than KGraph. This is caused by the fact that determining $m^{t}(u)$ in KGraph is sensitive to the updates of KNNs according to Rule 3 in Section 4.2. When KGraph starts with an accurate KNNG, fewer updates happen in each iteration and thus fewer new neighbors will join in the next iteration. Moreover, there may exist repeated distance computations between INIT and KGraph, since some similar pairs may be computed repeatedly in these two steps.

4.5.4 Influences of $k$

As a key parameter in NBPG, we study the influence of $k$ on three representative methods, i.e., Uni HNSW, Deep HNSW and KGraph. Here we select two $k$ values, i.e., a small value 5 and a large value 100, for each method. We show the results in Figure 13. We can see that $k$ has an obvious influence on Uni HNSW. With a large $k$ , its $cost$ and $recall$ increase significantly, while keeping almost unchanged with a small $k$ . This is because up to $k^{2}$ candidates will be found for each query in one iteration, which leads to larger $cost$ and better $recall$ . In addition, Uni HNSW is worse than its competitors for a large $k$ , except on Nusbow and Msdrp. As to Deep HNSW and KGraph, they have to pay more cost to achieve the same $recall$ for a large $k$ , but they are not significantly affected by $k$ like Uni HNSW, because the way they find candidates is independent of $k$ .

4.5.5 Performance on Large Data

To test the scalability of NBPG methods, we use two large-scale data sets, i.e., Sift100M ¹⁶¹⁶16http://corpus-texmex.irisa.fr/ and Deep100M to evaluate the performance of three representative methods, i.e., Uni HNSW, KGraph and Deep HNSW. Sift100M contains 100 million 128-dimensional SIFT vectors, while Deep100M consists of 100 million 96-dimensional float vectors, which are randomly sampled from the learn set of Deep1B ¹⁷¹⁷17https://yadi.sk/d/11eDCm7Dsn9GA. To deal with such large data, we set the number of threads as 48 for all three methods. We show the results on Figure 12. We can see that KGraph obviously outperforms Uni HNSW and Deep HNSW. This is because the construction of the HNSW graph costs quite a lot time, i.e., around 6,000 seconds for both data sets. Note that KGraph requires much more memory space than its competitors.

4.5.6 Summary

There is no dominator in all cases for KNNG construction. We do not reccommend UniProp, since it cannot achieve high accuracy due to its fast convergence and is pretty sensitive to $k$ . If the memory resource is enough and a high-recall KNNG is urgent, KGraph is the first choice. Otherwise, Deep HNSW is our first recommendation due to its superior balance among efficiency, accuracy and memory requirement.

TABLE IV: Data Hubness and Converged Recalls

	Low Hub			Moderate Hub		High Hub
	Sift	Uqv	Msdtrh	Nuscm	Gist	Glove	Nusbow	Msdrp
$H_{20}(0.001,\cdot)$	0.007	0.014	0.012	0.036	0.039	0.075	0.066	0.259
$H_{20}(0.01,\cdot)$	0.047	0.063	0.07	0.157	0.164	0.243	0.269	0.407
$H_{20}(0.1,\cdot)$	0.285	0.319	0.355	0.531	0.549	0.647	0.694	0.740
$recall_{converged}^{\texttt{LargeVis}}$	0.929	0.967	0.929	0.835	0.673	0.619	0.599	0.865
$recall_{converged}^{\texttt{Deep HNSW}}$	0.992	0.982	0.991	0.942	0.874	0.786	0.661	0.879
$recall_{converged}^{\texttt{KGraph}}$	0.986	0.972	0.990	0.968	0.943	0.779	0.867	0.926

5 Exploring Neighborhood Propagation

We are interested in the reasons why NBPG is effective. The principle (i.e., “a neighbor of my neighbor is also likely to be my neighbor” [9]) is intuitively correct and works effectively in practice, but it lacks convincing insights on its effectiveness in the literature. This motivates us to explore the NBPG mechanism. To achieve this goal, we employ the node hubness defined in [24] to characterize each node. Further, we develop this concept and propose the data hubness to characterize a data set. With those two definitions, we have two insights. First, the data hubness has a significant effect on the performance of NBPG. Second, the node hubness of a node has an obvious effect on its accuracy during the NBPG process. To present the effect of node hubness on its accuracy, we explore the dynamic process of three representative NBPG methods, i.e., LargeVis, KGraph and Deep HNSW, respectively.

5.1 Hubness and Accuracy of Reverse KNNs

According to [24], a hub node appears in the KNNs of many nodes, making it a popular exact KNN of other nodes. Accordingly, a hub node has a large in-degree in the exact KNNG. We show its definition from [24] as follows.

Definition 2

Node Hubness. Given a node $v\in D$ and $k$ , we define its hubness, denoted as $h_{k}(v)$ , as the number of its exact reverse KNNs, i.e., $h_{k}(v)=|R_{k}^{*}(v)|$ , where $R_{k}^{*}(v)=\{u|v\in N_{k}^{*}(u)\}$ and $N_{k}^{*}(u)$ represents the exact KNNs of $u$ .

According to [24], a hub node is usually close to the data center or the cluster center, and thus it is naturally attractive to other nodes as one of their exact KNNs. Further, we extend the hubness concept to the whole data set as follows.

Definition 3

Data Hubness. Given $D$ , $k$ and a percentage $x$ , we define the data hubness of $D$ at $x$ in Equation 1, where $\pi$ is a permutation of nodes in $D$ and $u_{\pi(i)}$ is the node with the $i$ -th largest node hubness.

H_{k}(x,D)=\frac{\sum_{i=1}^{\lceil x\times n\rceil}h_{k}(u_{\pi(i)})}{n\times k},x\in[0,1]

(1)

We can see that $H_{k}(x,D)$ is actually the normalized sum of the largest $\lceil x\times n\rceil$ $h_{k}$ values and thus $H_{k}(x,D)$ is between 0 and 1. For the eight data sets, we divide them into three groups according to their data hubness, i.e., low hub, moderate hub and high hub respectively, as shown in Table IV. We can see that Msdrp has the largest data hubness, where the first $0.1\%$ (995) points have $25.9\%$ hubness (corresponding to over 5 million edges in the exact KNNG).

In addition, we define the accuracy of reverse KNNs. For a KNNG $G$ , we have $R_{k}(v)=\{u|v\in N_{k}(u)\}$ as $v$ ’s reverse KNNs. Then we define the accuracy of $v$ ’s reverse KNNs as $recall^{R}(v)=|R_{k}(v)\cap R_{k}^{*}(v)|/|R_{k}^{*}(v)|$ , where $R_{k}^{*}(v)$ is defined as in Definition 2. $recall^{R}$ reflects the accuracy of a KNNG to some extent, since a node probably has a high $recall^{R}$ value on an accurate KNNG.

5.2 Effect of Data Hubness on NBPG Performance

In this part, we present the effect of data hubness on the performance of NBPG. We discuss the performance in two aspects, i.e., (1) computational cost vs accuracy and (2) the converged accuracy. We use $scan\_rate=\frac{\#total\_dist}{n*(n-1)/2}$ to measure the computational cost, where $\#total\_dist$ indicates the total number of distance computations during KNNG construction and $n*(n-1)/2$ is the number of all distance computations in the brute-force method. Unlike $cost$ , $scan\_rate$ is independent of $d$ . Here, we show the results on data size in Figure 14. We select data sets with the same size in each subfigure, so that we can focus on the effects of data hubness. We can see that the data hubness has a significant effect on the performance of NBPG methods. With a larger data hubess, each NBPG method has to compute more distances to achieve the same $recall$ .

The converged $recall$ is such a value that once an NBPG method reaches it, paying more cost cannot improve the $recall$ value in a significant scale. We believe that the converged $recall$ reflects the characteristic of an NBPG method. Here we obtain this value of a method by setting its parameters large enough. To be specific, the parameter settings are $nIter=4$ in LargeVis, $efSearch=160$ in Deep HNSW and $nIter=16$ in KGraph. We show the results in Table 3. We can see that a lower hub data usually has a higher converged $recall$ , while a higher hub data has a lower converged $recall$ . This indicates that it is computationally expensive to get a high-recall KNNG for a high hub data.

5.3 Exploring UniProp

As a representative UniProp method, LargeVis starts from an initial KNNG with low or moderate accuracy and then conducts UniProp for a few iterations. Here we vary the number $nIter$ of iterations from 0 to 3. When $nIter=0$ , it corresponds to the initial KNNG. Due to the space limit, we only report the results on Sift and Glove in Figure 15. The $x$ -axis shows the node hubness of individual nodes and the $y$ -axis shows the $recall$ or $recall^{R}$ . We can see the $recall$ values of all nodes increases substantially, especially in the first iteration. Notably, those newly found exact KNNs are mostly the hub nodes, as reflected by the significant increase in the $recall^{R}$ values of hub nodes.

Even though the $recall^{R}$ values of high hub nodes are not high in the beginning, as the cardinality of $R^{*}_{k}(\cdot)$ for these high hub nodes is large, there are still a good number of them that have been correctly connected to by many nodes as exact KNNs. Hence each node has a high chance to find those hub nodes via one of its KNNs in UniProp. That is why we observe a significant increase of $recall$ in the first iteration. Overall, UniProp works well to find hub nodes as exact KNNs, but misses a significant part of non-hub nodes, which are less connected by other nodes and thus not easy to be found in UniProp. That is the reason why its converged recall (as shown in Table IV) is relatively low, especially on high hub data.

5.4 Exploring BiProp

As a representative BiProp method, KGraph starts from a random initial graph and then conducts NBPG for a few iterations. Here we vary the number of iterations from 1 to 10. We show the results on Msdtrh and Nusbow in Figure 16. We can see that both $recall$ and $recall^{R}$ are pretty low in the first two iterations, due to the random initial graph. As KGraph continues, both $recall$ and $recall^{R}$ of high hub nodes increase obviously and approach almost the converged values in the fifth iteration. Unlike UniProp, $recall$ and $recall^{R}$ of moderate and low hub nodes obviously grow in later iterations. This is because KGraph conducts the brute-force search in $B_{m}^{t}(\cdot)$ , where each reverse neighbor (probably a non-hub node) will treat another reverse neighbor as a candidate. In other words, KGraph enhances the communications among non-hub nodes.

Notably, on Msdtrh, $recall$ and $recall^{R}$ are pretty low for some high hub nodes, whose $h_{20}$ values are larger than 400. This is caused by the fact that KGraph imposes a threshold on the number of reverse neighbors in order to ensure the efficiency. Hence the reverse neighbors of high hub nodes will be reduced, which has a negative influence on $recall$ and $recall^{R}$ of high hub nodes. But this loss could be made up by the increase in accuracy of non-hub nodes.

5.5 Exploring DeepSearch

As a representative method of DeepSearch, the results of Deep HNSW are shown in Figure 17, where $efSearch$ varies from 0 and 160. When $efSearch=0$ , it corresponds to the initial KNNG, where both $recall$ and $recall^{R}$ of hub nodes are higher than those of non-hub nodes. Note that this is different from the initial KNNG generated by LargeVis, where hub nodes have even smaller $recall^{R}$ values than non-hub nodes. This is because the construction of HNSW graph also employs the similar idea of NBPG that establishes the advantages of hub nodes in accuracy.

Like KGraph, $recall$ and $recall^{R}$ of non-hub nodes increase progressively in Deep HNSW as $efSearch$ grows. DeepSearch enhances its accuracy by enlarging the expanded neighborhood for each node. As a result, more candidates will be found with a larger $efSearch$ , which increases the chance of finding non-hub nodes.

5.6 Comparisons and Discussions

Now let us compare the converged accuracy of the three representative NBPG methods together. We show the results in Figure 18. On the low hub data set Uqv, it is interesting that both KGraph and Deep HNSW have small $recall$ and $recall^{R}$ for high hub nodes (i.e., $h_{k}>200$ ), while LargeVis has much larger values. This is because both KGraph and Deep HNSW limit the neighborhood size for each node $u$ . To be more detailed, KGraph limits the size of $R_{m}^{t}(u)$ , while the HNSW graph limits the number of $u$ ’s edges. Since low hub data has a small number of hub nodes, it has a small impact on the overall $recall$ . To some extent, this phenomenon reflects the preference of UniProp on hub nodes.

On the moderate hub data Gist, we can see the full advantages of KGraph in accuracy w.r.t. various hubness, followed by Deep HNSW and LargeVis. As indicated by the converged recall in Table IV, it will be more difficult to get high accuracy as the data hubness increases. The lower accuracy of LargeVis indicates that exploring the KNNs of each KNN does not work well for the low hub nodes of those difficult data, since UniProp obtains the candidates in the fixed neighborhood. To find more exact KNNs, we have to enlarge the neighborhood. Compared with LargeVis, Deep HNSW explores more than $k$ neighbors’ neighborhood on a fixed graph (i.e., the HNSW graph), while KGraph enlarges $B_{m}^{t}(u)$ for each promising point $u$ iteratively. Moreover, they both use the reverse neighbors in the neighborhood propagation, which enhances the accuracy of non-hub nodes.

5.7 Summary

We find that the hubness phenomenon has a significant effect on NBPG performance in two aspects. First, there is a strong correlation between the data hubness and the performance of an NBPG method. We can expect a good balance between efficiency and accuracy and a good converged recall on a low hub data, while probably the opposite situation on a high hub data. Second, the node hubness significantly affects its accuracy in both $recall$ and $recall^{R}$ . We can expect a high hub node has a high $recall$ and $recall^{R}$ value, while a low hub node the opposite situation.

6 Related Work

Similarity search on high-dimensional data is another problem that is closely related to KNNG construction. Various index structures have been designed to accelerate similarity search. Tree structures were popular in early years and thus many trees [12, 5, 15, 26] have been proposed. Locality sensitive hashing (LSH) [14, 11] attracted a lot of attentions during the past decade, due to its good thorectical guarantee. Hence a few methods [20, 30, 19, 28, 13] based on LSH were created. Besides, there are also a few methods [27, 23, 16, 3, 4] based inverted index, which assigns similar points into the same inverted list and only accesses the most promising lists during the search procedure. Recently, proximity graphs [9, 21, 22, 17, 10] become more and more popular, due to their superior performance over other structures.

7 Conclusion

In this paper, we revisit existing studies of constructing KNNG on high-dimensional data. We start from the framework adopted by existing methods, which contains two steps, i.e., INIT and NBPG. We conduct a comprehensive experimental study on representative methods for each step respectively. According to our experimental results, KGraph has the best performance to construct a high-recall KNNG, but requires much more memory space. A new combination called Deep HSNW presents better balance among efficiency, accuracy and memory requirement than its competitors. Notably, the popular NBPG technique, UniProp, is not recommended, since it cannot achieve high-recall KNNG and is pretty sensitive to $k$ . Finally, we employ the hubness concepts to investigate the effectiveness of NBPG and have two interesting findings. First, the performance of NBPG is obviously affected by the data hubness. Second, the accuracy of each node is significantly influenced by its node hubness.

References

[1] https://github.com/searchivarius/nmslib.
[2] https://github.com/spotify/annoy.
[3] A. Babenko and V. Lempitsky. The inverted multi-index. In CVPR, pages 3069–2076. IEEE, 2012.
[4] D. Baranchuk, A. Babenko, and Y. Malkov. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In ECCV, pages 202–216, 2018.
[5] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD Record, 19(2):322–331, 1990.
[6] J. Chen, H. R. Fang, and Y. Saad. Fast approximate knn graph construction for high dimensional data via recursive lanczos bisection. Journal of Machine Learning Research, 10(9):1989–2012, 2009.
[7] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In STOC, pages 537–546. ACM, 2008.
[8] C. Deng and W. Zhao. Fast k-means based on k-nn graph. In ICDE, pages 1220–1223. IEEE, 2018.
[9] W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, pages 577–586. ACM, 2011.
[10] C. Fu, C. Xiang, C. Wang, and D. Cai. Fast approximate nearest neighbor search with the navigating spreading-out graph. PVLDB, 12(5):461–474, 2019.
[11] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518–529, 1999.
[12] A. Guttman. R-trees: a dynamic index structure for spatial searching. In SIGMOD, pages 47–57. ACM, 1984.
[13] Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB, 9(1):1–12, 2016.
[14] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, pages 604–613. ACM, 1998.
[15] H. V. Jagadish, B. C. Ooi, K. L. Tan, C. Yu, and R. Zhang. iDistance: an adaptive B⁺-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems, 30(2):364–397, 2005.
[16] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011.
[17] W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on high dimensional data – experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 2019.
[18] W. Liu, J. Wang, S. Kumar, and S. F. Chang. Hashing with graphs. In ICML, pages 1–8, 2011.
[19] Y. Liu, J. Cui, Z. Huang, H. Li, and H. shen. SK-LSH : an efficient index structure for approximate nearest neighbor search. PVLDB, 7(9):745–756, 2014.
[20] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In VLDB, pages 950–961, 2007.
[21] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68, 2014.
[22] Y. Malkov and D. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[23] J. Philbin, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, pages 1–8, 2008.
[24] M. Radovanović, A. Nanopoulos, and M. Ivanović. Hubs in space: popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11(Sep):2487–2531, 2010.
[25] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
[26] C. Silpa-Anan and R. Hartley. Optimised KD-trees for fast image descriptor matching. In CVPR, pages 1–8. IEEE, 2008.
[27] J. Sivic and A. Zisserman. Video google: A text retrieval approach to objects matching in videos. In ICCV, pages 1470–1478. IEEE, 2003.
[28] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: Solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB, 8(1):1–12, 2015.
[29] J. Tang, J. Liu, M. Zhang, and Q. Mei. Visualizing large-scale and high-dimensional data. In WWW, pages 287–297, 2016.
[30] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563–576. ACM, 2009.
[31] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
[32] L. Van Der Maaten. Accelerating t-SNE using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
[33] J. Wang, J. Wang, Q. Ke, G. Zeng, and S. Li. Fast approximate k-means via cluster closures. In Multimedia Data Mining and Analytics, pages 373–395. Springer, 2015.
[34] J. Wang, J. Wang, G. Zeng, Z. Tu, R. Gan, and S. Li. Scalable k-nn graph construction for visual descriptors. In CVPR, pages 1106–1113. IEEE, 2012.
[35] W. Zhang, X. Wang, D. Zhao, and X. Tang. Graph degree linkage: Agglomerative clustering on a directed graph. In ECCV, pages 428–441, 2012.
[36] Y. Zhang, K. Huang, G. Geng, and C. Liu. Fast knn graph construction with locality sensitive hashing. In ECML, pages 660–674. Springer, 2013.

Yingfan Liu is currently a Lecturer in the School of Computer Science and Technology, Xidian University, China. He obtained his PhD degree in the department of Systems Engineering and Engineering Management of Chinese University of Hong Kong in 2019. His research interest contains the management of large-scale complex data and autonomous RDBMS.

Hong Cheng is an Associate Professor in the Department of Systems Engineering and Engineering Management at the Chinese University of Hong Kong. She received her Ph.D. degree from University of Illinois at Urbana-Champaign in 2008. Her research interests include data mining, database systems, and machine learning. She received research paper awards at ICDE’07, SIGKDD’06 and SIGKDD’05, and the certificate of recognition for the 2009 SIGKDD Doctoral Dissertation Award. She is a recipient of the 2010 Vice-Chancellor’s Exemplary Teaching Award at the Chinese University of Hong Kong.

Jiangtao Cui received the MS and PhD degrees both in computer science from Xidian University, China, in 2001 and 2005, respectively. Between 2007 and 2008, he was with the Data and Knowledge Engineering group working on high-dimensional indexing for large scale image retrieval, in the University of Queensland, Australia. He is currently a professor in the School of Computer Science and Technology, Xidian University, China. His current research interests include data and knowledge engineering, data security, and high-dimensional indexing.

Appendix A The cost model of SW KNNG

The main cost of SW KNNG is to conduct ANN search for each $u\in D$ . For each query $u$ , the search iteratively expands $u$ ’s close nodes until the termination condition is met. Hence we care about the number $\#expand$ of expanded nodes for each query. To investigate it, we conduct experiments on real data. We show the results in Figure 19. We can see that as $efConstruction$ increases, the ratio $\frac{\#expand}{efConstruction}$ decreases and becomes graudally close to 1. On average, we conclude that each query expands $O(efConstruction)$ nodes. Unfortunately, nodes in an SW graph have various number of neighbors. Hence, we cannot figure out the number of candidates accessed precisely. Let $M_{sw}^{max}$ the maximum neighborhood size in the SW graph. In the worst case, the time complexity of SW KNNG is $O(n*d*efConstruction*M_{sw}^{max})$ .

Appendix B The cost model of HNSW KNNG

The main cost of HNSW KNNG contains two parts, i.e. ANN search for each node and pruning oversize nodes. Like SW KNNG, we find that the number $\#expand$ of expanded nodes for each query is pretty close to $efConstruction$ . We show the relationship between the ratio $\frac{\#expand}{efConstruction}$ and $efConstruction$ in Figure 20. We can see that the ratio approaches 1 as $efConstruction$ increases. Hence, we can conclude that each query expands $O(efConstruction)$ nodes on average. Unlike an SW graph, each node in an HNSW graph has at most $M_{hnsw}$ neighbors. Moreover, after each search, HNSW selects $M_{hnsw}$ neighbors from $efConstruction$ ones, following the link diversification strategy like the pruning operations. Each such selection costs $O(efConstruction*M_{hnsw}*d)$ . In total, the part of ANN search for $n$ nodes costs $O(n*efConstruction*M_{hnsw}*d)$ .

Like $\#expand$ , we investigate the number $\#prune$ of pruning operations by experiments. We use the heuristic 2 (i.e., the default setting) in nmslib library as the link diversificaiton strategy, which generates fewer pruning operations. We show the experimental results in Figure 21. We can see that $\#prune$ increases obviously as $efConstruction$ rises. However, we set $efConstruction$ as a small enough value (e.g., 80) in this paper. This is because we do not expect a pretty accurate KNNG for an INIT method. Hence, the practical $\#prune$ will be sufficiently small.

In addition, the data hubness (defined in Section 5.1) also has a positive influence on $\#prune$ . Let us take Sift and Gist as an example. With the same $efConstruction$ value, Gist has an obvious $\#prune$ than Sift. Similar phenomena could be found when comparing the pair of Nuscm and Nusbow and the pair of Msdrp and Msdtrh respectively. Moreover, the data size also positively affect $\#prune$ . In Figure 22, we show the results on four subsets with sizes 1 million, 2 million, 4 million and 8 million of Sift100M, denoted as Sift1M, Sift2M, Sift4M and Sift8M respectively. The details of Sift100M could be found in Section 4.5.5.

Even affected by a few factors, $\#prune$ is pretty close to $O(n)$ in practice, especially considering a small $efConstruction$ value. Hence, the practical values of $\#prune$ are far smaller than the worst case $n*M_{hnsw}$ , where $M_{hnsw}=20$ in our experiments.

Appendix C The cost model of DeepSearch

DeepSearch conducts ANN search on a online-built proximity graph $H$ for each query $u\in D$ . The search process is controlled by the parameter $efSearch$ . The main cost of DeepSearch is to expand similar nodes to each query in $H$ . Hence, the number $\#expand$ is key to the cost model of DeepSearch. Here, we investigate two variants of DeepSearch, i.e. DeepMdiv and Deep HNSW. We show relationship between $\frac{\#expand}{efSearch}$ and $efSearch$ in Figure 23.

We can see that the ratio gradually approaches 1 as $efSearch$ increases. We can say that DeepMdiv and Deep HNSW expand $O(efSearch)$ nodes for each query. DeepMdiv takes the initial KNNG as the proximity graph, where each node has exactly $k$ neighbors. As a result, the cost of DeepMidv is $O(n*d*efSearch*k)$ . Deep HNSW uses the HNSW graph as the proximity graph, where each node has up to $M_{hnsw}$ neighbors. Hence, the cost of Deep HNSW is $O(n*d*efSearch*M_{hnsw})$ .

Revisiting kk-Nearest Neighbor Graph Construction on High-Dimensional Data : Experiments and Analyses

Abstract

Index Terms:

1 Introduction

2 Overview of KNNG Construction

2.1 Problem Definition

Definition 1

2.2 The Framework of KNNG Construction

2.3 KNNG vs Proximity Graph

3 Constructing Initial Graphs

3.1 Multiple Random Division

3.2 LSH KNNG

3.3 LargeVis

3.4 Small World based Approach

3.5 Experimental Evaluation of INIT

3.5.1 Experimental Settings

3.5.2 Compared Methods and Parameter Selection

3.5.3 Experimental Results

4 Neighborhood Propagation

4.1 UniProp

4.2 BiProp

4.3 DeepSearch

4.4 Optimizations

4.5 Experimental Evaluation of NBPG

4.5.1 Compared Methods and Parameter Selection

4.5.2 Experimental Results

4.5.3 KGraph with an Initial KNNG

4.5.4 Influences of kk

4.5.5 Performance on Large Data

4.5.6 Summary

5 Exploring Neighborhood Propagation

5.1 Hubness and Accuracy of Reverse KNNs

Definition 2

Definition 3

5.2 Effect of Data Hubness on NBPG Performance

5.3 Exploring UniProp

5.4 Exploring BiProp

5.5 Exploring DeepSearch

5.6 Comparisons and Discussions

5.7 Summary

6 Related Work

7 Conclusion

References

Appendix A The cost model of SW KNNG

Appendix B The cost model of HNSW KNNG

Appendix C The cost model of DeepSearch

Revisiting $k$ -Nearest Neighbor Graph Construction on High-Dimensional Data : Experiments and Analyses

4.5.4 Influences of $k$