Sublinear Time Nearest Neighbor Search over Generalized Weighted Manhattan Distance
Abstract
Nearest Neighbor Search (NNS) over generalized weighted distances is fundamental to a wide range of applications. The problem of NNS over the generalized weighted square Euclidean distance has been studied in previous work. However, numerous studies have shown that the Manhattan distance could be more effective than the Euclidean distance for high-dimensional NNS, which indicates that the generalized weighted Manhattan distance is possibly more practical than the generalized weighted square Euclidean distance in high dimensions. To the best of our knowledge, no prior work solves the problem of NNS over the generalized weighted Manhattan distance in sublinear time. This paper achieves the goal by proposing two novel hashing schemes ()-ALSH and ()-ALSH.
1 Introduction
Nearest Neighbor Search (NNS) over generalized weighted distances is fundamental to a wide variety of applications, such as personalized recommendation [11, 14, 18] and kNN classification [3, 19]. Given a set of data points and a query point with a weight vector , NNS over a generalized weighted distance, denoted by , is to find a point such that is the closest data point to for . Formally, the goal of NNS over is to return
(1) |
Note that the weight vector is specified along with rather than pre-specified. Moreover, each element of can be either positive or non-positive.
The generalized weighted Manhattan distance, denoted by , and the generalized weighted square Euclidean distance, denoted by , are two typical generalized weighted distances which are derived from the Manhattan distance and the Euclidean distance, respectively. For any two points and , the distances and are respectively computed as follows:
(2) |
where . A recent paper [16] studied the problem of NNS over and provided two sublinear time solutions for it. However, to the best of our knowledge, there is no prior work that solves the problem of NNS over in sublinear time. Actually, plenty of studies [1, 12] have shown that the Manhattan distance could be more effective than the Euclidean distance for producing meaningful NNS results in high-dimensional spaces. It indicates that NNS over is possibly more practical than NNS over in many real scenarios. In this paper, we target to propose sublinear time methods for efficiently solving the problem of NNS over .
As a matter of fact, existing methods can not handle NNS over well. Specifically, the brute-force linear scan scales linearly with data size and thus may yield unsatisfactory performance. The conventional spatial index-based methods [2, 6, 22] can only perform well for NNS in low dimensions due to the “curse of dimensionality” [5]. Locality-Sensitive Hashing (LSH) [27] is a popular approach for approximate NNS and exhibits good performance for high-dimensional cases. In the literature, a number of efficient LSH schemes [7, 9, 10, 13, 15, 17, 26, 28] have been proposed based on LSH families, and some of them can answer NNS queries even in sublinear time. Unfortunately, they can not be applied to answer the NNS queries over unless is fixed to an all-1 vector.
Recently, Asymmetric Locality-Sensitive Hashing (ALSH) was extended from LSH so that the problems of Maximum Inner Product Search (MIPS) and NNS over can be addressed in sublinear time [16, 20, 21, 23, 24]. An ALSH scheme relies on an ALSH family. As far as we know, there is no ALSH family proposed for NNS over in previous work. To provide sublinear time solutions for NNS over in this paper, we follow the ALSH approach to propose ALSH schemes by introducing ALSH families that are suitable for NNS over .
Outline. In Section 2, we review the approaches of LSH and ALSH. In Section 3, we show that there is no LSH or ALSH family for NNS over over the entire space and there is no LSH family for NNS over over the bounded spaces in . Then we seek to find ALSH families for NNS over over the bounded spaces in . As a result, we propose two suitable ALSH families and further obtain two sublinear time ALSH schemes ()-ALSH and ()-ALSH in Section 4.
2 Preliminaries
Before introducing our proposed solutions to the problem of NNS over , we first present the preliminaries on LSH and ALSH.
2.1 Locality-Sensitive Hashing
Let be a distance function and be the space where is defined. Assume that data points and query points are located in and , respectively. Then, an LSH family is formally defined as follows.
Definition 1 (LSH Family)
An LSH family is called -sensitive if for any and , the following conditions are satisfied:
-
•
If , then Pr;
-
•
If , then Pr;
-
•
and .
As we can see from Definition 1, an LSH family is essentially a set of hash functions that can hash closer points into the same bucket with higher probability. Thus, the basic idea of an LSH scheme is to use an LSH family to hash points such that only the data points that have the same hash code as the query point are likely to be retrieved to find approximate nearest neighbors. In the following, we review two popular LSH families that were proposed for the distance (a.k.a. the Euclidean distance) and the Angular distance, respectively.
The distance between any two points is computed as , where is the -norm of a vector. The LSH family proposed for the distance in [7] is , where
(3) |
is a -dimensional vector where each element is chosen independently from the standard normal distribution, is a real number chosen uniformly at random from , and is a user-specified positive constant. Let . The collision probability function is
(4) |
where is the cumulative distribution function of the standard normal distribution [7].
The Angular distance between any two points is computed as . The LSH family proposed for the Angular distance in [4] is , where
(5) |
and is a -dimensional vector where each element is chosen independently from the standard normal distribution. Let . The collision probability function is
(6) |
2.2 Asymmetric Locality-Sensitive Hashing
Recent studies have shown that ALSH is an effective approach for solving the problems of MIPS and NNS over [16, 21, 23, 24]. An ALSH scheme processes NNS queries in a similar way to an LSH scheme. It relies on an ALSH family. Formally, the definition of an ALSH family is as follows.
Definition 2 (ALSH Family)
An ALSH family is called -sensitive if for any data point and query point , the following conditions are satisfied:
-
•
If , then Pr;
-
•
If , then Pr;
-
•
and .
From Definition 2, we can see that an ALSH family consists of a set of hash functions for data points and a set of hash functions for query points, and it ensures that each query point can collide with closer data points with higher probability. In practice, is often implemented with an LSH family and two vector functions called Preprocessing Transformation and Query Transformation respectively [16, 21, 23, 24] (here, and ). Thus, the hash value of each data point is computed as and the hash value of each query point is computed as .
Fundamentally, both LSH and ALSH schemes obtain approximate nearest neighbors by efficiently solving the ()-Near Neighbor Search (()-NNS) problem as follows.
Definition 3 (()-NNS)
Given a distance function , two distance thresholds and () and a data set , for any query point , the -NNS problem is to return a point satisfying if there exists a point satisfying .
The theorem below indicates that the -NNS problem can be solved with an LSH or ALSH scheme in sublinear time.
3 Negative Results
In this section, we present some negative results on the existence of LSH and ALSH families for NNS over .
The following theorem indicates that it is impossible to find an LSH or ALSH family for NNS over over ().
Theorem 2
For any , and , there is no -sensitive LSH or ALSH family for NNS over over .
Proof. An LSH (or ALSH) family for NNS over over () is also an LSH (or ALSH) family for NNS over over a three-dimensional subspace, i.e., over . Hence, we only need to prove that there is no LSH or ALSH family for NNS over over . Assume by contradiction that for some and there exists an -sensitive LSH family or ALSH family for NNS over over . Consider a set of data points and a set of query points , where for ,
(7) | ||||
The weight vector specified along with each query point is set as follows:
(8) |
Thus, for . As can be seen, if and if . Let be a sign matrix where each element is
(9) |
Obviously, is triangular with +1 on and above the diagonal and -1 below it. Consider also the matrix of collision probabilities (for ) or (for ). Let and . It is easy to know that for . That is,
(10) |
where denotes the Hadamard (element-wise) product. From [25], the margin complexity of the sign matrix is , where is the max-norm of a matrix. Since is also an triangular matrix, the margin complexity of is bounded by according to [8]. Therefore, from Equation 10, we can obtain
(11) |
Since is a collision probability matrix, the max-norm of satisfies [20]. Shifting by changes by at most . Thus, we have
(12) |
Combining Equations 11 and 12, we can easily derive that . For any , we get a contradiction by selecting a large enough .
The proof of Theorem 2 is similar to that of Theorem 3.1 in [21]. Due to space limitations, for the details of the max-norm and margin complexity involved in the proof of Theorem 2, please refer to http://proceedings.mlr.press/v37/neyshabur15-supp.pdf.
Actually, in real scenarios data points and query points are usually located in bounded spaces. Consider the typical case of , where is a bounded space. The following theorem shows nonexistence of an LSH family for NNS over over .
Theorem 3
For any , and , there is no -sensitive LSH family for NNS over over .
Proof. Assume by contradiction that for some , and there exists an -sensitive LSH family for NNS over over . Let where 111Ignore the trivial case that contains only a single point.. As , we can always set to a value such that and thus . Moreover, we can always set to another value such that and thus . However, since data points should be hashed before queries arrive, can not involve . So is not affected by , which leads to a contradiction.
Due to the negative results in Theorems 2 and 3, we seek to propose ALSH families for NNS over over bounded spaces in Section 4. Notice that if an ALSH family is suitable for NNS over over (), it must also be suitable for NNS over over for any . Thus, it is sufficient to deal with the case of . Further, suppose . Otherwise, it can be satisfied by shifting without changing the results of NNS over .
4 Our Solutions
Let (). The following Observation 1 indicates that if we find an ALSH family for NNS over over , a similar ALSH family can be immediately obtained for NNS over over . Thus, we only need to consider NNS over over in the rest of the paper. Note that in our solutions can be an arbitrary positive integer.
Observation 1
Define a vector function: , where and . For any , and , if is an -sensitive ALSH family for NNS over over , then must be an -sensitive ALSH family for NNS over over , where and .
Proof. Let . Then . A simple calculation shows that holds. Thus, we have that if and if . Further, since and , we have that if and if . As a result, is an -sensitive ALSH family for NNS over over (note: always holds since ).
4.1 From NNS over to MIPS
In the following, we take two steps to convert the problem of NNS over into the problem of MIPS. As a result of these steps, a novel preprocessing transformation and query transformation are introduced for data points and query points, respectively. The two transformations are essential to our solutions.
Step 1: Convert NNS over into NNS over
The generalized weighted Hamming distance is defined on the Hamming space and computed in the same way as the generalized weighted Manhattan distance . That is, for any , and .
Inspired by [10], we complete this step by applying unary coding. Specifically, each point is mapped into a binary vector , where (;) is the concatenation and each is the unary representation of , i.e., a sequence of 1’s followed by 0’s. Then for , and . Moreover, the weight vector is mapped into , where each is a sequence of ’s. As a result, we have
(13) |
where , and . Equation 13 indicates that through the above mappings NNS over over is converted into NNS over over .
Step 2: Convert NNS over into MIPS
This step is based on the following observation.
Observation 2
For any and , the equation always holds.
Proof. We only need to check two cases. Case 1: If , then . Case 2: If , then .
For any , we define and as follows:
(14) | |||
(15) |
where
(16) | |||
(17) |
According to Observation 2, we have
(18) |
where , , , is the inner product of two vectors, and and are respectively as follows:
(19) | |||
(20) |
From Equation 18, we can see that NNS over over can be converted into MIPS over and .
To sum up, after Steps 1 and 2, we convert NNS over into MIPS by using the composite functions and that respectively map data points and query points from into two higher-dimensional spaces. Let and for . The vector functions and are respectively the preprocessing and query transformations for the ALSH families introduced later.
4.2 ALSH Schemes for NNS over
Next, we formally present two ALSH schemes for NNS over : the first one is called ()-ALSH and the second one is called ()-ALSH. ()-ALSH solves the problem of NNS over by reducing it to the problem of NNS over the distance, while ()-ALSH solves the problem of NNS over by reducing it to the problem of NNS over the Angular distance.
4.2.1 ()-ALSH
Based on the transformations and and the LSH family introduced in Section 2.1, ()-ALSH uses the ALSH family , where and for . Combining Equations 13 and 18 we obtain
(21) |
It is easy to know
(22) | |||
(23) |
Thus, we have
(24) |
Let . According to Equations 4 and 24, the collision probability function with respect to is
(25) |
Since is a decreasing function, holds for any . Therefore, we obtain the following Lemma 1.
Lemma 1
is -sensitive for any .
Theorem 4
()-ALSH can solve the problem of -NNS over with query time and space, where .
4.2.2 ()-ALSH
Now we introduce the scheme of ()-ALSH. Based on the transformations and and the LSH family introduced in Section 2.1, ()-ALSH uses the ALSH family , where and for . According to Equations 21, 22 and 23, the relationship between and is as follows:
(26) |
Let . From Equations 6 and 26, it can be seen that the collision probability function with respect to is
(27) |
It is easy to know that is a decreasing function. Thus, holds for any . Then we obtain the following Lemma 2.
Lemma 2
is -sensitive for any .
Theorem 5
()-ALSH can solve the problem of -NNS over with query time and space, where .
4.2.3 Implementation Skills of ()-ALSH and ()-ALSH
The scheme of ()-ALSH (or ()-ALSH) needs to compute the hash values and (or and ). We can easily know that the running time of computing (or ) is dominated by the time cost of obtaining and the running time of computing (or ) is dominated by the time cost of obtaining , where is a -dimensional vector where each entry is chosen independently from the standard normal distribution. The naive approach to obtain or is to compute the inner product of the two corresponding vectors. However, it will require multiplications and additions, which is expensive when is large.
Next, we show how to obtain with only additions and obtain with only additions and multiplications. Suppose , where . According to Equations 14-17, 19 and 20, we have and . Since is a sequence of 0’s followed by 1’s and is a sequence of 1’s followed by 0’s, it is easy to know that is the sum of the last elements of and is the sum of the first elements of . Thus, we preprocess the vector to obtain , where and
(28) |
Then we have . It can be seen that can be obtained with additions by using . Similarly, we have . Therefore, can be obtained with additions and multiplications by using .
5 Conclusion
This paper studies the fundamental problem of Nearest Neighbor Search (NNS) over the generalized weighted Manhattan distance (). As far as we know, there is no prior work that solves the problem in sublinear time. In this paper, we first prove that there is no LSH or ALSH family for over the entire space . Then, we prove that there is still no LSH family suitable for over a bounded space. After that, we propose two ALSH families for over a bounded space. Based on these ALSH families, two ALSH schemes ()-ALSH and ()-ALSH are proposed for solving NNS over in sublinear time.
References
- [1] Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In ICDT, 2001.
- [2] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 1975.
- [3] Gautam Bhattacharya, Koushik Ghosh, and Ananda S. Chowdhury. Granger causality driven AHP for feature weighted knn. Pattern Recognit., 2017.
- [4] Moses Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.
- [5] Lei Chen. Curse of dimensionality. In Encyclopedia of Database Systems. 2009.
- [6] King Lum Cheung and Ada Wai-Chee Fu. Enhanced nearest neighbour search on the r-tree. SIGMOD Rec., 1998.
- [7] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, 2004.
- [8] Jürgen Forster, Niels Schmitt, Hans Ulrich Simon, and Thorsten Suttorp. Estimating the optimal margins of embeddings in euclidean half spaces. Mach. Learn., 2003.
- [9] Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD, 2012.
- [10] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.
- [11] Yupeng Gu, Bo Zhao, David Hardtke, and Yizhou Sun. Learning global term weights for content-based recommender systems. In WWW, 2016.
- [12] Alexander Hinneburg, Charu C. Aggarwal, and Daniel A. Keim. What is the nearest neighbor in high dimensional spaces? In VLDB, 2000.
- [13] Qiang Huang, Jianlin Feng, Qiong Fang, Wilfred Ng, and Wei Wang. Query-aware locality-sensitive hashing scheme for l norm. VLDB J., 2017.
- [14] Chein-Shung Hwang, Yi-Ching Su, and Kuo-Cheng Tseng. Using genetic algorithms for personalized recommendation. In ICCCI, Lecture Notes in Computer Science. Springer, 2010.
- [15] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.
- [16] Yifan Lei, Qiang Huang, Mohan S. Kankanhalli, and Anthony K. H. Tung. Sublinear time nearest neighbor search over generalized weighted space. In ICML, 2019.
- [17] Kejing Lu, Hongya Wang, Wei Wang, and Mineichi Kudo. VHP: approximate nearest neighbor search via virtual hypersphere partitioning. Proc. VLDB Endow., 2020.
- [18] Julian J. McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. In SIGIR, 2015.
- [19] Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. Learning to weight for text classification. IEEE Trans. Knowl. Data Eng., 2020.
- [20] Behnam Neyshabur, Yury Makarychev, and Nathan Srebro. Clustering, hamming embedding, generalized LSH and the max norm. In ALT, 2014.
- [21] Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search. In ICML, 2015.
- [22] Hanan Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
- [23] Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In NeurIPS, 2014.
- [24] Anshumali Shrivastava and Ping Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). In UAI, 2015.
- [25] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In COLT, Lecture Notes in Computer Science. Springer, 2005.
- [26] Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. TODS, 2010.
- [27] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. CoRR, 2014.
- [28] Bolong Zheng, Xi Zhao, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu, and Christian S. Jensen. PM-LSH: A fast and accurate LSH framework for high-dimensional approximate NN search. Proc. VLDB Endow., 2020.