This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Set2Box: Similarity Preserving Representation Learning for Sets

Geon Lee KAIST AI
geonlee0325@kaist.ac.kr
   Chanyoung Park KAIST ISysE & AI
cy.park@kaist.ac.kr
   Kijung Shin KAIST AI & EE
kijungs@kaist.ac.kr
Abstract

Sets have been used for modeling various types of objects (e.g., a document as the set of keywords in it and a customer as the set of the items that she has purchased). Measuring similarity (e.g., Jaccard Index) between sets has been a key building block of a wide range of applications, including, plagiarism detection, recommendation, and graph compression. However, as sets have grown in numbers and sizes, the computational cost and storage required for set similarity computation have become substantial, and this has led to the development of hashing and sketching based solutions.

In this work, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea is to represent sets as boxes to precisely capture overlaps of sets. Additionally, based on the proposed box quantization scheme, we design Set2Box+, which yields more concise but more accurate box representations of sets. Through extensive experiments on 88 real-world datasets, we show that, compared to baseline approaches, Set2Box+ is (a) Accurate: achieving up to 40.8×\times smaller estimation error while requiring 60% fewer bits to encode sets, (b) Concise: yielding up to 96.8×\times more concise representations with similar estimation error, and (c) Versatile: enabling the estimation of four set-similarity measures from a single representation of each set.

I Introduction

Sets are ubiquitous, modeling various types of objects in many domains, including (a) a document: modeled as the set of keywords in it, (b) a customer: modeled as the set of the items that she has purchased, (c) a social circle: modeled as the set of its members, and (d) a question on online Q/A platforms: modeled as the set of tags attached to the question. Moreover, a number of set similarity measures (e.g., Jaccard Index and Dice Index), most of which are based on the overlaps between sets, have been developed.

As a result of the omnipresence of sets, measuring their similarity has been employed as a fundamental building block of a wide range of applications, including the following:

\circ Plagiarism Detection: Plagiarism is a critical problem in the digital age, where a vast amount of resources is accessible. A text is modeled as a “bag of words,” and texts whose set representations are highly similar are suspected of plagiarism [1].

\circ Gene Expression Mining: Mining gene expressions is useful for understanding clinical conditions (e.g., tumor and cancer). The functionality of a set of genes is estimated by comparing the set with other sets with known functionality [2].

\circ Recommendation: Recommendation is essential to support users in finding relevant items. To this end, it is useful to identify users with similar tastes (e.g., users who purchased a similar set of items and users with similar activities) [3, 4].

\circ Graph Compression: As large-scale graphs are omnipresent, compressing them into coarse-grained summary graphs so that they fit in main memory is important. In many graph compression algorithms, nodes with similar sets of neighbors are merged into a supernode to yield a concise summary graph while minimizing the information loss [5, 6].

\circ Medical Image Analysis: CT or MRI provide exquisite details of inner body (e.g., brain), and they are often described as a collection of spatially localized anatomical features termed “keypoints”. Sets of keypoints from different images are compared to diagnose and investigate diseases [7, 8, 9].

As sets grow in numbers and sizes, computation of set similarity requires substantial computational cost and storage. For example, similarities between tens of millions of nodes, which are represented as neighborhood sets of up to millions of neighbors, were measured for graph compression [5]. Moreover, similarities between tens of thousands of movies, which are represented as sets of up to hundreds of thousands of users who have rated them, were measured for movie recommendation [3].

In order to reduce the space and computation required for set-similarity computation, a number of approaches based on hashing and sketching [4, 10] have been developed. While their simplicity and theoretical guarantees are tempting, significant gains are expected if patterns in a given collection of sets can be learned and exploited.

In this paper, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea of Set2Box is to represent sets as boxes to accurately capture the overlaps between sets and thus their similarity based on them. Specifically, by utilizing the volumes of the boxes to approximate the sizes of the sets, Set2Box derives representations that are: (a) Concise: can represent sets of arbitrary sizes using the same number of bits, (b) Accurate: can accurately model overlaps between sets, and (c) Versatile: can be used to estimate various set similarity measures in a constant time. These properties are supported by the geometric nature of boxes, which share primary characteristics of sets. In addition, we propose Set2Box+, which yields even more concise but more accurate boxes based on the proposed box quantization scheme. We summarize our contributions as follows:

  • Accurate & Versatile Algorithm: We propose Set2Box, a set representation learning method that accurately preserves similarity between sets in terms of four measures.

  • Concise & Effective Algorithm: We devise Set2Box+ to enhance Set2Box through an end-to-end box quantization scheme. It yields up to 40.8×40.8\times more accurate similarity estimation while requiring 60%60\% fewer bits than its competitors.

  • Extensive Experiments: Using 8 real-world datasets, we validate the advantages of Set2Box+ over its competitors and the effectiveness of each of its components.

For reproducibility, the code and data are available at https://github.com/geon0325/Set2Box.

TABLE I: Frequently-used symbols.
Notation Definition
𝒮={s1,,s|𝒮|}\mathcal{S}=\{s_{1},...,s_{|\mathcal{S}|}\} set of sets
={e1,,e||}\mathcal{E}=\{e_{1},...,e_{|\mathcal{E}|}\} set of entities
B=(c,f)\mathrm{B}=(\mathrm{c},\mathrm{f}) a box with center c\mathrm{c} and offset f\mathrm{f}
𝕍(B)\mathbb{V}(\mathrm{B}) volume of box B\mathrm{B}
𝒯+\mathcal{T}^{+} and 𝒯\mathcal{T}^{-} a set of positive & negative samples
𝐐c||×d\mathbf{Q}^{\mathrm{c}}\in\mathbb{R}^{|\mathcal{E}|\times d} center embedding matrix of entities
𝐐f+||×d\mathbf{Q}^{\mathrm{f}}\in\mathbb{R}_{+}^{|\mathcal{E}|\times d} offset embedding matrix of entities
DD number of subspaces
KK number of key boxes in each subspace

In Section II, we review related work. In Section III, we define the problem of similarity-preserving set embedding and discuss intuitive approaches. In Section IV, we present Set2Box and Set2Box+. In Section V, we provide experimental results. In Section VI, we analyze the considered methods. Lastly, we offer conclusions in Section VII.

II Related Work

Here, we review previous studies related to our work.

Similarity-Preserving Embedding: Representation learning for preserving similarities between instances has been studied for graphs [11, 12, 13, 14], images [15, 16, 17], and texts [18]. These methods aim to yield high-quality embeddings by minimizing the information loss of the original data. However, most of them are designed to preserve the predetermined similarity matrix, which are not extensible to new measures [13, 14]. In this paper, we focus on the problem of learning similarity-preserving representations for sets, and we aim to learn a versatile representation of sets, which various similarity measures (e.g., Jaccard Index and Dice Index) can be estimated from.

Box Embedding: Boxes [19] are useful abstractions to express high-order information of the data. Thanks to their powerful expressiveness, they have been used in diverse applications including knowledge bases [20, 21, 22, 23, 24, 25], word embedding [26], image embedding [27], and recommender systems [28, 29]. For instance, Query2Box [24] uses boxes to embed queries with conjunctions (\wedge) or logical disjunctions (\lor). Zhang et al. [28] represent users as boxes to accurately model the users’ preferences to the items. However, in this work, we embed sets as boxes to accurately preserve their structural relationships and also similarities between them. In an algorithmic aspect, methods for improving the optimization of learning boxes have been presented, and examples include smoothing hard edges using Gaussian convolutions [30] and improving the parameter identifiability of boxes using Gumbel random variables [31].

Set Embedding: The problem of embedding sets has attracted much attention, with unique requirements of permutation invariance and size flexibility. For example, DeepSets [32] uses simple symmetric functions over input features, and Set2Set [33] is based on a LSTM-based pooling function. Set Transformer [34] uses an attention-based pooling function to aggregate information of the entities. Despite their promising results in some predictive tasks, they suffer from several limitations. First, they require attribute information of entities, which in fact largely affects the quality of the set embeddings. In addition, set representations are trained specifically for downstream tasks, and thus they may lose explicit similarity information of sets, which we aim to preserve in this paper. In another aspect, sets can be represented as compact binary vectors by hashing or sketching [4, 10], without requiring attribute information. Such binary vectors are used by Locality Sensitive Hashing (LSH) and its variants [35, 36, 37] for a rapid search of similar sets based on a predefined similarity measure (e.g., Jaccard Index). Refer to Section III for further discussion of set embedding methods.

Differentiable Product Quantization: Product quantization [38, 39] is an effective strategy for vector compression. Recently, deep learning methods for learning discrete codes in an end-to-end manner have been proposed [40, 41], and they have been applied in knowledge graphs [42] and image retrieval [43, 44, 45]. In this paper, we propose a novel box quantization method for compressing boxes while preserving their original geometric properties.

III Preliminaries

In this section, we introduce notations and define the problem. Then, we review some intuitive methods for the problem.

Notations: Consider a set 𝒮={s1,,s|𝒮|}\mathcal{S}=\{s_{1},\cdots,s_{|\mathcal{S}|}\} of sets and a set ={e1,,e||}\mathcal{E}=\{e_{1},\cdots,e_{|\mathcal{E}|}\} of entities. Each set s𝒮s\in\mathcal{S} is a non-empty subset of \mathcal{E} and its size (i.e., cardinality) is denoted by |s||s|. A representation of the set ss is denoted by zs\mathrm{z}_{s} and its encoding cost (the number of bits to encode zs\mathrm{z}_{s}) in bits is denoted by Cost(zs)Cost(\mathrm{z}_{s}). Refer to Table I for frequently-used notations.

Problem Definition: The problem of learning similarity-preserving set representations, which we focus in this work, is formulated as:

Problem 1 (Similarity-Preserving Set Embedding).
  • Given: (1) a set 𝒮\mathcal{S} of sets and (2) a budget bb

  • Find: a latent representation zs\mathrm{z}_{s} of each set s𝒮s\in\mathcal{S}

  • to Minimize: the difference between (1) the similarity between ss and ss^{\prime}, and (2) the similarity between zs\mathrm{z}_{s} and zs\mathrm{z}_{s^{\prime}} for all ss𝒮s\neq s^{\prime}\in\mathcal{S}

  • Subject to: the total encoding cost Cost({zs:s𝒮})b\textit{Cost}(\{\mathrm{z}_{s}:s\in\mathcal{S}\})\leq b.

In this paper, we consider four set-similarity measures and use the mean squared error (MSE)111ss𝒮|sim(s,s)sim^(zs,zs)|2\sum_{s\neq s^{\prime}\in\mathcal{S}}{|\text{sim}(s,s^{\prime})-\widehat{\text{sim}}(\mathrm{z}_{s},\mathrm{z}_{s^{\prime}})|}^{2} sim(,)\text{sim}(\cdot,\cdot) and sim^(,)\widehat{\text{sim}}(\cdot,\cdot) are similarity between sets and that between latent representations, respectively. to measure the differences, while our proposed methods are not specialized to the choices.

Desirable Properties: We expect set embeddings for Problem 1 to have the following desirable properties:

  • Accuracy: How can we accurately preserve similarities between sets? Similarities approximated using learned representations should be close to ground-truth similarities.

  • Conciseness: How can we obtain compact representations that give a good trade-off between accuracy and encoding cost? It is desirable to use less amount of memory to store embeddings while keeping them informative.

  • Generalizability: Due to the size flexibility of sets, there are infinitely many number of combinations of entities, and thus retraining the entire model for new sets is intractable. It is desirable for a model to be generalizable to unseen sets.

  • Versatility: While there have been various definitions of set similarities, the choice of the similarity metric plays a key role in practical analyses and applications. This motivates us to learn versatile representations of sets that can be used to approximate diverse similarity measures.

  • Speed: Using the obtained embeddings, set similarities should be rapidly estimated, regardless of their cardinalities.

Refer to caption
Refer to caption
(a) Random Hashing
MSE = 0.0884
Cost = 77.312 KB
Refer to caption
(b) Vector Embedding
MSE = 0.0495
Cost = 77.312 KB
Refer to caption
(c) Set2Box+
MSE = 0.0125
Cost = 15.695 KB
Figure 1: Compared to intuitive methods, Set2Box+ preserves the Overlap Coefficient between sets in the MovieLens 1M dataset more accurately while requiring smaller encoding cost. Rows and columns represent sets, and each cell represents the estimation error of pairwise set similarity. The indices of the sets are sorted by the sizes of the sets.

Intuitive Methods: Keeping the above desirable properties in mind, we discuss simple and intuitive set-embedding methods for similarity preservation.

  • Random Hashing [4]: Each set ss is encoded as a binary vector zs{0,1}d\mathrm{z}_{s}\in\{0,1\}^{d} by mapping each entity into one of the dd different values using a hash function h():{1,,d}h(\cdot):\mathcal{E}\rightarrow\{1,\cdots,d\}. Specifically, the representation zs\mathrm{z}_{s} is derived by:

    zs[i]={1if es s.t. h(e)=i0otherwise.\mathrm{z}_{s}[i]=\begin{cases}1&\text{if $\exists e\in s$ s.t. $h(e)=i$}\\ 0&\text{otherwise.}\end{cases}

    The size of the set ss is estimated from the L1 norm (or the number of nonzero elements) of zs\mathrm{z}_{s}, i.e., |s|zs1|s|\approx\|\mathrm{z}_{s}\|_{1}. In addition, sizes of the intersection and the union of sets ss and ss^{\prime} are estimated from:

    |ss|zsANDzs1and|ss|zsORzs1,|s\cap s^{\prime}|\approx\|\mathrm{z}_{s}\;\textbf{AND}\;\mathrm{z}_{s^{\prime}}\|_{1}\;\;\;\text{and}\;\;\;|s\cup s^{\prime}|\approx\|\mathrm{z}_{s}\;\textbf{OR}\;\mathrm{z}_{s^{\prime}}\|_{1},

    respectively, where AND and OR are dimension-wise operations. Based on these approximations, any set similarities (e.g., Jaccard Index) can be estimated.

  • Vector Embedding: Another popular approach is to represent sets as vectors and compute the inner products between them to estimate a predefined set similarity. More precisely, given two sets ss and ss^{\prime} and their vector representations zs\mathrm{z}_{s} and zs\mathrm{z}_{s^{\prime}}, it aims to approximate predefined sim(s,s)\text{sim}(s,s^{\prime}) by the inner product of zs\mathrm{z}_{s} and zs\mathrm{z}_{s^{\prime}}, i.e., zs,zssim(s,s)\langle\mathrm{z}_{s},\mathrm{z}_{s^{\prime}}\rangle\approx\text{sim}(s,s^{\prime}).

These methods, however, suffer from several limitations. In random hashing, the maximum size of a set that a binary vector can accurately represent is dd, and thus sets whose sizes are larger than dd inevitably suffer from information loss. This is empirically verified in Figure 1(a); while estimations are accurately made in small sets, the error increases as the sizes of the sets are larger. The vector embedding method avoids such a problem but shows weakness in its versatility. That is, vectors are derived to preserve a predefined similarity (e.g., Jaccard Index), and thus they are not reusable to estimate other similarity measures (e.g., Dice Index). To address these issues, in this work, we propose Set2Box and Set2Box+, novel end-to-end algorithms for similarity preserving set embedding. As shown in Figure 1, Set2Box+ accurately preserves similarities between sets compared to random hashing and vector embedding methods, while requiring fewer bits to encode sets.

IV Proposed Method

In this section, we present our proposed method for similarity-preserving set embedding. We first present Set2Box, a novel algorithm for learning similarity-preserving set representations using boxes (Sec. IV-A). Then we propose Set2Box+, an advanced version of Set2Box, which derives better conciseness and accuracy (Sec. IV-B).

IV-A Set2Box: Preliminary Version 

How can we derive set embeddings that accurately preserve similarity in terms of various metrics? Towards this goal, we first present Set2Box, a preliminary set representation method that effectively learns the set itself and the structural relations with other sets.

Concepts: A box is a dd-dimensional hyper-rectangle whose representation consists of its center and offset [19]. The center describes the location of the box in the latent space and the offset is the length of each edge of the box. Formally, given a box B=(c,f)\mathrm{B}=(\mathrm{c},\mathrm{f}) whose center cd\mathrm{c}\in\mathbb{R}^{d} and offset f+d\mathrm{f}\in\mathbb{R}_{+}^{d} are in the same latent space, the box is defined as a bounded region:

B{pd:cfpc+f},\mathrm{B}\equiv\{\mathrm{p}\in\mathbb{R}^{d}:\mathrm{c}-\mathrm{f}\preceq\mathrm{p}\preceq\mathrm{c}+\mathrm{f}\},

where p\mathrm{p} is any point within the box. We let md\mathrm{m}\in\mathbb{R}^{d} and Md\mathrm{M}\in\mathbb{R}^{d} be the vectors representing the minimum and the maximum at each dimension, respectively, i.e., m=cf\mathrm{m}=\mathrm{c}-\mathrm{f} and M=c+f\mathrm{M}=\mathrm{c}+\mathrm{f}. Given two boxes BX=(cX,fX)\mathrm{B}_{X}=(\mathrm{c}_{X},\mathrm{f}_{X}) and BY=(cY,fY)\mathrm{B}_{Y}=(\mathrm{c}_{Y},\mathrm{f}_{Y}), the intersection is also a box, represented as:

BXBY\displaystyle\mathrm{B}_{X}\cap\mathrm{B}_{Y}\!\!\equiv {pd:𝐦𝐚𝐱(mX,mY)p𝐦𝐢𝐧(MX,MY)}.\displaystyle\{\mathrm{p}\in\mathbb{R}^{d}:\mathbf{max}(\mathrm{m}_{X},\mathrm{m}_{Y})\preceq\mathrm{p}\preceq\mathbf{min}(\mathrm{M}_{X},\mathrm{M}_{Y})\}.

The volume 𝕍(B)\mathbb{V}(\mathrm{B}) of the box B\mathrm{B} is computed by the product of the length of an edge in each dimension, i.e., 𝕍(B)=i=1d(M[i]m[i])\mathbb{V}(\mathrm{B})=\prod_{i=1}^{d}(\mathrm{M}[i]-\mathrm{m}[i]). The volume of the union of the two boxes is simply computed by 𝕍(BX)+𝕍(BY)𝕍(BXBY)\mathbb{V}(\mathrm{B}_{X})+\mathbb{V}(\mathrm{B}_{Y})-\mathbb{V}(\mathrm{B}_{X}\cap\mathrm{B}_{Y}).

Representation: The core idea of Set2Box is to model each set ss as a box Bs=(cs,fs)\mathrm{B}_{s}=(\mathrm{c}_{s},\mathrm{f}_{s}) so that the relations with other sets are properly preserved in the latent space. To this end, Set2Box approximates the volumes of the boxes to the relative sizes of the sets, i.e., 𝕍(Bs)|s|\mathbb{V}(\mathrm{B}_{s})\propto|s|. In addition to the single-set level, Set2Box aims to preserve the relations between different sets by approximating the volumes of the intersection of the boxes to the intersection sizes of the sets, i.e., 𝕍(BsiBsj)|sisj|\mathbb{V}(\mathrm{B}_{s_{i}}\cap\mathrm{B}_{s_{j}})\propto|s_{i}\cap s_{j}|. Notably, Set2Box not only addresses limitations of random hashing and vector-based embeddings, but it also has various advantages benefited from unique properties of boxes, as we discuss in Section VI.

Objective: Now we turn our attention to how to capture such overlaps between sets using boxes. Recall that our goal is to derive accurate and versatile representations of sets, and towards the first goal, we take relations beyond pairwise into consideration. Specifically, we consider three different levels of set relations (i.e., single, pair, and triple-wise relations) to capture the underlying high-order structure of sets. In another aspect, we aim to derive versatile set representations that can be used to estimate various similarity measures (e.g., Jaccard Index and Dice Index). With these goals in mind, we design an objective function that aims to preserve elemental relations among triple of sets. Specifically, given a triple {si,sj,sk}\{s_{i},s_{j},s_{k}\} of sets, we consider seven cardinalities from three different levels of subsets: (1) |si||s_{i}|, |sj||s_{j}|, |sk||s_{k}|, (2) |sisj||s_{i}\cap s_{j}|, |sjsk||s_{j}\cap s_{k}|, |sksi||s_{k}\cap s_{i}|, and (3) |sisjsk||s_{i}\cap s_{j}\cap s_{k}| which contain single, pair, and triple-wise information, respectively, and we denote them from c1(si,sj,sk)c_{1}(s_{i},s_{j},s_{k}) to c7(si,sj,sk)c_{7}(s_{i},s_{j},s_{k}). These seven elements fully-describe the relations among the three sets, and we argue that any similarity measures are computable using them. In this regard, we aim to preserve the ratios of the seven cardinalities by the volumes of the boxes Bsi\mathrm{B}_{s_{i}}, Bsj\mathrm{B}_{s_{j}}, and Bsk\mathrm{B}_{s_{k}} by minimizing the following objective:

𝒥(si,sj,sk,Bsi,Bsj,Bsk)==17(p(si,sj,sk)p^(Bsi,Bsj,Bsk))2,\mathcal{J}(s_{i},s_{j},s_{k},\mathrm{B}_{s_{i}},\mathrm{B}_{s_{j}},\mathrm{B}_{s_{k}})=\\ \sum_{\ell=1}^{7}\left(p_{\ell}(s_{i},s_{j},s_{k})-\hat{p}_{\ell}(\mathrm{B}_{s_{i}},\mathrm{B}_{s_{j}},\mathrm{B}_{s_{k}})\right)^{2},

where pp_{\ell} is the ratio of the \ellth cardinality among the three sets (i.e., p=c/cp_{\ell}=c_{\ell}/\sum_{\ell^{\prime}}c_{\ell^{\prime}}) and p^\hat{p}_{\ell} is the corresponding ratio estimated by the boxes. Since there exist (|𝒮|3)|\mathcal{S}|\choose 3 possible triples of sets, taking all such combinations into account is practically intractable, and thus we resort to sampling some of them. We sample a set 𝒯\mathcal{T} of triples that consists of a set 𝒯+\mathcal{T}^{+} of positive triples and a set 𝒯\mathcal{T}^{-} of negative triples, i.e., 𝒯=𝒯+𝒯\mathcal{T}=\mathcal{T}^{+}\cup\mathcal{T}^{-}. Specifically, the positive set 𝒯+\mathcal{T}^{+} and the negative set 𝒯\mathcal{T}^{-} are obtained by sampling three connected (i.e., overlapping) sets and three uniform random sets, respectively. Then, the final objective function we aim to minimize is:

={si,sj,sk}𝒯𝒥(si,sj,sk,Bsi,Bsj,Bsk).\mathcal{L}=\sum\nolimits_{\{s_{i},s_{j},s_{k}\}\in\mathcal{T}}\mathcal{J}(s_{i},s_{j},s_{k},\mathrm{B}_{s_{i}},\mathrm{B}_{s_{j}},\mathrm{B}_{s_{k}}). (1)

Notably, the proposed objective function aims to capture not only the pairwise interactions between sets, but also the triplewise relations to capture high-order overlapping patterns of the sets. In addition, it does not rely on any predefined similarity measure, but is a general objective for learning key structural patterns of sets and their neighbors. This prevents the model from overfitting to a specific measure and enables the model to yield accurate estimates to diverse metrics, as shown empirically in Section V.

Box Embedding: Then, given a set ss, how can we derive the box Bs=(cs,fs)\mathrm{B}_{s}=(\mathrm{c}_{s},\mathrm{f}_{s}), that is, its center cs\mathrm{c}_{s} and offset fs\mathrm{f}_{s}? To make the method generalizable to unseen sets, Set2Box introduces a pair of learnable embedding matrices 𝐐c||×d\mathbf{Q}^{\mathrm{c}}\in\mathbb{R}^{|\mathcal{E}|\times d} and 𝐐f+||×d\mathbf{Q}^{\mathrm{f}}\in\mathbb{R}_{+}^{|\mathcal{E}|\times d} of entities, where 𝐐icd\mathbf{Q}_{i}^{\mathrm{c}}\in\mathbb{R}^{d} and 𝐐ifd\mathbf{Q}_{i}^{\mathrm{f}}\in\mathbb{R}^{d} represent the center and offset of an entity eie_{i}, respectively. Then, the embeddings of the entities in the set ss are aggregated to obtain the center cs\mathrm{c}_{s} and the offset fs\mathrm{f}_{s}:

cs=pooling(s,𝐐c)andfs=pooling(s,𝐐f)\mathrm{c}_{s}=\textbf{pooling}(s,\mathbf{Q}^{\mathrm{c}})\;\;\;\text{and}\;\;\;\mathrm{f}_{s}=\textbf{pooling}(s,\mathbf{Q}^{\mathrm{f}})

where pooling is a permutation invariant function. Instead of using simple functions such as mean or max, we use attentions to highlight the entities that are important to obtain either the center or the offset of the box. To this end, we define a pooling function that takes the context of each set into account, termed set-context pooling (SCP). Specifically, given a set ss and an item embedding matrix 𝐐\mathbf{Q} (which can be either 𝐐c\mathbf{Q}^{\mathrm{c}} or 𝐐f\mathbf{Q}^{\mathrm{f}}), it first obtains the set-specific context vector bs\mathrm{b}_{s}:

bs=eisαi𝐐iwhereαi=exp(a𝐐i)ejsexp(a𝐐j)\mathrm{b}_{s}=\sum_{e_{i}\in s}\alpha_{i}\mathbf{Q}_{i}\;\;\;\text{where}\;\;\;\alpha_{i}=\frac{\exp(\mathrm{a}^{\intercal}\mathbf{Q}_{i})}{\sum_{e_{j}\in s}\exp(\mathrm{a}^{\intercal}\mathbf{Q}_{j})}

where a\mathrm{a} is a global context vector shared by all sets. Then using the context vector bs\mathrm{b}_{s}, which specifically contains the information on set ss, it obtains the output embedding from:

SCP(s,𝐐)=eisωi𝐐iwhereωi=exp(bs𝐐i)ejsexp(bs𝐐j).\textbf{SCP}(s,\mathbf{Q})=\sum_{e_{i}\in s}\omega_{i}\mathbf{Q}_{i}\;\;\text{where}\;\;\omega_{i}=\frac{\exp(\mathrm{b}_{s}^{\intercal}\mathbf{Q}_{i})}{\sum_{e_{j}\in s}\exp(\mathrm{b}_{s}^{\intercal}\mathbf{Q}_{j})}.

To be precise, cs=SCP(s,𝐐c)\mathrm{c}_{s}=\textbf{SCP}(s,\mathbf{Q}^{\mathrm{c}}) and fs=|s|1dSCP(s,𝐐f)\mathrm{f}_{s}=|s|^{\frac{1}{d}}\textbf{SCP}(s,\mathbf{Q}^{\mathrm{f}}). Note that, for the offset fs\mathrm{f}_{s}, we further take unique geometric properties of boxes into consideration. For any entity eise_{i}\in s, the subset relation {ei}s\{e_{i}\}\subseteq s holds, and thus the same condition B{ei}Bs\mathrm{B}_{\{e_{i}\}}\subseteq\mathrm{B}_{s} for boxes is desired, which should satisfy f{ei}fs\mathrm{f}_{\{e_{i}\}}\preceq\mathrm{f}_{s} and thus maxeisf{ei}=maxeis𝐐iffs\max_{e_{i}\in s}\mathrm{f}_{\{e_{i}\}}=\max_{e_{i}\in s}\mathbf{Q}_{i}^{\mathrm{f}}\preceq\mathrm{f}_{s}. However, since SCP is the weighted mean of entities’ embeddings, the output of SCP is bounded by the input embeddings in all dimensions, i.e., mineis𝐐iffsmaxeis𝐐if\min_{e_{i}\in s}\mathbf{Q}_{i}^{\mathrm{f}}\preceq\mathrm{f}_{s}\preceq\max_{e_{i}\in s}\mathbf{Q}_{i}^{\mathrm{f}}, which inevitably contradicts the aforementioned condition. In these regards, for the offset fs\mathrm{f}_{s}, we multiply an additional regularizer |s|1d|s|^{\frac{1}{d}} that helps boxes to properly preserve the set similarity, i.e., fs=|s|1dSCP(s,𝐐f)\mathrm{f}_{s}=|s|^{\frac{1}{d}}\textbf{SCP}(s,\mathbf{Q}^{\mathrm{f}}).

Smoothing Boxes: By definition, a box B=(c,f)\mathrm{B}=(\mathrm{c},\mathrm{f}) is a bounded region with hard edges whose volume is

𝕍(B)=i=1d(M[i]m[i])=i=1dReLU(M[i]m[i])\mathbb{V}(\mathrm{B})=\prod\nolimits_{i=1}^{d}(\mathrm{M}[i]-\mathrm{m}[i])=\prod\nolimits_{i=1}^{d}\text{ReLU}(\mathrm{M}[i]-\mathrm{m}[i])

where m=cf\mathrm{m}=\mathrm{c}-\mathrm{f} and M=c+f\mathrm{M}=\mathrm{c}+\mathrm{f}. This, however, disables gradient-based optimization when boxes are disjoint [30], and thus we smooth the boxes by using an approximation of ReLU:

𝕍(B)=i=1dSoftplus(M[i]m[i])\mathbb{V}(\mathrm{B})=\prod\nolimits_{i=1}^{d}\text{Softplus}(\mathrm{M}[i]-\mathrm{m}[i])

where Softplus(x)=1βlog(1+exp(βx))\text{Softplus}(\mathrm{x})=\frac{1}{\beta}\log\left(1+\exp(\beta\mathrm{x})\right) is an approximation to ReLU(x)\text{ReLU}(\mathrm{x}), and it becomes closer to ReLU as β\beta increases (specifically, Softplus \rightarrow ReLU as β\beta\rightarrow\infty). In this way, any pairs of boxes overlap each other, and thus non-zero gradients are computed for optimization.

Encoding Cost: Each box consists of two vectors, a center and an offset, and it requires 232d=64d2\cdot 32d=64d bits to encode them, assuming that we are using float-32 to represent each real number. Thus, 64|𝒮|d64|\mathcal{S}|d bits are required to store the box embeddings of |𝒮||\mathcal{S}| sets.

IV-B Set2Box+: Advanced Version 

We describe Set2Box+, which enhances Set2Box in terms of conciseness and accuracy, based on an end-to-end box quantization scheme. Specifically, Set2Box+ compresses the the box embeddings into a compact set of key boxes and a set of discrete codes to reconstruct the original boxes.

Box Quantization: We propose box quantization, a novel scheme for compressing boxes by using substantially smaller number of bits. Note that conventional product quantization methods [40], which are for vector compression, are straightforwardly applicable, by independently reducing the center and the offset of the box. However, it hardly makes use of geometric properties of boxes, and thus it does not properly reflect the complex relations between them. The proposed box quantization scheme effectively addresses this issue through two steps: (1) box discretization and (2) box reconstruction.

Refer to caption
(a) Product Quantization
Refer to caption
(b) Box Quantization
Figure 2: An example of (a) product quantization and (b) box quantization when D(number of subspaces)=4D\;\text{(number of subspaces)}=4 and K(number of key boxes)=3K\;\text{(number of key boxes)}=3. While inner products of vectors are computed to measure the closeness in product quantization [40], the proposed box quantization scheme incorporates geometric relations between boxes.

\circ Box Discretization. Given a box Bs=(cs,fs)\mathrm{B}_{s}=(\mathrm{c}_{s},\mathrm{f}_{s}) of set ss, we discretize the box as a KK-way DD-dimensional discrete code Cs{1,,K}D\mathrm{C}_{s}\in\{1,\cdots,K\}^{D} which is more compact and requires much less number of bits to encode than real numbers. To this end, we divide the dd-dimensional latent space into DD subspaces (d/D\mathbb{R}^{d/D}) and, for each subspace, learn KK key boxes. Specifically, in the iith subspace, the jjth key box is denoted by Kj(i)=(cj(i),fj(i))\mathrm{K}_{j}^{(i)}=(\mathrm{c}_{j}^{(i)},\mathrm{f}_{j}^{(i)}) where cj(i)d/D\mathrm{c}_{j}^{(i)}\in\mathbb{R}^{d/D} and fj(i)+d/D\mathrm{f}_{j}^{(i)}\in\mathbb{R}_{+}^{d/D} are the center and offset of the key box, respectively. The original box Bs\mathrm{B}_{s} is also partitioned into DD sub-boxes Bs(1),,Bs(D)\mathrm{B}_{s}^{(1)},\cdots,\mathrm{B}_{s}^{(D)} and the iith code of Cs\mathrm{C}_{s} is decided by:

Cs[i]=argminjdist(Bs(i),Kj(i))\mathrm{C}_{s}[i]=\operatorname*{arg\,min}_{j}\textbf{dist}\left(\mathrm{B}_{s}^{(i)},\;\mathrm{K}_{j}^{(i)}\right)

where dist(,\cdot,\cdot) measures the distance (i.e., dissimilarity) between two boxes, and we can flexibly select the criterion. In this paper, we specify the dist function, using softmax, as:

Cs[i]=argmaxjexp(BOR(Bs(i),Kj(i)))jexp(BOR(Bs(i),Kj(i)))\mathrm{C}_{s}[i]=\operatorname*{arg\,max}_{j}\frac{\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j}^{(i)}\right)\right)}{\sum_{j^{\prime}}\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j^{\prime}}^{(i)}\right)\right)} (2)

where BOR (Box Overlap Ratio) is defined to measure how much a box BX\mathrm{B}_{X} and a box BY\mathrm{B}_{Y} overlap:

BOR(BX,BY)=12(𝕍(BXBY)𝕍(BX)+𝕍(BXBY)𝕍(BY)).\textbf{BOR}(\mathrm{B}_{X},\mathrm{B}_{Y})=\frac{1}{2}\left(\frac{\mathbb{V}(\mathrm{B}_{X}\cap\mathrm{B}_{Y})}{\mathbb{V}(\mathrm{B}_{X})}+\frac{\mathbb{V}(\mathrm{B}_{X}\cap\mathrm{B}_{Y})}{\mathbb{V}(\mathrm{B}_{Y})}\right).

As shown in Figure 2, the proposed box quantization scheme incorporates the geometric relations between boxes, differently from conventional product quantization methods on vectors. To sum up, for each iith subspace, we search for the key sub-box closest to the sub-box Bs(i)\mathrm{B}_{s}^{(i)} and assign its index as the iith dimension’s value of its discrete code.

\circ Box Reconstruction. Once the discrete code Cs\mathrm{C}_{s} of set ss is generated, in this step, we reconstruct the original box based on it. To be specific, we obtain the reconstructed box B^s=(c^s,f^s)\widehat{\mathrm{B}}_{s}=(\widehat{c}_{s},\widehat{f}_{s}) by concatenating DD key boxes from each subspace encoded in Cs\mathrm{C}_{s}:

B^s=i=1DKCs[i](i).\widehat{\mathrm{B}}_{s}=\Big{\|}_{i=1}^{D}\mathrm{K}_{\mathrm{C}_{s}[i]}^{(i)}.

More precisely, B^s\widehat{\mathrm{B}}_{s} is reconstructed by concatenating the centers and the offsets of the DD key boxes respectively. Since Cs\mathrm{C}_{s} encodes key boxes that largely overlap with the box Bs\mathrm{B}_{s} (i.e., high BOR), if properly encoded, we can expect the reconstructed box B^s\widehat{\mathrm{B}}_{s} to be geometrically similar to the original box Bs\mathrm{B}_{s}. In Section V-C, we demonstrate the effectiveness of the proposed box quantization scheme by comparing it with the product quantization method.

Differentiable Optimization: Recall that Set2Box+ is an end-to-end learnable algorithm, which requires all processes to be differentiable. However, the argmax\operatorname*{arg\,max} operation in Eq. (2) is non-differentiable, and to this end, we utilize the softmax with the temperature τ\tau:

C~s[i]=exp(BOR(Bs(i),Kj(i))/τ)jexp(BOR(Bs(i),Kj(i))/τ).\widetilde{\mathrm{C}}_{s}[i]=\frac{\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j}^{(i)}\right)/\tau\right)}{\sum_{j^{\prime}}\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j^{\prime}}^{(i)}\right)/\tau\right)}. (3)

Note that C~s[i]\widetilde{\mathrm{C}}_{s}[i] is a KK-dimensional probabilistic vector whose jjth element indicates the probability for Kj(i)\mathrm{K}_{j}^{(i)} being assigned as the closest key box, i.e., the probability of Cs[i]=j\mathrm{C}_{s}[i]=j. Then, the key box K~s(i)=(c~s(i),f~s(i))\widetilde{\mathrm{K}}_{s}^{(i)}=(\widetilde{\mathrm{c}}_{s}^{(i)},\widetilde{\mathrm{f}}_{s}^{(i)}) in the iith subspace is the weighted sum of the KK key boxes:

K~s(i)=j=1KC~s[i][j]Kj(i).\widetilde{\mathrm{K}}_{s}^{(i)}=\sum\nolimits_{j=1}^{K}\widetilde{C}_{s}[i][j]\cdot\mathrm{K}_{j}^{(i)}.

If τ=0\tau=0, Eq. (3) is equivalent to the argmax\operatorname*{arg\,max} function, i.e., a one-hot vector where Cs[i]\mathrm{C}_{s}[i]th dimension is 1 and others are 0. In this case, K~s(i)\widetilde{\mathrm{K}}_{s}^{(i)} becomes equivalent to KCs[i](i)\mathrm{K}_{\mathrm{C}_{s}[i]}^{(i)}, which is the exact reconstruction derivable from the discrete code Cs\mathrm{C}_{s}. However, since this hard selection is non-differentiable and thus prevents an end-to-end optimization, we resort to the approximation by using the softmax with τ0\tau\neq 0 which is fully differentiable. Specifically, we use different τ\taus’ in forward (τ=0\tau=0) and backward (τ=1\tau=1) passes, which effectively enables differentiable optimization.

Joint Training: For further improvement, we introduce a joint learning scheme in the box quantization scheme. Given a triple {si,sj,sk}\{s_{i},s_{j},s_{k}\} of sets from the training data 𝒯\mathcal{T}, we obtain their boxes Bsi\mathrm{B}_{s_{i}}, Bsj\mathrm{B}_{s_{j}}, and Bsk\mathrm{B}_{s_{k}} and their reconstructed ones B^si\widehat{\mathrm{B}}_{s_{i}}, B^sj\widehat{\mathrm{B}}_{s_{j}}, and B^sk\widehat{\mathrm{B}}_{s_{k}} using the box quantization. While the basic version of Set2Box+ optimizes the following objective:

TABLE II: Statistics of the 8 real-world datasets: the number of entities |||\mathcal{E}|, the number of sets |𝒮||\mathcal{S}|, the maximum set size maxs𝒮|s|\textbf{max}_{s\in\mathcal{S}}|s|, and the size of the dataset sums𝒮|s|\textbf{sum}_{s\in\mathcal{S}}|s|.
Dataset |||\mathcal{E}| |𝒮||\mathcal{S}| maxs𝒮|s|\textbf{max}_{s\in\mathcal{S}}|s| sums𝒮|s|\textbf{sum}_{s\in\mathcal{S}}|s|
Yelp (YP) 25,252 25,656 649 467K
Amazon (AM) 55,700 105,655 555 858K
Netflix (NF) 17,769 478,615 12,206 56.92M
Gplus (GP) 107,596 72,271 5,056 13.71M
Twitter (TW) 81,305 70,097 1,205 1.76M
MovieLens 1M (ML1) 3,533 6,038 1,435 575K
MovieLens 10M (ML10) 10,472 69,816 3,375 5.89M
MovieLens 20M (ML20) 22,884 138,362 4,168 12.20M
𝒥(si,sj,sj,B^si,B^sj,B^sk),\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),

we additionally make use of the original boxes during the optimization. Specifically, we jointly train the original boxes together with the reconstructed ones so that both types of boxes can achieve high accuracy. To this end, we consider the following eight losses:

𝒥(si,sj,sj,Bsi,Bsj,Bsk),𝒥(si,sj,sj,B^si,Bsj,Bsk),\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),
𝒥(si,sj,sj,Bsi,B^sj,Bsk),𝒥(si,sj,sj,Bsi,Bsj,B^sk),\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),
𝒥(si,sj,sj,B^si,B^sj,Bsk),𝒥(si,sj,sj,B^si,Bsj,B^sk),\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),
𝒥(si,sj,sj,Bsi,B^sj,B^sk),𝒥(si,sj,sj,B^si,B^sj,B^sk),\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),

where we denote them by 𝒥1\mathcal{J}_{1} to 𝒥8\mathcal{J}_{8}, for the sake of brevity. Notably, 𝒥1\mathcal{J}_{1}, which utilizes only the original boxes, is an objective used for Set2Box, and 𝒥8\mathcal{J}_{8} considers only the reconstructed boxes. Based on these joint views from different types of boxes, the final loss function we aim to minimize is:

={si,sj,sk}𝒯λ(𝒥1+𝒥2+𝒥3+𝒥4+𝒥5+𝒥6+𝒥7)+𝒥8,\mathcal{L}=\!\!\!\!\!\sum_{\{s_{i},s_{j},s_{k}\}\in\mathcal{T}}\!\!\!\!\!\lambda\left(\mathcal{J}_{1}+\mathcal{J}_{2}+\mathcal{J}_{3}+\mathcal{J}_{4}+\mathcal{J}_{5}+\mathcal{J}_{6}+\mathcal{J}_{7}\right)+\mathcal{J}_{8}, (4)

where λ\lambda is the coefficient for balancing the losses between the joint views and the loss from the reconstructed boxes. In this way, both original boxes and the reconstructed ones are trained together to be properly located and shaped in the latent space. Note that even though both types of boxes are jointly trained to achieve high accuracy, only the reconstructed boxes are used for inference. We conduct ablation studies to verify the effectiveness of the joint training scheme in Section V-C.

Encoding Cost: To encode the reconstructed boxes for each set, Set2Box+ requires (1) key boxes and (2) discrete codes to encode each set. There exist KK key boxes in each of the DD subspaces whose dimensionality is d/Dd/D, which requires 64Kd64Kd bits to encode. Each set is encoded as a KK-way DD-dimensional vector, which requires Dlog2KD\log_{2}K bits. To sum up, to encode |𝒮||\mathcal{S}| sets, Set2Box+ requires 64Kd+|𝒮|Dlog2K64Kd+|\mathcal{S}|D\log_{2}K bits. Notably, if K|𝒮|K\ll|\mathcal{S}|, then 64Kd64Kd bits are negligible, and typically, Dlog2K64dD\log_{2}K\ll 64d holds. Thus, the encoding cost of Set2Box+ is considerably smaller than that of Set2Box.

Similarity Computation: Once we obtain set representations, it is desirable to rapidly compute the estimated similarities in the latent space. Boxes, which Set2Box and Set2Box+ derive, require constant time to compute a pairwise similarity between two sets, as formalized in Lemma 1.

Lemma 1 (Time Complexity of Similarity Estimation).

Given a pair of sets ss and ss^{\prime} and their boxes Bs\mathrm{B}_{s} and Bs\mathrm{B}_{s^{\prime}}, respectively, it takes O(d)O(d) time to compute the estimated similarity sim^(Bs,Bs)\widehat{\text{sim}}(\mathrm{B}_{s},\mathrm{B}_{s^{\prime}}), where dd is a user-defined constant that does not depend on the sizes of ss and ss^{\prime}.

Proof. Assume that the true similarity sim(s,s)\text{sim}(s,s^{\prime}) is computable using |s||s|, |s||s^{\prime}|, and |ss||s\cap s^{\prime}|. They are estimated by 𝕍(Bs)\mathbb{V}(\mathrm{B}_{s}), 𝕍(Bs)\mathbb{V}(\mathrm{B}_{s^{\prime}}), and 𝕍(BsBs)\mathbb{V}(\mathrm{B}_{s}\cap\mathrm{B}_{s^{\prime}}), respectively, and each of them is computed by the product of dd values, which takes O(d)O(d) time. Hence, the total time complexity is O(d)O(d). ∎

V Experimental Results

Refer to caption
Refer to caption
(a) Overlap Coefficient
Refer to caption
(b) Cosine Similarity
Refer to caption
(c) Jaccard Index
Refer to caption
(d) Dice Index
Figure 3: Set2Box+ preserves set similarities more accurately than Set2Bin, Set2Vec, Set2Vec+, and Set2Box. Note that in Set2Box+, only 0.310.400.31-0.40 of the bits costed by the competitors are used to embed sets. Moreover, while Set2Vec and Set2Vec+ need to be trained for specifically each similarity metric, Set2Box and Set2Box+ do not separate training.

We review our experiments designed for answering Q1-Q3.

  1. Q1.

    Accuracy & Conciseness: Does Set2Box+ derive concise and accurate set representations than its competitors?

  2. Q2.

    Effectiveness: How does Set2Box+ yield concise and accurate representations? Are all its components useful?

  3. Q3.

    Effects of Parameters: How do the parameters of Set2Box+ affect the quality of set representations?

V-A Experimental Settings

Machines & Implementations: All experiments were conducted on a Linux server with RTX 3090Ti GPUs. We implemented all methods including Set2Box and Set2Box+ using the Pytorch library.

Hyperparameter Tuning Table III describes the hyperparameter search space of each method. The number of training samples, |𝒯+||\mathcal{T}^{+}| and |𝒯||\mathcal{T}^{-}|, are both set to 1010 for Set2Box, Set2Box+, and their variants. For the vector-based methods, Set2Vec and Set2Vec+, since three pairwise relations are extractable from each triple, 73|𝒯+|\left\lceil\frac{7}{3}|\mathcal{T}^{+}|\right\rceil positive triples and 73|𝒯|\left\lceil\frac{7}{3}|\mathcal{T}^{-}|\right\rceil negative samples are used for training. We fix the batch size to 512512 and use the Adam optimizer. In Set2Box+, we fix the softmax temperature τ\tau to 11.

TABLE III: Search space of each method.
Method Hyperparameter Selection Pool
Set2Box Learning rate 0.0010.001, 0.010.01
Box smoothing parameter β\beta 11, 22, 44
Set2Box+ Learning rate 0.0010.001, 0.010.01
Box smoothing parameter β\beta 11, 22, 44
Joint training coefficient λ\lambda 0, 0.0010.001, 0.010.01, 0.10.1, 11
TABLE IV: Encoding cost of the methods covered in this work. |𝒮||\mathcal{S}|: number of sets. dd: dimensionality. DD: number of subspaces. KK: number of key boxes (vectors) in each subspace.
Method Encoding Cost (bits)
Set2Bin d|𝒮|d|\mathcal{S}|
Set2Vec 32d|𝒮|32d|\mathcal{S}|
Set2Vec+ 32d|𝒮|32d|\mathcal{S}|
Set2Box-order 32d|𝒮|32d|\mathcal{S}|
Set2Box-PQ 64Kd+2|𝒮|Dlog2K64Kd+2|\mathcal{S}|D\log_{2}K
Set2Box-BQ 64Kd+|𝒮|Dlog2K64Kd+|\mathcal{S}|D\log_{2}K
Set2Box 64d|𝒮|64d|\mathcal{S}|
Set2Box+ 64Kd+|𝒮|Dlog2K64Kd+|\mathcal{S}|D\log_{2}K

Datasets: We use eight publicly available datasets in Table II. The details of each dataset is as follows:

  • Yelp (YP) consists of user ratings on locations (e.g., hotels and restaurants), and each set is a group of locations that a user rated. Ratings higher than 3 are considered.

  • Amazon (AM) contains reviews of products (specifically, those categorized as Movies & TV) by users. In the dataset, each user has at least 5 reviews. A group of products reviewed by the same user is abstracted as a set.

  • Netflix (NF) is a collections of movie ratings from users. Each set is a set of movies rated by each user, and each entity is a movie. We consider ratings higher than 3.

  • Gplus (GP) is a directed social network that consists of ‘circles’ from Google+. Each set is the group of neighboring nodes of each node.

  • Twitter (TW) is also a directed social networks consisting of ‘circles’ in Twitter. Each set is a group of neighbors of each node in the graph.

  • MovieLens (ML1, ML10, and ML20) are collections of movie ratings from anonymous users. Each set is a group of movies that a user rated. Sets of movies with ratings higher than 3 are considered.

Baselines: We compare Set2Box and Set2Box+ with the following baselines including the variants of the methods discussed in Section III:

  • Set2Bin encodes each set ss as a binary vector zs{0,1}d\mathrm{z}_{s}\in\{0,1\}^{d} using a random hash function. See Section III for details.

  • Set2Vec embeds each set ss as a vector zsd\mathrm{z}_{s}\in\mathbb{R}^{d} which is obtained by pooling learnable entity embeddings using SCP. Precisely, given two sets ss and ss^{\prime} and their vector representations zs\mathrm{z}_{s} and zs\mathrm{z}_{s^{\prime}}, it aims to approximate the predefined set similarity sim(s,s)\text{sim}(s,s^{\prime}) by the inner product of zs\mathrm{z}_{s} and zs\mathrm{z}_{s^{\prime}}, i.e., zs,zssim(s,s)\langle\mathrm{z}_{s},\mathrm{z}_{s^{\prime}}\rangle\approx\text{sim}(s,s^{\prime}).

  • Set2Vec+ incorporates entity features 𝐗||×d\mathbf{X}\in\mathbb{R}^{|\mathcal{E}|\times d} into the set representation. Features are projected using a fully-connected layer and then pooled to a set embedding.

In order to obtain the entity features for Set2Vec+, which incorporates features into the set representations, we generate a projected graph (a.k.a., clique expansion) where nodes are entities, and any two nodes are adjacent if and only if their corresponding entities co-appear in at least one set. Specifically, we generate a weighted graph by assigning a weight (specifically, the number of sets where the two corresponding entities co-appear) to each edge, and we apply node2vec [46], a popular random walk-based network embedding method, to the graph to obtain the feature of each entity. Recall that vector-based methods, Set2Vec and Set2Vec+, are not versatile, and thus they need to be trained specifically to each metric, while the proposed methods Set2Box and Set2Box+ do not separate training. It should be noticed that search methods (e.g., LSH) are not direct competitors of the considered embedding methods, whose common goal is similarity preservation. We summarize the encoding cost of each method, including the variants of Set2Box and Set2Box+ used in Section V-C, in Table IV.

Refer to caption
Refer to caption
(a) Effects of the proposed box quantization scheme
Refer to caption
(b) Effects of joint training scheme
Figure 4: Relative MSEs in Eq. (5) and (6) of the considered set similarity measures in each dataset. The proposed schemes: box quantization (Set2Box-PQ vs. Set2Box-BQ) and joint training (Set2Box-BQ vs. Set2Box+) improve the accuracy.

Evaluation: For the Netflix dataset, whose number of sets is relatively large, we used 5%5\% of the sets for training. For the other datasets, we used 20%20\% of the sets for training. The remaining sets are split into half and used for validation and test. We measured the Mean Squared Error (MSE) to evaluate the accuracy of the set similarity approximation. Since the number of possible pairs of sets is O(|𝒮|2)O(|\mathcal{S}|^{2}), which may be considerably large, we sample 100,000 pairs uniformly at random for evaluation. We consider four representative set-similarity measures for evaluation: Overlap Coefficient (OC), Cosine Similarity (CS), Jaccard Index (JI), and Dice Index (DI), which are defined as:

|ss|min(|s|,|s|),|ss||s||s|,|ss||ss|,and|ss|12(|s|+|s|),\frac{|s\cap s^{\prime}|}{\min(|s|,|s^{\prime}|)},\;\;\;\frac{|s\cap s^{\prime}|}{\sqrt{|s|\cdot|s^{\prime}|}},\;\;\;\frac{|s\cap s^{\prime}|}{|s\cup s^{\prime}|},\;\;\;\text{and}\;\;\;\frac{|s\cap s^{\prime}|}{\frac{1}{2}\left(|s|+|s^{\prime}|\right)},

respectively, between a pair of sets ss and ss^{\prime}.

V-B Q1. Accuracy & Conciseness

We compare the MSE of the set similarity estimation derived by Set2Box+ and its competitors. We set dimensions to 256 for Set2Bin, 8 for vector based methods (Set2Vec and Set2Vec+), and 4 for Set2Box, so that they use the same number of bits to encode sets. For Set2Box+, we set (d,D,K)=(32,16,30)(d,D,K)=(32,16,30), which results in only 3140%31-40\% of the encoding cost of the other methods, unless otherwise stated.

Accuracy: As seen in Figure 3, Set2Box+ yields the most accurate set representations while using a smaller number of bits to encode them. For example, in the Twitter dataset, Set2Box+ gives 40.8×40.8\times smaller MSE for the Jaccard Index compared to Set2Bin. In the Amazon dataset, Set2Box+ gives 11.0×11.0\times smaller MSE for the Overlap Coefficient than Set2Vec+. In both cases, Set2Box+ uses about 31%31\% of the encoding costs used by the competitors.

Refer to caption
Refer to caption
(a) MovieLens 1M
Refer to caption
(b) Yelp
Metric ML1 ML10 ML20 YP AM GP TW NF
JI 8.0 11.1 12.9 34.9 33.6 76.2 41.2 16.2
DI 8.0 15.9 17.7 27.3 27.2 63.5 31.7 22.7
OC 8.0 12.7 16.1 34.9 28.8 96.8 60.3 19.5
CS 8.0 15.9 16.1 28.8 28.8 68.2 38.0 22.7
(a) The number of times of the encoding cost that Set2Bin requires to catch up the accuracy of Set2Box+.
Figure 5: Set2Box+ yields more concise set representations compared to the competitors.

Conciseness: To verify the conciseness of Set2Box+, we measure the accuracy of competitors across various encoding costs. As seen in Figure 5, Set2Box+ yields compact representations of sets while keeping them informative. Vector-based methods are prone to the curse of dimensionality and hardly benefit from high dimensionality. While the MSE of Set2Bin decreases with respect to its dimension, it still requires larger space to achieve the MSE of Set2Box+. For example, Set2Bin requires 8.0×8.0\times and 34.9×34.9\times more bits to achieve the same accuracy of Set2Box+ in MovieLens 1M and Yelp, respectively. This is more noticeable in larger datasets, where Set2Bin requires up to 96.8×96.8\times of the encoding cost of Set2Box+, as shown in Figure V(a). These results demonstrate the conciseness of Set2Box+. In addition, Figure 6 shows how the accuracy of estimations by Set2Box and Set2Box+ depend on their encoding costs.

Speed: For the considered methods, Figure 7 shows the loss (relative to the loss after the first epoch) over time in two large datasets, MovieLens 20M and Netflix. The loss in Set2Box+ drops over time and eventually converges within an hour.

Further Analysis of Accuracy: We analyze how much sets estimated to be similar by the considered methods are actually similar. To this end, for each set ss, we compare its kk most similar sets Gs,kG_{s,k} to those G^s,k\widehat{G}_{s,k} estimated to be the most similar using each method. Then, as in [4], we measure the quality q(G^s,k)q(\widehat{G}_{s,k}) of G^s,k\widehat{G}_{s,k} which measures how sets in G^s,k\widehat{G}_{s,k} are similar enough to ss compared to those in Gs,kG_{s,k}:

q(G^s,k)=sG^s,ksim(s,s)sGs,ksim(s,s)q(\widehat{G}_{s,k})=\frac{\sum_{s^{\prime}\in\widehat{G}_{s,k}}\text{sim}(s,s^{\prime})}{\sum_{s^{\prime}\in G_{s,k}}\text{sim}(s,s^{\prime})}

where sim(,)\text{sim}(\cdot,\cdot) is the similarity (spec., Jaccard Index) between sets. The quality ranges from 0 to 11, and it is ideally close to 1, which indicates that the sets estimated to be similar by the method are similar enough compared to the ideal sets. Based on the above criterion, we measure the quality of each method with different kk’s. We use the above configuration for Set2Box+, while the dimensions of the other methods are adjusted to require a similar encoding cost to Set2Box+. As shown in Figure 8, the average quality q(G^s,k)q(\widehat{G}_{s,k}) is the highest in Set2Box+ in all considered kk’s in MovieLens 1M and Yelp, implying that the sets estimated to be similar by the proposed methods are indeed similar to each other.

Refer to caption
Refer to caption
(a) Set2Box
Refer to caption
(b) Set2Box+
Figure 6: Set2Box and Set2Box+ give more accurate estimations with larger space. Yelp is used for the plots.
Refer to caption
Refer to caption
(a) MovieLens 20M
Refer to caption
(b) Netflix
Figure 7: Set2Box+ converges over time.
Refer to caption
Refer to caption
(a) MovieLens 1M
Refer to caption
(b) Yelp
Figure 8: Sets estimated to be similar by Set2Box+ are indeed similar to each other.

V-C Q2. Effectiveness

To verify the effectiveness of each component of Set2Box+, we conduct ablation studies by comparing it with its variants. We first consider the following variants:

  • Set2Box-PQ: Given a box B=(c,f)\mathrm{B}=(\mathrm{c},\mathrm{f}), we apply an end-to-end differentiable product quantization (PQ) [40] to the center c\mathrm{c} and the offset f\mathrm{f} independently. Dot products between the query vector (c\mathrm{c} or r\mathrm{r}) and the key vectors are computed to measure the distances. Notably, it yields two independent discrete codes for the center and the offset, and thus its encoding cost is 64Kd+2|𝒮|Dlog2K64Kd+2|\mathcal{S}|D\log_{2}K bits, which is approximately twice that of Set2Box+.

  • Set2Box-BQ: A special case of Set2Box+ with λ=0\lambda=0, where the proposed box quantization is applied but joint training is not.

We set (d,D,K)(d,D,K) to (32,8,30)(32,8,30) for Set2Box-PQ and (32,16,30)(32,16,30) for Set2Box-BQ and Set2Box+ so they all require the the same amount of storage.

Effects of Box Quantization: We examine the effectiveness of the proposed box quantization scheme in Section IV-B by comparing Set2Box-BQ with Set2Box-PQ. To this end, we measure the relative MSE defined as:

MSE of Set2Box-BQMSE of Set2Box-PQ\frac{\text{MSE of {Set2Box-BQ}}}{\text{MSE of {Set2Box-PQ}}} (5)

of each dataset. Figure 4(a) demonstrates that Set2Box-BQ generally derives more accurate set representations compared to Set2Box-PQ, implying the effectiveness of the proposed box quantization scheme. As shown in Table V, on average, Set2Box-BQ yields up to 26%26\% smaller MSE than Set2Box-PQ while using approximately the same number of bits. For example, in MovieLens 10M, Set2Box-BQ gives 1.89×1.89\times more accurate estimation than Set2Box-PQ in approximating the Dice Index. While Set2Box-PQ discretizes the center and offset of the boxes independently, without the consideration of their geometric properties, the proposed box quantization scheme effectively takes the geometric relations between boxes into account and thus yields high-quality compression.

Effects of Joint Training: We analyze the effects of the joint training scheme of Set2Box+ by comparing Set2Box-BQ (λ=0\lambda=0) and Set2Box+ (λ0\lambda\geq 0) and to this end, we measure the relative MSE defined as:

MSE of Set2Box+MSE of Set2Box-BQ\frac{\text{MSE of {Set2Box}\textsuperscript{+}}}{\text{MSE of {Set2Box-BQ}}} (6)

of each dataset. Figure 4(b) shows that Set2Box+ is superior compared to Set2Box-BQ in most datasets indicating that jointly training the reconstruction boxes with the original ones leads to accurate boxes. As summarized in Table V, joint training reduces the average MSEs on the considered datasets, by up to 44%44\%, together with the box quantization scheme. For example, together with the box quantization scheme, joint training reduces estimation error by 64%64\% and 38%38\% for the Jaccard Index on Gplus and for the Overlap Coefficient on Netflix, respectively. These results imply that learning quantized boxes simultaneously with the original boxes improves the quality of the quantization and thus its effectiveness. To further analyze these results, we investigate how the loss decreases with respect to λ\lambda. In Figure 9, we observe that training the reconstructed boxes alone (λ=0\lambda=0) is unstable, and learning the original boxes together (λ>0\lambda>0) helps not only stabilize but also facilitate the optimization.

TABLE V: The proposed schemes: box quantization and joint training in Set2Box+ incrementally improves the accuracy (in terms of MSE) averaged over all considered datasets.
Method OC CS JI DI
Set2Box-PQ 0.0129 0.0028 0.0012 0.0023
Set2Box-BQ 0.0106 (-17%) 0.0023 (-17%) 0.0009 (-26%) 0.0019 (-17%)
Set2Box+ 0.0077 (-40%) 0.0016 (-44%) 0.0007 (-41%) 0.0013 (-42%)
Refer to caption
Refer to caption
(a) MovieLens 1M
Refer to caption
(b) Yelp
Figure 9: The joint training scheme in Set2Box+ facilitates and stabilizes optimization in MovieLens 10M and Yelp.

Effects of Boxes: To confirm the effectiveness of using boxes for representing sets, we consider Set2Box-order, which is also a region-based geometric embedding method:

  • Set2Box-order: A set ss is represented as a dd-dimensional vector zs+d\mathrm{z}_{s}\in\mathbb{R}_{+}^{d} whose volume is computed as 𝕍(zs)=exp(izs[i])\mathbb{V}(\mathrm{z}_{s})=\exp(-\sum_{i}\mathrm{z}_{s}[i]). The volume of the intersection and the union of two representations zs\mathrm{z}_{s} and zs\mathrm{z}_{s^{\prime}} of sets ss and ss^{\prime}, respectively, are:

    𝕍(zszs)=exp(imax(zs[i],zs[i])),\displaystyle\mathbb{V}(\mathrm{z}_{s}\wedge\mathrm{z}_{s^{\prime}})=\exp\left(-\textstyle\sum\nolimits_{i}\max(\mathrm{z}_{s}[i],\mathrm{z}_{s^{\prime}}[i])\right),
    𝕍(zszs)=exp(imin(zs[i],zs[i])),\displaystyle\mathbb{V}(\mathrm{z}_{s}\vee\mathrm{z}_{s^{\prime}})=\exp\left(-\textstyle\sum\nolimits_{i}\min(\mathrm{z}_{s}[i],\mathrm{z}_{s^{\prime}}[i])\right),

    respectively. The encoding cost is 32d|𝒮|32d|\mathcal{S}| bits.

We set the dimensions for Set2Box-order and Set2Box to 88 and 44, respectively, so that their encoding costs are the same. In Table VI, we compare Set2Box with Set2Box-order in terms of the average MSE on the considered datasets for each measure. Set2Box yields more accurate representations than Set2Box-order, implying the effectiveness of boxes to represent sets for similarity preservation. For example, Set2Box achieved 62%62\% lower average MSE than Set2Box-order in preserving the Overlap Coefficient.

TABLE VI: Set2Box yields smaller MSE on average in the considered datasets than Set2Box-order.
Method OC CS JI DI
Set2Box-order 0.0320 0.0033 0.0008 0.0027
Set2Box 0.0121 (-62%) 0.0028 (-14%) 0.0006 (-22%) 0.0022 (-17%)

V-D Q3. Effects of Parameters

We analyze how parameters of Set2Box+ affect the embedding quality of the set representations. First, the number of subspaces (DD) and the number of key boxes in each subspace (KK) are the key parameters that control the encoding cost of Set2Box+. In Figure 10, we investigate how the performance of Set2Box+ depends on DD and KK values while fixing dd to 3232. Typically, the accuracy improves as DD and KK increase at the expense of extra encoding cost. In addition, its performance is affected more heavily by DD than by KK.

In Figure 11, we observe how the coefficient λ\lambda in Eq. (4) affects the accuracy of Set2Box+. To this end, we measure relative MSE (relative to MSE when λ=0\lambda=0) with different λ\lambda values. As shown in Figure 11, joint training is beneficial, but overemphasizing the joint views sometimes prevents Set2Box+ from learning meaningful reconstructed boxes.

In Figure 12, we observe the effects of the number of training samples, |𝒯+||\mathcal{T}^{+}| and |𝒯||\mathcal{T}^{-}|, in Set2Box+. We can see that the accuracy is robust to the parameters, and thus using only a small number of samples for training is enough (we consistently use |𝒯+|=|𝒯|=10|\mathcal{T}^{+}|=|\mathcal{T}^{-}|=10 in all experiments).

Refer to caption
(a) MovieLens 1M
Refer to caption
(b) Yelp
Figure 10: Effects of KK and DD in Set2Box+ on the approximation of the Jaccard Indices in MovieLens 10M and Yelp.
Refer to caption
Refer to caption
(a) MovieLens1M
Refer to caption
(b) Yelp
Figure 11: Effects of λ\lambda in Set2Box+.
Refer to caption
Refer to caption
(a) MovieLens1M
Refer to caption
(b) Yelp
Figure 12: Effects of |𝒯+||\mathcal{T}^{+}| and |𝒯||\mathcal{T}^{-}| in Set2Box+.

VI Discussions

To further support the effectiveness of the proposed methods, Set2Box and Set2Box+, we analyze the properties of boxes and other representation methods and find their relations with sets. To this end, we review the following questions for the methods used in Section V, as summarized in Table VII:

RQ1. Are basic set properties supported?

A1. Boxes naturally satisfy six representative set properties, which are listed in Table VIII. These properties are also met by Set2Bin and Set2Box-order, but not by the vector-based methods Set2Vec and Set2Vec+, since they do not contain information on the set itself (e.g., set sizes).

RQ2. Are sets of any-size representable?

A2. In Set2Box and Set2Box+, boxes of various volumes can be learned by adjusting their offsets, and thus sets of any sizes are accurately learnable. So does the Set2Box-order, by controlling the L1 norm of the vector. However, Set2Bin inevitably suffers from information loss for sets larger than dd (see Figure 1 in Section III). The vector-based methods have no limitations regarding set sizes.

RQ3. Are representations expressive enough?

R3. Boxes of diverse shapes and sizes can be located anywhere in the Euclidean latent space by controlling their centers and offsets. This property provides boxes with expressiveness, enabling them to capture complex relations with other boxes. In Set2Box-order, a single nonnegative vector is learned to control the volume of the region. This nonnegativity limits the expressiveness of the embeddings. Set2Bin suffers from hash collisions, and thus different sets can be represented as the same binary vector, which causes considerable information loss if dd is not large enough. Despite their wide usage in various fields, empirically, Set2Vec and Set2Vec+ have limited power to accurately preserve similarities between sets. In particular, set embeddings obtained from them for a specific measure are not extensible to estimate other measures.

TABLE VII: Properties of the considered methods regarding Q1, Q2, and Q3 in Section VI.
Method Q1 Q2 Q3
Set2Bin
Set2Vec & Set2Vec+
Set2Box-order
Set2Box & Set2Box+

VII Conclusions

In this work, we propose Set2Box, an effective representation learning method for preserving similarities between sets. Thanks to the unique geometric properties of boxes, Set2Box accurately preserves various similarities without assumptions about measures. Additionally, we develop Set2Box+, which is equipped with novel box quantization and joint training schemes. Our empirical results support that Set2Box+ has the following strengths over its competitors:

  • Accurate: Set2Box+ yields up to 40.8×40.8\times smaller estimation error than competitors, requiring smaller encoding costs.

  • Concise: Set2Box+ requires up to 96.8×96.8\times smaller encoding costs to achieve the same accuracy of the competitors.

  • Versatile: Set2Box+ is free from assumptions about similarity measures to be preserved.

For reproducibility, the code and data are available at https://github.com/geon0325/Set2Box.

TABLE VIII: Boxes satisfy various set properties.
Property Properties Satisfied by Boxes
1. Transitivity Law BXBY,BYBZBXBZ\mathrm{B}_{X}\subseteq\mathrm{B}_{Y},\;\mathrm{B}_{Y}\subseteq\mathrm{B}_{Z}\rightarrow\mathrm{B}_{X}\subseteq\mathrm{B}_{Z}
2. Idempotent Law BXBX=BX\mathrm{B}_{X}\cup\mathrm{B}_{X}=\mathrm{B}_{X} BXBX=BX\mathrm{B}_{X}\cap\mathrm{B}_{X}=\mathrm{B}_{X}
3. Commutative Law BXBY=BYBX\mathrm{B}_{X}\cup\mathrm{B}_{Y}=\mathrm{B}_{Y}\cup\mathrm{B}_{X} BXBY=BYBX\mathrm{B}_{X}\cap\mathrm{B}_{Y}=\mathrm{B}_{Y}\cap\mathrm{B}_{X}
4. Associative Law BX(BYBZ)=(BXBY)BZ\mathrm{B}_{X}\cup(\mathrm{B}_{Y}\cup\mathrm{B}_{Z})=(\mathrm{B}_{X}\cup\mathrm{B}_{Y})\cup\mathrm{B}_{Z}
BX(BYBZ)=(BXBY)BZ\mathrm{B}_{X}\cap(\mathrm{B}_{Y}\cap\mathrm{B}_{Z})=(\mathrm{B}_{X}\cap\mathrm{B}_{Y})\cap\mathrm{B}_{Z}
5. Absorption Law BX(BXBY)=BX\mathrm{B}_{X}\cup(\mathrm{B}_{X}\cap\mathrm{B}_{Y})=\mathrm{B}_{X} BX(BXBY)=BX\mathrm{B}_{X}\cap(\mathrm{B}_{X}\cup\mathrm{B}_{Y})=\mathrm{B}_{X}
6. Distributive Law BX(BYBZ)=(BXBY)(BXBZ)\mathrm{B}_{X}\cap(\mathrm{B}_{Y}\cup\mathrm{B}_{Z})=(\mathrm{B}_{X}\cap\mathrm{B}_{Y})\cup(\mathrm{B}_{X}\cap\mathrm{B}_{Z})
BX(BYBZ)=(BXBY)(BXBZ)\mathrm{B}_{X}\cup(\mathrm{B}_{Y}\cap\mathrm{B}_{Z})=(\mathrm{B}_{X}\cup\mathrm{B}_{Y})\cap(\mathrm{B}_{X}\cup\mathrm{B}_{Z})

References

  • [1] L. Moussiades and A. Vakali, “Pdetect: A clustering approach for detecting plagiarism in source code datasets,” The Computer Journal, vol. 48, no. 6, pp. 651–661, 2005.
  • [2] N. A. Yousri and D. M. Elkaffash, “Associating gene functional groups with multiple clinical conditions using jaccard similarity,” in BIBMW, 2011.
  • [3] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in KDD, 2008.
  • [4] R. Guerraoui, A.-M. Kermarrec, O. Ruas, and F. Taïani, “Smaller, faster & lighter knn graph constructions,” in WWW, 2020.
  • [5] K. Shin, A. Ghoting, M. Kim, and H. Raghavan, “Sweg: Lossless and lossy summarization of web-scale graphs,” in WWW, 2019.
  • [6] S. Navlakha, R. Rastogi, and N. Shrivastava, “Graph summarization with bounded error,” in SIGMOD, 2008.
  • [7] L. Chauvin, K. Kumar, C. Wachinger, M. Vangel, J. de Guise, C. Desrosiers, W. Wells, M. Toews, A. D. N. Initiative et al., “Neuroimage signature from salient keypoints is highly specific to individuals and shared by close relatives,” NeuroImage, vol. 204, p. 116208, 2020.
  • [8] L. Chauvin, K. Kumar, C. Desrosiers, W. Wells III, and M. Toews, “Efficient pairwise neuroimage analysis using the soft jaccard index and 3d keypoint sets,” arXiv preprint arXiv:2103.06966, 2021.
  • [9] M. Toews, W. Wells III, D. L. Collins, and T. Arbel, “Feature-based morphometry: Discovering group-related anatomical patterns,” NeuroImage, vol. 49, no. 3, pp. 2318–2327, 2010.
  • [10] B. Cui, H. T. Shen, J. Shen, and K. L. Tan, “Exploring bit-difference for approximate knn search in high-dimensional databases,” in AusDM, 2005.
  • [11] W. Jin, T. Derr, Y. Wang, Y. Ma, Z. Liu, and J. Tang, “Node similarity preserving graph convolutional networks,” in WSDM, 2021.
  • [12] Y. Xie, M. Gong, S. Wang, W. Liu, and B. Yu, “Sim2vec: Node similarity preserving network embedding,” Information Sciences, vol. 495, pp. 37–51, 2019.
  • [13] A. Tsitsulin, D. Mottin, P. Karras, and E. Müller, “Verse: Versatile graph embeddings from similarity measures,” in WWW, 2018.
  • [14] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in KDD, 2016.
  • [15] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in ICCV, 2019.
  • [16] Q. Li, Z. Sun, R. He, and T. Tan, “Deep supervised discrete hashing,” in NeurIPS, 2017.
  • [17] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval,” in AAAI, 2016.
  • [18] Y. Li, Y. Sun, and N. Zhu, “Berttocnn: Similarity-preserving enhanced knowledge distillation for stance detection,” Plos one, vol. 16, no. 9, p. e0257130, 2021.
  • [19] L. Vilnis, X. Li, S. Murty, and A. McCallum, “Probabilistic embedding of knowledge graphs with box lattice measures,” in ACL, 2018.
  • [20] R. Abboud, I. Ceylan, T. Lukasiewicz, and T. Salvatori, “Boxe: A box embedding model for knowledge base completion,” in NeurIPS, 2020.
  • [21] L. Liu, B. Du, H. Ji, C. Zhai, and H. Tong, “Neural-answering logical queries on knowledge graphs,” in KDD, 2021.
  • [22] Y. Onoe, M. Boratko, A. McCallum, and G. Durrett, “Modeling fine-grained entity types with box embeddings,” in ACL, 2021.
  • [23] X. Chen, M. Boratko, M. Chen, S. S. Dasgupta, X. L. Li, and A. McCallum, “Probabilistic box embeddings for uncertain knowledge graph reasoning,” in ACL, 2021.
  • [24] H. Ren, W. Hu, and J. Leskovec, “Query2box: Reasoning over knowledge graphs in vector space using box embeddings,” in ICLR, 2019.
  • [25] D. Patel and S. Sankar, “Representing joint hierarchies with box embeddings,” in AKBC, 2020.
  • [26] S. S. Dasgupta, M. Boratko, S. Atmakuri, X. L. Li, D. Patel, and A. McCallum, “Word2box: Learning word representation using box embeddings,” arXiv preprint arXiv:2106.14361, 2021.
  • [27] A. Rau, G. Garcia-Hernando, D. Stoyanov, G. J. Brostow, and D. Turmukhambetov, “Predicting visual overlap of images through interpretable non-metric box embeddings,” in ECCV, 2020.
  • [28] S. Zhang, H. Liu, A. Zhang, Y. Hu, C. Zhang, Y. Li, T. Zhu, S. He, and W. Ou, “Learning user representations with hypercuboids for recommender systems,” in WSDM, 2021.
  • [29] K. Deng, J. Huang, and J. Qin, “Box4rec: Box embedding for sequential recommendation,” in PAKDD, 2021.
  • [30] X. Li, L. Vilnis, D. Zhang, M. Boratko, and A. McCallum, “Smoothing the geometry of probabilistic box embeddings,” in ICLR, 2018.
  • [31] S. Dasgupta, M. Boratko, D. Zhang, L. Vilnis, X. Li, and A. McCallum, “Improving local identifiability in probabilistic box embeddings,” in NeurIPS, 2020.
  • [32] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep sets,” in NeurIPS, 2017.
  • [33] O. Vinyals, S. Bengio, and M. Kudlur, “Order matters: Sequence to sequence for sets,” in ICLR, 2015.
  • [34] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, “Set transformer: A framework for attention-based permutation-invariant neural networks,” in ICML, 2019.
  • [35] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in STOC, 1998.
  • [36] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in STOC, 2002.
  • [37] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-sensitive hashing,” in NIPS, 2012.
  • [38] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2010.
  • [39] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization for approximate nearest neighbor search,” in CVPR, 2013.
  • [40] T. Chen, L. Li, and Y. Sun, “Differentiable product quantization for end-to-end embedding compression,” in ICML, 2020.
  • [41] T. Chen, M. R. Min, and Y. Sun, “Learning k-way d-dimensional discrete codes for compact embedding representations,” in ICML, 2018.
  • [42] M. Sachan, “Knowledge graph embedding compression,” in ACL, 2020.
  • [43] B. Klein and L. Wolf, “End-to-end supervised product quantization for image search and retrieval,” in CVPR, 2019.
  • [44] Y. K. Jang and N. I. Cho, “Self-supervised product quantization for deep unsupervised image retrieval,” in CVPR, 2021.
  • [45] S. Morozov and A. Babenko, “Unsupervised neural quantization for compressed-domain similarity search,” in ICCV, 2019.
  • [46] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in KDD, 2016.