Set2Box: Similarity Preserving Representation Learning for Sets

Geon Lee KAIST AI
geonlee0325@kaist.ac.kr Chanyoung Park KAIST ISysE & AI
cy.park@kaist.ac.kr Kijung Shin KAIST AI & EE
kijungs@kaist.ac.kr

Abstract

Sets have been used for modeling various types of objects (e.g., a document as the set of keywords in it and a customer as the set of the items that she has purchased). Measuring similarity (e.g., Jaccard Index) between sets has been a key building block of a wide range of applications, including, plagiarism detection, recommendation, and graph compression. However, as sets have grown in numbers and sizes, the computational cost and storage required for set similarity computation have become substantial, and this has led to the development of hashing and sketching based solutions.

In this work, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea is to represent sets as boxes to precisely capture overlaps of sets. Additionally, based on the proposed box quantization scheme, we design Set2Box⁺, which yields more concise but more accurate box representations of sets. Through extensive experiments on $8$ real-world datasets, we show that, compared to baseline approaches, Set2Box⁺ is (a) Accurate: achieving up to 40.8 $\times$ smaller estimation error while requiring 60% fewer bits to encode sets, (b) Concise: yielding up to 96.8 $\times$ more concise representations with similar estimation error, and (c) Versatile: enabling the estimation of four set-similarity measures from a single representation of each set.

I Introduction

Sets are ubiquitous, modeling various types of objects in many domains, including (a) a document: modeled as the set of keywords in it, (b) a customer: modeled as the set of the items that she has purchased, (c) a social circle: modeled as the set of its members, and (d) a question on online Q/A platforms: modeled as the set of tags attached to the question. Moreover, a number of set similarity measures (e.g., Jaccard Index and Dice Index), most of which are based on the overlaps between sets, have been developed.

As a result of the omnipresence of sets, measuring their similarity has been employed as a fundamental building block of a wide range of applications, including the following:

$\circ$ Plagiarism Detection: Plagiarism is a critical problem in the digital age, where a vast amount of resources is accessible. A text is modeled as a “bag of words,” and texts whose set representations are highly similar are suspected of plagiarism [1].

$\circ$ Gene Expression Mining: Mining gene expressions is useful for understanding clinical conditions (e.g., tumor and cancer). The functionality of a set of genes is estimated by comparing the set with other sets with known functionality [2].

$\circ$ Recommendation: Recommendation is essential to support users in finding relevant items. To this end, it is useful to identify users with similar tastes (e.g., users who purchased a similar set of items and users with similar activities) [3, 4].

$\circ$ Graph Compression: As large-scale graphs are omnipresent, compressing them into coarse-grained summary graphs so that they fit in main memory is important. In many graph compression algorithms, nodes with similar sets of neighbors are merged into a supernode to yield a concise summary graph while minimizing the information loss [5, 6].

$\circ$ Medical Image Analysis: CT or MRI provide exquisite details of inner body (e.g., brain), and they are often described as a collection of spatially localized anatomical features termed “keypoints”. Sets of keypoints from different images are compared to diagnose and investigate diseases [7, 8, 9].

As sets grow in numbers and sizes, computation of set similarity requires substantial computational cost and storage. For example, similarities between tens of millions of nodes, which are represented as neighborhood sets of up to millions of neighbors, were measured for graph compression [5]. Moreover, similarities between tens of thousands of movies, which are represented as sets of up to hundreds of thousands of users who have rated them, were measured for movie recommendation [3].

In order to reduce the space and computation required for set-similarity computation, a number of approaches based on hashing and sketching [4, 10] have been developed. While their simplicity and theoretical guarantees are tempting, significant gains are expected if patterns in a given collection of sets can be learned and exploited.

In this paper, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea of Set2Box is to represent sets as boxes to accurately capture the overlaps between sets and thus their similarity based on them. Specifically, by utilizing the volumes of the boxes to approximate the sizes of the sets, Set2Box derives representations that are: (a) Concise: can represent sets of arbitrary sizes using the same number of bits, (b) Accurate: can accurately model overlaps between sets, and (c) Versatile: can be used to estimate various set similarity measures in a constant time. These properties are supported by the geometric nature of boxes, which share primary characteristics of sets. In addition, we propose Set2Box⁺, which yields even more concise but more accurate boxes based on the proposed box quantization scheme. We summarize our contributions as follows:

•

Accurate & Versatile Algorithm: We propose Set2Box, a set representation learning method that accurately preserves similarity between sets in terms of four measures.
•

Concise & Effective Algorithm: We devise Set2Box⁺ to enhance Set2Box through an end-to-end box quantization scheme. It yields up to $40.8\times$ more accurate similarity estimation while requiring $60\%$ fewer bits than its competitors.
•

Extensive Experiments: Using 8 real-world datasets, we validate the advantages of Set2Box⁺ over its competitors and the effectiveness of each of its components.

For reproducibility, the code and data are available at https://github.com/geon0325/Set2Box.

TABLE I: Frequently-used symbols.

Notation	Definition
$\mathcal{S}=\{s_{1},...,s_{\|\mathcal{S}\|}\}$	set of sets
$\mathcal{E}=\{e_{1},...,e_{\|\mathcal{E}\|}\}$	set of entities
$\mathrm{B}=(\mathrm{c},\mathrm{f})$	a box with center $\mathrm{c}$ and offset $\mathrm{f}$
$\mathbb{V}(\mathrm{B})$	volume of box $\mathrm{B}$
$\mathcal{T}^{+}$ and $\mathcal{T}^{-}$	a set of positive & negative samples
$\mathbf{Q}^{\mathrm{c}}\in\mathbb{R}^{\|\mathcal{E}\|\times d}$	center embedding matrix of entities
$\mathbf{Q}^{\mathrm{f}}\in\mathbb{R}_{+}^{\|\mathcal{E}\|\times d}$	offset embedding matrix of entities
$D$	number of subspaces
$K$	number of key boxes in each subspace

In Section II, we review related work. In Section III, we define the problem of similarity-preserving set embedding and discuss intuitive approaches. In Section IV, we present Set2Box and Set2Box⁺. In Section V, we provide experimental results. In Section VI, we analyze the considered methods. Lastly, we offer conclusions in Section VII.

II Related Work

Here, we review previous studies related to our work.

Similarity-Preserving Embedding: Representation learning for preserving similarities between instances has been studied for graphs [11, 12, 13, 14], images [15, 16, 17], and texts [18]. These methods aim to yield high-quality embeddings by minimizing the information loss of the original data. However, most of them are designed to preserve the predetermined similarity matrix, which are not extensible to new measures [13, 14]. In this paper, we focus on the problem of learning similarity-preserving representations for sets, and we aim to learn a versatile representation of sets, which various similarity measures (e.g., Jaccard Index and Dice Index) can be estimated from.

Box Embedding: Boxes [19] are useful abstractions to express high-order information of the data. Thanks to their powerful expressiveness, they have been used in diverse applications including knowledge bases [20, 21, 22, 23, 24, 25], word embedding [26], image embedding [27], and recommender systems [28, 29]. For instance, Query2Box [24] uses boxes to embed queries with conjunctions ( $\wedge$ ) or logical disjunctions ( $\lor$ ). Zhang et al. [28] represent users as boxes to accurately model the users’ preferences to the items. However, in this work, we embed sets as boxes to accurately preserve their structural relationships and also similarities between them. In an algorithmic aspect, methods for improving the optimization of learning boxes have been presented, and examples include smoothing hard edges using Gaussian convolutions [30] and improving the parameter identifiability of boxes using Gumbel random variables [31].

Set Embedding: The problem of embedding sets has attracted much attention, with unique requirements of permutation invariance and size flexibility. For example, DeepSets [32] uses simple symmetric functions over input features, and Set2Set [33] is based on a LSTM-based pooling function. Set Transformer [34] uses an attention-based pooling function to aggregate information of the entities. Despite their promising results in some predictive tasks, they suffer from several limitations. First, they require attribute information of entities, which in fact largely affects the quality of the set embeddings. In addition, set representations are trained specifically for downstream tasks, and thus they may lose explicit similarity information of sets, which we aim to preserve in this paper. In another aspect, sets can be represented as compact binary vectors by hashing or sketching [4, 10], without requiring attribute information. Such binary vectors are used by Locality Sensitive Hashing (LSH) and its variants [35, 36, 37] for a rapid search of similar sets based on a predefined similarity measure (e.g., Jaccard Index). Refer to Section III for further discussion of set embedding methods.

Differentiable Product Quantization: Product quantization [38, 39] is an effective strategy for vector compression. Recently, deep learning methods for learning discrete codes in an end-to-end manner have been proposed [40, 41], and they have been applied in knowledge graphs [42] and image retrieval [43, 44, 45]. In this paper, we propose a novel box quantization method for compressing boxes while preserving their original geometric properties.

III Preliminaries

In this section, we introduce notations and define the problem. Then, we review some intuitive methods for the problem.

Notations: Consider a set $\mathcal{S}=\{s_{1},\cdots,s_{|\mathcal{S}|}\}$ of sets and a set $\mathcal{E}=\{e_{1},\cdots,e_{|\mathcal{E}|}\}$ of entities. Each set $s\in\mathcal{S}$ is a non-empty subset of $\mathcal{E}$ and its size (i.e., cardinality) is denoted by $|s|$ . A representation of the set $s$ is denoted by $\mathrm{z}_{s}$ and its encoding cost (the number of bits to encode $\mathrm{z}_{s}$ ) in bits is denoted by $Cost(\mathrm{z}_{s})$ . Refer to Table I for frequently-used notations.

Problem Definition: The problem of learning similarity-preserving set representations, which we focus in this work, is formulated as:

Problem 1 (Similarity-Preserving Set Embedding).

•

Given: (1) a set $\mathcal{S}$ of sets and (2) a budget $b$
•

Find: a latent representation $\mathrm{z}_{s}$ of each set $s\in\mathcal{S}$
•

to Minimize: the difference between (1) the similarity between $s$ and $s^{\prime}$ , and (2) the similarity between $\mathrm{z}_{s}$ and $\mathrm{z}_{s^{\prime}}$ for all $s\neq s^{\prime}\in\mathcal{S}$
•

Subject to: the total encoding cost $\textit{Cost}(\{\mathrm{z}_{s}:s\in\mathcal{S}\})\leq b$ .

In this paper, we consider four set-similarity measures and use the mean squared error (MSE)¹¹1 $\sum_{s\neq s^{\prime}\in\mathcal{S}}{|\text{sim}(s,s^{\prime})-\widehat{\text{sim}}(\mathrm{z}_{s},\mathrm{z}_{s^{\prime}})|}^{2}$ $\text{sim}(\cdot,\cdot)$ and $\widehat{\text{sim}}(\cdot,\cdot)$ are similarity between sets and that between latent representations, respectively. to measure the differences, while our proposed methods are not specialized to the choices.

Desirable Properties: We expect set embeddings for Problem 1 to have the following desirable properties:

•

Accuracy: How can we accurately preserve similarities between sets? Similarities approximated using learned representations should be close to ground-truth similarities.
•

Conciseness: How can we obtain compact representations that give a good trade-off between accuracy and encoding cost? It is desirable to use less amount of memory to store embeddings while keeping them informative.
•

Generalizability: Due to the size flexibility of sets, there are infinitely many number of combinations of entities, and thus retraining the entire model for new sets is intractable. It is desirable for a model to be generalizable to unseen sets.
•

Versatility: While there have been various definitions of set similarities, the choice of the similarity metric plays a key role in practical analyses and applications. This motivates us to learn versatile representations of sets that can be used to approximate diverse similarity measures.
•

Speed: Using the obtained embeddings, set similarities should be rapidly estimated, regardless of their cardinalities.

Refer to caption — (a) Random Hashing
MSE = 0.0884
Cost = 77.312 KB

Intuitive Methods: Keeping the above desirable properties in mind, we discuss simple and intuitive set-embedding methods for similarity preservation.

•

Random Hashing [4]: Each set $s$ is encoded as a binary vector $\mathrm{z}_{s}\in\{0,1\}^{d}$ by mapping each entity into one of the $d$ different values using a hash function $h(\cdot):\mathcal{E}\rightarrow\{1,\cdots,d\}$ . Specifically, the representation $\mathrm{z}_{s}$ is derived by:

\mathrm{z}_{s}[i]=\begin{cases}1&\text{if $\exists e\in s$ s.t. $h(e)=i$}\\ 0&\text{otherwise.}\end{cases}

The size of the set $s$ is estimated from the L1 norm (or the number of nonzero elements) of $\mathrm{z}_{s}$ , i.e., $|s|\approx\|\mathrm{z}_{s}\|_{1}$ . In addition, sizes of the intersection and the union of sets $s$ and $s^{\prime}$ are estimated from:

|s\cap s^{\prime}|\approx\|\mathrm{z}_{s}\;\textbf{AND}\;\mathrm{z}_{s^{\prime}}\|_{1}\;\;\;\text{and}\;\;\;|s\cup s^{\prime}|\approx\|\mathrm{z}_{s}\;\textbf{OR}\;\mathrm{z}_{s^{\prime}}\|_{1},

respectively, where AND and OR are dimension-wise operations. Based on these approximations, any set similarities (e.g., Jaccard Index) can be estimated.

•

Vector Embedding: Another popular approach is to represent sets as vectors and compute the inner products between them to estimate a predefined set similarity. More precisely, given two sets $s$ and $s^{\prime}$ and their vector representations $\mathrm{z}_{s}$ and $\mathrm{z}_{s^{\prime}}$ , it aims to approximate predefined $\text{sim}(s,s^{\prime})$ by the inner product of $\mathrm{z}_{s}$ and $\mathrm{z}_{s^{\prime}}$ , i.e., $\langle\mathrm{z}_{s},\mathrm{z}_{s^{\prime}}\rangle\approx\text{sim}(s,s^{\prime})$ .

These methods, however, suffer from several limitations. In random hashing, the maximum size of a set that a binary vector can accurately represent is $d$ , and thus sets whose sizes are larger than $d$ inevitably suffer from information loss. This is empirically verified in Figure 1(a); while estimations are accurately made in small sets, the error increases as the sizes of the sets are larger. The vector embedding method avoids such a problem but shows weakness in its versatility. That is, vectors are derived to preserve a predefined similarity (e.g., Jaccard Index), and thus they are not reusable to estimate other similarity measures (e.g., Dice Index). To address these issues, in this work, we propose Set2Box and Set2Box⁺, novel end-to-end algorithms for similarity preserving set embedding. As shown in Figure 1, Set2Box⁺ accurately preserves similarities between sets compared to random hashing and vector embedding methods, while requiring fewer bits to encode sets.

IV Proposed Method

In this section, we present our proposed method for similarity-preserving set embedding. We first present Set2Box, a novel algorithm for learning similarity-preserving set representations using boxes (Sec. IV-A). Then we propose Set2Box⁺, an advanced version of Set2Box, which derives better conciseness and accuracy (Sec. IV-B).

IV-A Set2Box: Preliminary Version

How can we derive set embeddings that accurately preserve similarity in terms of various metrics? Towards this goal, we first present Set2Box, a preliminary set representation method that effectively learns the set itself and the structural relations with other sets.

Concepts: A box is a $d$ -dimensional hyper-rectangle whose representation consists of its center and offset [19]. The center describes the location of the box in the latent space and the offset is the length of each edge of the box. Formally, given a box $\mathrm{B}=(\mathrm{c},\mathrm{f})$ whose center $\mathrm{c}\in\mathbb{R}^{d}$ and offset $\mathrm{f}\in\mathbb{R}_{+}^{d}$ are in the same latent space, the box is defined as a bounded region:

\mathrm{B}\equiv\{\mathrm{p}\in\mathbb{R}^{d}:\mathrm{c}-\mathrm{f}\preceq\mathrm{p}\preceq\mathrm{c}+\mathrm{f}\},

where $\mathrm{p}$ is any point within the box. We let $\mathrm{m}\in\mathbb{R}^{d}$ and $\mathrm{M}\in\mathbb{R}^{d}$ be the vectors representing the minimum and the maximum at each dimension, respectively, i.e., $\mathrm{m}=\mathrm{c}-\mathrm{f}$ and $\mathrm{M}=\mathrm{c}+\mathrm{f}$ . Given two boxes $\mathrm{B}_{X}=(\mathrm{c}_{X},\mathrm{f}_{X})$ and $\mathrm{B}_{Y}=(\mathrm{c}_{Y},\mathrm{f}_{Y})$ , the intersection is also a box, represented as:

\displaystyle\mathrm{B}_{X}\cap\mathrm{B}_{Y}\!\!\equiv

\displaystyle\{\mathrm{p}\in\mathbb{R}^{d}:\mathbf{max}(\mathrm{m}_{X},\mathrm{m}_{Y})\preceq\mathrm{p}\preceq\mathbf{min}(\mathrm{M}_{X},\mathrm{M}_{Y})\}.

The volume $\mathbb{V}(\mathrm{B})$ of the box $\mathrm{B}$ is computed by the product of the length of an edge in each dimension, i.e., $\mathbb{V}(\mathrm{B})=\prod_{i=1}^{d}(\mathrm{M}[i]-\mathrm{m}[i])$ . The volume of the union of the two boxes is simply computed by $\mathbb{V}(\mathrm{B}_{X})+\mathbb{V}(\mathrm{B}_{Y})-\mathbb{V}(\mathrm{B}_{X}\cap\mathrm{B}_{Y})$ .

Representation: The core idea of Set2Box is to model each set $s$ as a box $\mathrm{B}_{s}=(\mathrm{c}_{s},\mathrm{f}_{s})$ so that the relations with other sets are properly preserved in the latent space. To this end, Set2Box approximates the volumes of the boxes to the relative sizes of the sets, i.e., $\mathbb{V}(\mathrm{B}_{s})\propto|s|$ . In addition to the single-set level, Set2Box aims to preserve the relations between different sets by approximating the volumes of the intersection of the boxes to the intersection sizes of the sets, i.e., $\mathbb{V}(\mathrm{B}_{s_{i}}\cap\mathrm{B}_{s_{j}})\propto|s_{i}\cap s_{j}|$ . Notably, Set2Box not only addresses limitations of random hashing and vector-based embeddings, but it also has various advantages benefited from unique properties of boxes, as we discuss in Section VI.

Objective: Now we turn our attention to how to capture such overlaps between sets using boxes. Recall that our goal is to derive accurate and versatile representations of sets, and towards the first goal, we take relations beyond pairwise into consideration. Specifically, we consider three different levels of set relations (i.e., single, pair, and triple-wise relations) to capture the underlying high-order structure of sets. In another aspect, we aim to derive versatile set representations that can be used to estimate various similarity measures (e.g., Jaccard Index and Dice Index). With these goals in mind, we design an objective function that aims to preserve elemental relations among triple of sets. Specifically, given a triple $\{s_{i},s_{j},s_{k}\}$ of sets, we consider seven cardinalities from three different levels of subsets: (1) $|s_{i}|$ , $|s_{j}|$ , $|s_{k}|$ , (2) $|s_{i}\cap s_{j}|$ , $|s_{j}\cap s_{k}|$ , $|s_{k}\cap s_{i}|$ , and (3) $|s_{i}\cap s_{j}\cap s_{k}|$ which contain single, pair, and triple-wise information, respectively, and we denote them from $c_{1}(s_{i},s_{j},s_{k})$ to $c_{7}(s_{i},s_{j},s_{k})$ . These seven elements fully-describe the relations among the three sets, and we argue that any similarity measures are computable using them. In this regard, we aim to preserve the ratios of the seven cardinalities by the volumes of the boxes $\mathrm{B}_{s_{i}}$ , $\mathrm{B}_{s_{j}}$ , and $\mathrm{B}_{s_{k}}$ by minimizing the following objective:

\mathcal{J}(s_{i},s_{j},s_{k},\mathrm{B}_{s_{i}},\mathrm{B}_{s_{j}},\mathrm{B}_{s_{k}})=\\ \sum_{\ell=1}^{7}\left(p_{\ell}(s_{i},s_{j},s_{k})-\hat{p}_{\ell}(\mathrm{B}_{s_{i}},\mathrm{B}_{s_{j}},\mathrm{B}_{s_{k}})\right)^{2},

where $p_{\ell}$ is the ratio of the $\ell$ ^th cardinality among the three sets (i.e., $p_{\ell}=c_{\ell}/\sum_{\ell^{\prime}}c_{\ell^{\prime}}$ ) and $\hat{p}_{\ell}$ is the corresponding ratio estimated by the boxes. Since there exist $|\mathcal{S}|\choose 3$ possible triples of sets, taking all such combinations into account is practically intractable, and thus we resort to sampling some of them. We sample a set $\mathcal{T}$ of triples that consists of a set $\mathcal{T}^{+}$ of positive triples and a set $\mathcal{T}^{-}$ of negative triples, i.e., $\mathcal{T}=\mathcal{T}^{+}\cup\mathcal{T}^{-}$ . Specifically, the positive set $\mathcal{T}^{+}$ and the negative set $\mathcal{T}^{-}$ are obtained by sampling three connected (i.e., overlapping) sets and three uniform random sets, respectively. Then, the final objective function we aim to minimize is:

\mathcal{L}=\sum\nolimits_{\{s_{i},s_{j},s_{k}\}\in\mathcal{T}}\mathcal{J}(s_{i},s_{j},s_{k},\mathrm{B}_{s_{i}},\mathrm{B}_{s_{j}},\mathrm{B}_{s_{k}}).

(1)

Notably, the proposed objective function aims to capture not only the pairwise interactions between sets, but also the triplewise relations to capture high-order overlapping patterns of the sets. In addition, it does not rely on any predefined similarity measure, but is a general objective for learning key structural patterns of sets and their neighbors. This prevents the model from overfitting to a specific measure and enables the model to yield accurate estimates to diverse metrics, as shown empirically in Section V.

Box Embedding: Then, given a set $s$ , how can we derive the box $\mathrm{B}_{s}=(\mathrm{c}_{s},\mathrm{f}_{s})$ , that is, its center $\mathrm{c}_{s}$ and offset $\mathrm{f}_{s}$ ? To make the method generalizable to unseen sets, Set2Box introduces a pair of learnable embedding matrices $\mathbf{Q}^{\mathrm{c}}\in\mathbb{R}^{|\mathcal{E}|\times d}$ and $\mathbf{Q}^{\mathrm{f}}\in\mathbb{R}_{+}^{|\mathcal{E}|\times d}$ of entities, where $\mathbf{Q}_{i}^{\mathrm{c}}\in\mathbb{R}^{d}$ and $\mathbf{Q}_{i}^{\mathrm{f}}\in\mathbb{R}^{d}$ represent the center and offset of an entity $e_{i}$ , respectively. Then, the embeddings of the entities in the set $s$ are aggregated to obtain the center $\mathrm{c}_{s}$ and the offset $\mathrm{f}_{s}$ :

\mathrm{c}_{s}=\textbf{pooling}(s,\mathbf{Q}^{\mathrm{c}})\;\;\;\text{and}\;\;\;\mathrm{f}_{s}=\textbf{pooling}(s,\mathbf{Q}^{\mathrm{f}})

where pooling is a permutation invariant function. Instead of using simple functions such as mean or max, we use attentions to highlight the entities that are important to obtain either the center or the offset of the box. To this end, we define a pooling function that takes the context of each set into account, termed set-context pooling (SCP). Specifically, given a set $s$ and an item embedding matrix $\mathbf{Q}$ (which can be either $\mathbf{Q}^{\mathrm{c}}$ or $\mathbf{Q}^{\mathrm{f}}$ ), it first obtains the set-specific context vector $\mathrm{b}_{s}$ :

\mathrm{b}_{s}=\sum_{e_{i}\in s}\alpha_{i}\mathbf{Q}_{i}\;\;\;\text{where}\;\;\;\alpha_{i}=\frac{\exp(\mathrm{a}^{\intercal}\mathbf{Q}_{i})}{\sum_{e_{j}\in s}\exp(\mathrm{a}^{\intercal}\mathbf{Q}_{j})}

where $\mathrm{a}$ is a global context vector shared by all sets. Then using the context vector $\mathrm{b}_{s}$ , which specifically contains the information on set $s$ , it obtains the output embedding from:

\textbf{SCP}(s,\mathbf{Q})=\sum_{e_{i}\in s}\omega_{i}\mathbf{Q}_{i}\;\;\text{where}\;\;\omega_{i}=\frac{\exp(\mathrm{b}_{s}^{\intercal}\mathbf{Q}_{i})}{\sum_{e_{j}\in s}\exp(\mathrm{b}_{s}^{\intercal}\mathbf{Q}_{j})}.

To be precise, $\mathrm{c}_{s}=\textbf{SCP}(s,\mathbf{Q}^{\mathrm{c}})$ and $\mathrm{f}_{s}=|s|^{\frac{1}{d}}\textbf{SCP}(s,\mathbf{Q}^{\mathrm{f}})$ . Note that, for the offset $\mathrm{f}_{s}$ , we further take unique geometric properties of boxes into consideration. For any entity $e_{i}\in s$ , the subset relation $\{e_{i}\}\subseteq s$ holds, and thus the same condition $\mathrm{B}_{\{e_{i}\}}\subseteq\mathrm{B}_{s}$ for boxes is desired, which should satisfy $\mathrm{f}_{\{e_{i}\}}\preceq\mathrm{f}_{s}$ and thus $\max_{e_{i}\in s}\mathrm{f}_{\{e_{i}\}}=\max_{e_{i}\in s}\mathbf{Q}_{i}^{\mathrm{f}}\preceq\mathrm{f}_{s}$ . However, since SCP is the weighted mean of entities’ embeddings, the output of SCP is bounded by the input embeddings in all dimensions, i.e., $\min_{e_{i}\in s}\mathbf{Q}_{i}^{\mathrm{f}}\preceq\mathrm{f}_{s}\preceq\max_{e_{i}\in s}\mathbf{Q}_{i}^{\mathrm{f}}$ , which inevitably contradicts the aforementioned condition. In these regards, for the offset $\mathrm{f}_{s}$ , we multiply an additional regularizer $|s|^{\frac{1}{d}}$ that helps boxes to properly preserve the set similarity, i.e., $\mathrm{f}_{s}=|s|^{\frac{1}{d}}\textbf{SCP}(s,\mathbf{Q}^{\mathrm{f}})$ .

Smoothing Boxes: By definition, a box $\mathrm{B}=(\mathrm{c},\mathrm{f})$ is a bounded region with hard edges whose volume is

\mathbb{V}(\mathrm{B})=\prod\nolimits_{i=1}^{d}(\mathrm{M}[i]-\mathrm{m}[i])=\prod\nolimits_{i=1}^{d}\text{ReLU}(\mathrm{M}[i]-\mathrm{m}[i])

where $\mathrm{m}=\mathrm{c}-\mathrm{f}$ and $\mathrm{M}=\mathrm{c}+\mathrm{f}$ . This, however, disables gradient-based optimization when boxes are disjoint [30], and thus we smooth the boxes by using an approximation of ReLU:

\mathbb{V}(\mathrm{B})=\prod\nolimits_{i=1}^{d}\text{Softplus}(\mathrm{M}[i]-\mathrm{m}[i])

where $\text{Softplus}(\mathrm{x})=\frac{1}{\beta}\log\left(1+\exp(\beta\mathrm{x})\right)$ is an approximation to $\text{ReLU}(\mathrm{x})$ , and it becomes closer to ReLU as $\beta$ increases (specifically, Softplus $\rightarrow$ ReLU as $\beta\rightarrow\infty$ ). In this way, any pairs of boxes overlap each other, and thus non-zero gradients are computed for optimization.

Encoding Cost: Each box consists of two vectors, a center and an offset, and it requires $2\cdot 32d=64d$ bits to encode them, assuming that we are using float-32 to represent each real number. Thus, $64|\mathcal{S}|d$ bits are required to store the box embeddings of $|\mathcal{S}|$ sets.

IV-B Set2Box⁺: Advanced Version

We describe Set2Box⁺, which enhances Set2Box in terms of conciseness and accuracy, based on an end-to-end box quantization scheme. Specifically, Set2Box⁺ compresses the the box embeddings into a compact set of key boxes and a set of discrete codes to reconstruct the original boxes.

Box Quantization: We propose box quantization, a novel scheme for compressing boxes by using substantially smaller number of bits. Note that conventional product quantization methods [40], which are for vector compression, are straightforwardly applicable, by independently reducing the center and the offset of the box. However, it hardly makes use of geometric properties of boxes, and thus it does not properly reflect the complex relations between them. The proposed box quantization scheme effectively addresses this issue through two steps: (1) box discretization and (2) box reconstruction.

$\circ$ Box Discretization. Given a box $\mathrm{B}_{s}=(\mathrm{c}_{s},\mathrm{f}_{s})$ of set $s$ , we discretize the box as a $K$ -way $D$ -dimensional discrete code $\mathrm{C}_{s}\in\{1,\cdots,K\}^{D}$ which is more compact and requires much less number of bits to encode than real numbers. To this end, we divide the $d$ -dimensional latent space into $D$ subspaces ( $\mathbb{R}^{d/D}$ ) and, for each subspace, learn $K$ key boxes. Specifically, in the $i$ ^th subspace, the $j$ ^th key box is denoted by $\mathrm{K}_{j}^{(i)}=(\mathrm{c}_{j}^{(i)},\mathrm{f}_{j}^{(i)})$ where $\mathrm{c}_{j}^{(i)}\in\mathbb{R}^{d/D}$ and $\mathrm{f}_{j}^{(i)}\in\mathbb{R}_{+}^{d/D}$ are the center and offset of the key box, respectively. The original box $\mathrm{B}_{s}$ is also partitioned into $D$ sub-boxes $\mathrm{B}_{s}^{(1)},\cdots,\mathrm{B}_{s}^{(D)}$ and the $i$ ^th code of $\mathrm{C}_{s}$ is decided by:

\mathrm{C}_{s}[i]=\operatorname*{arg\,min}_{j}\textbf{dist}\left(\mathrm{B}_{s}^{(i)},\;\mathrm{K}_{j}^{(i)}\right)

where dist( $\cdot,\cdot$ ) measures the distance (i.e., dissimilarity) between two boxes, and we can flexibly select the criterion. In this paper, we specify the dist function, using softmax, as:

\mathrm{C}_{s}[i]=\operatorname*{arg\,max}_{j}\frac{\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j}^{(i)}\right)\right)}{\sum_{j^{\prime}}\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j^{\prime}}^{(i)}\right)\right)}

(2)

where BOR (Box Overlap Ratio) is defined to measure how much a box $\mathrm{B}_{X}$ and a box $\mathrm{B}_{Y}$ overlap:

\textbf{BOR}(\mathrm{B}_{X},\mathrm{B}_{Y})=\frac{1}{2}\left(\frac{\mathbb{V}(\mathrm{B}_{X}\cap\mathrm{B}_{Y})}{\mathbb{V}(\mathrm{B}_{X})}+\frac{\mathbb{V}(\mathrm{B}_{X}\cap\mathrm{B}_{Y})}{\mathbb{V}(\mathrm{B}_{Y})}\right).

As shown in Figure 2, the proposed box quantization scheme incorporates the geometric relations between boxes, differently from conventional product quantization methods on vectors. To sum up, for each $i$ ^th subspace, we search for the key sub-box closest to the sub-box $\mathrm{B}_{s}^{(i)}$ and assign its index as the $i$ ^th dimension’s value of its discrete code.

$\circ$ Box Reconstruction. Once the discrete code $\mathrm{C}_{s}$ of set $s$ is generated, in this step, we reconstruct the original box based on it. To be specific, we obtain the reconstructed box $\widehat{\mathrm{B}}_{s}=(\widehat{c}_{s},\widehat{f}_{s})$ by concatenating $D$ key boxes from each subspace encoded in $\mathrm{C}_{s}$ :

\widehat{\mathrm{B}}_{s}=\Big{\|}_{i=1}^{D}\mathrm{K}_{\mathrm{C}_{s}[i]}^{(i)}.

More precisely, $\widehat{\mathrm{B}}_{s}$ is reconstructed by concatenating the centers and the offsets of the $D$ key boxes respectively. Since $\mathrm{C}_{s}$ encodes key boxes that largely overlap with the box $\mathrm{B}_{s}$ (i.e., high BOR), if properly encoded, we can expect the reconstructed box $\widehat{\mathrm{B}}_{s}$ to be geometrically similar to the original box $\mathrm{B}_{s}$ . In Section V-C, we demonstrate the effectiveness of the proposed box quantization scheme by comparing it with the product quantization method.

Differentiable Optimization: Recall that Set2Box⁺ is an end-to-end learnable algorithm, which requires all processes to be differentiable. However, the $\operatorname*{arg\,max}$ operation in Eq. (2) is non-differentiable, and to this end, we utilize the softmax with the temperature $\tau$ :

\widetilde{\mathrm{C}}_{s}[i]=\frac{\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j}^{(i)}\right)/\tau\right)}{\sum_{j^{\prime}}\exp\left(\textbf{BOR}\left(\mathrm{B}_{s}^{(i)},\mathrm{K}_{j^{\prime}}^{(i)}\right)/\tau\right)}.

(3)

Note that $\widetilde{\mathrm{C}}_{s}[i]$ is a $K$ -dimensional probabilistic vector whose $j$ ^th element indicates the probability for $\mathrm{K}_{j}^{(i)}$ being assigned as the closest key box, i.e., the probability of $\mathrm{C}_{s}[i]=j$ . Then, the key box $\widetilde{\mathrm{K}}_{s}^{(i)}=(\widetilde{\mathrm{c}}_{s}^{(i)},\widetilde{\mathrm{f}}_{s}^{(i)})$ in the $i$ ^th subspace is the weighted sum of the $K$ key boxes:

\widetilde{\mathrm{K}}_{s}^{(i)}=\sum\nolimits_{j=1}^{K}\widetilde{C}_{s}[i][j]\cdot\mathrm{K}_{j}^{(i)}.

If $\tau=0$ , Eq. (3) is equivalent to the $\operatorname*{arg\,max}$ function, i.e., a one-hot vector where $\mathrm{C}_{s}[i]$ ^th dimension is 1 and others are 0. In this case, $\widetilde{\mathrm{K}}_{s}^{(i)}$ becomes equivalent to $\mathrm{K}_{\mathrm{C}_{s}[i]}^{(i)}$ , which is the exact reconstruction derivable from the discrete code $\mathrm{C}_{s}$ . However, since this hard selection is non-differentiable and thus prevents an end-to-end optimization, we resort to the approximation by using the softmax with $\tau\neq 0$ which is fully differentiable. Specifically, we use different $\tau$ s’ in forward ( $\tau=0$ ) and backward ( $\tau=1$ ) passes, which effectively enables differentiable optimization.

Joint Training: For further improvement, we introduce a joint learning scheme in the box quantization scheme. Given a triple $\{s_{i},s_{j},s_{k}\}$ of sets from the training data $\mathcal{T}$ , we obtain their boxes $\mathrm{B}_{s_{i}}$ , $\mathrm{B}_{s_{j}}$ , and $\mathrm{B}_{s_{k}}$ and their reconstructed ones $\widehat{\mathrm{B}}_{s_{i}}$ , $\widehat{\mathrm{B}}_{s_{j}}$ , and $\widehat{\mathrm{B}}_{s_{k}}$ using the box quantization. While the basic version of Set2Box⁺ optimizes the following objective:

TABLE II: Statistics of the 8 real-world datasets: the number of entities

|\mathcal{E}|

, the number of sets

|\mathcal{S}|

, the maximum set size

\textbf{max}_{s\in\mathcal{S}}|s|

, and the size of the dataset

\textbf{sum}_{s\in\mathcal{S}}|s|

Dataset	$\|\mathcal{E}\|$	$\|\mathcal{S}\|$	$\textbf{max}_{s\in\mathcal{S}}\|s\|$	$\textbf{sum}_{s\in\mathcal{S}}\|s\|$
Yelp (YP)	25,252	25,656	649	467K
Amazon (AM)	55,700	105,655	555	858K
Netflix (NF)	17,769	478,615	12,206	56.92M
Gplus (GP)	107,596	72,271	5,056	13.71M
Twitter (TW)	81,305	70,097	1,205	1.76M
MovieLens 1M (ML1)	3,533	6,038	1,435	575K
MovieLens 10M (ML10)	10,472	69,816	3,375	5.89M
MovieLens 20M (ML20)	22,884	138,362	4,168	12.20M

\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),

we additionally make use of the original boxes during the optimization. Specifically, we jointly train the original boxes together with the reconstructed ones so that both types of boxes can achieve high accuracy. To this end, we consider the following eight losses:

	$\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),$
	$\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),$
	$\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),$
	$\displaystyle\mathcal{J}(s_{i},s_{j},s_{j},{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),\;\;\;\;\;\mathcal{J}(s_{i},s_{j},s_{j},\widehat{\mathrm{B}}_{s_{i}},\widehat{\mathrm{B}}_{s_{j}},\widehat{\mathrm{B}}_{s_{k}}),$

where we denote them by $\mathcal{J}_{1}$ to $\mathcal{J}_{8}$ , for the sake of brevity. Notably, $\mathcal{J}_{1}$ , which utilizes only the original boxes, is an objective used for Set2Box, and $\mathcal{J}_{8}$ considers only the reconstructed boxes. Based on these joint views from different types of boxes, the final loss function we aim to minimize is:

\mathcal{L}=\!\!\!\!\!\sum_{\{s_{i},s_{j},s_{k}\}\in\mathcal{T}}\!\!\!\!\!\lambda\left(\mathcal{J}_{1}+\mathcal{J}_{2}+\mathcal{J}_{3}+\mathcal{J}_{4}+\mathcal{J}_{5}+\mathcal{J}_{6}+\mathcal{J}_{7}\right)+\mathcal{J}_{8},

(4)

where $\lambda$ is the coefficient for balancing the losses between the joint views and the loss from the reconstructed boxes. In this way, both original boxes and the reconstructed ones are trained together to be properly located and shaped in the latent space. Note that even though both types of boxes are jointly trained to achieve high accuracy, only the reconstructed boxes are used for inference. We conduct ablation studies to verify the effectiveness of the joint training scheme in Section V-C.

Encoding Cost: To encode the reconstructed boxes for each set, Set2Box⁺ requires (1) key boxes and (2) discrete codes to encode each set. There exist $K$ key boxes in each of the $D$ subspaces whose dimensionality is $d/D$ , which requires $64Kd$ bits to encode. Each set is encoded as a $K$ -way $D$ -dimensional vector, which requires $D\log_{2}K$ bits. To sum up, to encode $|\mathcal{S}|$ sets, Set2Box⁺ requires $64Kd+|\mathcal{S}|D\log_{2}K$ bits. Notably, if $K\ll|\mathcal{S}|$ , then $64Kd$ bits are negligible, and typically, $D\log_{2}K\ll 64d$ holds. Thus, the encoding cost of Set2Box⁺ is considerably smaller than that of Set2Box.

Similarity Computation: Once we obtain set representations, it is desirable to rapidly compute the estimated similarities in the latent space. Boxes, which Set2Box and Set2Box⁺ derive, require constant time to compute a pairwise similarity between two sets, as formalized in Lemma 1.

Lemma 1 (Time Complexity of Similarity Estimation).

Given a pair of sets $s$ and $s^{\prime}$ and their boxes $\mathrm{B}_{s}$ and $\mathrm{B}_{s^{\prime}}$ , respectively, it takes $O(d)$ time to compute the estimated similarity $\widehat{\text{sim}}(\mathrm{B}_{s},\mathrm{B}_{s^{\prime}})$ , where $d$ is a user-defined constant that does not depend on the sizes of $s$ and $s^{\prime}$ .

Proof. Assume that the true similarity $\text{sim}(s,s^{\prime})$ is computable using $|s|$ , $|s^{\prime}|$ , and $|s\cap s^{\prime}|$ . They are estimated by $\mathbb{V}(\mathrm{B}_{s})$ , $\mathbb{V}(\mathrm{B}_{s^{\prime}})$ , and $\mathbb{V}(\mathrm{B}_{s}\cap\mathrm{B}_{s^{\prime}})$ , respectively, and each of them is computed by the product of $d$ values, which takes $O(d)$ time. Hence, the total time complexity is $O(d)$ . ∎

V Experimental Results

We review our experiments designed for answering Q1-Q3.

Q1.

Accuracy & Conciseness: Does Set2Box⁺ derive concise and accurate set representations than its competitors?
Q2.

Effectiveness: How does Set2Box⁺ yield concise and accurate representations? Are all its components useful?
Q3.

Effects of Parameters: How do the parameters of Set2Box⁺ affect the quality of set representations?

V-A Experimental Settings

Machines & Implementations: All experiments were conducted on a Linux server with RTX 3090Ti GPUs. We implemented all methods including Set2Box and Set2Box⁺ using the Pytorch library.

Hyperparameter Tuning Table III describes the hyperparameter search space of each method. The number of training samples, $|\mathcal{T}^{+}|$ and $|\mathcal{T}^{-}|$ , are both set to $10$ for Set2Box, Set2Box⁺, and their variants. For the vector-based methods, Set2Vec and Set2Vec⁺, since three pairwise relations are extractable from each triple, $\left\lceil\frac{7}{3}|\mathcal{T}^{+}|\right\rceil$ positive triples and $\left\lceil\frac{7}{3}|\mathcal{T}^{-}|\right\rceil$ negative samples are used for training. We fix the batch size to $512$ and use the Adam optimizer. In Set2Box⁺, we fix the softmax temperature $\tau$ to $1$ .

TABLE III: Search space of each method.

Method	Hyperparameter	Selection Pool
Set2Box	Learning rate	$0.001$ , $0.01$
Set2Box	Box smoothing parameter $\beta$	$1$ , $2$ , $4$
Set2Box⁺	Learning rate	$0.001$ , $0.01$
	Box smoothing parameter $\beta$	$1$ , $2$ , $4$
	Joint training coefficient $\lambda$	$0$ , $0.001$ , $0.01$ , $0.1$ , $1$

TABLE IV: Encoding cost of the methods covered in this work.

|\mathcal{S}|

: number of sets.

d

: dimensionality.

D

: number of subspaces.

K

: number of key boxes (vectors) in each subspace.

Method		Encoding Cost (bits)
Set2Bin		$d\|\mathcal{S}\|$
Set2Vec		$32d\|\mathcal{S}\|$
Set2Vec⁺		$32d\|\mathcal{S}\|$
Set2Box-order		$32d\|\mathcal{S}\|$
Set2Box-PQ		$64Kd+2\|\mathcal{S}\|D\log_{2}K$
Set2Box-BQ		$64Kd+\|\mathcal{S}\|D\log_{2}K$
Set2Box		$64d\|\mathcal{S}\|$
Set2Box⁺		$64Kd+\|\mathcal{S}\|D\log_{2}K$

Datasets: We use eight publicly available datasets in Table II. The details of each dataset is as follows:

•

Yelp (YP) consists of user ratings on locations (e.g., hotels and restaurants), and each set is a group of locations that a user rated. Ratings higher than 3 are considered.
•

Amazon (AM) contains reviews of products (specifically, those categorized as Movies & TV) by users. In the dataset, each user has at least 5 reviews. A group of products reviewed by the same user is abstracted as a set.
•

Netflix (NF) is a collections of movie ratings from users. Each set is a set of movies rated by each user, and each entity is a movie. We consider ratings higher than 3.
•

Gplus (GP) is a directed social network that consists of ‘circles’ from Google+. Each set is the group of neighboring nodes of each node.
•

Twitter (TW) is also a directed social networks consisting of ‘circles’ in Twitter. Each set is a group of neighbors of each node in the graph.
•

MovieLens (ML1, ML10, and ML20) are collections of movie ratings from anonymous users. Each set is a group of movies that a user rated. Sets of movies with ratings higher than 3 are considered.

Baselines: We compare Set2Box and Set2Box⁺ with the following baselines including the variants of the methods discussed in Section III:

•

Set2Bin encodes each set $s$ as a binary vector $\mathrm{z}_{s}\in\{0,1\}^{d}$ using a random hash function. See Section III for details.
•

Set2Vec embeds each set $s$ as a vector $\mathrm{z}_{s}\in\mathbb{R}^{d}$ which is obtained by pooling learnable entity embeddings using SCP. Precisely, given two sets $s$ and $s^{\prime}$ and their vector representations $\mathrm{z}_{s}$ and $\mathrm{z}_{s^{\prime}}$ , it aims to approximate the predefined set similarity $\text{sim}(s,s^{\prime})$ by the inner product of $\mathrm{z}_{s}$ and $\mathrm{z}_{s^{\prime}}$ , i.e., $\langle\mathrm{z}_{s},\mathrm{z}_{s^{\prime}}\rangle\approx\text{sim}(s,s^{\prime})$ .
•

Set2Vec⁺ incorporates entity features $\mathbf{X}\in\mathbb{R}^{|\mathcal{E}|\times d}$ into the set representation. Features are projected using a fully-connected layer and then pooled to a set embedding.

In order to obtain the entity features for Set2Vec⁺, which incorporates features into the set representations, we generate a projected graph (a.k.a., clique expansion) where nodes are entities, and any two nodes are adjacent if and only if their corresponding entities co-appear in at least one set. Specifically, we generate a weighted graph by assigning a weight (specifically, the number of sets where the two corresponding entities co-appear) to each edge, and we apply node2vec [46], a popular random walk-based network embedding method, to the graph to obtain the feature of each entity. Recall that vector-based methods, Set2Vec and Set2Vec⁺, are not versatile, and thus they need to be trained specifically to each metric, while the proposed methods Set2Box and Set2Box⁺ do not separate training. It should be noticed that search methods (e.g., LSH) are not direct competitors of the considered embedding methods, whose common goal is similarity preservation. We summarize the encoding cost of each method, including the variants of Set2Box and Set2Box⁺ used in Section V-C, in Table IV.

Evaluation: For the Netflix dataset, whose number of sets is relatively large, we used $5\%$ of the sets for training. For the other datasets, we used $20\%$ of the sets for training. The remaining sets are split into half and used for validation and test. We measured the Mean Squared Error (MSE) to evaluate the accuracy of the set similarity approximation. Since the number of possible pairs of sets is $O(|\mathcal{S}|^{2})$ , which may be considerably large, we sample 100,000 pairs uniformly at random for evaluation. We consider four representative set-similarity measures for evaluation: Overlap Coefficient (OC), Cosine Similarity (CS), Jaccard Index (JI), and Dice Index (DI), which are defined as:

\frac{|s\cap s^{\prime}|}{\min(|s|,|s^{\prime}|)},\;\;\;\frac{|s\cap s^{\prime}|}{\sqrt{|s|\cdot|s^{\prime}|}},\;\;\;\frac{|s\cap s^{\prime}|}{|s\cup s^{\prime}|},\;\;\;\text{and}\;\;\;\frac{|s\cap s^{\prime}|}{\frac{1}{2}\left(|s|+|s^{\prime}|\right)},

respectively, between a pair of sets $s$ and $s^{\prime}$ .

V-B Q1. Accuracy & Conciseness

We compare the MSE of the set similarity estimation derived by Set2Box⁺ and its competitors. We set dimensions to 256 for Set2Bin, 8 for vector based methods (Set2Vec and Set2Vec⁺), and 4 for Set2Box, so that they use the same number of bits to encode sets. For Set2Box⁺, we set $(d,D,K)=(32,16,30)$ , which results in only $31-40\%$ of the encoding cost of the other methods, unless otherwise stated.

Accuracy: As seen in Figure 3, Set2Box⁺ yields the most accurate set representations while using a smaller number of bits to encode them. For example, in the Twitter dataset, Set2Box⁺ gives $40.8\times$ smaller MSE for the Jaccard Index compared to Set2Bin. In the Amazon dataset, Set2Box⁺ gives $11.0\times$ smaller MSE for the Overlap Coefficient than Set2Vec⁺. In both cases, Set2Box⁺ uses about $31\%$ of the encoding costs used by the competitors.

Metric	ML1	ML10	ML20	YP	AM	GP	TW	NF
JI	8.0	11.1	12.9	34.9	33.6	76.2	41.2	16.2
DI	8.0	15.9	17.7	27.3	27.2	63.5	31.7	22.7
OC	8.0	12.7	16.1	34.9	28.8	96.8	60.3	19.5
CS	8.0	15.9	16.1	28.8	28.8	68.2	38.0	22.7

Conciseness: To verify the conciseness of Set2Box⁺, we measure the accuracy of competitors across various encoding costs. As seen in Figure 5, Set2Box⁺ yields compact representations of sets while keeping them informative. Vector-based methods are prone to the curse of dimensionality and hardly benefit from high dimensionality. While the MSE of Set2Bin decreases with respect to its dimension, it still requires larger space to achieve the MSE of Set2Box⁺. For example, Set2Bin requires $8.0\times$ and $34.9\times$ more bits to achieve the same accuracy of Set2Box⁺ in MovieLens 1M and Yelp, respectively. This is more noticeable in larger datasets, where Set2Bin requires up to $96.8\times$ of the encoding cost of Set2Box⁺, as shown in Figure V(a). These results demonstrate the conciseness of Set2Box⁺. In addition, Figure 6 shows how the accuracy of estimations by Set2Box and Set2Box⁺ depend on their encoding costs.

Speed: For the considered methods, Figure 7 shows the loss (relative to the loss after the first epoch) over time in two large datasets, MovieLens 20M and Netflix. The loss in Set2Box⁺ drops over time and eventually converges within an hour.

Further Analysis of Accuracy: We analyze how much sets estimated to be similar by the considered methods are actually similar. To this end, for each set $s$ , we compare its $k$ most similar sets $G_{s,k}$ to those $\widehat{G}_{s,k}$ estimated to be the most similar using each method. Then, as in [4], we measure the quality $q(\widehat{G}_{s,k})$ of $\widehat{G}_{s,k}$ which measures how sets in $\widehat{G}_{s,k}$ are similar enough to $s$ compared to those in $G_{s,k}$ :

q(\widehat{G}_{s,k})=\frac{\sum_{s^{\prime}\in\widehat{G}_{s,k}}\text{sim}(s,s^{\prime})}{\sum_{s^{\prime}\in G_{s,k}}\text{sim}(s,s^{\prime})}

where $\text{sim}(\cdot,\cdot)$ is the similarity (spec., Jaccard Index) between sets. The quality ranges from $0$ to $1$ , and it is ideally close to 1, which indicates that the sets estimated to be similar by the method are similar enough compared to the ideal sets. Based on the above criterion, we measure the quality of each method with different $k$ ’s. We use the above configuration for Set2Box⁺, while the dimensions of the other methods are adjusted to require a similar encoding cost to Set2Box⁺. As shown in Figure 8, the average quality $q(\widehat{G}_{s,k})$ is the highest in Set2Box⁺ in all considered $k$ ’s in MovieLens 1M and Yelp, implying that the sets estimated to be similar by the proposed methods are indeed similar to each other.

V-C Q2. Effectiveness

To verify the effectiveness of each component of Set2Box⁺, we conduct ablation studies by comparing it with its variants. We first consider the following variants:

•

Set2Box-PQ: Given a box $\mathrm{B}=(\mathrm{c},\mathrm{f})$ , we apply an end-to-end differentiable product quantization (PQ) [40] to the center $\mathrm{c}$ and the offset $\mathrm{f}$ independently. Dot products between the query vector ( $\mathrm{c}$ or $\mathrm{r}$ ) and the key vectors are computed to measure the distances. Notably, it yields two independent discrete codes for the center and the offset, and thus its encoding cost is $64Kd+2|\mathcal{S}|D\log_{2}K$ bits, which is approximately twice that of Set2Box⁺.
•

Set2Box-BQ: A special case of Set2Box⁺ with $\lambda=0$ , where the proposed box quantization is applied but joint training is not.

We set $(d,D,K)$ to $(32,8,30)$ for Set2Box-PQ and $(32,16,30)$ for Set2Box-BQ and Set2Box⁺ so they all require the the same amount of storage.

Effects of Box Quantization: We examine the effectiveness of the proposed box quantization scheme in Section IV-B by comparing Set2Box-BQ with Set2Box-PQ. To this end, we measure the relative MSE defined as:

\frac{\text{MSE of {Set2Box-BQ}}}{\text{MSE of {Set2Box-PQ}}}

(5)

of each dataset. Figure 4(a) demonstrates that Set2Box-BQ generally derives more accurate set representations compared to Set2Box-PQ, implying the effectiveness of the proposed box quantization scheme. As shown in Table V, on average, Set2Box-BQ yields up to $26\%$ smaller MSE than Set2Box-PQ while using approximately the same number of bits. For example, in MovieLens 10M, Set2Box-BQ gives $1.89\times$ more accurate estimation than Set2Box-PQ in approximating the Dice Index. While Set2Box-PQ discretizes the center and offset of the boxes independently, without the consideration of their geometric properties, the proposed box quantization scheme effectively takes the geometric relations between boxes into account and thus yields high-quality compression.

Effects of Joint Training: We analyze the effects of the joint training scheme of Set2Box⁺ by comparing Set2Box-BQ ( $\lambda=0$ ) and Set2Box⁺ ( $\lambda\geq 0$ ) and to this end, we measure the relative MSE defined as:

\frac{\text{MSE of {Set2Box}\textsuperscript{+}}}{\text{MSE of {Set2Box-BQ}}}

(6)

of each dataset. Figure 4(b) shows that Set2Box⁺ is superior compared to Set2Box-BQ in most datasets indicating that jointly training the reconstruction boxes with the original ones leads to accurate boxes. As summarized in Table V, joint training reduces the average MSEs on the considered datasets, by up to $44\%$ , together with the box quantization scheme. For example, together with the box quantization scheme, joint training reduces estimation error by $64\%$ and $38\%$ for the Jaccard Index on Gplus and for the Overlap Coefficient on Netflix, respectively. These results imply that learning quantized boxes simultaneously with the original boxes improves the quality of the quantization and thus its effectiveness. To further analyze these results, we investigate how the loss decreases with respect to $\lambda$ . In Figure 9, we observe that training the reconstructed boxes alone ( $\lambda=0$ ) is unstable, and learning the original boxes together ( $\lambda>0$ ) helps not only stabilize but also facilitate the optimization.

TABLE V: The proposed schemes: box quantization and joint training in Set2Box⁺ incrementally improves the accuracy (in terms of MSE) averaged over all considered datasets.

Method	OC	CS	JI	DI
Set2Box-PQ	0.0129	0.0028	0.0012	0.0023
Set2Box-BQ	0.0106 (-17%)	0.0023 (-17%)	0.0009 (-26%)	0.0019 (-17%)
Set2Box⁺	0.0077 (-40%)	0.0016 (-44%)	0.0007 (-41%)	0.0013 (-42%)

Effects of Boxes: To confirm the effectiveness of using boxes for representing sets, we consider Set2Box-order, which is also a region-based geometric embedding method:

•

Set2Box-order: A set $s$ is represented as a $d$ -dimensional vector $\mathrm{z}_{s}\in\mathbb{R}_{+}^{d}$ whose volume is computed as $\mathbb{V}(\mathrm{z}_{s})=\exp(-\sum_{i}\mathrm{z}_{s}[i])$ . The volume of the intersection and the union of two representations $\mathrm{z}_{s}$ and $\mathrm{z}_{s^{\prime}}$ of sets $s$ and $s^{\prime}$ , respectively, are:

	$\displaystyle\mathbb{V}(\mathrm{z}_{s}\wedge\mathrm{z}_{s^{\prime}})=\exp\left(-\textstyle\sum\nolimits_{i}\max(\mathrm{z}_{s}[i],\mathrm{z}_{s^{\prime}}[i])\right),$
	$\displaystyle\mathbb{V}(\mathrm{z}_{s}\vee\mathrm{z}_{s^{\prime}})=\exp\left(-\textstyle\sum\nolimits_{i}\min(\mathrm{z}_{s}[i],\mathrm{z}_{s^{\prime}}[i])\right),$

respectively. The encoding cost is $32d|\mathcal{S}|$ bits.

We set the dimensions for Set2Box-order and Set2Box to $8$ and $4$ , respectively, so that their encoding costs are the same. In Table VI, we compare Set2Box with Set2Box-order in terms of the average MSE on the considered datasets for each measure. Set2Box yields more accurate representations than Set2Box-order, implying the effectiveness of boxes to represent sets for similarity preservation. For example, Set2Box achieved $62\%$ lower average MSE than Set2Box-order in preserving the Overlap Coefficient.

TABLE VI: Set2Box yields smaller MSE on average in the considered datasets than Set2Box-order.

Method	OC	CS	JI	DI
Set2Box-order	0.0320	0.0033	0.0008	0.0027
Set2Box	0.0121 (-62%)	0.0028 (-14%)	0.0006 (-22%)	0.0022 (-17%)

V-D Q3. Effects of Parameters

We analyze how parameters of Set2Box⁺ affect the embedding quality of the set representations. First, the number of subspaces ( $D$ ) and the number of key boxes in each subspace ( $K$ ) are the key parameters that control the encoding cost of Set2Box⁺. In Figure 10, we investigate how the performance of Set2Box⁺ depends on $D$ and $K$ values while fixing $d$ to $32$ . Typically, the accuracy improves as $D$ and $K$ increase at the expense of extra encoding cost. In addition, its performance is affected more heavily by $D$ than by $K$ .

In Figure 11, we observe how the coefficient $\lambda$ in Eq. (4) affects the accuracy of Set2Box⁺. To this end, we measure relative MSE (relative to MSE when $\lambda=0$ ) with different $\lambda$ values. As shown in Figure 11, joint training is beneficial, but overemphasizing the joint views sometimes prevents Set2Box⁺ from learning meaningful reconstructed boxes.

In Figure 12, we observe the effects of the number of training samples, $|\mathcal{T}^{+}|$ and $|\mathcal{T}^{-}|$ , in Set2Box⁺. We can see that the accuracy is robust to the parameters, and thus using only a small number of samples for training is enough (we consistently use $|\mathcal{T}^{+}|=|\mathcal{T}^{-}|=10$ in all experiments).

VI Discussions

To further support the effectiveness of the proposed methods, Set2Box and Set2Box⁺, we analyze the properties of boxes and other representation methods and find their relations with sets. To this end, we review the following questions for the methods used in Section V, as summarized in Table VII:

RQ1. Are basic set properties supported?

A1. Boxes naturally satisfy six representative set properties, which are listed in Table VIII. These properties are also met by Set2Bin and Set2Box-order, but not by the vector-based methods Set2Vec and Set2Vec⁺, since they do not contain information on the set itself (e.g., set sizes).

RQ2. Are sets of any-size representable?

A2. In Set2Box and Set2Box⁺, boxes of various volumes can be learned by adjusting their offsets, and thus sets of any sizes are accurately learnable. So does the Set2Box-order, by controlling the L1 norm of the vector. However, Set2Bin inevitably suffers from information loss for sets larger than $d$ (see Figure 1 in Section III). The vector-based methods have no limitations regarding set sizes.

RQ3. Are representations expressive enough?

R3. Boxes of diverse shapes and sizes can be located anywhere in the Euclidean latent space by controlling their centers and offsets. This property provides boxes with expressiveness, enabling them to capture complex relations with other boxes. In Set2Box-order, a single nonnegative vector is learned to control the volume of the region. This nonnegativity limits the expressiveness of the embeddings. Set2Bin suffers from hash collisions, and thus different sets can be represented as the same binary vector, which causes considerable information loss if $d$ is not large enough. Despite their wide usage in various fields, empirically, Set2Vec and Set2Vec⁺ have limited power to accurately preserve similarities between sets. In particular, set embeddings obtained from them for a specific measure are not extensible to estimate other measures.

TABLE VII: Properties of the considered methods regarding Q1, Q2, and Q3 in Section VI.

Method	Q1	Q2	Q3
Set2Bin	✓	✗	✗
Set2Vec & Set2Vec⁺	✗	✓	✗
Set2Box-order	✓	✓	✗
Set2Box & Set2Box⁺	✓	✓	✓

VII Conclusions

In this work, we propose Set2Box, an effective representation learning method for preserving similarities between sets. Thanks to the unique geometric properties of boxes, Set2Box accurately preserves various similarities without assumptions about measures. Additionally, we develop Set2Box⁺, which is equipped with novel box quantization and joint training schemes. Our empirical results support that Set2Box⁺ has the following strengths over its competitors:

•

Accurate: Set2Box⁺ yields up to $40.8\times$ smaller estimation error than competitors, requiring smaller encoding costs.
•

Concise: Set2Box⁺ requires up to $96.8\times$ smaller encoding costs to achieve the same accuracy of the competitors.
•

Versatile: Set2Box⁺ is free from assumptions about similarity measures to be preserved.