This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

TableDC: Deep Clustering for Tabular Data

Hafiz Tayyab Rauf 0000-0002-1515-3187 Department of Computer Science, University of ManchesterManchesterUK.M13 9PL hafiztayyab.rauf@manchester.ac.uk Andre Freitas 0000-0002-4430-4837 Department of Computer Science, University of Manchester
Manchester, UK.
IDIAP Research Institute
Martigny, Switzerland.
M13 9PL
andre.freitas@manchester.ac.uk
 and  Norman W. Paton 0000-0003-2008-6617 Department of Computer Science, University of ManchesterManchesterUK,M13 9PL norman.paton@manchester.ac.uk
Abstract.

Deep clustering (DC), a fusion of deep representation learning and clustering, has recently demonstrated positive results in data science, particularly text processing and computer vision. However, joint optimization of feature learning and data distribution in the multi-dimensional space is domain-specific, so existing DC methods struggle to generalize to other application domains (such as data integration and cleaning). In data management tasks, where high-density embeddings and overlapping clusters dominate, a data management-specific DC algorithm should be able to interact better with the data properties for supporting data cleaning and integration tasks. This paper presents a deep clustering algorithm for tabular data (TableDC) that reflects the properties of data management applications, particularly schema inference, entity resolution, and domain discovery. To address overlapping clusters, TableDC integrates Mahalanobis distance, which considers variance and correlation within the data, offering a similarity method suitable for tables, rows, or columns in high-dimensional latent spaces. TableDC provides flexibility for the final clustering assignment and shows higher tolerance to outliers through its heavy-tailed Cauchy distribution as the similarity kernel. The proposed similarity measure is particularly beneficial where the embeddings of raw data are densely packed and exhibit high degrees of overlap. Data cleaning tasks may involve a large number of clusters, which affects the scalability of existing DC methods. TableDC’s self-supervised module efficiently learns data embeddings with a large number of clusters compared to existing benchmarks, which scale in quadratic time. We evaluated TableDC with several existing DC, Standard Clustering (SC), and state-of-the-art bespoke methods over benchmark datasets. TableDC consistently outperforms existing DC, SC, and bespoke methods.

PVLDB Reference Format:
PVLDB, 14(1): XXX-XXX, 2020.
doi:XX.XX/XXX.XX This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/hafizrauf/TableDC.

1. Introduction

Deep Clustering (DC) combines the unsupervised grouping of related data items with the learning of clustering-friendly representations for the data to be clustered. DC uses deep neural network architectures with unsupervised clustering mechanisms to effectively classify multi-dimensional, unlabeled data. Early implementations primarily used autoencoders for dimensionality reduction, where the embeddings in the latent space were clustered using standard clustering (SC) algorithms such as K-means (Yang et al., 2017). Advancements in DC have led to more sophisticated architectures, based on variational autoencoders (Dilokthanakul et al., 2016), Generative Adversarial Networks (Mukherjee et al., 2019), and self-supervised learning paradigms (Alwassel et al., 2020). Recent DC methods focus on integrating representation learning and cluster assignment within end-to-end trainable models using graph-based neural networks (Liu et al., 2022; Tu et al., 2021; Bo et al., 2020) to capture relational structures within data.

DC has been employed successfully in several domains, particularly image processing (Zhou et al., 2022, 2022; Qian, 2023), which includes downstream tasks such as segmentation and object detection (Zhou et al., 2022), clustering of in and out-of-distribution noise in corrupted images (Albert et al., 2022), and unsupervised image alignment and clustering (Jin et al., 2023; Qian, 2023). Further typical applications of DC include community and anomaly detection (Emadi and Mazinani, 2018; Liu et al., 2020) and medical applications (Kart et al., 2021). These DC proposals excel in determining complex patterns within images and graphs, extracting features, and subsequently clustering similar images and communities based on those patterns. The successful application of DC in specific domains has led to the development of more specialized algorithms that reflect specific properties of the domain features, even if the algorithms are not explicitly domain-specific. For example, in (Chen et al., 2022a), the authors use domain-specific 3D geometric consistency to guide the learning of 2D image representation and solve several 2D-image-based downstream tasks.

In data management, and in particular data integration and cleaning, a variety of important problems make use of clustering, including schema inference (Kellou-Menouer and Kedad, 2015; Bonifati et al., 2022), entity resolution (Papadakis et al., 2020; Costa et al., 2010; Hassanzadeh et al., 2009) and domain discovery (Piai et al., 2023; Ota et al., 2020). We have investigated the application of existing DC algorithms to these three problems, with positive results; DC algorithms consistently outperform established SC algorithms for a variety of initial data representations for all of tables, columns and rows (Rauf et al., 2024). However, existing DC algorithms do not fully exhibit the properties required to address the clustering of latent representations of table objects, required to support the transfer to data cleaning and integration tasks. In our previous work (Rauf et al., 2024), we observed that existing DC algorithms often fall short when applied to data integration tasks, where the nature of the data is fundamentally different from that in image processing applications, with tables, rows or columns being compared in the latent space instead of images (Rauf et al., 2024).

In this paper, we propose a new deep clustering algorithm for tabular data (TableDC) that reflects the properties of table embeddings and data management applications, specifically: (i) Data representations are typically dense, where features are closely packed together. As the dimensionality increases, it becomes more challenging for the models to interpret the relationships between features. High density in data reduces the differences in features and makes it difficult to differentiate between clusters effectively. The closeness of data points in the latent space leads to ambiguity, where a data point could reasonably belong to multiple clusters. (ii) Tabular data contains higher semantic and syntactic variability levels than image data, leading to increased ambiguity and cluster overlap. Tables, columns and rows commonly involve the composition of different types of semantic objects (e.g., named entities, categories, numbers), each in their own way subject to ambiguity, leading to potentially overlapping embeddings when projected to high-dimensional latent spaces. In contrast, while complex, image data is based on a single data modality, with the expectation of smoother transfer of visual features across domains, naturally conforming to a better cluster separation. (iii) In a data management setting, data can have a large number of clusters compared to existing DC applications. In our previous work (Rauf et al., 2024), we observed that existing DC algorithms struggle with scalability when the data has a large number of clusters. This affects the DC method’s performance in terms of the high probability of misclassification, a higher computational burden, and cluster instability, where small changes in data and model parameters lead to significant changes in cluster assignments.

The contributions of this paper are:

  1. (1)

    The identification of features of data management applications that challenge the assumptions that underpin existing DC algorithms.

  2. (2)

    The proposal of a new DC algorithm, TableDC, that builds on the analysis at (1) to provide an algorithm that: (i) Maintains a balance between cluster compactness and separation during cluster center initialization to better assist TableDC training with robust cluster shapes. (ii) Integrates Mahalanobis distance, a covariance-aware distance that considers the data’s variance and covariance to handle dense features, unlike Euclidean distance, which treats all dimensions equally. (iii) Offers flexibility in overlapping clusters through its heavy-tailed Cauchy distribution as a similarity kernel, especially when the cluster boundaries are ambiguous. (iv) Scales in quasi-linear time under significant increases in the number of clusters.

  3. (3)

    The experimental evaluation of TableDC from (2) with both state-of-the-art SC and DC algorithms and recent bespoke algorithms on real-world datasets for schema inference, entity resolution and domain discovery, which shows that TableDC significantly outperforms existing techniques.

2. Related Work

This section reviews related work in areas of relevance to the development of a deep clustering algorithm for data management applications, specifically: (i) existing DC algorithms, their features and applications ; and (ii) clustering in data management, focusing on the three applications on which we evaluate techniques and representations for tables, columns and rows.

2.1. Existing DC algorithms

The basic DC architecture consists of two phases: (i) representation learning and (ii) clustering. The joint optimization of both phases helps each to improve the algorithm’s performance iteratively. A precursor to (i) and (ii) is (iii) cluster initialization, which contributes to the algorithm’s performance by providing the initial data shape. With better cluster initialization, a DC algorithm is less likely to get stuck in local minima, and requires fewer converging iterations.

Regarding (i), the most widely used architecture is the basic Auto-encoder (AE) (Yang et al., 2017), where the encoder function fef_{e} encodes an input representation xix_{i} into a latent representation given by hi=fe(xi)=11+e(Wxi+bi)h_{i}=f_{e}(x_{i})=\frac{1}{1+e^{-(Wx_{i}+b_{i})}}. Similarly, the decoder function fdf_{d} decodes hih_{i} back into the reconstructed input xi=fd(hi)x^{\prime}_{i}=f_{d}(h_{i}). Here, WW and bib_{i} represent the weights and biases of the neural network, respectively. The optimization function of AE with NN samples is defined as fmin=min1Ni=1Nxixi2f_{min}=\min\frac{1}{N}\sum_{i=1}^{N}\|x_{i}-x^{\prime}_{i}\|^{2}. Many recent DC proposals use generative models in representation learning, especially Variational Auto Encoders (VAEs), which combine an AE architecture with probabilistic graphical models. In VAEs, the encoder maps the input data into a probabilistic distribution over the latent space while the decoder reconstructs the data from the latent probabilistic representation. Some recent proposals with VAEs in representation learning are DGG (Yang et al., 2019), LTVAE (Li et al., 2019), VaDE (Jiang et al., 2017) and MFCVAE (Falck et al., 2021).

To benefit from the local and global structure of data in representation learning, several recent DC proposals combine basic AE learning with a Graph Convolutional Network (GCN) to assist the model in learning the structural information. Some recent examples are DCRN (Liu et al., 2022), DFCN (Tu et al., 2021) and SDCN (Bo et al., 2020). Subspace representation learning is another recent technique that identifies subspaces that optimize cluster separation, unlike AEs and VAEs, which require additional clustering mechanisms. Subspace learning seeks to identify and represent underlying subspaces within high-dimensional data, while AEs and VAEs focus on reconstructing the input data as accurately as possible. The latest DC methods based on Subspace representation learning are EDESC (Cai et al., 2022), DMCAG (Cui et al., 2023), SAGSC (Fettal et al., 2023) and OMSC (Chen et al., 2022b).

DC is inherently more complex for the tabular data embeddings obtained by Large Language Models (LLMs) than widely used sparse image or textual datasets like CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011), ImageNet (Russakovsky et al., 2015), USPS (LeCun et al., 1990) and Reuters (Lewis et al., 2004), primarily due to data structure and feature representation differences. Image datasets include spatial and visual coherence that DC methods can better differentiate, even when the resolution is low or the number of classes is high. DC has been used widely on complex and varied datasets unlike ImageNet, where, despite the vast diversity, the visual patterns (such as extracted from different textures, shapes, objects, and backgrounds) retain a level of consistency

On the other hand, the primary issue with data management datasets is the high density and significant overlap in the feature space of their embeddings; specifically, tabular data embeddings represent a complex relationship among different features, including categorical and numerical data.

In relation to (ii), clustering, the aim of existing work on DC is to use the low dimensional representation from the representation learning module and obtain clustering assignment probabilities for soft clustering. In the related work, several clustering mechanisms are employed to jointly work with representation learning for optimal clustering. The most widely used one is self-supervised clustering (Dizaji et al., 2017; Guo et al., 2017; Peng et al., 2017; Xie et al., 2016; Bo et al., 2020; Tu et al., 2021), which aims to group data points based on learned representations without explicit labels; the approach uses two distributions: the cluster assignment distribution QQ representing the probability that a data instance belongs to a particular cluster, and an auxiliary distribution PP which is a modified version of QQ that emphasizes more confident cluster assignments and minimizing the Kullback-Leibler (KL) divergence between QQ and PP. QQ is calculated using the distance between the data point xix_{i} and each cluster centroid μk\mu_{k}. The closer xix_{i} is to μk\mu_{k}, the higher the probability QQ. Self-supervised clustering focuses on high-confidence instances due to the squaring of probabilities in PP and preventing cluster dominance by normalizing the probabilities, ensuring balanced clustering assignments.

Pseudo-labeling is another clustering technique recently utilized in DC (Niu et al., 2022; Zhou et al., 2022), where the model generates labels that are then used to fine-tune the model, effectively creating a feedback loop where the model iteratively refines its clustering assignment. The recent success of contrastive learning has led to its use in DC (Sun et al., 2023; Zhang et al., 2023; Liu et al., 2023; Ma and Kim, 2022), where the data points from low dimensional representations are paired in a way that the same cluster (positive pairs) is contrasted with the different clusters (negative pairs). Further improvements in contrastive deep clustering include the contrast comparison between instance-to-instance, instance-to-cluster, and cluster-to-cluster (Zhou et al., 2022).

TableDC employs a self-supervised clustering approach by comparing the encodings of tables, rows or columns within a latent space. The selection of an appropriate distance function is critical in this context, as it determines how the distances between each table, row or column and the cluster centroids in the latent space are measured. A simple distance metric like Euclidean distance is beneficial when dealing with data where features exhibit limited variance and correlation, as in image vectors or sparse data matrices (Bo et al., 2020; Tu et al., 2021). However, a more sophisticated approach is required in dense embeddings, where the data points show significant feature correlation. Here, the Mahalanobis distance is particularly advantageous (Mahalanobis, 2018). Its ability to account for feature correlation enables the model to discern even subtle differences among clusters, mainly when a cluster shows high similarity across multiple features. Mahalanobis distance is particularly relevant when comparing tables, columns or rows, as it allows for a more nuanced interpretation and grouping of data points based on their inherent relationships and similarities.

In relation to (iii), most existing works use K-means to initialize clusters, whether self-supervised or subspace clustering such as EDESC (Cai et al., 2022), DCRN (Liu et al., 2022), DFCN (Tu et al., 2021) and SDCN (Bo et al., 2020). The quality of the initial clusters significantly affects the final clusters. To our knowledge, there is no related work on different cluster initialization approaches in the DC environment. To address this, we compare different cluster initialization approaches in an ablation study in Section 5.1.

2.2. Clustering in data management

Clustering is important in data management for several data integration and cleaning tasks, including those discussed here.

Schema Inference identifies the raw data’s types, relationships and attributes to infer an overarching schema (Kellou-Menouer et al., 2022). Schema inference has been explored to infer a schema that fully describes some closely related datasets  (Baazizi et al., 2019; Bex et al., 2007) or to summarise structures that appear in more diverse datasets. A common first step in the latter case is the identification of recurring structural patterns, which can be used as types in an inferred schema. In this setting, a variety of SC algorithms have been used to group similar structures (e.g., (Bonifati et al., 2022; Kellou-Menouer and Kedad, 2015; Tsuboi and Suzuki, 2019)).

Entity Resolution is the well-explored problem of identifying two or more records that represent the same real-world object (Christophides et al., 2021). Several proposals use clustering, especially for entity matching and duplicate detection, for which empirical comparisons have been carried out (Hassanzadeh et al., 2009; Costa et al., 2010; Saeedi et al., 2018; Draisbach et al., 2020; Hassanzadeh et al., 2009). Although many entity resolution proposals focus on pairwise comparisons, some proposals incorporate clustering because the transitive closure of pairwise similarity may not always lead to robust clusters (Fisher et al., 2015; Mandilaras et al., 2021).

Domain Discovery involves identifying columns that draw values from the same collection of values. Most existing works have used bespoke algorithms (Li et al., 2017; Ota et al., 2020; Piai et al., 2023). However, we cast domain discovery as a clustering problem, where the goal is to cluster a set of columns that share semantic types. RaF-STD (Piai et al., 2023) uses clustering to discover semantic types for heterogeneous sources. Similarly, D4 (Ota et al., 2020) provides a column-based clustering approach to discovering local and strong domains from a set of columns, while handling the challenge of incomplete columns. Building on language models, Starmie (Fan et al., 2023) provides column clustering based on the learned column representation through contrastive learning.

3. Proposed Model

We propose a clustering algorithm, TableDC, that benefits from representation learning in an end-to-end framework to perform several tabular data integration tasks, particularly schema inference, entity resolution, and domain discovery. TableDC considers features of embeddings explicitly in the latent space during its training, such as preserving correlation among highly dense data features through the Mahalanobis distance measure and using a heavy-tailed similarity kernel that provides the flexibility of cluster assignments when the cluster boundaries are ambiguous.

The representation learning architecture of TableDC is based on self-supervised learning (Bo et al., 2020). The self-supervised module adopts Mahalanobis distance (Mahalanobis, 2018) between data points and their centroids as it naturally accounts for the variance of different dimensions and is less sensitive to noise, ensuring that no particular feature influences the distance measure due to its scale. Unlike learning a Euclidean distance in a self-supervised module (Bo et al., 2020), which assumes that all dimensions of the data are orthogonal for the clustering, the Mahalanobis distance shows some degree of correlation between different dimensions by weighting the importance of all dimensions, which could be different based on their semantics.

TableDC integrates an autoencoder for representation learning and a self-supervised module to optimize representation for clustering. An autoencoder consists of two main parts: the encoder and the decoder. The encoder function compresses the input xx (an embedding matrix of tables, rows or columns) into a latent-space representation. Mathematically, it can be represented as:

(1) h=f(We×x+be)h=f(W_{e}\times x+b_{e})

where WeW_{e} is the weight matrix, beb_{e} is the bias term, and ff is an activation function for the encoder, such as sigmoid or ReLU (Nair and Hinton, 2010). The result, hh, is the encoded representation of the input xx. The decoder function works to reconstruct xx from the internal representation. It aims to map the encoded data back from the latent space to the original space. The decoder can be represented as:

(2) x^=g(Wd×h+bd)\hat{x}=g(W_{d}\times h+b_{d})

where hh is the encoded representation from the encoder, WdW_{d} is the weight matrix, bdb_{d} is the bias term, and gg is an activation function for the decoder. The result, x^\hat{x}, is the reconstructed representation of the original input xx. An overview of the TableDC framework is presented in Figure 1.

Refer to caption
Figure 1. Overview of TableDC framework. An autoencoder takes an embedding matrix as input, representing a set of rows for entity resolution, a set of columns for domain discovery, or a set of tables for schema inference. z is the latent representation obtained from the autoencoder, and c is initialized using Birch. In the self-supervised module, the m distribution is calculated from q by taking the Mahalanobis distance between z and c and setting the Cauchy distribution as a kernel to measure the similarity. p is the target distribution, which is also calculated from q.

Assume we have a latent vector with nn data points z=z1,z2,,znz={z_{1},z_{2},\dots,z_{n}}, where each zz represents a row, column or table depending on the data integration problem. We want to cluster the data points into 𝕂\mathbb{K} clusters with centers c=c1,c2,,ckc={c_{1},c_{2},\dots,c_{k}}. We create a covariance matrix Σ\Sigma to compute the Mahalanobis distance between the data points z=z1,z2,,znz={z_{1},z_{2},\dots,z_{n}} and the cluster centers c=c1,c2,,ckc={c_{1},c_{2},\dots,c_{k}}. Since we have noisy data (for example, columns with missing instances, similar tables with different unit measurements, and duplicate rows that affect the embeddings), the variance between different dimensions misleads the importance of the features. We aim to minimize the noise effect when calculating the covariance by scaling the identity matrix. The scaled identity matrix ensures that all features do not have the same importance to avoid the Euclidean effect and also handles multicollinearity in the data when the covariance matrix is singular. In the case of overlapping clusters, it avoids noisy correlations. The covariance matrix is created by scaling the identity matrix with a scale factor δ=0.01\delta=0.01 to adjust the strictness of distance between data points:

(3) Σ=δI,\Sigma=\delta\cdot I,

where II is the identity matrix. We calculate the Cholesky decomposition of the covariance matrix Σ\Sigma to simplify its inversion. The Cholesky decomposition is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix LL and its conjugate transpose LTL^{T}. In our case, since the covariance matrix Σ\Sigma is real and symmetric, the Cholesky decomposition can be represented as:

(4) C=LLT,C=LL^{T},

where CC represents the Cholesky decomposition of the covariance matrix Σ\Sigma, LL is the lower triangular matrix with real and positive diagonal entries, and LTL^{T} is the transpose of LL. The matrix LL is computed using the Cholesky decomposition algorithm.

The Cholesky decomposition is used to solve a linear system to obtain the Mahalanobis distance, as it provides an efficient way to compute the inverse of the covariance matrix:

(5) Σ1=(C)1=LTL1.\Sigma^{-1}=(C)^{-1}=L^{-T}L^{-1}.

By using the Cholesky decomposition, we can efficiently solve the linear system required to compute the Mahalanobis distance between the latent representations and the cluster centers.

(6) DM(zi,cj)=(zicj)TΣ1(zicj),D_{\text{M}}(z_{i},c_{j})=\sqrt{(z_{i}-c_{j})^{\text{T}}\Sigma^{-1}(z_{i}-c_{j})},

where Σ\Sigma is the covariance matrix.

The Student’s t-distribution (van der Maaten and Hinton, 2008) is a common choice for existing DC applications as a kernel to measure the similarity between data points and their cluster centers. However, due to its large degree of freedom parameter, the Student’s t-distribution shows less tolerance to outliers and overlapping clusters; it becomes closer to a normal distribution. In data integration, where the data is more prone to overlaps, i.e., the same magnitude with different semantics, we integrate the Cauchy distribution (Cauchy et al., 1847) with undefined mean and variance, making it robust to outliers, as its shape is unaffected by them. We calculate the soft assignments qq for each data point to each cluster using the Cauchy distribution. Soft assignments qq can be calculated using the Mahalanobis distance:

(7) qij=11+DM(zi,cj)2γ2,q_{ij}=\frac{1}{1+\frac{D_{\text{M}}(z_{i},c_{j})^{2}}{\gamma^{2}}},

where γ\gamma is a hyperparameter for the Cauchy distribution. We then normalize qq to ensure that the sum of the soft assignments for each data point is 1:

(8) q=qj=1kqij+ϵ,q=\frac{q}{\sum_{j=1}^{k}q_{ij}+\epsilon},

where ϵ\epsilon is a small value added to prevent division by zero. We then compute the predicted probabilities by applying the softmax function to qq:

(9) m=eqj=1keqij.m=\frac{e^{q}}{\sum_{j=1}^{k}e^{q_{ij}}}.

3.1. Loss Function

Since mm acts as clustering probabilities and can be further optimized, we integrate KL-divergence to refine the clustering probabilities mm to be more accurate and representative of the underlying data. KL divergence provides a flexible way to guide the model toward the expected clustering assignments. We use the following objective function to optimize the mm.

(10) celoss=KLdiv(p||m)=i=1nj=1kpijlogpijmij,ce_{loss}=\text{KL}_{\text{div}}(p||m)=\sum_{i=1}^{n}\sum_{j=1}^{k}p_{ij}\cdot\log\frac{p_{ij}}{m_{ij}},
(11) pij=qij2i=1nqij,p_{ij}=\frac{q_{ij}^{2}}{\sum_{i=1}^{n}q_{ij}},

celossce_{loss} is clustering loss, where pijp_{ij} is the target distribution, calculated by adjusting the soft assignments qq. This step aims to make the soft assignments more robust during the clustering process. We use the following reconstruction loss (relossre_{loss}) to maintain the inherent structure and essential features in a compressed form.

(12) reloss=1ni=1n(xix¯i)2,re_{loss}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x}_{i})^{2},

where x¯\bar{x} is the reconstructed input. celossce_{loss} and relossre_{loss} work together to achieve effective clustering while maintaining the quality of the data reconstruction. The total loss is given by:

(13) loss=αceloss+reloss,loss=\alpha\cdot ce_{loss}+re_{loss},

where α\alpha is a weighting factor that balances the contribution of the celossce_{loss} and the reconstruction loss (relossre_{loss}) to the total loss. Combining celossce_{loss} and relossre_{loss} enables autoencoder learning to generate meaningful latent representations that can be effectively clustered while preserving the structure and information in the input data. We show the pseudo-code for training of TableDC in Algorithm 1.

Algorithm 1 Pseudo-code for training of TableDC
0:  xx: representation of rows, columns or tables; 𝕂\mathbb{K}: number of clusters; maxitermaxiter: maximum number of iterations
0:  m: clustering probabilities
1:  Initialize We,be,Wd,bdW_{e},b_{e},W_{d},b_{d} by pre-training autoencoder
2:  Initialize cluster centers cc using Birch algorithm
3:  for i=1i=1 to length(maxiter)\text{length}(maxiter) do
4:     Generate latent representation zz from autoencoder using (1)
5:     Compute inverse of covariance matrix Σ1\Sigma^{-1} using Cholesky decomposition from (5)
6:     Calculate Mahalanobis distance DMD_{\text{M}} between zz and cc using (6)
7:     Apply Cauchy distribution on DMD_{\text{M}} to get soft assignments qq from (7)
8:     Normalize qq and apply softmax to get predicted probabilities mm
9:     Calculate target distribution pp using (11)
10:     Calculate celossce_{loss} and relossre_{loss} from (10) and (12) respectively
11:     Back propagate and update parameters in TableDC
12:  end for
13:  return  Updated mm

The reconstruction loss, relossre_{loss} (Equation 12), of TableDC is consistently reducing over epochs during the training (see ablation study in Section 5.2), which shows that TableDC is learning a meaningful representation of the data, capturing essential features and structures. In celossce_{loss} (Equation 10), m is a softmax-ed q consistently diverging from the target distribution p (see ablation study in Section 5.2). After applying softmax, q becomes a probability distribution m over clusters. Each row of the probability matrix has a sum of 1, representing the probability for each point to belong to each cluster. The softmax function emphasizes the significant values in a probability matrix m while suppressing smaller values, making TableDC soft assignments sharper. The choice of the hyperparameter α\alpha allows for fine-tuning the trade-off between clustering performance and reconstruction quality. We use α\alpha = 0.9 (Equation 13), so TableDC gives more importance to the clustering loss, celossce_{loss} (Equation 10) than relossre_{loss}. We determined the weight of α=0.9\alpha=0.9 for the celossce_{loss} through empirical evaluation.

The presence of noise or exact probabilities towards more than one cluster can help provoke the model towards more confident cluster assignments despite having inherent ambiguities. p is artificially derived from q by re-normalizing and making high-probability assignments higher and diminishing low-probability assignments. However, the high divergence of m from the sharped p shows that TableDC is more confident with the actual soft assignments q becoming sharper over time. Moreover, the data points in target p need more intra-cluster distance instead of high cohesion. A consistent increase in divergence of m from q indicates that TableDC’s internal clustering, as represented by m, is moving in a direction that better aligns with the true data structure. p is an auxiliary target to guide the clustering process and does not represent the ground truth; in the presence of an overlapping cluster, the divergence from p is not inherently poor. Considering this phenomenon, we can consider the KL-divergence as a maximization problem where we prefer m to have a maximum divergence from p and can be defined as a sequence of KL-divergences over NN iterations: {DKL(pm1),DKL(pm2),,DKL(pmN)}\{D_{KL}(p\parallel m^{1}),D_{KL}(p\parallel m^{2}),\ldots,D_{KL}(p\parallel m^{N})\} where DKL(pmk)D_{KL}(p\parallel m^{k}) represents the KL-divergence between pp and mm at the kthk^{th} iteration. The KL divergence is increases over NN if: DKL(pmk+1)>DKL(pmk)D_{KL}(p\parallel m^{k+1})>D_{KL}(p\parallel m^{k}) for all kk in {1,2,,\{1,2,\ldots,N-1}\}.

3.2. Initialisation

In the training phase of DC algorithms, the quality of the initial centroids can directly affect the quality of the final clusters. Poor initialization can lead to poorly formed clusters, such as unevenly sized, too small, or too large. Existing DC methods (Bo et al., 2020; Liu et al., 2022; Tu et al., 2021) use K-means to initialize cluster centers c. However, in data integration problems dealing with highly dense and overlapping embedding representation, K-means may not be a suitable initialization (see ablation study in Section 5) for the following reasons: (i) In densely packed and overlapping data dimensions, the Euclidean distance between c and z becomes uniform, which leads to poor cluster initialization. (ii) In a dense data space, many sub-optimal solutions exist, and K-means is not optimal enough to avoid local minima. (iii) K-means initialization is based on hard assignment, where the centers are randomly initialized. However, dense data with overlapping clusters requires soft assignments based on probabilities. K-means initialization is often considered suitable in existing DC proposals due to the nature of the data, which is mainly sparse or semi-dense, particularly in image representations. The Euclidean distance between different points becomes more distinct in a sparse data space, resulting in well-separated clusters. This distinct separation enhances the effectiveness of K-means initialization in these contexts, facilitating the clustering process by providing a clear starting point for the algorithm.

In TableDC, we integrate the Birch algorithm  (Zhang et al., 1996) to initialize the clusters. Birch is a traditional hierarchical clustering algorithm that handles large and noisy databases. Birch uses a Clustering Feature tree (CF-tree) to process each data point. A CF-tree is a set of nodes, and each node consists of a triplet that efficiently stores information on data points in a compressed format to the cluster instead of processing a complete dataset. The triplet includes the number of data points per cluster, squared, and linear sum. In a dense space, each node of the CF-tree represents aggregated properties of each data point instead of the distance between data points, which enables Birch initialization to avoid close proximity and high overlaps. Since Birch is hierarchical, the multiple levels of the CF-tree can effectively capture the different granularities of clusters even if the overlaps exist on one level. We compare different cluster center initialization techniques in the ablation study (see Section 5.1). The pseudo-code for the initialization of c using Birch is given in Algorithm 2).

Algorithm 2 Pseudo-code for initialization of c using Birch
0:  z: data points; K: number of clusters; T: threshold; B: branching factor (the maximum number of child nodes); L: The maximum number of entries in a leaf node
0:  c: cluster centers and cluster assignments for each z
1:  Create an empty CF-tree with T, B, and L.
2:  Initialize Birch clustering with K
3:  for each z in the dataset do
4:     Traverse the CF-tree to find a suitable corresponding leaf node
5:     Update the CF-tree in the path from the root to the last leaf node at each level
6:     if the value of leaf node >> T then
7:        Split the leaf node
8:     end if
9:  end for
10:  Compute clusters k and assign a cluster index to each z
11:  for each k do
12:     Compute c as a mean of z assigned to k
13:     Add c to the list of cluster centers
14:  end for
15:  return  ci=cc_{i}=c

4. Evaluation

This section evaluates TableDC using existing benchmarks with a variety of different embedding approaches, in comparison with other clustering algorithms and bespoke solutions.

4.1. Experimental setup

4.1.1. Datasets

TableDC is evaluated on six benchmark datasets widely adopted for data integration problems, specifically schema inference, domain discovery and entity resolution. A brief description and statistical detail of each data set are given below and in Table 1.

  • T2D Entity-Level Gold standard (T2D) (Ritze and Bizer, 2017) is a widely used collection of web tables with mappings to their corresponding DBpedia class. Regarding schema inference, T2D can be used to align the schema of web tables (structure and format) with the schema used in DBpedia. In T2D, we excluded all tables categorized under the DBpedia class ’Thing’. This is because the ’Thing’ category is mapped to over 50% of the web tables, leading to a significant risk of data imbalance.

  • Table Union Search (TUS) benchmark (Nargesian et al., 2018), refers to finding unionable tables for a given query table. However, for clustering application, the problem can be redefined as each cluster shares unionable tables.

  • MusicBrainz (Saeedi et al., 2017) dataset is derived from authentic records of songs sourced from the MusicBrainz database. It combines the data from five distinct sources, where approximately 50% of the original song records are duplicated across two to five sources.

  • Geographic Settlements (GeoSet) (Saeedi et al., 2017) represents a collection of geographical real-world entities from four different sources (Geonames, Freebase, DBpedia, and NYTimes).

  • Di2KG (Camera and Monitor)111http://di2kg.inf.uniroma3.it/datasets.html contains a set of cameras and monitors web records scraped from multiple e-commerce web pages. Both datasets are highly heterogeneous. For example, specification lcd display from www.price-hunt.com and monitor from www.cambuy.com.au semantically represent the same domain despite having different syntactic structures.

We use T2D and TUS for schema inference, MusicBrainz and GeoSet for entity resolution, and Di2KG (Camera and Monitor) datasets for domain discovery.

Table 1. Statistics of benchmark datasets used to evaluate TableDC for schema inference, entity resolution and domain discovery.
Dataset Instances Clusters
Tables web tables 429 26
TUS 4248 37
Rows Music Brainz 2002 684
GeoSet 3021 786
Columns Camera 19036 56
Monitor 34481 81

4.1.2. Benchmark Methods

We compared TableDC222https://github.com/hafizrauf/TableDC with three SC and five DC clustering methods. The selected SC methods represent different paradigms. Similarly, the chosen DC methods represent different approaches to representation learning (self-supervised and subspace). A short description of each benchmark method is given below:

  • K-means  (Hartigan and Wong, 1979) is a classical clustering algorithm; it assigns instances to the nearest centroid to create clusters. Centroids are updated based on the assigned points.

  • DBSCAN (Ester et al., 1996) identifies distinct clusters based on the data density, considering the number of neighboring points within a specified radius. DBSCAN is particularly good with data having variation in densities.

  • Birch  (Zhang et al., 1996) is a hierarchical clustering algorithm built on a CF-tree to summarize the information (on each tree node) needed to cluster instances.

  • SHGP (Yang et al., 2022) is a self-supervised method that shares an attention-aggregation mechanism between two modules, Att-LPA to produce pseudo-labels and Att-HGNN, to learn object embeddings. Both modules effectively improve each other to optimize embeddings.

  • DCRN (Liu et al., 2022) learns representations through AE and GCN by reducing the information correlation to improve the discriminative property of the resulting features.

  • DFCN (Tu et al., 2021) focuses on the fusion of representation learning of AE and graph neural networks. It works on a dynamic fusion mechanism to refine the graph neural network structure to better integrate with AE representation learning.

  • EDESC (Cai et al., 2022) EDESC is a subspace representation-based learning method that iteratively refines the bases of the subspace derived from deep representations.

  • SDCN (Bo et al., 2020) effectively utilizes the structural information in the AE representation learning. The dual self-supervised mechanism effectively updates the model parameters to produce clustering-specific representations.

Table 2. Comparing schema inference clustering results (TableDC vs. existing DC methods). The best results are shown in bold. Representations marked with * are obtained on instance-level data and TTTT^{*} refers to TabTransformer.
Method TUS Web tables
SBERT FastText TTTT^{*} SBERTSBERT^{*} SBERT USE TTTT^{*} SBERTSBERT^{*}
ARI ACC ARI ACC ARI ACC ARI ACC ARI ACC ARI ACC ARI ACC ARI ACC
K-means 0.73 0.79 0.63 0.69 0.22 0.35 0.46 0.56 0.27 0.45 0.19 0.40 -0.013 0.27 0.32 0.51
DBSCAN 0.17 0.47 0.01 0.25 0.02 0.26 0.09 0.38 0.0 0.29 0.0 0.29 -0.007 0.26 0.0 0.29
Birch 0.22 0.40 0.0 0.20 0.18 0.35 0.54 0.58 0.33 0.49 0.22 0.40 -0.013 0.27 0.46 0.61
SHGP 0.61 0.72 0.43 0.56 0.05 0.19 0.13 0.29 0.08 0.29 0.07 0.28 -0.0031 0.27 0.09 0.30
DCRN 0.50 0.59 \infty \infty 0.018 0.24 0.15 0.26 0.07 0.25 0.03 0.25 -0.02 0.25 0.17 0.36
DFCN 0.55 0.65 0.43 0.52 0.20 0.37 0.34 0.43 0.24 0.40 0.19 0.33 -0.0006 0.24 0.37 0.48
EDESC 0.28 0.35 0.22 0.32 0.14 0.30 0.36 0.43 0.26 0.44 0.22 0.37 0.08 0.25 -0.04 0.26
SDCN 0.60 0.57 0.53 0.47 0.25 0.40 0.45 0.49 0.20 0.34 0.08 0.33 0.0 0.29 0.30 0.42
TableDC 0.88 0.87 0.73 0.76 0.24 0.37 0.63 0.66 0.62 0.65 0.25 0.42 -0.006 0.26 0.61 0.65

4.1.3. Representations

Each data integration task has an associated embedding model for representing tables, rows or columns which is integrated into the DC method for clustering. We use six different embedding models, each of which is specific to one or more data integration tasks; details are given below:

Schema Inference: We use pre-trained embeddings to get the representations of raw data. We categorized these into two categories. When we consider headers, we use SBERT  (Reimers and Gurevych, 2019), FastText  (Grave et al., 2018) and USE (Cer et al., 2018); with instances, we use TabTransformer and SBERT. SBERT uses siamese and triplet network structures to extract semantically meaningful embeddings from sentences. FasText is specialized in extracting morphological information and considers subwords, representing each word as a bag of character n-grams. Like SBERT, USE is also a sentence encoder specializing in capturing the semantic meanings of sentences. Considering the instances along with the header is a more tricky task and requires more robust transformers. We fine-tuned TabTransformer (Huang et al., 2020) on both datasets to benefit the categorical and contentious features in the dataset. We also encoded instance data with SBERT; we convert each row of the table into a sentence sequence, where each row is represented as a sequence of its cell values appended with [SEP] token for SBERT to separate segments and maintain the row boundaries in the text. [SEP] carries the table structure within the BERT model in the encoded form.

Entity Resolution: We train EmbDi (Cappuzzo et al., 2020) on both datasets to embed rows and used pre-trained SBERT. EmbDi is a graph-based representation method based on tripartite connectivity within a graph of table elements to extract structural information. A tripartite connection contains a column node (attribute representation), a value node (unique value representation), and a row node (unique tuple token).

Domain Discovery: Similar to schema inference, we embed columns in two categories, with and without column headers. We used SBERT and T5 (Raffel et al., 2020) pre-trained embeddings for both purposes. When considering instances, we embed column values using SBERT and T5 and aggregate the individual embeddings to get the final column embeddings. T5 is a widely used language model developed by Google for information retrieval tasks and generating embeddings, specifically capturing semantic information within data.

4.2. Metrics

We adopted two widely used standard clustering evaluation metrics, Accuracy (ACC)  (Yang et al., 2010) and Adjusted Rand Score (ARI)  (Wu et al., 2019) to test the performance of TableDC and existing benchmark methods. ARI ranges from -1 to 1, where 1 indicates perfect matching, 0 indicates random labeling and negative values indicate worse than random labeling. ACC ranges from 0 to 1.

4.3. Parameter setting

TableDC consists of four AE layers and is trained using the Adam optimizer. The loss function combines KL-divergence (to minimize the Mahalanobis distance in self-supervised learning) and mean squared error loss for reconstruction. The latent space size is fixed at 100. In order to be consistent with the existing DC benchmarks, we pre-trained TableDC with AE on 30 epochs for schema inference and domain discovery. Pretraining provides a good starting point for the network parameters, avoiding random initialization and local minima during the clustering phase, which leads to faster convergence. Since entity resolution involves many clusters, the CF-tree in Birch initialization needs more nodes with detailed cluster summaries, each representing a smaller and more defined cluster. We have pre-trained TableDC on 100 epochs for entity resolution. Training epochs for schema inference are fixed to 200, domain discovery to 100, and entity resolution to 50. We run all existing benchmark methods on the same number of epochs adopted for TableDC, excluding the internal parameters for which we use the originally published values for each existing method, such as the number and size of each layer. We tuned the threshold T (the maximum range of radius of a cluster in the CF-tree) when initiating the centroids in TableDC for those experiments where the number of subclusters found on each CF-tree node is less to be able to summarize cluster information. We adopted a grid search on threshold T and divided the number of sub-clusters unless the CF-tree becomes stable. Some existing methods need K-means for centroids initialization and clustering assignments. To avoid extreme cases, we initialize 20 times and choose the best solution.

4.4. Schema inference clustering results

Schema inference as a clustering problem can be defined as the clustering of tables that share a similar schema. Table 2 shows the clustering results of schema inference on Table Union Search (TUS) and web tables datasets encoded with several representations.

We observe the following: (i) TableDC obtained the best results over both datasets when considering schema-level data (SBERT, FastText, USE) compared to all benchmark methods, including SC. TableDC improves the clustering performance (with SBERT) by 0.28 and 0.27 ARI and 0.30 and 0.15 ACC, compared to SDCN and SHGP, respectively. (ii) TableDC exhibits good but not always best performance on instance-level data (TabTransformer, SBERT) and was outperformed by SDCN by a small margin (0.01 ARI and 0.03 ACC) on the TUS dataset and by EDESC on the web tables dataset using TabTransformer. However, TableDC outperformed the other methods with SBERT, improving clustering performance by 0.31 average ARI and 0.23 average ACC on the TUS dataset. (iii) TableDC handles overlapping clusters due to its heavy-tailed Cauchy distribution as a similarity kernel. For example, TableDC assigns two tables that share similar schemas, RadioStation.(Radio, Country, Language) and Country.(Country, Language, Broadcasters) with a 0.79 cosine similarity score in the representation space to two different clusters, compared to SDCN and DFCN, which failed to handle high overlap and put them in one cluster, leading to misclassification. (iv) TableDC provides effective distance measures in a dense space through the Mahalnobis distance measure when calculating the distance between data points. Two instances from different GT clusters are closely packed in the latent space when the distance measured is Euclidean compared to the Mahalbonis distance, which helps the model by placing both instances apart. For example, the Mahalanobis distance between different attributes of two tables Book.(rank, title, genre, creator, date, freq) and Film.(rank, title, year, director, overall rank), is 0.99 compared to the Euclidean distance, which is 0.71, indicating that during self-supervision in TableDC, both instances are further apart despite having highly overlapping values. Experimentally, TableDC correctly placed the two instances in different clusters whereas EDESC and DCRN misclassified them into one cluster due to a low Euclidean distance score. (v) TableDC focuses on semantic similarity more than distributional frequency. For example, when considering instances on TUS with SBERT, attributes (Day, Month) with different frequencies (i.e., the frequency of each day) present in a column should be in the same clusters, ignoring the synthetic properties or the count of a particular day repeated in a column. However, SDCN and DCRN consider the syntactic similarity and the frequency (Tuesday and Wednesday are the highest counts in both tables) and incorrectly place the two tables in different clusters, in contrast with TableDC.

Table 3. Comparing entity resolution clustering results (TableDC vs. existing DC methods). The best results are shown in bold.
Music Brainz GeoSet
Method SBERT EmbDi SBERT EmbDi
ARI ACC ARI ACC ARI ACC ARI ACC
K-means 0.40 0.68 0.39 0.64 0.57 0.83 0.56 0.71
DBSCAN 0.0 0.002 N/A N/A 0.0 0.001 0.0 0.001
Birch 0.56 0.76 0.41 0.67 0.56 0.75 0.59 0.71
SHGP 0.15 0.47 0.20 0.51 0.52 0.77 0.43 0.63
DFCN 0.03 0.31 0.03 0.33 0.34 0.59 0.37 0.57
EDESC -0.001 0.03 0.03 0.29 0.30 0.73 0.25 0.52
SDCN 0.002 0.13 0.06 0.47 0.05 0.66 0.52 0.66
TableDC 0.80 0.88 0.51 0.71 0.65 0.86 0.60 0.71
Table 4. Comparing domain discovery clustering results (TableDC vs. existing DC methods). The best results are shown in bold.
Method Camera Monitor
SBERT SBERTSBERT^{*} T5T5^{*} SBERT SBERTSBERT^{*} T5T5^{*}
ARI ACC ARI ACC ARI ACC ARI ACC ARI ACC ARI ACC
K-means 0.74 0.70 0.49 0.56 0.49 0.56 0.59 0.58 0.54 0.53 0.52 0.53
DBSCAN 0.73 0.69 -0.005 0.25 -0.005 0.25 0.27 0.50 0.007 0.22 0.006 0.22
Birch 0.76 0.70 0.78 0.74 0.69 0.68 0.55 0.55 0.60 0.61 0.53 0.53
SHGP 0.65 0.65 0.47 0.56 0.41 0.51 0.59 0.60 0.48 0.52 0.46 0.49
DCRN 0.60 0.60 0.41 0.44 N/A N/A N/A N/A N/A N/A N/A N/A
DFCN 0.77 0.72 0.64 0.68 0.57 0.61 0.59 0.59 0.53 0.54 0.50 0.50
EDESC 0.41 0.48 0.57 0.59 0.46 0.54 0.32 0.42 0.48 0.50 0.46 0.49
SDCN 0.68 0.67 0.68 0.67 0.66 0.63 0.47 0.52 0.52 0.54 0.55 0.54
TableDC 0.80 0.72 0.82 0.75 0.77 0.74 0.64 0.65 0.63 0.61 0.58 0.57

4.5. Entity resolution clustering results

Entity resolution as a clustering problem can be defined as clustering multiple records that refer to the same real-world entity. Table 3 shows the clustering results of entity resolution on GeoSet and Music Brainz datasets encoded with several representations.

We observe the following: (i) The best results are with TableDC on both datasets, outperforming all other methods by 0.63 average ARI and 0.53 average ACC on Music Brainz and 0.31 average ARI and 0.24 average ACC on GeoSet using SBERT. (ii) TableDC performed well with SBERT compared to EmbDi, delivering good semantic similarity performance among different records of the same real-world entity. TableDC is better with SBERT than EmbDi by 0.20 ARI and 0.17 ACC with Music Brainz, while 0.05 ARI and 0.15 ACC with GeoSet. (iii) TableDC more effectively adopted the semantic mapping by SBERT than SHGP and DFCN by clustering contextually similar records. For example, two records of geographical locations (name: Manchester (England), longitude: -2.23743, latitude: 53.4809) from data.nytimes.com and (name: manchester united kingdom, longitude: -2.24, latitude: 53.48) from rdf.freebase.com have a low syntactic similarity (pairwise Euclidean distance) of 0.47 and high cosine similarity (pairwise cosine distance) of 0.86. SHGP and DFCN misclassified both instances and grouped them into different clusters compared to TableDC, which forms one cluster with contextually similar records. (iv) TableDC with EmbDi on GeoSet created fewer unary clusters (50) than SHGP (101) and EDESC (90), indicating that TableDC effectively balances cluster purity and cluster size. A higher ARI score with fewer unary clusters implies that TableDC is better at avoiding unnecessary fragmentation of the rows into too many small clusters. (v) Like schema inference, in entity resolution, existing clustering methods failed to form clusters based on the context of a particular text. For example, two different songs with some similar values of attributes are clustered together by SDCN and SHGP with EmbDi, i.e. (title: Jabberwock, length: 292706, year: 2010, language: French) and (title: Bubble Star, length: 244346, year: 2004, language: French). Both records share the same language and numeric values with a similar range encoded by EmbDi despite having different titles. TableDC managed to put them in different clusters based on the context of both records despite them having several similar attribute values.

4.6. Domain discovery clustering results

Domain discovery as a clustering problem can be defined as a clustering of the columns from a collection of datasets that refer to the same domain. Table 4 shows the clustering results of domain discovery on Camera and Monitor datasets encoded with several representations.

We observe the following: (i) TableDC obtained the best results with SBERT and T5 when considering column values of the Camera dataset outperforming other methods by an average of 0.32 and 0.18 with SBERT and 0.30 and 0.20 with T5 on ACC and ARI, respectively. (ii) TableDC also led on the Monitor dataset by an average of 0.18 and 0.11 with SBERT and an average of 0.14 and 0.09 with T5 on ARI and ACC, respectively. (iii) TableDC shows a superior capacity for learning and interpreting the column’s context in the latent space compared to other methods. TableDC’s ability to learn and utilize complex patterns shows the better integration of the embeddings obtained using T5 on the Camera dataset. For example, the cosine similarity of the T5 vectors of two columns (image sensor: CMOS) and (optical sensor: CMOS) is 0.61, and TableDC learns the context and assigns both columns to the same cluster, compared to SDCN and DFCN, which incorrectly assigns them to different clusters. (iv) The clustering performance of existing methods is affected by the length of columns compared to TableDC, which handles the columns uniformly regardless of the length. For example, two columns of the Camera dataset with different lengths (camera color: silver gray/ black/ red/ silver) and (color: black) are categorized in different clusters by TableDC and DCRN with SBERT; however, TableDC overrides the column structure and correctly puts them together in the same cluster. In the experimental evaluation, some experiments with DCRN have not managed to scale on available resources and are reported as N/A.

4.7. Comparison with bespoke solutions

Previous experiments have compared TableDC with other SC and DC algorithms, using the same representations. This section complements such results by comparing TableDC with existing state-of-the-art solutions.

4.7.1. Schema Inference

Although there are many proposals for schema inference techniques, as surveyed in (Cebiric et al., 2019; Kellou-Menouer et al., 2022), few of these are for tabular data. Thus the comparators in this section use state-of-the-art techniques for table similarity along with standard clustering algorithms. Specifically for table similarity, we use:

  • D3LD^{3}L (Bogatu et al., 2020), which combines the results of several largely syntactic methods for comparing column headers and values into an overall table similarity score. D3LD^{3}L is used with the K-means for clustering because it exhibits the best results compared to Birch and DBSCAN.

  • Starmie  (Fan et al., 2023), which uses self-supervised contrastive learning to fine tune ROBERTa  (Liu et al., 2019) language models for column similarity; table similarity is then supported by combining these column similarities. Starmie is used with the connected component algorithm  (Fan et al., 2023) as in the original paper.

4.7.2. Entity Resolution

There are many proposals for entity resolution techniques, as surveyed in (Christophides et al., 2021; Elmagarmid et al., 2007), though many of these consider pairwise similarity rather than producing clusters. Thus we specifically select a framework that can produce clusters of instances, namely JedAI  (Papadakis et al., 2020; Mandilaras et al., 2021), which provides several workflows for entity resolution. Within JedAI, we use the schema agnostic workflow, as TableDC is also schema agnostic; in essence, both techniques can be used to look for duplicates in tables with different structures. JedAI workflows can also be configured to support different comparison metrics, and results are reported for Jaccard, Cosine and Dice.

4.7.3. Domain Discovery

There are relatively few fully unsupervised proposals for Domain Discovery. Here we use one bespoke method, namely D4 (Ota et al., 2020), which identifies domains by considering overlaps between column instances, and as in Schema Inference we use Starmie (Fan et al., 2023) along with an SC algorithm. The comparison with Starmie involves comparing a language model that has been fine-tuned for similarity with a language model in TableDC that is fine-tuned for clustering.

4.7.4. Observations

We compare two proposals for each problem that use techniques published in top outlets (ICDE, PVLDB, Information Systems) since 2020. The results of the comparison of TableDC with bespoke solutions are presented in Figure 2, which reports the ARI and ACC for two datasets for each of schema inference, entity resolution and domain discovery. We can observe that, for Schema inference, TableDC consistently outperforms D3LD^{3}L and Starmie for web tables. For TUS, D3LD^{3}L obtains 0.06 higher ACC than TableDC, while TableDC is still ahead with 0.07 ARI. This indicates that TableDC is good at identifying the overall clustering structure, but D3LD^{3}L is better at assigning the exact labels correctly. TableDC shows its ability to achieve good results on entity resolution compared to JedAI with different distance similarity measures. JedAI with cosine distance is second best with 0.58 ARI and 0.69 ACC on the Music Brainz dataset. Lastly, TableDC outperformed D4 (0.29 ARI and 0.27 ACC) in column clustering due to its ability to learn contextualized column embeddings effectively, providing more accurate clustering, with 0.77 ARI and 0.74 ACC on the Camera dataset. Overall, we can see that in the 12 experiments, TableDC provides the best results in 11 cases, thereby demonstrating that, though generic, TableDC compares well with recent, specialized state-of-the-art proposals.

ARI (TUS)ACC (TUS)ARI (web tables)ACC (web tables)ARI (GeoSet)ACC (GeoSet)ARI (Music Brainz)ACC (Music Brainz)ARI (Camera)ACC (Camera)ARI (Monitor)ACC (Monitor)
D3LD^{3}LStarmieTableDC00.20.20.40.40.60.60.80.811Scores
(a) Schema Inference
CosineDiceTableDC00.20.20.40.40.60.60.80.811Scores
(b) Entity Resolution
D4StarmieTableDC00.20.20.40.40.60.60.80.811Scores
(c) Domain Discovery
Figure 2. TableDC vs. bespoke solutions for each problem. TableDC is integrated with SBERT in (a) and (b) and T5 in (c). In (b), Jaccard and Cosine are different similarity metrics of the JedAI framework  (Papadakis et al., 2020).

4.8. Scalability

We observe that existing DC proposals struggled to scale to large numbers of clusters 𝕂\mathbb{K} (see Figure 3). To compare the scalability of different DC methods, we used Music Brainz, which was scaled to large numbers of clusters, up to 𝕂=2400\mathbb{K}=2400. The times are reported running on Nvidia A100 GPU with 80GB GPU RAM.

Most existing DC algorithms use GCN to create an adjacency matrix to capture underlying relationships among different features. This leads to a sparse adjacency matrix in scenarios where the representation is not dense. The complexity of GCN-based representation learning, assuming a sparse adjacency matrix, is proportional to the number of vertices and edges. We often deal with dense representations with overlapping distances in data management applications. For dense representations, the multiplication between an adjacency matrix 𝔸\mathbb{A} (size ×\mathbb{N}\times\mathbb{N}) and a feature matrix (size ×𝔻\mathbb{N}\times\mathbb{D}) becomes more computationally intensive as 𝕂\mathbb{K} and 𝔻\mathbb{D} increase, leading to greater complexity.

Refer to caption
Figure 3. Scalability comparison of TableDC with existing DC approaches with respect to the number of clusters 𝕂\mathbb{K}. DFCN and DCRN are not included in the comparison because we have not managed to run both on the available hardware resources with a high number of clusters.

In contrast, TableDC does not rely on GCN to capture relationships but instead uses Mahalanobis distance to determine correlations between different features. TableDC employs an autoencoder with a linear relationship between the number of data points \mathbb{N} and the size of the hidden layer \mathbb{H}. The self-supervised module, which calculates the differences between cluster centroids from dense representations, maintains a manageable computational load because the calculation is around the distance from each data point to every cluster centroid. TableDC’s efficiency is optimized using Cholesky decomposition to compute the Mahalanobis distance. This method efficiently handles dense representations by decomposing the covariance matrix, allowing for faster computation of distances (Golub, 2013). Additionally, the integration of a Cauchy distribution kernel normalizes these distances and applies softmax operations, ensuring that the computational overhead remains low even as the number of clusters 𝕂\mathbb{K} increases.

5. Ablation study

This section presents an ablation study to investigate the impact of three components: (i) cluster initialization, (ii) loss function, and (iii) different distance measures and similarity kernels on self-supervised learning.

5.1. Cluster Initialization

We compare several cluster centroid initialization approaches with Birch initialization to investigate how cluster initialization affects clustering performance. We record ARI using different initialization methods to evaluate the cluster quality. Figure 4 shows a bar chart comparison of TableDC trained over different initialization methods, including K-means, which has been used significantly in existing DC algorithms.

K-meansK-means++Bisecting K-MeansGMMBirch00.20.20.40.40.60.60.80.811ARISchema InferenceEntity ResolutionDomain Discovery
Figure 4. Impact of different cluster initialization methods on clustering performance. We used SBERT on web tables for schema inference, EmbDi on GeoSet for entity resolution, and SBERT on Camera for Domain Discovery.

We can observe from Figure 4 that TableDC with Birch initialization provides the best results for all problems. The higher ARI score of TableDC with Birch initialization indicates the accurate identification of homogeneous clusters and clear separation between different clusters. K-means++ initialization appeared as the second best in domain discovery. From the comparative analysis, we also observed that with Birch initialization, the points are more widely dispersed around the centroid, indicating enhanced cluster compactness (how tightly the data points are grouped around the centroid of the cluster) and homogeneity (the similarity of data points within the same cluster).

Refer to caption
(a) relossre_{loss} minimization
Refer to caption
(b) divergence of qq from pp
Figure 5. relossre_{loss} and KLdiv\text{KL}_{\text{div}} (between q and p) comparative analysis of TableDC with benchmark methods for schema inference on web tables data.
Table 5. TableDC clustering results on different distance metrics and similarity kernels used in the self-supervised module. We used SBERT (schema only) on web tables and Monitor datasets for schema inference and domain discovery. We adopted SBERT (values only) for entity resolution on the MusicBrainz dataset.
Schema Inference Entity Resolution Domain Discovery
Distance Euclidean Cosine Mahalanobis Euclidean Cosine Mahalanobis Euclidean Cosine Mahalanobis
ARI 0.49 0.30 0.62 0.0014 0.0014 0.88 0.56 0.30 0.64
ACC 0.62 0.51 0.65 0.0599 0.0300 0.87 0.57 0.41 0.65
Distribution Student’s t Normal Cauchy Student’s t Normal Cauchy Student’s t Normal Cauchy
ARI 0.52 0.45 0.62 0.37 0.0 0.88 0.54 0.56 0.64
ACC 0.60 0.64 0.65 0.69 0.0025 0.87 0.06 0.0 0.65

5.2. Loss Function

To empirically assess the impact of loss functions in TableDC and existing benchmarks for data management problems, we use schema inference and web table data to plot (i) relossre_{loss} and (ii) KLdiv\text{KL}_{\text{div}} (see Figure 5). A well-trained AE with effective relossre_{loss} minimization, leads to embeddings where clusters are more distinct and separable, specifically in the cases of high density and overlap in the original embeddings. We can observe from Figure 5 that the relossre_{loss} of TableDC is significantly reduced and more consistent compared to the benchmark methods, which exhibit fluctuations in loss during the learning of dense embeddings. For KLdiv\text{KL}_{\text{div}}, the relationship between p and q (see Section 3.1) is investigated to show if q deviates from p. This divergence is necessary because the data points in the target p require greater intra-cluster distance rather than high cohesion. We only consider those benchmarks using p and q distribution and KLdiv\text{KL}_{\text{div}} loss to align the comparison with TableDC. TableDC shows a high divergence (see Figure 5) of q from p, which means that TableDC is more confident with the actual soft assignments q becoming sharper over time, and over highly packed p, the q needs to minimize the cohesiveness.

5.3. Self-supervision over different distance functions and similar kernels.

We evaluate the impact of difference distance measures and similarity kernels on the performance of TableDC, which utilizes a self-supervised module employing the Mahalanobis distance and Cauchy distribution to manage densely overlapping data effectively. We record the ARI and ACC (see Table 5) of TableDC by keeping the Cauchy distribution as a similar kernel and replacing the Mahalanobis distance with Euclidean (Bo et al., 2020) and Cosine distances (Salton et al., 1975), each providing different geometric properties. The Euclidean distance measures the straight-line distance between data points and their centroids, and the Cosine distance evaluates the cosine of the angle between vectors, emphasizing orientation over magnitude. Similarly, we kept the Mahalanobis distance constant while varying the similarity kernels. We replaced the Cauchy distribution with the Student’s t-distribution (Bo et al., 2020) (with an additional degree of freedom parameter that adjusts the distribution’s kurtosis.) and the Normal distribution (that provides a standard Gaussian decay of similarity).

From Table 5, we can observe that the choice of Mahalanobis distance combined with the Cauchy distribution appears optimal for all three problems. Euclidean distance with a Cauchy distribution performs relatively better in schema inference and domain discovery than cosine distance, showing that the straight-line distance between points is more reasonable than the angle when we have dense and overlapping clusters. With Student’s t distribution, TableDC shows a high ARI for schema inference but significantly lower ACC than the Normal distribution. The higher ARI suggests that it can effectively group similar items into clusters, but the low ACC indicates that it failed to handle overlapping clusters and mismatches predicted labels against the actual labels.

6. Conclusions

This paper proposes a deep clustering algorithm for data cleaning and integration problems, and has demonstrated its applicability to schema inference, entity resolution and domain discovery. TableDC utilizes the Mahalonbis distance measure in its self-supervised module to optimize data distribution during training. The strength of TableDC is its ability to account for the covariance among different dimensions of the embeddings, often ignored by the conventional Euclidean distance. This ensures that the distance between data points remains consistent irrespective of the orientation of the data in the latent space. Our method enhances cluster separation in scenarios with overlapping features in the latent space by weighing dimensions based on the degree of their inter-dependencies. The experimental evaluation shows that TableDC consistently outperforms existing clustering algorithms and problem-specific methods.

References

  • (1)
  • Albert et al. (2022) Paul Albert, Eric Arazo, Noel E. O’Connor, and Kevin McGuinness. 2022. Embedding Contrastive Unsupervised Features to Cluster In- And Out-of-Distribution Noise in Corrupted Image Datasets. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXI (Lecture Notes in Computer Science), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.), Vol. 13691. Springer, 402–419. https://doi.org/10.1007/978-3-031-19821-2_23
  • Alwassel et al. (2020) Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/6f2268bd1d3d3ebaabb04d6b5d099425-Abstract.html
  • Baazizi et al. (2019) Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019), 497–521. https://doi.org/10.1007/s00778-018-0532-7
  • Bex et al. (2007) Geert Jan Bex, Frank Neven, and Stijn Vansummeren. 2007. Inferring XML Schema Definitions from XML Data. In Proc. 33rd VLDB. 998–1009. http://www.vldb.org/conf/2007/papers/research/p998-bex.pdf
  • Bo et al. (2020) Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. 2020. Structural Deep Clustering Network. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, Yennun Huang, Irwin King, Tie-Yan Liu, and Maarten van Steen (Eds.). ACM / IW3C2, 1400–1410. https://doi.org/10.1145/3366423.3380214
  • Bogatu et al. (2020) Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 709–720. https://doi.org/10.1109/ICDE48307.2020.00067
  • Bonifati et al. (2022) Angela Bonifati, Stefania Dumbrava, and Nicolas Mir. 2022. Hierarchical Clustering for Property Graph Schema Discovery. In Proc. 25th EDBT. 2:449–2:453. https://doi.org/10.48786/edbt.2022.39
  • Cai et al. (2022) Jinyu Cai, Jicong Fan, Wenzhong Guo, Shiping Wang, Yunhe Zhang, and Zhao Zhang. 2022. Efficient Deep Embedded Subspace Clustering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 21–30. https://doi.org/10.1109/CVPR52688.2022.00012
  • Cappuzzo et al. (2020) Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1335–1349. https://doi.org/10.1145/3318464.3389742
  • Cauchy et al. (1847) Augustin Cauchy et al. 1847. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris 25, 1847 (1847), 536–538.
  • Cebiric et al. (2019) Sejla Cebiric, François Goasdoué, Haridimos Kondylakis, Dimitris Kotzinos, Ioana Manolescu, Georgia Troullinou, and Mussab Zneika. 2019. Summarizing semantic graphs: a survey. VLDB J. 28, 3 (2019), 295–327. https://doi.org/10.1007/s00778-018-0528-3
  • Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, Eduardo Blanco and Wei Lu (Eds.). Association for Computational Linguistics, 169–174. https://doi.org/10.18653/V1/D18-2029
  • Chen et al. (2022b) Mansheng Chen, Chang-Dong Wang, Dong Huang, Jian-Huang Lai, and Philip S. Yu. 2022b. Efficient Orthogonal Multi-view Subspace Clustering. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, Aidong Zhang and Huzefa Rangwala (Eds.). ACM, 127–135. https://doi.org/10.1145/3534678.3539282
  • Chen et al. (2022a) Nenglun Chen, Lei Chu, Hao Pan, Yan Lu, and Wenping Wang. 2022a. Self-Supervised Image Representation Learning with Geometric Set Consistency. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 19270–19280. https://doi.org/10.1109/CVPR52688.2022.01869
  • Christophides et al. (2021) Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2021. An Overview of End-to-End Entity Resolution for Big Data. ACM Comput. Surv. 53, 6 (2021), 127:1–127:42. https://doi.org/10.1145/3418896
  • Coates et al. (2011) Adam Coates, Andrew Y. Ng, and Honglak Lee. 2011. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011 (JMLR Proceedings), Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík (Eds.), Vol. 15. JMLR.org, 215–223. http://proceedings.mlr.press/v15/coates11a/coates11a.pdf
  • Costa et al. (2010) Gianni Costa, Giuseppe Manco, and Riccardo Ortale. 2010. An incremental clustering scheme for data de-duplication. Data Min. Knowl. Discov. 20, 1 (2010), 152–187. https://doi.org/10.1007/S10618-009-0155-0
  • Cui et al. (2023) Chenhang Cui, Yazhou Ren, Jingyu Pu, Xiaorong Pu, and Lifang He. 2023. Deep Multi-view Subspace Clustering with Anchor Graph. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China. ijcai.org, 3577–3585. https://doi.org/10.24963/IJCAI.2023/398
  • Dilokthanakul et al. (2016) Nat Dilokthanakul, Pedro A. M. Mediano, Marta Garnelo, Matthew C. H. Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. 2016. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. CoRR abs/1611.02648 (2016). arXiv:1611.02648 http://arxiv.org/abs/1611.02648
  • Dizaji et al. (2017) Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. 2017. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 5747–5756. https://doi.org/10.1109/ICCV.2017.612
  • Draisbach et al. (2020) Uwe Draisbach, Peter Christen, and Felix Naumann. 2020. Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection. ACM J. Data Inf. Qual. 12, 1 (2020), 3:1–3:30. https://doi.org/10.1145/3352591
  • Elmagarmid et al. (2007) Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1–16. https://doi.org/10.1109/TKDE.2007.250581
  • Emadi and Mazinani (2018) Hossein Saeedi Emadi and Sayyed Majid Mazinani. 2018. A Novel Anomaly Detection Algorithm Using DBSCAN and SVM in Wireless Sensor Networks. Wirel. Pers. Commun. 98, 2 (2018), 2025–2035. https://doi.org/10.1007/s11277-017-4961-1
  • Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, Evangelos Simoudis, Jiawei Han, and Usama M. Fayyad (Eds.). AAAI Press, 226–231. http://www.aaai.org/Library/KDD/1996/kdd96-037.php
  • Falck et al. (2021) Fabian Falck, Haoting Zhang, Matthew Willetts, George Nicholson, Christopher Yau, and Chris C. Holmes. 2021. Multi-Facet Clustering Variational Autoencoders. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 8676–8690. https://proceedings.neurips.cc/paper/2021/hash/48cb136b65a69e8c2aa22913a0d91b2f-Abstract.html
  • Fan et al. (2023) Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée J. Miller. 2023. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. Proc. VLDB Endow. 16, 7 (2023), 1726–1739. https://doi.org/10.14778/3587136.3587146
  • Fettal et al. (2023) Chakib Fettal, Lazhar Labiod, and Mohamed Nadif. 2023. Scalable Attributed-Graph Subspace Clustering. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 7559–7567. https://doi.org/10.1609/AAAI.V37I6.25918
  • Fisher et al. (2015) Jeffrey Fisher, Peter Christen, Qing Wang, and Erhard Rahm. 2015. A Clustering-Based Framework to Control Block Sizes for Entity Resolution. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams (Eds.). ACM, 279–288. https://doi.org/10.1145/2783258.2783396
  • Golub (2013) Gene Golub. 2013. Matrix Computations. Johns Hopkins University Press. https://doi.org/10.56021/9781421407944
  • Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomás Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari et al. (Eds.). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/summaries/627.html
  • Guo et al. (2017) Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. 2017. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, Carles Sierra (Ed.). ijcai.org, 1753–1759. https://doi.org/10.24963/IJCAI.2017/243
  • Hartigan and Wong (1979) J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics 28, 1 (1979), 100. https://doi.org/10.2307/2346830
  • Hassanzadeh et al. (2009) Oktie Hassanzadeh, Fei Chiang, Renée J. Miller, and Hyun Chul Lee. 2009. Framework for Evaluating Clustering Algorithms in Duplicate Detection. Proc. VLDB Endow. 2, 1 (2009), 1282–1293. https://doi.org/10.14778/1687627.1687771
  • Huang et al. (2020) Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar S. Karnin. 2020. TabTransformer: Tabular Data Modeling Using Contextual Embeddings. CoRR abs/2012.06678 (2020). arXiv:2012.06678 https://arxiv.org/abs/2012.06678
  • Jiang et al. (2017) Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. 2017. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, Carles Sierra (Ed.). ijcai.org, 1965–1972. https://doi.org/10.24963/IJCAI.2017/273
  • Jin et al. (2023) Jiaqi Jin, Siwei Wang, Zhibin Dong, Xinwang Liu, and En Zhu. 2023. Deep Incomplete Multi-View Clustering with Cross-View Partial Sample and Prototype Alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 11600–11609. https://doi.org/10.1109/CVPR52729.2023.01116
  • Kart et al. (2021) Turkay Kart, Wenjia Bai, Ben Glocker, and Daniel Rueckert. 2021. DeepMCAT: Large-Scale Deep Clustering for Medical Image Categorization. In Deep Generative Models, and Data Augmentation, Labelling, and Imperfections - First Workshop, DGM4MICCAI 2021, and First Workshop, DALI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, October 1, 2021, Proceedings (Lecture Notes in Computer Science), Sandy Engelhardt, Ilkay Öksüz, Dajiang Zhu, Yixuan Yuan, Anirban Mukhopadhyay, Nicholas Heller, Sharon Xiaolei Huang, Hien Van Nguyen, Raphael Sznitman, and Yuan Xue (Eds.), Vol. 13003. Springer, 259–267. https://doi.org/10.1007/978-3-030-88210-5_26
  • Kellou-Menouer et al. (2022) Kenza Kellou-Menouer, Nikolaos Kardoulakis, Georgia Troullinou, Zoubida Kedad, Dimitris Plexousakis, and Haridimos Kondylakis. 2022. A survey on semantic schema discovery. VLDB J. 31, 4 (2022), 675–710. https://doi.org/10.1007/S00778-021-00717-X
  • Kellou-Menouer et al. (2022) Kenza Kellou-Menouer, Nikolaos Kardoulakis, Georgia Troullinou, Zoubida Kedad, Dimitris Plexousakis, and Haridimos Kondylakis. 2022. A survey on semantic schema discovery. The VLDB Journal 31 (2022), 675–710.
  • Kellou-Menouer and Kedad (2015) Kenza Kellou-Menouer and Zoubida Kedad. 2015. Schema Discovery in RDF Data Sources. In Conceptual Modeling - 34th International Conference, ER (LNCS), Vol. 9381. Springer, 481–495. https://doi.org/10.1007/978-3-319-25264-3_36
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
  • LeCun et al. (1990) Yann LeCun, Ofer Matan, Bernhard E. Boser, John S. Denker, Don Henderson, Richard E. Howard, Wayne E. Hubbard, L. D. Jacket, and Henry S. Baird. 1990. Handwritten zip code recognition with multilayer networks. In 10th IAPR International Conference on Pattern Recognition, Conference C: image, speech, and signal processing, and Conference D: computer architecture for vision in pattern recognition, ICPR 1990, Atlantic City, NJ, USA, 16-21 June, 1990, Volume 2. IEEE, 35–40. https://doi.org/10.1109/ICPR.1990.119325
  • Lewis et al. (2004) David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res. 5 (2004), 361–397. http://jmlr.org/papers/volume5/lewis04a/lewis04a.pdf
  • Li et al. (2017) Keqian Li, Yeye He, and Kris Ganjam. 2017. Discovering Enterprise Concepts Using Spreadsheet Tables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 1873–1882. https://doi.org/10.1145/3097983.3098102
  • Li et al. (2019) Xiaopeng Li, Zhourong Chen, Leonard K. M. Poon, and Nevin L. Zhang. 2019. Learning Latent Superstructures in Variational Autoencoders for Deep Multidimensional Clustering. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=SJgNwi09Km
  • Liu et al. (2020) Fanzhen Liu, Shan Xue, Jia Wu, Chuan Zhou, Wenbin Hu, Cécile Paris, Surya Nepal, Jian Yang, and Philip S. Yu. 2020. Deep Learning for Community Detection: Progress, Challenges and Opportunities. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, Christian Bessiere (Ed.). ijcai.org, 4981–4987. https://doi.org/10.24963/ijcai.2020/693
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  • Liu et al. (2022) Yue Liu, Wenxuan Tu, Sihang Zhou, Xinwang Liu, Linxuan Song, Xihong Yang, and En Zhu. 2022. Deep Graph Clustering via Dual Correlation Reduction. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 7603–7611. https://doi.org/10.1609/AAAI.V36I7.20726
  • Liu et al. (2023) Yue Liu, Xihong Yang, Sihang Zhou, Xinwang Liu, Zhen Wang, Ke Liang, Wenxuan Tu, Liang Li, Jingcan Duan, and Cancan Chen. 2023. Hard Sample Aware Network for Contrastive Deep Graph Clustering. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 8914–8922. https://doi.org/10.1609/AAAI.V37I7.26071
  • Ma and Kim (2022) Xin Ma and Won Hwa Kim. 2022. Locally Normalized Soft Contrastive Clustering for Compact Clusters. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, Luc De Raedt (Ed.). ijcai.org, 3313–3320. https://doi.org/10.24963/IJCAI.2022/460
  • Mahalanobis (2018) Prasanta Chandra Mahalanobis. 2018. On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-) 80 (2018), S1–S7.
  • Mandilaras et al. (2021) Georgios M. Mandilaras, George Papadakis, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, Manolis Koubarakis, Alicia Lara-Clares, and Antonio Fariña. 2021. Reproducible experiments on Three-Dimensional Entity Resolution with JedAI. Inf. Syst. 102 (2021), 101830. https://doi.org/10.1016/J.IS.2021.101830
  • Mukherjee et al. (2019) Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. 2019. ClusterGAN: Latent Space Clustering in Generative Adversarial Networks. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 4610–4617. https://doi.org/10.1609/AAAI.V33I01.33014610
  • Nair and Hinton (2010) Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 807–814. https://icml.cc/Conferences/2010/papers/432.pdf
  • Nargesian et al. (2018) Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data. Proc. VLDB Endow. 11, 7 (2018), 813–825. https://doi.org/10.14778/3192965.3192973
  • Niu et al. (2022) Chuang Niu, Hongming Shan, and Ge Wang. 2022. SPICE: Semantic Pseudo-Labeling for Image Clustering. IEEE Trans. Image Process. 31 (2022), 7264–7278. https://doi.org/10.1109/TIP.2022.3221290
  • Ota et al. (2020) Masayo Ota, Heiko Mueller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow. 13, 7 (2020), 953–965. https://doi.org/10.14778/3384345.3384346
  • Papadakis et al. (2020) George Papadakis, Georgios M. Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with JedAI. Inf. Syst. 93 (2020), 101565. https://doi.org/10.1016/J.IS.2020.101565
  • Peng et al. (2017) Xi Peng, Jiashi Feng, Jiwen Lu, Wei-Yun Yau, and Zhang Yi. 2017. Cascade Subspace Clustering. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 2478–2484. https://doi.org/10.1609/AAAI.V31I1.10824
  • Piai et al. (2023) Federico Piai, Paolo Atzeni, Paolo Merialdo, and Divesh Srivastava. 2023. Fine-grained semantic type discovery for heterogeneous sources using clustering. VLDB J. 32, 2 (2023), 305–324. https://doi.org/10.1007/S00778-022-00743-3
  • Qian (2023) Qi Qian. 2023. Stable Cluster Discrimination for Deep Clustering. CoRR abs/2311.14310 (2023). https://doi.org/10.48550/ARXIV.2311.14310 arXiv:2311.14310
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
  • Rauf et al. (2024) Hafiz Tayyab Rauf, André Freitas, and Norman W. Paton. 2024. Deep Clustering for Data Cleaning and Integration. In Proceedings 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, March 25 - March 28, Letizia Tanca, Qiong Luo, Giuseppe Polese, Loredana Caruccio, Xavier Oriol, and Donatella Firmani (Eds.). OpenProceedings.org, 636–649. https://doi.org/10.48786/EDBT.2024.55
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. https://doi.org/10.18653/v1/d19-1410
  • Ritze and Bizer (2017) Dominique Ritze and Christian Bizer. 2017. Matching Web Tables To DBpedia - A Feature Utility Study. In Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21-24, 2017, Volker Markl, Salvatore Orlando, Bernhard Mitschang, Periklis Andritsos, Kai-Uwe Sattler, and Sebastian Breß (Eds.). OpenProceedings.org, 210–221. https://doi.org/10.5441/002/edbt.2017.20
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252. https://doi.org/10.1007/S11263-015-0816-Y
  • Saeedi et al. (2017) Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2017. Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Proceedings (Lecture Notes in Computer Science), Marite Kirikova, Kjetil Nørvg, and George A. Papadopoulos (Eds.), Vol. 10509. Springer, 278–293. https://doi.org/10.1007/978-3-319-66917-5_19
  • Saeedi et al. (2018) Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2018. Using Link Features for Entity Clustering in Knowledge Graphs. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings (Lecture Notes in Computer Science), Aldo Gangemi, Roberto Navigli, Maria-Esther Vidal, Pascal Hitzler, Raphaël Troncy, Laura Hollink, Anna Tordai, and Mehwish Alam (Eds.), Vol. 10843. Springer, 576–592. https://doi.org/10.1007/978-3-319-93417-4_37
  • Salton et al. (1975) Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A Vector Space Model for Automatic Indexing. Commun. ACM 18, 11 (1975), 613–620. https://doi.org/10.1145/361219.361220
  • Sun et al. (2023) Li Sun, Feiyang Wang, Junda Ye, Hao Peng, and Philip S. Yu. 2023. CONGREGATE: Contrastive Graph Clustering in Curvature Spaces. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China. ijcai.org, 2296–2305. https://doi.org/10.24963/IJCAI.2023/255
  • Tsuboi and Suzuki (2019) Yuroti Tsuboi and Nobutaka Suzuki. 2019. An Algorithm for Extracting Shape Expression Schemas from Graphs. In Proc. ACM Symposium on Document Engineering 2019. 32:1–32:4. https://doi.org/10.1145/3342558.3345417
  • Tu et al. (2021) Wenxuan Tu, Sihang Zhou, Xinwang Liu, Xifeng Guo, Zhiping Cai, En Zhu, and Jieren Cheng. 2021. Deep Fusion Clustering Network. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 9978–9987. https://doi.org/10.1609/AAAI.V35I11.17198
  • van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. journal of Machine Learning Research 9. Nov (2008) (2008).
  • Wu et al. (2019) Jianlong Wu, Keyu Long, Fei Wang, Chen Qian, Cheng Li, Zhouchen Lin, and Hongbin Zha. 2019. Deep Comprehensive Correlation Mining for Image Clustering. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 8149–8158. https://doi.org/10.1109/ICCV.2019.00824
  • Xie et al. (2016) Junyuan Xie, Ross B. Girshick, and Ali Farhadi. 2016. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings), Maria-Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. JMLR.org, 478–487. http://proceedings.mlr.press/v48/xieb16.html
  • Yang et al. (2017) Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, and Mingyi Hong. 2017. Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 3861–3870. http://proceedings.mlr.press/v70/yang17b.html
  • Yang et al. (2019) Linxiao Yang, Ngai-Man Cheung, Jiaying Li, and Jun Fang. 2019. Deep Clustering by Gaussian Mixture Variational Autoencoders With Graph Embedding. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 6439–6448. https://doi.org/10.1109/ICCV.2019.00654
  • Yang et al. (2022) Yaming Yang, Ziyu Guan, Zhe Wang, Wei Zhao, Cai Xu, Weigang Lu, and Jianbin Huang. 2022. Self-supervised Heterogeneous Graph Pre-training Based on Structural Clustering. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/6c7297baffe5c85ea1d9e1ccb1222ab8-Abstract-Conference.html
  • Yang et al. (2010) Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting Zhuang. 2010. Image Clustering Using Local Discriminant Models and Global Integration. IEEE Trans. Image Process. 19, 10 (2010), 2761–2773. https://doi.org/10.1109/TIP.2010.2049235
  • Zhang et al. (1996) Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996, H. V. Jagadish and Inderpal Singh Mumick (Eds.). ACM Press, 103–114. https://doi.org/10.1145/233269.233324
  • Zhang et al. (2023) Yuchao Zhang, Yuan Yuan, and Qi Wang. 2023. Multi-level Graph Contrastive Prototypical Clustering. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China. ijcai.org, 4611–4619. https://doi.org/10.24963/IJCAI.2023/513
  • Zhou et al. (2022) Sheng Zhou, Hongjia Xu, Zhuonan Zheng, Jiawei Chen, Zhao Li, Jiajun Bu, Jia Wu, Xin Wang, Wenwu Zhu, and Martin Ester. 2022. A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions. CoRR abs/2206.07579 (2022). https://doi.org/10.48550/arXiv.2206.07579 arXiv:2206.07579