This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Clustering with UMAP: Why and How Connectivity Matters

Ayush Dalmia, Suzanna Sia
Abstract

Topology based dimensionality reduction methods such as t-SNE and UMAP have strong mathematical foundations and are based on the intuition that the topology in low dimensions should be close to that of high dimensions. Given that the initial structure is a precursor to the success of the algorithm, this naturally raises the question: What makes a “good” topological structure for dimensionality reduction? In this paper which focuses on UMAP, we study the effects of node connectivity (k-Nearest Neighbors vs mutual k-Nearest Neighbors) and relative neighborhood (Adjacent via Path Neighbors) on dimensionality reduction. We explore these concepts through extensive ablation studies on 4 standard image and text datasets; MNIST, FMNIST, 20NG, AG, reducing to 2 and 64 dimensions. Our findings indicate that a more refined notion of connectivity (mutual k-Nearest Neighbors with minimum spanning tree) together with a flexible method of constructing the local neighborhood (Path Neighbors), can achieve a much better representation than default UMAP, as measured by downstream clustering performance.

1 Introduction

Dimension reduction techniques have become a standard tool in both machine learning and data analysis. Topology based dimensionality reduction techniques such as tt-SNE (Van der Maaten and Hinton 2008) and Uniform Manifold Approximation and Projection (UMAP; McInnes, Healy, and Melville (2018)) have been increasing in popularity due to their effectiveness in data visualisation and finding better representations for downstream tasks such as classification and clustering. While various topological dimensionality reduction methods differ in mathematical details for optimizing the low-dimensional manifold, these methods typically start with an initial graph structure in high dimensions. This naturally raises the question: What makes a “good” topological structure for dimensionality reduction? We explore different notions of connectivity for the initial graph structure and evaluate the performance of the resulting low-dimensional representation on a downstream clustering task using Normalised Mutual Information (NMI).

The most general initialization can be straightforwardly obtained by a weighted kk nearest neighbor (k-NN) graph. However, the k-NN graph is relatively susceptible to the “curse of dimensionality” and the associated distance concentration effect, where distances are similar in high dimensions, as well as the hub effect, where certain points become highly influential when highly connected. This skews the local representation of high dimensional data, deteriorating its performance for various similarity-based machine learning tasks such as classification (Tomašev and Buza 2015; Dinu and Baroni 2015), clustering (Tomašev et al. 2014) and most notably dimension reduction (Feldbauer and Flexer 2018).

In this paper, we focus on UMAP, which has advantages over other topology based dimension reduction techniques such as PCA, Isomap (Balasubramanian et al. 2002), and t-SNE for visualization quality, preserving a data’s global structure (inter-class distances) and its local structure (intra-class distances), and having superior runtime performance. UMAP has been increasingly used in various real world applications such as vision (Allaoui, Kherfi, and Cheriet 2020; Vermeulen et al. 2021); NLP (Kayal 2021; Rother et al. 2020); population genetics (Diaz-Papkovich, Anderson-Trocmé, and Gravel 2020; Becht et al. 2019) with competitive performance on unsupervised clustering and outlier detection among others.

In Section 3, we propose a refinement in the graph construction stage of the UMAP algorithm that uses a mutual k-NN graph instead of k-NN, to reduce the undesired distance concentration and hub effects. A consequence of applying mutual k-NN graph is that it can results in isolated components that can adversely affect the initialization of the lower dimensional vectors. Therefore in Section 3.2 we explore different methods such as linking adjacent neighbors, and adopting the edges from a minimum spanning tree (MST) to link the isolated components suggested by prior research. The resulting NN graph can be refined to obtain the NN for each point, by using the Shortest Path Distance so that non-adjacent neighbors (which we term Path Neighbors) can also be considered as the local neighborhood.

In Section 4, we conduct experiments on 4 standard image and text datasets, and visualise the resulting representation after applying UMAP, which suggest how each variant of Nearest Neighbor affects the low dimensional representation. The contributions of this paper are:

  • using a mutual k-NN graph rather than the original k-NN graph improves the separation between similar classes for both image and text datasets.

  • having a minimally connected graph by adding the minimum edges from the MST (MST-min) is best if we have access to flexible methods of constructing the local neighborhood (e.g via MST-min with Shortest Path distance neighbors (Path Neighbors)).

  • we show that simple graph processing can result in relative gains of 5-18% on NMI consistently across all datasets, and analyse the effect of graph processing on downstream representations.

2 Background

2.1 Understanding UMAP

UMAP first constructs a high dimensional graph representation of the data (graph construction phase), which is used to represent the “fuzzy simplicial complex”. Next it optimizes low-dimensional vectors to be as similar as possible to the graph representation in higher dimensions, by minimizing the cross-entropy of two fuzzy sets with the same underlying elements (datapoints).

Graph Construction

Our work is concerned with the graph construction phase of UMAP. In default UMAP, a weighted k-NN graph is constructed from the data where each vertex represents a datapoint, and edge weights represents the likelihood that two points are connected (McInnes, Healy, and Melville 2018). Formally, let X={x1,,xN}X=\{x_{1},\cdots,x_{N}\} be a dataset with NN points, and xMx\in\mathbb{R}^{M}. For each datapoint xix_{i}, the set of points in its local neighborhood is computed using a distance metric d:X×Xd:X\times X\rightarrow\mathbb{R}. In default UMAP, the neighborhood is found using the nearest neighbor descent algorithm (Dong, Moses, and Li 2011). Then for each xix_{i}, we have ρi\rho_{i} which reflects a connectivity constraint that data points are assumed to be locally connected. Here ρi=min{d(xi,xj)|d(xi,xj)>0,xjX}\rho_{i}=\mathrm{min}\{d(x_{i},x_{j})|d(x_{i},x_{j})>0,x_{j}\in X\}. This ensures that xix_{i} connects to at least one other datapoint.We define a weighted Graph G=(V,E,w)G\!=\!(V,E,w) where the vertices VV are simply the set of datapoints XX and ww is the set of weights corresponding to EE. GG represents the “fuzzy simplicial complex”. Details on how to construct GG can be found in McInnes, Healy, and Melville (2018).

The central focus of this paper is to demonstrate how refining the set of directed edges EE which reflects the connectivity of the graph in high-dimensional representations, affects the subsequent low-dimensional representations. The set of directed edges is E={(xi,xj)|iN,xji}E=\{(x_{i},x_{j})|i\leq N,x_{j}\in\mathcal{M}_{i}\}, and we define i\mathcal{M}_{i} to be the local neighborhood of points for xix_{i} that collectively determine the final set of directed edges EE. In this work, our starting point for i\mathcal{M}_{i} is the mutual nearest neighbors and we deal with the inherent asymmetry of this graph which could result in data points being completely isolated (Section 3.1).

Optimizing lower dimensional representation

Details about this optimization are not critical to understanding our proposed method, and we present this for completeness. Given GG, the data is projected into lower dimensions via a force-directed graph layout algorithm. To initialize the lower dimensional vectors, UMAP uses the eigenvectors of a normalized graph Laplacian of the “fuzzy simplicial complex” affinity matrix, also known as spectral vectors. For theoretical foundations and mathematical details, we refer interested readers to (McInnes, Healy, and Melville 2018).

Refer to caption
(a) Edge(blue) added by NN
Refer to caption
(b) Edge(blue) added by MST-min
Refer to caption
(c) Edge(blue) added by MST-all
Figure 1: Visual comparisons between all the methods used in Section 3.1. Black edges represent edges from the mutual k-NN graph. Blue edges represent the edges added using each connectivity method: connecting isolated nodes by nearest neighbor (NN), connecting components with the minimum edges of a minimum spanning tree (MST-min), and with all the edges from the minimum spanning tree (MST-all).

2.2 Mutual k-Nearest Neighbors

The default choice of i\mathcal{M}_{i} are the nearest neighbors of xix_{i}. However, previous research has shown that a weighted k-NN graph captures noisy relationships, and may not be an accurate representation of the underlying local structure for a high dimensional dataset due to the “curse of dimensionality” (Liu and Zhang 2012; Radovanović, Nanopoulos, and Ivanovic 2010). Instead we utilize a weighted mutual kk-nearest neighbor graph (weighted mutual k-NN graph) to represent the underlying structure of our data. A mutual k-NN graph is defined as a graph that only has an undirected edge between two vertices xix_{i} and xjx_{j} if xix_{i} is in kk-nearest neighbors of xjx_{j} and xjx_{j} is in the kk-nearest neighbors of xix_{i}. Therefore our local neighborhood is i={xj|xjknn(xi),xiknn(xj)}\mathcal{M}_{i}=\{x_{j}|x_{j}\in knn(x_{i}),x_{i}\in knn(x_{j})\}.

A mutual k-NN graph can hence be interpreted as a subgraph of the original k-NN graph. Mutual k-NN graphs have been shown to contain many desirable properties in comparison to its k-NN counterpart when combating the “curse of dimensionality”. One such property is that edges in mutual k-NN graphs have been shown to be a stronger indicator of similarity than the unidirectional nearest neighborhood relationship, alleviating the distance concentration effect (Jégou et al. 2010). Mutual k-NN graphs have also been shown to be less prone to the hub effect than the regular k-NN graph, since each vertex in a mutual k-NN graph is guaranteed to be at most k-degrees (Ozaki et al. 2011). They have also shown strong performance when used in both unsupervised clustering (Maier, Hein, and Luxburg 2009; Sardana and Bhatnagar 2014; Abbas and Shoukry 2012) and semi-supervised graph based classifications algorithms (Ozaki et al. 2011; de Sousa, Rezende, and Batista 2013). In Section 3, we describe how we utilize the mutual k-NN graph to create our high dimensional graph representation.

3 Methodology

In order to obtain the fuzzy simplicial complex for UMAP, our starting point is a weighted mutual k-NN graph as motivated in Section 2.2. However this graph may have completely isolated vertices and disconnected components, violating UMAP’s assumption that the manifold is locally connected (Section 2.1). In addition, having many components is undesirable for the initialization of the lower dimensional spectral vectors (McInnes, Healy, and Melville 2018). Therefore, we consider different connectivity methods such as using the minimum spanning tree (MST) to add edges, which we elaborate in Section 3.1.

Next, we consider that using a vertex’s adjacent neighborhood for i\mathcal{M}_{i} may not pose an adequate representation as this excludes all points that are more than 1 hop away. Hence we additionally refine i\mathcal{M}_{i} using the “new” neighborhood calculated using Shortest Path Distance to other nodes on the connected graph. This is elaborated in Section 3.2. Our method can thus be viewed as first having a loosely connected (mutual k-NN) graph, followed by increasing connectivity using MST variants, and finally with a refined i\mathcal{M}_{i} to define neighbors that are more than one hop away.

3.1 Increasing Connectivity for mutual k-NN

Since a mutual k-NN graph only retains a subset of the edges from the original k-NN graph(Section 2.2), the resulting mutual k-NN graph often contains disconnected components and potential isolated vertices. Isolated vertices violate one of UMAP’s primary assumptions that the underlying manifold is locally connected, and disconnected components negatively impact the spectral initialization (McInnes, Healy, and Melville 2018). Spectral initialization is crucial for preserving the global structure of the high dimensional data in the low dimensional vectors (Kobak and Linderman 2021).

Naively, we could increase kk, the number of nearest neighbors used for calculating the mutual k-NN graph, to obtain a more connected graphical representation. However increasing kk does not guarantee that the resulting graph has no isolated vertices or fewer disconnected components. We therefore consider the following methods for increasing the Connectivity of the mutual k-NN graph, presented in order of increasing connectivity.

  1. 1.

    NN: Add an undirected edge between each isolated vertex and its original nearest neighbor (de Sousa, Rezende, and Batista 2013).

  2. 2.

    MST-min: To achieve a connected graph, add the minimum number of edges from a maximum spanning tree to the mutual-kNN graph that has been weighted with similarity-based metrics(Ozaki et al. 2011). We adapt this by calculating the minimum spanning tree for distances (Algorithm 1).

  3. 3.

    MST-all: Adding all the edges of the MST.

We call this graph GG^{\prime}, the locally connected mutual k-NN graph, and run experiments with GG^{\prime} in Section 4.

3.2 Finding New Local Neighborhood i\mathcal{M}_{i}

Next we wish to obtain a “fuzzy simplicial complex” representation as described in Section 2.1, from the connected mutual k-NN graph, GG^{\prime} (Section 2.2). To achieve this, we first need to obtain the new local neighborhood, i\mathcal{M}_{i} for each xiXx_{i}\in X using GG^{\prime}. The most straightforward way is to use adjacent vertices in GG^{\prime}. However, this excludes all points that are more than 1 hop away, even if they are relatively close.

Instead, we compute GG from GG^{\prime} by refining the local neighborhood i\mathcal{M}_{i} using the Shortest Path Distance between any two nodes using Dijkstra’s Algorithm (See Fig. 2 for example) to other nodes on GG^{\prime} (Algorithm 2). This shortest path distance can be considered a new distance metric as it directly aligns with UMAP’s definition of an extended-pseudo-metric space. That is, we replace d(xi,xj)d(x_{i},x_{j}’) with dpath(xi,xj)d_{\mathrm{path}}(x_{i},x_{j}) in the graph construction phase (Section 2.1).

Refer to caption
Figure 2: Example of the local neighborhood we may encounter. If we are trying to find the neighborhood for A, then we would only select nodes B, C, D with Adjacent Neighbors. However this excludes closer nodes such as E and F so by using Path Neighbor, we can include E and F in the neighborhood of A.
Input: The mutual k-NN graph mkNNmkNN, kNN Graph kNNkNN with weights as the distance between points
Output: Connected mkNNmkNN graph GG^{\prime}
1 Initialize GG^{\prime} as a copy of mkNNmkNN
2 MSTMST\leftarrow minimum weight spanning tree of kNNkNN
3 sort_esort\_e\leftarrow sort MST edges by ascending weight order
4 foreach edge esort_e\text{edge }e\in sort\_e do
5       if e connects two components in Ge\text{ connects two components in }G^{\prime} then
6             Add undirected edge ee to GG^{\prime}
7       end if
8      
9 end foreach
10
return GG^{\prime}
Algorithm 1 Connect Disconnecting Components in mutual k-NN Graph with MST-min
Input: Connected mutual k-NN graph GG^{\prime}, number of new nearest neighbors to search for knewk_{new}
Output: Dictionary that returns the new neighborhood for each point \mathcal{M}, dictionary that returns the distance between xx and its nearest Path Neighbors dists\mathcal{M}_{dists}
1 Initialize \mathcal{M} and dists\mathcal{M}_{dists}
2foreach xiXx_{i}\in X do
3       Initialize neighborhood of xix_{i},i\mathcal{M}_{i}, and QQ as a min heap
4       Insert each adjacent vertex vv to xx in GG^{\prime} with the associated distance between vv and xx to QQ
5       while Q not empty & |i|<knewQ\text{ not empty \& }|\mathcal{M}_{i}|<k_{new} do
6             Extract minimum distance vertex yy with distance ypath_disty_{path\_dist} from QQ
7             if yiy\notin\mathcal{M}_{i} then
8                   Append yy to i\mathcal{M}_{i} and ypath_disty_{path\_dist} to dists(x)\mathcal{M}_{dists}(x)
9                   foreach vv\in Adj. vertices of yy do
10                         Insert vv into QQ with a distance of ypath_disty_{path\_dist} + distance between y and v
11                   end foreach
12                  
13             end if
14            
15       end while
16      
17 end foreach
18
19return \mathcal{M} and distances\mathcal{M}_{distances}
Algorithm 2 Find new local neighborhood i\mathcal{M}_{i}, with Path Neighbor using Djikstra’s

3.3 Computational Complexity

The complexity of constructing the mutual k-NN graph from the original k-NN graph is O(kn)O(kn). The additional complexity of connecting disconnected components varies depending on the method chosen. Adding edges between isolated vertices to their nearest neighbors (NN) is O(n)O(n) while the MST variants have a known time complexity of O(Elog(V))O(E\log(V)) = O(knlog(n))O(kn\log(n)). Since we construct the MST from the original k-NN graph, we know that E=nkE=nk and V=nV=n. With the MST, we add either all the edges (MST-all), which has an upper bound of O(n)O(n) since there are n1n-1 edges in the MST, or the minimum number of edges (MST-min) until we have one component, which has an upper bound of O(nlog(n))O(n\log(n)) since we need to sort the edges first.

Finally, using Path Neighbors on mutual k-NN graph has an additional cost since we have to perform a graph search. We use a variant of Djikstra’s algorithm which has a known time complexity of O(V+Elog(V))O(V+E\log(V)). Running the search for all V=nV=n nodes and EnkE\leq nk, we get a final complexity of O(n(n+knlog(n)))O(n(n+kn\log(n))). It is important to note that the worst case for Djikstra’s algorithm assumes one may need to explore the whole graph as it is typically to find the shortest distance between two points. However, since we are using Djikstra’s to find the new closest points, we in practice are more likely to terminate much earlier. This search can be further optimized by parallelizing the individual searches needed to find new nearest neighbors for each point.

4 Experiments

4.1 Datasets

We used two standard image classification datasets, MNIST (LeCun and Cortes 2010) and Fashion MNIST (FMNIST; Xiao, Rasul, and Vollgraf (2017)), and two standard text classification datasets, 20 newsgroup (20NG),111http://qwone.com/ jason/20Newsgroups/ and AG’s News Topic (AG;Zhang, Zhao, and LeCun (2015)) for evaluation. For the image classification datasets, no preprocessing was applied and each image was flattened to create a 1D vector. For the text classification datasets, we lowercase tokens, remove stopwords, punctuation and digits, and exclude words that appear in less than 5 documents. After preprocessing, we converted each dataset into both Count vectors and TF-IDF vectors. We opted to use the Count based representation of text as an illustration of datasets that have very high-dimensional and sparse representations.222TF-IDF is a weighting on the sparse representation. While our method is applicable to word or document vectors, these are heavily dependent on the text encoder and it is more difficult to interpret findings as we do not know what the dense high-dimensional vector space represents in the first place. 333Code is available at https://github.com/adalmia96/umap-mnn.

4.2 Evaluation and Performance Metrics

Quantative Evaluation: We evaluate the methods by comparing the clustering performance of the resulting low dimensional vectors generated. We performed KMeans clustering, assigning the number of clusters to be the ground truth number of topics for each dataset. We use the Normalised Mutual Information (NMI), a widely used clustering metric that ranges from 0 to 1 (1 being a perfect score), to evaluate the clusters.444NMI was primarily used to evaluate clustering quality for UMAP and subsequent works(McInnes, Healy, and Melville 2018) As UMAP is a stochastic algorithm, the NMI scores presented in Table 1 are averaged across low-dimensional vectors from 5 random seeds.

Qualitative Evaluation: In addition, we consider several desirable properties for dimensionality reduction in qualitative visualisations, local structure and global structure, and separation of classes. A dimensionality reduction method which preserves the local structure means that intra-class distances are preserved; i.e. the datapoints belonging to the same class are close to each other. A method which preserves the global structure means that inter-class distances are preserved; i.e., clusters of similar classes appear closer together and clusters of dissimilar classes appear farther from each other. Finally, the classes should be well separated (not clumped together).

4.3 Experimental Conditions

Normalized Mutual Information Adjacent Neighbors Path Neighbors
Dataset Dim UMAP NN
MST-min
MST-all
NN
MST-min
MST-all
MNIST 2 0.854 0.917 0.920 0.920 0.898 0.920 0.920
FMNIST 2 0.615 0.669 0.693 0.694 0.648 0.698 0.649
20NG Count 2 0.446 0.467 0.474 0.481 0.511 0.523 0.524
20NG TF-IDF 2 0.478 0.461 0.456 0.461 0.521 0.526 0.522
AG Count 2 0.589 0.582 0.571 0.589 0.589 0.630 0.642
AG TF-IDF 2 0.503 0.455 0.475 0.506 0.520 0.540 0.545
MNIST 64 0.862 0.915 0.919 0.920 0.910 0.919 0.918
FMNIST 64 0.626 0.685 0.703 0.703 0.667 0.698 0.679
20NG Count 64 0.487 0.525 0.529 0.535 0.560 0.563 0.565
20NG TF-IDF 64 0.566 0.556 0.556 0.560 0.592 0.594 0.595
AG Count 64 0.612 0.633 0.633 0.638 0.660 0.660 0.666
AG TF-IDF 64 0.575 0.548 0.548 0.558 0.593 0.612 0.615
Table 1: NMI Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans. For all datasets, using a mutual k-NN representation with one of the MST variants and combined with Path Neighbors provided the best NMI results.

Our starting point for all connectivity methods is the mutual k-NN, which we compare against UMAP which uses the default k-NN. As described in Section 3.1, we tested NN, which connects the nearest neighbor, MST-min which uses the minimum edges from the minimum spanning tree (MST), and MST-all which uses all the edges from the minimum spanning tree. This gives us a connected graph GG^{\prime}. Next, we consider two methods for obtaining the local neighborhood for each point, i\mathcal{M}_{i} which is used for the final graph GG (Section 3.2). For Adjacent Neighbors, G=GG^{\prime}=G as there is no change to the local neighborhood. For Path Neighbors, we additionally consider nodes from GG^{\prime} which are more than one hop away until we get kk neighbors, to get the ‘new’ local neighborhood i\mathcal{M}_{i} which is used for GG.

Hyperparameter Search

For all methods we do a grid search to find the best kk, the number of initial Nearest Neighbors (before applying the mutual NN restriction), by searching from 10 to 50 with increments of 5. We find the best min_dist, a hyper-parameter that controls how tightly UMAP packs points together, by searching from 0-1 in increments of 0.1. We use Euclidean distance for image datasets, Jaccard for the Count Vectors, and cosine for the TF-IDF Vectors (these distance metrics were best for the original UMAP embeddings).

5 Results

From Table 1, we see that MST variants combined with Path Neighbors to find i\mathcal{M}_{i} consistently produced better clustering results across all datasets for both 2 and 64 dimensions (p<0.01p<0.01).555Two-tailed t-test for org. UMAP vs Path Neigh. + MST-min. As a first step to uncover why, we present the 2D projections generated for MNIST, FMNIST, and 20 NG Count Vectors using each method in Fig. 3. We observe that MST variants combined with Path Neighbors consistently produces clearer separation between classes, less “random projections” (better local structure), and preserves the “global structure” which leads to consistently better clustering results (Table 1).

For MNIST, we see that the global structure was preserved among different digits, such as having 1 and 0 at far corners and placing similar digits such as 4, 7, 9 closer together. There is also more separation within the groups of similar digits (4, 7 9). Similarly for the FMNIST dataset, the vectors using the aforementioned method preserved the global structure between clothing classes(T-shirt, Coat, and etc.) from footwear classes (Sandal, Sneaker, Ankle-boot) while also depicting a clearer separation between the footwear classes. This is contrasted with original UMAP which has poorer separation between similar classes. Finally for 20NG, the generated vectors create a better distinction between similar subjects such as the recreation (rec) topics.

In the following sections, we explore how mutual k-NN graph affects separation of classes (5.1), how connectivity reduces random projections (5.2) and how selecting nearest neighbor can affect the structure of the final vector (5.3).

Refer to caption
Refer to caption
Refer to caption
Figure 3: Visual comparisons between all the methods tested in Section 4.3 for MNIST, FMNIST, and 20NG Count. We observe that using the MST variants for Connectivity combined with Path Neighbors improves the separations between locally similar classes while also preserving the global relationships that occur in the datasets.
Refer to caption
Refer to caption
Figure 4: Plots comparing how NMI varied based the Connectivity and Local Neighborhood used. We also plot the variances of 2D vectors generated for each class from the 20NG Count vectors for original UMAP, Adjacent Neighbors and Path Neighbor(using MST-all).

5.1 Mutual k-NN vs k-NN Graph (default UMAP)

Using a mutual k-NN representation results in improved separation between similiar classes. In general we observe that for most of the mutual k-NN graph based vectors (Fig. 3), there is a better separation between similar classes than the original UMAP vectors regardless of Connectivity (NN, MST variants). We observed the desired separation between similar classes such as with the 4, 7, 9 in MNIST and the footwear classes for FMNIST. mutual k-NN graphs have previously been shown as a useful method for removing edges between points that are only loosely similar, directly reducing distance concentration and hub effects.

5.2 Connectivity for Mutual k-NN

We consider three methods for connecting the mutal k-NN. In terms of connectivity, NN << MST-min << MST-all.

NN performs worse than MST variants, with vectors that are randomly scattered in 2d space. For both MNIST and FMNIST, we see that NN, which connects isolated vertices to their nearest neighbor, had multiple small clusters of points scattered throughout the vector space. Given that KMeans is sensitive to outliers, these randomly projected points negatively affect clustering performance as seen in Table 1. Since a mutual k-NN graph only retains a subset of the edges from the original k-NN graph, it can result in a very sparse representation. When constructing the mutual k-NN graph, we observed that \approx 1000(3%3\%) points were separated into small components for MNIST and \approx 7000(10%10\%) points for FMNIST. Our results show stronger notions of connectivity than NN are required.

We would expect that having higher connectivity that reduces random scattering of points would be better for clustering. However, we observe that too much connectivity from using all the edges from the MST (MST-all) with Path Neighbors can hurt performance on FMNIST (Section 5.3).

5.3 Local Neighborhood for Connected Mutual k-NN graph

We consider two methods, Adjacent Neighbors and Path Neighbors, for finding the local neighborhood i\mathcal{M}_{i} of each point, after obtaining a connected mutual k-NN graph GG^{\prime}.

Path Neighbors achieves the best clustering performance together with MST-min. We observe a similar clustering performance when we use the minimum number of edges from the MST (MST-min), vs all the edges from the MST (MST-all). In the case of FMNIST, using MST-all results in a worse clustering performance of 0.698 vs 0.649 for 2 dims, and 0.698 vs 0.679 for 64 dims. Although using Adjacent vs Path Neighbors resulted in similar clustering performance for image based dataset, Path Neighbors produced better results for text based datasets. From Fig. 3, both MST-min and MST-all produced better results when using Path vs Adjacent Neighbors for text based datasets (p<0.05p<0.05).

Adjacent Neighbors produce a poorer local structure than Path Neighbors. Visually, the vector generated using the Adjacent Neighbors and MST-min result in disperse dense clusters of points e.g, the footwear classes in FMNIST and the recreation topics in 20 NG. However when we apply Path Neighbors, the groups of points belonging to the same class are less dispersed (Fig. 3). This indicates that Adjacent Neighbors have a poorer local structure than Path Neighbors; i.e. same class datapoints are far from each other.

To investigate further why Adjacent Neighbors produce worse results for text datasets, we compute the variance along the dimensions of the 2D vectors for each class from the 20NG Count vectors and plot them in Fig. 4. We find that Adjacent Neighbors has greater variance for each class than the vectors for both Path Neighbors and original UMAP, indicating greater dispersion of points within-class. Having high variance is bad for clustering and indicates a poor lower dimensional representation, as there is no distinct range of values associated with the class label.

Path Neighbors increases the connectivity for the final GG, and therefore can rely on a more refined notion of GG^{\prime} with MST-min instead of MST-all. We can interpret Path Neighbors as a method which strictly increases the general connectivity of GG. Consider the local neighborhood i\mathcal{M}_{i} for Adjacent Neighbors, which is not guaranteed to have kk connected neighbors for i\mathcal{M}_{i}. Points with smaller i\mathcal{M}_{i} will be close to primarily few adjacent neighbors and repelled further away from the other points. This creates small groups of points that belong to the same class being spread across the vector space. On the other hand, for original UMAP and Path Neighbor vectors, |i|=k|\mathcal{M}_{i}|=k,and local groups of points are more likely to be connected to other groups within the same class. This increase in connectivity explains why visually Path Neighbors methods generate vectors which are less dispersed (within class) than the Adjacent Neighbors method, while still preserving the underlying structure of the data. Across Table 1, we see that using Path Neighbors performs consistently well with MST-min across both image and text datasets, and allows more flexibility in building a locally and globally connected fuzzy simplicial set.

5.4 Number of Dimensions

We also found that clustering performance was consistent across different dimensions for image datasets but was better at higher dimensions for text dataset with TF-IDF representations. Table 1 show results for 2 and 64 dimensions. While it is not surprising that having 64 dimensions allows the high-dimensional text datasets (>>26000 features for AG and >>34000 features for 20NG) to preserve more information, it is interesting that MST variants with Path Neighbors does not produce worse results at 2 vs 64 dimensions.

6 Conclusion

The initialization of a weighted connected graph is critical to the success of topology based dimensionality reduction methods. In this work, we established how starting with stricter conditions of connectivity (mutual k-NN graph vs standard k-NN graph) results in a better topology. Next, using a flexible method to expand the local neighborhood, and therefore connectivity of the final graph, is best done on a minimally connected (MST-min) mutual k-NN graph. Visualisation of the resulting vectors show that they produce better separation between similar classes while preserving both the overall structure of the data. Our quantitative experiments indicate these vectors consistently produce better clustering (empirical gains of relative 5-18%) across all datasets for both 2 and 64 dimensions despite the simplicity of graph methods, highlighting the role of graph connectivity in topological based dimensionality reduction methods.

Acknowledgments

We thank the anonymous reviewers for helpful feedback. Also Milind Agarwal, Desh Raj, Taryn Wong, and Jinyi Yang for proof-reading and Kelly Marchisio for discussions at an early stage of this project.

References

  • Abbas and Shoukry (2012) Abbas, M. A.; and Shoukry, A. A. 2012. CMUNE: A clustering using mutual nearest neighbors algorithm. In 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA), 1192–1197.
  • Allaoui, Kherfi, and Cheriet (2020) Allaoui, M.; Kherfi, M. L.; and Cheriet, A. 2020. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. In El Moataz, A.; Mammass, D.; Mansouri, A.; and Nouboud, F., eds., Image and Signal Processing, 317–325. Cham: Springer International Publishing. ISBN 978-3-030-51935-3.
  • Balasubramanian et al. (2002) Balasubramanian, M.; Schwartz, E. L.; Tenenbaum, J. B.; de Silva, V.; and Langford, J. C. 2002. The isomap algorithm and topological stability. Science, 295(5552): 7–7.
  • Becht et al. (2019) Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.; Kwok, I.; Ng, L.; Ginhoux, F.; and Newell, E. 2019. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37: 38–44.
  • de Sousa, Rezende, and Batista (2013) de Sousa, C. A. R.; Rezende, S. O.; and Batista, G. E. 2013. Influence of graph construction on semi-supervised learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 160–175. Springer.
  • Diaz-Papkovich, Anderson-Trocmé, and Gravel (2020) Diaz-Papkovich, A.; Anderson-Trocmé, L.; and Gravel, S. 2020. A review of UMAP in population genetics. Journal of Human Genetics.
  • Dinu and Baroni (2015) Dinu, G.; and Baroni, M. 2015. Improving zero-shot learning by mitigating the hubness problem. CoRR, abs/1412.6568.
  • Dong, Moses, and Li (2011) Dong, W.; Moses, C.; and Li, K. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, 577–586.
  • Feldbauer and Flexer (2018) Feldbauer, R.; and Flexer, A. 2018. A comprehensive empirical comparison of hubness reduction in high-dimensional spaces. Knowledge and Information Systems, 59: 137 – 166.
  • Jégou et al. (2010) Jégou, H.; Schmid, C.; Harzallah, H.; and Verbeek, J. 2010. Accurate Image Search Using the Contextual Dissimilarity Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32: 2 – 11.
  • Kayal (2021) Kayal, S. 2021. Unsupervised Sentence-embeddings by Manifold Approximation and Projection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1–11. Online: Association for Computational Linguistics.
  • Kobak and Linderman (2021) Kobak, D.; and Linderman, G. 2021. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, 39: 1–2.
  • LeCun and Cortes (2010) LeCun, Y.; and Cortes, C. 2010. MNIST handwritten digit database.
  • Liu and Zhang (2012) Liu, H.; and Zhang, S. 2012. Noisy data elimination using mutual k-nearest neighbor for classification mining. Journal of Systems and Software, 85(5): 1067–1074.
  • Maier, Hein, and Luxburg (2009) Maier, M.; Hein, M.; and Luxburg, U. 2009. Optimal construction of -nearest-neighbor graphs for identifying noisy clusters. Theoretical Computer Science, 410.
  • McInnes, Healy, and Melville (2018) McInnes, L.; Healy, J.; and Melville, J. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
  • Ozaki et al. (2011) Ozaki, K.; Shimbo, M.; Komachi, M.; and Matsumoto, Y. 2011. Using the Mutual k-Nearest Neighbor Graphs for Semi-supervised Classification on Natural Language Data. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 154–162. Portland, Oregon, USA: Association for Computational Linguistics.
  • Radovanović, Nanopoulos, and Ivanovic (2010) Radovanović, M.; Nanopoulos, A.; and Ivanovic, M. 2010. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. J. Mach. Learn. Res., 11: 2487–2531.
  • Rother et al. (2020) Rother, D.; Haider, T.; Eger, S.; et al. 2020. CMCE at SemEval-2020 Task 1: Clustering on Manifolds of Contextualized Embeddings to Detect Historical Meaning Shifts. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 187–193. Barcelona (online): International Committee for Computational Linguistics.
  • Sardana and Bhatnagar (2014) Sardana, D.; and Bhatnagar, R. 2014. Graph Clustering Using Mutual K-Nearest Neighbors. In Ślezak, D.; Schaefer, G.; Vuong, S. T.; and Kim, Y.-S., eds., Active Media Technology, 35–48. Cham: Springer International Publishing. ISBN 978-3-319-09912-5.
  • Tomašev and Buza (2015) Tomašev, N.; and Buza, K. 2015. Hubness-aware kNN classification of high-dimensional data in presence of label noise. Neurocomputing, 160: 157–172.
  • Tomašev et al. (2014) Tomašev, N.; Radovanovic, M.; Mladenic, D.; and Ivanovic, M. 2014. The Role of Hubness in Clustering High-Dimensional Data. IEEE Transactions on Knowledge and Data Engineering, 26(3): 739–751.
  • Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  • Vermeulen et al. (2021) Vermeulen, M.; Smith, K.; Eremin, K.; Rayner, G.; and Walton, M. 2021. Application of Uniform Manifold Approximation and Projection (UMAP) in spectral imaging of artworks. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 252: 119547.
  • Xiao, Rasul, and Vollgraf (2017) Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv, abs/1708.07747.
  • Zhang, Zhao, and LeCun (2015) Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28: 649–657.
Accuracy Adjacent Neighbors Path Neighbors
Dataset Dim UMAP NN
MST-min
MST-all
NN
MST-min
MST-all
MNIST 2 0.852 0.966 0.968 0.968 0.929 0.968 0.968
FMNIST 2 0.578 0.670 0.665 0.692 0.610 0.693 0.549
20NG Count 2 0.489 0.524 0.520 0.531 0.562 0.570 0.558
20NG TF-IDF 2 0.551 0.531 0.511 0.538 0.582 0.589 0.582
AG Count 2 0.802 0.831 0.823 0.833 0.827 0.859 0.865
AG TF-IDF 2 0.805 0.751 0.755 0.795 0.780 0.842 0.841
MNIST 64 0.824 0.95 0.968 0.968 0.953 0.967 0.968
FMNIST 64 0.558 0.653 0.666 0.663 0.643 0.670 0.651
20NG Count 64 0.515 0.565 0.586 0.583 0.592 0.589 0.589
20NG TF-IDF 64 0.650 0.621 0.621 0.622 0.658 0.663 0.663
AG Count 64 0.810 0.854 0.857 0.857 0.869 0.869 0.873
AG TF-IDF 64 0.810 0.804 0.802 0.809 0.840 0.842 0.847
Table 2: Accuracy Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans.
Purity Adjacent Neighbors Path Neighbors
Dataset Dim UMAP NN
MST-min
MST-all
NN
MST-min
MST-all
MNIST 2 0.888 0.966 0.968 0.968 0.939 0.968 0.968
FMNIST 2 0.639 0.700 0.707 0.721 0.677 0.724 0.645
20NG Count 2 0.508 0.538 0.535 0.547 0.576 0.583 0.579
20NG TF-IDF 2 0.570 0.554 0.537 0.563 0.600 0.606 0.606
AG Count 2 0.815 0.831 0.823 0.833 0.827 0.859 0.865
AG TF-IDF 2 0.805 0.751 0.755 0.795 0.700 0.829 0.826
MNIST 64 0.874 0.950 0.968 0.968 0.953 0.967 0.968
FMNIST 64 0.635 0.702 0.716 0.710 0.657 0.709 0.669
20NG Count 64 0.534 0.580 0.603 0.601 0.609 0.607 0.607
20NG TF-IDF 64 0.662 0.640 0.639 0.645 0.676 0.695 0.699
AG Count 64 0.820 0.854 0.854 0.857 0.869 0.869 0.873
AG TF-IDF 64 0.821 0.804 0.802 0.809 0.840 0.856 0.857
Table 3: Purity Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans.
Adjusted Rand Index Adjacent Neighbors Path Neighbors
Dataset Dim UMAP NN
MST-min
MST-all
NN
MST-min
MST-all
MNIST 2 0.810 0.927 0.931 0.931 0.885 0.931 0.931
FMNIST 2 0.485 0.547 0.535 0.558 0.523 0.580 0.492
20NG Count 2 0.309 0.348 0.347 0.356 0.386 0.394 0.381
20NG TF-IDF 2 0.398 0.351 0.330 0.349 0.403 0.421 0.413
AG Count 2 0.620 0.614 0.600 0.620 0.603 0.669 0.669
AG TF-IDF 2 0.530 0.464 0.498 0.540 0.519 0.569 0.571
MNIST 64 0.805 0.900 0.930 0.931 0.920 0.930 0.932
FMNIST 64 0.484 0.500 0.547 0.535 0.503 0.539 0.499
20NG Count 64 0.329 0.401 0.406 0.405 0.419 0.420 0.420
20NG TF-IDF 64 0.450 0.436 0.434 0.442 0.481 0.493 0.495
AG Count 64 0.660 0.662 0.662 0.668 0.691 0.691 0.699
AG TF-IDF 64 0.624 0.563 0.576 0.560 0.630 0.665 0.655
Table 4: Adjusted Rand Index Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans.