Clustering with UMAP: Why and How Connectivity Matters

Ayush Dalmia, Suzanna Sia

Abstract

Topology based dimensionality reduction methods such as t-SNE and UMAP have strong mathematical foundations and are based on the intuition that the topology in low dimensions should be close to that of high dimensions. Given that the initial structure is a precursor to the success of the algorithm, this naturally raises the question: What makes a “good” topological structure for dimensionality reduction? In this paper which focuses on UMAP, we study the effects of node connectivity (k-Nearest Neighbors vs mutual k-Nearest Neighbors) and relative neighborhood (Adjacent via Path Neighbors) on dimensionality reduction. We explore these concepts through extensive ablation studies on 4 standard image and text datasets; MNIST, FMNIST, 20NG, AG, reducing to 2 and 64 dimensions. Our findings indicate that a more refined notion of connectivity (mutual k-Nearest Neighbors with minimum spanning tree) together with a flexible method of constructing the local neighborhood (Path Neighbors), can achieve a much better representation than default UMAP, as measured by downstream clustering performance.

1 Introduction

Dimension reduction techniques have become a standard tool in both machine learning and data analysis. Topology based dimensionality reduction techniques such as $t$ -SNE (Van der Maaten and Hinton 2008) and Uniform Manifold Approximation and Projection (UMAP; McInnes, Healy, and Melville (2018)) have been increasing in popularity due to their effectiveness in data visualisation and finding better representations for downstream tasks such as classification and clustering. While various topological dimensionality reduction methods differ in mathematical details for optimizing the low-dimensional manifold, these methods typically start with an initial graph structure in high dimensions. This naturally raises the question: What makes a “good” topological structure for dimensionality reduction? We explore different notions of connectivity for the initial graph structure and evaluate the performance of the resulting low-dimensional representation on a downstream clustering task using Normalised Mutual Information (NMI).

The most general initialization can be straightforwardly obtained by a weighted $k$ nearest neighbor (k-NN) graph. However, the k-NN graph is relatively susceptible to the “curse of dimensionality” and the associated distance concentration effect, where distances are similar in high dimensions, as well as the hub effect, where certain points become highly influential when highly connected. This skews the local representation of high dimensional data, deteriorating its performance for various similarity-based machine learning tasks such as classification (Tomašev and Buza 2015; Dinu and Baroni 2015), clustering (Tomašev et al. 2014) and most notably dimension reduction (Feldbauer and Flexer 2018).

In this paper, we focus on UMAP, which has advantages over other topology based dimension reduction techniques such as PCA, Isomap (Balasubramanian et al. 2002), and t-SNE for visualization quality, preserving a data’s global structure (inter-class distances) and its local structure (intra-class distances), and having superior runtime performance. UMAP has been increasingly used in various real world applications such as vision (Allaoui, Kherfi, and Cheriet 2020; Vermeulen et al. 2021); NLP (Kayal 2021; Rother et al. 2020); population genetics (Diaz-Papkovich, Anderson-Trocmé, and Gravel 2020; Becht et al. 2019) with competitive performance on unsupervised clustering and outlier detection among others.

In Section 3, we propose a refinement in the graph construction stage of the UMAP algorithm that uses a mutual k-NN graph instead of k-NN, to reduce the undesired distance concentration and hub effects. A consequence of applying mutual k-NN graph is that it can results in isolated components that can adversely affect the initialization of the lower dimensional vectors. Therefore in Section 3.2 we explore different methods such as linking adjacent neighbors, and adopting the edges from a minimum spanning tree (MST) to link the isolated components suggested by prior research. The resulting NN graph can be refined to obtain the NN for each point, by using the Shortest Path Distance so that non-adjacent neighbors (which we term Path Neighbors) can also be considered as the local neighborhood.

In Section 4, we conduct experiments on 4 standard image and text datasets, and visualise the resulting representation after applying UMAP, which suggest how each variant of Nearest Neighbor affects the low dimensional representation. The contributions of this paper are:

•

using a mutual k-NN graph rather than the original k-NN graph improves the separation between similar classes for both image and text datasets.
•

having a minimally connected graph by adding the minimum edges from the MST (MST-min) is best if we have access to flexible methods of constructing the local neighborhood (e.g via MST-min with Shortest Path distance neighbors (Path Neighbors)).
•

we show that simple graph processing can result in relative gains of 5-18% on NMI consistently across all datasets, and analyse the effect of graph processing on downstream representations.

2 Background

2.1 Understanding UMAP

UMAP first constructs a high dimensional graph representation of the data (graph construction phase), which is used to represent the “fuzzy simplicial complex”. Next it optimizes low-dimensional vectors to be as similar as possible to the graph representation in higher dimensions, by minimizing the cross-entropy of two fuzzy sets with the same underlying elements (datapoints).

Graph Construction

Our work is concerned with the graph construction phase of UMAP. In default UMAP, a weighted k-NN graph is constructed from the data where each vertex represents a datapoint, and edge weights represents the likelihood that two points are connected (McInnes, Healy, and Melville 2018). Formally, let $X=\{x_{1},\cdots,x_{N}\}$ be a dataset with $N$ points, and $x\in\mathbb{R}^{M}$ . For each datapoint $x_{i}$ , the set of points in its local neighborhood is computed using a distance metric $d:X\times X\rightarrow\mathbb{R}$ . In default UMAP, the neighborhood is found using the nearest neighbor descent algorithm (Dong, Moses, and Li 2011). Then for each $x_{i}$ , we have $\rho_{i}$ which reflects a connectivity constraint that data points are assumed to be locally connected. Here $\rho_{i}=\mathrm{min}\{d(x_{i},x_{j})|d(x_{i},x_{j})>0,x_{j}\in X\}$ . This ensures that $x_{i}$ connects to at least one other datapoint.We define a weighted Graph $G\!=\!(V,E,w)$ where the vertices $V$ are simply the set of datapoints $X$ and $w$ is the set of weights corresponding to $E$ . $G$ represents the “fuzzy simplicial complex”. Details on how to construct $G$ can be found in McInnes, Healy, and Melville (2018).

The central focus of this paper is to demonstrate how refining the set of directed edges $E$ which reflects the connectivity of the graph in high-dimensional representations, affects the subsequent low-dimensional representations. The set of directed edges is $E=\{(x_{i},x_{j})|i\leq N,x_{j}\in\mathcal{M}_{i}\}$ , and we define $\mathcal{M}_{i}$ to be the local neighborhood of points for $x_{i}$ that collectively determine the final set of directed edges $E$ . In this work, our starting point for $\mathcal{M}_{i}$ is the mutual nearest neighbors and we deal with the inherent asymmetry of this graph which could result in data points being completely isolated (Section 3.1).

Optimizing lower dimensional representation

Details about this optimization are not critical to understanding our proposed method, and we present this for completeness. Given $G$ , the data is projected into lower dimensions via a force-directed graph layout algorithm. To initialize the lower dimensional vectors, UMAP uses the eigenvectors of a normalized graph Laplacian of the “fuzzy simplicial complex” affinity matrix, also known as spectral vectors. For theoretical foundations and mathematical details, we refer interested readers to (McInnes, Healy, and Melville 2018).

Refer to caption — (a) Edge(blue) added by NN

2.2 Mutual k-Nearest Neighbors

The default choice of $\mathcal{M}_{i}$ are the nearest neighbors of $x_{i}$ . However, previous research has shown that a weighted k-NN graph captures noisy relationships, and may not be an accurate representation of the underlying local structure for a high dimensional dataset due to the “curse of dimensionality” (Liu and Zhang 2012; Radovanović, Nanopoulos, and Ivanovic 2010). Instead we utilize a weighted mutual $k$ -nearest neighbor graph (weighted mutual k-NN graph) to represent the underlying structure of our data. A mutual k-NN graph is defined as a graph that only has an undirected edge between two vertices $x_{i}$ and $x_{j}$ if $x_{i}$ is in $k$ -nearest neighbors of $x_{j}$ and $x_{j}$ is in the $k$ -nearest neighbors of $x_{i}$ . Therefore our local neighborhood is $\mathcal{M}_{i}=\{x_{j}|x_{j}\in knn(x_{i}),x_{i}\in knn(x_{j})\}$ .

A mutual k-NN graph can hence be interpreted as a subgraph of the original k-NN graph. Mutual k-NN graphs have been shown to contain many desirable properties in comparison to its k-NN counterpart when combating the “curse of dimensionality”. One such property is that edges in mutual k-NN graphs have been shown to be a stronger indicator of similarity than the unidirectional nearest neighborhood relationship, alleviating the distance concentration effect (Jégou et al. 2010). Mutual k-NN graphs have also been shown to be less prone to the hub effect than the regular k-NN graph, since each vertex in a mutual k-NN graph is guaranteed to be at most k-degrees (Ozaki et al. 2011). They have also shown strong performance when used in both unsupervised clustering (Maier, Hein, and Luxburg 2009; Sardana and Bhatnagar 2014; Abbas and Shoukry 2012) and semi-supervised graph based classifications algorithms (Ozaki et al. 2011; de Sousa, Rezende, and Batista 2013). In Section 3, we describe how we utilize the mutual k-NN graph to create our high dimensional graph representation.

3 Methodology

In order to obtain the fuzzy simplicial complex for UMAP, our starting point is a weighted mutual k-NN graph as motivated in Section 2.2. However this graph may have completely isolated vertices and disconnected components, violating UMAP’s assumption that the manifold is locally connected (Section 2.1). In addition, having many components is undesirable for the initialization of the lower dimensional spectral vectors (McInnes, Healy, and Melville 2018). Therefore, we consider different connectivity methods such as using the minimum spanning tree (MST) to add edges, which we elaborate in Section 3.1.

Next, we consider that using a vertex’s adjacent neighborhood for $\mathcal{M}_{i}$ may not pose an adequate representation as this excludes all points that are more than 1 hop away. Hence we additionally refine $\mathcal{M}_{i}$ using the “new” neighborhood calculated using Shortest Path Distance to other nodes on the connected graph. This is elaborated in Section 3.2. Our method can thus be viewed as first having a loosely connected (mutual k-NN) graph, followed by increasing connectivity using MST variants, and finally with a refined $\mathcal{M}_{i}$ to define neighbors that are more than one hop away.

3.1 Increasing Connectivity for mutual k-NN

Since a mutual k-NN graph only retains a subset of the edges from the original k-NN graph(Section 2.2), the resulting mutual k-NN graph often contains disconnected components and potential isolated vertices. Isolated vertices violate one of UMAP’s primary assumptions that the underlying manifold is locally connected, and disconnected components negatively impact the spectral initialization (McInnes, Healy, and Melville 2018). Spectral initialization is crucial for preserving the global structure of the high dimensional data in the low dimensional vectors (Kobak and Linderman 2021).

Naively, we could increase $k$ , the number of nearest neighbors used for calculating the mutual k-NN graph, to obtain a more connected graphical representation. However increasing $k$ does not guarantee that the resulting graph has no isolated vertices or fewer disconnected components. We therefore consider the following methods for increasing the Connectivity of the mutual k-NN graph, presented in order of increasing connectivity.

1.

NN: Add an undirected edge between each isolated vertex and its original nearest neighbor (de Sousa, Rezende, and Batista 2013).
2.

MST-min: To achieve a connected graph, add the minimum number of edges from a maximum spanning tree to the mutual-kNN graph that has been weighted with similarity-based metrics(Ozaki et al. 2011). We adapt this by calculating the minimum spanning tree for distances (Algorithm 1).
3.

MST-all: Adding all the edges of the MST.

We call this graph $G^{\prime}$ , the locally connected mutual k-NN graph, and run experiments with $G^{\prime}$ in Section 4.

3.2 Finding New Local Neighborhood $\mathcal{M}_{i}$

Next we wish to obtain a “fuzzy simplicial complex” representation as described in Section 2.1, from the connected mutual k-NN graph, $G^{\prime}$ (Section 2.2). To achieve this, we first need to obtain the new local neighborhood, $\mathcal{M}_{i}$ for each $x_{i}\in X$ using $G^{\prime}$ . The most straightforward way is to use adjacent vertices in $G^{\prime}$ . However, this excludes all points that are more than 1 hop away, even if they are relatively close.

Instead, we compute $G$ from $G^{\prime}$ by refining the local neighborhood $\mathcal{M}_{i}$ using the Shortest Path Distance between any two nodes using Dijkstra’s Algorithm (See Fig. 2 for example) to other nodes on $G^{\prime}$ (Algorithm 2). This shortest path distance can be considered a new distance metric as it directly aligns with UMAP’s definition of an extended-pseudo-metric space. That is, we replace $d(x_{i},x_{j}’)$ with $d_{\mathrm{path}}(x_{i},x_{j})$ in the graph construction phase (Section 2.1).

Input: The mutual k-NN graph

mkNN

, kNN Graph

kNN

with weights as the distance between points

Output: Connected

mkNN

graph

G^{\prime}

1 Initialize

G^{\prime}

as a copy of

mkNN

MST\leftarrow

minimum weight spanning tree of

kNN

sort\_e\leftarrow

sort MST edges by ascending weight order

4 foreach $\text{edge }e\in sort\_e$ do

5 if $e\text{ connects two components in }G^{\prime}$ then

6 Add undirected edge

e

G^{\prime}

7 end if

9 end foreach

return

G^{\prime}

Algorithm 1 Connect Disconnecting Components in mutual k-NN Graph with MST-min

Input: Connected mutual k-NN graph

G^{\prime}

, number of new nearest neighbors to search for

k_{new}

Output: Dictionary that returns the new neighborhood for each point

\mathcal{M}

, dictionary that returns the distance between

x

and its nearest Path Neighbors

\mathcal{M}_{dists}

1 Initialize

\mathcal{M}

and

\mathcal{M}_{dists}

2foreach $x_{i}\in X$ do

3 Initialize neighborhood of

x_{i}

\mathcal{M}_{i}

, and

Q

as a min heap

4 Insert each adjacent vertex

v

x

G^{\prime}

with the associated distance between

v

and

x

Q

5 while $Q\text{ not empty \& }|\mathcal{M}_{i}|<k_{new}$ do

6 Extract minimum distance vertex

y

with distance

y_{path\_dist}

from

Q

7 if $y\notin\mathcal{M}_{i}$ then

8 Append

y

\mathcal{M}_{i}

and

y_{path\_dist}

\mathcal{M}_{dists}(x)

9 foreach $v\in$ Adj. vertices of $y$ do

10 Insert

v

into

Q

with a distance of

y_{path\_dist}

+ distance between y and v

11 end foreach

13 end if

15 end while

17 end foreach

19return

\mathcal{M}

and

\mathcal{M}_{distances}

Algorithm 2 Find new local neighborhood

\mathcal{M}_{i}

, with Path Neighbor using Djikstra’s

3.3 Computational Complexity

The complexity of constructing the mutual k-NN graph from the original k-NN graph is $O(kn)$ . The additional complexity of connecting disconnected components varies depending on the method chosen. Adding edges between isolated vertices to their nearest neighbors (NN) is $O(n)$ while the MST variants have a known time complexity of $O(E\log(V))$ = $O(kn\log(n))$ . Since we construct the MST from the original k-NN graph, we know that $E=nk$ and $V=n$ . With the MST, we add either all the edges (MST-all), which has an upper bound of $O(n)$ since there are $n-1$ edges in the MST, or the minimum number of edges (MST-min) until we have one component, which has an upper bound of $O(n\log(n))$ since we need to sort the edges first.

Finally, using Path Neighbors on mutual k-NN graph has an additional cost since we have to perform a graph search. We use a variant of Djikstra’s algorithm which has a known time complexity of $O(V+E\log(V))$ . Running the search for all $V=n$ nodes and $E\leq nk$ , we get a final complexity of $O(n(n+kn\log(n)))$ . It is important to note that the worst case for Djikstra’s algorithm assumes one may need to explore the whole graph as it is typically to find the shortest distance between two points. However, since we are using Djikstra’s to find the new closest points, we in practice are more likely to terminate much earlier. This search can be further optimized by parallelizing the individual searches needed to find new nearest neighbors for each point.

4 Experiments

4.1 Datasets

We used two standard image classification datasets, MNIST (LeCun and Cortes 2010) and Fashion MNIST (FMNIST; Xiao, Rasul, and Vollgraf (2017)), and two standard text classification datasets, 20 newsgroup (20NG),¹¹1http://qwone.com/ jason/20Newsgroups/ and AG’s News Topic (AG;Zhang, Zhao, and LeCun (2015)) for evaluation. For the image classification datasets, no preprocessing was applied and each image was flattened to create a 1D vector. For the text classification datasets, we lowercase tokens, remove stopwords, punctuation and digits, and exclude words that appear in less than 5 documents. After preprocessing, we converted each dataset into both Count vectors and TF-IDF vectors. We opted to use the Count based representation of text as an illustration of datasets that have very high-dimensional and sparse representations.²²2TF-IDF is a weighting on the sparse representation. While our method is applicable to word or document vectors, these are heavily dependent on the text encoder and it is more difficult to interpret findings as we do not know what the dense high-dimensional vector space represents in the first place. ³³3Code is available at https://github.com/adalmia96/umap-mnn.

4.2 Evaluation and Performance Metrics

Quantative Evaluation: We evaluate the methods by comparing the clustering performance of the resulting low dimensional vectors generated. We performed KMeans clustering, assigning the number of clusters to be the ground truth number of topics for each dataset. We use the Normalised Mutual Information (NMI), a widely used clustering metric that ranges from 0 to 1 (1 being a perfect score), to evaluate the clusters.⁴⁴4NMI was primarily used to evaluate clustering quality for UMAP and subsequent works(McInnes, Healy, and Melville 2018) As UMAP is a stochastic algorithm, the NMI scores presented in Table 1 are averaged across low-dimensional vectors from 5 random seeds.

Qualitative Evaluation: In addition, we consider several desirable properties for dimensionality reduction in qualitative visualisations, local structure and global structure, and separation of classes. A dimensionality reduction method which preserves the local structure means that intra-class distances are preserved; i.e. the datapoints belonging to the same class are close to each other. A method which preserves the global structure means that inter-class distances are preserved; i.e., clusters of similar classes appear closer together and clusters of dissimilar classes appear farther from each other. Finally, the classes should be well separated (not clumped together).

4.3 Experimental Conditions

Normalized Mutual Information

Adjacent Neighbors

Path Neighbors

Dataset

Dim

UMAP

MST-min

MST-all

MST-min

MST-all

MNIST

0.854

0.917

0.920

0.898

0.920

FMNIST

0.615

0.669

0.693

0.694

0.648

0.698

0.649

20NG Count

0.446

0.467

0.474

0.481

0.511

0.523

0.524

20NG TF-IDF

0.478

0.461

0.456

0.461

0.521

0.526

0.522

AG Count

0.589

0.582

0.571

0.589

0.630

0.642

AG TF-IDF

0.503

0.455

0.475

0.506

0.520

0.540

0.545

MNIST

0.862

0.915

0.919

0.920

0.910

0.919

0.918

FMNIST

0.626

0.685

0.703

0.667

0.698

0.679

20NG Count

0.487

0.525

0.529

0.535

0.560

0.563

0.565

20NG TF-IDF

0.566

0.556

0.560

0.592

0.594

0.595

AG Count

0.612

0.633

0.638

0.660

0.666

AG TF-IDF

0.575

0.548

0.558

0.593

0.612

0.615

Table 1: NMI Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans. For all datasets, using a mutual k-NN representation with one of the MST variants and combined with Path Neighbors provided the best NMI results.

Our starting point for all connectivity methods is the mutual k-NN, which we compare against UMAP which uses the default k-NN. As described in Section 3.1, we tested NN, which connects the nearest neighbor, MST-min which uses the minimum edges from the minimum spanning tree (MST), and MST-all which uses all the edges from the minimum spanning tree. This gives us a connected graph $G^{\prime}$ . Next, we consider two methods for obtaining the local neighborhood for each point, $\mathcal{M}_{i}$ which is used for the final graph $G$ (Section 3.2). For Adjacent Neighbors, $G^{\prime}=G$ as there is no change to the local neighborhood. For Path Neighbors, we additionally consider nodes from $G^{\prime}$ which are more than one hop away until we get $k$ neighbors, to get the ‘new’ local neighborhood $\mathcal{M}_{i}$ which is used for $G$ .

Hyperparameter Search

For all methods we do a grid search to find the best $k$ , the number of initial Nearest Neighbors (before applying the mutual NN restriction), by searching from 10 to 50 with increments of 5. We find the best min_dist, a hyper-parameter that controls how tightly UMAP packs points together, by searching from 0-1 in increments of 0.1. We use Euclidean distance for image datasets, Jaccard for the Count Vectors, and cosine for the TF-IDF Vectors (these distance metrics were best for the original UMAP embeddings).

5 Results

From Table 1, we see that MST variants combined with Path Neighbors to find $\mathcal{M}_{i}$ consistently produced better clustering results across all datasets for both 2 and 64 dimensions ( $p<0.01$ ).⁵⁵5Two-tailed t-test for org. UMAP vs Path Neigh. + MST-min. As a first step to uncover why, we present the 2D projections generated for MNIST, FMNIST, and 20 NG Count Vectors using each method in Fig. 3. We observe that MST variants combined with Path Neighbors consistently produces clearer separation between classes, less “random projections” (better local structure), and preserves the “global structure” which leads to consistently better clustering results (Table 1).

For MNIST, we see that the global structure was preserved among different digits, such as having 1 and 0 at far corners and placing similar digits such as 4, 7, 9 closer together. There is also more separation within the groups of similar digits (4, 7 9). Similarly for the FMNIST dataset, the vectors using the aforementioned method preserved the global structure between clothing classes(T-shirt, Coat, and etc.) from footwear classes (Sandal, Sneaker, Ankle-boot) while also depicting a clearer separation between the footwear classes. This is contrasted with original UMAP which has poorer separation between similar classes. Finally for 20NG, the generated vectors create a better distinction between similar subjects such as the recreation (rec) topics.

In the following sections, we explore how mutual k-NN graph affects separation of classes (5.1), how connectivity reduces random projections (5.2) and how selecting nearest neighbor can affect the structure of the final vector (5.3).

5.1 Mutual k-NN vs k-NN Graph (default UMAP)

Using a mutual k-NN representation results in improved separation between similiar classes. In general we observe that for most of the mutual k-NN graph based vectors (Fig. 3), there is a better separation between similar classes than the original UMAP vectors regardless of Connectivity (NN, MST variants). We observed the desired separation between similar classes such as with the 4, 7, 9 in MNIST and the footwear classes for FMNIST. mutual k-NN graphs have previously been shown as a useful method for removing edges between points that are only loosely similar, directly reducing distance concentration and hub effects.

5.2 Connectivity for Mutual k-NN

We consider three methods for connecting the mutal k-NN. In terms of connectivity, NN $<$ MST-min $<$ MST-all.

NN performs worse than MST variants, with vectors that are randomly scattered in 2d space. For both MNIST and FMNIST, we see that NN, which connects isolated vertices to their nearest neighbor, had multiple small clusters of points scattered throughout the vector space. Given that KMeans is sensitive to outliers, these randomly projected points negatively affect clustering performance as seen in Table 1. Since a mutual k-NN graph only retains a subset of the edges from the original k-NN graph, it can result in a very sparse representation. When constructing the mutual k-NN graph, we observed that $\approx$ 1000( $3\%$ ) points were separated into small components for MNIST and $\approx$ 7000( $10\%$ ) points for FMNIST. Our results show stronger notions of connectivity than NN are required.

We would expect that having higher connectivity that reduces random scattering of points would be better for clustering. However, we observe that too much connectivity from using all the edges from the MST (MST-all) with Path Neighbors can hurt performance on FMNIST (Section 5.3).

5.3 Local Neighborhood for Connected Mutual k-NN graph

We consider two methods, Adjacent Neighbors and Path Neighbors, for finding the local neighborhood $\mathcal{M}_{i}$ of each point, after obtaining a connected mutual k-NN graph $G^{\prime}$ .

Path Neighbors achieves the best clustering performance together with MST-min. We observe a similar clustering performance when we use the minimum number of edges from the MST (MST-min), vs all the edges from the MST (MST-all). In the case of FMNIST, using MST-all results in a worse clustering performance of 0.698 vs 0.649 for 2 dims, and 0.698 vs 0.679 for 64 dims. Although using Adjacent vs Path Neighbors resulted in similar clustering performance for image based dataset, Path Neighbors produced better results for text based datasets. From Fig. 3, both MST-min and MST-all produced better results when using Path vs Adjacent Neighbors for text based datasets ( $p<0.05$ ).

Adjacent Neighbors produce a poorer local structure than Path Neighbors. Visually, the vector generated using the Adjacent Neighbors and MST-min result in disperse dense clusters of points e.g, the footwear classes in FMNIST and the recreation topics in 20 NG. However when we apply Path Neighbors, the groups of points belonging to the same class are less dispersed (Fig. 3). This indicates that Adjacent Neighbors have a poorer local structure than Path Neighbors; i.e. same class datapoints are far from each other.

To investigate further why Adjacent Neighbors produce worse results for text datasets, we compute the variance along the dimensions of the 2D vectors for each class from the 20NG Count vectors and plot them in Fig. 4. We find that Adjacent Neighbors has greater variance for each class than the vectors for both Path Neighbors and original UMAP, indicating greater dispersion of points within-class. Having high variance is bad for clustering and indicates a poor lower dimensional representation, as there is no distinct range of values associated with the class label.

Path Neighbors increases the connectivity for the final $G$ , and therefore can rely on a more refined notion of $G^{\prime}$ with MST-min instead of MST-all. We can interpret Path Neighbors as a method which strictly increases the general connectivity of $G$ . Consider the local neighborhood $\mathcal{M}_{i}$ for Adjacent Neighbors, which is not guaranteed to have $k$ connected neighbors for $\mathcal{M}_{i}$ . Points with smaller $\mathcal{M}_{i}$ will be close to primarily few adjacent neighbors and repelled further away from the other points. This creates small groups of points that belong to the same class being spread across the vector space. On the other hand, for original UMAP and Path Neighbor vectors, $|\mathcal{M}_{i}|=k$ ,and local groups of points are more likely to be connected to other groups within the same class. This increase in connectivity explains why visually Path Neighbors methods generate vectors which are less dispersed (within class) than the Adjacent Neighbors method, while still preserving the underlying structure of the data. Across Table 1, we see that using Path Neighbors performs consistently well with MST-min across both image and text datasets, and allows more flexibility in building a locally and globally connected fuzzy simplicial set.

5.4 Number of Dimensions

We also found that clustering performance was consistent across different dimensions for image datasets but was better at higher dimensions for text dataset with TF-IDF representations. Table 1 show results for 2 and 64 dimensions. While it is not surprising that having 64 dimensions allows the high-dimensional text datasets ( $>$ 26000 features for AG and $>$ 34000 features for 20NG) to preserve more information, it is interesting that MST variants with Path Neighbors does not produce worse results at 2 vs 64 dimensions.

6 Conclusion

The initialization of a weighted connected graph is critical to the success of topology based dimensionality reduction methods. In this work, we established how starting with stricter conditions of connectivity (mutual k-NN graph vs standard k-NN graph) results in a better topology. Next, using a flexible method to expand the local neighborhood, and therefore connectivity of the final graph, is best done on a minimally connected (MST-min) mutual k-NN graph. Visualisation of the resulting vectors show that they produce better separation between similar classes while preserving both the overall structure of the data. Our quantitative experiments indicate these vectors consistently produce better clustering (empirical gains of relative 5-18%) across all datasets for both 2 and 64 dimensions despite the simplicity of graph methods, highlighting the role of graph connectivity in topological based dimensionality reduction methods.

Acknowledgments

We thank the anonymous reviewers for helpful feedback. Also Milind Agarwal, Desh Raj, Taryn Wong, and Jinyi Yang for proof-reading and Kelly Marchisio for discussions at an early stage of this project.

References

Abbas and Shoukry (2012) Abbas, M. A.; and Shoukry, A. A. 2012. CMUNE: A clustering using mutual nearest neighbors algorithm. In 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA), 1192–1197.
Allaoui, Kherfi, and Cheriet (2020) Allaoui, M.; Kherfi, M. L.; and Cheriet, A. 2020. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. In El Moataz, A.; Mammass, D.; Mansouri, A.; and Nouboud, F., eds., Image and Signal Processing, 317–325. Cham: Springer International Publishing. ISBN 978-3-030-51935-3.
Balasubramanian et al. (2002) Balasubramanian, M.; Schwartz, E. L.; Tenenbaum, J. B.; de Silva, V.; and Langford, J. C. 2002. The isomap algorithm and topological stability. Science, 295(5552): 7–7.
Becht et al. (2019) Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.; Kwok, I.; Ng, L.; Ginhoux, F.; and Newell, E. 2019. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37: 38–44.
de Sousa, Rezende, and Batista (2013) de Sousa, C. A. R.; Rezende, S. O.; and Batista, G. E. 2013. Influence of graph construction on semi-supervised learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 160–175. Springer.
Diaz-Papkovich, Anderson-Trocmé, and Gravel (2020) Diaz-Papkovich, A.; Anderson-Trocmé, L.; and Gravel, S. 2020. A review of UMAP in population genetics. Journal of Human Genetics.
Dinu and Baroni (2015) Dinu, G.; and Baroni, M. 2015. Improving zero-shot learning by mitigating the hubness problem. CoRR, abs/1412.6568.
Dong, Moses, and Li (2011) Dong, W.; Moses, C.; and Li, K. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, 577–586.
Feldbauer and Flexer (2018) Feldbauer, R.; and Flexer, A. 2018. A comprehensive empirical comparison of hubness reduction in high-dimensional spaces. Knowledge and Information Systems, 59: 137 – 166.
Jégou et al. (2010) Jégou, H.; Schmid, C.; Harzallah, H.; and Verbeek, J. 2010. Accurate Image Search Using the Contextual Dissimilarity Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32: 2 – 11.
Kayal (2021) Kayal, S. 2021. Unsupervised Sentence-embeddings by Manifold Approximation and Projection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1–11. Online: Association for Computational Linguistics.
Kobak and Linderman (2021) Kobak, D.; and Linderman, G. 2021. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, 39: 1–2.
LeCun and Cortes (2010) LeCun, Y.; and Cortes, C. 2010. MNIST handwritten digit database.
Liu and Zhang (2012) Liu, H.; and Zhang, S. 2012. Noisy data elimination using mutual k-nearest neighbor for classification mining. Journal of Systems and Software, 85(5): 1067–1074.
Maier, Hein, and Luxburg (2009) Maier, M.; Hein, M.; and Luxburg, U. 2009. Optimal construction of -nearest-neighbor graphs for identifying noisy clusters. Theoretical Computer Science, 410.
McInnes, Healy, and Melville (2018) McInnes, L.; Healy, J.; and Melville, J. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
Ozaki et al. (2011) Ozaki, K.; Shimbo, M.; Komachi, M.; and Matsumoto, Y. 2011. Using the Mutual k-Nearest Neighbor Graphs for Semi-supervised Classification on Natural Language Data. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 154–162. Portland, Oregon, USA: Association for Computational Linguistics.
Radovanović, Nanopoulos, and Ivanovic (2010) Radovanović, M.; Nanopoulos, A.; and Ivanovic, M. 2010. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. J. Mach. Learn. Res., 11: 2487–2531.
Rother et al. (2020) Rother, D.; Haider, T.; Eger, S.; et al. 2020. CMCE at SemEval-2020 Task 1: Clustering on Manifolds of Contextualized Embeddings to Detect Historical Meaning Shifts. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 187–193. Barcelona (online): International Committee for Computational Linguistics.
Sardana and Bhatnagar (2014) Sardana, D.; and Bhatnagar, R. 2014. Graph Clustering Using Mutual K-Nearest Neighbors. In Ślezak, D.; Schaefer, G.; Vuong, S. T.; and Kim, Y.-S., eds., Active Media Technology, 35–48. Cham: Springer International Publishing. ISBN 978-3-319-09912-5.
Tomašev and Buza (2015) Tomašev, N.; and Buza, K. 2015. Hubness-aware kNN classification of high-dimensional data in presence of label noise. Neurocomputing, 160: 157–172.
Tomašev et al. (2014) Tomašev, N.; Radovanovic, M.; Mladenic, D.; and Ivanovic, M. 2014. The Role of Hubness in Clustering High-Dimensional Data. IEEE Transactions on Knowledge and Data Engineering, 26(3): 739–751.
Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
Vermeulen et al. (2021) Vermeulen, M.; Smith, K.; Eremin, K.; Rayner, G.; and Walton, M. 2021. Application of Uniform Manifold Approximation and Projection (UMAP) in spectral imaging of artworks. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 252: 119547.
Xiao, Rasul, and Vollgraf (2017) Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv, abs/1708.07747.
Zhang, Zhao, and LeCun (2015) Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28: 649–657.

Accuracy

Adjacent Neighbors

Path Neighbors

Dataset

Dim

UMAP

MST-min

MST-all

MST-min

MST-all

MNIST

0.852

0.966

0.968

0.929

0.968

FMNIST

0.578

0.670

0.665

0.692

0.610

0.693

0.549

20NG Count

0.489

0.524

0.520

0.531

0.562

0.570

0.558

20NG TF-IDF

0.551

0.531

0.511

0.538

0.582

0.589

0.582

AG Count

0.802

0.831

0.823

0.833

0.827

0.859

0.865

AG TF-IDF

0.805

0.751

0.755

0.795

0.780

0.842

0.841

MNIST

0.824

0.95

0.968

0.953

0.967

0.968

FMNIST

0.558

0.653

0.666

0.663

0.643

0.670

0.651

20NG Count

0.515

0.565

0.586

0.583

0.592

0.589

20NG TF-IDF

0.650

0.621

0.622

0.658

0.663

AG Count

0.810

0.854

0.857

0.869

0.873

AG TF-IDF

0.810

0.804

0.802

0.809

0.840

0.842

0.847

Table 2: Accuracy Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans.

Purity

Adjacent Neighbors

Path Neighbors

Dataset

Dim

UMAP

MST-min

MST-all

MST-min

MST-all

MNIST

0.888

0.966

0.968

0.939

0.968

FMNIST

0.639

0.700

0.707

0.721

0.677

0.724

0.645

20NG Count

0.508

0.538

0.535

0.547

0.576

0.583

0.579

20NG TF-IDF

0.570

0.554

0.537

0.563

0.600

0.606

AG Count

0.815

0.831

0.823

0.833

0.827

0.859

0.865

AG TF-IDF

0.805

0.751

0.755

0.795

0.700

0.829

0.826

MNIST

0.874

0.950

0.968

0.953

0.967

0.968

FMNIST

0.635

0.702

0.716

0.710

0.657

0.709

0.669

20NG Count

0.534

0.580

0.603

0.601

0.609

0.607

20NG TF-IDF

0.662

0.640

0.639

0.645

0.676

0.695

0.699

AG Count

0.820

0.854

0.857

0.869

0.873

AG TF-IDF

0.821

0.804

0.802

0.809

0.840

0.856

0.857

Table 3: Purity Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans.

Adjusted Rand Index

Adjacent Neighbors

Path Neighbors

Dataset

Dim

UMAP

MST-min

MST-all

MST-min

MST-all

MNIST

0.810

0.927

0.931

0.885

0.931

FMNIST

0.485

0.547

0.535

0.558

0.523

0.580

0.492

20NG Count

0.309

0.348

0.347

0.356

0.386

0.394

0.381

20NG TF-IDF

0.398

0.351

0.330

0.349

0.403

0.421

0.413

AG Count

0.620

0.614

0.600

0.620

0.603

0.669

AG TF-IDF

0.530

0.464

0.498

0.540

0.519

0.569

0.571

MNIST

0.805

0.900

0.930

0.931

0.920

0.930

0.932

FMNIST

0.484

0.500

0.547

0.535

0.503

0.539

0.499

20NG Count

0.329

0.401

0.406

0.405

0.419

0.420

20NG TF-IDF

0.450

0.436

0.434

0.442

0.481

0.493

0.495

AG Count

0.660

0.662

0.668

0.691

0.699

AG TF-IDF

0.624

0.563

0.576

0.560

0.630

0.665

0.655

Table 4: Adjusted Rand Index Results for clustering each of the vectors generated using each method described in Section 4.3 with KMeans.

Clustering with UMAP: Why and How Connectivity Matters

Abstract

1 Introduction

2 Background

2.1 Understanding UMAP

Graph Construction

Optimizing lower dimensional representation

2.2 Mutual k-Nearest Neighbors

3 Methodology

3.1 Increasing Connectivity for mutual k-NN

3.2 Finding New Local Neighborhood ℳi\mathcal{M}_{i}

3.3 Computational Complexity

4 Experiments

4.1 Datasets

4.2 Evaluation and Performance Metrics

4.3 Experimental Conditions

Hyperparameter Search

5 Results

5.1 Mutual k-NN vs k-NN Graph (default UMAP)

5.2 Connectivity for Mutual k-NN

5.3 Local Neighborhood for Connected Mutual k-NN graph

5.4 Number of Dimensions

6 Conclusion

Acknowledgments

References

3.2 Finding New Local Neighborhood $\mathcal{M}_{i}$