Stochastic Cluster Embedding

Zhirong Yang^1,2, Yuwei Chen³, Denis Sedov², Samuel Kaski², and Jukka Corander^4,5,6
¹Norwegian University of Science and Technology
²Aalto University
³Finnish Geospatial Research Institute
⁴University of Oslo
⁵Wellcome Sanger Institute
⁶University of Helsinki zhirong.yang@ntnu.no

Abstract

Neighbor Embedding (NE) aims to preserve pairwise similarities between data items and has been shown to yield an effective principle for data visualization. However, even the best existing NE methods such as Stochastic Neighbor Embedding (SNE) may leave large-scale patterns hidden, for example clusters, despite strong signals being present in the data. To address this, we propose a new cluster visualization method based on the Neighbor Embedding principle. We first present a family of Neighbor Embedding methods that generalizes SNE by using non-normalized Kullback-Leibler divergence with a scale parameter. In this family, much better cluster visualizations often appear with a parameter value different from the one corresponding to SNE. We also develop an efficient software that employs asynchronous stochastic block coordinate descent to optimize the new family of objective functions. Our experimental results demonstrate that the method consistently and substantially improves the visualization of data clusters compared with the state-of-the-art NE approaches. The code of our method is publicly available¹¹1https://github.com/rozyangno/sce.

Keywords Clustering $\cdot$ Information Divergence $\cdot$ Neighbor Embedding $\cdot$ Nonlinear Dimensionality Reduction $\cdot$ Stochastic Optimization $\cdot$ Visualization

1 Introduction

The rapid growth in the amount of data processed by analysts demands more efficient information digestion and communication methods. Data visualization by dimensionality reduction facilitates a viewer to digest information in massive data sets quickly. Therefore, it is increasingly applied as a critical component in scientific research, digital libraries, data mining, financial data analysis, market studies, manufacturing production control, drug discovery, etc.

Stochastic Neighbor Embedding (SNE; Hinton and Roweis, 2003) is a widely used nonlinear dimensionality reduction (NLDR) method, which approximately preserves the pairwise probabilities of being neighbors (neighboring probabilities for short) in the input space. In particular, the Student t-Distributed Stochastic Neighbor Embedding (t-SNE; van der Maaten and Hinton, 2008) has become one of the most popular nonlinear dimensionality reduction methods for data visualization. The t-SNE method employs a heavy-tailed distribution for the neighboring probabilities in the embedding and minimizes their Kullback-Leibler divergence against precomputed input probabilities.

Discovery of large-scale patterns such as clusters is an important task of NLDR. It is often believed that t-SNE can show clusters for well clusterable data, with a smaller Kullback-Leibler divergence corresponding to a better quality. However, recently we found many counter-examples where t-SNE may not correctly visualize the clusters even if the input neighborhoods are well clusterable (see Section 5 for details). Tuning hyperparameters in t-SNE does not help in solving this issue because a better t-SNE learning objective can correspond to a worse cluster embedding. The root cause of the failure is that t-SNE was designed to preserve the neighborhood or local information, which is not necessarily directly connected with finding large-scale patterns such as clusters.

To address the problem, we propose a new method called Stochastic Cluster Embedding (SCE) for better cluster visualizations. We generalize SNE using a family of I-divergences parameterized by a scaling factor $s$ , between the non-normalized similarities in the input and output space. We show that SNE is a special case in the family with $s$ chosen to be the normalizing factor of output similarities. However, through a user study, we find that the best $s$ value for cluster visualization is often different from the one chosen by SNE.

To overcome the t-SNE drawback, SCE employs another choice that mixes the input similarities in calculating $s$ . The scale factor is adaptively adjusted when optimizing the new learning objective, and the data points are thus better grouped. We have also developed an efficient optimization algorithm that employs asynchronous stochastic block coordinate descent. The new algorithm can utilize parallel computing devices such as CPUs or GPUs and is suitable for mega-scale problems with massive data items.

Our new method is tested on six real-world data sets and compared with the state-of-the-art nonlinear dimensionality reduction approaches, including t-SNE (van der Maaten and Hinton, 2008). The results show that our method consistently performs better than t-SNE for cluster visualization. In all tested cases where t-SNE fails, SCE can show the clusters clearly and accurately. Among all compared methods, our new method is the only one that gives good cluster visualizations for all tested data sets.

The remainder of the paper is organized as follows. The popular visualization method SNE is briefly reviewed in Section 2. In Section 3 we give our notations about clustering and cluster visualization. In Section 4, we present the new method based on Neighbor Embedding for cluster visualization, including its learning objective function and optimization algorithms. We present the experiment settings and results in Section 5 and conclude the paper by presenting possible directions for future research in Section 6.

2 Stochastic Neighbor Embedding

Neighbor Embedding (NE) is a collection of methods that seek vectorial representation of data items according to their neighborhoods or similarities in the original space (Yang et al., 2013, 2014). Given a set of multivariate data points $\{x_{1},x_{2},\dots,x_{N}\}$ , where $x_{i}\in\mathbb{R}^{D}$ , their pairwise similarities are encoded in a nonnegative square matrix $p$ . Neighbor Embedding finds a mapping $x_{i}\mapsto y_{i}\in\mathbb{R}^{d}$ for $i=1,\dots,N$ ( $d=2$ or $d=3$ for visualization) such that the similarities in the mapped space, encoded by $q_{ij}$ , approximate those in $p$ . The approximation is realized by minimizing a certain divergence between $p$ and $q$ . For abbreviation we also write $Y=[y_{1},\dots,y_{N}]$ .

SNE is a family of Neigbhor Embedding methods, which minimizes the Kullback-Leibler divergence $D_{\text{KL}}(P||Q)$ between the normalized input and output similarities $P$ and $Q$ . The matrix $P$ can be precomputed, for example, $P_{ij}=(P_{i|j}+P_{j|i})/(2N)$ with $P_{i|j}=p_{ij}/\sum_{l}p_{il}$ . In this paper we focus on the matrix-wise normalization²²2In this paper the matrixwise summation is over off-diagonal elements, i.e. $\sum_{ij}A_{ij}\stackrel{{\scriptstyle\text{def}}}{{=}}\sum_{ij:i\neq j}A_{ij}$ . of $Q$ using $Q_{ij}=q_{ij}/\sum_{ab}q_{ab}$ . In SNE, the output similarities are usually given by a Gaussian kernel $q_{ij}=\exp\left(-\|y_{i}-y_{j}\|^{2}\right)$ or a Student $t$ -kernel $q_{ij}=(1+\|y_{i}-y_{j}\|^{2})^{-1}$ . SNE equipped with the latter kernel is called Student t-distributed Stochastic Neighbor Embedding (t-SNE; van der Maaten and Hinton, 2008).

3 Cluster Visualization

We focus on Neighbor Embedding for cluster visualization. A clustering divides the data objects into a number of groups, such that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters) (Tan et al., 2005). The pairwise similarities can be encoded in a similarity matrix $p$ , defined as in NE.

A similarity matrix is considered to be (well) clusterable if there is a clustering such that high similarities appear (much) more probably within clusters than between clusters. If we sort the rows and columns in such a similarity matrix according to the cluster labels, we should observe a diagonal blockwise pattern, where high similarities are denser within the cluster blocks. The pattern is much clearer for a well clusterable similarity matrix.

A good cluster visualization is a display where the user can easily see the groups of data points. In scatter plots, there should be clear space separating the groups such that points in the same group are closer to each other than to those in other groups. Besides sufficient separation, a good cluster visualization should also respect the within-cluster information by approximating the corresponding neighborhoods.

It is often believed that t-SNE can achieve a good cluster visualization (van der Maaten and Hinton, 2008) and some successful examples have been presented in the literature. However, as any data analysis method, t-SNE may also produce false-negative results for challenging data sets. That is, t-SNE can result in no apparent clustering for some $p$ matrices, despite them being well clusterable. Such errors are demonstrated with several examples in Section 5. The problem cannot be solved by using better optimization algorithms because minimizing the t-SNE objective aims at a better matching of local information and this may not correspond to a better large-scale pattern. A smaller t-SNE cost can in fact correspond to a worse cluster visualization.

4 Neighbor Embedding for Cluster Visualization

In this section, we show in closer detail why t-SNE can fail despite a $p$ matrix being well clusterable and how to correct the problem to obtain good cluster visualizations.

4.1 Generalized Stochastic Neighbor Embedding

We begin with a variant of Kullback-Leibler divergence (Amari, 1985) which is defined over non-normalized output similarities with a parameter $s>0$ :

\displaystyle D_{\text{I}}(P||sq)=\sum_{ij}\left[P_{ij}\ln\frac{P_{ij}}{sq_{ij}}-P_{ij}+sq_{ij}\right].

(1)

The divergence is called non-normalized KL-divergence or I-divergence, which measures the discrepancy between $P$ and $q$ at a certain scale $s$ . In this work we focus on the student $t$ -kernel $q_{ij}=(1+\|y_{i}-y_{j}\|^{2})^{-1}$ .

We call the NE variant that minimizes $D_{\text{I}}(P||sq)$ Generalized Stochastic Neighbor Embedding (GSNE) as it generalizes SNE. To see this, we eliminate $s$ by setting $\partial D_{\text{I}}(P||sq)/\partial s=0$ , which gives (see Yang et al., 2014, or supplemental document Section 1)

\displaystyle\arg\min_{Y}D_{\text{KL}}(P||Q)=\arg\min_{Y}\min_{s>0}D_{\text{I}}(P||sq),

(2)

where the equality holds when

\displaystyle s=\frac{1}{\sum_{ij}q_{ij}}.

(3)

That is, SNE is a special case of GSNE with $s$ set to the normalizing factor of $q$ .

The SNE choice of $s$ in Eq. 3 is not necessarily an optimal choice for cluster visualization. It aims at matching two neighboring probability matrices. Such a locality preserving objective can prevent the discovery of large-scale patterns, such as clusters present among the input items. Fortunately, below we show that using another choice of $s$ can correct the problem.

4.2 Selecting $s$ for better cluster visualization

The GSNE objective can be rewritten as

\displaystyle D_{\text{I}}(P||sq)=\mathcal{J}_{\text{attraction}}+\mathcal{J}_{\text{repulsion}}+C

(4)

where $\mathcal{J}_{\text{attraction}}=-\sum_{ij}P_{ij}\ln q_{ij}$ and $\mathcal{J}_{\text{repulsion}}=s\sum_{ij}q_{ij}$ respectively correspond to attractive and repulsive forces, and $C=-\ln s-1+\sum_{ij}P_{ij}\ln P_{ij}$ is a constant w.r.t. $q$ for a given $s$ . Here the scale parameter $s$ also controls the tradeoff between the two forces.

It is known that increasing attraction, e.g., by replacing $P$ with $\beta P$ ( $\beta>1$ ) can encourage the mapped data points to form tighter clumps, with more empty space in the visualization (see e.g., van der Maaten and Hinton, 2008; van der Maaten, 2014; Belkina et al., 2019). The trick is called “early exaggeration” and has been used in t-SNE initialization, where $\beta=4$ by van der Maaten and Hinton (2008) or $\beta=12$ by van der Maaten (2014) in the first 250 iterations³³3Other t-SNE optimization methods (e.g. Belkina et al., 2019) may use a different number of iterations spent in early exaggeration. The “early exaggeration” trick is equivalent to setting

\displaystyle s=\frac{1}{\beta\sum_{ij}q_{ij}}

(5)

during the initialization. Note that after the initialization, t-SNE still uses $\beta=1$ , i.e. the original choice in Eq. 3. As we will see in Section 5 and the supplemental document Section 4, the “early-exaggeration” still cannot produce good cluster visualizations, even if $\beta>1$ is used throughout the iterations (see the supplemental document Section 3).

In this work, we propose to choose

\displaystyle s=\frac{1}{\sum_{ij}w_{ij}q_{ij}},

(6)

where $w_{ij}=\alpha N(N-1)P_{ij}+(1-\alpha)$ with $\alpha\in[0,1]$ for better cluster visualizations. When $\alpha=0$ , it reduces to the SNE choice. When $\alpha>0$ , it can adaptively bring extra repulsion to improve cluster visualization. To see this, we rewrite the denominator in Eq. 6:

\displaystyle\sum_{ij}w_{ij}q_{ij}=N(N-1)\left[\alpha\mathbb{E}_{(i,j)\sim\text{Categorical}(P)}\{q_{ij}\}+(1-\alpha)\mathbb{E}_{(i,j)\sim\text{Uniform}\left([1,\dots,N]^{2}\right)}\{q_{ij}\}\right].

(7)

In the brackets, there are two expectations over different distributions. Because $Y$ is random initially, the two expectations do not differ much, and overall the optimization is similar to the original SNE. After minimizing the discrepancy $D_{I}(P||sq)$ for a while, data pairs that correspond to large $P_{ij}$ ’s will be mapped closer. As a result, $\mathbb{E}_{(i,j)\sim\text{Categorical}(P)}\{q_{ij}\}$ becomes larger than $\mathbb{E}_{(i,j)\sim\text{Uniform}\left([1,\dots,N]^{2}\right)}\{q_{ij}\}$ , which leads to a smaller $s$ and thus smaller repulsion than in SNE. In this work, we simply set $\alpha=0.5$ , and we find it already suffices for many data sets. Because our method often produces clearer cluster visualizations, we have named it Stochastic Cluster Embedding (SCE).

4.3 Optimization

The GSNE or SCE objective function is smooth over $Y$ . Therefore it can be optimized by any existing unconstrained smooth optimization techniques such as gradient descent with momentum in t-SNE. However, the original t-SNE algorithm runs in a centralized manner and is thus slow for large-scale data sets. Moreover, the tree-based approximation (Yang et al., 2013; van der Maaten, 2014; Vladymyrov and Carreira-Perpiñán, 2014) to the objective function and gradient calculation requires rather complex programming (see e.g., Chan et al., 2019).

Here we develop a simple tree-free parallel algorithm to optimize SCE. It repeats the following steps until maximum iterations or the $Y$ change is smaller than a given tolerance:

1.

update $Y$ given the current $s$ ;
2.

estimate $s$ given the current $Y$ ,

Because we usually initialize $Y$ with small numbers around zero, all $q_{ij}$ ’s are close to 1 at the beginning. Therefore it is reasonable to initialize $s=N(N-1)$ .

In both steps, the computation is distributed to a number of computing units called workers. In Step 1, we first rewrite $D_{I}(P||sq$ in a stochastic form:

	$\displaystyle\mathcal{J}_{\text{attraction}}$	$\displaystyle=\mathbb{E}_{(i,j)\sim\text{Categorical}(P)}\left\{\mathcal{J}_{\text{attraction}}^{(i,j)}\right\},$		(8)
	$\displaystyle\mathcal{J}_{\text{repulsion}}$	$\displaystyle=\mathbb{E}_{(i,j)\sim\text{Uniform}\left([1,\dots,N]^{2}\right)}\left\{\mathcal{J}_{\text{repulsion}}^{(i,j)}\right\},$		(9)

where $\mathcal{J}_{\text{attraction}}^{i,j}=-\ln q_{ij}$ and $\mathcal{J}_{\text{repulsion}}^{(i,j)}=sN(N-1)q_{ij}$ . According to this form, each worker randomly draws a pair $(i,j)$ for attraction and another pair for repulsion, calculates their partial stochastic gradients w.r.t. $y_{i}$ and $y_{j}$ :

	$\displaystyle\frac{\partial\mathcal{J}_{\text{attraction}}^{(i,j)}}{\partial y_{i}}$	$\displaystyle=-\frac{\partial\mathcal{J}_{\text{attraction}}^{(i,j)}}{\partial y_{j}}=-2q_{ij}(y_{i}-y_{j}),$		(10)
	$\displaystyle\frac{\partial\mathcal{J}_{\text{repulsion}}^{(i,j)}}{\partial y_{i}}$	$\displaystyle=-\frac{\partial\mathcal{J}_{\text{repulsion}}^{(i,j)}}{\partial y_{j}}=2sN(N-1)q_{ij}^{2}(y_{i}-y_{j}),$		(11)

and updates the corresponding mapped points by stochastic partial gradient descent. In this way, each worker requires only $O(1)$ cost for updating a pair of mapped points $y_{i}$ and $y_{j}$ .

Next we consider how to estimate $s$ in an asynchronously stochastic and distributed manner. The $(i,j)$ samples and the corresponding $q_{ij}$ ’s in the denominator of Eq. 6, that is $s^{-1}$ ), have already been obtained from Step 1. Denote by $\xi$ and $\omega$ , respectively, the weighted sum and count of newly calculated $q_{ij}$ ’s. We get $\displaystyle\frac{N(N-1)\xi}{\omega}$ as a stochastic approximation of $s^{-1}$ and mix it with the current estimate as the new one:

\displaystyle s^{-1}\leftarrow\rho s^{-1}+(1-\rho)\frac{N(N-1)\xi}{\omega},

(12)

where $\rho\in(0,1)$ is the forgetting rate. We find $\rho=\frac{N(N-1)}{N(N-1)+\omega}$ working well in practice.

The pseudo-code of the SCE algorithm is given in Algorithm 1. Because the algorithm is almost lock-free, it is straightforward to implement it efficiently on multi-core CPUs and GPUs. Our algorithm belongs to the family of stochastic block coordinate descent optimization, for which Richtárik and Takáč (2011) gave the convergence guarantee and convergence rate.

Algorithm 1 Stochastic Cluster Embedding

1:Input: similarity matrix

P

, number of iterations

T

2:Initialize

Y=\{y_{i}\}_{i=1}^{N}

with small numbers, e.g.

y_{id}\sim\mathcal{N}(0,10^{-4})

for all

i,d

3:Initialize

s^{-1}\leftarrow N(N-1)

4:for t=0 to T do

\eta_{t}\leftarrow 1-t/T

\xi\leftarrow 0

\omega\leftarrow 0

8: parallel for each worker do

9: Draw

(i,j)\sim

Categorical(

P

)

\triangleright

handle attraction

10:

q_{ij}\leftarrow(1+\|y_{i}-y_{j}\|^{2})^{-1}

11:

\mathcal{\nabla}\leftarrow\displaystyle-2q_{ij}(y_{i}-y_{j})

12:

y_{i}\leftarrow y_{i}+\eta_{t}\mathcal{\nabla}

y_{j}\leftarrow y_{j}-\eta_{t}\mathcal{\nabla}

13:

\xi\leftarrow\xi+\alpha q_{ij}

\omega\leftarrow\omega+\alpha

14: Draw

(i,j)\sim

Uniform(

[1,\dots,N]^{2}

)

\triangleright

handle repulsion

15:

q_{ij}\leftarrow(1+\|y_{i}-y_{j}\|^{2})^{-1}

16:

\mathcal{\nabla}\leftarrow\displaystyle 2sN(N-1)q_{ij}^{2}(y_{i}-y_{j})

17:

y_{i}\leftarrow y_{i}+\eta_{t}\mathcal{\nabla}

y_{j}\leftarrow y_{j}-\eta_{t}\mathcal{\nabla}

18:

\xi\leftarrow\xi+(1-\alpha)q_{ij}

\omega\leftarrow\omega+(1-\alpha)

19: end for

20:

\rho=\frac{N(N-1)}{N(N-1)+\omega}

21:

\displaystyle s^{-1}\leftarrow\rho s^{-1}+(1-\rho)\frac{N(N-1)\xi}{\omega}

;

22:end for

23:Output: low-dimensional representations

Y

5 Experiments

We have compared our method with t-SNE, as well as two other more recent methods called LargeVis and UMAP:

•

t-SNE: We have used the implementation⁴⁴4available at https://github.com/lvdmaaten/bhtsne by van der Maaten (2014), where the maximum iterations in t-SNE was set to 10000 (ten times as the default) to get closer to convergence.
•

LargeVis (Tang et al., 2016): we have used its official implementation in GitHub⁵⁵5available at https://github.com/lferry007/LargeVis.
•

UMAP (Uniform Manifold Approximation and Projection; McInnes et al., 2018): we have used the umap-learn package in Python.

We have used six real-world data sets in our experiments:

•

SHUTTLE: the Statlog (Shuttle) Data Set in the UCI repository⁶⁶6available at https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle). There are 58000 samples of 9 dimensions in three large and four small classes.
•

MNIST: the MNIST database of handwritten digits⁷⁷7available at http://yann.lecun.com/exdb/mnist/. There are 70,000 samples of 784 dimensions in ten digit classes.
•

IJCNN: the IJCNN 2001 neural network competition data⁸⁸8available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. There are 126,701 samples of 22 dimensions and from ten engines (classes).
•

TOMORADAR: The data was collected via a helicopter-borne microwave profiling radar (Chen et al., 2017) termed FGI-Tomoradar to investigate the vertical topography structure of forests. After preprocessing, the data set contains 120,024 samples of 8192 dimensions from three classes.
•

FLOW-CYTOMETRY: the single-cell biology data set collected from Flow Repository⁹⁹9available at https://flowrepository.org/id/FR-FCM-ZZ36. After preprocessing, the data set contains 1,000,000 samples of 17 dimensions.
•

HIGGS: the HIGGS Data Set in the UCI repository¹⁰¹⁰10available at https://archive.ics.uci.edu/ml/datasets/HIGGS. The data was produced using Monte Carlo simulations of the particles in a physics experiment. There are 11,000,000 data points of 28 dimensions. Previously the data were used for classification between the bosons and the background particles, whereas there is little research on unsupervised learning on the data. Here we compared visualizations to discover the particle clusters.

To our knowledge, it is a largely unsolved problem how to convert vectorial data to a similarity matrix optimally, and there is no universally best solution. In practice, popular choices are Entropic Affinity (EA; Hinton and Roweis, 2003) and $k$ -Nearest Neighbor ( $k$ -NN) with tuning of the perplexity (or $k$ ) parameter.

This work focuses on cluster embedding of a given similarity matrix $P$ . We have constructed the $P$ matrix by using EA with perplexity 30 for SHUTTLE and IJCNN. We have used symmetrized $k$ -NN graph adjacency matrix as $P$ for MNIST, TOMORADAR, FLOW-CYTOMETRY and HIGGS with $k=10$ , $k=50$ , $k=15$ and $k=5$ , respectively. In this way, the constructed similarity matrices are well clusterable because they comprise diagonal blockwise pattern, as we shall see in Figure 8. A good cluster visualization method should be able to clearly show these clusters.

We have performed three groups of empirical studies to verify the advantages of the proposed SCE method. Below we first demonstrate its visualizations compared with those by t-SNE, LargeVis, and UMAP. Second, we compare the $s$ choices in SCE and t-SNE to our user study. Third, we verify the clustering quality of the compared methods by seeing how well they can group the input similarities.

5.1 Visualization comparison

The SCE visualizations compared with the other methods are shown in Figures 1 to 6. It is important to notice that cluster visualizations are unsupervised. Therefore, we should focus on the colorless scatter plots and check whether large-scale patterns (i.e. clusters) appear or not. The colors in the sub-figures are only used for reference to verify the alignment between clusters and ground truth classes.

We can see that SCE works well for all six data sets. Compared with other approaches, SCE shows several typically compact clusters with clear separation among them. Therefore it is easy to see the major clusters in all even if color is removed (i.e. without supervised labels).

t-SNE fails for five of the six test data sets. For SHUTTLE, IJCNN, TOMORADAR, the t-SNE layouts overall look like a single diamond with too many small groups, where no major clusters can be identified. For FLOW-CYTOMETRY the t-SNE visualization is nearly a single ball, while for HIGGS it is just a hairball. The only barely successful case is MNIST, but there are still many data points that scatter between the clusters, leaving the boundaries unclear.

LargeVis is slightly better than t-SNE. It correctly shows ten clusters for MNIST. In the LargeVis visualizations of SHUTTLE, IJCNN, TOMORADAR and FLOW-CYTOMETRY, we can also see some groups of data points. However, there are still too many small groups, and it is hard to identify the major clusters. LargeVis fails for HIGGS, where no clear cluster is shown.

UMAP works better than t-SNE and LargeVis for some data sets. It also correctly shows ten clusters for MNIST. UMAP can separate the ten engine clusters for IJCNN. We can barely see several major clusters in its FLOW-CYTOMETRY visualization. The method also successfully identifies several clusters for HIGGS. However, UMAP does not work well for SHUTTLE because there are many small groups without clear separation of major clusters. UMAP fails for TOMORADAR, where no clustering pattern is found.

5.2 User study

Since visualizations are designed for human use, we have performed a user study about the human choice among GSNE visualizations corresponding to a range of $s$ values for seeing clusters. We can then compare the resulting $s$ values in SCE and t-SNE to see which is closer to the human choices.

We have used the four smallest data sets SHUTTLE, MNIST, IJCNN, and TOMORADAR. For each data set, we have ran GSNE with $s=10^{t}\cdot N^{-2}$ , where $t\in[-4,6]$ for TOMORADAR and $t\in[0,8]$ for the other data sets. These ranges of $s$ values were set to be wide enough from over-attractive to over-repulsive such that meaningful $s$ choices should take place in between.

The user interface of the study can be found in http://clres.cs.hut.fi/ClAnalysis/webpage.html. For each data set, the series of visualizations are presented to a user (see the supplemental document Section 2 for a screenshot), where he or she uses a slider to specify the $s$ value and inspects the corresponding precomputed visualization. The user selects a preferred s value for cluster visualization and then presses the “Next” button. The system records the user choice, and the study proceeds to the next series of visualization.

We first performed a controlled laboratory study, where 40 users came to the test computer room and gave their evaluations. We later conducted a crowdsourcing user study, following the established good practices of crowdsourcing for visualization research (Borgo et al., 2017). Using the crowdsourcing platform CrowdFlower¹¹¹¹11https://www.crowdflower.com/, now purchased by appen.com., we collected empirical data from a large and diverse population made up of 300 participants. We then combined the data of the controlled and crowdsourcing studies for the analysis, leading to 340 answers for each data set. More details about the user study can be found in the supplemental document Section 2.

The results are shown in Figure 7. The SNE choice of $s$ according to Eq. 3 is shown by blue dotted-dash lines. We can see that the $s$ chosen by SNE is on the right of the human median (solid green line) for all data sets, which indicates that, in human eyes, GSNE with a smaller $s$ is often better than t-SNE for cluster visualization. In contrast, the SCE choice (red dash lines) are closer to the human median for all four data sets.

5.3 Clustering quality

Cluster visualization is an unsupervised task where the supervised labels are not available. Even if class labels are available in some data sets, they are not necessarily aligned with the intrinsic data clusters, for example, in SHUTTLE and HIGGS.

Here we have used an unsupervised approach to verify the SCE clustering quality. We first manually clustered the mapped data points in 2D space. Because most clusters are well separated in the SCE visualizations, the manual clustering is easy with little ambiguity. After the clustering, we reordered the rows and columns of the input $P$ matrix according to the cluster labels and examined the nonzero entries (blue dots) in Figure 8. We can see blockwise diagonal patterns in all plots, where each block corresponds to a cluster in the visualizations. Moreover, the blue dots within clusters are denser than those between clusters, which means SCE achieves good clustering quality.

6 Conclusions

We have presented a new nonlinear dimensionality reduction method called Stochastic Clustering Embedding for better cluster visualization. Our method modifies t-SNE by using an adaptive and effective attraction-repulsion tradeoff. We have tested our method in various real-world data sets and compared it with other modern NLDR methods. The experimental results show that our method can consistently identify the intrinsic clusters. Furthermore, we have contributed a simple and fast optimization algorithm that can easily be implemented in modern parallel computing platforms.

In this work, we have only considered the layout algorithms which produce the embedding coordinates. The visualization quality is also determined by the visual elements such as dot sizes, colors, opacity in the display. One promising area for further research would be to incorporate a cognitive user model, which could potentially improve cluster visualization to a significant degree. Such models could be fitted with Approximate Bayesian Computation as shown by e.g., Kangasrääsiö et al. (2017); Micallef et al. (2017) using the efficient implementation of inference algorithms available in the ELFI Python package (Lintusaari et al., 2018).

Appendix A Optimization equivalence

The optimization equivalence connection was originally proposed between $\alpha$ - and Rényi-divergences, and between $\beta$ - and $\gamma$ -divergences (Yang et al., 2014). Here we make the connection explicit between non-normalized and normalized KL-divergences.

Proof.

The minimum of $D_{\text{I}}(P||sq)$ over $s$ appears when $\displaystyle\frac{\partial D_{\text{I}}(P||sq)}{\partial s}=0$ . That is,
$\displaystyle-s^{-1}\sum_{ij}P_{ij}+\sum_{ij}q_{ij}=0$ or

\displaystyle s=\frac{1}{\sum_{ij}q_{ij}}.

(13)

Obviously such $s>0$ . By putting it back to $D_{\text{I}}(P||sq)$ , we obtain

	$\displaystyle D_{\text{I}}(P\|\|sq)=$	$\displaystyle\sum_{ij}\left[P_{ij}\ln\frac{P_{ij}}{\displaystyle\left(\sum_{ab}q_{ab}\right)^{-1}q_{ij}}-P_{ij}+\frac{q_{ij}}{\displaystyle\sum_{ab}q_{ab}}\right]$		(14)
	$\displaystyle=$	$\displaystyle\sum_{ij}P_{ij}\ln\frac{P_{ij}}{Q_{ij}},$		(15)

which is recognized as $D_{\text{KL}}(P||Q)$ . ∎

Appendix B User study of GSNE

We conducted a user study in both a controlled laboratory and on the crowdsourcing platform CrowdFlower (respectively 40 and 300 participants per data set) to identify a value for $s$ that human users prefer for cluster visualization with GSNE. Thus, in total, we empirically collected 340 answers for each of the 4 data sets.

To be eligible for participation in the user study, the user should have at least a high-school education and be somewhat familiar with scatter plots. In addition, all participants should report that they have not participated in the user study before. These controls are designed to ensure the validity of the collected data.

Each test trial begins with an orientation phase. The user first sees several examples of good and bad cluster visualizations as well as their explanations. Next, the user practices on a few visualizations for synthetic and real-world data sets. We also used these practice questions to screen out robot visits and careless answer clicks (Borgo et al., 2017).

Then the actual user study starts. The tasks are presented to the user in random order, one at a time. Figure 9 shows a screenshot of the user interface. At the top is the task description. In the middle is the GSNE visualization for user inspection. The user can move the slider below the visualization to specify the $s$ value. Once $s$ is changed, the program will show the corresponding GSNE visualization in the middle display. We did not use colors to show ground truth classes because cluster visualization is unsupervised. When the user decides the most preferred cluster visualization, he or she clicks the “Next page” button at the bottom and proceeds to the next task. At the same time, the system records the user’s choice. We removed answers from inattentive participants who did not move the slider for all tasks.

Appendix C Comparison with over-exaggerated t-SNE

The “early-exaggeration” trick in the original t-SNE optimization algorithm also amplifies the attraction force to achieve more compact visualizations. We have tested whether using the trick throughout all iterations can give good cluster visualization for the tested data sets. We call this over-exaggerated t-SNE (with $\beta=12$ ) and compare it with SCE. The resulting visualizations are shown in Figures 10 and 11.

We can see that using over-exaggeration only occasionally yields some clusters, which is inferior to the SCE visualizations on the right. For SHUTTLE and IJCNN, over-exaggerated t-SNE produces some clusters but also shatters others. For TOMORADAR, the shattering is more severe and thus badly fails the over-exaggerated t-SNE for cluster visualization. For MNIST, over-exaggerated t-SNE barely produces ten clusters, though many data points are scattered between and around the clumps.

Appendix D Comparison with opt-SNE

Here we provide a comparison with a recent method called optimized t-SNE (opt-SNE; Belkina et al., 2019) which also adjusts the attraction-repulsion tradeoff. It automates the selection of three important parameters for the t-SNE run: 1) initial learning rate, 2) the number of iterations spent in early exaggeration, and 3) the number of total iterations. Despite the changes in initialization, opt-SNE still minimizes the original t-SNE objective, i.e. minimum normalized KL-divergence (KLD) at the end, because the opt-SNE authors believed that lower KLD values indicate superior visualization quality (Belkina et al., 2019, Page 5).

Despite the similarity, our method has a major difference from opt-SNE. Our method SCE does not minimize the same cost function in t-SNE throughout the iterations because we find that a lower KLD can correspond to a worse visualization in terms of showing clusters.

We have tried opt-SNE on five tested data sets SHUTTLE, IJCNN, MNIST, TOMORADAR and HIGGS. First we applied opt-SNE on the original vectorial data. That is, opt-SNE will compute the similarity matrix using its built-in function. We used default parameters of opt-SNE (with “–optsne” to True). The resulting visualizations are given in Figures 12 and 13. Next, for the ablation study, we modified opt-SNE and made it admits the same similarity matrix $P$ we used for SCE and SCE. The resulting visualizations are given in Figures 14 and 15.

The cluster embedding found by opt-SNE is worse than those by SCE, no matter whether it uses built-in similarity calculation or not. For SHUTTLE, IJCNN, and TOMORADAR, the opt-SNE visualizations show many small pieces and no clear cluster pattern. For MNIST and HIGGS, opt-SNE barely finds the clusters, but the cluster boundaries are unclear compared to the SCE visualizations.

References

Amari (1985) S. Amari. Differential-Geometrical Methods in Statistics. Springer Verlag, 1985.
Belkina et al. (2019) A. Belkina, C. Ciccolella, R. Anno, R. Halpert, J. Spidlen, and J. Snyder-Cappione. Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature Communications, 10(5415), 2019.
Borgo et al. (2017) R. Borgo, B. Lee, B. Bach, S. Fabrikant, R. Jianu, A. Kerren, S. Kobourov, F. McGee, L. Micallef, T. von Landesberger, K. Ballweg, S. Diehl, P. Simonetto, and M. Zhou. Crowdsourcing for information visualization: Promises and pitfalls. In Evaluation in the Crowd. Crowdsourcing and Human-Centered Experiments, pages 99–138, 2017.
Chan et al. (2019) David M Chan, Roshan Rao, Forrest Huang, and John F Canny. Gpu accelerated t-distributed stochastic neighbor embedding. Journal of Parallel and Distributed Computing, 131:1–13, 2019.
Chen et al. (2017) Y. Chen, T. Hakala, M. Karjalainen, Z. Feng, J. Tang, P. Litkey, A. Kukko, A. Jaakkola, and J. Hyyppä. Uav-borne profiling radar for forest research. Remote Sensing, 9(1):58, 2017.
Hinton and Roweis (2003) G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), pages 857–864, 2003.
Kangasrääsiö et al. (2017) A. Kangasrääsiö, K. Athukorala, A. Howes, J. Corander, S. Kaski, and A. Oulasvirta. Inferring cognitive models from data using approximate bayesian computation. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI), pages 1295–1306, 2017.
Lintusaari et al. (2018) Jarno Lintusaari, Henri Vuollekoski, Antti Kangasrääsiö, Kusti Skytén, Marko Järvenpää, Pekka Marttinen, Michael U. Gutmann, Aki Vehtari, Jukka Corander, and Samuel Kaski. Elfi: Engine for likelihood-free inference. Journal of Machine Learning Research, 19(16):1–7, 2018.
McInnes et al. (2018) L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints, February 2018.
Micallef et al. (2017) L. Micallef, G. Palmas, A. Oulasvirta, and T. Weinkauf. Towards perceptual optimization of the visual design of scatterplots. IEEE transactions on visualization and computer graphics, 23(6):1588–1599, 2017.
Richtárik and Takáč (2011) P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2011.
Tan et al. (2005) P. Tan, M. Steinbach, A. Karpatne, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.
Tang et al. (2016) J. Tang, J. Liu, M. Zhang, and Q. Mei. Visualizing large-scale and high-dimensional data. In Proceedings of International Conference on World Wide Web (WWW), pages 287–297, 2016.
van der Maaten (2014) L. van der Maaten. Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15:3221–3245, 2014.
van der Maaten and Hinton (2008) L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
Vladymyrov and Carreira-Perpiñán (2014) M. Vladymyrov and M. Carreira-Perpiñán. Linear-time training of nonlinear low-dimensional embeddings. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pages 968–977, 2014.
Yang et al. (2013) Z. Yang, J. Peltonen, and S. Kaski. Scalable optimization of neighbor embedding for visualization. In Proceedings of International Conference on Machine Learning (ICML), pages 127–135, 2013.
Yang et al. (2014) Z. Yang, J. Peltonen, and S. Kaski. Optimization equivalence of divergences improves neighbor embedding. In Proceedings of International Conference on Machine Learning (ICML), pages 460–468, 2014.

SHUTTLE, over-exaggerated t-SNE	SHUTTLE, SCE

MNIST, over-exaggerated t-SNE	MNIST, SCE

IJCNN, over-exaggerated t-SNE	IJCNN, SCE

TOMORADAR, over-exaggerated t-SNE	TOMORADAR, SCE