A Spectral Method for Assessing and Combining Multiple Data Visualizations

Rong Ma¹ Eric D. Sun² and James Zou²

Department of Statistics Stanford University¹
Department of Biomedical Data Science Stanford University²

Abstract

Dimension reduction and data visualization aim to project a high-dimensional dataset to a low-dimensional space while capturing the intrinsic structures in the data. It is an indispensable part of modern data science, and many dimensional reduction and visualization algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it critically important to evaluate their relative performance for a given dataset, and to leverage and combine their individual strengths. In this paper, we propose an efficient spectral method for assessing and combining multiple visualizations of a given dataset produced by diverse algorithms. The proposed method provides a quantitative measure – the visualization eigenscore – of the relative performance of the visualizations for preserving the structure around each data point. Then it leverages the eigenscores to obtain a consensus visualization, which has much improved quality over the individual visualizations in capturing the underlying true data structure. Our approach is flexible and works as a wrapper around any visualizations. We analyze multiple simulated and real-world datasets from diverse applications to demonstrate the effectiveness of the eigenscores for evaluating visualizations and the superiority of the proposed consensus visualization. Furthermore, we establish rigorous theoretical justification of our method based on a general statistical framework, yielding fundamental principles behind the empirical success of consensus visualization along with practical guidance.

KEY WORDS: data visualization; dimension reduction; high-dimensional data; manifold learning; spectral method.

1 INTRODUCTION

Data visualization and dimension reduction is a central topic in statistics and data science, as it facilitates intuitive understanding and global views of high-dimensional datasets and their underlying structural patterns through a low-dimensional embedding of the data (Donoho, 2017; Chen et al., 2020). The past decades have witnessed an explosion in machine learning algorithms for data visualization and dimension reduction. Many of them, such as Laplacian eigenmap (Belkin and Niyogi, 2003), kernel principal component analysis (kPCA) (Schölkopf et al., 1997), t-SNE (van der Maaten and Hinton, 2008), and UMAP (McInnes et al., 2018), have been regarded as indispensable tools and state-of-art techniques for generating graphics in academic and professional writings (Chen et al., 2007), and for exploratory data analysis and pattern discovery in many research disciplines, such as astrophysics (Traven et al., 2017), computer vision (Cheng et al., 2015), genetics (Platzer, 2013), molecular biology (Olivon et al., 2018), especially in single-cell transcriptomics (Kobak and Berens, 2019), among others.

However, the wide availability and functional diversity of data visualization methods also brings forth new challenges to data analysts and practitioners (Nonato and Aupetit, 2018; Espadoto et al., 2019). On the one hand, it is critically important to determine among the extensive list which visualization method is most suitable and reliable for embedding a given dataset. In fact, even for a single visualization method, such as t-SNE or UMAP, oftentimes there are multiple tuning parameters to be determined by the users, and different tuning parameters may lead to distinct visualizations (Kobak and Linderman, 2021; Cai and Ma, 2021). Thus, for a given dataset, selecting the most suitable visualization method and along with its tuning parameters calls for a method that provides quantitative and objective assessment of different visualizations of the dataset. On the other hand, as different methods are usually based on distinct ideas and heuristics, they would generate qualitatively diverse visualizations of a dataset, each containing important features about the data that are possibly unique to the visualization method. Meanwhile, due to noisiness and high-dimensionality of many real-world datasets, their low-dimensional visualizations necessarily contain distortions from the underlying true structures, which again may vary from one visualization to another. It is therefore of substantial practical interest to combine strengths and reach a consensus among multiple data visualizations, in order to obtain an even better “meta-visualization” of the data that captures the most information and is least susceptible to the distortions. Naturally, a meta-visualization would also save practitioners from painstakingly selecting a single visualization method among many.

In this paper, we propose an efficient spectral approach for simultaneously assessing and combining multiple data visualizations produced by diverse dimension reduction/visualization algorithms, allowing for different settings of tuning parameters for individual algorithms. Specifically, the proposed method takes as input a collection of visualizations, or low-dimensional embeddings of a dataset, hereafter referred as “candidate visualizations,” and summarizes each visualization by a normalized pairwise-distance matrix among the samples. With respect to each sample in the dataset, we construct a comparison matrix from these normalized distance matrices, characterizing the local concordance between each pair of candidate visualizations. Based on eigen-decomposition of the comparison matrices, we propose a quantitative measure, referred as “visualization eigenscore,” that quantifies the relative performance of the candidate visualizations in a sample-wise manner, reflecting their local concordance with the underlying low-dimensional structure contained in the data. To obtain a meta-visualization, the candidate visualizations are combined together into a meta-distance matrix, defined as a row-wise weighted average of those normalized distance matrices, using the corresponding eigenscores as the weights. The meta-distance matrix is then used to produce a meta-visualization, based on an existing method such as UMAP or kPCA, which is shown to be more reliable and more informative compared to individual candidate visualizations. Our method is schematically summarized in Figure 1 and Algorithm 1, and detailed in Section 2.1. The thus obtained meta-visualization reflects a joint perspective aggregating various aspects of the data that are oftentimes captured separately by individual candidate visualizations.

Refer to caption — Figure 1: A graphical illustration of the proposed method. The algorithm takes as input the normalized pairwise distance matrices associated to a collection of candidate visualizations (viz1 to viz4) of a dataset. For each sample of the dataset, we compute the similarity matrix between the rows of the normalized distance matrices associated to the sample (rows highlighted in the same color), and then define the corresponding eigenscores as the first eigenvector of the similarity matrix. The size of the circles in the similarity matrices and the vectors of eigenscores indicate the magnitude of the entries (assumed to be non-negative). The meta-distance matrix is defined such that its rows are the eigenscore-weighted average of the rows in the normalized distance matrices. The meta-distance leads to a meta-visualization, expected to be more concordant with the underlying true structure than individual candidate visualizations.

Numerically, through extensive simulations and analysis of multiple real-world datasets with diverse underlying structures, we show the effectiveness of the proposed eigenscores in assessing and ranking a collection of candidate visualizations, and demonstrate the superiority of the final meta-visualization over all the candidate visualizations in terms of identification and characterization of these structural patterns. To achieve a deeper understanding of the proposed method, we also develop a formal statistical framework, that rigorously justifies the proposed scoring and meta-visualization method, providing theoretical insights on the fundamental principles behind the empirical success of the method, along with its proper interpretations, and guidance on practice.

1.1 Related Works

Quantitative assessment of dimension reduction and data visualization algorithms is of substantial practical interests, and have been extensively studied in the past two decades. For example, many evaluation methods are based on distortion measures from metric geometry (Abraham et al., 2006, 2009; Chennuru Vankadara and von Luxburg, 2018; Bartal et al., 2019), whereas some other methods rely on information-theoretic precision-recall measures (Venna et al., 2010; Arora et al., 2018), co-ranking structure (Mokbel et al., 2013), or graph-based criteria (Wang et al., 2021; Cai and Ma, 2021). See also Bertini et al. (2011), Nonato and Aupetit (2018) and Espadoto et al. (2019) for recent reviews. However, most of these existing methods evaluate data visualizations by comparing them directly with the original dataset, without accounting for its noisiness. The thus obtained assessment may suffer from intrinsic bias due to ignorance of the underlying true structures, only approximately represented by the noisy observations; see Section 2.3.5 and Supplementary Figure 21 for more discussions. To address this issue, the proposed eigenscores, in contrast, provide provably consistent assessment and ranking of visualizations reflecting their relative concordances with the underlying noiseless structures in the data.

Compared to the quantitative assessment of data visualizations, there is a scarcity of meta-visualization methods that combine strengths of multiple data visualizations. In Pagliosa et al. (2015), an interactive method is developed that assesses and combines different multidimensional projection methods via a convex combination technique. However, for supervised learning tasks such as classification, there is a long history of research on designing and developing meta-classifiers that combine multiple classifiers (Woods et al., 1997; Tax et al., 2000; Parisi et al., 2014; Liu et al., 2017; Mohandes et al., 2018). Compared with meta-classification, the main difficulty of meta-visualization lies in the identification of a common space to properly align multiple visualizations, or low-dimensional embeddings, whose scales and coordinate bases may drastically differ from one to another (see, for example, Figures 3-5(a)). Moreover, unlike many meta-classifiers, which combines presumably independent classifiers trained over different datasets, a meta-visualization procedure typically relies on multiple visualizations of the same dataset, and therefore has to deal with more complicated correlation structure among the visualizations. The current study provides the first meta-visualization method that can flexibly combine any number of visualizations, and has interpretable and provable performance guarantee.

1.2 Main Contributions

The main contribution of the current study can be summarized as follows:

•

We propose a computationally efficient spectral method for assessing and combining multiple data visualizations. The method is generic and easy to implement: it does not require knowledge of the original dataset, and can be applied to a large number of data visualizations generated by diverse methods.
•

For any collection of visualizations of a dataset, our method provides a quantitative measure – eigenscore – of the relative performance of the visualizations for preserving the structure around each data point. The eigenscores are useful on their own rights for assessing the local and global reliability of a visualization in representing the underlying structures of the data, and in guiding selection of hyper-parameters.
•

The proposed method automatically combines strengths and ameliorates weakness (distortions) of the candidate visualizations, leading to a meta-visualization which is provably better than all the candidate visualizations under a wide range of settings. We show that the meta-visualization is able to capture diverse intrinsic structures, such as clusters, trajectories, and mixed low-dimensional structures, contained in noisy and high-dimensional datasets.
•

We establish rigorous theoretical justifications of the method under a general signal-plus-noise model (Section 2.3) in the large-sample limit. We prove the convergence of the eigenscores to certain underlying true concordance measures, the guaranteed performance of the meta-visualization and its advantages over alternative methods, its robustness against possible adversarial candidate visualizations, along with their conditions, interpretations, and practical implications.

The proposed method is described in detail in Section 2.1, and empirically illustrated and evaluated in Section 2.2, through extensive simulation studies and analyses of three real-world datasets with diverse underlying structures. In Section 2.3, we show results from our theoretical analysis, which unveils fundamental principles associated to the method, such as the benefits of including qualitatively and functionally diverse candidate visualizations.

2 RESULTS

2.1 Eigenscore and Meta-Visualization Methodology

Throughout, without loss of generality, we assume that for visualization purpose the target embedding is two-dimensional, although our discussion applies to any finite-dimensional embedding.

Algorithm 1 Spectral assessment and combination of multiple data visualizations

Input: candidate visualizations

\{\mathbb{X}^{(k)}_{i}\}_{1\leq i\leq n}

for

k\in\{1,2,...,K\}

1. Construct normalized pairwise-distance matrices: for each

k\in\{1,2,...,K\}

, calculate

\bar{\mathbb{P}}^{(k)}=[\mathbb{D}^{(k)}]^{-1}\mathbb{P}^{(k)},

(2.1)

where

\mathbb{P}^{(k)}=(\|\mathbb{X}_{i}^{(k)}-\mathbb{X}_{j}^{(k)}\|_{2})_{1\leq i,j\leq n}

and

\mathbb{D}^{(k)}=\text{diag}(\|\mathbb{P}_{1.}^{(k)}\|_{2},...,\|\mathbb{P}_{n.}^{(k)}\|_{2})

2. Obtain eigenscores: for each

i\in\{1,2,...,n\}

(i) calculate the similarity matrix

\mathbb{G}_{i}=((\bar{\mathbb{P}}_{i.}^{(k_{1})})^{\top}\bar{\mathbb{P}}_{i.}^{(k_{2})})_{1\leq k_{1},k_{2}\leq K}.

(2.2)

(ii) perform eigen-decomposition of

\mathbb{G}_{i}

and define the eigenscores

\widehat{\mathbb{s}}_{i}=(\hat{s}_{i,1},\hat{s}_{i,2},...,\hat{s}_{i,K}):=|\widehat{\mathbb{u}}_{i}|,

(2.3)

where

\widehat{\mathbb{u}}_{i}

is the eigenvector of

\mathbb{G}_{i}

associated to its largest eigenvalue.

3. Construct meta-distance matrix: for each

i\in\{1,2,...,n\}

, calculate the eigenscore-weighted average

\bar{\mathbb{P}}_{i.}^{m}=\sum_{k=1}^{K}\hat{s}_{i,k}\bar{\mathbb{P}}_{i.}^{(k)},

(2.4)

and define

\bar{\mathbb{P}}^{m}\in\mathbb{R}^{n\times n}

whose

i

-th row is

\bar{\mathbb{P}}_{i.}^{m}

4. Obtain meta-visualization: apply an existing visualization method (e.g., UMAP or kPCA) to

\bar{\mathbb{P}}^{m}

to obtain a meta-visualization.

Output: the eigenscores

\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n}

, and the meta-visualization.

We consider visualizing a $p$ -dimensional dataset $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ containing $n$ samples. From $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ , suppose we obtain a collection of $K$ (candidate) visualizations of the data, produced by various visualization methods. We denote these visualizations as two-dimensional embeddings $\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}\subset\mathbb{R}^{2}$ for $k\in\{1,2,...,K\}$ . Our approach only needs access to the low-dimensional embeddings $\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}$ rather than the raw data $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ ; as a result, the users can use our method even if they don’t have access to the original data, which is often the case.

2.1.1 Measuring Normalized Distances From Each Visualization

In order that the proposed method is invariant to the respective scale and coordinate basis (i.e., directionality) of the low-dimensional embeddings generated from different visualization method, we start by considering the normalized pairwise-distance matrix for each visualization.

Specifically, for each $k\in\{1,2,...,K\}$ , we define the normalized pairwise-distance matrix

\bar{\mathbb{P}}^{(k)}=[\mathbb{D}^{(k)}]^{-1}\mathbb{P}^{(k)}\in\mathbb{R}^{n\times n},

(2.5)

where

\mathbb{P}^{(k)}=(\|\mathbb{X}^{(k)}_{i}-\mathbb{X}^{(k)}_{j}\|_{2})_{1\leq i,j\leq n}\in\mathbb{R}^{n\times n},

(2.6)

is the un-normalized Euclidean distance matrix, and $\mathbb{D}^{(k)}=\text{diag}(\|\mathbb{P}^{(k)}_{1.}\|_{2},\|\mathbb{P}^{(k)}_{2.}\|_{2},...,\|\mathbb{P}^{(k)}_{n.}\|_{2})$ is a diagonal matrix with its diagonal entries being the $\ell_{2}$ -norms of the rows $\{\mathbb{P}^{(k)}_{1.},...,\mathbb{P}^{(k)}_{n.}\}$ of $\mathbb{P}^{(k)}$ . As a result, the normalized distance matrix $\bar{\mathbb{P}}^{(k)}$ has its rows being unit vectors, and is invariant to any scaling and rotation of the visualization $\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}$ .

The normalized distance matrices $\{\bar{\mathbb{P}}^{(k)}\}_{1\leq k\leq K}$ summarize the candidate visualizations in a compact and efficient way. Their scale- and rotation-invariance properties are particularly useful for comparing visualizations produced by distinct methods.

2.1.2 Sample-wise Eigenscores for Assessing Visualizations

Our spectral method for assessing multiple visualizations is based on the normalized distance matrices $\{\bar{\mathbb{P}}^{(k)}\}_{1\leq k\leq K}$ . For each $i\in\{1,2,...,n\}$ , we define the similarity matrix

\mathbb{G}_{i}=((\bar{\mathbb{P}}_{i.}^{(k_{1})})^{\top}\bar{\mathbb{P}}_{i.}^{(k_{2})})_{1\leq k_{1},k_{2}\leq K}\in\mathbb{R}^{K\times K},

(2.7)

which summarizes the pairwise similarity between the candidate visualizations with respect to sample $i$ . By construction, the entries of $\mathbb{G}_{i}$ are inner-products between unit vectors, each representing the normalized distances associated with sample $i$ in a candidate visualization. Naturally, a larger entry $(\bar{\mathbb{P}}_{i.}^{(k_{1})})^{\top}\bar{\mathbb{P}}_{i.}^{(k_{2})}$ indicates higher concordance between the two candidate visualizations. Then, for each $i\in\{1,2,...,n\}$ , we define the vector of eigenscores $\widehat{\mathbb{s}}_{i}=(\hat{s}_{i,1},...,\hat{s}_{i,K})$ for the candidate visualizations with respect to sample $i$ as the absolute value of the eigenvector $\widehat{\mathbb{u}}_{i}\in\mathbb{R}^{K}$ of $\mathbb{G}_{i}$ associated to its largest eigenvalue, that is,

\widehat{\mathbb{s}}_{i}:=|\widehat{\mathbb{u}}_{i}|,

(2.8)

where the absolute value function $|\cdot|$ is applied entrywise. As will be explained later (Section 2.3.2), the nonnegative components of $\widehat{\mathbb{s}}_{i}$ quantify the relative performance of $K$ candidate visualizations with respect to sample $i$ , with higher eigenscores indicating better performance. Consequently, for each candidate visualization $\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}$ , one obtains a set of eigenscores $\{\hat{s}_{i,k}\}_{1\leq i\leq n}$ summarizing its performance relative to other candidate visualizations in a sample-wise manner. Ranking and selection among candidate visualizations can be achieved based on various summary statistics of the eigenscores, such as mean, median, or coefficient of variation, depending on the specific applications. In particular, when some candidate visualizations are produced by the same method but under different tuning parameters, the eigenscores can be used to select the most suitable tuning parameters for visualizing the dataset. However, a more substantial application of the eigenscores is to combine multiple data visualizations into a meta-visualization, which has improved signal-to-noise ratio and higher resolution of the structural information contained in the data.

Importantly, as will be shown later (Section 2.3.5), the eigenscores essentially take the underlying true signals rather than the noisy observations $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ as its referential target for performance assessment, making the method easier to implement and less susceptible to the effect of noise in the original data (Section 2.3.5 and Supplement Figure 21).

2.1.3 Meta-Visualization using Eigenscores

Using the above eigenscores, one can construct a meta-distance matrix properly combining the information contained in each candidate visualization. Specifically, for each $i\in\{1,2,...,n\}$ , we define the vector of meta-distances with respect to sample $i$ as the eigenscore-weighted average of all the normalized distances respect to sample $i$ , that is,

\bar{\mathbb{P}}_{i.}^{m}=\sum_{k=1}^{K}\hat{s}_{i,k}\bar{\mathbb{P}}_{i.}^{(k)}\in\mathbb{R}^{n}.

(2.9)

Then, the meta-distance matrix is defined as $\bar{\mathbb{P}}^{m}\in\mathbb{R}^{n\times n}$ whose $i$ -th row is $\bar{\mathbb{P}}_{i.}^{m}$ . To obtain a meta-visualization, we take the meta-distance matrix $\bar{\mathbb{P}}^{m}$ and apply an existing visualization method that allows for the meta-distance $\bar{\mathbb{P}}^{m}$ (or its symmetrized version $\bar{\mathbb{P}}^{m}+(\bar{\mathbb{P}}^{m})^{\top}$ ) as its input.

Intuitively, for each $i=1,2,...,n$ , we essentially apply a principal component (PC) analysis to the normalized distance matrix $\bar{\mathbb{P}}_{i}=\begin{bmatrix}\bar{\mathbb{P}}_{i.}^{(1)}&\bar{\mathbb{P}}_{i.}^{(2)}&...&\bar{\mathbb{P}}_{i.}^{(K)}\end{bmatrix}\in\mathbb{R}^{n\times K}.$ Specifically, by definition (Jolliffe and Cadima, 2016) the leading eigenvector $\widehat{\mathbb{u}}_{i}$ of $\mathbb{G}_{i}=\bar{\mathbb{P}}_{i}^{\top}\bar{\mathbb{P}}_{i}\in\mathbb{R}^{K\times K}$ is the first PC loadings of $\bar{\mathbb{P}}_{i}$ , whereas the first PC is defined as the linear combination $\bar{\mathbb{P}}_{i}\widehat{\mathbb{u}}_{i}=\sum_{k=1}^{K}\hat{u}_{i,k}\bar{\mathbb{P}}_{i.}^{(k)}$ . Under the condition that the first PC loadings are all nonnegative (which is ensured with high probability under condition (C2) below), the first PC $\bar{\mathbb{P}}_{i}\widehat{\mathbb{u}}_{i}$ is exactly the meta-distance $\bar{\mathbb{P}}_{i.}^{m}$ defined in (2.9) above. When interpreted as PC loadings, the leading eigenvector $\widehat{\mathbb{u}}_{i}$ of $\mathbb{G}_{i}$ contains weights for different vectors $\{\bar{\mathbb{P}}_{i.}^{(k)}\}_{1\leq k\leq K}$ so that the final linear combination $\bar{\mathbb{P}}_{i}\widehat{\mathbb{u}}_{i}$ has the largest variance, that is, summarizes the most information contained in $\bar{\mathbb{P}}_{i}$ . It is in this sense that the meta-distance $\bar{\mathbb{P}}_{i.}^{m}$ is a consensus across $\{\bar{\mathbb{P}}_{i.}^{(k)}\}_{1\leq k\leq K}$ .

For our own numerical studies (Section 2.2), we used UMAP for meta-visualizing datasets with cluster structures, and used kPCA for meta-visualizing all the other datasets with smoother manifold structures, such as trajectory, cycle, or mixed structures. The choice of UMAP in the former case was due to its advantage in treating large numbers of clusters without requiring prior knowledge about the number of clusters (McInnes et al., 2018; Cai and Ma, 2021); whereas the choice of kPCA in the latter case was rooted in its advantage in capturing nonlinear smooth manifold structures (Ding and Ma, 2022). In each case, the hyper-parameters used for generating the meta-visualization were determined without further tuning – for example, when using UMAP for meta-visualization, we set the hyper-parameters the same as those associated to the UMAP visualization which achieved higher median eigenscore than other UMAP visualizations. Moreover, while in general UMAP/kPCA works well as a default method for meta-visualization, our proposed algorithm is robust with respect to the choice of this final visualization method. In our numerical analysis (Section 2.2), we observed empirically that other methods such as t-SNE and PHATE could also lead to meta-visualizations with comparably substantial improvement over individual candidate visualizations in terms of the concordance with the underlying true low-dimensional structure of the data (see Supplement Figure 11). In addition, the meta-visualization shows robustness to potential outliers in the data (Figures 4 and 5).

In Section 2.3, under a generic signal-plus-noise model, we obtain explicit theoretical conditions under which the performance of the proposed spectral method is guaranteed. Specifically, we show the convergence of the eigenscores to a desirable concordance measure between the candidate visualizations and the underlying true pattern, characterized by their associated pairwise distance matrices of the samples. In addition, we show improved performance of the meta-visualization over the candidate visualizations in terms of their concordance with the underlying true pattern, and its robustness against possible adversarial candidate visualizations. These conditions provide proper interpretations and guidance on the application of the method, such as how to more effectively prepare the candidate visualizations (Section 2.3.6). For clarity, we summarize these technical conditions informally as follows, and relegate their precise statements to Section 2.3.

(C1’)

The performance of candidate visualizations are sufficiently diverse in terms of their individual distortions from the underlying true structures.
(C2’)

The candidate visualizations altogether contains sufficient amount of information about the underlying true structures.

Intuitively, Condition (C1’) concerns diversity of methods in producing candidate visualizations, whereas Condition (C2’) is related to the quality of the candidate visualizations. In practice, Condition (C2’) is satisfied when the signal-to-noise ratio in the data, as described by (2.11), is sufficiently large, so that the adopted visualization methods perform reasonably well on average. On the other hand, from Section 2.3, a sufficient condition for (C1’) is that, at most $\sqrt{K}$ out of $K$ candidate visualizations are very similarly distorted from the true patterns in terms of the normalized distances $\bar{\mathbb{P}}^{(k)}$ . This would allow, for example, groups of up to 3 to 4 candidate visualizations out of 10 to 15 visualizations being produced by very similar procedures such as the same method under different hyper-parameters.

2.2 Simulations and Visualization of Real-World Datasets

2.2.1 Simulation Studies: Visualizing Noisy Low-Dimensional Structures

To demonstrate the wide range of applicability and the empirical advantage of the proposed method, we consider visualization of three families of noisy datasets, each containing a distinct low-dimensional structure as its underlying true signal. We assess performance of the eigenscores and the quality of the resulting meta-distance matrix based on 16 candidate visualizations produced by multiple visualization methods.

For a given sample size $n$ , we generate $p$ -dimensional noisy observations $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ from the signal-plus-noise model $\mathbb{Y}_{i}=\mathbb{Y}_{i}^{*}+\mathbb{Z}_{i},$ where $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ are the underlying noiseless samples (signals), and $\{\mathbb{Z}_{i}\}_{1\leq i\leq n}$ are the random noises. Specifically, we generate true signals $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ from various low-dimensional structures isometrically embedded in the $p$ -dimensional Euclidean space. Each of the low-dimensional structures lie in some $r$ -dimensional linear subspace, and is subject to an arbitrary rotation in $\mathbb{R}^{p}$ , so that these signals are generally $p$ -dimensional vectors with dense (nonzero) coordinates. Then we generate ${i.i.d.}$ noise vector $\mathbb{Z}_{i}$ from the standard multivariate normal distribution $\mathcal{N}({\bf 0},{\bf I}_{p})$ , and use the $p$ -dimensional noisy vector $\mathbb{Y}_{i}=\mathbb{Y}_{i}^{*}+\mathbb{Z}_{i}$ as the final observed data. In this way, we simulated noisy observations $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ of an intrinsically $r$ -dimensional structure. For our simulations, for some given signal-to-noise ratio (SNR) parameter $\theta>0$ , we generate $\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n}$ uniformly from each of the following three structures:

(i)

Finite point mixture with $r=5$ : $\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n}$ are independently sampled from the discrete set $\{\bm{\gamma}_{1},\bm{\gamma}_{2},...,\bm{\gamma}_{r+1}\}\subset\mathbb{R}^{p}$ with equal probability, where $\bm{\gamma}_{i}$ ’s are arbitrary orthogonal vectors in $\mathbb{R}^{p}$ with the same length, i.e., $\|\bm{\gamma}_{i}\|_{2}=\theta$ for $1\leq i\leq r+1$ .
(ii)

“Smiley face” with $r=2$ : $\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n}$ are generated independently and uniformly from a two-dimensional “smiley face” structure (Supplement Figure 6 left) of diameter $\theta$ , isometrically embedded in $\mathbb{R}^{p}$ and subject to an arbitrary rotation.
(iii)

“Mammoth” manifold with $r=3$ : $\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n}$ are generated independently uniformly from a three-dimensional “mammoth” manifold (Supplement Figure 6 right) of diameter $\theta$ , isometrically embedded in $\mathbb{R}^{p}$ and subject to an arbitrary rotation.

The thus generated datasets cover diverse structures including Gaussian mixture clusters (i), mixed-type nonlinear clusters (ii), and a connected smooth manifold (iii). As a result, the first family of datasets was set to have $p=500$ and $n=900$ , and were obtained by fixing various values of the SNR parameter $\theta$ , and generating $\mathbb{Y}^{*}_{i}\in\mathbb{R}^{p}$ from the above setting (i) to obtain the noisy dataset $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ as described above. Similarly, the second and the third families of datasets were obtained by drawing $\mathbb{Y}^{*}_{i}\in\mathbb{R}^{p}$ from the above settings (ii) and (iii), respectively, and generating datasets $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ with $p=300$ and $n=500$ , for various values of $\theta$ .

For each dataset $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ , we consider $12$ existing data visualization tools including principal component analysis (PCA), multi-dimensional scaling (MDS), Kruskal’s non-metric MDS (iMDS) (Kruskal, 1978), Sammon’s mapping (Sammon) (Sammon, 1969), locally linear embedding (LLE) (Roweis and Saul, 2000), Hessian LLE (HLLE) (Donoho and Grimes, 2003), isomap (Tenenbaum et al., 2000), kPCA, Laplacian eigenmap (LEIM), UMAP, t-SNE and PHATE (Moon et al., 2019). For methods such as kPCA, t-SNE, UMAP and PHATE, that require tuning parameters, we consider two different settings (Section A.2 of the Supplement) of tuning parameters for each method, denoted as “kPCA1” and “kPCA2,” etc. Therefore, for each dataset we obtain $K=16$ candidate visualizations corresponding to different combinations of visualization tools and tuning parameters. Applying our proposed method, we obtain eigenscores $\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n}$ for the candidate visualizations. We also compare two meta-distances based on the 16 visualizations, which are, the proposed spectral meta-distance matrix (“meta-spec”) based on the eigenscores, and the naive meta-distance matrix (“meta-aver”) assigning equal weights to all the candidate visualizations, as in (2.13).

Table 1: Empirical mean and standard error (SE) of the averaged cosines

\frac{1}{n}\sum_{i=1}^{n}\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i})

, between the eigenscores and the true concordance measures, over each family of datasets associated with a given low-dimensional structure under various values of the SNR parameter

\theta

Low-Dimensional Structure	Gaussian mixture	Smiley face	Mammoth
Simulation Setting	$(n,p)=(900,500)$	$(n,p)=(500,300)$	$(n,p)=(500,300)$
Empirical Mean (SE)	0.992 ( $10^{-5}$ )	0.986 ( $10^{-5}$ )	0.990 ( $10^{-5}$ )

To evaluate the proposed eigenscores, for each setting and each $i\in\{1,2,...,n\}$ , we compute $\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i}):=\frac{(\widehat{\mathbb{s}}_{i})^{\top}\mathbb{s}_{i}}{\|\widehat{\mathbb{s}}_{i}\|_{2}\|\mathbb{s}_{i}\|_{2}}$ , for the angle between the eigenscores $\widehat{\mathbb{s}}_{i}$ and the true local concordance $\mathbb{s}_{i}$ defined as

\mathbb{s}_{i}:=((\bar{\mathbb{P}}_{i.}^{(1)})^{\top}\bar{\mathbb{P}}^{*}_{i.},(\bar{\mathbb{P}}_{i.}^{(2)})^{\top}\bar{\mathbb{P}}^{*}_{i.},...,(\bar{\mathbb{P}}_{i.}^{(K)})^{\top}\bar{\mathbb{P}}^{*}_{i.})\in\mathbb{R}^{K},

(2.10)

where $\bar{\mathbb{P}}^{*}_{i.}$ is the $i$ -th row of the normalized distance matrix $\bar{\mathbb{P}}^{*}$ for the underlying noiseless samples $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ , defined as in (2.5) with $\mathbb{X}_{i}^{(k)}$ ’s replaced by $\mathbb{Y}^{*}_{i}$ ’s. Table 1 shows empirical mean and standard error (SE) of the averaged cosines $\frac{1}{n}\sum_{i=1}^{n}\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i})$ , over the family of datasets under the same low-dimensional structure associated with various $\theta$ as shown in Figure 2(a). Our simulations showed that $\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i})\approx 1$ , indicating that the eigenscores $\widehat{\mathbb{s}}_{i}$ essentially characterize the true concordance between the patterns contained in each candidate visualization and that of the underlying noiseless samples, evaluated locally with respect to sample $i$ . This justifies the proposed eigenscore as a precise measure of performance of the candidate visualizations in preserving the underlying true signals.

To assess the quality of two meta-distance matrices, for each dataset, we compare the mean concordance $\frac{1}{n}\sum_{i=1}^{n}(\bar{\mathbb{P}}_{i.}^{(k)})^{\top}\bar{\mathbb{P}}^{*}_{i.}$ between the normalized distance of each candidate visualization and that of the underlying noiseless samples, and the mean concordance $\frac{1}{n}\sum_{i=1}^{n}(\bar{\mathbb{P}}_{i.}^{m})^{\top}\bar{\mathbb{P}}^{*}_{i.}$ between the obtained meta-distance and that of the underlying noiseless samples. Figure 2(a) and Supplement Figure 7 show boxplots of these mean concordances for the 16 candidate visualizations and the two meta-distances under each setting of underlying structures across various values of $\theta$ . We observe that for each of the three structures, our proposed meta-distance is substantially more concordant with the underlying true patterns, than every candidate visualization and the naive meta-distance, indicating the superiority of the proposed meta-distance. To further demonstrate the advantage of the spectral meta-distance and its benefits to the final meta-visualization, we compared our proposed meta-visualization using UMAP, and candidate visualizations of a dataset under setting (i) with $\theta=5$ , and present their sample-wise concordance $\{(\bar{\mathbb{P}}_{i.}^{(k)})^{\top}\bar{\mathbb{P}}^{*}_{i.}\}_{1\leq i\leq n}$ for each $k$ , and $\{(\bar{\mathbb{P}}_{i.}^{m})^{\top}\bar{\mathbb{P}}^{*}_{i.}\}_{1\leq i\leq n}$ of the proposed meta-distance (Figure 2(b) and Supplement Figure 8). We observe that, while each individual method may capture some clusters in the dataset but misses others, the proposed meta-visualization is able to combine strengths of all the candidate visualizations in order to capture all the underlying clusters. Finally, to demonstrate the flexibility of our method with respect to higher intrinsic dimension $r$ , under the setting (i), we further evaluated the performance of different methods for $r\in\{15,30,50\}$ . Supplement Figure 9 shows consistent and superior performance of the proposed method compared to the other approaches.

2.2.2 Visualizing Clusters of Religious and Biblical Texts

Cluster data are ubiquitous in scientific research and industrial applications. Our first real data example concerns $n=590$ fragments of text, extracted from English translations of eight religious books or sacred scripts including Book of Proverb (BOP), Book of Ecclesiastes (BOE1), Book of Ecclesiasticus (BOE2), Book of Wisdom (BOW), Four Noble Truth of Buddhism (BUD), Tao Te Ching (TTC), Yogasutras (YOG) and Upanishads (UPA) (Sah and Fokoué, 2019). All the text were pre-processed using natural language processing into a $590\times 8265$ Document Term Matrix that counts frequency of 8265 atomic words, such as truth, diligent, sense, power, in each text fragment. In other words, each text fragment was treated as a bag of words, represented by a vector with $8265$ features. The word counts were centred and normalized before downstream analysis.

As in our simulation studies, we still consider $K=16$ candidate visualizations generated by 12 different methods with various tuning parameters (see Section A.2 of the Supplement for implementation details). Figure 3(a) contains examples of candidate visualizations obtained by PHATE, t-SNE, and kPCA, whose median eigenscores were ranked top, middle and bottom among all the visualizations (Figure 3(b)), respectively. More examples are included in Supplement Figure 10. In each visualization, the samples (text fragments) were colored by their associated books, showing how well the visualization captures the underlying clusters of the samples. The usefulness and validity of the eigenscores in Figure 3(b) can be verified empirically, by visually comparing the clarity of cluster patterns demonstrated by each candidate visualizations in Figure 3(a) and in Supplement Figure 10. Figure 3(c) is the proposed meta-visualization¹¹1With a slight abuse of notation, we used “meta-spec” and “meta-aver” hereafter to refer to the final meta-visualizations rather than the meta-distance matrices as in Section 2.2.1. of the samples by applying UMAP to the meta-distance matrix, which shows substantially better clustering of the text fragments in accordance with their sources. In addition, the meta-visualization also reflected deeper relationship between the eight religious books, such as the similarity between the two Hinduism books YOG and UPA, the similarity between Buddhism (BUD) and Taoism (TTC), the similarity between the four Christian books BOE1, BOE2, BOP, and BOW, as well as the general discrepancy between Asian religions (Hinduism, Buddhism, Taoism) and non-Asian religions (Christianity). All of these important phenomena, while salient in our meta-visualization, only appeared vaguely in very few candidate visualizations such as those produced by PHATE (Figure 3(a)) and UMAP (Supplement Figure 10).

To quantitatively evaluate the preservation of the underlying clustering pattern, we computed for each visualization the Silhouette indices (Rousseeuw, 1987) with respect to the underlying true cluster membership, based on the normalized pairwise-distance matrices of the embeddings defined in (2.5). The Silhouette index (see Section A.2 of the Supplement for its definition), defined for each individual sample in a visualization, measures the amount of discrepancy between the within-class distances and the inter-class distances with respect to a given sample. As a result, for a given visualization, its Silhouette indices altogether indicate how well the underlying cluster pattern is preserved in a visualization, and higher Silhouette indices indicate that the underlying clusters are more separate. Empirically, we observed a notable correlation ( $\rho=0.679$ ) between the median Silhouette indices and the median eigenscores across the candidate visualizations (Supplement Figure 11). In addition, for each candidate visualization, we found that samples with higher Silhouette index tend to have higher eigenscores (Supplement Figure 12), demonstrating the effectiveness of eigenscores, and its benefits on the final meta-visualization. In Figure 3(d), we show that, even taking into account the stochasticity of the visualization method (UMAP) applied to the meta-distance matrix, our meta-visualization had the median Silhouette index much higher than those of the candidate visualizations, as well as that of the meta-visualization “meta-aver” based on the naive meta-distance. It is of interest to note that “meta-spec” was the only visualization with a positive median Silhouette index, showing its better separation of clusters compared with other visualizations. Importantly, the proposed meta-visualization was not sensitive to the specific visualization method applied to the meta-distance matrices – similar results were obtained when we replaced UMAP by PHATE, the method having the highest median eigenscore in Figure 3(c), or t-SNE, for meta-visualization (Supplement Figure 11).

2.2.3 Visualizing Cell Cycles

Our second real data example concerns visualization of a different low-dimensional structure, namely, a mixture of cycle and clusters, contained in the gene expression profile of a collection of mouse embryonic stem cells, as a result of the cell cycle mechanism. The cell cycle, or cell-division cycle, is the series of events that take place in a cell that cause it to divide into two daughter cells²²2https://en.wikipedia.org/wiki/Cell_cycle. Identifying the cell cycle stages of individual cells analyzed during development is important for understanding its wide-ranging effects on cellular physiology and gene expression profiles. Specifically, our dataset contains $n=288$ mouse embryonic stem cells, whose underlying cell cycle stages were determined using flow cytometry sorting (Buettner et al., 2015). Among them, one-third (96) of the cells are in the G1 stage, one-third in the S stage, and the rest in the G2M stage. The raw count data were preprocessed and normalized, leading to a dataset consisting of standardized expression levels of $1147$ cell-cycle related genes for the 288 cells (see Section A.2 of the Supplement for implementation details of our data preprocessing).

We obtained 16 candidate visualizations as before, and applied our proposed method. Figure 4(a) contains examples of candidate visualizations obtained by t-SNE, LEIM, and kPCA, whose median eigenscores were ranked top, middle and bottom among all the visualizations, respectively, and the cells were colored according to their true cell cycle stages. Figure 4(b) contains the boxplots of eigenscores for the candidate visualizations, indicating the overall quality of each visualization. The variation of eigenscores within each candidate visualization suggests that different visualizations have their own unique features and strengths to be contributed to the meta-visualization (Supplement Figure 16). Figure 4(c) is the proposed meta-visualization by applying kPCA to the meta-distance matrix. Comparing with Figure 4(a), the proposed meta-visualization showed better clustering of the cells according to their cell cycle stages, as well as a more salient cyclic structure underlying the three cell cycle stages (Supplement Figure 15). To quantify the performance of each visualization in terms of these two underlying structures (cluster and cycle), we considered two distinct metrics, namely, the median Silhouette index with respect to the underlying true cell cycle stages, and the Kendall’s tau statistic (Kendall, 1938) between the inferred relative order of the cells and their true orders on the cycle. Specifically, to infer the relative order of cells, we projected the coordinates of each visualization to the two-dimensional unit circle centred at the origin (Supplement Figure 15), and then determined the relative orders based on the cells’ respective projected positions on the unit circle. Figure 4(d) shows that the proposed meta-visualization was significantly better than all the candidate visualizations and the naive meta-visualization in representing both aspects of the data.

2.2.4 Visualizing Trajectories of Cell Differentiation

Our third real data example concerns visualization of a mixed pattern of a trajectory and clusters underlying the gene expression profiles of a collection of cells undergoing differentiation (Hayashi et al., 2018). Specifically, 421 mouse embryonic stem cells were induced to differentiate into primitive endoderm cells. After the induction of differentiation, the cells were dissociated and individually captured at 12- or 24-hour intervals (0, 12, 24, 48 and 72 h), and each cell was sequenced to obtain the final total RNA sequencing reads using the random displacement amplification sequencing technology. As a result, at each of the five time points, there were about 70 to 90 cells captured and sequenced. The raw count data were preprocessed and normalized (see Section A.2 of the Supplement for implementation details), leading to a dataset consisting of standardized expression levels of $500$ most variable genes for the 421 cells.

Again, we obtained 16 candidate visualizations as before, and applied our proposed method. In Figure 5(a)-(c) we show examples of candidate visualizations, boxplots of the eigenscores, and the meta-visualization using kPCA. The global (Figure 5(b)) and local (Supplement Figure 18) variation of eigenscores demonstrated contribution of different visualizations to the final meta-visualization according to their respective performance. We observed that some candidate visualizations such as kPCA, UMAP (Figure 5(a)) and PHATE (Supplement Figure 17) to some extent captured the underlying trajectory structure consistent with the time course of the cells. However, the meta-visualization in Figure 5(c) showed much more salient patterns in terms of both the underlying trajectory and the cluster pattern among the cells, by locally combining strengths of the individual visualizations (Supplement Figure 18). We quantified the performance of visualizations from these two aspects using the median Silhouette index with respect to the underlying true cluster membership (i.e., batches of time course) and Kendall’s tau statistic between the inferred cell order and the true order along the progression path. To infer the relative order of the cells from a visualization, we ordered all the cells based on the two-dimensional embedding along the direction that explained the most variability of the cells. In Figure 5(d), we observed that, the proposed meta-visualization had the largest median Silhouette index as well as the largest Kendall’s tau statistic, compared with all the candidate visualizations and the naive meta-visualization, showing the superiority of the proposed meta-visualization in both aspects.

2.2.5 Computational Cost

For datasets of moderate size as the ones analyzed in the previous sections, the proposed method had a computational cost comparable to that of t-SNE or UMAP for generating a single candidate visualization (Supplement Figure 19). As for very large and high-dimensional datasets, there are a few features of the proposed algorithm that make it readily scalable. First, although our method relies on computing the leading eigenvector of generally non-sparse matrices, these matrices (i.e., $\mathbb{G}_{i}$ in Algorithm 1) are of dimension $K\times K$ , where $K$ – the number of candidate visualizations – is usually much smaller compared to the sample size $n$ or dimensionality $p$ of the original data. Thus, for each sample $i$ , the computational cost due to the eigendecomposition is mild. Second, given the candidate visualizations, our proposed algorithm is independent of the dimensionality ( $p$ ) of the original dataset, as it only requires as input a set of low-dimensional embeddings produced by different visualization methods. Third, since our algorithm computes the eigenscores and the meta-distance with respect to each sample individually, the algorithm can be easily parallelized and carried out in multiple cores to further reduce time cost.

To demonstrate the computational efficiency of the proposed method for large and high-dimensional datasets, we evaluated the proposed method on real single-cell transcriptomic datasets (Buckley et al., 2022) of various sample sizes ( $n\in\{1000,2000,4000,8000,14000\}$ cells of nine different cell types from the neurogenic regions of mice) and dimensions ( $p\in\{500,1000,2000\}$ genes). For each dataset, we obtained 11 candidate visualizations and applied Algorithm 1 to generate the final meta-visualization (see Section A.2 of the Supplement for implementation details). Supplement Figure 20(b) contains boxplots of median Silhouette indices for each candidate visualizations and the meta-visualization (highlighted in red) with respect to the underlying true cell types, showing the stable and superior performance of the proposed method under various sample sizes and dimensions. In Supplement Figure 20(a), we compared the running time for generating the 11 candidate visualizations, and that for generating the meta-visualizations based on Algorithm 1, on a MacBook Pro with 2.2 GHz 6-Core Intel Core i7. In general, as $n$ became large, the running time of the proposed algorithm also increased, but remained much less than that for generating the candidate visualizations. The difference in time cost became more significant as $n$ increased, demonstrating that for very large and high-dimensional datasets the computational cost essentially comes from generating candidate visualizations, rather than from the meta-visualization step. In particular, for dataset of sample size as large as $8000$ and of dimension $2000$ , it took about 60 mins to generate all the 11 candidate visualizations, and took about additional 12 mins to generate the meta-visualization. Moreover, Supplement Figure 20(a) also demonstrated that, for each $n$ , when $p$ increased, the running time for generating the candidate visualizations was longer, but the time cost for meta-visualization remained about the same (difference less than one minute). We also note that users often create multiple visualizations for data exploration, and our approach can simply reuse these visualizations with little additional computational cost.

2.3 Theoretical Justifications and Fundamental Principles

2.3.1 General Model for Multiple Visualizations

We develop a general and flexible theoretical framework, to investigate the statistical properties of the proposed methods, as well as the fundamental principles behind its empirical success³³3Throughout, we adopt the following notations. For a matrix $\mathbb{A}=(a_{ij})\in\mathbb{R}^{n\times n}$ , we define its spectral norm as $\|\mathbb{A}\|=\sup_{\|\mathbb{x}\|_{2}\leq 1}\|\mathbb{A}\mathbb{x}\|_{2}$ . For sequences $\{a_{n}\}$ and $\{b_{n}\}$ , we write $a_{n}=o(b_{n})$ or $b_{n}\gg a_{n}$ if $\lim_{n}a_{n}/b_{n}=0$ , and write $a_{n}\asymp b_{n}$ if there exists constants $C_{1},C_{2}>0$ such that $C_{1}b_{n}\leq a_{n}\leq C_{2}b_{n}$ for all $n$ .. As can be seen from Section 2.1, there are two key ingredients of our proposed method, namely, the eigenscores $\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n}$ for evaluating the candidate visualizations, and the meta-distance matrix $\bar{\mathbb{P}}^{m}$ that combines multiple candidate visualizations to obtain a meta-visualization.

To formally study their properties, we introduce a generic model for the collection of $K$ candidate visualizations produced by multiple visualization methods, with possibly different settings of tuning parameters for a single method as considered in previous sections. Specifically, we assume $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ are generated as

\mathbb{Y}_{i}=\mathbb{Y}^{*}_{i}+\mathbb{Z}_{i},\qquad i=1,2,...,n,

(2.11)

where $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ are the underlying noiseless samples and $\{\mathbb{Z}_{i}\}_{1\leq i\leq n}$ are the random noises. Recall that $\{\mathbb{P}^{(k)}\}_{1\leq k\leq K}$ are the distance matrices associated to the candidate visualizations. Then, for the candidate visualizations, we consider a scaled signal-plus-noise expression

\mathbb{P}^{(k)}_{i.}=c_{i,k}(\mathbb{P}^{*}_{i.}+\mathbb{h}^{(k)}_{i}),\qquad k=1,2,...,K,

(2.12)

induced by (2.11), where $c_{i,k}\geq 0$ is a global scaling parameter, $\mathbb{P}^{*}_{i.}$ is the $i$ -th row of the pairwise distance matrix $\mathbb{P}^{*}=(\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{j}\|_{2})_{1\leq i,j\leq n}$ of the underlying noiseless samples, and $\mathbb{h}_{i}^{(k)}$ is a random vector characterizing the relative distortion of $\mathbb{P}_{i.}^{(k)}$ associated to the $k$ -th candidate visualization, from the underlying true pattern $\mathbb{P}^{*}_{i.}$ . Before characterizing the distributions of $\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K}$ , we point out that, in principle, the relative distortions $\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K}$ are jointly determined by the random noises $\{\mathbb{Z}_{i}\}_{1\leq i\leq n}$ in (2.11), and the features and relations between of the specific visualization methods. Importantly, in line with what is often encountered in practice, equation (2.12) allows for flexible and possibly distinct scaling and directionality for different candidate visualizations, by introducing the visualization-specific parameter $c_{k}$ , and by focusing on the pairwise distance matrices, rather than the low-dimensional embeddings $\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}$ themselves.

To quantitatively describe the variability of the distortions $\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K}$ across $K$ candidate visualizations, we assume

(C1a)

$\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K}$ are identically distributed sub-Gaussian vectors with parameter $\sigma^{2}$ , that is, for any deterministic unit vector $\mathbb{g}\in\mathbb{R}^{n}$ , we have $\mathbb{E}\exp\{(\mathbb{h}^{(k)}_{i})^{\top}\mathbb{g}\}\leq\exp(\sigma^{2}/2)$ , and that $\|\mathbb{h}^{(k)}_{i}\|^{2}_{2}=c\sigma^{2}n(1+o(1))$ with high probability⁴⁴4An event $A_{n}$ holds with high probability if there exists some $N>0$ , such that $P(A_{n})\geq 1-n^{-D}$ for all $n\geq N$ for some large constant $D>0$ . for some constant $c>0$ .

This assumption makes (2.12) a generative model for $\{\mathbb{P}_{i.}^{(k)}\}_{1\leq k\leq K}$ with ground truth $\mathbb{P}^{*}_{i.}$ and random distortions, where the variance parameter $\sigma$ describes the average level of the distortions of candidate visualizations from the truth after proper scaling. In relation to (2.11), such a condition can be satisfied when the signal structure $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ is finite, the noise $\{\mathbb{Z}_{i}\}_{1\leq i\leq n}$ is sub-Gaussian, and the dimension reduction map underlying the candidate visualization is bounded and sufficiently smooth. See Section B.2 of the Supplement for details. In addition, we also need to characterize the correlations among these random distortions, not only because the candidate visualizations are typically obtained from the same dataset $\{\mathbb{Y}_{i}\}_{1\leq i\leq n}$ , but also because of the possible similarity between the adopted visualization methods, such as MDS and iMDS, or t-SNE under different tuning parameters. Specifically, for any $j,k\in\{1,2,...,K\}$ , we define the cross-visualization covariance $\mathbb{\Sigma}_{jk}=\mathbb{E}\mathbb{h}^{(j)}_{i}(\mathbb{h}^{(k)}_{i})^{\top}$ , and quantify the level of dependence between a pair of candidate visualizations by $\rho_{jk}=\|\mathbb{\Sigma}_{jk}\|/\sigma^{2}$ . By Condition (C1a), we have $\rho_{jj}\leq 1$ for all $j$ . For all correlation parameters $\{\rho_{jk}\}_{1\leq j,k\leq K}$ , we assume

(C1b)

The matrix $\mathbb{R}=(\rho_{jk})_{1\leq j,k\leq K}$ satisfies $\rho:=\|\mathbb{R}\|=o(K)$ .

Condition (C1b) covers a wide range of correlation structures among the candidate visualizations, allowing in particular for a subset of highly correlated visualizations possibly produced by very similar methods. The parameter $\rho$ characterizes the overall correlation strength among the candidate visualizations, which is assumed to be not too large. As a comparison, note that a set of pairwise independent candidate visualizations implies that $\rho\approx 1$ , whereas a set of identical candidate visualizations have $\rho\approx K$ . In particular, the requirement $\rho=o(K)$ can be satisfied if, for example, among $K$ candidate visualizations, there are subsets of at most $\sqrt{K}$ visualizations that are produced by very similar procedures, such as by the same method under different tuning parameters, so that $\rho\leq\sqrt{K}=o(K)$ . When Condition (C1b) fails, as all the candidate visualizations are essentially similarly distorted from truth, combination of them will not be substantially more informative than each individual visualization.

2.3.2 Convergence of Eigenscores

Under Condition (C1a), it holds that $\mathbb{E}\|\mathbb{h}_{i}^{(k)}\|_{2}\asymp\sigma\sqrt{n}$ . Hence, we can use the quantity $\frac{\|\mathbb{P}^{*}_{i.}\|_{2}}{\sigma\sqrt{n}}$ to characterize the overall SNR in the candidate visualizations as modelled by (2.12), which reflects the average quality of the candidate visualizations in preserving the underlying true patterns around sample $i$ . Before stating our main theorems, we first introduce our main assumption on the minimal SNR requirement, that is,

(C2)

For $(\sigma,\rho)$ defined in (C1a) and (C2b), it holds that $\frac{\|\mathbb{P}^{*}_{i.}\|_{2}}{\sigma\sqrt{n}}\gg\sqrt{\rho/K}$ and $K=o(n)$ as $n\to\infty$ .

Our algorithm is expected to perform well if $\sqrt{\rho/K}$ is small relative to the overall SNR. The condition $K=o(n)$ is easily satisfied for a sufficiently large dataset.

Recall that $\mathbb{s}_{i}=((\bar{\mathbb{P}}_{i.}^{(1)})^{\top}\bar{\mathbb{P}}^{*}_{i.},(\bar{\mathbb{P}}_{i.}^{(2)})^{\top}\bar{\mathbb{P}}^{*}_{i.},...,(\bar{\mathbb{P}}_{i.}^{(K)})^{\top}\bar{\mathbb{P}}^{*}_{i.}).$ The following theorem concerns the convergence of eigenscores to the true concordance $\mathbb{s}_{i}$ , and is proved in Section B.3 of the Supplement.

Theorem 1.

Under Conditions (C1a) (C1b) and (C2), for each $i\in\{1,2,...,n\}$ , it holds that $\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i})=\frac{(\widehat{\mathbb{s}}_{i})^{\top}\mathbb{s}_{i}}{\|\widehat{\mathbb{s}}_{i}\|_{2}\|\mathbb{s}_{i}\|_{2}}\to 1$ in probability as $n\to\infty$ .

Theorem 1 implies that, as long as the candidate visualizations contain sufficient amount of information about the underlying true structure, and are not terribly correlated, the proposed eigenscores $\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n}$ are quantitatively reliable, as they converge to the actual quality measures $\{\mathbb{s}_{i}\}_{1\leq i\leq n}$ asymptotically. In other words, the eigenscores provide a point-wise consistent estimation of the concordance between the candidate visualizations as summarized by $\{\mathbb{P}^{(k)}\}_{1\leq k\leq K}$ and the underlying true patterns $\mathbb{P}^{*}$ , justifying the empirical observations in Table 1. Importantly, Condition (C2) suggests that our proposed eigenscores may benefit from a larger number $K$ of candidate visualizations, or a smaller overall correlation $\rho$ , that is, a collection of functionally more diverse candidate visualizations; see further discussions in Section 2.3.6.

2.3.3 Theoretical Guarantee for Meta-Visualization

Our second theorem concerns the guaranteed performance of our proposed meta-distance matrix and its improvement upon the individual candidate visualizations in the large-sample limit.

Theorem 2.

Under Conditions (C1a) (C1b) and (C2), for each $i\in\{1,2,...,n\}$ , it holds that $\cos\angle(\bar{\mathbb{P}}^{m}_{i.},\mathbb{P}^{*}_{i.})\to 1$ in probability as $n\to\infty$ . Moreover, for any constant $\delta\in(0,1)$ , there exist a constant $C>0$ such that, whenever ${\|\mathbb{P}_{i.}^{*}\|_{2}}\leq C{\sigma\sqrt{n}}$ , we have $\max_{1\leq k\leq K}\cos\angle(\mathbb{P}^{(k)}_{i.},\mathbb{P}^{*}_{i.})<1-\delta$ in probability as $n\to\infty$ .

Theorem 2 is proved in Section B.4 of the Supplement. In addition to the point-wise consistency of $\bar{\mathbb{P}}^{m}$ as described by $\cos\angle(\bar{\mathbb{P}}^{m}_{i.},\mathbb{P}^{*}_{i.})\to 1$ in probability, Theorem 2 also ensures that the proposed meta-distance is in general no worse than the individual candidate visualizations, suggesting a competitive performance of the meta-visualization. In particular, if in addition to Conditions (C1a) (C1b) and (C2) we also have $\frac{\|\mathbb{P}_{i.}^{*}\|_{2}}{\sigma\sqrt{n}}\leq C$ , that is, the magnitude of the random distortions from the true structure $\mathbb{P}_{i.}^{*}$ is relatively large, then each candidate visualization necessarily has at most mediocre performance, i.e., $\max_{1\leq k\leq K}\cos\angle(\mathbb{P}^{(k)}_{i.},\mathbb{P}^{*}_{i.})<1-\delta$ in probability. In such cases, the proposed meta-distances is still consistent and thus strictly better than all candidate visualizations. Theorem 2 justifies the superior performance of the spectral meta-visualization demonstrated in Section 2.2, compared with 16 candidate visualizations.

Among the three conditions required for the consistency of the proposed meta-distance matrix, Condition (C2) is most critical as it describes the minimal SNR requirement, that is, how much information the candidate visualizations altogether should contain about the underlying true structure of the data. In this connection, our theoretical analysis indicates that, in fact, such a signal strength condition is also necessary, not only for the proposed method, but for any possible methods. More specifically, in Section B.6 of the Supplement, we proved (Theorem 4) that, it’s impossible to construct a meta-distance matrix that is consistent when Condition (C2) is violated. This result shows that the settings where our meta-visualization algorithm works well is essentially the most general setting possible.

2.3.4 Robustness of Spectral Weighting against Adversarial Visualizations

In Section 2.2, in addition to the proposed meta-visualization, we also considered the meta-visualization based on the naive meta-distance matrix $\bar{\mathbb{P}}^{a}$ , whose rows are

\bar{\mathbb{P}}_{i.}^{a}=\frac{1}{K}\sum_{k=1}^{K}\bar{\mathbb{P}}_{i.}^{(k)}\in\mathbb{R}^{n},

(2.13)

which is a simple average across all the candidate visualizations. We observed in all our real-world data analyses that, such a naive meta-visualization only had mediocre performance compared to the candidate visualizations (Figures 3, 4 and 5), much worse than the proposed spectral meta-visualization. The empirical observations suggest the advantage of informative weighting for combining candidate visualizations.

The empirically observed suboptimality of the non-informative weighting procedure can justified rigorously by theory. Our next theorem concerns the behavior of the proposed meta-distance matrix $\bar{\mathbb{P}}^{m}$ and the naive meta-distance matrix $\bar{\mathbb{P}}^{a}$ when combining a mixture of well-conditioned candidate visualizations, as characterized by our assumptions (C1a) (C1b) and (C2), and some adversarial candidate visualizations whose pairwise-distance matrices does not contain any information about the true structure. Specifically, we suppose among all the $K$ candidate visualizations, there is a collection $\mathcal{C}_{0}$ of $(1-\eta)K$ well-conditioned candidate visualizations for some small $\eta\in(0,1)$ , and a collection $\mathcal{C}_{1}$ of $\eta K$ adversarial candidate visualizations.

Theorem 3.

For any $i\in\{1,2,...,n\}$ , suppose among all the $K$ candidate visualizations, there is a collection $\mathcal{C}_{0}$ of $(1-\eta)K$ candidate visualizations for some small $\eta\in(0,1)$ satisfying Conditions (C1a) (C1b) and (C2), and a collection $\mathcal{C}_{1}$ of $\eta K$ adversarial candidate visualizations such that $(\mathbb{P}_{i.}^{(k)})^{\top}\mathbb{P}_{i.}^{*}=0$ for all $k\in\mathcal{C}_{1}$ . Then, for the proposed meta-distance $\bar{\mathbb{P}}^{m}$ , we still have $\cos\angle(\bar{\mathbb{P}}^{m}_{i.},\mathbb{P}^{*}_{i.})\to 1$ in probability as $n\to\infty$ . However, for the naive meta-distance $\bar{\mathbb{P}}^{a}$ , even if $\|\mathbb{P}_{i.}^{*}\|_{2}\gg\sigma\sqrt{n}$ , we have $\cos\angle(\bar{\mathbb{P}}^{a}_{i.},\mathbb{P}^{*}_{i.})<1-\eta$ in probability as $n\to\infty$ .

Theorem 3 is proved in Section B.5 of the Supplement. By Theorem 3, on the one hand, even when there are a small portion of really poor (adversarial) candidate visualizations to be combined with other relatively good visualizations, the proposed method still perform well thanks to the consistent eigenscore weighting in light of Theorem 1. On the other hand, no matter how strong the SNR is for those well-conditioned candidate visualizations, the method based on non-informative weighting is strictly sub-optimal. Indeed, when $\|\mathbb{P}_{i.}^{*}\|_{2}\gg\sigma\sqrt{n}\asymp\mathbb{E}\|\mathbb{h}_{i}\|_{2}$ , although we have $\cos\angle(\mathbb{P}_{i.}^{(k)},\mathbb{P}^{*}_{i.})\to 1$ in probability for all $k\in\mathcal{C}_{0}$ , the non-informative weighting would suffer from the non-negligible negative effects from the adversarial visualizations in $\mathcal{C}_{1}$ , causing a strict deviation from $\mathbb{P}^{*}$ ; see, for example, Figures 3-5(b)(d) for empirical evidences from real-world data.

2.3.5 Limitations of Original Noisy High-Dimensional Data

The proposed eigenscores provide an efficient and consistent way of evaluating the performance of the candidate visualizations. As mentioned in Section 1, a number of metrics have been proposed to quantify the distortion of a visualization by comparing the low-dimensional embedding directly with the original high-dimensional data. Such metrics essentially treat the original high-dimensional data as the ground truth, and do not take into account the noisiness of the high-dimensional data. However, for many datasets arising from real-world applications, the observed datasets, as modelled by (2.11), are themselves very noisy, which may not make an ideal reference point for evaluating a visualization that probably has already significantly denoised the data through dimension reduction. For example, the three real-world datasets considered in Section 2.2 are all high-dimensional and contain much more features than number of samples. In each case, there are some underlying clusters among the samples, but the original datasets showed significantly weaker cluster structure compared to most of the 16 candidate visualizations (Supplement Figure 21), suggesting that directly comparing a visualization with the noisy high-dimensional data may be misleading. In this respect, our theorems indicate that the proposed spectral method is able to precisely assess and effectively combine multiple visualizations to better grasp the underlying noiseless structure $\mathbb{P}^{*}$ , without referring to the original noisy datasets, making it more robust, flexible, and computationally more efficient.

2.3.6 Benefits of Including More Functionally Diverse Visualizations

Our theoretical analysis implies that the proposed meta-visualization may benefit from a large number (larger $K$ ) of functionally diverse (small $\rho$ ) candidate visualizations. To empirically verify this theoretical observation, we focused on the religious and biblical text data and the mouse embryonic stem cells data considered in Section 2.2, and obtained spectral meta-visualizations based on a smaller but relatively diverse collection of 5 candidate visualizations, produced by arguably the most popular methods, namely, t-SNE, PHATE, UMAP, PCA and MDS, respectively. Compared with the 16 candidate visualizations considered in Section 2.2, here we have presumably similar $\rho$ but much smaller $K$ . As a result, for the religious and biblical texts data, the meta-visualization had a median Silhouette index 0.187 (Supplement Figure 13), which was smaller than the median Silhouette index 0.275 based on the 16 candidate visualizations as in Figure 3(d); for the cell cycle data, the meta-visualization had a median Silhouette index -0.062 and a Kendall’s tau statistic 0.313, both smaller than the respective values based on the 16 candidate visualizations as in Figure 4(d). On the other hand, we also evaluated the effect when $\rho$ is increased but $K$ remains fixed. Specifically, we obtained 16 candidate visualizations, all produced by PHATE with varying nearest neighbor parameters, the final spectral meta-visualization had a median Silhouette index 0.094, which was even lower than the above meta-visualization based on five distinct methods, although being still slightly better than the 16 PHATE-based candidate visualizations (Supplement Figure 13). These empirical evidences were in line with our theoretical predictions, suggesting benefits of including more diverse visualizations.

3 DISCUSSION

We developed a spectral method in the current study to assess and combine multiple data visualizations. The proposed meta-visualization combines candidate visualizations through an arithmetic weighted average of their normalized distance matrices, by their corresponding eigenscores. Although the proposed method was shown both in theory and numerically to outperform the individual candidate visualizations and their naive combination, it is still unclear whether there exists any other forms of combinations that lead to even better meta-visualizations. For example, one could consider constructing a meta-distance matrix using the geometric or harmonic (weighted) average, or an average based on barycentric coordinates (Floater, 2015). We plan to investigate such problems concerning how to optimally combining multiple visualizations in a subsequent work.

Although originally developed for data visualization, the proposed method can be useful for other supervised and unsupervised machine learning tasks, such as combining multiple algorithms for clustering, classification, or prediction. For example, for a given dataset, if one has a collection of predicted cluster memberships produced by multiple clustering algorithms, one could construct cluster membership matrices with $(i,j)$ -th entry being 0 if sample $i$ and $j$ are not assigned to the same cluster and being 1 otherwise. Then we may define the similarity matrix as in (2.7), obtain the eigenscores for the candidate clusterings, and a meta-clustering using (2.9). It is of interest to know its empirical performance and if the fundamental principles unveiled in the current work continue to hold for such broader range of learning tasks.

Data Availability

The religious and biblical text data (Sah and Fokoué, 2019) was downloaded from UCI Machine Learning Repository ⁵⁵5https://archive.ics.uci.edu/ml/datasets/A+study+of++Asian+Religious+and+Biblical+Texts. The cell cycle analysis was based on the mouse embryonic stem cell data (Buettner et al., 2015) downloaded from EMBL-EBI with accession code E-MTAB-2805. The cell trajectory analysis was based on the mouse embryonic stem single cell data (Hayashi et al., 2018) downloaded from Gene Expression Omnibus with accession code GSE98664. The single-cell transcriptomic dataset (Buckley et al., 2022) used for evaluating computational cost is accessible at BioProject PRJNA795276.

Code Availability

The R codes of the method, and for reproducing our simulations and data analyses are available at our GitHub page https://github.com/rongstat/meta-visualization.

References

Abraham et al. (2006) Abraham, I., Y. Bartal, and O. Neiman (2006). Advances in metric embedding theory. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pp. 271–286.
Abraham et al. (2009) Abraham, I., Y. Bartal, and O. Neiman (2009). On low dimensional local embeddings. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 875–884. SIAM.
Arora et al. (2018) Arora, S., W. Hu, and P. K. Kothari (2018). An analysis of the t-SNE algorithm for data visualization. In Conference on Learning Theory, pp. 1455–1462. PMLR.
Bartal et al. (2019) Bartal, Y., N. Fandina, and O. Neiman (2019). Dimensionality reduction: theoretical perspective on practical measures. Advances in Neural Information Processing Systems 32.
Belkin and Niyogi (2003) Belkin, M. and P. Niyogi (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396.
Bertini et al. (2011) Bertini, E., A. Tatu, and D. Keim (2011). Quality metrics in high-dimensional data visualization: An overview and systematization. IEEE Transactions on Visualization and Computer Graphics 17(12), 2203–2212.
Bhatia (2013) Bhatia, R. (2013). Matrix Analysis, Volume 169. Springer Science & Business Media.
Buckley et al. (2022) Buckley, M. T., E. Sun, B. M. George, L. Liu, N. Schaum, L. Xu, J. M. Reyes, M. A. Goodell, I. L. Weissman, T. Wyss-Coray, et al. (2022). Cell type-specific aging clocks to quantify aging and rejuvenation in regenerative regions of the brain. bioRxiv.
Buettner et al. (2015) Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle (2015). Computational analysis of cell-to-cell heterogeneity in single-cell rna-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33(2), 155–160.
Cai et al. (2021) Cai, T. T., H. Li, and R. Ma (2021). Optimal structured principal subspace estimation: Metric entropy and minimax rates. J. Mach. Learn. Res. 22, 46–1.
Cai and Ma (2021) Cai, T. T. and R. Ma (2021). Theoretical foundations of t-sne for visualizing high-dimensional clustered data. arXiv preprint arXiv:2105.07536.
Cai and Zhang (2018) Cai, T. T. and A. Zhang (2018). Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics 46(1), 60–89.
Chen et al. (2007) Chen, C.-h., W. K. Härdle, and A. Unwin (2007). Handbook of Data Visualization. Springer Science & Business Media.
Chen et al. (2020) Chen, M., H. Hauser, P. Rheingans, and G. Scheuermann (2020). Foundations of Data Visualization. Springer.
Cheng et al. (2015) Cheng, J., H. Liu, F. Wang, H. Li, and C. Zhu (2015). Silhouette analysis for human action recognition based on supervised temporal t-SNE and incremental learning. IEEE Transactions on Image Processing 24(10), 3203–3217.
Chennuru Vankadara and von Luxburg (2018) Chennuru Vankadara, L. and U. von Luxburg (2018). Measures of distortion for machine learning. Advances in Neural Information Processing Systems 31.
Ding and Ma (2022) Ding, X. and R. Ma (2022). Learning low-dimensional nonlinear structures from high-dimensional noisy data: An integral operator approach. arXiv preprint arXiv:2203.00126.
Donoho (2017) Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics 26(4), 745–766.
Donoho and Grimes (2003) Donoho, D. L. and C. Grimes (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences 100(10), 5591–5596.
Espadoto et al. (2019) Espadoto, M., R. M. Martins, A. Kerren, N. S. Hirata, and A. C. Telea (2019). Toward a quantitative survey of dimension reduction techniques. IEEE Transactions on Visualization and Computer Graphics 27(3), 2153–2173.
Floater (2015) Floater, M. S. (2015). Generalized barycentric coordinates and applications. Acta Numerica 24, 161–214.
Hayashi et al. (2018) Hayashi, T., H. Ozaki, Y. Sasagawa, M. Umeda, H. Danno, and I. Nikaido (2018). Single-cell full-length total rna sequencing uncovers dynamics of recursive splicing and enhancer rnas. Nature Communications 9(1), 1–16.
Jolliffe and Cadima (2016) Jolliffe, I. T. and J. Cadima (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065), 20150202.
Kendall (1938) Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30(1/2), 81–93.
Kobak and Berens (2019) Kobak, D. and P. Berens (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications 10(1), 1–14.
Kobak and Linderman (2021) Kobak, D. and G. C. Linderman (2021). Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology 39(2), 156–157.
Kruskal (1978) Kruskal, J. B. (1978). Multidimensional Scaling. Number 11. Sage.
Liu et al. (2017) Liu, Z.-G., Q. Pan, J. Dezert, and A. Martin (2017). Combination of classifiers with optimal weight based on evidential reasoning. IEEE Transactions on Fuzzy Systems 26(3), 1217–1230.
McInnes et al. (2018) McInnes, L., J. Healy, and J. Melville (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
Mohandes et al. (2018) Mohandes, M., M. Deriche, and S. O. Aliyu (2018). Classifiers combination techniques: A comprehensive review. IEEE Access 6, 19626–19639.
Mokbel et al. (2013) Mokbel, B., W. Lueks, A. Gisbrecht, and B. Hammer (2013). Visualizing the quality of dimensionality reduction. Neurocomputing 112, 109–123.
Moon et al. (2019) Moon, K. R., D. van Dijk, Z. Wang, S. Gigante, D. B. Burkhardt, W. S. Chen, K. Yim, A. v. d. Elzen, M. J. Hirn, R. R. Coifman, et al. (2019). Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology 37(12), 1482–1492.
Nonato and Aupetit (2018) Nonato, L. G. and M. Aupetit (2018). Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment. IEEE Transactions on Visualization and Computer Graphics 25(8), 2650–2673.
Olivon et al. (2018) Olivon, F., N. Elie, G. Grelier, F. Roussi, M. Litaudon, and D. Touboul (2018). Metgem software for the generation of molecular networks based on the t-SNE algorithm. Analytical Chemistry 90(23), 13900–13908.
Pagliosa et al. (2015) Pagliosa, P., F. V. Paulovich, R. Minghim, H. Levkowitz, and L. G. Nonato (2015). Projection inspector: Assessment and synthesis of multidimensional projections. Neurocomputing 150, 599–610.
Parisi et al. (2014) Parisi, F., F. Strino, B. Nadler, and Y. Kluger (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences 111(4), 1253–1258.
Platzer (2013) Platzer, A. (2013). Visualization of snps with t-SNE. PloS One 8(2), e56883.
Rousseeuw (1987) Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.
Roweis and Saul (2000) Roweis, S. T. and L. K. Saul (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326.
Sah and Fokoué (2019) Sah, P. and E. Fokoué (2019). What do asian religions have in common? an unsupervised text analytics exploration. arXiv preprint arXiv:1912.10847.
Sammon (1969) Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers 100(5), 401–409.
Schölkopf et al. (1997) Schölkopf, B., A. Smola, and K.-R. Müller (1997). Kernel principal component analysis. In International Conference on Artificial Neural Networks, pp. 583–588. Springer.
Tax et al. (2000) Tax, D. M., M. Van Breukelen, R. P. Duin, and J. Kittler (2000). Combining multiple classifiers by averaging or by multiplying? Pattern Recognition 33(9), 1475–1485.
Tenenbaum et al. (2000) Tenenbaum, J. B., V. d. Silva, and J. C. Langford (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323.
Traven et al. (2017) Traven, G., G. Matijevič, T. Zwitter, M. Žerjal, J. Kos, M. Asplund, J. Bland-Hawthorn, A. R. Casey, G. De Silva, K. Freeman, et al. (2017). The galah survey: classification and diagnostics with t-SNE reduction of spectral information. The Astrophysical Journal Supplement Series 228(2), 24.
van der Maaten and Hinton (2008) van der Maaten, L. and G. Hinton (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov), 2579–2605.
Venna et al. (2010) Venna, J., J. Peltonen, K. Nybo, H. Aidos, and S. Kaski (2010). Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research 11(2).
Vershynin (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, Volume 47. Cambridge University Press.
Wang et al. (2021) Wang, Y., H. Huang, C. Rudin, and Y. Shaposhnik (2021). Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization. Journal of Machine Learning Research 22, 1–73.
Woods et al. (1997) Woods, K., W. P. Kegelmeyer, and K. Bowyer (1997). Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 405–410.
Yu et al. (2015) Yu, Y., T. Wang, and R. J. Samworth (2015). A useful variant of the davis–kahan theorem for statisticians. Biometrika 102(2), 315–323.

Acknowledgement

The authors would like to thank the edit and two anonymous reviewers for their suggestions and comments, which have resulted in a significant improvement of the manuscript. R.M. would like to thank David Donoho and Rui Duan for helpful discussions. J.Z. is supported by NSF CAREER 1942926, NIH P30AG059307, 5RM1HG010023 and grants from the Chan-Zuckerberg Initiative and the Emerson Collective. R.M. is supported by David Donoho at Stanford University.

Supplement to “A Spectral Method for Assessing and Combining Multiple Data Visualizations” Rong Ma¹, Eric D. Sun² and James Zou²

Department of Statistics, Stanford University¹
Department of Biomedical Data Science, Stanford University²

Appendix A Supplementary Results and Figures from Real Data Analysis

A.1 Silhouette Index

Consider a partition $\{1,2,...,n\}=C_{1}\cup C_{2}\cup...\cup C_{K}$ , of $n$ samples into $K$ non-overlapping subsets, with each cluster containing at least $2$ samples, being the true cluster membership of $n$ samples. Let $d(i,j)$ be the distance between samples $i$ and $j$ in certain vector space. For each sample $i\in C_{k}$ for some $k\in\{1,2,...,K\}$ , we define

a(i)=\frac{1}{|C_{k}|-1}\sum_{j\in C_{k}\setminus\{i\}}d(i,j),

(A.1)

as the mean distance between sample $i$ and all other samples in cluster $C_{k}$ , and define

b(i)=\min_{k^{\prime}\neq k}\frac{1}{|C_{k^{\prime}}|}\sum_{j\in C_{k^{\prime}}}d(i,j),

(A.2)

as the smallest mean distance of $i$ to all samples in any other cluster, of which sample $i$ is not a member. Then the Silhouette index of sample $i$ is defined as

SI(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}}.

(A.3)

By definition, we have $SI(i)\in(-1,1)$ , and a higher $SI(i)$ indicates better concordance between the distances $\{d(i,j)\}_{j\neq i}$ and the underlying true cluster membership.

A.2 Implementation Details and Supplementary Figures

Simulations.

For each simulation setting, we let the diameter of the underlying structure vary within a certain range so that the final results are comparable across different structures. The final boxplots in Figure 2(a), and Figures 7 and 9 summarize the simulation results across 20 equispaced diameter values for each underlying structure.

For the 16 candidate visualizations, they were obtained by the following R functions:

•

PCA: the fast SVD function svds from R package rARPACK with embedding dimension k=2.
•

MDS: the basic R function cmdscale with embedding dimension k=2.
•

Sammon: the R function sammon from R package MASS with embedding dimension k=2.
•

LLE: the R function lle from R package lle with parameters m=2, k=20, reg=2.
•

HLLE: the R function embed from R package dimRed with parameters method="HLLE", knn=20, ndim=2.
•

Isomap: the R function embed from R package dimRed with parameters method="Isomap", knn=20, ndim=2.
•

kPCA1&2: the R function embed from R package dimRed with parameters method="kPCA", kpar=list(sigma=width), ndim=2, where we set width=0.01 for kPCA1 and width=0.001 for kPCA2.
•

LEIM: the R function embed from R package dimRed with parameters ndim=2 and method = "LaplacianEigenmaps".
•

UMAP1&2: the R function umap from R package uwot with parameters n_neighbors=n, n_components=2, where we set n=30 for UMAP1 and width=50 for UMAP2.
•

tSNE1&2: the R function embed from R package dimRed with parameters method="tSNE", perplexity=n, ndim=2, where we set n=10 for tSNE1 and n=50 for tSNE2.
•

PHATE1&2: the R function phate from R package phateR with parameters knn=n, ndim=2, where we set n=30 for PHATE1 and n=50 for PHATE2.

Religious and Biblical Texts Data.

Each visualization method was applied to the Document Term Matrix, with 8265 centred and normalized features and 590 samples (text fragments). For the 16 candidate visualizations, they were obtained by the following R functions with the same tuning parameters as described above for the simulation studies. For the 16 PHATE-based candidate visualizations obtained in Figure 13, we consider 16 values of nearest neighbor parameter knn ranging from 2 to 150 with equal space.

Cell Cycle Data.

The raw count data were preprocessed, normalized, and scaled by following the standard procedure (R functions CreateSeuratObject, NormalizeData and ScaleData under default settings) as incorporated in the R package Seurat⁶⁶6https://cran.r-project.org/web/packages/Seurat/index.html. We also applied the R function FindVariableFeatures in Seurat to identify $2000$ most variable genes for subsequent analysis. The final $p=1147$ cell-cycle related genes were selected based on two-sample t-tests. The 16 candidate visualizations were generated the same way as in the previous example, with the same set of tuning parameters.

Cell Differentiation Data.

The raw count data were preprocessed, normalized, and scaled using Seurat package by following the same procedure as described previously. The 16 candidate visualizations were generated the same way as in the previous examples, with the same set of tuning parameters.

Computational Cost.

We considered the single-cell transcriptomic dataset of Buckley et al. (2022) that contains more than 20,000 cells of different cell types from the neurogenic regions of 28 mice. For each $n\in\{1000,2000,4000,8000,14000\}$ , we randomly select $n$ cells of nine different cell types, and selected subsets of $p\in\{500,1000,2000\}$ genes to obtain an $n\times p$ count matrix. After normalizing the count matrix, we applied various visualization methods (PCA, HLLE, kPCA, LEIM, UMAP, t-SNE and PHATE) that are in general scalable to large datasets (i.e., cost less than one minute for visualizing $1000$ samples of dimension $300$ ), to generate 11 candidate visualizations (with two different parameter settings for kPCA, t-SNE, UMAP and PHATE). Then we ran our proposed algorithm to obtain the final meta-visualization.

Appendix B Proof of Main Theorems

B.1 Notations

For a vector $\mathbb{a}=(a_{1},...,a_{n})^{\top}\in\mathbb{R}^{n}$ , we denote $\text{diag}(a_{1},...,a_{n})\in\mathbb{R}^{n\times n}$ as the diagonal matrix whose $i$ -th diagonal entry is $a_{i}$ , and define the $\ell_{p}$ norm $\|\mathbb{a}\|_{p}=\big{(}\sum_{i=1}^{n}a_{i}^{p}\big{)}^{1/p}$ and the $\ell_{\infty}$ norm $\|\mathbb{a}\|_{\infty}=\max_{1\leq i\leq n}|a_{i}|$ . For a matrix $\mathbb{A}=(a_{ij})\in\mathbb{R}^{n\times n}$ , we define its Frobenius norm as $\|\mathbb{A}\|_{F}=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n}a^{2}_{ij}}$ , and its spectral norm as $\|\mathbb{A}\|=\sup_{\|\mathbb{x}\|_{2}\leq 1}\|\mathbb{A}\mathbb{x}\|_{2}$ ; we also denote $\mathbb{A}_{.i}\in\mathbb{R}^{n}$ as its $i$ -th column and $\mathbb{A}_{i.}\in\mathbb{R}^{n}$ as its $i$ -th row. For sequences $\{a_{n}\}$ and $\{b_{n}\}$ , we write $a_{n}=o(b_{n})$ or $b_{n}\gg a_{n}$ if $\lim_{n}a_{n}/b_{n}=0$ , and write $a_{n}=O(b_{n})$ , $a_{n}\lesssim b_{n}$ or $b_{n}\gtrsim a_{n}$ if there exists a constant $C$ such that $a_{n}\leq Cb_{n}$ for all $n$ . We write $a_{n}\asymp b_{n}$ if $a_{n}\lesssim b_{n}$ and $a_{n}\gtrsim b_{n}$ .

B.2 Sufficient Condition for (C1a)

We provide a sufficient condition in light of the signal-plus-noise model (2.11) of the main text with some intuitions that implies the sub-Gaussian condition (C1a) for $\mathbb{h}_{i}^{(k)}$ in (2.12). In general, the distortion vector $\mathbb{h}_{i}^{(k)}$ is jointly determined by the noise $\{\mathbb{Z}_{i}\}_{1\leq i\leq n}$ , the noiseless samples $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ and the dimension reduction map $f_{k}:\mathbb{R}^{p}\to\mathbb{R}^{2}$ associated to the $k$ -th visualization method. Accordingly, our sufficient condition for (C1a) essentially involves regularity of the signal structure $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ and the dimension reduction (DR) maps $f_{k}:\mathbb{R}^{p}\to\mathbb{R}^{2}$ for $1\leq k\leq K$ , and sub-Gaussianity of the noise vector $\{\mathbb{Z}_{i}\}_{1\leq i\leq n}$ . We first state precisely our sufficient condition.

(C01)

(Regularity of the signal and DR map) The noiseless samples $\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}$ lie on a bounded manifold $\mathcal{M}$ embedded in $\mathbb{R}^{p}$ , and the DR map $f_{k}$ and its first-order derivative are bounded in the sense that for all $\mathbb{Y}\in\mathbb{R}^{p}$

L^{-1}\leq\|f_{k}(\mathbb{Y})\|_{2}\leq L,\qquad\bigg{\|}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{\|}\leq C_{f},

(B.1)

almost surely for some constants $L>1$ and $C_{f}>0$ .

(C02)

(Sub-Gaussian noise) The noise vectors $\{\mathbb{Z}_{i}\}_{1\leq i\leq n}$ are sub-Gaussian random vectors.

Intuitively, (C01) requires that the underlying signal structure is finite and that the DR map is also finite and sufficiently smooth. In particular, we allow that $f_{k}$ is random in itself, as in the cases of randomized algorithms such as t-SNE and UMAP. The sub-Gaussian condition (C02) on the noise vector $\mathbb{Z}_{i}$ is mild and allows for wide range of noise structures.

Below we show that Conditions (C01) and (C02) jointly imply the sub-Gaussianity of $\mathbb{h}_{i}^{(k)}$ . Firstly, note that by definition $\mathbb{P}_{i.}^{(k)}=(\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{1})\|_{2},\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{2})\|_{2},...,\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{n})\|_{2})^{\top}.$ Then for each $i$ , we can define the pairwise distance for the noiseless samples associated with the $k$ -th visualization method as

\mathbb{P}_{i.}^{*(k)}=(\|f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{1})\|_{2},\|f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{2})\|_{2},...,\|f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{n})\|_{2})^{\top}.

(B.2)

Recall that $\mathbb{P}_{i.}^{*}=(\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{1}\|_{2},\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{2}\|_{2},...,\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{n}\|_{2})^{\top}.$ Then, by (2.12), it follows that

$\displaystyle\mathbb{h}_{i}^{(k)}$	$\displaystyle=c_{k}^{-1}\mathbb{P}_{i.}^{(k)}-\mathbb{P}_{i.}^{*}$
	$\displaystyle=(c_{k}^{-1}\mathbb{P}_{ij}^{(k)}+\frac{c_{k}^{-1}\mathbb{g}_{k,ij}^{\top}[f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{j})-f_{k}(\mathbb{Y}^{}_{i})-f_{k}(\mathbb{Y}^{}_{j})]}{\\|\mathbb{g}_{k,ij}\\|_{2}}-\mathbb{P}_{ij}^{})_{1\leq j\leq n}$	(B.3)
	$\displaystyle=\bigg{(}c_{i,k}^{-1}\mathbb{P}_{ij}^{(k)}-\mathbb{P}_{ij}^{}+\frac{c_{i,k}^{-1}\mathbb{g}_{k,ij}^{\top}}{\\|\mathbb{g}_{k,ij}\\|_{2}}\bigg{[}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{\|}_{\mathbb{Y}=\mathbb{s}_{i}}\mathbb{Z}_{i}-\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{\|}_{\mathbb{Y}=\mathbb{s}_{j}}\mathbb{Z}_{j}\bigg{]}\bigg{)}_{1\leq j\leq n},$	(B.4)

where in (B.3) we used Taylor expansion of $\mathbb{P}_{ij}^{(k)}=\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{j})\|_{2}$ at $f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{j})$ with $\mathbb{g}_{k,ij}$ being some point between $\mathbb{P}_{ij}^{(k)}$ and $\mathbb{P}_{ij}^{*(k)}$ , and in (B.4) we used Taylor expansion of $f_{k}(\mathbb{Y}_{i})$ at $\mathbb{Y}^{*}_{i}$ with $\mathbb{s}_{i}$ being some point between $\mathbb{Y}_{i}$ and $\mathbb{Y}_{i}^{*}$ .

For each $i$ , since $c_{i,k}$ is a parameter that accounts for the possible scaling difference caused by the dimension reduction map $f_{k}$ , without loss of generality, we can take $c_{i,k}=\frac{\|f_{k}(\mathbb{Y}^{*}_{i})\|_{2}}{\|\mathbb{Y}_{i}^{*}\|_{2}}$ . Under Condition (C01), it follows that $c_{k}^{-1}\mathbb{P}_{ij}^{*(k)}-\mathbb{P}_{ij}^{*}$ in (B.4) is bounded and therefore a sub-Gaussian random variable. Similarly, for the random variable

\frac{c_{i,k}^{-1}\mathbb{g}_{k,ij}^{\top}}{\|\mathbb{g}_{k,ij}\|_{2}}\bigg{[}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{|}_{\mathbb{Y}=\mathbb{s}_{i}}\mathbb{Z}_{i}-\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{|}_{\mathbb{Y}=\mathbb{s}_{j}}\mathbb{Z}_{j}\bigg{]}

(B.5)

in (B.4), under Condition (C01), we also have the boundedness of $\big{\|}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\big{|}_{\mathbb{Y}=\mathbb{s}_{i}}\big{\|}$ and $\big{\|}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\big{|}_{\mathbb{Y}=\mathbb{s}_{j}}\big{\|}$ . By Proposition 2.5.2 of Vershynin (2018), these along with the boundedness of $c_{i,k}^{-1}$ and $\frac{\mathbb{g}_{k,ij}}{\|\mathbb{g}_{k,ij}\|_{2}}$ , and the sub-Gaussianity of $\mathbb{Z}_{i}$ and $\mathbb{Z}_{j}$ from (C02), imply that (B.5) is also a sub-Gaussian random variable. Thus, we have verified that the sub-Gaussianity of $\mathbb{h}_{i}^{(k)}$ under Conditions (C01) and (C02). In particular, the sub-Gaussian parameter $\sigma^{2}$ in (C1a) is jointly determined by the underlying manifold $\mathcal{M}$ , the scale ( $L$ ) and the smoothness ( $C_{f}$ ) of the DR map.

B.3 Eigenscore Consistency: Proof of Theorem 1

For simplicity, we omit the dependence on $i\in\{1,2,...,n\}$ and hereafter denote $\mathbb{x}_{k}=(x_{k1},...,x_{kn})=\mathbb{P}^{(k)}_{i.}$ , $\bm{y}=(y_{1},...,y_{n})=\mathbb{P}_{i.}^{*}$ , $\mathbb{z}_{k}=(z_{k1},...,z_{kn})=\mathbb{h}_{i}^{(k)}$ , $\widehat{\mathbb{s}}=\widehat{\mathbb{s}}_{i}$ , $\mathbb{s}=\mathbb{s}_{i}$ and $c_{k}=c_{i,k}$ . Therefore, model (2.12) in the main paper can be rewritten as

\mathbb{x}_{k}=c_{k}(\bm{y}+\mathbb{z}_{k}),\qquad k=1,2,...,K.

(B.6)

We denote

\bar{\mathbb{x}}_{k}=\frac{\mathbb{x}_{k}}{\|\mathbb{x}_{k}\|_{2}}=\frac{\bm{y}+\mathbb{z}_{k}}{\|\bm{y}+\mathbb{z}_{k}\|_{2}},

(B.7)

and define the matrix of normalized vectors

\bar{\mathbb{X}}=\begin{bmatrix}\bar{\mathbb{x}}_{1}^{\top}\\ \bar{\mathbb{x}}_{2}^{\top}\\ \vdots\\ \bar{\mathbb{x}}_{K}^{\top}\\ \end{bmatrix}\in\mathbb{R}^{K\times n},

Then

\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=(\bar{\mathbb{X}}_{j}^{\top}\bar{\mathbb{X}}_{k})_{1\leq j,k\leq n},

where $\bar{\mathbb{X}}_{j}\in\mathbb{R}^{K}$ is the $j$ -th column of $\bar{\mathbb{X}}$ , so that

\bar{\mathbb{X}}_{j}=\bigg{(}\frac{y_{j}+z_{1j}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}},\frac{y_{j}+z_{2j}}{\|\bm{y}+\mathbb{z}_{2}\|_{2}},...,\frac{y_{j}+z_{Kj}}{\|\bm{y}+\mathbb{z}_{K}\|_{2}}\bigg{)}^{\top}.

For any $1\leq j,k\leq n$ , it follows that

	$\displaystyle\bar{\mathbb{X}}_{j}^{\top}\bar{\mathbb{X}}_{k}=\sum_{i=1}^{K}\frac{(y_{j}+z_{ij})(y_{k}+z_{ik})}{\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{2}}$
	$\displaystyle=\bigg{(}\sum_{i=1}^{K}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}\bigg{)}y_{j}y_{k}+\bigg{(}\sum_{i=1}^{K}z_{ik}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}\bigg{)}y_{j}+\bigg{(}\sum_{i=1}^{K}z_{ij}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}\bigg{)}y_{k}+\sum_{i=1}^{K}z_{ij}z_{ik}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}.$

This implies that

\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top},

(B.8)

where

\mathbb{g}=(\|\bm{y}+\mathbb{z}_{1}\|_{2}^{-1},\|\bm{y}+\mathbb{z}_{2}\|_{2}^{-1},...,\|\bm{y}+\mathbb{z}_{K}\|_{2}^{-1}),\qquad\mathbb{H}=\begin{bmatrix}\frac{\mathbb{z}_{1}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}}&\frac{\mathbb{z}_{2}}{\|\bm{y}+\mathbb{z}_{2}\|_{2}}&...&\frac{\mathbb{z}_{K}}{\|\bm{y}+\mathbb{z}_{K}\|_{2}}\end{bmatrix}\in\mathbb{R}^{n\times K}

We denote

\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\mathbb{E},\qquad\mathbb{E}=\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top},

(B.9)

Let the singular value decomposition of $\bar{\mathbb{X}}$ be

\bar{\mathbb{X}}=\sum_{i=1}^{r}\widehat{\sigma}_{i}\hat{\mathbb{u}}_{i}\hat{\mathbb{v}}_{i}^{\top},

where $r=\min\{n,K\}$ and $\widehat{\sigma}_{1}\geq\widehat{\sigma}_{2}\geq...\geq\widehat{\sigma}_{r}$ . Then we have

\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\sum_{i=1}^{r}\widehat{\sigma}_{i}^{2}\hat{\mathbb{v}}_{i}\hat{\mathbb{v}}_{i}^{\top},\qquad\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}=\sum_{i=1}^{r}\widehat{\sigma}_{i}^{2}\hat{\mathbb{u}}_{i}\hat{\mathbb{u}}_{i}^{\top},

(B.10)

and

\mathbb{s}\cdot\|\bm{y}\|_{2}=\bigg{(}\frac{\mathbb{x}_{1}^{\top}\bm{y}}{\|\mathbb{x}_{1}\|_{2}},...,\frac{\mathbb{x}_{K}^{\top}\bm{y}}{\|\mathbb{x}_{K}\|_{2}}\bigg{)}^{\top}=\bar{\mathbb{X}}\bm{y}=\sum_{i=1}^{r}\widehat{\sigma}_{i}\hat{\mathbb{u}}_{i}\hat{\mathbb{v}}_{i}^{\top}\bm{y}.

(B.11)

Therefore, we have

	$\displaystyle 1-\|\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})\|=1-\|\cos\angle(\hat{\mathbb{u}}_{1},\bar{\mathbb{X}}\bm{y})\|=1-\frac{\widehat{\sigma}_{1}\|\hat{\mathbb{v}}_{1}^{\top}\bm{y}\|}{\\|\bar{\mathbb{X}}\bm{y}\\|_{2}}$
	$\displaystyle=\frac{\\|\bar{\mathbb{X}}\bm{y}\\|_{2}-\widehat{\sigma}_{1}\|\hat{\mathbb{v}}_{1}^{\top}\bm{y}\|}{\\|\bar{\mathbb{X}}\bm{y}\\|_{2}}\leq\frac{\\|\sum_{i=2}^{r}\widehat{\sigma}_{i}\hat{\mathbb{u}}_{i}\hat{\mathbb{v}}_{i}^{\top}\bm{y}\\|_{2}}{\\|\bar{\mathbb{X}}\bm{y}\\|_{2}}\leq\frac{\widehat{\sigma}_{2}\\|\hat{\mathbb{V}}_{-1}^{\top}\bm{y}\\|}{\widehat{\sigma}_{1}\|\hat{\mathbb{v}}_{1}^{\top}\bm{y}\|-\widehat{\sigma}_{2}\\|\hat{\mathbb{V}}_{-1}^{\top}\bm{y}\\|},$

where $\widehat{\mathbb{V}}_{-1}=\begin{bmatrix}\hat{\mathbb{v}}_{2}&...&\hat{\mathbb{v}}_{r}\end{bmatrix}\in\mathbb{R}^{n\times(r-1)}.$ Now we derive estimates of the random quantities $\widehat{\sigma}_{1}$ , $\widehat{\sigma}_{2}$ , $|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|$ and $\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\|$ to obtain an upper bound of the last term in the above inequality. Consider the decomposition (B.9) and the first equation of (B.10). Firstly, by Weyl’s inequality (e.g., Corollary III.2.6 of Bhatia 2013), we have

\big{|}\widehat{\sigma}_{1}^{2}-\|\mathbb{g}\|_{2}^{2}\cdot\|\bm{y}\|_{2}^{2}\big{|}\leq\|\mathbb{E}\|,

(B.12)

and

|\widehat{\sigma}_{2}^{2}|\leq\|\mathbb{E}\|.

(B.13)

Secondly, by Davis-Kahan’s perturbation theorem (e.g., Theorem 1 of (Yu et al., 2015)), we have

\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\|/\|\bm{y}\|=\sqrt{1-(\hat{\mathbb{v}}_{1}^{\top}\bm{y})^{2}/\|\bm{y}\|_{2}^{2}}\leq\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}.

(B.14)

Thus, if we denote $R=\|\mathbb{E}\|/(\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2})$ , by (B.12) and (B.13), if $R<1$ , we have

\frac{\widehat{\sigma}_{2}}{\widehat{\sigma}_{1}}\leq\frac{\|\mathbb{E}\|^{1/2}}{\sqrt{\|\mathbb{g}\|_{2}^{2}\cdot\|\bm{y}\|_{2}^{2}-\|\mathbb{E}\|}}=\frac{R^{1/2}}{\sqrt{1-R}},

and by (B.14), we have

\frac{\|\widehat{\mathbb{V}}^{\top}_{-1}\bm{y}\|}{|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}\leq\frac{R}{\sqrt{1-R^{2}}}.

Therefore, if $R<1$ , we have

1-|\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})|\leq\frac{\frac{\widehat{\sigma}_{2}\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\|}{\widehat{\sigma}_{1}|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}}{1-\frac{\widehat{\sigma}_{2}\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\|}{\widehat{\sigma}_{1}|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}}\leq\frac{\frac{R^{1/2}}{\sqrt{1-R}}\frac{R}{\sqrt{1-R^{2}}}}{1-\frac{R^{1/2}}{\sqrt{1-R}}\frac{R}{\sqrt{1-R^{2}}}}=\frac{R^{3/2}}{\sqrt{(1-R)(1-R^{2})}-R^{3/2}}.

(B.15)

If $R<0.5$ , we further have

1-|\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})|\leq 4R^{3/2}.

(B.16)

The rest of the proof is devoted to the stochastic upper bound of $R$ . Since by the elementary inequality $\|{\bf A}\mathbb{B}\|\leq\|{\bf A}\|\cdot\|\mathbb{B}\|$ and triangle inequality, we have

\|\mathbb{E}\|\leq 2\|\bm{y}\|_{2}\cdot\|\mathbb{g}\|_{2}\cdot\|\mathbb{H}\|+\|\mathbb{H}\|^{2}.

(B.17)

Then it follows that

R=\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}\leq\frac{2\|\bm{y}\|_{2}\|\mathbb{g}\|_{2}\|\mathbb{H}\|+\|\mathbb{H}\|^{2}}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}=\frac{2\|\mathbb{H}\|}{\|\mathbb{g}\|_{2}\|\bm{y}\|_{2}}+\frac{\|\mathbb{H}\|^{2}}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}},

(B.18)

Hence, it suffices to obtain an upper bound for $\|\mathbb{H}\|$ and a lower bound for $\|\mathbb{g}\|_{2}$ .

Upper bound of $\|\mathbb{H}\|$ .

Since

\mathbb{H}=\begin{bmatrix}\mathbb{z}_{1}&\mathbb{z}_{2}&...&\mathbb{z}_{K}\end{bmatrix}\cdot\text{diag}(\|\bm{y}+\mathbb{z}_{1}\|_{2}^{-1},\|\bm{y}+\mathbb{z}_{2}\|_{2}^{-1},...,\|\bm{y}+\mathbb{z}_{K}\|_{2}^{-1})\equiv\mathbb{H}_{0}\cdot\text{diag}(\mathbb{g}).

(B.19)

we have

\|\mathbb{H}\|\leq\|\mathbb{H}_{0}\|\cdot\|\mathbb{g}\|_{\infty}.

(B.20)

The following lemma gives the upper bound of $\|\mathbb{H}_{0}\|$ , which is proved in Section B.7 below.

Lemma 1.

Under the conditions of Theorem 1, it holds that $\|\mathbb{H}_{0}\|\leq C\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{K})$ with probability at least $1-n^{-c}$ for some universal constants $C,c>0$ .

Now for

\|\mathbb{g}\|_{\infty}=\frac{1}{\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}},

note that

\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}=\|\bm{y}\|_{2}^{2}+\|\mathbb{z}_{i}\|_{2}^{2}+2\bm{y}^{\top}\mathbb{z}_{i},

By Condition (C1a), we have

|\bm{y}^{\top}\mathbb{z}_{i}|\leq c^{\prime}\sigma\|\bm{y}\|_{2}\sqrt{\log n}\qquad\text{in probability,}

for any small constant $c^{\prime}>0$ , and that

\min_{1\leq i\leq K}\|\mathbb{z}_{i}\|_{2}^{2}=cn\sigma^{2}(1+o(1))\quad\text{in probability.}

Since $2\sigma\|\bm{y}\|_{2}\sqrt{\log n}\leq\sigma^{2}\log^{2}n+\|\bm{y}\|_{2}^{2}/\log n$ , we have

\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}\geq\|\bm{y}\|_{2}^{2}+cn\sigma^{2}(1+o(1))-c^{\prime}\sigma^{2}\log^{2}n-c^{\prime}\|\bm{y}\|_{2}^{2}/\log n\geq(1-\delta_{n})\|\bm{y}\|_{2}^{2}+(1-\delta_{n})cn\sigma^{2},

in probability, for some $\delta_{n}\to 0$ .

In other words, we have

\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}\geq(1-\delta_{n})\|\bm{y}\|_{2}+(1-\delta_{n})c^{1/2}\sigma\sqrt{n},

(B.21)

in probability, for some non-negative sequence $\delta_{n}\to 0$ . This implies

\|\mathbb{H}\|\leq C\frac{\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{K})}{\|\bm{y}\|_{2}+\sigma\sqrt{n}},

(B.22)

in probability.

Lower bound of $\|\mathbb{g}\|_{2}$ .

By similar argument that leads to (B.21), we have

\|\mathbb{g}\|_{2}^{2}=\sum_{i=1}^{K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\geq K\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}=\frac{K}{\max_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}}\geq\frac{K}{(1+\delta^{\prime}_{n})[\|\bm{y}\|_{2}^{2}+c\sigma^{2}n]},

(B.23)

in probability for some non-negative sequence $\delta^{\prime}_{n}\to 0$ .

Completing the proof.

Combining the above results, we have

\displaystyle\frac{\|\mathbb{H}\|}{\|\mathbb{g}\|_{2}\|\bm{y}\|_{2}}

\displaystyle\lesssim\frac{\sigma\sqrt{\rho}(\sqrt{n/K}+1)}{\|\bm{y}\|_{2}}\lesssim\frac{\sqrt{\rho n/K}}{\|\bm{y}\|/\sigma},

where the last inequality follows from $n>K$ . Thus, under Condition (C2), for sufficiently large $n$ , we always have $R<0.5$ , and therefore by (B.16), we have

\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})\to 1,\qquad\text{in probability.}

(B.24)

Finally, note that for any $i=1,2,...,K$ , we have

s_{i}=\frac{\|\bm{y}\|_{2}^{2}+\mathbb{z}_{i}^{\top}\bm{y}}{\|\bm{y}+\mathbb{z}_{i}\|_{2}\cdot\|\bm{y}\|_{2}}\gtrsim\frac{\|\bm{y}\|_{2}^{2}-c\sigma\|\bm{y}\|\sqrt{n/K}}{\|\bm{y}\|_{2}(\|\bm{y}\|_{2}+\sigma\sqrt{n})}\geq\frac{\|\bm{y}\|_{2}-c\sigma\sqrt{n/K}}{\|\bm{y}\|_{2}+\sigma\sqrt{n}},

(B.25)

in probability. By Condition (C2), whenever $K\ll n$ the right-hand side of the above inequality is strictly positive. In other words, in probability, $\mathbb{s}$ is a strictly positive vector. Thus, for $\widehat{\mathbb{s}}=|\hat{\mathbb{u}}_{1}|$ , under the same event it holds that

\cos\angle(\widehat{\mathbb{s}},\mathbb{s})\leq\cos\angle(|\hat{\mathbb{u}}_{1}|,\mathbb{s}).

(B.26)

This along with (B.24) implies

\cos\angle(\widehat{\mathbb{s}},\mathbb{s})\to 1,\qquad\text{in probability.}

(B.27)

B.4 Guarantee of Meta-Visualization: Proof of Theorem 2

We still use the notation defined at the beginning of the proof of Theorem 1. In addition, we denote $\mathbb{x}^{m}=\bar{\mathbb{P}}_{i.}^{m}$ , and define

\mathbb{x}^{*}=\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i},\qquad

(B.28)

where $w_{i}=s_{i}/\|\mathbb{s}\|_{2}$ . To begin with, note that

\cos\angle(\mathbb{x}^{*},\bm{y})=\frac{\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}^{\top}\bm{y}}{\|\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}\|_{2}\|\bm{y}\|_{2}},

\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}^{\top}\bm{y}=\frac{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\|\bar{\mathbb{X}}\bm{y}\|_{2},

and

\bigg{\|}\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}\bigg{\|}_{2}=\frac{\|\sum_{i=1}^{K}\bar{\mathbb{x}}_{i}^{\top}\bm{y}\bar{\mathbb{x}}_{i}\|_{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\frac{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\frac{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}.

Then

|\cos\angle(\mathbb{x}^{*},\bm{y})|=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}.

Similarly, we have

\max_{1\leq i\leq K}|\cos\angle(\mathbb{x}_{i},\bm{y})|=\|\bar{\mathbb{X}}\bm{y}\|_{\infty}/\|\bm{y}\|_{2}.

We first prove some auxiliary results. Since

\bar{\mathbb{X}}^{\top}=\begin{bmatrix}\bm{y}+\mathbb{z}_{1}&...\bm{y}+\mathbb{z}_{K}\end{bmatrix}\cdot\text{diag}(\mathbb{g}),

we have, for sufficiently large $n$ ,

	$\displaystyle\\|\bar{\mathbb{X}}\\|\leq\\|\mathbb{g}\\|_{\infty}\cdot\bigg{\\|}\begin{bmatrix}\bm{y}+\mathbb{z}_{1}&...\bm{y}+\mathbb{z}_{K}\end{bmatrix}\bigg{\\|}\leq\\|\mathbb{g}\\|_{\infty}\cdot(\sqrt{K}\\|\bm{y}\\|_{2}+\\|\mathbb{H}_{0}\\|)$
	$\displaystyle\leq\\|\mathbb{g}\\|_{\infty}\cdot[\sqrt{K}\\|\bm{y}\\|_{2}+C\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{K})]$		(B.29)

in probability, according to Lemma 1, and

\|\bar{\mathbb{X}}\bm{y}\|_{2}=\bigg{\|}\bm{y}^{\top}\begin{bmatrix}\bm{y}+\mathbb{z}_{1}&...\bm{y}+\mathbb{z}_{K}\end{bmatrix}\cdot\text{diag}(\mathbb{g})\bigg{\|}_{2}\geq\min_{1\leq i\leq K}|\|\bm{y}\|_{2}^{2}+\bm{y}^{\top}\mathbb{z}_{i}|\cdot\|\mathbb{g}\|_{2}\geq\|\mathbb{g}\|_{2}\cdot(1-\delta_{n})\|\bm{y}\|_{2}^{2},

(B.30)

\|\bar{\mathbb{X}}\bm{y}\|_{\infty}\leq\max_{1\leq i\leq K}|\|\bm{y}\|_{2}^{2}+\bm{y}^{\top}\mathbb{z}_{i}|\cdot\|\mathbb{g}\|_{\infty}\leq\|\mathbb{g}\|_{\infty}(1+\delta_{n})\|\bm{y}\|_{2}^{2},

(B.31)

for some non-negative sequence $\delta_{n}\to 0$ as $n\to\infty$ , in probability. In addition, by (B.23) and (B.21), we have

\|\mathbb{g}\|_{2}\geq\frac{\sqrt{K}}{(1+\delta_{n})(\|\bm{y}\|_{2}+c\sigma\sqrt{n})},\qquad\|\mathbb{g}\|_{\infty}\leq\frac{1}{(1-\delta_{n})(\|\bm{y}\|_{2}+c\sigma\sqrt{n})}.

(B.32)

in probability for some $\delta_{n}\to 0$ . Now we show that

\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\to 1.

(B.33)

To see this, since

\mathbb{x}^{m}=\sum_{i=1}^{K}\hat{s}_{i}\bar{\mathbb{x}}_{i}=\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}},

(B.34)

and $\mathbb{x}^{*}=\bar{\mathbb{X}}^{\top}\mathbb{w}$ as in (B.28), we have

\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\geq\frac{\mathbb{w}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}=\frac{\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}+\frac{\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\mathbb{w}-\widehat{\mathbb{s}})}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}.

(B.35)

Hence

	$\displaystyle\|1-\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\|$	$\displaystyle=\bigg{\|}1-\frac{\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}}{\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}+\frac{\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\mathbb{w}-\widehat{\mathbb{s}})}{\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}\bigg{\|}$
		$\displaystyle\leq\frac{\|\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}-\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}\|}{\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}+\frac{\|\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\widehat{\mathbb{s}}-\mathbb{w})\|}{\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}$
		$\displaystyle\leq\frac{2\\|\bar{\mathbb{X}}\\|\\|\widehat{\mathbb{s}}-\mathbb{w}\\|_{2}}{\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}$
		$\displaystyle\leq\frac{2\\|\bar{\mathbb{X}}\\|\\|\bar{\mathbb{X}}\bm{y}\\|_{2}\\|\widehat{\mathbb{s}}-\mathbb{w}\\|_{2}}{\\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\\|_{2}}.$

Note that by Theorem 1, we have $\|\widehat{\mathbb{s}}-\mathbb{w}\|_{2}\to 0$ in probability. Then it suffices to show that

\frac{\|\bar{\mathbb{X}}\|\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}=O_{P}(1).

(B.36)

To see this, since

\frac{\|\bar{\mathbb{X}}\|\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}\leq\frac{\|\bar{\mathbb{X}}\|\|\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}{|\bm{y}^{\top}\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}|}=\frac{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}.

(B.37)

Note that

\bar{\mathbb{X}}^{\top}=(\begin{bmatrix}\bm{y}&...&\bm{y}\end{bmatrix}+\mathbb{H}_{0})\cdot\text{diag}(\mathbb{g}).

(B.38)

By Condition (C2), whenever $K\ll n$ , we have

	$\displaystyle\bm{y}^{\top}\bar{\mathbb{X}}^{\top}$	$\displaystyle=(\begin{bmatrix}\\|\bm{y}\\|_{2}^{2}&...&\\|\bm{y}\\|_{2}^{2}\end{bmatrix}+\bm{y}^{\top}\mathbb{H}_{0})\cdot\text{diag}(\mathbb{g})$
		$\displaystyle=\begin{bmatrix}\\|\bm{y}\\|_{2}^{2}(1+o(1))&...&\\|\bm{y}\\|_{2}^{2}(1+o(1))\end{bmatrix}\cdot\text{diag}(\mathbb{g}),$

in probability, which implies

	$\displaystyle\\|\bar{\mathbb{X}}\bm{y}\\|_{2}$	$\displaystyle=\\|\bm{y}\\|_{2}^{2}(1+o(1))\sqrt{\sum_{i=1}^{K}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}}$
		$\displaystyle\asymp\\|\bm{y}\\|_{2}^{2}\sqrt{K}(\\|\bm{y}\\|_{2}+\sigma\sqrt{n})^{-1}$

in probability, where the last inequality follows from (B.21) and (B.23). On the one hand, we have

	$\displaystyle\\|\bar{\mathbb{X}}\\|$	$\displaystyle\leq\\|\begin{bmatrix}\bm{y}&...&\bm{y}\end{bmatrix}+\mathbb{H}_{0}\\|\\|\mathbb{g}\\|_{\infty}$
		$\displaystyle\lesssim(\\|\bm{y}\\|_{2}+\sigma\sqrt{n})^{-1}(\sqrt{K}\\|\bm{y}\\|_{2}+\sigma\rho(\sqrt{K}+\sqrt{n}))$
		$\displaystyle\lesssim\sqrt{K}\\|\bm{y}\\|_{2}(\\|\bm{y}\\|_{2}+\sigma\sqrt{n})^{-1}.$

Thus, we have

\frac{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}=O_{P}(1),

(B.39)

which along with (B.37) implies (B.36). This completes our proof of (B.33). Finally, we note that

|\cos\angle(\mathbb{x}^{*},\bm{y})|=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}\geq\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\|\bar{\mathbb{X}}\|\cdot\|\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\|\cdot\|\bm{y}\|_{2}},

so that

1-\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}\geq 1-|\cos\angle(\mathbb{x}^{*},\bm{y})|.

Now in order to prove the theorem, we show that, on the one hand,

1-\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}\to 0\qquad\text{in probability},

(B.40)

and on the other hand, under the additional assumption that $\|\bm{y}\|_{2}/\sigma\sqrt{n}\leq C$ for some sufficiently small constant,

\frac{\|\bar{\mathbb{X}}\bm{y}\|_{\infty}}{\|\bm{y}\|_{2}}\leq 1-\delta,\qquad\text{in probability},

(B.41)

for some constant $\delta>0$ .

To obtain (B.40), we note that by (B.29) (B.30) and (B.32), for some sequence $\delta_{n}\to 0$ , we have

	$\displaystyle\frac{\\|\bar{\mathbb{X}}\\|\\|\bm{y}\\|_{2}-\\|\bar{\mathbb{X}}\bm{y}\\|_{2}}{\\|\bar{\mathbb{X}}\\|\\|\bm{y}\\|_{2}}=\frac{\\|\bm{y}\\|_{2}-\\|\bar{\mathbb{X}}\bm{y}\\|_{2}/\\|\bar{\mathbb{X}}\\|}{\\|\bm{y}\\|_{2}}\leq\frac{\\|\bm{y}\\|_{2}-\frac{\\|\mathbb{g}\\|_{2}(1-\delta_{n})}{\\|\mathbb{g}\\|_{\infty}(1+\delta_{n})\sqrt{K}}}{\\|\bm{y}\\|_{2}}$
	$\displaystyle\leq 1-\frac{(1-\delta_{n})^{2}(1+c\sigma\sqrt{n}/\\|\bm{y}\\|_{2})}{(1+\delta_{n})^{2}(1+c\sigma\sqrt{n}/\\|\bm{y}\\|_{2})},$

in probability. To obtain (B.41), for some sequence $\delta_{n}\to 0$ , we have

\frac{\|\bar{\mathbb{X}}\bm{y}\|_{\infty}}{\|\bm{y}\|_{2}}\leq\|\mathbb{g}\|_{\infty}(1+\delta_{n})\|\bm{y}\|_{2}\leq\frac{(1+\delta_{n})}{(1-\delta_{n})(1+c\sigma\sqrt{n}/\|\bm{y}\|_{2})}\leq 1-\delta,

(B.42)

in probability, as long as $\|\bm{y}\|_{2}/\sigma\sqrt{n}\leq C$ for some sufficiently small constant $C>0$ . This completes our proof of Theorem 2.

B.5 Robustness of Spectral Weighting: Proof of Theorem 3

For the first statement concerning $\bar{\mathbb{P}}^{m}$ , firstly, we note that by the proof of Theorem 2, we have $\cos\angle(\mathbb{x}^{*},\bm{y})\to 1$ under the conditions of Theorem 1, as the effects of the adversarial candidate visualizations vanish in (B.28). Therefore, it suffices to show that under such an adversarial setting, we still have $\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\to 1$ in probability.

To see this, by the proof of Theorem 2, it suffices to show that $\|\widehat{\mathbb{s}}-\mathbb{w}\|_{2}\to 0$ in probability, or,

\cos(\widehat{\mathbb{s}},\mathbb{s})\to 0,\qquad\text{in probability}.

(B.43)

By the proof of Theorem 1, we have

\bar{\mathbb{X}}_{j}=\bigg{(}\frac{y_{j}+z_{1j}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}},...,\frac{y_{j}+z_{K^{\prime}j}}{\|\bm{y}+\mathbb{z}_{K}^{\prime}\|_{2}},\frac{z_{K^{\prime}+1,j}}{\|\mathbb{z}_{K^{\prime}+1}\|_{2}},...,\frac{z_{Kj}}{\|\mathbb{z}_{K}\|_{2}}\bigg{)}^{\top},

where we denote $\mathbb{z}_{k}=\mathbb{x}_{k}$ with $\mathbb{z}_{k}^{\top}\bm{y}=0$ for all $k\geq K^{\prime}+1$ . For any $1\leq j,k\leq n$ , it follows that

	$\displaystyle\bar{\mathbb{X}}_{j}^{\top}\bar{\mathbb{X}}_{k}=\sum_{i=1}^{K^{\prime}}\frac{(y_{j}+z_{ij})(y_{k}+z_{ik})}{\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{2}}+\sum_{i=K^{\prime}+1}^{K}\frac{z_{ij}z_{ik}}{\\|\mathbb{z}_{i}\\|_{2}^{2}}$
	$\displaystyle=\bigg{(}\sum_{i=1}^{K^{\prime}}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}\bigg{)}y_{j}y_{k}+\bigg{(}\sum_{i=1}^{K^{\prime}}z_{ik}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}\bigg{)}y_{j}+\bigg{(}\sum_{i=1}^{K^{\prime}}z_{ij}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}\bigg{)}y_{k}+\sum_{i=1}^{K^{\prime}}z_{ij}z_{ik}\\|\bm{y}+\mathbb{z}_{i}\\|_{2}^{-2}$
	$\displaystyle\quad+\sum_{i=K^{\prime}+1}^{K}\frac{z_{ij}z_{ik}}{\\|\mathbb{z}_{i}\\|_{2}^{2}}$

where $K^{\prime}=(1-\eta)K$ . This implies that

\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top}+\mathbb{L}\mathbb{L}^{\top},

(B.44)

where

\mathbb{g}=(\|\bm{y}+\mathbb{z}_{1}\|_{2}^{-1},\|\bm{y}+\mathbb{z}_{2}\|_{2}^{-1},...,\|\bm{y}+\mathbb{z}_{K^{\prime}}\|_{2}^{-1}),\qquad\mathbb{H}=\begin{bmatrix}\frac{\mathbb{z}_{1}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}}&\frac{\mathbb{z}_{2}}{\|\bm{y}+\mathbb{z}_{2}\|_{2}}&...&\frac{\mathbb{z}_{K^{\prime}}}{\|\bm{y}+\mathbb{z}_{K^{\prime}}\|_{2}}\end{bmatrix}\in\mathbb{R}^{n\times K^{\prime}},

and

\mathbb{L}=\begin{bmatrix}\frac{\mathbb{z}_{K^{\prime}+1}}{\|\mathbb{z}_{K^{\prime}+1}\|_{2}}&...&\frac{\mathbb{z}_{K}}{\|\mathbb{z}_{K}\|_{2}}\end{bmatrix}\in\mathbb{R}^{n\times\eta K},

In other words,

\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\mathbb{L}\mathbb{L}^{\top}+\mathbb{E},\qquad\mathbb{E}=\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top},

(B.45)

Note that by definition

\mathbb{L}\mathbb{L}^{\top}\bm{y}=0,

then we can treat the above decomposition as a low-rank matrix $\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\mathbb{L}\mathbb{L}^{\top}$ plus a noise matrix $\mathbb{E}$ . In this respect, the Davis-Kahan’s perturbation theorem applies similarly as in the proof of Theorem 1, only with $\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}$ therein replaced by

\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}-\|\mathbb{L}\|^{2}}.

Note that by definition we have

\|\mathbb{L}\|\leq\sqrt{\eta K},

(B.46)

whereas by (B.23) we have

\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}\geq\frac{(1-\eta)K\|\bm{y}\|_{2}^{2}}{(1+\delta^{\prime}_{n})(\|\bm{y}\|_{2}^{2}+c\sigma^{2}n)}.

(B.47)

Then, as long as

\frac{1-\eta}{(1+\delta^{\prime}_{n})(1+c\sigma^{2}n/\|\bm{y}\|_{2}^{2})}>2\eta,

(B.48)

we have

\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}-\|\mathbb{L}\|^{2}}\leq\frac{\|\mathbb{E}\|}{\eta K},

and the rest argument leading to (B.43) follows as in the proof of Theorem 2. Finally, note that a sufficient condition for (B.48) is

\|\bm{y}\|_{2}/\sigma\geq C\sqrt{n},\qquad\eta<C^{-1},

(B.49)

for some sufficiently large $C>0$ . This completes the first part of the proof.

As for the second statement concerning $\bar{\mathbb{P}}^{a}$ , note that by (B.25), whenever $\|\bm{y}\|_{2}/\sigma\geq C\sqrt{n}$ for some sufficiently large $C$ , all the $(1-\eta)K$ candidate visualizations are concordant with the true structure in probability, that is $\cos\angle(\mathbb{x}_{k},\bm{y})\to_{P}1$ for all $k\in\mathcal{C}_{0}$ . In particular, we have

\sum_{k\in\mathcal{C}_{0}}[1-\cos\angle(\mathbb{x}_{k},\bm{y})]\leq(1-\eta)K\epsilon/C,

in probability for some constant $\epsilon>0$ . Without loss of generality, we assume $\bm{y}=(1,0,...0)$ , and $\bar{\mathbb{x}}_{k}=(0,1,0,...,0)$ for all $k\in\mathcal{C}_{1}$ . Thus, it follows that

\sum_{k=1}^{K}\bar{\mathbb{x}}_{k}=(0,\eta K,0,...,0)+((1-\eta)K,0,...,0)+\sum_{k=1}^{(1-\eta)K}(\bar{\mathbb{x}}_{k}-\bm{y}),

and

	$\displaystyle\\|\sum_{k=1}^{K}\bar{\mathbb{x}}_{k}\\|_{2}$	$\displaystyle\geq\sqrt{\eta K^{2}+(1-\eta)^{2}K^{2}}-\sum_{k=1}^{(1-\eta)K}\\|\bar{\mathbb{x}}_{k}-\bm{y}\\|_{2}$
		$\displaystyle\geq K\sqrt{(\eta^{2}+(1-\eta)^{2})}-2(1-\epsilon)K\epsilon/C,$

in probability. If $C$ is chosen such that $\epsilon/C<0.01$ , then we have

\cos\angle(\sum_{k=1}^{K}\bar{\mathbb{x}}_{k},\bm{y})\leq\frac{(1-\eta)K+2(1-\epsilon)K\epsilon/C}{K\sqrt{(\eta^{2}+(1-\eta)^{2})}-2(1-\epsilon)K\epsilon/C}\leq\frac{1-\eta+0.02(1-\eta)}{\sqrt{1-2\eta+2\eta^{2}}-0.02(1-\eta)}<1-\eta,

in probability.

B.6 Necessity of the Signal Strength Condition (C2)

In this part, we show that the signal strength condition (C2) required by the proposed method is in fact asymptotically sharp and optimal, in the sense that it is essentially required for any possible method in order to achieve consistent estimation of the true structure $\mathbb{P}^{*}$ .

Specifically, we consider the settings where the following two conditions hold:

(C1c) $\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K}$ are identically distributed Gaussian random vectors with mean zero and covariance matrix $\sigma^{2}{\bf I}_{n}$ .

(C1d) The matrix $\mathbb{R}=(\rho_{jk})_{1\leq j,k\leq K}$ satisfies $\rho:=\|\mathbb{R}\|=O(1)$ .

Note that Conditions (C1c) and (C1d) are sufficient conditions for Conditions (C1a) and (C1b), respectively, so that Conditions (C1a) and (C1b) are satisfied throughout our discussion below.

Using the previous notations, we can rewrite our model for given $i$ as

\mathbb{x}_{k}=c_{k}(\bm{y}+\mathbb{z}_{k}),\qquad k=1,2,...,K,

where by (C1c) $\{\mathbb{z}_{k}\}_{1\leq k\leq K}$ are identically distributed Gaussian vectors with mean zero and covariance matrix $\sigma^{2}{\bf I}_{n}$ , and $\mathbb{R}\in\mathbb{R}^{K\times K}$ is defined similarly as in the main paper describing the correlation structure among $\{\mathbb{z}_{k}\}_{1\leq k\leq K}$ . We consider the parameter space for the model parameters $(\{c_{k}\}_{1\leq k\leq K},\bm{y},\sigma,\mathbb{R})$ as

\Theta(t,\sigma,K,n)=\bigg{\{}\begin{aligned} \{c_{k}\}_{1\leq k\leq K}\subset\mathbb{R},\bm{y}\in\mathbb{R}^{n},\|\bm{y}\|_{2}=t,\\ \sigma>0,\mathbb{R}\in\mathbb{R}^{K\times K},\|\mathbb{R}\|\leq C\end{aligned}\bigg{\}},

for some sufficiently large universal constant $C>1$ . Define a subset of $\Theta(t,\sigma,K,n)$ as

\Theta_{0}(t,\sigma,K,n)=\bigg{\{}c_{k}=1,\forall 1\leq k\leq K\bm{y}\in\mathbb{R}^{n},\sigma>0,\mathbb{R}={\bf I}_{K}\bigg{\}}.

From our previous analysis (Theorem 2), it follows that as long as $t/\sigma\gg\sqrt{n/K}$ , the proposed method satisfies

\lim_{n\to\infty}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\mathbb{x}^{m},\bm{y})>1-\epsilon)=1,

(B.50)

for all $\epsilon>0$ . Now we show the following minimax lower bound result, which shows that consistent estimation of $\bm{y}$ is never possible over the parameter space $\Theta(t,\sigma,K,n)$ for any estimator $\widehat{\bm{y}}$ whenever $t/\sigma\leq C_{1}\sqrt{n/K}$ for some large constant $C_{1}>0$ .

Theorem 4.

Under Conditions (C1c) and (C1d), there exists some absolute constant $C_{1}>0$ such that, as long as $t/\sigma\leq C_{1}\sqrt{n/K}$ , we have

\lim_{n\to\infty}\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\widehat{\bm{y}},\bm{y})<0.7)>c_{1},

(B.51)

for some absolute constant $c_{1}>0$ .

Proof.

To see this, note that

\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\widehat{\bm{y}},\bm{y})<0.7)\geq\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta_{0}(t,\sigma,K,n)}P_{\theta}(\cos\angle(\widehat{\bm{y}},\bm{y})<0.7).

(B.52)

Over the parameter space $\Theta_{0}(t,\sigma,K,n)$ , the model reduces to

\mathbb{X}={\bf 1}\bm{y}^{\top}+\mathbb{Z}\in\mathbb{R}^{K\times n},

(B.53)

which is rank-one matrix denoising model, where $\|{\bf 1}\bm{y}^{\top}\|=t\sqrt{K}$ and $\mathbb{Z}$ has $i.i.d.$ rows drawn from $N(0,\sigma^{2}{\bf I}_{n})$ . The estimation of the direction of $\bm{y}$ is thus equivalent to estimation of the right singular vector of the rank-one matrix ${\bf 1}\bm{y}^{\top}$ . Now we recall the following lemma concerning the minimax lower bound for singular vector estimation in the low-rank matrix denoising model, proved by Cai et al. (2021).

Lemma 2.

Let $\bar{\bm{y}}=\bm{y}/\|\bm{y}\|_{2}$ . There exists some constants $c_{1},c_{2}>0$ such that

\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}\bigg{(}\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}\geq\min\bigg{\{}1,\frac{c_{1}\sigma\sqrt{n}}{t\sqrt{K}}\bigg{\}}\bigg{)}\geq c_{2},

(B.54)

for sufficiently large $n$ .

As a result, since by Lemma 1 of Cai and Zhang (2018), we have

\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}=\sqrt{2}\sin\angle(\bm{y},\widehat{\bm{y}})=\sqrt{1-\cos^{2}\angle(\bm{y},\widehat{\bm{y}})}\leq\sqrt{2(1-\cos\angle(\bm{y},\widehat{\bm{y}}))}.

(B.55)

Then, it follows that $\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}\geq\sqrt{0.6}$ implies $\cos\angle(\bm{y},\widehat{\bm{y}})<0.7$ . Therefore, whenever we choose a constant $C_{1}>0$ sufficiently large so that $\frac{c_{1}C_{1}\sigma\sqrt{n}}{t\sqrt{K}}>\sqrt{0.6}$ , we have

\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}\geq\sqrt{0.6})\geq c_{2},

(B.56)

which implies

\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\bm{y},\widehat{\bm{y}})<0.7)\geq c_{2},

(B.57)

This completes the proof of the theorem. ∎

B.7 Proof of Auxiliary Lemma 1

We need the following lemma, proved in Cai and Zhang (2018), to obtain the upper bound.

Lemma 3.

For any $p\geq 1$ , denote $\mathbb{B}^{p}=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq 1\}$ as the $p$ -dimensional unit ball in the Euclidean space. Suppose $K\in\mathbb{R}^{p_{1}\times p_{2}}$ is a random matrix. Then we have for $t>0$ ,

P(\|K\|\geq 3t)\leq 7^{p_{1}+p_{2}}\cdot\max_{u\in\mathbb{B}^{p_{1}},v\in\mathbb{B}^{p_{2}}}P(|u^{\top}Kv|\geq t).

By Lemma 3, for $\mathbb{H}_{0}$ , it follows that

P(\|\mathbb{H}_{0}\|\geq 3t)\leq 7^{n+d}\cdot\max_{\mathbb{u}\in\mathbb{B}_{n}(1),\mathbb{v}\in\mathbb{B}_{d}(1)}P(|\mathbb{u}^{\top}\mathbb{H}_{0}\mathbb{v}|\geq t).

(B.58)

For any $\mathbb{u}\in\mathbb{B}_{n}(1)$ and $\mathbb{v}\in\mathbb{B}_{d}(1)$ , we have

\mathbb{u}^{\top}\mathbb{H}_{0}\mathbb{v}=\text{tr}(\mathbb{H}_{0}\mathbb{v}\mathbb{u}^{\top})=\langle\mathbb{H}_{0},\mathbb{u}\mathbb{v}^{\top}\rangle,

(B.59)

where $\langle\cdot,\cdot\rangle$ is Hilbert-Schmidt norm for matrices. Now define $\mathbb{\Sigma}\in\mathbb{R}^{nd\times nd}$ be $\mathbb{E}[\text{vec}(\mathbb{H}_{0})\text{vec}(\mathbb{H}_{0})^{\top}]$ , so that its $(i,j)$ -th block is $\mathbb{\Sigma}_{ij}$ . By Conditions (C1a) and (C1b) and the definition of $\{\rho_{ij}\}$ , we have

\mathbb{\Sigma}=\begin{bmatrix}\mathbb{\Sigma}_{11}&\mathbb{\Sigma}_{12}&...&\mathbb{\Sigma}_{1d}\\ \mathbb{\Sigma}_{21}&\mathbb{\Sigma}_{22}&..&\mathbb{\Sigma}_{2d}\\ \vdots\\ \mathbb{\Sigma}_{d1}&\mathbb{\Sigma}_{d2}&..&\mathbb{\Sigma}_{dd}\end{bmatrix}

where $\|\mathbb{\Sigma}_{ij}\|=\sigma^{2}\rho_{ij}$ for $1\leq i,j\leq d$ . In particular, we have

\|\mathbb{\Sigma}\|\leq\sigma^{2}\|\mathbb{R}\|=\sigma^{2}\rho.

(B.60)

To see this, if we denote $\mathbb{v}=(\mathbb{v}_{1},...,\mathbb{v}_{d})^{\top}\in\mathbb{R}^{nd}$ , then

	$\displaystyle\\|\mathbb{\Sigma}\\|=\max_{\\|\mathbb{v}\\|_{2}=1}\mathbb{v}^{\top}\mathbb{\Sigma}\mathbb{v}=\max_{\\|\mathbb{v}\\|_{2}=1}\sum_{1\leq i,j\leq d}\mathbb{v}_{i}^{\top}\mathbb{\Sigma}_{ij}\mathbb{v}_{j}\leq\max_{\\|\mathbb{v}\\|_{2}=1}\sum_{1\leq i,j\leq d}\\|\mathbb{v}_{i}\\|_{2}\\|\mathbb{v}_{j}\\|_{2}\\|\mathbb{\Sigma}_{ij}\\|$
	$\displaystyle=\sigma^{2}\max_{\\|\mathbb{l}\\|_{2}=1}\mathbb{l}^{\top}\mathbb{R}\mathbb{l}=\\|\mathbb{R}\\|.$

Now by (B.60), we have

\mathbb{E}\langle\mathbb{H}_{0},\mathbb{u}\mathbb{v}^{\top}\rangle\leq C\sigma\sqrt{\rho}.

(B.61)

By properties of sub-Gaussian vectors, it holds that

P(|\langle\mathbb{H}_{0},\mathbb{u}\mathbb{v}^{\top}\rangle|\leq C\sigma\sqrt{\rho}(1+t))\geq 1-e^{-ct^{2}}.

(B.62)

Combining (B.62) and (B.58), we have

	$\displaystyle P(\\|\mathbb{H}_{0}\\|\geq 3C\sigma\sqrt{\rho}(1+t))$	$\displaystyle\leq 7^{n+d}P(\|\mathbb{u}^{\top}\mathbb{H}_{0}\mathbb{v}\|\geq C\sigma\sqrt{\rho}(1+t))$
		$\displaystyle\leq 7^{n+d}e^{-ct^{2}}$
		$\displaystyle\leq\exp\{c^{\prime}(n+d)-ct^{2}\}.$

Setting $t=c_{1}(\sqrt{n}+\sqrt{d})$ for some sufficiently large $c_{1}>0$ , we have

P(\|\mathbb{H}_{0}\|\geq C\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{d}))\leq n^{-c}.

(B.63)

	$\displaystyle\|1-\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\|$	$\displaystyle=\bigg{\|}1-\frac{\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}}{\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}+\frac{\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\mathbb{w}-\widehat{\mathbb{s}})}{\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}\bigg{\|}$
		$\displaystyle\leq\frac{\|\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}-\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}\|}{\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}+\frac{\|\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\widehat{\mathbb{s}}-\mathbb{w})\|}{\\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\\|_{2}\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}$
		$\displaystyle\leq\frac{2\\|\bar{\mathbb{X}}\\|\\|\widehat{\mathbb{s}}-\mathbb{w}\\|_{2}}{\\|\bar{\mathbb{X}}^{\top}\mathbb{w}\\|_{2}}$
		$\displaystyle\leq\frac{2\\|\bar{\mathbb{X}}\\|\\|\bar{\mathbb{X}}\bm{y}\\|_{2}\\|\widehat{\mathbb{s}}-\mathbb{w}\\|_{2}}{\\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\\|_{2}}.$

	$\displaystyle\\|\bar{\mathbb{X}}\\|$	$\displaystyle\leq\\|\begin{bmatrix}\bm{y}&...&\bm{y}\end{bmatrix}+\mathbb{H}_{0}\\|\\|\mathbb{g}\\|_{\infty}$
		$\displaystyle\lesssim(\\|\bm{y}\\|_{2}+\sigma\sqrt{n})^{-1}(\sqrt{K}\\|\bm{y}\\|_{2}+\sigma\rho(\sqrt{K}+\sqrt{n}))$
		$\displaystyle\lesssim\sqrt{K}\\|\bm{y}\\|_{2}(\\|\bm{y}\\|_{2}+\sigma\sqrt{n})^{-1}.$

A Spectral Method for Assessing and Combining Multiple Data Visualizations

Abstract

1 INTRODUCTION

1.1 Related Works

1.2 Main Contributions

2 RESULTS

2.1 Eigenscore and Meta-Visualization Methodology

2.1.1 Measuring Normalized Distances From Each Visualization

2.1.2 Sample-wise Eigenscores for Assessing Visualizations

2.1.3 Meta-Visualization using Eigenscores

2.2 Simulations and Visualization of Real-World Datasets

2.2.1 Simulation Studies: Visualizing Noisy Low-Dimensional Structures

2.2.2 Visualizing Clusters of Religious and Biblical Texts

2.2.3 Visualizing Cell Cycles

2.2.4 Visualizing Trajectories of Cell Differentiation

2.2.5 Computational Cost

2.3 Theoretical Justifications and Fundamental Principles

2.3.1 General Model for Multiple Visualizations

2.3.2 Convergence of Eigenscores

Theorem 1.

2.3.3 Theoretical Guarantee for Meta-Visualization

Theorem 2.

2.3.4 Robustness of Spectral Weighting against Adversarial Visualizations

Theorem 3.

2.3.5 Limitations of Original Noisy High-Dimensional Data

2.3.6 Benefits of Including More Functionally Diverse Visualizations

3 DISCUSSION

Data Availability

Code Availability

References

Acknowledgement

Appendix A Supplementary Results and Figures from Real Data Analysis

A.1 Silhouette Index

A.2 Implementation Details and Supplementary Figures

Simulations.

Religious and Biblical Texts Data.

Cell Cycle Data.

Cell Differentiation Data.

Computational Cost.

Appendix B Proof of Main Theorems

B.1 Notations

B.2 Sufficient Condition for (C1a)

B.3 Eigenscore Consistency: Proof of Theorem 1

Upper bound of ‖ℍ‖\|\mathbb{H}\|.

Lemma 1.

Lower bound of ‖𝕘‖2\|\mathbb{g}\|_{2}.

Completing the proof.

B.4 Guarantee of Meta-Visualization: Proof of Theorem 2

B.5 Robustness of Spectral Weighting: Proof of Theorem 3

B.6 Necessity of the Signal Strength Condition (C2)

Theorem 4.

Proof.

Lemma 2.

B.7 Proof of Auxiliary Lemma 1

Lemma 3.

Upper bound of $\|\mathbb{H}\|$ .

Lower bound of $\|\mathbb{g}\|_{2}$ .