This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Spectral Method for Assessing and Combining Multiple Data Visualizations

Rong Ma1    Eric D. Sun2 and James Zou2

Department of Statistics
   Stanford University1
Department of Biomedical Data Science
   Stanford University2
Abstract

Dimension reduction and data visualization aim to project a high-dimensional dataset to a low-dimensional space while capturing the intrinsic structures in the data. It is an indispensable part of modern data science, and many dimensional reduction and visualization algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it critically important to evaluate their relative performance for a given dataset, and to leverage and combine their individual strengths. In this paper, we propose an efficient spectral method for assessing and combining multiple visualizations of a given dataset produced by diverse algorithms. The proposed method provides a quantitative measure – the visualization eigenscore – of the relative performance of the visualizations for preserving the structure around each data point. Then it leverages the eigenscores to obtain a consensus visualization, which has much improved quality over the individual visualizations in capturing the underlying true data structure. Our approach is flexible and works as a wrapper around any visualizations. We analyze multiple simulated and real-world datasets from diverse applications to demonstrate the effectiveness of the eigenscores for evaluating visualizations and the superiority of the proposed consensus visualization. Furthermore, we establish rigorous theoretical justification of our method based on a general statistical framework, yielding fundamental principles behind the empirical success of consensus visualization along with practical guidance.

KEY WORDS: data visualization; dimension reduction; high-dimensional data; manifold learning; spectral method.

1 INTRODUCTION

Data visualization and dimension reduction is a central topic in statistics and data science, as it facilitates intuitive understanding and global views of high-dimensional datasets and their underlying structural patterns through a low-dimensional embedding of the data (Donoho, 2017; Chen et al., 2020). The past decades have witnessed an explosion in machine learning algorithms for data visualization and dimension reduction. Many of them, such as Laplacian eigenmap (Belkin and Niyogi, 2003), kernel principal component analysis (kPCA) (Schölkopf et al., 1997), t-SNE (van der Maaten and Hinton, 2008), and UMAP (McInnes et al., 2018), have been regarded as indispensable tools and state-of-art techniques for generating graphics in academic and professional writings (Chen et al., 2007), and for exploratory data analysis and pattern discovery in many research disciplines, such as astrophysics (Traven et al., 2017), computer vision (Cheng et al., 2015), genetics (Platzer, 2013), molecular biology (Olivon et al., 2018), especially in single-cell transcriptomics (Kobak and Berens, 2019), among others.

However, the wide availability and functional diversity of data visualization methods also brings forth new challenges to data analysts and practitioners (Nonato and Aupetit, 2018; Espadoto et al., 2019). On the one hand, it is critically important to determine among the extensive list which visualization method is most suitable and reliable for embedding a given dataset. In fact, even for a single visualization method, such as t-SNE or UMAP, oftentimes there are multiple tuning parameters to be determined by the users, and different tuning parameters may lead to distinct visualizations (Kobak and Linderman, 2021; Cai and Ma, 2021). Thus, for a given dataset, selecting the most suitable visualization method and along with its tuning parameters calls for a method that provides quantitative and objective assessment of different visualizations of the dataset. On the other hand, as different methods are usually based on distinct ideas and heuristics, they would generate qualitatively diverse visualizations of a dataset, each containing important features about the data that are possibly unique to the visualization method. Meanwhile, due to noisiness and high-dimensionality of many real-world datasets, their low-dimensional visualizations necessarily contain distortions from the underlying true structures, which again may vary from one visualization to another. It is therefore of substantial practical interest to combine strengths and reach a consensus among multiple data visualizations, in order to obtain an even better “meta-visualization” of the data that captures the most information and is least susceptible to the distortions. Naturally, a meta-visualization would also save practitioners from painstakingly selecting a single visualization method among many.

In this paper, we propose an efficient spectral approach for simultaneously assessing and combining multiple data visualizations produced by diverse dimension reduction/visualization algorithms, allowing for different settings of tuning parameters for individual algorithms. Specifically, the proposed method takes as input a collection of visualizations, or low-dimensional embeddings of a dataset, hereafter referred as “candidate visualizations,” and summarizes each visualization by a normalized pairwise-distance matrix among the samples. With respect to each sample in the dataset, we construct a comparison matrix from these normalized distance matrices, characterizing the local concordance between each pair of candidate visualizations. Based on eigen-decomposition of the comparison matrices, we propose a quantitative measure, referred as “visualization eigenscore,” that quantifies the relative performance of the candidate visualizations in a sample-wise manner, reflecting their local concordance with the underlying low-dimensional structure contained in the data. To obtain a meta-visualization, the candidate visualizations are combined together into a meta-distance matrix, defined as a row-wise weighted average of those normalized distance matrices, using the corresponding eigenscores as the weights. The meta-distance matrix is then used to produce a meta-visualization, based on an existing method such as UMAP or kPCA, which is shown to be more reliable and more informative compared to individual candidate visualizations. Our method is schematically summarized in Figure 1 and Algorithm 1, and detailed in Section 2.1. The thus obtained meta-visualization reflects a joint perspective aggregating various aspects of the data that are oftentimes captured separately by individual candidate visualizations.

Refer to caption
Figure 1: A graphical illustration of the proposed method. The algorithm takes as input the normalized pairwise distance matrices associated to a collection of candidate visualizations (viz1 to viz4) of a dataset. For each sample of the dataset, we compute the similarity matrix between the rows of the normalized distance matrices associated to the sample (rows highlighted in the same color), and then define the corresponding eigenscores as the first eigenvector of the similarity matrix. The size of the circles in the similarity matrices and the vectors of eigenscores indicate the magnitude of the entries (assumed to be non-negative). The meta-distance matrix is defined such that its rows are the eigenscore-weighted average of the rows in the normalized distance matrices. The meta-distance leads to a meta-visualization, expected to be more concordant with the underlying true structure than individual candidate visualizations.

Numerically, through extensive simulations and analysis of multiple real-world datasets with diverse underlying structures, we show the effectiveness of the proposed eigenscores in assessing and ranking a collection of candidate visualizations, and demonstrate the superiority of the final meta-visualization over all the candidate visualizations in terms of identification and characterization of these structural patterns. To achieve a deeper understanding of the proposed method, we also develop a formal statistical framework, that rigorously justifies the proposed scoring and meta-visualization method, providing theoretical insights on the fundamental principles behind the empirical success of the method, along with its proper interpretations, and guidance on practice.

1.1 Related Works

Quantitative assessment of dimension reduction and data visualization algorithms is of substantial practical interests, and have been extensively studied in the past two decades. For example, many evaluation methods are based on distortion measures from metric geometry (Abraham et al., 2006, 2009; Chennuru Vankadara and von Luxburg, 2018; Bartal et al., 2019), whereas some other methods rely on information-theoretic precision-recall measures (Venna et al., 2010; Arora et al., 2018), co-ranking structure (Mokbel et al., 2013), or graph-based criteria (Wang et al., 2021; Cai and Ma, 2021). See also Bertini et al. (2011), Nonato and Aupetit (2018) and Espadoto et al. (2019) for recent reviews. However, most of these existing methods evaluate data visualizations by comparing them directly with the original dataset, without accounting for its noisiness. The thus obtained assessment may suffer from intrinsic bias due to ignorance of the underlying true structures, only approximately represented by the noisy observations; see Section 2.3.5 and Supplementary Figure 21 for more discussions. To address this issue, the proposed eigenscores, in contrast, provide provably consistent assessment and ranking of visualizations reflecting their relative concordances with the underlying noiseless structures in the data.

Compared to the quantitative assessment of data visualizations, there is a scarcity of meta-visualization methods that combine strengths of multiple data visualizations. In Pagliosa et al. (2015), an interactive method is developed that assesses and combines different multidimensional projection methods via a convex combination technique. However, for supervised learning tasks such as classification, there is a long history of research on designing and developing meta-classifiers that combine multiple classifiers (Woods et al., 1997; Tax et al., 2000; Parisi et al., 2014; Liu et al., 2017; Mohandes et al., 2018). Compared with meta-classification, the main difficulty of meta-visualization lies in the identification of a common space to properly align multiple visualizations, or low-dimensional embeddings, whose scales and coordinate bases may drastically differ from one to another (see, for example, Figures 3-5(a)). Moreover, unlike many meta-classifiers, which combines presumably independent classifiers trained over different datasets, a meta-visualization procedure typically relies on multiple visualizations of the same dataset, and therefore has to deal with more complicated correlation structure among the visualizations. The current study provides the first meta-visualization method that can flexibly combine any number of visualizations, and has interpretable and provable performance guarantee.

1.2 Main Contributions

The main contribution of the current study can be summarized as follows:

  • We propose a computationally efficient spectral method for assessing and combining multiple data visualizations. The method is generic and easy to implement: it does not require knowledge of the original dataset, and can be applied to a large number of data visualizations generated by diverse methods.

  • For any collection of visualizations of a dataset, our method provides a quantitative measure – eigenscore – of the relative performance of the visualizations for preserving the structure around each data point. The eigenscores are useful on their own rights for assessing the local and global reliability of a visualization in representing the underlying structures of the data, and in guiding selection of hyper-parameters.

  • The proposed method automatically combines strengths and ameliorates weakness (distortions) of the candidate visualizations, leading to a meta-visualization which is provably better than all the candidate visualizations under a wide range of settings. We show that the meta-visualization is able to capture diverse intrinsic structures, such as clusters, trajectories, and mixed low-dimensional structures, contained in noisy and high-dimensional datasets.

  • We establish rigorous theoretical justifications of the method under a general signal-plus-noise model (Section 2.3) in the large-sample limit. We prove the convergence of the eigenscores to certain underlying true concordance measures, the guaranteed performance of the meta-visualization and its advantages over alternative methods, its robustness against possible adversarial candidate visualizations, along with their conditions, interpretations, and practical implications.

The proposed method is described in detail in Section 2.1, and empirically illustrated and evaluated in Section 2.2, through extensive simulation studies and analyses of three real-world datasets with diverse underlying structures. In Section 2.3, we show results from our theoretical analysis, which unveils fundamental principles associated to the method, such as the benefits of including qualitatively and functionally diverse candidate visualizations.

2 RESULTS

2.1 Eigenscore and Meta-Visualization Methodology

Throughout, without loss of generality, we assume that for visualization purpose the target embedding is two-dimensional, although our discussion applies to any finite-dimensional embedding.

Algorithm 1 Spectral assessment and combination of multiple data visualizations
Input: candidate visualizations {𝕏i(k)}1in\{\mathbb{X}^{(k)}_{i}\}_{1\leq i\leq n} for k{1,2,,K}k\in\{1,2,...,K\}.
1. Construct normalized pairwise-distance matrices: for each k{1,2,,K}k\in\{1,2,...,K\}, calculate
¯(k)=[𝔻(k)]1(k),\bar{\mathbb{P}}^{(k)}=[\mathbb{D}^{(k)}]^{-1}\mathbb{P}^{(k)}, (2.1)
where (k)=(𝕏i(k)𝕏j(k)2)1i,jn\mathbb{P}^{(k)}=(\|\mathbb{X}_{i}^{(k)}-\mathbb{X}_{j}^{(k)}\|_{2})_{1\leq i,j\leq n} and 𝔻(k)=diag(1.(k)2,,n.(k)2)\mathbb{D}^{(k)}=\text{diag}(\|\mathbb{P}_{1.}^{(k)}\|_{2},...,\|\mathbb{P}_{n.}^{(k)}\|_{2}).
2. Obtain eigenscores: for each i{1,2,,n}i\in\{1,2,...,n\},
     (i) calculate the similarity matrix
𝔾i=((¯i.(k1))¯i.(k2))1k1,k2K.\mathbb{G}_{i}=((\bar{\mathbb{P}}_{i.}^{(k_{1})})^{\top}\bar{\mathbb{P}}_{i.}^{(k_{2})})_{1\leq k_{1},k_{2}\leq K}. (2.2)
     (ii) perform eigen-decomposition of 𝔾i\mathbb{G}_{i} and define the eigenscores
𝕤^i=(s^i,1,s^i,2,,s^i,K):=|𝕦^i|,\widehat{\mathbb{s}}_{i}=(\hat{s}_{i,1},\hat{s}_{i,2},...,\hat{s}_{i,K}):=|\widehat{\mathbb{u}}_{i}|, (2.3)
where 𝕦^i\widehat{\mathbb{u}}_{i} is the eigenvector of 𝔾i\mathbb{G}_{i} associated to its largest eigenvalue.
3. Construct meta-distance matrix: for each i{1,2,,n}i\in\{1,2,...,n\}, calculate the eigenscore-weighted average
¯i.m=k=1Ks^i,k¯i.(k),\bar{\mathbb{P}}_{i.}^{m}=\sum_{k=1}^{K}\hat{s}_{i,k}\bar{\mathbb{P}}_{i.}^{(k)}, (2.4)
and define ¯mn×n\bar{\mathbb{P}}^{m}\in\mathbb{R}^{n\times n} whose ii-th row is ¯i.m\bar{\mathbb{P}}_{i.}^{m}.
4. Obtain meta-visualization: apply an existing visualization method (e.g., UMAP or kPCA) to ¯m\bar{\mathbb{P}}^{m} to obtain a meta-visualization.
Output: the eigenscores {𝕤^i}1in\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n}, and the meta-visualization.

We consider visualizing a pp-dimensional dataset {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n} containing nn samples. From {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n}, suppose we obtain a collection of KK (candidate) visualizations of the data, produced by various visualization methods. We denote these visualizations as two-dimensional embeddings {𝕏i(k)}1in2\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}\subset\mathbb{R}^{2} for k{1,2,,K}k\in\{1,2,...,K\}. Our approach only needs access to the low-dimensional embeddings {𝕏i(k)}1in\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n} rather than the raw data {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n}; as a result, the users can use our method even if they don’t have access to the original data, which is often the case.

2.1.1 Measuring Normalized Distances From Each Visualization

In order that the proposed method is invariant to the respective scale and coordinate basis (i.e., directionality) of the low-dimensional embeddings generated from different visualization method, we start by considering the normalized pairwise-distance matrix for each visualization.

Specifically, for each k{1,2,,K}k\in\{1,2,...,K\}, we define the normalized pairwise-distance matrix

¯(k)=[𝔻(k)]1(k)n×n,\bar{\mathbb{P}}^{(k)}=[\mathbb{D}^{(k)}]^{-1}\mathbb{P}^{(k)}\in\mathbb{R}^{n\times n}, (2.5)

where

(k)=(𝕏i(k)𝕏j(k)2)1i,jnn×n,\mathbb{P}^{(k)}=(\|\mathbb{X}^{(k)}_{i}-\mathbb{X}^{(k)}_{j}\|_{2})_{1\leq i,j\leq n}\in\mathbb{R}^{n\times n}, (2.6)

is the un-normalized Euclidean distance matrix, and 𝔻(k)=diag(1.(k)2,2.(k)2,,n.(k)2)\mathbb{D}^{(k)}=\text{diag}(\|\mathbb{P}^{(k)}_{1.}\|_{2},\|\mathbb{P}^{(k)}_{2.}\|_{2},...,\|\mathbb{P}^{(k)}_{n.}\|_{2}) is a diagonal matrix with its diagonal entries being the 2\ell_{2}-norms of the rows {1.(k),,n.(k)}\{\mathbb{P}^{(k)}_{1.},...,\mathbb{P}^{(k)}_{n.}\} of (k)\mathbb{P}^{(k)}. As a result, the normalized distance matrix ¯(k)\bar{\mathbb{P}}^{(k)} has its rows being unit vectors, and is invariant to any scaling and rotation of the visualization {𝕏i(k)}1in\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}.

The normalized distance matrices {¯(k)}1kK\{\bar{\mathbb{P}}^{(k)}\}_{1\leq k\leq K} summarize the candidate visualizations in a compact and efficient way. Their scale- and rotation-invariance properties are particularly useful for comparing visualizations produced by distinct methods.

2.1.2 Sample-wise Eigenscores for Assessing Visualizations

Our spectral method for assessing multiple visualizations is based on the normalized distance matrices {¯(k)}1kK\{\bar{\mathbb{P}}^{(k)}\}_{1\leq k\leq K}. For each i{1,2,,n}i\in\{1,2,...,n\}, we define the similarity matrix

𝔾i=((¯i.(k1))¯i.(k2))1k1,k2KK×K,\mathbb{G}_{i}=((\bar{\mathbb{P}}_{i.}^{(k_{1})})^{\top}\bar{\mathbb{P}}_{i.}^{(k_{2})})_{1\leq k_{1},k_{2}\leq K}\in\mathbb{R}^{K\times K}, (2.7)

which summarizes the pairwise similarity between the candidate visualizations with respect to sample ii. By construction, the entries of 𝔾i\mathbb{G}_{i} are inner-products between unit vectors, each representing the normalized distances associated with sample ii in a candidate visualization. Naturally, a larger entry (¯i.(k1))¯i.(k2)(\bar{\mathbb{P}}_{i.}^{(k_{1})})^{\top}\bar{\mathbb{P}}_{i.}^{(k_{2})} indicates higher concordance between the two candidate visualizations. Then, for each i{1,2,,n}i\in\{1,2,...,n\}, we define the vector of eigenscores 𝕤^i=(s^i,1,,s^i,K)\widehat{\mathbb{s}}_{i}=(\hat{s}_{i,1},...,\hat{s}_{i,K}) for the candidate visualizations with respect to sample ii as the absolute value of the eigenvector 𝕦^iK\widehat{\mathbb{u}}_{i}\in\mathbb{R}^{K} of 𝔾i\mathbb{G}_{i} associated to its largest eigenvalue, that is,

𝕤^i:=|𝕦^i|,\widehat{\mathbb{s}}_{i}:=|\widehat{\mathbb{u}}_{i}|, (2.8)

where the absolute value function |||\cdot| is applied entrywise. As will be explained later (Section 2.3.2), the nonnegative components of 𝕤^i\widehat{\mathbb{s}}_{i} quantify the relative performance of KK candidate visualizations with respect to sample ii, with higher eigenscores indicating better performance. Consequently, for each candidate visualization {𝕏i(k)}1in\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n}, one obtains a set of eigenscores {s^i,k}1in\{\hat{s}_{i,k}\}_{1\leq i\leq n} summarizing its performance relative to other candidate visualizations in a sample-wise manner. Ranking and selection among candidate visualizations can be achieved based on various summary statistics of the eigenscores, such as mean, median, or coefficient of variation, depending on the specific applications. In particular, when some candidate visualizations are produced by the same method but under different tuning parameters, the eigenscores can be used to select the most suitable tuning parameters for visualizing the dataset. However, a more substantial application of the eigenscores is to combine multiple data visualizations into a meta-visualization, which has improved signal-to-noise ratio and higher resolution of the structural information contained in the data.

Importantly, as will be shown later (Section 2.3.5), the eigenscores essentially take the underlying true signals rather than the noisy observations {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n} as its referential target for performance assessment, making the method easier to implement and less susceptible to the effect of noise in the original data (Section 2.3.5 and Supplement Figure 21).

2.1.3 Meta-Visualization using Eigenscores

Using the above eigenscores, one can construct a meta-distance matrix properly combining the information contained in each candidate visualization. Specifically, for each i{1,2,,n}i\in\{1,2,...,n\}, we define the vector of meta-distances with respect to sample ii as the eigenscore-weighted average of all the normalized distances respect to sample ii, that is,

¯i.m=k=1Ks^i,k¯i.(k)n.\bar{\mathbb{P}}_{i.}^{m}=\sum_{k=1}^{K}\hat{s}_{i,k}\bar{\mathbb{P}}_{i.}^{(k)}\in\mathbb{R}^{n}. (2.9)

Then, the meta-distance matrix is defined as ¯mn×n\bar{\mathbb{P}}^{m}\in\mathbb{R}^{n\times n} whose ii-th row is ¯i.m\bar{\mathbb{P}}_{i.}^{m}. To obtain a meta-visualization, we take the meta-distance matrix ¯m\bar{\mathbb{P}}^{m} and apply an existing visualization method that allows for the meta-distance ¯m\bar{\mathbb{P}}^{m} (or its symmetrized version ¯m+(¯m)\bar{\mathbb{P}}^{m}+(\bar{\mathbb{P}}^{m})^{\top}) as its input.

Intuitively, for each i=1,2,,ni=1,2,...,n, we essentially apply a principal component (PC) analysis to the normalized distance matrix ¯i=[¯i.(1)¯i.(2)¯i.(K)]n×K.\bar{\mathbb{P}}_{i}=\begin{bmatrix}\bar{\mathbb{P}}_{i.}^{(1)}&\bar{\mathbb{P}}_{i.}^{(2)}&...&\bar{\mathbb{P}}_{i.}^{(K)}\end{bmatrix}\in\mathbb{R}^{n\times K}. Specifically, by definition (Jolliffe and Cadima, 2016) the leading eigenvector 𝕦^i\widehat{\mathbb{u}}_{i} of 𝔾i=¯i¯iK×K\mathbb{G}_{i}=\bar{\mathbb{P}}_{i}^{\top}\bar{\mathbb{P}}_{i}\in\mathbb{R}^{K\times K} is the first PC loadings of ¯i\bar{\mathbb{P}}_{i}, whereas the first PC is defined as the linear combination ¯i𝕦^i=k=1Ku^i,k¯i.(k)\bar{\mathbb{P}}_{i}\widehat{\mathbb{u}}_{i}=\sum_{k=1}^{K}\hat{u}_{i,k}\bar{\mathbb{P}}_{i.}^{(k)}. Under the condition that the first PC loadings are all nonnegative (which is ensured with high probability under condition (C2) below), the first PC ¯i𝕦^i\bar{\mathbb{P}}_{i}\widehat{\mathbb{u}}_{i} is exactly the meta-distance ¯i.m\bar{\mathbb{P}}_{i.}^{m} defined in (2.9) above. When interpreted as PC loadings, the leading eigenvector 𝕦^i\widehat{\mathbb{u}}_{i} of 𝔾i\mathbb{G}_{i} contains weights for different vectors {¯i.(k)}1kK\{\bar{\mathbb{P}}_{i.}^{(k)}\}_{1\leq k\leq K} so that the final linear combination ¯i𝕦^i\bar{\mathbb{P}}_{i}\widehat{\mathbb{u}}_{i} has the largest variance, that is, summarizes the most information contained in ¯i\bar{\mathbb{P}}_{i}. It is in this sense that the meta-distance ¯i.m\bar{\mathbb{P}}_{i.}^{m} is a consensus across {¯i.(k)}1kK\{\bar{\mathbb{P}}_{i.}^{(k)}\}_{1\leq k\leq K}.

For our own numerical studies (Section 2.2), we used UMAP for meta-visualizing datasets with cluster structures, and used kPCA for meta-visualizing all the other datasets with smoother manifold structures, such as trajectory, cycle, or mixed structures. The choice of UMAP in the former case was due to its advantage in treating large numbers of clusters without requiring prior knowledge about the number of clusters (McInnes et al., 2018; Cai and Ma, 2021); whereas the choice of kPCA in the latter case was rooted in its advantage in capturing nonlinear smooth manifold structures (Ding and Ma, 2022). In each case, the hyper-parameters used for generating the meta-visualization were determined without further tuning – for example, when using UMAP for meta-visualization, we set the hyper-parameters the same as those associated to the UMAP visualization which achieved higher median eigenscore than other UMAP visualizations. Moreover, while in general UMAP/kPCA works well as a default method for meta-visualization, our proposed algorithm is robust with respect to the choice of this final visualization method. In our numerical analysis (Section 2.2), we observed empirically that other methods such as t-SNE and PHATE could also lead to meta-visualizations with comparably substantial improvement over individual candidate visualizations in terms of the concordance with the underlying true low-dimensional structure of the data (see Supplement Figure 11). In addition, the meta-visualization shows robustness to potential outliers in the data (Figures 4 and 5).

In Section 2.3, under a generic signal-plus-noise model, we obtain explicit theoretical conditions under which the performance of the proposed spectral method is guaranteed. Specifically, we show the convergence of the eigenscores to a desirable concordance measure between the candidate visualizations and the underlying true pattern, characterized by their associated pairwise distance matrices of the samples. In addition, we show improved performance of the meta-visualization over the candidate visualizations in terms of their concordance with the underlying true pattern, and its robustness against possible adversarial candidate visualizations. These conditions provide proper interpretations and guidance on the application of the method, such as how to more effectively prepare the candidate visualizations (Section 2.3.6). For clarity, we summarize these technical conditions informally as follows, and relegate their precise statements to Section 2.3.

  • (C1’)

    The performance of candidate visualizations are sufficiently diverse in terms of their individual distortions from the underlying true structures.

  • (C2’)

    The candidate visualizations altogether contains sufficient amount of information about the underlying true structures.

Intuitively, Condition (C1’) concerns diversity of methods in producing candidate visualizations, whereas Condition (C2’) is related to the quality of the candidate visualizations. In practice, Condition (C2’) is satisfied when the signal-to-noise ratio in the data, as described by (2.11), is sufficiently large, so that the adopted visualization methods perform reasonably well on average. On the other hand, from Section 2.3, a sufficient condition for (C1’) is that, at most K\sqrt{K} out of KK candidate visualizations are very similarly distorted from the true patterns in terms of the normalized distances ¯(k)\bar{\mathbb{P}}^{(k)}. This would allow, for example, groups of up to 3 to 4 candidate visualizations out of 10 to 15 visualizations being produced by very similar procedures such as the same method under different hyper-parameters.

2.2 Simulations and Visualization of Real-World Datasets

2.2.1 Simulation Studies: Visualizing Noisy Low-Dimensional Structures

To demonstrate the wide range of applicability and the empirical advantage of the proposed method, we consider visualization of three families of noisy datasets, each containing a distinct low-dimensional structure as its underlying true signal. We assess performance of the eigenscores and the quality of the resulting meta-distance matrix based on 16 candidate visualizations produced by multiple visualization methods.

For a given sample size nn, we generate pp-dimensional noisy observations {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n} from the signal-plus-noise model 𝕐i=𝕐i+i,\mathbb{Y}_{i}=\mathbb{Y}_{i}^{*}+\mathbb{Z}_{i}, where {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n} are the underlying noiseless samples (signals), and {i}1in\{\mathbb{Z}_{i}\}_{1\leq i\leq n} are the random noises. Specifically, we generate true signals {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n} from various low-dimensional structures isometrically embedded in the pp-dimensional Euclidean space. Each of the low-dimensional structures lie in some rr-dimensional linear subspace, and is subject to an arbitrary rotation in p\mathbb{R}^{p}, so that these signals are generally pp-dimensional vectors with dense (nonzero) coordinates. Then we generate i.i.d.{i.i.d.} noise vector i\mathbb{Z}_{i} from the standard multivariate normal distribution 𝒩(𝟎,𝐈p)\mathcal{N}({\bf 0},{\bf I}_{p}), and use the pp-dimensional noisy vector 𝕐i=𝕐i+i\mathbb{Y}_{i}=\mathbb{Y}_{i}^{*}+\mathbb{Z}_{i} as the final observed data. In this way, we simulated noisy observations {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n} of an intrinsically rr-dimensional structure. For our simulations, for some given signal-to-noise ratio (SNR) parameter θ>0\theta>0, we generate {𝕐i}1in\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n} uniformly from each of the following three structures:

  • (i)

    Finite point mixture with r=5r=5: {𝕐i}1in\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n} are independently sampled from the discrete set {𝜸1,𝜸2,,𝜸r+1}p\{\bm{\gamma}_{1},\bm{\gamma}_{2},...,\bm{\gamma}_{r+1}\}\subset\mathbb{R}^{p} with equal probability, where 𝜸i\bm{\gamma}_{i}’s are arbitrary orthogonal vectors in p\mathbb{R}^{p} with the same length, i.e., 𝜸i2=θ\|\bm{\gamma}_{i}\|_{2}=\theta for 1ir+11\leq i\leq r+1.

  • (ii)

    “Smiley face” with r=2r=2: {𝕐i}1in\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n} are generated independently and uniformly from a two-dimensional “smiley face” structure (Supplement Figure 6 left) of diameter θ\theta, isometrically embedded in p\mathbb{R}^{p} and subject to an arbitrary rotation.

  • (iii)

    “Mammoth” manifold with r=3r=3: {𝕐i}1in\{\mathbb{Y}^{*}_{i}\}_{1\leq i\leq n} are generated independently uniformly from a three-dimensional “mammoth” manifold (Supplement Figure 6 right) of diameter θ\theta, isometrically embedded in p\mathbb{R}^{p} and subject to an arbitrary rotation.

The thus generated datasets cover diverse structures including Gaussian mixture clusters (i), mixed-type nonlinear clusters (ii), and a connected smooth manifold (iii). As a result, the first family of datasets was set to have p=500p=500 and n=900n=900, and were obtained by fixing various values of the SNR parameter θ\theta, and generating 𝕐ip\mathbb{Y}^{*}_{i}\in\mathbb{R}^{p} from the above setting (i) to obtain the noisy dataset {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n} as described above. Similarly, the second and the third families of datasets were obtained by drawing 𝕐ip\mathbb{Y}^{*}_{i}\in\mathbb{R}^{p} from the above settings (ii) and (iii), respectively, and generating datasets {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n} with p=300p=300 and n=500n=500, for various values of θ\theta.

For each dataset {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n}, we consider 1212 existing data visualization tools including principal component analysis (PCA), multi-dimensional scaling (MDS), Kruskal’s non-metric MDS (iMDS) (Kruskal, 1978), Sammon’s mapping (Sammon) (Sammon, 1969), locally linear embedding (LLE) (Roweis and Saul, 2000), Hessian LLE (HLLE) (Donoho and Grimes, 2003), isomap (Tenenbaum et al., 2000), kPCA, Laplacian eigenmap (LEIM), UMAP, t-SNE and PHATE (Moon et al., 2019). For methods such as kPCA, t-SNE, UMAP and PHATE, that require tuning parameters, we consider two different settings (Section A.2 of the Supplement) of tuning parameters for each method, denoted as “kPCA1” and “kPCA2,” etc. Therefore, for each dataset we obtain K=16K=16 candidate visualizations corresponding to different combinations of visualization tools and tuning parameters. Applying our proposed method, we obtain eigenscores {𝕤^i}1in\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n} for the candidate visualizations. We also compare two meta-distances based on the 16 visualizations, which are, the proposed spectral meta-distance matrix (“meta-spec”) based on the eigenscores, and the naive meta-distance matrix (“meta-aver”) assigning equal weights to all the candidate visualizations, as in (2.13).

Table 1: Empirical mean and standard error (SE) of the averaged cosines 1ni=1ncos(𝕤^i,𝕤i)\frac{1}{n}\sum_{i=1}^{n}\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i}), between the eigenscores and the true concordance measures, over each family of datasets associated with a given low-dimensional structure under various values of the SNR parameter θ\theta.
Low-Dimensional Structure Gaussian mixture Smiley face Mammoth
Simulation Setting (n,p)=(900,500)(n,p)=(900,500) (n,p)=(500,300)(n,p)=(500,300) (n,p)=(500,300)(n,p)=(500,300)
Empirical Mean (SE) 0.992 (10510^{-5}) 0.986 (10510^{-5}) 0.990 (10510^{-5})

To evaluate the proposed eigenscores, for each setting and each i{1,2,,n}i\in\{1,2,...,n\}, we compute cos(𝕤^i,𝕤i):=(𝕤^i)𝕤i𝕤^i2𝕤i2\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i}):=\frac{(\widehat{\mathbb{s}}_{i})^{\top}\mathbb{s}_{i}}{\|\widehat{\mathbb{s}}_{i}\|_{2}\|\mathbb{s}_{i}\|_{2}}, for the angle between the eigenscores 𝕤^i\widehat{\mathbb{s}}_{i} and the true local concordance 𝕤i\mathbb{s}_{i} defined as

𝕤i:=((¯i.(1))¯i.,(¯i.(2))¯i.,,(¯i.(K))¯i.)K,\mathbb{s}_{i}:=((\bar{\mathbb{P}}_{i.}^{(1)})^{\top}\bar{\mathbb{P}}^{*}_{i.},(\bar{\mathbb{P}}_{i.}^{(2)})^{\top}\bar{\mathbb{P}}^{*}_{i.},...,(\bar{\mathbb{P}}_{i.}^{(K)})^{\top}\bar{\mathbb{P}}^{*}_{i.})\in\mathbb{R}^{K}, (2.10)

where ¯i.\bar{\mathbb{P}}^{*}_{i.} is the ii-th row of the normalized distance matrix ¯\bar{\mathbb{P}}^{*} for the underlying noiseless samples {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n}, defined as in (2.5) with 𝕏i(k)\mathbb{X}_{i}^{(k)}’s replaced by 𝕐i\mathbb{Y}^{*}_{i}’s. Table 1 shows empirical mean and standard error (SE) of the averaged cosines 1ni=1ncos(𝕤^i,𝕤i)\frac{1}{n}\sum_{i=1}^{n}\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i}), over the family of datasets under the same low-dimensional structure associated with various θ\theta as shown in Figure 2(a). Our simulations showed that cos(𝕤^i,𝕤i)1\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i})\approx 1, indicating that the eigenscores 𝕤^i\widehat{\mathbb{s}}_{i} essentially characterize the true concordance between the patterns contained in each candidate visualization and that of the underlying noiseless samples, evaluated locally with respect to sample ii. This justifies the proposed eigenscore as a precise measure of performance of the candidate visualizations in preserving the underlying true signals.

Refer to caption
Refer to caption
Figure 2: (a) Boxplots of the mean concordance with the underlying true pattern for 15 candidate visualizations (HLLE omittted due to very low concordance) and the two meta-distance matrices under each simulation setting (Left: Gaussian mixture; Middle: smiley face; Right: mammoth) across various values of the SNR value θ\theta. See Supplement Figure 7 for complete plots. (b) Top: examples of candidate visualizations along with their sample-wise concordance {(¯i.(k))¯i.}1in\{(\bar{\mathbb{P}}_{i.}^{(k)})^{\top}\bar{\mathbb{P}}^{*}_{i.}\}_{1\leq i\leq n} with the structure of noiseless samples, and the proposed meta-visualization using UMAP and the concordance {(¯i.m)¯i.}1in\{(\bar{\mathbb{P}}_{i.}^{m})^{\top}\bar{\mathbb{P}}^{*}_{i.}\}_{1\leq i\leq n} for the proposed meta-distance. Bottom: boxplots (center line, median; box limits, upper and lower quartiles; points, outliers) of concordance measures as grouped by the true clusters under the Gaussian mixture setting. See Supplement Figure 8 for more examples.

To assess the quality of two meta-distance matrices, for each dataset, we compare the mean concordance 1ni=1n(¯i.(k))¯i.\frac{1}{n}\sum_{i=1}^{n}(\bar{\mathbb{P}}_{i.}^{(k)})^{\top}\bar{\mathbb{P}}^{*}_{i.} between the normalized distance of each candidate visualization and that of the underlying noiseless samples, and the mean concordance 1ni=1n(¯i.m)¯i.\frac{1}{n}\sum_{i=1}^{n}(\bar{\mathbb{P}}_{i.}^{m})^{\top}\bar{\mathbb{P}}^{*}_{i.} between the obtained meta-distance and that of the underlying noiseless samples. Figure 2(a) and Supplement Figure 7 show boxplots of these mean concordances for the 16 candidate visualizations and the two meta-distances under each setting of underlying structures across various values of θ\theta. We observe that for each of the three structures, our proposed meta-distance is substantially more concordant with the underlying true patterns, than every candidate visualization and the naive meta-distance, indicating the superiority of the proposed meta-distance. To further demonstrate the advantage of the spectral meta-distance and its benefits to the final meta-visualization, we compared our proposed meta-visualization using UMAP, and candidate visualizations of a dataset under setting (i) with θ=5\theta=5, and present their sample-wise concordance {(¯i.(k))¯i.}1in\{(\bar{\mathbb{P}}_{i.}^{(k)})^{\top}\bar{\mathbb{P}}^{*}_{i.}\}_{1\leq i\leq n} for each kk, and {(¯i.m)¯i.}1in\{(\bar{\mathbb{P}}_{i.}^{m})^{\top}\bar{\mathbb{P}}^{*}_{i.}\}_{1\leq i\leq n} of the proposed meta-distance (Figure 2(b) and Supplement Figure 8). We observe that, while each individual method may capture some clusters in the dataset but misses others, the proposed meta-visualization is able to combine strengths of all the candidate visualizations in order to capture all the underlying clusters. Finally, to demonstrate the flexibility of our method with respect to higher intrinsic dimension rr, under the setting (i), we further evaluated the performance of different methods for r{15,30,50}r\in\{15,30,50\}. Supplement Figure 9 shows consistent and superior performance of the proposed method compared to the other approaches.

2.2.2 Visualizing Clusters of Religious and Biblical Texts

Cluster data are ubiquitous in scientific research and industrial applications. Our first real data example concerns n=590n=590 fragments of text, extracted from English translations of eight religious books or sacred scripts including Book of Proverb (BOP), Book of Ecclesiastes (BOE1), Book of Ecclesiasticus (BOE2), Book of Wisdom (BOW), Four Noble Truth of Buddhism (BUD), Tao Te Ching (TTC), Yogasutras (YOG) and Upanishads (UPA) (Sah and Fokoué, 2019). All the text were pre-processed using natural language processing into a 590×8265590\times 8265 Document Term Matrix that counts frequency of 8265 atomic words, such as truth, diligent, sense, power, in each text fragment. In other words, each text fragment was treated as a bag of words, represented by a vector with 82658265 features. The word counts were centred and normalized before downstream analysis.

Refer to caption
Figure 3: Visualization of 590 fragments of texts from eight religious and biblical books. (a) Three examples of candidate visualizations. The samples are marked by eight different symbols and colors according to their associated books. More examples are included in Supplement Figure 10. (b) Boxplots of eigenscores for all 16 candidate visualizations. (c) The proposed spectral meta-visualization using UMAP. (d) Median silhouette indices for the 16 candidate and 2 meta-visualizations. The error bars of the meta-visualizations indicate the variability (95% confidence interval) due to the visualization method (UMAP) applied to the meta-distance matrix (2.9).

As in our simulation studies, we still consider K=16K=16 candidate visualizations generated by 12 different methods with various tuning parameters (see Section A.2 of the Supplement for implementation details). Figure 3(a) contains examples of candidate visualizations obtained by PHATE, t-SNE, and kPCA, whose median eigenscores were ranked top, middle and bottom among all the visualizations (Figure 3(b)), respectively. More examples are included in Supplement Figure 10. In each visualization, the samples (text fragments) were colored by their associated books, showing how well the visualization captures the underlying clusters of the samples. The usefulness and validity of the eigenscores in Figure 3(b) can be verified empirically, by visually comparing the clarity of cluster patterns demonstrated by each candidate visualizations in Figure 3(a) and in Supplement Figure 10. Figure 3(c) is the proposed meta-visualization111With a slight abuse of notation, we used “meta-spec” and “meta-aver” hereafter to refer to the final meta-visualizations rather than the meta-distance matrices as in Section 2.2.1. of the samples by applying UMAP to the meta-distance matrix, which shows substantially better clustering of the text fragments in accordance with their sources. In addition, the meta-visualization also reflected deeper relationship between the eight religious books, such as the similarity between the two Hinduism books YOG and UPA, the similarity between Buddhism (BUD) and Taoism (TTC), the similarity between the four Christian books BOE1, BOE2, BOP, and BOW, as well as the general discrepancy between Asian religions (Hinduism, Buddhism, Taoism) and non-Asian religions (Christianity). All of these important phenomena, while salient in our meta-visualization, only appeared vaguely in very few candidate visualizations such as those produced by PHATE (Figure 3(a)) and UMAP (Supplement Figure 10).

To quantitatively evaluate the preservation of the underlying clustering pattern, we computed for each visualization the Silhouette indices (Rousseeuw, 1987) with respect to the underlying true cluster membership, based on the normalized pairwise-distance matrices of the embeddings defined in (2.5). The Silhouette index (see Section A.2 of the Supplement for its definition), defined for each individual sample in a visualization, measures the amount of discrepancy between the within-class distances and the inter-class distances with respect to a given sample. As a result, for a given visualization, its Silhouette indices altogether indicate how well the underlying cluster pattern is preserved in a visualization, and higher Silhouette indices indicate that the underlying clusters are more separate. Empirically, we observed a notable correlation (ρ=0.679\rho=0.679) between the median Silhouette indices and the median eigenscores across the candidate visualizations (Supplement Figure 11). In addition, for each candidate visualization, we found that samples with higher Silhouette index tend to have higher eigenscores (Supplement Figure 12), demonstrating the effectiveness of eigenscores, and its benefits on the final meta-visualization. In Figure 3(d), we show that, even taking into account the stochasticity of the visualization method (UMAP) applied to the meta-distance matrix, our meta-visualization had the median Silhouette index much higher than those of the candidate visualizations, as well as that of the meta-visualization “meta-aver” based on the naive meta-distance. It is of interest to note that “meta-spec” was the only visualization with a positive median Silhouette index, showing its better separation of clusters compared with other visualizations. Importantly, the proposed meta-visualization was not sensitive to the specific visualization method applied to the meta-distance matrices – similar results were obtained when we replaced UMAP by PHATE, the method having the highest median eigenscore in Figure 3(c), or t-SNE, for meta-visualization (Supplement Figure 11).

2.2.3 Visualizing Cell Cycles

Our second real data example concerns visualization of a different low-dimensional structure, namely, a mixture of cycle and clusters, contained in the gene expression profile of a collection of mouse embryonic stem cells, as a result of the cell cycle mechanism. The cell cycle, or cell-division cycle, is the series of events that take place in a cell that cause it to divide into two daughter cells222https://en.wikipedia.org/wiki/Cell_cycle. Identifying the cell cycle stages of individual cells analyzed during development is important for understanding its wide-ranging effects on cellular physiology and gene expression profiles. Specifically, our dataset contains n=288n=288 mouse embryonic stem cells, whose underlying cell cycle stages were determined using flow cytometry sorting (Buettner et al., 2015). Among them, one-third (96) of the cells are in the G1 stage, one-third in the S stage, and the rest in the G2M stage. The raw count data were preprocessed and normalized, leading to a dataset consisting of standardized expression levels of 11471147 cell-cycle related genes for the 288 cells (see Section A.2 of the Supplement for implementation details of our data preprocessing).

Refer to caption
Figure 4: Visualization of the cell cycle of 288 mouse emryonic stem cells. (a) Three examples of candidate visualizations. The cells are marked by three different symbols and colors according to their associated cell cycle stages. More examples are included in Supplement Figure 14. (b) Boxplots of eigenscores for all 16 candidate visualizations. (c) The proposed meta-visualization using kPCA. (d) Median Silhouette indices versus Kendall’s tau statistics for the 16 candidate and the 2 meta-visualizations (Red: proposed spectral meta-visualization; Orange: naive simple average meta-visualization). For both metrics, a higher value indicates a better visualization of the respective structure (cluster/cycle).

We obtained 16 candidate visualizations as before, and applied our proposed method. Figure 4(a) contains examples of candidate visualizations obtained by t-SNE, LEIM, and kPCA, whose median eigenscores were ranked top, middle and bottom among all the visualizations, respectively, and the cells were colored according to their true cell cycle stages. Figure 4(b) contains the boxplots of eigenscores for the candidate visualizations, indicating the overall quality of each visualization. The variation of eigenscores within each candidate visualization suggests that different visualizations have their own unique features and strengths to be contributed to the meta-visualization (Supplement Figure 16). Figure 4(c) is the proposed meta-visualization by applying kPCA to the meta-distance matrix. Comparing with Figure 4(a), the proposed meta-visualization showed better clustering of the cells according to their cell cycle stages, as well as a more salient cyclic structure underlying the three cell cycle stages (Supplement Figure 15). To quantify the performance of each visualization in terms of these two underlying structures (cluster and cycle), we considered two distinct metrics, namely, the median Silhouette index with respect to the underlying true cell cycle stages, and the Kendall’s tau statistic (Kendall, 1938) between the inferred relative order of the cells and their true orders on the cycle. Specifically, to infer the relative order of cells, we projected the coordinates of each visualization to the two-dimensional unit circle centred at the origin (Supplement Figure 15), and then determined the relative orders based on the cells’ respective projected positions on the unit circle. Figure 4(d) shows that the proposed meta-visualization was significantly better than all the candidate visualizations and the naive meta-visualization in representing both aspects of the data.

2.2.4 Visualizing Trajectories of Cell Differentiation

Our third real data example concerns visualization of a mixed pattern of a trajectory and clusters underlying the gene expression profiles of a collection of cells undergoing differentiation (Hayashi et al., 2018). Specifically, 421 mouse embryonic stem cells were induced to differentiate into primitive endoderm cells. After the induction of differentiation, the cells were dissociated and individually captured at 12- or 24-hour intervals (0, 12, 24, 48 and 72 h), and each cell was sequenced to obtain the final total RNA sequencing reads using the random displacement amplification sequencing technology. As a result, at each of the five time points, there were about 70 to 90 cells captured and sequenced. The raw count data were preprocessed and normalized (see Section A.2 of the Supplement for implementation details), leading to a dataset consisting of standardized expression levels of 500500 most variable genes for the 421 cells.

Refer to caption
Figure 5: Visualization of 421 cells undergoing differentiation. (a) Three examples of candidate visualizations. The individual cells are marked by five different symbols and colors according to the time points they were captured and sequenced. More examples are included in Supplement Figure 17. (b) Boxplots of eigenscores for all 16 candidate visualizations. (c) The proposed meta-visualization using kPCA. (d) Median Silhouette indices versus Kendall’s tau statistics for the 16 candidate and the 2 meta-visualizations (Red: proposed spectral meta-visualization; Orange: naive simple average meta-visualization).

Again, we obtained 16 candidate visualizations as before, and applied our proposed method. In Figure 5(a)-(c) we show examples of candidate visualizations, boxplots of the eigenscores, and the meta-visualization using kPCA. The global (Figure 5(b)) and local (Supplement Figure 18) variation of eigenscores demonstrated contribution of different visualizations to the final meta-visualization according to their respective performance. We observed that some candidate visualizations such as kPCA, UMAP (Figure 5(a)) and PHATE (Supplement Figure 17) to some extent captured the underlying trajectory structure consistent with the time course of the cells. However, the meta-visualization in Figure 5(c) showed much more salient patterns in terms of both the underlying trajectory and the cluster pattern among the cells, by locally combining strengths of the individual visualizations (Supplement Figure 18). We quantified the performance of visualizations from these two aspects using the median Silhouette index with respect to the underlying true cluster membership (i.e., batches of time course) and Kendall’s tau statistic between the inferred cell order and the true order along the progression path. To infer the relative order of the cells from a visualization, we ordered all the cells based on the two-dimensional embedding along the direction that explained the most variability of the cells. In Figure 5(d), we observed that, the proposed meta-visualization had the largest median Silhouette index as well as the largest Kendall’s tau statistic, compared with all the candidate visualizations and the naive meta-visualization, showing the superiority of the proposed meta-visualization in both aspects.

2.2.5 Computational Cost

For datasets of moderate size as the ones analyzed in the previous sections, the proposed method had a computational cost comparable to that of t-SNE or UMAP for generating a single candidate visualization (Supplement Figure 19). As for very large and high-dimensional datasets, there are a few features of the proposed algorithm that make it readily scalable. First, although our method relies on computing the leading eigenvector of generally non-sparse matrices, these matrices (i.e., 𝔾i\mathbb{G}_{i} in Algorithm 1) are of dimension K×KK\times K, where KK – the number of candidate visualizations – is usually much smaller compared to the sample size nn or dimensionality pp of the original data. Thus, for each sample ii, the computational cost due to the eigendecomposition is mild. Second, given the candidate visualizations, our proposed algorithm is independent of the dimensionality (pp) of the original dataset, as it only requires as input a set of low-dimensional embeddings produced by different visualization methods. Third, since our algorithm computes the eigenscores and the meta-distance with respect to each sample individually, the algorithm can be easily parallelized and carried out in multiple cores to further reduce time cost.

To demonstrate the computational efficiency of the proposed method for large and high-dimensional datasets, we evaluated the proposed method on real single-cell transcriptomic datasets (Buckley et al., 2022) of various sample sizes (n{1000,2000,4000,8000,14000}n\in\{1000,2000,4000,8000,14000\} cells of nine different cell types from the neurogenic regions of mice) and dimensions (p{500,1000,2000}p\in\{500,1000,2000\} genes). For each dataset, we obtained 11 candidate visualizations and applied Algorithm 1 to generate the final meta-visualization (see Section A.2 of the Supplement for implementation details). Supplement Figure 20(b) contains boxplots of median Silhouette indices for each candidate visualizations and the meta-visualization (highlighted in red) with respect to the underlying true cell types, showing the stable and superior performance of the proposed method under various sample sizes and dimensions. In Supplement Figure 20(a), we compared the running time for generating the 11 candidate visualizations, and that for generating the meta-visualizations based on Algorithm 1, on a MacBook Pro with 2.2 GHz 6-Core Intel Core i7. In general, as nn became large, the running time of the proposed algorithm also increased, but remained much less than that for generating the candidate visualizations. The difference in time cost became more significant as nn increased, demonstrating that for very large and high-dimensional datasets the computational cost essentially comes from generating candidate visualizations, rather than from the meta-visualization step. In particular, for dataset of sample size as large as 80008000 and of dimension 20002000, it took about 60 mins to generate all the 11 candidate visualizations, and took about additional 12 mins to generate the meta-visualization. Moreover, Supplement Figure 20(a) also demonstrated that, for each nn, when pp increased, the running time for generating the candidate visualizations was longer, but the time cost for meta-visualization remained about the same (difference less than one minute). We also note that users often create multiple visualizations for data exploration, and our approach can simply reuse these visualizations with little additional computational cost.

2.3 Theoretical Justifications and Fundamental Principles

2.3.1 General Model for Multiple Visualizations

We develop a general and flexible theoretical framework, to investigate the statistical properties of the proposed methods, as well as the fundamental principles behind its empirical success333Throughout, we adopt the following notations. For a matrix 𝔸=(aij)n×n\mathbb{A}=(a_{ij})\in\mathbb{R}^{n\times n}, we define its spectral norm as 𝔸=sup𝕩21𝔸𝕩2\|\mathbb{A}\|=\sup_{\|\mathbb{x}\|_{2}\leq 1}\|\mathbb{A}\mathbb{x}\|_{2}. For sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write an=o(bn)a_{n}=o(b_{n}) or bnanb_{n}\gg a_{n} if limnan/bn=0\lim_{n}a_{n}/b_{n}=0, and write anbna_{n}\asymp b_{n} if there exists constants C1,C2>0C_{1},C_{2}>0 such that C1bnanC2bnC_{1}b_{n}\leq a_{n}\leq C_{2}b_{n} for all nn.. As can be seen from Section 2.1, there are two key ingredients of our proposed method, namely, the eigenscores {𝕤^i}1in\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n} for evaluating the candidate visualizations, and the meta-distance matrix ¯m\bar{\mathbb{P}}^{m} that combines multiple candidate visualizations to obtain a meta-visualization.

To formally study their properties, we introduce a generic model for the collection of KK candidate visualizations produced by multiple visualization methods, with possibly different settings of tuning parameters for a single method as considered in previous sections. Specifically, we assume {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n} are generated as

𝕐i=𝕐i+i,i=1,2,,n,\mathbb{Y}_{i}=\mathbb{Y}^{*}_{i}+\mathbb{Z}_{i},\qquad i=1,2,...,n, (2.11)

where {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n} are the underlying noiseless samples and {i}1in\{\mathbb{Z}_{i}\}_{1\leq i\leq n} are the random noises. Recall that {(k)}1kK\{\mathbb{P}^{(k)}\}_{1\leq k\leq K} are the distance matrices associated to the candidate visualizations. Then, for the candidate visualizations, we consider a scaled signal-plus-noise expression

i.(k)=ci,k(i.+𝕙i(k)),k=1,2,,K,\mathbb{P}^{(k)}_{i.}=c_{i,k}(\mathbb{P}^{*}_{i.}+\mathbb{h}^{(k)}_{i}),\qquad k=1,2,...,K, (2.12)

induced by (2.11), where ci,k0c_{i,k}\geq 0 is a global scaling parameter, i.\mathbb{P}^{*}_{i.} is the ii-th row of the pairwise distance matrix =(𝕐i𝕐j2)1i,jn\mathbb{P}^{*}=(\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{j}\|_{2})_{1\leq i,j\leq n} of the underlying noiseless samples, and 𝕙i(k)\mathbb{h}_{i}^{(k)} is a random vector characterizing the relative distortion of i.(k)\mathbb{P}_{i.}^{(k)} associated to the kk-th candidate visualization, from the underlying true pattern i.\mathbb{P}^{*}_{i.}. Before characterizing the distributions of {𝕙i(k)}1kK\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K}, we point out that, in principle, the relative distortions {𝕙i(k)}1kK\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K} are jointly determined by the random noises {i}1in\{\mathbb{Z}_{i}\}_{1\leq i\leq n} in (2.11), and the features and relations between of the specific visualization methods. Importantly, in line with what is often encountered in practice, equation (2.12) allows for flexible and possibly distinct scaling and directionality for different candidate visualizations, by introducing the visualization-specific parameter ckc_{k}, and by focusing on the pairwise distance matrices, rather than the low-dimensional embeddings {𝕏i(k)}1in\{\mathbb{X}_{i}^{(k)}\}_{1\leq i\leq n} themselves.

To quantitatively describe the variability of the distortions {𝕙i(k)}1kK\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K} across KK candidate visualizations, we assume

  • (C1a)

    {𝕙i(k)}1kK\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K} are identically distributed sub-Gaussian vectors with parameter σ2\sigma^{2}, that is, for any deterministic unit vector 𝕘n\mathbb{g}\in\mathbb{R}^{n}, we have 𝔼exp{(𝕙i(k))𝕘}exp(σ2/2)\mathbb{E}\exp\{(\mathbb{h}^{(k)}_{i})^{\top}\mathbb{g}\}\leq\exp(\sigma^{2}/2), and that 𝕙i(k)22=cσ2n(1+o(1))\|\mathbb{h}^{(k)}_{i}\|^{2}_{2}=c\sigma^{2}n(1+o(1)) with high probability444An event AnA_{n} holds with high probability if there exists some N>0N>0, such that P(An)1nDP(A_{n})\geq 1-n^{-D} for all nNn\geq N for some large constant D>0D>0. for some constant c>0c>0.

This assumption makes (2.12) a generative model for {i.(k)}1kK\{\mathbb{P}_{i.}^{(k)}\}_{1\leq k\leq K} with ground truth i.\mathbb{P}^{*}_{i.} and random distortions, where the variance parameter σ\sigma describes the average level of the distortions of candidate visualizations from the truth after proper scaling. In relation to (2.11), such a condition can be satisfied when the signal structure {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n} is finite, the noise {i}1in\{\mathbb{Z}_{i}\}_{1\leq i\leq n} is sub-Gaussian, and the dimension reduction map underlying the candidate visualization is bounded and sufficiently smooth. See Section B.2 of the Supplement for details. In addition, we also need to characterize the correlations among these random distortions, not only because the candidate visualizations are typically obtained from the same dataset {𝕐i}1in\{\mathbb{Y}_{i}\}_{1\leq i\leq n}, but also because of the possible similarity between the adopted visualization methods, such as MDS and iMDS, or t-SNE under different tuning parameters. Specifically, for any j,k{1,2,,K}j,k\in\{1,2,...,K\}, we define the cross-visualization covariance Σjk=𝔼𝕙i(j)(𝕙i(k))\mathbb{\Sigma}_{jk}=\mathbb{E}\mathbb{h}^{(j)}_{i}(\mathbb{h}^{(k)}_{i})^{\top}, and quantify the level of dependence between a pair of candidate visualizations by ρjk=Σjk/σ2\rho_{jk}=\|\mathbb{\Sigma}_{jk}\|/\sigma^{2}. By Condition (C1a), we have ρjj1\rho_{jj}\leq 1 for all jj. For all correlation parameters {ρjk}1j,kK\{\rho_{jk}\}_{1\leq j,k\leq K}, we assume

  • (C1b)

    The matrix =(ρjk)1j,kK\mathbb{R}=(\rho_{jk})_{1\leq j,k\leq K} satisfies ρ:==o(K)\rho:=\|\mathbb{R}\|=o(K).

Condition (C1b) covers a wide range of correlation structures among the candidate visualizations, allowing in particular for a subset of highly correlated visualizations possibly produced by very similar methods. The parameter ρ\rho characterizes the overall correlation strength among the candidate visualizations, which is assumed to be not too large. As a comparison, note that a set of pairwise independent candidate visualizations implies that ρ1\rho\approx 1, whereas a set of identical candidate visualizations have ρK\rho\approx K. In particular, the requirement ρ=o(K)\rho=o(K) can be satisfied if, for example, among KK candidate visualizations, there are subsets of at most K\sqrt{K} visualizations that are produced by very similar procedures, such as by the same method under different tuning parameters, so that ρK=o(K)\rho\leq\sqrt{K}=o(K). When Condition (C1b) fails, as all the candidate visualizations are essentially similarly distorted from truth, combination of them will not be substantially more informative than each individual visualization.

2.3.2 Convergence of Eigenscores

Under Condition (C1a), it holds that 𝔼𝕙i(k)2σn\mathbb{E}\|\mathbb{h}_{i}^{(k)}\|_{2}\asymp\sigma\sqrt{n}. Hence, we can use the quantity i.2σn\frac{\|\mathbb{P}^{*}_{i.}\|_{2}}{\sigma\sqrt{n}} to characterize the overall SNR in the candidate visualizations as modelled by (2.12), which reflects the average quality of the candidate visualizations in preserving the underlying true patterns around sample ii. Before stating our main theorems, we first introduce our main assumption on the minimal SNR requirement, that is,

  • (C2)

    For (σ,ρ)(\sigma,\rho) defined in (C1a) and (C2b), it holds that i.2σnρ/K\frac{\|\mathbb{P}^{*}_{i.}\|_{2}}{\sigma\sqrt{n}}\gg\sqrt{\rho/K} and K=o(n)K=o(n) as nn\to\infty.

Our algorithm is expected to perform well if ρ/K\sqrt{\rho/K} is small relative to the overall SNR. The condition K=o(n)K=o(n) is easily satisfied for a sufficiently large dataset.

Recall that 𝕤i=((¯i.(1))¯i.,(¯i.(2))¯i.,,(¯i.(K))¯i.).\mathbb{s}_{i}=((\bar{\mathbb{P}}_{i.}^{(1)})^{\top}\bar{\mathbb{P}}^{*}_{i.},(\bar{\mathbb{P}}_{i.}^{(2)})^{\top}\bar{\mathbb{P}}^{*}_{i.},...,(\bar{\mathbb{P}}_{i.}^{(K)})^{\top}\bar{\mathbb{P}}^{*}_{i.}). The following theorem concerns the convergence of eigenscores to the true concordance 𝕤i\mathbb{s}_{i}, and is proved in Section B.3 of the Supplement.

Theorem 1.

Under Conditions (C1a) (C1b) and (C2), for each i{1,2,,n}i\in\{1,2,...,n\}, it holds that cos(𝕤^i,𝕤i)=(𝕤^i)𝕤i𝕤^i2𝕤i21\cos\angle(\widehat{\mathbb{s}}_{i},\mathbb{s}_{i})=\frac{(\widehat{\mathbb{s}}_{i})^{\top}\mathbb{s}_{i}}{\|\widehat{\mathbb{s}}_{i}\|_{2}\|\mathbb{s}_{i}\|_{2}}\to 1 in probability as nn\to\infty.

Theorem 1 implies that, as long as the candidate visualizations contain sufficient amount of information about the underlying true structure, and are not terribly correlated, the proposed eigenscores {𝕤^i}1in\{\widehat{\mathbb{s}}_{i}\}_{1\leq i\leq n} are quantitatively reliable, as they converge to the actual quality measures {𝕤i}1in\{\mathbb{s}_{i}\}_{1\leq i\leq n} asymptotically. In other words, the eigenscores provide a point-wise consistent estimation of the concordance between the candidate visualizations as summarized by {(k)}1kK\{\mathbb{P}^{(k)}\}_{1\leq k\leq K} and the underlying true patterns \mathbb{P}^{*}, justifying the empirical observations in Table 1. Importantly, Condition (C2) suggests that our proposed eigenscores may benefit from a larger number KK of candidate visualizations, or a smaller overall correlation ρ\rho, that is, a collection of functionally more diverse candidate visualizations; see further discussions in Section 2.3.6.

2.3.3 Theoretical Guarantee for Meta-Visualization

Our second theorem concerns the guaranteed performance of our proposed meta-distance matrix and its improvement upon the individual candidate visualizations in the large-sample limit.

Theorem 2.

Under Conditions (C1a) (C1b) and (C2), for each i{1,2,,n}i\in\{1,2,...,n\}, it holds that cos(¯i.m,i.)1\cos\angle(\bar{\mathbb{P}}^{m}_{i.},\mathbb{P}^{*}_{i.})\to 1 in probability as nn\to\infty. Moreover, for any constant δ(0,1)\delta\in(0,1), there exist a constant C>0C>0 such that, whenever i.2Cσn{\|\mathbb{P}_{i.}^{*}\|_{2}}\leq C{\sigma\sqrt{n}}, we have max1kKcos(i.(k),i.)<1δ\max_{1\leq k\leq K}\cos\angle(\mathbb{P}^{(k)}_{i.},\mathbb{P}^{*}_{i.})<1-\delta in probability as nn\to\infty.

Theorem 2 is proved in Section B.4 of the Supplement. In addition to the point-wise consistency of ¯m\bar{\mathbb{P}}^{m} as described by cos(¯i.m,i.)1\cos\angle(\bar{\mathbb{P}}^{m}_{i.},\mathbb{P}^{*}_{i.})\to 1 in probability, Theorem 2 also ensures that the proposed meta-distance is in general no worse than the individual candidate visualizations, suggesting a competitive performance of the meta-visualization. In particular, if in addition to Conditions (C1a) (C1b) and (C2) we also have i.2σnC\frac{\|\mathbb{P}_{i.}^{*}\|_{2}}{\sigma\sqrt{n}}\leq C, that is, the magnitude of the random distortions from the true structure i.\mathbb{P}_{i.}^{*} is relatively large, then each candidate visualization necessarily has at most mediocre performance, i.e., max1kKcos(i.(k),i.)<1δ\max_{1\leq k\leq K}\cos\angle(\mathbb{P}^{(k)}_{i.},\mathbb{P}^{*}_{i.})<1-\delta in probability. In such cases, the proposed meta-distances is still consistent and thus strictly better than all candidate visualizations. Theorem 2 justifies the superior performance of the spectral meta-visualization demonstrated in Section 2.2, compared with 16 candidate visualizations.

Among the three conditions required for the consistency of the proposed meta-distance matrix, Condition (C2) is most critical as it describes the minimal SNR requirement, that is, how much information the candidate visualizations altogether should contain about the underlying true structure of the data. In this connection, our theoretical analysis indicates that, in fact, such a signal strength condition is also necessary, not only for the proposed method, but for any possible methods. More specifically, in Section B.6 of the Supplement, we proved (Theorem 4) that, it’s impossible to construct a meta-distance matrix that is consistent when Condition (C2) is violated. This result shows that the settings where our meta-visualization algorithm works well is essentially the most general setting possible.

2.3.4 Robustness of Spectral Weighting against Adversarial Visualizations

In Section 2.2, in addition to the proposed meta-visualization, we also considered the meta-visualization based on the naive meta-distance matrix ¯a\bar{\mathbb{P}}^{a}, whose rows are

¯i.a=1Kk=1K¯i.(k)n,\bar{\mathbb{P}}_{i.}^{a}=\frac{1}{K}\sum_{k=1}^{K}\bar{\mathbb{P}}_{i.}^{(k)}\in\mathbb{R}^{n}, (2.13)

which is a simple average across all the candidate visualizations. We observed in all our real-world data analyses that, such a naive meta-visualization only had mediocre performance compared to the candidate visualizations (Figures 3, 4 and 5), much worse than the proposed spectral meta-visualization. The empirical observations suggest the advantage of informative weighting for combining candidate visualizations.

The empirically observed suboptimality of the non-informative weighting procedure can justified rigorously by theory. Our next theorem concerns the behavior of the proposed meta-distance matrix ¯m\bar{\mathbb{P}}^{m} and the naive meta-distance matrix ¯a\bar{\mathbb{P}}^{a} when combining a mixture of well-conditioned candidate visualizations, as characterized by our assumptions (C1a) (C1b) and (C2), and some adversarial candidate visualizations whose pairwise-distance matrices does not contain any information about the true structure. Specifically, we suppose among all the KK candidate visualizations, there is a collection 𝒞0\mathcal{C}_{0} of (1η)K(1-\eta)K well-conditioned candidate visualizations for some small η(0,1)\eta\in(0,1), and a collection 𝒞1\mathcal{C}_{1} of ηK\eta K adversarial candidate visualizations.

Theorem 3.

For any i{1,2,,n}i\in\{1,2,...,n\}, suppose among all the KK candidate visualizations, there is a collection 𝒞0\mathcal{C}_{0} of (1η)K(1-\eta)K candidate visualizations for some small η(0,1)\eta\in(0,1) satisfying Conditions (C1a) (C1b) and (C2), and a collection 𝒞1\mathcal{C}_{1} of ηK\eta K adversarial candidate visualizations such that (i.(k))i.=0(\mathbb{P}_{i.}^{(k)})^{\top}\mathbb{P}_{i.}^{*}=0 for all k𝒞1k\in\mathcal{C}_{1}. Then, for the proposed meta-distance ¯m\bar{\mathbb{P}}^{m}, we still have cos(¯i.m,i.)1\cos\angle(\bar{\mathbb{P}}^{m}_{i.},\mathbb{P}^{*}_{i.})\to 1 in probability as nn\to\infty. However, for the naive meta-distance ¯a\bar{\mathbb{P}}^{a}, even if i.2σn\|\mathbb{P}_{i.}^{*}\|_{2}\gg\sigma\sqrt{n}, we have cos(¯i.a,i.)<1η\cos\angle(\bar{\mathbb{P}}^{a}_{i.},\mathbb{P}^{*}_{i.})<1-\eta in probability as nn\to\infty.

Theorem 3 is proved in Section B.5 of the Supplement. By Theorem 3, on the one hand, even when there are a small portion of really poor (adversarial) candidate visualizations to be combined with other relatively good visualizations, the proposed method still perform well thanks to the consistent eigenscore weighting in light of Theorem 1. On the other hand, no matter how strong the SNR is for those well-conditioned candidate visualizations, the method based on non-informative weighting is strictly sub-optimal. Indeed, when i.2σn𝔼𝕙i2\|\mathbb{P}_{i.}^{*}\|_{2}\gg\sigma\sqrt{n}\asymp\mathbb{E}\|\mathbb{h}_{i}\|_{2}, although we have cos(i.(k),i.)1\cos\angle(\mathbb{P}_{i.}^{(k)},\mathbb{P}^{*}_{i.})\to 1 in probability for all k𝒞0k\in\mathcal{C}_{0}, the non-informative weighting would suffer from the non-negligible negative effects from the adversarial visualizations in 𝒞1\mathcal{C}_{1}, causing a strict deviation from \mathbb{P}^{*}; see, for example, Figures 3-5(b)(d) for empirical evidences from real-world data.

2.3.5 Limitations of Original Noisy High-Dimensional Data

The proposed eigenscores provide an efficient and consistent way of evaluating the performance of the candidate visualizations. As mentioned in Section 1, a number of metrics have been proposed to quantify the distortion of a visualization by comparing the low-dimensional embedding directly with the original high-dimensional data. Such metrics essentially treat the original high-dimensional data as the ground truth, and do not take into account the noisiness of the high-dimensional data. However, for many datasets arising from real-world applications, the observed datasets, as modelled by (2.11), are themselves very noisy, which may not make an ideal reference point for evaluating a visualization that probably has already significantly denoised the data through dimension reduction. For example, the three real-world datasets considered in Section 2.2 are all high-dimensional and contain much more features than number of samples. In each case, there are some underlying clusters among the samples, but the original datasets showed significantly weaker cluster structure compared to most of the 16 candidate visualizations (Supplement Figure 21), suggesting that directly comparing a visualization with the noisy high-dimensional data may be misleading. In this respect, our theorems indicate that the proposed spectral method is able to precisely assess and effectively combine multiple visualizations to better grasp the underlying noiseless structure \mathbb{P}^{*}, without referring to the original noisy datasets, making it more robust, flexible, and computationally more efficient.

2.3.6 Benefits of Including More Functionally Diverse Visualizations

Our theoretical analysis implies that the proposed meta-visualization may benefit from a large number (larger KK) of functionally diverse (small ρ\rho) candidate visualizations. To empirically verify this theoretical observation, we focused on the religious and biblical text data and the mouse embryonic stem cells data considered in Section 2.2, and obtained spectral meta-visualizations based on a smaller but relatively diverse collection of 5 candidate visualizations, produced by arguably the most popular methods, namely, t-SNE, PHATE, UMAP, PCA and MDS, respectively. Compared with the 16 candidate visualizations considered in Section 2.2, here we have presumably similar ρ\rho but much smaller KK. As a result, for the religious and biblical texts data, the meta-visualization had a median Silhouette index 0.187 (Supplement Figure 13), which was smaller than the median Silhouette index 0.275 based on the 16 candidate visualizations as in Figure 3(d); for the cell cycle data, the meta-visualization had a median Silhouette index -0.062 and a Kendall’s tau statistic 0.313, both smaller than the respective values based on the 16 candidate visualizations as in Figure 4(d). On the other hand, we also evaluated the effect when ρ\rho is increased but KK remains fixed. Specifically, we obtained 16 candidate visualizations, all produced by PHATE with varying nearest neighbor parameters, the final spectral meta-visualization had a median Silhouette index 0.094, which was even lower than the above meta-visualization based on five distinct methods, although being still slightly better than the 16 PHATE-based candidate visualizations (Supplement Figure 13). These empirical evidences were in line with our theoretical predictions, suggesting benefits of including more diverse visualizations.

3 DISCUSSION

We developed a spectral method in the current study to assess and combine multiple data visualizations. The proposed meta-visualization combines candidate visualizations through an arithmetic weighted average of their normalized distance matrices, by their corresponding eigenscores. Although the proposed method was shown both in theory and numerically to outperform the individual candidate visualizations and their naive combination, it is still unclear whether there exists any other forms of combinations that lead to even better meta-visualizations. For example, one could consider constructing a meta-distance matrix using the geometric or harmonic (weighted) average, or an average based on barycentric coordinates (Floater, 2015). We plan to investigate such problems concerning how to optimally combining multiple visualizations in a subsequent work.

Although originally developed for data visualization, the proposed method can be useful for other supervised and unsupervised machine learning tasks, such as combining multiple algorithms for clustering, classification, or prediction. For example, for a given dataset, if one has a collection of predicted cluster memberships produced by multiple clustering algorithms, one could construct cluster membership matrices with (i,j)(i,j)-th entry being 0 if sample ii and jj are not assigned to the same cluster and being 1 otherwise. Then we may define the similarity matrix as in (2.7), obtain the eigenscores for the candidate clusterings, and a meta-clustering using (2.9). It is of interest to know its empirical performance and if the fundamental principles unveiled in the current work continue to hold for such broader range of learning tasks.

Data Availability

The religious and biblical text data (Sah and Fokoué, 2019) was downloaded from UCI Machine Learning Repository 555https://archive.ics.uci.edu/ml/datasets/A+study+of++Asian+Religious+and+Biblical+Texts. The cell cycle analysis was based on the mouse embryonic stem cell data (Buettner et al., 2015) downloaded from EMBL-EBI with accession code E-MTAB-2805. The cell trajectory analysis was based on the mouse embryonic stem single cell data (Hayashi et al., 2018) downloaded from Gene Expression Omnibus with accession code GSE98664. The single-cell transcriptomic dataset (Buckley et al., 2022) used for evaluating computational cost is accessible at BioProject PRJNA795276.

Code Availability

The R codes of the method, and for reproducing our simulations and data analyses are available at our GitHub page https://github.com/rongstat/meta-visualization.

References

  • Abraham et al. (2006) Abraham, I., Y. Bartal, and O. Neiman (2006). Advances in metric embedding theory. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pp.  271–286.
  • Abraham et al. (2009) Abraham, I., Y. Bartal, and O. Neiman (2009). On low dimensional local embeddings. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.  875–884. SIAM.
  • Arora et al. (2018) Arora, S., W. Hu, and P. K. Kothari (2018). An analysis of the t-SNE algorithm for data visualization. In Conference on Learning Theory, pp.  1455–1462. PMLR.
  • Bartal et al. (2019) Bartal, Y., N. Fandina, and O. Neiman (2019). Dimensionality reduction: theoretical perspective on practical measures. Advances in Neural Information Processing Systems 32.
  • Belkin and Niyogi (2003) Belkin, M. and P. Niyogi (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396.
  • Bertini et al. (2011) Bertini, E., A. Tatu, and D. Keim (2011). Quality metrics in high-dimensional data visualization: An overview and systematization. IEEE Transactions on Visualization and Computer Graphics 17(12), 2203–2212.
  • Bhatia (2013) Bhatia, R. (2013). Matrix Analysis, Volume 169. Springer Science & Business Media.
  • Buckley et al. (2022) Buckley, M. T., E. Sun, B. M. George, L. Liu, N. Schaum, L. Xu, J. M. Reyes, M. A. Goodell, I. L. Weissman, T. Wyss-Coray, et al. (2022). Cell type-specific aging clocks to quantify aging and rejuvenation in regenerative regions of the brain. bioRxiv.
  • Buettner et al. (2015) Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle (2015). Computational analysis of cell-to-cell heterogeneity in single-cell rna-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33(2), 155–160.
  • Cai et al. (2021) Cai, T. T., H. Li, and R. Ma (2021). Optimal structured principal subspace estimation: Metric entropy and minimax rates. J. Mach. Learn. Res. 22, 46–1.
  • Cai and Ma (2021) Cai, T. T. and R. Ma (2021). Theoretical foundations of t-sne for visualizing high-dimensional clustered data. arXiv preprint arXiv:2105.07536.
  • Cai and Zhang (2018) Cai, T. T. and A. Zhang (2018). Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics 46(1), 60–89.
  • Chen et al. (2007) Chen, C.-h., W. K. Härdle, and A. Unwin (2007). Handbook of Data Visualization. Springer Science & Business Media.
  • Chen et al. (2020) Chen, M., H. Hauser, P. Rheingans, and G. Scheuermann (2020). Foundations of Data Visualization. Springer.
  • Cheng et al. (2015) Cheng, J., H. Liu, F. Wang, H. Li, and C. Zhu (2015). Silhouette analysis for human action recognition based on supervised temporal t-SNE and incremental learning. IEEE Transactions on Image Processing 24(10), 3203–3217.
  • Chennuru Vankadara and von Luxburg (2018) Chennuru Vankadara, L. and U. von Luxburg (2018). Measures of distortion for machine learning. Advances in Neural Information Processing Systems 31.
  • Ding and Ma (2022) Ding, X. and R. Ma (2022). Learning low-dimensional nonlinear structures from high-dimensional noisy data: An integral operator approach. arXiv preprint arXiv:2203.00126.
  • Donoho (2017) Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics 26(4), 745–766.
  • Donoho and Grimes (2003) Donoho, D. L. and C. Grimes (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences 100(10), 5591–5596.
  • Espadoto et al. (2019) Espadoto, M., R. M. Martins, A. Kerren, N. S. Hirata, and A. C. Telea (2019). Toward a quantitative survey of dimension reduction techniques. IEEE Transactions on Visualization and Computer Graphics 27(3), 2153–2173.
  • Floater (2015) Floater, M. S. (2015). Generalized barycentric coordinates and applications. Acta Numerica 24, 161–214.
  • Hayashi et al. (2018) Hayashi, T., H. Ozaki, Y. Sasagawa, M. Umeda, H. Danno, and I. Nikaido (2018). Single-cell full-length total rna sequencing uncovers dynamics of recursive splicing and enhancer rnas. Nature Communications 9(1), 1–16.
  • Jolliffe and Cadima (2016) Jolliffe, I. T. and J. Cadima (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065), 20150202.
  • Kendall (1938) Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30(1/2), 81–93.
  • Kobak and Berens (2019) Kobak, D. and P. Berens (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications 10(1), 1–14.
  • Kobak and Linderman (2021) Kobak, D. and G. C. Linderman (2021). Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology 39(2), 156–157.
  • Kruskal (1978) Kruskal, J. B. (1978). Multidimensional Scaling. Number 11. Sage.
  • Liu et al. (2017) Liu, Z.-G., Q. Pan, J. Dezert, and A. Martin (2017). Combination of classifiers with optimal weight based on evidential reasoning. IEEE Transactions on Fuzzy Systems 26(3), 1217–1230.
  • McInnes et al. (2018) McInnes, L., J. Healy, and J. Melville (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
  • Mohandes et al. (2018) Mohandes, M., M. Deriche, and S. O. Aliyu (2018). Classifiers combination techniques: A comprehensive review. IEEE Access 6, 19626–19639.
  • Mokbel et al. (2013) Mokbel, B., W. Lueks, A. Gisbrecht, and B. Hammer (2013). Visualizing the quality of dimensionality reduction. Neurocomputing 112, 109–123.
  • Moon et al. (2019) Moon, K. R., D. van Dijk, Z. Wang, S. Gigante, D. B. Burkhardt, W. S. Chen, K. Yim, A. v. d. Elzen, M. J. Hirn, R. R. Coifman, et al. (2019). Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology 37(12), 1482–1492.
  • Nonato and Aupetit (2018) Nonato, L. G. and M. Aupetit (2018). Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment. IEEE Transactions on Visualization and Computer Graphics 25(8), 2650–2673.
  • Olivon et al. (2018) Olivon, F., N. Elie, G. Grelier, F. Roussi, M. Litaudon, and D. Touboul (2018). Metgem software for the generation of molecular networks based on the t-SNE algorithm. Analytical Chemistry 90(23), 13900–13908.
  • Pagliosa et al. (2015) Pagliosa, P., F. V. Paulovich, R. Minghim, H. Levkowitz, and L. G. Nonato (2015). Projection inspector: Assessment and synthesis of multidimensional projections. Neurocomputing 150, 599–610.
  • Parisi et al. (2014) Parisi, F., F. Strino, B. Nadler, and Y. Kluger (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences 111(4), 1253–1258.
  • Platzer (2013) Platzer, A. (2013). Visualization of snps with t-SNE. PloS One 8(2), e56883.
  • Rousseeuw (1987) Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.
  • Roweis and Saul (2000) Roweis, S. T. and L. K. Saul (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326.
  • Sah and Fokoué (2019) Sah, P. and E. Fokoué (2019). What do asian religions have in common? an unsupervised text analytics exploration. arXiv preprint arXiv:1912.10847.
  • Sammon (1969) Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers 100(5), 401–409.
  • Schölkopf et al. (1997) Schölkopf, B., A. Smola, and K.-R. Müller (1997). Kernel principal component analysis. In International Conference on Artificial Neural Networks, pp. 583–588. Springer.
  • Tax et al. (2000) Tax, D. M., M. Van Breukelen, R. P. Duin, and J. Kittler (2000). Combining multiple classifiers by averaging or by multiplying? Pattern Recognition 33(9), 1475–1485.
  • Tenenbaum et al. (2000) Tenenbaum, J. B., V. d. Silva, and J. C. Langford (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323.
  • Traven et al. (2017) Traven, G., G. Matijevič, T. Zwitter, M. Žerjal, J. Kos, M. Asplund, J. Bland-Hawthorn, A. R. Casey, G. De Silva, K. Freeman, et al. (2017). The galah survey: classification and diagnostics with t-SNE reduction of spectral information. The Astrophysical Journal Supplement Series 228(2), 24.
  • van der Maaten and Hinton (2008) van der Maaten, L. and G. Hinton (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov), 2579–2605.
  • Venna et al. (2010) Venna, J., J. Peltonen, K. Nybo, H. Aidos, and S. Kaski (2010). Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research 11(2).
  • Vershynin (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, Volume 47. Cambridge University Press.
  • Wang et al. (2021) Wang, Y., H. Huang, C. Rudin, and Y. Shaposhnik (2021). Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization. Journal of Machine Learning Research 22, 1–73.
  • Woods et al. (1997) Woods, K., W. P. Kegelmeyer, and K. Bowyer (1997). Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 405–410.
  • Yu et al. (2015) Yu, Y., T. Wang, and R. J. Samworth (2015). A useful variant of the davis–kahan theorem for statisticians. Biometrika 102(2), 315–323.

Acknowledgement

The authors would like to thank the edit and two anonymous reviewers for their suggestions and comments, which have resulted in a significant improvement of the manuscript. R.M. would like to thank David Donoho and Rui Duan for helpful discussions. J.Z. is supported by NSF CAREER 1942926, NIH P30AG059307, 5RM1HG010023 and grants from the Chan-Zuckerberg Initiative and the Emerson Collective. R.M. is supported by David Donoho at Stanford University.

Supplement to “A Spectral Method for Assessing and Combining Multiple Data Visualizations” Rong Ma1, Eric D. Sun2 and James Zou2

Department of Statistics, Stanford University1
Department of Biomedical Data Science, Stanford University2

Appendix A Supplementary Results and Figures from Real Data Analysis

A.1 Silhouette Index

Consider a partition {1,2,,n}=C1C2CK\{1,2,...,n\}=C_{1}\cup C_{2}\cup...\cup C_{K}, of nn samples into KK non-overlapping subsets, with each cluster containing at least 22 samples, being the true cluster membership of nn samples. Let d(i,j)d(i,j) be the distance between samples ii and jj in certain vector space. For each sample iCki\in C_{k} for some k{1,2,,K}k\in\{1,2,...,K\}, we define

a(i)=1|Ck|1jCk{i}d(i,j),a(i)=\frac{1}{|C_{k}|-1}\sum_{j\in C_{k}\setminus\{i\}}d(i,j), (A.1)

as the mean distance between sample ii and all other samples in cluster CkC_{k}, and define

b(i)=minkk1|Ck|jCkd(i,j),b(i)=\min_{k^{\prime}\neq k}\frac{1}{|C_{k^{\prime}}|}\sum_{j\in C_{k^{\prime}}}d(i,j), (A.2)

as the smallest mean distance of ii to all samples in any other cluster, of which sample ii is not a member. Then the Silhouette index of sample ii is defined as

SI(i)=b(i)a(i)max{a(i),b(i)}.SI(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}}. (A.3)

By definition, we have SI(i)(1,1)SI(i)\in(-1,1), and a higher SI(i)SI(i) indicates better concordance between the distances {d(i,j)}ji\{d(i,j)\}_{j\neq i} and the underlying true cluster membership.

A.2 Implementation Details and Supplementary Figures

Simulations.

For each simulation setting, we let the diameter of the underlying structure vary within a certain range so that the final results are comparable across different structures. The final boxplots in Figure 2(a), and Figures 7 and 9 summarize the simulation results across 20 equispaced diameter values for each underlying structure.

For the 16 candidate visualizations, they were obtained by the following R functions:

  • PCA: the fast SVD function svds from R package rARPACK with embedding dimension k=2.

  • MDS: the basic R function cmdscale with embedding dimension k=2.

  • Sammon: the R function sammon from R package MASS with embedding dimension k=2.

  • LLE: the R function lle from R package lle with parameters m=2, k=20, reg=2.

  • HLLE: the R function embed from R package dimRed with parameters method="HLLE", knn=20, ndim=2.

  • Isomap: the R function embed from R package dimRed with parameters method="Isomap", knn=20, ndim=2.

  • kPCA1&2: the R function embed from R package dimRed with parameters method="kPCA", kpar=list(sigma=width), ndim=2, where we set width=0.01 for kPCA1 and width=0.001 for kPCA2.

  • LEIM: the R function embed from R package dimRed with parameters ndim=2 and method = "LaplacianEigenmaps".

  • UMAP1&2: the R function umap from R package uwot with parameters n_neighbors=n, n_components=2, where we set n=30 for UMAP1 and width=50 for UMAP2.

  • tSNE1&2: the R function embed from R package dimRed with parameters method="tSNE", perplexity=n, ndim=2, where we set n=10 for tSNE1 and n=50 for tSNE2.

  • PHATE1&2: the R function phate from R package phateR with parameters knn=n, ndim=2, where we set n=30 for PHATE1 and n=50 for PHATE2.

Refer to caption
Figure 6: Low-dimensional structures used for simulations. Left: data points uniformly sampled from the two-dimensional “smiley face.” Right: data points uniformly sampled from the three-dimensional “mammoth” manifold.
Refer to caption
Figure 7: Boxplots for the mean concordance of the 16 candidate visualizations and the 2 meta-visualizations under each simulation setting (left: finite mixture of points; middle: smiley face; right: mammoth) across 20 equispaced values of θ\theta from some intervals. The proposed spectral meta-distance matrix had superior performance than the others.
Refer to caption
Figure 8: Examples of candidate visualizations of the simulated data generated from a 6-class Gaussian mixture model (i.e, setting (i)), along with their pointwise concordance, and boxplots grouped by clusters. The plots indicates strengths and weaknesses of different methods in capturing the underlying clusters.
Refer to caption
Figure 9: Boxplots for the mean concordance of the 16 candidate visualizations and the 2 meta-visualizations under the Gaussian mixture model (i.e., setting (i)) with various intrinsic dimensions (rr) across different values of θ[5,10]\theta\in[5,10]. The plots demonstrate the flexibility of the proposed method with respect to the intrinsic dimension rr.
Religious and Biblical Texts Data.

Each visualization method was applied to the Document Term Matrix, with 8265 centred and normalized features and 590 samples (text fragments). For the 16 candidate visualizations, they were obtained by the following R functions with the same tuning parameters as described above for the simulation studies. For the 16 PHATE-based candidate visualizations obtained in Figure 13, we consider 16 values of nearest neighbor parameter knn ranging from 2 to 150 with equal space.

Refer to caption
Figure 10: More examples of visualization of 590 fragments of religious text. The cluster patterns are far from clear based on individual methods.
Refer to caption
Figure 11: Visualization of 590 fragments of religious text. Left: median Silhouette indices for 16 candidate and 2 meta-visualizations and the original data (ori), where the two meta-visualizations are based on t-SNE. Middle: median Silhouette indices for 16 candidate and 2 meta-visualization and the original data, where the two meta-visualizations are based on PHATE. Right: scatter plot of averaged eigenscores and averaged Silhouette indices for 16 candidate visualizations. The left and middle panels indicate the improvement on performance was not sensitive to specific visualization methods. The right panel shows the potential interpretation and the effectiveness of the eigenscores.
Refer to caption
Figure 12: Comparison of eigenscores (top) and Silhouette indices (bottom) of four candidate visualizations, grouped by clusters. Notable correlation between the two quantities indicates the effectiveness of the spectral weighting approach.
Refer to caption
Figure 13: Visualization of 590 fragments of religious text. Left: boxplots of Silhouette indices for 5 candidate and 2 meta-visualizations and the original data (ori), where the 5 candidate visualizations are based on t-SNE, PHATE, UMAP, PCA and MDS, respectively. Right: boxplots of Silhouette indices for 16 candidate and 2 meta-visualizations and the original data (ori), where the 16 candidate visualizations are based on PHATE with varying nearest neighbor parameters. Compared with Figure 3(d), they showed that the proposed method may benefit from additional, and more diverse candidate visualizations.
Cell Cycle Data.

The raw count data were preprocessed, normalized, and scaled by following the standard procedure (R functions CreateSeuratObject, NormalizeData and ScaleData under default settings) as incorporated in the R package Seurat666https://cran.r-project.org/web/packages/Seurat/index.html. We also applied the R function FindVariableFeatures in Seurat to identify 20002000 most variable genes for subsequent analysis. The final p=1147p=1147 cell-cycle related genes were selected based on two-sample t-tests. The 16 candidate visualizations were generated the same way as in the previous example, with the same set of tuning parameters.

Refer to caption
Figure 14: More examples of visualization of 288 mouse emryonic stem cells in cell cycles.
Refer to caption
Figure 15: Examples of projection of candidate visualization onto the unit circle centred at the origin, from which the Kendall’s tau statistics were computed. The colors correspond to the true cell cycle stages, suggesting that the spectral meta-visualization recovered the cell cycle better than other methods.
Refer to caption
Figure 16: Illustrations of eigenscores for four candidate visualizations of the cell cycle data, showing that different visualizations may contribute to the meta-visualization from distinct aspects. Lighter color indicates better concordance to the underlying structures as assessed by the eigenscores.
Cell Differentiation Data.

The raw count data were preprocessed, normalized, and scaled using Seurat package by following the same procedure as described previously. The 16 candidate visualizations were generated the same way as in the previous examples, with the same set of tuning parameters.

Computational Cost.

We considered the single-cell transcriptomic dataset of Buckley et al. (2022) that contains more than 20,000 cells of different cell types from the neurogenic regions of 28 mice. For each n{1000,2000,4000,8000,14000}n\in\{1000,2000,4000,8000,14000\}, we randomly select nn cells of nine different cell types, and selected subsets of p{500,1000,2000}p\in\{500,1000,2000\} genes to obtain an n×pn\times p count matrix. After normalizing the count matrix, we applied various visualization methods (PCA, HLLE, kPCA, LEIM, UMAP, t-SNE and PHATE) that are in general scalable to large datasets (i.e., cost less than one minute for visualizing 10001000 samples of dimension 300300), to generate 11 candidate visualizations (with two different parameter settings for kPCA, t-SNE, UMAP and PHATE). Then we ran our proposed algorithm to obtain the final meta-visualization.

Refer to caption
Figure 17: More examples of visualization of the 421 cells under differentiation.
Refer to caption
Figure 18: Illustrations of eigenscores for four candidate visualizations of the cell differentiation data, showing that different visualizations may contribute to the meta-visualization from distinct aspects. Lighter color indicates better concordance to the underlying structures as assessed by the eigenscores.
Refer to caption
Figure 19: Computational time (log of running time in seconds) for each candidate visualization and the proposed method, including all four steps of Algorithm 1, for the three real-world datasets considered in Section 2.2. Left: religious and biblical text data; Middle: cell cycle data; Right: cell differentiation data. The proposed method had a computational time comparable to that of tSNE or UMAP for generating a single candidate visualization.
Refer to caption
Refer to caption
Figure 20: (a) Computational time (in mins) for generating 11 candidate visualizations (“generate candidate visualizations”) for single-cell transcriptomic datasets of various sample sizes and dimensions, and that for generating the meta-visualizations (“generate meta-visualization”) based on Algorithm 1. (b) Boxplots of the median Silhouette indices for 11 candidate visualizations and the meta-visualization (highlighted in red) with respect to the underlying true cell types. The running time of the proposed algorithm increased with nn, but remained much less than that for generating the candidate visualizations. For each nn, when pp increased, the running time for generating the candidate visualizations was longer, but the time cost for meta-visualization remained the same.
Refer to caption
Figure 21: Boxplots of Silhouette index for 16 candidate and 2 meta-visualizations, and the original data for the three datasets analyzed in the main paper. The original datasets showed significantly weaker cluster structure compared to most of the 16 candidate visualizations, suggesting that directly comparing a visualization with the noisy high-dimensional data may be misleading.

Appendix B Proof of Main Theorems

B.1 Notations

For a vector 𝕒=(a1,,an)n\mathbb{a}=(a_{1},...,a_{n})^{\top}\in\mathbb{R}^{n}, we denote diag(a1,,an)n×n\text{diag}(a_{1},...,a_{n})\in\mathbb{R}^{n\times n} as the diagonal matrix whose ii-th diagonal entry is aia_{i}, and define the p\ell_{p} norm 𝕒p=(i=1naip)1/p\|\mathbb{a}\|_{p}=\big{(}\sum_{i=1}^{n}a_{i}^{p}\big{)}^{1/p} and the \ell_{\infty} norm 𝕒=max1in|ai|\|\mathbb{a}\|_{\infty}=\max_{1\leq i\leq n}|a_{i}|. For a matrix 𝔸=(aij)n×n\mathbb{A}=(a_{ij})\in\mathbb{R}^{n\times n}, we define its Frobenius norm as 𝔸F=i=1nj=1naij2\|\mathbb{A}\|_{F}=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n}a^{2}_{ij}}, and its spectral norm as 𝔸=sup𝕩21𝔸𝕩2\|\mathbb{A}\|=\sup_{\|\mathbb{x}\|_{2}\leq 1}\|\mathbb{A}\mathbb{x}\|_{2}; we also denote 𝔸.in\mathbb{A}_{.i}\in\mathbb{R}^{n} as its ii-th column and 𝔸i.n\mathbb{A}_{i.}\in\mathbb{R}^{n} as its ii-th row. For sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write an=o(bn)a_{n}=o(b_{n}) or bnanb_{n}\gg a_{n} if limnan/bn=0\lim_{n}a_{n}/b_{n}=0, and write an=O(bn)a_{n}=O(b_{n}), anbna_{n}\lesssim b_{n} or bnanb_{n}\gtrsim a_{n} if there exists a constant CC such that anCbna_{n}\leq Cb_{n} for all nn. We write anbna_{n}\asymp b_{n} if anbna_{n}\lesssim b_{n} and anbna_{n}\gtrsim b_{n}.

B.2 Sufficient Condition for (C1a)

We provide a sufficient condition in light of the signal-plus-noise model (2.11) of the main text with some intuitions that implies the sub-Gaussian condition (C1a) for 𝕙i(k)\mathbb{h}_{i}^{(k)} in (2.12). In general, the distortion vector 𝕙i(k)\mathbb{h}_{i}^{(k)} is jointly determined by the noise {i}1in\{\mathbb{Z}_{i}\}_{1\leq i\leq n}, the noiseless samples {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n} and the dimension reduction map fk:p2f_{k}:\mathbb{R}^{p}\to\mathbb{R}^{2} associated to the kk-th visualization method. Accordingly, our sufficient condition for (C1a) essentially involves regularity of the signal structure {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n} and the dimension reduction (DR) maps fk:p2f_{k}:\mathbb{R}^{p}\to\mathbb{R}^{2} for 1kK1\leq k\leq K, and sub-Gaussianity of the noise vector {i}1in\{\mathbb{Z}_{i}\}_{1\leq i\leq n}. We first state precisely our sufficient condition.

  • (C01)

    (Regularity of the signal and DR map) The noiseless samples {𝕐i}1in\{\mathbb{Y}_{i}^{*}\}_{1\leq i\leq n} lie on a bounded manifold \mathcal{M} embedded in p\mathbb{R}^{p}, and the DR map fkf_{k} and its first-order derivative are bounded in the sense that for all 𝕐p\mathbb{Y}\in\mathbb{R}^{p}

    L1fk(𝕐)2L,fk(𝕐)𝕐Cf,L^{-1}\leq\|f_{k}(\mathbb{Y})\|_{2}\leq L,\qquad\bigg{\|}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{\|}\leq C_{f}, (B.1)

    almost surely for some constants L>1L>1 and Cf>0C_{f}>0.

  • (C02)

    (Sub-Gaussian noise) The noise vectors {i}1in\{\mathbb{Z}_{i}\}_{1\leq i\leq n} are sub-Gaussian random vectors.

Intuitively, (C01) requires that the underlying signal structure is finite and that the DR map is also finite and sufficiently smooth. In particular, we allow that fkf_{k} is random in itself, as in the cases of randomized algorithms such as t-SNE and UMAP. The sub-Gaussian condition (C02) on the noise vector i\mathbb{Z}_{i} is mild and allows for wide range of noise structures.

Below we show that Conditions (C01) and (C02) jointly imply the sub-Gaussianity of 𝕙i(k)\mathbb{h}_{i}^{(k)}. Firstly, note that by definition i.(k)=(fk(𝕐i)fk(𝕐1)2,fk(𝕐i)fk(𝕐2)2,,fk(𝕐i)fk(𝕐n)2).\mathbb{P}_{i.}^{(k)}=(\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{1})\|_{2},\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{2})\|_{2},...,\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{n})\|_{2})^{\top}. Then for each ii, we can define the pairwise distance for the noiseless samples associated with the kk-th visualization method as

i.(k)=(fk(𝕐i)fk(𝕐1)2,fk(𝕐i)fk(𝕐2)2,,fk(𝕐i)fk(𝕐n)2).\mathbb{P}_{i.}^{*(k)}=(\|f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{1})\|_{2},\|f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{2})\|_{2},...,\|f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{n})\|_{2})^{\top}. (B.2)

Recall that i.=(𝕐i𝕐12,𝕐i𝕐22,,𝕐i𝕐n2).\mathbb{P}_{i.}^{*}=(\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{1}\|_{2},\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{2}\|_{2},...,\|\mathbb{Y}^{*}_{i}-\mathbb{Y}^{*}_{n}\|_{2})^{\top}. Then, by (2.12), it follows that

𝕙i(k)\displaystyle\mathbb{h}_{i}^{(k)} =ck1i.(k)i.\displaystyle=c_{k}^{-1}\mathbb{P}_{i.}^{(k)}-\mathbb{P}_{i.}^{*}
=(ck1ij(k)+ck1𝕘k,ij[fk(𝕐i)fk(𝕐j)fk(𝕐i)fk(𝕐j)]𝕘k,ij2ij)1jn\displaystyle=(c_{k}^{-1}\mathbb{P}_{ij}^{*(k)}+\frac{c_{k}^{-1}\mathbb{g}_{k,ij}^{\top}[f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{j})-f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{j})]}{\|\mathbb{g}_{k,ij}\|_{2}}-\mathbb{P}_{ij}^{*})_{1\leq j\leq n} (B.3)
=(ci,k1ij(k)ij+ci,k1𝕘k,ij𝕘k,ij2[fk(𝕐)𝕐|𝕐=𝕤iifk(𝕐)𝕐|𝕐=𝕤jj])1jn,\displaystyle=\bigg{(}c_{i,k}^{-1}\mathbb{P}_{ij}^{*(k)}-\mathbb{P}_{ij}^{*}+\frac{c_{i,k}^{-1}\mathbb{g}_{k,ij}^{\top}}{\|\mathbb{g}_{k,ij}\|_{2}}\bigg{[}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{|}_{\mathbb{Y}=\mathbb{s}_{i}}\mathbb{Z}_{i}-\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{|}_{\mathbb{Y}=\mathbb{s}_{j}}\mathbb{Z}_{j}\bigg{]}\bigg{)}_{1\leq j\leq n}, (B.4)

where in (B.3) we used Taylor expansion of ij(k)=fk(𝕐i)fk(𝕐j)2\mathbb{P}_{ij}^{(k)}=\|f_{k}(\mathbb{Y}_{i})-f_{k}(\mathbb{Y}_{j})\|_{2} at fk(𝕐i)fk(𝕐j)f_{k}(\mathbb{Y}^{*}_{i})-f_{k}(\mathbb{Y}^{*}_{j}) with 𝕘k,ij\mathbb{g}_{k,ij} being some point between ij(k)\mathbb{P}_{ij}^{(k)} and ij(k)\mathbb{P}_{ij}^{*(k)}, and in (B.4) we used Taylor expansion of fk(𝕐i)f_{k}(\mathbb{Y}_{i}) at 𝕐i\mathbb{Y}^{*}_{i} with 𝕤i\mathbb{s}_{i} being some point between 𝕐i\mathbb{Y}_{i} and 𝕐i\mathbb{Y}_{i}^{*}.

For each ii, since ci,kc_{i,k} is a parameter that accounts for the possible scaling difference caused by the dimension reduction map fkf_{k}, without loss of generality, we can take ci,k=fk(𝕐i)2𝕐i2c_{i,k}=\frac{\|f_{k}(\mathbb{Y}^{*}_{i})\|_{2}}{\|\mathbb{Y}_{i}^{*}\|_{2}}. Under Condition (C01), it follows that ck1ij(k)ijc_{k}^{-1}\mathbb{P}_{ij}^{*(k)}-\mathbb{P}_{ij}^{*} in (B.4) is bounded and therefore a sub-Gaussian random variable. Similarly, for the random variable

ci,k1𝕘k,ij𝕘k,ij2[fk(𝕐)𝕐|𝕐=𝕤iifk(𝕐)𝕐|𝕐=𝕤jj]\frac{c_{i,k}^{-1}\mathbb{g}_{k,ij}^{\top}}{\|\mathbb{g}_{k,ij}\|_{2}}\bigg{[}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{|}_{\mathbb{Y}=\mathbb{s}_{i}}\mathbb{Z}_{i}-\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\bigg{|}_{\mathbb{Y}=\mathbb{s}_{j}}\mathbb{Z}_{j}\bigg{]} (B.5)

in (B.4), under Condition (C01), we also have the boundedness of fk(𝕐)𝕐|𝕐=𝕤i\big{\|}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\big{|}_{\mathbb{Y}=\mathbb{s}_{i}}\big{\|} and fk(𝕐)𝕐|𝕐=𝕤j\big{\|}\frac{\partial f_{k}(\mathbb{Y})}{\partial\mathbb{Y}}\big{|}_{\mathbb{Y}=\mathbb{s}_{j}}\big{\|}. By Proposition 2.5.2 of Vershynin (2018), these along with the boundedness of ci,k1c_{i,k}^{-1} and 𝕘k,ij𝕘k,ij2\frac{\mathbb{g}_{k,ij}}{\|\mathbb{g}_{k,ij}\|_{2}}, and the sub-Gaussianity of i\mathbb{Z}_{i} and j\mathbb{Z}_{j} from (C02), imply that (B.5) is also a sub-Gaussian random variable. Thus, we have verified that the sub-Gaussianity of 𝕙i(k)\mathbb{h}_{i}^{(k)} under Conditions (C01) and (C02). In particular, the sub-Gaussian parameter σ2\sigma^{2} in (C1a) is jointly determined by the underlying manifold \mathcal{M}, the scale (LL) and the smoothness (CfC_{f}) of the DR map.

B.3 Eigenscore Consistency: Proof of Theorem 1

For simplicity, we omit the dependence on i{1,2,,n}i\in\{1,2,...,n\} and hereafter denote 𝕩k=(xk1,,xkn)=i.(k)\mathbb{x}_{k}=(x_{k1},...,x_{kn})=\mathbb{P}^{(k)}_{i.}, 𝒚=(y1,,yn)=i.\bm{y}=(y_{1},...,y_{n})=\mathbb{P}_{i.}^{*}, 𝕫k=(zk1,,zkn)=𝕙i(k)\mathbb{z}_{k}=(z_{k1},...,z_{kn})=\mathbb{h}_{i}^{(k)}, 𝕤^=𝕤^i\widehat{\mathbb{s}}=\widehat{\mathbb{s}}_{i}, 𝕤=𝕤i\mathbb{s}=\mathbb{s}_{i} and ck=ci,kc_{k}=c_{i,k}. Therefore, model (2.12) in the main paper can be rewritten as

𝕩k=ck(𝒚+𝕫k),k=1,2,,K.\mathbb{x}_{k}=c_{k}(\bm{y}+\mathbb{z}_{k}),\qquad k=1,2,...,K. (B.6)

We denote

𝕩¯k=𝕩k𝕩k2=𝒚+𝕫k𝒚+𝕫k2,\bar{\mathbb{x}}_{k}=\frac{\mathbb{x}_{k}}{\|\mathbb{x}_{k}\|_{2}}=\frac{\bm{y}+\mathbb{z}_{k}}{\|\bm{y}+\mathbb{z}_{k}\|_{2}}, (B.7)

and define the matrix of normalized vectors

𝕏¯=[𝕩¯1𝕩¯2𝕩¯K]K×n,\bar{\mathbb{X}}=\begin{bmatrix}\bar{\mathbb{x}}_{1}^{\top}\\ \bar{\mathbb{x}}_{2}^{\top}\\ \vdots\\ \bar{\mathbb{x}}_{K}^{\top}\\ \end{bmatrix}\in\mathbb{R}^{K\times n},

Then

𝕏¯𝕏¯=(𝕏¯j𝕏¯k)1j,kn,\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=(\bar{\mathbb{X}}_{j}^{\top}\bar{\mathbb{X}}_{k})_{1\leq j,k\leq n},

where 𝕏¯jK\bar{\mathbb{X}}_{j}\in\mathbb{R}^{K} is the jj-th column of 𝕏¯\bar{\mathbb{X}}, so that

𝕏¯j=(yj+z1j𝒚+𝕫12,yj+z2j𝒚+𝕫22,,yj+zKj𝒚+𝕫K2).\bar{\mathbb{X}}_{j}=\bigg{(}\frac{y_{j}+z_{1j}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}},\frac{y_{j}+z_{2j}}{\|\bm{y}+\mathbb{z}_{2}\|_{2}},...,\frac{y_{j}+z_{Kj}}{\|\bm{y}+\mathbb{z}_{K}\|_{2}}\bigg{)}^{\top}.

For any 1j,kn1\leq j,k\leq n, it follows that

𝕏¯j𝕏¯k=i=1K(yj+zij)(yk+zik)𝒚+𝕫i22\displaystyle\bar{\mathbb{X}}_{j}^{\top}\bar{\mathbb{X}}_{k}=\sum_{i=1}^{K}\frac{(y_{j}+z_{ij})(y_{k}+z_{ik})}{\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}}
=(i=1K𝒚+𝕫i22)yjyk+(i=1Kzik𝒚+𝕫i22)yj+(i=1Kzij𝒚+𝕫i22)yk+i=1Kzijzik𝒚+𝕫i22.\displaystyle=\bigg{(}\sum_{i=1}^{K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\bigg{)}y_{j}y_{k}+\bigg{(}\sum_{i=1}^{K}z_{ik}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\bigg{)}y_{j}+\bigg{(}\sum_{i=1}^{K}z_{ij}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\bigg{)}y_{k}+\sum_{i=1}^{K}z_{ij}z_{ik}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}.

This implies that

𝕏¯𝕏¯=𝕘22𝒚𝒚+𝒚𝕘+𝕘𝒚+,\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top}, (B.8)

where

𝕘=(𝒚+𝕫121,𝒚+𝕫221,,𝒚+𝕫K21),=[𝕫1𝒚+𝕫12𝕫2𝒚+𝕫22𝕫K𝒚+𝕫K2]n×K\mathbb{g}=(\|\bm{y}+\mathbb{z}_{1}\|_{2}^{-1},\|\bm{y}+\mathbb{z}_{2}\|_{2}^{-1},...,\|\bm{y}+\mathbb{z}_{K}\|_{2}^{-1}),\qquad\mathbb{H}=\begin{bmatrix}\frac{\mathbb{z}_{1}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}}&\frac{\mathbb{z}_{2}}{\|\bm{y}+\mathbb{z}_{2}\|_{2}}&...&\frac{\mathbb{z}_{K}}{\|\bm{y}+\mathbb{z}_{K}\|_{2}}\end{bmatrix}\in\mathbb{R}^{n\times K}

We denote

𝕏¯𝕏¯=𝕘22𝒚𝒚+𝔼,𝔼=𝒚𝕘+𝕘𝒚+,\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\mathbb{E},\qquad\mathbb{E}=\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top}, (B.9)

Let the singular value decomposition of 𝕏¯\bar{\mathbb{X}} be

𝕏¯=i=1rσ^i𝕦^i𝕧^i,\bar{\mathbb{X}}=\sum_{i=1}^{r}\widehat{\sigma}_{i}\hat{\mathbb{u}}_{i}\hat{\mathbb{v}}_{i}^{\top},

where r=min{n,K}r=\min\{n,K\} and σ^1σ^2σ^r\widehat{\sigma}_{1}\geq\widehat{\sigma}_{2}\geq...\geq\widehat{\sigma}_{r}. Then we have

𝕏¯𝕏¯=i=1rσ^i2𝕧^i𝕧^i,𝕏¯𝕏¯=i=1rσ^i2𝕦^i𝕦^i,\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\sum_{i=1}^{r}\widehat{\sigma}_{i}^{2}\hat{\mathbb{v}}_{i}\hat{\mathbb{v}}_{i}^{\top},\qquad\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}=\sum_{i=1}^{r}\widehat{\sigma}_{i}^{2}\hat{\mathbb{u}}_{i}\hat{\mathbb{u}}_{i}^{\top}, (B.10)

and

𝕤𝒚2=(𝕩1𝒚𝕩12,,𝕩K𝒚𝕩K2)=𝕏¯𝒚=i=1rσ^i𝕦^i𝕧^i𝒚.\mathbb{s}\cdot\|\bm{y}\|_{2}=\bigg{(}\frac{\mathbb{x}_{1}^{\top}\bm{y}}{\|\mathbb{x}_{1}\|_{2}},...,\frac{\mathbb{x}_{K}^{\top}\bm{y}}{\|\mathbb{x}_{K}\|_{2}}\bigg{)}^{\top}=\bar{\mathbb{X}}\bm{y}=\sum_{i=1}^{r}\widehat{\sigma}_{i}\hat{\mathbb{u}}_{i}\hat{\mathbb{v}}_{i}^{\top}\bm{y}. (B.11)

Therefore, we have

1|cos(𝕦^1,𝕤)|=1|cos(𝕦^1,𝕏¯𝒚)|=1σ^1|𝕧^1𝒚|𝕏¯𝒚2\displaystyle 1-|\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})|=1-|\cos\angle(\hat{\mathbb{u}}_{1},\bar{\mathbb{X}}\bm{y})|=1-\frac{\widehat{\sigma}_{1}|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}
=𝕏¯𝒚2σ^1|𝕧^1𝒚|𝕏¯𝒚2i=2rσ^i𝕦^i𝕧^i𝒚2𝕏¯𝒚2σ^2𝕍^1𝒚σ^1|𝕧^1𝒚|σ^2𝕍^1𝒚,\displaystyle=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}-\widehat{\sigma}_{1}|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}\leq\frac{\|\sum_{i=2}^{r}\widehat{\sigma}_{i}\hat{\mathbb{u}}_{i}\hat{\mathbb{v}}_{i}^{\top}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}\leq\frac{\widehat{\sigma}_{2}\|\hat{\mathbb{V}}_{-1}^{\top}\bm{y}\|}{\widehat{\sigma}_{1}|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|-\widehat{\sigma}_{2}\|\hat{\mathbb{V}}_{-1}^{\top}\bm{y}\|},

where 𝕍^1=[𝕧^2𝕧^r]n×(r1).\widehat{\mathbb{V}}_{-1}=\begin{bmatrix}\hat{\mathbb{v}}_{2}&...&\hat{\mathbb{v}}_{r}\end{bmatrix}\in\mathbb{R}^{n\times(r-1)}. Now we derive estimates of the random quantities σ^1\widehat{\sigma}_{1}, σ^2\widehat{\sigma}_{2}, |𝕧^1𝒚||\hat{\mathbb{v}}_{1}^{\top}\bm{y}| and 𝕍^1𝒚\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\| to obtain an upper bound of the last term in the above inequality. Consider the decomposition (B.9) and the first equation of (B.10). Firstly, by Weyl’s inequality (e.g., Corollary III.2.6 of Bhatia 2013), we have

|σ^12𝕘22𝒚22|𝔼,\big{|}\widehat{\sigma}_{1}^{2}-\|\mathbb{g}\|_{2}^{2}\cdot\|\bm{y}\|_{2}^{2}\big{|}\leq\|\mathbb{E}\|, (B.12)

and

|σ^22|𝔼.|\widehat{\sigma}_{2}^{2}|\leq\|\mathbb{E}\|. (B.13)

Secondly, by Davis-Kahan’s perturbation theorem (e.g., Theorem 1 of (Yu et al., 2015)), we have

𝕍^1𝒚/𝒚=1(𝕧^1𝒚)2/𝒚22𝔼𝕘22𝒚22.\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\|/\|\bm{y}\|=\sqrt{1-(\hat{\mathbb{v}}_{1}^{\top}\bm{y})^{2}/\|\bm{y}\|_{2}^{2}}\leq\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}. (B.14)

Thus, if we denote R=𝔼/(𝕘22𝒚22)R=\|\mathbb{E}\|/(\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}), by (B.12) and (B.13), if R<1R<1, we have

σ^2σ^1𝔼1/2𝕘22𝒚22𝔼=R1/21R,\frac{\widehat{\sigma}_{2}}{\widehat{\sigma}_{1}}\leq\frac{\|\mathbb{E}\|^{1/2}}{\sqrt{\|\mathbb{g}\|_{2}^{2}\cdot\|\bm{y}\|_{2}^{2}-\|\mathbb{E}\|}}=\frac{R^{1/2}}{\sqrt{1-R}},

and by (B.14), we have

𝕍^1𝒚|𝕧^1𝒚|R1R2.\frac{\|\widehat{\mathbb{V}}^{\top}_{-1}\bm{y}\|}{|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}\leq\frac{R}{\sqrt{1-R^{2}}}.

Therefore, if R<1R<1, we have

1|cos(𝕦^1,𝕤)|σ^2𝕍^1𝒚σ^1|𝕧^1𝒚|1σ^2𝕍^1𝒚σ^1|𝕧^1𝒚|R1/21RR1R21R1/21RR1R2=R3/2(1R)(1R2)R3/2.1-|\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})|\leq\frac{\frac{\widehat{\sigma}_{2}\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\|}{\widehat{\sigma}_{1}|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}}{1-\frac{\widehat{\sigma}_{2}\|\widehat{\mathbb{V}}_{-1}^{\top}\bm{y}\|}{\widehat{\sigma}_{1}|\hat{\mathbb{v}}_{1}^{\top}\bm{y}|}}\leq\frac{\frac{R^{1/2}}{\sqrt{1-R}}\frac{R}{\sqrt{1-R^{2}}}}{1-\frac{R^{1/2}}{\sqrt{1-R}}\frac{R}{\sqrt{1-R^{2}}}}=\frac{R^{3/2}}{\sqrt{(1-R)(1-R^{2})}-R^{3/2}}. (B.15)

If R<0.5R<0.5, we further have

1|cos(𝕦^1,𝕤)|4R3/2.1-|\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})|\leq 4R^{3/2}. (B.16)

The rest of the proof is devoted to the stochastic upper bound of RR. Since by the elementary inequality 𝐀𝔹𝐀𝔹\|{\bf A}\mathbb{B}\|\leq\|{\bf A}\|\cdot\|\mathbb{B}\| and triangle inequality, we have

𝔼2𝒚2𝕘2+2.\|\mathbb{E}\|\leq 2\|\bm{y}\|_{2}\cdot\|\mathbb{g}\|_{2}\cdot\|\mathbb{H}\|+\|\mathbb{H}\|^{2}. (B.17)

Then it follows that

R=𝔼𝕘22𝒚222𝒚2𝕘2+2𝕘22𝒚22=2𝕘2𝒚2+2𝕘22𝒚22,R=\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}\leq\frac{2\|\bm{y}\|_{2}\|\mathbb{g}\|_{2}\|\mathbb{H}\|+\|\mathbb{H}\|^{2}}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}=\frac{2\|\mathbb{H}\|}{\|\mathbb{g}\|_{2}\|\bm{y}\|_{2}}+\frac{\|\mathbb{H}\|^{2}}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}}, (B.18)

Hence, it suffices to obtain an upper bound for \|\mathbb{H}\| and a lower bound for 𝕘2\|\mathbb{g}\|_{2}.

Upper bound of \|\mathbb{H}\|.

Since

=[𝕫1𝕫2𝕫K]diag(𝒚+𝕫121,𝒚+𝕫221,,𝒚+𝕫K21)0diag(𝕘).\mathbb{H}=\begin{bmatrix}\mathbb{z}_{1}&\mathbb{z}_{2}&...&\mathbb{z}_{K}\end{bmatrix}\cdot\text{diag}(\|\bm{y}+\mathbb{z}_{1}\|_{2}^{-1},\|\bm{y}+\mathbb{z}_{2}\|_{2}^{-1},...,\|\bm{y}+\mathbb{z}_{K}\|_{2}^{-1})\equiv\mathbb{H}_{0}\cdot\text{diag}(\mathbb{g}). (B.19)

we have

0𝕘.\|\mathbb{H}\|\leq\|\mathbb{H}_{0}\|\cdot\|\mathbb{g}\|_{\infty}. (B.20)

The following lemma gives the upper bound of 0\|\mathbb{H}_{0}\|, which is proved in Section B.7 below.

Lemma 1.

Under the conditions of Theorem 1, it holds that 0Cσρ(n+K)\|\mathbb{H}_{0}\|\leq C\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{K}) with probability at least 1nc1-n^{-c} for some universal constants C,c>0C,c>0.

Now for

𝕘=1min1iK𝒚+𝕫i2,\|\mathbb{g}\|_{\infty}=\frac{1}{\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}},

note that

𝒚+𝕫i22=𝒚22+𝕫i22+2𝒚𝕫i,\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}=\|\bm{y}\|_{2}^{2}+\|\mathbb{z}_{i}\|_{2}^{2}+2\bm{y}^{\top}\mathbb{z}_{i},

By Condition (C1a), we have

|𝒚𝕫i|cσ𝒚2lognin probability,|\bm{y}^{\top}\mathbb{z}_{i}|\leq c^{\prime}\sigma\|\bm{y}\|_{2}\sqrt{\log n}\qquad\text{in probability,}

for any small constant c>0c^{\prime}>0, and that

min1iK𝕫i22=cnσ2(1+o(1))in probability.\min_{1\leq i\leq K}\|\mathbb{z}_{i}\|_{2}^{2}=cn\sigma^{2}(1+o(1))\quad\text{in probability.}

Since 2σ𝒚2lognσ2log2n+𝒚22/logn2\sigma\|\bm{y}\|_{2}\sqrt{\log n}\leq\sigma^{2}\log^{2}n+\|\bm{y}\|_{2}^{2}/\log n, we have

min1iK𝒚+𝕫i22𝒚22+cnσ2(1+o(1))cσ2log2nc𝒚22/logn(1δn)𝒚22+(1δn)cnσ2,\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}\geq\|\bm{y}\|_{2}^{2}+cn\sigma^{2}(1+o(1))-c^{\prime}\sigma^{2}\log^{2}n-c^{\prime}\|\bm{y}\|_{2}^{2}/\log n\geq(1-\delta_{n})\|\bm{y}\|_{2}^{2}+(1-\delta_{n})cn\sigma^{2},

in probability, for some δn0\delta_{n}\to 0.

In other words, we have

min1iK𝒚+𝕫i2(1δn)𝒚2+(1δn)c1/2σn,\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}\geq(1-\delta_{n})\|\bm{y}\|_{2}+(1-\delta_{n})c^{1/2}\sigma\sqrt{n}, (B.21)

in probability, for some non-negative sequence δn0\delta_{n}\to 0. This implies

Cσρ(n+K)𝒚2+σn,\|\mathbb{H}\|\leq C\frac{\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{K})}{\|\bm{y}\|_{2}+\sigma\sqrt{n}}, (B.22)

in probability.

Lower bound of 𝕘2\|\mathbb{g}\|_{2}.

By similar argument that leads to (B.21), we have

𝕘22=i=1K𝒚+𝕫i22Kmin1iK𝒚+𝕫i22=Kmax1iK𝒚+𝕫i22K(1+δn)[𝒚22+cσ2n],\|\mathbb{g}\|_{2}^{2}=\sum_{i=1}^{K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\geq K\min_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}=\frac{K}{\max_{1\leq i\leq K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}}\geq\frac{K}{(1+\delta^{\prime}_{n})[\|\bm{y}\|_{2}^{2}+c\sigma^{2}n]}, (B.23)

in probability for some non-negative sequence δn0\delta^{\prime}_{n}\to 0.

Completing the proof.

Combining the above results, we have

𝕘2𝒚2\displaystyle\frac{\|\mathbb{H}\|}{\|\mathbb{g}\|_{2}\|\bm{y}\|_{2}} σρ(n/K+1)𝒚2ρn/K𝒚/σ,\displaystyle\lesssim\frac{\sigma\sqrt{\rho}(\sqrt{n/K}+1)}{\|\bm{y}\|_{2}}\lesssim\frac{\sqrt{\rho n/K}}{\|\bm{y}\|/\sigma},

where the last inequality follows from n>Kn>K. Thus, under Condition (C2), for sufficiently large nn, we always have R<0.5R<0.5, and therefore by (B.16), we have

cos(𝕦^1,𝕤)1,in probability.\cos\angle(\hat{\mathbb{u}}_{1},\mathbb{s})\to 1,\qquad\text{in probability.} (B.24)

Finally, note that for any i=1,2,,Ki=1,2,...,K, we have

si=𝒚22+𝕫i𝒚𝒚+𝕫i2𝒚2𝒚22cσ𝒚n/K𝒚2(𝒚2+σn)𝒚2cσn/K𝒚2+σn,s_{i}=\frac{\|\bm{y}\|_{2}^{2}+\mathbb{z}_{i}^{\top}\bm{y}}{\|\bm{y}+\mathbb{z}_{i}\|_{2}\cdot\|\bm{y}\|_{2}}\gtrsim\frac{\|\bm{y}\|_{2}^{2}-c\sigma\|\bm{y}\|\sqrt{n/K}}{\|\bm{y}\|_{2}(\|\bm{y}\|_{2}+\sigma\sqrt{n})}\geq\frac{\|\bm{y}\|_{2}-c\sigma\sqrt{n/K}}{\|\bm{y}\|_{2}+\sigma\sqrt{n}}, (B.25)

in probability. By Condition (C2), whenever KnK\ll n the right-hand side of the above inequality is strictly positive. In other words, in probability, 𝕤\mathbb{s} is a strictly positive vector. Thus, for 𝕤^=|𝕦^1|\widehat{\mathbb{s}}=|\hat{\mathbb{u}}_{1}|, under the same event it holds that

cos(𝕤^,𝕤)cos(|𝕦^1|,𝕤).\cos\angle(\widehat{\mathbb{s}},\mathbb{s})\leq\cos\angle(|\hat{\mathbb{u}}_{1}|,\mathbb{s}). (B.26)

This along with (B.24) implies

cos(𝕤^,𝕤)1,in probability.\cos\angle(\widehat{\mathbb{s}},\mathbb{s})\to 1,\qquad\text{in probability.} (B.27)

B.4 Guarantee of Meta-Visualization: Proof of Theorem 2

We still use the notation defined at the beginning of the proof of Theorem 1. In addition, we denote 𝕩m=¯i.m\mathbb{x}^{m}=\bar{\mathbb{P}}_{i.}^{m}, and define

𝕩=i=1Kwi𝕩¯i,\mathbb{x}^{*}=\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i},\qquad (B.28)

where wi=si/𝕤2w_{i}=s_{i}/\|\mathbb{s}\|_{2}. To begin with, note that

cos(𝕩,𝒚)=i=1Kwi𝕩¯i𝒚i=1Kwi𝕩¯i2𝒚2,\cos\angle(\mathbb{x}^{*},\bm{y})=\frac{\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}^{\top}\bm{y}}{\|\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}\|_{2}\|\bm{y}\|_{2}},
i=1Kwi𝕩¯i𝒚=i=1K(𝕩¯i𝒚)2i=1K(𝕩¯i𝒚)2=𝕏¯𝒚22i=1K(𝕩¯i𝒚)2=𝕏¯𝒚2,\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}^{\top}\bm{y}=\frac{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\|\bar{\mathbb{X}}\bm{y}\|_{2},

and

i=1Kwi𝕩¯i2=i=1K𝕩¯i𝒚𝕩¯i2i=1K(𝕩¯i𝒚)2=𝕏¯𝕏¯𝒚2i=1K(𝕩¯i𝒚)2=𝕏¯𝕏¯𝒚2𝕏¯𝒚2.\bigg{\|}\sum_{i=1}^{K}w_{i}\bar{\mathbb{x}}_{i}\bigg{\|}_{2}=\frac{\|\sum_{i=1}^{K}\bar{\mathbb{x}}_{i}^{\top}\bm{y}\bar{\mathbb{x}}_{i}\|_{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\frac{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}{\sqrt{\sum_{i=1}^{K}(\bar{\mathbb{x}}_{i}^{\top}\bm{y})^{2}}}=\frac{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}.

Then

|cos(𝕩,𝒚)|=𝕏¯𝒚22𝕏¯𝕏¯𝒚2𝒚2.|\cos\angle(\mathbb{x}^{*},\bm{y})|=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}.

Similarly, we have

max1iK|cos(𝕩i,𝒚)|=𝕏¯𝒚/𝒚2.\max_{1\leq i\leq K}|\cos\angle(\mathbb{x}_{i},\bm{y})|=\|\bar{\mathbb{X}}\bm{y}\|_{\infty}/\|\bm{y}\|_{2}.

We first prove some auxiliary results. Since

𝕏¯=[𝒚+𝕫1𝒚+𝕫K]diag(𝕘),\bar{\mathbb{X}}^{\top}=\begin{bmatrix}\bm{y}+\mathbb{z}_{1}&...\bm{y}+\mathbb{z}_{K}\end{bmatrix}\cdot\text{diag}(\mathbb{g}),

we have, for sufficiently large nn,

𝕏¯𝕘[𝒚+𝕫1𝒚+𝕫K]𝕘(K𝒚2+0)\displaystyle\|\bar{\mathbb{X}}\|\leq\|\mathbb{g}\|_{\infty}\cdot\bigg{\|}\begin{bmatrix}\bm{y}+\mathbb{z}_{1}&...\bm{y}+\mathbb{z}_{K}\end{bmatrix}\bigg{\|}\leq\|\mathbb{g}\|_{\infty}\cdot(\sqrt{K}\|\bm{y}\|_{2}+\|\mathbb{H}_{0}\|)
𝕘[K𝒚2+Cσρ(n+K)]\displaystyle\leq\|\mathbb{g}\|_{\infty}\cdot[\sqrt{K}\|\bm{y}\|_{2}+C\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{K})] (B.29)

in probability, according to Lemma 1, and

𝕏¯𝒚2=𝒚[𝒚+𝕫1𝒚+𝕫K]diag(𝕘)2min1iK|𝒚22+𝒚𝕫i|𝕘2𝕘2(1δn)𝒚22,\|\bar{\mathbb{X}}\bm{y}\|_{2}=\bigg{\|}\bm{y}^{\top}\begin{bmatrix}\bm{y}+\mathbb{z}_{1}&...\bm{y}+\mathbb{z}_{K}\end{bmatrix}\cdot\text{diag}(\mathbb{g})\bigg{\|}_{2}\geq\min_{1\leq i\leq K}|\|\bm{y}\|_{2}^{2}+\bm{y}^{\top}\mathbb{z}_{i}|\cdot\|\mathbb{g}\|_{2}\geq\|\mathbb{g}\|_{2}\cdot(1-\delta_{n})\|\bm{y}\|_{2}^{2}, (B.30)
𝕏¯𝒚max1iK|𝒚22+𝒚𝕫i|𝕘𝕘(1+δn)𝒚22,\|\bar{\mathbb{X}}\bm{y}\|_{\infty}\leq\max_{1\leq i\leq K}|\|\bm{y}\|_{2}^{2}+\bm{y}^{\top}\mathbb{z}_{i}|\cdot\|\mathbb{g}\|_{\infty}\leq\|\mathbb{g}\|_{\infty}(1+\delta_{n})\|\bm{y}\|_{2}^{2}, (B.31)

for some non-negative sequence δn0\delta_{n}\to 0 as nn\to\infty, in probability. In addition, by (B.23) and (B.21), we have

𝕘2K(1+δn)(𝒚2+cσn),𝕘1(1δn)(𝒚2+cσn).\|\mathbb{g}\|_{2}\geq\frac{\sqrt{K}}{(1+\delta_{n})(\|\bm{y}\|_{2}+c\sigma\sqrt{n})},\qquad\|\mathbb{g}\|_{\infty}\leq\frac{1}{(1-\delta_{n})(\|\bm{y}\|_{2}+c\sigma\sqrt{n})}. (B.32)

in probability for some δn0\delta_{n}\to 0. Now we show that

cos(𝕩m,𝕩)1.\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\to 1. (B.33)

To see this, since

𝕩m=i=1Ks^i𝕩¯i=𝕏¯𝕤^,\mathbb{x}^{m}=\sum_{i=1}^{K}\hat{s}_{i}\bar{\mathbb{x}}_{i}=\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}, (B.34)

and 𝕩=𝕏¯𝕨\mathbb{x}^{*}=\bar{\mathbb{X}}^{\top}\mathbb{w} as in (B.28), we have

cos(𝕩m,𝕩)𝕨𝕏¯𝕏¯𝕤^𝕏¯𝕤^2𝕏¯𝕨2=𝕤^𝕏¯𝕏¯𝕤^𝕏¯𝕤^2𝕏¯𝕨2+𝕤^𝕏¯𝕏¯(𝕨𝕤^)𝕏¯𝕤^2𝕏¯𝕨2.\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\geq\frac{\mathbb{w}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}=\frac{\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}+\frac{\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\mathbb{w}-\widehat{\mathbb{s}})}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}. (B.35)

Hence

|1cos(𝕩m,𝕩)|\displaystyle|1-\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})| =|1𝕏¯𝕤^2𝕏¯𝕨2+𝕤^𝕏¯𝕏¯(𝕨𝕤^)𝕏¯𝕤^2𝕏¯𝕨2|\displaystyle=\bigg{|}1-\frac{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}}{\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}+\frac{\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\mathbb{w}-\widehat{\mathbb{s}})}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}\bigg{|}
|𝕏¯𝕤^2𝕏¯𝕨2|𝕏¯𝕨2+|𝕤^𝕏¯𝕏¯(𝕤^𝕨)|𝕏¯𝕤^2𝕏¯𝕨2\displaystyle\leq\frac{|\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}-\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}|}{\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}+\frac{|\widehat{\mathbb{s}}^{\top}\bar{\mathbb{X}}\bar{\mathbb{X}}^{\top}(\widehat{\mathbb{s}}-\mathbb{w})|}{\|\bar{\mathbb{X}}^{\top}\widehat{\mathbb{s}}\|_{2}\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}
2𝕏¯𝕤^𝕨2𝕏¯𝕨2\displaystyle\leq\frac{2\|\bar{\mathbb{X}}\|\|\widehat{\mathbb{s}}-\mathbb{w}\|_{2}}{\|\bar{\mathbb{X}}^{\top}\mathbb{w}\|_{2}}
2𝕏¯𝕏¯𝒚2𝕤^𝕨2𝕏¯𝕏¯𝒚2.\displaystyle\leq\frac{2\|\bar{\mathbb{X}}\|\|\bar{\mathbb{X}}\bm{y}\|_{2}\|\widehat{\mathbb{s}}-\mathbb{w}\|_{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}.

Note that by Theorem 1, we have 𝕤^𝕨20\|\widehat{\mathbb{s}}-\mathbb{w}\|_{2}\to 0 in probability. Then it suffices to show that

𝕏¯𝕏¯𝒚2𝕏¯𝕏¯𝒚2=OP(1).\frac{\|\bar{\mathbb{X}}\|\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}=O_{P}(1). (B.36)

To see this, since

𝕏¯𝕏¯𝒚2𝕏¯𝕏¯𝒚2𝕏¯𝕏¯𝒚2𝒚2|𝒚𝕏¯𝕏¯𝒚|=𝕏¯𝒚2𝕏¯𝒚2.\frac{\|\bar{\mathbb{X}}\|\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}}\leq\frac{\|\bar{\mathbb{X}}\|\|\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}{|\bm{y}^{\top}\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}|}=\frac{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}. (B.37)

Note that

𝕏¯=([𝒚𝒚]+0)diag(𝕘).\bar{\mathbb{X}}^{\top}=(\begin{bmatrix}\bm{y}&...&\bm{y}\end{bmatrix}+\mathbb{H}_{0})\cdot\text{diag}(\mathbb{g}). (B.38)

By Condition (C2), whenever KnK\ll n, we have

𝒚𝕏¯\displaystyle\bm{y}^{\top}\bar{\mathbb{X}}^{\top} =([𝒚22𝒚22]+𝒚0)diag(𝕘)\displaystyle=(\begin{bmatrix}\|\bm{y}\|_{2}^{2}&...&\|\bm{y}\|_{2}^{2}\end{bmatrix}+\bm{y}^{\top}\mathbb{H}_{0})\cdot\text{diag}(\mathbb{g})
=[𝒚22(1+o(1))𝒚22(1+o(1))]diag(𝕘),\displaystyle=\begin{bmatrix}\|\bm{y}\|_{2}^{2}(1+o(1))&...&\|\bm{y}\|_{2}^{2}(1+o(1))\end{bmatrix}\cdot\text{diag}(\mathbb{g}),

in probability, which implies

𝕏¯𝒚2\displaystyle\|\bar{\mathbb{X}}\bm{y}\|_{2} =𝒚22(1+o(1))i=1K𝒚+𝕫i22\displaystyle=\|\bm{y}\|_{2}^{2}(1+o(1))\sqrt{\sum_{i=1}^{K}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}}
𝒚22K(𝒚2+σn)1\displaystyle\asymp\|\bm{y}\|_{2}^{2}\sqrt{K}(\|\bm{y}\|_{2}+\sigma\sqrt{n})^{-1}

in probability, where the last inequality follows from (B.21) and (B.23). On the one hand, we have

𝕏¯\displaystyle\|\bar{\mathbb{X}}\| [𝒚𝒚]+0𝕘\displaystyle\leq\|\begin{bmatrix}\bm{y}&...&\bm{y}\end{bmatrix}+\mathbb{H}_{0}\|\|\mathbb{g}\|_{\infty}
(𝒚2+σn)1(K𝒚2+σρ(K+n))\displaystyle\lesssim(\|\bm{y}\|_{2}+\sigma\sqrt{n})^{-1}(\sqrt{K}\|\bm{y}\|_{2}+\sigma\rho(\sqrt{K}+\sqrt{n}))
K𝒚2(𝒚2+σn)1.\displaystyle\lesssim\sqrt{K}\|\bm{y}\|_{2}(\|\bm{y}\|_{2}+\sigma\sqrt{n})^{-1}.

Thus, we have

𝕏¯𝒚2𝕏¯𝒚2=OP(1),\frac{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\bm{y}\|_{2}}=O_{P}(1), (B.39)

which along with (B.37) implies (B.36). This completes our proof of (B.33). Finally, we note that

|cos(𝕩,𝒚)|=𝕏¯𝒚22𝕏¯𝕏¯𝒚2𝒚2𝕏¯𝒚22𝕏¯𝕏¯𝒚2𝒚2=𝕏¯𝒚2𝕏¯𝒚2,|\cos\angle(\mathbb{x}^{*},\bm{y})|=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\|\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}\geq\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}^{2}}{\|\bar{\mathbb{X}}\|\cdot\|\bar{\mathbb{X}}\bm{y}\|_{2}\|\bm{y}\|_{2}}=\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\|\cdot\|\bm{y}\|_{2}},

so that

1𝕏¯𝒚2𝕏¯𝒚21|cos(𝕩,𝒚)|.1-\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}\geq 1-|\cos\angle(\mathbb{x}^{*},\bm{y})|.

Now in order to prove the theorem, we show that, on the one hand,

1𝕏¯𝒚2𝕏¯𝒚20in probability,1-\frac{\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}\to 0\qquad\text{in probability}, (B.40)

and on the other hand, under the additional assumption that 𝒚2/σnC\|\bm{y}\|_{2}/\sigma\sqrt{n}\leq C for some sufficiently small constant,

𝕏¯𝒚𝒚21δ,in probability,\frac{\|\bar{\mathbb{X}}\bm{y}\|_{\infty}}{\|\bm{y}\|_{2}}\leq 1-\delta,\qquad\text{in probability}, (B.41)

for some constant δ>0\delta>0.

To obtain (B.40), we note that by (B.29) (B.30) and (B.32), for some sequence δn0\delta_{n}\to 0, we have

𝕏¯𝒚2𝕏¯𝒚2𝕏¯𝒚2=𝒚2𝕏¯𝒚2/𝕏¯𝒚2𝒚2𝕘2(1δn)𝕘(1+δn)K𝒚2\displaystyle\frac{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}-\|\bar{\mathbb{X}}\bm{y}\|_{2}}{\|\bar{\mathbb{X}}\|\|\bm{y}\|_{2}}=\frac{\|\bm{y}\|_{2}-\|\bar{\mathbb{X}}\bm{y}\|_{2}/\|\bar{\mathbb{X}}\|}{\|\bm{y}\|_{2}}\leq\frac{\|\bm{y}\|_{2}-\frac{\|\mathbb{g}\|_{2}(1-\delta_{n})}{\|\mathbb{g}\|_{\infty}(1+\delta_{n})\sqrt{K}}}{\|\bm{y}\|_{2}}
1(1δn)2(1+cσn/𝒚2)(1+δn)2(1+cσn/𝒚2),\displaystyle\leq 1-\frac{(1-\delta_{n})^{2}(1+c\sigma\sqrt{n}/\|\bm{y}\|_{2})}{(1+\delta_{n})^{2}(1+c\sigma\sqrt{n}/\|\bm{y}\|_{2})},

in probability. To obtain (B.41), for some sequence δn0\delta_{n}\to 0, we have

𝕏¯𝒚𝒚2𝕘(1+δn)𝒚2(1+δn)(1δn)(1+cσn/𝒚2)1δ,\frac{\|\bar{\mathbb{X}}\bm{y}\|_{\infty}}{\|\bm{y}\|_{2}}\leq\|\mathbb{g}\|_{\infty}(1+\delta_{n})\|\bm{y}\|_{2}\leq\frac{(1+\delta_{n})}{(1-\delta_{n})(1+c\sigma\sqrt{n}/\|\bm{y}\|_{2})}\leq 1-\delta, (B.42)

in probability, as long as 𝒚2/σnC\|\bm{y}\|_{2}/\sigma\sqrt{n}\leq C for some sufficiently small constant C>0C>0. This completes our proof of Theorem 2.

B.5 Robustness of Spectral Weighting: Proof of Theorem 3

For the first statement concerning ¯m\bar{\mathbb{P}}^{m}, firstly, we note that by the proof of Theorem 2, we have cos(𝕩,𝒚)1\cos\angle(\mathbb{x}^{*},\bm{y})\to 1 under the conditions of Theorem 1, as the effects of the adversarial candidate visualizations vanish in (B.28). Therefore, it suffices to show that under such an adversarial setting, we still have cos(𝕩m,𝕩)1\cos\angle(\mathbb{x}^{m},\mathbb{x}^{*})\to 1 in probability.

To see this, by the proof of Theorem 2, it suffices to show that 𝕤^𝕨20\|\widehat{\mathbb{s}}-\mathbb{w}\|_{2}\to 0 in probability, or,

cos(𝕤^,𝕤)0,in probability.\cos(\widehat{\mathbb{s}},\mathbb{s})\to 0,\qquad\text{in probability}. (B.43)

By the proof of Theorem 1, we have

𝕏¯j=(yj+z1j𝒚+𝕫12,,yj+zKj𝒚+𝕫K2,zK+1,j𝕫K+12,,zKj𝕫K2),\bar{\mathbb{X}}_{j}=\bigg{(}\frac{y_{j}+z_{1j}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}},...,\frac{y_{j}+z_{K^{\prime}j}}{\|\bm{y}+\mathbb{z}_{K}^{\prime}\|_{2}},\frac{z_{K^{\prime}+1,j}}{\|\mathbb{z}_{K^{\prime}+1}\|_{2}},...,\frac{z_{Kj}}{\|\mathbb{z}_{K}\|_{2}}\bigg{)}^{\top},

where we denote 𝕫k=𝕩k\mathbb{z}_{k}=\mathbb{x}_{k} with 𝕫k𝒚=0\mathbb{z}_{k}^{\top}\bm{y}=0 for all kK+1k\geq K^{\prime}+1. For any 1j,kn1\leq j,k\leq n, it follows that

𝕏¯j𝕏¯k=i=1K(yj+zij)(yk+zik)𝒚+𝕫i22+i=K+1Kzijzik𝕫i22\displaystyle\bar{\mathbb{X}}_{j}^{\top}\bar{\mathbb{X}}_{k}=\sum_{i=1}^{K^{\prime}}\frac{(y_{j}+z_{ij})(y_{k}+z_{ik})}{\|\bm{y}+\mathbb{z}_{i}\|_{2}^{2}}+\sum_{i=K^{\prime}+1}^{K}\frac{z_{ij}z_{ik}}{\|\mathbb{z}_{i}\|_{2}^{2}}
=(i=1K𝒚+𝕫i22)yjyk+(i=1Kzik𝒚+𝕫i22)yj+(i=1Kzij𝒚+𝕫i22)yk+i=1Kzijzik𝒚+𝕫i22\displaystyle=\bigg{(}\sum_{i=1}^{K^{\prime}}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\bigg{)}y_{j}y_{k}+\bigg{(}\sum_{i=1}^{K^{\prime}}z_{ik}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\bigg{)}y_{j}+\bigg{(}\sum_{i=1}^{K^{\prime}}z_{ij}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}\bigg{)}y_{k}+\sum_{i=1}^{K^{\prime}}z_{ij}z_{ik}\|\bm{y}+\mathbb{z}_{i}\|_{2}^{-2}
+i=K+1Kzijzik𝕫i22\displaystyle\quad+\sum_{i=K^{\prime}+1}^{K}\frac{z_{ij}z_{ik}}{\|\mathbb{z}_{i}\|_{2}^{2}}

where K=(1η)KK^{\prime}=(1-\eta)K. This implies that

𝕏¯𝕏¯=𝕘22𝒚𝒚+𝒚𝕘+𝕘𝒚++𝕃𝕃,\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top}+\mathbb{L}\mathbb{L}^{\top}, (B.44)

where

𝕘=(𝒚+𝕫121,𝒚+𝕫221,,𝒚+𝕫K21),=[𝕫1𝒚+𝕫12𝕫2𝒚+𝕫22𝕫K𝒚+𝕫K2]n×K,\mathbb{g}=(\|\bm{y}+\mathbb{z}_{1}\|_{2}^{-1},\|\bm{y}+\mathbb{z}_{2}\|_{2}^{-1},...,\|\bm{y}+\mathbb{z}_{K^{\prime}}\|_{2}^{-1}),\qquad\mathbb{H}=\begin{bmatrix}\frac{\mathbb{z}_{1}}{\|\bm{y}+\mathbb{z}_{1}\|_{2}}&\frac{\mathbb{z}_{2}}{\|\bm{y}+\mathbb{z}_{2}\|_{2}}&...&\frac{\mathbb{z}_{K^{\prime}}}{\|\bm{y}+\mathbb{z}_{K^{\prime}}\|_{2}}\end{bmatrix}\in\mathbb{R}^{n\times K^{\prime}},

and

𝕃=[𝕫K+1𝕫K+12𝕫K𝕫K2]n×ηK,\mathbb{L}=\begin{bmatrix}\frac{\mathbb{z}_{K^{\prime}+1}}{\|\mathbb{z}_{K^{\prime}+1}\|_{2}}&...&\frac{\mathbb{z}_{K}}{\|\mathbb{z}_{K}\|_{2}}\end{bmatrix}\in\mathbb{R}^{n\times\eta K},

In other words,

𝕏¯𝕏¯=𝕘22𝒚𝒚+𝕃𝕃+𝔼,𝔼=𝒚𝕘+𝕘𝒚+,\bar{\mathbb{X}}^{\top}\bar{\mathbb{X}}=\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\mathbb{L}\mathbb{L}^{\top}+\mathbb{E},\qquad\mathbb{E}=\bm{y}\mathbb{g}^{\top}\mathbb{H}^{\top}+\mathbb{H}\mathbb{g}\bm{y}^{\top}+\mathbb{H}\mathbb{H}^{\top}, (B.45)

Note that by definition

𝕃𝕃𝒚=0,\mathbb{L}\mathbb{L}^{\top}\bm{y}=0,

then we can treat the above decomposition as a low-rank matrix 𝕘22𝒚𝒚+𝕃𝕃\|\mathbb{g}\|_{2}^{2}\cdot\bm{y}\bm{y}^{\top}+\mathbb{L}\mathbb{L}^{\top} plus a noise matrix 𝔼\mathbb{E}. In this respect, the Davis-Kahan’s perturbation theorem applies similarly as in the proof of Theorem 1, only with 𝔼𝕘22𝒚22\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}} therein replaced by

𝔼𝕘22𝒚22𝕃2.\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}-\|\mathbb{L}\|^{2}}.

Note that by definition we have

𝕃ηK,\|\mathbb{L}\|\leq\sqrt{\eta K}, (B.46)

whereas by (B.23) we have

𝕘22𝒚22(1η)K𝒚22(1+δn)(𝒚22+cσ2n).\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}\geq\frac{(1-\eta)K\|\bm{y}\|_{2}^{2}}{(1+\delta^{\prime}_{n})(\|\bm{y}\|_{2}^{2}+c\sigma^{2}n)}. (B.47)

Then, as long as

1η(1+δn)(1+cσ2n/𝒚22)>2η,\frac{1-\eta}{(1+\delta^{\prime}_{n})(1+c\sigma^{2}n/\|\bm{y}\|_{2}^{2})}>2\eta, (B.48)

we have

𝔼𝕘22𝒚22𝕃2𝔼ηK,\frac{\|\mathbb{E}\|}{\|\mathbb{g}\|_{2}^{2}\|\bm{y}\|_{2}^{2}-\|\mathbb{L}\|^{2}}\leq\frac{\|\mathbb{E}\|}{\eta K},

and the rest argument leading to (B.43) follows as in the proof of Theorem 2. Finally, note that a sufficient condition for (B.48) is

𝒚2/σCn,η<C1,\|\bm{y}\|_{2}/\sigma\geq C\sqrt{n},\qquad\eta<C^{-1}, (B.49)

for some sufficiently large C>0C>0. This completes the first part of the proof.

As for the second statement concerning ¯a\bar{\mathbb{P}}^{a}, note that by (B.25), whenever 𝒚2/σCn\|\bm{y}\|_{2}/\sigma\geq C\sqrt{n} for some sufficiently large CC, all the (1η)K(1-\eta)K candidate visualizations are concordant with the true structure in probability, that is cos(𝕩k,𝒚)P1\cos\angle(\mathbb{x}_{k},\bm{y})\to_{P}1 for all k𝒞0k\in\mathcal{C}_{0}. In particular, we have

k𝒞0[1cos(𝕩k,𝒚)](1η)Kϵ/C,\sum_{k\in\mathcal{C}_{0}}[1-\cos\angle(\mathbb{x}_{k},\bm{y})]\leq(1-\eta)K\epsilon/C,

in probability for some constant ϵ>0\epsilon>0. Without loss of generality, we assume 𝒚=(1,0,0)\bm{y}=(1,0,...0), and 𝕩¯k=(0,1,0,,0)\bar{\mathbb{x}}_{k}=(0,1,0,...,0) for all k𝒞1k\in\mathcal{C}_{1}. Thus, it follows that

k=1K𝕩¯k=(0,ηK,0,,0)+((1η)K,0,,0)+k=1(1η)K(𝕩¯k𝒚),\sum_{k=1}^{K}\bar{\mathbb{x}}_{k}=(0,\eta K,0,...,0)+((1-\eta)K,0,...,0)+\sum_{k=1}^{(1-\eta)K}(\bar{\mathbb{x}}_{k}-\bm{y}),

and

k=1K𝕩¯k2\displaystyle\|\sum_{k=1}^{K}\bar{\mathbb{x}}_{k}\|_{2} ηK2+(1η)2K2k=1(1η)K𝕩¯k𝒚2\displaystyle\geq\sqrt{\eta K^{2}+(1-\eta)^{2}K^{2}}-\sum_{k=1}^{(1-\eta)K}\|\bar{\mathbb{x}}_{k}-\bm{y}\|_{2}
K(η2+(1η)2)2(1ϵ)Kϵ/C,\displaystyle\geq K\sqrt{(\eta^{2}+(1-\eta)^{2})}-2(1-\epsilon)K\epsilon/C,

in probability. If CC is chosen such that ϵ/C<0.01\epsilon/C<0.01, then we have

cos(k=1K𝕩¯k,𝒚)(1η)K+2(1ϵ)Kϵ/CK(η2+(1η)2)2(1ϵ)Kϵ/C1η+0.02(1η)12η+2η20.02(1η)<1η,\cos\angle(\sum_{k=1}^{K}\bar{\mathbb{x}}_{k},\bm{y})\leq\frac{(1-\eta)K+2(1-\epsilon)K\epsilon/C}{K\sqrt{(\eta^{2}+(1-\eta)^{2})}-2(1-\epsilon)K\epsilon/C}\leq\frac{1-\eta+0.02(1-\eta)}{\sqrt{1-2\eta+2\eta^{2}}-0.02(1-\eta)}<1-\eta,

in probability.

B.6 Necessity of the Signal Strength Condition (C2)

In this part, we show that the signal strength condition (C2) required by the proposed method is in fact asymptotically sharp and optimal, in the sense that it is essentially required for any possible method in order to achieve consistent estimation of the true structure \mathbb{P}^{*}.

Specifically, we consider the settings where the following two conditions hold:

(C1c) {𝕙i(k)}1kK\{\mathbb{h}_{i}^{(k)}\}_{1\leq k\leq K} are identically distributed Gaussian random vectors with mean zero and covariance matrix σ2𝐈n\sigma^{2}{\bf I}_{n}.

(C1d) The matrix =(ρjk)1j,kK\mathbb{R}=(\rho_{jk})_{1\leq j,k\leq K} satisfies ρ:==O(1)\rho:=\|\mathbb{R}\|=O(1).

Note that Conditions (C1c) and (C1d) are sufficient conditions for Conditions (C1a) and (C1b), respectively, so that Conditions (C1a) and (C1b) are satisfied throughout our discussion below.

Using the previous notations, we can rewrite our model for given ii as

𝕩k=ck(𝒚+𝕫k),k=1,2,,K,\mathbb{x}_{k}=c_{k}(\bm{y}+\mathbb{z}_{k}),\qquad k=1,2,...,K,

where by (C1c) {𝕫k}1kK\{\mathbb{z}_{k}\}_{1\leq k\leq K} are identically distributed Gaussian vectors with mean zero and covariance matrix σ2𝐈n\sigma^{2}{\bf I}_{n}, and K×K\mathbb{R}\in\mathbb{R}^{K\times K} is defined similarly as in the main paper describing the correlation structure among {𝕫k}1kK\{\mathbb{z}_{k}\}_{1\leq k\leq K}. We consider the parameter space for the model parameters ({ck}1kK,𝒚,σ,)(\{c_{k}\}_{1\leq k\leq K},\bm{y},\sigma,\mathbb{R}) as

Θ(t,σ,K,n)={{ck}1kK,𝒚n,𝒚2=t,σ>0,K×K,C},\Theta(t,\sigma,K,n)=\bigg{\{}\begin{aligned} \{c_{k}\}_{1\leq k\leq K}\subset\mathbb{R},\bm{y}\in\mathbb{R}^{n},\|\bm{y}\|_{2}=t,\\ \sigma>0,\mathbb{R}\in\mathbb{R}^{K\times K},\|\mathbb{R}\|\leq C\end{aligned}\bigg{\}},

for some sufficiently large universal constant C>1C>1. Define a subset of Θ(t,σ,K,n)\Theta(t,\sigma,K,n) as

Θ0(t,σ,K,n)={ck=1,1kK𝒚n,σ>0,=𝐈K}.\Theta_{0}(t,\sigma,K,n)=\bigg{\{}c_{k}=1,\forall 1\leq k\leq K\bm{y}\in\mathbb{R}^{n},\sigma>0,\mathbb{R}={\bf I}_{K}\bigg{\}}.

From our previous analysis (Theorem 2), it follows that as long as t/σn/Kt/\sigma\gg\sqrt{n/K}, the proposed method satisfies

limnsupθΘ(t,σ,K,n)Pθ(cos(𝕩m,𝒚)>1ϵ)=1,\lim_{n\to\infty}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\mathbb{x}^{m},\bm{y})>1-\epsilon)=1, (B.50)

for all ϵ>0\epsilon>0. Now we show the following minimax lower bound result, which shows that consistent estimation of 𝒚\bm{y} is never possible over the parameter space Θ(t,σ,K,n)\Theta(t,\sigma,K,n) for any estimator 𝒚^\widehat{\bm{y}} whenever t/σC1n/Kt/\sigma\leq C_{1}\sqrt{n/K} for some large constant C1>0C_{1}>0.

Theorem 4.

Under Conditions (C1c) and (C1d), there exists some absolute constant C1>0C_{1}>0 such that, as long as t/σC1n/Kt/\sigma\leq C_{1}\sqrt{n/K}, we have

limninf𝒚^supθΘ(t,σ,K,n)Pθ(cos(𝒚^,𝒚)<0.7)>c1,\lim_{n\to\infty}\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\widehat{\bm{y}},\bm{y})<0.7)>c_{1}, (B.51)

for some absolute constant c1>0c_{1}>0.

Proof.

To see this, note that

inf𝒚^supθΘ(t,σ,K,n)Pθ(cos(𝒚^,𝒚)<0.7)inf𝒚^supθΘ0(t,σ,K,n)Pθ(cos(𝒚^,𝒚)<0.7).\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\widehat{\bm{y}},\bm{y})<0.7)\geq\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta_{0}(t,\sigma,K,n)}P_{\theta}(\cos\angle(\widehat{\bm{y}},\bm{y})<0.7). (B.52)

Over the parameter space Θ0(t,σ,K,n)\Theta_{0}(t,\sigma,K,n), the model reduces to

𝕏=𝟏𝒚+K×n,\mathbb{X}={\bf 1}\bm{y}^{\top}+\mathbb{Z}\in\mathbb{R}^{K\times n}, (B.53)

which is rank-one matrix denoising model, where 𝟏𝒚=tK\|{\bf 1}\bm{y}^{\top}\|=t\sqrt{K} and \mathbb{Z} has i.i.d.i.i.d. rows drawn from N(0,σ2𝐈n)N(0,\sigma^{2}{\bf I}_{n}). The estimation of the direction of 𝒚\bm{y} is thus equivalent to estimation of the right singular vector of the rank-one matrix 𝟏𝒚{\bf 1}\bm{y}^{\top}. Now we recall the following lemma concerning the minimax lower bound for singular vector estimation in the low-rank matrix denoising model, proved by Cai et al. (2021).

Lemma 2.

Let 𝐲¯=𝐲/𝐲2\bar{\bm{y}}=\bm{y}/\|\bm{y}\|_{2}. There exists some constants c1,c2>0c_{1},c_{2}>0 such that

inf𝒚^supθΘ(t,σ,K,n)Pθ(𝒚¯𝒚¯𝒚^𝒚^Fmin{1,c1σntK})c2,\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}\bigg{(}\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}\geq\min\bigg{\{}1,\frac{c_{1}\sigma\sqrt{n}}{t\sqrt{K}}\bigg{\}}\bigg{)}\geq c_{2}, (B.54)

for sufficiently large nn.

As a result, since by Lemma 1 of Cai and Zhang (2018), we have

𝒚¯𝒚¯𝒚^𝒚^F=2sin(𝒚,𝒚^)=1cos2(𝒚,𝒚^)2(1cos(𝒚,𝒚^)).\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}=\sqrt{2}\sin\angle(\bm{y},\widehat{\bm{y}})=\sqrt{1-\cos^{2}\angle(\bm{y},\widehat{\bm{y}})}\leq\sqrt{2(1-\cos\angle(\bm{y},\widehat{\bm{y}}))}. (B.55)

Then, it follows that 𝒚¯𝒚¯𝒚^𝒚^F0.6\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}\geq\sqrt{0.6} implies cos(𝒚,𝒚^)<0.7\cos\angle(\bm{y},\widehat{\bm{y}})<0.7. Therefore, whenever we choose a constant C1>0C_{1}>0 sufficiently large so that c1C1σntK>0.6\frac{c_{1}C_{1}\sigma\sqrt{n}}{t\sqrt{K}}>\sqrt{0.6}, we have

inf𝒚^supθΘ(t,σ,K,n)Pθ(𝒚¯𝒚¯𝒚^𝒚^F0.6)c2,\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\|\bar{\bm{y}}\bar{\bm{y}}^{\top}-\widehat{\bm{y}}\widehat{\bm{y}}^{\top}\|_{F}\geq\sqrt{0.6})\geq c_{2}, (B.56)

which implies

inf𝒚^supθΘ(t,σ,K,n)Pθ(cos(𝒚,𝒚^)<0.7)c2,\inf_{\widehat{\bm{y}}}\sup_{\theta\in\Theta(t,\sigma,K,n)}P_{\theta}(\cos\angle(\bm{y},\widehat{\bm{y}})<0.7)\geq c_{2}, (B.57)

This completes the proof of the theorem. ∎

B.7 Proof of Auxiliary Lemma 1

We need the following lemma, proved in Cai and Zhang (2018), to obtain the upper bound.

Lemma 3.

For any p1p\geq 1, denote 𝔹p={xp:x21}\mathbb{B}^{p}=\{x\in\mathbb{R}^{p}:\|x\|_{2}\leq 1\} as the pp-dimensional unit ball in the Euclidean space. Suppose Kp1×p2K\in\mathbb{R}^{p_{1}\times p_{2}} is a random matrix. Then we have for t>0t>0,

P(K3t)7p1+p2maxu𝔹p1,v𝔹p2P(|uKv|t).P(\|K\|\geq 3t)\leq 7^{p_{1}+p_{2}}\cdot\max_{u\in\mathbb{B}^{p_{1}},v\in\mathbb{B}^{p_{2}}}P(|u^{\top}Kv|\geq t).

By Lemma 3, for 0\mathbb{H}_{0}, it follows that

P(03t)7n+dmax𝕦𝔹n(1),𝕧𝔹d(1)P(|𝕦0𝕧|t).P(\|\mathbb{H}_{0}\|\geq 3t)\leq 7^{n+d}\cdot\max_{\mathbb{u}\in\mathbb{B}_{n}(1),\mathbb{v}\in\mathbb{B}_{d}(1)}P(|\mathbb{u}^{\top}\mathbb{H}_{0}\mathbb{v}|\geq t). (B.58)

For any 𝕦𝔹n(1)\mathbb{u}\in\mathbb{B}_{n}(1) and 𝕧𝔹d(1)\mathbb{v}\in\mathbb{B}_{d}(1), we have

𝕦0𝕧=tr(0𝕧𝕦)=0,𝕦𝕧,\mathbb{u}^{\top}\mathbb{H}_{0}\mathbb{v}=\text{tr}(\mathbb{H}_{0}\mathbb{v}\mathbb{u}^{\top})=\langle\mathbb{H}_{0},\mathbb{u}\mathbb{v}^{\top}\rangle, (B.59)

where ,\langle\cdot,\cdot\rangle is Hilbert-Schmidt norm for matrices. Now define Σnd×nd\mathbb{\Sigma}\in\mathbb{R}^{nd\times nd} be 𝔼[vec(0)vec(0)]\mathbb{E}[\text{vec}(\mathbb{H}_{0})\text{vec}(\mathbb{H}_{0})^{\top}], so that its (i,j)(i,j)-th block is Σij\mathbb{\Sigma}_{ij}. By Conditions (C1a) and (C1b) and the definition of {ρij}\{\rho_{ij}\}, we have

Σ=[Σ11Σ12Σ1dΣ21Σ22..Σ2dΣd1Σd2..Σdd]\mathbb{\Sigma}=\begin{bmatrix}\mathbb{\Sigma}_{11}&\mathbb{\Sigma}_{12}&...&\mathbb{\Sigma}_{1d}\\ \mathbb{\Sigma}_{21}&\mathbb{\Sigma}_{22}&..&\mathbb{\Sigma}_{2d}\\ \vdots\\ \mathbb{\Sigma}_{d1}&\mathbb{\Sigma}_{d2}&..&\mathbb{\Sigma}_{dd}\end{bmatrix}

where Σij=σ2ρij\|\mathbb{\Sigma}_{ij}\|=\sigma^{2}\rho_{ij} for 1i,jd1\leq i,j\leq d. In particular, we have

Σσ2=σ2ρ.\|\mathbb{\Sigma}\|\leq\sigma^{2}\|\mathbb{R}\|=\sigma^{2}\rho. (B.60)

To see this, if we denote 𝕧=(𝕧1,,𝕧d)nd\mathbb{v}=(\mathbb{v}_{1},...,\mathbb{v}_{d})^{\top}\in\mathbb{R}^{nd}, then

Σ=max𝕧2=1𝕧Σ𝕧=max𝕧2=11i,jd𝕧iΣij𝕧jmax𝕧2=11i,jd𝕧i2𝕧j2Σij\displaystyle\|\mathbb{\Sigma}\|=\max_{\|\mathbb{v}\|_{2}=1}\mathbb{v}^{\top}\mathbb{\Sigma}\mathbb{v}=\max_{\|\mathbb{v}\|_{2}=1}\sum_{1\leq i,j\leq d}\mathbb{v}_{i}^{\top}\mathbb{\Sigma}_{ij}\mathbb{v}_{j}\leq\max_{\|\mathbb{v}\|_{2}=1}\sum_{1\leq i,j\leq d}\|\mathbb{v}_{i}\|_{2}\|\mathbb{v}_{j}\|_{2}\|\mathbb{\Sigma}_{ij}\|
=σ2max𝕝2=1𝕝𝕝=.\displaystyle=\sigma^{2}\max_{\|\mathbb{l}\|_{2}=1}\mathbb{l}^{\top}\mathbb{R}\mathbb{l}=\|\mathbb{R}\|.

Now by (B.60), we have

𝔼0,𝕦𝕧Cσρ.\mathbb{E}\langle\mathbb{H}_{0},\mathbb{u}\mathbb{v}^{\top}\rangle\leq C\sigma\sqrt{\rho}. (B.61)

By properties of sub-Gaussian vectors, it holds that

P(|0,𝕦𝕧|Cσρ(1+t))1ect2.P(|\langle\mathbb{H}_{0},\mathbb{u}\mathbb{v}^{\top}\rangle|\leq C\sigma\sqrt{\rho}(1+t))\geq 1-e^{-ct^{2}}. (B.62)

Combining (B.62) and (B.58), we have

P(03Cσρ(1+t))\displaystyle P(\|\mathbb{H}_{0}\|\geq 3C\sigma\sqrt{\rho}(1+t)) 7n+dP(|𝕦0𝕧|Cσρ(1+t))\displaystyle\leq 7^{n+d}P(|\mathbb{u}^{\top}\mathbb{H}_{0}\mathbb{v}|\geq C\sigma\sqrt{\rho}(1+t))
7n+dect2\displaystyle\leq 7^{n+d}e^{-ct^{2}}
exp{c(n+d)ct2}.\displaystyle\leq\exp\{c^{\prime}(n+d)-ct^{2}\}.

Setting t=c1(n+d)t=c_{1}(\sqrt{n}+\sqrt{d}) for some sufficiently large c1>0c_{1}>0, we have

P(0Cσρ(n+d))nc.P(\|\mathbb{H}_{0}\|\geq C\sigma\sqrt{\rho}(\sqrt{n}+\sqrt{d}))\leq n^{-c}. (B.63)