Impact of signal-to-noise ratio and bandwidth on graph Laplacian spectrum from high-dimensional noisy point cloud

Xiucai Ding Department of Statistics, University of California, Davis, CA, USA xcading@ucdavis.edu and Hau-Tieng Wu Department of Mathematics and Department of Statistical Science, Duke University, Durham, NC, USA hauwu@math.duke.edu

Abstract.

We systematically study the spectrum of kernel-based graph Laplacian (GL) constructed from high-dimensional and noisy random point cloud in the nonnull setup. The problem is motived by studying the model when the clean signal is sampled from a manifold that is embedded in a low-dimensional Euclidean subspace, and corrupted by high-dimensional noise. We quantify how the signal and noise interact over different regions of signal-to-noise ratio (SNR), and report the resulting peculiar spectral behavior of GL. In addition, we explore the impact of chosen kernel bandwidth on the spectrum of GL over different regions of SNR, which lead to an adaptive choice of kernel bandwidth that coincides with the common practice in real data. This result paves the way to a theoretical understanding of how practitioners apply GL when the dataset is noisy.

1. Introduction

Spectral algorithms are popular and have been widely applied in unsupervised machine learning, like eigenmap [5], locally linear embedding (LLE) [54], diffusion map (DM) [17], to name but a few. A common ground of these algorithms is graph Laplacian (GL) and its spectral decomposition that have been extensively discussed in the spectral graph theory [16]. Up to now, due to its wide practical applications, there have been rich theoretical results discussing the spectral behavior of GL under the manifold model when the point cloud is clean. For example, the pointwise convergence [5, 36, 35, 57], the $L^{2}$ convergence without rate [6, 63, 58], the $L^{2}$ convergence with rate [34], the $L^{\infty}$ convergence with rate [25], etc. We refer readers to the cited literature therein for more information. Also see, for example [41, 12], for more relevant results. However, to our knowledge, limited results are known about the GL spectrum when the dataset is contaminated by noise [59, 26, 29, 55], particularly when the noise is high-dimensional [26, 29, 55]. Specifically, in the high-dimensional setup, [26, 29] report controls on the operator norm between the clean GL and noisy GL in some specific signal-to-noise (SNR) regions. However, a finer and more complete description of the spectrum is still lacking. The main focus of this work is extending existing results so that the spectral behavior of GL is better depicted when the dataset is contaminated by high-dimensional noise.

GL is not uniquely defined, and there are several possible constructions from a given point cloud. For example, it can be constructed by the idea of local barycentric coordinates like that in LLE [64] or by taking landmarks like that in Roseland [55], the kernel can be asymmetric [17], the metric can be non-Euclidean [60], etc. In this paper, we focus ourselves on a specific setup; that is, GL is constructed from a random point cloud $\mathcal{X}:=\{\mathbf{x}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{p}$ via a symmetric kernel with the usual Euclidean metric. To be more specific, from $\mathcal{X}$ , construct the affinity matrix (or kernel matrix) $\mathbf{W}\in\mathbb{R}^{n\times n}$ by

(1.1)

\mathbf{W}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}}{h}\right),\ 1\leq i,j\leq n,

where $\upsilon>0$ is the chosen parameter and $h\equiv h(n)$ is the chosen bandwidth. In other words, we focus on kernels of the exponential type, that is, $f(x)=\exp(-\upsilon x)$ , to simplify the discussion. Then, define the transition matrix

(1.2)

\mathbf{A}=\mathbf{D}^{-1}\mathbf{W},

where $\mathbf{D}$ is the degree matrix, which is a diagonal matrix with diagonal entries defined as

(1.3)

\mathbf{D}(i,i)=\sum_{j=1}^{n}\mathbf{W}(i,j),\ i=1,2,\cdots,n.

Note that $\mathbf{A}$ is row-stochastic. The (normalized) GL is defined as

\mathbf{L}:=\frac{1}{h}(\mathbf{I}-\mathbf{A})\,.

Since $\mathbf{L}$ and $\mathbf{A}$ are related by an isotropic shift and universal scaling, we focus on studying the spectral distributions of $\mathbf{W}$ and $\mathbf{A}$ in the rest of the paper. With the eigendecomposition of $\mathbf{A}$ , we could proceed with several data analysis missions. For example, spectral clustering is carried out by combining top few non-trivial eigenvectors and $k$ -mean [62], embedding datasets to a low-dimensional Euclidean space by eigenvectors and/or eigenvalues reduces the dataset dimension [5, 17], etc. Thus, the spectral behavior of $\mathbf{A}$ is the key to fully understanding these algorithms.

1.1. Mathematical setup

We now specify the high-dimensional noisy model and the problem we are concerned with in this work. Suppose $\mathcal{X}$ is independent and identically distributed (i.i.d.) following a sub-Gaussian random vector

(1.4)

\mathbf{x}_{i}=\mathbf{z}_{i}+\mathbf{y}_{i},\ 1\leq i\leq n\,,

where $\mathbf{z}_{i}$ is a random vector with mean 0 and $\textnormal{cov}(\mathbf{z}_{i})=\text{diag}\{\lambda_{1},\cdots,\lambda_{d},0,\cdots,0\}$ , $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{d}>0$ , and $\mathbf{y}_{i}$ is sub-Gaussian with independent entries,

(1.5)

\mathbb{E}(\mathbf{y}_{i})=\mathbf{0}\ \mbox{ and }\ \textnormal{cov}(\mathbf{y}_{i})=\mathbf{I}_{p}\,.

We also assume that $\mathbf{z}_{i}$ and $\mathbf{y}_{i}$ are independent. As a result, $\mathbf{x}_{i}$ is a sub-Gaussian random vector with mean 0 and covariance

(1.6)

\Sigma=\text{diag}\{\lambda_{1}+1,\cdots,\lambda_{d}+1,1,\cdots,1\}\,,

where $d\in\mathbb{N}$ . In this model, $\mathbf{z}_{i}$ and $\mathbf{y}_{i}$ represent the signal and noise part of the point cloud respectively. To simplify the discussion, we also assume that $\mathbf{z}_{i1},\ldots,\mathbf{z}_{id}$ are continuous random variables. Denote $\mathcal{Y}:=\{\mathbf{y}_{i}\}_{i=1}^{n}\in\mathbb{R}^{p}$ and $\mathcal{Z}:=\{\mathbf{z}_{i}\}_{i=1}^{n}$ .

We adopt the high-dimensional setting and assume that

(1.7)

\gamma\leq c_{n}:=\frac{n}{p}\leq\gamma^{-1}

for some constant $0<\gamma\leq 1$ ; in other words, we focus on the large $p$ and large $n$ setup. We also assume that $d$ is independent of $n$ . Note that (1.4) is closely related to the commonly applied spiked covariance model [39].

Next, we consider the following setup to control the signal strength:

(1.8)

\lambda_{i}\asymp n^{\alpha_{i}},\ 0\leq\alpha_{i}<\infty\ \text{are some constants}.

Thus, $\lambda_{i}$ represents the signal strength in the $i$ -th component of the signal. On the other hand, the total noise in the dataset $\mathcal{X}$ is $\texttt{tr}(\textnormal{cov}(\mathbf{y}_{i}))=p$ , and we call $\lambda_{i}/p$ the signal to noise ratio (SNR) associated with the $i$ -th component of the signal.

We denote the matrix $\mathbf{W}_{1}\in\mathbb{R}^{n\times n}$ such that

(1.9)

\mathbf{W}_{1}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{h}\right)\,.

Note that compared with $\mathbf{W}$ defined in (1.1), $\mathbf{W}_{1}$ is the affinity matrix constructed from the clean signal $\mathcal{Z}$ . Moreover, let $\mathbf{W}_{y}$ be the affinity matrix associated with $\mathcal{Y},$ which represents the noise part. Their associated transition matrices $\mathbf{A}_{1}$ and $\mathbf{A}_{y}$ are defined similarly as in (1.2). We will show in part (1) of Theorem 2.7, part (2) of Theorem 3.1 and Corollary 3.2 that under (1.4), when the SNR is above some threshold and the bandwidth is properly chosen, we can separate the signal and noise parts in the sense that $\mathbf{W}$ is close to

(1.10)

\mathbf{W}_{1}\circ\mathbf{W}_{y}

with high probability, where $\circ$ stands for the Hadamard product. The closeness is quantified using the normalized operator norm; that is, $n^{-1}\|\mathbf{W}-\mathbf{W}_{1}\circ\mathbf{W}_{y}\|=o(1)$ with high probability. In fact, using the definition of $\mathbf{W}_{c}$ below in (2.15), we immediately see that in general we have $\mathbf{W}=\mathbf{W}_{1}\circ\mathbf{W}_{y}\circ\mathbf{W}_{c}.$ When the SNR is relatively large and the bandwidth is chosen properly, we show that with high probability $\max_{i,j}|\mathbf{W}_{c}(i,j)-1|=o(1)$ , and Lemma A.1 leads to the claim.

The main problem we ask in this work is studying the relationship between the spectra of $\mathbf{W}_{1}$ and $\mathbf{W}$ under different SNR regions and bandwidths; that is, how do noise and chosen bandwidth impact the spectrum of the affinity matrix and how does the noisy spectrum deviate from that of the clean affinity matrix. By combining existing understandings of $\mathbf{W}_{1}$ under suitable models, like manifolds [34, 25], we obtain a finer understanding of the commonly applied GL.

1.2. Relationship with the manifold model

We claim that this seemingly trivial spiked covariance model (1.4) overlaps with the commonly considered nonlinear manifold model. Consider the case that the manifold is embedded into a subspace of fixed dimension in $\mathbb{R}^{p}$ . With some mild assumptions about the manifold, and the fact that the Euclidean distance of $\mathbb{R}^{p}$ is invariant to rotation, the noisy manifold model can be studied by the spiked covariance model satisfying (1.4)-(1.8). We defer details to Section A.1 to avoid distraction. We shall however emphasize that it does not mean that we could understand the manifold structure by studying the spiked covariance model. The problem we study in this paper is the relationship between the noisy and clean affinity matrices; that is, how does the noise impact the GL. The problem of exploring the manifold structure from a clean dataset via the GL is a different one. See, for example, [35, 34, 25]. Therefore, by studying the relationship between the noisy GL and clean GL via studying the model where the data is sampled from a sum of two sub-Gaussian random variables, we could further explore the manifold structure from noisy dataset.

1.3. Some related works

The focus of the current paper is to study the GL spectrum under the nonnull case for the point cloud $\mathcal{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}$ described in (1.4), and connect it to the GL spectrum under the null case; that is, when the GL is constructed from the pure noise point cloud $\mathcal{Y}=\{\mathbf{y}_{i}\}_{i=1}^{n}$ . The eigenvalues of $\mathbf{W}$ , a high-dimensional Euclidean distance kernel random matrix, in the null case have been studied in several works. In the pioneering work [27], the author studied the spectrum of $\mathbf{W}$ under the null case assuming the bandwidth $h=p$ , and showed that the complicated kernel matrix can be well approximated by a low rank perturbed Gram matrix; see [27, Theorem 2.2] or (A.19) below for more details. It was concluded that studying $\mathbf{W}$ in the null case is closely related to studying the principal component analysis (PCA) of the dataset with a low rank perturbation. The results in [27] were extended to more general kernels beyond the exponential kernel function, and $\mathcal{Y}$ can be anisotropic with some moment assumptions; for example, more general kernels with Gaussian noise was reported in [15], the Gaussian assumption was removed in [22, 11], the convergence rates of individual eigenvalues of $\mathbf{W}$ were provided in [18], etc.

However, much less is known for the nonnull case (1.4). To our knowledge, the most relevant works are [26, 29]. By assuming $h=p$ and $\mathbb{E}\|\mathbf{z}_{i}\|_{2}^{2}\asymp p$ , in [26] the author showed that $n^{-1}\mathbf{W}$ could be well approximated by $n^{-1}\exp(-2\upsilon)\mathbf{W}_{1}$ , which connected the noisy observation and the clean signal. In fact, the results of [26, Theorem 2.1] were established for a class of smooth kernels and the noise could be anisotropic. In a recent work [1], the authors established the concentration inequality of $\|\mathbf{W}-\mathbb{E}\mathbf{W}\|$ and used it to study spectral clustering.

We mention that other types of kernel random matrices related to (1.1) have also been studied in the literature. For example, the inner-product type kernel random matrices of the form $f(\mathbf{x}_{i}^{\top}\mathbf{x}_{j}),$ where $f$ is some kernel function, has drawn attention among researchers. In [27], the author showed that under some regularity assumptions, the inner-product kernel matrices could also be approximated by a Gram matrix with low rank perturbations, which had been generalized in [15, 22, 11, 32, 43, 52]. The other example is the Euclidean random matrices of the form $F(\mathbf{x}_{i}-\mathbf{x}_{j})$ arising from physics, where $F$ is some measurable function. Especially, the empirical spectral distribution has been extensively studied in, for example, [47, 10, 38], among others.

Finally, our present work is also in the line of research regarding the robustness of GL. There have been efforts in different settings along this direction. For example, the perturbation of the eigenvectors of GL was studied in [46], the consistency and robustness of spectral clustering were analyzed in [1, 63, 9], and the robustness of DM was studied in [59, 55], among others.

1.4. An overview of our results

Our main contribution is a systematic treatment of the spectrum of GL constructed from a high-dimensional random point cloud corrupted by high-dimensional noise; that is, we consider the nonnull setup (1.4). We establish a connection between the spectra of noisy and clean affinity matrices with different choices of bandwidth and different SNRs by extensively expanding the results reported in [26] with tools from random matrix theory. More specifically, we allow the signal strength to diverge with $n$ so that the relative strength of signal and noise is captured, and characterize the spectral distribution of the noisy affinity matrix by studying how the signal and noise interact and how different bandwidths impact the spectra of $\mathbf{W}$ and $\mathbf{A}.$ Motivated by our theoretical results, we propose an adaptive bandwidth selection algorithm with theoretical support. The proposed method utilizes some certain quantile of pairwise distance as in (3.6), where the quantile can be selected using our proposed Algorithm 1. We provide detailed results when $d=1$ in (1.4), and discuss how to extend the results to $d>1$ and why some cases are challenging when $d>1$ . Our result, when combined with existing manifold learning results like [35, 34, 25] and others, pave the way to a better understanding of how GL-based algorithms behave in practice, and a theoretical support for the commonly applied bandwidth scheme, when $\{\mathbf{z}_{i}\}_{i=1}^{n}$ is distributed on a low-dimensional, compact and smooth manifold.

We now provide a heuristic explanation of our results assuming $d=1$ and when $h=p$ . In Section 2, when $\alpha<1,$ we show that the noisy kernel affinity matrix $\mathbf{W}$ provides limited useful information for the clean affinity matrix $\mathbf{W}_{1}$ in (1.9). Specifically, similar to the null setting, $\mathbf{W}$ can be well approximated by a Gram matrix with a finite rank perturbation; see Section 2.2 for more details. When $\alpha\geq 1,$ $\mathbf{W}$ becomes closer to $\mathbf{W}_{1}$ with some universal scaling and isotropic shift (c.f. (2.14)). Note that some related results have been established for $\alpha=1$ in [26]. We show that the convergence rate is adaptive to $\alpha$ , and provide a quantification of how eigenvalues of the noisy affinity matrix $\mathbf{W}$ converge. We mention that when $\alpha>1$ , the classic bandwidth choice $h=p$ needs to be modified to reflect the underlying signal structure. Specifically, if $h=p$ and $\alpha>1$ , we have $\mathbf{W}\approx\mathbf{I}$ . These results are stated in Theorem 2.7. Similar results hold for the transition matrix $\mathbf{A}$ , and they are reported in Section 2.4.

Motivated by the fact that $\mathbf{W}$ becomes trivial when $\alpha>1$ is large, in Section 3, we focus on the case $h=p+\lambda.$ When $\alpha<1,$ the results are analogous to the setting $h=p.$ When $\alpha\geq 1,$ the spectrum of $\mathbf{W}$ will be dramatically different. Specifically, when $\alpha=1$ , $\mathbf{W}$ is close to $\mathbf{W}_{1}$ with some universal scaling and isotropic shift (c.f. (3.1)), and when $\alpha>1,$ $\mathbf{W}$ is close to $\mathbf{W}_{1}$ without any scaling and shift. Moreover, besides the top $\log(n)$ eigenvalues of $\mathbf{W}$ , the other eigenvalues are trivial; see Theorem 3.1 for more details.

In practice, $\lambda$ is unknown and not easy to estimate especially when $\{\mathbf{z}_{i}\}$ are sampled from a nonlinear geometric object. For practical implementation, we propose a bandwidth selection algorithm in Section 3.2. It turns out that the proposed bandwidth selection algorithm can bypass the challenge of estimating $\lambda$ , and the result coincides with that determined by the ad hoc bandwidth selection method commonly applied by practioners. See [56, 44] for example, among many others. Note that the bandwidth issue is also discussed in [1]. To our knowledge, our result is the first step toward a theoretical understanding of that common practice.

Finally, we point out technical ingredients of our proof. We focus on the discussion of the bandwidth $h=p$ and take $d=1$ for an example. First, when $\lambda$ is bounded, i.e., $\alpha=0,$ the spectrum of the $\mathbf{W}$ has been studied in [27] using an entrywise expansion to the order of two (with a third order error). When $0<\alpha<0.5,$ the above strategy can be adapted straightforwardly with some modifications. When $0.5\leq\alpha<1,$ we need to expand the entries to a higher order depending on $\alpha$ . With this expansion, we show that except for the Gram matrix, the other parts are either fixed rank or negligible; see Sections B.1 and B.2 for more details. Second, when $\alpha\geq 1,$ the entrywise expansion strategy fails and we need to conduct a more careful analysis. Particularly, thanks to the Hadamard product representation of $\mathbf{W}$ in (1.10), we need to analyze the spectral norm of $\mathbf{W}_{1}$ carefully; see Lemma A.10, which has its own interest. Third, to explore the number of informative eigenfunctions, we need to investigate the individual eigenvalues of $\mathbf{W}_{1}$ using Mehler’s formula (c.f. (A.32)); see Section B.3 for more details.

The paper is organized in the following way. The main results are described in Section 2 for the bandwidth $h\asymp p$ and in Section 3 for the bandwidth $h\asymp(\lambda+p)$ and an adaptive bandwidth selection algorithm. The numerical studies are reported in Section 4. The paper ends with the discussion and conclusion in Section 5. The background and necessary results for the proof are listed in Section A and the proofs of the main results are given in Sections B and C.

Conventions. We systematically use the following notations. For a smooth function $f(x)$ , denote $f^{(k)}(x),k=0,1,2,\cdots,$ to be the $k$ -th derivative of $f(x).$ For two sequences $a_{n}$ and $b_{n}$ depending on $n,$ the notation $a_{n}=O(b_{n})$ means that $|a_{n}|\leq C|b_{n}|$ for some constant $C>0,$ and $a_{n}=o(b_{n})$ means that $|a_{n}|\leq c_{n}|b_{n}|$ for some positive sequence $c_{n}\downarrow 0$ as $n\rightarrow\infty.$ We also use the notation $a_{n}\asymp b_{n}$ if $a_{n}=O(b_{n})$ and $b_{n}=O(a_{n}).$

2. Main results (I): classic bandwidth choice $h\asymp p$

In this section, we state the main results regarding the spectra of $\mathbf{W}$ and $\mathbf{A}$ when the bandwidth satisfies $h\asymp p.$ For definiteness, without loss of generality, we assume that $h=p.$ Such a choice has appeared in many works in kernel methods and manifold learning, e.g., [27, 18, 22, 42, 28, 29]. Since our focus is the nonlinear interaction of the signal and noise in the kernel method, in what follows, we focus on reporting the results for $d=1$ and omit the subscripts in (1.8). We refer the readers to Remark 2.9 and Section A.7.2 below for a discussion on the setting when $d>1.$

2.1. Some definitions

We start with introducing the notion of stochastic domination [30]. It makes precise statements of the form “ $\mathsf{X}^{(n)}$ is bounded with high probability by $\mathsf{Y}^{(n)}$ up to small powers of $n$ ”.

Definition 2.1 (Stochastic domination).

Let

	$\displaystyle\mathsf{X}$	$\displaystyle\,=\big{\{}\mathsf{X}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}},\$
	$\displaystyle\mathsf{Y}$	$\displaystyle\,=\big{\{}\mathsf{Y}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}},$

be two families of nonnegative random variables, where $\mathsf{U}^{(n)}$ is a possibly $n$ -dependent parameter set. We say that $\mathsf{X}$ is stochastically dominated by $\mathsf{Y}$ , uniformly in the parameter $u$ , if for all small $\upsilon>0$ and large $D>0$ , there exists $n_{0}(\upsilon,D)\in\mathbb{N}$ so that we have

\sup_{u\in\mathsf{U}^{(n)}}\mathbb{P}\Big{(}\mathsf{X}^{(n)}(u)>n^{\upsilon}\mathsf{Y}^{(n)}(u)\Big{)}\leq n^{-D},

for a sufficiently large $n\geq n_{0}(\upsilon,D)$ . We interchangeably use the notation $\mathsf{X}=O_{\prec}(\mathsf{Y})$ or $\mathsf{X}\prec\mathsf{Y}$ if $\mathsf{X}$ is stochastically dominated by $\mathsf{Y}$ , uniformly in $u$ , when there is no danger of confusion. In addition, we say that an $n$ -dependent event $\Omega\equiv\Omega(n)$ holds with high probability if for a $D>1$ , there exists $n_{0}=n_{0}(D)>0$ so that

\mathbb{P}(\Omega)\geq 1-n^{-D},

when $n\geq n_{0}.$

For any constant $c\in\mathbb{R},$ denote $T_{c}$ to be the shifting operator that shifts a probability measure $\mu$ defined on $\mathbb{R}$ by $c$ ; that is

(2.1)

T_{c}\mu(I):=\mu(I-c),

where $I\subset\mathbb{R}$ is a measurable subset. Till the end of the paper, for any symmetric $n\times n$ matrix $\mathbf{B},$ the eigenvalues are ordered in the decreasing fashion so that $\lambda_{1}(\mathbf{B})\geq\lambda_{2}(\mathbf{B})\geq\cdots\geq\lambda_{n}(\mathbf{B}).$

Definition 2.2.

For a given probability measure $\mu$ and $n\in\mathbb{N}$ , define the $j$ -th typical location of $\mu$ as $\gamma_{\mu}(j)$ ; that is,

(2.2)

\int_{\gamma_{\mu}(j)}^{\infty}\mu(\mathrm{d}x)=\frac{j}{n}\,,

where $j=1,\ldots,n$ . $\gamma_{\mu}(j)$ is also called the $j/n$ -quantile of $\mu$ .

Let $\mathbf{Y}\in\mathbb{R}^{p\times n}$ be the data matrix associated with the point cloud $\mathcal{Y}$ ; that is, the $i$ -th column $\mathbf{Y}$ is $\mathbf{y}_{i}$ . Denote the empirical spectral distribution (ESD) of $\mathbf{Q}=\frac{\sigma^{2}}{p}\mathbf{Y}^{\top}\mathbf{Y}$ as

\mu_{\mathbf{Q}}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}_{\{\lambda_{i}(\mathbf{Q})\leq x\}},\ x\in\mathbb{R}.

Here, $\sigma>0$ stands for the standard deviation of the scaled noise $\sigma\mathbf{y}_{i}$ . It is well-known that in the high-dimensional region (1.7), $\mu_{\mathbf{Q}}$ has the same asymptotic properties [40] as the so-called Marchenko-Pastur (MP) law [45], denoted as $\mu_{c_{n},\sigma^{2}}$ , satisfying

(2.3)

\mu_{c_{n},\sigma^{2}}(I)=(1-c_{n})_{+}\chi_{I}(0)+\zeta_{c_{n},\sigma^{2}}(I)\,,

where $\chi_{I}$ is the indicator function and $(a)_{+}:=0$ when $a\leq 0$ and $(a)_{+}:=a$ when $a>0$ ,

(2.4)

\mathrm{d}\zeta_{c_{n},\sigma^{2}}(x)=\frac{1}{2\pi\sigma^{2}}\frac{\sqrt{(\lambda_{+}-x)(x-\lambda_{-})}}{c_{n}x}\mathrm{d}x\,,

$\lambda_{+}=(1+\sigma^{2}\sqrt{c_{n}})^{2}$ and $\lambda_{-}=(1-\sigma^{2}\sqrt{c_{n}})^{2}$ .

Denote

(2.5)

\tau\equiv\tau(\lambda):=2\left(\frac{\lambda}{p}+1\right),

and for any kernel function $f(x)$ , define

(2.6)

\varsigma\equiv\varsigma(\lambda):=f(0)+2f^{\prime}(\tau)-f(\tau)\,.

Till the end of this paper, we will use $\tau$ and $\varsigma$ for simplicity, unless we need to specify the values of $\tau$ and $\varsigma$ at certain points of $\lambda$ . As mentioned below (1.1), we focus on $f(x)=\exp(-\upsilon x)$ throughout the paper unless otherwise specified. Recall the shift operator defined in (2.1). Denote

(2.7)

\nu_{\lambda}:=T_{\varsigma(\lambda)}\mu_{c_{n},-2f^{\prime}(\tau(\lambda))},

where $\mu_{c_{n},-2f^{\prime}(\tau(\lambda))}$ is the MP law defined in (2.3) by replacing $\sigma^{2}$ with $-2f^{\prime}(\tau(\lambda))$ .

2.2. Spectrum of kernel affinity matrices: low signal-to-noise region $0\leq\alpha<1$

In this subsection, we present the results when the SNR is small in the sense that $0\leq\alpha<1$ in Theorems 2.3 and 2.5. In this setting, there does not exist a natural connection between the spectrum of $\mathbf{W}$ and the signal part $\mathbf{W}_{1}$ in (1.9). More specifically, even though the spectrum of $\mathbf{W}$ can be studied, $\|\mathbf{W}-\mathbf{W}_{1}\|$ is not close to zero.

In the first result, we consider the case when $\lambda$ is bounded from above; that is, $\alpha=0$ . In this bounded region, as in the null case studied in [27, 18] (see Lemma A.9 for more details), the spectrum is governed by the MP law, except for a few outlying eigenvalues. These eigenvalues either come from the kernel function expansion that will be detailed in the proof, or the Gram matrix $\mathbf{Q}_{x}=\frac{1}{p}\mathbf{X}^{\top}\mathbf{X},$ where $\mathbf{X}\in\mathbb{R}^{p\times n}$ is the data matrix associated with the noisy observation $\mathbf{x}_{i}$ defined in (1.4) when the signal is above some threshold. This result is not surprising, since the signal is asymptotically negligible compared with the noise.

Theorem 2.3 (Bounded region).

Suppose (1.1) and (1.4)-(1.8) hold true, $d=1$ and $h=p$ . Moreover, we assume that $\lambda$ is a fixed constant. Set

\mathsf{S}:=\begin{cases}3,&\lambda\leq\sqrt{c_{n}};\\ 4,&\lambda>\sqrt{c_{n}}.\end{cases}

For any given small $\epsilon>0,$ when $n$ is sufficiently large, for some constant $C>0,$ with probability at least $1-O(n^{-1/2})$ , we have

(2.8)

\left|\lambda_{i}(\mathbf{W})-\gamma_{\nu_{0}}(i)\right|\leq Cn^{-1/4},\ \mathsf{S}<i\leq(1-\epsilon)n,

where $\nu_{0}$ is defined in (2.7).

Remark 2.4.

In Theorem 2.3, we focus on reporting the bulk eigenvalues of $\mathbf{W}.$ In this case, the outlying eigenvalues are mainly from the kernel function expansion, which we call the “kernel effect” hereafter, and the resulting Gram matrix. For example, as we will see in the proofs in Section B, we have that $\lambda_{1}(\mathbf{W})=n\exp(-\tau\upsilon)+o_{\prec}(1)$ and $\lambda_{2}(\mathbf{W})=\|\mathrm{Sh}_{1}(\tau)+\mathrm{Sh}_{2}(\tau)\|+o_{\prec}(1),$ where $\mathrm{Sh}_{1}(\tau)$ and $\mathrm{Sh}_{2}(\tau)$ are defined in (A.20). Moreover, we mention that Theorem 2.3 holds for a more general kernel function as in [27, 18]; that is, (1.1) is replaced by

(2.9)

\mathbf{W}(i,j)=f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}}{h}\right),\ 1\leq i,j\leq n,

where $f\in C^{3}(\mathbb{R})$ is monotonically decreasing, bounded and $f(2)>0$ . Moreover, we remark that in (2.8) we can relax $\epsilon>0$ to $\epsilon=0$ with an additional assumption that $|c_{n}-1|\geq\tau$ for some constant $\tau>0,$ which is a standard assumption in the random matrix theory literature guaranteeing that the smallest eigenvalue of the Gram matrix is bounded from below; for instance, see the monograph [31]. Finally, we mention that in the pure noise setting, i.e., $\lambda=0,$ the asymptotics of the spectrum of $\mathbf{W}$ (equivalently, $\mathbf{W}_{y}$ in this setting) has been established in [18] using a different approach, where the results hold for $\epsilon=0$ regardless of the ratio of $n/p.$ However, in [18], the bound is weaker than what we show here; that is, the rate is $n^{-1/9}$ when $i$ is bounded, and the rate $n^{-1/4}$ only appears when $i\asymp n.$

In the second result, we study the case when the signal strength $\lambda$ diverges with $n$ but slowly; that is, $\lambda=\lambda(n)\asymp n^{\alpha}$ , where $0<\alpha<1.$ In this region, since $\alpha<1$ , the signal is still weaker than the noise, and again it is asymptotically negligible. Thus, it is not surprising to see that the noise information dominates and the spectral distribution is almost like the MP law, except for the first few eigenvalues. Again, like in the bounded region, these finite number of outlying eigenvalues come from the interaction of the nonlinear kernel and the Gram matrix.

Theorem 2.5 (Slowly divergent region).

Suppose (1.1) and (1.4)-(1.8) hold true, $d=1$ and $h=p$ . For any given small $\epsilon>0,$ when $n$ is sufficiently large, we have that the following holds with probability at least $1-O(n^{-1/2})$ :

(1)

When $0<\alpha<{0.5-\epsilon},$ for some constant $C>0,$ we have that:

(2.10)

\displaystyle|\lambda_{i}(\mathbf{W})-\gamma_{\nu_{0}}(i)|\leq C\max\left\{n^{-1/4},n^{\epsilon}\frac{\lambda}{\sqrt{n}}\right\},\ 4<i\leq(1-\epsilon)n.

(2)

When ${0.5-\epsilon}\leq\alpha<1,$ denote

(2.11) $\mathfrak{d}\equiv\mathfrak{d}(\alpha):=\left\lceil\frac{1}{1-\alpha}\right\rceil+1.$

For some constant $C>0,$ there exists some integer $\mathsf{K}$ satisfying

$4\leq\mathsf{K}\leq C4^{\mathfrak{d}},$

so that with high probability, for all $\mathsf{K}<i\leq(1-\epsilon)n,$ we have that

(2.12) $\displaystyle|\lambda_{i}(\mathbf{W})-$ $\displaystyle\gamma_{\nu_{0}}(i)|\leq C\max\left\{p^{\mathcal{B}(\alpha)},\frac{\lambda}{p}\right\},$

where $\mathcal{B}(\alpha)<0$ is defined as

(2.13) $\mathcal{B}(\alpha)={(\alpha-1)\left\lceil\frac{1}{1-\alpha}\right\rceil+\alpha}\,.$

As in Theorem 2.3, since the outlying eigenvalues are impacted by both the kernel effect and signals, we focus on reporting the bulk eigenvalues. Similar to the discussion in Remark 2.4, the outlying eigenvalues can be figured out from the proof in Section B. Moreover, the number of the outlying eigenvalues is adaptive to $\alpha.$ As can be seen from Theorem 2.5, for any small $\epsilon>0$ , when $0<\alpha<0.5-\epsilon,$ the results are similar to (2) of Theorem 2.3 except for the convergence rates. When $\alpha\geq 0.5-\epsilon,$ there will be more, but finite, outlying eigenvalues, which comes from a high order expansion in the proof. Finally, we mention that Theorem 2.5 holds for a more general kernel function like that in (2.9).

We remark that when $\alpha<1$ and $\lambda>\sqrt{c_{n}},$ the non-bulk eigenvalues (i.e., outlying eigenvalues) “seem” to be useful for understanding the signal. For example, we may potentially utilize the number of outliers to determine if the signal strength is stronger than $\sqrt{c_{n}}$ . However, to our best knowledge, there exists limited literature on utilizing this information via GL since $\mathbf{W}$ ( $\mathbf{A}$ respectively) is not close to $\mathbf{W}_{1}$ ( $\mathbf{A}_{1}$ respectively) in this region, except some relevant work in [28] when $1/2<\alpha<1$ . Recall that some of the non-bulk eigenvalues are generated by the kernel effect instead of the signals. As commented in Remark 2.4, while it is true that when $\lambda$ is bounded the kernel effect can be quantified, when $\lambda$ diverges, we loss the capability to distinguish these outliers. Especially, when $1/2\leq\alpha<1,$ although we show in (2) of Theorem 2.5 that the number of non-bulk eigenvalues is finite and depends on the signal strength, we do not know the exact number of outliers and the relation between the signals and outliers. Therefore, it is challenging to recover the signals using this information. For example, when the signal is supported on a linear subspace so that there are multiple spikes, to our knowledge it is challenging to estimate the dimension via GL.

We shall emphasize that the GL approach is very different from PCA. Note that for the $d$ -spiked model, only the $d$ largest eigenvalues could possibly detach from the bulk and can be located when PCA is applied. However, for the GL approach, even a single spike can lead to multiple outlying eigenvalues as in (2) of Theorem 2.5. This shows that even for a simple $1$ -dim linear manifold that is realized as a $1$ -dim linear subspace in $\mathbb{R}^{p}$ , the nonlinear method via GL is very different from the linear method via PCA. More details on this aspect can be found in Section 5. The problem becomes more challenging if the subspace is nonlinear, even under the assumption that the manifold model can be reduced to this low rank spiked model in Section A.1 that we focus on in this paper.

Finally, we point out that when $1/2<\alpha<1,$ although it is challenging to obtain precise information for its clean signal counterpart from a direct application of the standard GL via $\mathbf{W}$ (c.f. Theorem 2.5) or $\mathbf{A}$ (c.f. Corollary 2.10), we could consider a variant of GL via the transition matrix $\mathbf{A}$ by zeroing-out the diagonal elements of $\mathbf{W}$ proposed in [29]. It has been shown in [29] that the zeroing-out strategy could help the analysis of noisy datasets. We refer the readers to Appendix A.7.1 for more details.

Remark 2.6.

When $\alpha<1,$ although it is still unclear how to use the bulk of eigenvalues from the noisy observation to extract information for the clean signal counterpart, it is a starting point for understanding the existence of a strong signal (i.e., $\alpha\geq 1$ ). For example, when $\alpha<1,$ Theorems 2.3 and 2.5 demonstrate that most of the eigenvalues follow the MP law so that except for a finite number of outliers, two consecutive eigenvalues should be close to each other by a distance of $o(1)$ . Under the alternative (i.e., $\alpha\geq 1$ ), since the noisy GL is close to the clean GL as shown later in Section 2.3 below, two consecutive eigenvalues can be separated by a distance of constant order.

2.3. Spectrum of kernel affinity matrices: high signal-to-noise region $\alpha\geq 1$

In this subsection, we present the results such that the spectra of $\mathbf{W}$ and $\mathbf{W}_{1}$ in (1.9) can be connected after properly scaling when the SNR is relatively large, i.e., $\alpha\geq 1$ . We first prepare some notations. Denote

(2.14)

\displaystyle\mathbf{W}_{a_{1}}:=\exp(-2\upsilon)\mathbf{W}_{1}+(1-\exp(-2\upsilon))\mathbf{I}_{n}.

Clearly, $\mathbf{W}_{a_{1}}$ is closely related to $\mathbf{W}_{1}$ via a scaling and an isotropic shift and $\mathbf{W}_{1}$ contains only the signal information. On the other hand, note that $\mathbf{W}_{a_{1}}=[\exp(-2\upsilon)\bm{1}\bm{1}^{\top}+(1-\exp(-2\upsilon))\mathbf{I}_{n}]\circ\mathbf{W}_{1}$ , where the matrix $\exp(-2\upsilon)\bm{1}\bm{1}^{\top}+(1-\exp(-2\upsilon))\mathbf{I}_{n}$ comes from the first order Taylor expansion of $\mathbf{W}_{y}$ . We introduce another affinity matrix that will be used when $\alpha$ is too large so that the bandwidth $h\asymp p$ is relatively small compared with the signal strength. Denote $\mathbf{W}_{c}\in\mathbb{R}^{n\times n}$

(2.15)

\mathbf{W}_{c}(i,j)=\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{h}\right),

and

(2.16)

\widetilde{\mathbf{W}}_{a_{1}}:=\mathbf{W}_{a_{1}}\circ\mathbf{W}_{c}.

Note that $\widetilde{\mathbf{W}}_{a_{1}}$ differs from $\mathbf{W}_{a_{1}}$ by the matrix $\mathbf{W}_{c}.$ It will be used when $\alpha\geq 2.$

Our main result for this SNR region is Theorem 2.7 below. For some constant $C\equiv C(\alpha)>0,$ denote

(2.17)

\mathsf{T}_{\alpha}:=\begin{cases}C\log n,&\alpha=1;\\ Cn^{\alpha-1},&1<\alpha<2.\end{cases}

Theorem 2.7.

Suppose (1.1) and (1.4)-(1.8) hold true, $d=1$ and $h=p$ .

(1)

When $1\leq\alpha<2,$ we have that

(2.18) $\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{a_{1}}\right\|\prec n^{-1/2}.$

Moreover, for some universal large constant $D>2,$ we have that for $i\geq\mathsf{T}_{\alpha}$ in (2.17)

(2.19) $\left|\lambda_{i}(\mathbf{W}_{a_{1}})-(1-\exp(-2\upsilon))\right|\prec n^{-D}.$
(2)

When $\alpha\geq 2,$ we have that

(2.20) $\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\widetilde{\mathbf{W}}_{a_{1}}\right\|\prec n^{-\alpha/2}+n^{-3/2}.$

Furthermore, when $\alpha$ is larger and satisfies

(2.21) $\alpha>\frac{2}{t}+1$

for a constant $t\in(0,1)$ , we have that with probability $1-O(n^{1-t(\alpha-1)/2}),$ for some constant $C>0$ ,

(2.22) $\left\|\widetilde{\mathbf{W}}_{a_{1}}-\mathbf{I}_{n}\right\|\leq Cn\exp(-\upsilon(\lambda/p)^{1-t}),$

and consequently

(2.23) $\left\|\mathbf{W}-\mathbf{I}_{n}\right\|\leq n\exp(-\upsilon(\lambda/p)^{1-t}).$

The scaling $n^{-1}$ in (2.18) is commonly used in many manifold learning and machine learning literature [1, 12, 26, 41, 53, 63]. On one hand, (1) of Theorem 2.7 shows that once the SNR is “relatively large” ( $1\leq\alpha<2$ ), we may access the spectrum of the clean affinity matrix $\mathbf{W}_{1}$ via the noisy affinity matrix $\mathbf{W}$ as is described in (2.18), and the clean affinity matrix may contain useful information of the signal. In this case, the signal is strong enough to compete with the noise so that we are able to recover the “top few” eigenvalues of the kernel matrix associated with the clean data via $\mathbf{W}_{a_{1}}.$ Especially, (2.19) implies that we should focus on the top $\mathsf{T}_{\alpha}$ eigenvalues of $\mathbf{W}$ , since the remaining eigenvalues are not informative. This coincides with what practitioners usually do in data analysis. Note that $\mathsf{T}_{\alpha}$ increases with $\alpha$ , which fits our intuition, since the SNR becomes larger.

On the other hand, we find that the classic bandwidth choice $h\asymp p$ is not a good choice when the SNR is “too large” ( $\alpha\geq 2$ ). First, (2) of Theorem 2.7 states that when $\alpha\geq 2,$ since the bandwidth is too small compared with the signal strength, the noisy affinity matrix will be close to a mixture of signal and noise. Especially, when the signals are stronger in the sense of (2.21), we will not be able to obtain information from the noisy affinity matrix. This can be understood as follows. Since the signal is far stronger than the noise, equivalently we could say that the signal is contaminated by “small” noise. However, since the bandwidth is set to $h=p$ , which is “small” compared with the signal strength, the exponential decay of the kernel forces each point to be “blind” to see other points. In this case, $\mathbf{W}_{i,j}$ is close to $0$ when $i\neq j$ , and the affinity matrix $\mathbf{W}$ is close to an identity matrix, which leads to (2.23). As we will see in Theorem 3.1, all these issues will be addressed using a different bandwidth. For the readers’ convenience, in Figure 1, we use a simulation to summarize the phase transitions observed in Theorems 2.3, 2.5 and 2.7. For numerical accuracy of our established theorems, we refer the readers to Section 4.1 below for more details.

Refer to caption — Figure 1. An illustration of the phase transition phenomenon of the affinity matrix spectrum. The noise $\mathbf{y}_{i}$ , $i=1,\ldots,n$ , is i.i.d. sampled from $\mathcal{N}(0,I_{p})$ and the clean data $\mathbf{x}_{i}$ is i.i.d. sampled from $\mathcal{N}(0,\lambda)$ , where $\lambda>0$ . We take $n=p=300$ . The kernel is $f(x)=\exp(-x/2)$ and the bandwidth is $h=p$ . For each $\lambda$ , the 300 eigenvalues are plotted in the descending order. In the bounded and slowly divergent regions, the second to the 300-th eigenvalues are zoomed in for a better visualization.

Technically, in the previous results when $\alpha<1$ , the kernel function $f(x)$ only contributes to the measure $\nu_{0}$ via (2.7) and its decay rate does not play a role in the conclusions. However, once the signal becomes stronger, the kernel decay rate plays an essential role. We focus on $f(x)=\exp(-\upsilon x)$ to simplify the discussion in this paper, and postpone the discussion of general kernels to our future work.

Remark 2.8.

When $1\leq\alpha<2$ in Theorem 2.7, we have shown that besides the top $\mathsf{T}_{\alpha}$ eigenvalues of $\mathbf{W}_{a_{1}}$ , the remaining eigenvalues of $\mathbf{W}_{a_{1}}$ are trivial. Since the error bound in (2.19) is much smaller than the one in (2.18), the smaller eigenvalues of $\mathbf{W}$ may fluctuate more and have a non-trivial ESD. The ESD of $\mathbf{W}$ is best formulated using the Stieltjes transform [2]. Here we provide a remark on the approximation of the Stieltjes transforms. Denote $\mathbf{W}_{b_{1}}$ as

(2.24)

\displaystyle\mathbf{W}_{b_{1}}:=\Big{(}

\displaystyle\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}+2\upsilon\exp(-4\upsilon)\mathbf{I}_{n}\Big{)}\circ\mathbf{W}_{1}\,.

Note that compared with $\mathbf{W}_{a_{1}}$ , the matrix $\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}+2\upsilon\exp(-4\upsilon)\mathbf{I}_{n}$ in $\mathbf{W}_{b_{1}}$ comes from a higher order Taylor expansion of $\mathbf{W}_{y}$ . Let $m_{\mathbf{W}}(z)$ and $m_{\mathbf{W}_{b_{1}}}(z)$ be the Stieltjes transforms of $\mathbf{W}$ and $\mathbf{W}_{b_{1}}$ respectively. In Section D, we show that

(2.25)

\sup_{z\in\mathcal{D}}|m_{\mathbf{W}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|\prec\frac{1}{n^{1-\alpha/2}\eta^{2}}\,,

where $\mathcal{D}$ is the domain of spectral parameters defined as

	$\displaystyle\mathcal{D}:=\mathcal{D}(1/4,\mathsf{a}):=\Big{\{}z=$	$\displaystyle E+\mathrm{i}\eta:\mathsf{a}\leq E\leq\frac{1}{\mathsf{a}},$
(2.26)			$\displaystyle n^{-1/2+\alpha/4+\mathsf{a}}\leq\eta\leq\frac{1}{\mathsf{a}}\Big{\}},$

and $0<\mathsf{a}<1$ is some fixed (small) constant. This result helps us further peek into the intricate relationship between the clean and noisy affinity matrices. However, it does not provide information about each single eigenvalue.

Remark 2.9.

In the above theorems, we focus on reporting the results for the case $d=1$ in (1.4). We now discuss how to generalize our results to $d>1$ . There are two major cases. The first case is when all signal strengths are in the same SNR region, and the second case is when the signal strengths might be in different SNR regions. We start from the first case, and there are four regions we shall discuss.

(1)

When all $\alpha_{i}$ , $1\leq i\leq d$ , are very large in the sense that they satisfy the condition (2) of Theorem 2.7, following the same argument, we can immediately conclude that the results stated in (2) in Theorem 2.7 still hold true by setting $\alpha:=\max_{i}\alpha_{i}$ in (2.21).
(2)

When $\alpha_{i}>1$ , $1\leq i\leq d$ , satisfy condition (1) of Theorem 2.7, the results of (2.18) and (2.20) of Theorem 2.7 hold with some changes. Indeed, $n^{-\alpha/2}$ in (2.20) should be replaced by $\sum_{l=1}^{d}n^{-\alpha_{l}/2}$ and $Cn^{\alpha-1}$ in (2.17) should be replaced by $Cn^{\min_{l}\alpha_{l}-1}$ . Moreover, in this setup, since the noisy affinity matrix will be close to a matrix depending on the clean affinity matrix $\mathbf{W}_{1}$ , where $\mathbf{W}_{1}(i,j)=\exp\left(-\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{h}\right)$ , which in general does not follow the MP law, the spectrum of $\mathbf{W}$ vary according to the specific values of $\lambda_{i}$ , $1\leq i\leq d$ . More discussion with simulation of this setup with $d=2$ can be found in Section A.7.2.
(3)

When $\alpha_{i}=1$ , $i=1,\ldots,d$ , this is the region our argument cannot be directly applied and we need a substantial generalization of the proof. Especially, our proof essentially relies on the Mehler’s formula in Section A.6 that has only been proved for $d=1$ to our knowledge. Nevertheless, we believe it is possible to generalize this formula to $d>1$ following the arguments of [33].
(4)

When $0\leq\alpha_{i}<1$ , $1\leq i\leq d,$ we could directly generalize the result to $d>1$ with all the key ingredients, i.e., Lemmas A.4–A.9, in Section A.5. Specifically, to extend Theorem 2.3 concerning $\alpha_{i}=0$ , $1\leq i\leq d$ , denote $r=0$ if $\lambda_{i}\leq\sqrt{c_{n}}$ for all $1\leq i\leq d$ , $r=d$ if $\lambda_{i}>\sqrt{c_{n}}$ for all $1\leq i\leq d$ , or $1\leq r<d$ if $\lambda_{1}\geq\cdots\geq\lambda_{r}>\sqrt{c_{n}}\geq\lambda_{r+1}\geq\cdots\lambda_{d}$ . The proof of Theorem 2.3 still holds by updating $\mathsf{S}=3+r.$ In fact, according to (B.13), the $\mathsf{O}$ matrix therein is of rank three and the Gram matrix $\mathbf{X}^{\top}\mathbf{X}$ can generate $r$ outliers according to Lemma A.6. Theorem 2.5 still holds when $d>1$ . Following its proof, when $0<\alpha_{i}<0.5-\epsilon$ for all $1\leq i\leq d,$ the first part of the theorem still holds when $i>d+r$ and the rate $\lambda/\sqrt{n}$ should be replaced by $\sum_{l=1}^{d}\lambda_{l}/\sqrt{n}.$ In fact, in this setting, all $\lambda_{i}>\sqrt{c_{n}}$ and all will generate outliers. Moreover, when all $\alpha_{i}\geq 0.5-\epsilon,$ by replacing (2.11) with $\sum_{l=1}^{d}\mathfrak{d}_{l}$ , where $\mathfrak{d}_{l}=\left\lceil\frac{1}{1-\alpha_{l}}\right\rceil+1,$ we conclude that the second part of the theorem still holds true by replacing $\lambda/p$ with $\sum_{l=1}^{d}\lambda_{l}/p$ and $p^{\mathcal{B}(\alpha)}$ with $\sum_{l=1}^{d}p^{\mathcal{B}(\alpha_{l})}$ . We emphasize that in the above settings, as $\tau$ defined in (2.5) satisfies that $\tau\rightarrow 2$ as $n\rightarrow\infty$ , the bulk eigenvalues can be characterized by the same MP law as in (2.8) and (2.12). However, the number of the outliers may depend on the signal strength $\lambda_{i}$ , $1\leq i\leq d.$ See Section A.7.2 for more discussion and simulation with $d=2$ .

The second case includes various combinations, and some of them might be challenging. We discuss only one setup assuming the signal strengths fall in two different regions; that is, there exists $1<r<d$ such that

\alpha_{1}\geq\alpha_{2}\geq\alpha_{r}\geq 1>\cdots>\alpha_{r+1}\geq\cdots\alpha_{d}\,.

Theorem 2.7 still holds by replacing $\alpha$ by $\max_{i}\alpha_{i}$ and $\mathbf{W}_{1}$ in (2.14) by $\mathsf{W}_{1}$ , where $\bm{z}_{i}=({\mathbf{z}}_{i1},\cdots,{\mathbf{z}}_{ir},0,\cdots,0)$ and $\mathsf{W}_{1}(i,j)=\exp\left(-\frac{\|\bm{z}_{i}-\bm{z}_{j}\|_{2}^{2}}{h}\right)$ , i.e., the clean affinity matrix is defined by only using those components with large SNRs. That is to say, the spectrum of $\mathbf{W}$ for the value of $d$ is close to the clean affinity matrix for the value of $r$ with the same signals $\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{r}.$ The detailed statements and proofs are similar to the case $d=1$ except for extra notional complication. We defer more discussion of this setup with $d=2$ to Section A.7.2.

In short, there are still several open challenges when $d>1$ , and we will explore them in our future work.

2.4. Spectrum of transition matrices

In this subsection, we state the main results for the transition matrix $\mathbf{A}$ defined in (1.2). Even though there is an extra normalization step by the degree matrix $\mathbf{D}$ , we will see that most of the spectral studies of $\mathbf{A}$ boil to those of $\mathbf{W}.$ In what follows, we provide the counterparts of the results in Sections 2.2 and 2.3 for $\mathbf{A}.$ Similar discussions as those in Remark 2.8 and Remark 2.9 also hold.

For $\mathbf{W}_{a_{1}}$ defined in (2.14), denote $\mathbf{A}_{a_{1}}=\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}$ similar to the definition (1.2). Similarly, for $\widetilde{\mathbf{W}}_{a_{1}}$ defined in (2.16), denote $\widetilde{\mathbf{A}}_{a_{1}}=\widetilde{\mathbf{D}}_{a_{1}}^{-1}\widetilde{\mathbf{W}}_{a_{1}}.$

Corollary 2.10.

Suppose (1.1)-(1.8) hold true, $d=1$ , $h=p$ and $f(x)=\exp(-\upsilon x)$ . When $0\leq\alpha<1,$ the results of Theorems 2.3 and 2.5 hold for the eigenvalues of $n\mathbf{A}$ by replacing $\mathbf{W}$ with $n\mathbf{A}$ and the measure $\nu_{0}$ with $\check{\nu}_{0}=T_{\varsigma(0)}\mu_{c_{n},-2f^{\prime}(\tau(0))/(f(\tau(\lambda)))}$ . When $1\leq\alpha<2,$ for (1) of Theorem 2.7, the counterpart of (2.18) reads as

\|\mathbf{A}-\mathbf{A}_{a_{1}}\|\prec n^{\frac{\alpha-2}{2}},

and the counterpart of (2.19) is

\lambda_{i}(\mathbf{A}_{a_{1}})\prec n^{\frac{\alpha-3}{2}}.

Moreover, when $\alpha\geq 2,$ for (2) of Theorem 2.7, the counterpart of (2.20) reads as

\|\mathbf{A}-\widetilde{\mathbf{A}}_{a_{1}}\|\prec n^{-\frac{1}{2}}.

The rest parts hold by replacing $\widetilde{\mathbf{W}}_{a_{1}}$ and $\mathbf{W}$ with $\widetilde{\mathbf{A}}_{a_{1}}$ and $\mathbf{A},$ respectively.

3. Main results (II): a different bandwidth choice $h\asymp(p+\lambda)$

As discussed after Theorem 2.7, when the SNR is large, the classic bandwidth choice $h\asymp p$ is too small compared with the signal. For example, according to (2) of Theorem 2.7, we cannot obtain any information about the clean signal when $\alpha\geq 2$ if $h\asymp p.$ To address this issue, we consider a different bandwidth $h\asymp(p+\lambda)$ , where $\lambda$ is the signal strength. We show that this signal dependent bandwidth will result in a meaningful spectral convergence result, which is stated in Theorem 3.1. As in Section 2, we focus on the setting $d=1$ , and set $\lambda:=\lambda_{1}$ . The discussion for the setting $d>1$ is similar to that in Remark 2.9, which we only state the difference here. When $0\leq\alpha_{i}\leq 1,$ $i=1,\ldots,d$ , we have $h\asymp\sum_{l=1}^{d}\lambda_{l}+p\asymp p$ , so all the arguments in (3) and (4) of Remark 2.9 directly apply. When $\alpha_{i}>1$ , $1\leq i\leq d$ , which corresponds to the strong signal cases (1) and (2) in Remark 2.9, following the proof of (3.5) below, a similar argument of case (2) of Remark 2.9 still apply. For definiteness, below we state our results for the spectra of $\mathbf{W}$ and $\mathbf{A}$ assuming $h=p+\lambda$ in Section 3.1. Since $\lambda$ is usually unknown in practice, in Section 3.2, we propose a bandwidth selection algorithm for practical implementation. With this algorithm, even without the knowledge of the signal strength, we can still get meaningful spectral results. See Corollary 3.2.

3.1. Spectra of affinity and transition matrices

In this subsection, we state the results for the spectra of $\mathbf{W}$ and $\mathbf{A}$ when $h=\lambda+p$ . Denote

(3.1)

\displaystyle\mathbf{W}_{a_{2}}=\exp\left(-\frac{2p\upsilon}{h}\right)\mathbf{W}_{1}+\left(1-\exp\left(-\frac{2p\upsilon}{h}\right)\right)\mathbf{I}_{n},

where $\mathbf{W}_{1}$ is constructed using the bandwidth $h=\lambda+p.$

Theorem 3.1.

Suppose (1.1)-(1.8) hold true, $d=1$ and $h=\lambda+p$ . The following results hold.

(1)

When $0\leq\alpha<1,$ Theorems 2.3 and 2.5 hold with $\nu_{0}$ replaced by

\widetilde{\nu}_{0}=T_{\varsigma_{h}}\mu_{c_{n},\eta}\,,

where

(3.2)			$\displaystyle\eta:=\frac{2p\upsilon\exp(-2p\upsilon/h)}{h},$
		$\displaystyle\varsigma_{h}:=\varsigma_{h}(\tau):=1-\frac{2\upsilon p}{h}\exp(-\upsilon\tau p/h)-\exp(-\upsilon\tau p/h).$

(2)

When $\alpha\geq 1,$ we have that

(3.3) $\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{a_{2}}\right\|\prec n^{-1/2},$

and for some large constants $D>2$ and $C>0,$ we have that when $i\geq C\log n$

(3.4) $\displaystyle\left|\lambda_{i}(\mathbf{W}_{a_{2}})-(1-\exp(-2p\upsilon/(p+\lambda)))\right|\prec n^{-D}.$

Moreover, when $\alpha>1,$ we have that

(3.5) $\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{1}\right\|\prec n^{-1/2}+n^{1-\alpha}.$

Finally, similar results hold for the transition matrix $\mathbf{A}$ by replacing $\frac{1}{n}\mathbf{W}$ , $\frac{1}{n}\mathbf{W}_{a_{2}}$ and $\frac{1}{n}\mathbf{W}_{1}$ in (3.3)-(3.5) by $\mathbf{A}$ , $\mathbf{A}_{a_{2}}$ and $\mathbf{A}_{1}$ respectively, where $\mathbf{A}_{a_{2}}$ and $\mathbf{A}_{1}$ are defined by plugging $\mathbf{W}_{a_{2}}$ and $\mathbf{W}_{1}$ into (1.2).

With this bandwidth, we have addressed the issue we encountered in (2) of Theorem 2.7. Specifically, according to Theorem 3.1, we find that when $\alpha<1,$ the noise dominates and $h=\lambda+p\asymp p$ does not lead to any essential difference in the spectrum compared with that from the fixed bandwidth $h=p.$ On the other hand, when $\alpha\geq 1$ , such a bandwidth choice contributes significantly to the spectrum. Especially, compared to (2.17), $\mathbf{W}_{a_{2}}$ only has $O(\log n)$ nontrivial eigenvalues. Combined with (3.3), by properly choosing the bandwidth, once the SNR is relatively large, the noisy affinity matrix $\mathbf{W}$ captures the spectrum of the clean affinity matrix $\mathbf{W}_{1}$ via $\mathbf{W}_{a_{2}}$ . In addition, when $\alpha>1$ , we can replace $\mathbf{W}_{a_{2}}$ with $\mathbf{W}_{1}$ directly. Finally, we mention that in the small SNR region $1/2<\alpha\leq 1,$ modifying the transition matrix $\mathbf{A}$ by zeroing out the diagonal terms before normalization [29] could be useful. For more discussions, we refer the readers to Section A.7.1.

3.2. An adaptive choice of bandwidth

While the above result connects the spectra of noisy affinity matrices and those of clean affinity matrices, in general $\lambda$ is unknown. In this subsection, we provide an adaptive choice of $h$ depending on the dataset without providing an estimator for $\lambda.$ Such a choice will enable us to recover the results of Theorem 3.1.

Given some constant $0<\omega<1,$ we choose $h\equiv h(\omega)$ according to

(3.6)

\int_{0}^{h}\mathrm{d}\mu_{\mathtt{dist}}=\omega\,,

where $\mu_{\mathtt{dist}}$ is the empirical distribution of the pairwise distance $\{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}\},\ i\neq j.$ In this subsection, when there is no danger of confusion, we abuse the notation and denote the affinity and transition matrices conducted using $h$ from (3.6) as $\mathbf{W}$ and $\mathbf{A}$ respectively, and define $\mathbf{W}_{1}$ in (1.9) and $\mathbf{W}_{a_{2}}$ in (3.1) with $h$ in (3.6).

As we show in the proof, when $0\leq\alpha<1$ , $h\asymp p$ (see (C.1) in the proof), and when $\alpha\geq 1$ , $h\asymp\lambda\asymp p+\lambda$ (see (C.2) in the proof). In this sense, the following corollary recovers the results of Theorem 3.1, while the choice of bandwidth is practical.

Corollary 3.2.

Suppose (1.1)-(1.8) hold true and $d=1$ . For any $0<\omega<1$ , let $h$ be the bandwidth chosen according to (3.6). Then we have Theorem 3.1 holds true.

Since Corollary 3.2 recovers the results of Theorem 3.1, the same comments after Theorem 3.1 and the discussions about manifolds in Subsection 3.3 hold for Corollary 3.2. We comment that in practice, usually researchers choose the 25% or 50% percentile of all pairwise distances, or those distances of nearest neighbor pairs, as the bandwidth; see, for example, [56, 44]. Corollary 3.2 provides a theoretical justification for this commonly applied ad hoc bandwidth selection method.

Next, we discuss how to choose $\omega$ in practice. Based on the obtained theoretical results, when $\alpha\geq 1,$ the outliers stand for the signal information (c.f. (3.1)), except those associated with the kernel effect. Thus, we propose Algorithm 1 to choose $\omega$ adaptively which seeks for a bandwidth so that the affinity matrix has the most number of outliers.

Algorithm 1 Adaptive choice of

\omega

(1)

Take fixed constants $\omega_{L}<\omega_{U},$ where $0<\omega_{L},\omega_{U}<1,$ e.g., $\omega_{L}=0.05$ and $\omega_{U}=0.95$ . For some large integer $T$ , we construct a partition of the intervals $[\omega_{L},\omega_{U}],$ denoted as $\mathcal{P}=(\omega_{0},\omega_{1},\cdots,\omega_{T})$ , where $\omega_{i}=\omega_{L}+\frac{i}{T}(\omega_{U}-\omega_{L})$ .
(2)

For the sequence of quantiles $\{\omega_{i}\}_{i=0}^{T},$ calculate the associated bandwidths according to (3.6), denoted as $\{h_{i}\}_{i=0}^{T}.$
(3)

For each $0\leq i\leq T,$ calculate the eigenvalues of the affinity matrices $\mathbf{W}_{i}$ which is conducted using the bandwidth $h_{i}.$ Denote the eigenvalues of $\mathbf{W}_{i}$ in the decreasing order as $\{\lambda_{k}^{(i)}\}_{i=1}^{n}.$

(4)

For a given threshold $\mathsf{s}>0$ satisfying $\mathsf{s}\rightarrow 0$ as $n\rightarrow\infty,$ denote

\mathsf{k}(\omega_{i}):=\max_{1\leq k\leq n-1}\left\{k\bigg{|}\,\frac{\lambda_{k}^{(i)}}{\lambda_{k+1}^{(i)}}\geq 1+\mathsf{s}\right\}.

(5)

Choose the quantile $\omega$ such that

(3.7) $\omega=\max\Big{[}\underset{\omega_{i}}{\mathrm{argmax}}\ \mathsf{k}(\omega_{i})\Big{]}\,.$

Note that we need a threshold $\mathsf{s}$ in step 4 of Algorithm 1. We suggest to adopt the resampling method established in [50, Section 4] and [19, Section 4.1]. This method provides a choice of $\mathsf{s}$ to distinguish the outlying eigenvalues and bulk eigenvalues given the ratio $c_{n}=p/n$ . The main rationale supporting this approach is that the bulk eigenvalues are close to each other (c.f. Remark 2.8) and hence the ratios of the two consecutive eigenvalues will be close to one. Moreover, in step 5 of the algorithm, when there are multiple $\omega_{i}$ that achieve the argmax, we choose the largest one for the purpose of robustness.

Next, we numerically illustrate how the chosen $\omega$ by Algorithm 1 depends on $\alpha$ . Consider the nonlinear manifold $S^{1}$ , the canonical $1$ -dim sphere, isometrically embedded in the first two axes of $\mathbb{R}^{p}$ and scaled by $\sqrt{\lambda}$ , where $\lambda>0$ ; that is, $\bm{z}_{i}:=\sqrt{\lambda}[\cos\theta_{i},\sin\theta_{i},0,\cdots,0]^{\top}\in\mathbb{R}^{p}$ , where $\theta_{i}$ is uniformly sampled from $[0,2\pi]$ . Next, we add Gaussian white noise to $\bm{z}_{i}$ via $\mathbf{x}_{i}=\bm{z}_{i}+\mathbf{y}_{i}\in\mathbb{R}^{p},$ where $\mathbf{y}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , $i=1,2,\cdots,n,$ are noise independent of $\xi_{i}.$ We consider this example since its topology is nontrivial and we know the ground truth. In Figure 2, we record the chosen $\omega$ for different $\alpha$ from $\mathbf{W}$ . When $\alpha$ is small, (i.e. $\alpha<1$ ), since the bandwidth choice will not essentially influence the transition, the algorithm will offer a large quantile in light of (3.7). When $\lambda$ is large (i.e., $\alpha>2$ ), we get a small quantile. Intuitively, the larger selected bandwidth when $\alpha$ gets small can be understood as the algorithm trying to combat the noise with a larger bandwidth. In Figure 2, we also record the chosen $\omega$ for different $\alpha$ from $\mathbf{A}$ , where we simply replace the role of $\mathbf{W}$ by $\mathbf{A}$ . We can see the same result as that from $\mathbf{W}$ . Finally, note that the choices of $\omega$ are irrelevant of the aspect ratio $c_{n}$ in (1.7). This finding suggests that the bandwidth selection is not sensitive to the ambient space dimension.

3.3. Connection with the manifold learning

To discuss the connection with the manifold learning, we focus on Theorem 3.1 and $\alpha>1$ , and hence Corollary 3.2, which states that via replacing $\frac{1}{n}\mathbf{W}$ , $\frac{1}{n}\mathbf{W}_{a_{2}}$ and $\frac{1}{n}\mathbf{W}_{1}$ in (3.3)-(3.5) by $\mathbf{A}$ , $\mathbf{A}_{a_{2}}$ and $\mathbf{A}_{1}$ , the relationship between the eigenvalues of $\mathbf{A}$ and $\mathbf{A}_{1}$ is established when $\alpha>1$ . We know that all except the top $C\log(n)$ eigenvalues of $\mathbf{W}$ are trivial according to (3.4), and by Weyl’s inequality, the top $C\log(n)$ eigenvalues of $\mathbf{A}$ and $\mathbf{A}_{1}$ differ by $n^{-1/2}+n^{1-\alpha}$ . On the other hand, the eigenvalues of $\mathbf{A}_{1}$ have been extensively studied in the literature. Below, take the result in [41] as an example. Suppose the clean data $\mathcal{Z}$ is sampled from a $m$ -dim closed (compact without boundary) and smooth manifold, which is embedded in a $d$ -dim subspace in $\mathbb{R}^{p}$ , following a proper sampling condition on the sampling density function $\mathsf{p}$ (See Section A.1 for details of this setup). To link our result to that shown in [41], note that $\|\mathbf{z}_{i}-\mathbf{z}_{j}\|^{2}$ is of order $\lambda$ by assumption so that the selected bandwidth is of the same order as that of $\|\mathbf{z}_{i}-\mathbf{z}_{j}\|^{2}$ . Thus, since $\lambda/(p+\lambda)\asymp 1$ when $n\to\infty$ , the eigenvalues of $\mathbf{A}_{1}$ converge to the eigenvalues of the integral operator

Th(x)=\int_{M}\exp\left(-\upsilon\|x-y\|^{2}_{2}\right)h(y)\mathsf{p}(y)dV(y)\,,

where $h$ is a smooth function defined on $M$ and $dV$ is the volume density. By combining the above facts, we conclude that under the high-dimensional noise setup, when the SNR is sufficiently large and the bandwidth is chosen properly, we could properly obtain at least the top few eigenvalues of the associated integral kernel from the noisy transition matrix $\mathbf{A}$ . Since our focus in this paper is not manifold learning itself but how the high-dimensional noise impacts the spectrum of GL, for more discussions and details about manifold learning, we refer readers to [25] and the citations therein.

4. Numerical studies

In this section, we conduct Monte Carlo simulations to illustrate the accuracy and usefulness of our results and proposed algorithm. In Section 4.1, we conduct numerical simulations to illustrate the accuracy of our established theorems for various values of $c_{n}=0.5,1,2.$ We also show the impact of $n.$ In Section 4.2, we examine the usefulness of our proposed Algorithm 1 and compare it with some methods in the literature with a linear manifold and a nonlinear manifold.

4.1. Accuracy of our asymptotic results

In this subsection, we conduct numerical simulations to examine the accuracy of the established results. For simplicity, we focus on checking the results in Section 2 when $h=p,$ which is the key part of the paper. Similar discussions can be applied to the results in Section 3 when $h=\lambda+p.$

In Figure 3, we study the low SNR setting when $0\leq\alpha<1$ as in Section 2.2, particularly the closeness of bulk eigenvalues of $\mathbf{W}$ and the quantiles of the MP law $\nu_{0}$ shown in (2.8) and (2.12). We also show that even for a relative small value of $n=200,$ our results are reasonably accurate for various values of $c_{n}=0.5,1,2.$ Then we study the region when $\alpha\geq 1$ as in Section 2.3. In Figure 4, we study the SNR region when $1\leq\alpha<2$ and check (2.18). Moreover, in Figure 5, we examine the accuracy of (2.23) when $\alpha$ is very large. Again, we find that our results are accurate even for a relatively small value of $n=200$ under different settings of $c_{n}=0.5,1,2.$

Since our results are stated in the asymptotic sense when $n$ is sufficiently large, in Figure 6, we examine how the value of $n$ impact our results. For various values of $c_{n}$ and SNRs, we find that our results are reasonable accurate once $n\geq 100$ . We see that when the SNR becomes large, our results for small $n$ , like $n<100$ , are more accurate.

We point out that for a better visualization, we report the results for the sorted eigenvalues instead of the histogram in the above plots regarding eigenvalues. The main reason is that we focus on each individual eigenvalue rather than the global empirical distribution, which has been known in the literature. The simulation results are based on only one trial which emphasizes the concentration with high probability as established in our main results. For the visualization of histogram, we have to run several simulations and look at the average. Since this is not the main focus of the paper, we only generate one such plot in Figure 7 based on 1,000 trials.

4.2. Efficiency of our proposed bandwidth selection algorithm (Algorithm 1)

In this subsection, we show the usefulness of our proposed Algorithm 1 and compare it with some existing methods in the literature. We consider two manifolds of different dimensions, and compare the following four setups. (1) The clean affinity matrix $\mathbf{W}_{1}$ with the bandwidth $h=\lambda+p$ ; (2) $\mathbf{W}$ constructed using our Algorithm 1; (3) $\mathbf{W}$ constructed using the bandwidth via (3.6) with $\omega=0.5,$ which has been used in [56, 44]; (4) $\mathbf{W}$ using the bandwidth $h=p$ as in [27, 18, 22, 42, 28, 29]. Since the accuracy of the eigenvalue has been discussed in Section 4.1, in what follows, while we do not explore eigenvectors of GL in this paper, we demonstrate the eigenvectors of $\mathbf{W}$ and compare them with those of $\mathbf{W}_{1}$ constructed from the clean signal $\{\xi_{i}\}_{i=1}^{n}$ to further understand the impact of bandwidth selection.

We start with an 1-dimensional smooth and closed manifold $M_{1}$ , which is parametrized by $\Phi:\,u\mapsto aR[2\cos(u),\,3(1-0.8e^{-8\cos(u)^{2}})\cos(2\pi(u/(2\pi))^{2}),\,\\ (1-0.8e^{-8\cos(u)^{2}})\sin(u),\,0,\cdots,0]^{\top}\in\mathbb{R}^{p}$ , where $R\in O(p)$ , $a>0$ and $u\in(0,2\pi]$ . In other words, the 1-dimensional manifold $M_{1}$ is embedded in a 3-dim Euclidean space in $\mathbb{R}^{p}$ . Now, sample independently and uniformly $n$ points from $\mathtt{Uniform}(0,2\pi)$ , $u_{1},\ldots,u_{n}$ , and hence $n$ points $\xi_{i}=\Phi(u_{i})$ . Next, we add Gaussian white noise to $\xi_{i}$ via $\mathbf{x}_{i}=\xi_{i}+\mathbf{y}_{i}\in\mathbb{R}^{p},$ where $\mathbf{y}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , $i=1,2,\cdots,n,$ are noise independent of $\xi_{i}.$ The results of embedding this manifold by the top three trivial eigenvectors are shown in Figure 8. We could see that with the proposed bandwidth selection algorithm, the embedding of the noisy data is closer to that from the clean data. To further quantify this closeness, we view the eigenvectors from $\mathbf{W}_{1}$ as the truth, and compare these with the eigenvectors of $\mathbf{W}$ with different bandwidths by evaluating the root mean square (RMSE). Note that the freedom of eigenvector sign is handled by taking the smaller value of $\|v^{(c)}_{j}-v^{(w)}_{j}\|_{2}$ and $\|v^{(c)}_{j}+v^{(w)}_{j}\|_{2}$ , where $v^{(c)}_{j}$ if the $j$ -th eigenvector of $\mathbf{W}_{1}$ associated with the $j$ -th largest eigenvalue and $v^{(w)}_{j}$ is the $j$ -th eigenvector of $\mathbf{W}$ associated with the $j$ -th largest eigenvalue. We repeat the random sample for 300 times, and plot the errobars with mean $\pm$ standard deviation in Figure 9. It is clear that with the adaptive bandwidth selection algorithm, the top several eigenvectors of $\mathbf{W}$ are close to those of $\mathbf{W}_{1}$ .

Next, we consider the Klein bottle, which is a $2$ -dim compact and smooth manifold that cannot be embedded into a three dimensional Euclidean space. First, set

(4.1)

\displaystyle\mathbf{z}_{i}=aR\begin{bmatrix}(2\cos(\mathbf{u}_{i}(1))+1)\cos(\mathbf{u}_{i}(2))\\ (2\cos(\mathbf{u}_{i}(1))+1)\sin(\mathbf{u}_{i}(2))\\ 2\sin(\mathbf{u}_{i}(1))\cos(\mathbf{u}_{i}(2)/2)\\ 2\sin(\mathbf{u}_{i}(1))\sin(\mathbf{u}_{i}(2)/2)\\ 0\\ \vdots\\ 0\end{bmatrix}\in\mathbb{R}^{p},

where $\mathbf{u}_{i}$ , $i=1,\ldots,n$ are randomly sampled uniformly from $[0,2\pi]\times[0,2\pi]\subset\mathbb{R}^{2}$ and $R\in O(p)$ and $a>0$ is the signal strength. In other words, we sample nonuniformly from the Klein bottle isometrically embedded in a $4$ -dimension subspace of $\mathbb{R}^{p}$ . Next, we add noise to $\mathbf{z}_{i}$ by setting $\mathbf{x}_{i}=\mathbf{z}_{i}+\mathbf{y}_{i}\in\mathbb{R}^{p},$ where $\mathbf{y}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , $i=1,2,\cdots,n,$ are noise independent of $\mathbf{z}_{i}.$ The eigenvectors under the mentioned four setups when the signal strength is equivalent to $\alpha=1$ are shown in Figure 10, where $c_{n}=0.5$ and we plot the magnitude of different eigenvector over $\mathbf{u}_{i}$ with colors; that is, the color at $\mathbf{u}_{i}$ represents the $i$ -th entry of the associated eigenvector. We apply the same RMSE evaluation used in the $M_{1}$ example above; that is, we show the RMSE of the eigenvectors of different $\mathbf{W}$ when compared with those from $\mathbf{W}_{1}$ as the truth. We repeat the random sample for 300 times, and plot the errobars with mean $\pm$ standard deviation in Figure 11. It is clear that with the adaptive bandwidth selection algorithm, the top several eigenvectors of $\mathbf{W}$ are close to those of $\mathbf{W}_{1}$ .

The results of the above two examples support the potential of our proposed bandwidth selection algorithm, particularly when compared with two bandwidth selection methods commonly considered in the literature. Note that since only the eigenvalue information is used in Algorithm 1, only partial information in the kernel random matrix is utilized. We hypothesize that by taking the eigenvectors into account, we could achieve a better bandwidth selection algorithm. Since the study of eigenvector is out of the scope of this paper, we will explore this possibility in our future work.

5. Discussion and Conclusion

We provide a systematic study of the spectrum of GL under the nonnull setup with different SNR regions, particularly when $d=1$ , and explore the impact of chosen bandwidths. Specifically, we show that under a proper SNR and bandwidth, the spectrum of $\mathbf{A}_{1}$ from the clean dataset can be extracted from that of $\mathbf{A}$ from the noisy dataset. We also provide a new algorithm to select the suitable bandwidth so that the number of “outliers” of the spectral bulk associated with the noise is maximized.

Note that we need the assumption that the entries of $\mathbf{y}_{i}$ are independent because our arguments depend on established results in the literature, for example, [8], which are proved using the assumption that the entries of $\mathbf{y}_{i}$ are independent. Nevertheless, we believe that this assumption can be removed with extra technical efforts. A natural strategy is to utilize the Gaussian comparison method in the literature of random matrix theory as in [31]. This strategy contains two steps. In the first step, we establish the results for Gaussian random vectors where linear independence is equivalent to independence. In the second step, we prove the results hold for sub-Gaussian random vectors using the comparison arguments as in [31] under certain moment matching conditions. Since this is not the main focus of the current paper and the extension needs to generalize many existing results in the literature, for example, [8], we will pursue this direction in the future.

A comparison with principal component analysis (PCA) deserves a discussion. It is well known that when $d=1$ so that the signal is sampled from a linear subspace, we could easily recover this linear subspace from any linear methods like PCA. At the first glance, it seems that the problem is resolved. However, if the purpose is studying the geometric and/or topological structure of the dataset, we may need further analysis. For example, if we want to answer the question like “is the $1$ -dim signal supported on two disjoint subsets (or the $1$ -dim linear manifold is disconnected)?”, then PCA cannot answer and we need other tools (for example, GL via the spectral clustering). The same fact holds when $d>1$ , where the nonlinear manifold is supported in a $d$ -dim subspace while its dimension is strictly smaller than $d$ . In this case, while we could possibly recover the $d$ -dim subspace by PCA, if we want to further study the nonlinear geometric and/or topological structure of the manifold, we need GL as the analysis tool. For readers who are interested in recovering the manifold structure via GL, see [35, 34, 25] and the literature therein. Based on our results, we know that even for the 1-dimensional linear manifold (for example, $\mathbf{z}_{i}$ in (1.4)–(1.6) satisfies that $\mathbf{e}_{1}^{\top}\mathbf{z}_{i}$ is uniformly distributed on $[0,1]$ ), the linear and nonlinear methods are significantly different.

One significant difference that we shall emphasize is the phase transition phenomenon when the nonlinear method like GL is applied. While the discussion could be much more general, the essence is the same, so we continue the discussion with the 1-dimensional linear manifold below. For simplicity, we focus on the results in Section 2 for the kernel affinity matrix $\mathbf{W}$ when the bandwidth is chosen so that $h\asymp p.$ From Theorem 2.3 to Theorem 2.7, we observe several phase transitions for eigenvalues depending on the signal strength $\lambda$ . In the case when $\lambda$ is bounded, we observe three or four outlying eigenvalues according to the magnitude of $\lambda,$ where three of these outlying eigenvalues are from the kernel effect; see Lemma A.9 for details. When $\lambda$ transits from the subcritical region $\lambda\leq\sqrt{c_{n}}$ to the supercritical region $\lambda>\sqrt{c_{n}}$ , one extra outlier is generated due to the well-known BBP transition [3] phenomenon for the spiked covariance matrix model. Moreover, the rest of the non-outlying eigenvalues still obey a shifted MP law. As will be clear from the proof, in the bounded region, studying the affinity matrix is directly related to studying PCA of the dataset via the Gram matrix; see (A.22) for details. However, once the signal strength diverges, the spectrum of the affinity matrix behaves totally differently. In PCA, under the setting of (1.6), no matter how large $\lambda$ is, we can only observe a single outlier and its strength increases as $\lambda$ increases. Moreover, only the first eigenvalue is influenced by $\lambda$ and the rest eigenvalues satisfy the MP law, and this MP law is independent of $\lambda.$ Specifically, the second eigenvalue, i.e., the first non-outlying eigenvalue will stick to the right-most edge of the MP law. We refer the readers to Lemma A.6 for more details and Figure 12 for an illustration. In contrast, regarding the nonlinear kernel method, the values and amount of outlying eigenvalues vary according to the signal strength. That is, the magnitude of $\lambda$ has a possible impact on all eigenvalues of $\mathbf{W}$ through the kernel function. Heuristically, this is because PCA explores the point cloud in a linear fashion so that $\lambda$ will only have an impact on the direction which explains the most variance, i.e., $\lambda_{1}(\mathbf{C})$ , where $\mathbf{C}$ is the covariance matrix that is directed related to the Gram matrix $\mathbf{G}$ . However, the kernel method deals with the data in a nonlinear way. As a consequence, all eigenvalues of $\mathbf{W}$ contain the signal information. In other words, unlike the bulk (non-outlier) eigenvalues of $\mathbf{G}$ , all eigenvalues of $\mathbf{W}$ change when $\lambda$ change when $\lambda$ increases. Consider the following three cases. First, as we will see in the proof below, the eigenvalue corresponding to the BBP transition in the supercritical region will increase when $\lambda$ increases, and this eigenvalue will eventually be close to $1$ when $\lambda$ diverges following (2.21). Thus, we expect that the magnitude of this outlying eigenvalue follows an asymmetric bell curve as $\alpha$ increases. Second, the eigenvalues corresponding to the kernel effect (see (A.20)) will decrease when $\lambda$ increases since the kernel function $f$ is a decreasing function. Third, as we can see from Theorems 2.5 and 2.7, the non-outlying eigenvalues become outlying eigenvalues when $\lambda$ increases. In Figure 13, we illustrate the above phenomena by investigating the behavior of some eigenvalues.

While our results pave the way towards a foundation for statistical inference of various kernel-based unsupervised learning algorithms for various data analysis purposes, like visualization, dimensional reduction, spectral clustering, etc, there are various problems we need to further study. First, as has been mentioned in Remark 2.9, there are some open problems; particularly, when $\alpha_{i}=1$ for all $i=1,\ldots,d$ . Second, the behavior of eigenvectors of $\mathbf{A}$ from noisy dataset and its relationship with the eigenfunctions of the Laplace-Beltrami operator under the manifold model need to be explored so that the above-mentioned data analysis purposes can be justified under the manifold setup. Third, in practice, noise may have a fat tail, the kernel may not be Gaussian [26, 29] or the kernel might not be isotropic [64], and the kernel function might be group-valued [58]. Fourth, the bandwidth selection algorithm is established under very nice assumptions, and its performance for real world databases needs further exploration. We will explore these problems in our future work.

Acknowledgment

The authors would like to thank the associate editor and two anonymous reviewers for many insightful comments and suggestions, which have resulted in a significant improvement of the paper.

Appendix A Preliminary results

In this section, we collect and prove some useful preliminary results, which will be used later in the technical proof.

A.1. From manifold to spiked model

In this subsection, we detail the claim in Section 1.2 and explain why the manifold model and (1.4) overlaps. Suppose $\{\mathbf{z}^{0}_{i}\}$ are i.i.d. sampled from a $p$ -dim random vector $Z$ , and suppose the range of $Z$ is supported on a $m$ -dimensional, connected, smooth and compact manifold $M$ , where $m\geq 1$ , isometrically embedded in $\mathbb{R}^{p}$ via $\iota$ . Assume there exists $d\geq m$ , where $d$ is independent of $n$ and $d\leq p$ , so that the embedded $M$ is supported in a $d$ -dim affine subspace of $\mathbb{R}^{p}$ . Here we assume that $d$ is fixed. Since we consider the kernel distance matrix as in (1.1), without loss of generality, we can assume that $\mathbb{E}\mathbf{z}^{0}_{i}=0$ ; that is, the embedded manifold is centered at $0$ . Also, assume the density function on the manifold associated with the sampling scheme is smooth with a positive lower bound. See [14] for more detailed discussion of the manifold model and the relevant notion of density function. Suppose the noisy data is

(A.1)

\mathbf{x}_{i}=\sqrt{\lambda}\mathbf{z}^{0}_{i}+\mathbf{y}_{i}\,,

where $\lambda>0$ represents the signal strength, and $\mathbf{y}_{i}$ is the independent noise that satisfies (1.5). Denote $\mathbf{z}_{i}:=\sqrt{\lambda}\mathbf{z}^{0}_{i}$ . Thus, there exists a $p\times p$ orthogonal matrix $R_{p}$ such that

R_{p}\mathbf{z}^{0}_{i}=(z^{0}_{i1},z^{0}_{i2},\cdots,z^{0}_{id},0,\cdots,0)^{\top}\,.

Since $M$ is compact, $z^{0}_{ij}$ , $j=1,\ldots,d$ , is bounded and hence sub-Gaussian with the variance controlled by $\mathcal{K}:=\max_{x,y\in M}\|\iota(x)-\iota(y)\|_{\mathbb{R}^{p}}$ , which is independent of $p$ since the manifold is assumed to be fixed. On the other hand, since $M$ is connected, $z^{0}_{ij}$ , $j=1,\ldots,d$ , is a continuous random variable. We can further choose another rotation $\bar{R}_{p}$ so that the first $d$ coordinates of $\mathbf{z}^{0}_{i}$ are whitened; that is, $\bar{R}_{p}\mathbf{z}^{0}_{i}$ has the covariance structure

\texttt{diag}(\lambda^{0}_{1},\lambda^{0}_{2},\ldots,\lambda^{0}_{d},0,\ldots,0)\in\mathbb{R}^{p\times p}\,,

where by the assumption of the density function, we have $\lambda^{0}_{i}>0$ for $i=1,\ldots,d$ . Note that $\lambda^{0}_{i}$ is controlled by $\mathcal{K}$ , and by the smoothness of the manifold, we could assume without loss of generality that $c\leq\lambda^{0}_{i}\leq 1/c$ for $c\in(0,1)$ for all $i=1,\ldots,d$ . Then, multiply the noisy dataset in (A.1) by $\bar{R}_{p}$ from the left and get

(A.2)

\bar{\mathbf{x}}_{i}=\bar{\mathbf{z}}_{i}+\bar{\mathbf{y}}_{i}\,,

where $\bar{\mathbf{x}}_{i}:=\bar{R}_{p}\mathbf{x}_{i}$ , $\bar{\mathbf{z}}_{i}:=\bar{R}_{p}\mathbf{z}_{i}$ and $\bar{\mathbf{y}}_{i}:=\bar{R}_{p}\mathbf{y}_{i}$ . It is easy to see that since $\{\mathbf{y}_{i}\}$ are isotropic (c.f. (1.5)), $\{\bar{\mathbf{y}}_{i}\}$ are sub-Gaussian random vectors satisfying (1.5). Thus, since $\bar{\mathbf{z}}_{i}$ and $\bar{\mathbf{y}}_{i}$ are still independent, the covariance of $\{\bar{\mathbf{x}}_{i}\}$ becomes

\bar{\Sigma}_{p}:=\texttt{diag}(\lambda_{1}+1,\cdots,\lambda_{d}+1,1,\cdots,1)\in\mathbb{R}^{p\times p}\,,

where $\lambda_{l}:=\lambda\lambda^{0}_{l}$ for $l=1,\ldots,d$ . Note that in this case, $\lambda_{i}$ are of the same order. By the above definitions and the invariance of the $\ell_{2}$ norm, we have

	$\displaystyle\\|\bar{\mathbf{x}}_{i}-\bar{\mathbf{x}}_{j}\\|\,$	$\displaystyle=\\|\mathbf{x}_{i}-\mathbf{x}_{j}\\|\,,$
	$\displaystyle\\|\bar{\mathbf{z}}_{i}-\bar{\mathbf{z}}_{j}\\|\,$	$\displaystyle=\\|\mathbf{z}_{i}-\mathbf{z}_{j}\\|\,,$
	$\displaystyle\\|\bar{\mathbf{y}}_{i}-\bar{\mathbf{y}}_{j}\\|\,$	$\displaystyle=\\|\mathbf{y}_{i}-\mathbf{y}_{j}\\|\,,$

which means that the affinity matrices (1.1) and transition matrices (1.9) remain unchanged after applying the orthogonal matrix. Thus, (A.2) is reduced to (1.4) and it suffices to focus on model (1.4).

Under the above setting that a low dimensional manifold is embedded into an affine subspace with a fixed dimension $d$ , the nonlinear manifold model is thus closely related to the spiked covariance matrix model. Recall that according to Nash’s isometric embedding theorem [48], there exists an embedding so that $d$ is smaller than $m(3m+11)/2$ , but there may exist embeddings so that the $d$ is higher than $m(3m+11)/2$ . More complicated models might need $d$ to even diverge as $n\to\infty$ . In these settings, the nonlinear manifold model will be reduced to other random matrix models, i.e., the divergent spiked model [13] or divergent rank signal plus noise model [23, 24, 20, 21]. We believe that the spectrum of the GL under these settings can also be investigated once the spectrum of these random matrix models can be well studied. Since this is not the focus of the current paper, we will pursue this direction in the future.

A.2. Some linear algebra facts

We record some linear algebraic results. The first one is for the Hadamard product from [27], Lemma A.5.

Lemma A.1.

Suppose $\mathbf{M}$ is a real symmetric matrix with nonnegative entries and $\mathbf{E}$ is a real symmetric matrix. Then we have that

\sigma_{1}(\mathbf{M}\circ\mathbf{E})\leq\max_{i,j}|\mathbf{E}(i,j)|\sigma_{1}(\mathbf{M}),

where $\sigma_{1}(\mathbf{M})$ stands for the largest singular value of $\mathbf{M}.$

The following lemma is commonly referred to as the Gershgorin circle theorem, and its proof can be found in [37, Section 6.1].

Lemma A.2.

Let $A=(a_{ij})$ be a real $n\times n$ matrix. For $1\leq i\leq n,$ let $R_{i}=\sum_{{j\neq{i}}}\left|a_{{ij}}\right|$ be the sum of the absolute values of the non-diagonal entries in the $i$ -th row. Let $D(a_{ii},R_{i})\subseteq\mathbb{R}$ be a closed disc with center $a_{ii}$ and radius $R_{i}$ referred as the Gershgorin disc. Every eigenvalue of $A=(a_{ij})$ lies within at least one of the Gershgorin discs $D(a_{ii},R_{i})$ , where $R_{i}=\sum_{j\neq i}|a_{ij}|$ .

We also collect some important matrix inequalities. For details, we refer readers to [18, Lemma SI.1.9]

Lemma A.3.

For two $n\times n$ symmetric matrices $\mathbf{A}$ and $\mathbf{B},$ we have that

\sum_{i=1}^{n}|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{B})|^{2}\leq\operatorname{tr}\{(\mathbf{A}-\mathbf{B})^{2}\}.

Moreover, let $m_{\mathbf{A}}(z)$ and $m_{\mathbf{B}}(z)$ be the Stieltjes transforms of the ESDs of $\mathbf{A}$ and $\mathbf{B}$ respectively, then we have

|m_{\mathbf{A}}(z)-m_{\mathbf{B}}(z)|\leq\frac{\operatorname{rank}\{\mathbf{A}-\mathbf{B}\}}{n}\min\left\{\frac{2}{\eta},\frac{\|\mathbf{A}-\mathbf{B}\|}{\eta^{2}}\right\}.

A.3. Some concentration inequalities for sub-Gaussian random vectors

We record some auxiliary lemmas for our technical proof. We start with the concentration inequalities for the sub-Gaussian random vector $\mathbf{y}$ that satisfies

\mathbb{E}(\exp(\mathbf{a}^{\top}\mathbf{y}))\leq\exp(\|\mathbf{a}\|_{2}^{2}/2).

The first lemma establishes the concentration inequalities when $\lambda$ is bounded.

Lemma A.4.

Suppose (1.4)-(1.6) hold. Assume $d\geq 1$ is fixed, $\lambda_{l}\asymp 1$ for $l=1,\ldots,d$ and write

(A.3)

\mathbf{z}_{i}=[\sqrt{\lambda_{1}}z_{i1},\ \sqrt{\lambda_{2}}z_{i2},\ldots,\sqrt{\lambda_{d}}z_{id},\ 0\ldots 0]

for all $1\leq i\leq n$ , where $\textnormal{var}(z_{il})=1$ for all $l=1,\ldots,d$ . Then, for $i\neq j$ and $t>0,$ we have

(A.4)

\mathbb{P}\left(\left|\frac{1}{p}\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\right|>t\right)\leq\exp(-pt^{2}/2)\,,

as well as

(A.5)

\mathbb{P}\left(\left|\frac{1}{p}\mathbf{x}_{i}^{\top}\mathbf{x}_{j}-\frac{1}{p}\mathbf{z}_{i}^{\top}\mathbf{z}_{j}\right|>t\right)\leq\exp(-pt^{2}/2)\,.

For the diagonal terms, for $t>0,$ we have for some universal constants $C,C_{1}>0,$

(A.6)

\displaystyle\mathbb{P}

\displaystyle\left(\left|\frac{1}{p}\|\mathbf{y}_{i}\|_{2}^{2}-1\right|>t\right)\leq\begin{cases}2\exp(-C_{1}pt^{2}),&0<t\leq C\\ 2\exp(-C_{1}pt),&t>C\,,\end{cases}

as well as

(A.7)

\displaystyle\mathbb{P}

\displaystyle\left(\left|\frac{1}{p}\|\mathbf{x}_{i}\|_{2}^{2}-\frac{1}{p}\|\mathbf{z}_{i}\|_{2}^{2}\right|>t\right)\leq\begin{cases}2\exp(-C_{1}pt^{2}),&0<t\leq C\\ 2\exp(-C_{1}pt),&t>C.\end{cases}

Especially, the above results imply that

		$\displaystyle\frac{1}{p}\left\|\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\right\|\prec n^{-1/2},$
(A.8)			$\displaystyle\frac{1}{p}\left\|\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right\|\prec n^{-1/2}\,,$

as well as

	$\displaystyle\left\|\frac{1}{p}\\|\mathbf{y}_{i}\\|_{2}^{2}-1\right\|\,$	$\displaystyle\prec n^{-1/2},$
(A.9)		$\displaystyle\left\|\frac{1}{p}\\|\mathbf{x}_{i}\\|_{2}^{2}-\left(1+\frac{\sum_{l=1}^{d}\lambda_{l}}{p}\right)\right\|\,$	$\displaystyle\prec n^{-1/2}.$

Note that since $p$ and $n$ are of the same order, the above results hold when $\mathbf{y}_{i}$ is replaced by the vector $\bm{z}:=[z_{1},\ldots,z_{n}]^{\top}\in\mathbb{R}^{n}$ , which is a sub-Gaussian random vector.

Proof.

We adapt (A.3) in the proof. When $i\neq j,$ (A.4) has been proved in [18, Lemma A.2]. Observe that

(A.10)

\mathbf{x}_{i}^{\top}\mathbf{x}_{j}-\mathbf{z}_{i}^{\top}\mathbf{z}_{j}=\mathbf{y}_{i}^{\top}\mathbf{y}_{j}+\sum_{l=1}^{d}\sqrt{\lambda_{l}}(z_{il}y_{jl}+z_{jl}y_{il}).

Since $\lambda\asymp 1$ and $d$ is fixed, we find that $\mathbf{y}_{i}^{\top}\mathbf{y}_{j}$ is the leading order term. (A.5) follows from (A.4) and (A.10). When $i=j$ , (A.6) has been proved in [61, Corollary 2.8.3]. (A.7) follows from (A.10) and (A.6).

(A.4) ((A.4) respectively) follows from (A.4) and (A.5) ((A.6) and (A.7) respectively) for scalar random variables and the fact that $\lambda\asymp 1.$ ∎

Then we provide the concentration inequalities when $\lambda$ is in the slowly divergent region. Indeed, in this region, the results of the diagonal parts of Lemma A.4 still apply.

Lemma A.5.

Suppose (1.4)-(1.6) hold. Assume $d\geq 1$ is fixed, $\lambda_{l}=n^{\alpha_{l}}$ with $0<\alpha_{l}<1$ for all $1\leq l\leq d$ and recall (A.3). Then when $i\neq j$ , we have

	$\displaystyle\frac{1}{p}\left\|\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right\|$	$\displaystyle\,\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}+\frac{1}{\sqrt{n}},$
(A.11)		$\displaystyle\left\|\frac{1}{p}\\|\mathbf{x}_{i}\\|_{2}^{2}-\left(1+\frac{\sum_{l=1}^{d}\lambda_{l}}{p}\right)\right\|$	$\displaystyle\,\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}+\frac{1}{\sqrt{n}}.$

Proof.

We only discuss the second term and the first item can be dealt with in a similar way. We define $\lambda_{fl}=\lfloor\lambda_{l}\rfloor$ as the floor of $\lambda$ . We again adapt (A.3) in the proof. Note that using (A.10) we have

	$\displaystyle\sum_{l=1}^{d}\frac{\lambda_{fl}}{p}z_{il}^{2}$	$\displaystyle+\frac{1}{p}\sum_{j=1}^{p}\mathbf{y}_{ij}^{2}\leq\frac{1}{p}\\|\mathbf{x}_{i}\\|_{2}^{2}-\sum_{l=1}^{d}\frac{2\sqrt{\lambda_{l}}}{p}z_{il}\mathbf{y}_{il}$
(A.12)			$\displaystyle\leq\sum_{l=1}^{d}\frac{\lambda_{fl}+1}{p}z_{il}^{2}+\frac{1}{p}\sum_{j=1}^{p}\mathbf{y}_{ij}^{2}.$

We study the upper bound of the sandwich inequality, and the lower bound follows by the same argument. On one hand, according to (A.4), we have that

\frac{1}{p}\sum_{j=1}^{p}\mathbf{y}_{ij}^{2}=1+O_{\prec}(n^{-1/2}).

Moreover, since $z_{i}$ is a sub-Gaussian random variable, we can apply (A.4). This yields that

\sum_{i=1}^{d}\frac{\lambda_{fl}+1}{p}z_{il}^{2}=\frac{\sum_{l=1}^{d}(\lambda_{fl}+1)}{p}+O_{\prec}(n^{{-1}+\alpha}),

where we used the fact that $\alpha_{l}<1$ for all $1\leq l\leq d.$ Similarly, we have

\left|\frac{\sqrt{\lambda_{l}}}{p}z_{il}\mathbf{y}_{il}\right|\prec\frac{\sqrt{\lambda_{l}}}{p}.

Using (A.3), we readily obtain that

\frac{1}{p}\|\mathbf{x}_{i}\|\leq 1+\frac{\sum_{l=1}^{d}\lambda_{l}}{p}+O_{\prec}\left(\frac{\sum_{l=1}^{d}\lambda_{l}}{n}+\frac{1}{\sqrt{n}}\right).

∎

A.4. Some results for Gram matrices

Denote the Gram matrix of the point clouds $\{\mathbf{x}_{i}\}\subset\mathbb{R}^{p}$ in the form of (1.4) as

(A.13)

\mathbf{G}_{x}=\frac{1}{p}\mathbf{X}^{\top}\mathbf{X},\ \mathbf{X}=[\mathbf{x}_{1}\cdots\mathbf{x}_{n}]\in\mathbb{R}^{p\times n}.

The eigenvalues of $\mathbf{G}_{x}$ have been thoroughly studied in the literature; see [8, 7, 4, 51], among others. We summarize those results relevant to this paper in the following lemma.

Lemma A.6.

Suppose (1.4)-(1.7) hold true and $d\geq 1$ is fixed. Assume that there exists some $0\leq d^{\prime}\leq d$ so that $\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{d^{\prime}}>\sqrt{c_{n}}\geq\lambda_{d^{\prime}+1}\geq\cdots\geq\lambda_{d}\geq 0.$ Then, if $d^{\prime}>0$ , we have that for $1\leq j\leq d^{\prime}$ ,

|\lambda_{j}(\mathbf{G}_{x})-(1+\lambda_{j})(1+c_{n}\lambda_{j}^{-1})|\prec n^{-1/2}\sqrt{\lambda_{j}}(\lambda_{j}-\sqrt{c_{n}})^{1/2}.

If $d^{\prime}<d$ , we have that for $d^{\prime}+1\leq j\leq d$ ,

|\lambda_{j}(\mathbf{G}_{x})-\gamma_{\mu_{c_{n},1}}(1)|\prec n^{-2/3}.

Moreover, for $1\leq i\leq(1-\epsilon)n$ , where $\epsilon>0$ is a fixed small constant, we have

(A.14)

\left|\lambda_{i+d}(\mathbf{G}_{x})-\gamma_{\mu_{c_{n},1}}(i)\right|\prec n^{-2/3}i^{-1/3}\,.

Proof.

We mention that the results under our setup have been originally proved in [8] (see Section 1.2 and Theorems 2.3 and 2.7 therein), and stated in the current form. ∎

The following lemma provides a control for the Hadamard product related to the Gram matrix. In our setup with the point cloud $\mathcal{X}$ , as discussed around (1.6), $\mathbf{x}_{i}$ itself is a sub-Gaussian random vector with a spiked $\Sigma$ as in (1.6). We thus can extract the probability and bounds by tracking the proof in [27, Step (iv) on Page 21 of the proof of Theorem 2.1].

Lemma A.7.

Suppose (1.5)-(1.7) hold true, $d\geq 1$ , and $\lambda_{l}=n^{\alpha_{l}}$ for $l=1,\ldots,d$ , where $0<\alpha_{l}<1$ . Let $\mathbf{G}_{x}$ be the Gram matrix associated with the point cloud $\mathcal{X}.$ Denote

(A.15)

\mathbf{P}_{x}:=\mathbf{G}_{x}-\text{diag}\{\mathbf{G}_{x}(1,1),\ldots,\mathbf{G}_{x}(n,n)\}\,.

For some constant $C>0$ , when $n$ is sufficiently large, with probability at least $1-O(n^{-1/2})$ , we have

		$\displaystyle\left\\|\mathbf{P}_{x}\circ\mathbf{P}_{x}-\frac{\sum_{l=1}^{d}(\lambda_{l}+1)^{2}+p-1}{p^{2}}(\mathbf{1}\mathbf{1}^{\top}-\mathbf{I}_{n})\right\\|$
(A.16)		$\displaystyle\leq$	$\displaystyle\,C\max\left\{n^{-1/4},\frac{\sum_{l=1}^{d}\lambda_{l}}{p}\right\}.$

We also need the following lemma for the Gram matrix of noisy signals.

Lemma A.8.

Suppose (1.4)-(1.8) hold true, $d\geq 1$ , $\lambda_{l}=n^{\alpha_{l}}$ , where $0<\alpha_{l}\leq\alpha_{l-1}\leq\cdots\leq\alpha_{1}<1$ , for $l=1,\ldots,d$ . Let $\mathbf{P}_{y}$ be denoted as (A.15) for the point cloud $\mathcal{Y}.$ Then we have

(A.17)

\displaystyle\left\|\mathbf{P}_{x}\circ\mathbf{P}_{x}-\mathbf{P}_{y}\circ\mathbf{P}_{y}\right\|\prec\begin{cases}\frac{\sum_{l=1}^{d}\lambda_{l}}{\sqrt{n}},&0<\alpha_{1}<0.5\\ \frac{(\sum_{l=1}^{d}\lambda_{l})^{2}}{n},&\text{otherwise}.\end{cases}

Proof.

When $i\neq j$ ,

		$\displaystyle\frac{1}{p^{2}}((\mathbf{x}_{i}^{\top}\mathbf{x}_{j})^{2}-(\mathbf{y}_{i}^{\top}\mathbf{y}_{j})^{2})$
(A.18)		$\displaystyle=$	$\displaystyle\,\frac{\sum_{l=1}^{d}\sqrt{\lambda_{l}}(z_{il}y_{jl}+z_{jl}y_{il})+\mathbf{z}_{i}^{\top}\mathbf{z}_{j}}{p}\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+\mathbf{y}_{i}^{\top}\mathbf{y}_{j}}{p}\,.$

By the assumption that the standard deviations of the entries of $z_{il}$ and $y_{il}$ are of order $\sqrt{\lambda_{l}}$ and $1$ respectively, we have

\frac{\sum_{l=1}^{d}\sqrt{\lambda_{l}}(z_{il}y_{jl}+z_{jl}y_{il})+\mathbf{z}_{i}^{\top}\mathbf{z}_{j}}{p}\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}\,,

and by (A.5), we have

\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+\mathbf{y}_{i}^{\top}\mathbf{y}_{j}}{p}\prec\frac{1}{\sqrt{n}}

when $\alpha<0.5$ and

\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+\mathbf{y}_{i}^{\top}\mathbf{y}_{j}}{p}\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}

when $0.5\leq\alpha<1$ . Therefore, using the Gershgorin circle theorem, we conclude the claimed bound. ∎

A.5. Some results for affinity matrices

Lemma A.9.

Suppose (1.1) and (1.4)-(1.8) hold true, $d\geq 1$ , $\alpha_{l}=0$ for $l=1,\ldots,d$ in (1.8) (i.e., $\lambda$ is bounded), and $h=p$ in (1.1). Let $\Phi=(\phi_{1},\ldots,\phi_{n})$ with $\phi_{i}=\frac{1}{p}\|\mathbf{x}_{i}\|-(1+\sum_{l=1}^{d}\lambda_{l}/p)$ , $i=1,2,\cdots,n$ . Denote

(A.19)

\displaystyle\mathbf{K}_{d}\equiv\mathbf{K}_{d}(\tau)=

\displaystyle\,-2f^{\prime}(\tau)p^{-1}\mathbf{X}^{\top}\mathbf{X}+\varsigma\mathbf{I}_{n}+\mathrm{Sh}_{0}(\tau)+\mathrm{Sh}_{1}(\tau)+\mathrm{Sh}_{2}(\tau),

where $f(x)$ is a general kernel function satisfying the conditions in Remark 2.4, $\varsigma$ is defined in (2.6),

(A.20)		$\displaystyle\mathrm{Sh}_{0}(\tau):=f(\tau)\mathbf{1}\mathbf{1}^{\top},$
	$\displaystyle\mathrm{Sh}_{1}(\tau):=f^{\prime}(\tau)[\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top}],$
	$\displaystyle\mathrm{Sh}_{2}(\tau):=\frac{f^{(2)}(\tau)}{2}\Big{[}\mathbf{1}(\Phi\circ\Phi)^{\top}+(\Phi\circ\Phi)\mathbf{1}^{\top}+2\Phi\Phi^{\top}$
(A.21)		$\displaystyle\qquad\qquad\quad+\frac{4}{p^{2}}\Big{(}\sum_{l=1}^{d}(\lambda_{l}+1)^{2}+p\Big{)}\mathbf{1}\mathbf{1}^{\top}\Big{]},$

and $\circ$ is the Hadamard product. Then for some small constant $\xi>0$ , when $n$ is sufficiently large, we have that with probability at least $1-O(n^{-1/2})$

(A.22)

\|\mathbf{W}-\mathbf{K}_{d}\|\leq n^{-\xi}\,.

Proof.

See [18, Lemma A.10] for the case $d=1$ . For $d>1$ , by modifying the proof of [18, Lemma A.10] using Lemmas A.4–A.8 and the assumption that $d$ is fixed, we get the result. ∎

We also need the following lemma, which is of independent interest.

Lemma A.10.

Suppose (1.1) and (1.4)-(1.8) hold true, $d\geq 1$ fixed and $\alpha_{1}\geq\alpha_{2}\geq\ldots\geq\alpha_{d}\geq 1$ . For the matrix $\mathbf{W}_{1}$ associated with the clean signal defined in (1.9) and $\delta>0$ , with high probability, we have for $C_{\delta}>0$ so that

(A.23)

\|\mathbf{W}_{1}\|\leq(n-C_{\delta}n^{\delta})\exp(-\upsilon\gamma n^{\alpha_{1}-1-2(1-\delta)})+C_{\delta}n^{\delta}\,,

where $\gamma$ is defined in (1.7).

Proof.

Assume $d=1$ and denote $\alpha=\alpha_{1}$ . Throughout the proof, we adapt the notation (A.3). For an arbitrarily small constant $\epsilon>0$ and a given $\delta>0,$ let $C_{\delta}>0$ be some constant depending on $\delta$ , and denote the event $\mathcal{A}(\delta)$ as

\mathcal{A}(\delta):=\left\{\text{There exist}\ C_{\delta}n^{\delta}\ z_{i}^{\prime}s\ \text{such that}\ |z_{i}|\leq n^{-\epsilon}\right\}\,.

Note that $z_{i}$ is the signal part. Denote $\ell:=\mathbb{P}(|z_{i}|\leq n^{-\epsilon}).$ Due to the independence, we find that

(A.24)

\mathbb{P}(\mathcal{A}(\delta))={n\choose C_{\delta}n^{\delta}}\ell^{C_{\delta}n^{\delta}}(1-\ell)^{n-C_{\delta}n^{\delta}}.

Using Stirling’s formula, when $n$ is sufficiently large, we find that

(A.25)		$\displaystyle{n\choose C_{\delta}n^{\delta}}=$	$\displaystyle\,\frac{n!}{(n-C_{\delta}n^{\delta})!(C_{\delta}n^{\delta})!}$
	$\displaystyle\asymp$	$\displaystyle\,\sqrt{2\pi}n^{n+1/2}\exp(-n)(C_{\delta}n^{\delta})^{-C_{\delta}n^{\delta}-1/2}\exp(C_{\delta}n^{\delta})$
		$\displaystyle\qquad\times(n-C_{\delta}n^{\delta})^{-n+C_{\delta}n^{\delta}-1/2}\exp(n-C_{\delta}n^{\delta})$
	$\displaystyle\asymp$	$\displaystyle\,\sqrt{\frac{2\pi}{C_{\delta}}}\sqrt{\frac{n^{1-\delta}}{n-C_{\delta}n^{\delta}}}\left(1+\frac{C_{\delta}n^{\delta}}{n-n^{\delta}}\right)^{n-C_{\delta}n^{\delta}}\left(\frac{n}{C_{\delta}n^{\delta}}\right)^{C_{\delta}n^{\delta}}$
	$\displaystyle\asymp$	$\displaystyle\,\sqrt{\frac{2\pi}{C_{\delta}}}n^{(1-\delta)C_{\delta}n^{\delta}-\delta/2}\exp(C_{\delta}n^{\delta})\,.$

We next provide an estimate for $\ell.$ Denote the probability density function (PDF) of $z_{i}^{2}$ as $\varrho$ and the PDF of $z_{i}$ is $\widetilde{\varrho}.$ Recall the assumption that $\{z_{i}\}_{i=1}^{n}$ are continuous random variables with $\widetilde{\varrho}(0)\neq 0$ . By using the fact $\varrho(y)=(\widetilde{\varrho}(\sqrt{y})+\widetilde{\varrho}(-\sqrt{y}))/(2\sqrt{y})$ , we find that

\varrho(y)\asymp O\left(\frac{1}{\sqrt{y}}\right)\,.

Consequently, we have that for any small $y>0,$

(A.26)

\mathbb{P}(z_{i}^{2}\leq y)\asymp O(\sqrt{y})\,.

By (A.26), we immediately have that

(A.27)

\ell\asymp n^{-\epsilon}\,.

Since $\ell\rightarrow 0$ as $n\rightarrow\infty$ , we have

(A.28)

\displaystyle\ell^{C_{\delta}n^{\delta}}(1-\ell)^{n-C_{\delta}n^{\delta}}\asymp\exp(-\ell n)\ell^{C_{\delta}n^{\delta}},\ n\rightarrow\infty\,.

By plugging (A.25) and (A.28) into (A.24), we obtain

	$\displaystyle\mathbb{P}(\mathcal{A}(\delta))\asymp$	$\displaystyle\,\exp(-\ell n+C_{\delta}(1-\delta)n^{\delta}\log n+C_{\delta}n^{\delta}\log\ell)$
	$\displaystyle\asymp$	$\displaystyle\,\exp\left(-n^{1-\epsilon}+C_{\delta}(1-\delta)n^{\delta}\log n-C_{\delta}\epsilon n^{\delta}\log n+C_{\delta}n^{\delta}\right)$
(A.29)		$\displaystyle=$	$\displaystyle\,\exp\left(-n^{1-\epsilon}+C_{\delta}\left[1-\delta-\epsilon+1/\log(n)\right]n^{\delta}\log n\right),$

where the second asymptotic comes from plugging (A.27). In light of (A.5), to make $\mathcal{A}(\delta)$ a high probability event, we may take $\epsilon+\delta=1$ , which leads to

\mathbb{P}(\mathcal{A}(\delta))\asymp\exp\left((C_{\delta}-1)n^{\delta}\right).

We can choose $1<C_{\delta}\leq 2$ such that for any large constant $D>0,$ when $n$ is large enough, we have that

(A.30)

\mathbb{P}(\mathcal{A}(\delta))\geq 1-n^{-D}\,.

Note that $\mathbf{W}_{1}(i,j)=\exp\left(-\upsilon\lambda(z_{i}-z_{j})^{2}/p\right)$ . On the event $\mathcal{A}(\delta)$ , by the Gershgorin circle theorem and the definition of $\mathbf{W}_{1}$ , we find that

(A.31)

\|\mathbf{W}_{1}\|\leq(n-C_{\delta}n^{\delta})\exp(-\upsilon\gamma n^{\alpha-1-2(1-\delta)})+C_{\delta}n^{\delta}\,,

where in the inequality we consider the worst scenario such that all the elements in $\mathcal{A}(\delta)$ are either on the same row or column. This finishes the claim with high probability.

To show the case when $d>1$ , we need a slight modification of (A.24). Suppose $\alpha_{1}\geq\alpha_{2}\geq\ldots\geq\alpha_{d}\geq 1$ . Since $z_{ik}$ , $k=1,\ldots,d$ are in general dependent, (A.24) has to be modified to accommodate this additional dependence. In fact, our results hold true by replacing $\alpha$ in (A.23) with $\alpha_{1}$ since we have $\mathbf{W}_{1}(i,j)=\prod_{k=1}^{d}\exp\left(-\upsilon\lambda_{1}(z_{ik}-z_{jk})^{2}/p\right)\leq\exp\left(-\upsilon\lambda_{1}(z_{i1}-z_{j1})^{2}/p\right)$ which reduces our discussion to the case $d=1.$ ∎

A.6. Orthogonal polynomials and kernel expansion

We first recall the celebrated Mehler’s formula (for instance, see equation (5) in [49] or [33])

(A.32)

\displaystyle\frac{1}{\sqrt{1-t^{2}}}

\displaystyle\exp\left(\frac{2txy-t^{2}(x^{2}+y^{2})}{2(1-t^{2})}\right)=\sum_{n=0}^{\infty}\frac{t^{n}}{n!}\widetilde{H}_{n}(x)\widetilde{H}_{n}(y),

where $\widetilde{H}_{m}(x)$ is the scaled Hermite polynomial defined as

\widetilde{H}_{m}(x)=\frac{H_{m}(x/\sqrt{2})}{\sqrt{2^{n}}}

and $H_{m}(x)$ is the standard Hermite polynomial defined as

H_{m}(x)=(-1)^{m}\exp(x^{2})\frac{\mathrm{d}^{m}\exp(-x^{2})}{\mathrm{d}x^{m}}.

We mention that $\widetilde{H}_{m}(x)$ is referred to as the probabilistic version of the Hermite polynomial. We will see later that in our proof, the above Mehler’s formula provides a convenient way to handle the interaction term when we write the affinity matrix as a summation of rank one matrices.

A.7. More remarks

A.7.1. Zeroing-out technique

Here we discuss the trick of zeroing-out diagonal terms proposed in [29]. First of all, we summarize the idea and restate the results in [29]. Then we modify it to our setting (1.4)–(1.6). To simplify the discussion, we focus our discussion on the setting $d=1$ with the signal strength $\lambda\asymp n^{\alpha}$ , where $\alpha\geq 0$ . The zeroed out affinity matrix is defined as

\mathring{\mathbf{W}}=\mathbf{W}\circ\left(\mathbf{1}\mathbf{1}^{\top}-\mathbf{I}_{n}\right),

where $\mathbf{1}\in\mathbb{R}^{n}$ is the vector with $1$ in all entries. Denote the associated degree matrix as $\mathring{\mathbf{D}}.$ Consequently, the modified transition matrix is

(A.33)

\mathring{\mathbf{A}}=\mathring{\mathbf{D}}^{-1}\mathring{\mathbf{W}}.

Recall that the transition matrix for the signal part is defined as $\mathbf{A}_{1}=\mathbf{D}_{1}^{-1}\mathbf{W}_{1}$ . It can be concluded from [29] that with high probability the spectrum of $\mathring{\mathbf{A}}$ is close to that of $\mathbf{A}_{1}$ in that

(A.34)

\|\mathring{\mathbf{A}}-\mathbf{A}_{1}\|=o_{\mathbb{P}}(1),

provided the following two conditions are satisfied:

(1). The signal strength satisfies

\alpha>\frac{1}{2}.

(2). The off-diagonal entries of the signal kernel affinity matrix should satisfy that

(A.35)

\inf_{i}\sum_{j\neq i}\frac{\mathbf{W}_{1}(i,j)}{n}\geq\gamma>0,

for some universal constant $\gamma>0.$ Even though [29] did not provide a detailed discussion on the bandwidth selection, the assumption (A.35) imposes a natural condition.

We now explain how the zeroing-out trick is related to our approach in the large signal-to-noise ratio region. Recall that the transition matrix for the observation is defined as $\mathbf{A}=\mathbf{D}^{-1}\mathbf{W}$ . As shown in part 2) of Theorem 3.1, when the signal strength is $\alpha>1$ and the bandwidth is either $h\asymp\lambda$ or selected adaptively according to the method proposed in Section 3.2, the spectrum of $\mathbf{A}$ will be close to that of $\mathbf{A}_{1}$ with high probability, i.e.,

\|\mathbf{A}-\mathbf{A}_{1}\|=o_{\mathbb{P}}(1).

We emphasize that when $\alpha>1$ and $h\asymp\lambda$ or selected adaptively according to the method proposed in Section 3.2, it can be concluded from the proof of Corollary 3.2 that (A.35) holds with high probability. Consequently, together with (A.34), we can conclude that when the signal-to-noise ratio is large in the sense $\alpha>1$ and the bandwidth is selected properly as in Section 3.2, the spectrum of $\mathbf{A}$ is asymptotically the same as the zeroing-out matrix $\mathring{\mathbf{A}}.$ We mention that although in this setting our approach and results are asymptotically equivalent to the zeroing-out trick, our method provides an adaptive and theoretically justified method to select the bandwidth instead of choosing a fixed bandwidth according to (A.35). In fact, in the simulation of [29], the authors used a similar approach empirically.

When the signal-to-noise ratio is “smaller” so that $1/2<\alpha\leq 1,$ the zeroing-out trick has a significant impact on the spectrum. In this region, the proposed bandwidth selection algorithm will choose a bandwidth satisfying $h\asymp p$ , and the spectral behavior of noisy GL is recorded in Corollary 2.10, part 1) of Theorem 3.1 and Corollary 3.2. We see that $\mathbf{A}$ is no longer close to the spectrum of $\mathbf{A}_{1}$ under our setup. However, the zeroing-out trick works provided (A.35) holds. Therefore, we can see that when $1/2<\alpha\leq 1$ and the bandwidth is properly selected, the result can be improved using the zeroing-out trick in the sense of (A.34).

Finally, when the SNR is “very small” so that $\alpha\leq 1/2,$ both $\mathring{\mathbf{A}}$ and $\mathbf{A}$ are dominated by the noise. In this case, we are not able to extract useful information of the signal, even with the zeroing-out trick. For an illustration, in Figure 14, we show the performance of $\mathring{\mathbf{A}}$ and $\mathbf{A}$ in different SNR regions by comparing some of their eigenfunctions (eigenvectors) with those of the clean signal matrix $\mathbf{A}_{1}.$ We can conclude that the zeroing-out trick can be beneficial when $0.5<\alpha\leq 1$ provided the bandwidth is carefully selected.

A.7.2. Mor remark on $d>1$

We continue the discussion in Remark 2.9 and provide some simulations with $d=2$ . We assume that $\alpha_{1}\geq\alpha_{2}\geq 0$ , and discuss two cases with different $\alpha_{1}$ and $\alpha_{2}.$

First, we discuss the setting when the bulk eigenvalues are governed by the MP law, i.e., in the region $0\leq\alpha_{2}\leq\alpha_{1}<1.$ Recall that the shifted MP law $\nu_{\lambda}$ defined in (2.7) depends on the signal level via $\tau\equiv\tau(\lambda)$ in (2.5), and when $\alpha<1,$ $\tau\rightarrow 2$ as $n\rightarrow\infty,$ which is independent of $\lambda$ asymptotically. We can thus always set $\lambda=0$ in (2.5) and use $\nu_{0}$ for the MP law as in (2.8) and (2.12). Therefore, when $0\leq\alpha_{2}\leq\alpha_{1}<1,$ the bulk distribution is the same as that in (2.8) and (2.12), which is characterized by the shifted MP law,

\nu_{0}=T_{\zeta(0)}\mu_{c_{n},-2f^{\prime}(2)},

where $T$ is the shift operator defined in (2.1), $\mu_{c_{n},-2f^{\prime}(2)}$ is the MP law defined in (2.3) with $\sigma^{2}$ replaced by $-2f^{\prime}(2)$ , and $\zeta(0)$ is defined by inserting $\lambda=0$ (or equivalently $\tau(0)=2$ ) in (2.6); that is,

\zeta(0)=f(0)+2f^{\prime}(2)-f(2)\,.

In general, although the bulk distribution is the same for different finite $d,$ the number of outliers can vary according to $\lambda_{i}$ , $1\leq i\leq d.$ On the technical level, the proofs in Appendices B.1 and B.2 follow after some minor modifications, especially in the Taylor expansion. For example, in (B.1), the key parameter $\tau$ should be defined as $\tau=2((\lambda_{1}+\lambda_{2})/p+1)$ when $d=2$ . For an illustration, in Figure 15, we show how the bulk eigenvalues are asymptotically identical for the setting $d=1$ and $d=2$ when the exponents are less than one for various settings of $c_{n}=0.5,1,2$ .

Second, we discuss the region when $\alpha_{2}\geq 1$ . On the one hand, as in Theorem 2.7, we can show that the spectrum of $\mathbf{W}$ is close to those of matrices defined in (2.14) and (2.16) that depend on the clean signal part $\mathbf{W}_{1}$ . As in the case when $d=1$ , the spectrum of $\mathbf{W}_{1}$ may not follow the MP law and depends on both $\lambda_{1}$ and $\lambda_{2}$ and the chosen bandwidth. This dependence suggests that the spectrum of $\mathbf{W}$ might be different from that when $d=1$ . In Figure 16, we show numerically how the bulk under the setup $d=2$ and $h=p$ is different from that when $d=1$ and $h=p$ .

Third, when $\alpha_{1}\geq 1>\alpha_{2}\geq 0,$ the spectrum of $\mathbf{W}$ will be close to some matrices that depend only on $\lambda_{1}.$ In this case, the spectrum of $\mathbf{W}$ is close to that of $\mathbf{W}_{1}$ when $d=1$ and the signal strength is $\lambda_{1}.$ See Figure 17 for an illustration, where we see that the bulks are fairly close to each other, while they may not necessarily follow the MP law.

Appendix B Proof of main results in Section 2

In this section, we prove the main results in Section 2.

B.1. Proof of Theorem 2.3

Recall $\tau$ defined in (2.5). Since the proof holds for general kernel function described in Remark 2.4, we will carry out our analysis with such general kernel function $f(x)$ .

Proof.

We start from simplifying $\mathbf{W}$ . Denote $\delta_{ij}$ to be the Kronecker delta. By the Taylor expansion, when $i\neq j$ , we have that

	$\displaystyle\mathbf{W}(i,j)=$	$\displaystyle f(\tau)+f^{\prime}(\tau)\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]$
		$\displaystyle+\frac{f^{(2)}(\tau)}{2}\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{2}$
(B.1)			$\displaystyle+\frac{f^{(3)}(\xi_{x}(i,j))}{6}\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{3},$

where $\mathbf{P}_{x}(i,j)$ is defined in (A.15), $\mathbf{O}_{x}(i,j)$ is defined as

(B.2)

\mathbf{O}_{x}(i,j)=(1-\delta_{ij})\left(\frac{\|\mathbf{x}_{i}\|_{2}^{2}+\|\mathbf{x}_{j}\|_{2}^{2}}{p}-\tau\right),

and $\xi_{x}(i,j)$ is some value between $(\|\mathbf{x}_{i}\|_{2}^{2}+\|\mathbf{x}_{j}\|_{2}^{2})/p$ and $\tau=2(\lambda/p+1)$ is defined in (2.5). Consequently, we find that $\mathbf{W}$ can be rewritten as

	$\displaystyle\mathbf{W}=$	$\displaystyle\,f(\tau)\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}$
		$\displaystyle+\Big{[}f^{\prime}(\tau)\mathbf{O}_{x}+\frac{f^{(2)}(\tau)}{2}\mathbf{H}_{x}+\frac{f^{(3)}(\xi_{x}(i,j))}{6}\mathbf{Q}_{x}\Big{]}$
(B.3)			$\displaystyle+2f^{\prime}(\tau)\left(\frac{1}{p}\text{diag}(\\|\mathbf{x}_{1}\\|^{2},\ldots,\\|\mathbf{x}_{n}\\|^{2})-1\right),$

where we used the shorthand notations

	$\displaystyle\mathbf{H}_{x}(i,j)=\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{2}$
	$\displaystyle\mathbf{Q}_{x}(i,j)=\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{3}.$

With this expansion, we immediately obtain

	$\displaystyle\mathbf{W}=$	$\displaystyle\,f(\tau)\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}+f^{\prime}(\tau)\mathbf{O}_{x}$
(B.4)			$\displaystyle+\frac{f^{(2)}(\tau)}{2}\mathbf{H}_{x}+O_{\prec}(n^{-1/2})\,,$

where the error quantified by $O_{\prec}$ is in the operator norm, the term $\frac{1}{p}\text{diag}(\|\mathbf{x}_{1}\|^{2},\ldots,\|\mathbf{x}_{n}\|^{2})-1$ is controlled by Lemma A.5, and the term $\mathbf{Q}_{x}$ is controlled by the facts that $\mathbf{O}_{x}(i,j)\prec(1-\delta_{ij})n^{-1/2}$ , $\mathbf{P}_{x}(i,j)\prec(1-\delta_{ij})n^{-1/2}$ and the Gershgorin circle theorem.

Next, we control $\mathbf{O}_{x}$ and $\mathbf{H}_{x}$ . Since $\mathbf{O}_{x}=\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top}-2\text{diag}\{\phi_{1},\cdots,\phi_{n}\}$ , we could approximate $\mathbf{O}_{x}$ by $\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top}$ via

(B.5)

\displaystyle\|\mathbf{O}_{x}-(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\|

\displaystyle=\|2\text{diag}\{\phi_{1},\cdots,\phi_{n}\}\|\prec n^{-1/2}\,,

where $\Phi=(\phi_{1},\ldots,\phi_{n})$ with $\phi_{i}=\frac{1}{p}\|\mathbf{x}_{i}\|_{2}^{2}-(1+\lambda/p)$ , $i=1,2,\cdots,n$ , is defined in (A.19), and the last bound comes from Lemma A.5.

For $\mathbf{H}_{x}$ , we write $\mathbf{H}_{x}(i,j)=[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)]^{2}=\mathbf{O}_{x}(i,j)^{2}+4\mathbf{P}_{x}(i,j)^{2}-4\mathbf{O}_{x}(i,j)\mathbf{P}_{x}(i,j)$ and focus on the term $\mathbf{O}_{x}(i,j)\mathbf{P}_{x}(i,j).$ Since $\mathbf{1}\Phi^{\top}\circ\mathbf{P}_{x}=\text{diag}\{\phi_{1},\cdots,\phi_{n}\}\mathbf{P}_{x}$ , we have (also see the proof of [27, Theorem 2.2])

(B.6)

\displaystyle\mathbf{O}_{x}\circ\mathbf{P}_{x}=\,

\displaystyle\mathbf{P}_{x}\text{diag}\{\phi_{1},\cdots,\phi_{n}\}+\text{diag}\{\phi_{1},\cdots,\phi_{n}\}\mathbf{P}_{x}\,.

Moreover, construct $\mathbf{P}_{y}$ from $\mathcal{Y}$ in the same way as (A.15) and write

(B.7)

\mathbf{P}_{x}-\mathbf{P}_{y}=\frac{1}{p}\left(\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}\right)\circ(\bm{1}\bm{1}^{\top}-\mathbf{I}_{n})\,,

where $\bm{y}:=(\mathbf{y}_{11},\cdots,\mathbf{y}_{n1})^{\top},\bm{z}:=(\mathbf{z}_{11},\cdots,\mathbf{z}_{n1})^{\top}.$ We find that

\|\mathbf{P}_{x}-\mathbf{P}_{y}\|\leq\frac{1}{p}(\bm{z}^{\top}\bm{z}+2\bm{z}^{\top}\bm{y}+\|(\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top})\circ\mathbf{I}_{n}\|).

Note that $\|\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}\|/p\leq(\|\bm{z}+\bm{y}\|^{2}+\|\bm{y}\|^{2})/p\leq(2\|\bm{z}\|^{2}+3\|\bm{y}\|^{2})/p\prec\lambda+1$ by (A.4) and (A.5), and $p^{-1}\|(\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top})\circ\mathbf{I}_{n}\|\prec(\lambda+1)/p$ by the fact that $|\bm{z}_{i1}|^{2}\prec\lambda$ and $|\bm{z}_{i1}\bm{y}_{i1}|\prec\sqrt{\lambda}$ , so we have $\|\mathbf{F}_{g}\|\prec\lambda+1.$ By Lemma A.6 and (A.4), we find that $\|\mathbf{P}_{y}\|\prec 1$ . On the other hand, by the fact that $\|\mathbf{G}_{x}\|=\|(\mathbf{Z}^{\top}\mathbf{Z}+\mathbf{Z}^{\top}\mathbf{Y}+\mathbf{Y}^{\top}\mathbf{Z}+\mathbf{Y}^{\top}\mathbf{Y})/p\|\leq(\|\bm{z}\|^{2}+2\|\bm{z}\|\|\bm{y}\|)/p+\|\mathbf{Y}^{\top}\mathbf{Y}\|/p\prec\lambda+1$ and (A.5), we have

(B.8)

\|\mathbf{P}_{x}\|\prec\lambda+1\,.

By (B.5), we conclude that

\|\mathbf{O}_{x}\circ\mathbf{P}_{x}\|\prec\frac{\lambda+1}{\sqrt{n}}.

With the above preparation, $\mathbf{W}$ is reduced from (B.1) to

	$\displaystyle\mathbf{W}=\,$	$\displaystyle f(\tau)\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}+f^{\prime}(\tau)\mathbf{O}_{x}$
		$\displaystyle+\frac{f^{(2)}(\tau)}{2}\mathbf{O}_{x}\circ\mathbf{O}_{x}+2f^{(2)}(\tau)\mathbf{P}_{x}\circ\mathbf{P}_{x}$
(B.9)			$\displaystyle+O_{\prec}\left(\frac{\lambda+1}{\sqrt{n}}\right).$

We further simplify $\mathbf{W}$ . Let $\mathbf{P}_{y}$ be constructed in the same way as $\mathbf{P}_{x}$ in (B.2) using the point cloud $\mathcal{Y}.$ By Lemma A.8, $\mathbf{P}_{x}\circ\mathbf{P}_{x}$ can be replaced by $\mathbf{P}_{y}\circ\mathbf{P}_{y}$ . Moreover, by a discussion similar to (B.5), we can control $\mathbf{O}_{x}\circ\mathbf{O}_{x}$ by

\displaystyle\|

\displaystyle(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\circ(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})-\mathbf{O}_{x}\circ\mathbf{O}_{x}\|=\|2\text{diag}\{\phi_{1}^{2},\cdots,\phi_{n}^{2}\}\|\prec n^{-1}.

Combining all the above results, and applying Lemma A.7, we have simplified $\mathbf{W}$ as

(B.10)

\displaystyle\mathbf{W}=\mathsf{W}+\varsigma(\lambda)\mathbf{I}_{n}+O\left(n^{\epsilon}\frac{\lambda+1}{\sqrt{n}}+n^{-1/4}\right)\,,

where

	$\displaystyle\mathsf{W}:=\,$	$\displaystyle(f(\tau)+2f^{(2)}(2)p^{-1})\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}$
(B.11)			$\displaystyle+f^{\prime}(\tau)(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})+\frac{f^{(2)}(\tau)}{2}(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\circ(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\,,$

with probability at least $1-O(n^{-1/2})$ for some small $\epsilon>0$ ,

With the above simplification, we discuss the outlying eigenvalues. Invoking (B.10), since $\varsigma(\lambda)\mathbf{I}_{n}$ is simply an isotropic shift, the outlying eigenvalues of $\mathbf{W}$ can only come from $\mathsf{W}$ . Notice that by the identity $\mathbf{a}\mathbf{b}^{\top}\circ\mathbf{u}\mathbf{v}^{\top}=(\mathbf{a}\circ\mathbf{u})(\mathbf{b}\circ\mathbf{v})^{\top}$ , we find that

	$\displaystyle(\mathbf{1}\Phi^{\top}$	$\displaystyle+\Phi\mathbf{1}^{\top})\circ(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})$
(B.12)			$\displaystyle=\mathbf{1}(\Phi\circ\Phi)^{\top}+(\Phi\circ\Phi)\mathbf{1}^{\top}+2\Phi\Phi^{\top}\,,$

which leads to a rearrangement of $\mathsf{W}$ to

	$\displaystyle\mathsf{W}=\,$	$\displaystyle\mathbf{1}\Big{[}\frac{1}{2}(f(\tau)+2f^{(2)}(2)p^{-1})\mathbf{1}^{\top}+f^{\prime}(\tau)\Phi^{\top}$
		$\displaystyle\qquad+\frac{f^{(2)}(\tau)}{2}(\Phi\circ\Phi)^{\top}\Big{]}+\,\Big{[}\frac{1}{2}(f(\tau)+2f^{(2)}(2)p^{-1})\mathbf{1}+f^{\prime}(\tau)\Phi$
		$\displaystyle\qquad+\frac{f^{(2)}(\tau)}{2}(\Phi\circ\Phi)\Big{]}\mathbf{1}^{\top}+f^{(2)}(\tau)\Phi\Phi^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}$
(B.13)		$\displaystyle:=\,$	$\displaystyle\mathsf{O}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}.$

Note that $\mathsf{O}$ is of rank at most three since the first two terms of (B.13) form a matrix of rank at most $2$ and $\Phi\Phi^{\top}$ is a rank-one matrix with the spectral norm of order $O(\sqrt{n})$ . With (B.10), we can therefore conclude our proof using Lemma A.6.

∎

B.2. Proof of Theorem 2.5

Since the proof in this subsection hold for general kernel function described in Remark 2.4, we will carry out our analysis with such general kernel function $f(x)$ . Note that in this case, $\tau$ defined in (2.5) is still bounded from above. So for a fixed $K\in\mathbb{N}$ , the first $K$ coefficients in the Taylor expansion can be well controlled under the smoothness assumption, i.e., $f^{(k)}(\tau)\asymp 1$ , for $k=1,2,\ldots,K$ for $K\in\mathbb{N}$ . However, Lemma A.9 is invalid since $\alpha>0$ . On the other hand, in this region, although the concentration inequality (c.f. Lemma A.5) still works, its rate becomes worse as $\lambda$ becomes larger. In [26], the author only needs to conduct the Taylor expansion up to the third order since $\lambda$ is fixed. In our setup, to handle the divergent $\lambda$ , we need a high order expansion that is adaptive to $\lambda.$ Thus, due to the nature of convergence rate in Lemma A.5, we will employ different proof strategies for the cases $0<\alpha<0.5$ and $0.5\leq\alpha<1.$ When $\alpha$ satisfies $0<\alpha<0.5$ , the proof of Theorem 2.3 still holds. When $\alpha$ satisfies $0.5\leq\alpha<1$ , we need a higher order Taylor expansion to control the convergence. This comes from the second term of (A.5), where the concentration inequalities regarding $\mathbf{x}_{i}^{\top}\mathbf{x}_{j}$ , where $i\neq j$ , have different upper bounds with different $\alpha$ .

Proof of case (1), $0<\alpha<0.5$ .

By (B.1) and Lemma A.5, we find that when $i\neq j$ ,

	$\displaystyle\mathbf{W}(i,j)=$	$\displaystyle\,f(\tau)+f^{\prime}(\tau)\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]$
		$\displaystyle+\frac{f^{(2)}(\tau)}{2}\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{2}+O_{\prec}(n^{-3/2}),$

where we used the fact that $f^{(3)}(\xi_{x}(i,j))$ is bounded. By a discussion similar to (B.1) and the Gershgorin circle theorem, we find that (B.1) also holds true. The rest of the proof follows lines of the proof of Theorem 2.3 using Lemmas A.5 and A.6. We omit the details here.

∎

Proof of Case (2), $0.5\leq\alpha<1$ .

For simplicity, we introduce

(B.14)

\mathbf{L}_{x}:=\mathbf{O}_{x}-\mathbf{P}_{x}.

By Lemma A.5 and notations defined in (B.2), we have

(B.15)

|\mathbf{P}_{x}(i,j)|=O_{\prec}(\lambda/n)\ \mbox{ and }\ |\mathbf{O}_{x}(i,j)|=O_{\prec}(\lambda/n)\,.

By the Taylor expansion, when $i\neq j$ , we have

(B.16)

\displaystyle\mathbf{W}(i,j)=

\displaystyle\,\sum_{k=0}^{\mathfrak{d}-1}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}(i,j)^{k}+\frac{f^{(\mathfrak{d})}(\xi_{x}(i,j))}{\mathfrak{d}!}\mathbf{L}_{x}(i,j)^{\mathfrak{d}}\,,

where $\mathfrak{d}$ is defined in (2.11) and $\xi_{x}(i,j)$ is some value between $(\|\mathbf{x}_{i}\|_{2}^{2}+\|\mathbf{x}_{j}\|_{2}^{2})/p$ and $\tau.$ Consider $\widetilde{\mathbf{W}},\,\mathbf{R}_{\mathfrak{d}}\in\mathbb{R}^{n\times n}$ defined as

	$\displaystyle\widetilde{\mathbf{W}}(i,j):=\sum_{k=3}^{\mathfrak{d}-1}\frac{f^{(k)}(\tau)\mathbf{L}_{x}(i,j)^{k}}{k!}\,,$
	$\displaystyle\mathbf{R}_{\mathfrak{d}}(i,j)=\frac{f^{(\mathfrak{d})}(\xi_{x}(i,j))}{\mathfrak{d}!}\mathbf{L}_{x}(i,j)^{\mathfrak{d}},$

so that $\mathbf{W}=\sum_{k=0}^{2}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}(i,j)^{k}+\widetilde{\mathbf{W}}+\mathbf{R}_{\mathfrak{d}}$ . We start from claiming that

(B.17)

\left|\mathbf{R}_{\mathfrak{d}}(i,j)\right|\prec p^{\mathcal{B}(\alpha)-1}\,,

where $\mathcal{B}(\alpha)={(\alpha-1)\left\lceil\frac{1}{1-\alpha}\right\rceil+\alpha<0}$ is defined in (2.13). To see (B.17), we use (A.5) and the fact that $d$ is finite to get

\mathbf{L}_{x}(i,j)^{\mathfrak{d}}\prec n^{\mathfrak{d}(-1+\alpha)}=n^{(\alpha-1)\left\lceil\frac{1}{1-\alpha}\right\rceil+\alpha-1}=n^{\mathcal{B}(\alpha)-1}\,.

Together with (B.16), by the Gershgorin circle theorem, we have

\|\mathbf{R}_{\mathfrak{d}}\|=\left\|\mathbf{W}-\sum_{k=0}^{\mathfrak{d}-1}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}^{\circ k}\right\|\prec n^{\mathcal{B}(\alpha)},

where we set

\mathbf{L}_{x}^{\circ k}:=\underbrace{\mathbf{L}_{x}\circ\cdots\circ\mathbf{L}_{x}}_{k\ \text{times}}.

Similar definition applies to $\mathbf{O}_{x}^{\circ k}$ and $\mathbf{P}_{x}^{\circ k}.$

Next, we study $\widetilde{\mathbf{W}}.$ Recall the definition of $\Phi$ in (A.20). To simplify the notation, we denote $\mathsf{F}_{1}:=\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top}$ and $\mathsf{F}_{2}:=-2\operatorname{diag}\{\phi_{1},\cdots,\phi_{n}\}$ , and obtain

(B.18)

\mathbf{O}_{x}=\mathsf{F}_{1}+\mathsf{F}_{2}.

Clearly, $\operatorname{rank}(\mathsf{F}_{1})\leq 2$ , and by Lemma A.5, we have

(B.19)

\|\mathbf{O}_{x}-\mathsf{F}_{1}\|\prec\frac{\lambda}{p}.

For any $3\leq k\leq\mathfrak{d}-1$ , in view of the expansion

\mathbf{L}_{x}^{\circ k}=\sum_{l=0}^{k}{k\choose l}\mathbf{O}_{x}^{\circ l}\circ(-\mathbf{P}_{x})^{\circ(k-l)}\,,

below we examine $\mathbf{O}_{x}^{\circ l}\circ(-\mathbf{P}_{x})^{\circ(k-l)}$ term by term.

First, when $l=0$ , we only have the term $\mathbf{P}^{\circ k}_{x}$ . We focus on the discussion when $k=3$ , and the same argument holds when $k>3$ . We need the following identity. For any $n\times n$ matrix $\mathbf{E}$ and vectors $\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}$ ,

(B.20)

\mathbf{E}\circ\mathbf{u}\mathbf{v}^{\top}=\text{diag}(\mathbf{u})\mathbf{E}\text{diag}(\mathbf{v}).

Note that the expansion in (B.7) still holds, and to further simplify the notation, we denote

(B.21)

\displaystyle\mathbf{P}_{x}-\mathbf{P}_{y}=\mathsf{T}\circ\mathsf{Q}\,,

where $\mathsf{T}:=\frac{1}{p}(\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}+\bm{z}\bm{z}^{\top})$ and $\mathsf{Q}:=\mathbf{1}\mathbf{1}^{\top}-\mathbf{I}_{n}$ . We thus have that

	$\displaystyle\mathbf{P}_{x}^{\circ 3}-\mathbf{P}_{y}^{\circ 3}$	$\displaystyle=(\mathbf{P}_{x}-\mathbf{P}_{y})\circ(\mathbf{P}_{x}^{\circ 2}+\mathbf{P}_{x}\circ\mathbf{P}_{y}+\mathbf{P}_{y}^{\circ 2})$
(B.22)			$\displaystyle=(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}^{\circ 2}+(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}\circ\mathbf{P}_{y}+(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{y}^{\circ 2}.$

We control the first term, and the other terms can be controlled by the same way. Since $\mathbf{P}_{y}=\mathbf{G}_{y}\circ\mathsf{Q}$ and $\mathbf{G}_{y}=p^{-1}\mathbf{Y}^{\top}\mathbf{Y}$ , we have that

\mathbf{P}_{x}^{\circ 2}=\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\mathsf{Q}.

Together with (B.21) and the fact that $\mathsf{Q}^{\circ 2}=\mathsf{Q}$ , we obtain

	$\displaystyle(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}^{\circ 2}$	$\displaystyle=\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\mathsf{T}\circ\mathsf{Q}$
		$\displaystyle=\left(\mathsf{T}\circ\mathsf{T}+2\mathsf{T}\circ\mathbf{G}_{y}+\mathbf{G}_{y}\circ\mathbf{G}_{y}\right)\circ\mathsf{T}\circ\mathsf{Q}$
		$\displaystyle=\mathsf{T}^{\circ 3}\circ\mathsf{Q}+\mathsf{R}_{1}+\mathsf{R}_{2}\,,$

where $\mathsf{R}_{1}:=2\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}\circ\mathsf{Q}$ and $\mathsf{R}_{2}:=\mathbf{G}_{y}^{\circ 2}\circ\mathsf{T}\circ\mathsf{Q}$ . Now we discuss the above three terms one by one. First, using (B.20), we have

\displaystyle\mathsf{T}^{\circ 3}\circ\mathsf{Q}=\mathsf{T}^{\circ 3}-\left[\mathsf{T}\circ\mathbf{I}_{n}\right]^{3}=\mathsf{T}^{\circ 3}+O_{\prec}\left(\frac{\lambda^{3}}{p^{3}}\right),

where in the second equality we used the fact that $\mathsf{T}(i,i)\prec\lambda/p.$ Second, we have

	$\displaystyle\mathsf{R}_{1}$	$\displaystyle=2\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}-2\left[\mathsf{T}\circ\mathbf{I}_{n}\right]^{2}[\mathbf{G}_{y}\circ\mathbf{I}_{n}]$
		$\displaystyle=2\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}+O_{\prec}\left((\lambda/p)^{2}\right).$

On one hand, we can use (B.20) to write

	$\displaystyle\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}=$	$\displaystyle\,\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}$
		$\displaystyle+\frac{2}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]$
		$\displaystyle+\frac{2}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}$
		$\displaystyle+\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{y})\right]^{2}$
		$\displaystyle+\frac{2}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]$
		$\displaystyle+\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{y})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}.$

The first term of the above equation is the leading order term which can be bounded as follows

\left\|\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}\right\|\prec(\lambda/p)^{2},

where we used the fact that $\|\mathbf{G}_{y}\|=O_{\prec}(1).$ The other terms can be bounded similarly so that

\mathsf{R}_{1}=O_{\prec}\left((\lambda/p)^{2}\right).

Similarly, we can control $\mathsf{R}_{2}.$ This shows that

(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}^{\circ 2}=\mathsf{T}^{\circ 3}+O_{\prec}((\lambda/p)^{2}).

Analogously, we can analyze the other two terms in (B.2) and obtain that

(B.23)

\mathbf{P}_{x}^{\circ 3}-\mathbf{P}_{y}^{\circ 3}=\mathsf{T}^{\circ 3}+\mathsf{T}^{\circ 2}+O_{\prec}(\lambda/p).

Moreover, note that $\mathbf{P}_{y}^{\circ 3}=\mathbf{P}_{y}\circ(\mathbf{P}_{y}^{\circ 2}-\mathsf{Q}/p)+\mathbf{P}_{y}\circ\mathsf{Q}/p)$ , we have by Lemma A.1 that

\left\|\mathbf{P}_{y}\circ(\mathbf{P}_{y}^{\circ 2}-\mathsf{Q}/p)\right\|\leq\max_{i,j}|\mathbf{P}_{y}(i,j)|\left\|\mathbf{P}_{y}^{\circ 2}-\mathsf{Q}/p\right\|\prec\lambda/p\,.

On the other hand, since $\mathbf{P}_{y}\circ\mathsf{Q}/p=\frac{1}{p}\mathbf{P}_{y}=O_{\prec}(1/p)$ , we have $\|\mathbf{P}_{y}^{\circ 3}\|\prec\lambda/p$ . Consequently, we have that

\mathbf{P}_{x}^{\circ 3}=\mathsf{T}^{\circ 3}+\mathsf{T}^{\circ 2}+O_{\prec}(\lambda/p).

We mention that since $\mathsf{T}$ is at most rank two so that $\mathsf{T}^{\circ 3}+\mathsf{T}^{\circ 2}$ is at most $2^{3}+2^{2}\leq 2^{4}.$ Similarly, using the above discussion for general $k>3,$ we can show that

\mathbf{P}_{x}^{\circ k}=\sum_{j=2}^{k}\mathsf{T}^{\circ j}+O_{\prec}(\lambda/p).

Second, when $l=k$ , we only have the term $\mathbf{O}_{x}^{\circ k}$ . When $k=2,$ using (B.18) and the fact that $\mathsf{F}_{2}$ is diagonal, we have that

\mathbf{O}_{x}^{\circ 2}=\mathsf{F}_{1}^{\circ 2}+(\mathsf{F}_{2})^{2}+2\mathsf{F}_{2}\operatorname{diag}(\mathsf{F}_{1}).

By Lemma A.5 (aka (B.19)), we have that

\mathbf{O}_{x}^{\circ 2}=\mathsf{F}_{1}^{\circ 2}+O_{\prec}((\lambda/p)^{2}).

Similarly, for general $k\geq 2,$ we have that

\mathbf{O}_{x}^{\circ k}=\mathsf{F}_{1}^{\circ k}+O_{\prec}((\lambda/p)^{k}).

Third, when $k\neq l$ and $l\neq 0$ , we discuss a typical case when $k=4$ and $l=2$ , i.e., $\mathbf{O}_{x}^{\circ 2}\circ\mathbf{P}_{x}^{\circ 2}.$ We prepare some bounds. Recall that

\mathbf{P}_{x}=\mathbf{G}_{x}-\frac{1}{p}\operatorname{diag}\{\|\mathbf{x}_{1}\|_{2}^{2},\cdots,\|\mathbf{x}_{n}\|_{2}^{2}\}\,,

where we have

\mathbf{G}_{x}=\mathbf{G}_{y}+\frac{1}{p}(\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}+\bm{z}\bm{z}^{\top})=:\mathbf{G}_{y}+\mathsf{T}\,.

By the definition of $\mathsf{T}$ , we immediately have

(B.24)

\operatorname{rank}(\mathsf{T})\leq 2\ \ \mbox{ and }\ \ \max_{i,j}|\mathsf{T}(i,j)|\prec\lambda/p\,,

where the bound for $\max_{i,j}|\mathsf{T}(i,j)|$ holds by the tail bound of the maximum of a finite set of sub-Gaussian random variables. Similarly, we have the bound

(B.25)

\max_{i,j}|\mathbf{O}_{x}(i,j)|\prec\lambda/p

by (A.5). By an expansion, we have

	$\displaystyle\mathbf{P}_{x}^{\circ 2}-\mathsf{T}^{\circ 2}=\,$	$\displaystyle\left(\frac{1}{p}\operatorname{diag}\{\\|\mathbf{x}_{1}\\|_{2}^{2},\cdots,\\|\mathbf{x}_{n}\\|_{2}^{2}\}\right)^{2}+\mathbf{G}_{y}\circ\mathbf{G}_{y}$
		$\displaystyle-2\mathbf{G}_{y}\circ\frac{1}{p}\operatorname{diag}\{\\|\mathbf{x}_{1}\\|_{2}^{2},\cdots,\\|\mathbf{x}_{n}\\|_{2}^{2}\}+2\mathbf{G}_{y}\circ\mathsf{T}$
(B.26)			$\displaystyle-2\mathsf{T}\circ\frac{1}{p}\operatorname{diag}\{\\|\mathbf{x}_{1}\\|_{2}^{2},\cdots,\\|\mathbf{x}_{n}\\|_{2}^{2}\}\,,$

which leads to

(B.27)

\|\mathbf{P}_{x}^{\circ 2}-\mathsf{T}^{\circ 2}\|\prec 1

by Lemma A.1 with the fact $\|\mathbf{G}_{y}\|\prec 1$ , $\|\frac{1}{p}\operatorname{diag}\{\|\mathbf{x}_{1}\|_{2}^{2},\cdots,\|\mathbf{x}_{n}\|_{2}^{2}\}\|\prec 1+\lambda/p$ by Lemma A.5, (B.24) and $\max_{i,j}|\frac{1}{p}\mathbf{y}_{i}^{\top}\mathbf{y}_{j}|\prec 1$ . So we have the bound

	$\displaystyle\\|\mathbf{O}_{x}^{\circ 2}\circ\mathbf{P}_{x}^{\circ 2}-\mathbf{O}_{x}^{\circ 2}\circ\mathsf{T}^{\circ 2}\\|$	$\displaystyle\leq\max_{ij}\|\mathbf{O}_{x}^{\circ 2}(i,j)\|\\|\mathbf{P}_{x}^{\circ 2}-\mathsf{T}^{\circ 2}\\|$
		$\displaystyle=O_{\prec}((\lambda/p)^{2})\,,$

where the first bound comes from Lemma A.1 and the second bound comes from (B.25). Together with $\mathbf{O}_{x}-\mathsf{F}_{1}=\mathsf{F}_{2}=-2\operatorname{diag}\{\phi_{1},\cdots,\phi_{n}\}$ , by the same argument we have that

\|\mathbf{O}_{x}^{\circ 2}\circ\mathbf{P}_{x}^{\circ 2}-\mathsf{F}_{1}^{\circ 2}\circ\mathsf{T}^{\circ 2}\|=O_{\prec}(\lambda/p).

Using the simple estimate that $\operatorname{rank}(A\circ B)\leq\operatorname{rank}(A)\operatorname{rank}(B)$ , we find that

\operatorname{rank}\left(\mathsf{F}_{1}^{\circ 2}\circ\mathsf{T}^{\circ 2}\right)\leq 2^{4}.

The other $k$ and $l$ can be handled in the same way. Precisely, when $l>0$ , $\mathbf{O}_{x}^{\circ l}\circ\mathbf{P}_{x}^{\circ(k-l)}$ can be approximated by $\mathsf{F}_{1}^{\circ l}\circ\mathsf{T}^{\circ(k-l)}$ with the norm difference of order $O_{\prec}(\lambda/p)$ , where $\operatorname{rank}\left(\mathsf{F}_{1}^{\circ l}\circ\mathsf{T}^{\circ(k-l)}\right)\leq 2^{k}$ . Therefore, up to an error of $O_{\prec}(\lambda/p)$ , all the terms in $\mathbf{L}_{x}^{\circ k}$ , except $\mathbf{P}_{x}^{\circ k}$ that will be absorbed into the first order expansion, can be well approximated using a matrix at most of rank $C^{\prime}k2^{2k}$ , where $0<C^{\prime}\leq 1$ .

Consequently, we can find a matrix $\mathbf{M}_{f}$ of rank at most $C2^{2\mathfrak{d}}$ , where $C>0$ , to approximate $\widetilde{\mathbf{W}}$ so that

\|\widetilde{\mathbf{W}}-\mathbf{M}_{f}\|\prec\lambda/p\,,

This indicates that $\widetilde{\mathbf{W}}$ will generate at most $C2^{2\mathfrak{d}}$ outlying eigenvalues. Therefore, by replacing $\sum_{k=0}^{2}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}^{\circ k}$ using (B.19) and the facts that $\mathbf{O}_{x}^{\circ 2}$ can be replaced by $\mathsf{F}_{1}$ and $\mathbf{O}_{x}\circ\mathbf{P}_{x}$ can be approximated by $\mathsf{F}_{1}\circ\mathsf{T}$ with rank bounded than $4$ by the same argument above, we have

\displaystyle\bigg{\|}\mathbf{W}+f^{\prime}(\tau)\mathbf{P}_{x}-\frac{f^{(2)}(\tau)}{2}\mathbf{P}_{x}\circ\mathbf{P}_{x}-\widetilde{\mathbf{M}}_{f}\bigg{\|}\prec\max\{n^{\mathcal{B}(\alpha)}\,,\lambda/p\}\,,

where

\widetilde{\mathbf{M}}_{f}=f(\tau)\mathsf{Q}+\Big{(}f^{\prime}(\tau)+\frac{f^{(2)}(\tau)}{2}\Big{)}\mathsf{F}_{1}-f^{(2)}(\tau)\mathsf{F}_{1}\circ\mathsf{T}+\mathbf{M}_{f}

is a low rank matrix of rank at most $C2^{2\mathfrak{d}}+6$ .

To finish the spectral analysis of $\mathbf{W}$ in (B.16), it remains to deal with the second order Taylor approximation, $\mathbf{P}_{x}\circ\mathbf{P}_{x}$ , to control the non-outlying eigenvalues follow MP law (i.e., the first order expansion $\mathbf{P}_{x}$ involves $\mathbf{X}^{\top}\mathbf{X}.$ ). The discussion is similar to (B.2). First, we extend the Hadamard product result in Lemma A.8. We have the following expansion:

		$\displaystyle\mathbf{P}_{x}\circ\mathbf{P}_{x}-\mathbf{P}_{y}\circ\mathbf{P}_{y}$
	$\displaystyle=\,$	$\displaystyle(\mathbf{P}_{x}+\mathbf{P}_{y})\circ(\mathbf{P}_{x}-\mathbf{P}_{y})$
	$\displaystyle=\,$	$\displaystyle[\mathbf{P}_{x}+\mathbf{P}_{y}]\circ\mathsf{T}\circ\mathsf{Q}$
	$\displaystyle=\,$	$\displaystyle\mathsf{T}^{\circ 2}\circ\mathsf{Q}+2\mathbf{P}_{y}\circ\mathsf{T}\circ\mathsf{Q}$
	$\displaystyle=\,$	$\displaystyle\frac{2}{p}\text{diag}(\bm{z})\mathbf{P}_{y}\text{diag}(\bm{z})+\frac{2}{p}\text{diag}(\bm{z})\mathbf{P}_{y}\text{diag}(\bm{y})$
		$\displaystyle\qquad+\frac{2}{p}\text{diag}(\bm{y})\mathbf{P}_{y}\text{diag}(\bm{z})+\mathsf{T}^{\circ 2}\circ\mathsf{Q}\,,$

where the first three terms in the right hand side could be controlled by $O_{\prec}(\lambda/p)$ by a discussion similar to (B.23). As a result, we have As a result, we have

(B.28)

\displaystyle\mathbf{P}_{x}\circ\mathbf{P}_{x}-\mathbf{P}_{y}\circ\mathbf{P}_{y}=\mathsf{T}^{\circ 2}\circ\mathsf{Q}+O_{\prec}(\lambda/p).

By the bound $\operatorname{rank}(\mathsf{T}^{\circ 2}\circ\mathsf{Q})\leq\operatorname{rank}(\mathsf{T}^{\circ 2})\operatorname{rank}(\mathsf{Q})$ and $\operatorname{rank}(\mathsf{Q})=1$ , we have that when $0.5\leq\alpha<1$ , $\mathbf{P}_{x}\circ\mathbf{P}_{x}$ differs from $\mathbf{P}_{y}\circ\mathbf{P}_{y}$ with at most four extra outlying eigenvalue. The rest four outlying eigenvalues will be from the first and second order terms in the Taylor expansion as in (B.13). The discussion is similar to the equations from (B.1) to (B.13). We omit the details here. Finally, we emphasize that the discussion of (B.28) is nearly optimal in light of the obtained bound.

∎

B.3. Proof of Theorem 2.7

While the kernel is $f(x)=\exp(-\upsilon x)$ , for notational simplicity, we keep using the symbol $f(x)$ as the kernel function and use the notation (A.3).

Proof of (1) when $1\leq\alpha<2$ .

Let

\mathbf{C}_{0}=f(2)\mathbf{1}\mathbf{1}^{\top}+(1-f(2))\mathbf{I}.

In light of (2.14) and $\mathbf{W}_{1}(i,i)=1$ , we have

\mathbf{W}_{a_{1}}=\mathbf{C}_{0}\circ\mathbf{W}_{1}.

Recall (2.15). By the definition of $\mathbf{W}$ , we have that

(B.29)

\mathbf{W}=\mathbf{W}_{c}\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}.

Consequently, we have that

	$\displaystyle\mathbf{W}-\mathbf{W}_{a_{1}}$	$\displaystyle=\left[(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}\right]+\left[(\mathbf{W}_{y}-\mathbf{C}_{0})\circ\mathbf{W}_{1}\right]$
(B.30)			$\displaystyle:=\mathcal{E}_{1}+\mathcal{E}_{2},$

where we used the relation $\mathbf{W}_{a_{1}}=(\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{a_{1}}.$ For $\mathcal{E}_{2}$ , by Lemma A.1 and the first order Taylor approximation of $\mathbf{W}_{y}$ , we see that

(B.31)

\|\mathcal{E}_{2}\|\prec\frac{1}{\sqrt{n}}\lambda_{1}(\mathbf{W}_{1})\,,

where we used (A.4) and (A.4). To control $\lambda_{1}(\mathbf{W}_{1})$ , we apply Lemma A.10, where in order to make the first term of the right-hand side of (A.23) negligible, we take

(B.32)

\delta>\max\left\{0,\,\frac{3-\alpha}{2}\right\},

and hence with high probability,

(B.33)

\|\mathbf{W}_{1}\|=O(n^{\max\left\{0,\,\frac{3-\alpha}{2}\right\}})\,.

By (B.31), when $\alpha\geq 1$ , we find that

(B.34)

\left\|\frac{1}{n}\mathcal{E}_{2}\right\|\prec n^{-3/2}+n^{-\alpha/2}\,.

Next, we discuss the first term $\mathcal{E}_{1}.$ By using Lemma A.1 twice and (B.33), we see that

(B.35)

\|\mathcal{E}_{1}\|\prec\max_{i,j}\left|\mathbf{W}_{c}(i,j)-1\right|\max_{i,j}\mathbf{W}_{y}(i,j)\|\mathbf{W}_{1}\|.

By definition, we have

\max_{i,j}\mathbf{W}_{y}(i,j)\leq 1.

Moreover, using the definition (2.15) and $h=p$ , we have

(B.36)

\max_{i,j}\left|\mathbf{W}_{c}(i,j)-1\right|\prec n^{\alpha/2-1}

by the bound $|\bm{z}_{i}\top\bm{y}_{j}|=|z_{i}\bm{y}_{j1}|\prec\sqrt{\lambda}$ and the Taylor expansion of $f(x)$ around $0$ . Together with (B.33) and (B.35), we readily obtain that

\left\|\frac{1}{n}\mathcal{E}_{1}\right\|\prec n^{-1/2}.

This completes the proof of (2.18).

The proof of (2.19) is more involved. By the definition of $\mathbf{W}_{a_{1}}$ , it suffices to study $\mathbf{W}_{1}.$ By Mehler’s formula (A.32), we find that when $i\neq j$ ,

(B.37)

\displaystyle\mathbf{W}_{1}(i,j)=\sqrt{1-t_{0}^{2}}

\displaystyle\exp\left(\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}(z_{i}^{2}+z_{j}^{2})\right)\sum_{m=0}^{\infty}\frac{t_{0}^{m}}{m!}\widetilde{H}_{m}(z_{i})\widetilde{H}_{m}(z_{j}),

where $t_{0}$ is defined

0<t_{0}:=\frac{-\upsilon+\sqrt{\upsilon^{2}+16\beta^{2}/\upsilon^{2}}}{4(\beta/\upsilon)}<1\quad\text{and}\quad\beta=\lambda/p\,.

By a direct calculation, we have

	$\displaystyle 1-t^{2}_{0}\,$	$\displaystyle=\frac{\upsilon^{3}\sqrt{\upsilon^{2}+16(\beta^{2}/\upsilon^{2})}-\upsilon^{4}}{8\beta^{2}},$
	$\displaystyle 1-t_{0}\,$	$\displaystyle=\frac{2\upsilon}{\sqrt{\upsilon^{2}+16\beta^{2}/\upsilon^{2}}+\upsilon+4(\beta/\upsilon)}.$

Note that when $\alpha>1$ , $1-t^{2}_{0}\asymp\frac{1}{\beta}$ as $p\to\infty$ . Therefore, with (B.37), we find that $\mathbf{W}_{1}$ can be written as an infinite summation of rank-one matrices such that

(B.38)

\mathbf{W}_{1}=\sqrt{1-t_{0}^{2}}\sum_{m=0}^{\infty}\frac{t_{0}^{m}}{m!}{\mathbf{H}}_{m}{\mathbf{H}}_{m}^{\top},

where ${\mathbf{H}}_{m}\in\mathbb{R}^{n}$ is defined as

{\mathbf{H}}_{m}=\mathbf{w}\circ\widetilde{\mathbf{H}}_{m}\,,\quad\widetilde{\mathbf{H}}_{m}(i)=\widetilde{H}_{m}(z_{i})\,,

and $\mathbf{w}=(\mathbf{w}_{1},\cdots,\mathbf{w}_{n})^{\top}\in\mathbb{R}^{n}$ is defined as

\mathbf{w}_{i}=\exp\left(\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}z_{i}^{2}\right),\ 1\leq i\leq n.

Since both $z_{i}$ and $z_{j}$ are sub-Gaussian random variables, for some large constants $C,D>0$ ,

(B.39)

\mathbb{P}(|z_{i}|^{2}\leq C\log n)\geq 1-O(n^{-D}).

Therefore, we conclude that with high probability, for some constant $C_{1}>0$ ,

	$\displaystyle\exp\left(\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}(z_{i}^{2}+z_{j}^{2})\right)$	$\displaystyle\leq(\exp(C\log n))^{\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}}$
(B.40)			$\displaystyle\leq n^{C_{1}\beta}.$

To finish the proof, we control $\|\mathbf{H}_{m}\mathbf{H}_{m}^{\top}\|=\|\mathbf{H}_{m}\|_{2}^{2}$ case by case.

Case I: $\alpha=1.$ In this case, since $\lambda\asymp p$ , we have $\beta\asymp 1$ and $t_{0}\in(0,1)$ is a constant away from $1$ . Since the degree of $\widetilde{H}_{m}(x)$ is $m\in\mathbb{N}$ , we have $|\widetilde{H}_{m}(x)|\asymp x^{m}$ when $x\rightarrow\infty$ . Together with (B.39), we find that with high probability, for some constant $C>0$

(B.41)

\|{\mathbf{H}}_{m}\|_{2}^{2}\leq n^{C_{1}\beta+1}(C\log n)^{m}.

Consequently, we have that for some constants $C_{2},C_{3}>0$ ,

(B.42)

\displaystyle\frac{t_{0}^{m}}{m!}\|{\mathbf{H}}_{m}\|_{2}^{2}

\displaystyle\leq C_{3}n^{C_{2}}m^{-1/2}\left(\frac{et_{0}C\log n}{m}\right)^{m},

where we use the Stirling’s formula. For notational convenience, we set

(B.43)

\mathbf{W}_{1}=\mathbf{W}_{11}+\mathbf{W}_{12},

where

(B.44)

\mathbf{W}_{11}=\sqrt{1-t_{0}^{2}}\sum_{m=0}^{C_{0}\log n}\frac{t_{0}^{m}}{m!}{\mathbf{H}}_{m}{\mathbf{H}}_{m}^{\top}

and $C_{0}$ will be chosen later. Now we control $\mathbf{W}_{11}$ and $\mathbf{W}_{12}$ . Choose a fixed large constant $C_{0}>0$ so that $C_{0}>et_{0}C$ , we set $m_{0}=C_{0}\log n.$ When $n$ is large enough, by (B.42), we have that for some constants $C_{4}>0$ and $0<\mathfrak{a}<1$ ,

	$\displaystyle\sum_{m=C_{0}\log n}^{\infty}\frac{t_{0}^{m}}{m!}\\|{\mathbf{H}}_{m}\\|_{2}^{2}\leq$	$\displaystyle\,C_{3}n^{C_{2}}\sum_{m=C_{0}\log n}^{\infty}\frac{1}{\sqrt{m}}\left(\frac{et_{0}\log n}{m}\right)^{m}$
	$\displaystyle\leq$	$\displaystyle\,C_{4}n^{C_{2}}\int_{C_{0}\log n}^{\infty}\mathfrak{a}^{x}\mathrm{d}x$
(B.45)		$\displaystyle=$	$\displaystyle\,C_{4}n^{C_{2}}\frac{1}{\log\mathfrak{a}}\mathfrak{a}^{C_{0}\log n}\,.$

This yields that for some constants $C_{5}>0$ , with high probability

(B.46)

\displaystyle\sum_{m=C_{0}\log n}^{\infty}\frac{t_{0}^{m}}{m!}\|{\mathbf{H}}_{m}\|_{2}^{2}

\displaystyle\leq C_{4}\mathfrak{a}^{C_{0}\log n}n^{C_{2}}\asymp n^{-C_{5}\log n+C_{2}}.

Thus, for a big constant $D>2$ , when $n$ is sufficiently large, with high probability we have

(B.47)

\displaystyle\|\mathbf{W}_{12}\|

\displaystyle=\left\|\mathbf{W}_{1}-\sqrt{1-t_{0}^{2}}\sum_{m=0}^{C_{0}\log n}\frac{t_{0}^{m}}{m!}{\mathbf{H}}_{m}{\mathbf{H}}_{m}^{\top}\right\|\leq n^{-D}\,.

This completes the proof for the case $\alpha=1$ using (2.14) and (B.43) since the rank of $\mathbf{W}_{11}$ is bounded by $C_{0}\log n$ .

Case II: $1<\alpha<2$ . It is easy to see that in this case, (B.41) still holds true with high probability. Since $\beta$ diverges in this case, we find that

	$\displaystyle\frac{t_{0}^{m}}{m!}\\|\widetilde{\mathbf{H}}_{m}(\mathbf{y})\\|_{2}^{2}$	$\displaystyle\leq Cn^{C_{1}\beta+1}(m)^{-m-1/2}e^{m}(C\log n)^{m}$
		$\displaystyle=Cn^{C_{1}\beta+1}m^{-1/2}\left(\frac{eC\log n}{m}\right)^{m}.$

Now for some fixed large constant $C_{0}>0$ , we set $m=C_{0}n^{\alpha-1}.$ By a discussion similar to (B.46), we have that for some constants $C,C_{1}>0$ , with high probability

\sum_{m=m_{0}}^{\infty}\frac{t_{0}^{m}}{m!}\|\widetilde{\mathbf{H}}_{m}(\mathbf{z})\|_{2}^{2}\leq Cn^{C_{1}(1-\alpha)n^{\alpha-1}}.

The rest of the proof is similar to the case $\alpha=1$ except that we can follow (A.23) to show that $\|\mathbf{W}_{1}\|\prec n^{\delta}$ for $\delta>(3-\alpha)/2.$ This concludes the proof for the case $1<\alpha<2$ and hence completes the proof of (2.19).

∎

Proof of (2) when $\alpha\geq 2$ .

We first prove (2.20). By (B.29), using the definition (2.16), we find that

\mathbf{W}-\widetilde{\mathbf{W}}_{a_{1}}=(\mathbf{C}_{0}-\mathbf{W}_{y})\circ\mathbf{W}_{c}\circ\mathbf{W}_{1}.

Then the proof follows from a discussion similar to (B.34). For the rest of the proof, due to similarity, we only prove (2.23). Note that since $z_{i}$ ’s are continuous random variables by assumption, it assures that $z_{i}^{\prime}s$ are not identical with high probability. Rewrite

\mathbf{W}(i,j)=\exp\left(-\upsilon\frac{\lambda}{p}\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\right).

Clearly, $\frac{\lambda}{p}\asymp n^{\alpha-1}$ . We then show that for the given constant $t\in(0,1)$ , with probability at least $1-O(n^{-\delta})$ , where $\delta>1$ , we have

\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\geq\left(\frac{p}{\lambda}\right)^{t},

when $i\neq j$ . By a direct expansion,

(B.48)

\displaystyle\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}=

\displaystyle\left(z_{i}-z_{j}\right)^{2}+\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{\lambda}+\frac{2(z_{i}-z_{j})(\mathbf{y}_{i1}-\mathbf{y}_{j1})}{\lambda}\,.

By Lemma A.4, we find that with high probability,

\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{\lambda}\prec\frac{p}{\lambda}.

Similarly, we have that

\left|\frac{2(z_{i}-z_{j})(\mathbf{y}_{i1}-\mathbf{y}_{j1})}{\lambda}\right|\prec\frac{1}{\lambda}.

Using the above result, we find that for some constant $C>0$ , for $i\neq j$ ,

\displaystyle\mathbb{P}\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\leq\left(\frac{p}{\lambda}\right)^{t}\right)

\displaystyle\leq\mathbb{P}\left((z_{i}-z_{j})^{2}\leq\left(\frac{p}{\lambda}\right)^{t}-C\frac{p}{\lambda}\right)\leq C\left(\frac{p}{\lambda}\right)^{t/2},

where the final inequality comes from a discussion similar to (A.26) since $\alpha\geq 2$ . This leads to

\mathbb{P}\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\geq\left(\frac{p}{\lambda}\right)^{t}\right)\geq 1-C\left(\frac{p}{\lambda}\right)^{t/2}.

Note that under the condition (2.21), we have $t(\alpha-1)/2>1$ , and hence $\left(\frac{p}{\lambda}\right)^{t/2}=o(n^{-1})$ . Therefore, by a direct union bound, for each fixed $i$ , we have that

(B.49)

\displaystyle\max_{j}\mathbb{P}

\displaystyle\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\geq\left(\frac{p}{\lambda}\right)^{t}\middle|j\neq i\right)\geq 1-Cn\left(\frac{p}{\lambda}\right)^{t/2}.

This implies that with probability at least $1-O(n^{1-t(\alpha-1)/2})$ , for some constant $C>0$ , we have that

(B.50)

\displaystyle\left|\sum_{j\neq i}\mathbf{W}(i,j)\right|

\displaystyle\leq Cn\exp\left(-\upsilon\frac{\lambda}{p}\left(\frac{p}{\lambda}\right)^{t}\right)=Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).

Consequently, by the Gershgorin circle theorem, we conclude that with probability $1-O(n^{1-t(\alpha-1)/2})$

(B.51)

\|\lambda_{i}(\mathbf{W})-1\|\leq Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).

This completes our proof.

∎

B.4. Proof of Corollary 2.10

In this subsection, we prove the results for the transition matrix $\mathbf{A}.$

Proof of Corollary 2.10.

When $\alpha=0$ , it follows from the fact that

(B.52)

\|(n^{-1}\mathbf{D})^{-1}-f(2)^{-1}\mathbf{I}_{n}\|\prec n^{-1/2}.

The proof can be found in [18, Lemma 4.5], and we omit it. Together with (B.10) and (B.13), we have that with probability $1-O(n^{-1/2})$ ,

	$\displaystyle n\mathbf{D}^{-1/2}$	$\displaystyle\mathbf{W}\mathbf{D}^{-1/2}=n\mathbf{D}^{-1/2}\mathsf{O}\mathbf{D}^{-1/2}$
		$\displaystyle+n\mathbf{D}^{-1/2}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right)\mathbf{D}^{-1/2}$
		$\displaystyle+O(n^{-1/4}+n^{\epsilon-1/2})\,.$

By definition, the first term is a matrix with rank at most three since $\mathsf{O}$ is of rank at most three. For the second term, from (B.52) and Lemma A.6, we have

	$\displaystyle\Big{\\|}n\mathbf{D}^{-1/2}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right)\mathbf{D}^{-1/2}$
	$\displaystyle\qquad-\frac{1}{f(2)}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right)\Big{\\|}\prec n^{-1/2}.$

As a result, since the spectra of $n\mathbf{A}$ and $n\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}$ are the same, and with probability $1-O(n^{-1/2})$ , $n\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}$ can be approximated by $\frac{1}{f(2)}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right)$ and a perturbation of rank at most $3$ with an error bound $O(n^{-1/4}$ , we obtain the claim by the Weyl’s lemma.

When $0<\alpha<1$ , by a discussion similar to (B.52) using Lemma A.5, we find that

\|(n^{-1}\mathbf{D})^{-1}-f(\tau)^{-1}\mathbf{I}_{n}\|\prec n^{-1/2}+\frac{\lambda}{p}\,.

and by the same argument as that for $\alpha=0$ , we conclude the proof.

Next, we handle the case $1\leq\alpha<2.$ Note that

	$\displaystyle\\|\mathbf{A}-\mathbf{A}_{a_{1}}\\|\leq\,$	$\displaystyle\\|\mathbf{D}^{-1}(\mathbf{W}-\mathbf{W}_{a_{1}})\\|+\\|\mathbf{D}^{-1}(\mathbf{D}_{a_{1}}-\mathbf{D})\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\\|$
(B.53)		$\displaystyle\leq\,$	$\displaystyle\\|\mathbf{D}^{-1}\\|\\|\mathbf{W}-\mathbf{W}_{a_{1}}\\|+\\|\mathbf{D}^{-1}(\mathbf{D}-\mathbf{D}_{a_{1}})\\|\\|\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\\|\,.$

By a direct expansion, we have

(B.54)			$\displaystyle\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i)$
	$\displaystyle=\,$	$\displaystyle\sum_{j\neq i}\exp\left(-\upsilon\frac{\\|\mathbf{x}_{i}-\mathbf{x}_{j}\\|_{2}^{2}}{p}\right)-\sum_{j\neq i}\exp(-2\upsilon)\exp\left(-\upsilon\frac{\\|\mathbf{z}_{i}-\mathbf{z}_{j}\\|_{2}^{2}}{p}\right)$
	$\displaystyle=\,$	$\displaystyle\sum_{j\neq i}\exp\left(-\upsilon\frac{\\|\mathbf{z}_{i}-\mathbf{z}_{j}\\|_{2}^{2}}{p}\right)$
		$\displaystyle\times\Big{[}\exp\left(-\upsilon\frac{\\|\mathbf{y}_{i}-\mathbf{y}_{j}\\|_{2}^{2}}{p}\right)\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{p}\right)-\exp(-2\upsilon)\Big{]}.$

To control $\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i)$ , we bound terms in the right hand side. First, since $\exp(-2\upsilon)$ is the zero-th order Taylor expansion for $\exp(-2\|\mathbf{y}_{i}-\mathbf{y}_{j}\|/p)$ at $2$ , together with (B.36) and the fact that $\max_{ij}\Big{|}\exp\left(-\upsilon\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{p}\right)-\exp(-2\upsilon)\Big{|}\prec n^{-1/2}$ , we have

		$\displaystyle\max_{i,j}\Big{\|}\exp\left(-\upsilon\frac{\\|\mathbf{y}_{i}-\mathbf{y}_{j}\\|_{2}^{2}}{p}\right)\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{p}\right)$
(B.55)			$\displaystyle\qquad-\exp(-2\upsilon)\Big{\|}\prec n^{\alpha/2-1}\,.$

Since $\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)>0$ , this implies that

\displaystyle\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i)=O_{\prec}\left(n^{\alpha/2-1}\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)\right)\,.

Denote $\Delta:=\mathbf{D}^{-1}(\mathbf{D}-\mathbf{D}_{a_{1}}).$ Since $\mathbf{D}(i,i)$ and $\mathbf{D}_{a_{1}}(i,i)$ are positive and $\mathbf{D}_{a_{1}}(i,i)=\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)+1$ , we have that

	$\displaystyle\max_{i}\|\Delta(i,i)\|$	$\displaystyle=\frac{\|\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i)\|}{\mathbf{D}(i,i)}$
		$\displaystyle\prec\frac{n^{\alpha/2-1}\sum_{j\neq i}\exp\left(-\upsilon\frac{\\|\mathbf{z}_{i}-\mathbf{z}_{j}\\|_{2}^{2}}{p}\right)}{\sum_{j\neq i}\exp\left(-\upsilon\frac{\\|\mathbf{z}_{i}-\mathbf{z}_{j}\\|_{2}^{2}}{p}\right)+1}\prec n^{\alpha/2-1}.$

As a result, we have $\|\mathbf{D}^{-1}(\mathbf{D}-\mathbf{D}_{a_{1}})\|\|\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\|\prec n^{\alpha/2-1}$ since $\|\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\|\leq 1$ .

Next, we control $\|\mathbf{D}^{-1}\|\|\mathbf{W}-\mathbf{W}_{a_{1}}\|$ . Using the same argument leading to (A.30), we find that for $0<\delta<1$ , with high probability, there exists $O(n^{\delta})$ amount of $z_{i}$ ’s such that $|z_{i}|\leq n^{-(1-\delta)}.$ Note that (A.23) requires that $\delta\geq\frac{3-\alpha}{2}.$ Choosing $\delta=\frac{3-\alpha}{2},$ we obtain that with high probability, for some constant $C>0$ ,

\max_{i,i}\mathbf{D}(i,i)\geq C\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)\geq Cn^{\frac{3-\alpha}{2}}.

This implies that $\|\mathbf{D}^{-1}\|\prec n^{(\alpha-3)/2}.$ Combining (2.18), we find that $\|\mathbf{D}^{-1}\|\|\mathbf{W}-\mathbf{W}_{a_{1}}\|\prec n^{\alpha/2-1}$ , and hence

\|\mathbf{A}-\mathbf{A}_{a_{1}}\|\prec n^{\alpha/2-1}\,.

This concludes our proof.

Then we prove the case when $\alpha\geq 2.$ The counterpart of (2.20) follows from a discussion similar to the case $1\leq\alpha<2$ except that (B.54) should be replaced by

	$\displaystyle\mathbf{D}(i,i)-\widetilde{\mathbf{D}}_{a_{1}}(i,i)=\,$	$\displaystyle\sum_{j\neq i}\exp\left(-\upsilon\frac{\\|\mathbf{z}_{i}-\mathbf{z}_{j}\\|_{2}^{2}}{p}\right)\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{p}\right)$
		$\displaystyle\times\left[\exp\left(-\upsilon\frac{\\|\mathbf{y}_{i}-\mathbf{y}_{j}\\|_{2}^{2}}{p}\right)-\exp(-2\upsilon)\right],$

so that

\max_{i}|\mathbf{D}(i,i)-\widetilde{\mathbf{D}}_{a_{1}}(i,i)|\prec n^{-1/2}.

We omit the details due to similarity. Finally, we prove the results when $\alpha$ is larger in the sense of (2.21). By (B.50) and the definition of $\mathbf{W}$ (i.e., $\mathbf{W}(i,i)=1$ ), we find that with probability at least $1-O(n^{1-t(\alpha-1)/2})$ , for some constant $C>0$ ,

\|\mathbf{D}-\mathbf{I}_{n}\|\leq Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).

This bound together with (2.23) lead to

	$\displaystyle\\|\mathbf{W}-\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}\\|\leq$	$\displaystyle\,(\\|\mathbf{D}^{-1/2}\\|+1)\\|\mathbf{W}\\|\\|\mathbf{D}^{-1/2}-\mathbf{I}_{n}\\|$
	$\displaystyle\leq$	$\displaystyle\,2Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).$

Since $\mathbf{A}$ is similar to $\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}$ , by Weyl’s lemma, we conclude the claim. This concludes the proof for the third part when (2.21) holds.

∎

Appendix C Proof of the results in Section 3

In this section, we prove the main results in Section 3.

C.1. Proof of Theorem 3.1

In this subsection, we prove Theorem 3.1 for $h=\lambda+p$ . We only prove $\mathbf{W}$ and omit the details of $\mathbf{A}$ since the proof is similar to that of Corollary 2.10. Throughout the proof, we write $f(x)=\exp(-\upsilon x)$ for the ease of statements.

Proof.

For part (1), denote

f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h}\right)=g\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{p}\right),\ h=p+\lambda,

where $g(x):=f(px/h).$ Since $0\leq\alpha<1$ , we have that

\frac{p}{p+\lambda}\asymp 1.

Then we can apply all proofs of Theorems 2.3 and 2.5 to the kernel function $g(x)$ to conclude the proof. The only difference is that we will get an extra factor $p/h$ for the derivative and we omit the details.

For part (2), since $\beta=\lambda/(\lambda+p)\asymp 1$ , (3.3) and (3.4) follow from the proof of (2.18) and Case (I) (below equation (B.3)) of the proof of (2.19), respectively. For (3.5), the proof follows from (3.3) and the bound

	$\displaystyle\left\\|\mathbf{W}_{a_{2}}-\mathbf{W}_{1}\right\\|=$	$\displaystyle\,\\|(\exp(-2p\upsilon/(p+\lambda))-1)\mathbf{W}_{1}+(1-\exp(-2p\upsilon/(p+\lambda)))\mathbf{I}_{n}\\|$
	$\displaystyle\leq$	$\displaystyle\,n^{1-\alpha}\\|\mathbf{W}_{1}\\|+n^{1-\alpha}\prec n^{1-\alpha}+n^{2-\alpha}\,,$

where the first inequality comes from $1-\exp(-\frac{2p\upsilon}{p+\lambda})\asymp n^{1-\alpha}$ when $\alpha>1$ and the last bound follows from the facts that $\|\mathbf{W}_{1}\|\prec n$ . ∎

C.2. Proof of Corollary 3.2

In this subsection, we prove Corollary 3.2.

Proof.

First, consider the first statement when $0\leq\alpha<1$ . Recall that the square of sub-Gaussian is sub-exponential and the summation of sub-exponential random variables is also sub-exponential. Since $\mathbf{x}_{i},\mathbf{x}_{j},i\neq j$ , are independent, the random variable $\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}$ follows a sub-exponential distribution, and by Lemma A.5, we have

\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}=2(p+\lambda)+O_{\prec}(p^{\alpha}+\sqrt{p})\,.

Since $\alpha<1$ , with high probability, when $p$ is large enough, $\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}$ is concentrated around $2p$ . Thus, for any $\omega\in(0,1)$ ,

(C.1)

h\asymp p

holds with high probability. We can thus rewrite

f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h}\right)=g\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{p}\right),

where $g(x):=f(px/h).$ Then we can apply all the proof of Theorems 2.3 and 2.5 to the kernel function $g(x)$ to conclude the proof. The only difference is that we will get an extra factor $p/h$ for the derivative and we omit the details.

We next prove the second statement when $\alpha\geq 1$ . We first show the counterpart of (3.3). In this case, we claim that for some given sufficiently small $\epsilon>0$ , we have that with high probability, for some constants $C_{1},C_{2}>0$ ,

(C.2)

C_{1}(\lambda\log^{-1}n+p)\leq h\leq C_{2}\lambda\log^{2}n.

To see this claim, we follow the notation (A.3). Note that

\displaystyle\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}=\,

\displaystyle\|\mathbf{y}_{i}-\mathbf{y}_{j}\|^{2}+\lambda(z_{i}-z_{j})^{2}+2\sqrt{\lambda}(\mathbf{y}_{i1}-\mathbf{y}_{j1})(z_{i}-z_{j})\,.

By Lemma A.4, $\|\mathbf{y}_{i}-\mathbf{y}_{j}\|^{2}=2p+O_{\prec}(\sqrt{p})$ . Also, since $\mathbf{y}_{i1}-\mathbf{y}_{j1}$ and $z_{i}-z_{j}$ are sub-Gaussian, $|(\mathbf{y}_{i1}-\mathbf{y}_{j1})(z_{i}-z_{j})|=O_{\prec}(\log(n))$ . It thus remains to handle $(z_{i}-z_{j})^{2}$ . On one hand, since $z_{i}-z_{j}$ is sub-Gaussian, we have that for some constant $C>0$ , $(z_{i}-z_{j})^{2}\leq C\log^{2}n$ with high probability; on the other hand, using a discussion similar to (A.30) by setting $\epsilon=0.5(\log\log n/\log n)$ , with high probability, we have that there are at least $Cn/\sqrt{\log n}$ $(z_{i}-z_{j})^{2}$ ’s such that $(z_{i}-z_{j})^{2}\geq\log^{-1}n$ . Since $\omega\in(0,1)$ , when $n$ is large enough, we have $\omega>C/\sqrt{\log n}$ . This proves (C.2). Using the bandwidth $h,$ we now have

\mathbf{W}_{y}(i,j)=f\left(\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{h}\right)=f\left(\frac{p}{h}\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{p}\right).

Denote $\widetilde{\mathbf{C}}_{0}=f(2p/h)\mathbf{1}\mathbf{1}^{\top}+(1-f(2p/h))\mathbf{I}.$ Consider the same expansion like (B.3) using the bandwidth $h$ ; that is, $\mathbf{W}-\mathbf{W}_{a_{2}}=[(\mathbf{W}_{c}-\bm{1}\bm{1}^{\top})\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}]+[(\mathbf{W}_{y}-\widetilde{\mathbf{C}}_{0})\circ\mathbf{W}_{1}]$ . By Lemma A.1, we have

	$\displaystyle\\|\mathbf{W}-\mathbf{W}_{a_{2}}\\|\leq\,$	$\displaystyle(\max_{i,j}\left\|\mathbf{W}_{c}(i,j)-1\right\|\max_{i,j}\mathbf{W}_{y}(i,j)$
		$\displaystyle+\max_{i,j}\|\mathbf{W}_{y}(i,j)-\widetilde{\mathbf{C}}_{0}(i,j)\|)\\|\mathbf{W}_{1}\\|\,.$

Due to (C.2), we have $|(z_{i}-z_{j})(\mathbf{y}_{i1}-\mathbf{y}_{j1})|/h=O_{\prec}(\lambda^{-1/2})$ and hence $\max_{i,j}|\mathbf{W}_{c}(i,j)-1|\prec\sqrt{\lambda}/h=O(\lambda^{-1/2})=O(n^{-\alpha/2})$ . On the other hand, we have $\max_{i,j}\mathbf{W}_{y}(i,j)\leq 1$ by definition and $\max_{i,j}|\mathbf{W}_{y}(i,j)-\widetilde{\mathbf{C}}_{0}(i,j)|\prec n^{-1/2}$ by Lemma A.4. As a result, we obtain

(C.3)

\|\mathbf{W}-\mathbf{W}_{a_{2}}\|\prec n^{-1/2}\|\mathbf{W}_{1}\|.

To finish the proof of the counterpart of (3.3), we need to control $\|\mathbf{W}_{1}\|$ with $h$ satisfying (C.2). We use a trivial bound $\|\mathbf{W}_{1}\|=O(n)$ by Gershgorin circle theorem. As a result, we have

\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{a_{2}}\right\|\prec n^{-1/2}\,.

The argument for the couterpart of (3.4) is analogous to that in Case ( $\mathbf{I}$ ) in the proof of (2.19) with the following necessary modifications. For $\beta:=\lambda/h$ , we have $\frac{1}{C_{2}\log^{2}n}\leq\beta\leq\frac{1}{C_{1}(\log^{-1}n+p/\lambda)}$ by (C.2). First, when $\beta\asymp 1$ , the discussion reduces to Case ( $\mathbf{I}$ ) (i.e., the arguments below (B.3)) of the proof of (2.19). Second, when $\beta$ diverges and $\beta\leq C\min\{\log n,n^{\alpha-1}\}$ for some constant $C>0$ , the discussion reduces to Case ( $\mathbf{I}$ ) again. Finally, we discuss $\beta=o(1)$ satisfying $\beta\geq 1/(C_{2}\log^{2}n).$ In this case, we still have $t_{0}<1.$ Recall (B.38), where we have $1-t_{0}^{2}\asymp\frac{1}{\beta}\leq C\log^{2}n$ for some constant $C>0.$ However, compared to the error bound in (B.46), the extra factor $\log^{2}n$ is negligible. Then the case still reduces to Case ( $\mathbf{I}$ ). This completes the proof.

∎

Appendix D Verification of Remark 2.8

In this section, we justify (2.25) of Remark 2.8.

Justification of Remark 2.8.

We focus on the case $\alpha=1$ , and the same claim holds for $1<\alpha<2$ . Recall (B.29). Using a discussion similar to (B.3), we have that

	$\displaystyle\mathbf{W}-\mathbf{W}_{b_{1}}=\,$	$\displaystyle(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}+\mathbf{W}_{1}\circ\mathbf{R}_{1}+\mathbf{W}_{1}\circ\left(\exp(-2\upsilon)\mathbf{1}\mathbf{1}^{\top}\right)$
(D.1)		$\displaystyle:=\,$	$\displaystyle\mathbf{E}_{0}+\mathbf{E}_{1}+\mathbf{E}_{2},$

where $\mathbf{R}_{1}$ is the error of the first order entrywise expansion of $\mathbf{W}_{y}$ defined as

	$\displaystyle\mathbf{R}_{1}$	$\displaystyle:=\mathbf{W}_{y}-\Big{[}\exp(-2\upsilon)\mathbf{1}\mathbf{1}^{\top}+\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}+2\upsilon\exp(-4\upsilon)\mathbf{I}\Big{]}$
(D.2)			$\displaystyle:=\mathbf{W}_{y}-\mathbf{U}_{y}\,.$

Take the same decomposition in (B.44) with some fixed large constant $C_{0}>0$ and $m=C_{0}\log n$ . By (B.46), we have that with high probability

(D.3)

\operatorname{rank}(\mathbf{W}_{1,1})\leq m,\quad\max_{i,j}|\mathbf{W}_{1,2}(i,j)|\leq n^{-D},

for some large constant $D>2.$ We start with the discussion of $\mathbf{E}_{2}$ in (D). Using (B.43) and the fact that $\mathbf{W}_{1}\circ\left(\exp(-2\upsilon)\mathbf{1}\mathbf{1}^{\top}\right)=\exp(-2\upsilon)\mathbf{W}_{1}$ , we obtain that

(D.4)

\displaystyle\mathbf{E}_{2}

\displaystyle=\exp(-2\upsilon)\mathbf{W}_{1,1}+\exp(-2\upsilon)\mathbf{W}_{1,2}=:\mathbf{E}_{2,1}+\mathbf{E}_{2,2}\,.

By (D.3) and the Gershgorin circle theorem, with high probability, for some constant $C>0,$ we have

(D.5)

\operatorname{rank}\left(\mathbf{E}_{2,1}\right)\leq m.\ \ \mbox{ and }\ \ \left\|\mathbf{E}_{2,2}\right\|\leq Cn^{-D+1}

We control $m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)$ by the triangle inequality,

		$\displaystyle\|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)\|$
	$\displaystyle\leq\,$	$\displaystyle\|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{21}}(z)\|+\|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{21}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)\|\,.$

By Lemma A.3, we have

		$\displaystyle\|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2,1}}(z)\|$
(D.6)		$\displaystyle\leq\,$	$\displaystyle\frac{\operatorname{rank}(\mathbf{E}_{2,2})}{n}\min\left\{\frac{2}{\eta},\,\frac{\\|\mathbf{E}_{2,2}\\|}{\eta^{2}}\right\}\leq\frac{C}{n^{D-1}\eta^{2}}\,,$

where we use the fact that the rank of $\mathbf{E}_{2,2}$ may be full and $D$ is large. Similarly, we have

		$\displaystyle\|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{2,1}+\mathbf{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)\|$
(D.7)		$\displaystyle\leq\,$	$\displaystyle\frac{\operatorname{rank}(\mathbf{E}_{2,1})}{n}\min\left\{\frac{2}{\eta},\,\frac{\\|\mathbf{E}_{2,1}\\|}{\eta^{2}}\right\}\leq\frac{2C_{0}\log n}{n\eta}\,,$

where we use the fact that $\|\mathbf{E}_{2,1}\|\leq\|\exp(-2\upsilon)\mathbf{W}_{1}\|\leq\exp(-2\upsilon)n^{-\delta}$ for some $0<\delta<1/2$ in the argument leading to (A.23). In conclusion, since $D$ is a big constant, we control the $\mathbf{E}_{2}$ term in the Stieltjes transform, and obtain

(D.8)

\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)|\leq\frac{3C_{0}\log n}{n\eta}\,

for $z\in\mathcal{D}$ defined in (2.8). It is easy to see that the above control is negligible compared to the rate $n^{-1/2+\epsilon}\eta^{-2}$ for any arbitrarily small $\epsilon>0.$ Consequently, it suffices to focus on $\mathbf{E}_{1}$ in (D). To control $\mathbf{E}_{1}$ in (D), note that

(D.9)

\mathbf{E}_{1}=\mathbf{W}_{1,1}\circ\mathbf{R}_{1}+\mathbf{W}_{1,2}\circ\mathbf{R}_{1}

and we control

		$\displaystyle\|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)\|$
	$\displaystyle\leq$	$\displaystyle\,\|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}+\mathbf{W}_{1,2}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)\|$
		$\displaystyle+\|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)\|\,.$

By a discussion similar to (B.1) with for the cloud points $\{\mathbf{y}_{i}\}$ (or see the proof of Lemma 4.2 of [18]), we have

(D.10)

\max_{i,j}|\mathbf{R}_{1}(i,j)|\prec n^{-1/2}.

Together with (D.3), by the Gershgorin circle theorem, we conclude that

(D.11)

\|\mathbf{W}_{1,2}\circ\mathbf{R}_{1}\|\prec n^{-D+1/2}.

This error is negligible since $D>2$ by the same argument as (D.6), thus we have the control for $|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}+\mathbf{W}_{1,2}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)|$ . The rest of the proof is controling $|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|$ . Set $\mathcal{E}_{1}:=\mathbf{W}_{1,1}\circ\mathbf{R}_{1}$ for convenience. Note that with high probability, for some constant $C>0,$ we have that

(D.12)

\max_{i,j}|\mathbf{W}_{1,1}(i,j)|\leq C.

By the definition of Stieltjes transform, for any even integer $q,$ we have

	$\displaystyle\|m_{\mathbf{W}_{b_{1}}+\mathcal{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{R}_{1}}(z)\|^{q}$
	$\displaystyle\leq\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})\|}{\|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-z\|\|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})-z\|}\right)^{q}\,,$

which, by the Cauchy-Schwarz inequality, is bounded by

	$\displaystyle\,\frac{1}{n^{q}}\bigg{[}\left(\sum_{i=1}^{n}\|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})\|^{2}\right)$
	$\displaystyle\qquad\qquad\times\left(\sum_{i=1}^{n}\frac{1}{\|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-z\|^{2}\|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})-z\|^{2}}\right)\bigg{]}^{q/2}$
(D.13)	$\displaystyle\leq$	$\displaystyle\,\frac{1}{n^{q/2}\eta^{2q}}(\text{tr}\{(\mathcal{E}_{1}-\mathbf{R}_{1})^{2}\})^{q/2},$

where the last inequality comes from the Hoffman-Wielandt inequality (Lemma A.3) and the trivial bound $|\lambda-z|^{-1}\leq\eta^{-1}$ for any $\lambda\in\mathbb{R}$ . The rest of the proof leaves to control the right-hand side of (D.13). By (D.12) and the fact that $\mathcal{E}_{1}-\mathbf{R}_{1}=(\mathbf{W}_{1,1}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{R}_{1}$ . Hence, since $\mathcal{E}$ is symmetric, for some constant $C>0,$ we can replace (D.13) with

(D.14)

\frac{C}{n^{q/2}\eta^{2q}}[\text{tr}(\mathbf{R}_{1}^{2})]^{q/2}\,.

Thus, we only need to consider the $\mathbf{R}_{1}$ part, and we claim that (D.14) can be controlled by a term of order $(\log n)^{q}n^{-q/2}\eta^{-2q}$ , where the proof can be found in [18, eq. (C.8)]. To finish the proof, by the same argument of controlling $|m_{\mathbf{W}_{b_{1}}+\mathcal{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{R}_{1}}(z)|^{q}$ , we get $|m_{\mathbf{W}_{b_{1}}+\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|^{q}$ controlled by a term of order $(\log n)^{q}n^{-q/2}\eta^{-2q}$ . By collecting the above controls, including $|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)|$ and $|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|$ , we conclude the bound for $|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|$ . To finish the proof, we control $|m_{\mathbf{W}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)|$ , which depends on controlling $\mathbf{E}_{0}$ . We can use an analogous argument to control $\mathbf{E}_{0}$ by decomposing $\mathbf{W}_{1}$ as in (B.43) and using (B.36) and Lemma A.3 and a discussion similar to (D.13). We only sketch the proof. Recall (D). Note that we have

	$\displaystyle\mathbf{E}_{0}=$	$\displaystyle\,(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{U}_{y}\circ\mathbf{W}_{1,1}$
		$\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{U}_{y}\circ\mathbf{W}_{1,2}$
		$\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{R}_{1}\circ\mathbf{W}_{1,1}$
		$\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{R}_{1}\circ\mathbf{W}_{1,2}.$

We explain how to control the first term, which is the leading order term. Observe that

		$\displaystyle(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{U}_{y}\circ\mathbf{W}_{1,1}$
	$\displaystyle=\,$	$\displaystyle\exp(-2\upsilon)(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{1,1}$
		$\displaystyle+2\upsilon\exp(-4\upsilon)[(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{I}_{n}][\mathbf{W}_{1,1}\circ\mathbf{I}_{n}]$
		$\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}\circ\mathbf{W}_{1,1}.$

First, by (D.12) and (B.36), the operator norm of the second term can be bounded by $n^{\alpha/2-1}$ and hence the differences of the Stieltjes transform can be bounded similarly as in (D.7). Second, the first and third terms can be bounded using a discussion similar to (D.13) with the counterpart of (D.10), which read as

\displaystyle\max_{i,j}\left|\left[\exp(-2\upsilon)(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{1,1}\right](i,j)\right|\prec n^{\alpha/2-1}

and

\displaystyle\max_{i,j}\left|\left[(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}\circ\mathbf{W}_{1,1}\right](i,j)\right|\prec n^{\alpha/2-1}.

This completes our proof.

∎

References

[1] A. A. Amini and Z. S. Razaee. Concentration of kernel matrices with application to kernel spectral clustering. The Annals of Statistics, 49(1):531 – 556, 2021.
[2] Z. Bai and J. W. Silverstein. Spectral analysis of large dimensional random matrices. Springer Series in Statistics. Springer, New York, second edition, 2010.
[3] J. Baik, G. Ben Arous, and S. Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab., 33(5):1643–1697, 09 2005.
[4] Z. Bao, X. Ding, J. Wang, and K. Wang. Statistical inference for principal components of spiked covariance matrices. The Annals of Statistics, 50(2):1144–1169, 2022.
[5] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural. Comput., 15(6):1373–1396, 2003.
[6] M. Belkin and P. Niyogi. Convergence of laplacian eigenmaps. In Advances in Neural Information Processing Systems, pages 129–136, 2007.
[7] F. Benaych-Georges and R. R. Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521, 2011.
[8] A. Bloemendal, A. Knowles, H.-T. Yau, and J. Yin. On the principal components of sample covariance matrices. Probab. Theory Related Fields, 164(1-2):459–552, 2016.
[9] A. Bojchevski, Y. Matkovic, and S. Günnemann. Robust spectral clustering for noisy data: Modeling sparse corruptions improves latent embeddings. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 737–746, 2017.
[10] C. Bordenave. Eigenvalues of Euclidean random matrices. Random Structures Algorithms, 33(4):515–532, 2008.
[11] C. Bordenave. On Euclidean random matrices in high dimension. Electron. Commun. Probab., 18:no. 25, 8, 2013.
[12] M. L. Braun. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7(82):2303–2328, 2006.
[13] T. T. Cai, X. Han, and G. Pan. Limiting laws for divergent spiked eigenvalues and largest nonspiked eigenvalue of sample covariance matrices. The Annals of Statistics, 48(3):1255–1280, 2020.
[14] M.-Y. Cheng and H.-T. Wu. Local linear regression on manifolds and its geometric interpretation. Journal of the American Statistical Association, 108(504):1421–1434, 2013.
[15] X. Cheng and A. Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013.
[16] F. R. K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. Published for the Conference Board of the Mathematical Sciences, Washington, DC; by the American Mathematical Society, Providence, RI, 1997.
[17] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006.
[18] X. Ding and H.-T. Wu. On the spectral property of kernel-based sensor fusion algorithms of high dimensional data. IEEE Transactions on Information Theory, 67(1):640–670, 2021.
[19] X. Ding and F. Yang. Spiked separable covariance matrices and principal components. Ann. Statist., 49(2):1113–1138, 2021.
[20] X. Ding and F. Yang. Edge statistics of large dimensional deformed rectangular matrices. Journal of Multivariate Analysis, page 105051, 2022.
[21] X. Ding and F. Yang. Tracy-Widom distribution for heterogeneous gram matrices with applications in signal detection. IEEE Transactions on Information Theory, 68(10):6682–6715, 2022.
[22] Y. Do and V. Vu. The spectrum of random kernel matrices: universality results for rough and varying kernels. Random Matrices: Theory and Applications, 2(03):1350005, 2013.
[23] R. B. Dozier and J. W. Silverstein. Analysis of the limiting spectral distribution of large dimensional information-plus-noise type matrices. Journal of Multivariate Analysis, 98(6):1099–1122, 2007.
[24] R. B. Dozier and J. W. Silverstein. On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices. Journal of Multivariate Analysis, 98(4):678–694, 2007.
[25] D. B. Dunson, H.-T. Wu, and N. Wu. Spectral convergence of graph laplacian and heat kernel reconstruction in $L^{\infty}$ from random samples. Applied and Computational Harmonic Analysis, 55:282–336, 2021.
[26] N. El Karoui. On information plus noise kernel random matrices. Ann. Statist., 38(5):3191–3216, 10 2010.
[27] N. El Karoui. The spectrum of kernel random matrices. Ann. Statist., 38(1):1–50, 2010.
[28] N. El Karoui and H.-T. Wu. Graph connection Laplacian and random matrices with random blocks. Information and Inference: A Journal of the IMA, 4(1):1–44, 2015.
[29] N. El Karoui and H.-T. Wu. Graph connection Laplacian methods can be made robust to noise. The Annals of Statistics, 44(1):346–372, 2016.
[30] L. Erdős, A. Knowles, and H.-T. Yau. Averaging fluctuations in resolvents of random band matrices. Ann. Henri Poincaré, 14(8):1837–1926, 2013.
[31] L. Erdős and H. Yau. A Dynamical Approach to Random Matrix Theory. Courant Lecture Notes. Courant Institute of Mathematical Sciences, New York University, 2017.
[32] Z. Fan and A. Montanari. The spectral norm of random inner-product kernel matrices. Probability Theory and Related Fields, 173(1):27–85, 2019.
[33] D. Foata. A combinatorial proof of the mehler formula. Journal of Combinatorial Theory, Series A, 24(3):367 – 376, 1978.
[34] N. García Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator. Found. Comput. Math., 20(4):827–887, 2020.
[35] E. Giné and V. Koltchinskii. Empirical graph laplacian approximation of laplace–beltrami operators: Large sample results. In High dimensional probability, pages 238–259. Institute of Mathematical Statistics, 2006.
[36] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their convergence on random neighborhood graphs. J. Mach. Learn. Res., 8:1325–1368, 2007.
[37] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition, 2012.
[38] T. Jiang. Distributions of eigenvalues of large Euclidean matrices generated from $l_{p}$ balls and spheres. Linear Algebra Appl., 473:14–36, 2015.
[39] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist., 29(2):295–327, 2001.
[40] A. Knowles and J. Yin. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352, 2017.
[41] V. Koltchinskii and E. Giné. Random matrix approximation of spectra of integral operators. Bernoulli, pages 113–167, 2000.
[42] Z. Liao. A Random Matrix Framework for Large Dimensional Machine Learning and Neural Networks. PhD thesis, Université Paris-Saclay, 2019.
[43] Z. Liao and R. Couillet. On inner-product kernels of high dimensional data. In 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 579–583, 2019.
[44] Y.-T. Lin, J. Malik, and H.-T. Wu. Wave-shape oscillatory model for nonstationary periodic time series analysis. Foundations of Data Science, 3(2):99–131, 2021.
[45] V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967.
[46] F. G. Meyer and X. Shen. Perturbation of the eigenvectors of the graph laplacian: Application to image denoising. Applied and Computational Harmonic Analysis, 36(2):326–334, 2014.
[47] M. Mézard, G. Parisi, and A. Zee. Spectra of Euclidean random matrices. Nuclear Phys. B, 559(3):689–701, 1999.
[48] J. Nash. The imbedding problem for riemannian manifolds. Annals of mathematics, pages 20–63, 1956.
[49] H. Q. Ngo. $\mathbb{P}$ -Species and the q-Mehler formula. Sém. Lothar. Combin, (48):B48b, 2002.
[50] D. Passemier and J. Yao. Estimation of the number of spikes, possibly equal, in the high-dimensional case. Journal of Multivariate Analysis, 127:173–183, 2014.
[51] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17(4):1617–1642, 2007.
[52] S. Prasad Kasiviswanathan and M. Rudelson. Spectral Norm of Random Kernel Matrices with Applications to Privacy. arXiv preprint arXiv 1504.05880, 2015.
[53] L. Rosasco, M. Belkin, and E. De Vito. On learning with integral operators. J. Mach. Learn. Res., 11:905–934, 2010.
[54] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
[55] C. Shen and H.-T. Wu. Scalability and robustness of spectral embedding: landmark diffusion is all you need. Information and Inference: A Journal of the IMA, 2022. iaac013.
[56] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM Journal on Mathematics of Data Science, 1(3):588–616, 2019.
[57] A. Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 21(1):128–134, 2006.
[58] A. Singer and H.-T. Wu. Vector diffusion maps and the connection Laplacian. Comm. Pure Appl. Math., 65(8):1067–1144, 2012.
[59] S. Steinerberger. A Filtering Technique for Markov Chains with Applications to Spectral Embedding. Applied and Computational Harmonic Analysis, 40:575–587, 2016.
[60] R. Talmon and R. R. Coifman. Empirical intrinsic geometry for nonlinear modeling and time series filtering. Proceedings of the National Academy of Sciences, 110(31):12535–12540, 2013.
[61] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
[62] U. von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395–416, 2007.
[63] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Ann. Statist., 36(2):555–586, 2008.
[64] H.-T. Wu and N. Wu. Think globally, fit locally under the Manifold Setup: Asymptotic Analysis of Locally Linear Embedding. Annals of Statistics, 46(6B):3805–3837, 2018.

	$\displaystyle\\|\bar{\mathbf{x}}_{i}-\bar{\mathbf{x}}_{j}\\|\,$	$\displaystyle=\\|\mathbf{x}_{i}-\mathbf{x}_{j}\\|\,,$
	$\displaystyle\\|\bar{\mathbf{z}}_{i}-\bar{\mathbf{z}}_{j}\\|\,$	$\displaystyle=\\|\mathbf{z}_{i}-\mathbf{z}_{j}\\|\,,$
	$\displaystyle\\|\bar{\mathbf{y}}_{i}-\bar{\mathbf{y}}_{j}\\|\,$	$\displaystyle=\\|\mathbf{y}_{i}-\mathbf{y}_{j}\\|\,,$

	$\displaystyle\mathbf{P}_{x}^{\circ 2}-\mathsf{T}^{\circ 2}=\,$	$\displaystyle\left(\frac{1}{p}\operatorname{diag}\{\\|\mathbf{x}_{1}\\|_{2}^{2},\cdots,\\|\mathbf{x}_{n}\\|_{2}^{2}\}\right)^{2}+\mathbf{G}_{y}\circ\mathbf{G}_{y}$
		$\displaystyle-2\mathbf{G}_{y}\circ\frac{1}{p}\operatorname{diag}\{\\|\mathbf{x}_{1}\\|_{2}^{2},\cdots,\\|\mathbf{x}_{n}\\|_{2}^{2}\}+2\mathbf{G}_{y}\circ\mathsf{T}$
(B.26)			$\displaystyle-2\mathsf{T}\circ\frac{1}{p}\operatorname{diag}\{\\|\mathbf{x}_{1}\\|_{2}^{2},\cdots,\\|\mathbf{x}_{n}\\|_{2}^{2}\}\,,$

	$\displaystyle\\|\mathbf{A}-\mathbf{A}_{a_{1}}\\|\leq\,$	$\displaystyle\\|\mathbf{D}^{-1}(\mathbf{W}-\mathbf{W}_{a_{1}})\\|+\\|\mathbf{D}^{-1}(\mathbf{D}_{a_{1}}-\mathbf{D})\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\\|$
(B.53)		$\displaystyle\leq\,$	$\displaystyle\\|\mathbf{D}^{-1}\\|\\|\mathbf{W}-\mathbf{W}_{a_{1}}\\|+\\|\mathbf{D}^{-1}(\mathbf{D}-\mathbf{D}_{a_{1}})\\|\\|\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\\|\,.$

	$\displaystyle\left\\|\mathbf{W}_{a_{2}}-\mathbf{W}_{1}\right\\|=$	$\displaystyle\,\\|(\exp(-2p\upsilon/(p+\lambda))-1)\mathbf{W}_{1}+(1-\exp(-2p\upsilon/(p+\lambda)))\mathbf{I}_{n}\\|$
	$\displaystyle\leq$	$\displaystyle\,n^{1-\alpha}\\|\mathbf{W}_{1}\\|+n^{1-\alpha}\prec n^{1-\alpha}+n^{2-\alpha}\,,$

		$\displaystyle\|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)\|$
	$\displaystyle\leq$	$\displaystyle\,\|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}+\mathbf{W}_{1,2}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)\|$
		$\displaystyle+\|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)\|\,.$

Impact of signal-to-noise ratio and bandwidth on graph Laplacian spectrum from high-dimensional noisy point cloud

Abstract.

1. Introduction

1.1. Mathematical setup

1.2. Relationship with the manifold model

1.3. Some related works

1.4. An overview of our results

2. Main results (I): classic bandwidth choice h≍ph\asymp p

2.1. Some definitions

Definition 2.1 (Stochastic domination).

Definition 2.2.

2.2. Spectrum of kernel affinity matrices: low signal-to-noise region 0≤α<10\leq\alpha<1

Theorem 2.3 (Bounded region).

Remark 2.4.

Theorem 2.5 (Slowly divergent region).

Remark 2.6.

2.3. Spectrum of kernel affinity matrices: high signal-to-noise region α≥1\alpha\geq 1

Theorem 2.7.

Remark 2.8.

Remark 2.9.

2.4. Spectrum of transition matrices

Corollary 2.10.

3. Main results (II): a different bandwidth choice h≍(p+λ)h\asymp(p+\lambda)

3.1. Spectra of affinity and transition matrices

Theorem 3.1.

3.2. An adaptive choice of bandwidth

Corollary 3.2.

3.3. Connection with the manifold learning

4. Numerical studies

4.1. Accuracy of our asymptotic results

4.2. Efficiency of our proposed bandwidth selection algorithm (Algorithm 1)

5. Discussion and Conclusion

Acknowledgment

Appendix A Preliminary results

A.1. From manifold to spiked model

A.2. Some linear algebra facts

Lemma A.1.

Lemma A.2.

Lemma A.3.

A.3. Some concentration inequalities for sub-Gaussian random vectors

Lemma A.4.

Proof.

Lemma A.5.

Proof.

A.4. Some results for Gram matrices

Lemma A.6.

Proof.

Lemma A.7.

Lemma A.8.

Proof.

A.5. Some results for affinity matrices

Lemma A.9.

Proof.

Lemma A.10.

Proof.

A.6. Orthogonal polynomials and kernel expansion

A.7. More remarks

A.7.1. Zeroing-out technique

A.7.2. Mor remark on d>1d>1

Appendix B Proof of main results in Section 2

B.1. Proof of Theorem 2.3

Proof.

B.2. Proof of Theorem 2.5

Proof of case (1), 0<α<0.50<\alpha<0.5.

Proof of Case (2), 0.5≤α<10.5\leq\alpha<1.

B.3. Proof of Theorem 2.7

Proof of (1) when 1≤α<21\leq\alpha<2.

Proof of (2) when α≥2\alpha\geq 2.

B.4. Proof of Corollary 2.10

Proof of Corollary 2.10.

Appendix C Proof of the results in Section 3

C.1. Proof of Theorem 3.1

Proof.

C.2. Proof of Corollary 3.2

Proof.

Appendix D Verification of Remark 2.8

Justification of Remark 2.8.

References

2. Main results (I): classic bandwidth choice $h\asymp p$

2.2. Spectrum of kernel affinity matrices: low signal-to-noise region $0\leq\alpha<1$

2.3. Spectrum of kernel affinity matrices: high signal-to-noise region $\alpha\geq 1$

3. Main results (II): a different bandwidth choice $h\asymp(p+\lambda)$

A.7.2. Mor remark on $d>1$

Proof of case (1), $0<\alpha<0.5$ .

Proof of Case (2), $0.5\leq\alpha<1$ .

Proof of (1) when $1\leq\alpha<2$ .

Proof of (2) when $\alpha\geq 2$ .