This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

.

Impact of signal-to-noise ratio and bandwidth on graph Laplacian spectrum from high-dimensional noisy point cloud

Xiucai Ding Department of Statistics, University of California, Davis, CA, USA xcading@ucdavis.edu  and  Hau-Tieng Wu Department of Mathematics and Department of Statistical Science, Duke University, Durham, NC, USA hauwu@math.duke.edu
Abstract.

We systematically study the spectrum of kernel-based graph Laplacian (GL) constructed from high-dimensional and noisy random point cloud in the nonnull setup. The problem is motived by studying the model when the clean signal is sampled from a manifold that is embedded in a low-dimensional Euclidean subspace, and corrupted by high-dimensional noise. We quantify how the signal and noise interact over different regions of signal-to-noise ratio (SNR), and report the resulting peculiar spectral behavior of GL. In addition, we explore the impact of chosen kernel bandwidth on the spectrum of GL over different regions of SNR, which lead to an adaptive choice of kernel bandwidth that coincides with the common practice in real data. This result paves the way to a theoretical understanding of how practitioners apply GL when the dataset is noisy.

1. Introduction

Spectral algorithms are popular and have been widely applied in unsupervised machine learning, like eigenmap [5], locally linear embedding (LLE) [54], diffusion map (DM) [17], to name but a few. A common ground of these algorithms is graph Laplacian (GL) and its spectral decomposition that have been extensively discussed in the spectral graph theory [16]. Up to now, due to its wide practical applications, there have been rich theoretical results discussing the spectral behavior of GL under the manifold model when the point cloud is clean. For example, the pointwise convergence [5, 36, 35, 57], the L2L^{2} convergence without rate [6, 63, 58], the L2L^{2} convergence with rate [34], the LL^{\infty} convergence with rate [25], etc. We refer readers to the cited literature therein for more information. Also see, for example [41, 12], for more relevant results. However, to our knowledge, limited results are known about the GL spectrum when the dataset is contaminated by noise [59, 26, 29, 55], particularly when the noise is high-dimensional [26, 29, 55]. Specifically, in the high-dimensional setup, [26, 29] report controls on the operator norm between the clean GL and noisy GL in some specific signal-to-noise (SNR) regions. However, a finer and more complete description of the spectrum is still lacking. The main focus of this work is extending existing results so that the spectral behavior of GL is better depicted when the dataset is contaminated by high-dimensional noise.

GL is not uniquely defined, and there are several possible constructions from a given point cloud. For example, it can be constructed by the idea of local barycentric coordinates like that in LLE [64] or by taking landmarks like that in Roseland [55], the kernel can be asymmetric [17], the metric can be non-Euclidean [60], etc. In this paper, we focus ourselves on a specific setup; that is, GL is constructed from a random point cloud 𝒳:={𝐱i}i=1np\mathcal{X}:=\{\mathbf{x}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{p} via a symmetric kernel with the usual Euclidean metric. To be more specific, from 𝒳\mathcal{X}, construct the affinity matrix (or kernel matrix) 𝐖n×n\mathbf{W}\in\mathbb{R}^{n\times n} by

(1.1) 𝐖(i,j)=exp(υ𝐱i𝐱j2h), 1i,jn,\mathbf{W}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}}{h}\right),\ 1\leq i,j\leq n,

where υ>0\upsilon>0 is the chosen parameter and hh(n)h\equiv h(n) is the chosen bandwidth. In other words, we focus on kernels of the exponential type, that is, f(x)=exp(υx)f(x)=\exp(-\upsilon x), to simplify the discussion. Then, define the transition matrix

(1.2) 𝐀=𝐃1𝐖,\mathbf{A}=\mathbf{D}^{-1}\mathbf{W},

where 𝐃\mathbf{D} is the degree matrix, which is a diagonal matrix with diagonal entries defined as

(1.3) 𝐃(i,i)=j=1n𝐖(i,j),i=1,2,,n.\mathbf{D}(i,i)=\sum_{j=1}^{n}\mathbf{W}(i,j),\ i=1,2,\cdots,n.

Note that 𝐀\mathbf{A} is row-stochastic. The (normalized) GL is defined as

𝐋:=1h(𝐈𝐀).\mathbf{L}:=\frac{1}{h}(\mathbf{I}-\mathbf{A})\,.

Since 𝐋\mathbf{L} and 𝐀\mathbf{A} are related by an isotropic shift and universal scaling, we focus on studying the spectral distributions of 𝐖\mathbf{W} and 𝐀\mathbf{A} in the rest of the paper. With the eigendecomposition of 𝐀\mathbf{A}, we could proceed with several data analysis missions. For example, spectral clustering is carried out by combining top few non-trivial eigenvectors and kk-mean [62], embedding datasets to a low-dimensional Euclidean space by eigenvectors and/or eigenvalues reduces the dataset dimension [5, 17], etc. Thus, the spectral behavior of 𝐀\mathbf{A} is the key to fully understanding these algorithms.

1.1. Mathematical setup

We now specify the high-dimensional noisy model and the problem we are concerned with in this work. Suppose 𝒳\mathcal{X} is independent and identically distributed (i.i.d.) following a sub-Gaussian random vector

(1.4) 𝐱i=𝐳i+𝐲i, 1in,\mathbf{x}_{i}=\mathbf{z}_{i}+\mathbf{y}_{i},\ 1\leq i\leq n\,,

where 𝐳i\mathbf{z}_{i} is a random vector with mean 0 and cov(𝐳i)=diag{λ1,,λd,0,,0}\textnormal{cov}(\mathbf{z}_{i})=\text{diag}\{\lambda_{1},\cdots,\lambda_{d},0,\cdots,0\}, λ1λ2λd>0\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{d}>0, and 𝐲i\mathbf{y}_{i} is sub-Gaussian with independent entries,

(1.5) 𝔼(𝐲i)=𝟎 and cov(𝐲i)=𝐈p.\mathbb{E}(\mathbf{y}_{i})=\mathbf{0}\ \mbox{ and }\ \textnormal{cov}(\mathbf{y}_{i})=\mathbf{I}_{p}\,.

We also assume that 𝐳i\mathbf{z}_{i} and 𝐲i\mathbf{y}_{i} are independent. As a result, 𝐱i\mathbf{x}_{i} is a sub-Gaussian random vector with mean 0 and covariance

(1.6) Σ=diag{λ1+1,,λd+1,1,,1},\Sigma=\text{diag}\{\lambda_{1}+1,\cdots,\lambda_{d}+1,1,\cdots,1\}\,,

where dd\in\mathbb{N}. In this model, 𝐳i\mathbf{z}_{i} and 𝐲i\mathbf{y}_{i} represent the signal and noise part of the point cloud respectively. To simplify the discussion, we also assume that 𝐳i1,,𝐳id\mathbf{z}_{i1},\ldots,\mathbf{z}_{id} are continuous random variables. Denote 𝒴:={𝐲i}i=1np\mathcal{Y}:=\{\mathbf{y}_{i}\}_{i=1}^{n}\in\mathbb{R}^{p} and 𝒵:={𝐳i}i=1n\mathcal{Z}:=\{\mathbf{z}_{i}\}_{i=1}^{n}.

We adopt the high-dimensional setting and assume that

(1.7) γcn:=npγ1\gamma\leq c_{n}:=\frac{n}{p}\leq\gamma^{-1}

for some constant 0<γ10<\gamma\leq 1; in other words, we focus on the large pp and large nn setup. We also assume that dd is independent of nn. Note that (1.4) is closely related to the commonly applied spiked covariance model [39].

Next, we consider the following setup to control the signal strength:

(1.8) λinαi, 0αi<are some constants.\lambda_{i}\asymp n^{\alpha_{i}},\ 0\leq\alpha_{i}<\infty\ \text{are some constants}.

Thus, λi\lambda_{i} represents the signal strength in the ii-th component of the signal. On the other hand, the total noise in the dataset 𝒳\mathcal{X} is tr(cov(𝐲i))=p\texttt{tr}(\textnormal{cov}(\mathbf{y}_{i}))=p, and we call λi/p\lambda_{i}/p the signal to noise ratio (SNR) associated with the ii-th component of the signal.

We denote the matrix 𝐖1n×n\mathbf{W}_{1}\in\mathbb{R}^{n\times n} such that

(1.9) 𝐖1(i,j)=exp(υ𝐳i𝐳j22h).\mathbf{W}_{1}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{h}\right)\,.

Note that compared with 𝐖\mathbf{W} defined in (1.1), 𝐖1\mathbf{W}_{1} is the affinity matrix constructed from the clean signal 𝒵\mathcal{Z}. Moreover, let 𝐖y\mathbf{W}_{y} be the affinity matrix associated with 𝒴,\mathcal{Y}, which represents the noise part. Their associated transition matrices 𝐀1\mathbf{A}_{1} and 𝐀y\mathbf{A}_{y} are defined similarly as in (1.2). We will show in part (1) of Theorem 2.7, part (2) of Theorem 3.1 and Corollary 3.2 that under (1.4), when the SNR is above some threshold and the bandwidth is properly chosen, we can separate the signal and noise parts in the sense that 𝐖\mathbf{W} is close to

(1.10) 𝐖1𝐖y\mathbf{W}_{1}\circ\mathbf{W}_{y}

with high probability, where \circ stands for the Hadamard product. The closeness is quantified using the normalized operator norm; that is, n1𝐖𝐖1𝐖y=o(1)n^{-1}\|\mathbf{W}-\mathbf{W}_{1}\circ\mathbf{W}_{y}\|=o(1) with high probability. In fact, using the definition of 𝐖c\mathbf{W}_{c} below in (2.15), we immediately see that in general we have 𝐖=𝐖1𝐖y𝐖c.\mathbf{W}=\mathbf{W}_{1}\circ\mathbf{W}_{y}\circ\mathbf{W}_{c}. When the SNR is relatively large and the bandwidth is chosen properly, we show that with high probability maxi,j|𝐖c(i,j)1|=o(1)\max_{i,j}|\mathbf{W}_{c}(i,j)-1|=o(1), and Lemma A.1 leads to the claim.

The main problem we ask in this work is studying the relationship between the spectra of 𝐖1\mathbf{W}_{1} and 𝐖\mathbf{W} under different SNR regions and bandwidths; that is, how do noise and chosen bandwidth impact the spectrum of the affinity matrix and how does the noisy spectrum deviate from that of the clean affinity matrix. By combining existing understandings of 𝐖1\mathbf{W}_{1} under suitable models, like manifolds [34, 25], we obtain a finer understanding of the commonly applied GL.

1.2. Relationship with the manifold model

We claim that this seemingly trivial spiked covariance model (1.4) overlaps with the commonly considered nonlinear manifold model. Consider the case that the manifold is embedded into a subspace of fixed dimension in p\mathbb{R}^{p}. With some mild assumptions about the manifold, and the fact that the Euclidean distance of p\mathbb{R}^{p} is invariant to rotation, the noisy manifold model can be studied by the spiked covariance model satisfying (1.4)-(1.8). We defer details to Section A.1 to avoid distraction. We shall however emphasize that it does not mean that we could understand the manifold structure by studying the spiked covariance model. The problem we study in this paper is the relationship between the noisy and clean affinity matrices; that is, how does the noise impact the GL. The problem of exploring the manifold structure from a clean dataset via the GL is a different one. See, for example, [35, 34, 25]. Therefore, by studying the relationship between the noisy GL and clean GL via studying the model where the data is sampled from a sum of two sub-Gaussian random variables, we could further explore the manifold structure from noisy dataset.

1.3. Some related works

The focus of the current paper is to study the GL spectrum under the nonnull case for the point cloud 𝒳={𝐱i}i=1n\mathcal{X}=\{\mathbf{x}_{i}\}_{i=1}^{n} described in (1.4), and connect it to the GL spectrum under the null case; that is, when the GL is constructed from the pure noise point cloud 𝒴={𝐲i}i=1n\mathcal{Y}=\{\mathbf{y}_{i}\}_{i=1}^{n}. The eigenvalues of 𝐖\mathbf{W}, a high-dimensional Euclidean distance kernel random matrix, in the null case have been studied in several works. In the pioneering work [27], the author studied the spectrum of 𝐖\mathbf{W} under the null case assuming the bandwidth h=ph=p, and showed that the complicated kernel matrix can be well approximated by a low rank perturbed Gram matrix; see [27, Theorem 2.2] or (A.19) below for more details. It was concluded that studying 𝐖\mathbf{W} in the null case is closely related to studying the principal component analysis (PCA) of the dataset with a low rank perturbation. The results in [27] were extended to more general kernels beyond the exponential kernel function, and 𝒴\mathcal{Y} can be anisotropic with some moment assumptions; for example, more general kernels with Gaussian noise was reported in [15], the Gaussian assumption was removed in [22, 11], the convergence rates of individual eigenvalues of 𝐖\mathbf{W} were provided in [18], etc.

However, much less is known for the nonnull case (1.4). To our knowledge, the most relevant works are [26, 29]. By assuming h=ph=p and 𝔼𝐳i22p\mathbb{E}\|\mathbf{z}_{i}\|_{2}^{2}\asymp p, in [26] the author showed that n1𝐖n^{-1}\mathbf{W} could be well approximated by n1exp(2υ)𝐖1n^{-1}\exp(-2\upsilon)\mathbf{W}_{1}, which connected the noisy observation and the clean signal. In fact, the results of [26, Theorem 2.1] were established for a class of smooth kernels and the noise could be anisotropic. In a recent work [1], the authors established the concentration inequality of 𝐖𝔼𝐖\|\mathbf{W}-\mathbb{E}\mathbf{W}\| and used it to study spectral clustering.

We mention that other types of kernel random matrices related to (1.1) have also been studied in the literature. For example, the inner-product type kernel random matrices of the form f(𝐱i𝐱j),f(\mathbf{x}_{i}^{\top}\mathbf{x}_{j}), where ff is some kernel function, has drawn attention among researchers. In [27], the author showed that under some regularity assumptions, the inner-product kernel matrices could also be approximated by a Gram matrix with low rank perturbations, which had been generalized in [15, 22, 11, 32, 43, 52]. The other example is the Euclidean random matrices of the form F(𝐱i𝐱j)F(\mathbf{x}_{i}-\mathbf{x}_{j}) arising from physics, where FF is some measurable function. Especially, the empirical spectral distribution has been extensively studied in, for example, [47, 10, 38], among others.

Finally, our present work is also in the line of research regarding the robustness of GL. There have been efforts in different settings along this direction. For example, the perturbation of the eigenvectors of GL was studied in [46], the consistency and robustness of spectral clustering were analyzed in [1, 63, 9], and the robustness of DM was studied in [59, 55], among others.

1.4. An overview of our results

Our main contribution is a systematic treatment of the spectrum of GL constructed from a high-dimensional random point cloud corrupted by high-dimensional noise; that is, we consider the nonnull setup (1.4). We establish a connection between the spectra of noisy and clean affinity matrices with different choices of bandwidth and different SNRs by extensively expanding the results reported in [26] with tools from random matrix theory. More specifically, we allow the signal strength to diverge with nn so that the relative strength of signal and noise is captured, and characterize the spectral distribution of the noisy affinity matrix by studying how the signal and noise interact and how different bandwidths impact the spectra of 𝐖\mathbf{W} and 𝐀.\mathbf{A}. Motivated by our theoretical results, we propose an adaptive bandwidth selection algorithm with theoretical support. The proposed method utilizes some certain quantile of pairwise distance as in (3.6), where the quantile can be selected using our proposed Algorithm 1. We provide detailed results when d=1d=1 in (1.4), and discuss how to extend the results to d>1d>1 and why some cases are challenging when d>1d>1. Our result, when combined with existing manifold learning results like [35, 34, 25] and others, pave the way to a better understanding of how GL-based algorithms behave in practice, and a theoretical support for the commonly applied bandwidth scheme, when {𝐳i}i=1n\{\mathbf{z}_{i}\}_{i=1}^{n} is distributed on a low-dimensional, compact and smooth manifold.

We now provide a heuristic explanation of our results assuming d=1d=1 and when h=ph=p. In Section 2, when α<1,\alpha<1, we show that the noisy kernel affinity matrix 𝐖\mathbf{W} provides limited useful information for the clean affinity matrix 𝐖1\mathbf{W}_{1} in (1.9). Specifically, similar to the null setting, 𝐖\mathbf{W} can be well approximated by a Gram matrix with a finite rank perturbation; see Section 2.2 for more details. When α1,\alpha\geq 1, 𝐖\mathbf{W} becomes closer to 𝐖1\mathbf{W}_{1} with some universal scaling and isotropic shift (c.f. (2.14)). Note that some related results have been established for α=1\alpha=1 in [26]. We show that the convergence rate is adaptive to α\alpha, and provide a quantification of how eigenvalues of the noisy affinity matrix 𝐖\mathbf{W} converge. We mention that when α>1\alpha>1, the classic bandwidth choice h=ph=p needs to be modified to reflect the underlying signal structure. Specifically, if h=ph=p and α>1\alpha>1, we have 𝐖𝐈\mathbf{W}\approx\mathbf{I}. These results are stated in Theorem 2.7. Similar results hold for the transition matrix 𝐀\mathbf{A}, and they are reported in Section 2.4.

Motivated by the fact that 𝐖\mathbf{W} becomes trivial when α>1\alpha>1 is large, in Section 3, we focus on the case h=p+λ.h=p+\lambda. When α<1,\alpha<1, the results are analogous to the setting h=p.h=p. When α1,\alpha\geq 1, the spectrum of 𝐖\mathbf{W} will be dramatically different. Specifically, when α=1\alpha=1, 𝐖\mathbf{W} is close to 𝐖1\mathbf{W}_{1} with some universal scaling and isotropic shift (c.f. (3.1)), and when α>1,\alpha>1, 𝐖\mathbf{W} is close to 𝐖1\mathbf{W}_{1} without any scaling and shift. Moreover, besides the top log(n)\log(n) eigenvalues of 𝐖\mathbf{W}, the other eigenvalues are trivial; see Theorem 3.1 for more details.

In practice, λ\lambda is unknown and not easy to estimate especially when {𝐳i}\{\mathbf{z}_{i}\} are sampled from a nonlinear geometric object. For practical implementation, we propose a bandwidth selection algorithm in Section 3.2. It turns out that the proposed bandwidth selection algorithm can bypass the challenge of estimating λ\lambda, and the result coincides with that determined by the ad hoc bandwidth selection method commonly applied by practioners. See [56, 44] for example, among many others. Note that the bandwidth issue is also discussed in [1]. To our knowledge, our result is the first step toward a theoretical understanding of that common practice.

Finally, we point out technical ingredients of our proof. We focus on the discussion of the bandwidth h=ph=p and take d=1d=1 for an example. First, when λ\lambda is bounded, i.e., α=0,\alpha=0, the spectrum of the 𝐖\mathbf{W} has been studied in [27] using an entrywise expansion to the order of two (with a third order error). When 0<α<0.5,0<\alpha<0.5, the above strategy can be adapted straightforwardly with some modifications. When 0.5α<1,0.5\leq\alpha<1, we need to expand the entries to a higher order depending on α\alpha. With this expansion, we show that except for the Gram matrix, the other parts are either fixed rank or negligible; see Sections B.1 and B.2 for more details. Second, when α1,\alpha\geq 1, the entrywise expansion strategy fails and we need to conduct a more careful analysis. Particularly, thanks to the Hadamard product representation of 𝐖\mathbf{W} in (1.10), we need to analyze the spectral norm of 𝐖1\mathbf{W}_{1} carefully; see Lemma A.10, which has its own interest. Third, to explore the number of informative eigenfunctions, we need to investigate the individual eigenvalues of 𝐖1\mathbf{W}_{1} using Mehler’s formula (c.f. (A.32)); see Section B.3 for more details.

The paper is organized in the following way. The main results are described in Section 2 for the bandwidth hph\asymp p and in Section 3 for the bandwidth h(λ+p)h\asymp(\lambda+p) and an adaptive bandwidth selection algorithm. The numerical studies are reported in Section 4. The paper ends with the discussion and conclusion in Section 5. The background and necessary results for the proof are listed in Section A and the proofs of the main results are given in Sections B and C.

Conventions. We systematically use the following notations. For a smooth function f(x)f(x), denote f(k)(x),k=0,1,2,,f^{(k)}(x),k=0,1,2,\cdots, to be the kk-th derivative of f(x).f(x). For two sequences ana_{n} and bnb_{n} depending on n,n, the notation an=O(bn)a_{n}=O(b_{n}) means that |an|C|bn||a_{n}|\leq C|b_{n}| for some constant C>0,C>0, and an=o(bn)a_{n}=o(b_{n}) means that |an|cn|bn||a_{n}|\leq c_{n}|b_{n}| for some positive sequence cn0c_{n}\downarrow 0 as n.n\rightarrow\infty. We also use the notation anbna_{n}\asymp b_{n} if an=O(bn)a_{n}=O(b_{n}) and bn=O(an).b_{n}=O(a_{n}).

2. Main results (I): classic bandwidth choice hph\asymp p

In this section, we state the main results regarding the spectra of 𝐖\mathbf{W} and 𝐀\mathbf{A} when the bandwidth satisfies hp.h\asymp p. For definiteness, without loss of generality, we assume that h=p.h=p. Such a choice has appeared in many works in kernel methods and manifold learning, e.g., [27, 18, 22, 42, 28, 29]. Since our focus is the nonlinear interaction of the signal and noise in the kernel method, in what follows, we focus on reporting the results for d=1d=1 and omit the subscripts in (1.8). We refer the readers to Remark 2.9 and Section A.7.2 below for a discussion on the setting when d>1.d>1.

2.1. Some definitions

We start with introducing the notion of stochastic domination [30]. It makes precise statements of the form “𝖷(n)\mathsf{X}^{(n)} is bounded with high probability by 𝖸(n)\mathsf{Y}^{(n)} up to small powers of nn”.

Definition 2.1 (Stochastic domination).

Let

𝖷\displaystyle\mathsf{X} ={𝖷(n)(u):n,u𝖴(n)},\displaystyle\,=\big{\{}\mathsf{X}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}},\
𝖸\displaystyle\mathsf{Y} ={𝖸(n)(u):n,u𝖴(n)},\displaystyle\,=\big{\{}\mathsf{Y}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}},

be two families of nonnegative random variables, where 𝖴(n)\mathsf{U}^{(n)} is a possibly nn-dependent parameter set. We say that 𝖷\mathsf{X} is stochastically dominated by 𝖸\mathsf{Y}, uniformly in the parameter uu, if for all small υ>0\upsilon>0 and large D>0D>0, there exists n0(υ,D)n_{0}(\upsilon,D)\in\mathbb{N} so that we have

supu𝖴(n)(𝖷(n)(u)>nυ𝖸(n)(u))nD,\sup_{u\in\mathsf{U}^{(n)}}\mathbb{P}\Big{(}\mathsf{X}^{(n)}(u)>n^{\upsilon}\mathsf{Y}^{(n)}(u)\Big{)}\leq n^{-D},

for a sufficiently large nn0(υ,D)n\geq n_{0}(\upsilon,D). We interchangeably use the notation 𝖷=O(𝖸)\mathsf{X}=O_{\prec}(\mathsf{Y}) or 𝖷𝖸\mathsf{X}\prec\mathsf{Y} if 𝖷\mathsf{X} is stochastically dominated by 𝖸\mathsf{Y}, uniformly in uu, when there is no danger of confusion. In addition, we say that an nn-dependent event ΩΩ(n)\Omega\equiv\Omega(n) holds with high probability if for a D>1D>1, there exists n0=n0(D)>0n_{0}=n_{0}(D)>0 so that

(Ω)1nD,\mathbb{P}(\Omega)\geq 1-n^{-D},

when nn0.n\geq n_{0}.

For any constant c,c\in\mathbb{R}, denote TcT_{c} to be the shifting operator that shifts a probability measure μ\mu defined on \mathbb{R} by cc; that is

(2.1) Tcμ(I):=μ(Ic),T_{c}\mu(I):=\mu(I-c),

where II\subset\mathbb{R} is a measurable subset. Till the end of the paper, for any symmetric n×nn\times n matrix 𝐁,\mathbf{B}, the eigenvalues are ordered in the decreasing fashion so that λ1(𝐁)λ2(𝐁)λn(𝐁).\lambda_{1}(\mathbf{B})\geq\lambda_{2}(\mathbf{B})\geq\cdots\geq\lambda_{n}(\mathbf{B}).

Definition 2.2.

For a given probability measure μ\mu and nn\in\mathbb{N}, define the jj-th typical location of μ\mu as γμ(j)\gamma_{\mu}(j); that is,

(2.2) γμ(j)μ(dx)=jn,\int_{\gamma_{\mu}(j)}^{\infty}\mu(\mathrm{d}x)=\frac{j}{n}\,,

where j=1,,nj=1,\ldots,n. γμ(j)\gamma_{\mu}(j) is also called the j/nj/n-quantile of μ\mu.

Let 𝐘p×n\mathbf{Y}\in\mathbb{R}^{p\times n} be the data matrix associated with the point cloud 𝒴\mathcal{Y}; that is, the ii-th column 𝐘\mathbf{Y} is 𝐲i\mathbf{y}_{i}. Denote the empirical spectral distribution (ESD) of 𝐐=σ2p𝐘𝐘\mathbf{Q}=\frac{\sigma^{2}}{p}\mathbf{Y}^{\top}\mathbf{Y} as

μ𝐐(x)=1ni=1n𝟏{λi(𝐐)x},x.\mu_{\mathbf{Q}}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}_{\{\lambda_{i}(\mathbf{Q})\leq x\}},\ x\in\mathbb{R}.

Here, σ>0\sigma>0 stands for the standard deviation of the scaled noise σ𝐲i\sigma\mathbf{y}_{i}. It is well-known that in the high-dimensional region (1.7), μ𝐐\mu_{\mathbf{Q}} has the same asymptotic properties [40] as the so-called Marchenko-Pastur (MP) law [45], denoted as μcn,σ2\mu_{c_{n},\sigma^{2}}, satisfying

(2.3) μcn,σ2(I)=(1cn)+χI(0)+ζcn,σ2(I),\mu_{c_{n},\sigma^{2}}(I)=(1-c_{n})_{+}\chi_{I}(0)+\zeta_{c_{n},\sigma^{2}}(I)\,,

where χI\chi_{I} is the indicator function and (a)+:=0(a)_{+}:=0 when a0a\leq 0 and (a)+:=a(a)_{+}:=a when a>0a>0,

(2.4) dζcn,σ2(x)=12πσ2(λ+x)(xλ)cnxdx,\mathrm{d}\zeta_{c_{n},\sigma^{2}}(x)=\frac{1}{2\pi\sigma^{2}}\frac{\sqrt{(\lambda_{+}-x)(x-\lambda_{-})}}{c_{n}x}\mathrm{d}x\,,

λ+=(1+σ2cn)2\lambda_{+}=(1+\sigma^{2}\sqrt{c_{n}})^{2} and λ=(1σ2cn)2\lambda_{-}=(1-\sigma^{2}\sqrt{c_{n}})^{2}.

Denote

(2.5) ττ(λ):=2(λp+1),\tau\equiv\tau(\lambda):=2\left(\frac{\lambda}{p}+1\right),

and for any kernel function f(x)f(x), define

(2.6) ςς(λ):=f(0)+2f(τ)f(τ).\varsigma\equiv\varsigma(\lambda):=f(0)+2f^{\prime}(\tau)-f(\tau)\,.

Till the end of this paper, we will use τ\tau and ς\varsigma for simplicity, unless we need to specify the values of τ\tau and ς\varsigma at certain points of λ\lambda. As mentioned below (1.1), we focus on f(x)=exp(υx)f(x)=\exp(-\upsilon x) throughout the paper unless otherwise specified. Recall the shift operator defined in (2.1). Denote

(2.7) νλ:=Tς(λ)μcn,2f(τ(λ)),\nu_{\lambda}:=T_{\varsigma(\lambda)}\mu_{c_{n},-2f^{\prime}(\tau(\lambda))},

where μcn,2f(τ(λ))\mu_{c_{n},-2f^{\prime}(\tau(\lambda))} is the MP law defined in (2.3) by replacing σ2\sigma^{2} with 2f(τ(λ))-2f^{\prime}(\tau(\lambda)).

2.2. Spectrum of kernel affinity matrices: low signal-to-noise region 0α<10\leq\alpha<1

In this subsection, we present the results when the SNR is small in the sense that 0α<10\leq\alpha<1 in Theorems 2.3 and 2.5. In this setting, there does not exist a natural connection between the spectrum of 𝐖\mathbf{W} and the signal part 𝐖1\mathbf{W}_{1} in (1.9). More specifically, even though the spectrum of 𝐖\mathbf{W} can be studied, 𝐖𝐖1\|\mathbf{W}-\mathbf{W}_{1}\| is not close to zero.

In the first result, we consider the case when λ\lambda is bounded from above; that is, α=0\alpha=0. In this bounded region, as in the null case studied in [27, 18] (see Lemma A.9 for more details), the spectrum is governed by the MP law, except for a few outlying eigenvalues. These eigenvalues either come from the kernel function expansion that will be detailed in the proof, or the Gram matrix 𝐐x=1p𝐗𝐗,\mathbf{Q}_{x}=\frac{1}{p}\mathbf{X}^{\top}\mathbf{X}, where 𝐗p×n\mathbf{X}\in\mathbb{R}^{p\times n} is the data matrix associated with the noisy observation 𝐱i\mathbf{x}_{i} defined in (1.4) when the signal is above some threshold. This result is not surprising, since the signal is asymptotically negligible compared with the noise.

Theorem 2.3 (Bounded region).

Suppose (1.1) and (1.4)-(1.8) hold true, d=1d=1 and h=ph=p. Moreover, we assume that λ\lambda is a fixed constant. Set

𝖲:={3,λcn;4,λ>cn.\mathsf{S}:=\begin{cases}3,&\lambda\leq\sqrt{c_{n}};\\ 4,&\lambda>\sqrt{c_{n}}.\end{cases}

For any given small ϵ>0,\epsilon>0, when nn is sufficiently large, for some constant C>0,C>0, with probability at least 1O(n1/2)1-O(n^{-1/2}), we have

(2.8) |λi(𝐖)γν0(i)|Cn1/4,𝖲<i(1ϵ)n,\left|\lambda_{i}(\mathbf{W})-\gamma_{\nu_{0}}(i)\right|\leq Cn^{-1/4},\ \mathsf{S}<i\leq(1-\epsilon)n,

where ν0\nu_{0} is defined in (2.7).

Remark 2.4.

In Theorem 2.3, we focus on reporting the bulk eigenvalues of 𝐖.\mathbf{W}. In this case, the outlying eigenvalues are mainly from the kernel function expansion, which we call the “kernel effect” hereafter, and the resulting Gram matrix. For example, as we will see in the proofs in Section B, we have that λ1(𝐖)=nexp(τυ)+o(1)\lambda_{1}(\mathbf{W})=n\exp(-\tau\upsilon)+o_{\prec}(1) and λ2(𝐖)=Sh1(τ)+Sh2(τ)+o(1),\lambda_{2}(\mathbf{W})=\|\mathrm{Sh}_{1}(\tau)+\mathrm{Sh}_{2}(\tau)\|+o_{\prec}(1), where Sh1(τ)\mathrm{Sh}_{1}(\tau) and Sh2(τ)\mathrm{Sh}_{2}(\tau) are defined in (A.20). Moreover, we mention that Theorem 2.3 holds for a more general kernel function as in [27, 18]; that is, (1.1) is replaced by

(2.9) 𝐖(i,j)=f(𝐱i𝐱j2h), 1i,jn,\mathbf{W}(i,j)=f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}}{h}\right),\ 1\leq i,j\leq n,

where fC3()f\in C^{3}(\mathbb{R}) is monotonically decreasing, bounded and f(2)>0f(2)>0. Moreover, we remark that in (2.8) we can relax ϵ>0\epsilon>0 to ϵ=0\epsilon=0 with an additional assumption that |cn1|τ|c_{n}-1|\geq\tau for some constant τ>0,\tau>0, which is a standard assumption in the random matrix theory literature guaranteeing that the smallest eigenvalue of the Gram matrix is bounded from below; for instance, see the monograph [31]. Finally, we mention that in the pure noise setting, i.e., λ=0,\lambda=0, the asymptotics of the spectrum of 𝐖\mathbf{W} (equivalently, 𝐖y\mathbf{W}_{y} in this setting) has been established in [18] using a different approach, where the results hold for ϵ=0\epsilon=0 regardless of the ratio of n/p.n/p. However, in [18], the bound is weaker than what we show here; that is, the rate is n1/9n^{-1/9} when ii is bounded, and the rate n1/4n^{-1/4} only appears when in.i\asymp n.

In the second result, we study the case when the signal strength λ\lambda diverges with nn but slowly; that is, λ=λ(n)nα\lambda=\lambda(n)\asymp n^{\alpha}, where 0<α<1.0<\alpha<1. In this region, since α<1\alpha<1, the signal is still weaker than the noise, and again it is asymptotically negligible. Thus, it is not surprising to see that the noise information dominates and the spectral distribution is almost like the MP law, except for the first few eigenvalues. Again, like in the bounded region, these finite number of outlying eigenvalues come from the interaction of the nonlinear kernel and the Gram matrix.

Theorem 2.5 (Slowly divergent region).

Suppose (1.1) and (1.4)-(1.8) hold true, d=1d=1 and h=ph=p. For any given small ϵ>0,\epsilon>0, when nn is sufficiently large, we have that the following holds with probability at least 1O(n1/2)1-O(n^{-1/2}):

  1. (1)

    When 0<α<0.5ϵ,0<\alpha<{0.5-\epsilon}, for some constant C>0,C>0, we have that:

    (2.10) |λi(𝐖)γν0(i)|Cmax{n1/4,nϵλn}, 4<i(1ϵ)n.\displaystyle|\lambda_{i}(\mathbf{W})-\gamma_{\nu_{0}}(i)|\leq C\max\left\{n^{-1/4},n^{\epsilon}\frac{\lambda}{\sqrt{n}}\right\},\ 4<i\leq(1-\epsilon)n.
  2. (2)

    When 0.5ϵα<1,{0.5-\epsilon}\leq\alpha<1, denote

    (2.11) 𝔡𝔡(α):=11α+1.\mathfrak{d}\equiv\mathfrak{d}(\alpha):=\left\lceil\frac{1}{1-\alpha}\right\rceil+1.

    For some constant C>0,C>0, there exists some integer 𝖪\mathsf{K} satisfying

    4𝖪C4𝔡,4\leq\mathsf{K}\leq C4^{\mathfrak{d}},

    so that with high probability, for all 𝖪<i(1ϵ)n,\mathsf{K}<i\leq(1-\epsilon)n, we have that

    (2.12) |λi(𝐖)\displaystyle|\lambda_{i}(\mathbf{W})- γν0(i)|Cmax{p(α),λp},\displaystyle\gamma_{\nu_{0}}(i)|\leq C\max\left\{p^{\mathcal{B}(\alpha)},\frac{\lambda}{p}\right\},

    where (α)<0\mathcal{B}(\alpha)<0 is defined as

    (2.13) (α)=(α1)11α+α.\mathcal{B}(\alpha)={(\alpha-1)\left\lceil\frac{1}{1-\alpha}\right\rceil+\alpha}\,.

As in Theorem 2.3, since the outlying eigenvalues are impacted by both the kernel effect and signals, we focus on reporting the bulk eigenvalues. Similar to the discussion in Remark 2.4, the outlying eigenvalues can be figured out from the proof in Section B. Moreover, the number of the outlying eigenvalues is adaptive to α.\alpha. As can be seen from Theorem 2.5, for any small ϵ>0\epsilon>0, when 0<α<0.5ϵ,0<\alpha<0.5-\epsilon, the results are similar to (2) of Theorem 2.3 except for the convergence rates. When α0.5ϵ,\alpha\geq 0.5-\epsilon, there will be more, but finite, outlying eigenvalues, which comes from a high order expansion in the proof. Finally, we mention that Theorem 2.5 holds for a more general kernel function like that in (2.9).

We remark that when α<1\alpha<1 and λ>cn,\lambda>\sqrt{c_{n}}, the non-bulk eigenvalues (i.e., outlying eigenvalues) “seem” to be useful for understanding the signal. For example, we may potentially utilize the number of outliers to determine if the signal strength is stronger than cn\sqrt{c_{n}}. However, to our best knowledge, there exists limited literature on utilizing this information via GL since 𝐖\mathbf{W} (𝐀\mathbf{A} respectively) is not close to 𝐖1\mathbf{W}_{1} (𝐀1\mathbf{A}_{1} respectively) in this region, except some relevant work in [28] when 1/2<α<11/2<\alpha<1. Recall that some of the non-bulk eigenvalues are generated by the kernel effect instead of the signals. As commented in Remark 2.4, while it is true that when λ\lambda is bounded the kernel effect can be quantified, when λ\lambda diverges, we loss the capability to distinguish these outliers. Especially, when 1/2α<1,1/2\leq\alpha<1, although we show in (2) of Theorem 2.5 that the number of non-bulk eigenvalues is finite and depends on the signal strength, we do not know the exact number of outliers and the relation between the signals and outliers. Therefore, it is challenging to recover the signals using this information. For example, when the signal is supported on a linear subspace so that there are multiple spikes, to our knowledge it is challenging to estimate the dimension via GL.

We shall emphasize that the GL approach is very different from PCA. Note that for the dd-spiked model, only the dd largest eigenvalues could possibly detach from the bulk and can be located when PCA is applied. However, for the GL approach, even a single spike can lead to multiple outlying eigenvalues as in (2) of Theorem 2.5. This shows that even for a simple 11-dim linear manifold that is realized as a 11-dim linear subspace in p\mathbb{R}^{p}, the nonlinear method via GL is very different from the linear method via PCA. More details on this aspect can be found in Section 5. The problem becomes more challenging if the subspace is nonlinear, even under the assumption that the manifold model can be reduced to this low rank spiked model in Section A.1 that we focus on in this paper.

Finally, we point out that when 1/2<α<1,1/2<\alpha<1, although it is challenging to obtain precise information for its clean signal counterpart from a direct application of the standard GL via 𝐖\mathbf{W} (c.f. Theorem 2.5) or 𝐀\mathbf{A} (c.f. Corollary 2.10), we could consider a variant of GL via the transition matrix 𝐀\mathbf{A} by zeroing-out the diagonal elements of 𝐖\mathbf{W} proposed in [29]. It has been shown in [29] that the zeroing-out strategy could help the analysis of noisy datasets. We refer the readers to Appendix A.7.1 for more details.

Remark 2.6.

When α<1,\alpha<1, although it is still unclear how to use the bulk of eigenvalues from the noisy observation to extract information for the clean signal counterpart, it is a starting point for understanding the existence of a strong signal (i.e., α1\alpha\geq 1). For example, when α<1,\alpha<1, Theorems 2.3 and 2.5 demonstrate that most of the eigenvalues follow the MP law so that except for a finite number of outliers, two consecutive eigenvalues should be close to each other by a distance of o(1)o(1). Under the alternative (i.e., α1\alpha\geq 1), since the noisy GL is close to the clean GL as shown later in Section 2.3 below, two consecutive eigenvalues can be separated by a distance of constant order.

2.3. Spectrum of kernel affinity matrices: high signal-to-noise region α1\alpha\geq 1

In this subsection, we present the results such that the spectra of 𝐖\mathbf{W} and 𝐖1\mathbf{W}_{1} in (1.9) can be connected after properly scaling when the SNR is relatively large, i.e., α1\alpha\geq 1. We first prepare some notations. Denote

(2.14) 𝐖a1:=exp(2υ)𝐖1+(1exp(2υ))𝐈n.\displaystyle\mathbf{W}_{a_{1}}:=\exp(-2\upsilon)\mathbf{W}_{1}+(1-\exp(-2\upsilon))\mathbf{I}_{n}.

Clearly, 𝐖a1\mathbf{W}_{a_{1}} is closely related to 𝐖1\mathbf{W}_{1} via a scaling and an isotropic shift and 𝐖1\mathbf{W}_{1} contains only the signal information. On the other hand, note that 𝐖a1=[exp(2υ)𝟏𝟏+(1exp(2υ))𝐈n]𝐖1\mathbf{W}_{a_{1}}=[\exp(-2\upsilon)\bm{1}\bm{1}^{\top}+(1-\exp(-2\upsilon))\mathbf{I}_{n}]\circ\mathbf{W}_{1}, where the matrix exp(2υ)𝟏𝟏+(1exp(2υ))𝐈n\exp(-2\upsilon)\bm{1}\bm{1}^{\top}+(1-\exp(-2\upsilon))\mathbf{I}_{n} comes from the first order Taylor expansion of 𝐖y\mathbf{W}_{y}. We introduce another affinity matrix that will be used when α\alpha is too large so that the bandwidth hph\asymp p is relatively small compared with the signal strength. Denote 𝐖cn×n\mathbf{W}_{c}\in\mathbb{R}^{n\times n}

(2.15) 𝐖c(i,j)=exp(2υ(𝐳i𝐳j)(𝐲i𝐲j)h),\mathbf{W}_{c}(i,j)=\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{h}\right),

and

(2.16) 𝐖~a1:=𝐖a1𝐖c.\widetilde{\mathbf{W}}_{a_{1}}:=\mathbf{W}_{a_{1}}\circ\mathbf{W}_{c}.

Note that 𝐖~a1\widetilde{\mathbf{W}}_{a_{1}} differs from 𝐖a1\mathbf{W}_{a_{1}} by the matrix 𝐖c.\mathbf{W}_{c}. It will be used when α2.\alpha\geq 2.

Our main result for this SNR region is Theorem 2.7 below. For some constant CC(α)>0,C\equiv C(\alpha)>0, denote

(2.17) 𝖳α:={Clogn,α=1;Cnα1,1<α<2.\mathsf{T}_{\alpha}:=\begin{cases}C\log n,&\alpha=1;\\ Cn^{\alpha-1},&1<\alpha<2.\end{cases}
Theorem 2.7.

Suppose (1.1) and (1.4)-(1.8) hold true, d=1d=1 and h=ph=p.

  1. (1)

    When 1α<2,1\leq\alpha<2, we have that

    (2.18) 1n𝐖1n𝐖a1n1/2.\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{a_{1}}\right\|\prec n^{-1/2}.

    Moreover, for some universal large constant D>2,D>2, we have that for i𝖳αi\geq\mathsf{T}_{\alpha} in (2.17)

    (2.19) |λi(𝐖a1)(1exp(2υ))|nD.\left|\lambda_{i}(\mathbf{W}_{a_{1}})-(1-\exp(-2\upsilon))\right|\prec n^{-D}.
  2. (2)

    When α2,\alpha\geq 2, we have that

    (2.20) 1n𝐖1n𝐖~a1nα/2+n3/2.\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\widetilde{\mathbf{W}}_{a_{1}}\right\|\prec n^{-\alpha/2}+n^{-3/2}.

    Furthermore, when α\alpha is larger and satisfies

    (2.21) α>2t+1\alpha>\frac{2}{t}+1

    for a constant t(0,1)t\in(0,1), we have that with probability 1O(n1t(α1)/2),1-O(n^{1-t(\alpha-1)/2}), for some constant C>0C>0,

    (2.22) 𝐖~a1𝐈nCnexp(υ(λ/p)1t),\left\|\widetilde{\mathbf{W}}_{a_{1}}-\mathbf{I}_{n}\right\|\leq Cn\exp(-\upsilon(\lambda/p)^{1-t}),

    and consequently

    (2.23) 𝐖𝐈nnexp(υ(λ/p)1t).\left\|\mathbf{W}-\mathbf{I}_{n}\right\|\leq n\exp(-\upsilon(\lambda/p)^{1-t}).

The scaling n1n^{-1} in (2.18) is commonly used in many manifold learning and machine learning literature [1, 12, 26, 41, 53, 63]. On one hand, (1) of Theorem 2.7 shows that once the SNR is “relatively large” (1α<21\leq\alpha<2), we may access the spectrum of the clean affinity matrix 𝐖1\mathbf{W}_{1} via the noisy affinity matrix 𝐖\mathbf{W} as is described in (2.18), and the clean affinity matrix may contain useful information of the signal. In this case, the signal is strong enough to compete with the noise so that we are able to recover the “top few” eigenvalues of the kernel matrix associated with the clean data via 𝐖a1.\mathbf{W}_{a_{1}}. Especially, (2.19) implies that we should focus on the top 𝖳α\mathsf{T}_{\alpha} eigenvalues of 𝐖\mathbf{W}, since the remaining eigenvalues are not informative. This coincides with what practitioners usually do in data analysis. Note that 𝖳α\mathsf{T}_{\alpha} increases with α\alpha, which fits our intuition, since the SNR becomes larger.

On the other hand, we find that the classic bandwidth choice hph\asymp p is not a good choice when the SNR is “too large” (α2\alpha\geq 2). First, (2) of Theorem 2.7 states that when α2,\alpha\geq 2, since the bandwidth is too small compared with the signal strength, the noisy affinity matrix will be close to a mixture of signal and noise. Especially, when the signals are stronger in the sense of (2.21), we will not be able to obtain information from the noisy affinity matrix. This can be understood as follows. Since the signal is far stronger than the noise, equivalently we could say that the signal is contaminated by “small” noise. However, since the bandwidth is set to h=ph=p, which is “small” compared with the signal strength, the exponential decay of the kernel forces each point to be “blind” to see other points. In this case, 𝐖i,j\mathbf{W}_{i,j} is close to 0 when iji\neq j, and the affinity matrix 𝐖\mathbf{W} is close to an identity matrix, which leads to (2.23). As we will see in Theorem 3.1, all these issues will be addressed using a different bandwidth. For the readers’ convenience, in Figure 1, we use a simulation to summarize the phase transitions observed in Theorems 2.3, 2.5 and 2.7. For numerical accuracy of our established theorems, we refer the readers to Section 4.1 below for more details.

Refer to caption
Figure 1. An illustration of the phase transition phenomenon of the affinity matrix spectrum. The noise 𝐲i\mathbf{y}_{i}, i=1,,ni=1,\ldots,n, is i.i.d. sampled from 𝒩(0,Ip)\mathcal{N}(0,I_{p}) and the clean data 𝐱i\mathbf{x}_{i} is i.i.d. sampled from 𝒩(0,λ)\mathcal{N}(0,\lambda), where λ>0\lambda>0. We take n=p=300n=p=300. The kernel is f(x)=exp(x/2)f(x)=\exp(-x/2) and the bandwidth is h=ph=p. For each λ\lambda, the 300 eigenvalues are plotted in the descending order. In the bounded and slowly divergent regions, the second to the 300-th eigenvalues are zoomed in for a better visualization.

Technically, in the previous results when α<1\alpha<1, the kernel function f(x)f(x) only contributes to the measure ν0\nu_{0} via (2.7) and its decay rate does not play a role in the conclusions. However, once the signal becomes stronger, the kernel decay rate plays an essential role. We focus on f(x)=exp(υx)f(x)=\exp(-\upsilon x) to simplify the discussion in this paper, and postpone the discussion of general kernels to our future work.

Remark 2.8.

When 1α<21\leq\alpha<2 in Theorem 2.7, we have shown that besides the top 𝖳α\mathsf{T}_{\alpha} eigenvalues of 𝐖a1\mathbf{W}_{a_{1}}, the remaining eigenvalues of 𝐖a1\mathbf{W}_{a_{1}} are trivial. Since the error bound in (2.19) is much smaller than the one in (2.18), the smaller eigenvalues of 𝐖\mathbf{W} may fluctuate more and have a non-trivial ESD. The ESD of 𝐖\mathbf{W} is best formulated using the Stieltjes transform [2]. Here we provide a remark on the approximation of the Stieltjes transforms. Denote 𝐖b1\mathbf{W}_{b_{1}} as

(2.24) 𝐖b1:=(\displaystyle\mathbf{W}_{b_{1}}:=\Big{(} 2υexp(2υ)p𝐘𝐘+2υexp(4υ)𝐈n)𝐖1.\displaystyle\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}+2\upsilon\exp(-4\upsilon)\mathbf{I}_{n}\Big{)}\circ\mathbf{W}_{1}\,.

Note that compared with 𝐖a1\mathbf{W}_{a_{1}}, the matrix 2υexp(2υ)p𝐘𝐘+2υexp(4υ)𝐈n\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}+2\upsilon\exp(-4\upsilon)\mathbf{I}_{n} in 𝐖b1\mathbf{W}_{b_{1}} comes from a higher order Taylor expansion of 𝐖y\mathbf{W}_{y}. Let m𝐖(z)m_{\mathbf{W}}(z) and m𝐖b1(z)m_{\mathbf{W}_{b_{1}}}(z) be the Stieltjes transforms of 𝐖\mathbf{W} and 𝐖b1\mathbf{W}_{b_{1}} respectively. In Section D, we show that

(2.25) supz𝒟|m𝐖(z)m𝐖b1(z)|1n1α/2η2,\sup_{z\in\mathcal{D}}|m_{\mathbf{W}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|\prec\frac{1}{n^{1-\alpha/2}\eta^{2}}\,,

where 𝒟\mathcal{D} is the domain of spectral parameters defined as

𝒟:=𝒟(1/4,𝖺):={z=\displaystyle\mathcal{D}:=\mathcal{D}(1/4,\mathsf{a}):=\Big{\{}z= E+iη:𝖺E1𝖺,\displaystyle E+\mathrm{i}\eta:\mathsf{a}\leq E\leq\frac{1}{\mathsf{a}},
(2.26) n1/2+α/4+𝖺η1𝖺},\displaystyle n^{-1/2+\alpha/4+\mathsf{a}}\leq\eta\leq\frac{1}{\mathsf{a}}\Big{\}},

and 0<𝖺<10<\mathsf{a}<1 is some fixed (small) constant. This result helps us further peek into the intricate relationship between the clean and noisy affinity matrices. However, it does not provide information about each single eigenvalue.

Remark 2.9.

In the above theorems, we focus on reporting the results for the case d=1d=1 in (1.4). We now discuss how to generalize our results to d>1d>1. There are two major cases. The first case is when all signal strengths are in the same SNR region, and the second case is when the signal strengths might be in different SNR regions. We start from the first case, and there are four regions we shall discuss.

  1. (1)

    When all αi\alpha_{i}, 1id1\leq i\leq d, are very large in the sense that they satisfy the condition (2) of Theorem 2.7, following the same argument, we can immediately conclude that the results stated in (2) in Theorem 2.7 still hold true by setting α:=maxiαi\alpha:=\max_{i}\alpha_{i} in (2.21).

  2. (2)

    When αi>1\alpha_{i}>1, 1id1\leq i\leq d, satisfy condition (1) of Theorem 2.7, the results of (2.18) and (2.20) of Theorem 2.7 hold with some changes. Indeed, nα/2n^{-\alpha/2} in (2.20) should be replaced by l=1dnαl/2\sum_{l=1}^{d}n^{-\alpha_{l}/2} and Cnα1Cn^{\alpha-1} in (2.17) should be replaced by Cnminlαl1Cn^{\min_{l}\alpha_{l}-1}. Moreover, in this setup, since the noisy affinity matrix will be close to a matrix depending on the clean affinity matrix 𝐖1\mathbf{W}_{1}, where 𝐖1(i,j)=exp(𝐳i𝐳j22h)\mathbf{W}_{1}(i,j)=\exp\left(-\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{h}\right), which in general does not follow the MP law, the spectrum of 𝐖\mathbf{W} vary according to the specific values of λi\lambda_{i}, 1id1\leq i\leq d. More discussion with simulation of this setup with d=2d=2 can be found in Section A.7.2.

  3. (3)

    When αi=1\alpha_{i}=1, i=1,,di=1,\ldots,d, this is the region our argument cannot be directly applied and we need a substantial generalization of the proof. Especially, our proof essentially relies on the Mehler’s formula in Section A.6 that has only been proved for d=1d=1 to our knowledge. Nevertheless, we believe it is possible to generalize this formula to d>1d>1 following the arguments of [33].

  4. (4)

    When 0αi<10\leq\alpha_{i}<1, 1id,1\leq i\leq d, we could directly generalize the result to d>1d>1 with all the key ingredients, i.e., Lemmas A.4A.9, in Section A.5. Specifically, to extend Theorem 2.3 concerning αi=0\alpha_{i}=0, 1id1\leq i\leq d, denote r=0r=0 if λicn\lambda_{i}\leq\sqrt{c_{n}} for all 1id1\leq i\leq d, r=dr=d if λi>cn\lambda_{i}>\sqrt{c_{n}} for all 1id1\leq i\leq d, or 1r<d1\leq r<d if λ1λr>cnλr+1λd\lambda_{1}\geq\cdots\geq\lambda_{r}>\sqrt{c_{n}}\geq\lambda_{r+1}\geq\cdots\lambda_{d}. The proof of Theorem 2.3 still holds by updating 𝖲=3+r.\mathsf{S}=3+r. In fact, according to (B.13), the 𝖮\mathsf{O} matrix therein is of rank three and the Gram matrix 𝐗𝐗\mathbf{X}^{\top}\mathbf{X} can generate rr outliers according to Lemma A.6. Theorem 2.5 still holds when d>1d>1. Following its proof, when 0<αi<0.5ϵ0<\alpha_{i}<0.5-\epsilon for all 1id,1\leq i\leq d, the first part of the theorem still holds when i>d+ri>d+r and the rate λ/n\lambda/\sqrt{n} should be replaced by l=1dλl/n.\sum_{l=1}^{d}\lambda_{l}/\sqrt{n}. In fact, in this setting, all λi>cn\lambda_{i}>\sqrt{c_{n}} and all will generate outliers. Moreover, when all αi0.5ϵ,\alpha_{i}\geq 0.5-\epsilon, by replacing (2.11) with l=1d𝔡l\sum_{l=1}^{d}\mathfrak{d}_{l}, where 𝔡l=11αl+1,\mathfrak{d}_{l}=\left\lceil\frac{1}{1-\alpha_{l}}\right\rceil+1, we conclude that the second part of the theorem still holds true by replacing λ/p\lambda/p with l=1dλl/p\sum_{l=1}^{d}\lambda_{l}/p and p(α)p^{\mathcal{B}(\alpha)} with l=1dp(αl)\sum_{l=1}^{d}p^{\mathcal{B}(\alpha_{l})}. We emphasize that in the above settings, as τ\tau defined in (2.5) satisfies that τ2\tau\rightarrow 2 as nn\rightarrow\infty, the bulk eigenvalues can be characterized by the same MP law as in (2.8) and (2.12). However, the number of the outliers may depend on the signal strength λi\lambda_{i}, 1id.1\leq i\leq d. See Section A.7.2 for more discussion and simulation with d=2d=2.

The second case includes various combinations, and some of them might be challenging. We discuss only one setup assuming the signal strengths fall in two different regions; that is, there exists 1<r<d1<r<d such that

α1α2αr1>>αr+1αd.\alpha_{1}\geq\alpha_{2}\geq\alpha_{r}\geq 1>\cdots>\alpha_{r+1}\geq\cdots\alpha_{d}\,.

Theorem 2.7 still holds by replacing α\alpha by maxiαi\max_{i}\alpha_{i} and 𝐖1\mathbf{W}_{1} in (2.14) by 𝖶1\mathsf{W}_{1}, where 𝐳i=(𝐳i1,,𝐳ir,0,,0)\bm{z}_{i}=({\mathbf{z}}_{i1},\cdots,{\mathbf{z}}_{ir},0,\cdots,0) and 𝖶1(i,j)=exp(𝐳i𝐳j22h)\mathsf{W}_{1}(i,j)=\exp\left(-\frac{\|\bm{z}_{i}-\bm{z}_{j}\|_{2}^{2}}{h}\right), i.e., the clean affinity matrix is defined by only using those components with large SNRs. That is to say, the spectrum of 𝐖\mathbf{W} for the value of dd is close to the clean affinity matrix for the value of rr with the same signals λ1λ2λr.\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{r}. The detailed statements and proofs are similar to the case d=1d=1 except for extra notional complication. We defer more discussion of this setup with d=2d=2 to Section A.7.2.

In short, there are still several open challenges when d>1d>1, and we will explore them in our future work.

2.4. Spectrum of transition matrices

In this subsection, we state the main results for the transition matrix 𝐀\mathbf{A} defined in (1.2). Even though there is an extra normalization step by the degree matrix 𝐃\mathbf{D}, we will see that most of the spectral studies of 𝐀\mathbf{A} boil to those of 𝐖.\mathbf{W}. In what follows, we provide the counterparts of the results in Sections 2.2 and 2.3 for 𝐀.\mathbf{A}. Similar discussions as those in Remark 2.8 and Remark 2.9 also hold.

For 𝐖a1\mathbf{W}_{a_{1}} defined in (2.14), denote 𝐀a1=𝐃a11𝐖a1\mathbf{A}_{a_{1}}=\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}} similar to the definition (1.2). Similarly, for 𝐖~a1\widetilde{\mathbf{W}}_{a_{1}} defined in (2.16), denote 𝐀~a1=𝐃~a11𝐖~a1.\widetilde{\mathbf{A}}_{a_{1}}=\widetilde{\mathbf{D}}_{a_{1}}^{-1}\widetilde{\mathbf{W}}_{a_{1}}.

Corollary 2.10.

Suppose (1.1)-(1.8) hold true, d=1d=1, h=ph=p and f(x)=exp(υx)f(x)=\exp(-\upsilon x). When 0α<1,0\leq\alpha<1, the results of Theorems 2.3 and 2.5 hold for the eigenvalues of n𝐀n\mathbf{A} by replacing 𝐖\mathbf{W} with n𝐀n\mathbf{A} and the measure ν0\nu_{0} with νˇ0=Tς(0)μcn,2f(τ(0))/(f(τ(λ)))\check{\nu}_{0}=T_{\varsigma(0)}\mu_{c_{n},-2f^{\prime}(\tau(0))/(f(\tau(\lambda)))}. When 1α<2,1\leq\alpha<2, for (1) of Theorem 2.7, the counterpart of (2.18) reads as

𝐀𝐀a1nα22,\|\mathbf{A}-\mathbf{A}_{a_{1}}\|\prec n^{\frac{\alpha-2}{2}},

and the counterpart of (2.19) is

λi(𝐀a1)nα32.\lambda_{i}(\mathbf{A}_{a_{1}})\prec n^{\frac{\alpha-3}{2}}.

Moreover, when α2,\alpha\geq 2, for (2) of Theorem 2.7, the counterpart of (2.20) reads as

𝐀𝐀~a1n12.\|\mathbf{A}-\widetilde{\mathbf{A}}_{a_{1}}\|\prec n^{-\frac{1}{2}}.

The rest parts hold by replacing 𝐖~a1\widetilde{\mathbf{W}}_{a_{1}} and 𝐖\mathbf{W} with 𝐀~a1\widetilde{\mathbf{A}}_{a_{1}} and 𝐀,\mathbf{A}, respectively.

3. Main results (II): a different bandwidth choice h(p+λ)h\asymp(p+\lambda)

As discussed after Theorem 2.7, when the SNR is large, the classic bandwidth choice hph\asymp p is too small compared with the signal. For example, according to (2) of Theorem 2.7, we cannot obtain any information about the clean signal when α2\alpha\geq 2 if hp.h\asymp p. To address this issue, we consider a different bandwidth h(p+λ)h\asymp(p+\lambda), where λ\lambda is the signal strength. We show that this signal dependent bandwidth will result in a meaningful spectral convergence result, which is stated in Theorem 3.1. As in Section 2, we focus on the setting d=1d=1, and set λ:=λ1\lambda:=\lambda_{1}. The discussion for the setting d>1d>1 is similar to that in Remark 2.9, which we only state the difference here. When 0αi1,0\leq\alpha_{i}\leq 1, i=1,,di=1,\ldots,d, we have hl=1dλl+pph\asymp\sum_{l=1}^{d}\lambda_{l}+p\asymp p, so all the arguments in (3) and (4) of Remark 2.9 directly apply. When αi>1\alpha_{i}>1, 1id1\leq i\leq d, which corresponds to the strong signal cases (1) and (2) in Remark 2.9, following the proof of (3.5) below, a similar argument of case (2) of Remark 2.9 still apply. For definiteness, below we state our results for the spectra of 𝐖\mathbf{W} and 𝐀\mathbf{A} assuming h=p+λh=p+\lambda in Section 3.1. Since λ\lambda is usually unknown in practice, in Section 3.2, we propose a bandwidth selection algorithm for practical implementation. With this algorithm, even without the knowledge of the signal strength, we can still get meaningful spectral results. See Corollary 3.2.

3.1. Spectra of affinity and transition matrices

In this subsection, we state the results for the spectra of 𝐖\mathbf{W} and 𝐀\mathbf{A} when h=λ+ph=\lambda+p. Denote

(3.1) 𝐖a2=exp(2pυh)𝐖1+(1exp(2pυh))𝐈n,\displaystyle\mathbf{W}_{a_{2}}=\exp\left(-\frac{2p\upsilon}{h}\right)\mathbf{W}_{1}+\left(1-\exp\left(-\frac{2p\upsilon}{h}\right)\right)\mathbf{I}_{n},

where 𝐖1\mathbf{W}_{1} is constructed using the bandwidth h=λ+p.h=\lambda+p.

Theorem 3.1.

Suppose (1.1)-(1.8) hold true, d=1d=1 and h=λ+ph=\lambda+p. The following results hold.

  1. (1)

    When 0α<1,0\leq\alpha<1, Theorems 2.3 and 2.5 hold with ν0\nu_{0} replaced by

    ν~0=Tςhμcn,η,\widetilde{\nu}_{0}=T_{\varsigma_{h}}\mu_{c_{n},\eta}\,,

    where

    (3.2) η:=2pυexp(2pυ/h)h,\displaystyle\eta:=\frac{2p\upsilon\exp(-2p\upsilon/h)}{h},
    ςh:=ςh(τ):=12υphexp(υτp/h)exp(υτp/h).\displaystyle\varsigma_{h}:=\varsigma_{h}(\tau):=1-\frac{2\upsilon p}{h}\exp(-\upsilon\tau p/h)-\exp(-\upsilon\tau p/h).
  2. (2)

    When α1,\alpha\geq 1, we have that

    (3.3) 1n𝐖1n𝐖a2n1/2,\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{a_{2}}\right\|\prec n^{-1/2},

    and for some large constants D>2D>2 and C>0,C>0, we have that when iClogni\geq C\log n

    (3.4) |λi(𝐖a2)(1exp(2pυ/(p+λ)))|nD.\displaystyle\left|\lambda_{i}(\mathbf{W}_{a_{2}})-(1-\exp(-2p\upsilon/(p+\lambda)))\right|\prec n^{-D}.

    Moreover, when α>1,\alpha>1, we have that

    (3.5) 1n𝐖1n𝐖1n1/2+n1α.\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{1}\right\|\prec n^{-1/2}+n^{1-\alpha}.

Finally, similar results hold for the transition matrix 𝐀\mathbf{A} by replacing 1n𝐖\frac{1}{n}\mathbf{W}, 1n𝐖a2\frac{1}{n}\mathbf{W}_{a_{2}} and 1n𝐖1\frac{1}{n}\mathbf{W}_{1} in (3.3)-(3.5) by 𝐀\mathbf{A}, 𝐀a2\mathbf{A}_{a_{2}} and 𝐀1\mathbf{A}_{1} respectively, where 𝐀a2\mathbf{A}_{a_{2}} and 𝐀1\mathbf{A}_{1} are defined by plugging 𝐖a2\mathbf{W}_{a_{2}} and 𝐖1\mathbf{W}_{1} into (1.2).

With this bandwidth, we have addressed the issue we encountered in (2) of Theorem 2.7. Specifically, according to Theorem 3.1, we find that when α<1,\alpha<1, the noise dominates and h=λ+pph=\lambda+p\asymp p does not lead to any essential difference in the spectrum compared with that from the fixed bandwidth h=p.h=p. On the other hand, when α1\alpha\geq 1, such a bandwidth choice contributes significantly to the spectrum. Especially, compared to (2.17), 𝐖a2\mathbf{W}_{a_{2}} only has O(logn)O(\log n) nontrivial eigenvalues. Combined with (3.3), by properly choosing the bandwidth, once the SNR is relatively large, the noisy affinity matrix 𝐖\mathbf{W} captures the spectrum of the clean affinity matrix 𝐖1\mathbf{W}_{1} via 𝐖a2\mathbf{W}_{a_{2}}. In addition, when α>1\alpha>1, we can replace 𝐖a2\mathbf{W}_{a_{2}} with 𝐖1\mathbf{W}_{1} directly. Finally, we mention that in the small SNR region 1/2<α1,1/2<\alpha\leq 1, modifying the transition matrix 𝐀\mathbf{A} by zeroing out the diagonal terms before normalization [29] could be useful. For more discussions, we refer the readers to Section A.7.1.

3.2. An adaptive choice of bandwidth

While the above result connects the spectra of noisy affinity matrices and those of clean affinity matrices, in general λ\lambda is unknown. In this subsection, we provide an adaptive choice of hh depending on the dataset without providing an estimator for λ.\lambda. Such a choice will enable us to recover the results of Theorem 3.1.

Given some constant 0<ω<1,0<\omega<1, we choose hh(ω)h\equiv h(\omega) according to

(3.6) 0hdμ𝚍𝚒𝚜𝚝=ω,\int_{0}^{h}\mathrm{d}\mu_{\mathtt{dist}}=\omega\,,

where μ𝚍𝚒𝚜𝚝\mu_{\mathtt{dist}} is the empirical distribution of the pairwise distance {𝐱i𝐱j22},ij.\{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}\},\ i\neq j. In this subsection, when there is no danger of confusion, we abuse the notation and denote the affinity and transition matrices conducted using hh from (3.6) as 𝐖\mathbf{W} and 𝐀\mathbf{A} respectively, and define 𝐖1\mathbf{W}_{1} in (1.9) and 𝐖a2\mathbf{W}_{a_{2}} in (3.1) with hh in (3.6).

As we show in the proof, when 0α<10\leq\alpha<1, hph\asymp p (see (C.1) in the proof), and when α1\alpha\geq 1, hλp+λh\asymp\lambda\asymp p+\lambda (see (C.2) in the proof). In this sense, the following corollary recovers the results of Theorem 3.1, while the choice of bandwidth is practical.

Corollary 3.2.

Suppose (1.1)-(1.8) hold true and d=1d=1. For any 0<ω<10<\omega<1, let hh be the bandwidth chosen according to (3.6). Then we have Theorem 3.1 holds true.

Since Corollary 3.2 recovers the results of Theorem 3.1, the same comments after Theorem 3.1 and the discussions about manifolds in Subsection 3.3 hold for Corollary 3.2. We comment that in practice, usually researchers choose the 25% or 50% percentile of all pairwise distances, or those distances of nearest neighbor pairs, as the bandwidth; see, for example, [56, 44]. Corollary 3.2 provides a theoretical justification for this commonly applied ad hoc bandwidth selection method.

Next, we discuss how to choose ω\omega in practice. Based on the obtained theoretical results, when α1,\alpha\geq 1, the outliers stand for the signal information (c.f. (3.1)), except those associated with the kernel effect. Thus, we propose Algorithm 1 to choose ω\omega adaptively which seeks for a bandwidth so that the affinity matrix has the most number of outliers.

Algorithm 1 Adaptive choice of ω\omega
  1. (1)

    Take fixed constants ωL<ωU,\omega_{L}<\omega_{U}, where 0<ωL,ωU<1,0<\omega_{L},\omega_{U}<1, e.g., ωL=0.05\omega_{L}=0.05 and ωU=0.95\omega_{U}=0.95. For some large integer TT, we construct a partition of the intervals [ωL,ωU],[\omega_{L},\omega_{U}], denoted as 𝒫=(ω0,ω1,,ωT)\mathcal{P}=(\omega_{0},\omega_{1},\cdots,\omega_{T}), where ωi=ωL+iT(ωUωL)\omega_{i}=\omega_{L}+\frac{i}{T}(\omega_{U}-\omega_{L}).

  2. (2)

    For the sequence of quantiles {ωi}i=0T,\{\omega_{i}\}_{i=0}^{T}, calculate the associated bandwidths according to (3.6), denoted as {hi}i=0T.\{h_{i}\}_{i=0}^{T}.

  3. (3)

    For each 0iT,0\leq i\leq T, calculate the eigenvalues of the affinity matrices 𝐖i\mathbf{W}_{i} which is conducted using the bandwidth hi.h_{i}. Denote the eigenvalues of 𝐖i\mathbf{W}_{i} in the decreasing order as {λk(i)}i=1n.\{\lambda_{k}^{(i)}\}_{i=1}^{n}.

  4. (4)

    For a given threshold 𝗌>0\mathsf{s}>0 satisfying 𝗌0\mathsf{s}\rightarrow 0 as n,n\rightarrow\infty, denote

    𝗄(ωi):=max1kn1{k|λk(i)λk+1(i)1+𝗌}.\mathsf{k}(\omega_{i}):=\max_{1\leq k\leq n-1}\left\{k\bigg{|}\,\frac{\lambda_{k}^{(i)}}{\lambda_{k+1}^{(i)}}\geq 1+\mathsf{s}\right\}.
  5. (5)

    Choose the quantile ω\omega such that

    (3.7) ω=max[argmaxωi𝗄(ωi)].\omega=\max\Big{[}\underset{\omega_{i}}{\mathrm{argmax}}\ \mathsf{k}(\omega_{i})\Big{]}\,.

Note that we need a threshold 𝗌\mathsf{s} in step 4 of Algorithm 1. We suggest to adopt the resampling method established in [50, Section 4] and [19, Section 4.1]. This method provides a choice of 𝗌\mathsf{s} to distinguish the outlying eigenvalues and bulk eigenvalues given the ratio cn=p/nc_{n}=p/n. The main rationale supporting this approach is that the bulk eigenvalues are close to each other (c.f. Remark 2.8) and hence the ratios of the two consecutive eigenvalues will be close to one. Moreover, in step 5 of the algorithm, when there are multiple ωi\omega_{i} that achieve the argmax, we choose the largest one for the purpose of robustness.

Next, we numerically illustrate how the chosen ω\omega by Algorithm 1 depends on α\alpha. Consider the nonlinear manifold S1S^{1}, the canonical 11-dim sphere, isometrically embedded in the first two axes of p\mathbb{R}^{p} and scaled by λ\sqrt{\lambda}, where λ>0\lambda>0; that is, 𝒛i:=λ[cosθi,sinθi,0,,0]p\bm{z}_{i}:=\sqrt{\lambda}[\cos\theta_{i},\sin\theta_{i},0,\cdots,0]^{\top}\in\mathbb{R}^{p}, where θi\theta_{i} is uniformly sampled from [0,2π][0,2\pi]. Next, we add Gaussian white noise to 𝒛i\bm{z}_{i} via 𝐱i=𝒛i+𝐲ip,\mathbf{x}_{i}=\bm{z}_{i}+\mathbf{y}_{i}\in\mathbb{R}^{p}, where 𝐲i𝒩(𝟎,𝐈)\mathbf{y}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), i=1,2,,n,i=1,2,\cdots,n, are noise independent of ξi.\xi_{i}. We consider this example since its topology is nontrivial and we know the ground truth. In Figure 2, we record the chosen ω\omega for different α\alpha from 𝐖\mathbf{W}. When α\alpha is small, (i.e. α<1\alpha<1), since the bandwidth choice will not essentially influence the transition, the algorithm will offer a large quantile in light of (3.7). When λ\lambda is large (i.e., α>2\alpha>2), we get a small quantile. Intuitively, the larger selected bandwidth when α\alpha gets small can be understood as the algorithm trying to combat the noise with a larger bandwidth. In Figure 2, we also record the chosen ω\omega for different α\alpha from 𝐀\mathbf{A}, where we simply replace the role of 𝐖\mathbf{W} by 𝐀\mathbf{A}. We can see the same result as that from 𝐖\mathbf{W}. Finally, note that the choices of ω\omega are irrelevant of the aspect ratio cnc_{n} in (1.7). This finding suggests that the bandwidth selection is not sensitive to the ambient space dimension.

Refer to caption
Refer to caption
Figure 2. Adaptive choices of ω\omega by the affinity matrix 𝐖\mathbf{W} and the transition matrix 𝐀\mathbf{A} are shown on the left and right subplots respectively. In the simulation, we use the Gaussian random vectors with λ=nα\lambda=n^{\alpha} under the setting n=300n=300 for different values of cnc_{n} in (1.7). The kernel function is f(x)=exp(x/2)f(x)=\exp(-x/2) and the bandwidth is chosen adaptively according to (3.6), where ω\omega is chosen using Algorithm 1 with T=91T=91, ωL=0.05\omega_{L}=0.05, and ωU=0.95\omega_{U}=0.95. 𝗌=0.24,0.17,0.12\mathsf{s}=0.24,0.17,0.12 are chosen using the resampling method as in [50] for cn=0.5,1,2,c_{n}=0.5,1,2, respectively.

3.3. Connection with the manifold learning

To discuss the connection with the manifold learning, we focus on Theorem 3.1 and α>1\alpha>1, and hence Corollary 3.2, which states that via replacing 1n𝐖\frac{1}{n}\mathbf{W}, 1n𝐖a2\frac{1}{n}\mathbf{W}_{a_{2}} and 1n𝐖1\frac{1}{n}\mathbf{W}_{1} in (3.3)-(3.5) by 𝐀\mathbf{A}, 𝐀a2\mathbf{A}_{a_{2}} and 𝐀1\mathbf{A}_{1}, the relationship between the eigenvalues of 𝐀\mathbf{A} and 𝐀1\mathbf{A}_{1} is established when α>1\alpha>1. We know that all except the top Clog(n)C\log(n) eigenvalues of 𝐖\mathbf{W} are trivial according to (3.4), and by Weyl’s inequality, the top Clog(n)C\log(n) eigenvalues of 𝐀\mathbf{A} and 𝐀1\mathbf{A}_{1} differ by n1/2+n1αn^{-1/2}+n^{1-\alpha}. On the other hand, the eigenvalues of 𝐀1\mathbf{A}_{1} have been extensively studied in the literature. Below, take the result in [41] as an example. Suppose the clean data 𝒵\mathcal{Z} is sampled from a mm-dim closed (compact without boundary) and smooth manifold, which is embedded in a dd-dim subspace in p\mathbb{R}^{p}, following a proper sampling condition on the sampling density function 𝗉\mathsf{p} (See Section A.1 for details of this setup). To link our result to that shown in [41], note that 𝐳i𝐳j2\|\mathbf{z}_{i}-\mathbf{z}_{j}\|^{2} is of order λ\lambda by assumption so that the selected bandwidth is of the same order as that of 𝐳i𝐳j2\|\mathbf{z}_{i}-\mathbf{z}_{j}\|^{2}. Thus, since λ/(p+λ)1\lambda/(p+\lambda)\asymp 1 when nn\to\infty, the eigenvalues of 𝐀1\mathbf{A}_{1} converge to the eigenvalues of the integral operator

Th(x)=Mexp(υxy22)h(y)𝗉(y)𝑑V(y),Th(x)=\int_{M}\exp\left(-\upsilon\|x-y\|^{2}_{2}\right)h(y)\mathsf{p}(y)dV(y)\,,

where hh is a smooth function defined on MM and dVdV is the volume density. By combining the above facts, we conclude that under the high-dimensional noise setup, when the SNR is sufficiently large and the bandwidth is chosen properly, we could properly obtain at least the top few eigenvalues of the associated integral kernel from the noisy transition matrix 𝐀\mathbf{A}. Since our focus in this paper is not manifold learning itself but how the high-dimensional noise impacts the spectrum of GL, for more discussions and details about manifold learning, we refer readers to [25] and the citations therein.

4. Numerical studies

In this section, we conduct Monte Carlo simulations to illustrate the accuracy and usefulness of our results and proposed algorithm. In Section 4.1, we conduct numerical simulations to illustrate the accuracy of our established theorems for various values of cn=0.5,1,2.c_{n}=0.5,1,2. We also show the impact of n.n. In Section 4.2, we examine the usefulness of our proposed Algorithm 1 and compare it with some methods in the literature with a linear manifold and a nonlinear manifold.

4.1. Accuracy of our asymptotic results

In this subsection, we conduct numerical simulations to examine the accuracy of the established results. For simplicity, we focus on checking the results in Section 2 when h=p,h=p, which is the key part of the paper. Similar discussions can be applied to the results in Section 3 when h=λ+p.h=\lambda+p.

In Figure 3, we study the low SNR setting when 0α<10\leq\alpha<1 as in Section 2.2, particularly the closeness of bulk eigenvalues of 𝐖\mathbf{W} and the quantiles of the MP law ν0\nu_{0} shown in (2.8) and (2.12). We also show that even for a relative small value of n=200,n=200, our results are reasonably accurate for various values of cn=0.5,1,2.c_{n}=0.5,1,2. Then we study the region when α1\alpha\geq 1 as in Section 2.3. In Figure 4, we study the SNR region when 1α<21\leq\alpha<2 and check (2.18). Moreover, in Figure 5, we examine the accuracy of (2.23) when α\alpha is very large. Again, we find that our results are accurate even for a relatively small value of n=200n=200 under different settings of cn=0.5,1,2.c_{n}=0.5,1,2.

Since our results are stated in the asymptotic sense when nn is sufficiently large, in Figure 6, we examine how the value of nn impact our results. For various values of cnc_{n} and SNRs, we find that our results are reasonable accurate once n100n\geq 100. We see that when the SNR becomes large, our results for small nn, like n<100n<100, are more accurate.

We point out that for a better visualization, we report the results for the sorted eigenvalues instead of the histogram in the above plots regarding eigenvalues. The main reason is that we focus on each individual eigenvalue rather than the global empirical distribution, which has been known in the literature. The simulation results are based on only one trial which emphasizes the concentration with high probability as established in our main results. For the visualization of histogram, we have to run several simulations and look at the average. Since this is not the main focus of the paper, we only generate one such plot in Figure 7 based on 1,000 trials.

Refer to caption
Refer to caption
Refer to caption
Figure 3. Low SNR setting. Here λ=p0.2\lambda=p^{0.2}, n=200n=200 with cn=n/p,h=pc_{n}=n/p,h=p and the random variables are standard Gaussian. To examine the bulk eigenvalues, we start from the 10th eigenvalue of 𝐖\mathbf{W}, i.e., λi(𝐖)\lambda_{i}(\mathbf{W}), where i10.i\geq 10. For the legend, the sample means the eigenvalues of 𝐖\mathbf{W} and the limit means the quantiles (c.f. (2.2)) of the MP law ν0\nu_{0} defined in (2.7).
Refer to caption
Refer to caption
Refer to caption
Figure 4. Moderate SNR region. Here λ=p1.9\lambda=p^{1.9}, n=200n=200 with cn=n/p,h=pc_{n}=n/p,h=p and the random variables are standard Gaussian. For the legend, the sample means the eigenvalues of 𝐖\mathbf{W} and the limit means the eigenvalues of 𝐖a1\mathbf{W}_{a_{1}} as in (2.18).
Refer to caption
Refer to caption
Refer to caption
Figure 5. Large SNR setting. Here λ=p5\lambda=p^{5}, n=200n=200 with cn=n/p,h=pc_{n}=n/p,h=p and the random variables are standard Gaussian. For the legend, the sample means the eigenvalues of 𝐖\mathbf{W} and the limit means unity as predicted in (2.23).
Refer to caption
Refer to caption
Refer to caption
Figure 6. Dimension analysis on nn. We examine how the values of nn influence the results as in Figures 35. For the error rate, when the SNR is small that λ=p0.2,\lambda=p^{0.2}, the error rate is denoted as supi10|λi(𝐖)γν0(i)|.\sup_{i\geq 10}|\lambda_{i}(\mathbf{W})-\gamma_{\nu_{0}}(i)|. When the SNR is large that α1,\alpha\geq 1, the error rate is denoted as the operator norm of difference between 𝐖\mathbf{W} and its corresponding limit matrix. Here cn=n/p,h=pc_{n}=n/p,h=p and the random variables are standard Gaussian. We conclude that once nn is relatively large say, n100,n\geq 100, our results will be reasonably accurate. We also observe that when the SNRs are large, it seems that we will need smaller values of nn for the purpose of accuracy.
Refer to caption
Refer to caption
Refer to caption
Figure 7. Histogram of low SNR setting. The settings are the same as in the caption of Figure 3. The simulations are based on 1,000 repetitions. In the plots, we have removed the point mass.

4.2. Efficiency of our proposed bandwidth selection algorithm (Algorithm 1)

In this subsection, we show the usefulness of our proposed Algorithm 1 and compare it with some existing methods in the literature. We consider two manifolds of different dimensions, and compare the following four setups. (1) The clean affinity matrix 𝐖1\mathbf{W}_{1} with the bandwidth h=λ+ph=\lambda+p; (2) 𝐖\mathbf{W} constructed using our Algorithm 1; (3) 𝐖\mathbf{W} constructed using the bandwidth via (3.6) with ω=0.5,\omega=0.5, which has been used in [56, 44]; (4) 𝐖\mathbf{W} using the bandwidth h=ph=p as in [27, 18, 22, 42, 28, 29]. Since the accuracy of the eigenvalue has been discussed in Section 4.1, in what follows, while we do not explore eigenvectors of GL in this paper, we demonstrate the eigenvectors of 𝐖\mathbf{W} and compare them with those of 𝐖1\mathbf{W}_{1} constructed from the clean signal {ξi}i=1n\{\xi_{i}\}_{i=1}^{n} to further understand the impact of bandwidth selection.

We start with an 1-dimensional smooth and closed manifold M1M_{1}, which is parametrized by Φ:uaR[2cos(u), 3(10.8e8cos(u)2)cos(2π(u/(2π))2),(10.8e8cos(u)2)sin(u), 0,,0]p\Phi:\,u\mapsto aR[2\cos(u),\,3(1-0.8e^{-8\cos(u)^{2}})\cos(2\pi(u/(2\pi))^{2}),\,\\ (1-0.8e^{-8\cos(u)^{2}})\sin(u),\,0,\cdots,0]^{\top}\in\mathbb{R}^{p}, where RO(p)R\in O(p), a>0a>0 and u(0,2π]u\in(0,2\pi]. In other words, the 1-dimensional manifold M1M_{1} is embedded in a 3-dim Euclidean space in p\mathbb{R}^{p}. Now, sample independently and uniformly nn points from 𝚄𝚗𝚒𝚏𝚘𝚛𝚖(0,2π)\mathtt{Uniform}(0,2\pi), u1,,unu_{1},\ldots,u_{n}, and hence nn points ξi=Φ(ui)\xi_{i}=\Phi(u_{i}). Next, we add Gaussian white noise to ξi\xi_{i} via 𝐱i=ξi+𝐲ip,\mathbf{x}_{i}=\xi_{i}+\mathbf{y}_{i}\in\mathbb{R}^{p}, where 𝐲i𝒩(𝟎,𝐈)\mathbf{y}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), i=1,2,,n,i=1,2,\cdots,n, are noise independent of ξi.\xi_{i}. The results of embedding this manifold by the top three trivial eigenvectors are shown in Figure 8. We could see that with the proposed bandwidth selection algorithm, the embedding of the noisy data is closer to that from the clean data. To further quantify this closeness, we view the eigenvectors from 𝐖1\mathbf{W}_{1} as the truth, and compare these with the eigenvectors of 𝐖\mathbf{W} with different bandwidths by evaluating the root mean square (RMSE). Note that the freedom of eigenvector sign is handled by taking the smaller value of vj(c)vj(w)2\|v^{(c)}_{j}-v^{(w)}_{j}\|_{2} and vj(c)+vj(w)2\|v^{(c)}_{j}+v^{(w)}_{j}\|_{2}, where vj(c)v^{(c)}_{j} if the jj-th eigenvector of 𝐖1\mathbf{W}_{1} associated with the jj-th largest eigenvalue and vj(w)v^{(w)}_{j} is the jj-th eigenvector of 𝐖\mathbf{W} associated with the jj-th largest eigenvalue. We repeat the random sample for 300 times, and plot the errobars with mean±\pmstandard deviation in Figure 9. It is clear that with the adaptive bandwidth selection algorithm, the top several eigenvectors of 𝐖\mathbf{W} are close to those of 𝐖1\mathbf{W}_{1}.

Refer to caption
Figure 8. Comparison of different bandwidth choices for cn=0.5.c_{n}=0.5. We consider the one-dim manifold M1M_{1} with a=20pa=20\sqrt{p} and take n=400.n=400. For the title in each subplot, Clean means the matrix 𝐖1,\mathbf{W}_{1}, Adap means using the bandwidth according to our proposed Algorithm 1, Medq means using the bandwidth via (3.6) with ω=0.5\omega=0.5 and h=ph=p means using the bandwidth h=p.h=p.
Refer to caption
Refer to caption
Refer to caption
Figure 9. From left to right: comparison of eigenvectors determined by different bandwidth choice schemes for cn=0.5, 1,2c_{n}=0.5,\ 1,2. We consider the one-dim manifold M1M_{1} with a=20pa=20\sqrt{p} and take n=400.n=400. The red, blue and black lines indicate the mean±\pmstandard deviation of RMSE when we compare the first nine eigenvectors evaluated from Clean and Adap, Clean and Medq, and Clean and h=ph=p, where Clean means the matrix 𝐖1\mathbf{W}_{1} with the bandwidth p+λp+\lambda, Adap means using the bandwidth according to our proposed Algorithm 1, Medq means using the bandwidth via (3.6) with ω=0.5\omega=0.5 and h=ph=p means using the bandwidth h=p.h=p.

Next, we consider the Klein bottle, which is a 22-dim compact and smooth manifold that cannot be embedded into a three dimensional Euclidean space. First, set

(4.1) 𝐳i=aR[(2cos(𝐮i(1))+1)cos(𝐮i(2))(2cos(𝐮i(1))+1)sin(𝐮i(2))2sin(𝐮i(1))cos(𝐮i(2)/2)2sin(𝐮i(1))sin(𝐮i(2)/2)00]p,\displaystyle\mathbf{z}_{i}=aR\begin{bmatrix}(2\cos(\mathbf{u}_{i}(1))+1)\cos(\mathbf{u}_{i}(2))\\ (2\cos(\mathbf{u}_{i}(1))+1)\sin(\mathbf{u}_{i}(2))\\ 2\sin(\mathbf{u}_{i}(1))\cos(\mathbf{u}_{i}(2)/2)\\ 2\sin(\mathbf{u}_{i}(1))\sin(\mathbf{u}_{i}(2)/2)\\ 0\\ \vdots\\ 0\end{bmatrix}\in\mathbb{R}^{p},

where 𝐮i\mathbf{u}_{i}, i=1,,ni=1,\ldots,n are randomly sampled uniformly from [0,2π]×[0,2π]2[0,2\pi]\times[0,2\pi]\subset\mathbb{R}^{2} and RO(p)R\in O(p) and a>0a>0 is the signal strength. In other words, we sample nonuniformly from the Klein bottle isometrically embedded in a 44-dimension subspace of p\mathbb{R}^{p}. Next, we add noise to 𝐳i\mathbf{z}_{i} by setting 𝐱i=𝐳i+𝐲ip,\mathbf{x}_{i}=\mathbf{z}_{i}+\mathbf{y}_{i}\in\mathbb{R}^{p}, where 𝐲i𝒩(𝟎,𝐈)\mathbf{y}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), i=1,2,,n,i=1,2,\cdots,n, are noise independent of 𝐳i.\mathbf{z}_{i}. The eigenvectors under the mentioned four setups when the signal strength is equivalent to α=1\alpha=1 are shown in Figure 10, where cn=0.5c_{n}=0.5 and we plot the magnitude of different eigenvector over 𝐮i\mathbf{u}_{i} with colors; that is, the color at 𝐮i\mathbf{u}_{i} represents the ii-th entry of the associated eigenvector. We apply the same RMSE evaluation used in the M1M_{1} example above; that is, we show the RMSE of the eigenvectors of different 𝐖\mathbf{W} when compared with those from 𝐖1\mathbf{W}_{1} as the truth. We repeat the random sample for 300 times, and plot the errobars with mean±\pmstandard deviation in Figure 11. It is clear that with the adaptive bandwidth selection algorithm, the top several eigenvectors of 𝐖\mathbf{W} are close to those of 𝐖1\mathbf{W}_{1}.

Refer to caption
Figure 10. Comparison of different bandwidth choices for cn=0.5.c_{n}=0.5. We consider the Klein bottle parametrized in (4.1) with a=20p.a=20\sqrt{p}. Here n=800.n=800. For the ylabel on the first column, Clean means the matrix 𝐖1\mathbf{W}_{1} with the bandwidth p+λp+\lambda, Adap means using the bandwidth according to our proposed Algorithm 1, Medq means using the bandwidth via (3.6) with ω=0.5\omega=0.5 and h=ph=p means using the bandwidth h=p.h=p.
Refer to caption
Refer to caption
Refer to caption
Figure 11. From left to right: Comparison of eigenvectors determined by different bandwidth choice schemes for cn=0.5, 1,2c_{n}=0.5,\ 1,2. We consider the Klein bottle parametrized in (4.1) with a=20p.a=20\sqrt{p}. Here n=800.n=800. The red, blue and black lines indicate the mean±\pmstandard deviation of RMSE when we compare the first nine eigenvectors evaluated from Clean and Adap, Clean and Medq, and Clean and h=ph=p, where Clean means the matrix 𝐖1\mathbf{W}_{1} with the bandwidth p+λp+\lambda, Adap means using the bandwidth according to our proposed Algorithm 1, Medq means using the bandwidth via (3.6) with ω=0.5\omega=0.5 and h=ph=p means using the bandwidth h=p.h=p.

The results of the above two examples support the potential of our proposed bandwidth selection algorithm, particularly when compared with two bandwidth selection methods commonly considered in the literature. Note that since only the eigenvalue information is used in Algorithm 1, only partial information in the kernel random matrix is utilized. We hypothesize that by taking the eigenvectors into account, we could achieve a better bandwidth selection algorithm. Since the study of eigenvector is out of the scope of this paper, we will explore this possibility in our future work.

5. Discussion and Conclusion

We provide a systematic study of the spectrum of GL under the nonnull setup with different SNR regions, particularly when d=1d=1, and explore the impact of chosen bandwidths. Specifically, we show that under a proper SNR and bandwidth, the spectrum of 𝐀1\mathbf{A}_{1} from the clean dataset can be extracted from that of 𝐀\mathbf{A} from the noisy dataset. We also provide a new algorithm to select the suitable bandwidth so that the number of “outliers” of the spectral bulk associated with the noise is maximized.

Note that we need the assumption that the entries of 𝐲i\mathbf{y}_{i} are independent because our arguments depend on established results in the literature, for example, [8], which are proved using the assumption that the entries of 𝐲i\mathbf{y}_{i} are independent. Nevertheless, we believe that this assumption can be removed with extra technical efforts. A natural strategy is to utilize the Gaussian comparison method in the literature of random matrix theory as in [31]. This strategy contains two steps. In the first step, we establish the results for Gaussian random vectors where linear independence is equivalent to independence. In the second step, we prove the results hold for sub-Gaussian random vectors using the comparison arguments as in [31] under certain moment matching conditions. Since this is not the main focus of the current paper and the extension needs to generalize many existing results in the literature, for example, [8], we will pursue this direction in the future.

A comparison with principal component analysis (PCA) deserves a discussion. It is well known that when d=1d=1 so that the signal is sampled from a linear subspace, we could easily recover this linear subspace from any linear methods like PCA. At the first glance, it seems that the problem is resolved. However, if the purpose is studying the geometric and/or topological structure of the dataset, we may need further analysis. For example, if we want to answer the question like “is the 11-dim signal supported on two disjoint subsets (or the 11-dim linear manifold is disconnected)?”, then PCA cannot answer and we need other tools (for example, GL via the spectral clustering). The same fact holds when d>1d>1, where the nonlinear manifold is supported in a dd-dim subspace while its dimension is strictly smaller than dd. In this case, while we could possibly recover the dd-dim subspace by PCA, if we want to further study the nonlinear geometric and/or topological structure of the manifold, we need GL as the analysis tool. For readers who are interested in recovering the manifold structure via GL, see [35, 34, 25] and the literature therein. Based on our results, we know that even for the 1-dimensional linear manifold (for example, 𝐳i\mathbf{z}_{i} in (1.4)–(1.6) satisfies that 𝐞1𝐳i\mathbf{e}_{1}^{\top}\mathbf{z}_{i} is uniformly distributed on [0,1][0,1]), the linear and nonlinear methods are significantly different.

One significant difference that we shall emphasize is the phase transition phenomenon when the nonlinear method like GL is applied. While the discussion could be much more general, the essence is the same, so we continue the discussion with the 1-dimensional linear manifold below. For simplicity, we focus on the results in Section 2 for the kernel affinity matrix 𝐖\mathbf{W} when the bandwidth is chosen so that hp.h\asymp p. From Theorem 2.3 to Theorem 2.7, we observe several phase transitions for eigenvalues depending on the signal strength λ\lambda. In the case when λ\lambda is bounded, we observe three or four outlying eigenvalues according to the magnitude of λ,\lambda, where three of these outlying eigenvalues are from the kernel effect; see Lemma A.9 for details. When λ\lambda transits from the subcritical region λcn\lambda\leq\sqrt{c_{n}} to the supercritical region λ>cn\lambda>\sqrt{c_{n}}, one extra outlier is generated due to the well-known BBP transition [3] phenomenon for the spiked covariance matrix model. Moreover, the rest of the non-outlying eigenvalues still obey a shifted MP law. As will be clear from the proof, in the bounded region, studying the affinity matrix is directly related to studying PCA of the dataset via the Gram matrix; see (A.22) for details. However, once the signal strength diverges, the spectrum of the affinity matrix behaves totally differently. In PCA, under the setting of (1.6), no matter how large λ\lambda is, we can only observe a single outlier and its strength increases as λ\lambda increases. Moreover, only the first eigenvalue is influenced by λ\lambda and the rest eigenvalues satisfy the MP law, and this MP law is independent of λ.\lambda. Specifically, the second eigenvalue, i.e., the first non-outlying eigenvalue will stick to the right-most edge of the MP law. We refer the readers to Lemma A.6 for more details and Figure 12 for an illustration. In contrast, regarding the nonlinear kernel method, the values and amount of outlying eigenvalues vary according to the signal strength. That is, the magnitude of λ\lambda has a possible impact on all eigenvalues of 𝐖\mathbf{W} through the kernel function. Heuristically, this is because PCA explores the point cloud in a linear fashion so that λ\lambda will only have an impact on the direction which explains the most variance, i.e., λ1(𝐂)\lambda_{1}(\mathbf{C}), where 𝐂\mathbf{C} is the covariance matrix that is directed related to the Gram matrix 𝐆\mathbf{G}. However, the kernel method deals with the data in a nonlinear way. As a consequence, all eigenvalues of 𝐖\mathbf{W} contain the signal information. In other words, unlike the bulk (non-outlier) eigenvalues of 𝐆\mathbf{G}, all eigenvalues of 𝐖\mathbf{W} change when λ\lambda change when λ\lambda increases. Consider the following three cases. First, as we will see in the proof below, the eigenvalue corresponding to the BBP transition in the supercritical region will increase when λ\lambda increases, and this eigenvalue will eventually be close to 11 when λ\lambda diverges following (2.21). Thus, we expect that the magnitude of this outlying eigenvalue follows an asymmetric bell curve as α\alpha increases. Second, the eigenvalues corresponding to the kernel effect (see (A.20)) will decrease when λ\lambda increases since the kernel function ff is a decreasing function. Third, as we can see from Theorems 2.5 and 2.7, the non-outlying eigenvalues become outlying eigenvalues when λ\lambda increases. In Figure 13, we illustrate the above phenomena by investigating the behavior of some eigenvalues.

Refer to caption
Refer to caption
Figure 12. Outlier and the first non-outlying eigenvalues of 𝐆=1p𝐗𝐗.\mathbf{G}=\frac{1}{p}\mathbf{X}^{\top}\mathbf{X}. In this simulation, we use the Gaussian random vectors with covariance matrix (1.6) with λ=pα\lambda=p^{\alpha} under the setting n=200n=200 and cn=n/p.c_{n}=n/p. The left panel corresponds to the first eigenvalue, i.e., the outlying eigenvalue of 𝐆\mathbf{G}, and the right panel corresponds to the second eigenvalue, i.e., the first non-outlying eigenvalue, of 𝐆.\mathbf{G}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13. Eigenvalues of 𝐖\mathbf{W} for different α\alpha with the kernel f(x)=exp(x/2)f(x)=\exp(-x/2) and bandwidth h=ph=p. In this simulation, we use the Gaussian random vectors with covariance matrix (1.6) with λ=pα\lambda=p^{\alpha} with different cn=n/pc_{n}=n/p and n=200n=200. The top left panel corresponds to the outlying eigenvalue (i.e. λ1(𝐖)\lambda_{1}(\mathbf{W})) associated with Sh0(τ)\mathrm{Sh}_{0}(\tau) in (A.20), the top right panel stands for the outlying eigenvalue (i.e. λ2(𝐖)\lambda_{2}(\mathbf{W})) from the BBP transition effect, the bottom left panel is an eigenvalue (i.e. λ8(𝐖)\lambda_{8}(\mathbf{W})) which gradually detaches from the bulk spectrum and becomes an outlying eigenvalue when α\alpha increases. The bottom right panel is an eigenvalue sticking to the bulk spectrum (i.e. λ80(𝐖)\lambda_{80}(\mathbf{W})). In all cases, we see that the eigenvalues change when α\alpha increases.

While our results pave the way towards a foundation for statistical inference of various kernel-based unsupervised learning algorithms for various data analysis purposes, like visualization, dimensional reduction, spectral clustering, etc, there are various problems we need to further study. First, as has been mentioned in Remark 2.9, there are some open problems; particularly, when αi=1\alpha_{i}=1 for all i=1,,di=1,\ldots,d. Second, the behavior of eigenvectors of 𝐀\mathbf{A} from noisy dataset and its relationship with the eigenfunctions of the Laplace-Beltrami operator under the manifold model need to be explored so that the above-mentioned data analysis purposes can be justified under the manifold setup. Third, in practice, noise may have a fat tail, the kernel may not be Gaussian [26, 29] or the kernel might not be isotropic [64], and the kernel function might be group-valued [58]. Fourth, the bandwidth selection algorithm is established under very nice assumptions, and its performance for real world databases needs further exploration. We will explore these problems in our future work.

Acknowledgment

The authors would like to thank the associate editor and two anonymous reviewers for many insightful comments and suggestions, which have resulted in a significant improvement of the paper.

Appendix A Preliminary results

In this section, we collect and prove some useful preliminary results, which will be used later in the technical proof.

A.1. From manifold to spiked model

In this subsection, we detail the claim in Section 1.2 and explain why the manifold model and (1.4) overlaps. Suppose {𝐳i0}\{\mathbf{z}^{0}_{i}\} are i.i.d. sampled from a pp-dim random vector ZZ, and suppose the range of ZZ is supported on a mm-dimensional, connected, smooth and compact manifold MM, where m1m\geq 1, isometrically embedded in p\mathbb{R}^{p} via ι\iota. Assume there exists dmd\geq m, where dd is independent of nn and dpd\leq p, so that the embedded MM is supported in a dd-dim affine subspace of p\mathbb{R}^{p}. Here we assume that dd is fixed. Since we consider the kernel distance matrix as in (1.1), without loss of generality, we can assume that 𝔼𝐳i0=0\mathbb{E}\mathbf{z}^{0}_{i}=0; that is, the embedded manifold is centered at 0. Also, assume the density function on the manifold associated with the sampling scheme is smooth with a positive lower bound. See [14] for more detailed discussion of the manifold model and the relevant notion of density function. Suppose the noisy data is

(A.1) 𝐱i=λ𝐳i0+𝐲i,\mathbf{x}_{i}=\sqrt{\lambda}\mathbf{z}^{0}_{i}+\mathbf{y}_{i}\,,

where λ>0\lambda>0 represents the signal strength, and 𝐲i\mathbf{y}_{i} is the independent noise that satisfies (1.5). Denote 𝐳i:=λ𝐳i0\mathbf{z}_{i}:=\sqrt{\lambda}\mathbf{z}^{0}_{i}. Thus, there exists a p×pp\times p orthogonal matrix RpR_{p} such that

Rp𝐳i0=(zi10,zi20,,zid0,0,,0).R_{p}\mathbf{z}^{0}_{i}=(z^{0}_{i1},z^{0}_{i2},\cdots,z^{0}_{id},0,\cdots,0)^{\top}\,.

Since MM is compact, zij0z^{0}_{ij}, j=1,,dj=1,\ldots,d, is bounded and hence sub-Gaussian with the variance controlled by 𝒦:=maxx,yMι(x)ι(y)p\mathcal{K}:=\max_{x,y\in M}\|\iota(x)-\iota(y)\|_{\mathbb{R}^{p}}, which is independent of pp since the manifold is assumed to be fixed. On the other hand, since MM is connected, zij0z^{0}_{ij}, j=1,,dj=1,\ldots,d, is a continuous random variable. We can further choose another rotation R¯p\bar{R}_{p} so that the first dd coordinates of 𝐳i0\mathbf{z}^{0}_{i} are whitened; that is, R¯p𝐳i0\bar{R}_{p}\mathbf{z}^{0}_{i} has the covariance structure

diag(λ10,λ20,,λd0,0,,0)p×p,\texttt{diag}(\lambda^{0}_{1},\lambda^{0}_{2},\ldots,\lambda^{0}_{d},0,\ldots,0)\in\mathbb{R}^{p\times p}\,,

where by the assumption of the density function, we have λi0>0\lambda^{0}_{i}>0 for i=1,,di=1,\ldots,d. Note that λi0\lambda^{0}_{i} is controlled by 𝒦\mathcal{K}, and by the smoothness of the manifold, we could assume without loss of generality that cλi01/cc\leq\lambda^{0}_{i}\leq 1/c for c(0,1)c\in(0,1) for all i=1,,di=1,\ldots,d. Then, multiply the noisy dataset in (A.1) by R¯p\bar{R}_{p} from the left and get

(A.2) 𝐱¯i=𝐳¯i+𝐲¯i,\bar{\mathbf{x}}_{i}=\bar{\mathbf{z}}_{i}+\bar{\mathbf{y}}_{i}\,,

where 𝐱¯i:=R¯p𝐱i\bar{\mathbf{x}}_{i}:=\bar{R}_{p}\mathbf{x}_{i}, 𝐳¯i:=R¯p𝐳i\bar{\mathbf{z}}_{i}:=\bar{R}_{p}\mathbf{z}_{i} and 𝐲¯i:=R¯p𝐲i\bar{\mathbf{y}}_{i}:=\bar{R}_{p}\mathbf{y}_{i}. It is easy to see that since {𝐲i}\{\mathbf{y}_{i}\} are isotropic (c.f. (1.5)), {𝐲¯i}\{\bar{\mathbf{y}}_{i}\} are sub-Gaussian random vectors satisfying (1.5). Thus, since 𝐳¯i\bar{\mathbf{z}}_{i} and 𝐲¯i\bar{\mathbf{y}}_{i} are still independent, the covariance of {𝐱¯i}\{\bar{\mathbf{x}}_{i}\} becomes

Σ¯p:=diag(λ1+1,,λd+1,1,,1)p×p,\bar{\Sigma}_{p}:=\texttt{diag}(\lambda_{1}+1,\cdots,\lambda_{d}+1,1,\cdots,1)\in\mathbb{R}^{p\times p}\,,

where λl:=λλl0\lambda_{l}:=\lambda\lambda^{0}_{l} for l=1,,dl=1,\ldots,d. Note that in this case, λi\lambda_{i} are of the same order. By the above definitions and the invariance of the 2\ell_{2} norm, we have

𝐱¯i𝐱¯j\displaystyle\|\bar{\mathbf{x}}_{i}-\bar{\mathbf{x}}_{j}\|\, =𝐱i𝐱j,\displaystyle=\|\mathbf{x}_{i}-\mathbf{x}_{j}\|\,,
𝐳¯i𝐳¯j\displaystyle\|\bar{\mathbf{z}}_{i}-\bar{\mathbf{z}}_{j}\|\, =𝐳i𝐳j,\displaystyle=\|\mathbf{z}_{i}-\mathbf{z}_{j}\|\,,
𝐲¯i𝐲¯j\displaystyle\|\bar{\mathbf{y}}_{i}-\bar{\mathbf{y}}_{j}\|\, =𝐲i𝐲j,\displaystyle=\|\mathbf{y}_{i}-\mathbf{y}_{j}\|\,,

which means that the affinity matrices (1.1) and transition matrices (1.9) remain unchanged after applying the orthogonal matrix. Thus, (A.2) is reduced to (1.4) and it suffices to focus on model (1.4).

Under the above setting that a low dimensional manifold is embedded into an affine subspace with a fixed dimension dd, the nonlinear manifold model is thus closely related to the spiked covariance matrix model. Recall that according to Nash’s isometric embedding theorem [48], there exists an embedding so that dd is smaller than m(3m+11)/2m(3m+11)/2, but there may exist embeddings so that the dd is higher than m(3m+11)/2m(3m+11)/2. More complicated models might need dd to even diverge as nn\to\infty. In these settings, the nonlinear manifold model will be reduced to other random matrix models, i.e., the divergent spiked model [13] or divergent rank signal plus noise model [23, 24, 20, 21]. We believe that the spectrum of the GL under these settings can also be investigated once the spectrum of these random matrix models can be well studied. Since this is not the focus of the current paper, we will pursue this direction in the future.

A.2. Some linear algebra facts

We record some linear algebraic results. The first one is for the Hadamard product from [27], Lemma A.5.

Lemma A.1.

Suppose 𝐌\mathbf{M} is a real symmetric matrix with nonnegative entries and 𝐄\mathbf{E} is a real symmetric matrix. Then we have that

σ1(𝐌𝐄)maxi,j|𝐄(i,j)|σ1(𝐌),\sigma_{1}(\mathbf{M}\circ\mathbf{E})\leq\max_{i,j}|\mathbf{E}(i,j)|\sigma_{1}(\mathbf{M}),

where σ1(𝐌)\sigma_{1}(\mathbf{M}) stands for the largest singular value of 𝐌.\mathbf{M}.

The following lemma is commonly referred to as the Gershgorin circle theorem, and its proof can be found in [37, Section 6.1].

Lemma A.2.

Let A=(aij)A=(a_{ij}) be a real n×nn\times n matrix. For 1in,1\leq i\leq n, let Ri=ji|aij|R_{i}=\sum_{{j\neq{i}}}\left|a_{{ij}}\right| be the sum of the absolute values of the non-diagonal entries in the ii-th row. Let D(aii,Ri)D(a_{ii},R_{i})\subseteq\mathbb{R} be a closed disc with center aiia_{ii} and radius RiR_{i} referred as the Gershgorin disc. Every eigenvalue of A=(aij)A=(a_{ij}) lies within at least one of the Gershgorin discs D(aii,Ri)D(a_{ii},R_{i}), where Ri=ji|aij|R_{i}=\sum_{j\neq i}|a_{ij}|.

We also collect some important matrix inequalities. For details, we refer readers to [18, Lemma SI.1.9]

Lemma A.3.

For two n×nn\times n symmetric matrices 𝐀\mathbf{A} and 𝐁,\mathbf{B}, we have that

i=1n|λi(𝐀)λi(𝐁)|2tr{(𝐀𝐁)2}.\sum_{i=1}^{n}|\lambda_{i}(\mathbf{A})-\lambda_{i}(\mathbf{B})|^{2}\leq\operatorname{tr}\{(\mathbf{A}-\mathbf{B})^{2}\}.

Moreover, let m𝐀(z)m_{\mathbf{A}}(z) and m𝐁(z)m_{\mathbf{B}}(z) be the Stieltjes transforms of the ESDs of 𝐀\mathbf{A} and 𝐁\mathbf{B} respectively, then we have

|m𝐀(z)m𝐁(z)|rank{𝐀𝐁}nmin{2η,𝐀𝐁η2}.|m_{\mathbf{A}}(z)-m_{\mathbf{B}}(z)|\leq\frac{\operatorname{rank}\{\mathbf{A}-\mathbf{B}\}}{n}\min\left\{\frac{2}{\eta},\frac{\|\mathbf{A}-\mathbf{B}\|}{\eta^{2}}\right\}.

A.3. Some concentration inequalities for sub-Gaussian random vectors

We record some auxiliary lemmas for our technical proof. We start with the concentration inequalities for the sub-Gaussian random vector 𝐲\mathbf{y} that satisfies

𝔼(exp(𝐚𝐲))exp(𝐚22/2).\mathbb{E}(\exp(\mathbf{a}^{\top}\mathbf{y}))\leq\exp(\|\mathbf{a}\|_{2}^{2}/2).

The first lemma establishes the concentration inequalities when λ\lambda is bounded.

Lemma A.4.

Suppose (1.4)-(1.6) hold. Assume d1d\geq 1 is fixed, λl1\lambda_{l}\asymp 1 for l=1,,dl=1,\ldots,d and write

(A.3) 𝐳i=[λ1zi1,λ2zi2,,λdzid, 00]\mathbf{z}_{i}=[\sqrt{\lambda_{1}}z_{i1},\ \sqrt{\lambda_{2}}z_{i2},\ldots,\sqrt{\lambda_{d}}z_{id},\ 0\ldots 0]

for all 1in1\leq i\leq n, where var(zil)=1\textnormal{var}(z_{il})=1 for all l=1,,dl=1,\ldots,d. Then, for iji\neq j and t>0,t>0, we have

(A.4) (|1p𝐲i𝐲j|>t)exp(pt2/2),\mathbb{P}\left(\left|\frac{1}{p}\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\right|>t\right)\leq\exp(-pt^{2}/2)\,,

as well as

(A.5) (|1p𝐱i𝐱j1p𝐳i𝐳j|>t)exp(pt2/2).\mathbb{P}\left(\left|\frac{1}{p}\mathbf{x}_{i}^{\top}\mathbf{x}_{j}-\frac{1}{p}\mathbf{z}_{i}^{\top}\mathbf{z}_{j}\right|>t\right)\leq\exp(-pt^{2}/2)\,.

For the diagonal terms, for t>0,t>0, we have for some universal constants C,C1>0,C,C_{1}>0,

(A.6) \displaystyle\mathbb{P} (|1p𝐲i221|>t){2exp(C1pt2),0<tC2exp(C1pt),t>C,\displaystyle\left(\left|\frac{1}{p}\|\mathbf{y}_{i}\|_{2}^{2}-1\right|>t\right)\leq\begin{cases}2\exp(-C_{1}pt^{2}),&0<t\leq C\\ 2\exp(-C_{1}pt),&t>C\,,\end{cases}

as well as

(A.7) \displaystyle\mathbb{P} (|1p𝐱i221p𝐳i22|>t){2exp(C1pt2),0<tC2exp(C1pt),t>C.\displaystyle\left(\left|\frac{1}{p}\|\mathbf{x}_{i}\|_{2}^{2}-\frac{1}{p}\|\mathbf{z}_{i}\|_{2}^{2}\right|>t\right)\leq\begin{cases}2\exp(-C_{1}pt^{2}),&0<t\leq C\\ 2\exp(-C_{1}pt),&t>C.\end{cases}

Especially, the above results imply that

1p|𝐲i𝐲j|n1/2,\displaystyle\frac{1}{p}\left|\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\right|\prec n^{-1/2},
(A.8) 1p|𝐱i𝐱j|n1/2,\displaystyle\frac{1}{p}\left|\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right|\prec n^{-1/2}\,,

as well as

|1p𝐲i221|\displaystyle\left|\frac{1}{p}\|\mathbf{y}_{i}\|_{2}^{2}-1\right|\, n1/2,\displaystyle\prec n^{-1/2},
(A.9) |1p𝐱i22(1+l=1dλlp)|\displaystyle\left|\frac{1}{p}\|\mathbf{x}_{i}\|_{2}^{2}-\left(1+\frac{\sum_{l=1}^{d}\lambda_{l}}{p}\right)\right|\, n1/2.\displaystyle\prec n^{-1/2}.

Note that since pp and nn are of the same order, the above results hold when 𝐲i\mathbf{y}_{i} is replaced by the vector 𝐳:=[z1,,zn]n\bm{z}:=[z_{1},\ldots,z_{n}]^{\top}\in\mathbb{R}^{n}, which is a sub-Gaussian random vector.

Proof.

We adapt (A.3) in the proof. When ij,i\neq j, (A.4) has been proved in [18, Lemma A.2]. Observe that

(A.10) 𝐱i𝐱j𝐳i𝐳j=𝐲i𝐲j+l=1dλl(zilyjl+zjlyil).\mathbf{x}_{i}^{\top}\mathbf{x}_{j}-\mathbf{z}_{i}^{\top}\mathbf{z}_{j}=\mathbf{y}_{i}^{\top}\mathbf{y}_{j}+\sum_{l=1}^{d}\sqrt{\lambda_{l}}(z_{il}y_{jl}+z_{jl}y_{il}).

Since λ1\lambda\asymp 1 and dd is fixed, we find that 𝐲i𝐲j\mathbf{y}_{i}^{\top}\mathbf{y}_{j} is the leading order term. (A.5) follows from (A.4) and (A.10). When i=ji=j, (A.6) has been proved in [61, Corollary 2.8.3]. (A.7) follows from (A.10) and (A.6).

(A.4) ((A.4) respectively) follows from (A.4) and (A.5) ((A.6) and (A.7) respectively) for scalar random variables and the fact that λ1.\lambda\asymp 1.

Then we provide the concentration inequalities when λ\lambda is in the slowly divergent region. Indeed, in this region, the results of the diagonal parts of Lemma A.4 still apply.

Lemma A.5.

Suppose (1.4)-(1.6) hold. Assume d1d\geq 1 is fixed, λl=nαl\lambda_{l}=n^{\alpha_{l}} with 0<αl<10<\alpha_{l}<1 for all 1ld1\leq l\leq d and recall (A.3). Then when iji\neq j, we have

1p|𝐱i𝐱j|\displaystyle\frac{1}{p}\left|\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right| l=1dλln+1n,\displaystyle\,\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}+\frac{1}{\sqrt{n}},
(A.11) |1p𝐱i22(1+l=1dλlp)|\displaystyle\left|\frac{1}{p}\|\mathbf{x}_{i}\|_{2}^{2}-\left(1+\frac{\sum_{l=1}^{d}\lambda_{l}}{p}\right)\right| l=1dλln+1n.\displaystyle\,\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}+\frac{1}{\sqrt{n}}.
Proof.

We only discuss the second term and the first item can be dealt with in a similar way. We define λfl=λl\lambda_{fl}=\lfloor\lambda_{l}\rfloor as the floor of λ\lambda. We again adapt (A.3) in the proof. Note that using (A.10) we have

l=1dλflpzil2\displaystyle\sum_{l=1}^{d}\frac{\lambda_{fl}}{p}z_{il}^{2} +1pj=1p𝐲ij21p𝐱i22l=1d2λlpzil𝐲il\displaystyle+\frac{1}{p}\sum_{j=1}^{p}\mathbf{y}_{ij}^{2}\leq\frac{1}{p}\|\mathbf{x}_{i}\|_{2}^{2}-\sum_{l=1}^{d}\frac{2\sqrt{\lambda_{l}}}{p}z_{il}\mathbf{y}_{il}
(A.12) l=1dλfl+1pzil2+1pj=1p𝐲ij2.\displaystyle\leq\sum_{l=1}^{d}\frac{\lambda_{fl}+1}{p}z_{il}^{2}+\frac{1}{p}\sum_{j=1}^{p}\mathbf{y}_{ij}^{2}.

We study the upper bound of the sandwich inequality, and the lower bound follows by the same argument. On one hand, according to (A.4), we have that

1pj=1p𝐲ij2=1+O(n1/2).\frac{1}{p}\sum_{j=1}^{p}\mathbf{y}_{ij}^{2}=1+O_{\prec}(n^{-1/2}).

Moreover, since ziz_{i} is a sub-Gaussian random variable, we can apply (A.4). This yields that

i=1dλfl+1pzil2=l=1d(λfl+1)p+O(n1+α),\sum_{i=1}^{d}\frac{\lambda_{fl}+1}{p}z_{il}^{2}=\frac{\sum_{l=1}^{d}(\lambda_{fl}+1)}{p}+O_{\prec}(n^{{-1}+\alpha}),

where we used the fact that αl<1\alpha_{l}<1 for all 1ld.1\leq l\leq d. Similarly, we have

|λlpzil𝐲il|λlp.\left|\frac{\sqrt{\lambda_{l}}}{p}z_{il}\mathbf{y}_{il}\right|\prec\frac{\sqrt{\lambda_{l}}}{p}.

Using (A.3), we readily obtain that

1p𝐱i1+l=1dλlp+O(l=1dλln+1n).\frac{1}{p}\|\mathbf{x}_{i}\|\leq 1+\frac{\sum_{l=1}^{d}\lambda_{l}}{p}+O_{\prec}\left(\frac{\sum_{l=1}^{d}\lambda_{l}}{n}+\frac{1}{\sqrt{n}}\right).

A.4. Some results for Gram matrices

Denote the Gram matrix of the point clouds {𝐱i}p\{\mathbf{x}_{i}\}\subset\mathbb{R}^{p} in the form of (1.4) as

(A.13) 𝐆x=1p𝐗𝐗,𝐗=[𝐱1𝐱n]p×n.\mathbf{G}_{x}=\frac{1}{p}\mathbf{X}^{\top}\mathbf{X},\ \mathbf{X}=[\mathbf{x}_{1}\cdots\mathbf{x}_{n}]\in\mathbb{R}^{p\times n}.

The eigenvalues of 𝐆x\mathbf{G}_{x} have been thoroughly studied in the literature; see [8, 7, 4, 51], among others. We summarize those results relevant to this paper in the following lemma.

Lemma A.6.

Suppose (1.4)-(1.7) hold true and d1d\geq 1 is fixed. Assume that there exists some 0dd0\leq d^{\prime}\leq d so that λ1λ2λd>cnλd+1λd0.\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{d^{\prime}}>\sqrt{c_{n}}\geq\lambda_{d^{\prime}+1}\geq\cdots\geq\lambda_{d}\geq 0. Then, if d>0d^{\prime}>0, we have that for 1jd1\leq j\leq d^{\prime},

|λj(𝐆x)(1+λj)(1+cnλj1)|n1/2λj(λjcn)1/2.|\lambda_{j}(\mathbf{G}_{x})-(1+\lambda_{j})(1+c_{n}\lambda_{j}^{-1})|\prec n^{-1/2}\sqrt{\lambda_{j}}(\lambda_{j}-\sqrt{c_{n}})^{1/2}.

If d<dd^{\prime}<d, we have that for d+1jdd^{\prime}+1\leq j\leq d,

|λj(𝐆x)γμcn,1(1)|n2/3.|\lambda_{j}(\mathbf{G}_{x})-\gamma_{\mu_{c_{n},1}}(1)|\prec n^{-2/3}.

Moreover, for 1i(1ϵ)n1\leq i\leq(1-\epsilon)n, where ϵ>0\epsilon>0 is a fixed small constant, we have

(A.14) |λi+d(𝐆x)γμcn,1(i)|n2/3i1/3.\left|\lambda_{i+d}(\mathbf{G}_{x})-\gamma_{\mu_{c_{n},1}}(i)\right|\prec n^{-2/3}i^{-1/3}\,.
Proof.

We mention that the results under our setup have been originally proved in [8] (see Section 1.2 and Theorems 2.3 and 2.7 therein), and stated in the current form. ∎

The following lemma provides a control for the Hadamard product related to the Gram matrix. In our setup with the point cloud 𝒳\mathcal{X}, as discussed around (1.6), 𝐱i\mathbf{x}_{i} itself is a sub-Gaussian random vector with a spiked Σ\Sigma as in (1.6). We thus can extract the probability and bounds by tracking the proof in [27, Step (iv) on Page 21 of the proof of Theorem 2.1].

Lemma A.7.

Suppose (1.5)-(1.7) hold true, d1d\geq 1, and λl=nαl\lambda_{l}=n^{\alpha_{l}} for l=1,,dl=1,\ldots,d, where 0<αl<10<\alpha_{l}<1. Let 𝐆x\mathbf{G}_{x} be the Gram matrix associated with the point cloud 𝒳.\mathcal{X}. Denote

(A.15) 𝐏x:=𝐆xdiag{𝐆x(1,1),,𝐆x(n,n)}.\mathbf{P}_{x}:=\mathbf{G}_{x}-\text{diag}\{\mathbf{G}_{x}(1,1),\ldots,\mathbf{G}_{x}(n,n)\}\,.

For some constant C>0C>0, when nn is sufficiently large, with probability at least 1O(n1/2)1-O(n^{-1/2}), we have

𝐏x𝐏xl=1d(λl+1)2+p1p2(𝟏𝟏𝐈n)\displaystyle\left\|\mathbf{P}_{x}\circ\mathbf{P}_{x}-\frac{\sum_{l=1}^{d}(\lambda_{l}+1)^{2}+p-1}{p^{2}}(\mathbf{1}\mathbf{1}^{\top}-\mathbf{I}_{n})\right\|
(A.16) \displaystyle\leq Cmax{n1/4,l=1dλlp}.\displaystyle\,C\max\left\{n^{-1/4},\frac{\sum_{l=1}^{d}\lambda_{l}}{p}\right\}.

We also need the following lemma for the Gram matrix of noisy signals.

Lemma A.8.

Suppose (1.4)-(1.8) hold true, d1d\geq 1, λl=nαl\lambda_{l}=n^{\alpha_{l}}, where 0<αlαl1α1<10<\alpha_{l}\leq\alpha_{l-1}\leq\cdots\leq\alpha_{1}<1, for l=1,,dl=1,\ldots,d. Let 𝐏y\mathbf{P}_{y} be denoted as (A.15) for the point cloud 𝒴.\mathcal{Y}. Then we have

(A.17) 𝐏x𝐏x𝐏y𝐏y{l=1dλln,0<α1<0.5(l=1dλl)2n,otherwise.\displaystyle\left\|\mathbf{P}_{x}\circ\mathbf{P}_{x}-\mathbf{P}_{y}\circ\mathbf{P}_{y}\right\|\prec\begin{cases}\frac{\sum_{l=1}^{d}\lambda_{l}}{\sqrt{n}},&0<\alpha_{1}<0.5\\ \frac{(\sum_{l=1}^{d}\lambda_{l})^{2}}{n},&\text{otherwise}.\end{cases}
Proof.

When iji\neq j,

1p2((𝐱i𝐱j)2(𝐲i𝐲j)2)\displaystyle\frac{1}{p^{2}}((\mathbf{x}_{i}^{\top}\mathbf{x}_{j})^{2}-(\mathbf{y}_{i}^{\top}\mathbf{y}_{j})^{2})
(A.18) =\displaystyle= l=1dλl(zilyjl+zjlyil)+𝐳i𝐳jp𝐱i𝐱j+𝐲i𝐲jp.\displaystyle\,\frac{\sum_{l=1}^{d}\sqrt{\lambda_{l}}(z_{il}y_{jl}+z_{jl}y_{il})+\mathbf{z}_{i}^{\top}\mathbf{z}_{j}}{p}\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+\mathbf{y}_{i}^{\top}\mathbf{y}_{j}}{p}\,.

By the assumption that the standard deviations of the entries of zilz_{il} and yily_{il} are of order λl\sqrt{\lambda_{l}} and 11 respectively, we have

l=1dλl(zilyjl+zjlyil)+𝐳i𝐳jpl=1dλln,\frac{\sum_{l=1}^{d}\sqrt{\lambda_{l}}(z_{il}y_{jl}+z_{jl}y_{il})+\mathbf{z}_{i}^{\top}\mathbf{z}_{j}}{p}\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}\,,

and by (A.5), we have

𝐱i𝐱j+𝐲i𝐲jp1n\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+\mathbf{y}_{i}^{\top}\mathbf{y}_{j}}{p}\prec\frac{1}{\sqrt{n}}

when α<0.5\alpha<0.5 and

𝐱i𝐱j+𝐲i𝐲jpl=1dλln\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}+\mathbf{y}_{i}^{\top}\mathbf{y}_{j}}{p}\prec\frac{\sum_{l=1}^{d}\lambda_{l}}{n}

when 0.5α<10.5\leq\alpha<1. Therefore, using the Gershgorin circle theorem, we conclude the claimed bound. ∎

A.5. Some results for affinity matrices

Lemma A.9.

Suppose (1.1) and (1.4)-(1.8) hold true, d1d\geq 1, αl=0\alpha_{l}=0 for l=1,,dl=1,\ldots,d in (1.8) (i.e., λ\lambda is bounded), and h=ph=p in (1.1). Let Φ=(ϕ1,,ϕn)\Phi=(\phi_{1},\ldots,\phi_{n}) with ϕi=1p𝐱i(1+l=1dλl/p)\phi_{i}=\frac{1}{p}\|\mathbf{x}_{i}\|-(1+\sum_{l=1}^{d}\lambda_{l}/p), i=1,2,,ni=1,2,\cdots,n. Denote

(A.19) 𝐊d𝐊d(τ)=\displaystyle\mathbf{K}_{d}\equiv\mathbf{K}_{d}(\tau)= 2f(τ)p1𝐗𝐗+ς𝐈n+Sh0(τ)+Sh1(τ)+Sh2(τ),\displaystyle\,-2f^{\prime}(\tau)p^{-1}\mathbf{X}^{\top}\mathbf{X}+\varsigma\mathbf{I}_{n}+\mathrm{Sh}_{0}(\tau)+\mathrm{Sh}_{1}(\tau)+\mathrm{Sh}_{2}(\tau),

where f(x)f(x) is a general kernel function satisfying the conditions in Remark 2.4, ς\varsigma is defined in (2.6),

(A.20) Sh0(τ):=f(τ)𝟏𝟏,\displaystyle\mathrm{Sh}_{0}(\tau):=f(\tau)\mathbf{1}\mathbf{1}^{\top},
Sh1(τ):=f(τ)[𝟏Φ+Φ𝟏],\displaystyle\mathrm{Sh}_{1}(\tau):=f^{\prime}(\tau)[\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top}],
Sh2(τ):=f(2)(τ)2[𝟏(ΦΦ)+(ΦΦ)𝟏+2ΦΦ\displaystyle\mathrm{Sh}_{2}(\tau):=\frac{f^{(2)}(\tau)}{2}\Big{[}\mathbf{1}(\Phi\circ\Phi)^{\top}+(\Phi\circ\Phi)\mathbf{1}^{\top}+2\Phi\Phi^{\top}
(A.21) +4p2(l=1d(λl+1)2+p)𝟏𝟏],\displaystyle\qquad\qquad\quad+\frac{4}{p^{2}}\Big{(}\sum_{l=1}^{d}(\lambda_{l}+1)^{2}+p\Big{)}\mathbf{1}\mathbf{1}^{\top}\Big{]},

and \circ is the Hadamard product. Then for some small constant ξ>0\xi>0, when nn is sufficiently large, we have that with probability at least 1O(n1/2)1-O(n^{-1/2})

(A.22) 𝐖𝐊dnξ.\|\mathbf{W}-\mathbf{K}_{d}\|\leq n^{-\xi}\,.
Proof.

See [18, Lemma A.10] for the case d=1d=1. For d>1d>1, by modifying the proof of [18, Lemma A.10] using Lemmas A.4A.8 and the assumption that dd is fixed, we get the result. ∎

We also need the following lemma, which is of independent interest.

Lemma A.10.

Suppose (1.1) and (1.4)-(1.8) hold true, d1d\geq 1 fixed and α1α2αd1\alpha_{1}\geq\alpha_{2}\geq\ldots\geq\alpha_{d}\geq 1. For the matrix 𝐖1\mathbf{W}_{1} associated with the clean signal defined in (1.9) and δ>0\delta>0, with high probability, we have for Cδ>0C_{\delta}>0 so that

(A.23) 𝐖1(nCδnδ)exp(υγnα112(1δ))+Cδnδ,\|\mathbf{W}_{1}\|\leq(n-C_{\delta}n^{\delta})\exp(-\upsilon\gamma n^{\alpha_{1}-1-2(1-\delta)})+C_{\delta}n^{\delta}\,,

where γ\gamma is defined in (1.7).

Proof.

Assume d=1d=1 and denote α=α1\alpha=\alpha_{1}. Throughout the proof, we adapt the notation (A.3). For an arbitrarily small constant ϵ>0\epsilon>0 and a given δ>0,\delta>0, let Cδ>0C_{\delta}>0 be some constant depending on δ\delta, and denote the event 𝒜(δ)\mathcal{A}(\delta) as

𝒜(δ):={There existCδnδzissuch that|zi|nϵ}.\mathcal{A}(\delta):=\left\{\text{There exist}\ C_{\delta}n^{\delta}\ z_{i}^{\prime}s\ \text{such that}\ |z_{i}|\leq n^{-\epsilon}\right\}\,.

Note that ziz_{i} is the signal part. Denote :=(|zi|nϵ).\ell:=\mathbb{P}(|z_{i}|\leq n^{-\epsilon}). Due to the independence, we find that

(A.24) (𝒜(δ))=(nCδnδ)Cδnδ(1)nCδnδ.\mathbb{P}(\mathcal{A}(\delta))={n\choose C_{\delta}n^{\delta}}\ell^{C_{\delta}n^{\delta}}(1-\ell)^{n-C_{\delta}n^{\delta}}.

Using Stirling’s formula, when nn is sufficiently large, we find that

(A.25) (nCδnδ)=\displaystyle{n\choose C_{\delta}n^{\delta}}= n!(nCδnδ)!(Cδnδ)!\displaystyle\,\frac{n!}{(n-C_{\delta}n^{\delta})!(C_{\delta}n^{\delta})!}
\displaystyle\asymp 2πnn+1/2exp(n)(Cδnδ)Cδnδ1/2exp(Cδnδ)\displaystyle\,\sqrt{2\pi}n^{n+1/2}\exp(-n)(C_{\delta}n^{\delta})^{-C_{\delta}n^{\delta}-1/2}\exp(C_{\delta}n^{\delta})
×(nCδnδ)n+Cδnδ1/2exp(nCδnδ)\displaystyle\qquad\times(n-C_{\delta}n^{\delta})^{-n+C_{\delta}n^{\delta}-1/2}\exp(n-C_{\delta}n^{\delta})
\displaystyle\asymp 2πCδn1δnCδnδ(1+Cδnδnnδ)nCδnδ(nCδnδ)Cδnδ\displaystyle\,\sqrt{\frac{2\pi}{C_{\delta}}}\sqrt{\frac{n^{1-\delta}}{n-C_{\delta}n^{\delta}}}\left(1+\frac{C_{\delta}n^{\delta}}{n-n^{\delta}}\right)^{n-C_{\delta}n^{\delta}}\left(\frac{n}{C_{\delta}n^{\delta}}\right)^{C_{\delta}n^{\delta}}
\displaystyle\asymp 2πCδn(1δ)Cδnδδ/2exp(Cδnδ).\displaystyle\,\sqrt{\frac{2\pi}{C_{\delta}}}n^{(1-\delta)C_{\delta}n^{\delta}-\delta/2}\exp(C_{\delta}n^{\delta})\,.

We next provide an estimate for .\ell. Denote the probability density function (PDF) of zi2z_{i}^{2} as ϱ\varrho and the PDF of ziz_{i} is ϱ~.\widetilde{\varrho}. Recall the assumption that {zi}i=1n\{z_{i}\}_{i=1}^{n} are continuous random variables with ϱ~(0)0\widetilde{\varrho}(0)\neq 0. By using the fact ϱ(y)=(ϱ~(y)+ϱ~(y))/(2y)\varrho(y)=(\widetilde{\varrho}(\sqrt{y})+\widetilde{\varrho}(-\sqrt{y}))/(2\sqrt{y}), we find that

ϱ(y)O(1y).\varrho(y)\asymp O\left(\frac{1}{\sqrt{y}}\right)\,.

Consequently, we have that for any small y>0,y>0,

(A.26) (zi2y)O(y).\mathbb{P}(z_{i}^{2}\leq y)\asymp O(\sqrt{y})\,.

By (A.26), we immediately have that

(A.27) nϵ.\ell\asymp n^{-\epsilon}\,.

Since 0\ell\rightarrow 0 as nn\rightarrow\infty, we have

(A.28) Cδnδ(1)nCδnδexp(n)Cδnδ,n.\displaystyle\ell^{C_{\delta}n^{\delta}}(1-\ell)^{n-C_{\delta}n^{\delta}}\asymp\exp(-\ell n)\ell^{C_{\delta}n^{\delta}},\ n\rightarrow\infty\,.

By plugging (A.25) and (A.28) into (A.24), we obtain

(𝒜(δ))\displaystyle\mathbb{P}(\mathcal{A}(\delta))\asymp exp(n+Cδ(1δ)nδlogn+Cδnδlog)\displaystyle\,\exp(-\ell n+C_{\delta}(1-\delta)n^{\delta}\log n+C_{\delta}n^{\delta}\log\ell)
\displaystyle\asymp exp(n1ϵ+Cδ(1δ)nδlognCδϵnδlogn+Cδnδ)\displaystyle\,\exp\left(-n^{1-\epsilon}+C_{\delta}(1-\delta)n^{\delta}\log n-C_{\delta}\epsilon n^{\delta}\log n+C_{\delta}n^{\delta}\right)
(A.29) =\displaystyle= exp(n1ϵ+Cδ[1δϵ+1/log(n)]nδlogn),\displaystyle\,\exp\left(-n^{1-\epsilon}+C_{\delta}\left[1-\delta-\epsilon+1/\log(n)\right]n^{\delta}\log n\right),

where the second asymptotic comes from plugging (A.27). In light of (A.5), to make 𝒜(δ)\mathcal{A}(\delta) a high probability event, we may take ϵ+δ=1\epsilon+\delta=1, which leads to

(𝒜(δ))exp((Cδ1)nδ).\mathbb{P}(\mathcal{A}(\delta))\asymp\exp\left((C_{\delta}-1)n^{\delta}\right).

We can choose 1<Cδ21<C_{\delta}\leq 2 such that for any large constant D>0,D>0, when nn is large enough, we have that

(A.30) (𝒜(δ))1nD.\mathbb{P}(\mathcal{A}(\delta))\geq 1-n^{-D}\,.

Note that 𝐖1(i,j)=exp(υλ(zizj)2/p)\mathbf{W}_{1}(i,j)=\exp\left(-\upsilon\lambda(z_{i}-z_{j})^{2}/p\right). On the event 𝒜(δ)\mathcal{A}(\delta), by the Gershgorin circle theorem and the definition of 𝐖1\mathbf{W}_{1}, we find that

(A.31) 𝐖1(nCδnδ)exp(υγnα12(1δ))+Cδnδ,\|\mathbf{W}_{1}\|\leq(n-C_{\delta}n^{\delta})\exp(-\upsilon\gamma n^{\alpha-1-2(1-\delta)})+C_{\delta}n^{\delta}\,,

where in the inequality we consider the worst scenario such that all the elements in 𝒜(δ)\mathcal{A}(\delta) are either on the same row or column. This finishes the claim with high probability.

To show the case when d>1d>1, we need a slight modification of (A.24). Suppose α1α2αd1\alpha_{1}\geq\alpha_{2}\geq\ldots\geq\alpha_{d}\geq 1. Since zikz_{ik}, k=1,,dk=1,\ldots,d are in general dependent, (A.24) has to be modified to accommodate this additional dependence. In fact, our results hold true by replacing α\alpha in (A.23) with α1\alpha_{1} since we have 𝐖1(i,j)=k=1dexp(υλ1(zikzjk)2/p)exp(υλ1(zi1zj1)2/p)\mathbf{W}_{1}(i,j)=\prod_{k=1}^{d}\exp\left(-\upsilon\lambda_{1}(z_{ik}-z_{jk})^{2}/p\right)\leq\exp\left(-\upsilon\lambda_{1}(z_{i1}-z_{j1})^{2}/p\right) which reduces our discussion to the case d=1.d=1.

A.6. Orthogonal polynomials and kernel expansion

We first recall the celebrated Mehler’s formula (for instance, see equation (5) in [49] or [33])

(A.32) 11t2\displaystyle\frac{1}{\sqrt{1-t^{2}}} exp(2txyt2(x2+y2)2(1t2))=n=0tnn!H~n(x)H~n(y),\displaystyle\exp\left(\frac{2txy-t^{2}(x^{2}+y^{2})}{2(1-t^{2})}\right)=\sum_{n=0}^{\infty}\frac{t^{n}}{n!}\widetilde{H}_{n}(x)\widetilde{H}_{n}(y),

where H~m(x)\widetilde{H}_{m}(x) is the scaled Hermite polynomial defined as

H~m(x)=Hm(x/2)2n\widetilde{H}_{m}(x)=\frac{H_{m}(x/\sqrt{2})}{\sqrt{2^{n}}}

and Hm(x)H_{m}(x) is the standard Hermite polynomial defined as

Hm(x)=(1)mexp(x2)dmexp(x2)dxm.H_{m}(x)=(-1)^{m}\exp(x^{2})\frac{\mathrm{d}^{m}\exp(-x^{2})}{\mathrm{d}x^{m}}.

We mention that H~m(x)\widetilde{H}_{m}(x) is referred to as the probabilistic version of the Hermite polynomial. We will see later that in our proof, the above Mehler’s formula provides a convenient way to handle the interaction term when we write the affinity matrix as a summation of rank one matrices.

A.7. More remarks

A.7.1. Zeroing-out technique

Here we discuss the trick of zeroing-out diagonal terms proposed in [29]. First of all, we summarize the idea and restate the results in [29]. Then we modify it to our setting (1.4)–(1.6). To simplify the discussion, we focus our discussion on the setting d=1d=1 with the signal strength λnα\lambda\asymp n^{\alpha}, where α0\alpha\geq 0. The zeroed out affinity matrix is defined as

𝐖̊=𝐖(𝟏𝟏𝐈n),\mathring{\mathbf{W}}=\mathbf{W}\circ\left(\mathbf{1}\mathbf{1}^{\top}-\mathbf{I}_{n}\right),

where 𝟏n\mathbf{1}\in\mathbb{R}^{n} is the vector with 11 in all entries. Denote the associated degree matrix as 𝐃̊.\mathring{\mathbf{D}}. Consequently, the modified transition matrix is

(A.33) 𝐀̊=𝐃̊1𝐖̊.\mathring{\mathbf{A}}=\mathring{\mathbf{D}}^{-1}\mathring{\mathbf{W}}.

Recall that the transition matrix for the signal part is defined as 𝐀1=𝐃11𝐖1\mathbf{A}_{1}=\mathbf{D}_{1}^{-1}\mathbf{W}_{1}. It can be concluded from [29] that with high probability the spectrum of 𝐀̊\mathring{\mathbf{A}} is close to that of 𝐀1\mathbf{A}_{1} in that

(A.34) 𝐀̊𝐀1=o(1),\|\mathring{\mathbf{A}}-\mathbf{A}_{1}\|=o_{\mathbb{P}}(1),

provided the following two conditions are satisfied:

(1). The signal strength satisfies

α>12.\alpha>\frac{1}{2}.

(2). The off-diagonal entries of the signal kernel affinity matrix should satisfy that

(A.35) infiji𝐖1(i,j)nγ>0,\inf_{i}\sum_{j\neq i}\frac{\mathbf{W}_{1}(i,j)}{n}\geq\gamma>0,

for some universal constant γ>0.\gamma>0. Even though [29] did not provide a detailed discussion on the bandwidth selection, the assumption (A.35) imposes a natural condition.

We now explain how the zeroing-out trick is related to our approach in the large signal-to-noise ratio region. Recall that the transition matrix for the observation is defined as 𝐀=𝐃1𝐖\mathbf{A}=\mathbf{D}^{-1}\mathbf{W}. As shown in part 2) of Theorem 3.1, when the signal strength is α>1\alpha>1 and the bandwidth is either hλh\asymp\lambda or selected adaptively according to the method proposed in Section 3.2, the spectrum of 𝐀\mathbf{A} will be close to that of 𝐀1\mathbf{A}_{1} with high probability, i.e.,

𝐀𝐀1=o(1).\|\mathbf{A}-\mathbf{A}_{1}\|=o_{\mathbb{P}}(1).

We emphasize that when α>1\alpha>1 and hλh\asymp\lambda or selected adaptively according to the method proposed in Section 3.2, it can be concluded from the proof of Corollary 3.2 that (A.35) holds with high probability. Consequently, together with (A.34), we can conclude that when the signal-to-noise ratio is large in the sense α>1\alpha>1 and the bandwidth is selected properly as in Section 3.2, the spectrum of 𝐀\mathbf{A} is asymptotically the same as the zeroing-out matrix 𝐀̊.\mathring{\mathbf{A}}. We mention that although in this setting our approach and results are asymptotically equivalent to the zeroing-out trick, our method provides an adaptive and theoretically justified method to select the bandwidth instead of choosing a fixed bandwidth according to (A.35). In fact, in the simulation of [29], the authors used a similar approach empirically.

When the signal-to-noise ratio is “smaller” so that 1/2<α1,1/2<\alpha\leq 1, the zeroing-out trick has a significant impact on the spectrum. In this region, the proposed bandwidth selection algorithm will choose a bandwidth satisfying hph\asymp p, and the spectral behavior of noisy GL is recorded in Corollary 2.10, part 1) of Theorem 3.1 and Corollary 3.2. We see that 𝐀\mathbf{A} is no longer close to the spectrum of 𝐀1\mathbf{A}_{1} under our setup. However, the zeroing-out trick works provided (A.35) holds. Therefore, we can see that when 1/2<α11/2<\alpha\leq 1 and the bandwidth is properly selected, the result can be improved using the zeroing-out trick in the sense of (A.34).

Finally, when the SNR is “very small” so that α1/2,\alpha\leq 1/2, both 𝐀̊\mathring{\mathbf{A}} and 𝐀\mathbf{A} are dominated by the noise. In this case, we are not able to extract useful information of the signal, even with the zeroing-out trick. For an illustration, in Figure 14, we show the performance of 𝐀̊\mathring{\mathbf{A}} and 𝐀\mathbf{A} in different SNR regions by comparing some of their eigenfunctions (eigenvectors) with those of the clean signal matrix 𝐀1.\mathbf{A}_{1}. We can conclude that the zeroing-out trick can be beneficial when 0.5<α10.5<\alpha\leq 1 provided the bandwidth is carefully selected.

Refer to caption
Refer to caption
Refer to caption
Figure 14. Comparison of different GLs. We consider the comparison of the 3rd eigenfunctions (eigenvectors) of 𝐀1,\mathbf{A}_{1}, 𝐀\mathbf{A} with bandwidth constructed using Algorithm 1 and 𝐀̊\mathring{\mathbf{A}} with h=35.h=35. Here p=200,n=400p=200,n=400 and the random variables are Gaussian. For the legend, Clean means 𝐀1,\mathbf{A}_{1}, Adap means 𝐀\mathbf{A} constructed using Algorithm 1 and Zeoring means 𝐀̊\mathring{\mathbf{A}} with h=35.h=35. We can see that the zeroing-out trick outperforms our method when 0.5<α10.5<\alpha\leq 1 when the bandwidth is chosen as h=35.h=35.

A.7.2. Mor remark on d>1d>1

We continue the discussion in Remark 2.9 and provide some simulations with d=2d=2. We assume that α1α20\alpha_{1}\geq\alpha_{2}\geq 0, and discuss two cases with different α1\alpha_{1} and α2.\alpha_{2}.

First, we discuss the setting when the bulk eigenvalues are governed by the MP law, i.e., in the region 0α2α1<1.0\leq\alpha_{2}\leq\alpha_{1}<1. Recall that the shifted MP law νλ\nu_{\lambda} defined in (2.7) depends on the signal level via ττ(λ)\tau\equiv\tau(\lambda) in (2.5), and when α<1,\alpha<1, τ2\tau\rightarrow 2 as n,n\rightarrow\infty, which is independent of λ\lambda asymptotically. We can thus always set λ=0\lambda=0 in (2.5) and use ν0\nu_{0} for the MP law as in (2.8) and (2.12). Therefore, when 0α2α1<1,0\leq\alpha_{2}\leq\alpha_{1}<1, the bulk distribution is the same as that in (2.8) and (2.12), which is characterized by the shifted MP law,

ν0=Tζ(0)μcn,2f(2),\nu_{0}=T_{\zeta(0)}\mu_{c_{n},-2f^{\prime}(2)},

where TT is the shift operator defined in (2.1), μcn,2f(2)\mu_{c_{n},-2f^{\prime}(2)} is the MP law defined in (2.3) with σ2\sigma^{2} replaced by 2f(2)-2f^{\prime}(2), and ζ(0)\zeta(0) is defined by inserting λ=0\lambda=0 (or equivalently τ(0)=2\tau(0)=2) in (2.6); that is,

ζ(0)=f(0)+2f(2)f(2).\zeta(0)=f(0)+2f^{\prime}(2)-f(2)\,.

In general, although the bulk distribution is the same for different finite d,d, the number of outliers can vary according to λi\lambda_{i}, 1id.1\leq i\leq d. On the technical level, the proofs in Appendices B.1 and B.2 follow after some minor modifications, especially in the Taylor expansion. For example, in (B.1), the key parameter τ\tau should be defined as τ=2((λ1+λ2)/p+1)\tau=2((\lambda_{1}+\lambda_{2})/p+1) when d=2d=2. For an illustration, in Figure 15, we show how the bulk eigenvalues are asymptotically identical for the setting d=1d=1 and d=2d=2 when the exponents are less than one for various settings of cn=0.5,1,2c_{n}=0.5,1,2.

Refer to caption
Refer to caption
Refer to caption
Figure 15. Bulk eigenvalues of 𝐖\mathbf{W} for different intrinsic dimension dd\in\mathbb{N} in the low SNR region. For the construction of 𝐖,\mathbf{W}, we consider h=p.h=p. We compare the bulk eigenvalues for two kernel affinity matrices constructed from two point clouds with different values of dd. In the simulations, we use the Gaussian random vectors with covariance matrix (1.6). Specifically, for the setting d=1,d=1, we choose λ1=p0.4,λ2==λp=1,\lambda_{1}=p^{0.4},\lambda_{2}=\cdots=\lambda_{p}=1, and for the setting d=2,d=2, we choose λ1=p0.4,λ2=p0.1,λ3=λp=1.\lambda_{1}=p^{0.4},\lambda_{2}=p^{0.1},\lambda_{3}=\cdots\lambda_{p}=1. We also consider the comparison for three different settings of cn=n/pc_{n}=n/p with n=200.n=200. We can see that the bulk eigenvalues for these two settings are fairly close to each other. For the bulk eigenvalues, for definiteness and simplicity, we start from λ10(𝐖).\lambda_{10}(\mathbf{W}).

Second, we discuss the region when α21\alpha_{2}\geq 1. On the one hand, as in Theorem 2.7, we can show that the spectrum of 𝐖\mathbf{W} is close to those of matrices defined in (2.14) and (2.16) that depend on the clean signal part 𝐖1\mathbf{W}_{1}. As in the case when d=1d=1, the spectrum of 𝐖1\mathbf{W}_{1} may not follow the MP law and depends on both λ1\lambda_{1} and λ2\lambda_{2} and the chosen bandwidth. This dependence suggests that the spectrum of 𝐖\mathbf{W} might be different from that when d=1d=1. In Figure 16, we show numerically how the bulk under the setup d=2d=2 and h=ph=p is different from that when d=1d=1 and h=ph=p.

Third, when α11>α20,\alpha_{1}\geq 1>\alpha_{2}\geq 0, the spectrum of 𝐖\mathbf{W} will be close to some matrices that depend only on λ1.\lambda_{1}. In this case, the spectrum of 𝐖\mathbf{W} is close to that of 𝐖1\mathbf{W}_{1} when d=1d=1 and the signal strength is λ1.\lambda_{1}. See Figure 17 for an illustration, where we see that the bulks are fairly close to each other, while they may not necessarily follow the MP law.

Refer to caption
Refer to caption
Refer to caption
Figure 16. Comparison of bulk eigenvalues of 𝐖\mathbf{W} for different intrinsic dimensions, d=1d=1 and d=2d=2. When d=2d=2, we have two large signals; that is, α1α21\alpha_{1}\geq\alpha_{2}\geq 1. For the construction of 𝐖,\mathbf{W}, we consider h=p.h=p. In this simulation, we use the Gaussian random vectors with the covariance matrix (1.6). When d=1,d=1, we choose λ1=p2,λ2==λp=1\lambda_{1}=p^{2},\lambda_{2}=\cdots=\lambda_{p}=1; when d=2,d=2, we choose λ1=p2,λ2=p1.6,λ3=λp=1.\lambda_{1}=p^{2},\lambda_{2}=p^{1.6},\lambda_{3}=\cdots\lambda_{p}=1. We consider the comparison for three different settings of cn=n/pc_{n}=n/p with n=200n=200. For a better visualization of the bulk eigenvalues, we start from λ10(𝐖).\lambda_{10}(\mathbf{W}). We can see that the bulk eigenvalues for these two settings are different and not necessarily follow the MP law.
Refer to caption
Refer to caption
Refer to caption
Figure 17. Comparison of bulk eigenvalues of 𝐖\mathbf{W} for different intrinsic dimensions, d=1d=1 and d=2d=2. When d=2d=2, we have one large signal and one small signal; that is, α11>α20\alpha_{1}\geq 1>\alpha_{2}\geq 0. For the construction of 𝐖,\mathbf{W}, we consider h=p.h=p. In the simulations, we use the Gaussian random vectors with the covariance matrix (1.6). When d=1,d=1, we choose λ1=p2,λ2==λp=1\lambda_{1}=p^{2},\lambda_{2}=\cdots=\lambda_{p}=1; when d=2,d=2, we choose λ1=p2,λ2=p0.4,λ3=λp=1.\lambda_{1}=p^{2},\lambda_{2}=p^{0.4},\lambda_{3}=\cdots\lambda_{p}=1. We consider three different settings of cn=n/pc_{n}=n/p with n=200.n=200. For a better visualization of the bulk eigenvalues, we start from λ10(𝐖).\lambda_{10}(\mathbf{W}). We see that the bulk eigenvalues are fairly close to each other.

Appendix B Proof of main results in Section 2

In this section, we prove the main results in Section 2.

B.1. Proof of Theorem 2.3

Recall τ\tau defined in (2.5). Since the proof holds for general kernel function described in Remark 2.4, we will carry out our analysis with such general kernel function f(x)f(x).

Proof.

We start from simplifying 𝐖\mathbf{W}. Denote δij\delta_{ij} to be the Kronecker delta. By the Taylor expansion, when iji\neq j, we have that

𝐖(i,j)=\displaystyle\mathbf{W}(i,j)= f(τ)+f(τ)[𝐎x(i,j)2𝐏x(i,j)]\displaystyle f(\tau)+f^{\prime}(\tau)\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]
+f(2)(τ)2[𝐎x(i,j)2𝐏x(i,j)]2\displaystyle+\frac{f^{(2)}(\tau)}{2}\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{2}
(B.1) +f(3)(ξx(i,j))6[𝐎x(i,j)2𝐏x(i,j)]3,\displaystyle+\frac{f^{(3)}(\xi_{x}(i,j))}{6}\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{3},

where 𝐏x(i,j)\mathbf{P}_{x}(i,j) is defined in (A.15), 𝐎x(i,j)\mathbf{O}_{x}(i,j) is defined as

(B.2) 𝐎x(i,j)=(1δij)(𝐱i22+𝐱j22pτ),\mathbf{O}_{x}(i,j)=(1-\delta_{ij})\left(\frac{\|\mathbf{x}_{i}\|_{2}^{2}+\|\mathbf{x}_{j}\|_{2}^{2}}{p}-\tau\right),

and ξx(i,j)\xi_{x}(i,j) is some value between (𝐱i22+𝐱j22)/p(\|\mathbf{x}_{i}\|_{2}^{2}+\|\mathbf{x}_{j}\|_{2}^{2})/p and τ=2(λ/p+1)\tau=2(\lambda/p+1) is defined in (2.5). Consequently, we find that 𝐖\mathbf{W} can be rewritten as

𝐖=\displaystyle\mathbf{W}= f(τ)𝟏𝟏2f(τ)p𝐗𝐗+ς(λ)𝐈\displaystyle\,f(\tau)\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}
+[f(τ)𝐎x+f(2)(τ)2𝐇x+f(3)(ξx(i,j))6𝐐x]\displaystyle+\Big{[}f^{\prime}(\tau)\mathbf{O}_{x}+\frac{f^{(2)}(\tau)}{2}\mathbf{H}_{x}+\frac{f^{(3)}(\xi_{x}(i,j))}{6}\mathbf{Q}_{x}\Big{]}
(B.3) +2f(τ)(1pdiag(𝐱12,,𝐱n2)1),\displaystyle+2f^{\prime}(\tau)\left(\frac{1}{p}\text{diag}(\|\mathbf{x}_{1}\|^{2},\ldots,\|\mathbf{x}_{n}\|^{2})-1\right),

where we used the shorthand notations

𝐇x(i,j)=[𝐎x(i,j)2𝐏x(i,j)]2\displaystyle\mathbf{H}_{x}(i,j)=\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{2}
𝐐x(i,j)=[𝐎x(i,j)2𝐏x(i,j)]3.\displaystyle\mathbf{Q}_{x}(i,j)=\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{3}.

With this expansion, we immediately obtain

𝐖=\displaystyle\mathbf{W}= f(τ)𝟏𝟏2f(τ)p𝐗𝐗+ς(λ)𝐈+f(τ)𝐎x\displaystyle\,f(\tau)\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}+f^{\prime}(\tau)\mathbf{O}_{x}
(B.4) +f(2)(τ)2𝐇x+O(n1/2),\displaystyle+\frac{f^{(2)}(\tau)}{2}\mathbf{H}_{x}+O_{\prec}(n^{-1/2})\,,

where the error quantified by OO_{\prec} is in the operator norm, the term 1pdiag(𝐱12,,𝐱n2)1\frac{1}{p}\text{diag}(\|\mathbf{x}_{1}\|^{2},\ldots,\|\mathbf{x}_{n}\|^{2})-1 is controlled by Lemma A.5, and the term 𝐐x\mathbf{Q}_{x} is controlled by the facts that 𝐎x(i,j)(1δij)n1/2\mathbf{O}_{x}(i,j)\prec(1-\delta_{ij})n^{-1/2}, 𝐏x(i,j)(1δij)n1/2\mathbf{P}_{x}(i,j)\prec(1-\delta_{ij})n^{-1/2} and the Gershgorin circle theorem.

Next, we control 𝐎x\mathbf{O}_{x} and 𝐇x\mathbf{H}_{x}. Since 𝐎x=𝟏Φ+Φ𝟏2diag{ϕ1,,ϕn}\mathbf{O}_{x}=\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top}-2\text{diag}\{\phi_{1},\cdots,\phi_{n}\}, we could approximate 𝐎x\mathbf{O}_{x} by 𝟏Φ+Φ𝟏\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top} via

(B.5) 𝐎x(𝟏Φ+Φ𝟏)\displaystyle\|\mathbf{O}_{x}-(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\| =2diag{ϕ1,,ϕn}n1/2,\displaystyle=\|2\text{diag}\{\phi_{1},\cdots,\phi_{n}\}\|\prec n^{-1/2}\,,

where Φ=(ϕ1,,ϕn)\Phi=(\phi_{1},\ldots,\phi_{n}) with ϕi=1p𝐱i22(1+λ/p)\phi_{i}=\frac{1}{p}\|\mathbf{x}_{i}\|_{2}^{2}-(1+\lambda/p), i=1,2,,ni=1,2,\cdots,n, is defined in (A.19), and the last bound comes from Lemma A.5.

For 𝐇x\mathbf{H}_{x}, we write 𝐇x(i,j)=[𝐎x(i,j)2𝐏x(i,j)]2=𝐎x(i,j)2+4𝐏x(i,j)24𝐎x(i,j)𝐏x(i,j)\mathbf{H}_{x}(i,j)=[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)]^{2}=\mathbf{O}_{x}(i,j)^{2}+4\mathbf{P}_{x}(i,j)^{2}-4\mathbf{O}_{x}(i,j)\mathbf{P}_{x}(i,j) and focus on the term 𝐎x(i,j)𝐏x(i,j).\mathbf{O}_{x}(i,j)\mathbf{P}_{x}(i,j). Since 𝟏Φ𝐏x=diag{ϕ1,,ϕn}𝐏x\mathbf{1}\Phi^{\top}\circ\mathbf{P}_{x}=\text{diag}\{\phi_{1},\cdots,\phi_{n}\}\mathbf{P}_{x}, we have (also see the proof of [27, Theorem 2.2])

(B.6) 𝐎x𝐏x=\displaystyle\mathbf{O}_{x}\circ\mathbf{P}_{x}=\, 𝐏xdiag{ϕ1,,ϕn}+diag{ϕ1,,ϕn}𝐏x.\displaystyle\mathbf{P}_{x}\text{diag}\{\phi_{1},\cdots,\phi_{n}\}+\text{diag}\{\phi_{1},\cdots,\phi_{n}\}\mathbf{P}_{x}\,.

Moreover, construct 𝐏y\mathbf{P}_{y} from 𝒴\mathcal{Y} in the same way as (A.15) and write

(B.7) 𝐏x𝐏y=1p(𝒛𝒛+𝒛𝒚+𝒚𝒛)(𝟏𝟏𝐈n),\mathbf{P}_{x}-\mathbf{P}_{y}=\frac{1}{p}\left(\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}\right)\circ(\bm{1}\bm{1}^{\top}-\mathbf{I}_{n})\,,

where 𝒚:=(𝐲11,,𝐲n1),𝒛:=(𝐳11,,𝐳n1).\bm{y}:=(\mathbf{y}_{11},\cdots,\mathbf{y}_{n1})^{\top},\bm{z}:=(\mathbf{z}_{11},\cdots,\mathbf{z}_{n1})^{\top}. We find that

𝐏x𝐏y1p(𝒛𝒛+2𝒛𝒚+(𝒛𝒛+𝒛𝒚+𝒚𝒛)𝐈n).\|\mathbf{P}_{x}-\mathbf{P}_{y}\|\leq\frac{1}{p}(\bm{z}^{\top}\bm{z}+2\bm{z}^{\top}\bm{y}+\|(\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top})\circ\mathbf{I}_{n}\|).

Note that 𝒛𝒛+𝒛𝒚+𝒚𝒛/p(𝒛+𝒚2+𝒚2)/p(2𝒛2+3𝒚2)/pλ+1\|\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}\|/p\leq(\|\bm{z}+\bm{y}\|^{2}+\|\bm{y}\|^{2})/p\leq(2\|\bm{z}\|^{2}+3\|\bm{y}\|^{2})/p\prec\lambda+1 by (A.4) and (A.5), and p1(𝒛𝒛+𝒛𝒚+𝒚𝒛)𝐈n(λ+1)/pp^{-1}\|(\bm{z}\bm{z}^{\top}+\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top})\circ\mathbf{I}_{n}\|\prec(\lambda+1)/p by the fact that |𝒛i1|2λ|\bm{z}_{i1}|^{2}\prec\lambda and |𝒛i1𝒚i1|λ|\bm{z}_{i1}\bm{y}_{i1}|\prec\sqrt{\lambda}, so we have 𝐅gλ+1.\|\mathbf{F}_{g}\|\prec\lambda+1. By Lemma A.6 and (A.4), we find that 𝐏y1\|\mathbf{P}_{y}\|\prec 1. On the other hand, by the fact that 𝐆x=(𝐙𝐙+𝐙𝐘+𝐘𝐙+𝐘𝐘)/p(𝒛2+2𝒛𝒚)/p+𝐘𝐘/pλ+1\|\mathbf{G}_{x}\|=\|(\mathbf{Z}^{\top}\mathbf{Z}+\mathbf{Z}^{\top}\mathbf{Y}+\mathbf{Y}^{\top}\mathbf{Z}+\mathbf{Y}^{\top}\mathbf{Y})/p\|\leq(\|\bm{z}\|^{2}+2\|\bm{z}\|\|\bm{y}\|)/p+\|\mathbf{Y}^{\top}\mathbf{Y}\|/p\prec\lambda+1 and (A.5), we have

(B.8) 𝐏xλ+1.\|\mathbf{P}_{x}\|\prec\lambda+1\,.

By (B.5), we conclude that

𝐎x𝐏xλ+1n.\|\mathbf{O}_{x}\circ\mathbf{P}_{x}\|\prec\frac{\lambda+1}{\sqrt{n}}.

With the above preparation, 𝐖\mathbf{W} is reduced from (B.1) to

𝐖=\displaystyle\mathbf{W}=\, f(τ)𝟏𝟏2f(τ)p𝐗𝐗+ς(λ)𝐈n+f(τ)𝐎x\displaystyle f(\tau)\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}+f^{\prime}(\tau)\mathbf{O}_{x}
+f(2)(τ)2𝐎x𝐎x+2f(2)(τ)𝐏x𝐏x\displaystyle+\frac{f^{(2)}(\tau)}{2}\mathbf{O}_{x}\circ\mathbf{O}_{x}+2f^{(2)}(\tau)\mathbf{P}_{x}\circ\mathbf{P}_{x}
(B.9) +O(λ+1n).\displaystyle+O_{\prec}\left(\frac{\lambda+1}{\sqrt{n}}\right).

We further simplify 𝐖\mathbf{W}. Let 𝐏y\mathbf{P}_{y} be constructed in the same way as 𝐏x\mathbf{P}_{x} in (B.2) using the point cloud 𝒴.\mathcal{Y}. By Lemma A.8, 𝐏x𝐏x\mathbf{P}_{x}\circ\mathbf{P}_{x} can be replaced by 𝐏y𝐏y\mathbf{P}_{y}\circ\mathbf{P}_{y}. Moreover, by a discussion similar to (B.5), we can control 𝐎x𝐎x\mathbf{O}_{x}\circ\mathbf{O}_{x} by

\displaystyle\| (𝟏Φ+Φ𝟏)(𝟏Φ+Φ𝟏)𝐎x𝐎x=2diag{ϕ12,,ϕn2}n1.\displaystyle(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\circ(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})-\mathbf{O}_{x}\circ\mathbf{O}_{x}\|=\|2\text{diag}\{\phi_{1}^{2},\cdots,\phi_{n}^{2}\}\|\prec n^{-1}.

Combining all the above results, and applying Lemma A.7, we have simplified 𝐖\mathbf{W} as

(B.10) 𝐖=𝖶+ς(λ)𝐈n+O(nϵλ+1n+n1/4),\displaystyle\mathbf{W}=\mathsf{W}+\varsigma(\lambda)\mathbf{I}_{n}+O\left(n^{\epsilon}\frac{\lambda+1}{\sqrt{n}}+n^{-1/4}\right)\,,

where

𝖶:=\displaystyle\mathsf{W}:=\, (f(τ)+2f(2)(2)p1)𝟏𝟏2f(τ)p𝐗𝐗\displaystyle(f(\tau)+2f^{(2)}(2)p^{-1})\mathbf{1}\mathbf{1}^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}
(B.11) +f(τ)(𝟏Φ+Φ𝟏)+f(2)(τ)2(𝟏Φ+Φ𝟏)(𝟏Φ+Φ𝟏),\displaystyle+f^{\prime}(\tau)(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})+\frac{f^{(2)}(\tau)}{2}(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\circ(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})\,,

with probability at least 1O(n1/2)1-O(n^{-1/2}) for some small ϵ>0\epsilon>0,

With the above simplification, we discuss the outlying eigenvalues. Invoking (B.10), since ς(λ)𝐈n\varsigma(\lambda)\mathbf{I}_{n} is simply an isotropic shift, the outlying eigenvalues of 𝐖\mathbf{W} can only come from 𝖶\mathsf{W}. Notice that by the identity 𝐚𝐛𝐮𝐯=(𝐚𝐮)(𝐛𝐯)\mathbf{a}\mathbf{b}^{\top}\circ\mathbf{u}\mathbf{v}^{\top}=(\mathbf{a}\circ\mathbf{u})(\mathbf{b}\circ\mathbf{v})^{\top}, we find that

(𝟏Φ\displaystyle(\mathbf{1}\Phi^{\top} +Φ𝟏)(𝟏Φ+Φ𝟏)\displaystyle+\Phi\mathbf{1}^{\top})\circ(\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top})
(B.12) =𝟏(ΦΦ)+(ΦΦ)𝟏+2ΦΦ,\displaystyle=\mathbf{1}(\Phi\circ\Phi)^{\top}+(\Phi\circ\Phi)\mathbf{1}^{\top}+2\Phi\Phi^{\top}\,,

which leads to a rearrangement of 𝖶\mathsf{W} to

𝖶=\displaystyle\mathsf{W}=\, 𝟏[12(f(τ)+2f(2)(2)p1)𝟏+f(τ)Φ\displaystyle\mathbf{1}\Big{[}\frac{1}{2}(f(\tau)+2f^{(2)}(2)p^{-1})\mathbf{1}^{\top}+f^{\prime}(\tau)\Phi^{\top}
+f(2)(τ)2(ΦΦ)]+[12(f(τ)+2f(2)(2)p1)𝟏+f(τ)Φ\displaystyle\qquad+\frac{f^{(2)}(\tau)}{2}(\Phi\circ\Phi)^{\top}\Big{]}+\,\Big{[}\frac{1}{2}(f(\tau)+2f^{(2)}(2)p^{-1})\mathbf{1}+f^{\prime}(\tau)\Phi
+f(2)(τ)2(ΦΦ)]𝟏+f(2)(τ)ΦΦ2f(τ)p𝐗𝐗\displaystyle\qquad+\frac{f^{(2)}(\tau)}{2}(\Phi\circ\Phi)\Big{]}\mathbf{1}^{\top}+f^{(2)}(\tau)\Phi\Phi^{\top}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}
(B.13) :=\displaystyle:=\, 𝖮2f(τ)p𝐗𝐗.\displaystyle\mathsf{O}-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}.

Note that 𝖮\mathsf{O} is of rank at most three since the first two terms of (B.13) form a matrix of rank at most 22 and ΦΦ\Phi\Phi^{\top} is a rank-one matrix with the spectral norm of order O(n)O(\sqrt{n}). With (B.10), we can therefore conclude our proof using Lemma A.6.

B.2. Proof of Theorem 2.5

Since the proof in this subsection hold for general kernel function described in Remark 2.4, we will carry out our analysis with such general kernel function f(x)f(x). Note that in this case, τ\tau defined in (2.5) is still bounded from above. So for a fixed KK\in\mathbb{N}, the first KK coefficients in the Taylor expansion can be well controlled under the smoothness assumption, i.e., f(k)(τ)1f^{(k)}(\tau)\asymp 1, for k=1,2,,Kk=1,2,\ldots,K for KK\in\mathbb{N}. However, Lemma A.9 is invalid since α>0\alpha>0. On the other hand, in this region, although the concentration inequality (c.f. Lemma A.5) still works, its rate becomes worse as λ\lambda becomes larger. In [26], the author only needs to conduct the Taylor expansion up to the third order since λ\lambda is fixed. In our setup, to handle the divergent λ\lambda, we need a high order expansion that is adaptive to λ.\lambda. Thus, due to the nature of convergence rate in Lemma A.5, we will employ different proof strategies for the cases 0<α<0.50<\alpha<0.5 and 0.5α<1.0.5\leq\alpha<1. When α\alpha satisfies 0<α<0.50<\alpha<0.5, the proof of Theorem 2.3 still holds. When α\alpha satisfies 0.5α<10.5\leq\alpha<1, we need a higher order Taylor expansion to control the convergence. This comes from the second term of (A.5), where the concentration inequalities regarding 𝐱i𝐱j\mathbf{x}_{i}^{\top}\mathbf{x}_{j}, where iji\neq j, have different upper bounds with different α\alpha.

Proof of case (1), 0<α<0.50<\alpha<0.5.

By (B.1) and Lemma A.5, we find that when iji\neq j,

𝐖(i,j)=\displaystyle\mathbf{W}(i,j)= f(τ)+f(τ)[𝐎x(i,j)2𝐏x(i,j)]\displaystyle\,f(\tau)+f^{\prime}(\tau)\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]
+f(2)(τ)2[𝐎x(i,j)2𝐏x(i,j)]2+O(n3/2),\displaystyle+\frac{f^{(2)}(\tau)}{2}\left[\mathbf{O}_{x}(i,j)-2\mathbf{P}_{x}(i,j)\right]^{2}+O_{\prec}(n^{-3/2}),

where we used the fact that f(3)(ξx(i,j))f^{(3)}(\xi_{x}(i,j)) is bounded. By a discussion similar to (B.1) and the Gershgorin circle theorem, we find that (B.1) also holds true. The rest of the proof follows lines of the proof of Theorem 2.3 using Lemmas A.5 and A.6. We omit the details here.

Proof of Case (2), 0.5α<10.5\leq\alpha<1.

For simplicity, we introduce

(B.14) 𝐋x:=𝐎x𝐏x.\mathbf{L}_{x}:=\mathbf{O}_{x}-\mathbf{P}_{x}.

By Lemma A.5 and notations defined in (B.2), we have

(B.15) |𝐏x(i,j)|=O(λ/n) and |𝐎x(i,j)|=O(λ/n).|\mathbf{P}_{x}(i,j)|=O_{\prec}(\lambda/n)\ \mbox{ and }\ |\mathbf{O}_{x}(i,j)|=O_{\prec}(\lambda/n)\,.

By the Taylor expansion, when iji\neq j, we have

(B.16) 𝐖(i,j)=\displaystyle\mathbf{W}(i,j)= k=0𝔡1f(k)(τ)k!𝐋x(i,j)k+f(𝔡)(ξx(i,j))𝔡!𝐋x(i,j)𝔡,\displaystyle\,\sum_{k=0}^{\mathfrak{d}-1}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}(i,j)^{k}+\frac{f^{(\mathfrak{d})}(\xi_{x}(i,j))}{\mathfrak{d}!}\mathbf{L}_{x}(i,j)^{\mathfrak{d}}\,,

where 𝔡\mathfrak{d} is defined in (2.11) and ξx(i,j)\xi_{x}(i,j) is some value between (𝐱i22+𝐱j22)/p(\|\mathbf{x}_{i}\|_{2}^{2}+\|\mathbf{x}_{j}\|_{2}^{2})/p and τ.\tau. Consider 𝐖~,𝐑𝔡n×n\widetilde{\mathbf{W}},\,\mathbf{R}_{\mathfrak{d}}\in\mathbb{R}^{n\times n} defined as

𝐖~(i,j):=k=3𝔡1f(k)(τ)𝐋x(i,j)kk!,\displaystyle\widetilde{\mathbf{W}}(i,j):=\sum_{k=3}^{\mathfrak{d}-1}\frac{f^{(k)}(\tau)\mathbf{L}_{x}(i,j)^{k}}{k!}\,,
𝐑𝔡(i,j)=f(𝔡)(ξx(i,j))𝔡!𝐋x(i,j)𝔡,\displaystyle\mathbf{R}_{\mathfrak{d}}(i,j)=\frac{f^{(\mathfrak{d})}(\xi_{x}(i,j))}{\mathfrak{d}!}\mathbf{L}_{x}(i,j)^{\mathfrak{d}},

so that 𝐖=k=02f(k)(τ)k!𝐋x(i,j)k+𝐖~+𝐑𝔡\mathbf{W}=\sum_{k=0}^{2}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}(i,j)^{k}+\widetilde{\mathbf{W}}+\mathbf{R}_{\mathfrak{d}}. We start from claiming that

(B.17) |𝐑𝔡(i,j)|p(α)1,\left|\mathbf{R}_{\mathfrak{d}}(i,j)\right|\prec p^{\mathcal{B}(\alpha)-1}\,,

where (α)=(α1)11α+α<0\mathcal{B}(\alpha)={(\alpha-1)\left\lceil\frac{1}{1-\alpha}\right\rceil+\alpha<0} is defined in (2.13). To see (B.17), we use (A.5) and the fact that dd is finite to get

𝐋x(i,j)𝔡n𝔡(1+α)=n(α1)11α+α1=n(α)1.\mathbf{L}_{x}(i,j)^{\mathfrak{d}}\prec n^{\mathfrak{d}(-1+\alpha)}=n^{(\alpha-1)\left\lceil\frac{1}{1-\alpha}\right\rceil+\alpha-1}=n^{\mathcal{B}(\alpha)-1}\,.

Together with (B.16), by the Gershgorin circle theorem, we have

𝐑𝔡=𝐖k=0𝔡1f(k)(τ)k!𝐋xkn(α),\|\mathbf{R}_{\mathfrak{d}}\|=\left\|\mathbf{W}-\sum_{k=0}^{\mathfrak{d}-1}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}^{\circ k}\right\|\prec n^{\mathcal{B}(\alpha)},

where we set

𝐋xk:=𝐋x𝐋xktimes.\mathbf{L}_{x}^{\circ k}:=\underbrace{\mathbf{L}_{x}\circ\cdots\circ\mathbf{L}_{x}}_{k\ \text{times}}.

Similar definition applies to 𝐎xk\mathbf{O}_{x}^{\circ k} and 𝐏xk.\mathbf{P}_{x}^{\circ k}.

Next, we study 𝐖~.\widetilde{\mathbf{W}}. Recall the definition of Φ\Phi in (A.20). To simplify the notation, we denote 𝖥1:=𝟏Φ+Φ𝟏\mathsf{F}_{1}:=\mathbf{1}\Phi^{\top}+\Phi\mathbf{1}^{\top} and 𝖥2:=2diag{ϕ1,,ϕn}\mathsf{F}_{2}:=-2\operatorname{diag}\{\phi_{1},\cdots,\phi_{n}\}, and obtain

(B.18) 𝐎x=𝖥1+𝖥2.\mathbf{O}_{x}=\mathsf{F}_{1}+\mathsf{F}_{2}.

Clearly, rank(𝖥1)2\operatorname{rank}(\mathsf{F}_{1})\leq 2, and by Lemma A.5, we have

(B.19) 𝐎x𝖥1λp.\|\mathbf{O}_{x}-\mathsf{F}_{1}\|\prec\frac{\lambda}{p}.

For any 3k𝔡13\leq k\leq\mathfrak{d}-1, in view of the expansion

𝐋xk=l=0k(kl)𝐎xl(𝐏x)(kl),\mathbf{L}_{x}^{\circ k}=\sum_{l=0}^{k}{k\choose l}\mathbf{O}_{x}^{\circ l}\circ(-\mathbf{P}_{x})^{\circ(k-l)}\,,

below we examine 𝐎xl(𝐏x)(kl)\mathbf{O}_{x}^{\circ l}\circ(-\mathbf{P}_{x})^{\circ(k-l)} term by term.

First, when l=0l=0, we only have the term 𝐏xk\mathbf{P}^{\circ k}_{x}. We focus on the discussion when k=3k=3, and the same argument holds when k>3k>3. We need the following identity. For any n×nn\times n matrix 𝐄\mathbf{E} and vectors 𝐮,𝐯n\mathbf{u},\mathbf{v}\in\mathbb{R}^{n},

(B.20) 𝐄𝐮𝐯=diag(𝐮)𝐄diag(𝐯).\mathbf{E}\circ\mathbf{u}\mathbf{v}^{\top}=\text{diag}(\mathbf{u})\mathbf{E}\text{diag}(\mathbf{v}).

Note that the expansion in (B.7) still holds, and to further simplify the notation, we denote

(B.21) 𝐏x𝐏y=𝖳𝖰,\displaystyle\mathbf{P}_{x}-\mathbf{P}_{y}=\mathsf{T}\circ\mathsf{Q}\,,

where 𝖳:=1p(𝒛𝒚+𝒚𝒛+𝒛𝒛)\mathsf{T}:=\frac{1}{p}(\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}+\bm{z}\bm{z}^{\top}) and 𝖰:=𝟏𝟏𝐈n\mathsf{Q}:=\mathbf{1}\mathbf{1}^{\top}-\mathbf{I}_{n}. We thus have that

𝐏x3𝐏y3\displaystyle\mathbf{P}_{x}^{\circ 3}-\mathbf{P}_{y}^{\circ 3} =(𝐏x𝐏y)(𝐏x2+𝐏x𝐏y+𝐏y2)\displaystyle=(\mathbf{P}_{x}-\mathbf{P}_{y})\circ(\mathbf{P}_{x}^{\circ 2}+\mathbf{P}_{x}\circ\mathbf{P}_{y}+\mathbf{P}_{y}^{\circ 2})
(B.22) =(𝐏x𝐏y)𝐏x2+(𝐏x𝐏y)𝐏x𝐏y+(𝐏x𝐏y)𝐏y2.\displaystyle=(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}^{\circ 2}+(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}\circ\mathbf{P}_{y}+(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{y}^{\circ 2}.

We control the first term, and the other terms can be controlled by the same way. Since 𝐏y=𝐆y𝖰\mathbf{P}_{y}=\mathbf{G}_{y}\circ\mathsf{Q} and 𝐆y=p1𝐘𝐘\mathbf{G}_{y}=p^{-1}\mathbf{Y}^{\top}\mathbf{Y}, we have that

𝐏x2=(𝖳+𝐆y)(𝖳+𝐆y)𝖰.\mathbf{P}_{x}^{\circ 2}=\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\mathsf{Q}.

Together with (B.21) and the fact that 𝖰2=𝖰\mathsf{Q}^{\circ 2}=\mathsf{Q}, we obtain

(𝐏x𝐏y)𝐏x2\displaystyle(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}^{\circ 2} =(𝖳+𝐆y)(𝖳+𝐆y)𝖳𝖰\displaystyle=\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\left(\mathsf{T}+\mathbf{G}_{y}\right)\circ\mathsf{T}\circ\mathsf{Q}
=(𝖳𝖳+2𝖳𝐆y+𝐆y𝐆y)𝖳𝖰\displaystyle=\left(\mathsf{T}\circ\mathsf{T}+2\mathsf{T}\circ\mathbf{G}_{y}+\mathbf{G}_{y}\circ\mathbf{G}_{y}\right)\circ\mathsf{T}\circ\mathsf{Q}
=𝖳3𝖰+𝖱1+𝖱2,\displaystyle=\mathsf{T}^{\circ 3}\circ\mathsf{Q}+\mathsf{R}_{1}+\mathsf{R}_{2}\,,

where 𝖱1:=2𝖳2𝐆y𝖰\mathsf{R}_{1}:=2\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}\circ\mathsf{Q} and 𝖱2:=𝐆y2𝖳𝖰\mathsf{R}_{2}:=\mathbf{G}_{y}^{\circ 2}\circ\mathsf{T}\circ\mathsf{Q}. Now we discuss the above three terms one by one. First, using (B.20), we have

𝖳3𝖰=𝖳3[𝖳𝐈n]3=𝖳3+O(λ3p3),\displaystyle\mathsf{T}^{\circ 3}\circ\mathsf{Q}=\mathsf{T}^{\circ 3}-\left[\mathsf{T}\circ\mathbf{I}_{n}\right]^{3}=\mathsf{T}^{\circ 3}+O_{\prec}\left(\frac{\lambda^{3}}{p^{3}}\right),

where in the second equality we used the fact that 𝖳(i,i)λ/p.\mathsf{T}(i,i)\prec\lambda/p. Second, we have

𝖱1\displaystyle\mathsf{R}_{1} =2𝖳2𝐆y2[𝖳𝐈n]2[𝐆y𝐈n]\displaystyle=2\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}-2\left[\mathsf{T}\circ\mathbf{I}_{n}\right]^{2}[\mathbf{G}_{y}\circ\mathbf{I}_{n}]
=2𝖳2𝐆y+O((λ/p)2).\displaystyle=2\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}+O_{\prec}\left((\lambda/p)^{2}\right).

On one hand, we can use (B.20) to write

𝖳2𝐆y=\displaystyle\mathsf{T}^{\circ 2}\circ\mathbf{G}_{y}= 1p2[diag(𝒛)]2𝐆y[diag(𝒛)]2\displaystyle\,\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}
+2p2[diag(𝒛)]2𝐆y[diag(𝒛)][diag(𝒚)]\displaystyle+\frac{2}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]
+2p2[diag(𝒛)][diag(𝒚)]𝐆y[diag(𝒛)]2\displaystyle+\frac{2}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}
+1p2[diag(𝒛)]2𝐆y[diag(𝒚)]2\displaystyle+\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{y})\right]^{2}
+2p2[diag(𝒛)][diag(𝒚)]𝐆y[diag(𝒛)][diag(𝒚)]\displaystyle+\frac{2}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]\left[\operatorname{diag}(\bm{y})\right]
+1p2[diag(𝒚)]2𝐆y[diag(𝒛)]2.\displaystyle+\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{y})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}.

The first term of the above equation is the leading order term which can be bounded as follows

1p2[diag(𝒛)]2𝐆y[diag(𝒛)]2(λ/p)2,\left\|\frac{1}{p^{2}}\left[\operatorname{diag}(\bm{z})\right]^{2}\mathbf{G}_{y}\left[\operatorname{diag}(\bm{z})\right]^{2}\right\|\prec(\lambda/p)^{2},

where we used the fact that 𝐆y=O(1).\|\mathbf{G}_{y}\|=O_{\prec}(1). The other terms can be bounded similarly so that

𝖱1=O((λ/p)2).\mathsf{R}_{1}=O_{\prec}\left((\lambda/p)^{2}\right).

Similarly, we can control 𝖱2.\mathsf{R}_{2}. This shows that

(𝐏x𝐏y)𝐏x2=𝖳3+O((λ/p)2).(\mathbf{P}_{x}-\mathbf{P}_{y})\circ\mathbf{P}_{x}^{\circ 2}=\mathsf{T}^{\circ 3}+O_{\prec}((\lambda/p)^{2}).

Analogously, we can analyze the other two terms in (B.2) and obtain that

(B.23) 𝐏x3𝐏y3=𝖳3+𝖳2+O(λ/p).\mathbf{P}_{x}^{\circ 3}-\mathbf{P}_{y}^{\circ 3}=\mathsf{T}^{\circ 3}+\mathsf{T}^{\circ 2}+O_{\prec}(\lambda/p).

Moreover, note that 𝐏y3=𝐏y(𝐏y2𝖰/p)+𝐏y𝖰/p)\mathbf{P}_{y}^{\circ 3}=\mathbf{P}_{y}\circ(\mathbf{P}_{y}^{\circ 2}-\mathsf{Q}/p)+\mathbf{P}_{y}\circ\mathsf{Q}/p), we have by Lemma A.1 that

𝐏y(𝐏y2𝖰/p)maxi,j|𝐏y(i,j)|𝐏y2𝖰/pλ/p.\left\|\mathbf{P}_{y}\circ(\mathbf{P}_{y}^{\circ 2}-\mathsf{Q}/p)\right\|\leq\max_{i,j}|\mathbf{P}_{y}(i,j)|\left\|\mathbf{P}_{y}^{\circ 2}-\mathsf{Q}/p\right\|\prec\lambda/p\,.

On the other hand, since 𝐏y𝖰/p=1p𝐏y=O(1/p)\mathbf{P}_{y}\circ\mathsf{Q}/p=\frac{1}{p}\mathbf{P}_{y}=O_{\prec}(1/p), we have 𝐏y3λ/p\|\mathbf{P}_{y}^{\circ 3}\|\prec\lambda/p. Consequently, we have that

𝐏x3=𝖳3+𝖳2+O(λ/p).\mathbf{P}_{x}^{\circ 3}=\mathsf{T}^{\circ 3}+\mathsf{T}^{\circ 2}+O_{\prec}(\lambda/p).

We mention that since 𝖳\mathsf{T} is at most rank two so that 𝖳3+𝖳2\mathsf{T}^{\circ 3}+\mathsf{T}^{\circ 2} is at most 23+2224.2^{3}+2^{2}\leq 2^{4}. Similarly, using the above discussion for general k>3,k>3, we can show that

𝐏xk=j=2k𝖳j+O(λ/p).\mathbf{P}_{x}^{\circ k}=\sum_{j=2}^{k}\mathsf{T}^{\circ j}+O_{\prec}(\lambda/p).

Second, when l=kl=k, we only have the term 𝐎xk\mathbf{O}_{x}^{\circ k}. When k=2,k=2, using (B.18) and the fact that 𝖥2\mathsf{F}_{2} is diagonal, we have that

𝐎x2=𝖥12+(𝖥2)2+2𝖥2diag(𝖥1).\mathbf{O}_{x}^{\circ 2}=\mathsf{F}_{1}^{\circ 2}+(\mathsf{F}_{2})^{2}+2\mathsf{F}_{2}\operatorname{diag}(\mathsf{F}_{1}).

By Lemma A.5 (aka (B.19)), we have that

𝐎x2=𝖥12+O((λ/p)2).\mathbf{O}_{x}^{\circ 2}=\mathsf{F}_{1}^{\circ 2}+O_{\prec}((\lambda/p)^{2}).

Similarly, for general k2,k\geq 2, we have that

𝐎xk=𝖥1k+O((λ/p)k).\mathbf{O}_{x}^{\circ k}=\mathsf{F}_{1}^{\circ k}+O_{\prec}((\lambda/p)^{k}).

Third, when klk\neq l and l0l\neq 0, we discuss a typical case when k=4k=4 and l=2l=2, i.e., 𝐎x2𝐏x2.\mathbf{O}_{x}^{\circ 2}\circ\mathbf{P}_{x}^{\circ 2}. We prepare some bounds. Recall that

𝐏x=𝐆x1pdiag{𝐱122,,𝐱n22},\mathbf{P}_{x}=\mathbf{G}_{x}-\frac{1}{p}\operatorname{diag}\{\|\mathbf{x}_{1}\|_{2}^{2},\cdots,\|\mathbf{x}_{n}\|_{2}^{2}\}\,,

where we have

𝐆x=𝐆y+1p(𝒛𝒚+𝒚𝒛+𝒛𝒛)=:𝐆y+𝖳.\mathbf{G}_{x}=\mathbf{G}_{y}+\frac{1}{p}(\bm{z}\bm{y}^{\top}+\bm{y}\bm{z}^{\top}+\bm{z}\bm{z}^{\top})=:\mathbf{G}_{y}+\mathsf{T}\,.

By the definition of 𝖳\mathsf{T}, we immediately have

(B.24) rank(𝖳)2 and maxi,j|𝖳(i,j)|λ/p,\operatorname{rank}(\mathsf{T})\leq 2\ \ \mbox{ and }\ \ \max_{i,j}|\mathsf{T}(i,j)|\prec\lambda/p\,,

where the bound for maxi,j|𝖳(i,j)|\max_{i,j}|\mathsf{T}(i,j)| holds by the tail bound of the maximum of a finite set of sub-Gaussian random variables. Similarly, we have the bound

(B.25) maxi,j|𝐎x(i,j)|λ/p\max_{i,j}|\mathbf{O}_{x}(i,j)|\prec\lambda/p

by (A.5). By an expansion, we have

𝐏x2𝖳2=\displaystyle\mathbf{P}_{x}^{\circ 2}-\mathsf{T}^{\circ 2}=\, (1pdiag{𝐱122,,𝐱n22})2+𝐆y𝐆y\displaystyle\left(\frac{1}{p}\operatorname{diag}\{\|\mathbf{x}_{1}\|_{2}^{2},\cdots,\|\mathbf{x}_{n}\|_{2}^{2}\}\right)^{2}+\mathbf{G}_{y}\circ\mathbf{G}_{y}
2𝐆y1pdiag{𝐱122,,𝐱n22}+2𝐆y𝖳\displaystyle-2\mathbf{G}_{y}\circ\frac{1}{p}\operatorname{diag}\{\|\mathbf{x}_{1}\|_{2}^{2},\cdots,\|\mathbf{x}_{n}\|_{2}^{2}\}+2\mathbf{G}_{y}\circ\mathsf{T}
(B.26) 2𝖳1pdiag{𝐱122,,𝐱n22},\displaystyle-2\mathsf{T}\circ\frac{1}{p}\operatorname{diag}\{\|\mathbf{x}_{1}\|_{2}^{2},\cdots,\|\mathbf{x}_{n}\|_{2}^{2}\}\,,

which leads to

(B.27) 𝐏x2𝖳21\|\mathbf{P}_{x}^{\circ 2}-\mathsf{T}^{\circ 2}\|\prec 1

by Lemma A.1 with the fact 𝐆y1\|\mathbf{G}_{y}\|\prec 1, 1pdiag{𝐱122,,𝐱n22}1+λ/p\|\frac{1}{p}\operatorname{diag}\{\|\mathbf{x}_{1}\|_{2}^{2},\cdots,\|\mathbf{x}_{n}\|_{2}^{2}\}\|\prec 1+\lambda/p by Lemma A.5, (B.24) and maxi,j|1p𝐲i𝐲j|1\max_{i,j}|\frac{1}{p}\mathbf{y}_{i}^{\top}\mathbf{y}_{j}|\prec 1. So we have the bound

𝐎x2𝐏x2𝐎x2𝖳2\displaystyle\|\mathbf{O}_{x}^{\circ 2}\circ\mathbf{P}_{x}^{\circ 2}-\mathbf{O}_{x}^{\circ 2}\circ\mathsf{T}^{\circ 2}\| maxij|𝐎x2(i,j)|𝐏x2𝖳2\displaystyle\leq\max_{ij}|\mathbf{O}_{x}^{\circ 2}(i,j)|\|\mathbf{P}_{x}^{\circ 2}-\mathsf{T}^{\circ 2}\|
=O((λ/p)2),\displaystyle=O_{\prec}((\lambda/p)^{2})\,,

where the first bound comes from Lemma A.1 and the second bound comes from (B.25). Together with 𝐎x𝖥1=𝖥2=2diag{ϕ1,,ϕn}\mathbf{O}_{x}-\mathsf{F}_{1}=\mathsf{F}_{2}=-2\operatorname{diag}\{\phi_{1},\cdots,\phi_{n}\}, by the same argument we have that

𝐎x2𝐏x2𝖥12𝖳2=O(λ/p).\|\mathbf{O}_{x}^{\circ 2}\circ\mathbf{P}_{x}^{\circ 2}-\mathsf{F}_{1}^{\circ 2}\circ\mathsf{T}^{\circ 2}\|=O_{\prec}(\lambda/p).

Using the simple estimate that rank(AB)rank(A)rank(B)\operatorname{rank}(A\circ B)\leq\operatorname{rank}(A)\operatorname{rank}(B), we find that

rank(𝖥12𝖳2)24.\operatorname{rank}\left(\mathsf{F}_{1}^{\circ 2}\circ\mathsf{T}^{\circ 2}\right)\leq 2^{4}.

The other kk and ll can be handled in the same way. Precisely, when l>0l>0, 𝐎xl𝐏x(kl)\mathbf{O}_{x}^{\circ l}\circ\mathbf{P}_{x}^{\circ(k-l)} can be approximated by 𝖥1l𝖳(kl)\mathsf{F}_{1}^{\circ l}\circ\mathsf{T}^{\circ(k-l)} with the norm difference of order O(λ/p)O_{\prec}(\lambda/p), where rank(𝖥1l𝖳(kl))2k\operatorname{rank}\left(\mathsf{F}_{1}^{\circ l}\circ\mathsf{T}^{\circ(k-l)}\right)\leq 2^{k}. Therefore, up to an error of O(λ/p)O_{\prec}(\lambda/p), all the terms in 𝐋xk\mathbf{L}_{x}^{\circ k}, except 𝐏xk\mathbf{P}_{x}^{\circ k} that will be absorbed into the first order expansion, can be well approximated using a matrix at most of rank Ck22kC^{\prime}k2^{2k}, where 0<C10<C^{\prime}\leq 1.

Consequently, we can find a matrix 𝐌f\mathbf{M}_{f} of rank at most C22𝔡C2^{2\mathfrak{d}}, where C>0C>0, to approximate 𝐖~\widetilde{\mathbf{W}} so that

𝐖~𝐌fλ/p,\|\widetilde{\mathbf{W}}-\mathbf{M}_{f}\|\prec\lambda/p\,,

This indicates that 𝐖~\widetilde{\mathbf{W}} will generate at most C22𝔡C2^{2\mathfrak{d}} outlying eigenvalues. Therefore, by replacing k=02f(k)(τ)k!𝐋xk\sum_{k=0}^{2}\frac{f^{(k)}(\tau)}{k!}\mathbf{L}_{x}^{\circ k} using (B.19) and the facts that 𝐎x2\mathbf{O}_{x}^{\circ 2} can be replaced by 𝖥1\mathsf{F}_{1} and 𝐎x𝐏x\mathbf{O}_{x}\circ\mathbf{P}_{x} can be approximated by 𝖥1𝖳\mathsf{F}_{1}\circ\mathsf{T} with rank bounded than 44 by the same argument above, we have

𝐖+f(τ)𝐏xf(2)(τ)2𝐏x𝐏x𝐌~fmax{n(α),λ/p},\displaystyle\bigg{\|}\mathbf{W}+f^{\prime}(\tau)\mathbf{P}_{x}-\frac{f^{(2)}(\tau)}{2}\mathbf{P}_{x}\circ\mathbf{P}_{x}-\widetilde{\mathbf{M}}_{f}\bigg{\|}\prec\max\{n^{\mathcal{B}(\alpha)}\,,\lambda/p\}\,,

where

𝐌~f=f(τ)𝖰+(f(τ)+f(2)(τ)2)𝖥1f(2)(τ)𝖥1𝖳+𝐌f\widetilde{\mathbf{M}}_{f}=f(\tau)\mathsf{Q}+\Big{(}f^{\prime}(\tau)+\frac{f^{(2)}(\tau)}{2}\Big{)}\mathsf{F}_{1}-f^{(2)}(\tau)\mathsf{F}_{1}\circ\mathsf{T}+\mathbf{M}_{f}

is a low rank matrix of rank at most C22𝔡+6C2^{2\mathfrak{d}}+6.

To finish the spectral analysis of 𝐖\mathbf{W} in (B.16), it remains to deal with the second order Taylor approximation, 𝐏x𝐏x\mathbf{P}_{x}\circ\mathbf{P}_{x}, to control the non-outlying eigenvalues follow MP law (i.e., the first order expansion 𝐏x\mathbf{P}_{x} involves 𝐗𝐗.\mathbf{X}^{\top}\mathbf{X}.). The discussion is similar to (B.2). First, we extend the Hadamard product result in Lemma A.8. We have the following expansion:

𝐏x𝐏x𝐏y𝐏y\displaystyle\mathbf{P}_{x}\circ\mathbf{P}_{x}-\mathbf{P}_{y}\circ\mathbf{P}_{y}
=\displaystyle=\, (𝐏x+𝐏y)(𝐏x𝐏y)\displaystyle(\mathbf{P}_{x}+\mathbf{P}_{y})\circ(\mathbf{P}_{x}-\mathbf{P}_{y})
=\displaystyle=\, [𝐏x+𝐏y]𝖳𝖰\displaystyle[\mathbf{P}_{x}+\mathbf{P}_{y}]\circ\mathsf{T}\circ\mathsf{Q}
=\displaystyle=\, 𝖳2𝖰+2𝐏y𝖳𝖰\displaystyle\mathsf{T}^{\circ 2}\circ\mathsf{Q}+2\mathbf{P}_{y}\circ\mathsf{T}\circ\mathsf{Q}
=\displaystyle=\, 2pdiag(𝒛)𝐏ydiag(𝒛)+2pdiag(𝒛)𝐏ydiag(𝒚)\displaystyle\frac{2}{p}\text{diag}(\bm{z})\mathbf{P}_{y}\text{diag}(\bm{z})+\frac{2}{p}\text{diag}(\bm{z})\mathbf{P}_{y}\text{diag}(\bm{y})
+2pdiag(𝒚)𝐏ydiag(𝒛)+𝖳2𝖰,\displaystyle\qquad+\frac{2}{p}\text{diag}(\bm{y})\mathbf{P}_{y}\text{diag}(\bm{z})+\mathsf{T}^{\circ 2}\circ\mathsf{Q}\,,

where the first three terms in the right hand side could be controlled by O(λ/p)O_{\prec}(\lambda/p) by a discussion similar to (B.23). As a result, we have As a result, we have

(B.28) 𝐏x𝐏x𝐏y𝐏y=𝖳2𝖰+O(λ/p).\displaystyle\mathbf{P}_{x}\circ\mathbf{P}_{x}-\mathbf{P}_{y}\circ\mathbf{P}_{y}=\mathsf{T}^{\circ 2}\circ\mathsf{Q}+O_{\prec}(\lambda/p).

By the bound rank(𝖳2𝖰)rank(𝖳2)rank(𝖰)\operatorname{rank}(\mathsf{T}^{\circ 2}\circ\mathsf{Q})\leq\operatorname{rank}(\mathsf{T}^{\circ 2})\operatorname{rank}(\mathsf{Q}) and rank(𝖰)=1\operatorname{rank}(\mathsf{Q})=1, we have that when 0.5α<10.5\leq\alpha<1, 𝐏x𝐏x\mathbf{P}_{x}\circ\mathbf{P}_{x} differs from 𝐏y𝐏y\mathbf{P}_{y}\circ\mathbf{P}_{y} with at most four extra outlying eigenvalue. The rest four outlying eigenvalues will be from the first and second order terms in the Taylor expansion as in (B.13). The discussion is similar to the equations from (B.1) to (B.13). We omit the details here. Finally, we emphasize that the discussion of (B.28) is nearly optimal in light of the obtained bound.

B.3. Proof of Theorem 2.7

While the kernel is f(x)=exp(υx)f(x)=\exp(-\upsilon x), for notational simplicity, we keep using the symbol f(x)f(x) as the kernel function and use the notation (A.3).

Proof of (1) when 1α<21\leq\alpha<2.

Let

𝐂0=f(2)𝟏𝟏+(1f(2))𝐈.\mathbf{C}_{0}=f(2)\mathbf{1}\mathbf{1}^{\top}+(1-f(2))\mathbf{I}.

In light of (2.14) and 𝐖1(i,i)=1\mathbf{W}_{1}(i,i)=1, we have

𝐖a1=𝐂0𝐖1.\mathbf{W}_{a_{1}}=\mathbf{C}_{0}\circ\mathbf{W}_{1}.

Recall (2.15). By the definition of 𝐖\mathbf{W}, we have that

(B.29) 𝐖=𝐖c𝐖y𝐖1.\mathbf{W}=\mathbf{W}_{c}\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}.

Consequently, we have that

𝐖𝐖a1\displaystyle\mathbf{W}-\mathbf{W}_{a_{1}} =[(𝐖c𝟏𝟏)𝐖y𝐖1]+[(𝐖y𝐂0)𝐖1]\displaystyle=\left[(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}\right]+\left[(\mathbf{W}_{y}-\mathbf{C}_{0})\circ\mathbf{W}_{1}\right]
(B.30) :=1+2,\displaystyle:=\mathcal{E}_{1}+\mathcal{E}_{2},

where we used the relation 𝐖a1=(𝟏𝟏)𝐖a1.\mathbf{W}_{a_{1}}=(\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{a_{1}}. For 2\mathcal{E}_{2}, by Lemma A.1 and the first order Taylor approximation of 𝐖y\mathbf{W}_{y}, we see that

(B.31) 21nλ1(𝐖1),\|\mathcal{E}_{2}\|\prec\frac{1}{\sqrt{n}}\lambda_{1}(\mathbf{W}_{1})\,,

where we used (A.4) and (A.4). To control λ1(𝐖1)\lambda_{1}(\mathbf{W}_{1}), we apply Lemma A.10, where in order to make the first term of the right-hand side of (A.23) negligible, we take

(B.32) δ>max{0,3α2},\delta>\max\left\{0,\,\frac{3-\alpha}{2}\right\},

and hence with high probability,

(B.33) 𝐖1=O(nmax{0,3α2}).\|\mathbf{W}_{1}\|=O(n^{\max\left\{0,\,\frac{3-\alpha}{2}\right\}})\,.

By (B.31), when α1\alpha\geq 1, we find that

(B.34) 1n2n3/2+nα/2.\left\|\frac{1}{n}\mathcal{E}_{2}\right\|\prec n^{-3/2}+n^{-\alpha/2}\,.

Next, we discuss the first term 1.\mathcal{E}_{1}. By using Lemma A.1 twice and (B.33), we see that

(B.35) 1maxi,j|𝐖c(i,j)1|maxi,j𝐖y(i,j)𝐖1.\|\mathcal{E}_{1}\|\prec\max_{i,j}\left|\mathbf{W}_{c}(i,j)-1\right|\max_{i,j}\mathbf{W}_{y}(i,j)\|\mathbf{W}_{1}\|.

By definition, we have

maxi,j𝐖y(i,j)1.\max_{i,j}\mathbf{W}_{y}(i,j)\leq 1.

Moreover, using the definition (2.15) and h=ph=p, we have

(B.36) maxi,j|𝐖c(i,j)1|nα/21\max_{i,j}\left|\mathbf{W}_{c}(i,j)-1\right|\prec n^{\alpha/2-1}

by the bound |𝒛i𝒚j|=|zi𝒚j1|λ|\bm{z}_{i}\top\bm{y}_{j}|=|z_{i}\bm{y}_{j1}|\prec\sqrt{\lambda} and the Taylor expansion of f(x)f(x) around 0. Together with (B.33) and (B.35), we readily obtain that

1n1n1/2.\left\|\frac{1}{n}\mathcal{E}_{1}\right\|\prec n^{-1/2}.

This completes the proof of (2.18).

The proof of (2.19) is more involved. By the definition of 𝐖a1\mathbf{W}_{a_{1}}, it suffices to study 𝐖1.\mathbf{W}_{1}. By Mehler’s formula (A.32), we find that when iji\neq j,

(B.37) 𝐖1(i,j)=1t02\displaystyle\mathbf{W}_{1}(i,j)=\sqrt{1-t_{0}^{2}} exp(3t0222(1t02)(zi2+zj2))m=0t0mm!H~m(zi)H~m(zj),\displaystyle\exp\left(\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}(z_{i}^{2}+z_{j}^{2})\right)\sum_{m=0}^{\infty}\frac{t_{0}^{m}}{m!}\widetilde{H}_{m}(z_{i})\widetilde{H}_{m}(z_{j}),

where t0t_{0} is defined

0<t0:=υ+υ2+16β2/υ24(β/υ)<1andβ=λ/p.0<t_{0}:=\frac{-\upsilon+\sqrt{\upsilon^{2}+16\beta^{2}/\upsilon^{2}}}{4(\beta/\upsilon)}<1\quad\text{and}\quad\beta=\lambda/p\,.

By a direct calculation, we have

1t02\displaystyle 1-t^{2}_{0}\, =υ3υ2+16(β2/υ2)υ48β2,\displaystyle=\frac{\upsilon^{3}\sqrt{\upsilon^{2}+16(\beta^{2}/\upsilon^{2})}-\upsilon^{4}}{8\beta^{2}},
1t0\displaystyle 1-t_{0}\, =2υυ2+16β2/υ2+υ+4(β/υ).\displaystyle=\frac{2\upsilon}{\sqrt{\upsilon^{2}+16\beta^{2}/\upsilon^{2}}+\upsilon+4(\beta/\upsilon)}.

Note that when α>1\alpha>1, 1t021β1-t^{2}_{0}\asymp\frac{1}{\beta} as pp\to\infty. Therefore, with (B.37), we find that 𝐖1\mathbf{W}_{1} can be written as an infinite summation of rank-one matrices such that

(B.38) 𝐖1=1t02m=0t0mm!𝐇m𝐇m,\mathbf{W}_{1}=\sqrt{1-t_{0}^{2}}\sum_{m=0}^{\infty}\frac{t_{0}^{m}}{m!}{\mathbf{H}}_{m}{\mathbf{H}}_{m}^{\top},

where 𝐇mn{\mathbf{H}}_{m}\in\mathbb{R}^{n} is defined as

𝐇m=𝐰𝐇~m,𝐇~m(i)=H~m(zi),{\mathbf{H}}_{m}=\mathbf{w}\circ\widetilde{\mathbf{H}}_{m}\,,\quad\widetilde{\mathbf{H}}_{m}(i)=\widetilde{H}_{m}(z_{i})\,,

and 𝐰=(𝐰1,,𝐰n)n\mathbf{w}=(\mathbf{w}_{1},\cdots,\mathbf{w}_{n})^{\top}\in\mathbb{R}^{n} is defined as

𝐰i=exp(3t0222(1t02)zi2), 1in.\mathbf{w}_{i}=\exp\left(\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}z_{i}^{2}\right),\ 1\leq i\leq n.

Since both ziz_{i} and zjz_{j} are sub-Gaussian random variables, for some large constants C,D>0C,D>0,

(B.39) (|zi|2Clogn)1O(nD).\mathbb{P}(|z_{i}|^{2}\leq C\log n)\geq 1-O(n^{-D}).

Therefore, we conclude that with high probability, for some constant C1>0C_{1}>0,

exp(3t0222(1t02)(zi2+zj2))\displaystyle\exp\left(\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}(z_{i}^{2}+z_{j}^{2})\right) (exp(Clogn))3t0222(1t02)\displaystyle\leq(\exp(C\log n))^{\frac{3t_{0}^{2}-2}{2(1-t_{0}^{2})}}
(B.40) nC1β.\displaystyle\leq n^{C_{1}\beta}.

To finish the proof, we control 𝐇m𝐇m=𝐇m22\|\mathbf{H}_{m}\mathbf{H}_{m}^{\top}\|=\|\mathbf{H}_{m}\|_{2}^{2} case by case.

Case I: α=1.\alpha=1. In this case, since λp\lambda\asymp p, we have β1\beta\asymp 1 and t0(0,1)t_{0}\in(0,1) is a constant away from 11. Since the degree of H~m(x)\widetilde{H}_{m}(x) is mm\in\mathbb{N}, we have |H~m(x)|xm|\widetilde{H}_{m}(x)|\asymp x^{m} when xx\rightarrow\infty. Together with (B.39), we find that with high probability, for some constant C>0C>0

(B.41) 𝐇m22nC1β+1(Clogn)m.\|{\mathbf{H}}_{m}\|_{2}^{2}\leq n^{C_{1}\beta+1}(C\log n)^{m}.

Consequently, we have that for some constants C2,C3>0C_{2},C_{3}>0,

(B.42) t0mm!𝐇m22\displaystyle\frac{t_{0}^{m}}{m!}\|{\mathbf{H}}_{m}\|_{2}^{2} C3nC2m1/2(et0Clognm)m,\displaystyle\leq C_{3}n^{C_{2}}m^{-1/2}\left(\frac{et_{0}C\log n}{m}\right)^{m},

where we use the Stirling’s formula. For notational convenience, we set

(B.43) 𝐖1=𝐖11+𝐖12,\mathbf{W}_{1}=\mathbf{W}_{11}+\mathbf{W}_{12},

where

(B.44) 𝐖11=1t02m=0C0lognt0mm!𝐇m𝐇m\mathbf{W}_{11}=\sqrt{1-t_{0}^{2}}\sum_{m=0}^{C_{0}\log n}\frac{t_{0}^{m}}{m!}{\mathbf{H}}_{m}{\mathbf{H}}_{m}^{\top}

and C0C_{0} will be chosen later. Now we control 𝐖11\mathbf{W}_{11} and 𝐖12\mathbf{W}_{12}. Choose a fixed large constant C0>0C_{0}>0 so that C0>et0CC_{0}>et_{0}C, we set m0=C0logn.m_{0}=C_{0}\log n. When nn is large enough, by (B.42), we have that for some constants C4>0C_{4}>0 and 0<𝔞<10<\mathfrak{a}<1,

m=C0lognt0mm!𝐇m22\displaystyle\sum_{m=C_{0}\log n}^{\infty}\frac{t_{0}^{m}}{m!}\|{\mathbf{H}}_{m}\|_{2}^{2}\leq C3nC2m=C0logn1m(et0lognm)m\displaystyle\,C_{3}n^{C_{2}}\sum_{m=C_{0}\log n}^{\infty}\frac{1}{\sqrt{m}}\left(\frac{et_{0}\log n}{m}\right)^{m}
\displaystyle\leq C4nC2C0logn𝔞xdx\displaystyle\,C_{4}n^{C_{2}}\int_{C_{0}\log n}^{\infty}\mathfrak{a}^{x}\mathrm{d}x
(B.45) =\displaystyle= C4nC21log𝔞𝔞C0logn.\displaystyle\,C_{4}n^{C_{2}}\frac{1}{\log\mathfrak{a}}\mathfrak{a}^{C_{0}\log n}\,.

This yields that for some constants C5>0C_{5}>0, with high probability

(B.46) m=C0lognt0mm!𝐇m22\displaystyle\sum_{m=C_{0}\log n}^{\infty}\frac{t_{0}^{m}}{m!}\|{\mathbf{H}}_{m}\|_{2}^{2} C4𝔞C0lognnC2nC5logn+C2.\displaystyle\leq C_{4}\mathfrak{a}^{C_{0}\log n}n^{C_{2}}\asymp n^{-C_{5}\log n+C_{2}}.

Thus, for a big constant D>2D>2, when nn is sufficiently large, with high probability we have

(B.47) 𝐖12\displaystyle\|\mathbf{W}_{12}\| =𝐖11t02m=0C0lognt0mm!𝐇m𝐇mnD.\displaystyle=\left\|\mathbf{W}_{1}-\sqrt{1-t_{0}^{2}}\sum_{m=0}^{C_{0}\log n}\frac{t_{0}^{m}}{m!}{\mathbf{H}}_{m}{\mathbf{H}}_{m}^{\top}\right\|\leq n^{-D}\,.

This completes the proof for the case α=1\alpha=1 using (2.14) and (B.43) since the rank of 𝐖11\mathbf{W}_{11} is bounded by C0lognC_{0}\log n.

Case II: 1<α<21<\alpha<2. It is easy to see that in this case, (B.41) still holds true with high probability. Since β\beta diverges in this case, we find that

t0mm!𝐇~m(𝐲)22\displaystyle\frac{t_{0}^{m}}{m!}\|\widetilde{\mathbf{H}}_{m}(\mathbf{y})\|_{2}^{2} CnC1β+1(m)m1/2em(Clogn)m\displaystyle\leq Cn^{C_{1}\beta+1}(m)^{-m-1/2}e^{m}(C\log n)^{m}
=CnC1β+1m1/2(eClognm)m.\displaystyle=Cn^{C_{1}\beta+1}m^{-1/2}\left(\frac{eC\log n}{m}\right)^{m}.

Now for some fixed large constant C0>0C_{0}>0, we set m=C0nα1.m=C_{0}n^{\alpha-1}. By a discussion similar to (B.46), we have that for some constants C,C1>0C,C_{1}>0, with high probability

m=m0t0mm!𝐇~m(𝐳)22CnC1(1α)nα1.\sum_{m=m_{0}}^{\infty}\frac{t_{0}^{m}}{m!}\|\widetilde{\mathbf{H}}_{m}(\mathbf{z})\|_{2}^{2}\leq Cn^{C_{1}(1-\alpha)n^{\alpha-1}}.

The rest of the proof is similar to the case α=1\alpha=1 except that we can follow (A.23) to show that 𝐖1nδ\|\mathbf{W}_{1}\|\prec n^{\delta} for δ>(3α)/2.\delta>(3-\alpha)/2. This concludes the proof for the case 1<α<21<\alpha<2 and hence completes the proof of (2.19).

Proof of (2) when α2\alpha\geq 2.

We first prove (2.20). By (B.29), using the definition (2.16), we find that

𝐖𝐖~a1=(𝐂0𝐖y)𝐖c𝐖1.\mathbf{W}-\widetilde{\mathbf{W}}_{a_{1}}=(\mathbf{C}_{0}-\mathbf{W}_{y})\circ\mathbf{W}_{c}\circ\mathbf{W}_{1}.

Then the proof follows from a discussion similar to (B.34). For the rest of the proof, due to similarity, we only prove (2.23). Note that since ziz_{i}’s are continuous random variables by assumption, it assures that zisz_{i}^{\prime}s are not identical with high probability. Rewrite

𝐖(i,j)=exp(υλp𝐱i𝐱j22λ).\mathbf{W}(i,j)=\exp\left(-\upsilon\frac{\lambda}{p}\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\right).

Clearly, λpnα1\frac{\lambda}{p}\asymp n^{\alpha-1}. We then show that for the given constant t(0,1)t\in(0,1), with probability at least 1O(nδ)1-O(n^{-\delta}), where δ>1\delta>1, we have

𝐱i𝐱j22λ(pλ)t,\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\geq\left(\frac{p}{\lambda}\right)^{t},

when iji\neq j. By a direct expansion,

(B.48) 𝐱i𝐱j22λ=\displaystyle\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}= (zizj)2+𝐲i𝐲j22λ+2(zizj)(𝐲i1𝐲j1)λ.\displaystyle\left(z_{i}-z_{j}\right)^{2}+\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{\lambda}+\frac{2(z_{i}-z_{j})(\mathbf{y}_{i1}-\mathbf{y}_{j1})}{\lambda}\,.

By Lemma A.4, we find that with high probability,

𝐲i𝐲j22λpλ.\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{\lambda}\prec\frac{p}{\lambda}.

Similarly, we have that

|2(zizj)(𝐲i1𝐲j1)λ|1λ.\left|\frac{2(z_{i}-z_{j})(\mathbf{y}_{i1}-\mathbf{y}_{j1})}{\lambda}\right|\prec\frac{1}{\lambda}.

Using the above result, we find that for some constant C>0C>0, for iji\neq j,

(𝐱i𝐱j22λ(pλ)t)\displaystyle\mathbb{P}\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\leq\left(\frac{p}{\lambda}\right)^{t}\right) ((zizj)2(pλ)tCpλ)C(pλ)t/2,\displaystyle\leq\mathbb{P}\left((z_{i}-z_{j})^{2}\leq\left(\frac{p}{\lambda}\right)^{t}-C\frac{p}{\lambda}\right)\leq C\left(\frac{p}{\lambda}\right)^{t/2},

where the final inequality comes from a discussion similar to (A.26) since α2\alpha\geq 2. This leads to

(𝐱i𝐱j22λ(pλ)t)1C(pλ)t/2.\mathbb{P}\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\geq\left(\frac{p}{\lambda}\right)^{t}\right)\geq 1-C\left(\frac{p}{\lambda}\right)^{t/2}.

Note that under the condition (2.21), we have t(α1)/2>1t(\alpha-1)/2>1, and hence (pλ)t/2=o(n1)\left(\frac{p}{\lambda}\right)^{t/2}=o(n^{-1}). Therefore, by a direct union bound, for each fixed ii, we have that

(B.49) maxj\displaystyle\max_{j}\mathbb{P} (𝐱i𝐱j22λ(pλ)t|ji)1Cn(pλ)t/2.\displaystyle\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{\lambda}\geq\left(\frac{p}{\lambda}\right)^{t}\middle|j\neq i\right)\geq 1-Cn\left(\frac{p}{\lambda}\right)^{t/2}.

This implies that with probability at least 1O(n1t(α1)/2)1-O(n^{1-t(\alpha-1)/2}), for some constant C>0C>0, we have that

(B.50) |ji𝐖(i,j)|\displaystyle\left|\sum_{j\neq i}\mathbf{W}(i,j)\right| Cnexp(υλp(pλ)t)=Cnexp(υ(λp)1t).\displaystyle\leq Cn\exp\left(-\upsilon\frac{\lambda}{p}\left(\frac{p}{\lambda}\right)^{t}\right)=Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).

Consequently, by the Gershgorin circle theorem, we conclude that with probability 1O(n1t(α1)/2)1-O(n^{1-t(\alpha-1)/2})

(B.51) λi(𝐖)1Cnexp(υ(λp)1t).\|\lambda_{i}(\mathbf{W})-1\|\leq Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).

This completes our proof.

B.4. Proof of Corollary 2.10

In this subsection, we prove the results for the transition matrix 𝐀.\mathbf{A}.

Proof of Corollary 2.10.

When α=0\alpha=0, it follows from the fact that

(B.52) (n1𝐃)1f(2)1𝐈nn1/2.\|(n^{-1}\mathbf{D})^{-1}-f(2)^{-1}\mathbf{I}_{n}\|\prec n^{-1/2}.

The proof can be found in [18, Lemma 4.5], and we omit it. Together with (B.10) and (B.13), we have that with probability 1O(n1/2)1-O(n^{-1/2}),

n𝐃1/2\displaystyle n\mathbf{D}^{-1/2} 𝐖𝐃1/2=n𝐃1/2𝖮𝐃1/2\displaystyle\mathbf{W}\mathbf{D}^{-1/2}=n\mathbf{D}^{-1/2}\mathsf{O}\mathbf{D}^{-1/2}
+n𝐃1/2(2f(τ)p𝐗𝐗+ς(λ)𝐈n)𝐃1/2\displaystyle+n\mathbf{D}^{-1/2}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right)\mathbf{D}^{-1/2}
+O(n1/4+nϵ1/2).\displaystyle+O(n^{-1/4}+n^{\epsilon-1/2})\,.

By definition, the first term is a matrix with rank at most three since 𝖮\mathsf{O} is of rank at most three. For the second term, from (B.52) and Lemma A.6, we have

n𝐃1/2(2f(τ)p𝐗𝐗+ς(λ)𝐈n)𝐃1/2\displaystyle\Big{\|}n\mathbf{D}^{-1/2}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right)\mathbf{D}^{-1/2}
1f(2)(2f(τ)p𝐗𝐗+ς(λ)𝐈n)n1/2.\displaystyle\qquad-\frac{1}{f(2)}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right)\Big{\|}\prec n^{-1/2}.

As a result, since the spectra of n𝐀n\mathbf{A} and n𝐃1/2𝐖𝐃1/2n\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2} are the same, and with probability 1O(n1/2)1-O(n^{-1/2}), n𝐃1/2𝐖𝐃1/2n\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2} can be approximated by 1f(2)(2f(τ)p𝐗𝐗+ς(λ)𝐈n)\frac{1}{f(2)}\left(-\frac{2f^{\prime}(\tau)}{p}\mathbf{X}^{\top}\mathbf{X}+\varsigma(\lambda)\mathbf{I}_{n}\right) and a perturbation of rank at most 33 with an error bound O(n1/4O(n^{-1/4}, we obtain the claim by the Weyl’s lemma.

When 0<α<10<\alpha<1, by a discussion similar to (B.52) using Lemma A.5, we find that

(n1𝐃)1f(τ)1𝐈nn1/2+λp.\|(n^{-1}\mathbf{D})^{-1}-f(\tau)^{-1}\mathbf{I}_{n}\|\prec n^{-1/2}+\frac{\lambda}{p}\,.

and by the same argument as that for α=0\alpha=0, we conclude the proof.

Next, we handle the case 1α<2.1\leq\alpha<2. Note that

𝐀𝐀a1\displaystyle\|\mathbf{A}-\mathbf{A}_{a_{1}}\|\leq\, 𝐃1(𝐖𝐖a1)+𝐃1(𝐃a1𝐃)𝐃a11𝐖a1\displaystyle\|\mathbf{D}^{-1}(\mathbf{W}-\mathbf{W}_{a_{1}})\|+\|\mathbf{D}^{-1}(\mathbf{D}_{a_{1}}-\mathbf{D})\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\|
(B.53) \displaystyle\leq\, 𝐃1𝐖𝐖a1+𝐃1(𝐃𝐃a1)𝐃a11𝐖a1.\displaystyle\|\mathbf{D}^{-1}\|\|\mathbf{W}-\mathbf{W}_{a_{1}}\|+\|\mathbf{D}^{-1}(\mathbf{D}-\mathbf{D}_{a_{1}})\|\|\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\|\,.

By a direct expansion, we have

(B.54) 𝐃(i,i)𝐃a1(i,i)\displaystyle\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i)
=\displaystyle=\, jiexp(υ𝐱i𝐱j22p)jiexp(2υ)exp(υ𝐳i𝐳j22p)\displaystyle\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{p}\right)-\sum_{j\neq i}\exp(-2\upsilon)\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)
=\displaystyle=\, jiexp(υ𝐳i𝐳j22p)\displaystyle\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)
×[exp(υ𝐲i𝐲j22p)exp(2υ(𝐳i𝐳j)(𝐲i𝐲j)p)exp(2υ)].\displaystyle\times\Big{[}\exp\left(-\upsilon\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{p}\right)\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{p}\right)-\exp(-2\upsilon)\Big{]}.

To control 𝐃(i,i)𝐃a1(i,i)\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i), we bound terms in the right hand side. First, since exp(2υ)\exp(-2\upsilon) is the zero-th order Taylor expansion for exp(2𝐲i𝐲j/p)\exp(-2\|\mathbf{y}_{i}-\mathbf{y}_{j}\|/p) at 22, together with (B.36) and the fact that maxij|exp(υ𝐲i𝐲j22p)exp(2υ)|n1/2\max_{ij}\Big{|}\exp\left(-\upsilon\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{p}\right)-\exp(-2\upsilon)\Big{|}\prec n^{-1/2}, we have

maxi,j|exp(υ𝐲i𝐲j22p)exp(2υ(𝐳i𝐳j)(𝐲i𝐲j)p)\displaystyle\max_{i,j}\Big{|}\exp\left(-\upsilon\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{p}\right)\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{p}\right)
(B.55) exp(2υ)|nα/21.\displaystyle\qquad-\exp(-2\upsilon)\Big{|}\prec n^{\alpha/2-1}\,.

Since exp(υ𝐳i𝐳j22p)>0\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)>0, this implies that

𝐃(i,i)𝐃a1(i,i)=O(nα/21jiexp(υ𝐳i𝐳j22p)).\displaystyle\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i)=O_{\prec}\left(n^{\alpha/2-1}\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)\right)\,.

Denote Δ:=𝐃1(𝐃𝐃a1).\Delta:=\mathbf{D}^{-1}(\mathbf{D}-\mathbf{D}_{a_{1}}). Since 𝐃(i,i)\mathbf{D}(i,i) and 𝐃a1(i,i)\mathbf{D}_{a_{1}}(i,i) are positive and 𝐃a1(i,i)=jiexp(υ𝐳i𝐳j22p)+1\mathbf{D}_{a_{1}}(i,i)=\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)+1, we have that

maxi|Δ(i,i)|\displaystyle\max_{i}|\Delta(i,i)| =|𝐃(i,i)𝐃a1(i,i)|𝐃(i,i)\displaystyle=\frac{|\mathbf{D}(i,i)-\mathbf{D}_{a_{1}}(i,i)|}{\mathbf{D}(i,i)}
nα/21jiexp(υ𝐳i𝐳j22p)jiexp(υ𝐳i𝐳j22p)+1nα/21.\displaystyle\prec\frac{n^{\alpha/2-1}\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)}{\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)+1}\prec n^{\alpha/2-1}.

As a result, we have 𝐃1(𝐃𝐃a1)𝐃a11𝐖a1nα/21\|\mathbf{D}^{-1}(\mathbf{D}-\mathbf{D}_{a_{1}})\|\|\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\|\prec n^{\alpha/2-1} since 𝐃a11𝐖a11\|\mathbf{D}_{a_{1}}^{-1}\mathbf{W}_{a_{1}}\|\leq 1.

Next, we control 𝐃1𝐖𝐖a1\|\mathbf{D}^{-1}\|\|\mathbf{W}-\mathbf{W}_{a_{1}}\|. Using the same argument leading to (A.30), we find that for 0<δ<10<\delta<1, with high probability, there exists O(nδ)O(n^{\delta}) amount of ziz_{i}’s such that |zi|n(1δ).|z_{i}|\leq n^{-(1-\delta)}. Note that (A.23) requires that δ3α2.\delta\geq\frac{3-\alpha}{2}. Choosing δ=3α2,\delta=\frac{3-\alpha}{2}, we obtain that with high probability, for some constant C>0C>0,

maxi,i𝐃(i,i)Cjiexp(υ𝐳i𝐳j22p)Cn3α2.\max_{i,i}\mathbf{D}(i,i)\geq C\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)\geq Cn^{\frac{3-\alpha}{2}}.

This implies that 𝐃1n(α3)/2.\|\mathbf{D}^{-1}\|\prec n^{(\alpha-3)/2}. Combining (2.18), we find that 𝐃1𝐖𝐖a1nα/21\|\mathbf{D}^{-1}\|\|\mathbf{W}-\mathbf{W}_{a_{1}}\|\prec n^{\alpha/2-1}, and hence

𝐀𝐀a1nα/21.\|\mathbf{A}-\mathbf{A}_{a_{1}}\|\prec n^{\alpha/2-1}\,.

This concludes our proof.

Then we prove the case when α2.\alpha\geq 2. The counterpart of (2.20) follows from a discussion similar to the case 1α<21\leq\alpha<2 except that (B.54) should be replaced by

𝐃(i,i)𝐃~a1(i,i)=\displaystyle\mathbf{D}(i,i)-\widetilde{\mathbf{D}}_{a_{1}}(i,i)=\, jiexp(υ𝐳i𝐳j22p)exp(2υ(𝐳i𝐳j)(𝐲i𝐲j)p)\displaystyle\sum_{j\neq i}\exp\left(-\upsilon\frac{\|\mathbf{z}_{i}-\mathbf{z}_{j}\|_{2}^{2}}{p}\right)\exp\left(-2\upsilon\frac{(\mathbf{z}_{i}-\mathbf{z}_{j})^{\top}(\mathbf{y}_{i}-\mathbf{y}_{j})}{p}\right)
×[exp(υ𝐲i𝐲j22p)exp(2υ)],\displaystyle\times\left[\exp\left(-\upsilon\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{p}\right)-\exp(-2\upsilon)\right],

so that

maxi|𝐃(i,i)𝐃~a1(i,i)|n1/2.\max_{i}|\mathbf{D}(i,i)-\widetilde{\mathbf{D}}_{a_{1}}(i,i)|\prec n^{-1/2}.

We omit the details due to similarity. Finally, we prove the results when α\alpha is larger in the sense of (2.21). By (B.50) and the definition of 𝐖\mathbf{W} (i.e., 𝐖(i,i)=1\mathbf{W}(i,i)=1), we find that with probability at least 1O(n1t(α1)/2)1-O(n^{1-t(\alpha-1)/2}), for some constant C>0C>0,

𝐃𝐈nCnexp(υ(λp)1t).\|\mathbf{D}-\mathbf{I}_{n}\|\leq Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).

This bound together with (2.23) lead to

𝐖𝐃1/2𝐖𝐃1/2\displaystyle\|\mathbf{W}-\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}\|\leq (𝐃1/2+1)𝐖𝐃1/2𝐈n\displaystyle\,(\|\mathbf{D}^{-1/2}\|+1)\|\mathbf{W}\|\|\mathbf{D}^{-1/2}-\mathbf{I}_{n}\|
\displaystyle\leq  2Cnexp(υ(λp)1t).\displaystyle\,2Cn\exp\left(-\upsilon\left(\frac{\lambda}{p}\right)^{1-t}\right).

Since 𝐀\mathbf{A} is similar to 𝐃1/2𝐖𝐃1/2\mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}, by Weyl’s lemma, we conclude the claim. This concludes the proof for the third part when (2.21) holds.

Appendix C Proof of the results in Section 3

In this section, we prove the main results in Section 3.

C.1. Proof of Theorem 3.1

In this subsection, we prove Theorem 3.1 for h=λ+ph=\lambda+p. We only prove 𝐖\mathbf{W} and omit the details of 𝐀\mathbf{A} since the proof is similar to that of Corollary 2.10. Throughout the proof, we write f(x)=exp(υx)f(x)=\exp(-\upsilon x) for the ease of statements.

Proof.

For part (1), denote

f(𝐱i𝐱j22h)=g(𝐱i𝐱j22p),h=p+λ,f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h}\right)=g\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{p}\right),\ h=p+\lambda,

where g(x):=f(px/h).g(x):=f(px/h). Since 0α<10\leq\alpha<1, we have that

pp+λ1.\frac{p}{p+\lambda}\asymp 1.

Then we can apply all proofs of Theorems 2.3 and 2.5 to the kernel function g(x)g(x) to conclude the proof. The only difference is that we will get an extra factor p/hp/h for the derivative and we omit the details.

For part (2), since β=λ/(λ+p)1\beta=\lambda/(\lambda+p)\asymp 1, (3.3) and (3.4) follow from the proof of (2.18) and Case (I) (below equation (B.3)) of the proof of (2.19), respectively. For (3.5), the proof follows from (3.3) and the bound

𝐖a2𝐖1=\displaystyle\left\|\mathbf{W}_{a_{2}}-\mathbf{W}_{1}\right\|= (exp(2pυ/(p+λ))1)𝐖1+(1exp(2pυ/(p+λ)))𝐈n\displaystyle\,\|(\exp(-2p\upsilon/(p+\lambda))-1)\mathbf{W}_{1}+(1-\exp(-2p\upsilon/(p+\lambda)))\mathbf{I}_{n}\|
\displaystyle\leq n1α𝐖1+n1αn1α+n2α,\displaystyle\,n^{1-\alpha}\|\mathbf{W}_{1}\|+n^{1-\alpha}\prec n^{1-\alpha}+n^{2-\alpha}\,,

where the first inequality comes from 1exp(2pυp+λ)n1α1-\exp(-\frac{2p\upsilon}{p+\lambda})\asymp n^{1-\alpha} when α>1\alpha>1 and the last bound follows from the facts that 𝐖1n\|\mathbf{W}_{1}\|\prec n. ∎

C.2. Proof of Corollary 3.2

In this subsection, we prove Corollary 3.2.

Proof.

First, consider the first statement when 0α<10\leq\alpha<1. Recall that the square of sub-Gaussian is sub-exponential and the summation of sub-exponential random variables is also sub-exponential. Since 𝐱i,𝐱j,ij\mathbf{x}_{i},\mathbf{x}_{j},i\neq j, are independent, the random variable 𝐱i𝐱j22\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2} follows a sub-exponential distribution, and by Lemma A.5, we have

𝐱i𝐱j22=2(p+λ)+O(pα+p).\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}=2(p+\lambda)+O_{\prec}(p^{\alpha}+\sqrt{p})\,.

Since α<1\alpha<1, with high probability, when pp is large enough, 𝐱i𝐱j22\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2} is concentrated around 2p2p. Thus, for any ω(0,1)\omega\in(0,1),

(C.1) hph\asymp p

holds with high probability. We can thus rewrite

f(𝐱i𝐱j22h)=g(𝐱i𝐱j22p),f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h}\right)=g\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{p}\right),

where g(x):=f(px/h).g(x):=f(px/h). Then we can apply all the proof of Theorems 2.3 and 2.5 to the kernel function g(x)g(x) to conclude the proof. The only difference is that we will get an extra factor p/hp/h for the derivative and we omit the details.

We next prove the second statement when α1\alpha\geq 1. We first show the counterpart of (3.3). In this case, we claim that for some given sufficiently small ϵ>0\epsilon>0, we have that with high probability, for some constants C1,C2>0C_{1},C_{2}>0,

(C.2) C1(λlog1n+p)hC2λlog2n.C_{1}(\lambda\log^{-1}n+p)\leq h\leq C_{2}\lambda\log^{2}n.

To see this claim, we follow the notation (A.3). Note that

𝐱i𝐱j22=\displaystyle\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}=\, 𝐲i𝐲j2+λ(zizj)2+2λ(𝐲i1𝐲j1)(zizj).\displaystyle\|\mathbf{y}_{i}-\mathbf{y}_{j}\|^{2}+\lambda(z_{i}-z_{j})^{2}+2\sqrt{\lambda}(\mathbf{y}_{i1}-\mathbf{y}_{j1})(z_{i}-z_{j})\,.

By Lemma A.4, 𝐲i𝐲j2=2p+O(p)\|\mathbf{y}_{i}-\mathbf{y}_{j}\|^{2}=2p+O_{\prec}(\sqrt{p}). Also, since 𝐲i1𝐲j1\mathbf{y}_{i1}-\mathbf{y}_{j1} and zizjz_{i}-z_{j} are sub-Gaussian, |(𝐲i1𝐲j1)(zizj)|=O(log(n))|(\mathbf{y}_{i1}-\mathbf{y}_{j1})(z_{i}-z_{j})|=O_{\prec}(\log(n)). It thus remains to handle (zizj)2(z_{i}-z_{j})^{2}. On one hand, since zizjz_{i}-z_{j} is sub-Gaussian, we have that for some constant C>0C>0, (zizj)2Clog2n(z_{i}-z_{j})^{2}\leq C\log^{2}n with high probability; on the other hand, using a discussion similar to (A.30) by setting ϵ=0.5(loglogn/logn)\epsilon=0.5(\log\log n/\log n), with high probability, we have that there are at least Cn/lognCn/\sqrt{\log n} (zizj)2(z_{i}-z_{j})^{2}’s such that (zizj)2log1n(z_{i}-z_{j})^{2}\geq\log^{-1}n. Since ω(0,1)\omega\in(0,1), when nn is large enough, we have ω>C/logn\omega>C/\sqrt{\log n}. This proves (C.2). Using the bandwidth h,h, we now have

𝐖y(i,j)=f(𝐲i𝐲j22h)=f(ph𝐲i𝐲j22p).\mathbf{W}_{y}(i,j)=f\left(\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{h}\right)=f\left(\frac{p}{h}\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{p}\right).

Denote 𝐂~0=f(2p/h)𝟏𝟏+(1f(2p/h))𝐈.\widetilde{\mathbf{C}}_{0}=f(2p/h)\mathbf{1}\mathbf{1}^{\top}+(1-f(2p/h))\mathbf{I}. Consider the same expansion like (B.3) using the bandwidth hh; that is, 𝐖𝐖a2=[(𝐖c𝟏𝟏)𝐖y𝐖1]+[(𝐖y𝐂~0)𝐖1]\mathbf{W}-\mathbf{W}_{a_{2}}=[(\mathbf{W}_{c}-\bm{1}\bm{1}^{\top})\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}]+[(\mathbf{W}_{y}-\widetilde{\mathbf{C}}_{0})\circ\mathbf{W}_{1}]. By Lemma A.1, we have

𝐖𝐖a2\displaystyle\|\mathbf{W}-\mathbf{W}_{a_{2}}\|\leq\, (maxi,j|𝐖c(i,j)1|maxi,j𝐖y(i,j)\displaystyle(\max_{i,j}\left|\mathbf{W}_{c}(i,j)-1\right|\max_{i,j}\mathbf{W}_{y}(i,j)
+maxi,j|𝐖y(i,j)𝐂~0(i,j)|)𝐖1.\displaystyle+\max_{i,j}|\mathbf{W}_{y}(i,j)-\widetilde{\mathbf{C}}_{0}(i,j)|)\|\mathbf{W}_{1}\|\,.

Due to (C.2), we have |(zizj)(𝐲i1𝐲j1)|/h=O(λ1/2)|(z_{i}-z_{j})(\mathbf{y}_{i1}-\mathbf{y}_{j1})|/h=O_{\prec}(\lambda^{-1/2}) and hence maxi,j|𝐖c(i,j)1|λ/h=O(λ1/2)=O(nα/2)\max_{i,j}|\mathbf{W}_{c}(i,j)-1|\prec\sqrt{\lambda}/h=O(\lambda^{-1/2})=O(n^{-\alpha/2}). On the other hand, we have maxi,j𝐖y(i,j)1\max_{i,j}\mathbf{W}_{y}(i,j)\leq 1 by definition and maxi,j|𝐖y(i,j)𝐂~0(i,j)|n1/2\max_{i,j}|\mathbf{W}_{y}(i,j)-\widetilde{\mathbf{C}}_{0}(i,j)|\prec n^{-1/2} by Lemma A.4. As a result, we obtain

(C.3) 𝐖𝐖a2n1/2𝐖1.\|\mathbf{W}-\mathbf{W}_{a_{2}}\|\prec n^{-1/2}\|\mathbf{W}_{1}\|.

To finish the proof of the counterpart of (3.3), we need to control 𝐖1\|\mathbf{W}_{1}\| with hh satisfying (C.2). We use a trivial bound 𝐖1=O(n)\|\mathbf{W}_{1}\|=O(n) by Gershgorin circle theorem. As a result, we have

1n𝐖1n𝐖a2n1/2.\left\|\frac{1}{n}\mathbf{W}-\frac{1}{n}\mathbf{W}_{a_{2}}\right\|\prec n^{-1/2}\,.

The argument for the couterpart of (3.4) is analogous to that in Case (𝐈\mathbf{I}) in the proof of (2.19) with the following necessary modifications. For β:=λ/h\beta:=\lambda/h, we have 1C2log2nβ1C1(log1n+p/λ)\frac{1}{C_{2}\log^{2}n}\leq\beta\leq\frac{1}{C_{1}(\log^{-1}n+p/\lambda)} by (C.2). First, when β1\beta\asymp 1, the discussion reduces to Case (𝐈\mathbf{I}) (i.e., the arguments below (B.3)) of the proof of (2.19). Second, when β\beta diverges and βCmin{logn,nα1}\beta\leq C\min\{\log n,n^{\alpha-1}\} for some constant C>0C>0, the discussion reduces to Case (𝐈\mathbf{I}) again. Finally, we discuss β=o(1)\beta=o(1) satisfying β1/(C2log2n).\beta\geq 1/(C_{2}\log^{2}n). In this case, we still have t0<1.t_{0}<1. Recall (B.38), where we have 1t021βClog2n1-t_{0}^{2}\asymp\frac{1}{\beta}\leq C\log^{2}n for some constant C>0.C>0. However, compared to the error bound in (B.46), the extra factor log2n\log^{2}n is negligible. Then the case still reduces to Case (𝐈\mathbf{I}). This completes the proof.

Appendix D Verification of Remark 2.8

In this section, we justify (2.25) of Remark 2.8.

Justification of Remark 2.8.

We focus on the case α=1\alpha=1, and the same claim holds for 1<α<21<\alpha<2. Recall (B.29). Using a discussion similar to (B.3), we have that

𝐖𝐖b1=\displaystyle\mathbf{W}-\mathbf{W}_{b_{1}}=\, (𝐖c𝟏𝟏)𝐖y𝐖1+𝐖1𝐑1+𝐖1(exp(2υ)𝟏𝟏)\displaystyle(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{y}\circ\mathbf{W}_{1}+\mathbf{W}_{1}\circ\mathbf{R}_{1}+\mathbf{W}_{1}\circ\left(\exp(-2\upsilon)\mathbf{1}\mathbf{1}^{\top}\right)
(D.1) :=\displaystyle:=\, 𝐄0+𝐄1+𝐄2,\displaystyle\mathbf{E}_{0}+\mathbf{E}_{1}+\mathbf{E}_{2},

where 𝐑1\mathbf{R}_{1} is the error of the first order entrywise expansion of 𝐖y\mathbf{W}_{y} defined as

𝐑1\displaystyle\mathbf{R}_{1} :=𝐖y[exp(2υ)𝟏𝟏+2υexp(2υ)p𝐘𝐘+2υexp(4υ)𝐈]\displaystyle:=\mathbf{W}_{y}-\Big{[}\exp(-2\upsilon)\mathbf{1}\mathbf{1}^{\top}+\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}+2\upsilon\exp(-4\upsilon)\mathbf{I}\Big{]}
(D.2) :=𝐖y𝐔y.\displaystyle:=\mathbf{W}_{y}-\mathbf{U}_{y}\,.

Take the same decomposition in (B.44) with some fixed large constant C0>0C_{0}>0 and m=C0lognm=C_{0}\log n. By (B.46), we have that with high probability

(D.3) rank(𝐖1,1)m,maxi,j|𝐖1,2(i,j)|nD,\operatorname{rank}(\mathbf{W}_{1,1})\leq m,\quad\max_{i,j}|\mathbf{W}_{1,2}(i,j)|\leq n^{-D},

for some large constant D>2.D>2. We start with the discussion of 𝐄2\mathbf{E}_{2} in (D). Using (B.43) and the fact that 𝐖1(exp(2υ)𝟏𝟏)=exp(2υ)𝐖1\mathbf{W}_{1}\circ\left(\exp(-2\upsilon)\mathbf{1}\mathbf{1}^{\top}\right)=\exp(-2\upsilon)\mathbf{W}_{1}, we obtain that

(D.4) 𝐄2\displaystyle\mathbf{E}_{2} =exp(2υ)𝐖1,1+exp(2υ)𝐖1,2=:𝐄2,1+𝐄2,2.\displaystyle=\exp(-2\upsilon)\mathbf{W}_{1,1}+\exp(-2\upsilon)\mathbf{W}_{1,2}=:\mathbf{E}_{2,1}+\mathbf{E}_{2,2}\,.

By (D.3) and the Gershgorin circle theorem, with high probability, for some constant C>0,C>0, we have

(D.5) rank(𝐄2,1)m. and 𝐄2,2CnD+1\operatorname{rank}\left(\mathbf{E}_{2,1}\right)\leq m.\ \ \mbox{ and }\ \ \left\|\mathbf{E}_{2,2}\right\|\leq Cn^{-D+1}

We control m𝐖b1+𝐄1+𝐄2(z)m𝐖b1+𝐄1(z)m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z) by the triangle inequality,

|m𝐖b1+𝐄1+𝐄2(z)m𝐖b1+𝐄1(z)|\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)|
\displaystyle\leq\, |m𝐖b1+𝐄1+𝐄2(z)m𝐖b1+𝐄1+𝐄21(z)|+|m𝐖b1+𝐄1+𝐄21(z)m𝐖b1+𝐄1(z)|.\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{21}}(z)|+|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{21}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)|\,.

By Lemma A.3, we have

|m𝐖b1+𝐄1+𝐄2(z)m𝐖b1+𝐄1+𝐄2,1(z)|\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2,1}}(z)|
(D.6) \displaystyle\leq\, rank(𝐄2,2)nmin{2η,𝐄2,2η2}CnD1η2,\displaystyle\frac{\operatorname{rank}(\mathbf{E}_{2,2})}{n}\min\left\{\frac{2}{\eta},\,\frac{\|\mathbf{E}_{2,2}\|}{\eta^{2}}\right\}\leq\frac{C}{n^{D-1}\eta^{2}}\,,

where we use the fact that the rank of 𝐄2,2\mathbf{E}_{2,2} may be full and DD is large. Similarly, we have

|m𝐖b1+𝐄2,1+𝐄1(z)m𝐖b1+𝐄1(z)|\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{2,1}+\mathbf{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)|
(D.7) \displaystyle\leq\, rank(𝐄2,1)nmin{2η,𝐄2,1η2}2C0lognnη,\displaystyle\frac{\operatorname{rank}(\mathbf{E}_{2,1})}{n}\min\left\{\frac{2}{\eta},\,\frac{\|\mathbf{E}_{2,1}\|}{\eta^{2}}\right\}\leq\frac{2C_{0}\log n}{n\eta}\,,

where we use the fact that 𝐄2,1exp(2υ)𝐖1exp(2υ)nδ\|\mathbf{E}_{2,1}\|\leq\|\exp(-2\upsilon)\mathbf{W}_{1}\|\leq\exp(-2\upsilon)n^{-\delta} for some 0<δ<1/20<\delta<1/2 in the argument leading to (A.23). In conclusion, since DD is a big constant, we control the 𝐄2\mathbf{E}_{2} term in the Stieltjes transform, and obtain

(D.8) |m𝐖b1+𝐄1+𝐄2(z)m𝐖b1+𝐄1(z)|3C0lognnη\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)|\leq\frac{3C_{0}\log n}{n\eta}\,

for z𝒟z\in\mathcal{D} defined in (2.8). It is easy to see that the above control is negligible compared to the rate n1/2+ϵη2n^{-1/2+\epsilon}\eta^{-2} for any arbitrarily small ϵ>0.\epsilon>0. Consequently, it suffices to focus on 𝐄1\mathbf{E}_{1} in (D). To control 𝐄1\mathbf{E}_{1} in (D), note that

(D.9) 𝐄1=𝐖1,1𝐑1+𝐖1,2𝐑1\mathbf{E}_{1}=\mathbf{W}_{1,1}\circ\mathbf{R}_{1}+\mathbf{W}_{1,2}\circ\mathbf{R}_{1}

and we control

|m𝐖b1+𝐄1(z)m𝐖b1(z)|\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|
\displaystyle\leq |m𝐖b1+𝐖1,1𝐑1+𝐖1,2𝐑1(z)m𝐖b1+𝐖1,1𝐑1(z)|\displaystyle\,|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}+\mathbf{W}_{1,2}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)|
+|m𝐖b1+𝐖1,1𝐑1(z)m𝐖b1(z)|.\displaystyle+|m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|\,.

By a discussion similar to (B.1) with for the cloud points {𝐲i}\{\mathbf{y}_{i}\} (or see the proof of Lemma 4.2 of [18]), we have

(D.10) maxi,j|𝐑1(i,j)|n1/2.\max_{i,j}|\mathbf{R}_{1}(i,j)|\prec n^{-1/2}.

Together with (D.3), by the Gershgorin circle theorem, we conclude that

(D.11) 𝐖1,2𝐑1nD+1/2.\|\mathbf{W}_{1,2}\circ\mathbf{R}_{1}\|\prec n^{-D+1/2}.

This error is negligible since D>2D>2 by the same argument as (D.6), thus we have the control for |m𝐖b1+𝐖1,1𝐑1+𝐖1,2𝐑1(z)m𝐖b1+𝐖1,1𝐑1(z)||m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}+\mathbf{W}_{1,2}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)|. The rest of the proof is controling |m𝐖b1+𝐖1,1𝐑1(z)m𝐖b1(z)||m_{\mathbf{W}_{b_{1}}+\mathbf{W}_{1,1}\circ\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|. Set 1:=𝐖1,1𝐑1\mathcal{E}_{1}:=\mathbf{W}_{1,1}\circ\mathbf{R}_{1} for convenience. Note that with high probability, for some constant C>0,C>0, we have that

(D.12) maxi,j|𝐖1,1(i,j)|C.\max_{i,j}|\mathbf{W}_{1,1}(i,j)|\leq C.

By the definition of Stieltjes transform, for any even integer q,q, we have

|m𝐖b1+1(z)m𝐖b1+𝐑1(z)|q\displaystyle|m_{\mathbf{W}_{b_{1}}+\mathcal{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{R}_{1}}(z)|^{q}
(1ni=1n|λi(𝐖b1+1)λi(𝐖b1+𝐑1)||λi(𝐖b1+1)z||λi(𝐖b1+𝐑1)z|)q,\displaystyle\leq\left(\frac{1}{n}\sum_{i=1}^{n}\frac{|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})|}{|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-z||\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})-z|}\right)^{q}\,,

which, by the Cauchy-Schwarz inequality, is bounded by

1nq[(i=1n|λi(𝐖b1+1)λi(𝐖b1+𝐑1)|2)\displaystyle\,\frac{1}{n^{q}}\bigg{[}\left(\sum_{i=1}^{n}|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})|^{2}\right)
×(i=1n1|λi(𝐖b1+1)z|2|λi(𝐖b1+𝐑1)z|2)]q/2\displaystyle\qquad\qquad\times\left(\sum_{i=1}^{n}\frac{1}{|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathcal{E}_{1})-z|^{2}|\lambda_{i}(\mathbf{W}_{b_{1}}+\mathbf{R}_{1})-z|^{2}}\right)\bigg{]}^{q/2}
(D.13) \displaystyle\leq 1nq/2η2q(tr{(1𝐑1)2})q/2,\displaystyle\,\frac{1}{n^{q/2}\eta^{2q}}(\text{tr}\{(\mathcal{E}_{1}-\mathbf{R}_{1})^{2}\})^{q/2},

where the last inequality comes from the Hoffman-Wielandt inequality (Lemma A.3) and the trivial bound |λz|1η1|\lambda-z|^{-1}\leq\eta^{-1} for any λ\lambda\in\mathbb{R}. The rest of the proof leaves to control the right-hand side of (D.13). By (D.12) and the fact that 1𝐑1=(𝐖1,1𝟏𝟏)𝐑1\mathcal{E}_{1}-\mathbf{R}_{1}=(\mathbf{W}_{1,1}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{R}_{1}. Hence, since \mathcal{E} is symmetric, for some constant C>0,C>0, we can replace (D.13) with

(D.14) Cnq/2η2q[tr(𝐑12)]q/2.\frac{C}{n^{q/2}\eta^{2q}}[\text{tr}(\mathbf{R}_{1}^{2})]^{q/2}\,.

Thus, we only need to consider the 𝐑1\mathbf{R}_{1} part, and we claim that (D.14) can be controlled by a term of order (logn)qnq/2η2q(\log n)^{q}n^{-q/2}\eta^{-2q}, where the proof can be found in [18, eq. (C.8)]. To finish the proof, by the same argument of controlling |m𝐖b1+1(z)m𝐖b1+𝐑1(z)|q|m_{\mathbf{W}_{b_{1}}+\mathcal{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{R}_{1}}(z)|^{q}, we get |m𝐖b1+𝐑1(z)m𝐖b1(z)|q|m_{\mathbf{W}_{b_{1}}+\mathbf{R}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|^{q} controlled by a term of order (logn)qnq/2η2q(\log n)^{q}n^{-q/2}\eta^{-2q}. By collecting the above controls, including |m𝐖b1+𝐄1+𝐄2(z)m𝐖b1+𝐄1(z)||m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)| and |m𝐖b1+𝐄1(z)m𝐖b1(z)||m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|, we conclude the bound for |m𝐖b1+𝐄1+𝐄2(z)m𝐖b1(z)||m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)-m_{\mathbf{W}_{b_{1}}}(z)|. To finish the proof, we control |m𝐖(z)m𝐖b1+𝐄1+𝐄2(z)||m_{\mathbf{W}}(z)-m_{\mathbf{W}_{b_{1}}+\mathbf{E}_{1}+\mathbf{E}_{2}}(z)|, which depends on controlling 𝐄0\mathbf{E}_{0}. We can use an analogous argument to control 𝐄0\mathbf{E}_{0} by decomposing 𝐖1\mathbf{W}_{1} as in (B.43) and using (B.36) and Lemma A.3 and a discussion similar to (D.13). We only sketch the proof. Recall (D). Note that we have

𝐄0=\displaystyle\mathbf{E}_{0}= (𝐖c𝟏𝟏)𝐔y𝐖1,1\displaystyle\,(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{U}_{y}\circ\mathbf{W}_{1,1}
+(𝐖c𝟏𝟏)𝐔y𝐖1,2\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{U}_{y}\circ\mathbf{W}_{1,2}
+(𝐖c𝟏𝟏)𝐑1𝐖1,1\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{R}_{1}\circ\mathbf{W}_{1,1}
+(𝐖c𝟏𝟏)𝐑1𝐖1,2.\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{R}_{1}\circ\mathbf{W}_{1,2}.

We explain how to control the first term, which is the leading order term. Observe that

(𝐖c𝟏𝟏)𝐔y𝐖1,1\displaystyle(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{U}_{y}\circ\mathbf{W}_{1,1}
=\displaystyle=\, exp(2υ)(𝐖c𝟏𝟏)𝐖1,1\displaystyle\exp(-2\upsilon)(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{1,1}
+2υexp(4υ)[(𝐖c𝟏𝟏)𝐈n][𝐖1,1𝐈n]\displaystyle+2\upsilon\exp(-4\upsilon)[(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{I}_{n}][\mathbf{W}_{1,1}\circ\mathbf{I}_{n}]
+(𝐖c𝟏𝟏)2υexp(2υ)p𝐘𝐘𝐖1,1.\displaystyle+(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}\circ\mathbf{W}_{1,1}.

First, by (D.12) and (B.36), the operator norm of the second term can be bounded by nα/21n^{\alpha/2-1} and hence the differences of the Stieltjes transform can be bounded similarly as in (D.7). Second, the first and third terms can be bounded using a discussion similar to (D.13) with the counterpart of (D.10), which read as

maxi,j|[exp(2υ)(𝐖c𝟏𝟏)𝐖1,1](i,j)|nα/21\displaystyle\max_{i,j}\left|\left[\exp(-2\upsilon)(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\mathbf{W}_{1,1}\right](i,j)\right|\prec n^{\alpha/2-1}

and

maxi,j|[(𝐖c𝟏𝟏)2υexp(2υ)p𝐘𝐘𝐖1,1](i,j)|nα/21.\displaystyle\max_{i,j}\left|\left[(\mathbf{W}_{c}-\mathbf{1}\mathbf{1}^{\top})\circ\frac{2\upsilon\exp(-2\upsilon)}{p}\mathbf{Y}^{\top}\mathbf{Y}\circ\mathbf{W}_{1,1}\right](i,j)\right|\prec n^{\alpha/2-1}.

This completes our proof.

References

  • [1] A. A. Amini and Z. S. Razaee. Concentration of kernel matrices with application to kernel spectral clustering. The Annals of Statistics, 49(1):531 – 556, 2021.
  • [2] Z. Bai and J. W. Silverstein. Spectral analysis of large dimensional random matrices. Springer Series in Statistics. Springer, New York, second edition, 2010.
  • [3] J. Baik, G. Ben Arous, and S. Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab., 33(5):1643–1697, 09 2005.
  • [4] Z. Bao, X. Ding, J. Wang, and K. Wang. Statistical inference for principal components of spiked covariance matrices. The Annals of Statistics, 50(2):1144–1169, 2022.
  • [5] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural. Comput., 15(6):1373–1396, 2003.
  • [6] M. Belkin and P. Niyogi. Convergence of laplacian eigenmaps. In Advances in Neural Information Processing Systems, pages 129–136, 2007.
  • [7] F. Benaych-Georges and R. R. Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521, 2011.
  • [8] A. Bloemendal, A. Knowles, H.-T. Yau, and J. Yin. On the principal components of sample covariance matrices. Probab. Theory Related Fields, 164(1-2):459–552, 2016.
  • [9] A. Bojchevski, Y. Matkovic, and S. Günnemann. Robust spectral clustering for noisy data: Modeling sparse corruptions improves latent embeddings. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 737–746, 2017.
  • [10] C. Bordenave. Eigenvalues of Euclidean random matrices. Random Structures Algorithms, 33(4):515–532, 2008.
  • [11] C. Bordenave. On Euclidean random matrices in high dimension. Electron. Commun. Probab., 18:no. 25, 8, 2013.
  • [12] M. L. Braun. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7(82):2303–2328, 2006.
  • [13] T. T. Cai, X. Han, and G. Pan. Limiting laws for divergent spiked eigenvalues and largest nonspiked eigenvalue of sample covariance matrices. The Annals of Statistics, 48(3):1255–1280, 2020.
  • [14] M.-Y. Cheng and H.-T. Wu. Local linear regression on manifolds and its geometric interpretation. Journal of the American Statistical Association, 108(504):1421–1434, 2013.
  • [15] X. Cheng and A. Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013.
  • [16] F. R. K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. Published for the Conference Board of the Mathematical Sciences, Washington, DC; by the American Mathematical Society, Providence, RI, 1997.
  • [17] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006.
  • [18] X. Ding and H.-T. Wu. On the spectral property of kernel-based sensor fusion algorithms of high dimensional data. IEEE Transactions on Information Theory, 67(1):640–670, 2021.
  • [19] X. Ding and F. Yang. Spiked separable covariance matrices and principal components. Ann. Statist., 49(2):1113–1138, 2021.
  • [20] X. Ding and F. Yang. Edge statistics of large dimensional deformed rectangular matrices. Journal of Multivariate Analysis, page 105051, 2022.
  • [21] X. Ding and F. Yang. Tracy-Widom distribution for heterogeneous gram matrices with applications in signal detection. IEEE Transactions on Information Theory, 68(10):6682–6715, 2022.
  • [22] Y. Do and V. Vu. The spectrum of random kernel matrices: universality results for rough and varying kernels. Random Matrices: Theory and Applications, 2(03):1350005, 2013.
  • [23] R. B. Dozier and J. W. Silverstein. Analysis of the limiting spectral distribution of large dimensional information-plus-noise type matrices. Journal of Multivariate Analysis, 98(6):1099–1122, 2007.
  • [24] R. B. Dozier and J. W. Silverstein. On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices. Journal of Multivariate Analysis, 98(4):678–694, 2007.
  • [25] D. B. Dunson, H.-T. Wu, and N. Wu. Spectral convergence of graph laplacian and heat kernel reconstruction in LL^{\infty} from random samples. Applied and Computational Harmonic Analysis, 55:282–336, 2021.
  • [26] N. El Karoui. On information plus noise kernel random matrices. Ann. Statist., 38(5):3191–3216, 10 2010.
  • [27] N. El Karoui. The spectrum of kernel random matrices. Ann. Statist., 38(1):1–50, 2010.
  • [28] N. El Karoui and H.-T. Wu. Graph connection Laplacian and random matrices with random blocks. Information and Inference: A Journal of the IMA, 4(1):1–44, 2015.
  • [29] N. El Karoui and H.-T. Wu. Graph connection Laplacian methods can be made robust to noise. The Annals of Statistics, 44(1):346–372, 2016.
  • [30] L. Erdős, A. Knowles, and H.-T. Yau. Averaging fluctuations in resolvents of random band matrices. Ann. Henri Poincaré, 14(8):1837–1926, 2013.
  • [31] L. Erdős and H. Yau. A Dynamical Approach to Random Matrix Theory. Courant Lecture Notes. Courant Institute of Mathematical Sciences, New York University, 2017.
  • [32] Z. Fan and A. Montanari. The spectral norm of random inner-product kernel matrices. Probability Theory and Related Fields, 173(1):27–85, 2019.
  • [33] D. Foata. A combinatorial proof of the mehler formula. Journal of Combinatorial Theory, Series A, 24(3):367 – 376, 1978.
  • [34] N. García Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator. Found. Comput. Math., 20(4):827–887, 2020.
  • [35] E. Giné and V. Koltchinskii. Empirical graph laplacian approximation of laplace–beltrami operators: Large sample results. In High dimensional probability, pages 238–259. Institute of Mathematical Statistics, 2006.
  • [36] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their convergence on random neighborhood graphs. J. Mach. Learn. Res., 8:1325–1368, 2007.
  • [37] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition, 2012.
  • [38] T. Jiang. Distributions of eigenvalues of large Euclidean matrices generated from lpl_{p} balls and spheres. Linear Algebra Appl., 473:14–36, 2015.
  • [39] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist., 29(2):295–327, 2001.
  • [40] A. Knowles and J. Yin. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352, 2017.
  • [41] V. Koltchinskii and E. Giné. Random matrix approximation of spectra of integral operators. Bernoulli, pages 113–167, 2000.
  • [42] Z. Liao. A Random Matrix Framework for Large Dimensional Machine Learning and Neural Networks. PhD thesis, Université Paris-Saclay, 2019.
  • [43] Z. Liao and R. Couillet. On inner-product kernels of high dimensional data. In 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 579–583, 2019.
  • [44] Y.-T. Lin, J. Malik, and H.-T. Wu. Wave-shape oscillatory model for nonstationary periodic time series analysis. Foundations of Data Science, 3(2):99–131, 2021.
  • [45] V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967.
  • [46] F. G. Meyer and X. Shen. Perturbation of the eigenvectors of the graph laplacian: Application to image denoising. Applied and Computational Harmonic Analysis, 36(2):326–334, 2014.
  • [47] M. Mézard, G. Parisi, and A. Zee. Spectra of Euclidean random matrices. Nuclear Phys. B, 559(3):689–701, 1999.
  • [48] J. Nash. The imbedding problem for riemannian manifolds. Annals of mathematics, pages 20–63, 1956.
  • [49] H. Q. Ngo. \mathbb{P}-Species and the q-Mehler formula. Sém. Lothar. Combin, (48):B48b, 2002.
  • [50] D. Passemier and J. Yao. Estimation of the number of spikes, possibly equal, in the high-dimensional case. Journal of Multivariate Analysis, 127:173–183, 2014.
  • [51] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17(4):1617–1642, 2007.
  • [52] S. Prasad Kasiviswanathan and M. Rudelson. Spectral Norm of Random Kernel Matrices with Applications to Privacy. arXiv preprint arXiv 1504.05880, 2015.
  • [53] L. Rosasco, M. Belkin, and E. De Vito. On learning with integral operators. J. Mach. Learn. Res., 11:905–934, 2010.
  • [54] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • [55] C. Shen and H.-T. Wu. Scalability and robustness of spectral embedding: landmark diffusion is all you need. Information and Inference: A Journal of the IMA, 2022. iaac013.
  • [56] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM Journal on Mathematics of Data Science, 1(3):588–616, 2019.
  • [57] A. Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 21(1):128–134, 2006.
  • [58] A. Singer and H.-T. Wu. Vector diffusion maps and the connection Laplacian. Comm. Pure Appl. Math., 65(8):1067–1144, 2012.
  • [59] S. Steinerberger. A Filtering Technique for Markov Chains with Applications to Spectral Embedding. Applied and Computational Harmonic Analysis, 40:575–587, 2016.
  • [60] R. Talmon and R. R. Coifman. Empirical intrinsic geometry for nonlinear modeling and time series filtering. Proceedings of the National Academy of Sciences, 110(31):12535–12540, 2013.
  • [61] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
  • [62] U. von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395–416, 2007.
  • [63] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Ann. Statist., 36(2):555–586, 2008.
  • [64] H.-T. Wu and N. Wu. Think globally, fit locally under the Manifold Setup: Asymptotic Analysis of Locally Linear Embedding. Annals of Statistics, 46(6B):3805–3837, 2018.