On uniform consistency of spectral embeddings

Ruofei Zhaolabel=e1]rfzhao@umich.edu [ Songkai Xuelabel=e2]sxue@umich.edu [ Yuekai Sunlabel=e3]yuekai@umich.edu [ University of Michigan
1085 S University, Ann Arbor, MI 48109

(2023)

Abstract

In this paper, we study the convergence of the spectral embeddings obtained from the leading eigenvectors of certain similarity matrices to their population counterparts. We opt to study this convergence in a uniform (instead of average) sense and highlight the benefits of this choice. Using the Newton-Kantorovich Theorem and other tools from functional analysis, we first establish a general perturbation result for orthonormal bases of invariant subspaces. We then apply this general result to normalized spectral clustering. By tapping into the rich literature of Sobolev spaces and exploiting some concentration results in Hilbert spaces, we are able to prove a finite sample error bound on the uniform consistency error of the spectral embeddings in normalized spectral clustering.

62H30,

47A55,

Spectral embedding,

normalized spectral clustering,

uniform consistency,

the Newton-Kantorovich Theorem,

Sobolev spaces,

functional analysis,

concentration in Hilbert spaces,

keywords:

[class=MSC]

keywords:

^†^†volume: 0^†^†issue: 0

\startlocaldefs\endlocaldefs

, and

1 Introduction

Spectral methods are a staple of modern statistics. For statistical learning tasks such as clustering or classification, one can featurize the data with spectral methods then perform the task on the features. In the past twenty years, spectral methods have seen wide applications in image segmentation [33], novelty detection [18], community detection [13], bioinformatics [17], and its effectiveness is partly credited to its ability to reveal the latent low-dimensional structure in the data.

Spectral embedding gets its name from the fact that the embeddings are constructed from the spectral decomposition of a positive-definite matrix. For example, in normalized spectral clustering [27], the normalized Laplacian embedding $\Phi_{n}:\{x_{i}\}_{i=1}^{n}\to\mathbb{R}^{K}$ is given by

\Phi_{n}(x_{i})^{T}=e_{i}^{T}V,\quad i\in[n],

(1.1)

where $\{x_{i}\}_{i=1}^{n}$ are the observations, $e_{i}\in\mathbb{R}^{n}$ ’s are all zeros but one on the $i$ -th entry, $K$ is the desired dimension of the embedding, and the columns of $V\in\mathbb{R}^{n\times K}$ are the leading eigenvectors of the normalized Laplacian matrix. As described, spectral embeddings are only defined on points in the training data, but it is possible to evaluate them on points that are not in the training data through out-of-sample extensions [4, 41]. Some other examples of spectral methods are Isomap [36], Laplacian [3] and Hessian eigenmaps [14], and diffusion maps [10].

Since downstream procedures take the embeddings as input, it is imperative that the embeddings have certain consistency properties to ensure the quality of the ultimate output. Specifically, we ask

•

In the large sample limit, do the embedded representations of the data “converge” to certain population level representations?
•

If the embedded representations do converge, in what sense do they converge?

While there are many results on the convergence of eigenvalues and spectral projections, there are only a few results that directly address the convergence of the embedded representation in a general setting. The only exception is von Luxburg, Belkin and Bousquet [41]. This is a gap in the literature because it is the embedded representation, not the spectral projections or the eigenvalues, that are the inputs to downstream application. In this paper, we address the two questions and provide direct answers — we show the sample level embeddings converge uniformly to its population counterpart up to a unitary transformation. We improve the result of von Luxburg, Belkin and Bousquet [41] by considering multidimensional embeddings and allowing for non-simple eigenvalues.

For a concrete application of our result, let us return to spectral clustering. The population counterpart of the normalized Laplacian embedding is given by

\Psi(x)^{T}=\begin{bmatrix}f_{1}(x)&\dots&f_{K}(x)\end{bmatrix},

where $f_{1},\dots,f_{K}$ are the leading eigenfunctions of the normalized Laplacian operator [41, 30]. As is shown in von Luxburg, Belkin and Bousquet [41], the normalized Laplacian matrix has an operator counterpart that we shall refer to as the empirical normalized Laplacian operator. Let $\hat{f}_{n,1},\dots,\hat{f}_{n,K}$ be the leading eigenfunctions of this operator, and define the embedding

\Psi_{n}(x)^{T}=\begin{bmatrix}\hat{f}_{n,1}(x)&\dots&\hat{f}_{n,K}(x)\end{bmatrix}.

(1.2)

The embedding $\Psi_{n}$ coincides with $\Phi_{n}$ on the sample points, i.e. $\Psi_{n}(x_{i})=\Phi_{n}(x_{i})$ for all $\{x_{i}\}_{i=1}^{n}$ . We shall show that the sample level embedding converges uniformly to its population counterpart:

\sup\{d(\Psi_{n}(x),\Psi(x)):x\in\mathcal{X}\}\overset{p}{\to}0,

(1.3)

where $d$ is some metric on $\mathbb{R}^{K}$ . This implies $\Phi_{n}$ converges uniformly to the restriction of $\Psi$ to the sample points.

1.1 Main results

In this section, we state our results in an informal manner. These results are made precise and proved in subsequent sections.

Our first main result concerns the effect of perturbation on the invariant subspace of an operator. It serves as a general recipe for establishing uniform consistency type results. Although in statistics and machine learning, we mainly work with real-valued functions, our main spectral perturbation result is stated for complex-valued functions. This choice is technically convenient because the complex numbers are algebraically closed while the real numbers are not. In most applications of the result, the complex-valued functions only take real values.

Suppose $\mathcal{H}$ is a complex Hilbert space whose elements are bounded complex-valued continuous functions over a domain $\mathcal{X}$ . Let $T,\widetilde{T}$ be two operators from $\mathcal{H}$ to $\mathcal{H}$ that are close in Hilbert-Schmidt norm. Let $\{f_{i}\}_{i=1}^{K}$ be the top $K$ eigenfunctions of $T$ and $\{\tilde{f}_{i}\}_{i=1}^{K}$ be those of $\widetilde{T}$ . As long as $\{f_{i}\}_{i=1}^{K}$ and $\{\tilde{f}_{i}\}_{i=1}^{K}$ are appropriately normalized, we expect $\{f_{i}\}_{i=1}^{K}$ to be close to $\{\tilde{f}_{i}\}_{i=1}^{K}$ up to some unitary transformation. This is indeed the case and is characterized as follows by our first result.

Result 1 (General recipe for uniform consistency).

Define $V_{1}:\mathbb{C}^{K}\to\mathcal{H}$ as $V_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i}$ and $\widetilde{V}_{1}:\mathbb{C}^{K}\to\mathcal{H}$ as $\widetilde{V}_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}\tilde{f}_{i}$ . There are constants $C_{1},C_{2}>0$ that only depend on $T$ such that as long as $\|\widetilde{T}-T\|_{HS}\leq C_{1}$ , we have

\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{2}\|\widetilde{T}-T\|_{HS},

(1.4)

where $\mathbb{U}^{K}$ is the space of unitary matrices in $\mathbb{C}^{K\times K}$ , and the $2\to\infty$ -norm of an operator $A:\mathbb{C}^{K}\to\mathcal{H}$ is defined as

\|A\|_{2\to\infty}=\sup\{\|A\alpha\|_{\infty}:\alpha\in\mathbb{C}^{K},\|\alpha\|_{2}=1\}.

It is not hard to notice the correspondence between (1.4) and (1.3): $V_{1}$ is the analogue of $\Psi$ ; $\widetilde{V}_{1}$ is the analogue of $\Psi_{n}$ ; the two to infinity norm guarantees the convergence is uniform, and the distance metric $d$ is chosen to measure Euclidean distance up to a unitary transformation.¹¹1Normalized eigenfunctions of the same eigenvalue are only determined up to unitary transformation. This observation justifies naming $\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}$ the uniform consistency error. It is also worth mentioning that the constant $C_{2}$ is inversely proportional to a measure of the eigengap between the $K$ -th and $K+1$ -th eigenvalues of $T$ . This provides further justification for studying the convergence of the leading eigenspace as a whole, rather than studying the convergence of the individual eigenspaces like in von Luxburg, Belkin and Bousquet [41]. Not only is the former more general and realistic, but it leads to better constants as well. In many applications, the top eigenvalues are usually clustered together, but there is a large gap between the top eigenvalues and the rest of the spectrum. Thus it is hard to estimate the corresponding eigenfunctions individually, but it is easy to estimate them altogether up to a unitary transformation.

Result 1 provides a general approach to proving uniform consistency: we simply need to bound the difference between the sample level operator and its population counterpart in an appropriate norm. The proof of Result 1 is also interesting in its own right. We identify the invariant subspace directly by solving an operator equation and appeal to the Newton-Kantorovich Theorem to characterize the solution. The main benefit of this approach is it overcomes the limitations of traditional approaches when working with non-unitarily invariant norms.

Our second main result is a finite sample uniform error bound for embedding in normalized spectral clustering. Let $C_{b}(\mathcal{X})$ denote the space of bounded continuous complex-valued functions over $\mathcal{X}$ . Define $V_{1}:\mathbb{C}^{K}\to C_{b}(\mathcal{X})$ as $V_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i}$ where $f_{1},\dots,f_{K}$ are the leading real-valued eigenfunctions of the normalized Laplacian operator. Define $\widehat{V}_{n,1}:\mathbb{C}^{K}\to C_{b}(\mathcal{X})$ as $\widehat{V}_{n,1}\alpha=\sum_{i=1}^{K}\alpha_{i}\widehat{f}_{n,i}$ where $\widehat{f}_{n,i}$ are defined as in (1.2) and real-valued. Applying Result 1, we obtain

Result 2 (Uniform consistency for normalized spectral clustering).

Under suitable conditions, there are constants $C_{4},C_{5}>0$ that are independent of $n$ and the randomness of the sample, such that whenever the sample size $n\geq C_{4}\tau$ for some $\tau>1$ , we have

\inf\{\|V_{1}-\widehat{V}_{n,1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{5}\frac{\sqrt{\tau}}{\sqrt{n}},

with probability at least $1-8e^{-\tau}$ .

Despite the fact that that Result 2 is an application of Result 1, its proof is by no means simple. The main technical challenge is establishing concentration bounds for Hilbert-Schmidt operators. Result 2 suggests that the convergence rate, under appropriate conditions, is $\mathcal{O}(\frac{1}{\sqrt{n}})$ (modulo a log factor). Moreover, in the context of clustering, the notion of uniform consistency leads to stronger assurances about the correctness of the clustering output. For example, in spectral clustering, the points are clustered based on their embeddings. Uniform convergence implies the embeddings of all points are close to their population counterpart. As long as the error in the embeddings are small enough, it is possible to show that all points are correctly clustered. This is not possible if the embeddings only “converge in mean”: $\|V_{1}-\widehat{V}_{n,1}Q\|_{2\to L^{2}(\mathcal{X},\mathbb{P})}\to 0$ .

1.2 Related literature

Most closely related to our results are the work of von Luxburg, Belkin and Bousquet [41] and Rosasco, Belkin and Vito [29]. For normalized spectral clustering, von Luxburg, Belkin and Bousquet [41] proved the convergence of the eigenvalues and spectral projectors of the sample level operator to their population counterparts. They also established uniform convergence of eigenfunctions whose corresponding eigenvalue has multiplicity one to their population counterparts. Our results are in the same vein as theirs in that we also study uniform convergence of eigenfunctions. We improve upon their uniform convergence result by considering multiple eigenfunctions at once and allowing for non-simple eigenvalues. In the context of unnormalized spectral clustering, Rosasco, Belkin and Vito [29] studied the convergence rate of the $l^{2}$ -distance between the ordered spectrum of the sample level operator and that of the population operator and derived finite sample bound for the deviation between the sample level and population level spectral projections associated with the top $K$ eigenvalues. They also obtained finite sample spectral projection error bound for asymmetric normalized graph Laplacian. Our work is related to theirs because both study the convergence of the leading eigenspace and we owe much of our concentration results to them. The two works are also very distinct at the same time. Firstly, our notion of convergence is in uniform consistency of the eigenfunction, theirs is in terms of the induced RKHS norm between the spectral projectors. To the best of our knowledge, it is non-trivial to establish one set of results from the other. Secondly, we study the normalized symmetric graph Laplacian, while they study the unnormalized graph Laplacian and asymmetric normalized graph Laplacian.

The general relationship between the spectral properties of an empirical operator/matrix and that of its population counterpart has also been studied under other contexts. In Koltchinskii and Giné [22], it is proved that the ordered spectra of an integral operator and its empirical version tend to zero in $l^{2}$ -distance almost surely if and only if the kernel is square integrable. Convergence rate and distributional limits were also obtained under stronger conditions. In Koltchinskii [21], the authors extended their own result by proving law of large numbers and central limit theorems for quadratic forms induced by spectral projections. The investigation of spectrum convergence is continued in Mendelson and Pajor [25] and Mendelson and Pajor [26], where the authors associated various types of distance metric between two ordered spectra to the deviation of the sample mean of i.i.d rank one operators from its population mean. Similar problems have also been studied in kernel principal component analysis (KPCA) literature. For example, in Shawe-Taylor et al. [31] and Blanchard, Bousquet and Zwald [5], the concentration property of the sum of the top $K$ eigenvalues and the sum of all but the top $K$ eigenvalues of the empirical kernel matrix are studied, because such partial sums are closely related to the reconstruction error of KPCA. In Zwald and Blanchard [42], a finite sample error bound on the difference between the projection operator to the leading eigenspace of the empirical covariance operator and that to the leading eigenspace of the population covariance operator is derived. We remark that none of the results mentioned in this paragraph addressed the consistency of the embedding directly, nor did any consider kernel matrix normalized by the degree matrix.

Unlike our results which are model-agnostic, the property of spectral methods has also been studied in model-specific settings. For example, Rohe, Chatterjee and Yu [28] and Lei and Rinaldo [23] investigated the spectral convergence properties of the graph Laplacian and the consistency of spectral clustering in terms of community membership recovery under stochastic block models. When the data are sampled from a finite mixture of nonparametric distributions, Shi, Belkin and Yu [32] studied how the leading eigenfunctions and eigenvectors of the population level integral operator can reflect clustering information; Schiebinger, Wainwright and Yu [30] studied the geometry of the embedded samples generated by normalized spectral clustering and showed that the embedded samples for different clusters are approximately orthogonal when the mixtures have small overlapping and the sample size is large. We remark that in all the results mentioned in this section so far, the kernel function is fixed. For the relationship between the graph Laplacian and the Laplace-Beltrami operator on a manifold and the properties of spectral clustering when the kernel is chosen adaptively, we refer readers to the series of work by Trillos et. al. [38, 39, 37] and the references therein.

Lastly, entrywise or row-wise analysis for eigevectors and eigenspaces of matrices has been studied in recent literature. For general purpose, deterministic $\ell_{2\to\infty}$ bounds are derived by Fan, Wang and Zhong [16], Cape, Tang and Priebe [7], Damle and Sun [12], where the first two are for rectangular matrices and the last one is for symmetric matrices. When probabilistic assumptions are imposed on the true and perturbed matrices, Cape, Tang and Priebe [8], Abbe et al. [1], Mao, Sarkar and Chakrabarti [24] obtain stronger $\ell_{2\to\infty}$ bounds for various tasks by taking advantages of the structure of the random matrices. Comparing to these literature, we remark that our work provides a deterministic bound which aids in the $\ell_{2\to\infty}$ perturbation theory of linear operators, and the bound can be applied to many problems in statistics (e.g., spectral clustering and kernel PCA) for helping characterize the spectral embedding of individual samples.

1.3 Main contributions

We view our main contributions as three fold and list them in the order of appearance. First, we demonstrate that the Newton-Kantorovich Theorem provides a general approach to studying the effect of local perturbations on the invariant spaces of an operator. This result may be of independent interest to researchers working on spectral perturbation theory. Second, we study the convergence of the embeddings via uniform consistency error and offer a general recipe for establishing non-asymptotic uniform consistency type results that handles multiple eigenfunctions at once and is not limited to simple eigenvalues. Third, we apply our recipe to normalized spectral clustering and give a novel proof of finite sample error bound on the uniform consistency error of the spectral embeddings.

1.4 Structure of the paper

The rest of the paper is organized as follows: A review of relevant mathematical preliminaries is provided in Section 2; the exact statement and proof for Result 1 is in section 3; the exact statement and proof for Result 2 is in section 4; a discussion of various issues relevant to our results is in section 5; proofs of some secondary lemmas and an additional application are relegated to the appendix.

2 Preliminaries and notations

In this section, we discuss various basic concepts and preliminary results that will be used repeatedly throughout the paper. More technical results that are section specific shall be introduced as needed later in the paper.

2.1 Operator theory

We assume readers are familiar with basic concepts such as Banach spaces, Hilbert spaces, linear operators, operator norms, and spectra of operators. From now on, we let $\mathbb{K}$ denote either the field of real numbers or the field of complex numbers, $\mathcal{Y}_{1},\mathcal{Y}_{2}$ denote Banach spaces over the same field $\mathbb{K}$ , and $\mathcal{H}_{1},\mathcal{H}_{2}$ denote Hilbert spaces over the same field $\mathbb{K}$ .

We would like to first highlight a nuance in the definition of linear operator. For a linear operator $A:\mathcal{Y}_{1}\to\mathcal{Y}_{2}$ , we adopt the convention from Kato [20] and allow $A$ to be defined only on a linear manifold in $\mathcal{Y}_{1}$ ,²²2In Kato [20], linear manifold is just a synonym for affine subspace. denoted $D(A)$ .³³3In Ciarlet [9], for example, such a distinction isn’t made. We call $D(A)$ the domain of $A$ and can naturally define the range of $A$ as $R(A):=\{Ay\;|\;y\in D(A)\}$ . As for $\mathcal{Y}_{1},\mathcal{Y}_{2}$ , we call them the domain space and the range space respectively.

For a linear operator $A:\mathcal{Y}_{1}\to\mathcal{Y}_{2}$ , we say $A$ bounded if $\sup_{\|y\|_{\mathcal{Y}_{1}}=1}\|Ay\|_{\mathcal{Y}_{2}}<\infty$ , and when $A$ is bounded, we define its operator norm $\|A\|:=\sup_{\|y\|_{\mathcal{Y}_{1}}=1}\|Ay\|_{\mathcal{Y}_{2}}$ . Throughout the paper, when $\|\cdot\|$ has no subscript, it defaults to operator norm. We use $\mathcal{L}(\mathcal{Y}_{1},\mathcal{Y}_{2})$ to denote the space of all bounded linear operators from $\mathcal{Y}_{1}$ to $\mathcal{Y}_{2}$ . When $\mathcal{Y}_{1}=\mathcal{Y}_{2}$ , we simply write $\mathcal{L}(\mathcal{Y}_{1},\mathcal{Y}_{1})$ as $\mathcal{L}(\mathcal{Y}_{1})$ . We say $A$ is a compact operator if the closure of the image of any bounded set in $\mathcal{Y}_{1}$ under $A$ is compact. It is know that compact operators are bounded.

For a bounded linear operator $A:\mathcal{H}_{1}\to\mathcal{H}_{2}$ , define its adjoint $A^{*}:\mathcal{H}_{2}\to\mathcal{H}_{1}$ as the unique operator from $\mathcal{H}_{2}$ to $\mathcal{H}_{1}$ satisfying $\langle Af,g\rangle_{\mathcal{H}_{2}}=\langle f,A^{*}g\rangle_{\mathcal{H}_{1}}$ for $\forall f\in\mathcal{H}_{1},\forall g\in\mathcal{H}_{2}$ . Here, we use $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ to denote the inner product in the Hilbert space $\mathcal{H}$ . A basic property of $A^{*}$ is $\|A\|_{\mathcal{H}_{1}\to\mathcal{H}_{2}}=\|A^{*}\|_{\mathcal{H}_{2}\to\mathcal{H}_{1}}$ , where the norm is operator norm and we use the $\mathcal{H}_{1}\to\mathcal{H}_{2}$ notation to explicitly specify the domain space and range space. When $\mathcal{H}_{1}=\mathcal{H}_{2}$ , $A$ is called self-adjoint if $A$ is equal to its adjoint $A^{*}$ , and $A$ is called positive if for any $f\in\mathcal{H}_{1}$ , $\langle Af,f\rangle_{\mathcal{H}_{1}}\geq 0$ .

We say a Hilbert space is separable if it has a basis of countably many elements. We say a bounded linear operator $A:\mathcal{H}_{1}\to\mathcal{H}_{2}$ is Hilbert-Schmidt if $\sum_{i\in\mathcal{I}}\|Ae_{i}\|_{\mathcal{H}_{2}}^{2}<\infty$ where $\{e_{i}:i\in\mathcal{I}\}$ is an orthonormal basis of $\mathcal{H}_{1}$ . We use $HS(\mathcal{H}_{1},\mathcal{H}_{2})$ to denote the space of all Hilbert-Schmidt operators from $\mathcal{H}_{1}$ to $\mathcal{H}_{2}$ ; this space is also a Hilbert space with respect to the inner product $\langle A,B\rangle_{HS}:=\sum_{i\in\mathcal{I}}\langle Ae_{i},Be_{i}\rangle_{\mathcal{H}_{2}}$ . We use $\|\cdot\|_{HS}$ to denote the norm induced by this inner product and note that all Hilbert-Schmidt operators are compact. We also note the Hilbert-Schmidt norm is stronger than operator norm in that $\|A\|\leq\|A\|_{HS}$ , and the Hilbert-Schmidt norm is compatible with the operator norm in the following sense: for any Hilbert-Schmidt operator $A$ and bounded operator $B$ , their product $AB$ and $BA$ are Hilbert-Schmidt and their Hilbert-Schmidt norm satisfies

	$\displaystyle\\|AB\\|_{HS}\leq\\|A\\|_{HS}\\|B\\|,$
	$\displaystyle\\|BA\\|_{HS}\leq\\|B\\|\\|A\\|_{HS}.$

2.2 Spectral theory for linear operators

In this subsection, we set $\mathbb{K}=\mathbb{C}$ . Let $A:\mathcal{H}_{1}\to\mathcal{H}_{1}$ be a bounded linear operator. Similar to matrices, we say $\lambda\in\mathbb{C}$ is an eigenvalue of $A$ if for some eigenvector $f\in\mathcal{H}_{1}$ ,

Af=\lambda f\;\;\text{ and }\;\;f\neq 0.

In other words, $\lambda$ is an eigenvalue if the null space $N(\lambda I-A)$ is not $\{0\}$ . We call $N(\lambda I-A)$ the eigenspace associated with $\lambda$ , and the dimension of $N(\lambda I-A)$ is called the geometric multiplicity of $\lambda$ . The spectrum of $A$ is defined as $\sigma(A):=\mathbb{C}\setminus\rho(A)$ , where $\rho(A)$ is the resolvent set

\rho(A):=\{\lambda\in\mathbb{C}\;|\;(\lambda I-A)^{-1}\in\mathcal{L}(\mathcal{H}_{1})\}.

Eigenvalues are in the spectrum, but $\sigma(A)$ generally contains more than just eigenvalues. If $A$ is a compact operator, $\sigma(A)$ has the following structure: $\sigma(A)\setminus\{0\}$ is a countable set of isolated eigenvalues, each with finite geometric multiplicity, and the only possible accumulation point of $\sigma(A)$ is $0$ . If $A$ is self-adjoint, then all the eigenvalues must be real. If $A$ is a positive operator, then all its eigenvalues are real and non-negative. Therefore for any compact positive self-adjoint operator, we can arrange the non-zero eigenvalues of $A$ into a non-increasing sequence of positive numbers,⁴⁴4Because the largest eigenvalue is bounded by the operator norm of $A$ . and repeat each eigenvalue for a number of times equal to its geometric multiplicity.

Another remarkable fact in the spectral theory of linear operators concerns spectral projection. Let $\Gamma\subset\rho(A)$ be a closed simple rectifiable curve. Assume the part of $\sigma(A)$ enclosed inside $\Gamma$ is a finite number of eigenvalues $\lambda_{1},\lambda_{2},\ldots,\lambda_{K}$ . Then the projection $P:\mathcal{H}_{1}\to\mathcal{H}_{1}$ which projects to the direct sum of the eigenspaces of $\{\lambda_{i}\}_{i=1}^{K}$ , i.e. $\bigoplus_{i=1}^{K}N(\lambda_{i}I-A)$ , can be defined. Technicalities aside, this projection has the following contour integration expression

P=\frac{1}{2\pi i}\int_{\Gamma}(\gamma I-A)^{-1}d\gamma.

2.3 Function spaces

Let $\mathcal{X}$ be a bounded open subset of $\mathbb{R}^{p}$ , we now define several function spaces we are going to work with. Define the space of bounded continuous functions $C_{b}(\mathcal{X})$ as

C_{b}(\mathcal{X}):=\{f\;|\;f:\mathcal{X}\to\mathbb{C}\text{ is a bounded continous function}\}.

It can be shown that $\|f\|_{\infty}:=\sup_{x\in\mathcal{X}}|f(x)|$ is a norm on $C_{b}(\mathcal{X})$ and $C_{b}(\mathcal{X})$ is a Banach space with respect to this infinity norm.

We can also define the space of complex-valued square integrable functions $L^{2}(\mathcal{X},\mu)$ . Suppose $(\mathcal{X},\mathcal{B},\mu)$ is a measure space where $\mathcal{B}$ is the Lebesgue $\sigma$ -algebra and $\mu$ is a measure, then $L^{2}(\mathcal{X},\mu)$ is defined as the set of measurable functions such that

\int_{\mathcal{X}}|f(x)|^{2}d\mu<\infty.

In fact, $L^{2}(\mathcal{X},\mu)$ is a Hilbert space with respect to the inner product

\langle f,g\rangle_{L^{2}(\mathcal{X},\mu)}:=\int_{\mathcal{X}}f\bar{g}d\mu.

We also define $l^{2}$ , the space of square summable infinite sequence of complex numbers. It is well known that $l^{2}$ is a complex Hilbert space with respect to the inner product

\langle u,v\rangle_{l^{2}}:=\sum_{i=1}^{\infty}u_{i}\bar{v_{i}}.

2.4 Reproducing Kernel Hilbert Space (RKHS)

Let $\mathcal{X}$ be a subset of $\mathbb{R}^{p}$ and $\mathcal{H}$ be a set of functions $f:\mathcal{X}\to\mathbb{C}$ . Suppose $\mathcal{H}$ is a Hilbert space with respect to some inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ . If in $\mathcal{H}$ , all point evaluation functionals are bounded, i.e.

|f(x)|\leq C_{x}\|f\|_{\mathcal{H}}\quad\forall f\in\mathcal{H},

where $C_{x}$ is some constant depending on $x$ , then it can be shown that there exists a unique conjugate symmetric positive definite kernel function $k:\mathcal{X}\times\mathcal{X}\to\mathbb{C}$ , such that the following reproducing property is satisfied:

f(x)=\langle f,k(\cdot,x)\rangle.

The kernel $k$ is called the reproducing kernel and $\mathcal{H}$ is called a reproducing kernel Hilbert space (RKHS).

We say a kernel function $k$ is positive definite if for any $n\in\mathbb{N}^{+}$ , any $x_{1},x_{2},\ldots,x_{n}\in\mathcal{X}$ and any $\xi_{1},\xi_{2},\ldots,\xi_{n}\in\mathbb{C}$ , the quadratic form

\sum_{i,j=1}^{n}k(x_{i},x_{j})\bar{\xi_{i}}\xi_{j}

(2.1)

is non-negative. The kernel function for any RKHS is positive definite.

3 Uniform error bound for spectral embedding

In this section, we prove the first result described in the previous section. We first lay out the assumptions and notations. Let $\mathcal{X}$ be a subset of $\mathbb{R}^{p}$ and $\mathbb{P}$ be a probability measure whose density function is supported on $\mathcal{X}$ . Let $L^{2}(\mathcal{X},\mathbb{P})$ denote the space of complex-valued square integrable functions on $\mathcal{X}$ and $\mathcal{H}$ be a subspace of $L^{2}(\mathcal{X},\mathbb{P})$ . We assume $\mathcal{H}$ is equipped with its own inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ and is a Hilbert space with respect to this inner product. We also require $\mathcal{H}$ to be such that for every $h\in\mathcal{H}$ , which is an equivalent class in $L^{2}(\mathcal{X},\mathbb{P})$ , there exists a representative function $h^{\prime}$ in the class such that $h^{\prime}\in C_{b}(\mathcal{X})$ . Since $\textrm{supp}(\mathbb{P})=\mathcal{X}$ , $h^{\prime}$ is unique, and we can define infinity norm on $\mathcal{H}$ by setting $\|h\|_{\infty}:=\|h^{\prime}\|_{\infty}$ . We require that on $\mathcal{H}$ , the norm induced by the $\mathcal{H}$ -inner product, denoted $\|\cdot\|_{\mathcal{H}}$ , be stronger than the infinity norm; that is there is a constant $C_{\mathcal{H}}>0$ such that $\|f\|_{\infty}\leq C_{\mathcal{H}}\|f\|_{\mathcal{H}}$ for all $f\in\mathcal{H}$ .⁵⁵5This in fact implies $\mathcal{H}$ is an RKHS. But since we do not use the reproducing property anywhere in the proof, we find framing $\mathcal{H}$ as an RKHS unnecessary.

Let $T$ and $\widetilde{T}$ be two Hilbert-Schmidt operators from $\mathcal{H}$ to $\mathcal{H}$ ; $\widetilde{T}$ can be seen as a perturbed version of $T$ and we use $E:=\widetilde{T}-T$ to denote their difference. Suppose all the eigenvalues of $T$ (counting geometric multiplicity) can be arranged in a non-increasing (possibly infinite) sequence of non-negative real numbers $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0$ with a positive gap between $\lambda_{K}$ and $\lambda_{K+1}$ . Suppose the eigenvalues of $\widetilde{T}$ can also be arranged in a non-increasing sequence of non-negative real numbers. We do not assume, however, any eigengap for $\widetilde{T}$ .

Let $\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}$ be the eigenfunctions associated with eigenvalues $\lambda_{1},\ldots,\lambda_{K}$ . We assume $\{f_{i}\}_{i=1}^{K}$ are so picked that they constitute a set of orthonormal vectors in $L^{2}(\mathcal{X},\mathbb{P})$ . We then pick $\{f_{i}\}_{i=K+1}^{\infty}$ so that $\{f_{i}\}_{i=1}^{\infty}$ constitute a complete orthonormal basis of $L^{2}(\mathcal{X},\mathbb{P})$ . Define $V_{1}:\mathbb{C}^{K}\to L^{2}(\mathcal{X},\mathbb{P})$ by $V_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i}$ and $V_{2}:l^{2}\to L^{2}(\mathcal{X},\mathbb{P})$ by $V_{2}\beta=\sum_{i=1}^{\infty}\beta_{i}f_{K+i}$ . Define their adjoints $V_{1}^{*},V_{2}^{*}$ with respect to the standard inner product on $\mathbb{C}^{K},l^{2}$ and $L^{2}(\mathcal{X},\mathbb{P})$ . Since $\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}$ , we can also view $\mathcal{H}$ as the range (domain) space of $V_{1}$ ( $V_{1}^{*}$ ). The exact range space of $V_{1}$ shall be clear from the context.

When the perturbation $E$ has small enough Hilbert-Schmidt norm, $\widetilde{T}$ necessarily has an eigengap. In this case, the leading $K$ -dimensional invariant subspace of $\widetilde{T}$ is well defined. We pick $\{\tilde{f}_{i}\}_{i=1}^{K}$ to be an orthonormal set of vectors in $L^{2}(\mathcal{X},\mathbb{P})$ such that they span the leading invariant subspace of $\widetilde{T}$ , and define $\widetilde{V}_{1}:\mathbb{C}^{K}\to L^{2}(\mathcal{X},\mathbb{P})$ as $\widetilde{V}_{1}\alpha=\sum_{i=1}^{{}^{K}}\alpha_{i}\tilde{f}_{i}$ .

Last but not least, define $V_{2}^{-1}\mathcal{H}:=\{l\in l^{2}\,:\,V_{2}l\in\mathcal{H}\}$ and $V_{2}^{*}\mathcal{H}:=\{V_{2}^{*}h\,:\,h\in\mathcal{H}\}$ . They are intuitively the “coordinate space” for functions in $\mathcal{H}$ under the basis in $V_{2}$ . Working with these coordinates could simplify our notations. The following facts regarding $V_{2}^{-1}\mathcal{H}$ and $V_{2}^{*}\mathcal{H}$ hold true (with proof in appendix).

Lemma 3.1.

Assuming $f_{1},\dots,f_{K}\in\mathcal{H}$ , we have

1.

the set $V_{2}^{-1}\mathcal{H}$ is equal to the set $V_{2}^{*}\mathcal{H}$ ;
2.

$V_{2}^{-1}\mathcal{H}$ is a subspace of $l^{2}$ ; it is also a Hilbert space with respect to the $\mathcal{H}$ -induced inner product

$(b_{1},b_{2})_{V_{2}^{-1}\mathcal{H}}:=\langle V_{2}b_{1},V_{2}b_{2}\rangle_{\mathcal{H}};$

$V_{1}\in\mathcal{L}(\mathbb{R}^{K},\mathcal{H})$ , $V_{1}^{*}\in\mathcal{L}(\mathcal{H},\mathbb{R}^{K})$ , $V_{2}\in\mathcal{L}(V_{2}^{-1}\mathcal{H},\mathcal{H})$ , and $V_{2}^{*}\in\mathcal{L}(\mathcal{H},V_{2}^{-1}\mathcal{H})$ , with operator norms satisfying

	$\displaystyle\\|V_{1}\\|_{2\to\mathcal{H}}\leq\sqrt{K}\max_{i\in[K]}\\|f_{i}\\|_{\mathcal{H}},\quad\\|V_{1}^{*}\\|_{\mathcal{H}\to 2}\leq C_{\mathcal{H}}\sqrt{K},$
	$\displaystyle\\|V_{2}\\|_{V_{2}^{-1}\mathcal{H}\to\mathcal{H}}=1,\quad\\|V_{2}^{*}\\|_{\mathcal{H}\to V_{2}^{-1}\mathcal{H}}\leq 1+C_{\mathcal{H}}K\max_{i\in[K]}\\|f_{i}\\|_{\mathcal{H}}.$

Because of item 1 of the lemma, we do not need to distinguish between $V_{2}^{-1}\mathcal{H}$ and $V_{2}^{*}\mathcal{H}$ ; we denote both by $\tilde{l}^{2}$ . To keep notation manageable, define $\widetilde{T}_{ij}=V_{i}^{*}\widetilde{T}V_{j}$ for any $i,j\in\{1,2\}$ ; e.g. $\widetilde{T}_{21}$ is shorthand for $V_{2}^{*}\widetilde{T}V_{1}$ .

We also need the following quantities to define the constants in Result 1. Let $\Gamma$ be the boundary of the rectangle

\big{\{}\lambda\in\mathbb{C}\;|\;\frac{\lambda_{K}+\lambda_{K+1}}{2}\leq re(\lambda)\leq\|T\|_{\mathcal{H}\to\mathcal{H}}+1,|im(\lambda)|\leq 1\big{\}}.

(3.1)

Let $l(\Gamma)$ denote the length of $\Gamma$ and define

\eta=\frac{1}{\sup_{\lambda\in\Gamma}\|(\lambda I-A)^{-1}\|_{op}},

(3.2)

which is necessarily finite. Define a measure of spectral separation

\delta:=\text{sep}(T_{11},T_{22}):=\inf\Big{\{}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\;\Big{|}\;Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2}),\|Y\|_{HS}=1\Big{\}}.

It is reasonable to expect that the larger the eigengap, the larger the $\text{sep}(T_{11},T_{22})$ . And when $T$ has only $K$ eigenvalues or is self-adjoint from $\mathcal{H}$ to $\mathcal{H}$ , it is provably so that the separation $\text{sep}(T_{11},T_{22})$ is lower bounded by the eigengap $\lambda_{K}-\lambda_{K+1}$ .

Define constant

C_{3}:=\max\Big{\{}C_{\mathcal{H}},1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}},\|V_{1}\|_{2\to\mathcal{H}}(1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}})\Big{\}}.

We are now ready to state the main theorem of this section.

Theorem 3.2 (General recipe for uniform consistency).

Under the assumptions above, as long as $E:=\widetilde{T}-T$ as an operator from $\mathcal{H}$ to $\mathcal{H}$ has Hilbert Schmidt norm $\|E\|_{HS}\leq C_{1}$ , the uniform consistency error satisfies

\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{2}\|E\|_{HS}.

Here, $C_{1},C_{2}>0$ are two constants independent of the choice of $\widetilde{T}$ defined as

	$\displaystyle C_{1}:=\frac{1}{C_{3}}\min\Big{\{}\frac{\lambda_{K}-\lambda_{K+1}}{8},\frac{1}{2},\frac{\delta}{4},\frac{\delta}{4C_{\mathcal{H}}},\frac{C_{3}\eta^{2}}{\eta+l(\Gamma)/2\pi}\Big{\}},$		(3.3)
	$\displaystyle C_{2}:=\frac{4C_{3}C_{\mathcal{H}}(\\|V_{1}\\|_{2\to\infty}+1)}{\delta}.$		(3.4)

We remark that since $C_{2}$ is inversely proportional to $\delta$ , it is beneficial to study the convergence of the leading eigenspace as a whole. When eigenspaces are treated individually, each eigenspace converges slowly because the leading eigenvalues may cluster together and we have a small $\delta$ , but when treated as a whole, we get a larger $\delta$ and thus faster convergence because the leading eigenvalues are well-separated from the rest of the spectrum.

The rest of the section is devoted to proving Theorem 3.2. The proof strategy is to express $\widetilde{V}_{1}Q$ in terms of the solution of an operator equation and directly bound $\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}$ . We present the proof in five steps. In step one, we characterize the invariant subspace of $\widetilde{T}$ in terms of the solution of a quadratic operator equation. In step two, we apply the Newton-Kantorovich Theorem to show this equation does have a solution when the perturbation is small. In step three, we introduce some additional conditions that guarantee the invariant space from step two is the leading invariant space. In step four, we directly bound the error term $\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}$ . In step five, we assemble all pieces together and prove Theorem 3.2. A similar approach was used in Stewart [35] to study the invariant subspace of matrices.

3.1 Step one: equation characterization of the invariant subspace

In this section, our goal is to find a $Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2})$ such that the range of $V_{1}+V_{2}Y\in\mathcal{L}(\mathbb{C}^{K},\mathcal{H})$ is an invariant subspace of $\widetilde{T}$ . It turns out that any $Y$ that satisfies the following quadratic operator equation suffices.

Proposition 3.3.

As long as $Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2})$ satisfies the equation

\widetilde{T}_{21}+\widetilde{T}_{22}Y=Y\widetilde{T}_{11}+Y\widetilde{T}_{12}Y,

(3.5)

the range of $V_{1}+V_{2}Y\in\mathcal{L}(\mathbb{C}^{K},\mathcal{H})$ is an invariant subspace of $\widetilde{T}$ .

Proof.

First, we note (3.5) is a well-defined equation of operators in $\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2})$ . This can be seen from our assumption $\widetilde{T}\in\mathcal{L}(\mathcal{H})$ and item 3 of Lemma 3.1. Next, we assert that equation (3.5) implies⁶⁶6We do not differentiate equality in $L^{2}(\mathcal{X},\mathbb{P})$ from equality in $\mathcal{H}$ , because the two are equivalent.

\widetilde{T}(V_{1}+V_{2}Y)=(V_{1}+V_{2}Y)(\widetilde{T}_{11}+\widetilde{T}_{12}Y),

(3.6)

which suggests that the range of $V_{1}+V_{2}Y$ is invariant under $\widetilde{T}$ . To prove this assertion, note

	$\displaystyle\widetilde{T}(V_{1}+V_{2}Y)$	$\displaystyle{=}(V_{1}V_{1}^{}+V_{2}V_{2}^{})\widetilde{T}(V_{1}+V_{2}Y)$
		$\displaystyle=V_{1}\widetilde{T}_{11}+V_{1}\widetilde{T}_{12}Y+V_{2}\widetilde{T}_{21}+V_{2}\widetilde{T}_{22}Y$
		$\displaystyle=V_{1}\widetilde{T}_{11}+V_{1}\widetilde{T}_{12}Y+V_{2}Y\widetilde{T}_{11}+V_{2}Y\widetilde{T}_{12}Y$
		$\displaystyle=(V_{1}+V_{2}Y)(\widetilde{T}_{11}+\widetilde{T}_{12}Y).$

∎

3.2 Step two: solve the equation with the Newton-Kantorovich Theorem

After characterizing the invariant subspace of $\widetilde{T}$ in terms of a solution of (3.5), we apply the Newton-Kantorovich Theorem to prove a solution to (3.5) exists. The Newton-Kantorovich Theorem constructs a root of a function between Banach spaces when certain conditions on the function itself and its first and second order derivatives are met. The construction is algorithmic: the root is the limit point of a sequence of iterates generated by the Newton-Raphson method for root finding. The exact version of the Newton-Kantorovich Theorem we use is from the appendix of Karow and Kressner [19].

Theorem 3.4 (Newton-Kantorovich).

Let $\mathcal{E},\mathcal{Z}$ be Banach spaces and let $F:\mathcal{Z}\to\mathcal{E}$ be twice continuously differentiable in a sufficiently large neighborhood $\Omega$ of $Z\in\mathcal{Z}$ . Suppose that there exists a linear operator $\mathbb{T}:\mathcal{Z}\to\mathcal{E}$ with a continuous inverse $\mathbb{T}^{-1}$ and satisfying the following conditions:

	$\displaystyle\\|\mathbb{T}^{-1}(F(Z))\\|_{\mathcal{Z}}\leq a,$		(3.7)
	$\displaystyle\\|\mathbb{T}^{-1}\circ F^{\prime}(Z)-I\\|_{op}\leq b,$		(3.8)
	$\displaystyle\\|\mathbb{T}^{-1}\circ F^{\prime\prime}(\widetilde{Z})\\|_{op}\leq c,\qquad\forall\widetilde{Z}\in\Omega.$		(3.9)

If $b<1$ and $h:=\frac{ac}{(1-b)^{2}}<\frac{1}{2}$ , then there exists a solution $Z_{E}$ of $F(Z_{E})=0$ such that

\|Z_{E}-Z\|_{\mathcal{Z}}\leq r_{0}\qquad\text{with}\qquad r_{0}:=\frac{2a}{(1-b)(1+\sqrt{1-2h})}.

We are now ready to prove the proposition below, which states that when $\|E_{11}\|_{HS}$ , $\|E_{21}\|_{HS}$ , $\|E_{22}\|_{HS}$ , $\|E_{12}\|_{HS}$ are small relative to $\text{sep}(T_{11},T_{22})$ , equation (3.5) has a solution.

Proposition 3.5.

Let $\delta:=\text{sep}(T_{11},T_{22})$ and $s_{E}:=\delta-\|E_{22}\|_{HS}-\|E_{11}\|_{HS}$ . When $s_{E}>0$ and $\frac{\|E_{21}\|_{HS}\|E_{12}\|_{HS}}{s_{E}^{2}}<\frac{1}{4}$ , there exists $Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2})$ with $\|Y\|_{HS}\leq\frac{2\|E_{21}\|_{HS}}{s_{E}}$ such that equation (3.5) is satisfied.

Proof of Proposition 3.5.

After rearrangement, (3.5) is equivalent to

E_{21}+(T_{22}+E_{22})Y-Y(T_{11}+E_{11})-YE_{12}Y=0.

(3.10)

Let $\mathcal{E}:=\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2})$ denote the space of bounded linear operators from $\mathbb{C}^{K}$ to $\tilde{l}^{2}$ . Since $\mathbb{C}^{K}$ is finite dimensional, any linear operator from $\mathbb{C}^{K}$ to $\tilde{l}^{2}$ is bounded and Hilbert Schmidt. We can thus use Hilbert Schmidt norm as the default norm on $\mathcal{E}$ and $\mathcal{E}$ is a Hilbert space with respect to this norm. This fact also allows us to define $f:\mathcal{E}\mapsto\mathcal{E}$ as $f(Y):=E_{21}+(T_{22}+E_{22})Y-Y(T_{11}+E_{11})-YE_{12}Y$ and $\mathcal{T}:\mathcal{E}\mapsto\mathcal{E}$ as $\mathcal{T}(Y):=T_{22}Y-YT_{11}$ . Noting that the image of $\mathcal{H}$ under $T,E$ are still in $\mathcal{H}$ , we can verify $\mathcal{T}$ and $f$ are indeed well defined.

We assert $\delta>0$ and $\mathcal{T}$ is one-to-one and onto and defer the proof of this to a lemma. The implication of this is $\mathcal{T}$ is invertible with $\|\mathcal{T}^{-1}\|_{op}=\frac{1}{\text{sep}(T_{11},T_{22})}\leq\frac{1}{\delta}$ .

We are now ready to verify the three assumptions of Newton-Kantorovich theorem.

(A1):

\big{\|}\mathcal{T}^{-1}(f(0))\big{\|}_{HS}=\big{\|}\mathcal{T}^{-1}(E_{21})\big{\|}_{HS}\leq\big{\|}\mathcal{T}^{-1}\big{\|}_{op}\big{\|}E_{21}\big{\|}_{HS}\leq\frac{\big{\|}E_{21}\big{\|}_{HS}}{\delta}=:a.

(3.11)

(A2): The Fréchet derivative of $f$ at $Y_{0}$ is given by

	$\displaystyle f^{\prime}(Y_{0})\colon\mathcal{E}$	$\displaystyle\longrightarrow\mathcal{E}$
	$\displaystyle\Delta Y$	$\displaystyle\longmapsto(T_{22}+E_{22})\Delta Y-\Delta Y(T_{11}+E_{11})$
		$\displaystyle\quad\quad-\Delta YE_{12}Y_{0}-Y_{0}E_{12}\Delta Y.$

Especially, when $Y_{0}=0$ ,

	$\displaystyle f^{\prime}(0)\colon\mathcal{E}$	$\displaystyle\longrightarrow\mathcal{E}$
	$\displaystyle\Delta Y$	$\displaystyle\longmapsto(T_{22}+E_{22})\Delta Y-\Delta Y(T_{11}+E_{11}).$

Consequently,

	$\displaystyle\big{(}\mathcal{T}^{-1}\circ f^{\prime}(0)-I\big{)}\colon\mathcal{E}$	$\displaystyle\longrightarrow\mathcal{E}$
	$\displaystyle\Delta Y$	$\displaystyle\longmapsto\mathcal{T}^{-1}(E_{22}\Delta Y-\Delta YE_{11}).$

We thus have

	$\displaystyle\big{\\|}\mathcal{T}^{-1}\circ f^{\prime}(0)-I\big{\\|}$	$\displaystyle=\sup_{\\|\Delta Y\\|_{HS}=1}\big{\\|}\mathcal{T}^{-1}(E_{22}\Delta Y-\Delta YE_{11})\big{\\|}_{HS}$
		$\displaystyle\leq\sup_{\\|\Delta Y\\|_{HS}=1}\big{\\|}\mathcal{T}^{-1}\big{\\|}_{op}\big{\\|}E_{22}\Delta Y-\Delta YE_{11}\big{\\|}_{HS}$
		$\displaystyle\leq\frac{1}{\delta}\sup_{\\|\Delta Y\\|_{HS}=1}\bigg{\{}\big{\\|}E_{22}\big{\\|}_{HS}\big{\\|}\Delta Y\big{\\|}_{HS}+\big{\\|}\Delta Y\big{\\|}_{HS}\big{\\|}E_{11}\big{\\|}_{HS}\bigg{\}}$
		$\displaystyle\leq\frac{\\|E_{22}\\|_{HS}+\\|E_{11}\\|_{HS}}{\delta}:=b.$

(A3): The second order Fréchet derivative at $Y_{0}$ is a linear operator in $\mathcal{L}(\mathcal{E},\mathcal{L}(\mathcal{E},\mathcal{E}))$ :

	$\displaystyle f^{\prime\prime}(Y_{0})\colon\mathcal{E}$	$\displaystyle\longrightarrow\mathcal{L}(\mathcal{E},\mathcal{E})$
	$\displaystyle\Delta_{1}Y$	$\displaystyle\longmapsto\mathcal{T}_{\Delta_{1}Y},$

where $\mathcal{T}_{\Delta_{1}Y}$ is

	$\displaystyle\mathcal{T}_{\Delta_{1}Y}\colon\mathcal{E}$	$\displaystyle\longrightarrow\mathcal{E}$
	$\displaystyle\Delta_{2}Y$	$\displaystyle\longmapsto-\Delta_{1}YE_{12}{\Delta_{2}Y}-\Delta_{2}YE_{12}{\Delta_{1}Y}.$

Therefore the second derivative is a constant for every $Y_{0}\in\mathcal{E}$ and we have,

	$\displaystyle\big{\\|}\mathcal{T}^{-1}\circ f^{\prime\prime}(Y_{0})\big{\\|}_{op}$	$\displaystyle=\sup_{\\|\Delta_{1}Y\\|_{HS}=1}\big{\\|}\mathcal{T}^{-1}\circ\mathcal{T}_{\Delta_{1}Y}\big{\\|}_{op}$
		$\displaystyle=\sup_{\\|\Delta_{1}Y\\|_{HS}=1}\sup_{\\|\Delta_{2}Y\\|_{HS}=1}\big{\\|}\big{(}\mathcal{T}^{-1}\circ\mathcal{T}_{\Delta_{1}Y}\big{)}(\Delta_{2}Y)\big{\\|}_{\mathcal{E}}$
		$\displaystyle\leq\frac{1}{\delta}\sup_{\\|\Delta_{1}Y\\|_{HS}=1}\sup_{\\|\Delta_{2}Y\\|_{HS}=1}\big{\\|}\Delta_{1}YE_{12}{\Delta_{2}Y}+\Delta_{2}YE_{12}{\Delta_{1}Y}\big{\\|}_{\mathcal{E}}$
		$\displaystyle\leq\frac{2\\|E_{12}\\|_{HS}}{\delta}:=c$

(Conclusion:) With all assumptions in place, we apply the Newton-Kantorovich Theorem and conclude as follows. When

	$\displaystyle s_{E}:=\delta-\\|E_{22}\\|_{HS}-\\|E_{11}\\|_{HS}>0\text{ and }$
	$\displaystyle\frac{\\|E_{21}\\|_{HS}\\|E_{12}\\|_{HS}}{s_{E}^{2}}<\frac{1}{4},$

equation (3.5) has solution $Y_{E}$ such that

\big{\|}Y_{E}\big{\|}\leq\frac{2\|E_{21}\|_{HS}}{s_{E}+\sqrt{s_{E}^{2}-4\|E_{21}\|_{HS}\|E_{12}\|_{HS}}}\leq\frac{2\|E_{21}\|_{HS}}{s_{E}}.

(3.12)

∎

3.3 Step three: showing the invariant space is the leading eigenspace

In step two, we obtained an invariant subspace, but there is no guarantee that the invariant subspace we obtained is the leading $K$ -dimensional invariant subspace of $\widetilde{T}$ . In this subsection, we give sufficient conditions to ensure this. When $\|E\|_{HS}$ is small, we show several things must happen: first, the range of $V_{1}+V_{2}Y_{E}$ is $K$ dimensional; second, the eigenvalues of the restriction of $\widetilde{T}$ to this subspace are contained in the interval $[\lambda_{K}-\epsilon,\lambda_{1}+\epsilon]$ for some small $\epsilon$ ; third, $\widetilde{T}$ has exactly $K$ eigenvalues (counting geometric multiplicity) in the interval $[\lambda_{K}-\epsilon,\lambda_{max}(\widetilde{T})+\epsilon]$ . These facts combined implies the invariant subspace from Proposition 3.5 has to be the leading $K$ -dimensional invariant subspace.

The first point is not hard to show. Suppose the range of $V_{1}+V_{2}Y_{E}$ has less than $K$ dimensions, then there exists $s\in\mathbb{C}^{K}$ with $\|s\|_{2}=1$ such that $(V_{1}+V_{2}Y_{E})s=0$ . But since $\|(V_{1}+V_{2}Y_{E})s\|_{L^{2}}\geq\|V_{1}s\|_{L^{2}}-\|V_{2}Y_{E}s\|_{L^{2}}\geq 1-C_{\mathcal{H}}\|V_{2}Y_{E}s\|_{\mathcal{H}}\geq 1-C_{\mathcal{H}}\|Y_{E}\|_{HS}$ , when $\|Y_{E}\|_{HS}$ is small, the vector $(V_{1}+V_{2}Y_{E})s$ simply cannot be zero. Stated formally, we have

Lemma 3.6.

When $C_{\mathcal{H}}\|Y_{E}\|_{HS}<1$ , the range of $V_{1}+V_{2}Y_{E}$ is $K$ -dimensional.

As for the second point, which is to determine the eigenvalues of the restriction of $\widetilde{T}$ , note that (3.5) implies $\widetilde{T}$ has matrix representation $\widetilde{T}_{11}+\widetilde{T}_{12}Y_{E}$ in the basis $V_{1}+V_{2}Y_{E}$ . Since eigenvalues are not affected by the choice of bases, we know the eigenvalues of $T$ on the invariant space are those of $\widetilde{T}_{11}+\widetilde{T}_{12}Y_{E}$ . Next, we recall a perturbation result for eigenvalues.

Lemma 3.7.

Assuming $\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\mathbb{R}$ , we have (addition is set addition)

\displaystyle\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\sigma(T_{11})+\big{[}-\|E_{11}+E_{12}Y_{E}\|,\|E_{11}+E_{12}Y_{E}\|\big{]}.

(3.13)

Proof.

First note that $T_{11}=\text{diag}\big{(}(\lambda_{1},\lambda_{2},\ldots,\lambda_{K})^{T}\big{)}$ . Suppose $\lambda$ is a real eigenvalue of $T_{11}+E_{11}+E_{12}Y_{E}\in\mathbb{C}^{K\times K}$ , then there exists $v\in\mathbb{C}^{K}$ with $\|v\|_{2}=1$ such that $(T_{11}+E_{11}+E_{12}Y_{E})v=\lambda v$ . It thus follows

\inf_{i\in[K]}|\lambda_{i}-\lambda|\leq\Big{(}\sum_{i=1}^{K}(\lambda_{i}-\lambda)^{2}|v_{i}|^{2}\Big{)}^{1/2}=\|(\lambda I-T_{11})v\|=\|(E_{11}+E_{12}Y_{E})v\|\leq\|E_{11}+E_{12}Y_{E}\|,

which suggests $\lambda$ is within $\|E_{11}+E_{12}Y_{E}\|$ from at least one of $\{\lambda_{i}\}_{i=1}^{K}$ . This is equivalent to the claim of (3.13). ∎

From the lemma, we know (assuming $\|Y_{E}\|_{HS}<1$ )

\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\big{[}\lambda_{K}-\|E_{11}\|_{HS}-\|E_{12}\|_{HS},\lambda_{1}+\|E_{11}\|_{HS}+\|E_{12}\|_{HS}\big{]}.

(3.14)

For the third point, we need the following result from Rosasco, Belkin and Vito [29] (their Theorem 20), which they credited to Anselone [2] for the origin.

Theorem 3.8.

Let $A\in\mathcal{L}(\mathcal{H})$ be a compact operator. Given a finite set $\Lambda$ of non-zero eigenvalues of $A$ , let $\Gamma$ be any simple rectifiable closed curve (having positive direction) with $\Lambda$ inside and $\sigma(A)\setminus\Lambda$ outside. Let $P$ be the spectral projection associated with $\Lambda$ , that is,

P=\frac{1}{2\pi i}\int_{\Gamma}(\lambda I-A)^{-1}d\lambda,

(3.15)

and define

\eta^{-1}=\sup_{\lambda\in\Gamma}\|(\lambda I-A)^{-1}\|_{op}.

(3.16)

Let $B$ be another compact operator such that

\|B-A\|_{op}\leq\frac{\eta^{2}}{\eta+l(\Gamma)/2\pi}<\eta,

(3.17)

where $l(\Gamma)$ is the length of $\Gamma$ , the the following facts hold true.

1.

The curve $\Gamma$ is a subset of the resolvent set of $B$ enclosing a finite set $\widehat{\Lambda}$ of non-zero eigenvalues of $B$ ;
2.

The dimension of the range of $P$ is equal to the dimension of the range of $\widehat{P}$ , where $\widehat{P}=\frac{1}{2\pi i}\int_{\Gamma}(\lambda I-B)^{-1}d\lambda$ .

From the theorem above, we can take $\Gamma$ as in (3.1), i.e. as the boundary of the rectangle

\big{\{}\lambda\in\mathbb{C}\;|\;\frac{\lambda_{K}+\lambda_{K+1}}{2}\leq re(\lambda)\leq\|T\|_{\mathcal{H}\to\mathcal{H}}+1,|im(\lambda)|\leq 1\big{\}}.

(3.18)

For $\|E\|_{op}\leq\|E\|_{HS}$ small enough, $\Gamma$ contains exactly the top $K$ eigenvalues of $\widetilde{T}$ . Combining the three points up, we obtain sufficient conditions for the range of $v_{1}+V_{2}Y_{E}$ to be the leading invariant subspace of $\widetilde{T}$ .

Proposition 3.9.

Assuming

	$\displaystyle\\|Y_{E}\\|_{HS}<\min\Big{\{}1,\frac{1}{C_{\mathcal{H}}}\Big{\}}$		(3.19)
	$\displaystyle\\|E_{11}\\|_{HS}+\\|E_{12}\\|_{HS}<\min\Big{\{}\frac{\lambda_{K}-\lambda_{K+1}}{2},1\Big{\}},$		(3.20)
	$\displaystyle\\|E\\|_{HS}\leq\min\Big{\{}\frac{\eta^{2}}{\eta+l(\Gamma)/2\pi},1\Big{\}},$		(3.21)

where $\Gamma,\eta$ are defined according to (3.18) and (3.16) respectively, the range of $V_{1}+V_{2}Y_{E}$ is the leading invariant subspace of $\widetilde{T}$ .

Proof.

First, our choice of $\Gamma$ contains and only contains the top $K$ eigenvalues of $T$ . Next, since we assumed $\widetilde{T}$ also has only real eigenvalues, Lemma 3.7 applies. So from (3.14), (3.19), and (3.20), we know

\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\big{(}\frac{\lambda_{K}+\lambda_{K+1}}{2},\lambda_{1}+1\big{)},

which is enclosed in $\Gamma$ . We also see from (3.19) that Lemma 3.6 applies. Finally, condition (3.21) on $E$ ensures Theorem 3.8 applies, so $\widetilde{T}$ has only $K$ eigenvalues in $\Gamma$ . It thus follows the invariant subspace induced by $V_{1}+V_{2}Y_{E}$ is the $K$ -dimensional leading invariant subspace of $\widetilde{T}$ . ∎

3.4 Step four: bound uniform consistency error

In this step, we bound the uniform consistency error

\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\},

where $\widetilde{V}_{1}$ has orthonormal columns spanning the leading invariant subspace of $\widetilde{T}$ . Since we require $\widetilde{V}_{1}^{*}\widetilde{V}_{1}=I$ , we need to orthonormalize the “columns” of $V_{1}+V_{2}Y_{E}$ . Let us define $Y_{E}^{*}$ to be the adjoint of $Y_{E}$ with respect to $\mathbb{C}^{K}$ and $l^{2}$ . We can verify that for some $Q\in\mathbb{U}^{K}$ , $\widetilde{V}_{1}Q=(V_{1}+V_{2}Y_{E})(I+Y_{E}^{*}Y_{E})^{-1/2}$ because⁷⁷7As we will see from Lemma 3.23, $(I+Y_{E}^{*}Y_{E})^{-1/2}$ is well-defined when $\|Y_{E}\|_{HS}$ is small.

(V_{1}+V_{2}Y_{E})^{*}(V_{1}+V_{2}Y_{E})=V_{1}^{*}V_{1}+Y_{E}^{*}V_{2}^{*}V_{2}Y_{E}=I+Y_{E}^{*}Y_{E}.

Meanwhile, note that by assumption, $\|\cdot\|_{\mathcal{H}}$ is stronger than $\|\cdot\|_{\infty}$ , i.e. $C_{\mathcal{H}}\|f\|_{\mathcal{H}}\geq\|f\|_{\infty}$ for all $f\in\mathcal{H}$ . This implies for $\forall l\in\tilde{l}^{2}$

C_{\mathcal{H}}\|l\|_{\tilde{l}^{2}}=C_{\mathcal{H}}\|V_{2}l\|_{\mathcal{H}}\geq\|V_{2}l\|_{\infty}\geq\|V_{2}l\|_{L^{2}}=\|l\|_{l^{2}}.

The consequence of this is

C_{\mathcal{H}}\|Y_{E}\|_{2\to\tilde{l}^{2}}\geq\|Y_{E}\|_{2\to l^{2}}.

(3.22)

We also need the following handy result.

Lemma 3.10.

Let $Y:\mathbb{C}^{K}\rightarrow l_{2}$ have operator norm $\|Y\|_{2\to l^{2}}<1$ . Then

\big{\|}(I+Y^{*}Y)^{-\frac{1}{2}}\big{\|}_{2}\leq 1,\quad\big{\|}I-(I+Y^{*}Y)^{-\frac{1}{2}}\big{\|}_{2}\leq\|Y\|_{2\to l^{2}}.

(3.23)

Proof.

Suppose $\|Y\|_{2\to l^{2}}=r<1$ . Note that $\|Y^{*}Y\|=r^{2}<1$ , so the Hermitian matrix $I+Y^{*}Y$ is invertible with spectrum in $[1,1+r^{2}]$ . Consequently, $\sigma\big{(}(I+Y^{*}Y)^{-\frac{1}{2}}\big{)}\subset[\frac{1}{\sqrt{1+r^{2}}},1]$ , so $\big{\|}(I+Y^{*}Y^{-\frac{1}{2}})\big{\|}\leq 1$ .

Similarly, we have $\sigma(I-(I+Y^{*}Y)^{-\frac{1}{2}})\subset[0,1-\frac{1}{\sqrt{1+r^{2}}}]$ . It remains to verify $1-\frac{1}{\sqrt{1+r^{2}}}\leq r$ , which is easy. ∎

Now we have

Proposition 3.11.

Suppose $\|Y_{E}\|_{2\to\tilde{l}^{2}}<1/C_{\mathcal{H}}$ and $\widetilde{V}_{1}Q=(V_{1}+V_{2}Y_{E})(I+Y_{E}^{*}Y_{E})^{-1/2}$ for some $Q\in\mathbb{U}^{K}$ . We have

\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{O}^{K}\}\leq C_{\mathcal{H}}(\|V_{1}\|_{2\to\infty}+1)\|Y_{E}\|_{2\to\tilde{l}^{2}}.

Proof.

The condition $\|Y_{E}\|_{2\to\tilde{l}^{2}}<1/C_{\mathcal{H}}$ ensures Lemma 3.23 applies. We then directly calculate

	$\displaystyle\quad\inf\{\\|V_{1}-\widetilde{V}_{1}Q\\|_{2\to\infty}:Q\in\mathbb{O}^{K}\}$
	$\displaystyle\leq\big{\\|}V_{1}-(V_{1}+V_{2}Y_{E})(I+Y_{E}^{*}Y_{E})^{-\frac{1}{2}}\big{\\|}_{2\to\infty}$
	$\displaystyle\leq\big{\\|}V_{1}(I-(I+Y_{E}^{}Y_{E})^{-\frac{1}{2}})\big{\\|}_{2\to\infty}+\big{\\|}V_{2}Y_{E}(I+Y_{E}^{}Y_{E})^{-\frac{1}{2}}\big{\\|}_{2\to\infty}$
	$\displaystyle\leq\\|V_{1}\\|_{2\to\infty}\\|Y_{E}\\|_{2\to l^{2}}+\\|V_{2}Y_{E}\\|_{2\to\infty}$
	$\displaystyle\leq\\|V_{1}\\|_{2\to\infty}\\|Y_{E}\\|_{2\to l^{2}}+C_{\mathcal{H}}\\|Y_{E}\\|_{2\to\tilde{l}^{2}}$
	$\displaystyle\leq C_{\mathcal{H}}(\\|V_{1}\\|_{2\to\infty}+1)\\|Y_{E}\\|_{2\to\tilde{l}^{2}}.$

∎

3.5 Step five: put all pieces together

We combine the previous steps together and prove Theorem 3.2. To this end, we need the following lemma that relates $\|E_{ij}\|_{HS}$ , $i,j\in\{1,2\}$ to $\|E\|_{HS}$ .

Lemma 3.12.

Let

C_{3}=\max\Big{\{}C_{\mathcal{H}},1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}},\|V_{1}\|_{2\to\mathcal{H}}(1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}})\Big{\}},

then for any $i,j\in\{1,2\}$

\|E_{ij}\|_{HS}\leq C_{3}\|E\|_{HS}.

Proof.

Note the fact

\|V_{i}^{*}EV_{j}\|_{HS}\leq\|V_{i}^{*}\|_{op}\|E\|_{HS}\|V_{j}\|_{op}.

We plug in the bounds from item 3 in Lemma 3.1 to obtain the stated result. ∎

Proof of Theorem 3.2.

Define $C_{1}$ as

C_{1}:=\frac{1}{C_{3}}\min\Big{\{}\frac{\lambda_{K}-\lambda_{K+1}}{8},\frac{1}{2},\frac{\delta}{4},\frac{\delta}{4C_{\mathcal{H}}},\frac{C_{3}\eta^{2}}{\eta+l(\Gamma)/2\pi}\Big{\}}.

We have $\|E\|_{HS}\leq C_{1}$ by assumption. By Lemma 3.12, this assumption implies

\|E_{ij}\|_{HS}\leq C_{3}\|E\|_{HS}<\frac{\delta}{4}

for all $i,j\in\{1,2\}$ . Thus

	$\displaystyle s_{E}=\delta-\\|E_{11}\\|_{HS}-\\|E_{22}\\|_{HS}\geq\frac{\delta}{2},$
	$\displaystyle\frac{\\|E_{21}\\|_{HS}\\|E_{12}\\|_{HS}}{s_{E}^{2}}<\frac{1}{4}.$

Proposition 3.5 guarantees (3.5) has a solution $Y_{E}$ and

\|Y_{E}\|_{HS}\leq\frac{4C_{3}}{\delta}\|E\|_{HS}<\min\{\frac{1}{C_{\mathcal{H}}},1\}.

(3.24)

We check that our choice of $C_{1}$ satisfies condition (3.21) and (3.20) so Proposition 3.9 implies the invariant space from Proposition 3.5 is the leading invariant space. Finally, (3.24) implies the conditions of Proposition 3.11 are satisfied so we have

\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{2}\|E\|_{HS}

where $C_{2}=4C_{3}C_{\mathcal{H}}(\|V_{1}\|_{2\to\infty}+1)/\delta$ .

∎

4 Application to normalized spectral clustering

In spectral clustering, we start from a subset $\mathcal{X}\subset\mathbb{R}^{p}$ , a probability measure $\mathbb{P}$ on $\mathcal{X}$ ,⁸⁸8Assume the underlying $\sigma$ -algebra of $\mathbb{P}$ is the Lebesgue $\sigma$ -algebra. and a continuous symmetric positive definite real-valued kernel function $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ . After observing samples $X_{1},\dots,X_{n}\overset{\textrm{iid}}{\sim}\mathbb{P}$ , we construct matrix $K_{n}\in\mathbb{R}^{n\times n}$ of their pairwise similarities: $K_{n}=\begin{bmatrix}\frac{1}{n}k(X_{i},X_{j})\end{bmatrix}_{i,j=1}^{n}$ , and then normalize it to obtain the normalized Laplacian matrix

L_{n}=D_{n}^{-\frac{1}{2}}K_{n}D_{n}^{-\frac{1}{2}},

where $d_{n}=K_{n}1_{n}$ and $D_{n}=\operatorname{diag}(d_{n})$ is the degree matrix.⁹⁹9The normalized Laplacian matrix is usually defined as $I_{n}-L_{n}$ , but the eigenvectors of $L_{n}$ and those of $I_{n}-L_{n}$ are identical and it is more convenient to study $L_{n}$ . It is possible to show that $L_{n}$ is symmetric and semi-positive definite, so it has an eigenvalue decomposition. We denote the eigenpairs of $L_{n}$ by $(\widehat{\lambda}_{k},v_{k})$ and sort the eigenvalues in descending order:

\widehat{\lambda}_{1}\geq\dots\geq\widehat{\lambda}_{n}\geq 0.

In this paper, we normalize the eigenvectors of $L_{n}$ so that $n^{-\frac{1}{2}}\|v_{k}\|_{2}=1$ . The spectral embedding matrix is $V\in\mathbb{R}^{n\times K}$ whose columns are $v_{1},\dots,v_{K}$ .

Suppose for now that the kernel $k$ is bounded away from $0$ by a positive number and bounded from above. The operator counterpart of $L_{n}$ is the following operator, which can be shown to be a bounded linear operator in $\mathcal{L}(C_{b}(\mathcal{X}))$

(\widehat{T}_{n}f)(x)=\int_{\mathcal{X}}\frac{k(x,y)}{d_{n}(x)^{1/2}d_{n}(y)^{1/2}}f(y)dP_{n}(y),

(4.1)

where $d_{n}(x)=\int_{\mathcal{X}}k(x,y)dP_{n}(y)$ is the sample degree function. Although $\widehat{T}_{n}$ is introduced as an operator in $\mathcal{L}(C_{b}(\mathcal{X}))$ , we remark that the definitive element for $\widehat{T}_{n}$ is the integral form and the domain space and range space need not be restricted to $C_{b}(\mathcal{X})$ . In fact, the actual $\widehat{T}_{n}$ we shall work with is an operator between Hilbert spaces; $\mathcal{L}(C_{b}(\mathcal{X}))$ is only chosen here for the ease of understanding. The same remark also applies to other operators we shall subsequently define.

The operator $\widehat{T}_{n}$ is the operator counterpart of $L_{n}$ because $\rho_{n}\circ\widehat{T}_{n}=L_{n}\circ\rho_{n}$ , where $\rho_{n}:C_{b}(\mathcal{X})\to\mathbb{C}^{n}$ is the restriction operator defined as

\rho_{n}f=\begin{bmatrix}f(X_{1})&\dots&f(X_{n})\end{bmatrix}^{T}.

In other words, if we identify functions $f\in C_{b}(\mathcal{X})$ with vectors $v\in\mathbb{C}^{n}$ by the restriction operator $\rho_{n}$ , $\widehat{T}_{n}$ “behaves as” $L_{n}$ . The eigenvalues and eigenvectors(functions) of $\widehat{T}_{n}$ and $L_{n}$ are also closely related in the following sense.

Lemma 4.1.

Suppose real-valued kernel function $k(x,y)$ is continuous and bounded from below and above: $0<\kappa_{l}<k(x,y)<\kappa_{u}<\infty$ . Let $\widehat{T}_{n}$ be defined as in (4.1) where the domain space and range space are both $C_{b}(\mathcal{X})$ . If $(\widehat{\lambda},f)$ is a non-trivial eigenpair of $\widehat{T}_{n}$ (i.e. $\widehat{\lambda}\neq 0$ ), then $(\widehat{\lambda},\rho_{n}f)$ is an eigenpair of $L_{n}$ . Conversely, if $(\widehat{\lambda},v)$ is an eigenpair of $L_{n}$ , then $(\widehat{\lambda},\widehat{f})$ , where

\widehat{f}(x)=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}\frac{k(x,X_{i})}{d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2}}v_{i},

(4.2)

is an eigenpair of $\widehat{T}_{n}$ with $\widehat{f}\in C_{b}(\mathcal{X})$ . Moreover, this choice of $\widehat{f}$ is such that $\|\widehat{f}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}=1$ and the restriction of $\widehat{f}$ onto sample points agrees with $v$ , i.e. $\rho_{n}\widehat{f}=v$ .

Proof.

Let $(\widehat{\lambda},\widehat{f})$ be an eigenpair of $\widehat{T}_{n}$ : $\widehat{T}_{n}\widehat{f}=\widehat{\lambda}\widehat{f}$ . We check that $(\widehat{\lambda},\rho_{n}\widehat{f})$ is an eigenpair of $L_{n}$ :

L_{n}\rho_{n}\widehat{f}=\rho_{n}\widehat{T}_{n}\widehat{f}=\rho_{n}\widehat{\lambda}\widehat{f}=\widehat{\lambda}\rho_{n}\widehat{f}.

Conversely, if $(\widehat{\lambda},v)$ is an eigenpair of $L_{n}$ , we check that $(\widehat{\lambda},\widehat{f})$ is an eigenpair of $\widehat{T}_{n}$ :

	$\displaystyle(\widehat{T}_{n}\widehat{f})(x)$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{k(x,X_{i})}{d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2}}\left\{\textstyle\frac{1}{\widehat{\lambda}n}\sum_{j=1}^{n}\frac{k(X_{i},X_{j})}{d_{n}(X_{i})^{1/2}d_{n}(X_{j})^{1/2}}v_{j}\right\}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{k(x,X_{i})}{d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2}}\left\{\textstyle\frac{1}{\widehat{\lambda}}[L_{n}v]_{i}\right\}$
		$\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\frac{k(x,X_{j})}{d_{n}(x)^{1/2}d_{n}(X_{j})^{1/2}}v_{i}$
		$\displaystyle=(\widehat{\lambda}\widehat{f})(x).$

It remains to check $\widehat{f}$ is indeed in $C_{b}(\mathcal{X})$ . To this end, note that since the kernel function $k$ is continuous and bounded from above, we know $k(x,X_{j})\in C_{b}(\mathcal{X})$ . Since $k$ is bounded from below, we know $d_{n}(x)$ is continuous and $d_{n}(x)>\kappa_{l}$ , so $k(x,X_{j})/(d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2})\in C_{b}(\mathcal{X})$ . Thus the average of such terms $\widehat{f}$ is also in $C_{b}(\mathcal{X})$ . ∎

The population version of $\widehat{T}_{n}$ is the normalized Laplacian operator

T:C_{b}(\mathcal{X})\to C_{b}(\mathcal{X})\text{ defined as }(Tf)(x)=\int_{\mathcal{X}}\frac{k(x,y)}{d(x)^{1/2}d(y)^{1/2}}f(y)dP(y),

where $d(x)=\int_{\mathcal{X}}k(x,y)dP(y)$ is the (population) degree function. Under appropriate assumptions, it can be shown that we can choose $\{f_{i}\}_{i=1}^{K}$ , the top $K$ eigenfunctions of $T$ , to be real-valued and orthonormal in $L^{2}(\mathcal{X},P)$ . We can thus define $V_{1}:\mathbb{C}^{K}\to C_{b}(X)$ as $V_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i}$ . We can similarly define $\widehat{V}_{1}$ with $\{\widehat{f}_{k}\}_{k=1}^{K}$ , the extension of top $K$ eigenvectors of $L_{n}$ according to (4.2). Our goal in this section is to apply our general theory to prove the following result.

Theorem 4.2.

Under the general assumptions defined below, there exists $C_{4},C_{5}$ that are determined by $\mathcal{X},\mathbb{P},k$ such that whenever sample size $n\geq C_{4}\tau$ for some $\tau>1$ , we have with confidence $1-8e^{-\tau}$

\inf\{\|V_{1}-\widehat{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{5}\frac{\sqrt{\tau}}{\sqrt{n}}.

The general assumptions referred to in Theorem 4.2 are

General Assumptions.

The set $\mathcal{X}$ is a bounded connected open set in $\mathbb{R}^{p}$ with a nice boundary.¹⁰¹⁰10We need the boundary to be quasi-resolved [6] for inequality (4.10) and $C^{\infty}$ for lemma A.2 [15, 11]. We also need $\mathcal{X}$ to satisfy the cone condition [6] We omit the definitions of these conditions because the precise definitions are very technical and not relevant to the main story of the paper.. The probability measure $\mathbb{P}$ is defined with respect to Lebesgue measure and admits a density function $p(x)$ . Moreover, there exists constants $0<p_{l}<p_{u}<\infty$ such that $p_{l}<p(x)<p_{u}$ almost surely with respect to the Lebesgue measure. The kernel $k(\cdot,\cdot)\in C_{b}^{p+2}(\mathcal{X}\times\mathcal{X})$ is symmetric, positive, and there exists constants $0<\kappa_{l}<\kappa_{u}<\infty$ such that $\kappa_{l}<k(x,y)<\kappa_{u}$ for $\forall x,y\in\mathcal{X}$ . Treated as an operator from $L^{2}(\mathcal{X},\mathbb{P})$ to $L^{2}(\mathcal{X},\mathbb{P})$ , the eigenvalues of $T$ satisfy $\lambda_{1}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0$ . The top $K$ eigenfunctions of $T$ , $\{f_{i}\}_{i=1}^{K}\subset C_{b}^{p+2}(\mathcal{X})$ . ¹¹¹¹11Function space $C_{b}^{p+2}(\mathcal{X})$ shall be defined in section 4.2.

4.1 Overview of the proof

The most challenging parts in applying the general theory are to identify the correct Hilbert space $\mathcal{H}$ to work with, and to show that $T-\widehat{T}_{n}$ , as an operator from $\mathcal{H}$ to $\mathcal{H}$ , has Hilbert-Schmidt norm tending to zero as $n$ goes to infinity. It turns out under the general assumptions, we may set $\mathcal{H}$ to be a Sobolev space of sufficiently high degrees. As for bounding $\|T-\widehat{T}_{n}\|_{HS}$ , we first decompose $T,\widehat{T}_{n}$ as the product of three operators. Let us define

	$\displaystyle D^{-1/2}:\mathcal{H}\to\mathcal{H}\text{ as }(D^{-1/2}f)(x)=f(x)/\sqrt{d(x)},$		(4.3)
	$\displaystyle K:\mathcal{H}\to\mathcal{H}\text{ as }(Kf)(x)=\int_{\mathcal{X}}k(x,y)f(y)dP(y).$		(4.4)

Then $T=D^{-1/2}KD^{-1/2}$ . Similarly, we have $\widehat{T}_{n}=D_{n}^{-1/2}K_{n}D_{n}^{-1/2}$ where $D_{n}^{-1/2},K_{n}$ are the sample level version of $D^{-1/2}$ and $K$ defined using $d_{n}$ and $\mathbb{P}_{n}$ . We shall establish the concentration of $K_{n}$ to $K$ and $D_{n}^{-1/2}$ to $D^{-1/2}$ and invoke triangular inequality to bound $\|T-\widehat{T}_{n}\|_{HS}$ .

Despite that the general theory does all the heavy lifting, there is one additional step we must take to finish the full proof of Theorem 4.2. In our general theory, $\widetilde{V}_{1}$ has columns orthonormal in $L^{2}(\mathcal{X},\mathbb{P})$ that span the leading invariant space of $\widetilde{T}$ . In theorem 4.2 however, the same leading invariant space is spanned by the columns of $\widehat{V}_{1}$ , which are only orthonormal in $L^{2}(\mathcal{X},\mathbb{P}_{n})$ . Morally speaking, when $n$ is large, $\widehat{V}_{1}$ and $\widetilde{V}_{1}$ are roughly the same up to some unitary transformation, so switching from $\widetilde{V}_{1}$ to $\widehat{V}_{1}$ should not inflate the consistency error by any order of magnitude. The exact error bound shall be obtained through some uniform law of large numbers.

The rigorous treatment shall be presented in five parts. In part one, we introduce the Sobolev space we work with and lay out its basic properties. In part two, we bound the norm of operator differences such as $\|D_{n}^{-1/2}-D^{-1/2}\|_{\mathcal{H}\to\mathcal{H}}$ and $\|K-K_{n}\|_{HS}$ and express $\|T-\widehat{T}_{n}\|_{HS}$ in terms of them. In part three, we invoke concentration results in Hilbert spaces and relate the norm of operator differences to sample size. In part four, we check the remaining conditions required by our general theory and combine all previous pieces together. In part five, we deal with the error induced by the difference of $\widehat{V}_{1}$ and $\widetilde{V}_{1}$ and complete the proof.

4.2 Part one: The Sobolev space $\mathcal{H}^{s}$

First recall that by assumption $\mathcal{X}$ is a bounded connected open subset of $\mathbb{R}^{p}$ with a nice boundary. Given $s\in\mathbb{N}$ , the Sobolev space $\mathcal{H}^{s}=\mathcal{H}^{s}(\mathcal{X})$ of order $s$ is defined as

\mathcal{H}^{s}\;:=\;\{f\in L^{2}(\mathcal{X},dx)\;|\;D^{\alpha}f\in L^{2}(\mathcal{X},dx),\forall|\alpha|=s\},

where $D^{\alpha}f$ is the (weak) derivative of $f$ with respect to the multi-index $\alpha$ and $L^{2}(\mathcal{X},dx)$ is the complex Hilbert space of complex-valued functions square integrable under Lebesgue measure. The space $\mathcal{H}^{s}$ is a separable Hilbert space with respect to the inner product

\langle f,g\rangle_{\mathcal{H}^{s}}=\langle f,g\rangle_{L^{2}(\mathcal{X},dx)}+\sum_{|\alpha|=s}\langle D^{\alpha}f,D^{\alpha}g\rangle_{L^{2}(\mathcal{X},dx)}.

Let $C_{b}^{s}(\mathcal{X})$ be the set of complex-valued continuous bounded functions such that all the derivatives up to order $s$ exist and are continuous bounded functions. The space $C_{b}^{s}(\mathcal{X})$ is a Banach space with respect to the norm

\|f\|_{C_{b}^{s}}=\|f\|_{\infty}+\sum_{|\alpha|=s}\|D^{\alpha}f\|_{\infty}.

Since $\mathcal{X}$ is bounded, we know $C_{b}^{s}(\mathcal{X})\subset\mathcal{H}^{s}$ and $\|f\|_{\mathcal{H}^{s}}\leq C_{s}\|f\|_{C_{b}^{s}}$ where $C_{s}$ is a constant only depending on $s$ . We also know from the Sobolev embedding theorem (see Chapter 4.6 of Burenkov [6]) that for $l,m\in\mathbb{N}$ with $l-m>p/2$ , we have

\mathcal{H}^{l}\subset C_{b}^{m}(\mathcal{X})\qquad\|f\|_{C_{b}^{m}}\leq C_{m,l}\|f\|_{\mathcal{H}^{l}}

(4.5)

where $C_{m,l}$ is a constant depending only on $m$ and $l$ .

Taking $l=s=\lfloor p/2\rfloor+1$ and $m=0$ , we see

C_{b}^{s}(\mathcal{X})\subset\mathcal{H}^{s}\subset C_{b}(\mathcal{X}),

with $\|f\|_{\infty}\leq C_{6}\|f\|_{\mathcal{H}^{s}}$ for $\forall f\in\mathcal{H}^{s}$ for some constant $C_{6}$ . This norm relationship suggests that $\mathcal{H}^{s}$ is a RKHS with a bounded kernel $s(\cdot,\cdot)$ .

4.3 Part two: bounds on operator differences

Similar to (4.4), we define multiplication operators

	$\displaystyle D^{1/2}:\mathcal{H}^{s}\to\mathcal{H}^{s}\text{ as }(D^{1/2}f)(x)=\sqrt{d(x)}f(x),$		(4.6)
	$\displaystyle D:\mathcal{H}^{s}\to\mathcal{H}^{s}\text{ as }(Df)(x)=d(x)f(x).$		(4.7)

In this subsection, we show $D^{1/2},D^{-1/2},D,D_{n}^{1/2},D_{n}^{-1/2},D_{n}\in\mathcal{L}(\mathcal{H}^{s})$ and their operator norms are appropriately bounded, that is

Lemma 4.3.

Under the general assumptions, all the following operators are bounded linear operators in $\mathcal{L}(\mathcal{H}^{s})$ , and there exists a suitable constant $C_{7}>0$ such that

	$\displaystyle\\|D^{1/2}\\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}},\\|D^{-1/2}\\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}},\\|D_{n}^{1/2}\\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}},\\|D_{n}^{-1/2}\\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}}\leq C_{7}$		(4.8)
	$\displaystyle\\|(D^{1/2}+D_{n}^{1/2})^{-1}\\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}}\leq C_{7},\quad\\|D-D_{n}\\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}}\leq C_{7}\\|d-d_{n}\\|_{\mathcal{H}^{p+2}}$		(4.9)

Proof.

Let $C_{8}=\|k\|_{C_{b}^{p+2}(\mathcal{X}\times\mathcal{X})}$ . For any $x\in\mathcal{X}$ , clearly $k_{x}:=k(\cdot,x)\in C_{b}^{p+2}(\mathcal{X})$ with $\|k_{x}\|_{C_{b}^{p+2}}\leq C_{8}$ . Since $d$ and $d_{n}$ are some weighted average of average of $k_{x}$ , it follows

\|d\|_{C_{b}^{p+2}},\|d_{n}\|_{C_{b}^{p+2}}\leq C_{8}.

Since $d,d_{n}$ inherit the $\kappa_{l},\kappa_{u}$ pointwise bound from $k(\cdot,\cdot)$ , we know $d^{1/2},d_{n}^{1/2},d^{-1/2},d_{n}^{-1/2}\in C_{b}^{p+2}(\mathcal{X})$ with

\|d^{1/2}\|_{C_{b}^{p+2}},\|d_{n}^{1/2}\|_{C_{b}^{p+2}},\|d^{-1/2}\|_{C_{b}^{p+2}},\|d_{n}^{-1/2}\|_{C_{b}^{p+2}}\leq C_{9}.

Next, we know from Lemma 15 of Chapter 4 of Burenkov [6] that for $g\in C_{b}^{s}(\mathcal{X})$ and $f\in\mathcal{H}^{s}$ , we have $gf\in\mathcal{H}^{s}$ and

\|gf\|_{\mathcal{H}^{s}}\leq\|g\|_{C_{b}^{s}}\|f\|_{\mathcal{H}^{s}}.

(4.10)

We can use this inequality to prove $D^{1/2},D^{-1/2},D,D_{n}^{1/2},D_{n}^{-1/2},D_{n}\in\mathcal{L}(\mathcal{H}^{s})$ and bound their operator norm. For example, plugging in $g=d^{1/2},d_{n}^{1/2},d^{-1/2},d_{n}^{-1/2}$ into (4.10), and noticing for these choices of $g$ , $\|g\|_{C_{b}^{s}}\leq\|g\|_{C_{b}^{p+2}}$ because $p+2>s$ , we conclude

\|D^{1/2}\|,\|D^{-1/2}\|,\|D_{n}^{1/2}\|,\|D_{n}^{-1/2}\|\leq C_{9}.

Note that by the embedding theorem, $\mathcal{H}^{p+2}$ can be embedded into $C_{b}^{s}(\mathcal{X})$ , so plugging in $d-d_{n}\in C_{b}^{p+2}(\mathcal{X})\subset\mathcal{H}^{p+2}$ , we see

\|D-D_{n}\|\leq\|d-d_{n}\|_{C_{b}^{s}}\leq C_{10}\|d-d_{n}\|_{\mathcal{H}^{p+2}}.

For the bound on $\|(D^{1/2}+D_{n}^{1/2})^{-1}\|$ , we follow essentially the same route. We first bound $\|d^{1/2}+d_{n}^{1/2}\|_{C_{b}^{p+2}}$ , then argue $d^{1/2}+d_{n}^{1/2}$ has pointwise lower and upper bound. It then follows that $\|(d^{1/2}+d_{n}^{1/2})^{-1}\|_{C_{b}^{p+2}}\leq C_{11}$ , and we see via (4.10) that $\|(D^{1/2}+D_{n}^{1/2})^{-1}\|_{\mathcal{H}^{s},\mathcal{H}^{s}}\leq C_{9}$ . Taking $C_{7}$ as the maximum of $C_{9}$ to $C_{11}$ completes the proof. ∎

Lemma 4.4.

Under the general assumptions, we have

\|T-\widehat{T}_{n}\|_{HS}\leq C_{12}\Big{(}\big{(}\|K_{n}\|_{HS}+\|K\|_{HS}\big{)}\|d-d_{n}\|_{\mathcal{H}^{p+2}}+\|K-K_{n}\|_{HS}\Big{)}.

Proof.

First note

	$\displaystyle\quad D^{-1/2}-D_{n}^{-1/2}$
	$\displaystyle=D_{n}^{-1/2}(D_{n}^{1/2}-D^{1/2})D^{-1/2}$
	$\displaystyle=D_{n}^{-1/2}(D_{n}-D)(D_{n}^{1/2}+D^{1/2})^{-1}D^{-1/2}.$

Applying the bounds from Lemma 4.3, we see

\|D^{-1/2}-D_{n}^{-1/2}\|\leq C_{7}^{4}\|d-d_{n}\|_{\mathcal{H}^{p+2}}.

We also have decomposition

	$\displaystyle\quad D^{-1/2}KD^{-1/2}-D_{n}^{-1/2}K_{n}D_{n}^{-1/2}$
	$\displaystyle=D^{-1/2}K(D^{-1/2}-D_{n}^{-1/2})+(D^{-1/2}K-D_{n}^{-1/2}K_{n})D_{n}^{-1/2}$
	$\displaystyle=D^{-1/2}K(D^{-1/2}-D_{n}^{-1/2})+D^{-1/2}(K-K_{n})D_{n}^{-1/2}+(D^{-1/2}-D_{n}^{-1/2})K_{n}D_{n}^{-1/2}.$

Taking Hilbert-Schmidt norm on both sides, we have¹²¹²12We haven’t shown $K,K_{n}$ are Hilbert-Schmidt operators from $H^{s}$ to $H^{s}$ yet. This is shown in Lemma 4.6

	$\displaystyle\quad\\|T-\widehat{T}_{n}\\|_{HS}$
	$\displaystyle\leq C_{7}^{5}\\|K\\|_{HS}\\|d-d_{n}\\|_{\mathcal{H}^{p+2}}+C_{7}^{2}\\|K-K_{n}\\|_{HS}+C_{7}^{5}\\|K_{n}\\|_{HS}\\|d-d_{n}\\|_{\mathcal{H}^{p+2}}$
	$\displaystyle\leq C_{12}\Big{(}\big{(}\\|K_{n}\\|_{HS}+\\|K\\|_{HS}\big{)}\\|d-d_{n}\\|_{\mathcal{H}^{p+2}}+\\|K-K_{n}\\|_{HS}\Big{)}$

for some appropriately chosen $C_{12}$ . ∎

4.4 Concentration on Hilbert Space

In this subsection, we show $K,K_{n}$ are both Hilbert-Schmidt operator from $\mathcal{H}^{s}$ to $\mathcal{H}^{s}$ and establish some concentration results regarding $\|d-d_{n}\|_{\mathcal{H}^{p+2}}$ and $\|K-K_{n}\|_{HS}$ . With these results and Lemma 4.4, we will be able to bound $\|T-\widehat{T}_{n}\|_{HS}$ . The required concentration bounds are obtained through the following Theorem on the concentration in (complex) Hilbert space (see section 2.4 of Rosasco, Belkin and Vito [29]).

Lemma 4.5.

Let $\xi_{1},\ldots,\xi_{n}$ be zero mean independent random variables with values in a separable (complex) Hilbert space $\mathcal{H}$ such that $\|\xi_{i}\|_{\mathcal{H}}\leq C$ for all $i\in[n]$ . Then with probability at least $1-2e^{-\tau}$ , we have

\big{\|}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\big{\|}_{\mathcal{H}}\leq\frac{C\sqrt{2\tau}}{\sqrt{n}}.

With this lemma, we can show

Lemma 4.6.

Under the general assumptions, the following facts hold true:

1.

For some constant $C_{13}$ , with confidence $1-2e^{-\tau}$

$\|d-d_{n}\|_{\mathcal{H}^{p+2}}\leq C_{13}\frac{\sqrt{\tau}}{\sqrt{n}}.$
2.

Both $K$ and $K_{n}$ are Hilbert-Schmidt operators from $\mathcal{H}^{s}$ to $\mathcal{H}^{s}$ , and there exists some constant $C_{14}$ that doesn’t depend on $n$ such that their Hilbert-Schmidt norm $\|K_{n}\|_{HS},\|K\|_{HS}\leq C_{14}$ is bounded.
3.

For some constant $C_{15}$ , with confidence $1-2e^{-\tau}$

$\|K-K_{n}\|_{HS}\leq C\frac{\sqrt{\tau}}{\sqrt{n}}.$

Proof.

For item 1, consider random variables $\xi_{i}=k(\cdot,X_{i})-d\in\mathcal{H}^{p+2}$ for $i\in[n]$ . They are clearly zero mean. From the proof of Lemma 4.3, we see $d,k(\cdot,X_{i})\in C_{b}^{p+2}(\mathcal{X})$ . We thus have

\|\xi_{i}\|_{\mathcal{H}^{p+2}}\leq\|k(\cdot,X_{i})\|_{\mathcal{H}^{p+2}}+\|d\|_{\mathcal{H}^{p+2}}\leq C_{\mathcal{X}}\big{(}\|k(\cdot,X_{i})\|_{C_{b}^{p+2}}+\|d\|_{C_{b}^{p+2}}\big{)}

where $C_{\mathcal{X}}$ is some constant depending on the Lebesgue measure of the bounded set $\mathcal{X}$ . This suggests $\xi_{i}$ ’s are bounded. Since $\mathcal{H}^{p+2}$ is a separable Hilbert space, we apply Lemma 4.5 and conclude that we have with probability $1-2e^{-\tau}$

\|d-d_{n}\|_{\mathcal{H}^{p+2}}\leq C_{13}\frac{\sqrt{\tau}}{\sqrt{n}}.

For item 2, let us fix any $x\in\mathcal{X}$ and consider the operator $\langle\cdot,s_{x}\rangle_{\mathcal{H}^{s}}k_{x}$ where $s_{X_{i}}:=s(\cdot,X_{i})$ . This operator is in fact a Hilbert-Schmidt operator from $\mathcal{H}^{s}$ to $\mathcal{H}^{s}$ . To see this, note that $\|\langle\cdot,s_{x}\rangle_{\mathcal{H}^{s}}k_{x}\|_{HS}=\|s_{x}\|_{\mathcal{H}^{s}}\|k_{x}\|_{\mathcal{H}^{s}}$ . With the same reasoning used for proving item 1, we see $\|k_{x}\|_{\mathcal{H}^{s}}$ has a bound uniform on $\forall x\in\mathcal{X}$ . It remains to show $\|s_{x}\|_{\mathcal{H}^{s}}$ has a uniform bound. Let $\delta_{x}:\mathcal{H}^{s}\to\mathbb{C}$ be the evaluation functional, i.e. $\delta_{x}(f)=f(x)$ . We know from the embedding theorem that $\|\delta_{x}\|_{op}\leq C_{6}$ for all $x\in\mathcal{X}$ . But $s_{x}$ also induces this point evaluation functional, so by Riesz representation theorem, $\|s_{x}\|_{\mathcal{H}^{s}}=\|\delta_{x}\|_{op}\leq C_{6}$ . Hence for some $C_{14}$ , $\|\langle\cdot,s_{x}\rangle_{\mathcal{H}^{s}}k_{x}\|_{HS}\leq C_{14}$ for all $x\in\mathcal{X}$ . Now let $x$ be random. We see $\|K\|_{HS}=\|\mathbb{E}\langle\cdot,s_{X_{i}}\rangle_{\mathcal{H}^{s}}k_{X_{i}}\|_{HS}\leq\mathbb{E}\|\langle\cdot,s_{X_{i}}\rangle_{\mathcal{H}^{s}}k_{X_{i}}\|_{HS}\leq C_{14}$ , i.e. $K$ is Hilbert-Schmidt. By the same reasoning, we see the claim for $K_{n}$ in item 2 is also true.

For item 3, consider random variables $\omega_{i}:=\langle\cdot,s_{X_{i}}\rangle_{\mathcal{H}^{s}}k_{X_{i}}-K$ . We know from item 2 that $\omega_{i}\in HS(\mathcal{H}^{s})$ . Since $\mathcal{H}^{s}$ is separable, the Hilbert space $HS(\mathcal{H}^{s})$ is also separable. We also know $\omega_{i}$ is zero mean and $\|\omega_{i}\|_{HS}\leq 2C_{14}$ is bounded. We can thus apply Lemma 4.5 and conclude that we have with probability $1-2e^{-\tau}$

\|K-K_{n}\|_{HS}\leq C_{15}\frac{\sqrt{\tau}}{\sqrt{n}}

for $C_{15}:=2C_{14}$ . ∎

Combining Lemma 4.4 and 4.6, we obtain the result we want

Proposition 4.7.

Under the general assumptions, with probability $1-4e^{-\tau}$ , we have

\|T-\widehat{T}_{n}\|_{HS}\leq C_{16}\frac{\sqrt{\tau}}{\sqrt{n}}

for some constant $C_{16}$ .

Proof.

A union bound and a direct application of lemma 4.4 will suffice for the proof. ∎

4.5 Checking conditions for general theory

In the first three paragraphs of section 3, we have laid out the conditions that must be satisfied for our general theory to apply. We’ve already checked most of them implicitly in the previous three subsections, but for completeness, we summarize all such conditions here and prove them.

Lemma 4.8.

Under the general conditions, the following facts hold true:

1.

The Sobolev space $H^{s}$ is a subspace of $L^{2}(\mathcal{X},\mathbb{P})$ .
2.

The $\mathcal{H}^{s}$ norm $\|\cdot\|_{\mathcal{H}^{s}}$ is stronger than infinity norm.
3.

Both $T,\widehat{T}_{n}$ are Hilbert-Schmidt from $\mathcal{H}_{s}$ to $\mathcal{H}_{s}$ .
4.

All eigenvalues of $T$ (counting multiplicity) can be arranged in a decreasing (possibly infinite) sequence of non-negative real numbers $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0$ with a positive gap between $\lambda_{K}$ and $\lambda_{K+1}$ .
5.

The top $K$ eigenfunctions $\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}^{s}$ and can be picked to form an orthonormal set of functions in $L^{2}(\mathcal{X},\mathbb{P})$ .
6.

$\widehat{T}_{n}$ has a sequence of non-increasing, real, non-negative eigenvalues.

Proof.

For item 1, this is because $H^{s}$ is a subspace of $L^{2}(\mathcal{X},dx)$ and under our assumptions on $\mathcal{X}$ and $\mathbb{P}$ , $L^{2}(\mathcal{X},dx)$ and $L^{2}(\mathcal{X},\mathbb{P})$ are the same space. First of all, since the underlying $\sigma$ -algebra of $\mathbb{P}$ is the Lebesgue $\sigma$ -algebra, the set of measurable functions are the same. If $f\in L^{2}(\mathcal{X},dx)$ , then $f$ is also in $L^{2}(\mathcal{X},\mathbb{P})$ because $\int_{\mathcal{X}}f\bar{f}dP=\int_{\mathcal{X}}f\bar{f}p(x)dx\leq p_{u}\int_{\mathcal{X}}f\bar{f}dx<\infty$ . The converse is also true. It is not hard to see our assumptions ensure that the Lebesgue measure is absolutely continuous with respect to $\mathbb{P}$ with the density being $1/p(x)$ a.s.. Noticing $1/p(x)<1/p_{l}$ , we can prove the converse.

Item 2 is the consequence of Sobolev embedding theorem and has been used time and again in the previous subsections. Item 3 is the joint consequence of Lemma 4.3 and 4.6.

For item 4 and 5, we first show $T$ as an operator from $L^{2}(\mathcal{X},\mathbb{P})$ to $L^{2}(\mathcal{X},\mathbb{P})$ is positive, self-adjoint, and Hilbert-Schmidt. Let $h(x,y):=k(x,y)/\sqrt{d(x)d(y)}$ denote the normalized kernel. The self-adjointness is due to the (conjugate) symmetry $h(\cdot,\cdot)$ inherited from $k(\cdot,\cdot)$ :

	$\displaystyle\langle f,Tg\rangle=\int f(x)\Big{(}\int{\mspace{2.5mu}\overline{\mspace{-2.5mu}h(x,y)}}{\mspace{2.5mu}\overline{\mspace{-2.5mu}g(y)}}dP(y)\Big{)}dP(x)=\iint h(y,x)f(x){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(y)}}dP(x)dP(y),$
	$\displaystyle\langle Tf,g\rangle=\int\Big{(}h(x,y)f(y)dP(y)\Big{)}{\mspace{2.5mu}\overline{\mspace{-2.5mu}g(x)}}dP(x)=\iint h(x,y)f(y){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(x)}}dP(x)dP(y).$

We thus see $\langle f,Tg\rangle=\langle Tf,g\rangle$ , i.e. $T$ is self-adjoint. To see why $T$ is Hilbert-Schmidt, let real-valued functions $\{e_{i}\}_{i=1}^{\infty}$ be an orthonormal basis of $L^{2}(\mathcal{X},\mathbb{P})$ . We calculate

	$\displaystyle\\|T\\|_{HS}^{2}=\sum_{i=1}^{\infty}\Bigg{(}\int\Big{(}\int h(x,y)e_{i}(y)dP(y)\Big{)}^{2}dP(x)\Bigg{)}=\int\Big{(}\sum_{i=1}^{\infty}\langle h_{x},e_{i}\rangle_{L^{2}}^{2}\Big{)}dP(x)$
	$\displaystyle=\iint h^{2}(x,y)dP(x)dP(y)\leq\kappa_{u}^{2}/\kappa_{l}^{2}<\infty.$

The positive part is slightly more involved. To show $T$ is positive, we need to show for $\forall f\in L^{2}(\mathcal{X},\mathbb{P})$

\langle Tf,f\rangle=\iint k(x,y){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(x)}}f(y)dP(x)dP(y)\geq 0.

Let us fix a sample size $n$ and draw i.i.d. samples $X_{1},X_{2},\ldots,X_{n}\sim\mathbb{P}$ . Then since the kernel $k(\cdot,\cdot)$ is positive definite, the quadratic form

\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i},X_{j}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}f(X_{j})

is non-negative regardless of what samples we draw. It thus follows that the expectation of this quadratic form is non-negative. A simple calculation suggests that the expectation is in fact

\frac{n-1}{n}\langle Tf,f\rangle+\frac{1}{n}\int k(x,x){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(x)}}f(x)dP(x).

Since by our assumption $k(x,x)\leq\kappa_{u}$ and $f\in L^{2}(\mathcal{X},\mathbb{P})$ , we see $\int k(x,x){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(x)}}f(x)dP(x)$ is finite. Since $n$ can be arbitrarily large, $\langle Tf,f\rangle$ must be non-negative.

According to the spectral theory for positive, self-adjoint, Hilbert-Schmidt operators we introduced in Section 2.2, we immediately see most parts of item 4 and 5 are true. The remaining part to check for item 5 is that $\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}^{s}$ , which is implied by our assumption that $\{f_{i}\}_{i=1}^{K}\subset C_{b}^{p+2}(\mathcal{X})$ . The eigengap part in item 4 is also assumed by the general assumption. A nuance in item 4 is that the eigenvalues and eigenvectors there are under the premise that $T$ is an operator from $\mathcal{H}^{s}$ to $\mathcal{H}^{s}$ . But since $\mathcal{H}^{s}$ is a subspace of $L^{2}(\mathcal{X},\mathbb{P})$ , we can only have fewer eigenvalues than treating $T$ as in $\mathcal{L}(L^{2}(\mathcal{X},\mathbb{P}))$ . Plus, since our general assumptions imply $\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}^{s}$ , the leading eigenspace remains unchanged after the restriction from $\mathcal{L}(L^{2}(\mathcal{X},\mathbb{P}))$ to $\mathcal{H}^{s}$ .

For item 6, this is true because of the relationship between the spectrum of $\widehat{T}_{n}$ and that of the symmetric positive semi-definite kernel matrix $L_{n}$ . A sort of Lemma 4.1 is also true with the $C_{b}(\mathcal{X})$ therein replaced by $\mathcal{H}^{s}$

∎

Because of Lemma 4.8, we can apply a slightly modified version Theorem 3.2 (see the proof for Theorem 3.2) to obtain the following.

Proposition 4.9.

For some constant $C_{17},C_{18}$ , whenever $n>C_{17}\tau$ , we have with confidence $1-4e^{-\tau}$ that

\|V_{1}-(V_{1}+V_{2}Y)(I+Y^{*}Y)^{-1/2}\|_{2\to\infty}\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}

(4.11)

Proof.

By proposition 4.7, we have with probability $1-4e^{-\tau}$ ,

\|T-\widehat{T}_{n}\|_{HS}\leq C_{16}\frac{\sqrt{\tau}}{\sqrt{n}}.

We can set $C_{17}$ sufficiently large that $C_{16}\frac{\sqrt{\tau}}{\sqrt{n}}\leq C_{16}/\sqrt{C_{17}}\leq C_{1}$ , where $C_{1}$ is the constant from Theorem 3.2. Hence by an intermediate step in the proof of Theorem 3.2, we conclude

\|V_{1}-(V_{1}+V_{2}Y)(I+Y^{*}Y)^{-1/2}\|_{2\to\infty}\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}.

∎

4.6 Part five: error induced by $\widehat{V}_{1}$

We deal with the error induced by operator $\widehat{V}_{1}$ not having orthonormal columns in $L^{2}(\mathcal{X},\mathbb{P})$ . Introduce the shorthand $W:=(V_{1}+V_{2}Y)(I+Y^{*}Y)^{-1/2}$ and we have

$\displaystyle\\|V_{1}-\widehat{V}_{1}Q\\|_{2\to\infty}$	$\displaystyle\leq\\|V_{1}-W\\|_{2\to\infty}+\\|W-\widehat{V}_{1}Q\\|_{2\to\infty}$
	$\displaystyle=\\|V_{1}-W\\|_{2\to\infty}+\\|W-WW^{*}\widehat{V}_{1}Q\\|_{2\to\infty}$
	$\displaystyle\leq\\|V_{1}-W\\|_{2\to\infty}+\\|W\\|_{2\to\infty}\\|Q^{T}-W^{*}\widehat{V}_{1}\\|_{2}.$	(4.12)

Here, the equality in the second step is true because $\widehat{V}_{1}$ and $W$ span the same leading eigenspace¹³¹³13Since $\widehat{V}_{1}$ is constructed from the eigenvectors of $L_{n}$ which are linearly independent, the columns cannot be linearly dependent functions in $\mathcal{H}^{s}\subset C_{b}(\mathcal{X})$ . and $W$ has orthonormal columns in $L^{2}(\mathcal{X},\mathbb{P})$ . Inspecting (4.12), we see $\|V_{1}-W\|_{2\to\infty}$ is bounded by Proposition 4.11, and $\|W\|_{2\to\infty}$ is roughly $\|V_{1}\|_{2\to\infty}$ thus bounded, so it all boils down to how “far” $W^{*}\widehat{V}_{1}$ is from an unitary matrix in $\mathbb{C}^{K\times K}$ . In fact, we have

Lemma 4.10.

Assume all the singular values of $W^{*}\widehat{V}_{1}$ are less than $2$ , then there exists unitary matrix $Q\in\mathbb{C}^{K\times K}$ such that

\|Q-W^{*}\widehat{V}_{1}\|_{2}\leq 2\|W\|_{2\to H^{s}}^{2}\sup_{{\|g\|_{\mathcal{H}^{s}}\leq 1}}\Big{|}P_{n}g^{2}-Pg^{2}\Big{|}.

Proof.

Suppose $W^{*}\widehat{V}_{1}$ admits singular value decomposition $W^{*}\widehat{V}_{1}=A\Sigma B^{*}$ , then $\Sigma=A^{*}W^{*}\widehat{V}_{1}B$ . Let $g_{i}:=WAe_{i}$ be the $i$ -th column in $WA$ , $\tilde{g}_{i}:=\widehat{V}_{1}Be_{i}$ be the $i$ -th column in $\widehat{V}_{1}B$ where $\{e_{i}\}_{i=1}^{K}$ is the standard basis in $\mathbb{R}^{K}$ . We know $\Sigma=(\Sigma_{ij})$ where $\Sigma_{ij}=\langle\tilde{g}_{i},g_{j}\rangle_{L^{2}(\mathcal{X},\mathbb{P})}$ .

Since $A,B$ are unitary matrices, $\{g_{i}\}_{i=1}^{K}$ are orthonormal in $L^{2}(\mathcal{X},\mathbb{P})$ and $\{\tilde{g}_{i}\}_{i=1}^{K}$ orthonormal in $L^{2}(\mathcal{X},\mathbb{P}_{n})$ . So $g_{1}$ is orthogonal to $g_{2},\ldots,g_{K}$ . At the same time, from the diagonal structure of $\Sigma$ , we know $\tilde{g}_{1}$ is orthogonal to $g_{2},\ldots,g_{K}$ as well. This suggests $g_{1}$ is collinear with $\tilde{g}_{1}$ . On top of that, since the diagonal entries of $\Sigma$ are real positive values, we know $\tilde{g}_{1}=g_{1}/\|g_{1}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}$ . This in fact holds for all $i\in[K]$ , i.e. $\tilde{g}_{i}=g_{i}/\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}$ . Taking $Q=AB^{*}$ , which is a unitary matrix, we have

\|Q-W^{*}\widehat{V}_{1}\|_{2}=\|\Sigma-I\|_{2}=\max_{i\in[K]}|1-\langle\tilde{g}_{i},g_{i}\rangle|=\max_{i\in[K]}|1-\frac{1}{\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}}|.

(4.13)

By our assumption on the singular values, we know for all $i\in[K]$ , $\frac{1}{\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}}=\langle\tilde{g}_{i},g_{i}\rangle\leq 2$ . Note that for $x\geq\frac{1}{2}$ , $|1-\frac{1}{x}|\leq 2|x-1|\leq 2|x^{2}-1|$ , we see

\max_{i\in[K]}|1-\frac{1}{\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}}|\leq 2\max_{i\in[K]}|1-\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}^{2}|=2\max_{i\in[K]}\big{|}\frac{1}{n}\sum_{j=1}^{n}|g_{i}(X_{j})|^{2}-\mathbb{E}|g_{i}(X)|^{2}\big{|}.

(4.14)

Since $g_{i}$ ’s rely on the samples $\{X_{i}\}_{i=1}^{n}$ , they are random. What they have in common is they have bounded $\mathcal{H}^{s}$ norm, which is because

\|g_{i}\|_{\mathcal{H}^{s}}=\|WAe_{i}\|_{\mathcal{H}}\leq\|W\|_{2\to\mathcal{H}^{s}}\|A\|_{2}\|e_{i}\|_{2}=\|W\|_{2\to\mathcal{H}^{s}}.

Therefore,

\max_{i\in[K]}\big{|}\frac{1}{n}\sum_{j=1}^{n}|g_{i}(X_{j})|^{2}-\mathbb{E}|g_{i}(X)|^{2}\big{|}\leq\sup_{\|g\|_{\mathcal{H}^{s}}\leq\|W\|_{2\to\mathcal{H}^{s}}}\Big{|}P_{n}|g|^{2}-P|g|^{2}\Big{|}=\|W\|_{2\to\mathcal{H}^{s}}^{2}\sup_{{\|g\|_{\mathcal{H}^{s}}\leq 1}}\Big{|}P_{n}|g|^{2}-P|g|^{2}\Big{|}.

(4.15)

Chaining (4.15) and (4.14) completes the proof. ∎

Using Dudley inequality and standard results on the covering number in Sobolev space, we can show (with proof in the appendix)

Lemma 4.11.

For our choice of $s$ , we have with probability $1-4\textrm{exp}(-\tau)$

\sup_{{\|g\|_{\mathcal{H}^{s}}\leq 1}}\Big{|}P_{n}|g|^{2}-P|g|^{2}\Big{|}\leq\frac{C_{19}+C_{20}\sqrt{\tau}}{\sqrt{n}}.

We are now ready to prove Theorem 4.2.

proof of theorem 4.2.

For fixed sample size $n$ , let $\mathcal{E}_{n,1}$ be the event when the concentration in Proposition 4.7 holds, and let $\mathcal{E}_{n,2}$ be the event when the concentration in Lemma 4.11 holds. From now on, we condition on the intersection $\mathcal{E}_{n,1}\cap\mathcal{E}_{n,2}$ , which happens with probability greater than of equal to $1-8e^{-\tau}$ .

First of all, on this event, we know Proposition 4.11 also holds. We thus have

\|V_{1}-W\|_{2\to\infty}\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}.

So $\|V_{1}\|_{2\to\infty}$ is close to $\|W\|_{2\to\infty}$ . Since in Theorem 4.2, $n/\tau\geq C_{4}$ and we have the freedom of choosing $C_{5}$ , we can set $C_{5}$ large enough such that $\|W\|_{2\to\infty}\leq 2\|V_{1}\|_{2\to\infty}$ . Imitating the proof of Proposition 4.11, we can similarly show $\|V_{1}-W\|_{2\to\mathcal{H}^{s}}$ is on the order of $\sqrt{\tau}/\sqrt{n}$ . We can thus assume $C_{4}$ is also large enough to ensure $\|W\|_{2\to\mathcal{H}^{s}}\leq 2\|V_{1}\|_{2\to\mathcal{H}^{s}}$ .

Meanwhile, due to the uniform law of large numbers in Lemma 4.11, we can always let $\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}$ be greater than $1/2$ by setting $C_{4}$ large enough. The condition on singular values in Lemma 4.10 is thus satisfied and from it we see there exists unitary matrix $Q$ such that

\|Q-W^{*}\widehat{V}_{1}\|_{2}\leq 2\|W\|_{2\to H^{s}}^{2}\frac{C_{19}+C_{20}\sqrt{\tau}}{\sqrt{n}}\leq 2\|W\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}},

where we assumed $\tau\geq 1$ . Since for concentration results like Theorem 4.2 to be meaningful, $\tau$ is large anyway, this assumption is harmless.

Going back to (4.12), we have

$\displaystyle\\|V_{1}-\widehat{V}_{1}Q\\|_{2\to\infty}$	$\displaystyle\leq\\|V_{1}-W\\|_{2\to\infty}+\\|W\\|_{2\to\infty}\\|Q^{T}-W^{*}\widehat{V}_{1}\\|_{2}$	(4.16)
	$\displaystyle\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}+2\\|W\\|_{2\to\infty}\\|W\\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}}$	(4.17)
	$\displaystyle\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}+16\\|V_{1}\\|_{2\to\infty}\\|V_{1}\\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}}.$	(4.18)

Setting $C_{3}=C_{18}+16\|V_{1}\|_{2\to\infty}\|V_{1}\|_{2\to H^{s}}^{2}(C_{19}+C_{20})$ thus completes the proof . ∎

5 Discussion

We would like to first comment on the relationship between Theorem 3.2 and the concentration of spectral projection (see Proposition 22 in Rosasco, Belkin and Vito [29]). Our result in fact easily implies the concentration of spectral projections. To see this, simply note the difference in projection can be written as $V_{1}V_{1}^{*}-\widetilde{V}_{1}\widetilde{V}_{1}^{*}$ and apply triangular inequality. We believe it is possible to go from the concentration of spectral projection in Hilbert space $\mathcal{H}$ to Theorem 3.2, but the road is treacherous. On a high level, we need to project an orthonormal bases of the leading invariant space of the perturbed operator to that of the unperturbed operator, and then performing Gram-Schmidt on the projections. During this process, we need to convert back and forth from $\mathcal{H}$ to $L^{2}(\mathcal{X},\mathbb{P})$ and we foresee countless petty and pesky technical details. But it is our belief that the concentration of spectral projections in $\mathcal{H}$ induced operator norm is equivalent to Theorem 3.2.

We would also like to comment on the generality of the Newton-Kantorovich Theorem. By that, we mean the operator equation (3.5) need not be restricted to the space of $\mathcal{L}(\mathbb{C}^{K},\mathcal{H})$ . We can have slightly altered versions of (3.5) involving $L^{2}(\mathcal{X},\mathbb{P})$ , $C_{b}(\mathcal{X})$ , or $C_{b}^{1}(\mathcal{X})$ that induce an invariant subspace and still apply the Newton-Kantorovich Theorem to solve them. For example, we should be able to replace every $\mathcal{H}$ in this paper with $C_{b}^{1}(\mathcal{X})$ and remake the proof to make everything go through. A word of caution is that to obtain operator norm convergence from the sample level operator to the population operator, the function space one works with has to have some kind of “smoothness”. Either the kind of smoothness from an RKHS or the kind from $C_{b}^{1}(\mathcal{X})$ is fine, but spaces like $C_{b}(\mathcal{X})$ or $L^{2}(\mathcal{X},\mathbb{P})$ where functions may oscillate wildly while still having a small norm are not okay, because adversarial functions can be chosen to ruin operator norm convergence. This point was also mentioned in von Luxburg, Belkin and Bousquet [41].

Finally, we would like to comment on our complex-valued functions assumption and the fact that Theorem 3.2 needs an unitary matrix $Q$ . We feel like since everything is real, the unitary matrix is an artifact rather than a necessity and our proof could be altered so that only an orthonormal matrix is needed (although we don’t know how at the moment). We have also checked that we can get around with real Hilbert or Banach spaces and real-valued functions for almost all lemmas and theorems except for Theorem 3.8. But on the brighter side, working with complex numbers makes our result more general and can give us the freedom of using a complex-valued kernel function, although such freedom is rarely taken advantage of in statistics or machine learning. Last but not least, we wish to point out that due to length constraints, we only did one application which is normalized spectral clustering, but other applications of our general theory are possible. For example, uniform consistency results can be obtained for kernel PCA and the proof of that is much simpler than the proof of normalized spectral clustering. We include such results in the appendix.

References

Abbe et al. [2020] {barticle}[author] \bauthor\bsnmAbbe, \bfnmEmmanuel\binitsE., \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmKaizheng\binitsK. and \bauthor\bsnmZhong, \bfnmYiqiao\binitsY. (\byear2020). \btitleEntrywise eigenvector analysis of random matrices with low expected rank. \bjournalAnnals of statistics \bvolume48 \bpages1452. \endbibitem
Anselone [1971] {bbook}[author] \bauthor\bsnmAnselone, \bfnmP. M.\binitsP. M. (\byear1971). \btitleCollectively Compact Operator Approximation Theory and Applications to Integral Equations. \bseriesAutomatic Computation. \bpublisherPrentice Hall. \endbibitem
Belkin and Niyogi [2003] {barticle}[author] \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmNiyogi, \bfnmPartha\binitsP. (\byear2003). \btitleLaplacian Eigenmaps for Dimensionality Reduction and Data Representation. \bjournalNeural Computation \bvolume15 \bpages1373-1396. \bdoi10.1162/089976603321780317 \endbibitem
Bengio et al. [2003] {binproceedings}[author] \bauthor\bsnmBengio, \bfnmYoshua\binitsY., \bauthor\bsnmPaiement, \bfnmJean-François\binitsJ.-F., \bauthor\bsnmVincent, \bfnmPascal\binitsP., \bauthor\bsnmDelalleau, \bfnmOlivier\binitsO., \bauthor\bsnmRoux, \bfnmNicolas Le\binitsN. L. and \bauthor\bsnmOuimet, \bfnmMarie\binitsM. (\byear2003). \btitleOut-of-sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. In \bbooktitleProceedings of the 16th International Conference on Neural Information Processing Systems. \bseriesNIPS’03 \bpages177–184. \bpublisherMIT Press, \baddressCambridge, MA, USA. \endbibitem
Blanchard, Bousquet and Zwald [2007] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG., \bauthor\bsnmBousquet, \bfnmOlivier\binitsO. and \bauthor\bsnmZwald, \bfnmLaurent\binitsL. (\byear2007). \btitleStatistical properties of kernel principal component analysis. \bjournalMachine Learning \bvolume66 \bpages259–294. \bdoi10.1007/s10994-006-6895-9 \endbibitem
Burenkov [1998] {bbook}[author] \bauthor\bsnmBurenkov, \bfnmV. I.\binitsV. I. (\byear1998). \btitleSobolev Spaces on Domains. \bseriesTeubner-Texte zur Mathematik. \bpublisherB. G. Teubner Verlagsgesellschaft Leipzig. \endbibitem
Cape, Tang and Priebe [2019a] {barticle}[author] \bauthor\bsnmCape, \bfnmJoshua\binitsJ., \bauthor\bsnmTang, \bfnmMinh\binitsM. and \bauthor\bsnmPriebe, \bfnmCarey E\binitsC. E. (\byear2019a). \btitleThe two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. \bjournalThe Annals of Statistics \bvolume47 \bpages2405–2439. \endbibitem
Cape, Tang and Priebe [2019b] {barticle}[author] \bauthor\bsnmCape, \bfnmJoshua\binitsJ., \bauthor\bsnmTang, \bfnmMinh\binitsM. and \bauthor\bsnmPriebe, \bfnmCarey E\binitsC. E. (\byear2019b). \btitleSignal-plus-noise matrix models: eigenvector deviations and fluctuations. \bjournalBiometrika \bvolume106 \bpages243–250. \endbibitem
Ciarlet [2013] {bbook}[author] \bauthor\bsnmCiarlet, \bfnmPhilippe G.\binitsP. G. (\byear2013). \btitleLinear and Nonlinear Functional Analysis with Applications. \bpublisherSociety for Industrial and Applied Mathematics, \baddressPhiladelphia, PA, USA. \endbibitem
Coifman et al. [2005] {barticle}[author] \bauthor\bsnmCoifman, \bfnmR. R.\binitsR. R., \bauthor\bsnmLafon, \bfnmS.\binitsS., \bauthor\bsnmLee, \bfnmA. B.\binitsA. B., \bauthor\bsnmMaggioni, \bfnmM.\binitsM., \bauthor\bsnmNadler, \bfnmB.\binitsB., \bauthor\bsnmWarner, \bfnmF.\binitsF. and \bauthor\bsnmZucker, \bfnmS. W.\binitsS. W. (\byear2005). \btitleGeometric Diffusions as a Tool for Harmonic Analysis and Structure Definition of Data: Diffusion Maps. \bjournalProceedings of the National Academy of Sciences \bvolume102 \bpages7426-7431. \bdoi10.1073/pnas.0500334102 \endbibitem
Cucker and Smale [2002] {barticle}[author] \bauthor\bsnmCucker, \bfnmFelipe\binitsF. and \bauthor\bsnmSmale, \bfnmSteve\binitsS. (\byear2002). \btitleOn the mathematical foundations of learning. \bjournalBulletin of the American Mathematical Society \bvolume39 \bpages1–49. \endbibitem
Damle and Sun [2020] {barticle}[author] \bauthor\bsnmDamle, \bfnmAnil\binitsA. and \bauthor\bsnmSun, \bfnmYuekai\binitsY. (\byear2020). \btitleUniform bounds for invariant subspace perturbations. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume41 \bpages1208–1236. \endbibitem
Donetti and Muñoz [2004] {barticle}[author] \bauthor\bsnmDonetti, \bfnmLuca\binitsL. and \bauthor\bsnmMuñoz, \bfnmMiguel A\binitsM. A. (\byear2004). \btitleDetecting network communities: a new systematic and efficient algorithm. \bjournalJournal of Statistical Mechanics: Theory and Experiment \bvolume2004 \bpagesP10012. \bdoi10.1088/1742-5468/2004/10/p10012 \endbibitem
Donoho and Grimes [2003] {barticle}[author] \bauthor\bsnmDonoho, \bfnmDavid L.\binitsD. L. and \bauthor\bsnmGrimes, \bfnmCarrie\binitsC. (\byear2003). \btitleHessian Eigenmaps: Locally Linear Embedding Techniques for High-Dimensional Data. \bjournalProceedings of the National Academy of Sciences \bvolume100 \bpages5591-5596. \bdoi10.1073/pnas.1031596100 \endbibitem
Edmunds and Triebel [1996] {bbook}[author] \bauthor\bsnmEdmunds, \bfnmD. E.\binitsD. E. and \bauthor\bsnmTriebel, \bfnmH.\binitsH. (\byear1996). \btitleFunction Spaces, Entropy Numbers, Differential Operators. \bseriesCambridge Tracts in Mathematics. \bpublisherCambridge University Press. \bdoi10.1017/CBO9780511662201 \endbibitem
Fan, Wang and Zhong [2018] {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmWeichen\binitsW. and \bauthor\bsnmZhong, \bfnmYiqiao\binitsY. (\byear2018). \btitleAn $l_{\infty}$ eigenvector perturbation bound and its application to robust covariance estimation. \bjournalJournal of Machine Learning Research \bvolume18 \bpages1–42. \endbibitem
Higgs, Weller and Solka [2006] {barticle}[author] \bauthor\bsnmHiggs, \bfnmBrandon W.\binitsB. W., \bauthor\bsnmWeller, \bfnmJennifer\binitsJ. and \bauthor\bsnmSolka, \bfnmJeffrey L.\binitsJ. L. (\byear2006). \btitleSpectral embedding finds meaningful (relevant) structure in image and microarray data. \bjournalBMC Bioinformatics \bvolume7 \bpages74. \bdoi10.1186/1471-2105-7-74 \endbibitem
Hoffmann [2007] {barticle}[author] \bauthor\bsnmHoffmann, \bfnmHeiko\binitsH. (\byear2007). \btitleKernel PCA for novelty detection. \bjournalPattern Recognition \bvolume40 \bpages863 - 874. \bdoihttps://doi.org/10.1016/j.patcog.2006.07.009 \endbibitem
Karow and Kressner [2014] {barticle}[author] \bauthor\bsnmKarow, \bfnmM.\binitsM. and \bauthor\bsnmKressner, \bfnmD.\binitsD. (\byear2014). \btitleOn a Perturbation Bound for Invariant Subspaces of Matrices. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume35 \bpages599-618. \bdoi10.1137/130912372 \endbibitem
Kato [1995] {bbook}[author] \bauthor\bsnmKato, \bfnmT\binitsT. (\byear1995). \btitlePerturbation Theory for Linear Operators. \bseriesClassics in Mathematics. \bpublisherSpringer-Verlag Berlin Heidelberg. \endbibitem
Koltchinskii [1998] {binproceedings}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir I.\binitsV. I. (\byear1998). \btitleAsymptotics of Spectral Projections of Some Random Matrices Approximating Integral Operators. In \bbooktitleHigh Dimensional Probability (\beditor\bfnmErnst\binitsE. \bsnmEberlein, \beditor\bfnmMarjorie\binitsM. \bsnmHahn and \beditor\bfnmMichel\binitsM. \bsnmTalagrand, eds.) \bpages191–227. \bpublisherBirkhäuser Basel, \baddressBasel. \endbibitem
Koltchinskii and Giné [2000] {barticle}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir\binitsV. and \bauthor\bsnmGiné, \bfnmEvarist\binitsE. (\byear2000). \btitleRandom matrix approximation of spectra of integral operators. \bjournalBernoulli \bvolume6 \bpages113–167. \endbibitem
Lei and Rinaldo [2015] {barticle}[author] \bauthor\bsnmLei, \bfnmJing\binitsJ. and \bauthor\bsnmRinaldo, \bfnmAlessandro\binitsA. (\byear2015). \btitleConsistency of spectral clustering in stochastic block models. \bjournalAnn. Statist. \bvolume43 \bpages215–237. \bdoi10.1214/14-AOS1274 \endbibitem
Mao, Sarkar and Chakrabarti [2021] {barticle}[author] \bauthor\bsnmMao, \bfnmXueyu\binitsX., \bauthor\bsnmSarkar, \bfnmPurnamrita\binitsP. and \bauthor\bsnmChakrabarti, \bfnmDeepayan\binitsD. (\byear2021). \btitleEstimating mixed memberships with sharp eigenvector deviations. \bjournalJournal of the American Statistical Association \bvolume116 \bpages1928–1940. \endbibitem
Mendelson and Pajor [2005] {binproceedings}[author] \bauthor\bsnmMendelson, \bfnmS.\binitsS. and \bauthor\bsnmPajor, \bfnmA.\binitsA. (\byear2005). \btitleEllipsoid Approximation Using Random Vectors. In \bbooktitleLearning Theory (\beditor\bfnmPeter\binitsP. \bsnmAuer and \beditor\bfnmRon\binitsR. \bsnmMeir, eds.) \bpages429–443. \bpublisherSpringer Berlin Heidelberg, \baddressBerlin, Heidelberg. \endbibitem
Mendelson and Pajor [2006] {barticle}[author] \bauthor\bsnmMendelson, \bfnmShahar\binitsS. and \bauthor\bsnmPajor, \bfnmAlain\binitsA. (\byear2006). \btitleOn singular values of matrices with independent rows. \bjournalBernoulli \bvolume12 \bpages761–773. \bdoi10.3150/bj/1161614945 \endbibitem
Ng, Jordan and Weiss [2001] {binproceedings}[author] \bauthor\bsnmNg, \bfnmAndrew Y.\binitsA. Y., \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I. and \bauthor\bsnmWeiss, \bfnmYair\binitsY. (\byear2001). \btitleOn Spectral Clustering: Analysis and an Algorithm. In \bbooktitleAdvances in Neural Information Processing Systems \bpages849–856. \bpublisherMIT Press. \endbibitem
Rohe, Chatterjee and Yu [2011] {barticle}[author] \bauthor\bsnmRohe, \bfnmKarl\binitsK., \bauthor\bsnmChatterjee, \bfnmSourav\binitsS. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2011). \btitleSpectral clustering and the high-dimensional stochastic blockmodel. \bjournalAnn. Statist. \bvolume39 \bpages1878–1915. \bdoi10.1214/11-AOS887 \endbibitem
Rosasco, Belkin and Vito [2010] {barticle}[author] \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmVito, \bfnmErnesto De\binitsE. D. (\byear2010). \btitleOn Learning with Integral Operators. \bjournalJournal of Machine Learning Research \bvolume11 \bpages905-934. \endbibitem
Schiebinger, Wainwright and Yu [2015] {barticle}[author] \bauthor\bsnmSchiebinger, \bfnmGeoffrey\binitsG., \bauthor\bsnmWainwright, \bfnmMartin J.\binitsM. J. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2015). \btitleThe Geometry of Kernelized Spectral Clustering. \bjournalThe Annals of Statistics \bvolume43 \bpages819-846. \bdoi10.1214/14-AOS1283 \endbibitem
Shawe-Taylor et al. [2005] {barticle}[author] \bauthor\bsnmShawe-Taylor, \bfnmJ.\binitsJ., \bauthor\bsnmWilliams, \bfnmC. K. I.\binitsC. K. I., \bauthor\bsnmCristianini, \bfnmN.\binitsN. and \bauthor\bsnmKandola, \bfnmJ.\binitsJ. (\byear2005). \btitleOn the eigenspectrum of the gram matrix and the generalization error of kernel-PCA. \bjournalIEEE Transactions on Information Theory \bvolume51 \bpages2510-2522. \bdoi10.1109/TIT.2005.850052 \endbibitem
Shi, Belkin and Yu [2009] {barticle}[author] \bauthor\bsnmShi, \bfnmTao\binitsT., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009). \btitleData spectroscopy: Eigenspaces of convolution operators and clustering. \bjournalAnn. Statist. \bvolume37 \bpages3960–3984. \bdoi10.1214/09-AOS700 \endbibitem
Shi and Malik [2000] {barticle}[author] \bauthor\bsnmShi, \bfnmJ\binitsJ. and \bauthor\bsnmMalik, \bfnmJ\binitsJ. (\byear2000). \btitleNormalized cuts and image segmentation. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence \bvolume22 \bpages888-905. \bdoi10.1109/34.868688 \endbibitem
Steinwart and Christmann [2008] {bbook}[author] \bauthor\bsnmSteinwart, \bfnmIngo\binitsI. and \bauthor\bsnmChristmann, \bfnmAndreas\binitsA. (\byear2008). \btitleSupport vector machines. \bpublisherSpringer Science & Business Media. \endbibitem
Stewart [1971] {barticle}[author] \bauthor\bsnmStewart, \bfnmG.\binitsG. (\byear1971). \btitleError Bounds for Approximate Invariant Subspaces of Closed Linear Operators. \bjournalSIAM Journal on Numerical Analysis \bvolume8 \bpages796-808. \bdoi10.1137/0708073 \endbibitem
Tenenbaum [2000] {barticle}[author] \bauthor\bsnmTenenbaum, \bfnmJ. B.\binitsJ. B. (\byear2000). \btitleA Global Geometric Framework for Nonlinear Dimensionality Reduction. \bjournalScience \bvolume290 \bpages2319-2323. \bdoi10.1126/science.290.5500.2319 \endbibitem
Trillos, Hoffmann and Hosseini [2019] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolas Garcia\binitsN. G., \bauthor\bsnmHoffmann, \bfnmFranca\binitsF. and \bauthor\bsnmHosseini, \bfnmBamdad\binitsB. (\byear2019). \btitleGeometric structure of graph Laplacian embeddings. \bpagesarXiv:1901.10651. \endbibitem
Trillos and Slepčev [2015] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolás García\binitsN. G. and \bauthor\bsnmSlepčev, \bfnmDejan\binitsD. (\byear2015). \btitleA variational approach to the consistency of spectral clustering. \bpagesarXiv:1508.01928. \endbibitem
Trillos et al. [2018] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolas Garcia\binitsN. G., \bauthor\bsnmGerlach, \bfnmMoritz\binitsM., \bauthor\bsnmHein, \bfnmMatthias\binitsM. and \bauthor\bsnmSlepcev, \bfnmDejan\binitsD. (\byear2018). \btitleError estimates for spectral convergence of the graph Laplacian on random geometric graphs towards the Laplace–Beltrami operator. \bpagesarXiv:1801.10108. \endbibitem
Vershynin [2018] {bbook}[author] \bauthor\bsnmVershynin, \bfnmRoman\binitsR. (\byear2018). \btitleHigh-Dimensional Probability: An Introduction with Applications in Data Science. \bseriesCambridge Series in Statistical and Probabilistic Mathematics. \bpublisherCambridge University Press. \bdoi10.1017/9781108231596 \endbibitem
von Luxburg, Belkin and Bousquet [2008] {barticle}[author] \bauthor\bsnmvon Luxburg, \bfnmUlrike\binitsU., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmBousquet, \bfnmOlivier\binitsO. (\byear2008). \btitleConsistency of Spectral Clustering. \bjournalThe Annals of Statistics \bvolume36 \bpages555-586. \bdoi10.1214/009053607000000640 \endbibitem
Zwald and Blanchard [2006] {bincollection}[author] \bauthor\bsnmZwald, \bfnmLaurent\binitsL. and \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. (\byear2006). \btitleOn the Convergence of Eigenspaces in Kernel Principal Component Analysis. In \bbooktitleAdvances in Neural Information Processing Systems 18 (\beditor\bfnmY.\binitsY. \bsnmWeiss, \beditor\bfnmB.\binitsB. \bsnmSchölkopf and \beditor\bfnmJ. C.\binitsJ. C. \bsnmPlatt, eds.) \bpages1649–1656. \bpublisherMIT Press. \endbibitem

Appendix A Proofs

A.1 Proof of Lemma 3.1

Proof of Lemma 3.1.

For item 1, we show double inclusions: $V_{2}^{-1}\mathcal{H}\subset V_{2}^{*}\mathcal{H}$ and $V_{2}^{*}\mathcal{H}\subset V_{2}^{-1}\mathcal{H}$ . First, for any $l\in V_{2}^{-1}\mathcal{H}$ , we know $V_{2}l\in\mathcal{H}$ , so $V_{2}^{*}(V_{2}l)\in V_{2}^{*}\mathcal{H}$ . Since $l=V_{2}^{*}V_{2}l$ , we see $V_{2}^{-1}\mathcal{H}\subset V_{2}^{*}\mathcal{H}$ . Second, for any $l\in V_{2}^{*}\mathcal{H}$ , suppose without loss of generality that $l=V_{2}^{*}h$ for some $h\in\mathcal{H}$ . Note $V_{2}V_{2}^{*}h=h-V_{1}V_{1}^{*}h$ . Since $f_{1},\ldots,f_{K}\in\mathcal{H}$ by assumption, we know $V_{1}V_{1}^{*}h\in\mathcal{H}$ , so $V_{2}l=h-V_{1}V_{1}^{*}h\in\mathcal{H}$ . This shows the other inclusion.

For item 2, since $\mathcal{H}$ is a subspace of $L^{2}(\mathcal{X},\mathbb{P})$ , $V_{2}^{*}\mathcal{H}$ is a subspace of $V_{2}^{*}L^{2}(\mathcal{X},\mathbb{P})$ which is equal to $l^{2}$ . Checking $(\cdot,\cdot)$ is an inner product is routine. We next show the completeness of $V_{2}^{-1}\mathcal{H}$ . Let $\{b_{i}\}_{i=1}^{\infty}$ be a Cauchy sequence in $V_{2}^{-1}\mathcal{H}$ . Since by definition $\|b\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}b\|_{\mathcal{H}}$ , the sequence $\{V_{2}b_{i}\}_{i=1}^{\infty}$ is Cauchy in $\mathcal{H}$ . Suppose $V_{2}b_{i}\xrightarrow{\mathcal{H}}y$ for some $y\in\mathcal{H}$ . Since by assumption $\|\cdot\|_{\mathcal{H}}$ norm is stronger than $\|\cdot\|_{\infty}$ which is in turn stronger than $\|\cdot\|_{L^{2}}$ , we know $V_{2}b_{i}\xrightarrow{L^{2}}y$ . Since the range of $V_{2}$ is closed in $L^{2}(\mathcal{X},\mathbb{P})$ , we know $y$ is also in its range and $y=V_{2}V_{2}^{*}y$ . Note $\|b_{i}-V_{2}^{*}y\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}b_{i}-V_{2}V_{2}^{*}y\|_{\mathcal{H}}$ and the right hand side converges to zero because $V_{2}b_{i}\xrightarrow{\mathcal{H}}y$ , we see the space of $V_{2}^{-1}\mathcal{H}$ is indeed complete.

For item 3, for any $\alpha\in\mathbb{C}^{K}$ with $\|\alpha\|_{2}=1$ , we have

\|V_{1}\alpha\|_{\mathcal{H}}=\|\sum_{i=1}^{K}\alpha_{i}f_{i}\|_{\mathcal{H}}\leq\sum_{i=1}^{K}|\alpha_{i}|\|f_{i}\|_{\mathcal{H}}\leq\sqrt{K}\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}}.

This shows $\|V_{1}\|_{2\to\mathcal{H}}\leq\sqrt{K}\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}}$ . As is noted in the proof for item 2, $\|\cdot\|_{\mathcal{H}}$ norm is stronger than the $\|\cdot\|_{L^{2}}$ norm, i.e. $\|h\|_{L^{2}}\leq C_{\mathcal{H}}\|h\|_{\mathcal{H}}$ for some constant $C_{\mathcal{H}}$ and $\forall h\in\mathcal{H}$ . Therefore

\|V_{1}^{*}h\|_{2}^{2}=\sum_{i=1}^{K}\langle f_{i},h\rangle_{L^{2}}^{2}\leq\sum_{i=1}^{K}\|f_{i}\|_{L^{2}}^{2}\|h\|_{L^{2}}^{2}\leq C_{\mathcal{H}}^{2}K\|h\|_{\mathcal{H}}^{2}.

We thus see $\|V_{1}^{*}\|_{\mathcal{H}\to 2}\leq C_{\mathcal{H}}\sqrt{K}$ . The fact that $\|V_{2}\|_{V_{2}^{-1}\mathcal{H}\to\mathcal{H}}=1$ is a simple consequence of $\|b\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}b\|_{\mathcal{H}}$ for $\forall b\in V_{2}^{-1}\mathcal{H}$ . Finally, we have for $\forall h\in\mathcal{H}$

\|V_{2}^{*}h\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}V_{2}^{*}h\|_{\mathcal{H}}=\|(I-V_{1}V_{1}^{*})h\|_{\mathcal{H}}\leq(1+\|V_{1}\|_{2\to\mathcal{H}}\|V_{1}^{*}\|_{\mathcal{H}\to 2})\|h\|_{\mathcal{H}}.

It thus follows $\|V_{2}^{*}\|_{\mathcal{H}\to V_{2}^{-1}\mathcal{H}}\leq 1+C_{\mathcal{H}}K\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}}$ . ∎

A.2 Proof of a lemma used in proving Theorem 3.5

In the proof of Theorem 3.5, we used to following lemma. We now state and prove it.

Lemma A.1.

The operator $\mathcal{T}$ defined as $\mathcal{T}:\mathcal{E}\mapsto\mathcal{E}$ as $\mathcal{T}(Y):=T_{22}Y-YT_{11}$ is one-to-one and onto. Moreover, $\inf_{\|Y\|_{HS}=1}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}>0$ .

Proof.

First, we note

	$\displaystyle T_{11}\colon\mathbb{R}^{K}$	$\displaystyle\longrightarrow\mathbb{R}^{K}$
	$\displaystyle(a_{1},a_{2},\ldots,a_{K})$	$\displaystyle\longmapsto(\lambda_{1}a_{1},\lambda_{2}a_{2},\ldots,\lambda_{K}a_{K})$

To show $\mathcal{T}$ is one-to-one and onto, it suffices to show for any $g\in\mathcal{E}$ , there exists a unique $y\in\mathcal{E}$ such that $g=\mathcal{T}(y)$ . Denote the standard orthonormal basis in $\mathbb{R}^{K}$ by $\{e_{i}\}_{i=1}^{K}$ . Due to the diagonal structure of $T_{11}$ , we see $(\mathcal{T}y)(e_{i})=(T_{22}-\lambda_{i}I)ye_{i}$ . So to show the existence and uniqueness of $y\in\mathcal{E}$ such that $g=\mathcal{T}(y)$ , it suffices to show for $\forall i\in[K]$ , there exists a unique $ye_{i}$ such that $(T_{22}-\lambda_{i}I)ye_{i}=ge_{i}$ . To this end, it suffices to show $\lambda_{i}$ is in the resolvent of $T_{22}$ . This is indeed true because 1) $T_{22}$ is a compact operator from $\tilde{l}^{2}$ to $\tilde{l}^{2}$ ; 2) $\sigma(T_{22})\subset\{\lambda_{K+1},\lambda_{K+2},\ldots\}\cup\{0\}$ . The second point is obvious and the first point follows from

\|T_{22}\|_{HS}=\|F_{\perp}^{*}TF_{\perp}\|_{HS}\leq\|F_{\perp}^{*}\|_{op}\|T\|_{HS}\|F_{\perp}\|_{op}\leq\|T\|_{HS}.

Next, we show $\mathcal{T}$ is a bounded operator. Once $\mathcal{T}$ is bounded, since $\mathcal{T}$ is one to one and onto and $\mathcal{E}$ is a Banach space, we know from bounded inverse theorem that $\mathcal{T}^{-1}\in\mathcal{L}(\mathcal{E})$ , which is equivalent to $\inf_{\|Y\|_{HS}=1}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}>0$ .

The operator $\mathcal{T}$ is indeed bounded because

\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\leq\|T_{22}\|_{HS}\|Y\|_{HS}+\|T_{11}\|_{HS}\|Y\|_{HS}\leq 2\|T\|_{HS}\|Y\|_{HS}.

∎

We remark that when $T_{22}$ is self-adjoint, the proof of this lemma will be greatly simplified. In fact, we have

\displaystyle\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\geq\big{\|}YT_{11}\big{\|}_{HS}-\big{\|}T_{22}Y\big{\|}_{HS}\geq\lambda_{K}\big{\|}Y\big{\|}_{HS}-\big{\|}T_{22}\big{\|}_{op}\big{\|}Y\big{\|}_{HS}

Since $T_{22}$ is self-adjoint, its operator norm is its largest eigenvalue, which is $\lambda_{K+1}$ . We see immediately in this case that $\inf_{\|Y\|_{HS}=1}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\geq\lambda_{K}-\lambda_{K+1}$ , so the eigengap is recovered. In unnormalized spectral clustering, where $\mathcal{H}$ is set to be the RKHS associated with the kernel function, we claim $T_{22}$ is self-adjoint.

A.3 Proof of Lemma 4.11

We see at the core of Lemma 4.11 is some uniform law of large number over the unit ball in $\mathcal{H}^{s}$ . We need the following two lemmas in the proof, the first is from Cucker and Smale [11] (Proposition 6), and the second is from Vershynin [40] (Theorem 8.1.6).

Lemma A.2.

Denote $\mathcal{G}=\{g\,|\,g\in\mathcal{H}^{s},\|g\|_{\mathcal{H}^{s}}\leq 1\}$ . When $s>p/2$ , for all $\epsilon>0$ ,

\log\mathcal{N}(\mathcal{G},\|\cdot\|_{\infty},\epsilon)\leq\Big{(}\frac{C}{\epsilon}\Big{)}^{p/s}+1

for some constant $C$ .

Lemma A.3.

Let $(X_{t})_{t\in T}$ be a random process on a metric space $(T,d)$ with sub-gaussian increments, i.e.

\|X_{t}-X_{s}\|_{\Psi_{2}}\leq Kd(t,s)\text{ for all }t,s\in T.

Then, for every $u\geq 0$ , the event

\sup_{t\in T}X_{t}\leq CK\Big{[}\int_{0}^{\infty}\sqrt{\log\mathcal{N}(T,d,\epsilon)}d\epsilon+u\cdot\text{diam}(T)\Big{]}

holds with probability at least $1-2exp(-u^{2})$ .

We now prove Lemma 4.11. We refer readers who are unfamiliar with the arguments below to the proof of Theorem 8.2.3 in Vershynin [40].

Proof of Lemma 4.11.

We first show on $\mathcal{G}=\{g\,|\,g\in\mathcal{H}^{s},\|g\|_{\mathcal{H}^{s}}\leq 1\}$ , the random process $P_{n}|g|^{2}-P|g|^{2}$ has sub-gaussian increments. For fixed $f,g\in\mathcal{G}$ , we have

\|P_{n}f\bar{f}-Pf\bar{f}-P_{n}g\bar{g}+Pg\bar{g}\|_{\Psi_{2}}=\frac{1}{n}\|\sum_{i=1}^{n}Z_{i}\|_{\Psi_{2}}\text{ where }Z_{i}=(f\bar{f}-g\bar{g})(X_{i})-\mathbb{E}(f\bar{f}-g\bar{g})(X).

So $Z_{i}$ ’s are independent and mean zero. It thus follows

\|P_{n}f\bar{f}-Pf\bar{f}-P_{n}g\bar{g}+Pg\bar{g}\|_{\Psi_{2}}\leq\frac{C_{21}}{n}\big{(}\sum_{i=1}^{n}\|Z_{i}\|_{\Psi_{2}}^{2}\big{)}^{1/2}.

By the centering lemma, we know

\|Z_{i}\|_{\Psi_{2}}\leq C_{22}\|f(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}-g(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(X_{i})}}\|_{\Psi_{2}}.

Note that because of the embedding, we have

\|f\bar{f}-g\bar{g}\|_{\infty}\leq\|f(\bar{f}-\bar{g})\|_{\infty}+\|(f-g)\bar{g}\|_{\infty}\leq\|f\|_{\infty}\|f-g\|_{\infty}+\|g\|_{\infty}\|f-g\|_{\infty}\leq 2C_{6}\|f-g\|_{\infty}.

The random variable $f(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}-g(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(X_{i})}}$ is thus bounded. Since bounded random variables have bounded $\Psi_{2}$ norm, we see

\|f(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}-g(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(X_{i})}}\|_{\Psi_{2}}\leq 2C_{6}C_{23}\|f-g\|_{\infty}.

Putting pieces together, we have

\|P_{n}f^{2}-Pf^{2}-P_{n}g^{2}+Pg^{2}\|_{\Psi_{2}}\leq\frac{C_{24}}{\sqrt{n}}\|f-g\|_{\infty}.

Next, it is easy to examine that $\text{diam}(\mathcal{G})\leq 2C_{6}$ and $\int_{0}^{2C_{6}}\sqrt{\log\mathcal{N}(T,d,\epsilon)}d\epsilon<\infty$ (because our choice of $s=\lfloor p/2\rfloor+1$ and Lemma A.2). It thus follows that the event

\sup_{g\in\mathcal{G}}P_{n}g^{2}-Pg^{2}\leq\frac{C_{19}+C_{20}u}{\sqrt{n}}

holds with probability $1-2exp(-u^{2})$ .

By the exactly same argument, we can also show the event

\sup_{g\in\mathcal{G}}Pg^{2}-P_{n}g^{2}\leq\frac{C_{19}+C_{20}u}{\sqrt{n}}

holds with probability $1-2exp(-u^{2})$ . Taking union bound, we see

\sup_{g\in\mathcal{G}}\big{|}Pg^{2}-P_{n}g^{2}\big{|}\leq\frac{C_{19}+C_{20}u}{\sqrt{n}}

holds with probability $1-4exp(-u^{2})$ . Rewrite $u^{2}$ as $\tau$ and the proof is complete. ∎

Appendix B Application to kernel PCA

To further demonstrate the usage of the general theory, we apply it to kernel principal component analysis (kernel PCA) in this section. In kernel PCA, we start from a metric space $\mathcal{X}$ , a probability measure $\mathbb{P}$ on $\mathcal{X}$ , and a continuous positive definite kernel function $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ . After observing samples $X_{1},\dots,X_{n}\overset{\textrm{iid}}{\sim}\mathbb{P}$ , we are interested in matrix $K_{n}\in\mathbb{R}^{n\times n}$ of their pairwise similarities: $K_{n}=\begin{bmatrix}\frac{1}{n}k(X_{i},X_{j})\end{bmatrix}_{i,j=1}^{n}$ , assuming our data mapped into the feature space is centered. Since $K_{n}$ is symmetric and positive semi-definite, it has an eigenvalue decomposition. We denote the eigenpairs by $(\widehat{\lambda}_{k},v_{k})$ and sort the eigenvalues in descending order:

\widehat{\lambda}_{1}\geq\dots\geq\widehat{\lambda}_{n}\geq 0.

The eigenvectors $v_{k}$ are normalized to have $\|v_{k}\|_{2}=\sqrt{n}$ . Then the matrix $V=[v_{1},\cdots,v_{K}]\in\mathbb{R}^{n\times K}$ consists of the leading $K$ principal components.

Let $\mathcal{H}$ be the RKHS associated with kernel $k(\cdot,\cdot)$ . Recall that the tensor product of $a,b\in\mathcal{H}$ ,

	$\displaystyle a\otimes b:\mathcal{H}$	$\displaystyle\rightarrow\mathcal{H}$
	$\displaystyle f$	$\displaystyle\mapsto\langle b,f\rangle_{\mathcal{H}}a,$

is a linear operator. Then the operator counterpart of $\widetilde{K}_{n}$ is the empirical covariance operator

\displaystyle\Sigma_{n}=\frac{1}{n}\sum_{i=1}^{n}k_{X_{i}}\otimes k_{X_{i}},

(B.1)

where $k_{X_{i}}=k(\cdot,X_{i})$ is the corresponding feature in $\mathcal{H}$ of $X_{i}$ ( $i=1,\cdots,n$ ) under the feature map $x\mapsto k(\cdot,x)$ .

It turns out that the eigenvalues and eigenvectors of $K_{n}$ and $\Sigma_{n}$ are closely related. To formulate this relationship, let us define the restriction operator $\zeta:\mathcal{H}\to\mathbb{R}^{n}$ by $\zeta f=\frac{1}{\sqrt{n}}(f(X_{1}),\cdots,f(X_{n}))^{T}$ . Then verifiably, the adjoint of $\zeta$ , $\zeta^{*}:\mathbb{R}^{n}\mapsto\mathcal{H}$ is given by $\zeta^{*}\alpha=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\alpha_{i}k_{X_{i}}$ , where $\alpha=(\alpha_{1},\cdots,\alpha_{n})^{T}$ . The eigenvalues and eigenvectors(functions) of $\Sigma_{n}$ and $K_{n}$ are closely related in the following sense.

Lemma B.1.

The following facts hold true:

1.

$K_{n}=\zeta\zeta^{*}$ and $\Sigma_{n}=\zeta^{*}\zeta$ ;
2.

If $(\widehat{\lambda},f)$ is a non-trivial eigenpair of $\Sigma_{n}$ (i.e. $\widehat{\lambda}\neq 0$ ), then $(\widehat{\lambda},\zeta f)$ is an eigenpair for $K_{n}$ .
3.

If $(\widehat{\lambda},v)$ is a non-trivial eigenpair of $K_{n}$ , then $(\widehat{\lambda},\widehat{f})$ , where

$\widehat{f}(x)=\frac{1}{\widehat{\lambda}n}\zeta^{\star}v=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}k(x,X_{i})v_{i}$ (B.2)

is an eigenpair for $\Sigma_{n}$ with $\widehat{f}\in\mathcal{H}$ . Moreover, this choice of $\widehat{f}$ is such that $\|\widehat{f}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}=1$ and the restriction of $\widehat{f}$ onto sample points agrees with $v$ , i.e. $\zeta\widehat{f}=v$ .

Proof.

For item one, it is easy to verify $\zeta\zeta^{*}:\mathbb{R}^{n}\mapsto\mathbb{R}^{n}$ is the linear transformation defined by $K_{n}$ . For the other half, noting that

\zeta^{*}\zeta f=\frac{1}{n}\sum_{i=1}^{n}f(X_{i})k(x,X_{i}).

At the same time, by the reproducing property

\Sigma_{n}f=\frac{1}{n}\sum_{i=1}^{n}\langle k_{X_{i}},f\rangle_{\mathcal{H}}k_{X_{i}}=\frac{1}{n}\sum_{i=1}^{n}f(X_{i})k(x,X_{i}).

We thus conclude the two are equal.

For item two, since by assumption $\zeta^{*}\zeta f=\widehat{\lambda}f$ , we have $K_{n}\zeta f=\zeta\zeta^{*}\zeta f=\widehat{\lambda}\zeta f$ , which is exactly what the statement suggests.

For item three, if $(\widehat{\lambda},v)$ is an eigenpair of $K_{n}$ , we check that $(\widehat{\lambda},\widehat{f})$ is an eigenpair of $\Sigma_{n}$ :

	$\displaystyle\Sigma_{n}\widehat{f}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(k_{X_{i}}\otimes k_{X_{i}})\left(\frac{1}{\widehat{\lambda}n}\sum_{j=1}^{n}k_{X_{j}}v_{j}\right)$
		$\displaystyle=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}\left(k_{X_{i}}\sum_{j=1}^{n}k(X_{i},X_{j})v_{j}\right)=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}k_{X_{i}}[K_{n}v]_{i}=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}k_{X_{i}}\widehat{\lambda}v_{i}=\widehat{\lambda}\widehat{f}.$

Moreover, $\widehat{f}$ is a linear combination of $k_{X_{i}}$ and therefore belongs to $\mathcal{H}$ . ∎

The population version of $\Sigma_{n}$ is the covariance operator

\Sigma=\mathbb{E}k_{X}\otimes k_{X},

where $k_{X}=k(\cdot,X)$ , $d=\mathbb{E}k_{X}$ , and $X\sim\mathbb{P}$ . We will later justify the expectation of such random elements in an appropriate Hilbert space. Under appropriate assumptions, it can be shown that we can choose $\{f_{i}\}_{i=1}^{K}$ , the top $K$ eigenfunctions of $\Sigma$ , to be real-valued and orthonormal in $L^{2}(\mathcal{X},\mathbb{P})$ . Then we can define $V_{1}:\mathbb{C}^{K}\rightarrow\mathcal{H}$ as $V_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i}$ . Similarly, we can define $\widehat{V}_{1}$ with $\{\widehat{f}_{i}\}_{i=1}^{K}$ , the extension of top $K$ orthonormal eigenvectors of $\widetilde{K}_{n}$ according to (B.2). We are now ready to apply our general theory to prove the following result, which is similar to Theorem 3.2.

Theorem B.2.

Under the general assumptions defined below, there exists $C_{6},C_{7}$ that are determined by $\mathcal{X},\mathbb{P},k(\cdot,\cdot)$ such that whenever sample size $n\geq C_{6}\tau$ for some $\tau>1$ , we have with confidence $1-6e^{-\tau}$ ,

\inf\left\{\|V_{1}-\widehat{V}_{1}Q\|_{2\rightarrow\infty}:Q\in\mathbb{U}^{K}\right\}\leq C_{7}\frac{\sqrt{\tau}}{\sqrt{n}}.

(B.3)

The general assumptions referred to in Theorem 5.2 are

General Assumptions. The set $\mathcal{X}$ is a separable topological space. The kernel $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is continuous, symmetric, positive semi-definite, and

\sup_{x\in\mathcal{X}}k(x,x)<\infty.

(B.4)

Treated as an operator from $\mathcal{H}$ to $\mathcal{H}$ , the eigenvalues of $\Sigma$ satisfy $\lambda_{1}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0$ . The top $K$ eigenfunctions of $\Sigma_{n}$ , $\{f_{i}\}_{i=1}^{K}\subset C_{b}(\mathcal{X})$ .

Condition (B.4) ensures that all of the operators we are working with are Hilbert-Schmidt, and further guarantees concentration of bounded random elements in the Hilbert space of Hilbert-Schmidt operators. Separability of $\mathcal{X}$ and continuity of $k(\cdot,\cdot)$ assures that the RKHS $\mathcal{H}$ is separable by Lemma 4.33 of [34].

B.1 Overview of the proof

The proof of Theorem B.2 is simpler than that of Theorem 3.2 because we can work with the reproducing kernel Hilbert space $\mathcal{H}$ associated with $k(\cdot,\cdot)$ directly. We shall show that $\Sigma_{n}-\Sigma$ , as an operator from $\mathcal{H}$ to $\mathcal{H}$ , has Hilbert-Schmidt norm tending to zero as $n$ goes to infinity. Recall that the columns of $\widehat{V}_{1}$ are only orthonormal in $L^{2}(\mathcal{X},\mathbb{P}_{n})$ but the general theory requires $\widetilde{V}_{1}$ which has columns orthonormal in $L^{2}(\mathcal{X},\mathbb{P})$ . Similar to the challenge of proving Theorem 3.2, we also need to deal with the error induced by the distinction between orthonormality in $L^{2}(\mathcal{X},\mathbb{P}_{n})$ and $L^{2}(\mathcal{X},\mathbb{P})$ , which is one of the key steps for the proof of Theorem 3.2.

The rigorous treatment shall be presented in five parts. In part one, we introduce $\mathcal{L}_{HS}(\mathcal{H})$ , the Hilbert space of Hilbert-Schmidt operators from $\mathcal{H}$ to $\mathcal{H}$ that we work with, and justify the random elements in $\mathcal{H}$ and $\mathcal{L}_{HS}(\mathcal{H})$ . In part two, we build up concentration results in the Hilbert space $\mathcal{L}_{HS}(\mathcal{H})$ . In part three, we check the remaining conditions required by our general theory. In part four, we put all the above ingredients together and apply our general theory to complete the proof.

B.2 Part one: the space $\mathcal{L}_{HS}(\mathcal{H})$ and random elements in $\mathcal{H}$ and $\mathcal{L}_{HS}(\mathcal{H})$

The space $\mathcal{L}_{HS}(\mathcal{H})$ collects all of the Hilbert-Schmidt operators from $\mathcal{H}$ to $\mathcal{H}$ , which is also a Hilbert space. Before justifying the covariance operator, we first define the mean element in $\mathcal{H}$ and the cross-covariance operator in $\mathcal{L}_{HS}(\widetilde{\mathcal{H}},\mathcal{H})$ .

The mean element in $\mathcal{H}$ is defined as $\mu_{X}=\mathbb{E}_{X\sim\mathbb{P}}[k(\cdot,X)]\in\mathcal{H}$ such that $\left\langle f,\mu_{k}\right\rangle_{\mathcal{H}}=\mathbb{E}_{X\sim\mathbb{P}}[f(X)]$ for any $f\in\mathcal{H}$ . Let $(\mathcal{Y},\mathcal{B}_{\mathcal{Y}},\mathbb{Q})$ be another probability space and $\widetilde{\mathcal{H}}$ is a reproducing kernel Hilbert space associated with kernel $\widetilde{k}(\cdot,\cdot)$ containing $\widetilde{k}(\cdot,y)$ , $y\in\mathcal{Y}$ as its elements. The cross-covariance operator $C_{X,Y}=\mathbb{E}_{X\sim\mathbb{P},Y\sim\mathbb{Q}}[k(\cdot,X)\otimes\widetilde{k}(\cdot,Y)]$ is a Hilbert-Schmidt operator from $\widetilde{\mathcal{H}}$ to $\mathcal{H}$ such that for any $f\in\mathcal{H}$ and $g\in\widetilde{\mathcal{H}}$ , $\langle f,C_{X,Y}g\rangle_{\mathcal{H}}=\mathbb{E}_{X\sim\mathbb{P},Y\sim\mathbb{Q}}[f(X)g(Y)]$ . Finally, the covariance operator is $\Sigma=\mathbb{E}k_{X}\otimes k_{X}=C_{X,X}\in\mathcal{L}(\mathcal{H})$ .

The covariance operator is indeed Hilbert-Schmidt because $\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS}\leq\mathbb{E}[\|k_{X}\otimes k_{X}\|_{HS}^{2}]=\mathbb{E}[k(X,X)^{2}]\leq\left(\sup_{x\in\mathcal{X}}k(x,x)\right)^{2}<\infty$ by the general assumption.

B.3 Part two: concentration in the Hilbert space $\mathcal{L}_{HS}(\mathcal{H})$

In part one we have shown that $\Sigma$ is a Hilbert-Schmidt operators from $\mathcal{H}$ to $\mathcal{H}$ . In this subsection, we show concentration of $\|\Sigma_{n}-\Sigma\|_{HS}$ by using Lemma 4.5.

Lemma B.3.

Under the general assumptions, with probability $1-2e^{-\tau}$ , we have

\left\|\frac{1}{n}\sum_{i=1}^{n}k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}\right\|_{HS}\leq C_{23}\frac{\sqrt{\tau}}{\sqrt{n}}

(B.5)

for some constant $C_{23}$ .

Proof.

We denote $M:=\sqrt{\sup_{x\in\mathcal{X}}k(x,x)}<\infty$ . For any $i\in[n]$ , we have

	$\displaystyle\\|k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}\\|_{HS}$	$\displaystyle\leq\\|k_{X_{i}}\otimes k_{X_{i}}\\|_{HS}+\\|\mathbb{E}k_{X}\otimes k_{X}\\|_{HS}$
		$\displaystyle=\\|k_{X_{i}}\\|_{\mathcal{H}}^{2}+\\|\mathbb{E}k_{X}\otimes k_{X}\\|_{HS}$
		$\displaystyle=k(X_{i},X_{i})+\\|\mathbb{E}k_{X}\otimes k_{X}\\|_{HS}$
		$\displaystyle\leq M^{2}+\\|\mathbb{E}k_{X}\otimes k_{X}\\|_{HS},$

which implies $k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}$ ’s are bounded zero mean independent random variables in $\mathcal{L}_{HS}(\mathcal{H},\mathcal{H})$ .

We can take $C_{23}=\sqrt{2}(M^{2}+\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS})$ . Then by Lemma 4.5, we have

\left\|\frac{1}{n}\sum_{i=1}^{n}k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}\right\|_{HS}=\left\|\frac{1}{n}\sum_{i=1}^{n}(k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X})\right\|_{HS}\leq C_{23}\frac{\sqrt{\tau}}{\sqrt{n}}

with probability $1-2e^{-\tau}$ . ∎

B.4 Part three: checking conditions for general theory

Lemma B.4.

Under the general conditions, the following facts hold true:

1.

The reproducing kernel Hilbert space $H$ is a subspace of $L^{2}(\mathcal{X},\mathbb{P})$ .
2.

The $\mathcal{H}$ norm $\|\cdot\|_{\mathcal{H}}$ is stronger than infinity norm.
3.

Both $\Sigma,\Sigma_{n}$ are Hilbert-Schmidt from $\mathcal{H}$ to $\mathcal{H}$ .
4.

All eigenvalues of $\Sigma$ (counting multiplicity) can be arranged in a decreasing (possibly infinite) sequence of non-negative real numbers $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0$ with a positive gap between $\lambda_{K}$ and $\lambda_{K+1}$ .
5.

The top $K$ eigenfunctions $\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}$ and can be picked to form an orthonormal set of functions in $L^{2}(\mathcal{X},\mathbb{P})$ .
6.

$\Sigma_{n}$ has a sequence of non-increasing, real, non-negative eigenvalues.

Proof.

Recall that the general conditions entail that $M:=\sqrt{\sup_{x\in\mathcal{X}}k(x,x)}<\infty$ . The kernel $k(\cdot,\cdot)$ is therefore a Mercer’s kernel, which satisfies the condition $\int_{\mathcal{X}\times\mathcal{X}}k^{2}(x,y)d\mathbb{P}(x)d\mathbb{P}(y)<\infty$ . Then item 1 is an implication of Mercer’s theorem.

For item 2, for any $f\in\mathcal{H}$ , by the reproducing property and Cauchy-Schwarz inequality, we have $|f(x)|^{2}=\langle f,k(\cdot,x)\rangle_{\mathcal{H}}^{2}\leq|k(x,x)|^{2}\|f\|_{\mathcal{H}}^{2}\leq M^{2}\|f\|_{\mathcal{H}}^{2}$ . Therefore $\|f\|_{\infty}\leq M\|f\|_{\mathcal{H}}$ .

Item 3 has been checked in the intermediate step of proof of Lemma B.3.

Both item 4 and item 5 are ensured by Mercer’s theorem.

Item 6 is true because of the relationship between the spectrum of $\Sigma_{n}$ and that of the symmetric positive semi-definite kernel matrix $K_{n}$ , which has been checked in Lemma B.1. ∎

B.5 Part four: putting all ingredients together

To complete the proof of Theorem B.2, we have to deal with the error induced by operator $\widehat{V}_{1}$ only having orthonormal columns in $L^{2}(\mathcal{X},\mathbb{P}_{n})$ but not in $L^{2}(\mathcal{X},\mathbb{P})$ . This can be accomplished using the same trick as part five of the proof of Theorem 3.2. So we omit the technical redundance here.

	$\displaystyle\big{\\|}\mathcal{T}^{-1}\circ f^{\prime}(0)-I\big{\\|}$	$\displaystyle=\sup_{\\|\Delta Y\\|_{HS}=1}\big{\\|}\mathcal{T}^{-1}(E_{22}\Delta Y-\Delta YE_{11})\big{\\|}_{HS}$
		$\displaystyle\leq\sup_{\\|\Delta Y\\|_{HS}=1}\big{\\|}\mathcal{T}^{-1}\big{\\|}_{op}\big{\\|}E_{22}\Delta Y-\Delta YE_{11}\big{\\|}_{HS}$
		$\displaystyle\leq\frac{1}{\delta}\sup_{\\|\Delta Y\\|_{HS}=1}\bigg{\{}\big{\\|}E_{22}\big{\\|}_{HS}\big{\\|}\Delta Y\big{\\|}_{HS}+\big{\\|}\Delta Y\big{\\|}_{HS}\big{\\|}E_{11}\big{\\|}_{HS}\bigg{\}}$
		$\displaystyle\leq\frac{\\|E_{22}\\|_{HS}+\\|E_{11}\\|_{HS}}{\delta}:=b.$

	$\displaystyle\big{\\|}\mathcal{T}^{-1}\circ f^{\prime\prime}(Y_{0})\big{\\|}_{op}$	$\displaystyle=\sup_{\\|\Delta_{1}Y\\|_{HS}=1}\big{\\|}\mathcal{T}^{-1}\circ\mathcal{T}_{\Delta_{1}Y}\big{\\|}_{op}$
		$\displaystyle=\sup_{\\|\Delta_{1}Y\\|_{HS}=1}\sup_{\\|\Delta_{2}Y\\|_{HS}=1}\big{\\|}\big{(}\mathcal{T}^{-1}\circ\mathcal{T}_{\Delta_{1}Y}\big{)}(\Delta_{2}Y)\big{\\|}_{\mathcal{E}}$
		$\displaystyle\leq\frac{1}{\delta}\sup_{\\|\Delta_{1}Y\\|_{HS}=1}\sup_{\\|\Delta_{2}Y\\|_{HS}=1}\big{\\|}\Delta_{1}YE_{12}{\Delta_{2}Y}+\Delta_{2}YE_{12}{\Delta_{1}Y}\big{\\|}_{\mathcal{E}}$
		$\displaystyle\leq\frac{2\\|E_{12}\\|_{HS}}{\delta}:=c$

	$\displaystyle\quad\inf\{\\|V_{1}-\widetilde{V}_{1}Q\\|_{2\to\infty}:Q\in\mathbb{O}^{K}\}$
	$\displaystyle\leq\big{\\|}V_{1}-(V_{1}+V_{2}Y_{E})(I+Y_{E}^{*}Y_{E})^{-\frac{1}{2}}\big{\\|}_{2\to\infty}$
	$\displaystyle\leq\big{\\|}V_{1}(I-(I+Y_{E}^{}Y_{E})^{-\frac{1}{2}})\big{\\|}_{2\to\infty}+\big{\\|}V_{2}Y_{E}(I+Y_{E}^{}Y_{E})^{-\frac{1}{2}}\big{\\|}_{2\to\infty}$
	$\displaystyle\leq\\|V_{1}\\|_{2\to\infty}\\|Y_{E}\\|_{2\to l^{2}}+\\|V_{2}Y_{E}\\|_{2\to\infty}$
	$\displaystyle\leq\\|V_{1}\\|_{2\to\infty}\\|Y_{E}\\|_{2\to l^{2}}+C_{\mathcal{H}}\\|Y_{E}\\|_{2\to\tilde{l}^{2}}$
	$\displaystyle\leq C_{\mathcal{H}}(\\|V_{1}\\|_{2\to\infty}+1)\\|Y_{E}\\|_{2\to\tilde{l}^{2}}.$

$\displaystyle\\|V_{1}-\widehat{V}_{1}Q\\|_{2\to\infty}$	$\displaystyle\leq\\|V_{1}-W\\|_{2\to\infty}+\\|W-\widehat{V}_{1}Q\\|_{2\to\infty}$
	$\displaystyle=\\|V_{1}-W\\|_{2\to\infty}+\\|W-WW^{*}\widehat{V}_{1}Q\\|_{2\to\infty}$
	$\displaystyle\leq\\|V_{1}-W\\|_{2\to\infty}+\\|W\\|_{2\to\infty}\\|Q^{T}-W^{*}\widehat{V}_{1}\\|_{2}.$	(4.12)

$\displaystyle\\|V_{1}-\widehat{V}_{1}Q\\|_{2\to\infty}$	$\displaystyle\leq\\|V_{1}-W\\|_{2\to\infty}+\\|W\\|_{2\to\infty}\\|Q^{T}-W^{*}\widehat{V}_{1}\\|_{2}$	(4.16)
	$\displaystyle\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}+2\\|W\\|_{2\to\infty}\\|W\\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}}$	(4.17)
	$\displaystyle\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}+16\\|V_{1}\\|_{2\to\infty}\\|V_{1}\\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}}.$	(4.18)

On uniform consistency of spectral embeddings

Abstract

keywords:

keywords:

1 Introduction

1.1 Main results

Result 1 (General recipe for uniform consistency).

Result 2 (Uniform consistency for normalized spectral clustering).

1.2 Related literature

1.3 Main contributions

1.4 Structure of the paper

2 Preliminaries and notations

2.1 Operator theory

2.2 Spectral theory for linear operators

2.3 Function spaces

2.4 Reproducing Kernel Hilbert Space (RKHS)

3 Uniform error bound for spectral embedding

Lemma 3.1.

Theorem 3.2 (General recipe for uniform consistency).

3.1 Step one: equation characterization of the invariant subspace

Proposition 3.3.

Proof.

3.2 Step two: solve the equation with the Newton-Kantorovich Theorem

Theorem 3.4 (Newton-Kantorovich).

Proposition 3.5.

Proof of Proposition 3.5.

3.3 Step three: showing the invariant space is the leading eigenspace

Lemma 3.6.

Lemma 3.7.

Proof.

Theorem 3.8.

Proposition 3.9.

Proof.

3.4 Step four: bound uniform consistency error

Lemma 3.10.

Proof.

Proposition 3.11.

Proof.

3.5 Step five: put all pieces together

Lemma 3.12.

Proof.

Proof of Theorem 3.2.

4 Application to normalized spectral clustering

Lemma 4.1.

Proof.

Theorem 4.2.

General Assumptions.

4.1 Overview of the proof

4.2 Part one: The Sobolev space ℋs\mathcal{H}^{s}

4.3 Part two: bounds on operator differences

Lemma 4.3.

Proof.

Lemma 4.4.

Proof.

4.4 Concentration on Hilbert Space

Lemma 4.5.

Lemma 4.6.

Proof.

Proposition 4.7.

Proof.

4.5 Checking conditions for general theory

Lemma 4.8.

Proof.

Proposition 4.9.

Proof.

4.6 Part five: error induced by V^1\widehat{V}_{1}

Lemma 4.10.

Proof.

Lemma 4.11.

proof of theorem 4.2.

5 Discussion

References

Appendix A Proofs

A.1 Proof of Lemma 3.1

Proof of Lemma 3.1.

A.2 Proof of a lemma used in proving Theorem 3.5

Lemma A.1.

Proof.

A.3 Proof of Lemma 4.11

Lemma A.2.

4.2 Part one: The Sobolev space $\mathcal{H}^{s}$

4.6 Part five: error induced by $\widehat{V}_{1}$

B.2 Part one: the space $\mathcal{L}_{HS}(\mathcal{H})$ and random elements in $\mathcal{H}$ and $\mathcal{L}_{HS}(\mathcal{H})$

B.3 Part two: concentration in the Hilbert space $\mathcal{L}_{HS}(\mathcal{H})$