This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On uniform consistency of spectral embeddings

Ruofei Zhaolabel=e1]rfzhao@umich.edu [    Songkai Xuelabel=e2]sxue@umich.edu [    Yuekai Sunlabel=e3]yuekai@umich.edu [ University of Michigan
1085 S University, Ann Arbor, MI 48109
(2023)
Abstract

In this paper, we study the convergence of the spectral embeddings obtained from the leading eigenvectors of certain similarity matrices to their population counterparts. We opt to study this convergence in a uniform (instead of average) sense and highlight the benefits of this choice. Using the Newton-Kantorovich Theorem and other tools from functional analysis, we first establish a general perturbation result for orthonormal bases of invariant subspaces. We then apply this general result to normalized spectral clustering. By tapping into the rich literature of Sobolev spaces and exploiting some concentration results in Hilbert spaces, we are able to prove a finite sample error bound on the uniform consistency error of the spectral embeddings in normalized spectral clustering.

62H30,
47A55,
Spectral embedding,
normalized spectral clustering,
uniform consistency,
the Newton-Kantorovich Theorem,
Sobolev spaces,
functional analysis,
concentration in Hilbert spaces,
keywords:
[class=MSC]
keywords:
volume: 0issue: 0
\startlocaldefs\endlocaldefs

, and

1 Introduction

Spectral methods are a staple of modern statistics. For statistical learning tasks such as clustering or classification, one can featurize the data with spectral methods then perform the task on the features. In the past twenty years, spectral methods have seen wide applications in image segmentation [33], novelty detection [18], community detection [13], bioinformatics [17], and its effectiveness is partly credited to its ability to reveal the latent low-dimensional structure in the data.

Spectral embedding gets its name from the fact that the embeddings are constructed from the spectral decomposition of a positive-definite matrix. For example, in normalized spectral clustering [27], the normalized Laplacian embedding Φn:{xi}i=1nK\Phi_{n}:\{x_{i}\}_{i=1}^{n}\to\mathbb{R}^{K} is given by

Φn(xi)T=eiTV,i[n],\Phi_{n}(x_{i})^{T}=e_{i}^{T}V,\quad i\in[n], (1.1)

where {xi}i=1n\{x_{i}\}_{i=1}^{n} are the observations, eine_{i}\in\mathbb{R}^{n}’s are all zeros but one on the ii-th entry, KK is the desired dimension of the embedding, and the columns of Vn×KV\in\mathbb{R}^{n\times K} are the leading eigenvectors of the normalized Laplacian matrix. As described, spectral embeddings are only defined on points in the training data, but it is possible to evaluate them on points that are not in the training data through out-of-sample extensions [4, 41]. Some other examples of spectral methods are Isomap [36], Laplacian [3] and Hessian eigenmaps [14], and diffusion maps [10].

Since downstream procedures take the embeddings as input, it is imperative that the embeddings have certain consistency properties to ensure the quality of the ultimate output. Specifically, we ask

  • In the large sample limit, do the embedded representations of the data “converge” to certain population level representations?

  • If the embedded representations do converge, in what sense do they converge?

While there are many results on the convergence of eigenvalues and spectral projections, there are only a few results that directly address the convergence of the embedded representation in a general setting. The only exception is von Luxburg, Belkin and Bousquet [41]. This is a gap in the literature because it is the embedded representation, not the spectral projections or the eigenvalues, that are the inputs to downstream application. In this paper, we address the two questions and provide direct answers — we show the sample level embeddings converge uniformly to its population counterpart up to a unitary transformation. We improve the result of von Luxburg, Belkin and Bousquet [41] by considering multidimensional embeddings and allowing for non-simple eigenvalues.

For a concrete application of our result, let us return to spectral clustering. The population counterpart of the normalized Laplacian embedding is given by

Ψ(x)T=[f1(x)fK(x)],\Psi(x)^{T}=\begin{bmatrix}f_{1}(x)&\dots&f_{K}(x)\end{bmatrix},

where f1,,fKf_{1},\dots,f_{K} are the leading eigenfunctions of the normalized Laplacian operator [41, 30]. As is shown in von Luxburg, Belkin and Bousquet [41], the normalized Laplacian matrix has an operator counterpart that we shall refer to as the empirical normalized Laplacian operator. Let f^n,1,,f^n,K\hat{f}_{n,1},\dots,\hat{f}_{n,K} be the leading eigenfunctions of this operator, and define the embedding

Ψn(x)T=[f^n,1(x)f^n,K(x)].\Psi_{n}(x)^{T}=\begin{bmatrix}\hat{f}_{n,1}(x)&\dots&\hat{f}_{n,K}(x)\end{bmatrix}. (1.2)

The embedding Ψn\Psi_{n} coincides with Φn\Phi_{n} on the sample points, i.e. Ψn(xi)=Φn(xi)\Psi_{n}(x_{i})=\Phi_{n}(x_{i}) for all {xi}i=1n\{x_{i}\}_{i=1}^{n}. We shall show that the sample level embedding converges uniformly to its population counterpart:

sup{d(Ψn(x),Ψ(x)):x𝒳}𝑝0,\sup\{d(\Psi_{n}(x),\Psi(x)):x\in\mathcal{X}\}\overset{p}{\to}0, (1.3)

where dd is some metric on K\mathbb{R}^{K}. This implies Φn\Phi_{n} converges uniformly to the restriction of Ψ\Psi to the sample points.

1.1 Main results

In this section, we state our results in an informal manner. These results are made precise and proved in subsequent sections.

Our first main result concerns the effect of perturbation on the invariant subspace of an operator. It serves as a general recipe for establishing uniform consistency type results. Although in statistics and machine learning, we mainly work with real-valued functions, our main spectral perturbation result is stated for complex-valued functions. This choice is technically convenient because the complex numbers are algebraically closed while the real numbers are not. In most applications of the result, the complex-valued functions only take real values.

Suppose \mathcal{H} is a complex Hilbert space whose elements are bounded complex-valued continuous functions over a domain 𝒳\mathcal{X}. Let T,T~T,\widetilde{T} be two operators from \mathcal{H} to \mathcal{H} that are close in Hilbert-Schmidt norm. Let {fi}i=1K\{f_{i}\}_{i=1}^{K} be the top KK eigenfunctions of TT and {f~i}i=1K\{\tilde{f}_{i}\}_{i=1}^{K} be those of T~\widetilde{T}. As long as {fi}i=1K\{f_{i}\}_{i=1}^{K} and {f~i}i=1K\{\tilde{f}_{i}\}_{i=1}^{K} are appropriately normalized, we expect {fi}i=1K\{f_{i}\}_{i=1}^{K} to be close to {f~i}i=1K\{\tilde{f}_{i}\}_{i=1}^{K} up to some unitary transformation. This is indeed the case and is characterized as follows by our first result.

Result 1 (General recipe for uniform consistency).

Define V1:KV_{1}:\mathbb{C}^{K}\to\mathcal{H} as V1α=i=1KαifiV_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i} and V~1:K\widetilde{V}_{1}:\mathbb{C}^{K}\to\mathcal{H} as V~1α=i=1Kαif~i\widetilde{V}_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}\tilde{f}_{i}. There are constants C1,C2>0C_{1},C_{2}>0 that only depend on TT such that as long as T~THSC1\|\widetilde{T}-T\|_{HS}\leq C_{1}, we have

inf{V1V~1Q2:Q𝕌K}C2T~THS,\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{2}\|\widetilde{T}-T\|_{HS}, (1.4)

where 𝕌K\mathbb{U}^{K} is the space of unitary matrices in K×K\mathbb{C}^{K\times K}, and the 22\to\infty-norm of an operator A:KA:\mathbb{C}^{K}\to\mathcal{H} is defined as

A2=sup{Aα:αK,α2=1}.\|A\|_{2\to\infty}=\sup\{\|A\alpha\|_{\infty}:\alpha\in\mathbb{C}^{K},\|\alpha\|_{2}=1\}.

It is not hard to notice the correspondence between (1.4) and (1.3): V1V_{1} is the analogue of Ψ\Psi; V~1\widetilde{V}_{1} is the analogue of Ψn\Psi_{n}; the two to infinity norm guarantees the convergence is uniform, and the distance metric dd is chosen to measure Euclidean distance up to a unitary transformation.111Normalized eigenfunctions of the same eigenvalue are only determined up to unitary transformation. This observation justifies naming inf{V1V~1Q2:Q𝕌K}\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\} the uniform consistency error. It is also worth mentioning that the constant C2C_{2} is inversely proportional to a measure of the eigengap between the KK-th and K+1K+1-th eigenvalues of TT. This provides further justification for studying the convergence of the leading eigenspace as a whole, rather than studying the convergence of the individual eigenspaces like in von Luxburg, Belkin and Bousquet [41]. Not only is the former more general and realistic, but it leads to better constants as well. In many applications, the top eigenvalues are usually clustered together, but there is a large gap between the top eigenvalues and the rest of the spectrum. Thus it is hard to estimate the corresponding eigenfunctions individually, but it is easy to estimate them altogether up to a unitary transformation.

Result 1 provides a general approach to proving uniform consistency: we simply need to bound the difference between the sample level operator and its population counterpart in an appropriate norm. The proof of Result 1 is also interesting in its own right. We identify the invariant subspace directly by solving an operator equation and appeal to the Newton-Kantorovich Theorem to characterize the solution. The main benefit of this approach is it overcomes the limitations of traditional approaches when working with non-unitarily invariant norms.

Our second main result is a finite sample uniform error bound for embedding in normalized spectral clustering. Let Cb(𝒳)C_{b}(\mathcal{X}) denote the space of bounded continuous complex-valued functions over 𝒳\mathcal{X}. Define V1:KCb(𝒳)V_{1}:\mathbb{C}^{K}\to C_{b}(\mathcal{X}) as V1α=i=1KαifiV_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i} where f1,,fKf_{1},\dots,f_{K} are the leading real-valued eigenfunctions of the normalized Laplacian operator. Define V^n,1:KCb(𝒳)\widehat{V}_{n,1}:\mathbb{C}^{K}\to C_{b}(\mathcal{X}) as V^n,1α=i=1Kαif^n,i\widehat{V}_{n,1}\alpha=\sum_{i=1}^{K}\alpha_{i}\widehat{f}_{n,i} where f^n,i\widehat{f}_{n,i} are defined as in (1.2) and real-valued. Applying Result 1, we obtain

Result 2 (Uniform consistency for normalized spectral clustering).

Under suitable conditions, there are constants C4,C5>0C_{4},C_{5}>0 that are independent of nn and the randomness of the sample, such that whenever the sample size nC4τn\geq C_{4}\tau for some τ>1\tau>1, we have

inf{V1V^n,1Q2:Q𝕌K}C5τn,\inf\{\|V_{1}-\widehat{V}_{n,1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{5}\frac{\sqrt{\tau}}{\sqrt{n}},

with probability at least 18eτ1-8e^{-\tau}.

Despite the fact that that Result 2 is an application of Result 1, its proof is by no means simple. The main technical challenge is establishing concentration bounds for Hilbert-Schmidt operators. Result 2 suggests that the convergence rate, under appropriate conditions, is 𝒪(1n)\mathcal{O}(\frac{1}{\sqrt{n}}) (modulo a log factor). Moreover, in the context of clustering, the notion of uniform consistency leads to stronger assurances about the correctness of the clustering output. For example, in spectral clustering, the points are clustered based on their embeddings. Uniform convergence implies the embeddings of all points are close to their population counterpart. As long as the error in the embeddings are small enough, it is possible to show that all points are correctly clustered. This is not possible if the embeddings only “converge in mean”: V1V^n,1Q2L2(𝒳,)0\|V_{1}-\widehat{V}_{n,1}Q\|_{2\to L^{2}(\mathcal{X},\mathbb{P})}\to 0.

1.2 Related literature

Most closely related to our results are the work of von Luxburg, Belkin and Bousquet [41] and Rosasco, Belkin and Vito [29]. For normalized spectral clustering, von Luxburg, Belkin and Bousquet [41] proved the convergence of the eigenvalues and spectral projectors of the sample level operator to their population counterparts. They also established uniform convergence of eigenfunctions whose corresponding eigenvalue has multiplicity one to their population counterparts. Our results are in the same vein as theirs in that we also study uniform convergence of eigenfunctions. We improve upon their uniform convergence result by considering multiple eigenfunctions at once and allowing for non-simple eigenvalues. In the context of unnormalized spectral clustering, Rosasco, Belkin and Vito [29] studied the convergence rate of the l2l^{2}-distance between the ordered spectrum of the sample level operator and that of the population operator and derived finite sample bound for the deviation between the sample level and population level spectral projections associated with the top KK eigenvalues. They also obtained finite sample spectral projection error bound for asymmetric normalized graph Laplacian. Our work is related to theirs because both study the convergence of the leading eigenspace and we owe much of our concentration results to them. The two works are also very distinct at the same time. Firstly, our notion of convergence is in uniform consistency of the eigenfunction, theirs is in terms of the induced RKHS norm between the spectral projectors. To the best of our knowledge, it is non-trivial to establish one set of results from the other. Secondly, we study the normalized symmetric graph Laplacian, while they study the unnormalized graph Laplacian and asymmetric normalized graph Laplacian.

The general relationship between the spectral properties of an empirical operator/matrix and that of its population counterpart has also been studied under other contexts. In Koltchinskii and Giné [22], it is proved that the ordered spectra of an integral operator and its empirical version tend to zero in l2l^{2}-distance almost surely if and only if the kernel is square integrable. Convergence rate and distributional limits were also obtained under stronger conditions. In Koltchinskii [21], the authors extended their own result by proving law of large numbers and central limit theorems for quadratic forms induced by spectral projections. The investigation of spectrum convergence is continued in Mendelson and Pajor [25] and Mendelson and Pajor [26], where the authors associated various types of distance metric between two ordered spectra to the deviation of the sample mean of i.i.d rank one operators from its population mean. Similar problems have also been studied in kernel principal component analysis (KPCA) literature. For example, in Shawe-Taylor et al. [31] and Blanchard, Bousquet and Zwald [5], the concentration property of the sum of the top KK eigenvalues and the sum of all but the top KK eigenvalues of the empirical kernel matrix are studied, because such partial sums are closely related to the reconstruction error of KPCA. In Zwald and Blanchard [42], a finite sample error bound on the difference between the projection operator to the leading eigenspace of the empirical covariance operator and that to the leading eigenspace of the population covariance operator is derived. We remark that none of the results mentioned in this paragraph addressed the consistency of the embedding directly, nor did any consider kernel matrix normalized by the degree matrix.

Unlike our results which are model-agnostic, the property of spectral methods has also been studied in model-specific settings. For example, Rohe, Chatterjee and Yu [28] and Lei and Rinaldo [23] investigated the spectral convergence properties of the graph Laplacian and the consistency of spectral clustering in terms of community membership recovery under stochastic block models. When the data are sampled from a finite mixture of nonparametric distributions, Shi, Belkin and Yu [32] studied how the leading eigenfunctions and eigenvectors of the population level integral operator can reflect clustering information; Schiebinger, Wainwright and Yu [30] studied the geometry of the embedded samples generated by normalized spectral clustering and showed that the embedded samples for different clusters are approximately orthogonal when the mixtures have small overlapping and the sample size is large. We remark that in all the results mentioned in this section so far, the kernel function is fixed. For the relationship between the graph Laplacian and the Laplace-Beltrami operator on a manifold and the properties of spectral clustering when the kernel is chosen adaptively, we refer readers to the series of work by Trillos et. al. [38, 39, 37] and the references therein.

Lastly, entrywise or row-wise analysis for eigevectors and eigenspaces of matrices has been studied in recent literature. For general purpose, deterministic 2\ell_{2\to\infty} bounds are derived by Fan, Wang and Zhong [16], Cape, Tang and Priebe [7], Damle and Sun [12], where the first two are for rectangular matrices and the last one is for symmetric matrices. When probabilistic assumptions are imposed on the true and perturbed matrices, Cape, Tang and Priebe [8], Abbe et al. [1], Mao, Sarkar and Chakrabarti [24] obtain stronger 2\ell_{2\to\infty} bounds for various tasks by taking advantages of the structure of the random matrices. Comparing to these literature, we remark that our work provides a deterministic bound which aids in the 2\ell_{2\to\infty} perturbation theory of linear operators, and the bound can be applied to many problems in statistics (e.g., spectral clustering and kernel PCA) for helping characterize the spectral embedding of individual samples.

1.3 Main contributions

We view our main contributions as three fold and list them in the order of appearance. First, we demonstrate that the Newton-Kantorovich Theorem provides a general approach to studying the effect of local perturbations on the invariant spaces of an operator. This result may be of independent interest to researchers working on spectral perturbation theory. Second, we study the convergence of the embeddings via uniform consistency error and offer a general recipe for establishing non-asymptotic uniform consistency type results that handles multiple eigenfunctions at once and is not limited to simple eigenvalues. Third, we apply our recipe to normalized spectral clustering and give a novel proof of finite sample error bound on the uniform consistency error of the spectral embeddings.

1.4 Structure of the paper

The rest of the paper is organized as follows: A review of relevant mathematical preliminaries is provided in Section 2; the exact statement and proof for Result 1 is in section 3; the exact statement and proof for Result 2 is in section 4; a discussion of various issues relevant to our results is in section 5; proofs of some secondary lemmas and an additional application are relegated to the appendix.

2 Preliminaries and notations

In this section, we discuss various basic concepts and preliminary results that will be used repeatedly throughout the paper. More technical results that are section specific shall be introduced as needed later in the paper.

2.1 Operator theory

We assume readers are familiar with basic concepts such as Banach spaces, Hilbert spaces, linear operators, operator norms, and spectra of operators. From now on, we let 𝕂\mathbb{K} denote either the field of real numbers or the field of complex numbers, 𝒴1,𝒴2\mathcal{Y}_{1},\mathcal{Y}_{2} denote Banach spaces over the same field 𝕂\mathbb{K}, and 1,2\mathcal{H}_{1},\mathcal{H}_{2} denote Hilbert spaces over the same field 𝕂\mathbb{K}.

We would like to first highlight a nuance in the definition of linear operator. For a linear operator A:𝒴1𝒴2A:\mathcal{Y}_{1}\to\mathcal{Y}_{2}, we adopt the convention from Kato [20] and allow AA to be defined only on a linear manifold in 𝒴1\mathcal{Y}_{1},222In Kato [20], linear manifold is just a synonym for affine subspace. denoted D(A)D(A).333In Ciarlet [9], for example, such a distinction isn’t made. We call D(A)D(A) the domain of AA and can naturally define the range of AA as R(A):={Ay|yD(A)}R(A):=\{Ay\;|\;y\in D(A)\}. As for 𝒴1,𝒴2\mathcal{Y}_{1},\mathcal{Y}_{2}, we call them the domain space and the range space respectively.

For a linear operator A:𝒴1𝒴2A:\mathcal{Y}_{1}\to\mathcal{Y}_{2}, we say AA bounded if supy𝒴1=1Ay𝒴2<\sup_{\|y\|_{\mathcal{Y}_{1}}=1}\|Ay\|_{\mathcal{Y}_{2}}<\infty, and when AA is bounded, we define its operator norm A:=supy𝒴1=1Ay𝒴2\|A\|:=\sup_{\|y\|_{\mathcal{Y}_{1}}=1}\|Ay\|_{\mathcal{Y}_{2}}. Throughout the paper, when \|\cdot\| has no subscript, it defaults to operator norm. We use (𝒴1,𝒴2)\mathcal{L}(\mathcal{Y}_{1},\mathcal{Y}_{2}) to denote the space of all bounded linear operators from 𝒴1\mathcal{Y}_{1} to 𝒴2\mathcal{Y}_{2}. When 𝒴1=𝒴2\mathcal{Y}_{1}=\mathcal{Y}_{2}, we simply write (𝒴1,𝒴1)\mathcal{L}(\mathcal{Y}_{1},\mathcal{Y}_{1}) as (𝒴1)\mathcal{L}(\mathcal{Y}_{1}). We say AA is a compact operator if the closure of the image of any bounded set in 𝒴1\mathcal{Y}_{1} under AA is compact. It is know that compact operators are bounded.

For a bounded linear operator A:12A:\mathcal{H}_{1}\to\mathcal{H}_{2}, define its adjoint A:21A^{*}:\mathcal{H}_{2}\to\mathcal{H}_{1} as the unique operator from 2\mathcal{H}_{2} to 1\mathcal{H}_{1} satisfying Af,g2=f,Ag1\langle Af,g\rangle_{\mathcal{H}_{2}}=\langle f,A^{*}g\rangle_{\mathcal{H}_{1}} for f1,g2\forall f\in\mathcal{H}_{1},\forall g\in\mathcal{H}_{2}. Here, we use ,\langle\cdot,\cdot\rangle_{\mathcal{H}} to denote the inner product in the Hilbert space \mathcal{H}. A basic property of AA^{*} is A12=A21\|A\|_{\mathcal{H}_{1}\to\mathcal{H}_{2}}=\|A^{*}\|_{\mathcal{H}_{2}\to\mathcal{H}_{1}}, where the norm is operator norm and we use the 12\mathcal{H}_{1}\to\mathcal{H}_{2} notation to explicitly specify the domain space and range space. When 1=2\mathcal{H}_{1}=\mathcal{H}_{2}, AA is called self-adjoint if AA is equal to its adjoint AA^{*}, and AA is called positive if for any f1f\in\mathcal{H}_{1}, Af,f10\langle Af,f\rangle_{\mathcal{H}_{1}}\geq 0.

We say a Hilbert space is separable if it has a basis of countably many elements. We say a bounded linear operator A:12A:\mathcal{H}_{1}\to\mathcal{H}_{2} is Hilbert-Schmidt if iAei22<\sum_{i\in\mathcal{I}}\|Ae_{i}\|_{\mathcal{H}_{2}}^{2}<\infty where {ei:i}\{e_{i}:i\in\mathcal{I}\} is an orthonormal basis of 1\mathcal{H}_{1}. We use HS(1,2)HS(\mathcal{H}_{1},\mathcal{H}_{2}) to denote the space of all Hilbert-Schmidt operators from 1\mathcal{H}_{1} to 2\mathcal{H}_{2}; this space is also a Hilbert space with respect to the inner product A,BHS:=iAei,Bei2\langle A,B\rangle_{HS}:=\sum_{i\in\mathcal{I}}\langle Ae_{i},Be_{i}\rangle_{\mathcal{H}_{2}}. We use HS\|\cdot\|_{HS} to denote the norm induced by this inner product and note that all Hilbert-Schmidt operators are compact. We also note the Hilbert-Schmidt norm is stronger than operator norm in that AAHS\|A\|\leq\|A\|_{HS}, and the Hilbert-Schmidt norm is compatible with the operator norm in the following sense: for any Hilbert-Schmidt operator AA and bounded operator BB, their product ABAB and BABA are Hilbert-Schmidt and their Hilbert-Schmidt norm satisfies

ABHSAHSB,\displaystyle\|AB\|_{HS}\leq\|A\|_{HS}\|B\|,
BAHSBAHS.\displaystyle\|BA\|_{HS}\leq\|B\|\|A\|_{HS}.

2.2 Spectral theory for linear operators

In this subsection, we set 𝕂=\mathbb{K}=\mathbb{C}. Let A:11A:\mathcal{H}_{1}\to\mathcal{H}_{1} be a bounded linear operator. Similar to matrices, we say λ\lambda\in\mathbb{C} is an eigenvalue of AA if for some eigenvector f1f\in\mathcal{H}_{1},

Af=λf and f0.Af=\lambda f\;\;\text{ and }\;\;f\neq 0.

In other words, λ\lambda is an eigenvalue if the null space N(λIA)N(\lambda I-A) is not {0}\{0\}. We call N(λIA)N(\lambda I-A) the eigenspace associated with λ\lambda, and the dimension of N(λIA)N(\lambda I-A) is called the geometric multiplicity of λ\lambda. The spectrum of AA is defined as σ(A):=ρ(A)\sigma(A):=\mathbb{C}\setminus\rho(A), where ρ(A)\rho(A) is the resolvent set

ρ(A):={λ|(λIA)1(1)}.\rho(A):=\{\lambda\in\mathbb{C}\;|\;(\lambda I-A)^{-1}\in\mathcal{L}(\mathcal{H}_{1})\}.

Eigenvalues are in the spectrum, but σ(A)\sigma(A) generally contains more than just eigenvalues. If AA is a compact operator, σ(A)\sigma(A) has the following structure: σ(A){0}\sigma(A)\setminus\{0\} is a countable set of isolated eigenvalues, each with finite geometric multiplicity, and the only possible accumulation point of σ(A)\sigma(A) is 0. If AA is self-adjoint, then all the eigenvalues must be real. If AA is a positive operator, then all its eigenvalues are real and non-negative. Therefore for any compact positive self-adjoint operator, we can arrange the non-zero eigenvalues of AA into a non-increasing sequence of positive numbers,444Because the largest eigenvalue is bounded by the operator norm of AA. and repeat each eigenvalue for a number of times equal to its geometric multiplicity.

Another remarkable fact in the spectral theory of linear operators concerns spectral projection. Let Γρ(A)\Gamma\subset\rho(A) be a closed simple rectifiable curve. Assume the part of σ(A)\sigma(A) enclosed inside Γ\Gamma is a finite number of eigenvalues λ1,λ2,,λK\lambda_{1},\lambda_{2},\ldots,\lambda_{K}. Then the projection P:11P:\mathcal{H}_{1}\to\mathcal{H}_{1} which projects to the direct sum of the eigenspaces of {λi}i=1K\{\lambda_{i}\}_{i=1}^{K}, i.e. i=1KN(λiIA)\bigoplus_{i=1}^{K}N(\lambda_{i}I-A), can be defined. Technicalities aside, this projection has the following contour integration expression

P=12πiΓ(γIA)1𝑑γ.P=\frac{1}{2\pi i}\int_{\Gamma}(\gamma I-A)^{-1}d\gamma.

2.3 Function spaces

Let 𝒳\mathcal{X} be a bounded open subset of p\mathbb{R}^{p}, we now define several function spaces we are going to work with. Define the space of bounded continuous functions Cb(𝒳)C_{b}(\mathcal{X}) as

Cb(𝒳):={f|f:𝒳 is a bounded continous function}.C_{b}(\mathcal{X}):=\{f\;|\;f:\mathcal{X}\to\mathbb{C}\text{ is a bounded continous function}\}.

It can be shown that f:=supx𝒳|f(x)|\|f\|_{\infty}:=\sup_{x\in\mathcal{X}}|f(x)| is a norm on Cb(𝒳)C_{b}(\mathcal{X}) and Cb(𝒳)C_{b}(\mathcal{X}) is a Banach space with respect to this infinity norm.

We can also define the space of complex-valued square integrable functions L2(𝒳,μ)L^{2}(\mathcal{X},\mu). Suppose (𝒳,,μ)(\mathcal{X},\mathcal{B},\mu) is a measure space where \mathcal{B} is the Lebesgue σ\sigma-algebra and μ\mu is a measure, then L2(𝒳,μ)L^{2}(\mathcal{X},\mu) is defined as the set of measurable functions such that

𝒳|f(x)|2𝑑μ<.\int_{\mathcal{X}}|f(x)|^{2}d\mu<\infty.

In fact, L2(𝒳,μ)L^{2}(\mathcal{X},\mu) is a Hilbert space with respect to the inner product

f,gL2(𝒳,μ):=𝒳fg¯𝑑μ.\langle f,g\rangle_{L^{2}(\mathcal{X},\mu)}:=\int_{\mathcal{X}}f\bar{g}d\mu.

We also define l2l^{2}, the space of square summable infinite sequence of complex numbers. It is well known that l2l^{2} is a complex Hilbert space with respect to the inner product

u,vl2:=i=1uivi¯.\langle u,v\rangle_{l^{2}}:=\sum_{i=1}^{\infty}u_{i}\bar{v_{i}}.

2.4 Reproducing Kernel Hilbert Space (RKHS)

Let 𝒳\mathcal{X} be a subset of p\mathbb{R}^{p} and \mathcal{H} be a set of functions f:𝒳f:\mathcal{X}\to\mathbb{C}. Suppose \mathcal{H} is a Hilbert space with respect to some inner product ,\langle\cdot,\cdot\rangle_{\mathcal{H}}. If in \mathcal{H}, all point evaluation functionals are bounded, i.e.

|f(x)|Cxff,|f(x)|\leq C_{x}\|f\|_{\mathcal{H}}\quad\forall f\in\mathcal{H},

where CxC_{x} is some constant depending on xx, then it can be shown that there exists a unique conjugate symmetric positive definite kernel function k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\to\mathbb{C}, such that the following reproducing property is satisfied:

f(x)=f,k(,x).f(x)=\langle f,k(\cdot,x)\rangle.

The kernel kk is called the reproducing kernel and \mathcal{H} is called a reproducing kernel Hilbert space (RKHS).

We say a kernel function kk is positive definite if for any n+n\in\mathbb{N}^{+}, any x1,x2,,xn𝒳x_{1},x_{2},\ldots,x_{n}\in\mathcal{X} and any ξ1,ξ2,,ξn\xi_{1},\xi_{2},\ldots,\xi_{n}\in\mathbb{C}, the quadratic form

i,j=1nk(xi,xj)ξi¯ξj\sum_{i,j=1}^{n}k(x_{i},x_{j})\bar{\xi_{i}}\xi_{j} (2.1)

is non-negative. The kernel function for any RKHS is positive definite.

3 Uniform error bound for spectral embedding

In this section, we prove the first result described in the previous section. We first lay out the assumptions and notations. Let 𝒳\mathcal{X} be a subset of p\mathbb{R}^{p} and \mathbb{P} be a probability measure whose density function is supported on 𝒳\mathcal{X}. Let L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) denote the space of complex-valued square integrable functions on 𝒳\mathcal{X} and \mathcal{H} be a subspace of L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). We assume \mathcal{H} is equipped with its own inner product ,\langle\cdot,\cdot\rangle_{\mathcal{H}} and is a Hilbert space with respect to this inner product. We also require \mathcal{H} to be such that for every hh\in\mathcal{H}, which is an equivalent class in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}), there exists a representative function hh^{\prime} in the class such that hCb(𝒳)h^{\prime}\in C_{b}(\mathcal{X}). Since supp()=𝒳\textrm{supp}(\mathbb{P})=\mathcal{X}, hh^{\prime} is unique, and we can define infinity norm on \mathcal{H} by setting h:=h\|h\|_{\infty}:=\|h^{\prime}\|_{\infty}. We require that on \mathcal{H}, the norm induced by the \mathcal{H}-inner product, denoted \|\cdot\|_{\mathcal{H}}, be stronger than the infinity norm; that is there is a constant C>0C_{\mathcal{H}}>0 such that fCf\|f\|_{\infty}\leq C_{\mathcal{H}}\|f\|_{\mathcal{H}} for all ff\in\mathcal{H}.555This in fact implies \mathcal{H} is an RKHS. But since we do not use the reproducing property anywhere in the proof, we find framing \mathcal{H} as an RKHS unnecessary.

Let TT and T~\widetilde{T} be two Hilbert-Schmidt operators from \mathcal{H} to \mathcal{H}; T~\widetilde{T} can be seen as a perturbed version of TT and we use E:=T~TE:=\widetilde{T}-T to denote their difference. Suppose all the eigenvalues of TT (counting geometric multiplicity) can be arranged in a non-increasing (possibly infinite) sequence of non-negative real numbers λ1λ2λK>λK+10\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0 with a positive gap between λK\lambda_{K} and λK+1\lambda_{K+1}. Suppose the eigenvalues of T~\widetilde{T} can also be arranged in a non-increasing sequence of non-negative real numbers. We do not assume, however, any eigengap for T~\widetilde{T}.

Let {fi}i=1K\{f_{i}\}_{i=1}^{K}\subset\mathcal{H} be the eigenfunctions associated with eigenvalues λ1,,λK\lambda_{1},\ldots,\lambda_{K}. We assume {fi}i=1K\{f_{i}\}_{i=1}^{K} are so picked that they constitute a set of orthonormal vectors in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). We then pick {fi}i=K+1\{f_{i}\}_{i=K+1}^{\infty} so that {fi}i=1\{f_{i}\}_{i=1}^{\infty} constitute a complete orthonormal basis of L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). Define V1:KL2(𝒳,)V_{1}:\mathbb{C}^{K}\to L^{2}(\mathcal{X},\mathbb{P}) by V1α=i=1KαifiV_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i} and V2:l2L2(𝒳,)V_{2}:l^{2}\to L^{2}(\mathcal{X},\mathbb{P}) by V2β=i=1βifK+iV_{2}\beta=\sum_{i=1}^{\infty}\beta_{i}f_{K+i}. Define their adjoints V1,V2V_{1}^{*},V_{2}^{*} with respect to the standard inner product on K,l2\mathbb{C}^{K},l^{2} and L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). Since {fi}i=1K\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}, we can also view \mathcal{H} as the range (domain) space of V1V_{1} (V1V_{1}^{*}). The exact range space of V1V_{1} shall be clear from the context.

When the perturbation EE has small enough Hilbert-Schmidt norm, T~\widetilde{T} necessarily has an eigengap. In this case, the leading KK-dimensional invariant subspace of T~\widetilde{T} is well defined. We pick {f~i}i=1K\{\tilde{f}_{i}\}_{i=1}^{K} to be an orthonormal set of vectors in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) such that they span the leading invariant subspace of T~\widetilde{T}, and define V~1:KL2(𝒳,)\widetilde{V}_{1}:\mathbb{C}^{K}\to L^{2}(\mathcal{X},\mathbb{P}) as V~1α=i=1Kαif~i\widetilde{V}_{1}\alpha=\sum_{i=1}^{{}^{K}}\alpha_{i}\tilde{f}_{i}.

Last but not least, define V21:={ll2:V2l}V_{2}^{-1}\mathcal{H}:=\{l\in l^{2}\,:\,V_{2}l\in\mathcal{H}\} and V2:={V2h:h}V_{2}^{*}\mathcal{H}:=\{V_{2}^{*}h\,:\,h\in\mathcal{H}\}. They are intuitively the “coordinate space” for functions in \mathcal{H} under the basis in V2V_{2}. Working with these coordinates could simplify our notations. The following facts regarding V21V_{2}^{-1}\mathcal{H} and V2V_{2}^{*}\mathcal{H} hold true (with proof in appendix).

Lemma 3.1.

Assuming f1,,fKf_{1},\dots,f_{K}\in\mathcal{H}, we have

  1. 1.

    the set V21V_{2}^{-1}\mathcal{H} is equal to the set V2V_{2}^{*}\mathcal{H};

  2. 2.

    V21V_{2}^{-1}\mathcal{H} is a subspace of l2l^{2}; it is also a Hilbert space with respect to the \mathcal{H}-induced inner product

    (b1,b2)V21:=V2b1,V2b2;(b_{1},b_{2})_{V_{2}^{-1}\mathcal{H}}:=\langle V_{2}b_{1},V_{2}b_{2}\rangle_{\mathcal{H}};
  3. 3.

    V1(K,)V_{1}\in\mathcal{L}(\mathbb{R}^{K},\mathcal{H}), V1(,K)V_{1}^{*}\in\mathcal{L}(\mathcal{H},\mathbb{R}^{K}), V2(V21,)V_{2}\in\mathcal{L}(V_{2}^{-1}\mathcal{H},\mathcal{H}), and V2(,V21)V_{2}^{*}\in\mathcal{L}(\mathcal{H},V_{2}^{-1}\mathcal{H}), with operator norms satisfying

    V12Kmaxi[K]fi,V12CK,\displaystyle\|V_{1}\|_{2\to\mathcal{H}}\leq\sqrt{K}\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}},\quad\|V_{1}^{*}\|_{\mathcal{H}\to 2}\leq C_{\mathcal{H}}\sqrt{K},
    V2V21=1,V2V211+CKmaxi[K]fi.\displaystyle\|V_{2}\|_{V_{2}^{-1}\mathcal{H}\to\mathcal{H}}=1,\quad\|V_{2}^{*}\|_{\mathcal{H}\to V_{2}^{-1}\mathcal{H}}\leq 1+C_{\mathcal{H}}K\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}}.

Because of item 1 of the lemma, we do not need to distinguish between V21V_{2}^{-1}\mathcal{H} and V2V_{2}^{*}\mathcal{H}; we denote both by l~2\tilde{l}^{2}. To keep notation manageable, define T~ij=ViT~Vj\widetilde{T}_{ij}=V_{i}^{*}\widetilde{T}V_{j} for any i,j{1,2}i,j\in\{1,2\}; e.g. T~21\widetilde{T}_{21} is shorthand for V2T~V1V_{2}^{*}\widetilde{T}V_{1}.

We also need the following quantities to define the constants in Result 1. Let Γ\Gamma be the boundary of the rectangle

{λ|λK+λK+12re(λ)T+1,|im(λ)|1}.\big{\{}\lambda\in\mathbb{C}\;|\;\frac{\lambda_{K}+\lambda_{K+1}}{2}\leq re(\lambda)\leq\|T\|_{\mathcal{H}\to\mathcal{H}}+1,|im(\lambda)|\leq 1\big{\}}. (3.1)

Let l(Γ)l(\Gamma) denote the length of Γ\Gamma and define

η=1supλΓ(λIA)1op,\eta=\frac{1}{\sup_{\lambda\in\Gamma}\|(\lambda I-A)^{-1}\|_{op}}, (3.2)

which is necessarily finite. Define a measure of spectral separation

δ:=sep(T11,T22):=inf{T22YYT11HS|Y(K,l~2),YHS=1}.\delta:=\text{sep}(T_{11},T_{22}):=\inf\Big{\{}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\;\Big{|}\;Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2}),\|Y\|_{HS}=1\Big{\}}.

It is reasonable to expect that the larger the eigengap, the larger the sep(T11,T22)\text{sep}(T_{11},T_{22}). And when TT has only KK eigenvalues or is self-adjoint from \mathcal{H} to \mathcal{H}, it is provably so that the separation sep(T11,T22)\text{sep}(T_{11},T_{22}) is lower bounded by the eigengap λKλK+1\lambda_{K}-\lambda_{K+1}.

Define constant

C3:=max{C,1+CV12,V12(1+CV12)}.C_{3}:=\max\Big{\{}C_{\mathcal{H}},1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}},\|V_{1}\|_{2\to\mathcal{H}}(1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}})\Big{\}}.

We are now ready to state the main theorem of this section.

Theorem 3.2 (General recipe for uniform consistency).

Under the assumptions above, as long as E:=T~TE:=\widetilde{T}-T as an operator from \mathcal{H} to \mathcal{H} has Hilbert Schmidt norm EHSC1\|E\|_{HS}\leq C_{1}, the uniform consistency error satisfies

inf{V1V~1Q2:Q𝕌K}C2EHS.\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{2}\|E\|_{HS}.

Here, C1,C2>0C_{1},C_{2}>0 are two constants independent of the choice of T~\widetilde{T} defined as

C1:=1C3min{λKλK+18,12,δ4,δ4C,C3η2η+l(Γ)/2π},\displaystyle C_{1}:=\frac{1}{C_{3}}\min\Big{\{}\frac{\lambda_{K}-\lambda_{K+1}}{8},\frac{1}{2},\frac{\delta}{4},\frac{\delta}{4C_{\mathcal{H}}},\frac{C_{3}\eta^{2}}{\eta+l(\Gamma)/2\pi}\Big{\}}, (3.3)
C2:=4C3C(V12+1)δ.\displaystyle C_{2}:=\frac{4C_{3}C_{\mathcal{H}}(\|V_{1}\|_{2\to\infty}+1)}{\delta}. (3.4)

We remark that since C2C_{2} is inversely proportional to δ\delta, it is beneficial to study the convergence of the leading eigenspace as a whole. When eigenspaces are treated individually, each eigenspace converges slowly because the leading eigenvalues may cluster together and we have a small δ\delta, but when treated as a whole, we get a larger δ\delta and thus faster convergence because the leading eigenvalues are well-separated from the rest of the spectrum.

The rest of the section is devoted to proving Theorem 3.2. The proof strategy is to express V~1Q\widetilde{V}_{1}Q in terms of the solution of an operator equation and directly bound V1V~1Q2\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}. We present the proof in five steps. In step one, we characterize the invariant subspace of T~\widetilde{T} in terms of the solution of a quadratic operator equation. In step two, we apply the Newton-Kantorovich Theorem to show this equation does have a solution when the perturbation is small. In step three, we introduce some additional conditions that guarantee the invariant space from step two is the leading invariant space. In step four, we directly bound the error term V1V~1Q2\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}. In step five, we assemble all pieces together and prove Theorem 3.2. A similar approach was used in Stewart [35] to study the invariant subspace of matrices.

3.1 Step one: equation characterization of the invariant subspace

In this section, our goal is to find a Y(K,l~2)Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2}) such that the range of V1+V2Y(K,)V_{1}+V_{2}Y\in\mathcal{L}(\mathbb{C}^{K},\mathcal{H}) is an invariant subspace of T~\widetilde{T}. It turns out that any YY that satisfies the following quadratic operator equation suffices.

Proposition 3.3.

As long as Y(K,l~2)Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2}) satisfies the equation

T~21+T~22Y=YT~11+YT~12Y,\widetilde{T}_{21}+\widetilde{T}_{22}Y=Y\widetilde{T}_{11}+Y\widetilde{T}_{12}Y, (3.5)

the range of V1+V2Y(K,)V_{1}+V_{2}Y\in\mathcal{L}(\mathbb{C}^{K},\mathcal{H}) is an invariant subspace of T~\widetilde{T}.

Proof.

First, we note (3.5) is a well-defined equation of operators in (K,l~2)\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2}). This can be seen from our assumption T~()\widetilde{T}\in\mathcal{L}(\mathcal{H}) and item 3 of Lemma 3.1. Next, we assert that equation (3.5) implies666We do not differentiate equality in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) from equality in \mathcal{H}, because the two are equivalent.

T~(V1+V2Y)=(V1+V2Y)(T~11+T~12Y),\widetilde{T}(V_{1}+V_{2}Y)=(V_{1}+V_{2}Y)(\widetilde{T}_{11}+\widetilde{T}_{12}Y), (3.6)

which suggests that the range of V1+V2YV_{1}+V_{2}Y is invariant under T~\widetilde{T}. To prove this assertion, note

T~(V1+V2Y)\displaystyle\widetilde{T}(V_{1}+V_{2}Y) =(V1V1+V2V2)T~(V1+V2Y)\displaystyle{=}(V_{1}V_{1}^{*}+V_{2}V_{2}^{*})\widetilde{T}(V_{1}+V_{2}Y)
=V1T~11+V1T~12Y+V2T~21+V2T~22Y\displaystyle=V_{1}\widetilde{T}_{11}+V_{1}\widetilde{T}_{12}Y+V_{2}\widetilde{T}_{21}+V_{2}\widetilde{T}_{22}Y
=V1T~11+V1T~12Y+V2YT~11+V2YT~12Y\displaystyle=V_{1}\widetilde{T}_{11}+V_{1}\widetilde{T}_{12}Y+V_{2}Y\widetilde{T}_{11}+V_{2}Y\widetilde{T}_{12}Y
=(V1+V2Y)(T~11+T~12Y).\displaystyle=(V_{1}+V_{2}Y)(\widetilde{T}_{11}+\widetilde{T}_{12}Y).

3.2 Step two: solve the equation with the Newton-Kantorovich Theorem

After characterizing the invariant subspace of T~\widetilde{T} in terms of a solution of (3.5), we apply the Newton-Kantorovich Theorem to prove a solution to (3.5) exists. The Newton-Kantorovich Theorem constructs a root of a function between Banach spaces when certain conditions on the function itself and its first and second order derivatives are met. The construction is algorithmic: the root is the limit point of a sequence of iterates generated by the Newton-Raphson method for root finding. The exact version of the Newton-Kantorovich Theorem we use is from the appendix of Karow and Kressner [19].

Theorem 3.4 (Newton-Kantorovich).

Let ,𝒵\mathcal{E},\mathcal{Z} be Banach spaces and let F:𝒵F:\mathcal{Z}\to\mathcal{E} be twice continuously differentiable in a sufficiently large neighborhood Ω\Omega of Z𝒵Z\in\mathcal{Z}. Suppose that there exists a linear operator 𝕋:𝒵\mathbb{T}:\mathcal{Z}\to\mathcal{E} with a continuous inverse 𝕋1\mathbb{T}^{-1} and satisfying the following conditions:

𝕋1(F(Z))𝒵a,\displaystyle\|\mathbb{T}^{-1}(F(Z))\|_{\mathcal{Z}}\leq a, (3.7)
𝕋1F(Z)Iopb,\displaystyle\|\mathbb{T}^{-1}\circ F^{\prime}(Z)-I\|_{op}\leq b, (3.8)
𝕋1F′′(Z~)opc,Z~Ω.\displaystyle\|\mathbb{T}^{-1}\circ F^{\prime\prime}(\widetilde{Z})\|_{op}\leq c,\qquad\forall\widetilde{Z}\in\Omega. (3.9)

If b<1b<1 and h:=ac(1b)2<12h:=\frac{ac}{(1-b)^{2}}<\frac{1}{2}, then there exists a solution ZEZ_{E} of F(ZE)=0F(Z_{E})=0 such that

ZEZ𝒵r0withr0:=2a(1b)(1+12h).\|Z_{E}-Z\|_{\mathcal{Z}}\leq r_{0}\qquad\text{with}\qquad r_{0}:=\frac{2a}{(1-b)(1+\sqrt{1-2h})}.

We are now ready to prove the proposition below, which states that when E11HS\|E_{11}\|_{HS}, E21HS\|E_{21}\|_{HS}, E22HS\|E_{22}\|_{HS}, E12HS\|E_{12}\|_{HS} are small relative to sep(T11,T22)\text{sep}(T_{11},T_{22}), equation (3.5) has a solution.

Proposition 3.5.

Let δ:=sep(T11,T22)\delta:=\text{sep}(T_{11},T_{22}) and sE:=δE22HSE11HSs_{E}:=\delta-\|E_{22}\|_{HS}-\|E_{11}\|_{HS}. When sE>0s_{E}>0 and E21HSE12HSsE2<14\frac{\|E_{21}\|_{HS}\|E_{12}\|_{HS}}{s_{E}^{2}}<\frac{1}{4}, there exists Y(K,l~2)Y\in\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2}) with YHS2E21HSsE\|Y\|_{HS}\leq\frac{2\|E_{21}\|_{HS}}{s_{E}} such that equation (3.5) is satisfied.

Proof of Proposition 3.5.

After rearrangement, (3.5) is equivalent to

E21+(T22+E22)YY(T11+E11)YE12Y=0.E_{21}+(T_{22}+E_{22})Y-Y(T_{11}+E_{11})-YE_{12}Y=0. (3.10)

Let :=(K,l~2)\mathcal{E}:=\mathcal{L}(\mathbb{C}^{K},\tilde{l}^{2}) denote the space of bounded linear operators from K\mathbb{C}^{K} to l~2\tilde{l}^{2}. Since K\mathbb{C}^{K} is finite dimensional, any linear operator from K\mathbb{C}^{K} to l~2\tilde{l}^{2} is bounded and Hilbert Schmidt. We can thus use Hilbert Schmidt norm as the default norm on \mathcal{E} and \mathcal{E} is a Hilbert space with respect to this norm. This fact also allows us to define f:f:\mathcal{E}\mapsto\mathcal{E} as f(Y):=E21+(T22+E22)YY(T11+E11)YE12Yf(Y):=E_{21}+(T_{22}+E_{22})Y-Y(T_{11}+E_{11})-YE_{12}Y and 𝒯:\mathcal{T}:\mathcal{E}\mapsto\mathcal{E} as 𝒯(Y):=T22YYT11\mathcal{T}(Y):=T_{22}Y-YT_{11}. Noting that the image of \mathcal{H} under T,ET,E are still in \mathcal{H}, we can verify 𝒯\mathcal{T} and ff are indeed well defined.

We assert δ>0\delta>0 and 𝒯\mathcal{T} is one-to-one and onto and defer the proof of this to a lemma. The implication of this is 𝒯\mathcal{T} is invertible with 𝒯1op=1sep(T11,T22)1δ\|\mathcal{T}^{-1}\|_{op}=\frac{1}{\text{sep}(T_{11},T_{22})}\leq\frac{1}{\delta}.

We are now ready to verify the three assumptions of Newton-Kantorovich theorem.

(A1):

𝒯1(f(0))HS=𝒯1(E21)HS𝒯1opE21HSE21HSδ=:a.\big{\|}\mathcal{T}^{-1}(f(0))\big{\|}_{HS}=\big{\|}\mathcal{T}^{-1}(E_{21})\big{\|}_{HS}\leq\big{\|}\mathcal{T}^{-1}\big{\|}_{op}\big{\|}E_{21}\big{\|}_{HS}\leq\frac{\big{\|}E_{21}\big{\|}_{HS}}{\delta}=:a. (3.11)

(A2): The Fréchet derivative of ff at Y0Y_{0} is given by

f(Y0):\displaystyle f^{\prime}(Y_{0})\colon\mathcal{E} \displaystyle\longrightarrow\mathcal{E}
ΔY\displaystyle\Delta Y (T22+E22)ΔYΔY(T11+E11)\displaystyle\longmapsto(T_{22}+E_{22})\Delta Y-\Delta Y(T_{11}+E_{11})
ΔYE12Y0Y0E12ΔY.\displaystyle\quad\quad-\Delta YE_{12}Y_{0}-Y_{0}E_{12}\Delta Y.

Especially, when Y0=0Y_{0}=0,

f(0):\displaystyle f^{\prime}(0)\colon\mathcal{E} \displaystyle\longrightarrow\mathcal{E}
ΔY\displaystyle\Delta Y (T22+E22)ΔYΔY(T11+E11).\displaystyle\longmapsto(T_{22}+E_{22})\Delta Y-\Delta Y(T_{11}+E_{11}).

Consequently,

(𝒯1f(0)I):\displaystyle\big{(}\mathcal{T}^{-1}\circ f^{\prime}(0)-I\big{)}\colon\mathcal{E} \displaystyle\longrightarrow\mathcal{E}
ΔY\displaystyle\Delta Y 𝒯1(E22ΔYΔYE11).\displaystyle\longmapsto\mathcal{T}^{-1}(E_{22}\Delta Y-\Delta YE_{11}).

We thus have

𝒯1f(0)I\displaystyle\big{\|}\mathcal{T}^{-1}\circ f^{\prime}(0)-I\big{\|} =supΔYHS=1𝒯1(E22ΔYΔYE11)HS\displaystyle=\sup_{\|\Delta Y\|_{HS}=1}\big{\|}\mathcal{T}^{-1}(E_{22}\Delta Y-\Delta YE_{11})\big{\|}_{HS}
supΔYHS=1𝒯1opE22ΔYΔYE11HS\displaystyle\leq\sup_{\|\Delta Y\|_{HS}=1}\big{\|}\mathcal{T}^{-1}\big{\|}_{op}\big{\|}E_{22}\Delta Y-\Delta YE_{11}\big{\|}_{HS}
1δsupΔYHS=1{E22HSΔYHS+ΔYHSE11HS}\displaystyle\leq\frac{1}{\delta}\sup_{\|\Delta Y\|_{HS}=1}\bigg{\{}\big{\|}E_{22}\big{\|}_{HS}\big{\|}\Delta Y\big{\|}_{HS}+\big{\|}\Delta Y\big{\|}_{HS}\big{\|}E_{11}\big{\|}_{HS}\bigg{\}}
E22HS+E11HSδ:=b.\displaystyle\leq\frac{\|E_{22}\|_{HS}+\|E_{11}\|_{HS}}{\delta}:=b.

(A3): The second order Fréchet derivative at Y0Y_{0} is a linear operator in (,(,))\mathcal{L}(\mathcal{E},\mathcal{L}(\mathcal{E},\mathcal{E})):

f′′(Y0):\displaystyle f^{\prime\prime}(Y_{0})\colon\mathcal{E} (,)\displaystyle\longrightarrow\mathcal{L}(\mathcal{E},\mathcal{E})
Δ1Y\displaystyle\Delta_{1}Y 𝒯Δ1Y,\displaystyle\longmapsto\mathcal{T}_{\Delta_{1}Y},

where 𝒯Δ1Y\mathcal{T}_{\Delta_{1}Y} is

𝒯Δ1Y:\displaystyle\mathcal{T}_{\Delta_{1}Y}\colon\mathcal{E} \displaystyle\longrightarrow\mathcal{E}
Δ2Y\displaystyle\Delta_{2}Y Δ1YE12Δ2YΔ2YE12Δ1Y.\displaystyle\longmapsto-\Delta_{1}YE_{12}{\Delta_{2}Y}-\Delta_{2}YE_{12}{\Delta_{1}Y}.

Therefore the second derivative is a constant for every Y0Y_{0}\in\mathcal{E} and we have,

𝒯1f′′(Y0)op\displaystyle\big{\|}\mathcal{T}^{-1}\circ f^{\prime\prime}(Y_{0})\big{\|}_{op} =supΔ1YHS=1𝒯1𝒯Δ1Yop\displaystyle=\sup_{\|\Delta_{1}Y\|_{HS}=1}\big{\|}\mathcal{T}^{-1}\circ\mathcal{T}_{\Delta_{1}Y}\big{\|}_{op}
=supΔ1YHS=1supΔ2YHS=1(𝒯1𝒯Δ1Y)(Δ2Y)\displaystyle=\sup_{\|\Delta_{1}Y\|_{HS}=1}\sup_{\|\Delta_{2}Y\|_{HS}=1}\big{\|}\big{(}\mathcal{T}^{-1}\circ\mathcal{T}_{\Delta_{1}Y}\big{)}(\Delta_{2}Y)\big{\|}_{\mathcal{E}}
1δsupΔ1YHS=1supΔ2YHS=1Δ1YE12Δ2Y+Δ2YE12Δ1Y\displaystyle\leq\frac{1}{\delta}\sup_{\|\Delta_{1}Y\|_{HS}=1}\sup_{\|\Delta_{2}Y\|_{HS}=1}\big{\|}\Delta_{1}YE_{12}{\Delta_{2}Y}+\Delta_{2}YE_{12}{\Delta_{1}Y}\big{\|}_{\mathcal{E}}
2E12HSδ:=c\displaystyle\leq\frac{2\|E_{12}\|_{HS}}{\delta}:=c

(Conclusion:) With all assumptions in place, we apply the Newton-Kantorovich Theorem and conclude as follows. When

sE:=δE22HSE11HS>0 and\displaystyle s_{E}:=\delta-\|E_{22}\|_{HS}-\|E_{11}\|_{HS}>0\text{ and }
E21HSE12HSsE2<14,\displaystyle\frac{\|E_{21}\|_{HS}\|E_{12}\|_{HS}}{s_{E}^{2}}<\frac{1}{4},

equation (3.5) has solution YEY_{E} such that

YE2E21HSsE+sE24E21HSE12HS2E21HSsE.\big{\|}Y_{E}\big{\|}\leq\frac{2\|E_{21}\|_{HS}}{s_{E}+\sqrt{s_{E}^{2}-4\|E_{21}\|_{HS}\|E_{12}\|_{HS}}}\leq\frac{2\|E_{21}\|_{HS}}{s_{E}}. (3.12)

3.3 Step three: showing the invariant space is the leading eigenspace

In step two, we obtained an invariant subspace, but there is no guarantee that the invariant subspace we obtained is the leading KK-dimensional invariant subspace of T~\widetilde{T}. In this subsection, we give sufficient conditions to ensure this. When EHS\|E\|_{HS} is small, we show several things must happen: first, the range of V1+V2YEV_{1}+V_{2}Y_{E} is KK dimensional; second, the eigenvalues of the restriction of T~\widetilde{T} to this subspace are contained in the interval [λKϵ,λ1+ϵ][\lambda_{K}-\epsilon,\lambda_{1}+\epsilon] for some small ϵ\epsilon; third, T~\widetilde{T} has exactly KK eigenvalues (counting geometric multiplicity) in the interval [λKϵ,λmax(T~)+ϵ][\lambda_{K}-\epsilon,\lambda_{max}(\widetilde{T})+\epsilon]. These facts combined implies the invariant subspace from Proposition 3.5 has to be the leading KK-dimensional invariant subspace.

The first point is not hard to show. Suppose the range of V1+V2YEV_{1}+V_{2}Y_{E} has less than KK dimensions, then there exists sKs\in\mathbb{C}^{K} with s2=1\|s\|_{2}=1 such that (V1+V2YE)s=0(V_{1}+V_{2}Y_{E})s=0. But since (V1+V2YE)sL2V1sL2V2YEsL21CV2YEs1CYEHS\|(V_{1}+V_{2}Y_{E})s\|_{L^{2}}\geq\|V_{1}s\|_{L^{2}}-\|V_{2}Y_{E}s\|_{L^{2}}\geq 1-C_{\mathcal{H}}\|V_{2}Y_{E}s\|_{\mathcal{H}}\geq 1-C_{\mathcal{H}}\|Y_{E}\|_{HS}, when YEHS\|Y_{E}\|_{HS} is small, the vector (V1+V2YE)s(V_{1}+V_{2}Y_{E})s simply cannot be zero. Stated formally, we have

Lemma 3.6.

When CYEHS<1C_{\mathcal{H}}\|Y_{E}\|_{HS}<1, the range of V1+V2YEV_{1}+V_{2}Y_{E} is KK-dimensional.

As for the second point, which is to determine the eigenvalues of the restriction of T~\widetilde{T}, note that (3.5) implies T~\widetilde{T} has matrix representation T~11+T~12YE\widetilde{T}_{11}+\widetilde{T}_{12}Y_{E} in the basis V1+V2YEV_{1}+V_{2}Y_{E}. Since eigenvalues are not affected by the choice of bases, we know the eigenvalues of TT on the invariant space are those of T~11+T~12YE\widetilde{T}_{11}+\widetilde{T}_{12}Y_{E}. Next, we recall a perturbation result for eigenvalues.

Lemma 3.7.

Assuming σ(T11+E11+E12YE)\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\mathbb{R}, we have (addition is set addition)

σ(T11+E11+E12YE)σ(T11)+[E11+E12YE,E11+E12YE].\displaystyle\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\sigma(T_{11})+\big{[}-\|E_{11}+E_{12}Y_{E}\|,\|E_{11}+E_{12}Y_{E}\|\big{]}. (3.13)
Proof.

First note that T11=diag((λ1,λ2,,λK)T)T_{11}=\text{diag}\big{(}(\lambda_{1},\lambda_{2},\ldots,\lambda_{K})^{T}\big{)}. Suppose λ\lambda is a real eigenvalue of T11+E11+E12YEK×KT_{11}+E_{11}+E_{12}Y_{E}\in\mathbb{C}^{K\times K}, then there exists vKv\in\mathbb{C}^{K} with v2=1\|v\|_{2}=1 such that (T11+E11+E12YE)v=λv(T_{11}+E_{11}+E_{12}Y_{E})v=\lambda v. It thus follows

infi[K]|λiλ|(i=1K(λiλ)2|vi|2)1/2=(λIT11)v=(E11+E12YE)vE11+E12YE,\inf_{i\in[K]}|\lambda_{i}-\lambda|\leq\Big{(}\sum_{i=1}^{K}(\lambda_{i}-\lambda)^{2}|v_{i}|^{2}\Big{)}^{1/2}=\|(\lambda I-T_{11})v\|=\|(E_{11}+E_{12}Y_{E})v\|\leq\|E_{11}+E_{12}Y_{E}\|,

which suggests λ\lambda is within E11+E12YE\|E_{11}+E_{12}Y_{E}\| from at least one of {λi}i=1K\{\lambda_{i}\}_{i=1}^{K}. This is equivalent to the claim of (3.13). ∎

From the lemma, we know (assuming YEHS<1\|Y_{E}\|_{HS}<1)

σ(T11+E11+E12YE)[λKE11HSE12HS,λ1+E11HS+E12HS].\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\big{[}\lambda_{K}-\|E_{11}\|_{HS}-\|E_{12}\|_{HS},\lambda_{1}+\|E_{11}\|_{HS}+\|E_{12}\|_{HS}\big{]}. (3.14)

For the third point, we need the following result from Rosasco, Belkin and Vito [29] (their Theorem 20), which they credited to Anselone [2] for the origin.

Theorem 3.8.

Let A()A\in\mathcal{L}(\mathcal{H}) be a compact operator. Given a finite set Λ\Lambda of non-zero eigenvalues of AA, let Γ\Gamma be any simple rectifiable closed curve (having positive direction) with Λ\Lambda inside and σ(A)Λ\sigma(A)\setminus\Lambda outside. Let PP be the spectral projection associated with Λ\Lambda, that is,

P=12πiΓ(λIA)1𝑑λ,P=\frac{1}{2\pi i}\int_{\Gamma}(\lambda I-A)^{-1}d\lambda, (3.15)

and define

η1=supλΓ(λIA)1op.\eta^{-1}=\sup_{\lambda\in\Gamma}\|(\lambda I-A)^{-1}\|_{op}. (3.16)

Let BB be another compact operator such that

BAopη2η+l(Γ)/2π<η,\|B-A\|_{op}\leq\frac{\eta^{2}}{\eta+l(\Gamma)/2\pi}<\eta, (3.17)

where l(Γ)l(\Gamma) is the length of Γ\Gamma, the the following facts hold true.

  1. 1.

    The curve Γ\Gamma is a subset of the resolvent set of BB enclosing a finite set Λ^\widehat{\Lambda} of non-zero eigenvalues of BB;

  2. 2.

    The dimension of the range of PP is equal to the dimension of the range of P^\widehat{P}, where P^=12πiΓ(λIB)1𝑑λ\widehat{P}=\frac{1}{2\pi i}\int_{\Gamma}(\lambda I-B)^{-1}d\lambda.

From the theorem above, we can take Γ\Gamma as in (3.1), i.e. as the boundary of the rectangle

{λ|λK+λK+12re(λ)T+1,|im(λ)|1}.\big{\{}\lambda\in\mathbb{C}\;|\;\frac{\lambda_{K}+\lambda_{K+1}}{2}\leq re(\lambda)\leq\|T\|_{\mathcal{H}\to\mathcal{H}}+1,|im(\lambda)|\leq 1\big{\}}. (3.18)

For EopEHS\|E\|_{op}\leq\|E\|_{HS} small enough, Γ\Gamma contains exactly the top KK eigenvalues of T~\widetilde{T}. Combining the three points up, we obtain sufficient conditions for the range of v1+V2YEv_{1}+V_{2}Y_{E} to be the leading invariant subspace of T~\widetilde{T}.

Proposition 3.9.

Assuming

YEHS<min{1,1C}\displaystyle\|Y_{E}\|_{HS}<\min\Big{\{}1,\frac{1}{C_{\mathcal{H}}}\Big{\}} (3.19)
E11HS+E12HS<min{λKλK+12,1},\displaystyle\|E_{11}\|_{HS}+\|E_{12}\|_{HS}<\min\Big{\{}\frac{\lambda_{K}-\lambda_{K+1}}{2},1\Big{\}}, (3.20)
EHSmin{η2η+l(Γ)/2π,1},\displaystyle\|E\|_{HS}\leq\min\Big{\{}\frac{\eta^{2}}{\eta+l(\Gamma)/2\pi},1\Big{\}}, (3.21)

where Γ,η\Gamma,\eta are defined according to (3.18) and (3.16) respectively, the range of V1+V2YEV_{1}+V_{2}Y_{E} is the leading invariant subspace of T~\widetilde{T}.

Proof.

First, our choice of Γ\Gamma contains and only contains the top KK eigenvalues of TT. Next, since we assumed T~\widetilde{T} also has only real eigenvalues, Lemma 3.7 applies. So from (3.14), (3.19), and (3.20), we know

σ(T11+E11+E12YE)(λK+λK+12,λ1+1),\sigma(T_{11}+E_{11}+E_{12}Y_{E})\subset\big{(}\frac{\lambda_{K}+\lambda_{K+1}}{2},\lambda_{1}+1\big{)},

which is enclosed in Γ\Gamma. We also see from (3.19) that Lemma 3.6 applies. Finally, condition (3.21) on EE ensures Theorem 3.8 applies, so T~\widetilde{T} has only KK eigenvalues in Γ\Gamma. It thus follows the invariant subspace induced by V1+V2YEV_{1}+V_{2}Y_{E} is the KK-dimensional leading invariant subspace of T~\widetilde{T}. ∎

3.4 Step four: bound uniform consistency error

In this step, we bound the uniform consistency error

inf{V1V~1Q2:Q𝕌K},\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\},

where V~1\widetilde{V}_{1} has orthonormal columns spanning the leading invariant subspace of T~\widetilde{T}. Since we require V~1V~1=I\widetilde{V}_{1}^{*}\widetilde{V}_{1}=I, we need to orthonormalize the “columns” of V1+V2YEV_{1}+V_{2}Y_{E}. Let us define YEY_{E}^{*} to be the adjoint of YEY_{E} with respect to K\mathbb{C}^{K} and l2l^{2}. We can verify that for some Q𝕌KQ\in\mathbb{U}^{K}, V~1Q=(V1+V2YE)(I+YEYE)1/2\widetilde{V}_{1}Q=(V_{1}+V_{2}Y_{E})(I+Y_{E}^{*}Y_{E})^{-1/2} because777As we will see from Lemma 3.23, (I+YEYE)1/2(I+Y_{E}^{*}Y_{E})^{-1/2} is well-defined when YEHS\|Y_{E}\|_{HS} is small.

(V1+V2YE)(V1+V2YE)=V1V1+YEV2V2YE=I+YEYE.(V_{1}+V_{2}Y_{E})^{*}(V_{1}+V_{2}Y_{E})=V_{1}^{*}V_{1}+Y_{E}^{*}V_{2}^{*}V_{2}Y_{E}=I+Y_{E}^{*}Y_{E}.

Meanwhile, note that by assumption, \|\cdot\|_{\mathcal{H}} is stronger than \|\cdot\|_{\infty}, i.e. CffC_{\mathcal{H}}\|f\|_{\mathcal{H}}\geq\|f\|_{\infty} for all ff\in\mathcal{H}. This implies for ll~2\forall l\in\tilde{l}^{2}

Cll~2=CV2lV2lV2lL2=ll2.C_{\mathcal{H}}\|l\|_{\tilde{l}^{2}}=C_{\mathcal{H}}\|V_{2}l\|_{\mathcal{H}}\geq\|V_{2}l\|_{\infty}\geq\|V_{2}l\|_{L^{2}}=\|l\|_{l^{2}}.

The consequence of this is

CYE2l~2YE2l2.C_{\mathcal{H}}\|Y_{E}\|_{2\to\tilde{l}^{2}}\geq\|Y_{E}\|_{2\to l^{2}}. (3.22)

We also need the following handy result.

Lemma 3.10.

Let Y:Kl2Y:\mathbb{C}^{K}\rightarrow l_{2} have operator norm Y2l2<1\|Y\|_{2\to l^{2}}<1. Then

(I+YY)1221,I(I+YY)122Y2l2.\big{\|}(I+Y^{*}Y)^{-\frac{1}{2}}\big{\|}_{2}\leq 1,\quad\big{\|}I-(I+Y^{*}Y)^{-\frac{1}{2}}\big{\|}_{2}\leq\|Y\|_{2\to l^{2}}. (3.23)
Proof.

Suppose Y2l2=r<1\|Y\|_{2\to l^{2}}=r<1. Note that YY=r2<1\|Y^{*}Y\|=r^{2}<1, so the Hermitian matrix I+YYI+Y^{*}Y is invertible with spectrum in [1,1+r2][1,1+r^{2}]. Consequently, σ((I+YY)12)[11+r2,1]\sigma\big{(}(I+Y^{*}Y)^{-\frac{1}{2}}\big{)}\subset[\frac{1}{\sqrt{1+r^{2}}},1], so (I+YY12)1\big{\|}(I+Y^{*}Y^{-\frac{1}{2}})\big{\|}\leq 1.

Similarly, we have σ(I(I+YY)12)[0,111+r2]\sigma(I-(I+Y^{*}Y)^{-\frac{1}{2}})\subset[0,1-\frac{1}{\sqrt{1+r^{2}}}]. It remains to verify 111+r2r1-\frac{1}{\sqrt{1+r^{2}}}\leq r, which is easy. ∎

Now we have

Proposition 3.11.

Suppose YE2l~2<1/C\|Y_{E}\|_{2\to\tilde{l}^{2}}<1/C_{\mathcal{H}} and V~1Q=(V1+V2YE)(I+YEYE)1/2\widetilde{V}_{1}Q=(V_{1}+V_{2}Y_{E})(I+Y_{E}^{*}Y_{E})^{-1/2} for some Q𝕌KQ\in\mathbb{U}^{K}. We have

inf{V1V~1Q2:Q𝕆K}C(V12+1)YE2l~2.\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{O}^{K}\}\leq C_{\mathcal{H}}(\|V_{1}\|_{2\to\infty}+1)\|Y_{E}\|_{2\to\tilde{l}^{2}}.
Proof.

The condition YE2l~2<1/C\|Y_{E}\|_{2\to\tilde{l}^{2}}<1/C_{\mathcal{H}} ensures Lemma 3.23 applies. We then directly calculate

inf{V1V~1Q2:Q𝕆K}\displaystyle\quad\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{O}^{K}\}
V1(V1+V2YE)(I+YEYE)122\displaystyle\leq\big{\|}V_{1}-(V_{1}+V_{2}Y_{E})(I+Y_{E}^{*}Y_{E})^{-\frac{1}{2}}\big{\|}_{2\to\infty}
V1(I(I+YEYE)12)2+V2YE(I+YEYE)122\displaystyle\leq\big{\|}V_{1}(I-(I+Y_{E}^{*}Y_{E})^{-\frac{1}{2}})\big{\|}_{2\to\infty}+\big{\|}V_{2}Y_{E}(I+Y_{E}^{*}Y_{E})^{-\frac{1}{2}}\big{\|}_{2\to\infty}
V12YE2l2+V2YE2\displaystyle\leq\|V_{1}\|_{2\to\infty}\|Y_{E}\|_{2\to l^{2}}+\|V_{2}Y_{E}\|_{2\to\infty}
V12YE2l2+CYE2l~2\displaystyle\leq\|V_{1}\|_{2\to\infty}\|Y_{E}\|_{2\to l^{2}}+C_{\mathcal{H}}\|Y_{E}\|_{2\to\tilde{l}^{2}}
C(V12+1)YE2l~2.\displaystyle\leq C_{\mathcal{H}}(\|V_{1}\|_{2\to\infty}+1)\|Y_{E}\|_{2\to\tilde{l}^{2}}.

3.5 Step five: put all pieces together

We combine the previous steps together and prove Theorem 3.2. To this end, we need the following lemma that relates EijHS\|E_{ij}\|_{HS}, i,j{1,2}i,j\in\{1,2\} to EHS\|E\|_{HS}.

Lemma 3.12.

Let

C3=max{C,1+CV12,V12(1+CV12)},C_{3}=\max\Big{\{}C_{\mathcal{H}},1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}},\|V_{1}\|_{2\to\mathcal{H}}(1+C_{\mathcal{H}}\|V_{1}\|_{2\to\mathcal{H}})\Big{\}},

then for any i,j{1,2}i,j\in\{1,2\}

EijHSC3EHS.\|E_{ij}\|_{HS}\leq C_{3}\|E\|_{HS}.
Proof.

Note the fact

ViEVjHSViopEHSVjop.\|V_{i}^{*}EV_{j}\|_{HS}\leq\|V_{i}^{*}\|_{op}\|E\|_{HS}\|V_{j}\|_{op}.

We plug in the bounds from item 3 in Lemma 3.1 to obtain the stated result. ∎

Proof of Theorem 3.2.

Define C1C_{1} as

C1:=1C3min{λKλK+18,12,δ4,δ4C,C3η2η+l(Γ)/2π}.C_{1}:=\frac{1}{C_{3}}\min\Big{\{}\frac{\lambda_{K}-\lambda_{K+1}}{8},\frac{1}{2},\frac{\delta}{4},\frac{\delta}{4C_{\mathcal{H}}},\frac{C_{3}\eta^{2}}{\eta+l(\Gamma)/2\pi}\Big{\}}.

We have EHSC1\|E\|_{HS}\leq C_{1} by assumption. By Lemma 3.12, this assumption implies

EijHSC3EHS<δ4\|E_{ij}\|_{HS}\leq C_{3}\|E\|_{HS}<\frac{\delta}{4}

for all i,j{1,2}i,j\in\{1,2\}. Thus

sE=δE11HSE22HSδ2,\displaystyle s_{E}=\delta-\|E_{11}\|_{HS}-\|E_{22}\|_{HS}\geq\frac{\delta}{2},
E21HSE12HSsE2<14.\displaystyle\frac{\|E_{21}\|_{HS}\|E_{12}\|_{HS}}{s_{E}^{2}}<\frac{1}{4}.

Proposition 3.5 guarantees (3.5) has a solution YEY_{E} and

YEHS4C3δEHS<min{1C,1}.\|Y_{E}\|_{HS}\leq\frac{4C_{3}}{\delta}\|E\|_{HS}<\min\{\frac{1}{C_{\mathcal{H}}},1\}. (3.24)

We check that our choice of C1C_{1} satisfies condition (3.21) and (3.20) so Proposition 3.9 implies the invariant space from Proposition 3.5 is the leading invariant space. Finally, (3.24) implies the conditions of Proposition 3.11 are satisfied so we have

inf{V1V~1Q2:Q𝕌K}C2EHS\inf\{\|V_{1}-\widetilde{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{2}\|E\|_{HS}

where C2=4C3C(V12+1)/δC_{2}=4C_{3}C_{\mathcal{H}}(\|V_{1}\|_{2\to\infty}+1)/\delta.

4 Application to normalized spectral clustering

In spectral clustering, we start from a subset 𝒳p\mathcal{X}\subset\mathbb{R}^{p}, a probability measure \mathbb{P} on 𝒳\mathcal{X},888Assume the underlying σ\sigma-algebra of \mathbb{P} is the Lebesgue σ\sigma-algebra. and a continuous symmetric positive definite real-valued kernel function k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}. After observing samples X1,,XniidX_{1},\dots,X_{n}\overset{\textrm{iid}}{\sim}\mathbb{P}, we construct matrix Knn×nK_{n}\in\mathbb{R}^{n\times n} of their pairwise similarities: Kn=[1nk(Xi,Xj)]i,j=1nK_{n}=\begin{bmatrix}\frac{1}{n}k(X_{i},X_{j})\end{bmatrix}_{i,j=1}^{n}, and then normalize it to obtain the normalized Laplacian matrix

Ln=Dn12KnDn12,L_{n}=D_{n}^{-\frac{1}{2}}K_{n}D_{n}^{-\frac{1}{2}},

where dn=Kn1nd_{n}=K_{n}1_{n} and Dn=diag(dn)D_{n}=\operatorname{diag}(d_{n}) is the degree matrix.999The normalized Laplacian matrix is usually defined as InLnI_{n}-L_{n}, but the eigenvectors of LnL_{n} and those of InLnI_{n}-L_{n} are identical and it is more convenient to study LnL_{n}. It is possible to show that LnL_{n} is symmetric and semi-positive definite, so it has an eigenvalue decomposition. We denote the eigenpairs of LnL_{n} by (λ^k,vk)(\widehat{\lambda}_{k},v_{k}) and sort the eigenvalues in descending order:

λ^1λ^n0.\widehat{\lambda}_{1}\geq\dots\geq\widehat{\lambda}_{n}\geq 0.

In this paper, we normalize the eigenvectors of LnL_{n} so that n12vk2=1n^{-\frac{1}{2}}\|v_{k}\|_{2}=1. The spectral embedding matrix is Vn×KV\in\mathbb{R}^{n\times K} whose columns are v1,,vKv_{1},\dots,v_{K}.

Suppose for now that the kernel kk is bounded away from 0 by a positive number and bounded from above. The operator counterpart of LnL_{n} is the following operator, which can be shown to be a bounded linear operator in (Cb(𝒳))\mathcal{L}(C_{b}(\mathcal{X}))

(T^nf)(x)=𝒳k(x,y)dn(x)1/2dn(y)1/2f(y)𝑑Pn(y),(\widehat{T}_{n}f)(x)=\int_{\mathcal{X}}\frac{k(x,y)}{d_{n}(x)^{1/2}d_{n}(y)^{1/2}}f(y)dP_{n}(y), (4.1)

where dn(x)=𝒳k(x,y)𝑑Pn(y)d_{n}(x)=\int_{\mathcal{X}}k(x,y)dP_{n}(y) is the sample degree function. Although T^n\widehat{T}_{n} is introduced as an operator in (Cb(𝒳))\mathcal{L}(C_{b}(\mathcal{X})), we remark that the definitive element for T^n\widehat{T}_{n} is the integral form and the domain space and range space need not be restricted to Cb(𝒳)C_{b}(\mathcal{X}). In fact, the actual T^n\widehat{T}_{n} we shall work with is an operator between Hilbert spaces; (Cb(𝒳))\mathcal{L}(C_{b}(\mathcal{X})) is only chosen here for the ease of understanding. The same remark also applies to other operators we shall subsequently define.

The operator T^n\widehat{T}_{n} is the operator counterpart of LnL_{n} because ρnT^n=Lnρn\rho_{n}\circ\widehat{T}_{n}=L_{n}\circ\rho_{n}, where ρn:Cb(𝒳)n\rho_{n}:C_{b}(\mathcal{X})\to\mathbb{C}^{n} is the restriction operator defined as

ρnf=[f(X1)f(Xn)]T.\rho_{n}f=\begin{bmatrix}f(X_{1})&\dots&f(X_{n})\end{bmatrix}^{T}.

In other words, if we identify functions fCb(𝒳)f\in C_{b}(\mathcal{X}) with vectors vnv\in\mathbb{C}^{n} by the restriction operator ρn\rho_{n}, T^n\widehat{T}_{n} “behaves as” LnL_{n}. The eigenvalues and eigenvectors(functions) of T^n\widehat{T}_{n} and LnL_{n} are also closely related in the following sense.

Lemma 4.1.

Suppose real-valued kernel function k(x,y)k(x,y) is continuous and bounded from below and above: 0<κl<k(x,y)<κu<0<\kappa_{l}<k(x,y)<\kappa_{u}<\infty. Let T^n\widehat{T}_{n} be defined as in (4.1) where the domain space and range space are both Cb(𝒳)C_{b}(\mathcal{X}). If (λ^,f)(\widehat{\lambda},f) is a non-trivial eigenpair of T^n\widehat{T}_{n} (i.e. λ^0\widehat{\lambda}\neq 0), then (λ^,ρnf)(\widehat{\lambda},\rho_{n}f) is an eigenpair of LnL_{n}. Conversely, if (λ^,v)(\widehat{\lambda},v) is an eigenpair of LnL_{n}, then (λ^,f^)(\widehat{\lambda},\widehat{f}), where

f^(x)=1λ^ni=1nk(x,Xi)dn(x)1/2dn(Xi)1/2vi,\widehat{f}(x)=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}\frac{k(x,X_{i})}{d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2}}v_{i}, (4.2)

is an eigenpair of T^n\widehat{T}_{n} with f^Cb(𝒳)\widehat{f}\in C_{b}(\mathcal{X}). Moreover, this choice of f^\widehat{f} is such that f^L2(𝒳,n)=1\|\widehat{f}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}=1 and the restriction of f^\widehat{f} onto sample points agrees with vv, i.e. ρnf^=v\rho_{n}\widehat{f}=v.

Proof.

Let (λ^,f^)(\widehat{\lambda},\widehat{f}) be an eigenpair of T^n\widehat{T}_{n}: T^nf^=λ^f^\widehat{T}_{n}\widehat{f}=\widehat{\lambda}\widehat{f}. We check that (λ^,ρnf^)(\widehat{\lambda},\rho_{n}\widehat{f}) is an eigenpair of LnL_{n}:

Lnρnf^=ρnT^nf^=ρnλ^f^=λ^ρnf^.L_{n}\rho_{n}\widehat{f}=\rho_{n}\widehat{T}_{n}\widehat{f}=\rho_{n}\widehat{\lambda}\widehat{f}=\widehat{\lambda}\rho_{n}\widehat{f}.

Conversely, if (λ^,v)(\widehat{\lambda},v) is an eigenpair of LnL_{n}, we check that (λ^,f^)(\widehat{\lambda},\widehat{f}) is an eigenpair of T^n\widehat{T}_{n}:

(T^nf^)(x)\displaystyle(\widehat{T}_{n}\widehat{f})(x) =1ni=1nk(x,Xi)dn(x)1/2dn(Xi)1/2{1λ^nj=1nk(Xi,Xj)dn(Xi)1/2dn(Xj)1/2vj}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{k(x,X_{i})}{d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2}}\left\{\textstyle\frac{1}{\widehat{\lambda}n}\sum_{j=1}^{n}\frac{k(X_{i},X_{j})}{d_{n}(X_{i})^{1/2}d_{n}(X_{j})^{1/2}}v_{j}\right\}
=1ni=1nk(x,Xi)dn(x)1/2dn(Xi)1/2{1λ^[Lnv]i}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{k(x,X_{i})}{d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2}}\left\{\textstyle\frac{1}{\widehat{\lambda}}[L_{n}v]_{i}\right\}
=1nj=1nk(x,Xj)dn(x)1/2dn(Xj)1/2vi\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\frac{k(x,X_{j})}{d_{n}(x)^{1/2}d_{n}(X_{j})^{1/2}}v_{i}
=(λ^f^)(x).\displaystyle=(\widehat{\lambda}\widehat{f})(x).

It remains to check f^\widehat{f} is indeed in Cb(𝒳)C_{b}(\mathcal{X}). To this end, note that since the kernel function kk is continuous and bounded from above, we know k(x,Xj)Cb(𝒳)k(x,X_{j})\in C_{b}(\mathcal{X}). Since kk is bounded from below, we know dn(x)d_{n}(x) is continuous and dn(x)>κld_{n}(x)>\kappa_{l}, so k(x,Xj)/(dn(x)1/2dn(Xi)1/2)Cb(𝒳)k(x,X_{j})/(d_{n}(x)^{1/2}d_{n}(X_{i})^{1/2})\in C_{b}(\mathcal{X}). Thus the average of such terms f^\widehat{f} is also in Cb(𝒳)C_{b}(\mathcal{X}). ∎

The population version of T^n\widehat{T}_{n} is the normalized Laplacian operator

T:Cb(𝒳)Cb(𝒳) defined as (Tf)(x)=𝒳k(x,y)d(x)1/2d(y)1/2f(y)𝑑P(y),T:C_{b}(\mathcal{X})\to C_{b}(\mathcal{X})\text{ defined as }(Tf)(x)=\int_{\mathcal{X}}\frac{k(x,y)}{d(x)^{1/2}d(y)^{1/2}}f(y)dP(y),

where d(x)=𝒳k(x,y)𝑑P(y)d(x)=\int_{\mathcal{X}}k(x,y)dP(y) is the (population) degree function. Under appropriate assumptions, it can be shown that we can choose {fi}i=1K\{f_{i}\}_{i=1}^{K}, the top KK eigenfunctions of TT, to be real-valued and orthonormal in L2(𝒳,P)L^{2}(\mathcal{X},P). We can thus define V1:KCb(X)V_{1}:\mathbb{C}^{K}\to C_{b}(X) as V1α=i=1KαifiV_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i}. We can similarly define V^1\widehat{V}_{1} with {f^k}k=1K\{\widehat{f}_{k}\}_{k=1}^{K}, the extension of top KK eigenvectors of LnL_{n} according to (4.2). Our goal in this section is to apply our general theory to prove the following result.

Theorem 4.2.

Under the general assumptions defined below, there exists C4,C5C_{4},C_{5} that are determined by 𝒳,,k\mathcal{X},\mathbb{P},k such that whenever sample size nC4τn\geq C_{4}\tau for some τ>1\tau>1, we have with confidence 18eτ1-8e^{-\tau}

inf{V1V^1Q2:Q𝕌K}C5τn.\inf\{\|V_{1}-\widehat{V}_{1}Q\|_{2\to\infty}:Q\in\mathbb{U}^{K}\}\leq C_{5}\frac{\sqrt{\tau}}{\sqrt{n}}.

The general assumptions referred to in Theorem 4.2 are

General Assumptions.

The set 𝒳\mathcal{X} is a bounded connected open set in p\mathbb{R}^{p} with a nice boundary.101010We need the boundary to be quasi-resolved [6] for inequality (4.10) and CC^{\infty} for lemma A.2 [15, 11]. We also need 𝒳\mathcal{X} to satisfy the cone condition [6] We omit the definitions of these conditions because the precise definitions are very technical and not relevant to the main story of the paper.. The probability measure \mathbb{P} is defined with respect to Lebesgue measure and admits a density function p(x)p(x). Moreover, there exists constants 0<pl<pu<0<p_{l}<p_{u}<\infty such that pl<p(x)<pup_{l}<p(x)<p_{u} almost surely with respect to the Lebesgue measure. The kernel k(,)Cbp+2(𝒳×𝒳)k(\cdot,\cdot)\in C_{b}^{p+2}(\mathcal{X}\times\mathcal{X}) is symmetric, positive, and there exists constants 0<κl<κu<0<\kappa_{l}<\kappa_{u}<\infty such that κl<k(x,y)<κu\kappa_{l}<k(x,y)<\kappa_{u} for x,y𝒳\forall x,y\in\mathcal{X}. Treated as an operator from L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) to L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}), the eigenvalues of TT satisfy λ1λK>λK+10\lambda_{1}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0. The top KK eigenfunctions of TT, {fi}i=1KCbp+2(𝒳)\{f_{i}\}_{i=1}^{K}\subset C_{b}^{p+2}(\mathcal{X}). 111111Function space Cbp+2(𝒳)C_{b}^{p+2}(\mathcal{X}) shall be defined in section 4.2.

4.1 Overview of the proof

The most challenging parts in applying the general theory are to identify the correct Hilbert space \mathcal{H} to work with, and to show that TT^nT-\widehat{T}_{n}, as an operator from \mathcal{H} to \mathcal{H}, has Hilbert-Schmidt norm tending to zero as nn goes to infinity. It turns out under the general assumptions, we may set \mathcal{H} to be a Sobolev space of sufficiently high degrees. As for bounding TT^nHS\|T-\widehat{T}_{n}\|_{HS}, we first decompose T,T^nT,\widehat{T}_{n} as the product of three operators. Let us define

D1/2: as (D1/2f)(x)=f(x)/d(x),\displaystyle D^{-1/2}:\mathcal{H}\to\mathcal{H}\text{ as }(D^{-1/2}f)(x)=f(x)/\sqrt{d(x)}, (4.3)
K: as (Kf)(x)=𝒳k(x,y)f(y)𝑑P(y).\displaystyle K:\mathcal{H}\to\mathcal{H}\text{ as }(Kf)(x)=\int_{\mathcal{X}}k(x,y)f(y)dP(y). (4.4)

Then T=D1/2KD1/2T=D^{-1/2}KD^{-1/2}. Similarly, we have T^n=Dn1/2KnDn1/2\widehat{T}_{n}=D_{n}^{-1/2}K_{n}D_{n}^{-1/2} where Dn1/2,KnD_{n}^{-1/2},K_{n} are the sample level version of D1/2D^{-1/2} and KK defined using dnd_{n} and n\mathbb{P}_{n}. We shall establish the concentration of KnK_{n} to KK and Dn1/2D_{n}^{-1/2} to D1/2D^{-1/2} and invoke triangular inequality to bound TT^nHS\|T-\widehat{T}_{n}\|_{HS}.

Despite that the general theory does all the heavy lifting, there is one additional step we must take to finish the full proof of Theorem 4.2. In our general theory, V~1\widetilde{V}_{1} has columns orthonormal in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) that span the leading invariant space of T~\widetilde{T}. In theorem 4.2 however, the same leading invariant space is spanned by the columns of V^1\widehat{V}_{1}, which are only orthonormal in L2(𝒳,n)L^{2}(\mathcal{X},\mathbb{P}_{n}). Morally speaking, when nn is large, V^1\widehat{V}_{1} and V~1\widetilde{V}_{1} are roughly the same up to some unitary transformation, so switching from V~1\widetilde{V}_{1} to V^1\widehat{V}_{1} should not inflate the consistency error by any order of magnitude. The exact error bound shall be obtained through some uniform law of large numbers.

The rigorous treatment shall be presented in five parts. In part one, we introduce the Sobolev space we work with and lay out its basic properties. In part two, we bound the norm of operator differences such as Dn1/2D1/2\|D_{n}^{-1/2}-D^{-1/2}\|_{\mathcal{H}\to\mathcal{H}} and KKnHS\|K-K_{n}\|_{HS} and express TT^nHS\|T-\widehat{T}_{n}\|_{HS} in terms of them. In part three, we invoke concentration results in Hilbert spaces and relate the norm of operator differences to sample size. In part four, we check the remaining conditions required by our general theory and combine all previous pieces together. In part five, we deal with the error induced by the difference of V^1\widehat{V}_{1} and V~1\widetilde{V}_{1} and complete the proof.

4.2 Part one: The Sobolev space s\mathcal{H}^{s}

First recall that by assumption 𝒳\mathcal{X} is a bounded connected open subset of p\mathbb{R}^{p} with a nice boundary. Given ss\in\mathbb{N}, the Sobolev space s=s(𝒳)\mathcal{H}^{s}=\mathcal{H}^{s}(\mathcal{X}) of order ss is defined as

s:={fL2(𝒳,dx)|DαfL2(𝒳,dx),|α|=s},\mathcal{H}^{s}\;:=\;\{f\in L^{2}(\mathcal{X},dx)\;|\;D^{\alpha}f\in L^{2}(\mathcal{X},dx),\forall|\alpha|=s\},

where DαfD^{\alpha}f is the (weak) derivative of ff with respect to the multi-index α\alpha and L2(𝒳,dx)L^{2}(\mathcal{X},dx) is the complex Hilbert space of complex-valued functions square integrable under Lebesgue measure. The space s\mathcal{H}^{s} is a separable Hilbert space with respect to the inner product

f,gs=f,gL2(𝒳,dx)+|α|=sDαf,DαgL2(𝒳,dx).\langle f,g\rangle_{\mathcal{H}^{s}}=\langle f,g\rangle_{L^{2}(\mathcal{X},dx)}+\sum_{|\alpha|=s}\langle D^{\alpha}f,D^{\alpha}g\rangle_{L^{2}(\mathcal{X},dx)}.

Let Cbs(𝒳)C_{b}^{s}(\mathcal{X}) be the set of complex-valued continuous bounded functions such that all the derivatives up to order ss exist and are continuous bounded functions. The space Cbs(𝒳)C_{b}^{s}(\mathcal{X}) is a Banach space with respect to the norm

fCbs=f+|α|=sDαf.\|f\|_{C_{b}^{s}}=\|f\|_{\infty}+\sum_{|\alpha|=s}\|D^{\alpha}f\|_{\infty}.

Since 𝒳\mathcal{X} is bounded, we know Cbs(𝒳)sC_{b}^{s}(\mathcal{X})\subset\mathcal{H}^{s} and fsCsfCbs\|f\|_{\mathcal{H}^{s}}\leq C_{s}\|f\|_{C_{b}^{s}} where CsC_{s} is a constant only depending on ss. We also know from the Sobolev embedding theorem (see Chapter 4.6 of Burenkov [6]) that for l,ml,m\in\mathbb{N} with lm>p/2l-m>p/2, we have

lCbm(𝒳)fCbmCm,lfl\mathcal{H}^{l}\subset C_{b}^{m}(\mathcal{X})\qquad\|f\|_{C_{b}^{m}}\leq C_{m,l}\|f\|_{\mathcal{H}^{l}} (4.5)

where Cm,lC_{m,l} is a constant depending only on mm and ll.

Taking l=s=p/2+1l=s=\lfloor p/2\rfloor+1 and m=0m=0, we see

Cbs(𝒳)sCb(𝒳),C_{b}^{s}(\mathcal{X})\subset\mathcal{H}^{s}\subset C_{b}(\mathcal{X}),

with fC6fs\|f\|_{\infty}\leq C_{6}\|f\|_{\mathcal{H}^{s}} for fs\forall f\in\mathcal{H}^{s} for some constant C6C_{6}. This norm relationship suggests that s\mathcal{H}^{s} is a RKHS with a bounded kernel s(,)s(\cdot,\cdot).

4.3 Part two: bounds on operator differences

Similar to (4.4), we define multiplication operators

D1/2:ss as (D1/2f)(x)=d(x)f(x),\displaystyle D^{1/2}:\mathcal{H}^{s}\to\mathcal{H}^{s}\text{ as }(D^{1/2}f)(x)=\sqrt{d(x)}f(x), (4.6)
D:ss as (Df)(x)=d(x)f(x).\displaystyle D:\mathcal{H}^{s}\to\mathcal{H}^{s}\text{ as }(Df)(x)=d(x)f(x). (4.7)

In this subsection, we show D1/2,D1/2,D,Dn1/2,Dn1/2,Dn(s)D^{1/2},D^{-1/2},D,D_{n}^{1/2},D_{n}^{-1/2},D_{n}\in\mathcal{L}(\mathcal{H}^{s}) and their operator norms are appropriately bounded, that is

Lemma 4.3.

Under the general assumptions, all the following operators are bounded linear operators in (s)\mathcal{L}(\mathcal{H}^{s}), and there exists a suitable constant C7>0C_{7}>0 such that

D1/2ss,D1/2ss,Dn1/2ss,Dn1/2ssC7\displaystyle\|D^{1/2}\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}},\|D^{-1/2}\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}},\|D_{n}^{1/2}\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}},\|D_{n}^{-1/2}\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}}\leq C_{7} (4.8)
(D1/2+Dn1/2)1ssC7,DDnssC7ddnp+2\displaystyle\|(D^{1/2}+D_{n}^{1/2})^{-1}\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}}\leq C_{7},\quad\|D-D_{n}\|_{\mathcal{H}^{s}\to\mathcal{H}^{s}}\leq C_{7}\|d-d_{n}\|_{\mathcal{H}^{p+2}} (4.9)
Proof.

Let C8=kCbp+2(𝒳×𝒳)C_{8}=\|k\|_{C_{b}^{p+2}(\mathcal{X}\times\mathcal{X})}. For any x𝒳x\in\mathcal{X}, clearly kx:=k(,x)Cbp+2(𝒳)k_{x}:=k(\cdot,x)\in C_{b}^{p+2}(\mathcal{X}) with kxCbp+2C8\|k_{x}\|_{C_{b}^{p+2}}\leq C_{8}. Since dd and dnd_{n} are some weighted average of average of kxk_{x}, it follows

dCbp+2,dnCbp+2C8.\|d\|_{C_{b}^{p+2}},\|d_{n}\|_{C_{b}^{p+2}}\leq C_{8}.

Since d,dnd,d_{n} inherit the κl,κu\kappa_{l},\kappa_{u} pointwise bound from k(,)k(\cdot,\cdot), we know d1/2,dn1/2,d1/2,dn1/2Cbp+2(𝒳)d^{1/2},d_{n}^{1/2},d^{-1/2},d_{n}^{-1/2}\in C_{b}^{p+2}(\mathcal{X}) with

d1/2Cbp+2,dn1/2Cbp+2,d1/2Cbp+2,dn1/2Cbp+2C9.\|d^{1/2}\|_{C_{b}^{p+2}},\|d_{n}^{1/2}\|_{C_{b}^{p+2}},\|d^{-1/2}\|_{C_{b}^{p+2}},\|d_{n}^{-1/2}\|_{C_{b}^{p+2}}\leq C_{9}.

Next, we know from Lemma 15 of Chapter 4 of Burenkov [6] that for gCbs(𝒳)g\in C_{b}^{s}(\mathcal{X}) and fsf\in\mathcal{H}^{s}, we have gfsgf\in\mathcal{H}^{s} and

gfsgCbsfs.\|gf\|_{\mathcal{H}^{s}}\leq\|g\|_{C_{b}^{s}}\|f\|_{\mathcal{H}^{s}}. (4.10)

We can use this inequality to prove D1/2,D1/2,D,Dn1/2,Dn1/2,Dn(s)D^{1/2},D^{-1/2},D,D_{n}^{1/2},D_{n}^{-1/2},D_{n}\in\mathcal{L}(\mathcal{H}^{s}) and bound their operator norm. For example, plugging in g=d1/2,dn1/2,d1/2,dn1/2g=d^{1/2},d_{n}^{1/2},d^{-1/2},d_{n}^{-1/2} into (4.10), and noticing for these choices of gg, gCbsgCbp+2\|g\|_{C_{b}^{s}}\leq\|g\|_{C_{b}^{p+2}} because p+2>sp+2>s, we conclude

D1/2,D1/2,Dn1/2,Dn1/2C9.\|D^{1/2}\|,\|D^{-1/2}\|,\|D_{n}^{1/2}\|,\|D_{n}^{-1/2}\|\leq C_{9}.

Note that by the embedding theorem, p+2\mathcal{H}^{p+2} can be embedded into Cbs(𝒳)C_{b}^{s}(\mathcal{X}), so plugging in ddnCbp+2(𝒳)p+2d-d_{n}\in C_{b}^{p+2}(\mathcal{X})\subset\mathcal{H}^{p+2}, we see

DDnddnCbsC10ddnp+2.\|D-D_{n}\|\leq\|d-d_{n}\|_{C_{b}^{s}}\leq C_{10}\|d-d_{n}\|_{\mathcal{H}^{p+2}}.

For the bound on (D1/2+Dn1/2)1\|(D^{1/2}+D_{n}^{1/2})^{-1}\|, we follow essentially the same route. We first bound d1/2+dn1/2Cbp+2\|d^{1/2}+d_{n}^{1/2}\|_{C_{b}^{p+2}}, then argue d1/2+dn1/2d^{1/2}+d_{n}^{1/2} has pointwise lower and upper bound. It then follows that (d1/2+dn1/2)1Cbp+2C11\|(d^{1/2}+d_{n}^{1/2})^{-1}\|_{C_{b}^{p+2}}\leq C_{11}, and we see via (4.10) that (D1/2+Dn1/2)1s,sC9\|(D^{1/2}+D_{n}^{1/2})^{-1}\|_{\mathcal{H}^{s},\mathcal{H}^{s}}\leq C_{9}. Taking C7C_{7} as the maximum of C9C_{9} to C11C_{11} completes the proof. ∎

Lemma 4.4.

Under the general assumptions, we have

TT^nHSC12((KnHS+KHS)ddnp+2+KKnHS).\|T-\widehat{T}_{n}\|_{HS}\leq C_{12}\Big{(}\big{(}\|K_{n}\|_{HS}+\|K\|_{HS}\big{)}\|d-d_{n}\|_{\mathcal{H}^{p+2}}+\|K-K_{n}\|_{HS}\Big{)}.
Proof.

First note

D1/2Dn1/2\displaystyle\quad D^{-1/2}-D_{n}^{-1/2}
=Dn1/2(Dn1/2D1/2)D1/2\displaystyle=D_{n}^{-1/2}(D_{n}^{1/2}-D^{1/2})D^{-1/2}
=Dn1/2(DnD)(Dn1/2+D1/2)1D1/2.\displaystyle=D_{n}^{-1/2}(D_{n}-D)(D_{n}^{1/2}+D^{1/2})^{-1}D^{-1/2}.

Applying the bounds from Lemma 4.3, we see

D1/2Dn1/2C74ddnp+2.\|D^{-1/2}-D_{n}^{-1/2}\|\leq C_{7}^{4}\|d-d_{n}\|_{\mathcal{H}^{p+2}}.

We also have decomposition

D1/2KD1/2Dn1/2KnDn1/2\displaystyle\quad D^{-1/2}KD^{-1/2}-D_{n}^{-1/2}K_{n}D_{n}^{-1/2}
=D1/2K(D1/2Dn1/2)+(D1/2KDn1/2Kn)Dn1/2\displaystyle=D^{-1/2}K(D^{-1/2}-D_{n}^{-1/2})+(D^{-1/2}K-D_{n}^{-1/2}K_{n})D_{n}^{-1/2}
=D1/2K(D1/2Dn1/2)+D1/2(KKn)Dn1/2+(D1/2Dn1/2)KnDn1/2.\displaystyle=D^{-1/2}K(D^{-1/2}-D_{n}^{-1/2})+D^{-1/2}(K-K_{n})D_{n}^{-1/2}+(D^{-1/2}-D_{n}^{-1/2})K_{n}D_{n}^{-1/2}.

Taking Hilbert-Schmidt norm on both sides, we have121212We haven’t shown K,KnK,K_{n} are Hilbert-Schmidt operators from HsH^{s} to HsH^{s} yet. This is shown in Lemma 4.6

TT^nHS\displaystyle\quad\|T-\widehat{T}_{n}\|_{HS}
C75KHSddnp+2+C72KKnHS+C75KnHSddnp+2\displaystyle\leq C_{7}^{5}\|K\|_{HS}\|d-d_{n}\|_{\mathcal{H}^{p+2}}+C_{7}^{2}\|K-K_{n}\|_{HS}+C_{7}^{5}\|K_{n}\|_{HS}\|d-d_{n}\|_{\mathcal{H}^{p+2}}
C12((KnHS+KHS)ddnp+2+KKnHS)\displaystyle\leq C_{12}\Big{(}\big{(}\|K_{n}\|_{HS}+\|K\|_{HS}\big{)}\|d-d_{n}\|_{\mathcal{H}^{p+2}}+\|K-K_{n}\|_{HS}\Big{)}

for some appropriately chosen C12C_{12}. ∎

4.4 Concentration on Hilbert Space

In this subsection, we show K,KnK,K_{n} are both Hilbert-Schmidt operator from s\mathcal{H}^{s} to s\mathcal{H}^{s} and establish some concentration results regarding ddnp+2\|d-d_{n}\|_{\mathcal{H}^{p+2}} and KKnHS\|K-K_{n}\|_{HS}. With these results and Lemma 4.4, we will be able to bound TT^nHS\|T-\widehat{T}_{n}\|_{HS}. The required concentration bounds are obtained through the following Theorem on the concentration in (complex) Hilbert space (see section 2.4 of Rosasco, Belkin and Vito [29]).

Lemma 4.5.

Let ξ1,,ξn\xi_{1},\ldots,\xi_{n} be zero mean independent random variables with values in a separable (complex) Hilbert space \mathcal{H} such that ξiC\|\xi_{i}\|_{\mathcal{H}}\leq C for all i[n]i\in[n]. Then with probability at least 12eτ1-2e^{-\tau}, we have

1ni=1nξiC2τn.\big{\|}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\big{\|}_{\mathcal{H}}\leq\frac{C\sqrt{2\tau}}{\sqrt{n}}.

With this lemma, we can show

Lemma 4.6.

Under the general assumptions, the following facts hold true:

  1. 1.

    For some constant C13C_{13}, with confidence 12eτ1-2e^{-\tau}

    ddnp+2C13τn.\|d-d_{n}\|_{\mathcal{H}^{p+2}}\leq C_{13}\frac{\sqrt{\tau}}{\sqrt{n}}.
  2. 2.

    Both KK and KnK_{n} are Hilbert-Schmidt operators from s\mathcal{H}^{s} to s\mathcal{H}^{s}, and there exists some constant C14C_{14} that doesn’t depend on nn such that their Hilbert-Schmidt norm KnHS,KHSC14\|K_{n}\|_{HS},\|K\|_{HS}\leq C_{14} is bounded.

  3. 3.

    For some constant C15C_{15}, with confidence 12eτ1-2e^{-\tau}

    KKnHSCτn.\|K-K_{n}\|_{HS}\leq C\frac{\sqrt{\tau}}{\sqrt{n}}.
Proof.

For item 1, consider random variables ξi=k(,Xi)dp+2\xi_{i}=k(\cdot,X_{i})-d\in\mathcal{H}^{p+2} for i[n]i\in[n]. They are clearly zero mean. From the proof of Lemma 4.3, we see d,k(,Xi)Cbp+2(𝒳)d,k(\cdot,X_{i})\in C_{b}^{p+2}(\mathcal{X}). We thus have

ξip+2k(,Xi)p+2+dp+2C𝒳(k(,Xi)Cbp+2+dCbp+2)\|\xi_{i}\|_{\mathcal{H}^{p+2}}\leq\|k(\cdot,X_{i})\|_{\mathcal{H}^{p+2}}+\|d\|_{\mathcal{H}^{p+2}}\leq C_{\mathcal{X}}\big{(}\|k(\cdot,X_{i})\|_{C_{b}^{p+2}}+\|d\|_{C_{b}^{p+2}}\big{)}

where C𝒳C_{\mathcal{X}} is some constant depending on the Lebesgue measure of the bounded set 𝒳\mathcal{X}. This suggests ξi\xi_{i}’s are bounded. Since p+2\mathcal{H}^{p+2} is a separable Hilbert space, we apply Lemma 4.5 and conclude that we have with probability 12eτ1-2e^{-\tau}

ddnp+2C13τn.\|d-d_{n}\|_{\mathcal{H}^{p+2}}\leq C_{13}\frac{\sqrt{\tau}}{\sqrt{n}}.

For item 2, let us fix any x𝒳x\in\mathcal{X} and consider the operator ,sxskx\langle\cdot,s_{x}\rangle_{\mathcal{H}^{s}}k_{x} where sXi:=s(,Xi)s_{X_{i}}:=s(\cdot,X_{i}). This operator is in fact a Hilbert-Schmidt operator from s\mathcal{H}^{s} to s\mathcal{H}^{s}. To see this, note that ,sxskxHS=sxskxs\|\langle\cdot,s_{x}\rangle_{\mathcal{H}^{s}}k_{x}\|_{HS}=\|s_{x}\|_{\mathcal{H}^{s}}\|k_{x}\|_{\mathcal{H}^{s}}. With the same reasoning used for proving item 1, we see kxs\|k_{x}\|_{\mathcal{H}^{s}} has a bound uniform on x𝒳\forall x\in\mathcal{X}. It remains to show sxs\|s_{x}\|_{\mathcal{H}^{s}} has a uniform bound. Let δx:s\delta_{x}:\mathcal{H}^{s}\to\mathbb{C} be the evaluation functional, i.e. δx(f)=f(x)\delta_{x}(f)=f(x). We know from the embedding theorem that δxopC6\|\delta_{x}\|_{op}\leq C_{6} for all x𝒳x\in\mathcal{X}. But sxs_{x} also induces this point evaluation functional, so by Riesz representation theorem, sxs=δxopC6\|s_{x}\|_{\mathcal{H}^{s}}=\|\delta_{x}\|_{op}\leq C_{6}. Hence for some C14C_{14}, ,sxskxHSC14\|\langle\cdot,s_{x}\rangle_{\mathcal{H}^{s}}k_{x}\|_{HS}\leq C_{14} for all x𝒳x\in\mathcal{X}. Now let xx be random. We see KHS=𝔼,sXiskXiHS𝔼,sXiskXiHSC14\|K\|_{HS}=\|\mathbb{E}\langle\cdot,s_{X_{i}}\rangle_{\mathcal{H}^{s}}k_{X_{i}}\|_{HS}\leq\mathbb{E}\|\langle\cdot,s_{X_{i}}\rangle_{\mathcal{H}^{s}}k_{X_{i}}\|_{HS}\leq C_{14}, i.e. KK is Hilbert-Schmidt. By the same reasoning, we see the claim for KnK_{n} in item 2 is also true.

For item 3, consider random variables ωi:=,sXiskXiK\omega_{i}:=\langle\cdot,s_{X_{i}}\rangle_{\mathcal{H}^{s}}k_{X_{i}}-K. We know from item 2 that ωiHS(s)\omega_{i}\in HS(\mathcal{H}^{s}). Since s\mathcal{H}^{s} is separable, the Hilbert space HS(s)HS(\mathcal{H}^{s}) is also separable. We also know ωi\omega_{i} is zero mean and ωiHS2C14\|\omega_{i}\|_{HS}\leq 2C_{14} is bounded. We can thus apply Lemma 4.5 and conclude that we have with probability 12eτ1-2e^{-\tau}

KKnHSC15τn\|K-K_{n}\|_{HS}\leq C_{15}\frac{\sqrt{\tau}}{\sqrt{n}}

for C15:=2C14C_{15}:=2C_{14}. ∎

Combining Lemma 4.4 and 4.6, we obtain the result we want

Proposition 4.7.

Under the general assumptions, with probability 14eτ1-4e^{-\tau}, we have

TT^nHSC16τn\|T-\widehat{T}_{n}\|_{HS}\leq C_{16}\frac{\sqrt{\tau}}{\sqrt{n}}

for some constant C16C_{16}.

Proof.

A union bound and a direct application of lemma 4.4 will suffice for the proof. ∎

4.5 Checking conditions for general theory

In the first three paragraphs of section 3, we have laid out the conditions that must be satisfied for our general theory to apply. We’ve already checked most of them implicitly in the previous three subsections, but for completeness, we summarize all such conditions here and prove them.

Lemma 4.8.

Under the general conditions, the following facts hold true:

  1. 1.

    The Sobolev space HsH^{s} is a subspace of L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}).

  2. 2.

    The s\mathcal{H}^{s} norm s\|\cdot\|_{\mathcal{H}^{s}} is stronger than infinity norm.

  3. 3.

    Both T,T^nT,\widehat{T}_{n} are Hilbert-Schmidt from s\mathcal{H}_{s} to s\mathcal{H}_{s}.

  4. 4.

    All eigenvalues of TT (counting multiplicity) can be arranged in a decreasing (possibly infinite) sequence of non-negative real numbers λ1λ2λK>λK+10\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0 with a positive gap between λK\lambda_{K} and λK+1\lambda_{K+1}.

  5. 5.

    The top KK eigenfunctions {fi}i=1Ks\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}^{s} and can be picked to form an orthonormal set of functions in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}).

  6. 6.

    T^n\widehat{T}_{n} has a sequence of non-increasing, real, non-negative eigenvalues.

Proof.

For item 1, this is because HsH^{s} is a subspace of L2(𝒳,dx)L^{2}(\mathcal{X},dx) and under our assumptions on 𝒳\mathcal{X} and \mathbb{P}, L2(𝒳,dx)L^{2}(\mathcal{X},dx) and L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) are the same space. First of all, since the underlying σ\sigma-algebra of \mathbb{P} is the Lebesgue σ\sigma-algebra, the set of measurable functions are the same. If fL2(𝒳,dx)f\in L^{2}(\mathcal{X},dx), then ff is also in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) because 𝒳ff¯𝑑P=𝒳ff¯p(x)𝑑xpu𝒳ff¯𝑑x<\int_{\mathcal{X}}f\bar{f}dP=\int_{\mathcal{X}}f\bar{f}p(x)dx\leq p_{u}\int_{\mathcal{X}}f\bar{f}dx<\infty. The converse is also true. It is not hard to see our assumptions ensure that the Lebesgue measure is absolutely continuous with respect to \mathbb{P} with the density being 1/p(x)1/p(x) a.s.. Noticing 1/p(x)<1/pl1/p(x)<1/p_{l}, we can prove the converse.

Item 2 is the consequence of Sobolev embedding theorem and has been used time and again in the previous subsections. Item 3 is the joint consequence of Lemma 4.3 and 4.6.

For item 4 and 5, we first show TT as an operator from L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) to L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) is positive, self-adjoint, and Hilbert-Schmidt. Let h(x,y):=k(x,y)/d(x)d(y)h(x,y):=k(x,y)/\sqrt{d(x)d(y)} denote the normalized kernel. The self-adjointness is due to the (conjugate) symmetry h(,)h(\cdot,\cdot) inherited from k(,)k(\cdot,\cdot):

f,Tg=f(x)(h(x,y)¯g(y)¯𝑑P(y))𝑑P(x)=h(y,x)f(x)g(y)¯𝑑P(x)𝑑P(y),\displaystyle\langle f,Tg\rangle=\int f(x)\Big{(}\int{\mspace{2.5mu}\overline{\mspace{-2.5mu}h(x,y)}}{\mspace{2.5mu}\overline{\mspace{-2.5mu}g(y)}}dP(y)\Big{)}dP(x)=\iint h(y,x)f(x){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(y)}}dP(x)dP(y),
Tf,g=(h(x,y)f(y)dP(y))g(x)¯𝑑P(x)=h(x,y)f(y)g(x)¯𝑑P(x)𝑑P(y).\displaystyle\langle Tf,g\rangle=\int\Big{(}h(x,y)f(y)dP(y)\Big{)}{\mspace{2.5mu}\overline{\mspace{-2.5mu}g(x)}}dP(x)=\iint h(x,y)f(y){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(x)}}dP(x)dP(y).

We thus see f,Tg=Tf,g\langle f,Tg\rangle=\langle Tf,g\rangle, i.e. TT is self-adjoint. To see why TT is Hilbert-Schmidt, let real-valued functions {ei}i=1\{e_{i}\}_{i=1}^{\infty} be an orthonormal basis of L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) . We calculate

THS2=i=1((h(x,y)ei(y)𝑑P(y))2𝑑P(x))=(i=1hx,eiL22)𝑑P(x)\displaystyle\|T\|_{HS}^{2}=\sum_{i=1}^{\infty}\Bigg{(}\int\Big{(}\int h(x,y)e_{i}(y)dP(y)\Big{)}^{2}dP(x)\Bigg{)}=\int\Big{(}\sum_{i=1}^{\infty}\langle h_{x},e_{i}\rangle_{L^{2}}^{2}\Big{)}dP(x)
=h2(x,y)𝑑P(x)𝑑P(y)κu2/κl2<.\displaystyle=\iint h^{2}(x,y)dP(x)dP(y)\leq\kappa_{u}^{2}/\kappa_{l}^{2}<\infty.

The positive part is slightly more involved. To show TT is positive, we need to show for fL2(𝒳,)\forall f\in L^{2}(\mathcal{X},\mathbb{P})

Tf,f=k(x,y)f(x)¯f(y)𝑑P(x)𝑑P(y)0.\langle Tf,f\rangle=\iint k(x,y){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(x)}}f(y)dP(x)dP(y)\geq 0.

Let us fix a sample size nn and draw i.i.d. samples X1,X2,,XnX_{1},X_{2},\ldots,X_{n}\sim\mathbb{P}. Then since the kernel k(,)k(\cdot,\cdot) is positive definite, the quadratic form

1n2i,j=1nk(Xi,Xj)f(Xi)¯f(Xj)\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i},X_{j}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}f(X_{j})

is non-negative regardless of what samples we draw. It thus follows that the expectation of this quadratic form is non-negative. A simple calculation suggests that the expectation is in fact

n1nTf,f+1nk(x,x)f(x)¯f(x)𝑑P(x).\frac{n-1}{n}\langle Tf,f\rangle+\frac{1}{n}\int k(x,x){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(x)}}f(x)dP(x).

Since by our assumption k(x,x)κuk(x,x)\leq\kappa_{u} and fL2(𝒳,)f\in L^{2}(\mathcal{X},\mathbb{P}), we see k(x,x)f(x)¯f(x)𝑑P(x)\int k(x,x){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(x)}}f(x)dP(x) is finite. Since nn can be arbitrarily large, Tf,f\langle Tf,f\rangle must be non-negative.

According to the spectral theory for positive, self-adjoint, Hilbert-Schmidt operators we introduced in Section 2.2, we immediately see most parts of item 4 and 5 are true. The remaining part to check for item 5 is that {fi}i=1Ks\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}^{s}, which is implied by our assumption that {fi}i=1KCbp+2(𝒳)\{f_{i}\}_{i=1}^{K}\subset C_{b}^{p+2}(\mathcal{X}). The eigengap part in item 4 is also assumed by the general assumption. A nuance in item 4 is that the eigenvalues and eigenvectors there are under the premise that TT is an operator from s\mathcal{H}^{s} to s\mathcal{H}^{s}. But since s\mathcal{H}^{s} is a subspace of L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}), we can only have fewer eigenvalues than treating TT as in (L2(𝒳,))\mathcal{L}(L^{2}(\mathcal{X},\mathbb{P})). Plus, since our general assumptions imply {fi}i=1Ks\{f_{i}\}_{i=1}^{K}\subset\mathcal{H}^{s}, the leading eigenspace remains unchanged after the restriction from (L2(𝒳,))\mathcal{L}(L^{2}(\mathcal{X},\mathbb{P})) to s\mathcal{H}^{s}.

For item 6, this is true because of the relationship between the spectrum of T^n\widehat{T}_{n} and that of the symmetric positive semi-definite kernel matrix LnL_{n}. A sort of Lemma 4.1 is also true with the Cb(𝒳)C_{b}(\mathcal{X}) therein replaced by s\mathcal{H}^{s}

Because of Lemma 4.8, we can apply a slightly modified version Theorem 3.2 (see the proof for Theorem 3.2) to obtain the following.

Proposition 4.9.

For some constant C17,C18C_{17},C_{18}, whenever n>C17τn>C_{17}\tau, we have with confidence 14eτ1-4e^{-\tau} that

V1(V1+V2Y)(I+YY)1/22C18τn\|V_{1}-(V_{1}+V_{2}Y)(I+Y^{*}Y)^{-1/2}\|_{2\to\infty}\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}} (4.11)
Proof.

By proposition 4.7, we have with probability 14eτ1-4e^{-\tau},

TT^nHSC16τn.\|T-\widehat{T}_{n}\|_{HS}\leq C_{16}\frac{\sqrt{\tau}}{\sqrt{n}}.

We can set C17C_{17} sufficiently large that C16τnC16/C17C1C_{16}\frac{\sqrt{\tau}}{\sqrt{n}}\leq C_{16}/\sqrt{C_{17}}\leq C_{1}, where C1C_{1} is the constant from Theorem 3.2. Hence by an intermediate step in the proof of Theorem 3.2, we conclude

V1(V1+V2Y)(I+YY)1/22C18τn.\|V_{1}-(V_{1}+V_{2}Y)(I+Y^{*}Y)^{-1/2}\|_{2\to\infty}\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}.

4.6 Part five: error induced by V^1\widehat{V}_{1}

We deal with the error induced by operator V^1\widehat{V}_{1} not having orthonormal columns in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). Introduce the shorthand W:=(V1+V2Y)(I+YY)1/2W:=(V_{1}+V_{2}Y)(I+Y^{*}Y)^{-1/2} and we have

V1V^1Q2\displaystyle\|V_{1}-\widehat{V}_{1}Q\|_{2\to\infty} V1W2+WV^1Q2\displaystyle\leq\|V_{1}-W\|_{2\to\infty}+\|W-\widehat{V}_{1}Q\|_{2\to\infty}
=V1W2+WWWV^1Q2\displaystyle=\|V_{1}-W\|_{2\to\infty}+\|W-WW^{*}\widehat{V}_{1}Q\|_{2\to\infty}
V1W2+W2QTWV^12.\displaystyle\leq\|V_{1}-W\|_{2\to\infty}+\|W\|_{2\to\infty}\|Q^{T}-W^{*}\widehat{V}_{1}\|_{2}. (4.12)

Here, the equality in the second step is true because V^1\widehat{V}_{1} and WW span the same leading eigenspace131313Since V^1\widehat{V}_{1} is constructed from the eigenvectors of LnL_{n} which are linearly independent, the columns cannot be linearly dependent functions in sCb(𝒳)\mathcal{H}^{s}\subset C_{b}(\mathcal{X}). and WW has orthonormal columns in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). Inspecting (4.12), we see V1W2\|V_{1}-W\|_{2\to\infty} is bounded by Proposition 4.11, and W2\|W\|_{2\to\infty} is roughly V12\|V_{1}\|_{2\to\infty} thus bounded, so it all boils down to how “far” WV^1W^{*}\widehat{V}_{1} is from an unitary matrix in K×K\mathbb{C}^{K\times K}. In fact, we have

Lemma 4.10.

Assume all the singular values of WV^1W^{*}\widehat{V}_{1} are less than 22, then there exists unitary matrix QK×KQ\in\mathbb{C}^{K\times K} such that

QWV^122W2Hs2supgs1|Png2Pg2|.\|Q-W^{*}\widehat{V}_{1}\|_{2}\leq 2\|W\|_{2\to H^{s}}^{2}\sup_{{\|g\|_{\mathcal{H}^{s}}\leq 1}}\Big{|}P_{n}g^{2}-Pg^{2}\Big{|}.
Proof.

Suppose WV^1W^{*}\widehat{V}_{1} admits singular value decomposition WV^1=AΣBW^{*}\widehat{V}_{1}=A\Sigma B^{*}, then Σ=AWV^1B\Sigma=A^{*}W^{*}\widehat{V}_{1}B. Let gi:=WAeig_{i}:=WAe_{i} be the ii-th column in WAWA, g~i:=V^1Bei\tilde{g}_{i}:=\widehat{V}_{1}Be_{i} be the ii-th column in V^1B\widehat{V}_{1}B where {ei}i=1K\{e_{i}\}_{i=1}^{K} is the standard basis in K\mathbb{R}^{K}. We know Σ=(Σij)\Sigma=(\Sigma_{ij}) where Σij=g~i,gjL2(𝒳,)\Sigma_{ij}=\langle\tilde{g}_{i},g_{j}\rangle_{L^{2}(\mathcal{X},\mathbb{P})}.

Since A,BA,B are unitary matrices, {gi}i=1K\{g_{i}\}_{i=1}^{K} are orthonormal in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) and {g~i}i=1K\{\tilde{g}_{i}\}_{i=1}^{K} orthonormal in L2(𝒳,n)L^{2}(\mathcal{X},\mathbb{P}_{n}). So g1g_{1} is orthogonal to g2,,gKg_{2},\ldots,g_{K}. At the same time, from the diagonal structure of Σ\Sigma, we know g~1\tilde{g}_{1} is orthogonal to g2,,gKg_{2},\ldots,g_{K} as well. This suggests g1g_{1} is collinear with g~1\tilde{g}_{1}. On top of that, since the diagonal entries of Σ\Sigma are real positive values, we know g~1=g1/g1L2(𝒳,n)\tilde{g}_{1}=g_{1}/\|g_{1}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}. This in fact holds for all i[K]i\in[K], i.e. g~i=gi/giL2(𝒳,n)\tilde{g}_{i}=g_{i}/\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}. Taking Q=ABQ=AB^{*}, which is a unitary matrix, we have

QWV^12=ΣI2=maxi[K]|1g~i,gi|=maxi[K]|11giL2(𝒳,n)|.\|Q-W^{*}\widehat{V}_{1}\|_{2}=\|\Sigma-I\|_{2}=\max_{i\in[K]}|1-\langle\tilde{g}_{i},g_{i}\rangle|=\max_{i\in[K]}|1-\frac{1}{\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}}|. (4.13)

By our assumption on the singular values, we know for all i[K]i\in[K], 1giL2(𝒳,n)=g~i,gi2\frac{1}{\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}}=\langle\tilde{g}_{i},g_{i}\rangle\leq 2. Note that for x12x\geq\frac{1}{2}, |11x|2|x1|2|x21||1-\frac{1}{x}|\leq 2|x-1|\leq 2|x^{2}-1|, we see

maxi[K]|11giL2(𝒳,n)|2maxi[K]|1giL2(𝒳,n)2|=2maxi[K]|1nj=1n|gi(Xj)|2𝔼|gi(X)|2|.\max_{i\in[K]}|1-\frac{1}{\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}}|\leq 2\max_{i\in[K]}|1-\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}^{2}|=2\max_{i\in[K]}\big{|}\frac{1}{n}\sum_{j=1}^{n}|g_{i}(X_{j})|^{2}-\mathbb{E}|g_{i}(X)|^{2}\big{|}. (4.14)

Since gig_{i}’s rely on the samples {Xi}i=1n\{X_{i}\}_{i=1}^{n}, they are random. What they have in common is they have bounded s\mathcal{H}^{s} norm, which is because

gis=WAeiW2sA2ei2=W2s.\|g_{i}\|_{\mathcal{H}^{s}}=\|WAe_{i}\|_{\mathcal{H}}\leq\|W\|_{2\to\mathcal{H}^{s}}\|A\|_{2}\|e_{i}\|_{2}=\|W\|_{2\to\mathcal{H}^{s}}.

Therefore,

maxi[K]|1nj=1n|gi(Xj)|2𝔼|gi(X)|2|supgsW2s|Pn|g|2P|g|2|=W2s2supgs1|Pn|g|2P|g|2|.\max_{i\in[K]}\big{|}\frac{1}{n}\sum_{j=1}^{n}|g_{i}(X_{j})|^{2}-\mathbb{E}|g_{i}(X)|^{2}\big{|}\leq\sup_{\|g\|_{\mathcal{H}^{s}}\leq\|W\|_{2\to\mathcal{H}^{s}}}\Big{|}P_{n}|g|^{2}-P|g|^{2}\Big{|}=\|W\|_{2\to\mathcal{H}^{s}}^{2}\sup_{{\|g\|_{\mathcal{H}^{s}}\leq 1}}\Big{|}P_{n}|g|^{2}-P|g|^{2}\Big{|}. (4.15)

Chaining (4.15) and (4.14) completes the proof. ∎

Using Dudley inequality and standard results on the covering number in Sobolev space, we can show (with proof in the appendix)

Lemma 4.11.

For our choice of ss, we have with probability 14exp(τ)1-4\textrm{exp}(-\tau)

supgs1|Pn|g|2P|g|2|C19+C20τn.\sup_{{\|g\|_{\mathcal{H}^{s}}\leq 1}}\Big{|}P_{n}|g|^{2}-P|g|^{2}\Big{|}\leq\frac{C_{19}+C_{20}\sqrt{\tau}}{\sqrt{n}}.

We are now ready to prove Theorem 4.2.

proof of theorem 4.2.

For fixed sample size nn, let n,1\mathcal{E}_{n,1} be the event when the concentration in Proposition 4.7 holds, and let n,2\mathcal{E}_{n,2} be the event when the concentration in Lemma 4.11 holds. From now on, we condition on the intersection n,1n,2\mathcal{E}_{n,1}\cap\mathcal{E}_{n,2}, which happens with probability greater than of equal to 18eτ1-8e^{-\tau}.

First of all, on this event, we know Proposition 4.11 also holds. We thus have

V1W2C18τn.\|V_{1}-W\|_{2\to\infty}\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}.

So V12\|V_{1}\|_{2\to\infty} is close to W2\|W\|_{2\to\infty}. Since in Theorem 4.2, n/τC4n/\tau\geq C_{4} and we have the freedom of choosing C5C_{5}, we can set C5C_{5} large enough such that W22V12\|W\|_{2\to\infty}\leq 2\|V_{1}\|_{2\to\infty}. Imitating the proof of Proposition 4.11, we can similarly show V1W2s\|V_{1}-W\|_{2\to\mathcal{H}^{s}} is on the order of τ/n\sqrt{\tau}/\sqrt{n}. We can thus assume C4C_{4} is also large enough to ensure W2s2V12s\|W\|_{2\to\mathcal{H}^{s}}\leq 2\|V_{1}\|_{2\to\mathcal{H}^{s}}.

Meanwhile, due to the uniform law of large numbers in Lemma 4.11, we can always let giL2(𝒳,n)\|g_{i}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})} be greater than 1/21/2 by setting C4C_{4} large enough. The condition on singular values in Lemma 4.10 is thus satisfied and from it we see there exists unitary matrix QQ such that

QWV^122W2Hs2C19+C20τn2W2Hs2(C19+C20)τn,\|Q-W^{*}\widehat{V}_{1}\|_{2}\leq 2\|W\|_{2\to H^{s}}^{2}\frac{C_{19}+C_{20}\sqrt{\tau}}{\sqrt{n}}\leq 2\|W\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}},

where we assumed τ1\tau\geq 1. Since for concentration results like Theorem 4.2 to be meaningful, τ\tau is large anyway, this assumption is harmless.

Going back to (4.12), we have

V1V^1Q2\displaystyle\|V_{1}-\widehat{V}_{1}Q\|_{2\to\infty} V1W2+W2QTWV^12\displaystyle\leq\|V_{1}-W\|_{2\to\infty}+\|W\|_{2\to\infty}\|Q^{T}-W^{*}\widehat{V}_{1}\|_{2} (4.16)
C18τn+2W2W2Hs2(C19+C20)τn\displaystyle\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}+2\|W\|_{2\to\infty}\|W\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}} (4.17)
C18τn+16V12V12Hs2(C19+C20)τn.\displaystyle\leq C_{18}\frac{\sqrt{\tau}}{\sqrt{n}}+16\|V_{1}\|_{2\to\infty}\|V_{1}\|_{2\to H^{s}}^{2}\frac{(C_{19}+C_{20})\sqrt{\tau}}{\sqrt{n}}. (4.18)

Setting C3=C18+16V12V12Hs2(C19+C20)C_{3}=C_{18}+16\|V_{1}\|_{2\to\infty}\|V_{1}\|_{2\to H^{s}}^{2}(C_{19}+C_{20}) thus completes the proof . ∎

5 Discussion

We would like to first comment on the relationship between Theorem 3.2 and the concentration of spectral projection (see Proposition 22 in Rosasco, Belkin and Vito [29]). Our result in fact easily implies the concentration of spectral projections. To see this, simply note the difference in projection can be written as V1V1V~1V~1V_{1}V_{1}^{*}-\widetilde{V}_{1}\widetilde{V}_{1}^{*} and apply triangular inequality. We believe it is possible to go from the concentration of spectral projection in Hilbert space \mathcal{H} to Theorem 3.2, but the road is treacherous. On a high level, we need to project an orthonormal bases of the leading invariant space of the perturbed operator to that of the unperturbed operator, and then performing Gram-Schmidt on the projections. During this process, we need to convert back and forth from \mathcal{H} to L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) and we foresee countless petty and pesky technical details. But it is our belief that the concentration of spectral projections in \mathcal{H} induced operator norm is equivalent to Theorem 3.2.

We would also like to comment on the generality of the Newton-Kantorovich Theorem. By that, we mean the operator equation (3.5) need not be restricted to the space of (K,)\mathcal{L}(\mathbb{C}^{K},\mathcal{H}). We can have slightly altered versions of (3.5) involving L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}), Cb(𝒳)C_{b}(\mathcal{X}), or Cb1(𝒳)C_{b}^{1}(\mathcal{X}) that induce an invariant subspace and still apply the Newton-Kantorovich Theorem to solve them. For example, we should be able to replace every \mathcal{H} in this paper with Cb1(𝒳)C_{b}^{1}(\mathcal{X}) and remake the proof to make everything go through. A word of caution is that to obtain operator norm convergence from the sample level operator to the population operator, the function space one works with has to have some kind of “smoothness”. Either the kind of smoothness from an RKHS or the kind from Cb1(𝒳)C_{b}^{1}(\mathcal{X}) is fine, but spaces like Cb(𝒳)C_{b}(\mathcal{X}) or L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}) where functions may oscillate wildly while still having a small norm are not okay, because adversarial functions can be chosen to ruin operator norm convergence. This point was also mentioned in von Luxburg, Belkin and Bousquet [41].

Finally, we would like to comment on our complex-valued functions assumption and the fact that Theorem 3.2 needs an unitary matrix QQ. We feel like since everything is real, the unitary matrix is an artifact rather than a necessity and our proof could be altered so that only an orthonormal matrix is needed (although we don’t know how at the moment). We have also checked that we can get around with real Hilbert or Banach spaces and real-valued functions for almost all lemmas and theorems except for Theorem 3.8. But on the brighter side, working with complex numbers makes our result more general and can give us the freedom of using a complex-valued kernel function, although such freedom is rarely taken advantage of in statistics or machine learning. Last but not least, we wish to point out that due to length constraints, we only did one application which is normalized spectral clustering, but other applications of our general theory are possible. For example, uniform consistency results can be obtained for kernel PCA and the proof of that is much simpler than the proof of normalized spectral clustering. We include such results in the appendix.

References

  • Abbe et al. [2020] {barticle}[author] \bauthor\bsnmAbbe, \bfnmEmmanuel\binitsE., \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmKaizheng\binitsK. and \bauthor\bsnmZhong, \bfnmYiqiao\binitsY. (\byear2020). \btitleEntrywise eigenvector analysis of random matrices with low expected rank. \bjournalAnnals of statistics \bvolume48 \bpages1452. \endbibitem
  • Anselone [1971] {bbook}[author] \bauthor\bsnmAnselone, \bfnmP. M.\binitsP. M. (\byear1971). \btitleCollectively Compact Operator Approximation Theory and Applications to Integral Equations. \bseriesAutomatic Computation. \bpublisherPrentice Hall. \endbibitem
  • Belkin and Niyogi [2003] {barticle}[author] \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmNiyogi, \bfnmPartha\binitsP. (\byear2003). \btitleLaplacian Eigenmaps for Dimensionality Reduction and Data Representation. \bjournalNeural Computation \bvolume15 \bpages1373-1396. \bdoi10.1162/089976603321780317 \endbibitem
  • Bengio et al. [2003] {binproceedings}[author] \bauthor\bsnmBengio, \bfnmYoshua\binitsY., \bauthor\bsnmPaiement, \bfnmJean-François\binitsJ.-F., \bauthor\bsnmVincent, \bfnmPascal\binitsP., \bauthor\bsnmDelalleau, \bfnmOlivier\binitsO., \bauthor\bsnmRoux, \bfnmNicolas Le\binitsN. L. and \bauthor\bsnmOuimet, \bfnmMarie\binitsM. (\byear2003). \btitleOut-of-sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. In \bbooktitleProceedings of the 16th International Conference on Neural Information Processing Systems. \bseriesNIPS’03 \bpages177–184. \bpublisherMIT Press, \baddressCambridge, MA, USA. \endbibitem
  • Blanchard, Bousquet and Zwald [2007] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG., \bauthor\bsnmBousquet, \bfnmOlivier\binitsO. and \bauthor\bsnmZwald, \bfnmLaurent\binitsL. (\byear2007). \btitleStatistical properties of kernel principal component analysis. \bjournalMachine Learning \bvolume66 \bpages259–294. \bdoi10.1007/s10994-006-6895-9 \endbibitem
  • Burenkov [1998] {bbook}[author] \bauthor\bsnmBurenkov, \bfnmV. I.\binitsV. I. (\byear1998). \btitleSobolev Spaces on Domains. \bseriesTeubner-Texte zur Mathematik. \bpublisherB. G. Teubner Verlagsgesellschaft Leipzig. \endbibitem
  • Cape, Tang and Priebe [2019a] {barticle}[author] \bauthor\bsnmCape, \bfnmJoshua\binitsJ., \bauthor\bsnmTang, \bfnmMinh\binitsM. and \bauthor\bsnmPriebe, \bfnmCarey E\binitsC. E. (\byear2019a). \btitleThe two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. \bjournalThe Annals of Statistics \bvolume47 \bpages2405–2439. \endbibitem
  • Cape, Tang and Priebe [2019b] {barticle}[author] \bauthor\bsnmCape, \bfnmJoshua\binitsJ., \bauthor\bsnmTang, \bfnmMinh\binitsM. and \bauthor\bsnmPriebe, \bfnmCarey E\binitsC. E. (\byear2019b). \btitleSignal-plus-noise matrix models: eigenvector deviations and fluctuations. \bjournalBiometrika \bvolume106 \bpages243–250. \endbibitem
  • Ciarlet [2013] {bbook}[author] \bauthor\bsnmCiarlet, \bfnmPhilippe G.\binitsP. G. (\byear2013). \btitleLinear and Nonlinear Functional Analysis with Applications. \bpublisherSociety for Industrial and Applied Mathematics, \baddressPhiladelphia, PA, USA. \endbibitem
  • Coifman et al. [2005] {barticle}[author] \bauthor\bsnmCoifman, \bfnmR. R.\binitsR. R., \bauthor\bsnmLafon, \bfnmS.\binitsS., \bauthor\bsnmLee, \bfnmA. B.\binitsA. B., \bauthor\bsnmMaggioni, \bfnmM.\binitsM., \bauthor\bsnmNadler, \bfnmB.\binitsB., \bauthor\bsnmWarner, \bfnmF.\binitsF. and \bauthor\bsnmZucker, \bfnmS. W.\binitsS. W. (\byear2005). \btitleGeometric Diffusions as a Tool for Harmonic Analysis and Structure Definition of Data: Diffusion Maps. \bjournalProceedings of the National Academy of Sciences \bvolume102 \bpages7426-7431. \bdoi10.1073/pnas.0500334102 \endbibitem
  • Cucker and Smale [2002] {barticle}[author] \bauthor\bsnmCucker, \bfnmFelipe\binitsF. and \bauthor\bsnmSmale, \bfnmSteve\binitsS. (\byear2002). \btitleOn the mathematical foundations of learning. \bjournalBulletin of the American Mathematical Society \bvolume39 \bpages1–49. \endbibitem
  • Damle and Sun [2020] {barticle}[author] \bauthor\bsnmDamle, \bfnmAnil\binitsA. and \bauthor\bsnmSun, \bfnmYuekai\binitsY. (\byear2020). \btitleUniform bounds for invariant subspace perturbations. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume41 \bpages1208–1236. \endbibitem
  • Donetti and Muñoz [2004] {barticle}[author] \bauthor\bsnmDonetti, \bfnmLuca\binitsL. and \bauthor\bsnmMuñoz, \bfnmMiguel A\binitsM. A. (\byear2004). \btitleDetecting network communities: a new systematic and efficient algorithm. \bjournalJournal of Statistical Mechanics: Theory and Experiment \bvolume2004 \bpagesP10012. \bdoi10.1088/1742-5468/2004/10/p10012 \endbibitem
  • Donoho and Grimes [2003] {barticle}[author] \bauthor\bsnmDonoho, \bfnmDavid L.\binitsD. L. and \bauthor\bsnmGrimes, \bfnmCarrie\binitsC. (\byear2003). \btitleHessian Eigenmaps: Locally Linear Embedding Techniques for High-Dimensional Data. \bjournalProceedings of the National Academy of Sciences \bvolume100 \bpages5591-5596. \bdoi10.1073/pnas.1031596100 \endbibitem
  • Edmunds and Triebel [1996] {bbook}[author] \bauthor\bsnmEdmunds, \bfnmD. E.\binitsD. E. and \bauthor\bsnmTriebel, \bfnmH.\binitsH. (\byear1996). \btitleFunction Spaces, Entropy Numbers, Differential Operators. \bseriesCambridge Tracts in Mathematics. \bpublisherCambridge University Press. \bdoi10.1017/CBO9780511662201 \endbibitem
  • Fan, Wang and Zhong [2018] {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmWeichen\binitsW. and \bauthor\bsnmZhong, \bfnmYiqiao\binitsY. (\byear2018). \btitleAn ll_{\infty} eigenvector perturbation bound and its application to robust covariance estimation. \bjournalJournal of Machine Learning Research \bvolume18 \bpages1–42. \endbibitem
  • Higgs, Weller and Solka [2006] {barticle}[author] \bauthor\bsnmHiggs, \bfnmBrandon W.\binitsB. W., \bauthor\bsnmWeller, \bfnmJennifer\binitsJ. and \bauthor\bsnmSolka, \bfnmJeffrey L.\binitsJ. L. (\byear2006). \btitleSpectral embedding finds meaningful (relevant) structure in image and microarray data. \bjournalBMC Bioinformatics \bvolume7 \bpages74. \bdoi10.1186/1471-2105-7-74 \endbibitem
  • Hoffmann [2007] {barticle}[author] \bauthor\bsnmHoffmann, \bfnmHeiko\binitsH. (\byear2007). \btitleKernel PCA for novelty detection. \bjournalPattern Recognition \bvolume40 \bpages863 - 874. \bdoihttps://doi.org/10.1016/j.patcog.2006.07.009 \endbibitem
  • Karow and Kressner [2014] {barticle}[author] \bauthor\bsnmKarow, \bfnmM.\binitsM. and \bauthor\bsnmKressner, \bfnmD.\binitsD. (\byear2014). \btitleOn a Perturbation Bound for Invariant Subspaces of Matrices. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume35 \bpages599-618. \bdoi10.1137/130912372 \endbibitem
  • Kato [1995] {bbook}[author] \bauthor\bsnmKato, \bfnmT\binitsT. (\byear1995). \btitlePerturbation Theory for Linear Operators. \bseriesClassics in Mathematics. \bpublisherSpringer-Verlag Berlin Heidelberg. \endbibitem
  • Koltchinskii [1998] {binproceedings}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir I.\binitsV. I. (\byear1998). \btitleAsymptotics of Spectral Projections of Some Random Matrices Approximating Integral Operators. In \bbooktitleHigh Dimensional Probability (\beditor\bfnmErnst\binitsE. \bsnmEberlein, \beditor\bfnmMarjorie\binitsM. \bsnmHahn and \beditor\bfnmMichel\binitsM. \bsnmTalagrand, eds.) \bpages191–227. \bpublisherBirkhäuser Basel, \baddressBasel. \endbibitem
  • Koltchinskii and Giné [2000] {barticle}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir\binitsV. and \bauthor\bsnmGiné, \bfnmEvarist\binitsE. (\byear2000). \btitleRandom matrix approximation of spectra of integral operators. \bjournalBernoulli \bvolume6 \bpages113–167. \endbibitem
  • Lei and Rinaldo [2015] {barticle}[author] \bauthor\bsnmLei, \bfnmJing\binitsJ. and \bauthor\bsnmRinaldo, \bfnmAlessandro\binitsA. (\byear2015). \btitleConsistency of spectral clustering in stochastic block models. \bjournalAnn. Statist. \bvolume43 \bpages215–237. \bdoi10.1214/14-AOS1274 \endbibitem
  • Mao, Sarkar and Chakrabarti [2021] {barticle}[author] \bauthor\bsnmMao, \bfnmXueyu\binitsX., \bauthor\bsnmSarkar, \bfnmPurnamrita\binitsP. and \bauthor\bsnmChakrabarti, \bfnmDeepayan\binitsD. (\byear2021). \btitleEstimating mixed memberships with sharp eigenvector deviations. \bjournalJournal of the American Statistical Association \bvolume116 \bpages1928–1940. \endbibitem
  • Mendelson and Pajor [2005] {binproceedings}[author] \bauthor\bsnmMendelson, \bfnmS.\binitsS. and \bauthor\bsnmPajor, \bfnmA.\binitsA. (\byear2005). \btitleEllipsoid Approximation Using Random Vectors. In \bbooktitleLearning Theory (\beditor\bfnmPeter\binitsP. \bsnmAuer and \beditor\bfnmRon\binitsR. \bsnmMeir, eds.) \bpages429–443. \bpublisherSpringer Berlin Heidelberg, \baddressBerlin, Heidelberg. \endbibitem
  • Mendelson and Pajor [2006] {barticle}[author] \bauthor\bsnmMendelson, \bfnmShahar\binitsS. and \bauthor\bsnmPajor, \bfnmAlain\binitsA. (\byear2006). \btitleOn singular values of matrices with independent rows. \bjournalBernoulli \bvolume12 \bpages761–773. \bdoi10.3150/bj/1161614945 \endbibitem
  • Ng, Jordan and Weiss [2001] {binproceedings}[author] \bauthor\bsnmNg, \bfnmAndrew Y.\binitsA. Y., \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I. and \bauthor\bsnmWeiss, \bfnmYair\binitsY. (\byear2001). \btitleOn Spectral Clustering: Analysis and an Algorithm. In \bbooktitleAdvances in Neural Information Processing Systems \bpages849–856. \bpublisherMIT Press. \endbibitem
  • Rohe, Chatterjee and Yu [2011] {barticle}[author] \bauthor\bsnmRohe, \bfnmKarl\binitsK., \bauthor\bsnmChatterjee, \bfnmSourav\binitsS. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2011). \btitleSpectral clustering and the high-dimensional stochastic blockmodel. \bjournalAnn. Statist. \bvolume39 \bpages1878–1915. \bdoi10.1214/11-AOS887 \endbibitem
  • Rosasco, Belkin and Vito [2010] {barticle}[author] \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmVito, \bfnmErnesto De\binitsE. D. (\byear2010). \btitleOn Learning with Integral Operators. \bjournalJournal of Machine Learning Research \bvolume11 \bpages905-934. \endbibitem
  • Schiebinger, Wainwright and Yu [2015] {barticle}[author] \bauthor\bsnmSchiebinger, \bfnmGeoffrey\binitsG., \bauthor\bsnmWainwright, \bfnmMartin J.\binitsM. J. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2015). \btitleThe Geometry of Kernelized Spectral Clustering. \bjournalThe Annals of Statistics \bvolume43 \bpages819-846. \bdoi10.1214/14-AOS1283 \endbibitem
  • Shawe-Taylor et al. [2005] {barticle}[author] \bauthor\bsnmShawe-Taylor, \bfnmJ.\binitsJ., \bauthor\bsnmWilliams, \bfnmC. K. I.\binitsC. K. I., \bauthor\bsnmCristianini, \bfnmN.\binitsN. and \bauthor\bsnmKandola, \bfnmJ.\binitsJ. (\byear2005). \btitleOn the eigenspectrum of the gram matrix and the generalization error of kernel-PCA. \bjournalIEEE Transactions on Information Theory \bvolume51 \bpages2510-2522. \bdoi10.1109/TIT.2005.850052 \endbibitem
  • Shi, Belkin and Yu [2009] {barticle}[author] \bauthor\bsnmShi, \bfnmTao\binitsT., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009). \btitleData spectroscopy: Eigenspaces of convolution operators and clustering. \bjournalAnn. Statist. \bvolume37 \bpages3960–3984. \bdoi10.1214/09-AOS700 \endbibitem
  • Shi and Malik [2000] {barticle}[author] \bauthor\bsnmShi, \bfnmJ\binitsJ. and \bauthor\bsnmMalik, \bfnmJ\binitsJ. (\byear2000). \btitleNormalized cuts and image segmentation. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence \bvolume22 \bpages888-905. \bdoi10.1109/34.868688 \endbibitem
  • Steinwart and Christmann [2008] {bbook}[author] \bauthor\bsnmSteinwart, \bfnmIngo\binitsI. and \bauthor\bsnmChristmann, \bfnmAndreas\binitsA. (\byear2008). \btitleSupport vector machines. \bpublisherSpringer Science & Business Media. \endbibitem
  • Stewart [1971] {barticle}[author] \bauthor\bsnmStewart, \bfnmG.\binitsG. (\byear1971). \btitleError Bounds for Approximate Invariant Subspaces of Closed Linear Operators. \bjournalSIAM Journal on Numerical Analysis \bvolume8 \bpages796-808. \bdoi10.1137/0708073 \endbibitem
  • Tenenbaum [2000] {barticle}[author] \bauthor\bsnmTenenbaum, \bfnmJ. B.\binitsJ. B. (\byear2000). \btitleA Global Geometric Framework for Nonlinear Dimensionality Reduction. \bjournalScience \bvolume290 \bpages2319-2323. \bdoi10.1126/science.290.5500.2319 \endbibitem
  • Trillos, Hoffmann and Hosseini [2019] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolas Garcia\binitsN. G., \bauthor\bsnmHoffmann, \bfnmFranca\binitsF. and \bauthor\bsnmHosseini, \bfnmBamdad\binitsB. (\byear2019). \btitleGeometric structure of graph Laplacian embeddings. \bpagesarXiv:1901.10651. \endbibitem
  • Trillos and Slepčev [2015] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolás García\binitsN. G. and \bauthor\bsnmSlepčev, \bfnmDejan\binitsD. (\byear2015). \btitleA variational approach to the consistency of spectral clustering. \bpagesarXiv:1508.01928. \endbibitem
  • Trillos et al. [2018] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolas Garcia\binitsN. G., \bauthor\bsnmGerlach, \bfnmMoritz\binitsM., \bauthor\bsnmHein, \bfnmMatthias\binitsM. and \bauthor\bsnmSlepcev, \bfnmDejan\binitsD. (\byear2018). \btitleError estimates for spectral convergence of the graph Laplacian on random geometric graphs towards the Laplace–Beltrami operator. \bpagesarXiv:1801.10108. \endbibitem
  • Vershynin [2018] {bbook}[author] \bauthor\bsnmVershynin, \bfnmRoman\binitsR. (\byear2018). \btitleHigh-Dimensional Probability: An Introduction with Applications in Data Science. \bseriesCambridge Series in Statistical and Probabilistic Mathematics. \bpublisherCambridge University Press. \bdoi10.1017/9781108231596 \endbibitem
  • von Luxburg, Belkin and Bousquet [2008] {barticle}[author] \bauthor\bsnmvon Luxburg, \bfnmUlrike\binitsU., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmBousquet, \bfnmOlivier\binitsO. (\byear2008). \btitleConsistency of Spectral Clustering. \bjournalThe Annals of Statistics \bvolume36 \bpages555-586. \bdoi10.1214/009053607000000640 \endbibitem
  • Zwald and Blanchard [2006] {bincollection}[author] \bauthor\bsnmZwald, \bfnmLaurent\binitsL. and \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. (\byear2006). \btitleOn the Convergence of Eigenspaces in Kernel Principal Component Analysis. In \bbooktitleAdvances in Neural Information Processing Systems 18 (\beditor\bfnmY.\binitsY. \bsnmWeiss, \beditor\bfnmB.\binitsB. \bsnmSchölkopf and \beditor\bfnmJ. C.\binitsJ. C. \bsnmPlatt, eds.) \bpages1649–1656. \bpublisherMIT Press. \endbibitem

Appendix A Proofs

A.1 Proof of Lemma 3.1

Proof of Lemma 3.1.

For item 1, we show double inclusions: V21V2V_{2}^{-1}\mathcal{H}\subset V_{2}^{*}\mathcal{H} and V2V21V_{2}^{*}\mathcal{H}\subset V_{2}^{-1}\mathcal{H}. First, for any lV21l\in V_{2}^{-1}\mathcal{H}, we know V2lV_{2}l\in\mathcal{H}, so V2(V2l)V2V_{2}^{*}(V_{2}l)\in V_{2}^{*}\mathcal{H}. Since l=V2V2ll=V_{2}^{*}V_{2}l, we see V21V2V_{2}^{-1}\mathcal{H}\subset V_{2}^{*}\mathcal{H}. Second, for any lV2l\in V_{2}^{*}\mathcal{H}, suppose without loss of generality that l=V2hl=V_{2}^{*}h for some hh\in\mathcal{H}. Note V2V2h=hV1V1hV_{2}V_{2}^{*}h=h-V_{1}V_{1}^{*}h. Since f1,,fKf_{1},\ldots,f_{K}\in\mathcal{H} by assumption, we know V1V1hV_{1}V_{1}^{*}h\in\mathcal{H}, so V2l=hV1V1hV_{2}l=h-V_{1}V_{1}^{*}h\in\mathcal{H}. This shows the other inclusion.

For item 2, since \mathcal{H} is a subspace of L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}), V2V_{2}^{*}\mathcal{H} is a subspace of V2L2(𝒳,)V_{2}^{*}L^{2}(\mathcal{X},\mathbb{P}) which is equal to l2l^{2}. Checking (,)(\cdot,\cdot) is an inner product is routine. We next show the completeness of V21V_{2}^{-1}\mathcal{H}. Let {bi}i=1\{b_{i}\}_{i=1}^{\infty} be a Cauchy sequence in V21V_{2}^{-1}\mathcal{H}. Since by definition bV21=V2b\|b\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}b\|_{\mathcal{H}}, the sequence {V2bi}i=1\{V_{2}b_{i}\}_{i=1}^{\infty} is Cauchy in \mathcal{H}. Suppose V2biyV_{2}b_{i}\xrightarrow{\mathcal{H}}y for some yy\in\mathcal{H}. Since by assumption \|\cdot\|_{\mathcal{H}} norm is stronger than \|\cdot\|_{\infty} which is in turn stronger than L2\|\cdot\|_{L^{2}}, we know V2biL2yV_{2}b_{i}\xrightarrow{L^{2}}y. Since the range of V2V_{2} is closed in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}), we know yy is also in its range and y=V2V2yy=V_{2}V_{2}^{*}y. Note biV2yV21=V2biV2V2y\|b_{i}-V_{2}^{*}y\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}b_{i}-V_{2}V_{2}^{*}y\|_{\mathcal{H}} and the right hand side converges to zero because V2biyV_{2}b_{i}\xrightarrow{\mathcal{H}}y, we see the space of V21V_{2}^{-1}\mathcal{H} is indeed complete.

For item 3, for any αK\alpha\in\mathbb{C}^{K} with α2=1\|\alpha\|_{2}=1, we have

V1α=i=1Kαifii=1K|αi|fiKmaxi[K]fi.\|V_{1}\alpha\|_{\mathcal{H}}=\|\sum_{i=1}^{K}\alpha_{i}f_{i}\|_{\mathcal{H}}\leq\sum_{i=1}^{K}|\alpha_{i}|\|f_{i}\|_{\mathcal{H}}\leq\sqrt{K}\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}}.

This shows V12Kmaxi[K]fi\|V_{1}\|_{2\to\mathcal{H}}\leq\sqrt{K}\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}}. As is noted in the proof for item 2, \|\cdot\|_{\mathcal{H}} norm is stronger than the L2\|\cdot\|_{L^{2}} norm, i.e. hL2Ch\|h\|_{L^{2}}\leq C_{\mathcal{H}}\|h\|_{\mathcal{H}} for some constant CC_{\mathcal{H}} and h\forall h\in\mathcal{H}. Therefore

V1h22=i=1Kfi,hL22i=1KfiL22hL22C2Kh2.\|V_{1}^{*}h\|_{2}^{2}=\sum_{i=1}^{K}\langle f_{i},h\rangle_{L^{2}}^{2}\leq\sum_{i=1}^{K}\|f_{i}\|_{L^{2}}^{2}\|h\|_{L^{2}}^{2}\leq C_{\mathcal{H}}^{2}K\|h\|_{\mathcal{H}}^{2}.

We thus see V12CK\|V_{1}^{*}\|_{\mathcal{H}\to 2}\leq C_{\mathcal{H}}\sqrt{K}. The fact that V2V21=1\|V_{2}\|_{V_{2}^{-1}\mathcal{H}\to\mathcal{H}}=1 is a simple consequence of bV21=V2b\|b\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}b\|_{\mathcal{H}} for bV21\forall b\in V_{2}^{-1}\mathcal{H}. Finally, we have for h\forall h\in\mathcal{H}

V2hV21=V2V2h=(IV1V1)h(1+V12V12)h.\|V_{2}^{*}h\|_{V_{2}^{-1}\mathcal{H}}=\|V_{2}V_{2}^{*}h\|_{\mathcal{H}}=\|(I-V_{1}V_{1}^{*})h\|_{\mathcal{H}}\leq(1+\|V_{1}\|_{2\to\mathcal{H}}\|V_{1}^{*}\|_{\mathcal{H}\to 2})\|h\|_{\mathcal{H}}.

It thus follows V2V211+CKmaxi[K]fi\|V_{2}^{*}\|_{\mathcal{H}\to V_{2}^{-1}\mathcal{H}}\leq 1+C_{\mathcal{H}}K\max_{i\in[K]}\|f_{i}\|_{\mathcal{H}}. ∎

A.2 Proof of a lemma used in proving Theorem 3.5

In the proof of Theorem 3.5, we used to following lemma. We now state and prove it.

Lemma A.1.

The operator 𝒯\mathcal{T} defined as 𝒯:\mathcal{T}:\mathcal{E}\mapsto\mathcal{E} as 𝒯(Y):=T22YYT11\mathcal{T}(Y):=T_{22}Y-YT_{11} is one-to-one and onto. Moreover, infYHS=1T22YYT11HS>0\inf_{\|Y\|_{HS}=1}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}>0.

Proof.

First, we note

T11:K\displaystyle T_{11}\colon\mathbb{R}^{K} K\displaystyle\longrightarrow\mathbb{R}^{K}
(a1,a2,,aK)\displaystyle(a_{1},a_{2},\ldots,a_{K}) (λ1a1,λ2a2,,λKaK)\displaystyle\longmapsto(\lambda_{1}a_{1},\lambda_{2}a_{2},\ldots,\lambda_{K}a_{K})

To show 𝒯\mathcal{T} is one-to-one and onto, it suffices to show for any gg\in\mathcal{E}, there exists a unique yy\in\mathcal{E} such that g=𝒯(y)g=\mathcal{T}(y). Denote the standard orthonormal basis in K\mathbb{R}^{K} by {ei}i=1K\{e_{i}\}_{i=1}^{K}. Due to the diagonal structure of T11T_{11}, we see (𝒯y)(ei)=(T22λiI)yei(\mathcal{T}y)(e_{i})=(T_{22}-\lambda_{i}I)ye_{i}. So to show the existence and uniqueness of yy\in\mathcal{E} such that g=𝒯(y)g=\mathcal{T}(y), it suffices to show for i[K]\forall i\in[K], there exists a unique yeiye_{i} such that (T22λiI)yei=gei(T_{22}-\lambda_{i}I)ye_{i}=ge_{i}. To this end, it suffices to show λi\lambda_{i} is in the resolvent of T22T_{22}. This is indeed true because 1) T22T_{22} is a compact operator from l~2\tilde{l}^{2} to l~2\tilde{l}^{2}; 2) σ(T22){λK+1,λK+2,}{0}\sigma(T_{22})\subset\{\lambda_{K+1},\lambda_{K+2},\ldots\}\cup\{0\}. The second point is obvious and the first point follows from

T22HS=FTFHSFopTHSFopTHS.\|T_{22}\|_{HS}=\|F_{\perp}^{*}TF_{\perp}\|_{HS}\leq\|F_{\perp}^{*}\|_{op}\|T\|_{HS}\|F_{\perp}\|_{op}\leq\|T\|_{HS}.

Next, we show 𝒯\mathcal{T} is a bounded operator. Once 𝒯\mathcal{T} is bounded, since 𝒯\mathcal{T} is one to one and onto and \mathcal{E} is a Banach space, we know from bounded inverse theorem that 𝒯1()\mathcal{T}^{-1}\in\mathcal{L}(\mathcal{E}), which is equivalent to infYHS=1T22YYT11HS>0\inf_{\|Y\|_{HS}=1}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}>0.

The operator 𝒯\mathcal{T} is indeed bounded because

T22YYT11HST22HSYHS+T11HSYHS2THSYHS.\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\leq\|T_{22}\|_{HS}\|Y\|_{HS}+\|T_{11}\|_{HS}\|Y\|_{HS}\leq 2\|T\|_{HS}\|Y\|_{HS}.

We remark that when T22T_{22} is self-adjoint, the proof of this lemma will be greatly simplified. In fact, we have

T22YYT11HSYT11HST22YHSλKYHST22opYHS\displaystyle\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\geq\big{\|}YT_{11}\big{\|}_{HS}-\big{\|}T_{22}Y\big{\|}_{HS}\geq\lambda_{K}\big{\|}Y\big{\|}_{HS}-\big{\|}T_{22}\big{\|}_{op}\big{\|}Y\big{\|}_{HS}

Since T22T_{22} is self-adjoint, its operator norm is its largest eigenvalue, which is λK+1\lambda_{K+1}. We see immediately in this case that infYHS=1T22YYT11HSλKλK+1\inf_{\|Y\|_{HS}=1}\big{\|}T_{22}Y-YT_{11}\big{\|}_{HS}\geq\lambda_{K}-\lambda_{K+1}, so the eigengap is recovered. In unnormalized spectral clustering, where \mathcal{H} is set to be the RKHS associated with the kernel function, we claim T22T_{22} is self-adjoint.

A.3 Proof of Lemma 4.11

We see at the core of Lemma 4.11 is some uniform law of large number over the unit ball in s\mathcal{H}^{s}. We need the following two lemmas in the proof, the first is from Cucker and Smale [11] (Proposition 6), and the second is from Vershynin [40] (Theorem 8.1.6).

Lemma A.2.

Denote 𝒢={g|gs,gs1}\mathcal{G}=\{g\,|\,g\in\mathcal{H}^{s},\|g\|_{\mathcal{H}^{s}}\leq 1\}. When s>p/2s>p/2, for all ϵ>0\epsilon>0,

log𝒩(𝒢,,ϵ)(Cϵ)p/s+1\log\mathcal{N}(\mathcal{G},\|\cdot\|_{\infty},\epsilon)\leq\Big{(}\frac{C}{\epsilon}\Big{)}^{p/s}+1

for some constant CC.

Lemma A.3.

Let (Xt)tT(X_{t})_{t\in T} be a random process on a metric space (T,d)(T,d) with sub-gaussian increments, i.e.

XtXsΨ2Kd(t,s) for all t,sT.\|X_{t}-X_{s}\|_{\Psi_{2}}\leq Kd(t,s)\text{ for all }t,s\in T.

Then, for every u0u\geq 0, the event

suptTXtCK[0log𝒩(T,d,ϵ)𝑑ϵ+udiam(T)]\sup_{t\in T}X_{t}\leq CK\Big{[}\int_{0}^{\infty}\sqrt{\log\mathcal{N}(T,d,\epsilon)}d\epsilon+u\cdot\text{diam}(T)\Big{]}

holds with probability at least 12exp(u2)1-2exp(-u^{2}).

We now prove Lemma 4.11. We refer readers who are unfamiliar with the arguments below to the proof of Theorem 8.2.3 in Vershynin [40].

Proof of Lemma 4.11.

We first show on 𝒢={g|gs,gs1}\mathcal{G}=\{g\,|\,g\in\mathcal{H}^{s},\|g\|_{\mathcal{H}^{s}}\leq 1\}, the random process Pn|g|2P|g|2P_{n}|g|^{2}-P|g|^{2} has sub-gaussian increments. For fixed f,g𝒢f,g\in\mathcal{G}, we have

Pnff¯Pff¯Pngg¯+Pgg¯Ψ2=1ni=1nZiΨ2 where Zi=(ff¯gg¯)(Xi)𝔼(ff¯gg¯)(X).\|P_{n}f\bar{f}-Pf\bar{f}-P_{n}g\bar{g}+Pg\bar{g}\|_{\Psi_{2}}=\frac{1}{n}\|\sum_{i=1}^{n}Z_{i}\|_{\Psi_{2}}\text{ where }Z_{i}=(f\bar{f}-g\bar{g})(X_{i})-\mathbb{E}(f\bar{f}-g\bar{g})(X).

So ZiZ_{i}’s are independent and mean zero. It thus follows

Pnff¯Pff¯Pngg¯+Pgg¯Ψ2C21n(i=1nZiΨ22)1/2.\|P_{n}f\bar{f}-Pf\bar{f}-P_{n}g\bar{g}+Pg\bar{g}\|_{\Psi_{2}}\leq\frac{C_{21}}{n}\big{(}\sum_{i=1}^{n}\|Z_{i}\|_{\Psi_{2}}^{2}\big{)}^{1/2}.

By the centering lemma, we know

ZiΨ2C22f(Xi)f(Xi)¯g(Xi)g(Xi)¯Ψ2.\|Z_{i}\|_{\Psi_{2}}\leq C_{22}\|f(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}-g(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(X_{i})}}\|_{\Psi_{2}}.

Note that because of the embedding, we have

ff¯gg¯f(f¯g¯)+(fg)g¯ffg+gfg2C6fg.\|f\bar{f}-g\bar{g}\|_{\infty}\leq\|f(\bar{f}-\bar{g})\|_{\infty}+\|(f-g)\bar{g}\|_{\infty}\leq\|f\|_{\infty}\|f-g\|_{\infty}+\|g\|_{\infty}\|f-g\|_{\infty}\leq 2C_{6}\|f-g\|_{\infty}.

The random variable f(Xi)f(Xi)¯g(Xi)g(Xi)¯f(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}-g(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(X_{i})}} is thus bounded. Since bounded random variables have bounded Ψ2\Psi_{2} norm, we see

f(Xi)f(Xi)¯g(Xi)g(Xi)¯Ψ22C6C23fg.\|f(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}f(X_{i})}}-g(X_{i}){\mspace{2.5mu}\overline{\mspace{-2.5mu}g(X_{i})}}\|_{\Psi_{2}}\leq 2C_{6}C_{23}\|f-g\|_{\infty}.

Putting pieces together, we have

Pnf2Pf2Png2+Pg2Ψ2C24nfg.\|P_{n}f^{2}-Pf^{2}-P_{n}g^{2}+Pg^{2}\|_{\Psi_{2}}\leq\frac{C_{24}}{\sqrt{n}}\|f-g\|_{\infty}.

Next, it is easy to examine that diam(𝒢)2C6\text{diam}(\mathcal{G})\leq 2C_{6} and 02C6log𝒩(T,d,ϵ)𝑑ϵ<\int_{0}^{2C_{6}}\sqrt{\log\mathcal{N}(T,d,\epsilon)}d\epsilon<\infty (because our choice of s=p/2+1s=\lfloor p/2\rfloor+1 and Lemma A.2). It thus follows that the event

supg𝒢Png2Pg2C19+C20un\sup_{g\in\mathcal{G}}P_{n}g^{2}-Pg^{2}\leq\frac{C_{19}+C_{20}u}{\sqrt{n}}

holds with probability 12exp(u2)1-2exp(-u^{2}).

By the exactly same argument, we can also show the event

supg𝒢Pg2Png2C19+C20un\sup_{g\in\mathcal{G}}Pg^{2}-P_{n}g^{2}\leq\frac{C_{19}+C_{20}u}{\sqrt{n}}

holds with probability 12exp(u2)1-2exp(-u^{2}). Taking union bound, we see

supg𝒢|Pg2Png2|C19+C20un\sup_{g\in\mathcal{G}}\big{|}Pg^{2}-P_{n}g^{2}\big{|}\leq\frac{C_{19}+C_{20}u}{\sqrt{n}}

holds with probability 14exp(u2)1-4exp(-u^{2}). Rewrite u2u^{2} as τ\tau and the proof is complete. ∎

Appendix B Application to kernel PCA

To further demonstrate the usage of the general theory, we apply it to kernel principal component analysis (kernel PCA) in this section. In kernel PCA, we start from a metric space 𝒳\mathcal{X}, a probability measure \mathbb{P} on 𝒳\mathcal{X}, and a continuous positive definite kernel function k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}. After observing samples X1,,XniidX_{1},\dots,X_{n}\overset{\textrm{iid}}{\sim}\mathbb{P}, we are interested in matrix Knn×nK_{n}\in\mathbb{R}^{n\times n} of their pairwise similarities: Kn=[1nk(Xi,Xj)]i,j=1nK_{n}=\begin{bmatrix}\frac{1}{n}k(X_{i},X_{j})\end{bmatrix}_{i,j=1}^{n}, assuming our data mapped into the feature space is centered. Since KnK_{n} is symmetric and positive semi-definite, it has an eigenvalue decomposition. We denote the eigenpairs by (λ^k,vk)(\widehat{\lambda}_{k},v_{k}) and sort the eigenvalues in descending order:

λ^1λ^n0.\widehat{\lambda}_{1}\geq\dots\geq\widehat{\lambda}_{n}\geq 0.

The eigenvectors vkv_{k} are normalized to have vk2=n\|v_{k}\|_{2}=\sqrt{n}. Then the matrix V=[v1,,vK]n×KV=[v_{1},\cdots,v_{K}]\in\mathbb{R}^{n\times K} consists of the leading KK principal components.

Let \mathcal{H} be the RKHS associated with kernel k(,)k(\cdot,\cdot). Recall that the tensor product of a,ba,b\in\mathcal{H},

ab:\displaystyle a\otimes b:\mathcal{H} \displaystyle\rightarrow\mathcal{H}
f\displaystyle f b,fa,\displaystyle\mapsto\langle b,f\rangle_{\mathcal{H}}a,

is a linear operator. Then the operator counterpart of K~n\widetilde{K}_{n} is the empirical covariance operator

Σn=1ni=1nkXikXi,\displaystyle\Sigma_{n}=\frac{1}{n}\sum_{i=1}^{n}k_{X_{i}}\otimes k_{X_{i}}, (B.1)

where kXi=k(,Xi)k_{X_{i}}=k(\cdot,X_{i}) is the corresponding feature in \mathcal{H} of XiX_{i} (i=1,,ni=1,\cdots,n) under the feature map xk(,x)x\mapsto k(\cdot,x).

It turns out that the eigenvalues and eigenvectors of KnK_{n} and Σn\Sigma_{n} are closely related. To formulate this relationship, let us define the restriction operator ζ:n\zeta:\mathcal{H}\to\mathbb{R}^{n} by ζf=1n(f(X1),,f(Xn))T\zeta f=\frac{1}{\sqrt{n}}(f(X_{1}),\cdots,f(X_{n}))^{T}. Then verifiably, the adjoint of ζ\zeta, ζ:n\zeta^{*}:\mathbb{R}^{n}\mapsto\mathcal{H} is given by ζα=1ni=1nαikXi\zeta^{*}\alpha=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\alpha_{i}k_{X_{i}}, where α=(α1,,αn)T\alpha=(\alpha_{1},\cdots,\alpha_{n})^{T}. The eigenvalues and eigenvectors(functions) of Σn\Sigma_{n} and KnK_{n} are closely related in the following sense.

Lemma B.1.

The following facts hold true:

  1. 1.

    Kn=ζζK_{n}=\zeta\zeta^{*} and Σn=ζζ\Sigma_{n}=\zeta^{*}\zeta;

  2. 2.

    If (λ^,f)(\widehat{\lambda},f) is a non-trivial eigenpair of Σn\Sigma_{n} (i.e. λ^0\widehat{\lambda}\neq 0), then (λ^,ζf)(\widehat{\lambda},\zeta f) is an eigenpair for KnK_{n}.

  3. 3.

    If (λ^,v)(\widehat{\lambda},v) is a non-trivial eigenpair of KnK_{n}, then (λ^,f^)(\widehat{\lambda},\widehat{f}), where

    f^(x)=1λ^nζv=1λ^ni=1nk(x,Xi)vi\widehat{f}(x)=\frac{1}{\widehat{\lambda}n}\zeta^{\star}v=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}k(x,X_{i})v_{i} (B.2)

    is an eigenpair for Σn\Sigma_{n} with f^\widehat{f}\in\mathcal{H}. Moreover, this choice of f^\widehat{f} is such that f^L2(𝒳,n)=1\|\widehat{f}\|_{L^{2}(\mathcal{X},\mathbb{P}_{n})}=1 and the restriction of f^\widehat{f} onto sample points agrees with vv, i.e. ζf^=v\zeta\widehat{f}=v.

Proof.

For item one, it is easy to verify ζζ:nn\zeta\zeta^{*}:\mathbb{R}^{n}\mapsto\mathbb{R}^{n} is the linear transformation defined by KnK_{n}. For the other half, noting that

ζζf=1ni=1nf(Xi)k(x,Xi).\zeta^{*}\zeta f=\frac{1}{n}\sum_{i=1}^{n}f(X_{i})k(x,X_{i}).

At the same time, by the reproducing property

Σnf=1ni=1nkXi,fkXi=1ni=1nf(Xi)k(x,Xi).\Sigma_{n}f=\frac{1}{n}\sum_{i=1}^{n}\langle k_{X_{i}},f\rangle_{\mathcal{H}}k_{X_{i}}=\frac{1}{n}\sum_{i=1}^{n}f(X_{i})k(x,X_{i}).

We thus conclude the two are equal.

For item two, since by assumption ζζf=λ^f\zeta^{*}\zeta f=\widehat{\lambda}f, we have Knζf=ζζζf=λ^ζfK_{n}\zeta f=\zeta\zeta^{*}\zeta f=\widehat{\lambda}\zeta f, which is exactly what the statement suggests.

For item three, if (λ^,v)(\widehat{\lambda},v) is an eigenpair of KnK_{n}, we check that (λ^,f^)(\widehat{\lambda},\widehat{f}) is an eigenpair of Σn\Sigma_{n}:

Σnf^\displaystyle\Sigma_{n}\widehat{f} =1ni=1n(kXikXi)(1λ^nj=1nkXjvj)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(k_{X_{i}}\otimes k_{X_{i}})\left(\frac{1}{\widehat{\lambda}n}\sum_{j=1}^{n}k_{X_{j}}v_{j}\right)
=1λ^ni=1n(kXij=1nk(Xi,Xj)vj)=1λ^ni=1nkXi[Knv]i=1λ^ni=1nkXiλ^vi=λ^f^.\displaystyle=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}\left(k_{X_{i}}\sum_{j=1}^{n}k(X_{i},X_{j})v_{j}\right)=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}k_{X_{i}}[K_{n}v]_{i}=\frac{1}{\widehat{\lambda}n}\sum_{i=1}^{n}k_{X_{i}}\widehat{\lambda}v_{i}=\widehat{\lambda}\widehat{f}.

Moreover, f^\widehat{f} is a linear combination of kXik_{X_{i}} and therefore belongs to \mathcal{H}. ∎

The population version of Σn\Sigma_{n} is the covariance operator

Σ=𝔼kXkX,\Sigma=\mathbb{E}k_{X}\otimes k_{X},

where kX=k(,X)k_{X}=k(\cdot,X), d=𝔼kXd=\mathbb{E}k_{X}, and XX\sim\mathbb{P}. We will later justify the expectation of such random elements in an appropriate Hilbert space. Under appropriate assumptions, it can be shown that we can choose {fi}i=1K\{f_{i}\}_{i=1}^{K}, the top KK eigenfunctions of Σ\Sigma, to be real-valued and orthonormal in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). Then we can define V1:KV_{1}:\mathbb{C}^{K}\rightarrow\mathcal{H} as V1α=i=1KαifiV_{1}\alpha=\sum_{i=1}^{K}\alpha_{i}f_{i}. Similarly, we can define V^1\widehat{V}_{1} with {f^i}i=1K\{\widehat{f}_{i}\}_{i=1}^{K}, the extension of top KK orthonormal eigenvectors of K~n\widetilde{K}_{n} according to (B.2). We are now ready to apply our general theory to prove the following result, which is similar to Theorem 3.2.

Theorem B.2.

Under the general assumptions defined below, there exists C6,C7C_{6},C_{7} that are determined by 𝒳,,k(,)\mathcal{X},\mathbb{P},k(\cdot,\cdot) such that whenever sample size nC6τn\geq C_{6}\tau for some τ>1\tau>1, we have with confidence 16eτ1-6e^{-\tau},

inf{V1V^1Q2:Q𝕌K}C7τn.\inf\left\{\|V_{1}-\widehat{V}_{1}Q\|_{2\rightarrow\infty}:Q\in\mathbb{U}^{K}\right\}\leq C_{7}\frac{\sqrt{\tau}}{\sqrt{n}}. (B.3)

The general assumptions referred to in Theorem 5.2 are

General Assumptions. The set 𝒳\mathcal{X} is a separable topological space. The kernel k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} is continuous, symmetric, positive semi-definite, and

supx𝒳k(x,x)<.\sup_{x\in\mathcal{X}}k(x,x)<\infty. (B.4)

Treated as an operator from \mathcal{H} to \mathcal{H}, the eigenvalues of Σ\Sigma satisfy λ1λK>λK+10\lambda_{1}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0. The top KK eigenfunctions of Σn\Sigma_{n}, {fi}i=1KCb(𝒳)\{f_{i}\}_{i=1}^{K}\subset C_{b}(\mathcal{X}).

Condition (B.4) ensures that all of the operators we are working with are Hilbert-Schmidt, and further guarantees concentration of bounded random elements in the Hilbert space of Hilbert-Schmidt operators. Separability of 𝒳\mathcal{X} and continuity of k(,)k(\cdot,\cdot) assures that the RKHS \mathcal{H} is separable by Lemma 4.33 of [34].

B.1 Overview of the proof

The proof of Theorem B.2 is simpler than that of Theorem 3.2 because we can work with the reproducing kernel Hilbert space \mathcal{H} associated with k(,)k(\cdot,\cdot) directly. We shall show that ΣnΣ\Sigma_{n}-\Sigma, as an operator from \mathcal{H} to \mathcal{H}, has Hilbert-Schmidt norm tending to zero as nn goes to infinity. Recall that the columns of V^1\widehat{V}_{1} are only orthonormal in L2(𝒳,n)L^{2}(\mathcal{X},\mathbb{P}_{n}) but the general theory requires V~1\widetilde{V}_{1} which has columns orthonormal in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). Similar to the challenge of proving Theorem 3.2, we also need to deal with the error induced by the distinction between orthonormality in L2(𝒳,n)L^{2}(\mathcal{X},\mathbb{P}_{n}) and L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}), which is one of the key steps for the proof of Theorem 3.2.

The rigorous treatment shall be presented in five parts. In part one, we introduce HS()\mathcal{L}_{HS}(\mathcal{H}), the Hilbert space of Hilbert-Schmidt operators from \mathcal{H} to \mathcal{H} that we work with, and justify the random elements in \mathcal{H} and HS()\mathcal{L}_{HS}(\mathcal{H}). In part two, we build up concentration results in the Hilbert space HS()\mathcal{L}_{HS}(\mathcal{H}). In part three, we check the remaining conditions required by our general theory. In part four, we put all the above ingredients together and apply our general theory to complete the proof.

B.2 Part one: the space HS()\mathcal{L}_{HS}(\mathcal{H}) and random elements in \mathcal{H} and HS()\mathcal{L}_{HS}(\mathcal{H})

The space HS()\mathcal{L}_{HS}(\mathcal{H}) collects all of the Hilbert-Schmidt operators from \mathcal{H} to \mathcal{H}, which is also a Hilbert space. Before justifying the covariance operator, we first define the mean element in \mathcal{H} and the cross-covariance operator in HS(~,)\mathcal{L}_{HS}(\widetilde{\mathcal{H}},\mathcal{H}).

The mean element in \mathcal{H} is defined as μX=𝔼X[k(,X)]\mu_{X}=\mathbb{E}_{X\sim\mathbb{P}}[k(\cdot,X)]\in\mathcal{H} such that f,μk=𝔼X[f(X)]\left\langle f,\mu_{k}\right\rangle_{\mathcal{H}}=\mathbb{E}_{X\sim\mathbb{P}}[f(X)] for any ff\in\mathcal{H}. Let (𝒴,𝒴,)(\mathcal{Y},\mathcal{B}_{\mathcal{Y}},\mathbb{Q}) be another probability space and ~\widetilde{\mathcal{H}} is a reproducing kernel Hilbert space associated with kernel k~(,)\widetilde{k}(\cdot,\cdot) containing k~(,y)\widetilde{k}(\cdot,y), y𝒴y\in\mathcal{Y} as its elements. The cross-covariance operator CX,Y=𝔼X,Y[k(,X)k~(,Y)]C_{X,Y}=\mathbb{E}_{X\sim\mathbb{P},Y\sim\mathbb{Q}}[k(\cdot,X)\otimes\widetilde{k}(\cdot,Y)] is a Hilbert-Schmidt operator from ~\widetilde{\mathcal{H}} to \mathcal{H} such that for any ff\in\mathcal{H} and g~g\in\widetilde{\mathcal{H}}, f,CX,Yg=𝔼X,Y[f(X)g(Y)]\langle f,C_{X,Y}g\rangle_{\mathcal{H}}=\mathbb{E}_{X\sim\mathbb{P},Y\sim\mathbb{Q}}[f(X)g(Y)]. Finally, the covariance operator is Σ=𝔼kXkX=CX,X()\Sigma=\mathbb{E}k_{X}\otimes k_{X}=C_{X,X}\in\mathcal{L}(\mathcal{H}).

The covariance operator is indeed Hilbert-Schmidt because 𝔼kXkXHS𝔼[kXkXHS2]=𝔼[k(X,X)2](supx𝒳k(x,x))2<\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS}\leq\mathbb{E}[\|k_{X}\otimes k_{X}\|_{HS}^{2}]=\mathbb{E}[k(X,X)^{2}]\leq\left(\sup_{x\in\mathcal{X}}k(x,x)\right)^{2}<\infty by the general assumption.

B.3 Part two: concentration in the Hilbert space HS()\mathcal{L}_{HS}(\mathcal{H})

In part one we have shown that Σ\Sigma is a Hilbert-Schmidt operators from \mathcal{H} to \mathcal{H}. In this subsection, we show concentration of ΣnΣHS\|\Sigma_{n}-\Sigma\|_{HS} by using Lemma 4.5.

Lemma B.3.

Under the general assumptions, with probability 12eτ1-2e^{-\tau}, we have

1ni=1nkXikXi𝔼kXkXHSC23τn\left\|\frac{1}{n}\sum_{i=1}^{n}k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}\right\|_{HS}\leq C_{23}\frac{\sqrt{\tau}}{\sqrt{n}} (B.5)

for some constant C23C_{23}.

Proof.

We denote M:=supx𝒳k(x,x)<M:=\sqrt{\sup_{x\in\mathcal{X}}k(x,x)}<\infty. For any i[n]i\in[n], we have

kXikXi𝔼kXkXHS\displaystyle\|k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}\|_{HS} kXikXiHS+𝔼kXkXHS\displaystyle\leq\|k_{X_{i}}\otimes k_{X_{i}}\|_{HS}+\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS}
=kXi2+𝔼kXkXHS\displaystyle=\|k_{X_{i}}\|_{\mathcal{H}}^{2}+\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS}
=k(Xi,Xi)+𝔼kXkXHS\displaystyle=k(X_{i},X_{i})+\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS}
M2+𝔼kXkXHS,\displaystyle\leq M^{2}+\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS},

which implies kXikXi𝔼kXkXk_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}’s are bounded zero mean independent random variables in HS(,)\mathcal{L}_{HS}(\mathcal{H},\mathcal{H}).

We can take C23=2(M2+𝔼kXkXHS)C_{23}=\sqrt{2}(M^{2}+\|\mathbb{E}k_{X}\otimes k_{X}\|_{HS}). Then by Lemma 4.5, we have

1ni=1nkXikXi𝔼kXkXHS=1ni=1n(kXikXi𝔼kXkX)HSC23τn\left\|\frac{1}{n}\sum_{i=1}^{n}k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X}\right\|_{HS}=\left\|\frac{1}{n}\sum_{i=1}^{n}(k_{X_{i}}\otimes k_{X_{i}}-\mathbb{E}k_{X}\otimes k_{X})\right\|_{HS}\leq C_{23}\frac{\sqrt{\tau}}{\sqrt{n}}

with probability 12eτ1-2e^{-\tau}. ∎

B.4 Part three: checking conditions for general theory

Lemma B.4.

Under the general conditions, the following facts hold true:

  1. 1.

    The reproducing kernel Hilbert space HH is a subspace of L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}).

  2. 2.

    The \mathcal{H} norm \|\cdot\|_{\mathcal{H}} is stronger than infinity norm.

  3. 3.

    Both Σ,Σn\Sigma,\Sigma_{n} are Hilbert-Schmidt from \mathcal{H} to \mathcal{H}.

  4. 4.

    All eigenvalues of Σ\Sigma (counting multiplicity) can be arranged in a decreasing (possibly infinite) sequence of non-negative real numbers λ1λ2λK>λK+10\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{K}>\lambda_{K+1}\geq\ldots\geq 0 with a positive gap between λK\lambda_{K} and λK+1\lambda_{K+1}.

  5. 5.

    The top KK eigenfunctions {fi}i=1K\{f_{i}\}_{i=1}^{K}\subset\mathcal{H} and can be picked to form an orthonormal set of functions in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}).

  6. 6.

    Σn\Sigma_{n} has a sequence of non-increasing, real, non-negative eigenvalues.

Proof.

Recall that the general conditions entail that M:=supx𝒳k(x,x)<M:=\sqrt{\sup_{x\in\mathcal{X}}k(x,x)}<\infty. The kernel k(,)k(\cdot,\cdot) is therefore a Mercer’s kernel, which satisfies the condition 𝒳×𝒳k2(x,y)𝑑(x)𝑑(y)<\int_{\mathcal{X}\times\mathcal{X}}k^{2}(x,y)d\mathbb{P}(x)d\mathbb{P}(y)<\infty. Then item 1 is an implication of Mercer’s theorem.

For item 2, for any ff\in\mathcal{H}, by the reproducing property and Cauchy-Schwarz inequality, we have |f(x)|2=f,k(,x)2|k(x,x)|2f2M2f2|f(x)|^{2}=\langle f,k(\cdot,x)\rangle_{\mathcal{H}}^{2}\leq|k(x,x)|^{2}\|f\|_{\mathcal{H}}^{2}\leq M^{2}\|f\|_{\mathcal{H}}^{2}. Therefore fMf\|f\|_{\infty}\leq M\|f\|_{\mathcal{H}}.

Item 3 has been checked in the intermediate step of proof of Lemma B.3.

Both item 4 and item 5 are ensured by Mercer’s theorem.

Item 6 is true because of the relationship between the spectrum of Σn\Sigma_{n} and that of the symmetric positive semi-definite kernel matrix KnK_{n}, which has been checked in Lemma B.1. ∎

B.5 Part four: putting all ingredients together

To complete the proof of Theorem B.2, we have to deal with the error induced by operator V^1\widehat{V}_{1} only having orthonormal columns in L2(𝒳,n)L^{2}(\mathcal{X},\mathbb{P}_{n}) but not in L2(𝒳,)L^{2}(\mathcal{X},\mathbb{P}). This can be accomplished using the same trick as part five of the proof of Theorem 3.2. So we omit the technical redundance here.