On uniform consistency of spectral embeddings
Abstract
In this paper, we study the convergence of the spectral embeddings obtained from the leading eigenvectors of certain similarity matrices to their population counterparts. We opt to study this convergence in a uniform (instead of average) sense and highlight the benefits of this choice. Using the Newton-Kantorovich Theorem and other tools from functional analysis, we first establish a general perturbation result for orthonormal bases of invariant subspaces. We then apply this general result to normalized spectral clustering. By tapping into the rich literature of Sobolev spaces and exploiting some concentration results in Hilbert spaces, we are able to prove a finite sample error bound on the uniform consistency error of the spectral embeddings in normalized spectral clustering.
keywords:
[class=MSC]keywords:
, and
1 Introduction
Spectral methods are a staple of modern statistics. For statistical learning tasks such as clustering or classification, one can featurize the data with spectral methods then perform the task on the features. In the past twenty years, spectral methods have seen wide applications in image segmentation [33], novelty detection [18], community detection [13], bioinformatics [17], and its effectiveness is partly credited to its ability to reveal the latent low-dimensional structure in the data.
Spectral embedding gets its name from the fact that the embeddings are constructed from the spectral decomposition of a positive-definite matrix. For example, in normalized spectral clustering [27], the normalized Laplacian embedding is given by
(1.1) |
where are the observations, ’s are all zeros but one on the -th entry, is the desired dimension of the embedding, and the columns of are the leading eigenvectors of the normalized Laplacian matrix. As described, spectral embeddings are only defined on points in the training data, but it is possible to evaluate them on points that are not in the training data through out-of-sample extensions [4, 41]. Some other examples of spectral methods are Isomap [36], Laplacian [3] and Hessian eigenmaps [14], and diffusion maps [10].
Since downstream procedures take the embeddings as input, it is imperative that the embeddings have certain consistency properties to ensure the quality of the ultimate output. Specifically, we ask
-
•
In the large sample limit, do the embedded representations of the data “converge” to certain population level representations?
-
•
If the embedded representations do converge, in what sense do they converge?
While there are many results on the convergence of eigenvalues and spectral projections, there are only a few results that directly address the convergence of the embedded representation in a general setting. The only exception is von Luxburg, Belkin and Bousquet [41]. This is a gap in the literature because it is the embedded representation, not the spectral projections or the eigenvalues, that are the inputs to downstream application. In this paper, we address the two questions and provide direct answers — we show the sample level embeddings converge uniformly to its population counterpart up to a unitary transformation. We improve the result of von Luxburg, Belkin and Bousquet [41] by considering multidimensional embeddings and allowing for non-simple eigenvalues.
For a concrete application of our result, let us return to spectral clustering. The population counterpart of the normalized Laplacian embedding is given by
where are the leading eigenfunctions of the normalized Laplacian operator [41, 30]. As is shown in von Luxburg, Belkin and Bousquet [41], the normalized Laplacian matrix has an operator counterpart that we shall refer to as the empirical normalized Laplacian operator. Let be the leading eigenfunctions of this operator, and define the embedding
(1.2) |
The embedding coincides with on the sample points, i.e. for all . We shall show that the sample level embedding converges uniformly to its population counterpart:
(1.3) |
where is some metric on . This implies converges uniformly to the restriction of to the sample points.
1.1 Main results
In this section, we state our results in an informal manner. These results are made precise and proved in subsequent sections.
Our first main result concerns the effect of perturbation on the invariant subspace of an operator. It serves as a general recipe for establishing uniform consistency type results. Although in statistics and machine learning, we mainly work with real-valued functions, our main spectral perturbation result is stated for complex-valued functions. This choice is technically convenient because the complex numbers are algebraically closed while the real numbers are not. In most applications of the result, the complex-valued functions only take real values.
Suppose is a complex Hilbert space whose elements are bounded complex-valued continuous functions over a domain . Let be two operators from to that are close in Hilbert-Schmidt norm. Let be the top eigenfunctions of and be those of . As long as and are appropriately normalized, we expect to be close to up to some unitary transformation. This is indeed the case and is characterized as follows by our first result.
Result 1 (General recipe for uniform consistency).
Define as and as . There are constants that only depend on such that as long as , we have
(1.4) |
where is the space of unitary matrices in , and the -norm of an operator is defined as
It is not hard to notice the correspondence between (1.4) and (1.3): is the analogue of ; is the analogue of ; the two to infinity norm guarantees the convergence is uniform, and the distance metric is chosen to measure Euclidean distance up to a unitary transformation.111Normalized eigenfunctions of the same eigenvalue are only determined up to unitary transformation. This observation justifies naming the uniform consistency error. It is also worth mentioning that the constant is inversely proportional to a measure of the eigengap between the -th and -th eigenvalues of . This provides further justification for studying the convergence of the leading eigenspace as a whole, rather than studying the convergence of the individual eigenspaces like in von Luxburg, Belkin and Bousquet [41]. Not only is the former more general and realistic, but it leads to better constants as well. In many applications, the top eigenvalues are usually clustered together, but there is a large gap between the top eigenvalues and the rest of the spectrum. Thus it is hard to estimate the corresponding eigenfunctions individually, but it is easy to estimate them altogether up to a unitary transformation.
Result 1 provides a general approach to proving uniform consistency: we simply need to bound the difference between the sample level operator and its population counterpart in an appropriate norm. The proof of Result 1 is also interesting in its own right. We identify the invariant subspace directly by solving an operator equation and appeal to the Newton-Kantorovich Theorem to characterize the solution. The main benefit of this approach is it overcomes the limitations of traditional approaches when working with non-unitarily invariant norms.
Our second main result is a finite sample uniform error bound for embedding in normalized spectral clustering. Let denote the space of bounded continuous complex-valued functions over . Define as where are the leading real-valued eigenfunctions of the normalized Laplacian operator. Define as where are defined as in (1.2) and real-valued. Applying Result 1, we obtain
Result 2 (Uniform consistency for normalized spectral clustering).
Under suitable conditions, there are constants that are independent of and the randomness of the sample, such that whenever the sample size for some , we have
with probability at least .
Despite the fact that that Result 2 is an application of Result 1, its proof is by no means simple. The main technical challenge is establishing concentration bounds for Hilbert-Schmidt operators. Result 2 suggests that the convergence rate, under appropriate conditions, is (modulo a log factor). Moreover, in the context of clustering, the notion of uniform consistency leads to stronger assurances about the correctness of the clustering output. For example, in spectral clustering, the points are clustered based on their embeddings. Uniform convergence implies the embeddings of all points are close to their population counterpart. As long as the error in the embeddings are small enough, it is possible to show that all points are correctly clustered. This is not possible if the embeddings only “converge in mean”: .
1.2 Related literature
Most closely related to our results are the work of von Luxburg, Belkin and Bousquet [41] and Rosasco, Belkin and Vito [29]. For normalized spectral clustering, von Luxburg, Belkin and Bousquet [41] proved the convergence of the eigenvalues and spectral projectors of the sample level operator to their population counterparts. They also established uniform convergence of eigenfunctions whose corresponding eigenvalue has multiplicity one to their population counterparts. Our results are in the same vein as theirs in that we also study uniform convergence of eigenfunctions. We improve upon their uniform convergence result by considering multiple eigenfunctions at once and allowing for non-simple eigenvalues. In the context of unnormalized spectral clustering, Rosasco, Belkin and Vito [29] studied the convergence rate of the -distance between the ordered spectrum of the sample level operator and that of the population operator and derived finite sample bound for the deviation between the sample level and population level spectral projections associated with the top eigenvalues. They also obtained finite sample spectral projection error bound for asymmetric normalized graph Laplacian. Our work is related to theirs because both study the convergence of the leading eigenspace and we owe much of our concentration results to them. The two works are also very distinct at the same time. Firstly, our notion of convergence is in uniform consistency of the eigenfunction, theirs is in terms of the induced RKHS norm between the spectral projectors. To the best of our knowledge, it is non-trivial to establish one set of results from the other. Secondly, we study the normalized symmetric graph Laplacian, while they study the unnormalized graph Laplacian and asymmetric normalized graph Laplacian.
The general relationship between the spectral properties of an empirical operator/matrix and that of its population counterpart has also been studied under other contexts. In Koltchinskii and Giné [22], it is proved that the ordered spectra of an integral operator and its empirical version tend to zero in -distance almost surely if and only if the kernel is square integrable. Convergence rate and distributional limits were also obtained under stronger conditions. In Koltchinskii [21], the authors extended their own result by proving law of large numbers and central limit theorems for quadratic forms induced by spectral projections. The investigation of spectrum convergence is continued in Mendelson and Pajor [25] and Mendelson and Pajor [26], where the authors associated various types of distance metric between two ordered spectra to the deviation of the sample mean of i.i.d rank one operators from its population mean. Similar problems have also been studied in kernel principal component analysis (KPCA) literature. For example, in Shawe-Taylor et al. [31] and Blanchard, Bousquet and Zwald [5], the concentration property of the sum of the top eigenvalues and the sum of all but the top eigenvalues of the empirical kernel matrix are studied, because such partial sums are closely related to the reconstruction error of KPCA. In Zwald and Blanchard [42], a finite sample error bound on the difference between the projection operator to the leading eigenspace of the empirical covariance operator and that to the leading eigenspace of the population covariance operator is derived. We remark that none of the results mentioned in this paragraph addressed the consistency of the embedding directly, nor did any consider kernel matrix normalized by the degree matrix.
Unlike our results which are model-agnostic, the property of spectral methods has also been studied in model-specific settings. For example, Rohe, Chatterjee and Yu [28] and Lei and Rinaldo [23] investigated the spectral convergence properties of the graph Laplacian and the consistency of spectral clustering in terms of community membership recovery under stochastic block models. When the data are sampled from a finite mixture of nonparametric distributions, Shi, Belkin and Yu [32] studied how the leading eigenfunctions and eigenvectors of the population level integral operator can reflect clustering information; Schiebinger, Wainwright and Yu [30] studied the geometry of the embedded samples generated by normalized spectral clustering and showed that the embedded samples for different clusters are approximately orthogonal when the mixtures have small overlapping and the sample size is large. We remark that in all the results mentioned in this section so far, the kernel function is fixed. For the relationship between the graph Laplacian and the Laplace-Beltrami operator on a manifold and the properties of spectral clustering when the kernel is chosen adaptively, we refer readers to the series of work by Trillos et. al. [38, 39, 37] and the references therein.
Lastly, entrywise or row-wise analysis for eigevectors and eigenspaces of matrices has been studied in recent literature. For general purpose, deterministic bounds are derived by Fan, Wang and Zhong [16], Cape, Tang and Priebe [7], Damle and Sun [12], where the first two are for rectangular matrices and the last one is for symmetric matrices. When probabilistic assumptions are imposed on the true and perturbed matrices, Cape, Tang and Priebe [8], Abbe et al. [1], Mao, Sarkar and Chakrabarti [24] obtain stronger bounds for various tasks by taking advantages of the structure of the random matrices. Comparing to these literature, we remark that our work provides a deterministic bound which aids in the perturbation theory of linear operators, and the bound can be applied to many problems in statistics (e.g., spectral clustering and kernel PCA) for helping characterize the spectral embedding of individual samples.
1.3 Main contributions
We view our main contributions as three fold and list them in the order of appearance. First, we demonstrate that the Newton-Kantorovich Theorem provides a general approach to studying the effect of local perturbations on the invariant spaces of an operator. This result may be of independent interest to researchers working on spectral perturbation theory. Second, we study the convergence of the embeddings via uniform consistency error and offer a general recipe for establishing non-asymptotic uniform consistency type results that handles multiple eigenfunctions at once and is not limited to simple eigenvalues. Third, we apply our recipe to normalized spectral clustering and give a novel proof of finite sample error bound on the uniform consistency error of the spectral embeddings.
1.4 Structure of the paper
The rest of the paper is organized as follows: A review of relevant mathematical preliminaries is provided in Section 2; the exact statement and proof for Result 1 is in section 3; the exact statement and proof for Result 2 is in section 4; a discussion of various issues relevant to our results is in section 5; proofs of some secondary lemmas and an additional application are relegated to the appendix.
2 Preliminaries and notations
In this section, we discuss various basic concepts and preliminary results that will be used repeatedly throughout the paper. More technical results that are section specific shall be introduced as needed later in the paper.
2.1 Operator theory
We assume readers are familiar with basic concepts such as Banach spaces, Hilbert spaces, linear operators, operator norms, and spectra of operators. From now on, we let denote either the field of real numbers or the field of complex numbers, denote Banach spaces over the same field , and denote Hilbert spaces over the same field .
We would like to first highlight a nuance in the definition of linear operator. For a linear operator , we adopt the convention from Kato [20] and allow to be defined only on a linear manifold in ,222In Kato [20], linear manifold is just a synonym for affine subspace. denoted .333In Ciarlet [9], for example, such a distinction isn’t made. We call the domain of and can naturally define the range of as . As for , we call them the domain space and the range space respectively.
For a linear operator , we say bounded if , and when is bounded, we define its operator norm . Throughout the paper, when has no subscript, it defaults to operator norm. We use to denote the space of all bounded linear operators from to . When , we simply write as . We say is a compact operator if the closure of the image of any bounded set in under is compact. It is know that compact operators are bounded.
For a bounded linear operator , define its adjoint as the unique operator from to satisfying for . Here, we use to denote the inner product in the Hilbert space . A basic property of is , where the norm is operator norm and we use the notation to explicitly specify the domain space and range space. When , is called self-adjoint if is equal to its adjoint , and is called positive if for any , .
We say a Hilbert space is separable if it has a basis of countably many elements. We say a bounded linear operator is Hilbert-Schmidt if where is an orthonormal basis of . We use to denote the space of all Hilbert-Schmidt operators from to ; this space is also a Hilbert space with respect to the inner product . We use to denote the norm induced by this inner product and note that all Hilbert-Schmidt operators are compact. We also note the Hilbert-Schmidt norm is stronger than operator norm in that , and the Hilbert-Schmidt norm is compatible with the operator norm in the following sense: for any Hilbert-Schmidt operator and bounded operator , their product and are Hilbert-Schmidt and their Hilbert-Schmidt norm satisfies
2.2 Spectral theory for linear operators
In this subsection, we set . Let be a bounded linear operator. Similar to matrices, we say is an eigenvalue of if for some eigenvector ,
In other words, is an eigenvalue if the null space is not . We call the eigenspace associated with , and the dimension of is called the geometric multiplicity of . The spectrum of is defined as , where is the resolvent set
Eigenvalues are in the spectrum, but generally contains more than just eigenvalues. If is a compact operator, has the following structure: is a countable set of isolated eigenvalues, each with finite geometric multiplicity, and the only possible accumulation point of is . If is self-adjoint, then all the eigenvalues must be real. If is a positive operator, then all its eigenvalues are real and non-negative. Therefore for any compact positive self-adjoint operator, we can arrange the non-zero eigenvalues of into a non-increasing sequence of positive numbers,444Because the largest eigenvalue is bounded by the operator norm of . and repeat each eigenvalue for a number of times equal to its geometric multiplicity.
Another remarkable fact in the spectral theory of linear operators concerns spectral projection. Let be a closed simple rectifiable curve. Assume the part of enclosed inside is a finite number of eigenvalues . Then the projection which projects to the direct sum of the eigenspaces of , i.e. , can be defined. Technicalities aside, this projection has the following contour integration expression
2.3 Function spaces
Let be a bounded open subset of , we now define several function spaces we are going to work with. Define the space of bounded continuous functions as
It can be shown that is a norm on and is a Banach space with respect to this infinity norm.
We can also define the space of complex-valued square integrable functions . Suppose is a measure space where is the Lebesgue -algebra and is a measure, then is defined as the set of measurable functions such that
In fact, is a Hilbert space with respect to the inner product
We also define , the space of square summable infinite sequence of complex numbers. It is well known that is a complex Hilbert space with respect to the inner product
2.4 Reproducing Kernel Hilbert Space (RKHS)
Let be a subset of and be a set of functions . Suppose is a Hilbert space with respect to some inner product . If in , all point evaluation functionals are bounded, i.e.
where is some constant depending on , then it can be shown that there exists a unique conjugate symmetric positive definite kernel function , such that the following reproducing property is satisfied:
The kernel is called the reproducing kernel and is called a reproducing kernel Hilbert space (RKHS).
We say a kernel function is positive definite if for any , any and any , the quadratic form
(2.1) |
is non-negative. The kernel function for any RKHS is positive definite.
3 Uniform error bound for spectral embedding
In this section, we prove the first result described in the previous section. We first lay out the assumptions and notations. Let be a subset of and be a probability measure whose density function is supported on . Let denote the space of complex-valued square integrable functions on and be a subspace of . We assume is equipped with its own inner product and is a Hilbert space with respect to this inner product. We also require to be such that for every , which is an equivalent class in , there exists a representative function in the class such that . Since , is unique, and we can define infinity norm on by setting . We require that on , the norm induced by the -inner product, denoted , be stronger than the infinity norm; that is there is a constant such that for all .555This in fact implies is an RKHS. But since we do not use the reproducing property anywhere in the proof, we find framing as an RKHS unnecessary.
Let and be two Hilbert-Schmidt operators from to ; can be seen as a perturbed version of and we use to denote their difference. Suppose all the eigenvalues of (counting geometric multiplicity) can be arranged in a non-increasing (possibly infinite) sequence of non-negative real numbers with a positive gap between and . Suppose the eigenvalues of can also be arranged in a non-increasing sequence of non-negative real numbers. We do not assume, however, any eigengap for .
Let be the eigenfunctions associated with eigenvalues . We assume are so picked that they constitute a set of orthonormal vectors in . We then pick so that constitute a complete orthonormal basis of . Define by and by . Define their adjoints with respect to the standard inner product on and . Since , we can also view as the range (domain) space of (). The exact range space of shall be clear from the context.
When the perturbation has small enough Hilbert-Schmidt norm, necessarily has an eigengap. In this case, the leading -dimensional invariant subspace of is well defined. We pick to be an orthonormal set of vectors in such that they span the leading invariant subspace of , and define as .
Last but not least, define and . They are intuitively the “coordinate space” for functions in under the basis in . Working with these coordinates could simplify our notations. The following facts regarding and hold true (with proof in appendix).
Lemma 3.1.
Assuming , we have
-
1.
the set is equal to the set ;
-
2.
is a subspace of ; it is also a Hilbert space with respect to the -induced inner product
-
3.
, , , and , with operator norms satisfying
Because of item 1 of the lemma, we do not need to distinguish between and ; we denote both by . To keep notation manageable, define for any ; e.g. is shorthand for .
We also need the following quantities to define the constants in Result 1. Let be the boundary of the rectangle
(3.1) |
Let denote the length of and define
(3.2) |
which is necessarily finite. Define a measure of spectral separation
It is reasonable to expect that the larger the eigengap, the larger the . And when has only eigenvalues or is self-adjoint from to , it is provably so that the separation is lower bounded by the eigengap .
Define constant
We are now ready to state the main theorem of this section.
Theorem 3.2 (General recipe for uniform consistency).
Under the assumptions above, as long as as an operator from to has Hilbert Schmidt norm , the uniform consistency error satisfies
Here, are two constants independent of the choice of defined as
(3.3) | |||
(3.4) |
We remark that since is inversely proportional to , it is beneficial to study the convergence of the leading eigenspace as a whole. When eigenspaces are treated individually, each eigenspace converges slowly because the leading eigenvalues may cluster together and we have a small , but when treated as a whole, we get a larger and thus faster convergence because the leading eigenvalues are well-separated from the rest of the spectrum.
The rest of the section is devoted to proving Theorem 3.2. The proof strategy is to express in terms of the solution of an operator equation and directly bound . We present the proof in five steps. In step one, we characterize the invariant subspace of in terms of the solution of a quadratic operator equation. In step two, we apply the Newton-Kantorovich Theorem to show this equation does have a solution when the perturbation is small. In step three, we introduce some additional conditions that guarantee the invariant space from step two is the leading invariant space. In step four, we directly bound the error term . In step five, we assemble all pieces together and prove Theorem 3.2. A similar approach was used in Stewart [35] to study the invariant subspace of matrices.
3.1 Step one: equation characterization of the invariant subspace
In this section, our goal is to find a such that the range of is an invariant subspace of . It turns out that any that satisfies the following quadratic operator equation suffices.
Proposition 3.3.
As long as satisfies the equation
(3.5) |
the range of is an invariant subspace of .
Proof.
First, we note (3.5) is a well-defined equation of operators in . This can be seen from our assumption and item 3 of Lemma 3.1. Next, we assert that equation (3.5) implies666We do not differentiate equality in from equality in , because the two are equivalent.
(3.6) |
which suggests that the range of is invariant under . To prove this assertion, note
∎
3.2 Step two: solve the equation with the Newton-Kantorovich Theorem
After characterizing the invariant subspace of in terms of a solution of (3.5), we apply the Newton-Kantorovich Theorem to prove a solution to (3.5) exists. The Newton-Kantorovich Theorem constructs a root of a function between Banach spaces when certain conditions on the function itself and its first and second order derivatives are met. The construction is algorithmic: the root is the limit point of a sequence of iterates generated by the Newton-Raphson method for root finding. The exact version of the Newton-Kantorovich Theorem we use is from the appendix of Karow and Kressner [19].
Theorem 3.4 (Newton-Kantorovich).
Let be Banach spaces and let be twice continuously differentiable in a sufficiently large neighborhood of . Suppose that there exists a linear operator with a continuous inverse and satisfying the following conditions:
(3.7) | |||
(3.8) | |||
(3.9) |
If and , then there exists a solution of such that
We are now ready to prove the proposition below, which states that when , , , are small relative to , equation (3.5) has a solution.
Proposition 3.5.
Let and . When and , there exists with such that equation (3.5) is satisfied.
Proof of Proposition 3.5.
After rearrangement, (3.5) is equivalent to
(3.10) |
Let denote the space of bounded linear operators from to . Since is finite dimensional, any linear operator from to is bounded and Hilbert Schmidt. We can thus use Hilbert Schmidt norm as the default norm on and is a Hilbert space with respect to this norm. This fact also allows us to define as and as . Noting that the image of under are still in , we can verify and are indeed well defined.
We assert and is one-to-one and onto and defer the proof of this to a lemma. The implication of this is is invertible with .
We are now ready to verify the three assumptions of Newton-Kantorovich theorem.
(A1):
(3.11) |
(A2): The Fréchet derivative of at is given by
Especially, when ,
Consequently,
We thus have
(A3): The second order Fréchet derivative at is a linear operator in :
where is
Therefore the second derivative is a constant for every and we have,
(Conclusion:) With all assumptions in place, we apply the Newton-Kantorovich Theorem and conclude as follows. When
equation (3.5) has solution such that
(3.12) |
∎
3.3 Step three: showing the invariant space is the leading eigenspace
In step two, we obtained an invariant subspace, but there is no guarantee that the invariant subspace we obtained is the leading -dimensional invariant subspace of . In this subsection, we give sufficient conditions to ensure this. When is small, we show several things must happen: first, the range of is dimensional; second, the eigenvalues of the restriction of to this subspace are contained in the interval for some small ; third, has exactly eigenvalues (counting geometric multiplicity) in the interval . These facts combined implies the invariant subspace from Proposition 3.5 has to be the leading -dimensional invariant subspace.
The first point is not hard to show. Suppose the range of has less than dimensions, then there exists with such that . But since , when is small, the vector simply cannot be zero. Stated formally, we have
Lemma 3.6.
When , the range of is -dimensional.
As for the second point, which is to determine the eigenvalues of the restriction of , note that (3.5) implies has matrix representation in the basis . Since eigenvalues are not affected by the choice of bases, we know the eigenvalues of on the invariant space are those of . Next, we recall a perturbation result for eigenvalues.
Lemma 3.7.
Assuming , we have (addition is set addition)
(3.13) |
Proof.
First note that . Suppose is a real eigenvalue of , then there exists with such that . It thus follows
which suggests is within from at least one of . This is equivalent to the claim of (3.13). ∎
From the lemma, we know (assuming )
(3.14) |
For the third point, we need the following result from Rosasco, Belkin and Vito [29] (their Theorem 20), which they credited to Anselone [2] for the origin.
Theorem 3.8.
Let be a compact operator. Given a finite set of non-zero eigenvalues of , let be any simple rectifiable closed curve (having positive direction) with inside and outside. Let be the spectral projection associated with , that is,
(3.15) |
and define
(3.16) |
Let be another compact operator such that
(3.17) |
where is the length of , the the following facts hold true.
-
1.
The curve is a subset of the resolvent set of enclosing a finite set of non-zero eigenvalues of ;
-
2.
The dimension of the range of is equal to the dimension of the range of , where .
From the theorem above, we can take as in (3.1), i.e. as the boundary of the rectangle
(3.18) |
For small enough, contains exactly the top eigenvalues of . Combining the three points up, we obtain sufficient conditions for the range of to be the leading invariant subspace of .
Proposition 3.9.
Proof.
First, our choice of contains and only contains the top eigenvalues of . Next, since we assumed also has only real eigenvalues, Lemma 3.7 applies. So from (3.14), (3.19), and (3.20), we know
which is enclosed in . We also see from (3.19) that Lemma 3.6 applies. Finally, condition (3.21) on ensures Theorem 3.8 applies, so has only eigenvalues in . It thus follows the invariant subspace induced by is the -dimensional leading invariant subspace of . ∎
3.4 Step four: bound uniform consistency error
In this step, we bound the uniform consistency error
where has orthonormal columns spanning the leading invariant subspace of . Since we require , we need to orthonormalize the “columns” of . Let us define to be the adjoint of with respect to and . We can verify that for some , because777As we will see from Lemma 3.23, is well-defined when is small.
Meanwhile, note that by assumption, is stronger than , i.e. for all . This implies for
The consequence of this is
(3.22) |
We also need the following handy result.
Lemma 3.10.
Let have operator norm . Then
(3.23) |
Proof.
Suppose . Note that , so the Hermitian matrix is invertible with spectrum in . Consequently, , so .
Similarly, we have . It remains to verify , which is easy. ∎
Now we have
Proposition 3.11.
Suppose and for some . We have
Proof.
3.5 Step five: put all pieces together
We combine the previous steps together and prove Theorem 3.2. To this end, we need the following lemma that relates , to .
Lemma 3.12.
Let
then for any
Proof.
Proof of Theorem 3.2.
Define as
We have by assumption. By Lemma 3.12, this assumption implies
for all . Thus
Proposition 3.5 guarantees (3.5) has a solution and
(3.24) |
We check that our choice of satisfies condition (3.21) and (3.20) so Proposition 3.9 implies the invariant space from Proposition 3.5 is the leading invariant space. Finally, (3.24) implies the conditions of Proposition 3.11 are satisfied so we have
where .
∎
4 Application to normalized spectral clustering
In spectral clustering, we start from a subset , a probability measure on ,888Assume the underlying -algebra of is the Lebesgue -algebra. and a continuous symmetric positive definite real-valued kernel function . After observing samples , we construct matrix of their pairwise similarities: , and then normalize it to obtain the normalized Laplacian matrix
where and is the degree matrix.999The normalized Laplacian matrix is usually defined as , but the eigenvectors of and those of are identical and it is more convenient to study . It is possible to show that is symmetric and semi-positive definite, so it has an eigenvalue decomposition. We denote the eigenpairs of by and sort the eigenvalues in descending order:
In this paper, we normalize the eigenvectors of so that . The spectral embedding matrix is whose columns are .
Suppose for now that the kernel is bounded away from by a positive number and bounded from above. The operator counterpart of is the following operator, which can be shown to be a bounded linear operator in
(4.1) |
where is the sample degree function. Although is introduced as an operator in , we remark that the definitive element for is the integral form and the domain space and range space need not be restricted to . In fact, the actual we shall work with is an operator between Hilbert spaces; is only chosen here for the ease of understanding. The same remark also applies to other operators we shall subsequently define.
The operator is the operator counterpart of because , where is the restriction operator defined as
In other words, if we identify functions with vectors by the restriction operator , “behaves as” . The eigenvalues and eigenvectors(functions) of and are also closely related in the following sense.
Lemma 4.1.
Suppose real-valued kernel function is continuous and bounded from below and above: . Let be defined as in (4.1) where the domain space and range space are both . If is a non-trivial eigenpair of (i.e. ), then is an eigenpair of . Conversely, if is an eigenpair of , then , where
(4.2) |
is an eigenpair of with . Moreover, this choice of is such that and the restriction of onto sample points agrees with , i.e. .
Proof.
Let be an eigenpair of : . We check that is an eigenpair of :
Conversely, if is an eigenpair of , we check that is an eigenpair of :
It remains to check is indeed in . To this end, note that since the kernel function is continuous and bounded from above, we know . Since is bounded from below, we know is continuous and , so . Thus the average of such terms is also in . ∎
The population version of is the normalized Laplacian operator
where is the (population) degree function. Under appropriate assumptions, it can be shown that we can choose , the top eigenfunctions of , to be real-valued and orthonormal in . We can thus define as . We can similarly define with , the extension of top eigenvectors of according to (4.2). Our goal in this section is to apply our general theory to prove the following result.
Theorem 4.2.
Under the general assumptions defined below, there exists that are determined by such that whenever sample size for some , we have with confidence
The general assumptions referred to in Theorem 4.2 are
General Assumptions.
The set is a bounded connected open set in with a nice boundary.101010We need the boundary to be quasi-resolved [6] for inequality (4.10) and for lemma A.2 [15, 11]. We also need to satisfy the cone condition [6] We omit the definitions of these conditions because the precise definitions are very technical and not relevant to the main story of the paper.. The probability measure is defined with respect to Lebesgue measure and admits a density function . Moreover, there exists constants such that almost surely with respect to the Lebesgue measure. The kernel is symmetric, positive, and there exists constants such that for . Treated as an operator from to , the eigenvalues of satisfy . The top eigenfunctions of , . 111111Function space shall be defined in section 4.2.
4.1 Overview of the proof
The most challenging parts in applying the general theory are to identify the correct Hilbert space to work with, and to show that , as an operator from to , has Hilbert-Schmidt norm tending to zero as goes to infinity. It turns out under the general assumptions, we may set to be a Sobolev space of sufficiently high degrees. As for bounding , we first decompose as the product of three operators. Let us define
(4.3) | |||
(4.4) |
Then . Similarly, we have where are the sample level version of and defined using and . We shall establish the concentration of to and to and invoke triangular inequality to bound .
Despite that the general theory does all the heavy lifting, there is one additional step we must take to finish the full proof of Theorem 4.2. In our general theory, has columns orthonormal in that span the leading invariant space of . In theorem 4.2 however, the same leading invariant space is spanned by the columns of , which are only orthonormal in . Morally speaking, when is large, and are roughly the same up to some unitary transformation, so switching from to should not inflate the consistency error by any order of magnitude. The exact error bound shall be obtained through some uniform law of large numbers.
The rigorous treatment shall be presented in five parts. In part one, we introduce the Sobolev space we work with and lay out its basic properties. In part two, we bound the norm of operator differences such as and and express in terms of them. In part three, we invoke concentration results in Hilbert spaces and relate the norm of operator differences to sample size. In part four, we check the remaining conditions required by our general theory and combine all previous pieces together. In part five, we deal with the error induced by the difference of and and complete the proof.
4.2 Part one: The Sobolev space
First recall that by assumption is a bounded connected open subset of with a nice boundary. Given , the Sobolev space of order is defined as
where is the (weak) derivative of with respect to the multi-index and is the complex Hilbert space of complex-valued functions square integrable under Lebesgue measure. The space is a separable Hilbert space with respect to the inner product
Let be the set of complex-valued continuous bounded functions such that all the derivatives up to order exist and are continuous bounded functions. The space is a Banach space with respect to the norm
Since is bounded, we know and where is a constant only depending on . We also know from the Sobolev embedding theorem (see Chapter 4.6 of Burenkov [6]) that for with , we have
(4.5) |
where is a constant depending only on and .
Taking and , we see
with for for some constant . This norm relationship suggests that is a RKHS with a bounded kernel .
4.3 Part two: bounds on operator differences
Similar to (4.4), we define multiplication operators
(4.6) | |||
(4.7) |
In this subsection, we show and their operator norms are appropriately bounded, that is
Lemma 4.3.
Under the general assumptions, all the following operators are bounded linear operators in , and there exists a suitable constant such that
(4.8) | |||
(4.9) |
Proof.
Let . For any , clearly with . Since and are some weighted average of average of , it follows
Since inherit the pointwise bound from , we know with
Next, we know from Lemma 15 of Chapter 4 of Burenkov [6] that for and , we have and
(4.10) |
We can use this inequality to prove and bound their operator norm. For example, plugging in into (4.10), and noticing for these choices of , because , we conclude
Note that by the embedding theorem, can be embedded into , so plugging in , we see
For the bound on , we follow essentially the same route. We first bound , then argue has pointwise lower and upper bound. It then follows that , and we see via (4.10) that . Taking as the maximum of to completes the proof. ∎
Lemma 4.4.
Under the general assumptions, we have
4.4 Concentration on Hilbert Space
In this subsection, we show are both Hilbert-Schmidt operator from to and establish some concentration results regarding and . With these results and Lemma 4.4, we will be able to bound . The required concentration bounds are obtained through the following Theorem on the concentration in (complex) Hilbert space (see section 2.4 of Rosasco, Belkin and Vito [29]).
Lemma 4.5.
Let be zero mean independent random variables with values in a separable (complex) Hilbert space such that for all . Then with probability at least , we have
With this lemma, we can show
Lemma 4.6.
Under the general assumptions, the following facts hold true:
-
1.
For some constant , with confidence
-
2.
Both and are Hilbert-Schmidt operators from to , and there exists some constant that doesn’t depend on such that their Hilbert-Schmidt norm is bounded.
-
3.
For some constant , with confidence
Proof.
For item 1, consider random variables for . They are clearly zero mean. From the proof of Lemma 4.3, we see . We thus have
where is some constant depending on the Lebesgue measure of the bounded set . This suggests ’s are bounded. Since is a separable Hilbert space, we apply Lemma 4.5 and conclude that we have with probability
For item 2, let us fix any and consider the operator where . This operator is in fact a Hilbert-Schmidt operator from to . To see this, note that . With the same reasoning used for proving item 1, we see has a bound uniform on . It remains to show has a uniform bound. Let be the evaluation functional, i.e. . We know from the embedding theorem that for all . But also induces this point evaluation functional, so by Riesz representation theorem, . Hence for some , for all . Now let be random. We see , i.e. is Hilbert-Schmidt. By the same reasoning, we see the claim for in item 2 is also true.
For item 3, consider random variables . We know from item 2 that . Since is separable, the Hilbert space is also separable. We also know is zero mean and is bounded. We can thus apply Lemma 4.5 and conclude that we have with probability
for . ∎
Proposition 4.7.
Under the general assumptions, with probability , we have
for some constant .
Proof.
A union bound and a direct application of lemma 4.4 will suffice for the proof. ∎
4.5 Checking conditions for general theory
In the first three paragraphs of section 3, we have laid out the conditions that must be satisfied for our general theory to apply. We’ve already checked most of them implicitly in the previous three subsections, but for completeness, we summarize all such conditions here and prove them.
Lemma 4.8.
Under the general conditions, the following facts hold true:
-
1.
The Sobolev space is a subspace of .
-
2.
The norm is stronger than infinity norm.
-
3.
Both are Hilbert-Schmidt from to .
-
4.
All eigenvalues of (counting multiplicity) can be arranged in a decreasing (possibly infinite) sequence of non-negative real numbers with a positive gap between and .
-
5.
The top eigenfunctions and can be picked to form an orthonormal set of functions in .
-
6.
has a sequence of non-increasing, real, non-negative eigenvalues.
Proof.
For item 1, this is because is a subspace of and under our assumptions on and , and are the same space. First of all, since the underlying -algebra of is the Lebesgue -algebra, the set of measurable functions are the same. If , then is also in because . The converse is also true. It is not hard to see our assumptions ensure that the Lebesgue measure is absolutely continuous with respect to with the density being a.s.. Noticing , we can prove the converse.
Item 2 is the consequence of Sobolev embedding theorem and has been used time and again in the previous subsections. Item 3 is the joint consequence of Lemma 4.3 and 4.6.
For item 4 and 5, we first show as an operator from to is positive, self-adjoint, and Hilbert-Schmidt. Let denote the normalized kernel. The self-adjointness is due to the (conjugate) symmetry inherited from :
We thus see , i.e. is self-adjoint. To see why is Hilbert-Schmidt, let real-valued functions be an orthonormal basis of . We calculate
The positive part is slightly more involved. To show is positive, we need to show for
Let us fix a sample size and draw i.i.d. samples . Then since the kernel is positive definite, the quadratic form
is non-negative regardless of what samples we draw. It thus follows that the expectation of this quadratic form is non-negative. A simple calculation suggests that the expectation is in fact
Since by our assumption and , we see is finite. Since can be arbitrarily large, must be non-negative.
According to the spectral theory for positive, self-adjoint, Hilbert-Schmidt operators we introduced in Section 2.2, we immediately see most parts of item 4 and 5 are true. The remaining part to check for item 5 is that , which is implied by our assumption that . The eigengap part in item 4 is also assumed by the general assumption. A nuance in item 4 is that the eigenvalues and eigenvectors there are under the premise that is an operator from to . But since is a subspace of , we can only have fewer eigenvalues than treating as in . Plus, since our general assumptions imply , the leading eigenspace remains unchanged after the restriction from to .
For item 6, this is true because of the relationship between the spectrum of and that of the symmetric positive semi-definite kernel matrix . A sort of Lemma 4.1 is also true with the therein replaced by
∎
Because of Lemma 4.8, we can apply a slightly modified version Theorem 3.2 (see the proof for Theorem 3.2) to obtain the following.
Proposition 4.9.
For some constant , whenever , we have with confidence that
(4.11) |
4.6 Part five: error induced by
We deal with the error induced by operator not having orthonormal columns in . Introduce the shorthand and we have
(4.12) |
Here, the equality in the second step is true because and span the same leading eigenspace131313Since is constructed from the eigenvectors of which are linearly independent, the columns cannot be linearly dependent functions in . and has orthonormal columns in . Inspecting (4.12), we see is bounded by Proposition 4.11, and is roughly thus bounded, so it all boils down to how “far” is from an unitary matrix in . In fact, we have
Lemma 4.10.
Assume all the singular values of are less than , then there exists unitary matrix such that
Proof.
Suppose admits singular value decomposition , then . Let be the -th column in , be the -th column in where is the standard basis in . We know where .
Since are unitary matrices, are orthonormal in and orthonormal in . So is orthogonal to . At the same time, from the diagonal structure of , we know is orthogonal to as well. This suggests is collinear with . On top of that, since the diagonal entries of are real positive values, we know . This in fact holds for all , i.e. . Taking , which is a unitary matrix, we have
(4.13) |
By our assumption on the singular values, we know for all , . Note that for , , we see
(4.14) |
Since ’s rely on the samples , they are random. What they have in common is they have bounded norm, which is because
Using Dudley inequality and standard results on the covering number in Sobolev space, we can show (with proof in the appendix)
Lemma 4.11.
For our choice of , we have with probability
We are now ready to prove Theorem 4.2.
proof of theorem 4.2.
For fixed sample size , let be the event when the concentration in Proposition 4.7 holds, and let be the event when the concentration in Lemma 4.11 holds. From now on, we condition on the intersection , which happens with probability greater than of equal to .
First of all, on this event, we know Proposition 4.11 also holds. We thus have
So is close to . Since in Theorem 4.2, and we have the freedom of choosing , we can set large enough such that . Imitating the proof of Proposition 4.11, we can similarly show is on the order of . We can thus assume is also large enough to ensure .
Meanwhile, due to the uniform law of large numbers in Lemma 4.11, we can always let be greater than by setting large enough. The condition on singular values in Lemma 4.10 is thus satisfied and from it we see there exists unitary matrix such that
where we assumed . Since for concentration results like Theorem 4.2 to be meaningful, is large anyway, this assumption is harmless.
5 Discussion
We would like to first comment on the relationship between Theorem 3.2 and the concentration of spectral projection (see Proposition 22 in Rosasco, Belkin and Vito [29]). Our result in fact easily implies the concentration of spectral projections. To see this, simply note the difference in projection can be written as and apply triangular inequality. We believe it is possible to go from the concentration of spectral projection in Hilbert space to Theorem 3.2, but the road is treacherous. On a high level, we need to project an orthonormal bases of the leading invariant space of the perturbed operator to that of the unperturbed operator, and then performing Gram-Schmidt on the projections. During this process, we need to convert back and forth from to and we foresee countless petty and pesky technical details. But it is our belief that the concentration of spectral projections in induced operator norm is equivalent to Theorem 3.2.
We would also like to comment on the generality of the Newton-Kantorovich Theorem. By that, we mean the operator equation (3.5) need not be restricted to the space of . We can have slightly altered versions of (3.5) involving , , or that induce an invariant subspace and still apply the Newton-Kantorovich Theorem to solve them. For example, we should be able to replace every in this paper with and remake the proof to make everything go through. A word of caution is that to obtain operator norm convergence from the sample level operator to the population operator, the function space one works with has to have some kind of “smoothness”. Either the kind of smoothness from an RKHS or the kind from is fine, but spaces like or where functions may oscillate wildly while still having a small norm are not okay, because adversarial functions can be chosen to ruin operator norm convergence. This point was also mentioned in von Luxburg, Belkin and Bousquet [41].
Finally, we would like to comment on our complex-valued functions assumption and the fact that Theorem 3.2 needs an unitary matrix . We feel like since everything is real, the unitary matrix is an artifact rather than a necessity and our proof could be altered so that only an orthonormal matrix is needed (although we don’t know how at the moment). We have also checked that we can get around with real Hilbert or Banach spaces and real-valued functions for almost all lemmas and theorems except for Theorem 3.8. But on the brighter side, working with complex numbers makes our result more general and can give us the freedom of using a complex-valued kernel function, although such freedom is rarely taken advantage of in statistics or machine learning. Last but not least, we wish to point out that due to length constraints, we only did one application which is normalized spectral clustering, but other applications of our general theory are possible. For example, uniform consistency results can be obtained for kernel PCA and the proof of that is much simpler than the proof of normalized spectral clustering. We include such results in the appendix.
References
- Abbe et al. [2020] {barticle}[author] \bauthor\bsnmAbbe, \bfnmEmmanuel\binitsE., \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmKaizheng\binitsK. and \bauthor\bsnmZhong, \bfnmYiqiao\binitsY. (\byear2020). \btitleEntrywise eigenvector analysis of random matrices with low expected rank. \bjournalAnnals of statistics \bvolume48 \bpages1452. \endbibitem
- Anselone [1971] {bbook}[author] \bauthor\bsnmAnselone, \bfnmP. M.\binitsP. M. (\byear1971). \btitleCollectively Compact Operator Approximation Theory and Applications to Integral Equations. \bseriesAutomatic Computation. \bpublisherPrentice Hall. \endbibitem
- Belkin and Niyogi [2003] {barticle}[author] \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmNiyogi, \bfnmPartha\binitsP. (\byear2003). \btitleLaplacian Eigenmaps for Dimensionality Reduction and Data Representation. \bjournalNeural Computation \bvolume15 \bpages1373-1396. \bdoi10.1162/089976603321780317 \endbibitem
- Bengio et al. [2003] {binproceedings}[author] \bauthor\bsnmBengio, \bfnmYoshua\binitsY., \bauthor\bsnmPaiement, \bfnmJean-François\binitsJ.-F., \bauthor\bsnmVincent, \bfnmPascal\binitsP., \bauthor\bsnmDelalleau, \bfnmOlivier\binitsO., \bauthor\bsnmRoux, \bfnmNicolas Le\binitsN. L. and \bauthor\bsnmOuimet, \bfnmMarie\binitsM. (\byear2003). \btitleOut-of-sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. In \bbooktitleProceedings of the 16th International Conference on Neural Information Processing Systems. \bseriesNIPS’03 \bpages177–184. \bpublisherMIT Press, \baddressCambridge, MA, USA. \endbibitem
- Blanchard, Bousquet and Zwald [2007] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG., \bauthor\bsnmBousquet, \bfnmOlivier\binitsO. and \bauthor\bsnmZwald, \bfnmLaurent\binitsL. (\byear2007). \btitleStatistical properties of kernel principal component analysis. \bjournalMachine Learning \bvolume66 \bpages259–294. \bdoi10.1007/s10994-006-6895-9 \endbibitem
- Burenkov [1998] {bbook}[author] \bauthor\bsnmBurenkov, \bfnmV. I.\binitsV. I. (\byear1998). \btitleSobolev Spaces on Domains. \bseriesTeubner-Texte zur Mathematik. \bpublisherB. G. Teubner Verlagsgesellschaft Leipzig. \endbibitem
- Cape, Tang and Priebe [2019a] {barticle}[author] \bauthor\bsnmCape, \bfnmJoshua\binitsJ., \bauthor\bsnmTang, \bfnmMinh\binitsM. and \bauthor\bsnmPriebe, \bfnmCarey E\binitsC. E. (\byear2019a). \btitleThe two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. \bjournalThe Annals of Statistics \bvolume47 \bpages2405–2439. \endbibitem
- Cape, Tang and Priebe [2019b] {barticle}[author] \bauthor\bsnmCape, \bfnmJoshua\binitsJ., \bauthor\bsnmTang, \bfnmMinh\binitsM. and \bauthor\bsnmPriebe, \bfnmCarey E\binitsC. E. (\byear2019b). \btitleSignal-plus-noise matrix models: eigenvector deviations and fluctuations. \bjournalBiometrika \bvolume106 \bpages243–250. \endbibitem
- Ciarlet [2013] {bbook}[author] \bauthor\bsnmCiarlet, \bfnmPhilippe G.\binitsP. G. (\byear2013). \btitleLinear and Nonlinear Functional Analysis with Applications. \bpublisherSociety for Industrial and Applied Mathematics, \baddressPhiladelphia, PA, USA. \endbibitem
- Coifman et al. [2005] {barticle}[author] \bauthor\bsnmCoifman, \bfnmR. R.\binitsR. R., \bauthor\bsnmLafon, \bfnmS.\binitsS., \bauthor\bsnmLee, \bfnmA. B.\binitsA. B., \bauthor\bsnmMaggioni, \bfnmM.\binitsM., \bauthor\bsnmNadler, \bfnmB.\binitsB., \bauthor\bsnmWarner, \bfnmF.\binitsF. and \bauthor\bsnmZucker, \bfnmS. W.\binitsS. W. (\byear2005). \btitleGeometric Diffusions as a Tool for Harmonic Analysis and Structure Definition of Data: Diffusion Maps. \bjournalProceedings of the National Academy of Sciences \bvolume102 \bpages7426-7431. \bdoi10.1073/pnas.0500334102 \endbibitem
- Cucker and Smale [2002] {barticle}[author] \bauthor\bsnmCucker, \bfnmFelipe\binitsF. and \bauthor\bsnmSmale, \bfnmSteve\binitsS. (\byear2002). \btitleOn the mathematical foundations of learning. \bjournalBulletin of the American Mathematical Society \bvolume39 \bpages1–49. \endbibitem
- Damle and Sun [2020] {barticle}[author] \bauthor\bsnmDamle, \bfnmAnil\binitsA. and \bauthor\bsnmSun, \bfnmYuekai\binitsY. (\byear2020). \btitleUniform bounds for invariant subspace perturbations. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume41 \bpages1208–1236. \endbibitem
- Donetti and Muñoz [2004] {barticle}[author] \bauthor\bsnmDonetti, \bfnmLuca\binitsL. and \bauthor\bsnmMuñoz, \bfnmMiguel A\binitsM. A. (\byear2004). \btitleDetecting network communities: a new systematic and efficient algorithm. \bjournalJournal of Statistical Mechanics: Theory and Experiment \bvolume2004 \bpagesP10012. \bdoi10.1088/1742-5468/2004/10/p10012 \endbibitem
- Donoho and Grimes [2003] {barticle}[author] \bauthor\bsnmDonoho, \bfnmDavid L.\binitsD. L. and \bauthor\bsnmGrimes, \bfnmCarrie\binitsC. (\byear2003). \btitleHessian Eigenmaps: Locally Linear Embedding Techniques for High-Dimensional Data. \bjournalProceedings of the National Academy of Sciences \bvolume100 \bpages5591-5596. \bdoi10.1073/pnas.1031596100 \endbibitem
- Edmunds and Triebel [1996] {bbook}[author] \bauthor\bsnmEdmunds, \bfnmD. E.\binitsD. E. and \bauthor\bsnmTriebel, \bfnmH.\binitsH. (\byear1996). \btitleFunction Spaces, Entropy Numbers, Differential Operators. \bseriesCambridge Tracts in Mathematics. \bpublisherCambridge University Press. \bdoi10.1017/CBO9780511662201 \endbibitem
- Fan, Wang and Zhong [2018] {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmWeichen\binitsW. and \bauthor\bsnmZhong, \bfnmYiqiao\binitsY. (\byear2018). \btitleAn eigenvector perturbation bound and its application to robust covariance estimation. \bjournalJournal of Machine Learning Research \bvolume18 \bpages1–42. \endbibitem
- Higgs, Weller and Solka [2006] {barticle}[author] \bauthor\bsnmHiggs, \bfnmBrandon W.\binitsB. W., \bauthor\bsnmWeller, \bfnmJennifer\binitsJ. and \bauthor\bsnmSolka, \bfnmJeffrey L.\binitsJ. L. (\byear2006). \btitleSpectral embedding finds meaningful (relevant) structure in image and microarray data. \bjournalBMC Bioinformatics \bvolume7 \bpages74. \bdoi10.1186/1471-2105-7-74 \endbibitem
- Hoffmann [2007] {barticle}[author] \bauthor\bsnmHoffmann, \bfnmHeiko\binitsH. (\byear2007). \btitleKernel PCA for novelty detection. \bjournalPattern Recognition \bvolume40 \bpages863 - 874. \bdoihttps://doi.org/10.1016/j.patcog.2006.07.009 \endbibitem
- Karow and Kressner [2014] {barticle}[author] \bauthor\bsnmKarow, \bfnmM.\binitsM. and \bauthor\bsnmKressner, \bfnmD.\binitsD. (\byear2014). \btitleOn a Perturbation Bound for Invariant Subspaces of Matrices. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume35 \bpages599-618. \bdoi10.1137/130912372 \endbibitem
- Kato [1995] {bbook}[author] \bauthor\bsnmKato, \bfnmT\binitsT. (\byear1995). \btitlePerturbation Theory for Linear Operators. \bseriesClassics in Mathematics. \bpublisherSpringer-Verlag Berlin Heidelberg. \endbibitem
- Koltchinskii [1998] {binproceedings}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir I.\binitsV. I. (\byear1998). \btitleAsymptotics of Spectral Projections of Some Random Matrices Approximating Integral Operators. In \bbooktitleHigh Dimensional Probability (\beditor\bfnmErnst\binitsE. \bsnmEberlein, \beditor\bfnmMarjorie\binitsM. \bsnmHahn and \beditor\bfnmMichel\binitsM. \bsnmTalagrand, eds.) \bpages191–227. \bpublisherBirkhäuser Basel, \baddressBasel. \endbibitem
- Koltchinskii and Giné [2000] {barticle}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir\binitsV. and \bauthor\bsnmGiné, \bfnmEvarist\binitsE. (\byear2000). \btitleRandom matrix approximation of spectra of integral operators. \bjournalBernoulli \bvolume6 \bpages113–167. \endbibitem
- Lei and Rinaldo [2015] {barticle}[author] \bauthor\bsnmLei, \bfnmJing\binitsJ. and \bauthor\bsnmRinaldo, \bfnmAlessandro\binitsA. (\byear2015). \btitleConsistency of spectral clustering in stochastic block models. \bjournalAnn. Statist. \bvolume43 \bpages215–237. \bdoi10.1214/14-AOS1274 \endbibitem
- Mao, Sarkar and Chakrabarti [2021] {barticle}[author] \bauthor\bsnmMao, \bfnmXueyu\binitsX., \bauthor\bsnmSarkar, \bfnmPurnamrita\binitsP. and \bauthor\bsnmChakrabarti, \bfnmDeepayan\binitsD. (\byear2021). \btitleEstimating mixed memberships with sharp eigenvector deviations. \bjournalJournal of the American Statistical Association \bvolume116 \bpages1928–1940. \endbibitem
- Mendelson and Pajor [2005] {binproceedings}[author] \bauthor\bsnmMendelson, \bfnmS.\binitsS. and \bauthor\bsnmPajor, \bfnmA.\binitsA. (\byear2005). \btitleEllipsoid Approximation Using Random Vectors. In \bbooktitleLearning Theory (\beditor\bfnmPeter\binitsP. \bsnmAuer and \beditor\bfnmRon\binitsR. \bsnmMeir, eds.) \bpages429–443. \bpublisherSpringer Berlin Heidelberg, \baddressBerlin, Heidelberg. \endbibitem
- Mendelson and Pajor [2006] {barticle}[author] \bauthor\bsnmMendelson, \bfnmShahar\binitsS. and \bauthor\bsnmPajor, \bfnmAlain\binitsA. (\byear2006). \btitleOn singular values of matrices with independent rows. \bjournalBernoulli \bvolume12 \bpages761–773. \bdoi10.3150/bj/1161614945 \endbibitem
- Ng, Jordan and Weiss [2001] {binproceedings}[author] \bauthor\bsnmNg, \bfnmAndrew Y.\binitsA. Y., \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I. and \bauthor\bsnmWeiss, \bfnmYair\binitsY. (\byear2001). \btitleOn Spectral Clustering: Analysis and an Algorithm. In \bbooktitleAdvances in Neural Information Processing Systems \bpages849–856. \bpublisherMIT Press. \endbibitem
- Rohe, Chatterjee and Yu [2011] {barticle}[author] \bauthor\bsnmRohe, \bfnmKarl\binitsK., \bauthor\bsnmChatterjee, \bfnmSourav\binitsS. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2011). \btitleSpectral clustering and the high-dimensional stochastic blockmodel. \bjournalAnn. Statist. \bvolume39 \bpages1878–1915. \bdoi10.1214/11-AOS887 \endbibitem
- Rosasco, Belkin and Vito [2010] {barticle}[author] \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmVito, \bfnmErnesto De\binitsE. D. (\byear2010). \btitleOn Learning with Integral Operators. \bjournalJournal of Machine Learning Research \bvolume11 \bpages905-934. \endbibitem
- Schiebinger, Wainwright and Yu [2015] {barticle}[author] \bauthor\bsnmSchiebinger, \bfnmGeoffrey\binitsG., \bauthor\bsnmWainwright, \bfnmMartin J.\binitsM. J. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2015). \btitleThe Geometry of Kernelized Spectral Clustering. \bjournalThe Annals of Statistics \bvolume43 \bpages819-846. \bdoi10.1214/14-AOS1283 \endbibitem
- Shawe-Taylor et al. [2005] {barticle}[author] \bauthor\bsnmShawe-Taylor, \bfnmJ.\binitsJ., \bauthor\bsnmWilliams, \bfnmC. K. I.\binitsC. K. I., \bauthor\bsnmCristianini, \bfnmN.\binitsN. and \bauthor\bsnmKandola, \bfnmJ.\binitsJ. (\byear2005). \btitleOn the eigenspectrum of the gram matrix and the generalization error of kernel-PCA. \bjournalIEEE Transactions on Information Theory \bvolume51 \bpages2510-2522. \bdoi10.1109/TIT.2005.850052 \endbibitem
- Shi, Belkin and Yu [2009] {barticle}[author] \bauthor\bsnmShi, \bfnmTao\binitsT., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009). \btitleData spectroscopy: Eigenspaces of convolution operators and clustering. \bjournalAnn. Statist. \bvolume37 \bpages3960–3984. \bdoi10.1214/09-AOS700 \endbibitem
- Shi and Malik [2000] {barticle}[author] \bauthor\bsnmShi, \bfnmJ\binitsJ. and \bauthor\bsnmMalik, \bfnmJ\binitsJ. (\byear2000). \btitleNormalized cuts and image segmentation. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence \bvolume22 \bpages888-905. \bdoi10.1109/34.868688 \endbibitem
- Steinwart and Christmann [2008] {bbook}[author] \bauthor\bsnmSteinwart, \bfnmIngo\binitsI. and \bauthor\bsnmChristmann, \bfnmAndreas\binitsA. (\byear2008). \btitleSupport vector machines. \bpublisherSpringer Science & Business Media. \endbibitem
- Stewart [1971] {barticle}[author] \bauthor\bsnmStewart, \bfnmG.\binitsG. (\byear1971). \btitleError Bounds for Approximate Invariant Subspaces of Closed Linear Operators. \bjournalSIAM Journal on Numerical Analysis \bvolume8 \bpages796-808. \bdoi10.1137/0708073 \endbibitem
- Tenenbaum [2000] {barticle}[author] \bauthor\bsnmTenenbaum, \bfnmJ. B.\binitsJ. B. (\byear2000). \btitleA Global Geometric Framework for Nonlinear Dimensionality Reduction. \bjournalScience \bvolume290 \bpages2319-2323. \bdoi10.1126/science.290.5500.2319 \endbibitem
- Trillos, Hoffmann and Hosseini [2019] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolas Garcia\binitsN. G., \bauthor\bsnmHoffmann, \bfnmFranca\binitsF. and \bauthor\bsnmHosseini, \bfnmBamdad\binitsB. (\byear2019). \btitleGeometric structure of graph Laplacian embeddings. \bpagesarXiv:1901.10651. \endbibitem
- Trillos and Slepčev [2015] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolás García\binitsN. G. and \bauthor\bsnmSlepčev, \bfnmDejan\binitsD. (\byear2015). \btitleA variational approach to the consistency of spectral clustering. \bpagesarXiv:1508.01928. \endbibitem
- Trillos et al. [2018] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolas Garcia\binitsN. G., \bauthor\bsnmGerlach, \bfnmMoritz\binitsM., \bauthor\bsnmHein, \bfnmMatthias\binitsM. and \bauthor\bsnmSlepcev, \bfnmDejan\binitsD. (\byear2018). \btitleError estimates for spectral convergence of the graph Laplacian on random geometric graphs towards the Laplace–Beltrami operator. \bpagesarXiv:1801.10108. \endbibitem
- Vershynin [2018] {bbook}[author] \bauthor\bsnmVershynin, \bfnmRoman\binitsR. (\byear2018). \btitleHigh-Dimensional Probability: An Introduction with Applications in Data Science. \bseriesCambridge Series in Statistical and Probabilistic Mathematics. \bpublisherCambridge University Press. \bdoi10.1017/9781108231596 \endbibitem
- von Luxburg, Belkin and Bousquet [2008] {barticle}[author] \bauthor\bsnmvon Luxburg, \bfnmUlrike\binitsU., \bauthor\bsnmBelkin, \bfnmMikhail\binitsM. and \bauthor\bsnmBousquet, \bfnmOlivier\binitsO. (\byear2008). \btitleConsistency of Spectral Clustering. \bjournalThe Annals of Statistics \bvolume36 \bpages555-586. \bdoi10.1214/009053607000000640 \endbibitem
- Zwald and Blanchard [2006] {bincollection}[author] \bauthor\bsnmZwald, \bfnmLaurent\binitsL. and \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. (\byear2006). \btitleOn the Convergence of Eigenspaces in Kernel Principal Component Analysis. In \bbooktitleAdvances in Neural Information Processing Systems 18 (\beditor\bfnmY.\binitsY. \bsnmWeiss, \beditor\bfnmB.\binitsB. \bsnmSchölkopf and \beditor\bfnmJ. C.\binitsJ. C. \bsnmPlatt, eds.) \bpages1649–1656. \bpublisherMIT Press. \endbibitem
Appendix A Proofs
A.1 Proof of Lemma 3.1
Proof of Lemma 3.1.
For item 1, we show double inclusions: and . First, for any , we know , so . Since , we see . Second, for any , suppose without loss of generality that for some . Note . Since by assumption, we know , so . This shows the other inclusion.
For item 2, since is a subspace of , is a subspace of which is equal to . Checking is an inner product is routine. We next show the completeness of . Let be a Cauchy sequence in . Since by definition , the sequence is Cauchy in . Suppose for some . Since by assumption norm is stronger than which is in turn stronger than , we know . Since the range of is closed in , we know is also in its range and . Note and the right hand side converges to zero because , we see the space of is indeed complete.
For item 3, for any with , we have
This shows . As is noted in the proof for item 2, norm is stronger than the norm, i.e. for some constant and . Therefore
We thus see . The fact that is a simple consequence of for . Finally, we have for
It thus follows . ∎
A.2 Proof of a lemma used in proving Theorem 3.5
In the proof of Theorem 3.5, we used to following lemma. We now state and prove it.
Lemma A.1.
The operator defined as as is one-to-one and onto. Moreover, .
Proof.
First, we note
To show is one-to-one and onto, it suffices to show for any , there exists a unique such that . Denote the standard orthonormal basis in by . Due to the diagonal structure of , we see . So to show the existence and uniqueness of such that , it suffices to show for , there exists a unique such that . To this end, it suffices to show is in the resolvent of . This is indeed true because 1) is a compact operator from to ; 2) . The second point is obvious and the first point follows from
Next, we show is a bounded operator. Once is bounded, since is one to one and onto and is a Banach space, we know from bounded inverse theorem that , which is equivalent to .
The operator is indeed bounded because
∎
We remark that when is self-adjoint, the proof of this lemma will be greatly simplified. In fact, we have
Since is self-adjoint, its operator norm is its largest eigenvalue, which is . We see immediately in this case that , so the eigengap is recovered. In unnormalized spectral clustering, where is set to be the RKHS associated with the kernel function, we claim is self-adjoint.
A.3 Proof of Lemma 4.11
We see at the core of Lemma 4.11 is some uniform law of large number over the unit ball in . We need the following two lemmas in the proof, the first is from Cucker and Smale [11] (Proposition 6), and the second is from Vershynin [40] (Theorem 8.1.6).
Lemma A.2.
Denote . When , for all ,
for some constant .
Lemma A.3.
Let be a random process on a metric space with sub-gaussian increments, i.e.
Then, for every , the event
holds with probability at least .
We now prove Lemma 4.11. We refer readers who are unfamiliar with the arguments below to the proof of Theorem 8.2.3 in Vershynin [40].
Proof of Lemma 4.11.
We first show on , the random process has sub-gaussian increments. For fixed , we have
So ’s are independent and mean zero. It thus follows
By the centering lemma, we know
Note that because of the embedding, we have
The random variable is thus bounded. Since bounded random variables have bounded norm, we see
Putting pieces together, we have
Next, it is easy to examine that and (because our choice of and Lemma A.2). It thus follows that the event
holds with probability .
By the exactly same argument, we can also show the event
holds with probability . Taking union bound, we see
holds with probability . Rewrite as and the proof is complete. ∎
Appendix B Application to kernel PCA
To further demonstrate the usage of the general theory, we apply it to kernel principal component analysis (kernel PCA) in this section. In kernel PCA, we start from a metric space , a probability measure on , and a continuous positive definite kernel function . After observing samples , we are interested in matrix of their pairwise similarities: , assuming our data mapped into the feature space is centered. Since is symmetric and positive semi-definite, it has an eigenvalue decomposition. We denote the eigenpairs by and sort the eigenvalues in descending order:
The eigenvectors are normalized to have . Then the matrix consists of the leading principal components.
Let be the RKHS associated with kernel . Recall that the tensor product of ,
is a linear operator. Then the operator counterpart of is the empirical covariance operator
(B.1) |
where is the corresponding feature in of () under the feature map .
It turns out that the eigenvalues and eigenvectors of and are closely related. To formulate this relationship, let us define the restriction operator by . Then verifiably, the adjoint of , is given by , where . The eigenvalues and eigenvectors(functions) of and are closely related in the following sense.
Lemma B.1.
The following facts hold true:
-
1.
and ;
-
2.
If is a non-trivial eigenpair of (i.e. ), then is an eigenpair for .
-
3.
If is a non-trivial eigenpair of , then , where
(B.2) is an eigenpair for with . Moreover, this choice of is such that and the restriction of onto sample points agrees with , i.e. .
Proof.
For item one, it is easy to verify is the linear transformation defined by . For the other half, noting that
At the same time, by the reproducing property
We thus conclude the two are equal.
For item two, since by assumption , we have , which is exactly what the statement suggests.
For item three, if is an eigenpair of , we check that is an eigenpair of :
Moreover, is a linear combination of and therefore belongs to . ∎
The population version of is the covariance operator
where , , and . We will later justify the expectation of such random elements in an appropriate Hilbert space. Under appropriate assumptions, it can be shown that we can choose , the top eigenfunctions of , to be real-valued and orthonormal in . Then we can define as . Similarly, we can define with , the extension of top orthonormal eigenvectors of according to (B.2). We are now ready to apply our general theory to prove the following result, which is similar to Theorem 3.2.
Theorem B.2.
Under the general assumptions defined below, there exists that are determined by such that whenever sample size for some , we have with confidence ,
(B.3) |
The general assumptions referred to in Theorem 5.2 are
General Assumptions. The set is a separable topological space. The kernel is continuous, symmetric, positive semi-definite, and
(B.4) |
Treated as an operator from to , the eigenvalues of satisfy . The top eigenfunctions of , .
Condition (B.4) ensures that all of the operators we are working with are Hilbert-Schmidt, and further guarantees concentration of bounded random elements in the Hilbert space of Hilbert-Schmidt operators. Separability of and continuity of assures that the RKHS is separable by Lemma 4.33 of [34].
B.1 Overview of the proof
The proof of Theorem B.2 is simpler than that of Theorem 3.2 because we can work with the reproducing kernel Hilbert space associated with directly. We shall show that , as an operator from to , has Hilbert-Schmidt norm tending to zero as goes to infinity. Recall that the columns of are only orthonormal in but the general theory requires which has columns orthonormal in . Similar to the challenge of proving Theorem 3.2, we also need to deal with the error induced by the distinction between orthonormality in and , which is one of the key steps for the proof of Theorem 3.2.
The rigorous treatment shall be presented in five parts. In part one, we introduce , the Hilbert space of Hilbert-Schmidt operators from to that we work with, and justify the random elements in and . In part two, we build up concentration results in the Hilbert space . In part three, we check the remaining conditions required by our general theory. In part four, we put all the above ingredients together and apply our general theory to complete the proof.
B.2 Part one: the space and random elements in and
The space collects all of the Hilbert-Schmidt operators from to , which is also a Hilbert space. Before justifying the covariance operator, we first define the mean element in and the cross-covariance operator in .
The mean element in is defined as such that for any . Let be another probability space and is a reproducing kernel Hilbert space associated with kernel containing , as its elements. The cross-covariance operator is a Hilbert-Schmidt operator from to such that for any and , . Finally, the covariance operator is .
The covariance operator is indeed Hilbert-Schmidt because by the general assumption.
B.3 Part two: concentration in the Hilbert space
In part one we have shown that is a Hilbert-Schmidt operators from to . In this subsection, we show concentration of by using Lemma 4.5.
Lemma B.3.
Under the general assumptions, with probability , we have
(B.5) |
for some constant .
Proof.
We denote . For any , we have
which implies ’s are bounded zero mean independent random variables in .
B.4 Part three: checking conditions for general theory
Lemma B.4.
Under the general conditions, the following facts hold true:
-
1.
The reproducing kernel Hilbert space is a subspace of .
-
2.
The norm is stronger than infinity norm.
-
3.
Both are Hilbert-Schmidt from to .
-
4.
All eigenvalues of (counting multiplicity) can be arranged in a decreasing (possibly infinite) sequence of non-negative real numbers with a positive gap between and .
-
5.
The top eigenfunctions and can be picked to form an orthonormal set of functions in .
-
6.
has a sequence of non-increasing, real, non-negative eigenvalues.
Proof.
Recall that the general conditions entail that . The kernel is therefore a Mercer’s kernel, which satisfies the condition . Then item 1 is an implication of Mercer’s theorem.
For item 2, for any , by the reproducing property and Cauchy-Schwarz inequality, we have . Therefore .
Item 3 has been checked in the intermediate step of proof of Lemma B.3.
Both item 4 and item 5 are ensured by Mercer’s theorem.
Item 6 is true because of the relationship between the spectrum of and that of the symmetric positive semi-definite kernel matrix , which has been checked in Lemma B.1. ∎