How do kernel-based sensor fusion algorithms behave under high dimensional noise?
Abstract.
We study the behavior of two kernel based sensor fusion algorithms, nonparametric canonical correlation analysis (NCCA) and alternating diffusion (AD), under the nonnull setting that the clean datasets collected from two sensors are modeled by a common low dimensional manifold embedded in a high dimensional Euclidean space and the datasets are corrupted by high dimensional noise. We establish the asymptotic limits and convergence rates for the eigenvalues of the associated kernel matrices assuming that the sample dimension and sample size are comparably large, where NCCA and AD are conducted using the Gaussian kernel. It turns out that both the asymptotic limits and convergence rates depend on the signal-to-noise ratio (SNR) of each sensor and selected bandwidths. On one hand, we show that if NCCA and AD are directly applied to the noisy point clouds without any sanity check, it may generate artificial information that misleads scientists’ interpretation. On the other hand, we prove that if the bandwidths are selected adequately, both NCCA and AD can be made robust to high dimensional noise when the SNRs are relatively large.
1. Introduction
How to adequately quantify the system of interest by assembling available information from multiple datasets collected simultaneously from different sensors is a long lasting and commonly encountered problem in data science. This problem is commonly referred to as the sensor fusion problem [20, 30, 19, 49]. While the simplest approach to “fuse” the information is via a simple concatenation of available information from each sensor, it is not the best and most efficient approach. To achieve a better and more efficient fusion algorithm, researchers usually face several challenges. For example, the sensors might be heterogeneous, datasets from different sensors might not be properly aligned, the datasets might be high dimensional and noisy, to name but a few. Roughly speaking, researchers are interested in extracting common components (information) shared by different sensors, if there is any, where roughly speaking.
A lot of effort was invested to find a satisfactory algorithm based on various models. Historically, when we can safely assume a linear structure in the common information shared by different sensors, the most typical algorithm in handling this problem is the canonical correlation analysis (CCA) [24] and its descendants [23, 20, 25], which is a far from complete list. In the modern data analysis era, due to the advance of sensor development and growth of the complexities of problems, researchers may need to take the nonlinear structure of the datasets into account to better understand the datasets. To handle this nonlinear structure, several nonlinear sensor fusion algorithms have been developed, for example, nonparametric canonical correlation analysis (NCCA) [39], alternative diffusion (AD) [31, 45] and its generalization [42], time coupled diffusion maps [38], multiview diffusion maps [34], etc. See [42] for a recent and more thoughtful list of available nonlinear tools and [50] for a recent review. The main idea beyond these developments is that the nonlinear structure is modeled by various nonlinear geometric structures, and the algorithms are designed to preserve and capture this nonlinear structure. Such ideas and algorithms have been successfully applied to many real world problems, like audio-visual voice activity detection [10], the study of the sequential audio-visual correspondence [9], automatic sleep stage annotation from two electroencephalogram signals [35], seismic event modeling [33], fetal electrocardiogram analysis [42] and IQ prediction from two fMRI paradigms [48], which is a far from complete list.
While these kernel-based sensor fusion algorithms have been developed and applied for a while, there are still several gaps toward a solid practical application and sound theoretical understanding of these tools. One important gap is understanding how the inevitable noise, particularly when the data dimension is high, impacts the kernel-based sensor fusion algorithms. For example, can we be sure if the obtained fused information is really informative, particularly when the datasets are noisy or when one sensor is broken? If the signal-to-noise ratios of two sensors are different, how does these noises impact the information captured by these kernel based sensors? To our knowledge, the developed kernel-based sensor fusion algorithms do not take care of how the noise interacts with the algorithm, and most theoretical understandings are mainly based on the nonlinear data structure without considering the impact of high dimensional noise, except a recent effort in the null case [7]. In this paper, we focus on one specific challenge among many; that is, we study how high dimensional noise impacts the spectrum of two kernel-based sensor fusion algorithms, NCCA and AD, in the non-null setup when there are two sensors.
We briefly recall the NCCA and AD algorithms. Consider two noisy point clouds, and . For some bandwidths and some fixed constant chosen by the user, we consider two affinity matrices, and , defined as
(1) |
where . Here, and are related to the point clouds and respectively. Denote the associated degree matrices and which are diagonal matrices such that
(2) |
Moreover, denote the transition matrices as
The NCCA and AD matrices are defined as
(3) |
respectively. Note that in the current paper, for simplicity, we focus our study on the Gaussian kernels. More general kernel functions will be our future topics. Usually, the top few eigenpairs of and are used as features of the extracted common information shared by two sensors. We shall emphasize that in general, while and are diagonalizable, and are not. But theoretically we can obtain the top few eigenpairs without a problem under the common manifold model [45, 42] since asymptotically and both converge to self-adjoint operators. To avoid this trouble, researchers also consider singular value decomposition (SVD) of and . Another important fact is that usually we are interested in the case when and are aligned; that is, and are sampled from the same system at the same time. However, the algorithm can be applied to any two datasets of the same size, while it is not our concern in this paper.
1.1. Some related works
In this subsection, we summarize some related results. Since the NCCA and AD matrices (3) are essentially products of transition matrices, we start from summarizing the results of the affinity and transition matrices when there is only one sensor. On one hand, in the noiseless setting, the spectral properties have been widely studied, for example, [2, 21, 22, 44, 18, 11], to name but a few. In summary, under the manifold model, researchers show that the Graph Laplacian(GL) converges to the Laplace–Beltrami operator in various settings with properly chosen bandwidth. On other hand, the spectral properties have been investigated in [4, 3, 8, 14, 13, 17, 28] under the null setup. These works essentially show that when contains pure high-dimensional noise, the affinity and transition matrices are governed by a low-rank perturbed Gram matrix when the bandwidth . Despite rich literature above about two extreme setups, limited results are available in the intermittent, or nonnull, setup [12, 6, 15]. For example, when the signal-to-noise ratio (SNR), which will be defined precisely later, is sufficiently large, the spectral properties of GL constructed from the noisy observation are close to that constructed from the clean signal. Moreover, the bandwidth plays an important role in the nonnull setup. For a more comprehensive review and sophisticated study on the spectral properties of the affinity and transition matrices for an individual point cloud, we refer the readers to [6, Sections 1.2 and 1.3].
For the NCCA and AD matrices, on one hand, in the noiseless setting, there have been several results under the common manifold model [32, 46]. On the other hand, under the null setup that both sensors only capture high dimensional white noise, its spectral property has been studied recently [7]. Specifically, except for a few larger outliers, when and the edge eigenvalues of or converge to some deterministic limit depending on the free convolution (c.f. Definition 2.3) of two Marchenko-Pastur (MP) laws [37]. However, in the nonnull setting when both sensors are contaminated by noise, to our knowledge, there does not exist any theoretical study, particularly under the high dimensional setup.
1.2. An overview of our results
We now provide an overview of our results. The main contribution of this paper is a comprehensive study of NCCA and AD under the non-null case in the high dimensional setup. This result can be viewed as a continuation of the study under the null case [7]. We focus on the setup that the signal is modeled by a low dimensional manifold. It turns out that this problem can be recast as studying the algorithm under the commonly applied spiked model, which will be made clear later. In addition to providing a theoretical justification based on the kernel random matrix theory, we propose a method to choose the bandwidth adaptively. Moreover, peculiar and counterintuitive results will be presented when two sensors have different behavior, which emphasizes the importance of carefully applying these algorithms in practice. In Section 3, we investigate the eigenvalues of the NCCA and AD matrices when and , which is a common choice in the literature. The behavior of the eigenvalues varies according to both SNRs of the point clouds. When both SNRs are small, the spectral behavior of and is like that in the null case, while both the number of outliers and the convergence rates rely on SNRs; see Theorem 3.1 for details. Furthermore, if one of the sensors has large SNR and the other one has small SNR, the eigenvalues of and provide limited information about the signal; see Theorem 3.2 for details. We emphasize that this result warns us that if we directly apply NCCA and AD without any sanity check, it may result in a misleading conclusion. When both SNRs are larger, the eigenvalues are close to the clean NCCA and AD matrices; see Theorem 3.3 for more details. It is clear that the classic bandwidth choices for are inappropriate when the SNR is large, since the bandwidth is too small compared with the signal strength. In this case , and we obtain limited information about the signal; see (42) for details. To handle this issue, in Section 4, we consider bandwidths that are adaptively chosen according to the dataset. With this choice, when the SNRs are large, NCCA and AD become non-trivial and informative; that is, NCCA and AD are robust against the high dimensional noise. See Theorem 4.1 for details.
Conventions. The fundamental large parameter is and we always assume that and are comparable to and depend on . We use to denote a generic positive constant, and the value may change from one line to the next. Similarly, we use , , , etc., to denote generic small positive constants. If a constant depends on a quantity , we use or to indicate this dependence. For two quantities and depending on , the notation means that for some constant , and means that for some positive sequence as . We also use the notations if , and if and . For a matrix , indicates the operator norm of , and means for some constant . Finally, for a random vector we say it is sub-Gaussian if for any deterministic vector , we have .
The paper is organized as follows. In Section 2, we introduce the mathematical setup and some background in random matrix theory. In Section 3, we state our main results for the classic choice of bandwidth. In Section 4, we state the main results for the adaptively chosen bandwidth. In Section 5, we offer the technical proofs of the main results. In Appendix 5.1, we provide and prove some preliminary results which will be used in the technical proofs.
2. Mathematical framework and background
2.1. Mathematical framework
We focus on the following model for the datasets and . Assume that the first sensor i.i.d. sample clean signals from a sub-Gaussian vector , denoted as , where is a probability space. Similarly, assume that the second sensor also i.i.d. sample clean signals from a sub-Gaussian vector , denoted as . Since we focus on the distance of pairwise samples, without loss of generality, we assume that
(4) |
Denote and , and to simplify the discussion, we assume that and admit the following spectral decomposition
(5) |
where and are fixed integers. We model the common information by assuming that there exists a bijection so that
(6) |
that is, we have for any . In practice, the clean signals and are contaminated by two sequences of i.i.d. sub-Gaussian noise and , respectively, so that the data generating process follows
(7) |
where
(8) |
We further assume that and are independent with each other and also independent of and . We are mainly interested in the high dimensional setting; that is, and are comparably as large as More specifically, we assume that there exists some small constant such that
(9) |
The SNRs in our setting are defined as and respectively, so that for all and ,
(10) |
for some constants . To avoid repetitions, we summarize the assumptions as follows.
In view of (5), the model (7) for each sensor is related to the spiked covariance matrix models [27]. We comment that this seemingly simple model, particularly (5), includes the commonly considered nonlinear common manifold model. In the literature, the common manifold model means that two sensors sample simultaneously from one low dimensional manifold; that is, and is an identity map, where is a low dimensional smooth and compact manifold embedded in the high dimensional Euclidean space. Since we are interested in the kernel matrices depending on pairwise distances, which is invariant to rotation, when combined with Nash’s embedding theory, the common manifold can be assumed to be supported in the first few axes of the high dimensional space, like that in (5). As a result, the common manifold model becomes a special case of the model (7). We refer readers to [6] for a detailed discussion of this relationship. A special example of the common manifold model is the widely considered linear subspace as the common component; that is, when embedded in for . In this case, we could simply apply CCA to estimate the common component, and its behavior in the high dimensional setup has been studied in [1, 36].
We should emphasize that through the analysis of NCCA and AD under the common component model satisfying Assumption (2.1), we do not claim that we could understand the underlying manifold structure. The problem we are asking here is the nontrivial relationship between the noisy and clean affinity and transition matrices, while the problem of exploring the manifold structure from the clean datasets [18, 11] is a different one, which is usually understood as the manifold learning problem. To study the nontrivial relationship between the noisy and clean affinity and transition matrices, it is the spiked covariance structure that we focus on, but not the possibly non-trivial . By establishing the nontrivial relationship between the noisy and clean affinity and transition matrices in this paper, when combined with the knowledge of manifold learning via the kernel-based manifold learning algorithm with clean datasets [31, 46, 43], we know how to explore the common manifold structure, which depends on , from the noisy datasets.
Remark 2.2.
While it is not our focus in this paper, we should mention that our model includes the case that the datasets captured by two sensors are not exactly on one manifold , but from two manifolds that are diffeomorphic to [43]. Specifically, the first sensor samples points from , while the second sensor simultaneously samples points from , where and are both diffeomorphisms and ; that is, in (6). Note that in this case, might be different from . Moreover, the samples from two sensors can be more general. For example, in [46], the principle bundle structure is considered to model the “nuisance”, which can be understood as the “deterministic noise”, and in [31] the metric space as the common component is considered. While it is possible to consider a more complicated one, since we are interested in studying how noise impacts NCCA and AD, in this paper we simply focus on the above model but not further elaborate this possible extension.
2.2. Some random matrix theory background
In this subsection, we introduce some random matrix theory background and necessary notations. Let be the data matrix associated with ; that is, the -th column is , and consider the scaled noise , where stands for the standard deviation of the scaled noise. Denote the empirical spectral distribution (ESD) of as
It is well-known that in the high dimensional regime (9), has the same asymptotic [29] as the so-called MP law [37], denoted as , satisfying
(11) |
where is a measurable set, is the indicator function and when and when ,
(12) |
and . Denote
(13) |
and for ,
(14) |
For any constant denote be the shifting operator that shifts a probability measure defined on by that is
(15) |
where means the shifted set. Using the notation (11), for denote
(16) |
Next, we introduce a -dependent quantity of some probability measure. For a given probability measure and define as
(17) |
Finally, we recall the following notion of stochastic domination [16, Chapter 6.3] that we will frequently use. Let and be two families of nonnegative random variables, where is a possibly -dependent parameter set. We say that is stochastically dominated by , uniformly in the parameter , if for any small and large , there exists so that we have , for a sufficiently large . We interchangeably use the notation or if is stochastically dominated by , uniformly in , when there is no danger of confusion. In addition, we say that an -dependent event holds with high probability if for a , there exists so that , when
2.3. A brief summary of free multiplication of random matrices
In this subsection, we summarize some preliminary results about free multiplication of random matrices from [5, 26]. Given some probability measure its Stieltjes transform and -transform are defined as
where , respectively. We next introduce the subordination functions utilizing the -transform [26, 47]. For any two probability measures and , there exist analytic functions and satisfying
(18) |
Armed with the subordination functions, we now introduce the free multiplicative convolution of and denoted as , when and are compactly supported on but not both delta measures supported on ; see Definition 2.7 of [5].
Definition 2.3.
Denote the analytic function by
(19) |
Then the free multiplicative convolution is defined as the unique probability measure that (19) holds for all i.e., is the -transform of Moreover, and are referred to as the subordination functions.
For and defined in (16), we have two sequences and , where . Note that we have
(20) |
where are the right edges of and respectively. Denote two positive definite matrices and as follows
(21) |
Let be an Haar distributed random matrix in and denote
The following lemma summarizes the rigidity of eigenvalues of
Lemma 2.4.
Suppose (21) holds. Then we have that
3. Main results (I)–classic bandwidth:
In this section, we state our main results regarding the eigenvalues of and when , where For definiteness, we assume that and In what follows, for the ease of statements, we focus on reporting the results for and hence omit the subscripts of the indices in (10). For the general setting with or , we refer the readers to Remark 3.4 below for more details. Finally, we focus on reporting the results for the NCCA matrix . The results for the AD matrix are similar. For the details of the AD matrix , we refer the readers to Remark 3.5 below. Moreover, by symmetry, without loss of generality, we always assume that that is, the first sensor always has a larger SNR.
3.1. Noninformative region:
In this subsection, we state the results when at least one sensor contains strong noise, or equivalently has a small SNR; that is, . In this case, the NCCA and AD will not be able to provide useful information or can only provide limited information for the underlying common manifold.
3.1.1. When both sensors have small SNRs,
In this case, both sensors have small SNRs such that the noise dominates the signal. For some fixed integers satisfying
(22) |
where is some constant, denote as
(23) |
Moreover, define as
(24) |
Theorem 3.1.
Intuitively, in this region we cannot obtain any information about the signal, since asymptotically the noise dominates the signal. In practice, the datasets might fall in this region when both sensors are corrupted or the environment noise is too strong. This intuition is confirmed by Theorem 3.1. As discussed in [6, 13], when the noise dominates the signal, the outlier eigenvalues are mainly from the kernel function expansion or the Gram matrix and hence are not useful to study the underlying manifold structure. The number of these outlier eigenvalues depend on the SNR as can be seen in (23), which can be figured out from the kernel function expansion.
We should point out that (25) and (26) are mainly technical assumptions and commonly used in the random matrix theory literature. They guarantee that the individual bulk eigenvalues of can be characterized by the quantiles of free multiplicative convolution. Specifically, (26) ensures that the Gram matrices are bounded from below and (25) has been used in [7] to ensure that the eigenvectors of the Gram matrix are Haar distributed. As is discussed in [7], while it is widely accepted that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed, we cannot find a proper proof. Since the proof of this Theorem depends on the results in [7], we impose the same condition. The assumption (25) can be removed when we can show that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed. Since this is not the focus of the current paper, we will pursue this direction in future works.
3.1.2. When one sensor has a small SNR,
In Theorem 3.2 below, we consider that i.e., one of the sensors has a large SNR whereas the other is dominated by the noise. We prepare some notations here. Let and be the affinity matrices associated with and respectively, where the subscript stands for the short-hand notation for the signal. In other words, and are constructed from the clean signal. In general, since may be different from , and might be different. Denote
(28) |
Analogously, we denote the associated degree matrix and transition matrix as and respectively, that is,
(29) |
Define and similarly. Note that from the random walk perspective, (and as well) describe a lazy random walk on the clean dataset. We further introduce some other matrices,
We then define the associate degree matrix and transition matrix as and respectively; that is,
(30) |
and will be used when is too large () so that the bandwidth is insufficient to capture the relationship between two different samples.
Theorem 3.2.
Suppose Assumption 2.1 holds with , , and . Then we have that for ,
(33) |
where is defined in (24) and is defined in (31). Furthermore, when is larger in the sense that for any given small constant
(34) |
then with probability at least for some sufficiently small constant and some constant and all we have
(35) |
This is a potentially confusing region. In practice, it captures the situation when one sensor is corrupted so that the signal part becomes weak. Since we still have one sensor available with a strong SNR, it is expected that we could still obtain something useful. However, it is shown in Theorem 3.2 that the corrupted sensor unfortunately contaminates the overall performance of the sensor fusion algorithm. Note that since the first sensor has a large SNR, the noisy transition matrix is close to the transition matrix , which only depends on the signal part when and the transition which is a mixture of the signal and noise when . This fact has been shown in [6]. However, for the second sensor, due to the strong noise, will be close to a perturbed Gram matrix that mainly comes from the high dimensional noise. Consequently, as illustrated in (33), the NCCA matrix will be close to which is a product matrix of the clean transition matrix and the shifted Gram matrix. Clearly, the clean transition matrix is contaminated by the shifted Gram matrix, which does not contain any information about the signal. This limits the information we can obtain.
In the extreme case when is larger in the sense of (34), the chosen bandwidth is too small compared with the signal so that the transition matrix will be close to the identity matrix. Consequently, as in (35), the NCCA matrix will be mainly characterized by the perturbed Gram matrix whose limiting ESD follows the MP law with proper scaling.
We should however emphasize that as has been elaborated in [6], when the SNR is large, particularly when , we should consider a different bandwidth, particularly the bandwidth determined by the percentile of pairwise distance that is commonly considered in practice. It is thus natural to ask if the bandwidth is chosen “properly”, would we obtain useful information eventually. We will answer this question in the later section.
3.2. Informative region:
In this subsection, we state the results when both of the sensors have large SNR (). Recall (29), (30) and denote analogously for the point cloud For some constant denote
(36) |
Theorem 3.3.
Theorem 3.3 shows that when , where and both SNRs are large, the NCCA matrix from the noisy dataset could be well approximated by that from the clean dataset of the common manifold. The main reason has been elaborated in [6] when we have only one sensor. In the two sensors case, combining (37) and (38), we see that except the first eigenvalues, the remaining eigenvalues are negligible and not informative. Moreover, (2) and (3) reveal important information about the bandwidth; that is, if the bandwidth choice is improper, like and , the result could be misleading in general. For instance, when and are large, ideally we should have a “very clean” dataset and we shall expect to obtain useful information about the signal. However, this result says that we cannot obtain any useful information from NCCA; particularly, see (42). This however is intuitively true, since when the bandwidth is too small, the relationship of two distinct points cannot be captured by the kernel; that is, when , with high probability (see proof below for a precise statement of this argument or [6]). This problem can be fixed if we choose a proper bandwidth. In Section 4, we will state the corresponding results when the bandwidths are selected properly, in which case this counterintuitive result is eliminated.
Remark 3.4.
In the above theorems, we focus on reporting the results for the case in (5). In this remark, we discuss how to generalize the results to the setting when or First, when Theorem 3.1 still holds after minor modification, for example, in (22) should be replaced by and the error bound in (27) should be replaced by
where ’s are defined similarly as in (24). Similar arguments apply for Theorem 3.2. Second, when , where Theorem 3.3 holds by setting Finally, suppose that there exist some integers such that
Then we have that Theorem 3.3 still holds by setting and the affinity and transition matrices in (29) should be defined using the signal part with large SNRs. For example, should be defined via
The detailed statements and proofs are similar to the setting except for extra notational complicatedness. Since this is not the main focus of the current paper, we omit details here.
Remark 3.5.
Throughout the paper, we focus on reporting the results of the NCCA matrix. However, our results can also be applied to the AD matrix with a minor modification based on their definitions in (3). Specifically, Theorem 3.1 holds for , Theorem 3.2 holds for and Theorem 3.3 holds for by replacing with with and with . Since the proof is similar, we omit details.
4. Main results (II)–adaptive choice of bandwidth
As discussed after Theorem 3.3, when the SNRs are large, the bandwidth choice for is improper. One solution to this trouble has been discussed in [6] when we have one sensor; that is, the bandwidth is decided by the percentile of all pairwise distances. It is thus natural to hypothesize that the same solution would hold for the kernel sensor fusion approach. As in Section 3, we focus on the case and the discussion for the general setting is similar to that of Remark 3.4. As before, we also assume that Also, we focus on reporting the results of the NCCA matrix. The discussion for the AD matrix is similar to that of Remark 3.5.
We first recall the adaptive bandwidth selection approach [6] motivated by the empirical approach commonly used in daily practice. Let and be the empirical distributions of pairwise distances and respectively. Then we choose the bandwidths and by
(43) |
where and are fixed constants chosen by the user. Define in the same way as that in (45), as that in (48), and as that in (46) using (43). Similarly, we can define the counterparts for the point cloud
Recall that and are the affinity matrices associated with and With a little bit abuse of notation, for we denote
(44) |
where are constructed using the adaptively selected bandwidth . Clearly, and differ by an isotropic spectral shift, and when , asymptotically and are the same. Note that compared to (28), the difference is that we use the modified bandwidth in (44). This difference is significant, particularly when is large. Indeed, when is large, defined in (28) is close to an identity matrix, while defined in (44) encodes information of the signal. Specifically, asymptotically we can show that defined in (44) converges to an integral operator defined on the manifold, whose spectral structure is commonly used in manifold learning society to study the signal. See [6] for more discussion. We then define
(45) |
and
(46) |
where . Compared to (45), (46) does not contain the scaling and shift of the signal parts. Moreover, denote
(47) |
Theorem 4.1.
Theorem 4.1 (1) states that if both sensors have low SNRs, the NCCA matrix has a similar spectral behavior as that in Theorem 3.1; that is, when the SNRs are small, due to the noise impact, even if there exists signal, we may not obtain useful result. The reason is that we still have for with high probability (see (100)), so the bandwidth choice does not influence the conclusion. Especially, most of the eigenvalues of are governed by the free multiplication convolutions of two MP type laws, which are essentially the limiting empirical spectral distributions of Gram matrices only containing white noise.
On the other hand, when the signals are stronger; that is, we are able to approximate the associated clean NCCA matrix of the underlying clean common component, as is detailed in Theorem 4.1 (2). This result can be interpreted as that NCCA is robust to the noise. Especially, when we see that and come from the clean dataset directly. Finally, we point out that compared to (36), except the top eigenvalues for some constant , the remaining eigenvalues are not information. When the NCCA matrix is always informative compared to (2) and (3) of Theorem 3.3. As a result, when combined with the existing theory about AD [31, 45], the first few eigenpairs of NCCA and AD capture the geometry of the common manifold under the manifold setup.
Theorem 4.1 (1) also describes the behavior of NCCA when one sensor has a high SNR while the other one has a low SNR, which is the most interesting and counterintuitive case. In this case, even if the bandwidths of both sensors are generated according to (43), the NCCA matrix still encodes limited information about the signal, like that stated in Theorem 3.2. Indeed, the NCCA matrix is close to a product matrix which is a mixture of signal and noise, shown in (48). While contains information about the signal, it is contaminated by via production, which comes from the noise dominant dataset collected from the other sensor. Since the spectral behavior of follows the shifted and scaled MP law, overall we obtain limited information about the signal if we apply the kernel-based sensor fusion algorithm. In this case, it is better to simply consider the dataset with a high SNR. Based on the above discussion and practical experience, we would like to mention a potential danger if we directly apply NCCA (or AD) without confirming the signal quality. This result warns us that if we directly apply AD without any sanity check, it may result in a misleading conclusion, or give us lower quality information. Therefore, before applying NCCA and AD, it is suggested to carry out the common practice by detecting the existence of signals in each of the sensors.
For the choices of the constants and we comment that in practice, usually researchers choose or [42]. In [6], we propose an algorithm to adaptively choose the values of them. The main idea behind is that the algorithm seeks for a bandwidth so that the affinity matrix has the most number of outlier eigenvalues. We refer the readers to [6, Section 3.2] for more details.
Last but not the least, we point out that our results can be potentially used to detect the common components. Usually, researchers count on the background knowledge to decide if common information exists. For example, it is not surprising that two electroencephalogram channels share the same brain activity. However, while physiologically the brain and heart share common information [41], it is less clear if the electroencephalogram and the electrocardiogram share anything in common, and what is the common information. Answering this complicated question may need a lot of scientific work, but the first step toward it is a powerful tool to confirm if two sensors share the same information, in addition to checking if the signal quality is sufficient. Since this is not the focus of the current paper, we will address this issue in the future work.
5. Proof of main theorems
We now provide proofs of the main theoretical results in Sections 3 and 4. We start from collecting some technical lemmas needed in the proof.
5.1. Some technical lemmas
The following lemma provides some deterministic inequalities for the products of matrices.
Lemma 5.1.
(1). Suppose that is a real symmetric matrix with nonnegative entries and is another real symmetric matrix. Then we have that
where is the Hadamard product and stands for the largest singular value of
(2). Suppose and are two positive definite matrices. Then for all we have that
Proof.
The following Lemma 5.2 collects some concentration inequalities.
Lemma 5.2.
Proof.
See Lemma A.2 of [6]. ∎
In the following lemma, we prove some results regarding the concentration of the affinity matrices when and .
Lemma 5.3.
Proof.
First, (56) has been proved in [6] using the entry-wise Taylor expansion and the Gershgorin circle theorem; see the proof of Theorems 2.3 and 2.5 of [6]. Second, (58) has been proved in the proof of Theorem 2.5 of [6]. Third, we prove (57). By Lemma 5.2 and a discussion similar to [7, Lemma IV.5], when and
(59) |
Consequently,
where we used the fact that Since by the Gershgorin circle theorem, we conclude that This concludes our proof. ∎
In the following lemma, we collect the results regarding the affinity matrices when and Recall defined via (29).
Lemma 5.4.
Suppose Assumption 2.1 holds with , . For some constant denote
(60) |
Then we have:
(1). When if
(61) |
Moreover, moreover, we have that for in (60),
(62) |
On the other hand, when we have that
Finally, when is the larger in the sense that (34) holds, we have that with probability at least for some constant
(63) |
(2). When is chosen according to (43), we have that
(64) |
Moreover, we have that for for some constant ,
(65) |
Recall (46). Finally, when we have that
(66) |
Similar results hold for
Proof.
See Corollary 2.11 and Theorem 3.1 of [6]. ∎
Finally, we record the results for the rigidity of eigenvalues of non-spiked Gram matrix. Denote the non-spiked Gram matrix as , where
Proof.
See [40, Theorem 3.3]. ∎
5.2. Proof of Theorem 3.1
We need the following notations. Denote , where , . Similarly, we define with , For denote
(67) | |||
(68) | |||
(69) |
Proof.
Case (1). . By (59), we conclude that
(70) |
Therefore, it suffices to consider To ease the heavy notation, we denote
(71) |
where and . With the above notations, by Lemma 5.3, we have that
(72) | ||||
Denote
(73) |
Since and and contain samples from the common manifold, we can set
Moreover, denote
Withe above notations, denote
(74) |
(75) |
Note that
(76) |
Moreover, are rank-one matrices and by (52),
(77) |
In light of (76), we can write
(78) |
We can further write
(79) |
where
On one hand, it is easy to see that On the other hand, by (77), (53) and Lemma 5.5, using the assumption that we obtain that
(80) |
Denote the spectral decompositions of and as
(81) |
Let and be the quantiles of and respectively as constructed via (20). Let be the eigenvalues of . For some small we denote an event
(82) |
Since is a Gaussian random matrix, we have that is a Haar orthogonal random matrix. Since and are independent, we have that is also a Haar orthogonal random matrix when is fixed. Since Lemma 5.5 implies that is a high probability event, in what follows, we focus our discussion on the high probability event and is a deterministic orthonormal matrix.
On one hand, have the same eigenvalues as by construction. On the other hand, by Lemma 5.5, we have that for
where and are diagonal matrices containing and respectively. Note that the rigidity of the eigenvalues of has been studied in [5] under the Gaussian assumption and summarized in Lemma 2.4. Together with Lemma 2.4, we conclude that for
(83) |
Note that
(84) | ||||
We then analyze the rank of Recall that for any compatible matrices and we have that
Since and we conclude that
Consequently, we have that
(85) |
By (78), (79), (84) and (85), utilizing (70), we obtain that
(86) |
where we used Since , by (83), we have finished our proof for case (1).
Case (2). . Recall (76). In this case, according to Lemma 5.3, we require a high order expansion up to the degree of in (22) for Recall (24) and (78). By Lemma 5.3, we have that
(87) |
where is defined in (71) and is defined in (54) below satisfying Using a decomposition similar to (84) with (87), by (77) and the assumption , we obtain that
(88) | ||||
It is easy to see that the rank of the second to the fourth terms of (88) is bounded by On other hand, by (59), the first inequality of (70) should be replaced by
(89) |
The rest of the discussion follows from the case (1). This completes our proof for case (2).
Case (3). . The discussion is similar to case (2) except that we also need to conduct a high order expansion for Similar to (87), by Lemma 5.3, we have that
(90) |
By decomposition similar to (88), with (90), by (77), we have that
On one hand, the rank of the second term of right-hand side of the above equation can be bounded by On the other hand, the first term can be again analyzed in the same way as heading from (82) to (83) using Lemma 2.4. Finally, by (59), similar to (89), we have that
(91) |
The rest of the proof follows from the discussion of case (1). This completes the proof of Case (3) using the fact ∎
5.3. Proof of Theorem 3.2
We now prove Theorem 3.2 when .
Case (1). . Decompose by
(92) |
First, we have that Moreover, using the decomposition (76), similar to (86), by (77), we can further write that
(93) |
Second, by Lemma 5.4, we have that
(94) |
Together with (70) and the fact that , using the definition (31), we have that
(95) |
We can therefore conclude the proof using (93).
Next, when is larger in the sense of (34), by Lemma 5.4, we find that with probability at least , for some constant
(96) |
Consequently, we have
(97) |
where in the second inequality we use the fact that since . By a result analogous to (57) for , we have that for
(98) |
Together with (97), we conclude our proof.
Case (2). . The discussion is similar to case (1) except that we need to conduct a high order expansion for Note that
By (77), (90) and (91), we have that
We can therefore conclude our proof by (94) with a discussion similar to (95). Finally, when is larger, we can conclude our proof using a discussion similar to (98) with (97) and Lemma 5.4. Together with (91), we conclude the proof.
5.4. Proof of Theorem 3.3
5.5. Proof of Theorem 4.1
To prove the results of Theorem 4.1, we first study the adaptive bandwidth and . When by Lemma 5.2 about the sub-Gaussian random vector, we have that for
Since are concentrated around . Then for any and chosen according to (43), we have that
(100) |
Similarly, when we have that Now we can prove part (1) when . Denote
where Since , we can apply the proof of Theorem 3.1 to the kernel functions The only difference is that the constant is now replaced by When the modification is similar except that we also need to use (2) of Lemma 5.4.
The other two cases can be obtained similarly by recalling the following fact. For any let be the bandwidth selected using (43), we have that for some constants with high probability
(101) |
Also, note that (2) of Lemma 5.4 holds. See Corollary 3.2 of [6] for the proof. With this fact, for part (2), (49) follows from (64) and its counterpart for and the fact (50) follows from (65) and its counterpart for and (2) of Lemma 5.1; (51) follows from (66) and its counterpart for and the assumption
References
- [1] Z. Bao, J. Hu, G. Pan, and W. Zhou. Canonical correlation coefficients of high-dimensional Gaussian vectors: Finite rank case. The Annals of Statistics, 47(1):612 – 640, 2019.
- [2] M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289–1308, 2008.
- [3] C. Bordenave. On Euclidean random matrices in high dimension. Electron. Commun. Probab., 18:no. 25, 8, 2013.
- [4] X. Cheng and A. Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 02(04):1350010, 2013.
- [5] X. Ding and H. C. Ji. Local laws for multiplication of random matrices and spiked invariant model. arXiv preprint arXiv 2010.16083, 2020.
- [6] X. Ding and H.-T. Wu. Impact of signal-to-noise ratio and bandwidth on graph Laplacian spectrum from high-dimensional noisy point cloud. arXiv preprint arXiv 2011.10725, 2020.
- [7] X. Ding and H. T. Wu. On the spectral property of kernel-based sensor fusion algorithms of high dimensional data. IEEE Transactions on Information Theory, 67(1):640–670, 2021.
- [8] Y. Do and V. Vu. The spectrum of random kernel matrices: Universality results for rough and varying kernels. Random Matrices: Theory and Applications, 02(03):1350005, 2013.
- [9] D. Dov, R. Talmon, and I. Cohen. Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Transactions on Signal Processing, 64(24):6406–6416, 2016.
- [10] D. Dov, R. Talmon, and I. Cohen. Sequential audio-visual correspondence with alternating diffusion kernels. IEEE Transactions on Signal Processing, 66(12):3100–3111, 2018.
- [11] D. B. Dunson, H.-T. Wu, and N. Wu. Diffusion based Gaussian process regression via heat kernel reconstruction. Applied and Computional Harmonic Analysis, 2021.
- [12] N. El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191 – 3216, 2010.
- [13] N. El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 – 50, 2010.
- [14] N. El Karoui and H.-T. Wu. Graph connection Laplacian and random matrices with random blocks. Information and Inference: A Journal of the IMA, 4(1):1–44, 2015.
- [15] N. El Karoui and H.-T. Wu. Graph connection Laplacian methods can be made robust to noise. The Annals of Statistics, 44(1):346 – 372, 2016.
- [16] L. Erdős and H. Yau. A Dynamical Approach to Random Matrix Theory. Courant Lecture Notes. American Mathematical Society, 2017.
- [17] Z. Fan and A. Montanari. The spectral norm of random inner-product kernel matrices. Probab. Theory Related Fields, 173(1-2):27–85, 2019.
- [18] N. García Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator. Found. Comput. Math., 20(4):827–887, 2020.
- [19] F. Gustafsson. Statistical Sensor Fusion. Professional Publishing House, 2012.
- [20] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004.
- [21] M. Hein, J.-Y. Audibert, and U. von Luxburg. From graphs to manifolds – weak and strong pointwise consistency of graph Laplacians. In P. Auer and R. Meir, editors, Learning Theory, pages 470–485, 2005.
- [22] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their convergence on random neighborhood graphs. J. Mach. Learn. Res., 8:1325–1368, 2007.
- [23] P. Horst. Relations among m sets of measures. Psychometrika, 26(2):129–149, 1961.
- [24] H. Hotelling. Relations between two sets of variates. Biometrika, 28:321–377, 1936.
- [25] H. Hwang, K. Jung, Y. Takane, and T. S. Woodward. A unified approach to multiple-set canonical correlation analysis and principal components analysis. British Journal of Mathematical and Statistical Psychology, 66(2):308–321, 2013.
- [26] H. C. Ji. Regularity Properties of Free Multiplicative Convolution on the Positive Line. International Mathematics Research Notices, 07 2020. rnaa152.
- [27] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist., 29(2):295–327, 2001.
- [28] S. P. Kasiviswanathan and M. Rudelson. Spectral norm of random kernel matrices with applications to privacy. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2015, August 24-26, 2015, Princeton, NJ, USA, volume 40 of LIPIcs, pages 898–914. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
- [29] A. Knowles and J. Yin. Anisotropic local laws for random matrices. Probab. Theory Related Fields, 169(1-2):257–352, 2017.
- [30] D. Lahat, T. Adali, and C. Jutten. Multimodal data fusion: An overview of methods, challenges, and prospects. Proceedings of the IEEE, 103(9):1449–1477, 2015.
- [31] R. R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509–536, 2018.
- [32] R. R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509–536, 2018.
- [33] O. Lindenbaum, Y. Bregman, N. Rabin, and A. Averbuch. Multiview kernels for low-dimensional modeling of seismic events. IEEE Transactions on Geoscience and Remote Sensing, 56(6):3300–3310, 2018.
- [34] O. Lindenbaum, A. Yeredor, M. Salhov, and A. Averbuch. Multi-view diffusion maps. Information Fusion, 55:127–149, 2020.
- [35] G.-R. Liu, Y.-L. Lo, J. Malik, Y.-C. Sheu, and H.-T. Wu. Diffuse to fuse eeg spectra–intrinsic geometry of sleep dynamics for classification. Biomedical Signal Processing and Control, 55:101576, 2020.
- [36] Z. Ma and F. Yang. Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations. arXiv preprint arXiv 2102.03297, 2021.
- [37] V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967.
- [38] N. F. Marshall and M. J. Hirn. Time coupled diffusion maps. Applied and Computational Harmonic Analysis, 45(3):709–728, 2018.
- [39] T. Michaeli, W. Wang, and K. Livescu. Nonparametric canonical correlation analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1967–1976, 2016.
- [40] N. S. Pillai and J. Yin. Universality of covariance matrices. The Annals of Applied Probability, 24(3):935 – 1001, 2014.
- [41] M. A. Samuels. The brain–heart connection. Circulation, 116(1):77–84, 2007.
- [42] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM J. Math. Data Sci., 1(3):588–616, 2019.
- [43] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM Journal on Mathematics of Data Science, 1(3):588–616, 2019.
- [44] A. Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 21(1):128–134, 2006.
- [45] R. Talmon and H.-T. Wu. Latent common manifold learning with alternating diffusion: Analysis and applications. Applied and Computational Harmonic Analysis, 47(3):848–892, 2019.
- [46] R. Talmon and H.-T. Wu. Latent common manifold learning with alternating diffusion: analysis and applications. Applied and Computational Harmonic Analysis, 47(3):848–892, 2019.
- [47] D. Voiculescu. Multiplication of certain non-commuting random variables. Journal of Operator Theory, 18(2):223–235, 1987.
- [48] L. Xiao, J. M. Stephen, T. W. Wilson, V. D. Calhoun, and Y.-P. Wang. A manifold regularized multi-task learning model for iq prediction from two fmri paradigms. IEEE Transactions on Biomedical Engineering, 67(3):796–806, 2019.
- [49] J. Zhao, X. Xie, X. Xu, and S. Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017.
- [50] X. Zhuang, Z. Yang, and D. Cordes. A technical review of canonical correlation analysis for neuroscience applications. Human Brain Mapping, 41(13):3807–3833, 2020.