How do kernel-based sensor fusion algorithms behave under high dimensional noise?

Xiucai Ding Department of Statistics, University of California, Davis, CA, USA xcading@ucdavis.edu and Hau-Tieng Wu Department of Mathematics and Department of Statistical Science, Duke University, Durham, NC, USA hauwu@math.duke.edu

Abstract.

We study the behavior of two kernel based sensor fusion algorithms, nonparametric canonical correlation analysis (NCCA) and alternating diffusion (AD), under the nonnull setting that the clean datasets collected from two sensors are modeled by a common low dimensional manifold embedded in a high dimensional Euclidean space and the datasets are corrupted by high dimensional noise. We establish the asymptotic limits and convergence rates for the eigenvalues of the associated kernel matrices assuming that the sample dimension and sample size are comparably large, where NCCA and AD are conducted using the Gaussian kernel. It turns out that both the asymptotic limits and convergence rates depend on the signal-to-noise ratio (SNR) of each sensor and selected bandwidths. On one hand, we show that if NCCA and AD are directly applied to the noisy point clouds without any sanity check, it may generate artificial information that misleads scientists’ interpretation. On the other hand, we prove that if the bandwidths are selected adequately, both NCCA and AD can be made robust to high dimensional noise when the SNRs are relatively large.

1. Introduction

How to adequately quantify the system of interest by assembling available information from multiple datasets collected simultaneously from different sensors is a long lasting and commonly encountered problem in data science. This problem is commonly referred to as the sensor fusion problem [20, 30, 19, 49]. While the simplest approach to “fuse” the information is via a simple concatenation of available information from each sensor, it is not the best and most efficient approach. To achieve a better and more efficient fusion algorithm, researchers usually face several challenges. For example, the sensors might be heterogeneous, datasets from different sensors might not be properly aligned, the datasets might be high dimensional and noisy, to name but a few. Roughly speaking, researchers are interested in extracting common components (information) shared by different sensors, if there is any, where roughly speaking.

A lot of effort was invested to find a satisfactory algorithm based on various models. Historically, when we can safely assume a linear structure in the common information shared by different sensors, the most typical algorithm in handling this problem is the canonical correlation analysis (CCA) [24] and its descendants [23, 20, 25], which is a far from complete list. In the modern data analysis era, due to the advance of sensor development and growth of the complexities of problems, researchers may need to take the nonlinear structure of the datasets into account to better understand the datasets. To handle this nonlinear structure, several nonlinear sensor fusion algorithms have been developed, for example, nonparametric canonical correlation analysis (NCCA) [39], alternative diffusion (AD) [31, 45] and its generalization [42], time coupled diffusion maps [38], multiview diffusion maps [34], etc. See [42] for a recent and more thoughtful list of available nonlinear tools and [50] for a recent review. The main idea beyond these developments is that the nonlinear structure is modeled by various nonlinear geometric structures, and the algorithms are designed to preserve and capture this nonlinear structure. Such ideas and algorithms have been successfully applied to many real world problems, like audio-visual voice activity detection [10], the study of the sequential audio-visual correspondence [9], automatic sleep stage annotation from two electroencephalogram signals [35], seismic event modeling [33], fetal electrocardiogram analysis [42] and IQ prediction from two fMRI paradigms [48], which is a far from complete list.

While these kernel-based sensor fusion algorithms have been developed and applied for a while, there are still several gaps toward a solid practical application and sound theoretical understanding of these tools. One important gap is understanding how the inevitable noise, particularly when the data dimension is high, impacts the kernel-based sensor fusion algorithms. For example, can we be sure if the obtained fused information is really informative, particularly when the datasets are noisy or when one sensor is broken? If the signal-to-noise ratios of two sensors are different, how does these noises impact the information captured by these kernel based sensors? To our knowledge, the developed kernel-based sensor fusion algorithms do not take care of how the noise interacts with the algorithm, and most theoretical understandings are mainly based on the nonlinear data structure without considering the impact of high dimensional noise, except a recent effort in the null case [7]. In this paper, we focus on one specific challenge among many; that is, we study how high dimensional noise impacts the spectrum of two kernel-based sensor fusion algorithms, NCCA and AD, in the non-null setup when there are two sensors.

We briefly recall the NCCA and AD algorithms. Consider two noisy point clouds, $\mathcal{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{p_{1}}$ and $\mathcal{Y}=\{\mathbf{y}_{j}\}_{j=1}^{n}\subset\mathbb{R}^{p_{2}}$ . For some bandwidths $h_{1},h_{2}>0$ and some fixed constant $\upsilon>0$ chosen by the user, we consider two $n\times n$ affinity matrices, $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ , defined as

\mathbf{W}_{1}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h_{1}}\right)\ \ \mbox{and}\ \ \mathbf{W}_{2}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{h_{2}}\right),

(1)

where $i,j=1,\ldots,n$ . Here, $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ are related to the point clouds $\mathcal{X}$ and $\mathcal{Y}$ respectively. Denote the associated degree matrices $\mathbf{D}_{1}$ and $\mathbf{D}_{2},$ which are diagonal matrices such that

\mathbf{D}_{1}(i,i)=\sum_{j=1}^{n}\mathbf{W}_{1}(i,j)\ \ \mbox{ and }\ \ \mathbf{D}_{2}(i,i)=\sum_{j=1}^{n}\mathbf{W}_{2}(i,j),\ i=1,2,\cdots,n\,.

(2)

Moreover, denote the transition matrices $\mathbf{A}_{1},\mathbf{A}_{2}$ as

\mathbf{A}_{1}=\mathbf{D}_{1}^{-1}\mathbf{W}_{1}\ \ \mbox{ and }\ \ \mathbf{A}_{2}=\mathbf{D}_{2}^{-1}\mathbf{W}_{2}\,.

The NCCA and AD matrices are defined as

\mathbf{N}=\mathbf{A}_{1}\mathbf{A}_{2}^{\top}\ \ \mbox{ and }\ \ \mathbf{A}=\mathbf{A}_{1}\mathbf{A}_{2}

(3)

respectively. Note that in the current paper, for simplicity, we focus our study on the Gaussian kernels. More general kernel functions will be our future topics. Usually, the top few eigenpairs of $\mathbf{N}$ and $\mathbf{A}$ are used as features of the extracted common information shared by two sensors. We shall emphasize that in general, while $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ are diagonalizable, $\mathbf{N}$ and $\mathbf{A}$ are not. But theoretically we can obtain the top few eigenpairs without a problem under the common manifold model [45, 42] since asymptotically $\mathbf{N}$ and $\mathbf{A}$ both converge to self-adjoint operators. To avoid this trouble, researchers also consider singular value decomposition (SVD) of $\mathbf{N}$ and $\mathbf{A}$ . Another important fact is that usually we are interested in the case when $\mathcal{X}$ and $\mathcal{Y}$ are aligned; that is, $\mathbf{x}_{i}$ and $\mathbf{y}_{i}$ are sampled from the same system at the same time. However, the algorithm can be applied to any two datasets of the same size, while it is not our concern in this paper.

1.1. Some related works

In this subsection, we summarize some related results. Since the NCCA and AD matrices (3) are essentially products of transition matrices, we start from summarizing the results of the affinity and transition matrices when there is only one sensor. On one hand, in the noiseless setting, the spectral properties have been widely studied, for example, [2, 21, 22, 44, 18, 11], to name but a few. In summary, under the manifold model, researchers show that the Graph Laplacian(GL) converges to the Laplace–Beltrami operator in various settings with properly chosen bandwidth. On other hand, the spectral properties have been investigated in [4, 3, 8, 14, 13, 17, 28] under the null setup. These works essentially show that when $\mathcal{X}$ contains pure high-dimensional noise, the affinity and transition matrices are governed by a low-rank perturbed Gram matrix when the bandwidth $h_{1}=p_{1}$ . Despite rich literature above about two extreme setups, limited results are available in the intermittent, or nonnull, setup [12, 6, 15]. For example, when the signal-to-noise ratio (SNR), which will be defined precisely later, is sufficiently large, the spectral properties of GL constructed from the noisy observation are close to that constructed from the clean signal. Moreover, the bandwidth plays an important role in the nonnull setup. For a more comprehensive review and sophisticated study on the spectral properties of the affinity and transition matrices for an individual point cloud, we refer the readers to [6, Sections 1.2 and 1.3].

For the NCCA and AD matrices, on one hand, in the noiseless setting, there have been several results under the common manifold model [32, 46]. On the other hand, under the null setup that both sensors only capture high dimensional white noise, its spectral property has been studied recently [7]. Specifically, except for a few larger outliers, when $h_{1}=p_{1}$ and $h_{2}=p_{2},$ the edge eigenvalues of $n^{2}\mathbf{A}$ or $n^{2}\mathbf{N}$ converge to some deterministic limit depending on the free convolution (c.f. Definition 2.3) of two Marchenko-Pastur (MP) laws [37]. However, in the nonnull setting when both sensors are contaminated by noise, to our knowledge, there does not exist any theoretical study, particularly under the high dimensional setup.

1.2. An overview of our results

We now provide an overview of our results. The main contribution of this paper is a comprehensive study of NCCA and AD under the non-null case in the high dimensional setup. This result can be viewed as a continuation of the study under the null case [7]. We focus on the setup that the signal is modeled by a low dimensional manifold. It turns out that this problem can be recast as studying the algorithm under the commonly applied spiked model, which will be made clear later. In addition to providing a theoretical justification based on the kernel random matrix theory, we propose a method to choose the bandwidth adaptively. Moreover, peculiar and counterintuitive results will be presented when two sensors have different behavior, which emphasizes the importance of carefully applying these algorithms in practice. In Section 3, we investigate the eigenvalues of the NCCA and AD matrices when $h_{1}=p_{1}$ and $h_{2}=p_{2}$ , which is a common choice in the literature. The behavior of the eigenvalues varies according to both SNRs of the point clouds. When both SNRs are small, the spectral behavior of $n^{2}\mathbf{N}$ and $n^{2}\mathbf{A}$ is like that in the null case, while both the number of outliers and the convergence rates rely on SNRs; see Theorem 3.1 for details. Furthermore, if one of the sensors has large SNR and the other one has small SNR, the eigenvalues of $n\mathbf{N}$ and $n\mathbf{A}$ provide limited information about the signal; see Theorem 3.2 for details. We emphasize that this result warns us that if we directly apply NCCA and AD without any sanity check, it may result in a misleading conclusion. When both SNRs are larger, the eigenvalues are close to the clean NCCA and AD matrices; see Theorem 3.3 for more details. It is clear that the classic bandwidth choices $h_{k}=p_{k}$ for $k=1,2$ are inappropriate when the SNR is large, since the bandwidth is too small compared with the signal strength. In this case $\mathbf{N}\approx\mathbf{I},\mathbf{A}\approx\mathbf{I}$ , and we obtain limited information about the signal; see (42) for details. To handle this issue, in Section 4, we consider bandwidths that are adaptively chosen according to the dataset. With this choice, when the SNRs are large, NCCA and AD become non-trivial and informative; that is, NCCA and AD are robust against the high dimensional noise. See Theorem 4.1 for details.

Conventions. The fundamental large parameter is $n$ and we always assume that $p_{1}$ and $p_{2}$ are comparable to and depend on $n$ . We use $C$ to denote a generic positive constant, and the value may change from one line to the next. Similarly, we use $\epsilon$ , $\tau$ , $\delta$ , etc., to denote generic small positive constants. If a constant depends on a quantity $a$ , we use $C(a)$ or $C_{a}$ to indicate this dependence. For two quantities $a_{n}$ and $b_{n}$ depending on $n$ , the notation $a_{n}=\mathrm{O}(b_{n})$ means that $|a_{n}|\leq C|b_{n}|$ for some constant $C>0$ , and $a_{n}=\mathrm{o}(b_{n})$ means that $|a_{n}|\leq c_{n}|b_{n}|$ for some positive sequence $c_{n}\downarrow 0$ as $n\to\infty$ . We also use the notations $a_{n}\lesssim b_{n}$ if $a_{n}=\mathrm{O}(b_{n})$ , and $a_{n}\asymp b_{n}$ if $a_{n}=\mathrm{O}(b_{n})$ and $b_{n}=\mathrm{O}(a_{n})$ . For a $n\times n$ matrix $A$ , $\|A\|$ indicates the operator norm of $A$ , and $A=O(a_{n})$ means $\|A\|\leq Ca_{n}$ for some constant $C>0$ . Finally, for a random vector $\mathbf{u},$ we say it is sub-Gaussian if for any deterministic vector $\mathbf{a}$ , we have $\mathbb{E}(\exp(\mathbf{a}^{\top}\mathbf{u}))\leq\exp(\|\mathbf{a}\|_{2}^{2}/2)$ .

The paper is organized as follows. In Section 2, we introduce the mathematical setup and some background in random matrix theory. In Section 3, we state our main results for the classic choice of bandwidth. In Section 4, we state the main results for the adaptively chosen bandwidth. In Section 5, we offer the technical proofs of the main results. In Appendix 5.1, we provide and prove some preliminary results which will be used in the technical proofs.

2. Mathematical framework and background

2.1. Mathematical framework

We focus on the following model for the datasets $\mathcal{X}$ and $\mathcal{Y}$ . Assume that the first sensor i.i.d. sample $n$ clean signals from a sub-Gaussian vector $X:(\Omega,\mathcal{F},\mathbb{P})\to\mathbb{R}^{p_{1}}$ , denoted as $\{\mathbf{u}_{ix}\}_{i=1}^{n}\in\mathbb{R}^{p_{1}}$ , where $(\Omega,\mathcal{F},\mathbb{P})$ is a probability space. Similarly, assume that the second sensor also i.i.d. sample clean signals from a sub-Gaussian vector $Y$ , denoted as $\{\mathbf{u}_{iy}\}_{i=1}^{n}\in\mathbb{R}^{p_{2}}$ . Since we focus on the distance of pairwise samples, without loss of generality, we assume that

\mathbb{E}\mathbf{u}_{ix}=\mathbf{0}\quad\mbox{and}\quad\mathbb{E}\mathbf{u}_{iy}=\mathbf{0}.

(4)

Denote $S_{1}=\operatorname{Cov}(\mathbf{u}_{ix})$ and $S_{2}=\operatorname{Cov}(\mathbf{u}_{iy})$ , and to simplify the discussion, we assume that $S_{1}$ and $S_{2}$ admit the following spectral decomposition

S_{k}=\operatorname{diag}\{\sigma_{k1}^{2},\cdots,\sigma_{kd}^{2},0,\cdots,0\}\in\mathbb{R}^{p_{k}\times p_{k}},\ k=1,2,

(5)

where $d_{1}$ and $d_{2}$ are fixed integers. We model the common information by assuming that there exists a bijection $\phi$ so that

X(\Omega)=\phi(Y(\Omega))\,;

(6)

that is, we have $\mathbf{u}_{ix}=\phi(\mathbf{u}_{iy})$ for any $i=1,\ldots,n$ . In practice, the clean signals $\{\mathbf{u}_{ix}\}$ and $\{\mathbf{u}_{iy}\}$ are contaminated by two sequences of i.i.d. sub-Gaussian noise $\{\mathbf{z}_{i}\}\in\mathbb{R}^{p_{1}}$ and $\{\mathbf{w}_{i}\}\in\mathbb{R}^{p_{2}}$ , respectively, so that the data generating process follows

\mathbf{x}_{i}=\mathbf{u}_{ix}+\mathbf{z}_{i}\ \ \mbox{and}\ \ \mathbf{y}_{i}=\mathbf{u}_{iy}+\mathbf{w}_{i}\,,

(7)

where

\mathbb{E}\mathbf{z}_{i}=\mathbf{0}_{p_{1}},\ \operatorname{Cov}(\mathbf{z}_{i})=\mathbf{I}_{p_{1}},\ \mathbb{E}\mathbf{w}_{i}=\mathbf{0}_{p_{2}},\ \operatorname{Cov}(\mathbf{w}_{i})=\mathbf{I}_{p_{2}}.

(8)

We further assume that $\mathbf{z}_{i}$ and $\mathbf{w}_{i}$ are independent with each other and also independent of $\{\mathbf{u}_{ix}\}$ and $\{\mathbf{u}_{iy}\}$ . We are mainly interested in the high dimensional setting; that is, $p_{1}$ and $p_{2}$ are comparably as large as $n.$ More specifically, we assume that there exists some small constant $0<\gamma<1$ such that

\gamma\leq c_{1}:=\frac{n}{p_{1}}\leq\gamma^{-1}\ \ \mbox{and}\ \ \gamma\leq c_{2}:=\frac{n}{p_{2}}\leq\gamma^{-1}.\

(9)

The SNRs in our setting are defined as $\{\sigma_{1i}^{2}\}$ and $\{\sigma_{2i}^{2}\},$ respectively, so that for all $1\leq i\leq d_{1}$ and $1\leq j\leq d_{2}$ ,

\sigma_{1i}^{2}\asymp n^{\zeta_{1i}},\ \sigma_{2j}^{2}\asymp n^{\zeta_{2j}}\,,

(10)

for some constants $\zeta_{1i},\zeta_{2j}\geq 0$ . To avoid repetitions, we summarize the assumptions as follows.

Assumption 2.1.

Throughout the paper, we assume that (4)–(10) hold.

In view of (5), the model (7) for each sensor is related to the spiked covariance matrix models [27]. We comment that this seemingly simple model, particularly (5), includes the commonly considered nonlinear common manifold model. In the literature, the common manifold model means that two sensors sample simultaneously from one low dimensional manifold; that is, $\mathbf{u}_{ix}=\mathbf{u}_{iy}\in M$ and $\phi$ is an identity map, where $M$ is a low dimensional smooth and compact manifold embedded in the high dimensional Euclidean space. Since we are interested in the kernel matrices depending on pairwise distances, which is invariant to rotation, when combined with Nash’s embedding theory, the common manifold can be assumed to be supported in the first few axes of the high dimensional space, like that in (5). As a result, the common manifold model becomes a special case of the model (7). We refer readers to [6] for a detailed discussion of this relationship. A special example of the common manifold model is the widely considered linear subspace as the common component; that is, when $M=\mathbb{R}^{d}$ embedded in $\mathbb{R}^{p_{k}}$ for $k=1,2$ . In this case, we could simply apply CCA to estimate the common component, and its behavior in the high dimensional setup has been studied in [1, 36].

We should emphasize that through the analysis of NCCA and AD under the common component model satisfying Assumption (2.1), we do not claim that we could understand the underlying manifold structure. The problem we are asking here is the nontrivial relationship between the noisy and clean affinity and transition matrices, while the problem of exploring the manifold structure from the clean datasets [18, 11] is a different one, which is usually understood as the manifold learning problem. To study the nontrivial relationship between the noisy and clean affinity and transition matrices, it is the spiked covariance structure that we focus on, but not the possibly non-trivial $\phi$ . By establishing the nontrivial relationship between the noisy and clean affinity and transition matrices in this paper, when combined with the knowledge of manifold learning via the kernel-based manifold learning algorithm with clean datasets [31, 46, 43], we know how to explore the common manifold structure, which depends on $\phi$ , from the noisy datasets.

Remark 2.2.

While it is not our focus in this paper, we should mention that our model includes the case that the datasets captured by two sensors are not exactly on one manifold $M$ , but from two manifolds that are diffeomorphic to $M$ [43]. Specifically, the first sensor samples points $\{\mathbf{u}_{ix}\}$ from $\phi_{1}(M)$ , while the second sensor simultaneously samples points $\{\mathbf{u}_{iy}\}$ from $\phi_{2}(M)$ , where $\phi_{1}$ and $\phi_{2}$ are both diffeomorphisms and $\mathbf{u}_{iy}=\phi_{1}(\phi_{2}^{-1}(\mathbf{u}_{ix}))$ ; that is, $\phi=\phi_{1}\circ\phi_{2}^{-1}$ in (6). Note that in this case, $d_{1}$ might be different from $d_{2}$ . Moreover, the samples from two sensors can be more general. For example, in [46], the principle bundle structure is considered to model the “nuisance”, which can be understood as the “deterministic noise”, and in [31] the metric space as the common component is considered. While it is possible to consider a more complicated one, since we are interested in studying how noise impacts NCCA and AD, in this paper we simply focus on the above model but not further elaborate this possible extension.

2.2. Some random matrix theory background

In this subsection, we introduce some random matrix theory background and necessary notations. Let $\mathbf{Z}\in\mathbb{R}^{p_{1}\times n}$ be the data matrix associated with $\{\mathbf{z}_{i}\}$ ; that is, the $i$ -th column $\mathbf{Z}$ is $\mathbf{z}_{i}$ , and consider the scaled noise $s\mathbf{Z}$ , where $s>0$ stands for the standard deviation of the scaled noise. Denote the empirical spectral distribution (ESD) of $\mathbf{Q}=\frac{s^{2}}{p_{1}}\mathbf{Z}^{\top}\mathbf{Z}$ as

\mu_{\mathbf{Q}}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}_{\{\lambda_{i}(\mathbf{Q})\leq x\}},\ x\in\mathbb{R}.

It is well-known that in the high dimensional regime (9), $\mu_{\mathbf{Q}}$ has the same asymptotic [29] as the so-called MP law [37], denoted as $\nu_{c_{1},s^{2}}$ , satisfying

\nu_{c_{1},s^{2}}(I)=(1-c_{1})_{+}\chi_{I}(0)+\zeta_{c_{1},s^{2}}(I)\,,

(11)

where $I\subset\mathbb{R}$ is a measurable set, $\chi_{I}$ is the indicator function and $(a)_{+}:=0$ when $a\leq 0$ and $(a)_{+}:=a$ when $a>0$ ,

d\zeta_{c_{1},s^{2}}(x)=\frac{1}{2\pi s^{2}}\frac{\sqrt{(\lambda_{+,1}-x)(x-\lambda_{-,1})}}{c_{n}x}\mathrm{d}x\,,

(12)

$\lambda_{+,1}=(1+s^{2}\sqrt{c_{1}})^{2}$ and $\lambda_{-,1}=(1-s^{2}\sqrt{c_{1}})^{2}$ . Denote

\tau_{1}\equiv\tau_{1}(\lambda_{1}):=2\left(\frac{\lambda_{1}}{p_{1}}+1\right),\ \tau_{2}\equiv\tau_{2}(\lambda_{2}):=2\left(\frac{\lambda_{2}}{p_{2}}+1\right),

(13)

and for $k=1,2$ ,

\varsigma_{k}\equiv\varsigma_{k}(\tau_{k}):=1-2\upsilon\exp(-\upsilon\tau_{k})-\exp(-\upsilon\tau_{k})\,.

(14)

For any constant $\mathsf{a}>0,$ denote $\mathrm{T}_{\mathsf{a}}$ be the shifting operator that shifts a probability measure $\nu$ defined on $\mathbb{R}$ by $\mathsf{a};$ that is

\mathrm{T}_{\mathsf{a}}\nu(I)=\nu(I-\mathsf{a}),

(15)

where $I-\mathsf{a}$ means the shifted set. Using the notation (11), for $k=1,2,$ denote

\nu_{k}:=\mathrm{T}_{\varsigma_{k}}\nu_{c_{k},\eta},\ \mbox{where }\ \eta=2\upsilon\exp(-2\upsilon).

(16)

Next, we introduce a $n$ -dependent quantity of some probability measure. For a given probability measure $\mu$ and $n\in\mathbb{N},$ define $\gamma_{\mu}(j)$ as

\int_{\gamma_{\mu}(j)}^{\infty}\mu(\mathrm{d}x)=\frac{j}{n}.

(17)

Finally, we recall the following notion of stochastic domination [16, Chapter 6.3] that we will frequently use. Let $\mathsf{X}=\big{\{}\mathsf{X}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}}$ and $\mathsf{Y}=\big{\{}\mathsf{Y}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}}$ be two families of nonnegative random variables, where $\mathsf{U}^{(n)}$ is a possibly $n$ -dependent parameter set. We say that $\mathsf{X}$ is stochastically dominated by $\mathsf{Y}$ , uniformly in the parameter $u$ , if for any small $\upsilon>0$ and large $D>0$ , there exists $n_{0}(\upsilon,D)\in\mathbb{N}$ so that we have $\sup_{u\in\mathsf{U}^{(n)}}\mathbb{P}\Big{(}\mathsf{X}^{(n)}(u)>n^{\upsilon}\mathsf{Y}^{(n)}(u)\Big{)}\leq n^{-D}$ , for a sufficiently large $n\geq n_{0}(\upsilon,D)$ . We interchangeably use the notation $\mathsf{X}=\mathrm{O}_{\prec}(\mathsf{Y})$ or $\mathsf{X}\prec\mathsf{Y}$ if $\mathsf{X}$ is stochastically dominated by $\mathsf{Y}$ , uniformly in $u$ , when there is no danger of confusion. In addition, we say that an $n$ -dependent event $\Omega\equiv\Omega(n)$ holds with high probability if for a $D>1$ , there exists $n_{0}=n_{0}(D)>0$ so that $\mathbb{P}(\Omega)\geq 1-n^{-D}$ , when $n\geq n_{0}.$

2.3. A brief summary of free multiplication of random matrices

In this subsection, we summarize some preliminary results about free multiplication of random matrices from [5, 26]. Given some probability measure $\mu,$ its Stieltjes transform and $M$ -transform are defined as

m_{\mu}(z)=\int\frac{1}{x-z}\mathrm{d}\mu(x)\ \ \mbox{and}\ \ M_{\mu(z)}=\frac{zm_{\mu}(z)}{1+zm_{\mu}(z)}\,,

where $z\in\mathbb{C}\backslash\mathbb{R}_{+}$ , respectively. We next introduce the subordination functions utilizing the $M$ -transform [26, 47]. For any two probability measures $\mu_{x}$ and $\mu_{y}$ , there exist analytic functions $\Omega_{x}(z)$ and $\Omega_{y}(z)$ satisfying

zM_{\mu_{x}}(\Omega_{y}(z))=zM_{\mu_{y}}(\Omega_{x}(z))=\Omega_{x}(z)\Omega_{y}(z),\ \text{for}\ z\in\mathbb{C}\backslash\mathbb{R}_{+}.

(18)

Armed with the subordination functions, we now introduce the free multiplicative convolution of $\mu_{x}$ and $\mu_{y},$ denoted as $\mu_{x}\boxtimes\mu_{y}$ , when $\mu_{x}$ and $\mu_{y}$ are compactly supported on $\mathbb{R}_{+}$ but not both delta measures supported on $\{0\}$ ; see Definition 2.7 of [5].

Definition 2.3.

Denote the analytic function $M$ by

M(z):=M_{\mu_{x}}(\Omega_{y}(z))=M_{\mu_{y}}(\Omega_{x}(z)).

(19)

Then the free multiplicative convolution $\mu_{x}\boxtimes\mu_{y}$ is defined as the unique probability measure that (19) holds for all $z\in\mathbb{C}\backslash\mathbb{R}_{+},$ i.e., $M(z)\equiv M_{\mu_{x}\boxtimes\mu_{y}}(z)$ is the $M$ -transform of $\mu_{x}\boxtimes\mu_{y}.$ Moreover, $\Omega_{x}$ and $\Omega_{y}$ are referred to as the subordination functions.

For $\nu_{1}$ and $\nu_{2}$ defined in (16), we have two sequences $a_{k}=\gamma_{\nu_{1}}(k-1)$ and $b_{k}=\gamma_{\nu_{2}}(k-1)$ , where $1\leq k\leq n$ . Note that we have

\int_{a_{k}}^{E_{+,1}}\mathrm{d}\nu_{1}(x)=\frac{k-1}{n},\ \int_{b_{k}}^{E_{+,2}}\mathrm{d}\nu_{2}(x)=\frac{k-1}{n},

(20)

where $E_{+,1},E_{+,2}$ are the right edges of $\nu_{1}$ and $\nu_{2},$ respectively. Denote two $n\times n$ positive definite matrices $\Sigma_{1}$ and $\Sigma_{2}$ as follows

\Sigma_{1}=\operatorname{diag}\{a_{1},a_{2},\cdots,a_{n}\},\ \Sigma_{2}=\operatorname{diag}\{b_{1},b_{2},\cdots,b_{n}\}.

(21)

Let $\mathbf{U}$ be an $n\times n$ Haar distributed random matrix in $O(n)$ and denote

\mathbf{H}=\Sigma_{2}\mathbf{U}\Sigma_{1}\mathbf{U}^{\top}.

The following lemma summarizes the rigidity of eigenvalues of $\mathbf{H}.$

Lemma 2.4.

Suppose (21) holds. Then we have that

\sup_{j}\left|\lambda_{j}(\mathbf{H})-\gamma_{\nu_{1}\boxtimes\nu_{2}}(j)\right|\prec n^{-2/3}\widetilde{j}^{-1/3},\ \widetilde{j}:=\min\left\{p_{1}\wedge n+1-j,j\right\}.

Proof.

The proof follows from Theorems 2.14 and 2.20 of [5] since the Assumptions 2.2, 2.4 and 2.7 in [5] are satisfied. Especially, (iii) of Assumption 2.2 in [5] holds due to the square root behavior of the MP laws as indicated by (12). ∎

3. Main results (I)–classic bandwidth: $h_{1}\asymp p_{1},h_{2}\asymp p_{2}$

In this section, we state our main results regarding the eigenvalues of $\mathbf{N}$ and $\mathbf{A}$ when $h_{k}\asymp p_{k}$ , where $k=1,2.$ For definiteness, we assume that $h_{1}=p_{1}$ and $h_{2}=p_{2}.$ In what follows, for the ease of statements, we focus on reporting the results for $d=1$ and hence omit the subscripts of the indices $i,j$ in (10). For the general setting with $d_{1}>1$ or $d_{2}>1$ , we refer the readers to Remark 3.4 below for more details. Finally, we focus on reporting the results for the NCCA matrix $\mathbf{N}$ . The results for the AD matrix $\mathbf{A}$ are similar. For the details of the AD matrix $\mathbf{A}$ , we refer the readers to Remark 3.5 below. Moreover, by symmetry, without loss of generality, we always assume that $\zeta_{1}\geq\zeta_{2};$ that is, the first sensor always has a larger SNR.

3.1. Noninformative region: $0\leq\zeta_{2}<1$

In this subsection, we state the results when at least one sensor contains strong noise, or equivalently has a small SNR; that is, $0\leq\zeta_{2}<1$ . In this case, the NCCA and AD will not be able to provide useful information or can only provide limited information for the underlying common manifold.

3.1.1. When both sensors have small SNRs, $0\leq\zeta_{2}\leq\zeta_{1}<1$

In this case, both sensors have small SNRs such that the noise dominates the signal. For some fixed integers $\mathsf{s}_{1},\mathsf{s}_{2}$ satisfying

4\leq\mathsf{s}_{k}\leq C4^{\mathfrak{d}_{k}},\ \mathfrak{d}_{k}:=\left\lceil\frac{1}{1-{\zeta_{k}}}\right\rceil+1,\ k=1,2,

(22)

where $C>0$ is some constant, denote $\mathsf{T}$ as

\mathsf{T}:=\begin{cases}8,&\ 0\leq\zeta_{2}\leq\zeta_{1}<0.5,\\ \mathsf{s}_{1}+8,&\ 0\leq\zeta_{2}<0.5\leq\zeta_{1}<1,\\ \mathsf{s}_{1}+\mathsf{s}_{2}+8,&\ 0.5\leq\zeta_{2}\leq\zeta_{1}<1.\end{cases}

(23)

Moreover, define $\mathsf{e}_{k}<0,k=1,2,$ as

\mathsf{e}_{k}:=(\zeta_{k}-1)\left(\left\lceil\frac{1}{1-\zeta_{k}}\right\rceil+1\right)+1.

(24)

Theorem 3.1.

Suppose Assumption 2.1 holds with $0\leq\zeta_{2}\leq\zeta_{1}<1$ , $h_{1}=p_{1}$ , $h_{2}=p_{2}$ and $d_{1}=d_{2}=1.$ Moreover, we assume that

\mathbf{z}_{i}\overset{\operatorname{i.i.d.}}{\sim}\mathcal{N}(0,\mathbf{I}_{p_{1}}),\ \mathbf{w}_{i}\overset{\operatorname{i.i.d.}}{\sim}\mathcal{N}(0,\mathbf{I}_{p_{2}})\,,

(25)

and there exists some constant $\tau>0$ such that

|c_{k}-1|\geq\tau,\ k=1,2.

(26)

Then, when $n$ is sufficiently large, we have that for $i>\mathsf{T}$ in (23),

\left|\lambda_{i}(n^{2}\mathbf{N})-\exp(4\upsilon)\gamma_{\nu_{1}\boxtimes\nu_{2}}(j)\right|=\mathrm{O}_{\prec}\left(\max\left\{n^{\frac{\zeta_{1}-1}{2}},n^{\mathsf{e}_{1}}\right\}\right),

(27)

where $\mathsf{e}_{1}$ is defined in (24).

Intuitively, in this region we cannot obtain any information about the signal, since asymptotically the noise dominates the signal. In practice, the datasets might fall in this region when both sensors are corrupted or the environment noise is too strong. This intuition is confirmed by Theorem 3.1. As discussed in [6, 13], when the noise dominates the signal, the outlier eigenvalues are mainly from the kernel function expansion or the Gram matrix and hence are not useful to study the underlying manifold structure. The number of these outlier eigenvalues depend on the SNR as can be seen in (23), which can be figured out from the kernel function expansion.

We should point out that (25) and (26) are mainly technical assumptions and commonly used in the random matrix theory literature. They guarantee that the individual bulk eigenvalues of $\mathbf{N}$ can be characterized by the quantiles of free multiplicative convolution. Specifically, (26) ensures that the Gram matrices are bounded from below and (25) has been used in [7] to ensure that the eigenvectors of the Gram matrix are Haar distributed. As is discussed in [7], while it is widely accepted that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed, we cannot find a proper proof. Since the proof of this Theorem depends on the results in [7], we impose the same condition. The assumption (25) can be removed when we can show that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed. Since this is not the focus of the current paper, we will pursue this direction in future works.

Finally, we mention that Theorem 3.1 holds for more general kernel function beyond the Gaussian kernel. For example, as discussed in [6, Remark 2.4], we can choose a general kernel function which is decreasing, $C^{3}$ and $f(2)>0.$

3.1.2. When one sensor has a small SNR, $0\leq\zeta_{2}<1\leq\zeta_{1}<\infty$

In Theorem 3.2 below, we consider that $0\leq\zeta_{2}<1\leq\zeta_{1}<\infty,$ i.e., one of the sensors has a large SNR whereas the other is dominated by the noise. We prepare some notations here. Let $\mathbf{W}_{1,s}$ and $\mathbf{W}_{2,s}$ be the affinity matrices associated with $\{\mathbf{u}_{ix}\}$ and $\{\mathbf{u}_{iy}\}$ respectively, where the subscript $s$ stands for the short-hand notation for the signal. In other words, $\mathbf{W}_{1,s}$ and $\mathbf{W}_{2,s}$ are constructed from the clean signal. In general, since $h_{1}$ may be different from $h_{2}$ , $\mathbf{W}_{1,s}$ and $\mathbf{W}_{2,s}$ might be different. Denote

\displaystyle\widetilde{\mathbf{W}}_{1,s}=\exp(-2\upsilon)\mathbf{W}_{1,s}+(1-\exp(-2\upsilon))\mathbf{I}.

(28)

Analogously, we denote the associated degree matrix and transition matrix as $\widetilde{\mathbf{D}}_{1,s}$ and $\widetilde{\mathbf{A}}_{1,s}$ respectively, that is,

\widetilde{\mathbf{A}}_{1,s}=\widetilde{\mathbf{D}}_{1,s}^{-1}\widetilde{\mathbf{W}}_{1,s}.

(29)

Define $\widetilde{\mathbf{W}}_{2,s}$ and $\widetilde{\mathbf{A}}_{2,s}$ similarly. Note that from the random walk perspective, $\widetilde{\mathbf{A}}_{1,s}$ (and $\widetilde{\mathbf{A}}_{2,s}$ as well) describe a lazy random walk on the clean dataset. We further introduce some other $n\times n$ matrices,

\mathbf{W}_{1,c}(i,j)=\exp\left(-2\upsilon\frac{(\mathbf{u}_{ix}-\mathbf{u}_{jx})^{\top}(\mathbf{z}_{i}-\mathbf{z}_{j})}{p_{1}}\right)\quad\mbox{and}\quad\widetilde{\mathbf{W}}_{1,c}:=\widetilde{\mathbf{W}}_{1,s}\circ\mathbf{W}_{1,c}\,.

We then define the associate degree matrix and transition matrix as $\widetilde{\mathbf{D}}_{1,c}$ and $\widetilde{\mathbf{A}}_{1,c},$ respectively; that is,

\widetilde{\mathbf{A}}_{1,c}=\widetilde{\mathbf{D}}_{1,c}^{-1}\widetilde{\mathbf{W}}_{1,c}\,.

(30)

$\widetilde{\mathbf{W}}_{1,c}$ and $\widetilde{\mathbf{A}}_{1,c}$ will be used when $\zeta_{1}$ is too large ( $\zeta_{1}\geq 2$ ) so that the bandwidth $h_{1}=p_{1}$ is insufficient to capture the relationship between two different samples.

With the above notations and (14), denote

\widetilde{\mathbf{N}}:=\begin{cases}\exp(2\upsilon)\widetilde{\mathbf{A}}_{1,s}\left(\varsigma_{2}\mathbf{I}+2\frac{\upsilon\exp(-\upsilon\tau_{2})}{p_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&1\leq\zeta_{1}<2;\\ \exp(2\upsilon)\widetilde{\mathbf{A}}_{1,c}\left(\varsigma_{2}\mathbf{I}+2\frac{\upsilon\exp(-\upsilon\tau_{2})}{p_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&\zeta_{1}\geq 2.\end{cases}

(31)

Using $\mathsf{s}_{2}$ defined in (22), denote

\mathsf{S}:=\begin{cases}4,&0\leq\zeta_{2}<0.5,\\ \mathsf{s}_{2}+4,&0.5\leq\zeta_{2}<1.\end{cases}

(32)

Theorem 3.2.

Suppose Assumption 2.1 holds with $0\leq\zeta_{2}<1\leq\zeta_{1}<\infty$ , $h_{1}=p_{1}$ , $h_{2}=p_{2}$ and $d_{1}=d_{2}=1$ . Then we have that for $i>\mathsf{S}$ ,

\left|\lambda_{i}(n\mathbf{N})-\lambda_{i}(\widetilde{\mathbf{N}})\right|=\mathrm{O}_{\prec}\left(\max\left\{n^{\mathsf{e}_{2}},n^{\frac{\zeta_{2}-1}{2}}\right\}\right),

(33)

where $\mathsf{e}_{2}$ is defined in (24) and $\widetilde{\mathbf{N}}$ is defined in (31). Furthermore, when $\zeta_{1}$ is larger in the sense that for any given small constant $\delta\in(0,1),$

\zeta_{1}>\frac{2}{\delta}+1,

(34)

then with probability at least $1-\mathrm{O}\left(n^{1-\delta(\zeta_{1}-1)/2}\right),$ for some sufficiently small constant $\epsilon>0$ and some constant $C>0$ and all $i\geq\mathsf{S},$ we have

\left|\lambda_{i}(n\mathbf{N})-\exp(2\upsilon)\gamma_{\nu_{2}}(i)\right|\leq C\max\{n^{-1/2+\epsilon},n^{\mathsf{e}_{2}},\ n^{\frac{{\zeta_{2}}-1}{2}}\}.

(35)

This is a potentially confusing region. In practice, it captures the situation when one sensor is corrupted so that the signal part becomes weak. Since we still have one sensor available with a strong SNR, it is expected that we could still obtain something useful. However, it is shown in Theorem 3.2 that the corrupted sensor unfortunately contaminates the overall performance of the sensor fusion algorithm. Note that since the first sensor has a large SNR, the noisy transition matrix $\mathbf{A}_{1}$ is close to the transition matrix $\widetilde{\mathbf{A}}_{1,s}$ , which only depends on the signal part when $1\leq\zeta_{1}<2,$ and the transition $\widetilde{\mathbf{A}}_{1,c}$ which is a mixture of the signal and noise when $\zeta_{1}\geq 2$ . This fact has been shown in [6]. However, for the second sensor, due to the strong noise, $\mathbf{A}_{2}$ will be close to a perturbed Gram matrix that mainly comes from the high dimensional noise. Consequently, as illustrated in (33), the NCCA matrix will be close to $\widetilde{\mathbf{N}}$ which is a product matrix of the clean transition matrix and the shifted Gram matrix. Clearly, the clean transition matrix is contaminated by the shifted Gram matrix, which does not contain any information about the signal. This limits the information we can obtain.

In the extreme case when $\zeta_{1}$ is larger in the sense of (34), the chosen bandwidth $h_{1}=p_{1}$ is too small compared with the signal so that the transition matrix $\mathbf{A}_{1}$ will be close to the identity matrix. Consequently, as in (35), the NCCA matrix will be mainly characterized by the perturbed Gram matrix whose limiting ESD follows the MP law $\nu_{2}$ with proper scaling.

We should however emphasize that as has been elaborated in [6], when the SNR is large, particularly when $\zeta_{1}>2$ , we should consider a different bandwidth, particularly the bandwidth determined by the percentile of pairwise distance that is commonly considered in practice. It is thus natural to ask if the bandwidth $h_{1}$ is chosen “properly”, would we obtain useful information eventually. We will answer this question in the later section.

3.2. Informative region: $\zeta_{2}\geq 1$

In this subsection, we state the results when both of the sensors have large SNR ( $\zeta_{2}\geq 1$ ). Recall (29), (30) and denote $\widetilde{\mathbf{A}}_{2,s},\widetilde{\mathbf{A}}_{2,c}$ analogously for the point cloud $\mathcal{Y}.$ For some constant $C>0,$ denote

\mathsf{R}:=\begin{cases}C\log n,&\zeta_{2}=1\\ Cn^{\zeta_{2}-1},&1<\zeta_{2}<2.\end{cases}

(36)

Theorem 3.3.

Suppose Assumption 2.1 holds with $1\leq\zeta_{2}\leq\zeta_{1}<\infty$ , $h_{1}=p_{1}$ , $h_{2}=p_{2}$ and $d_{1}=d_{2}=1$ . Then we have that:

(1).

When $1\leq\zeta_{2}\leq\zeta_{1}<2,$ we have

$\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top}\right\|\prec n^{-1/2}.$ (37)

Additionally, for $i>\mathsf{R}$ , we have

$\lambda_{i}(\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top})\prec n^{(\zeta_{2}-3)/2}.$ (38)

(2).

When $1\leq\zeta_{2}<2\leq\zeta_{1}<\infty,$ we have that (38) holds true and

\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,s}^{\top}\|\prec n^{-1/2}.

(39)

Moreover, when ${\zeta_{1}}$ is larger in the sense of (34), we have that with probability at least $1-\mathrm{O}\left(n^{1-\delta({\zeta_{1}}-1)/2}\right),$ for some sufficiently small $\epsilon>0$ and some constant $C>0$

\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{2,s}\right\|\leq C\left(n^{-1/2+\epsilon}+n\exp(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta})\right).

(40)

(3).

When $\zeta_{1}\geq\zeta_{2}\geq 2,$ we have that

\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,c}^{\top}\right\|\prec n^{-3/2}+n^{-\zeta_{2}/2}.

(41)

Moreover, if ${\zeta_{2}}$ is larger in the sense of (34), we have that with probability at least $1-\mathrm{O}\left(n^{1-\delta(\min\{{\zeta_{1}},{\zeta_{2}}\}-1)/2}\right),$ for some constant $C>0$ ,

\left\|\mathbf{N}-\mathbf{I}\right\|\leq Cn\left(\exp(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta})+\exp(-\upsilon(\sigma_{2}^{2}/n)^{1-\delta})\right).

(42)

Theorem 3.3 shows that when $h_{k}=p_{k}$ , where $k=1,2,$ and both SNRs are large, the NCCA matrix from the noisy dataset could be well approximated by that from the clean dataset of the common manifold. The main reason has been elaborated in [6] when we have only one sensor. In the two sensors case, combining (37) and (38), we see that except the first $\mathsf{R}$ eigenvalues, the remaining eigenvalues are negligible and not informative. Moreover, (2) and (3) reveal important information about the bandwidth; that is, if the bandwidth choice is improper, like $h_{1}=p_{1}$ and $h_{2}=p_{2}$ , the result could be misleading in general. For instance, when $\zeta_{1}$ and $\zeta_{2}$ are large, ideally we should have a “very clean” dataset and we shall expect to obtain useful information about the signal. However, this result says that we cannot obtain any useful information from NCCA; particularly, see (42). This however is intuitively true, since when the bandwidth is too small, the relationship of two distinct points cannot be captured by the kernel; that is, when $i\neq j$ , $\exp(-\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}/p_{1})\approx 0$ with high probability (see proof below for a precise statement of this argument or [6]). This problem can be fixed if we choose a proper bandwidth. In Section 4, we will state the corresponding results when the bandwidths are selected properly, in which case this counterintuitive result is eliminated.

Remark 3.4.

In the above theorems, we focus on reporting the results for the case $d_{1}=d_{2}=1$ in (5). In this remark, we discuss how to generalize the results to the setting when $d_{1}>1$ or $d_{2}>1.$ First, when $0\leq\sigma_{1i}^{2},\sigma_{2j}^{2}<1,1\leq i\leq d_{1},1\leq j\leq d_{2},$ Theorem 3.1 still holds after minor modification, for example, $\mathfrak{d}_{k}$ in (22) should be replaced by $\sum_{i=1}^{d_{k}}\mathfrak{d}_{ki},\mathfrak{d}_{ki}:=\left\lceil\frac{1}{1-{\zeta_{ki}}}\right\rceil+1$ and the error bound in (27) should be replaced by

\max\left\{\max_{i}\{n^{\frac{\zeta_{1i}-1}{2}}\},\max_{j}\{n^{\mathsf{e}_{2j}}\}\right\},

where $\mathsf{e}_{2j}$ ’s are defined similarly as in (24). Similar arguments apply for Theorem 3.2. Second, when $\zeta_{1i},\zeta_{1j}\geq 1$ , where $1\leq i\leq d_{1},1\leq j\leq d_{2},$ Theorem 3.3 holds by setting $\zeta_{k}:=\max_{j}\{\zeta_{kj}\},k=1,2.$ Finally, suppose that there exist some integers $r_{k}<d_{k},k=1,2,$ such that

\zeta_{k1}\geq\zeta_{k2}\geq\cdots\geq\zeta_{k,r_{k}}\geq 1>\zeta_{k,r_{k}+1}\geq\cdots\zeta_{k,d_{k}}\geq 0.

Then we have that Theorem 3.3 still holds by setting $\zeta_{k}:=\zeta_{k1},k=1,2,$ and the affinity and transition matrices in (29) should be defined using the signal part with large SNRs. For example, $\mathbf{W}_{1,s}$ should be defined via

\mathbf{W}_{1,s}(i,j)=\exp\left(-\upsilon\frac{\|\widetilde{\mathbf{u}}_{ix}-\widetilde{\mathbf{u}}_{jx}\|_{2}^{2}}{h_{1}}\right),\ \widetilde{\mathbf{u}}_{ix}=(\mathbf{u}_{ix}(1),\cdots,\mathbf{u}_{ix}(r_{1}),0,\cdots,0).

The detailed statements and proofs are similar to the setting $d_{1}=d_{2}=1$ except for extra notational complicatedness. Since this is not the main focus of the current paper, we omit details here.

Remark 3.5.

Throughout the paper, we focus on reporting the results of the NCCA matrix. However, our results can also be applied to the AD matrix with a minor modification based on their definitions in (3). Specifically, Theorem 3.1 holds for $n^{2}\mathbf{A}$ , Theorem 3.2 holds for $n\mathbf{A}$ and Theorem 3.3 holds for $\mathbf{A}$ by replacing $\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top}$ with $\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s},$ $\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,s}^{\top}$ with $\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,s}$ and $\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,c}^{\top}$ with $\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,c}$ . Since the proof is similar, we omit details.

4. Main results (II)–adaptive choice of bandwidth

As discussed after Theorem 3.3, when the SNRs are large, the bandwidth choice $h_{k}=p_{k}$ for $k=1,2$ is improper. One solution to this trouble has been discussed in [6] when we have one sensor; that is, the bandwidth is decided by the percentile of all pairwise distances. It is thus natural to hypothesize that the same solution would hold for the kernel sensor fusion approach. As in Section 3, we focus on the case $d_{1}=d_{2}=1,$ and the discussion for the general setting is similar to that of Remark 3.4. As before, we also assume that $\zeta_{1}\geq\zeta_{2}.$ Also, we focus on reporting the results of the NCCA matrix. The discussion for the AD matrix is similar to that of Remark 3.5.

We first recall the adaptive bandwidth selection approach [6] motivated by the empirical approach commonly used in daily practice. Let $\nu_{\text{dist},1}$ and $\nu_{\text{dist},2}$ be the empirical distributions of pairwise distances $\{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|\}_{i\neq j}$ and $\{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|\}_{i\neq j}$ respectively. Then we choose the bandwidths $h_{1}>0$ and $h_{2}>0$ by

\int_{0}^{h_{1}}\mathrm{d}\nu_{\text{dist},1}=\omega_{1}\quad\mbox{and}\quad\int_{0}^{h_{2}}\mathrm{d}\nu_{\text{dist},2}=\omega_{2},

(43)

where $0<\omega_{1}<1$ and $0<\omega_{2}<1$ are fixed constants chosen by the user. Define $\widetilde{\mathbf{A}}_{1,s}$ in the same way as that in (45), $\widetilde{\mathbf{N}}$ as that in (48), and $\mathbf{A}_{1,s}$ as that in (46) using (43). Similarly, we can define the counterparts for the point cloud $\mathcal{Y}.$

Recall that $\mathbf{W}_{1,s}$ and $\mathbf{W}_{2,s}$ are the affinity matrices associated with $\{\mathbf{u}_{ix}\}$ and $\{\mathbf{u}_{iy}\}.$ With a little bit abuse of notation, for $k=1,2,$ we denote

\displaystyle\widetilde{\mathbf{W}}_{k,s}=\exp\left(-\upsilon\frac{2p_{k}}{h_{k}}\right)\mathbf{W}_{k,s}+\left(1-\exp\left(-\upsilon\frac{2p_{k}}{h_{k}}\right)\right)\mathbf{I},

(44)

where $\mathbf{W}_{k,s}$ are constructed using the adaptively selected bandwidth $h_{k}$ . Clearly, $\mathbf{W}_{k,s}$ and $\widetilde{\mathbf{W}}_{k,s}$ differ by an isotropic spectral shift, and when $\zeta_{k}>1$ , asymptotically $\mathbf{W}_{k,s}$ and $\widetilde{\mathbf{W}}_{k,s}$ are the same. Note that compared to (28), the difference is that we use the modified bandwidth in (44). This difference is significant, particularly when $\zeta_{k}$ is large. Indeed, when $\zeta_{k}$ is large, $\mathbf{W}_{k,s}$ defined in (28) is close to an identity matrix, while $\mathbf{W}_{k,s}$ defined in (44) encodes information of the signal. Specifically, asymptotically we can show that $\mathbf{W}_{k,s}$ defined in (44) converges to an integral operator defined on the manifold, whose spectral structure is commonly used in manifold learning society to study the signal. See [6] for more discussion. We then define

\widetilde{\mathbf{A}}_{k,s}=\widetilde{\mathbf{D}}_{k,s}^{-1}\widetilde{\mathbf{W}}_{k,s}

(45)

and

\mathbf{A}_{k,s}=\mathbf{D}_{k,s}^{-1}\mathbf{W}_{k,s},

(46)

where $k=1,2$ . Compared to (45), (46) does not contain the scaling and shift of the signal parts. Moreover, denote

\varsigma_{k,h}\equiv\varsigma_{k,h}(\tau_{k},h_{k}):=1-\frac{2\upsilon p_{k}}{h_{k}}\exp\left(-\upsilon\frac{\tau_{k}p_{k}}{h_{k}}\right)-\exp\left(-\upsilon\frac{\tau_{k}p_{k}}{h_{k}}\right),\ k=1,2.

(47)

Theorem 4.1.

Suppose Assumption 2.1 holds with the adaptively chosen bandwidths $h_{1},h_{2}$ and $d_{1}=d_{2}=1.$ Recall (15), (11) and (47). Then we have that:

(1).

$0\leq\zeta_{2}<1$ (at least one sensor has a low SNR). When $0\leq\zeta_{1}<1$ as well, Theorem 3.1 holds under the assumption of (25) and (26) by replacing $\nu_{k},k=1,2,$ with

\widetilde{\nu}_{k}:=\mathrm{T}_{\varsigma_{k,h}}\nu_{c_{k},\eta_{k}},\ \ \mbox{where}\ \ \eta_{k}=\frac{2p_{k}\upsilon\exp(-2p_{k}\upsilon/h_{k})}{h_{k}}

and $\exp(4\upsilon)$ with $\exp(4\upsilon p_{1}p_{2}/(h_{1}h_{2}))$ in (27). When $\zeta_{1}\geq 1$ , Theorem 3.2 holds by replacing $\mathbf{N}$ by $\widetilde{\mathbf{N}}$ in (33), where

\widetilde{\mathbf{N}}:=\begin{cases}\exp\left(\upsilon\frac{2p_{2}}{h_{2}}\right)\widetilde{\mathbf{A}}_{1,s}\left(\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&1\leq\zeta_{1}<2;\\ \exp\left(\upsilon\frac{2p_{2}}{h_{2}}\right)\widetilde{\mathbf{A}}_{1,c}\left(\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&\zeta\geq 2.\end{cases}

(48)

and when $\zeta_{1}$ is large in the sense of (34), (35) holds replacing $\nu_{2}$ with $\widetilde{\nu}_{2}$ and $\exp(2\upsilon)$ with $\exp(2p_{2}\upsilon/h_{2}).$

(2).

$\zeta_{2}\geq 1$ (both sensors have high SNRs). In this case, we have that

$\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top}\right\|\prec n^{-1/2}.$ (49)

Moreover, for some constant $C>0$ and $i\geq C\log n,$ we have

$\lambda_{i}(\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top})\prec n^{-1}.$ (50)

Finally, when $\zeta_{2}>1,$ we have that

$\left\|\mathbf{N}-\mathbf{A}_{1,s}\mathbf{A}_{2,s}^{\top}\right\|\prec n^{-1/2}+n^{1-\zeta_{2}}.$ (51)

Theorem 4.1 (1) states that if both sensors have low SNRs, the NCCA matrix has a similar spectral behavior as that in Theorem 3.1; that is, when the SNRs are small, due to the noise impact, even if there exists signal, we may not obtain useful result. The reason is that we still have $h_{k}\asymp p_{k}$ for $k=1,2$ with high probability (see (100)), so the bandwidth choice does not influence the conclusion. Especially, most of the eigenvalues of $\mathbf{A}$ are governed by the free multiplication convolutions of two MP type laws, which are essentially the limiting empirical spectral distributions of Gram matrices only containing white noise.

On the other hand, when the signals are stronger; that is, $\zeta_{1},\zeta_{2}\geq 1,$ we are able to approximate the associated clean NCCA matrix of the underlying clean common component, as is detailed in Theorem 4.1 (2). This result can be interpreted as that NCCA is robust to the noise. Especially, when $\zeta_{1},\zeta_{2}>1,$ we see that $\mathbf{A}_{1,s}$ and ${\mathbf{A}}_{2,s}$ come from the clean dataset directly. Finally, we point out that compared to (36), except the top $C\log n$ eigenvalues for some constant $C>0$ , the remaining eigenvalues are not information. When $\zeta_{1},\zeta_{2}\geq 2,$ the NCCA matrix is always informative compared to (2) and (3) of Theorem 3.3. As a result, when combined with the existing theory about AD [31, 45], the first few eigenpairs of NCCA and AD capture the geometry of the common manifold under the manifold setup.

Theorem 4.1 (1) also describes the behavior of NCCA when one sensor has a high SNR while the other one has a low SNR, which is the most interesting and counterintuitive case. In this case, even if the bandwidths of both sensors are generated according to (43), the NCCA matrix still encodes limited information about the signal, like that stated in Theorem 3.2. Indeed, the NCCA matrix is close to a product matrix which is a mixture of signal and noise, shown in (48). While $\widetilde{\mathbf{A}}_{1,s}$ contains information about the signal, it is contaminated by $\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W}$ via production, which comes from the noise dominant dataset collected from the other sensor. Since the spectral behavior of $\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W}$ follows the shifted and scaled MP law, overall we obtain limited information about the signal if we apply the kernel-based sensor fusion algorithm. In this case, it is better to simply consider the dataset with a high SNR. Based on the above discussion and practical experience, we would like to mention a potential danger if we directly apply NCCA (or AD) without confirming the signal quality. This result warns us that if we directly apply AD without any sanity check, it may result in a misleading conclusion, or give us lower quality information. Therefore, before applying NCCA and AD, it is suggested to carry out the common practice by detecting the existence of signals in each of the sensors.

For the choices of the constants $\omega_{1}$ and $\omega_{2},$ we comment that in practice, usually researchers choose $\omega_{k}=0.25$ or $0.5$ [42]. In [6], we propose an algorithm to adaptively choose the values of them. The main idea behind is that the algorithm seeks for a bandwidth so that the affinity matrix has the most number of outlier eigenvalues. We refer the readers to [6, Section 3.2] for more details.

Last but not the least, we point out that our results can be potentially used to detect the common components. Usually, researchers count on the background knowledge to decide if common information exists. For example, it is not surprising that two electroencephalogram channels share the same brain activity. However, while physiologically the brain and heart share common information [41], it is less clear if the electroencephalogram and the electrocardiogram share anything in common, and what is the common information. Answering this complicated question may need a lot of scientific work, but the first step toward it is a powerful tool to confirm if two sensors share the same information, in addition to checking if the signal quality is sufficient. Since this is not the focus of the current paper, we will address this issue in the future work.

5. Proof of main theorems

We now provide proofs of the main theoretical results in Sections 3 and 4. We start from collecting some technical lemmas needed in the proof.

5.1. Some technical lemmas

The following lemma provides some deterministic inequalities for the products of matrices.

Lemma 5.1.

(1). Suppose that $\mathbf{L}$ is a real symmetric matrix with nonnegative entries and $\mathbf{E}$ is another real symmetric matrix. Then we have that

\sigma_{1}(\mathbf{L}\circ\mathbf{E})\leq\max_{i,j}|\mathbf{E}(i,j)|\sigma_{1}(\mathbf{L}),

where $\mathbf{L}\circ\mathbf{E}$ is the Hadamard product and $\sigma_{1}(\mathbf{L})$ stands for the largest singular value of $\mathbf{L}.$
(2). Suppose $A$ and $B$ are two $n\times n$ positive definite matrices. Then for all $1\leq k\leq n,$ we have that

\lambda_{k}(A)\lambda_{n}(B)\leq\lambda_{k}(AB)\leq\lambda_{k}(A)\lambda_{1}(B).

Proof.

For (1), see [13, Lemma A.5]. (2) follows from Courant-Fischer-Weyl’s min-max principle via

	$\displaystyle\lambda_{k}(AB)=\lambda_{k}(\sqrt{B}A\sqrt{B})$	$\displaystyle=\min_{\begin{subarray}{c}F\subset\mathbb{R}^{n}\\ \dim(F)=k\end{subarray}}\left(\max_{x\in F\backslash\{0\}}\frac{(\sqrt{B}A\sqrt{B}x,x)}{(x,x)}\right)$
		$\displaystyle=\min_{\begin{subarray}{c}F\subset\mathbb{R}^{n}\\ \dim(F)=k\end{subarray}}\left(\max_{x\in F\backslash\{0\}}\frac{(A\sqrt{B}x,\sqrt{B}x)}{(\sqrt{B}x,\sqrt{B}x)}\frac{(Bx,x)}{(x,x)}\right).$

∎

The following Lemma 5.2 collects some concentration inequalities.

Lemma 5.2.

Suppose Assumption 2.1 holds with $d_{1}=d_{2}=1$ . Moreover, we assume that $0\leq\zeta_{1},\zeta_{2}<1$ in (10). Then we have that

\frac{1}{p_{1}}\left|\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right|\prec\frac{\sigma^{2}_{1}}{n}+\frac{1}{\sqrt{n}},\ \frac{1}{p_{2}}\left|\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\right|\prec\frac{\sigma^{2}_{2}}{n}+\frac{1}{\sqrt{n}},

(52)

and

\left|\frac{1}{p_{1}}\|\mathbf{x}_{i}\|_{2}^{2}-\left(1+\frac{\sigma^{2}_{1}}{p_{1}}\right)\right|\prec\frac{\sigma^{2}_{1}}{n}+\frac{1}{\sqrt{n}},\ \left|\frac{1}{p_{2}}\|\mathbf{y}_{i}\|_{2}^{2}-\left(1+\frac{\sigma^{2}_{2}}{p_{2}}\right)\right|\prec\frac{\sigma^{2}_{2}}{n}+\frac{1}{\sqrt{n}}.

(53)

Proof.

See Lemma A.2 of [6]. ∎

In the following lemma, we prove some results regarding the concentration of the affinity matrices when $0\leq\zeta_{1}<1$ and $0\leq\zeta_{2}<1$ .

Lemma 5.3.

Follow the notations (67)–(69). Recall (22) for $d_{1}=d_{2}=1$ . For $f(x)=\exp(-\upsilon x),$ we denote $\mathrm{Sh}_{\mathsf{d}}$ and $\widetilde{\mathrm{Sh}}_{d}$ such that

\mathrm{Sh}_{\mathsf{d}}(i,j):=\sum_{k=3}^{\mathfrak{d}_{1}-1}\frac{f^{(k)}(\tau_{1})\mathbf{L}_{x}(i,j)^{k}}{k!},\ \widetilde{\mathrm{Sh}}_{\mathsf{d}}(i,j):=\sum_{k=3}^{\mathfrak{d}_{2}-1}\frac{f^{(k)}(\tau_{2})\mathbf{L}_{y}(i,j)^{k}}{k!},

(54)

where $\mathbf{L}_{x}=\mathbf{O}_{x}-\mathbf{P}_{x},$ and $\mathbf{O}_{x}$ and $\mathbf{P}_{x}$ are defined as

\mathbf{O}_{x}(i,j)=(1-\delta_{ij})(\phi_{i}+\phi_{j}),\ \mathbf{P}_{x}(i,j)=(1-\delta_{ij})\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}}{p_{1}}.

Denote

\mathbf{K}_{1}=\begin{cases}-2f^{\prime}(\tau_{1})p_{1}^{-1}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}_{n}+\mathrm{Sh}_{10}(\tau)+\mathrm{Sh}_{11}(\tau)+\mathrm{Sh}_{12}(\tau),&0\leq\alpha_{1}<0.5\\ -2f^{\prime}(\tau_{1})p_{1}^{-1}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}_{n}+\mathrm{Sh}_{10}(\tau)+\mathrm{Sh}_{11}(\tau)+\mathrm{Sh}_{12}(\tau)+\mathrm{Sh}_{\mathsf{d}},&0.5\leq\alpha_{1}<1.\end{cases}

(55)

Recall (24). Then, when $0\leq\zeta_{1}<1$ and $h_{1}=p_{1},$ we have

\mathbf{W}_{1}=\mathbf{K}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}}+n^{-1/2}).

(56)

Moreover, we have

\mathbf{A}_{1}=\frac{1}{nf(\tau_{1})}\mathbf{K}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}}+n^{-1/2}).

(57)

Finally, for $\mathsf{s}_{1}$ in (22), we have

\operatorname{rank}(\mathrm{Sh}_{\mathsf{d}})\leq\mathsf{s}_{1}.

(58)

Similar results hold for $\mathbf{A}_{2}$ and $\mathbf{W}_{2}$ using $\widetilde{\phi}_{k},1\leq k\leq n,$ $\widetilde{\mathrm{Sh}}_{\mathsf{d}}$ and $\mathrm{Sh}_{2i},i=0,1,2.$

Proof.

First, (56) has been proved in [6] using the entry-wise Taylor expansion and the Gershgorin circle theorem; see the proof of Theorems 2.3 and 2.5 of [6]. Second, (58) has been proved in the proof of Theorem 2.5 of [6]. Third, we prove (57). By Lemma 5.2 and a discussion similar to [7, Lemma IV.5], when $0\leq\zeta_{1}<1$ and $0\leq\zeta_{2}<1$

\left\|(n\mathbf{D}_{1})^{-1}-\frac{1}{f(\tau_{1})}\mathbf{I}\right\|\prec n^{\mathsf{e}_{1}}+n^{-1/2},\ \left\|(n\mathbf{D}_{2})^{-1}-\frac{1}{f(\tau_{2})}\mathbf{I}\right\|\prec n^{\mathsf{e}_{2}}+n^{-1/2}.

(59)

Consequently,

	$\displaystyle\left\\|n\mathbf{A}_{1}-\frac{1}{f(\tau_{1})}\mathbf{K}_{1}\right\\|$	$\displaystyle\leq\left\\|(n\mathbf{D}_{1})^{-1}\mathbf{W}_{1}-\frac{1}{f(\tau_{1})}\mathbf{W}_{1}\right\\|+\frac{1}{f(\tau_{1})}\left\\|\mathbf{W}_{1}-\mathbf{K}_{1}\right\\|$
		$\displaystyle\prec(n^{\mathsf{e}_{1}}+n^{-1/2})(\\|\mathbf{W}_{1}\\|+1),$

where we used the fact that $\tau_{1}<\infty.$ Since $\max_{i,j}|\mathbf{W}_{1}(i,j)|\prec 1,$ by the Gershgorin circle theorem, we conclude that $\|\mathbf{W}_{1}\|\prec n.$ This concludes our proof. ∎

In the following lemma, we collect the results regarding the affinity matrices when $\zeta_{1}\geq 1$ and $\zeta_{2}\geq 1.$ Recall $\widetilde{\mathbf{A}}_{1,s}$ defined via (29).

Lemma 5.4.

Suppose Assumption 2.1 holds with $d_{1}=d_{2}=1$ , $\zeta_{1},\zeta_{2}\geq 1$ . For some constant $C>0,$ denote

\mathsf{T}_{1}:=\begin{cases}C\log n,&\zeta_{1}=1;\\ Cn^{\zeta_{1}-1},&1<\zeta_{2}<2.\end{cases}

(60)

Then we have:
(1). When $h_{1}=p_{1},$ if $1\leq\zeta_{1}<2,$

\left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,s}\right\|\prec n^{-1/2}.

(61)

Moreover, moreover, we have that for $i>\mathsf{T}_{1}$ in (60),

\lambda_{i}(\widetilde{\mathbf{A}}_{1,s})\prec n^{(\zeta_{1}-3)/2}.

(62)

On the other hand, when $\zeta_{1}\geq 2,$ we have that

\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,c}\|\prec n^{-\zeta_{1}/2}+n^{-3/2}.

Finally, when $\zeta_{1}$ is the larger in the sense that (34) holds, we have that with probability at least $1-\mathrm{O}(n^{1-\delta(\zeta_{1}-1)/2}),$ for some constant $C>0,$

\left\|\mathbf{A}_{1}-\mathbf{I}\right\|\leq Cn\exp\left(-\upsilon n^{(\zeta_{1}-1)(1-\delta)}\right).

(63)

(2). When $h_{1}$ is chosen according to (43), we have that

\left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,s}\right\|\prec n^{-1/2}.

(64)

Moreover, we have that for $i>C\log n$ for some constant $C>0$ ,

\lambda_{i}(\widetilde{\mathbf{A}}_{1,s})\prec n^{-1}.

(65)

Recall (46). Finally, when $\zeta_{1}>1,$ we have that

\left\|\mathbf{A}_{1}-\mathbf{A}_{1,s}\right\|\prec n^{-1/2}+n^{1-\zeta_{1}}.

(66)

Similar results hold for $\mathbf{A}_{2}.$

Proof.

See Corollary 2.11 and Theorem 3.1 of [6]. ∎

Finally, we record the results for the rigidity of eigenvalues of non-spiked Gram matrix. Denote the non-spiked Gram matrix as $\mathbf{S}$ , where

\mathbf{S}:=\frac{1}{p_{1}}\mathbf{Z}^{\top}\mathbf{Z},

and its eigenvalues as $\lambda_{1}\geq\cdots\geq\lambda_{n}.$ Recall (11) and (17).

Lemma 5.5.

Suppose $\{\mathbf{z}_{i}\}$ are sub-Gaussian random vectors satisfying (8), (9) and (26). Then we have

\sup_{j}\left|\lambda_{j}-\gamma_{\nu_{c_{1},1}}(j)\right|\prec n^{-2/3}\widetilde{j}^{-1/3},\ \widetilde{j}:=\min\left\{p_{1}\wedge n+1-j,j\right\}.

Proof.

See [40, Theorem 3.3]. ∎

5.2. Proof of Theorem 3.1

We need the following notations. Denote $\Phi_{1}=(\phi_{1,1},\ldots,\phi_{1,n})$ , where $\phi_{1,i}=\frac{1}{p_{1}}\|\mathbf{x}_{i}\|_{2}^{2}-(1+\sigma_{1}^{2}/p_{1})$ , $i=1,2,\cdots,n$ . Similarly, we define $\Phi_{2}=({\phi}_{2,1},\ldots,{\phi}_{2,n})$ with ${\phi}_{2,i}=\frac{1}{p_{2}}\|\mathbf{y}_{i}\|_{2}^{2}-(1+\sigma_{2}^{2}/p_{2})$ , $i=1,2,\cdots,n.$ For $k=1,2,$ denote

	$\displaystyle\mathrm{Sh}_{k0}(\tau_{k}):=f(\tau_{k})\mathbf{1}\mathbf{1}^{\top},$		(67)
	$\displaystyle\mathrm{Sh}_{k1}(\tau_{k}):=f^{\prime}(\tau_{k})[\mathbf{1}\Phi_{k}^{\top}+\Phi_{k}\mathbf{1}^{\top}],$		(68)
	$\displaystyle\mathrm{Sh}_{k2}(\tau_{k}):=\frac{f^{(2)}(\tau_{k})}{2}\left[\mathbf{1}(\Phi_{k}\circ\Phi_{k})^{\top}+(\Phi_{k}\circ\Phi_{k})\mathbf{1}^{\top}+2\Phi_{k}\Phi_{k}^{\top}+4\frac{(\sigma_{k}^{2}+1)^{2}+p_{k}}{p_{k}^{2}}\mathbf{1}\mathbf{1}^{\top}\right].$		(69)

Proof.

Case (1). $0\leq\zeta_{2}\leq\zeta_{1}<0.5$ . By (59), we conclude that

\left\|n\mathbf{D}_{1}^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{-1/2}),\ \left\|n\mathbf{D}_{2}^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{-1/2}).

(70)

Therefore, it suffices to consider $\mathbf{W}_{1}\mathbf{W}_{2}.$ To ease the heavy notation, we denote

\ell_{kt}:=(-\upsilon)^{t}\exp(-\upsilon\tau_{k}),\ \mathrm{Sh}_{k}=\sum_{j=0}^{2}\mathrm{Sh}_{kj},

(71)

where $k=1,2$ and $t\in\mathbb{N}$ . With the above notations, by Lemma 5.3, we have that

		$\displaystyle(\mathbf{W}_{1}-\mathrm{Sh}_{1})(\mathbf{W}_{2}-\mathrm{Sh}_{2})$		(72)
	$\displaystyle=$	$\displaystyle\,\left(-2\frac{\ell_{11}}{p_{1}}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}+\mathrm{O}_{\prec}(n^{-1/2})\right)\left(-2\frac{\ell_{21}}{p_{2}}\mathbf{Y}^{\top}\mathbf{Y}+\varsigma_{2}\mathbf{I}+\mathrm{O}_{\prec}(n^{-1/2})\right).$

Denote

\mathbf{P}_{1}:=-2\frac{\ell_{11}}{p_{1}}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}\ \ \mbox{and}\ \ \mathbf{P}_{2}:=-2\frac{\ell_{21}}{p_{2}}\mathbf{Y}^{\top}\mathbf{Y}+\varsigma_{2}\mathbf{I}.

(73)

Since $d_{1}=d_{2}=1$ and $\mathbf{u}_{ix}$ and $\mathbf{u}_{iy}$ contain samples from the common manifold, we can set

\bm{u}=(u_{1},\cdots,u_{n})^{\top}\in\mathbb{R}^{n}.

Moreover, denote

\bm{z}=(\mathbf{z}_{i1},\cdots,\mathbf{z}_{in})^{\top}\ \ \mbox{and}\ \ \bm{w}=(\mathbf{w}_{i1},\cdots,\mathbf{w}_{in})^{\top}\in\mathbb{R}^{n}.

Withe above notations, denote

\Delta_{1}:=-2\frac{\ell_{11}}{p_{1}}\bm{u}\bm{u}^{\top},\ \Delta_{2}:=-2\frac{\ell_{21}}{p_{2}}\bm{u}\bm{u}^{\top},

(74)

\Upsilon_{1}:=-2\frac{\ell_{11}}{p_{1}}(\bm{u}\bm{z}^{\top}+\bm{z}\bm{u}^{\top}),\ \Upsilon_{2}:=-2\frac{\ell_{21}}{p_{2}}(\bm{u}\bm{w}^{\top}+\bm{w}\bm{u}^{\top}),

(75)

\mathbf{T}_{1}=-2\frac{\ell_{11}}{p_{1}}\mathbf{Z}^{\top}\mathbf{Z}+\varsigma_{1}\mathbf{I}\ \ \mbox{and}\ \ \mathbf{T}_{2}=-2\frac{\ell_{21}}{p_{2}}\mathbf{W}^{\top}\mathbf{W}+\varsigma_{2}\mathbf{I}.

Note that

\mathbf{P}_{k}=\mathbf{T}_{k}+\Delta_{k}+\Upsilon_{k},\ k=1,2.

(76)

Moreover, $\Delta_{k}$ are rank-one matrices and by (52),

\Delta_{k}=\mathrm{O}_{\prec}\left(n^{\zeta_{k}}\right),\ \Upsilon_{k}=\mathrm{O}_{\prec}\left(n^{\frac{\zeta_{k}-1}{2}}\right).

(77)

In light of (76), we can write

\displaystyle\mathbf{P}_{1}\mathbf{P}_{2}=\prod_{k=1}^{2}(\mathbf{T}_{k}+\Delta_{k}+\Upsilon_{k}).

(78)

We can further write

\displaystyle\mathbf{P}_{1}\mathbf{P}_{2}=\mathbf{T}_{1}\mathbf{T}_{2}+\mathbf{R}_{1}+\mathbf{R}_{2},

(79)

where

\mathbf{R}_{1}:=\mathbf{T}_{1}\Delta_{2}+\Delta_{1}\mathbf{P}_{2}\ \ \mbox{and}\ \ \mathbf{R}_{2}:=\mathbf{T}_{1}\Upsilon_{2}+\Upsilon_{1}\mathbf{P}_{2}.

On one hand, it is easy to see that $\operatorname{rank}(\mathbf{R}_{1})\leq 2.$ On the other hand, by (77), (53) and Lemma 5.5, using the assumption that $\zeta_{1}\geq\zeta_{2},$ we obtain that

\mathbf{R}_{2}=\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}).

(80)

Denote the spectral decompositions of $\mathbf{T}_{1}$ and $\mathbf{T}_{2}$ as

\mathbf{T}_{1}=\mathbf{U}_{1}\bm{\Lambda}_{1}\mathbf{U}_{1}^{\top},\ \mathbf{T}_{2}=\mathbf{U}_{2}\bm{\Lambda}_{2}\mathbf{U}_{2}^{\top}.

(81)

Let $\{a_{k}\}$ and $\{b_{k}\}$ be the quantiles of $\nu_{1}$ and $\nu_{2},$ respectively as constructed via (20). Let $\{\lambda_{i}\}$ be the eigenvalues of $\mathbf{T}_{1}$ . For some small $\epsilon>0,$ we denote an event

\Xi:=\left\{\sup_{i\geq 1}|\lambda_{i}-a_{i}|\leq\widetilde{i}^{-1/3}n^{-2/3+\epsilon}\right\},\ \widetilde{i}:=\min\{(p_{1}-1)\wedge n+1-j,j\}.

(82)

Since $\mathbf{W}$ is a Gaussian random matrix, we have that $\mathbf{U}_{2}$ is a Haar orthogonal random matrix. Since $\mathbf{Z}$ and $\mathbf{W}$ are independent, we have that $\mathbf{U}:=\mathbf{U}_{2}^{\top}\mathbf{U}_{1}$ is also a Haar orthogonal random matrix when $\mathbf{U}_{1}$ is fixed. Since Lemma 5.5 implies that $\Xi$ is a high probability event, in what follows, we focus our discussion on the high probability event $\Xi$ and $\mathbf{U}_{1}$ is a deterministic orthonormal matrix.

On one hand, $\mathbf{T}_{1}\mathbf{T}_{2}$ have the same eigenvalues as $\bm{\Lambda}_{2}\mathbf{U}\bm{\Lambda}_{1}\mathbf{U}^{\top}$ by construction. On the other hand, by Lemma 5.5, we have that for $\mathbf{H}:=\Sigma_{2}\mathbf{U}\Sigma_{1}\mathbf{U}^{\top}$

\left\|\bm{\Lambda}_{2}\mathbf{U}\bm{\Lambda}_{1}\mathbf{U}^{\top}-\mathbf{H}\right\|\prec n^{-2/3},

where $\Sigma_{1}$ and $\Sigma_{2}$ are diagonal matrices containing $\{a_{k}\}$ and $\{b_{k}\},$ respectively. Note that the rigidity of the eigenvalues of $\mathbf{H}$ has been studied in [5] under the Gaussian assumption and summarized in Lemma 2.4. Together with Lemma 2.4, we conclude that for $i\geq 1$

\left|\lambda_{i}(\mathbf{T}_{1}\mathbf{T}_{2})-\gamma_{\nu_{1}\boxtimes\nu_{2}}(i)\right|\prec n^{-2/3}.

(83)

Note that

	$\displaystyle n^{2}\mathbf{N}=$	$\displaystyle n^{2}\mathbf{D}_{1}^{-1}(\mathbf{W}_{1}-\mathrm{Sh}_{1})(\mathbf{W}_{2}-\mathrm{Sh}_{2})\mathbf{D}_{2}^{-1}$		(84)
		$\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}-\mathrm{Sh}_{1}\mathrm{Sh}_{2}\right)\mathbf{D}_{2}^{-1}.$

We then analyze the rank of $\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}+\mathrm{Sh}_{1}\mathrm{Sh}_{2}.$ Recall that for any compatible matrices $A$ and $B,$ we have that

\operatorname{rank}(AB)\leq\min\{\operatorname{rank}(A),\operatorname{rank}(B)\}.

Since $\operatorname{rank}(\mathrm{Sh}_{1})\leq 3$ and $\operatorname{rank}(\mathrm{Sh}_{2})\leq 3,$ we conclude that

\operatorname{rank}(\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}+\mathrm{Sh}_{1}\mathrm{Sh}_{2})\leq 6.

Consequently, we have that

\operatorname{rank}\left(\mathbf{R}\right)\leq 6,\ \mbox{where}\ \ \mathbf{R}:=n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}-\mathrm{Sh}_{1}\mathrm{Sh}_{2}\right)\mathbf{D}_{2}^{-1}.

(85)

By (78), (79), (84) and (85), utilizing (70), we obtain that

n^{2}\mathbf{N}=\frac{1}{\exp(-4\upsilon)}\mathbf{T}_{1}\mathbf{T}_{2}+n^{2}\mathbf{D}_{1}^{-1}\mathbf{R}_{1}\mathbf{D}_{2}^{-1}+\mathbf{R}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}),

(86)

where we used $\|\mathbf{T}_{1}\|\prec 1,\|\mathbf{T}_{2}\|\prec 1.$ Since $\operatorname{rank}(\mathbf{R})\leq 6$ , $\operatorname{rank}(\mathbf{R}_{1})\leq 2,$ by (83), we have finished our proof for case (1).

Case (2). $0\leq\zeta_{2}<0.5\leq\zeta_{1}<1$ . Recall (76). In this case, according to Lemma 5.3, we require a high order expansion up to the degree of $\mathfrak{d}_{1}$ in (22) for $\mathbf{W}_{1}.$ Recall (24) and (78). By Lemma 5.3, we have that

\mathbf{W}_{1}-\mathrm{Sh}_{\mathsf{d}}-\mathrm{Sh}_{1}=\mathbf{P}_{1}+\mathrm{O}_{\prec}\left(n^{\mathsf{e}_{1}}\right),

(87)

where $\mathrm{Sh}_{1}$ is defined in (71) and $\mathrm{Sh}_{\mathsf{d}}$ is defined in (54) below satisfying $\operatorname{rank}(\mathrm{Sh}_{\mathsf{d}})\leq\mathsf{s}_{1},\ 4\leq\mathsf{s}_{1}\leq C4^{\mathfrak{d}_{1}}.$ Using a decomposition similar to (84) with (87), by (77) and the assumption $\zeta_{1}\geq\zeta_{2}$ , we obtain that

$\displaystyle n^{2}\mathbf{N}=$	$\displaystyle\,n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{T}_{1}+\Delta_{1}+\Upsilon_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}})\right)(\mathbf{T}_{2}+\Delta_{2}+\Upsilon_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1}$
	$\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left((\mathbf{W}_{1}-\mathrm{Sh}_{1}-\mathrm{Sh}_{\mathsf{d}})\mathrm{Sh}_{2}+(\mathrm{Sh}_{1}+\mathrm{Sh}_{\mathsf{d}})\mathbf{W}_{2}\right)\mathbf{D}_{2}^{-1}$
$\displaystyle=$	$\displaystyle\,n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{T}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}})\right)(\mathbf{T}_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1}$	(88)
	$\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\Delta_{1}(\mathbf{P}_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1}$
	$\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\mathbf{T}_{1}(\Delta_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1}$
	$\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left((\mathbf{W}_{1}-\mathrm{Sh}_{1}-\mathrm{Sh}_{\mathsf{d}})\mathrm{Sh}_{2}+(\mathrm{Sh}_{1}+\mathrm{Sh}_{\mathsf{d}})\mathbf{W}_{2}\right)\mathbf{D}_{2}^{-1}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}).$

It is easy to see that the rank of the second to the fourth terms of (88) is bounded by $\mathsf{s}_{1}+8.$ On other hand, by (59), the first inequality of (70) should be replaced by

\left\|n(\mathbf{D}_{1})^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{\zeta_{1}-1}).

(89)

The rest of the discussion follows from the case (1). This completes our proof for case (2).

Case (3). $0.5\leq\zeta_{2}\leq\zeta_{1}<1$ . The discussion is similar to case (2) except that we also need to conduct a high order expansion for $\mathbf{W}_{2}.$ Similar to (87), by Lemma 5.3, we have that

\mathbf{W}_{2}-\widetilde{\mathrm{Sh}}_{\mathsf{d}}-\mathrm{Sh}_{2}=\mathbf{P}_{2}+\mathrm{O}_{\prec}\left(n^{\mathsf{e}_{2}}\right).

(90)

By decomposition similar to (88), with (90), by (77), we have that

	$\displaystyle n^{2}\mathbf{N}=$	$\displaystyle\,n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{T}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}})\right)\left(\mathbf{T}_{2}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{2}})\right)\mathbf{D}_{2}^{-1}$
		$\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left((\Delta_{1}+\mathrm{Sh}_{\mathsf{d}}+\mathrm{Sh}_{1})\mathbf{W}_{2}+(\mathbf{W}_{1}-\mathbf{P}_{1}-\mathrm{Sh}_{\mathsf{d}}-\mathrm{Sh}_{1})(\widetilde{\mathrm{Sh}}_{\mathsf{d}}+\Delta_{2}+\mathrm{Sh}_{2})\right)\mathbf{D}_{2}^{-1}$
		$\displaystyle+\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}).$

On one hand, the rank of the second term of right-hand side of the above equation can be bounded by $\mathsf{s}_{1}+\mathsf{s}_{2}+8.$ On the other hand, the first term can be again analyzed in the same way as heading from (82) to (83) using Lemma 2.4. Finally, by (59), similar to (89), we have that

\left\|n(\mathbf{D}_{2})^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{\zeta_{2}-1}).

(91)

The rest of the proof follows from the discussion of case (1). This completes the proof of Case (3) using the fact $\zeta_{2}\leq\zeta_{1}.$ ∎

5.3. Proof of Theorem 3.2

We now prove Theorem 3.2 when $0\leq\zeta_{2}<1\leq\zeta_{1}<\infty$ .

Case (1). $0\leq\zeta_{2}<0.5$ . Decompose $\mathbf{N}$ by

\displaystyle n\mathbf{N}=\mathbf{A}_{1}n(\mathbf{W}_{2}-\mathrm{Sh}_{2})\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}\mathrm{Sh}_{2}\mathbf{D}_{2}^{-1}.

(92)

First, we have that $\operatorname{rank}(n\mathbf{A}_{1}\mathrm{Sh}_{2}\mathbf{D}_{2}^{-1})\leq 3.$ Moreover, using the decomposition (76), similar to (86), by (77), we can further write that

n\mathbf{N}=n\mathbf{A}_{1}\mathbf{T}_{1}\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}\mathrm{Sh}_{2}\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}\Delta_{2}\mathbf{D}_{2}^{-1}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{2}-1}{2}}).

(93)

Second, by Lemma 5.4, we have that

\left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,s}\right\|\prec n^{-1/2},\ \ \left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,c}\right\|\prec n^{-\alpha/2}+n^{-3/2}.

(94)

Together with (70) and the fact that $\|\mathbf{T}_{1}\|=\mathrm{O}_{\prec}(1)$ , using the definition (31), we have that

\left\|\mathbf{A}_{1}n\mathbf{T}_{1}\mathbf{D}_{2}^{-1}-\widetilde{\mathbf{N}}\right\|\prec n^{-1/2}.

(95)

We can therefore conclude the proof using (93).

Next, when $\zeta_{1}$ is larger in the sense of (34), by Lemma 5.4, we find that with probability at least $1-\mathrm{O}\left(n^{1-\delta(\zeta_{1}-1)/2}\right)$ , for some constant $C>0,$

\left\|\mathbf{A}_{1}-\mathbf{I}\right\|\leq Cn\exp\left(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta}\right).

(96)

Consequently, we have

\left\|n\mathbf{N}-n\mathbf{A}_{2}\right\|\leq\left\|\mathbf{A}_{1}-\mathbf{I}\right\|\left\|n\mathbf{A}_{2}\right\|\leq n^{2}\exp\left(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta}\right),

(97)

where in the second inequality we use the fact that $\|\mathbf{A}_{2}\|\prec 1$ since $\lambda_{1}(\mathbf{A}_{2})=1$ . By a result analogous to (57) for $\mathbf{A}_{2}$ , we have that for $i>3,$

\left|\lambda_{i}(n\mathbf{A}_{2})-\exp(2\upsilon)\gamma_{\nu_{2}}(i)\right|\prec n^{-1/2}.

(98)

Together with (97), we conclude our proof.

Case (2). $0.5\leq\zeta_{2}<1$ . The discussion is similar to case (1) except that we need to conduct a high order expansion for $\mathbf{A}_{2}.$ Note that

\displaystyle n\mathbf{N}=\mathbf{A}_{1}n(\mathbf{W}_{2}-\mathrm{Sh}_{2}-\widetilde{\mathrm{Sh}}_{\mathsf{d}})\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}(\mathrm{Sh}_{2}+\widetilde{\mathrm{Sh}}_{\mathsf{d}})\mathbf{D}_{2}^{-1}.

By (77), (90) and (91), we have that

\displaystyle n\mathbf{N}=\exp(2\upsilon)\mathbf{A}_{1}\mathbf{T}_{2}+n\mathbf{A}_{1}\Delta_{1}\mathbf{D}_{2}^{-1}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{2}-1}{2}}+n^{\mathsf{e}_{2}})+n\mathbf{A}_{1}(\mathrm{Sh}_{2}+\widetilde{\mathrm{Sh}}_{\mathsf{d}})\mathbf{D}_{2}^{-1}.

We can therefore conclude our proof by (94) with a discussion similar to (95). Finally, when $\zeta_{1}$ is larger, we can conclude our proof using a discussion similar to (98) with (97) and Lemma 5.4. Together with (91), we conclude the proof.

5.4. Proof of Theorem 3.3

We now prove Theorem 3.3 when $\zeta_{1}\geq\zeta_{2}\geq 1.$ For part (1), (37) follows from (61) and an analogous result for $\mathbf{A}_{2}$ that

\left\|\mathbf{A}_{2}-\widetilde{\mathbf{A}}_{2,s}\right\|\prec n^{-1/2},

(99)

as well as the facts that $\|\mathbf{A}_{1}\|=\mathrm{O}_{\prec}(1),\|\mathbf{A}_{2}\|=\mathrm{O}_{\prec}(1).$ Second, (38) follows from (62), (2) of Lemma 5.1 and the assumption that $\zeta_{2}\leq\zeta_{1}.$ For part (2) and (3), the proof follows from (1) of Lemma 5.4 and its counterpart for $\mathbf{A}_{2}.$

5.5. Proof of Theorem 4.1

To prove the results of Theorem 4.1, we first study the adaptive bandwidth $h_{1}$ and $h_{2}$ . When $0\leq\zeta_{2}<1,$ by Lemma 5.2 about the sub-Gaussian random vector, we have that for $i\neq j,$

\|\mathbf{y}_{i}-\mathbf{y}_{j}\|^{2}=2(p_{2}+\sigma_{2}^{2})+\mathrm{O}_{\prec}(p_{2}^{\zeta_{2}}+\sqrt{p_{2}}).

Since $\zeta_{2}<1,$ $\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}$ are concentrated around $2p_{2}$ . Then for any $\omega\in(0,1)$ and $h_{2}$ chosen according to (43), we have that

h_{2}\asymp p_{2}.

(100)

Similarly, when $0\leq\zeta_{1}<1,$ we have that $h_{1}\asymp p_{1}.$ Now we can prove part (1) when $0\leq\zeta_{1}<1$ . Denote

f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h_{1}}\right)=g_{1}\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{p_{1}}\right),\ f(x)=\exp(-\upsilon x),

where $g_{k}(x):=f(p_{k}x/h_{k}),\ k=1,2.$ Since $\frac{p_{k}}{h_{k}}\asymp 1$ , we can apply the proof of Theorem 3.1 to the kernel functions $g_{k}(x),k=1,2.$ The only difference is that the constant $\upsilon$ is now replaced by $\upsilon p_{k}/h_{k}.$ When $\zeta_{1}\geq 1,$ the modification is similar except that we also need to use (2) of Lemma 5.4.

The other two cases can be obtained similarly by recalling the following fact. For any $\zeta_{1}\geq 1,$ let $h_{1}$ be the bandwidth selected using (43), we have that for some constants $C_{1},C_{2}>0,$ with high probability

C_{1}(\sigma_{1}^{2}\log^{-1}n+p_{1})\leq h_{1}\leq C_{2}\sigma_{1}^{2}\log^{2}n.

(101)

Also, note that (2) of Lemma 5.4 holds. See Corollary 3.2 of [6] for the proof. With this fact, for part (2), (49) follows from (64) and its counterpart for $\mathbf{A}_{2}$ and the fact $\|\mathbf{A}_{1}\|\prec 1,\|\mathbf{A}_{2}\|\prec 1;$ (50) follows from (65) and its counterpart for $\mathbf{A}_{2}$ and (2) of Lemma 5.1; (51) follows from (66) and its counterpart for $\mathbf{A}_{2}$ and the assumption $\zeta_{1}\geq\zeta_{2}.$

References

[1] Z. Bao, J. Hu, G. Pan, and W. Zhou. Canonical correlation coefficients of high-dimensional Gaussian vectors: Finite rank case. The Annals of Statistics, 47(1):612 – 640, 2019.
[2] M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289–1308, 2008.
[3] C. Bordenave. On Euclidean random matrices in high dimension. Electron. Commun. Probab., 18:no. 25, 8, 2013.
[4] X. Cheng and A. Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 02(04):1350010, 2013.
[5] X. Ding and H. C. Ji. Local laws for multiplication of random matrices and spiked invariant model. arXiv preprint arXiv 2010.16083, 2020.
[6] X. Ding and H.-T. Wu. Impact of signal-to-noise ratio and bandwidth on graph Laplacian spectrum from high-dimensional noisy point cloud. arXiv preprint arXiv 2011.10725, 2020.
[7] X. Ding and H. T. Wu. On the spectral property of kernel-based sensor fusion algorithms of high dimensional data. IEEE Transactions on Information Theory, 67(1):640–670, 2021.
[8] Y. Do and V. Vu. The spectrum of random kernel matrices: Universality results for rough and varying kernels. Random Matrices: Theory and Applications, 02(03):1350005, 2013.
[9] D. Dov, R. Talmon, and I. Cohen. Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Transactions on Signal Processing, 64(24):6406–6416, 2016.
[10] D. Dov, R. Talmon, and I. Cohen. Sequential audio-visual correspondence with alternating diffusion kernels. IEEE Transactions on Signal Processing, 66(12):3100–3111, 2018.
[11] D. B. Dunson, H.-T. Wu, and N. Wu. Diffusion based Gaussian process regression via heat kernel reconstruction. Applied and Computional Harmonic Analysis, 2021.
[12] N. El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191 – 3216, 2010.
[13] N. El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 – 50, 2010.
[14] N. El Karoui and H.-T. Wu. Graph connection Laplacian and random matrices with random blocks. Information and Inference: A Journal of the IMA, 4(1):1–44, 2015.
[15] N. El Karoui and H.-T. Wu. Graph connection Laplacian methods can be made robust to noise. The Annals of Statistics, 44(1):346 – 372, 2016.
[16] L. Erdős and H. Yau. A Dynamical Approach to Random Matrix Theory. Courant Lecture Notes. American Mathematical Society, 2017.
[17] Z. Fan and A. Montanari. The spectral norm of random inner-product kernel matrices. Probab. Theory Related Fields, 173(1-2):27–85, 2019.
[18] N. García Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator. Found. Comput. Math., 20(4):827–887, 2020.
[19] F. Gustafsson. Statistical Sensor Fusion. Professional Publishing House, 2012.
[20] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004.
[21] M. Hein, J.-Y. Audibert, and U. von Luxburg. From graphs to manifolds – weak and strong pointwise consistency of graph Laplacians. In P. Auer and R. Meir, editors, Learning Theory, pages 470–485, 2005.
[22] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their convergence on random neighborhood graphs. J. Mach. Learn. Res., 8:1325–1368, 2007.
[23] P. Horst. Relations among m sets of measures. Psychometrika, 26(2):129–149, 1961.
[24] H. Hotelling. Relations between two sets of variates. Biometrika, 28:321–377, 1936.
[25] H. Hwang, K. Jung, Y. Takane, and T. S. Woodward. A unified approach to multiple-set canonical correlation analysis and principal components analysis. British Journal of Mathematical and Statistical Psychology, 66(2):308–321, 2013.
[26] H. C. Ji. Regularity Properties of Free Multiplicative Convolution on the Positive Line. International Mathematics Research Notices, 07 2020. rnaa152.
[27] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist., 29(2):295–327, 2001.
[28] S. P. Kasiviswanathan and M. Rudelson. Spectral norm of random kernel matrices with applications to privacy. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2015, August 24-26, 2015, Princeton, NJ, USA, volume 40 of LIPIcs, pages 898–914. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
[29] A. Knowles and J. Yin. Anisotropic local laws for random matrices. Probab. Theory Related Fields, 169(1-2):257–352, 2017.
[30] D. Lahat, T. Adali, and C. Jutten. Multimodal data fusion: An overview of methods, challenges, and prospects. Proceedings of the IEEE, 103(9):1449–1477, 2015.
[31] R. R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509–536, 2018.
[32] R. R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509–536, 2018.
[33] O. Lindenbaum, Y. Bregman, N. Rabin, and A. Averbuch. Multiview kernels for low-dimensional modeling of seismic events. IEEE Transactions on Geoscience and Remote Sensing, 56(6):3300–3310, 2018.
[34] O. Lindenbaum, A. Yeredor, M. Salhov, and A. Averbuch. Multi-view diffusion maps. Information Fusion, 55:127–149, 2020.
[35] G.-R. Liu, Y.-L. Lo, J. Malik, Y.-C. Sheu, and H.-T. Wu. Diffuse to fuse eeg spectra–intrinsic geometry of sleep dynamics for classification. Biomedical Signal Processing and Control, 55:101576, 2020.
[36] Z. Ma and F. Yang. Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations. arXiv preprint arXiv 2102.03297, 2021.
[37] V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967.
[38] N. F. Marshall and M. J. Hirn. Time coupled diffusion maps. Applied and Computational Harmonic Analysis, 45(3):709–728, 2018.
[39] T. Michaeli, W. Wang, and K. Livescu. Nonparametric canonical correlation analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1967–1976, 2016.
[40] N. S. Pillai and J. Yin. Universality of covariance matrices. The Annals of Applied Probability, 24(3):935 – 1001, 2014.
[41] M. A. Samuels. The brain–heart connection. Circulation, 116(1):77–84, 2007.
[42] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM J. Math. Data Sci., 1(3):588–616, 2019.
[43] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM Journal on Mathematics of Data Science, 1(3):588–616, 2019.
[44] A. Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 21(1):128–134, 2006.
[45] R. Talmon and H.-T. Wu. Latent common manifold learning with alternating diffusion: Analysis and applications. Applied and Computational Harmonic Analysis, 47(3):848–892, 2019.
[46] R. Talmon and H.-T. Wu. Latent common manifold learning with alternating diffusion: analysis and applications. Applied and Computational Harmonic Analysis, 47(3):848–892, 2019.
[47] D. Voiculescu. Multiplication of certain non-commuting random variables. Journal of Operator Theory, 18(2):223–235, 1987.
[48] L. Xiao, J. M. Stephen, T. W. Wilson, V. D. Calhoun, and Y.-P. Wang. A manifold regularized multi-task learning model for iq prediction from two fmri paradigms. IEEE Transactions on Biomedical Engineering, 67(3):796–806, 2019.
[49] J. Zhao, X. Xie, X. Xu, and S. Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017.
[50] X. Zhuang, Z. Yang, and D. Cordes. A technical review of canonical correlation analysis for neuroscience applications. Human Brain Mapping, 41(13):3807–3833, 2020.

How do kernel-based sensor fusion algorithms behave under high dimensional noise?

Abstract.

1. Introduction

1.1. Some related works

1.2. An overview of our results

2. Mathematical framework and background

2.1. Mathematical framework

Assumption 2.1.

Remark 2.2.

2.2. Some random matrix theory background

2.3. A brief summary of free multiplication of random matrices

Definition 2.3.

Lemma 2.4.

Proof.

3. Main results (I)–classic bandwidth: h1≍p1,h2≍p2h_{1}\asymp p_{1},h_{2}\asymp p_{2}

3.1. Noninformative region: 0≤ζ2<10\leq\zeta_{2}<1

3.1.1. When both sensors have small SNRs, 0≤ζ2≤ζ1<10\leq\zeta_{2}\leq\zeta_{1}<1

Theorem 3.1.

3.1.2. When one sensor has a small SNR, 0≤ζ2<1≤ζ1<∞0\leq\zeta_{2}<1\leq\zeta_{1}<\infty

Theorem 3.2.

3.2. Informative region: ζ2≥1\zeta_{2}\geq 1

Theorem 3.3.

Remark 3.4.

Remark 3.5.

4. Main results (II)–adaptive choice of bandwidth

Theorem 4.1.

5. Proof of main theorems

5.1. Some technical lemmas

Lemma 5.1.

Proof.

Lemma 5.2.

Proof.

Lemma 5.3.

Proof.

Lemma 5.4.

Proof.

Lemma 5.5.

Proof.

5.2. Proof of Theorem 3.1

Proof.

5.3. Proof of Theorem 3.2

5.4. Proof of Theorem 3.3

5.5. Proof of Theorem 4.1

References

3. Main results (I)–classic bandwidth: $h_{1}\asymp p_{1},h_{2}\asymp p_{2}$

3.1. Noninformative region: $0\leq\zeta_{2}<1$

3.1.1. When both sensors have small SNRs, $0\leq\zeta_{2}\leq\zeta_{1}<1$

3.1.2. When one sensor has a small SNR, $0\leq\zeta_{2}<1\leq\zeta_{1}<\infty$

3.2. Informative region: $\zeta_{2}\geq 1$