This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How do kernel-based sensor fusion algorithms behave under high dimensional noise?

Xiucai Ding Department of Statistics, University of California, Davis, CA, USA xcading@ucdavis.edu  and  Hau-Tieng Wu Department of Mathematics and Department of Statistical Science, Duke University, Durham, NC, USA hauwu@math.duke.edu
Abstract.

We study the behavior of two kernel based sensor fusion algorithms, nonparametric canonical correlation analysis (NCCA) and alternating diffusion (AD), under the nonnull setting that the clean datasets collected from two sensors are modeled by a common low dimensional manifold embedded in a high dimensional Euclidean space and the datasets are corrupted by high dimensional noise. We establish the asymptotic limits and convergence rates for the eigenvalues of the associated kernel matrices assuming that the sample dimension and sample size are comparably large, where NCCA and AD are conducted using the Gaussian kernel. It turns out that both the asymptotic limits and convergence rates depend on the signal-to-noise ratio (SNR) of each sensor and selected bandwidths. On one hand, we show that if NCCA and AD are directly applied to the noisy point clouds without any sanity check, it may generate artificial information that misleads scientists’ interpretation. On the other hand, we prove that if the bandwidths are selected adequately, both NCCA and AD can be made robust to high dimensional noise when the SNRs are relatively large.

1. Introduction

How to adequately quantify the system of interest by assembling available information from multiple datasets collected simultaneously from different sensors is a long lasting and commonly encountered problem in data science. This problem is commonly referred to as the sensor fusion problem [20, 30, 19, 49]. While the simplest approach to “fuse” the information is via a simple concatenation of available information from each sensor, it is not the best and most efficient approach. To achieve a better and more efficient fusion algorithm, researchers usually face several challenges. For example, the sensors might be heterogeneous, datasets from different sensors might not be properly aligned, the datasets might be high dimensional and noisy, to name but a few. Roughly speaking, researchers are interested in extracting common components (information) shared by different sensors, if there is any, where roughly speaking.

A lot of effort was invested to find a satisfactory algorithm based on various models. Historically, when we can safely assume a linear structure in the common information shared by different sensors, the most typical algorithm in handling this problem is the canonical correlation analysis (CCA) [24] and its descendants [23, 20, 25], which is a far from complete list. In the modern data analysis era, due to the advance of sensor development and growth of the complexities of problems, researchers may need to take the nonlinear structure of the datasets into account to better understand the datasets. To handle this nonlinear structure, several nonlinear sensor fusion algorithms have been developed, for example, nonparametric canonical correlation analysis (NCCA) [39], alternative diffusion (AD) [31, 45] and its generalization [42], time coupled diffusion maps [38], multiview diffusion maps [34], etc. See [42] for a recent and more thoughtful list of available nonlinear tools and [50] for a recent review. The main idea beyond these developments is that the nonlinear structure is modeled by various nonlinear geometric structures, and the algorithms are designed to preserve and capture this nonlinear structure. Such ideas and algorithms have been successfully applied to many real world problems, like audio-visual voice activity detection [10], the study of the sequential audio-visual correspondence [9], automatic sleep stage annotation from two electroencephalogram signals [35], seismic event modeling [33], fetal electrocardiogram analysis [42] and IQ prediction from two fMRI paradigms [48], which is a far from complete list.

While these kernel-based sensor fusion algorithms have been developed and applied for a while, there are still several gaps toward a solid practical application and sound theoretical understanding of these tools. One important gap is understanding how the inevitable noise, particularly when the data dimension is high, impacts the kernel-based sensor fusion algorithms. For example, can we be sure if the obtained fused information is really informative, particularly when the datasets are noisy or when one sensor is broken? If the signal-to-noise ratios of two sensors are different, how does these noises impact the information captured by these kernel based sensors? To our knowledge, the developed kernel-based sensor fusion algorithms do not take care of how the noise interacts with the algorithm, and most theoretical understandings are mainly based on the nonlinear data structure without considering the impact of high dimensional noise, except a recent effort in the null case [7]. In this paper, we focus on one specific challenge among many; that is, we study how high dimensional noise impacts the spectrum of two kernel-based sensor fusion algorithms, NCCA and AD, in the non-null setup when there are two sensors.

We briefly recall the NCCA and AD algorithms. Consider two noisy point clouds, 𝒳={𝐱i}i=1np1\mathcal{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{p_{1}} and 𝒴={𝐲j}j=1np2\mathcal{Y}=\{\mathbf{y}_{j}\}_{j=1}^{n}\subset\mathbb{R}^{p_{2}}. For some bandwidths h1,h2>0h_{1},h_{2}>0 and some fixed constant υ>0\upsilon>0 chosen by the user, we consider two n×nn\times n affinity matrices, 𝐖1\mathbf{W}_{1} and 𝐖2\mathbf{W}_{2}, defined as

𝐖1(i,j)=exp(υ𝐱i𝐱j22h1)and𝐖2(i,j)=exp(υ𝐲i𝐲j22h2),\mathbf{W}_{1}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h_{1}}\right)\ \ \mbox{and}\ \ \mathbf{W}_{2}(i,j)=\exp\left(-\upsilon\frac{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2}}{h_{2}}\right), (1)

where i,j=1,,ni,j=1,\ldots,n. Here, 𝐖1\mathbf{W}_{1} and 𝐖2\mathbf{W}_{2} are related to the point clouds 𝒳\mathcal{X} and 𝒴\mathcal{Y} respectively. Denote the associated degree matrices 𝐃1\mathbf{D}_{1} and 𝐃2,\mathbf{D}_{2}, which are diagonal matrices such that

𝐃1(i,i)=j=1n𝐖1(i,j) and 𝐃2(i,i)=j=1n𝐖2(i,j),i=1,2,,n.\mathbf{D}_{1}(i,i)=\sum_{j=1}^{n}\mathbf{W}_{1}(i,j)\ \ \mbox{ and }\ \ \mathbf{D}_{2}(i,i)=\sum_{j=1}^{n}\mathbf{W}_{2}(i,j),\ i=1,2,\cdots,n\,. (2)

Moreover, denote the transition matrices 𝐀1,𝐀2\mathbf{A}_{1},\mathbf{A}_{2} as

𝐀1=𝐃11𝐖1 and 𝐀2=𝐃21𝐖2.\mathbf{A}_{1}=\mathbf{D}_{1}^{-1}\mathbf{W}_{1}\ \ \mbox{ and }\ \ \mathbf{A}_{2}=\mathbf{D}_{2}^{-1}\mathbf{W}_{2}\,.

The NCCA and AD matrices are defined as

𝐍=𝐀1𝐀2 and 𝐀=𝐀1𝐀2\mathbf{N}=\mathbf{A}_{1}\mathbf{A}_{2}^{\top}\ \ \mbox{ and }\ \ \mathbf{A}=\mathbf{A}_{1}\mathbf{A}_{2} (3)

respectively. Note that in the current paper, for simplicity, we focus our study on the Gaussian kernels. More general kernel functions will be our future topics. Usually, the top few eigenpairs of 𝐍\mathbf{N} and 𝐀\mathbf{A} are used as features of the extracted common information shared by two sensors. We shall emphasize that in general, while 𝐀1\mathbf{A}_{1} and 𝐀2\mathbf{A}_{2} are diagonalizable, 𝐍\mathbf{N} and 𝐀\mathbf{A} are not. But theoretically we can obtain the top few eigenpairs without a problem under the common manifold model [45, 42] since asymptotically 𝐍\mathbf{N} and 𝐀\mathbf{A} both converge to self-adjoint operators. To avoid this trouble, researchers also consider singular value decomposition (SVD) of 𝐍\mathbf{N} and 𝐀\mathbf{A}. Another important fact is that usually we are interested in the case when 𝒳\mathcal{X} and 𝒴\mathcal{Y} are aligned; that is, 𝐱i\mathbf{x}_{i} and 𝐲i\mathbf{y}_{i} are sampled from the same system at the same time. However, the algorithm can be applied to any two datasets of the same size, while it is not our concern in this paper.

1.1. Some related works

In this subsection, we summarize some related results. Since the NCCA and AD matrices (3) are essentially products of transition matrices, we start from summarizing the results of the affinity and transition matrices when there is only one sensor. On one hand, in the noiseless setting, the spectral properties have been widely studied, for example, [2, 21, 22, 44, 18, 11], to name but a few. In summary, under the manifold model, researchers show that the Graph Laplacian(GL) converges to the Laplace–Beltrami operator in various settings with properly chosen bandwidth. On other hand, the spectral properties have been investigated in [4, 3, 8, 14, 13, 17, 28] under the null setup. These works essentially show that when 𝒳\mathcal{X} contains pure high-dimensional noise, the affinity and transition matrices are governed by a low-rank perturbed Gram matrix when the bandwidth h1=p1h_{1}=p_{1}. Despite rich literature above about two extreme setups, limited results are available in the intermittent, or nonnull, setup [12, 6, 15]. For example, when the signal-to-noise ratio (SNR), which will be defined precisely later, is sufficiently large, the spectral properties of GL constructed from the noisy observation are close to that constructed from the clean signal. Moreover, the bandwidth plays an important role in the nonnull setup. For a more comprehensive review and sophisticated study on the spectral properties of the affinity and transition matrices for an individual point cloud, we refer the readers to [6, Sections 1.2 and 1.3].

For the NCCA and AD matrices, on one hand, in the noiseless setting, there have been several results under the common manifold model [32, 46]. On the other hand, under the null setup that both sensors only capture high dimensional white noise, its spectral property has been studied recently [7]. Specifically, except for a few larger outliers, when h1=p1h_{1}=p_{1} and h2=p2,h_{2}=p_{2}, the edge eigenvalues of n2𝐀n^{2}\mathbf{A} or n2𝐍n^{2}\mathbf{N} converge to some deterministic limit depending on the free convolution (c.f. Definition 2.3) of two Marchenko-Pastur (MP) laws [37]. However, in the nonnull setting when both sensors are contaminated by noise, to our knowledge, there does not exist any theoretical study, particularly under the high dimensional setup.

1.2. An overview of our results

We now provide an overview of our results. The main contribution of this paper is a comprehensive study of NCCA and AD under the non-null case in the high dimensional setup. This result can be viewed as a continuation of the study under the null case [7]. We focus on the setup that the signal is modeled by a low dimensional manifold. It turns out that this problem can be recast as studying the algorithm under the commonly applied spiked model, which will be made clear later. In addition to providing a theoretical justification based on the kernel random matrix theory, we propose a method to choose the bandwidth adaptively. Moreover, peculiar and counterintuitive results will be presented when two sensors have different behavior, which emphasizes the importance of carefully applying these algorithms in practice. In Section 3, we investigate the eigenvalues of the NCCA and AD matrices when h1=p1h_{1}=p_{1} and h2=p2h_{2}=p_{2}, which is a common choice in the literature. The behavior of the eigenvalues varies according to both SNRs of the point clouds. When both SNRs are small, the spectral behavior of n2𝐍n^{2}\mathbf{N} and n2𝐀n^{2}\mathbf{A} is like that in the null case, while both the number of outliers and the convergence rates rely on SNRs; see Theorem 3.1 for details. Furthermore, if one of the sensors has large SNR and the other one has small SNR, the eigenvalues of n𝐍n\mathbf{N} and n𝐀n\mathbf{A} provide limited information about the signal; see Theorem 3.2 for details. We emphasize that this result warns us that if we directly apply NCCA and AD without any sanity check, it may result in a misleading conclusion. When both SNRs are larger, the eigenvalues are close to the clean NCCA and AD matrices; see Theorem 3.3 for more details. It is clear that the classic bandwidth choices hk=pkh_{k}=p_{k} for k=1,2k=1,2 are inappropriate when the SNR is large, since the bandwidth is too small compared with the signal strength. In this case 𝐍𝐈,𝐀𝐈\mathbf{N}\approx\mathbf{I},\mathbf{A}\approx\mathbf{I}, and we obtain limited information about the signal; see (42) for details. To handle this issue, in Section 4, we consider bandwidths that are adaptively chosen according to the dataset. With this choice, when the SNRs are large, NCCA and AD become non-trivial and informative; that is, NCCA and AD are robust against the high dimensional noise. See Theorem 4.1 for details.

Conventions. The fundamental large parameter is nn and we always assume that p1p_{1} and p2p_{2} are comparable to and depend on nn. We use CC to denote a generic positive constant, and the value may change from one line to the next. Similarly, we use ϵ\epsilon, τ\tau, δ\delta, etc., to denote generic small positive constants. If a constant depends on a quantity aa, we use C(a)C(a) or CaC_{a} to indicate this dependence. For two quantities ana_{n} and bnb_{n} depending on nn, the notation an=O(bn)a_{n}=\mathrm{O}(b_{n}) means that |an|C|bn||a_{n}|\leq C|b_{n}| for some constant C>0C>0, and an=o(bn)a_{n}=\mathrm{o}(b_{n}) means that |an|cn|bn||a_{n}|\leq c_{n}|b_{n}| for some positive sequence cn0c_{n}\downarrow 0 as nn\to\infty. We also use the notations anbna_{n}\lesssim b_{n} if an=O(bn)a_{n}=\mathrm{O}(b_{n}), and anbna_{n}\asymp b_{n} if an=O(bn)a_{n}=\mathrm{O}(b_{n}) and bn=O(an)b_{n}=\mathrm{O}(a_{n}). For a n×nn\times n matrix AA, A\|A\| indicates the operator norm of AA, and A=O(an)A=O(a_{n}) means ACan\|A\|\leq Ca_{n} for some constant C>0C>0. Finally, for a random vector 𝐮,\mathbf{u}, we say it is sub-Gaussian if for any deterministic vector 𝐚\mathbf{a}, we have 𝔼(exp(𝐚𝐮))exp(𝐚22/2)\mathbb{E}(\exp(\mathbf{a}^{\top}\mathbf{u}))\leq\exp(\|\mathbf{a}\|_{2}^{2}/2).

The paper is organized as follows. In Section 2, we introduce the mathematical setup and some background in random matrix theory. In Section 3, we state our main results for the classic choice of bandwidth. In Section 4, we state the main results for the adaptively chosen bandwidth. In Section 5, we offer the technical proofs of the main results. In Appendix 5.1, we provide and prove some preliminary results which will be used in the technical proofs.

2. Mathematical framework and background

2.1. Mathematical framework

We focus on the following model for the datasets 𝒳\mathcal{X} and 𝒴\mathcal{Y}. Assume that the first sensor i.i.d. sample nn clean signals from a sub-Gaussian vector X:(Ω,,)p1X:(\Omega,\mathcal{F},\mathbb{P})\to\mathbb{R}^{p_{1}}, denoted as {𝐮ix}i=1np1\{\mathbf{u}_{ix}\}_{i=1}^{n}\in\mathbb{R}^{p_{1}}, where (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) is a probability space. Similarly, assume that the second sensor also i.i.d. sample clean signals from a sub-Gaussian vector YY, denoted as {𝐮iy}i=1np2\{\mathbf{u}_{iy}\}_{i=1}^{n}\in\mathbb{R}^{p_{2}}. Since we focus on the distance of pairwise samples, without loss of generality, we assume that

𝔼𝐮ix=𝟎and𝔼𝐮iy=𝟎.\mathbb{E}\mathbf{u}_{ix}=\mathbf{0}\quad\mbox{and}\quad\mathbb{E}\mathbf{u}_{iy}=\mathbf{0}. (4)

Denote S1=Cov(𝐮ix)S_{1}=\operatorname{Cov}(\mathbf{u}_{ix}) and S2=Cov(𝐮iy)S_{2}=\operatorname{Cov}(\mathbf{u}_{iy}), and to simplify the discussion, we assume that S1S_{1} and S2S_{2} admit the following spectral decomposition

Sk=diag{σk12,,σkd2,0,,0}pk×pk,k=1,2,S_{k}=\operatorname{diag}\{\sigma_{k1}^{2},\cdots,\sigma_{kd}^{2},0,\cdots,0\}\in\mathbb{R}^{p_{k}\times p_{k}},\ k=1,2, (5)

where d1d_{1} and d2d_{2} are fixed integers. We model the common information by assuming that there exists a bijection ϕ\phi so that

X(Ω)=ϕ(Y(Ω));X(\Omega)=\phi(Y(\Omega))\,; (6)

that is, we have 𝐮ix=ϕ(𝐮iy)\mathbf{u}_{ix}=\phi(\mathbf{u}_{iy}) for any i=1,,ni=1,\ldots,n. In practice, the clean signals {𝐮ix}\{\mathbf{u}_{ix}\} and {𝐮iy}\{\mathbf{u}_{iy}\} are contaminated by two sequences of i.i.d. sub-Gaussian noise {𝐳i}p1\{\mathbf{z}_{i}\}\in\mathbb{R}^{p_{1}} and {𝐰i}p2\{\mathbf{w}_{i}\}\in\mathbb{R}^{p_{2}}, respectively, so that the data generating process follows

𝐱i=𝐮ix+𝐳iand𝐲i=𝐮iy+𝐰i,\mathbf{x}_{i}=\mathbf{u}_{ix}+\mathbf{z}_{i}\ \ \mbox{and}\ \ \mathbf{y}_{i}=\mathbf{u}_{iy}+\mathbf{w}_{i}\,, (7)

where

𝔼𝐳i=𝟎p1,Cov(𝐳i)=𝐈p1,𝔼𝐰i=𝟎p2,Cov(𝐰i)=𝐈p2.\mathbb{E}\mathbf{z}_{i}=\mathbf{0}_{p_{1}},\ \operatorname{Cov}(\mathbf{z}_{i})=\mathbf{I}_{p_{1}},\ \mathbb{E}\mathbf{w}_{i}=\mathbf{0}_{p_{2}},\ \operatorname{Cov}(\mathbf{w}_{i})=\mathbf{I}_{p_{2}}. (8)

We further assume that 𝐳i\mathbf{z}_{i} and 𝐰i\mathbf{w}_{i} are independent with each other and also independent of {𝐮ix}\{\mathbf{u}_{ix}\} and {𝐮iy}\{\mathbf{u}_{iy}\}. We are mainly interested in the high dimensional setting; that is, p1p_{1} and p2p_{2} are comparably as large as n.n. More specifically, we assume that there exists some small constant 0<γ<10<\gamma<1 such that

γc1:=np1γ1andγc2:=np2γ1.\gamma\leq c_{1}:=\frac{n}{p_{1}}\leq\gamma^{-1}\ \ \mbox{and}\ \ \gamma\leq c_{2}:=\frac{n}{p_{2}}\leq\gamma^{-1}.\ (9)

The SNRs in our setting are defined as {σ1i2}\{\sigma_{1i}^{2}\} and {σ2i2},\{\sigma_{2i}^{2}\}, respectively, so that for all 1id11\leq i\leq d_{1} and 1jd21\leq j\leq d_{2},

σ1i2nζ1i,σ2j2nζ2j,\sigma_{1i}^{2}\asymp n^{\zeta_{1i}},\ \sigma_{2j}^{2}\asymp n^{\zeta_{2j}}\,, (10)

for some constants ζ1i,ζ2j0\zeta_{1i},\zeta_{2j}\geq 0. To avoid repetitions, we summarize the assumptions as follows.

Assumption 2.1.

Throughout the paper, we assume that (4)–(10) hold.

In view of (5), the model (7) for each sensor is related to the spiked covariance matrix models [27]. We comment that this seemingly simple model, particularly (5), includes the commonly considered nonlinear common manifold model. In the literature, the common manifold model means that two sensors sample simultaneously from one low dimensional manifold; that is, 𝐮ix=𝐮iyM\mathbf{u}_{ix}=\mathbf{u}_{iy}\in M and ϕ\phi is an identity map, where MM is a low dimensional smooth and compact manifold embedded in the high dimensional Euclidean space. Since we are interested in the kernel matrices depending on pairwise distances, which is invariant to rotation, when combined with Nash’s embedding theory, the common manifold can be assumed to be supported in the first few axes of the high dimensional space, like that in (5). As a result, the common manifold model becomes a special case of the model (7). We refer readers to [6] for a detailed discussion of this relationship. A special example of the common manifold model is the widely considered linear subspace as the common component; that is, when M=dM=\mathbb{R}^{d} embedded in pk\mathbb{R}^{p_{k}} for k=1,2k=1,2. In this case, we could simply apply CCA to estimate the common component, and its behavior in the high dimensional setup has been studied in [1, 36].

We should emphasize that through the analysis of NCCA and AD under the common component model satisfying Assumption (2.1), we do not claim that we could understand the underlying manifold structure. The problem we are asking here is the nontrivial relationship between the noisy and clean affinity and transition matrices, while the problem of exploring the manifold structure from the clean datasets [18, 11] is a different one, which is usually understood as the manifold learning problem. To study the nontrivial relationship between the noisy and clean affinity and transition matrices, it is the spiked covariance structure that we focus on, but not the possibly non-trivial ϕ\phi. By establishing the nontrivial relationship between the noisy and clean affinity and transition matrices in this paper, when combined with the knowledge of manifold learning via the kernel-based manifold learning algorithm with clean datasets [31, 46, 43], we know how to explore the common manifold structure, which depends on ϕ\phi, from the noisy datasets.

Remark 2.2.

While it is not our focus in this paper, we should mention that our model includes the case that the datasets captured by two sensors are not exactly on one manifold MM, but from two manifolds that are diffeomorphic to MM [43]. Specifically, the first sensor samples points {𝐮ix}\{\mathbf{u}_{ix}\} from ϕ1(M)\phi_{1}(M), while the second sensor simultaneously samples points {𝐮iy}\{\mathbf{u}_{iy}\} from ϕ2(M)\phi_{2}(M), where ϕ1\phi_{1} and ϕ2\phi_{2} are both diffeomorphisms and 𝐮iy=ϕ1(ϕ21(𝐮ix))\mathbf{u}_{iy}=\phi_{1}(\phi_{2}^{-1}(\mathbf{u}_{ix})); that is, ϕ=ϕ1ϕ21\phi=\phi_{1}\circ\phi_{2}^{-1} in (6). Note that in this case, d1d_{1} might be different from d2d_{2}. Moreover, the samples from two sensors can be more general. For example, in [46], the principle bundle structure is considered to model the “nuisance”, which can be understood as the “deterministic noise”, and in [31] the metric space as the common component is considered. While it is possible to consider a more complicated one, since we are interested in studying how noise impacts NCCA and AD, in this paper we simply focus on the above model but not further elaborate this possible extension.

2.2. Some random matrix theory background

In this subsection, we introduce some random matrix theory background and necessary notations. Let 𝐙p1×n\mathbf{Z}\in\mathbb{R}^{p_{1}\times n} be the data matrix associated with {𝐳i}\{\mathbf{z}_{i}\}; that is, the ii-th column 𝐙\mathbf{Z} is 𝐳i\mathbf{z}_{i}, and consider the scaled noise s𝐙s\mathbf{Z}, where s>0s>0 stands for the standard deviation of the scaled noise. Denote the empirical spectral distribution (ESD) of 𝐐=s2p1𝐙𝐙\mathbf{Q}=\frac{s^{2}}{p_{1}}\mathbf{Z}^{\top}\mathbf{Z} as

μ𝐐(x)=1ni=1n𝟏{λi(𝐐)x},x.\mu_{\mathbf{Q}}(x)=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}_{\{\lambda_{i}(\mathbf{Q})\leq x\}},\ x\in\mathbb{R}.

It is well-known that in the high dimensional regime (9), μ𝐐\mu_{\mathbf{Q}} has the same asymptotic [29] as the so-called MP law [37], denoted as νc1,s2\nu_{c_{1},s^{2}}, satisfying

νc1,s2(I)=(1c1)+χI(0)+ζc1,s2(I),\nu_{c_{1},s^{2}}(I)=(1-c_{1})_{+}\chi_{I}(0)+\zeta_{c_{1},s^{2}}(I)\,, (11)

where II\subset\mathbb{R} is a measurable set, χI\chi_{I} is the indicator function and (a)+:=0(a)_{+}:=0 when a0a\leq 0 and (a)+:=a(a)_{+}:=a when a>0a>0,

dζc1,s2(x)=12πs2(λ+,1x)(xλ,1)cnxdx,d\zeta_{c_{1},s^{2}}(x)=\frac{1}{2\pi s^{2}}\frac{\sqrt{(\lambda_{+,1}-x)(x-\lambda_{-,1})}}{c_{n}x}\mathrm{d}x\,, (12)

λ+,1=(1+s2c1)2\lambda_{+,1}=(1+s^{2}\sqrt{c_{1}})^{2} and λ,1=(1s2c1)2\lambda_{-,1}=(1-s^{2}\sqrt{c_{1}})^{2}. Denote

τ1τ1(λ1):=2(λ1p1+1),τ2τ2(λ2):=2(λ2p2+1),\tau_{1}\equiv\tau_{1}(\lambda_{1}):=2\left(\frac{\lambda_{1}}{p_{1}}+1\right),\ \tau_{2}\equiv\tau_{2}(\lambda_{2}):=2\left(\frac{\lambda_{2}}{p_{2}}+1\right), (13)

and for k=1,2k=1,2,

ςkςk(τk):=12υexp(υτk)exp(υτk).\varsigma_{k}\equiv\varsigma_{k}(\tau_{k}):=1-2\upsilon\exp(-\upsilon\tau_{k})-\exp(-\upsilon\tau_{k})\,. (14)

For any constant 𝖺>0,\mathsf{a}>0, denote T𝖺\mathrm{T}_{\mathsf{a}} be the shifting operator that shifts a probability measure ν\nu defined on \mathbb{R} by 𝖺;\mathsf{a}; that is

T𝖺ν(I)=ν(I𝖺),\mathrm{T}_{\mathsf{a}}\nu(I)=\nu(I-\mathsf{a}), (15)

where I𝖺I-\mathsf{a} means the shifted set. Using the notation (11), for k=1,2,k=1,2, denote

νk:=Tςkνck,η,where η=2υexp(2υ).\nu_{k}:=\mathrm{T}_{\varsigma_{k}}\nu_{c_{k},\eta},\ \mbox{where }\ \eta=2\upsilon\exp(-2\upsilon). (16)

Next, we introduce a nn-dependent quantity of some probability measure. For a given probability measure μ\mu and n,n\in\mathbb{N}, define γμ(j)\gamma_{\mu}(j) as

γμ(j)μ(dx)=jn.\int_{\gamma_{\mu}(j)}^{\infty}\mu(\mathrm{d}x)=\frac{j}{n}. (17)

Finally, we recall the following notion of stochastic domination [16, Chapter 6.3] that we will frequently use. Let 𝖷={𝖷(n)(u):n,u𝖴(n)}\mathsf{X}=\big{\{}\mathsf{X}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}} and 𝖸={𝖸(n)(u):n,u𝖴(n)}\mathsf{Y}=\big{\{}\mathsf{Y}^{(n)}(u):n\in\mathbb{N},\ u\in\mathsf{U}^{(n)}\big{\}} be two families of nonnegative random variables, where 𝖴(n)\mathsf{U}^{(n)} is a possibly nn-dependent parameter set. We say that 𝖷\mathsf{X} is stochastically dominated by 𝖸\mathsf{Y}, uniformly in the parameter uu, if for any small υ>0\upsilon>0 and large D>0D>0, there exists n0(υ,D)n_{0}(\upsilon,D)\in\mathbb{N} so that we have supu𝖴(n)(𝖷(n)(u)>nυ𝖸(n)(u))nD\sup_{u\in\mathsf{U}^{(n)}}\mathbb{P}\Big{(}\mathsf{X}^{(n)}(u)>n^{\upsilon}\mathsf{Y}^{(n)}(u)\Big{)}\leq n^{-D}, for a sufficiently large nn0(υ,D)n\geq n_{0}(\upsilon,D). We interchangeably use the notation 𝖷=O(𝖸)\mathsf{X}=\mathrm{O}_{\prec}(\mathsf{Y}) or 𝖷𝖸\mathsf{X}\prec\mathsf{Y} if 𝖷\mathsf{X} is stochastically dominated by 𝖸\mathsf{Y}, uniformly in uu, when there is no danger of confusion. In addition, we say that an nn-dependent event ΩΩ(n)\Omega\equiv\Omega(n) holds with high probability if for a D>1D>1, there exists n0=n0(D)>0n_{0}=n_{0}(D)>0 so that (Ω)1nD\mathbb{P}(\Omega)\geq 1-n^{-D}, when nn0.n\geq n_{0}.

2.3. A brief summary of free multiplication of random matrices

In this subsection, we summarize some preliminary results about free multiplication of random matrices from [5, 26]. Given some probability measure μ,\mu, its Stieltjes transform and MM-transform are defined as

mμ(z)=1xzdμ(x)andMμ(z)=zmμ(z)1+zmμ(z),m_{\mu}(z)=\int\frac{1}{x-z}\mathrm{d}\mu(x)\ \ \mbox{and}\ \ M_{\mu(z)}=\frac{zm_{\mu}(z)}{1+zm_{\mu}(z)}\,,

where z\+z\in\mathbb{C}\backslash\mathbb{R}_{+}, respectively. We next introduce the subordination functions utilizing the MM-transform [26, 47]. For any two probability measures μx\mu_{x} and μy\mu_{y}, there exist analytic functions Ωx(z)\Omega_{x}(z) and Ωy(z)\Omega_{y}(z) satisfying

zMμx(Ωy(z))=zMμy(Ωx(z))=Ωx(z)Ωy(z),forz\+.zM_{\mu_{x}}(\Omega_{y}(z))=zM_{\mu_{y}}(\Omega_{x}(z))=\Omega_{x}(z)\Omega_{y}(z),\ \text{for}\ z\in\mathbb{C}\backslash\mathbb{R}_{+}. (18)

Armed with the subordination functions, we now introduce the free multiplicative convolution of μx\mu_{x} and μy,\mu_{y}, denoted as μxμy\mu_{x}\boxtimes\mu_{y}, when μx\mu_{x} and μy\mu_{y} are compactly supported on +\mathbb{R}_{+} but not both delta measures supported on {0}\{0\}; see Definition 2.7 of [5].

Definition 2.3.

Denote the analytic function MM by

M(z):=Mμx(Ωy(z))=Mμy(Ωx(z)).M(z):=M_{\mu_{x}}(\Omega_{y}(z))=M_{\mu_{y}}(\Omega_{x}(z)). (19)

Then the free multiplicative convolution μxμy\mu_{x}\boxtimes\mu_{y} is defined as the unique probability measure that (19) holds for all z\+,z\in\mathbb{C}\backslash\mathbb{R}_{+}, i.e., M(z)Mμxμy(z)M(z)\equiv M_{\mu_{x}\boxtimes\mu_{y}}(z) is the MM-transform of μxμy.\mu_{x}\boxtimes\mu_{y}. Moreover, Ωx\Omega_{x} and Ωy\Omega_{y} are referred to as the subordination functions.

For ν1\nu_{1} and ν2\nu_{2} defined in (16), we have two sequences ak=γν1(k1)a_{k}=\gamma_{\nu_{1}}(k-1) and bk=γν2(k1)b_{k}=\gamma_{\nu_{2}}(k-1), where 1kn1\leq k\leq n. Note that we have

akE+,1dν1(x)=k1n,bkE+,2dν2(x)=k1n,\int_{a_{k}}^{E_{+,1}}\mathrm{d}\nu_{1}(x)=\frac{k-1}{n},\ \int_{b_{k}}^{E_{+,2}}\mathrm{d}\nu_{2}(x)=\frac{k-1}{n}, (20)

where E+,1,E+,2E_{+,1},E_{+,2} are the right edges of ν1\nu_{1} and ν2,\nu_{2}, respectively. Denote two n×nn\times n positive definite matrices Σ1\Sigma_{1} and Σ2\Sigma_{2} as follows

Σ1=diag{a1,a2,,an},Σ2=diag{b1,b2,,bn}.\Sigma_{1}=\operatorname{diag}\{a_{1},a_{2},\cdots,a_{n}\},\ \Sigma_{2}=\operatorname{diag}\{b_{1},b_{2},\cdots,b_{n}\}. (21)

Let 𝐔\mathbf{U} be an n×nn\times n Haar distributed random matrix in O(n)O(n) and denote

𝐇=Σ2𝐔Σ1𝐔.\mathbf{H}=\Sigma_{2}\mathbf{U}\Sigma_{1}\mathbf{U}^{\top}.

The following lemma summarizes the rigidity of eigenvalues of 𝐇.\mathbf{H}.

Lemma 2.4.

Suppose (21) holds. Then we have that

supj|λj(𝐇)γν1ν2(j)|n2/3j~1/3,j~:=min{p1n+1j,j}.\sup_{j}\left|\lambda_{j}(\mathbf{H})-\gamma_{\nu_{1}\boxtimes\nu_{2}}(j)\right|\prec n^{-2/3}\widetilde{j}^{-1/3},\ \widetilde{j}:=\min\left\{p_{1}\wedge n+1-j,j\right\}.
Proof.

The proof follows from Theorems 2.14 and 2.20 of [5] since the Assumptions 2.2, 2.4 and 2.7 in [5] are satisfied. Especially, (iii) of Assumption 2.2 in [5] holds due to the square root behavior of the MP laws as indicated by (12). ∎

3. Main results (I)–classic bandwidth: h1p1,h2p2h_{1}\asymp p_{1},h_{2}\asymp p_{2}

In this section, we state our main results regarding the eigenvalues of 𝐍\mathbf{N} and 𝐀\mathbf{A} when hkpkh_{k}\asymp p_{k}, where k=1,2.k=1,2. For definiteness, we assume that h1=p1h_{1}=p_{1} and h2=p2.h_{2}=p_{2}. In what follows, for the ease of statements, we focus on reporting the results for d=1d=1 and hence omit the subscripts of the indices i,ji,j in (10). For the general setting with d1>1d_{1}>1 or d2>1d_{2}>1, we refer the readers to Remark 3.4 below for more details. Finally, we focus on reporting the results for the NCCA matrix 𝐍\mathbf{N}. The results for the AD matrix 𝐀\mathbf{A} are similar. For the details of the AD matrix 𝐀\mathbf{A}, we refer the readers to Remark 3.5 below. Moreover, by symmetry, without loss of generality, we always assume that ζ1ζ2;\zeta_{1}\geq\zeta_{2}; that is, the first sensor always has a larger SNR.

3.1. Noninformative region: 0ζ2<10\leq\zeta_{2}<1

In this subsection, we state the results when at least one sensor contains strong noise, or equivalently has a small SNR; that is, 0ζ2<10\leq\zeta_{2}<1. In this case, the NCCA and AD will not be able to provide useful information or can only provide limited information for the underlying common manifold.

3.1.1. When both sensors have small SNRs, 0ζ2ζ1<10\leq\zeta_{2}\leq\zeta_{1}<1

In this case, both sensors have small SNRs such that the noise dominates the signal. For some fixed integers 𝗌1,𝗌2\mathsf{s}_{1},\mathsf{s}_{2} satisfying

4𝗌kC4𝔡k,𝔡k:=11ζk+1,k=1,2,4\leq\mathsf{s}_{k}\leq C4^{\mathfrak{d}_{k}},\ \mathfrak{d}_{k}:=\left\lceil\frac{1}{1-{\zeta_{k}}}\right\rceil+1,\ k=1,2, (22)

where C>0C>0 is some constant, denote 𝖳\mathsf{T} as

𝖳:={8, 0ζ2ζ1<0.5,𝗌1+8, 0ζ2<0.5ζ1<1,𝗌1+𝗌2+8, 0.5ζ2ζ1<1.\mathsf{T}:=\begin{cases}8,&\ 0\leq\zeta_{2}\leq\zeta_{1}<0.5,\\ \mathsf{s}_{1}+8,&\ 0\leq\zeta_{2}<0.5\leq\zeta_{1}<1,\\ \mathsf{s}_{1}+\mathsf{s}_{2}+8,&\ 0.5\leq\zeta_{2}\leq\zeta_{1}<1.\end{cases} (23)

Moreover, define 𝖾k<0,k=1,2,\mathsf{e}_{k}<0,k=1,2, as

𝖾k:=(ζk1)(11ζk+1)+1.\mathsf{e}_{k}:=(\zeta_{k}-1)\left(\left\lceil\frac{1}{1-\zeta_{k}}\right\rceil+1\right)+1. (24)
Theorem 3.1.

Suppose Assumption 2.1 holds with 0ζ2ζ1<10\leq\zeta_{2}\leq\zeta_{1}<1, h1=p1h_{1}=p_{1}, h2=p2h_{2}=p_{2} and d1=d2=1.d_{1}=d_{2}=1. Moreover, we assume that

𝐳ii.i.d.𝒩(0,𝐈p1),𝐰ii.i.d.𝒩(0,𝐈p2),\mathbf{z}_{i}\overset{\operatorname{i.i.d.}}{\sim}\mathcal{N}(0,\mathbf{I}_{p_{1}}),\ \mathbf{w}_{i}\overset{\operatorname{i.i.d.}}{\sim}\mathcal{N}(0,\mathbf{I}_{p_{2}})\,, (25)

and there exists some constant τ>0\tau>0 such that

|ck1|τ,k=1,2.|c_{k}-1|\geq\tau,\ k=1,2. (26)

Then, when nn is sufficiently large, we have that for i>𝖳i>\mathsf{T} in (23),

|λi(n2𝐍)exp(4υ)γν1ν2(j)|=O(max{nζ112,n𝖾1}),\left|\lambda_{i}(n^{2}\mathbf{N})-\exp(4\upsilon)\gamma_{\nu_{1}\boxtimes\nu_{2}}(j)\right|=\mathrm{O}_{\prec}\left(\max\left\{n^{\frac{\zeta_{1}-1}{2}},n^{\mathsf{e}_{1}}\right\}\right), (27)

where 𝖾1\mathsf{e}_{1} is defined in (24).

Intuitively, in this region we cannot obtain any information about the signal, since asymptotically the noise dominates the signal. In practice, the datasets might fall in this region when both sensors are corrupted or the environment noise is too strong. This intuition is confirmed by Theorem 3.1. As discussed in [6, 13], when the noise dominates the signal, the outlier eigenvalues are mainly from the kernel function expansion or the Gram matrix and hence are not useful to study the underlying manifold structure. The number of these outlier eigenvalues depend on the SNR as can be seen in (23), which can be figured out from the kernel function expansion.

We should point out that (25) and (26) are mainly technical assumptions and commonly used in the random matrix theory literature. They guarantee that the individual bulk eigenvalues of 𝐍\mathbf{N} can be characterized by the quantiles of free multiplicative convolution. Specifically, (26) ensures that the Gram matrices are bounded from below and (25) has been used in [7] to ensure that the eigenvectors of the Gram matrix are Haar distributed. As is discussed in [7], while it is widely accepted that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed, we cannot find a proper proof. Since the proof of this Theorem depends on the results in [7], we impose the same condition. The assumption (25) can be removed when we can show that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed. Since this is not the focus of the current paper, we will pursue this direction in future works.

Finally, we mention that Theorem 3.1 holds for more general kernel function beyond the Gaussian kernel. For example, as discussed in [6, Remark 2.4], we can choose a general kernel function which is decreasing, C3C^{3} and f(2)>0.f(2)>0.

3.1.2. When one sensor has a small SNR, 0ζ2<1ζ1<0\leq\zeta_{2}<1\leq\zeta_{1}<\infty

In Theorem 3.2 below, we consider that 0ζ2<1ζ1<,0\leq\zeta_{2}<1\leq\zeta_{1}<\infty, i.e., one of the sensors has a large SNR whereas the other is dominated by the noise. We prepare some notations here. Let 𝐖1,s\mathbf{W}_{1,s} and 𝐖2,s\mathbf{W}_{2,s} be the affinity matrices associated with {𝐮ix}\{\mathbf{u}_{ix}\} and {𝐮iy}\{\mathbf{u}_{iy}\} respectively, where the subscript ss stands for the short-hand notation for the signal. In other words, 𝐖1,s\mathbf{W}_{1,s} and 𝐖2,s\mathbf{W}_{2,s} are constructed from the clean signal. In general, since h1h_{1} may be different from h2h_{2}, 𝐖1,s\mathbf{W}_{1,s} and 𝐖2,s\mathbf{W}_{2,s} might be different. Denote

𝐖~1,s=exp(2υ)𝐖1,s+(1exp(2υ))𝐈.\displaystyle\widetilde{\mathbf{W}}_{1,s}=\exp(-2\upsilon)\mathbf{W}_{1,s}+(1-\exp(-2\upsilon))\mathbf{I}. (28)

Analogously, we denote the associated degree matrix and transition matrix as 𝐃~1,s\widetilde{\mathbf{D}}_{1,s} and 𝐀~1,s\widetilde{\mathbf{A}}_{1,s} respectively, that is,

𝐀~1,s=𝐃~1,s1𝐖~1,s.\widetilde{\mathbf{A}}_{1,s}=\widetilde{\mathbf{D}}_{1,s}^{-1}\widetilde{\mathbf{W}}_{1,s}. (29)

Define 𝐖~2,s\widetilde{\mathbf{W}}_{2,s} and 𝐀~2,s\widetilde{\mathbf{A}}_{2,s} similarly. Note that from the random walk perspective, 𝐀~1,s\widetilde{\mathbf{A}}_{1,s} (and 𝐀~2,s\widetilde{\mathbf{A}}_{2,s} as well) describe a lazy random walk on the clean dataset. We further introduce some other n×nn\times n matrices,

𝐖1,c(i,j)=exp(2υ(𝐮ix𝐮jx)(𝐳i𝐳j)p1)and𝐖~1,c:=𝐖~1,s𝐖1,c.\mathbf{W}_{1,c}(i,j)=\exp\left(-2\upsilon\frac{(\mathbf{u}_{ix}-\mathbf{u}_{jx})^{\top}(\mathbf{z}_{i}-\mathbf{z}_{j})}{p_{1}}\right)\quad\mbox{and}\quad\widetilde{\mathbf{W}}_{1,c}:=\widetilde{\mathbf{W}}_{1,s}\circ\mathbf{W}_{1,c}\,.

We then define the associate degree matrix and transition matrix as 𝐃~1,c\widetilde{\mathbf{D}}_{1,c} and 𝐀~1,c,\widetilde{\mathbf{A}}_{1,c}, respectively; that is,

𝐀~1,c=𝐃~1,c1𝐖~1,c.\widetilde{\mathbf{A}}_{1,c}=\widetilde{\mathbf{D}}_{1,c}^{-1}\widetilde{\mathbf{W}}_{1,c}\,. (30)

𝐖~1,c\widetilde{\mathbf{W}}_{1,c} and 𝐀~1,c\widetilde{\mathbf{A}}_{1,c} will be used when ζ1\zeta_{1} is too large (ζ12\zeta_{1}\geq 2) so that the bandwidth h1=p1h_{1}=p_{1} is insufficient to capture the relationship between two different samples.

With the above notations and (14), denote

𝐍~:={exp(2υ)𝐀~1,s(ς2𝐈+2υexp(υτ2)p2𝐖𝐖),1ζ1<2;exp(2υ)𝐀~1,c(ς2𝐈+2υexp(υτ2)p2𝐖𝐖),ζ12.\widetilde{\mathbf{N}}:=\begin{cases}\exp(2\upsilon)\widetilde{\mathbf{A}}_{1,s}\left(\varsigma_{2}\mathbf{I}+2\frac{\upsilon\exp(-\upsilon\tau_{2})}{p_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&1\leq\zeta_{1}<2;\\ \exp(2\upsilon)\widetilde{\mathbf{A}}_{1,c}\left(\varsigma_{2}\mathbf{I}+2\frac{\upsilon\exp(-\upsilon\tau_{2})}{p_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&\zeta_{1}\geq 2.\end{cases} (31)

Using 𝗌2\mathsf{s}_{2} defined in (22), denote

𝖲:={4,0ζ2<0.5,𝗌2+4,0.5ζ2<1.\mathsf{S}:=\begin{cases}4,&0\leq\zeta_{2}<0.5,\\ \mathsf{s}_{2}+4,&0.5\leq\zeta_{2}<1.\end{cases} (32)
Theorem 3.2.

Suppose Assumption 2.1 holds with 0ζ2<1ζ1<0\leq\zeta_{2}<1\leq\zeta_{1}<\infty, h1=p1h_{1}=p_{1}, h2=p2h_{2}=p_{2} and d1=d2=1d_{1}=d_{2}=1. Then we have that for i>𝖲i>\mathsf{S},

|λi(n𝐍)λi(𝐍~)|=O(max{n𝖾2,nζ212}),\left|\lambda_{i}(n\mathbf{N})-\lambda_{i}(\widetilde{\mathbf{N}})\right|=\mathrm{O}_{\prec}\left(\max\left\{n^{\mathsf{e}_{2}},n^{\frac{\zeta_{2}-1}{2}}\right\}\right), (33)

where 𝖾2\mathsf{e}_{2} is defined in (24) and 𝐍~\widetilde{\mathbf{N}} is defined in (31). Furthermore, when ζ1\zeta_{1} is larger in the sense that for any given small constant δ(0,1),\delta\in(0,1),

ζ1>2δ+1,\zeta_{1}>\frac{2}{\delta}+1, (34)

then with probability at least 1O(n1δ(ζ11)/2),1-\mathrm{O}\left(n^{1-\delta(\zeta_{1}-1)/2}\right), for some sufficiently small constant ϵ>0\epsilon>0 and some constant C>0C>0 and all i𝖲,i\geq\mathsf{S}, we have

|λi(n𝐍)exp(2υ)γν2(i)|Cmax{n1/2+ϵ,n𝖾2,nζ212}.\left|\lambda_{i}(n\mathbf{N})-\exp(2\upsilon)\gamma_{\nu_{2}}(i)\right|\leq C\max\{n^{-1/2+\epsilon},n^{\mathsf{e}_{2}},\ n^{\frac{{\zeta_{2}}-1}{2}}\}. (35)

This is a potentially confusing region. In practice, it captures the situation when one sensor is corrupted so that the signal part becomes weak. Since we still have one sensor available with a strong SNR, it is expected that we could still obtain something useful. However, it is shown in Theorem 3.2 that the corrupted sensor unfortunately contaminates the overall performance of the sensor fusion algorithm. Note that since the first sensor has a large SNR, the noisy transition matrix 𝐀1\mathbf{A}_{1} is close to the transition matrix 𝐀~1,s\widetilde{\mathbf{A}}_{1,s}, which only depends on the signal part when 1ζ1<2,1\leq\zeta_{1}<2, and the transition 𝐀~1,c\widetilde{\mathbf{A}}_{1,c} which is a mixture of the signal and noise when ζ12\zeta_{1}\geq 2. This fact has been shown in [6]. However, for the second sensor, due to the strong noise, 𝐀2\mathbf{A}_{2} will be close to a perturbed Gram matrix that mainly comes from the high dimensional noise. Consequently, as illustrated in (33), the NCCA matrix will be close to 𝐍~\widetilde{\mathbf{N}} which is a product matrix of the clean transition matrix and the shifted Gram matrix. Clearly, the clean transition matrix is contaminated by the shifted Gram matrix, which does not contain any information about the signal. This limits the information we can obtain.

In the extreme case when ζ1\zeta_{1} is larger in the sense of (34), the chosen bandwidth h1=p1h_{1}=p_{1} is too small compared with the signal so that the transition matrix 𝐀1\mathbf{A}_{1} will be close to the identity matrix. Consequently, as in (35), the NCCA matrix will be mainly characterized by the perturbed Gram matrix whose limiting ESD follows the MP law ν2\nu_{2} with proper scaling.

We should however emphasize that as has been elaborated in [6], when the SNR is large, particularly when ζ1>2\zeta_{1}>2, we should consider a different bandwidth, particularly the bandwidth determined by the percentile of pairwise distance that is commonly considered in practice. It is thus natural to ask if the bandwidth h1h_{1} is chosen “properly”, would we obtain useful information eventually. We will answer this question in the later section.

3.2. Informative region: ζ21\zeta_{2}\geq 1

In this subsection, we state the results when both of the sensors have large SNR (ζ21\zeta_{2}\geq 1). Recall (29), (30) and denote 𝐀~2,s,𝐀~2,c\widetilde{\mathbf{A}}_{2,s},\widetilde{\mathbf{A}}_{2,c} analogously for the point cloud 𝒴.\mathcal{Y}. For some constant C>0,C>0, denote

𝖱:={Clogn,ζ2=1Cnζ21,1<ζ2<2.\mathsf{R}:=\begin{cases}C\log n,&\zeta_{2}=1\\ Cn^{\zeta_{2}-1},&1<\zeta_{2}<2.\end{cases} (36)
Theorem 3.3.

Suppose Assumption 2.1 holds with 1ζ2ζ1<1\leq\zeta_{2}\leq\zeta_{1}<\infty, h1=p1h_{1}=p_{1}, h2=p2h_{2}=p_{2} and d1=d2=1d_{1}=d_{2}=1. Then we have that:

  1. (1).

    When 1ζ2ζ1<2,1\leq\zeta_{2}\leq\zeta_{1}<2, we have

    𝐍𝐀~1,s𝐀~2,sn1/2.\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top}\right\|\prec n^{-1/2}. (37)

    Additionally, for i>𝖱i>\mathsf{R}, we have

    λi(𝐀~1,s𝐀~2,s)n(ζ23)/2.\lambda_{i}(\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top})\prec n^{(\zeta_{2}-3)/2}. (38)
  2. (2).

    When 1ζ2<2ζ1<,1\leq\zeta_{2}<2\leq\zeta_{1}<\infty, we have that (38) holds true and

    𝐍𝐀~1,c𝐀~2,sn1/2.\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,s}^{\top}\|\prec n^{-1/2}. (39)

    Moreover, when ζ1{\zeta_{1}} is larger in the sense of (34), we have that with probability at least 1O(n1δ(ζ11)/2),1-\mathrm{O}\left(n^{1-\delta({\zeta_{1}}-1)/2}\right), for some sufficiently small ϵ>0\epsilon>0 and some constant C>0C>0

    𝐍𝐀~2,sC(n1/2+ϵ+nexp(υ(σ12/n)1δ)).\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{2,s}\right\|\leq C\left(n^{-1/2+\epsilon}+n\exp(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta})\right). (40)
  3. (3).

    When ζ1ζ22,\zeta_{1}\geq\zeta_{2}\geq 2, we have that

    𝐍𝐀~1,c𝐀~2,cn3/2+nζ2/2.\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,c}^{\top}\right\|\prec n^{-3/2}+n^{-\zeta_{2}/2}. (41)

    Moreover, if ζ2{\zeta_{2}} is larger in the sense of (34), we have that with probability at least 1O(n1δ(min{ζ1,ζ2}1)/2),1-\mathrm{O}\left(n^{1-\delta(\min\{{\zeta_{1}},{\zeta_{2}}\}-1)/2}\right), for some constant C>0C>0,

    𝐍𝐈Cn(exp(υ(σ12/n)1δ)+exp(υ(σ22/n)1δ)).\left\|\mathbf{N}-\mathbf{I}\right\|\leq Cn\left(\exp(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta})+\exp(-\upsilon(\sigma_{2}^{2}/n)^{1-\delta})\right). (42)

Theorem 3.3 shows that when hk=pkh_{k}=p_{k}, where k=1,2,k=1,2, and both SNRs are large, the NCCA matrix from the noisy dataset could be well approximated by that from the clean dataset of the common manifold. The main reason has been elaborated in [6] when we have only one sensor. In the two sensors case, combining (37) and (38), we see that except the first 𝖱\mathsf{R} eigenvalues, the remaining eigenvalues are negligible and not informative. Moreover, (2) and (3) reveal important information about the bandwidth; that is, if the bandwidth choice is improper, like h1=p1h_{1}=p_{1} and h2=p2h_{2}=p_{2}, the result could be misleading in general. For instance, when ζ1\zeta_{1} and ζ2\zeta_{2} are large, ideally we should have a “very clean” dataset and we shall expect to obtain useful information about the signal. However, this result says that we cannot obtain any useful information from NCCA; particularly, see (42). This however is intuitively true, since when the bandwidth is too small, the relationship of two distinct points cannot be captured by the kernel; that is, when iji\neq j, exp(𝐱i𝐱j2/p1)0\exp(-\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}/p_{1})\approx 0 with high probability (see proof below for a precise statement of this argument or [6]). This problem can be fixed if we choose a proper bandwidth. In Section 4, we will state the corresponding results when the bandwidths are selected properly, in which case this counterintuitive result is eliminated.

Remark 3.4.

In the above theorems, we focus on reporting the results for the case d1=d2=1d_{1}=d_{2}=1 in (5). In this remark, we discuss how to generalize the results to the setting when d1>1d_{1}>1 or d2>1.d_{2}>1. First, when 0σ1i2,σ2j2<1,1id1,1jd2,0\leq\sigma_{1i}^{2},\sigma_{2j}^{2}<1,1\leq i\leq d_{1},1\leq j\leq d_{2}, Theorem 3.1 still holds after minor modification, for example, 𝔡k\mathfrak{d}_{k} in (22) should be replaced by i=1dk𝔡ki,𝔡ki:=11ζki+1\sum_{i=1}^{d_{k}}\mathfrak{d}_{ki},\mathfrak{d}_{ki}:=\left\lceil\frac{1}{1-{\zeta_{ki}}}\right\rceil+1 and the error bound in (27) should be replaced by

max{maxi{nζ1i12},maxj{n𝖾2j}},\max\left\{\max_{i}\{n^{\frac{\zeta_{1i}-1}{2}}\},\max_{j}\{n^{\mathsf{e}_{2j}}\}\right\},

where 𝖾2j\mathsf{e}_{2j}’s are defined similarly as in (24). Similar arguments apply for Theorem 3.2. Second, when ζ1i,ζ1j1\zeta_{1i},\zeta_{1j}\geq 1, where 1id1,1jd2,1\leq i\leq d_{1},1\leq j\leq d_{2}, Theorem 3.3 holds by setting ζk:=maxj{ζkj},k=1,2.\zeta_{k}:=\max_{j}\{\zeta_{kj}\},k=1,2. Finally, suppose that there exist some integers rk<dk,k=1,2,r_{k}<d_{k},k=1,2, such that

ζk1ζk2ζk,rk1>ζk,rk+1ζk,dk0.\zeta_{k1}\geq\zeta_{k2}\geq\cdots\geq\zeta_{k,r_{k}}\geq 1>\zeta_{k,r_{k}+1}\geq\cdots\zeta_{k,d_{k}}\geq 0.

Then we have that Theorem 3.3 still holds by setting ζk:=ζk1,k=1,2,\zeta_{k}:=\zeta_{k1},k=1,2, and the affinity and transition matrices in (29) should be defined using the signal part with large SNRs. For example, 𝐖1,s\mathbf{W}_{1,s} should be defined via

𝐖1,s(i,j)=exp(υ𝐮~ix𝐮~jx22h1),𝐮~ix=(𝐮ix(1),,𝐮ix(r1),0,,0).\mathbf{W}_{1,s}(i,j)=\exp\left(-\upsilon\frac{\|\widetilde{\mathbf{u}}_{ix}-\widetilde{\mathbf{u}}_{jx}\|_{2}^{2}}{h_{1}}\right),\ \widetilde{\mathbf{u}}_{ix}=(\mathbf{u}_{ix}(1),\cdots,\mathbf{u}_{ix}(r_{1}),0,\cdots,0).

The detailed statements and proofs are similar to the setting d1=d2=1d_{1}=d_{2}=1 except for extra notational complicatedness. Since this is not the main focus of the current paper, we omit details here.

Remark 3.5.

Throughout the paper, we focus on reporting the results of the NCCA matrix. However, our results can also be applied to the AD matrix with a minor modification based on their definitions in (3). Specifically, Theorem 3.1 holds for n2𝐀n^{2}\mathbf{A}, Theorem 3.2 holds for n𝐀n\mathbf{A} and Theorem 3.3 holds for 𝐀\mathbf{A} by replacing 𝐀~1,s𝐀~2,s\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top} with 𝐀~1,s𝐀~2,s,\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}, 𝐀~1,c𝐀~2,s\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,s}^{\top} with 𝐀~1,c𝐀~2,s\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,s} and 𝐀~1,c𝐀~2,c\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,c}^{\top} with 𝐀~1,c𝐀~2,c\widetilde{\mathbf{A}}_{1,c}\widetilde{\mathbf{A}}_{2,c}. Since the proof is similar, we omit details.

4. Main results (II)–adaptive choice of bandwidth

As discussed after Theorem 3.3, when the SNRs are large, the bandwidth choice hk=pkh_{k}=p_{k} for k=1,2k=1,2 is improper. One solution to this trouble has been discussed in [6] when we have one sensor; that is, the bandwidth is decided by the percentile of all pairwise distances. It is thus natural to hypothesize that the same solution would hold for the kernel sensor fusion approach. As in Section 3, we focus on the case d1=d2=1,d_{1}=d_{2}=1, and the discussion for the general setting is similar to that of Remark 3.4. As before, we also assume that ζ1ζ2.\zeta_{1}\geq\zeta_{2}. Also, we focus on reporting the results of the NCCA matrix. The discussion for the AD matrix is similar to that of Remark 3.5.

We first recall the adaptive bandwidth selection approach [6] motivated by the empirical approach commonly used in daily practice. Let νdist,1\nu_{\text{dist},1} and νdist,2\nu_{\text{dist},2} be the empirical distributions of pairwise distances {𝐱i𝐱j}ij\{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|\}_{i\neq j} and {𝐲i𝐲j}ij\{\|\mathbf{y}_{i}-\mathbf{y}_{j}\|\}_{i\neq j} respectively. Then we choose the bandwidths h1>0h_{1}>0 and h2>0h_{2}>0 by

0h1dνdist,1=ω1and0h2dνdist,2=ω2,\int_{0}^{h_{1}}\mathrm{d}\nu_{\text{dist},1}=\omega_{1}\quad\mbox{and}\quad\int_{0}^{h_{2}}\mathrm{d}\nu_{\text{dist},2}=\omega_{2}, (43)

where 0<ω1<10<\omega_{1}<1 and 0<ω2<10<\omega_{2}<1 are fixed constants chosen by the user. Define 𝐀~1,s\widetilde{\mathbf{A}}_{1,s} in the same way as that in (45), 𝐍~\widetilde{\mathbf{N}} as that in (48), and 𝐀1,s\mathbf{A}_{1,s} as that in (46) using (43). Similarly, we can define the counterparts for the point cloud 𝒴.\mathcal{Y}.

Recall that 𝐖1,s\mathbf{W}_{1,s} and 𝐖2,s\mathbf{W}_{2,s} are the affinity matrices associated with {𝐮ix}\{\mathbf{u}_{ix}\} and {𝐮iy}.\{\mathbf{u}_{iy}\}. With a little bit abuse of notation, for k=1,2,k=1,2, we denote

𝐖~k,s=exp(υ2pkhk)𝐖k,s+(1exp(υ2pkhk))𝐈,\displaystyle\widetilde{\mathbf{W}}_{k,s}=\exp\left(-\upsilon\frac{2p_{k}}{h_{k}}\right)\mathbf{W}_{k,s}+\left(1-\exp\left(-\upsilon\frac{2p_{k}}{h_{k}}\right)\right)\mathbf{I}, (44)

where 𝐖k,s\mathbf{W}_{k,s} are constructed using the adaptively selected bandwidth hkh_{k}. Clearly, 𝐖k,s\mathbf{W}_{k,s} and 𝐖~k,s\widetilde{\mathbf{W}}_{k,s} differ by an isotropic spectral shift, and when ζk>1\zeta_{k}>1, asymptotically 𝐖k,s\mathbf{W}_{k,s} and 𝐖~k,s\widetilde{\mathbf{W}}_{k,s} are the same. Note that compared to (28), the difference is that we use the modified bandwidth in (44). This difference is significant, particularly when ζk\zeta_{k} is large. Indeed, when ζk\zeta_{k} is large, 𝐖k,s\mathbf{W}_{k,s} defined in (28) is close to an identity matrix, while 𝐖k,s\mathbf{W}_{k,s} defined in (44) encodes information of the signal. Specifically, asymptotically we can show that 𝐖k,s\mathbf{W}_{k,s} defined in (44) converges to an integral operator defined on the manifold, whose spectral structure is commonly used in manifold learning society to study the signal. See [6] for more discussion. We then define

𝐀~k,s=𝐃~k,s1𝐖~k,s\widetilde{\mathbf{A}}_{k,s}=\widetilde{\mathbf{D}}_{k,s}^{-1}\widetilde{\mathbf{W}}_{k,s} (45)

and

𝐀k,s=𝐃k,s1𝐖k,s,\mathbf{A}_{k,s}=\mathbf{D}_{k,s}^{-1}\mathbf{W}_{k,s}, (46)

where k=1,2k=1,2. Compared to (45), (46) does not contain the scaling and shift of the signal parts. Moreover, denote

ςk,hςk,h(τk,hk):=12υpkhkexp(υτkpkhk)exp(υτkpkhk),k=1,2.\varsigma_{k,h}\equiv\varsigma_{k,h}(\tau_{k},h_{k}):=1-\frac{2\upsilon p_{k}}{h_{k}}\exp\left(-\upsilon\frac{\tau_{k}p_{k}}{h_{k}}\right)-\exp\left(-\upsilon\frac{\tau_{k}p_{k}}{h_{k}}\right),\ k=1,2. (47)
Theorem 4.1.

Suppose Assumption 2.1 holds with the adaptively chosen bandwidths h1,h2h_{1},h_{2} and d1=d2=1.d_{1}=d_{2}=1. Recall (15), (11) and (47). Then we have that:

  1. (1).

    0ζ2<10\leq\zeta_{2}<1 (at least one sensor has a low SNR). When 0ζ1<10\leq\zeta_{1}<1 as well, Theorem 3.1 holds under the assumption of (25) and (26) by replacing νk,k=1,2,\nu_{k},k=1,2, with

    ν~k:=Tςk,hνck,ηk,whereηk=2pkυexp(2pkυ/hk)hk\widetilde{\nu}_{k}:=\mathrm{T}_{\varsigma_{k,h}}\nu_{c_{k},\eta_{k}},\ \ \mbox{where}\ \ \eta_{k}=\frac{2p_{k}\upsilon\exp(-2p_{k}\upsilon/h_{k})}{h_{k}}

    and exp(4υ)\exp(4\upsilon) with exp(4υp1p2/(h1h2))\exp(4\upsilon p_{1}p_{2}/(h_{1}h_{2})) in (27). When ζ11\zeta_{1}\geq 1, Theorem 3.2 holds by replacing 𝐍\mathbf{N} by 𝐍~\widetilde{\mathbf{N}} in (33), where

    𝐍~:={exp(υ2p2h2)𝐀~1,s(ς2,h𝐈+2υexp(υp2τ2/h2)h2𝐖𝐖),1ζ1<2;exp(υ2p2h2)𝐀~1,c(ς2,h𝐈+2υexp(υp2τ2/h2)h2𝐖𝐖),ζ2.\widetilde{\mathbf{N}}:=\begin{cases}\exp\left(\upsilon\frac{2p_{2}}{h_{2}}\right)\widetilde{\mathbf{A}}_{1,s}\left(\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&1\leq\zeta_{1}<2;\\ \exp\left(\upsilon\frac{2p_{2}}{h_{2}}\right)\widetilde{\mathbf{A}}_{1,c}\left(\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W}\right),&\zeta\geq 2.\end{cases} (48)

    and when ζ1\zeta_{1} is large in the sense of (34), (35) holds replacing ν2\nu_{2} with ν~2\widetilde{\nu}_{2} and exp(2υ)\exp(2\upsilon) with exp(2p2υ/h2).\exp(2p_{2}\upsilon/h_{2}).

  2. (2).

    ζ21\zeta_{2}\geq 1 (both sensors have high SNRs). In this case, we have that

    𝐍𝐀~1,s𝐀~2,sn1/2.\left\|\mathbf{N}-\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top}\right\|\prec n^{-1/2}. (49)

    Moreover, for some constant C>0C>0 and iClogn,i\geq C\log n, we have

    λi(𝐀~1,s𝐀~2,s)n1.\lambda_{i}(\widetilde{\mathbf{A}}_{1,s}\widetilde{\mathbf{A}}_{2,s}^{\top})\prec n^{-1}. (50)

    Finally, when ζ2>1,\zeta_{2}>1, we have that

    𝐍𝐀1,s𝐀2,sn1/2+n1ζ2.\left\|\mathbf{N}-\mathbf{A}_{1,s}\mathbf{A}_{2,s}^{\top}\right\|\prec n^{-1/2}+n^{1-\zeta_{2}}. (51)

Theorem 4.1 (1) states that if both sensors have low SNRs, the NCCA matrix has a similar spectral behavior as that in Theorem 3.1; that is, when the SNRs are small, due to the noise impact, even if there exists signal, we may not obtain useful result. The reason is that we still have hkpkh_{k}\asymp p_{k} for k=1,2k=1,2 with high probability (see (100)), so the bandwidth choice does not influence the conclusion. Especially, most of the eigenvalues of 𝐀\mathbf{A} are governed by the free multiplication convolutions of two MP type laws, which are essentially the limiting empirical spectral distributions of Gram matrices only containing white noise.

On the other hand, when the signals are stronger; that is, ζ1,ζ21,\zeta_{1},\zeta_{2}\geq 1, we are able to approximate the associated clean NCCA matrix of the underlying clean common component, as is detailed in Theorem 4.1 (2). This result can be interpreted as that NCCA is robust to the noise. Especially, when ζ1,ζ2>1,\zeta_{1},\zeta_{2}>1, we see that 𝐀1,s\mathbf{A}_{1,s} and 𝐀2,s{\mathbf{A}}_{2,s} come from the clean dataset directly. Finally, we point out that compared to (36), except the top ClognC\log n eigenvalues for some constant C>0C>0, the remaining eigenvalues are not information. When ζ1,ζ22,\zeta_{1},\zeta_{2}\geq 2, the NCCA matrix is always informative compared to (2) and (3) of Theorem 3.3. As a result, when combined with the existing theory about AD [31, 45], the first few eigenpairs of NCCA and AD capture the geometry of the common manifold under the manifold setup.

Theorem 4.1 (1) also describes the behavior of NCCA when one sensor has a high SNR while the other one has a low SNR, which is the most interesting and counterintuitive case. In this case, even if the bandwidths of both sensors are generated according to (43), the NCCA matrix still encodes limited information about the signal, like that stated in Theorem 3.2. Indeed, the NCCA matrix is close to a product matrix which is a mixture of signal and noise, shown in (48). While 𝐀~1,s\widetilde{\mathbf{A}}_{1,s} contains information about the signal, it is contaminated by ς2,h𝐈+2υexp(υp2τ2/h2)h2𝐖𝐖\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W} via production, which comes from the noise dominant dataset collected from the other sensor. Since the spectral behavior of ς2,h𝐈+2υexp(υp2τ2/h2)h2𝐖𝐖\varsigma_{2,h}\mathbf{I}+\frac{2\upsilon\exp(-\upsilon p_{2}\tau_{2}/h_{2})}{h_{2}}\mathbf{W}^{\top}\mathbf{W} follows the shifted and scaled MP law, overall we obtain limited information about the signal if we apply the kernel-based sensor fusion algorithm. In this case, it is better to simply consider the dataset with a high SNR. Based on the above discussion and practical experience, we would like to mention a potential danger if we directly apply NCCA (or AD) without confirming the signal quality. This result warns us that if we directly apply AD without any sanity check, it may result in a misleading conclusion, or give us lower quality information. Therefore, before applying NCCA and AD, it is suggested to carry out the common practice by detecting the existence of signals in each of the sensors.

For the choices of the constants ω1\omega_{1} and ω2,\omega_{2}, we comment that in practice, usually researchers choose ωk=0.25\omega_{k}=0.25 or 0.50.5 [42]. In [6], we propose an algorithm to adaptively choose the values of them. The main idea behind is that the algorithm seeks for a bandwidth so that the affinity matrix has the most number of outlier eigenvalues. We refer the readers to [6, Section 3.2] for more details.

Last but not the least, we point out that our results can be potentially used to detect the common components. Usually, researchers count on the background knowledge to decide if common information exists. For example, it is not surprising that two electroencephalogram channels share the same brain activity. However, while physiologically the brain and heart share common information [41], it is less clear if the electroencephalogram and the electrocardiogram share anything in common, and what is the common information. Answering this complicated question may need a lot of scientific work, but the first step toward it is a powerful tool to confirm if two sensors share the same information, in addition to checking if the signal quality is sufficient. Since this is not the focus of the current paper, we will address this issue in the future work.

5. Proof of main theorems

We now provide proofs of the main theoretical results in Sections 3 and 4. We start from collecting some technical lemmas needed in the proof.

5.1. Some technical lemmas

The following lemma provides some deterministic inequalities for the products of matrices.

Lemma 5.1.

(1). Suppose that 𝐋\mathbf{L} is a real symmetric matrix with nonnegative entries and 𝐄\mathbf{E} is another real symmetric matrix. Then we have that

σ1(𝐋𝐄)maxi,j|𝐄(i,j)|σ1(𝐋),\sigma_{1}(\mathbf{L}\circ\mathbf{E})\leq\max_{i,j}|\mathbf{E}(i,j)|\sigma_{1}(\mathbf{L}),

where 𝐋𝐄\mathbf{L}\circ\mathbf{E} is the Hadamard product and σ1(𝐋)\sigma_{1}(\mathbf{L}) stands for the largest singular value of 𝐋.\mathbf{L}.
(2). Suppose AA and BB are two n×nn\times n positive definite matrices. Then for all 1kn,1\leq k\leq n, we have that

λk(A)λn(B)λk(AB)λk(A)λ1(B).\lambda_{k}(A)\lambda_{n}(B)\leq\lambda_{k}(AB)\leq\lambda_{k}(A)\lambda_{1}(B).
Proof.

For (1), see [13, Lemma A.5]. (2) follows from Courant-Fischer-Weyl’s min-max principle via

λk(AB)=λk(BAB)\displaystyle\lambda_{k}(AB)=\lambda_{k}(\sqrt{B}A\sqrt{B}) =minFndim(F)=k(maxxF\{0}(BABx,x)(x,x))\displaystyle=\min_{\begin{subarray}{c}F\subset\mathbb{R}^{n}\\ \dim(F)=k\end{subarray}}\left(\max_{x\in F\backslash\{0\}}\frac{(\sqrt{B}A\sqrt{B}x,x)}{(x,x)}\right)
=minFndim(F)=k(maxxF\{0}(ABx,Bx)(Bx,Bx)(Bx,x)(x,x)).\displaystyle=\min_{\begin{subarray}{c}F\subset\mathbb{R}^{n}\\ \dim(F)=k\end{subarray}}\left(\max_{x\in F\backslash\{0\}}\frac{(A\sqrt{B}x,\sqrt{B}x)}{(\sqrt{B}x,\sqrt{B}x)}\frac{(Bx,x)}{(x,x)}\right).

The following Lemma 5.2 collects some concentration inequalities.

Lemma 5.2.

Suppose Assumption 2.1 holds with d1=d2=1d_{1}=d_{2}=1. Moreover, we assume that 0ζ1,ζ2<10\leq\zeta_{1},\zeta_{2}<1 in (10). Then we have that

1p1|𝐱i𝐱j|σ12n+1n,1p2|𝐲i𝐲j|σ22n+1n,\frac{1}{p_{1}}\left|\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right|\prec\frac{\sigma^{2}_{1}}{n}+\frac{1}{\sqrt{n}},\ \frac{1}{p_{2}}\left|\mathbf{y}_{i}^{\top}\mathbf{y}_{j}\right|\prec\frac{\sigma^{2}_{2}}{n}+\frac{1}{\sqrt{n}}, (52)

and

|1p1𝐱i22(1+σ12p1)|σ12n+1n,|1p2𝐲i22(1+σ22p2)|σ22n+1n.\left|\frac{1}{p_{1}}\|\mathbf{x}_{i}\|_{2}^{2}-\left(1+\frac{\sigma^{2}_{1}}{p_{1}}\right)\right|\prec\frac{\sigma^{2}_{1}}{n}+\frac{1}{\sqrt{n}},\ \left|\frac{1}{p_{2}}\|\mathbf{y}_{i}\|_{2}^{2}-\left(1+\frac{\sigma^{2}_{2}}{p_{2}}\right)\right|\prec\frac{\sigma^{2}_{2}}{n}+\frac{1}{\sqrt{n}}. (53)
Proof.

See Lemma A.2 of [6]. ∎

In the following lemma, we prove some results regarding the concentration of the affinity matrices when 0ζ1<10\leq\zeta_{1}<1 and 0ζ2<10\leq\zeta_{2}<1.

Lemma 5.3.

Follow the notations (67)–(69). Recall (22) for d1=d2=1d_{1}=d_{2}=1. For f(x)=exp(υx),f(x)=\exp(-\upsilon x), we denote Sh𝖽\mathrm{Sh}_{\mathsf{d}} and Sh~d\widetilde{\mathrm{Sh}}_{d} such that

Sh𝖽(i,j):=k=3𝔡11f(k)(τ1)𝐋x(i,j)kk!,Sh~𝖽(i,j):=k=3𝔡21f(k)(τ2)𝐋y(i,j)kk!,\mathrm{Sh}_{\mathsf{d}}(i,j):=\sum_{k=3}^{\mathfrak{d}_{1}-1}\frac{f^{(k)}(\tau_{1})\mathbf{L}_{x}(i,j)^{k}}{k!},\ \widetilde{\mathrm{Sh}}_{\mathsf{d}}(i,j):=\sum_{k=3}^{\mathfrak{d}_{2}-1}\frac{f^{(k)}(\tau_{2})\mathbf{L}_{y}(i,j)^{k}}{k!}, (54)

where 𝐋x=𝐎x𝐏x,\mathbf{L}_{x}=\mathbf{O}_{x}-\mathbf{P}_{x}, and 𝐎x\mathbf{O}_{x} and 𝐏x\mathbf{P}_{x} are defined as

𝐎x(i,j)=(1δij)(ϕi+ϕj),𝐏x(i,j)=(1δij)𝐱i𝐱jp1.\mathbf{O}_{x}(i,j)=(1-\delta_{ij})(\phi_{i}+\phi_{j}),\ \mathbf{P}_{x}(i,j)=(1-\delta_{ij})\frac{\mathbf{x}_{i}^{\top}\mathbf{x}_{j}}{p_{1}}.

Denote

𝐊1={2f(τ1)p11𝐗𝐗+ς1𝐈n+Sh10(τ)+Sh11(τ)+Sh12(τ),0α1<0.52f(τ1)p11𝐗𝐗+ς1𝐈n+Sh10(τ)+Sh11(τ)+Sh12(τ)+Sh𝖽,0.5α1<1.\mathbf{K}_{1}=\begin{cases}-2f^{\prime}(\tau_{1})p_{1}^{-1}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}_{n}+\mathrm{Sh}_{10}(\tau)+\mathrm{Sh}_{11}(\tau)+\mathrm{Sh}_{12}(\tau),&0\leq\alpha_{1}<0.5\\ -2f^{\prime}(\tau_{1})p_{1}^{-1}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}_{n}+\mathrm{Sh}_{10}(\tau)+\mathrm{Sh}_{11}(\tau)+\mathrm{Sh}_{12}(\tau)+\mathrm{Sh}_{\mathsf{d}},&0.5\leq\alpha_{1}<1.\end{cases} (55)

Recall (24). Then, when 0ζ1<10\leq\zeta_{1}<1 and h1=p1,h_{1}=p_{1}, we have

𝐖1=𝐊1+O(n𝖾1+n1/2).\mathbf{W}_{1}=\mathbf{K}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}}+n^{-1/2}). (56)

Moreover, we have

𝐀1=1nf(τ1)𝐊1+O(n𝖾1+n1/2).\mathbf{A}_{1}=\frac{1}{nf(\tau_{1})}\mathbf{K}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}}+n^{-1/2}). (57)

Finally, for 𝗌1\mathsf{s}_{1} in (22), we have

rank(Sh𝖽)𝗌1.\operatorname{rank}(\mathrm{Sh}_{\mathsf{d}})\leq\mathsf{s}_{1}. (58)

Similar results hold for 𝐀2\mathbf{A}_{2} and 𝐖2\mathbf{W}_{2} using ϕ~k,1kn,\widetilde{\phi}_{k},1\leq k\leq n, Sh~𝖽\widetilde{\mathrm{Sh}}_{\mathsf{d}} and Sh2i,i=0,1,2.\mathrm{Sh}_{2i},i=0,1,2.

Proof.

First, (56) has been proved in [6] using the entry-wise Taylor expansion and the Gershgorin circle theorem; see the proof of Theorems 2.3 and 2.5 of [6]. Second, (58) has been proved in the proof of Theorem 2.5 of [6]. Third, we prove (57). By Lemma 5.2 and a discussion similar to [7, Lemma IV.5], when 0ζ1<10\leq\zeta_{1}<1 and 0ζ2<10\leq\zeta_{2}<1

(n𝐃1)11f(τ1)𝐈n𝖾1+n1/2,(n𝐃2)11f(τ2)𝐈n𝖾2+n1/2.\left\|(n\mathbf{D}_{1})^{-1}-\frac{1}{f(\tau_{1})}\mathbf{I}\right\|\prec n^{\mathsf{e}_{1}}+n^{-1/2},\ \left\|(n\mathbf{D}_{2})^{-1}-\frac{1}{f(\tau_{2})}\mathbf{I}\right\|\prec n^{\mathsf{e}_{2}}+n^{-1/2}. (59)

Consequently,

n𝐀11f(τ1)𝐊1\displaystyle\left\|n\mathbf{A}_{1}-\frac{1}{f(\tau_{1})}\mathbf{K}_{1}\right\| (n𝐃1)1𝐖11f(τ1)𝐖1+1f(τ1)𝐖1𝐊1\displaystyle\leq\left\|(n\mathbf{D}_{1})^{-1}\mathbf{W}_{1}-\frac{1}{f(\tau_{1})}\mathbf{W}_{1}\right\|+\frac{1}{f(\tau_{1})}\left\|\mathbf{W}_{1}-\mathbf{K}_{1}\right\|
(n𝖾1+n1/2)(𝐖1+1),\displaystyle\prec(n^{\mathsf{e}_{1}}+n^{-1/2})(\|\mathbf{W}_{1}\|+1),

where we used the fact that τ1<.\tau_{1}<\infty. Since maxi,j|𝐖1(i,j)|1,\max_{i,j}|\mathbf{W}_{1}(i,j)|\prec 1, by the Gershgorin circle theorem, we conclude that 𝐖1n.\|\mathbf{W}_{1}\|\prec n. This concludes our proof. ∎

In the following lemma, we collect the results regarding the affinity matrices when ζ11\zeta_{1}\geq 1 and ζ21.\zeta_{2}\geq 1. Recall 𝐀~1,s\widetilde{\mathbf{A}}_{1,s} defined via (29).

Lemma 5.4.

Suppose Assumption 2.1 holds with d1=d2=1d_{1}=d_{2}=1, ζ1,ζ21\zeta_{1},\zeta_{2}\geq 1. For some constant C>0,C>0, denote

𝖳1:={Clogn,ζ1=1;Cnζ11,1<ζ2<2.\mathsf{T}_{1}:=\begin{cases}C\log n,&\zeta_{1}=1;\\ Cn^{\zeta_{1}-1},&1<\zeta_{2}<2.\end{cases} (60)

Then we have:
(1). When h1=p1,h_{1}=p_{1}, if 1ζ1<2,1\leq\zeta_{1}<2,

𝐀1𝐀~1,sn1/2.\left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,s}\right\|\prec n^{-1/2}. (61)

Moreover, moreover, we have that for i>𝖳1i>\mathsf{T}_{1} in (60),

λi(𝐀~1,s)n(ζ13)/2.\lambda_{i}(\widetilde{\mathbf{A}}_{1,s})\prec n^{(\zeta_{1}-3)/2}. (62)

On the other hand, when ζ12,\zeta_{1}\geq 2, we have that

𝐀1𝐀~1,cnζ1/2+n3/2.\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,c}\|\prec n^{-\zeta_{1}/2}+n^{-3/2}.

Finally, when ζ1\zeta_{1} is the larger in the sense that (34) holds, we have that with probability at least 1O(n1δ(ζ11)/2),1-\mathrm{O}(n^{1-\delta(\zeta_{1}-1)/2}), for some constant C>0,C>0,

𝐀1𝐈Cnexp(υn(ζ11)(1δ)).\left\|\mathbf{A}_{1}-\mathbf{I}\right\|\leq Cn\exp\left(-\upsilon n^{(\zeta_{1}-1)(1-\delta)}\right). (63)

(2). When h1h_{1} is chosen according to (43), we have that

𝐀1𝐀~1,sn1/2.\left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,s}\right\|\prec n^{-1/2}. (64)

Moreover, we have that for i>Clogni>C\log n for some constant C>0C>0,

λi(𝐀~1,s)n1.\lambda_{i}(\widetilde{\mathbf{A}}_{1,s})\prec n^{-1}. (65)

Recall (46). Finally, when ζ1>1,\zeta_{1}>1, we have that

𝐀1𝐀1,sn1/2+n1ζ1.\left\|\mathbf{A}_{1}-\mathbf{A}_{1,s}\right\|\prec n^{-1/2}+n^{1-\zeta_{1}}. (66)

Similar results hold for 𝐀2.\mathbf{A}_{2}.

Proof.

See Corollary 2.11 and Theorem 3.1 of [6]. ∎

Finally, we record the results for the rigidity of eigenvalues of non-spiked Gram matrix. Denote the non-spiked Gram matrix as 𝐒\mathbf{S}, where

𝐒:=1p1𝐙𝐙,\mathbf{S}:=\frac{1}{p_{1}}\mathbf{Z}^{\top}\mathbf{Z},

and its eigenvalues as λ1λn.\lambda_{1}\geq\cdots\geq\lambda_{n}. Recall (11) and (17).

Lemma 5.5.

Suppose {𝐳i}\{\mathbf{z}_{i}\} are sub-Gaussian random vectors satisfying (8), (9) and (26). Then we have

supj|λjγνc1,1(j)|n2/3j~1/3,j~:=min{p1n+1j,j}.\sup_{j}\left|\lambda_{j}-\gamma_{\nu_{c_{1},1}}(j)\right|\prec n^{-2/3}\widetilde{j}^{-1/3},\ \widetilde{j}:=\min\left\{p_{1}\wedge n+1-j,j\right\}.
Proof.

See [40, Theorem 3.3]. ∎

5.2. Proof of Theorem 3.1

We need the following notations. Denote Φ1=(ϕ1,1,,ϕ1,n)\Phi_{1}=(\phi_{1,1},\ldots,\phi_{1,n}), where ϕ1,i=1p1𝐱i22(1+σ12/p1)\phi_{1,i}=\frac{1}{p_{1}}\|\mathbf{x}_{i}\|_{2}^{2}-(1+\sigma_{1}^{2}/p_{1}), i=1,2,,ni=1,2,\cdots,n. Similarly, we define Φ2=(ϕ2,1,,ϕ2,n)\Phi_{2}=({\phi}_{2,1},\ldots,{\phi}_{2,n}) with ϕ2,i=1p2𝐲i22(1+σ22/p2){\phi}_{2,i}=\frac{1}{p_{2}}\|\mathbf{y}_{i}\|_{2}^{2}-(1+\sigma_{2}^{2}/p_{2}), i=1,2,,n.i=1,2,\cdots,n. For k=1,2,k=1,2, denote

Shk0(τk):=f(τk)𝟏𝟏,\displaystyle\mathrm{Sh}_{k0}(\tau_{k}):=f(\tau_{k})\mathbf{1}\mathbf{1}^{\top}, (67)
Shk1(τk):=f(τk)[𝟏Φk+Φk𝟏],\displaystyle\mathrm{Sh}_{k1}(\tau_{k}):=f^{\prime}(\tau_{k})[\mathbf{1}\Phi_{k}^{\top}+\Phi_{k}\mathbf{1}^{\top}], (68)
Shk2(τk):=f(2)(τk)2[𝟏(ΦkΦk)+(ΦkΦk)𝟏+2ΦkΦk+4(σk2+1)2+pkpk2𝟏𝟏].\displaystyle\mathrm{Sh}_{k2}(\tau_{k}):=\frac{f^{(2)}(\tau_{k})}{2}\left[\mathbf{1}(\Phi_{k}\circ\Phi_{k})^{\top}+(\Phi_{k}\circ\Phi_{k})\mathbf{1}^{\top}+2\Phi_{k}\Phi_{k}^{\top}+4\frac{(\sigma_{k}^{2}+1)^{2}+p_{k}}{p_{k}^{2}}\mathbf{1}\mathbf{1}^{\top}\right]. (69)
Proof.

Case (1). 0ζ2ζ1<0.50\leq\zeta_{2}\leq\zeta_{1}<0.5. By (59), we conclude that

n𝐃11exp(2υ)𝐈=O(n1/2),n𝐃21exp(2υ)𝐈=O(n1/2).\left\|n\mathbf{D}_{1}^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{-1/2}),\ \left\|n\mathbf{D}_{2}^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{-1/2}). (70)

Therefore, it suffices to consider 𝐖1𝐖2.\mathbf{W}_{1}\mathbf{W}_{2}. To ease the heavy notation, we denote

kt:=(υ)texp(υτk),Shk=j=02Shkj,\ell_{kt}:=(-\upsilon)^{t}\exp(-\upsilon\tau_{k}),\ \mathrm{Sh}_{k}=\sum_{j=0}^{2}\mathrm{Sh}_{kj}, (71)

where k=1,2k=1,2 and tt\in\mathbb{N}. With the above notations, by Lemma 5.3, we have that

(𝐖1Sh1)(𝐖2Sh2)\displaystyle(\mathbf{W}_{1}-\mathrm{Sh}_{1})(\mathbf{W}_{2}-\mathrm{Sh}_{2}) (72)
=\displaystyle= (211p1𝐗𝐗+ς1𝐈+O(n1/2))(221p2𝐘𝐘+ς2𝐈+O(n1/2)).\displaystyle\,\left(-2\frac{\ell_{11}}{p_{1}}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}+\mathrm{O}_{\prec}(n^{-1/2})\right)\left(-2\frac{\ell_{21}}{p_{2}}\mathbf{Y}^{\top}\mathbf{Y}+\varsigma_{2}\mathbf{I}+\mathrm{O}_{\prec}(n^{-1/2})\right).

Denote

𝐏1:=211p1𝐗𝐗+ς1𝐈and𝐏2:=221p2𝐘𝐘+ς2𝐈.\mathbf{P}_{1}:=-2\frac{\ell_{11}}{p_{1}}\mathbf{X}^{\top}\mathbf{X}+\varsigma_{1}\mathbf{I}\ \ \mbox{and}\ \ \mathbf{P}_{2}:=-2\frac{\ell_{21}}{p_{2}}\mathbf{Y}^{\top}\mathbf{Y}+\varsigma_{2}\mathbf{I}. (73)

Since d1=d2=1d_{1}=d_{2}=1 and 𝐮ix\mathbf{u}_{ix} and 𝐮iy\mathbf{u}_{iy} contain samples from the common manifold, we can set

𝒖=(u1,,un)n.\bm{u}=(u_{1},\cdots,u_{n})^{\top}\in\mathbb{R}^{n}.

Moreover, denote

𝒛=(𝐳i1,,𝐳in)and𝒘=(𝐰i1,,𝐰in)n.\bm{z}=(\mathbf{z}_{i1},\cdots,\mathbf{z}_{in})^{\top}\ \ \mbox{and}\ \ \bm{w}=(\mathbf{w}_{i1},\cdots,\mathbf{w}_{in})^{\top}\in\mathbb{R}^{n}.

Withe above notations, denote

Δ1:=211p1𝒖𝒖,Δ2:=221p2𝒖𝒖,\Delta_{1}:=-2\frac{\ell_{11}}{p_{1}}\bm{u}\bm{u}^{\top},\ \Delta_{2}:=-2\frac{\ell_{21}}{p_{2}}\bm{u}\bm{u}^{\top}, (74)
Υ1:=211p1(𝒖𝒛+𝒛𝒖),Υ2:=221p2(𝒖𝒘+𝒘𝒖),\Upsilon_{1}:=-2\frac{\ell_{11}}{p_{1}}(\bm{u}\bm{z}^{\top}+\bm{z}\bm{u}^{\top}),\ \Upsilon_{2}:=-2\frac{\ell_{21}}{p_{2}}(\bm{u}\bm{w}^{\top}+\bm{w}\bm{u}^{\top}), (75)
𝐓1=211p1𝐙𝐙+ς1𝐈and𝐓2=221p2𝐖𝐖+ς2𝐈.\mathbf{T}_{1}=-2\frac{\ell_{11}}{p_{1}}\mathbf{Z}^{\top}\mathbf{Z}+\varsigma_{1}\mathbf{I}\ \ \mbox{and}\ \ \mathbf{T}_{2}=-2\frac{\ell_{21}}{p_{2}}\mathbf{W}^{\top}\mathbf{W}+\varsigma_{2}\mathbf{I}.

Note that

𝐏k=𝐓k+Δk+Υk,k=1,2.\mathbf{P}_{k}=\mathbf{T}_{k}+\Delta_{k}+\Upsilon_{k},\ k=1,2. (76)

Moreover, Δk\Delta_{k} are rank-one matrices and by (52),

Δk=O(nζk),Υk=O(nζk12).\Delta_{k}=\mathrm{O}_{\prec}\left(n^{\zeta_{k}}\right),\ \Upsilon_{k}=\mathrm{O}_{\prec}\left(n^{\frac{\zeta_{k}-1}{2}}\right). (77)

In light of (76), we can write

𝐏1𝐏2=k=12(𝐓k+Δk+Υk).\displaystyle\mathbf{P}_{1}\mathbf{P}_{2}=\prod_{k=1}^{2}(\mathbf{T}_{k}+\Delta_{k}+\Upsilon_{k}). (78)

We can further write

𝐏1𝐏2=𝐓1𝐓2+𝐑1+𝐑2,\displaystyle\mathbf{P}_{1}\mathbf{P}_{2}=\mathbf{T}_{1}\mathbf{T}_{2}+\mathbf{R}_{1}+\mathbf{R}_{2}, (79)

where

𝐑1:=𝐓1Δ2+Δ1𝐏2and𝐑2:=𝐓1Υ2+Υ1𝐏2.\mathbf{R}_{1}:=\mathbf{T}_{1}\Delta_{2}+\Delta_{1}\mathbf{P}_{2}\ \ \mbox{and}\ \ \mathbf{R}_{2}:=\mathbf{T}_{1}\Upsilon_{2}+\Upsilon_{1}\mathbf{P}_{2}.

On one hand, it is easy to see that rank(𝐑1)2.\operatorname{rank}(\mathbf{R}_{1})\leq 2. On the other hand, by (77), (53) and Lemma 5.5, using the assumption that ζ1ζ2,\zeta_{1}\geq\zeta_{2}, we obtain that

𝐑2=O(nζ112).\mathbf{R}_{2}=\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}). (80)

Denote the spectral decompositions of 𝐓1\mathbf{T}_{1} and 𝐓2\mathbf{T}_{2} as

𝐓1=𝐔1𝚲1𝐔1,𝐓2=𝐔2𝚲2𝐔2.\mathbf{T}_{1}=\mathbf{U}_{1}\bm{\Lambda}_{1}\mathbf{U}_{1}^{\top},\ \mathbf{T}_{2}=\mathbf{U}_{2}\bm{\Lambda}_{2}\mathbf{U}_{2}^{\top}. (81)

Let {ak}\{a_{k}\} and {bk}\{b_{k}\} be the quantiles of ν1\nu_{1} and ν2,\nu_{2}, respectively as constructed via (20). Let {λi}\{\lambda_{i}\} be the eigenvalues of 𝐓1\mathbf{T}_{1}. For some small ϵ>0,\epsilon>0, we denote an event

Ξ:={supi1|λiai|i~1/3n2/3+ϵ},i~:=min{(p11)n+1j,j}.\Xi:=\left\{\sup_{i\geq 1}|\lambda_{i}-a_{i}|\leq\widetilde{i}^{-1/3}n^{-2/3+\epsilon}\right\},\ \widetilde{i}:=\min\{(p_{1}-1)\wedge n+1-j,j\}. (82)

Since 𝐖\mathbf{W} is a Gaussian random matrix, we have that 𝐔2\mathbf{U}_{2} is a Haar orthogonal random matrix. Since 𝐙\mathbf{Z} and 𝐖\mathbf{W} are independent, we have that 𝐔:=𝐔2𝐔1\mathbf{U}:=\mathbf{U}_{2}^{\top}\mathbf{U}_{1} is also a Haar orthogonal random matrix when 𝐔1\mathbf{U}_{1} is fixed. Since Lemma 5.5 implies that Ξ\Xi is a high probability event, in what follows, we focus our discussion on the high probability event Ξ\Xi and 𝐔1\mathbf{U}_{1} is a deterministic orthonormal matrix.

On one hand, 𝐓1𝐓2\mathbf{T}_{1}\mathbf{T}_{2} have the same eigenvalues as 𝚲2𝐔𝚲1𝐔\bm{\Lambda}_{2}\mathbf{U}\bm{\Lambda}_{1}\mathbf{U}^{\top} by construction. On the other hand, by Lemma 5.5, we have that for 𝐇:=Σ2𝐔Σ1𝐔\mathbf{H}:=\Sigma_{2}\mathbf{U}\Sigma_{1}\mathbf{U}^{\top}

𝚲2𝐔𝚲1𝐔𝐇n2/3,\left\|\bm{\Lambda}_{2}\mathbf{U}\bm{\Lambda}_{1}\mathbf{U}^{\top}-\mathbf{H}\right\|\prec n^{-2/3},

where Σ1\Sigma_{1} and Σ2\Sigma_{2} are diagonal matrices containing {ak}\{a_{k}\} and {bk},\{b_{k}\}, respectively. Note that the rigidity of the eigenvalues of 𝐇\mathbf{H} has been studied in [5] under the Gaussian assumption and summarized in Lemma 2.4. Together with Lemma 2.4, we conclude that for i1i\geq 1

|λi(𝐓1𝐓2)γν1ν2(i)|n2/3.\left|\lambda_{i}(\mathbf{T}_{1}\mathbf{T}_{2})-\gamma_{\nu_{1}\boxtimes\nu_{2}}(i)\right|\prec n^{-2/3}. (83)

Note that

n2𝐍=\displaystyle n^{2}\mathbf{N}= n2𝐃11(𝐖1Sh1)(𝐖2Sh2)𝐃21\displaystyle n^{2}\mathbf{D}_{1}^{-1}(\mathbf{W}_{1}-\mathrm{Sh}_{1})(\mathbf{W}_{2}-\mathrm{Sh}_{2})\mathbf{D}_{2}^{-1} (84)
+n2𝐃11(𝐖1Sh2+Sh1𝐖2Sh1Sh2)𝐃21.\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}-\mathrm{Sh}_{1}\mathrm{Sh}_{2}\right)\mathbf{D}_{2}^{-1}.

We then analyze the rank of 𝐖1Sh2+Sh1𝐖2+Sh1Sh2.\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}+\mathrm{Sh}_{1}\mathrm{Sh}_{2}. Recall that for any compatible matrices AA and B,B, we have that

rank(AB)min{rank(A),rank(B)}.\operatorname{rank}(AB)\leq\min\{\operatorname{rank}(A),\operatorname{rank}(B)\}.

Since rank(Sh1)3\operatorname{rank}(\mathrm{Sh}_{1})\leq 3 and rank(Sh2)3,\operatorname{rank}(\mathrm{Sh}_{2})\leq 3, we conclude that

rank(𝐖1Sh2+Sh1𝐖2+Sh1Sh2)6.\operatorname{rank}(\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}+\mathrm{Sh}_{1}\mathrm{Sh}_{2})\leq 6.

Consequently, we have that

rank(𝐑)6,where𝐑:=n2𝐃11(𝐖1Sh2+Sh1𝐖2Sh1Sh2)𝐃21.\operatorname{rank}\left(\mathbf{R}\right)\leq 6,\ \mbox{where}\ \ \mathbf{R}:=n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{W}_{1}\mathrm{Sh}_{2}+\mathrm{Sh}_{1}\mathbf{W}_{2}-\mathrm{Sh}_{1}\mathrm{Sh}_{2}\right)\mathbf{D}_{2}^{-1}. (85)

By (78), (79), (84) and (85), utilizing (70), we obtain that

n2𝐍=1exp(4υ)𝐓1𝐓2+n2𝐃11𝐑1𝐃21+𝐑+O(nζ112),n^{2}\mathbf{N}=\frac{1}{\exp(-4\upsilon)}\mathbf{T}_{1}\mathbf{T}_{2}+n^{2}\mathbf{D}_{1}^{-1}\mathbf{R}_{1}\mathbf{D}_{2}^{-1}+\mathbf{R}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}), (86)

where we used 𝐓11,𝐓21.\|\mathbf{T}_{1}\|\prec 1,\|\mathbf{T}_{2}\|\prec 1. Since rank(𝐑)6\operatorname{rank}(\mathbf{R})\leq 6, rank(𝐑1)2,\operatorname{rank}(\mathbf{R}_{1})\leq 2, by (83), we have finished our proof for case (1).

Case (2). 0ζ2<0.5ζ1<10\leq\zeta_{2}<0.5\leq\zeta_{1}<1. Recall (76). In this case, according to Lemma 5.3, we require a high order expansion up to the degree of 𝔡1\mathfrak{d}_{1} in (22) for 𝐖1.\mathbf{W}_{1}. Recall (24) and (78). By Lemma 5.3, we have that

𝐖1Sh𝖽Sh1=𝐏1+O(n𝖾1),\mathbf{W}_{1}-\mathrm{Sh}_{\mathsf{d}}-\mathrm{Sh}_{1}=\mathbf{P}_{1}+\mathrm{O}_{\prec}\left(n^{\mathsf{e}_{1}}\right), (87)

where Sh1\mathrm{Sh}_{1} is defined in (71) and Sh𝖽\mathrm{Sh}_{\mathsf{d}} is defined in (54) below satisfying rank(Sh𝖽)𝗌1, 4𝗌1C4𝔡1.\operatorname{rank}(\mathrm{Sh}_{\mathsf{d}})\leq\mathsf{s}_{1},\ 4\leq\mathsf{s}_{1}\leq C4^{\mathfrak{d}_{1}}. Using a decomposition similar to (84) with (87), by (77) and the assumption ζ1ζ2\zeta_{1}\geq\zeta_{2}, we obtain that

n2𝐍=\displaystyle n^{2}\mathbf{N}= n2𝐃11(𝐓1+Δ1+Υ1+O(n𝖾1))(𝐓2+Δ2+Υ2+O(n1/2))𝐃21\displaystyle\,n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{T}_{1}+\Delta_{1}+\Upsilon_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}})\right)(\mathbf{T}_{2}+\Delta_{2}+\Upsilon_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1}
+n2𝐃11((𝐖1Sh1Sh𝖽)Sh2+(Sh1+Sh𝖽)𝐖2)𝐃21\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left((\mathbf{W}_{1}-\mathrm{Sh}_{1}-\mathrm{Sh}_{\mathsf{d}})\mathrm{Sh}_{2}+(\mathrm{Sh}_{1}+\mathrm{Sh}_{\mathsf{d}})\mathbf{W}_{2}\right)\mathbf{D}_{2}^{-1}
=\displaystyle= n2𝐃11(𝐓1+O(n𝖾1))(𝐓2+O(n1/2))𝐃21\displaystyle\,n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{T}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}})\right)(\mathbf{T}_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1} (88)
+n2𝐃11Δ1(𝐏2+O(n1/2))𝐃21\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\Delta_{1}(\mathbf{P}_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1}
+n2𝐃11𝐓1(Δ2+O(n1/2))𝐃21\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\mathbf{T}_{1}(\Delta_{2}+\mathrm{O}_{\prec}(n^{-1/2}))\mathbf{D}_{2}^{-1}
+n2𝐃11((𝐖1Sh1Sh𝖽)Sh2+(Sh1+Sh𝖽)𝐖2)𝐃21+O(nζ112).\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left((\mathbf{W}_{1}-\mathrm{Sh}_{1}-\mathrm{Sh}_{\mathsf{d}})\mathrm{Sh}_{2}+(\mathrm{Sh}_{1}+\mathrm{Sh}_{\mathsf{d}})\mathbf{W}_{2}\right)\mathbf{D}_{2}^{-1}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}).

It is easy to see that the rank of the second to the fourth terms of (88) is bounded by 𝗌1+8.\mathsf{s}_{1}+8. On other hand, by (59), the first inequality of (70) should be replaced by

n(𝐃1)1exp(2υ)𝐈=O(nζ11).\left\|n(\mathbf{D}_{1})^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{\zeta_{1}-1}). (89)

The rest of the discussion follows from the case (1). This completes our proof for case (2).

Case (3). 0.5ζ2ζ1<10.5\leq\zeta_{2}\leq\zeta_{1}<1. The discussion is similar to case (2) except that we also need to conduct a high order expansion for 𝐖2.\mathbf{W}_{2}. Similar to (87), by Lemma 5.3, we have that

𝐖2Sh~𝖽Sh2=𝐏2+O(n𝖾2).\mathbf{W}_{2}-\widetilde{\mathrm{Sh}}_{\mathsf{d}}-\mathrm{Sh}_{2}=\mathbf{P}_{2}+\mathrm{O}_{\prec}\left(n^{\mathsf{e}_{2}}\right). (90)

By decomposition similar to (88), with (90), by (77), we have that

n2𝐍=\displaystyle n^{2}\mathbf{N}= n2𝐃11(𝐓1+O(n𝖾1))(𝐓2+O(n𝖾2))𝐃21\displaystyle\,n^{2}\mathbf{D}_{1}^{-1}\left(\mathbf{T}_{1}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{1}})\right)\left(\mathbf{T}_{2}+\mathrm{O}_{\prec}(n^{\mathsf{e}_{2}})\right)\mathbf{D}_{2}^{-1}
+n2𝐃11((Δ1+Sh𝖽+Sh1)𝐖2+(𝐖1𝐏1Sh𝖽Sh1)(Sh~𝖽+Δ2+Sh2))𝐃21\displaystyle+n^{2}\mathbf{D}_{1}^{-1}\left((\Delta_{1}+\mathrm{Sh}_{\mathsf{d}}+\mathrm{Sh}_{1})\mathbf{W}_{2}+(\mathbf{W}_{1}-\mathbf{P}_{1}-\mathrm{Sh}_{\mathsf{d}}-\mathrm{Sh}_{1})(\widetilde{\mathrm{Sh}}_{\mathsf{d}}+\Delta_{2}+\mathrm{Sh}_{2})\right)\mathbf{D}_{2}^{-1}
+O(nζ112).\displaystyle+\mathrm{O}_{\prec}(n^{\frac{\zeta_{1}-1}{2}}).

On one hand, the rank of the second term of right-hand side of the above equation can be bounded by 𝗌1+𝗌2+8.\mathsf{s}_{1}+\mathsf{s}_{2}+8. On the other hand, the first term can be again analyzed in the same way as heading from (82) to (83) using Lemma 2.4. Finally, by (59), similar to (89), we have that

n(𝐃2)1exp(2υ)𝐈=O(nζ21).\left\|n(\mathbf{D}_{2})^{-1}-\exp(2\upsilon)\mathbf{I}\right\|=\mathrm{O}_{\prec}(n^{\zeta_{2}-1}). (91)

The rest of the proof follows from the discussion of case (1). This completes the proof of Case (3) using the fact ζ2ζ1.\zeta_{2}\leq\zeta_{1}.

5.3. Proof of Theorem 3.2

We now prove Theorem 3.2 when 0ζ2<1ζ1<0\leq\zeta_{2}<1\leq\zeta_{1}<\infty.

Case (1). 0ζ2<0.50\leq\zeta_{2}<0.5. Decompose 𝐍\mathbf{N} by

n𝐍=𝐀1n(𝐖2Sh2)𝐃21+n𝐀1Sh2𝐃21.\displaystyle n\mathbf{N}=\mathbf{A}_{1}n(\mathbf{W}_{2}-\mathrm{Sh}_{2})\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}\mathrm{Sh}_{2}\mathbf{D}_{2}^{-1}. (92)

First, we have that rank(n𝐀1Sh2𝐃21)3.\operatorname{rank}(n\mathbf{A}_{1}\mathrm{Sh}_{2}\mathbf{D}_{2}^{-1})\leq 3. Moreover, using the decomposition (76), similar to (86), by (77), we can further write that

n𝐍=n𝐀1𝐓1𝐃21+n𝐀1Sh2𝐃21+n𝐀1Δ2𝐃21+O(nζ212).n\mathbf{N}=n\mathbf{A}_{1}\mathbf{T}_{1}\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}\mathrm{Sh}_{2}\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}\Delta_{2}\mathbf{D}_{2}^{-1}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{2}-1}{2}}). (93)

Second, by Lemma 5.4, we have that

𝐀1𝐀~1,sn1/2,𝐀1𝐀~1,cnα/2+n3/2.\left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,s}\right\|\prec n^{-1/2},\ \ \left\|\mathbf{A}_{1}-\widetilde{\mathbf{A}}_{1,c}\right\|\prec n^{-\alpha/2}+n^{-3/2}. (94)

Together with (70) and the fact that 𝐓1=O(1)\|\mathbf{T}_{1}\|=\mathrm{O}_{\prec}(1), using the definition (31), we have that

𝐀1n𝐓1𝐃21𝐍~n1/2.\left\|\mathbf{A}_{1}n\mathbf{T}_{1}\mathbf{D}_{2}^{-1}-\widetilde{\mathbf{N}}\right\|\prec n^{-1/2}. (95)

We can therefore conclude the proof using (93).

Next, when ζ1\zeta_{1} is larger in the sense of (34), by Lemma 5.4, we find that with probability at least 1O(n1δ(ζ11)/2)1-\mathrm{O}\left(n^{1-\delta(\zeta_{1}-1)/2}\right), for some constant C>0,C>0,

𝐀1𝐈Cnexp(υ(σ12/n)1δ).\left\|\mathbf{A}_{1}-\mathbf{I}\right\|\leq Cn\exp\left(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta}\right). (96)

Consequently, we have

n𝐍n𝐀2𝐀1𝐈n𝐀2n2exp(υ(σ12/n)1δ),\left\|n\mathbf{N}-n\mathbf{A}_{2}\right\|\leq\left\|\mathbf{A}_{1}-\mathbf{I}\right\|\left\|n\mathbf{A}_{2}\right\|\leq n^{2}\exp\left(-\upsilon(\sigma_{1}^{2}/n)^{1-\delta}\right), (97)

where in the second inequality we use the fact that 𝐀21\|\mathbf{A}_{2}\|\prec 1 since λ1(𝐀2)=1\lambda_{1}(\mathbf{A}_{2})=1. By a result analogous to (57) for 𝐀2\mathbf{A}_{2}, we have that for i>3,i>3,

|λi(n𝐀2)exp(2υ)γν2(i)|n1/2.\left|\lambda_{i}(n\mathbf{A}_{2})-\exp(2\upsilon)\gamma_{\nu_{2}}(i)\right|\prec n^{-1/2}. (98)

Together with (97), we conclude our proof.

Case (2). 0.5ζ2<10.5\leq\zeta_{2}<1. The discussion is similar to case (1) except that we need to conduct a high order expansion for 𝐀2.\mathbf{A}_{2}. Note that

n𝐍=𝐀1n(𝐖2Sh2Sh~𝖽)𝐃21+n𝐀1(Sh2+Sh~𝖽)𝐃21.\displaystyle n\mathbf{N}=\mathbf{A}_{1}n(\mathbf{W}_{2}-\mathrm{Sh}_{2}-\widetilde{\mathrm{Sh}}_{\mathsf{d}})\mathbf{D}_{2}^{-1}+n\mathbf{A}_{1}(\mathrm{Sh}_{2}+\widetilde{\mathrm{Sh}}_{\mathsf{d}})\mathbf{D}_{2}^{-1}.

By (77), (90) and (91), we have that

n𝐍=exp(2υ)𝐀1𝐓2+n𝐀1Δ1𝐃21+O(nζ212+n𝖾2)+n𝐀1(Sh2+Sh~𝖽)𝐃21.\displaystyle n\mathbf{N}=\exp(2\upsilon)\mathbf{A}_{1}\mathbf{T}_{2}+n\mathbf{A}_{1}\Delta_{1}\mathbf{D}_{2}^{-1}+\mathrm{O}_{\prec}(n^{\frac{\zeta_{2}-1}{2}}+n^{\mathsf{e}_{2}})+n\mathbf{A}_{1}(\mathrm{Sh}_{2}+\widetilde{\mathrm{Sh}}_{\mathsf{d}})\mathbf{D}_{2}^{-1}.

We can therefore conclude our proof by (94) with a discussion similar to (95). Finally, when ζ1\zeta_{1} is larger, we can conclude our proof using a discussion similar to (98) with (97) and Lemma 5.4. Together with (91), we conclude the proof.

5.4. Proof of Theorem 3.3

We now prove Theorem 3.3 when ζ1ζ21.\zeta_{1}\geq\zeta_{2}\geq 1. For part (1), (37) follows from (61) and an analogous result for 𝐀2\mathbf{A}_{2} that

𝐀2𝐀~2,sn1/2,\left\|\mathbf{A}_{2}-\widetilde{\mathbf{A}}_{2,s}\right\|\prec n^{-1/2}, (99)

as well as the facts that 𝐀1=O(1),𝐀2=O(1).\|\mathbf{A}_{1}\|=\mathrm{O}_{\prec}(1),\|\mathbf{A}_{2}\|=\mathrm{O}_{\prec}(1). Second, (38) follows from (62), (2) of Lemma 5.1 and the assumption that ζ2ζ1.\zeta_{2}\leq\zeta_{1}. For part (2) and (3), the proof follows from (1) of Lemma 5.4 and its counterpart for 𝐀2.\mathbf{A}_{2}.

5.5. Proof of Theorem 4.1

To prove the results of Theorem 4.1, we first study the adaptive bandwidth h1h_{1} and h2h_{2}. When 0ζ2<1,0\leq\zeta_{2}<1, by Lemma 5.2 about the sub-Gaussian random vector, we have that for ij,i\neq j,

𝐲i𝐲j2=2(p2+σ22)+O(p2ζ2+p2).\|\mathbf{y}_{i}-\mathbf{y}_{j}\|^{2}=2(p_{2}+\sigma_{2}^{2})+\mathrm{O}_{\prec}(p_{2}^{\zeta_{2}}+\sqrt{p_{2}}).

Since ζ2<1,\zeta_{2}<1, 𝐲i𝐲j22\|\mathbf{y}_{i}-\mathbf{y}_{j}\|_{2}^{2} are concentrated around 2p22p_{2}. Then for any ω(0,1)\omega\in(0,1) and h2h_{2} chosen according to (43), we have that

h2p2.h_{2}\asymp p_{2}. (100)

Similarly, when 0ζ1<1,0\leq\zeta_{1}<1, we have that h1p1.h_{1}\asymp p_{1}. Now we can prove part (1) when 0ζ1<10\leq\zeta_{1}<1. Denote

f(𝐱i𝐱j22h1)=g1(𝐱i𝐱j22p1),f(x)=exp(υx),f\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{h_{1}}\right)=g_{1}\left(\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{p_{1}}\right),\ f(x)=\exp(-\upsilon x),

where gk(x):=f(pkx/hk),k=1,2.g_{k}(x):=f(p_{k}x/h_{k}),\ k=1,2. Since pkhk1\frac{p_{k}}{h_{k}}\asymp 1, we can apply the proof of Theorem 3.1 to the kernel functions gk(x),k=1,2.g_{k}(x),k=1,2. The only difference is that the constant υ\upsilon is now replaced by υpk/hk.\upsilon p_{k}/h_{k}. When ζ11,\zeta_{1}\geq 1, the modification is similar except that we also need to use (2) of Lemma 5.4.

The other two cases can be obtained similarly by recalling the following fact. For any ζ11,\zeta_{1}\geq 1, let h1h_{1} be the bandwidth selected using (43), we have that for some constants C1,C2>0,C_{1},C_{2}>0, with high probability

C1(σ12log1n+p1)h1C2σ12log2n.C_{1}(\sigma_{1}^{2}\log^{-1}n+p_{1})\leq h_{1}\leq C_{2}\sigma_{1}^{2}\log^{2}n. (101)

Also, note that (2) of Lemma 5.4 holds. See Corollary 3.2 of [6] for the proof. With this fact, for part (2), (49) follows from (64) and its counterpart for 𝐀2\mathbf{A}_{2} and the fact 𝐀11,𝐀21;\|\mathbf{A}_{1}\|\prec 1,\|\mathbf{A}_{2}\|\prec 1; (50) follows from (65) and its counterpart for 𝐀2\mathbf{A}_{2} and (2) of Lemma 5.1; (51) follows from (66) and its counterpart for 𝐀2\mathbf{A}_{2} and the assumption ζ1ζ2.\zeta_{1}\geq\zeta_{2}.

References

  • [1] Z. Bao, J. Hu, G. Pan, and W. Zhou. Canonical correlation coefficients of high-dimensional Gaussian vectors: Finite rank case. The Annals of Statistics, 47(1):612 – 640, 2019.
  • [2] M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289–1308, 2008.
  • [3] C. Bordenave. On Euclidean random matrices in high dimension. Electron. Commun. Probab., 18:no. 25, 8, 2013.
  • [4] X. Cheng and A. Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 02(04):1350010, 2013.
  • [5] X. Ding and H. C. Ji. Local laws for multiplication of random matrices and spiked invariant model. arXiv preprint arXiv 2010.16083, 2020.
  • [6] X. Ding and H.-T. Wu. Impact of signal-to-noise ratio and bandwidth on graph Laplacian spectrum from high-dimensional noisy point cloud. arXiv preprint arXiv 2011.10725, 2020.
  • [7] X. Ding and H. T. Wu. On the spectral property of kernel-based sensor fusion algorithms of high dimensional data. IEEE Transactions on Information Theory, 67(1):640–670, 2021.
  • [8] Y. Do and V. Vu. The spectrum of random kernel matrices: Universality results for rough and varying kernels. Random Matrices: Theory and Applications, 02(03):1350005, 2013.
  • [9] D. Dov, R. Talmon, and I. Cohen. Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Transactions on Signal Processing, 64(24):6406–6416, 2016.
  • [10] D. Dov, R. Talmon, and I. Cohen. Sequential audio-visual correspondence with alternating diffusion kernels. IEEE Transactions on Signal Processing, 66(12):3100–3111, 2018.
  • [11] D. B. Dunson, H.-T. Wu, and N. Wu. Diffusion based Gaussian process regression via heat kernel reconstruction. Applied and Computional Harmonic Analysis, 2021.
  • [12] N. El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191 – 3216, 2010.
  • [13] N. El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 – 50, 2010.
  • [14] N. El Karoui and H.-T. Wu. Graph connection Laplacian and random matrices with random blocks. Information and Inference: A Journal of the IMA, 4(1):1–44, 2015.
  • [15] N. El Karoui and H.-T. Wu. Graph connection Laplacian methods can be made robust to noise. The Annals of Statistics, 44(1):346 – 372, 2016.
  • [16] L. Erdős and H. Yau. A Dynamical Approach to Random Matrix Theory. Courant Lecture Notes. American Mathematical Society, 2017.
  • [17] Z. Fan and A. Montanari. The spectral norm of random inner-product kernel matrices. Probab. Theory Related Fields, 173(1-2):27–85, 2019.
  • [18] N. García Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator. Found. Comput. Math., 20(4):827–887, 2020.
  • [19] F. Gustafsson. Statistical Sensor Fusion. Professional Publishing House, 2012.
  • [20] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004.
  • [21] M. Hein, J.-Y. Audibert, and U. von Luxburg. From graphs to manifolds – weak and strong pointwise consistency of graph Laplacians. In P. Auer and R. Meir, editors, Learning Theory, pages 470–485, 2005.
  • [22] M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their convergence on random neighborhood graphs. J. Mach. Learn. Res., 8:1325–1368, 2007.
  • [23] P. Horst. Relations among m sets of measures. Psychometrika, 26(2):129–149, 1961.
  • [24] H. Hotelling. Relations between two sets of variates. Biometrika, 28:321–377, 1936.
  • [25] H. Hwang, K. Jung, Y. Takane, and T. S. Woodward. A unified approach to multiple-set canonical correlation analysis and principal components analysis. British Journal of Mathematical and Statistical Psychology, 66(2):308–321, 2013.
  • [26] H. C. Ji. Regularity Properties of Free Multiplicative Convolution on the Positive Line. International Mathematics Research Notices, 07 2020. rnaa152.
  • [27] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist., 29(2):295–327, 2001.
  • [28] S. P. Kasiviswanathan and M. Rudelson. Spectral norm of random kernel matrices with applications to privacy. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2015, August 24-26, 2015, Princeton, NJ, USA, volume 40 of LIPIcs, pages 898–914. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
  • [29] A. Knowles and J. Yin. Anisotropic local laws for random matrices. Probab. Theory Related Fields, 169(1-2):257–352, 2017.
  • [30] D. Lahat, T. Adali, and C. Jutten. Multimodal data fusion: An overview of methods, challenges, and prospects. Proceedings of the IEEE, 103(9):1449–1477, 2015.
  • [31] R. R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509–536, 2018.
  • [32] R. R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509–536, 2018.
  • [33] O. Lindenbaum, Y. Bregman, N. Rabin, and A. Averbuch. Multiview kernels for low-dimensional modeling of seismic events. IEEE Transactions on Geoscience and Remote Sensing, 56(6):3300–3310, 2018.
  • [34] O. Lindenbaum, A. Yeredor, M. Salhov, and A. Averbuch. Multi-view diffusion maps. Information Fusion, 55:127–149, 2020.
  • [35] G.-R. Liu, Y.-L. Lo, J. Malik, Y.-C. Sheu, and H.-T. Wu. Diffuse to fuse eeg spectra–intrinsic geometry of sleep dynamics for classification. Biomedical Signal Processing and Control, 55:101576, 2020.
  • [36] Z. Ma and F. Yang. Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations. arXiv preprint arXiv 2102.03297, 2021.
  • [37] V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457–483, 1967.
  • [38] N. F. Marshall and M. J. Hirn. Time coupled diffusion maps. Applied and Computational Harmonic Analysis, 45(3):709–728, 2018.
  • [39] T. Michaeli, W. Wang, and K. Livescu. Nonparametric canonical correlation analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1967–1976, 2016.
  • [40] N. S. Pillai and J. Yin. Universality of covariance matrices. The Annals of Applied Probability, 24(3):935 – 1001, 2014.
  • [41] M. A. Samuels. The brain–heart connection. Circulation, 116(1):77–84, 2007.
  • [42] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM J. Math. Data Sci., 1(3):588–616, 2019.
  • [43] T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM Journal on Mathematics of Data Science, 1(3):588–616, 2019.
  • [44] A. Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 21(1):128–134, 2006.
  • [45] R. Talmon and H.-T. Wu. Latent common manifold learning with alternating diffusion: Analysis and applications. Applied and Computational Harmonic Analysis, 47(3):848–892, 2019.
  • [46] R. Talmon and H.-T. Wu. Latent common manifold learning with alternating diffusion: analysis and applications. Applied and Computational Harmonic Analysis, 47(3):848–892, 2019.
  • [47] D. Voiculescu. Multiplication of certain non-commuting random variables. Journal of Operator Theory, 18(2):223–235, 1987.
  • [48] L. Xiao, J. M. Stephen, T. W. Wilson, V. D. Calhoun, and Y.-P. Wang. A manifold regularized multi-task learning model for iq prediction from two fmri paradigms. IEEE Transactions on Biomedical Engineering, 67(3):796–806, 2019.
  • [49] J. Zhao, X. Xie, X. Xu, and S. Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017.
  • [50] X. Zhuang, Z. Yang, and D. Cordes. A technical review of canonical correlation analysis for neuroscience applications. Human Brain Mapping, 41(13):3807–3833, 2020.