This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Learning for Principle Eigenspaces without Moment Constraints

Yong He 111Institute of Financial Studies, Shandong University, China; e-mail:heyong@sdu.edu.cn, zhaochangwei@mail.sdu.edu.cn, Zichen Liu111Institute of Financial Studies, Shandong University, China; e-mail:heyong@sdu.edu.cn , Yalin Wang111Institute of Financial Studies, Shandong University, China; e-mail:heyong@sdu.edu.cn

Distributed Principal Component Analysis (PCA) has been studied to deal with the case when data are stored across multiple machines and communication cost or privacy concerns prohibit the computation of PCA in a central location. However, the sub-Gaussian assumption in the related literature is restrictive in real application where outliers or heavy-tailed data are common in areas such as finance and macroeconomic. In this article, we propose a distributed algorithm for estimating the principle eigenspaces without any moment constraint on the underlying distribution. We study the problem under the elliptical family framework and adopt the sample multivariate Kendall’tau matrix to extract eigenspace estimators from all sub-machines, which can be viewed as points in the Grassman manifold. We then find the “center” of these points as the final distributed estimator of the principal eigenspace. We investigate the bias and variance for the distributed estimator and derive its convergence rate which depends on the effective rank and eigengap of the scatter matrix, and the number of submachines. We show that the distributed estimator performs as if we have full access of whole data. Simulation studies show that the distributed algorithm performs comparably with the existing one for light-tailed data, while showing great advantage for heavy-tailed data. We also extend our algorithm to the distributed learning of elliptical factor models and verify its empirical usefulness through real application to a macroeconomic dataset.

Keyword: Elliptical distribution; Distributed learning; Grassman manifold; Spatial Kendall’s tau matrix; Principle Eigenspaces.

1 Introduction

Principal component analysis (PCA) is one of the most important statistical tool for dimension reduction, which extracts latent principal factors that contribute to the most variation of the data. For the classical setting with fixed dimension, the consistency and asymptotic normality of empirical principal components have been raised since Anderson (1963). In the last decade, high-dimensional PCA gradually attracted the attention of statisticians, see for example Onatski (2012), Wang & Fan (2017), Kong et al. (2021), Bao et al. (2022). The existing work on high-dimensional PCA typically assume the Gaussian/sub-Gaussian tail property of the underlying distribution, which is really an idealization of the complex random real world. Heavy-tailed data are common in research fields such as financial engineering and biomedical imaging (Jing et al., 2012; Kong et al., 2015; Fan et al., 2018; Li et al., 2022). Thus it is desperately needed to find a robust way to do principal component analysis for heavy-tailed data. Elliptical family provides one a proper way to capture the different tail behaviour of various common distributions such as Gaussian and tt-distribution, and has been widely studied in various high-dimensional statistical problems, see for example, Han & Liu (2014); Yu et al. (2019); Hu et al. (2019); Chen et al. (2021). Han & Liu (2018) presented a robust alternative to PCA, called Elliptical Component Analysis (ECA), for analyzing high dimensional, elliptically distributed data, in which multivariate Kendall’s tau matrix plays a central role. In essence, for elliptical distributions, the eigenspace of the population Kendall’s tau matrix coincides with that of the scatter matrix The multivariate Kendall’s tau is first introduced in Choi & Marden (1998) for testing dependence, and is later adopted for estimating covariance matrix and principal components (Taskinen et al., 2012). Properties of ECA in low dimension were considered by Hallin et al. (2010), Hallin et al. (2014), and some recent works related to high-dimensional ECA include but not limited to Feng & Liu (2017); Chen et al. (2021); He et al. (2022). Noticeably, He et al. (2022) extended ECA to factor analysis under the framework of elliptical factor model.

With rapid developments of information and technology, the modern datasets exhibit the characteristic of extremely large-scale and we are now embracing the big-data era. Efficient statistical inference algorithms on such enormous dataset is unprecedentedly desirable. Distributed computing provides an effective way to deal with large-scale datasets. In addition to computation efficiency, distributed computing is also relatively robust to possible failures in the subsevers. There are lots of other reasons for establishing rigorous statistical theories on distributed computing, such as privacy protection, data ownerships and limitation of data storage. Two patterns of data segmentation were considered in the literature over the past few years, which are horizontal and vertical. “Horizontal” reserves all the features in each subserver, while data are scattered to different machines. Conversely, “Vertical” means that features are divided into several parts, each storage has full access of data but part of the features, which is common in signal processing and sensor networks. Some representative works include but not limited to Qu et al. (2002), Zhang et al. (2012), Fan et al. (2019) and Fan et al. (2021). In particular, Fan et al. (2019) proposed a distributed PCA algorithm: first each submachine computes the top KK eigenvectors and transmits them to the central machine; then the central machine aggregates the information from all the submachines and conducts a PCA based on the aggregated information.

In the current work, we consider a robust alternative to the existing distributed PCA algorithm. We adopt the horizontal division and propose a distributed algorithm for estimating the principle eigenspaces without any moment constraint on the underlying distribution. In detail, assume that we have mm subservers under the elliptical family framework, we first extract the leading KK orthonormal eigenvectors (as columns of 𝑽^K(i)\widehat{\bm{V}}_{K}^{(i)}) of the local sample multivariate Kendall’s tau matrix with observations on the ii-th subserver, for i=1,,mi=1,\ldots,m. We next transport these eigenspace estimators {span(𝑽^K(i))}\{span(\widehat{\bm{V}}_{K}^{(i)})\} to the central server, where {span(𝑽^K(i))}\{span(\widehat{\bm{V}}_{K}^{(i)})\} denotes the linear space spanned by the columns of 𝑽^K(i)\widehat{\bm{V}}_{K}^{(i)}. In essence, the transported local eigenspace estimators {span(𝑽^K(i))}\{span(\widehat{\bm{V}}_{K}^{(i)})\} can be viewed as points (representatives of equivalent class) in the Grassmann manifold, see Figure 1 for better illustration. We then find the “center” ( the red point span(𝑽~K)span(\widetilde{\bm{V}}_{K}) in Figure 1) of these points by minimizing the projection metrics on Grassmann manifolds in the central server, which is analogous to the physical notion of barycenter. Mathematically, the “center” span(𝑽~K)span(\widetilde{\bm{V}}_{K}) satisfies

𝑽~K=argmin𝑽𝑽=𝐈i=1m𝑽𝑽𝑽^K(i)𝑽^K(i)F2,\widetilde{\bm{V}}_{K}=\mathop{\mathrm{arg\ min}}_{\bm{V}^{\top}\bm{V}=\mathbf{I}}\sum_{i=1}^{m}\|\bm{V}\bm{V}^{\top}-\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)\top}\|_{F}^{2},

and it can be shown that 𝑽~K\widetilde{\bm{V}}_{K} is exactly the leading KK eigenvectors of the average projection matrix 𝚺~=1/mi=1m𝑽^K(i)𝑽^K(i)\widetilde{\bm{\Sigma}}=1/m\sum_{i=1}^{m}\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)\top}.

Refer to caption
Figure 1: Workflow of the proposed distributed algorithm.

The contributions of the current work include the following aspects: firstly we for the first time propose a robust alternative to distributed principal PCA. The proposed distributed algorithm is easy to implement and avoids huge transportation costs between central processor and subservers; secondly we investigate the bias and variance for the robust distributed estimator and derive its convergence rate which depends on the effective rank of the scatter matrix, eigengap, and the number of submachines. We show that the distributed estimator performs as if we have full access of whole data; thirdly, we extend the distributed algorithm to the elliptical factor model in a horizontal partition regime, which is the first distributed algorithm for robust factor analysis as far as we know.

The rest of this paper is organized as follows. Section 2 includes some notations and preliminary results on elliptical family and multivariate Kendall’tau matrix. In Section 3, we present the details on our robust distributed algorithm for estimating principal eigenspace. In Section 4, we investigate the theoretical properties of the robust distributed estimator. In Section 5, we extend the distributed algorithm to the distributed elliptical factor analysis. Simulation results are given in Section 6. We also applied the distributed algorithm to analyze a large-scale real macroeconomic dataset. Further discussions and future work direction are left in Section 7.

2 Notations and Preliminaries

We first introduce some notations used throughout the article. Vectors and matrices will be written in bold symbol, scalars are written by regular letters. For two sequences {an}n=1,{bn}n=1\{a_{n}\}_{n=1}^{\infty},\{b_{n}\}_{n=1}^{\infty}, anbna_{n}\lesssim b_{n} if there is a universal constant C>0C>0, such that anCbna_{n}\leq Cb_{n}, for which we also adopt another notation an=O(bn)a_{n}=O(b_{n}), while an=o(bn)a_{n}=o(b_{n}) means an/bn0a_{n}/b_{n}\rightarrow 0, as nn\rightarrow\infty. Index set {1,2,,p}\{1,2,\ldots,p\} is simply denoted by [p][p]. For vectors, we define 𝒆i\bm{e}_{i} to be the unit vector with 1 in the ii-th component, p\|\cdot\|_{p} is the p\ell_{p}-norm. For random variable XX, Orlicz ψ1\psi_{1} norm Xψ1\|X\|_{\psi_{1}} is defined as supp1(𝔼|X|p)1/p/p\mathop{sup}_{p\geq 1}(\mathbb{E}\lvert X\rvert^{p})^{1/p}/p. If random vectors 𝑿,𝒀\bm{X},\bm{Y} share the same distribution, we write 𝑿=𝑑𝒀\bm{X}\overset{d}{=}\bm{Y}. For matrix 𝑨p×p\bm{A}\in\mathbb{R}^{p\times p}, its transpose is 𝑨\bm{A}^{\top}; λi(𝑨)\lambda_{i}(\bm{A}) is the ii-th largest eigenvalue of 𝑨\bm{A}; 𝑨2\|\bm{A}\|_{2} is the spectral norm and 𝑨F\|\bm{A}\|_{F} is the Frobenius norm. Given a fixed K[p]K\in[p], the eigen gap of 𝑨\bm{A} is Δ(𝑨)=λK(𝑨)λK+1(𝑨)\Delta(\bm{A})=\lambda_{K}(\bm{A})-\lambda_{K+1}(\bm{A}), and r=r(𝑨):=Tr(𝑨)/λ1(𝑨)r=r(\bm{A}):=Tr(\bm{A})/\lambda_{1}(\bm{A}) is the effective rank of 𝑨\bm{A}. The space spanned by the columns of 𝑨\bm{A} is denoted by span(𝑨)span(\bm{A}). For two matrices with orthogonal columns 𝑨p×n1,𝑩p×n2\bm{A}\in\mathbb{R}^{p\times n_{1}},\bm{B}\in\mathbb{R}^{p\times n_{2}}(n1,n2pn_{1},n_{2}\leq p), the spatial distance between span(𝑨)span(\bm{A}) and span(𝑩)span(\bm{B}) is ρ(𝑨,𝑩)=𝑨𝑨𝑩𝑩F\rho(\bm{A},\bm{B})=\|\bm{A}\bm{A}^{\top}-\bm{B}\bm{B}^{\top}\|_{F}.

Elliptical Distribution and Spatial Kendall’s Tau Matrix

Elliptical family is a large distribution family containing common distributions such as Gaussian and tt-distribution, which is widely used for modeling heavy-tailed data in finance and macroeconomics. To begin with, we recall its definition and some of its nice properties. For further details on the elliptical distribution, see for example Hult & Lindskog (2002).

In the following we give two equivalent definitions, one by its characteristic function, the other by its stochastic representation. A random vector 𝑿=(X1,,Xp)\bm{X}=(X_{1},\ldots,X_{p})^{\top} belongs to the elliptical family, denoted by 𝑿EDp(g;𝝁,𝚺)\bm{X}\sim ED_{p}(g;\bm{\mu},\bm{\Sigma}), if its characteristic function has the form:

tϕg(t;𝝁,𝚺)=eit𝝁g(t𝚺t),\textbf{t}\mapsto\phi_{g}(\textbf{t};\bm{\mu},\bm{\Sigma})=e^{i\textbf{t}^{\top}\bm{\mu}}g(\textbf{t}^{\top}\bm{\Sigma}\textbf{t}),

where g()g(\cdot) is a proper function defined on [0,)[0,\infty), 𝝁p\bm{\mu}\in\mathbb{R}^{p} is the location parameter, 𝚺\bm{\Sigma} is called scatter matrix with rank(𝚺)=qrank(\bm{\Sigma})=q. Equivalently,

𝑿=𝑑𝝁+ξ𝑨𝑼,\bm{X}\overset{d}{=}\bm{\mu}+\xi\bm{AU},

where 𝑼\bm{U} is a random vector uniformly distributed on the unit sphere Sq1S^{q-1} in q\mathbb{R}^{q}, ξ0\xi\geq 0 is a scalar random variable indepent of 𝑼\bm{U}, 𝑨p×q\bm{A}\in\mathbb{R}^{p\times q} is a deterministic matrix satisfying 𝑨𝑨=𝚺\bm{AA}^{\top}=\bm{\Sigma}. Thus we can also denote as 𝑿EDp(𝝁,𝚺,ξ)\bm{X}\sim ED_{p}(\bm{\mu},\bm{\Sigma},\xi) and ξ\xi and g()g(\cdot) are mutually determined. Elliptically distributed random vectors have the same nice properties as Gaussian random vectors, such as linear combination of elliptically distributed random vectors also follows elliptical distribution and their marginal distributions are also elliptical.

For Gaussian distribution, scatter matrix is the covariance matrix up to a constant. For non-Gaussian distribution, even with infinite variance (such as Cauchy distribution), scatter matrix still measures the dispersion of a random vector. To estimate the principal eigenspace of elliptically distributed data, simply performing PCA to the sample covariance matrix is not satisfactory due to the possible inexistence of the population covariance matrix. To address the problem, we turn to a tool, the spatial Kendall’s tau matrix. For 𝑿ED(𝝁,𝚺,ξ)\bm{X}\sim ED(\bm{\mu},\bm{\Sigma},\xi) and its independent copy 𝑿~\widetilde{\bm{X}}, the population spatial Kendall’s tau matrix of 𝑿\bm{X} is defined as:

𝑲=𝔼{(𝑿𝑿~)(𝑿𝑿~)𝑿𝑿~22}.\bm{K}=\mathbb{E}\left\{\frac{(\bm{X}-\widetilde{\bm{X}})(\bm{X}-\widetilde{\bm{X}})^{\top}}{\|\bm{X}-\widetilde{\bm{X}}\|_{2}^{2}}\right\}.

The spatial Kendall’s tau matrix was first introduced in Choi & Marden (1998) and has been used for a lot of statistical problems such as covariance matrix estimation and principal eigenspace estimation, see Visuri et al. (2000), Fan et al. (2018) and Han & Liu (2018). A critical property of spatial Kendall’s tau matrix is that for elliptical vector 𝑿\bm{X}, it shares the same eigenvectors of scatter matrix with the same ordering of eigenvalues, see Han & Liu (2018) for detailed proof. Suppose {𝑿t}t=1n\left\{\bm{X}_{t}\right\}_{t=1}^{n} are nn independent data points from 𝑿\bm{X}, the sample spatial Kendall’s tau matrix is naturally defined as a U-statistic:

𝑲^=2n(n1)i<j(𝑿i𝑿j)(𝑿i𝑿j)𝑿i𝑿j22.\widehat{\bm{K}}=\frac{2}{n(n-1)}\sum_{i<j}\frac{(\bm{X}_{i}-\bm{X}_{j})(\bm{X}_{i}-\bm{X}_{j})^{\top}}{\|\bm{X}_{i}-\bm{X}_{j}\|_{2}^{2}}.

3 Distributed PCA without Moment Constraint

In this section, a new distributed algorithm for estimating principal eigenspace robustly is introduced in detail, which generalizes the distributed PCA by Fan et al. (2019) to the elliptical family setting.

Suppose there are NN samples of 𝑿EDp(𝝁,𝚺,ξ)\bm{X}\sim ED_{p}(\bm{\mu},\bm{\Sigma},\xi) in total, and we are interested in recovering the eigenspace spanned by the leading KK eigenvectors of the scatter matrix 𝚺\bm{\Sigma}. The NN samples spread across mm machines, and on the ll-th machine, there are n=N/mn=N/m samples. For l[m]l\in[m], let {𝑿t(l)}t=1n\{\bm{X}_{t}^{(l)}\}_{t=1}^{n} denote the samples stored on the ll-th machine. First calculate the local sample spatial Kendall’ tau matrix on the ll-th server as

𝑲^(l)=1Cn2j<s(𝑿j(l)𝑿s(l))(𝑿j(l)𝑿s(l))𝑿j(l)𝑿s(l)22,l=1,,m.\widehat{\bm{K}}^{(l)}=\frac{1}{C_{n}^{2}}\sum_{j<s}\frac{(\bm{X}_{j}^{(l)}-\bm{X}_{s}^{(l)})(\bm{X}_{j}^{(l)}-\bm{X}_{s}^{(l)})^{\top}}{\|\bm{X}_{j}^{(l)}-\bm{X}_{s}^{(l)}\|_{2}^{2}},\ \ l=1,\ldots,m.

We further compute the leading KK eigenvectors of sample Kendall’s tau matrix 𝑲^(l)\widehat{\bm{K}}^{(l)} on the ll-th server and denote 𝑽^K(i)\widehat{\bm{V}}_{K}^{(i)} as the matrix with columns being these eigenvectors. We next transport these 𝑽^K(i)\widehat{\bm{V}}_{K}^{(i)} to the central server. The communication cost of the proposed distributed algorithm is of order O(mKp)O(mKp). In contrast, to share all the data or entire spatial Kendall’tau matrix, the communication cost will be of order O(mpmin(n,p))O(mp\min(n,p)). In most cases K=o(min(n,p))K=o(min(n,p)), the distributed algorithm requires much less communication cost than naive data aggregation.

The transported local estimator 𝑽^K(i)\widehat{\bm{V}}_{K}^{(i)} would span an eigenspace {span(𝑽^K(i))}\{span(\widehat{\bm{V}}_{K}^{(i)})\}, which can be viewed as points in the Grassmann manifold, as illustrated in Figure 1. We then find the “center” of these points by minimizing the projection metrics on Grassmann manifolds in the central server, i.e,

𝑽~K=argmin𝑽𝑽=𝐈i=1m𝑽𝑽𝑽^K(i)𝑽^K(i)F2.\widetilde{\bm{V}}_{K}=\mathop{\mathrm{arg\ min}}_{\bm{V}^{\top}\bm{V}=\mathbf{I}}\sum_{i=1}^{m}\|\bm{V}\bm{V}^{\top}-\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)\top}\|_{F}^{2}.

Denote P𝑽=𝑽𝑽P_{\bm{V}}=\bm{V}\bm{V}^{\top} and P𝑽^K(i)=𝑽^K(i)𝑽^K(i)P_{\widehat{\bm{V}}_{K}^{(i)}}=\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)\top}, which are the projection matrices of span(𝑽)span(\bm{V}) and span(𝑽^K(i))span(\widehat{\bm{V}}_{K}^{(i)}) respectively. Then we have:

im𝑽𝑽𝑽^K(i)𝑽^K(i)F2=i=1mTr(P𝑽)+i=1mTr(P𝑽^K(i))2Tr[P𝑽(i=1mP𝑽^K(i))].\sum_{i}^{m}\|\bm{V}\bm{V}^{\top}-\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)\top}\|^{2}_{F}=\sum_{i=1}^{m}Tr(P_{\bm{V}})+\sum_{i=1}^{m}Tr(P_{\widehat{\bm{V}}_{K}^{(i)}})-2Tr\left[P_{\bm{V}}(\sum_{i=1}^{m}P_{\widehat{\bm{V}}_{K}^{(i)}})\right].
Algorithm 1 Robust Distributed Principle Component Analysis Algorithm
STEP 1:On the ll-th machine, compute the KK leading eigenvectors 𝑽^K(l)=(𝒗^1(l),,𝒗^K(l))p×K\widehat{\bm{V}}_{K}^{(l)}=(\widehat{\bm{v}}_{1}^{(l)},\ldots,\widehat{\bm{v}}_{K}^{(l)})_{p\times K} of the sample Kendall’s tau matrix 𝑲^(l)\widehat{\bm{K}}^{(l)};
STEP 2:On the central server, compute the KK leading eigenvectors 𝑽~K=(𝒗~1,,𝒗~K)\widetilde{\bm{V}}_{K}=(\widetilde{\bm{v}}_{1},\ldots,\widetilde{\bm{v}}_{K}) of 𝚺~=1mi=1m𝑽^K(i)𝑽^K(i)\widetilde{\bm{\Sigma}}=\frac{1}{m}\sum_{i=1}^{m}\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)^{\top}}
Output:The estimator of the leading eigenspace, span(𝑽~K)span(\widetilde{\bm{V}}_{K})

The first two terms on the right hand side are fixed, so it’s equivalent to maximizing the last term, namely (𝑽)=tr[𝑽(i=1mP𝑽^K(i))𝑽]\mathcal{L}(\bm{V})=\mathop{\mathrm{tr}}\left[\bm{V}^{\top}(\sum_{i=1}^{m}P_{\widehat{\bm{V}}_{K}^{(i)}})\bm{V}\right]. Thus it’s clear that 𝑽~K\widetilde{\bm{V}}_{K} is the leading KK eigenvectors of the average projection matrix

𝚺~=1mi=1mP𝑽^K(i)=1mi=1m𝑽^K(i)𝑽^K(i).\widetilde{\bm{\Sigma}}=\frac{1}{m}\sum_{i=1}^{m}P_{\widehat{\bm{V}}_{K}^{(i)}}=\frac{1}{m}\sum_{i=1}^{m}\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)^{\top}}.

At last we take 𝑽~K\widetilde{\bm{V}}_{K} as the final distributed estimator of the principal eigenspace. The detailed algorithm is summarized in Algorithm 1.

Fan et al. (2019) proposed a distributed PCA method in which they explain the aggregation procedure from the perspective of semidefinite programming (SDP) problem and minimize the sum of squared losses. In the current work, we provide a rationale of the aggregation procedure from the perspective of the Grassmann manifold, which is quite easy to understand. The viewpoint from the Grassmann manifold is similar to Li et al. (2022), where they proposed manifold PCA for matrix elliptical factor model but not from the perspective of distributed computation.

4 Theoretical properties

In this section, we investigate the theoretical property of the distributed estimator 𝑽~K\widetilde{\bm{V}}_{K}. We focus on the distance of the spaces spanned by the columns of 𝑽~K\widetilde{\bm{V}}_{K} and those of 𝑽K\bm{V}_{K}. We adopt the metric in Fan et al. (2019) to analyse the statistical error of span{𝑽~K}span\{\widetilde{\bm{V}}_{K}\} as an estimator of span{𝑽K}span\{\bm{V}_{K}\}. At the beginning, we give some assumptions for technical analysis.

Assumption 4.1.

Assume that 𝑿EDp(𝝁,𝚺,ξ)\bm{X}\sim ED_{p}(\bm{\mu},\bm{\Sigma},\xi), for a fixed δ>0\delta>0, 0<α<1/20<\alpha<1/2, the scatter matrix 𝚺\bm{\Sigma} satisfies:

  • i)

    There exists a universal positive constant C>0C>0 such that the ratio γ\gamma of the maximal eigenvalue to minimal eigenvalue of the scatter matrix satisfies:

    γ=λ1(𝚺)λp(𝚺)Cpα;\gamma=\frac{\lambda_{1}(\bm{\Sigma})}{\lambda_{p}(\bm{\Sigma})}\leq Cp^{\alpha};
  • ii)

    There is a relatively significant difference between the KK-th eigenvalue and the (K+1K+1)-th eigenvalue:

    λK(𝚺)λK+1(𝚺)1+δ\frac{\lambda_{K}(\bm{\Sigma})}{\lambda_{K+1}(\bm{\Sigma})}\geq 1+\delta

Assumption 4.1 imposes conditions on the eigengap and condition number of the scatter matrix, which is common in the high-dimensional PCA literature and is mild in the following sense. In terms of elliptical factor models in He et al. (2022), that’s 𝑿t=𝑳𝒇t+𝒖t\bm{X}_{t}=\bm{L}\bm{f}_{t}+\bm{u}_{t}, the scatter matrix of 𝑿t\bm{X}_{t} has the low rank plus sparsity structure 𝚺=𝑳𝑳+𝚺u\bm{\Sigma}=\bm{L}\bm{L}^{\top}+\bm{\Sigma}_{u}, where 𝚺u\bm{\Sigma}_{u} is the scatter matrix of idiosyncratic errors 𝒖t\bm{u}_{t}, see also Section 5. Assume that there exists KK latent factors, the loading matrix 𝑳\bm{L} satisfies the pervasive condition 𝑳𝑳/pα𝑸\bm{L}^{\top}\bm{L}/p^{\alpha}\rightarrow\bm{Q} as pp\rightarrow\infty, with 𝑸\bm{Q} being a K×KK\times K positive definite matrix, and the eigenvalues of 𝚺u\bm{\Sigma}_{u} are bounded from below and above, then the scatter matrix 𝚺\bm{\Sigma} would satisfy the conditions in Assumption 4.1.

In the following analysis, we decompose the error ρ(𝑽~K,𝑽K)\rho(\widetilde{\bm{V}}_{K},\bm{V}_{K}) into two parts: the bias term and the variance term. In detail,

ρ(𝑽~K,𝑽K)ρ(𝑽~K,𝑽K)variance term+ρ(𝑽K,𝑽K)bias term,\rho(\widetilde{\bm{V}}_{K},\bm{V}_{K})\leq\underbrace{\rho(\widetilde{\bm{V}}_{K},\bm{V}_{K}^{*})}_{\text{variance term}}+\underbrace{\rho(\bm{V}_{K}^{*},\bm{V}_{K})}_{\text{bias term}},

where 𝑽K\bm{V}_{K}^{*} is the KK leading eigenvectors of 𝚺=𝔼(𝑽^K(i)𝑽^K(i))\bm{\Sigma}^{*}=\mathbb{E}(\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)^{\top}}). In the following we present the upper bound of the variance term and the bias term in Theorem 4.1 and Theorem 4.2 respectively.

Theorem 4.1.

Under Assumption 4.1, further assume that logp=o(N)\log p=o(N). Then we have

ρ(𝑽~K,𝑽K)ψ1κrK3/2logpN,\|{\rho}(\widetilde{\bm{V}}_{K},\bm{V}_{K}^{*})\|_{\psi_{1}}\lesssim\kappa rK^{3/2}\sqrt{\frac{\log p}{N}},

where

κ=λ1(𝚺)λK(𝚺)λK+1(𝚺),r=Tr(𝚺)λ1(𝚺).\kappa=\frac{\lambda_{1}(\bm{\Sigma})}{\lambda_{K}(\bm{\Sigma})-\lambda_{K+1}(\bm{\Sigma})},\ \ r=\frac{Tr(\bm{\Sigma})}{\lambda_{1}(\bm{\Sigma})}.

The measure of concentration is essential in high-dimensional statistical analysis. Fan et al. (2019) exerted the sub-Gaussian assumption to derive a form of concentration inequality. However, the proposed algorithm maps each eigenspace estimator to the Grassmann manifolds where concentration inequalities are easily derived due to the compactness therein, thus the results also hold even for Cauchy distribution without any moment. Next we give the upper bound of the bias term.

Theorem 4.2.

Under Assumption 4.1, further assume that logp=o(N/m)\log p=o(N/m). Then we have

ρ(𝑽K,𝑽K)κ2r2K5/2mlogpN.{\rho}(\bm{V}_{K}^{*},\bm{V}_{K})\lesssim\kappa^{2}r^{2}K^{5/2}\frac{m\log p}{N}.

By combining the results of the bias term and the variance term, we have the following corollary.

Corollary 4.1.

Under Assumption 4.1, further assume that logp=o(N/m)\log p=o(N/m). Then there exist positive constants C1,C2C_{1},C_{2} such that

ρ(𝑽~K,𝑽K)ψ1C1κrK3/2(logpN)1/2+C2κ2r2K5/2mlogpN;\|{\rho}(\widetilde{\bm{V}}_{K},\bm{V}_{K})\|_{\psi_{1}}\leq C_{1}\kappa rK^{3/2}(\frac{\log\,p}{N})^{1/2}+C_{2}\kappa^{2}r^{2}K^{5/2}\frac{m\log p}{N};

if the number of subservers satisfies mC(N/logp)1/2/(κrK)m\leq C(N/\log p)^{1/2}/(\kappa rK) for some constant CC, we have

ρ(𝑽~K,𝑽K)ψ1κrK3/2(logpN)1/2.\|{\rho}(\widetilde{\bm{V}}_{K},\bm{V}_{K})\|_{\psi_{1}}\lesssim\kappa rK^{3/2}\left(\frac{\log p}{N}\right)^{1/2}.

By Corollary 4.1 and the theoretical result in Han & Liu (2018), we conclude that our algorithm is not only communication-efficient but also outputs an estimator which shares the same convergence rate as the ECA estimator with all samples available at one machine when mm is not larger than the bounds given in Corollary 4.1. This is also confirmed in the following simulation study.

5 Extension to Elliptical factor models

In this section, we assume that the data points 𝑿t\bm{X}_{t} has a elliptical factor model, i.e., {𝑿t}t=1Np\left\{\bm{X}_{t}\right\}_{t=1}^{N}\subset\mathbb{R}^{p},

𝑿t=𝑳𝒇t+𝒖t,t=1,,N,\bm{X}_{t}=\bm{L}\bm{f}_{t}+\bm{u}_{t},\ \ t=1,\ldots,N,

where 𝑿t=(X1t,,Xpt),𝒇tK\bm{X}_{t}=(X_{1t},\ldots,X_{pt})^{\top},\bm{f}_{t}\in\mathbb{R}^{K} are the latent factors, 𝑳=(𝒍1,,𝒍p)\bm{L}=(\bm{l}_{1},\ldots,\bm{l}_{p})^{\top} is the factor loading matrix and 𝒖t\bm{u}_{t}’s are the idiosyncratic errors. Note that for the elliptical factor model, the loading space span(𝑳)span(\bm{L}) rather than 𝑳\bm{L} is identifiable. We assume that random vectors (𝒇t,𝒖t)(\bm{f}_{t}^{\top},\bm{u}_{t}^{\top})^{\top} are generated independently and identically from the elliptical distribution EDK+p(𝟎,𝚺0,ξ)ED_{K+p}({\bm{0}},\bm{\Sigma}_{0},\xi), where 𝚺0=diag(𝐈K,𝚺u)\bm{\Sigma}_{0}=\text{diag}(\mathbf{I}_{K},\bm{\Sigma}_{u}), see He et al. (2022) for more details on elliptic distribution. By the property of the elliptical family, the scatter matrix of 𝑿t\bm{X}_{t} has the form: 𝚺=𝑳𝑳+𝚺u\bm{\Sigma}=\bm{L}\bm{L}^{\top}+\bm{\Sigma}_{u}. We assume that the loading matrix 𝑳\bm{L} satisfies the pervasive condition 𝑳𝑳/pα𝑸\bm{L}^{\top}\bm{L}/p^{\alpha}\rightarrow\bm{Q} as pp\rightarrow\infty, where 𝑸\bm{Q} is a K×KK\times K positive definite matrix, and the eigenvalues of 𝚺u\bm{\Sigma}_{u} are bounded from below and above. Then 𝚺\bm{\Sigma} satisfies Assumption 4.1. Naturally, we use the space spanned by the KK leading eigenvectors of the sample Multivariate Kendall’s tau matrix as the estimator of the loading space span(𝑳)span(\bm{L}). In the article, we assume that the NN observations spread across mm machines, and on the ll-th machine, there are n=N/mn=N/m observations. For l[m]l\in[m], let {𝑿t(l)}t=1n\{\bm{X}_{t}^{(l)}\}_{t=1}^{n} denote the observations stored on the ll-th machine.

𝑿t(l)=𝑳𝒇t(l)+𝒖t(l),t=1,,n,l=1,m.\bm{X}_{t}^{(l)}=\bm{L}\bm{f}_{t}^{(l)}+\bm{u}_{t}^{(l)},\ \ t=1,\ldots,n,\ \ l=1\ldots,m.
Algorithm 2 Distributed algorithm for elliptical factor model
STEP1:On the ll-th machine, l=1,,ml=1,\ldots,m, compute the KK leading eigenvectors 𝑽^K(l)=(𝒗^1(l),,𝒗^K(l))p×K\widehat{\bm{V}}_{K}^{(l)}=(\widehat{\bm{v}}_{1}^{(l)},\ldots,\widehat{\bm{v}}_{K}^{(l)})_{p\times K} of the sample Kendall’s tau matrix, and transport 𝑽^K(l),l=1,m\widehat{\bm{V}}_{K}^{(l)},l=1\ldots,m to the central server.
STEP2:On the central processor, compute the KK leading eigenvectors 𝑽~K=(𝒗~1,,𝒗~K)\widetilde{\bm{V}}_{K}=(\widetilde{\bm{v}}_{1},\ldots,\widetilde{\bm{v}}_{K}) of 𝚺~=1ml=1m𝑽^K(l)𝑽^K(l)\widetilde{\bm{\Sigma}}=\frac{1}{m}\sum_{l=1}^{m}\widehat{\bm{V}}_{K}^{(l)}\widehat{\bm{V}}_{K}^{(l)^{\top}} and use span(𝑽~K)span(\widetilde{\bm{V}}_{K}) as the estimator of factor loading space span(𝑳)span(\bm{L}).
STEP3:Transport 𝑽~K\widetilde{\bm{V}}_{K} back to each servers, and set 𝑳~=pα/2𝑽~K\widetilde{\bm{L}}=p^{\alpha/2}\widetilde{\bm{V}}_{K}, further solve the least squares problem on each server to get estimators of the factor scores, 𝒇~t(l)\widetilde{\bm{f}}_{t}^{(l)}.
𝒇~t(l)=argmin𝜷Ki=1p(Xit𝒍~i𝜷j)2=argmin𝜷K𝑿t𝑳~𝜷j22,\displaystyle\widetilde{\bm{f}}_{t}^{(l)}=\mathop{\arg\min}\limits_{\bm{\beta}\in\mathbb{R}^{K}}\sum_{i=1}^{p}(X_{it}-\widetilde{\bm{l}}_{i}^{\top}\bm{\beta}_{j})^{2}=\mathop{\arg\min}\limits_{\bm{\beta}\in\mathbb{R}^{K}}\|\bm{X}_{t}-\widetilde{\bm{L}}\bm{\beta}_{j}\|_{2}^{2},
where 𝑳~=(𝒍~1,,𝒍~p)\widetilde{\bm{L}}=(\widetilde{\bm{l}}_{1},\ldots,\widetilde{\bm{l}}_{p})^{\top}.
Output:Factor loading space estimator span(𝑽~K)span(\widetilde{\bm{V}}_{K}), factor scores estimators 𝒇~t(l),l=1,,m,t=1,,n\widetilde{\bm{f}}_{t}^{(l)},l=1,\ldots,m,t=1,\ldots,n.

By the proposed distributed algorithm, we further propose a distributed procedure to estimate the factor loading space span(L)span(L) and the factor scores. The algorithm goes as follows: firstly for l=1,,ml=1,\ldots,m, compute the KK leading eigenvectors {𝒗^1(l),,𝒗^K(l)}\{\widehat{\bm{v}}_{1}^{(l)},\ldots,\widehat{\bm{v}}_{K}^{(l)}\} of the sample Kendall’s tau matrix on the ll-th server, and let 𝑽^K(l)=(𝒗^1(l),,𝒗^K(l))p×K\widehat{\bm{V}}_{K}^{(l)}=(\widehat{\bm{v}}_{1}^{(l)},\ldots,\widehat{\bm{v}}_{K}^{(l)})_{p\times K}; then transport 𝑽^K(l)\widehat{\bm{V}}_{K}^{(l)} to the central processor; then perform spectral decomposition for the average of projection operators 𝚺~=1/ml=1m𝑽^K(l)𝑽^K(l)\widetilde{\bm{\Sigma}}={1}/{m}\sum_{l=1}^{m}\widehat{\bm{V}}_{K}^{(l)}\widehat{\bm{V}}_{K}^{(l)^{\top}} to get its KK leading eigenvectors {𝒗~1,,𝒗~K}\{\widetilde{\bm{v}}_{1},\ldots,\widetilde{\bm{v}}_{K}\}, and let 𝑽~K=(𝒗~1,,𝒗~K)\widetilde{\bm{V}}_{K}=(\widetilde{\bm{v}}_{1},\ldots,\widetilde{\bm{v}}_{K}). At last span(𝑽~K)span(\widetilde{\bm{V}}_{K}) is set as the final estimator of span(𝑳)span(\bm{L}). If we are also interested in the factor scores on each server, then we simply transport 𝑽~K\widetilde{\bm{V}}_{K} back to each subserver, and solve a least squares problem to get the estimators of the factor scores. We summarized the procedure in Algorithm 2.

6 Simulation Studies and Real Example

6.1 Simulation Studies

In this section, we investigate the empirical performance of the proposed algorithm in terms of estimating loading space in factor models. In essence, our algorithm is a distributed Elliptical Component Analysis procedure, thus is briefly denoted as D-ECA. We compare ours with the distributed PCA (denoted as D-PCA) algorithm in Fan et al. (2019) as well as an oracle algorithm in which the full samples are available at one machine and elliptical component analysis (denoted as F-ECA) is performed.

We adopt a simplified data-generating scheme from the elliptical factor model, similar as that in He et al. (2022):

Xit=j=1KLijfjt+uit,X_{it}=\sum_{j=1}^{K}L_{ij}f_{jt}+u_{it},

where 𝒇t=(f1t,,fKt)\bm{f}_{t}=(f_{1t},\ldots,f_{Kt})^{\top} and 𝒖t=(u1t,,upt)\bm{u}_{t}=(u_{1t},\ldots,u_{pt})^{\top} are jointly generated from elliptical distributions. We set K=3K=3 and draw LijL_{ij} independently from the standard normal distribution. We use a fixed number of observations on each server, i.e., n=200n=200, but vary the number of servers mm and the dimensionality pp. The factors and the idiosyncratic errors (𝒇t,𝒖t)(\bm{f}_{t}^{\top},\bm{u}_{t}^{\top})^{\top} are generated in the following ways: (i) (𝒇t,𝒖t)(\bm{f}_{t}^{\top},\bm{u}_{t}^{\top})^{\top} are i.i.d. jointly elliptical random variables from multivariate Gaussian distributions 𝒩(𝟎,𝐈p+K)\mathcal{N}({\bm{0}},\mathbf{I}_{p+K}); (ii) (𝒇t,𝒖t)(\bm{f}_{t}^{\top},\bm{u}_{t}^{\top})^{\top} are i.i.d. jointly elliptical random variables from multivariate centralized tt distributions tν(𝟎,𝐈p+K)t_{\nu}({\bm{0}},\mathbf{I}_{p+K}) with ν=1,2,3\nu=1,2,3.

We compare the performance of the distributed algorithms by the distance between the estimated loading space and the real loading space. We adopt a distance between linear spaces also utilized in He et al. (2022) Precisely, for two matrices with orthonormal columns 𝑽1p×K\bm{V}_{1}\in\mathbb{R}^{p\times K} and 𝑽2p×K\bm{V}_{2}\in\mathbb{R}^{p\times K}, the distance between their column spaces is defined as:

ρ1(𝑽1,𝑽2)=(11KTr(𝑽1𝑽1𝑽2𝑽2))1/2.\rho_{1}(\bm{V}_{1},\bm{V}_{2})=\left(1-\frac{1}{K}Tr(\bm{V}_{1}\bm{V}_{1}^{\top}\bm{V}_{2}\bm{V}_{2}^{\top})\right)^{1/2}.

The measure equals to 0 if and only if the column spaces of 𝑽1\bm{V}_{1} and 𝑽2\bm{V}_{2} are the same, and equals to 1 if and only if they are orthogonal. Indeed, we have ρ(𝑽1,𝑽2)=2Kρ1(𝑽1,𝑽2)\rho(\bm{V}_{1},\bm{V}_{2})=\sqrt{2K}\rho_{1}(\bm{V}_{1},\bm{V}_{2}).

Refer to caption
Figure 2: The log errors, i.e., log(ρ1(𝑳,𝑳^))\log(\rho_{1}(\bm{L},\widehat{\bm{L}})) versus the number of servers mm under various distributions with p=20p=20 and p=50p=50.
Table 1: Simulation results for the different scenarios based on 100 replications, with standard deviations in parentheses.
pp mm MethodMethod DistributionDistribution
𝒩(𝟎,𝑰p+m)\mathcal{N}(\bm{0},\bm{I}_{p+m}) t3(𝟎,𝑰p+m)t_{3}(\bm{0},\bm{I}_{p+m}) t2(𝟎,𝑰p+m)t_{2}(\bm{0},\bm{I}_{p+m}) t1(𝟎,𝑰p+m)t_{1}(\bm{0},\bm{I}_{p+m})
20 5 D-PCA 0.034(0.006) 0.080(0.019) 0.126(0.034) 0.259(0.066)
D-ECA 0.035(0.006) 0.038(0.006) 0.040(0.007) 0.042(0.007)
F-ECA 0.034(0.005) 0.038(0.006) 0.039(0.006) 0.041(0.007)
10 D-PCA 0.024(0.005) 0.057(0.013) 0.092(0.022) 0.169(0.031)
D-ECA 0.025(0.005) 0.027(0.005) 0.028(0.004) 0.029(0.004)
F-ECA 0.025(0.005) 0.027(0.005) 0.028(0.004) 0.028(0.004)
20 D-PCA 0.016(0.002) 0.040(0.008) 0.064(0.013) 0.124(0.026)
D-ECA 0.017(0.002) 0.019(0.008) 0.019(0.003) 0.020(0.004)
F-ECA 0.017(0.002) 0.019(0.008) 0.019(0.003) 0.020(0.004)
50 5 D-PCA 0.032(0.003) 0.076(0.014) 0.126(0.026) 0.247(0.038)
D-ECA 0.033(0.003) 0.037(0.003) 0.037(0.003) 0.040(0.003)
F-ECA 0.033(0.003) 0.036(0.003) 0.037(0.003) 0.040(0.003)
10 D-PCA 0.023(0.002) 0.054(0.008) 0.085(0.012) 0.167(0.023)
D-ECA 0.023(0.002) 0.026(0.002) 0.026(0.002) 0.028(0.002)
F-ECA 0.023(0.002) 0.026(0.002) 0.026(0.002) 0.027(0.002)
20 D-PCA 0.016(0.002) 0.038(0.004) 0.060(0.008) 0.116(0.013)
D-ECA 0.017(0.002) 0.018(0.002) 0.019(0.002) 0.020(0.002)
F-ECA 0.017(0.002) 0.018(0.002) 0.019(0.002) 0.020(0.002)
100 5 D-PCA 0.032(0.002) 0.077(0.014) 0.123(0.020) 0.240(0.031)
D-ECA 0.033(0.002) 0.036(0.002) 0.037(0.002) 0.039(0.002)
F-ECA 0.033(0.002) 0.036(0.002) 0.036(0.002) 0.039(0.002)
10 D-PCA 0.023(0.001) 0.054( 0.007) 0.087(0.011) 0.165(0.018)
D-ECA 0.023(0.001) 0.026(0.001) 0.026(0.001) 0.028(0.002)
F-ECA 0.023(0.001) 0.025(0.001) 0.026(0.001) 0.028(0.002)
20 D-PCA 0.016(0.001) 0.037(0.003) 0.059(0.006) 0.116(0.010)
D-ECA 0.017(0.001) 0.018(0.001) 0.019(0.001) 0.020(0.001)
F-ECA 0.016(0.001) 0.018(0.001) 0.018(0.001) 0.019(0.001)

The simulation results are presented in Table 1, from which we mainly draw the following essential conclusions. Firstly, it confirms that distributed ECA is a more robust method compared with the distributed PCA. As revealed by the results, our distributed ECA has a pretty good performance even when data is extremely heavy-tailed, while performs comparably with the D-PCA under Gaussian distribution. Secondly, the performance of distributed ECA is comparable with that of the F-DCA, which indicates that the distributed algorithm works as well in the case when we have all observations in one machine and coincides with our theoretical analysis. Thus the distributed algorithm enjoys high computation efficiency while barely loses any accuracy.


Refer to caption
Figure 3: The errors ( ρ1(𝑳,𝑳^)\rho_{1}(\bm{L},\widehat{\bm{L}}) ) versus the number of machines mm for distributed PCA, distributed ECA and full sample ECA under different distributions with fixed p=20p=20.

Figure 2 demonstrates the decay rate of the error ρ1(𝑳,𝑳^)\rho_{1}(\bm{L},\widehat{\bm{L}}) as the number of servers mm increases when (n,p)={(200,20),(200,50)}(n,p)=\{(200,20),(200,50)\}, where 𝑳^\widehat{\bm{L}} denotes the orthonormal output from the D-ECA . From Figure 2, we can see that the relationship between log(ρ1(𝑳,𝑳^))\log(\rho_{1}(\bm{L},\widehat{\bm{L}})) and log(m)\log(m) is linear. Then we consider a regression model as follows:

log(ρ1(𝑳,𝑳^))=β0+β1log(m)+ϵ.\log(\rho_{1}(\bm{L},\widehat{\bm{L}}))=\beta_{0}+\beta_{1}\log(m)+\epsilon.

The fitting result is that, β^1=0.5019\widehat{\beta}_{1}=-0.5019 for p=20p=20 and β^1=0.49\widehat{\beta}_{1}=-0.49 for p=50p=50. As these experiments are conducted for fixed nn, the slope is consistent with the power of mm in the error bounds in Section 3, by noting that log(ρ1(𝑳,𝑳^))=log(ρ(𝑳,𝑳^))+log(2K)/2\log(\rho_{1}(\bm{L},\widehat{\bm{L}}))=\log(\rho(\bm{L},\widehat{\bm{L}}))+\log(2K)/2.

Figure 3 shows the errors of D-PCA, D-ECA and F-ECA with varying mm and fixed p=20p=20 under different distributions. It is obvious from Figure 3 (see more simulation results in the supplement) that the error of D-PCA is much higher than that of D-ECA and F-ECA under the tt-distribution and the differences increase as the degrees of freedom declines, which again demonstrates that distributed ECA is a much more robust method than distributed PCA. Meanwhile, the errors of D-ECA and F-ECA are very close. Thus we confirm the conclusion that distributed ECA is a computation efficient method without losing much accuracy. In conclusion, when data are stored across different machines due to various kinds of reasons including communication cost, privacy, data security, our D-ECA can be regarded as an alternative to ECA based on full-samples. Also, it performs much more robust than distributed PCA in Fan et al. (2019).

6.2 Real example

We apply the distributed algorithm for elliptical factor model to an macroeconomic real dataset, i.e., U.S. weekly economic data which are available at Federal Reserve Economic Data (https://fred.stlouisfed.org/). Our dataset consists of 40 variables (p=40)(p=40) from October 2nd, 1996, to April 11th, 2018 (N=1124)(N=1124). We divide the dataset into four equal subsets and store them on four servers (m=4)(m=4). The series can roughly be classified into 5 groups: (1) Assets (series 1-3), (2) Loans (series 4-14), (3) Deposits and Securities, briefly denoted as DAS (series 15-26), (4) Liabilities (series 27-32), (5) Currency (series 33-40). By preliminary test, we find that more than two-thirds of the variables show the characteristics of heavy-tails, which indicates ECA is more suitable than PCA in analyzing the dataset.

We compare the forecasting performance of factors obtained by D-ECA, F-ECA, F-PCA, and D-PCA respectively. We adopt the rolling window prediction schemes which employs a fixed-length window of the most recent data (i.e., 10 weekly observations) to train the factor models and conduct out-of-sample forecasting. To measure the forecasting performance of the extracted factors, we establish the hh-step ahead prediction model:

x^i,t+h=α^h+β^h𝒇^t.\widehat{x}_{i,t+h}=\widehat{\alpha}_{h}+\widehat{\beta}_{h}\widehat{\bm{f}}_{t}.

We only regress x^i,t+h\widehat{x}_{i,t+h} on 𝒇^t\widehat{\bm{f}}_{t} without considering the lagged variable of x^i,t\widehat{x}_{i,t} as we find that including the lagged variables in the model is less effective, which is consistent with the conclusion of Stock & Watson (2002).

Table 2: The 1-step to 4-step ahead forecast errors. The ratios are reported for a more intuitive comparison of the prediction errors.
StepStep MethodMethod AssetsAssets LoansLoans DASDAS LiabilitiesLiabilities CurrencyCurrency
1 D-ECA/F-ECA 0.98609 0.98390 1.00540 0.98852 1.00038
D-ECA/F-PCA 0.98727 0.98329 1.00378 0.98944 0.99921
F-PCA/F-ECA 1.00089 1.00081 1.00366 1.00124 1.00333
2 D-ECA/F-ECA 0.96390 0.98154 0.97463 0.97224 0.97600
D-ECA/F-PCA 0.96576 0.98136 0.97796 0.97125 0.97828
F-PCA/F-ECA 0.99825 1.00355 0.98876 1.00359 0.99126
3 D-ECA/F-ECA 0.97880 0.98323 1.00612 0.94963 1.00575
D-ECA/F-PCA 0.97804 0.98261 1.00620 0.94734 1.00565
F-PCA/F-ECA 1.00277 1.00819 1.00747 1.00701 1.00720
4 D-ECA/F-ECA 0.93355 0.96147 0.96971 0.95796 0.96653
D-ECA/F-PCA 0.93786 0.96351 0.94353 0.96158 0.96692
F-PCA/F-ECA 0.99195 1.00563 0.99795 1.00233 0.99923

From Table 2, we see that the ahead prediction errors of D-ECA and F-ECA are similar. For 2-step ahead forecast and 4-step ahead forecast, the D-ECA prediction errors of these five groups are smaller than these of F-ECA. The prediction errors of D-ECA are not as good as those of F-ECA only for the 1-step ahead forecast and 3-step ahead forecast for DAS and Currency. On the other hand, whether distributed or full sample, the prediction errors of ECA is smaller than PCA. We also find that the F-ECA is better than the F-PCA in the 1-step ahead forecast and 3-step ahead forecast. The 2-step and 4-step prediction errors of Loans and Liabilities of F-ECA are better than those of F-PCA.

Overall, the factors extracted by the distributed algorithms tend to have higher forecasting performance than the algorithms with the full data. When it is extremely difficult to aggregate large datasets due to the issues of communication cost, privacy concern, data security, ownership and among many others, the distributed ECA algorithm provides a solution and may have higher forecasting performance than performing ECA/PCA to the whole data.

7 Discussion

We proposed a robust distributed algorithm to estimate principal eigenspace for high dimensional, elliptically distributed data, which are stored across multiple machines. The distributed algorithm reduce the transmission cost, and performs much better than distributed PCA when data are extremely heavy-tailed. The algorithm goes as follows: in the first step, we compute the KK leading eigenvectors of sample Kendall’s tau matrix on each server and then transport them to the central server; in the second step, we take an average of the projection operators and perform a spectral decomposition on it to get the KK leading eigenvectors, which spans the final estimated eigenspace. We derive the convergence rate of the robust distributed estimator. Numerical studies show that the proposed algorithm works comparably with the full sample ECA while performs better than distributed PCA when data are heavy-tailed. We also extend the algorithm to learn the elliptical factor model in a distributed manner, and the theoretical analysis is more challenging and we leave it as future work. It’s also interesting to study the “vertical” division for distributed robust PCA and elliptical factor model.

Acknowledgements

The authors gratefully acknowledge National Science Foundation of China (12171282, 11801316), National Statistical Scientific Research Key Project (2021LZ09), Young Scholars Program of Shandong University, Project funded by China Postdoctoral Science Foundation (2021M701997).

References

  • (1)
  • Anderson (1963) Anderson, T. W. (1963), ‘Asymptotic theory for principal component analysis’, The Annals of Mathematical Statistics 34(1), 122–148.
  • Bao et al. (2022) Bao, Z., Ding, X., Wang, J. & Wang, K. (2022), ‘Statistical inference for principal components of spiked covariance matrices’, The Annals of Statistics 50(2), 1144–1169.
  • Chen et al. (2021) Chen, X., Zhang, J. & Zhou, W. (2021), ‘High-dimensional elliptical sliced inverse regression in non-gaussian distributions’, Journal of Business & Economic Statistics 0(0), 1–12.
  • Choi & Marden (1998) Choi, K. & Marden, J. (1998), ‘A multivariate version of kendall’s τ\tau’, Journal of Nonparametric Statistics 9(3), 261–293.
  • Fan et al. (2021) Fan, J., Guo, Y. & Wang, K. (2021), ‘Communication-efficient accurate statistical estimation’, Journal of the American Statistical Association pp. 1–11.
  • Fan et al. (2018) Fan, J., Liu, H. & Wang, W. (2018), ‘Large covariance estimation through elliptical factor models’, Annals of statistics 46(4), 1383.
  • Fan et al. (2019) Fan, J., Wang, D., Wang, K. & Zhu, Z. (2019), ‘Distributed estimation of principal eigenspaces’, Annals of statistics 47(6), 3009.
  • Feng & Liu (2017) Feng, L. & Liu, B. (2017), ‘High-dimensional rank tests for sphericity’, Journal of Multivariate Analysis 155, 217–233.
  • Hallin et al. (2010) Hallin, M., Paindaveine, D. & Verdebout, T. (2010), ‘Optimal rank-based testing for principal components’, The Annals of Statistics 38(6), 3245–3299.
  • Hallin et al. (2014) Hallin, M., Paindaveine, D. & Verdebout, T. (2014), ‘Efficient r-estimation of principal and common principal components’, Journal of the American Statistical Association 109(507), 1071–1083.
  • Han & Liu (2014) Han, F. & Liu, H. (2014), ‘Scale-invariant sparse PCA on high-dimensional meta-elliptical data’, J. Amer. Statist. Assoc. 109(505), 275–287.
  • Han & Liu (2018) Han, F. & Liu, H. (2018), ‘Eca: High-dimensional elliptical component analysis in non-gaussian distributions’, Journal of the American Statistical Association 113(521), 252–268.
  • He et al. (2022) He, Y., Kong, X., Yu, L. & Zhang, X. (2022), ‘Large-dimensional factor analysis without moment constraints’, Journal of Business & Economic Statistics 40(1), 302–312.
  • Hu et al. (2019) Hu, J., Li, W., Liu, Z. & Zhou, W. (2019), ‘High-dimensional covariance matrices in elliptical distributions with application to spherical test’, The Annals of Statistics 47(1), 527 – 555.
  • Hult & Lindskog (2002) Hult, H. & Lindskog, F. (2002), ‘Multivariate extremes, aggregation and dependence in elliptical distributions’, Advances in Applied probability 34(3), 587–608.
  • Jing et al. (2012) Jing, B.-Y., Kong, X.-B. & Liu, Z. (2012), ‘Modeling high-frequency financial data by pure jump processes’, The Annals of Statistics 40(2), 759–784.
  • Kong et al. (2021) Kong, X.-B., Lin, J.-G., Liu, C. & Liu, G.-Y. (2021), ‘Discrepancy between global and local principal component analysis on large-panel high-frequency data’, Journal of the American Statistical Association 0(0), 1–12.
  • Kong et al. (2015) Kong, X.-B., Liu, Z. & Jing, B.-Y. (2015), ‘Testing for pure-jump processes for high-frequency data’, The Annals of Statistics 43(2), 847–877.
  • Li et al. (2022) Li, Z., He, Y., Kong, X. & Zhang, X. (2022), ‘Manifold principle component analysis for large-dimensional matrix elliptical factor model’, arXiv preprint arXiv:2203.14063 .
  • Onatski (2012) Onatski, A. (2012), ‘Asymptotics of the principal components estimator of large factor models with weakly influential factors’, Journal of Econometrics 168(2), 244–258.
  • Qu et al. (2002) Qu, Y., Ostrouchov, G., Samatova, N. & Geist, A. (2002), Principal component analysis for dimension reduction in massive distributed data sets, in ‘Proceedings of IEEE International Conference on Data Mining (ICDM)’, Vol. 1318, p. 1788.
  • Stock & Watson (2002) Stock, J. H. & Watson, M. W. (2002), ‘Macroeconomic forecasting using diffusion indexes’, Journal of Business & Economic Statistics 20(2), 147–162.
  • Taskinen et al. (2012) Taskinen, S., Koch, I. & Oja, H. (2012), ‘Robustifying principal component analysis with spatial sign vectors’, Statistics & Probability Letters 82(4), 765–774.
  • Vershynin (2010) Vershynin, R. (2010), ‘Introduction to the non-asymptotic analysis of random matrices’, arXiv preprint arXiv:1011.3027 .
  • Visuri et al. (2000) Visuri, S., Koivunen, V. & Oja, H. (2000), ‘Sign and rank covariance matrices’, Journal of Statistical Planning and Inference 91(2), 557–575.
  • Wang & Fan (2017) Wang, W. & Fan, J. (2017), ‘Asymptotics of empirical eigenstructure for high dimensional spiked covariance’, Annals of Statistics 45, 1342–1374.
  • Yu et al. (2019) Yu, L., He, Y. & Zhang, X. (2019), ‘Robust factor number specification for large-dimensional elliptical factor model’, Journal of Multivariate analysis 174, 104543.
  • Yu et al. (2015) Yu, Y., Wang, T. & Samworth, R. J. (2015), ‘A useful variant of the davis–kahan theorem for statisticians’, Biometrika 102(2), 315–323.
  • Zhang et al. (2012) Zhang, Y., Wainwright, M. J. & Duchi, J. C. (2012), ‘Communication-efficient algorithms for statistical optimization’, Advances in neural information processing systems 25.

Supplementary Materials for “Distributed Learning for Principle Eigenspaces without Moment Constraints”

Yong He 111Institute of Financial Studies, Shandong University, China; e-mail:heyong@sdu.edu.cn , Zichen Liu111Institute of Financial Studies, Shandong University, China; e-mail:heyong@sdu.edu.cn , Yalin Wang111Institute of Financial Studies, Shandong University, China; e-mail:heyong@sdu.edu.cn

This document provides detailed proofs of the main paper and additional simulation results. We start with some auxiliary lemmas in Section A and the proof of main theorems is provided in Section B. Section C provides the additional simulation results.

Appendix A Auxiliary Lemmas

Lemma A.1.

Let 𝑿\bm{X} follows a continuous elliptical distribution, that is, 𝑿ED(g;𝝁,𝚺,𝝂)\bm{X}\sim ED(g;\bm{\mu},\bm{\Sigma},\bm{\nu}) with (ξ=0)=0\mathbb{P}(\xi=0)=0. K is the population multivariate Kendall’s tau matrix, its rank is assumed to be rank(𝚺)=qrank(\bm{\Sigma})=q. Then:

λj(K)=𝔼(λj(𝚺)gjλ1(𝚺)g12++λq(𝚺)gq2),\lambda_{j}(\textbf{K})=\mathbb{E}\left(\frac{\lambda_{j}(\bm{\Sigma})g_{j}}{\lambda_{1}(\bm{\Sigma})g_{1}^{2}+...+\lambda_{q}(\bm{\Sigma})g_{q}^{2}}\right),

where 𝒈=(g1,,gq)𝒩(0,Iq)\bm{g}=(g_{1},\ldots,g_{q})^{\top}\sim\mathcal{N}(\textbf{0},\textbf{I}_{q}) is a standard multivariate Gaussian distribution. In addition, K and 𝚺\bm{\Sigma} share the same eigenspace with the same descending order of the eigenvalues.

This lemma is the foundation of principal component analysis for elliptically distributed data, and the proof can be found in Han & Liu (2018).

Lemma A.2.

Under Assumption 3.1 in the main paper, there exist a positive constant c0c_{0} and a large p0p_{0}, for pp0p\geq p_{0}:

1λK(K)λK+1(K)c0Tr(𝚺)λK(𝚺)λK+1(𝚺)=c0κr\frac{1}{\lambda_{K}(\textbf{K})-\lambda_{K+1}(\textbf{K})}\leq c_{0}\frac{\text{Tr}(\bm{\Sigma})}{\lambda_{K}(\bm{\Sigma})-\lambda_{K+1}(\bm{\Sigma})}=c_{0}\kappa r
Proof.

First, we claim that when γ=λ1(𝚺)/λp(𝚺)Cpα\gamma=\lambda_{1}(\bm{\Sigma})/\lambda_{p}(\bm{\Sigma})\leq Cp^{\alpha},

𝚺Flogp=o(Tr(𝚺)).\|\bm{\Sigma}\|_{F}\sqrt{\log p}=o(\text{Tr}(\bm{\Sigma})).

It is obviously true by:

Tr(𝚺)pλpp1αλ1C=p1α𝚺2/Cp1/2α𝚺F\text{Tr}(\bm{\Sigma})\geq p\lambda_{p}\geq p^{1-\alpha}\frac{\lambda_{1}}{C}=p^{1-\alpha}\|\bm{\Sigma}\|_{2}/C\geq p^{1/2-\alpha}\|\bm{\Sigma}\|_{F}

By Theorem 3.5 in Han & Liu (2018), for large pp,

λK(K)λK+1(K)λK(𝚺)Tr(𝚺)+4𝚺Flogp+8𝚺2logp(13p2)\displaystyle\lambda_{K}(\textbf{K})-\lambda_{K+1}(\textbf{K})\geq\frac{\lambda_{K}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})+4\|\bm{\Sigma}\|_{F}\sqrt{\log p}+8\|\bm{\Sigma}\|_{2}\log p}(1-\frac{\sqrt{3}}{p^{2}})
λK+1(𝚺)Tr(𝚺)4𝚺Flogp1p4:=f1(p)λK(𝚺)Tr(𝚺)f2(p)λK+1(𝚺)Tr(𝚺)1p4.\displaystyle-\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})-4\|\bm{\Sigma}\|_{F}\sqrt{\log p}}-\frac{1}{p^{4}}=f_{1}(p)\frac{\lambda_{K}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-f_{2}(p)\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-\frac{1}{p^{4}}.

According to the claim above, we know that f1,f21f_{1},f_{2}\rightarrow 1 as pp\rightarrow\infty. Also,

f1(p)f2(p)=13/p2Tr(𝚺)+4𝚺Flogp+8𝚺2logp(Tr(𝚺)4𝚺Flogp)<1.\frac{f_{1}(p)}{f_{2}(p)}=\frac{1-\sqrt{3}/p^{2}}{\text{Tr}(\bm{\Sigma})+4\|\bm{\Sigma}\|_{F}\sqrt{\log p}+8\|\bm{\Sigma}\|_{2}\log p}(\text{Tr}(\bm{\Sigma})-4\|\bm{\Sigma}\|_{F}\sqrt{\log p})<1.

By the assumption on ratio of eigenvalues, we have:

f1(p)λK(𝚺)Tr(𝚺)f2(p)λK+1(𝚺)Tr(𝚺)1p4\displaystyle f_{1}(p)\frac{\lambda_{K}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-f_{2}(p)\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-\frac{1}{p^{4}}
=f1(p)2λK(𝚺)Tr(𝚺)f1(p)2λK+1(𝚺)Tr(𝚺)+(f1(p)2λK(𝚺)Tr(𝚺)+f1(p)2λK+1(𝚺)Tr(𝚺)f2(p)λK+1(𝚺)Tr(𝚺))1p4\displaystyle=\frac{f_{1}(p)}{2}\frac{\lambda_{K}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-\frac{f_{1}(p)}{2}\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}+\left(\frac{f_{1}(p)}{2}\frac{\lambda_{K}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}+\frac{f_{1}(p)}{2}\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-f_{2}(p)\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}\right)-\frac{1}{p^{4}}
f1(p)2λK(𝚺)Tr(𝚺)f1(p)2λK+1(𝚺)Tr(𝚺)+λK+1(𝚺)Tr(𝚺)(f1(p)(1+δ)f2(p))1p4\displaystyle\geq\frac{f_{1}(p)}{2}\frac{\lambda_{K}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-\frac{f_{1}(p)}{2}\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}+\frac{\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}(f_{1}(p)(1+\delta)-f_{2}(p))-\frac{1}{p^{4}}
14λK(𝚺)λK+1(𝚺)Tr(𝚺)1p4(for p large enough)1c0λK(𝚺)λK+1(𝚺)Tr(𝚺),\displaystyle\geq\frac{1}{4}\frac{\lambda_{K}(\bm{\Sigma})-\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})}-\frac{1}{p^{4}}(\text{for $p$ large enough})\geq\frac{1}{c_{0}}\frac{\lambda_{K}(\bm{\Sigma})-\lambda_{K+1}(\bm{\Sigma})}{\text{Tr}(\bm{\Sigma})},

where the last inequality holds because:

Tr(𝚺)λK(𝚺)λK+1(𝚺)pλ1δλK+1pλ1δλp=O(p1+α)=o(p4).\frac{\text{Tr}(\bm{\Sigma})}{\lambda_{K}(\bm{\Sigma})-\lambda_{K+1}(\bm{\Sigma})}\leq\frac{p\lambda_{1}}{\delta\lambda_{K+1}}\leq p\frac{\lambda_{1}}{\delta\lambda_{p}}=O(p^{1+\alpha})=o(p^{4}).

Consequently, there exists a large p0p_{0}, such that when pp0p\geq p_{0}:

1λK(K)λK+1(K)c0Tr(𝚺)λK(𝚺)λK+1(𝚺)=c0κr.\frac{1}{\lambda_{K}(\textbf{K})-\lambda_{K+1}(\textbf{K})}\leq c_{0}\frac{\text{Tr}(\bm{\Sigma})}{\lambda_{K}(\bm{\Sigma})-\lambda_{K+1}(\bm{\Sigma})}=c_{0}\kappa r.

Lemma A.3.

When 𝚺𝑽K𝑽KT214\|\bm{\Sigma}^{*}-\bm{V}_{K}\bm{V}_{K}^{T}\|_{2}\leq\frac{1}{4}, we have:

12λK(𝚺)λK+1(𝚺)32.\frac{1}{2}\leq\lambda_{K}(\bm{\Sigma}^{*})-\lambda_{K+1}(\bm{\Sigma}^{*})\leq\frac{3}{2}.
Proof.

Since 𝑽KT𝑽K=𝐈K\bm{V}_{K}^{T}\bm{V}_{K}=\mathbf{I}_{K}, we have:

λK(𝑽K𝑽KT)=1,λK+1(𝑽K𝑽KT)=0.\lambda_{K}(\bm{V}_{K}\bm{V}_{K}^{T})=1,\lambda_{K+1}(\bm{V}_{K}\bm{V}_{K}^{T})=0.

By Weyl theorem,

λK+1(𝚺)λK+1(𝑽K𝑽KT)𝚺𝑽K𝑽KT2λK+1(𝚺)14.\lambda_{K+1}(\bm{\Sigma}^{*})-\lambda_{K+1}(\bm{V}_{K}\bm{V}_{K}^{T})\leq\|\bm{\Sigma}^{*}-\bm{V}_{K}\bm{V}_{K}^{T}\|_{2}\qquad\lambda_{K+1}(\bm{\Sigma}^{*})\leq\frac{1}{4}.

Similarly,

λK(𝚺)54,λK(𝚺)34,λK+1(𝚺)14\lambda_{K}(\bm{\Sigma}^{*})\leq\frac{5}{4},\lambda_{K}(\bm{\Sigma}^{*})\geq\frac{3}{4},\lambda_{K+1}(\bm{\Sigma}^{*})\geq-\frac{1}{4}

Hence,

12λK(𝚺)λK+1(𝚺)32.\frac{1}{2}\leq\lambda_{K}(\bm{\Sigma}^{*})-\lambda_{K+1}(\bm{\Sigma}^{*})\leq\frac{3}{2}.

Lemma A.4.

Let k()k(\cdot) : 𝒳×𝒳d×d\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}^{d\times d} be a matrix value kernel function. Let 𝑿1,,𝑿n\bm{X}_{1},\ldots,\bm{X}_{n} be nn independent observations of an random variable 𝑿𝒳\bm{X}\in\mathcal{X}. Suppose that, for any ii{1,,n}i\neq i^{{}^{\prime}}\in\left\{1,\ldots,n\right\}, 𝔼k(𝑿i,𝑿i)\mathbb{E}k(\bm{X}_{i},\bm{X}_{i^{{}^{\prime}}}) exists and there exist two constants R1,R2>0R_{1},R_{2}>0 such that

k(𝑿i,𝑿i)𝔼k(𝑿i,𝑿i)2R1and𝔼{k(𝑿i,𝑿i)𝔼k(𝑿i,𝑿i)}22R2\|k(\bm{X}_{i},\bm{X}_{i^{{}^{\prime}}})-\mathbb{E}k(\bm{X}_{i},\bm{X}_{i^{\prime}})\|_{2}\leq R_{1}\ \text{and}\ \|\mathbb{E}\{k(\bm{X}_{i},\bm{X}_{i^{\prime}})-\mathbb{E}k(\bm{X}_{i},\bm{X}_{i^{\prime}})\}^{2}\|_{2}\leq R_{2}

We then have

(1(n2)i<ik(𝑿i,𝑿i)𝔼k(𝑿i,𝑿i)2t)dexp((n/4)t2R2+R1t/3).\mathbb{P}\left(\left\|\frac{1}{\binom{n}{2}}\sum_{i<i^{\prime}}k(\bm{X}_{i},\bm{X}_{i^{\prime}})-\mathbb{E}k(\bm{X}_{i},\bm{X}_{i^{\prime}})\right\|_{2}\geq t\right)\leq d\exp\left(-\frac{(n/4)t^{2}}{R_{2}+R_{1}t/3}\right).

The proof of this lemma can be found in Han & Liu (2018).

Lemma A.5.

𝑿1,,𝑿np\bm{X}_{1},\ldots,\bm{X}_{n}\in\mathbb{R}^{p} are samples from some elliptical distribution. Population and sample Kendall’s tau matrix are denoted by K and K^\widehat{\textbf{K}}, respectively. If logp=o(n)\log p=o(n), we have:

K^K2ψ1logpn.\Big{\|}\|\widehat{\textbf{K}}-\textbf{K}\|_{2}\Big{\|}_{\psi_{1}}\lesssim\sqrt{\frac{\log\,p}{n}}.
Proof.

It is easily seen that Tr(K)=Tr(K^)=1\text{Tr}(\textbf{K})=\text{Tr}(\widehat{\textbf{K}})=1. Since K and K^\widehat{\textbf{K}} are positive semidefinite matrices, we have K21,K^21\|\textbf{K}\|_{2}\leq 1,\|\widehat{\textbf{K}}\|_{2}\leq 1, and thus KK^22\|\textbf{K}-\widehat{\textbf{K}}\|_{2}\leq 2.

Note that kernel function of Kendall’s tau matrix is:

k(𝑿i,𝑿i):=(𝑿i𝑿i)(𝑿i𝑿i)T𝑿i𝑿i22.k(\bm{X}_{i},\bm{X}_{i^{\prime}}):=\frac{(\bm{X}_{i}-\bm{X}_{i^{\prime}})(\bm{X}_{i}-\bm{X}_{i^{\prime}})^{T}}{\|\bm{X}_{i}-\bm{X}_{i^{\prime}}\|_{2}^{2}}.

By some simple algebra,

(𝑿i𝑿i)(𝑿i𝑿i)T𝑿i𝑿i22K2\displaystyle\|\frac{(\bm{X}_{i}-\bm{X}_{i^{\prime}})(\bm{X}_{i}-\bm{X}_{i^{\prime}})^{T}}{\|\bm{X}_{i}-\bm{X}_{i^{\prime}}\|_{2}^{2}}-\textbf{K}\|_{2} (𝑿i𝑿i)(𝑿i𝑿i)T𝑿i𝑿i222+K2\displaystyle\leq\|\frac{(\bm{X}_{i}-\bm{X}_{i^{\prime}})(\bm{X}_{i}-\bm{X}_{i^{\prime}})^{T}}{\|\bm{X}_{i}-\bm{X}_{i^{\prime}}\|_{2}^{2}}\|_{2}+\|\textbf{K}\|_{2}
=Tr((𝑿i𝑿i)(𝑿i𝑿i)T𝑿i𝑿i22)+K2=1+K2\displaystyle=\text{Tr}(\frac{(\bm{X}_{i}-\bm{X}_{i^{\prime}})(\bm{X}_{i}-\bm{X}_{i^{\prime}})^{T}}{\|\bm{X}_{i}-\bm{X}_{i^{\prime}}\|_{2}^{2}})+\|\textbf{K}\|_{2}=1+\|\textbf{K}\|_{2}
𝔼{k(𝑿i,𝑿i)𝔼k(𝑿i,𝑿i)}22\displaystyle\|\mathbb{E}\{k(\bm{X}_{i},\bm{X}_{i^{\prime}})-\mathbb{E}k(\bm{X}_{i},\bm{X}_{i^{\prime}})\}^{2}\|_{2} K2+K22.\displaystyle\leq\|\textbf{K}\|_{2}+\|\textbf{K}\|_{2}^{2}.

According to Lemma A.4, let R1=2,R2=2R_{1}=2,R_{2}=2, then

(K^K2t)pexp((n/4)t22+2t/3),t0\mathbb{P}(\|\widehat{\textbf{K}}-\textbf{K}\|_{2}\geq t)\leq p\,\exp\left(-\frac{(n/4)t^{2}}{2+2t/3}\right),\forall t\geq 0

We need to find a function f(n,p)f(n,p) with respect to nn and pp, which satisfies f(n,p)2f(n,p)\leq 2 and

pexp((n/4)t22+2t/3)exp(1tf(n,p)),t[f(n,p),2]p\,\exp\left(-\frac{(n/4)t^{2}}{2+2t/3}\right)\leq\exp(1-\frac{t}{f(n,p)}),\forall t\in\left[f(n,p),2\right] (A.1)

It is equivalent to:

1f(n,p)mint[f(n,p),2](1logpt+nt/42+2t/3)\frac{1}{f(n,p)}\leq\mathop{\min}\limits_{t\in\left[f(n,p),2\right]}\left(\frac{1-\log\,p}{t}+\frac{nt/4}{2+2t/3}\right)

It is easy to see that the right term is an increasing function of tt, so the last inequality above is equivalent to:

{1f(n,p)1logpf(n,p)+nf(n,p)/42+2f(n,p)/3f(n,p)2\left\{\begin{aligned} \frac{1}{f(n,p)}&\leq\frac{1-\log\,p}{f(n,p)}+\frac{nf(n,p)/4}{2+2f(n,p)/3}\\ f(n,p)&\leq 2\end{aligned}\right.

Let f(n,p)=(36logpn)2f(n,p)=(\sqrt{\frac{36\log\,p}{n}})\wedge 2, it is easily to verify that it satisfies (A.1). Therefore,

(K^K2t)exp(11(36logpn)2)),t0\mathbb{P}(\|\widehat{\textbf{K}}-\textbf{K}\|_{2}\geq t)\leq exp(1-\frac{1}{(\sqrt{\frac{36\log\,p}{n}})\wedge 2})),\forall t\geq 0

By (5.14) in Vershynin (2010), we have

K^K2ψ1(36logpn)2logpn\Big{\|}\|\widehat{\textbf{K}}-\textbf{K}\|_{2}\Big{\|}_{\psi_{1}}\leq(\sqrt{\frac{36\log\,p}{n}})\wedge 2\lesssim\sqrt{\frac{\log p}{n}}

Appendix B Proof of the Main Theorems

Proof of Theorem 4.1

Proof.

According to Davis-Kahan theorem in Yu et al. (2015), we have:

𝝆(𝑽~K,𝑽K)𝚺~𝚺FλK(𝚺)λK+1(𝚺)\bm{\rho}(\widetilde{\bm{V}}_{K},\bm{V}_{K}^{*})\lesssim\frac{\|\widetilde{\bm{\Sigma}}-\bm{\Sigma}^{*}\|_{F}}{\lambda_{K}(\bm{\Sigma}^{*})-\lambda_{K+1}(\bm{\Sigma}^{*})}
𝝆(𝑽~K,𝑽K)ψ12\displaystyle\Big{\|}\bm{\rho}(\widetilde{\bm{V}}_{K},\bm{V}_{K}^{*})\Big{\|}_{\psi_{1}}\leq\sqrt{2}\Big{\|}\| 𝚺~𝚺Fψ1/δ=21mi=1m𝑽K(i)𝑽K(i)T𝚺Fψ1/δ\displaystyle\widetilde{\bm{\Sigma}}-\bm{\Sigma}^{*}\|_{F}\Big{\|}_{\psi_{1}}/\delta=\sqrt{2}\Big{\|}\|\frac{1}{m}\sum_{i=1}^{m}\bm{V}_{K}^{(i)}\bm{V}_{K}^{(i)T}-\bm{\Sigma}^{*}\|_{F}\Big{\|}_{\psi_{1}}/\delta\leq
21mi=1m𝑽K(i)𝑽K(i)T𝚺Fψ1/δ\displaystyle\sqrt{2}\Big{\|}\frac{1}{m}\sum_{i=1}^{m}\|\bm{V}_{K}^{(i)}\bm{V}_{K}^{(i)T}-\bm{\Sigma}^{*}\|_{F}\Big{\|}_{\psi_{1}}/\delta

Note that 𝑽K(i)𝑽K(i)T𝚺Fk+𝚺F<\|\bm{V}_{K}^{(i)}\bm{V}_{K}^{(i)T}-\bm{\Sigma}^{*}\|_{F}\leq k+\|\bm{\Sigma}^{*}\|_{F}<\infty is a bounded random variable, thus it is sub-Gaussian, naturally with finite Orlizc ψ1\psi_{1} norm. Hence according to Lemma 4 in Fan et al. (2019),

1mi=1m𝑽K(i)𝑽K(i)T𝚺Fψ1\displaystyle\Big{\|}\frac{1}{m}\sum_{i=1}^{m}\|\bm{V}_{K}^{(i)}\bm{V}_{K}^{(i)T}-\bm{\Sigma}^{*}\|_{F}\Big{\|}_{\psi_{1}} i=1m(𝑽^K(i)𝑽^K(i)T𝚺Fψ1)2m2\displaystyle\lesssim\sqrt{\frac{\sum_{i=1}^{m}(\Big{\|}\|\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)T}-\bm{\Sigma}^{*}\|_{F}\Big{\|}_{\psi_{1}})^{2}}{m^{2}}}
1mmaxl𝑽^K(l)𝑽^K(l)T𝚺Fψ1.\displaystyle\leq\frac{1}{\sqrt{m}}\max\limits_{l}\Big{\|}\|\widehat{\bm{V}}_{K}^{(l)}\widehat{\bm{V}}_{K}^{(l)T}-\bm{\Sigma}^{*}\|_{F}\Big{\|}_{\psi_{1}}.

In fact, we have

𝑽^K(i)𝑽^K(i)T𝚺Fψ1\displaystyle\Big{\|}\|\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)T}-\bm{\Sigma}^{*}\|_{F}\Big{\|}_{\psi_{1}} 𝑽^K(i)𝑽^K(i)T𝑽K𝑽KTFψ1+𝚺𝑽K𝑽KTF\displaystyle\leq\Big{\|}\|\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)T}-\bm{V}_{K}\bm{V}_{K}^{T}\|_{F}\Big{\|}_{\psi_{1}}+\|\bm{\Sigma}^{*}-\bm{V}_{K}\bm{V}_{K}^{T}\|_{F}
𝝆(𝑽^K(i),𝑽K)ψ1+𝔼(𝑽^K(i)𝑽^K(i)T𝑽K𝑽KTF)\displaystyle\leq\|\bm{\rho}(\widehat{\bm{V}}_{K}^{(i)},\bm{V}_{K})\|_{\psi_{1}}+\mathbb{E}(\|\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)T}-\bm{V}_{K}\bm{V}_{K}^{T}\|_{F})
2𝝆(𝑽K,𝑽^K(i))ψ1,i[m],\displaystyle\leq 2\|\bm{\rho}(\bm{V}_{K},\widehat{\bm{V}}_{K}^{(i)})\|_{\psi_{1}},\forall i\in[m],

where the second inequality is by Jensen’s inequality. Combining

𝝆(𝑽K,𝑽^K(i))=𝑽^K(i)𝑽^K(i)T𝑽K𝑽KTF=2sinΘ(𝑽^K(i),𝑽K)F,\bm{\rho}(\bm{V}_{K},\widehat{\bm{V}}_{K}^{(i)})=\|\widehat{\bm{V}}_{K}^{(i)}\widehat{\bm{V}}_{K}^{(i)T}-\bm{V}_{K}\bm{V}_{K}^{T}\|_{F}=\sqrt{2}\|sin\Theta(\widehat{\bm{V}}_{K}^{(i)},\bm{V}_{K})\|_{F},

and the generalized Davis-Kahan theorem in Yu et al. (2015), which is

sinΘ(𝑽^K(i),𝑽K)F2KK^K2λK(K)λK+1(K),\|sin\Theta(\widehat{\bm{V}}_{K}^{(i)},\bm{V}_{K})\|_{F}\leq\frac{2\sqrt{K}\|\widehat{\textbf{K}}-\textbf{K}\|_{2}}{\lambda_{K}(\textbf{K})-\lambda_{K+1}(\textbf{K})},

we have

𝝆(𝑽K,𝑽^K(i))ψ122KλK(K)λK+1(K)K^K2ψ1,\|\bm{\rho}(\bm{V}_{K},\widehat{\bm{V}}_{K}^{(i)})\|_{\psi_{1}}\leq\frac{2\sqrt{2K}}{\lambda_{K}(\textbf{K})-\lambda_{K+1}(\textbf{K})}\Big{\|}\|\widehat{\textbf{K}}-\textbf{K}\|_{2}\Big{\|}_{\psi_{1}},

Hence, combining all of the equations above and Lemma A.5, we have

𝝆(𝑽^K,𝑽K)ψ148Klogpmn1δ(λK(K)λK+1(K)).\|\bm{\rho}(\widehat{\bm{V}}_{K},\bm{V}_{K}^{*})\|_{\psi_{1}}\leq 48\sqrt{\frac{K\log\,p}{m\cdot n}}\frac{1}{\delta\cdot(\lambda_{K}(\textbf{K})-\lambda_{K+1}(\textbf{K}))}. (B.1)

According to Lemma A.2 and equation (B.1), it is not hard to show that

𝝆(𝑽^K,𝑽K)ψ1κrK3/2(logpN)1/2.\|\bm{\rho}(\widehat{\bm{V}}_{K},\bm{V}_{K}^{*})\|_{\psi_{1}}\lesssim\kappa rK^{3/2}(\frac{\log p}{N})^{1/2}.

Proof of Theorem 4.2

Proof.

Define E=K^K\textbf{E}=\widehat{\textbf{K}}-\textbf{K}, Δ=λK(K)λK+1(K)\Delta=\lambda_{K}(\textbf{K})-\lambda_{K+1}(\textbf{K}). According to Theorem 3 in Fan et al. (2019), we have

𝝆(𝑽K,𝑽K)𝔼(𝑽^K(1)𝑽^K(1)T)𝑽K𝑽KTFK𝔼(E2Δ)2\displaystyle\bm{\rho}(\bm{V}_{K}^{*},\bm{V}_{K})\leq\|\mathbb{E}(\widehat{\bm{V}}_{K}^{(1)}\widehat{\bm{V}}_{K}^{(1)T})-\bm{V}_{K}\bm{V}_{K}^{T}\|_{F}\lesssim\sqrt{K}\mathbb{E}(\frac{\|\textbf{E}\|_{2}}{\Delta})^{2}
KΔ2𝔼E2ψ12κ2r2K5/2mlogpN.\displaystyle\lesssim\sqrt{K}\Delta^{-2}\mathbb{E}\Big{\|}\|\textbf{E}\|_{2}\Big{\|}_{\psi_{1}}^{2}\lesssim\kappa^{2}r^{2}K^{5/2}\frac{m\log p}{N}.

The final inequality is derived from Lemma A.5. ∎

Appendix C Additional Simulation results

Figure 4 show the errors of D-PCA, D-ECA and F-ECA with varying mm and and fixed p=50p=50 under different distributions. It is obvious from the Figure 4 that the error of D-PCA is much higher than that of D-ECA and F-ECA under the tt-distribution and the differences increase as the degrees of freedom declines, which again proves distributed ECA is a much more robust method than distributed PCA. Meanwhile, the errors of D-ECA and F-ECA are very close. Thus we confirm the conclusion that distributed ECA is a computation efficient method without losing much accuracy. In conclusion, when data are stored across different machines due to various kinds of reasons including communication cost, privacy, data security, our D-ECA can be regarded as an alternative to ECA based on full-samples.

Refer to caption
Figure 4: The errors (ρ1(𝑳,𝑳^)\rho_{1}(\bm{L},\widehat{\bm{L}})) versus the number of machines mm for distributed PCA, distributed ECA and full sample ECA under different distributions with fixed p=50p=50.