This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Data-fusion using factor analysis and low-rank matrix completion

Daniel Ahfock School of Mathematics and Physics, University of Queensland d.ahfock@uq.edu.au Saumyadipta Pyne Public Health Dynamics Laboratory and Department of Biostatistics,
Graduate School of Public Health, University of Pittsburgh, PA, USA
Health Analytics Network, PA, USA
Geoffrey J. McLachlan School of Mathematics and Physics, University of Queensland
Abstract

Data-fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data-fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging given the missing-data pattern. Factor analysis models use lower-dimensional latent variables in the data-generating process, and this introduces low-rank components in the complete-data matrix and the population covariance matrix. The low-rank structure of the factor analysis model can be exploited to estimate the full covariance matrix from incomplete data via low-rank matrix completion. We prove the identifiability of the factor analysis model in the statistical file-matching problem under conditions on the number of factors and the number of shared variables over the observed marginal subsets. Additionally, we provide an EM algorithm for parameter estimation. On several real datasets, the factor model gives smaller reconstruction errors in file-matching problems than the common approaches for low-rank matrix completion.

1 Introduction

Data-fusion involves the joint modelling of multiple related datasets, with the aim to lift the quality of inference that could be obtained from isolated separate analyses of each dataset. The statistical file-matching problem is a classic data-fusion task, where all observations are from the same population, with the caveat that different sets of variables are recorded in each individual dataset (Little and Rubin, 2002; Rässler, 2002; D’Orazio et al., 2006a). The core problem can be reduced to the analysis of two datasets, labelled A and B, with three groups of variables denoted 𝑿,𝒀\bm{X},\bm{Y}, and 𝒁\bm{Z}. Dataset A contains observations on the (𝑿,𝒀)(\bm{X},\bm{Y}) variables and Dataset B contains observations on the (𝑿,𝒁)(\bm{X},\bm{Z}) variables. Table 1 describes the missing-data pattern. The statistical file-matching problem occurs in bioinformatics such as in flow cytomtery analysis in the synthesis of multiplexed data collected using different panels of markers (Pedreira et al., 2008; Lee et al., 2011; O’Neill et al., 2015; Abdelaal et al., 2019). Other important application areas include survey sampling and data integration in official statistics (Conti et al., 2016; D’Orazio, 2019).

Table 1: Missing data pattern in the statistical file-matching problem
Variables XX YY ZZ
Dataset A observed observed
Dataset B observed observed

We assume that dataset A and dataset B are samples from the same homogeneous population. For example, in flow cytometry analysis a single blood sample from a patient may be divided into two aliquots which are then analysed using markers (𝑿,𝒀)(\bm{X},\bm{Y}) and (𝑿,𝒁)(\bm{X},\bm{Z}) respectively. Technological limitations prevent joint measurements on the 𝒀\bm{Y} and 𝒁\bm{Z} markers. In survey sampling, one random subset of the population may receive a questionnaire with items (𝑿,𝒀)(\bm{X},\bm{Y}) and a second random subset of the population given items (𝑿,𝒁)(\bm{X},\bm{Z}). This situation may arise when integrating results from related surveys, or in the design of a single survey where there is a constraint on the response burden.

Key linear relationships between 𝑿,𝒀\bm{X},\bm{Y}, and 𝒁\bm{Z} are encoded in the covariance matrix 𝚺\bm{\Sigma}:

cov(𝑿𝒀𝒁)\displaystyle\text{cov}\begin{pmatrix}\bm{X}\\ \bm{Y}\\ \bm{Z}\end{pmatrix} =𝚺=(𝚺XX𝚺XY𝚺XZ𝚺YX𝚺YY𝚺YZ𝚺ZX𝚺ZY𝚺ZZ).\displaystyle=\bm{\Sigma}=\begin{pmatrix}{\bm{\Sigma}}_{XX}&{\bm{\Sigma}}_{XY}&{\bm{\Sigma}}_{XZ}\\ {\bm{\Sigma}}_{YX}&{\bm{\Sigma}}_{YY}&{\bm{\Sigma}}_{YZ}\\ {\bm{\Sigma}}_{ZX}&{\bm{\Sigma}}_{ZY}&{\bm{\Sigma}}_{ZZ}\end{pmatrix}. (1)

A fundamental task in the statistical file-matching problem is the estimation of 𝚺YZ\bm{\Sigma}_{YZ} from incomplete data (Rässler, 2002; D’Orazio et al., 2006a). A common objective in file-matching is to impute the missing observations so that complete-data techniques can be used in downstream tasks. Recovery of 𝚺YZ\bm{\Sigma}_{YZ} is crucial to generating proper imputations (Little, 1988; Van Buuren, 2018). The covariance matrix is also important for Gaussian graphical modelling, and the recovery of a network model from incomplete information is an interesting problem (Sachs et al., 2009).

The critical issue in the statistical file-matching problem is how to estimate 𝚺YZ\bm{\bm{\Sigma}}_{YZ} without any joint observations on the 𝒀\bm{Y} and 𝒁\bm{Z} variables. Factor analysis models are useful in data-fusion tasks as they provide structured covariance matrices that allow the sharing of information across variables (Li and Jung, 2017; O’Connell and Lock, 2019; Park and Lock, 2020). In the statistical file-matching problem, this can facilitate the estimation of 𝚺YZ\bm{\bm{\Sigma}}_{YZ} from the observed (𝑿,𝒀)(\bm{X},\bm{Y}) and (𝑿,𝒁)(\bm{X},\bm{Z}) associations. Let pX,pYp_{X},p_{Y}, and pZp_{Z} denote the dimensions of 𝑿\bm{X}, 𝒀\bm{Y}, and 𝒁\bm{Z} respectively, with p=pX+pY+pZp=p_{X}+p_{Y}+p_{Z}. A factor analysis model consists of a p×qp\times q matrix of factor loadings 𝚲\bm{\Lambda} and a diagonal p×pp\times p matrix 𝚿\bm{\Psi} of uniqueness terms, giving the structured covariance matrix 𝚺=𝚲𝚲𝖳+𝚿\bm{\bm{\Sigma}}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}. The full covariance matrix 𝚺\bm{\bm{\Sigma}} can be written as

𝚺\displaystyle\bm{\bm{\Sigma}} =𝚲𝚲𝖳+𝚿=(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲X𝚲Z𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y𝚲Y𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳𝚲Z𝚲Z𝖳+𝚿Z),\displaystyle=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}, (2)

where 𝚲\bm{\Lambda} and 𝚿\bm{\Psi} have been partitioned as

𝚲=(𝚲X𝚲Y𝚲Z),𝚿=(𝚿X𝟎𝟎𝟎𝚿Y𝟎𝟎𝟎𝚿Z),\displaystyle\bm{\Lambda}=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{pmatrix},\quad\bm{\Psi}=\begin{pmatrix}\bm{\Psi}_{X}&\bm{0}&\bm{0}\\ \bm{0}&\bm{\Psi}_{Y}&\bm{0}\\ \bm{0}&\bm{0}&\bm{\Psi}_{Z}\end{pmatrix},

where 𝚲X\bm{\Lambda}_{X} is a pX×qp_{X}\times q matrix of factor loadings for the 𝑿\bm{X} variables, and 𝚿X\bm{\Psi}_{X} is a pX×pXp_{X}\times p_{X} diagonal matrix of uniquenesses for the 𝑿\bm{X} variables. The 𝒀\bm{Y} and 𝒁\bm{Z} parameters are partitioned in a similar fashion. Assuming that 𝚲Y\bm{\Lambda}_{Y} and 𝚲Z\bm{\Lambda}_{Z} can be recovered from the observed blocks of the covariance matrix, it is possible to learn cov(𝒀,𝒁)\text{cov}(\bm{Y},\bm{Z}) as 𝚺YZ=𝚲Y𝚲Z𝖳\bm{\bm{\Sigma}}_{YZ}=\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}.

The file-matching problem can be approached as a low-rank matrix completion task. One may attempt to complete the partial estimate of the covariance matrix or to complete the partially observed data matrix. Generic results for matrix recovery with random missingness patterns do not give clear guidance on the feasibility of statistical file-matching with the fixed and systematic missing-data pattern in Table 1 (Candes and Plan, 2010).

We provide conditions for the identifiability of the factor analysis model given the missing-data pattern in the file-matching problem. The main result is that recovery of 𝚺YZ\bm{\Sigma}_{YZ} is possible if the number of latent factors qq satisfies qpX,q<(pX+pY)/2q\leq p_{X},q<(p_{X}+p_{Y})/2, and q<(pX+pZ)/2q<(p_{X}+p_{Z})/2 (Theorem 1). Additionally, we give an EM algorithm (Dempster et al., 1977) for joint estimation of 𝚲\bm{\bm{\Lambda}} and 𝚿\bm{\bm{\Psi}} with closed form E- and M-steps. We find that on several real datasets, maximum likelihood factor analysis gives better estimates of 𝚺YZ\bm{\Sigma}_{YZ} than common existing approaches for low-rank matrix completion.

2 Methods

Here we review some existing approaches that can be used to estimate 𝚺YZ\bm{\Sigma}_{YZ} in the statistical file-matching problem.

2.1 Conditional independence assumption

A common resolution to the identifiability issue in the file-matching problem is to assume that 𝒀\bm{Y} and 𝒁\bm{Z} are conditionally independent given 𝑿\bm{X}. Under this assumption, 𝚺YZ=𝚺YX𝚺XX1𝚺XZ{\bm{\Sigma}}_{YZ}={\bm{\Sigma}}_{YX}{\bm{\Sigma}}_{XX}^{-1}{\bm{\Sigma}}_{XZ}. The conditional independence assumption is commonly invoked to justify statistical matching (D’Orazio et al., 2006a; Rässler, 2002). However, it is untestable and can have a large influence on downstream results (Barry, 1988; Rodgers, 1984).

2.2 Identified set

To avoid making assumptions, the generative model can be treated as being partially identified (Gustafson, 2015). There is a growing literature focused on developing uncertainty bounds for the matching problem that reflect the limited information in the sample (D’Orazio et al., 2006b; Conti et al., 2012, 2016). For the multivariate normal model, the goal is to estimate the a feasible set of values for 𝚺YZ\bm{\bm{\Sigma}}_{YZ}, rather than to deliver a point estimate (Kadane, 2001; Moriarity and Scheuren, 2001; Ahfock et al., 2016). The restriction that the covariance matrix be positive definite creates an identified set

ΘI\displaystyle\Theta_{\text{I}} ={𝚺YZpY×pZ:(𝚺XX𝚺XY𝚺XZ𝚺YX𝚺YY𝚺YZ𝚺ZX𝚺ZY𝚺ZZ)is positive definite}.\displaystyle=\left\{\bm{\bm{\Sigma}}_{YZ}\in\mathbb{R}^{p_{Y}\times p_{Z}}:\begin{pmatrix}\bm{\bm{\Sigma}}_{XX}&\bm{\bm{\Sigma}}_{XY}&\bm{\bm{\Sigma}}_{XZ}\\ \bm{\bm{\Sigma}}_{YX}&\bm{\bm{\Sigma}}_{YY}&\bm{\bm{\Sigma}}_{YZ}\\ \bm{\bm{\Sigma}}_{ZX}&\bm{\bm{\Sigma}}_{ZY}&\bm{\bm{\Sigma}}_{ZZ}\end{pmatrix}\text{is positive definite}\right\}. (3)

A disadvantage of the partial identification approach is that the level of uncertainty on 𝚺\bm{\Sigma} can be very high. Bounds on the parameters can be too wide for any practical interpretation.

2.3 Low-rank matrix completion

Low-rank matrix completion may be applied to fill in the partial estimate of the covariance matrix or to impute the missing observations in the partially observed data matrix. To complete the estimate of the covariance matrix 𝚺^\widehat{\bm{\Sigma}}, one may consider the regularised least-squares objective

f(𝑴,𝚿)\displaystyle f(\bm{M},\bm{\Psi}) =12(i,j)Ω(𝚺^ij𝑴ij𝚿ij)2+λ𝑴,\displaystyle=\dfrac{1}{2}\sum_{(i,j)\in\Omega}(\widehat{\bm{\Sigma}}_{ij}-\bm{M}_{ij}-\bm{\Psi}_{ij})^{2}+\lambda\lVert\bm{M}\rVert_{*},
subject to 𝑴𝖳=𝑴,\displaystyle\text{subject to }\bm{M}^{\mathsf{T}}=\bm{M},

where 𝑴\lVert\bm{M}\rVert_{*} is the nuclear norm of the matrix 𝑴\bm{M} (Candes and Plan, 2010; Mazumder et al., 2010) and Ω\Omega contains the indices of the known elements of 𝚺^\widehat{\bm{\Sigma}}. The main barrier to attempting matrix completion on 𝚺^\widehat{\bm{\Sigma}} is that the presence of the diagonal matrix 𝚿{\bm{\Psi}} along with the symmetry constraint on 𝑴\bm{M} necessitates modifications to existing algorithms for low-rank matrix completion. The algorithm for symmetric low-rank matrix completion with block missingness developed in Bishop and Byron (2014) could be a useful starting point.

A more natural approach in the file-matching problem is to use a low-rank matrix completion algorithm to impute the missing data. Partitioning the complete-dataset 𝒟\mathcal{D} as

𝒟\displaystyle\mathcal{D} =(𝑿A𝒀A𝒁A𝑿B𝒀B𝒁B)\displaystyle=\begin{pmatrix}\bm{X}_{A}&\bm{Y}_{A}&\bm{Z}_{A}\\ \bm{X}_{B}&\bm{Y}_{B}&\bm{Z}_{B}\end{pmatrix}

where the first row corresponds to dataset A, and the second row corresponds to dataset B, the missing 𝒁A\bm{Z}_{A} and 𝒀B\bm{Y}_{B} observations can be estimated by applying matrix completion to 𝒟\mathcal{D}. The modelling assumption is that complete-dataset can be expressed as 𝒟=𝒟0+\mathcal{D}=\mathcal{D}_{0}+\mathcal{E} for some rank qq matrix 𝒟0\mathcal{D}_{0} plus noise \mathcal{E}. To estimate 𝒟0\mathcal{D}_{0}, the following least squares objective can be used

f(𝑮,𝑯)\displaystyle f(\bm{G},\bm{H}) =12(i,j)Ω(𝒟ij[𝑮𝑯]ij)2,\displaystyle=\dfrac{1}{2}\sum_{(i,j)\in\Omega}(\mathcal{D}_{ij}-[\bm{G}\bm{H}]_{ij})^{2},

where 𝑮\bm{G} is a n×qn\times q matrix, 𝑯\bm{H} is a q×pq\times p matrix and Ω\Omega contains the indices of the observed elements in 𝒟\mathcal{D}. With (𝑮^,𝑯^)argmin(𝑮,𝑯)f(𝑮,𝑯)(\widehat{\bm{G}},\widehat{\bm{H}})\in\operatorname*{argmin}_{(\bm{G},\bm{H})}f(\bm{G},\bm{H}), the low-rank estimate of 𝒟0\mathcal{D}_{0} is

𝒟^0\displaystyle\widehat{\mathcal{D}}_{0} =𝑮^𝑯^=(𝑿^A𝒀^A𝒁^A𝑿^B𝒀^B𝒁^B).\displaystyle=\widehat{\bm{G}}\widehat{\bm{H}}=\begin{pmatrix}\widehat{\bm{X}}_{A}&\widehat{\bm{Y}}_{A}&\widehat{\bm{Z}}_{A}\\ \widehat{\bm{X}}_{B}&\widehat{\bm{Y}}_{B}&\widehat{\bm{Z}}_{B}\end{pmatrix}. (4)

Using the low-rank reconstructions 𝒁^A\widehat{\bm{Z}}_{A} and 𝒀^B\widehat{\bm{Y}}_{B} as imputed values for the missing-data 𝒁A\bm{Z}_{A} and 𝒀B\bm{Y}_{B}, an estimate of 𝚺YZ\bm{\Sigma}_{YZ} is then given by

𝚺^YZ\displaystyle\widehat{\bm{\Sigma}}_{YZ} =cov([𝒀A𝒀^B],[𝒁^A𝒁B]).\displaystyle=\text{cov}\left(\begin{bmatrix}{\bm{Y}}_{A}\\ \widehat{\bm{Y}}_{B}\end{bmatrix},\begin{bmatrix}\widehat{\bm{Z}}_{A}\\ {\bm{Z}}_{B}\end{bmatrix}\right).

The theoretical work in Koltchinskii et al. (2011) can be used to give finite-sample bounds on the reconstruction error 𝒟0𝒟^022\lVert\mathcal{D}_{0}-\widehat{\mathcal{D}}_{0}\rVert_{2}^{2} under assumptions on the noise matrix \mathcal{E}. However, in the context of the file-matching problem, it is not immediately clear from existing theoretical work whether exact recovery of the true covariance 𝚺YZ\bm{\Sigma}_{YZ} is possible in the large-sample limit.

3 Factor models for statistical file-matching

3.1 Generative model

A probabilistic factor analysis model with qq factors has the latent variable representation

(𝑿𝒀𝒁)\displaystyle\begin{pmatrix}\bm{X}\\ \bm{Y}\\ \bm{Z}\end{pmatrix} =(𝚲X𝚲Y𝚲Z)𝑭+ϵ,\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{pmatrix}\bm{F}+\bm{\epsilon}, (5)

where 𝑭𝒩(𝟎,𝑰q)\bm{F}\sim\mathcal{N}(\bm{0},\bm{I}_{q}), ϵ𝒩(𝟎,𝚿)\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{\Psi}), and

𝚿=(𝚿X𝟎𝟎𝟎𝚿Y𝟎𝟎𝟎𝚿Z).\displaystyle\bm{\Psi}=\begin{pmatrix}\bm{\Psi}_{X}&\bm{0}&\bm{0}\\ \bm{0}&\bm{\Psi}_{Y}&\bm{0}\\ \bm{0}&\bm{0}&\bm{\Psi}_{Z}\\ \end{pmatrix}.

Kamakura and Wedel (2000) discuss the application of factor analysis models to the statistical file-matching problem, but do not consider the identifiability of the model, or develop an algorithm for parameter estimation. We address both of these important points. We prove the identifiability of the factor analysis model in the file-matching problem under the following assumptions.

Assumption 1.

The matrix of factors for the 𝑿\bm{X} variables, 𝚲X\bm{\Lambda}_{X}, is of rank qq.

Assumption 2.

Define

𝚲A\displaystyle\bm{\Lambda}_{A} =(𝚲X𝚲Y),𝚲B=(𝚲X𝚲Z).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\end{pmatrix},\quad\bm{\Lambda}_{B}=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Z}\end{pmatrix}.

For both 𝚲A\bm{\Lambda}_{A} and 𝚲B\bm{\Lambda}_{B}, if any row is removed, there remain two disjoint submatrices of rank qq.

Assumption 2 is based on the sufficient conditions for the identifiability of the factor analysis model given in Anderson and Rubin (1956). For the file-matching-problem, Assumption 2 enforces the requirement that q<(pX+pY)/2q<(p_{X}+p_{Y})/2 and q<(pX+pZ)/2q<(p_{X}+p_{Z})/2. Assumption 1 requires that qpXq\leq p_{X}.

3.2 Identifiability

The key to recovery of 𝚺YZ\bm{\Sigma}_{YZ} is that the matrix 𝚲𝚲𝖳\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}} is of rank qpXq\leq p_{X}. Lemma 1 shows that it is possible to complete the matrix 𝚲𝚲𝖳\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}} given the missing-data pattern in Table 1.

Lemma 1.

Suppose that Assumption 1 is satisfied. If all blocks other than 𝚲Y𝚲Z𝖳\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}} and 𝚲Z𝚲Y𝖳\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}} of

𝑴\displaystyle\bm{M} =(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲X𝚲Z𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳𝚲Y𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳𝚲Z𝚲Z𝖳)\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}

are observed, then it is possible to recover 𝚲Y𝚲Z𝖳\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}} from the observed elements. Consider the eigenvalue decomposition of the two sub-blocks of 𝐌\bm{M} that are relevant to dataset A and dataset B:

𝑽A𝑫A𝑽A𝖳\displaystyle\bm{V}_{A}\bm{D}_{A}\bm{V}_{A}^{\mathsf{T}} =(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳),𝑽B𝑫B𝑽B𝖳=(𝚲X𝚲X𝖳𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix},\quad\bm{V}_{B}\bm{D}_{B}\bm{V}_{B}^{\mathsf{T}}=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}.

The factors for the marginal models for dataset A and dataset B in the canonical rotation are

(𝚲XA𝚲YA)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}^{A}\\ \bm{\Lambda}_{Y}^{A}\end{pmatrix} =𝑽A𝑫A1/2,(𝚲XB𝚲ZB)=𝑽B𝑫B1/2.\displaystyle=\bm{V}_{A}\bm{D}_{A}^{1/2},\quad\begin{pmatrix}\bm{\Lambda}_{X}^{B}\\ \bm{\Lambda}_{Z}^{B}\end{pmatrix}=\bm{V}_{B}\bm{D}_{B}^{1/2}.

To recover 𝚲Y𝚲Z𝖳\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}} from the observed elements set

𝚲Y𝚲Z𝖳\displaystyle\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}} =𝚲YA(𝚲ZB𝑹)𝖳,\displaystyle=\bm{\Lambda}_{Y}^{A}(\bm{\Lambda}_{Z}^{B}\bm{R})^{\mathsf{T}},

where 𝐑=𝐖𝐐𝖳\bm{R}=\bm{W}\bm{Q}^{\mathsf{T}}, and 𝐖\bm{W} and 𝐐\bm{Q} are the left and right singular vectors of the matrix 𝐌=(𝚲XB)𝖳𝚲XA=𝐖𝐃𝐐𝖳\bm{M}=(\bm{\Lambda}_{X}^{B})^{\mathsf{T}}{\bm{\Lambda}_{X}^{A}}=\bm{W}\bm{D}\bm{Q}^{\mathsf{T}}.

The proof is given in the Appendix. Bishop and Byron (2014) establish more general results for the low-rank completion of symmetric matrices with structured missingness. Although Lemma 1 is sufficient for low-rank matrix completion problems, identifiability of the factor analysis model is a more complex issue. Specifically, for two parameter sets {𝚲,𝚿}\{\bm{\Lambda},\bm{\Psi}\} and {𝚲,𝚿}\{\bm{\Lambda}^{*},\bm{\Psi}^{*}\}, the equality 𝚲𝚲𝖳+𝚿=𝚲𝚲𝖳+𝚿\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}=\bm{\Lambda}^{*}\bm{\Lambda}^{*\mathsf{T}}+\bm{\Psi}^{*} does not imply that 𝚲𝚲𝖳=𝚲𝚲𝖳\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}=\bm{\Lambda}^{*}\bm{\Lambda}^{*\mathsf{T}} (Anderson and Rubin, 1956; Shapiro, 1985). Assumption 2 allows us to make a statement about the identifiability of the factor analysis model in the file-matching problem.

Theorem 1.

Suppose that Assumptions 1 and 2 are satisfied. Consider two parameter sets

𝚯1={(𝚲X𝚲Y𝚲Z),(𝚿X000𝚿Y000𝚿Z)},𝚯2={(𝚲X𝚲Y𝚲Z),(𝚿X000𝚿Y000𝚿Z)}.\displaystyle\bm{\Theta}_{1}=\left\{\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{pmatrix},\begin{pmatrix}\bm{\Psi}_{X}&0&0\\ 0&\bm{\Psi}_{Y}&0\\ 0&0&\bm{\Psi}_{Z}\end{pmatrix}\right\},\quad\bm{\Theta}_{2}=\left\{\begin{pmatrix}\bm{\Lambda}_{X}^{*}\\ \bm{\Lambda}_{Y}^{*}\\ \bm{\Lambda}_{Z}^{*}\end{pmatrix},\begin{pmatrix}\bm{\Psi}_{X}^{*}&0&0\\ 0&\bm{\Psi}_{Y}^{*}&0\\ 0&0&\bm{\Psi}_{Z}^{*}\end{pmatrix}\right\}.

Then if

(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix} =(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y),\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}+\bm{\Psi}_{X}^{*}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}+\bm{\Psi}_{Y}^{*}\end{pmatrix}, (6)
(𝚲X𝚲Z𝖳+𝚿X𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳+𝚿Z)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix} =(𝚲X𝚲Z𝖳+𝚿X𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳+𝚿Z).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}+\bm{\Psi}_{X}^{*}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}+\bm{\Psi}_{Z}^{*}\end{pmatrix}. (7)

We have that the factor loadings in 𝚯1\bm{\Theta}_{1} and 𝚯2\bm{\Theta}_{2} are equal up to post-multiplication on the right by an orthogonal matrix. That is,

(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲X𝚲Z𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳𝚲Y𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳𝚲Z𝚲Z𝖳)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix} =(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲X𝚲Z𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳𝚲Y𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳𝚲Z𝚲Z𝖳).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\end{pmatrix}.

The proof is given in the Appendix. In terms of the generative model in Section 3.1, Theorem 1 states that if parameter sets 𝚯1\bm{\Theta}_{1} and 𝚯2\bm{\Theta}_{2} satisfy f(𝒙,𝒚;𝚯1)=f(𝒙,𝒚;𝚯2)f(\bm{x},\bm{y};\bm{\Theta}_{1})=f(\bm{x},\bm{y};\bm{\Theta}_{2}) and f(𝒙,𝒛;𝚯1)=f(𝒙,𝒛;𝚯2)f(\bm{x},\bm{z};\bm{\Theta}_{1})=f(\bm{x},\bm{z};\bm{\Theta}_{2}) then the factor loadings in 𝚯1\bm{\Theta}_{1} and 𝚯2\bm{\Theta}_{2} are equal up to post-multiplication on the right by an orthogonal matrix. The conclusion is that recovery of 𝚺YZ\bm{\Sigma}_{YZ} in the file-matching scenario is possible given the observable sub-blocks of the population covariance matrix 𝚺\bm{\Sigma}. Assuming that the sample covariance matrices converge in probability to the population quantities as nA,nBn_{A},n_{B}\to\infty, consistent estimation of 𝚺YZ\bm{\Sigma}_{YZ} is possible.

Under mild regularity conditions on the factor loadings, q<p/2q<p/2 is sufficient for identifiability of the factor analysis model with complete-cases (Anderson and Rubin, 1956; Shapiro, 1985). In the file-matching scenario, Assumption 2 enforces the stronger condition that q<(pX+pY)/2q<(p_{X}+p_{Y})/2 and q<(pX+pZ)/2q<(p_{X}+p_{Z})/2. If the number of factors qq falls within the intervals (pX+pY)/2<q<p/2(p_{X}+p_{Y})/2<q<p/2 and (pX+pZ)/2<q<p/2(p_{X}+p_{Z})/2<q<p/2, Assumption 2 is violated. However, one may hope that the model is still identifiable, given that consistent estimation is possible with complete-cases.

Some insight can be gained by comparing the number of equations and the number of unknowns (Ledermann, 1937). There are pqpq unknown parameters in 𝚲\bm{\Lambda} and pp unknown parameters in 𝚿\bm{\Psi}. Given the symmetry in 𝚺=𝚲𝚲𝖳+𝚿\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}, there are p(p+1)/2p(p+1)/2 equations, and the adoption of a rotation constraint on 𝚲\bm{\Lambda} yields an additional q(q1)/2q(q-1)/2 equations (Anderson and Rubin, 1956). The number of equations minus the number of unknowns is C=[(pq)2pq]/2C=[(p-q)^{2}-p-q]/2. A useful rule of thumb is to expect the the factor analysis model to be identifiable if C0C\geq 0, and nonidentifiable if C<0C<0 (Ledermann, 1937; Anderson and Rubin, 1956). Bekker and ten Berge (1997) show that C>0C>0 is sufficient for the identifiability of the factor model in most circumstances. In the file-matching problem, the equality constraints due to 𝚺YZ\bm{\Sigma}_{YZ} are not present and the degrees of freedom is now reduced to CM=[(pq)2pq]/2pYpZC_{M}=[(p-q)^{2}-p-q]/2-p_{Y}p_{Z}. If CMC_{M} is negative and CC is positive it may signal a situation where the missing-data has caused the factor model to be nonidentifiable.

Table 2 reports the maximum number of allowable factors qq using the degrees of freedom criteria CC and CMC_{M}, and the restrictions of Assumption 2 in a file-matching problem with pX=pY=pZ=p/3p_{X}=p_{Y}=p_{Z}=p/3. Assumption 2 places stronger requirements on the number of factors qq than the degrees of freedom requirement CMC_{M}. The number of allowable factors using CMC_{M} is smaller than CC, demonstrating the information loss due to the missing-data.

Table 2: Maximum number of factors qq according to different identifiability criteria with pX=pY=pZ=p/3p_{X}=p_{Y}=p_{Z}=p/3. The complete-cases degrees of freedom is given by C=[(pq)2pq]/2C=[(p-q)^{2}-p-q]/2. The file-matching degrees of freedom is given by CM=CpYpZC_{M}=C-p_{Y}p_{Z}.
pp
Criterion Quantity 3 6 9 12 15 18 21
Degrees of freedom, complete-cases CC 1 3 5 7 10 12 15
Degrees of freedom, file-matching CMC_{M} 0 2 3 5 6 8 9
Assumption 2 p/3p/3 0 1 2 3 4 5 6

3.3 Estimation

As mentioned in Section 2.3, procedures for low-rank matrix completion focus on the estimation of 𝚲𝚲𝖳\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}} rather than the joint estimation of 𝚲\bm{\Lambda} and 𝚿\bm{\Psi}. A maximum likelihood approach to the factor analysis model facilitates joint estimation of 𝚲\bm{\Lambda} and 𝚿\bm{\Psi}. With complete-cases, the EM updates for factor analysis can be written in closed form (Rubin and Thayer, 1982). Let 𝑺\bm{S} denote the sample scatter matrix

𝑺\displaystyle\bm{S} =i=1n(𝒙i𝒚i𝒛i)(𝒙i𝒚i𝒛i)𝖳.\displaystyle=\sum_{i=1}^{n}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\\ \bm{z}_{i}\end{pmatrix}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\\ \bm{z}_{i}\end{pmatrix}^{\mathsf{T}}. (8)

Given current parameter estimates of 𝚲\bm{\bm{\Lambda}} and 𝚿\bm{\bm{\Psi}}, define 𝜷𝚲𝖳(𝚲𝚲𝖳+𝚿)1\bm{\beta}\equiv\bm{\Lambda}^{\mathsf{T}}(\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi})^{-1}. The M-step is

𝚲new\displaystyle\bm{\Lambda}^{\text{new}} =(𝑺𝜷𝖳)(nIn𝜷𝚲+𝜷𝑺𝜷𝖳)1,\displaystyle=(\bm{S}\bm{\beta}^{\mathsf{T}})\left(nI-n\bm{\beta}\bm{\Lambda}+\bm{\beta}\bm{S}\bm{\beta}^{\mathsf{T}}\right)^{-1}, (9)
𝚿new\displaystyle\bm{\Psi}^{\rm{new}} =1ndiag(𝑺𝚲new𝜷𝑺).\displaystyle=\dfrac{1}{n}\text{diag}\left(\bm{S}-\bm{\Lambda}^{\text{new}}\bm{\beta}\bm{S}\right). (10)

For the file-matching problem we need to compute the conditional expectation of (8) given the missing-data pattern. The observed scatter matrices in dataset A and dataset B are

𝑷\displaystyle\bm{P} =i=1nA(𝒙i𝒚i)(𝒙i𝒚i)𝖳=(𝑷XX𝑷XY𝑷YX𝑷YY),\displaystyle=\sum_{i=1}^{n_{A}}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\end{pmatrix}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\end{pmatrix}^{\mathsf{T}}=\begin{pmatrix}\bm{P}_{XX}&\bm{P}_{XY}\\ \bm{P}_{YX}&\bm{P}_{YY}\end{pmatrix}, (11)
𝑻\displaystyle\quad\bm{T} =i=1nB(𝒙i𝒛i)(𝒙i𝒛i)𝖳=(𝑻XX𝑻XZ𝑻ZX𝑻ZZ).\displaystyle=\sum_{i=1}^{n_{B}}\begin{pmatrix}\bm{x}_{i}\\ \bm{z}_{i}\end{pmatrix}\begin{pmatrix}\bm{x}_{i}\\ \bm{z}_{i}\end{pmatrix}^{\mathsf{T}}=\begin{pmatrix}\bm{T}_{XX}&\bm{T}_{XZ}\\ \bm{T}_{ZX}&\bm{T}_{ZZ}\end{pmatrix}. (12)

To determine the conditional expectation of the complete-data sufficient statistic, for current parameter estimates of 𝚲\bm{\Lambda} and 𝚿\bm{\Psi} define

𝝎\displaystyle\bm{\omega} =(𝝎X𝝎Y)=(𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳)(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y)1,\displaystyle=\begin{pmatrix}\bm{\omega}_{X}&\bm{\omega}_{Y}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}^{-1},
𝜶\displaystyle\bm{\alpha} =(𝜶X𝜶Z)=(𝚲Y𝚲X𝖳𝚲Y𝚲Z𝖳)(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳+𝚿Z)1,\displaystyle=\begin{pmatrix}\bm{\alpha}_{X}&\bm{\alpha}_{Z}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}^{-1},

and the conditional covariance matrices

𝚺ZXY\displaystyle\bm{\Sigma}_{Z\mid XY} =(𝚲Z𝚲Z𝖳+𝚿Z)(𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳)(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y)1(𝚲X𝚲Z𝖳𝚲Y𝚲Z𝖳),\displaystyle=\begin{pmatrix}\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}-\begin{pmatrix}\bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}^{-1}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix},
𝚺YXZ\displaystyle\bm{\Sigma}_{Y\mid XZ} =(𝚲Y𝚲Y𝖳+𝚿Y)(𝚲Y𝚲X𝖳𝚲Y𝚲Z𝖳)(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳+𝚿Z)1(𝚲X𝚲Y𝖳𝚲Z𝚲Y𝖳).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}-\begin{pmatrix}\bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}^{-1}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}.

Using the augmented scatter matrices

𝑷~\displaystyle\widetilde{\bm{P}} =(𝑷XX𝑷XY𝑷XX𝝎X𝖳+𝑷XY𝝎Y𝖳𝑷YX𝑷YY𝑷YX𝝎X𝖳+𝑷YY𝝎Y𝖳𝝎Y𝑷YX+𝝎X𝑷XX𝝎Y𝑷YY+𝝎X𝑷XY𝝎𝑷𝝎𝖳+nA𝚺ZXY),\displaystyle=\begin{pmatrix}\bm{P}_{XX}&\bm{P}_{XY}&\bm{P}_{XX}\bm{\omega}_{X}^{\mathsf{T}}+\bm{P}_{XY}\bm{\omega}_{Y}^{\mathsf{T}}\\ \bm{P}_{YX}&\bm{P}_{YY}&\bm{P}_{YX}\bm{\omega}_{X}^{\mathsf{T}}+\bm{P}_{YY}\bm{\omega}_{Y}^{\mathsf{T}}\\ \bm{\omega}_{Y}\bm{P}_{YX}+\bm{\omega}_{X}\bm{P}_{XX}&\bm{\omega}_{Y}\bm{P}_{YY}+\bm{\omega}_{X}\bm{P}_{XY}&\bm{\omega}\bm{P}\bm{\omega}^{\mathsf{T}}+n_{A}\bm{\Sigma}_{Z\mid XY}\end{pmatrix},
𝑻~\displaystyle\widetilde{\bm{T}} =(𝑻XX𝑻XX𝜶X𝖳+𝑻XZ𝜶Z𝖳𝑻XZ𝜶Z𝑻ZX+𝜶X𝑻XX𝜶𝑻𝜶𝖳+nB𝚺YXZ𝜶Z𝑻ZZ+𝜶X𝑻XZ𝑻ZX𝑻ZX𝜶X𝖳+𝑻ZZ𝜶Z𝖳𝑻ZZ),\displaystyle=\begin{pmatrix}\bm{T}_{XX}&\bm{T}_{XX}\bm{\alpha}_{X}^{\mathsf{T}}+\bm{T}_{XZ}\bm{\alpha}_{Z}^{\mathsf{T}}&\bm{T}_{XZ}\\ \bm{\alpha}_{Z}\bm{T}_{ZX}+\bm{\alpha}_{X}\bm{T}_{XX}&\bm{\alpha}\bm{T}\bm{\alpha}^{\mathsf{T}}+n_{B}\bm{\Sigma}_{Y\mid XZ}&\bm{\alpha}_{Z}\bm{T}_{ZZ}+\bm{\alpha}_{X}\bm{T}_{XZ}\\ \bm{T}_{ZX}&\bm{T}_{ZX}\bm{\alpha}_{X}^{\mathsf{T}}+\bm{T}_{ZZ}\bm{\alpha}_{Z}^{\mathsf{T}}&\bm{T}_{ZZ}\end{pmatrix},

the expected value of the complete-data sufficient statistic is given by

𝑺~\displaystyle\widetilde{\bm{S}} =𝑷~+𝑻~.\displaystyle=\widetilde{\bm{P}}+\widetilde{\bm{T}}.

Given current parameter estimates of 𝚲,𝚿{\bm{\Lambda}},{\bm{\Psi}}, define 𝜷𝚲𝖳(𝚲𝚲𝖳+𝚿)1\bm{\beta}\equiv\bm{\Lambda}^{\mathsf{T}}(\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi})^{-1}. The M-step is then

𝚲new\displaystyle\bm{\Lambda}^{\text{new}} =(𝑺~𝜷𝖳)(n𝑰n𝜷𝚲+𝜷𝑺~𝜷𝖳)1,\displaystyle=(\widetilde{\bm{S}}\bm{\beta}^{\mathsf{T}})\left(n\bm{I}-n\bm{\beta}\bm{\Lambda}+\bm{\beta}\widetilde{\bm{S}}\bm{\beta}^{\mathsf{T}}\right)^{-1}, (13)
𝚿new\displaystyle\bm{\Psi}^{\rm{new}} =1ndiag(𝑺~𝚲new𝜷𝑺~).\displaystyle=\dfrac{1}{n}\text{diag}\left(\widetilde{\bm{S}}-\bm{\Lambda}^{\text{new}}\bm{\beta}\tilde{\bm{S}}\right). (14)

The EM algorithm can be run using the sample covariance matrices from dataset A and dataset B, and so it is not necessary to have the original observations. The multivariate normality assumption used to develop the algorithm is not crucial, as long as the population covariance matrix fits the factor analysis model 𝚺=𝚲𝚲𝖳+𝚿\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi} (Browne, 1984).

3.4 Model selection

In practice, the number of factors qq must be determined. The Bayesian information criterion (Schwarz, 1978) is a generic tool for model selection that has been well explored in factor analysis for selecting qq (Preacher et al., 2013). The Bayesian information criterion for a model \mathcal{M} with likelihood function LL and parameter 𝜽\bm{\theta} is given by

BIC()\displaystyle\text{BIC}(\mathcal{M}) =2logL(𝜽^)+dlogn,\displaystyle=-2\log L(\bm{\widehat{\theta}})+d\log n, (15)

where 𝜽^\widehat{\bm{\theta}} is the maximum likelihood estimate, dd is the number of free parameters in the model \mathcal{M}, and nn is the sample size (Schwarz, 1978). For factor analysis with complete-cases the log-likelihood is

logL(𝚲,𝚿)\displaystyle\log L(\bm{\Lambda},\bm{\Psi}) =n2log|𝚺|12trace{𝚺1𝑺}np2log2π,\displaystyle=-\dfrac{n}{2}\log\lvert\bm{\Sigma}\rvert-\dfrac{1}{2}\text{trace}\{\bm{\Sigma}^{-1}\bm{S}\}-\dfrac{np}{2}\log 2\pi,

where 𝚺=𝚲𝚲𝖳+𝚿\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi} and 𝑺\bm{S} is the sample scatter matrix (8). The Bayesian information criterion for factor analysis with complete-cases is therefore

BIC(q)\displaystyle\text{BIC}(q) =2logL(𝚲^,𝚿^)+(qp+pq(q1)/2)logn,\displaystyle=-2\log L(\widehat{\bm{\Lambda}},\widehat{\bm{\Psi}})+(qp+p-q(q-1)/2)\log n,

where the number of free parameters is obtained by subracting the number of rotation constraints q(q1)/2q(q-1)/2 from the nominal number of parameters qp+pqp+p.

For model selection in missing-data problems it has been suggested to use the observed-data log-likelihood in place of the complete-data log-likelihood in the standard definition of the Bayesian information criterion (15) (Ibrahim et al., 2008). This is the strategy we propose to use for determining the number of factors qq in the file-matching problem. The log-likelihood functions for each dataset are given by

logLA(𝚺,𝚿)\displaystyle\log L_{A}(\bm{\Sigma},\bm{\Psi}) =nA2log|𝚺A|12trace{𝚺A1𝑷}nApA2log2π,\displaystyle=-\dfrac{n_{A}}{2}\log\lvert\bm{\Sigma}_{A}\rvert-\dfrac{1}{2}\text{trace}\{\bm{\Sigma}_{A}^{-1}\bm{P}\}-\dfrac{n_{A}p_{A}}{2}\log 2\pi,
logLB(𝚺,𝚿)\displaystyle\log L_{B}(\bm{\Sigma},\bm{\Psi}) =nB2log|𝚺B|12trace{𝚺B1𝑻}nBpB2log2π,\displaystyle=-\dfrac{n_{B}}{2}\log\lvert\bm{\Sigma}_{B}\rvert-\dfrac{1}{2}\text{trace}\{\bm{\Sigma}_{B}^{-1}\bm{T}\}-\dfrac{n_{B}p_{B}}{2}\log 2\pi,

where pA=pX+pYp_{A}=p_{X}+p_{Y}, pB=pX+pZp_{B}=p_{X}+p_{Z}, 𝑷\bm{P} is the observed scatter matrix from dataset A (11), 𝑻\bm{T} is the observed scatter matrix from dataset B (12) and

𝚺A\displaystyle\bm{\Sigma}_{A} =(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y),\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix},
𝚺B\displaystyle\bm{\Sigma}_{B} =(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳+𝚿Z).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}.

The Bayesian information criterion using the observed-data log-likelihood is then

BIC(q)\displaystyle\text{BIC}(q) =2{logLA(𝚲^,𝚿^)+logLB(𝚲^,𝚿^)}+(qp+pq(q1)/2)logn.\displaystyle=-2\{\log L_{A}(\widehat{\bm{\Lambda}},\widehat{\bm{\Psi}})+\log L_{B}(\widehat{\bm{\Lambda}},\widehat{\bm{\Psi}})\}+(qp+p-q(q-1)/2)\log n.

4 Simulation

Here we investigate the necessity of Assumption 2 for the identifiability of the factor model by focusing on the noiseless setting where the exact covariance matrix 𝚺\bm{\Sigma} is known with the exception of 𝚺YZ\bm{\Sigma}_{YZ}. We considered a file-matching scenario with pX=pY=pZ=4p_{X}=p_{Y}=p_{Z}=4. As seen in Table 2, Assumption 2 enforces stricter requirements than the degrees of freedom criterion based on CM=[(pq)2pq]/2pYpZC_{M}=[(p-q)^{2}-p-q]/2-p_{Y}p_{Z}. We simulated three different covariance matrices with p=12p=12, one from a q=3q=3 factor model, one from a q=4q=4 factor model, and one from a q=5q=5 factor model. Elements of Λ\Lambda were sampled from a 𝒩(2,1)\mathcal{N}(2,1) distribution. The jjth diagonal element of Ψ\Psi was set as Dj2D_{j}^{2}, where Dj𝒩(3,0.01)D_{j}\sim\mathcal{N}(3,0.01). Assumption 2 holds only for q=3q=3, but the degrees of freedom CMC_{M} is non-negative for q=3,4,5q=3,4,5.

For each covariance matrix, we fit a factor model using the EM algorithm for 10000 iterations using a random initial value Λ\Lambda, where each element was sampled from a 𝒩(0,1)\mathcal{N}(0,1) distribution. The correct number of factors was used for each of the three test matrices. Given the estimate 𝚺^\widehat{\bm{\Sigma}}, we computed the mean squared error over the elements of 𝚺YZ\bm{\Sigma}_{YZ}, that is 𝚺^YZ𝚺YZ2/(pYpZ)\lVert\widehat{\bm{\Sigma}}_{YZ}-\bm{\Sigma}_{YZ}\rVert^{2}/(p_{Y}p_{Z}). We also computed the mean squared error for the observed blocks of 𝚺\bm{\Sigma}. This procedure was repeated fifty times. Figure 1 displays the results over the fifty random initialisations for each of the three covariance matrices (q=3,4,5)(q=3,4,5).

The top row shows the estimation errors for 𝚺YZ\bm{\Sigma}_{YZ}, the bottom row shows the estimation errors for the observed blocks of 𝚺\bm{\Sigma}. Looking at the bottom row, for q=3,4q=3,4, and 55 the EM algorithm is able to reconstruct the observed blocks of 𝚺\bm{\Sigma} over each random initialisation. From the top row, for q=3q=3, the choice of initial values does not affect the final estimate of 𝚺YZ\bm{\Sigma}_{YZ}, and the estimation error is zero in each trial. For q=4q=4 and 55, the choice of initial values influences the final estimate of 𝚺YZ\bm{\Sigma}_{YZ}. As the reconstruction error for the observed blocks of 𝚺\bm{\Sigma} is zero, this suggests there are multiple factor solutions and the model is not identifiable. The divergent behaviour when comparing q=3q=3 to q=4q=4 and 55 is notable, as Assumption 2 is only satisfied for q=3q=3. The degrees of freedom CMC_{M} are non-negative for q=4q=4 and 55, but this does not lead to a well-behaved estimator. In this simulation, Assumption 2 appears to give better guidelines for the stable recovery of 𝚺YZ\bm{\Sigma}_{YZ} compared to CMC_{M}.

Refer to caption
Figure 1: Estimation errors in a simulated file-matching scenario with p=12p=12 and pX=pY=pZ=4p_{X}=p_{Y}=p_{Z}=4. The EM algorithm was run for 10000 iterations, panels show the estimation error from different random initialisations. Bottom row: for q=3,4q=3,4, and 55 the reconstruction error of the observed blocks of Σ\Sigma is zero across all initialisations. Top row: for q=4q=4 and 55, the random initialisations appear to lead to different estimates of 𝚺YZ\bm{\Sigma}_{YZ}, suggesting multiple factor solutions. For q=3q=3, Assumption 2 is satisfied, and the factor model is identifiable. For q=3q=3 the estimation error of 𝚺YZ\bm{\Sigma}_{YZ} is zero across all initialisations.

5 Data application

5.1 Estimation

In this section we compare the performance of various algorithms for the estimation of 𝚺YZ\bm{\Sigma}_{YZ}. Existing algorithms for low-rank matrix completion were implemented using the R package filling (You, 2020). We also considered the standard conditional independence assumption (CIA). We tested the algorithms on the following datasets.

  • Sachs dataset. This dataset consists of results from 11 flow cytometry experiments measuring the same set of 12 variables. The measurements were log transformed. Data from each experiment was centered, then combined and scaled so that each variable had unit variance.

  • Wisconsin breast cancer dataset. The original observations in the malignant class were centered and scaled so that each variable had unit variance.

  • Abalone dataset. The original observations in the female class were centered and scaled so that each variable had unit variance.

For each dataset we considered matching scenarios with pXp_{X} common variables, pYp_{Y} unique variables in dataset A, and pZp_{Z} unique variables in dataset B. This defines partitions of the complete-dataset. Table 3 gives the file-matching scenarios considered for each dataset, as well as the number of factors qq that were used in the fitting of the factor analysis model. The final columns indicate whether the degrees of freedom criteria CC and CMC_{M}, and Assumption 2 are satisfied for the chosen number of factors qq. Assumption 2 is violated on the Sachs dataset, but the degrees of freedom CMC_{M} is positive.

Table 3: Data examples. CC represents the degrees of freedom given complete cases, CMC_{M} represents the degrees of freedom in the file-matching scenario. The final column states if it possible to satisfy Assumption 2 given the file-matching scenario.
Dataset pp pXp_{X} pYp_{Y} pZp_{Z} qq C0C\geq 0 CM0C_{M}\geq 0 Assumption 2
Sachs 11 4 4 3 4 Yes Yes No
Wisconsin 10 3 3 4 2 Yes Yes Yes
Abalone 8 1 4 3 1 Yes Yes Yes

In each simulation, variables were randomly allocated to the 𝑿\bm{X}, 𝒀\bm{Y}, and 𝒁\bm{Z} groups. We then used different algorithms to estimate 𝚺YZ\bm{\Sigma}_{YZ}. The mean squared error over elements of 𝚺YZ\bm{\Sigma}_{YZ} was then recorded for each algorithm, that is 𝚺^YZ𝚺YZ2/(pYpZ)\lVert\widehat{\bm{\Sigma}}_{YZ}-\bm{\Sigma}_{YZ}\rVert^{2}/(p_{Y}p_{Z}). This process was repeated 100 times. The EM algorithm was run for 2000 iterations in each simulation. We considered two different initial value settings for the EM algorithm, initialisation using the factor analysis solution from the full covariance matrix, and random initialisation of 𝚲\bm{\Lambda} by sampling from a standard normal distribution. For the random initialisation protocol we took one hundred samples of 𝚲\bm{\Lambda} followed by a short run of the EM algorithm for fifty iterations. The parameters with the highest log-likelihood after the fifty iterations were then used in the longer run. In each simulation we also fit a factor model using the complete-dataset with no missing data to provide a reference point for the goodness-of-fit of the factor model.

Figure 2 compares the results using the different algorithms. The results for factor analysis model with the favourable initial values are shown as FM, the results for the factor analysis model with random initialisation are shown as FM (random). The factor model with good initial values (FM) has the lowest median error across each dataset (Table 4). The errors on the Sachs dataset and the Wisconsin breast cancer dataset are particularly small.

The results for the factor model with random initialisation (FM (random)) are very similar to those of FM using the favourable initial values except on the Sachs dataset. On the Sachs dataset, Assumption 2 is violated and there may be multiple factor solutions. In Figure 2 (a), the errors for the factor model with random initialisation (FM (random)) are much more dispersed than the factor model with the good initial values (FM). Although there are sufficient degrees of freedom (CM0)(C_{M}\geq 0), it appears that the model may not be identifiable. With good initial values, the EM algorithm appears to converge to a local mode that gives a good estimate of ΣYZ\Sigma_{YZ}.

The interquartile range of the error for each algorithm over the one hundred variable permutations is given in Table 4. The factor analysis model (FM) has the smallest interquartile range on each dataset, closely followed by the factor model using random initialisations with the exception of the Sachs dataset.

Refer to caption
Figure 2: Accuracy of estimates of 𝚺YZ\bm{\Sigma}_{YZ} over 100 variable permutations for different algorithms. Complete refers to estimation of a factor analysis model using complete-cases with no missing data.
Table 4: Median error and interquartile range in estimation of 𝚺YZ\bm{\Sigma}_{YZ} across 100 variable permutations. The factor analysis model (FM) has the lowest median error and interquartile range on each dataset. The factor analysis model with random initial values (FM (random)) has comparable performance in each dataset except for Sachs, where Assumption 2 is violated.
Dataset Value FM CIA Soft-Impute SVD FM (random) Complete
Sachs Median 0.0006 0.0320 0.0465 3.1174 0.0402 0.0001
IQR 0.0017 0.0351 0.0518 27.822 0.0907 0.0001
Wisconsin Median 0.0032 0.0125 0.0058 0.0059 0.0036 0.0006
IQR 0.0023 0.0216 0.0073 0.0113 0.0025 0.0008
Abalone Median 0.0042 0.0125 0.0089 0.0102 0.0042 0.0031
IQR 0.0031 0.0198 0.0081 0.0142 0.0031 0.0024

5.2 Model selection

We also performed some experiments to assess the behavior of the Bayesian information criterion for determining the number of factors qq. The following datasets and variable partitions were used in the experiments

  • Reise dataset. This dataset is a 16×1616\times 16 correlation matrix for mental health items. The variables were partitioned as pX=6,pY=5p_{X}=6,p_{Y}=5, and pZ=5p_{Z}=5.

  • Harman dataset. This dataset is a 24×2424\times 24 correlation matrix for psychological tests. The variables were partitioned as pX=5,pY=10p_{X}=5,p_{Y}=10, and pZ=9p_{Z}=9.

  • Holzinger dataset. This dataset is a 14×1414\times 14 correlation matrix for mental ability scores. The variables were partitioned as pX=6,pY=4p_{X}=6,p_{Y}=4, and pZ=4p_{Z}=4.

  • Simulated from a factor model with q=6q=6 and p=100p=100. Elements of Λ\Lambda were sampled from a 𝒩(2,1)\mathcal{N}(2,1) distribution. The jjth diagonal element of Ψ\Psi was set as Dj2D_{j}^{2}, where Dj𝒩(3,0.01)D_{j}\sim\mathcal{N}(3,0.01). The covariance matrix 𝚺=𝚲𝚲𝖳+𝚿\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi} was then standardised to a correlation matrix. We simulated one dataset with nA=nB=50n_{A}=n_{B}=50 observations and one dataset of nA=nB=500n_{A}=n_{B}=500 observations. The variables were partitioned as pX=6,pY=47,pZ=47p_{X}=6,p_{Y}=47,p_{Z}=47.

In each simulation, variables were randomly allocated to the 𝑿\bm{X}, 𝒀\bm{Y} and 𝒁\bm{Z} groups with nA=nB=n/2n_{A}=n_{B}=n/2. We then computed the BIC for a range of values for qq and recorded the optimal value qq^{*}. This process was repeated 100 times. In each of the simulations we considered the range of values for qq such that Assumptions 1 and 2 were satisfied. Table 5 reports the number of times a qq factor model was chosen using the BIC over the 100 replications. The optimal number of factors according to the BIC using complete-cases is given as qcq_{c}^{*} in Table 5. The missing-data appears to lead to the BIC acting conservatively, selecting a number of factors less than or equal to the number chosen with complete-cases. The correct number of factors q=6q=6 is selected for the simulated datasets when using complete-cases. For the simulated dataset with n=50n=50, the BIC selects fewer than six factors in each trial. For the simulated dataset with n=500n=500 the BIC selects the correct number of factors q=6q=6 in each replication. Figure 3 shows boxplots of the estimation error 𝚺^YZ𝚺YZ2/(pYpZ)\lVert\widehat{\bm{\Sigma}}_{YZ}-\bm{\Sigma}_{YZ}\rVert^{2}/(p_{Y}p_{Z}) for different values of qq. The errors from using the conditional independence assumption and from fitting a factor model with complete-cases are also reported for comparison. For almost all values of qq, the median error using the factor model is lower than that of the conditional independence model.

Table 5: Simulation results for model selection using the Bayesian information criterion. The optimal number of factors using complete-cases is denoted as qcq_{c}^{*}. Entries represent the number of times a qq factor model was chosen using the BIC across the 100 variable permutations. A dash indicates that the qq factor model would violate the identification conditions in Assumptions 1 and 2.
Dataset qcq_{\text{c}}^{*} q=1q=1 q=2q=2 q=3q=3 q=4q=4 q=5q=5 q=6q=6 q=7q=7 q=8q=8
Reise 8 0 1 1 18 80 - - -
Harman 3 62 37 1 0 0 - - -
Holzinger 4 0 60 37 3 - - - -
Simulated (n=50)(n=50) 6 5 51 41 3 0 0 0 0
Simulated (n=500)(n=500) 6 0 0 0 0 0 100 0 0
Refer to caption
Figure 3: Accuracy of estimates of 𝚺YZ\bm{\Sigma}_{YZ} over 100 variable permutations for different algorithms. Complete refers to estimation of a factor analysis model using complete-cases with no missing data.

6 Discussion

Technological and design constraints can prevent investigators from collecting a full dataset on all variables of interest. The statistical file-matching problem is an important data-fusion task where joint observations on the full set of variables are not available. Factor analysis models are useful as they can remain identifiable despite the missing-data pattern of the file-matching problem. The factor analysis approach is a useful alternative to the conditional independence assumption, as it is less restrictive and testable. Estimation of the factor analysis model can be carried out via the EM algorithm.

Although factor analysis and low-rank matrix completion are related, the identifiability of the factor analysis model 𝚺=𝚲𝚲𝖳+𝚿\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi} requires additional assumptions compared to low-rank matrix completion due to the diagonal matrix 𝚿\bm{\Psi}. As the assumption that q<(pX+pY)/2q<(p_{X}+p_{Y})/2 and q<(pX+pZ)/2q<(p_{X}+p_{Z})/2 may be unnecessarily strong, it is of interest to establish the weakest conditions that ensure the identifiability of the factor analysis model in the file-matching problem. In many applications of file-matching, procedures for generating uncertainty bounds when the model is partially identified supply useful information (Conti et al., 2016). Further work may explore characterisations of the identified set for the factor analysis model when the number of latent factors qq exceeds the maximum number for identifiability.

It is also of interest to relax the assumption that both datasets are samples from the same homogeneous population with the joint model f(𝒙,𝒚,𝒛;𝜽)f(\bm{x},\bm{y},\bm{z};\bm{\theta}). One possible avenue is to embed the file-matching problem in a hierarchical model. A dataset specific random effect can be applied to the parameters in dataset A and dataset B, so that there is shared component across datasets and a unique component within each dataset. Samples in dataset A are from the distribution f(𝒙,𝒚;𝜽A)f(\bm{x},\bm{y};\bm{\theta}_{A}), and samples from dataset B are from the distribution f(𝒙,𝒛;𝜽B)f(\bm{x},\bm{z};\bm{\theta}_{B}) where 𝜽A=𝜽0+δA\bm{\theta}_{A}=\bm{\theta}_{0}+\delta_{A} and 𝜽B=𝜽0+δB\bm{\theta}_{B}=\bm{\theta}_{0}+\delta_{B} where δA\delta_{A} and δB\delta_{B} are random effect terms. The added flexibility of the hierarchical model can allow for situations where datasets A and B are from related but not necessarily identical populations, a common scenario in data integration. Model identifiability with heterogeneous data sources is an interesting and challenging problem, and the results here may serve as useful groundwork.

Acknowledgements

This research was funded by the Australian Government through the Australian Research Council (Project Numbers DP170100907 and IC170100035).

References

  • Abdelaal et al. (2019) Abdelaal, T., Höllt, T., van Unen, V., Lelieveldt, B. P. F., Koning, F., Reinders, M. J. T. and Mahfouz, A. (2019) CyTOFmerge: integrating mass cytometry data across multiple panels. Bioinformatics, 35, 4063–4071.
  • Ahfock et al. (2016) Ahfock, D., Pyne, S., Lee, S. X. and McLachlan, G. J. (2016) Partial identification in the statistical matching problem. Computational Statistics and Data Analysis, 104, 79–90.
  • Anderson and Rubin (1956) Anderson, T. W. and Rubin, H. (1956) Statistical inference in factor analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 238–246.
  • Barry (1988) Barry, J. T. (1988) An investigation of statistical matching. Journal of Applied Statistics, 15, 275–283.
  • Bekker and ten Berge (1997) Bekker, P. A. and ten Berge, J. M. (1997) Generic global identification in factor analysis. Linear Algebra and its Applications, 264, 255 – 263.
  • Bishop and Byron (2014) Bishop, W. E. and Byron, M. Y. (2014) Deterministic symmetric positive semidefinite matrix completion. In Advances in Neural Information Processing Systems, 2762–2770.
  • Browne (1984) Browne, M. W. (1984) Asymptotically distribution-free methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37, 62–83.
  • Candes and Plan (2010) Candes, E. J. and Plan, Y. (2010) Matrix completion with noise. Proceedings of the IEEE, 98, 925–936.
  • Conti et al. (2012) Conti, P. L., Marella, D. and Scanu, M. (2012) Uncertainty analysis in statistical matching. Journal of Official Statistics, 28, 69–88.
  • Conti et al. (2016) — (2016) Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111, 1715–1725.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38.
  • D’Orazio (2019) D’Orazio, M. (2019) Statistical learning in official statistics: The case of statistical matching. Statistical Journal of the IAOS, 35, 435–441.
  • D’Orazio et al. (2006a) D’Orazio, M., Di Zio, M. and Scanu, M. (2006a) Statistical Matching: Theory and Practice. New York: Wiley.
  • D’Orazio et al. (2006b) D’Orazio, M., Zio, M. and Scanu, M. (2006b) Statistical matching for categorical data: Displaying uncertainty and using logical constraints. Journal of Official Statistics, 22, 137.
  • Gustafson (2015) Gustafson, P. (2015) Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data. Boca Raton: CRC Press.
  • Ibrahim et al. (2008) Ibrahim, J. G., Zhu, H. and Tang, N. (2008) Model selection criteria for missing-data problems using the EM algorithm. Journal of the American Statistical Association, 103, 1648–1658.
  • Kadane (2001) Kadane, J. B. (2001) Some statistical problems in merging data files. Journal of Official Statistics, 17, 423.
  • Kamakura and Wedel (2000) Kamakura, W. A. and Wedel, M. (2000) Factor analysis and missing data. Journal of Marketing Research, 37, 490–498.
  • Koltchinskii et al. (2011) Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011) Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Annals of Statistics, 39, 2302–2329.
  • Ledermann (1937) Ledermann, W. (1937) On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika, 2, 85–93.
  • Lee et al. (2011) Lee, G., Finn, W. and Scott, C. (2011) Statistical file matching of flow cytometry data. Journal of Biomedical Informatics, 44, 663–676.
  • Li and Jung (2017) Li, G. and Jung, S. (2017) Incorporating covariates into integrated factor analysis of multi-view data. Biometrics, 73, 1433–1442.
  • Little (1988) Little, R. J. (1988) Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6, 287–296.
  • Little and Rubin (2002) Little, R. J. A. and Rubin, D. B. (2002) Statistical Analysis with Missing Data. Hoboken: Wiley, 2nd edn.
  • Mazumder et al. (2010) Mazumder, R., Hastie, T. and Tibshirani, R. (2010) Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11, 2287–2322.
  • Moriarity and Scheuren (2001) Moriarity, C. and Scheuren, F. (2001) Statistical matching: a paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, 17, 407.
  • O’Connell and Lock (2019) O’Connell, M. J. and Lock, E. F. (2019) Linked matrix factorization. Biometrics, 75, 582–592.
  • O’Neill et al. (2015) O’Neill, K., Aghaeepour, N., Parker, J., Hogge, D., Karsan, A., Dalal, B. and Brinkman, R. R. (2015) Deep profiling of multitube flow cytometry data. Bioinformatics, 31, 1623–1631.
  • Park and Lock (2020) Park, J. Y. and Lock, E. F. (2020) Integrative factorization of bidimensionally linked matrices. Biometrics, 76, 61–74.
  • Pedreira et al. (2008) Pedreira, C. E., Costa, E. S., Barrena, S., Lecrevisse, Q., Almeida, J., van Dongen, J. J. M. and Orfao, A. (2008) Generation of flow cytometry data files with a potentially infinite number of dimensions. Cytometry Part A, 73, 834–846.
  • Preacher et al. (2013) Preacher, K. J., Zhang, G., Kim, C. and Mels, G. (2013) Choosing the optimal number of factors in exploratory factor analysis: A model selection perspective. Multivariate Behavioral Research, 48, 28–56.
  • Rässler (2002) Rässler, S. (2002) Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. New York: Springer-Verlag.
  • Rodgers (1984) Rodgers, W. L. (1984) An Evaluation of Statistical Matching. Journal of Business & Economic Statistics, 2, 91.
  • Rubin and Thayer (1982) Rubin, D. B. and Thayer, D. T. (1982) EM algorithms for ML factor analysis. Psychometrika, 47, 69–76.
  • Sachs et al. (2009) Sachs, K., Itani, S., Carlisle, J., Nolan, G. P., Pe’er, D. and Lauffenburger, D. A. (2009) Learning signaling network structures with sparsely distributed data. Journal of Computational Biology, 16, 201–212.
  • Schönemann (1966) Schönemann, P. H. (1966) A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31, 1–10.
  • Schwarz (1978) Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
  • Shapiro (1985) Shapiro, A. (1985) Identifiability of factor analysis: Some results and open problems. Linear Algebra and its Applications, 70, 1–7.
  • Van Buuren (2018) Van Buuren, S. (2018) Flexible Imputation of Missing Data. Boca Raton: CRC Press.
  • You (2020) You, K. (2020) filling: Matrix Completion, Imputation, and Inpainting Methods. R package version 0.2.1.

Appendix A Appendix

A.1 Proof of Lemma 1

Due to the rotational invariance of the factor model, we have that

[𝚲XA𝚲YA𝚲ZA]\displaystyle\begin{bmatrix}\bm{\Lambda}_{X}^{A}\\ \bm{\Lambda}_{Y}^{A}\\ \bm{\Lambda}_{Z}^{A}\end{bmatrix} =[𝚲X𝚲Y𝚲Z]𝑹1,[𝚲XB𝚲YB𝚲ZB]=[𝚲X𝚲Y𝚲Z]𝑹2,[𝚲XB𝚲YB𝚲ZB]=[𝚲XA𝚲YA𝚲ZA]𝑹3,\displaystyle=\begin{bmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{bmatrix}\bm{R}_{1},\quad\begin{bmatrix}\bm{\Lambda}_{X}^{B}\\ \bm{\Lambda}_{Y}^{B}\\ \bm{\Lambda}_{Z}^{B}\end{bmatrix}=\begin{bmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{bmatrix}\bm{R}_{2},\quad\begin{bmatrix}\bm{\Lambda}_{X}^{B}\\ \bm{\Lambda}_{Y}^{B}\\ \bm{\Lambda}_{Z}^{B}\end{bmatrix}=\begin{bmatrix}\bm{\Lambda}_{X}^{A}\\ \bm{\Lambda}_{Y}^{A}\\ \bm{\Lambda}_{Z}^{A}\end{bmatrix}\bm{R}_{3}, (A.1)

for orthogonal matrices 𝑹1,𝑹2\bm{R}_{1},\bm{R}_{2}, and 𝑹3\bm{R}_{3}. The alignment of 𝚲XA\bm{\Lambda}_{X}^{A} and 𝚲XB\bm{\Lambda}_{X}^{B} is an orthogonal Procrustes problem. Let 𝑹\bm{R} be the solution to the optimisation problem

𝑹\displaystyle\bm{R} =argmin𝚲XA𝚲XB𝑹F, subject to 𝑹𝖳𝑹=𝑰.\displaystyle=\operatorname*{argmin}\ \lVert\bm{\Lambda}_{X}^{A}-\bm{\Lambda}_{X}^{B}\bm{R}\rVert_{F},\quad\text{ subject to }\bm{R}^{\mathsf{T}}\bm{R}=\bm{I}. (A.2)

Assuming that 𝚲XA\bm{\Lambda}_{X}^{A} and 𝚲XB\bm{\Lambda}_{X}^{B} are of full column rank, Schönemann (1966) showed that there is a unique solution to (A.2). As rank(𝚲XA)=rank(𝚲XB)=rank(𝚲X)\text{rank}(\bm{\Lambda}_{X}^{A})=\text{rank}(\bm{\Lambda}_{X}^{B})=\text{rank}(\bm{\Lambda}_{X}), both 𝚲XA\bm{\Lambda}_{X}^{A} and 𝚲XB\bm{\Lambda}_{X}^{B} are of rank qq under Assumption 1. Define 𝑴=(𝚲XB)𝖳𝚲XA\bm{M}=(\bm{\Lambda}_{X}^{B})^{\mathsf{T}}{\bm{\Lambda}_{X}^{A}} and let the singular value decomposition of 𝑴\bm{M} be given by 𝑴=𝑾𝑫𝑸𝖳\bm{M}=\bm{W}\bm{D}\bm{Q}^{\mathsf{T}}. Then using the result from Schönemann (1966), the unique solution to (A.2) is given by 𝑹=𝑾𝑸𝖳\bm{R}=\bm{W}\bm{Q}^{\mathsf{T}}. The uniqueness of the solution implies that 𝑹=𝑹3𝖳\bm{R}=\bm{R}_{3}^{\mathsf{T}} as ΛXB𝑹3𝑹3𝖳=𝚲XA\Lambda_{X}^{B}\bm{R}_{3}\bm{R}_{3}^{\mathsf{T}}=\bm{\Lambda}_{X}^{A} from (A.1). Then 𝚲ZB𝑹=𝚲ZB𝑹3𝖳=𝚲ZA𝑹3𝑹3𝖳=𝚲ZA\bm{\Lambda}_{Z}^{B}\bm{R}=\bm{\Lambda}_{Z}^{B}\bm{R}_{3}^{\mathsf{T}}=\bm{\Lambda}_{Z}^{A}\bm{R}_{3}\bm{R}_{3}^{\mathsf{T}}=\bm{\Lambda}_{Z}^{A} again using (A.1). Finally, 𝚲YA(𝚲ZB𝑹)𝖳=𝚲YA(𝚲ZA)𝖳=𝚲Y𝑹1𝑹1𝖳𝚲Z𝖳=𝚲Y𝚲Z𝖳\bm{\Lambda}_{Y}^{A}(\bm{\Lambda}_{Z}^{B}\bm{R})^{\mathsf{T}}=\bm{\Lambda}_{Y}^{A}(\bm{\Lambda}_{Z}^{A})^{\mathsf{T}}=\bm{\Lambda}_{Y}\bm{R}_{1}\bm{R}_{1}^{\mathsf{T}}\bm{\Lambda}_{Z}^{\mathsf{T}}=\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}.

A.2 Proof of Theorem 1

Using Theorem 5.1 in Anderson and Rubin (1956), Assumption 2 guarantees that if

(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix} =(𝚲X𝚲X𝖳+𝚿X𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳+𝚿Y),\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}+\bm{\Psi}_{X}^{*}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}+\bm{\Psi}_{Y}^{*}\end{pmatrix},
(𝚲X𝚲Z𝖳+𝚿X𝚲X𝚲Z𝖳𝚲Y𝚲Z𝖳𝚲Z𝚲Z𝖳+𝚿Z)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix} =(𝚲X𝚲Z𝖳+𝚿X𝚲X𝚲Z𝖳𝚲Y𝚲Z𝖳𝚲Z𝚲Z𝖳+𝚿Z).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}+\bm{\Psi}_{X}^{*}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}+\bm{\Psi}_{Z}^{*}\end{pmatrix}.

then the uniquenesses are equal, 𝚿X=𝚿X\bm{\Psi}_{X}=\bm{\Psi}_{X}^{*}, 𝚿Y=𝚿Y\bm{\Psi}_{Y}=\bm{\Psi}_{Y}^{*}, and 𝚿Z=𝚿Z\bm{\Psi}_{Z}=\bm{\Psi}_{Z}^{*}, implying

(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix} =(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳),\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}\end{pmatrix}, (A.3)
(𝚲X𝚲X𝖳𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix} =(𝚲X𝚲X𝖳𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\end{pmatrix}. (A.4)

Using Lemma 1, ΛYΛZ𝖳\Lambda_{Y}\Lambda_{Z}^{\mathsf{T}} can be uniquely recovered given the matrices on the left-hand side of (A.3) and (A.4). Likewise, ΛYΛZ𝖳\Lambda_{Y}^{*}\Lambda_{Z}^{*\mathsf{T}} can be uniquely recovered given the matrices on the right hand side of (A.3) and (A.4). It remains to show that ΛYΛZ𝖳=ΛYΛZ𝖳\Lambda_{Y}\Lambda_{Z}^{\mathsf{T}}=\Lambda_{Y}^{*}\Lambda_{Z}^{*\mathsf{T}}. To do so, define the eigendecompositions

𝑽A𝑫A𝑽A𝖳\displaystyle\bm{V}_{A}\bm{D}_{A}\bm{V}_{A}^{\mathsf{T}} =(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳)=(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳),\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}\end{pmatrix},
𝑽B𝑫B𝑽B𝖳\displaystyle\bm{V}_{B}\bm{D}_{B}\bm{V}_{B}^{\mathsf{T}} =(𝚲X𝚲X𝖳𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳)=(𝚲X𝚲X𝖳𝚲X𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Z𝖳),\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\end{pmatrix},

and the rotated and scaled eigenvectors

(𝚪XA𝚪YA)\displaystyle\begin{pmatrix}\bm{\bm{\Gamma}}_{X}^{A}\\ \bm{\bm{\Gamma}}_{Y}^{A}\end{pmatrix} =𝑽A𝑫A1/2,(𝚪XB𝚪ZB)=𝑽B𝑫B1/2.\displaystyle=\bm{V}_{A}\bm{D}_{A}^{1/2},\quad\begin{pmatrix}\bm{\bm{\Gamma}}_{X}^{B}\\ \bm{\bm{\Gamma}}_{Z}^{B}\end{pmatrix}=\bm{V}_{B}\bm{D}_{B}^{1/2}.

Using Assumption 1 and Lemma 1, the equality

𝚲Y𝚲Z𝖳\displaystyle\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}} =𝚲Y𝚲Z𝖳=𝚪YA𝚪ZB𝑾𝑸𝖳,\displaystyle=\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}=\bm{\Gamma}_{Y}^{A}\bm{\Gamma}_{Z}^{B}\bm{W}\bm{Q}^{\mathsf{T}}, (A.5)

must hold, where 𝑾\bm{W} and 𝑸\bm{Q} are the left and right singular vectors of the matrix 𝑴=(𝚪XB)𝖳𝚪XA=𝑾𝑫𝑸𝖳\bm{M}=(\bm{\Gamma}_{X}^{B})^{\mathsf{T}}{\bm{\Gamma}_{X}^{A}}=\bm{W}\bm{D}\bm{Q}^{\mathsf{T}}. Combining the equalities in (A.3), (A.4) and (A.5) gives the main result

(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲X𝚲Z𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳𝚲Y𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳𝚲Z𝚲Z𝖳)\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix} =(𝚲X𝚲X𝖳𝚲X𝚲Y𝖳𝚲X𝚲Z𝖳𝚲Y𝚲X𝖳𝚲Y𝚲Y𝖳𝚲Y𝚲Z𝖳𝚲Z𝚲X𝖳𝚲Z𝚲Y𝖳𝚲Z𝚲Z𝖳).\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\end{pmatrix}.