Data-fusion using factor analysis and low-rank matrix completion

Daniel Ahfock School of Mathematics and Physics, University of Queensland d.ahfock@uq.edu.au Saumyadipta Pyne Public Health Dynamics Laboratory and Department of Biostatistics,
Graduate School of Public Health, University of Pittsburgh, PA, USA Health Analytics Network, PA, USA Geoffrey J. McLachlan School of Mathematics and Physics, University of Queensland

Abstract

Data-fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data-fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging given the missing-data pattern. Factor analysis models use lower-dimensional latent variables in the data-generating process, and this introduces low-rank components in the complete-data matrix and the population covariance matrix. The low-rank structure of the factor analysis model can be exploited to estimate the full covariance matrix from incomplete data via low-rank matrix completion. We prove the identifiability of the factor analysis model in the statistical file-matching problem under conditions on the number of factors and the number of shared variables over the observed marginal subsets. Additionally, we provide an EM algorithm for parameter estimation. On several real datasets, the factor model gives smaller reconstruction errors in file-matching problems than the common approaches for low-rank matrix completion.

1 Introduction

Data-fusion involves the joint modelling of multiple related datasets, with the aim to lift the quality of inference that could be obtained from isolated separate analyses of each dataset. The statistical file-matching problem is a classic data-fusion task, where all observations are from the same population, with the caveat that different sets of variables are recorded in each individual dataset (Little and Rubin, 2002; Rässler, 2002; D’Orazio et al., 2006a). The core problem can be reduced to the analysis of two datasets, labelled A and B, with three groups of variables denoted $\bm{X},\bm{Y}$ , and $\bm{Z}$ . Dataset A contains observations on the $(\bm{X},\bm{Y})$ variables and Dataset B contains observations on the $(\bm{X},\bm{Z})$ variables. Table 1 describes the missing-data pattern. The statistical file-matching problem occurs in bioinformatics such as in flow cytomtery analysis in the synthesis of multiplexed data collected using different panels of markers (Pedreira et al., 2008; Lee et al., 2011; O’Neill et al., 2015; Abdelaal et al., 2019). Other important application areas include survey sampling and data integration in official statistics (Conti et al., 2016; D’Orazio, 2019).

Table 1: Missing data pattern in the statistical file-matching problem

Variables	$X$	$Y$	$Z$
Dataset A	observed	observed	—
Dataset B	observed	—	observed

We assume that dataset A and dataset B are samples from the same homogeneous population. For example, in flow cytometry analysis a single blood sample from a patient may be divided into two aliquots which are then analysed using markers $(\bm{X},\bm{Y})$ and $(\bm{X},\bm{Z})$ respectively. Technological limitations prevent joint measurements on the $\bm{Y}$ and $\bm{Z}$ markers. In survey sampling, one random subset of the population may receive a questionnaire with items $(\bm{X},\bm{Y})$ and a second random subset of the population given items $(\bm{X},\bm{Z})$ . This situation may arise when integrating results from related surveys, or in the design of a single survey where there is a constraint on the response burden.

Key linear relationships between $\bm{X},\bm{Y}$ , and $\bm{Z}$ are encoded in the covariance matrix $\bm{\Sigma}$ :

\displaystyle\text{cov}\begin{pmatrix}\bm{X}\\ \bm{Y}\\ \bm{Z}\end{pmatrix}

\displaystyle=\bm{\Sigma}=\begin{pmatrix}{\bm{\Sigma}}_{XX}&{\bm{\Sigma}}_{XY}&{\bm{\Sigma}}_{XZ}\\ {\bm{\Sigma}}_{YX}&{\bm{\Sigma}}_{YY}&{\bm{\Sigma}}_{YZ}\\ {\bm{\Sigma}}_{ZX}&{\bm{\Sigma}}_{ZY}&{\bm{\Sigma}}_{ZZ}\end{pmatrix}.

(1)

A fundamental task in the statistical file-matching problem is the estimation of $\bm{\Sigma}_{YZ}$ from incomplete data (Rässler, 2002; D’Orazio et al., 2006a). A common objective in file-matching is to impute the missing observations so that complete-data techniques can be used in downstream tasks. Recovery of $\bm{\Sigma}_{YZ}$ is crucial to generating proper imputations (Little, 1988; Van Buuren, 2018). The covariance matrix is also important for Gaussian graphical modelling, and the recovery of a network model from incomplete information is an interesting problem (Sachs et al., 2009).

The critical issue in the statistical file-matching problem is how to estimate $\bm{\bm{\Sigma}}_{YZ}$ without any joint observations on the $\bm{Y}$ and $\bm{Z}$ variables. Factor analysis models are useful in data-fusion tasks as they provide structured covariance matrices that allow the sharing of information across variables (Li and Jung, 2017; O’Connell and Lock, 2019; Park and Lock, 2020). In the statistical file-matching problem, this can facilitate the estimation of $\bm{\bm{\Sigma}}_{YZ}$ from the observed $(\bm{X},\bm{Y})$ and $(\bm{X},\bm{Z})$ associations. Let $p_{X},p_{Y}$ , and $p_{Z}$ denote the dimensions of $\bm{X}$ , $\bm{Y}$ , and $\bm{Z}$ respectively, with $p=p_{X}+p_{Y}+p_{Z}$ . A factor analysis model consists of a $p\times q$ matrix of factor loadings $\bm{\Lambda}$ and a diagonal $p\times p$ matrix $\bm{\Psi}$ of uniqueness terms, giving the structured covariance matrix $\bm{\bm{\Sigma}}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}$ . The full covariance matrix $\bm{\bm{\Sigma}}$ can be written as

\displaystyle\bm{\bm{\Sigma}}

\displaystyle=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix},

(2)

where $\bm{\Lambda}$ and $\bm{\Psi}$ have been partitioned as

\displaystyle\bm{\Lambda}=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{pmatrix},\quad\bm{\Psi}=\begin{pmatrix}\bm{\Psi}_{X}&\bm{0}&\bm{0}\\ \bm{0}&\bm{\Psi}_{Y}&\bm{0}\\ \bm{0}&\bm{0}&\bm{\Psi}_{Z}\end{pmatrix},

where $\bm{\Lambda}_{X}$ is a $p_{X}\times q$ matrix of factor loadings for the $\bm{X}$ variables, and $\bm{\Psi}_{X}$ is a $p_{X}\times p_{X}$ diagonal matrix of uniquenesses for the $\bm{X}$ variables. The $\bm{Y}$ and $\bm{Z}$ parameters are partitioned in a similar fashion. Assuming that $\bm{\Lambda}_{Y}$ and $\bm{\Lambda}_{Z}$ can be recovered from the observed blocks of the covariance matrix, it is possible to learn $\text{cov}(\bm{Y},\bm{Z})$ as $\bm{\bm{\Sigma}}_{YZ}=\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}$ .

The file-matching problem can be approached as a low-rank matrix completion task. One may attempt to complete the partial estimate of the covariance matrix or to complete the partially observed data matrix. Generic results for matrix recovery with random missingness patterns do not give clear guidance on the feasibility of statistical file-matching with the fixed and systematic missing-data pattern in Table 1 (Candes and Plan, 2010).

We provide conditions for the identifiability of the factor analysis model given the missing-data pattern in the file-matching problem. The main result is that recovery of $\bm{\Sigma}_{YZ}$ is possible if the number of latent factors $q$ satisfies $q\leq p_{X},q<(p_{X}+p_{Y})/2$ , and $q<(p_{X}+p_{Z})/2$ (Theorem 1). Additionally, we give an EM algorithm (Dempster et al., 1977) for joint estimation of $\bm{\bm{\Lambda}}$ and $\bm{\bm{\Psi}}$ with closed form E- and M-steps. We find that on several real datasets, maximum likelihood factor analysis gives better estimates of $\bm{\Sigma}_{YZ}$ than common existing approaches for low-rank matrix completion.

2 Methods

Here we review some existing approaches that can be used to estimate $\bm{\Sigma}_{YZ}$ in the statistical file-matching problem.

2.1 Conditional independence assumption

A common resolution to the identifiability issue in the file-matching problem is to assume that $\bm{Y}$ and $\bm{Z}$ are conditionally independent given $\bm{X}$ . Under this assumption, ${\bm{\Sigma}}_{YZ}={\bm{\Sigma}}_{YX}{\bm{\Sigma}}_{XX}^{-1}{\bm{\Sigma}}_{XZ}$ . The conditional independence assumption is commonly invoked to justify statistical matching (D’Orazio et al., 2006a; Rässler, 2002). However, it is untestable and can have a large influence on downstream results (Barry, 1988; Rodgers, 1984).

2.2 Identified set

To avoid making assumptions, the generative model can be treated as being partially identified (Gustafson, 2015). There is a growing literature focused on developing uncertainty bounds for the matching problem that reflect the limited information in the sample (D’Orazio et al., 2006b; Conti et al., 2012, 2016). For the multivariate normal model, the goal is to estimate the a feasible set of values for $\bm{\bm{\Sigma}}_{YZ}$ , rather than to deliver a point estimate (Kadane, 2001; Moriarity and Scheuren, 2001; Ahfock et al., 2016). The restriction that the covariance matrix be positive definite creates an identified set

\displaystyle\Theta_{\text{I}}

\displaystyle=\left\{\bm{\bm{\Sigma}}_{YZ}\in\mathbb{R}^{p_{Y}\times p_{Z}}:\begin{pmatrix}\bm{\bm{\Sigma}}_{XX}&\bm{\bm{\Sigma}}_{XY}&\bm{\bm{\Sigma}}_{XZ}\\ \bm{\bm{\Sigma}}_{YX}&\bm{\bm{\Sigma}}_{YY}&\bm{\bm{\Sigma}}_{YZ}\\ \bm{\bm{\Sigma}}_{ZX}&\bm{\bm{\Sigma}}_{ZY}&\bm{\bm{\Sigma}}_{ZZ}\end{pmatrix}\text{is positive definite}\right\}.

(3)

A disadvantage of the partial identification approach is that the level of uncertainty on $\bm{\Sigma}$ can be very high. Bounds on the parameters can be too wide for any practical interpretation.

2.3 Low-rank matrix completion

Low-rank matrix completion may be applied to fill in the partial estimate of the covariance matrix or to impute the missing observations in the partially observed data matrix. To complete the estimate of the covariance matrix $\widehat{\bm{\Sigma}}$ , one may consider the regularised least-squares objective

	$\displaystyle f(\bm{M},\bm{\Psi})$	$\displaystyle=\dfrac{1}{2}\sum_{(i,j)\in\Omega}(\widehat{\bm{\Sigma}}_{ij}-\bm{M}_{ij}-\bm{\Psi}_{ij})^{2}+\lambda\lVert\bm{M}\rVert_{*},$
		$\displaystyle\text{subject to }\bm{M}^{\mathsf{T}}=\bm{M},$

where $\lVert\bm{M}\rVert_{*}$ is the nuclear norm of the matrix $\bm{M}$ (Candes and Plan, 2010; Mazumder et al., 2010) and $\Omega$ contains the indices of the known elements of $\widehat{\bm{\Sigma}}$ . The main barrier to attempting matrix completion on $\widehat{\bm{\Sigma}}$ is that the presence of the diagonal matrix ${\bm{\Psi}}$ along with the symmetry constraint on $\bm{M}$ necessitates modifications to existing algorithms for low-rank matrix completion. The algorithm for symmetric low-rank matrix completion with block missingness developed in Bishop and Byron (2014) could be a useful starting point.

A more natural approach in the file-matching problem is to use a low-rank matrix completion algorithm to impute the missing data. Partitioning the complete-dataset $\mathcal{D}$ as

\displaystyle\mathcal{D}

\displaystyle=\begin{pmatrix}\bm{X}_{A}&\bm{Y}_{A}&\bm{Z}_{A}\\ \bm{X}_{B}&\bm{Y}_{B}&\bm{Z}_{B}\end{pmatrix}

where the first row corresponds to dataset A, and the second row corresponds to dataset B, the missing $\bm{Z}_{A}$ and $\bm{Y}_{B}$ observations can be estimated by applying matrix completion to $\mathcal{D}$ . The modelling assumption is that complete-dataset can be expressed as $\mathcal{D}=\mathcal{D}_{0}+\mathcal{E}$ for some rank $q$ matrix $\mathcal{D}_{0}$ plus noise $\mathcal{E}$ . To estimate $\mathcal{D}_{0}$ , the following least squares objective can be used

\displaystyle f(\bm{G},\bm{H})

\displaystyle=\dfrac{1}{2}\sum_{(i,j)\in\Omega}(\mathcal{D}_{ij}-[\bm{G}\bm{H}]_{ij})^{2},

where $\bm{G}$ is a $n\times q$ matrix, $\bm{H}$ is a $q\times p$ matrix and $\Omega$ contains the indices of the observed elements in $\mathcal{D}$ . With $(\widehat{\bm{G}},\widehat{\bm{H}})\in\operatorname*{argmin}_{(\bm{G},\bm{H})}f(\bm{G},\bm{H})$ , the low-rank estimate of $\mathcal{D}_{0}$ is

\displaystyle\widehat{\mathcal{D}}_{0}

\displaystyle=\widehat{\bm{G}}\widehat{\bm{H}}=\begin{pmatrix}\widehat{\bm{X}}_{A}&\widehat{\bm{Y}}_{A}&\widehat{\bm{Z}}_{A}\\ \widehat{\bm{X}}_{B}&\widehat{\bm{Y}}_{B}&\widehat{\bm{Z}}_{B}\end{pmatrix}.

(4)

Using the low-rank reconstructions $\widehat{\bm{Z}}_{A}$ and $\widehat{\bm{Y}}_{B}$ as imputed values for the missing-data $\bm{Z}_{A}$ and $\bm{Y}_{B}$ , an estimate of $\bm{\Sigma}_{YZ}$ is then given by

\displaystyle\widehat{\bm{\Sigma}}_{YZ}

\displaystyle=\text{cov}\left(\begin{bmatrix}{\bm{Y}}_{A}\\ \widehat{\bm{Y}}_{B}\end{bmatrix},\begin{bmatrix}\widehat{\bm{Z}}_{A}\\ {\bm{Z}}_{B}\end{bmatrix}\right).

The theoretical work in Koltchinskii et al. (2011) can be used to give finite-sample bounds on the reconstruction error $\lVert\mathcal{D}_{0}-\widehat{\mathcal{D}}_{0}\rVert_{2}^{2}$ under assumptions on the noise matrix $\mathcal{E}$ . However, in the context of the file-matching problem, it is not immediately clear from existing theoretical work whether exact recovery of the true covariance $\bm{\Sigma}_{YZ}$ is possible in the large-sample limit.

3 Factor models for statistical file-matching

3.1 Generative model

A probabilistic factor analysis model with $q$ factors has the latent variable representation

\displaystyle\begin{pmatrix}\bm{X}\\ \bm{Y}\\ \bm{Z}\end{pmatrix}

\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{pmatrix}\bm{F}+\bm{\epsilon},

(5)

where $\bm{F}\sim\mathcal{N}(\bm{0},\bm{I}_{q})$ , $\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{\Psi})$ , and

\displaystyle\bm{\Psi}=\begin{pmatrix}\bm{\Psi}_{X}&\bm{0}&\bm{0}\\ \bm{0}&\bm{\Psi}_{Y}&\bm{0}\\ \bm{0}&\bm{0}&\bm{\Psi}_{Z}\\ \end{pmatrix}.

Kamakura and Wedel (2000) discuss the application of factor analysis models to the statistical file-matching problem, but do not consider the identifiability of the model, or develop an algorithm for parameter estimation. We address both of these important points. We prove the identifiability of the factor analysis model in the file-matching problem under the following assumptions.

Assumption 1.

The matrix of factors for the $\bm{X}$ variables, $\bm{\Lambda}_{X}$ , is of rank $q$ .

Assumption 2.

Define

\displaystyle\bm{\Lambda}_{A}

\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\end{pmatrix},\quad\bm{\Lambda}_{B}=\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Z}\end{pmatrix}.

For both $\bm{\Lambda}_{A}$ and $\bm{\Lambda}_{B}$ , if any row is removed, there remain two disjoint submatrices of rank $q$ .

Assumption 2 is based on the sufficient conditions for the identifiability of the factor analysis model given in Anderson and Rubin (1956). For the file-matching-problem, Assumption 2 enforces the requirement that $q<(p_{X}+p_{Y})/2$ and $q<(p_{X}+p_{Z})/2$ . Assumption 1 requires that $q\leq p_{X}$ .

3.2 Identifiability

The key to recovery of $\bm{\Sigma}_{YZ}$ is that the matrix $\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}$ is of rank $q\leq p_{X}$ . Lemma 1 shows that it is possible to complete the matrix $\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}$ given the missing-data pattern in Table 1.

Lemma 1.

Suppose that Assumption 1 is satisfied. If all blocks other than $\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}$ and $\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}$ of

\displaystyle\bm{M}

\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}

are observed, then it is possible to recover $\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}$ from the observed elements. Consider the eigenvalue decomposition of the two sub-blocks of $\bm{M}$ that are relevant to dataset A and dataset B:

\displaystyle\bm{V}_{A}\bm{D}_{A}\bm{V}_{A}^{\mathsf{T}}

\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix},\quad\bm{V}_{B}\bm{D}_{B}\bm{V}_{B}^{\mathsf{T}}=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}.

The factors for the marginal models for dataset A and dataset B in the canonical rotation are

\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}^{A}\\ \bm{\Lambda}_{Y}^{A}\end{pmatrix}

\displaystyle=\bm{V}_{A}\bm{D}_{A}^{1/2},\quad\begin{pmatrix}\bm{\Lambda}_{X}^{B}\\ \bm{\Lambda}_{Z}^{B}\end{pmatrix}=\bm{V}_{B}\bm{D}_{B}^{1/2}.

To recover $\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}$ from the observed elements set

\displaystyle\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}

\displaystyle=\bm{\Lambda}_{Y}^{A}(\bm{\Lambda}_{Z}^{B}\bm{R})^{\mathsf{T}},

where $\bm{R}=\bm{W}\bm{Q}^{\mathsf{T}}$ , and $\bm{W}$ and $\bm{Q}$ are the left and right singular vectors of the matrix $\bm{M}=(\bm{\Lambda}_{X}^{B})^{\mathsf{T}}{\bm{\Lambda}_{X}^{A}}=\bm{W}\bm{D}\bm{Q}^{\mathsf{T}}$ .

The proof is given in the Appendix. Bishop and Byron (2014) establish more general results for the low-rank completion of symmetric matrices with structured missingness. Although Lemma 1 is sufficient for low-rank matrix completion problems, identifiability of the factor analysis model is a more complex issue. Specifically, for two parameter sets $\{\bm{\Lambda},\bm{\Psi}\}$ and $\{\bm{\Lambda}^{*},\bm{\Psi}^{*}\}$ , the equality $\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}=\bm{\Lambda}^{*}\bm{\Lambda}^{*\mathsf{T}}+\bm{\Psi}^{*}$ does not imply that $\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}=\bm{\Lambda}^{*}\bm{\Lambda}^{*\mathsf{T}}$ (Anderson and Rubin, 1956; Shapiro, 1985). Assumption 2 allows us to make a statement about the identifiability of the factor analysis model in the file-matching problem.

Theorem 1.

Suppose that Assumptions 1 and 2 are satisfied. Consider two parameter sets

\displaystyle\bm{\Theta}_{1}=\left\{\begin{pmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{pmatrix},\begin{pmatrix}\bm{\Psi}_{X}&0&0\\ 0&\bm{\Psi}_{Y}&0\\ 0&0&\bm{\Psi}_{Z}\end{pmatrix}\right\},\quad\bm{\Theta}_{2}=\left\{\begin{pmatrix}\bm{\Lambda}_{X}^{*}\\ \bm{\Lambda}_{Y}^{*}\\ \bm{\Lambda}_{Z}^{*}\end{pmatrix},\begin{pmatrix}\bm{\Psi}_{X}^{*}&0&0\\ 0&\bm{\Psi}_{Y}^{*}&0\\ 0&0&\bm{\Psi}_{Z}^{*}\end{pmatrix}\right\}.

Then if

	$\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}^{}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}^{}\end{pmatrix},$		(6)
	$\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{X}^{}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}^{}\end{pmatrix}.$		(7)

We have that the factor loadings in $\bm{\Theta}_{1}$ and $\bm{\Theta}_{2}$ are equal up to post-multiplication on the right by an orthogonal matrix. That is,

\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}

\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\end{pmatrix}.

The proof is given in the Appendix. In terms of the generative model in Section 3.1, Theorem 1 states that if parameter sets $\bm{\Theta}_{1}$ and $\bm{\Theta}_{2}$ satisfy $f(\bm{x},\bm{y};\bm{\Theta}_{1})=f(\bm{x},\bm{y};\bm{\Theta}_{2})$ and $f(\bm{x},\bm{z};\bm{\Theta}_{1})=f(\bm{x},\bm{z};\bm{\Theta}_{2})$ then the factor loadings in $\bm{\Theta}_{1}$ and $\bm{\Theta}_{2}$ are equal up to post-multiplication on the right by an orthogonal matrix. The conclusion is that recovery of $\bm{\Sigma}_{YZ}$ in the file-matching scenario is possible given the observable sub-blocks of the population covariance matrix $\bm{\Sigma}$ . Assuming that the sample covariance matrices converge in probability to the population quantities as $n_{A},n_{B}\to\infty$ , consistent estimation of $\bm{\Sigma}_{YZ}$ is possible.

Under mild regularity conditions on the factor loadings, $q<p/2$ is sufficient for identifiability of the factor analysis model with complete-cases (Anderson and Rubin, 1956; Shapiro, 1985). In the file-matching scenario, Assumption 2 enforces the stronger condition that $q<(p_{X}+p_{Y})/2$ and $q<(p_{X}+p_{Z})/2$ . If the number of factors $q$ falls within the intervals $(p_{X}+p_{Y})/2<q<p/2$ and $(p_{X}+p_{Z})/2<q<p/2$ , Assumption 2 is violated. However, one may hope that the model is still identifiable, given that consistent estimation is possible with complete-cases.

Some insight can be gained by comparing the number of equations and the number of unknowns (Ledermann, 1937). There are $pq$ unknown parameters in $\bm{\Lambda}$ and $p$ unknown parameters in $\bm{\Psi}$ . Given the symmetry in $\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}$ , there are $p(p+1)/2$ equations, and the adoption of a rotation constraint on $\bm{\Lambda}$ yields an additional $q(q-1)/2$ equations (Anderson and Rubin, 1956). The number of equations minus the number of unknowns is $C=[(p-q)^{2}-p-q]/2$ . A useful rule of thumb is to expect the the factor analysis model to be identifiable if $C\geq 0$ , and nonidentifiable if $C<0$ (Ledermann, 1937; Anderson and Rubin, 1956). Bekker and ten Berge (1997) show that $C>0$ is sufficient for the identifiability of the factor model in most circumstances. In the file-matching problem, the equality constraints due to $\bm{\Sigma}_{YZ}$ are not present and the degrees of freedom is now reduced to $C_{M}=[(p-q)^{2}-p-q]/2-p_{Y}p_{Z}$ . If $C_{M}$ is negative and $C$ is positive it may signal a situation where the missing-data has caused the factor model to be nonidentifiable.

Table 2 reports the maximum number of allowable factors $q$ using the degrees of freedom criteria $C$ and $C_{M}$ , and the restrictions of Assumption 2 in a file-matching problem with $p_{X}=p_{Y}=p_{Z}=p/3$ . Assumption 2 places stronger requirements on the number of factors $q$ than the degrees of freedom requirement $C_{M}$ . The number of allowable factors using $C_{M}$ is smaller than $C$ , demonstrating the information loss due to the missing-data.

Table 2: Maximum number of factors

q

according to different identifiability criteria with

p_{X}=p_{Y}=p_{Z}=p/3

. The complete-cases degrees of freedom is given by

C=[(p-q)^{2}-p-q]/2

. The file-matching degrees of freedom is given by

C_{M}=C-p_{Y}p_{Z}

Criterion	Quantity	3	6	9	12	15	18	21
		$p$
Degrees of freedom, complete-cases	$C$	1	3	5	7	10	12	15
Degrees of freedom, file-matching	$C_{M}$	0	2	3	5	6	8	9
Assumption 2	$p/3$	0	1	2	3	4	5	6

3.3 Estimation

As mentioned in Section 2.3, procedures for low-rank matrix completion focus on the estimation of $\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}$ rather than the joint estimation of $\bm{\Lambda}$ and $\bm{\Psi}$ . A maximum likelihood approach to the factor analysis model facilitates joint estimation of $\bm{\Lambda}$ and $\bm{\Psi}$ . With complete-cases, the EM updates for factor analysis can be written in closed form (Rubin and Thayer, 1982). Let $\bm{S}$ denote the sample scatter matrix

\displaystyle\bm{S}

\displaystyle=\sum_{i=1}^{n}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\\ \bm{z}_{i}\end{pmatrix}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\\ \bm{z}_{i}\end{pmatrix}^{\mathsf{T}}.

(8)

Given current parameter estimates of $\bm{\bm{\Lambda}}$ and $\bm{\bm{\Psi}}$ , define $\bm{\beta}\equiv\bm{\Lambda}^{\mathsf{T}}(\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi})^{-1}$ . The M-step is

	$\displaystyle\bm{\Lambda}^{\text{new}}$	$\displaystyle=(\bm{S}\bm{\beta}^{\mathsf{T}})\left(nI-n\bm{\beta}\bm{\Lambda}+\bm{\beta}\bm{S}\bm{\beta}^{\mathsf{T}}\right)^{-1},$		(9)
	$\displaystyle\bm{\Psi}^{\rm{new}}$	$\displaystyle=\dfrac{1}{n}\text{diag}\left(\bm{S}-\bm{\Lambda}^{\text{new}}\bm{\beta}\bm{S}\right).$		(10)

For the file-matching problem we need to compute the conditional expectation of (8) given the missing-data pattern. The observed scatter matrices in dataset A and dataset B are

	$\displaystyle\bm{P}$	$\displaystyle=\sum_{i=1}^{n_{A}}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\end{pmatrix}\begin{pmatrix}\bm{x}_{i}\\ \bm{y}_{i}\end{pmatrix}^{\mathsf{T}}=\begin{pmatrix}\bm{P}_{XX}&\bm{P}_{XY}\\ \bm{P}_{YX}&\bm{P}_{YY}\end{pmatrix},$		(11)
	$\displaystyle\quad\bm{T}$	$\displaystyle=\sum_{i=1}^{n_{B}}\begin{pmatrix}\bm{x}_{i}\\ \bm{z}_{i}\end{pmatrix}\begin{pmatrix}\bm{x}_{i}\\ \bm{z}_{i}\end{pmatrix}^{\mathsf{T}}=\begin{pmatrix}\bm{T}_{XX}&\bm{T}_{XZ}\\ \bm{T}_{ZX}&\bm{T}_{ZZ}\end{pmatrix}.$		(12)

To determine the conditional expectation of the complete-data sufficient statistic, for current parameter estimates of $\bm{\Lambda}$ and $\bm{\Psi}$ define

	$\displaystyle\bm{\omega}$	$\displaystyle=\begin{pmatrix}\bm{\omega}_{X}&\bm{\omega}_{Y}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}^{-1},$
	$\displaystyle\bm{\alpha}$	$\displaystyle=\begin{pmatrix}\bm{\alpha}_{X}&\bm{\alpha}_{Z}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}^{-1},$

and the conditional covariance matrices

	$\displaystyle\bm{\Sigma}_{Z\mid XY}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}-\begin{pmatrix}\bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}^{-1}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix},$
	$\displaystyle\bm{\Sigma}_{Y\mid XZ}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}-\begin{pmatrix}\bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}^{-1}\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}.$

Using the augmented scatter matrices

	$\displaystyle\widetilde{\bm{P}}$	$\displaystyle=\begin{pmatrix}\bm{P}_{XX}&\bm{P}_{XY}&\bm{P}_{XX}\bm{\omega}_{X}^{\mathsf{T}}+\bm{P}_{XY}\bm{\omega}_{Y}^{\mathsf{T}}\\ \bm{P}_{YX}&\bm{P}_{YY}&\bm{P}_{YX}\bm{\omega}_{X}^{\mathsf{T}}+\bm{P}_{YY}\bm{\omega}_{Y}^{\mathsf{T}}\\ \bm{\omega}_{Y}\bm{P}_{YX}+\bm{\omega}_{X}\bm{P}_{XX}&\bm{\omega}_{Y}\bm{P}_{YY}+\bm{\omega}_{X}\bm{P}_{XY}&\bm{\omega}\bm{P}\bm{\omega}^{\mathsf{T}}+n_{A}\bm{\Sigma}_{Z\mid XY}\end{pmatrix},$
	$\displaystyle\widetilde{\bm{T}}$	$\displaystyle=\begin{pmatrix}\bm{T}_{XX}&\bm{T}_{XX}\bm{\alpha}_{X}^{\mathsf{T}}+\bm{T}_{XZ}\bm{\alpha}_{Z}^{\mathsf{T}}&\bm{T}_{XZ}\\ \bm{\alpha}_{Z}\bm{T}_{ZX}+\bm{\alpha}_{X}\bm{T}_{XX}&\bm{\alpha}\bm{T}\bm{\alpha}^{\mathsf{T}}+n_{B}\bm{\Sigma}_{Y\mid XZ}&\bm{\alpha}_{Z}\bm{T}_{ZZ}+\bm{\alpha}_{X}\bm{T}_{XZ}\\ \bm{T}_{ZX}&\bm{T}_{ZX}\bm{\alpha}_{X}^{\mathsf{T}}+\bm{T}_{ZZ}\bm{\alpha}_{Z}^{\mathsf{T}}&\bm{T}_{ZZ}\end{pmatrix},$

the expected value of the complete-data sufficient statistic is given by

\displaystyle\widetilde{\bm{S}}

\displaystyle=\widetilde{\bm{P}}+\widetilde{\bm{T}}.

Given current parameter estimates of ${\bm{\Lambda}},{\bm{\Psi}}$ , define $\bm{\beta}\equiv\bm{\Lambda}^{\mathsf{T}}(\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi})^{-1}$ . The M-step is then

	$\displaystyle\bm{\Lambda}^{\text{new}}$	$\displaystyle=(\widetilde{\bm{S}}\bm{\beta}^{\mathsf{T}})\left(n\bm{I}-n\bm{\beta}\bm{\Lambda}+\bm{\beta}\widetilde{\bm{S}}\bm{\beta}^{\mathsf{T}}\right)^{-1},$		(13)
	$\displaystyle\bm{\Psi}^{\rm{new}}$	$\displaystyle=\dfrac{1}{n}\text{diag}\left(\widetilde{\bm{S}}-\bm{\Lambda}^{\text{new}}\bm{\beta}\tilde{\bm{S}}\right).$		(14)

The EM algorithm can be run using the sample covariance matrices from dataset A and dataset B, and so it is not necessary to have the original observations. The multivariate normality assumption used to develop the algorithm is not crucial, as long as the population covariance matrix fits the factor analysis model $\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}$ (Browne, 1984).

3.4 Model selection

In practice, the number of factors $q$ must be determined. The Bayesian information criterion (Schwarz, 1978) is a generic tool for model selection that has been well explored in factor analysis for selecting $q$ (Preacher et al., 2013). The Bayesian information criterion for a model $\mathcal{M}$ with likelihood function $L$ and parameter $\bm{\theta}$ is given by

\displaystyle\text{BIC}(\mathcal{M})

\displaystyle=-2\log L(\bm{\widehat{\theta}})+d\log n,

(15)

where $\widehat{\bm{\theta}}$ is the maximum likelihood estimate, $d$ is the number of free parameters in the model $\mathcal{M}$ , and $n$ is the sample size (Schwarz, 1978). For factor analysis with complete-cases the log-likelihood is

\displaystyle\log L(\bm{\Lambda},\bm{\Psi})

\displaystyle=-\dfrac{n}{2}\log\lvert\bm{\Sigma}\rvert-\dfrac{1}{2}\text{trace}\{\bm{\Sigma}^{-1}\bm{S}\}-\dfrac{np}{2}\log 2\pi,

where $\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}$ and $\bm{S}$ is the sample scatter matrix (8). The Bayesian information criterion for factor analysis with complete-cases is therefore

\displaystyle\text{BIC}(q)

\displaystyle=-2\log L(\widehat{\bm{\Lambda}},\widehat{\bm{\Psi}})+(qp+p-q(q-1)/2)\log n,

where the number of free parameters is obtained by subracting the number of rotation constraints $q(q-1)/2$ from the nominal number of parameters $qp+p$ .

For model selection in missing-data problems it has been suggested to use the observed-data log-likelihood in place of the complete-data log-likelihood in the standard definition of the Bayesian information criterion (15) (Ibrahim et al., 2008). This is the strategy we propose to use for determining the number of factors $q$ in the file-matching problem. The log-likelihood functions for each dataset are given by

	$\displaystyle\log L_{A}(\bm{\Sigma},\bm{\Psi})$	$\displaystyle=-\dfrac{n_{A}}{2}\log\lvert\bm{\Sigma}_{A}\rvert-\dfrac{1}{2}\text{trace}\{\bm{\Sigma}_{A}^{-1}\bm{P}\}-\dfrac{n_{A}p_{A}}{2}\log 2\pi,$
	$\displaystyle\log L_{B}(\bm{\Sigma},\bm{\Psi})$	$\displaystyle=-\dfrac{n_{B}}{2}\log\lvert\bm{\Sigma}_{B}\rvert-\dfrac{1}{2}\text{trace}\{\bm{\Sigma}_{B}^{-1}\bm{T}\}-\dfrac{n_{B}p_{B}}{2}\log 2\pi,$

where $p_{A}=p_{X}+p_{Y}$ , $p_{B}=p_{X}+p_{Z}$ , $\bm{P}$ is the observed scatter matrix from dataset A (11), $\bm{T}$ is the observed scatter matrix from dataset B (12) and

	$\displaystyle\bm{\Sigma}_{A}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix},$
	$\displaystyle\bm{\Sigma}_{B}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}.$

The Bayesian information criterion using the observed-data log-likelihood is then

\displaystyle\text{BIC}(q)

\displaystyle=-2\{\log L_{A}(\widehat{\bm{\Lambda}},\widehat{\bm{\Psi}})+\log L_{B}(\widehat{\bm{\Lambda}},\widehat{\bm{\Psi}})\}+(qp+p-q(q-1)/2)\log n.

4 Simulation

Here we investigate the necessity of Assumption 2 for the identifiability of the factor model by focusing on the noiseless setting where the exact covariance matrix $\bm{\Sigma}$ is known with the exception of $\bm{\Sigma}_{YZ}$ . We considered a file-matching scenario with $p_{X}=p_{Y}=p_{Z}=4$ . As seen in Table 2, Assumption 2 enforces stricter requirements than the degrees of freedom criterion based on $C_{M}=[(p-q)^{2}-p-q]/2-p_{Y}p_{Z}$ . We simulated three different covariance matrices with $p=12$ , one from a $q=3$ factor model, one from a $q=4$ factor model, and one from a $q=5$ factor model. Elements of $\Lambda$ were sampled from a $\mathcal{N}(2,1)$ distribution. The $j$ th diagonal element of $\Psi$ was set as $D_{j}^{2}$ , where $D_{j}\sim\mathcal{N}(3,0.01)$ . Assumption 2 holds only for $q=3$ , but the degrees of freedom $C_{M}$ is non-negative for $q=3,4,5$ .

For each covariance matrix, we fit a factor model using the EM algorithm for 10000 iterations using a random initial value $\Lambda$ , where each element was sampled from a $\mathcal{N}(0,1)$ distribution. The correct number of factors was used for each of the three test matrices. Given the estimate $\widehat{\bm{\Sigma}}$ , we computed the mean squared error over the elements of $\bm{\Sigma}_{YZ}$ , that is $\lVert\widehat{\bm{\Sigma}}_{YZ}-\bm{\Sigma}_{YZ}\rVert^{2}/(p_{Y}p_{Z})$ . We also computed the mean squared error for the observed blocks of $\bm{\Sigma}$ . This procedure was repeated fifty times. Figure 1 displays the results over the fifty random initialisations for each of the three covariance matrices $(q=3,4,5)$ .

The top row shows the estimation errors for $\bm{\Sigma}_{YZ}$ , the bottom row shows the estimation errors for the observed blocks of $\bm{\Sigma}$ . Looking at the bottom row, for $q=3,4$ , and $5$ the EM algorithm is able to reconstruct the observed blocks of $\bm{\Sigma}$ over each random initialisation. From the top row, for $q=3$ , the choice of initial values does not affect the final estimate of $\bm{\Sigma}_{YZ}$ , and the estimation error is zero in each trial. For $q=4$ and $5$ , the choice of initial values influences the final estimate of $\bm{\Sigma}_{YZ}$ . As the reconstruction error for the observed blocks of $\bm{\Sigma}$ is zero, this suggests there are multiple factor solutions and the model is not identifiable. The divergent behaviour when comparing $q=3$ to $q=4$ and $5$ is notable, as Assumption 2 is only satisfied for $q=3$ . The degrees of freedom $C_{M}$ are non-negative for $q=4$ and $5$ , but this does not lead to a well-behaved estimator. In this simulation, Assumption 2 appears to give better guidelines for the stable recovery of $\bm{\Sigma}_{YZ}$ compared to $C_{M}$ .

Refer to caption — Figure 1: Estimation errors in a simulated file-matching scenario with $p=12$ and $p_{X}=p_{Y}=p_{Z}=4$ . The EM algorithm was run for 10000 iterations, panels show the estimation error from different random initialisations. Bottom row: for $q=3,4$ , and $5$ the reconstruction error of the observed blocks of $\Sigma$ is zero across all initialisations. Top row: for $q=4$ and $5$ , the random initialisations appear to lead to different estimates of $\bm{\Sigma}_{YZ}$ , suggesting multiple factor solutions. For $q=3$ , Assumption 2 is satisfied, and the factor model is identifiable. For $q=3$ the estimation error of $\bm{\Sigma}_{YZ}$ is zero across all initialisations.

5 Data application

5.1 Estimation

In this section we compare the performance of various algorithms for the estimation of $\bm{\Sigma}_{YZ}$ . Existing algorithms for low-rank matrix completion were implemented using the R package filling (You, 2020). We also considered the standard conditional independence assumption (CIA). We tested the algorithms on the following datasets.

•

Sachs dataset. This dataset consists of results from 11 flow cytometry experiments measuring the same set of 12 variables. The measurements were log transformed. Data from each experiment was centered, then combined and scaled so that each variable had unit variance.
•

Wisconsin breast cancer dataset. The original observations in the malignant class were centered and scaled so that each variable had unit variance.
•

Abalone dataset. The original observations in the female class were centered and scaled so that each variable had unit variance.

For each dataset we considered matching scenarios with $p_{X}$ common variables, $p_{Y}$ unique variables in dataset A, and $p_{Z}$ unique variables in dataset B. This defines partitions of the complete-dataset. Table 3 gives the file-matching scenarios considered for each dataset, as well as the number of factors $q$ that were used in the fitting of the factor analysis model. The final columns indicate whether the degrees of freedom criteria $C$ and $C_{M}$ , and Assumption 2 are satisfied for the chosen number of factors $q$ . Assumption 2 is violated on the Sachs dataset, but the degrees of freedom $C_{M}$ is positive.

Table 3: Data examples.

C

represents the degrees of freedom given complete cases,

C_{M}

represents the degrees of freedom in the file-matching scenario. The final column states if it possible to satisfy Assumption 2 given the file-matching scenario.

Dataset	$p$	$p_{X}$	$p_{Y}$	$p_{Z}$	$q$	$C\geq 0$	$C_{M}\geq 0$	Assumption 2
Sachs	11	4	4	3	4	Yes	Yes	No
Wisconsin	10	3	3	4	2	Yes	Yes	Yes
Abalone	8	1	4	3	1	Yes	Yes	Yes

In each simulation, variables were randomly allocated to the $\bm{X}$ , $\bm{Y}$ , and $\bm{Z}$ groups. We then used different algorithms to estimate $\bm{\Sigma}_{YZ}$ . The mean squared error over elements of $\bm{\Sigma}_{YZ}$ was then recorded for each algorithm, that is $\lVert\widehat{\bm{\Sigma}}_{YZ}-\bm{\Sigma}_{YZ}\rVert^{2}/(p_{Y}p_{Z})$ . This process was repeated 100 times. The EM algorithm was run for 2000 iterations in each simulation. We considered two different initial value settings for the EM algorithm, initialisation using the factor analysis solution from the full covariance matrix, and random initialisation of $\bm{\Lambda}$ by sampling from a standard normal distribution. For the random initialisation protocol we took one hundred samples of $\bm{\Lambda}$ followed by a short run of the EM algorithm for fifty iterations. The parameters with the highest log-likelihood after the fifty iterations were then used in the longer run. In each simulation we also fit a factor model using the complete-dataset with no missing data to provide a reference point for the goodness-of-fit of the factor model.

Figure 2 compares the results using the different algorithms. The results for factor analysis model with the favourable initial values are shown as FM, the results for the factor analysis model with random initialisation are shown as FM (random). The factor model with good initial values (FM) has the lowest median error across each dataset (Table 4). The errors on the Sachs dataset and the Wisconsin breast cancer dataset are particularly small.

The results for the factor model with random initialisation (FM (random)) are very similar to those of FM using the favourable initial values except on the Sachs dataset. On the Sachs dataset, Assumption 2 is violated and there may be multiple factor solutions. In Figure 2 (a), the errors for the factor model with random initialisation (FM (random)) are much more dispersed than the factor model with the good initial values (FM). Although there are sufficient degrees of freedom $(C_{M}\geq 0)$ , it appears that the model may not be identifiable. With good initial values, the EM algorithm appears to converge to a local mode that gives a good estimate of $\Sigma_{YZ}$ .

The interquartile range of the error for each algorithm over the one hundred variable permutations is given in Table 4. The factor analysis model (FM) has the smallest interquartile range on each dataset, closely followed by the factor model using random initialisations with the exception of the Sachs dataset.

Table 4: Median error and interquartile range in estimation of

\bm{\Sigma}_{YZ}

across 100 variable permutations. The factor analysis model (FM) has the lowest median error and interquartile range on each dataset. The factor analysis model with random initial values (FM (random)) has comparable performance in each dataset except for Sachs, where Assumption 2 is violated.

Dataset	Value	FM	CIA	Soft-Impute	SVD	FM (random)	Complete
Sachs	Median	0.0006	0.0320	0.0465	3.1174	0.0402	0.0001
	IQR	0.0017	0.0351	0.0518	27.822	0.0907	0.0001
Wisconsin	Median	0.0032	0.0125	0.0058	0.0059	0.0036	0.0006
	IQR	0.0023	0.0216	0.0073	0.0113	0.0025	0.0008
Abalone	Median	0.0042	0.0125	0.0089	0.0102	0.0042	0.0031
	IQR	0.0031	0.0198	0.0081	0.0142	0.0031	0.0024

5.2 Model selection

We also performed some experiments to assess the behavior of the Bayesian information criterion for determining the number of factors $q$ . The following datasets and variable partitions were used in the experiments

•

Reise dataset. This dataset is a $16\times 16$ correlation matrix for mental health items. The variables were partitioned as $p_{X}=6,p_{Y}=5$ , and $p_{Z}=5$ .
•

Harman dataset. This dataset is a $24\times 24$ correlation matrix for psychological tests. The variables were partitioned as $p_{X}=5,p_{Y}=10$ , and $p_{Z}=9$ .
•

Holzinger dataset. This dataset is a $14\times 14$ correlation matrix for mental ability scores. The variables were partitioned as $p_{X}=6,p_{Y}=4$ , and $p_{Z}=4$ .
•

Simulated from a factor model with $q=6$ and $p=100$ . Elements of $\Lambda$ were sampled from a $\mathcal{N}(2,1)$ distribution. The $j$ th diagonal element of $\Psi$ was set as $D_{j}^{2}$ , where $D_{j}\sim\mathcal{N}(3,0.01)$ . The covariance matrix $\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}$ was then standardised to a correlation matrix. We simulated one dataset with $n_{A}=n_{B}=50$ observations and one dataset of $n_{A}=n_{B}=500$ observations. The variables were partitioned as $p_{X}=6,p_{Y}=47,p_{Z}=47$ .

In each simulation, variables were randomly allocated to the $\bm{X}$ , $\bm{Y}$ and $\bm{Z}$ groups with $n_{A}=n_{B}=n/2$ . We then computed the BIC for a range of values for $q$ and recorded the optimal value $q^{*}$ . This process was repeated 100 times. In each of the simulations we considered the range of values for $q$ such that Assumptions 1 and 2 were satisfied. Table 5 reports the number of times a $q$ factor model was chosen using the BIC over the 100 replications. The optimal number of factors according to the BIC using complete-cases is given as $q_{c}^{*}$ in Table 5. The missing-data appears to lead to the BIC acting conservatively, selecting a number of factors less than or equal to the number chosen with complete-cases. The correct number of factors $q=6$ is selected for the simulated datasets when using complete-cases. For the simulated dataset with $n=50$ , the BIC selects fewer than six factors in each trial. For the simulated dataset with $n=500$ the BIC selects the correct number of factors $q=6$ in each replication. Figure 3 shows boxplots of the estimation error $\lVert\widehat{\bm{\Sigma}}_{YZ}-\bm{\Sigma}_{YZ}\rVert^{2}/(p_{Y}p_{Z})$ for different values of $q$ . The errors from using the conditional independence assumption and from fitting a factor model with complete-cases are also reported for comparison. For almost all values of $q$ , the median error using the factor model is lower than that of the conditional independence model.

Table 5: Simulation results for model selection using the Bayesian information criterion. The optimal number of factors using complete-cases is denoted as

q_{c}^{*}

. Entries represent the number of times a

q

factor model was chosen using the BIC across the 100 variable permutations. A dash indicates that the

q

factor model would violate the identification conditions in Assumptions 1 and 2.

Dataset	$q_{\text{c}}^{*}$	$q=1$	$q=2$	$q=3$	$q=4$	$q=5$	$q=6$	$q=7$	$q=8$
Reise	8	0	1	1	18	80	-	-	-
Harman	3	62	37	1	0	0	-	-	-
Holzinger	4	0	60	37	3	-	-	-	-
Simulated $(n=50)$	6	5	51	41	3	0	0	0	0
Simulated $(n=500)$	6	0	0	0	0	0	100	0	0

6 Discussion

Technological and design constraints can prevent investigators from collecting a full dataset on all variables of interest. The statistical file-matching problem is an important data-fusion task where joint observations on the full set of variables are not available. Factor analysis models are useful as they can remain identifiable despite the missing-data pattern of the file-matching problem. The factor analysis approach is a useful alternative to the conditional independence assumption, as it is less restrictive and testable. Estimation of the factor analysis model can be carried out via the EM algorithm.

Although factor analysis and low-rank matrix completion are related, the identifiability of the factor analysis model $\bm{\Sigma}=\bm{\Lambda}\bm{\Lambda}^{\mathsf{T}}+\bm{\Psi}$ requires additional assumptions compared to low-rank matrix completion due to the diagonal matrix $\bm{\Psi}$ . As the assumption that $q<(p_{X}+p_{Y})/2$ and $q<(p_{X}+p_{Z})/2$ may be unnecessarily strong, it is of interest to establish the weakest conditions that ensure the identifiability of the factor analysis model in the file-matching problem. In many applications of file-matching, procedures for generating uncertainty bounds when the model is partially identified supply useful information (Conti et al., 2016). Further work may explore characterisations of the identified set for the factor analysis model when the number of latent factors $q$ exceeds the maximum number for identifiability.

It is also of interest to relax the assumption that both datasets are samples from the same homogeneous population with the joint model $f(\bm{x},\bm{y},\bm{z};\bm{\theta})$ . One possible avenue is to embed the file-matching problem in a hierarchical model. A dataset specific random effect can be applied to the parameters in dataset A and dataset B, so that there is shared component across datasets and a unique component within each dataset. Samples in dataset A are from the distribution $f(\bm{x},\bm{y};\bm{\theta}_{A})$ , and samples from dataset B are from the distribution $f(\bm{x},\bm{z};\bm{\theta}_{B})$ where $\bm{\theta}_{A}=\bm{\theta}_{0}+\delta_{A}$ and $\bm{\theta}_{B}=\bm{\theta}_{0}+\delta_{B}$ where $\delta_{A}$ and $\delta_{B}$ are random effect terms. The added flexibility of the hierarchical model can allow for situations where datasets A and B are from related but not necessarily identical populations, a common scenario in data integration. Model identifiability with heterogeneous data sources is an interesting and challenging problem, and the results here may serve as useful groundwork.

Acknowledgements

This research was funded by the Australian Government through the Australian Research Council (Project Numbers DP170100907 and IC170100035).

References

Abdelaal et al. (2019) Abdelaal, T., Höllt, T., van Unen, V., Lelieveldt, B. P. F., Koning, F., Reinders, M. J. T. and Mahfouz, A. (2019) CyTOFmerge: integrating mass cytometry data across multiple panels. Bioinformatics, 35, 4063–4071.
Ahfock et al. (2016) Ahfock, D., Pyne, S., Lee, S. X. and McLachlan, G. J. (2016) Partial identification in the statistical matching problem. Computational Statistics and Data Analysis, 104, 79–90.
Anderson and Rubin (1956) Anderson, T. W. and Rubin, H. (1956) Statistical inference in factor analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 238–246.
Barry (1988) Barry, J. T. (1988) An investigation of statistical matching. Journal of Applied Statistics, 15, 275–283.
Bekker and ten Berge (1997) Bekker, P. A. and ten Berge, J. M. (1997) Generic global identification in factor analysis. Linear Algebra and its Applications, 264, 255 – 263.
Bishop and Byron (2014) Bishop, W. E. and Byron, M. Y. (2014) Deterministic symmetric positive semidefinite matrix completion. In Advances in Neural Information Processing Systems, 2762–2770.
Browne (1984) Browne, M. W. (1984) Asymptotically distribution-free methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37, 62–83.
Candes and Plan (2010) Candes, E. J. and Plan, Y. (2010) Matrix completion with noise. Proceedings of the IEEE, 98, 925–936.
Conti et al. (2012) Conti, P. L., Marella, D. and Scanu, M. (2012) Uncertainty analysis in statistical matching. Journal of Official Statistics, 28, 69–88.
Conti et al. (2016) — (2016) Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association, 111, 1715–1725.
Dempster et al. (1977) Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38.
D’Orazio (2019) D’Orazio, M. (2019) Statistical learning in official statistics: The case of statistical matching. Statistical Journal of the IAOS, 35, 435–441.
D’Orazio et al. (2006a) D’Orazio, M., Di Zio, M. and Scanu, M. (2006a) Statistical Matching: Theory and Practice. New York: Wiley.
D’Orazio et al. (2006b) D’Orazio, M., Zio, M. and Scanu, M. (2006b) Statistical matching for categorical data: Displaying uncertainty and using logical constraints. Journal of Official Statistics, 22, 137.
Gustafson (2015) Gustafson, P. (2015) Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data. Boca Raton: CRC Press.
Ibrahim et al. (2008) Ibrahim, J. G., Zhu, H. and Tang, N. (2008) Model selection criteria for missing-data problems using the EM algorithm. Journal of the American Statistical Association, 103, 1648–1658.
Kadane (2001) Kadane, J. B. (2001) Some statistical problems in merging data files. Journal of Official Statistics, 17, 423.
Kamakura and Wedel (2000) Kamakura, W. A. and Wedel, M. (2000) Factor analysis and missing data. Journal of Marketing Research, 37, 490–498.
Koltchinskii et al. (2011) Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011) Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Annals of Statistics, 39, 2302–2329.
Ledermann (1937) Ledermann, W. (1937) On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika, 2, 85–93.
Lee et al. (2011) Lee, G., Finn, W. and Scott, C. (2011) Statistical file matching of flow cytometry data. Journal of Biomedical Informatics, 44, 663–676.
Li and Jung (2017) Li, G. and Jung, S. (2017) Incorporating covariates into integrated factor analysis of multi-view data. Biometrics, 73, 1433–1442.
Little (1988) Little, R. J. (1988) Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6, 287–296.
Little and Rubin (2002) Little, R. J. A. and Rubin, D. B. (2002) Statistical Analysis with Missing Data. Hoboken: Wiley, 2nd edn.
Mazumder et al. (2010) Mazumder, R., Hastie, T. and Tibshirani, R. (2010) Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11, 2287–2322.
Moriarity and Scheuren (2001) Moriarity, C. and Scheuren, F. (2001) Statistical matching: a paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, 17, 407.
O’Connell and Lock (2019) O’Connell, M. J. and Lock, E. F. (2019) Linked matrix factorization. Biometrics, 75, 582–592.
O’Neill et al. (2015) O’Neill, K., Aghaeepour, N., Parker, J., Hogge, D., Karsan, A., Dalal, B. and Brinkman, R. R. (2015) Deep profiling of multitube flow cytometry data. Bioinformatics, 31, 1623–1631.
Park and Lock (2020) Park, J. Y. and Lock, E. F. (2020) Integrative factorization of bidimensionally linked matrices. Biometrics, 76, 61–74.
Pedreira et al. (2008) Pedreira, C. E., Costa, E. S., Barrena, S., Lecrevisse, Q., Almeida, J., van Dongen, J. J. M. and Orfao, A. (2008) Generation of flow cytometry data files with a potentially infinite number of dimensions. Cytometry Part A, 73, 834–846.
Preacher et al. (2013) Preacher, K. J., Zhang, G., Kim, C. and Mels, G. (2013) Choosing the optimal number of factors in exploratory factor analysis: A model selection perspective. Multivariate Behavioral Research, 48, 28–56.
Rässler (2002) Rässler, S. (2002) Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. New York: Springer-Verlag.
Rodgers (1984) Rodgers, W. L. (1984) An Evaluation of Statistical Matching. Journal of Business & Economic Statistics, 2, 91.
Rubin and Thayer (1982) Rubin, D. B. and Thayer, D. T. (1982) EM algorithms for ML factor analysis. Psychometrika, 47, 69–76.
Sachs et al. (2009) Sachs, K., Itani, S., Carlisle, J., Nolan, G. P., Pe’er, D. and Lauffenburger, D. A. (2009) Learning signaling network structures with sparsely distributed data. Journal of Computational Biology, 16, 201–212.
Schönemann (1966) Schönemann, P. H. (1966) A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31, 1–10.
Schwarz (1978) Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Shapiro (1985) Shapiro, A. (1985) Identifiability of factor analysis: Some results and open problems. Linear Algebra and its Applications, 70, 1–7.
Van Buuren (2018) Van Buuren, S. (2018) Flexible Imputation of Missing Data. Boca Raton: CRC Press.
You (2020) You, K. (2020) filling: Matrix Completion, Imputation, and Inpainting Methods. R package version 0.2.1.

Appendix A Appendix

A.1 Proof of Lemma 1

Due to the rotational invariance of the factor model, we have that

\displaystyle\begin{bmatrix}\bm{\Lambda}_{X}^{A}\\ \bm{\Lambda}_{Y}^{A}\\ \bm{\Lambda}_{Z}^{A}\end{bmatrix}

\displaystyle=\begin{bmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{bmatrix}\bm{R}_{1},\quad\begin{bmatrix}\bm{\Lambda}_{X}^{B}\\ \bm{\Lambda}_{Y}^{B}\\ \bm{\Lambda}_{Z}^{B}\end{bmatrix}=\begin{bmatrix}\bm{\Lambda}_{X}\\ \bm{\Lambda}_{Y}\\ \bm{\Lambda}_{Z}\end{bmatrix}\bm{R}_{2},\quad\begin{bmatrix}\bm{\Lambda}_{X}^{B}\\ \bm{\Lambda}_{Y}^{B}\\ \bm{\Lambda}_{Z}^{B}\end{bmatrix}=\begin{bmatrix}\bm{\Lambda}_{X}^{A}\\ \bm{\Lambda}_{Y}^{A}\\ \bm{\Lambda}_{Z}^{A}\end{bmatrix}\bm{R}_{3},

(A.1)

for orthogonal matrices $\bm{R}_{1},\bm{R}_{2}$ , and $\bm{R}_{3}$ . The alignment of $\bm{\Lambda}_{X}^{A}$ and $\bm{\Lambda}_{X}^{B}$ is an orthogonal Procrustes problem. Let $\bm{R}$ be the solution to the optimisation problem

\displaystyle\bm{R}

\displaystyle=\operatorname*{argmin}\ \lVert\bm{\Lambda}_{X}^{A}-\bm{\Lambda}_{X}^{B}\bm{R}\rVert_{F},\quad\text{ subject to }\bm{R}^{\mathsf{T}}\bm{R}=\bm{I}.

(A.2)

Assuming that $\bm{\Lambda}_{X}^{A}$ and $\bm{\Lambda}_{X}^{B}$ are of full column rank, Schönemann (1966) showed that there is a unique solution to (A.2). As $\text{rank}(\bm{\Lambda}_{X}^{A})=\text{rank}(\bm{\Lambda}_{X}^{B})=\text{rank}(\bm{\Lambda}_{X})$ , both $\bm{\Lambda}_{X}^{A}$ and $\bm{\Lambda}_{X}^{B}$ are of rank $q$ under Assumption 1. Define $\bm{M}=(\bm{\Lambda}_{X}^{B})^{\mathsf{T}}{\bm{\Lambda}_{X}^{A}}$ and let the singular value decomposition of $\bm{M}$ be given by $\bm{M}=\bm{W}\bm{D}\bm{Q}^{\mathsf{T}}$ . Then using the result from Schönemann (1966), the unique solution to (A.2) is given by $\bm{R}=\bm{W}\bm{Q}^{\mathsf{T}}$ . The uniqueness of the solution implies that $\bm{R}=\bm{R}_{3}^{\mathsf{T}}$ as $\Lambda_{X}^{B}\bm{R}_{3}\bm{R}_{3}^{\mathsf{T}}=\bm{\Lambda}_{X}^{A}$ from (A.1). Then $\bm{\Lambda}_{Z}^{B}\bm{R}=\bm{\Lambda}_{Z}^{B}\bm{R}_{3}^{\mathsf{T}}=\bm{\Lambda}_{Z}^{A}\bm{R}_{3}\bm{R}_{3}^{\mathsf{T}}=\bm{\Lambda}_{Z}^{A}$ again using (A.1). Finally, $\bm{\Lambda}_{Y}^{A}(\bm{\Lambda}_{Z}^{B}\bm{R})^{\mathsf{T}}=\bm{\Lambda}_{Y}^{A}(\bm{\Lambda}_{Z}^{A})^{\mathsf{T}}=\bm{\Lambda}_{Y}\bm{R}_{1}\bm{R}_{1}^{\mathsf{T}}\bm{\Lambda}_{Z}^{\mathsf{T}}=\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}$ .

A.2 Proof of Theorem 1

Using Theorem 5.1 in Anderson and Rubin (1956), Assumption 2 guarantees that if

	$\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{X}^{\mathsf{T}}+\bm{\Psi}_{X}^{}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}+\bm{\Psi}_{Y}^{}\end{pmatrix},$
	$\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{X}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{X}^{}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}&\bm{\Lambda}_{Z}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}+\bm{\Psi}_{Z}^{}\end{pmatrix}.$

then the uniquenesses are equal, $\bm{\Psi}_{X}=\bm{\Psi}_{X}^{*}$ , $\bm{\Psi}_{Y}=\bm{\Psi}_{Y}^{*}$ , and $\bm{\Psi}_{Z}=\bm{\Psi}_{Z}^{*}$ , implying

	$\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix},$		(A.3)
	$\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}.$		(A.4)

Using Lemma 1, $\Lambda_{Y}\Lambda_{Z}^{\mathsf{T}}$ can be uniquely recovered given the matrices on the left-hand side of (A.3) and (A.4). Likewise, $\Lambda_{Y}^{*}\Lambda_{Z}^{*\mathsf{T}}$ can be uniquely recovered given the matrices on the right hand side of (A.3) and (A.4). It remains to show that $\Lambda_{Y}\Lambda_{Z}^{\mathsf{T}}=\Lambda_{Y}^{*}\Lambda_{Z}^{*\mathsf{T}}$ . To do so, define the eigendecompositions

	$\displaystyle\bm{V}_{A}\bm{D}_{A}\bm{V}_{A}^{\mathsf{T}}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}^{}\bm{\Lambda}_{Y}^{\mathsf{T}}\end{pmatrix},$
	$\displaystyle\bm{V}_{B}\bm{D}_{B}\bm{V}_{B}^{\mathsf{T}}$	$\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}=\begin{pmatrix}\bm{\Lambda}_{X}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}^{}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}^{}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix},$

and the rotated and scaled eigenvectors

\displaystyle\begin{pmatrix}\bm{\bm{\Gamma}}_{X}^{A}\\ \bm{\bm{\Gamma}}_{Y}^{A}\end{pmatrix}

\displaystyle=\bm{V}_{A}\bm{D}_{A}^{1/2},\quad\begin{pmatrix}\bm{\bm{\Gamma}}_{X}^{B}\\ \bm{\bm{\Gamma}}_{Z}^{B}\end{pmatrix}=\bm{V}_{B}\bm{D}_{B}^{1/2}.

Using Assumption 1 and Lemma 1, the equality

\displaystyle\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}

\displaystyle=\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}=\bm{\Gamma}_{Y}^{A}\bm{\Gamma}_{Z}^{B}\bm{W}\bm{Q}^{\mathsf{T}},

(A.5)

must hold, where $\bm{W}$ and $\bm{Q}$ are the left and right singular vectors of the matrix $\bm{M}=(\bm{\Gamma}_{X}^{B})^{\mathsf{T}}{\bm{\Gamma}_{X}^{A}}=\bm{W}\bm{D}\bm{Q}^{\mathsf{T}}$ . Combining the equalities in (A.3), (A.4) and (A.5) gives the main result

\displaystyle\begin{pmatrix}\bm{\Lambda}_{X}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{X}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Y}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Y}\bm{\Lambda}_{Z}^{\mathsf{T}}\\ \bm{\Lambda}_{Z}\bm{\Lambda}_{X}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Y}^{\mathsf{T}}&\bm{\Lambda}_{Z}\bm{\Lambda}_{Z}^{\mathsf{T}}\end{pmatrix}

\displaystyle=\begin{pmatrix}\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{X}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Y}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\\ \bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{X}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Y}^{*\mathsf{T}}&\bm{\Lambda}_{Z}^{*}\bm{\Lambda}_{Z}^{*\mathsf{T}}\end{pmatrix}.