This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Consistent Estimation of Low-Dimensional Latent Structure
in High-Dimensional Data

\nameXiongzhi Chen \emailxiongzhi@princeton.edu
\nameJohn D. Storey \emailjstorey@princeton.edu
\addrCenter for Statistics and Machine Learning and
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544, USA
 
October 2015
Abstract

We consider the problem of extracting a low-dimensional, linear latent variable structure from high-dimensional random variables. Specifically, we show that under mild conditions and when this structure manifests itself as a linear space that spans the conditional means, it is possible to consistently recover the structure using only information up to the second moments of these random variables. This finding, specialized to one-parameter exponential families whose variance function is quadratic in their means, allows for the derivation of an explicit estimator of such latent structure. This approach serves as a latent variable model estimator and as a tool for dimension reduction for a high-dimensional matrix of data composed of many related variables. Our theoretical results are verified by simulation studies and an application to genomic data.

Keywords: exponential family distribution, factor analysis, high-dimensional data, latent variable model, spectral decomposition

1 Introduction

Low-dimensional latent variable models are often used to capture systematic structure in the conditional mean space of high-dimensional random variables (rv’s). This has been a popular strategy in high-dimensional probabilistic modeling and data analysis, and it serves as an attractive strategy for dimension reduction and recovering latent structure. Examples include factor analysis (Bartholomew et al., 2011), probabilistic principal components analysis (PCA) (Tipping and Bishop, 1999), non-negative matrix factorization (Lee and Seung, 1999), asymptotic PCA (Leek, 2011), latent Dirichlet allocation (LDA) (Blei et al., 2003), and exponential family distribution extensions of PCA (Collins et al., 2001).

Let 𝐘k×n=(yij)\mathbf{Y}_{k\times n}=\left(y_{ij}\right) be an observed data matrix of kk variables (one variable per row), each with nn observations, whose entries are rv’s such that

θij=𝔼[yij|𝐌]=(𝚽𝐌)(i,j) or 𝚯=𝔼[𝐘|𝐌]=𝚽𝐌,\theta_{ij}=\mathbb{E}\left[y_{ij}|\mathbf{M}\right]=\left(\boldsymbol{\Phi}\mathbf{M}\right)\left(i,j\right)\text{ or }\boldsymbol{\Theta}=\mathbb{E}\left[\mathbf{Y}|\mathbf{M}\right]=\boldsymbol{\Phi}\mathbf{M}, (1)

where 𝔼\mathbb{E} is the expectation operator, 𝐌r×n\mathbf{M}_{r\times n} is a matrix of rr latent variables, and 𝚽k×r\boldsymbol{\Phi}_{k\times r} is a matrix of coefficients relating the latent variables to the observed variables. Furthermore, the dimensions are such as knrk\gg n\geq r. In model (1), the conditional mean θij\theta_{ij} of yijy_{ij} for a fixed ii only depends on the jjth column 𝐦j\mathbf{m}^{j} of 𝐌\mathbf{M}, and each row of the conditional mean matrix 𝚯\boldsymbol{\Theta} lies in the row space Π𝐌\Pi_{\mathbf{M}} of 𝐌\mathbf{M}. The latent structure in 𝚯\boldsymbol{\Theta} is therefore induced by Π𝐌\Pi_{\mathbf{M}}.

The above model is a general form of several highly used models. This includes instances of factor analysis (Bartholomew et al., 2011), probabilistic PCA (Tipping and Bishop, 1999), mixed membership clustering in population genetics (Pritchard et al., 2000; Alexander et al., 2009) which is closely related to LDA, and non-negative matrix factorization (Lee and Seung, 1999). Whereas the specialized models are often focused on the probabilistic interpretation of the columns of 𝐌\mathbf{M}, we are instead here interested in its row space, Π𝐌\Pi_{\mathbf{M}}. This row space is sufficient for: (i) characterizing systematic patterns of variation in the data 𝐘\mathbf{Y}, which can be used for exploratory data analysis or dimension reduction; (ii) accounting for the latent variables in downstream modeling of 𝐘\mathbf{Y} (Leek and Storey, 2007, 2008) that requires adjustment for these variables; (iii) potentially identifying suitable initial values or geometric constraints for algorithms that estimate probabilistically constrained versions of 𝐌\mathbf{M}; (iv) or recovering 𝐌\mathbf{M} itself if additional geometric properties are known (e.g., as in Arora et al. (2013)). Furthermore, many of the above models make assumptions about the probability distribution of 𝐌\mathbf{M} that may be untrue on a given data set. One can compare our estimate of Π𝐌\Pi_{\mathbf{M}} (which makes minimal assumptions) to the space induced by the model-based estimates of 𝐌\mathbf{M} to gauge the accuracy of the model assumptions and fit. Therefore, we focus on estimation of the latent variable space Π𝐌\Pi_{\mathbf{M}}. Estimation of 𝚽\boldsymbol{\Phi} may be tractable, but we do not focus on that here.

Leek (2011) and Anandkumar et al. (2012, 2015) have carried out work that is complementary to that presented here. They both study moment estimators of linear latent variable models applied to high-dimensional data. We explain how our work is related to Leek (2011) in Section 4.3 and how it is related to Anandkumar et al. (2012, 2015) in Section A. The strategies employed in these papers have ties to what we do here; however, they each consider different probabilistic models with theory that does not directly apply to the models we study.

We show that both the row space Π𝐌\Pi_{\mathbf{M}} of 𝐌\mathbf{M} (in Section 2) and the row rank of 𝐌\mathbf{M} (in Section 3) can be consistently estimated using information from a suitably adjusted n×nn\times n matrix k1𝐘T𝐘k^{-1}\mathbf{Y}^{T}\mathbf{Y}. In Section 4, we specialize these general results to yijy_{ij}’s that, conditional on 𝐌\mathbf{M}, come from exponential family distributions. In particular, we explicitly construct a nonparametric, consistent estimator of the row space Π𝐌\Pi_{\mathbf{M}} of 𝐌\mathbf{M} for yijy_{ij} rv’s that follow the natural exponential family (NEF) with quadratic variance function (QVF) using information only up to their second moments, and the estimator is computationally straightforward to implement. In Section 5, we extend the results of previous sections to the case where 𝚽\boldsymbol{\Phi} is random. A simulation study is conducted in Section 6 to check and confirm our theoretical findings, and we apply the estimators to a genomics data set in Section 7. Finally, we end the article with a discussion in Section 8, collect in Section B all technical proofs, and present the the full set of results from simulation studies in Section C and Section D.

2 Almost Surely Estimating the Latent Linear Space

The big picture strategy we take is summarized as follows. Carrying out a decomposition related to Leek (2011), we first expand k1𝐘T𝐘k^{-1}\mathbf{Y}^{T}\mathbf{Y} into four components:

k1𝐘T𝐘\displaystyle k^{-1}\mathbf{Y}^{T}\mathbf{Y} =\displaystyle= k1(𝐘𝚽𝐌+𝚽𝐌)T(𝐘𝚽𝐌+𝚽𝐌)\displaystyle k^{-1}(\mathbf{Y}-\boldsymbol{\Phi}\mathbf{M}+\boldsymbol{\Phi}\mathbf{M})^{T}(\mathbf{Y}-\boldsymbol{\Phi}\mathbf{M}+\boldsymbol{\Phi}\mathbf{M})
=\displaystyle= k1(𝐘𝚽𝐌)T(𝐘𝚽𝐌)(i)+k1(𝐘𝚽𝐌)T(𝚽𝐌)(ii)\displaystyle\underbrace{k^{-1}(\mathbf{Y}-\boldsymbol{\Phi}\mathbf{M})^{T}(\mathbf{Y}-\boldsymbol{\Phi}\mathbf{M})}_{\text{$(i)$}}+\underbrace{k^{-1}(\mathbf{Y}-\boldsymbol{\Phi}\mathbf{M})^{T}(\boldsymbol{\Phi}\mathbf{M})}_{\text{$(ii)$}} (2)
+k1(𝚽𝐌)T(𝐘𝚽𝐌)(iii)+k1(𝚽𝐌)T(𝚽𝐌)(iv)\displaystyle+\underbrace{k^{-1}(\boldsymbol{\Phi}\mathbf{M})^{T}(\mathbf{Y}-\boldsymbol{\Phi}\mathbf{M})}_{\text{$(iii)$}}+\underbrace{k^{-1}(\boldsymbol{\Phi}\mathbf{M})^{T}(\boldsymbol{\Phi}\mathbf{M})}_{\text{$(iv)$}} (3)

We then show the following about each of the components as kk\rightarrow\infty, under the assumptions given in detail below:

  • (i)(i): this term may be estimated arbitrarily well as kk\rightarrow\infty by a diagonal matrix defined and studied below;

  • (ii)(ii) and (iii)(iii): these terms converge to the zero matrix;

  • (iv)(iv): this term converges to a symmetric matrix with rr positive eigenvalues whose leading rr eigenvectors span the row space of 𝐌\mathbf{M}.

Once the convergence of these terms is rigorously established, the strategy we take is to form an estimate of term (i)(i), denoted by 𝐃^k\mathbf{\hat{D}}_{k}, and show that the space spanned by the leading rr eigenvectors of k1𝐘T𝐘𝐃^kk^{-1}\mathbf{Y}^{T}\mathbf{Y}-\mathbf{\hat{D}}_{k} converges to the row space of 𝐌\mathbf{M} as kk\rightarrow\infty. We then also provide a framework to estimate the dimension rr of the row space of 𝐌\mathbf{M} and incorporate it into this estimation framework.

2.1 Model Assumptions

We first define the matrix norm 𝐗=i,jxij2\left\|\mathbf{X}\right\|=\sqrt{\sum\limits_{i,j}x_{ij}^{2}} for any real matrix 𝐗=(xij)\mathbf{X}=\left(x_{ij}\right) and let C>0C>0 denote a generic, finite constant whose value may vary at different occurrences. We first assume that 𝚽\mathbf{\Phi} is deterministic; the results in this section are extended to the case where 𝚽\mathbf{\Phi} is random in Section 5. The assumptions on model (1) are as follows:

A1)

1rank(𝐌)=r<n1\leq{\rm rank}\left(\mathbf{M}\right)=r<n and nn is finite; {yij|𝐌}i,j\left\{y_{ij}|\mathbf{M}\right\}_{i,j} are jointly independent with variance 𝕍[yij|𝐌]=δij\mathbb{V}\left[y_{ij}|\mathbf{M}\right]=\delta_{ij} such that supkmaxi,j𝔼[yij8|𝐌]C\sup\limits_{k}\max\limits_{i,j}\mathbb{E}\left[y_{ij}^{8}|\mathbf{M}\right]\leq C (which implies supkmaxi,j𝔼[yij4|𝐌]C\sup\limits_{k}\max\limits_{i,j}\mathbb{E}\left[y_{ij}^{4}|\mathbf{M}\right]\leq C), where 𝕍\mathbb{V} is the variance operator.

A2)

supk1max1ikϕiC\sup\limits_{k\geq 1}\max\limits_{1\leq i\leq k}\left\|\boldsymbol{\phi}_{i}\right\|\leq C, where ϕi\boldsymbol{\phi}_{i} is the iith row of 𝚽\boldsymbol{\Phi}. Further, for some 𝐖r×r>0\mathbf{W}_{r\times r}>0 and some non-negative sequence ck0c_{k}\rightarrow 0,

k1𝚽T𝚽𝐖=ck.\left\|k^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}-\mathbf{W}\right\|=c_{k}\text{.} (4)

Since we are considering model (1) for which 𝐘\mathbf{Y} is conditioned on 𝐌\mathbf{M}, all random vectors in this model are by default conditioned on 𝐌\mathbf{M} unless otherwise noted (e.g., Section 5). For conciseness we will omit stating “conditional on 𝐌\mathbf{M}” in this default setting.

We state a consequence of the assumption A2) as Lemma 1, whose proof is straightforward and omitted.

Lemma 1

If the assumption A2) holds, then

limkk1i=1kθij2=(𝐌T𝐖𝐌)jj\lim_{k\rightarrow\infty}k^{-1}\sum_{i=1}^{k}\theta_{ij}^{2}=\left(\mathbf{M}^{T}\mathbf{WM}\right)_{jj}

for each 1jn1\leq j\leq n and supkmaxi,j|θij|C\sup\limits_{k}\max\limits_{i,j}\left|\theta_{ij}\right|\leq C.

The uniform boundedness results provided by Lemma 1 will be used to prove the convergence results later.

2.2 k1𝐘T𝐘k^{-1}\mathbf{Y}^{T}\mathbf{Y} Asymptotically Preserves the Latent Linear Space

We first derive the asymptotic form of k1𝐘T𝐘k^{-1}\mathbf{Y}^{T}\mathbf{Y} with the aid of the strong law of large numbers (SLLN) in Walk (2005). Let δ¯kj=k1l=1kδlj\bar{\delta}_{kj}=k^{-1}\sum\limits_{l=1}^{k}\delta_{lj} be the column-wise average variance of 𝐘|𝐌\mathbf{Y}|\mathbf{M} (where 𝕍[yij|𝐌]=δij\mathbb{V}\left[y_{ij}|\mathbf{M}\right]=\delta_{ij} as defined above) and

𝐃k=diag{δ¯k1,,δ¯kn}\mathbf{D}_{k}=\mbox{diag}\left\{\bar{\delta}_{k1},...,\bar{\delta}_{kn}\right\} (5)

the n×nn\times n diagonal matrix composed of these average variances.

Theorem 2

Under the assumptions for model (1),

limkk1𝐘T𝐘𝐃k𝐇=0 almost surely (a.s.),\lim_{k\rightarrow\infty}\left\|k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\mathbf{D}_{k}-\mathbf{H}\right\|=0\text{\ almost surely (a.s.),} (6)

where 𝐇=𝐌T𝐖𝐌\mathbf{H}=\mathbf{M}^{T}\mathbf{WM}.

Theorem 2 shows that 𝐑k=k1𝐘T𝐘𝐃k\mathbf{R}_{k}=k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\mathbf{D}_{k} becomes arbitrarily close to 𝐇\mathbf{H} as the number of variables kk\rightarrow\infty. In fact, it gives much more information on the eigensystem of 𝐑k\mathbf{R}_{k} (as we take the convention that an eigenvector always has norm 11). Let {βk,i}i=1n\left\{\beta_{k,i}\right\}_{i=1}^{n} be the eigenvalues of 𝐑k\mathbf{R}_{k} ordered into βk,iβk,i+1\beta_{k,i}\geq\beta_{k,i+1} for 1in11\leq i\leq n-1 (where here we take the convention that the ordering of the designated multiple copies of a multiple eigenvalue is arbitrary),

S=i=1r{𝐮:𝐮 is an eigenvector of 𝐑k corresponding to βk,i},S={\bigcup}_{i=1}^{r}\left\{\mathbf{u}:\mathbf{u}\text{ is an eigenvector of }\mathbf{R}_{k}\text{ corresponding to }\beta_{k,i}\right\}\text{,}

and {αi}i=1n\left\{\alpha_{i}\right\}_{i=1}^{n} be the eigenvalues of 𝐇\mathbf{H} ordered into αiαi+1\alpha_{i}\geq\alpha_{i+1} for 1in11\leq i\leq n-1.

Corollary 3

Under the assumptions for model (1),

limkmax1in|βk,iαi|=0 a.s.\lim_{k\rightarrow\infty}\max_{1\leq i\leq n}\left|\beta_{k,i}-\alpha_{i}\right|=0\text{ a.s.} (7)

Further, limk|S|=r\lim\limits_{k\rightarrow\infty}\left|S\right|=r a.s. and

limk{𝐮S}Π𝐌= a.s.,\lim_{k\rightarrow\infty}\left\langle\left\{\mathbf{u}\in S\right\}\right\rangle\triangle\Pi_{\mathbf{M}}=\varnothing\text{ a.s.,} (8)

where \left\langle\cdot\right\rangle denotes the linear space spanned by its arguments, and \triangle the symmetric set difference.

Corollary 3 reveals that asymptotically as kk\rightarrow\infty the eigenvalues of 𝐑k\mathbf{R}_{k} converge to those of 𝐇\mathbf{H} when both sets of eigenvalues are ordered the same way, that the dimension of the space spanned by all the eigenvectors corresponding to the rr largest eigenvalues of 𝐑k\mathbf{R}_{k} converges to rr as kk\rightarrow\infty, and that Π𝐌\Pi_{\mathbf{M}} is asymptotically spanned by the leading rr dimensional joint eigenspace induced by 𝐑k\mathbf{R}_{k} as kk\rightarrow\infty. When the nonzero eigenvalues of 𝐇\mathbf{H} are distinct, we easily have

limk𝐮k,i𝐯i= a.s. for each i=1,,r,\lim_{k\rightarrow\infty}\left\langle\mathbf{u}_{k,i}\right\rangle\triangle\left\langle\mathbf{v}_{i}\right\rangle=\varnothing\text{ a.s. for each }i=1,\ldots,r, (9)

where, modulo a sign, 𝐮k,i\mathbf{u}_{k,i} is the eigenvector corresponding to βk,i\beta_{k,i} and 𝐯i\mathbf{v}_{i} that to αi\alpha_{i}.

When the dimension rr of the latent space Π𝐌\Pi_{\mathbf{M}} and the diagonal matrix 𝐃k\mathbf{D}_{k} of the column-wise average variances are known, it follows by Corollary 3 that Π~𝐌={𝐮S}\tilde{\Pi}_{\mathbf{M}}=\left\langle\left\{\mathbf{u}\in S\right\}\right\rangle asymptotically spans the latent space Π𝐌\Pi_{\mathbf{M}}, and Π~𝐌\tilde{\Pi}_{\mathbf{M}} converges to the row space Π𝐌\Pi_{\mathbf{M}} with probability 1. However, in practice both rr and 𝐃k\mathbf{D}_{k} need to be estimated, which is the topic of the next three sections. Estimating the number of latent variables rr is in general a difficult problem. In our setting we also must accurately estimate 𝐃k\mathbf{D}_{k}, which can be a difficult task when the variances δij\delta_{ij} may all be different (i.e., heteroskedastic).

3 Consistently Estimating the Latent Linear Space Dimension

The strategy we take to consistently estimate the dimension of the latent variable space rr is to carefully scale the ordered eigenvalues of 𝐑k=k1𝐘T𝐘𝐃k\mathbf{R}_{k}=k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\mathbf{D}_{k} and identify the index of the eigenvalue whose magnitude separates the magnitudes of these eigenvalues into two particular groups when kk is large. Recall that, by Theorem 2 and Corollary 3, the difference between the vector of descendingly ordered eigenvalues of 𝐑k\mathbf{R}_{k} and that of those of 𝐇=𝐌T𝐖𝐌\mathbf{H}=\mathbf{M}^{T}\mathbf{WM} converges to zero as kk\rightarrow\infty. However, since rank(𝐇)=r{\rm rank}\left(\mathbf{H}\right)=r and 𝐖>0\mathbf{W}>0, we know that the rr largest eigenvalues of 𝐇\mathbf{H} are positive but the rest are zero. This means that the rr largest eigenvalues of 𝐑k\mathbf{R}_{k} are all strictly positive as kk\rightarrow\infty, while the smallest nrn-r eigenvalues of 𝐑k\mathbf{R}_{k} converge to 0 as kk\rightarrow\infty. Depending on the speed of convergence of the nrn-r smallest eigenvalues of 𝐑k\mathbf{R}_{k} to 0, if we suitably scale the eigenvalues of 𝐑k\mathbf{R}_{k}, then the scaled, ordered eigenvalues will eventually separate into two groups, those with very large magnitudes and the rest very small. The index of the scaled, ordered eigenvalues for which such a separation happens is then a consistent estimator of rr. If we replace 𝐑k\mathbf{R}_{k} with an estimator 𝐑^k=k1𝐘T𝐘𝐃^k\hat{\mathbf{R}}_{k}=k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\hat{\mathbf{D}}_{k} that satisfies a certain level of accuracy detailed below, then the previous reasoning applied to the eigenvalues of 𝐑^k\hat{\mathbf{R}}_{k} will also give a consistent estimator of rr.

To find the scaling sequence for the magnitudes of the eigenvalues of 𝐑k\mathbf{R}_{k} or equivalently the speed of convergence of the nrn-r smallest eigenvalues of 𝐑k\mathbf{R}_{k} to 0, we define the k×nk\times n matrix 𝐄\mathbf{E} with entry 𝐄(i,j)=eij=yijθij\mathbf{E}(i,j)=e_{ij}=y_{ij}-\theta_{ij}, and study as a whole the random part of k1𝐘T𝐘k^{-1}\mathbf{Y}^{T}\mathbf{Y} defined by

𝐅=k1𝐄T𝐄(i)+k1𝐄T𝚯(ii)+k1𝚯T𝐄(iii).\mathbf{F}=\underbrace{k^{-1}\mathbf{E}^{T}\mathbf{E}}_{(i)}+\underbrace{k^{-1}\mathbf{E}^{T}\boldsymbol{\Theta}}_{(ii)}+\underbrace{k^{-1}\boldsymbol{\Theta}^{T}\mathbf{E}}_{(iii)}. (10)

Note that the terms (i)(i), (ii)(ii) and (iii)(iii) in 𝐅\mathbf{F} correspond to those from equation (3). We will show that 𝐅\mathbf{F} configured as a 2n22n^{2} vector possesses asymptotic Normality after centering and scaling as kk\rightarrow\infty. This then reveals that the scaling sequence for the eigenvalues of 𝐑k\mathbf{R}_{k} should be no smaller than being proportional to k1/2k^{-1/2}.

Let 𝚫=(δij)\boldsymbol{\Delta}=\left(\delta_{ij}\right), 𝚫i=diag{δi1,,δin}\boldsymbol{\Delta}_{i}={\rm diag}\left\{\delta_{i1},...,\delta_{in}\right\} and define

𝐳i=k1(ei1𝐞i,,ein𝐞i,ϕi𝐦1𝐞i,,ϕi𝐦n𝐞i)\mathbf{z}_{i}=k^{-1}\left(e_{i1}\mathbf{e}_{i},...,e_{in}\mathbf{e}_{i},\boldsymbol{\phi}_{i}\mathbf{m}^{1}\mathbf{e}_{i},...,\boldsymbol{\phi}_{i}\mathbf{m}^{n}\mathbf{e}_{i}\right) (11)

for 1ik1\leq i\leq k, where 𝐞i\mathbf{e}_{i} and 𝐦j\mathbf{m}^{j} are respectively the iith row of 𝐄\mathbf{E} and the jjth column of 𝐌\mathbf{M}. We have:

Proposition 4

Under model (1), 𝐅\mathbf{F} is a linear function of i=1k𝐳i\sum\limits_{i=1}^{k}\mathbf{z}_{i} with global Lipschitz constant 11. If we assume A1) and A2) from Section 2 and also

A3)

The sequences k1i=1k𝔼[eij4|𝐌]k^{-1}\sum\limits_{i=1}^{k}\mathbb{E}\left[e_{ij}^{4}|\mathbf{M}\right], k1i=1kϕi𝔼[eij3|𝐌]k^{-1}\sum\limits_{i=1}^{k}\boldsymbol{\phi}_{i}\mathbb{E}\left[e_{ij}^{3}|\mathbf{M}\right] and k1i=1kϕi𝐦jϕi𝐦l𝚫ik^{-1}\sum_{i=1}^{k}\boldsymbol{\phi}_{i}\mathbf{m}^{j}\boldsymbol{\phi}_{i}\mathbf{m}^{l}\boldsymbol{\Delta}_{i} for any 1jln1\leq j\leq l\leq n are all convergent as kk\rightarrow\infty,

then i=1kk(𝐳i𝔼[𝐳i|𝐌])\sum\limits_{i=1}^{k}\sqrt{k}\left(\mathbf{z}_{i}-\mathbb{E}\left[\mathbf{z}_{i}|\mathbf{M}\right]\right) converges in distribution to a multivariate Normal random vector with mean zero.

With the concentration property of 𝐅\mathbf{F} established by Proposition 4, we will be able to explore the possibility of consistently estimating rr by studying the magnitudes of eigenvalues {α^k,i}i=1n\left\{\hat{\alpha}_{k,i}\right\}_{i=1}^{n} of

𝐑^k=k1𝐘T𝐘𝐃^k with 𝐃^k=diag{δ^k1,,δ^kn},\mathbf{\hat{R}}_{k}=k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\mathbf{\hat{D}}_{k}\text{ with }\mathbf{\hat{D}}_{k}={\rm diag}\left\{\hat{\delta}_{k1},...,\hat{\delta}_{kn}\right\}\text{,} (12)

where each δ^kj\hat{\delta}_{kj} is an estimate of δ¯kj\bar{\delta}_{kj} for 1jn1\leq j\leq n, and the α^k,i\hat{\alpha}_{k,i}’s are ordered into α^k,iα^k,i+1\hat{\alpha}_{k,i}\geq\hat{\alpha}_{k,i+1} for 1i<n11\leq i<n-1.

Theorem 5

Under the assumptions A1), A2) and A3), if

A4)

for some non-negative εk\varepsilon_{k} such that εk0\varepsilon_{k}\rightarrow 0,

𝐃^k𝐃k=εk,\left\|\mathbf{\hat{D}}_{k}-\mathbf{D}_{k}\right\|=\varepsilon_{k}, (13)

then

𝐑^k𝐇=OPr(τk) and α^k,iαi=OPr(τk),\left\|\mathbf{\hat{R}}_{k}-\mathbf{H}\right\|=O_{\Pr}\left(\tau_{k}\right)\text{ and }\left\|\hat{\alpha}_{k,i}-\alpha_{i}\right\|=O_{\Pr}\left(\tau_{k}\right)\text{,} (14)

where Pr\Pr is the probability measure and

τk=max{k1/2,εk,ck}.\tau_{k}=\max\left\{k^{-1/2},\varepsilon_{k},c_{k}\right\}\text{.} (15)

Further, for any τ~k>0\tilde{\tau}_{k}>0 such that τ~k=o(1)\tilde{\tau}_{k}=o\left(1\right) and τk=o(τ~k)\tau_{k}=o\left(\tilde{\tau}_{k}\right), as kk\rightarrow\infty,

{Pr(τ~k1α^k,i>c~)0ifr+1in,Pr(τ~k1α^k,i)1if1ir,\left\{\begin{tabular}[c]{lll}$\Pr\left(\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i}>\tilde{c}\right)\rightarrow 0$&if&$r+1\leq i\leq n$,\\ $\Pr\left(\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i}\rightarrow\infty\right)\rightarrow 1$&if&$1\leq i\leq r$,\end{tabular}\ \ \ \ \right. (16)

for any fixed c~>0\tilde{c}>0. Therefore, letting

r^=i=1n1{τ~k1α^k,i>c~}\hat{r}=\sum_{i=1}^{n}1_{\left\{\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i}>\tilde{c}\right\}} (17)

gives

Pr(r^=r)1 as k,\Pr\left(\hat{r}=r\right)\rightarrow 1\text{\ as \ }k\rightarrow\infty, (18)

where 1A1_{A} is the indicator of a set AA.

Theorem 5 shows that the speed of convergence, {τk}k1\left\{\tau_{k}\right\}_{k\geq 1}, of the eigenvalues of 𝐑^k\hat{\mathbf{R}}_{k} (and those of 𝐑k\mathbf{R}_{k}) is essentially is determined by those related to k1𝚽T𝚽k^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi} and 𝐃k\mathbf{D}_{k}; see equations (13)–(15). Further, it reveals that, when {τk}k1\left\{\tau_{k}\right\}_{k\geq 1} is known, with probability approaching to 11 as kk\rightarrow\infty, the scaled, ordered eigenvalues {τ~k1α^k,i}i=1n\left\{\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i}\right\}_{i=1}^{n} eventually separates into two groups: τ~k1α^k,i\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i} for 1ir1\leq i\leq r all lie above c~\tilde{c} but the rest all lie below c~\tilde{c} for any chosen c~>0\tilde{c}>0. In other words, for a chosen c~>0\tilde{c}>0, the number of τ~k1α^k,i>c~\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i}>\tilde{c} when kk is large is very likely equal to rr. However, in practice {εk}k1\left\{\varepsilon_{k}\right\}_{k\geq 1} and {ck}k1\left\{c_{k}\right\}_{k\geq 1} are unknown, and even if they are known or can be estimated, unfortunately the hidden constants in (14) are unknown (even when k=k=\infty). Further, when kk is finite, the nrn-r smallest eigenvalues of 𝐑^k\mathbf{\hat{R}}_{k} may not yet be identically zero, the hidden constants may have a large impact on estimating the scaling sequence τ~k\tilde{\tau}_{k}, and rates slightly different than τ~k\tilde{\tau}_{k} may have to be chosen to balance the effects of the hidden constants; see in Section 6 a brief discussion on the estimation of τ~k\tilde{\tau}_{k} and the effects of choosing τ~k\tilde{\tau}_{k} on estimating rr for finite kk.

Before we apply our theory to yijy_{ij}’s that follow specific parametric distributions, we pause to comment on the key results obtained so far. Theorem 2 and Corollary 3 together ensure that asymptotically as kk\rightarrow\infty we can span the row space Π𝐌\Pi_{\mathbf{M}} of 𝐌\mathbf{M} by the rr leading eigenvectors of 𝐑k=k1𝐘T𝐘𝐃k\mathbf{R}_{k}=k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\mathbf{D}_{k} (see the definition of 𝐃k\mathbf{D}_{k} in (5)). However, the conclusions in these results are based on a known rr and 𝐃k\mathbf{D}_{k}. In contrast, Proposition 4 and Theorem 5 retain similar assertions to those of Theorem 2 and Corollary 3 by replacing the unknown 𝐃k\mathbf{D}_{k} by its consistent estimate 𝐃^k\hat{\mathbf{D}}_{k}, and they enable us to construct a consistent estimate r^\hat{r} of rr such that the linear space spanned by the leading r^\hat{r} eigenvectors of 𝐑^k=k1𝐘T𝐘𝐃^k\hat{\mathbf{R}}_{k}=k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\hat{\mathbf{D}}_{k} consistently estimates Π𝐌\Pi_{\mathbf{M}} as kk\rightarrow\infty. This reveals that to consistently estimate Π𝐌\Pi_{\mathbf{M}} in a fully data driven approach using our theory, it is crucial to develop a consistent estimate 𝐃^k\hat{\mathbf{D}}_{k} of 𝐃k\mathbf{D}_{k}.

4 Specializing to Exponential Family Distributions

We specialize the general results obtained in Section 2 and Section 3 for model (1) to the case when yijy_{ij} follow the single parameter exponential family probability density function (pdf) given by

f(y;θ)=h(y)exp{η(θ)yg(η(θ))},f(y;\theta)=h\left(y\right)\exp\left\{\eta(\theta)y-g\left(\eta\left(\theta\right)\right)\right\}, (19)

where η\eta is the canonical link function, and gg and hh are known functions such that f(y;θ)f(y;\theta) is a proper pdf. The values that θ\theta can take are Ω={θ:f(y;θ)=1}\Omega=\left\{\theta\in\mathbb{R}:\int f\left(y;\theta\right)=1\right\}. The following corollary says that our general results on consistent estimation of the latent space Π𝐌\Pi_{\mathbf{M}} and its dimension rr hold when the link function η\eta is bounded on the closure of the parameter space Ω0Ω\Omega_{0}\subseteq\Omega.

Corollary 6

Suppose Ω0\Omega_{0} is an open subset of Ω\Omega such that its closure lies in the interior of Ω\Omega. If conditional on 𝐌\mathbf{M} each yijy_{ij} has pdf of the form (19) with θijΩ0\theta_{ij}\in\Omega_{0} and η\eta is bounded as a function of θ\theta in the closure of Ω0\Omega_{0}, then supkmaxi,j𝔼[yij8|𝐌]C\sup\limits_{k}\max\limits_{i,j}\mathbb{E}\left[y_{ij}^{8}|\mathbf{M}\right]\leq C, which implies that Theorem 2, Corollary 3, Proposition 4 and Theorem 5 hold.

We remark that the boundedness of η\eta on the closure of Ω0\Omega_{0} is not restrictive, in that in practice either Ω0\Omega_{0} is bounded in Euclidean norm or η\eta is bounded in supremum norm. The proof of Corollary 6 is a simple observation that gg is analytic in η\eta when θΩ0\theta\in\Omega_{0} (see, e.g. Letac and Mora, 1990) and that the derivative of gg in η\eta of any order is bounded when η\eta is bounded; the proof is thus omitted.

4.1 Estimating 𝐃k\mathbf{D}_{k}

With Corollary 6 (see also the discussion at the end of Section 3), we only need to estimate δ¯kj\bar{\delta}_{kj} (which in turn yields 𝐃k\mathbf{D}_{k}) in order to consistently estimate rr and Π𝐌\Pi_{\mathbf{M}}. To obtain an estimate δ^kj\hat{\delta}_{kj} of δ¯kj\bar{\delta}_{kj} when potentially all δij\delta_{ij} are different from each other (i.e., complete heteroskedasticity), we exploit the intrinsic relationship between θij\theta_{ij} and δij\delta_{ij} when yijy_{ij} come from a certain class of natural exponential family (NEF) distributions.

Lemma 7

Let yy have marginal pdf (19). Then there exists a quadratic function v()v\left(\cdot\right) such that 𝔼[v(y)]=𝕍[y]\mathbb{E}\left[v\left(y\right)\right]=\mathbb{V}\left[y\right] if and only if ff parametrized by η\eta forms an NEF with quadratic variance function (QVF) defined by Morris (1982) such that

𝕍[y]=b0+b1𝔼[y]+b2(𝔼[y])2 with b21\mathbb{V}\left[y\right]=b_{0}+b_{1}\mathbb{E}\left[y\right]+b_{2}\left(\mathbb{E}\left[y\right]\right)^{2}\text{ with }b_{2}\neq-1 (20)

for some b0,b1,b2b_{0},b_{1},b_{2}\in\mathbb{R}. Specifically, (20) implies

v(t)=(1+b2)1(b0+b1t+b2t2).v\left(t\right)=\left(1+b_{2}\right)^{-1}\left(b_{0}+b_{1}t+b_{2}t^{2}\right)\text{.} (21)

The proof of Lemma 7 is straightforward and omitted. Table 1 lists v()v\left(\cdot\right) for the six NEFs with QVF.

Table 1: The function v()v(\cdot) such that 𝔼[v(y)]=𝕍[y]\mathbb{E}[v(y)]=\mathbb{V}[y] when yy has pdf (19) and comes from a NEF with QVF in Morris (1982). Note that logit(x)=logx1x{\rm logit}(x)=\log{\dfrac{x}{1-x}} for x(0,1)x\in(0,1) and “GHS” stands for “generalized hyperbolic secant distribution”.
θ\theta 𝕍[y]\mathbb{V}[y] v(y)v(y) η(θ)\eta(\theta)
Normal(μ,1)(\mu,1) μ\mu 1 11 θ\theta
Poisson(λ)(\lambda) λ\lambda θ\theta yy log(θ)\log(\theta)
Binomial(s,p)(s,p) spsp θθ2/s\theta-\theta^{2}/s (syy2)/(s1)(sy-y^{2})/(s-1) logit(θ/s){\rm logit}(\theta/s)
NegBin(s,p)(s,p) sp/(1p)sp/(1-p) θ+θ2/s\theta+\theta^{2}/s (sy+y2)/(s+1)(sy+y^{2})/(s+1) log(θ/(s+θ))\log(\theta/(s+\theta))
Gamma(s,λ)(s,\lambda) s/λs/\lambda θ2/s\theta^{2}/s y2/(1+s)y^{2}/(1+s) 1/θ-1/\theta
GHS(s,λ)(s,\lambda) sλs\lambda s+θ2/ss+\theta^{2}/s (s2+y2)/(1+s)(s^{2}+y^{2})/(1+s) arctan(θ/s)\arctan(\theta/s)

Inspired by the availability of the function v()v(\cdot) obtained in Lemma 7, we now state a general result on how to explicitly construct a 𝐃^k\hat{\mathbf{D}}_{k} that properly estimates 𝐃k\mathbf{D}_{k}.

Lemma 8

Let yijy_{ij} have pdf (19) such that 𝕍[yij|𝐌]=𝔼[v(yij)|𝐌]\mathbb{V}\left[y_{ij}|\mathbf{M}\right]=\mathbb{E}\left[v\left(y_{ij}\right)|\mathbf{M}\right] for some function v()v\left(\cdot\right) satisfying

supkmaxi,j𝔼[v4(yij)|𝐌]C<.\sup_{k}\max_{i,j}\mathbb{E}\left[v^{4}\left(y_{ij}\right)|\mathbf{M}\right]\leq C<\infty. (22)

Then

δ^kj=k1l=1kv(ylj)\hat{\delta}_{kj}=k^{-1}\sum\limits_{l=1}^{k}v\left(y_{lj}\right) (23)

satisfies limk|δ^kjδ¯kj|=0\lim\limits_{k\rightarrow\infty}\left|\hat{\delta}_{kj}-\bar{\delta}_{kj}\right|=0 a.s. for each 1jn1\leq j\leq n. If additionally, for each 1jn1\leq j\leq n and some σj>0\sigma_{j}>0,

limkk1l=1k𝕍[v(ylj)|𝐌]=σj,\lim_{k\rightarrow\infty}k^{-1}\sum_{l=1}^{k}\mathbb{V}\left[v\left(y_{lj}\right)|\mathbf{M}\right]=\sigma_{j}\text{,} (24)

then k(δ^kjδ¯kj)\sqrt{k}\left(\hat{\delta}_{kj}-\bar{\delta}_{kj}\right) converges in distribution to a Normal random variable as kk\rightarrow\infty.

Lemma 8 shows that 𝐃^k\hat{\mathbf{D}}_{k} in (12) with δ^kj\hat{\delta}_{kj} defined by (23) satisfies limk𝐃^k𝐃k=a.s.0\lim\limits_{k\rightarrow\infty}\left\|\hat{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|\stackrel{{\scriptstyle\mbox{a.s.}}}{{=}}0. Note that the first assertion in Lemma 8, i.e., limk|δ^kjδ¯kj|=0\lim\limits_{k\rightarrow\infty}\left|\hat{\delta}_{kj}-\bar{\delta}_{kj}\right|=0 a.s., clearly applies to yijy_{ij} that follow NEFs with QVF when their corresponding θij\theta_{ij} are in a set Ω0\Omega_{0} described in Corollary 6. We remark that requiring the closure of Ω0\Omega_{0} to be in the interior of Ω\Omega is not restrictive, since in practice the θij\theta_{ij}’s are not the boundary points of Ω\Omega.

4.2 Simultaneously Consistently Estimating the Latent Linear Space and Its Dimension

We are ready to present a consistent estimator of the latent linear space Π𝐌\Pi_{\mathbf{M}}:

  1. 1.

    Set 𝐑^k=k1𝐘T𝐘𝐃^k\mathbf{\hat{R}}_{k}=k^{-1}\mathbf{Y}^{T}\mathbf{Y}-\mathbf{\hat{D}}_{k} as (12) with δ^kj\hat{\delta}_{kj} as in (23).

  2. 2.

    Estimate rr as r^\hat{r} using (18) as given in Theorem 5.

  3. 3.

    From the spectral decomposition 𝐑^k=𝐕𝐊𝐕T\mathbf{\hat{R}}_{k}=\mathbf{VKV}^{T} where 𝐕T𝐕=𝐈\mathbf{V}^{T}\mathbf{V}=\mathbf{I} and 𝐊=diag{α^k,i}i=1n\mathbf{K}={\rm diag}\left\{\hat{\alpha}_{k,i}\right\}_{i=1}^{n}, pick r^\hat{r} columns {𝐮^k,i}i=1r^\left\{\mathbf{\hat{u}}_{k,i}\right\}_{i=1}^{\hat{r}} of 𝐕\mathbf{V} corresponding to the r^\hat{r} largest {α^k,i}i=1r^\left\{\hat{\alpha}_{k,i}\right\}_{i=1}^{\hat{r}}.

  4. 4.

    Set

    Π^𝐌={𝐮^k,i}i=1r^\hat{\Pi}_{\mathbf{M}}=\left\langle\left\{\mathbf{\hat{u}}_{k,i}\right\}_{i=1}^{\hat{r}}\right\rangle (25)

    to be the estimate of Π𝐌\Pi_{\mathbf{M}}. Note that 𝐌^=(𝐮^k,1,,𝐮^k,r^)T\mathbf{\hat{M}}=\left(\mathbf{\hat{u}}_{k,1},...,\mathbf{\hat{u}}_{k,\hat{r}}\right)^{T} can be regarded as an estimator of 𝐌\mathbf{M} even though it is not our focus.

The above procedure is supported by the following theorem, whose proof is straightforward and omitted. More specifically, Π^𝐌\hat{\Pi}_{\mathbf{M}} in (25) consistently estimates Π𝐌\Pi_{\mathbf{M}}.

Theorem 9

Under the assumptions of Theorem 2 and Lemma 8, limk𝐑^k𝐇=0\lim_{k\rightarrow\infty}\left\|\mathbf{\hat{R}}_{k}-\mathbf{H}\right\|=0 a.s. and limk{𝐮^k,i}i=1rΠ𝐌=\lim\limits_{k\rightarrow\infty}\left\langle\left\{\mathbf{\hat{u}}_{k,i}\right\}_{i=1}^{r}\right\rangle\triangle\Pi_{\mathbf{M}}=\varnothing a.s.. If additionally the conditions of Theorem 5 are satisfied, then equation (18) holds and limkPr({Π^𝐌Π𝐌=})=1\lim_{k\rightarrow\infty}\Pr\left(\left\{\hat{\Pi}_{\mathbf{M}}\triangle\Pi_{\mathbf{M}}=\varnothing\right\}\right)=1.

4.3 Normal Distribution with Unknown Variances

One of the exponential family distributions we considered above is yij|θij𝖭𝗈𝗋𝗆𝖺𝗅(θij,1)y_{ij}|\theta_{ij}\sim\mathsf{Normal}(\theta_{ij},1). Suppose instead we assume that yij|θij𝖭𝗈𝗋𝗆𝖺𝗅(θij,σi2)y_{ij}|\theta_{ij}\sim\mathsf{Normal}(\theta_{ij},\sigma^{2}_{i}), where σ12,,σk2\sigma^{2}_{1},\ldots,\sigma^{2}_{k} are unknown. Leek (2011) studies this important case, and obtains several results related to ours. Let a1a2ana_{1}\geq a_{2}\geq\ldots\geq a_{n} be the nn ordered singular values resulting from the singular value decomposition (SVD) of 𝐘\mathbf{Y}. If we regress the top t1t-1 right singular vectors in this SVD from each row of 𝐘\mathbf{Y}, this yields total residual variation that is of proportion (j=tnaj2)/(j=1naj2)(\sum_{j=t}^{n}a_{j}^{2})/(\sum_{j^{\prime}=1}^{n}a_{j^{\prime}}^{2}) to the total variation in 𝐘\mathbf{Y}. In order to estimate σaverage2=k1j=1kσj2\sigma^{2}_{\mbox{average}}=k^{-1}\sum_{j=1}^{k}\sigma^{2}_{j}, Leek (2011) employs the estimate

σ^average2=1k(nt)j=tnaj2j=1naj2𝐘2.\hat{\sigma}^{2}_{\mbox{average}}=\frac{1}{k(n-t)}\frac{\sum_{j=t}^{n}a_{j}^{2}}{\sum_{j^{\prime}=1}^{n}a_{j^{\prime}}^{2}}\left\|\mathbf{Y}\right\|^{2}.

Using our notation, Leek (2011) then sets 𝐃^k=σ^average2𝐈\hat{\mathbf{D}}_{k}=\hat{\sigma}^{2}_{\mbox{average}}\mathbf{I}, where 𝐈\mathbf{I} is the n×nn\times n identity matrix, and proceeds to estimate Π𝐌\Pi_{\mathbf{M}} based on k1𝐘T𝐘𝐃^kk^{-1}\mathbf{Y}^{T}\mathbf{Y}-\hat{\mathbf{D}}_{k} as we have done. However, it must be the case that t>rt>r in order for σ^average2\hat{\sigma}^{2}_{\mbox{average}} to be well-behaved, so the assumptions and theory in Leek (2011) have several important differences from ours. We refer the reader to Leek (2011) for specific details on this important case. We note that taking our results together with those of Leek (2011), this covers a large proportion of the models utilized in practice.

5 Letting the Latent Variable Coefficients 𝚽\mathbf{\Phi} Be Random

We now discuss the case where 𝚽\mathbf{\Phi} is random but then conditioned. Assume that 𝚽\boldsymbol{\Phi} is a random matrix with entries ϕij\phi_{ij} defined on the probability space (Ω1,1,1)\left(\Omega_{1},\mathcal{F}_{1},\mathbb{P}_{1}\right), and that the entries yijy_{ij} of 𝐘\mathbf{Y}, conditional on 𝚽\mathbf{\Phi} and 𝐌\mathbf{M}, are defined on the probability space (Ω,,)\left(\Omega,\mathcal{F},\mathbb{P}\right). Rather than model (1), consider the following model:

θij=𝔼[yij|𝚽,𝐌]=(𝚽𝐌)(i,j) or 𝚯=𝔼[𝐘|𝚽,𝐌]=𝚽𝐌.\theta_{ij}=\mathbb{E}\left[y_{ij}|\boldsymbol{\Phi},\mathbf{M}\right]=\left(\boldsymbol{\Phi}\mathbf{M}\right)\left(i,j\right)\text{ or }\boldsymbol{\Theta}=\mathbb{E}\left[\mathbf{Y|}\boldsymbol{\Phi},\mathbf{M}\right]=\boldsymbol{\Phi}\mathbf{M}\text{.} (26)

Suppose assumption A4) holds (see (13)) and:

A1)

1rank(𝐌)=r<n1\leq{\rm rank}\left(\mathbf{M}\right)=r<n and nn is finite; conditional on 𝐌\mathbf{M} and 𝚽\boldsymbol{\Phi}, {yij}i,j\left\{y_{ij}\right\}_{i,j} are jointly independent with variance 𝕍[yij|𝐌,𝚽]=δij\mathbb{V}\left[y_{ij}|\mathbf{M},\boldsymbol{\Phi}\right]=\delta_{ij} such that supkmaxi,j𝔼[yij8|𝐌,𝚽]C\sup\limits_{k}\max\limits_{i,j}\mathbb{E}\left[y_{ij}^{8}|\mathbf{M},\boldsymbol{\Phi}\right]\leq C (which implies supkmaxi,j𝔼[yij4|𝐌,𝚽]C\sup\limits_{k}\max\limits_{i,j}\mathbb{E}\left[y_{ij}^{4}|\mathbf{M},\boldsymbol{\Phi}\right]\leq C), where 𝔼\mathbb{E} and 𝕍\mathbb{V} are the expectation and variance wrt \mathbb{P}.

A2)

Either A2) holds 1\mathbb{P}_{1}-a.s., i.e., 1(supk1max1ikϕiC)=1\mathbb{P}_{1}\left(\sup_{k\geq 1}\max_{1\leq i\leq k}\left\|\boldsymbol{\phi}_{i}\right\|\leq C\right)=1 and

1(k1𝚽T𝚽𝐖=ck,ck0,𝐖r×r>0)=1,\mathbb{P}_{1}\left(\left\|k^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}-\mathbf{W}\right\|=c_{k},\exists c_{k}\rightarrow 0,\exists\mathbf{W}_{r\times r}>0\right)=1, (27)

or A2) holds in probability 1\mathbb{P}_{1}, i.e., as kk\rightarrow\infty, 1(supk1max1ikϕiC)1\mathbb{P}_{1}\left(\sup_{k\geq 1}\max_{1\leq i\leq k}\left\|\boldsymbol{\phi}_{i}\right\|\leq C\right)\rightarrow 1 and

1(k1𝚽T𝚽𝐖=ck,ck0,𝐖r×r>0)1.\mathbb{P}_{1}\left(\left\|k^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}-\mathbf{W}\right\|=c_{k},\exists c_{k}\rightarrow 0,\exists\mathbf{W}_{r\times r}>0\right)\rightarrow 1\text{.} (28)
A3)

Conditional on 𝐌\mathbf{M}, k1i=1k𝔼[eij4|𝐌]k^{-1}\sum\limits_{i=1}^{k}\mathbb{E}\left[e_{ij}^{4}|\mathbf{M}\right], k1i=1kϕi𝔼[eij3|𝐌]k^{-1}\sum\limits_{i=1}^{k}\boldsymbol{\phi}_{i}\mathbb{E}\left[e_{ij}^{3}|\mathbf{M}\right] and k1i=1kϕi𝐦jϕi𝐦l𝚫ik^{-1}\sum_{i=1}^{k}\boldsymbol{\phi}_{i}\mathbf{m}^{j}\boldsymbol{\phi}_{i}\mathbf{m}^{l}\boldsymbol{\Delta}_{i} for any 1jln1\leq j\leq l\leq n are all convergent 1\mathbb{P}_{1}-a.s. as kk\rightarrow\infty.

Note that assumptions A1), A2) and A3) are the probabilistic versions of assumptions A1), A2 and A3) that also account for the randomness of 𝚽\mathbf{\Phi}. Recall assumption A2) when 𝚽\boldsymbol{\Phi} is deterministic, i.e., for some 𝐖r×r>0\mathbf{W}_{r\times r}>0 and some non-negative sequence ck0c_{k}\rightarrow 0, k1𝚽T𝚽𝐖=ck\left\|k^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}-\mathbf{W}\right\|=c_{k}. Assumption A2) implies that (see also Lemma 1):

supk1max1ikϕiC and supk1max1ik,1jn|θij|C.\sup_{k\geq 1}\max_{1\leq i\leq k}\left\|\boldsymbol{\phi}_{i}\right\|\leq C\text{ \ \ and \ }\sup_{k\geq 1}\max_{1\leq i\leq k,1\leq j\leq n}\left|\theta_{ij}\right|\leq C. (29)

These two uniform boundedness results in equation (29) are then used to show the a.s. convergence in Theorem 2 which induces the a.s. convergence in Corollary 3, the validity of Lindeberg’s condition as (34) that induces Proposition 4, convergence in probability in equations (14) and (16) that induces Theorem 5, Corollary 6, and Theorem 9.

Let N~11\tilde{N}_{1}\in\mathcal{F}_{1} be such that (27) does not hold, then 1(N~1)=0\mathbb{P}_{1}\left(\tilde{N}_{1}\right)=0. On the other hand, if (28) holds, then for any positive constants CC and ε~\tilde{\varepsilon} there exists a k0=k0(C,ε~)k_{0}=k_{0}\left(C,\tilde{\varepsilon}\right) such that the set

N1(C,ε~)={ωΩ1:supk1max1ikϕi>C}N_{1}^{\ast}\left(C,\tilde{\varepsilon}\right)=\left\{\omega\in\Omega_{1}:\sup\nolimits_{k\geq 1}\max\nolimits_{1\leq i\leq k}\left\|\boldsymbol{\phi}_{i}\right\|>C\right\}

satisfies

1(N1(C,ε~))<ε~\mathbb{P}_{1}\left(N_{1}^{\ast}\left(C,\tilde{\varepsilon}\right)\right)<\tilde{\varepsilon}

whenever k>k0k>k_{0}. Now if (27) holds, then the results on the a.s. convergence and on convergence in probability for yij|𝐌y_{ij}|\mathbf{M} remain true with respect to \mathbb{P} when yij|𝐌y_{ij}|\mathbf{M} is replaced by yij|(𝚽,𝐌)y_{ij}|\left(\boldsymbol{\Phi},\mathbf{M}\right), as long as each ϕij(ω)\phi_{ij}\left(\omega\right) is such that ωΩ1N~1\omega\in\Omega_{1}\setminus\tilde{N}_{1}. In contrast, if (28) holds, then the results on the a.s. convergence for yij|𝐌y_{ij}|\mathbf{M} reduce to convergence in probability in \mathbb{P}. But those on convergence in probability for yij|𝐌y_{ij}|\mathbf{M} remain true with respect to \mathbb{P} when yij|𝐌y_{ij}|\mathbf{M} is replaced by yij|(𝚽,𝐌)y_{ij}|\left(\boldsymbol{\Phi},\mathbf{M}\right), as long as each ϕij(ω)\phi_{ij}\left(\omega\right) is with ωΩ1N1(C,ε~)\omega\in\Omega_{1}\setminus N_{1}^{\ast}\left(C,\tilde{\varepsilon}\right) (where ε~\tilde{\varepsilon} can be chosen to be small). Therefore, the results (except Lemmas 7 and 8) obtained in the previous sections hold with their corresponding probability statements, whenever, for some 𝐖r×r>0\mathbf{W}_{r\times r}>0,

k1𝚽T𝚽𝐖=ck0\left\|k^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}-\mathbf{W}\right\|=c_{k}\rightarrow 0 (30)

holds a.s. or in probability in 1\mathbb{P}_{1}, and they allow 𝚽\boldsymbol{\Phi} to be random (and conditioned) or deterministic. Statistically speaking, an event with very small or zero 1\mathbb{P}_{1} probability is very unlikely to occur. This means that the practical utility of these results is not affected when (30) holds in the sense of (27) or (28) when kk is large.

Finally, we remark that Lemmas 7 and 8 are independent of the assumptions A1), A2) and A3), and of A1), A2) and A3). In other words, for model (26), Lemma 7 remains valid, and so does Lemma 8 when the assumptions involved there are replaced by those on yij|(𝚽,𝐌)y_{ij}|\left(\boldsymbol{\Phi},\mathbf{M}\right).

6 A Simulation Study

We conducted a simulation study to demonstrate the validity of our theoretical results. The quality of Π^𝐌\hat{\Pi}_{\mathbf{M}} is measured by

d(𝐌,𝐌^)=(nr^)1𝐌T𝐌^T𝐌𝐕2+𝐌^T𝐌T𝐕𝐌2,d\left(\mathbf{M},\mathbf{\hat{M}}\right)=\left(\sqrt{n\hat{r}}\right)^{-1}\sqrt{\left\|\mathbf{M}^{T}-\mathbf{\hat{M}}^{T}\mathbf{M}_{\mathbf{V}}\right\|^{2}+\left\|\hat{\mathbf{M}}^{T}-\mathbf{M}^{T}\mathbf{V}_{\mathbf{M}}\right\|^{2}},

where r^\hat{r} is an estimate of rr, 𝐌𝐕=𝐌^𝐌T\mathbf{M}_{\mathbf{V}}=\mathbf{\hat{M}M}^{T} and 𝐕𝐌=(𝐌𝐌T)1𝐌𝐌^T\mathbf{V}_{\mathbf{M}}=\left(\mathbf{MM}^{T}\right)^{-1}\mathbf{M\hat{M}}^{T}. In fact, d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) measures the difference between the orthogonal projections (as linear operators) induced by the rows of 𝐌^\mathbf{\hat{M}} and those of 𝐌\mathbf{M} respectively, and d(𝐌,𝐌^)=0d\left(\mathbf{M},\mathbf{\hat{M}}\right)=0 if and only if Π^𝐌=Π𝐌\hat{\Pi}_{\mathbf{M}}=\Pi_{\mathbf{M}}. We propose a rank estimator r^\hat{r}, directly applied to 𝐑^k\mathbf{\hat{R}}_{k}, based on Theorem 5. Specifically, τ~k=c~kkη\tilde{\tau}_{k}=\tilde{c}_{k}k^{-\eta} for some positive c~k\tilde{c}_{k} and η\eta is used to scale the eigenvalues of 𝐑^k\mathbf{\hat{R}}_{k}, for which c~k\tilde{c}_{k} (dependent on kk) is determined by a strategy of Hallin and Liska (2007) (that was also used by Leek, 2011); for more details, we refer the reader to these two references. We do so since the assertions in Theorem 5 are valid only for k=k=\infty, assumption A3) may not be satisfied in our simulation study, and the unknown constants in equation (14) for the speed of convergence need to be estimated. However, we caution that, if the scaling sequence τ~k\tilde{\tau}_{k} in Theorem 5 defined via (15) is not well estimated to capture the speed of convergence of the eigenvalues of 𝐑^k\hat{\mathbf{R}}_{k}, r^\hat{r} defined by (17) as an estimate of rr can be inaccurate; see our simulation study results.

6.1 Simulation Study Design

The settings for our simulation study are described below:

  1. 1.

    We consider n=15,100,200n=15,100,200, k=103,5×103,104,105k=10^{3},5\times 10^{3},10^{4},10^{5} and r=1,2,3,4,5,6,8,10,12r=1,2,3,4,5,6,8,10,12.

  2. 2.

    The mean matrix 𝚯=𝚽𝐌\boldsymbol{\Theta}=\boldsymbol{\Phi}\mathbf{M} and the observed data yij|θijy_{ij}|\theta_{ij} are generated within the following scenarios:

    1. (a)

      yij|θijNormal(θij,1)y_{ij}|\theta_{ij}\sim\mbox{Normal}(\theta_{ij},1), 𝚽(i,j)=ϕiji.i.d.𝖭𝗈𝗋𝗆𝖺𝗅(0,1)\boldsymbol{\Phi}\left(i,j\right)=\phi_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Normal}\left(0,1\right), and 𝐌(i,j)=miji.i.d.𝖴𝗇𝗂𝖿𝗈𝗋𝗆(1,10)\mathbf{M}\left(i,j\right)=m_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Uniform}\left(1,10\right).

    2. (b)

      yij|θij𝖯𝗈𝗂𝗌𝗌𝗈𝗇(θij)y_{ij}|\theta_{ij}\sim\mathsf{Poisson}\left(\theta_{ij}\right), ϕiji.i.d.𝖢𝗁𝗂\phi_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Chi}-𝗌𝗊𝗎𝖺𝗋𝖾(9,1)\mathsf{square}\left(9,1\right) with degrees of freedom 99 and non-centrality 11, and miji.i.d.𝖴𝗇𝗂𝖿𝗈𝗋𝗆(1,5)m_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Uniform}\left(1,5\right).

    3. (c)

      yij|θij𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(s,θij)y_{ij}|\theta_{ij}\sim\mathsf{Binomial}\left(s,\theta_{ij}\right) with s=20s=20 and ϕiji.i.d.𝖴𝗇𝗂𝖿𝗈𝗋𝗆(0.05,0.95)\phi_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Uniform}\left(0.05,0.95\right). 𝐌\mathbf{M} is such that mij=1m_{ij}=1 for 1i=jr1\leq i=j\leq r, mij=r1m_{ij}=r^{-1} for 1ir1\leq i\leq r and r+1jnr+1\leq j\leq n, and mij=0m_{ij}=0 otherwise.

    4. (d)

      yij|θij𝖭𝖾𝗀𝖡𝗂𝗇(s,θij)y_{ij}|\theta_{ij}\sim\mathsf{NegBin}\left(s,\theta_{ij}\right) with s=10s=10, ϕiji.i.d.𝖴𝗇𝗂𝖿𝗈𝗋𝗆(0.5,2)\phi_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Uniform}\left(0.5,2\right), and miji.i.d.𝖴𝗇𝗂𝖿𝗈𝗋𝗆(0.3,1.5)m_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Uniform}\left(0.3,1.5\right).

    5. (e)

      yij|θij𝖦𝖺𝗆𝗆𝖺(s,θij)y_{ij}|\theta_{ij}\sim\mathsf{Gamma}\left(s,\theta_{ij}\right) with shape s=10s=10 and mean θij\theta_{ij}, ϕiji.i.d.𝖴𝗇𝗂𝖿𝗈𝗋𝗆(0.5,2)\phi_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Uniform}\left(0.5,2\right), and miji.i.d.𝖴𝗇𝗂𝖿𝗈𝗋𝗆(0.3,1.5)m_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathsf{Uniform}\left(0.3,1.5\right).

  3. 3.

    For each combination of values in Step 1 and distributions in Step 2 above, the data matrix 𝐘\mathbf{Y} is generated independently 100100 times and then the relevant statistics are calculated.

For each simulated data set, we measure the quality of 𝐃^k\mathbf{\hat{D}}_{k} by ρk=max1jn|δ^kjδ¯kj|\rho_{k}=\max\limits_{1\leq j\leq n}\left|\hat{\delta}_{kj}-\bar{\delta}_{kj}\right|, that of Π^𝐌\hat{\Pi}_{\mathbf{M}} by d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right), and we record r^\hat{r}.

Table 2: Assessment of the estimator r^\hat{r}. Data were simulated under model (1) with n=15n=15 and r=5r=5 under the five distributions listed. Shown is the number of times that r^=r\hat{r}=r among 100100 simulated data sets for each scenario. Also shown in parentheses is the number of times that r^<r\hat{r}<r, if any instances occurred. Results from several additional scenarios are considered in Section C.
kk Binomial Gamma NegBin Normal Poisson
1000 3 (97) 24 (4) 3 100 84 (6)
5000 95 (5) 30 (9) 4 96 94
10,000 100 33 (22) 27 (9) 96 89
100,000 100 90 43 99 94

6.2 Simulation Study Results

Figure 1 and Table 2 display the performance of the nonparametric estimator Π^𝐌\hat{\Pi}_{\mathbf{M}} of Π𝐌\Pi_{\mathbf{M}}, that of 𝐃^k\mathbf{\hat{D}}_{k}, and that of r^\hat{r} when n=15n=15 and r=5r=5; for other values of nn and rr, the performance of r^\hat{r} is provided in Section C and that of Π^𝐌\hat{\Pi}_{\mathbf{M}} in Section D. The following conclusions can be drawn from the simulation study:

Refer to caption
Figure 1: Performance of the nonparametric estimator Π^𝐌\hat{\Pi}_{\mathbf{M}} of Π𝐌\Pi_{\mathbf{M}} when n=15n=15 and r=5r=5. Column 1: Boxplots of the difference between the row spaces spanned by 𝐌\mathbf{M} and 𝐌^\mathbf{\hat{M}} as measured by d(𝐌,𝐌^)d(\mathbf{M},\mathbf{\hat{M}}) when the true dimension of 𝐌\mathbf{M} is utilized to form 𝐌^\mathbf{\hat{M}} (i.e., setting r^=r\hat{r}=r). Column 2: Boxplots of d(𝐌,𝐌^)d(\mathbf{M},\mathbf{\hat{M}}) when using the proposed estimator r^\hat{r} of the row space dimension in forming 𝐌^\mathbf{\hat{M}}. Column 3: An assessment of the estimate 𝐃^k\mathbf{\hat{D}}_{k} of 𝐃k\mathbf{D}_{k}, where the latter term is the average of the column-wise variances of 𝐘\mathbf{Y}. The difference is measured by 𝐃^k𝐃k=max1jn|δ^kjδ¯kj|\|\mathbf{\hat{D}}_{k}-\mathbf{D}_{k}\|=\max\limits_{1\leq j\leq n}|\hat{\delta}_{kj}-\bar{\delta}_{kj}|. Results from several additional scenarios are shown in Section D.
  1. 1.

    Π^𝐌\hat{\Pi}_{\mathbf{M}} with r^\hat{r} set as the true rr approximates Π𝐌\Pi_{\mathbf{M}} with increasing accuracy as measured by the difference between their induced orthogonal projections d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) when kk gets larger. In all settings, strong trends of convergence of Π^𝐌\hat{\Pi}_{\mathbf{M}} to Π𝐌\Pi_{\mathbf{M}} can be observed, even when r^\hat{r} is used. However, the speed of convergence can be slightly different for different settings due to the hidden constants in (4) and (14).

  2. 2.

    As kk gets larger 𝐃^k\mathbf{\hat{D}}_{k} becomes more accurate, and strong trends of convergence of 𝐃^k\mathbf{\hat{D}}_{k} to 𝐃k\mathbf{D}_{k} can be observed in all settings. This is similar to the behavior of Π^𝐌\hat{\Pi}_{\mathbf{M}}. However, the accuracy of 𝐃^k\mathbf{\hat{D}}_{k} in estimating 𝐃k\mathbf{D}_{k} does not seem to have a drastic impact on Π^𝐌\hat{\Pi}_{\mathbf{M}}, since 𝐃^k\mathbf{\hat{D}}_{k} induces a shift by a diagonal matrix to the matrix k1𝐘T𝐘k^{-1}\mathbf{Y}^{T}\mathbf{Y} and such a shift does not necessarily have a huge impact on the leading eigenvectors of 𝐑^k\mathbf{\hat{R}}_{k} and hence on Π^𝐌\hat{\Pi}_{\mathbf{M}}. We do not report the performance of 𝐃^k\mathbf{\hat{D}}_{k} in the scenarios where yij|θij𝖭𝗈𝗋𝗆𝖺𝗅(θij,1)y_{ij}|\theta_{ij}\sim\mathsf{Normal}(\theta_{ij},1) since in this case δij=1\delta_{ij}=1 and 𝐃k=𝐈\mathbf{D}_{k}=\mathbf{I}, and 𝐃^k=𝐈\mathbf{\hat{D}}_{k}=\mathbf{I} has been set.

  3. 3.

    In all settings, r^\hat{r} can under- or over-estimate rr, and as kk increases r^\hat{r} becomes more accurate. However, when r^>r\hat{r}>r it does not reduce the accuracy of the estimate Π^𝐌\hat{\Pi}_{\mathbf{M}} when r^\hat{r} is used to pick the number of leading eigenvectors of 𝐑^k\mathbf{\hat{R}}_{k} to estimate Π𝐌\Pi_{\mathbf{M}}. In fact, when r^>r\hat{r}>r, additional linearly independent eigenvectors 𝐮^k,i\hat{\mathbf{u}}_{k,i} for i=1,,ri=1,\ldots,r may be used to span Π𝐌\Pi_{\mathbf{M}}, giving better estimate than using the true rr. When r^<r\hat{r}<r, the accuracy of Π^𝐌\hat{\Pi}_{\mathbf{M}} is reduced, since in this case the original row space Π𝐌\Pi_{\mathbf{M}} can not be sufficiently spanned by Π^𝐌\hat{\Pi}_{\mathbf{M}}. This is clearly seen from the plots on performance of Π^𝐌\hat{\Pi}_{\mathbf{M}}.

  4. 4.

    The scaling sequence {τ~k}k1\left\{\tilde{\tau}_{k}\right\}_{k\geq 1} plays a critical role on the accuracy of r^\hat{r} as an estimator of rr, since it decides where to “cut” the spectrum of 𝐑^k\mathbf{\hat{R}}_{k} for finite kk to give r^\hat{r}, where numerically all eigenvalues of 𝐑^k\mathbf{\hat{R}}_{k} are non-zero. In all settings, the non-adaptive choice of the sequence {τ~k}k1\left\{\tilde{\tau}_{k}\right\}_{k\geq 1} with τ~k=c~kk1/3\tilde{\tau}_{k}=\tilde{c}_{k}k^{-1/3} has been used, and it can cause inaccurate scaling of the spectrum of 𝐑^k\hat{\mathbf{R}}_{k} and hence inaccurate estimate of rr. This explains why rr has been under-estimated when yijy_{ij}’s follow Binomial distributions, n{100,200}n\in\left\{100,200\right\} and r2r\geq 2, since in these cases the magnitudes of the eigenvalues of 𝐑^k\mathbf{\hat{R}}_{k} have more complicated behavior, and τ~k\tilde{\tau}_{k} are too large compared the magnitudes of these eigenvalues; see Tables S1S3 in Section C for results when n=100n=100 and those when n=200n=200. We found that when yijy_{ij} follow Binomial distributions in the simulation study, in order to accurately estimate rr, τ~k=O(k1/1.1)\tilde{\tau}_{k}=O\left(k^{-1/1.1}\right) should be set for n=100n=100 and that τ~k=O(k1/1.5)\tilde{\tau}_{k}=O\left(k^{-1/1.5}\right) should be set for n=200n=200. The non-adaptive choice of τ~k\tilde{\tau}_{k} also explains why r^\hat{r} over-estimates rr when yijy_{ij}’s follow the Negative Binomial distributions or gamma distributions in the simulation study, since these two types of distributions are more likely to generate outliers that counterbalance the speed of concentration as kk gets larger, and affect the separation of the spectrum of the limiting matrix 𝐑^k\hat{\mathbf{R}}_{k}. In other words, for these two cases, τ~k\tilde{\tau}_{k} are too small in magnitudes in order to scale up the spectrum of 𝐑^k\hat{\mathbf{R}}_{k} to the point of separation in order to estimate rr. In general, the magnitude of nn plays a role in the asymptotic property of 𝐑^k\hat{\mathbf{R}}_{k} as kk\rightarrow\infty, and it affects the speed of convergence of 𝐑^k\hat{\mathbf{R}}_{k} through the hidden constants in (14) even when all needed assumptions are satisfied for Theorem 5. This explains why larger nn does not necessarily induce more accurate r^\hat{r} when rr increases but kk does not increase at a compatible speed when kk is finite; see Tables S1S3 in Section C. However, deciding the hidden constants in (14) is usually very hard, and accurately estimating r=rank(𝐌)r={\rm rank}\left(\mathbf{M}\right), being also the number of factors in factor models, is in general a very challenging problem.

Refer to caption
Figure 2: A measure of how well 𝐌^\hat{\mathbf{M}} captures the row space of 𝐌\mathbf{M} as the dimension r^\hat{r} of 𝐌^\hat{\mathbf{M}} increases in the RNA-seq data set. This measure d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) is plotted versus r^\hat{r} over r^{1,2,,16}\hat{r}\in\{1,2,\ldots,16\}. The true dimension of the row space of 𝐌\mathbf{M} is r=9r=9, shown by vertical dotted line.
Refer to caption
Figure 3: A measure of how well 𝐌^\hat{\mathbf{M}} captures the row space of 𝐌\mathbf{M} as the number of variables kk increases in the RNA-seq data set. For each value of kk, we randomly sampled kk rows from the RNA-seq data matrix and computed d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) as defined in the text; this was repeated 50 times and the median was plotted.

7 Application to an RNA-Seq Study

In Robinson et al. (2015), we measured genome-wide gene expression in the budding yeast Saccharomyces cerevisiae in a nested factorial experimental design that allowed us to carefully partition gene expression variation due to both biology and technology. The technology utilized to measure gene expression is called “RNA-seq”, short for RNA sequencing. This technology provides a digital measure of gene expression in that mRNA molecules are discretely sequenced and therefore counted (Wang et al., 2009). The resulting 6575×166575\times 16 matrix of RNA-seq counts represents gene expression measurements on 6575 genes across 16 samples. We also have a design matrix of dimension 9×169\times 16 that captures the experimental design and apportionment of variation throughout the data. The data here are counts, and RNA-seq count data are typically modeled as overdispersed Poisson data (McCarthy et al., 2012). It can be verified on these data that, because of the experimental design, there is very little over-dispersion once the experimental design is taken into account; we therefore utilize the Poisson distribution in this analysis.

We set 𝐘\mathbf{Y} to be the 6575×166575\times 16 matrix of RNA-seq counts and 𝐌\mathbf{M} the 9×169\times 16 design matrix, where each row of 𝐌\mathbf{M} is normalized to have unit Euclidean norm. Ignoring our knowledge of 𝐌\mathbf{M}, we applied the proposed estimator using the Poisson distribution formula to yield an estimate 𝐌^\mathbf{\hat{M}} over a range of values of r^{1,2,,16}\hat{r}\in\{1,2,\ldots,16\}. In evaluating 𝐌^\mathbf{\hat{M}}, we utilized the measure d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) defined in Section 6.

Figure 2 shows how d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) changes with the dimension r^\hat{r} of 𝐌^\mathbf{\hat{M}}. It can be seen that the minimum value is at r^=9\hat{r}=9 (where r=9r=9 is the true dimension of 𝐌\mathbf{M}). We then randomly sampled rows of 𝐘\mathbf{Y} to quantify how d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) changes as the number of variables kk grows. Figure 3 shows the median d(𝐌,𝐌^)d\left(\mathbf{M},\mathbf{\hat{M}}\right) values over 50 random samplings for each kk value, where a convergence with respect to kk can be observed. In summary, this analysis shows that the proposed methodology is capable of accurately capturing the linear latent variable space on real data that is extremely heteroskedastic and does not follow the Normal distribution.

8 Discussion

We have proposed a general method to consistently estimate the low-dimensional, linear latent variable structure in a set of high-dimensional rv’s. Further, by exploiting the intrinsic relationship between the moments of natural exponential family (NEF) distributions that have quadratic variance functions (QVFs), we are able to explicitly recover this latent structure by using just the second moments of the rv’s even when these rv’s have heteroskedastic variances. Empirical evidence confirms our theoretical findings and the utility of our methodology. Once the latent structure has been well estimated, the variable-specific coefficients of the latent variables can be estimated via appropriate estimation methods.

We point out that, under the same assumptions A1), A2), A3) and A4), the theoretical results in Sections 2, 3 and 4 hold for the unconditional model

𝚯=𝔼[𝐘]=𝚽𝐌\boldsymbol{\Theta}=\mathbb{E}\left[\mathbf{Y}\right]=\boldsymbol{\Phi}\mathbf{M} (31)

when both 𝚽\boldsymbol{\Phi} and 𝐌\mathbf{M} are deterministic with rank(𝐌)=r{\rm rank}\left(\mathbf{M}\right)=r. In this case the conditional distribution of each yijy_{ij} given 𝐌\mathbf{M} in model (1), i.e., 𝚯=𝔼[𝐘|𝐌]=𝚽𝐌\boldsymbol{\Theta}=\mathbb{E}\left[\mathbf{Y}|\mathbf{M}\right]=\boldsymbol{\Phi}\mathbf{M} when 𝐌\mathbf{M} is a realization of a random matrix 𝐌~\tilde{\mathbf{M}}, becomes the marginal distribution of yijy_{ij} in model (31) and the proofs proceed almost verbatim as those for model (1). Further, for the model

𝚯=𝔼[𝐘|𝚽]=𝚽𝐌\boldsymbol{\Theta}=\mathbb{E}\left[\mathbf{Y}|\boldsymbol{\Phi}\right]=\boldsymbol{\Phi}\mathbf{M} (32)

when 𝚽\boldsymbol{\Phi} is random but 𝐌\mathbf{M} is deterministic with rank(𝐌)=r{\rm rank}\left(\mathbf{M}\right)=r, under the assumptions A1), A2), A3) and A4), similar arguments as those given in Section 5 imply that the theoretical results obtained in Sections 2, 3 and 4 hold when the probabilistic statements in the conclusions there are adjusted in the way detailed in Section 5.

We have observed that for certain NEF distribution configurations, estimating the rank of the linear space generated by the latent structure can can be challenging, even with the theory we provided, if the scaling sequence to separate the limiting spectrum of the adjusted gram matrix of the data is not adaptively specified. The need for choosing an adaptive scaling sequence comes from the hidden constants that describe the asymptotic speed of convergence, and it can be a delicate task to do so in non-asymptotic settings. This is reflective of the more general challenge of estimating the dimension of a latent variable model.

Finally, we briefly point out that, when the yijy_{ij} have pdf f(yij;θij,ψ)f(y_{ij};\theta_{ij},\psi) such that

f(y;θ,ψ)=h(y,ψ)exp{ψ1[η(θ)yg(η(θ))]}f(y;\theta,\psi)=h\left(y,\psi\right)\exp\left\{\psi^{-1}\left[\eta(\theta)y-g\left(\eta\left(\theta\right)\right)\right]\right\}

form an exponential dispersion family (see, e.g., Jørgensen, 1987), an explicit estimator of 𝐃k\mathbf{D}_{k} (and hence that of Π𝐌\Pi_{\mathbf{M}}) will require results beyond those provided in this work, even if the unknown dispersion parameter ψ\psi is constant. We leave this case to future research.

Acknowledgements

This research is funded by the Office of Naval Research grant N00014-12-1-0764 and NIH grant HG002913.

References

  • Alexander et al. (2009) D. H Alexander, J Novembre, and K Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19(9):1655–1664, 2009.
  • Anandkumar et al. (2012) Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and Yi-Kai Liu. A spectral algorithm for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 25, 2012.
  • Anandkumar et al. (2015) Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and Yi-Kai Liu. A spectral algorithm for latent Dirichlet allocation. Algorithmica, 72:193 C–214, 2015.
  • Arora et al. (2013) Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. In Proceedings of The 30th International Conference on Machine Learning, pages 280–288, 2013.
  • Bartholomew et al. (2011) D. J. Bartholomew, M. Knott, and I. Moustaki. Latent Variable Models and Factor Analysis: A Unified Approach. Wiley Series in Probability and Statistics, 3rd edition, 2011.
  • Blei et al. (2003) David M. Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
  • Collins et al. (2001) Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. A generalization of principal components analysis to the exponential family. In Advances in neural information processing systems, pages 617–624, 2001.
  • Hallin and Liska (2007) Marc Hallin and Roman Liska. Determining the number of factors in the general dynamic factor model. J. Amer. Statist. Assoc., 102(478):603 C617, 2007.
  • Hoffman and Wielandt (1953) A.J. Hoffman and H.W. Wielandt. The variation of the spectrum of a normal matrix. Duke Math. J., 20(1):37–39, 1953.
  • Jørgensen (1987) Bent Jørgensen. Exponential dispersion models. J. R. Statist. Soc. B, 49(2):127–162, 1987.
  • Lee and Seung (1999) D. D. Lee and S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999.
  • Leek (2011) Jeffrey T. Leek. Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometric, 67(2):344–352, 2011.
  • Leek and Storey (2007) Jeffrey T Leek and John D Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet, 3(9):e161, 2007.
  • Leek and Storey (2008) Jeffrey T. Leek and John D. Storey. A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. U.S.A., 105(48):18718–18723, 2008.
  • Letac and Mora (1990) Gerard Letac and Marianne Mora. Natural real exponential families with cubic variance functions. Ann. Statist., 18(1):1–37, 1990.
  • McCarthy et al. (2012) Davis J McCarthy, Yunshun Chen, and Gordon K Smyth. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10):4288–4297, 2012.
  • Morris (1982) Carl N. Morris. Natural exponential families with quadratic variance functions. Ann. Statist., 10(1):65–80, 1982.
  • Pritchard et al. (2000) J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959, 2000.
  • Robinson et al. (2015) David G. Robinson, Jean Y. Wang, and John D. Storey. A nested parallel experiment demonstrates differences in intensity-dependence between rna-seq and microarrays. Nucleic acids research, 2015. doi: 10.1093/nar/gkv636.
  • Tipping and Bishop (1999) Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3):611–622, 1999. doi: 10.1111/1467-9868.00196.
  • Van der Vaart (1998) A. W. Van der Vaart. Asymptotic Statistics. Cambridge Univ. Press, 1998.
  • Walk (2005) Harro Walk. Strong laws of large numbers by elementary Tauberian arguments. Monatsh. Math., 144(4):329–346, 2005.
  • Wang et al. (2009) H. S. Wang, B. Li, and C. L. Leng. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society Series B-Statistical Methodology, 71:671–683, 2009.

A Relationship to Anandkumar et al.

Anandkumar et al. (2012, 2015) consider a different class of probabilistic models than we consider. We establish this by rewriting their model in our notation. We consider the model

𝔼[𝐘|𝐌]=𝚽k×r𝐌r×n,\mathbb{E}\left[\mathbf{Y}|\mathbf{M}\right]=\boldsymbol{\Phi}_{k\times r}\mathbf{M}_{r\times n},

where 𝐘k×n\mathbf{Y}_{k\times n} is a matrix of nn observations on kk variables. The data points yijy_{ij} can take on a wide range of classes of variables, from Binomial outcomes, to count data, to continuous data (see Table 1, for example). Variable ii can be written as 𝐲i=(yi1,,yin)\mathbf{y}_{i}=(y_{i1},\ldots,y_{in}), which is a 1×n1\times n vector. In terms of this variable, our model assumes that

𝔼[𝐲i|𝐌]=ϕi𝐌,\mathbb{E}\left[\mathbf{y}_{i}|\mathbf{M}\right]=\boldsymbol{\phi}_{i}\mathbf{M},

where ϕi1×r\boldsymbol{\phi}_{i}\in\real^{1\times r} is row ii of 𝚽\boldsymbol{\Phi}.

Anandkumar et al. (2012, 2015), on the other hand, assume that yij{0,1}ny_{ij}\in\{0,1\}^{n} with the restriction that j=1nyij=1\sum_{j=1}^{n}y_{ij}=1. This construction is meant to represent an observed document of text where there are nn words in the dictionary. They consider a single document with an infinite number of words. The vector 𝐲i=(yi1,,yin)\mathbf{y}_{i}=(y_{i1},\ldots,y_{in}) tells us which of the nn words is present at location ii in the document. When they let kk\rightarrow\infty, this means the number of words in the document grows to infinity.

The model studied in Anandkumar et al. (2012, 2015) is

𝔼[𝐲iT|𝐦]=𝐎n×r𝐦r×1,\mathbb{E}\left[\mathbf{y}_{i}^{T}|\mathbf{m}\right]=\mathbf{O}_{n\times r}\mathbf{m}_{r\times 1},

where there are rr topics under consideration and the rr-vector 𝐦\mathbf{m} gives the mixture of topics in this particular document. Each column of 𝐎\mathbf{O} gives the multinomial probabilities of the nn words for the corresponding topic. Note that the linear latent variable model 𝐎𝐦\mathbf{O}\mathbf{m} does not vary with ii, whereas in our model it does. Also, the dimensionality of the latent variable model is different than ours. Therefore, this is a different model than we consider.

Anandkumar et al. (2012, 2015) take the approach of calculating, projecting and decomposing the expectations of tensor products involving (𝐲1,𝐲2)(\mathbf{y}_{1},\mathbf{y}_{2}) and (𝐲1,𝐲2,𝐲3)(\mathbf{y}_{1},\mathbf{y}_{2},\mathbf{y}_{3}), and then suggesting that these expectations can be almost surely estimated as kk\rightarrow\infty by utilizing the analogous sample moments. In order to exactly recover 𝐎\mathbf{O} modulo a permutation of its columns, additional assumptions are made by Anandkumar et al. (2012, 2015), such as knowledge of the sum of the unknown exponents in the Dirichlet distribution in the LDA model of Blei et al. (2003). It is also assumed that the number of topics rr is known.

B Proofs

Since we are considering the conditional model (1), i.e., 𝚯=𝔼[𝐘|𝐌]=𝚽𝐌\boldsymbol{\Theta}=\mathbb{E}\left[\mathbf{Y}|\mathbf{M}\right]=\boldsymbol{\Phi}\mathbf{M}, to maintain concise notations we introduce the conditional expectation and variance operators respectively as 𝔼~[]=𝔼[|𝐌]\tilde{\mathbb{E}}\left[\cdot\right]=\mathbb{E}\left[\cdot|\mathbf{M}\right] and 𝕍~[]=𝕍[|𝐌]\tilde{\mathbb{V}}\left[\cdot\right]=\mathbb{V}\left[\cdot|\mathbf{M}\right]. In the proofs, unless otherwise noted, the random vectors are conditioned on 𝐌\mathbf{M} and the arguments for these random vectors are conditional on 𝐌\mathbf{M}.

B.1 Proof of Theorem 2

Denote the terms in the order in which they appear in the expansion

k1𝐘T𝐘\displaystyle k^{-1}\mathbf{Y}^{T}\mathbf{Y} =k1𝚯T𝚯+k1(𝐘𝚯)T𝚯+k1𝚯T(𝐘𝚯)\displaystyle=k^{-1}\boldsymbol{\Theta}^{T}\boldsymbol{\Theta}+k^{-1}\left(\mathbf{Y}-\boldsymbol{\Theta}\right)^{T}\boldsymbol{\Theta}+k^{-1}\boldsymbol{\Theta}^{T}\left(\mathbf{Y}-\boldsymbol{\Theta}\right)
+k1(𝐘𝚯)T(𝐘𝚯)\displaystyle+k^{-1}\left(\mathbf{Y}-\boldsymbol{\Theta}\right)^{T}\left(\mathbf{Y}-\boldsymbol{\Theta}\right)

as 𝐀1\mathbf{A}_{1}, 𝐀2\mathbf{A}_{2} and 𝐀2T\mathbf{A}_{2}^{T} and 𝐀3\mathbf{A}_{3}, we see that limk𝐀1=𝐇\lim\limits_{k\rightarrow\infty}\mathbf{A}_{1}=\mathbf{H} by assumption (4). We claim that limk𝐀2=0=limk𝐀2T\lim\limits_{k\rightarrow\infty}\left\|\mathbf{A}_{2}\right\|=0=\lim_{k\rightarrow\infty}\left\|\mathbf{A}_{2}^{T}\right\| a.s. The (i,j)\left(i,j\right) entry a2,ija_{2,ij} of 𝐀2\mathbf{A}_{2} is a2,ij=k1l=1ka~2,ij,la_{2,ij}=k^{-1}\sum\limits_{l=1}^{k}\tilde{a}_{2,ij,l}, where a~2,ij,l=(yliθli)θlj\tilde{a}_{2,ij,l}=\left(y_{li}-\theta_{li}\right)\theta_{lj} for 1lk1\leq l\leq k. Clearly, 𝔼~[a~2,ij,l]=0\tilde{\mathbb{E}}\left[\tilde{a}_{2,ij,l}\right]=0. Set y~li=yliθli\tilde{y}_{li}=y_{li}-\theta_{li}. By Lemma 1, uniformly bounded 44th conditional moments of yijy_{ij} and Hölder’s inequality,

supk1max1lk𝕍~[a~2,ij,l]Csupk1max1lk(𝔼~[y~li4])1/2C.\sup_{k\geq 1}\max_{1\leq l\leq k}\tilde{\mathbb{V}}\left[\tilde{a}_{2,ij,l}\right]\leq C\sup_{k\geq 1}\max_{1\leq l\leq k}\left(\tilde{\mathbb{E}}\left[\tilde{y}_{li}^{4}\right]\right)^{1/2}\leq C\text{.}

By independence conditional 𝐌\mathbf{M},

k11k𝕍~[a2,ij]Ck11k3l=1k(𝔼~[y~li4])1/2Ck11k2=Cπ26.\sum_{k\geq 1}\dfrac{1}{k}\tilde{\mathbb{V}}\left[a_{2,ij}\right]\leq C\sum_{k\geq 1}\dfrac{1}{k^{3}}\sum_{l=1}^{k}\left(\tilde{\mathbb{E}}\left[\tilde{y}_{li}^{4}\right]\right)^{1/2}\leq C\sum_{k\geq 1}\dfrac{1}{k^{2}}=\dfrac{C\pi^{2}}{6}\text{.}

Therefore, Theorem 1 of Walk (2005) implies that limka2,ij=0\lim\limits_{k\rightarrow\infty}a_{2,ij}=0 a.s., which validates the claim.

Consider the last term 𝐀3\mathbf{A}_{3}, whose (i,j)\left(i,j\right)th off-diagonal entry a3,ija_{3,ij} can be written as a3,ij=k1l=1ka~3,ij,la_{3,ij}=k^{-1}\sum\limits_{l=1}^{k}\tilde{a}_{3,ij,l}, where a~3,ij,l=(yliθli)(yljθlj)\tilde{a}_{3,ij,l}=\left(y_{li}-\theta_{li}\right)\left(y_{lj}-\theta_{lj}\right). Clearly, 𝔼~[a~3,ij,l]=0\tilde{\mathbb{E}}\left[\tilde{a}_{3,ij,l}\right]=0. The same reasoning as above implies

k11k𝕍~[a3,ij]Ck11k3l=1k(𝔼~[y~li4])1/2(𝔼~[y~lj4])1/2Ck11k2=Cπ26\sum_{k\geq 1}\dfrac{1}{k}\tilde{\mathbb{V}}\left[a_{3,ij}\right]\leq C\sum_{k\geq 1}\dfrac{1}{k^{3}}\sum_{l=1}^{k}\left(\tilde{\mathbb{E}}\left[\tilde{y}_{li}^{4}\right]\right)^{1/2}\left(\tilde{\mathbb{E}}\left[\tilde{y}_{lj}^{4}\right]\right)^{1/2}\leq C\sum_{k\geq 1}\dfrac{1}{k^{2}}=\dfrac{C\pi^{2}}{6}

when iji\neq j. So, Theorem 1 of Walk (2005) implies that limka3,ij=0\lim\limits_{k\rightarrow\infty}a_{3,ij}=0 a.s. The diagonal entries of 𝐀3\mathbf{A}_{3} can be written as a3,ii=k1l=1ka~3,ii,la_{3,ii}=k^{-1}\sum\limits_{l=1}^{k}\tilde{a}_{3,ii,l}, where a~3,ii,l=(yliθli)2\tilde{a}_{3,ii,l}=\left(y_{li}-\theta_{li}\right)^{2}. Since 𝔼~[a~3,ii,l]=δli\tilde{\mathbb{E}}\left[\tilde{a}_{3,ii,l}\right]=\delta_{li} and

𝕍~[a~3,ii,l]δi2+𝔼~[y~li4]2𝔼~[y~li4]C,\tilde{\mathbb{V}}\left[\tilde{a}_{3,ii,l}\right]\leq\delta_{i}^{2}+\tilde{\mathbb{E}}\left[\tilde{y}_{li}^{4}\right]\leq 2\tilde{\mathbb{E}}\left[\tilde{y}_{li}^{4}\right]\leq C\text{,}

it follows that

k11k𝕍~[a3,ii]k1Ck2=Cπ26.\sum_{k\geq 1}\dfrac{1}{k}\tilde{\mathbb{V}}\left[a_{3,ii}\right]\leq\sum_{k\geq 1}\dfrac{C}{k^{2}}=\dfrac{C\pi^{2}}{6}\text{.}

Hence, Theorem 1 of Walk (2005) implies limk|a3,iiδ¯ki|=0\lim\limits_{k\rightarrow\infty}\left|a_{3,ii}-\bar{\delta}_{ki}\right|=0 a.s. and limk𝐀3𝐃k=0\lim\limits_{k\rightarrow\infty}\left\|\mathbf{A}_{3}-\mathbf{D}_{k}\right\|=0. Combining the limiting terms obtained above completes the proof.

B.2 Proof of Corollary 3

We show (7) first. By (6) and Wielandt-Hoffman (WH) inequality of Hoffman and Wielandt (1953), we immediately have (7), i.e.,

limkmax1in|βk,iαi|=0 a.s..\lim_{k\rightarrow\infty}\max_{1\leq i\leq n}\left|\beta_{k,i}-\alpha_{i}\right|=0\text{ a.s.}. (33)

Now we show the rest of the assertions. Let {λl}l=1s\left\{\lambda_{l}\right\}_{l=1}^{s} with 1sn1\leq s\leq n be the distinct eigenvalues of 𝐇\mathbf{H} ordered into λl>λl+1\lambda_{l}>\lambda_{l+1} for 1ls11\leq l\leq s-1. Note that sl=0s_{l}=0 since the rank of 𝐇\mathbf{H} is rr and 𝐇\mathbf{H} is positive semi-definite. Let ε0=min1ls0{λlλl+1}\varepsilon_{0}=\min\limits_{1\leq l\leq s_{0}}\left\{\lambda_{l}-\lambda_{l+1}\right\} and pick ε>0\varepsilon>0 such that ε<41ε0\varepsilon<4^{-1}\varepsilon_{0}, where s0=max{l:λl>0}r<ns_{0}=\max\left\{l:\lambda_{l}>0\right\}\leq r<n. From (33), we see that there exists some k0k_{0}\in\mathbb{N} and a partition of {1,,n}\left\{1,...,n\right\} into its subsets {Al}l=1s\left\{A_{l}\right\}_{l=1}^{s} for which, whenever kk0k\geq k_{0}, max1lsmaxiAl|βk,iλl|<41ε\max\limits_{1\leq l\leq s}\max\limits_{i\in A_{l}}\left|\beta_{k,i}-\lambda_{l}\right|<4^{-1}\varepsilon but

min1l<lsmin{|βk,iβk,j|:iAl,jAl,ll}>41ε0.\min_{1\leq l<l^{\prime}\leq s}\min\left\{\left|\beta_{k,i}-\beta_{k,j}\right|:i\in A_{l},j\in A_{l^{\prime}},l\neq l^{\prime}\right\}>4^{-1}\varepsilon_{0}\text{.}

Namely, {βk,i}i=1n\left\{\beta_{k,i}\right\}_{i=1}^{n} separate into ss groups {βk,i:iAl}\left\{\beta_{k,i}:i\in A_{l}\right\} for 1ls1\leq l\leq s each with center λl\lambda_{l} but diameter at most 21ε2^{-1}\varepsilon. Therefore, limk{βk,i:i=1,,n}={λl:l=1,,s}\lim_{k\rightarrow\infty}\left\{\beta_{k,i}:i=1,\ldots,n\right\}=\left\{\lambda_{l}:l=1,\ldots,s\right\} a.s..

Since 𝐖>0\mathbf{W}>0, rank(𝐌)=r{\rm rank}\left(\mathbf{M}\right)=r and 𝐇\mathbf{H} is symmetric, 𝐇\mathbf{H} is diagonalizable with rank(𝐇)=r{\rm rank}\left(\mathbf{H}\right)=r and the geometric multiplicity (gm) of each λl\lambda_{l} equals its algebraic multiplicity (am). Therefore, the linear space spanned by the union of the eigenvectors corresponding to all non-zero eigenvalues of 𝐇\mathbf{H} must be {𝐯i}i=1r\left\langle\left\{\mathbf{v}_{i}\right\}_{i=1}^{r}\right\rangle and rank({𝐯i}i=1r)=r{\rm rank}\left(\left\langle\left\{\mathbf{v}_{i}\right\}_{i=1}^{r}\right\rangle\right)=r must hold, where 𝐯i\mathbf{v}_{i} is an eigenvector corresponding to some αj\alpha_{j} for 1jr1\leq j\leq r. Moreover, {𝐯i}i=1r=Π𝐌\left\langle\left\{\mathbf{v}_{i}\right\}_{i=1}^{r}\right\rangle=\Pi_{\mathbf{M}}. Fix an ll and any 𝜸n\boldsymbol{\gamma}\in\mathbb{R}^{n} with 𝜸<\left\|\boldsymbol{\gamma}\right\|<\infty. Let 𝐮k,i\mathbf{u}_{k,i} be any eigenvector that is part of the basis for the eigenspace Uk,iU_{k,i} of βk,i\beta_{k,i}. Then, from (6) and (7), we see that

bk(𝜸)=(𝐑kβk,i𝐈)𝜸(𝐑kβk,j𝐈)𝜸b_{k}\left(\boldsymbol{\gamma}\right)=\left(\mathbf{R}_{k}-\beta_{k,i}\mathbf{I}\right)\boldsymbol{\gamma}-\left(\mathbf{R}_{k}-\beta_{k,j}\mathbf{I}\right)\boldsymbol{\gamma}

satisfies limkbk(𝜸)=0\lim\limits_{k\rightarrow\infty}\left\|b_{k}\left(\boldsymbol{\gamma}\right)\right\|=0 a.s. for all i,jAli,j\in A_{l}. Therefore, limkUk,iUk,j=\lim\limits_{k\rightarrow\infty}U_{k,i}\triangle U_{k,j}=\varnothing a.s. for all i,jAli,j\in A_{l} and asymptotically there are only ss linearly independent eigenspaces {Uk,il}l=1s\left\{U_{k,i_{l}}\right\}_{l=1}^{s} with 1iln1\leq i_{l}\leq n corresponding to {limkβk,i:iAl,1ls}\left\{\lim\limits_{k\rightarrow\infty}\beta_{k,i}:i\in A_{l},1\leq l\leq s\right\}. On the other hand, for any iAli\in A_{l},

ak(𝜸)=(𝐑kβk,i𝐈)𝜸(𝐇λl𝐈)𝜸a_{k}\left(\boldsymbol{\gamma}\right)=\left(\mathbf{R}_{k}-\beta_{k,i}\mathbf{I}\right)\boldsymbol{\gamma}-\left(\mathbf{H}-\lambda_{l}\mathbf{I}\right)\boldsymbol{\gamma}

satisfies limkak(𝜸)=0\lim\limits_{k\rightarrow\infty}\left\|a_{k}\left(\boldsymbol{\gamma}\right)\right\|=0 a.s.. Let 𝐯\mathbf{v} be any eigenvector that is part of the basis for the eigenspace VlV_{l} of λl\lambda_{l}. Then, for any iAli\in A_{l},

(𝐑kβk,i𝐈)𝐯(𝐇λl𝐈)𝐯+ak(𝜸)0 a.s.\left\|\left(\mathbf{R}_{k}-\beta_{k,i}\mathbf{I}\right)\mathbf{v}\right\|\leq\left\|\left(\mathbf{H}-\lambda_{l}\mathbf{I}\right)\mathbf{v}\right\|+\left\|a_{k}\left(\boldsymbol{\gamma}\right)\right\|\rightarrow 0\text{\ a.s.}

and

(𝐇λl𝐈)𝐮k,i(𝐑kβk,i𝐈)𝐮k,i+ak(𝜸)0 a.s.\left\|\left(\mathbf{H}-\lambda_{l}\mathbf{I}\right)\mathbf{u}_{k,i}\right\|\leq\left\|\left(\mathbf{R}_{k}-\beta_{k,i}\mathbf{I}\right)\mathbf{u}_{k,i}\right\|+\left\|a_{k}\left(\boldsymbol{\gamma}\right)\right\|\rightarrow 0\text{\ a.s.}

since 𝐯=1\left\|\mathbf{v}\right\|=1 and Pr(𝐮k,i<)=1\Pr\left(\left\|\mathbf{u}_{k,i}\right\|<\infty\right)=1. Consequently, VllimkUk,iV_{l}\subseteq\lim\limits_{k\rightarrow\infty}U_{k,i} and limmUk,iVl\lim\limits_{m\rightarrow\infty}U_{k,i}\subseteq V_{l} a.s. for any iAli\in A_{l}. Since we have already shown that {Uk,i}i=1n\left\{U_{k,i}\right\}_{i=1}^{n} asymptotically reduces to {Uk,il}l=1s\left\{U_{k,i_{l}}\right\}_{l=1}^{s} corresponding to {βk,il}l=1s{βk,i}i=1n\left\{\beta_{k,i_{l}}\right\}_{l=1}^{s}\subseteq\left\{\beta_{k,i}\right\}_{i=1}^{n} for which limkmax1ls|βk,ilλl|=0\lim\limits_{k\rightarrow\infty}\max\limits_{1\leq l\leq s}\left|\beta_{k,i_{l}}-\lambda_{l}\right|=0 a.s., we see that all eigenvectors corresponding to the rr largest eigenvalues of 𝐑k\mathbf{R}_{k} together asymptotically spans {𝐯i}i=1r\left\langle\left\{\mathbf{v}_{i}\right\}_{i=1}^{r}\right\rangle a.s.. This yields limk|S|=r\lim\limits_{k\rightarrow\infty}\left|S\right|=r a.s. and (8). The proof is completed.

B.3 Proof of Proposition 4

We remark that our proof of Proposition 4 follows a similar strategy given in Leek (2011) but uses slightly different techniques.

We show the first claim. Recall 𝚯=𝚽𝐌\boldsymbol{\Theta}=\mathbf{\Phi M}. It suffices to show that each of k1𝚯T𝐄k^{-1}\boldsymbol{\Theta}^{T}\mathbf{E} and k1𝐄T𝐄k^{-1}\mathbf{E}^{T}\mathbf{E} is a linear mapping of i=1k𝐳i\sum\limits_{i=1}^{k}\mathbf{z}_{i} with Lipschitz constant 11 as follows. The (i,j)\left(i,j\right)th entry of k1𝐄T𝐄k^{-1}\mathbf{E}^{T}\mathbf{E} is

k1l=1kelielj=k1l=1k𝐪j,eli𝐞l=𝐪j,k1l=1keli𝐞l,k^{-1}\sum\limits_{l=1}^{k}e_{li}e_{lj}=k^{-1}\sum\limits_{l=1}^{k}\left\langle\mathbf{q}_{j},e_{li}\mathbf{e}_{l}\right\rangle=\left\langle\mathbf{q}_{j},k^{-1}\sum\limits_{l=1}^{k}e_{li}\mathbf{e}_{l}\right\rangle,

where 𝐪i=(0,,0,1,0,,0)\mathbf{q}_{i}=\left(0,...,0,1,0,...,0\right) (i.e., only the iith entry is 11; others are 0) and ,\left\langle\cdot,\cdot\right\rangle is the inner product in Euclidean space. In other words, the (i,j)\left(i,j\right)th entry of k1𝐄T𝐄k^{-1}\mathbf{E}^{T}\mathbf{E} is exactly the (in+j)\left(in+j\right)th entry of i=1k𝐳i\sum\limits_{i=1}^{k}\mathbf{z}_{i}. Further, the (i,j)\left(i,j\right)th entry of k1𝚯T𝐄k^{-1}\boldsymbol{\Theta}^{T}\mathbf{E} is

k1l=1kϕl𝐦ielj=k1l=1k𝐪j,ϕl𝐦i𝐞l=𝐪j,k1l=1kϕl𝐦i𝐞l,k^{-1}\sum\limits_{l=1}^{k}\boldsymbol{\phi}_{l}\mathbf{m}^{i}e_{lj}=k^{-1}\sum\limits_{l=1}^{k}\left\langle\mathbf{q}_{j},\boldsymbol{\phi}_{l}\mathbf{m}^{i}\mathbf{e}_{l}\right\rangle=\left\langle\mathbf{q}_{j},k^{-1}\sum\limits_{l=1}^{k}\boldsymbol{\phi}_{l}\mathbf{m}^{i}\mathbf{e}_{l}\right\rangle\text{,}

i.e., the (n2+ni+j)(n^{2}+ni+j)th entry of i=1k𝐳i\sum\limits_{i=1}^{k}\mathbf{z}_{i}.

Notice that linearity of picking an entry as a mapping described above is invariant under matrix transpose, we see that k1𝐄T𝚯k^{-1}\mathbf{E}^{T}\boldsymbol{\Theta} is also a linear mapping of i=1k𝐳i\sum\limits_{i=1}^{k}\mathbf{z}_{i} with Lipschitz constant 11. Consequently, 𝐅\mathbf{F} is a linear mapping of {𝐳i}i=1k\left\{\mathbf{z}_{i}\right\}_{i=1}^{k} with Lipschitz constant 11.

Now we show the second claim. Set i=1k𝐳~i\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i} with 𝐳~i=k(𝐳i𝔼~[𝐳i])\mathbf{\tilde{z}}_{i}=\sqrt{k}\left(\mathbf{z}_{i}-\tilde{\mathbb{E}}\left[\mathbf{z}_{i}\right]\right). We will verify that i=1k𝐳~i\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i} is asymptotically Normally distributed. Clearly, among the long vector 𝔼~[𝐳i]\tilde{\mathbb{E}}\left[\mathbf{z}_{i}\right], only the jjth entry of 𝔼~[eij𝐞i]\tilde{\mathbb{E}}\left[e_{ij}\mathbf{e}_{i}\right] is nonzero and it is k1δijk^{-1}\delta_{ij}, which means that 𝔼~[𝐳i]\tilde{\mathbb{E}}\left[\mathbf{z}_{i}\right] has an easy form. By the multivariate central limit theorem (MCLT), e.g., see page 20 of Van der Vaart (1998), to show the asymptotic normality of i=1k𝐳~i\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i} as kk\rightarrow\infty, it suffices to show

vkε=i=1k𝔼~[𝐳~i21{𝐳~i>ε}]0 as kv_{k}^{\varepsilon}=\sum_{i=1}^{k}\tilde{\mathbb{E}}\left[\left\|\mathbf{\tilde{z}}_{i}\right\|^{2}1_{\left\{\left\|\mathbf{\tilde{z}}_{i}\right\|>\varepsilon\right\}}\right]\rightarrow 0\text{ as }k\rightarrow\infty (34)

for any ε>0\varepsilon>0 and

i=1kCov~[𝐳~i]𝚺 as k\sum_{i=1}^{k}\widetilde{Cov}\left[\mathbf{\tilde{z}}_{i}\right]\rightarrow\boldsymbol{\Sigma}_{\infty}\text{ as }k\rightarrow\infty (35)

for some matrix 𝚺\boldsymbol{\Sigma}_{\infty} in order that i=1k𝐳~iN(𝟎,𝚺)\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i}\rightsquigarrow N\left(\mathbf{0},\boldsymbol{\Sigma}_{\infty}\right), where Cov~\widetilde{Cov} is the conditional covariance operator given 𝐌\mathbf{M} and \rightsquigarrow denotes weak convergence.

We verify (34) first. Since supkmaxi,j𝔼~[yij8]C\sup\limits_{k}\max\limits_{i,j}\tilde{\mathbb{E}}\left[y_{ij}^{8}\right]\leq C, supkmaxiϕi<\sup\limits_{k}\max\limits_{i}\left\|\boldsymbol{\phi}_{i}\right\|<\infty and nn is finite, the identity

k𝐳~i2=jl;1j,lneil2eij2+j=1n(eij2δij)2+j=1nl=1n(ϕi𝐦jeil)2k\left\|\mathbf{\tilde{z}}_{i}\right\|^{2}=\sum_{j\neq l;1\leq j,l\leq n}e_{il}^{2}e_{ij}^{2}+\sum_{j=1}^{n}\left(e_{ij}^{2}-\delta_{ij}\right)^{2}+\sum_{j=1}^{n}\sum_{l=1}^{n}\left(\boldsymbol{\phi}_{i}\mathbf{m}^{j}e_{il}\right)^{2} (36)

for each 1ik1\leq i\leq k implies, via Hölder’s inequality,

𝔼~[𝐳~i21{𝐳~i>ε}]k1(𝔼~[ϑi2]Pr(ϑi>kε2))1/2Ck3/2ε\tilde{\mathbb{E}}\left[\left\|\mathbf{\tilde{z}}_{i}\right\|^{2}1_{\left\{\left\|\mathbf{\tilde{z}}_{i}\right\|>\varepsilon\right\}}\right]\leq k^{-1}\left(\tilde{\mathbb{E}}\left[\vartheta_{i}^{2}\right]\Pr\left(\vartheta_{i}>k\varepsilon^{2}\right)\right)^{1/2}\leq\dfrac{C}{k^{3/2}\varepsilon}

and vkεCk1/2ε10v_{k}^{\varepsilon}\leq Ck^{-1/2}\varepsilon^{-1}\rightarrow 0 as kk\rightarrow\infty, where ϑi\vartheta_{i} denotes the right hand side (RHS) of (36). Hence, (34) is valid.

It is left to verify (35). Let 𝚺i=Cov~(𝐳~i)\boldsymbol{\Sigma}_{{}_{i}}^{\ast}=\widetilde{Cov}\left(\mathbf{\tilde{z}}_{i}\right). Then k𝚺ik\boldsymbol{\Sigma}_{{}_{i}}^{\ast} has entries zero except at the blocks 𝐒i1,jl=Cov~(eij𝐞i,eil𝐞i)\mathbf{S}_{i1,jl}=\widetilde{Cov}\left(e_{ij}\mathbf{e}_{i},e_{il}\mathbf{e}_{i}\right), 𝐒i2,jl=Cov~(ϕi𝐦j𝐞i,ϕi𝐦l𝐞i)\mathbf{S}_{i2,jl}=\widetilde{Cov}\left(\boldsymbol{\phi}_{i}\mathbf{m}^{j}\mathbf{e}_{i},\boldsymbol{\phi}_{i}\mathbf{m}^{l}\mathbf{e}_{i}\right) and 𝐒i4,jl=Cov~(eij𝐞i,ϕi𝐦l𝐞i)\mathbf{S}_{i4,jl}=\widetilde{Cov}\left(e_{ij}\mathbf{e}_{i},\boldsymbol{\phi}_{i}\mathbf{m}^{l}\mathbf{e}_{i}\right). Specifically, 𝐒i1,jl(r,r)=𝔼~(eir4)\mathbf{S}_{i1,jl}\left(r,r^{\prime}\right)=\tilde{\mathbb{E}}\left(e_{ir}^{4}\right) if r=r=j=lr=r^{\prime}=j=l and is 0 otherwise; 𝐒i2,jl=ϕi𝐦jϕi𝐦l𝚫i\mathbf{S}_{i2,jl}=\boldsymbol{\phi}_{i}\mathbf{m}^{j}\boldsymbol{\phi}_{i}\mathbf{m}^{l}\mathbf{\Delta}_{i} with 𝚫i=diag{δi1,,δin}\mathbf{\Delta}_{i}={\rm diag}\left\{\delta_{i1},...,\delta_{in}\right\}; 𝐒i4,jl(r,r)=ϕi𝐦l𝔼~[eir3]\mathbf{S}_{i4,jl}\left(r,r^{\prime}\right)=\boldsymbol{\phi}_{i}\mathbf{m}^{l}\tilde{\mathbb{E}}\left[e_{ir}^{3}\right] if r=r=jr=r^{\prime}=j and is 0 otherwise. By the joint independence among {𝐳~i}i=1k\left\{\mathbf{\tilde{z}}_{i}\right\}_{i=1}^{k}, we have Cov~[i=1k𝐳~i]=𝚺k=i=1k𝚺i\widetilde{Cov}\left[\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i}\right]=\boldsymbol{\Sigma}_{k}^{\ast}=\sum\limits_{i=1}^{k}\boldsymbol{\Sigma}_{{}_{i}}^{\ast}. Further, 𝚺k𝚺\boldsymbol{\Sigma}_{k}^{\ast}\rightarrow\boldsymbol{\Sigma}_{\infty} as kk\rightarrow\infty when, for each 1j,ln1\leq j,l\leq n,

  1. 1.

    k1i=1k𝔼~[eij4]k^{-1}\sum\limits_{i=1}^{k}\tilde{\mathbb{E}}\left[e_{ij}^{4}\right] and k1i=1kϕi𝔼~[eij3]k^{-1}\sum\limits_{i=1}^{k}\boldsymbol{\phi}_{i}\tilde{\mathbb{E}}\left[e_{ij}^{3}\right] are convergent (so that both k1i=1k𝐒i1,jlk^{-1}\sum\limits_{i=1}^{k}\mathbf{S}_{i1,jl} and k1i=1k𝐒i4,jlk^{-1}\sum\limits_{i=1}^{k}\mathbf{S}_{i4,jl} converge).

  2. 2.

    k1i=1k𝐒i2,jlk^{-1}\sum\limits_{i=1}^{k}\mathbf{S}_{i2,jl} is convergent.

By assumption A3), the above needed convergence is ensured, meaning that 𝚺k\boldsymbol{\Sigma}_{k}^{\ast} 𝚺\rightarrow\boldsymbol{\Sigma}_{\infty} for some 𝚺0\boldsymbol{\Sigma}_{\infty}\geq 0, i.e., (35) holds. Therefore, i=1k𝐳~i\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i} is asymptotically Normally distributed as N(𝟎,𝚺)N\left(\mathbf{0},\boldsymbol{\Sigma}_{\infty}\right). The proof is completed.

B.4 Proof of Theorem 5

We will use the notations in the proof of Proposition 4 and aim to show 𝐑^k𝐇=OPr(τk)\left\|\mathbf{\hat{R}}_{k}-\mathbf{H}\right\|=O_{\Pr}\left(\tau_{k}\right), which then by Wielandt-Hoffman (WH) inequality of Hoffman and Wielandt (1953) implies 𝜶^k𝜶=OPr(τk)\left\|\boldsymbol{\hat{\alpha}}_{k}-\boldsymbol{\alpha}\right\|=O_{\Pr}\left(\tau_{k}\right), where 𝜶^k=(α^k,1,,α^k,n)\boldsymbol{\hat{\alpha}}_{k}=\left(\hat{\alpha}_{k,1},...,\hat{\alpha}_{k,n}\right) and 𝜶=(α1,,αn)\boldsymbol{\alpha}=\left(\alpha_{1},...,\alpha_{n}\right). Recall k1𝐘T𝐘=𝐅+k1𝚯T𝚯k^{-1}\mathbf{Y}^{T}\mathbf{Y}=\mathbf{F}+k^{-1}\boldsymbol{\Theta}^{T}\boldsymbol{\Theta} and 𝐅=f0(i=1k𝐳i)\mathbf{F}=f_{0}\left(\sum\limits_{i=1}^{k}\mathbf{z}_{i}\right) by Proposition 4, where f0f_{0} is linear with Lipschitz constant 11. We see that the linearity of f0f_{0} and that the only nonzero entries of 𝔼~[𝐳i]\tilde{\mathbb{E}}\left[\mathbf{z}_{i}\right] are k1δijk^{-1}\delta_{ij} at indices (j1)n+j\left(j-1\right)n+j for 1jn1\leq j\leq n together imply

k𝐅=f0(i=1kk(𝐳i𝔼~[𝐳i]))+k𝐃k=f0(i=1k𝐳~i)+k𝐃k.\sqrt{k}\mathbf{F}=f_{0}\left(\sum\limits_{i=1}^{k}\sqrt{k}\left(\mathbf{z}_{i}-\tilde{\mathbb{E}}\left[\mathbf{z}_{i}\right]\right)\right)+\sqrt{k}\mathbf{D}_{k}=f_{0}\left(\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i}\right)+\sqrt{k}\mathbf{D}_{k}\text{.}

Therefore, the Lipschitz property of f0f_{0} and the asymptotic normality of i=1k𝐳~i\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i} force

k(𝐑^k𝐇)\displaystyle\sqrt{k}\left(\mathbf{\hat{R}}_{k}-\mathbf{H}\right) =f0(i=1k𝐳~i)+k(𝐃^k𝐃k)+k(𝐌T(k1𝚽T𝚽𝐖)𝐌)\displaystyle=f_{0}\left(\sum\limits_{i=1}^{k}\mathbf{\tilde{z}}_{i}\right)+\sqrt{k}\left(\mathbf{\hat{D}}_{k}-\mathbf{D}_{k}\right)+\sqrt{k}\left(\mathbf{M}^{T}\left(k^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}-\mathbf{W}\right)\mathbf{M}\right)
=OPr(1)+kεk+kO(ck),\displaystyle=O_{\Pr}\left(1\right)+\sqrt{k}\varepsilon_{k}+\sqrt{k}O\left(c_{k}\right)\text{,}

where we have used assumptions (13) and (4). Hence, (14) holds with τk\tau_{k} set by (15). Finally, we show (16) and (18). From (14), we see that max{αic~iτk,0}α^k,iαi+c~iτk\max\left\{\alpha_{i}-\tilde{c}_{i}\tau_{k},0\right\}\leq\hat{\alpha}_{k,i}\leq\alpha_{i}+\tilde{c}_{i}\tau_{k} for some finite constant c~i0\tilde{c}_{i}\geq 0 for each 1in1\leq i\leq n. Since αr>0\alpha_{r}>0 but αr+1=0\alpha_{r+1}=0, we see that τ~k1α^k,iτ~k1αrc~iτ~k1τk\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i}\geq\tilde{\tau}_{k}^{-1}\alpha_{r}-\tilde{c}_{i}\tilde{\tau}_{k}^{-1}\tau_{k}\rightarrow\infty for 1ir1\leq i\leq r but τ~k1α^k,ic~iτ~k1τk0\tilde{\tau}_{k}^{-1}\hat{\alpha}_{k,i}\leq\tilde{c}_{i}\tilde{\tau}_{k}^{-1}\tau_{k}\rightarrow 0 for r+1inr+1\leq i\leq n. So, (16) and (18) hold. The proof is completed.

B.5 Proof of Lemma 8

Let wij=v(yij)w_{ij}=v\left(y_{ij}\right). Then wijw_{ij} are mutually independent and

k1k1𝕍~[δ^kj]=k1k2max1ik𝕍~[wij]Ck1k2<.\sum_{k\geq 1}k^{-1}\tilde{\mathbb{V}}\left[\hat{\delta}_{kj}\right]=\sum_{k\geq 1}k^{-2}\max_{1\leq i\leq k}\tilde{\mathbb{V}}\left[w_{ij}\right]\leq C\sum_{k\geq 1}k^{-2}<\infty\text{.}

Therefore, Theorem 1 of Walk (2005) implies that d~kj=δ^kjδ¯kj=δ^kj𝔼~[δ^kj]0\tilde{d}_{kj}=\hat{\delta}_{kj}-\bar{\delta}_{kj}=\hat{\delta}_{kj}-\tilde{\mathbb{E}}\left[\hat{\delta}_{kj}\right]\rightarrow 0 a.s. as kk\rightarrow\infty. Now we show the second claim. Let y~ij=v(yij)δij\tilde{y}_{ij}=v\left(y_{ij}\right)-\delta_{ij} and bij=k1/2y~ijb_{ij}=k^{-1/2}\tilde{y}_{ij}. Then l=1kblj=kd~kj\sum\limits_{l=1}^{k}b_{lj}=\sqrt{k}\tilde{d}_{kj}. For any ε>0\varepsilon>0, assumption (22) implies

a~i=𝔼~[blj21{|blj|>ε}]k1(𝔼~[y~ij4]Pr(|y~ij|>k1/2ε))1/2Ck5/4ε\tilde{a}_{i}=\tilde{\mathbb{E}}\left[b_{lj}^{2}1_{\left\{\left|b_{lj}\right|>\varepsilon\right\}}\right]\leq k^{-1}\left(\tilde{\mathbb{E}}\left[\tilde{y}_{ij}^{4}\right]\Pr\left(\left|\tilde{y}_{ij}\right|>k^{1/2}\varepsilon\right)\right)^{1/2}\leq\dfrac{C}{k^{5/4}\varepsilon}

and

limki=1ka~ilimkkCk5/4ε=0.\lim_{k\rightarrow\infty}\sum_{i=1}^{k}\tilde{a}_{i}\leq\lim_{k\rightarrow\infty}\dfrac{kC}{k^{5/4}\varepsilon}=0\text{.}

By assumption (24), limki=1k𝕍~[bij]=σj\lim\limits_{k\rightarrow\infty}\sum\limits_{i=1}^{k}\tilde{\mathbb{V}}\left[b_{ij}\right]=\sigma_{j} for each 1jn1\leq j\leq n. Thus, the conditions of MCLT (e.g., see Van der Vaart, 1998) are satisfied and l=1kblj\sum\limits_{l=1}^{k}b_{lj} converges in distribution to a Normal random variable with mean 0 and variance σj\sigma_{j}. The proof is completed.

C Performance of r^\hat{r} in estimating rr

In the following tables, we provide an assessment of the estimator r^\hat{r} in several scenarios that extend beyond that shown in Table 2. Data were simulated under model (1) over a range of nn and rr values under the five distributions listed. Shown is the number of times that r^=r\hat{r}=r among 100100 simulated data sets for each scenario. Also shown in parentheses is the number of times that r^<r\hat{r}<r, if any instances occurred.

Table S1: Performance of r^\hat{r} as an estimator of rr when n=15n=15 and r{1,2,3,4}r\in\left\{1,2,3,4\right\}.
kk Binomial Gamma NegBin Normal Poisson
n=15n=15 and r=1r=1
1000 96 82 82 100 91
5000 99 89 86 100 92
10,000 96 76 91 99 93
100,000 100 90 86 99 94
n=15n=15 and r=2r=2
1000 96 92 5 (85) 100 94
5000 96 90 93 98 97
10,000 97 93 90 96 95
100,000 99 96 93 99 97
n=15n=15 and r=3r=3
1000 95 27 (68) 36 (51) 96 94
5000 97 44 (46) 45 (46) 99 93
10,000 100 89 76 (20) 96 95
100,000 97 95 90 98 96
n=15n=15 and r=4r=4
1000 51 (48) 25 (29) 9 (2) 100 94
5000 99 20 (72) 49 (5) 97 94
10,000 100 60 (27) 28 (35) 97 93
100,000 98 91 90 97 93
Table S2: Performance of r^\hat{r} as an estimator of rr when n=100n=100 and r{1,2,3,4,5}r\in\left\{1,2,3,4,5\right\}.
kk Binomial Gamma NegBin Normal Poisson
n=100n=100 and r=1r=1
1000 100 99 (1) 99 100 100
5000 100 100 100 100 100
10,000 100 100 100 100 100
100,000 100 100 100 100 100
n=100n=100 and r=2r=2
1000 0 (100) 100 100 100 100
5000 0 (100) 100 100 100 100
10,000 0 (100) 100 100 100 100
100,000 0 (100) 100 100 100 100
n=100n=100 and r=3r=3
1000 0 (100) 100 100 100 100
5000 0 (100) 100 100 100 100
10,000 0 (100) 100 100 100 100
100,000 0 (100) 100 100 100 100
n=100n=100 and r=4r=4
1000 0 (100) 77 (23) 24 (76) 100 100
5000 0 (100) 100 100 100 100
10,000 0 (100) 100 100 100 100
100,000 0 (100) 100 100 100 100
n=100n=100 and r=5r=5
1000 0 (100) 35 (52) 40 100 100
5000 0 (100) 96 (4) 100 100 100
10,000 0 (100) 100 100 100 100
100,000 0 (100) 100 100 100 100
Table S3: Performance of r^\hat{r} as an estimator of rr when n=200n=200 and r{6,8,10,12}r\in\left\{6,8,10,12\right\}.
kk Binomial Gamma NegBin Normal Poisson
n=200n=200 and r=6r=6
1000 0 (100) 59 (10) 1 100 100
5000 0 (100) 100 100 100 100
10,000 0 (100) 100 100 100 100
100,000 0 (100) 100 100 100 100
n=200n=200 and r=8r=8
1000 0 (100) 0 0 100 100
5000 0 (100) 0 0 100 100
10,000 0 (100) 78 0 100 100
100,000 0 (100) 100 100 100 100
n=200n=200 and r=10r=10
1000 0 (100) 0 0 100 100
5000 0 (100) 0 0 100 100
10,000 0 (100) 0 0 100 100
100,000 0 (100) 69 11 100 100
n=200n=200 and r=12r=12
1000 0 (100) 0 0 100 100
5000 0 (100) 0 0 100 100
10,000 0 (100) 0 0 100 100
100,000 0 (100) 0 0 100 100

D Performance of Estimator of Π𝐌\Pi_{\mathbf{M}}

The following figures show the performance of the nonparametric estimator Π^𝐌\hat{\Pi}_{\mathbf{M}} of Π𝐌\Pi_{\mathbf{M}} for several scenarios beyond that shown in Figure 1. The results are given over a range of nn and rr values as indicated in the figure captions. Column 1: Boxplots of the difference between the row spaces spanned by 𝐌\mathbf{M} and 𝐌^\mathbf{\hat{M}} as measured by d(𝐌,𝐌^)d(\mathbf{M},\mathbf{\hat{M}}) when the true dimension of 𝐌\mathbf{M} is utilized to form 𝐌^\mathbf{\hat{M}} (i.e., setting r^=r\hat{r}=r). Column 2: Boxplots of d(𝐌,𝐌^)d(\mathbf{M},\mathbf{\hat{M}}) when using the proposed estimator r^\hat{r} of the row space dimension in forming 𝐌^\mathbf{\hat{M}}. Column 3: An assessment of the estimate 𝐃^k\mathbf{\hat{D}}_{k} of 𝐃k\mathbf{D}_{k}, where the latter term is the average of the column-wise variances of 𝐘\mathbf{Y}. The difference is measured by 𝐃^k𝐃k=max1jn|δ^kjδ¯kj|\|\mathbf{\hat{D}}_{k}-\mathbf{D}_{k}\|=\max\limits_{1\leq j\leq n}|\hat{\delta}_{kj}-\bar{\delta}_{kj}|.

Refer to caption
Figure S1: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=15n=15 and r=1r=1.
Refer to caption
Figure S2: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=15n=15 and r=2r=2.
Refer to caption
Figure S3: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=15n=15 and r=3r=3.
Refer to caption
Figure S4: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=15n=15 and r=4r=4.
Refer to caption
Figure S5: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=100n=100 and r=1r=1.
Refer to caption
Figure S6: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=100n=100 and r=2r=2.
Refer to caption
Figure S7: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=100n=100 and r=3r=3.
Refer to caption
Figure S8: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=100n=100 and r=4r=4.
Refer to caption
Figure S9: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=100n=100 and r=5r=5.
Refer to caption
Figure S10: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=200n=200 and r=6r=6.
Refer to caption
Figure S11: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=200n=200 and r=8r=8.
Refer to caption
Figure S12: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=200n=200 and r=10r=10.
Refer to caption
Figure S13: Performance of Π^𝐌\hat{\Pi}_{\mathbf{M}} when n=200n=200 and r=12r=12.