This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Functional Factor Modeling of Brain Connectivity

Kyle Stanley Department of Statistics, Pennsylvania State University
kms8227@psu.edu
Nicole Lazar Department of Statistics and Huck Institutes of the Life Sciences, Pennsylvania State University
nfl5182@psu.edu
Matthew Reimherr Department of Statistics, Pennsylvania State University, and Amazon Science
mlr36@psu.edu
Abstract

Many fMRI analyses examine functional connectivity, or statistical dependencies among remote brain regions. Yet popular methods for studying whole-brain functional connectivity often yield results that are difficult to interpret. Factor analysis offers a natural framework in which to study such dependencies, particularly given its emphasis on interpretability. However, multivariate factor models break down when applied to functional and spatiotemporal data, like fMRI. We present a factor model for discretely-observed multidimensional functional data that is well-suited to the study of functional connectivity. Unlike classical factor models which decompose a multivariate observation into a “common” term that captures covariance between observed variables and an uncorrelated “idiosyncratic” term that captures variance unique to each observed variable, our model decomposes a functional observation into two uncorrelated components: a “global” term that captures long-range dependencies and a “local” term that captures short-range dependencies. We show that if the global covariance is smooth with finite rank and the local covariance is banded with potentially infinite rank, then this decomposition is identifiable. Under these conditions, recovery of the global covariance amounts to rank-constrained matrix completion, which we exploit to formulate consistent loading estimators. We study these estimators, and their more interpretable post-processed counterparts, through simulations, then use our approach to uncover a rich covariance structure in a collection of resting-state fMRI scans.

Keywords: functional data analysis, factor analysis, fMRI, functional connectivity, matrix completion

1 Introduction

There exist a number of imaging modalities that allow researchers to measure the physiological changes that accompany neuronal activation. Among the most widely used techniques is functional magnetic resonance imaging (fMRI). FMRI is based on the haemodynamic response wherein blood delivers oxygen to active brain regions at a greater rate than to inactive regions. Blood-oxygen-level-dependent (BOLD) imaging uses magnetic fields to measure relative oxygenation levels across the brain, allowing researchers to use BOLD signal as a proxy for neuronal activation. During an fMRI experiment, a subject is placed in a scanner that collects a temporal sequence of three-dimensional brain images. Each image is partitioned into a grid of three-dimensional volume elements, called voxels, and contains BOLD measurements at each voxel (Lindquist, 2008). The resulting data are both large and complex. The data from a single session typically contains hundreds of brain images, each with BOLD observations at more than 100,000 voxels; large fMRI datasets include hundreds of such scans. Moreover, these data exhibit spatiotemporal dependencies that complicate analysis.

A common goal of fMRI analyses is to characterize functional connectivity, or the statistical dependencies among remote neurophysiological events (Friston, 2011). This is of particular interest in resting-state fMRI, wherein subjects do not perform an explicit task while in the scanner. In this work, we study functional connectivity in resting-state data from the PIOP1 dataset of the Amsterdam Open MRI Collection (AOMIC) (Snoek, van der Miesen, Beemsterboer, van der Leij, Eigenhuis, and Steven Scholte, 2021), which contain six-minute scans collected from 210 subjects at rest. To do so, we develop a new approach that integrates two fields of statistics: factor analysis and functional data analysis.

Techniques used to study resting-state functional connectivity fall in two categories: seed methods and whole-brain methods. To conduct a seed-based connectivity analysis, researchers first select a seed voxel (or region), then correlate the time series of that seed with the time series of all other voxels (or regions), resulting in a functional connectivity map (Biswal, Zerrin Yetkin, Haughton, and Hyde, 1995). Although readily interpretable, these functional connectivity maps only describe dependencies of the selected seed region (Van Den Heuvel and Pol, 2010). Principal component analysis (PCA) is a rudimentary whole-brain approach to functional connectivity that finds the collection of orthogonal spatial maps (a.k.a., principal components) that maximally explains the variation – both between- and within-voxel – in a set of brain images (Friston et al., 1993). The orthogonality constraint, however, means that the low-order subspace is expressed by dense spatial maps containing many high-magnitude weights that complicate interpretation. Independent component analysis (ICA; Beckmann and Smith, 2004), perhaps the most popular whole-brain approach, offers a more parsimonious representation of this subspace. After computing the principal spatial components and associated temporal components from an fMRI scan, ICA rotates these components until the spatial set is maximally independent. Although the resulting spatial maps are typically more parsimonious than those of PCA, ICA may fragment broad areas of activation into multiple maps with highly correlated time courses in its pursuit to maximize spatial independence (Friston, 1998). This means ICA findings are not robust to dimension overspecification.

To address the shortcomings of the above approaches, we develop a method based on exploratory factor analysis. The goal of factor analysis is to describe the covariance relationships between many observed variables using a few unobserved random quantities. Classical factor analyses use the orthogonal factor model (OFM), which posits that MM zero-mean observed quantities X¯M\underline{X}\in\mathbb{R}^{M} having covariance 𝑪M×M\boldsymbol{C}\in\mathbb{R}^{M\times M} are the sum of a common component, given by linear combinations of K<MK<M unobserved uncorrelated factors F¯K\underline{F}\in\mathbb{R}^{K} with zero-mean and unit variance, and an idiosyncratic component, given by a vector of uncorrelated errors ϵ¯M\underline{\epsilon}\in\mathbb{R}^{M} with zero-mean. The loading matrix 𝑳M×K\boldsymbol{L}\in\mathbb{R}^{M\times K} specifies the coefficients of these linear combinations in the common component:

X¯=𝑳F¯+ϵ¯.\displaystyle\underline{X}=\boldsymbol{L}\underline{F}+\underline{\epsilon}.

If 𝑫M×M\boldsymbol{D}\in\mathbb{R}^{M\times M} denotes the diagonal covariance matrix of ϵ¯\underline{\epsilon}, then 𝑪\boldsymbol{C} decomposes as

𝑪=𝑳𝑳T+𝑫.\displaystyle\boldsymbol{C}=\boldsymbol{L}\boldsymbol{L}^{T}+\boldsymbol{D}.

The OFM is identifiable only up to an orthogonal rotation since, for F¯=𝑹TF¯\underline{F}^{*}=\boldsymbol{R}^{T}\underline{F} and 𝑳=𝑳𝑹\boldsymbol{L}^{*}=\boldsymbol{L}\boldsymbol{R}, where 𝑹K×K\boldsymbol{R}\in\mathbb{R}^{K\times K} is a rotation matrix, we have the equivalences

X¯=𝑳F¯+ϵ¯=𝑳F¯+ϵ¯and𝑪=𝑳𝑳T+𝑫=𝑳(𝑳)T+𝑫.\displaystyle\underline{X}=\boldsymbol{L}\underline{F}+\underline{\epsilon}=\boldsymbol{L}^{*}\underline{F}^{*}+\underline{\epsilon}\hskip 10.00002pt\text{and}\hskip 10.00002pt\boldsymbol{C}=\boldsymbol{L}\boldsymbol{L}^{T}+\boldsymbol{D}=\boldsymbol{L}^{*}(\boldsymbol{L}^{*})^{T}+\boldsymbol{D}.

Practitioners address this rotation problem by choosing from a suite of factor rotation algorithms, like quartimax (Neuhaus and Wrigley, 1954) and varimax (Kaiser, 1958) rotation, that bring the loading matrix closer to a so-called simple structure, where in (i) each variable has a high loading on a single factor but near-zero loadings on other factors, and (ii) each factor is described by only a few variables with high loadings on that factor while other variables have near-zero loadings on that factor. This structure provides a parsimonious representation of the common component and one can opt for a rotation procedure that is robust to dimension overspecification.

To grasp the value of factor analysis in functional connectivity, consider a naive application wherein an OFM is fit to a vector of voxel-wise BOLD observations. Used in conjunction with factor rotation, the OFM would provide KK simple spatial maps amenable to interpretation. This sparse and low-dimensional representation of the entire brain makes factor analysis an attractive alternative to seed and ICA methods which, respectively, are not whole-brain and lack such interpretable results. However, there is a problem; the OFM crucially assumes that its errors are uncorrelated, a condition violated in this application given that errors for nearby voxels will be correlated. Such autocorrelation may dissipate as KK grows, but this has two undesirable effects. First, the parsimony of the low-order model is lost. Second, a high-order common component will capture variation at small scales – possibly even between adjacent voxels – an outcome at odds with the purpose of functional connectivity studies: to study correlations between distant brain regions. In a more appropriate factor model, the common component would be low-rank and capture only large-scale variation while its error term would permit variation at a smaller scale.

To develop this better suited factor model, it helps to view fMRI data in the functional data analysis (FDA) framework. Data are said to be functional if we can naturally assume that they arise from some smooth curve. FDA is then the statistical analysis of sample curves in such data (Ramsay and Silverman, 2005; Wang et al., 2016; Kokoszka and Reimherr, 2017). Since it is reasonable to view BOLD observations of a brain image as noisy realizations from some smooth function over a three-dimensional domain, fMRI data is an example of multidimensional functional data. Though not widely adopted by the neuroimaging community, fMRI has been studied using FDA tools such as functional principal component analysis (FPCA; Viviani et al., 2005), functional regression (Reiss et al., 2015, 2017), and functional graphical models (Li and Solea, 2018). To adapt factor analytic techniques to fMRI data, we can extend classical factor methods to this functional setting.

Early extensions of the OFM come from studies of large-dimensional econometric panel data, in which MM time series (e.g., one time series per asset) are observed at the same nn points in time. When MM is large, traditionally reliable time series tools (e.g., vector autoregressive models) require estimation of too many parameters. Factor models are a practical alternative as they provide more parsimonious parameterizations. These factor models should, however, possess two characteristics to accommodate properties inherent to economic time series. First, they ought to be dynamic as problems involving time series are dynamic in nature. Second, they should permit correlated errors as uncorrelatedness is often unrealistic with large MM. Sargent and Sims (1977) and Geweke (1977) innovated on the OFM by making latent factors dynamic in their dynamic factor model (DFM). However, these early DFMs were exact in that they assumed uncorrelated errors. Factor models did not permit such error correlation until Chamberlain and Rothschild (1983) introduced their approximate factor model (AFM), which required only that the largest eigenvalue of the error covariance be bounded. Although this AFM was static, much subsequent study focused on factor models that were both approximate and dynamic (Forni et al., 2000; Stock and Watson, 2002; Bai, 2003).

Factor models are well-studied in large dimensions, but the literature on infinite-dimensional factor analysis is scarce. Hays et al. (2012) were the first to merge ideas from FDA with factor analysis in their dynamic functional factor model (DFFM) which they used to forecast yield curves. Liebl (2013) proposed a different DFFM which, unlike the model of Hays et al. (2012), placed no a priori constraints on the latent process of factor scores. Kowal et al. (2017) presented a Bayesian DFFM for modeling multivariate dependent functional data. None of these early functional factor models permit any degree of error dependence, a structure that certainly arises when representing potentially infinite-rank observations with low-rank objects. Otto and Salish (2022) were the first to address this issue using their approximate DFFM wherein the error covariance is left unrestricted. However, as this is a model for completely-observed functional data, one must either observe these data at infinite resolution (a practical impossibility) or obtain functional representations of discretely-observed sample curves via spline smoothing (Ramsay and Silverman, 2005) prior to model estimation.

Though spline smoothing (and other forms of pre-smoothing) may be an effective first step in many analyses of functional data, it is not in factor analysis. Analyses that begin by smoothing functional data assume that a functional observation X(s)X(s) is the sum of two uncorrelated components: a smooth signal Y(s)Y(s) and a white noise term ϵ(s)\epsilon(s) whose covariance kernel is zero off the diagonal (or some infinitesimally narrow band). When observed at discrete points,

X(sm)=Y(sm)+ϵ(sm),m=1,,M,\displaystyle X(s_{m})=Y(s_{m})+\epsilon(s_{m}),\ m=1,\dots,M,

the smoothed curve X~(s)\tilde{X}(s) is defined as

X~(s)=argminfC2[0,1]{m=1M(f(sm)X(sm))2+αs2fL2}\tilde{X}(s)=\arg\min_{f\in C^{2}[0,1]}\left\{\sum_{m=1}^{M}(f(s_{m})-X(s_{m}))^{2}+\alpha\mathinner{\!\left\lVert\partial_{s}^{2}f\right\rVert}_{L^{2}}\right\} (1)

When ϵ(s)\epsilon(s) is white noise, X~(s)\tilde{X}(s) is a good approximation to Y(s)Y(s). If, instead, the signal Y(s)Y(s) and the error ϵ(s)\epsilon(s) are defined as in our theorized factor model – the former capturing only large-scale variation and the latter permitting small-scale variation – then minimizing the “uncorrelated” objective of (1) yields a smoothed curve X~(s)\tilde{X}(s) contaminated by short-range variation of the error term, making it a poor approximation to Y(s)Y(s). Using X~(s)\tilde{X}(s) as a stand-in for Y(s)Y(s) in subsequent analyses is ill-advised. Thus, a factor model for functional data should not presume completely-observed sample curves, but rather deal directly with discretely-observed data.

As it turns out, an orthogonal factor structure indeed underlies discretely-observed functional data. Hörmann and Jammoul (2020) formalized this observation, noting, however, that the assumption of uncorrelated errors is rarely justified. They suggest that a more realistic factor model would permit error correlations that “taper to zero with increasing lag.” But this poses an identifiability question: is it possible to distinguish such error dependence from that residing in the common component? Descary and Panaretos (2019) show that this is, indeed, possible if one assumes the covariance of the common term is smooth with low rank and that of the error term is banded with potentially infinite rank. These assumptions give rise to a distinctly functional re-characterization of the common and idiosyncratic components of the OFM: the former becomes “global”, capturing dependencies between distant regions of the domain, while the latter becomes “local”, describing dependencies between nearby regions. As this is the precise perspective put forth by our theorized factor model for functional connectivity, we let the ideas of Descary and Panaretos (2019) guide the development of this model.

In this paper, we develop a factor analytic approach for discretely-observed multidimensional functional data that is well-suited to the study of functional connectivity in fMRI. Our methodology is inspired by the framework of Descary and Panaretos (2019), which we build upon in three key respects. First, we extend their conditions for identifying the global and local components of a full covariance to multidimensional functional data, and show that these conditions become less restrictive as the dimensionality grows. Second, we improve upon their global covariance estimator by incorporating a roughness penalty. Third, we develop a post-processing procedure – inspired by methods from classical factor analysis – that facilitates interpretation of global estimates. Using this new approach, we uncover a rich covariance structure in the 210 resting-state scans from the AOMIC-PIOP1 dataset.

The rest of this paper proceeds as follows. In Section 2, we introduce functional factor models for both completely- and discretely-observed functional data, and present identifiability conditions for both. In Section 3, we develop estimation methods for the discretely-observed functional factor model. We then take a brief detour, in Section 4, to review an ICA-based approach used throughout the neuroimaging community. In Section 5, we use simulations to compare our estimator to that of this ICA-based approach and several other comparators. Section 6 contains an application of our approach to the AOMIC resting-state data which we contrast with an application of the ICA-based estimator. We conclude, in Section 7, with some final remarks and future work.

2 The Functional Factor Model and its Identification

This section considers two data observation paradigms: (i) data observed completely (e.g., a brain image with infinite spatial resolution, making it a function), and (ii) data observed discretely on a grid (e.g., a brain image with finite spatial resolution, making it a tensor). Given that our methodology forgos the usual data smoothing step in many analyses of functional data, the completely-observed paradigm is not practically possible. We instead use this first paradigm as an intermediate step in the development of our theory for the second. Unless otherwise specified, let aa or AA denote scalars, a¯\underline{a} or A¯\underline{A} denote vectors, 𝑨\boldsymbol{A} denote matrices, 𝒜\mathcal{A} denote tensors, and 𝒜\mathscr{A} denote operators.

2.1 Completely-Observed Data

In the completely-observed paradigm, we assume XX is a mean-zero random function in L2([0,1]D)L^{2}([0,1]^{D}), which is endowed with the usual inner product and norm:

f,gL2=[0,1]Df(s¯)g(s¯)𝑑s¯andfL2=f,fL2.\displaystyle\left\langle f,g\right\rangle_{L^{2}}=\int_{[0,1]^{D}}f(\underline{s})g(\underline{s})d\underline{s}\hskip 10.00002pt\text{and}\hskip 10.00002pt\mathinner{\!\left\lVert f\right\rVert}_{L^{2}}=\left\langle f,f\right\rangle_{L^{2}}.

We may view XX, in the fMRI setting, as a 3-dimensional (D=3D=3) brain image at infinite spatial resolution, an object conceivable only in theory. The covariance operator 𝒞\mathscr{C} of XX is the integral operator with kernel

c(s¯1,s¯2)=𝔼[X(s¯1)X(s¯2)].c(\underline{s}_{1},\underline{s}_{2})=\mathbb{E}[X(\underline{s}_{1})X(\underline{s}_{2})].

We say XX follows a functional factor model (FFM) with KK factors if

X(s¯)=k=1Klk(s¯)fkY(s¯)+ϵ(s¯),X(\underline{s})=\underbrace{\sum_{k=1}^{K}l_{k}(\underline{s})f_{k}}_{Y(\underline{s})}+\epsilon(\underline{s}),

where lkl_{k} and fkf_{k} denote the kkth loading function and factor, respectively, and ϵ\epsilon is the error function. In the spirit of classical factor models, this model assumes the fkf_{k} are uncorrelated, mean-zero with unit variance, and independent of the mean-zero ϵ\epsilon, and that the lkl_{k} are linearly independent. In this functional setting, we want the components YY and ϵ\epsilon to capture global and local variation in XX, respectively. If we assume the lkl_{k} are sufficiently smooth (in a sense to be defined in Section 2.1.1), then YY, being a finite sum of these smooth functions, will contain the global variation in XX but fail to capture all of the function’s local variation. If we further assume ϵ\epsilon has δ¯\underline{\delta}-banded covariance, i.e., Cov(ϵ(s¯1),ϵ(s¯2))=0\text{Cov}(\epsilon(\underline{s}_{1}),\epsilon(\underline{s}_{2}))=0 when |s1,ds2,d|δd\mathinner{\!\left\lvert s_{1,d}-s_{2,d}\right\rvert}\geq\delta_{d} for some δd>0\delta_{d}>0, d=1,,Dd=1,\dots,D, then the error function will contain only that local variation not captured by YY. When XX is an infinite-resolution brain image, this model decomposes XX into the sum of two images: one defined by a finite linear combination whose components (i.e., loading functions) describe dependencies between distant brain regions, and another that captures dependencies between nearby regions.

If the global component YY has covariance 𝒢\mathscr{G} with kernel gg, and the local component ϵ\epsilon has covariance \mathscr{B} with kernel bb, then

𝒞=𝒢+,c(s¯1,s¯2)=g(s¯1,s¯2)+b(s¯1,s¯2).\begin{split}\mathscr{C}&=\mathscr{G}+\mathscr{B},\\ c(\underline{s}_{1},\underline{s}_{2})&=g(\underline{s}_{1},\underline{s}_{2})+b(\underline{s}_{1},\underline{s}_{2}).\end{split}

Moreover, being covariances, gg and bb admit the Mercer decompositions

g(s¯1,s¯2)\displaystyle g(\underline{s}_{1},\underline{s}_{2}) =k=1Kλkηk(s¯1)ηk(s¯2),\displaystyle=\sum_{k=1}^{K}\lambda_{k}\eta_{k}(\underline{s}_{1})\eta_{k}(\underline{s}_{2}),
b(s¯1,s¯2)\displaystyle b(\underline{s}_{1},\underline{s}_{2}) =j=1βjψj(s¯1)ψj(s¯2)=𝟙{|s1,ds2,d|<δd,d}j=1βjψj(s¯1)ψj(s¯2).\displaystyle=\sum_{j=1}^{\infty}\beta_{j}\psi_{j}(\underline{s}_{1})\psi_{j}(\underline{s}_{2})=\mathds{1}\{\mathinner{\!\left\lvert s_{1,d}-s_{2,d}\right\rvert}<\delta_{d},\ \forall d\}\sum_{j=1}^{\infty}\beta_{j}\psi_{j}(\underline{s}_{1})\psi_{j}(\underline{s}_{2}).

Since the assumptions of the model also imply

g(s¯1,s¯2)\displaystyle g(\underline{s}_{1},\underline{s}_{2}) =k=1Klk(s¯1)lk(s¯2)\displaystyle=\sum_{k=1}^{K}l_{k}(\underline{s}_{1})l_{k}(\underline{s}_{2})

the loading functions lkl_{k} are equal to the scaled eigenfunctions of 𝒢\mathscr{G}, λk1/2ηk\lambda_{k}^{1/2}\eta_{k}, up to an orthogonal rotation. This makes it possible to impose smoothness on the loading functions through the eigenfunctions of 𝒢\mathscr{G}, as we do in Section 2.1.1.

Like many latent models, the FFM is rotationally indeterminate. That is, if loading functions l1,,lKl_{1},\dots,l_{K} satisfy model constraints, then so do any orthogonal rotation of these functions (see Section 2 of Supplement A for a detailed explanation). In the fMRI context, this means there are infinitely many ways to express the global component via spatial map sets of size KK, and each set is a rotation away from the rest. Classical factor analysis addresses this rotation problem by bringing the loadings closer to a simple structure. We explore such approaches in Section 3.2.

2.1.1 Identification of (𝒢,)(\mathscr{G},\mathscr{B}) from 𝒞\mathscr{C}

Despite rotational indeterminacy of the loading functions, we may still hope to identify the global and local covariance components (𝒢,)(\mathscr{G},\mathscr{B}) from 𝒞\mathscr{C}. Thus far, we have imposed two critical conditions on this covariance decomposition: (i) 𝒢\mathscr{G} has finite rank, and (ii) \mathscr{B} has banded covariance. These restrictions, however, do not sufficiently constrain the problem. Given 𝒞\mathscr{C}, we know 𝒢\mathscr{G} off the band, but it is not clear how to uniquely extend 𝒢\mathscr{G} onto the band. Now, if 𝒞\mathscr{C} were contaminated by \mathscr{B} only on its diagonal (i.e., \mathscr{B} is 0¯\underline{0}-banded), then we could simply assume 𝒢\mathscr{G} to be continuous and smooth over this diagonal. However, this strategy does not suffice in the presence of a nontrivial band. Recall that in Section 2.1, we alluded to an additional assumption on the smoothness of the eigenfunctions of 𝒢\mathscr{G}. In Theorem 1 from Section 3.1 of Supplement A, which extends Theorem 1 from Descary and Panaretos (2019), we formalize this condition through the notion of analyticity, showing that real analytic loading functions provide the formulation of smoothness needed to identify the covariance decomposition. To prove this result, we exploit the so-called analytic continuation property, which states that if a function is analytic over some domain, but is known only on an open subset of that domain, then the function extends uniquely to the rest of the domain. This property allows us to uniquely extend 𝒢\mathscr{G} onto the band, thereby establishing a unique decomposition of the full covariance. When XX is an infinite-resolution brain image, this condition implies that the loading functions, which capture correlations between distant brain regions, are analytically smooth over the domain of the image.

2.2 Discretely-Observed Data

In practice, we can measure a function XX at only a finite number of locations. In fMRI, these locations are an evenly-spaced 3-dimensional grid of voxels. This manuscript assumes such a grid, although Section 3.2 of Supplement A presents theory for a more general discrete-observation framework. Let {s¯m¯}m¯M¯\{\underline{s}_{\underline{m}}\}_{\underline{m}\in\mathbb{N}_{\underline{M}}} be an evenly-spaced grid where M¯=(M1,,MD)\underline{M}=(M_{1},\dots,M_{D}) is the grid resolution and M¯=M1××MD\mathbb{N}_{\underline{M}}=\mathbb{N}_{M_{1}}\times\dots\times\mathbb{N}_{M_{D}} is the set of all DD-dimensional multi-indices with ddth component no greater than MdM_{d} (i.e., Md={1,,Md}\mathbb{N}_{M_{d}}=\{1,\dots,M_{d}\}). In the case of the AOMIC dataset which contains brain images of dimension 65×77×6065\times 77\times 60, we have resolution M¯=(65,77,60)\underline{M}=(65,77,60) with dimension D=3D=3, and s¯m¯\underline{s}_{\underline{m}} denotes the location of the voxel at grid position m¯=(m1,m2,m3)\underline{m}=(m_{1},m_{2},m_{3}). Suppose we observe each of the nn samples of XX at the M=M1MDM=M_{1}\dots M_{D} grid points,

Xi(s¯m¯)=k=1Klk(s¯m¯)fi,k+ϵi(s¯m¯),i=1,,n,m¯M¯.X_{i}(\underline{s}_{\underline{m}})=\sum_{k=1}^{K}l_{k}(\underline{s}_{\underline{m}})f_{i,k}+\epsilon_{i}(\underline{s}_{\underline{m}}),\ i=1,\dots,n,\ \underline{m}\in\mathbb{N}_{\underline{M}}.

If we summarize the functional terms in the model with the tensors

𝒳M¯(m¯)=X(s¯m¯),kM¯(m¯)=lk(s¯m¯),M¯(m¯)=ϵ(s¯m¯),\displaystyle\mathcal{X}^{\underline{M}}(\underline{m})=X(\underline{s}_{\underline{m}}),\hskip 10.00002pt\mathcal{L}^{\underline{M}}_{k}(\underline{m})=l_{k}(\underline{s}_{\underline{m}}),\hskip 10.00002pt\mathcal{E}^{\underline{M}}(\underline{m})=\epsilon(\underline{s}_{\underline{m}}),

then we can compactly write the M-resolution functional factor model with KK factors as

𝒳M¯=k=1KkM¯fk+M¯.\displaystyle\mathcal{X}^{\underline{M}}=\sum_{k=1}^{K}\mathcal{L}^{\underline{M}}_{k}f_{k}+\mathcal{E}^{\underline{M}}.

Defining the M1××MD×M1××MDM_{1}\times\dots\times M_{D}\times M_{1}\times\dots\times M_{D} covariance tensors

𝒞M¯(i¯,j¯)=c(s¯i¯,s¯j¯),𝒢M¯(i¯,j¯)=g(s¯i¯,s¯j¯),M¯(i¯,j¯)=b(s¯i¯,s¯j¯),\displaystyle\mathcal{C}^{\underline{M}}(\underline{i},\underline{j})=c(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}}),\hskip 10.00002pt\mathcal{G}^{\underline{M}}(\underline{i},\underline{j})=g(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}}),\hskip 10.00002pt\mathcal{B}^{\underline{M}}(\underline{i},\underline{j})=b(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}}),

we can invoke the model’s assumptions, which carry over from the completely-observed setting, to decompose the covariance as

𝒞M¯=k=1KkM¯kM¯𝒢M¯+M¯,\mathcal{C}^{\underline{M}}=\underbrace{\sum_{k=1}^{K}\mathcal{L}^{\underline{M}}_{k}\otimes\mathcal{L}^{\underline{M}}_{k}}_{\mathcal{G}^{\underline{M}}}+\mathcal{B}^{\underline{M}},

where \otimes denotes the usual tensor product, and the loading tensor kM¯\mathcal{L}_{k}^{\underline{M}} is equal to the kkth scaled eigentensor of 𝒢M¯\mathcal{G}^{\underline{M}}, λk1/2kM¯\lambda_{k}^{1/2}\mathcal{H}_{k}^{\underline{M}}, up to a rotation. For notational simplicity, we will often suppress the M¯\underline{M} in the superscript when writing tensors. We will also say that tensors with dimension M1××MDM_{1}\times\dots\times M_{D} and M1××MD×M1××MDM_{1}\times\dots\times M_{D}\times M_{1}\times\dots\times M_{D} simply have dimensions 𝔐\mathfrak{M} and 𝔐×𝔐\mathfrak{M}\times\mathfrak{M}, respectively. This allows us abbreviate M1××MD\mathbb{R}^{M_{1}\times\dots\times M_{D}} and M1××MD×M1××MD\mathbb{R}^{M_{1}\times\dots\times M_{D}\times M_{1}\times\dots\times M_{D}} with 𝔐\mathbb{R}^{\mathfrak{M}} and 𝔐×𝔐\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}, respectively.

2.2.1 Identification of (𝒢,)(\mathcal{G},\mathcal{B}) from 𝒞\mathcal{C}

The rotation problem discussed in Section 2.1 extends to this finite resolution framework. That is, given 𝒢\mathcal{G}, we can only identify the loading tensors k\mathcal{L}_{k} up to a rotation. Yet we may still hope to identify (𝒢\mathcal{G}, \mathcal{B}) from 𝒞\mathcal{C}. This, however, is not obviously possible as analyticity, which was used to establish uniqueness in the completely-observed setting, is a property of functions, not of tensors. Descary and Panaretos (2019) showed that when D=1D=1 this covariance decomposition is indeed unique at finite resolution and that analyticity plays a central role. In Theorem 4 from Section 3.2 of Supplement A, we extend their result to general DD, and show that the central identification condition becomes less restrictive as DD grows:

KK=d=1D(12δd)Md1.\displaystyle K\leq K^{*}=\prod_{d=1}^{D}\left\lfloor\left(\frac{1}{2}-\delta_{d}\right)M_{d}-1\right\rfloor. (2)

Condition (2) reveals that at finite resolution, identification depends crucially on the relationship between the bandwidth δ¯\underline{\delta}, resolution M¯\underline{M}, and rank of the smooth operator KK. From a factor analytic perspective, for fixed δ¯\underline{\delta} and M¯\underline{M}, this condition gives the maximum number of identifiable factors KK^{*} (up to a rotation). Figure 1 shows that these parameter constraints are not very restrictive. Of particular note is the order-DD polynomial growth of the maximal rank in MM. This means that in fMRI, where MM is typically large and D>1D>1, the functional KK-factor model will be identifiable up to a rotation for any practically reasonable KK.

Refer to caption
Figure 1: Plots of the maximal rank KK^{*} as a function of M=M1==MDM=M_{1}=\dots=M_{D} for different values of DD and δ\delta. Note that each plot has a different vertical axis.

3 Estimation

In this section, we formulate estimation in the discrete observation paradigm. The methodology is split in two phases: initial loading estimation and post-processing. In the first phase, we define an estimator for 𝒢\mathcal{G}, from which we derive estimates for the number of factors KK and the loading tensors k\mathcal{L}_{k}. To estimate the k\mathcal{L}_{k} through 𝒢\mathcal{G}, we must assume that these k\mathcal{L}_{k} are orthogonal. This, of course, is merely a mathematical convenience; there is no reason to suspect that a collection of brain maps describing connectivity structures is orthogonal. We depart from this assumption during post-processing, wherein the initial loading estimates are brought into a more interpretable form via rotation and shrinkage. This section emphasizes practical implementation of our method; Section 3.3 of Supplement A develops theory for these estimators and provides derivations of their asymptotic properties.

3.1 Initial Loading Estimation

Recall that when the data are observed at resolution M¯\underline{M}, the covariance decomposes as

𝒞=k=1Kkk𝒢+.\mathcal{C}=\underbrace{\sum_{k=1}^{K}\mathcal{L}_{k}\otimes\mathcal{L}_{k}}_{\mathcal{G}}+\mathcal{B}.

Two central goals of factor analysis are to estimate the number of factors KK and the loading tensors k\mathcal{L}_{k}, k=1,,Kk=1,\dots,K, from the empirical covariance 𝒞^\hat{\mathcal{C}}. Since loading tensors are only identifiable up to a rotation, we temporarily assume they are orthogonal so that the kkth loading tensor may be identified with the kkth scaled eigentensor of 𝒢\mathcal{G} (i.e., k=λk1/2k\mathcal{L}_{k}=\lambda_{k}^{1/2}\mathcal{H}_{k}, where (λk1/2,k)(\lambda_{k}^{1/2},\mathcal{H}_{k}) is the kkth eigen-pair of 𝒢\mathcal{G}). In this section, we propose an estimator of 𝒢\mathcal{G} from which we derive estimators for KK and the k\mathcal{L}_{k}.

Given a sample of discretely-observed functions summarized by the tensors 𝒳1,,𝒳n\mathcal{X}_{1},\dots,\mathcal{X}_{n}, our goal is to estimate the smooth covariance component 𝒢\mathcal{G} from the empirical covariance 𝒞^\hat{\mathcal{C}} defined by

𝒞^=1n1i=1n(𝒳i𝒳¯)(𝒳i𝒳¯),\displaystyle\hat{\mathcal{C}}=\frac{1}{n-1}\sum_{i=1}^{n}(\mathcal{X}_{i}-\bar{\mathcal{X}})\otimes(\mathcal{X}_{i}-\bar{\mathcal{X}}), (3)

where 𝒳¯=n1i=1n𝒳i\bar{\mathcal{X}}=n^{-1}\sum_{i=1}^{n}\mathcal{X}_{i}. We do so by finding a smooth low-rank covariance tensor that is a good approximation to 𝒞^\hat{\mathcal{C}} off the band:

𝒢^=argminθΘM¯{𝒜(𝒞^θ)F2+α𝒫(θ)+τrank(θ)}.\hat{\mathcal{G}}=\arg\min_{\theta\in\Theta_{\underline{M}^{*}}}\left\{\mathinner{\!\left\lVert\mathcal{A}\circ(\hat{\mathcal{C}}-\theta)\right\rVert}_{F}^{2}+\alpha\mathscr{P}(\theta)+\tau\text{rank}(\theta)\right\}. (4)

In the estimator of (4), ΘM¯\Theta_{\underline{M}}^{*} is the space of 𝔐×𝔐\mathfrak{M}\times\mathfrak{M} covariance tensors satisfying constraints specified in Section 3.3 of Supplement A, 𝒜\mathcal{A} is the 𝔐×𝔐\mathfrak{M}\times\mathfrak{M} “band-deleting” tensor defined by 𝒜M¯(i¯,j¯)=𝟙{|idjd|>Md/4\mathcal{A}^{\underline{M}}(\underline{i},\underline{j})=\mathds{1}\{\mathinner{\!\left\lvert i_{d}-j_{d}\right\rvert}>\lceil M_{d}/4\rceil for d=1,,D}d=1,\dots,D\}, 𝒫:𝔐×𝔐\mathscr{P}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}\to\mathbb{R} is some roughness-penalizing functional, α>0\alpha>0 is a roughness-penalizing parameter, and τ>0\tau>0 is a rank-penalizing parameter. Using 𝒢^\hat{\mathcal{G}}, we define estimators for KK and the k\mathcal{L}_{k},

K^=rank(𝒢^)and^k=λ^k^k,\hat{K}=\text{rank}(\hat{\mathcal{G}})\hskip 10.00002pt\text{and}\hskip 10.00002pt\hat{\mathcal{L}}_{k}=\hat{\lambda}_{k}\hat{\mathcal{H}}_{k}, (5)

k=1,,K^k=1,\dots,\hat{K}, where λ^k\hat{\lambda}_{k} and ^k\hat{\mathcal{H}}_{k} are the kkth scaled eigenvalue and eigentensor of 𝒢^\hat{\mathcal{G}}, respectively.

To describe how one performs the optimization in (4), let us first assume we have already chosen a rank parameter τ\tau^{*} and a smoothing parameter α\alpha^{*}. Define the functional f:𝔐×𝔐f\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}\to\mathbb{R} by f(θ)=𝒜(𝒞^θ)F2+α𝒫(θ)f(\theta)=\mathinner{\!\left\lVert\mathcal{A}\circ\left(\hat{\mathcal{C}}-\theta\right)\right\rVert}_{F}^{2}+\alpha^{*}\mathscr{P}(\theta). We then estimate 𝒢\mathcal{G} as follows:

  1. (a)

    Solve the optimization problem

    min0θ𝔐×𝔐f(θ)subject torank(θ)j,\displaystyle\min_{0\preceq\theta\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}}f(\theta)\hskip 10.00002pt\text{subject to}\hskip 10.00002pt\text{rank}(\theta)\leq j,

    for j{1,,K}j\in\{1,\dots,K^{*}\}, obtaining minimizers θ^1,,θ^K\hat{\theta}_{1},\dots,\hat{\theta}_{K^{*}}.

  2. (b)

    Compute the quantities {f(θ^j)+τj:j=1,,K}\{f(\hat{\theta}_{j})+\tau^{*}j\mathrel{\mathop{\ordinarycolon}}j=1,\dots,K^{*}\} and determine the jj furnishing the minimum quantity. Declare the corresponding θ^j\hat{\theta}_{j} to be the estimator 𝒢^\hat{\mathcal{G}}.

Now consider selection of the penalty parameters, α\alpha and τ\tau (see Table 1 for a summary of information related to these parameters and others). Choice of the smoothing parameter α\alpha depends on the roughness penalty 𝒫\mathscr{P}. Although many choices for 𝒫\mathscr{P} are possible, we use one that promotes smoothness via the eigentensors. In particular, we define 𝒫(θ)=M1k=1K𝒫~k(θ)\mathscr{P}(\theta)=M^{-1}\sum_{k=1}^{K}\tilde{\mathscr{P}}\mathcal{E}_{k}(\theta) where k:ΘM¯𝔐\mathcal{E}_{k}\mathrel{\mathop{\ordinarycolon}}\Theta_{\underline{M}}\to\mathbb{R}^{\mathfrak{M}} maps a tensor to its kkth scaled eigentensor, and 𝒫~:𝔐\tilde{\mathscr{P}}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{\mathfrak{M}}\to\mathbb{R} is a roughness-penalizing functional defined by 𝒫~(𝒱)=𝒱vecTsq𝒱vec\tilde{\mathscr{P}}(\mathcal{V})=\mathcal{V}_{\text{vec}}^{T}\mathcal{R}_{\text{sq}}\mathcal{V}_{\text{vec}} where 𝔐×𝔐\mathcal{R}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} is a generalization of the second difference matrix given by m¯1,m¯2=3D1\mathcal{R}_{\underline{m}_{1},\underline{m}_{2}}=3^{D}-1 when m¯1m¯2=0\mathinner{\!\left\lVert\underline{m}_{1}-\underline{m}_{2}\right\rVert}_{\infty}=0, m¯1,m¯2=1\mathcal{R}_{\underline{m}_{1},\underline{m}_{2}}=-1 when m¯1m¯2=1\mathinner{\!\left\lVert\underline{m}_{1}-\underline{m}_{2}\right\rVert}_{\infty}=1, and m¯1,m¯2=0\mathcal{R}_{\underline{m}_{1},\underline{m}_{2}}=0 otherwise. Here, 𝒱vec\mathcal{V}_{\text{vec}} denotes the vectorization of 𝒱\mathcal{V} and sq\mathcal{R}_{\text{sq}} is the square matricization of \mathcal{R} (see Section 1 of Supplement A for formal definitions of these tensor reshapings). As in smoothed functional principal component analysis, it may be satisfactory (or even preferable) to make the choice of α\alpha subjectively based on visual inspection (Ramsay and Silverman, 2005, Section 9.3.3). For a discussion on automatic selection via cross-validation, see Section 3 of Supplement A.

In practice, we do not explicitly choose the rank parameter τ\tau invoked in Step (b). Since each τ\tau implies a choice of rank jτj_{\tau}, and thus some f(θ^jτ)f(\hat{\theta}_{j_{\tau}}), we use the nonincreasing function jf(θ^j)j\mapsto f(\hat{\theta}_{j}) to implicitly choose τ\tau. Specifically, we identify the value of jj to be the smallest one such that f(θ^j)<cf(\hat{\theta}_{j})<c for some threshold cc, then set 𝒢^\hat{\mathcal{G}} to the corresponding minimizer (see Figure 2). We would like the threshold cc to be small enough that ff evaluated at the estimator is low, but not so small that the estimator has large rank. We thus select the rank of the estimator by identifying the elbow of the scree-type plot generated by the function jf(θ^j)j\mapsto f(\hat{\theta}_{j}).

Refer to caption
(a) Small τ\tau
Refer to caption
(b) Large τ\tau
Figure 2: An illustration of the scree plot approach for selecting the number of factors using (a) a small rank penalty τ1\tau_{1} and (b) a large rank penalty τ2\tau_{2}. Blue dots represent the non-increasing function jf(θj^)j\mapsto f(\hat{\theta_{j}}), orange squares represent the rank penalty τj\tau j, and blue triangle represents their sum (the function to be minimized). The number cτc_{\tau} is one example of a threshold furnishing the rank jτj_{\tau}.

As shown in Section 3.3 of Supplement A, we can theoretically estimate 𝒢\mathcal{G} using any δ¯\underline{\delta} for which each δd\delta_{d} is less than 0.25. The choice of δ¯\underline{\delta} thus depends on how much small-scale variation one want to admit into the global component. In this sense, δ¯\underline{\delta} does not need to be tuned; rather, one needs to consider the purpose of the application in its selection. For instance, in the analysis of Section 6, we set δd=0.1\delta_{d}=0.1 for each dd, which demotes variation isolated to a radius of a few voxels to the local component. However, the optimization described in the next paragraph may be unstable when components of δ¯\underline{\delta} are near 0.25. For applications requiring wide bands, one may want to test a grid of bandwidths that increments to the desired bandwidth. If estimates change abruptly before reaching the final bandwidth, then algorithmic instability may be to blame.

Parameter Influence Selection
δ¯\underline{\delta} increasing δd\delta_{d} removes rough variation from 𝒢^\hat{\mathcal{G}} application-informed, testing for stability
α\alpha increasing α\alpha yields smoother ^k\hat{\mathcal{L}}_{k} subjective selection or cross-validation
τ\tau increasing τ\tau lowers K^\hat{K} implicitly chose via scree-type plots
κ¯\underline{\kappa} increasing κk\kappa_{k} produces sparser ^k\hat{\mathcal{L}}_{k} cross-validation
Table 1: Estimation parameters, along with their influences on the estimators and selection procedures.

To solve the optimization problems in (a) we resort to approximate methods. In doing so, we note that any symmetric non-negative definite tensor θ𝔐×𝔐\theta\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} with rank-jj has square matricization θsqM×M\theta_{\text{sq}}\in\mathbb{R}^{M\times M}, M=M1MDM=M_{1}\dots M_{D}, that may be written as 𝑽𝑽T\boldsymbol{V}\boldsymbol{V}^{T}, where 𝑽M×j\boldsymbol{V}\in\mathbb{R}^{M\times j}. Note also that the columns of 𝑽\boldsymbol{V} are equal to the vectorized scaled eigentensors of θ\theta (up to a rotation), allowing us to impose smoothness on these eigentensors through the columns of 𝑽\boldsymbol{V}. The optimization problems in (a) thus reduce to the factorized matrix completion problems

min𝑽M×j𝒜sq((𝒞^)sq𝑽𝑽T)F2+α𝒫sq(𝑽),\displaystyle\min_{\boldsymbol{V}\in\mathbb{R}^{M\times j}}\mathinner{\!\left\lVert\mathcal{A}_{\text{sq}}\circ\left((\hat{\mathcal{C}})_{\text{sq}}-\boldsymbol{V}\boldsymbol{V}^{T}\right)\right\rVert}_{F}^{2}+\alpha^{*}\mathscr{P}_{\text{sq}}(\boldsymbol{V}), (6)

for j=1,,Kj=1,\dots,K^{*}, where 𝒫sq(𝑽)=𝒫(θ)\mathscr{P}_{\text{sq}}(\boldsymbol{V})=\mathscr{P}(\theta) when θ=𝑽𝑽T\theta=\boldsymbol{V}\boldsymbol{V}^{T}. For our choice of 𝒫\mathscr{P}, we have 𝒫sq(𝑽)=𝑷,sq𝑷F\mathscr{P}_{\text{sq}}(\boldsymbol{V})=\left\langle\boldsymbol{P},\mathcal{R}_{\text{sq}}\boldsymbol{P}\right\rangle_{F} where ,F\left\langle\cdot,\cdot\right\rangle_{F} denotes the Frobenius inner product. As these problems are not convex in 𝑽\boldsymbol{V}, no gradient descent-type algorithm is guaranteed to converge to the problems’ infinitely many global optima. In search of “good” local optima, we use MATLAB’s fminunc function, which implements the Broyden–Fletcher–Goldfarb–Shanno (BFGS; Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) algorithm or, for larger problems, the limited-memory BFGS (L-BFGS; Liu and Nocedal, 1989) algorithm. Indeed, when given a reasonable initialization, the simulations of Section 5 show that these routines are stable and quickly converge to suitable local optima. We use the initial value 𝑽0=𝑼j𝚺j1/2\boldsymbol{V}_{0}=\boldsymbol{U}_{j}\boldsymbol{\Sigma}_{j}^{1/2}, where 𝑼𝚺𝑼T\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{U}^{T} is the singular value decomposition of (𝒞^)sq(\hat{\mathcal{C}})_{\text{sq}}, 𝑼j\boldsymbol{U}_{j} is the M×jM\times j matrix obtained by keeping the first jj columns of 𝑼\boldsymbol{U}, and 𝚺j\boldsymbol{\Sigma}_{j} is the j×jj\times j matrix obtained by keeping the jj rows and columns of 𝚺\boldsymbol{\Sigma}.

3.2 Post-Processing: Rotation and Shrinkage

In Section 3.1, we assumed the loading tensors k\mathcal{L}_{k} were equal to the scaled eigentensors of 𝒢\mathcal{G}. Although mathematically convenient, we did not impose this constraint in the original model as there is no practical justification for orthogonal loadings. Unfortunately, without such an assumption, it is not possible to identify the k\mathcal{L}_{k} from 𝒞\mathcal{C}. We may still, however, hope to understand the latent factors that drive variation captured in the identifiable global component 𝒢\mathcal{G}. For instance, interesting neurological phenomena, like visual processing, may contribute variation to the global component, and an effective factor analytic approach should help us discover these latent processes. However, such insights can be difficult to glean via the scaled eigentensors, which are often dense with many high-magnitude weights. In our two-step post-processing procedure, we transform the illegible eigentensors into loadings possessing simple structure. The resulting parsimony greatly facilitates interpretation of the underlying factors.

In the first post-processing step, we exploit the model’s rotational indeterminacy by rotating the scaled eigentensors of 𝒢^\hat{\mathcal{G}} to a maximally-simple structure. We do so using varimax rotation (Kaiser, 1958) which solves the optimization problem

𝔏^:=arg max𝑹𝔏^:𝑹SO(K)k=1K(𝑹𝔏^),k2F2(𝑹𝔏^),kF4,\displaystyle\hat{\mathfrak{L}}^{*}\mathrel{\mathop{\ordinarycolon}}=\underset{\mathcal{M}_{\boldsymbol{R}}\hat{\mathfrak{L}}\ \mathrel{\mathop{\ordinarycolon}}\ \boldsymbol{R}\in\text{SO}(K)}{\text{arg max}}\sum_{k=1}^{K}\mathinner{\!\left\lVert\left(\mathcal{M}_{\boldsymbol{R}}\hat{\mathfrak{L}}\right)_{\cdot,k}^{2}\right\rVert}_{F}^{2}-\mathinner{\!\left\lVert\left(\mathcal{M}_{\boldsymbol{R}}\hat{\mathfrak{L}}\right)_{\cdot,k}\right\rVert}_{F}^{4},

where 𝔏^𝔐×K\hat{\mathfrak{L}}\in\mathbb{R}^{\mathfrak{M}}\times\mathbb{N}_{K} is defined by 𝔏^m¯,k=(^k)m¯\hat{\mathfrak{L}}_{\underline{m},k}=(\hat{\mathcal{L}}_{k})_{\underline{m}}, 𝑹:𝔐×K𝔐×K\mathcal{M}_{\boldsymbol{R}}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{\mathfrak{M}}\times\mathbb{N}_{K}\to\mathbb{R}^{\mathfrak{M}}\times\mathbb{N}_{K} is an operator given by (𝑹𝔏)m¯,k=(𝑹𝔏m¯,)k(\mathcal{M}_{\boldsymbol{R}}\mathfrak{L})_{\underline{m},k}=(\boldsymbol{R}\mathfrak{L}_{\underline{m},\cdot})_{k}, and SO(K)\text{SO}(K) is the set of K×KK\times K rotation matrices.

Although the rotated loading estimate 𝔏^\hat{\mathfrak{L}}^{*} may be sufficiently interpretable, it sometimes helps to shrink near-zero regions to zero. The second post-processing step accomplishes this systematically by adaptively soft-thresholding 𝔏^\hat{\mathfrak{L}}^{*},

𝔏~m¯,k:=sgn(𝔏^m¯,k)max{|𝔏^m¯,k|κkw(𝔏^m¯,k),0},\tilde{\mathfrak{L}}_{\underline{m},k}\mathrel{\mathop{\ordinarycolon}}=\text{sgn}(\hat{\mathfrak{L}}^{*}_{\underline{m},k})\max\left\{\mathinner{\!\left\lvert\hat{\mathfrak{L}}^{*}_{\underline{m},k}\right\rvert}-\kappa_{k}w\left(\hat{\mathfrak{L}}^{*}_{\underline{m},k}\right),0\right\}, (7)

where w()w(\cdot) is a weight function with positive support (e.g., w(x)=|x|2w(x)=\mathinner{\!\left\lvert x\right\rvert}^{-2}), and the κk>0\kappa_{k}>0 are shrinkage parameters, which may be tuned using the cross-validation procedure described in Section 3 of Supplement A. Beyond enhancing interpretability, we show in Section 5.3 that shrinkage can also correct an overspecified model by zeroing out “extra” loading estimates.

4 MELODIC Background

In Sections 5 and 6, we compare our approach to, among others, one based on ICA, the most popular whole-brain connectivity method. In doing so, we rely on the MELODIC function (Beckmann and Smith, 2004) of the FMRIB Software Library (FSL; Jenkinson et al., 2012), arguably ICA’s most widely used implementation for connectivity analyses. This section details the MELODIC model and its estimation.

FSL’s MELODIC function estimates the probabilistic ICA model which assumes that observations for each voxel are generated from a set of statistically independent non-Gaussian sources via a linear mixing process corrupted by additive Gaussian noise:

x¯m=𝑨s¯m+μ¯+η¯m,m=1,,M.\underline{x}_{m}=\boldsymbol{A}\underline{s}_{m}+\underline{\mu}+\underline{\eta}_{m},\ m=1,\dots,M.

Here, x¯m\underline{x}_{m} denotes the nn-dimensional vector of observations at voxel mm, s¯m\underline{s}_{m} denotes the KK-dimensional vector of non-Gaussian sources, 𝑨\boldsymbol{A} denotes the (n×Kn\times K)-dimensional mixing matrix, μ¯\underline{\mu} denotes the mean of the observations x¯m\underline{x}_{m}, and η¯m\underline{\eta}_{m} denotes Gaussian noise. The model assumes there are fewer sources than observations (i.e., K<nK<n), and that the noise covariance is voxel-dependent (i.e., η¯mN(0¯,σ2𝚺m)\underline{\eta}_{m}\sim N(\underline{0},\sigma^{2}\boldsymbol{\Sigma}_{m})).

The goal of estimation is to find the (K×n)(K\times n)-dimensional unmixing matrix 𝑾\boldsymbol{W} such that s¯^m=𝑾x¯m\hat{\underline{s}}_{m}=\boldsymbol{W}\underline{x}_{m} is a good approximation to s¯m\underline{s}_{m} for each voxel. Prior to invoking MELODIC, it is common to spatially smooth the data. We do so using the Gaussian filter of FSL’s FSLMATHS utility (Jenkinson et al., 2012), tuning the kernel width via cross-validation. Before estimation, MELODIC prepares the multi-session data by (i) temporally concatenating the preprocessed slices to form observation vectors x¯m\underline{x}_{m} for each voxel, (ii) temporally whitening each x¯m\underline{x}_{m} so that the noise covariance is isotropic at each voxel location, then (iii) normalizing each x¯m\underline{x}_{m} to zero mean and unit variance. By default, the utility also reduces the dimension of the concatenated data via MELODIC Incremental Group PCA (MIGP; Smith et al., 2014); however, we switch off this feature as Sections 5 and 6 consider datasets of sufficiently reduced sizes. After this preprocessing, one can show that the maximum likelihood (ML) mixing matrix estimate is

𝑨^=𝑼K(𝚲Kσ2𝑰K)1/2𝑸T,\hat{\boldsymbol{A}}=\boldsymbol{U}_{K}(\boldsymbol{\Lambda}_{K}-\sigma^{2}\boldsymbol{I}_{K})^{1/2}\boldsymbol{Q}^{T},

where 𝑼K\boldsymbol{U}_{K} and 𝚲K\boldsymbol{\Lambda}_{K} contain the first KK left singluar vectors and singular values of 𝑿\boldsymbol{X}, the n×Mn\times M matrix of preprocessed data, and 𝑸\boldsymbol{Q} is some K×KK\times K rotation matrix. From 𝑨^\hat{\boldsymbol{A}}, the ML source estimates are obtained via generalized least squares,

s¯^m=(𝑨^T𝑨^)1𝑨^Tx¯~m,\hat{\underline{s}}_{m}=(\hat{\boldsymbol{A}}^{T}\hat{\boldsymbol{A}})^{-1}\hat{\boldsymbol{A}}^{T}\tilde{\underline{x}}_{m},

where x¯~m=(𝚲Kσ2𝑰K)1/2𝑼KTx¯m\tilde{\underline{x}}_{m}=(\boldsymbol{\Lambda}_{K}-\sigma^{2}\boldsymbol{I}_{K})^{-1/2}\boldsymbol{U}_{K}^{T}\underline{x}_{m}. Thus, to fix 𝑨^\hat{\boldsymbol{A}}, one chooses the 𝑸\boldsymbol{Q} that maximizes the non-Gaussianity of the source estimates s¯^m\hat{\underline{s}}_{m}.

For a random variable ss, one measure of non-Gaussianity is negentropy,

J(s)=H(sgauss)H(s),J(s)=H(s_{\text{gauss}})-H(s),

where H(y)=𝔼y[logy]H(y)=-\mathbb{E}_{y}[\log y] is the entropy of a random variable yy and sgausss_{\text{gauss}} is a Gaussian having the same variance (or covariance if ss is a random vector) as ss. Since a Gaussian random variable has the largest entropy among all random variables of equal variance, negentropy is a non-negative function that equals zero if and only if ss is Gaussian. As estimation of negentropy is difficult, it is often approximated using the contrast function

J(s)(𝔼[G(s)]𝔼[G(ν)])2,J(s)\approx\left(\mathbb{E}[G(s)]-\mathbb{E}[G(\nu)]\right)^{2},

where ν\nu is a standard normal and GG is a general non-quadratic function. To estimate the kkth source smks_{mk} of the mmth voxel MELODIC uses a fixed-point iteration scheme (Hyvärinen and Oja, 2000) to choose the kkth row q¯kT\underline{q}_{k}^{T} of 𝑸\boldsymbol{Q} so that J(s^mk)=J(q¯kTx¯~m)J(\hat{s}_{mk})=J(\underline{q}_{k}^{T}\tilde{\underline{x}}_{m}) is maximized for some domain-specific choice of GG. By default, MELODIC sets GG equal to the pow3 function. To estimate KK sources, the procedure is repeated KK times under the constraint that the vectors q¯k\underline{q}_{k} are mutually orthogonal.

5 Simulation Studies

To assess the efficacy of our methodology, we conduct three simulation studies. In the first, we compare the accuracy of the post-processed estimator of (4) (denoted by FFA) to that of four other estimators. In the second, we assess the relative interpretability of these estimators. In the third, which may be found in Section 4 of Supplement A, we explore how the scree plot approach for selecting the number of factors behaves in different settings. We begin, in Section 5.1, by describing the data generating procedure for these studies.

5.1 Data Simulation

To simulate data from an FFM (when D=2D=2), we first generate nn i.i.d. mean-zero functions YiY_{i} and nn i.i.d. mean-zero functions ϵi\epsilon_{i} on an equally spaced M×MM\times M lattice in [0,1]2[0,1]^{2}, summarizing these discretely-observed functions with the order-22 tensors (i.e., matrices) 𝒴i\mathcal{Y}_{i} and i\mathcal{E}_{i}, respectively. We then sum these components to get the samples 𝒳i=𝒴i+i\mathcal{X}_{i}=\mathcal{Y}_{i}+\mathcal{E}_{i}, and compute the empirical covariance tensor, 𝒞^=n1i=1n𝒳i𝒳i\hat{\mathcal{C}}=n^{-1}\sum_{i=1}^{n}\mathcal{X}_{i}\otimes\mathcal{X}_{i}. For simplicity, in all studies, we set M=M1=M2=30M=M_{1}=M_{2}=30 and δ=δ1=δ2\delta=\delta_{1}=\delta_{2}. Though these simulated “brain slices” are lower resolution than those of the AOMIC data whose axial slices have resolution M¯=(65,77)\underline{M}=(65,77), this simplification serves only to reduce compute time and has little impact on estimator performance.

To simulate the functions YiY_{i}, we set Yi(s¯)=k=1Kfi,kckvk(s¯)Y_{i}(\underline{s})=\sum_{k=1}^{K}f_{i,k}c_{k}v_{k}(\underline{s}), where the vkv_{k} are (possibly non-orthogonal) functions scaled to have unit norm, the ckc_{k} are positive constants, and the factors fi,kf_{i,k} are drawn i.i.d.i.i.d. from N(0,1)N(0,1). We decompose the kkth loading function into its standardization vkv_{k} and a scaling parameter ckc_{k} to control the difficulty of the estimation problem. Also note that the (ck,vk)(c_{k},v_{k}) are not necessarily equal to the eigen-pairs (λk,ηk)(\lambda_{k},\eta_{k}) of 𝒢\mathscr{G}, as we do not require the vkv_{k} to be orthogonal. Across the three studies, we make use of three loading schemes, depicted in Figure 3. In the first loading scheme (denoted by BL1), each vkv_{k} is defined piecewise with a bump function in two regions of its domain, and zero elsewhere. Each vkv_{k} in the second loading scheme (denoted by NET) is also defined with piecewise bump functions, but these functions are arranged in the types of complex patterns one might expect to observe in a brain network. In the third loading scheme (denoted by BL2), each vkv_{k} is defined by three bump functions, one of which is wider than the other two.

Refer to caption
(a) BL1
Refer to caption
(b) NET
Refer to caption
(c) BL2
Figure 3: The three loading schemes used across studies: (a) each loading contains a bump function in two regions of its domain; (b) each loading contains bump functions arranged in patterns more complex and varied than those of the BL scheme; (c) each loading contains a bump function in three regions of its domain.

To generate the functions ϵi\epsilon_{i}, we set ϵi(s¯)=j=1Jai,jdjej(s¯)\epsilon_{i}(\underline{s})=\sum_{j=1}^{J}a_{i,j}d_{j}e_{j}(\underline{s}), where the eje_{j} are (again, possibly non-orthogonal) functions with support on [(p1)δ,pδ]×[(q1)δ,qδ][0,1]2[(p-1)\delta,p\delta]\times[(q-1)\delta,q\delta]\subset[0,1]^{2}, the djd_{j} are positive constants, the ai,ja_{i,j} are drawn i.i.d. from N(0,1)N(0,1), and JKJ\gg K is large. As in the loading scheme framework, the (dj,ej)(d_{j},e_{j}) may not be equal to the eigen-pairs (βj,ψj)(\beta_{j},\psi_{j}) of \mathscr{B} since the former need not be orthogonal. We consider two error schemes. Each defines JJ functions eje_{j} with possibly overlapping supports on an equally-spaced grid of [0,1]2[0,1]^{2}. In the first (denoted by BE), each eje_{j} is a two-dimensional bump function. In the second (denoted TRI), each eje_{j} is a two-dimensional triangle function.

5.2 Study 1: Accuracy Comparison

The aim of the first study is to compare the accuracy of the FFA estimator to that of five alternative estimators:

  1. 1.

    Independent component analysis without smoothing (denoted by ICA): we estimate 𝒢\mathcal{G} by extracting from the data an M×M×KM\times M\times K tensor 𝒮^\hat{\mathcal{S}} of independent components via MELODIC, then computing k𝒮k^𝒮k^\sum_{k}\hat{\mathcal{S}_{k}}\otimes\hat{\mathcal{S}_{k}} where the M×MM\times M matrix 𝒮k^\hat{\mathcal{S}_{k}} is the kkth estimated independent component. We skip variance normalization since the simulated data do not reproduce phenomena like scanner and physiological noise that lead to different levels of variability across different voxels. Omission of this step also ensures that ICA estimates are on the same scale as those derived from other estimators.

  2. 2.

    Independent component analysis with smoothing (denoted by ICAS): we estimate 𝒢\mathcal{G} using the same procedure as ICA but smooth the data with a Gaussian filter as described in Section 4.

  3. 3.

    Principal component analysis (denoted by PCA): we estimate 𝒢\mathcal{G} with a truncated eigen-decomposition of 𝒞^\hat{\mathcal{C}}. Note that this approach corresponds with principal component estimation in multivariate factor analysis (Johnson and Wichern, 2002, Section 9.3).

  4. 4.

    Matrix completion approach, presented by Descary and Panaretos (2019) (denoted by DP): we compute the estimator of (4) fixing α=0\alpha=0.

  5. 5.

    Matrix completion approach with smoothing (denoted by DPS): we compute the estimator of (4) using α0\alpha\geq 0, but omit post-processing.

We study these estimators in four scenarios, which are defined by the Cartesian product of two loading schemes (BL1 and NET) and the two error schemes: S1 for BL1 and BE, S2 for BL1 and TRI, S3 for NET and BE, and S4 for NET and TRI. Within each scenario, we consider 24 configurations, each of which is characterized by a sample size n{250,500,1000}n\in\{250,500,1000\}, a number of factors K{2,4}K\in\{2,4\}, a bandwidth δ{0.05,0.1}\delta\in\{0.05,0.1\}, and a “regime” (R1 or R2). In both regimes, the djd_{j} are distributed uniformly in [0.1,1][0.1,1], while the ckc_{k} are distributed uniformly in [2,3][2,3], and [0.8,1.8][0.8,1.8] for R1 and R2, respectively. Consequently, the lower-signal R2 regime poses a more difficult estimation problem than the R1 regime. Note that in the multi-subject AOMIC analysis of Section 6, the sample size is the number of time points per subject multiplied by the number of subjects (480×210=10,800480\times 210=10,800). Accordingly, this simulation study evaluates our model in settings that are likely lower-signal than that of our application.

Per Section 3, FFA estimation – including initial loading estimation and post-processing – involves two levels of tuning. We consider a smoothing parameter α\alpha and just one shrinkage parameter κ\kappa for all loading functions to lighten computation. Instead of conducting the expensive cross-validation procedures of Section 3 in Supplement A, we simply split the data into a training set (80%) and a test set (20%), then choose the α\alpha (similarly, κ\kappa) that minimizes the out-of-sample prediction error for the band-deleted empirical covariance. Throughout the study, we set δ=0.1\delta=0.1 for estimation procedures.

For 100 repetitions of each configuration, we compute the six estimators – FFA, DPS, DP, PCA, ICA, and ICAS – then calculate normalized errors by evaluating E(x)=𝒢xF/𝒢FE(x)=\|\mathcal{G}-x\|_{F}/\|\mathcal{G}\|_{F} at each estimator. To assess the comparative accuracy of the FFA estimator, we form five relative errors by dividing the normalized errors for the DPS, DP, PCA, ICA, and ICAS estimators by that for the FFA estimator. Figures 4 and 5 display results for R1 and R2, respectively.

FFA is in league with or superior to the alternatives in nearly every configuration. DPS, however, consistently competes with FFA, indicating that post-processing leads to only modest accuracy gains (Section 5.3 considers the more noteworthy interpretability gains afforded by post-processing). The smoothing of FFA and DPS provides helpful de-noising in BL configurations – particularly in the low-signal regime and/or when nn is small – but is less effective in NET configurations. In the latter, over-smoothing occasionally forces the optimization procedure into bad local minima. This is likely because application of a single smoothing parameter to each loading tensor is more appropriate in the BL scheme, which has uniformly smooth loading tensors, than in the NET scheme, whose first and second loadings are spikier than its third and fourth loadings.

DP, PCA, ICA, and ICAS typically trail FFA and DPS. It is not surprising that the performances of ICA and PCA are in near lockstep, given that MELODIC projects the temporally whitened data onto the space spanned by the KK leading eigenvectors of 𝒞^\hat{\mathcal{C}}, and that MELODIC’s temporal whitening has little effect on these i.i.d. data. The data smoothing of ICAS consistently improves upon ICA, most noticeably in regime R2. In this regime, ICAS performance even matches that of FFA and DPS when δ=0.05\delta=0.05 configurations, though it falls short in δ=0.1\delta=0.1 configurations where data smoothing smears more local variation into the global component. Contamination of the global component brought about by data smoothing is so pronounced that even DP, an estimator that does not explicitly encourage smooth loadings, outpaces ICAS in some of these δ=0.1\delta=0.1 configurations.

Refer to caption
Figure 4: Mean relative errors (with ±2\pm 2-standard-deviation error bars) for all Regime 1 configurations using the FFA estimator as baseline.
Refer to caption
Figure 5: Mean relative errors (with ±2\pm 2-standard-deviation error bars) for all Regime 2 configurations using the FFA estimator as baseline.

5.3 Study 2: Interpretability Comparison

The aim of the second study is to compare the interpretability of FFA to that of DPS and ICAS, the two most competitive estimators from Section 5.2. To do so, we focus on output from a single repetition of a configuration with loading scheme BL2, error scheme BE, regime R2, nn set to 1000, KK set to 8, and δ\delta set to 0.1. Consideration of only one repetition alleviates computational strains, allowing us to tune component-specific shrinkage parameters for the FFA estimator.

In the first part of this study, we correctly specify model order by estimating the three models with 8 components (see Figure 6). DPS fails to capture the simple structure of the true loadings. ICAS, on the other hand, largely accomplishes this, identifying 6 of the true loadings. However, it lacks the sparsity of FFA, which clearly capture 7 to 8 of the true loadings.

Refer to caption
(a) FFA
Refer to caption
(b) ICAS
Refer to caption
(c) DPS
Figure 6: Estimates for order-8 FFA, DPS, and ICAS models when loading scheme is BL2, error scheme is BE, regime is R2, n=1000n=1000, K=8K=8, and δ=0.1\delta=0.1.

The most noteworthy difference, however, between FFA and ICAS is that ICAS results are less robust to dimension overspecification. That is, if we estimate two IC models, one with the true number of ICs and another with more, the latter will splinter structures found in single ICs of the former across multiple ICs. Such fragmentation does not occur in overspecified FFMs. To see this, consider output from order-25 FFA and ICAS models fit to these data (see Figures 7 and 8). The higher-order FFM preserves the 7 to 8 loadings of the lower-order FFM and even shrinks many of the other loadings to zero, effectively correcting overspecification. However, of the 6 true loadings captured in the 8-IC model, only two are preserved in the 25-IC model (k=1,2k=1,2 in Figure 8). Other structures in the 8-IC model are scattered across the remaining components of the higher-order model. This is because, in its search for maximally independent spatial maps, ICA fragments broad correlation structures across multiple maps (Friston, 1998). This splintering presents challenges to interpretation, particularly when the specified dimension far exceeds the true dimension. In contrast, varimax rotation in FFA does not fragment correlation structure. Moreover, shrinkage seems to have partially “corrected” the overspecification, zeroing out many of the extra loadings. This makes FFA an attractive approach when interpretation is the foremost goal and dimension overspecification is possible.

Refer to caption
Figure 7: Estimates for order-25 FFA model when loading scheme is BL2, error scheme is BE, regime is R2, n=1000n=1000, K=8K=8, and δ=0.1\delta=0.1.
Refer to caption
Figure 8: Estimates for order-25 ICA model when loading scheme is BL2, error scheme is BE, regime is R2, n=1000n=1000, K=8K=8, and δ=0.1\delta=0.1.

6 AOMIC Data Analysis

We now consider the AOMIC-PIOP1 dataset which includes six-minute resting-state scans for 210 healthy university students. Prior to each subject’s resting-state session, a structural image was acquired at a resolution of 1 mm isotropic while the functional image was acquired at a spatial resolution of 3 mm isotropic and a temporal resolution of 0.75 seconds (resulting in a single-subject data matrix with dimension 65×77×60×48065\times 77\times 60\times 480). Including all \sim 300,000 voxels in this analysis would necessitate estimation and in-memory storage of a prohibitively large spatial covariance. We avoid this by focusing on one axial slice (at z=30z=30). Preprocessing of anatomical and functional MRI were performed using a standard Fmriprep pipeline (Esteban et al., 2019), as detailed in Snoek et al. (2021).

Our strategy for discovering the connectivity patterns present in these resting-state data was to prewhiten the voxel time courses in each subject’s preprocessed scan (which are, in general, nonstationary and autocorrelated), then treat the multi-subject collection of slices as an i.i.d. sample 𝒳1,,𝒳n\mathcal{X}_{1},\dots,\mathcal{X}_{n}, n=100800n=100800, which we used to fit the FFM. Although fMRI data is inherently nonstationary and autocorrelated, many connectivity studies that correlate voxel time series skip this essential prewhitening step, risking the discovery of spurious correlations (Christova et al., 2011; Afyouni et al., 2019). Of course, prewhitening only addresses temporal correlation within voxel time courses, not between time courses. There are several ways to prewhiten voxel time courses, including those based on Fourier (Laird et al., 2004) and wavelet (Bullmore et al., 2001) decompositions. Taking another common approach, we performed prewhitening by fitting a non-seasonal AutoRegressive Integrative Moving Average (ARIMA) model to each voxel time course using the R function forecast::auto.arima (Hyndman and Khandakar, 2008), then extracting the residuals. Figure 5 from Section 6.1 of Supplement A demonstrates the effects of prewhitening on voxel time courses. Taking after Christova et al. (2011), we form samples 𝒳1,,𝒳n\mathcal{X}_{1},\dots,\mathcal{X}_{n} from the scaled ARIMA residuals.

We begin analysis of these samples by computing the empirical covariance 𝒞^\hat{\mathcal{C}} in equation (3). Based on visual inspection and the fact that the sample size is quite large, we opt not to smooth (i.e., we set α=0\alpha=0), then plot the non-increasing function g(j)=𝒜(𝒞^θ^j)F2g(j)=\|\mathcal{A}\circ(\hat{\mathcal{C}}-\hat{\theta}_{j})\|_{F}^{2} for j=1,,100j=1,\dots,100 (see Figure 6 of Section 6.1 in Supplement A). Based on this plot, we set K^=12\hat{K}=12 and use the corresponding ^1,,^12\hat{\mathcal{L}}_{1},\dots,\hat{\mathcal{L}}_{12} as the initial loading estimates. To choose the shrinkage parameter κ¯=(κ1,,κ12)\underline{\kappa}=(\kappa_{1},\dots,\kappa_{12}) used to compute the post-processed loadings ~1,,~12\tilde{\mathcal{L}}_{1},\dots,\tilde{\mathcal{L}}_{12}, we define a manageable search grid then follow the tuning procedure described in Section 3 of Supplement A. Figure 9a displays the rich covariance structure captured by these post-processed loadings. In addition to anatomical segmentation (e.g., grey matter segmentation in loadings 1, 2, and 6; white matter segmentation in loadings 3 and 9; and ventricle segmentation in loadings 5 and 10), these 12 loadings contain known resting-state networks. The prominent bilateral structure in the first loading captures parts of the auditory resting-state network, including co-activation in Heschl’s gyrus and the posterior insular (see 7207_{20} of Figure 1 in Smith et al., 2009). Next, frontal activation in the second loading includes areas associated with resting-state executive control, like the anterior cingulate (see 8208_{20} of Figure 1 in Smith et al., 2014). Lastly, the sixth loading contains activation in the occipital lobe, an area associated with visual processing (see 2202_{20} of Figure 1 in Smith et al., 2009).

To provide a methodological comparison, we also performed group ICA by temporally concatenating the preprocessed slices then running ICA on the resulting data to produce spatial maps common to all subjects (Calhoun et al., 2001). As in Section 5, we do so using FSL’s MELODIC function. For this analysis, we fit 12-, 25-, and 50-independent component (IC) models. Transformed components (for interpretability) for the 12-IC model are shown in Figure 9, and those for the 25- and 50-IC models are shown in Figures 8, 9, and 10 of Section 5.2 in Supplement A. To facilitate comparison with FFA, we also estimate the FFM with 25 and 50 factors (see Figures 5, 6, and 7 in Section 5.2 of Supplement A) even though the plots of Figure 6 in Section 6.1 of Supplement A suggest a smaller number of factors.

We first compare the 12-component models (see Figure 9). Several prominent structures are captured by both FFA and ICA: dorsal correlations appear in the 6th loading, and in the 4th and 9th ICs; frontal correlations appear in the 2nd loading and the 7th IC; bilateral correlations appear in the 1st loading and in the 8th IC; similar ventricle correlations appear in the 5th loading and in the 1st IC; and brain outlines appear in the 4th and 7th loadings, and in the 11th IC. Despite the many common features, there exist some discrepancies (e.g., the spikes of activation in the 10th and 12th IC do not display prominently in any single loading).

Refer to caption
(a) FFA
Refer to caption
(b) ICA
Figure 9: (a) Loadings of the functional 12-factor model estimated from ARIMA residuals. (b) Transformed components (t(x)=sgn(x)log(|x|+1)t(x)=\text{sgn}(x)\log(\mathinner{\!\left\lvert x\right\rvert}+1)) of the 12-component IC model estimated from the preprocessed scans. ICs are presented on the log-scale to bring forward correlations that are washed out by the spikes of activation in ICs 3, 10, and 12.

However, as observed in Section 5.3, the most practically meaningful difference is that MELODIC lacks FFA’s robustness to dimension misspecification. Consider, for instance, the similar ventricle correlations in loading 5 and IC 1 of the order-12 plots in Figures 10a and 10b. In the FFMs with 25 and 50 factors, this structure is preserved. However, this structure, captured in a single IC of the 12-IC model, is splintered across 2 ICs in both the 25- and 50-IC models, with the components of the latter containing less of the original structure than those of the former. Similar fragmentation occurs for other structures found in the 12-IC model.

Refer to caption
(a) FFA
Refer to caption
(b) ICA
Figure 10: In their respective 12-component models, FFA and ICA capture correlation structures within the ventricles. In higher-order models, FFA preserves this structure while ICA fragments it across multiple ICs.

One way to combat ICA’s instability is through multi-scale approaches. For example, Iraji et al. (2023) estimate common spatial maps by first generating 100 half-splits of their multi-subject dataset, then fitting IC models of order 25, 50, 75, 100, 125, 150, 175, and 200 to each data split. This produces 100 sets of 900 components from which they select the 900 with the highest average spatial similarity (calculated by Pearson correlation) across the 100 sets. They then identify those ICs among the 900 that are most distinct from each other (spatial similarity ¡ 0.8). In Section 6.2 of Supplement A, we adapt this procedure to the AOMIC data, showing that the final subset of estimated ICs does, in fact, preserve low-order structure. However, this requires more computing resources than a single-model ICA approaches. Moreover, this output lacks some of the interpretability afforded by that from a single-model approach: estimated ICs from this multi-scale algorithm are cherry-picked from many IC models, consequently collapsing the straightforward independence-based interpretation arising from the assumptions of a single IC model.

7 Discussion

This work recognizes the central contribution of Descary and Panaretos (2019) – a framework that decouples the global and local variation of a functional observation – as the proper scaffolding upon which to build a factor analytic approach for functional data. Our work extends their framework to multidimensional functional data, enhances the estimator of the global covariance via a roughness penalty, then appends a post-processing procedure that improves interpretability. The result is a factor analytic approach tailored to the study of functional connectivity in fMRI.

Three characteristics of our methodology make it uniquely suited to the study of functional connectivity. First, it outputs spatial maps possessing simple structure, which facilitates interpretation of the latent factors that drive correlations in the data. It is a hallmark of factor analytic methods and has, in fields like education and psychology, made multivariate factor analysis a popular alternative to less legible techniques, like PCA. We demonstrate the utility of this structure in the neuroimaging context. Second, our approach is not preceded by data smoothing like other functional connectivity methods. Given that connectivity studies aim to characterize correlations between distant brain regions, model estimation should exclude local variation, yet data smoothing forbids this. It smears local variation into the global signal, thereby corrupting subsequent attempts to estimate this signal. Third, our approach is robust to dimension overspecification. This model feature, not found in popular ICA-based methods, is critical in fMRI where it is difficult to correctly specify the dimension of the signal.

The limitations of our methodology present several opportunities for future work. One weakness of the approach is that its application does not consider all spatial information available in an fMRI scan. Although the model remains valid for 3-dimensional brain volumes, inclusion of a third spatial dimension renders the existing estimation method computationally impractical. Additionally, the model ignores the temporal information present in fMRI data. We attempted to mitigate the impact of these temporal dynamics in our analysis through prewhitening, but such dependencies ought to be acknowledged by the model. Future work on functional factor modeling of large-scale functional or spatiotemporal data should address these concerns.

Supplement A: Functional Factor Modeling of Brain Connectivity

1 Notation and Definitions

1.1 Tensor Reshaping

In the discrete observation paradigm, it is useful to define some tensor reshapings, or bijections between the set of indices of one tensor and that of another tensor. Denote the index set of 𝒳𝔐\mathcal{X}\in\mathbb{R}^{\mathfrak{M}} by M¯=M1××MD\mathbb{N}_{\underline{M}}=\mathbb{N}_{M_{1}}\times\dots\times\mathbb{N}_{M_{D}}. Define a bijection μ:M¯M1MD\mu\mathrel{\mathop{\ordinarycolon}}\mathbb{N}_{\underline{M}}\to\mathbb{N}_{M_{1}\dots M_{D}} so that the vectorization of 𝒳𝔐\mathcal{X}\in\mathbb{R}^{\mathfrak{M}} is the vector

𝒳vec:=[vμ1(m)]m=1M1MD.\displaystyle\mathcal{X}_{\text{vec}}\mathrel{\mathop{\ordinarycolon}}=[v_{\mu^{-1}(m)}]_{m=1}^{M_{1}\dots M_{D}}.

Next, define a symmetric bijection γ:M¯×M¯M1MD×M1MD\gamma\mathrel{\mathop{\ordinarycolon}}\mathbb{N}_{\underline{M}}\times\mathbb{N}_{\underline{M}}\to\mathbb{N}_{M_{1}\dots M_{D}}\times\mathbb{N}_{M_{1}\dots M_{D}} so that the square matricization of 𝒞𝔐×𝔐\mathcal{C}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} is the square matrix

𝒞sq:=[wγ1(i,j)]i,j=1M1MD.\displaystyle\mathcal{C}_{\text{sq}}\mathrel{\mathop{\ordinarycolon}}=[w_{\gamma^{-1}(i,j)}]_{i,j=1}^{M_{1}\dots M_{D}}.

Although there are many suitable selections for μ\mu and γ\gamma, we choose these bijections to be consistent with the array reshaping operator in MATLAB.

1.2 Properties of Tensors

We use the tensor reshapings of Section 1.1 to define several properties of tensors. We say 𝒞𝔐×𝔐\mathcal{C}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} is symmetric if 𝒞sq\mathcal{C}_{\text{sq}} is symmetric, nonnegative definite if 𝒞sq\mathcal{C}_{\text{sq}} is nonnegative definite, and rank-rr if 𝒞sq\mathcal{C}_{\text{sq}} is rank-rr. Through square matricization and vectorization, it can be shown that any symmetric nonnegative definite tensor 𝒞𝔐×𝔐\mathcal{C}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} with rank-rr admits the eigendecomposition

𝒞=k=1rck𝒜k𝒜k,\displaystyle\mathcal{C}=\sum_{k=1}^{r}c_{k}\mathcal{A}_{k}\otimes\mathcal{A}_{k},

where the ckc_{k} are the nonnegative eigenvalues of 𝒞\mathcal{C} and the orthonormal 𝒜k\mathcal{A}_{k} are the eigentensors of 𝒞\mathcal{C}.

2 Rotational Indeterminacy

The FFM is identifiable up to an orthogonal rotation. To see this, let 𝔏[0,1]D×K\mathfrak{L}\in[0,1]^{D}\times\mathbb{N}_{K}, defined by 𝔏s¯,k=lk(s¯)\mathfrak{L}_{\underline{s},k}=l_{k}(\underline{s}), be a set of loading functions indexed by kk. For a K×KK\times K rotation matrix 𝑹\boldsymbol{R}, define the rotation operator 𝑹:[0,1]D×K[0,1]D×K\mathcal{M}_{\boldsymbol{R}}\mathrel{\mathop{\ordinarycolon}}[0,1]^{D}\times\mathbb{N}_{K}\to[0,1]^{D}\times\mathbb{N}_{K} by (𝑹𝔏)s¯,k=(𝑹𝔏s¯,)k(\mathcal{M}_{\boldsymbol{R}}\mathfrak{L})_{\underline{s},k}=(\boldsymbol{R}\mathfrak{L}_{\underline{s},\cdot})_{k}. That is, 𝑹\mathcal{M}_{\boldsymbol{R}} rotates the loading functions by rotating the KK-dimensional vector (l1(s¯),,lK(s¯))(l_{1}(\underline{s}),\dots,l_{K}(\underline{s})) by 𝑹\boldsymbol{R} at each fixed s¯[0,1]D\underline{s}\in[0,1]^{D}. If f¯=(f1,,fK)\underline{f}=(f_{1},\dots,f_{K}) is a vector of factors and 𝑹\boldsymbol{R} is some rotation matrix, then the following equivalences hold:

Y(s¯)=k=1K𝔏s¯,kf¯k=k=1K(𝑹𝔏)s¯,k(𝑹f¯)k\displaystyle Y(\underline{s})=\sum_{k=1}^{K}\mathfrak{L}_{\underline{s},k}\underline{f}_{k}=\sum_{k=1}^{K}(\mathcal{M}_{\boldsymbol{R}}\mathfrak{L})_{\underline{s},k}(\boldsymbol{R}\underline{f})_{k}

and

𝒢=k=1K𝔏,k𝔏,k=k=1K(𝑹𝔏),k(𝑹𝔏),k.\displaystyle\mathscr{G}=\sum_{k=1}^{K}\mathfrak{L}_{\cdot,k}\otimes\mathfrak{L}_{\cdot,k}=\sum_{k=1}^{K}(\mathcal{M}_{\boldsymbol{R}}\mathfrak{L})_{\cdot,k}\otimes(\mathcal{M}_{\boldsymbol{R}}\mathfrak{L})_{\cdot,k}.

Thus, when the global component YY and its covariance 𝒢\mathscr{G} are known, it is impossible to distinguish the loading functions of 𝔏\mathfrak{L} from those of 𝑹𝔏\mathcal{M}_{\boldsymbol{R}}\mathfrak{L}.

3 Theoretical Support

This section includes statements and proofs of results described in the manuscript, as well as those for auxiliary results that enrich our theory.

3.1 Identifiability in the Completely-Observed Paradigm

We begin with Theorem 3.1, which establishes conditions needed to identify the global covariance 𝒢\mathscr{G} and the local covariance \mathscr{B} from the covariance 𝒞\mathscr{C} in the complete-observation paradigm. The result is a straightforward extension of Theorem 1 from Descary and Panaretos (2019) to multidimensional functional data.

Theorem 3.1 (Identifiability).

Let 𝒢1,𝒢2:L2([0,1]D)L2([0,1]D)\mathscr{G}_{1},\mathscr{G}_{2}\mathrel{\mathop{\ordinarycolon}}L^{2}([0,1]^{D})\to L^{2}([0,1]^{D}) be trace-class covariance operators of rank K1,K2<K_{1},K_{2}<\infty, respectively. Let 1,2:L2([0,1]D)L2([0,1]D)\mathscr{B}_{1},\mathscr{B}_{2}\mathrel{\mathop{\ordinarycolon}}L^{2}([0,1]^{D})\to L^{2}([0,1]^{D}) be banded trace-class covariance operators whose bandwidths δ¯1\underline{\delta}_{1} and δ¯2\underline{\delta}_{2}, respectively, have components less than 1. If the eigenfunctions of 𝒢1\mathscr{G}_{1} and 𝒢2\mathscr{G}_{2} are real analytic, then we have the equivalence

𝒢1+1=𝒢2+2𝒢1=𝒢2and1=2.\mathscr{G}_{1}+\mathscr{B}_{1}=\mathscr{G}_{2}+\mathscr{B}_{2}\hskip 10.00002pt\Longleftrightarrow\hskip 10.00002pt\mathscr{G}_{1}=\mathscr{G}_{2}\hskip 10.00002pt\text{and}\hskip 10.00002pt\mathscr{B}_{1}=\mathscr{B}_{2}.

By definition, a function ff defined on open UU\subset\mathbb{R} is real analytic at x0x_{0} if ff is equal to its Taylor expansion about x0x_{0} in some neighborhood of x0x_{0}. A function ff is real analytic on open VUV\subset U if it is real analytic at each x0x_{0} in VV. In the proof of the above result, we use an alternative characterization of real analyticity that generalizes to multivariable functions. The power of analyticity in our setting comes from the so-called analytic continuation property: if a function is analytic on an open interval UDU\subset\mathbb{R}^{D}, but is known only on open VUV\subset U, then the function extends uniquely to UU. In the proof of Theorem 3.1, after showing that the kernel gg is real analytic on [0,1]2D[0,1]^{2D}, we use this observation to uniquely extend gg to the on-band subset of its domain (contaminated by bb) based on off-band values (uncontaminated by bb).

Proof of Theorem 3.1.

Let g1,g2L2([0,1]2D)g_{1},g_{2}\in L^{2}([0,1]^{2D}) denote the kernels of 𝒢1\mathscr{G}_{1} and 𝒢2\mathscr{G}_{2}, respectively. We begin by showing that g1g_{1} (similarly, g2g_{2}) is analytic on [0,1]2D[0,1]^{2D}. By Mercer’s Theorem (Hsing and Eubank, 2015), g1g_{1} admits the decomposition

g1(s¯1,s¯2)=k=1Kλkηk(s¯1)ηk(s¯2),\displaystyle g_{1}(\underline{s}_{1},\underline{s}_{2})=\sum_{k=1}^{K}\lambda_{k}\eta_{k}(\underline{s}_{1})\eta_{k}(\underline{s}_{2}),

where we have assumed that each ηk\eta_{k} is analytic on [0,1][0,1]. Equivalently, each ηk\eta_{k} has some real analytic extension η~k\tilde{\eta}_{k} to some open Uk[0,1]DU_{k}\supset[0,1]^{D}. Let U=kUkU=\cap_{k}U_{k} and choose open VV and compact RR so that URV[0,1]DU\supset R\supset V\supset[0,1]^{D}. Since each η~k\tilde{\eta}_{k} is real analytic on UU and KUK\subset U is compact, there exists a positive constant CC such that for every multi-index αD\alpha\in\mathbb{N}^{D},

sups¯R|αηks¯α(s¯)|C|α|α!,k=1,,K.\displaystyle\sup_{\underline{s}\in R}\left|\frac{\partial^{\alpha}\eta_{k}}{\partial\underline{s}^{\alpha}}(\underline{s})\right|\leq C^{\mathinner{\!\left\lvert\alpha\right\rvert}}\alpha!,\hskip 10.00002ptk=1,\dots,K.

Define λ=maxkλk\lambda^{*}=\max_{k}\lambda_{k}, η=maxksups¯R|η~k(s¯)|\eta^{*}=\max_{k}\sup_{\underline{s}\in R}\mathinner{\!\left\lvert\tilde{\eta}_{k}(\underline{s})\right\rvert}, and g~1(s¯1,s¯2)=k=1Kλkη~k(s¯1)η~k(s¯2)\tilde{g}_{1}(\underline{s}_{1},\underline{s}_{2})=\sum_{k=1}^{K}\lambda_{k}\tilde{\eta}_{k}(\underline{s}_{1})\tilde{\eta}_{k}(\underline{s}_{2}). Then for every multi-index α\alpha,

sup(s¯1,s¯2)V×V|αs¯1αg~1(s¯1,s¯2)|\displaystyle\sup_{(\underline{s}_{1},\underline{s}_{2})\in V\times V}\mathinner{\!\left\lvert\frac{\partial^{\alpha}}{\partial\underline{s}_{1}^{\alpha}}\tilde{g}_{1}(\underline{s}_{1},\underline{s}_{2})\right\rvert} ληk=1Ksups¯V|αs¯1αη~k(s¯1)|\displaystyle\leq\lambda^{*}\eta^{*}\sum_{k=1}^{K}\sup_{\underline{s}\in V}\mathinner{\!\left\lvert\frac{\partial^{\alpha}}{\partial\underline{s}_{1}^{\alpha}}\tilde{\eta}_{k}(\underline{s}_{1})\right\rvert}
ληk=1Ksups¯R|αs¯1αη~k(s¯1)|\displaystyle\leq\lambda^{*}\eta^{*}\sum_{k=1}^{K}\sup_{\underline{s}\in R}\mathinner{\!\left\lvert\frac{\partial^{\alpha}}{\partial\underline{s}_{1}^{\alpha}}\tilde{\eta}_{k}(\underline{s}_{1})\right\rvert}
ληk=1KC|α|α!\displaystyle\leq\lambda^{*}\eta^{*}\sum_{k=1}^{K}C^{\mathinner{\!\left\lvert\alpha\right\rvert}}\alpha!
C0|α|α!,\displaystyle\leq C_{0}^{\mathinner{\!\left\lvert\alpha\right\rvert}}\alpha!,

for some sufficiently large C0C_{0}. By symmetry, we also have

sup(s¯1,s¯2)V×V|αs¯2αg~1(s¯1,s¯2)|C0|α|α!.\displaystyle\sup_{(\underline{s}_{1},\underline{s}_{2})\in V\times V}\mathinner{\!\left\lvert\frac{\partial^{\alpha}}{\partial\underline{s}_{2}^{\alpha}}\tilde{g}_{1}(\underline{s}_{1},\underline{s}_{2})\right\rvert}\leq C_{0}^{\mathinner{\!\left\lvert\alpha\right\rvert}}\alpha!.

Thus, by Theorem 4.3.3 of (Krantz and Parks, 2002), g~1\tilde{g}_{1} is real analytic on V×VV\times V, making g1g_{1} (similarly g2g_{2}) real analytic on [0,1]2D[0,1]^{2D}.

By Theorem 6.3.3 of (Krantz and Parks, 2002), the zero set of either kernel is at most (2D1)(2D-1)-dimensional, provided the kernels are not uniformly zero. Since the result follows trivially if 𝒢1\mathscr{G}_{1} and 𝒢2\mathscr{G}_{2} are the zero operator, we assume that their kernels are not uniformly zero. Thus, if we can show that the two kernels coincide on an open (i.e., 2D2D-dimensional) subset WW of [0,1]2D[0,1]^{2D}, then by Corollary 1.2.6 of (Krantz and Parks, 2002), they will coincide on (0,1)2D(0,1)^{2D}, and on [0,1]2D[0,1]^{2D} by continuity.

Let δ<1\delta<1 be the maximum component across δ¯1\underline{\delta}_{1} and δ¯2\underline{\delta}_{2}, and define W=(δ,1)D×(0,1δ)DW=(\delta,1)^{D}\times(0,1-\delta)^{D}. Since 𝒢1+1=𝒢1+1\mathscr{G}_{1}+\mathscr{B}_{1}=\mathscr{G}_{1}+\mathscr{B}_{1}, but 1=2=0\mathscr{B}_{1}=\mathscr{B}_{2}=0 on WW, it must be that 𝒢1=𝒢2\mathscr{G}_{1}=\mathscr{G}_{2} on WW. This completes the proof. ∎

Despite this seemingly restrictive property, the assumption of analyticity is not overly prohibitive. There are many examples of analytic functions, including polynomials, the exponential and logarithmic functions, and trigonometric functions. Moreover, the class of real analytic functions is closed under addition, subtraction, multiplication, division (given a nonvanishing denominator), differentiation, and integration. In fact, Theorem 3.2 shows that any finite-rank covariance operator is well-approximated by finite-rank covariance operators having real analytic eigenfunctions.

Theorem 3.2 (Density).

Let ZZ be an L2([0,1]D)L^{2}([0,1]^{D})-valued random function with a trace-class covariance \mathscr{H} of rank K<K<\infty. Then, for any ϵ>0\epsilon>0, there exists a random function YY whose covariance 𝒢\mathscr{G} has analytic eigenfunctions and rank qKq\leq K, such that

𝔼ZYL22<ϵand𝒢<ϵ\mathbb{E}\mathinner{\!\left\lVert Z-Y\right\rVert}_{L^{2}}^{2}<\epsilon\hskip 10.00002pt\text{and}\hskip 10.00002pt\mathinner{\!\left\lVert\mathscr{H}-\mathscr{G}\right\rVert}_{*}<\epsilon

for \mathinner{\!\left\lVert\cdot\right\rVert}_{*} the nuclear norm. If additionally \mathscr{H} has C1C^{1} eigenfunctions on [0,1]D[0,1]^{D} then we have the stronger result that for any ϵ>0\epsilon>0, there exists a random function YY whose covariance 𝒢\mathscr{G} has analytic eigenfunctions and rank qKq\leq K, such that

sups¯[0,1]D𝔼|Z(s¯)Y(s¯)|2<ϵandsups¯1,s¯2[0,1]D|h(s¯1,s¯2)g(s¯1,s¯2)|<ϵ,\sup_{\underline{s}\in[0,1]^{D}}\mathbb{E}\mathinner{\!\left\lvert Z(\underline{s})-Y(\underline{s})\right\rvert}^{2}<\epsilon\hskip 10.00002pt\text{and}\hskip 10.00002pt\sup_{\underline{s}_{1},\underline{s}_{2}\in[0,1]^{D}}\mathinner{\!\left\lvert h(\underline{s}_{1},\underline{s}_{2})-g(\underline{s}_{1},\underline{s}_{2})\right\rvert}<\epsilon,

where hh and gg are the kernels of \mathscr{H} and 𝒢\mathscr{G}, respectively.

Proof.

Proposition 1 of Descary and Panaretos (2019) establishes this result for D=1D=1. The proof uses Fourier series approximations of the eigenfunctions of \mathscr{H} to establish each result. To extend these results to D>1D>1, we need only apply the same reasoning using a basis of L2([0,1]D)L^{2}([0,1]^{D}) that is composed of functions analytic on [0,1]D[0,1]^{D}, and that yields approximations dense in C1([0,1]D)C^{1}([0,1]^{D}) under the sup-norm. One can show that these criteria are satisfied by the Fourier tensor product basis,

{en1enD:n1,,nD},\displaystyle\{e_{n_{1}}\otimes\dots\otimes e_{n_{D}}\mathrel{\mathop{\ordinarycolon}}n_{1},\dots,n_{D}\in\mathbb{N}\},

where ende_{n_{d}} is the ndn_{d}-th function of the Fourier basis of L2([0,1])L^{2}([0,1]). ∎

3.2 Identifiability in the Discretely-Observed Paradigm

Next, we turn to theory for the discrete-observation setting. We begin by describing an observation scheme that generalizes the evenly-spaced grid considered in the manuscript. Let {Im¯,M¯}m¯M¯\{I_{\underline{m},\underline{M}}\}_{\underline{m}\in\mathbb{N}_{\underline{M}}} be the partition of [0,1]D[0,1]^{D} into cells of size M11××MD1M_{1}^{-1}\times\dots\times M_{D}^{-1}, where M¯\underline{M} and M¯\mathbb{N}_{\underline{M}} are defined as in the manuscript. Suppose we observe each of the nn samples of XX at the M=M1MDM=M_{1}\dots M_{D} discrete points in {s¯m¯}m¯M¯\{\underline{s}_{\underline{m}}\}_{\underline{m}\in\mathbb{N}_{\underline{M}}} where s¯m¯Im¯,M¯\underline{s}_{\underline{m}}\in I_{\underline{m},\underline{M}}, and denote the set of all possible discrete point collections by 𝒮M¯\mathcal{S}_{\underline{M}}. From here, we can follow the steps taken in the manuscript to define the M-resolution functional factor model with KK factors as

𝒳M¯=k=1KkM¯fk+M¯.\displaystyle\mathcal{X}^{\underline{M}}=\sum_{k=1}^{K}\mathcal{L}^{\underline{M}}_{k}f_{k}+\mathcal{E}^{\underline{M}}.

We denote the M¯\underline{M}-resolution versions of (𝒞,𝒢,)(\mathscr{C},\mathscr{G},\mathscr{B}) by (𝒞M¯,𝒢M¯,M¯)(\mathscr{C}^{\underline{M}},\mathscr{G}^{\underline{M}},\mathscr{B}^{\underline{M}}). These operators have kernels

cM¯(x¯,y¯)\displaystyle c^{\underline{M}}(\underline{x},\underline{y}) =i¯,j¯M¯c(s¯i¯,s¯j¯)𝟙{(x¯,y¯)Ii¯,M¯×Ij¯,M¯},\displaystyle=\sum_{\underline{i},\underline{j}\in\mathbb{N}_{\underline{M}}}c(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}})\mathds{1}\{(\underline{x},\underline{y})\in I_{\underline{i},\underline{M}}\times I_{\underline{j},\underline{M}}\},
gM¯(x¯,y¯)\displaystyle g^{\underline{M}}(\underline{x},\underline{y}) =i¯,j¯M¯g(s¯i¯,s¯j¯)𝟙{(x¯,y¯)Ii¯,M¯×Ij¯,M¯},\displaystyle=\sum_{\underline{i},\underline{j}\in\mathbb{N}_{\underline{M}}}g(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}})\mathds{1}\{(\underline{x},\underline{y})\in I_{\underline{i},\underline{M}}\times I_{\underline{j},\underline{M}}\},
bM¯(x¯,y¯)\displaystyle b^{\underline{M}}(\underline{x},\underline{y}) =i¯,j¯M¯b(s¯i¯,s¯j¯)𝟙{(x¯,y¯)Ii¯,M¯×Ij¯,M¯},\displaystyle=\sum_{\underline{i},\underline{j}\in\mathbb{N}_{\underline{M}}}b(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}})\mathds{1}\{(\underline{x},\underline{y})\in I_{\underline{i},\underline{M}}\times I_{\underline{j},\underline{M}}\},

which can be summarized using the tensors of dimension M1××MD×M1××MDM_{1}\times\dots\times M_{D}\times M_{1}\times\dots\times M_{D},

𝒞M¯(i¯,j¯)=cM¯(s¯i¯,s¯j¯),𝒢M¯(i¯,j¯)=gM¯(s¯i¯,s¯j¯),M¯(i¯,j¯)=bM¯(s¯i¯,s¯j¯).\displaystyle\mathcal{C}^{\underline{M}}(\underline{i},\underline{j})=c^{\underline{M}}(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}}),\hskip 10.00002pt\mathcal{G}^{\underline{M}}(\underline{i},\underline{j})=g^{\underline{M}}(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}}),\hskip 10.00002pt\mathcal{B}^{\underline{M}}(\underline{i},\underline{j})=b^{\underline{M}}(\underline{s}_{\underline{i}},\underline{s}_{\underline{j}}).

As in the manuscript, the covariance decomposes as

𝒞M¯=k=1KkM¯kM¯𝒢M¯+M¯.\mathcal{C}^{\underline{M}}=\underbrace{\sum_{k=1}^{K}\mathcal{L}^{\underline{M}}_{k}\otimes\mathcal{L}^{\underline{M}}_{k}}_{\mathcal{G}^{\underline{M}}}+\mathcal{B}^{\underline{M}}.

Theorem 3.4 presents conditions needed to identify 𝒢\mathcal{G} and \mathcal{B} from 𝒞\mathcal{C}. Chief among them is the requirement from the complete-observation setting that 𝒢\mathscr{G} be finite-rank with real analytic eigenfunctions. Theorem 3.3 is an intermediate result that links the functional assumption of analyticity to discretely-observed covariances.

Theorem 3.3.

Let 𝒢\mathscr{G} have kernel g(s¯1,s¯2)=k=1Kλkηk(s¯1)ηk(s¯2)g(\underline{s}_{1},\underline{s}_{2})=\sum_{k=1}^{K}\lambda_{k}\eta_{k}(\underline{s}_{1})\eta_{k}(\underline{s}_{2}) with K<K<\infty and real analytic eigenfunctions {η1,,ηK}\{\eta_{1},\dots,\eta_{K}\}. If M=M1MD>KM=M_{1}\dots M_{D}>K, then all order-KK minors of 𝒢sqM¯\mathcal{G}^{\underline{M}}_{\text{sq}} are nonzero, almost everywhere on 𝒮M¯\mathcal{S}_{\underline{M}}.

Proof.

First, notice that from g(s¯1,s¯2)=k=1Kλkηk(s¯1)ηk(s¯2)g(\underline{s}_{1},\underline{s}_{2})=\sum_{k=1}^{K}\lambda_{k}\eta_{k}(\underline{s}_{1})\eta_{k}(\underline{s}_{2}), we have that

(𝒢)(i¯,j¯)=(𝒢sq)(γ(i¯),γ(j¯))=k=1Kλkηk(s¯i¯)ηk(s¯j¯),i¯,j¯M¯,\displaystyle(\mathcal{G})_{(\underline{i},\underline{j})}=(\mathcal{G}_{\text{sq}})_{(\gamma(\underline{i}),\gamma(\underline{j}))}=\sum_{k=1}^{K}\lambda_{k}\eta_{k}(\underline{s}_{\underline{i}})\eta_{k}(\underline{s}_{\underline{j}}),\hskip 10.00002pt\underline{i},\underline{j}\in\mathbb{N}_{\underline{M}},

where γ:M¯M\gamma\mathrel{\mathop{\ordinarycolon}}\mathbb{N}_{\underline{M}}\to\mathbb{N}_{M} is the index set bijection that defines square matricization of 𝒢\mathcal{G}. Thus, the the rank-KK matrix 𝒢sqM×M\mathcal{G}_{\text{sq}}\in\mathbb{R}^{M\times M} can be written as 𝑼𝚺𝑼T\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{U}^{T}, where (𝑼)(γ(i¯),k)=ηk(s¯i¯)(\boldsymbol{U})_{(\gamma(\underline{i}),k)}=\eta_{k}(\underline{s}_{\underline{i}}) and 𝚺=diag(λ1,,λK)\boldsymbol{\Sigma}=\text{diag}(\lambda_{1},\dots,\lambda_{K}). Note that any K×KK\times K submatrix of 𝒢sq\mathcal{G}_{\text{sq}} can be written as

𝑼F𝚺𝑼FT,\displaystyle\boldsymbol{U}_{F}\boldsymbol{\Sigma}\boldsymbol{U}_{F^{\prime}}^{T},

where 𝑼F\boldsymbol{U}_{F} and 𝑼F\boldsymbol{U}_{F^{\prime}} are K×KK\times K matrices obtained by deleting rows of 𝑼\boldsymbol{U} whose indices are not included in the cardinality-KK subsets F,FMF,F^{\prime}\subset\mathbb{N}_{M}, respectively. The condition that any order-KK minor of 𝒢sq\mathcal{G}_{\text{sq}} be nonzero is equivalent to the condition that

det[𝑼F𝚺𝑼FT]=det[𝑼F]det[𝚺]det[𝑼FT]0,\displaystyle\det\left[\boldsymbol{U}_{F}\boldsymbol{\Sigma}\boldsymbol{U}_{F^{\prime}}^{T}\right]=\det\left[\boldsymbol{U}_{F}\right]\det\left[\boldsymbol{\Sigma}\right]\det\left[\boldsymbol{U}_{F^{\prime}}^{T}\right]\neq 0,

for any subsets F,FMF,F^{\prime}\subset\mathbb{N}_{M} of cardinality KK. By construction, det[𝚺]0\det[\boldsymbol{\Sigma}]\neq 0, so the minor condition is then equivalent to requiring that det[𝑼F]0\det[\boldsymbol{U}_{F}]\neq 0 for any subset FMF\subset\mathbb{N}_{M} of cardinality KK.

We will show that this is indeed the case almost everywhere on 𝒮M¯\mathcal{S}_{\underline{M}}. Let μ\mu denote the Lebesgue measure on 𝒮M¯\mathcal{S}_{\underline{M}} and let F=KF=\mathbb{N}_{K}, without loss of generality. Using the Leibniz formula, we have that det[𝑼F]\det[\boldsymbol{U}_{F}] can be written as the function

Ddet(s¯γ1(1),,s¯γ1(K))\displaystyle D_{\det}(\underline{s}_{\gamma^{-1}(1)},\dots,\underline{s}_{\gamma^{-1}(K)}) =σSKϵ(σ)k=1K(𝑼F)(σ(k),k)\displaystyle=\sum_{\sigma\in S_{K}}\epsilon(\sigma)\prod_{k=1}^{K}(\boldsymbol{U}_{F})_{(\sigma(k),k)}
=σSKϵ(σ)k=1Kηk(s¯γ1(σ(k))),\displaystyle=\sum_{\sigma\in S_{K}}\epsilon(\sigma)\prod_{k=1}^{K}\eta_{k}(\underline{s}_{\gamma^{-1}(\sigma(k))}),

where SKS_{K} is the symmetric group on KK elements and ϵ(σ)\epsilon(\sigma) is the signature of the permutation σ\sigma. Note that the function DdetD_{\det} is real analytic on [0,1]DK[0,1]^{DK}, by virtue of each ηk\eta_{k} being real analytic on [0,1]D[0,1]^{D}.

We will now proceed by contradiction. Assume that

μ{{x¯j¯}j¯M¯𝒮M¯:Ddet(𝒙γ1(1),,𝒙γ1(K))=0}>0.\displaystyle\mu\left\{\{\underline{x}_{\underline{j}}\}_{\underline{j}\in\mathbb{N}_{\underline{M}}}\in\mathcal{S}_{\underline{M}}\mathrel{\mathop{\ordinarycolon}}D_{\det}(\boldsymbol{x}_{\gamma^{-1}(1)},\dots,\boldsymbol{x}_{\gamma^{-1}(K)})=0\right\}>0.

Since μ\mu is the Lebesque measure, it follows that the Hausdorff dimension of the set A={(x¯1,,x¯K):Ddet(x¯1,,x¯K)=0}A=\{(\underline{x}_{1},\dots,\underline{x}_{K})\mathrel{\mathop{\ordinarycolon}}D_{\det}(\underline{x}_{1},\dots,\underline{x}_{K})=0\} is equal to DKDK. However, since DdetD_{\det} is analytic, Theorem 6.3.3 of (Krantz and Parks, 2002) implies the dichotomy: either DdetD_{\det} is constant everywhere on (0,1)DK(0,1)^{DK}, or the set AA is at most of dimension DK1DK-1. Since the zero set of DdetD_{\det} has positive measure and is thus of Hausdorff dimension DKDK, it must be that DdetD_{\det} is everywhere zero on (0,1)DK(0,1)^{DK}:

Ddet(x¯1,,x¯K)=σSKϵ(σ)k=1Kηk(x¯σ(k))=0,(x¯1,x¯K)(0,1)DK.\displaystyle D_{\det}(\underline{x}_{1},\dots,\underline{x}_{K})=\sum_{\sigma\in S_{K}}\epsilon(\sigma)\prod_{k=1}^{K}\eta_{k}(\underline{x}_{\sigma(k)})=0,\hskip 10.00002pt\forall(\underline{x}_{1},\dots\underline{x}_{K})\in(0,1)^{DK}.

Now fix (x¯1,,x¯K1)(\underline{x}_{1},\dots,\underline{x}_{K-1}) and apply DdetD_{\det} (viewed as a function of x¯K\underline{x}_{K} only) the continuous linear functional Tηk(f)=f,ηkT_{\eta_{k}}(f)=\left\langle f,\eta_{k}\right\rangle. Then for all (x¯1,,x¯K1)(0,1)D(K1)(\underline{x}_{1},\dots,\underline{x}_{K-1})\in(0,1)^{D(K-1)}:

0\displaystyle 0 =Ddet,ηk\displaystyle=\left\langle D_{\det},\eta_{k}\right\rangle
=σSKϵ(σ)k=1Kηk(x¯σ(k)),ηk\displaystyle=\left\langle\sum_{\sigma\in S_{K}}\epsilon(\sigma)\prod_{k=1}^{K}\eta_{k}(\underline{x}_{\sigma(k)}),\eta_{k}\right\rangle
=σSK[ϵ(σ)k:σ(k)Kηk(x¯σ(k))]ησ1(K),ηK\displaystyle=\sum_{\sigma\in S_{K}}\left[\epsilon(\sigma)\prod_{k\mathrel{\mathop{\ordinarycolon}}\sigma(k)\neq K}\eta_{k}(\underline{x}_{\sigma(k)})\right]\left\langle\eta_{\sigma^{-1}(K)},\eta_{K}\right\rangle
=σSKϵ(σ)k=1K1ηk(x¯σ(k))\displaystyle=\sum_{\sigma\in S_{K}}\epsilon(\sigma)\prod_{k=1}^{K-1}\eta_{k}(\underline{x}_{\sigma(k)})

Applying TηkT_{\eta_{k}} iteratively to DdetD_{\det} while keeping (x¯1,,x¯k1)(\underline{x}_{1},\dots,\underline{x}_{k-1}) fixed then leads to

η1(s¯)=0,s¯(0,1)D.\displaystyle\eta_{1}(\underline{s})=0,\hskip 10.00002pt\forall\underline{s}\in(0,1)^{D}.

This last equality contradicts the fact that η1\eta_{1} has norm one, and allows us to conclude that

μ{{x¯j¯}j¯M¯𝒮M¯:Ddet(𝒙γ1(1),,𝒙γ1(K))=0}=0.\displaystyle\mu\left\{\{\underline{x}_{\underline{j}}\}_{\underline{j}\in\mathbb{N}_{\underline{M}}}\in\mathcal{S}_{\underline{M}}\mathrel{\mathop{\ordinarycolon}}D_{\det}(\boldsymbol{x}_{\gamma^{-1}(1)},\dots,\boldsymbol{x}_{\gamma^{-1}(K)})=0\right\}=0.

We now state Theorem 3.4, then invoke Theorem 3.3 in its proof.

Theorem 3.4 (Identifiability at Finite Resolution).

Let 𝒢1\mathscr{G}_{1} and 𝒢2\mathscr{G}_{2} be covariance operators of rank K1,K2<K_{1},K_{2}<\infty, respectively, and assume without loss of generality that K1K2K_{1}\geq K_{2}. Let 1\mathscr{B}_{1} and 2\mathscr{B}_{2} be banded continuous covariance operators whose bandwidths δ¯1\underline{\delta}_{1} and δ¯2\underline{\delta}_{2}, respectively, have components less than 1/21/2. If the eigenfunctions of 𝒢1\mathscr{G}_{1} and 𝒢2\mathscr{G}_{2} are real analytic, and

K1K=d=1D(12max{δ1,d,δ2,d})Md1,\displaystyle K_{1}\leq K^{*}=\prod_{d=1}^{D}\left\lfloor\left(\frac{1}{2}-\max\{\delta_{1,d},\delta_{2,d}\}\right)M_{d}-1\right\rfloor, (1)

then we have the equivalence

𝒢1M¯+1M¯=𝒢2M¯+2M¯𝒢1M¯=𝒢2M¯and1M¯=2M¯,\displaystyle\mathcal{G}^{\underline{M}}_{1}+\mathcal{B}^{\underline{M}}_{1}=\mathcal{G}^{\underline{M}}_{2}+\mathcal{B}^{\underline{M}}_{2}\hskip 10.00002pt\Longleftrightarrow\hskip 10.00002pt\mathcal{G}^{\underline{M}}_{1}=\mathcal{G}^{\underline{M}}_{2}\hskip 10.00002pt\text{and}\hskip 10.00002pt\mathcal{B}^{\underline{M}}_{1}=\mathcal{B}^{\underline{M}}_{2},

almost everywhere on 𝒮M¯\mathcal{S}_{\underline{M}} with respect to the Lebesgue measure.

Proof of Theorem 2 from the Manuscript.

Given our conditions, the tensors i𝔐×𝔐\mathcal{B}_{i}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} have bandwidths 2δi,dMd2\lceil\delta_{i,d}M_{d}\rceil for i{1,2}i\in\{1,2\} and d=1,,Dd=1,\dots,D. Let δd=max{δ1,d,δ2,d}\delta_{d}=\max\{\delta_{1,d},\delta_{2,d}\} for d=1,,Dd=1,\dots,D, and assume without loss of generality that K1K2K_{1}\leq K_{2}. Let Ω\Omega be the set of indices on which both 1\mathcal{B}_{1} and 2\mathcal{B}_{2} vanish:

Ω={(i¯,j¯)M¯×M¯:|idjd|>δdMdfor somed=1,,D}.\displaystyle\Omega=\left\{(\underline{i},\underline{j})\in\mathbb{N}_{\underline{M}}\times\mathbb{N}_{\underline{M}}\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert i_{d}-j_{d}\right\rvert}>\lceil\delta_{d}M_{d}\rceil\ \text{for some}\ d=1,\dots,D\right\}.

From 𝒢1+1=𝒢2+2\mathcal{G}_{1}+\mathcal{B}_{1}=\mathcal{G}_{2}+\mathcal{B}_{2}, we obtain that (𝒢1)(i¯,j¯)=(𝒢2)(i¯,j¯)(\mathcal{G}_{1})_{(\underline{i},\underline{j})}=(\mathcal{G}_{2})_{(\underline{i},\underline{j})} for all (i¯,j¯)Ω(\underline{i},\underline{j})\in\Omega. Let ad=(12max{δ1,d,δ2,d})Md1a_{d}=\left\lfloor\left(\frac{1}{2}-\max\{\delta_{1,d},\delta_{2,d}\}\right)M_{d}-1\right\rfloor and define

Ω𝒜={(i¯,j¯)M¯×M¯:1idad+1,MdadjdMd},\displaystyle\Omega_{\mathcal{A}}=\{(\underline{i},\underline{j})\in\mathbb{N}_{\underline{M}}\times\mathbb{N}_{\underline{M}}\mathrel{\mathop{\ordinarycolon}}1\leq i_{d}\leq a_{d}+1,M_{d}-a_{d}\leq j_{d}\leq M_{d}\},

then equation (1) of the manuscript implies that Ω𝒜Ω\Omega_{\mathcal{A}}\subset\Omega, which in turn implies that the tensors 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} contain a common subtensor of dimension a1××aD×a1××aDa_{1}\times\dots\times a_{D}\times a_{1}\times\dots\times a_{D}. Consequently, 𝒢1sq\mathcal{G}_{1\text{sq}} and 𝒢2sq\mathcal{G}_{2\text{sq}} contain a common submatrix of dimension a1aD×a1aDa_{1}\dots a_{D}\times a_{1}\dots a_{D}, and, by equation (1) of the manuscript, a common submatrix 𝑨\boldsymbol{A} of dimension K1×K1K_{1}\times K_{1}.

Assume that all order-K1K_{1} minors of 𝒢1sq\mathcal{G}_{1\text{sq}} are nonzero. Then the determinant of 𝑨\boldsymbol{A} is nonzero. This implies that K2=rank(𝒢2sq)(𝒢1sq)=K1K_{2}=\text{rank}(\mathcal{G}_{2\text{sq}})\geq(\mathcal{G}_{1\text{sq}})=K_{1}, which means 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} are two rank-K1K_{1} tensors equal on Ω\Omega.

Now, let 𝒢\mathcal{G}^{*} be a tensor equal to 𝒢1\mathcal{G}_{1} on Ω\Omega, but unknown at those indices not belonging to Ω\Omega. We will now show that there exists a unique rank-K1K_{1} completion of 𝒢\mathcal{G}^{*}. Due to equation (1) of the manuscript, it is possible to find a submatrix of 𝒢sq\mathcal{G}_{\text{sq}}^{*} with dimension (K1+1)×(K1+1)(K_{1}+1)\times(K_{1}+1) containing exactly one unobserved value, denoted xx^{*}. Using the fact that the determinant of any square submatrix of dimension larger than K1K_{1} is zero, we obtain a linear equation of the form ax+b=0ax^{*}+b=0 where aa is equal to to the determinant of a submatrix of 𝒢sq\mathcal{G}_{\text{sq}}^{*} of dimension K1×K1K_{1}\times K_{1}. Since we have assumed that any minor of order r1r_{1} is nonzero, we have that a0a\neq 0 and the previous equation has a unique solution. It is then possible to impute the value of xx^{*}. Again, due to equation (1) of the manuscript, it is possible to apply this procedure iteratively until all missing entries are imputed, thereby allowing us to uniquely complete the tensor 𝒢\mathcal{G}^{*} into a rank-K1K_{1} tensor.

In summary, we have demonstrated that when all order-K1K_{1} minors of 𝒢1sq\mathcal{G}_{1\text{sq}} are nonzero, it holds that 𝒢=𝒢1=𝒢2\mathcal{G}^{*}=\mathcal{G}_{1}=\mathcal{G}_{2}, and hence 1=2\mathcal{B}_{1}=\mathcal{B}_{2}. Theorem 3.3 ensures that 𝒢1sq\mathcal{G}_{1\text{sq}} indeed has nonvanishing minors of order K1K_{1} almost everywhere on 𝒮M¯\mathcal{S}_{\underline{M}}. So we conclude that 𝒢1=𝒢2\mathcal{G}_{1}=\mathcal{G}_{2} and 1=2\mathcal{B}_{1}=\mathcal{B}_{2} almost everywhere on 𝒮M¯\mathcal{S}_{\underline{M}}. ∎

3.3 Estimation

We now present theory for the estimators of 𝒢\mathcal{G}, KK, and k\mathcal{L}_{k} discussed in Section 3 of the manuscript. To motivate the estimator of 𝒢\mathcal{G}, we define an optimization problem for which 𝒢\mathcal{G} is the unique solution. Let ΘM¯\Theta_{\underline{M}} be the set of 𝔐×𝔐\mathfrak{M}\times\mathfrak{M} symmetric non-negative definite tensors with trace norm bounded by that of 𝒞^n\hat{\mathcal{C}}_{n}, whose own trace norm may be scaled to one. Next, define the functional SM¯:ΘM¯S^{\underline{M}}\mathrel{\mathop{\ordinarycolon}}\Theta_{\underline{M}}\to\mathbb{R} by

SM¯(θ;τ)\displaystyle S^{\underline{M}}(\theta;\tau) =M2𝒜M¯(𝒞M¯θ)F2RM¯(θ)+τrank(θ),\displaystyle=\underbrace{M^{-2}\|\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\theta)\|_{F}^{2}}_{R^{\underline{M}}(\theta)}+\tau\text{rank}(\theta),

where M=M1MDM=M_{1}\dots M_{D}, 𝒜M¯𝔐×𝔐\mathcal{A}^{\underline{M}}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} is defined by 𝒜M¯(i¯,j¯)=𝟙{|idjd|>Md/4\mathcal{A}^{\underline{M}}(\underline{i},\underline{j})=\mathds{1}\{\mathinner{\!\left\lvert i_{d}-j_{d}\right\rvert}>\lceil M_{d}/4\rceil for d=1,,D}d=1,\dots,D\}, and τ>0\tau>0 is a rank-penalizing parameter. The subsequent lemma shows how the functional SM¯S^{\underline{M}} is used to formulate two versions of the same optimization problem, both of which are solved uniquely by 𝒢\mathcal{G}.

Lemma 3.1.

Let 𝒢:L2([0,1]D)L2([0,1]D)\mathscr{G}\mathrel{\mathop{\ordinarycolon}}L^{2}([0,1]^{D})\to L^{2}([0,1]^{D}) be a rank K<K<\infty covariance operator with analytic eigenfunctions and kernel gg, and :L2([0,1]D)L2([0,1]D)\mathscr{B}\mathrel{\mathop{\ordinarycolon}}L^{2}([0,1]^{D})\to L^{2}([0,1]^{D}) be a trace-class covariance operator with δ¯\underline{\delta}-banded kernel bb. For {s¯m¯}m¯M¯𝒮M¯\{\underline{s}_{\underline{m}}\}_{\underline{m}\in\mathbb{N}_{\underline{M}}}\in\mathcal{S}_{\underline{M}}, let

𝒢M¯={g(s¯m¯1,s¯m¯2)}m¯1,m¯2andM¯={b(s¯m¯1,s¯m¯2)}m¯1,m¯2,\displaystyle\mathcal{G}^{\underline{M}}=\{g(\underline{s}_{\underline{m}_{1}},\underline{s}_{\underline{m}_{2}})\}_{\underline{m}_{1},\underline{m}_{2}}\hskip 10.00002pt\text{and}\hskip 10.00002pt\mathcal{B}^{\underline{M}}=\{b(\underline{s}_{\underline{m}_{1}},\underline{s}_{\underline{m}_{2}})\}_{\underline{m}_{1},\underline{m}_{2}},

and 𝒞M¯=𝒢M¯+M¯\mathcal{C}^{\underline{M}}=\mathcal{G}^{\underline{M}}+\mathcal{B}^{\underline{M}}. Assume

maxdδd<1/4andKK=d=1DMd41.\displaystyle\max_{d}\delta_{d}<1/4\hskip 10.00002pt\text{and}\hskip 10.00002ptK\leq K^{*}=\prod_{d=1}^{D}\left\lfloor\frac{M_{d}}{4}-1\right\rfloor. (2)

Then for almost all grids in 𝒮M¯\mathcal{S}_{\underline{M}}:

  1. 1.

    The tensor 𝒢M¯\mathcal{G}^{\underline{M}} is the unique solution to the optimization problem

    minθ𝔐×𝔐rank(θ)subject toRM¯(θ)=0.\displaystyle\min_{\theta\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}}\text{rank}(\theta)\hskip 10.00002pt\text{subject to}\hskip 10.00002ptR^{\underline{M}}(\theta)=0.
  2. 2.

    Equivalently, in penalized form,

    𝒢M¯=argminθ𝔐×𝔐SM¯(θ;τ),\displaystyle\mathcal{G}^{\underline{M}}=\arg\min_{\theta\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}}S^{\underline{M}}(\theta;\tau),

    for sufficiently small τ>0\tau>0.

Proof.

We first show that 𝒢M¯\mathcal{G}^{\underline{M}} is the unique solution to the constrained optimization problem. Note that a solution θ\theta to the constrained problem must coincide with 𝒢M¯\mathcal{G}^{\underline{M}} on

Ω={(i¯,j¯):|idjd|>Md/4,d=1,,D},\displaystyle\Omega=\{(\underline{i},\underline{j})\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lvert i_{d}-j_{d}\right\rvert}>\lceil M_{d}/4\rceil,d=1,\dots,D\},

since 𝒜M¯(𝒞M¯θ)F=0\mathinner{\!\left\lVert\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\theta)\right\rVert}_{F}=0. Then by condition (2) of the manuscript and Theorem 3.3, the square matricization of θ\theta has a non-zero minor of order KK, implying that rank(θ)K\text{rank}(\theta)\geq K. Since 𝒢M¯\mathcal{G}^{\underline{M}} satisfies the constraint and has rank KK, we have that rank(θ)=K\text{rank}(\theta)=K. Using the iterative procedure from the proof of Theorem 2 of the manuscript, we can show that θ\theta equals 𝒢M¯\mathcal{G}^{\underline{M}} everywhere.

Next, we show that 𝒢M¯\mathcal{G}^{\underline{M}} is the unique minimizer of S(;τ)S(\cdot;\tau) for sufficiently small τ>0\tau>0. First, consider some θ𝔐×𝔐\theta\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} with rank(θ)K\text{rank}(\theta)\geq K. Since 𝒢M¯\mathcal{G}^{\underline{M}} uniquely solves the constrained problem, we have that for all τ>0\tau>0,

𝒜M¯(𝒞M¯𝒢M¯)F2+τrank(𝒢M¯)<𝒜M¯(𝒞M¯θ)F2+τrank(θ).\displaystyle\mathinner{\!\left\lVert\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\mathcal{G}^{\underline{M}})\right\rVert}_{F}^{2}+\tau\text{rank}(\mathcal{G}^{\underline{M}})<\mathinner{\!\left\lVert\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\theta)\right\rVert}_{F}^{2}+\tau\text{rank}(\theta).

Thus, θ\theta does not minimize S(;τ)S(\cdot;\tau). Next, consider some θ𝔐×𝔐\theta\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} with rank(θ)K1\text{rank}(\theta)\leq K-1. Let μ=minθ𝔐×𝔐,rank(θ)K1𝒜M¯(𝒞M¯θ)F2>0\mu=\min_{\theta^{\prime}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}},\text{rank}(\theta^{\prime})\leq K-1}\mathinner{\!\left\lVert\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\theta^{\prime})\right\rVert}_{F}^{2}>0, where the inequality holds since a minimizer of 𝒜M¯(𝒞M¯θ)F2\mathinner{\!\left\lVert\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\theta^{\prime})\right\rVert}_{F}^{2} has rank bounded below by KK. Let τ=μK1\tau^{*}=\frac{\mu}{K-1}. Then for all τ<τ\tau<\tau^{*},

𝒜M¯(𝒞M¯𝒢M¯)F2+τrank(𝒢M¯)=τK<μ+τ𝒜M¯(𝒞M¯θ)F2+τrank(θ).\displaystyle\mathinner{\!\left\lVert\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\mathcal{G}^{\underline{M}})\right\rVert}_{F}^{2}+\tau\text{rank}(\mathcal{G}^{\underline{M}})=\tau K<\mu+\tau\leq\mathinner{\!\left\lVert\mathcal{A}^{\underline{M}}\circ(\mathcal{C}^{\underline{M}}-\theta)\right\rVert}_{F}^{2}+\tau\text{rank}(\theta).

Thus, θ\theta does not minimize S(;τ)S(\cdot;\tau). ∎

In Lemma 3.1, we present constrained and penalized formulations of the same optimization problem. The constrained problem informs practical implementation of the estimator, while the penalized version motivates the formal estimator of 𝒢\mathcal{G}. To make this estimation problem tractable, we assume that 𝒢\mathcal{G} belongs to the subset ΘM¯ΘM¯\Theta_{\underline{M}}^{*}\subset\Theta_{\underline{M}} defined by

ΘM¯={θΘM¯:𝒫(θ)<B,B<},\displaystyle\Theta_{\underline{M}}^{*}=\{\theta\in\Theta_{\underline{M}}\mathrel{\mathop{\ordinarycolon}}\mathscr{P}(\theta)<B,B<\infty\},

where 𝒫:𝔐×𝔐\mathscr{P}\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}\to\mathbb{R} is some roughness-penalizing functional. On this space, define the functional SnM¯:ΘM¯S_{n}^{\underline{M}}\mathrel{\mathop{\ordinarycolon}}\Theta_{\underline{M}}^{*}\to\mathbb{R} by

SnM¯(θ;τ,αn)=M2𝒜M¯(𝒞^nM¯θ)F2RnM¯(θ)+τrank(θ)+αn𝒫(θ),\displaystyle S^{\underline{M}}_{n}(\theta;\tau,\alpha_{n})=\underbrace{M^{-2}\|\mathcal{A}^{\underline{M}}\circ(\hat{\mathcal{C}}_{n}^{\underline{M}}-\theta)\|_{F}^{2}}_{R_{n}^{\underline{M}}(\theta)}+\tau\text{rank}(\theta)+\alpha_{n}\mathscr{P}(\theta),

where αn>0\alpha_{n}>0 is a roughness-penalizing parameter. This functional, which may be viewed as a roughness-penalizing empirical analogue of SM¯S^{\underline{M}}, is used to define the estimator for 𝒢\mathcal{G}.

Definition 3.1 (Estimator of 𝒢M¯\mathcal{G}^{\underline{M}}).

Suppose (X1,,Xn)(X_{1},\dots,X_{n}) are an i.i.d. sample from a functional KK-factor model. Let {s¯m¯}m¯M¯𝒮M¯\{\underline{s}_{\underline{m}}\}_{\underline{m}\in\mathbb{N}_{\underline{M}}}\in\mathcal{S}_{\underline{M}} and assume we observe

𝒳i,m¯=Xi(s¯m¯),i=1,,n,m¯M¯.\displaystyle\mathcal{X}_{i,\underline{m}}=X_{i}(\underline{s}_{\underline{m}}),\hskip 10.00002pti=1,\dots,n,\hskip 10.00002pt\underline{m}\in\mathbb{N}_{\underline{M}}.

Let 𝒞^nM¯𝔐×𝔐\hat{\mathcal{C}}_{n}^{\underline{M}}\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}} be the empirical covariance tensor of the tensors

{{𝒳i,m¯}m¯M¯}i=1n.\displaystyle\left\{\{\mathcal{X}_{i,\underline{m}}\}_{\underline{m}\in\mathbb{N}_{\underline{M}}}\right\}_{i=1}^{n}.

We define the estimator 𝒢^nM¯ΘM¯\hat{\mathcal{G}}^{\underline{M}}_{n}\in\Theta_{\underline{M}}^{*} of 𝒢M¯\mathcal{G}^{\underline{M}} to be an approximate minimum of SnM¯(θ;τ,αn)S_{n}^{\underline{M}}(\theta;\tau,\alpha_{n}), where τ>0\tau>0 is sufficiently small, and αn0\alpha_{n}\to 0 as nn\to\infty. By approximate minimum, we mean that 𝒢^nM¯\hat{\mathcal{G}}^{\underline{M}}_{n} satisfies

SnM¯(𝒢^nM¯)infθSnM¯(θ)+oP(1).\displaystyle S^{\underline{M}}_{n}(\hat{\mathcal{G}}^{\underline{M}}_{n})\leq\inf_{\theta}S^{\underline{M}}_{n}(\theta)+o_{P}(1).

The rank and scaled eigentensors of the estimator in Definition 3.1 produce estimators for the number of factors KK and the loading tensors k\mathcal{L}_{k}, respectively.

Definition 3.2 (Estimators of KK and kM¯\mathcal{L}_{k}^{\underline{M}}).

Consider the setting and estimator 𝒢^nM¯\hat{\mathcal{G}}^{\underline{M}}_{n} presented in Definition 3.1. We define estimators for the number of factors KK and the kkth loading function kM¯\mathcal{L}_{k}^{\underline{M}} as

K^n=rank(𝒢^nM¯)and^k,nM¯=λ^k,n1/2^k,nM¯,\displaystyle\hat{K}_{n}=\text{rank}(\hat{\mathcal{G}}^{\underline{M}}_{n})\hskip 10.00002pt\text{and}\hskip 10.00002pt\hat{\mathcal{L}}_{k,n}^{\underline{M}}=\hat{\lambda}_{k,n}^{1/2}\hat{\mathcal{H}}_{k,n}^{\underline{M}},

where λ^k,n\hat{\lambda}_{k,n} and ^k,nM¯\hat{\mathcal{H}}_{k,n}^{\underline{M}} are the kkth eigenvalue and eigentensor of 𝒢^nM¯\hat{\mathcal{G}}^{\underline{M}}_{n}, respectively.

In Theorem 3.5, we show that 𝒢^\hat{\mathcal{G}}, K^\hat{K}, and ^k\hat{\mathcal{L}}_{k}, k=1,,K^k=1,\dots,\hat{K} are consistent for their population analogues.

Theorem 3.5 (Consistency).

Suppose k\mathcal{L}_{k} is equal to the scaled eigentensor of 𝒢\mathcal{G} for k=1,,Kk=1,\dots,K. Then for sufficiently small τ>0\tau>0 and αn0\alpha_{n}\to 0, we have

  1. (i)

    𝒢^n𝒢F2𝑃0\mathinner{\!\left\lVert\hat{\mathcal{G}}_{n}-\mathcal{G}\right\rVert}_{F}^{2}\overset{P}{\to}0,

  2. (ii)

    |rank(L^n)K|𝑃0\mathinner{\!\left\lvert\text{rank}(\hat{L}_{n})-K\right\rvert}\overset{P}{\to}0, and

  3. (iii)

    ^n,kkF2𝑃0,k=1,,K\mathinner{\!\left\lVert\hat{\mathcal{L}}_{n,k}-\mathcal{L}_{k}\right\rVert}_{F}^{2}\overset{P}{\to}0,\ k=1,\dots,K.

Proof.

In proving (i), we assume D=1D=1 so that all tensors in the problem are M×MM\times M matrices. For D>1D>1, simply square matricize each tensor then proceed as follows. By Corollary 3.2.3 from van der Vaart and Wellner (1996), the result follows after verifying that:

  1. (a)

    supθΘM|SnM(θ)SM(θ)|a.s.0\sup_{\theta\in\Theta_{M}^{*}}\mathinner{\!\left\lvert S_{n}^{M}(\theta)-S^{M}(\theta)\right\rvert}\overset{a.s.}{\to}0,

  2. (b)

    SM(θ)S^{M}(\theta) is lower semi-continuous on ΘM\Theta_{M}^{*},

  3. (c)

    SM(θ)S^{M}(\theta) has a unique minimum at 𝒢\mathcal{G}, and

  4. (d)

    {𝒢^n}\{\hat{\mathcal{G}}_{n}\} is uniformly tight.

Note that we already established (c) in the Lemma from the manuscript. After scaling 𝒞^nM\hat{\mathcal{C}}_{n}^{M} to have unit trace norm, (d) holds as well. To verify (b), note that RM(θ)R^{M}(\theta) is continuous and rank(θ)\text{rank}(\theta) is lower semi-continuous. Being the sum of a continuous and lower semi-continuous function SM(θ)S^{M}(\theta) is itself lower semi-continuous.

To check (a), note that for θΘM\theta\in\Theta_{M}^{*},

|SnM(θ;\displaystyle|S_{n}^{M}(\theta; τ,αn)SM(θ;τ)|\displaystyle\tau,\alpha_{n})-S^{M}(\theta;\tau)|
=|RnM(θ)RM(θ)+αn𝒫(θ)|\displaystyle=\mathinner{\!\left\lvert R_{n}^{M}(\theta)-R^{M}(\theta)+\alpha_{n}\mathscr{P}(\theta)\right\rvert}
=|𝒜M(𝒞^nMθ)F2𝒜M(𝒞Mθ)F2|+αn𝒫(θ)\displaystyle=\mathinner{\!\left\lvert\mathinner{\!\left\lVert\mathcal{A}^{M}\circ(\hat{\mathcal{C}}_{n}^{M}-\theta)\right\rVert}_{F}^{2}-\mathinner{\!\left\lVert\mathcal{A}^{M}\circ(\mathcal{C}^{M}-\theta)\right\rVert}_{F}^{2}\right\rvert}+\alpha_{n}\mathscr{P}(\theta)
|𝒜M(𝒞^nMθ)F𝒜M(𝒞Mθ)F|\displaystyle\leq\mathinner{\!\left\lvert\mathinner{\!\left\lVert\mathcal{A}^{M}\circ(\hat{\mathcal{C}}_{n}^{M}-\theta)\right\rVert}_{F}-\mathinner{\!\left\lVert\mathcal{A}^{M}\circ(\mathcal{C}^{M}-\theta)\right\rVert}_{F}\right\rvert}
×(𝒜M(𝒞^nMθ)F+𝒜M(𝒞Mθ)F)\displaystyle\hskip 30.00005pt\times\left(\mathinner{\!\left\lVert\mathcal{A}^{M}\circ(\hat{\mathcal{C}}_{n}^{M}-\theta)\right\rVert}_{F}+\mathinner{\!\left\lVert\mathcal{A}^{M}\circ(\mathcal{C}^{M}-\theta)\right\rVert}_{F}\right)
+αn𝒫(θ)\displaystyle\hskip 30.00005pt+\alpha_{n}\mathscr{P}(\theta)
𝒜M(𝒞^nM𝒞M)F(2+𝒞^nMF+𝒞MF)+αnB\displaystyle\leq\mathinner{\!\left\lVert\mathcal{A}^{M}\circ(\hat{\mathcal{C}}_{n}^{M}-\mathcal{C}^{M})\right\rVert}_{F}\left(2+\mathinner{\!\left\lVert\hat{\mathcal{C}}_{n}^{M}\right\rVert}_{F}+\mathinner{\!\left\lVert\mathcal{C}^{M}\right\rVert}_{F}\right)+\alpha_{n}B
a.s.0,\displaystyle\overset{a.s.}{\to}0,

where convergence follows from 𝒞^nM𝒞M\hat{\mathcal{C}}_{n}^{M}\to\mathcal{C}^{M} almost surely, and αn0\alpha_{n}\to 0. This completes the proof of (i).

To prove (ii), we again assume D=1D=1 and extend the proof to D>1D>1 via square matricization. Proceeding by contradiction, suppose that (ii) does not hold. Then for some subsequence {n}{n}\{n^{\prime}\}\subset\{n\}, there exist ϵ>0\epsilon>0 and δ>0\delta>0 such that (|rank(𝒢^n)K|>ϵ)>δ\mathbb{P}\left(\mathinner{\!\left\lvert\text{rank}(\hat{\mathcal{G}}_{n^{\prime}})-K\right\rvert}>\epsilon\right)>\delta for all nn^{\prime}. Consequently, there exists either

  1. (e)

    a subset {j}{n}\{j\}\subset\{n^{\prime}\} such that (rank(𝒢^j)<K)>δ/2\mathbb{P}\left(\text{rank}(\hat{\mathcal{G}}_{j})<K\right)>\delta/2 for all jj, or

  2. (f)

    a subset {i}{n}\{i\}\subset\{n^{\prime}\} such that (rank(𝒢^i)>K)>δ/2\mathbb{P}\left(\text{rank}(\hat{\mathcal{G}}_{i})>K\right)>\delta/2 for all ii.

Focusing on (e), the convergence of 𝒢^n\hat{\mathcal{G}}_{n} to 𝒢\mathcal{G} in probability means there exists a subsequence {j}{j}\{j^{\prime}\}\subset\{j\} such that 𝒢^j𝒢\hat{\mathcal{G}}_{j^{\prime}}\to\mathcal{G} almost surely. This means that on a set with probability at least δ/2\delta/2, ^j𝒢\mathcal{\hat{L}}_{j^{\prime}}\to\mathcal{G} and rank(𝒢^j)<r\text{rank}(\hat{\mathcal{G}}_{j^{\prime}})<r for all jj^{\prime}. This is not possible since the set of matrices with rank at most r1r-1 is closed.

We thus focus on (f). Again, since 𝒢^n\hat{\mathcal{G}}_{n} converges to 𝒢\mathcal{G} in probability, there exists a subsequence {i}{i}\{i^{\prime}\}\subset\{i\} for which 𝒢^i𝒢\hat{\mathcal{G}}_{i^{\prime}}\to\mathcal{G} almost surely. This means that on a set with probability at least δ/2\delta/2, 𝒢^i𝒢\hat{\mathcal{G}}_{i^{\prime}}\to\mathcal{G} and rank(𝒢^i)>r\text{rank}(\hat{\mathcal{G}}_{i^{\prime}})>r for all ii^{\prime}. Working on this set, for all ii^{\prime},

RiM(𝒢^i)\displaystyle R_{i^{\prime}}^{M}(\hat{\mathcal{G}}_{i^{\prime}}) +τ(K+1)+αi𝒫(𝒢^i)\displaystyle+\tau(K+1)+\alpha_{i^{\prime}}\mathscr{P}(\hat{\mathcal{G}}_{i^{\prime}})
RiM(𝒢^i)+τrank(𝒢^i)+αi𝒫(𝒢^i)\displaystyle\leq R_{i^{\prime}}^{M}(\hat{\mathcal{G}}_{i^{\prime}})+\tau\text{rank}(\hat{\mathcal{G}}_{i^{\prime}})+\alpha_{i^{\prime}}\mathscr{P}(\hat{\mathcal{G}}_{i^{\prime}})
infθΘM:rank(θ)=K{RiM(θ)+τrank(θ)+αi𝒫(θ)}\displaystyle\leq\inf_{\theta\in\Theta_{M}^{*}\mathrel{\mathop{\ordinarycolon}}\text{rank}(\theta)=K}\left\{R_{i^{\prime}}^{M}(\theta)+\tau\text{rank}(\theta)+\alpha_{i^{\prime}}\mathscr{P}(\theta)\right\} (3)
infθΘM:rank(θ)=K{RiM(θ)+τK+αi𝒫(θ)}\displaystyle\leq\inf_{\theta\in\Theta_{M}^{*}\mathrel{\mathop{\ordinarycolon}}\text{rank}(\theta)=K}\left\{R_{i^{\prime}}^{M}(\theta)+\tau K+\alpha_{i^{\prime}}\mathscr{P}(\theta)\right\}
infθΘM:rank(θ)=KRiM(θ)+τK+αiB,\displaystyle\leq\inf_{\theta\in\Theta_{M}^{*}\mathrel{\mathop{\ordinarycolon}}\text{rank}(\theta)=K}R_{i^{\prime}}^{M}(\theta)+\tau K+\alpha_{i^{\prime}}B, (4)

where line (3) follows because 𝒢^i\hat{\mathcal{G}}_{i^{\prime}} is a minimizer of SiMS_{i^{\prime}}^{M}. From the proof of (i), we have that

supθΘM|RiM(θ)αn𝒫(θ)RM(θ)|a.s.0,\displaystyle\sup_{\theta\in\Theta_{M}^{*}}\mathinner{\!\left\lvert R_{i^{\prime}}^{M}(\theta)-\alpha_{n}\mathscr{P}(\theta)-R^{M}(\theta)\right\rvert}\overset{a.s.}{\to}0, (5)

which implies

RiM(𝒢^i)αi𝒫(𝒢^i)RM(𝒢^i)a.s.0.\displaystyle R_{i^{\prime}}^{M}(\hat{\mathcal{G}}_{i^{\prime}})-\alpha_{i^{\prime}}\mathscr{P}(\hat{\mathcal{G}}_{i^{\prime}})-R^{M}(\hat{\mathcal{G}}_{i^{\prime}})\overset{a.s.}{\to}0. (6)

Moreover, since RMR^{M} is continuous and 𝒢^i𝒢\hat{\mathcal{G}}_{i^{\prime}}\to\mathcal{G} almost surely, by continuous mapping we have

RM(𝒢^i)a.s.RM(𝒢)=0.\displaystyle R^{M}(\hat{\mathcal{G}}_{i^{\prime}})\overset{a.s.}{\to}R^{M}(\mathcal{G})=0. (7)

Combining the results of lines (6) and (7) gives

RiM(𝒢^i)+αi𝒫(𝒢^i)a.s.0.\displaystyle R_{i^{\prime}}^{M}(\hat{\mathcal{G}}_{i^{\prime}})+\alpha_{i^{\prime}}\mathscr{P}(\hat{\mathcal{G}}_{i^{\prime}})\overset{a.s.}{\to}0. (8)

Now, note that on the set {θΘM:rank(θ)=K}\{\theta\in\Theta_{M}^{*}\mathrel{\mathop{\ordinarycolon}}\text{rank}(\theta)=K\} the RiMR_{i^{\prime}}^{M} are equi-Lipschitz continuous. From this and the uniform convergence in line (5), we have

infθΘM:rank(θ)=KRiM(θ)infθΘM:rank(θ)=KRM(θ)=0.\displaystyle\inf_{\theta\in\Theta_{M}^{*}\mathrel{\mathop{\ordinarycolon}}\text{rank}(\theta)=K}R_{i^{\prime}}^{M}(\theta)\to\inf_{\theta\in\Theta_{M}^{*}\mathrel{\mathop{\ordinarycolon}}\text{rank}(\theta)=K}R^{M}(\theta)=0. (9)

By the inequality in line (4) and the results in lines (8) and (9), we derive the contradiction that τ0\tau\leq 0. This proves (ii).

Finally, (iii) follows immediately from (i), thus completing the proof. ∎

4 Tuning Procedures

Recall from Section 3.1.1 of the manuscript that we must choose values for the smoothing parameter α\alpha before estimating 𝒢\mathcal{G}. We find that subjective selection is often preferable, but there are some settings (e.g., a simulation study) that require automatic selection. One may do so using V-fold cross-validation, as follows:

  1. (1a)

    Partition the sample 𝒳1,,𝒳n\mathcal{X}_{1},\dots,\mathcal{X}_{n} into VV folds of equal (or near equal) size.

  2. (1b)

    For a given α\alpha, let 𝒢~(v)(α)\tilde{\mathcal{G}}^{(v)}(\alpha) be the solution to the optimization problem

    min0θ𝔐×𝔐\displaystyle\min_{0\preceq\theta\in\mathbb{R}^{\mathfrak{M}\times\mathfrak{M}}} M2𝒜(𝒞^n(v)θ)F2+M1αk=1rank(θ)𝒫~k(θ)\displaystyle M^{-2}\mathinner{\!\left\lVert\mathcal{A}\circ\left(\hat{\mathcal{C}}_{n}^{(v)}-\theta\right)\right\rVert}_{F}^{2}+M^{-1}\alpha\sum_{k=1}^{\text{rank}(\theta)}\tilde{\mathscr{P}}\mathscr{E}_{k}(\theta)
    subject torank(θ)K,\displaystyle\text{subject to}\hskip 10.00002pt\text{rank}(\theta)\leq K^{*},

    where 𝒞^n(v)\hat{\mathcal{C}}_{n}^{(v)} denotes the empirical covariance of all samples except those in the vvth fold, and KK^{*} is the maximal rank of 𝒢\mathcal{G}. That is, 𝒢~(v)(α)\tilde{\mathcal{G}}^{(v)}(\alpha) is the maximal-rank version of the estimator in Definition 1 of the manuscript for the sample excepting the vvth fold.

  3. (1c)

    Compute the cross-validation scores

    CV(α)=1VM2v=1V𝒜(𝒞^nv𝒢~(v)(α))F2,\displaystyle\text{CV}(\alpha)=\frac{1}{VM^{2}}\sum_{v=1}^{V}\mathinner{\!\left\lVert\mathcal{A}\circ\left(\hat{\mathcal{C}}_{n}^{v}-\tilde{\mathcal{G}}^{(v)}(\alpha)\right)\right\rVert}_{F}^{2},

    where 𝒞^nv\hat{\mathcal{C}}_{n}^{v} is the empirical covariance of samples in the vvth fold.

  4. (1d)

    Minimize CV(α)\text{CV}(\alpha) to provide the choice of smoothing parameters.

One might wonder if we could use the above procedure to tune a vector of smoothing parameters α¯\underline{\alpha} that applies a different level of smoothing to each eigentensor. We do not recommend this for the following reason. Recall that we solve the problem in (1b) by re-framing it as a factorized matrix completion problem like equation (4) of the manuscript that optimizes over 𝑽M×K\boldsymbol{V}\in\mathbb{R}^{M\times K^{*}}, M=M1MDM=M_{1}...M_{D}. For fixed 𝑽𝑽T\boldsymbol{V}\boldsymbol{V}^{T}, the error term in equation (4) of the manuscript is constant for any 𝑽\boldsymbol{V} while the penalty term varies as we rotate 𝑽\boldsymbol{V}. This means that rough loading tensors can avoid penalization by “hiding” from larger αk\alpha_{k} in some rotation of 𝑽\boldsymbol{V}, making it difficult to target a specific loading tensor with a particular roughness penalty.

Next, recall from Section 3.2 of the manuscript that some tuning procedure is needed to systematically choose the shrinkage parameter κ¯\underline{\kappa}. One option is to choose κ¯\underline{\kappa} using VV-fold cross-validation, as follows:

  1. (2a)

    Apply a varimax rotation to the initial estimate 𝔏^\hat{\mathfrak{L}} to obtain 𝔏^\hat{\mathfrak{L}}^{*}.

  2. (2b)

    Partition the sample 𝒳1,,𝒳n\mathcal{X}_{1},\dots,\mathcal{X}_{n} into VV folds of equal (or near equal) size.

  3. (2c)

    Using the tuned smoothing parameter α\alpha^{*}, compute 𝔏^(v)(α)\hat{\mathfrak{L}}^{(v)}(\alpha^{*}) as in step (b) of the estimation procedure for 𝒢\mathcal{G} (see Section 3.1.1 of the manuscript), setting K=K^K^{*}=\hat{K}. We suppress α\alpha^{*} from here forward.

  4. (2d)

    Rotate each 𝔏^(v)\hat{\mathfrak{L}}^{(v)} towards the target 𝔏^\hat{\mathfrak{L}} (using GPArotation::targetT from Bernaards et al. (2015)) to get (𝔏^(v))(\hat{\mathfrak{L}}^{(v)})^{*}.

  5. (2e)

    For a given κ¯\underline{\kappa}, adaptively soft-threshold (𝔏^(v))(\hat{\mathfrak{L}}^{(v)})^{*} as in equation (5) of the manuscript to get 𝔏~(v)(κ¯)\tilde{\mathfrak{L}}^{(v)}(\underline{\kappa}).

  6. (2f)

    Compute the cross-validation scores

    CV(κ¯)=1VM2v=1V𝒜(𝒞^v𝒢~(v)(κ¯))F2,\displaystyle\text{CV}(\underline{\kappa})=\frac{1}{VM^{2}}\sum_{v=1}^{V}\mathinner{\!\left\lVert\mathcal{A}\circ\left(\hat{\mathcal{C}}^{v}-\tilde{\mathcal{G}}^{(v)}(\underline{\kappa})\right)\right\rVert}_{F}^{2},

    where 𝒢~(v)(κ¯)=k=1K𝔏~,k(v)(κ¯)𝔏~,k(v)(κ¯)\tilde{\mathcal{G}}^{(v)}(\underline{\kappa})=\sum_{k=1}^{K}\tilde{\mathfrak{L}}_{\cdot,k}^{(v)}(\underline{\kappa})\otimes\tilde{\mathfrak{L}}_{\cdot,k}^{(v)}(\underline{\kappa}), and 𝒞^v\hat{\mathcal{C}}^{v} is the empirical covariance of samples in the vvth fold.

  7. (2g)

    Minimize CV(κ¯)\text{CV}(\underline{\kappa}) to provide the choice of shrinkage parameter.

5 Rank Simulation Study

To study the behavior of the scree plot approach, we simulated a single instance of each configuration, then plotted the quantities f(θ^j)f(\hat{\theta}_{j}) for j=1,,8j=1,\dots,8. Figures 1, 2, 3, and 4 display the resulting scree plots. As expected, R2 presents a harder problem than R1, while rank selection generally becomes easier as nn grows. Post-hoc analysis of some selection failures – for instance, when the scree plot appears non-decreasing or under-selects the rank – reveals that over-smoothing can force optimization into bad local optima, and that such optima may sometimes be avoided with a more conservative choice of the smoothing parameter. In particular, for several R2 configurations, the scree plot approach fails more frequently for the NET loading scheme than for the BL scheme. As discussed in Section 4.1 of the manuscript, this is likely due to the application of a single smoothing parameter to the loading tensors of NET which exhibit varying degrees of smoothness. Despite these occasional failures, our simulation study supports the scree plot approach as a viable rank selection method.

Refer to caption
(a) Scenario S1
Refer to caption
(b) Scenario S2
Refer to caption
(c) Scenario S3
Refer to caption
(d) Scenario S4
Figure 1: Plots of the function jf(θ^j)j\mapsto f(\hat{\theta}_{j}) for K=2K=2 and δ=0.05\delta=0.05. Each plot is for some combination of scenario, regime, and sample size nn.
Refer to caption
(a) Scenario S1
Refer to caption
(b) Scenario S2
Refer to caption
(c) Scenario S3
Refer to caption
(d) Scenario S4
Figure 2: Plots of the function jf(θ^j)j\mapsto f(\hat{\theta}_{j}) for K=2K=2 and δ=0.1\delta=0.1. Each plot is for some combination of scenario, regime, and sample size nn.
Refer to caption
(a) Scenario S1
Refer to caption
(b) Scenario S2
Refer to caption
(c) Scenario S3
Refer to caption
(d) Scenario S4
Figure 3: Plots of the function jf(θ^j)j\mapsto f(\hat{\theta}_{j}) for K=4K=4 and δ=0.05\delta=0.05. Each plot is for some combination of scenario, regime, and sample size nn.
Refer to caption
(a) Scenario S1
Refer to caption
(b) Scenario S2
Refer to caption
(c) Scenario S3
Refer to caption
(d) Scenario S4
Figure 4: Plots of the function jf(θ^j)j\mapsto f(\hat{\theta}_{j}) for K=4K=4 and δ=0.1\delta=0.1. Each plot is for some combination of scenario, regime, and sample size nn.

6 AOMIC Data Analysis

6.1 Output from Higher-Order Models

This section contains supplementary figures from the analysis of resting-state data from the Amsterdam Open MRI Collection (AOMIC) PIOP1 dataset in Section 5 of the manuscript.

Figure 5 demonstrates the effect of prewhitening on voxel time courses. Figure 5a shows a time course and corresponding autocorrelation function (acf) for three example voxels from one subject’s preprocessed data, while Figure 5b displays each time course’s ARIMA residuals scaled to unit variance with accompanying acf.

Refer to caption
(a) Preprocessed BOLD.
Refer to caption
(b) ARIMA residual.
Figure 5: (a) The preprocessed BOLD time course and autocorrelation function (acf) for three example voxels whose time series exhibit varying degrees of nonstationarity and temporal correlation. (b) The scaled ARIMA residuals and acf for the three voxel time courses depicted in (a). In the acf plot of a white noise series, we expect 95% of the nonzero-lag spikes to lie within the plot’s blue dotted lines.

The left-hand plot of Figure 6 displays the non-increasing function g(j)=𝒜(𝒞^nθ^j)F2g(j)=\|\mathcal{A}\circ(\hat{\mathcal{C}}_{n}-\hat{\theta}_{j})\|_{F}^{2} for j=1,,100j=1,\dots,100. To aid in identification of this plot’s elbow, we also plot the ratios r(i)=g(i)/g(i+1)r(i)=g(i)/g(i+1) for i=1,,99i=1,\dots,99 in Figure 6. The estimated number of factors K^\hat{K} should be the jj at which g(j)g(j) levels off or, equivalently, the ii at which r(i)r(i) becomes a constant close to 1. As discussed in Section 6 of the manuscript, we set K^=12\hat{K}=12.

Refer to caption
Figure 6: The scree-type plot (left) and ratio plot (right) used to select the number of factors in the functional factor analysis the AOMIC resting-state data.

Figure 7 displays loading estimates from the 25-factor FFM while Figures 8 and 9 show those from the 50-factor FFM. Figure 10 displays independent component (IC) estimates from the 25-IC model while Figures 11 and 12 show those from the 50-IC model.

Refer to caption
Figure 7: Loadings of the functional 25-factor model estimated from ARIMA residuals.
Refer to caption
Figure 8: Loadings 1-25 of the functional 50-factor model estimated from ARIMA residuals.
Refer to caption
Figure 9: Loadings 26-50 of the functional 50-factor model estimated from ARIMA residuals.
Refer to caption
Figure 10: Transformed components (t(x)=sgn(x)log(|x|+1)t(x)=\text{sgn}(x)\log(\mathinner{\!\left\lvert x\right\rvert}+1)) of the 25-component IC model estimated from the preprocessed scans.
Refer to caption
Figure 11: Transformed components (t(x)=sgn(x)log(|x|+1)t(x)=\text{sgn}(x)\log(\mathinner{\!\left\lvert x\right\rvert}+1)) 1-25 of the 50-component IC model estimated from the preprocessed scans.
Refer to caption
Figure 12: Transformed components (t(x)=sgn(x)log(|x|+1)t(x)=\text{sgn}(x)\log(\mathinner{\!\left\lvert x\right\rvert}+1)) 26-50 of the 50-component IC model estimated from the preprocessed scans.

6.2 Multi-Scale ICA

As described in Section 6 of the manuscript, multi-scale approaches, like that of Iraji et al. (2023), can remedy ICA’s sensitivity to dimension misspecification. Adapting the method of Iraji et al. (2023) to the AOMIC data, we first independently generate 25 random subsets of the full dataset, each containing the scans of 150 subjects. Next, we fit IC models of order 10, 20, 30, 40, and 50 to each subset, yielding 25 sets of 150 components. From the collection of all components, we select the 150 with the highest average spatial similarity (calculated by Pearson correlation) across the 25 sets, then identify the components among the 150 that are most distinct from each other (spatial similarity ¡ 0.5). The final 45 components are displayed in Figures 13 and 14. Among this final set is a single component (k=10k=10) containing the ventricle correlations of the first component from the 12-IC model, a structure splintered across several components of higher-order IC models. Comparing Figure 9b of the manuscript to Figures 13 and 14 confirms that other low-order structures are similarly preserved. As discussed in Section 6 of the manuscript, this solution to the fragmentation problem comes at the cost of additional computational resources and an interpretation more convoluted than that of approaches based on single IC models.

Refer to caption
Figure 13: Transformed components (t(x)=sgn(x)log(|x|+1)t(x)=\text{sgn}(x)\log(\mathinner{\!\left\lvert x\right\rvert}+1)) 1-25 obtained via multi-scale ICA.
Refer to caption
Figure 14: Transformed components (t(x)=sgn(x)log(|x|+1)t(x)=\text{sgn}(x)\log(\mathinner{\!\left\lvert x\right\rvert}+1)) 26-45 obtained via multi-scale ICA.

References

  • Afyouni et al. [2019] Soroosh Afyouni, Stephen M Smith, and Thomas E Nichols. Effective degrees of freedom of the pearson’s correlation coefficient under autocorrelation. NeuroImage, 199:609–625, 2019.
  • Bai [2003] Jushan Bai. Inferential theory for factor models of large dimensions. Econometrica, 71(1):135–171, 2003. doi: https://doi.org/10.1111/1468-0262.00392. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0262.00392.
  • Beckmann and Smith [2004] Christian F Beckmann and Stephen M Smith. Probabilistic independent component analysis for functional magnetic resonance imaging. IEEE transactions on medical imaging, 23(2):137–152, 2004.
  • Bernaards et al. [2015] Coen Bernaards, Robert Jennrich, and Maintainer Paul Gilbert. Package ‘gparotation’. Retrieved February, 19:2015, 2015.
  • Biswal et al. [1995] Bharat Biswal, F Zerrin Yetkin, Victor M Haughton, and James S Hyde. Functional connectivity in the motor cortex of resting human brain using echo-planar mri. Magnetic resonance in medicine, 34(4):537–541, 1995.
  • Broyden [1970] Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970.
  • Bullmore et al. [2001] Ed Bullmore, Chris Long, John Suckling, Jalal Fadili, Gemma Calvert, Fernando Zelaya, T Adrian Carpenter, and Mick Brammer. Colored noise and computational inference in neurophysiological (fmri) time series analysis: resampling methods in time and wavelet domains. Human brain mapping, 12(2):61–78, 2001.
  • Calhoun et al. [2001] Vince D Calhoun, Tulay Adali, Godfrey D Pearlson, and James J Pekar. A method for making group inferences from functional mri data using independent component analysis. Human brain mapping, 14(3):140–151, 2001.
  • Chamberlain and Rothschild [1983] Gary Chamberlain and Michael Rothschild. Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica, 51(5):1281–1304, 1983. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1912275.
  • Christova et al. [2011] P Christova, SM Lewis, TA Jerde, JK Lynch, and Apostolos P Georgopoulos. True associations between resting fmri time series based on innovations. Journal of neural engineering, 8(4):046025, 2011.
  • Descary and Panaretos [2019] Marie-Hélène Descary and Victor M. Panaretos. Functional data analysis by matrix completion. The Annals of Statistics, 47(1):1 – 38, 2019. doi: 10.1214/17-AOS1590. URL https://doi.org/10.1214/17-AOS1590.
  • Esteban et al. [2019] Oscar Esteban, Christopher J Markiewicz, Ross W Blair, Craig A Moodie, A Ilkay Isik, Asier Erramuzpe, James D Kent, Mathias Goncalves, Elizabeth DuPre, Madeleine Snyder, et al. fmriprep: a robust preprocessing pipeline for functional mri. Nature methods, 16(1):111–116, 2019.
  • Fletcher [1970] Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317–322, 1970.
  • Forni et al. [2000] Mario Forni, Marc Hallin, Marco Lippi, and Lucrezia Reichlin. The Generalized Dynamic-Factor Model: Identification and Estimation. The Review of Economics and Statistics, 82(4):540–554, 11 2000. ISSN 0034-6535. doi: 10.1162/003465300559037. URL https://doi.org/10.1162/003465300559037.
  • Friston [1998] Karl J Friston. Modes or models: a critique on independent component analysis for fmri. Trends in cognitive sciences, 2(10):373–375, 1998.
  • Friston [2011] Karl J Friston. Functional and effective connectivity: a review. Brain connectivity, 1(1):13–36, 2011.
  • Friston et al. [1993] Karl J Friston, Chris D Frith, Peter F Liddle, and Richard SJ Frackowiak. Functional connectivity: the principal-component analysis of large (pet) data sets. Journal of Cerebral Blood Flow & Metabolism, 13(1):5–14, 1993.
  • Geweke [1977] John Geweke. The dynamic factor analysis of economic time series. Latent variables in socio-economic models, pages 365 – 383, 1977.
  • Goldfarb [1970] Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23–26, 1970.
  • Hays et al. [2012] Spencer Hays, Haipeng Shen, and Jianhua Z. Huang. Functional dynamic factor models with application to yield curve forecasting. The Annals of Applied Statistics, 6(3):870 – 894, 2012. doi: 10.1214/12-AOAS551. URL https://doi.org/10.1214/12-AOAS551.
  • Hsing and Eubank [2015] Tailen Hsing and Randall Eubank. Theoretical foundations of functional data analysis, with an introduction to linear operators, volume 997. John Wiley & Sons, 2015.
  • Hyndman and Khandakar [2008] Rob J Hyndman and Yeasmin Khandakar. Automatic time series forecasting: the forecast package for r. Journal of statistical software, 27:1–22, 2008.
  • Hyvärinen and Oja [2000] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • Hörmann and Jammoul [2020] Siegfried Hörmann and Fatima Jammoul. Preprocessing noisy functional data: a multivariate perspective, 2020.
  • Iraji et al. [2023] Armin Iraji, Z Fu, Ashkan Faghiri, M Duda, Jiayu Chen, Srinivas Rachakonda, Thomas DeRamus, Peter Kochunov, BM Adhikari, A Belger, et al. Identifying canonical and replicable multi-scale intrinsic connectivity networks in 100k+ resting-state fmri datasets. Human brain mapping, 44(17):5729–5748, 2023.
  • Jenkinson et al. [2012] Mark Jenkinson, Christian F Beckmann, Timothy EJ Behrens, Mark W Woolrich, and Stephen M Smith. Fsl. Neuroimage, 62(2):782–790, 2012.
  • Johnson and Wichern [2002] Wichern A Johnson and Dean W. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River, 2002.
  • Kaiser [1958] Henry F Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187–200, 1958.
  • Kokoszka and Reimherr [2017] Piotr Kokoszka and Matthew Reimherr. Introduction to functional data analysis. CRC press, 2017.
  • Kowal et al. [2017] Daniel R. Kowal, David S. Matteson, and David Ruppert. A bayesian multivariate functional dynamic linear model. Journal of the American Statistical Association, 112(518):733–744, 2017. doi: 10.1080/01621459.2016.1165104. URL https://doi.org/10.1080/01621459.2016.1165104.
  • Krantz and Parks [2002] Steven G Krantz and Harold R Parks. A primer of real analytic functions. Springer Science & Business Media, 2002.
  • Laird et al. [2004] Angela R Laird, Baxter P Rogers, and M Elizabeth Meyerand. Comparison of fourier and wavelet resampling methods. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 51(2):418–422, 2004.
  • Li and Solea [2018] Bing Li and Eftychia Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113(524):1637–1655, 2018.
  • Liebl [2013] Dominik Liebl. Modeling and forecasting electricity spot prices: A functional data perspective. The Annals of Applied Statistics, 7(3):1562 – 1592, 2013. doi: 10.1214/13-AOAS652. URL https://doi.org/10.1214/13-AOAS652.
  • Lindquist [2008] Martin A Lindquist. The statistical analysis of fmri data. Statistical science, 23(4):439–464, 2008.
  • Liu and Nocedal [1989] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
  • Neuhaus and Wrigley [1954] Jack O Neuhaus and Charles Wrigley. The quartimax method: An analytic approach to orthogonal simple structure 1. British Journal of Statistical Psychology, 7(2):81–91, 1954.
  • Otto and Salish [2022] Sven Otto and Nazarii Salish. Dynamic factor model for functional time series: Identification, estimation, and prediction, 2022.
  • Ramsay and Silverman [2005] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2005. ISBN 9780387400808. URL http://www.worldcat.org/isbn/9780387400808.
  • Reiss et al. [2015] Philip T Reiss, Lan Huo, Yihong Zhao, Clare Kelly, and R Todd Ogden. Wavelet-domain regression and predictive inference in psychiatric neuroimaging. The annals of applied statistics, 9(2):1076, 2015.
  • Reiss et al. [2017] Philip T Reiss, Jeff Goldsmith, Han Lin Shang, and R Todd Ogden. Methods for scalar-on-function regression. International Statistical Review, 85(2):228–249, 2017.
  • Sargent and Sims [1977] Thomas Sargent and Christopher Sims. Business cycle modeling without pretending to have too much a priori economic theory. Working Papers 55, Federal Reserve Bank of Minneapolis, 1977. URL https://EconPapers.repec.org/RePEc:fip:fedmwp:55.
  • Shanno [1970] David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647–656, 1970.
  • Smith et al. [2009] Stephen M Smith, Peter T Fox, Karla L Miller, David C Glahn, P Mickle Fox, Clare E Mackay, Nicola Filippini, Kate E Watkins, Roberto Toro, Angela R Laird, et al. Correspondence of the brain’s functional architecture during activation and rest. Proceedings of the national academy of sciences, 106(31):13040–13045, 2009.
  • Smith et al. [2014] Stephen M Smith, Aapo Hyvärinen, Gaël Varoquaux, Karla L Miller, and Christian F Beckmann. Group-pca for very large fmri datasets. Neuroimage, 101:738–749, 2014.
  • Snoek et al. [2021] Lukas Snoek, Maite M van der Miesen, Tinka Beemsterboer, Andries van der Leij, Annemarie Eigenhuis, and H Steven Scholte. The amsterdam open mri collection, a set of multimodal mri datasets for individual difference analyses. Scientific Data, 8(1):1–23, 2021.
  • Stock and Watson [2002] James H. Stock and Mark W. Watson. Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association, 97(460):1167–1179, 2002. ISSN 01621459. URL http://www.jstor.org/stable/3085839.
  • Van Den Heuvel and Pol [2010] Martijn P Van Den Heuvel and Hilleke E Hulshoff Pol. Exploring the brain network: a review on resting-state fmri functional connectivity. European neuropsychopharmacology, 20(8):519–534, 2010.
  • van der Vaart and Wellner [1996] Aad W van der Vaart and Jon A Wellner. Weak Convergence and Empirical Processes with Applications to Statistics. Springer, 1996.
  • Viviani et al. [2005] Roberto Viviani, Georg Grön, and Manfred Spitzer. Functional principal component analysis of fmri data. Human brain mapping, 24(2):109–129, 2005.
  • Wang et al. [2016] Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller. Functional data analysis. Annual Review of Statistics and Its Application, 3(1):257–295, 2016. doi: 10.1146/annurev-statistics-041715-033624. URL https://doi.org/10.1146/annurev-statistics-041715-033624.