This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Low-Rank Covariance Function Estimation for Multidimensional Functional Data

Jiayi Wang Department of Statistics, Texas A&M University Raymond K. W. Wong The research of Raymond K. W. Wong is partially supported by National Science Foundation grants DMS-1806063, DMS-1711952 and CCF-1934904. Department of Statistics, Texas A&M University Xiaoke Zhang The research of Xiaoke Zhang is partially supported by National Science Foundation grant DMS-1832046. Department of Statistics, George Washington University
Abstract

Multidimensional function data arise from many fields nowadays. The covariance function plays an important role in the analysis of such increasingly common data. In this paper, we propose a novel nonparametric covariance function estimation approach under the framework of reproducing kernel Hilbert spaces (RKHS) that can handle both sparse and dense functional data. We extend multilinear rank structures for (finite-dimensional) tensors to functions, which allow for flexible modeling of both covariance operators and marginal structures. The proposed framework can guarantee that the resulting estimator is automatically semi-positive definite, and can incorporate various spectral regularizations. The trace-norm regularization in particular can promote low ranks for both covariance operator and marginal structures. Despite the lack of a closed form, under mild assumptions, the proposed estimator can achieve unified theoretical results that hold for any relative magnitudes between the sample size and the number of observations per sample field, and the rate of convergence reveals the “phase-transition” phenomenon from sparse to dense functional data. Based on a new representer theorem, an ADMM algorithm is developed for the trace-norm regularization. The appealing numerical performance of the proposed estimator is demonstrated by a simulation study and the analysis of a dataset from the Argo project.

Keywords: Functional data analysis; multilinear ranks; tensor product space; unified theory

1 Introduction

In recent decades, functional data analysis (FDA) has become a popular branch of statistical research. General introductions to FDA can be found in a few monographs (e.g., Ramsay and Silverman, 2005; Ferraty and Vieu, 2006; Horváth and Kokoszka, 2012; Hsing and Eubank, 2015; Kokoszka and Reimherr, 2017). While traditional FDA deals with a sample of time-varying trajectories, many new forms of functional data have emerged due to improved capabilities of data recording and storage, as well as advances in scientific computing. One particular new form of functional data is multidimensional functional data, which becomes increasingly common in various fields such as climate science, neuroscience and chemometrics. Multidimensional functional data are generated from random fields, i.e., random functions of several input variables. One example is multi-subject magnetic resonance imaging (MRI) scans, such as those collected by the Alzheimer’s Disease Neuroimaging Initiative. A human brain is virtually divided into three-dimensional boxes called “voxels” and brain signals obtained from these voxels form a three-dimensional functional sample indexed by spatial locations of the voxels. Despite the growing popularity of multidimensional functional data, statistical methods for such data are limited apart from very few existing works (e.g., Huang et al., 2009; Allen, 2013; Zhang et al., 2013; Zhou and Pan, 2014; Wang and Huang, 2017).

In FDA covariance function estimation plays an important role. Many methods have been proposed for unidimensional functional data (e.g., Rice and Silverman, 1991; James et al., 2000; Yao et al., 2005; Paul and Peng, 2009; Li and Hsing, 2010; Goldsmith et al., 2011; Xiao et al., 2013), and a few were particularly developed for two-dimensional functional data (e.g., Zhou and Pan, 2014; Wang and Huang, 2017). In general when the input domain is of dimension pp, one needs to estimate a 2p2p-dimensional covariance function. Since covariance function estimation in FDA is typically nonparametric, the curse of dimensionality emerges soon when pp is moderate or large.

For general pp, most work are restricted to regular and fixed designs (e.g., Zipunnikov et al., 2011; Allen, 2013), where all random fields are observed over a regular grid like MRI scans. Such sampling plan leads to a tensor dataset, so one may apply tensor/matrix decompositions to estimate the covariance function. When random fields are observed at irregular locations, the dataset is no longer a completely observed tensor so tensor/matrix decompositions are not directly applicable. If observations are densely collected for each random field, a two-step approach is a natural solution, which involves pre-smoothing every random field followed by ensor/matrix decompositions at a fine discretized grid. However, this solution is infeasible for sparse data where there are a limited number of observations per random field. One example is the data collected by the international Argo project (http://www.argo.net). See Section 7 for more details. In such sparse data setting, one may apply the local smoothing method of Chen and Jiang (2017), but it suffers from the curse of dimensionality when the dimension pp is moderate due to a 2p2p-dimensional nonparametric regression.

We notice that there is a related class of literature on longitudinal functional data (e.g., Chen and Müller, 2012; Park and Staicu, 2015; Chen et al., 2017), a special type of multidimensional functional data where a function is repeatedly measured over longitudinal times. Typically multi-step methods are needed to model the functional and longitudinal dimensions either separately (one dimension at a time) or sequentially (one dimension given the other), as opposed to the joint estimation procedure proposed in this paper. We also notice a recent work on longitudinal functional data under the Bayesian framework (Shamshoian et al., 2019).

The contribution of this paper is three-fold. First, we propose a new and flexible nonparametric method for low-rank covariance function estimation for multidimensional functional data, via the introduction of (infinite-dimensional) unfolding operators (See Section 3). This method can handle both sparse and dense functional data, and can achieve joint structural reductions in all dimensions, in addition to rank reduction of the covariance operator. The proposed estimator is guaranteed to be semi-positive definite. As a one-step procedure, our method reduces the theoretical complexities compared to multi-steps estimators which often involve a functional principal component analysis followed by a truncation and reconstruction step (e.g., Hall and Vial, 2006; Poskitt and Sengarapillai, 2013).

Second, we generalize the representer theorem for unidimensional functional data by Wong and Zhang (2019) to the multidimensional case with more complex spectral regularizations. The new representer theorem makes the estimation procedure practically computable by generating a finite-dimensional parametrization to the solution of the underlying infinite-dimensional optimization.

Finally, a unified asymptotic theory is developed for the proposed estimator. It automatically incorporates the settings of dense and sparse functional data, and reveals a phase transition in the rate of convergence. Different from existing theoretical work heavily based on closed-form representations of estimators, (Li and Hsing, 2010; Cai and Yuan, 2010; Zhang and Wang, 2016; Liebl, 2019), this paper provides the first unified theory for penalized global M-estimators of covariance functions which does not require a closed-form solution. Furthermore, a near-optimal (i.e., optimal up to a logarithmic order) one-dimensional nonparametric rate of convergence is attainable for the 2p2p-dimensional covariance function estimator for Sobolev-Hilbert spaces.

The rest of the paper is organized as follows. Section 2 provides some background on reproducing kernel Hilbert space (RKHS) frameworks for functional data. Section 3 introduces Tucker decomposition for finite-dimensional tensors and our proposed generalization to tensor product RKHS operators, which is the foundation for our estimation procedure. The proposed estimation method is given in Section 4, together with an computational algorithm. The unified theoretical results are presented in Section 5. The numerical performance of the proposed method is evaluated by a simulation study in Section 6 and a real data application in Section 7.

2 RKHS Framework for Functional Data

In recent years there is a surge of RKHS methods in FDA (e.g., Yuan and Cai, 2010; Zhu et al., 2014; Li and Song, 2017; Reimherr et al., 2018; Sun et al., 2018; Wong et al., 2019). However, covariance function estimation, a seemingly well-studied problem, does not receive the same amount of attention in the development of RKHS methods, even for unidimensional functional data. Interestingly, we find that the RKHS modeling provides a versatile framework for both unidimensional and multidimensional functional data.

Let XX be a random field defined on an index set 𝒯p\mathcal{T}\subset\mathbb{R}^{p}, with a mean function μ0()=𝔼{X()}\mu_{0}(\cdot)=\mathbb{E}\{X(\cdot)\} and a covariance function γ0(,)=Cov(X(),X())\gamma_{0}(*,\cdot)=\mathrm{Cov}(X(*),X(\cdot)), and let {Xi:i=1,,n}\{X_{i}:i=1,\ldots,n\} be nn independently and identically distributed (i.i.d.) copies of XX. Typically, a functional dataset is represented by {(𝑻ij,Yij):j=1,mi;i=1,,n}\{(\bm{T}_{ij},Y_{ij}):j=1,\dots m_{i};i=1,\dots,n\}, where

Yij=Xi(𝑻ij)+ϵijY_{ij}=X_{i}(\bm{T}_{ij})+\epsilon_{ij}\in\mathbb{R} (1)

is the noisy measurement of the ii-th random field XiX_{i} taken at the corresponding index 𝑻ij𝒯\bm{T}_{ij}\in\mathcal{T}, mim_{i} is the number of measurements observed from the ii-th random field, and {ϵij:i=1,,n;j=1,mi}\{\epsilon_{ij}:i=1,\dots,n;j=1,\dots m_{i}\} are independent errors with mean zero and finite variance. For simplicity and without loss of generality, we assume mi=mm_{i}=m for all ii.

As in many nonparametric regression setups such as penalized regression splines (e.g., Pearce and Wand, 2006) and smoothing splines (e.g., Wahba, 1990; Gu, 2013), the sample field of XX, i.e., the realized XX (as opposed to the sample path of a unidimensional random function), is assumed to reside in an RKHS \mathcal{H} of functions defined on 𝒯\mathcal{T} with a continuous and square integrable reproducing kernel KK. Let ,\langle\cdot,\cdot\rangle_{\mathcal{H}} and \|\cdot\|_{{\mathcal{H}}} denote the inner product and norm of \mathcal{H} respectively. With the technical condition 𝔼X2<\mathbb{E}\|X\|^{2}_{\mathcal{H}}<\infty, the covariance function γ0\gamma_{0} resides in the tensor product RKHS \mathcal{H}\otimes{\mathcal{H}}. It can be shown that {\mathcal{H}}\otimes{\mathcal{H}} is an RKHS, equipped with the reproducing kernel KKK\otimes K defined as (KK)((𝒔1,𝒕1),(𝒔2,𝒕2))=K(𝒔1,𝒔2)K(𝒕1,𝒕2)(K\otimes K)((\bm{s}_{1},\bm{t}_{1}),(\bm{s}_{2},\bm{t}_{2}))=K(\bm{s}_{1},\bm{s}_{2})K(\bm{t}_{1},\bm{t}_{2}), for any 𝒔1,𝒔2,𝒕1,𝒕2𝒯\bm{s}_{1},\bm{s}_{2},\bm{t}_{1},\bm{t}_{2}\in\mathcal{T}. This result has been exploited by Cai and Yuan (2010) and Wong and Zhang (2019) for covariance estimation in the unidimensional setting.

For any function ff\in\mathcal{H}\otimes\mathcal{H}, there exists an operator mapping \mathcal{H} to \mathcal{H} defined by gf(,),g()g\in\mathcal{H}\mapsto\langle f(*,\cdot),g(\cdot)\rangle_{\mathcal{H}}\in\mathcal{H}. When ff is a covariance function, we call the induced operator a \mathcal{H}-covariance operator, or simply a covariance operator as below. To avoid clutter, the induced operator will share the same notation with the generating function. Similar to L2L^{2}-covariance operators, the definition of an induced operator is obtained by replacing the L2L^{2} inner product by the RKHS inner product. The benefits of considering this operator have been discussed in Wong and Zhang (2019). We also note that a singular value decomposition (e.g., Hsing and Eubank, 2015) of the induced operator exists whenever the corresponding function ff belongs to the tensor product RKHS \mathcal{H}\otimes\mathcal{H}. The idea of induced operator can be similarly extended to general tensor product space 12\mathcal{F}_{1}\otimes\mathcal{F}_{2} where 1\mathcal{F}_{1} and 2\mathcal{F}_{2} are two generic RKHSs of functions.

For any γ\gamma\in\mathcal{H}\otimes\mathcal{H}, let γ\gamma^{\top} be the transpose of γ\gamma, i.e., γ(𝒔,𝒕)=γ(𝒕,𝒔)\gamma^{\top}(\bm{s},\bm{t})=\gamma(\bm{t},\bm{s}), 𝒔,𝒕𝒯\bm{s},\bm{t}\in\mathcal{T}. Define ={γ:γγ}{\mathcal{M}}=\{\gamma\in{\mathcal{H}}\otimes{\mathcal{H}}:\gamma\equiv\gamma^{\top}\}. To guarantee symmetry and positive semi-definiteness of the estimators, Wong and Zhang (2019) adopted +={γ:γf,f0,f}{\mathcal{M}}^{+}=\{\gamma\in{\mathcal{M}}:\langle\gamma f,f\rangle_{{\mathcal{H}}}\geq 0,\forall f\in{\mathcal{H}}\} as the hypothesis class of γ0\gamma_{0} and considered the following regularized estimator:

argminγ+{(γ)+τΨ(γ)},\underset{\gamma\in{\mathcal{M}}^{+}}{\arg\min}\left\{\ell(\gamma)+\tau\Psi(\gamma)\right\}, (2)

where \ell is a convex and smooth loss function characterizing the fidelity to the data, Ψ(γ)\Psi(\gamma) is a spectral penalty function (see Definition 5 below), and τ\tau is a tuning parameter. Due to the constraints specified in +{\mathcal{M}}^{+}, the resulting covariance estimator is always positive semi-definite. In particular, if the spectral penalty function Ψ(γ)\Psi(\gamma) imposes the trace-norm regularization, an 1\ell_{1}-type shrinkage penalty on the respective singular values, the estimator is usually of low rank. Cai and Yuan (2010) adopted a similar objective function as in (2) but with the hypothesis class {\mathcal{H}}\otimes{\mathcal{H}} and an 2\ell_{2}-type penalty Ψ(γ)=γ2\Psi(\gamma)=\|\gamma\|_{{\mathcal{H}}\otimes{\mathcal{H}}}^{2}, so the resulting estimator may neither be positive semi-definite nor low-rank.

Although Cai and Yuan (2010) and Wong and Zhang (2019) focused on unidimensional functional data, their frameworks can be directly extended to the multidimensional setting. Explicitly, similar to (2), as long as a proper {\mathcal{H}} for the random fields with dimension p>1p>1 is selected, an efficient “one-step” covariance function estimation with the hypothesis class +{\mathcal{M}}^{+} can be obtained immediately, which results in a positive semi-definite and possibly low-rank estimator. Since an RKHS is identified by its reproducing kernel, we simply need to pick a multivariate reproducing kernel KK for multidimensional functional data. However, even when the low-rank approximation/estimation is adopted (e.g., by trace-norm regularization), we still need to estimate several pp-dimensional eigenfunctions nonparametrically. This curse of dimensionality calls for a more efficient modeling. Below, we explore this through the lens of tensor decomposition in finite-dimensional vector spaces and its extension to infinite-dimensional function spaces.

3 Low-Rank Modeling via Functional Unfolding

In this section we will extend the well-known Tucker decomposition for finite-dimensional tensors to functional data, then introduce the concept of functional unfolding for low-rank modeling, and finally apply functional unfolding to covariance function estimation.

3.1 Tucker decomposition for finite-dimensional tensors

First, we give a brief introduction to the popular Tucker decomposition (Tucker, 1966) for finite-dimensional tensors. Let 𝒢=k=1d𝒢k\mathcal{G}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k} denote a generic tensor product space with finite-dimensional 𝒢k,k=1,,d\mathcal{G}_{k},k=1,\ldots,d. If the dimension of 𝒢k\mathcal{G}_{k} is qkq_{k}, k=1,,dk=1,\ldots,d, then each element in 𝒢=k=1d𝒢k\mathcal{G}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k} can be identified by an array in j=kdqk\mathbb{R}^{\prod_{j=k}^{d}q_{k}}, which contains the coefficients through an orthonormal basis. By Tucker decomposition, any array in k=1dqk\mathbb{R}^{\prod^{d}_{k=1}q_{k}} can be represented in terms of nn-mode products as follows.

Definition 1 (nn-mode product).

For any arrays 𝐀q1×q2××qd\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}} and 𝐏pn×qn\bm{P}\in\mathbb{R}^{p_{n}\times q_{n}}, n{1,,d}n\in\{1,\dots,d\}, the nn-mode product between 𝐀\bm{A} and 𝐏\bm{P}, denoted by 𝐀×n𝐏\bm{A}\times_{n}\bm{P}, is a array of dimension
q1×q2×qn1×pn×qn+1×qd{q_{1}\times q_{2}\times\cdots q_{n-1}\times p_{n}\times q_{n+1}\times\cdots q_{d}} of which (l1,,ln1,j,ln+1,ld)(l_{1},\dots,l_{n-1},j,l_{n+1},\dots l_{d})-th element is defined by

(𝑨×n𝑷)l1,,ln1,j,ln+1,ld=i=1qn𝑨l1,,ln1,i,ln+1,ld𝑷j,i.(\bm{A}\times_{n}\bm{P})_{l_{1},\dots,l_{n-1},j,l_{n+1},\dots l_{d}}=\sum_{i=1}^{q_{n}}\bm{A}_{l_{1},\dots,l_{n-1},i,l_{n+1},\dots l_{d}}\bm{P}_{j,i}.
Definition 2 (Tucker decomposition).

Tucker decomposition of 𝐀q1×q2××qd\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}} is

𝑨=𝑮×1𝑼1×2×d𝑼d,\bm{A}=\bm{G}\times_{1}\bm{U}_{1}\times_{2}\cdots\times_{d}\bm{U}_{d}, (3)

where 𝐔iqi×ri\bm{U}_{i}\in\mathbb{R}^{q_{i}\times r_{i}} i=1,2,,di=1,2,\dots,d, are called the “factor matrices” (usually orthonormal) with riqir_{i}\leq q_{i} and 𝐆r1××rd\bm{G}\in\mathbb{R}^{r_{1}\times\cdots\times r_{d}} is called the “core tensor”.

Figure 1 provides a pictorial illustration of a Tucker decomposition. Unlike matrices, the concept of rank is more complicated for arrays of order 3 or above. Tucker decomposition naturally leads to a particular form of rank, called “multilinear rank”, which is directly related to the familiar concept of matrix ranks. To see this, we employ a reshaping operation called matricization, which rearranges elements of an array into a matrix.

𝑨\bm{A}(q1×q2×q3)(q_{1}\times q_{2}\times q_{3})

=    𝑼1\bm{U}_{1}(q1×r1)(q_{1}\times r_{1})𝑮\bm{G}(r1×r2×r3)(r_{1}\times r_{2}\times r_{3})𝑼3\bm{U}_{3}(q3×r3)(q_{3}\times r_{3})𝑼2\bm{U}_{2}(q2×r2)(q_{2}\times r_{2})

Figure 1: Tucker decomposition of a third-order array. The values in the parentheses are dimensions for the corresponding matrices or arrays.
Definition 3 (Matricization).

For any n{1,,d}n\in\{1,\dots,d\}, the nn-mode matricization of 𝐀q1×q2××qd\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}}, denoted by 𝐀(n)\bm{A}_{(n)}, is a matrix of dimension qn×(knqk)q_{n}\times(\prod_{k\neq n}q_{k}) of which (ln,j)(l_{n},j)-th element is defined by [𝐀(n)]ln,j=𝐀l1,,ld[\bm{A}_{(n)}]_{l_{n},j}=\bm{A}_{l_{1},\ldots,l_{d}}, where j=1+i=1,ind(li1)(m=1,mni1qm)j=1+\sum_{i=1,i\neq n}^{d}(l_{i}-1)(\prod_{m=1,m\neq n}^{i-1}q_{m})111All empty products are defined as 1. For example, m=ijqm=1\prod_{m=i}^{j}q_{m}=1 when i>ji>j..

For any 𝑨q1×q2××qd\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}}, by simple derivations, one can obtain a useful relationship between the nn-mode matricization and Tucker decomposition 𝑨=𝑮×1𝑼1×2×d𝑼d\bm{A}=\bm{G}\times_{1}\bm{U}_{1}\times_{2}\cdots\times_{d}\bm{U}_{d}:

𝑨(n)=𝑼n𝑮(n)(𝑼d𝑼n+1𝑼n1𝑼1),\bm{A}_{(n)}=\bm{U}_{n}\bm{G}_{(n)}(\bm{U}_{d}\otimes\cdots\otimes\bm{U}_{n+1}\otimes\bm{U}_{n-1}\otimes\cdots\otimes\bm{U}_{1})^{\intercal}, (4)

where, with a slight abuse of notation, \otimes also represents the Kronecker product between matrices. Hence if the factor matrices are of full rank, then rank(𝑨(n))=rank(𝑮(n))\mathrm{rank}(\bm{A}_{(n)})=\mathrm{rank}(\bm{G}_{(n)}). The vector (rank(𝑨(1)),,rank(𝑨(d)))(\mathrm{rank}(\bm{A}_{(1)}),\dots,\mathrm{rank}(\bm{A}_{(d)})) is known as the multilinear rank of 𝑨\bm{A}. Clearly from (4), one can choose a Tucker decomposition such that {𝑼k:k=1,,d}\{\bm{U}_{k}:k=1,\ldots,d\} are orthonormal matrices and rank(𝑼k)=rk\mathrm{rank}(\bm{U}_{k})=r_{k}. Therefore a “small” multilinear rank corresponds to a small core tensor and thus an intrinsic dimension reduction, which potentially improves estimation and interpretation. We will relate this low-rank structure to multidimensional functional data.

3.2 Functional unfolding for infinite-dimensional tensors

To encourage low-rank structures in covariance function estimation, we generalize the matricization operation for finite-dimensional arrays to infinite-dimensional tensors (Hackbusch, 2012). Here let 𝒢=k=1d𝒢k\mathcal{G}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k} denote a generic tensor product space where 𝒢k\mathcal{G}_{k} is an RKHS of functions with an inner product ,𝒢k\langle\cdot,\cdot\rangle_{{\mathcal{G}}_{k}}, for k=1,,dk=1,\ldots,d.

Notice that the tensor product space 𝒢=k=1d𝒢k{\mathcal{G}}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k} can be generated by some elementary tensors of the form k=1dfk(x1,,xd)=k=1dfk(xk)\bigotimes^{d}_{k=1}f_{k}(x_{1},\dots,x_{d})=\prod^{d}_{k=1}f_{k}(x_{k}) where fk𝒢k,k=1,,df_{k}\in{\mathcal{G}}_{k},k=1,\ldots,d. More specifically, 𝒢\mathcal{G} is the completion of the linear span of all elementary tensors under the inner product k=1dfk,k=1dfk𝒢=k=1dfk,fk𝒢k\langle\bigotimes^{d}_{k=1}f_{k},\bigotimes^{d}_{k=1}f^{\prime}_{k}\rangle_{\mathcal{G}}=\prod^{d}_{k=1}\langle f_{k},f_{k}^{\prime}\rangle_{{\mathcal{G}}_{k}}, for any fk,fk𝒢kf_{k},f_{k}^{\prime}\in{\mathcal{G}}_{k}.

In Definition 4 below, we generalize matricization/unfolding for finite-dimensional arrays to infinite-dimensional elementary tensors. We also define a square unfolding for infinite-dimensional tensors that will be used to describe the spectrum of covariance operators.

Definition 4 (Functional unfolding operators).

The one-way unfolding operator and square unfolding operators are defined as follows for any elementary tensor of the form k=1dfk\bigotimes^{d}_{k=1}f_{k}.

  1. 1.

    One-way unfolding operator 𝒰j{\mathcal{U}}_{j} for j=1,,dj=1,\dots,d: The jj-mode one-way unfolding operator 𝒰j:k=1d𝒢k𝒢j(kj𝒢k){\mathcal{U}}_{j}:\bigotimes_{k=1}^{d}\mathcal{G}_{k}\rightarrow\mathcal{G}_{j}\otimes(\bigotimes_{k\neq j}\mathcal{G}_{k}) is defined by 𝒰j(k=1dfk)=fj(kjfk){\mathcal{U}}_{j}(\bigotimes_{k=1}^{d}f_{k})=f_{j}\otimes(\bigotimes_{k\neq j}f_{k}).

  2. 2.

    Square unfolding operator 𝒮\mathcal{S}: When dd is even, the square unfolding operator 𝒮:j=1d𝒢j(j=1d/2𝒢j)(k=d/2+1d𝒢k)\mathcal{S}:\bigotimes_{j=1}^{d}\mathcal{G}_{j}\rightarrow(\bigotimes^{d/2}_{j=1}\mathcal{G}_{j})\otimes(\bigotimes^{d}_{k=d/2+1}\mathcal{G}_{k}) is defined by 𝒮(j=1dfj)=(j=1d/2fj)(k=d/2+1dfk)\mathcal{S}(\bigotimes^{d}_{j=1}f_{j})=(\bigotimes_{j=1}^{d/2}f_{j})\otimes(\bigotimes_{k=d/2+1}^{d}f_{k}).

These definitions extend to any function f𝒢f\in\mathcal{G} by linearity. For notational simplicity we denote 𝒰j(f){\mathcal{U}}_{j}(f) by f(j)f_{(j)}, j=1,,dj=1,\ldots,d, and 𝒮(f)\mathcal{S}(f) by ff_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}.

Note that the range of each functional unfolding operator, either 𝒰j{\mathcal{U}}_{j}, j=1,,dj=1,\ldots,d or 𝒮\mathcal{S}, is a tensor product of two RKHSs, so its output can be interpreted as an (induced) operator. Given a function f𝒢f\in\mathcal{G}, the multilinear rank can be defined as (rank(f(1)),,rank(f(d)))(\mathrm{rank}(f_{(1)}),\dots,\mathrm{rank}(f_{(d)})), where f(j)f_{(j)}’s are interpreted as an operator here and rank(A)\mathrm{rank}(A) is the rank of any operator AA. If all 𝒢k,k=1,,d\mathcal{G}_{k},k=1,\ldots,d are finite-dimensional, the singular values of the output of any functional unfolding operator match with those of the jj-mode matricization (of the corresponding array representation).

3.3 Functional unfolding for covariance functions

Suppose that the random field X=k=1pkX\in{\mathcal{H}}=\bigotimes_{k=1}^{p}{\mathcal{H}}_{k} where each k{\mathcal{H}}_{k} is a RKHS of functions equipped with an inner product ,k\langle\cdot,*\rangle_{k} and corresponding norm k\|\cdot\|_{k}, k=1,,pk=1,\ldots,p. Then the covariance function γ0\gamma_{0} resides in =(j=1pj)(k=1pk){\mathcal{H}}\otimes{\mathcal{H}}=(\bigotimes_{j=1}^{p}{\mathcal{H}}_{j})\otimes(\bigotimes_{k=1}^{p}{\mathcal{H}}_{k}). To estimate γ0\gamma_{0}, we could consider a special case of 𝒢=j=1d𝒢j\mathcal{G}=\bigotimes^{d}_{j=1}\mathcal{G}_{j} in Section 3.2 by letting d=2pd=2p, 𝒢j=j\mathcal{G}_{j}=\mathcal{H}_{j} for j=1,,pj=1,\dots,p; 𝒢j=jp\mathcal{G}_{j}=\mathcal{H}_{j-p} for j=p+1,,dj=p+1,\dots,d; and ,𝒢j=,j\langle\cdot,\cdot\rangle_{{\mathcal{G}}_{j}}=\langle\cdot,\cdot\rangle_{j} for j=1,,dj=1,\ldots,d.

Clearly, the elements of {\mathcal{H}}\otimes{\mathcal{H}} are identified by those in 𝒢=j=1d𝒢j\mathcal{G}=\bigotimes^{d}_{j=1}\mathcal{G}_{j}. In terms of the folding structure, {\mathcal{H}}\otimes{\mathcal{H}} has a squarely unfolded structure. Since a low-multilinear-rank structure is represented by different unfolded forms, it would be easier to study the completely folded space k=1d𝒢k\bigotimes^{d}_{k=1}{\mathcal{G}}_{k} instead of the squarely unfolded space {\mathcal{H}}\otimes{\mathcal{H}}. We use Γ0\Gamma_{0} to represent the folded covariance function, the corresponding element of γ0\gamma_{0} in 𝒢{\mathcal{G}}. In other words, Γ0,=γ0\Gamma_{0,\mathrel{\scalebox{0.5}{$\blacksquare$}}}=\gamma_{0}. For any Γ𝒢\Gamma\in{\mathcal{G}}, rank(Γ)\mathrm{rank}(\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}) is defined as the two-way rank of Γ\Gamma while rank(Γ(1)),,rank(Γ(p))\mathrm{rank}(\Gamma_{(1)}),\dots,\mathrm{rank}(\Gamma_{(p)}) are defined as the one-way ranks of Γ\Gamma.

Remark 1.

For an array 𝐀k=1dqk\bm{A}\in\mathbb{R}^{\prod_{k=1}^{d}q_{k}}, the one-way unfolding 𝒰j(𝐀){\mathcal{U}}_{j}(\bm{A}) is the same as matricization, if we further impose the same ordering of the columns in the output of 𝒰j(𝐀)\mathcal{U}_{j}(\bm{A}), j=1,,dj=1,\ldots,d. This ordering is just related to how we represent the array, and is not crucial in the general definition of 𝒰j\mathcal{U}_{j}. Since the description of the computation strategy depends on the explicit representation, we will always assume this ordering. Similarly, we also define a specific ordering of rows and columns for 𝐀(d/2)×(d/2){\bm{A}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\in\mathbb{R}^{(d/2)\times(d/2)} when dd is even, such that its (j1,j2)(j_{1},j_{2})-th entry is 𝐀k1,,kd\bm{A}_{k_{1},\dots,k_{d}} where j1=1+i=1d/2(ki1)(m=i+1d/2qm)j_{1}=1+\sum_{i=1}^{d/2}(k_{i}-1)(\prod_{m=i+1}^{d/2}q_{m}) and j2=1+i=d/2+1d(ki1)(m=i+1dqm)j_{2}=1+\sum_{i=d/2+1}^{d}(k_{i}-1)(\prod_{m=i+1}^{d}q_{m}).

3.4 One-way and two-way ranks in covariance functions

Here we illustrate the roles of one-way and two-way ranks in the modeling of covariance functions. For a general 𝒢=j=1d𝒢j\mathcal{G}=\bigotimes^{d}_{j=1}\mathcal{G}_{j}, let {ek,lk:lj=1,,qk}\{e_{k,l_{k}}:l_{j}=1,\dots,q_{k}\} be a set of orthonormal basis functions of 𝒢k{\mathcal{G}}_{k} for k=1,,d=2pk=1,\dots,d=2p, where qkq_{k} is allowed to be infinite, depending on the dimensionality of 𝒢k{\mathcal{G}}_{k}. Then {k=1dek,lk:lk=1,,qk;k=1,,d}\{\bigotimes^{d}_{k=1}e_{k,l_{k}}:l_{k}=1,\dots,q_{k};k=1,\dots,d\} forms a set of orthonormal basis functions for 𝒢{\mathcal{G}}. Thus for any Γ𝒢\Gamma\in{\mathcal{G}}, we can express

Γ\displaystyle\Gamma =k1,k2,,kd𝑩k1,,kdi=1dei,ki,\displaystyle=\sum_{k_{1},k_{2},\dots,k_{d}}\bm{B}_{k_{1},\dots,k_{d}}\bigotimes_{i=1}^{d}e_{i,k_{i}}, (5)

where the coefficients 𝑩k1,,kd\bm{B}_{k_{1},\dots,k_{d}} are real numbers. For convenience, we collectively put them into an array 𝑩k=1dqk\bm{B}\in\mathbb{R}^{\prod^{d}_{k=1}q_{k}}.

To illustrate the low-multilinear-rank structures for covariance functions, we consider p=2p=2, i.e., d=2p=4d=2p=4, and then by (5) the folded covariance function Γ\Gamma can be expressed by

Γ(s1,s2,t1,t2)=k1=1q1k2=1q2k3=1q1k4=1q2𝑩k1,k2,k3,k4e1,k1(s1)e2,k2(s2)e1,k3(t1)e2,k4(t2).\Gamma(s_{1},s_{2},t_{1},t_{2})=\sum_{k_{1}=1}^{q_{1}}\sum_{k_{2}=1}^{q_{2}}\sum_{k_{3}=1}^{q_{1}}\sum_{k_{4}=1}^{q_{2}}\bm{B}_{k_{1},k_{2},k_{3},k_{4}}e_{1,k_{1}}(s_{1})e_{2,k_{2}}(s_{2})e_{1,k_{3}}(t_{1})e_{2,k_{4}}(t_{2}).

To be precise, the covariance function is the squarely unfolded Γ((s1,s2),(t1,t2))Γ(s1,s2,t1,t2)\Gamma_{{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}((s_{1},s_{2}),(t_{1},t_{2}))\equiv\Gamma(s_{1},s_{2},t_{1},t_{2}). Suppose that 𝑩\bm{B} possesses (or is well-approximated by) a structure of a low multilinear rank, and yields Tucker decomposition 𝑩=𝑬×1𝑼1×2𝑼2×3𝑼1×4𝑼2\bm{B}=\bm{E}\times_{1}\bm{U}_{1}\times_{2}\bm{U}_{2}\times_{3}\bm{U}_{1}\times_{4}\bm{U}_{2}222Definition 1 is extended to the case when qnq_{n} is infinite. where 𝑬r1×r2×r1×r2\bm{E}\in\mathbb{R}^{r_{1}\times r_{2}\times r_{1}\times r_{2}}, 𝑼kqk×rk\bm{U}_{k}\in\mathbb{R}^{q_{k}\times r_{k}} for k=1,2k=1,2, and columns of 𝑼k\bm{U}_{k} are orthonormal. Apparently R:=rank(𝑩)R:=\mathrm{rank}(\bm{B}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}) is the two-way rank of Γ\Gamma, while r1r_{1} and r2r_{2} are the corresponding one-way ranks. Now the covariance function can be further represented as

Γ(s1,s2,t1,t2)=j1=1r1j2=1r2j3=1r1j4=1r2𝑬j1,j2,j3,j4uj1(s1)vj2(s2)uj3(t1)vj4(t2),\Gamma(s_{1},s_{2},t_{1},t_{2})=\sum_{j_{1}=1}^{r_{1}}\sum_{j_{2}=1}^{r_{2}}\sum_{j_{3}=1}^{r_{1}}\sum_{j_{4}=1}^{r_{2}}\bm{E}_{j_{1},j_{2},j_{3},j_{4}}u_{j_{1}}(s_{1})v_{j_{2}}(s_{2})u_{j_{3}}(t_{1})v_{j_{4}}(t_{2}),

where {uj:j=1,,r1}\{u_{j}:j=1,\ldots,r_{1}\} and {vk:k=1,,r2}\{v_{k}:k=1,\ldots,r_{2}\} are (possibly infinite) linear combinations of the original basis functions. In fact, {uj:j=1,,r1}\{u_{j}:j=1,\ldots,r_{1}\} and {vk:k=1,,r2}\{v_{k}:k=1,\ldots,r_{2}\} are the sets of orthonormal functions of 𝒢1{\mathcal{G}}_{1} and 𝒢2{\mathcal{G}}_{2} respectively. Apparently rank(𝑬)=R\mathrm{rank}(\bm{E}_{{\mathrel{\scalebox{0.5}{$\blacksquare$}}}})=R.

Consider the eigen-decomposition of the squarely unfolded core tensor 𝑬=𝑷𝑫𝑷T\bm{E}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}=\bm{P}\bm{D}\bm{P}^{T} where 𝑫=diag(λ1,λ2,,λR)\bm{D}=\mathrm{diag}(\lambda_{1},\lambda_{2},\dots,\lambda_{R}) and 𝑷r1r2×R\bm{P}\in\mathbb{R}^{r_{1}r_{2}\times R} has orthonormal columns. Then we obtain the eigen-decomposition of the covariance function Γ\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}:

Γ((s1,s2),(t1,t2))=g=1Rλgfg(s1,s2)fg(t1,t2),\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}((s_{1},s_{2}),(t_{1},t_{2}))=\sum_{g=1}^{R}\lambda_{g}f_{g}(s_{1},s_{2})f_{g}(t_{1},t_{2}),

where the eigenfunction is

fg(s1,s2)=j1=1r1j2=1r2𝑷j2+(j11)r1,guj1(s1)vj2(s2)=:{j1=1r1aj1,g(s2)uj1(s1)j2=1r2bj2,g(s1)vj2(s2),f_{g}(s_{1},s_{2})=\sum_{j_{1}=1}^{r_{1}}\sum_{j_{2}=1}^{r_{2}}\bm{P}_{j_{2}+(j_{1}-1)r_{1},g}u_{j_{1}}(s_{1})v_{j_{2}}(s_{2})=:\,\begin{cases}\sum^{r_{1}}_{j_{1}=1}a_{j_{1},g}(s_{2})u_{j_{1}}(s_{1})\\ \sum^{r_{2}}_{j_{2}=1}b_{j_{2},g}(s_{1})v_{j_{2}}(s_{2})\end{cases},

with aj1,g()=j2=1r2𝑷j2+(j11)r1,gvj2()a_{j_{1},g}(\cdot)=\sum_{j_{2}=1}^{r_{2}}\bm{P}_{j_{2}+(j_{1}-1)r_{1},g}v_{j_{2}}(\cdot) and bj2,g()=j1=1r1𝑷j2+(j11)r1,guj1()b_{j_{2},g}(\cdot)=\sum_{j_{1}=1}^{r_{1}}\bm{P}_{j_{2}+(j_{1}-1)r_{1},g}u_{j_{1}}(\cdot).

First, this indicates that the two-way rank RR is the same as the rank of the covariance operator. Second, this shows that {uj1:j1=1,,r1}\{u_{j_{1}}:j_{1}=1,\ldots,r_{1}\} is the common basis for the variation along the dimension s1s_{1}, hence describing the marginal structure along s1s_{1}. Similarly {vj2:j2=1,,r2}\{v_{j_{2}}:j_{2}=1,\ldots,r_{2}\} is the common basis that characterizes the marginal variation along the dimension s2s_{2}. We call them the marginal basis along the respective dimension. Therefore, the one-way ranks r1r_{1} and r2r_{2} are the minimal numbers of the one-dimensional functions for the dimensions s1s_{1} and s2s_{2} respectively that construct all the eigenfunctions of covariance function Γ\Gamma.

Similarly, for pp-dimensional functional data, each eigenfunction can be represented by a linear combination of pp-products of univariate functions. One can then show that the two-way rank RR is the same as the rank of the covariance operator and the one-way ranks r1,,rpr_{1},\dots,r_{p} are the minimal numbers of one-dimensional functions along respective dimensions that characterize all eigenfunctions of the covariance operator.

Remark 2.

Obviously, Rk=1prkR\leq\prod^{p}_{k=1}r_{k} for pp-dimensional functional data. If the random field XX has the property of “weak separability” as defined by Lynch and Chen (2018), then max(r1,,rp)R\max(r_{1},\dots,r_{p})\leq R so the low-rank structure in terms of RR will be automatically translated to low one-way ranks. Note that the construction of our estimator and corresponding theoretical analysis do not require separability conditions.

Compared to typical low-rank covariance modelings only in terms of RR, we also intend to regularize the one-way ranks r1,,rpr_{1},\dots,r_{p} for two reasons. First, the illustration above shows that the structure of low one-way ranks encourages a “sharing” structure of one-dimensional variations among different eigenfunctions. Promoting low one-way ranks can facilitate additional dimension reduction and further alleviates the curse of dimensionality. Moreover, one-dimensional marginal structures will provide more details of the covariance function structure and thus help with a better understanding of pp-dimensional eigenfunctions.

Therefore, we will utilize both one-way and two-way structures and propose an estimation procedure that regularizes one-way and two-way ranks jointly and flexibly, with the aim of seeking the “sharing” of marginal structures while controlling the number of eigen-components simultaneously.

4 Covariance Function Estimation

In this section we propose a low-rank covariance function estimation framework based on functional unfolding operators and spectral regularizations. Spectral penalty functions (Abernethy et al., 2009; Wong and Zhang, 2019) are defined as follows.

Definition 5 (Spectral penalty function).

Given a compact operator AA, a spectral penalty function takes the form Ψ(A)=k1ψ(λk(A))\Psi(A)=\sum_{k\geq 1}\psi(\lambda_{k}(A)) with the singular values of the operator AA, λ1(A)\lambda_{1}(A), λ2(A)\lambda_{2}(A), \dots in a descending order of magnitude and a non-decreasing penalty function ψ\psi such that ψ(0)=0\psi(0)=0.

Recall =j=1pj{\mathcal{H}}=\bigotimes^{p}_{j=1}{\mathcal{H}}_{j} and 𝒢=j=1d𝒢j{\mathcal{G}}=\bigotimes^{d}_{j=1}{\mathcal{G}}_{j} where d=2pd=2p, 𝒢j=j\mathcal{G}_{j}=\mathcal{H}_{j} for j=1,,pj=1,\dots,p, and 𝒢j=jp\mathcal{G}_{j}=\mathcal{H}_{j-p} for j=p+1,,dj=p+1,\dots,d. Clearly, a covariance operator is self-adjoint and positive semi-definite. Therefore we consider the hypothesis space +={Γ:Γf,f0,for all f}{\mathcal{M}}^{+}=\{\Gamma\in{\mathcal{M}}:\langle\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}f,f\rangle_{{\mathcal{H}}}\geq 0,\text{for all }f\in{\mathcal{H}}\}, where ={Γ𝒢:Γ is self-adjoint}{\mathcal{M}}=\{\Gamma\in{\mathcal{G}}:\mbox{$\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ is self-adjoint}\}, and propose a general class of covariance function estimators as follows:

argminΓ+{(Γ)+λ[βΨ0(Γ)+1βpj=1pΨj(Γ(j))]},\underset{\Gamma\in{\mathcal{M}}^{+}}{\arg\min}\left\{\ell(\Gamma)+\lambda\left[\beta\Psi_{0}(\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})+\frac{1-\beta}{p}\sum_{j=1}^{p}\Psi_{j}(\Gamma_{(j)})\right]\right\}, (6)

where \ell is a convex and smooth loss function, {Ψj:j=1,,p}\{\Psi_{j}:j=1,\ldots,p\} are spectral penalty functions, and λ0\lambda\geq 0 , β[0,1]\beta\in[0,1] are tuning parameters. Here Ψ0\Psi_{0} penalizes the squarely unfolded operator Γ\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}} while Ψj\Psi_{j} regularizes one-way unfolded operator Γ(j)\Gamma_{(j)} respectively for j=1,,pj=1,\dots,p. The tuning parameter β\beta controls the relative degree of regularization between one-way and two-way singular values. The larger the β\beta is, the more penalty is imposed on the two-way singular values relative to the one-way singular values. When β=1\beta=1, the penalization is only on the eigenvalues of the covariance operator (i.e., the two-way singular values), similarly as Wong and Zhang (2019).

To achieve low-rank estimation, we adopt a special form of (6):

Γ^=argminΓ+{square(Γ)+λ[βΓ+1βpj=1pΓ(j)]},\hat{\Gamma}=\underset{\Gamma\in{\mathcal{M}}^{+}}{\arg\min}\left\{\ell_{\mathrm{square}}(\Gamma)+\lambda\left[\beta\|\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\|_{*}+\frac{1-\beta}{p}\sum_{j=1}^{p}\|\Gamma_{(j)}\|_{*}\right]\right\}, (7)

where \|\cdot\|_{*} is the sum of singular values, also called trace norm, and square\ell_{\mathrm{square}} is the squared error loss:

square(Γ)=1nm(m1)i=1n1jjm{Γ(Tij1,,Tijp,Tij1,,Tijp)Zijj}2,{\ell}_{\mathrm{square}}(\Gamma)=\frac{1}{nm(m-1)}\sum_{i=1}^{n}\sum_{1\leq j\neq j^{\prime}\leq m}\{\Gamma(T_{ij1},\dots,T_{ijp},T_{ij^{\prime}1},\dots,T_{ij^{\prime}p})-Z_{ijj^{\prime}}\}^{2}, (8)

with Zijj={Yijμ^(Tij1,,Tijp)}{Yijμ^(Tij1,,Tijp)}Z_{ijj^{\prime}}=\{Y_{ij}-\hat{\mu}(T_{ij1},\dots,T_{ijp})\}\{Y_{ij^{\prime}}-\hat{\mu}(T_{ij^{\prime}1},\dots,T_{ij^{\prime}p})\}, μ^\hat{\mu} as an estimate of the mean function, and TijkT_{ijk} as the kk-th element of location vector 𝑻ij\bm{T}_{ij}. Notice that trace-norm regularizations promote low-rankness of the underlying operators, hence leading to a low-rank estimation in terms of both the one-way and two-way (covariance) ranks.

4.1 Representer theorem and parametrization

Before deriving a computational algorithm, we notice that the optimization (7) is an infinite-dimensional optimization which is generally unsolvable. To overcome this challenge, we show that the solution to (7) always lies in a known finite-dimensional sub-space given data, hence allowing a finite-dimensional parametrization. Indeed, we are able to achieve a stronger result in Theorem 1 which holds for the general class of estimators obtained by (6).

Let n,m={Tijk:i=1,,n,j=1,,m,k=1,,p}\mathcal{L}_{n,m}=\left\{T_{ijk}:i=1,\dots,n,j=1,\dots,m,k=1,\dots,p\right\}.

Theorem 1 (Representer theorem).

If the solution set of (6) is not empty, there always exists a solution Γ\Gamma lying in the space 𝒢(n,m):=k=12p𝒦k\mathcal{G}(\mathcal{L}_{n,m}):=\bigotimes_{k=1}^{2p}\mathcal{K}_{k}, where 𝒦p+k=𝒦k\mathcal{K}_{p+k}=\mathcal{K}_{k} and
𝒦k=span{Kk(Tijk):i=1,,n,j=1,,m}\mathcal{K}_{k}=\mathrm{span}\left\{K_{k}(T_{ijk}):i=1,\dots,n,j=1,\dots,m\right\} for k=1,,pk=1,\dots,p. The solution takes the form:

Γ(s1,,sp,t1,,tp)=𝑨×1𝒛1(s1)×2𝒛2(s2)×p𝒛p(sp)×p+1𝒛1(t1)×2p𝒛p(tp),\Gamma(s_{1},\ldots,s_{p},t_{1},\ldots,t_{p})=\bm{A}\times_{1}\bm{z}_{1}^{\intercal}(s_{1})\times_{2}\bm{z}_{2}^{\intercal}(s_{2})\cdots\times_{p}\bm{z}_{p}^{\intercal}(s_{p})\times_{p+1}\bm{z}_{1}^{\intercal}(t_{1})\cdots\times_{2p}\bm{z}_{p}^{\intercal}(t_{p}), (9)

where the ll-th element of 𝐳k()mn\bm{z}_{k}(\cdot)\in\mathbb{R}^{mn} is K(Tijk,)K(T_{ijk},\cdot) with l=(i1)n+jl=(i-1)n+j. Also, 𝐀\bm{A} is a 2p2p-th order tensor where the dimension of each mode is nmnm and 𝐀\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}} is a symmetric matrix.

The proof of Theorem 1 is given in Section S1 of the supplementary material. By Theorem 1, we can now only focus on covariance function estimators of the form (9). Let 𝑩=𝑨×1𝑴1T×p𝑴pT×p+1𝑴1T×2p𝑴pT\bm{B}=\bm{A}\times_{1}\bm{M}_{1}^{T}\cdots\times_{p}\bm{M}_{p}^{T}\times_{p+1}\bm{M}_{1}^{T}\cdots\times_{2p}\bm{M}_{p}^{T}, where 𝑴k\bm{M}_{k} is a nm×qknm\times q_{k} matrix such that 𝑴k𝑴kT=𝑲k=[K(Ti1,j1,k,Ti2,j2,k)]1i1,i2n,1j1,j2m\bm{M}_{k}\bm{M}_{k}^{T}=\bm{K}_{k}=\left[K(T_{i_{1},j_{1},k},T_{i_{2},j_{2},k})\right]_{1\leq i_{1},i_{2}\leq n,1\leq j_{1},j_{2}\leq m}. With 𝑩\bm{B}, we can express

Γ(s1,,sp,t1,,tp)=𝑩×1{𝑴1+𝒛1(s1)}×p{𝑴p+𝒛p(sp)}×p+1{𝑴1+𝒛1(t1)}×2p{𝑴p+𝒛p(tp)},\displaystyle\begin{split}\Gamma(s_{1},\ldots,s_{p},t_{1},\ldots,t_{p})=\bm{B}&\times_{1}\{\bm{M}_{1}^{+}\bm{z}_{1}(s_{1})\}^{\intercal}\cdots\times_{p}\{\bm{M}_{p}^{+}\bm{z}_{p}(s_{p})\}^{\intercal}\\ &\times_{p+1}\{\bm{M}_{1}^{+}\bm{z}_{1}(t_{1})\}^{\intercal}\cdots\times_{2p}\{\bm{M}_{p}^{+}\bm{z}_{p}(t_{p})\}^{\intercal},\end{split} (10)

where zk()z_{k}(\cdot) is defined in Theorem 1 and 𝑴k+\bm{M}_{k}^{+} is the Moore–Penrose inverse of matrix 𝑴k\bm{M}_{k}.

The Gram matrix 𝑲k\bm{K}_{k} is often approximately low-rank. For computational simplicity, one could adopt qkq_{k} to be significantly smaller than nmnm. Ideally we can obtain the “best” low-rank approximation with respect to the Frobenius norm by eigen-decomposition, but a full eigen-decomposition is computationally expensive. Instead, randomized algorithms can be used to obtain low-rank approximations in an efficient manner (Halko et al., 2009).

One can easily show that the eigenvalues of the operator Γ\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}} are the same as those of the matrix 𝑩{\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}} and that the singular values of the operator Γ(j)\Gamma_{(j)} are the same as those of the matrix 𝑩(j){\bm{B}}_{(j)}. Therefore, solving (7) is equivalent to solving the following optimization:

min𝑩{~square(𝑩)+λ[βh(𝑩)+1βpk=1p𝑩(j)]},\min_{\bm{B}}\left\{\tilde{\ell}_{\mathrm{square}}(\bm{B})+\lambda\left[\beta h({\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})+\frac{1-\beta}{p}\sum_{k=1}^{p}\left\|{\bm{B}}_{(j)}\right\|_{*}\right]\right\}, (11)

where \|\cdot\|_{*} also represents the trace norm of matrices, h(𝑯)=𝑯h(\bm{H})=\left\|\bm{H}\right\|_{*} if matrix 𝑯\bm{H} is positive semi-definite, and h(𝑯)=h(\bm{H})=\infty otherwise, and ~square(𝑩)=square(Γ)\tilde{\ell}_{\mathrm{square}}(\bm{B})=\ell_{\mathrm{square}}(\Gamma), where Γ\Gamma is constructed from (10).

Beyond estimating the covariance function, one may be further interested in the eigen-decomposition of Γ\Gamma_{{\mathrel{\scalebox{0.5}{$\blacksquare$}}}} via the L2L^{2} inner product, e.g., to perform functional principal component analysis in the usual sense. Due to the finite-dimensional parametrization, a closed-form expression of L2L^{2} eigen-decomposition can be derived from our estimator without further discretization or approximation. In addition, we can obtain a similar one-way analysis in terms of the L2L_{2} inner product. We can define a L2L^{2} singular value decomposition via the Tucker form and obtain the L2L^{2} marginal basis. Details are given in Appendix A.

4.2 Computational algorithm

We solve (11) by the accelerated alternating direction method of multipliers (ADMM) algorithm (Kadkhodaie et al., 2015). We begin with an alternative form of (11):

min𝑩q1××q2p{~square(𝑩)+λβh(𝑫0,)+λ1βpk=1p𝑫j,(j)}.\displaystyle\min_{\bm{B}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}}\left\{\tilde{\ell}_{\mathrm{square}}(\bm{B})+\lambda\beta h(\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}})+\lambda\frac{1-\beta}{p}\sum_{k=1}^{p}\left\|\bm{D}_{j,(j)}\right\|_{*}\right\}. (12)
subject to𝑩=𝑫0=𝑫1==𝑫p\displaystyle\text{subject to}\ \ \ \bm{B}=\bm{D}_{0}=\bm{D}_{1}=\cdots=\bm{D}_{p} (13)

where qp+k=qkq_{p+k}=q_{k} for k=1,,pk=1,\dots,p.

Then a standard ADMM algorithm solves the optimization problem (12) by minimizing the augmented Lagrangian with respect to different variables alternatively. More explicitly, at the (t+1)(t+1)-th iteration, the following updates are implemented:

𝑩(t+1)\displaystyle\bm{B}^{(t+1)} =argmin𝑩{~square(𝑩)+η2𝑩𝑫0,(t)+𝑽0,(t)F2+η2k=1p𝑩(k)𝑫k,(k)(t)+𝑽k,(k)(t)F2},\displaystyle=\operatorname*{argmin}_{\bm{B}}\left\{\tilde{\ell}_{\mathrm{square}}(\bm{B})+\frac{\eta}{2}\|\bm{B}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}^{(t)}+\bm{V}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}^{(t)}\|_{F}^{2}+\frac{\eta}{2}\sum_{k=1}^{p}\left\|{\bm{B}}_{(k)}-\bm{D}_{k,(k)}^{(t)}+\bm{V}_{k,(k)}^{(t)}\right\|_{F}^{2}\right\}, (14a)
𝑫0(t+1)\displaystyle\bm{D}_{0}^{(t+1)} =argmin𝑫0{λβh(𝑫0,)+η2𝑩(t+1)𝑫0,+𝑽0,(t)F2},\displaystyle=\operatorname*{argmin}_{\bm{D}_{0}}\left\{\lambda\beta h(\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}})+\frac{\eta}{2}\left\|\bm{B}^{(t+1)}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}+\bm{V}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}^{(t)}\right\|_{F}^{2}\right\}, (14b)
𝑫k(t+1)\displaystyle\bm{D}_{k}^{(t+1)} =argmin𝑫k{λ1βp𝑫k,(k)+η2𝑩(k)(t+1)𝑫k,(k)+𝑽k,(k)(t)F2},k=1,,p,\displaystyle=\operatorname*{argmin}_{\bm{D}_{k}}\left\{\lambda\frac{1-\beta}{p}\|\bm{D}_{k,(k)}\|_{*}+\frac{\eta}{2}\left\|\bm{B}^{(t+1)}_{(k)}-\bm{D}_{k,(k)}+\bm{V}_{k,(k)}^{(t)}\right\|_{F}^{2}\right\},\ k=1,\dots,p, (14c)
𝑽k(t+1)\displaystyle\bm{V}_{k}^{(t+1)} =𝑽k(t)+𝑩(t+1)𝑫k(t+1),k=0,,p,\displaystyle=\bm{V}_{k}^{(t)}+\bm{B}^{(t+1)}-\bm{D}_{k}^{(t+1)},\ k=0,\dots,p, (14d)

where 𝑽kq1×q2p\bm{V}_{k}\in\mathbb{R}^{q_{1}\times\cdots q_{2p}}, for k=0,,pk=0,\dots,p, are scaled Lagrangian multipliers and η>0\eta>0 is an algorithmic parameter. An adaptive strategy to tune η\eta is provided in Boyd et al. (2010). One can see that Steps (14a), (14b) and (14c) involve additional optimizations. Now we discuss how to solve them.

The objective function of (14a) is a quadratic function, and so we can easily solve this with a closed-form solution, given in line 2 of Algorithm 1. To solve (14b) and (14c), we use proximal operator proxvk\mathrm{prox}^{k}_{v}, k=1,,pk=1,\dots,p and proxv+:q1××q2pq1××q2p\mathrm{prox}_{v}^{+}:\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}\rightarrow\mathbb{R}^{q_{1}\times\cdots\times q_{2p}} respectively defined by

proxvk(𝑨)\displaystyle\mathrm{prox}_{v}^{k}(\bm{A}) =argmin𝑾q1××q2p{12𝑾(k)𝑨(k)F2+v𝑾(k)},\displaystyle=\operatorname*{argmin}_{\bm{W}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}}\left\{\frac{1}{2}\|\bm{W}_{(k)}-\bm{A}_{(k)}\|_{F}^{2}+v\|\bm{W}_{(k)}\|_{*}\right\}, (15a)
proxv+(𝑨)\displaystyle\mathrm{prox}^{+}_{v}(\bm{A}) =argmin𝑾q1××q2p{12𝑾𝑨F2+vh(𝑾)},\displaystyle=\operatorname*{argmin}_{\bm{W}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}}\left\{\frac{1}{2}\|\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\|_{F}^{2}+vh(\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\right\}, (15b)

for v0v\geq 0. By Lemma 1 in Mazumder et al. (2010), the solutions to (15) have closed forms.

For (15a), write the singular value decomposition of 𝑨(k)\bm{A}_{(k)} as 𝑼diag((a~1,,a~qk))𝑽\bm{U}\mathrm{diag}((\tilde{a}_{1},\dots,\tilde{a}_{q_{k}}))\bm{V}^{\intercal}, then [proxvk(𝑨)](k)=𝑼diag(~𝒄)𝑽[\mathrm{prox}_{v}^{k}(\bm{A})]_{(k)}=\bm{U}\mathrm{diag}(\bm{\tilde{}}{\bm{c}})\bm{V}^{\intercal} where 𝒄~=((a~1v)+,(a~2v)+,,(a~qkv)+)\tilde{\bm{c}}=((\tilde{a}_{1}-v)_{+},(\tilde{a}_{2}-v)_{+},\dots,(\tilde{a}_{q_{k}}-v)_{+}). As for (15b), is restricted to be a symmetric matrix since the penalty hh equals infinity otherwise. Thus (15b) is equivalent to minimizing {(1/2)𝑾(𝑨+𝑨)/2F2+vh(𝑾)}\left\{(1/2)\|\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}+\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2\|_{F}^{2}+vh(\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\right\} since 𝑾,(𝑨𝑨)/2=(𝑾+𝑾)/2,(𝑨𝑨)/2=0\langle\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}},(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2\rangle=\langle(\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}+\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2,(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2\rangle=0 . Suppose that (𝑨+𝑨)/2(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}+\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2 yields eigen-decomposition 𝑷diag((a~1,,a~q)𝑷\bm{P}\mathrm{diag}((\tilde{a}_{1},\dots,\tilde{a}_{q})\bm{P}^{\intercal}. Then [proxv+(𝑨)]=𝑷diag(~𝒄)𝑷[\mathrm{prox}^{+}_{v}(\bm{A})]_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}=\bm{P}\mathrm{diag}(\bm{\tilde{}}{\bm{c}})\bm{P}^{\intercal}, where 𝒄~=((a~1v)+,(a~2v)+,,(a~qv)+)\tilde{\bm{c}}=((\tilde{a}_{1}-v)_{+},(\tilde{a}_{2}-v)_{+},\dots,(\tilde{a}_{q}-v)_{+}). Unlike singular values, the eigenvalues may be negative. Hence, as opposed to proxvk\mathrm{prox}^{k}_{v}, this procedure proxv+\mathrm{prox}_{v}^{+} also removes eigen-components with negative eigenvalues.

The details of computational algorithm are given in Algorithm 1, an accelerated version of ADMM which involves additional steps for a faster algorithmic convergence.

Input: 𝑽^k(0)q1××q2p\hat{\bm{V}}_{k}^{(0)}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}, k=0,1,,pk=0,1,\dots,p, and 𝑩(0)q1××q2p\bm{B}^{(0)}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}} such that 𝑽^0,(0)\hat{\bm{V}}_{0,(0)} and 𝑩(0)\bm{B}^{(0)}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}} are symmetric matrices; 𝑴k=[𝑴1,k,,𝑴n,k]\bm{M}_{k}=[\bm{M}_{1,k}^{\intercal},\dots,\bm{M}_{n,k}^{\intercal}]^{\intercal}, k=1,,pk=1,\dots,p; 𝒁i=(Zijj)1j,jm\bm{Z}_{i}=(Z_{ijj^{\prime}})_{1\leq j,j^{\prime}\leq m} , i=1,,ni=1,\dots,n; 𝑰~=[I(ij)]1i,jm\tilde{\bm{I}}=[I(i\neq j)]_{1\leq i,j\leq m}; η>0\eta>0; TT
Initialization :  αk(0)1\alpha_{k}^{(0)}\leftarrow 1, 𝑫k(1)𝑩(0)\bm{D}_{k}^{(-1)}\leftarrow\bm{B}^{(0)}, 𝑫^k(0)𝑩(0)\hat{\bm{D}}_{k}^{(0)}\leftarrow\bm{B}^{(0)}, 𝑽k(1)𝑽^k(0)\bm{V}_{k}^{(-1)}\leftarrow\hat{\bm{V}}_{k}^{(0)}, k=0,1,,pk=0,1,\dots,p
𝑳i[𝑴i,1𝑴i,2𝑴i,p]\bm{L}_{i}\leftarrow[\bm{M}_{i,1}^{\intercal}\odot\bm{M}_{i,2}^{\intercal}\odot\cdots\odot\bm{M}_{i,p}^{\intercal}]^{\intercal}, i=1,,ni=1,\dots,n, where \odot is the Khatri–Rao product defined as 𝑨𝑩=[aibi]i=1,,rrarb×r\bm{A}\odot\bm{B}=[a_{i}\otimes b_{i}]_{i=1,...,r}\in\mathbb{R}^{r_{a}r_{b}\times r} for 𝑨ra×r\bm{A}\in\mathbb{R}^{r_{a}\times r}, 𝑩rb×r\bm{B}\in\mathbb{R}^{r_{b}\times r} and ai,bia_{i},b_{i} are ii-th column of matrices 𝑨\bm{A} and 𝑩\bm{B} respectively.
𝑮1nm(m1)i=1n(𝑳i𝑳i)diag(vec(𝑰~))(𝑳i𝑳i)\bm{G}\leftarrow\frac{1}{nm(m-1)}\sum_{i=1}^{n}(\bm{L}_{i}\otimes\bm{L}_{i})^{\intercal}\mathrm{diag}(\mathrm{vec}(\tilde{\bm{I}}))(\bm{L}_{i}\otimes\bm{L}_{i}) 𝒉2nm(m1)i=1n(𝑳i𝑳i)diag(vec(𝑰~))vec(𝒁i)\bm{h}\leftarrow\frac{2}{nm(m-1)}\sum_{i=1}^{n}(\bm{L}_{i}\otimes\bm{L}_{i})^{\intercal}\mathrm{diag}(\mathrm{vec}(\tilde{\bm{I}}))\mathrm{vec}(\bm{Z}_{i})
𝑸(2(𝑮+p+12η𝑰))1\bm{Q}\leftarrow(2(\bm{G}+\frac{p+1}{2}*\eta*\bm{I}))^{-1}
1 for t=0,1,,Tt=0,1,\dots,T do
2       vec(𝑩(t+1))𝑸{𝒉+ηk=0pvec([𝑫k(t)𝑽^k(t)])}\mathrm{vec}(\bm{B}^{(t+1)}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\leftarrow\bm{Q}\{\bm{h}+\eta\sum_{k=0}^{p}\mathrm{vec}([\bm{D}_{k}^{(t)}-\hat{\bm{V}}_{k}^{(t)}]_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\}
3       for k=0,1,,pk=0,1,\dots,p do
4             if k=0k=0 then
5                   𝑫0(t)proxλβ/η+(𝑩(t+1)+𝑽^0(t))\bm{D}_{0}^{(t)}\leftarrow\mathrm{prox}^{+}_{\lambda\beta/\eta}(\bm{B}^{(t+1)}+\hat{\bm{V}}_{0}^{(t)})
6            else
7                   𝑫k(t)proxλ(1β)/(pη)k(𝑩(t+1)+𝑽^k(t))\bm{D}_{k}^{(t)}\leftarrow\mathrm{prox}^{k}_{\lambda(1-\beta)/(p\eta)}(\bm{B}^{(t+1)}+\hat{\bm{V}}_{k}^{(t)})
8             end if
9            𝑽k(t)𝑽^k(t)+𝑩(t+1)𝑫k(t)\bm{V}_{k}^{(t)}\leftarrow\hat{\bm{V}}_{k}^{(t)}+\bm{B}^{(t+1)}-\bm{D}_{k}^{(t)}
10             αk(t+1)1+1+4(αk(t))22\alpha_{k}^{(t+1)}\leftarrow\frac{1+\sqrt{1+4(\alpha_{k}^{(t)})^{2}}}{2}
11             𝑫^k(t+1)𝑫k(t)+αk(t)1αk(t+1)(𝑫k(k)𝑫k(k1))\hat{\bm{D}}_{k}^{(t+1)}\leftarrow\bm{D}_{k}^{(t)}+\frac{\alpha_{k}^{(t)}-1}{\alpha_{k}^{(t+1)}}(\bm{D}_{k}^{(k)}-\bm{D}_{k}^{(k-1)})
12             𝑽^k(t+1)𝑽k(t)+αk(t)1αk(t+1)(𝑽k(t)𝑽k(t1))\hat{\bm{V}}_{k}^{(t+1)}\leftarrow\bm{V}_{k}^{(t)}+\frac{\alpha_{k}^{(t)}-1}{\alpha_{k}^{(t+1)}}(\bm{V}_{k}^{(t)}-\bm{V}_{k}^{(t-1)})
13       end for
14      Stop if objective value change less than tolerance.
15 end for
Output: 𝑫0(T)\bm{D}_{0}^{(T)}
Algorithm 1 Accelerated ADMM for solving (11)

5 Asymptotic Properties

In this section, we conduct an asymptotic analysis for the proposed estimator Γ^\hat{\Gamma} as defined in (7). Our analysis has a unified flavor such that the derived convergence rate of the proposed estimator automatically adapts to sparse and dense settings. Throughout this section, we neglect the mean function estimation error by setting μ0(𝒕)=μ^(𝒕)=0\mu_{0}(\bm{t})=\hat{\mu}(\bm{t})=0 for any 𝒕𝒯\bm{t}\in\mathcal{T}, which leads to a cleaner and more focused analysis. The additional error from the mean function estimation can be incorporated into our proofs without any fundamental difficulty.

5.1 Assumptions

Without loss of generality let 𝒯=[0,1]p\mathcal{T}=[0,1]^{p}. The assumptions needed in the asymptotic results are listed as follows.

Assumption 1.

Sample fields {Xi:i=1,,n}\{X_{i}:i=1,\ldots,n\} reside in =k=1pk{\mathcal{H}}=\bigotimes_{k=1}^{p}\mathcal{H}_{k} where k{\mathcal{H}}_{k} is an RKHS of functions on [0,1][0,1] with a continuous and square integrable reproducing kernel KkK_{k}.

Assumption 2.

The true (folded) covariance function Γ00\Gamma_{0}\neq 0 and Γ0𝒢=j=1d𝒢j\Gamma_{0}\in{\mathcal{G}}=\bigotimes^{d}_{j=1}\mathcal{G}_{j}, where d=2pd=2p, 𝒢j=j\mathcal{G}_{j}=\mathcal{H}_{j} for j=1,,pj=1,\dots,p and 𝒢j=jp\mathcal{G}_{j}=\mathcal{H}_{j-p} for j=p+1,,dj=p+1,\dots,d.

Assumption 3.

The locations {𝐓ij:i=1,,n;j=1,,m}\{\bm{T}_{ij}:i=1,\dots,n;j=1,\dots,m\} are independent random vectors from Uniform[0,1]p\mathrm{Uniform}[0,1]^{p}, and they are independent of {Xi:i=1,,n}\left\{X_{i}:i=1,\dots,n\right\}.
The errors {ϵij:i=1,,n;j=1,,m}\left\{\epsilon_{ij}:i=1,\dots,n;j=1,\dots,m\right\} are independent of both locations and sample fields.

Assumption 4.

For each 𝐭𝒯\bm{t}\in\mathcal{T}, X(𝐭)X(\bm{t}) is sub-Gaussian with a parameter bX>0b_{X}>0 which does not depend on 𝐭\bm{t}, i.e., 𝔼[exp{βX(𝐭)}]exp{bX2β2/2}\mathbb{E}[\exp\left\{\beta X(\bm{t})\right\}]\leq\exp\left\{b_{X}^{2}\beta^{2}/2\right\} for all β\beta and 𝐭𝒯\bm{t}\in\mathcal{T}.

Assumption 5.

For each ii and jj, ϵij\epsilon_{ij} is a mean-zero sub-Gaussian random variable with a parameter bϵb_{\epsilon} independent of ii and jj, i.e., 𝔼[exp{βϵij}]exp{bϵ2β2/2}\mathbb{E}[\exp\left\{\beta\epsilon_{ij}\right\}]\leq\exp\left\{b_{\epsilon}^{2}\beta^{2}/2\right\}.
Moreover all errors {ϵij:i=1,,n;j=1,,m}\left\{\epsilon_{ij}:i=1,\dots,n;j=1,\dots,m\right\} are independent.

Assumption 1 delineates a tensor product RKHS modeling, commonly seen in the nonparametric regression literature (e.g., Wahba, 1990; Gu, 2013). In Assumption 2, the condition Γ0𝒢\Gamma_{0}\in{\mathcal{G}} is satisfied if 𝔼X2<\mathbb{E}\|X\|_{{\mathcal{H}}}^{2}<\infty, as shown in Cai and Yuan (2010). Assumption 3 is specified for random design and we adopt the uniform distribution here for simplicity. The uniform distribution on [0,1]p[0,1]^{p} can be generalized to any other continuous distribution of which density function π\pi satisfies cππ(𝒕)cπc_{\pi}\leq\pi(\bm{t})\leq c_{\pi}^{\prime} for all 𝒕[0,1]p\bm{t}\in[0,1]^{p}, for some constants 0<cπcπ<10<c_{\pi}\leq c_{\pi}^{\prime}<1, to ensure both Theorems 2 and 3 still hold. Assumptions 4 and 5 involve sub-Gaussian conditions of the stochastic process and measurement error, which are standard tail conditions.

5.2 Reproducing kernels

In Assumption 1, the “smoothness” of the function in the underlying RKHS is not explicitly specified. It is well-known that such smoothness conditions are directly related to the eigen-decay of the respective reproducing kernel. By Mercer’s Theorem (Mercer, 1909), the reproducing kernel K((t1,,tp),(t1,,tp))K_{\mathcal{H}}((t_{1},\ldots,t_{p}),(t^{\prime}_{1},\ldots,t^{\prime}_{p})) of {\mathcal{H}} possesses the eigen-decomposition

K((t1,,tp),(t1,,tp))=l=1μlϕl(t1,,tp)ϕl(t1,,tp),\displaystyle K_{\mathcal{H}}((t_{1},\ldots,t_{p}),(t^{\prime}_{1},\ldots,t^{\prime}_{p}))=\sum_{l=1}^{\infty}\mu_{l}\phi_{l}(t_{1},\ldots,t_{p})\phi_{l}(t^{\prime}_{1},\ldots,t^{\prime}_{p}), (16)

where {μl:l1}\{\mu_{l}:l\geq 1\} are non-negative eigenvalues and {ϕl:l1}\{\phi_{l}:l\geq 1\} are L2L^{2} eigenfunctions on [0,1]p[0,1]^{p}. Then for the space {\mathcal{H}}\otimes{\mathcal{H}}, which is also identified by 𝒢=k=1d𝒢k{\mathcal{G}}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k}, its corresponding reproducing kernel K𝒢K_{{\mathcal{G}}} has the following eigen-decomposition

K𝒢((x1,,x2p),(x1,,x2p))\displaystyle\quad K_{\mathcal{G}}((x_{1},\ldots,x_{2p}),(x^{\prime}_{1},\ldots,x^{\prime}_{2p}))
=K((x1,,xp),(x1,,xp))K((xp+1,,x2p),(xp+1,,x2p))\displaystyle=K_{\mathcal{H}}((x_{1},\ldots,x_{p}),(x^{\prime}_{1},\ldots,x^{\prime}_{p}))K_{\mathcal{H}}((x_{p+1},\ldots,x_{2p}),(x^{\prime}_{p+1},\ldots,x^{\prime}_{2p}))
=l,h=1μlμhϕl(x1,,xp)ϕh(xp+1,,x2p)ϕl(x1,,xp)ϕh(xp+1,,x2p),\displaystyle=\sum_{l,h=1}^{\infty}\mu_{l}\mu_{h}\phi_{l}(x_{1},\ldots,x_{p})\phi_{h}(x_{p+1},\ldots,x_{2p})\phi_{l}(x^{\prime}_{1},\ldots,x^{\prime}_{p})\phi_{h}(x^{\prime}_{p+1},\ldots,x^{\prime}_{2p}),

where {μlμh:l,h1}\{\mu_{l}\mu_{h}:l,h\geq 1\} are the eigenvalues of K𝒢K_{\mathcal{G}}. Due to continuity assumption (Assumption 1) of the univariate kernels, there exists a constant bb such that

sup(x1,,x2p)[0,1]2pK𝒢((x1,,x2p),(x1,,x2p))b.\sup_{(x_{1},\ldots,x_{2p})\in[0,1]^{2p}}K_{\mathcal{G}}((x_{1},\ldots,x_{2p}),(x_{1},\dots,x_{2p}))\leq b.

The decay rate of the eigenvalues {μlμh:l,h1}\{\mu_{l}\mu_{h}:l,h\geq 1\} is involved in our analysis through two quantities κn,m\kappa_{n,m} and ηn,m\eta_{n,m}, which have relatively complex forms defined in Appendix B. Similar quantities can be found in other analyses of RKHS-based estimators (e.g., Raskutti et al., 2012) that accommodate general choices of RKHS. Generally κn,m\kappa_{n,m} and ηn,m\eta_{n,m} are expected to diminish in certain orders of nn and mm, characterized by the decay rate of the eigenvalues {μlμh}\{\mu_{l}\mu_{h}\}. The smoother the functions in the RKHS, the faster these two quantities diminish. Our general results in Theorems 2 and 3 are specified in terms of these quantities. To provide a solid example, we derive the orders of κn,m\kappa_{n,m} and ηn,m\eta_{n,m} under a Sobolev-Hilbert space setting and provide the convergence rate of the proposed estimator in Corollary 1.

5.3 Unified rates of convergence

We write the penalty in (7) as I(Γ)=βΓ+(1β)p1k=1pΓ(k)I(\Gamma)=\beta\|\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\|_{*}+(1-\beta)p^{-1}\sum_{k=1}^{p}\|\Gamma_{(k)}\|_{*}. For arbitrary functions g1,g2𝒢g_{1},g_{2}\in{\mathcal{G}}, define their empirical inner product and the corresponding (squared) empirical norm as

g1,g2n,m=1nm(m1)i=1n1j,jmg1(Tij1,,Tijp,Tij1,,Tijp)g2(Tij1,,Tijp,Tij1,,Tijp),\langle g_{1},g_{2}\rangle_{n,m}=\frac{1}{nm(m-1)}\sum_{i=1}^{n}\sum_{1\leq j,j^{\prime}\leq m}g_{1}(T_{ij1},\dots,T_{ijp},T_{ij^{\prime}1},\dots,T_{ij^{\prime}p})g_{2}(T_{ij1},\dots,T_{ijp},T_{ij^{\prime}1},\dots,T_{ij^{\prime}p}),
g1n,m2=g1,g1n,m.\|g_{1}\|_{n,m}^{2}=\langle g_{1},g_{1}\rangle_{n,m}.

Additionally, the L2L^{2} norm of a function gg is defined as g2={𝒯g2(𝒕)𝑑𝒕}1/2\|g\|_{2}=\{\int_{\mathcal{T}}g^{2}(\bm{t})\,d\bm{t}\}^{1/2}.

Define ξn,m=max{ηn,m,κn,m,(n1logn)1/2}\xi_{n,m}=\max\{\eta_{n,m},\kappa_{n,m},(n^{-1}\log n)^{1/2}\}. We first provide the empirical L2L^{2} rate of convergence for Γ^\hat{\Gamma}.

Theorem 2.

Suppose that Assumptions 15 hold. Assume ξn,m\xi_{n,m} satisfies (logn)/nξn,m2/(loglogξn,m1)(\log n)/n\leq\xi_{n,m}^{2}/(\log\log\xi_{n,m}^{-1}). If λL1ξn,m2\lambda\geq L_{1}\xi_{n,m}^{2} for some constant L1>0L_{1}>0 depending on bXb_{X}, bϵb_{\epsilon} and bb, we have

Γ^Γ0n,m2I(Γ0)λ+L1ξn,m,\displaystyle\|\hat{\Gamma}-\Gamma_{0}\|_{n,m}\leq\sqrt{2I(\Gamma_{0})\lambda}+L_{1}\xi_{n,m},

with probability at least 1exp(cnξn,m2/logn)1-\exp(-cn\xi_{n,m}^{2}/{\log n}) for some positive universal constant cc.

Next, we provide the L2L^{2} rate of convergence for Γ^\hat{\Gamma}.

Theorem 3.

Under the same conditions as Theorem 2, there exist a positive constant L2L_{2} depending on bXb_{X}, bϵb_{\epsilon}, bb and I(Γ0)I(\Gamma_{0}), such that

Γ^Γ022I(Γ0)λ+L2ξn,m,\|\hat{\Gamma}-\Gamma_{0}\|_{2}\leq 2\sqrt{I(\Gamma_{0})\lambda}+L_{2}\xi_{n,m},

with probability at least 1exp(cpnξn,m2/logn)1-\exp(-c_{p}n\xi_{n,m}^{2}/\log n) for some constant cpc_{p} depending on bb.

The proofs of Theorems 2 and 3 can be found in Section S1 in the supplementary material.

Remark 3.

Theorems 2 and 3 are applicable to general RKHS \mathcal{H} which satisfies Assumption 1. The convergence rate depends on the eigen-decay rates of the reproducing kernel. A special case of polynomial decay rates for univariate RKHS will be given in Corollary 1. Moreover, our analysis has a unified flavor in the sense that the resulting convergence rates automatically adapt to the orders of both nn and mm. In Remark 5 we will provide a discussion of a “phase transition” between dense and sparse functional data revealed by our theory.

Remark 4.

With a properly chosen λ\lambda, Theorems 2 and 3 bound the convergence rates (in terms of both the empirical and theoretical L2L^{2} norm) by ξn,m\xi_{n,m}, which cannot be faster than (n1logn)1/2(n^{-1}\log n)^{1/2}. The logarithmic order is due to the use of Adamczak bound in Lemma S2 in the supplementary material. If one further assumes boundedness for the sample fields XiX_{i}’s (in terms of the sup-norm) and the noise variables ϵij\epsilon_{ij}’s, we can instead use Talagrand concentration inequality (Bousquet bound in Koltchinskii (2011)) and the results in Theorems 2 and 3 can be improved to max{Γ^Γ0n,m2,Γ^Γ022}=𝒪p(ξ~n,m2)\max\{\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{n,m},\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{2}\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\tilde{\xi}_{n,m}^{2}), where ξ~n,m=max{ηn,m,κn,m,n1/2}\tilde{\xi}_{n,m}=\max\{\eta_{n,m},\kappa_{n,m},n^{-1/2}\}.

Next we focus on a special case where the reproducing kernels of the univariate RKHS k{\mathcal{H}}_{k}’s exhibit polynomial eigen-decay rates, which holds for a range of commonly used RKHS. A canonical example is α\alpha-th order Sobolev-Hilbert space:

k={f:f(r),r=0,,α,areabsolutelycontinuous;f(α)L2([0,1])},{\mathcal{H}}_{k}=\{f:f^{(r)},r=0,\dots,\alpha,\mathrm{are\ absolutely\ continuous};f^{(\alpha)}\in L^{2}([0,1])\},

where k=1,,pk=1,\dots,p. Here α\alpha is the same as α\alpha in Corollary 1. To derive the convergence rates, we relate the eigenvalues νl\nu_{l} in (16) to the univariate RKHS k{\mathcal{H}}_{k}, k=1,,pk=1,\dots,p. Due to Mercer’s Theorem, the reproducing kernel KkK_{k} of k{\mathcal{H}}_{k} yields an eigen-decomposition with non-negative eigenvalues {μl(k):l1}\{\mu_{l}^{(k)}:l\geq 1\} and an L2L^{2} eigenfunction {ϕl(k):l1}\{\phi^{(k)}_{l}:l\geq 1\}, i.e., Kk(t,t)=l=1μl(k)ϕl(k)(t)ϕl(k)(t).K_{k}(t,t^{\prime})=\sum_{l=1}^{\infty}\mu_{l}^{(k)}\phi_{l}^{(k)}(t)\phi^{(k)}_{l}(t^{\prime}). Therefore, the set of eigenvalues {μl:l1}\{\mu_{l}:l\geq 1\} in (16) is the same as the set {k=1pμlk(k):l1,,lp1}\{\prod_{k=1}^{p}\mu_{l_{k}}^{(k)}:l_{1},\dots,l_{p}\geq 1\}. Given the eigen-decay of μl(k)\mu_{l}^{(k)}, one can obtain the order of ξn,m\xi_{n,m} and hence the convergence rates from Theorems 2 and 3. Here are the results under the setting of a polynomial eigen-decay.

Corollary 1.

Suppose that the same conditions in Theorem 3 hold. If the eigenvalues of KkK_{k} for k,k=1,,p,{\mathcal{H}}_{k},k=1,\ldots,p, have polynomial decaying rates, that is, there exists α>1/2\alpha>1/2 such that μl(k)l2α\mu_{l}^{(k)}\asymp l^{-2\alpha} for all k=1,,pk=1,\dots,p, then

max{Γ^Γ0n,m2,Γ^Γ022}=𝒪p(max{(nm)2α1+2α{log(nm)}2α(2p1)2α+1,lognn}).\max\left\{\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{n,m},\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{2}\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\max\left\{(nm)^{-\frac{2\alpha}{1+2\alpha}}\{\log(nm)\}^{\frac{2\alpha(2p-1)}{2\alpha+1}},\frac{\log n}{n}\right\}\right).
Remark 5.

All Theorems 2 and 3 and Corollary 1 reveal a “phase-transition” of the convergence rate depending on the relative magnitudes between nn, the sample size, and mm, the number of observations per field. When κn,m2(logn/n)\kappa^{2}_{n,m}\ll(\log n/n), which is equivalent to mn1/(2α)(logn)2p21/(2α)m\gg n^{1/(2\alpha)}(\log n)^{2p-2-1/(2\alpha)} in Corollary 1, both empirical and theoretical L2L^{2} rates of convergence can achieve the near-optimal rate logn/n\sqrt{\log n/n}. Under the stronger assumptions in Remark 4, the convergence rate will achieve the optimal order 1/n\sqrt{1/n} when κn,m21/n\kappa^{2}_{n,m}\ll 1/n (or mn1/(2α)(logn)2p1m\gg n^{1/(2\alpha)}(\log n)^{2p-1} in Corollary 1). In this case, the observations are so densely sampled that we can estimate the covariance function as precisely as if the entire sample fields are observable. On the contrary, when κn,m2(logn/n)\kappa^{2}_{n,m}\gg(\log n/n) (or mn1/(2α)(logn)2p21/(2α)m\ll n^{1/(2\alpha)}(\log n)^{2p-2-1/(2\alpha)} in Corollary 1), the convergence rate is determined by the total number of observations nmnm. When p=1p=1, the asymptotic result in Corollary 1, up to some logm\log m and logn\log n terms, is the same as the minimax optimal rate obtained by Cai and Yuan (2010), and is comparable to the L2L^{2} rate obtained by Paul and Peng (2009) for α=2\alpha=2.

Remark 6.

For covariance function estimation for unidimensional functional data, i.e., p=1p=1, a limited number of approaches, including Cai and Yuan (2010), Li and Hsing (2010), Zhang and Wang (2016), and Liebl (2019), can achieve unified theoretical results in the sense that they hold for all relative magnitudes of nn and mm. The similarity of these approaches is the availability of a closed form for each covariance function estimator. In contrast, our estimator obtained from (7) does not have a closed form due to the non-differentiability of the penalty, but it can still achieve unified theoretical results which hold for both unidimensional and multidimensional functional data. Due to the lack of a closed form of our covariance estimator, we used the empirical process techniques (e.g., Bartlett et al., 2005; Koltchinskii, 2011) in the theoretical development. In particular, we have developed a novel grouping lemma (Lemma S4 in the supplementary material) to deterministically decouple the dependence within a UU-statistics of order 2. We believe this lemma is of independent interest. In our analysis, the corresponding UU-statistics is indexed by a function class, and this grouping lemma provides a tool to obtain uniform results (see Lemma S3 in the supplementary material). In particular, this allows us to relate the empirical and theoretical L2L^{2} norm of the underlying function class, in precise enough order dependence on nn and mm to derive the unified theory. See Lemma S3 for more details. To the best of our knowledge, this paper is one of the first in the FDA literature that derives a unified result in terms of empirical process theories, and the proof technique is potentially useful for some other estimators without a closed form.

6 Simulation

To evaluate the practical performance of the proposed method, we conducted a simulation study. We in particular focused on two-dimensional functional data. Let 1\mathcal{H}_{1} and 2\mathcal{H}_{2} both be the RKHS with kernel K(t1,t2)=k=1(kπ)4ek(t1)ek(t2)K(t_{1},t_{2})=\sum_{k=1}^{\infty}(k\pi)^{-4}e_{k}(t_{1})e_{k}(t_{2}), where ek(t)=2cos(kπt)e_{k}(t)=\sqrt{2}\cos(k\pi t), k1k\geq 1. This RKHS has been used in various studies in FDA, e.g., the simulation study of Cai and Yuan (2012). Each XiX_{i} is generated from a mean-zero Gaussian random field with a covariance function

γ0((s1,s2),(t1,t2))=Γ0(s1,s2,t1,t2)=k=1Rk2ψk(s1,s2)ψk(t1,t2),\gamma_{0}((s_{1},s_{2}),(t_{1},t_{2}))=\Gamma_{0}(s_{1},s_{2},t_{1},t_{2})=\sum^{R}_{k=1}k^{-2}\psi_{k}(s_{1},s_{2})\psi_{k}(t_{1},t_{2}),

where the eigenfunctions ψk(t1,t2)𝒫r1,r2:={ei(t1)ej(t2):i=1,,r1;j=1,,r2}\psi_{k}(t_{1},t_{2})\in\mathcal{P}_{r_{1},r_{2}}:=\{e_{i}(t_{1})e_{j}(t_{2}):i=1,\dots,r_{1};j=1,\dots,r_{2}\}. Three combinations of one-way ranks (r1,r2r_{1},r_{2}) and two-way rank RR were studied for Γ0\Gamma_{0}:

Setting 1: R=6R=6, r1=3r_{1}=3, r2=2r_{2}=2;       Setting 2: R=6R=6, r1=r2=4r_{1}=r_{2}=4;

Setting 3: R=r1=r2=4R=r_{1}=r_{2}=4.

For each setting, we chose RR functions out of 𝒫r1,r2\mathcal{P}_{r_{1},r_{2}} to be {ψk}\{\psi_{k}\} such that smoother functions are associated with larger eigenvalues. The details are described in Section S2.1 of the supplementary material.

In terms of sampling plans, we considered both sparse and dense designs. Due to the space limit, here we only show and discuss the results for the sparse design, while defer those for the dense design to Section S2.3 of the supplementary material. For the sparse design, the random locations 𝑻ij,j=1,,m,\bm{T}_{ij},j=1,\ldots,m, were independently generated from the continuous uniform distribution on [0,1]2[0,1]^{2} within each field and across different fields, and the random errors {ϵij:i=1,,n;j=1,,m}\{\epsilon_{ij}:i=1,\ldots,n;j=1,\ldots,m\} were independently generated from N(0,σ2)N(0,\sigma^{2}). In each of the 200 simulation runs, the observed data were obtained following (1) with various combinations of m=10,20m=10,20, n=100,200n=100,200 and noise level σ=0.1,0.4\sigma=0.1,0.4.

We compared the proposed method, denoted by mOpCov, with three existing methods: 1) OpCov: the estimator based on Wong and Zhang (2019) with adaption to two dimensional case (see Section 2); 2) ll-smooth: local linear smoothing with Epanechnikov kernel; 3) ll-smooth+: the two-step estimator constructed by retaining eigen-components of ll-smoothselected by 99% fraction of variation explained (FVE), and then removing the eigen-components with negative eigenvalues. For both OpCov and mOpCov, 5-fold cross-validation was adopted to select the corresponding tuning parameters.

Table 1 show the average integrated squared error (AISE), average of estimated two-way rank (R¯\bar{R}), as well as average of estimated one-way ranks (r¯1,r¯2\bar{r}_{1},\bar{r}_{2}) of the above covariance estimators over 200 simulated data sets in respective settings when sample size is n=200n=200. Corresponding results for n=100n=100 can be found in Table S4 of the supplementary material, and they lead to similar conclusions. Obviously ll-smooth and ll-smooth+, especially ll-smooth, perform significantly worse than the other two methods in both estimation accuracy and rank reduction (if applicable). Below we only compare mOpCov and OpCov.

Regarding estimation accuracy, the proposed mOpCov has uniformly smaller AISE values than OpCov, with around 10%20%10\%\sim 20\% improvement of AISE over OpCov in most cases under Settings 1 and 2. If the standard error (SE) of AISE is taken into account, the improvements of AISE by mOpCov are more distinguishable in Settings 1 and 2 than those in Setting 3 since the SEs of OpCov in Setting 3 are relatively high. This is due to the fact that in Setting 3, marginal basis are not shared by different two-dimensional eigenfunctions, and hence mOpCov cannot benefit from the structure sharing among eigenfunctions. Setting 3 is in fact an extreme setting we designed to challenge the proposed method.

For rank estimation, OpCov almost always underestimates two-way ranks, while mOpCov typically overestimates both one-way and two-way ranks. For mOpCov, the average one-way rank estimates are always smaller than the average two-way rank estimates, and their differences are substantial in Settings 1 and 2. This demonstrates the benefit of mOpCov of detecting structure sharing of one-dimensional basis among two-dimensional eigenfunctions.

We also tested the performance of mOpCov in the dense and regular designs, and compared it with the existing methods mentioned above together with the one by Wang and Huang (2017), which is not applicable to the sparse design. Details are given in Section S2.3 of the supplementary material, where all methods achieve similar AISE values, but mOpCov performs slightly better in estimation accuracy when the noise level is high.

Table 1: Simulation results for three Settings with the sparse design when sample size (nn) is 200. The AISE values with standard errors (SE) in parentheses are provided for the four covariance estimators in comparison, together with average two-way ranks (R¯\bar{R}) for those estimators which can lead to rank reduction (i.e., mOpCov, OpCov, and ll-smooth+) and average one-way ranks (r1r_{1}, r2r_{2}) for mOpCov.
Setting mm σ\sigma mOpCov OpCov ll-smooth ll-smooth+
1 10 0.1 AISE 0.053 (1.97e-03) 0.0632 (3.22e-03) 0.652 (1.92e-01) 0.337 (5.35e-02)
R¯\bar{R} 8.38 2.94 - 172.70
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.4, 5.4 _ _ _
0.4 AISE 0.0547 (2.01e-03) 0.0656 (2.72e-03) 0.714 (2.11e-01) 0.366 (5.96e-02)
R¯\bar{R} 9.16 2.84 - 177.3
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.34, 5.32 _ _ _
20 0.1 AISE 0.0343 (1.46e-03) 0.0421 (1.97e-03) 0.297 (1.39e-02) 0.206 (4.62e-03)
R¯\bar{R} 8.38 3.78 - 317.44
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.84, 5.82 _ _ _
0.4 AISE 0.0354 (1.52e-03) 0.044 (2.21e-03) 0.325 (1.58e-02) 0.223 (4.94e-03)
R¯\bar{R} 8.86 3.76 - 326.31
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.83, 5.84 _ _ _
2 10 0.1 AISE 0.0532 (1.98e-03) 0.0636 (3.12e-03) 2.33 (1.13e+00) 0.795 (2.98e-01)
R¯\bar{R} 8.48 3.02 - 191.175
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.82, 5.82 _ _ _
0.4 AISE 0.0548 (2.05e-03) 0.0686 (3.53e-03) 2.44 (1.17e+00) 0.828 (3.04e-01)
R¯\bar{R} 9.04 3.04 - 196.34
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.71, 5.74 _ _ _
20 0.1 AISE 0.0341 (1.43e-03) 0.0419 (2.02e-03) 0.301 (1.58e-02) 0.208 (4.50e-03)
R¯\bar{R} 8.99 3.74 - 318.645
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.93, 5.92 _ _ _
0.4 AISE 0.0348 (1.43e-03) 0.043 (2.22e-03) 0.328 (1.78e-02) 0.225 (4.74e-03)
R¯\bar{R} 8.01 3.6 - 327.395
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.94, 5.93 _ _ _
3 10 0.1 AISE 0.058 (2.62e-03) 0.0692 (5.33e-03) 0.454 (7.28e-02) 0.286 (2.89e-02)
R¯\bar{R} 6.26 3.12 - 182.74
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5, 5.06 _ _ _
0.4 AISE 0.0598 (2.68e-03) 0.0733 (6.14e-03) 0.531 (1.07e-01) 0.323 (4.23e-02)
R¯\bar{R} 6.48 3.2 - 185.82
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 4.99, 5.07 _ _ _
20 0.1 AISE 0.0422 (1.37e-03) 0.0535 (2.64e-03) 0.267 (5.04e-03) 0.196 (3.59e-03)
R¯\bar{R} 6.29 4.49 - 332.09
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.62, 5.69 _ _ _
0.4 AISE 0.0424 (1.30e-03) 0.0494 (2.42e-03) 0.292 (5.30e-03) 0.212 (3.72e-03)
R¯\bar{R} 5.68 3.36 - 338.725
r¯1\bar{r}_{1}, r¯2\bar{r}_{2} 5.59, 5.66 _ _ _

7 Real Data Application

We applied the proposed method to an Argo profile data set, obtained from http://www.argo.ucsd.edu/Argo_data_and.html. The Argo project has a global array of approximately 3,800 free-drifting profiling floats, which measure temperature and salinity of the ocean. These floats drift freely in the depths of the ocean most of the time, and ascend regularly to the sea surface, where they transmit the collected data to the satellites. Every day only a small subset of floats show up on the sea surface. Due to the drifting process, these floats measure temperature and salinity at irregular locations over the ocean. See Figure 2 for examples.

In this analysis, we focus on the different changes of sea surface temperature between the tropical western and eastern Indian Ocean, which is called the Indian Ocean Dipole (IOD). The IOD is known to be associated with droughts in Australia (Ummenhofer et al., 2009) and has a significant effect on rainfall patterns in southeast Australia (Behera and Yamagata, 2003). According to Shinoda et al. (2004), the IOD phenomenon is a predominant inter-annual variation of sea surface temperature during late boreal summer and autumn (Shinoda et al., 2004), so in this application we focused on the sea surface temperature in the Indian Ocean region of longitude 40\sim120 and latitude -20\sim20 between September and November every year from 2003 to 2018.

Based on a simple autocorrelation analysis on the gridded data, we decided to use measurements for every ten days in order to reduce the temporal dependence among the data.

At each location of a float on a particular day, the average temperature between 0 and 5 hPa from the float is regarded as a measurement. The Argo float dataset provides multiple versions of data, and we adopted the quality controlled (QC) version. Eventually we have a two-dimensional functional data collected of n=144n=144 days, where the number of observed locations 𝑻ij=(longitudie,latitude)\bm{T}_{ij}=(\text{longitudie},\text{latitude}) per day varies from 7 to 47, i.e., 7mi477\leq m_{i}\leq 47, i=1,,ni=1,...,n, with an average of 21.83. The locations are rescaled to [0,1]×[0,1][0,1]\times[0,1]. As shown in Figure 2, the data has a random sparse design.

Refer to caption
Figure 2: Observations on 2013/09/04 (left), and all observations in the data set (right). Points on the map indicate locations (Longitude, Latitude) of observations and the color scale of every point shows the corresponding Celsius temperature.

First we used kernel ridge regression with the corresponding kernel for the tensor product of two second order Sobolev spaces (e.g., Wong and Zhang, 2019) to obtain a mean function estimate for every month. Then we applied the proposed covariance function estimator with the same kernel.

The estimates of the top two two-dimensional L2L^{2} eigenfunctions are illustrated in Figure 3. The first eigenfunction shows the east-west dipole mode, which aligns with existing scientific findings (e.g., Shinoda et al., 2004; Chu et al., 2014; Deser et al., 2010). The second eigenfunction can be interpreted as the basin-wide mode, which is a dominant mode all around the year (e.g., Deser et al., 2010; Chu et al., 2014).

To provide a clearer understanding of the covariance function structure, we derived a marginal L2L^{2} basis along longitude and latitude respectively. The details are given in Appendix A. The left panel of Figure 4 demonstrates that the first longitudinal marginal basis reflects a large variation in the western region while the second one corresponds to the variation in the eastern region. Due to different linear combinations, the variation along longitude may contribute to not only opposite changes between the eastern and western sides of the Indian Ocean as shown in the first two-dimensional eigenfunction, but also an overall warming or cooling tendency as shown in the second two-dimensional eigenfunction. The second longitudinal marginal basis reveals that the closer to the east boundary, the greater the variation is, which suggests that the IOD may be related to the Pacific Ocean. This aligns with the evidence that the IOD has a link with El Niño Southern Oscillation (ENSO) (Stuecker et al., 2017), an irregularly periodic variation in sea surface temperature over the tropical eastern Pacific Ocean. As shown in the right panel of Figure 4, the overall trend for the first latitude marginal basis is almost a constant function. This provides evidence that the IOD is primarily associated with the variation along longitude.

Refer to caption
Figure 3: The first two-dimensional L2L^{2} eigenfunction (left) and the second two-dimensional L2L^{2} eigenfunction (right). The first eigenfunction explains 33.60%33.60\% variance and the second eigenfunction explains 25.94%25.94\% variance.
Refer to caption
(a) Longitude (39.06%39.06\%, 36.10%36.10\%)
Refer to caption
(b) Latitude (48.22%48.22\%, 25.40%25.40\%)
Figure 4: The first two marginal L2L^{2} basis functions along longitude and latitude respectively. Solid lines are the first marginal basis function and dotted lines are the second marginal basis function. The fractions of variation explained by the corresponding principle components are given in parentheses.

Appendix

Appendix A L2L^{2} eigensystem and L2L^{2} marginal basis

In this section, we present a transformation procedure to produce L2L^{2} eigenfunctions and corresponding eigenvalues from our estimator 𝑩^\hat{\bm{B}} obtained by (11).

Let 𝑸k=[[0,1]K(s,Tijk)K(s,Tijk)𝑑s]1i,in,1j,jm\bm{Q}_{k}=[\int_{[0,1]}K(s,T_{ijk})K(s,T_{i^{\prime}j^{\prime}k})ds]_{1\leq i,i^{\prime}\leq n,1\leq j,j^{\prime}\leq m}, k=1,,pk=1,\ldots,p. Then 𝑸k=𝑴k𝑹k𝑴k\bm{Q}_{k}=\bm{M}_{k}\bm{R}_{k}\bm{M}_{k}^{\intercal}, where 𝑹k=[[0,1]vl(s)vh(s)𝑑s]1l,hqk\bm{R}_{k}=[\int_{[0,1]}v_{l}(s)v_{h}(s)ds]_{1\leq l,h\leq q_{k}} and {vl:l=1,,qk}\{v_{l}:l=1,\ldots,q_{k}\} form a basis of k{\mathcal{H}}_{k}, so 𝑹k=𝑴k+𝑸k(𝑴k+)\bm{R}_{k}=\bm{M}_{k}^{+}\bm{Q}_{k}(\bm{M}_{k}^{+})^{\intercal}. The L2L^{2} eigenvalues of Γ^{\hat{\Gamma}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}} coincide with the eigenvalues of matrix 𝑩^squareL:=(𝑹1𝑹p)1/2𝑩^[(𝑹1𝑹p)1/2]\hat{\bm{B}}^{L}_{\mathrm{square}}:=(\bm{R}_{1}\otimes\ldots\otimes\bm{R}_{p})^{1/2}\hat{\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}[(\bm{R}_{1}\otimes\ldots\otimes\bm{R}_{p})^{1/2}]^{\intercal}, and the number of nonzero eigenvalues is the same as the rank of 𝑩^\hat{\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}. The L2L^{2} eigenfunction ϕ^l\hat{\phi}_{l} that corresponds to the ll-th eigenvalue of Γ^\hat{\Gamma}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}} can be expressed as ϕ^l(s1,,sp)=𝒖l[𝒛1(s1)𝒛p(sp)]\hat{\phi}_{l}(s_{1},...,s_{p})=\bm{u}_{l}^{\intercal}[\bm{z}_{1}(s_{1})\otimes\ldots\otimes\bm{z}_{p}(s_{p})], where 𝒛k()\bm{z}_{k}(\cdot), k=1,,pk=1,\ldots,p are defined in Theorem 1, and 𝒖l=(𝑴1+𝑴p+)(𝑹1𝑹p)1/2𝒗l\bm{u}_{l}=(\bm{M}_{1}^{+}\otimes\ldots\otimes\bm{M}_{p}^{+})^{\intercal}(\bm{R}_{1}\otimes\ldots\otimes\bm{R}_{p})^{-1/2}\bm{v}_{l} with 𝒗l\bm{v}_{l} being the ll-th eigenvector of matrix 𝑩^squareL\hat{\bm{B}}^{L}_{\mathrm{square}}. Using the property of Kronecker products, we have ϕ^l(s1,,sp)=𝒗l[(𝑹11/2𝑴1+𝒛1(s1))(𝑹p1/2𝑴p+𝒛p(sp))]\hat{\phi}_{l}(s_{1},...,s_{p})=\bm{v}_{l}^{\intercal}[(\bm{R}_{1}^{-1/2}\bm{M}_{1}^{+}\bm{z}_{1}(s_{1}))\otimes\ldots\otimes(\bm{R}_{p}^{-1/2}\bm{M}_{p}^{+}\bm{z}_{p}(s_{p}))].

By simple verification, we can see that 𝑹k1/2𝑴k+𝒛k()\bm{R}_{k}^{-1/2}\bm{M}_{k}^{+}\bm{z}_{k}(\cdot) are qkq_{k} one-dimensional orthonormal L2L^{2} functions for dimension kk, k=1,,pk=1,...,p. Therefore, we can also express Γ^\hat{\Gamma} with these L2L^{2} one-dimensional basis and the coefficients will form a 2p2p-th order tensor of dimension q1×qp×q1×qpq_{1}\times\ldots q_{p}\times q_{1}\times\ldots q_{p}. We use 𝑩^L{\hat{\bm{B}}}^{L} to represent this new coefficient tensor and extend our unfolding operators to L2L^{2} space. It is easy to see that 𝑩^L=𝑩^squareL{\hat{\bm{B}}}^{L}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}={\hat{\bm{B}}}^{L}_{\mathrm{square}}.

Since Γ^(k)\hat{\Gamma}_{(k)} is a compact operator in the L2L^{2} space, this yields a singular value decomposition which leads to a L2L^{2} basis characterizing the marginal variation along the kk-th dimension. We call it a L2L^{2} marginal basis for the kk-th dimension. Obviously the marginal basis function ψ^lk\hat{\psi}^{k}_{l} corresponding to the ll-th singular value for dimension kk can be expressed as ψ^lk()=𝒖lk𝒛k()\hat{\psi}^{k}_{l}(\cdot)=\bm{u}^{k}_{l}\bm{z}_{k}(\cdot), where 𝒖lk=(𝑴k+)𝑹k1/2𝒗lk\bm{u}^{k}_{l}=(\bm{M}_{k}^{+})^{\intercal}\bm{R}_{k}^{-1/2}\bm{v}^{k}_{l}, and 𝒗lk\bm{v}^{k}_{l} is the ll-th singular vector of 𝑩^(k)L\hat{\bm{B}}^{L}_{(k)}. And the L2L^{2} marginal singular values of Γ^(k)\hat{\Gamma}_{(k)} coincide with the singular values of matrix 𝑩^(k)L\hat{\bm{B}}^{L}_{(k)}.

Appendix B Definitions of κn,m\kappa_{n,m} and ηn,m\eta_{n,m}

Here we provide the specific forms of κn,m\kappa_{n,m} and ηn,m\eta_{n,m}, which are closely related to the decay of {μlμh:l,h=1,}\{\mu_{l}\mu_{h}:l,h=1,\dots\}. Specifically, κn,m\kappa_{n,m} is defined as the smallest positive κ\kappa such that

cb3[1n(m1)l,h=1min{κ2,μlμh}]1/2κ2,\displaystyle cb^{3}\left[\frac{1}{n(m-1)}\sum_{l,h=1}^{\infty}\min\left\{\kappa^{2},\mu_{l}\mu_{h}\right\}\right]^{1/2}\leq\kappa^{2}, (17)
32cb[1n(m1)l,h=1min{κ2/b2,μlμh}]1/2κ2,\displaystyle\quad 32cb\left[\frac{1}{n(m-1)}\sum_{l,h=1}^{\infty}\min\left\{\kappa^{2}/b^{2},\mu_{l}\mu_{h}\right\}\right]^{1/2}\leq\kappa^{2},

where cc is a universal constant, and ηn,m\eta_{n,m} is defined as the smallest positive η\eta such that

(cηnml,h=1min{η2,μlμh}+η2n)1/2η2,\displaystyle\left(\frac{c_{\eta}}{nm}\sum_{l,h=1}^{\infty}\min\{\eta^{2},\mu_{l}\mu_{h}\}+\frac{\eta^{2}}{n}\right)^{1/2}\leq\eta^{2}, (18)

where cηc_{\eta} is a constant depending on b,bX,bϵb,b_{X},b_{\epsilon}. The existences of κn,m\kappa_{n,m} and ηn,m\eta_{n,m} are shown in the proof of Theorem 2.

Supplementary Material

In the supplementary material related to this paper, we provide proofs of our theoretical findings and additional simulation results.

Acknowledgement

The research of Raymond K. W. Wong is partially supported by the U.S. National Science Foundation under grants DMS-1806063, DMS-1711952 (subcontract) and CCF-1934904. The research of Xiaoke Zhang is partially supported by the U.S. National Science Foundation under grant DMS-1832046. Portions of this research were conducted with high performance research computing resources provided by Texas A&M University (https://hprc.tamu.edu).

References

  • Abernethy et al. (2009) Abernethy, J., F. Bach, T. Evgeniou, and J.-P. Vert (2009). A new approach to collaborative filtering: Operator estimation with spectral regularization. Journal of Machine Learning Research 10, 803–826.
  • Allen (2013) Allen, G. I. (2013). Multi-way functional principal components analysis. In 2013 5th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp.  220–223. IEEE.
  • Bartlett et al. (2005) Bartlett, P. L., O. Bousquet, and S. Mendelson (2005). Local rademacher complexities. The Annals of Statistics 33(4), 1497–1537.
  • Behera and Yamagata (2003) Behera, S. K. and T. Yamagata (2003). Influence of the indian ocean dipole on the southern oscillation. Journal of the Meteorological Society of Japan. Ser. II 81(1), 169–177.
  • Boyd et al. (2010) Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3(1), 1–122.
  • Cai and Yuan (2010) Cai, T. T. and M. Yuan (2010). Nonparametric covariance function estimation for functional and longitudinal data. Technical report, Georgia Institute of Technology, Atlanta, GA.
  • Cai and Yuan (2012) Cai, T. T. and M. Yuan (2012). Minimax and adaptive prediction for functional linear regression. Journal of the American Statistical Association 107(499), 1201–1216.
  • Chen et al. (2017) Chen, K., P. Delicado, and H.-G. Müller (2017). Modelling function-valued stochastic processes, with applications to fertility dynamics. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(1), 177–196.
  • Chen and Müller (2012) Chen, K. and H.-G. Müller (2012). Modeling repeated functional observations. Journal of the American Statistical Association 107(500), 1599–1609.
  • Chen and Jiang (2017) Chen, L.-H. and C.-R. Jiang (2017). Multi-dimensional functional principal component analysis. Statistics and Computing 27(5), 1181–1192.
  • Chu et al. (2014) Chu, J.-E., K.-J. Ha, J.-Y. Lee, B. Wang, B.-H. Kim, and C. E. Chung (2014). Future change of the indian ocean basin-wide and dipole modes in the cmip5. Climate dynamics 43(1-2), 535–551.
  • Deser et al. (2010) Deser, C., M. A. Alexander, S.-P. Xie, and A. S. Phillips (2010). Sea surface temperature variability: Patterns and mechanisms. Annual review of marine science 2, 115–143.
  • Ferraty and Vieu (2006) Ferraty, F. and P. Vieu (2006). Nonparametric functional data analysis: theory and practice. Springer, New York.
  • Goldsmith et al. (2011) Goldsmith, J., J. Bobb, C. M. Crainiceanu, B. Caffo, and D. Reich (2011). Penalized functional regression. Journal of Computational and Graphical Statistics 20(4), 830–851.
  • Gu (2013) Gu, C. (2013). Smoothing Spline ANOVA Models (2nd ed.). New York: Springer.
  • Hackbusch (2012) Hackbusch, W. (2012). Tensor spaces and numerical tensor calculus, Volume 42. Springer Science & Business Media.
  • Halko et al. (2009) Halko, N., P.-G. Martinsson, and J. A. Tropp (2009). Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv preprint arXiv:0909.4061.
  • Hall and Vial (2006) Hall, P. and C. Vial (2006). Assessing the finite dimensionality of functional data. Journal of the Royal Statistical Society: Series B 68(4), 689–705.
  • Horváth and Kokoszka (2012) Horváth, L. and P. Kokoszka (2012). Inference for functional data with applications, Volume 200. Springer, New York.
  • Hsing and Eubank (2015) Hsing, T. and R. Eubank (2015). Theoretical foundations of functional data analysis, with an introduction to linear operators. John Wiley & Sons.
  • Huang et al. (2009) Huang, J. Z., H. Shen, and A. Buja (2009). The analysis of two-way functional data using two-way regularized singular value decompositions. Journal of the American Statistical Association 104(488), 1609–1620.
  • James et al. (2000) James, G. M., T. J. Hastie, and C. A. Sugar (2000). Principal component models for sparse functional data. Biometrika 87(3), 587–602.
  • Kadkhodaie et al. (2015) Kadkhodaie, M., K. Christakopoulou, M. Sanjabi, and A. Banerjee (2015). Accelerated alternating direction method of multipliers. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp.  497–506. ACM.
  • Kokoszka and Reimherr (2017) Kokoszka, P. and M. Reimherr (2017). Introduction to functional data analysis. CRC Press.
  • Koltchinskii (2011) Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, Volume 2033. Springer Science & Business Media.
  • Li and Song (2017) Li, B. and J. Song (2017). Nonlinear sufficient dimension reduction for functional data. The Annals of Statistics 45(3), 1059–1095.
  • Li and Hsing (2010) Li, Y. and T. Hsing (2010). Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. The Annals of Statistics 38(6), 3321–3351.
  • Liebl (2019) Liebl, D. (2019). Inference for sparse and dense functional data with covariate adjustments. Journal of Multivariate Analysis 170, 315–335.
  • Lynch and Chen (2018) Lynch, B. and K. Chen (2018). A test of weak separability for multi-way functional data, with application to brain connectivity studies. Biometrika 105(4), 815–831.
  • Mazumder et al. (2010) Mazumder, R., T. Hastie, and R. Tibshirani (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research 11(Aug), 2287–2322.
  • Mercer (1909) Mercer, J. (1909). Xvi. functions of positive and negative type, and their connection the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character 209(441-458), 415–446.
  • Park and Staicu (2015) Park, S. Y. and A.-M. Staicu (2015). Longitudinal functional data analysis. Stat 4(1), 212–226.
  • Paul and Peng (2009) Paul, D. and J. Peng (2009). Consistency of restricted maximum likelihood estimators of principal components. The Annals of Statistics 37(3), 1229–1271.
  • Pearce and Wand (2006) Pearce, N. D. and M. P. Wand (2006). Penalized splines and reproducing kernel methods. The American Statistician 60(3), 233–240.
  • Poskitt and Sengarapillai (2013) Poskitt, D. S. and A. Sengarapillai (2013). Description length and dimensionality reduction in functional data analysis. Computational Statistics & Data Analysis 58, 98–113.
  • Ramsay and Silverman (2005) Ramsay, J. and B. Silverman (2005). Functional data analysis. Springer, New York.
  • Raskutti et al. (2012) Raskutti, G., M. J. Wainwright, and B. Yu (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. The Journal of Machine Learning Research 13, 389–427.
  • Reimherr et al. (2018) Reimherr, M., B. Sriperumbudur, and B. Taoufik (2018). Optimal prediction for additive function-on-function regression. Electronic Journal of Statistics 12(2), 4571–4601.
  • Rice and Silverman (1991) Rice, J. A. and B. W. Silverman (1991). Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological) 53(1), 233–243.
  • Shamshoian et al. (2019) Shamshoian, J., D. Senturk, S. Jeste, and D. Telesca (2019). Bayesian analysis of multidimensional functional data. arXiv preprint arXiv:1909.08763.
  • Shinoda et al. (2004) Shinoda, T., H. H. Hendon, and M. A. Alexander (2004). Surface and subsurface dipole variability in the indian ocean and its relation with enso. Deep Sea Research Part I: Oceanographic Research Papers 51(5), 619–635.
  • Stuecker et al. (2017) Stuecker, M. F., A. Timmermann, F.-F. Jin, Y. Chikamoto, W. Zhang, A. T. Wittenberg, E. Widiasih, and S. Zhao (2017). Revisiting enso/indian ocean dipole phase relationships. Geophysical Research Letters 44(5), 2481–2492.
  • Sun et al. (2018) Sun, X., P. Du, X. Wang, and P. Ma (2018). Optimal penalized function-on-function regression under a reproducing kernel hilbert space framework. Journal of the American Statistical Association 113(524), 1601–1611.
  • Tucker (1966) Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika 31(3), 279–311.
  • Ummenhofer et al. (2009) Ummenhofer, C. C., M. H. England, P. C. McIntosh, G. A. Meyers, M. J. Pook, J. S. Risbey, A. S. Gupta, and A. S. Taschetto (2009). What causes southeast australia’s worst droughts? Geophysical Research Letters 36(4).
  • Wahba (1990) Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM.
  • Wang and Huang (2017) Wang, W.-T. and H.-C. Huang (2017). Regularized principal component analysis for spatial data. Journal of Computational and Graphical Statistics 26(1), 14–25.
  • Wong et al. (2019) Wong, R. K. W., Y. Li, and Z. Zhu (2019). Partially linear functional additive models for multivariate functional data. Journal of the American Statistical Association 114(525), 406–418.
  • Wong and Zhang (2019) Wong, R. K. W. and X. Zhang (2019). Nonparametric operator-regularized covariance function estimation for functional data. Computational Statistics & Data Analysis 131, 131–144.
  • Xiao et al. (2013) Xiao, L., Y. Li, and D. Ruppert (2013). Fast bivariate p-splines: the sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(3), 577–599.
  • Yao et al. (2005) Yao, F., H.-G. Müller, and J.-L. Wang (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100(470), 577–590.
  • Yuan and Cai (2010) Yuan, M. and T. T. Cai (2010). A reproducing kernel hilbert space approach to functional linear regression. The Annals of Statistics 38(6), 3412–3444.
  • Zhang et al. (2013) Zhang, L., H. Shen, and J. Z. Huang (2013). Robust regularized singular value decomposition with application to mortality data. The Annals of Applied Statistics 7(3), 1540–1561.
  • Zhang and Wang (2016) Zhang, X. and J.-L. Wang (2016). From sparse to dense functional data and beyond. The Annals of Statistics 44(5), 2281–2321.
  • Zhou and Pan (2014) Zhou, L. and H. Pan (2014). Principal component analysis of two-dimensional functional data. Journal of Computational and Graphical Statistics 23(3), 779–801.
  • Zhu et al. (2014) Zhu, H., F. Yao, and H. H. Zhang (2014). Structured functional additive regression in reproducing kernel hilbert spaces. Journal of the Royal Statistical Society: Series B 76(3), 581–603.
  • Zipunnikov et al. (2011) Zipunnikov, V., B. Caffo, D. M. Yousem, C. Davatzikos, B. S. Schwartz, and C. Crainiceanu (2011). Multilevel functional principal component analysis for high-dimensional data. Journal of Computational and Graphical Statistics 20(4), 852–873.