Low-Rank Covariance Function Estimation for Multidimensional Functional Data

Jiayi Wang Department of Statistics, Texas A&M University Raymond K. W. Wong The research of Raymond K. W. Wong is partially supported by National Science Foundation grants DMS-1806063, DMS-1711952 and CCF-1934904. Department of Statistics, Texas A&M University Xiaoke Zhang The research of Xiaoke Zhang is partially supported by National Science Foundation grant DMS-1832046. Department of Statistics, George Washington University

Abstract

Multidimensional function data arise from many fields nowadays. The covariance function plays an important role in the analysis of such increasingly common data. In this paper, we propose a novel nonparametric covariance function estimation approach under the framework of reproducing kernel Hilbert spaces (RKHS) that can handle both sparse and dense functional data. We extend multilinear rank structures for (finite-dimensional) tensors to functions, which allow for flexible modeling of both covariance operators and marginal structures. The proposed framework can guarantee that the resulting estimator is automatically semi-positive definite, and can incorporate various spectral regularizations. The trace-norm regularization in particular can promote low ranks for both covariance operator and marginal structures. Despite the lack of a closed form, under mild assumptions, the proposed estimator can achieve unified theoretical results that hold for any relative magnitudes between the sample size and the number of observations per sample field, and the rate of convergence reveals the “phase-transition” phenomenon from sparse to dense functional data. Based on a new representer theorem, an ADMM algorithm is developed for the trace-norm regularization. The appealing numerical performance of the proposed estimator is demonstrated by a simulation study and the analysis of a dataset from the Argo project.

Keywords: Functional data analysis; multilinear ranks; tensor product space; unified theory

1 Introduction

In recent decades, functional data analysis (FDA) has become a popular branch of statistical research. General introductions to FDA can be found in a few monographs (e.g., Ramsay and Silverman, 2005; Ferraty and Vieu, 2006; Horváth and Kokoszka, 2012; Hsing and Eubank, 2015; Kokoszka and Reimherr, 2017). While traditional FDA deals with a sample of time-varying trajectories, many new forms of functional data have emerged due to improved capabilities of data recording and storage, as well as advances in scientific computing. One particular new form of functional data is multidimensional functional data, which becomes increasingly common in various fields such as climate science, neuroscience and chemometrics. Multidimensional functional data are generated from random fields, i.e., random functions of several input variables. One example is multi-subject magnetic resonance imaging (MRI) scans, such as those collected by the Alzheimer’s Disease Neuroimaging Initiative. A human brain is virtually divided into three-dimensional boxes called “voxels” and brain signals obtained from these voxels form a three-dimensional functional sample indexed by spatial locations of the voxels. Despite the growing popularity of multidimensional functional data, statistical methods for such data are limited apart from very few existing works (e.g., Huang et al., 2009; Allen, 2013; Zhang et al., 2013; Zhou and Pan, 2014; Wang and Huang, 2017).

In FDA covariance function estimation plays an important role. Many methods have been proposed for unidimensional functional data (e.g., Rice and Silverman, 1991; James et al., 2000; Yao et al., 2005; Paul and Peng, 2009; Li and Hsing, 2010; Goldsmith et al., 2011; Xiao et al., 2013), and a few were particularly developed for two-dimensional functional data (e.g., Zhou and Pan, 2014; Wang and Huang, 2017). In general when the input domain is of dimension $p$ , one needs to estimate a $2p$ -dimensional covariance function. Since covariance function estimation in FDA is typically nonparametric, the curse of dimensionality emerges soon when $p$ is moderate or large.

For general $p$ , most work are restricted to regular and fixed designs (e.g., Zipunnikov et al., 2011; Allen, 2013), where all random fields are observed over a regular grid like MRI scans. Such sampling plan leads to a tensor dataset, so one may apply tensor/matrix decompositions to estimate the covariance function. When random fields are observed at irregular locations, the dataset is no longer a completely observed tensor so tensor/matrix decompositions are not directly applicable. If observations are densely collected for each random field, a two-step approach is a natural solution, which involves pre-smoothing every random field followed by ensor/matrix decompositions at a fine discretized grid. However, this solution is infeasible for sparse data where there are a limited number of observations per random field. One example is the data collected by the international Argo project (http://www.argo.net). See Section 7 for more details. In such sparse data setting, one may apply the local smoothing method of Chen and Jiang (2017), but it suffers from the curse of dimensionality when the dimension $p$ is moderate due to a $2p$ -dimensional nonparametric regression.

We notice that there is a related class of literature on longitudinal functional data (e.g., Chen and Müller, 2012; Park and Staicu, 2015; Chen et al., 2017), a special type of multidimensional functional data where a function is repeatedly measured over longitudinal times. Typically multi-step methods are needed to model the functional and longitudinal dimensions either separately (one dimension at a time) or sequentially (one dimension given the other), as opposed to the joint estimation procedure proposed in this paper. We also notice a recent work on longitudinal functional data under the Bayesian framework (Shamshoian et al., 2019).

The contribution of this paper is three-fold. First, we propose a new and flexible nonparametric method for low-rank covariance function estimation for multidimensional functional data, via the introduction of (infinite-dimensional) unfolding operators (See Section 3). This method can handle both sparse and dense functional data, and can achieve joint structural reductions in all dimensions, in addition to rank reduction of the covariance operator. The proposed estimator is guaranteed to be semi-positive definite. As a one-step procedure, our method reduces the theoretical complexities compared to multi-steps estimators which often involve a functional principal component analysis followed by a truncation and reconstruction step (e.g., Hall and Vial, 2006; Poskitt and Sengarapillai, 2013).

Second, we generalize the representer theorem for unidimensional functional data by Wong and Zhang (2019) to the multidimensional case with more complex spectral regularizations. The new representer theorem makes the estimation procedure practically computable by generating a finite-dimensional parametrization to the solution of the underlying infinite-dimensional optimization.

Finally, a unified asymptotic theory is developed for the proposed estimator. It automatically incorporates the settings of dense and sparse functional data, and reveals a phase transition in the rate of convergence. Different from existing theoretical work heavily based on closed-form representations of estimators, (Li and Hsing, 2010; Cai and Yuan, 2010; Zhang and Wang, 2016; Liebl, 2019), this paper provides the first unified theory for penalized global M-estimators of covariance functions which does not require a closed-form solution. Furthermore, a near-optimal (i.e., optimal up to a logarithmic order) one-dimensional nonparametric rate of convergence is attainable for the $2p$ -dimensional covariance function estimator for Sobolev-Hilbert spaces.

The rest of the paper is organized as follows. Section 2 provides some background on reproducing kernel Hilbert space (RKHS) frameworks for functional data. Section 3 introduces Tucker decomposition for finite-dimensional tensors and our proposed generalization to tensor product RKHS operators, which is the foundation for our estimation procedure. The proposed estimation method is given in Section 4, together with an computational algorithm. The unified theoretical results are presented in Section 5. The numerical performance of the proposed method is evaluated by a simulation study in Section 6 and a real data application in Section 7.

2 RKHS Framework for Functional Data

In recent years there is a surge of RKHS methods in FDA (e.g., Yuan and Cai, 2010; Zhu et al., 2014; Li and Song, 2017; Reimherr et al., 2018; Sun et al., 2018; Wong et al., 2019). However, covariance function estimation, a seemingly well-studied problem, does not receive the same amount of attention in the development of RKHS methods, even for unidimensional functional data. Interestingly, we find that the RKHS modeling provides a versatile framework for both unidimensional and multidimensional functional data.

Let $X$ be a random field defined on an index set $\mathcal{T}\subset\mathbb{R}^{p}$ , with a mean function $\mu_{0}(\cdot)=\mathbb{E}\{X(\cdot)\}$ and a covariance function $\gamma_{0}(*,\cdot)=\mathrm{Cov}(X(*),X(\cdot))$ , and let $\{X_{i}:i=1,\ldots,n\}$ be $n$ independently and identically distributed (i.i.d.) copies of $X$ . Typically, a functional dataset is represented by $\{(\bm{T}_{ij},Y_{ij}):j=1,\dots m_{i};i=1,\dots,n\}$ , where

Y_{ij}=X_{i}(\bm{T}_{ij})+\epsilon_{ij}\in\mathbb{R}

(1)

is the noisy measurement of the $i$ -th random field $X_{i}$ taken at the corresponding index $\bm{T}_{ij}\in\mathcal{T}$ , $m_{i}$ is the number of measurements observed from the $i$ -th random field, and $\{\epsilon_{ij}:i=1,\dots,n;j=1,\dots m_{i}\}$ are independent errors with mean zero and finite variance. For simplicity and without loss of generality, we assume $m_{i}=m$ for all $i$ .

As in many nonparametric regression setups such as penalized regression splines (e.g., Pearce and Wand, 2006) and smoothing splines (e.g., Wahba, 1990; Gu, 2013), the sample field of $X$ , i.e., the realized $X$ (as opposed to the sample path of a unidimensional random function), is assumed to reside in an RKHS $\mathcal{H}$ of functions defined on $\mathcal{T}$ with a continuous and square integrable reproducing kernel $K$ . Let $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ and $\|\cdot\|_{{\mathcal{H}}}$ denote the inner product and norm of $\mathcal{H}$ respectively. With the technical condition $\mathbb{E}\|X\|^{2}_{\mathcal{H}}<\infty$ , the covariance function $\gamma_{0}$ resides in the tensor product RKHS $\mathcal{H}\otimes{\mathcal{H}}$ . It can be shown that ${\mathcal{H}}\otimes{\mathcal{H}}$ is an RKHS, equipped with the reproducing kernel $K\otimes K$ defined as $(K\otimes K)((\bm{s}_{1},\bm{t}_{1}),(\bm{s}_{2},\bm{t}_{2}))=K(\bm{s}_{1},\bm{s}_{2})K(\bm{t}_{1},\bm{t}_{2})$ , for any $\bm{s}_{1},\bm{s}_{2},\bm{t}_{1},\bm{t}_{2}\in\mathcal{T}$ . This result has been exploited by Cai and Yuan (2010) and Wong and Zhang (2019) for covariance estimation in the unidimensional setting.

For any function $f\in\mathcal{H}\otimes\mathcal{H}$ , there exists an operator mapping $\mathcal{H}$ to $\mathcal{H}$ defined by $g\in\mathcal{H}\mapsto\langle f(*,\cdot),g(\cdot)\rangle_{\mathcal{H}}\in\mathcal{H}$ . When $f$ is a covariance function, we call the induced operator a $\mathcal{H}$ -covariance operator, or simply a covariance operator as below. To avoid clutter, the induced operator will share the same notation with the generating function. Similar to $L^{2}$ -covariance operators, the definition of an induced operator is obtained by replacing the $L^{2}$ inner product by the RKHS inner product. The benefits of considering this operator have been discussed in Wong and Zhang (2019). We also note that a singular value decomposition (e.g., Hsing and Eubank, 2015) of the induced operator exists whenever the corresponding function $f$ belongs to the tensor product RKHS $\mathcal{H}\otimes\mathcal{H}$ . The idea of induced operator can be similarly extended to general tensor product space $\mathcal{F}_{1}\otimes\mathcal{F}_{2}$ where $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$ are two generic RKHSs of functions.

For any $\gamma\in\mathcal{H}\otimes\mathcal{H}$ , let $\gamma^{\top}$ be the transpose of $\gamma$ , i.e., $\gamma^{\top}(\bm{s},\bm{t})=\gamma(\bm{t},\bm{s})$ , $\bm{s},\bm{t}\in\mathcal{T}$ . Define ${\mathcal{M}}=\{\gamma\in{\mathcal{H}}\otimes{\mathcal{H}}:\gamma\equiv\gamma^{\top}\}$ . To guarantee symmetry and positive semi-definiteness of the estimators, Wong and Zhang (2019) adopted ${\mathcal{M}}^{+}=\{\gamma\in{\mathcal{M}}:\langle\gamma f,f\rangle_{{\mathcal{H}}}\geq 0,\forall f\in{\mathcal{H}}\}$ as the hypothesis class of $\gamma_{0}$ and considered the following regularized estimator:

\underset{\gamma\in{\mathcal{M}}^{+}}{\arg\min}\left\{\ell(\gamma)+\tau\Psi(\gamma)\right\},

(2)

where $\ell$ is a convex and smooth loss function characterizing the fidelity to the data, $\Psi(\gamma)$ is a spectral penalty function (see Definition 5 below), and $\tau$ is a tuning parameter. Due to the constraints specified in ${\mathcal{M}}^{+}$ , the resulting covariance estimator is always positive semi-definite. In particular, if the spectral penalty function $\Psi(\gamma)$ imposes the trace-norm regularization, an $\ell_{1}$ -type shrinkage penalty on the respective singular values, the estimator is usually of low rank. Cai and Yuan (2010) adopted a similar objective function as in (2) but with the hypothesis class ${\mathcal{H}}\otimes{\mathcal{H}}$ and an $\ell_{2}$ -type penalty $\Psi(\gamma)=\|\gamma\|_{{\mathcal{H}}\otimes{\mathcal{H}}}^{2}$ , so the resulting estimator may neither be positive semi-definite nor low-rank.

Although Cai and Yuan (2010) and Wong and Zhang (2019) focused on unidimensional functional data, their frameworks can be directly extended to the multidimensional setting. Explicitly, similar to (2), as long as a proper ${\mathcal{H}}$ for the random fields with dimension $p>1$ is selected, an efficient “one-step” covariance function estimation with the hypothesis class ${\mathcal{M}}^{+}$ can be obtained immediately, which results in a positive semi-definite and possibly low-rank estimator. Since an RKHS is identified by its reproducing kernel, we simply need to pick a multivariate reproducing kernel $K$ for multidimensional functional data. However, even when the low-rank approximation/estimation is adopted (e.g., by trace-norm regularization), we still need to estimate several $p$ -dimensional eigenfunctions nonparametrically. This curse of dimensionality calls for a more efficient modeling. Below, we explore this through the lens of tensor decomposition in finite-dimensional vector spaces and its extension to infinite-dimensional function spaces.

3 Low-Rank Modeling via Functional Unfolding

In this section we will extend the well-known Tucker decomposition for finite-dimensional tensors to functional data, then introduce the concept of functional unfolding for low-rank modeling, and finally apply functional unfolding to covariance function estimation.

3.1 Tucker decomposition for finite-dimensional tensors

First, we give a brief introduction to the popular Tucker decomposition (Tucker, 1966) for finite-dimensional tensors. Let $\mathcal{G}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k}$ denote a generic tensor product space with finite-dimensional $\mathcal{G}_{k},k=1,\ldots,d$ . If the dimension of $\mathcal{G}_{k}$ is $q_{k}$ , $k=1,\ldots,d$ , then each element in $\mathcal{G}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k}$ can be identified by an array in $\mathbb{R}^{\prod_{j=k}^{d}q_{k}}$ , which contains the coefficients through an orthonormal basis. By Tucker decomposition, any array in $\mathbb{R}^{\prod^{d}_{k=1}q_{k}}$ can be represented in terms of $n$ -mode products as follows.

Definition 1 ( $n$ -mode product).

For any arrays $\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}}$ and $\bm{P}\in\mathbb{R}^{p_{n}\times q_{n}}$ , $n\in\{1,\dots,d\}$ , the $n$ -mode product between $\bm{A}$ and $\bm{P}$ , denoted by $\bm{A}\times_{n}\bm{P}$ , is a array of dimension
${q_{1}\times q_{2}\times\cdots q_{n-1}\times p_{n}\times q_{n+1}\times\cdots q_{d}}$ of which $(l_{1},\dots,l_{n-1},j,l_{n+1},\dots l_{d})$ -th element is defined by

(\bm{A}\times_{n}\bm{P})_{l_{1},\dots,l_{n-1},j,l_{n+1},\dots l_{d}}=\sum_{i=1}^{q_{n}}\bm{A}_{l_{1},\dots,l_{n-1},i,l_{n+1},\dots l_{d}}\bm{P}_{j,i}.

Definition 2 (Tucker decomposition).

Tucker decomposition of $\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}}$ is

\bm{A}=\bm{G}\times_{1}\bm{U}_{1}\times_{2}\cdots\times_{d}\bm{U}_{d},

(3)

where $\bm{U}_{i}\in\mathbb{R}^{q_{i}\times r_{i}}$ $i=1,2,\dots,d$ , are called the “factor matrices” (usually orthonormal) with $r_{i}\leq q_{i}$ and $\bm{G}\in\mathbb{R}^{r_{1}\times\cdots\times r_{d}}$ is called the “core tensor”.

Figure 1 provides a pictorial illustration of a Tucker decomposition. Unlike matrices, the concept of rank is more complicated for arrays of order 3 or above. Tucker decomposition naturally leads to a particular form of rank, called “multilinear rank”, which is directly related to the familiar concept of matrix ranks. To see this, we employ a reshaping operation called matricization, which rearranges elements of an array into a matrix.

Figure 1: Tucker decomposition of a third-order array. The values in the parentheses are dimensions for the corresponding matrices or arrays.

Definition 3 (Matricization).

For any $n\in\{1,\dots,d\}$ , the $n$ -mode matricization of $\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}}$ , denoted by $\bm{A}_{(n)}$ , is a matrix of dimension $q_{n}\times(\prod_{k\neq n}q_{k})$ of which $(l_{n},j)$ -th element is defined by $[\bm{A}_{(n)}]_{l_{n},j}=\bm{A}_{l_{1},\ldots,l_{d}}$ , where $j=1+\sum_{i=1,i\neq n}^{d}(l_{i}-1)(\prod_{m=1,m\neq n}^{i-1}q_{m})$ ¹¹1All empty products are defined as 1. For example, $\prod_{m=i}^{j}q_{m}=1$ when $i>j$ ..

For any $\bm{A}\in\mathbb{R}^{q_{1}\times q_{2}\times\cdots\times q_{d}}$ , by simple derivations, one can obtain a useful relationship between the $n$ -mode matricization and Tucker decomposition $\bm{A}=\bm{G}\times_{1}\bm{U}_{1}\times_{2}\cdots\times_{d}\bm{U}_{d}$ :

\bm{A}_{(n)}=\bm{U}_{n}\bm{G}_{(n)}(\bm{U}_{d}\otimes\cdots\otimes\bm{U}_{n+1}\otimes\bm{U}_{n-1}\otimes\cdots\otimes\bm{U}_{1})^{\intercal},

(4)

where, with a slight abuse of notation, $\otimes$ also represents the Kronecker product between matrices. Hence if the factor matrices are of full rank, then $\mathrm{rank}(\bm{A}_{(n)})=\mathrm{rank}(\bm{G}_{(n)})$ . The vector $(\mathrm{rank}(\bm{A}_{(1)}),\dots,\mathrm{rank}(\bm{A}_{(d)}))$ is known as the multilinear rank of $\bm{A}$ . Clearly from (4), one can choose a Tucker decomposition such that $\{\bm{U}_{k}:k=1,\ldots,d\}$ are orthonormal matrices and $\mathrm{rank}(\bm{U}_{k})=r_{k}$ . Therefore a “small” multilinear rank corresponds to a small core tensor and thus an intrinsic dimension reduction, which potentially improves estimation and interpretation. We will relate this low-rank structure to multidimensional functional data.

3.2 Functional unfolding for infinite-dimensional tensors

To encourage low-rank structures in covariance function estimation, we generalize the matricization operation for finite-dimensional arrays to infinite-dimensional tensors (Hackbusch, 2012). Here let $\mathcal{G}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k}$ denote a generic tensor product space where $\mathcal{G}_{k}$ is an RKHS of functions with an inner product $\langle\cdot,\cdot\rangle_{{\mathcal{G}}_{k}}$ , for $k=1,\ldots,d$ .

Notice that the tensor product space ${\mathcal{G}}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k}$ can be generated by some elementary tensors of the form $\bigotimes^{d}_{k=1}f_{k}(x_{1},\dots,x_{d})=\prod^{d}_{k=1}f_{k}(x_{k})$ where $f_{k}\in{\mathcal{G}}_{k},k=1,\ldots,d$ . More specifically, $\mathcal{G}$ is the completion of the linear span of all elementary tensors under the inner product $\langle\bigotimes^{d}_{k=1}f_{k},\bigotimes^{d}_{k=1}f^{\prime}_{k}\rangle_{\mathcal{G}}=\prod^{d}_{k=1}\langle f_{k},f_{k}^{\prime}\rangle_{{\mathcal{G}}_{k}}$ , for any $f_{k},f_{k}^{\prime}\in{\mathcal{G}}_{k}$ .

In Definition 4 below, we generalize matricization/unfolding for finite-dimensional arrays to infinite-dimensional elementary tensors. We also define a square unfolding for infinite-dimensional tensors that will be used to describe the spectrum of covariance operators.

Definition 4 (Functional unfolding operators).

The one-way unfolding operator and square unfolding operators are defined as follows for any elementary tensor of the form $\bigotimes^{d}_{k=1}f_{k}$ .

1.

One-way unfolding operator ${\mathcal{U}}_{j}$ for $j=1,\dots,d$ : The $j$ -mode one-way unfolding operator ${\mathcal{U}}_{j}:\bigotimes_{k=1}^{d}\mathcal{G}_{k}\rightarrow\mathcal{G}_{j}\otimes(\bigotimes_{k\neq j}\mathcal{G}_{k})$ is defined by ${\mathcal{U}}_{j}(\bigotimes_{k=1}^{d}f_{k})=f_{j}\otimes(\bigotimes_{k\neq j}f_{k})$ .
2.

Square unfolding operator $\mathcal{S}$ : When $d$ is even, the square unfolding operator $\mathcal{S}:\bigotimes_{j=1}^{d}\mathcal{G}_{j}\rightarrow(\bigotimes^{d/2}_{j=1}\mathcal{G}_{j})\otimes(\bigotimes^{d}_{k=d/2+1}\mathcal{G}_{k})$ is defined by $\mathcal{S}(\bigotimes^{d}_{j=1}f_{j})=(\bigotimes_{j=1}^{d/2}f_{j})\otimes(\bigotimes_{k=d/2+1}^{d}f_{k})$ .

These definitions extend to any function $f\in\mathcal{G}$ by linearity. For notational simplicity we denote ${\mathcal{U}}_{j}(f)$ by $f_{(j)}$ , $j=1,\ldots,d$ , and $\mathcal{S}(f)$ by $f_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ .

Note that the range of each functional unfolding operator, either ${\mathcal{U}}_{j}$ , $j=1,\ldots,d$ or $\mathcal{S}$ , is a tensor product of two RKHSs, so its output can be interpreted as an (induced) operator. Given a function $f\in\mathcal{G}$ , the multilinear rank can be defined as $(\mathrm{rank}(f_{(1)}),\dots,\mathrm{rank}(f_{(d)}))$ , where $f_{(j)}$ ’s are interpreted as an operator here and $\mathrm{rank}(A)$ is the rank of any operator $A$ . If all $\mathcal{G}_{k},k=1,\ldots,d$ are finite-dimensional, the singular values of the output of any functional unfolding operator match with those of the $j$ -mode matricization (of the corresponding array representation).

3.3 Functional unfolding for covariance functions

Suppose that the random field $X\in{\mathcal{H}}=\bigotimes_{k=1}^{p}{\mathcal{H}}_{k}$ where each ${\mathcal{H}}_{k}$ is a RKHS of functions equipped with an inner product $\langle\cdot,*\rangle_{k}$ and corresponding norm $\|\cdot\|_{k}$ , $k=1,\ldots,p$ . Then the covariance function $\gamma_{0}$ resides in ${\mathcal{H}}\otimes{\mathcal{H}}=(\bigotimes_{j=1}^{p}{\mathcal{H}}_{j})\otimes(\bigotimes_{k=1}^{p}{\mathcal{H}}_{k})$ . To estimate $\gamma_{0}$ , we could consider a special case of $\mathcal{G}=\bigotimes^{d}_{j=1}\mathcal{G}_{j}$ in Section 3.2 by letting $d=2p$ , $\mathcal{G}_{j}=\mathcal{H}_{j}$ for $j=1,\dots,p$ ; $\mathcal{G}_{j}=\mathcal{H}_{j-p}$ for $j=p+1,\dots,d$ ; and $\langle\cdot,\cdot\rangle_{{\mathcal{G}}_{j}}=\langle\cdot,\cdot\rangle_{j}$ for $j=1,\ldots,d$ .

Clearly, the elements of ${\mathcal{H}}\otimes{\mathcal{H}}$ are identified by those in $\mathcal{G}=\bigotimes^{d}_{j=1}\mathcal{G}_{j}$ . In terms of the folding structure, ${\mathcal{H}}\otimes{\mathcal{H}}$ has a squarely unfolded structure. Since a low-multilinear-rank structure is represented by different unfolded forms, it would be easier to study the completely folded space $\bigotimes^{d}_{k=1}{\mathcal{G}}_{k}$ instead of the squarely unfolded space ${\mathcal{H}}\otimes{\mathcal{H}}$ . We use $\Gamma_{0}$ to represent the folded covariance function, the corresponding element of $\gamma_{0}$ in ${\mathcal{G}}$ . In other words, $\Gamma_{0,\mathrel{\scalebox{0.5}{$\blacksquare$}}}=\gamma_{0}$ . For any $\Gamma\in{\mathcal{G}}$ , $\mathrm{rank}(\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})$ is defined as the two-way rank of $\Gamma$ while $\mathrm{rank}(\Gamma_{(1)}),\dots,\mathrm{rank}(\Gamma_{(p)})$ are defined as the one-way ranks of $\Gamma$ .

Remark 1.

For an array $\bm{A}\in\mathbb{R}^{\prod_{k=1}^{d}q_{k}}$ , the one-way unfolding ${\mathcal{U}}_{j}(\bm{A})$ is the same as matricization, if we further impose the same ordering of the columns in the output of $\mathcal{U}_{j}(\bm{A})$ , $j=1,\ldots,d$ . This ordering is just related to how we represent the array, and is not crucial in the general definition of $\mathcal{U}_{j}$ . Since the description of the computation strategy depends on the explicit representation, we will always assume this ordering. Similarly, we also define a specific ordering of rows and columns for ${\bm{A}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\in\mathbb{R}^{(d/2)\times(d/2)}$ when $d$ is even, such that its $(j_{1},j_{2})$ -th entry is $\bm{A}_{k_{1},\dots,k_{d}}$ where $j_{1}=1+\sum_{i=1}^{d/2}(k_{i}-1)(\prod_{m=i+1}^{d/2}q_{m})$ and $j_{2}=1+\sum_{i=d/2+1}^{d}(k_{i}-1)(\prod_{m=i+1}^{d}q_{m})$ .

3.4 One-way and two-way ranks in covariance functions

Here we illustrate the roles of one-way and two-way ranks in the modeling of covariance functions. For a general $\mathcal{G}=\bigotimes^{d}_{j=1}\mathcal{G}_{j}$ , let $\{e_{k,l_{k}}:l_{j}=1,\dots,q_{k}\}$ be a set of orthonormal basis functions of ${\mathcal{G}}_{k}$ for $k=1,\dots,d=2p$ , where $q_{k}$ is allowed to be infinite, depending on the dimensionality of ${\mathcal{G}}_{k}$ . Then $\{\bigotimes^{d}_{k=1}e_{k,l_{k}}:l_{k}=1,\dots,q_{k};k=1,\dots,d\}$ forms a set of orthonormal basis functions for ${\mathcal{G}}$ . Thus for any $\Gamma\in{\mathcal{G}}$ , we can express

\displaystyle\Gamma

\displaystyle=\sum_{k_{1},k_{2},\dots,k_{d}}\bm{B}_{k_{1},\dots,k_{d}}\bigotimes_{i=1}^{d}e_{i,k_{i}},

(5)

where the coefficients $\bm{B}_{k_{1},\dots,k_{d}}$ are real numbers. For convenience, we collectively put them into an array $\bm{B}\in\mathbb{R}^{\prod^{d}_{k=1}q_{k}}$ .

To illustrate the low-multilinear-rank structures for covariance functions, we consider $p=2$ , i.e., $d=2p=4$ , and then by (5) the folded covariance function $\Gamma$ can be expressed by

\Gamma(s_{1},s_{2},t_{1},t_{2})=\sum_{k_{1}=1}^{q_{1}}\sum_{k_{2}=1}^{q_{2}}\sum_{k_{3}=1}^{q_{1}}\sum_{k_{4}=1}^{q_{2}}\bm{B}_{k_{1},k_{2},k_{3},k_{4}}e_{1,k_{1}}(s_{1})e_{2,k_{2}}(s_{2})e_{1,k_{3}}(t_{1})e_{2,k_{4}}(t_{2}).

To be precise, the covariance function is the squarely unfolded $\Gamma_{{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}((s_{1},s_{2}),(t_{1},t_{2}))\equiv\Gamma(s_{1},s_{2},t_{1},t_{2})$ . Suppose that $\bm{B}$ possesses (or is well-approximated by) a structure of a low multilinear rank, and yields Tucker decomposition $\bm{B}=\bm{E}\times_{1}\bm{U}_{1}\times_{2}\bm{U}_{2}\times_{3}\bm{U}_{1}\times_{4}\bm{U}_{2}$ ²²2Definition 1 is extended to the case when $q_{n}$ is infinite. where $\bm{E}\in\mathbb{R}^{r_{1}\times r_{2}\times r_{1}\times r_{2}}$ , $\bm{U}_{k}\in\mathbb{R}^{q_{k}\times r_{k}}$ for $k=1,2$ , and columns of $\bm{U}_{k}$ are orthonormal. Apparently $R:=\mathrm{rank}(\bm{B}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})$ is the two-way rank of $\Gamma$ , while $r_{1}$ and $r_{2}$ are the corresponding one-way ranks. Now the covariance function can be further represented as

\Gamma(s_{1},s_{2},t_{1},t_{2})=\sum_{j_{1}=1}^{r_{1}}\sum_{j_{2}=1}^{r_{2}}\sum_{j_{3}=1}^{r_{1}}\sum_{j_{4}=1}^{r_{2}}\bm{E}_{j_{1},j_{2},j_{3},j_{4}}u_{j_{1}}(s_{1})v_{j_{2}}(s_{2})u_{j_{3}}(t_{1})v_{j_{4}}(t_{2}),

where $\{u_{j}:j=1,\ldots,r_{1}\}$ and $\{v_{k}:k=1,\ldots,r_{2}\}$ are (possibly infinite) linear combinations of the original basis functions. In fact, $\{u_{j}:j=1,\ldots,r_{1}\}$ and $\{v_{k}:k=1,\ldots,r_{2}\}$ are the sets of orthonormal functions of ${\mathcal{G}}_{1}$ and ${\mathcal{G}}_{2}$ respectively. Apparently $\mathrm{rank}(\bm{E}_{{\mathrel{\scalebox{0.5}{$\blacksquare$}}}})=R$ .

Consider the eigen-decomposition of the squarely unfolded core tensor $\bm{E}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}=\bm{P}\bm{D}\bm{P}^{T}$ where $\bm{D}=\mathrm{diag}(\lambda_{1},\lambda_{2},\dots,\lambda_{R})$ and $\bm{P}\in\mathbb{R}^{r_{1}r_{2}\times R}$ has orthonormal columns. Then we obtain the eigen-decomposition of the covariance function $\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ :

\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}((s_{1},s_{2}),(t_{1},t_{2}))=\sum_{g=1}^{R}\lambda_{g}f_{g}(s_{1},s_{2})f_{g}(t_{1},t_{2}),

where the eigenfunction is

f_{g}(s_{1},s_{2})=\sum_{j_{1}=1}^{r_{1}}\sum_{j_{2}=1}^{r_{2}}\bm{P}_{j_{2}+(j_{1}-1)r_{1},g}u_{j_{1}}(s_{1})v_{j_{2}}(s_{2})=:\,\begin{cases}\sum^{r_{1}}_{j_{1}=1}a_{j_{1},g}(s_{2})u_{j_{1}}(s_{1})\\ \sum^{r_{2}}_{j_{2}=1}b_{j_{2},g}(s_{1})v_{j_{2}}(s_{2})\end{cases},

with $a_{j_{1},g}(\cdot)=\sum_{j_{2}=1}^{r_{2}}\bm{P}_{j_{2}+(j_{1}-1)r_{1},g}v_{j_{2}}(\cdot)$ and $b_{j_{2},g}(\cdot)=\sum_{j_{1}=1}^{r_{1}}\bm{P}_{j_{2}+(j_{1}-1)r_{1},g}u_{j_{1}}(\cdot)$ .

First, this indicates that the two-way rank $R$ is the same as the rank of the covariance operator. Second, this shows that $\{u_{j_{1}}:j_{1}=1,\ldots,r_{1}\}$ is the common basis for the variation along the dimension $s_{1}$ , hence describing the marginal structure along $s_{1}$ . Similarly $\{v_{j_{2}}:j_{2}=1,\ldots,r_{2}\}$ is the common basis that characterizes the marginal variation along the dimension $s_{2}$ . We call them the marginal basis along the respective dimension. Therefore, the one-way ranks $r_{1}$ and $r_{2}$ are the minimal numbers of the one-dimensional functions for the dimensions $s_{1}$ and $s_{2}$ respectively that construct all the eigenfunctions of covariance function $\Gamma$ .

Similarly, for $p$ -dimensional functional data, each eigenfunction can be represented by a linear combination of $p$ -products of univariate functions. One can then show that the two-way rank $R$ is the same as the rank of the covariance operator and the one-way ranks $r_{1},\dots,r_{p}$ are the minimal numbers of one-dimensional functions along respective dimensions that characterize all eigenfunctions of the covariance operator.

Remark 2.

Obviously, $R\leq\prod^{p}_{k=1}r_{k}$ for $p$ -dimensional functional data. If the random field $X$ has the property of “weak separability” as defined by Lynch and Chen (2018), then $\max(r_{1},\dots,r_{p})\leq R$ so the low-rank structure in terms of $R$ will be automatically translated to low one-way ranks. Note that the construction of our estimator and corresponding theoretical analysis do not require separability conditions.

Compared to typical low-rank covariance modelings only in terms of $R$ , we also intend to regularize the one-way ranks $r_{1},\dots,r_{p}$ for two reasons. First, the illustration above shows that the structure of low one-way ranks encourages a “sharing” structure of one-dimensional variations among different eigenfunctions. Promoting low one-way ranks can facilitate additional dimension reduction and further alleviates the curse of dimensionality. Moreover, one-dimensional marginal structures will provide more details of the covariance function structure and thus help with a better understanding of $p$ -dimensional eigenfunctions.

Therefore, we will utilize both one-way and two-way structures and propose an estimation procedure that regularizes one-way and two-way ranks jointly and flexibly, with the aim of seeking the “sharing” of marginal structures while controlling the number of eigen-components simultaneously.

4 Covariance Function Estimation

In this section we propose a low-rank covariance function estimation framework based on functional unfolding operators and spectral regularizations. Spectral penalty functions (Abernethy et al., 2009; Wong and Zhang, 2019) are defined as follows.

Definition 5 (Spectral penalty function).

Given a compact operator $A$ , a spectral penalty function takes the form $\Psi(A)=\sum_{k\geq 1}\psi(\lambda_{k}(A))$ with the singular values of the operator $A$ , $\lambda_{1}(A)$ , $\lambda_{2}(A)$ , $\dots$ in a descending order of magnitude and a non-decreasing penalty function $\psi$ such that $\psi(0)=0$ .

Recall ${\mathcal{H}}=\bigotimes^{p}_{j=1}{\mathcal{H}}_{j}$ and ${\mathcal{G}}=\bigotimes^{d}_{j=1}{\mathcal{G}}_{j}$ where $d=2p$ , $\mathcal{G}_{j}=\mathcal{H}_{j}$ for $j=1,\dots,p$ , and $\mathcal{G}_{j}=\mathcal{H}_{j-p}$ for $j=p+1,\dots,d$ . Clearly, a covariance operator is self-adjoint and positive semi-definite. Therefore we consider the hypothesis space ${\mathcal{M}}^{+}=\{\Gamma\in{\mathcal{M}}:\langle\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}f,f\rangle_{{\mathcal{H}}}\geq 0,\text{for all }f\in{\mathcal{H}}\}$ , where ${\mathcal{M}}=\{\Gamma\in{\mathcal{G}}:\mbox{$\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ is self-adjoint}\}$ , and propose a general class of covariance function estimators as follows:

\underset{\Gamma\in{\mathcal{M}}^{+}}{\arg\min}\left\{\ell(\Gamma)+\lambda\left[\beta\Psi_{0}(\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})+\frac{1-\beta}{p}\sum_{j=1}^{p}\Psi_{j}(\Gamma_{(j)})\right]\right\},

(6)

where $\ell$ is a convex and smooth loss function, $\{\Psi_{j}:j=1,\ldots,p\}$ are spectral penalty functions, and $\lambda\geq 0$ , $\beta\in[0,1]$ are tuning parameters. Here $\Psi_{0}$ penalizes the squarely unfolded operator $\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ while $\Psi_{j}$ regularizes one-way unfolded operator $\Gamma_{(j)}$ respectively for $j=1,\dots,p$ . The tuning parameter $\beta$ controls the relative degree of regularization between one-way and two-way singular values. The larger the $\beta$ is, the more penalty is imposed on the two-way singular values relative to the one-way singular values. When $\beta=1$ , the penalization is only on the eigenvalues of the covariance operator (i.e., the two-way singular values), similarly as Wong and Zhang (2019).

To achieve low-rank estimation, we adopt a special form of (6):

\hat{\Gamma}=\underset{\Gamma\in{\mathcal{M}}^{+}}{\arg\min}\left\{\ell_{\mathrm{square}}(\Gamma)+\lambda\left[\beta\|\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\|_{*}+\frac{1-\beta}{p}\sum_{j=1}^{p}\|\Gamma_{(j)}\|_{*}\right]\right\},

(7)

where $\|\cdot\|_{*}$ is the sum of singular values, also called trace norm, and $\ell_{\mathrm{square}}$ is the squared error loss:

{\ell}_{\mathrm{square}}(\Gamma)=\frac{1}{nm(m-1)}\sum_{i=1}^{n}\sum_{1\leq j\neq j^{\prime}\leq m}\{\Gamma(T_{ij1},\dots,T_{ijp},T_{ij^{\prime}1},\dots,T_{ij^{\prime}p})-Z_{ijj^{\prime}}\}^{2},

(8)

with $Z_{ijj^{\prime}}=\{Y_{ij}-\hat{\mu}(T_{ij1},\dots,T_{ijp})\}\{Y_{ij^{\prime}}-\hat{\mu}(T_{ij^{\prime}1},\dots,T_{ij^{\prime}p})\}$ , $\hat{\mu}$ as an estimate of the mean function, and $T_{ijk}$ as the $k$ -th element of location vector $\bm{T}_{ij}$ . Notice that trace-norm regularizations promote low-rankness of the underlying operators, hence leading to a low-rank estimation in terms of both the one-way and two-way (covariance) ranks.

4.1 Representer theorem and parametrization

Before deriving a computational algorithm, we notice that the optimization (7) is an infinite-dimensional optimization which is generally unsolvable. To overcome this challenge, we show that the solution to (7) always lies in a known finite-dimensional sub-space given data, hence allowing a finite-dimensional parametrization. Indeed, we are able to achieve a stronger result in Theorem 1 which holds for the general class of estimators obtained by (6).

Let $\mathcal{L}_{n,m}=\left\{T_{ijk}:i=1,\dots,n,j=1,\dots,m,k=1,\dots,p\right\}$ .

Theorem 1 (Representer theorem).

If the solution set of (6) is not empty, there always exists a solution $\Gamma$ lying in the space $\mathcal{G}(\mathcal{L}_{n,m}):=\bigotimes_{k=1}^{2p}\mathcal{K}_{k}$ , where $\mathcal{K}_{p+k}=\mathcal{K}_{k}$ and
$\mathcal{K}_{k}=\mathrm{span}\left\{K_{k}(T_{ijk}):i=1,\dots,n,j=1,\dots,m\right\}$ for $k=1,\dots,p$ . The solution takes the form:

\Gamma(s_{1},\ldots,s_{p},t_{1},\ldots,t_{p})=\bm{A}\times_{1}\bm{z}_{1}^{\intercal}(s_{1})\times_{2}\bm{z}_{2}^{\intercal}(s_{2})\cdots\times_{p}\bm{z}_{p}^{\intercal}(s_{p})\times_{p+1}\bm{z}_{1}^{\intercal}(t_{1})\cdots\times_{2p}\bm{z}_{p}^{\intercal}(t_{p}),

(9)

where the $l$ -th element of $\bm{z}_{k}(\cdot)\in\mathbb{R}^{mn}$ is $K(T_{ijk},\cdot)$ with $l=(i-1)n+j$ . Also, $\bm{A}$ is a $2p$ -th order tensor where the dimension of each mode is $nm$ and $\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ is a symmetric matrix.

The proof of Theorem 1 is given in Section S1 of the supplementary material. By Theorem 1, we can now only focus on covariance function estimators of the form (9). Let $\bm{B}=\bm{A}\times_{1}\bm{M}_{1}^{T}\cdots\times_{p}\bm{M}_{p}^{T}\times_{p+1}\bm{M}_{1}^{T}\cdots\times_{2p}\bm{M}_{p}^{T}$ , where $\bm{M}_{k}$ is a $nm\times q_{k}$ matrix such that $\bm{M}_{k}\bm{M}_{k}^{T}=\bm{K}_{k}=\left[K(T_{i_{1},j_{1},k},T_{i_{2},j_{2},k})\right]_{1\leq i_{1},i_{2}\leq n,1\leq j_{1},j_{2}\leq m}$ . With $\bm{B}$ , we can express

\displaystyle\begin{split}\Gamma(s_{1},\ldots,s_{p},t_{1},\ldots,t_{p})=\bm{B}&\times_{1}\{\bm{M}_{1}^{+}\bm{z}_{1}(s_{1})\}^{\intercal}\cdots\times_{p}\{\bm{M}_{p}^{+}\bm{z}_{p}(s_{p})\}^{\intercal}\\ &\times_{p+1}\{\bm{M}_{1}^{+}\bm{z}_{1}(t_{1})\}^{\intercal}\cdots\times_{2p}\{\bm{M}_{p}^{+}\bm{z}_{p}(t_{p})\}^{\intercal},\end{split}

(10)

where $z_{k}(\cdot)$ is defined in Theorem 1 and $\bm{M}_{k}^{+}$ is the Moore–Penrose inverse of matrix $\bm{M}_{k}$ .

The Gram matrix $\bm{K}_{k}$ is often approximately low-rank. For computational simplicity, one could adopt $q_{k}$ to be significantly smaller than $nm$ . Ideally we can obtain the “best” low-rank approximation with respect to the Frobenius norm by eigen-decomposition, but a full eigen-decomposition is computationally expensive. Instead, randomized algorithms can be used to obtain low-rank approximations in an efficient manner (Halko et al., 2009).

One can easily show that the eigenvalues of the operator $\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ are the same as those of the matrix ${\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ and that the singular values of the operator $\Gamma_{(j)}$ are the same as those of the matrix ${\bm{B}}_{(j)}$ . Therefore, solving (7) is equivalent to solving the following optimization:

\min_{\bm{B}}\left\{\tilde{\ell}_{\mathrm{square}}(\bm{B})+\lambda\left[\beta h({\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})+\frac{1-\beta}{p}\sum_{k=1}^{p}\left\|{\bm{B}}_{(j)}\right\|_{*}\right]\right\},

(11)

where $\|\cdot\|_{*}$ also represents the trace norm of matrices, $h(\bm{H})=\left\|\bm{H}\right\|_{*}$ if matrix $\bm{H}$ is positive semi-definite, and $h(\bm{H})=\infty$ otherwise, and $\tilde{\ell}_{\mathrm{square}}(\bm{B})=\ell_{\mathrm{square}}(\Gamma)$ , where $\Gamma$ is constructed from (10).

Beyond estimating the covariance function, one may be further interested in the eigen-decomposition of $\Gamma_{{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}$ via the $L^{2}$ inner product, e.g., to perform functional principal component analysis in the usual sense. Due to the finite-dimensional parametrization, a closed-form expression of $L^{2}$ eigen-decomposition can be derived from our estimator without further discretization or approximation. In addition, we can obtain a similar one-way analysis in terms of the $L_{2}$ inner product. We can define a $L^{2}$ singular value decomposition via the Tucker form and obtain the $L^{2}$ marginal basis. Details are given in Appendix A.

4.2 Computational algorithm

We solve (11) by the accelerated alternating direction method of multipliers (ADMM) algorithm (Kadkhodaie et al., 2015). We begin with an alternative form of (11):

		$\displaystyle\min_{\bm{B}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}}\left\{\tilde{\ell}_{\mathrm{square}}(\bm{B})+\lambda\beta h(\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}})+\lambda\frac{1-\beta}{p}\sum_{k=1}^{p}\left\\|\bm{D}_{j,(j)}\right\\|_{*}\right\}.$		(12)
		$\displaystyle\text{subject to}\ \ \ \bm{B}=\bm{D}_{0}=\bm{D}_{1}=\cdots=\bm{D}_{p}$		(13)

where $q_{p+k}=q_{k}$ for $k=1,\dots,p$ .

Then a standard ADMM algorithm solves the optimization problem (12) by minimizing the augmented Lagrangian with respect to different variables alternatively. More explicitly, at the $(t+1)$ -th iteration, the following updates are implemented:


$\displaystyle\bm{B}^{(t+1)}$	$\displaystyle=\operatorname*{argmin}_{\bm{B}}\left\{\tilde{\ell}_{\mathrm{square}}(\bm{B})+\frac{\eta}{2}\\|\bm{B}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}^{(t)}+\bm{V}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}^{(t)}\\|_{F}^{2}+\frac{\eta}{2}\sum_{k=1}^{p}\left\\|{\bm{B}}_{(k)}-\bm{D}_{k,(k)}^{(t)}+\bm{V}_{k,(k)}^{(t)}\right\\|_{F}^{2}\right\},$	(14a)
$\displaystyle\bm{D}_{0}^{(t+1)}$	$\displaystyle=\operatorname*{argmin}_{\bm{D}_{0}}\left\{\lambda\beta h(\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}})+\frac{\eta}{2}\left\\|\bm{B}^{(t+1)}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{D}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}+\bm{V}_{0,{\mathrel{\scalebox{0.5}{$\blacksquare$}}}}^{(t)}\right\\|_{F}^{2}\right\},$	(14b)
$\displaystyle\bm{D}_{k}^{(t+1)}$	$\displaystyle=\operatorname{argmin}_{\bm{D}_{k}}\left\{\lambda\frac{1-\beta}{p}\\|\bm{D}_{k,(k)}\\|_{}+\frac{\eta}{2}\left\\|\bm{B}^{(t+1)}_{(k)}-\bm{D}_{k,(k)}+\bm{V}_{k,(k)}^{(t)}\right\\|_{F}^{2}\right\},\ k=1,\dots,p,$	(14c)
$\displaystyle\bm{V}_{k}^{(t+1)}$	$\displaystyle=\bm{V}_{k}^{(t)}+\bm{B}^{(t+1)}-\bm{D}_{k}^{(t+1)},\ k=0,\dots,p,$	(14d)

where $\bm{V}_{k}\in\mathbb{R}^{q_{1}\times\cdots q_{2p}}$ , for $k=0,\dots,p$ , are scaled Lagrangian multipliers and $\eta>0$ is an algorithmic parameter. An adaptive strategy to tune $\eta$ is provided in Boyd et al. (2010). One can see that Steps (14a), (14b) and (14c) involve additional optimizations. Now we discuss how to solve them.

The objective function of (14a) is a quadratic function, and so we can easily solve this with a closed-form solution, given in line 2 of Algorithm 1. To solve (14b) and (14c), we use proximal operator $\mathrm{prox}^{k}_{v}$ , $k=1,\dots,p$ and $\mathrm{prox}_{v}^{+}:\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}\rightarrow\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}$ respectively defined by


$\displaystyle\mathrm{prox}_{v}^{k}(\bm{A})$	$\displaystyle=\operatorname{argmin}_{\bm{W}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}}\left\{\frac{1}{2}\\|\bm{W}_{(k)}-\bm{A}_{(k)}\\|_{F}^{2}+v\\|\bm{W}_{(k)}\\|_{}\right\},$	(15a)
$\displaystyle\mathrm{prox}^{+}_{v}(\bm{A})$	$\displaystyle=\operatorname*{argmin}_{\bm{W}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}}\left\{\frac{1}{2}\\|\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\\|_{F}^{2}+vh(\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\right\},$	(15b)

for $v\geq 0$ . By Lemma 1 in Mazumder et al. (2010), the solutions to (15) have closed forms.

For (15a), write the singular value decomposition of $\bm{A}_{(k)}$ as $\bm{U}\mathrm{diag}((\tilde{a}_{1},\dots,\tilde{a}_{q_{k}}))\bm{V}^{\intercal}$ , then $[\mathrm{prox}_{v}^{k}(\bm{A})]_{(k)}=\bm{U}\mathrm{diag}(\bm{\tilde{}}{\bm{c}})\bm{V}^{\intercal}$ where $\tilde{\bm{c}}=((\tilde{a}_{1}-v)_{+},(\tilde{a}_{2}-v)_{+},\dots,(\tilde{a}_{q_{k}}-v)_{+})$ . As for (15b), is restricted to be a symmetric matrix since the penalty $h$ equals infinity otherwise. Thus (15b) is equivalent to minimizing $\left\{(1/2)\|\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}+\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2\|_{F}^{2}+vh(\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\right\}$ since $\langle\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}},(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2\rangle=\langle(\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}+\bm{W}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2,(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}-\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2\rangle=0$ . Suppose that $(\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}+\bm{A}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}^{\intercal})/2$ yields eigen-decomposition $\bm{P}\mathrm{diag}((\tilde{a}_{1},\dots,\tilde{a}_{q})\bm{P}^{\intercal}$ . Then $[\mathrm{prox}^{+}_{v}(\bm{A})]_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}=\bm{P}\mathrm{diag}(\bm{\tilde{}}{\bm{c}})\bm{P}^{\intercal}$ , where $\tilde{\bm{c}}=((\tilde{a}_{1}-v)_{+},(\tilde{a}_{2}-v)_{+},\dots,(\tilde{a}_{q}-v)_{+})$ . Unlike singular values, the eigenvalues may be negative. Hence, as opposed to $\mathrm{prox}^{k}_{v}$ , this procedure $\mathrm{prox}_{v}^{+}$ also removes eigen-components with negative eigenvalues.

The details of computational algorithm are given in Algorithm 1, an accelerated version of ADMM which involves additional steps for a faster algorithmic convergence.

Input:

\hat{\bm{V}}_{k}^{(0)}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}

k=0,1,\dots,p

, and

\bm{B}^{(0)}\in\mathbb{R}^{q_{1}\times\cdots\times q_{2p}}

such that

\hat{\bm{V}}_{0,(0)}

and

\bm{B}^{(0)}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}

are symmetric matrices;

\bm{M}_{k}=[\bm{M}_{1,k}^{\intercal},\dots,\bm{M}_{n,k}^{\intercal}]^{\intercal}

k=1,\dots,p

;

\bm{Z}_{i}=(Z_{ijj^{\prime}})_{1\leq j,j^{\prime}\leq m}

i=1,\dots,n

;

\tilde{\bm{I}}=[I(i\neq j)]_{1\leq i,j\leq m}

;

\eta>0

;

T

Initialization :

\alpha_{k}^{(0)}\leftarrow 1

\bm{D}_{k}^{(-1)}\leftarrow\bm{B}^{(0)}

\hat{\bm{D}}_{k}^{(0)}\leftarrow\bm{B}^{(0)}

\bm{V}_{k}^{(-1)}\leftarrow\hat{\bm{V}}_{k}^{(0)}

k=0,1,\dots,p

\bm{L}_{i}\leftarrow[\bm{M}_{i,1}^{\intercal}\odot\bm{M}_{i,2}^{\intercal}\odot\cdots\odot\bm{M}_{i,p}^{\intercal}]^{\intercal}

i=1,\dots,n

, where

\odot

is the Khatri–Rao product defined as

\bm{A}\odot\bm{B}=[a_{i}\otimes b_{i}]_{i=1,...,r}\in\mathbb{R}^{r_{a}r_{b}\times r}

for

\bm{A}\in\mathbb{R}^{r_{a}\times r}

\bm{B}\in\mathbb{R}^{r_{b}\times r}

and

a_{i},b_{i}

are

i

-th column of matrices

\bm{A}

and

\bm{B}

respectively.

\bm{G}\leftarrow\frac{1}{nm(m-1)}\sum_{i=1}^{n}(\bm{L}_{i}\otimes\bm{L}_{i})^{\intercal}\mathrm{diag}(\mathrm{vec}(\tilde{\bm{I}}))(\bm{L}_{i}\otimes\bm{L}_{i})

\bm{h}\leftarrow\frac{2}{nm(m-1)}\sum_{i=1}^{n}(\bm{L}_{i}\otimes\bm{L}_{i})^{\intercal}\mathrm{diag}(\mathrm{vec}(\tilde{\bm{I}}))\mathrm{vec}(\bm{Z}_{i})

\bm{Q}\leftarrow(2(\bm{G}+\frac{p+1}{2}*\eta*\bm{I}))^{-1}

1 for $t=0,1,\dots,T$ do

\mathrm{vec}(\bm{B}^{(t+1)}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\leftarrow\bm{Q}\{\bm{h}+\eta\sum_{k=0}^{p}\mathrm{vec}([\bm{D}_{k}^{(t)}-\hat{\bm{V}}_{k}^{(t)}]_{\mathrel{\scalebox{0.5}{$\blacksquare$}}})\}

3 for $k=0,1,\dots,p$ do

4 if $k=0$ then

\bm{D}_{0}^{(t)}\leftarrow\mathrm{prox}^{+}_{\lambda\beta/\eta}(\bm{B}^{(t+1)}+\hat{\bm{V}}_{0}^{(t)})

6 else

\bm{D}_{k}^{(t)}\leftarrow\mathrm{prox}^{k}_{\lambda(1-\beta)/(p\eta)}(\bm{B}^{(t+1)}+\hat{\bm{V}}_{k}^{(t)})

8 end if

\bm{V}_{k}^{(t)}\leftarrow\hat{\bm{V}}_{k}^{(t)}+\bm{B}^{(t+1)}-\bm{D}_{k}^{(t)}

\alpha_{k}^{(t+1)}\leftarrow\frac{1+\sqrt{1+4(\alpha_{k}^{(t)})^{2}}}{2}

\hat{\bm{D}}_{k}^{(t+1)}\leftarrow\bm{D}_{k}^{(t)}+\frac{\alpha_{k}^{(t)}-1}{\alpha_{k}^{(t+1)}}(\bm{D}_{k}^{(k)}-\bm{D}_{k}^{(k-1)})

\hat{\bm{V}}_{k}^{(t+1)}\leftarrow\bm{V}_{k}^{(t)}+\frac{\alpha_{k}^{(t)}-1}{\alpha_{k}^{(t+1)}}(\bm{V}_{k}^{(t)}-\bm{V}_{k}^{(t-1)})

13 end for

14 Stop if objective value change less than tolerance.

15 end for

Output:

\bm{D}_{0}^{(T)}

Algorithm 1 Accelerated ADMM for solving (11)

5 Asymptotic Properties

In this section, we conduct an asymptotic analysis for the proposed estimator $\hat{\Gamma}$ as defined in (7). Our analysis has a unified flavor such that the derived convergence rate of the proposed estimator automatically adapts to sparse and dense settings. Throughout this section, we neglect the mean function estimation error by setting $\mu_{0}(\bm{t})=\hat{\mu}(\bm{t})=0$ for any $\bm{t}\in\mathcal{T}$ , which leads to a cleaner and more focused analysis. The additional error from the mean function estimation can be incorporated into our proofs without any fundamental difficulty.

5.1 Assumptions

Without loss of generality let $\mathcal{T}=[0,1]^{p}$ . The assumptions needed in the asymptotic results are listed as follows.

Assumption 1.

Sample fields $\{X_{i}:i=1,\ldots,n\}$ reside in ${\mathcal{H}}=\bigotimes_{k=1}^{p}\mathcal{H}_{k}$ where ${\mathcal{H}}_{k}$ is an RKHS of functions on $[0,1]$ with a continuous and square integrable reproducing kernel $K_{k}$ .

Assumption 2.

The true (folded) covariance function $\Gamma_{0}\neq 0$ and $\Gamma_{0}\in{\mathcal{G}}=\bigotimes^{d}_{j=1}\mathcal{G}_{j}$ , where $d=2p$ , $\mathcal{G}_{j}=\mathcal{H}_{j}$ for $j=1,\dots,p$ and $\mathcal{G}_{j}=\mathcal{H}_{j-p}$ for $j=p+1,\dots,d$ .

Assumption 3.

The locations $\{\bm{T}_{ij}:i=1,\dots,n;j=1,\dots,m\}$ are independent random vectors from $\mathrm{Uniform}[0,1]^{p}$ , and they are independent of $\left\{X_{i}:i=1,\dots,n\right\}$ .
The errors $\left\{\epsilon_{ij}:i=1,\dots,n;j=1,\dots,m\right\}$ are independent of both locations and sample fields.

Assumption 4.

For each $\bm{t}\in\mathcal{T}$ , $X(\bm{t})$ is sub-Gaussian with a parameter $b_{X}>0$ which does not depend on $\bm{t}$ , i.e., $\mathbb{E}[\exp\left\{\beta X(\bm{t})\right\}]\leq\exp\left\{b_{X}^{2}\beta^{2}/2\right\}$ for all $\beta$ and $\bm{t}\in\mathcal{T}$ .

Assumption 5.

For each $i$ and $j$ , $\epsilon_{ij}$ is a mean-zero sub-Gaussian random variable with a parameter $b_{\epsilon}$ independent of $i$ and $j$ , i.e., $\mathbb{E}[\exp\left\{\beta\epsilon_{ij}\right\}]\leq\exp\left\{b_{\epsilon}^{2}\beta^{2}/2\right\}$ .
Moreover all errors $\left\{\epsilon_{ij}:i=1,\dots,n;j=1,\dots,m\right\}$ are independent.

Assumption 1 delineates a tensor product RKHS modeling, commonly seen in the nonparametric regression literature (e.g., Wahba, 1990; Gu, 2013). In Assumption 2, the condition $\Gamma_{0}\in{\mathcal{G}}$ is satisfied if $\mathbb{E}\|X\|_{{\mathcal{H}}}^{2}<\infty$ , as shown in Cai and Yuan (2010). Assumption 3 is specified for random design and we adopt the uniform distribution here for simplicity. The uniform distribution on $[0,1]^{p}$ can be generalized to any other continuous distribution of which density function $\pi$ satisfies $c_{\pi}\leq\pi(\bm{t})\leq c_{\pi}^{\prime}$ for all $\bm{t}\in[0,1]^{p}$ , for some constants $0<c_{\pi}\leq c_{\pi}^{\prime}<1$ , to ensure both Theorems 2 and 3 still hold. Assumptions 4 and 5 involve sub-Gaussian conditions of the stochastic process and measurement error, which are standard tail conditions.

5.2 Reproducing kernels

In Assumption 1, the “smoothness” of the function in the underlying RKHS is not explicitly specified. It is well-known that such smoothness conditions are directly related to the eigen-decay of the respective reproducing kernel. By Mercer’s Theorem (Mercer, 1909), the reproducing kernel $K_{\mathcal{H}}((t_{1},\ldots,t_{p}),(t^{\prime}_{1},\ldots,t^{\prime}_{p}))$ of ${\mathcal{H}}$ possesses the eigen-decomposition

\displaystyle K_{\mathcal{H}}((t_{1},\ldots,t_{p}),(t^{\prime}_{1},\ldots,t^{\prime}_{p}))=\sum_{l=1}^{\infty}\mu_{l}\phi_{l}(t_{1},\ldots,t_{p})\phi_{l}(t^{\prime}_{1},\ldots,t^{\prime}_{p}),

(16)

where $\{\mu_{l}:l\geq 1\}$ are non-negative eigenvalues and $\{\phi_{l}:l\geq 1\}$ are $L^{2}$ eigenfunctions on $[0,1]^{p}$ . Then for the space ${\mathcal{H}}\otimes{\mathcal{H}}$ , which is also identified by ${\mathcal{G}}=\bigotimes_{k=1}^{d}{\mathcal{G}}_{k}$ , its corresponding reproducing kernel $K_{{\mathcal{G}}}$ has the following eigen-decomposition

	$\displaystyle\quad K_{\mathcal{G}}((x_{1},\ldots,x_{2p}),(x^{\prime}_{1},\ldots,x^{\prime}_{2p}))$
	$\displaystyle=K_{\mathcal{H}}((x_{1},\ldots,x_{p}),(x^{\prime}_{1},\ldots,x^{\prime}_{p}))K_{\mathcal{H}}((x_{p+1},\ldots,x_{2p}),(x^{\prime}_{p+1},\ldots,x^{\prime}_{2p}))$
	$\displaystyle=\sum_{l,h=1}^{\infty}\mu_{l}\mu_{h}\phi_{l}(x_{1},\ldots,x_{p})\phi_{h}(x_{p+1},\ldots,x_{2p})\phi_{l}(x^{\prime}_{1},\ldots,x^{\prime}_{p})\phi_{h}(x^{\prime}_{p+1},\ldots,x^{\prime}_{2p}),$

where $\{\mu_{l}\mu_{h}:l,h\geq 1\}$ are the eigenvalues of $K_{\mathcal{G}}$ . Due to continuity assumption (Assumption 1) of the univariate kernels, there exists a constant $b$ such that

\sup_{(x_{1},\ldots,x_{2p})\in[0,1]^{2p}}K_{\mathcal{G}}((x_{1},\ldots,x_{2p}),(x_{1},\dots,x_{2p}))\leq b.

The decay rate of the eigenvalues $\{\mu_{l}\mu_{h}:l,h\geq 1\}$ is involved in our analysis through two quantities $\kappa_{n,m}$ and $\eta_{n,m}$ , which have relatively complex forms defined in Appendix B. Similar quantities can be found in other analyses of RKHS-based estimators (e.g., Raskutti et al., 2012) that accommodate general choices of RKHS. Generally $\kappa_{n,m}$ and $\eta_{n,m}$ are expected to diminish in certain orders of $n$ and $m$ , characterized by the decay rate of the eigenvalues $\{\mu_{l}\mu_{h}\}$ . The smoother the functions in the RKHS, the faster these two quantities diminish. Our general results in Theorems 2 and 3 are specified in terms of these quantities. To provide a solid example, we derive the orders of $\kappa_{n,m}$ and $\eta_{n,m}$ under a Sobolev-Hilbert space setting and provide the convergence rate of the proposed estimator in Corollary 1.

5.3 Unified rates of convergence

We write the penalty in (7) as $I(\Gamma)=\beta\|\Gamma_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}\|_{*}+(1-\beta)p^{-1}\sum_{k=1}^{p}\|\Gamma_{(k)}\|_{*}$ . For arbitrary functions $g_{1},g_{2}\in{\mathcal{G}}$ , define their empirical inner product and the corresponding (squared) empirical norm as

\langle g_{1},g_{2}\rangle_{n,m}=\frac{1}{nm(m-1)}\sum_{i=1}^{n}\sum_{1\leq j,j^{\prime}\leq m}g_{1}(T_{ij1},\dots,T_{ijp},T_{ij^{\prime}1},\dots,T_{ij^{\prime}p})g_{2}(T_{ij1},\dots,T_{ijp},T_{ij^{\prime}1},\dots,T_{ij^{\prime}p}),

\|g_{1}\|_{n,m}^{2}=\langle g_{1},g_{1}\rangle_{n,m}.

Additionally, the $L^{2}$ norm of a function $g$ is defined as $\|g\|_{2}=\{\int_{\mathcal{T}}g^{2}(\bm{t})\,d\bm{t}\}^{1/2}$ .

Define $\xi_{n,m}=\max\{\eta_{n,m},\kappa_{n,m},(n^{-1}\log n)^{1/2}\}$ . We first provide the empirical $L^{2}$ rate of convergence for $\hat{\Gamma}$ .

Theorem 2.

Suppose that Assumptions 1–5 hold. Assume $\xi_{n,m}$ satisfies $(\log n)/n\leq\xi_{n,m}^{2}/(\log\log\xi_{n,m}^{-1})$ . If $\lambda\geq L_{1}\xi_{n,m}^{2}$ for some constant $L_{1}>0$ depending on $b_{X}$ , $b_{\epsilon}$ and $b$ , we have

\displaystyle\|\hat{\Gamma}-\Gamma_{0}\|_{n,m}\leq\sqrt{2I(\Gamma_{0})\lambda}+L_{1}\xi_{n,m},

with probability at least $1-\exp(-cn\xi_{n,m}^{2}/{\log n})$ for some positive universal constant $c$ .

Next, we provide the $L^{2}$ rate of convergence for $\hat{\Gamma}$ .

Theorem 3.

Under the same conditions as Theorem 2, there exist a positive constant $L_{2}$ depending on $b_{X}$ , $b_{\epsilon}$ , $b$ and $I(\Gamma_{0})$ , such that

\|\hat{\Gamma}-\Gamma_{0}\|_{2}\leq 2\sqrt{I(\Gamma_{0})\lambda}+L_{2}\xi_{n,m},

with probability at least $1-\exp(-c_{p}n\xi_{n,m}^{2}/\log n)$ for some constant $c_{p}$ depending on $b$ .

The proofs of Theorems 2 and 3 can be found in Section S1 in the supplementary material.

Remark 3.

Theorems 2 and 3 are applicable to general RKHS $\mathcal{H}$ which satisfies Assumption 1. The convergence rate depends on the eigen-decay rates of the reproducing kernel. A special case of polynomial decay rates for univariate RKHS will be given in Corollary 1. Moreover, our analysis has a unified flavor in the sense that the resulting convergence rates automatically adapt to the orders of both $n$ and $m$ . In Remark 5 we will provide a discussion of a “phase transition” between dense and sparse functional data revealed by our theory.

Remark 4.

With a properly chosen $\lambda$ , Theorems 2 and 3 bound the convergence rates (in terms of both the empirical and theoretical $L^{2}$ norm) by $\xi_{n,m}$ , which cannot be faster than $(n^{-1}\log n)^{1/2}$ . The logarithmic order is due to the use of Adamczak bound in Lemma S2 in the supplementary material. If one further assumes boundedness for the sample fields $X_{i}$ ’s (in terms of the sup-norm) and the noise variables $\epsilon_{ij}$ ’s, we can instead use Talagrand concentration inequality (Bousquet bound in Koltchinskii (2011)) and the results in Theorems 2 and 3 can be improved to $\max\{\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{n,m},\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{2}\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\tilde{\xi}_{n,m}^{2})$ , where $\tilde{\xi}_{n,m}=\max\{\eta_{n,m},\kappa_{n,m},n^{-1/2}\}$ .

Next we focus on a special case where the reproducing kernels of the univariate RKHS ${\mathcal{H}}_{k}$ ’s exhibit polynomial eigen-decay rates, which holds for a range of commonly used RKHS. A canonical example is $\alpha$ -th order Sobolev-Hilbert space:

{\mathcal{H}}_{k}=\{f:f^{(r)},r=0,\dots,\alpha,\mathrm{are\ absolutely\ continuous};f^{(\alpha)}\in L^{2}([0,1])\},

where $k=1,\dots,p$ . Here $\alpha$ is the same as $\alpha$ in Corollary 1. To derive the convergence rates, we relate the eigenvalues $\nu_{l}$ in (16) to the univariate RKHS ${\mathcal{H}}_{k}$ , $k=1,\dots,p$ . Due to Mercer’s Theorem, the reproducing kernel $K_{k}$ of ${\mathcal{H}}_{k}$ yields an eigen-decomposition with non-negative eigenvalues $\{\mu_{l}^{(k)}:l\geq 1\}$ and an $L^{2}$ eigenfunction $\{\phi^{(k)}_{l}:l\geq 1\}$ , i.e., $K_{k}(t,t^{\prime})=\sum_{l=1}^{\infty}\mu_{l}^{(k)}\phi_{l}^{(k)}(t)\phi^{(k)}_{l}(t^{\prime}).$ Therefore, the set of eigenvalues $\{\mu_{l}:l\geq 1\}$ in (16) is the same as the set $\{\prod_{k=1}^{p}\mu_{l_{k}}^{(k)}:l_{1},\dots,l_{p}\geq 1\}$ . Given the eigen-decay of $\mu_{l}^{(k)}$ , one can obtain the order of $\xi_{n,m}$ and hence the convergence rates from Theorems 2 and 3. Here are the results under the setting of a polynomial eigen-decay.

Corollary 1.

Suppose that the same conditions in Theorem 3 hold. If the eigenvalues of $K_{k}$ for ${\mathcal{H}}_{k},k=1,\ldots,p,$ have polynomial decaying rates, that is, there exists $\alpha>1/2$ such that $\mu_{l}^{(k)}\asymp l^{-2\alpha}$ for all $k=1,\dots,p$ , then

\max\left\{\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{n,m},\|\hat{\Gamma}-\Gamma_{0}\|^{2}_{2}\right\}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\left(\max\left\{(nm)^{-\frac{2\alpha}{1+2\alpha}}\{\log(nm)\}^{\frac{2\alpha(2p-1)}{2\alpha+1}},\frac{\log n}{n}\right\}\right).

Remark 5.

All Theorems 2 and 3 and Corollary 1 reveal a “phase-transition” of the convergence rate depending on the relative magnitudes between $n$ , the sample size, and $m$ , the number of observations per field. When $\kappa^{2}_{n,m}\ll(\log n/n)$ , which is equivalent to $m\gg n^{1/(2\alpha)}(\log n)^{2p-2-1/(2\alpha)}$ in Corollary 1, both empirical and theoretical $L^{2}$ rates of convergence can achieve the near-optimal rate $\sqrt{\log n/n}$ . Under the stronger assumptions in Remark 4, the convergence rate will achieve the optimal order $\sqrt{1/n}$ when $\kappa^{2}_{n,m}\ll 1/n$ (or $m\gg n^{1/(2\alpha)}(\log n)^{2p-1}$ in Corollary 1). In this case, the observations are so densely sampled that we can estimate the covariance function as precisely as if the entire sample fields are observable. On the contrary, when $\kappa^{2}_{n,m}\gg(\log n/n)$ (or $m\ll n^{1/(2\alpha)}(\log n)^{2p-2-1/(2\alpha)}$ in Corollary 1), the convergence rate is determined by the total number of observations $nm$ . When $p=1$ , the asymptotic result in Corollary 1, up to some $\log m$ and $\log n$ terms, is the same as the minimax optimal rate obtained by Cai and Yuan (2010), and is comparable to the $L^{2}$ rate obtained by Paul and Peng (2009) for $\alpha=2$ .

Remark 6.

For covariance function estimation for unidimensional functional data, i.e., $p=1$ , a limited number of approaches, including Cai and Yuan (2010), Li and Hsing (2010), Zhang and Wang (2016), and Liebl (2019), can achieve unified theoretical results in the sense that they hold for all relative magnitudes of $n$ and $m$ . The similarity of these approaches is the availability of a closed form for each covariance function estimator. In contrast, our estimator obtained from (7) does not have a closed form due to the non-differentiability of the penalty, but it can still achieve unified theoretical results which hold for both unidimensional and multidimensional functional data. Due to the lack of a closed form of our covariance estimator, we used the empirical process techniques (e.g., Bartlett et al., 2005; Koltchinskii, 2011) in the theoretical development. In particular, we have developed a novel grouping lemma (Lemma S4 in the supplementary material) to deterministically decouple the dependence within a $U$ -statistics of order 2. We believe this lemma is of independent interest. In our analysis, the corresponding $U$ -statistics is indexed by a function class, and this grouping lemma provides a tool to obtain uniform results (see Lemma S3 in the supplementary material). In particular, this allows us to relate the empirical and theoretical $L^{2}$ norm of the underlying function class, in precise enough order dependence on $n$ and $m$ to derive the unified theory. See Lemma S3 for more details. To the best of our knowledge, this paper is one of the first in the FDA literature that derives a unified result in terms of empirical process theories, and the proof technique is potentially useful for some other estimators without a closed form.

6 Simulation

To evaluate the practical performance of the proposed method, we conducted a simulation study. We in particular focused on two-dimensional functional data. Let $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ both be the RKHS with kernel $K(t_{1},t_{2})=\sum_{k=1}^{\infty}(k\pi)^{-4}e_{k}(t_{1})e_{k}(t_{2})$ , where $e_{k}(t)=\sqrt{2}\cos(k\pi t)$ , $k\geq 1$ . This RKHS has been used in various studies in FDA, e.g., the simulation study of Cai and Yuan (2012). Each $X_{i}$ is generated from a mean-zero Gaussian random field with a covariance function

\gamma_{0}((s_{1},s_{2}),(t_{1},t_{2}))=\Gamma_{0}(s_{1},s_{2},t_{1},t_{2})=\sum^{R}_{k=1}k^{-2}\psi_{k}(s_{1},s_{2})\psi_{k}(t_{1},t_{2}),

where the eigenfunctions $\psi_{k}(t_{1},t_{2})\in\mathcal{P}_{r_{1},r_{2}}:=\{e_{i}(t_{1})e_{j}(t_{2}):i=1,\dots,r_{1};j=1,\dots,r_{2}\}$ . Three combinations of one-way ranks ( $r_{1},r_{2}$ ) and two-way rank $R$ were studied for $\Gamma_{0}$ :

Setting 1: $R=6$ , $r_{1}=3$ , $r_{2}=2$ ; Setting 2: $R=6$ , $r_{1}=r_{2}=4$ ;

Setting 3: $R=r_{1}=r_{2}=4$ .

For each setting, we chose $R$ functions out of $\mathcal{P}_{r_{1},r_{2}}$ to be $\{\psi_{k}\}$ such that smoother functions are associated with larger eigenvalues. The details are described in Section S2.1 of the supplementary material.

In terms of sampling plans, we considered both sparse and dense designs. Due to the space limit, here we only show and discuss the results for the sparse design, while defer those for the dense design to Section S2.3 of the supplementary material. For the sparse design, the random locations $\bm{T}_{ij},j=1,\ldots,m,$ were independently generated from the continuous uniform distribution on $[0,1]^{2}$ within each field and across different fields, and the random errors $\{\epsilon_{ij}:i=1,\ldots,n;j=1,\ldots,m\}$ were independently generated from $N(0,\sigma^{2})$ . In each of the 200 simulation runs, the observed data were obtained following (1) with various combinations of $m=10,20$ , $n=100,200$ and noise level $\sigma=0.1,0.4$ .

We compared the proposed method, denoted by mOpCov, with three existing methods: 1) OpCov: the estimator based on Wong and Zhang (2019) with adaption to two dimensional case (see Section 2); 2) ll-smooth: local linear smoothing with Epanechnikov kernel; 3) ll-smooth+: the two-step estimator constructed by retaining eigen-components of ll-smoothselected by 99% fraction of variation explained (FVE), and then removing the eigen-components with negative eigenvalues. For both OpCov and mOpCov, 5-fold cross-validation was adopted to select the corresponding tuning parameters.

Table 1 show the average integrated squared error (AISE), average of estimated two-way rank ( $\bar{R}$ ), as well as average of estimated one-way ranks ( $\bar{r}_{1},\bar{r}_{2}$ ) of the above covariance estimators over 200 simulated data sets in respective settings when sample size is $n=200$ . Corresponding results for $n=100$ can be found in Table S4 of the supplementary material, and they lead to similar conclusions. Obviously ll-smooth and ll-smooth+, especially ll-smooth, perform significantly worse than the other two methods in both estimation accuracy and rank reduction (if applicable). Below we only compare mOpCov and OpCov.

Regarding estimation accuracy, the proposed mOpCov has uniformly smaller AISE values than OpCov, with around $10\%\sim 20\%$ improvement of AISE over OpCov in most cases under Settings 1 and 2. If the standard error (SE) of AISE is taken into account, the improvements of AISE by mOpCov are more distinguishable in Settings 1 and 2 than those in Setting 3 since the SEs of OpCov in Setting 3 are relatively high. This is due to the fact that in Setting 3, marginal basis are not shared by different two-dimensional eigenfunctions, and hence mOpCov cannot benefit from the structure sharing among eigenfunctions. Setting 3 is in fact an extreme setting we designed to challenge the proposed method.

For rank estimation, OpCov almost always underestimates two-way ranks, while mOpCov typically overestimates both one-way and two-way ranks. For mOpCov, the average one-way rank estimates are always smaller than the average two-way rank estimates, and their differences are substantial in Settings 1 and 2. This demonstrates the benefit of mOpCov of detecting structure sharing of one-dimensional basis among two-dimensional eigenfunctions.

We also tested the performance of mOpCov in the dense and regular designs, and compared it with the existing methods mentioned above together with the one by Wang and Huang (2017), which is not applicable to the sparse design. Details are given in Section S2.3 of the supplementary material, where all methods achieve similar AISE values, but mOpCov performs slightly better in estimation accuracy when the noise level is high.

Table 1: Simulation results for three Settings with the sparse design when sample size (

n

) is 200. The AISE values with standard errors (SE) in parentheses are provided for the four covariance estimators in comparison, together with average two-way ranks (

\bar{R}

) for those estimators which can lead to rank reduction (i.e., mOpCov, OpCov, and ll-smooth+) and average one-way ranks (

r_{1}

r_{2}

) for mOpCov.

Setting	$m$	$\sigma$		mOpCov	OpCov	ll-smooth	ll-smooth+
1	10	0.1	AISE	0.053 (1.97e-03)	0.0632 (3.22e-03)	0.652 (1.92e-01)	0.337 (5.35e-02)
			$\bar{R}$	8.38	2.94	-	172.70
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.4, 5.4	_	_	_
		0.4	AISE	0.0547 (2.01e-03)	0.0656 (2.72e-03)	0.714 (2.11e-01)	0.366 (5.96e-02)
			$\bar{R}$	9.16	2.84	-	177.3
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.34, 5.32	_	_	_
	20	0.1	AISE	0.0343 (1.46e-03)	0.0421 (1.97e-03)	0.297 (1.39e-02)	0.206 (4.62e-03)
			$\bar{R}$	8.38	3.78	-	317.44
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.84, 5.82	_	_	_
		0.4	AISE	0.0354 (1.52e-03)	0.044 (2.21e-03)	0.325 (1.58e-02)	0.223 (4.94e-03)
			$\bar{R}$	8.86	3.76	-	326.31
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.83, 5.84	_	_	_
2	10	0.1	AISE	0.0532 (1.98e-03)	0.0636 (3.12e-03)	2.33 (1.13e+00)	0.795 (2.98e-01)
			$\bar{R}$	8.48	3.02	-	191.175
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.82, 5.82	_	_	_
		0.4	AISE	0.0548 (2.05e-03)	0.0686 (3.53e-03)	2.44 (1.17e+00)	0.828 (3.04e-01)
			$\bar{R}$	9.04	3.04	-	196.34
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.71, 5.74	_	_	_
	20	0.1	AISE	0.0341 (1.43e-03)	0.0419 (2.02e-03)	0.301 (1.58e-02)	0.208 (4.50e-03)
			$\bar{R}$	8.99	3.74	-	318.645
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.93, 5.92	_	_	_
		0.4	AISE	0.0348 (1.43e-03)	0.043 (2.22e-03)	0.328 (1.78e-02)	0.225 (4.74e-03)
			$\bar{R}$	8.01	3.6	-	327.395
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.94, 5.93	_	_	_
3	10	0.1	AISE	0.058 (2.62e-03)	0.0692 (5.33e-03)	0.454 (7.28e-02)	0.286 (2.89e-02)
			$\bar{R}$	6.26	3.12	-	182.74
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5, 5.06	_	_	_
		0.4	AISE	0.0598 (2.68e-03)	0.0733 (6.14e-03)	0.531 (1.07e-01)	0.323 (4.23e-02)
			$\bar{R}$	6.48	3.2	-	185.82
			$\bar{r}_{1}$ , $\bar{r}_{2}$	4.99, 5.07	_	_	_
	20	0.1	AISE	0.0422 (1.37e-03)	0.0535 (2.64e-03)	0.267 (5.04e-03)	0.196 (3.59e-03)
			$\bar{R}$	6.29	4.49	-	332.09
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.62, 5.69	_	_	_
		0.4	AISE	0.0424 (1.30e-03)	0.0494 (2.42e-03)	0.292 (5.30e-03)	0.212 (3.72e-03)
			$\bar{R}$	5.68	3.36	-	338.725
			$\bar{r}_{1}$ , $\bar{r}_{2}$	5.59, 5.66	_	_	_

7 Real Data Application

We applied the proposed method to an Argo profile data set, obtained from http://www.argo.ucsd.edu/Argo_data_and.html. The Argo project has a global array of approximately 3,800 free-drifting profiling floats, which measure temperature and salinity of the ocean. These floats drift freely in the depths of the ocean most of the time, and ascend regularly to the sea surface, where they transmit the collected data to the satellites. Every day only a small subset of floats show up on the sea surface. Due to the drifting process, these floats measure temperature and salinity at irregular locations over the ocean. See Figure 2 for examples.

In this analysis, we focus on the different changes of sea surface temperature between the tropical western and eastern Indian Ocean, which is called the Indian Ocean Dipole (IOD). The IOD is known to be associated with droughts in Australia (Ummenhofer et al., 2009) and has a significant effect on rainfall patterns in southeast Australia (Behera and Yamagata, 2003). According to Shinoda et al. (2004), the IOD phenomenon is a predominant inter-annual variation of sea surface temperature during late boreal summer and autumn (Shinoda et al., 2004), so in this application we focused on the sea surface temperature in the Indian Ocean region of longitude 40 $\sim$ 120 and latitude -20 $\sim$ 20 between September and November every year from 2003 to 2018.

Based on a simple autocorrelation analysis on the gridded data, we decided to use measurements for every ten days in order to reduce the temporal dependence among the data.

At each location of a float on a particular day, the average temperature between 0 and 5 hPa from the float is regarded as a measurement. The Argo float dataset provides multiple versions of data, and we adopted the quality controlled (QC) version. Eventually we have a two-dimensional functional data collected of $n=144$ days, where the number of observed locations $\bm{T}_{ij}=(\text{longitudie},\text{latitude})$ per day varies from 7 to 47, i.e., $7\leq m_{i}\leq 47$ , $i=1,...,n$ , with an average of 21.83. The locations are rescaled to $[0,1]\times[0,1]$ . As shown in Figure 2, the data has a random sparse design.

Refer to caption — Figure 2: Observations on 2013/09/04 (left), and all observations in the data set (right). Points on the map indicate locations (Longitude, Latitude) of observations and the color scale of every point shows the corresponding Celsius temperature.

First we used kernel ridge regression with the corresponding kernel for the tensor product of two second order Sobolev spaces (e.g., Wong and Zhang, 2019) to obtain a mean function estimate for every month. Then we applied the proposed covariance function estimator with the same kernel.

The estimates of the top two two-dimensional $L^{2}$ eigenfunctions are illustrated in Figure 3. The first eigenfunction shows the east-west dipole mode, which aligns with existing scientific findings (e.g., Shinoda et al., 2004; Chu et al., 2014; Deser et al., 2010). The second eigenfunction can be interpreted as the basin-wide mode, which is a dominant mode all around the year (e.g., Deser et al., 2010; Chu et al., 2014).

To provide a clearer understanding of the covariance function structure, we derived a marginal $L^{2}$ basis along longitude and latitude respectively. The details are given in Appendix A. The left panel of Figure 4 demonstrates that the first longitudinal marginal basis reflects a large variation in the western region while the second one corresponds to the variation in the eastern region. Due to different linear combinations, the variation along longitude may contribute to not only opposite changes between the eastern and western sides of the Indian Ocean as shown in the first two-dimensional eigenfunction, but also an overall warming or cooling tendency as shown in the second two-dimensional eigenfunction. The second longitudinal marginal basis reveals that the closer to the east boundary, the greater the variation is, which suggests that the IOD may be related to the Pacific Ocean. This aligns with the evidence that the IOD has a link with El Niño Southern Oscillation (ENSO) (Stuecker et al., 2017), an irregularly periodic variation in sea surface temperature over the tropical eastern Pacific Ocean. As shown in the right panel of Figure 4, the overall trend for the first latitude marginal basis is almost a constant function. This provides evidence that the IOD is primarily associated with the variation along longitude.

Appendix

Appendix A $L^{2}$ eigensystem and $L^{2}$ marginal basis

In this section, we present a transformation procedure to produce $L^{2}$ eigenfunctions and corresponding eigenvalues from our estimator $\hat{\bm{B}}$ obtained by (11).

Let $\bm{Q}_{k}=[\int_{[0,1]}K(s,T_{ijk})K(s,T_{i^{\prime}j^{\prime}k})ds]_{1\leq i,i^{\prime}\leq n,1\leq j,j^{\prime}\leq m}$ , $k=1,\ldots,p$ . Then $\bm{Q}_{k}=\bm{M}_{k}\bm{R}_{k}\bm{M}_{k}^{\intercal}$ , where $\bm{R}_{k}=[\int_{[0,1]}v_{l}(s)v_{h}(s)ds]_{1\leq l,h\leq q_{k}}$ and $\{v_{l}:l=1,\ldots,q_{k}\}$ form a basis of ${\mathcal{H}}_{k}$ , so $\bm{R}_{k}=\bm{M}_{k}^{+}\bm{Q}_{k}(\bm{M}_{k}^{+})^{\intercal}$ . The $L^{2}$ eigenvalues of ${\hat{\Gamma}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ coincide with the eigenvalues of matrix $\hat{\bm{B}}^{L}_{\mathrm{square}}:=(\bm{R}_{1}\otimes\ldots\otimes\bm{R}_{p})^{1/2}\hat{\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}[(\bm{R}_{1}\otimes\ldots\otimes\bm{R}_{p})^{1/2}]^{\intercal}$ , and the number of nonzero eigenvalues is the same as the rank of $\hat{\bm{B}}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ . The $L^{2}$ eigenfunction $\hat{\phi}_{l}$ that corresponds to the $l$ -th eigenvalue of $\hat{\Gamma}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}$ can be expressed as $\hat{\phi}_{l}(s_{1},...,s_{p})=\bm{u}_{l}^{\intercal}[\bm{z}_{1}(s_{1})\otimes\ldots\otimes\bm{z}_{p}(s_{p})]$ , where $\bm{z}_{k}(\cdot)$ , $k=1,\ldots,p$ are defined in Theorem 1, and $\bm{u}_{l}=(\bm{M}_{1}^{+}\otimes\ldots\otimes\bm{M}_{p}^{+})^{\intercal}(\bm{R}_{1}\otimes\ldots\otimes\bm{R}_{p})^{-1/2}\bm{v}_{l}$ with $\bm{v}_{l}$ being the $l$ -th eigenvector of matrix $\hat{\bm{B}}^{L}_{\mathrm{square}}$ . Using the property of Kronecker products, we have $\hat{\phi}_{l}(s_{1},...,s_{p})=\bm{v}_{l}^{\intercal}[(\bm{R}_{1}^{-1/2}\bm{M}_{1}^{+}\bm{z}_{1}(s_{1}))\otimes\ldots\otimes(\bm{R}_{p}^{-1/2}\bm{M}_{p}^{+}\bm{z}_{p}(s_{p}))]$ .

By simple verification, we can see that $\bm{R}_{k}^{-1/2}\bm{M}_{k}^{+}\bm{z}_{k}(\cdot)$ are $q_{k}$ one-dimensional orthonormal $L^{2}$ functions for dimension $k$ , $k=1,...,p$ . Therefore, we can also express $\hat{\Gamma}$ with these $L^{2}$ one-dimensional basis and the coefficients will form a $2p-$ th order tensor of dimension $q_{1}\times\ldots q_{p}\times q_{1}\times\ldots q_{p}$ . We use ${\hat{\bm{B}}}^{L}$ to represent this new coefficient tensor and extend our unfolding operators to $L^{2}$ space. It is easy to see that ${\hat{\bm{B}}}^{L}_{\mathrel{\scalebox{0.5}{$\blacksquare$}}}={\hat{\bm{B}}}^{L}_{\mathrm{square}}$ .

Since $\hat{\Gamma}_{(k)}$ is a compact operator in the $L^{2}$ space, this yields a singular value decomposition which leads to a $L^{2}$ basis characterizing the marginal variation along the $k-$ th dimension. We call it a $L^{2}$ marginal basis for the $k-$ th dimension. Obviously the marginal basis function $\hat{\psi}^{k}_{l}$ corresponding to the $l$ -th singular value for dimension $k$ can be expressed as $\hat{\psi}^{k}_{l}(\cdot)=\bm{u}^{k}_{l}\bm{z}_{k}(\cdot)$ , where $\bm{u}^{k}_{l}=(\bm{M}_{k}^{+})^{\intercal}\bm{R}_{k}^{-1/2}\bm{v}^{k}_{l}$ , and $\bm{v}^{k}_{l}$ is the $l$ -th singular vector of $\hat{\bm{B}}^{L}_{(k)}$ . And the $L^{2}$ marginal singular values of $\hat{\Gamma}_{(k)}$ coincide with the singular values of matrix $\hat{\bm{B}}^{L}_{(k)}$ .

Appendix B Definitions of $\kappa_{n,m}$ and $\eta_{n,m}$

Here we provide the specific forms of $\kappa_{n,m}$ and $\eta_{n,m}$ , which are closely related to the decay of $\{\mu_{l}\mu_{h}:l,h=1,\dots\}$ . Specifically, $\kappa_{n,m}$ is defined as the smallest positive $\kappa$ such that

	$\displaystyle cb^{3}\left[\frac{1}{n(m-1)}\sum_{l,h=1}^{\infty}\min\left\{\kappa^{2},\mu_{l}\mu_{h}\right\}\right]^{1/2}\leq\kappa^{2},$		(17)
	$\displaystyle\quad 32cb\left[\frac{1}{n(m-1)}\sum_{l,h=1}^{\infty}\min\left\{\kappa^{2}/b^{2},\mu_{l}\mu_{h}\right\}\right]^{1/2}\leq\kappa^{2},$		(17)

where $c$ is a universal constant, and $\eta_{n,m}$ is defined as the smallest positive $\eta$ such that

\displaystyle\left(\frac{c_{\eta}}{nm}\sum_{l,h=1}^{\infty}\min\{\eta^{2},\mu_{l}\mu_{h}\}+\frac{\eta^{2}}{n}\right)^{1/2}\leq\eta^{2},

(18)

where $c_{\eta}$ is a constant depending on $b,b_{X},b_{\epsilon}$ . The existences of $\kappa_{n,m}$ and $\eta_{n,m}$ are shown in the proof of Theorem 2.

Supplementary Material

In the supplementary material related to this paper, we provide proofs of our theoretical findings and additional simulation results.

Acknowledgement

The research of Raymond K. W. Wong is partially supported by the U.S. National Science Foundation under grants DMS-1806063, DMS-1711952 (subcontract) and CCF-1934904. The research of Xiaoke Zhang is partially supported by the U.S. National Science Foundation under grant DMS-1832046. Portions of this research were conducted with high performance research computing resources provided by Texas A&M University (https://hprc.tamu.edu).

References

Abernethy et al. (2009) Abernethy, J., F. Bach, T. Evgeniou, and J.-P. Vert (2009). A new approach to collaborative filtering: Operator estimation with spectral regularization. Journal of Machine Learning Research 10, 803–826.
Allen (2013) Allen, G. I. (2013). Multi-way functional principal components analysis. In 2013 5th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 220–223. IEEE.
Bartlett et al. (2005) Bartlett, P. L., O. Bousquet, and S. Mendelson (2005). Local rademacher complexities. The Annals of Statistics 33(4), 1497–1537.
Behera and Yamagata (2003) Behera, S. K. and T. Yamagata (2003). Influence of the indian ocean dipole on the southern oscillation. Journal of the Meteorological Society of Japan. Ser. II 81(1), 169–177.
Boyd et al. (2010) Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3(1), 1–122.
Cai and Yuan (2010) Cai, T. T. and M. Yuan (2010). Nonparametric covariance function estimation for functional and longitudinal data. Technical report, Georgia Institute of Technology, Atlanta, GA.
Cai and Yuan (2012) Cai, T. T. and M. Yuan (2012). Minimax and adaptive prediction for functional linear regression. Journal of the American Statistical Association 107(499), 1201–1216.
Chen et al. (2017) Chen, K., P. Delicado, and H.-G. Müller (2017). Modelling function-valued stochastic processes, with applications to fertility dynamics. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(1), 177–196.
Chen and Müller (2012) Chen, K. and H.-G. Müller (2012). Modeling repeated functional observations. Journal of the American Statistical Association 107(500), 1599–1609.
Chen and Jiang (2017) Chen, L.-H. and C.-R. Jiang (2017). Multi-dimensional functional principal component analysis. Statistics and Computing 27(5), 1181–1192.
Chu et al. (2014) Chu, J.-E., K.-J. Ha, J.-Y. Lee, B. Wang, B.-H. Kim, and C. E. Chung (2014). Future change of the indian ocean basin-wide and dipole modes in the cmip5. Climate dynamics 43(1-2), 535–551.
Deser et al. (2010) Deser, C., M. A. Alexander, S.-P. Xie, and A. S. Phillips (2010). Sea surface temperature variability: Patterns and mechanisms. Annual review of marine science 2, 115–143.
Ferraty and Vieu (2006) Ferraty, F. and P. Vieu (2006). Nonparametric functional data analysis: theory and practice. Springer, New York.
Goldsmith et al. (2011) Goldsmith, J., J. Bobb, C. M. Crainiceanu, B. Caffo, and D. Reich (2011). Penalized functional regression. Journal of Computational and Graphical Statistics 20(4), 830–851.
Gu (2013) Gu, C. (2013). Smoothing Spline ANOVA Models (2nd ed.). New York: Springer.
Hackbusch (2012) Hackbusch, W. (2012). Tensor spaces and numerical tensor calculus, Volume 42. Springer Science & Business Media.
Halko et al. (2009) Halko, N., P.-G. Martinsson, and J. A. Tropp (2009). Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv preprint arXiv:0909.4061.
Hall and Vial (2006) Hall, P. and C. Vial (2006). Assessing the finite dimensionality of functional data. Journal of the Royal Statistical Society: Series B 68(4), 689–705.
Horváth and Kokoszka (2012) Horváth, L. and P. Kokoszka (2012). Inference for functional data with applications, Volume 200. Springer, New York.
Hsing and Eubank (2015) Hsing, T. and R. Eubank (2015). Theoretical foundations of functional data analysis, with an introduction to linear operators. John Wiley & Sons.
Huang et al. (2009) Huang, J. Z., H. Shen, and A. Buja (2009). The analysis of two-way functional data using two-way regularized singular value decompositions. Journal of the American Statistical Association 104(488), 1609–1620.
James et al. (2000) James, G. M., T. J. Hastie, and C. A. Sugar (2000). Principal component models for sparse functional data. Biometrika 87(3), 587–602.
Kadkhodaie et al. (2015) Kadkhodaie, M., K. Christakopoulou, M. Sanjabi, and A. Banerjee (2015). Accelerated alternating direction method of multipliers. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 497–506. ACM.
Kokoszka and Reimherr (2017) Kokoszka, P. and M. Reimherr (2017). Introduction to functional data analysis. CRC Press.
Koltchinskii (2011) Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, Volume 2033. Springer Science & Business Media.
Li and Song (2017) Li, B. and J. Song (2017). Nonlinear sufficient dimension reduction for functional data. The Annals of Statistics 45(3), 1059–1095.
Li and Hsing (2010) Li, Y. and T. Hsing (2010). Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. The Annals of Statistics 38(6), 3321–3351.
Liebl (2019) Liebl, D. (2019). Inference for sparse and dense functional data with covariate adjustments. Journal of Multivariate Analysis 170, 315–335.
Lynch and Chen (2018) Lynch, B. and K. Chen (2018). A test of weak separability for multi-way functional data, with application to brain connectivity studies. Biometrika 105(4), 815–831.
Mazumder et al. (2010) Mazumder, R., T. Hastie, and R. Tibshirani (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research 11(Aug), 2287–2322.
Mercer (1909) Mercer, J. (1909). Xvi. functions of positive and negative type, and their connection the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character 209(441-458), 415–446.
Park and Staicu (2015) Park, S. Y. and A.-M. Staicu (2015). Longitudinal functional data analysis. Stat 4(1), 212–226.
Paul and Peng (2009) Paul, D. and J. Peng (2009). Consistency of restricted maximum likelihood estimators of principal components. The Annals of Statistics 37(3), 1229–1271.
Pearce and Wand (2006) Pearce, N. D. and M. P. Wand (2006). Penalized splines and reproducing kernel methods. The American Statistician 60(3), 233–240.
Poskitt and Sengarapillai (2013) Poskitt, D. S. and A. Sengarapillai (2013). Description length and dimensionality reduction in functional data analysis. Computational Statistics & Data Analysis 58, 98–113.
Ramsay and Silverman (2005) Ramsay, J. and B. Silverman (2005). Functional data analysis. Springer, New York.
Raskutti et al. (2012) Raskutti, G., M. J. Wainwright, and B. Yu (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. The Journal of Machine Learning Research 13, 389–427.
Reimherr et al. (2018) Reimherr, M., B. Sriperumbudur, and B. Taoufik (2018). Optimal prediction for additive function-on-function regression. Electronic Journal of Statistics 12(2), 4571–4601.
Rice and Silverman (1991) Rice, J. A. and B. W. Silverman (1991). Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological) 53(1), 233–243.
Shamshoian et al. (2019) Shamshoian, J., D. Senturk, S. Jeste, and D. Telesca (2019). Bayesian analysis of multidimensional functional data. arXiv preprint arXiv:1909.08763.
Shinoda et al. (2004) Shinoda, T., H. H. Hendon, and M. A. Alexander (2004). Surface and subsurface dipole variability in the indian ocean and its relation with enso. Deep Sea Research Part I: Oceanographic Research Papers 51(5), 619–635.
Stuecker et al. (2017) Stuecker, M. F., A. Timmermann, F.-F. Jin, Y. Chikamoto, W. Zhang, A. T. Wittenberg, E. Widiasih, and S. Zhao (2017). Revisiting enso/indian ocean dipole phase relationships. Geophysical Research Letters 44(5), 2481–2492.
Sun et al. (2018) Sun, X., P. Du, X. Wang, and P. Ma (2018). Optimal penalized function-on-function regression under a reproducing kernel hilbert space framework. Journal of the American Statistical Association 113(524), 1601–1611.
Tucker (1966) Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika 31(3), 279–311.
Ummenhofer et al. (2009) Ummenhofer, C. C., M. H. England, P. C. McIntosh, G. A. Meyers, M. J. Pook, J. S. Risbey, A. S. Gupta, and A. S. Taschetto (2009). What causes southeast australia’s worst droughts? Geophysical Research Letters 36(4).
Wahba (1990) Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM.
Wang and Huang (2017) Wang, W.-T. and H.-C. Huang (2017). Regularized principal component analysis for spatial data. Journal of Computational and Graphical Statistics 26(1), 14–25.
Wong et al. (2019) Wong, R. K. W., Y. Li, and Z. Zhu (2019). Partially linear functional additive models for multivariate functional data. Journal of the American Statistical Association 114(525), 406–418.
Wong and Zhang (2019) Wong, R. K. W. and X. Zhang (2019). Nonparametric operator-regularized covariance function estimation for functional data. Computational Statistics & Data Analysis 131, 131–144.
Xiao et al. (2013) Xiao, L., Y. Li, and D. Ruppert (2013). Fast bivariate p-splines: the sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(3), 577–599.
Yao et al. (2005) Yao, F., H.-G. Müller, and J.-L. Wang (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100(470), 577–590.
Yuan and Cai (2010) Yuan, M. and T. T. Cai (2010). A reproducing kernel hilbert space approach to functional linear regression. The Annals of Statistics 38(6), 3412–3444.
Zhang et al. (2013) Zhang, L., H. Shen, and J. Z. Huang (2013). Robust regularized singular value decomposition with application to mortality data. The Annals of Applied Statistics 7(3), 1540–1561.
Zhang and Wang (2016) Zhang, X. and J.-L. Wang (2016). From sparse to dense functional data and beyond. The Annals of Statistics 44(5), 2281–2321.
Zhou and Pan (2014) Zhou, L. and H. Pan (2014). Principal component analysis of two-dimensional functional data. Journal of Computational and Graphical Statistics 23(3), 779–801.
Zhu et al. (2014) Zhu, H., F. Yao, and H. H. Zhang (2014). Structured functional additive regression in reproducing kernel hilbert spaces. Journal of the Royal Statistical Society: Series B 76(3), 581–603.
Zipunnikov et al. (2011) Zipunnikov, V., B. Caffo, D. M. Yousem, C. Davatzikos, B. S. Schwartz, and C. Crainiceanu (2011). Multilevel functional principal component analysis for high-dimensional data. Journal of Computational and Graphical Statistics 20(4), 852–873.