Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal Data

Hu Sun
Department of Statistics, University of Michigan, Ann Arbor
Zuofeng Shang
Department of Mathematical Sciences, New Jersey Institute of Technology
and
Yang Chen
Department of Statistics, University of Michigan, Ann Arbor

Abstract

We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on a time series of matrices defined on a static 2-D spatial grid and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC.

Keywords: Auxiliary covariates, matrix autoregression, reproducing kernel Hilbert space (RKHS), spatio-temporal forecast, tensor data model

1 Introduction

Matrix-valued time series data have received increasing attention in multiple scientific fields, such as economics (Wang et al., 2019), geophysics (Sun et al., 2022), and environmental science (Dong et al., 2020), where scientists are interested in modeling the joint dynamics of data observed on a 2-D grid across time. This paper focuses on the matrix-valued data defined on a 2-D spatial grid that contains the geographical information of the individual observations. As a concrete example, we visualize the global Total Electron Content (TEC) distribution in Figure 1. TEC is the density of electrons in the Earth’s ionosphere along the vertical pathway connecting a radio transmitter and a ground-based receiver. An accurate prediction of the global TEC is critical since it can foretell the impact of space weather on the positioning, navigation, and timing (PNT) service (Wang et al., 2021; Younas et al., 2022). Every image in panel (A)-(C) is a $71\times 73$ matrix, distributed on a spatial grid with $2.5^{\circ}$ -latitude-by- $5^{\circ}$ -longitude resolution.

Refer to caption — Figure 1: An example of matrix time series with auxiliary vector time series. Panels (A)-(C) show the global Total Electron Content (TEC) distribution at three timestamps on the latitude-local-time grid (source: the IGS TEC database (Hernández-Pajares et al., 2009)). Panel (D) plots the auxiliary Sym-H index time series, which measures the impact of solar eruptions on Earth. We highlight the time of panels (A)-(C) in (D) with arrows.

The matrix-valued time series, such as the TEC time series, is often associated with auxiliary vector time series that measures the same object, such as the Earth’s ionosphere, from a different data modality. In panel (D) of Figure 1, we plot the global SYM-H index, which measures the geomagnetic activity caused by the solar eruptions that can finally impact the global TEC distribution. These non-spatial auxiliary covariates carry additional information related to the matrix time series data dynamics.

In this paper, we investigate the problem of forecasting future matrix data jointly with the historical matrices and the vector time series covariates. There are two major challenges in this modeling context. From the perspective of building a matrix-variate regression model, we need to integrate the information of predictors with non-uniform modes, namely both matrices and vectors. Adding the auxiliary vector covariates benefits the prediction and enables domain scientists to understand the interplay between different data modalities but at the same time, it complicates the modeling and the subsequent theoretical analysis. From the perspective of spatio-temporal analysis (Cressie and Wikle, 2015), we need to properly leverage the spatial information of the data and transform the classical spatial statistics framework to accommodate the grid geometry of matrix-valued data. In the remainder of this section, we briefly review the related literature that can shed light on these challenges and then summarize our unique contributions.

A naive yet straightforward prediction model is to vectorize the matrices as vectors and make predictions via the Vector Autoregression (VAR) (Stock and Watson, 2001). In this approach, the auxiliary vector covariates can be incorporated easily by concatenating them with the vectorized matrix predictors. However, vectorizing matrix data leads to the loss of spatial information and also requires a significant amount of parameters given the high dimensionality of the data. To avoid vectorizing the matrix data, scalar-on-tensor regression (Zhou et al., 2013; Guhaniyogi et al., 2017; Li et al., 2018; Papadogeorgou et al., 2021) tackles the problem by using matrix predictors directly. However, these models are built for scalar responses while in our setting we are dealing with matrix responses. Dividing the matrix response into individual scalar responses and fitting scalar-on-tensor regression still requires a significant number of parameters and more importantly, it fails to take the spatial information of the response into account.

The statistical framework that can incorporate matrices as both predictors and response is the tensor-on-tensor regression (Lock, 2018; Liu et al., 2020; Luo and Zhang, 2022) and more specifically for time series data, the matrix/tensor autoregression (Chen et al., 2021; Li and Xiao, 2021; Hsu et al., 2021; Wang et al., 2024). The matrix/tensor predictors are mapped to matrix/tensor responses via multi-linear transformations that greatly reduce the number of parameters. Our work builds on this framework and incorporates the non-spatial vector predictors at the same time.

To incorporate the vector predictor in the same model, we need to map vector predictors to matrix responses. Tensor-on-scalar regression (Rabusseau and Kadri, 2016; Sun and Li, 2017; Li and Zhang, 2017; Guha and Guhaniyogi, 2021) illustrates a way of mapping low-order scalar/vector predictors to high-order matrix/tensor responses via taking the tensor-vector product of the predictor with a high-order tensor coefficient. In this paper, we take a similar approach and introduce a 3-D tensor coefficient for the vector predictors such that our model can take predictors with non-uniform modes, which is a key distinction of our model compared to existing works.

The other distinction of our model is that our model leverages the spatial information of the matrix response. In our model, a key assumption is that the vector predictor has similar predictive effects on neighboring locations in the matrix response. This is equivalent to saying that the tensor coefficient is spatially smooth. Typically, the estimation of spatially smooth tensor coefficients in such regression models is done via adding a total-variation (TV) penalty (Wang et al., 2017; Shen et al., 2022; Sun et al., 2023) to the unknown tensor. The TV penalty leads to piecewise smooth estimators with sharp edges and enables feature selections. However, the estimation with the TV penalty requires solving non-convex optimization problems, making the subsequent theoretical analysis difficult. In our model, we utilize a simpler approach by assuming that the tensor coefficients are discrete evaluations of functional parameters from a Reproducing Kernel Hilbert Space (RKHS). Such a kernel method has been widely used in scalar-on-image regressions (Kang et al., 2018) where the regression coefficients of the image predictor are constrained to be spatially smooth.

We facilitate the estimation of the unknown functional parameters with the functional norm penalty. Functional norm penalties have been widely used for estimating smooth functions in classic semi/non-parametric learning in which data variables are either scalar/vector-valued (see Hastie et al., 2009; Gu, 2013; Yuan and Cai, 2010; Cai and Yuan, 2012; Shang and Cheng, 2013, 2015; Cheng and Shang, 2015; Yang et al., 2020). To the best of our knowledge, the present article is the first to consider functional norm penalty for tensor coefficient estimation in a matrix autoregressive setting.

To encapsulate, our paper has two major contributions. Firstly, we build a unified matrix autoregression framework for spatio-temporal data that incorporates lower-order scalar/vector time series covariates. Such a framework has strong application motivation where domain scientists are curious about integrating the information of spatial and non-spatial data for predictions and inference. The framework also bridges regression methodologies with tensor predictors and responses of non-uniform modes, making the theoretical investigation itself an interesting topic. Secondly, we propose to estimate coefficients of the auxiliary covariates, together with the autoregressive coefficients, in a single penalized maximum likelihood estimation (MLE) framework with the RKHS functional norm penalty. The RKHS framework builds spatial continuity into the regression coefficients. We establish the joint asymptotics of the autoregressive coefficients and the functional parameters under fixed/high matrix dimensionality regimes and propose an efficient alternating minimization algorithm for estimation and validate it with extensive simulations and real applications.

The remainder of the paper is organized as follows. We introduce our model formally in Section 2 and provide model interpretations and comparisons in sufficient detail. Section 3 introduces the penalized MLE framework and describes the exact and approximate estimation algorithms. Large sample properties of the estimators under fixed and high matrix dimensionality are established in Section 4. Section 5 provides extensive simulation studies for validating the consistency of the estimators, demonstrating BIC-based model selection results, and comparing our method with various competitors. We apply our method to the global TEC data in Section 6 and make conclusions in Section 7. Technical proofs and additional details of the simulation and real data applications are deferred to supplemental materials.

2 Model

2.1 Notation

We adopt the following notations throughout the article. We use calligraphic bold-face letters (e.g. $\mathbfcal{X},\mathbfcal{G}$ ) for tensors with at least three modes, uppercase bold-face letters (e.g. $\mathbf{X},\mathbf{G}$ ) for matrix and lowercase bold-face letters (e.g. $\mathbf{x},\mathbf{z}$ ) for vector and blackboard bold-faced letters for sets (e.g. $\mathbb{R},\mathbb{S}$ ). To subscript any tensor/matrix/vector, we use square brackets with subscripts such as $[\mathbfcal{G}]_{ijd},[\mathbf{z}_{t}]_{d},[\mathbf{X}_{t}]_{ij}$ , and we keep the subscript $t$ inside the square bracket to index time. Any fibers and slices of tensor are subscript-ed with colons such as $[\mathbfcal{G}]_{ij:}$ , $[\mathbfcal{G}]_{::d}$ and thus any row and column of a matrix is denoted as $[\mathbf{X}_{t}]_{i:}$ and $[\mathbf{X}_{t}]_{:j}$ . If the slices of tensor/matrix are based on the last mode such as $[\mathbfcal{G}]_{::d}$ and $[\mathbf{X}_{t}]_{:j}$ , we will often omit the colons and write as $[\mathbfcal{G}]_{d}$ and $[\mathbf{X}_{t}]_{j}$ for brevity. For any tensor $\mathbfcal{X}$ , we use $\mathrm{\mathbf{vec}}\left(\mathbfcal{X}\right)$ to denote the vectorized tensor. For any two tensors $\mathbfcal{X},\mathbfcal{Y}$ with identical size, we define their inner product as: $\langle\mathbfcal{X},\mathbfcal{Y}\rangle=\mathrm{\mathbf{vec}}\left(\mathbfcal{X}\right)^{\top}\mathrm{\mathbf{vec}}\left(\mathbfcal{Y}\right)$ , and we use $\|\mathbfcal{X}\|_{\mathrm{F}}$ to denote the Frobenius norm of a tensor and one has $\|\mathbfcal{X}\|_{\mathrm{F}}=\sqrt{\langle\mathbfcal{X},\mathbfcal{X}\rangle}$ .

Following Li and Zhang (2017), the tensor-vector product between a tensor $\mathbfcal{G}$ of size $d_{1}\times\cdots\times d_{K+1}$ and a vector $\mathbf{z}\in\mathbb{R}^{d_{K+1}}$ , denoted as $\mathbfcal{G}\bar{\times}_{(K+1)}\mathbf{z}$ , or simply $\mathbfcal{G}\bar{\times}\mathbf{z}$ , is a tensor of size $d_{1}\times\cdots\times d_{K}$ with $[\mathbfcal{G}\bar{\times}\mathbf{z}]_{i_{1}\ldots i_{K}}=\sum_{i_{K+1}}[\mathbfcal{G}]_{i_{1}\ldots i_{K}i_{K+1}}\cdot[\mathbf{z}]_{i_{K+1}}$ . For tensor $\mathbfcal{X}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ , we use $\mathbf{X}_{(k)}\in\mathbb{R}^{d_{k}\times\prod_{m\neq k}d_{m}}$ to denote its $k$ -mode matricization. The Kronecker product between matrices is denoted via $\mathbf{A}\otimes\mathbf{B}$ and the trace of a square matrix $\mathbf{A}$ is denoted as $\text{tr}\left(\mathbf{A}\right)$ . We use $\bar{\rho}(\cdot),\underaccent{\bar}{\rho}(\cdot),\rho_{i}(\cdot)$ to denote the maximum, minimum and $i^{\rm{th}}$ largest eigenvalue of a matrix. We use $diag(\mathbf{C}_{1},\ldots,\mathbf{C}_{d})$ to denote a block diagonal matrix with $\mathbf{C}_{1},\ldots,\mathbf{C}_{d}$ along the diagonal. More details on the related tensor algebra can be found in Kolda and Bader (2009).

For the matrix time series $\mathbf{X}_{t}\in\mathbb{R}^{M\times N}$ in our modeling context, we assume that all $S=MN$ grid locations are points on an $M\times N$ grid within the domain $\bar{\mathbb{S}}=[0,1]^{2}$ . The collection of all the spatial locations is denoted as $\mathbb{S}$ and any particular element of $\mathbb{S}$ corresponding to the $(i,j)^{\rm{th}}$ entry of the matrix is denoted as $s_{ij}$ . We will often index the $(i,j)^{\rm{th}}$ entry of the matrix $\mathbf{X}_{t}$ with a single index $u=i+(j-1)M$ and thus $s_{ij}$ will be denoted as $s_{u}$ . We use $[N]$ to denote index set, i.e. $[N]=\{1,2,\ldots,N\}$ . Finally, we use $k(\cdot,\cdot):\bar{\mathbb{S}}\times\bar{\mathbb{S}}\mapsto\mathbb{R}$ to denote a spatial kernel function and $\mathbb{H}_{k}$ to denote the corresponding Reproducing Kernel Hilbert Space (RKHS).

2.2 Matrix AutoRegression with Auxiliary Covariates (MARAC)

Let $\{\mathbf{X}_{t},\mathbf{z}_{t}\}_{t=1}^{T}$ be a joint observation of the matrix and the auxiliary vector time series with $\mathbf{X}_{t}\in\mathbb{R}^{M\times N},\mathbf{z}_{t}\in\mathbb{R}^{D}$ . To forecast $\mathbf{X}_{t}$ , we propose our Matrix AutoRegression with Auxiliary Covariates, or MARAC, as:

\mathbf{X}_{t}=\sum_{p=1}^{P}\mathbf{A}_{p}\mathbf{X}_{t-p}\mathbf{B}_{p}^{\top}+\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t},

(1)

where $\mathbf{A}_{p}\in\mathbb{R}^{M\times M},\mathbf{B}_{p}\in\mathbb{R}^{N\times N}$ are the autoregressive coefficients for the lag- $p$ matrix predictor and $\mathbfcal{G}_{q}\in\mathbb{R}^{M\times N\times D}$ is the tensor coefficient for the lag- $q$ vector predictor, and $\mathbf{E}_{t}$ is a noise term whose distribution will be specified later. The lag parameters $P,Q$ are hyper-parameters of the model and we often refer to the model (1) as MARAC $(P,Q)$ .

Based on model (1), for the $(i,j)^{\rm{th}}$ element of $\mathbf{X}_{t}$ , the MARAC $(P,Q)$ specifies the following model:

[\mathbf{X}_{t}]_{ij}=\sum_{p=1}^{P}\left\langle[\mathbf{A}_{p}]_{i:}^{\top}[\mathbf{B}_{p}]_{j:},\mathbf{X}_{t-p}\right\rangle+\sum_{q=1}^{Q}[\mathbfcal{G}_{q}]_{ij:}^{\top}\mathbf{z}_{t-q}+[\mathbf{E}_{t}]_{ij},

(2)

where each autoregressive term is associated with a rank- $1$ coefficient matrix determined by the specific rows from $\mathbf{A}_{p},\mathbf{B}_{p}$ and each non-spatial auxiliary covariate is associated with a coefficient vector that is location-specific, i.e. $[\mathbfcal{G}_{q}]_{ij:}$ . It now becomes more evident from (2) that the auxiliary vector covariates enter the model via an elementwise linear model. The autoregressive term utilizes $\mathbf{A}_{p},\mathbf{B}_{p}$ to transform each lag- $p$ predictor in a bi-linear form. Using such bi-linear transformation greatly reduces the total amount of parameters in that each lagged predictor that required $M^{2}N^{2}$ parameters previously now only requires $M^{2}+N^{2}$ parameters.

For the tensor coefficient $\mathbfcal{G}_{q}$ , we assume that it is spatially smooth. More specifically, we assume that $[\mathbfcal{G}_{q}]_{ijd}$ and $[\mathbfcal{G}_{q}]_{uvd}$ are similar if $s_{ij},s_{uv}$ are spatially close. Formally, we assume that each $[\mathbfcal{G}_{q}]_{d}$ , i.e. the coefficient matrix for the $d^{\rm{th}}$ covariate at lag- $q$ , is a discrete evaluation of a function $g_{q,d}(\cdot):[0,1]^{2}\mapsto\mathbb{R}$ on $\mathbb{S}$ . Furthermore, each $g_{q,d}(\cdot)$ comes from an RKHS $\mathbb{H}_{k}$ endowed with the spatial kernel function $k(\cdot,\cdot)$ . The spatial kernel function specifies the spatial smoothness of the functional parameters $g_{q,d}(\cdot)$ and thus the tensor coefficient $\mathbfcal{G}_{q}$ .

An alternative formulation for $\mathbfcal{G}_{q}$ would be a low-rank form (Li and Zhang, 2017). We choose locally smooth over low rank to explicitly model the spatial smoothness of the coefficients and avoid tuning the tensor rank of $\mathbfcal{G}_{q}$ . We leave the low-rank model for future research and focus on the RKHS framework for the current paper.

Finally, for the additive noise term $\mathbf{E}_{t}$ , we assume that it is i.i.d. from a multivariate normal distribution with a separable Kronecker-product covariance:

\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\mathrel{\overset{\mathrm{i.i.d.}}{\scalebox{1.5}[1.0]{$\sim$}}}\mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}\right),\quad t\in[T]

(3)

where $\boldsymbol{\Sigma}_{r}\in\mathbb{R}^{M\times M},\boldsymbol{\Sigma}_{c}\in\mathbb{R}^{N\times N}$ are the row/column covariance components. Such a Kronecker-product covariance is commonly seen in the covariance models for multi-way data (Hoff, 2011; Fosdick and Hoff, 2014) with the merit of reducing the number of parameters significantly.

Compared to existing models that can only deal with either matrix or vector predictors, our model (1) can incorporate predictors with non-uniform modes. If one redefines $\mathbf{E}_{t}$ in our model as $\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t}$ , i.e. all terms except the autoregressive term, then our model ends up specifying:

	$\displaystyle\mathrm{Cov}(\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right),\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t^{\prime}}\right))$	$\displaystyle=\mathbbm{1}_{\{t=t^{\prime}\}}\cdot\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}+\mathbf{F}\mathbf{M}\mathbf{F}^{\top}$
	$\displaystyle\mathbf{F}=[\left(\mathbfcal{G}_{1}\right)_{(3)}^{\top}:\cdots:\left(\mathbfcal{G}_{Q}\right)_{(3)}^{\top}],$	$\displaystyle\quad\mathbf{M}=\left[\mathrm{Cov}(\mathbf{z}_{t-q_{1}},\mathbf{z}_{t^{\prime}-q_{2}})\right]_{q_{1},q_{2}\in[Q]}$

where $(\mathbfcal{G}_{q})_{(3)}$ is the mode-3 matricization of $\mathbfcal{G}_{q}$ and we will use $\mathbf{G}_{q}$ to denote it for the rest of the paper. This new formulation reveals how our model differs from other autoregression models with matrix predictors. The covariance of $\mathbf{E}_{t},\mathbf{E}_{t^{\prime}}$ in our model contains a separable covariance matrix $\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}$ that is based on the matrix grid geometry, a locally smooth coefficient matrix $\mathbf{F}$ that captures the local spatial dependency and an auto-covariance matrix $\mathbf{M}$ that captures the temporal dependency. Consequently, entries of $\mathbf{E}_{t}$ are more correlated if either they are spatially/temporally close or they share the same row/column index and are thus more flexible for spatial data distributed on a matrix grid.

As a comparison, in the kriging framework (Cressie, 1986), the covariance of $\mathbf{E}_{t},\mathbf{E}_{t^{\prime}}$ is characterized by a spatio-temporal kernel that captures the dependencies among spatial and temporal neighbors. Such kernel method can account for the local dependency but not the spatial dependency based on the matrix grid geometry. In the matrix autoregression model (Chen et al., 2021), the authors do not consider the local spatial dependencies among entries of $\mathbf{E}_{t}$ nor the temporal dependency across different $t$ . In Hsu et al. (2021), the matrix autoregression model is generalized to adapt to spatial data via fixed-rank co-kriging (FRC) (Cressie and Johannesson, 2008) with $\mathrm{Cov}(\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right),\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t^{\prime}}\right))=\mathbbm{1}_{\{t=t^{\prime}\}}\cdot\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}+\mathbf{F}\mathbf{M}\mathbf{F}^{\top}$ , where $\mathbf{M}$ is a $k\times k$ coefficient matrix and $\mathbf{F}$ is a pre-specified $MN\times k$ spatial basis matrix. Such a co-kriging framework does not account for the temporal dependency of the noises nor does it consider the auxiliary covariates. Our model generalizes these previous works to allow for temporally dependent noise with both local and grid spatial dependency.

The combination of (1) and (3) specifies the complete MARAC $(P,Q)$ model. Vectorizing both sides of (1) yields the vectorized MARAC $(P,Q)$ model:

\mathbf{x}_{t}=\sum_{p=1}^{P}\left(\mathbf{B}_{p}\otimes\mathbf{A}_{p}\right)\mathbf{x}_{t-p}+\sum_{q=1}^{Q}\mathbf{G}_{q}^{\top}\mathbf{z}_{t-q}+\mathbf{e}_{t},\quad\mathbf{e}_{t}\mathrel{\overset{\mathrm{i.i.d.}}{\scalebox{1.5}[1.0]{$\sim$}}}\mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}\right)

(4)

where $\mathbf{x}_{t}=\mathrm{\mathbf{vec}}\left(\mathbf{X}_{t}\right),\mathbf{e}_{t}=\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)$ , and recall that $\mathbf{G}_{q}=(\mathbfcal{G}_{q})_{(3)}$ . It is now more evident that the Kronecker-product structure of the autoregressive coefficient matrix and the noise covariance matrix greatly reduce the number of parameters, making the regression estimation feasible given finite samples. The spatially smooth structure of $\mathbf{G}_{q}$ leverages the spatial information of the spatial data. In the next section, we will discuss the estimating algorithm of the model parameters of MARAC.

3 Estimating Algorithm

In this section, we discuss the parameter estimation for the MARAC $(P,Q)$ model (1). We first propose a penalized maximum likelihood estimator (MLE) in Section 3.1 for exact parameter estimation. Then, we propose an approximation to the penalized MLE in Section 3.2 for faster computation when dealing with high-dimensional matrix data. Finally, in Section 3.3, we outline the model selection criterion for selecting the lag hyper-parameters whose consistency will be validated empirically in Section 5.

3.1 Penalized Maximum Likelihood Estimation (MLE)

To estimate the parameters of the MARAC $(P,Q)$ model, which we denote collectively as $\boldsymbol{\Theta}$ , we propose a penalized maximum likelihood estimation (MLE) approach. Following the distribution assumption on $\mathbf{E}_{t}$ in (3), we can write the negative log-likelihood of $\{\mathbf{X}_{t}\}_{t=1}^{T}$ with a squared RKHS functional norm penalty, after dropping the constants, as:

\mathfrak{L}_{\lambda}(\boldsymbol{\Theta})=-\frac{1}{T}\sum_{t\in[T]}\ell\left(\mathbf{X}_{t}|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q};\boldsymbol{\Theta}\right)+\frac{\lambda}{2}\sum_{q\in[Q]}\sum_{d\in[D]}\|g_{q,d}\|_{\mathbb{H}_{k}}^{2},

(5)

where $\ell(\cdot)$ is the conditional log-likelihood of $\mathbf{X}_{t}$ :

\ell\left(\mathbf{X}_{t}|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q};\boldsymbol{\Theta}\right)=-\frac{1}{2}\log|\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}|-\frac{1}{2}\mathbf{r}_{t}^{\top}\left(\boldsymbol{\Sigma}_{c}^{-1}\otimes\boldsymbol{\Sigma}_{r}^{-1}\right)\mathbf{r}_{t},

(6)

and $\mathbf{r}_{t}=\mathbf{x}_{t}-\sum_{p}(\mathbf{B}_{p}\otimes\mathbf{A}_{p})\mathbf{x}_{t-p}-\sum_{q}\mathbf{G}_{q}^{\top}\mathbf{z}_{t-q}$ is the vectorized residual at $t$ . To estimate the parameters, one needs to solve a constrained minimization problem:

\min_{\boldsymbol{\Theta}}\mathfrak{L}_{\lambda}(\boldsymbol{\Theta}),\quad\text{s.t. }g_{q,d}(s_{ij})=[\mathbfcal{G}_{q}]_{ijd},\text{ for all }s_{ij}\in\mathbb{S}.

(7)

We now define the functional norm penalty in (5) explicitly and derive a finite-dimensional equivalent of the optimization problem above. We assume that the spatial kernel function $k(\cdot,\cdot)$ is continuous and square-integrable, thus it has an eigen-decomposition following the Mercer’s Theorem (Williams and Rasmussen, 2006):

k(s_{ij},s_{uv})=\sum_{r=1}^{\infty}\lambda_{r}\psi_{r}(s_{ij})\psi_{r}(s_{uv}),\quad s_{ij},s_{uv}\in[0,1]^{2},

(8)

where $\lambda_{1}\geq\lambda_{2}\geq\ldots$ is a sequence of non-negative eigenvalues and $\psi_{1},\psi_{2},\ldots$ is a set of orthonormal eigen-functions on $[0,1]^{2}$ . The functional norm of function $g$ from the RKHS $\mathbb{H}_{k}$ endowed with kernel $k(\cdot,\cdot)$ is defined as:

\|g\|_{\mathbb{H}_{k}}=\sqrt{\sum_{r=1}^{\infty}\frac{\beta_{r}^{2}}{\lambda_{r}}},\qquad\text{where }g(\cdot)=\sum_{r=1}^{\infty}\beta_{r}\psi_{r}(\cdot),

(9)

following van Zanten and van der Vaart (2008).

Given any $\lambda>0$ in (5), the generalized representer theorem (Schölkopf et al., 2001) suggests that the solution of the functional parameters, denoted as $\{\widetilde{g}_{q,d}\}_{q=1,d=1}^{Q,D}$ , of the minimization problem (7), with all other parameters held fixed, is a linear combination of the representers $\{k(\cdot,s)\}_{s\in\mathbb{S}}$ plus a linear combination of the basis functions $\{\phi_{1},\ldots,\phi_{J}\}$ of the null space of $\mathbb{H}_{k}$ , i.e.,

\widetilde{g}_{q,d}(\cdot)=\sum_{s\in\mathbb{S}}\gamma_{s}k(\cdot,s)+\sum_{j=1}^{J}\alpha_{j}\phi_{j}(\cdot),\quad\|\phi_{j}\|_{\mathbb{H}_{k}}=0,

(10)

where we omit the subscript $(q,d)$ for the coefficient $\gamma_{s},\alpha_{j}$ for brevity but they are essentially different for each $(q,d)$ . We assume that the null space of $\mathbb{H}_{k}$ contains only the zero function for the remainder of the paper. As a consequence of (10), the minimization problem in (7) can be reduced to a finite-dimensional Kernel Ridge Regression (KRR) problem. We summarize the discussion above in the proposition below:

Proposition 1

If $\lambda>0$ , the constrained minimization problem in (7) is equivalent to the following unconstrained kernel ridge regression problem:

\min_{\boldsymbol{\Theta}}\left\{\frac{1}{2}\log|\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}|+\frac{1}{2T}\sum_{t\in[T]}\mathbf{r}_{t}^{\top}\left(\boldsymbol{\Sigma}_{c}^{-1}\otimes\boldsymbol{\Sigma}_{r}^{-1}\right)\mathbf{r}_{t}+\frac{\lambda}{2}\sum_{q\in[Q]}\text{tr}\left(\boldsymbol{\Gamma}_{q}^{\top}\mathbf{K}\boldsymbol{\Gamma}_{q}\right)\right\},

(11)

where $\mathbf{r}_{t}=\mathbf{x}_{t}-\sum_{p}(\mathbf{B}_{p}\otimes\mathbf{A}_{p})\mathbf{x}_{t-p}-\sum_{q}\mathbf{K}\boldsymbol{\Gamma}_{q}\mathbf{z}_{t-q}$ is the vectorized residual, $\mathbf{K}\in\mathbb{R}^{MN\times MN}$ is the kernel Gram matrix with $[\mathbf{K}]_{u_{1}u_{2}}=k(s_{i_{1}j_{1}},s_{i_{2}j_{2}}),s_{i_{l}j_{l}}\in\mathbb{S},u_{l}=i_{l}+(j_{l}-1)M,l=1,2$ and $\boldsymbol{\Gamma}_{q}\in\mathbb{R}^{MN\times D}$ contains the coefficients of the representers with $[\boldsymbol{\Gamma}_{q}]_{ud}$ being the coefficient for the $u^{\rm{th}}$ representer $k(\cdot,s_{u})$ and the $d^{\rm{th}}$ auxiliary covariate at lag $q$ .

We give the proof in the supplemental material. After introducing the functional norm penalty, the original tensor coefficient is now converted to a linear combination of the representer functions with the relationship that $[\mathbfcal{G}_{q}]_{ijd}=\langle[\mathbf{K}]_{u:}^{\top},[\boldsymbol{\Gamma}_{q}]_{:d}\rangle$ where $u=i+(j-1)M$ .

We attempt to solve the minimization problem in (11) with an alternating minimization algorithm (Attouch et al., 2013) where we update one block of parameters at a time while keeping the others fixed. We update the parameters following the order of: $\mathbf{A}_{1}\rightarrow\mathbf{B}_{1}\rightarrow\cdots\rightarrow\mathbf{A}_{P}\rightarrow\mathbf{B}_{P}\rightarrow\boldsymbol{\Gamma}_{1}\rightarrow\cdots\rightarrow\boldsymbol{\Gamma}_{Q}\rightarrow\boldsymbol{\Sigma}_{r}\rightarrow\boldsymbol{\Sigma}_{c}\rightarrow\mathbf{A}_{1}\rightarrow\cdots$ . We choose the alternating minimization algorithm for its simplicity and efficiency. Each step of the algorithm conducts exact minimization over one block of the parameters, leading to a non-increasing sequence of the objective function, which guarantees the convergence of the algorithm towards a local stationary point.

To solve the optimization problem in (11) for $\mathbf{A}_{p}$ at the $(l+1)^{\rm{th}}$ iteration, it suffices to solve the following least-square problem:

\min_{\mathbf{A}_{p}}\left\{\sum_{t\in[T]}\text{tr}\left(\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p})^{\top}\left(\boldsymbol{\Sigma}_{r}^{(l)}\right)^{-1}\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p})\left(\boldsymbol{\Sigma}_{c}^{(l)}\right)^{-1}\right)\right\},

(12)

where $\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p})$ is the residual matrix when predicting $\mathbf{X}_{t}$ :

	$\displaystyle\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p})$	$\displaystyle=\mathbf{X}_{t}-\sum_{p^{{}^{\prime}}<p}\mathbf{A}_{p^{{}^{\prime}}}^{(l+1)}\mathbf{X}_{t-p^{{}^{\prime}}}\left(\mathbf{B}_{p^{{}^{\prime}}}^{(l+1)}\right)^{\top}-\sum_{p^{{}^{\prime}}>p}\mathbf{A}_{p^{{}^{\prime}}}^{(l)}\mathbf{X}_{t-p^{{}^{\prime}}}\left(\mathbf{B}_{p^{{}^{\prime}}}^{(l)}\right)^{\top}$
		$\displaystyle-\sum_{q\in[Q]}\mathbfcal{G}_{q}^{(l)}\bar{\times}\mathbf{z}_{t-q}-\mathbf{A}_{p}\mathbf{X}_{t-p}\left(\mathbf{B}_{p}^{(l)}\right)^{\top}=\widetilde{\mathbf{X}}_{t,-p}-\mathbf{A}_{p}\mathbf{X}_{t-p}\left(\mathbf{B}_{p}^{(l)}\right)^{\top}$

and we use $\widetilde{\mathbf{X}}_{t,-p}$ to denote the partial residual excluding the term involving $\mathbf{X}_{t-p}$ and use $\mathbfcal{G}_{q}^{(l)}$ to denote the tensor coefficient satisfying $[\mathbfcal{G}_{q}^{(l)}]_{ijd}=\langle[\mathbf{K}]_{u:}^{\top},[\boldsymbol{\Gamma}_{q}^{(l)}]_{:d}\rangle$ , with $u=i+(j-1)M$ . The superscript $l$ represents the value at the $l^{\rm{th}}$ iteration. To simplify the notation, we define $\boldsymbol{\Phi}(\mathbf{A}_{t},\mathbf{B}_{t},\boldsymbol{\Sigma})=\sum_{t}\mathbf{A}_{t}^{\top}\boldsymbol{\Sigma}^{-1}\mathbf{B}_{t}$ , where $\boldsymbol{\Sigma},\mathbf{A}_{t},\mathbf{B}_{t}$ are arbitrary matrices/vectors with conformal matrix sizes and we simply write $\boldsymbol{\Phi}(\mathbf{A}_{t},\boldsymbol{\Sigma})$ if $\mathbf{A}_{t}=\mathbf{B}_{t}$ . Solving (12) yields the following updating formula for $\mathbf{A}_{p}^{(l+1)}$ :

\mathbf{A}_{p}^{(l+1)}\leftarrow\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t,-p}^{\top},\mathbf{B}_{p}^{(l)}\mathbf{X}_{t-p}^{\top},\boldsymbol{\Sigma}_{c}^{(l)}\right)\boldsymbol{\Phi}\left(\mathbf{B}_{p}^{(l)}\mathbf{X}_{t-p}^{\top},\boldsymbol{\Sigma}_{c}^{(l)}\right)^{-1}

(13)

Similarly, we have the following updating formula for $\mathbf{B}_{p}^{(l+1)}$ :

\mathbf{B}_{p}^{(l+1)}\leftarrow\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t,-p},\mathbf{A}_{p}^{(l+1)}\mathbf{X}_{t-p},\boldsymbol{\Sigma}_{r}^{(l)}\right)\boldsymbol{\Phi}\left(\mathbf{A}_{p}^{(l+1)}\mathbf{X}_{t-p},\boldsymbol{\Sigma}_{r}^{(l)}\right)^{-1}

(14)

For updating $\boldsymbol{\Gamma}_{q}$ , or its vectorized version $\boldsymbol{\gamma}_{q}=\mathrm{\mathbf{vec}}\left(\boldsymbol{\Gamma}_{q}\right)$ , it is required to solve the following kernel ridge regression problem:

\min_{\boldsymbol{\gamma}_{q}}\left\{\frac{1}{2T}\boldsymbol{\Phi}\left(\widetilde{\mathbf{x}}_{t,-q}-\left(\mathbf{z}_{t-q}^{\top}\otimes\mathbf{K}\right)\boldsymbol{\gamma}_{q},\boldsymbol{\Sigma}^{(l)}\right)+\frac{\lambda}{2}\boldsymbol{\gamma}_{q}^{\top}\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\boldsymbol{\gamma}_{q}\right\},

where $\boldsymbol{\Sigma}^{(l)}=\boldsymbol{\Sigma}_{c}^{(l)}\otimes\boldsymbol{\Sigma}_{r}^{(l)}$ and $\widetilde{\mathbf{x}}_{t,-q}$ is the vectorized partial residual of $\mathbf{X}_{t}$ by leaving out the lag- $q$ auxiliary predictor, defined in a similar way as $\widetilde{\mathbf{X}}_{t,-p}$ . Solving the kernel ridge regression leads to the following updating formula for $\boldsymbol{\gamma}_{q}^{(l+1)}$ :

\boldsymbol{\gamma}_{q}^{(l+1)}\leftarrow\left[\left(\sum_{t\in[T]}\mathbf{z}_{t-q}\mathbf{z}_{t-q}^{\top}\right)\otimes\mathbf{K}+\lambda T\left(\mathbf{I}_{D}\otimes\boldsymbol{\Sigma}^{(l)}\right)\right]^{-1}\left[\sum_{t\in[T]}\left(\mathbf{z}_{t-q}\otimes\widetilde{\mathbf{x}}_{t,-q}\right)\right].

(15)

The step in (15) can be slow since one needs to invert a square matrix of size $MND\times MND$ . In Section 3.2, we propose an approximation to (15) to speed up the computation under high dimensionality.

The updating rule of $\boldsymbol{\Sigma}_{r}^{(l+1)}$ and $\boldsymbol{\Sigma}_{c}^{(l+1)}$ can be easily derived by taking their derivative in (11) and setting it to zero. Specifically, we have:

	$\displaystyle\boldsymbol{\Sigma}_{r}^{(l+1)}$	$\displaystyle\leftarrow\frac{1}{NT}\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t}^{\top},\boldsymbol{\Sigma}_{c}^{(l)}\right)$		(16)
	$\displaystyle\boldsymbol{\Sigma}_{c}^{(l+1)}$	$\displaystyle\leftarrow\frac{1}{MT}\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t},\boldsymbol{\Sigma}_{r}^{(l+1)}\right).$		(17)

where $\widetilde{\mathbf{X}}_{t}$ is the full residual when predicting $\mathbf{X}_{t}$ .

The algorithm cycles through (13), (14), (15), (16) and (17) and terminates when $\mathbf{B}_{p}^{(l)}\otimes\mathbf{A}_{p}^{(l)}$ , $\mathbfcal{G}_{q}^{(l)}$ , $\boldsymbol{\Sigma}_{c}^{(l)}\otimes\boldsymbol{\Sigma}_{r}^{(l)}$ have their relative changes between iterations fall under a pre-specified threshold. We make two additional remarks on the algorithm:

Remark 2

(Identifiability Constraint) The MARAC $(P,Q)$ model specified in (1) is scale-unidentifiable in that one can re-scale each pair of $(\mathbf{A}_{p},\mathbf{B}_{p})$ by a non-zero constant $c$ and obtain $(c\cdot\mathbf{A}_{p},c^{-1}\cdot\mathbf{B}_{p})$ without changing their Kronecker product. To enforce scale identifiability, we re-scale the algorithm output for each pair of $(\mathbf{A}_{p},\mathbf{B}_{p})$ such that $\|\mathbf{A}_{p}\|_{\mathrm{F}}=1$ , $\text{sign}(\text{tr}\left(\mathbf{A}_{p}\right))=1$ . The identifiability constraint is enforced before outputting the estimators.

Remark 3

(Convergence of Kronecker Product) When dealing with high-dimensional matrices, it is cumbersome to compute the change between $\mathbf{B}_{p}^{(l)}\otimes\mathbf{A}_{p}^{(l)}$ and $\mathbf{B}_{p}^{(l+1)}\otimes\mathbf{A}_{p}^{(l+1)}$ under the Frobenius norm. An upper bound of $\|\mathbf{B}_{p}^{(l+1)}\otimes\mathbf{A}_{p}^{(l+1)}-\mathbf{B}_{p}^{(l)}\otimes\mathbf{A}_{p}^{(l)}\|_{\mathrm{F}}$ can be used instead:

\|\mathbf{B}_{p}^{(l+1)}-\mathbf{B}_{p}^{(l)}\|_{\mathrm{F}}\cdot\|\mathbf{A}_{p}^{(l+1)}\|_{\mathrm{F}}+\|\mathbf{B}_{p}^{(l)}\|_{\mathrm{F}}\cdot\|\mathbf{A}_{p}^{(l+1)}-\mathbf{A}_{p}^{(l)}\|_{\mathrm{F}},

(18)

and a similar bound can be used for the convergence check of $\boldsymbol{\Sigma}_{c}^{(l)}\otimes\boldsymbol{\Sigma}_{r}^{(l)}$ .

3.2 Approximated Penalized MLE with Kernel Truncation

The iterative algorithm in Section 3.1 requires inverting an $MND\times MND$ matrix in (15) when updating $\boldsymbol{\gamma}_{q}$ , i.e., the coefficients of the representer functions $k(\cdot,s)$ . One way to reduce the computational complexity without any approximation is to divide the step of updating $\boldsymbol{\gamma}_{q}=[\boldsymbol{\gamma}_{q,1}^{\top}:\cdots:\boldsymbol{\gamma}_{q,D}^{\top}]^{\top}$ to updating one block of parameters at a time following the order of $\boldsymbol{\gamma}_{q,1}\rightarrow\cdots\rightarrow\boldsymbol{\gamma}_{q,D}$ . However, such a procedure requires inverting a matrix of size $MN\times MN$ , which could still be high-dimensional.

To circumvent the issue of inverting large matrices, we can approximate the linear combination of all $MN$ representers using a set of $R<<MN$ basis functions, i.e., $\mathbf{K}\boldsymbol{\gamma}_{q,d}\approx\mathbf{K}_{R}\boldsymbol{\theta}_{q,d}$ , where $\mathbf{K}_{R}\in\mathbb{R}^{MN\times R},\boldsymbol{\theta}_{q,d}\in\mathbb{R}^{R}$ . For example, one can reduce the spatial resolution by subsampling a fraction of the rows and columns of the matrix and only use the representers at the subsampled “knots” as the basis functions. In this subsection, we consider an alternative approach by truncating the Mercer decomposition in (8). A similar technique can be found in Kang et al. (2018).

Given the eigen-decomposition of $k(\cdot,\cdot)$ in (8), one can truncate the decomposition at the $R^{\rm{th}}$ largest eigenvalue $\lambda_{R}$ and get an approximation: $k(\cdot,\cdot)\approx\sum_{r\leq R}\lambda_{r}\psi_{r}(\cdot)\psi_{r}(\cdot)$ . We will use the set of eigen-functions $\{\psi_{1}(\cdot),\ldots,\psi_{R}(\cdot)\}$ for faster computation. The choice of $R$ depends on the decaying rate of the eigenvalue sequence $\{\lambda_{r}\}_{r=1}^{\infty}$ (thus the smoothness of the underlying functional parameters). Our simulation result shows that the estimation and prediction errors shrink monotonically as $R\rightarrow\infty$ . Therefore, $R$ can be chosen based on the computational resources available. The kernel truncation speeds up the computation at the cost of providing an overly-smoothed estimator, as we demonstrate empirically in the supplemental material.

Given the kernel truncation, any functional parameter $g_{q,d}(\cdot)$ is now approximated as: $g_{q,d}(\cdot)\approx\sum_{r\in[R]}[\boldsymbol{\theta}_{q,d}]_{r}\psi_{r}(\cdot)$ . The parameter to be estimated now is $\boldsymbol{\Theta}_{q}=[\boldsymbol{\theta}_{q,1};\cdots;\boldsymbol{\theta}_{q,D}]\in\mathbb{R}^{R\times D}$ , whose dimension is much lower than before ( $\boldsymbol{\Gamma}_{q}\in\mathbb{R}^{MN\times D}$ ). Estimating $\boldsymbol{\Theta}_{q}$ requires solving a ridge regression problem, and the updating formula for generating $\mathrm{\mathbf{vec}}\left(\boldsymbol{\Theta}_{q}\right)$ at the $(l+1)^{\rm{th}}$ iteration can be written as:

\left[\boldsymbol{\Phi}\left(\mathbf{z}_{t-q}^{\top}\otimes\mathbf{K}_{R},\boldsymbol{\Sigma}^{(l)}\right)+\lambda T\left(\mathbf{I}_{D}\otimes\boldsymbol{\Lambda}_{R}^{-1}\right)\right]^{-1}\boldsymbol{\Phi}\left(\mathbf{z}_{t-q}^{\top}\otimes\mathbf{K}_{R},\widetilde{\mathbf{x}}_{t,-q},\boldsymbol{\Sigma}^{(l)}\right),

where $\mathbf{K}_{R}\in\mathbb{R}^{MN\times R}$ satisfies $[\mathbf{K}_{R}]_{ur}=\psi_{r}(s_{ij}),u=i+(j-1)M$ , and $\boldsymbol{\Lambda}_{r}=\text{diag}(\lambda_{1},\ldots,\lambda_{R})$ , with $\lambda_{r}$ being the $r^{\rm{th}}$ largest eigenvalue of the Mercer decomposition of $k(\cdot,\cdot)$ . Now we only need to invert a matrix of size $RD\times RD$ , which speeds up the computation.

3.3 Lag Selection

The MARAC $(P,Q)$ model (1) has three hyper-parameters: the autoregressive lag $P$ , the auxiliary covariate lag $Q$ , and the RKHS norm penalty weight $\lambda$ . In practice, $\lambda$ can be chosen by cross-validation while choosing $P$ and $Q$ requires a more formal model selection criterion. We propose to select $P$ and $Q$ by using information criterion, including the Akaike Information Criterion (AIC) (Akaike, 1998) and the Bayesian Information Criterion (BIC) (Schwarz, 1978). We formally define the AIC and BIC for the MARAC $(P,Q)$ model here and empirically validate their consistency via simulation experiments in Section 5.

Let $\widehat{\boldsymbol{\Theta}}$ be the set of the estimated parameters of the MARAC $(P,Q)$ model, and $\mathbf{df}_{P,Q,\lambda}$ be the effective degrees of the freedom of the model. We can then define the AIC and the BIC as follows:

	$\displaystyle\mathrm{AIC}(P,Q,\lambda)$	$\displaystyle=-2\sum_{t\in[T]}\ell(\mathbf{X}_{t}\|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q},\widehat{\boldsymbol{\Theta}})+2\cdot\mathbf{df}_{P,Q,\lambda},$		(19)
	$\displaystyle\mathrm{BIC}(P,Q,\lambda)$	$\displaystyle=-2\sum_{t\in[T]}\ell(\mathbf{X}_{t}\|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q},\widehat{\boldsymbol{\Theta}})+\log(T)\cdot\mathbf{df}_{P,Q,\lambda}.$		(20)

To calculate $\mathbf{df}_{P,Q,\lambda}$ , we decompose it into the sum of three components: 1) for each pair of the autoregressive coefficient $\widehat{\mathbf{A}}_{p},\widehat{\mathbf{B}}_{p}$ , the model has $(M^{2}+N^{2}-1)$ degrees of freedom; 2) for the noise covariance $\widehat{\boldsymbol{\Sigma}}_{r},\widehat{\boldsymbol{\Sigma}}_{c}$ , the model has $(M^{2}+N^{2})$ degrees of freedom; and 3) for the auxiliary covariate functional parameters $\widehat{g}_{q,1},\ldots,\widehat{g}_{q,D}$ , inspired by the kernel ridge regression estimator in (15), we define the sum of their degrees of freedom as:

\mathbf{df}_{q}(\widehat{g})=\text{tr}\left\{\left[\widetilde{\mathbf{K}}+\lambda\left(\mathbf{I}_{D}\otimes\widehat{\boldsymbol{\Sigma}}_{c}\otimes\widehat{\boldsymbol{\Sigma}}_{r}\right)\right]^{-1}\widetilde{\mathbf{K}}\right\},

where $\widetilde{\mathbf{K}}=\left(T^{-1}\sum_{t\in[T]}\mathbf{z}_{t-q}\mathbf{z}_{t-q}^{\top}\right)\otimes\mathbf{K}$ . As $\lambda\rightarrow 0$ , we have $\mathbf{df}_{q}(\widehat{g})\rightarrow MND$ ; namely each covariate has $MN$ free parameters, which then reduces to the element-wise linear regression model. Empirically, we find that the BIC is a consistent lag selection criterion for our model.

4 Theoretical Analysis

This section presents the theoretical analyses of the MARAC model. We first formulate the condition under which the matrix and vector time series are jointly stationary. Under this condition, we then establish the consistency and asymptotic normality of the penalized MLE under fixed matrix dimensionality as $T\rightarrow\infty$ . Finally, we consider the case where the matrix size goes to infinity as $T\rightarrow\infty$ and derive the convergence rate of the penalized MLE estimator and also the optimal order of the functional norm penalty tuning parameter $\lambda$ . Without loss of generality, we assume that the matrix and vector time series have zero means, and we use $S=MN$ to denote the spatial dimensionality of the matrix data. All proofs are deferred to the supplemental material.

4.1 Stationarity Condition

To facilitate the theoretical analysis, we make an assumption for the vector time series $\mathbf{z}_{t}$ , which greatly simplifies the theoretical analysis.

Assumption 4

The $D$ -dimensional auxiliary vector time series $\{\mathbf{z}_{t}\}_{t=-\infty}^{\infty}$ follows a stationary VAR $(\widetilde{Q})$ process:

\mathbf{z}_{t}=\sum_{\widetilde{q}=1}^{\widetilde{Q}}\mathbf{C}_{\widetilde{q}}\mathbf{z}_{t-\widetilde{q}}+\boldsymbol{\nu}_{t},

(21)

where $\mathbf{C}_{\widetilde{q}}\in\mathbb{R}^{D\times D}$ is the lag- $\widetilde{q}$ transition matrix and $\boldsymbol{\nu}_{t}$ has independent sub-Gaussian entries and is independent of $\mathbf{E}_{t}$ .

Remark 5

With Assumption 4, we can combine the MARAC model for $\mathbf{X}_{t}$ in (4) and the VAR process for $\mathbf{z}_{t}$ in (21) as a single VAR $(L)$ model, with $L=\max(P,Q,\widetilde{Q})$ , as:

\begin{bmatrix}\mathbf{x}_{t}\\ \mathbf{z}_{t}\end{bmatrix}=\sum_{l=1}^{L}\begin{bmatrix}\left(\mathbf{B}_{l}\otimes\mathbf{A}_{l}\right)\odot\mathbf{1}_{\{l\leq P\}}&\mathbf{G}_{l}^{\top}\odot\mathbf{1}_{\{l\leq Q\}}\\ \mathbf{O}_{D\times S}&\mathbf{C}_{l}\odot\mathbf{1}_{\{l\leq\widetilde{Q}\}}\end{bmatrix}\begin{bmatrix}\mathbf{x}_{t-l}\\ \mathbf{z}_{t-l}\end{bmatrix}+\begin{bmatrix}\mathbf{e}_{t}\\ \boldsymbol{\nu}_{t}\end{bmatrix},

(22)

with $\odot$ being the Hadamard product between matrices and $\mathbf{1}_{\{l\leq C\}}$ being a matrix with all elements taking values from $\mathbbm{1}_{\{l\leq C\}}$ and $\mathbf{O}$ is zero matrix. As (22) shows, $\mathbf{z}_{t^{\prime}}$ can help forecast $\mathbf{x}_{t}$ where $t>t^{\prime}$ but not the opposite, indicating that $\mathbf{z}_{t}$ is an exogenous predictor for $\mathbf{x}_{t}$ .

Given the joint vector autoregressive model in (22), we derive the conditions for $\mathbf{x}_{t}$ and $\mathbf{z}_{t}$ to be jointly stationary in Theorem 6.

Theorem 6 (MARAC Stationarity Condition)

Assume that Assumption 4 holds for the auxiliary time series $\{\mathbf{z}_{t}\}_{t=-\infty}^{\infty}$ , and that the matrix time series $\{\mathbf{X}_{t}\}_{t=-\infty}^{\infty}$ is generated by the MARAC $(P,Q)$ model in (1), then $\{\mathbf{X}_{t},\mathbf{z}_{t}\}_{t=-\infty}^{\infty}$ are jointly stationary if and only if for any $y\in\mathbb{C}$ in the complex plane such that $|y|\leq 1$ , we have

\mathrm{det}\left[\mathbf{I}_{S}-\sum_{p=1}^{P}\left(\mathbf{B}_{p}\otimes\mathbf{A}_{p}\right)y^{p}\right]\neq 0,\quad\mathrm{det}\left[\mathbf{I}_{D}-\sum_{\widetilde{q}=1}^{\widetilde{Q}}\mathbf{C}_{\widetilde{q}}y^{\widetilde{q}}\right]\neq 0.

(23)

The proof of Theorem 6 follows the proof of the stationarity of the joint autoregressive model in (22). As a special case where $P=\widetilde{Q}=1$ , the stationarity condition in (23) is equivalent to $\bar{\rho}(\mathbf{A}_{1})\cdot\bar{\rho}(\mathbf{B}_{1})<1$ and $\bar{\rho}(\mathbf{C}_{1})<1$ , where $\bar{\rho}(\cdot)$ is the spectral radius of a square matrix.

Based on Theorem 6, the stationarity of the matrix and vector time series relies on the stationarity of the autoregressive coefficients of the MARAC $(P,Q)$ and VAR $(\widetilde{Q})$ models and the tensor coefficients $\mathbfcal{G}_{1},\ldots,\mathbfcal{G}_{Q}$ do not affect the stationarity. The MARAC model can be extended to the joint autoregressive process (22) where the dynamics of $\mathbf{X}_{t}$ and $\mathbf{z}_{t}$ are modeled jointly. We stick to the simpler case here and leave the joint autoregression to future work.

4.2 Finite Spatial Dimension Asymptotics

In this subsection, we establish the consistency and asymptotic normality of the MARAC model estimators under the scenario that $M,N$ are fixed. Given a fixed matrix dimensionality, the functional parameters $g_{q,d}\in\mathbb{H}_{k}$ can only be estimated at $S=MN$ fixed locations, and thus the asymptotic normality result is established for the corresponding tensor coefficient $\widehat{\mathbfcal{G}}_{q}$ . In Section 4.3, we will discuss the double asymptotics when both $S,T\rightarrow\infty$ . For the remainder of the paper, we denote the true model coefficient with an asterisk superscript, such as $\mathbf{A}_{1}^{*},\mathbf{B}_{1}^{*},\mathbfcal{G}_{1}^{*}$ and $\boldsymbol{\Sigma}^{*}$ .

To start with, we make an assumption on the Gram matrix $\mathbf{K}$ :

Assumption 7

The minimum eigenvalue of $\mathbf{K}$ is bounded by a positive constant $\underaccent{\bar}{c}$ , i.e. $\underaccent{\bar}{\rho}(\mathbf{K})=\underaccent{\bar}{c}>0$ .

As a result of Assumption 7, every $\mathbfcal{G}_{q}^{*}$ has a unique kernel decomposition: $\mathrm{\mathbf{vec}}\left(\mathbfcal{G}_{q}^{*}\right)=(\mathbf{I}_{D}\otimes\mathbf{K})\boldsymbol{\gamma}_{q}^{*}$ . With this additional assumption, the first theoretical result we establish is the consistency of the covariance matrix estimator $\widehat{\boldsymbol{\Sigma}}=\widehat{\boldsymbol{\Sigma}}_{c}\otimes\widehat{\boldsymbol{\Sigma}}_{r}$ , which we summarize in Proposition 8.

Proposition 8 (Covariance Consistency)

Assume that $\lambda\rightarrow 0$ as $T\rightarrow\infty$ and $S$ is fixed, and Assumption 4, 7 and the stationarity condition in Theorem 6 hold, , then $\widehat{\boldsymbol{\Sigma}}\overset{p.}{\rightarrow}\boldsymbol{\Sigma}^{*}$ .

Given this result, we can further establish the asymptotic normality of the other model estimators:

Theorem 9 (Asymptotic Normality)

Assume that the matrix time series $\{\mathbf{X}_{t}\}_{t=-\infty}^{\infty}$ follows the MARAC $(P,Q)$ model (1) with i.i.d. noise $\{\mathbf{E}_{t}\}_{t=-\infty}^{\infty}$ following (3) and Assumption 4, 7 and the stationarity condition in Theorem 6 hold and $\lambda=o(T^{-1/2})$ . Additionally, assume that $\underaccent{\bar}{\rho}(\mathrm{Var}([\mathrm{\mathbf{vec}}\left(\mathbf{X}_{t}\right)^{\top},\mathbf{z}_{t}^{\top}]^{\top}))=\underaccent{\bar}{c}^{{}^{\prime}}>0$ . Then suppose $M,N$ are fixed and $P,Q$ are known and denote $\mathbf{A}_{p},\mathbf{B}_{p}^{\top}$ as $\boldsymbol{\alpha}_{p}$ and $\boldsymbol{\beta}_{p}$ for any $p\in[P]$ , the penalized MLE of the MARAC $(P,Q)$ model is asymptotically normal:

\sqrt{T}\begin{bmatrix}\widehat{\boldsymbol{\beta}}_{1}\otimes\widehat{\boldsymbol{\alpha}}_{1}-\boldsymbol{\beta}_{1}^{*}\otimes\boldsymbol{\alpha}_{1}^{*}\\ \vdots\\ \widehat{\boldsymbol{\beta}}_{P}\otimes\widehat{\boldsymbol{\alpha}}_{P}-\boldsymbol{\beta}_{P}^{*}\otimes\boldsymbol{\alpha}_{P}^{*}\\ \mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{G}}_{1}-\mathbfcal{G}^{*}_{1}\right)\\ \vdots\\ \mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{G}}_{Q}-\mathbfcal{G}^{*}_{Q}\right)\end{bmatrix}\overset{d.}{\longrightarrow}\mathcal{N}\left(\mathbf{0},\mathbf{V}\boldsymbol{\Xi}\mathbf{V}^{\top}\right),

(24)

where $\mathbf{V}$ is:

\mathbf{V}=\begin{bmatrix}\text{diag}(\mathbf{V}_{1},\ldots,\mathbf{V}_{P})&\mathbf{O}\\ \mathbf{O}&\mathbf{I}_{QD}\otimes\mathbf{K}\end{bmatrix},\quad\mathbf{V}_{p}=[\boldsymbol{\beta}^{*}_{p}\otimes\mathbf{I}_{M^{2}},\mathbf{I}_{N^{2}}\otimes\boldsymbol{\alpha}_{p}^{*}],

and $\boldsymbol{\Xi}=\mathbf{H}^{-1}\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]\mathbf{H}^{-1}$ , and $\mathbf{W}_{t}$ is defined as:

\mathbf{W}_{t}=\left[\mathbf{W}_{0,t}\otimes\mathbf{I}_{M},\mathbf{I}_{N}\otimes\mathbf{W}_{1,t},[\mathbf{z}_{t-1}^{\top},\ldots,\mathbf{z}_{t-Q}^{\top}]\otimes\mathbf{K}\right],

where $\mathbf{H}=\mathrm{E}[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}]+\boldsymbol{\zeta}\boldsymbol{\zeta}^{\top}$ with $\boldsymbol{\zeta}^{\top}=[(\boldsymbol{\alpha}_{1}^{*})^{\top},\cdots,(\boldsymbol{\alpha}_{P}^{*})^{\top},\mathbf{0}^{\top}]$ and:

\mathbf{W}_{0,t}=[\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top},\ldots,\mathbf{B}_{P}^{*}\mathbf{X}_{t-P}^{\top}],\quad\mathbf{W}_{1,t}=[\mathbf{A}_{1}^{*}\mathbf{X}_{t-1},\ldots,\mathbf{A}_{P}^{*}\mathbf{X}_{t-P}].

The asymptotic distribution (24) indicates that all parameters are $\sqrt{T}$ -consistent. Under a fixed matrix dimensionality $S$ , the functional parameters $g_{q,d}\in\mathbb{H}_{k}$ are estimated only at fixed locations. Hence, the convergence is at a parametric rate just like the autoregressive coefficient.

4.3 High Spatial Dimension Asymptotics

The previous section presents the asymptotic normality of the MARAC estimators under a fixed matrix dimensionality $S$ . In this section, we relax this assumption and establish the convergence rate of the MARAC estimators when $S,T\rightarrow\infty$ . For technical reasons, we assume that all entries of $\mathbf{E}_{t}$ are i.i.d. normally-distributed random variables following $\mathcal{N}(0,\sigma^{2})$ .

To establish the convergence rate of the MARAC estimators when $S,T\rightarrow\infty$ , we need to make several additional assumptions.

Assumption 10

The spatial locations of the rows and columns of $\mathbf{X}_{t}$ are sampled independently from a uniform distribution on $[0,1]$ .

Assumption 11

The spatial kernel function $k(\cdot,\cdot)$ can be decomposed into the product of a row kernel $k_{1}(\cdot,\cdot)$ and a column kernel $k_{2}(\cdot,\cdot)$ that satisfies $k((u,v),(s,t))=k_{1}(u,s)k_{2}(v,t)$ . Both $k_{1},k_{2}$ have their eigenvalues decaying at a polynomial rate: $\lambda_{j}(k_{1})\asymp j^{-r_{0}},\lambda_{j}(k_{2})\asymp j^{-r_{0}}$ with $r_{0}\in(1/2,2)$ .

Assumption 11 elicits a simple eigen-spectrum characterization of the spatial kernel $k(\cdot,\cdot)$ , whose eigenvalue can be written as $\lambda_{i}(k_{1})\lambda_{j}(k_{2})$ . Also, the Gram matrix $\mathbf{K}$ is separable, i.e. $\mathbf{K}=\mathbf{K}_{2}\otimes\mathbf{K}_{1}$ and all eigenvalues of $\mathbf{K}$ have the form: $\rho_{i}(\mathbf{K}_{1})\rho_{j}(\mathbf{K}_{2})$ , where $\mathbf{K}_{1}\in\mathbb{R}^{M\times M},\mathbf{K}_{2}\in\mathbb{R}^{N\times N}$ are the Gram matrix for the kernel $k_{1},k_{2}$ , respectively.

Under Assumption 10, we further have $\rho_{i}(\mathbf{K}_{1})\rightarrow M\lambda_{i}(k_{1})$ and $\rho_{j}(\mathbf{K}_{2})\rightarrow N\lambda_{j}(k_{2})$ , as $M,N\rightarrow\infty$ . Combined with Assumption 11, we can characterize the eigenvalues of $\mathbf{K}$ as $S(ij)^{-r_{0}}$ . We refer our readers to Koltchinskii and Giné (2000); Braun (2006) for more references about the eigen-analysis of the kernel Gram matrix. One can generalize Assumption 10 to non-uniform sampling, but here, we stick to this simpler setting.

In Assumption 11, we assume the kernel separability to accommodate the grid structure of the spatial locations. We do not restrict $r_{0}$ to be an integer but just a parameter that characterizes the smoothness of the functional parameters. With these assumptions, we are now ready to present the main result in Theorem 12.

Theorem 12 (Asymptotics for High-Dimensional MARAC)

Assume that Assumptions 4, 10 and 11 hold and $\mathbf{X}_{t}$ is generated by the MARAC $(P,Q)$ model (1) with $\mathbf{E}_{t}$ having i.i.d. $\mathcal{N}(0,\sigma^{2})$ entries. Then as $S,T\rightarrow\infty$ ( $D$ is fixed) and $S\log S/T\rightarrow 0$ , and under the additional assumptions that:

1.

$M=O(\sqrt{S}),N=O(\sqrt{S})$ ;
2.

$\gamma_{S}\coloneqq\lambda/S\rightarrow 0$ and $\gamma_{S}\cdot S^{r_{0}}\rightarrow C_{1}$ as $S\rightarrow\infty$ , with $0<C_{1}\leq\infty$ ;
3.

$\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*})^{\top}(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*})^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}})=c_{0,S}>0$ as $S,T\rightarrow\infty$ , where $\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*},\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*},\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}$ are $\mathrm{Var}(\mathbf{x}_{t})$ , $\mathrm{Var}(\mathbf{z}_{t})$ and $\mathrm{Cov}(\mathbf{z}_{t},\mathbf{x}_{t})$ , respectively. $c_{0,S}$ is a constant that only relates to $S$ ;
4.

For any $S$ , we have $0<\underaccent{\bar}{\rho}(\mathbf{K})<\bar{\rho}(\mathbf{K})\leq C_{0}$ , where $C_{0}$ is a finite constant,

then we have:

\frac{1}{\sqrt{P}S}\sqrt{\sum_{p=1}^{P}\left\|\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}-\mathbf{B}^{*}_{p}\otimes\mathbf{A}_{p}^{*}\right\|_{\mathrm{F}}^{2}}\lesssim O_{P}\left(\sqrt{\frac{C_{g}\cdot\gamma_{S}}{c_{0,S}\cdot S}}\right)+O_{P}\left(\sqrt{\frac{D}{c_{0,S}\cdot TS}}\right),

(25)

where $C_{g}=\sum_{q=1}^{Q}\sum_{d=1}^{D}\|g_{q,d}\|_{\mathbb{H}_{k}}^{2}$ . Furthermore, we also have:

	$\displaystyle\sqrt{(TS)^{-1}\sum_{t=1}^{T}\left\\|\sum_{q=1}^{Q}\left(\widehat{\mathbfcal{G}}_{q}-\mathbfcal{G}_{q}^{*}\right)\bar{\times}\mathbf{z}_{t-q}\right\\|_{\mathrm{F}}^{2}}$
	$\displaystyle\lesssim O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1/{2r_{0}}}}}{\sqrt{T}\sqrt[4]{S}}\right)+O_{P}(\sqrt{\gamma_{S}})+O_{P}\left(\frac{1}{\sqrt{S}}\right)+O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1}}}{\sqrt{TS}}\right).$		(26)

In Theorem 12, (25) gives the error bound of the autoregressive coefficients and (26) gives the error bound of the prediction made by the auxiliary time series, which contains the functional parameter estimators. As a special case of (25) where $\gamma_{S}=0$ and $S$ is fixed, the convergence rate for the autoregressive coefficients is $O_{P}(T^{-1/2})$ , which reproduces the result in Theorem 9. For the discussion below, we use $\text{AR}_{err}$ and $\text{AC}_{err}$ as acronyms for the quantity on the left-hand side of (25) and (26).

Remark 13 (Optimal Choice of $\lambda$ and Phase Transition)

According to our proof, the error bound (26) can be decomposed into the sum of:

•

nonparametric error: $O_{P}\left(\sqrt{\frac{\gamma_{S}^{-1/2r_{0}}}{T\sqrt{S}}}\right)+O_{P}(\sqrt{\gamma_{S}})$ ,
•

autoregressive error: $O_{P}\left(\sqrt{\gamma_{S}}\right)+O_{P}\left(S^{-1/2}\right)+O_{P}\left(T^{-1/2}\right)+O_{P}\left(\sqrt{\frac{\gamma_{S}^{-1}}{TS}}\right)$ ,

where the autoregressive error stems from the estimation error of $\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}$ . The nonparametric error resembles the result of nonparametric regression with RKHS norm penalty (Cui et al., 2018), where if the number of data points is $n$ and penalty tuning parameter is $\lambda$ , then the nonparametric error is bounded by $O_{P}(\sqrt{\lambda^{-1/2r_{0}}/n})+O_{P}(\sqrt{\lambda})$ with an optimal $\lambda\asymp n^{-2r_{0}/(2r_{0}+1)}$ . In our model, if there is no autoregressive error, the optimal tuning parameter satisfies $\gamma_{S}\asymp(T\sqrt{S})^{-2r_{0}/(2r_{0}+1)}$ . The number of data points in our case is $TS$ , and we are short of $\sqrt{S}$ in the optimal tuning parameter due to Assumption 11, where the eigenvalues of $\mathbf{K}$ , ordered as $\rho_{1}(\mathbf{K})\geq\cdots\rho_{i}(\mathbf{K})\geq\cdots$ , decay slower than $i^{-2r_{0}}$ . This is a special result for matrix-shaped data. It is also noteworthy that under the condition that $S\log S/T\rightarrow 0$ , the autoregressive error dominates the nonparametric error.

To simplify the discussion of the optimal order of $\gamma_{S}$ , we assume that $S=T^{c}$ , where $c<1$ is a constant. Under this condition, when $P,Q\geq 1$ , the optimal tuning parameter $\gamma_{S}=\lambda/S$ shows an interesting phase transition phenomenon under different spatial smoothness $r_{0}$ and matrix dimensionality $c=\log_{T}S$ , which we summarize in Table 1.

$r_{0}$	$\log_{T}S$	Optimal $\gamma_{S}$	Estimator Error
$[1,2)$	$[\frac{1}{2r_{0}-1},1)$	$O((TS)^{-\frac{1}{2}})$	$\mathrm{AR}_{err}=O_{P}(T^{-\frac{1}{4}}S^{-\frac{3}{4}})$ $\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})$
$[1,2)$	$(0,\frac{1}{2r_{0}-1})$	$O(S^{-r_{0}})$	$\mathrm{AR}_{err}=O_{P}(S^{-\frac{r_{0}+1}{2}})$ $\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})$
$(\frac{1}{2},1)$	$[2r_{0}-1,1)$	$O(S^{-r_{0}(2r_{0}-1)})$	$\mathrm{AR}_{err}=O_{P}(S^{-\frac{r_{0}(2r_{0}-1)+1}{2}})$ $\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})$
$(\frac{1}{2},1)$	$(0,2r_{0}-1)$	$O((T\sqrt{S})^{-\frac{2r_{0}}{2r_{0}+1}})$	$\mathrm{AR}_{err}=O_{P}((TS)^{-\frac{1}{2}})+O_{P}((T\sqrt{S})^{-\frac{r_{0}}{2r_{0}+1}}S^{-\frac{1}{2}})$ $\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})+O_{P}((T\sqrt{S})^{-\frac{r_{0}}{2r_{0}+1}})$

Table 1: Summary of optimal tuning parameter

\gamma_{S}

and estimators error following (25) and (26), under the assumption that

c_{0,S}\geq c_{0}>0,\text{ for all }S

and

S=T^{c}

for some constant

0<c<1

such that

S\log S/T\rightarrow 0

\mathrm{AR}_{err}

and

\mathrm{AC}_{err}

are the quantity on the left-hand side of (25) and (26).

Based on the results in Table 1, the faster $S$ grows with respect to $T$ , the smaller the optimal tuning parameter $\gamma_{S}$ is. This is an intuitive result since when one has more spatial locations, the observations are denser, and thus less smoothing is needed. Furthermore, we achieve an optimal tuning order of $\gamma_{S}$ that is close to the classic nonparametric optimal rate at $(TS)^{-2r_{0}/(2r_{0}+1)}$ only under the regime where $1/2<r_{0}<1$ and $\log_{T}S<2r_{0}-1$ . This regime specifies the scenario where the functional parameter is relatively unsmooth and the spatial dimensionality grows slowly with respect to $T$ . Only under this regime will the discrepancy between the nonparametric error and the autoregressive error remain small, leading to an optimal tuning parameter close to the result of nonparametric regression.

In (25), the constant $c_{0,S}$ appears in the error bound of the autoregressive term. This constant characterizes the spatial correlation of the matrix time series $\mathbf{X}_{t}$ , conditioning on the auxiliary vector time series $\mathbf{z}_{t}$ and can vary across different assumptions made on the covariances of $\mathbf{E}_{t}$ and $\boldsymbol{\nu}_{t}$ . In Table 1, we assume that $c_{0,S}\geq c_{0}>0$ for some universal constant $c_{0}$ . Unfortunately, in practice, it is common to have $c_{0,S}\rightarrow 0$ as $S\rightarrow\infty$ , which makes the autoregressive coefficient converge at a slower rate but does not affect the functional parameter convergence. We leave the constant $c_{0,S}$ here in (25) to give a general result and leave the characterization of $c_{0,S}$ under specific assumptions for future works.

5 Simulation Experiments

5.1 Consistency and Convergence Rate

In this section, we validate the consistency and convergence rate of the MARAC estimators. We consider a simple setup with $P=Q=1$ and $D=3$ and simulate the autoregressive coefficients $\mathbf{A}_{1}^{*},\mathbf{B}_{1}^{*}$ such that they satisfy the stationarity condition in Theorem 6. We specify both $\mathbf{A}_{1}^{*},\mathbf{B}_{1}^{*}$ and $\boldsymbol{\Sigma}_{r}^{*},\boldsymbol{\Sigma}_{c}^{*}$ to have symmetric banded structures. To simulate $g_{1},g_{2},g_{3}$ (we drop the lag subscript) from the RKHS $\mathbb{H}_{k}$ , we choose $k(\cdot,\cdot)$ to be the Lebedev kernel (Kennedy et al., 2013) and generate $g_{1},g_{2},g_{3}$ randomly from Gaussian processes with the Lebedev kernel as the covariance kernel. Finally, we simulate the auxiliary vector time series $\mathbf{z}_{t}\in\mathbb{R}^{3}$ from a VAR $(1)$ process. We include more details and visualizations of the simulation setups in the supplemental material.

The evaluation metric is the rooted mean squared error (RMSE), defined as $\mathrm{RMSE}(\widehat{\boldsymbol{\Theta}})=\|\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\|_{\mathrm{F}}/\sqrt{d(\boldsymbol{\Theta}^{*})}$ , where $d(\boldsymbol{\Theta}^{*})$ is the number of elements in $\boldsymbol{\Theta}^{*}$ . We consider $\boldsymbol{\Theta}\in\{\mathbf{B}_{1}\otimes\mathbf{A}_{1},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r},\mathbfcal{G}_{1},\mathbfcal{G}_{2},\mathbfcal{G}_{3}\}$ and we report the average RMSE for $\mathbfcal{G}_{1},\mathbfcal{G}_{2},\mathbfcal{G}_{3}$ . The dataset is configured with $M\in\{5,10,20,40\}$ and $N=M$ . For each $M$ , we train the MARAC model with $P=Q=1$ over $T_{\text{train}}\in\{1,5,10,20,40,80,160\}\times 10^{2}$ frames of the matrix time series and choose the tuning parameter $\lambda$ based on the prediction RMSE over a held-out validation set with $T_{\text{val}}=T_{\text{train}}/2$ and we validate the prediction performance over a $5,000$ -frame testing set. We simulate a sufficiently long time series and choose the training set starting from the first frame and the validation set right after the training set. The testing set is always fixed as the last $5,000$ frames of the time series. All results are reported with $20$ repetitions in Figure 2.

The result shows that all model estimators are consistent and the convergence rate, under a fixed spatial dimensionality, is close to $1/\sqrt{T}$ (the black line in panel (a) shows a reference line of $O(1/\sqrt{T})$ ), echoing the result in Theorem 9. As the spatial dimensionality $S$ increases, the RMSE for $\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1}$ becomes even smaller, echoing the result in (25) and Table 1. The RMSE of the nonparametric estimators $\widehat{g}_{1},\widehat{g}_{2},\widehat{g}_{3}$ , under a fixed spatial dimensionality, also decay at a rate of $1/\sqrt{T}$ , echoing the result in Theorem 9 as well. The RMSE of the covariance matrix estimator $\widehat{\boldsymbol{\Sigma}}_{c}\otimes\widehat{\boldsymbol{\Sigma}}_{r}$ suggests that it is consistent, confirming the result of Proposition 8 and shows a convergence rate similar to $\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1}$ , though we did not provide the exact convergence rate theoretically.

In this simulation, we fix the variance of each element of $\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)$ to be unity. Therefore, the optimal testing set prediction RMSE should be unity. When plotting the test prediction RMSE in (d), we subtract $1$ from all RMSE results and thus the RMSE should be interpreted as the RMSE for the signal part of the matrix time series. The test prediction RMSE for all cases converges to zero, and for matrices of higher dimensionality, we typically require more training frames to reach the same prediction performance.

To validate the theoretical result of the high-dimensional MARAC in Theorem 12, we also plot the RMSE of $\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1}$ and $\widehat{g}_{1},\widehat{g}_{2},\widehat{g}_{3}$ against $T\sqrt{S}$ in panel (e) and (f) of Figure 2. The trend line is fitted by linear regression, and it shows that $\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1}$ converges roughly at the rate of $1/\sqrt{T}\sqrt[4]{S}$ , which indicates that $c_{0,S}\asymp 1/\sqrt{S}$ under this specific simulation setup. It also shows that the functional parameter’s convergence rate is around $(T\sqrt{S})^{-3/8}$ , which coincides with our simulation setup where $r_{0}\approx 3/4$ and the theoretical result in the last row of Table 1.

All the results reported in Figure 2 are based on the penalized MLE framework without the kernel truncation introduced in Section 3.2. Kernel truncation speeds up the computation, especially when the matrix dimensionality is high, at the cost of over-smoothing the functional parameter estimates. We illustrate the performance of the kernel truncation method in the supplemental material.

5.2 Lag Selection Consistency

In Section 3.3, we propose to select the lag parameters $P$ and $Q$ of the MARAC model using information criteria such as AIC and BIC. To validate the consistency of these model selection criteria, we simulate data from a MARAC $(2,2)$ model with $5\times 5$ matrix dimensionality. We consider a candidate model class with $1\leq P,Q\leq 4$ and each model is fitted with $T\in\{1,2,4,8\}\times 10^{3}$ frames with $\lambda$ being chosen from a held-out validation set. In Table 2, we report the proportion of times that AIC and BIC select the correct $P$ , $Q$ individually (first two numbers in each parenthesis), and $(P,Q)$ jointly (last number in each parenthesis) from $100$ repetitions.

	$T=1\times 10^{3}$	$T=2\times 10^{3}$	$T=4\times 10^{3}$	$T=8\times 10^{3}$
AIC	$(.54,.99,.53)$	$(.55,.97,.53)$	$(.59,.96,.55)$	$(.65,.94,.59)$
BIC	$(1.00,.09,.09)$	$(.99,.56,.56)$	$(.97,.97,.94)$	$(.96,.99,.95)$

Table 2: Probability that AIC and BIC select the correct

P

(first number),

Q

(second number) and

(P,Q)

(third number) from 100 repetitions.

From Table 2, we find that AIC tends to select the model with more autoregressive lags but BIC performs consistently better under large sample sizes. This coincides with the findings in Hsu et al. (2021) for the matrix autoregression model.

5.3 Comparison with Alternative Methods

We compare our MARAC model against other competing methods for the matrix autoregression task. We simulate the matrix time series $\mathbf{X}_{t}$ from a MARAC $(P,Q)$ model, with $P=Q\in\{1,2,3\}$ , and the vector time series $\mathbf{z}_{t}\in\mathbb{R}^{3}$ from VAR $(1)$ . The dataset is generated with $T_{\text{train}}=T_{\text{val}}=T_{\text{test}}=2000$ . Under each $(P,Q)$ , we simulate with varying matrix dimensionality with $M=N\in\{5,10,20,40\}$ . We evaluate the performance of each method via the testing set prediction RMSE. Each simulation scenario is repeated 20 times.

Under each $P,Q,M,N$ specification, we consider the following five competing methods besides our own MARAC $(P,Q)$ model.

MAR (Chen et al., 2021):

\mathbf{X}_{t}=\sum_{p=1}^{P}\mathbf{A}_{p}\mathbf{X}_{t-p}\mathbf{B}_{p}^{\top}+\mathbf{E}_{t},\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}).

MAR with fixed-rank co-kriging (MAR+FRC) (Hsu et al., 2021):

\mathbf{X}_{t}=\sum_{p=1}^{P}\mathbf{A}_{p}\mathbf{X}_{t-p}\mathbf{B}_{p}^{\top}+\mathbf{E}_{t},\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\sim\mathcal{N}(\mathbf{0},\sigma_{\eta}^{2}\mathbf{I}+\mathbf{F}\mathbf{M}\mathbf{F}^{\top}),

where $\mathbf{F}\in\mathbb{R}^{MN\times QD}$ is the multi-resolution spline basis (Tzeng and Huang, 2018).

MAR followed by a tensor-on-scalar linear model (MAR+LM) (Li and Zhang, 2017):

\mathbf{X}_{t}-\sum_{p=1}^{P}\widehat{\mathbf{A}}_{p}\mathbf{X}_{t-p}\widehat{\mathbf{B}}_{p}^{\top}=\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t},\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\sim\mathcal{N}(\mathbf{0},\sigma_{\eta}^{2}\mathbf{I}),

(27)

where $\widehat{\mathbf{A}}_{p},\widehat{\mathbf{B}}_{p}$ come from a pre-trained MAR model and $\mathbfcal{G}_{q}$ can be a low-rank tensor. The MAR+LM model can be considered as a two-step procedure for fitting the MARAC model.

Pixel-wise autoregression (Pixel-AR): for each $i\in[M],j\in[N]$ , we have:

[\mathbf{X}_{t}]_{ij}=\alpha_{ij}+\sum_{p=1}^{P}\beta_{ijp}[\mathbf{X}_{t-p}]_{ij}+\sum_{q=1}^{Q}\boldsymbol{\gamma}_{ijq}^{\top}\mathbf{z}_{t-q}+[\mathbf{E}_{t}]_{ij},\quad[\mathbf{E}_{t}]_{ij}\sim\mathcal{N}(0,\sigma_{ij}^{2}).

5.

Vector Autoregression with Exogenous Predictor (VARX), which vectorizes the matrix time series and stacks them up with the vector time series as predictors.

The results of the average prediction RMSE obtained from the 20 repeated runs are plotted in Figure 3. Overall, our MARAC model outperforms the other competing methods under varying matrix dimensionality and lags. We make two additional remarks. First, when the matrix size is small (e.g., $5\times 5$ ), the vector autoregression model (VARX) performs almost as well as the MARAC model and is better than other methods. However, the performance of the VARX model gets worse quickly as the matrix becomes larger, indicating that sufficient dimension reduction is needed for dealing with large matrix time series. The MARAC model is a parsimonious version of VARX for such purposes. Secondly, the MAR, MAR with fixed-rank co-kriging (MAR+FRC), and two-step MARAC (MAR+LM) all perform worse than MARAC. This shows that when the auxiliary time series predictors are present, it is sub-optimal to remove them from the model (MAR), incorporate them implicitly in the covariance structure (MAR+FRC), or fit them separately in a tensor-on-scalar regression model (MAR+LM). Putting both matrix predictors and vector predictors in a unified framework like MARAC can be beneficial for improving prediction performances.

6 Application to Global Total Electron Content Forecast

For real data applications, we consider the problem of predicting the global total electron content (TEC) distribution, which we briefly introduce in Section 1. The TEC data we use is the IGS (International GNSS Service) TEC data, which are freely available from the National Aeronautics and Space Administration (NASA) Crustal Dynamics Data Information System (Hernández-Pajares et al., 2009). The spatial-temporal resolution of the data is $2.5^{\circ}(\text{latitude})\times 5^{\circ}(\text{longitude})\times 15(\text{minutes})$ . We downloaded the data for September 2017, and the whole month of data form a matrix time series with $T=2880$ and $M=71,N=73$ . For the auxiliary covariates, we download the 15-minute resolution IMF Bz and Sym-H time series, which are parameters related to the near-Earth magnetic field and plasma (Papitashvili et al., 2014). We also download the daily F10.7 index, which measures the solar radio flux at 10.7 cm, as an additional auxiliary predictor. The IMF Bz and Sym-H time series are accessed from the OMNI dataset (Papitashvili and King, 2020) and the F10.7 index is accessed from the NOAA data repository (Tapping, 2013). These covariates measure the solar wind strengths. Strong solar wind might lead to geomagnetic storms that could increase the global TEC significantly.

We formulate our MARAC model for the TEC prediction problem as:

\text{TEC}_{t+h}=\sum_{p=1}^{P}\mathbf{A}_{p}\text{TEC}_{t-p}\mathbf{B}_{p}^{\top}+\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t},

(28)

where $h$ is the forecast latency time and $\mathbf{z}_{t}\in\mathbb{R}^{3}$ includes the IMF Bz, Sym-H and F10.7 indices at time $t$ . We consider the forecasting scenario with $h\in\{1,2,\ldots,24\}$ , which corresponds to making forecasts from 15 minutes ahead up to 6 hours ahead. At each latency time, we fit our MARAC $(P,Q)$ model following (28) with $1\leq P,Q\leq 3$ . We fit the MARAC model with kernel truncation approximation using $R=121$ basis functions from the truncated Lebedev kernel. As a comparison, we also fit the MAR model with $1\leq P\leq 3$ and the MAR+LM model with $1\leq P,Q\leq 3$ , see the definition of MAR+LM model in (27). As a benchmark, we consider using $\text{TEC}_{t-1}$ to predict $\text{TEC}_{t+h}$ and name it the persistence model.

The $2,880$ frames of matrix data are split into a $70\%$ training set, $15\%$ validation set, and a $15\%$ testing set following the chronological order. We choose the tuning parameter $\lambda$ for MARAC based on the validation set prediction RMSE. The lag parameters $P,Q$ are chosen for all models based on the BIC. To increase computational speed, we assume that matrices $\boldsymbol{\Sigma}_{r},\boldsymbol{\Sigma}_{c}$ are diagonal when fitting all models. We zero-meaned all sets of data using the mean of the matrix and vector time series of the training set.

In Figure 4(A), we report the pixel-wise prediction RMSE on the testing set. The result shows that when the latency time is low, the matrix autoregressive (MAR) model is sufficient for making the TEC prediction. As the latency time increases to around 4 to 5 hours, the auxiliary time series helps improve the prediction performance as compared to the MAR model. This coincides with the domain intuition that the disturbances from the solar wind to Earth’s ionosphere will affect the global TEC distribution but with a delay in time of up to several hours. The additional prediction gain from incorporating the auxiliary covariates vanishes as one further increases the latency time, indicating that the correlation of the solar wind and global TEC is weak beyond a 6-hour separation.

In Figure 4(B), we visualize an example of the TEC prediction across the competing methods under the 4-hour latency time scenario (i.e., h=16). The MAR and MAR+LM results are similar and do not resemble the ground truth very well. The global TEC typically has two peaks located symmetrically around the equator, and both models fail to capture this as they provide a single patch in the middle. The MARAC model, however, can capture this fine-scale structure in its prediction. To further showcase the MARAC model prediction result, we decompose the prediction from the autoregressive component and the auxiliary covariates component and visualize them separately. The auxiliary covariate component highlights a sub-region in the southern hemisphere with high TEC values, complementing the prediction made by the autoregressive component.

7 Summary

In this paper, we propose a new methodology for spatial-temporal matrix autoregression with non-spatial exogenous vector covariates. The model has an autoregressive component with bi-linear transformations on the lagged matrix predictors and an additive auxiliary covariate component with tensor-vector product between a tensor coefficient and the lagged vector covariates. We propose a penalized MLE estimation approach with a squared RKHS norm penalty and establish the estimator asymptotics under fixed and high matrix dimensionality. The model efficacy has been validated using both numerical experiments and an application to the global TEC forecast.

The application of our model can be extended to other spatial data with exogenous, non-spatial predictors and is not restricted to matrix-valued data but can be generalized to the tensor setting and potentially data without grid structure or containing missing data. Furthermore, our model nests a simpler model that does not contain the autoregressive term, i.e. $P=0$ , and thus can be applied to matrix-on-scalar regression with spatial data. We leave the discussions for these setups to future research.

Supplementary Materials

The supplemental material contains technical proofs of all theorems and propositions of the paper and additional details and results of the simulation experiments.

Acknowledgements

The authors thank Shasha Zou, Zihan Wang, and Yizhou Zhang for helpful discussions on the TEC data. YC acknowledges support from NSF DMS 2113397 and NSF PHY 2027555.

References

Akaike (1998) Hirotogu Akaike. Information Theory and an Extension of the Maximum Likelihood Principle. Selected papers of hirotugu akaike, pages 199–213, 1998.
Attouch et al. (2013) Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of Descent Methods for Semi-Algebraic and Tame Problems: Proximal Algorithms, Forward–Backward Splitting, and Regularized Gauss–Seidel Methods. Mathematical Programming, 137(1-2):91–129, 2013.
Braun (2006) Mikio L Braun. Accurate Error Bounds for the Eigenvalues of the Kernel Matrix. The Journal of Machine Learning Research, 7:2303–2328, 2006.
Cai and Yuan (2012) T Tony Cai and Ming Yuan. Minimax and Adaptive Prediction for Functional Linear Regression. Journal of the American Statistical Association, 107(499):1201–1216, 2012.
Chen et al. (2021) Rong Chen, Han Xiao, and Dan Yang. Autoregressive Models for Matrix-valued Time Series. Journal of Econometrics, 222(1):539–560, 2021.
Cheng and Shang (2015) Guang Cheng and Zuofeng Shang. Joint Asymptotics for Semi-nonparametric Regression Models with Partially Linear Structure. The Annals of Statistics, 43:1351–1390, 2015.
Cressie (1986) Noel Cressie. Kriging Nonstationary Data. Journal of the American Statistical Association, 81(395):625–634, 1986.
Cressie and Johannesson (2008) Noel Cressie and Gardar Johannesson. Fixed Rank Kriging for Very Large Spatial Data Sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226, 2008.
Cressie and Wikle (2015) Noel Cressie and Christopher K Wikle. Statistics for Spatio-Temporal Data. John Wiley & Sons, 2015.
Cui et al. (2018) Wenquan Cui, Haoyang Cheng, and Jiajing Sun. An RKHS-based Approach to Double-Penalized Regression in High-dimensional Partially Linear Models. Journal of Multivariate Analysis, 168:201–210, 2018.
Dong et al. (2020) Mingwang Dong, Linfu Huang, Xueqin Wu, and Qingguang Zeng. Application of Least-Squares Method to Time Series Analysis for 4dpm Matrix. IOP Conference Series: Earth and Environmental Science, 455(1):012200, feb 2020. doi: 10.1088/1755-1315/455/1/012200. URL https://dx.doi.org/10.1088/1755-1315/455/1/012200.
Fosdick and Hoff (2014) BK Fosdick and PD Hoff. Separable Factor Analysis with Applications to Mortality Data. The Annals of Applied Statistics, 8(1):120–147, 2014.
Gu (2013) Chong Gu. Smoothing Spline ANOVA models, 2nd edition. Springer, New York, 2013.
Guha and Guhaniyogi (2021) Sharmistha Guha and Rajarshi Guhaniyogi. Bayesian Generalized Sparse Symmetric Tensor-on-Vector Regression. Technometrics, 63(2):160–170, 2021.
Guhaniyogi et al. (2017) Rajarshi Guhaniyogi, Shaan Qamar, and David B Dunson. Bayesian Tensor Regression. The Journal of Machine Learning Research, 18(1):2733–2763, 2017.
Hall and Heyde (2014) Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic press, 2014.
Hamilton (2020) James D Hamilton. Time Series Analysis. Princeton University Press, 2020.
Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition. Springer, New York, 2009.
Hernández-Pajares et al. (2009) Manuel Hernández-Pajares, JM Juan, J Sanz, R Orus, A Garcia-Rigo, J Feltens, A Komjathy, SC Schaer, and A Krankowski. The IGS VTEC Maps: a Reliable Source of Ionospheric Information since 1998. Journal of Geodesy, 83:263–275, 2009.
Hoff (2011) Peter D Hoff. Separable Covariance Arrays via the Tucker Product, with Applications to Multivariate Relational Data. Bayesian Analysis, 6(2):179–196, 2011.
Hsu et al. (2021) Nan-Jung Hsu, Hsin-Cheng Huang, and Ruey S Tsay. Matrix Autoregressive Spatio-Temporal Models. Journal of Computational and Graphical Statistics, 30(4):1143–1155, 2021.
Kang et al. (2018) Jian Kang, Brian J Reich, and Ana-Maria Staicu. Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process. Biometrika, 105(1):165–184, 2018.
Kennedy et al. (2013) Rodney A Kennedy, Parastoo Sadeghi, Zubair Khalid, and Jason D McEwen. Classification and Construction of Closed-form Kernels for Signal Representation on the 2-sphere. In Wavelets and Sparsity XV, volume 8858, pages 169–183. SPIE, 2013.
Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. Tensor Decompositions and Applications. SIAM review, 51(3):455–500, 2009.
Koltchinskii and Giné (2000) Vladimir Koltchinskii and Evarist Giné. Random Matrix Approximation of Spectra of Integral Operators. Bernoulli, pages 113–167, 2000.
Li and Zhang (2017) Lexin Li and Xin Zhang. Parsimonious Tensor Response Regression. Journal of the American Statistical Association, 112(519):1131–1146, 2017.
Li et al. (2018) Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker Tensor Regression and Neuroimaging Analysis. Statistics in Biosciences, 10(3):520–545, 2018.
Li and Xiao (2021) Zebang Li and Han Xiao. Multi-linear Tensor Autoregressive Models. arXiv preprint arXiv:2110.00928, 2021.
Liu et al. (2020) Yipeng Liu, Jiani Liu, and Ce Zhu. Low-rank Tensor Train Coefficient Array Estimation for Tensor-on-Tensor Regression. IEEE Transactions on Neural Networks and Learning Systems, 31(12):5402–5411, 2020.
Lock (2018) Eric F Lock. Tensor-on-Tensor Regression. Journal of Computational and Graphical Statistics, 27(3):638–647, 2018.
Luo and Zhang (2022) Yuetian Luo and Anru R Zhang. Tensor-on-Tensor Regression: Riemannian Optimization, Over-Parameterization, Statistical-Computational gap, and their Interplay. arXiv preprint arXiv:2206.08756, 2022.
Papadogeorgou et al. (2021) Georgia Papadogeorgou, Zhengwu Zhang, and David B Dunson. Soft Tensor Regression. J. Mach. Learn. Res., 22:219–1, 2021.
Papitashvili and King (2020) Natalia E. Papitashvili and Joseph H. King. Omni 5-min Data [Data set]. NASA Space Physics Data Facility, 2020. https://doi.org/10.48322/gbpg-5r77.
Papitashvili et al. (2014) Natasha Papitashvili, Dieter Bilitza, and Joseph King. OMNI: a Description of Near-Earth Solar Wind Environment. 40th COSPAR scientific assembly, 40:C0–1, 2014.
Rabusseau and Kadri (2016) Guillaume Rabusseau and Hachem Kadri. Low-rank Regression with Tensor Responses. Advances in Neural Information Processing Systems, 29, 2016.
Rudelson and Vershynin (2013) Mark Rudelson and Roman Vershynin. Hanson-Wright Inequality and Sub-Gaussian Concentration. Electronic Communications in Probability, pages 1–9, 2013.
Schölkopf et al. (2001) Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A Generalized Representer Theorem. In International conference on computational learning theory, pages 416–426. Springer, 2001.
Schwarz (1978) Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, pages 461–464, 1978.
Shang and Cheng (2013) Zuofeng Shang and Guang Cheng. Local and Global Asymptotic Inference in Smoothing Spline Models. The Annals of Statistics, 41:2608–2638, 2013.
Shang and Cheng (2015) Zuofeng Shang and Guang Cheng. Nonparametric Inference in Generalized Functional Linear Models. The Annals of Statistics, 43:1742–1773, 2015.
Shen et al. (2022) Bo Shen, Weijun Xie, and Zhenyu Kong. Smooth Robust Tensor Completion for Background/Foreground Separation with Missing Pixels: Novel Algorithm with Convergence Guarantee. The Journal of Machine Learning Research, 23(1):9757–9796, 2022.
Stock and Watson (2001) James H Stock and Mark W Watson. Vector Autoregressions. Journal of Economic perspectives, 15(4):101–115, 2001.
Sun et al. (2022) Hu Sun, Zhijun Hua, Jiaen Ren, Shasha Zou, Yuekai Sun, and Yang Chen. Matrix Completion Methods for the Total Electron Content Video Reconstruction. The Annals of Applied Statistics, 16(3):1333–1358, 2022.
Sun et al. (2023) Hu Sun, Ward Manchester, Meng Jin, Yang Liu, and Yang Chen. Tensor Gaussian Process with Contraction for Multi-Channel Imaging Analysis. In International Conference on Machine Learning, pages 32913–32935. PMLR, 2023.
Sun and Li (2017) Will Wei Sun and Lexin Li. STORE: Sparse Tensor Response Regression and Neuroimaging Analysis. The Journal of Machine Learning Research, 18(1):4908–4944, 2017.
Tapping (2013) KF Tapping. The 10.7 cm Solar Radio Flux (F10. 7). Space weather, 11(7):394–406, 2013.
Tzeng and Huang (2018) ShengLi Tzeng and Hsin-Cheng Huang. Resolution Adaptive Fixed Rank Kriging. Technometrics, 60(2):198–208, 2018.
van Zanten and van der Vaart (2008) JH van Zanten and Aad W van der Vaart. Reproducing Kernel Hilbert Spaces of Gaussian Priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, pages 200–222. Institute of Mathematical Statistics, 2008.
Wang et al. (2024) Di Wang, Yao Zheng, and Guodong Li. High-Dimensional Low-rank Tensor Autoregressive Time Series Modeling. Journal of Econometrics, 238(1):105544, 2024.
Wang et al. (2019) Dong Wang, Xialu Liu, and Rong Chen. Factor Models for Matrix-valued High-dimensional Time Series. Journal of econometrics, 208(1):231–248, 2019.
Wang et al. (2017) Xiao Wang, Hongtu Zhu, and Alzheimer’s Disease Neuroimaging Initiative. Generalized Scalar-on-Image Regression Models via Total Variation. Journal of the American Statistical Association, 112(519):1156–1168, 2017.
Wang et al. (2021) Zihan Wang, Shasha Zou, Lei Liu, Jiaen Ren, and Ercha Aa. Hemispheric Asymmetries in the Mid-latitude Ionosphere During the September 7–8, 2017 Storm: Multi-instrument Observations. Journal of Geophysical Research: Space Physics, 126:e2020JA028829, 4 2021. ISSN 2169-9402. doi: 10.1029/2020JA028829.
Williams and Rasmussen (2006) Christopher K Williams and Carl Edward Rasmussen. Gaussian Processes for Machine Learning, volume 2. MIT press Cambridge, MA, 2006.
Yang et al. (2020) Yun Yang, Zuofeng Shang, and Guang Cheng. Non-asymptotic Analysis for Nonparametric Testing. In 33rd Annual Conference on Learning Theory, pages 1–47. ACM, 2020.
Younas et al. (2022) Waqar Younas, Majid Khan, C. Amory-Mazaudier, Paul O. Amaechi, and R. Fleury. Middle and Low Latitudes Hemispheric Asymmetries in $\Sigma O/N2$ and TEC during Intense Magnetic Storms of Solar Cycle 24. Advances in Space Research, 69:220–235, 1 2022.
Yuan and Cai (2010) Ming Yuan and T Tony Cai. A Reproducing Kernel Hilbert Space Approach to Functional Linear Regression. The Annals of Statistics, 38(6):3412–3444, 2010.
Zhou et al. (2013) Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor Regression with Applications in Neuroimaging Data Analysis. Journal of the American Statistical Association, 108(502):540–552, 2013.

SUPPLEMENTARY MATERIAL

This supplemental material is organized as follows. In Section A, we prove Proposition 1 on the equivalence of the estimation problem of MARAC to a kernel ridge regression problem. In Section B, we prove Theorem 6 on the joint stationarity condition of the matrix and auxiliary vector time series. Then in Section C, we provide proofs of the theoretical results under fixed spatial dimensionality, including Proposition 8 and Theorem 9. In Section D, we present proofs of the theoretical results under high spatial dimensionality, namely Theorem 12. All essential lemmas used throughout the proofs are presented and proved in Section E. Finally, we include additional details and results of the simulation experiments in Section F.

In this supplemental material, we use $\bar{\rho}(\cdot)$ , $\rho_{i}(\cdot)$ , $\underaccent{\bar}{\rho}(\cdot)$ and $\|\cdot\|_{s}$ to denote the maximum, $i^{\rm{th}}$ largest, minimum eigenvalue and spectral norm of a matrix. We use $a\vee b,a\wedge b$ to denote the maximum and minimum of $a$ and $b$ , respectively. For two sequences of random variables, say $X_{n},Y_{n}$ , we use $X_{n}\lesssim Y_{n}$ to denote the case where $X_{n}/Y_{n}=O_{P}(1)$ , and $X_{n}\gtrsim Y_{n}$ to denote the case where $Y_{n}/X_{n}=O_{P}(1)$ . We then use $X_{n}\asymp Y_{n}$ to denote the case where both $X_{n}\lesssim Y_{n}$ and $X_{n}\gtrsim Y_{n}$ hold.

Appendix A Proof of Proposition 1

Proof For each function $g_{q,d}(\cdot)\in\mathbb{H}_{k}$ , we can decompose it as follows:

g_{q,d}(\cdot)=\sum_{s\in\mathbb{S}}\gamma_{q,d,s}k(\cdot,s)+\sum_{j=1}^{J}\alpha_{q,d,j}\phi_{j}(\cdot)+h_{q,d}(\cdot),

where $h_{q,d}(\cdot)$ does not belong to the null space of $\mathbb{H}_{k}$ nor the span of $\{k(\cdot,s)|s\in\mathbb{S}\}$ . Here we assume that the null space of $\mathbb{H}_{k}$ contains only the zero function, so $\phi_{j}(\cdot)=0,\text{ for all }j$ .

By the reproducing property of the kernel $k(\cdot,\cdot)$ , we have $\langle g_{q,d},k(\cdot,s^{\prime})\rangle_{\mathbb{H}_{k}}=g_{q,d}(s^{\prime})=\sum_{s\in\mathbb{S}}\gamma_{q,d,s}k(s,s^{\prime})$ , which is independent of $h_{q,d}(\cdot)$ , and therefore $h_{q,d}(\cdot)$ is independent of the prediction for $\mathbf{x}_{t}$ in the MARAC model. In addition, for any $h_{q,d}(\cdot)\notin\text{span}(\{k(\cdot,s)|s\in\mathbb{S}\})$ , we have:

\|g_{q,d}\|_{\mathbb{H}_{k}}^{2}=\gamma_{q,d}^{\top}\mathbf{K}\gamma_{q,d}+\|h_{q,d}\|_{\mathbb{H}_{k}}^{2}\geq\|\sum_{s\in\mathbb{S}}\gamma_{q,d,s}k(\cdot,s)\|_{\mathbb{H}_{k}}^{2},

and the equality holds only if $h_{q,d}(\cdot)=0$ . Consequently, the global minimizer for the constrained optimization problem (7) must have $h_{q,d}(\cdot)=0$ . It then follows that the squared RKHS functional norm penalty for $g_{q,d}$ can be written as $\gamma_{q,d}^{\top}\mathbf{K}\gamma_{q,d}$ and the tensor coefficient $\mathbfcal{G}_{q}$ satisfies $\mathrm{\mathbf{vec}}\left([\mathbfcal{G}]_{::d}\right)=\mathbf{K}\gamma_{q,d}$ . The remainder of the proof is straightforward by simple linear algebra and thus we omit it here.

Appendix B Proof of Theorem 1

Proof Under Assumption 4 that the vector time series $\mathbf{z}_{t}$ follows a VAR $(\widetilde{Q})$ process, we can derive that the vectorized matrix time series $\mathbf{X}_{t}$ and the vector time series $\mathbf{z}_{t}$ jointly follows a VAR $(\max(P,Q,\widetilde{Q}))$ process, namely,

\begin{bmatrix}\mathbf{x}_{t}\\ \mathbf{z}_{t}\end{bmatrix}=\sum_{l=1}^{\max(P,Q,\widetilde{Q})}\begin{bmatrix}\left(\mathbf{B}_{l}\otimes\mathbf{A}_{l}\right)\odot\mathbf{1}_{\{l\leq P\}}&\mathbf{G}_{l}^{\top}\odot\mathbf{1}_{\{l\leq Q\}}\\ \mathbf{O}_{D\times S}&\mathbf{C}_{l}\odot\mathbf{1}_{\{l\leq\widetilde{Q}\}}\end{bmatrix}\begin{bmatrix}\mathbf{x}_{t-l}\\ \mathbf{z}_{t-l}\end{bmatrix}+\begin{bmatrix}\mathbf{e}_{t}\\ \boldsymbol{\nu}_{t}\end{bmatrix}.

(29)

Let $L=\max(P,Q,\widetilde{Q})$ and $\mathbf{y}_{t}=[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]$ . Denote the transition matrix in (29) at lag- $l$ as $\mathbf{J}_{l}\in\mathbb{R}^{(S+D)\times(S+D)}$ and the error term as $\mathbf{u}_{t}^{\top}=[\mathbf{e}_{t}^{\top},\boldsymbol{\nu}_{t}^{\top}]$ , then we can rewrite the VAR $(L)$ process in (29) as a VAR $(1)$ process as:

\begin{bmatrix}\mathbf{y}_{t}\\ \mathbf{y}_{t-1}\\ \vdots\\ \mathbf{y}_{t-L+1}\end{bmatrix}=\begin{bmatrix}\mathbf{J}_{1}&\mathbf{J}_{2}&\cdots&\mathbf{J}_{L-1}&\mathbf{J}_{L}\\ \mathbf{I}_{S+D}&\mathbf{O}_{S+D}&\cdots&\cdots&\mathbf{O}_{S+D}\\ \mathbf{O}_{S+D}&\mathbf{I}_{S+D}&\mathbf{O}_{S+D}&\cdots&\mathbf{O}_{S+D}\\ \vdots&\vdots&\ddots&\ddots&\vdots\\ \mathbf{O}_{S+D}&\mathbf{O}_{S+D}&\cdots&\mathbf{I}_{S+D}&\mathbf{O}_{S+D}\end{bmatrix}\begin{bmatrix}\mathbf{y}_{t-1}\\ \mathbf{y}_{t-2}\\ \vdots\\ \mathbf{y}_{t-L}\end{bmatrix}+\begin{bmatrix}\mathbf{u}_{t}\\ \mathbf{0}_{S+D}\\ \vdots\\ \mathbf{0}_{S+D}\end{bmatrix},

(30)

where we use $\mathbf{O}_{S+D}$ to denote a zero matrix of size $(S+D)\times(S+D)$ . For this VAR $(1)$ process to be stationary, we require that $\mathrm{det}\left(\lambda\mathbf{I}-\mathbf{J}\right)\neq 0$ for all $|\lambda|\geq 1,\lambda\in\mathbb{C}$ , where $\mathbf{J}$ is the transition matrix in (30). The determinant $\mathrm{det}\left(\lambda\mathbf{I}-\mathbf{J}\right)$ can be simplified by column operations as:

	$\displaystyle\mathrm{det}\left(\lambda\mathbf{I}-\mathbf{J}\right)$
	$\displaystyle=\mathrm{det}\begin{bmatrix}\lambda^{L}\mathbf{I}_{S}-\sum\limits_{l=1}^{L}\lambda^{L-l}\left(\mathbf{B}_{l}\otimes\mathbf{A}_{l}\right)\odot\mathbf{1}_{\{l\leq P\}}&-\sum\limits_{l=1}^{L}\lambda^{L-l}\mathbf{G}_{l}^{\top}\odot\mathbf{1}_{\{l\leq Q\}}\\ \mathbf{O}&\lambda^{L}\mathbf{I}_{D}-\sum\limits_{l=1}^{L}\lambda^{L-l}\mathbf{C}_{l}\odot\mathbf{1}_{\{l\leq\widetilde{Q}\}}\end{bmatrix}$
	$\displaystyle=\lambda^{2L}\mathrm{det}\left[\boldsymbol{\Phi}_{1}(\lambda)\right]\mathrm{det}\left[\boldsymbol{\Phi}_{2}(\lambda)\right],$

where $\boldsymbol{\Phi}_{1}(\lambda)=\mathbf{I}_{S}-\sum_{p=1}^{P}\lambda^{-p}\left(\mathbf{B}_{p}\otimes\mathbf{A}_{p}\right)$ and $\boldsymbol{\Phi}_{2}(\lambda)=\mathbf{I}_{D}-\sum_{\widetilde{q}=1}^{\widetilde{Q}}\lambda^{-\widetilde{q}}\mathbf{C}_{\widetilde{q}}$ , and setting $y=1/\lambda$ completes the proof.

Appendix C Theory under Fixed Spatial Dimension

C.1 Proof of Proposition 8

Proof For the brevity of the presentation, we fix $P,Q$ as $1$ but the proofs presented below can be easily extended to an arbitrary $P,Q$ . For the vectorized MARAC $(1,1)$ model (4), we can equivalently write it as:

\mathbf{x}_{t}=\mathbf{y}_{t}\boldsymbol{\theta}+\mathbf{e}_{t},

(31)

where $\mathbf{y}_{t}=[\mathbf{x}_{t-1}^{\top}\otimes\mathbf{I}_{S};\mathbf{z}_{t-1}^{\top}\otimes\mathbf{K}]$ and $\boldsymbol{\theta}=[\mathrm{\mathbf{vec}}\left(\mathbf{B}_{1}\otimes\mathbf{A}_{1}\right)^{\top},\boldsymbol{\gamma}_{1}^{\top}]^{\top}$ . Using $\boldsymbol{\Omega}=\boldsymbol{\Sigma}^{-1}$ to denote the precision matrix for $\mathbf{e}_{t}$ , we can rewrite the penalized likelihood in (11) for $(\boldsymbol{\theta},\boldsymbol{\Omega})$ as:

h(\boldsymbol{\theta},\boldsymbol{\Omega})=-\frac{1}{2}\log\left|\boldsymbol{\Omega}\right|+\frac{1}{2}\text{tr}\left(\boldsymbol{\Omega}\mathbf{S}(\boldsymbol{\theta})\right)+\frac{\lambda}{2}\boldsymbol{\theta}^{\top}\widetilde{\mathbf{K}}\boldsymbol{\theta},

(32)

where $\mathbf{S}(\boldsymbol{\theta})=T^{-1}\sum_{t=1}^{T}(\mathbf{x}_{t}-\mathbf{y}_{t}\boldsymbol{\theta})(\mathbf{x}_{t}-\mathbf{y}_{t}\boldsymbol{\theta})^{\top}$ , $\widetilde{\mathbf{K}}$ is defined as:

\widetilde{\mathbf{K}}=\begin{bmatrix}\mathbf{O}_{S\times S}\otimes\mathbf{K}&\mathbf{O}_{S\times D}\otimes\mathbf{K}\\ \mathbf{O}_{D\times S}\otimes\mathbf{K}&\mathbf{I}_{D}\otimes\mathbf{K}\end{bmatrix}.

We use $\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*}$ to denote the ground truth of $\boldsymbol{\theta},\boldsymbol{\Omega}$ , respectively. We define $\mathbb{F}_{\boldsymbol{\theta}}$ and $\mathbb{F}_{\boldsymbol{\Omega}}$ as:

	$\displaystyle\mathbb{F}_{\boldsymbol{\theta}}$	$\displaystyle=\{[\mathrm{\mathbf{vec}}\left(\mathbf{B}_{1}\otimes\mathbf{A}_{1}\right)^{\top},\boldsymbol{\gamma}_{1}^{\top}]^{\top}\|\\|\mathbf{A}_{1}\\|_{\mathrm{F}}=1,\text{sign}(\text{tr}\left(\mathbf{A}_{1}\right))=1\}$
	$\displaystyle\mathbb{F}_{\boldsymbol{\Omega}}$	$\displaystyle=\{\boldsymbol{\Sigma}_{c}^{-1}\otimes\boldsymbol{\Sigma}_{r}^{-1}\|\boldsymbol{\Sigma}_{r}\in\mathbb{R}^{M\times M},\boldsymbol{\Sigma}_{c}\in\mathbb{R}^{N\times N},\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{r}),\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{c})>0\}.$

The estimators of MARAC, denoted as $\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{\Omega}}$ , is the minimizer of $h(\boldsymbol{\theta},\boldsymbol{\Omega})$ with $\boldsymbol{\theta}\in\mathbb{F}_{\boldsymbol{\theta}},\boldsymbol{\Omega}\in\mathbb{F}_{\boldsymbol{\Omega}}$ .

In order to establish the consistency of $\widehat{\boldsymbol{\Sigma}}=\widehat{\boldsymbol{\Omega}}^{-1}$ , it suffices to show that for any constant $c>0$ :

\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\inf_{\bar{\boldsymbol{\theta}}}h(\bar{\boldsymbol{\theta}},\bar{\boldsymbol{\Omega}})\leq h(\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*})\right)\rightarrow 0,\text{ as }T\rightarrow\infty.

(33)

This is because if (33) is established, then as $T\rightarrow\infty$ we have:

\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\inf_{\bar{\boldsymbol{\theta}}\in\mathbb{F}_{\boldsymbol{\theta}}}h(\bar{\boldsymbol{\theta}},\bar{\boldsymbol{\Omega}})\geq\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\inf_{\bar{\boldsymbol{\theta}}}h(\bar{\boldsymbol{\theta}},\bar{\boldsymbol{\Omega}})>h(\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*})\geq h(\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{\Omega}})\right)

approaching $1$ and thus we must have $\|\widehat{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}<c$ with probability approaching $1$ as $T\rightarrow\infty$ , and the consistency is established since $c$ is arbitrary.

To prove (33), we first fix $\boldsymbol{\Omega}=\bar{\boldsymbol{\Omega}}$ and let $\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\operatorname*{arg\,min}_{\boldsymbol{\theta}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})$ , thus we have:

\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{x}_{t}}{T}\right),

(34)

which is a consistent estimator of $\boldsymbol{\theta}^{*}$ for any $\bar{\boldsymbol{\Omega}}$ given that $\lambda\rightarrow 0$ and the matrix and vector time series are covariance-stationary. To see that $\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\overset{p.}{\rightarrow}\boldsymbol{\theta}^{*}$ , notice that:

\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=(\mathbf{I}-\lambda\widetilde{\mathbf{K}})\boldsymbol{\theta}^{*}+\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right),

(35)

and the first term converges to $\boldsymbol{\theta}^{*}$ since $\lambda=o(1)$ . In the second term of (35), we have:

\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\overset{p.}{\rightarrow}\begin{bmatrix}\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}\otimes\bar{\boldsymbol{\Omega}}&\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{z}}^{*}\otimes\bar{\boldsymbol{\Omega}}\mathbf{K}\\ \boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}&\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K}\end{bmatrix},

(36)

where $\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}=\mathrm{Var}(\mathbf{x}_{t})$ , $\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{z}}^{*}=\mathrm{Cov}(\mathbf{x}_{t},\mathbf{z}_{t})$ and $\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}=\mathrm{Var}(\mathbf{z}_{t})$ . The convergence in probability in (36) holds due to the joint stationarity of $\mathbf{x}_{t}$ and $\mathbf{z}_{t}$ and the assumption that $\lambda=o(1)$ . We further note that the sequence $\{\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}\}_{t=1}^{T}$ is a martingale difference sequence (MDS), and we have $\sum_{t=1}^{T}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}/T=O_{P}(T^{-1/2})$ by the central limit theorem (CLT) of MDS (see proposition 7.9 of Hamilton (2020) for the central limit theorem of martingale difference sequence). Combining this result together with (36), we conclude that the second term in (35) is $o_{P}(1)$ and thus $\widetilde{\boldsymbol{\theta}}(\bar{\Omega})$ is consistent for $\boldsymbol{\theta}^{*}$ .

Plugging $\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})$ into $h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})$ yields the profile likelihood of $\bar{\boldsymbol{\Omega}}$ :

\ell(\bar{\boldsymbol{\Omega}})=-\frac{1}{2}\log|\bar{\boldsymbol{\Omega}}|+\frac{1}{2}\text{tr}\left(\bar{\boldsymbol{\Omega}}\frac{\sum_{t}\mathbf{x}_{t}[\mathbf{x}_{t}-\mathbf{y}_{t}\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})]^{\top}}{T}\right).

To prove (33), it suffices to show that:

\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\ell(\bar{\boldsymbol{\Omega}})\leq\ell(\boldsymbol{\Omega}^{*})\right)\rightarrow 0,\text{ as }T\rightarrow\infty,

(37)

since $\ell(\boldsymbol{\Omega}^{*})\leq h(\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*})$ . Now, since $\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\overset{p.}{\rightarrow}\boldsymbol{\theta}^{*}$ , we can write $\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\boldsymbol{\theta}^{*}+\boldsymbol{\zeta}$ , with $\|\boldsymbol{\zeta}\|_{\mathrm{F}}=o_{P}(1)$ . Using this new notation, we can rewrite $\ell(\bar{\boldsymbol{\Omega}})$ as:

	$\displaystyle\ell(\bar{\boldsymbol{\Omega}})$	$\displaystyle=-\frac{1}{2}\log\|\bar{\boldsymbol{\Omega}}\|+\frac{1}{2}\text{tr}\left(\bar{\boldsymbol{\Omega}}\frac{\sum_{t}\mathbf{x}_{t}\mathbf{e}_{t}^{\top}}{T}\right)-\frac{1}{2}\text{tr}\left(\left(\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right)\boldsymbol{\zeta}\right)$		(38)
		$\displaystyle=\widetilde{\ell}(\bar{\boldsymbol{\Omega}})-\frac{1}{2}\text{tr}\left(\left(\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right)\boldsymbol{\zeta}\right),$

where we define the first two terms in (38) as $\widetilde{\ell}(\bar{\boldsymbol{\Omega}})$ .

By the Cauchy-Schwartz inequality, we have:

\left|\frac{1}{2}\text{tr}\left(\left(\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right)\boldsymbol{\zeta}\right)\right|\leq\frac{1}{2}\left\|\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right\|_{\mathrm{F}}\cdot\|\boldsymbol{\zeta}\|_{\mathrm{F}}.

(39)

By the definition of $\mathbf{y}_{t}$ , we have:

\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}=\left[\left(\frac{\sum_{t}\mathbf{x}_{t-1}\otimes\mathbf{x}_{t}}{T}\right)^{\top}\left(\mathbf{I}_{S}\otimes\bar{\boldsymbol{\Omega}}\right);\left(\frac{\sum_{t}\mathbf{z}_{t-1}\otimes\mathbf{x}_{t}}{T}\right)^{\top}\left(\mathbf{I}_{D}\otimes\bar{\boldsymbol{\Omega}}\mathbf{K}\right)\right],

and notice that $\mathbf{x}_{t-1}\otimes\mathbf{x}_{t}$ and $\mathbf{z}_{t-1}\otimes\mathbf{x}_{t}$ are just rearranged versions of $\mathbf{x}_{t}\mathbf{x}_{t-1}^{\top}$ and $\mathbf{x}_{t}\mathbf{z}_{t-1}^{\top}$ , respectively. Therefore, by the joint stationarity of $\mathbf{x}_{t}$ and $\mathbf{z}_{t}$ , we have the time average of $\mathbf{x}_{t-1}\otimes\mathbf{x}_{t}$ and $\mathbf{z}_{t-1}\otimes\mathbf{x}_{t}$ converging to the rearranged version of some constant auto-covariance matrices and therefore we have the term on the right-hand side of (39) being $o_{P}(1)$ .

Given this argument, proving (37) is now equivalent to proving:

\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\widetilde{\ell}(\bar{\boldsymbol{\Omega}})\leq\widetilde{\ell}(\boldsymbol{\Omega}^{*})\right)\rightarrow 0,\text{ as }T\rightarrow\infty.

(40)

Define $\widetilde{\boldsymbol{\Omega}}$ as the unconstrained minimizer of $\widetilde{\ell}(\boldsymbol{\Omega})$ , then explicitly, we have:

	$\displaystyle\widetilde{\boldsymbol{\Omega}}=\operatorname*{arg\,min}_{\boldsymbol{\Omega}}\widetilde{\ell}(\boldsymbol{\Omega})$	$\displaystyle=\left(\frac{\sum_{t}\mathbf{e}_{t}\mathbf{x}_{t}^{\top}}{T}\right)^{-1}$
		$\displaystyle=\left(\frac{\sum_{t}\mathbf{e}_{t}\mathbf{e}_{t}^{\top}}{T}+\frac{\sum_{t}\mathbf{e}_{t}\left(\mathbf{y}_{t}\boldsymbol{\theta}^{}\right)^{\top}}{T}\right)^{-1}\overset{p.}{\rightarrow}\boldsymbol{\Omega}^{},$

where the final argument on the convergence in probability to $\boldsymbol{\Omega}^{*}$ is based on the fact that $\sum_{t=1}^{T}\mathbf{e}_{t}\left(\mathbf{y}_{t}\boldsymbol{\theta}^{*}\right)^{\top}/T=O_{P}(T^{-1/2})$ by the CLT of MDS. By the second-order Taylor expansion of $\widetilde{\ell}(\bar{\boldsymbol{\Omega}})$ at $\widetilde{\boldsymbol{\Omega}}$ , we have:

\widetilde{\ell}(\bar{\boldsymbol{\Omega}})=\widetilde{\ell}(\widetilde{\boldsymbol{\Omega}})+\frac{1}{4}\mathrm{\mathbf{vec}}\left(\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}}\right)^{\top}\left[\check{\boldsymbol{\Omega}}^{-1}\otimes\check{\boldsymbol{\Omega}}^{-1}\right]\mathrm{\mathbf{vec}}\left(\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}}\right),

(41)

where $\check{\boldsymbol{\Omega}}=\widetilde{\boldsymbol{\Omega}}+\eta(\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}})$ , for some $\eta\in[0,1]$ . For any constant $c>0$ such that $\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=c$ , let $c=\kappa\bar{\rho}(\boldsymbol{\Omega}^{*})$ , where $\kappa>0$ is also a constant that relates to $c$ only. Consequently, we have:

\left|\bar{\rho}(\bar{\boldsymbol{\Omega}})-\bar{\rho}(\boldsymbol{\Omega}^{*})\right|\leq\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{s}\leq\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=\kappa\bar{\rho}(\boldsymbol{\Omega}^{*}),

and thus $\bar{\rho}(\bar{\boldsymbol{\Omega}})\leq(1+\kappa)\bar{\rho}(\boldsymbol{\Omega}^{*})$ . Conditioning on the event that $\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=c$ , we first have $\|\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}}\|_{\mathrm{F}}\geq c/2$ to hold with probability approaching one, due to the consistency of $\widetilde{\boldsymbol{\Omega}}$ . Furthermore, we also have:

	$\displaystyle\underaccent{\bar}{\rho}(\check{\boldsymbol{\Omega}}^{-1}\otimes\check{\boldsymbol{\Omega}}^{-1})=\underaccent{\bar}{\rho}(\check{\boldsymbol{\Omega}}^{-1})^{2}$	$\displaystyle=\frac{1}{\bar{\rho}(\check{\boldsymbol{\Omega}})^{2}}$
		$\displaystyle\geq\left[\frac{1}{\bar{\rho}(\widetilde{\boldsymbol{\Omega}})+\bar{\rho}(\bar{\boldsymbol{\Omega}})}\right]^{2}$
		$\displaystyle\geq\left[\frac{1}{2\bar{\rho}(\boldsymbol{\Omega}^{})+(c+\bar{\rho}(\boldsymbol{\Omega}^{}))}\right]^{2}=\frac{1}{(3+\kappa)^{2}}\cdot\frac{1}{\bar{\rho}(\boldsymbol{\Omega}^{*})^{2}},$

where the last inequality holds with probability approaching one since $\mathrm{P}\left[\bar{\rho}(\widetilde{\boldsymbol{\Omega}})\leq 2\bar{\rho}(\boldsymbol{\Omega}^{*})\right]\rightarrow 1$ . Utilizing these facts together with (41), we end up having:

\mathrm{P}\left[\widetilde{\ell}(\bar{\boldsymbol{\Omega}})\geq\widetilde{\ell}(\widetilde{\boldsymbol{\Omega}})+\frac{1}{16}\cdot\left(\frac{\kappa}{3+\kappa}\right)^{2}\right]\rightarrow 1,\text{ as }T\rightarrow\infty,

(42)

for any $\bar{\boldsymbol{\Omega}}$ such that $\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=c=\kappa\bar{\rho}(\boldsymbol{\Omega}^{*})$ . Since $\kappa$ is an arbitrary positive constant and $\widetilde{\ell}(\widetilde{\boldsymbol{\Omega}})\overset{p.}{\rightarrow}\widetilde{\ell}(\boldsymbol{\Omega}^{*})$ , we establish (40) and thereby completes the proof.

C.2 Proof of Theorem 9

To prove Theorem 9, we first establish the consistency and the convergence rate of the estimators in Lemma 14 below.

Lemma 14

Under the same assumption as Theorem 9, all model estimators for MARAC are $\sqrt{T}$ -consistent, namely:

\|\widehat{\mathbf{A}}_{p}-\mathbf{A}_{p}^{*}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),\|\widehat{\mathbf{B}}_{p}-\mathbf{B}_{p}^{*}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),\|\widehat{\boldsymbol{\gamma}}_{q}-\boldsymbol{\gamma}_{q}^{*}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),

for $p\in[P],q\in[Q]$ . As a direct result, we also have:

\|\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}-\mathbf{B}_{p}^{*}\otimes\mathbf{A}^{*}_{p}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),\text{ for }p\in[P].

We delay the proof of Lemma 14 to Section E.2. With this lemma, we are now ready to present the proof of Theorem 9.

Proof For the simplicity of notation and presentation, we fix $P,Q$ as $1$ but the proving technique can be generalized to arbitrary $P,Q$ . To start with, we revisit the updating rule for $\mathbf{A}_{p}^{(l+1)}$ in (13). By plugging in the data-generating model for $\mathbf{X}_{t}$ according to MARAC $(1,1)$ model, we can transform (13) into:

\sum_{t\in[T]}\left[\Delta\mathbf{A}_{1}\mathbf{X}_{t-1}\widehat{\mathbf{B}}_{1}^{\top}+\mathbf{A}_{1}^{*}\mathbf{X}_{t-1}\Delta\mathbf{B}_{1}^{\top}+\Delta\mathbfcal{G}_{1}\bar{\times}\mathbf{z}_{t-1}-\mathbf{E}_{t}\right]\widehat{\boldsymbol{\Sigma}}_{c}^{-1}\widehat{\mathbf{B}}_{1}\mathbf{X}_{t-1}^{\top}=\mathbf{O}_{M\times M},

where for any arbitrary matrix/tensor $\mathbf{M}$ , we define $\Delta\mathbf{M}$ as $\Delta\mathbf{M}=\widehat{\mathbf{M}}-\mathbf{M}^{*}$ . One can simplify the estimating equation above by left multiplying $\widehat{\boldsymbol{\Sigma}}_{r}^{-1}$ and then vectorize both sides to obtain:

		$\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})\otimes(\boldsymbol{\Sigma}_{r}^{})^{-1}\right]\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{A}}_{1}-\mathbf{A}_{1}^{*}\right)$
	$\displaystyle+$	$\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{})^{-1}\mathbf{A}_{1}^{}\mathbf{X}_{t-1}\right]\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{B}}_{1}^{\top}-(\mathbf{B}_{1}^{*})^{\top}\right)$
	$\displaystyle+$	$\displaystyle\sum_{t\in[T]}\left\{\mathbf{z}_{t-1}^{\top}\otimes\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{})^{-1}\mathbf{K}\right]\right\}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\gamma}}_{1}-\boldsymbol{\gamma}_{1}^{}\right)$
	$\displaystyle=$	$\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{*})^{-1}\right]\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)+o_{P}(\sqrt{T}).$

On the left-hand side of the equation above, we replace $\widehat{\mathbf{B}}_{1},\widehat{\boldsymbol{\Sigma}}_{r},\widehat{\boldsymbol{\Sigma}}_{c}$ with their true values $\mathbf{B}_{1}^{*},\boldsymbol{\Sigma}_{r}^{*},\boldsymbol{\Sigma}_{c}^{*}$ , since the discrepancies are of order $o_{P}(1)$ and can thus be incorporated into the $o_{P}(\sqrt{T})$ term given the $\sqrt{T}$ -consistency of $\widehat{\mathbf{A}}_{1},\widehat{\mathbf{B}}_{1},\widehat{\boldsymbol{\gamma}}_{1}$ . On the right-hand side, we have:

	$\displaystyle\sum_{t}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Sigma}}_{r}^{-1}\mathbf{E}_{t}\widehat{\boldsymbol{\Sigma}}_{c}^{-1}\widehat{\mathbf{B}}_{1}\mathbf{X}_{t-1}^{\top}\right)$
	$\displaystyle=\sum_{t}\left[\mathbf{e}_{t}^{\top}\otimes\left(\mathbf{X}_{t-1}\otimes\mathbf{I}_{M}\right)\right]\mathbf{vec}\left[\left(\widehat{\mathbf{B}}_{1}^{\top}\otimes\mathbf{I}_{M}\right)\widehat{\boldsymbol{\Sigma}}^{-1}\right],$

where the process $\{\mathbf{e}_{t}^{\top}\otimes\left(\mathbf{X}_{t-1}\otimes\mathbf{I}_{M}\right)\}_{t=1}^{T}$ is a martingale difference sequence and the martingale central limit theorem (Hall and Heyde, 2014) implies that $\sum_{t}\left[\mathbf{e}_{t}^{\top}\otimes\left(\mathbf{X}_{t-1}\otimes\mathbf{I}_{M}\right)\right]=O_{P}(\sqrt{T})$ , and thus by the consistency of $\widehat{\boldsymbol{\Sigma}}$ and $\widehat{\mathbf{B}}_{1}$ , we can replace $\widehat{\boldsymbol{\Sigma}}$ and $\widehat{\mathbf{B}}_{1}$ with their true values and incorporate the remainders into $o_{P}(\sqrt{T})$ .

Similar transformations can be applied to (14) and (15), where the penalty term is incorporated into $o_{P}(\sqrt{T})$ due to the assumption that $\lambda=o(T^{-\frac{1}{2}})$ . With the notation that $\mathbf{U}_{t}=\mathbf{I}_{N}\otimes\mathbf{A}_{1}^{*}\mathbf{X}_{t-1}$ , $\mathbf{V}_{t}=\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top}\otimes\mathbf{I}_{M}$ , $\mathbf{Y}_{t}=\mathbf{z}_{t-1}^{\top}\otimes\mathbf{K}$ and $\mathbf{W}_{t}=[\mathbf{V}_{t};\mathbf{U}_{t};\mathbf{Y}_{t}]$ , these transformed estimating equations can be converted altogether into:

	$\displaystyle\left(\frac{1}{T}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{})^{-1}\mathbf{W}_{t}\right)\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{}\right)$	$\displaystyle=\frac{1}{T}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)$
		$\displaystyle+o_{P}(T^{-1/2}),$		(43)

where $\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right)=[\mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{A}}-\mathbfcal{A}^{*}\right)^{\top},\mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{B}}-\mathbfcal{B}^{*}\right)^{\top},\mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{R}}-\mathbfcal{R}^{*}\right)^{\top}]^{\top}$ , and $\widehat{\mathbfcal{A}},\widehat{\mathbfcal{B}},\widehat{\mathbfcal{R}}$ are defined as $[\widehat{\mathbfcal{A}}]_{::p}=\widehat{\mathbf{A}}_{p},[\widehat{\mathbfcal{B}}]_{::p}=\widehat{\mathbf{B}}_{p}^{\top}$ , $[\widehat{\mathbfcal{R}}]_{:dq}=\widehat{\boldsymbol{\gamma}}_{q,d}$ and $\mathbfcal{A}^{*},\mathbfcal{B}^{*},\mathbfcal{R}^{*}$ are the corresponding true coefficients.

In (43), we first establish that:

(1/T)\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\overset{p.}{\rightarrow}\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right].

(44)

To prove 44, by the assumption that $\mathbf{X}_{t}$ and $\mathbf{z}_{t}$ are zero-meaned and jointly stationary, we have $T^{-1}\sum_{t\in[T]}\widetilde{\mathbf{x}}_{t}\widetilde{\mathbf{x}}_{t}^{\top}\overset{p.}{\rightarrow}\mathrm{E}[\widetilde{\mathbf{x}}_{t}\widetilde{\mathbf{x}}_{t}^{\top}]$ by Lemma 20 and Corollary 21, where $\widetilde{\mathbf{x}}_{t}=[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}$ . See details of Lemma 20 and Corollary 21 in Section E.1. Then since each element of $\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}$ is a linear combination of terms in $\widetilde{\mathbf{x}}_{t}\widetilde{\mathbf{x}}_{t}^{\top}$ (thus a continuous mapping), it is straightforward that (44) holds elementwise.

Given (44) and the fact that $\widehat{\boldsymbol{\Theta}}$ is $\sqrt{T}$ -consistent, we can rewrite (43) as:

	$\displaystyle\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{})^{-1}\mathbf{W}_{t}\right]\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{}\right)$	$\displaystyle=\frac{1}{T}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)$
		$\displaystyle+o_{P}(T^{-1/2}),$		(45)

For the term on the right-hand side of (45), first notice that the sequence $\{\boldsymbol{\eta}_{t}\}_{t=1}^{T}$ , where $\boldsymbol{\eta}_{t}=\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)$ , is a zero-meaned, stationary vector martingale difference sequence (MDS), thanks to the independence of $\mathbf{E}_{t}$ from the jointly stationary $\mathbf{X}_{t-1}$ and $\mathbf{z}_{t-1}$ . By the martingale central limit theorem (Hall and Heyde, 2014), we have:

\frac{1}{\sqrt{T}}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]).

(46)

Combining (45) and (46), we end up having:

\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]\sqrt{T}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right)\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]).

(47)

The asymptotic distribution of $\sqrt{T}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right)$ can thus be derived by multiplying both sides of (47) by the inverse of $\mathbf{L}=\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]$ . However, the matrix $\mathbf{L}$ is not a full-rank matrix, because $\mathbf{L}\boldsymbol{\mu}=\mathbf{0}$ , where $\boldsymbol{\mu}=[\mathrm{\mathbf{vec}}\left(\mathbfcal{A}^{*}\right)^{\top},-\mathrm{\mathbf{vec}}\left(\mathbfcal{B}^{*}\right)^{\top},\mathbf{0}^{\top}]^{\top}$ . As a remedy, let $\boldsymbol{\zeta}=[\mathrm{\mathbf{vec}}\left(\mathbf{A}_{1}^{*}\right)^{\top}\mathbf{0}^{\top}]^{\top}\in\mathbb{R}^{M^{2}+N^{2}+DMN}$ , then given the identifiability constraint that $\|\mathbf{A}^{*}_{1}\|_{\mathrm{F}}=\|\widehat{\mathbf{A}}_{1}\|_{\mathrm{F}}=1$ and the fact that $\widehat{\mathbf{A}}_{1}$ is $\sqrt{T}$ -consistent, we have $\mathrm{\mathbf{vec}}\left(\mathbf{A}_{1}^{*}\right)^{\top}\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{A}}_{1}-\mathbf{A}^{*}_{1}\right)=o_{P}(T^{-1/2})$ . Therefore, we have:

\sqrt{T}\boldsymbol{\zeta}^{\top}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right)\overset{p.}{\rightarrow}0.

(48)

Combining (47) and (48) and using the Slutsky’s theorem, we have $\mathbf{H}\sqrt{T}\mathbf{vec}(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*})\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathbf{L})$ , where $\mathbf{H}=\mathbf{L}+\boldsymbol{\zeta}\boldsymbol{\zeta}^{\top}$ and thus:

\sqrt{T}\mathbf{vec}(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*})\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathbf{H}^{-1}\mathbf{L}\mathbf{H}^{-1}).

(49)

The final asymptotic distribution of $\mathrm{vec}(\widehat{\mathbf{B}}_{1}^{\top})\otimes\mathrm{vec}(\widehat{\mathbf{A}}_{1})$ and $\mathbf{K}\widehat{\boldsymbol{\gamma}}_{q,d}$ can be derived easily from (49) with multivariate delta method and we omit the details here.

Appendix D Theory under High Spatial Dimension

D.1 Proof of Theorem 12

Proof In this proof, we will fix $P,Q$ as $1$ again for the ease of presentation but the technical details can be generalized to arbitrary $P,Q$ . Since we fix the lags to be $1$ , we drop the subscript of the coefficients for convenience.

Under the specification of the MARAC $(1,1)$ model, we restate the model as:

\mathbf{x}_{t}=\left(\mathbf{x}_{t-1}^{\top}\otimes\mathbf{I}_{S}\right)\mathrm{\mathbf{vec}}\left(\mathbf{B}^{*}\otimes\mathbf{A}^{*}\right)+\left(\mathbf{z}_{t-1}^{\top}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbf{e}_{t},

where $S=MN$ and we introduce the following additional notations:

\mathbf{Y}_{T}\coloneqq\begin{bmatrix}\mathbf{x}_{1}\\ \vdots\\ \mathbf{x}_{T}\end{bmatrix},\quad\widetilde{\mathbf{X}}_{T}\coloneqq\begin{bmatrix}\mathbf{x}_{0}^{\top}\\ \vdots\\ \mathbf{x}_{T-1}^{\top}\end{bmatrix}\otimes\mathbf{I}_{S},\quad\widetilde{\mathbf{z}}_{T}\coloneqq\begin{bmatrix}\mathbf{z}_{0}^{\top}\\ \vdots\\ \mathbf{z}_{T-1}^{\top}\end{bmatrix},\quad\mathbfcal{E}_{T}=\begin{bmatrix}\mathbf{e}_{1}\\ \vdots\\ \mathbf{e}_{T}\end{bmatrix}.

We will drop the subscript $T$ for convenience. Let $\boldsymbol{\phi}^{*}=\mathrm{\mathbf{vec}}\left(\mathbf{B}^{*}\otimes\mathbf{A}^{*}\right)$ , and $g_{1}^{*},\ldots,g_{D}^{*}\in\mathbb{H}_{k}$ be the true autoregressive and functional parameters. Correspondingly, let $\boldsymbol{\gamma}_{1}^{*},\ldots,\boldsymbol{\gamma}_{D}^{*}$ be the coefficients for the representers when evaluating $g_{1}^{*},\ldots,g_{D}^{*}$ on a matrix grid, i.e. $\mathbf{K}\boldsymbol{\gamma}_{d}^{*}$ is a discrete evaluation of $g_{d}^{*}$ on the matrix grid. Let $\mathbb{F}_{\boldsymbol{\phi}}=\{\mathrm{\mathbf{vec}}\left(\mathbf{B}\otimes\mathbf{A}\right)|\|\mathbf{A}\|_{\mathrm{F}}=\text{sign}(\text{tr}\left(\mathbf{A}\right))=1,\mathbf{A}\in\mathbb{R}^{M\times M},\mathbf{B}\in\mathbb{R}^{N\times N}\}$ . Using these new notations, the MARAC estimator is obtained by solving the following penalized least square problem:

\min_{\boldsymbol{\phi}\in\mathbb{F}_{\boldsymbol{\phi}},\boldsymbol{\gamma}\in\mathbb{R}^{SD}}\mathfrak{L}_{\lambda}(\boldsymbol{\phi},\boldsymbol{\gamma})\coloneqq\left\{\frac{1}{2T}\|\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}-\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}\|_{\mathrm{F}}^{2}+\frac{\lambda}{2}\boldsymbol{\gamma}^{\top}\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\boldsymbol{\gamma}\right\}.

(50)

By fixing $\boldsymbol{\phi}$ , the estimator for $\boldsymbol{\gamma}$ is given by $\widehat{\boldsymbol{\gamma}}(\boldsymbol{\phi})=\operatorname*{arg\,min}_{\boldsymbol{\gamma}}\mathfrak{L}_{\lambda}(\boldsymbol{\phi},\boldsymbol{\gamma})$ , and can be explicitly written as:

\widehat{\boldsymbol{\gamma}}(\boldsymbol{\phi})=T^{-1}\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\cdot\mathbf{I}_{SD}\right]^{-1}\left(\widetilde{\mathbf{z}}^{\top}\otimes\mathbf{I}_{S}\right)\left(\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}\right).

(51)

Plugging (51) into (50) yields the profile likelihood for $\boldsymbol{\phi}$ :

\ell_{\lambda}(\boldsymbol{\phi})=\mathfrak{L}_{\lambda}(\boldsymbol{\phi},\widehat{\boldsymbol{\gamma}}(\boldsymbol{\phi}))=\frac{1}{2T}\left(\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}\right)^{\top}\mathbf{W}\left(\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}\right),

(52)

where $\mathbf{W}$ is defined as:

\mathbf{W}=\left\{\mathbf{I}-\frac{\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\cdot\mathbf{I}_{SD}\right]^{-1}\left(\widetilde{\mathbf{z}}^{\top}\otimes\mathbf{I}_{S}\right)}{T}\right\}=\left(\mathbf{I}+\frac{\widetilde{\mathbf{z}}\widetilde{\mathbf{z}}^{\top}}{\lambda T}\otimes\mathbf{K}\right)^{-1},

(53)

and the second equality in (53) is by the Woodbury matrix identity. It can be seen that $\mathbf{W}$ is positive semi-definite and has all of its eigenvalues within $(0,1)$ . To improve the clarity and organization of the proof, we break down the proof into several major steps. In the first step, we establish the following result on $\widehat{\boldsymbol{\phi}}$ :

Proposition 15

Under the assumptions of Theorem 12, we have:

\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)^{\top}\left(\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T}\right)\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)\lesssim O_{P}(C_{g}\lambda)+O_{P}(SD/T),

(54)

where $C_{g}=\sum_{d=1}^{D}\|g_{d}^{*}\|_{\mathbb{H}_{k}}^{2}$ .

In order to derive the convergence rate of $\widehat{\boldsymbol{\phi}}$ , we still require one additional result:

Lemma 16

Under the assumptions of Theorem 12 and the requirement that $S\log S/T\rightarrow 0$ , it holds that:

\underaccent{\bar}{\rho}\left(\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}/T\right)\geq\frac{c_{0,S}}{2}>0,

(55)

with probability approaching $1$ as $S,T\rightarrow\infty$ , where $\underaccent{\bar}{\rho}(\cdot)$ is the minimum eigenvalue of a matrix and $c_{0,S}=\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)^{\top}\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\right)^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*})$ .

The proof of Proposition 15 and Lemma 16 are relegated to Section D.2 and E.3, respectively. Combining Proposition 15 and Lemma 16, we can derive the error bound of $\widehat{\boldsymbol{\phi}}$ as:

\frac{1}{S}\|\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\|_{\mathrm{F}}\lesssim O_{P}(\sqrt{\frac{C_{g}\gamma_{S}}{c_{0,S}S}})+O_{P}(\sqrt{\frac{D}{c_{0,S}TS}}).

(56)

Now with this error bound of the autoregressive parameter $\widehat{\boldsymbol{\phi}}$ , we further derive the prediction error bound for the functional parameters. To start with, we have:

	$\displaystyle\frac{1}{\sqrt{TS}}\\|(\widetilde{\mathbf{z}}\otimes\mathbf{K})(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\\|_{\mathrm{F}}$	$\displaystyle=\frac{1}{\sqrt{TS}}\left\\|(\mathbf{I}-\mathbf{W})(\mathbf{Y}-\widetilde{\mathbf{X}}\widehat{\boldsymbol{\phi}})-(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\right\\|_{\mathrm{F}}$
		$\displaystyle\leq\frac{1}{\sqrt{TS}}\bigg{[}\underbrace{\\|(\mathbf{I}-\mathbf{W})\mathbfcal{E}\\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{1}$}}}+\underbrace{\\|(\mathbf{I}-\mathbf{W})\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{2}$}}}$
		$\displaystyle+\underbrace{\\|\mathbf{W}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{3}$}}}\bigg{]},$

and we will bound the terms $J_{1},J_{2},J_{3}$ separately.

To bound $J_{1}$ , we first establish two lemmas.

Lemma 17

Given the definition of $\mathbf{W}$ in (53) and under the assumptions of Theorem 12, we have $O_{P}(\gamma_{S}^{-1/2r_{0}})\leq\text{tr}\left(\mathbf{I}-\mathbf{W}\right)\leq O_{P}(\sqrt{S}\gamma_{S}^{-1/2r_{0}})$ , where $\gamma_{S}=\lambda/S$ . Furthermore, we have $\text{tr}\left(\mathbf{W}\right)\leq SD$ .

Lemma 18

Given the definition of $\mathbf{W}$ in (53) and under the assumptions of Theorem 12, we have that:

\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}/\text{tr}\left(\mathbf{W}\right)=O_{P}(1).

Furthermore, we have $\mathbfcal{E}^{\top}\left(\mathbf{I}-\mathbf{W}\right)^{2}\mathbfcal{E}/\text{tr}\left(\left(\mathbf{I}-\mathbf{W}\right)^{2}\right)=O_{P}(1)$ .

We leave the proof of Lemma 17 and Lemma 18 to Section E.4 and E.5. By Lemma 18, we have:

J_{1}^{2}\asymp\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\lesssim\text{tr}\left(\mathbf{I}-\mathbf{W}\right).

And by Lemma 17, we have $J_{1}\leq O_{P}(S^{1/4}\gamma_{S}^{-1/4r_{0}})$ .

For $J_{2}$ , we have the following bound:

$\displaystyle J_{2}$	$\displaystyle=\\|(\mathbf{I}-\mathbf{W})\mathbf{W}^{-1/2}\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\\|_{\mathrm{F}}$	(57)
	$\displaystyle\leq\\|(\mathbf{I}-\mathbf{W})\mathbf{W}^{-1/2}\\|_{s}\cdot\\|\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\\|_{\mathrm{F}}$
	$\displaystyle\leq\\|\mathbf{W}^{-1/2}\\|_{s}\cdot\\|\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\\|_{\mathrm{F}}.$	(58)

To bound $\|\mathbf{W}^{-1/2}\|_{s}$ , we can take advantage of the simpler form of $\mathbf{W}$ using the Woodbury matrix identity in (53) and obtain:

	$\displaystyle\\|\mathbf{W}^{-1/2}\\|_{s}$	$\displaystyle=\bar{\rho}(\mathbf{W}^{-1})^{\frac{1}{2}}=\bar{\rho}\left(\mathbf{I}+(\lambda T)^{-1}\widetilde{\mathbf{z}}\widetilde{\mathbf{z}}^{\top}\otimes\mathbf{K}\right)^{\frac{1}{2}}$
		$\displaystyle\leq\left[1+\lambda^{-1}\bar{\rho}(\mathbf{K})\bar{\rho}(T^{-1}\widetilde{\mathbf{z}}\widetilde{\mathbf{z}}^{\top})\right]^{\frac{1}{2}}\leq\left[1+\lambda^{-1}\bar{\rho}(\mathbf{K})\text{tr}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\right)\right]^{\frac{1}{2}}.$

In Lemma 20, which we state later in Section E.1, we have shown that for $N$ -dimensional stationary vector autoregressive process, the covariance estimator is consistent in the spectral norm as long as $N\log N/T\rightarrow 0$ . Therefore, since $\{\mathbf{z}_{t}\}_{t=1}^{T}$ follows a stationary VAR $(\widetilde{Q})$ process and its dimensionality $D$ is fixed, we have $\|\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}-\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\|_{s}\overset{p.}{\rightarrow}0$ and thus with probability approaching $1$ , we have $\mathrm{tr}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\leq 2\mathrm{tr}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*})$ . Therefore, we have $\|\mathbf{W}^{-1/2}\|_{s}\leq O_{P}(\sqrt{1+c_{0}/\lambda})$ , where $c_{0}$ is a constant related to $\text{tr}\left(\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\right)$ and $\bar{\rho}(\mathbf{K})$ . Combining this with the result in Proposition 15, we can bound $J_{2}$ via its upper bound (58) as:

J_{2}\leq O_{P}\left(\sqrt{C_{g}\lambda T}\right)+O_{P}\left(\sqrt{C_{g}T}\right)+O_{P}(\sqrt{S})+O_{P}\left(\sqrt{\gamma_{S}^{-1}}\right).

(59)

Finally, for $J_{3}$ , we first notice that:

J_{3}=\|\mathbf{W}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}\leq\|\mathbf{W}^{1/2}\|_{s}\cdot\|\mathbf{W}^{1/2}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}\leq\|\mathbf{W}^{1/2}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}.

The upper bound of $J_{3}$ above can be further bounded by:

$\displaystyle\\|\mathbf{W}^{1/2}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\\|_{\mathrm{F}}^{2}$	$\displaystyle=(\lambda T)[(\mathbf{I}_{D}\otimes\mathbf{K})\boldsymbol{\gamma}^{}]^{\top}\left\{\mathbf{I}_{SD}-\left(\lambda^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\mathbf{I}_{SD}\right)^{-1}\right\}\boldsymbol{\gamma}^{}$
	$\displaystyle=(\lambda T)\left(\sum_{d=1}^{D}\\|g_{d}^{*}\\|_{\mathbb{H}_{k}}^{2}\right)$
	$\displaystyle-(\lambda^{2}T)\left(\boldsymbol{\gamma}^{}\right)^{\top}\left[\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}\right]\boldsymbol{\gamma}^{}$
	$\displaystyle\leq C_{g}\lambda T,$	(60)

where $C_{g}=\sum_{d=1}^{D}\|g_{d}^{*}\|_{\mathbb{H}_{k}}^{2}$ is the norm of all the underlying functional parameters. The last inequality of (60) follows from the fact that the quadratic form led by $\lambda^{2}T$ is non-negative. To see why, first note that:

\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}=\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}\right)^{-1}-\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}+\lambda^{-1}\widehat{\boldsymbol{\Sigma}}^{2}_{\mathbf{z}}\otimes\mathbf{K}\right]^{-1}.

Then, we have the following lemma:

Lemma 19

If $\mathbf{A},\mathbf{B}$ are symmetric, positive definite real matrices and $\mathbf{A}-\mathbf{B}$ is positive semi-definite, then $\mathbf{B}^{-1}-\mathbf{A}^{-1}$ is also positive semi-definite.

We leave the proof to Section E.6. Let $\mathbf{M}=\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}+\lambda^{-1}\widehat{\boldsymbol{\Sigma}}^{2}_{\mathbf{z}}\otimes\mathbf{K}$ and $\mathbf{N}=\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}$ , then both $\mathbf{M}$ and $\mathbf{N}$ are positive definite and $\mathbf{M}-\mathbf{N}$ is positive semi-definite. By Lemma 19, we have $\mathbf{N}^{-1}-\mathbf{M}^{-1}$ being positive semi-definite and thus (60) holds.

Using the result in (60), we eventually have $J_{3}\leq O_{P}(\sqrt{C_{g}\lambda T})$ . Combining all the bounds for $J_{1},J_{2},J_{3}$ , we end up with:

	$\displaystyle\frac{1}{\sqrt{TS}}\\|(\widetilde{\mathbf{z}}\otimes\mathbf{K})(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\\|_{\mathrm{F}}$	$\displaystyle\leq O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1/{2r_{0}}}}}{\sqrt{T}\sqrt[4]{S}}\right)+O_{P}(\sqrt{\gamma_{S}})$
		$\displaystyle+O_{P}\left(\frac{1}{\sqrt{S}}\right)+O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1}}}{\sqrt{TS}}\right),$

where we drop the term $O_{P}(T^{-1/2})$ as it is a higher order term of $O_{P}(S^{-1/2})$ under the condition that $S\log S/T\rightarrow 0$ .

D.2 Proof of Proposition 15

Proof The MARAC estimator $\widehat{\boldsymbol{\phi}}$ is the minimizer of $\ell_{\lambda}(\boldsymbol{\phi})$ , defined in (52), for all $\boldsymbol{\phi}\in\mathbb{F}_{\phi}$ and thus $\ell_{\lambda}(\widehat{\boldsymbol{\phi}})\leq\ell_{\lambda}(\boldsymbol{\phi}^{*})$ . Equivalently, this means that:

\frac{1}{2}\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)^{\top}\left(\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T}\right)\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)\leq\frac{1}{T}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbfcal{E}\right]^{\top}\mathbf{W}\widetilde{\mathbf{X}}\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right).

Let $\boldsymbol{\delta}=\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})/\sqrt{T}$ and $\boldsymbol{\omega}=\mathbf{W}^{1/2}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbfcal{E}\right]/\sqrt{T}$ , then the inequality can be simply written as $\boldsymbol{\delta}^{\top}\boldsymbol{\delta}\leq 2\boldsymbol{\delta}^{\top}\boldsymbol{\omega}$ , and we can upper bound our quantity of interest, namely $\boldsymbol{\delta}^{\top}\boldsymbol{\delta}$ , as:

\boldsymbol{\delta}^{\top}\boldsymbol{\delta}\leq 2(\boldsymbol{\delta}-\boldsymbol{\omega})^{\top}(\boldsymbol{\delta}-\boldsymbol{\omega})+2\boldsymbol{\omega}^{\top}\boldsymbol{\omega}\leq 4\boldsymbol{\omega}^{\top}\boldsymbol{\omega}.

Therefore, the bound of $\|\boldsymbol{\delta}\|_{\mathrm{F}}^{2}$ can be obtained via the bound of $\|\boldsymbol{\omega}\|_{\mathrm{F}}^{2}$ . We have the following upper bound for $\|\boldsymbol{\omega}\|_{\mathrm{F}}^{2}$ :

	$\displaystyle\\|\boldsymbol{\delta}\\|_{\mathrm{F}}^{2}\leq 4\\|\boldsymbol{\omega}\\|_{\mathrm{F}}^{2}$	$\displaystyle=\frac{4}{T}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{}+\mathbfcal{E}\right]^{\top}\mathbf{W}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{}+\mathbfcal{E}\right]$
		$\displaystyle\leq\frac{8}{T}\left[\underbrace{\\|\mathbf{W}^{1/2}\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}\\|_{\mathrm{F}}^{2}}_{\text{\makebox[0.0pt]{$I_{1}$}}}+\underbrace{\\|\mathbf{W}^{1/2}\mathbfcal{E}\\|_{\mathrm{F}}^{2}}_{\text{\makebox[0.0pt]{$I_{2}$}}}\right],$		(61)

where the last inequality follows from the fact that $\mathbf{W}$ is positive semi-definite.

For $I_{1}$ , it can be bounded by (60) and thus $I_{1}\leq C_{g}\lambda T$ . To bound $I_{2}$ , we utilize Lemma 18 and bound $I_{2}$ as $I_{2}\asymp\text{tr}\left(\mathbf{W}\right)\leq SD$ . Combining the bounds for $I_{1}$ and $I_{2}$ , we have:

\|\boldsymbol{\delta}\|_{\mathrm{F}}^{2}=\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)^{\top}\left(\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T}\right)\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)\lesssim O_{P}(C_{g}\lambda)+O_{P}(SD/T),

which completes the proof.

Appendix E Technical Lemmas & Proofs

In this section, we first introduce Lemma 20 on the consistency of the covariance matrix estimator for any stationary vector autoregressive process and then Corollary 21 on the consistency of the covariance estimator of our MARAC model, given the joint stationarity condition. Then we provide proof for Lemma 14 used in Section C.2 when proving Theorem 9 on the asymptotic normality under fixed spatial dimension. Then we provide proofs for Lemma 16, 17, 18 and 19 used in Section D when proving the error bounds with high spatial dimensionality.

E.1 Statement of Lemma 20

In Lemma 20, we restate the result of Proposition 6 and 7 of Li and Xiao (2021), which covers the general result of the consistency of the estimator for the lag- $0$ auto-covariance matrix of a stationary VAR $(p)$ process.

Lemma 20

Let $\mathbf{x}_{t}\in\mathbb{R}^{N}$ be a zero-meaned stationary VAR $(p)$ process: $\mathbf{x}_{t}=\sum_{l=1}^{p}\boldsymbol{\Phi}_{p}\mathbf{x}_{t-p}+\boldsymbol{\xi}_{t}$ , where $\boldsymbol{\xi}_{t}$ have independent sub-Gaussian entries. Let $\widehat{\boldsymbol{\Sigma}}=(1/T)\sum_{t=1}^{T}\mathbf{x}_{t}\mathbf{x}_{t}^{\top}$ and $\boldsymbol{\Sigma}=\mathrm{E}[\widehat{\boldsymbol{\Sigma}}]$ , then we have:

\mathrm{E}\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\|_{s}\leq C\left(\sqrt{\frac{N\log N}{T}}+\frac{N\log N}{T}\right)\|\boldsymbol{\Sigma}\|_{s},

(62)

where $C$ is an absolute constant.

We refer our readers to Appendix C.3 of Li and Xiao (2021) for the proof. As a corollary of Lemma 20, we have the following results:

Corollary 21

Assume that $\{\mathbf{z}_{t}\}_{t=1}^{T}$ is generated by a stationary VAR $(\widetilde{Q})$ process: $\mathbf{z}_{t}=\sum_{\widetilde{q}=1}^{\widetilde{Q}}\mathbf{C}_{\widetilde{q}}\mathbf{z}_{t-\widetilde{q}}+\boldsymbol{\nu}_{t}$ , with $\boldsymbol{\nu}_{t}$ having independent sub-Gaussian entries, then with $\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}=(1/T)\sum_{t=1}^{T}\mathbf{z}_{t}\mathbf{z}_{t}^{\top}$ and $\boldsymbol{\Sigma}_{\mathbf{z}}^{*}=\mathrm{E}[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}]$ , we have:

\mathrm{P}\left(\left\|\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}-\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\right\|_{s}\geq\epsilon\right)\leq C\epsilon^{-1}\left(\sqrt{\frac{D}{T}}+\frac{D}{T}\right),

(63)

with $C$ being an absolute constant and $\epsilon$ being a fixed positive real number, and thus $\left\|\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}-\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\right\|_{s}\overset{p.}{\rightarrow}0$ .

Let $\{\mathbf{X}_{t}\}_{t=1}^{T}$ be a zero-meaned matrix time series generated by the MARAC model with lag $P,Q$ and $\{\mathbf{z}_{t}\}_{t=1}^{T}$ satisfies the assumption above and $\{\mathbf{X}_{t},\mathbf{z}_{t}\}_{t=1}^{T}$ are jointly stationary in the sense of Theorem 6. Assume further that $\mathbf{E}_{t}$ has i.i.d. Gaussian entries with constant variance $\sigma^{2}$ , then for $\mathbf{y}_{t}=[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}$ , $\widehat{\boldsymbol{\Sigma}}_{0}=(1/T)\sum_{t=1}^{T}\mathbf{y}_{t}\mathbf{y}_{t}^{\top}$ and $\boldsymbol{\Sigma}_{0}^{*}=\mathrm{E}[\mathbf{y}_{t}\mathbf{y}_{t}^{\top}]$ , we have:

\mathrm{E}\left\|\widehat{\boldsymbol{\Sigma}}_{0}-\boldsymbol{\Sigma}_{0}^{*}\right\|_{s}\leq C\left(\sqrt{\frac{S\log S}{T}}+\frac{S\log S}{T}\right)\|\boldsymbol{\Sigma}_{0}^{*}\|_{s},

(64)

where $C$ is an absolute constant.

Proof The proof of (63) is straightforward from Lemma 20 together with Markov inequality. The proof of (64) also follows from Lemma 20 since $\{\mathbf{y}_{t}\}_{t=1}^{T}$ follows a stationary VAR $(\text{max}(P,Q,\widetilde{Q}))$ process with i.i.d. sub-Gaussian noise (see (29)) and $\mathrm{E}[(1/T)\sum_{t=1}^{T}\mathbf{y}_{t}\mathbf{y}_{t}^{\top}]=\mathrm{E}[\mathbf{y}_{t}\mathbf{y}_{t}^{\top}]$ due to stationarity.
Note that the convergence of the variance estimator in spectral norm also indicates that each element of the variance estimator converges in probability. Also, the assumption that $\mathbf{E}_{t}$ has i.i.d. Gaussian entries can be relaxed to $\mathbf{E}_{t}$ having independent sub-Gaussian entries.

E.2 Proof of Lemma 14

Proof Without loss of generality, we fix $P,Q$ as 1 and use the same notation as (31) in Section C.1, so the MARAC model can be written as $\mathbf{x}_{t}=\mathbf{y}_{t}\boldsymbol{\theta}^{*}+\mathbf{e}_{t}$ . Correspondingly, the penalized log-likelihood $h(\boldsymbol{\theta},\boldsymbol{\Omega})$ is specified by (32) and given any $\bar{\boldsymbol{\Omega}}$ , we have $\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\operatorname*{arg\,min}_{\boldsymbol{\theta}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})$ as specified by (34). Given the decomposition of $\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})$ in (35), we have:

\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}=-\lambda\widetilde{\mathbf{K}}\boldsymbol{\theta}^{*}+\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right),

where $\|\lambda\widetilde{\mathbf{K}}\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=o(T^{-1/2})$ since $\lambda=o(T^{-1/2})$ and the norm of the second term is $O_{P}(T^{-1/2})$ . To show that the norm of the second term is $O_{P}(T^{-1/2})$ , we first observe that:

	$\displaystyle\left\\|\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right)\right\\|_{\mathrm{F}}$
	$\displaystyle\leq\left\\|\underbrace{\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}}_{\text{\makebox[0.0pt]{$\mathbf{L}_{T}^{-1}$}}}\right\\|_{\mathrm{F}}\cdot\left\\|\underbrace{\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right)}_{\text{\makebox[0.0pt]{$\mathbf{R}_{T}$}}}\right\\|_{\mathrm{F}}.$

For the sequence of random matrices $\{\mathbf{L}_{T}\}_{T=1}^{\infty}$ , we have:

\mathbf{L}_{T}=\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\overset{p.}{\rightarrow}\begin{bmatrix}\mathrm{Cov}(\mathbf{x}_{t},\mathbf{x}_{t})\otimes\bar{\boldsymbol{\Omega}}&\mathrm{Cov}(\mathbf{x}_{t},\mathbf{z}_{t})\otimes\bar{\boldsymbol{\Omega}}\mathbf{K}\\ \mathrm{Cov}(\mathbf{z}_{t},\mathbf{x}_{t})\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}&\mathrm{Cov}(\mathbf{z}_{t},\mathbf{z}_{t})\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K}\end{bmatrix},

and we define the limiting matrix as $\mathbf{L}$ . To show this, first note that the covariance estimator $\widehat{\mathrm{Var}}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top})=T^{-1}\sum_{t}[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]$ converges in probability to the true covariance $\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top})$ , which we prove separately in Corollary 21. Secondly, notice that $\lambda=o(T^{-1/2})$ , thus we have $\lambda\widetilde{\mathbf{K}}\rightarrow\mathbf{O}$ and thus we have the convergence in probability of $\mathbf{L}_{T}$ to $\mathbf{L}$ holds.

Notice that the limiting matrix $\mathbf{L}$ is invertible because the matrix $\mathbf{L}^{\prime}$ , defined as:

\mathbf{L}^{\prime}=\begin{bmatrix}\mathbf{I}\otimes\mathbf{K}&\mathbf{O}\\ \mathbf{O}&\mathbf{I}\end{bmatrix}\mathbf{L}\begin{bmatrix}\mathbf{I}\otimes\mathbf{K}&\mathbf{O}\\ \mathbf{O}&\mathbf{I}\end{bmatrix}=\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top})\otimes(\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K}),

is invertible. To see why, firstly note that $\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top})$ is invertible because we can express $[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}$ as $\sum_{j=0}^{\infty}\boldsymbol{\Phi}_{j}[\mathbf{e}_{t}^{\top},\boldsymbol{\nu}_{t}^{\top}]^{\top}$ , where $\{\boldsymbol{\Phi}_{j}\}_{j=0}^{\infty}$ is a sequence of matrices whose elements are absolutely summable and $\boldsymbol{\Phi}_{0}=\mathbf{I}$ , therefore, we have $\underaccent{\bar}{\rho}(\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}))\geq\underaccent{\bar}{\rho}(\mathrm{Var}([\mathbf{e}_{t}^{\top},\boldsymbol{\nu}_{t}^{\top}]^{\top}))>0$ . Secondly, by Assumption 7, we have $\underaccent{\bar}{\rho}(\mathbf{K})>0$ and we also have $\underaccent{\bar}{\rho}(\bar{\boldsymbol{\Omega}})>0$ by definition, therefore we have $\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K}$ to be positive definite. The invertibility of $\mathbf{L}$ and the fact that $\mathbf{L}_{T}\overset{p.}{\rightarrow}\mathbf{L}$ indicates that $\mathbf{L}_{T}^{-1}\overset{p.}{\rightarrow}\mathbf{L}^{-1}$ , since matrix inversion is a continuous function of the input matrix and the convergence in probability carries over under continuous transformations. Eventually, this leads to the conclusion that $\|\mathbf{L}_{T}^{-1}\|_{\mathrm{F}}=O_{P}(1)$ .

For the sequence of random matrices $\{\mathbf{R}_{T}\}_{T=1}^{\infty}$ , we note that the sequence $\{\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}\}_{t=1}^{\infty}$ is a martingale difference sequence (MDS) such that $\|\mathbf{R}_{T}\|_{\mathrm{F}}=O_{P}(T^{-1/2})$ (see proposition 7.9 of Hamilton (2020) for the central limit theorem of martingale difference sequence). Combining the result of $\|\mathbf{L}_{T}\|_{\mathrm{F}}$ and $\|\mathbf{R}_{T}\|_{\mathrm{F}}$ , we conclude that $\|\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2})$ .

Fix $\boldsymbol{\Omega}=\bar{\boldsymbol{\Omega}}$ , we can decompose $h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})$ via the second-order Taylor expansion as follows:

	$\displaystyle h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})$	$\displaystyle=h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+\frac{1}{2}(\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}))^{\top}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)(\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}))$
		$\displaystyle\geq h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+\frac{1}{2}\underaccent{\bar}{\rho}(\mathbf{L}_{T})\\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\\|_{\mathrm{F}}^{2},$		(65)

and recall that $\mathbf{L}_{T}=T^{-1}\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}+\lambda\widetilde{\mathbf{K}}$ . In the previous proof, we’ve shown that $\mathbf{L}_{T}\overset{p.}{\rightarrow}\mathbf{L}$ , with $\mathbf{L}$ being a positive definite matrix. Therefore, with probability approaching $1$ , we have $\underaccent{\bar}{\rho}(\mathbf{L}_{T})\geq\underaccent{\bar}{\rho}(\mathbf{L})/2>0$ .

With the lower bound on $\underaccent{\bar}{\rho}(\mathbf{L}_{T})$ , we can claim that for some constant $C_{1}>0$ :

	$\displaystyle\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\\|_{\mathrm{F}}\leq C_{1}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})$
	$\displaystyle\geq\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\\|_{\mathrm{F}}\leq C_{1}}\left\{h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+\frac{1}{4}\underaccent{\bar}{\rho}(\mathbf{L})\cdot\\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\\|_{\mathrm{F}}^{2}\right\},$		(66)

with probability approaching $1$ . Now consider $\boldsymbol{\theta}$ belongs to the set $\{\boldsymbol{\theta}\in\mathbb{F}_{\boldsymbol{\theta}}|\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}\geq c_{T}\}$ , where $c_{T}\rightarrow\infty$ is an arbitrary sequence that diverges to infinity. Within this set, we have:

\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}\geq\frac{c_{T}}{\sqrt{T}}-\|\boldsymbol{\theta}^{*}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}},

(67)

thus $\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}\gtrsim O_{P}(c^{\prime}_{T}/\sqrt{T})$ for some sequence $c^{\prime}_{T}\rightarrow\infty$ since $\|\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2})$ . By the Taylor expansion in (65), we can conclude that $h(\boldsymbol{\theta}^{*},\bar{\boldsymbol{\Omega}})=h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+O_{P}(T^{-1})$ , also using that $\|\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2})$ . Combining this result together with the order of $\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}$ , we have the following hold according to (66):

\mathrm{P}\left(\inf_{\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}\geq c_{T}}\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\leq C_{1}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})>\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\leq C_{1}}h(\boldsymbol{\theta}^{*},\bar{\boldsymbol{\Omega}})\right)\rightarrow 1.

(68)

The result in (68) indicates that for any $\boldsymbol{\theta}$ that lies outside of the set $\{\boldsymbol{\theta}\in\mathbb{F}_{\boldsymbol{\theta}}|\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}<c_{T}\}$ , the penalized log-likelihood is no smaller than a sub-optimal solution with probability approaching $1$ . Therefore, with probability approaching $1$ , one must have $\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}\leq c_{T}$ . And since the choice of $c_{T}$ is arbitrary, we can conclude that $\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2})$ and thus each block of $\widehat{\boldsymbol{\theta}}$ , namely $\widehat{\mathbf{A}}_{p},\widehat{\mathbf{B}}_{p},\widehat{\boldsymbol{\gamma}}_{q}$ converges to their ground truth value at the rate of $T^{-1/2}$ .

The convergence rate of $\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}_{p}}$ can be derived from the following inequality:

\|\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}-\mathbf{B}_{p}^{*}\otimes\mathbf{A}_{p}^{*}\|_{\mathrm{F}}\leq\|\widehat{\mathbf{B}}_{p}\|_{\mathrm{F}}\cdot\|\widehat{\mathbf{A}}_{p}-\mathbf{A}_{p}^{*}\|_{\mathrm{F}}+\|\widehat{\mathbf{B}}_{p}-\mathbf{B}_{p}^{*}\|_{\mathrm{F}}\cdot\|\mathbf{A}_{p}^{*}\|_{\mathrm{F}},

as well as the convergence rate of $\widehat{\mathbf{A}}_{p}$ and $\widehat{\mathbf{B}}_{p}$ .

E.3 Proof of Lemma 16

Proof Based on the definition of $\mathbf{W}$ in equation (53), we have

$\displaystyle\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T}$	$\displaystyle=\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}\otimes\mathbf{I}_{S}-\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\otimes\mathbf{K}\right)\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\otimes\mathbf{I}_{S}\right)$
	$\displaystyle=\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\right)\otimes\mathbf{I}_{S}$
	$\displaystyle+\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\otimes\mathbf{I}_{S}\right)^{\top}\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{2}\otimes\lambda^{-1}\mathbf{K}+\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}\otimes\mathbf{I}_{S}\right]^{-1}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\otimes\mathbf{I}_{S}\right),$	(69)

where the second term in (69) is positive semi-definite since both $\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}})$ and $\underaccent{\bar}{\rho}(\mathbf{K})$ are non-negative and the whole term is symmetric. Therefore, by Weyl’s inequality, one can lower bound $\underaccent{\bar}{\rho}(\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}/T)$ by $\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}})$ . For simplicity, we will use $\mathbf{A},\mathbf{B},\mathbf{C}$ to denote $\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*},\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*},(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*})^{-1}$ , and $\widehat{\mathbf{A}},\widehat{\mathbf{B}},\widehat{\mathbf{C}}$ to denote $\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}},\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}},\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}$ , respectively. We will use $\widehat{\boldsymbol{\Sigma}}$ and $\boldsymbol{\Sigma}^{*}$ to denote the estimated and true covariance matrix of $[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}$ . It is evident that $\|\mathbf{A}\|_{s}\leq\|\boldsymbol{\Sigma}^{*}\|_{s}$ and $\|\mathbf{B}\|_{s}\leq\|\boldsymbol{\Sigma}^{*}\|_{s}$ , since both $\mathbf{A}$ and $\mathbf{B}$ are blocks of $\boldsymbol{\Sigma}^{*}$ and can thus be represented as $\mathbf{E}_{1}^{\top}\boldsymbol{\Sigma}^{*}\mathbf{E}_{2}$ with $\mathbf{E}_{1},\mathbf{E}_{2}$ being two block matrices with unity spectral norm.

The rest of the proof focuses on showing that with $S\log S/T\rightarrow 0$ , $\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}})\overset{p.}{\rightarrow}\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)^{\top}\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\right)^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*})$ . For brevity, we omit the subscript $s$ for the spectral norm notation and simply use $\|\cdot\|$ in this proof.

To start with, we have:

	$\displaystyle\\|\widehat{\mathbf{A}}-\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-(\mathbf{A}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B})\\|$
	$\displaystyle\leq\\|\widehat{\mathbf{A}}-\mathbf{A}\\|+\\|\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-\mathbf{B}^{\top}\widehat{\mathbf{C}}\mathbf{B}\\|+\\|\mathbf{B}^{\top}\widehat{\mathbf{C}}\mathbf{B}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B}\\|$
	$\displaystyle\leq\\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\\|+\\|(\widehat{\mathbf{B}}-\mathbf{B})^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}\\|+\\|\mathbf{B}^{\top}\mathbf{C}(\widehat{\mathbf{B}}-\mathbf{B})\\|+\\|\mathbf{B}^{\top}(\widehat{\mathbf{C}}-\mathbf{C})\widehat{\mathbf{B}}\\|$
	$\displaystyle\leq\\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\\|+\\|\widehat{\mathbf{B}}-\mathbf{B}\\|\cdot\left(\\|\widehat{\mathbf{C}}\\|\cdot\\|\widehat{\mathbf{B}}\\|+\\|\mathbf{C}\\|\cdot\\|\mathbf{B}\\|\right)$
	$\displaystyle+\\|\mathbf{B}\\|\cdot\\|\widehat{\mathbf{B}}\\|\cdot\\|\widehat{\mathbf{C}}-\mathbf{C}\\|.$		(70)

Based on Corollary 21, under the condition that $S\log S/T\rightarrow 0$ and the conditions that $\mathbf{z}_{t}$ follows a stationary VAR $(\widetilde{Q})$ process and is jointly stationary with $\mathbf{x}_{t}$ , we have $\|\widehat{\mathbf{C}}-\mathbf{C}\|\overset{p.}{\rightarrow}0$ and $\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{*}\|\overset{p.}{\rightarrow}0$ . Therefore, with probability approaching $1$ , we have $\|\widehat{\mathbf{C}}\|\leq 2\|\mathbf{C}\|$ , $\|\widehat{\mathbf{B}}-\mathbf{B}\|\leq\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{*}\|\leq 2\|\boldsymbol{\Sigma}^{*}\|$ and $\|\widehat{\mathbf{B}}\|\leq 3\|\boldsymbol{\Sigma}^{*}\|$ .

Combining these results and the upper bound in (70), with probability approaching $1$ , we have:

	$\displaystyle\\|\widehat{\mathbf{A}}-\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-(\mathbf{A}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B})\\|$	$\displaystyle\leq\left(1+7\\|\mathbf{C}\\|\cdot\\|\boldsymbol{\Sigma}^{}\\|\right)\cdot\\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{}\\|$
		$\displaystyle+3\\|\boldsymbol{\Sigma}^{*}\\|^{2}\cdot\\|\widehat{\mathbf{C}}-\mathbf{C}\\|.$		(71)

The upper bound in (71) can be arbitrarily small as $S,T\rightarrow\infty$ since $\|\widehat{\mathbf{C}}-\mathbf{C}\|\overset{p.}{\rightarrow}0$ and $\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{*}\|\overset{p.}{\rightarrow}0$ .

Eventually, with probability approaching $1$ , we have:

\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}})\geq\frac{1}{2}\underaccent{\bar}{\rho}\left(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)^{\top}\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\right)^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)=\frac{c_{0,S}}{2}.

(72)

This completes the proof.

E.4 Proof of Lemma 17

Proof By the definition of $\mathbf{W}$ in (53), we have:

	$\displaystyle\text{tr}\left(\mathbf{I}-\mathbf{W}\right)$	$\displaystyle=\mathrm{tr}\left[\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}\right)\right]$
		$\displaystyle=\sum_{s=1}^{S}\sum_{d=1}^{D}\frac{\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}{\lambda+\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}\leq D\cdot\sum_{s=1}^{S}\frac{1}{1+\lambda\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})^{-1}\rho_{s}(\mathbf{K})^{-1}}.$		(73)

Using Lemma 20, we can bound $\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})$ by $2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*})$ with probability approaching $1$ as $T\rightarrow\infty$ . Conditioning on this high probability event and using the Assumption 11 that the kernel function is separable, the kernel Gram matrix $\mathbf{K}$ can be written as $\mathbf{K}_{2}\otimes\mathbf{K}_{1}$ and thus (73) can be bounded as:

D\cdot\sum_{s=1}^{S}\frac{1}{1+\lambda\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})^{-1}\rho_{s}(\mathbf{K})^{-1}}\leq D\cdot\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{\mathbf{z}}\lambda\rho_{i}(\mathbf{K}_{1})^{-1}\rho_{j}(\mathbf{K}_{2})^{-1}},

(74)

where $c_{\mathbf{z}}=1/2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*})$ . As $M,N\rightarrow\infty$ , based on Assumption 10, we have $\rho_{i}(\mathbf{K}_{1})\rightarrow Mi^{-r_{0}}$ and $\rho_{j}(\mathbf{K}_{2})\rightarrow Nj^{-r_{0}}$ . Consequently, we can find two constants $0<c_{1}<c_{2}$ , with $c_{1}$ being sufficiently small and $c_{2}$ being sufficiently large, such that:

	$\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{2}\lambda(ij)^{r_{0}}/S}$	$\displaystyle\leq\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{\mathbf{z}}\lambda\rho_{i}(\mathbf{K}_{1})^{-1}\rho_{j}(\mathbf{K}_{2})^{-1}}$
		$\displaystyle\leq\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{1}\lambda(ij)^{r_{0}}/S},$		(75)

where we, with a little abuse of notations, incorporate $c_{\mathbf{z}}$ into $c_{1},c_{2}$ . To estimate the order of the lower and upper bound in (75), we first notice that for any constant $c>0$ , one has:

\sum_{i=1}^{M\wedge N}\frac{1}{1+c\lambda i^{2r_{0}}/S}\leq\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c\lambda(ij)^{r_{0}}/S}\leq 2(M\vee N)\sum_{i=1}^{M\vee N}\frac{1}{1+c\lambda i^{2r_{0}}/S}.

(76)

To approximate the sum in (76), notice that:

\sum_{i=1}^{M\vee N}\frac{1}{1+c\lambda i^{2r_{0}}/S}=(S/c\lambda)^{1/2r_{0}}\cdot\sum_{i=1}^{M\vee N}\frac{1}{1+[\frac{i}{(S/c\lambda)^{1/2r_{0}}}]^{2r_{0}}}\cdot\frac{1}{(S/c\lambda)^{1/2r_{0}}},

and furthermore, we have:

\lim_{S\rightarrow\infty}\sum_{i=1}^{M\vee N}\frac{1}{1+[\frac{i}{(S/c\lambda)^{1/2r_{0}}}]^{2r_{0}}}\cdot\frac{1}{(S/c\lambda)^{1/2r_{0}}}=\int_{0}^{C}\frac{1}{1+x^{2r_{0}}}\mathrm{d}x<\infty,

where $C=\lim_{S\rightarrow\infty}c(M\vee N)^{2r_{0}}\cdot\gamma_{S}$ . In the assumptions of Theorem 12, we assume that $M\vee N=O(\sqrt{S})$ and $\lim_{S\rightarrow\infty}\gamma_{S}\cdot S^{r_{0}}\rightarrow C_{1}$ where $0<C_{1}\leq\infty$ . As a result, we have $C$ being either a finite value or infinity, thus we have:

\lim_{S\rightarrow\infty}\sum_{i=1}^{M\vee N}\frac{1}{1+c\lambda i^{2r_{0}}/S}=\int_{0}^{C}\frac{1}{1+x^{2r_{0}}}\mathrm{d}x\cdot\lim_{S\rightarrow\infty}(S/c\lambda)^{1/2r_{0}}=O(\gamma_{S}^{-1/2r_{0}}).

(77)

Combining (73), (74), (75) and (77), we have $\text{tr}\left(\mathbf{I}-\mathbf{W}\right)\lesssim O_{P}((M\vee N)\gamma_{S}^{-1/2r_{0}})=O_{P}(\sqrt{S}\gamma_{S}^{-1/2r_{0}})$ . To obtain the lower bound of $\text{tr}\left(\mathbf{I}-\mathbf{W}\right)$ , we have:

\text{tr}\left(\mathbf{I}-\mathbf{W}\right)\geq D\cdot\sum_{s=1}^{S}\frac{1}{1+\lambda c_{\mathbf{z}}^{\prime}\rho_{s}(\mathbf{K})^{-1}}\geq D\cdot\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{3}\lambda(ij)^{r_{0}}/S},

which holds with probability approaching $1$ and $c_{\mathbf{z}}^{\prime}=2/\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*})$ and the second inequality follows from (75). To further lower bound the double summation, we have:

\cdot\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{3}\lambda(ij)^{r_{0}}/S}\geq\sum_{i=1}^{M\wedge N}\frac{1}{1+c_{3}\lambda(ij)^{r_{0}}/S}.

This new lower bound can be approximated with the same method as (77) under the assumption that $M\wedge N=O(\sqrt{S})$ . We can obtain the lower bound of $\text{tr}\left(\mathbf{I}-\mathbf{W}\right)$ as $O_{P}(\gamma_{S}^{-1/2r_{0}})$ , which establishes the final result.

The upper bound of $\text{tr}\left(\mathbf{W}\right)$ is trivial since:

\text{tr}\left(\mathbf{W}\right)=\sum_{s=1}^{S}\sum_{d=1}^{D}\frac{\lambda}{\lambda+\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}\leq SD.

E.5 Proof of Lemma 18

Proof Given any fixed $\mathbf{W}$ and $t>0$ , let $K=\sqrt{8/3}\sigma$ and $c>0$ be some constant, then we have:

\mathrm{P}\left[\left|\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}-\sigma^{2}\text{tr}\left(\mathbf{W}\right)\right|>t\bigg{|}\mathbf{W}\right]\leq 2\exp\left[-c\min\left(\frac{t^{2}}{K^{4}\|\mathbf{W}\|_{\mathrm{F}}^{2}},\frac{t}{K^{2}\|\mathbf{W}\|_{s}}\right)\right],

(78)

which holds by the Hanson-Wright inequality (Rudelson and Vershynin, 2013). Letting $t=\sigma^{2}\text{tr}\left(\mathbf{W}\right)/2$ , $c_{1}=9c/256$ , and using the fact that $\|\mathbf{W}\|_{s}\leq 1$ , we have:

2\exp\left[-c\min\left(\frac{t^{2}}{K^{4}\|\mathbf{W}\|_{\mathrm{F}}^{2}},\frac{t}{K^{2}\|\mathbf{W}\|_{s}}\right)\right]\leq 2\exp\left[-c_{1}\text{tr}\left(\mathbf{W}\right)\right].

(79)

We can lower bound the trace of $\mathbf{W}$ as follows. First, note that:

\text{tr}\left(\mathbf{W}\right)=\sum_{s=1}^{S}\sum_{d=1}^{D}\frac{\lambda}{\lambda+\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}\geq SD\cdot\frac{\lambda}{\lambda+\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\bar{\rho}(\mathbf{K})}.

By the assumption that $\bar{\rho}(\mathbf{K})$ is bounded and that the fact that $\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\leq 2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*})$ with probability approaching $1$ as $T\rightarrow\infty$ , we have:

\mathrm{P}\left[\text{tr}\left(\mathbf{W}\right)\geq\frac{SD\lambda}{\lambda+\bar{c}}\right]\rightarrow 1,\quad\text{ as }T\rightarrow\infty,

(80)

where $\bar{c}=2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*})\bar{\rho}(\mathbf{K})$ . Since $r_{0}<2$ and $\gamma_{S}\cdot S^{r_{0}}\rightarrow C_{1}$ as $S\rightarrow\infty$ , with $C_{1}$ being either a positive constant or infinity, we have $\gamma_{S}\cdot S^{2}=\lambda\cdot S\rightarrow\infty$ . Therefore, we have $\text{tr}\left(\mathbf{W}\right)\rightarrow\infty$ with probability approaching $1$ , as $S,T\rightarrow\infty$ .

With these results, we can now upper bound the unconditional probability of the event $\{\left|\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}-\sigma^{2}\text{tr}\left(\mathbf{W}\right)\right|>\sigma^{2}\text{tr}\left(\mathbf{W}\right)/2\}$ as follows:

	$\displaystyle\mathrm{P}\left[\left\|\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}-\sigma^{2}\text{tr}\left(\mathbf{W}\right)\right\|>\sigma^{2}\text{tr}\left(\mathbf{W}\right)/2\right]$
	$\displaystyle\leq\mathrm{E}\left[2\exp\left[-c\text{tr}\left(\mathbf{W}\right)\right]\right]$
	$\displaystyle\leq 2\left\{1\cdot\mathrm{P}\left(\text{tr}\left(W\right)<\frac{SD\lambda}{\lambda+\bar{c}}\right)+\exp\left[-c\frac{SD\lambda}{\lambda+\bar{c}}\right]\cdot\mathrm{P}\left(\text{tr}\left(W\right)\geq\frac{SD\lambda}{\lambda+\bar{c}}\right)\right\}\rightarrow 0.$		(81)

This indicates that $\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}\asymp\text{tr}\left(\mathbf{W}\right)$ .

The proof of $\mathbfcal{E}^{\top}\left(\mathbf{I}-\mathbf{W}\right)^{2}\mathbfcal{E}\asymp\text{tr}\left(\left(\mathbf{I}-\mathbf{W}\right)^{2}\right)$ is similar to the proof above. In the first step, similar to (78) and (79), we have the following tail probability bound:

	$\displaystyle\mathrm{P}\left[\left\|\mathbfcal{E}^{\top}\left(\mathbf{I}-\mathbf{W}\right)^{2}\mathbfcal{E}-\sigma^{2}\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\right\|>\frac{\sigma^{2}\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)}{2}\bigg{\|}\mathbf{W}\right]$
	$\displaystyle\leq 2\exp\left\{-c\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\right\}.$		(82)

We can actually establish the unboundedness of $\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)$ by following the same idea as the proof for Lemma 17, where we have:

\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\geq(S/c\lambda)^{1/2r_{0}}\cdot\sum_{i=1}^{M\wedge N}\left\{\frac{1}{1+\left[\frac{i}{(S/c\lambda)^{1/2r_{0}}}\right]^{2r_{0}}}\right\}^{2}(S/c\lambda)^{-1/2r_{0}},

with probability approaching $1$ and $c$ is some constant. Therefore, we have $\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\gtrsim O_{P}(\gamma_{S}^{-1/2r_{0}})$ . The rest of the proof follows the idea of (81) and we omit the details here.

E.6 Proof of Lemma 19

Proof For any two arbitrary symmetric matrices $\mathbf{M},\mathbf{N}$ with identical sizes, we use $\mathbf{M}\gtrsim\mathbf{N}$ to indicate that $\mathbf{M}-\mathbf{N}$ is positive semi-definite and we use $\mathbf{M}^{1/2}$ to denote the symmetric, positive semi-definite square root matrix of $\mathbf{M}$ .

Since $\mathbf{A}-\mathbf{B}$ is positive semi-definite, multiplying it by $\mathbf{B}^{-1/2}$ on both left and right sides of $\mathbf{A}-\mathbf{B}$ , we have $\mathbf{B}^{-1/2}\mathbf{A}\mathbf{B}^{-1/2}\gtrsim\mathbf{I}$ . Therefore, we have $\mathbf{B}^{-1/2}\mathbf{A}^{1/2}\mathbf{A}^{1/2}\mathbf{B}^{-1/2}\gtrsim\mathbf{I}$ . Notice that the matrix $\mathbf{A}^{1/2}\mathbf{B}^{-1/2}$ is invertible and thus has no zero eigenvalues. As a result, all eigenvalues of $\mathbf{B}^{-1/2}\mathbf{A}^{1/2}\mathbf{A}^{1/2}\mathbf{B}^{-1/2}$ are the same as the eigenvalues of $\mathbf{A}^{1/2}\mathbf{B}^{-1/2}\mathbf{B}^{-1/2}\mathbf{A}^{1/2}$ and thus $\mathbf{A}^{1/2}\mathbf{B}^{-1/2}\mathbf{B}^{-1/2}\mathbf{A}^{1/2}\gtrsim\mathbf{I}$ . Multiplying both sides by $\mathbf{A}^{-1/2}$ on both the left and right sides yields $\mathbf{B}^{-1}\gtrsim\mathbf{A}^{-1}$ , which completes the proof.

Appendix F Additional Details on Simulation Experiments

We generate the simulated dataset according to the MARAC $(P,Q)$ model specified by (1) and (3). We simulate the autoregressive coefficients $\mathbf{A}_{p},\mathbf{B}_{p}$ such that they satisfy the stationarity condition specified in Theorem 6 and have a banded structure. We use a similar setup for generating $\boldsymbol{\Sigma}_{r},\boldsymbol{\Sigma}_{c}$ with their diagonals fixed at unity. In Figure 5, we plot the simulated $\mathbf{A}_{1},\mathbf{B}_{1},\boldsymbol{\Sigma}_{r},\boldsymbol{\Sigma}_{c}$ when $(M,N)=(20,20)$ .

To generate $g_{1},g_{2},g_{3}\in\mathbb{H}_{k}$ and mimic the spatial grid in our real data application in Section 6, we specify the 2-D spatial grid with the two dimensions being latitude and longitude of points on a unit sphere $\mathbb{S}^{2}$ . Each of the evenly spaced $M\times N$ grid points has its polar-azimuthal coordinate pair as $(\theta_{i},\phi_{j})\in[0^{\circ},180^{\circ}]\times[0^{\circ},360^{\circ}],i\in[M],j\in[N]$ , and one projects the sampled grid points on the sphere onto a plane to form an $M\times N$ matrix. The polar $\theta$ (co-latitude) and azimuthal $\phi$ (longitude) angles are very commonly used in the spherical coordinate system, with the corresponding Euclidean coordinates being $(x,y,z)=(\sin(\theta)\cos(\phi),\sin(\theta)\sin(\phi),\cos(\theta))$ .

As for the spatial kernel, we choose the Lebedev kernel:

k_{\eta}(s_{1},s_{2})=\left(\frac{1}{4\pi}+\frac{\eta}{12\pi}\right)-\frac{\eta}{8\pi}\sqrt{\frac{1-\left<s_{1},s_{2}\right>}{2}},\quad s_{1},s_{2}\in\mathbb{S}^{2},

(83)

where $\left<\cdot,\cdot\right>$ denotes the angle between two points on the sphere $\mathbb{S}^{2}$ and $\eta$ is a hyperparameter of the kernel. In the simulation experiment as well as the real data application, we fix $\eta=3$ . The Lebedev kernel has the spherical harmonics functions as its eigenfunction:

k_{\eta}(s_{1},s_{2})=\frac{1}{4\pi}+\sum_{l=1}^{\infty}\frac{\eta}{(4l^{2}-1)(2l+3)}\sum_{m=-l}^{l}Y_{l}^{m}(s_{1})Y_{l}^{m}(s_{2}),

where $Y_{l}^{m}(\cdot)$ is a series of orthonormal real spherical harmonics bases defined on sphere $\mathbb{S}^{2}$ :

Y_{l}^{m}(s)=Y_{l}^{m}(\theta,\phi)=\begin{dcases}\sqrt{2}N_{lm}P_{l}^{m}(\cos(\theta))\cos(m\phi)&\text{if $m>0$}\\ N_{l0}P_{l}^{0}(\cos(\theta))&\text{if $m=0$}\\ \sqrt{2}N_{l|m|}P_{l}^{|m|}(\cos(\theta))\sin(|m|\phi)&\text{if $m<0$}\end{dcases},

with $N_{lm}=\sqrt{(2l+1)(l-m)!/(4\pi(l+m)!)}$ , and $P_{l}^{m}(\cdot)$ being the associated Legendre polynomials of order $l$ . We refer our readers to Kennedy et al. (2013) for detailed information about the spherical harmonics functions and the associated isotropic kernels. Under our 2-D grid setup and the choice of kernel, we have found that empirically, the kernel Gram matrix $\mathbf{K}$ has its eigen spectrum decaying at a rate at $\rho_{i}(\mathbf{K})\approx i^{-r}$ with $r\in[1.3,1.5]$ .

We randomly sample $g_{1},g_{2},g_{3}$ from Gaussian processes with covariance kernel being the Lebedev kernel in (83). Finally, we simulate the vector time series $\mathbf{z}_{t}$ using a VAR $(1)$ process. In Figure 6, we visualize the simulated functional parameters as well as the vector time series from one random draw.

In Figure 7, we visualize the ground truth of $g_{3}$ and both its penalized MLE and truncated penalized MLE estimators. It is evident that the truncated penalized MLE estimators give a smooth approximation to $g_{3}$ and the approximation gets better when $R$ gets larger. The choice of $R$ should be as large as possible for accuracy, so one can determine $R$ based on the computational resources available.

		$\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})\otimes(\boldsymbol{\Sigma}_{r}^{})^{-1}\right]\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{A}}_{1}-\mathbf{A}_{1}^{*}\right)$
	$\displaystyle+$	$\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{})^{-1}\mathbf{A}_{1}^{}\mathbf{X}_{t-1}\right]\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{B}}_{1}^{\top}-(\mathbf{B}_{1}^{*})^{\top}\right)$
	$\displaystyle+$	$\displaystyle\sum_{t\in[T]}\left\{\mathbf{z}_{t-1}^{\top}\otimes\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{})^{-1}\mathbf{K}\right]\right\}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\gamma}}_{1}-\boldsymbol{\gamma}_{1}^{}\right)$
	$\displaystyle=$	$\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{*})^{-1}\right]\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)+o_{P}(\sqrt{T}).$

	$\displaystyle\frac{1}{\sqrt{TS}}\\|(\widetilde{\mathbf{z}}\otimes\mathbf{K})(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\\|_{\mathrm{F}}$	$\displaystyle=\frac{1}{\sqrt{TS}}\left\\|(\mathbf{I}-\mathbf{W})(\mathbf{Y}-\widetilde{\mathbf{X}}\widehat{\boldsymbol{\phi}})-(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\right\\|_{\mathrm{F}}$
		$\displaystyle\leq\frac{1}{\sqrt{TS}}\bigg{[}\underbrace{\\|(\mathbf{I}-\mathbf{W})\mathbfcal{E}\\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{1}$}}}+\underbrace{\\|(\mathbf{I}-\mathbf{W})\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{2}$}}}$
		$\displaystyle+\underbrace{\\|\mathbf{W}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{3}$}}}\bigg{]},$

$\displaystyle\\|\mathbf{W}^{1/2}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\\|_{\mathrm{F}}^{2}$	$\displaystyle=(\lambda T)[(\mathbf{I}_{D}\otimes\mathbf{K})\boldsymbol{\gamma}^{}]^{\top}\left\{\mathbf{I}_{SD}-\left(\lambda^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\mathbf{I}_{SD}\right)^{-1}\right\}\boldsymbol{\gamma}^{}$
	$\displaystyle=(\lambda T)\left(\sum_{d=1}^{D}\\|g_{d}^{*}\\|_{\mathbb{H}_{k}}^{2}\right)$
	$\displaystyle-(\lambda^{2}T)\left(\boldsymbol{\gamma}^{}\right)^{\top}\left[\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}\right]\boldsymbol{\gamma}^{}$
	$\displaystyle\leq C_{g}\lambda T,$	(60)

	$\displaystyle\\|\boldsymbol{\delta}\\|_{\mathrm{F}}^{2}\leq 4\\|\boldsymbol{\omega}\\|_{\mathrm{F}}^{2}$	$\displaystyle=\frac{4}{T}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{}+\mathbfcal{E}\right]^{\top}\mathbf{W}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{}+\mathbfcal{E}\right]$
		$\displaystyle\leq\frac{8}{T}\left[\underbrace{\\|\mathbf{W}^{1/2}\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}\\|_{\mathrm{F}}^{2}}_{\text{\makebox[0.0pt]{$I_{1}$}}}+\underbrace{\\|\mathbf{W}^{1/2}\mathbfcal{E}\\|_{\mathrm{F}}^{2}}_{\text{\makebox[0.0pt]{$I_{2}$}}}\right],$		(61)

	$\displaystyle\\|\widehat{\mathbf{A}}-\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-(\mathbf{A}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B})\\|$
	$\displaystyle\leq\\|\widehat{\mathbf{A}}-\mathbf{A}\\|+\\|\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-\mathbf{B}^{\top}\widehat{\mathbf{C}}\mathbf{B}\\|+\\|\mathbf{B}^{\top}\widehat{\mathbf{C}}\mathbf{B}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B}\\|$
	$\displaystyle\leq\\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\\|+\\|(\widehat{\mathbf{B}}-\mathbf{B})^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}\\|+\\|\mathbf{B}^{\top}\mathbf{C}(\widehat{\mathbf{B}}-\mathbf{B})\\|+\\|\mathbf{B}^{\top}(\widehat{\mathbf{C}}-\mathbf{C})\widehat{\mathbf{B}}\\|$
	$\displaystyle\leq\\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\\|+\\|\widehat{\mathbf{B}}-\mathbf{B}\\|\cdot\left(\\|\widehat{\mathbf{C}}\\|\cdot\\|\widehat{\mathbf{B}}\\|+\\|\mathbf{C}\\|\cdot\\|\mathbf{B}\\|\right)$
	$\displaystyle+\\|\mathbf{B}\\|\cdot\\|\widehat{\mathbf{B}}\\|\cdot\\|\widehat{\mathbf{C}}-\mathbf{C}\\|.$		(70)