This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal Data

Hu Sun
Department of Statistics, University of Michigan, Ann Arbor
Zuofeng Shang
Department of Mathematical Sciences, New Jersey Institute of Technology
and
Yang Chen
Department of Statistics, University of Michigan, Ann Arbor
Abstract

We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on a time series of matrices defined on a static 2-D spatial grid and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC.


Keywords: Auxiliary covariates, matrix autoregression, reproducing kernel Hilbert space (RKHS), spatio-temporal forecast, tensor data model

1 Introduction

Matrix-valued time series data have received increasing attention in multiple scientific fields, such as economics (Wang et al., 2019), geophysics (Sun et al., 2022), and environmental science (Dong et al., 2020), where scientists are interested in modeling the joint dynamics of data observed on a 2-D grid across time. This paper focuses on the matrix-valued data defined on a 2-D spatial grid that contains the geographical information of the individual observations. As a concrete example, we visualize the global Total Electron Content (TEC) distribution in Figure 1. TEC is the density of electrons in the Earth’s ionosphere along the vertical pathway connecting a radio transmitter and a ground-based receiver. An accurate prediction of the global TEC is critical since it can foretell the impact of space weather on the positioning, navigation, and timing (PNT) service (Wang et al., 2021; Younas et al., 2022). Every image in panel (A)-(C) is a 71×7371\times 73 matrix, distributed on a spatial grid with 2.52.5^{\circ}-latitude-by-55^{\circ}-longitude resolution.

Refer to caption
Figure 1: An example of matrix time series with auxiliary vector time series. Panels (A)-(C) show the global Total Electron Content (TEC) distribution at three timestamps on the latitude-local-time grid (source: the IGS TEC database (Hernández-Pajares et al., 2009)). Panel (D) plots the auxiliary Sym-H index time series, which measures the impact of solar eruptions on Earth. We highlight the time of panels (A)-(C) in (D) with arrows.

The matrix-valued time series, such as the TEC time series, is often associated with auxiliary vector time series that measures the same object, such as the Earth’s ionosphere, from a different data modality. In panel (D) of Figure 1, we plot the global SYM-H index, which measures the geomagnetic activity caused by the solar eruptions that can finally impact the global TEC distribution. These non-spatial auxiliary covariates carry additional information related to the matrix time series data dynamics.

In this paper, we investigate the problem of forecasting future matrix data jointly with the historical matrices and the vector time series covariates. There are two major challenges in this modeling context. From the perspective of building a matrix-variate regression model, we need to integrate the information of predictors with non-uniform modes, namely both matrices and vectors. Adding the auxiliary vector covariates benefits the prediction and enables domain scientists to understand the interplay between different data modalities but at the same time, it complicates the modeling and the subsequent theoretical analysis. From the perspective of spatio-temporal analysis (Cressie and Wikle, 2015), we need to properly leverage the spatial information of the data and transform the classical spatial statistics framework to accommodate the grid geometry of matrix-valued data. In the remainder of this section, we briefly review the related literature that can shed light on these challenges and then summarize our unique contributions.

A naive yet straightforward prediction model is to vectorize the matrices as vectors and make predictions via the Vector Autoregression (VAR) (Stock and Watson, 2001). In this approach, the auxiliary vector covariates can be incorporated easily by concatenating them with the vectorized matrix predictors. However, vectorizing matrix data leads to the loss of spatial information and also requires a significant amount of parameters given the high dimensionality of the data. To avoid vectorizing the matrix data, scalar-on-tensor regression (Zhou et al., 2013; Guhaniyogi et al., 2017; Li et al., 2018; Papadogeorgou et al., 2021) tackles the problem by using matrix predictors directly. However, these models are built for scalar responses while in our setting we are dealing with matrix responses. Dividing the matrix response into individual scalar responses and fitting scalar-on-tensor regression still requires a significant number of parameters and more importantly, it fails to take the spatial information of the response into account.

The statistical framework that can incorporate matrices as both predictors and response is the tensor-on-tensor regression (Lock, 2018; Liu et al., 2020; Luo and Zhang, 2022) and more specifically for time series data, the matrix/tensor autoregression (Chen et al., 2021; Li and Xiao, 2021; Hsu et al., 2021; Wang et al., 2024). The matrix/tensor predictors are mapped to matrix/tensor responses via multi-linear transformations that greatly reduce the number of parameters. Our work builds on this framework and incorporates the non-spatial vector predictors at the same time.

To incorporate the vector predictor in the same model, we need to map vector predictors to matrix responses. Tensor-on-scalar regression (Rabusseau and Kadri, 2016; Sun and Li, 2017; Li and Zhang, 2017; Guha and Guhaniyogi, 2021) illustrates a way of mapping low-order scalar/vector predictors to high-order matrix/tensor responses via taking the tensor-vector product of the predictor with a high-order tensor coefficient. In this paper, we take a similar approach and introduce a 3-D tensor coefficient for the vector predictors such that our model can take predictors with non-uniform modes, which is a key distinction of our model compared to existing works.

The other distinction of our model is that our model leverages the spatial information of the matrix response. In our model, a key assumption is that the vector predictor has similar predictive effects on neighboring locations in the matrix response. This is equivalent to saying that the tensor coefficient is spatially smooth. Typically, the estimation of spatially smooth tensor coefficients in such regression models is done via adding a total-variation (TV) penalty (Wang et al., 2017; Shen et al., 2022; Sun et al., 2023) to the unknown tensor. The TV penalty leads to piecewise smooth estimators with sharp edges and enables feature selections. However, the estimation with the TV penalty requires solving non-convex optimization problems, making the subsequent theoretical analysis difficult. In our model, we utilize a simpler approach by assuming that the tensor coefficients are discrete evaluations of functional parameters from a Reproducing Kernel Hilbert Space (RKHS). Such a kernel method has been widely used in scalar-on-image regressions (Kang et al., 2018) where the regression coefficients of the image predictor are constrained to be spatially smooth.

We facilitate the estimation of the unknown functional parameters with the functional norm penalty. Functional norm penalties have been widely used for estimating smooth functions in classic semi/non-parametric learning in which data variables are either scalar/vector-valued (see Hastie et al., 2009; Gu, 2013; Yuan and Cai, 2010; Cai and Yuan, 2012; Shang and Cheng, 2013, 2015; Cheng and Shang, 2015; Yang et al., 2020). To the best of our knowledge, the present article is the first to consider functional norm penalty for tensor coefficient estimation in a matrix autoregressive setting.

To encapsulate, our paper has two major contributions. Firstly, we build a unified matrix autoregression framework for spatio-temporal data that incorporates lower-order scalar/vector time series covariates. Such a framework has strong application motivation where domain scientists are curious about integrating the information of spatial and non-spatial data for predictions and inference. The framework also bridges regression methodologies with tensor predictors and responses of non-uniform modes, making the theoretical investigation itself an interesting topic. Secondly, we propose to estimate coefficients of the auxiliary covariates, together with the autoregressive coefficients, in a single penalized maximum likelihood estimation (MLE) framework with the RKHS functional norm penalty. The RKHS framework builds spatial continuity into the regression coefficients. We establish the joint asymptotics of the autoregressive coefficients and the functional parameters under fixed/high matrix dimensionality regimes and propose an efficient alternating minimization algorithm for estimation and validate it with extensive simulations and real applications.

The remainder of the paper is organized as follows. We introduce our model formally in Section 2 and provide model interpretations and comparisons in sufficient detail. Section 3 introduces the penalized MLE framework and describes the exact and approximate estimation algorithms. Large sample properties of the estimators under fixed and high matrix dimensionality are established in Section 4. Section 5 provides extensive simulation studies for validating the consistency of the estimators, demonstrating BIC-based model selection results, and comparing our method with various competitors. We apply our method to the global TEC data in Section 6 and make conclusions in Section 7. Technical proofs and additional details of the simulation and real data applications are deferred to supplemental materials.

2 Model

2.1 Notation

We adopt the following notations throughout the article. We use calligraphic bold-face letters (e.g. 𝒳𝒢\mathbfcal{X},\mathbfcal{G}) for tensors with at least three modes, uppercase bold-face letters (e.g. 𝐗,𝐆\mathbf{X},\mathbf{G}) for matrix and lowercase bold-face letters (e.g. 𝐱,𝐳\mathbf{x},\mathbf{z}) for vector and blackboard bold-faced letters for sets (e.g. ,𝕊\mathbb{R},\mathbb{S}). To subscript any tensor/matrix/vector, we use square brackets with subscripts such as [𝒢|𝒳|[\mathbfcal{G}]_{ijd},[\mathbf{z}_{t}]_{d},[\mathbf{X}_{t}]_{ij}, and we keep the subscript tt inside the square bracket to index time. Any fibers and slices of tensor are subscript-ed with colons such as [𝒢|¬[\mathbfcal{G}]_{ij:}, [𝒢¬¬[\mathbfcal{G}]_{::d} and thus any row and column of a matrix is denoted as [𝐗t]i:[\mathbf{X}_{t}]_{i:} and [𝐗t]:j[\mathbf{X}_{t}]_{:j}. If the slices of tensor/matrix are based on the last mode such as [𝒢¬¬[\mathbfcal{G}]_{::d} and [𝐗t]:j[\mathbf{X}_{t}]_{:j}, we will often omit the colons and write as [𝒢[\mathbfcal{G}]_{d} and [𝐗t]j[\mathbf{X}_{t}]_{j} for brevity. For any tensor 𝒳\mathbfcal{X}, we use 𝐯𝐞𝐜(𝒳)\mathrm{\mathbf{vec}}\left(\mathbfcal{X}\right) to denote the vectorized tensor. For any two tensors 𝒳𝒴\mathbfcal{X},\mathbfcal{Y} with identical size, we define their inner product as: 𝒳𝒴𝒳𝒴\langle\mathbfcal{X},\mathbfcal{Y}\rangle=\mathrm{\mathbf{vec}}\left(\mathbfcal{X}\right)^{\top}\mathrm{\mathbf{vec}}\left(\mathbfcal{Y}\right), and we use 𝒳\|\mathbfcal{X}\|_{\mathrm{F}} to denote the Frobenius norm of a tensor and one has 𝒳𝒳𝒳\|\mathbfcal{X}\|_{\mathrm{F}}=\sqrt{\langle\mathbfcal{X},\mathbfcal{X}\rangle}.

Following Li and Zhang (2017), the tensor-vector product between a tensor 𝒢\mathbfcal{G} of size d1××dK+1d_{1}\times\cdots\times d_{K+1} and a vector 𝐳dK+1\mathbf{z}\in\mathbb{R}^{d_{K+1}}, denoted as 𝒢ׯ𝒦\mathbfcal{G}\bar{\times}_{(K+1)}\mathbf{z}, or simply 𝒢ׯ\mathbfcal{G}\bar{\times}\mathbf{z}, is a tensor of size d1××dKd_{1}\times\cdots\times d_{K} with [𝒢ׯ𝒦𝒦𝒢𝒦𝒦𝒦[\mathbfcal{G}\bar{\times}\mathbf{z}]_{i_{1}\ldots i_{K}}=\sum_{i_{K+1}}[\mathbfcal{G}]_{i_{1}\ldots i_{K}i_{K+1}}\cdot[\mathbf{z}]_{i_{K+1}}. For tensor 𝒳××𝒦\mathbfcal{X}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}, we use 𝐗(k)dk×mkdm\mathbf{X}_{(k)}\in\mathbb{R}^{d_{k}\times\prod_{m\neq k}d_{m}} to denote its kk-mode matricization. The Kronecker product between matrices is denoted via 𝐀𝐁\mathbf{A}\otimes\mathbf{B} and the trace of a square matrix 𝐀\mathbf{A} is denoted as tr(𝐀)\text{tr}\left(\mathbf{A}\right). We use ρ¯(),ρ¯(),ρi()\bar{\rho}(\cdot),\underaccent{\bar}{\rho}(\cdot),\rho_{i}(\cdot) to denote the maximum, minimum and ithi^{\rm{th}} largest eigenvalue of a matrix. We use diag(𝐂1,,𝐂d)diag(\mathbf{C}_{1},\ldots,\mathbf{C}_{d}) to denote a block diagonal matrix with 𝐂1,,𝐂d\mathbf{C}_{1},\ldots,\mathbf{C}_{d} along the diagonal. More details on the related tensor algebra can be found in Kolda and Bader (2009).

For the matrix time series 𝐗tM×N\mathbf{X}_{t}\in\mathbb{R}^{M\times N} in our modeling context, we assume that all S=MNS=MN grid locations are points on an M×NM\times N grid within the domain 𝕊¯=[0,1]2\bar{\mathbb{S}}=[0,1]^{2}. The collection of all the spatial locations is denoted as 𝕊\mathbb{S} and any particular element of 𝕊\mathbb{S} corresponding to the (i,j)th(i,j)^{\rm{th}} entry of the matrix is denoted as sijs_{ij}. We will often index the (i,j)th(i,j)^{\rm{th}} entry of the matrix 𝐗t\mathbf{X}_{t} with a single index u=i+(j1)Mu=i+(j-1)M and thus sijs_{ij} will be denoted as sus_{u}. We use [N][N] to denote index set, i.e. [N]={1,2,,N}[N]=\{1,2,\ldots,N\}. Finally, we use k(,):𝕊¯×𝕊¯k(\cdot,\cdot):\bar{\mathbb{S}}\times\bar{\mathbb{S}}\mapsto\mathbb{R} to denote a spatial kernel function and k\mathbb{H}_{k} to denote the corresponding Reproducing Kernel Hilbert Space (RKHS).

2.2 Matrix AutoRegression with Auxiliary Covariates (MARAC)

Let {𝐗t,𝐳t}t=1T\{\mathbf{X}_{t},\mathbf{z}_{t}\}_{t=1}^{T} be a joint observation of the matrix and the auxiliary vector time series with 𝐗tM×N,𝐳tD\mathbf{X}_{t}\in\mathbb{R}^{M\times N},\mathbf{z}_{t}\in\mathbb{R}^{D}. To forecast 𝐗t\mathbf{X}_{t}, we propose our Matrix AutoRegression with Auxiliary Covariates, or MARAC, as:

𝐗t=p=1P𝐀p𝐗tp𝐁p+q=1Q𝒢ׯ\mathbf{X}_{t}=\sum_{p=1}^{P}\mathbf{A}_{p}\mathbf{X}_{t-p}\mathbf{B}_{p}^{\top}+\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t}, (1)

where 𝐀pM×M,𝐁pN×N\mathbf{A}_{p}\in\mathbb{R}^{M\times M},\mathbf{B}_{p}\in\mathbb{R}^{N\times N} are the autoregressive coefficients for the lag-pp matrix predictor and 𝒢×𝒩×𝒟\mathbfcal{G}_{q}\in\mathbb{R}^{M\times N\times D} is the tensor coefficient for the lag-qq vector predictor, and 𝐄t\mathbf{E}_{t} is a noise term whose distribution will be specified later. The lag parameters P,QP,Q are hyper-parameters of the model and we often refer to the model (1) as MARAC(P,Q)(P,Q).

Based on model (1), for the (i,j)th(i,j)^{\rm{th}} element of 𝐗t\mathbf{X}_{t}, the MARAC(P,Q)(P,Q) specifies the following model:

[𝐗t]ij=p=1P[𝐀p]i:[𝐁p]j:,𝐗tp+q=1Q[𝒢|¬|[\mathbf{X}_{t}]_{ij}=\sum_{p=1}^{P}\left\langle[\mathbf{A}_{p}]_{i:}^{\top}[\mathbf{B}_{p}]_{j:},\mathbf{X}_{t-p}\right\rangle+\sum_{q=1}^{Q}[\mathbfcal{G}_{q}]_{ij:}^{\top}\mathbf{z}_{t-q}+[\mathbf{E}_{t}]_{ij}, (2)

where each autoregressive term is associated with a rank-11 coefficient matrix determined by the specific rows from 𝐀p,𝐁p\mathbf{A}_{p},\mathbf{B}_{p} and each non-spatial auxiliary covariate is associated with a coefficient vector that is location-specific, i.e. [𝒢|¬[\mathbfcal{G}_{q}]_{ij:}. It now becomes more evident from (2) that the auxiliary vector covariates enter the model via an elementwise linear model. The autoregressive term utilizes 𝐀p,𝐁p\mathbf{A}_{p},\mathbf{B}_{p} to transform each lag-pp predictor in a bi-linear form. Using such bi-linear transformation greatly reduces the total amount of parameters in that each lagged predictor that required M2N2M^{2}N^{2} parameters previously now only requires M2+N2M^{2}+N^{2} parameters.

For the tensor coefficient 𝒢\mathbfcal{G}_{q}, we assume that it is spatially smooth. More specifically, we assume that [𝒢|[\mathbfcal{G}_{q}]_{ijd} and [𝒢[\mathbfcal{G}_{q}]_{uvd} are similar if sij,suvs_{ij},s_{uv} are spatially close. Formally, we assume that each [𝒢[\mathbfcal{G}_{q}]_{d}, i.e. the coefficient matrix for the dthd^{\rm{th}} covariate at lag-qq, is a discrete evaluation of a function gq,d():[0,1]2g_{q,d}(\cdot):[0,1]^{2}\mapsto\mathbb{R} on 𝕊\mathbb{S}. Furthermore, each gq,d()g_{q,d}(\cdot) comes from an RKHS k\mathbb{H}_{k} endowed with the spatial kernel function k(,)k(\cdot,\cdot). The spatial kernel function specifies the spatial smoothness of the functional parameters gq,d()g_{q,d}(\cdot) and thus the tensor coefficient 𝒢\mathbfcal{G}_{q}.

An alternative formulation for 𝒢\mathbfcal{G}_{q} would be a low-rank form (Li and Zhang, 2017). We choose locally smooth over low rank to explicitly model the spatial smoothness of the coefficients and avoid tuning the tensor rank of 𝒢\mathbfcal{G}_{q}. We leave the low-rank model for future research and focus on the RKHS framework for the current paper.

Finally, for the additive noise term 𝐄t\mathbf{E}_{t}, we assume that it is i.i.d. from a multivariate normal distribution with a separable Kronecker-product covariance:

𝐯𝐞𝐜(𝐄t)i.i.d.𝒩(𝟎,𝚺c𝚺r),t[T]\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\mathrel{\overset{\mathrm{i.i.d.}}{\scalebox{1.5}[1.0]{$\sim$}}}\mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}\right),\quad t\in[T] (3)

where 𝚺rM×M,𝚺cN×N\boldsymbol{\Sigma}_{r}\in\mathbb{R}^{M\times M},\boldsymbol{\Sigma}_{c}\in\mathbb{R}^{N\times N} are the row/column covariance components. Such a Kronecker-product covariance is commonly seen in the covariance models for multi-way data (Hoff, 2011; Fosdick and Hoff, 2014) with the merit of reducing the number of parameters significantly.

Compared to existing models that can only deal with either matrix or vector predictors, our model (1) can incorporate predictors with non-uniform modes. If one redefines 𝐄t\mathbf{E}_{t} in our model as q=1Q𝒢ׯ\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t}, i.e. all terms except the autoregressive term, then our model ends up specifying:

Cov(𝐯𝐞𝐜(𝐄t),𝐯𝐞𝐜(𝐄t))\displaystyle\mathrm{Cov}(\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right),\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t^{\prime}}\right)) =𝟙{t=t}𝚺c𝚺r+𝐅𝐌𝐅\displaystyle=\mathbbm{1}_{\{t=t^{\prime}\}}\cdot\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}+\mathbf{F}\mathbf{M}\mathbf{F}^{\top}
𝐅=[(𝒢)(3)::(𝒢𝒬)(3)],\displaystyle\mathbf{F}=[\left(\mathbfcal{G}_{1}\right)_{(3)}^{\top}:\cdots:\left(\mathbfcal{G}_{Q}\right)_{(3)}^{\top}], 𝐌=[Cov(𝐳tq1,𝐳tq2)]q1,q2[Q]\displaystyle\quad\mathbf{M}=\left[\mathrm{Cov}(\mathbf{z}_{t-q_{1}},\mathbf{z}_{t^{\prime}-q_{2}})\right]_{q_{1},q_{2}\in[Q]}

where (𝒢(\mathbfcal{G}_{q})_{(3)} is the mode-3 matricization of 𝒢\mathbfcal{G}_{q} and we will use 𝐆q\mathbf{G}_{q} to denote it for the rest of the paper. This new formulation reveals how our model differs from other autoregression models with matrix predictors. The covariance of 𝐄t,𝐄t\mathbf{E}_{t},\mathbf{E}_{t^{\prime}} in our model contains a separable covariance matrix 𝚺c𝚺r\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r} that is based on the matrix grid geometry, a locally smooth coefficient matrix 𝐅\mathbf{F} that captures the local spatial dependency and an auto-covariance matrix 𝐌\mathbf{M} that captures the temporal dependency. Consequently, entries of 𝐄t\mathbf{E}_{t} are more correlated if either they are spatially/temporally close or they share the same row/column index and are thus more flexible for spatial data distributed on a matrix grid.

As a comparison, in the kriging framework (Cressie, 1986), the covariance of 𝐄t,𝐄t\mathbf{E}_{t},\mathbf{E}_{t^{\prime}} is characterized by a spatio-temporal kernel that captures the dependencies among spatial and temporal neighbors. Such kernel method can account for the local dependency but not the spatial dependency based on the matrix grid geometry. In the matrix autoregression model (Chen et al., 2021), the authors do not consider the local spatial dependencies among entries of 𝐄t\mathbf{E}_{t} nor the temporal dependency across different tt. In Hsu et al. (2021), the matrix autoregression model is generalized to adapt to spatial data via fixed-rank co-kriging (FRC) (Cressie and Johannesson, 2008) with Cov(𝐯𝐞𝐜(𝐄t),𝐯𝐞𝐜(𝐄t))=𝟙{t=t}𝚺c𝚺r+𝐅𝐌𝐅\mathrm{Cov}(\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right),\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t^{\prime}}\right))=\mathbbm{1}_{\{t=t^{\prime}\}}\cdot\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}+\mathbf{F}\mathbf{M}\mathbf{F}^{\top}, where 𝐌\mathbf{M} is a k×kk\times k coefficient matrix and 𝐅\mathbf{F} is a pre-specified MN×kMN\times k spatial basis matrix. Such a co-kriging framework does not account for the temporal dependency of the noises nor does it consider the auxiliary covariates. Our model generalizes these previous works to allow for temporally dependent noise with both local and grid spatial dependency.

The combination of (1) and (3) specifies the complete MARAC(P,Q)(P,Q) model. Vectorizing both sides of (1) yields the vectorized MARAC(P,Q)(P,Q) model:

𝐱t=p=1P(𝐁p𝐀p)𝐱tp+q=1Q𝐆q𝐳tq+𝐞t,𝐞ti.i.d.𝒩(𝟎,𝚺c𝚺r)\mathbf{x}_{t}=\sum_{p=1}^{P}\left(\mathbf{B}_{p}\otimes\mathbf{A}_{p}\right)\mathbf{x}_{t-p}+\sum_{q=1}^{Q}\mathbf{G}_{q}^{\top}\mathbf{z}_{t-q}+\mathbf{e}_{t},\quad\mathbf{e}_{t}\mathrel{\overset{\mathrm{i.i.d.}}{\scalebox{1.5}[1.0]{$\sim$}}}\mathcal{N}\left(\mathbf{0},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}\right) (4)

where 𝐱t=𝐯𝐞𝐜(𝐗t),𝐞t=𝐯𝐞𝐜(𝐄t)\mathbf{x}_{t}=\mathrm{\mathbf{vec}}\left(\mathbf{X}_{t}\right),\mathbf{e}_{t}=\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right), and recall that 𝐆q=(𝒢\mathbf{G}_{q}=(\mathbfcal{G}_{q})_{(3)}. It is now more evident that the Kronecker-product structure of the autoregressive coefficient matrix and the noise covariance matrix greatly reduce the number of parameters, making the regression estimation feasible given finite samples. The spatially smooth structure of 𝐆q\mathbf{G}_{q} leverages the spatial information of the spatial data. In the next section, we will discuss the estimating algorithm of the model parameters of MARAC.

3 Estimating Algorithm

In this section, we discuss the parameter estimation for the MARAC(P,Q)(P,Q) model (1). We first propose a penalized maximum likelihood estimator (MLE) in Section 3.1 for exact parameter estimation. Then, we propose an approximation to the penalized MLE in Section 3.2 for faster computation when dealing with high-dimensional matrix data. Finally, in Section 3.3, we outline the model selection criterion for selecting the lag hyper-parameters whose consistency will be validated empirically in Section 5.

3.1 Penalized Maximum Likelihood Estimation (MLE)

To estimate the parameters of the MARAC(P,Q)(P,Q) model, which we denote collectively as 𝚯\boldsymbol{\Theta}, we propose a penalized maximum likelihood estimation (MLE) approach. Following the distribution assumption on 𝐄t\mathbf{E}_{t} in (3), we can write the negative log-likelihood of {𝐗t}t=1T\{\mathbf{X}_{t}\}_{t=1}^{T} with a squared RKHS functional norm penalty, after dropping the constants, as:

𝔏λ(𝚯)=1Tt[T](𝐗t|𝐗t1:P,𝐳t1:Q;𝚯)+λ2q[Q]d[D]gq,dk2,\mathfrak{L}_{\lambda}(\boldsymbol{\Theta})=-\frac{1}{T}\sum_{t\in[T]}\ell\left(\mathbf{X}_{t}|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q};\boldsymbol{\Theta}\right)+\frac{\lambda}{2}\sum_{q\in[Q]}\sum_{d\in[D]}\|g_{q,d}\|_{\mathbb{H}_{k}}^{2}, (5)

where ()\ell(\cdot) is the conditional log-likelihood of 𝐗t\mathbf{X}_{t}:

(𝐗t|𝐗t1:P,𝐳t1:Q;𝚯)=12log|𝚺c𝚺r|12𝐫t(𝚺c1𝚺r1)𝐫t,\ell\left(\mathbf{X}_{t}|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q};\boldsymbol{\Theta}\right)=-\frac{1}{2}\log|\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}|-\frac{1}{2}\mathbf{r}_{t}^{\top}\left(\boldsymbol{\Sigma}_{c}^{-1}\otimes\boldsymbol{\Sigma}_{r}^{-1}\right)\mathbf{r}_{t}, (6)

and 𝐫t=𝐱tp(𝐁p𝐀p)𝐱tpq𝐆q𝐳tq\mathbf{r}_{t}=\mathbf{x}_{t}-\sum_{p}(\mathbf{B}_{p}\otimes\mathbf{A}_{p})\mathbf{x}_{t-p}-\sum_{q}\mathbf{G}_{q}^{\top}\mathbf{z}_{t-q} is the vectorized residual at tt. To estimate the parameters, one needs to solve a constrained minimization problem:

min𝚯𝔏λ(𝚯),s.t. gq,d(sij)=[𝒢| for all |𝒮\min_{\boldsymbol{\Theta}}\mathfrak{L}_{\lambda}(\boldsymbol{\Theta}),\quad\text{s.t. }g_{q,d}(s_{ij})=[\mathbfcal{G}_{q}]_{ijd},\text{ for all }s_{ij}\in\mathbb{S}. (7)

We now define the functional norm penalty in (5) explicitly and derive a finite-dimensional equivalent of the optimization problem above. We assume that the spatial kernel function k(,)k(\cdot,\cdot) is continuous and square-integrable, thus it has an eigen-decomposition following the Mercer’s Theorem (Williams and Rasmussen, 2006):

k(sij,suv)=r=1λrψr(sij)ψr(suv),sij,suv[0,1]2,k(s_{ij},s_{uv})=\sum_{r=1}^{\infty}\lambda_{r}\psi_{r}(s_{ij})\psi_{r}(s_{uv}),\quad s_{ij},s_{uv}\in[0,1]^{2}, (8)

where λ1λ2\lambda_{1}\geq\lambda_{2}\geq\ldots is a sequence of non-negative eigenvalues and ψ1,ψ2,\psi_{1},\psi_{2},\ldots is a set of orthonormal eigen-functions on [0,1]2[0,1]^{2}. The functional norm of function gg from the RKHS k\mathbb{H}_{k} endowed with kernel k(,)k(\cdot,\cdot) is defined as:

gk=r=1βr2λr,where g()=r=1βrψr(),\|g\|_{\mathbb{H}_{k}}=\sqrt{\sum_{r=1}^{\infty}\frac{\beta_{r}^{2}}{\lambda_{r}}},\qquad\text{where }g(\cdot)=\sum_{r=1}^{\infty}\beta_{r}\psi_{r}(\cdot), (9)

following van Zanten and van der Vaart (2008).

Given any λ>0\lambda>0 in (5), the generalized representer theorem (Schölkopf et al., 2001) suggests that the solution of the functional parameters, denoted as {g~q,d}q=1,d=1Q,D\{\widetilde{g}_{q,d}\}_{q=1,d=1}^{Q,D}, of the minimization problem (7), with all other parameters held fixed, is a linear combination of the representers {k(,s)}s𝕊\{k(\cdot,s)\}_{s\in\mathbb{S}} plus a linear combination of the basis functions {ϕ1,,ϕJ}\{\phi_{1},\ldots,\phi_{J}\} of the null space of k\mathbb{H}_{k}, i.e.,

g~q,d()=s𝕊γsk(,s)+j=1Jαjϕj(),ϕjk=0,\widetilde{g}_{q,d}(\cdot)=\sum_{s\in\mathbb{S}}\gamma_{s}k(\cdot,s)+\sum_{j=1}^{J}\alpha_{j}\phi_{j}(\cdot),\quad\|\phi_{j}\|_{\mathbb{H}_{k}}=0, (10)

where we omit the subscript (q,d)(q,d) for the coefficient γs,αj\gamma_{s},\alpha_{j} for brevity but they are essentially different for each (q,d)(q,d). We assume that the null space of k\mathbb{H}_{k} contains only the zero function for the remainder of the paper. As a consequence of (10), the minimization problem in (7) can be reduced to a finite-dimensional Kernel Ridge Regression (KRR) problem. We summarize the discussion above in the proposition below:

Proposition 1

If λ>0\lambda>0, the constrained minimization problem in (7) is equivalent to the following unconstrained kernel ridge regression problem:

min𝚯{12log|𝚺c𝚺r|+12Tt[T]𝐫t(𝚺c1𝚺r1)𝐫t+λ2q[Q]tr(𝚪q𝐊𝚪q)},\min_{\boldsymbol{\Theta}}\left\{\frac{1}{2}\log|\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}|+\frac{1}{2T}\sum_{t\in[T]}\mathbf{r}_{t}^{\top}\left(\boldsymbol{\Sigma}_{c}^{-1}\otimes\boldsymbol{\Sigma}_{r}^{-1}\right)\mathbf{r}_{t}+\frac{\lambda}{2}\sum_{q\in[Q]}\text{tr}\left(\boldsymbol{\Gamma}_{q}^{\top}\mathbf{K}\boldsymbol{\Gamma}_{q}\right)\right\}, (11)

where 𝐫t=𝐱tp(𝐁p𝐀p)𝐱tpq𝐊𝚪q𝐳tq\mathbf{r}_{t}=\mathbf{x}_{t}-\sum_{p}(\mathbf{B}_{p}\otimes\mathbf{A}_{p})\mathbf{x}_{t-p}-\sum_{q}\mathbf{K}\boldsymbol{\Gamma}_{q}\mathbf{z}_{t-q} is the vectorized residual, 𝐊MN×MN\mathbf{K}\in\mathbb{R}^{MN\times MN} is the kernel Gram matrix with [𝐊]u1u2=k(si1j1,si2j2),siljl𝕊,ul=il+(jl1)M,l=1,2[\mathbf{K}]_{u_{1}u_{2}}=k(s_{i_{1}j_{1}},s_{i_{2}j_{2}}),s_{i_{l}j_{l}}\in\mathbb{S},u_{l}=i_{l}+(j_{l}-1)M,l=1,2 and 𝚪qMN×D\boldsymbol{\Gamma}_{q}\in\mathbb{R}^{MN\times D} contains the coefficients of the representers with [𝚪q]ud[\boldsymbol{\Gamma}_{q}]_{ud} being the coefficient for the uthu^{\rm{th}} representer k(,su)k(\cdot,s_{u}) and the dthd^{\rm{th}} auxiliary covariate at lag qq.

We give the proof in the supplemental material. After introducing the functional norm penalty, the original tensor coefficient is now converted to a linear combination of the representer functions with the relationship that [𝒢|𝒦¬Γ¬[\mathbfcal{G}_{q}]_{ijd}=\langle[\mathbf{K}]_{u:}^{\top},[\boldsymbol{\Gamma}_{q}]_{:d}\rangle where u=i+(j1)Mu=i+(j-1)M.

We attempt to solve the minimization problem in (11) with an alternating minimization algorithm (Attouch et al., 2013) where we update one block of parameters at a time while keeping the others fixed. We update the parameters following the order of: 𝐀1𝐁1𝐀P𝐁P𝚪1𝚪Q𝚺r𝚺c𝐀1\mathbf{A}_{1}\rightarrow\mathbf{B}_{1}\rightarrow\cdots\rightarrow\mathbf{A}_{P}\rightarrow\mathbf{B}_{P}\rightarrow\boldsymbol{\Gamma}_{1}\rightarrow\cdots\rightarrow\boldsymbol{\Gamma}_{Q}\rightarrow\boldsymbol{\Sigma}_{r}\rightarrow\boldsymbol{\Sigma}_{c}\rightarrow\mathbf{A}_{1}\rightarrow\cdots. We choose the alternating minimization algorithm for its simplicity and efficiency. Each step of the algorithm conducts exact minimization over one block of the parameters, leading to a non-increasing sequence of the objective function, which guarantees the convergence of the algorithm towards a local stationary point.

To solve the optimization problem in (11) for 𝐀p\mathbf{A}_{p} at the (l+1)th(l+1)^{\rm{th}} iteration, it suffices to solve the following least-square problem:

min𝐀p{t[T]tr(𝐗~t(𝐀p)(𝚺r(l))1𝐗~t(𝐀p)(𝚺c(l))1)},\min_{\mathbf{A}_{p}}\left\{\sum_{t\in[T]}\text{tr}\left(\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p})^{\top}\left(\boldsymbol{\Sigma}_{r}^{(l)}\right)^{-1}\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p})\left(\boldsymbol{\Sigma}_{c}^{(l)}\right)^{-1}\right)\right\}, (12)

where 𝐗~t(𝐀p)\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p}) is the residual matrix when predicting 𝐗t\mathbf{X}_{t}:

𝐗~t(𝐀p)\displaystyle\widetilde{\mathbf{X}}_{t}(\mathbf{A}_{p}) =𝐗tp<p𝐀p(l+1)𝐗tp(𝐁p(l+1))p>p𝐀p(l)𝐗tp(𝐁p(l))\displaystyle=\mathbf{X}_{t}-\sum_{p^{{}^{\prime}}<p}\mathbf{A}_{p^{{}^{\prime}}}^{(l+1)}\mathbf{X}_{t-p^{{}^{\prime}}}\left(\mathbf{B}_{p^{{}^{\prime}}}^{(l+1)}\right)^{\top}-\sum_{p^{{}^{\prime}}>p}\mathbf{A}_{p^{{}^{\prime}}}^{(l)}\mathbf{X}_{t-p^{{}^{\prime}}}\left(\mathbf{B}_{p^{{}^{\prime}}}^{(l)}\right)^{\top}
q[Q]𝒢ׯ𝒜𝒳𝒳~𝒜𝒳\displaystyle-\sum_{q\in[Q]}\mathbfcal{G}_{q}^{(l)}\bar{\times}\mathbf{z}_{t-q}-\mathbf{A}_{p}\mathbf{X}_{t-p}\left(\mathbf{B}_{p}^{(l)}\right)^{\top}=\widetilde{\mathbf{X}}_{t,-p}-\mathbf{A}_{p}\mathbf{X}_{t-p}\left(\mathbf{B}_{p}^{(l)}\right)^{\top}

and we use 𝐗~t,p\widetilde{\mathbf{X}}_{t,-p} to denote the partial residual excluding the term involving 𝐗tp\mathbf{X}_{t-p} and use 𝒢\mathbfcal{G}_{q}^{(l)} to denote the tensor coefficient satisfying [𝒢|𝒦¬Γ¬[\mathbfcal{G}_{q}^{(l)}]_{ijd}=\langle[\mathbf{K}]_{u:}^{\top},[\boldsymbol{\Gamma}_{q}^{(l)}]_{:d}\rangle, with u=i+(j1)Mu=i+(j-1)M. The superscript ll represents the value at the lthl^{\rm{th}} iteration. To simplify the notation, we define 𝚽(𝐀t,𝐁t,𝚺)=t𝐀t𝚺1𝐁t\boldsymbol{\Phi}(\mathbf{A}_{t},\mathbf{B}_{t},\boldsymbol{\Sigma})=\sum_{t}\mathbf{A}_{t}^{\top}\boldsymbol{\Sigma}^{-1}\mathbf{B}_{t}, where 𝚺,𝐀t,𝐁t\boldsymbol{\Sigma},\mathbf{A}_{t},\mathbf{B}_{t} are arbitrary matrices/vectors with conformal matrix sizes and we simply write 𝚽(𝐀t,𝚺)\boldsymbol{\Phi}(\mathbf{A}_{t},\boldsymbol{\Sigma}) if 𝐀t=𝐁t\mathbf{A}_{t}=\mathbf{B}_{t}. Solving (12) yields the following updating formula for 𝐀p(l+1)\mathbf{A}_{p}^{(l+1)}:

𝐀p(l+1)𝚽(𝐗~t,p,𝐁p(l)𝐗tp,𝚺c(l))𝚽(𝐁p(l)𝐗tp,𝚺c(l))1\mathbf{A}_{p}^{(l+1)}\leftarrow\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t,-p}^{\top},\mathbf{B}_{p}^{(l)}\mathbf{X}_{t-p}^{\top},\boldsymbol{\Sigma}_{c}^{(l)}\right)\boldsymbol{\Phi}\left(\mathbf{B}_{p}^{(l)}\mathbf{X}_{t-p}^{\top},\boldsymbol{\Sigma}_{c}^{(l)}\right)^{-1} (13)

Similarly, we have the following updating formula for 𝐁p(l+1)\mathbf{B}_{p}^{(l+1)}:

𝐁p(l+1)𝚽(𝐗~t,p,𝐀p(l+1)𝐗tp,𝚺r(l))𝚽(𝐀p(l+1)𝐗tp,𝚺r(l))1\mathbf{B}_{p}^{(l+1)}\leftarrow\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t,-p},\mathbf{A}_{p}^{(l+1)}\mathbf{X}_{t-p},\boldsymbol{\Sigma}_{r}^{(l)}\right)\boldsymbol{\Phi}\left(\mathbf{A}_{p}^{(l+1)}\mathbf{X}_{t-p},\boldsymbol{\Sigma}_{r}^{(l)}\right)^{-1} (14)

For updating 𝚪q\boldsymbol{\Gamma}_{q}, or its vectorized version 𝜸q=𝐯𝐞𝐜(𝚪q)\boldsymbol{\gamma}_{q}=\mathrm{\mathbf{vec}}\left(\boldsymbol{\Gamma}_{q}\right), it is required to solve the following kernel ridge regression problem:

min𝜸q{12T𝚽(𝐱~t,q(𝐳tq𝐊)𝜸q,𝚺(l))+λ2𝜸q(𝐈D𝐊)𝜸q},\min_{\boldsymbol{\gamma}_{q}}\left\{\frac{1}{2T}\boldsymbol{\Phi}\left(\widetilde{\mathbf{x}}_{t,-q}-\left(\mathbf{z}_{t-q}^{\top}\otimes\mathbf{K}\right)\boldsymbol{\gamma}_{q},\boldsymbol{\Sigma}^{(l)}\right)+\frac{\lambda}{2}\boldsymbol{\gamma}_{q}^{\top}\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\boldsymbol{\gamma}_{q}\right\},

where 𝚺(l)=𝚺c(l)𝚺r(l)\boldsymbol{\Sigma}^{(l)}=\boldsymbol{\Sigma}_{c}^{(l)}\otimes\boldsymbol{\Sigma}_{r}^{(l)} and 𝐱~t,q\widetilde{\mathbf{x}}_{t,-q} is the vectorized partial residual of 𝐗t\mathbf{X}_{t} by leaving out the lag-qq auxiliary predictor, defined in a similar way as 𝐗~t,p\widetilde{\mathbf{X}}_{t,-p}. Solving the kernel ridge regression leads to the following updating formula for 𝜸q(l+1)\boldsymbol{\gamma}_{q}^{(l+1)}:

𝜸q(l+1)[(t[T]𝐳tq𝐳tq)𝐊+λT(𝐈D𝚺(l))]1[t[T](𝐳tq𝐱~t,q)].\boldsymbol{\gamma}_{q}^{(l+1)}\leftarrow\left[\left(\sum_{t\in[T]}\mathbf{z}_{t-q}\mathbf{z}_{t-q}^{\top}\right)\otimes\mathbf{K}+\lambda T\left(\mathbf{I}_{D}\otimes\boldsymbol{\Sigma}^{(l)}\right)\right]^{-1}\left[\sum_{t\in[T]}\left(\mathbf{z}_{t-q}\otimes\widetilde{\mathbf{x}}_{t,-q}\right)\right]. (15)

The step in (15) can be slow since one needs to invert a square matrix of size MND×MNDMND\times MND. In Section 3.2, we propose an approximation to (15) to speed up the computation under high dimensionality.

The updating rule of 𝚺r(l+1)\boldsymbol{\Sigma}_{r}^{(l+1)} and 𝚺c(l+1)\boldsymbol{\Sigma}_{c}^{(l+1)} can be easily derived by taking their derivative in (11) and setting it to zero. Specifically, we have:

𝚺r(l+1)\displaystyle\boldsymbol{\Sigma}_{r}^{(l+1)} 1NT𝚽(𝐗~t,𝚺c(l))\displaystyle\leftarrow\frac{1}{NT}\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t}^{\top},\boldsymbol{\Sigma}_{c}^{(l)}\right) (16)
𝚺c(l+1)\displaystyle\boldsymbol{\Sigma}_{c}^{(l+1)} 1MT𝚽(𝐗~t,𝚺r(l+1)).\displaystyle\leftarrow\frac{1}{MT}\boldsymbol{\Phi}\left(\widetilde{\mathbf{X}}_{t},\boldsymbol{\Sigma}_{r}^{(l+1)}\right). (17)

where 𝐗~t\widetilde{\mathbf{X}}_{t} is the full residual when predicting 𝐗t\mathbf{X}_{t}.

The algorithm cycles through (13), (14), (15), (16) and (17) and terminates when 𝐁p(l)𝐀p(l)\mathbf{B}_{p}^{(l)}\otimes\mathbf{A}_{p}^{(l)}, 𝒢\mathbfcal{G}_{q}^{(l)}, 𝚺c(l)𝚺r(l)\boldsymbol{\Sigma}_{c}^{(l)}\otimes\boldsymbol{\Sigma}_{r}^{(l)} have their relative changes between iterations fall under a pre-specified threshold. We make two additional remarks on the algorithm:

Remark 2

(Identifiability Constraint) The MARAC(P,Q)(P,Q) model specified in (1) is scale-unidentifiable in that one can re-scale each pair of (𝐀p,𝐁p)(\mathbf{A}_{p},\mathbf{B}_{p}) by a non-zero constant cc and obtain (c𝐀p,c1𝐁p)(c\cdot\mathbf{A}_{p},c^{-1}\cdot\mathbf{B}_{p}) without changing their Kronecker product. To enforce scale identifiability, we re-scale the algorithm output for each pair of (𝐀p,𝐁p)(\mathbf{A}_{p},\mathbf{B}_{p}) such that 𝐀pF=1\|\mathbf{A}_{p}\|_{\mathrm{F}}=1, sign(tr(𝐀p))=1\text{sign}(\text{tr}\left(\mathbf{A}_{p}\right))=1. The identifiability constraint is enforced before outputting the estimators.

Remark 3

(Convergence of Kronecker Product) When dealing with high-dimensional matrices, it is cumbersome to compute the change between 𝐁p(l)𝐀p(l)\mathbf{B}_{p}^{(l)}\otimes\mathbf{A}_{p}^{(l)} and 𝐁p(l+1)𝐀p(l+1)\mathbf{B}_{p}^{(l+1)}\otimes\mathbf{A}_{p}^{(l+1)} under the Frobenius norm. An upper bound of 𝐁p(l+1)𝐀p(l+1)𝐁p(l)𝐀p(l)F\|\mathbf{B}_{p}^{(l+1)}\otimes\mathbf{A}_{p}^{(l+1)}-\mathbf{B}_{p}^{(l)}\otimes\mathbf{A}_{p}^{(l)}\|_{\mathrm{F}} can be used instead:

𝐁p(l+1)𝐁p(l)F𝐀p(l+1)F+𝐁p(l)F𝐀p(l+1)𝐀p(l)F,\|\mathbf{B}_{p}^{(l+1)}-\mathbf{B}_{p}^{(l)}\|_{\mathrm{F}}\cdot\|\mathbf{A}_{p}^{(l+1)}\|_{\mathrm{F}}+\|\mathbf{B}_{p}^{(l)}\|_{\mathrm{F}}\cdot\|\mathbf{A}_{p}^{(l+1)}-\mathbf{A}_{p}^{(l)}\|_{\mathrm{F}}, (18)

and a similar bound can be used for the convergence check of 𝚺c(l)𝚺r(l)\boldsymbol{\Sigma}_{c}^{(l)}\otimes\boldsymbol{\Sigma}_{r}^{(l)}.

3.2 Approximated Penalized MLE with Kernel Truncation

The iterative algorithm in Section 3.1 requires inverting an MND×MNDMND\times MND matrix in (15) when updating 𝜸q\boldsymbol{\gamma}_{q}, i.e., the coefficients of the representer functions k(,s)k(\cdot,s). One way to reduce the computational complexity without any approximation is to divide the step of updating 𝜸q=[𝜸q,1::𝜸q,D]\boldsymbol{\gamma}_{q}=[\boldsymbol{\gamma}_{q,1}^{\top}:\cdots:\boldsymbol{\gamma}_{q,D}^{\top}]^{\top} to updating one block of parameters at a time following the order of 𝜸q,1𝜸q,D\boldsymbol{\gamma}_{q,1}\rightarrow\cdots\rightarrow\boldsymbol{\gamma}_{q,D}. However, such a procedure requires inverting a matrix of size MN×MNMN\times MN, which could still be high-dimensional.

To circumvent the issue of inverting large matrices, we can approximate the linear combination of all MNMN representers using a set of R<<MNR<<MN basis functions, i.e., 𝐊𝜸q,d𝐊R𝜽q,d\mathbf{K}\boldsymbol{\gamma}_{q,d}\approx\mathbf{K}_{R}\boldsymbol{\theta}_{q,d}, where 𝐊RMN×R,𝜽q,dR\mathbf{K}_{R}\in\mathbb{R}^{MN\times R},\boldsymbol{\theta}_{q,d}\in\mathbb{R}^{R}. For example, one can reduce the spatial resolution by subsampling a fraction of the rows and columns of the matrix and only use the representers at the subsampled “knots” as the basis functions. In this subsection, we consider an alternative approach by truncating the Mercer decomposition in (8). A similar technique can be found in Kang et al. (2018).

Given the eigen-decomposition of k(,)k(\cdot,\cdot) in (8), one can truncate the decomposition at the RthR^{\rm{th}} largest eigenvalue λR\lambda_{R} and get an approximation: k(,)rRλrψr()ψr()k(\cdot,\cdot)\approx\sum_{r\leq R}\lambda_{r}\psi_{r}(\cdot)\psi_{r}(\cdot). We will use the set of eigen-functions {ψ1(),,ψR()}\{\psi_{1}(\cdot),\ldots,\psi_{R}(\cdot)\} for faster computation. The choice of RR depends on the decaying rate of the eigenvalue sequence {λr}r=1\{\lambda_{r}\}_{r=1}^{\infty} (thus the smoothness of the underlying functional parameters). Our simulation result shows that the estimation and prediction errors shrink monotonically as RR\rightarrow\infty. Therefore, RR can be chosen based on the computational resources available. The kernel truncation speeds up the computation at the cost of providing an overly-smoothed estimator, as we demonstrate empirically in the supplemental material.

Given the kernel truncation, any functional parameter gq,d()g_{q,d}(\cdot) is now approximated as: gq,d()r[R][𝜽q,d]rψr()g_{q,d}(\cdot)\approx\sum_{r\in[R]}[\boldsymbol{\theta}_{q,d}]_{r}\psi_{r}(\cdot). The parameter to be estimated now is 𝚯q=[𝜽q,1;;𝜽q,D]R×D\boldsymbol{\Theta}_{q}=[\boldsymbol{\theta}_{q,1};\cdots;\boldsymbol{\theta}_{q,D}]\in\mathbb{R}^{R\times D}, whose dimension is much lower than before (𝚪qMN×D\boldsymbol{\Gamma}_{q}\in\mathbb{R}^{MN\times D}). Estimating 𝚯q\boldsymbol{\Theta}_{q} requires solving a ridge regression problem, and the updating formula for generating 𝐯𝐞𝐜(𝚯q)\mathrm{\mathbf{vec}}\left(\boldsymbol{\Theta}_{q}\right) at the (l+1)th(l+1)^{\rm{th}} iteration can be written as:

[𝚽(𝐳tq𝐊R,𝚺(l))+λT(𝐈D𝚲R1)]1𝚽(𝐳tq𝐊R,𝐱~t,q,𝚺(l)),\left[\boldsymbol{\Phi}\left(\mathbf{z}_{t-q}^{\top}\otimes\mathbf{K}_{R},\boldsymbol{\Sigma}^{(l)}\right)+\lambda T\left(\mathbf{I}_{D}\otimes\boldsymbol{\Lambda}_{R}^{-1}\right)\right]^{-1}\boldsymbol{\Phi}\left(\mathbf{z}_{t-q}^{\top}\otimes\mathbf{K}_{R},\widetilde{\mathbf{x}}_{t,-q},\boldsymbol{\Sigma}^{(l)}\right),

where 𝐊RMN×R\mathbf{K}_{R}\in\mathbb{R}^{MN\times R} satisfies [𝐊R]ur=ψr(sij),u=i+(j1)M[\mathbf{K}_{R}]_{ur}=\psi_{r}(s_{ij}),u=i+(j-1)M, and 𝚲r=diag(λ1,,λR)\boldsymbol{\Lambda}_{r}=\text{diag}(\lambda_{1},\ldots,\lambda_{R}), with λr\lambda_{r} being the rthr^{\rm{th}} largest eigenvalue of the Mercer decomposition of k(,)k(\cdot,\cdot). Now we only need to invert a matrix of size RD×RDRD\times RD, which speeds up the computation.

3.3 Lag Selection

The MARAC(P,Q)(P,Q) model (1) has three hyper-parameters: the autoregressive lag PP, the auxiliary covariate lag QQ, and the RKHS norm penalty weight λ\lambda. In practice, λ\lambda can be chosen by cross-validation while choosing PP and QQ requires a more formal model selection criterion. We propose to select PP and QQ by using information criterion, including the Akaike Information Criterion (AIC) (Akaike, 1998) and the Bayesian Information Criterion (BIC) (Schwarz, 1978). We formally define the AIC and BIC for the MARAC(P,Q)(P,Q) model here and empirically validate their consistency via simulation experiments in Section 5.

Let 𝚯^\widehat{\boldsymbol{\Theta}} be the set of the estimated parameters of the MARAC(P,Q)(P,Q) model, and 𝐝𝐟P,Q,λ\mathbf{df}_{P,Q,\lambda} be the effective degrees of the freedom of the model. We can then define the AIC and the BIC as follows:

AIC(P,Q,λ)\displaystyle\mathrm{AIC}(P,Q,\lambda) =2t[T](𝐗t|𝐗t1:P,𝐳t1:Q,𝚯^)+2𝐝𝐟P,Q,λ,\displaystyle=-2\sum_{t\in[T]}\ell(\mathbf{X}_{t}|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q},\widehat{\boldsymbol{\Theta}})+2\cdot\mathbf{df}_{P,Q,\lambda}, (19)
BIC(P,Q,λ)\displaystyle\mathrm{BIC}(P,Q,\lambda) =2t[T](𝐗t|𝐗t1:P,𝐳t1:Q,𝚯^)+log(T)𝐝𝐟P,Q,λ.\displaystyle=-2\sum_{t\in[T]}\ell(\mathbf{X}_{t}|\mathbf{X}_{t-1:P},\mathbf{z}_{t-1:Q},\widehat{\boldsymbol{\Theta}})+\log(T)\cdot\mathbf{df}_{P,Q,\lambda}. (20)

To calculate 𝐝𝐟P,Q,λ\mathbf{df}_{P,Q,\lambda}, we decompose it into the sum of three components: 1) for each pair of the autoregressive coefficient 𝐀^p,𝐁^p\widehat{\mathbf{A}}_{p},\widehat{\mathbf{B}}_{p}, the model has (M2+N21)(M^{2}+N^{2}-1) degrees of freedom; 2) for the noise covariance 𝚺^r,𝚺^c\widehat{\boldsymbol{\Sigma}}_{r},\widehat{\boldsymbol{\Sigma}}_{c}, the model has (M2+N2)(M^{2}+N^{2}) degrees of freedom; and 3) for the auxiliary covariate functional parameters g^q,1,,g^q,D\widehat{g}_{q,1},\ldots,\widehat{g}_{q,D}, inspired by the kernel ridge regression estimator in (15), we define the sum of their degrees of freedom as:

𝐝𝐟q(g^)=tr{[𝐊~+λ(𝐈D𝚺^c𝚺^r)]1𝐊~},\mathbf{df}_{q}(\widehat{g})=\text{tr}\left\{\left[\widetilde{\mathbf{K}}+\lambda\left(\mathbf{I}_{D}\otimes\widehat{\boldsymbol{\Sigma}}_{c}\otimes\widehat{\boldsymbol{\Sigma}}_{r}\right)\right]^{-1}\widetilde{\mathbf{K}}\right\},

where 𝐊~=(T1t[T]𝐳tq𝐳tq)𝐊\widetilde{\mathbf{K}}=\left(T^{-1}\sum_{t\in[T]}\mathbf{z}_{t-q}\mathbf{z}_{t-q}^{\top}\right)\otimes\mathbf{K}. As λ0\lambda\rightarrow 0, we have 𝐝𝐟q(g^)MND\mathbf{df}_{q}(\widehat{g})\rightarrow MND; namely each covariate has MNMN free parameters, which then reduces to the element-wise linear regression model. Empirically, we find that the BIC is a consistent lag selection criterion for our model.

4 Theoretical Analysis

This section presents the theoretical analyses of the MARAC model. We first formulate the condition under which the matrix and vector time series are jointly stationary. Under this condition, we then establish the consistency and asymptotic normality of the penalized MLE under fixed matrix dimensionality as TT\rightarrow\infty. Finally, we consider the case where the matrix size goes to infinity as TT\rightarrow\infty and derive the convergence rate of the penalized MLE estimator and also the optimal order of the functional norm penalty tuning parameter λ\lambda. Without loss of generality, we assume that the matrix and vector time series have zero means, and we use S=MNS=MN to denote the spatial dimensionality of the matrix data. All proofs are deferred to the supplemental material.

4.1 Stationarity Condition

To facilitate the theoretical analysis, we make an assumption for the vector time series 𝐳t\mathbf{z}_{t}, which greatly simplifies the theoretical analysis.

Assumption 4

The DD-dimensional auxiliary vector time series {𝐳t}t=\{\mathbf{z}_{t}\}_{t=-\infty}^{\infty} follows a stationary VAR(Q~)(\widetilde{Q}) process:

𝐳t=q~=1Q~𝐂q~𝐳tq~+𝝂t,\mathbf{z}_{t}=\sum_{\widetilde{q}=1}^{\widetilde{Q}}\mathbf{C}_{\widetilde{q}}\mathbf{z}_{t-\widetilde{q}}+\boldsymbol{\nu}_{t}, (21)

where 𝐂q~D×D\mathbf{C}_{\widetilde{q}}\in\mathbb{R}^{D\times D} is the lag-q~\widetilde{q} transition matrix and 𝛎t\boldsymbol{\nu}_{t} has independent sub-Gaussian entries and is independent of 𝐄t\mathbf{E}_{t}.

Remark 5

With Assumption 4, we can combine the MARAC model for 𝐗t\mathbf{X}_{t} in (4) and the VAR process for 𝐳t\mathbf{z}_{t} in (21) as a single VAR(L)(L) model, with L=max(P,Q,Q~)L=\max(P,Q,\widetilde{Q}), as:

[𝐱t𝐳t]=l=1L[(𝐁l𝐀l)𝟏{lP}𝐆l𝟏{lQ}𝐎D×S𝐂l𝟏{lQ~}][𝐱tl𝐳tl]+[𝐞t𝝂t],\begin{bmatrix}\mathbf{x}_{t}\\ \mathbf{z}_{t}\end{bmatrix}=\sum_{l=1}^{L}\begin{bmatrix}\left(\mathbf{B}_{l}\otimes\mathbf{A}_{l}\right)\odot\mathbf{1}_{\{l\leq P\}}&\mathbf{G}_{l}^{\top}\odot\mathbf{1}_{\{l\leq Q\}}\\ \mathbf{O}_{D\times S}&\mathbf{C}_{l}\odot\mathbf{1}_{\{l\leq\widetilde{Q}\}}\end{bmatrix}\begin{bmatrix}\mathbf{x}_{t-l}\\ \mathbf{z}_{t-l}\end{bmatrix}+\begin{bmatrix}\mathbf{e}_{t}\\ \boldsymbol{\nu}_{t}\end{bmatrix}, (22)

with \odot being the Hadamard product between matrices and 𝟏{lC}\mathbf{1}_{\{l\leq C\}} being a matrix with all elements taking values from 𝟙{lC}\mathbbm{1}_{\{l\leq C\}} and 𝐎\mathbf{O} is zero matrix. As (22) shows, 𝐳t\mathbf{z}_{t^{\prime}} can help forecast 𝐱t\mathbf{x}_{t} where t>tt>t^{\prime} but not the opposite, indicating that 𝐳t\mathbf{z}_{t} is an exogenous predictor for 𝐱t\mathbf{x}_{t}.

Given the joint vector autoregressive model in (22), we derive the conditions for 𝐱t\mathbf{x}_{t} and 𝐳t\mathbf{z}_{t} to be jointly stationary in Theorem 6.

Theorem 6 (MARAC Stationarity Condition)

Assume that Assumption 4 holds for the auxiliary time series {𝐳t}t=\{\mathbf{z}_{t}\}_{t=-\infty}^{\infty}, and that the matrix time series {𝐗t}t=\{\mathbf{X}_{t}\}_{t=-\infty}^{\infty} is generated by the MARAC(P,Q)(P,Q) model in (1), then {𝐗t,𝐳t}t=\{\mathbf{X}_{t},\mathbf{z}_{t}\}_{t=-\infty}^{\infty} are jointly stationary if and only if for any yy\in\mathbb{C} in the complex plane such that |y|1|y|\leq 1, we have

det[𝐈Sp=1P(𝐁p𝐀p)yp]0,det[𝐈Dq~=1Q~𝐂q~yq~]0.\mathrm{det}\left[\mathbf{I}_{S}-\sum_{p=1}^{P}\left(\mathbf{B}_{p}\otimes\mathbf{A}_{p}\right)y^{p}\right]\neq 0,\quad\mathrm{det}\left[\mathbf{I}_{D}-\sum_{\widetilde{q}=1}^{\widetilde{Q}}\mathbf{C}_{\widetilde{q}}y^{\widetilde{q}}\right]\neq 0. (23)

The proof of Theorem 6 follows the proof of the stationarity of the joint autoregressive model in (22). As a special case where P=Q~=1P=\widetilde{Q}=1, the stationarity condition in (23) is equivalent to ρ¯(𝐀1)ρ¯(𝐁1)<1\bar{\rho}(\mathbf{A}_{1})\cdot\bar{\rho}(\mathbf{B}_{1})<1 and ρ¯(𝐂1)<1\bar{\rho}(\mathbf{C}_{1})<1, where ρ¯()\bar{\rho}(\cdot) is the spectral radius of a square matrix.

Based on Theorem 6, the stationarity of the matrix and vector time series relies on the stationarity of the autoregressive coefficients of the MARAC(P,Q)(P,Q) and VAR(Q~)(\widetilde{Q}) models and the tensor coefficients 𝒢𝒢𝒬\mathbfcal{G}_{1},\ldots,\mathbfcal{G}_{Q} do not affect the stationarity. The MARAC model can be extended to the joint autoregressive process (22) where the dynamics of 𝐗t\mathbf{X}_{t} and 𝐳t\mathbf{z}_{t} are modeled jointly. We stick to the simpler case here and leave the joint autoregression to future work.

4.2 Finite Spatial Dimension Asymptotics

In this subsection, we establish the consistency and asymptotic normality of the MARAC model estimators under the scenario that M,NM,N are fixed. Given a fixed matrix dimensionality, the functional parameters gq,dkg_{q,d}\in\mathbb{H}_{k} can only be estimated at S=MNS=MN fixed locations, and thus the asymptotic normality result is established for the corresponding tensor coefficient 𝒢^q\widehat{\mathbfcal{G}}_{q}. In Section 4.3, we will discuss the double asymptotics when both S,TS,T\rightarrow\infty. For the remainder of the paper, we denote the true model coefficient with an asterisk superscript, such as 𝐀1,𝐁1,𝒢\mathbf{A}_{1}^{*},\mathbf{B}_{1}^{*},\mathbfcal{G}_{1}^{*} and 𝚺\boldsymbol{\Sigma}^{*}.

To start with, we make an assumption on the Gram matrix 𝐊\mathbf{K}:

Assumption 7

The minimum eigenvalue of 𝐊\mathbf{K} is bounded by a positive constant c¯\underaccent{\bar}{c}, i.e. ρ¯(𝐊)=c¯>0\underaccent{\bar}{\rho}(\mathbf{K})=\underaccent{\bar}{c}>0.

As a result of Assumption 7, every 𝒢\mathbfcal{G}_{q}^{*} has a unique kernel decomposition: 𝐯𝐞𝐜(𝒢)=(𝐈D𝐊)𝜸q\mathrm{\mathbf{vec}}\left(\mathbfcal{G}_{q}^{*}\right)=(\mathbf{I}_{D}\otimes\mathbf{K})\boldsymbol{\gamma}_{q}^{*}. With this additional assumption, the first theoretical result we establish is the consistency of the covariance matrix estimator 𝚺^=𝚺^c𝚺^r\widehat{\boldsymbol{\Sigma}}=\widehat{\boldsymbol{\Sigma}}_{c}\otimes\widehat{\boldsymbol{\Sigma}}_{r}, which we summarize in Proposition 8.

Proposition 8 (Covariance Consistency)

Assume that λ0\lambda\rightarrow 0 as TT\rightarrow\infty and SS is fixed, and Assumption 47 and the stationarity condition in Theorem 6 hold, , then 𝚺^p.𝚺\widehat{\boldsymbol{\Sigma}}\overset{p.}{\rightarrow}\boldsymbol{\Sigma}^{*}.

Given this result, we can further establish the asymptotic normality of the other model estimators:

Theorem 9 (Asymptotic Normality)

Assume that the matrix time series {𝐗t}t=\{\mathbf{X}_{t}\}_{t=-\infty}^{\infty} follows the MARAC(P,Q)(P,Q) model (1) with i.i.d. noise {𝐄t}t=\{\mathbf{E}_{t}\}_{t=-\infty}^{\infty} following (3) and Assumption 47 and the stationarity condition in Theorem 6 hold and λ=o(T1/2)\lambda=o(T^{-1/2}). Additionally, assume that ρ¯(Var([𝐯𝐞𝐜(𝐗t),𝐳t]))=c¯>0\underaccent{\bar}{\rho}(\mathrm{Var}([\mathrm{\mathbf{vec}}\left(\mathbf{X}_{t}\right)^{\top},\mathbf{z}_{t}^{\top}]^{\top}))=\underaccent{\bar}{c}^{{}^{\prime}}>0. Then suppose M,NM,N are fixed and P,QP,Q are known and denote 𝐀p,𝐁p\mathbf{A}_{p},\mathbf{B}_{p}^{\top} as 𝛂p\boldsymbol{\alpha}_{p} and 𝛃p\boldsymbol{\beta}_{p} for any p[P]p\in[P], the penalized MLE of the MARAC(P,Q)(P,Q) model is asymptotically normal:

T[𝜷^1𝜶^1𝜷1𝜶1𝜷^P𝜶^P𝜷P𝜶P𝐯𝐞𝐜(𝒢^1𝒢)𝐯𝐞𝐜(𝒢^Q𝒢𝒬)]d.𝒩(𝟎,𝐕𝚵𝐕),\sqrt{T}\begin{bmatrix}\widehat{\boldsymbol{\beta}}_{1}\otimes\widehat{\boldsymbol{\alpha}}_{1}-\boldsymbol{\beta}_{1}^{*}\otimes\boldsymbol{\alpha}_{1}^{*}\\ \vdots\\ \widehat{\boldsymbol{\beta}}_{P}\otimes\widehat{\boldsymbol{\alpha}}_{P}-\boldsymbol{\beta}_{P}^{*}\otimes\boldsymbol{\alpha}_{P}^{*}\\ \mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{G}}_{1}-\mathbfcal{G}^{*}_{1}\right)\\ \vdots\\ \mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{G}}_{Q}-\mathbfcal{G}^{*}_{Q}\right)\end{bmatrix}\overset{d.}{\longrightarrow}\mathcal{N}\left(\mathbf{0},\mathbf{V}\boldsymbol{\Xi}\mathbf{V}^{\top}\right), (24)

where 𝐕\mathbf{V} is:

𝐕=[diag(𝐕1,,𝐕P)𝐎𝐎𝐈QD𝐊],𝐕p=[𝜷p𝐈M2,𝐈N2𝜶p],\mathbf{V}=\begin{bmatrix}\text{diag}(\mathbf{V}_{1},\ldots,\mathbf{V}_{P})&\mathbf{O}\\ \mathbf{O}&\mathbf{I}_{QD}\otimes\mathbf{K}\end{bmatrix},\quad\mathbf{V}_{p}=[\boldsymbol{\beta}^{*}_{p}\otimes\mathbf{I}_{M^{2}},\mathbf{I}_{N^{2}}\otimes\boldsymbol{\alpha}_{p}^{*}],

and 𝚵=𝐇1E[𝐖t(𝚺)1𝐖t]𝐇1\boldsymbol{\Xi}=\mathbf{H}^{-1}\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]\mathbf{H}^{-1}, and 𝐖t\mathbf{W}_{t} is defined as:

𝐖t=[𝐖0,t𝐈M,𝐈N𝐖1,t,[𝐳t1,,𝐳tQ]𝐊],\mathbf{W}_{t}=\left[\mathbf{W}_{0,t}\otimes\mathbf{I}_{M},\mathbf{I}_{N}\otimes\mathbf{W}_{1,t},[\mathbf{z}_{t-1}^{\top},\ldots,\mathbf{z}_{t-Q}^{\top}]\otimes\mathbf{K}\right],

where 𝐇=E[𝐖t(𝚺)1𝐖t]+𝛇𝛇\mathbf{H}=\mathrm{E}[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}]+\boldsymbol{\zeta}\boldsymbol{\zeta}^{\top} with 𝛇=[(𝛂1),,(𝛂P),𝟎]\boldsymbol{\zeta}^{\top}=[(\boldsymbol{\alpha}_{1}^{*})^{\top},\cdots,(\boldsymbol{\alpha}_{P}^{*})^{\top},\mathbf{0}^{\top}] and:

𝐖0,t=[𝐁1𝐗t1,,𝐁P𝐗tP],𝐖1,t=[𝐀1𝐗t1,,𝐀P𝐗tP].\mathbf{W}_{0,t}=[\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top},\ldots,\mathbf{B}_{P}^{*}\mathbf{X}_{t-P}^{\top}],\quad\mathbf{W}_{1,t}=[\mathbf{A}_{1}^{*}\mathbf{X}_{t-1},\ldots,\mathbf{A}_{P}^{*}\mathbf{X}_{t-P}].

The asymptotic distribution (24) indicates that all parameters are T\sqrt{T}-consistent. Under a fixed matrix dimensionality SS, the functional parameters gq,dkg_{q,d}\in\mathbb{H}_{k} are estimated only at fixed locations. Hence, the convergence is at a parametric rate just like the autoregressive coefficient.

4.3 High Spatial Dimension Asymptotics

The previous section presents the asymptotic normality of the MARAC estimators under a fixed matrix dimensionality SS. In this section, we relax this assumption and establish the convergence rate of the MARAC estimators when S,TS,T\rightarrow\infty. For technical reasons, we assume that all entries of 𝐄t\mathbf{E}_{t} are i.i.d. normally-distributed random variables following 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}).

To establish the convergence rate of the MARAC estimators when S,TS,T\rightarrow\infty, we need to make several additional assumptions.

Assumption 10

The spatial locations of the rows and columns of 𝐗t\mathbf{X}_{t} are sampled independently from a uniform distribution on [0,1][0,1].

Assumption 11

The spatial kernel function k(,)k(\cdot,\cdot) can be decomposed into the product of a row kernel k1(,)k_{1}(\cdot,\cdot) and a column kernel k2(,)k_{2}(\cdot,\cdot) that satisfies k((u,v),(s,t))=k1(u,s)k2(v,t)k((u,v),(s,t))=k_{1}(u,s)k_{2}(v,t). Both k1,k2k_{1},k_{2} have their eigenvalues decaying at a polynomial rate: λj(k1)jr0,λj(k2)jr0\lambda_{j}(k_{1})\asymp j^{-r_{0}},\lambda_{j}(k_{2})\asymp j^{-r_{0}} with r0(1/2,2)r_{0}\in(1/2,2).

Assumption 11 elicits a simple eigen-spectrum characterization of the spatial kernel k(,)k(\cdot,\cdot), whose eigenvalue can be written as λi(k1)λj(k2)\lambda_{i}(k_{1})\lambda_{j}(k_{2}). Also, the Gram matrix 𝐊\mathbf{K} is separable, i.e. 𝐊=𝐊2𝐊1\mathbf{K}=\mathbf{K}_{2}\otimes\mathbf{K}_{1} and all eigenvalues of 𝐊\mathbf{K} have the form: ρi(𝐊1)ρj(𝐊2)\rho_{i}(\mathbf{K}_{1})\rho_{j}(\mathbf{K}_{2}), where 𝐊1M×M,𝐊2N×N\mathbf{K}_{1}\in\mathbb{R}^{M\times M},\mathbf{K}_{2}\in\mathbb{R}^{N\times N} are the Gram matrix for the kernel k1,k2k_{1},k_{2}, respectively.

Under Assumption 10, we further have ρi(𝐊1)Mλi(k1)\rho_{i}(\mathbf{K}_{1})\rightarrow M\lambda_{i}(k_{1}) and ρj(𝐊2)Nλj(k2)\rho_{j}(\mathbf{K}_{2})\rightarrow N\lambda_{j}(k_{2}), as M,NM,N\rightarrow\infty. Combined with Assumption 11, we can characterize the eigenvalues of 𝐊\mathbf{K} as S(ij)r0S(ij)^{-r_{0}}. We refer our readers to Koltchinskii and Giné (2000); Braun (2006) for more references about the eigen-analysis of the kernel Gram matrix. One can generalize Assumption 10 to non-uniform sampling, but here, we stick to this simpler setting.

In Assumption 11, we assume the kernel separability to accommodate the grid structure of the spatial locations. We do not restrict r0r_{0} to be an integer but just a parameter that characterizes the smoothness of the functional parameters. With these assumptions, we are now ready to present the main result in Theorem 12.

Theorem 12 (Asymptotics for High-Dimensional MARAC)

Assume that Assumptions 410 and 11 hold and 𝐗t\mathbf{X}_{t} is generated by the MARAC(P,Q)(P,Q) model (1) with 𝐄t\mathbf{E}_{t} having i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) entries. Then as S,TS,T\rightarrow\infty (DD is fixed) and SlogS/T0S\log S/T\rightarrow 0, and under the additional assumptions that:

  1. 1.

    M=O(S),N=O(S)M=O(\sqrt{S}),N=O(\sqrt{S});

  2. 2.

    γSλ/S0\gamma_{S}\coloneqq\lambda/S\rightarrow 0 and γSSr0C1\gamma_{S}\cdot S^{r_{0}}\rightarrow C_{1} as SS\rightarrow\infty, with 0<C10<C_{1}\leq\infty;

  3. 3.

    ρ¯(𝚺𝐱,𝐱(𝚺𝐳,𝐱)(𝚺𝐳,𝐳)1𝚺𝐳,𝐱)=c0,S>0\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*})^{\top}(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*})^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}})=c_{0,S}>0 as S,TS,T\rightarrow\infty, where 𝚺𝐱,𝐱,𝚺𝐳,𝐳,𝚺𝐳,𝐱\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*},\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*},\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*} are Var(𝐱t)\mathrm{Var}(\mathbf{x}_{t}), Var(𝐳t)\mathrm{Var}(\mathbf{z}_{t}) and Cov(𝐳t,𝐱t)\mathrm{Cov}(\mathbf{z}_{t},\mathbf{x}_{t}), respectively. c0,Sc_{0,S} is a constant that only relates to SS;

  4. 4.

    For any SS, we have 0<ρ¯(𝐊)<ρ¯(𝐊)C00<\underaccent{\bar}{\rho}(\mathbf{K})<\bar{\rho}(\mathbf{K})\leq C_{0}, where C0C_{0} is a finite constant,

then we have:

1PSp=1P𝐁^p𝐀^p𝐁p𝐀pF2OP(CgγSc0,SS)+OP(Dc0,STS),\frac{1}{\sqrt{P}S}\sqrt{\sum_{p=1}^{P}\left\|\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}-\mathbf{B}^{*}_{p}\otimes\mathbf{A}_{p}^{*}\right\|_{\mathrm{F}}^{2}}\lesssim O_{P}\left(\sqrt{\frac{C_{g}\cdot\gamma_{S}}{c_{0,S}\cdot S}}\right)+O_{P}\left(\sqrt{\frac{D}{c_{0,S}\cdot TS}}\right), (25)

where Cg=q=1Qd=1Dgq,dk2C_{g}=\sum_{q=1}^{Q}\sum_{d=1}^{D}\|g_{q,d}\|_{\mathbb{H}_{k}}^{2}. Furthermore, we also have:

(TS)1t=1Tq=1Q(𝒢^q𝒢)ׯ𝐳tqF2\displaystyle\sqrt{(TS)^{-1}\sum_{t=1}^{T}\left\|\sum_{q=1}^{Q}\left(\widehat{\mathbfcal{G}}_{q}-\mathbfcal{G}_{q}^{*}\right)\bar{\times}\mathbf{z}_{t-q}\right\|_{\mathrm{F}}^{2}}
OP(γS1/2r0TS4)+OP(γS)+OP(1S)+OP(γS1TS).\displaystyle\lesssim O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1/{2r_{0}}}}}{\sqrt{T}\sqrt[4]{S}}\right)+O_{P}(\sqrt{\gamma_{S}})+O_{P}\left(\frac{1}{\sqrt{S}}\right)+O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1}}}{\sqrt{TS}}\right). (26)

In Theorem 12, (25) gives the error bound of the autoregressive coefficients and (26) gives the error bound of the prediction made by the auxiliary time series, which contains the functional parameter estimators. As a special case of (25) where γS=0\gamma_{S}=0 and SS is fixed, the convergence rate for the autoregressive coefficients is OP(T1/2)O_{P}(T^{-1/2}), which reproduces the result in Theorem 9. For the discussion below, we use ARerr\text{AR}_{err} and ACerr\text{AC}_{err} as acronyms for the quantity on the left-hand side of (25) and (26).

Remark 13 (Optimal Choice of λ\lambda and Phase Transition)

According to our proof, the error bound (26) can be decomposed into the sum of:

  • nonparametric error: OP(γS1/2r0TS)+OP(γS)O_{P}\left(\sqrt{\frac{\gamma_{S}^{-1/2r_{0}}}{T\sqrt{S}}}\right)+O_{P}(\sqrt{\gamma_{S}}),

  • autoregressive error: OP(γS)+OP(S1/2)+OP(T1/2)+OP(γS1TS)O_{P}\left(\sqrt{\gamma_{S}}\right)+O_{P}\left(S^{-1/2}\right)+O_{P}\left(T^{-1/2}\right)+O_{P}\left(\sqrt{\frac{\gamma_{S}^{-1}}{TS}}\right),

where the autoregressive error stems from the estimation error of 𝐁^p𝐀^p\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}. The nonparametric error resembles the result of nonparametric regression with RKHS norm penalty (Cui et al., 2018), where if the number of data points is nn and penalty tuning parameter is λ\lambda, then the nonparametric error is bounded by OP(λ1/2r0/n)+OP(λ)O_{P}(\sqrt{\lambda^{-1/2r_{0}}/n})+O_{P}(\sqrt{\lambda}) with an optimal λn2r0/(2r0+1)\lambda\asymp n^{-2r_{0}/(2r_{0}+1)}. In our model, if there is no autoregressive error, the optimal tuning parameter satisfies γS(TS)2r0/(2r0+1)\gamma_{S}\asymp(T\sqrt{S})^{-2r_{0}/(2r_{0}+1)}. The number of data points in our case is TSTS, and we are short of S\sqrt{S} in the optimal tuning parameter due to Assumption 11, where the eigenvalues of 𝐊\mathbf{K}, ordered as ρ1(𝐊)ρi(𝐊)\rho_{1}(\mathbf{K})\geq\cdots\rho_{i}(\mathbf{K})\geq\cdots, decay slower than i2r0i^{-2r_{0}}. This is a special result for matrix-shaped data. It is also noteworthy that under the condition that SlogS/T0S\log S/T\rightarrow 0, the autoregressive error dominates the nonparametric error.

To simplify the discussion of the optimal order of γS\gamma_{S}, we assume that S=TcS=T^{c}, where c<1c<1 is a constant. Under this condition, when P,Q1P,Q\geq 1, the optimal tuning parameter γS=λ/S\gamma_{S}=\lambda/S shows an interesting phase transition phenomenon under different spatial smoothness r0r_{0} and matrix dimensionality c=logTSc=\log_{T}S, which we summarize in Table 1.

r0r_{0} logTS\log_{T}S Optimal γS\gamma_{S} Estimator Error
[1,2)[1,2) [12r01,1)[\frac{1}{2r_{0}-1},1) O((TS)12)O((TS)^{-\frac{1}{2}}) ARerr=OP(T14S34)\mathrm{AR}_{err}=O_{P}(T^{-\frac{1}{4}}S^{-\frac{3}{4}}) ACerr=OP(S12)\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})
[1,2)[1,2) (0,12r01)(0,\frac{1}{2r_{0}-1}) O(Sr0)O(S^{-r_{0}}) ARerr=OP(Sr0+12)\mathrm{AR}_{err}=O_{P}(S^{-\frac{r_{0}+1}{2}}) ACerr=OP(S12)\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})
(12,1)(\frac{1}{2},1) [2r01,1)[2r_{0}-1,1) O(Sr0(2r01))O(S^{-r_{0}(2r_{0}-1)}) ARerr=OP(Sr0(2r01)+12)\mathrm{AR}_{err}=O_{P}(S^{-\frac{r_{0}(2r_{0}-1)+1}{2}}) ACerr=OP(S12)\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})
(12,1)(\frac{1}{2},1) (0,2r01)(0,2r_{0}-1) O((TS)2r02r0+1)O((T\sqrt{S})^{-\frac{2r_{0}}{2r_{0}+1}}) ARerr=OP((TS)12)+OP((TS)r02r0+1S12)\mathrm{AR}_{err}=O_{P}((TS)^{-\frac{1}{2}})+O_{P}((T\sqrt{S})^{-\frac{r_{0}}{2r_{0}+1}}S^{-\frac{1}{2}}) ACerr=OP(S12)+OP((TS)r02r0+1)\mathrm{AC}_{err}=O_{P}(S^{-\frac{1}{2}})+O_{P}((T\sqrt{S})^{-\frac{r_{0}}{2r_{0}+1}})
Table 1: Summary of optimal tuning parameter γS\gamma_{S} and estimators error following (25) and (26), under the assumption that c0,Sc0>0, for all Sc_{0,S}\geq c_{0}>0,\text{ for all }S and S=TcS=T^{c} for some constant 0<c<10<c<1 such that SlogS/T0S\log S/T\rightarrow 0. ARerr\mathrm{AR}_{err} and ACerr\mathrm{AC}_{err} are the quantity on the left-hand side of (25) and (26).

Based on the results in Table 1, the faster SS grows with respect to TT, the smaller the optimal tuning parameter γS\gamma_{S} is. This is an intuitive result since when one has more spatial locations, the observations are denser, and thus less smoothing is needed. Furthermore, we achieve an optimal tuning order of γS\gamma_{S} that is close to the classic nonparametric optimal rate at (TS)2r0/(2r0+1)(TS)^{-2r_{0}/(2r_{0}+1)} only under the regime where 1/2<r0<11/2<r_{0}<1 and logTS<2r01\log_{T}S<2r_{0}-1. This regime specifies the scenario where the functional parameter is relatively unsmooth and the spatial dimensionality grows slowly with respect to TT. Only under this regime will the discrepancy between the nonparametric error and the autoregressive error remain small, leading to an optimal tuning parameter close to the result of nonparametric regression.

In (25), the constant c0,Sc_{0,S} appears in the error bound of the autoregressive term. This constant characterizes the spatial correlation of the matrix time series 𝐗t\mathbf{X}_{t}, conditioning on the auxiliary vector time series 𝐳t\mathbf{z}_{t} and can vary across different assumptions made on the covariances of 𝐄t\mathbf{E}_{t} and 𝝂t\boldsymbol{\nu}_{t}. In Table 1, we assume that c0,Sc0>0c_{0,S}\geq c_{0}>0 for some universal constant c0c_{0}. Unfortunately, in practice, it is common to have c0,S0c_{0,S}\rightarrow 0 as SS\rightarrow\infty, which makes the autoregressive coefficient converge at a slower rate but does not affect the functional parameter convergence. We leave the constant c0,Sc_{0,S} here in (25) to give a general result and leave the characterization of c0,Sc_{0,S} under specific assumptions for future works.

5 Simulation Experiments

5.1 Consistency and Convergence Rate

In this section, we validate the consistency and convergence rate of the MARAC estimators. We consider a simple setup with P=Q=1P=Q=1 and D=3D=3 and simulate the autoregressive coefficients 𝐀1,𝐁1\mathbf{A}_{1}^{*},\mathbf{B}_{1}^{*} such that they satisfy the stationarity condition in Theorem 6. We specify both 𝐀1,𝐁1\mathbf{A}_{1}^{*},\mathbf{B}_{1}^{*} and 𝚺r,𝚺c\boldsymbol{\Sigma}_{r}^{*},\boldsymbol{\Sigma}_{c}^{*} to have symmetric banded structures. To simulate g1,g2,g3g_{1},g_{2},g_{3} (we drop the lag subscript) from the RKHS k\mathbb{H}_{k}, we choose k(,)k(\cdot,\cdot) to be the Lebedev kernel (Kennedy et al., 2013) and generate g1,g2,g3g_{1},g_{2},g_{3} randomly from Gaussian processes with the Lebedev kernel as the covariance kernel. Finally, we simulate the auxiliary vector time series 𝐳t3\mathbf{z}_{t}\in\mathbb{R}^{3} from a VAR(1)(1) process. We include more details and visualizations of the simulation setups in the supplemental material.

The evaluation metric is the rooted mean squared error (RMSE), defined as RMSE(𝚯^)=𝚯^𝚯F/d(𝚯)\mathrm{RMSE}(\widehat{\boldsymbol{\Theta}})=\|\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\|_{\mathrm{F}}/\sqrt{d(\boldsymbol{\Theta}^{*})}, where d(𝚯)d(\boldsymbol{\Theta}^{*}) is the number of elements in 𝚯\boldsymbol{\Theta}^{*}. We consider 𝚯{𝐁1𝐀1,𝚺c𝚺r,𝒢𝒢𝒢}\boldsymbol{\Theta}\in\{\mathbf{B}_{1}\otimes\mathbf{A}_{1},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r},\mathbfcal{G}_{1},\mathbfcal{G}_{2},\mathbfcal{G}_{3}\} and we report the average RMSE for 𝒢𝒢𝒢\mathbfcal{G}_{1},\mathbfcal{G}_{2},\mathbfcal{G}_{3}. The dataset is configured with M{5,10,20,40}M\in\{5,10,20,40\} and N=MN=M. For each MM, we train the MARAC model with P=Q=1P=Q=1 over Ttrain{1,5,10,20,40,80,160}×102T_{\text{train}}\in\{1,5,10,20,40,80,160\}\times 10^{2} frames of the matrix time series and choose the tuning parameter λ\lambda based on the prediction RMSE over a held-out validation set with Tval=Ttrain/2T_{\text{val}}=T_{\text{train}}/2 and we validate the prediction performance over a 5,0005,000-frame testing set. We simulate a sufficiently long time series and choose the training set starting from the first frame and the validation set right after the training set. The testing set is always fixed as the last 5,0005,000 frames of the time series. All results are reported with 2020 repetitions in Figure 2.

Refer to caption
Figure 2: Panel (a), (b), (c) show the RMSE of the penalized MLE of the MARAC model. Panel (d) shows the testing set prediction RMSE subtracted by 11, where 11 is the noise variance of the simulated time series. Panels (a)-(d) have both axes plotted in log10\log_{10} scale. (e) and (f) are RMSE of the autoregressive parameters and auxiliary covariates parameters under different TST\sqrt{S}, plotted with both axes in log10\log_{10} scale together with a fitted linear regression line.

The result shows that all model estimators are consistent and the convergence rate, under a fixed spatial dimensionality, is close to 1/T1/\sqrt{T} (the black line in panel (a) shows a reference line of O(1/T)O(1/\sqrt{T})), echoing the result in Theorem 9. As the spatial dimensionality SS increases, the RMSE for 𝐁^1𝐀^1\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1} becomes even smaller, echoing the result in (25) and Table 1. The RMSE of the nonparametric estimators g^1,g^2,g^3\widehat{g}_{1},\widehat{g}_{2},\widehat{g}_{3}, under a fixed spatial dimensionality, also decay at a rate of 1/T1/\sqrt{T}, echoing the result in Theorem 9 as well. The RMSE of the covariance matrix estimator 𝚺^c𝚺^r\widehat{\boldsymbol{\Sigma}}_{c}\otimes\widehat{\boldsymbol{\Sigma}}_{r} suggests that it is consistent, confirming the result of Proposition 8 and shows a convergence rate similar to 𝐁^1𝐀^1\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1}, though we did not provide the exact convergence rate theoretically.

In this simulation, we fix the variance of each element of 𝐯𝐞𝐜(𝐄t)\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right) to be unity. Therefore, the optimal testing set prediction RMSE should be unity. When plotting the test prediction RMSE in (d), we subtract 11 from all RMSE results and thus the RMSE should be interpreted as the RMSE for the signal part of the matrix time series. The test prediction RMSE for all cases converges to zero, and for matrices of higher dimensionality, we typically require more training frames to reach the same prediction performance.

To validate the theoretical result of the high-dimensional MARAC in Theorem 12, we also plot the RMSE of 𝐁^1𝐀^1\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1} and g^1,g^2,g^3\widehat{g}_{1},\widehat{g}_{2},\widehat{g}_{3} against TST\sqrt{S} in panel (e) and (f) of Figure 2. The trend line is fitted by linear regression, and it shows that 𝐁^1𝐀^1\widehat{\mathbf{B}}_{1}\otimes\widehat{\mathbf{A}}_{1} converges roughly at the rate of 1/TS41/\sqrt{T}\sqrt[4]{S}, which indicates that c0,S1/Sc_{0,S}\asymp 1/\sqrt{S} under this specific simulation setup. It also shows that the functional parameter’s convergence rate is around (TS)3/8(T\sqrt{S})^{-3/8}, which coincides with our simulation setup where r03/4r_{0}\approx 3/4 and the theoretical result in the last row of Table 1.

All the results reported in Figure 2 are based on the penalized MLE framework without the kernel truncation introduced in Section 3.2. Kernel truncation speeds up the computation, especially when the matrix dimensionality is high, at the cost of over-smoothing the functional parameter estimates. We illustrate the performance of the kernel truncation method in the supplemental material.

5.2 Lag Selection Consistency

In Section 3.3, we propose to select the lag parameters PP and QQ of the MARAC model using information criteria such as AIC and BIC. To validate the consistency of these model selection criteria, we simulate data from a MARAC(2,2)(2,2) model with 5×55\times 5 matrix dimensionality. We consider a candidate model class with 1P,Q41\leq P,Q\leq 4 and each model is fitted with T{1,2,4,8}×103T\in\{1,2,4,8\}\times 10^{3} frames with λ\lambda being chosen from a held-out validation set. In Table 2, we report the proportion of times that AIC and BIC select the correct PP, QQ individually (first two numbers in each parenthesis), and (P,Q)(P,Q) jointly (last number in each parenthesis) from 100100 repetitions.

T=1×103T=1\times 10^{3} T=2×103T=2\times 10^{3} T=4×103T=4\times 10^{3} T=8×103T=8\times 10^{3}
AIC (.54,.99,.53)(.54,.99,.53) (.55,.97,.53)(.55,.97,.53) (.59,.96,.55)(.59,.96,.55) (.65,.94,.59)(.65,.94,.59)
BIC (1.00,.09,.09)(1.00,.09,.09) (.99,.56,.56)(.99,.56,.56) (.97,.97,.94)(.97,.97,.94) (.96,.99,.95)(.96,.99,.95)
Table 2: Probability that AIC and BIC select the correct PP (first number), QQ (second number) and (P,Q)(P,Q) (third number) from 100 repetitions.

From Table 2, we find that AIC tends to select the model with more autoregressive lags but BIC performs consistently better under large sample sizes. This coincides with the findings in Hsu et al. (2021) for the matrix autoregression model.

5.3 Comparison with Alternative Methods

We compare our MARAC model against other competing methods for the matrix autoregression task. We simulate the matrix time series 𝐗t\mathbf{X}_{t} from a MARAC(P,Q)(P,Q) model, with P=Q{1,2,3}P=Q\in\{1,2,3\}, and the vector time series 𝐳t3\mathbf{z}_{t}\in\mathbb{R}^{3} from VAR(1)(1). The dataset is generated with Ttrain=Tval=Ttest=2000T_{\text{train}}=T_{\text{val}}=T_{\text{test}}=2000. Under each (P,Q)(P,Q), we simulate with varying matrix dimensionality with M=N{5,10,20,40}M=N\in\{5,10,20,40\}. We evaluate the performance of each method via the testing set prediction RMSE. Each simulation scenario is repeated 20 times.

Under each P,Q,M,NP,Q,M,N specification, we consider the following five competing methods besides our own MARAC(P,Q)(P,Q) model.

  1. 1.

    MAR (Chen et al., 2021):

    𝐗t=p=1P𝐀p𝐗tp𝐁p+𝐄t,𝐯𝐞𝐜(𝐄t)𝒩(𝟎,𝚺c𝚺r).\mathbf{X}_{t}=\sum_{p=1}^{P}\mathbf{A}_{p}\mathbf{X}_{t-p}\mathbf{B}_{p}^{\top}+\mathbf{E}_{t},\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{c}\otimes\boldsymbol{\Sigma}_{r}).
  2. 2.

    MAR with fixed-rank co-kriging (MAR+FRC) (Hsu et al., 2021):

    𝐗t=p=1P𝐀p𝐗tp𝐁p+𝐄t,𝐯𝐞𝐜(𝐄t)𝒩(𝟎,ση2𝐈+𝐅𝐌𝐅),\mathbf{X}_{t}=\sum_{p=1}^{P}\mathbf{A}_{p}\mathbf{X}_{t-p}\mathbf{B}_{p}^{\top}+\mathbf{E}_{t},\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\sim\mathcal{N}(\mathbf{0},\sigma_{\eta}^{2}\mathbf{I}+\mathbf{F}\mathbf{M}\mathbf{F}^{\top}),

    where 𝐅MN×QD\mathbf{F}\in\mathbb{R}^{MN\times QD} is the multi-resolution spline basis (Tzeng and Huang, 2018).

  3. 3.

    MAR followed by a tensor-on-scalar linear model (MAR+LM) (Li and Zhang, 2017):

    𝐗tp=1P𝐀^p𝐗tp𝐁^p=q=1Q𝒢ׯ𝒩ση\mathbf{X}_{t}-\sum_{p=1}^{P}\widehat{\mathbf{A}}_{p}\mathbf{X}_{t-p}\widehat{\mathbf{B}}_{p}^{\top}=\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t},\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\sim\mathcal{N}(\mathbf{0},\sigma_{\eta}^{2}\mathbf{I}), (27)

    where 𝐀^p,𝐁^p\widehat{\mathbf{A}}_{p},\widehat{\mathbf{B}}_{p} come from a pre-trained MAR model and 𝒢\mathbfcal{G}_{q} can be a low-rank tensor. The MAR+LM model can be considered as a two-step procedure for fitting the MARAC model.

  4. 4.

    Pixel-wise autoregression (Pixel-AR): for each i[M],j[N]i\in[M],j\in[N], we have:

    [𝐗t]ij=αij+p=1Pβijp[𝐗tp]ij+q=1Q𝜸ijq𝐳tq+[𝐄t]ij,[𝐄t]ij𝒩(0,σij2).[\mathbf{X}_{t}]_{ij}=\alpha_{ij}+\sum_{p=1}^{P}\beta_{ijp}[\mathbf{X}_{t-p}]_{ij}+\sum_{q=1}^{Q}\boldsymbol{\gamma}_{ijq}^{\top}\mathbf{z}_{t-q}+[\mathbf{E}_{t}]_{ij},\quad[\mathbf{E}_{t}]_{ij}\sim\mathcal{N}(0,\sigma_{ij}^{2}).
  5. 5.

    Vector Autoregression with Exogenous Predictor (VARX), which vectorizes the matrix time series and stacks them up with the vector time series as predictors.

The results of the average prediction RMSE obtained from the 20 repeated runs are plotted in Figure 3. Overall, our MARAC model outperforms the other competing methods under varying matrix dimensionality and lags. We make two additional remarks. First, when the matrix size is small (e.g., 5×55\times 5), the vector autoregression model (VARX) performs almost as well as the MARAC model and is better than other methods. However, the performance of the VARX model gets worse quickly as the matrix becomes larger, indicating that sufficient dimension reduction is needed for dealing with large matrix time series. The MARAC model is a parsimonious version of VARX for such purposes. Secondly, the MAR, MAR with fixed-rank co-kriging (MAR+FRC), and two-step MARAC (MAR+LM) all perform worse than MARAC. This shows that when the auxiliary time series predictors are present, it is sub-optimal to remove them from the model (MAR), incorporate them implicitly in the covariance structure (MAR+FRC), or fit them separately in a tensor-on-scalar regression model (MAR+LM). Putting both matrix predictors and vector predictors in a unified framework like MARAC can be beneficial for improving prediction performances.

Refer to caption
Figure 3: Testing set prediction RMSE comparison across six competing methods on the matrix autoregression task. Four panels correspond to four different matrix dimensionality (labeled on the top-left corner of each panel). Test prediction RMSE is subtracted by 11 for better visualization, where 11 is the noise variance of the simulated data. Error bar shows 95%95\% CI of the 20 repeated runs. We rearrange the spacing between ticks along the y-axis using a square root transformation for better visualization.

6 Application to Global Total Electron Content Forecast

For real data applications, we consider the problem of predicting the global total electron content (TEC) distribution, which we briefly introduce in Section 1. The TEC data we use is the IGS (International GNSS Service) TEC data, which are freely available from the National Aeronautics and Space Administration (NASA) Crustal Dynamics Data Information System (Hernández-Pajares et al., 2009). The spatial-temporal resolution of the data is 2.5(latitude)×5(longitude)×15(minutes)2.5^{\circ}(\text{latitude})\times 5^{\circ}(\text{longitude})\times 15(\text{minutes}). We downloaded the data for September 2017, and the whole month of data form a matrix time series with T=2880T=2880 and M=71,N=73M=71,N=73. For the auxiliary covariates, we download the 15-minute resolution IMF Bz and Sym-H time series, which are parameters related to the near-Earth magnetic field and plasma (Papitashvili et al., 2014). We also download the daily F10.7 index, which measures the solar radio flux at 10.7 cm, as an additional auxiliary predictor. The IMF Bz and Sym-H time series are accessed from the OMNI dataset (Papitashvili and King, 2020) and the F10.7 index is accessed from the NOAA data repository (Tapping, 2013). These covariates measure the solar wind strengths. Strong solar wind might lead to geomagnetic storms that could increase the global TEC significantly.

We formulate our MARAC model for the TEC prediction problem as:

TECt+h=p=1P𝐀pTECtp𝐁p+q=1Q𝒢ׯ\text{TEC}_{t+h}=\sum_{p=1}^{P}\mathbf{A}_{p}\text{TEC}_{t-p}\mathbf{B}_{p}^{\top}+\sum_{q=1}^{Q}\mathbfcal{G}_{q}\bar{\times}\mathbf{z}_{t-q}+\mathbf{E}_{t}, (28)

where hh is the forecast latency time and 𝐳t3\mathbf{z}_{t}\in\mathbb{R}^{3} includes the IMF Bz, Sym-H and F10.7 indices at time tt. We consider the forecasting scenario with h{1,2,,24}h\in\{1,2,\ldots,24\}, which corresponds to making forecasts from 15 minutes ahead up to 6 hours ahead. At each latency time, we fit our MARAC(P,Q)(P,Q) model following (28) with 1P,Q31\leq P,Q\leq 3. We fit the MARAC model with kernel truncation approximation using R=121R=121 basis functions from the truncated Lebedev kernel. As a comparison, we also fit the MAR model with 1P31\leq P\leq 3 and the MAR+LM model with 1P,Q31\leq P,Q\leq 3, see the definition of MAR+LM model in (27). As a benchmark, we consider using TECt1\text{TEC}_{t-1} to predict TECt+h\text{TEC}_{t+h} and name it the persistence model.

The 2,8802,880 frames of matrix data are split into a 70%70\% training set, 15%15\% validation set, and a 15%15\% testing set following the chronological order. We choose the tuning parameter λ\lambda for MARAC based on the validation set prediction RMSE. The lag parameters P,QP,Q are chosen for all models based on the BIC. To increase computational speed, we assume that matrices 𝚺r,𝚺c\boldsymbol{\Sigma}_{r},\boldsymbol{\Sigma}_{c} are diagonal when fitting all models. We zero-meaned all sets of data using the mean of the matrix and vector time series of the training set.

In Figure 4(A), we report the pixel-wise prediction RMSE on the testing set. The result shows that when the latency time is low, the matrix autoregressive (MAR) model is sufficient for making the TEC prediction. As the latency time increases to around 4 to 5 hours, the auxiliary time series helps improve the prediction performance as compared to the MAR model. This coincides with the domain intuition that the disturbances from the solar wind to Earth’s ionosphere will affect the global TEC distribution but with a delay in time of up to several hours. The additional prediction gain from incorporating the auxiliary covariates vanishes as one further increases the latency time, indicating that the correlation of the solar wind and global TEC is weak beyond a 6-hour separation.

In Figure 4(B), we visualize an example of the TEC prediction across the competing methods under the 4-hour latency time scenario (i.e., h=16). The MAR and MAR+LM results are similar and do not resemble the ground truth very well. The global TEC typically has two peaks located symmetrically around the equator, and both models fail to capture this as they provide a single patch in the middle. The MARAC model, however, can capture this fine-scale structure in its prediction. To further showcase the MARAC model prediction result, we decompose the prediction from the autoregressive component and the auxiliary covariates component and visualize them separately. The auxiliary covariate component highlights a sub-region in the southern hemisphere with high TEC values, complementing the prediction made by the autoregressive component.

Refer to caption
Figure 4: IGS TEC prediction results. Panel (A) shows the testing set prediction RMSE across four competing methods under 24 different latency times. Panel (B) shows an example of the predicted TEC at 10:45:00 UT, 2017-Sep-28, under the 4-hour latency time scenario. Note that the “MARAC-Auxiliary Part” plot has a different color bar underneath it and that color bar applies to it exclusively.

7 Summary

In this paper, we propose a new methodology for spatial-temporal matrix autoregression with non-spatial exogenous vector covariates. The model has an autoregressive component with bi-linear transformations on the lagged matrix predictors and an additive auxiliary covariate component with tensor-vector product between a tensor coefficient and the lagged vector covariates. We propose a penalized MLE estimation approach with a squared RKHS norm penalty and establish the estimator asymptotics under fixed and high matrix dimensionality. The model efficacy has been validated using both numerical experiments and an application to the global TEC forecast.

The application of our model can be extended to other spatial data with exogenous, non-spatial predictors and is not restricted to matrix-valued data but can be generalized to the tensor setting and potentially data without grid structure or containing missing data. Furthermore, our model nests a simpler model that does not contain the autoregressive term, i.e. P=0P=0, and thus can be applied to matrix-on-scalar regression with spatial data. We leave the discussions for these setups to future research.

Supplementary Materials

The supplemental material contains technical proofs of all theorems and propositions of the paper and additional details and results of the simulation experiments.

Acknowledgements

The authors thank Shasha Zou, Zihan Wang, and Yizhou Zhang for helpful discussions on the TEC data. YC acknowledges support from NSF DMS 2113397 and NSF PHY 2027555.

References

  • Akaike (1998) Hirotogu Akaike. Information Theory and an Extension of the Maximum Likelihood Principle. Selected papers of hirotugu akaike, pages 199–213, 1998.
  • Attouch et al. (2013) Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of Descent Methods for Semi-Algebraic and Tame Problems: Proximal Algorithms, Forward–Backward Splitting, and Regularized Gauss–Seidel Methods. Mathematical Programming, 137(1-2):91–129, 2013.
  • Braun (2006) Mikio L Braun. Accurate Error Bounds for the Eigenvalues of the Kernel Matrix. The Journal of Machine Learning Research, 7:2303–2328, 2006.
  • Cai and Yuan (2012) T Tony Cai and Ming Yuan. Minimax and Adaptive Prediction for Functional Linear Regression. Journal of the American Statistical Association, 107(499):1201–1216, 2012.
  • Chen et al. (2021) Rong Chen, Han Xiao, and Dan Yang. Autoregressive Models for Matrix-valued Time Series. Journal of Econometrics, 222(1):539–560, 2021.
  • Cheng and Shang (2015) Guang Cheng and Zuofeng Shang. Joint Asymptotics for Semi-nonparametric Regression Models with Partially Linear Structure. The Annals of Statistics, 43:1351–1390, 2015.
  • Cressie (1986) Noel Cressie. Kriging Nonstationary Data. Journal of the American Statistical Association, 81(395):625–634, 1986.
  • Cressie and Johannesson (2008) Noel Cressie and Gardar Johannesson. Fixed Rank Kriging for Very Large Spatial Data Sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226, 2008.
  • Cressie and Wikle (2015) Noel Cressie and Christopher K Wikle. Statistics for Spatio-Temporal Data. John Wiley & Sons, 2015.
  • Cui et al. (2018) Wenquan Cui, Haoyang Cheng, and Jiajing Sun. An RKHS-based Approach to Double-Penalized Regression in High-dimensional Partially Linear Models. Journal of Multivariate Analysis, 168:201–210, 2018.
  • Dong et al. (2020) Mingwang Dong, Linfu Huang, Xueqin Wu, and Qingguang Zeng. Application of Least-Squares Method to Time Series Analysis for 4dpm Matrix. IOP Conference Series: Earth and Environmental Science, 455(1):012200, feb 2020. doi: 10.1088/1755-1315/455/1/012200. URL https://dx.doi.org/10.1088/1755-1315/455/1/012200.
  • Fosdick and Hoff (2014) BK Fosdick and PD Hoff. Separable Factor Analysis with Applications to Mortality Data. The Annals of Applied Statistics, 8(1):120–147, 2014.
  • Gu (2013) Chong Gu. Smoothing Spline ANOVA models, 2nd edition. Springer, New York, 2013.
  • Guha and Guhaniyogi (2021) Sharmistha Guha and Rajarshi Guhaniyogi. Bayesian Generalized Sparse Symmetric Tensor-on-Vector Regression. Technometrics, 63(2):160–170, 2021.
  • Guhaniyogi et al. (2017) Rajarshi Guhaniyogi, Shaan Qamar, and David B Dunson. Bayesian Tensor Regression. The Journal of Machine Learning Research, 18(1):2733–2763, 2017.
  • Hall and Heyde (2014) Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic press, 2014.
  • Hamilton (2020) James D Hamilton. Time Series Analysis. Princeton University Press, 2020.
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition. Springer, New York, 2009.
  • Hernández-Pajares et al. (2009) Manuel Hernández-Pajares, JM Juan, J Sanz, R Orus, A Garcia-Rigo, J Feltens, A Komjathy, SC Schaer, and A Krankowski. The IGS VTEC Maps: a Reliable Source of Ionospheric Information since 1998. Journal of Geodesy, 83:263–275, 2009.
  • Hoff (2011) Peter D Hoff. Separable Covariance Arrays via the Tucker Product, with Applications to Multivariate Relational Data. Bayesian Analysis, 6(2):179–196, 2011.
  • Hsu et al. (2021) Nan-Jung Hsu, Hsin-Cheng Huang, and Ruey S Tsay. Matrix Autoregressive Spatio-Temporal Models. Journal of Computational and Graphical Statistics, 30(4):1143–1155, 2021.
  • Kang et al. (2018) Jian Kang, Brian J Reich, and Ana-Maria Staicu. Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process. Biometrika, 105(1):165–184, 2018.
  • Kennedy et al. (2013) Rodney A Kennedy, Parastoo Sadeghi, Zubair Khalid, and Jason D McEwen. Classification and Construction of Closed-form Kernels for Signal Representation on the 2-sphere. In Wavelets and Sparsity XV, volume 8858, pages 169–183. SPIE, 2013.
  • Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. Tensor Decompositions and Applications. SIAM review, 51(3):455–500, 2009.
  • Koltchinskii and Giné (2000) Vladimir Koltchinskii and Evarist Giné. Random Matrix Approximation of Spectra of Integral Operators. Bernoulli, pages 113–167, 2000.
  • Li and Zhang (2017) Lexin Li and Xin Zhang. Parsimonious Tensor Response Regression. Journal of the American Statistical Association, 112(519):1131–1146, 2017.
  • Li et al. (2018) Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker Tensor Regression and Neuroimaging Analysis. Statistics in Biosciences, 10(3):520–545, 2018.
  • Li and Xiao (2021) Zebang Li and Han Xiao. Multi-linear Tensor Autoregressive Models. arXiv preprint arXiv:2110.00928, 2021.
  • Liu et al. (2020) Yipeng Liu, Jiani Liu, and Ce Zhu. Low-rank Tensor Train Coefficient Array Estimation for Tensor-on-Tensor Regression. IEEE Transactions on Neural Networks and Learning Systems, 31(12):5402–5411, 2020.
  • Lock (2018) Eric F Lock. Tensor-on-Tensor Regression. Journal of Computational and Graphical Statistics, 27(3):638–647, 2018.
  • Luo and Zhang (2022) Yuetian Luo and Anru R Zhang. Tensor-on-Tensor Regression: Riemannian Optimization, Over-Parameterization, Statistical-Computational gap, and their Interplay. arXiv preprint arXiv:2206.08756, 2022.
  • Papadogeorgou et al. (2021) Georgia Papadogeorgou, Zhengwu Zhang, and David B Dunson. Soft Tensor Regression. J. Mach. Learn. Res., 22:219–1, 2021.
  • Papitashvili and King (2020) Natalia E. Papitashvili and Joseph H. King. Omni 5-min Data [Data set]. NASA Space Physics Data Facility, 2020. https://doi.org/10.48322/gbpg-5r77.
  • Papitashvili et al. (2014) Natasha Papitashvili, Dieter Bilitza, and Joseph King. OMNI: a Description of Near-Earth Solar Wind Environment. 40th COSPAR scientific assembly, 40:C0–1, 2014.
  • Rabusseau and Kadri (2016) Guillaume Rabusseau and Hachem Kadri. Low-rank Regression with Tensor Responses. Advances in Neural Information Processing Systems, 29, 2016.
  • Rudelson and Vershynin (2013) Mark Rudelson and Roman Vershynin. Hanson-Wright Inequality and Sub-Gaussian Concentration. Electronic Communications in Probability, pages 1–9, 2013.
  • Schölkopf et al. (2001) Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A Generalized Representer Theorem. In International conference on computational learning theory, pages 416–426. Springer, 2001.
  • Schwarz (1978) Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, pages 461–464, 1978.
  • Shang and Cheng (2013) Zuofeng Shang and Guang Cheng. Local and Global Asymptotic Inference in Smoothing Spline Models. The Annals of Statistics, 41:2608–2638, 2013.
  • Shang and Cheng (2015) Zuofeng Shang and Guang Cheng. Nonparametric Inference in Generalized Functional Linear Models. The Annals of Statistics, 43:1742–1773, 2015.
  • Shen et al. (2022) Bo Shen, Weijun Xie, and Zhenyu Kong. Smooth Robust Tensor Completion for Background/Foreground Separation with Missing Pixels: Novel Algorithm with Convergence Guarantee. The Journal of Machine Learning Research, 23(1):9757–9796, 2022.
  • Stock and Watson (2001) James H Stock and Mark W Watson. Vector Autoregressions. Journal of Economic perspectives, 15(4):101–115, 2001.
  • Sun et al. (2022) Hu Sun, Zhijun Hua, Jiaen Ren, Shasha Zou, Yuekai Sun, and Yang Chen. Matrix Completion Methods for the Total Electron Content Video Reconstruction. The Annals of Applied Statistics, 16(3):1333–1358, 2022.
  • Sun et al. (2023) Hu Sun, Ward Manchester, Meng Jin, Yang Liu, and Yang Chen. Tensor Gaussian Process with Contraction for Multi-Channel Imaging Analysis. In International Conference on Machine Learning, pages 32913–32935. PMLR, 2023.
  • Sun and Li (2017) Will Wei Sun and Lexin Li. STORE: Sparse Tensor Response Regression and Neuroimaging Analysis. The Journal of Machine Learning Research, 18(1):4908–4944, 2017.
  • Tapping (2013) KF Tapping. The 10.7 cm Solar Radio Flux (F10. 7). Space weather, 11(7):394–406, 2013.
  • Tzeng and Huang (2018) ShengLi Tzeng and Hsin-Cheng Huang. Resolution Adaptive Fixed Rank Kriging. Technometrics, 60(2):198–208, 2018.
  • van Zanten and van der Vaart (2008) JH van Zanten and Aad W van der Vaart. Reproducing Kernel Hilbert Spaces of Gaussian Priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, pages 200–222. Institute of Mathematical Statistics, 2008.
  • Wang et al. (2024) Di Wang, Yao Zheng, and Guodong Li. High-Dimensional Low-rank Tensor Autoregressive Time Series Modeling. Journal of Econometrics, 238(1):105544, 2024.
  • Wang et al. (2019) Dong Wang, Xialu Liu, and Rong Chen. Factor Models for Matrix-valued High-dimensional Time Series. Journal of econometrics, 208(1):231–248, 2019.
  • Wang et al. (2017) Xiao Wang, Hongtu Zhu, and Alzheimer’s Disease Neuroimaging Initiative. Generalized Scalar-on-Image Regression Models via Total Variation. Journal of the American Statistical Association, 112(519):1156–1168, 2017.
  • Wang et al. (2021) Zihan Wang, Shasha Zou, Lei Liu, Jiaen Ren, and Ercha Aa. Hemispheric Asymmetries in the Mid-latitude Ionosphere During the September 7–8, 2017 Storm: Multi-instrument Observations. Journal of Geophysical Research: Space Physics, 126:e2020JA028829, 4 2021. ISSN 2169-9402. doi: 10.1029/2020JA028829.
  • Williams and Rasmussen (2006) Christopher K Williams and Carl Edward Rasmussen. Gaussian Processes for Machine Learning, volume 2. MIT press Cambridge, MA, 2006.
  • Yang et al. (2020) Yun Yang, Zuofeng Shang, and Guang Cheng. Non-asymptotic Analysis for Nonparametric Testing. In 33rd Annual Conference on Learning Theory, pages 1–47. ACM, 2020.
  • Younas et al. (2022) Waqar Younas, Majid Khan, C. Amory-Mazaudier, Paul O. Amaechi, and R. Fleury. Middle and Low Latitudes Hemispheric Asymmetries in ΣO/N2\Sigma O/N2 and TEC during Intense Magnetic Storms of Solar Cycle 24. Advances in Space Research, 69:220–235, 1 2022.
  • Yuan and Cai (2010) Ming Yuan and T Tony Cai. A Reproducing Kernel Hilbert Space Approach to Functional Linear Regression. The Annals of Statistics, 38(6):3412–3444, 2010.
  • Zhou et al. (2013) Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor Regression with Applications in Neuroimaging Data Analysis. Journal of the American Statistical Association, 108(502):540–552, 2013.

SUPPLEMENTARY MATERIAL

This supplemental material is organized as follows. In Section A, we prove Proposition 1 on the equivalence of the estimation problem of MARAC to a kernel ridge regression problem. In Section B, we prove Theorem 6 on the joint stationarity condition of the matrix and auxiliary vector time series. Then in Section C, we provide proofs of the theoretical results under fixed spatial dimensionality, including Proposition 8 and Theorem 9. In Section D, we present proofs of the theoretical results under high spatial dimensionality, namely Theorem 12. All essential lemmas used throughout the proofs are presented and proved in Section E. Finally, we include additional details and results of the simulation experiments in Section F.

In this supplemental material, we use ρ¯()\bar{\rho}(\cdot), ρi()\rho_{i}(\cdot), ρ¯()\underaccent{\bar}{\rho}(\cdot) and s\|\cdot\|_{s} to denote the maximum, ithi^{\rm{th}} largest, minimum eigenvalue and spectral norm of a matrix. We use ab,aba\vee b,a\wedge b to denote the maximum and minimum of aa and bb, respectively. For two sequences of random variables, say Xn,YnX_{n},Y_{n}, we use XnYnX_{n}\lesssim Y_{n} to denote the case where Xn/Yn=OP(1)X_{n}/Y_{n}=O_{P}(1), and XnYnX_{n}\gtrsim Y_{n} to denote the case where Yn/Xn=OP(1)Y_{n}/X_{n}=O_{P}(1). We then use XnYnX_{n}\asymp Y_{n} to denote the case where both XnYnX_{n}\lesssim Y_{n} and XnYnX_{n}\gtrsim Y_{n} hold.

Appendix A Proof of Proposition 1

Proof  For each function gq,d()kg_{q,d}(\cdot)\in\mathbb{H}_{k}, we can decompose it as follows:

gq,d()=s𝕊γq,d,sk(,s)+j=1Jαq,d,jϕj()+hq,d(),g_{q,d}(\cdot)=\sum_{s\in\mathbb{S}}\gamma_{q,d,s}k(\cdot,s)+\sum_{j=1}^{J}\alpha_{q,d,j}\phi_{j}(\cdot)+h_{q,d}(\cdot),

where hq,d()h_{q,d}(\cdot) does not belong to the null space of k\mathbb{H}_{k} nor the span of {k(,s)|s𝕊}\{k(\cdot,s)|s\in\mathbb{S}\}. Here we assume that the null space of k\mathbb{H}_{k} contains only the zero function, so ϕj()=0, for all j\phi_{j}(\cdot)=0,\text{ for all }j.

By the reproducing property of the kernel k(,)k(\cdot,\cdot), we have gq,d,k(,s)k=gq,d(s)=s𝕊γq,d,sk(s,s)\langle g_{q,d},k(\cdot,s^{\prime})\rangle_{\mathbb{H}_{k}}=g_{q,d}(s^{\prime})=\sum_{s\in\mathbb{S}}\gamma_{q,d,s}k(s,s^{\prime}), which is independent of hq,d()h_{q,d}(\cdot), and therefore hq,d()h_{q,d}(\cdot) is independent of the prediction for 𝐱t\mathbf{x}_{t} in the MARAC model. In addition, for any hq,d()span({k(,s)|s𝕊})h_{q,d}(\cdot)\notin\text{span}(\{k(\cdot,s)|s\in\mathbb{S}\}), we have:

gq,dk2=γq,d𝐊γq,d+hq,dk2s𝕊γq,d,sk(,s)k2,\|g_{q,d}\|_{\mathbb{H}_{k}}^{2}=\gamma_{q,d}^{\top}\mathbf{K}\gamma_{q,d}+\|h_{q,d}\|_{\mathbb{H}_{k}}^{2}\geq\|\sum_{s\in\mathbb{S}}\gamma_{q,d,s}k(\cdot,s)\|_{\mathbb{H}_{k}}^{2},

and the equality holds only if hq,d()=0h_{q,d}(\cdot)=0. Consequently, the global minimizer for the constrained optimization problem (7) must have hq,d()=0h_{q,d}(\cdot)=0. It then follows that the squared RKHS functional norm penalty for gq,dg_{q,d} can be written as γq,d𝐊γq,d\gamma_{q,d}^{\top}\mathbf{K}\gamma_{q,d} and the tensor coefficient 𝒢\mathbfcal{G}_{q} satisfies 𝐯𝐞𝐜([𝒢¬¬)=𝐊γq,d\mathrm{\mathbf{vec}}\left([\mathbfcal{G}]_{::d}\right)=\mathbf{K}\gamma_{q,d}. The remainder of the proof is straightforward by simple linear algebra and thus we omit it here.  

Appendix B Proof of Theorem 1

Proof  Under Assumption 4 that the vector time series 𝐳t\mathbf{z}_{t} follows a VAR(Q~)(\widetilde{Q}) process, we can derive that the vectorized matrix time series 𝐗t\mathbf{X}_{t} and the vector time series 𝐳t\mathbf{z}_{t} jointly follows a VAR(max(P,Q,Q~))(\max(P,Q,\widetilde{Q})) process, namely,

[𝐱t𝐳t]=l=1max(P,Q,Q~)[(𝐁l𝐀l)𝟏{lP}𝐆l𝟏{lQ}𝐎D×S𝐂l𝟏{lQ~}][𝐱tl𝐳tl]+[𝐞t𝝂t].\begin{bmatrix}\mathbf{x}_{t}\\ \mathbf{z}_{t}\end{bmatrix}=\sum_{l=1}^{\max(P,Q,\widetilde{Q})}\begin{bmatrix}\left(\mathbf{B}_{l}\otimes\mathbf{A}_{l}\right)\odot\mathbf{1}_{\{l\leq P\}}&\mathbf{G}_{l}^{\top}\odot\mathbf{1}_{\{l\leq Q\}}\\ \mathbf{O}_{D\times S}&\mathbf{C}_{l}\odot\mathbf{1}_{\{l\leq\widetilde{Q}\}}\end{bmatrix}\begin{bmatrix}\mathbf{x}_{t-l}\\ \mathbf{z}_{t-l}\end{bmatrix}+\begin{bmatrix}\mathbf{e}_{t}\\ \boldsymbol{\nu}_{t}\end{bmatrix}. (29)

Let L=max(P,Q,Q~)L=\max(P,Q,\widetilde{Q}) and 𝐲t=[𝐱t,𝐳t]\mathbf{y}_{t}=[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]. Denote the transition matrix in (29) at lag-ll as 𝐉l(S+D)×(S+D)\mathbf{J}_{l}\in\mathbb{R}^{(S+D)\times(S+D)} and the error term as 𝐮t=[𝐞t,𝝂t]\mathbf{u}_{t}^{\top}=[\mathbf{e}_{t}^{\top},\boldsymbol{\nu}_{t}^{\top}], then we can rewrite the VAR(L)(L) process in (29) as a VAR(1)(1) process as:

[𝐲t𝐲t1𝐲tL+1]=[𝐉1𝐉2𝐉L1𝐉L𝐈S+D𝐎S+D𝐎S+D𝐎S+D𝐈S+D𝐎S+D𝐎S+D𝐎S+D𝐎S+D𝐈S+D𝐎S+D][𝐲t1𝐲t2𝐲tL]+[𝐮t𝟎S+D𝟎S+D],\begin{bmatrix}\mathbf{y}_{t}\\ \mathbf{y}_{t-1}\\ \vdots\\ \mathbf{y}_{t-L+1}\end{bmatrix}=\begin{bmatrix}\mathbf{J}_{1}&\mathbf{J}_{2}&\cdots&\mathbf{J}_{L-1}&\mathbf{J}_{L}\\ \mathbf{I}_{S+D}&\mathbf{O}_{S+D}&\cdots&\cdots&\mathbf{O}_{S+D}\\ \mathbf{O}_{S+D}&\mathbf{I}_{S+D}&\mathbf{O}_{S+D}&\cdots&\mathbf{O}_{S+D}\\ \vdots&\vdots&\ddots&\ddots&\vdots\\ \mathbf{O}_{S+D}&\mathbf{O}_{S+D}&\cdots&\mathbf{I}_{S+D}&\mathbf{O}_{S+D}\end{bmatrix}\begin{bmatrix}\mathbf{y}_{t-1}\\ \mathbf{y}_{t-2}\\ \vdots\\ \mathbf{y}_{t-L}\end{bmatrix}+\begin{bmatrix}\mathbf{u}_{t}\\ \mathbf{0}_{S+D}\\ \vdots\\ \mathbf{0}_{S+D}\end{bmatrix}, (30)

where we use 𝐎S+D\mathbf{O}_{S+D} to denote a zero matrix of size (S+D)×(S+D)(S+D)\times(S+D). For this VAR(1)(1) process to be stationary, we require that det(λ𝐈𝐉)0\mathrm{det}\left(\lambda\mathbf{I}-\mathbf{J}\right)\neq 0 for all |λ|1,λ|\lambda|\geq 1,\lambda\in\mathbb{C}, where 𝐉\mathbf{J} is the transition matrix in (30). The determinant det(λ𝐈𝐉)\mathrm{det}\left(\lambda\mathbf{I}-\mathbf{J}\right) can be simplified by column operations as:

det(λ𝐈𝐉)\displaystyle\mathrm{det}\left(\lambda\mathbf{I}-\mathbf{J}\right)
=det[λL𝐈Sl=1LλLl(𝐁l𝐀l)𝟏{lP}l=1LλLl𝐆l𝟏{lQ}𝐎λL𝐈Dl=1LλLl𝐂l𝟏{lQ~}]\displaystyle=\mathrm{det}\begin{bmatrix}\lambda^{L}\mathbf{I}_{S}-\sum\limits_{l=1}^{L}\lambda^{L-l}\left(\mathbf{B}_{l}\otimes\mathbf{A}_{l}\right)\odot\mathbf{1}_{\{l\leq P\}}&-\sum\limits_{l=1}^{L}\lambda^{L-l}\mathbf{G}_{l}^{\top}\odot\mathbf{1}_{\{l\leq Q\}}\\ \mathbf{O}&\lambda^{L}\mathbf{I}_{D}-\sum\limits_{l=1}^{L}\lambda^{L-l}\mathbf{C}_{l}\odot\mathbf{1}_{\{l\leq\widetilde{Q}\}}\end{bmatrix}
=λ2Ldet[𝚽1(λ)]det[𝚽2(λ)],\displaystyle=\lambda^{2L}\mathrm{det}\left[\boldsymbol{\Phi}_{1}(\lambda)\right]\mathrm{det}\left[\boldsymbol{\Phi}_{2}(\lambda)\right],

where 𝚽1(λ)=𝐈Sp=1Pλp(𝐁p𝐀p)\boldsymbol{\Phi}_{1}(\lambda)=\mathbf{I}_{S}-\sum_{p=1}^{P}\lambda^{-p}\left(\mathbf{B}_{p}\otimes\mathbf{A}_{p}\right) and 𝚽2(λ)=𝐈Dq~=1Q~λq~𝐂q~\boldsymbol{\Phi}_{2}(\lambda)=\mathbf{I}_{D}-\sum_{\widetilde{q}=1}^{\widetilde{Q}}\lambda^{-\widetilde{q}}\mathbf{C}_{\widetilde{q}}, and setting y=1/λy=1/\lambda completes the proof.  

Appendix C Theory under Fixed Spatial Dimension

C.1 Proof of Proposition 8

Proof  For the brevity of the presentation, we fix P,QP,Q as 11 but the proofs presented below can be easily extended to an arbitrary P,QP,Q. For the vectorized MARAC(1,1)(1,1) model (4), we can equivalently write it as:

𝐱t=𝐲t𝜽+𝐞t,\mathbf{x}_{t}=\mathbf{y}_{t}\boldsymbol{\theta}+\mathbf{e}_{t}, (31)

where 𝐲t=[𝐱t1𝐈S;𝐳t1𝐊]\mathbf{y}_{t}=[\mathbf{x}_{t-1}^{\top}\otimes\mathbf{I}_{S};\mathbf{z}_{t-1}^{\top}\otimes\mathbf{K}] and 𝜽=[𝐯𝐞𝐜(𝐁1𝐀1),𝜸1]\boldsymbol{\theta}=[\mathrm{\mathbf{vec}}\left(\mathbf{B}_{1}\otimes\mathbf{A}_{1}\right)^{\top},\boldsymbol{\gamma}_{1}^{\top}]^{\top}. Using 𝛀=𝚺1\boldsymbol{\Omega}=\boldsymbol{\Sigma}^{-1} to denote the precision matrix for 𝐞t\mathbf{e}_{t}, we can rewrite the penalized likelihood in (11) for (𝜽,𝛀)(\boldsymbol{\theta},\boldsymbol{\Omega}) as:

h(𝜽,𝛀)=12log|𝛀|+12tr(𝛀𝐒(𝜽))+λ2𝜽𝐊~𝜽,h(\boldsymbol{\theta},\boldsymbol{\Omega})=-\frac{1}{2}\log\left|\boldsymbol{\Omega}\right|+\frac{1}{2}\text{tr}\left(\boldsymbol{\Omega}\mathbf{S}(\boldsymbol{\theta})\right)+\frac{\lambda}{2}\boldsymbol{\theta}^{\top}\widetilde{\mathbf{K}}\boldsymbol{\theta}, (32)

where 𝐒(𝜽)=T1t=1T(𝐱t𝐲t𝜽)(𝐱t𝐲t𝜽)\mathbf{S}(\boldsymbol{\theta})=T^{-1}\sum_{t=1}^{T}(\mathbf{x}_{t}-\mathbf{y}_{t}\boldsymbol{\theta})(\mathbf{x}_{t}-\mathbf{y}_{t}\boldsymbol{\theta})^{\top}, 𝐊~\widetilde{\mathbf{K}} is defined as:

𝐊~=[𝐎S×S𝐊𝐎S×D𝐊𝐎D×S𝐊𝐈D𝐊].\widetilde{\mathbf{K}}=\begin{bmatrix}\mathbf{O}_{S\times S}\otimes\mathbf{K}&\mathbf{O}_{S\times D}\otimes\mathbf{K}\\ \mathbf{O}_{D\times S}\otimes\mathbf{K}&\mathbf{I}_{D}\otimes\mathbf{K}\end{bmatrix}.

We use 𝜽,𝛀\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*} to denote the ground truth of 𝜽,𝛀\boldsymbol{\theta},\boldsymbol{\Omega}, respectively. We define 𝔽𝜽\mathbb{F}_{\boldsymbol{\theta}} and 𝔽𝛀\mathbb{F}_{\boldsymbol{\Omega}} as:

𝔽𝜽\displaystyle\mathbb{F}_{\boldsymbol{\theta}} ={[𝐯𝐞𝐜(𝐁1𝐀1),𝜸1]|𝐀1F=1,sign(tr(𝐀1))=1}\displaystyle=\{[\mathrm{\mathbf{vec}}\left(\mathbf{B}_{1}\otimes\mathbf{A}_{1}\right)^{\top},\boldsymbol{\gamma}_{1}^{\top}]^{\top}|\|\mathbf{A}_{1}\|_{\mathrm{F}}=1,\text{sign}(\text{tr}\left(\mathbf{A}_{1}\right))=1\}
𝔽𝛀\displaystyle\mathbb{F}_{\boldsymbol{\Omega}} ={𝚺c1𝚺r1|𝚺rM×M,𝚺cN×N,ρ¯(𝚺r),ρ¯(𝚺c)>0}.\displaystyle=\{\boldsymbol{\Sigma}_{c}^{-1}\otimes\boldsymbol{\Sigma}_{r}^{-1}|\boldsymbol{\Sigma}_{r}\in\mathbb{R}^{M\times M},\boldsymbol{\Sigma}_{c}\in\mathbb{R}^{N\times N},\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{r}),\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{c})>0\}.

The estimators of MARAC, denoted as 𝜽^,𝛀^\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{\Omega}}, is the minimizer of h(𝜽,𝛀)h(\boldsymbol{\theta},\boldsymbol{\Omega}) with 𝜽𝔽𝜽,𝛀𝔽𝛀\boldsymbol{\theta}\in\mathbb{F}_{\boldsymbol{\theta}},\boldsymbol{\Omega}\in\mathbb{F}_{\boldsymbol{\Omega}}.

In order to establish the consistency of 𝚺^=𝛀^1\widehat{\boldsymbol{\Sigma}}=\widehat{\boldsymbol{\Omega}}^{-1}, it suffices to show that for any constant c>0c>0:

P(inf𝛀¯𝛀Fcinf𝜽¯h(𝜽¯,𝛀¯)h(𝜽,𝛀))0, as T.\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\inf_{\bar{\boldsymbol{\theta}}}h(\bar{\boldsymbol{\theta}},\bar{\boldsymbol{\Omega}})\leq h(\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*})\right)\rightarrow 0,\text{ as }T\rightarrow\infty. (33)

This is because if (33) is established, then as TT\rightarrow\infty we have:

P(inf𝛀¯𝛀Fcinf𝜽¯𝔽𝜽h(𝜽¯,𝛀¯)inf𝛀¯𝛀Fcinf𝜽¯h(𝜽¯,𝛀¯)>h(𝜽,𝛀)h(𝜽^,𝛀^))\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\inf_{\bar{\boldsymbol{\theta}}\in\mathbb{F}_{\boldsymbol{\theta}}}h(\bar{\boldsymbol{\theta}},\bar{\boldsymbol{\Omega}})\geq\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\inf_{\bar{\boldsymbol{\theta}}}h(\bar{\boldsymbol{\theta}},\bar{\boldsymbol{\Omega}})>h(\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*})\geq h(\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{\Omega}})\right)

approaching 11 and thus we must have 𝛀^𝛀F<c\|\widehat{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}<c with probability approaching 11 as TT\rightarrow\infty, and the consistency is established since cc is arbitrary.

To prove (33), we first fix 𝛀=𝛀¯\boldsymbol{\Omega}=\bar{\boldsymbol{\Omega}} and let 𝜽~(𝛀¯)=argmin𝜽h(𝜽,𝛀¯)\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\operatorname*{arg\,min}_{\boldsymbol{\theta}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}}), thus we have:

𝜽~(𝛀¯)=(t𝐲t𝛀¯𝐲tT+λ𝐊~)1(t𝐲t𝛀¯𝐱tT),\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{x}_{t}}{T}\right), (34)

which is a consistent estimator of 𝜽\boldsymbol{\theta}^{*} for any 𝛀¯\bar{\boldsymbol{\Omega}} given that λ0\lambda\rightarrow 0 and the matrix and vector time series are covariance-stationary. To see that 𝜽~(𝛀¯)p.𝜽\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\overset{p.}{\rightarrow}\boldsymbol{\theta}^{*}, notice that:

𝜽~(𝛀¯)=(𝐈λ𝐊~)𝜽+(t𝐲t𝛀¯𝐲tT+λ𝐊~)1(t𝐲t𝛀¯𝐞tT),\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=(\mathbf{I}-\lambda\widetilde{\mathbf{K}})\boldsymbol{\theta}^{*}+\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right), (35)

and the first term converges to 𝜽\boldsymbol{\theta}^{*} since λ=o(1)\lambda=o(1). In the second term of (35), we have:

t𝐲t𝛀¯𝐲tT+λ𝐊~p.[𝚺𝐱,𝐱𝛀¯𝚺𝐱,𝐳𝛀¯𝐊𝚺𝐳,𝐱𝐊𝛀¯𝚺𝐳,𝐳𝐊𝛀¯𝐊],\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\overset{p.}{\rightarrow}\begin{bmatrix}\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}\otimes\bar{\boldsymbol{\Omega}}&\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{z}}^{*}\otimes\bar{\boldsymbol{\Omega}}\mathbf{K}\\ \boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}&\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K}\end{bmatrix}, (36)

where 𝚺𝐱,𝐱=Var(𝐱t)\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}=\mathrm{Var}(\mathbf{x}_{t}), 𝚺𝐱,𝐳=Cov(𝐱t,𝐳t)\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{z}}^{*}=\mathrm{Cov}(\mathbf{x}_{t},\mathbf{z}_{t}) and 𝚺𝐳,𝐳=Var(𝐳t)\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}=\mathrm{Var}(\mathbf{z}_{t}). The convergence in probability in (36) holds due to the joint stationarity of 𝐱t\mathbf{x}_{t} and 𝐳t\mathbf{z}_{t} and the assumption that λ=o(1)\lambda=o(1). We further note that the sequence {𝐲t𝛀¯𝐞t}t=1T\{\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}\}_{t=1}^{T} is a martingale difference sequence (MDS), and we have t=1T𝐲t𝛀¯𝐞t/T=OP(T1/2)\sum_{t=1}^{T}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}/T=O_{P}(T^{-1/2}) by the central limit theorem (CLT) of MDS (see proposition 7.9 of Hamilton (2020) for the central limit theorem of martingale difference sequence). Combining this result together with (36), we conclude that the second term in (35) is oP(1)o_{P}(1) and thus 𝜽~(Ω¯)\widetilde{\boldsymbol{\theta}}(\bar{\Omega}) is consistent for 𝜽\boldsymbol{\theta}^{*}.

Plugging 𝜽~(𝛀¯)\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}) into h(𝜽,𝛀¯)h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}}) yields the profile likelihood of 𝛀¯\bar{\boldsymbol{\Omega}}:

(𝛀¯)=12log|𝛀¯|+12tr(𝛀¯t𝐱t[𝐱t𝐲t𝜽~(𝛀¯)]T).\ell(\bar{\boldsymbol{\Omega}})=-\frac{1}{2}\log|\bar{\boldsymbol{\Omega}}|+\frac{1}{2}\text{tr}\left(\bar{\boldsymbol{\Omega}}\frac{\sum_{t}\mathbf{x}_{t}[\mathbf{x}_{t}-\mathbf{y}_{t}\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})]^{\top}}{T}\right).

To prove (33), it suffices to show that:

P(inf𝛀¯𝛀Fc(𝛀¯)(𝛀))0, as T,\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\ell(\bar{\boldsymbol{\Omega}})\leq\ell(\boldsymbol{\Omega}^{*})\right)\rightarrow 0,\text{ as }T\rightarrow\infty, (37)

since (𝛀)h(𝜽,𝛀)\ell(\boldsymbol{\Omega}^{*})\leq h(\boldsymbol{\theta}^{*},\boldsymbol{\Omega}^{*}). Now, since 𝜽~(𝛀¯)p.𝜽\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\overset{p.}{\rightarrow}\boldsymbol{\theta}^{*}, we can write 𝜽~(𝛀¯)=𝜽+𝜻\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\boldsymbol{\theta}^{*}+\boldsymbol{\zeta}, with 𝜻F=oP(1)\|\boldsymbol{\zeta}\|_{\mathrm{F}}=o_{P}(1). Using this new notation, we can rewrite (𝛀¯)\ell(\bar{\boldsymbol{\Omega}}) as:

(𝛀¯)\displaystyle\ell(\bar{\boldsymbol{\Omega}}) =12log|𝛀¯|+12tr(𝛀¯t𝐱t𝐞tT)12tr((t𝐱t𝛀¯𝐲tT)𝜻)\displaystyle=-\frac{1}{2}\log|\bar{\boldsymbol{\Omega}}|+\frac{1}{2}\text{tr}\left(\bar{\boldsymbol{\Omega}}\frac{\sum_{t}\mathbf{x}_{t}\mathbf{e}_{t}^{\top}}{T}\right)-\frac{1}{2}\text{tr}\left(\left(\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right)\boldsymbol{\zeta}\right) (38)
=~(𝛀¯)12tr((t𝐱t𝛀¯𝐲tT)𝜻),\displaystyle=\widetilde{\ell}(\bar{\boldsymbol{\Omega}})-\frac{1}{2}\text{tr}\left(\left(\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right)\boldsymbol{\zeta}\right),

where we define the first two terms in (38) as ~(𝛀¯)\widetilde{\ell}(\bar{\boldsymbol{\Omega}}).

By the Cauchy-Schwartz inequality, we have:

|12tr((t𝐱t𝛀¯𝐲tT)𝜻)|12t𝐱t𝛀¯𝐲tTF𝜻F.\left|\frac{1}{2}\text{tr}\left(\left(\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right)\boldsymbol{\zeta}\right)\right|\leq\frac{1}{2}\left\|\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}\right\|_{\mathrm{F}}\cdot\|\boldsymbol{\zeta}\|_{\mathrm{F}}. (39)

By the definition of 𝐲t\mathbf{y}_{t}, we have:

t𝐱t𝛀¯𝐲tT=[(t𝐱t1𝐱tT)(𝐈S𝛀¯);(t𝐳t1𝐱tT)(𝐈D𝛀¯𝐊)],\frac{\sum_{t}\mathbf{x}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}=\left[\left(\frac{\sum_{t}\mathbf{x}_{t-1}\otimes\mathbf{x}_{t}}{T}\right)^{\top}\left(\mathbf{I}_{S}\otimes\bar{\boldsymbol{\Omega}}\right);\left(\frac{\sum_{t}\mathbf{z}_{t-1}\otimes\mathbf{x}_{t}}{T}\right)^{\top}\left(\mathbf{I}_{D}\otimes\bar{\boldsymbol{\Omega}}\mathbf{K}\right)\right],

and notice that 𝐱t1𝐱t\mathbf{x}_{t-1}\otimes\mathbf{x}_{t} and 𝐳t1𝐱t\mathbf{z}_{t-1}\otimes\mathbf{x}_{t} are just rearranged versions of 𝐱t𝐱t1\mathbf{x}_{t}\mathbf{x}_{t-1}^{\top} and 𝐱t𝐳t1\mathbf{x}_{t}\mathbf{z}_{t-1}^{\top}, respectively. Therefore, by the joint stationarity of 𝐱t\mathbf{x}_{t} and 𝐳t\mathbf{z}_{t}, we have the time average of 𝐱t1𝐱t\mathbf{x}_{t-1}\otimes\mathbf{x}_{t} and 𝐳t1𝐱t\mathbf{z}_{t-1}\otimes\mathbf{x}_{t} converging to the rearranged version of some constant auto-covariance matrices and therefore we have the term on the right-hand side of (39) being oP(1)o_{P}(1).

Given this argument, proving (37) is now equivalent to proving:

P(inf𝛀¯𝛀Fc~(𝛀¯)~(𝛀))0, as T.\mathrm{P}\left(\inf_{\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\geq c}\widetilde{\ell}(\bar{\boldsymbol{\Omega}})\leq\widetilde{\ell}(\boldsymbol{\Omega}^{*})\right)\rightarrow 0,\text{ as }T\rightarrow\infty. (40)

Define 𝛀~\widetilde{\boldsymbol{\Omega}} as the unconstrained minimizer of ~(𝛀)\widetilde{\ell}(\boldsymbol{\Omega}), then explicitly, we have:

𝛀~=argmin𝛀~(𝛀)\displaystyle\widetilde{\boldsymbol{\Omega}}=\operatorname*{arg\,min}_{\boldsymbol{\Omega}}\widetilde{\ell}(\boldsymbol{\Omega}) =(t𝐞t𝐱tT)1\displaystyle=\left(\frac{\sum_{t}\mathbf{e}_{t}\mathbf{x}_{t}^{\top}}{T}\right)^{-1}
=(t𝐞t𝐞tT+t𝐞t(𝐲t𝜽)T)1p.𝛀,\displaystyle=\left(\frac{\sum_{t}\mathbf{e}_{t}\mathbf{e}_{t}^{\top}}{T}+\frac{\sum_{t}\mathbf{e}_{t}\left(\mathbf{y}_{t}\boldsymbol{\theta}^{*}\right)^{\top}}{T}\right)^{-1}\overset{p.}{\rightarrow}\boldsymbol{\Omega}^{*},

where the final argument on the convergence in probability to 𝛀\boldsymbol{\Omega}^{*} is based on the fact that t=1T𝐞t(𝐲t𝜽)/T=OP(T1/2)\sum_{t=1}^{T}\mathbf{e}_{t}\left(\mathbf{y}_{t}\boldsymbol{\theta}^{*}\right)^{\top}/T=O_{P}(T^{-1/2}) by the CLT of MDS. By the second-order Taylor expansion of ~(𝛀¯)\widetilde{\ell}(\bar{\boldsymbol{\Omega}}) at 𝛀~\widetilde{\boldsymbol{\Omega}}, we have:

~(𝛀¯)=~(𝛀~)+14𝐯𝐞𝐜(𝛀¯𝛀~)[𝛀ˇ1𝛀ˇ1]𝐯𝐞𝐜(𝛀¯𝛀~),\widetilde{\ell}(\bar{\boldsymbol{\Omega}})=\widetilde{\ell}(\widetilde{\boldsymbol{\Omega}})+\frac{1}{4}\mathrm{\mathbf{vec}}\left(\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}}\right)^{\top}\left[\check{\boldsymbol{\Omega}}^{-1}\otimes\check{\boldsymbol{\Omega}}^{-1}\right]\mathrm{\mathbf{vec}}\left(\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}}\right), (41)

where 𝛀ˇ=𝛀~+η(𝛀¯𝛀~)\check{\boldsymbol{\Omega}}=\widetilde{\boldsymbol{\Omega}}+\eta(\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}}), for some η[0,1]\eta\in[0,1]. For any constant c>0c>0 such that 𝛀¯𝛀F=c\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=c, let c=κρ¯(𝛀)c=\kappa\bar{\rho}(\boldsymbol{\Omega}^{*}), where κ>0\kappa>0 is also a constant that relates to cc only. Consequently, we have:

|ρ¯(𝛀¯)ρ¯(𝛀)|𝛀¯𝛀s𝛀¯𝛀F=κρ¯(𝛀),\left|\bar{\rho}(\bar{\boldsymbol{\Omega}})-\bar{\rho}(\boldsymbol{\Omega}^{*})\right|\leq\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{s}\leq\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=\kappa\bar{\rho}(\boldsymbol{\Omega}^{*}),

and thus ρ¯(𝛀¯)(1+κ)ρ¯(𝛀)\bar{\rho}(\bar{\boldsymbol{\Omega}})\leq(1+\kappa)\bar{\rho}(\boldsymbol{\Omega}^{*}). Conditioning on the event that 𝛀¯𝛀F=c\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=c, we first have 𝛀¯𝛀~Fc/2\|\bar{\boldsymbol{\Omega}}-\widetilde{\boldsymbol{\Omega}}\|_{\mathrm{F}}\geq c/2 to hold with probability approaching one, due to the consistency of 𝛀~\widetilde{\boldsymbol{\Omega}}. Furthermore, we also have:

ρ¯(𝛀ˇ1𝛀ˇ1)=ρ¯(𝛀ˇ1)2\displaystyle\underaccent{\bar}{\rho}(\check{\boldsymbol{\Omega}}^{-1}\otimes\check{\boldsymbol{\Omega}}^{-1})=\underaccent{\bar}{\rho}(\check{\boldsymbol{\Omega}}^{-1})^{2} =1ρ¯(𝛀ˇ)2\displaystyle=\frac{1}{\bar{\rho}(\check{\boldsymbol{\Omega}})^{2}}
[1ρ¯(𝛀~)+ρ¯(𝛀¯)]2\displaystyle\geq\left[\frac{1}{\bar{\rho}(\widetilde{\boldsymbol{\Omega}})+\bar{\rho}(\bar{\boldsymbol{\Omega}})}\right]^{2}
[12ρ¯(𝛀)+(c+ρ¯(𝛀))]2=1(3+κ)21ρ¯(𝛀)2,\displaystyle\geq\left[\frac{1}{2\bar{\rho}(\boldsymbol{\Omega}^{*})+(c+\bar{\rho}(\boldsymbol{\Omega}^{*}))}\right]^{2}=\frac{1}{(3+\kappa)^{2}}\cdot\frac{1}{\bar{\rho}(\boldsymbol{\Omega}^{*})^{2}},

where the last inequality holds with probability approaching one since P[ρ¯(𝛀~)2ρ¯(𝛀)]1\mathrm{P}\left[\bar{\rho}(\widetilde{\boldsymbol{\Omega}})\leq 2\bar{\rho}(\boldsymbol{\Omega}^{*})\right]\rightarrow 1. Utilizing these facts together with (41), we end up having:

P[~(𝛀¯)~(𝛀~)+116(κ3+κ)2]1, as T,\mathrm{P}\left[\widetilde{\ell}(\bar{\boldsymbol{\Omega}})\geq\widetilde{\ell}(\widetilde{\boldsymbol{\Omega}})+\frac{1}{16}\cdot\left(\frac{\kappa}{3+\kappa}\right)^{2}\right]\rightarrow 1,\text{ as }T\rightarrow\infty, (42)

for any 𝛀¯\bar{\boldsymbol{\Omega}} such that 𝛀¯𝛀F=c=κρ¯(𝛀)\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}=c=\kappa\bar{\rho}(\boldsymbol{\Omega}^{*}). Since κ\kappa is an arbitrary positive constant and ~(𝛀~)p.~(𝛀)\widetilde{\ell}(\widetilde{\boldsymbol{\Omega}})\overset{p.}{\rightarrow}\widetilde{\ell}(\boldsymbol{\Omega}^{*}), we establish (40) and thereby completes the proof.  

C.2 Proof of Theorem 9

To prove Theorem 9, we first establish the consistency and the convergence rate of the estimators in Lemma 14 below.

Lemma 14

Under the same assumption as Theorem 9, all model estimators for MARAC are T\sqrt{T}-consistent, namely:

𝐀^p𝐀pF=OP(1T),𝐁^p𝐁pF=OP(1T),𝜸^q𝜸qF=OP(1T),\|\widehat{\mathbf{A}}_{p}-\mathbf{A}_{p}^{*}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),\|\widehat{\mathbf{B}}_{p}-\mathbf{B}_{p}^{*}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),\|\widehat{\boldsymbol{\gamma}}_{q}-\boldsymbol{\gamma}_{q}^{*}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),

for p[P],q[Q]p\in[P],q\in[Q]. As a direct result, we also have:

𝐁^p𝐀^p𝐁p𝐀pF=OP(1T), for p[P].\|\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}-\mathbf{B}_{p}^{*}\otimes\mathbf{A}^{*}_{p}\|_{\mathrm{F}}=O_{P}\left(\frac{1}{\sqrt{T}}\right),\text{ for }p\in[P].

We delay the proof of Lemma 14 to Section E.2. With this lemma, we are now ready to present the proof of Theorem 9.

Proof  For the simplicity of notation and presentation, we fix P,QP,Q as 11 but the proving technique can be generalized to arbitrary P,QP,Q. To start with, we revisit the updating rule for 𝐀p(l+1)\mathbf{A}_{p}^{(l+1)} in (13). By plugging in the data-generating model for 𝐗t\mathbf{X}_{t} according to MARAC(1,1)(1,1) model, we can transform (13) into:

t[T][Δ𝐀1𝐗t1𝐁^1+𝐀1𝐗t1Δ𝐁1+Δ𝒢ׯ]𝚺^c1𝐁^1𝐗t1=𝐎M×M,\sum_{t\in[T]}\left[\Delta\mathbf{A}_{1}\mathbf{X}_{t-1}\widehat{\mathbf{B}}_{1}^{\top}+\mathbf{A}_{1}^{*}\mathbf{X}_{t-1}\Delta\mathbf{B}_{1}^{\top}+\Delta\mathbfcal{G}_{1}\bar{\times}\mathbf{z}_{t-1}-\mathbf{E}_{t}\right]\widehat{\boldsymbol{\Sigma}}_{c}^{-1}\widehat{\mathbf{B}}_{1}\mathbf{X}_{t-1}^{\top}=\mathbf{O}_{M\times M},

where for any arbitrary matrix/tensor 𝐌\mathbf{M}, we define Δ𝐌\Delta\mathbf{M} as Δ𝐌=𝐌^𝐌\Delta\mathbf{M}=\widehat{\mathbf{M}}-\mathbf{M}^{*}. One can simplify the estimating equation above by left multiplying 𝚺^r1\widehat{\boldsymbol{\Sigma}}_{r}^{-1} and then vectorize both sides to obtain:

t[T][(𝐁1𝐗t1)(𝚺c)1(𝐁1𝐗t1)(𝚺r)1]𝐯𝐞𝐜(𝐀^1𝐀1)\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{*})^{-1}(\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top})\otimes(\boldsymbol{\Sigma}_{r}^{*})^{-1}\right]\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{A}}_{1}-\mathbf{A}_{1}^{*}\right)
+\displaystyle+ t[T][(𝐁1𝐗t1)(𝚺c)1(𝚺r)1𝐀1𝐗t1]𝐯𝐞𝐜(𝐁^1(𝐁1))\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{*})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{*})^{-1}\mathbf{A}_{1}^{*}\mathbf{X}_{t-1}\right]\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{B}}_{1}^{\top}-(\mathbf{B}_{1}^{*})^{\top}\right)
+\displaystyle+ t[T]{𝐳t1[(𝐁1𝐗t1)(𝚺c)1(𝚺r)1𝐊]}𝐯𝐞𝐜(𝜸^1𝜸1)\displaystyle\sum_{t\in[T]}\left\{\mathbf{z}_{t-1}^{\top}\otimes\left[(\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{*})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{*})^{-1}\mathbf{K}\right]\right\}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\gamma}}_{1}-\boldsymbol{\gamma}_{1}^{*}\right)
=\displaystyle= t[T][(𝐁1𝐗t1)(𝚺c)1(𝚺r)1]𝐯𝐞𝐜(𝐄t)+oP(T).\displaystyle\sum_{t\in[T]}\left[(\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top})^{\top}(\boldsymbol{\Sigma}_{c}^{*})^{-1}\otimes(\boldsymbol{\Sigma}_{r}^{*})^{-1}\right]\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)+o_{P}(\sqrt{T}).

On the left-hand side of the equation above, we replace 𝐁^1,𝚺^r,𝚺^c\widehat{\mathbf{B}}_{1},\widehat{\boldsymbol{\Sigma}}_{r},\widehat{\boldsymbol{\Sigma}}_{c} with their true values 𝐁1,𝚺r,𝚺c\mathbf{B}_{1}^{*},\boldsymbol{\Sigma}_{r}^{*},\boldsymbol{\Sigma}_{c}^{*}, since the discrepancies are of order oP(1)o_{P}(1) and can thus be incorporated into the oP(T)o_{P}(\sqrt{T}) term given the T\sqrt{T}-consistency of 𝐀^1,𝐁^1,𝜸^1\widehat{\mathbf{A}}_{1},\widehat{\mathbf{B}}_{1},\widehat{\boldsymbol{\gamma}}_{1}. On the right-hand side, we have:

t𝐯𝐞𝐜(𝚺^r1𝐄t𝚺^c1𝐁^1𝐗t1)\displaystyle\sum_{t}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Sigma}}_{r}^{-1}\mathbf{E}_{t}\widehat{\boldsymbol{\Sigma}}_{c}^{-1}\widehat{\mathbf{B}}_{1}\mathbf{X}_{t-1}^{\top}\right)
=t[𝐞t(𝐗t1𝐈M)]𝐯𝐞𝐜[(𝐁^1𝐈M)𝚺^1],\displaystyle=\sum_{t}\left[\mathbf{e}_{t}^{\top}\otimes\left(\mathbf{X}_{t-1}\otimes\mathbf{I}_{M}\right)\right]\mathbf{vec}\left[\left(\widehat{\mathbf{B}}_{1}^{\top}\otimes\mathbf{I}_{M}\right)\widehat{\boldsymbol{\Sigma}}^{-1}\right],

where the process {𝐞t(𝐗t1𝐈M)}t=1T\{\mathbf{e}_{t}^{\top}\otimes\left(\mathbf{X}_{t-1}\otimes\mathbf{I}_{M}\right)\}_{t=1}^{T} is a martingale difference sequence and the martingale central limit theorem (Hall and Heyde, 2014) implies that t[𝐞t(𝐗t1𝐈M)]=OP(T)\sum_{t}\left[\mathbf{e}_{t}^{\top}\otimes\left(\mathbf{X}_{t-1}\otimes\mathbf{I}_{M}\right)\right]=O_{P}(\sqrt{T}), and thus by the consistency of 𝚺^\widehat{\boldsymbol{\Sigma}} and 𝐁^1\widehat{\mathbf{B}}_{1}, we can replace 𝚺^\widehat{\boldsymbol{\Sigma}} and 𝐁^1\widehat{\mathbf{B}}_{1} with their true values and incorporate the remainders into oP(T)o_{P}(\sqrt{T}).

Similar transformations can be applied to (14) and (15), where the penalty term is incorporated into oP(T)o_{P}(\sqrt{T}) due to the assumption that λ=o(T12)\lambda=o(T^{-\frac{1}{2}}). With the notation that 𝐔t=𝐈N𝐀1𝐗t1\mathbf{U}_{t}=\mathbf{I}_{N}\otimes\mathbf{A}_{1}^{*}\mathbf{X}_{t-1}, 𝐕t=𝐁1𝐗t1𝐈M\mathbf{V}_{t}=\mathbf{B}_{1}^{*}\mathbf{X}_{t-1}^{\top}\otimes\mathbf{I}_{M}, 𝐘t=𝐳t1𝐊\mathbf{Y}_{t}=\mathbf{z}_{t-1}^{\top}\otimes\mathbf{K} and 𝐖t=[𝐕t;𝐔t;𝐘t]\mathbf{W}_{t}=[\mathbf{V}_{t};\mathbf{U}_{t};\mathbf{Y}_{t}], these transformed estimating equations can be converted altogether into:

(1Tt[T]𝐖t(𝚺)1𝐖t)𝐯𝐞𝐜(𝚯^𝚯)\displaystyle\left(\frac{1}{T}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right)\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right) =1Tt[T]𝐖t(𝚺)1𝐯𝐞𝐜(𝐄t)\displaystyle=\frac{1}{T}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)
+oP(T1/2),\displaystyle+o_{P}(T^{-1/2}), (43)

where 𝐯𝐞𝐜(𝚯^𝚯)=[𝐯𝐞𝐜(𝒜^𝒜),𝐯𝐞𝐜(^),𝐯𝐞𝐜(^)]\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right)=[\mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{A}}-\mathbfcal{A}^{*}\right)^{\top},\mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{B}}-\mathbfcal{B}^{*}\right)^{\top},\mathrm{\mathbf{vec}}\left(\widehat{\mathbfcal{R}}-\mathbfcal{R}^{*}\right)^{\top}]^{\top}, and 𝒜^,^,^\widehat{\mathbfcal{A}},\widehat{\mathbfcal{B}},\widehat{\mathbfcal{R}} are defined as [𝒜^]::p=𝐀^p,[^]::p=𝐁^p[\widehat{\mathbfcal{A}}]_{::p}=\widehat{\mathbf{A}}_{p},[\widehat{\mathbfcal{B}}]_{::p}=\widehat{\mathbf{B}}_{p}^{\top}, [^]:dq=𝜸^q,d[\widehat{\mathbfcal{R}}]_{:dq}=\widehat{\boldsymbol{\gamma}}_{q,d} and 𝒜\mathbfcal{A}^{*},\mathbfcal{B}^{*},\mathbfcal{R}^{*} are the corresponding true coefficients.

In (43), we first establish that:

(1/T)t[T]𝐖t(𝚺)1𝐖tp.E[𝐖t(𝚺)1𝐖t].(1/T)\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\overset{p.}{\rightarrow}\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]. (44)

To prove 44, by the assumption that 𝐗t\mathbf{X}_{t} and 𝐳t\mathbf{z}_{t} are zero-meaned and jointly stationary, we have T1t[T]𝐱~t𝐱~tp.E[𝐱~t𝐱~t]T^{-1}\sum_{t\in[T]}\widetilde{\mathbf{x}}_{t}\widetilde{\mathbf{x}}_{t}^{\top}\overset{p.}{\rightarrow}\mathrm{E}[\widetilde{\mathbf{x}}_{t}\widetilde{\mathbf{x}}_{t}^{\top}] by Lemma 20 and Corollary 21, where 𝐱~t=[𝐱t,𝐳t]\widetilde{\mathbf{x}}_{t}=[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}. See details of Lemma 20 and Corollary 21 in Section E.1. Then since each element of 𝐖t(𝚺)1𝐖t\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t} is a linear combination of terms in 𝐱~t𝐱~t\widetilde{\mathbf{x}}_{t}\widetilde{\mathbf{x}}_{t}^{\top} (thus a continuous mapping), it is straightforward that (44) holds elementwise.

Given (44) and the fact that 𝚯^\widehat{\boldsymbol{\Theta}} is T\sqrt{T}-consistent, we can rewrite (43) as:

E[𝐖t(𝚺)1𝐖t]𝐯𝐞𝐜(𝚯^𝚯)\displaystyle\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right) =1Tt[T]𝐖t(𝚺)1𝐯𝐞𝐜(𝐄t)\displaystyle=\frac{1}{T}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)
+oP(T1/2),\displaystyle+o_{P}(T^{-1/2}), (45)

For the term on the right-hand side of (45), first notice that the sequence {𝜼t}t=1T\{\boldsymbol{\eta}_{t}\}_{t=1}^{T}, where 𝜼t=𝐖t(𝚺)1𝐯𝐞𝐜(𝐄t)\boldsymbol{\eta}_{t}=\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right), is a zero-meaned, stationary vector martingale difference sequence (MDS), thanks to the independence of 𝐄t\mathbf{E}_{t} from the jointly stationary 𝐗t1\mathbf{X}_{t-1} and 𝐳t1\mathbf{z}_{t-1}. By the martingale central limit theorem (Hall and Heyde, 2014), we have:

1Tt[T]𝐖t(𝚺)1𝐯𝐞𝐜(𝐄t)d.𝒩(𝟎,E[𝐖t(𝚺)1𝐖t]).\frac{1}{\sqrt{T}}\sum_{t\in[T]}\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathrm{\mathbf{vec}}\left(\mathbf{E}_{t}\right)\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]). (46)

Combining (45) and (46), we end up having:

E[𝐖t(𝚺)1𝐖t]T𝐯𝐞𝐜(𝚯^𝚯)d.𝒩(𝟎,E[𝐖t(𝚺)1𝐖t]).\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]\sqrt{T}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right)\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]). (47)

The asymptotic distribution of T𝐯𝐞𝐜(𝚯^𝚯)\sqrt{T}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right) can thus be derived by multiplying both sides of (47) by the inverse of 𝐋=E[𝐖t(𝚺)1𝐖t]\mathbf{L}=\mathrm{E}\left[\mathbf{W}_{t}^{\top}(\boldsymbol{\Sigma}^{*})^{-1}\mathbf{W}_{t}\right]. However, the matrix 𝐋\mathbf{L} is not a full-rank matrix, because 𝐋𝝁=𝟎\mathbf{L}\boldsymbol{\mu}=\mathbf{0}, where 𝝁=[𝐯𝐞𝐜(𝒜),𝐯𝐞𝐜(),𝟎]\boldsymbol{\mu}=[\mathrm{\mathbf{vec}}\left(\mathbfcal{A}^{*}\right)^{\top},-\mathrm{\mathbf{vec}}\left(\mathbfcal{B}^{*}\right)^{\top},\mathbf{0}^{\top}]^{\top}. As a remedy, let 𝜻=[𝐯𝐞𝐜(𝐀1)𝟎]M2+N2+DMN\boldsymbol{\zeta}=[\mathrm{\mathbf{vec}}\left(\mathbf{A}_{1}^{*}\right)^{\top}\mathbf{0}^{\top}]^{\top}\in\mathbb{R}^{M^{2}+N^{2}+DMN}, then given the identifiability constraint that 𝐀1F=𝐀^1F=1\|\mathbf{A}^{*}_{1}\|_{\mathrm{F}}=\|\widehat{\mathbf{A}}_{1}\|_{\mathrm{F}}=1 and the fact that 𝐀^1\widehat{\mathbf{A}}_{1} is T\sqrt{T}-consistent, we have 𝐯𝐞𝐜(𝐀1)𝐯𝐞𝐜(𝐀^1𝐀1)=oP(T1/2)\mathrm{\mathbf{vec}}\left(\mathbf{A}_{1}^{*}\right)^{\top}\mathrm{\mathbf{vec}}\left(\widehat{\mathbf{A}}_{1}-\mathbf{A}^{*}_{1}\right)=o_{P}(T^{-1/2}). Therefore, we have:

T𝜻𝐯𝐞𝐜(𝚯^𝚯)p.0.\sqrt{T}\boldsymbol{\zeta}^{\top}\mathrm{\mathbf{vec}}\left(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*}\right)\overset{p.}{\rightarrow}0. (48)

Combining (47) and (48) and using the Slutsky’s theorem, we have 𝐇T𝐯𝐞𝐜(𝚯^𝚯)d.𝒩(𝟎,𝐋)\mathbf{H}\sqrt{T}\mathbf{vec}(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*})\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathbf{L}), where 𝐇=𝐋+𝜻𝜻\mathbf{H}=\mathbf{L}+\boldsymbol{\zeta}\boldsymbol{\zeta}^{\top} and thus:

T𝐯𝐞𝐜(𝚯^𝚯)d.𝒩(𝟎,𝐇1𝐋𝐇1).\sqrt{T}\mathbf{vec}(\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}^{*})\overset{d.}{\rightarrow}\mathcal{N}(\mathbf{0},\mathbf{H}^{-1}\mathbf{L}\mathbf{H}^{-1}). (49)

The final asymptotic distribution of vec(𝐁^1)vec(𝐀^1)\mathrm{vec}(\widehat{\mathbf{B}}_{1}^{\top})\otimes\mathrm{vec}(\widehat{\mathbf{A}}_{1}) and 𝐊𝜸^q,d\mathbf{K}\widehat{\boldsymbol{\gamma}}_{q,d} can be derived easily from (49) with multivariate delta method and we omit the details here.  

Appendix D Theory under High Spatial Dimension

D.1 Proof of Theorem 12

Proof  In this proof, we will fix P,QP,Q as 11 again for the ease of presentation but the technical details can be generalized to arbitrary P,QP,Q. Since we fix the lags to be 11, we drop the subscript of the coefficients for convenience.

Under the specification of the MARAC(1,1)(1,1) model, we restate the model as:

𝐱t=(𝐱t1𝐈S)𝐯𝐞𝐜(𝐁𝐀)+(𝐳t1𝐊)𝜸+𝐞t,\mathbf{x}_{t}=\left(\mathbf{x}_{t-1}^{\top}\otimes\mathbf{I}_{S}\right)\mathrm{\mathbf{vec}}\left(\mathbf{B}^{*}\otimes\mathbf{A}^{*}\right)+\left(\mathbf{z}_{t-1}^{\top}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbf{e}_{t},

where S=MNS=MN and we introduce the following additional notations:

𝐘T[𝐱1𝐱T],𝐗~T[𝐱0𝐱T1]𝐈S,𝐳~T[𝐳0𝐳T1],𝒯𝒯\mathbf{Y}_{T}\coloneqq\begin{bmatrix}\mathbf{x}_{1}\\ \vdots\\ \mathbf{x}_{T}\end{bmatrix},\quad\widetilde{\mathbf{X}}_{T}\coloneqq\begin{bmatrix}\mathbf{x}_{0}^{\top}\\ \vdots\\ \mathbf{x}_{T-1}^{\top}\end{bmatrix}\otimes\mathbf{I}_{S},\quad\widetilde{\mathbf{z}}_{T}\coloneqq\begin{bmatrix}\mathbf{z}_{0}^{\top}\\ \vdots\\ \mathbf{z}_{T-1}^{\top}\end{bmatrix},\quad\mathbfcal{E}_{T}=\begin{bmatrix}\mathbf{e}_{1}\\ \vdots\\ \mathbf{e}_{T}\end{bmatrix}.

We will drop the subscript TT for convenience. Let ϕ=𝐯𝐞𝐜(𝐁𝐀)\boldsymbol{\phi}^{*}=\mathrm{\mathbf{vec}}\left(\mathbf{B}^{*}\otimes\mathbf{A}^{*}\right), and g1,,gDkg_{1}^{*},\ldots,g_{D}^{*}\in\mathbb{H}_{k} be the true autoregressive and functional parameters. Correspondingly, let 𝜸1,,𝜸D\boldsymbol{\gamma}_{1}^{*},\ldots,\boldsymbol{\gamma}_{D}^{*} be the coefficients for the representers when evaluating g1,,gDg_{1}^{*},\ldots,g_{D}^{*} on a matrix grid, i.e. 𝐊𝜸d\mathbf{K}\boldsymbol{\gamma}_{d}^{*} is a discrete evaluation of gdg_{d}^{*} on the matrix grid. Let 𝔽ϕ={𝐯𝐞𝐜(𝐁𝐀)|𝐀F=sign(tr(𝐀))=1,𝐀M×M,𝐁N×N}\mathbb{F}_{\boldsymbol{\phi}}=\{\mathrm{\mathbf{vec}}\left(\mathbf{B}\otimes\mathbf{A}\right)|\|\mathbf{A}\|_{\mathrm{F}}=\text{sign}(\text{tr}\left(\mathbf{A}\right))=1,\mathbf{A}\in\mathbb{R}^{M\times M},\mathbf{B}\in\mathbb{R}^{N\times N}\}. Using these new notations, the MARAC estimator is obtained by solving the following penalized least square problem:

minϕ𝔽ϕ,𝜸SD𝔏λ(ϕ,𝜸){12T𝐘𝐗~ϕ(𝐳~𝐊)𝜸F2+λ2𝜸(𝐈D𝐊)𝜸}.\min_{\boldsymbol{\phi}\in\mathbb{F}_{\boldsymbol{\phi}},\boldsymbol{\gamma}\in\mathbb{R}^{SD}}\mathfrak{L}_{\lambda}(\boldsymbol{\phi},\boldsymbol{\gamma})\coloneqq\left\{\frac{1}{2T}\|\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}-\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}\|_{\mathrm{F}}^{2}+\frac{\lambda}{2}\boldsymbol{\gamma}^{\top}\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\boldsymbol{\gamma}\right\}. (50)

By fixing ϕ\boldsymbol{\phi}, the estimator for 𝜸\boldsymbol{\gamma} is given by 𝜸^(ϕ)=argmin𝜸𝔏λ(ϕ,𝜸)\widehat{\boldsymbol{\gamma}}(\boldsymbol{\phi})=\operatorname*{arg\,min}_{\boldsymbol{\gamma}}\mathfrak{L}_{\lambda}(\boldsymbol{\phi},\boldsymbol{\gamma}), and can be explicitly written as:

𝜸^(ϕ)=T1[𝚺^𝐳𝐊+λ𝐈SD]1(𝐳~𝐈S)(𝐘𝐗~ϕ).\widehat{\boldsymbol{\gamma}}(\boldsymbol{\phi})=T^{-1}\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\cdot\mathbf{I}_{SD}\right]^{-1}\left(\widetilde{\mathbf{z}}^{\top}\otimes\mathbf{I}_{S}\right)\left(\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}\right). (51)

Plugging (51) into (50) yields the profile likelihood for ϕ\boldsymbol{\phi}:

λ(ϕ)=𝔏λ(ϕ,𝜸^(ϕ))=12T(𝐘𝐗~ϕ)𝐖(𝐘𝐗~ϕ),\ell_{\lambda}(\boldsymbol{\phi})=\mathfrak{L}_{\lambda}(\boldsymbol{\phi},\widehat{\boldsymbol{\gamma}}(\boldsymbol{\phi}))=\frac{1}{2T}\left(\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}\right)^{\top}\mathbf{W}\left(\mathbf{Y}-\widetilde{\mathbf{X}}\boldsymbol{\phi}\right), (52)

where 𝐖\mathbf{W} is defined as:

𝐖={𝐈(𝐳~𝐊)[𝚺^𝐳𝐊+λ𝐈SD]1(𝐳~𝐈S)T}=(𝐈+𝐳~𝐳~λT𝐊)1,\mathbf{W}=\left\{\mathbf{I}-\frac{\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\cdot\mathbf{I}_{SD}\right]^{-1}\left(\widetilde{\mathbf{z}}^{\top}\otimes\mathbf{I}_{S}\right)}{T}\right\}=\left(\mathbf{I}+\frac{\widetilde{\mathbf{z}}\widetilde{\mathbf{z}}^{\top}}{\lambda T}\otimes\mathbf{K}\right)^{-1}, (53)

and the second equality in (53) is by the Woodbury matrix identity. It can be seen that 𝐖\mathbf{W} is positive semi-definite and has all of its eigenvalues within (0,1)(0,1). To improve the clarity and organization of the proof, we break down the proof into several major steps. In the first step, we establish the following result on ϕ^\widehat{\boldsymbol{\phi}}:

Proposition 15

Under the assumptions of Theorem 12, we have:

(ϕ^ϕ)(𝐗~𝐖𝐗~T)(ϕ^ϕ)OP(Cgλ)+OP(SD/T),\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)^{\top}\left(\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T}\right)\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)\lesssim O_{P}(C_{g}\lambda)+O_{P}(SD/T), (54)

where Cg=d=1Dgdk2C_{g}=\sum_{d=1}^{D}\|g_{d}^{*}\|_{\mathbb{H}_{k}}^{2}.

In order to derive the convergence rate of ϕ^\widehat{\boldsymbol{\phi}}, we still require one additional result:

Lemma 16

Under the assumptions of Theorem 12 and the requirement that SlogS/T0S\log S/T\rightarrow 0, it holds that:

ρ¯(𝐗~𝐖𝐗~/T)c0,S2>0,\underaccent{\bar}{\rho}\left(\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}/T\right)\geq\frac{c_{0,S}}{2}>0, (55)

with probability approaching 11 as S,TS,T\rightarrow\infty, where ρ¯()\underaccent{\bar}{\rho}(\cdot) is the minimum eigenvalue of a matrix and c0,S=ρ¯(𝚺𝐱,𝐱(𝚺𝐳,𝐱)(𝚺𝐳,𝐳)1𝚺𝐳,𝐱)c_{0,S}=\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)^{\top}\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\right)^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}).

The proof of Proposition 15 and Lemma 16 are relegated to Section D.2 and E.3, respectively. Combining Proposition 15 and Lemma 16, we can derive the error bound of ϕ^\widehat{\boldsymbol{\phi}} as:

1Sϕ^ϕFOP(CgγSc0,SS)+OP(Dc0,STS).\frac{1}{S}\|\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\|_{\mathrm{F}}\lesssim O_{P}(\sqrt{\frac{C_{g}\gamma_{S}}{c_{0,S}S}})+O_{P}(\sqrt{\frac{D}{c_{0,S}TS}}). (56)

Now with this error bound of the autoregressive parameter ϕ^\widehat{\boldsymbol{\phi}}, we further derive the prediction error bound for the functional parameters. To start with, we have:

1TS(𝐳~𝐊)(𝜸^𝜸)F\displaystyle\frac{1}{\sqrt{TS}}\|(\widetilde{\mathbf{z}}\otimes\mathbf{K})(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\|_{\mathrm{F}} =1TS(𝐈𝐖)(𝐘𝐗~ϕ^)(𝐳~𝐊)𝜸F\displaystyle=\frac{1}{\sqrt{TS}}\left\|(\mathbf{I}-\mathbf{W})(\mathbf{Y}-\widetilde{\mathbf{X}}\widehat{\boldsymbol{\phi}})-(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\right\|_{\mathrm{F}}
1TS[(𝐈𝐖)J1+(𝐈𝐖)𝐗~(ϕ^ϕ)FJ2\displaystyle\leq\frac{1}{\sqrt{TS}}\bigg{[}\underbrace{\|(\mathbf{I}-\mathbf{W})\mathbfcal{E}\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{1}$}}}+\underbrace{\|(\mathbf{I}-\mathbf{W})\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{2}$}}}
+𝐖(𝐳~𝐊)𝜸FJ3],\displaystyle+\underbrace{\|\mathbf{W}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}}_{\text{\makebox[0.0pt]{$J_{3}$}}}\bigg{]},

and we will bound the terms J1,J2,J3J_{1},J_{2},J_{3} separately.

To bound J1J_{1}, we first establish two lemmas.

Lemma 17

Given the definition of 𝐖\mathbf{W} in (53) and under the assumptions of Theorem 12, we have OP(γS1/2r0)tr(𝐈𝐖)OP(SγS1/2r0)O_{P}(\gamma_{S}^{-1/2r_{0}})\leq\text{tr}\left(\mathbf{I}-\mathbf{W}\right)\leq O_{P}(\sqrt{S}\gamma_{S}^{-1/2r_{0}}), where γS=λ/S\gamma_{S}=\lambda/S. Furthermore, we have tr(𝐖)SD\text{tr}\left(\mathbf{W}\right)\leq SD.

Lemma 18

Given the definition of 𝐖\mathbf{W} in (53) and under the assumptions of Theorem 12, we have that:

𝒲tr𝒲𝒪𝒫\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}/\text{tr}\left(\mathbf{W}\right)=O_{P}(1).

Furthermore, we have 𝒲tr𝒲𝒪𝒫\mathbfcal{E}^{\top}\left(\mathbf{I}-\mathbf{W}\right)^{2}\mathbfcal{E}/\text{tr}\left(\left(\mathbf{I}-\mathbf{W}\right)^{2}\right)=O_{P}(1).

We leave the proof of Lemma 17 and Lemma 18 to Section E.4 and E.5. By Lemma 18, we have:

J12tr((𝐈𝐖)2)tr(𝐈𝐖).J_{1}^{2}\asymp\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\lesssim\text{tr}\left(\mathbf{I}-\mathbf{W}\right).

And by Lemma 17, we have J1OP(S1/4γS1/4r0)J_{1}\leq O_{P}(S^{1/4}\gamma_{S}^{-1/4r_{0}}).

For J2J_{2}, we have the following bound:

J2\displaystyle J_{2} =(𝐈𝐖)𝐖1/2𝐖1/2𝐗~(ϕ^ϕ)F\displaystyle=\|(\mathbf{I}-\mathbf{W})\mathbf{W}^{-1/2}\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\|_{\mathrm{F}} (57)
(𝐈𝐖)𝐖1/2s𝐖1/2𝐗~(ϕ^ϕ)F\displaystyle\leq\|(\mathbf{I}-\mathbf{W})\mathbf{W}^{-1/2}\|_{s}\cdot\|\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\|_{\mathrm{F}}
𝐖1/2s𝐖1/2𝐗~(ϕ^ϕ)F.\displaystyle\leq\|\mathbf{W}^{-1/2}\|_{s}\cdot\|\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})\|_{\mathrm{F}}. (58)

To bound 𝐖1/2s\|\mathbf{W}^{-1/2}\|_{s}, we can take advantage of the simpler form of 𝐖\mathbf{W} using the Woodbury matrix identity in (53) and obtain:

𝐖1/2s\displaystyle\|\mathbf{W}^{-1/2}\|_{s} =ρ¯(𝐖1)12=ρ¯(𝐈+(λT)1𝐳~𝐳~𝐊)12\displaystyle=\bar{\rho}(\mathbf{W}^{-1})^{\frac{1}{2}}=\bar{\rho}\left(\mathbf{I}+(\lambda T)^{-1}\widetilde{\mathbf{z}}\widetilde{\mathbf{z}}^{\top}\otimes\mathbf{K}\right)^{\frac{1}{2}}
[1+λ1ρ¯(𝐊)ρ¯(T1𝐳~𝐳~)]12[1+λ1ρ¯(𝐊)tr(𝚺^𝐳)]12.\displaystyle\leq\left[1+\lambda^{-1}\bar{\rho}(\mathbf{K})\bar{\rho}(T^{-1}\widetilde{\mathbf{z}}\widetilde{\mathbf{z}}^{\top})\right]^{\frac{1}{2}}\leq\left[1+\lambda^{-1}\bar{\rho}(\mathbf{K})\text{tr}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\right)\right]^{\frac{1}{2}}.

In Lemma 20, which we state later in Section E.1, we have shown that for NN-dimensional stationary vector autoregressive process, the covariance estimator is consistent in the spectral norm as long as NlogN/T0N\log N/T\rightarrow 0. Therefore, since {𝐳t}t=1T\{\mathbf{z}_{t}\}_{t=1}^{T} follows a stationary VAR(Q~)(\widetilde{Q}) process and its dimensionality DD is fixed, we have 𝚺^𝐳𝚺𝐳sp.0\|\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}-\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\|_{s}\overset{p.}{\rightarrow}0 and thus with probability approaching 11, we have tr(𝚺^𝐳)2tr(𝚺𝐳)\mathrm{tr}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\leq 2\mathrm{tr}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*}). Therefore, we have 𝐖1/2sOP(1+c0/λ)\|\mathbf{W}^{-1/2}\|_{s}\leq O_{P}(\sqrt{1+c_{0}/\lambda}), where c0c_{0} is a constant related to tr(𝚺𝐳)\text{tr}\left(\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\right) and ρ¯(𝐊)\bar{\rho}(\mathbf{K}). Combining this with the result in Proposition 15, we can bound J2J_{2} via its upper bound (58) as:

J2OP(CgλT)+OP(CgT)+OP(S)+OP(γS1).J_{2}\leq O_{P}\left(\sqrt{C_{g}\lambda T}\right)+O_{P}\left(\sqrt{C_{g}T}\right)+O_{P}(\sqrt{S})+O_{P}\left(\sqrt{\gamma_{S}^{-1}}\right). (59)

Finally, for J3J_{3}, we first notice that:

J3=𝐖(𝐳~𝐊)𝜸F𝐖1/2s𝐖1/2(𝐳~𝐊)𝜸F𝐖1/2(𝐳~𝐊)𝜸F.J_{3}=\|\mathbf{W}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}\leq\|\mathbf{W}^{1/2}\|_{s}\cdot\|\mathbf{W}^{1/2}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}\leq\|\mathbf{W}^{1/2}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}.

The upper bound of J3J_{3} above can be further bounded by:

𝐖1/2(𝐳~𝐊)𝜸F2\displaystyle\|\mathbf{W}^{1/2}(\widetilde{\mathbf{z}}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}^{2} =(λT)[(𝐈D𝐊)𝜸]{𝐈SD(λ1𝚺^𝐳𝐊+𝐈SD)1}𝜸\displaystyle=(\lambda T)[(\mathbf{I}_{D}\otimes\mathbf{K})\boldsymbol{\gamma}^{*}]^{\top}\left\{\mathbf{I}_{SD}-\left(\lambda^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\mathbf{I}_{SD}\right)^{-1}\right\}\boldsymbol{\gamma}^{*}
=(λT)(d=1Dgdk2)\displaystyle=(\lambda T)\left(\sum_{d=1}^{D}\|g_{d}^{*}\|_{\mathbb{H}_{k}}^{2}\right)
(λ2T)(𝜸)[(𝐈D𝐊)(𝚺^𝐳𝐊+λ𝐈SD)1]𝜸\displaystyle-(\lambda^{2}T)\left(\boldsymbol{\gamma}^{*}\right)^{\top}\left[\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}\right]\boldsymbol{\gamma}^{*}
CgλT,\displaystyle\leq C_{g}\lambda T, (60)

where Cg=d=1Dgdk2C_{g}=\sum_{d=1}^{D}\|g_{d}^{*}\|_{\mathbb{H}_{k}}^{2} is the norm of all the underlying functional parameters. The last inequality of (60) follows from the fact that the quadratic form led by λ2T\lambda^{2}T is non-negative. To see why, first note that:

(𝐈D𝐊)(𝚺^𝐳𝐊+λ𝐈SD)1=(𝚺^𝐳𝐈S)1[𝚺^𝐳𝐈S+λ1𝚺^𝐳2𝐊]1.\left(\mathbf{I}_{D}\otimes\mathbf{K}\right)\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}=\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}\right)^{-1}-\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}+\lambda^{-1}\widehat{\boldsymbol{\Sigma}}^{2}_{\mathbf{z}}\otimes\mathbf{K}\right]^{-1}.

Then, we have the following lemma:

Lemma 19

If 𝐀,𝐁\mathbf{A},\mathbf{B} are symmetric, positive definite real matrices and 𝐀𝐁\mathbf{A}-\mathbf{B} is positive semi-definite, then 𝐁1𝐀1\mathbf{B}^{-1}-\mathbf{A}^{-1} is also positive semi-definite.

We leave the proof to Section E.6. Let 𝐌=𝚺^𝐳𝐈S+λ1𝚺^𝐳2𝐊\mathbf{M}=\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}+\lambda^{-1}\widehat{\boldsymbol{\Sigma}}^{2}_{\mathbf{z}}\otimes\mathbf{K} and 𝐍=𝚺^𝐳𝐈S\mathbf{N}=\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{I}_{S}, then both 𝐌\mathbf{M} and 𝐍\mathbf{N} are positive definite and 𝐌𝐍\mathbf{M}-\mathbf{N} is positive semi-definite. By Lemma 19, we have 𝐍1𝐌1\mathbf{N}^{-1}-\mathbf{M}^{-1} being positive semi-definite and thus (60) holds.

Using the result in (60), we eventually have J3OP(CgλT)J_{3}\leq O_{P}(\sqrt{C_{g}\lambda T}). Combining all the bounds for J1,J2,J3J_{1},J_{2},J_{3}, we end up with:

1TS(𝐳~𝐊)(𝜸^𝜸)F\displaystyle\frac{1}{\sqrt{TS}}\|(\widetilde{\mathbf{z}}\otimes\mathbf{K})(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\|_{\mathrm{F}} OP(γS1/2r0TS4)+OP(γS)\displaystyle\leq O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1/{2r_{0}}}}}{\sqrt{T}\sqrt[4]{S}}\right)+O_{P}(\sqrt{\gamma_{S}})
+OP(1S)+OP(γS1TS),\displaystyle+O_{P}\left(\frac{1}{\sqrt{S}}\right)+O_{P}\left(\frac{\sqrt{\gamma_{S}^{-1}}}{\sqrt{TS}}\right),

where we drop the term OP(T1/2)O_{P}(T^{-1/2}) as it is a higher order term of OP(S1/2)O_{P}(S^{-1/2}) under the condition that SlogS/T0S\log S/T\rightarrow 0.  

D.2 Proof of Proposition 15

Proof  The MARAC estimator ϕ^\widehat{\boldsymbol{\phi}} is the minimizer of λ(ϕ)\ell_{\lambda}(\boldsymbol{\phi}), defined in (52), for all ϕ𝔽ϕ\boldsymbol{\phi}\in\mathbb{F}_{\phi} and thus λ(ϕ^)λ(ϕ)\ell_{\lambda}(\widehat{\boldsymbol{\phi}})\leq\ell_{\lambda}(\boldsymbol{\phi}^{*}). Equivalently, this means that:

12(ϕ^ϕ)(𝐗~𝐖𝐗~T)(ϕ^ϕ)1T[(𝐳~𝐊)𝜸+]𝐖𝐗~(ϕ^ϕ).\frac{1}{2}\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)^{\top}\left(\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T}\right)\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)\leq\frac{1}{T}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbfcal{E}\right]^{\top}\mathbf{W}\widetilde{\mathbf{X}}\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right).

Let 𝜹=𝐖1/2𝐗~(ϕ^ϕ)/T\boldsymbol{\delta}=\mathbf{W}^{1/2}\widetilde{\mathbf{X}}(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*})/\sqrt{T} and 𝝎=𝐖1/2[(𝐳~𝐊)𝜸+]/T\boldsymbol{\omega}=\mathbf{W}^{1/2}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbfcal{E}\right]/\sqrt{T}, then the inequality can be simply written as 𝜹𝜹2𝜹𝝎\boldsymbol{\delta}^{\top}\boldsymbol{\delta}\leq 2\boldsymbol{\delta}^{\top}\boldsymbol{\omega}, and we can upper bound our quantity of interest, namely 𝜹𝜹\boldsymbol{\delta}^{\top}\boldsymbol{\delta}, as:

𝜹𝜹2(𝜹𝝎)(𝜹𝝎)+2𝝎𝝎4𝝎𝝎.\boldsymbol{\delta}^{\top}\boldsymbol{\delta}\leq 2(\boldsymbol{\delta}-\boldsymbol{\omega})^{\top}(\boldsymbol{\delta}-\boldsymbol{\omega})+2\boldsymbol{\omega}^{\top}\boldsymbol{\omega}\leq 4\boldsymbol{\omega}^{\top}\boldsymbol{\omega}.

Therefore, the bound of 𝜹F2\|\boldsymbol{\delta}\|_{\mathrm{F}}^{2} can be obtained via the bound of 𝝎F2\|\boldsymbol{\omega}\|_{\mathrm{F}}^{2}. We have the following upper bound for 𝝎F2\|\boldsymbol{\omega}\|_{\mathrm{F}}^{2}:

𝜹F24𝝎F2\displaystyle\|\boldsymbol{\delta}\|_{\mathrm{F}}^{2}\leq 4\|\boldsymbol{\omega}\|_{\mathrm{F}}^{2} =4T[(𝐳~𝐊)𝜸+]𝐖[(𝐳~𝐊)𝜸+]\displaystyle=\frac{4}{T}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbfcal{E}\right]^{\top}\mathbf{W}\left[\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}+\mathbfcal{E}\right]
8T[𝐖1/2(𝐳~𝐊)𝜸F2I1+𝐖1/2I2],\displaystyle\leq\frac{8}{T}\left[\underbrace{\|\mathbf{W}^{1/2}\left(\widetilde{\mathbf{z}}\otimes\mathbf{K}\right)\boldsymbol{\gamma}^{*}\|_{\mathrm{F}}^{2}}_{\text{\makebox[0.0pt]{$I_{1}$}}}+\underbrace{\|\mathbf{W}^{1/2}\mathbfcal{E}\|_{\mathrm{F}}^{2}}_{\text{\makebox[0.0pt]{$I_{2}$}}}\right], (61)

where the last inequality follows from the fact that 𝐖\mathbf{W} is positive semi-definite.

For I1I_{1}, it can be bounded by (60) and thus I1CgλTI_{1}\leq C_{g}\lambda T. To bound I2I_{2}, we utilize Lemma 18 and bound I2I_{2} as I2tr(𝐖)SDI_{2}\asymp\text{tr}\left(\mathbf{W}\right)\leq SD. Combining the bounds for I1I_{1} and I2I_{2}, we have:

𝜹F2=(ϕ^ϕ)(𝐗~𝐖𝐗~T)(ϕ^ϕ)OP(Cgλ)+OP(SD/T),\|\boldsymbol{\delta}\|_{\mathrm{F}}^{2}=\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)^{\top}\left(\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T}\right)\left(\widehat{\boldsymbol{\phi}}-\boldsymbol{\phi}^{*}\right)\lesssim O_{P}(C_{g}\lambda)+O_{P}(SD/T),

which completes the proof.  

Appendix E Technical Lemmas & Proofs

In this section, we first introduce Lemma 20 on the consistency of the covariance matrix estimator for any stationary vector autoregressive process and then Corollary 21 on the consistency of the covariance estimator of our MARAC model, given the joint stationarity condition. Then we provide proof for Lemma 14 used in Section C.2 when proving Theorem 9 on the asymptotic normality under fixed spatial dimension. Then we provide proofs for Lemma 161718 and 19 used in Section D when proving the error bounds with high spatial dimensionality.

E.1 Statement of Lemma 20

In Lemma 20, we restate the result of Proposition 6 and 7 of Li and Xiao (2021), which covers the general result of the consistency of the estimator for the lag-0 auto-covariance matrix of a stationary VAR(p)(p) process.

Lemma 20

Let 𝐱tN\mathbf{x}_{t}\in\mathbb{R}^{N} be a zero-meaned stationary VAR(p)(p) process: 𝐱t=l=1p𝚽p𝐱tp+𝛏t\mathbf{x}_{t}=\sum_{l=1}^{p}\boldsymbol{\Phi}_{p}\mathbf{x}_{t-p}+\boldsymbol{\xi}_{t}, where 𝛏t\boldsymbol{\xi}_{t} have independent sub-Gaussian entries. Let 𝚺^=(1/T)t=1T𝐱t𝐱t\widehat{\boldsymbol{\Sigma}}=(1/T)\sum_{t=1}^{T}\mathbf{x}_{t}\mathbf{x}_{t}^{\top} and 𝚺=E[𝚺^]\boldsymbol{\Sigma}=\mathrm{E}[\widehat{\boldsymbol{\Sigma}}], then we have:

E𝚺^𝚺sC(NlogNT+NlogNT)𝚺s,\mathrm{E}\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\|_{s}\leq C\left(\sqrt{\frac{N\log N}{T}}+\frac{N\log N}{T}\right)\|\boldsymbol{\Sigma}\|_{s}, (62)

where CC is an absolute constant.

We refer our readers to Appendix C.3 of Li and Xiao (2021) for the proof. As a corollary of Lemma 20, we have the following results:

Corollary 21

Assume that {𝐳t}t=1T\{\mathbf{z}_{t}\}_{t=1}^{T} is generated by a stationary VAR(Q~)(\widetilde{Q}) process: 𝐳t=q~=1Q~𝐂q~𝐳tq~+𝛎t\mathbf{z}_{t}=\sum_{\widetilde{q}=1}^{\widetilde{Q}}\mathbf{C}_{\widetilde{q}}\mathbf{z}_{t-\widetilde{q}}+\boldsymbol{\nu}_{t}, with 𝛎t\boldsymbol{\nu}_{t} having independent sub-Gaussian entries, then with 𝚺^𝐳=(1/T)t=1T𝐳t𝐳t\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}=(1/T)\sum_{t=1}^{T}\mathbf{z}_{t}\mathbf{z}_{t}^{\top} and 𝚺𝐳=E[𝚺^𝐳]\boldsymbol{\Sigma}_{\mathbf{z}}^{*}=\mathrm{E}[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}], we have:

P(𝚺^𝐳𝚺𝐳sϵ)Cϵ1(DT+DT),\mathrm{P}\left(\left\|\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}-\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\right\|_{s}\geq\epsilon\right)\leq C\epsilon^{-1}\left(\sqrt{\frac{D}{T}}+\frac{D}{T}\right), (63)

with CC being an absolute constant and ϵ\epsilon being a fixed positive real number, and thus 𝚺^𝐳𝚺𝐳sp.0\left\|\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}-\boldsymbol{\Sigma}_{\mathbf{z}}^{*}\right\|_{s}\overset{p.}{\rightarrow}0.

Let {𝐗t}t=1T\{\mathbf{X}_{t}\}_{t=1}^{T} be a zero-meaned matrix time series generated by the MARAC model with lag P,QP,Q and {𝐳t}t=1T\{\mathbf{z}_{t}\}_{t=1}^{T} satisfies the assumption above and {𝐗t,𝐳t}t=1T\{\mathbf{X}_{t},\mathbf{z}_{t}\}_{t=1}^{T} are jointly stationary in the sense of Theorem 6. Assume further that 𝐄t\mathbf{E}_{t} has i.i.d. Gaussian entries with constant variance σ2\sigma^{2}, then for 𝐲t=[𝐱t,𝐳t]\mathbf{y}_{t}=[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}, 𝚺^0=(1/T)t=1T𝐲t𝐲t\widehat{\boldsymbol{\Sigma}}_{0}=(1/T)\sum_{t=1}^{T}\mathbf{y}_{t}\mathbf{y}_{t}^{\top} and 𝚺0=E[𝐲t𝐲t]\boldsymbol{\Sigma}_{0}^{*}=\mathrm{E}[\mathbf{y}_{t}\mathbf{y}_{t}^{\top}], we have:

E𝚺^0𝚺0sC(SlogST+SlogST)𝚺0s,\mathrm{E}\left\|\widehat{\boldsymbol{\Sigma}}_{0}-\boldsymbol{\Sigma}_{0}^{*}\right\|_{s}\leq C\left(\sqrt{\frac{S\log S}{T}}+\frac{S\log S}{T}\right)\|\boldsymbol{\Sigma}_{0}^{*}\|_{s}, (64)

where CC is an absolute constant.

Proof  The proof of (63) is straightforward from Lemma 20 together with Markov inequality. The proof of (64) also follows from Lemma 20 since {𝐲t}t=1T\{\mathbf{y}_{t}\}_{t=1}^{T} follows a stationary VAR(max(P,Q,Q~))(\text{max}(P,Q,\widetilde{Q})) process with i.i.d. sub-Gaussian noise (see (29)) and E[(1/T)t=1T𝐲t𝐲t]=E[𝐲t𝐲t]\mathrm{E}[(1/T)\sum_{t=1}^{T}\mathbf{y}_{t}\mathbf{y}_{t}^{\top}]=\mathrm{E}[\mathbf{y}_{t}\mathbf{y}_{t}^{\top}] due to stationarity.  
Note that the convergence of the variance estimator in spectral norm also indicates that each element of the variance estimator converges in probability. Also, the assumption that 𝐄t\mathbf{E}_{t} has i.i.d. Gaussian entries can be relaxed to 𝐄t\mathbf{E}_{t} having independent sub-Gaussian entries.

E.2 Proof of Lemma 14

Proof  Without loss of generality, we fix P,QP,Q as 1 and use the same notation as (31) in Section C.1, so the MARAC model can be written as 𝐱t=𝐲t𝜽+𝐞t\mathbf{x}_{t}=\mathbf{y}_{t}\boldsymbol{\theta}^{*}+\mathbf{e}_{t}. Correspondingly, the penalized log-likelihood h(𝜽,𝛀)h(\boldsymbol{\theta},\boldsymbol{\Omega}) is specified by (32) and given any 𝛀¯\bar{\boldsymbol{\Omega}}, we have 𝜽~(𝛀¯)=argmin𝜽h(𝜽,𝛀¯)\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})=\operatorname*{arg\,min}_{\boldsymbol{\theta}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}}) as specified by (34). Given the decomposition of 𝜽~(𝛀¯)\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}) in (35), we have:

𝜽~(𝛀¯)𝜽=λ𝐊~𝜽+(t𝐲t𝛀¯𝐲tT+λ𝐊~)1(t𝐲t𝛀¯𝐞tT),\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}=-\lambda\widetilde{\mathbf{K}}\boldsymbol{\theta}^{*}+\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right),

where λ𝐊~𝜽F=o(T1/2)\|\lambda\widetilde{\mathbf{K}}\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=o(T^{-1/2}) since λ=o(T1/2)\lambda=o(T^{-1/2}) and the norm of the second term is OP(T1/2)O_{P}(T^{-1/2}). To show that the norm of the second term is OP(T1/2)O_{P}(T^{-1/2}), we first observe that:

(t𝐲t𝛀¯𝐲tT+λ𝐊~)1(t𝐲t𝛀¯𝐞tT)F\displaystyle\left\|\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right)\right\|_{\mathrm{F}}
(t𝐲t𝛀¯𝐲tT+λ𝐊~)1𝐋T1F(t𝐲t𝛀¯𝐞tT)𝐑TF.\displaystyle\leq\left\|\underbrace{\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)^{-1}}_{\text{\makebox[0.0pt]{$\mathbf{L}_{T}^{-1}$}}}\right\|_{\mathrm{F}}\cdot\left\|\underbrace{\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}}{T}\right)}_{\text{\makebox[0.0pt]{$\mathbf{R}_{T}$}}}\right\|_{\mathrm{F}}.

For the sequence of random matrices {𝐋T}T=1\{\mathbf{L}_{T}\}_{T=1}^{\infty}, we have:

𝐋T=t𝐲t𝛀¯𝐲tT+λ𝐊~p.[Cov(𝐱t,𝐱t)𝛀¯Cov(𝐱t,𝐳t)𝛀¯𝐊Cov(𝐳t,𝐱t)𝐊𝛀¯Cov(𝐳t,𝐳t)𝐊𝛀¯𝐊],\mathbf{L}_{T}=\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\overset{p.}{\rightarrow}\begin{bmatrix}\mathrm{Cov}(\mathbf{x}_{t},\mathbf{x}_{t})\otimes\bar{\boldsymbol{\Omega}}&\mathrm{Cov}(\mathbf{x}_{t},\mathbf{z}_{t})\otimes\bar{\boldsymbol{\Omega}}\mathbf{K}\\ \mathrm{Cov}(\mathbf{z}_{t},\mathbf{x}_{t})\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}&\mathrm{Cov}(\mathbf{z}_{t},\mathbf{z}_{t})\otimes\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K}\end{bmatrix},

and we define the limiting matrix as 𝐋\mathbf{L}. To show this, first note that the covariance estimator Var^([𝐱t,𝐳t])=T1t[𝐱t,𝐳t][𝐱t,𝐳t]\widehat{\mathrm{Var}}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top})=T^{-1}\sum_{t}[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}[\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}] converges in probability to the true covariance Var([𝐱t,𝐳t])\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}), which we prove separately in Corollary 21. Secondly, notice that λ=o(T1/2)\lambda=o(T^{-1/2}), thus we have λ𝐊~𝐎\lambda\widetilde{\mathbf{K}}\rightarrow\mathbf{O} and thus we have the convergence in probability of 𝐋T\mathbf{L}_{T} to 𝐋\mathbf{L} holds.

Notice that the limiting matrix 𝐋\mathbf{L} is invertible because the matrix 𝐋\mathbf{L}^{\prime}, defined as:

𝐋=[𝐈𝐊𝐎𝐎𝐈]𝐋[𝐈𝐊𝐎𝐎𝐈]=Var([𝐱t,𝐳t])(𝐊𝛀¯𝐊),\mathbf{L}^{\prime}=\begin{bmatrix}\mathbf{I}\otimes\mathbf{K}&\mathbf{O}\\ \mathbf{O}&\mathbf{I}\end{bmatrix}\mathbf{L}\begin{bmatrix}\mathbf{I}\otimes\mathbf{K}&\mathbf{O}\\ \mathbf{O}&\mathbf{I}\end{bmatrix}=\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top})\otimes(\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K}),

is invertible. To see why, firstly note that Var([𝐱t,𝐳t])\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}) is invertible because we can express [𝐱t,𝐳t][\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top} as j=0𝚽j[𝐞t,𝝂t]\sum_{j=0}^{\infty}\boldsymbol{\Phi}_{j}[\mathbf{e}_{t}^{\top},\boldsymbol{\nu}_{t}^{\top}]^{\top}, where {𝚽j}j=0\{\boldsymbol{\Phi}_{j}\}_{j=0}^{\infty} is a sequence of matrices whose elements are absolutely summable and 𝚽0=𝐈\boldsymbol{\Phi}_{0}=\mathbf{I}, therefore, we have ρ¯(Var([𝐱t,𝐳t]))ρ¯(Var([𝐞t,𝝂t]))>0\underaccent{\bar}{\rho}(\mathrm{Var}([\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}))\geq\underaccent{\bar}{\rho}(\mathrm{Var}([\mathbf{e}_{t}^{\top},\boldsymbol{\nu}_{t}^{\top}]^{\top}))>0. Secondly, by Assumption 7, we have ρ¯(𝐊)>0\underaccent{\bar}{\rho}(\mathbf{K})>0 and we also have ρ¯(𝛀¯)>0\underaccent{\bar}{\rho}(\bar{\boldsymbol{\Omega}})>0 by definition, therefore we have 𝐊𝛀¯𝐊\mathbf{K}\bar{\boldsymbol{\Omega}}\mathbf{K} to be positive definite. The invertibility of 𝐋\mathbf{L} and the fact that 𝐋Tp.𝐋\mathbf{L}_{T}\overset{p.}{\rightarrow}\mathbf{L} indicates that 𝐋T1p.𝐋1\mathbf{L}_{T}^{-1}\overset{p.}{\rightarrow}\mathbf{L}^{-1}, since matrix inversion is a continuous function of the input matrix and the convergence in probability carries over under continuous transformations. Eventually, this leads to the conclusion that 𝐋T1F=OP(1)\|\mathbf{L}_{T}^{-1}\|_{\mathrm{F}}=O_{P}(1).

For the sequence of random matrices {𝐑T}T=1\{\mathbf{R}_{T}\}_{T=1}^{\infty}, we note that the sequence {𝐲t𝛀¯𝐞t}t=1\{\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{e}_{t}\}_{t=1}^{\infty} is a martingale difference sequence (MDS) such that 𝐑TF=OP(T1/2)\|\mathbf{R}_{T}\|_{\mathrm{F}}=O_{P}(T^{-1/2}) (see proposition 7.9 of Hamilton (2020) for the central limit theorem of martingale difference sequence). Combining the result of 𝐋TF\|\mathbf{L}_{T}\|_{\mathrm{F}} and 𝐑TF\|\mathbf{R}_{T}\|_{\mathrm{F}}, we conclude that 𝜽~(𝛀¯)𝜽F=OP(T1/2)\|\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2}).

Fix 𝛀=𝛀¯\boldsymbol{\Omega}=\bar{\boldsymbol{\Omega}}, we can decompose h(𝜽,𝛀¯)h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}}) via the second-order Taylor expansion as follows:

h(𝜽,𝛀¯)\displaystyle h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}}) =h(𝜽~(𝛀¯),𝛀¯)+12(𝜽𝜽~(𝛀¯))(t𝐲t𝛀¯𝐲tT+λ𝐊~)(𝜽𝜽~(𝛀¯))\displaystyle=h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+\frac{1}{2}(\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}))^{\top}\left(\frac{\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}}{T}+\lambda\widetilde{\mathbf{K}}\right)(\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}))
h(𝜽~(𝛀¯),𝛀¯)+12ρ¯(𝐋T)𝜽𝜽~(𝛀¯)F2,\displaystyle\geq h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+\frac{1}{2}\underaccent{\bar}{\rho}(\mathbf{L}_{T})\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}^{2}, (65)

and recall that 𝐋T=T1t𝐲t𝛀¯𝐲t+λ𝐊~\mathbf{L}_{T}=T^{-1}\sum_{t}\mathbf{y}_{t}^{\top}\bar{\boldsymbol{\Omega}}\mathbf{y}_{t}+\lambda\widetilde{\mathbf{K}}. In the previous proof, we’ve shown that 𝐋Tp.𝐋\mathbf{L}_{T}\overset{p.}{\rightarrow}\mathbf{L}, with 𝐋\mathbf{L} being a positive definite matrix. Therefore, with probability approaching 11, we have ρ¯(𝐋T)ρ¯(𝐋)/2>0\underaccent{\bar}{\rho}(\mathbf{L}_{T})\geq\underaccent{\bar}{\rho}(\mathbf{L})/2>0.

With the lower bound on ρ¯(𝐋T)\underaccent{\bar}{\rho}(\mathbf{L}_{T}), we can claim that for some constant C1>0C_{1}>0:

inf𝛀¯𝔽𝛀:𝛀¯𝛀FC1h(𝜽,𝛀¯)\displaystyle\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\leq C_{1}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})
inf𝛀¯𝔽𝛀:𝛀¯𝛀FC1{h(𝜽~(𝛀¯),𝛀¯)+14ρ¯(𝐋)𝜽𝜽~(𝛀¯)F2},\displaystyle\geq\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\leq C_{1}}\left\{h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+\frac{1}{4}\underaccent{\bar}{\rho}(\mathbf{L})\cdot\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}^{2}\right\}, (66)

with probability approaching 11. Now consider 𝜽\boldsymbol{\theta} belongs to the set {𝜽𝔽𝜽|T𝜽𝜽FcT}\{\boldsymbol{\theta}\in\mathbb{F}_{\boldsymbol{\theta}}|\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}\geq c_{T}\}, where cTc_{T}\rightarrow\infty is an arbitrary sequence that diverges to infinity. Within this set, we have:

𝜽𝜽~(𝛀¯)FcTT𝜽𝜽~(𝛀¯)F,\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}\geq\frac{c_{T}}{\sqrt{T}}-\|\boldsymbol{\theta}^{*}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}, (67)

thus 𝜽𝜽~(𝛀¯)FOP(cT/T)\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}\gtrsim O_{P}(c^{\prime}_{T}/\sqrt{T}) for some sequence cTc^{\prime}_{T}\rightarrow\infty since 𝜽~(𝛀¯)𝜽F=OP(T1/2)\|\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2}). By the Taylor expansion in (65), we can conclude that h(𝜽,𝛀¯)=h(𝜽~(𝛀¯),𝛀¯)+OP(T1)h(\boldsymbol{\theta}^{*},\bar{\boldsymbol{\Omega}})=h(\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}}),\bar{\boldsymbol{\Omega}})+O_{P}(T^{-1}), also using that 𝜽~(𝛀¯)𝜽F=OP(T1/2)\|\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2}). Combining this result together with the order of 𝜽𝜽~(𝛀¯)F\|\boldsymbol{\theta}-\widetilde{\boldsymbol{\theta}}(\bar{\boldsymbol{\Omega}})\|_{\mathrm{F}}, we have the following hold according to (66):

P(infT𝜽𝜽FcTinf𝛀¯𝔽𝛀:𝛀¯𝛀FC1h(𝜽,𝛀¯)>inf𝛀¯𝔽𝛀:𝛀¯𝛀FC1h(𝜽,𝛀¯))1.\mathrm{P}\left(\inf_{\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}\geq c_{T}}\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\leq C_{1}}h(\boldsymbol{\theta},\bar{\boldsymbol{\Omega}})>\inf_{\bar{\boldsymbol{\Omega}}\in\mathbb{F}_{\boldsymbol{\Omega}}:\|\bar{\boldsymbol{\Omega}}-\boldsymbol{\Omega}^{*}\|_{\mathrm{F}}\leq C_{1}}h(\boldsymbol{\theta}^{*},\bar{\boldsymbol{\Omega}})\right)\rightarrow 1. (68)

The result in (68) indicates that for any 𝜽\boldsymbol{\theta} that lies outside of the set {𝜽𝔽𝜽|T𝜽𝜽F<cT}\{\boldsymbol{\theta}\in\mathbb{F}_{\boldsymbol{\theta}}|\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}<c_{T}\}, the penalized log-likelihood is no smaller than a sub-optimal solution with probability approaching 11. Therefore, with probability approaching 11, one must have T𝜽𝜽FcT\sqrt{T}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}\leq c_{T}. And since the choice of cTc_{T} is arbitrary, we can conclude that 𝜽^𝜽F=OP(T1/2)\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{\mathrm{F}}=O_{P}(T^{-1/2}) and thus each block of 𝜽^\widehat{\boldsymbol{\theta}}, namely 𝐀^p,𝐁^p,𝜸^q\widehat{\mathbf{A}}_{p},\widehat{\mathbf{B}}_{p},\widehat{\boldsymbol{\gamma}}_{q} converges to their ground truth value at the rate of T1/2T^{-1/2}.

The convergence rate of 𝐁^p𝐀p^\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}_{p}} can be derived from the following inequality:

𝐁^p𝐀^p𝐁p𝐀pF𝐁^pF𝐀^p𝐀pF+𝐁^p𝐁pF𝐀pF,\|\widehat{\mathbf{B}}_{p}\otimes\widehat{\mathbf{A}}_{p}-\mathbf{B}_{p}^{*}\otimes\mathbf{A}_{p}^{*}\|_{\mathrm{F}}\leq\|\widehat{\mathbf{B}}_{p}\|_{\mathrm{F}}\cdot\|\widehat{\mathbf{A}}_{p}-\mathbf{A}_{p}^{*}\|_{\mathrm{F}}+\|\widehat{\mathbf{B}}_{p}-\mathbf{B}_{p}^{*}\|_{\mathrm{F}}\cdot\|\mathbf{A}_{p}^{*}\|_{\mathrm{F}},

as well as the convergence rate of 𝐀^p\widehat{\mathbf{A}}_{p} and 𝐁^p\widehat{\mathbf{B}}_{p}.

 

E.3 Proof of Lemma 16

Proof  Based on the definition of 𝐖\mathbf{W} in equation (53), we have

𝐗~𝐖𝐗~T\displaystyle\frac{\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}}{T} =𝚺^𝐱,𝐱𝐈S(𝚺^𝐳,𝐱𝐊)(𝚺^𝐳,𝐳𝐊+λ𝐈SD)1(𝚺^𝐳,𝐱𝐈S)\displaystyle=\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}\otimes\mathbf{I}_{S}-\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\otimes\mathbf{K}\right)\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\otimes\mathbf{I}_{S}\right)
=(𝚺^𝐱,𝐱𝚺^𝐳,𝐱𝚺^𝐳,𝐳1𝚺^𝐳,𝐱)𝐈S\displaystyle=\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\right)\otimes\mathbf{I}_{S}
+(𝚺^𝐳,𝐱𝐈S)[𝚺^𝐳,𝐳2λ1𝐊+𝚺^𝐳,𝐳𝐈S]1(𝚺^𝐳,𝐱𝐈S),\displaystyle+\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\otimes\mathbf{I}_{S}\right)^{\top}\left[\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{2}\otimes\lambda^{-1}\mathbf{K}+\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}\otimes\mathbf{I}_{S}\right]^{-1}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}\otimes\mathbf{I}_{S}\right), (69)

where the second term in (69) is positive semi-definite since both ρ¯(𝚺^𝐳,𝐳)\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}) and ρ¯(𝐊)\underaccent{\bar}{\rho}(\mathbf{K}) are non-negative and the whole term is symmetric. Therefore, by Weyl’s inequality, one can lower bound ρ¯(𝐗~𝐖𝐗~/T)\underaccent{\bar}{\rho}(\widetilde{\mathbf{X}}^{\top}\mathbf{W}\widetilde{\mathbf{X}}/T) by ρ¯(𝚺^𝐱,𝐱𝚺^𝐳,𝐱𝚺^𝐳,𝐳1𝚺^𝐳,𝐱)\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}). For simplicity, we will use 𝐀,𝐁,𝐂\mathbf{A},\mathbf{B},\mathbf{C} to denote 𝚺𝐱,𝐱,𝚺𝐳,𝐱,(𝚺𝐳,𝐳)1\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*},\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*},(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*})^{-1}, and 𝐀^,𝐁^,𝐂^\widehat{\mathbf{A}},\widehat{\mathbf{B}},\widehat{\mathbf{C}} to denote 𝚺^𝐱,𝐱,𝚺^𝐳,𝐱,𝚺^𝐳,𝐳1\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}},\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}},\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}, respectively. We will use 𝚺^\widehat{\boldsymbol{\Sigma}} and 𝚺\boldsymbol{\Sigma}^{*} to denote the estimated and true covariance matrix of [𝐱t,𝐳t][\mathbf{x}_{t}^{\top},\mathbf{z}_{t}^{\top}]^{\top}. It is evident that 𝐀s𝚺s\|\mathbf{A}\|_{s}\leq\|\boldsymbol{\Sigma}^{*}\|_{s} and 𝐁s𝚺s\|\mathbf{B}\|_{s}\leq\|\boldsymbol{\Sigma}^{*}\|_{s}, since both 𝐀\mathbf{A} and 𝐁\mathbf{B} are blocks of 𝚺\boldsymbol{\Sigma}^{*} and can thus be represented as 𝐄1𝚺𝐄2\mathbf{E}_{1}^{\top}\boldsymbol{\Sigma}^{*}\mathbf{E}_{2} with 𝐄1,𝐄2\mathbf{E}_{1},\mathbf{E}_{2} being two block matrices with unity spectral norm.

The rest of the proof focuses on showing that with SlogS/T0S\log S/T\rightarrow 0, ρ¯(𝚺^𝐱,𝐱𝚺^𝐳,𝐱𝚺^𝐳,𝐳1𝚺^𝐳,𝐱)p.ρ¯(𝚺𝐱,𝐱(𝚺𝐳,𝐱)(𝚺𝐳,𝐳)1𝚺𝐳,𝐱)\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}})\overset{p.}{\rightarrow}\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)^{\top}\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\right)^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}). For brevity, we omit the subscript ss for the spectral norm notation and simply use \|\cdot\| in this proof.

To start with, we have:

𝐀^𝐁^𝐂^𝐁^(𝐀𝐁𝐂𝐁)\displaystyle\|\widehat{\mathbf{A}}-\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-(\mathbf{A}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B})\|
𝐀^𝐀+𝐁^𝐂^𝐁^𝐁𝐂^𝐁+𝐁𝐂^𝐁𝐁𝐂𝐁\displaystyle\leq\|\widehat{\mathbf{A}}-\mathbf{A}\|+\|\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-\mathbf{B}^{\top}\widehat{\mathbf{C}}\mathbf{B}\|+\|\mathbf{B}^{\top}\widehat{\mathbf{C}}\mathbf{B}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B}\|
𝚺^𝚺+(𝐁^𝐁)𝐂^𝐁^+𝐁𝐂(𝐁^𝐁)+𝐁(𝐂^𝐂)𝐁^\displaystyle\leq\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\|+\|(\widehat{\mathbf{B}}-\mathbf{B})^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}\|+\|\mathbf{B}^{\top}\mathbf{C}(\widehat{\mathbf{B}}-\mathbf{B})\|+\|\mathbf{B}^{\top}(\widehat{\mathbf{C}}-\mathbf{C})\widehat{\mathbf{B}}\|
𝚺^𝚺+𝐁^𝐁(𝐂^𝐁^+𝐂𝐁)\displaystyle\leq\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}\|+\|\widehat{\mathbf{B}}-\mathbf{B}\|\cdot\left(\|\widehat{\mathbf{C}}\|\cdot\|\widehat{\mathbf{B}}\|+\|\mathbf{C}\|\cdot\|\mathbf{B}\|\right)
+𝐁𝐁^𝐂^𝐂.\displaystyle+\|\mathbf{B}\|\cdot\|\widehat{\mathbf{B}}\|\cdot\|\widehat{\mathbf{C}}-\mathbf{C}\|. (70)

Based on Corollary 21, under the condition that SlogS/T0S\log S/T\rightarrow 0 and the conditions that 𝐳t\mathbf{z}_{t} follows a stationary VAR(Q~)(\widetilde{Q}) process and is jointly stationary with 𝐱t\mathbf{x}_{t}, we have 𝐂^𝐂p.0\|\widehat{\mathbf{C}}-\mathbf{C}\|\overset{p.}{\rightarrow}0 and 𝚺^𝚺p.0\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{*}\|\overset{p.}{\rightarrow}0. Therefore, with probability approaching 11, we have 𝐂^2𝐂\|\widehat{\mathbf{C}}\|\leq 2\|\mathbf{C}\|, 𝐁^𝐁𝚺^𝚺2𝚺\|\widehat{\mathbf{B}}-\mathbf{B}\|\leq\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{*}\|\leq 2\|\boldsymbol{\Sigma}^{*}\| and 𝐁^3𝚺\|\widehat{\mathbf{B}}\|\leq 3\|\boldsymbol{\Sigma}^{*}\|.

Combining these results and the upper bound in (70), with probability approaching 11, we have:

𝐀^𝐁^𝐂^𝐁^(𝐀𝐁𝐂𝐁)\displaystyle\|\widehat{\mathbf{A}}-\widehat{\mathbf{B}}^{\top}\widehat{\mathbf{C}}\widehat{\mathbf{B}}-(\mathbf{A}-\mathbf{B}^{\top}\mathbf{C}\mathbf{B})\| (1+7𝐂𝚺)𝚺^𝚺\displaystyle\leq\left(1+7\|\mathbf{C}\|\cdot\|\boldsymbol{\Sigma}^{*}\|\right)\cdot\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{*}\|
+3𝚺2𝐂^𝐂.\displaystyle+3\|\boldsymbol{\Sigma}^{*}\|^{2}\cdot\|\widehat{\mathbf{C}}-\mathbf{C}\|. (71)

The upper bound in (71) can be arbitrarily small as S,TS,T\rightarrow\infty since 𝐂^𝐂p.0\|\widehat{\mathbf{C}}-\mathbf{C}\|\overset{p.}{\rightarrow}0 and 𝚺^𝚺p.0\|\widehat{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}^{*}\|\overset{p.}{\rightarrow}0.

Eventually, with probability approaching 11, we have:

ρ¯(𝚺^𝐱,𝐱𝚺^𝐳,𝐱𝚺^𝐳,𝐳1𝚺^𝐳,𝐱)12ρ¯(𝚺𝐱,𝐱(𝚺𝐳,𝐱)(𝚺𝐳,𝐳)1𝚺𝐳,𝐱)=c0,S2.\underaccent{\bar}{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{x},\mathbf{x}}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}}^{\top}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{z}}^{-1}\widehat{\boldsymbol{\Sigma}}_{\mathbf{z},\mathbf{x}})\geq\frac{1}{2}\underaccent{\bar}{\rho}\left(\boldsymbol{\Sigma}_{\mathbf{x},\mathbf{x}}^{*}-\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)^{\top}\left(\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{z}}^{*}\right)^{-1}\boldsymbol{\Sigma}_{\mathbf{z},\mathbf{x}}^{*}\right)=\frac{c_{0,S}}{2}. (72)

This completes the proof.

 

E.4 Proof of Lemma 17

Proof  By the definition of 𝐖\mathbf{W} in (53), we have:

tr(𝐈𝐖)\displaystyle\text{tr}\left(\mathbf{I}-\mathbf{W}\right) =tr[(𝚺^𝐳𝐊+λ𝐈SD)1(𝚺^𝐳𝐊)]\displaystyle=\mathrm{tr}\left[\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}+\lambda\mathbf{I}_{SD}\right)^{-1}\left(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}\otimes\mathbf{K}\right)\right]
=s=1Sd=1Dρd(𝚺^𝐳)ρs(𝐊)λ+ρd(𝚺^𝐳)ρs(𝐊)Ds=1S11+λρ¯(𝚺^𝐳)1ρs(𝐊)1.\displaystyle=\sum_{s=1}^{S}\sum_{d=1}^{D}\frac{\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}{\lambda+\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}\leq D\cdot\sum_{s=1}^{S}\frac{1}{1+\lambda\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})^{-1}\rho_{s}(\mathbf{K})^{-1}}. (73)

Using Lemma 20, we can bound ρ¯(𝚺^𝐳)\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}}) by 2ρ¯(𝚺𝐳)2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*}) with probability approaching 11 as TT\rightarrow\infty. Conditioning on this high probability event and using the Assumption 11 that the kernel function is separable, the kernel Gram matrix 𝐊\mathbf{K} can be written as 𝐊2𝐊1\mathbf{K}_{2}\otimes\mathbf{K}_{1} and thus (73) can be bounded as:

Ds=1S11+λρ¯(𝚺^𝐳)1ρs(𝐊)1Di=1Mj=1N11+c𝐳λρi(𝐊1)1ρj(𝐊2)1,D\cdot\sum_{s=1}^{S}\frac{1}{1+\lambda\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})^{-1}\rho_{s}(\mathbf{K})^{-1}}\leq D\cdot\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{\mathbf{z}}\lambda\rho_{i}(\mathbf{K}_{1})^{-1}\rho_{j}(\mathbf{K}_{2})^{-1}}, (74)

where c𝐳=1/2ρ¯(𝚺𝐳)c_{\mathbf{z}}=1/2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*}). As M,NM,N\rightarrow\infty, based on Assumption 10, we have ρi(𝐊1)Mir0\rho_{i}(\mathbf{K}_{1})\rightarrow Mi^{-r_{0}} and ρj(𝐊2)Njr0\rho_{j}(\mathbf{K}_{2})\rightarrow Nj^{-r_{0}}. Consequently, we can find two constants 0<c1<c20<c_{1}<c_{2}, with c1c_{1} being sufficiently small and c2c_{2} being sufficiently large, such that:

i=1Mj=1N11+c2λ(ij)r0/S\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{2}\lambda(ij)^{r_{0}}/S} i=1Mj=1N11+c𝐳λρi(𝐊1)1ρj(𝐊2)1\displaystyle\leq\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{\mathbf{z}}\lambda\rho_{i}(\mathbf{K}_{1})^{-1}\rho_{j}(\mathbf{K}_{2})^{-1}}
i=1Mj=1N11+c1λ(ij)r0/S,\displaystyle\leq\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{1}\lambda(ij)^{r_{0}}/S}, (75)

where we, with a little abuse of notations, incorporate c𝐳c_{\mathbf{z}} into c1,c2c_{1},c_{2}. To estimate the order of the lower and upper bound in (75), we first notice that for any constant c>0c>0, one has:

i=1MN11+cλi2r0/Si=1Mj=1N11+cλ(ij)r0/S2(MN)i=1MN11+cλi2r0/S.\sum_{i=1}^{M\wedge N}\frac{1}{1+c\lambda i^{2r_{0}}/S}\leq\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c\lambda(ij)^{r_{0}}/S}\leq 2(M\vee N)\sum_{i=1}^{M\vee N}\frac{1}{1+c\lambda i^{2r_{0}}/S}. (76)

To approximate the sum in (76), notice that:

i=1MN11+cλi2r0/S=(S/cλ)1/2r0i=1MN11+[i(S/cλ)1/2r0]2r01(S/cλ)1/2r0,\sum_{i=1}^{M\vee N}\frac{1}{1+c\lambda i^{2r_{0}}/S}=(S/c\lambda)^{1/2r_{0}}\cdot\sum_{i=1}^{M\vee N}\frac{1}{1+[\frac{i}{(S/c\lambda)^{1/2r_{0}}}]^{2r_{0}}}\cdot\frac{1}{(S/c\lambda)^{1/2r_{0}}},

and furthermore, we have:

limSi=1MN11+[i(S/cλ)1/2r0]2r01(S/cλ)1/2r0=0C11+x2r0dx<,\lim_{S\rightarrow\infty}\sum_{i=1}^{M\vee N}\frac{1}{1+[\frac{i}{(S/c\lambda)^{1/2r_{0}}}]^{2r_{0}}}\cdot\frac{1}{(S/c\lambda)^{1/2r_{0}}}=\int_{0}^{C}\frac{1}{1+x^{2r_{0}}}\mathrm{d}x<\infty,

where C=limSc(MN)2r0γSC=\lim_{S\rightarrow\infty}c(M\vee N)^{2r_{0}}\cdot\gamma_{S}. In the assumptions of Theorem 12, we assume that MN=O(S)M\vee N=O(\sqrt{S}) and limSγSSr0C1\lim_{S\rightarrow\infty}\gamma_{S}\cdot S^{r_{0}}\rightarrow C_{1} where 0<C10<C_{1}\leq\infty. As a result, we have CC being either a finite value or infinity, thus we have:

limSi=1MN11+cλi2r0/S=0C11+x2r0dxlimS(S/cλ)1/2r0=O(γS1/2r0).\lim_{S\rightarrow\infty}\sum_{i=1}^{M\vee N}\frac{1}{1+c\lambda i^{2r_{0}}/S}=\int_{0}^{C}\frac{1}{1+x^{2r_{0}}}\mathrm{d}x\cdot\lim_{S\rightarrow\infty}(S/c\lambda)^{1/2r_{0}}=O(\gamma_{S}^{-1/2r_{0}}). (77)

Combining (73), (74), (75) and (77), we have tr(𝐈𝐖)OP((MN)γS1/2r0)=OP(SγS1/2r0)\text{tr}\left(\mathbf{I}-\mathbf{W}\right)\lesssim O_{P}((M\vee N)\gamma_{S}^{-1/2r_{0}})=O_{P}(\sqrt{S}\gamma_{S}^{-1/2r_{0}}). To obtain the lower bound of tr(𝐈𝐖)\text{tr}\left(\mathbf{I}-\mathbf{W}\right), we have:

tr(𝐈𝐖)Ds=1S11+λc𝐳ρs(𝐊)1Di=1Mj=1N11+c3λ(ij)r0/S,\text{tr}\left(\mathbf{I}-\mathbf{W}\right)\geq D\cdot\sum_{s=1}^{S}\frac{1}{1+\lambda c_{\mathbf{z}}^{\prime}\rho_{s}(\mathbf{K})^{-1}}\geq D\cdot\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{3}\lambda(ij)^{r_{0}}/S},

which holds with probability approaching 11 and c𝐳=2/ρ¯(𝚺𝐳)c_{\mathbf{z}}^{\prime}=2/\underaccent{\bar}{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*}) and the second inequality follows from (75). To further lower bound the double summation, we have:

i=1Mj=1N11+c3λ(ij)r0/Si=1MN11+c3λ(ij)r0/S.\cdot\sum_{i=1}^{M}\sum_{j=1}^{N}\frac{1}{1+c_{3}\lambda(ij)^{r_{0}}/S}\geq\sum_{i=1}^{M\wedge N}\frac{1}{1+c_{3}\lambda(ij)^{r_{0}}/S}.

This new lower bound can be approximated with the same method as (77) under the assumption that MN=O(S)M\wedge N=O(\sqrt{S}). We can obtain the lower bound of tr(𝐈𝐖)\text{tr}\left(\mathbf{I}-\mathbf{W}\right) as OP(γS1/2r0)O_{P}(\gamma_{S}^{-1/2r_{0}}), which establishes the final result.

The upper bound of tr(𝐖)\text{tr}\left(\mathbf{W}\right) is trivial since:

tr(𝐖)=s=1Sd=1Dλλ+ρd(𝚺^𝐳)ρs(𝐊)SD.\text{tr}\left(\mathbf{W}\right)=\sum_{s=1}^{S}\sum_{d=1}^{D}\frac{\lambda}{\lambda+\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}\leq SD.

 

E.5 Proof of Lemma 18

Proof  Given any fixed 𝐖\mathbf{W} and t>0t>0, let K=8/3σK=\sqrt{8/3}\sigma and c>0c>0 be some constant, then we have:

P[|𝒲σtr𝒲|>t|𝐖]2exp[cmin(t2K4𝐖F2,tK2𝐖s)],\mathrm{P}\left[\left|\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}-\sigma^{2}\text{tr}\left(\mathbf{W}\right)\right|>t\bigg{|}\mathbf{W}\right]\leq 2\exp\left[-c\min\left(\frac{t^{2}}{K^{4}\|\mathbf{W}\|_{\mathrm{F}}^{2}},\frac{t}{K^{2}\|\mathbf{W}\|_{s}}\right)\right], (78)

which holds by the Hanson-Wright inequality (Rudelson and Vershynin, 2013). Letting t=σ2tr(𝐖)/2t=\sigma^{2}\text{tr}\left(\mathbf{W}\right)/2, c1=9c/256c_{1}=9c/256, and using the fact that 𝐖s1\|\mathbf{W}\|_{s}\leq 1, we have:

2exp[cmin(t2K4𝐖F2,tK2𝐖s)]2exp[c1tr(𝐖)].2\exp\left[-c\min\left(\frac{t^{2}}{K^{4}\|\mathbf{W}\|_{\mathrm{F}}^{2}},\frac{t}{K^{2}\|\mathbf{W}\|_{s}}\right)\right]\leq 2\exp\left[-c_{1}\text{tr}\left(\mathbf{W}\right)\right]. (79)

We can lower bound the trace of 𝐖\mathbf{W} as follows. First, note that:

tr(𝐖)=s=1Sd=1Dλλ+ρd(𝚺^𝐳)ρs(𝐊)SDλλ+ρ¯(𝚺^𝐳)ρ¯(𝐊).\text{tr}\left(\mathbf{W}\right)=\sum_{s=1}^{S}\sum_{d=1}^{D}\frac{\lambda}{\lambda+\rho_{d}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\rho_{s}(\mathbf{K})}\geq SD\cdot\frac{\lambda}{\lambda+\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\bar{\rho}(\mathbf{K})}.

By the assumption that ρ¯(𝐊)\bar{\rho}(\mathbf{K}) is bounded and that the fact that ρ¯(𝚺^𝐳)2ρ¯(𝚺𝐳)\bar{\rho}(\widehat{\boldsymbol{\Sigma}}_{\mathbf{z}})\leq 2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*}) with probability approaching 11 as TT\rightarrow\infty, we have:

P[tr(𝐖)SDλλ+c¯]1, as T,\mathrm{P}\left[\text{tr}\left(\mathbf{W}\right)\geq\frac{SD\lambda}{\lambda+\bar{c}}\right]\rightarrow 1,\quad\text{ as }T\rightarrow\infty, (80)

where c¯=2ρ¯(𝚺𝐳)ρ¯(𝐊)\bar{c}=2\bar{\rho}(\boldsymbol{\Sigma}_{\mathbf{z}}^{*})\bar{\rho}(\mathbf{K}). Since r0<2r_{0}<2 and γSSr0C1\gamma_{S}\cdot S^{r_{0}}\rightarrow C_{1} as SS\rightarrow\infty, with C1C_{1} being either a positive constant or infinity, we have γSS2=λS\gamma_{S}\cdot S^{2}=\lambda\cdot S\rightarrow\infty. Therefore, we have tr(𝐖)\text{tr}\left(\mathbf{W}\right)\rightarrow\infty with probability approaching 11, as S,TS,T\rightarrow\infty.

With these results, we can now upper bound the unconditional probability of the event {|𝒲σtr𝒲|>σ2tr(𝐖)/2}\{\left|\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}-\sigma^{2}\text{tr}\left(\mathbf{W}\right)\right|>\sigma^{2}\text{tr}\left(\mathbf{W}\right)/2\} as follows:

P[|𝒲σtr𝒲|>σ2tr(𝐖)/2]\displaystyle\mathrm{P}\left[\left|\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}-\sigma^{2}\text{tr}\left(\mathbf{W}\right)\right|>\sigma^{2}\text{tr}\left(\mathbf{W}\right)/2\right]
E[2exp[ctr(𝐖)]]\displaystyle\leq\mathrm{E}\left[2\exp\left[-c\text{tr}\left(\mathbf{W}\right)\right]\right]
2{1P(tr(W)<SDλλ+c¯)+exp[cSDλλ+c¯]P(tr(W)SDλλ+c¯)}0.\displaystyle\leq 2\left\{1\cdot\mathrm{P}\left(\text{tr}\left(W\right)<\frac{SD\lambda}{\lambda+\bar{c}}\right)+\exp\left[-c\frac{SD\lambda}{\lambda+\bar{c}}\right]\cdot\mathrm{P}\left(\text{tr}\left(W\right)\geq\frac{SD\lambda}{\lambda+\bar{c}}\right)\right\}\rightarrow 0. (81)

This indicates that 𝒲tr𝒲\mathbfcal{E}^{\top}\mathbf{W}\mathbfcal{E}\asymp\text{tr}\left(\mathbf{W}\right).

The proof of 𝒲tr𝒲\mathbfcal{E}^{\top}\left(\mathbf{I}-\mathbf{W}\right)^{2}\mathbfcal{E}\asymp\text{tr}\left(\left(\mathbf{I}-\mathbf{W}\right)^{2}\right) is similar to the proof above. In the first step, similar to (78) and (79), we have the following tail probability bound:

P[|𝒲σtr𝒲|>σ2tr((𝐈𝐖)2)2|𝐖]\displaystyle\mathrm{P}\left[\left|\mathbfcal{E}^{\top}\left(\mathbf{I}-\mathbf{W}\right)^{2}\mathbfcal{E}-\sigma^{2}\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\right|>\frac{\sigma^{2}\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)}{2}\bigg{|}\mathbf{W}\right]
2exp{ctr((𝐈𝐖)2)}.\displaystyle\leq 2\exp\left\{-c\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\right\}. (82)

We can actually establish the unboundedness of tr((𝐈𝐖)2)\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right) by following the same idea as the proof for Lemma 17, where we have:

tr((𝐈𝐖)2)(S/cλ)1/2r0i=1MN{11+[i(S/cλ)1/2r0]2r0}2(S/cλ)1/2r0,\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\geq(S/c\lambda)^{1/2r_{0}}\cdot\sum_{i=1}^{M\wedge N}\left\{\frac{1}{1+\left[\frac{i}{(S/c\lambda)^{1/2r_{0}}}\right]^{2r_{0}}}\right\}^{2}(S/c\lambda)^{-1/2r_{0}},

with probability approaching 11 and cc is some constant. Therefore, we have tr((𝐈𝐖)2)OP(γS1/2r0)\text{tr}\left((\mathbf{I}-\mathbf{W})^{2}\right)\gtrsim O_{P}(\gamma_{S}^{-1/2r_{0}}). The rest of the proof follows the idea of (81) and we omit the details here.  

E.6 Proof of Lemma 19

Proof  For any two arbitrary symmetric matrices 𝐌,𝐍\mathbf{M},\mathbf{N} with identical sizes, we use 𝐌𝐍\mathbf{M}\gtrsim\mathbf{N} to indicate that 𝐌𝐍\mathbf{M}-\mathbf{N} is positive semi-definite and we use 𝐌1/2\mathbf{M}^{1/2} to denote the symmetric, positive semi-definite square root matrix of 𝐌\mathbf{M}.

Since 𝐀𝐁\mathbf{A}-\mathbf{B} is positive semi-definite, multiplying it by 𝐁1/2\mathbf{B}^{-1/2} on both left and right sides of 𝐀𝐁\mathbf{A}-\mathbf{B}, we have 𝐁1/2𝐀𝐁1/2𝐈\mathbf{B}^{-1/2}\mathbf{A}\mathbf{B}^{-1/2}\gtrsim\mathbf{I}. Therefore, we have 𝐁1/2𝐀1/2𝐀1/2𝐁1/2𝐈\mathbf{B}^{-1/2}\mathbf{A}^{1/2}\mathbf{A}^{1/2}\mathbf{B}^{-1/2}\gtrsim\mathbf{I}. Notice that the matrix 𝐀1/2𝐁1/2\mathbf{A}^{1/2}\mathbf{B}^{-1/2} is invertible and thus has no zero eigenvalues. As a result, all eigenvalues of 𝐁1/2𝐀1/2𝐀1/2𝐁1/2\mathbf{B}^{-1/2}\mathbf{A}^{1/2}\mathbf{A}^{1/2}\mathbf{B}^{-1/2} are the same as the eigenvalues of 𝐀1/2𝐁1/2𝐁1/2𝐀1/2\mathbf{A}^{1/2}\mathbf{B}^{-1/2}\mathbf{B}^{-1/2}\mathbf{A}^{1/2} and thus 𝐀1/2𝐁1/2𝐁1/2𝐀1/2𝐈\mathbf{A}^{1/2}\mathbf{B}^{-1/2}\mathbf{B}^{-1/2}\mathbf{A}^{1/2}\gtrsim\mathbf{I}. Multiplying both sides by 𝐀1/2\mathbf{A}^{-1/2} on both the left and right sides yields 𝐁1𝐀1\mathbf{B}^{-1}\gtrsim\mathbf{A}^{-1}, which completes the proof.  

Appendix F Additional Details on Simulation Experiments

We generate the simulated dataset according to the MARAC(P,Q)(P,Q) model specified by (1) and (3). We simulate the autoregressive coefficients 𝐀p,𝐁p\mathbf{A}_{p},\mathbf{B}_{p} such that they satisfy the stationarity condition specified in Theorem 6 and have a banded structure. We use a similar setup for generating 𝚺r,𝚺c\boldsymbol{\Sigma}_{r},\boldsymbol{\Sigma}_{c} with their diagonals fixed at unity. In Figure 5, we plot the simulated 𝐀1,𝐁1,𝚺r,𝚺c\mathbf{A}_{1},\mathbf{B}_{1},\boldsymbol{\Sigma}_{r},\boldsymbol{\Sigma}_{c} when (M,N)=(20,20)(M,N)=(20,20).

Refer to caption
Figure 5: Visualization of the simulated 𝐀1,𝐁1,𝚺r,𝚺c\mathbf{A}_{1},\mathbf{B}_{1},\boldsymbol{\Sigma}_{r},\boldsymbol{\Sigma}_{c} with M=N=20M=N=20.

To generate g1,g2,g3kg_{1},g_{2},g_{3}\in\mathbb{H}_{k} and mimic the spatial grid in our real data application in Section 6, we specify the 2-D spatial grid with the two dimensions being latitude and longitude of points on a unit sphere 𝕊2\mathbb{S}^{2}. Each of the evenly spaced M×NM\times N grid points has its polar-azimuthal coordinate pair as (θi,ϕj)[0,180]×[0,360],i[M],j[N](\theta_{i},\phi_{j})\in[0^{\circ},180^{\circ}]\times[0^{\circ},360^{\circ}],i\in[M],j\in[N], and one projects the sampled grid points on the sphere onto a plane to form an M×NM\times N matrix. The polar θ\theta (co-latitude) and azimuthal ϕ\phi (longitude) angles are very commonly used in the spherical coordinate system, with the corresponding Euclidean coordinates being (x,y,z)=(sin(θ)cos(ϕ),sin(θ)sin(ϕ),cos(θ))(x,y,z)=(\sin(\theta)\cos(\phi),\sin(\theta)\sin(\phi),\cos(\theta)).

As for the spatial kernel, we choose the Lebedev kernel:

kη(s1,s2)=(14π+η12π)η8π1s1,s22,s1,s2𝕊2,k_{\eta}(s_{1},s_{2})=\left(\frac{1}{4\pi}+\frac{\eta}{12\pi}\right)-\frac{\eta}{8\pi}\sqrt{\frac{1-\left<s_{1},s_{2}\right>}{2}},\quad s_{1},s_{2}\in\mathbb{S}^{2}, (83)

where ,\left<\cdot,\cdot\right> denotes the angle between two points on the sphere 𝕊2\mathbb{S}^{2} and η\eta is a hyperparameter of the kernel. In the simulation experiment as well as the real data application, we fix η=3\eta=3. The Lebedev kernel has the spherical harmonics functions as its eigenfunction:

kη(s1,s2)=14π+l=1η(4l21)(2l+3)m=llYlm(s1)Ylm(s2),k_{\eta}(s_{1},s_{2})=\frac{1}{4\pi}+\sum_{l=1}^{\infty}\frac{\eta}{(4l^{2}-1)(2l+3)}\sum_{m=-l}^{l}Y_{l}^{m}(s_{1})Y_{l}^{m}(s_{2}),

where Ylm()Y_{l}^{m}(\cdot) is a series of orthonormal real spherical harmonics bases defined on sphere 𝕊2\mathbb{S}^{2}:

Ylm(s)=Ylm(θ,ϕ)={2NlmPlm(cos(θ))cos(mϕ)if m>0Nl0Pl0(cos(θ))if m=02Nl|m|Pl|m|(cos(θ))sin(|m|ϕ)if m<0,Y_{l}^{m}(s)=Y_{l}^{m}(\theta,\phi)=\begin{dcases}\sqrt{2}N_{lm}P_{l}^{m}(\cos(\theta))\cos(m\phi)&\text{if $m>0$}\\ N_{l0}P_{l}^{0}(\cos(\theta))&\text{if $m=0$}\\ \sqrt{2}N_{l|m|}P_{l}^{|m|}(\cos(\theta))\sin(|m|\phi)&\text{if $m<0$}\end{dcases},

with Nlm=(2l+1)(lm)!/(4π(l+m)!)N_{lm}=\sqrt{(2l+1)(l-m)!/(4\pi(l+m)!)}, and Plm()P_{l}^{m}(\cdot) being the associated Legendre polynomials of order ll. We refer our readers to Kennedy et al. (2013) for detailed information about the spherical harmonics functions and the associated isotropic kernels. Under our 2-D grid setup and the choice of kernel, we have found that empirically, the kernel Gram matrix 𝐊\mathbf{K} has its eigen spectrum decaying at a rate at ρi(𝐊)ir\rho_{i}(\mathbf{K})\approx i^{-r} with r[1.3,1.5]r\in[1.3,1.5].

We randomly sample g1,g2,g3g_{1},g_{2},g_{3} from Gaussian processes with covariance kernel being the Lebedev kernel in (83). Finally, we simulate the vector time series 𝐳t\mathbf{z}_{t} using a VAR(1)(1) process. In Figure 6, we visualize the simulated functional parameters as well as the vector time series from one random draw.

Refer to caption
Figure 6: Simulated functional parameters g1,g2,g3g_{1},g_{2},g_{3} evaluated on a 20×2020\times 20 spatial grid (top row) and the corresponding auxiliary vector time series (bottom row).

In Figure 7, we visualize the ground truth of g3g_{3} and both its penalized MLE and truncated penalized MLE estimators. It is evident that the truncated penalized MLE estimators give a smooth approximation to g3g_{3} and the approximation gets better when RR gets larger. The choice of RR should be as large as possible for accuracy, so one can determine RR based on the computational resources available.

Refer to caption
Figure 7: Ground truth g3g_{3} (panel (a)) against the penalized MLE estimator g^3\widehat{g}_{3} (panel (b)) and the truncated penalized MLE estimator g^3\widehat{g}_{3} using R{49,81,121}R\in\{49,81,121\} basis functions. M=20M=20.