Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal Data
Abstract
We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on a time series of matrices defined on a static 2-D spatial grid and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC.
Keywords: Auxiliary covariates, matrix autoregression, reproducing kernel Hilbert space (RKHS), spatio-temporal forecast, tensor data model
1 Introduction
Matrix-valued time series data have received increasing attention in multiple scientific fields, such as economics (Wang et al., 2019), geophysics (Sun et al., 2022), and environmental science (Dong et al., 2020), where scientists are interested in modeling the joint dynamics of data observed on a 2-D grid across time. This paper focuses on the matrix-valued data defined on a 2-D spatial grid that contains the geographical information of the individual observations. As a concrete example, we visualize the global Total Electron Content (TEC) distribution in Figure 1. TEC is the density of electrons in the Earth’s ionosphere along the vertical pathway connecting a radio transmitter and a ground-based receiver. An accurate prediction of the global TEC is critical since it can foretell the impact of space weather on the positioning, navigation, and timing (PNT) service (Wang et al., 2021; Younas et al., 2022). Every image in panel (A)-(C) is a matrix, distributed on a spatial grid with -latitude-by--longitude resolution.

The matrix-valued time series, such as the TEC time series, is often associated with auxiliary vector time series that measures the same object, such as the Earth’s ionosphere, from a different data modality. In panel (D) of Figure 1, we plot the global SYM-H index, which measures the geomagnetic activity caused by the solar eruptions that can finally impact the global TEC distribution. These non-spatial auxiliary covariates carry additional information related to the matrix time series data dynamics.
In this paper, we investigate the problem of forecasting future matrix data jointly with the historical matrices and the vector time series covariates. There are two major challenges in this modeling context. From the perspective of building a matrix-variate regression model, we need to integrate the information of predictors with non-uniform modes, namely both matrices and vectors. Adding the auxiliary vector covariates benefits the prediction and enables domain scientists to understand the interplay between different data modalities but at the same time, it complicates the modeling and the subsequent theoretical analysis. From the perspective of spatio-temporal analysis (Cressie and Wikle, 2015), we need to properly leverage the spatial information of the data and transform the classical spatial statistics framework to accommodate the grid geometry of matrix-valued data. In the remainder of this section, we briefly review the related literature that can shed light on these challenges and then summarize our unique contributions.
A naive yet straightforward prediction model is to vectorize the matrices as vectors and make predictions via the Vector Autoregression (VAR) (Stock and Watson, 2001). In this approach, the auxiliary vector covariates can be incorporated easily by concatenating them with the vectorized matrix predictors. However, vectorizing matrix data leads to the loss of spatial information and also requires a significant amount of parameters given the high dimensionality of the data. To avoid vectorizing the matrix data, scalar-on-tensor regression (Zhou et al., 2013; Guhaniyogi et al., 2017; Li et al., 2018; Papadogeorgou et al., 2021) tackles the problem by using matrix predictors directly. However, these models are built for scalar responses while in our setting we are dealing with matrix responses. Dividing the matrix response into individual scalar responses and fitting scalar-on-tensor regression still requires a significant number of parameters and more importantly, it fails to take the spatial information of the response into account.
The statistical framework that can incorporate matrices as both predictors and response is the tensor-on-tensor regression (Lock, 2018; Liu et al., 2020; Luo and Zhang, 2022) and more specifically for time series data, the matrix/tensor autoregression (Chen et al., 2021; Li and Xiao, 2021; Hsu et al., 2021; Wang et al., 2024). The matrix/tensor predictors are mapped to matrix/tensor responses via multi-linear transformations that greatly reduce the number of parameters. Our work builds on this framework and incorporates the non-spatial vector predictors at the same time.
To incorporate the vector predictor in the same model, we need to map vector predictors to matrix responses. Tensor-on-scalar regression (Rabusseau and Kadri, 2016; Sun and Li, 2017; Li and Zhang, 2017; Guha and Guhaniyogi, 2021) illustrates a way of mapping low-order scalar/vector predictors to high-order matrix/tensor responses via taking the tensor-vector product of the predictor with a high-order tensor coefficient. In this paper, we take a similar approach and introduce a 3-D tensor coefficient for the vector predictors such that our model can take predictors with non-uniform modes, which is a key distinction of our model compared to existing works.
The other distinction of our model is that our model leverages the spatial information of the matrix response. In our model, a key assumption is that the vector predictor has similar predictive effects on neighboring locations in the matrix response. This is equivalent to saying that the tensor coefficient is spatially smooth. Typically, the estimation of spatially smooth tensor coefficients in such regression models is done via adding a total-variation (TV) penalty (Wang et al., 2017; Shen et al., 2022; Sun et al., 2023) to the unknown tensor. The TV penalty leads to piecewise smooth estimators with sharp edges and enables feature selections. However, the estimation with the TV penalty requires solving non-convex optimization problems, making the subsequent theoretical analysis difficult. In our model, we utilize a simpler approach by assuming that the tensor coefficients are discrete evaluations of functional parameters from a Reproducing Kernel Hilbert Space (RKHS). Such a kernel method has been widely used in scalar-on-image regressions (Kang et al., 2018) where the regression coefficients of the image predictor are constrained to be spatially smooth.
We facilitate the estimation of the unknown functional parameters with the functional norm penalty. Functional norm penalties have been widely used for estimating smooth functions in classic semi/non-parametric learning in which data variables are either scalar/vector-valued (see Hastie et al., 2009; Gu, 2013; Yuan and Cai, 2010; Cai and Yuan, 2012; Shang and Cheng, 2013, 2015; Cheng and Shang, 2015; Yang et al., 2020). To the best of our knowledge, the present article is the first to consider functional norm penalty for tensor coefficient estimation in a matrix autoregressive setting.
To encapsulate, our paper has two major contributions. Firstly, we build a unified matrix autoregression framework for spatio-temporal data that incorporates lower-order scalar/vector time series covariates. Such a framework has strong application motivation where domain scientists are curious about integrating the information of spatial and non-spatial data for predictions and inference. The framework also bridges regression methodologies with tensor predictors and responses of non-uniform modes, making the theoretical investigation itself an interesting topic. Secondly, we propose to estimate coefficients of the auxiliary covariates, together with the autoregressive coefficients, in a single penalized maximum likelihood estimation (MLE) framework with the RKHS functional norm penalty. The RKHS framework builds spatial continuity into the regression coefficients. We establish the joint asymptotics of the autoregressive coefficients and the functional parameters under fixed/high matrix dimensionality regimes and propose an efficient alternating minimization algorithm for estimation and validate it with extensive simulations and real applications.
The remainder of the paper is organized as follows. We introduce our model formally in Section 2 and provide model interpretations and comparisons in sufficient detail. Section 3 introduces the penalized MLE framework and describes the exact and approximate estimation algorithms. Large sample properties of the estimators under fixed and high matrix dimensionality are established in Section 4. Section 5 provides extensive simulation studies for validating the consistency of the estimators, demonstrating BIC-based model selection results, and comparing our method with various competitors. We apply our method to the global TEC data in Section 6 and make conclusions in Section 7. Technical proofs and additional details of the simulation and real data applications are deferred to supplemental materials.
2 Model
2.1 Notation
We adopt the following notations throughout the article. We use calligraphic bold-face letters (e.g. ) for tensors with at least three modes, uppercase bold-face letters (e.g. ) for matrix and lowercase bold-face letters (e.g. ) for vector and blackboard bold-faced letters for sets (e.g. ). To subscript any tensor/matrix/vector, we use square brackets with subscripts such as , and we keep the subscript inside the square bracket to index time. Any fibers and slices of tensor are subscript-ed with colons such as , and thus any row and column of a matrix is denoted as and . If the slices of tensor/matrix are based on the last mode such as and , we will often omit the colons and write as and for brevity. For any tensor , we use to denote the vectorized tensor. For any two tensors with identical size, we define their inner product as: , and we use to denote the Frobenius norm of a tensor and one has .
Following Li and Zhang (2017), the tensor-vector product between a tensor of size and a vector , denoted as , or simply , is a tensor of size with . For tensor , we use to denote its -mode matricization. The Kronecker product between matrices is denoted via and the trace of a square matrix is denoted as . We use to denote the maximum, minimum and largest eigenvalue of a matrix. We use to denote a block diagonal matrix with along the diagonal. More details on the related tensor algebra can be found in Kolda and Bader (2009).
For the matrix time series in our modeling context, we assume that all grid locations are points on an grid within the domain . The collection of all the spatial locations is denoted as and any particular element of corresponding to the entry of the matrix is denoted as . We will often index the entry of the matrix with a single index and thus will be denoted as . We use to denote index set, i.e. . Finally, we use to denote a spatial kernel function and to denote the corresponding Reproducing Kernel Hilbert Space (RKHS).
2.2 Matrix AutoRegression with Auxiliary Covariates (MARAC)
Let be a joint observation of the matrix and the auxiliary vector time series with . To forecast , we propose our Matrix AutoRegression with Auxiliary Covariates, or MARAC, as:
(1) |
where are the autoregressive coefficients for the lag- matrix predictor and is the tensor coefficient for the lag- vector predictor, and is a noise term whose distribution will be specified later. The lag parameters are hyper-parameters of the model and we often refer to the model (1) as MARAC.
Based on model (1), for the element of , the MARAC specifies the following model:
(2) |
where each autoregressive term is associated with a rank- coefficient matrix determined by the specific rows from and each non-spatial auxiliary covariate is associated with a coefficient vector that is location-specific, i.e. . It now becomes more evident from (2) that the auxiliary vector covariates enter the model via an elementwise linear model. The autoregressive term utilizes to transform each lag- predictor in a bi-linear form. Using such bi-linear transformation greatly reduces the total amount of parameters in that each lagged predictor that required parameters previously now only requires parameters.
For the tensor coefficient , we assume that it is spatially smooth. More specifically, we assume that and are similar if are spatially close. Formally, we assume that each , i.e. the coefficient matrix for the covariate at lag-, is a discrete evaluation of a function on . Furthermore, each comes from an RKHS endowed with the spatial kernel function . The spatial kernel function specifies the spatial smoothness of the functional parameters and thus the tensor coefficient .
An alternative formulation for would be a low-rank form (Li and Zhang, 2017). We choose locally smooth over low rank to explicitly model the spatial smoothness of the coefficients and avoid tuning the tensor rank of . We leave the low-rank model for future research and focus on the RKHS framework for the current paper.
Finally, for the additive noise term , we assume that it is i.i.d. from a multivariate normal distribution with a separable Kronecker-product covariance:
(3) |
where are the row/column covariance components. Such a Kronecker-product covariance is commonly seen in the covariance models for multi-way data (Hoff, 2011; Fosdick and Hoff, 2014) with the merit of reducing the number of parameters significantly.
Compared to existing models that can only deal with either matrix or vector predictors, our model (1) can incorporate predictors with non-uniform modes. If one redefines in our model as , i.e. all terms except the autoregressive term, then our model ends up specifying:
where is the mode-3 matricization of and we will use to denote it for the rest of the paper. This new formulation reveals how our model differs from other autoregression models with matrix predictors. The covariance of in our model contains a separable covariance matrix that is based on the matrix grid geometry, a locally smooth coefficient matrix that captures the local spatial dependency and an auto-covariance matrix that captures the temporal dependency. Consequently, entries of are more correlated if either they are spatially/temporally close or they share the same row/column index and are thus more flexible for spatial data distributed on a matrix grid.
As a comparison, in the kriging framework (Cressie, 1986), the covariance of is characterized by a spatio-temporal kernel that captures the dependencies among spatial and temporal neighbors. Such kernel method can account for the local dependency but not the spatial dependency based on the matrix grid geometry. In the matrix autoregression model (Chen et al., 2021), the authors do not consider the local spatial dependencies among entries of nor the temporal dependency across different . In Hsu et al. (2021), the matrix autoregression model is generalized to adapt to spatial data via fixed-rank co-kriging (FRC) (Cressie and Johannesson, 2008) with , where is a coefficient matrix and is a pre-specified spatial basis matrix. Such a co-kriging framework does not account for the temporal dependency of the noises nor does it consider the auxiliary covariates. Our model generalizes these previous works to allow for temporally dependent noise with both local and grid spatial dependency.
The combination of (1) and (3) specifies the complete MARAC model. Vectorizing both sides of (1) yields the vectorized MARAC model:
(4) |
where , and recall that . It is now more evident that the Kronecker-product structure of the autoregressive coefficient matrix and the noise covariance matrix greatly reduce the number of parameters, making the regression estimation feasible given finite samples. The spatially smooth structure of leverages the spatial information of the spatial data. In the next section, we will discuss the estimating algorithm of the model parameters of MARAC.
3 Estimating Algorithm
In this section, we discuss the parameter estimation for the MARAC model (1). We first propose a penalized maximum likelihood estimator (MLE) in Section 3.1 for exact parameter estimation. Then, we propose an approximation to the penalized MLE in Section 3.2 for faster computation when dealing with high-dimensional matrix data. Finally, in Section 3.3, we outline the model selection criterion for selecting the lag hyper-parameters whose consistency will be validated empirically in Section 5.
3.1 Penalized Maximum Likelihood Estimation (MLE)
To estimate the parameters of the MARAC model, which we denote collectively as , we propose a penalized maximum likelihood estimation (MLE) approach. Following the distribution assumption on in (3), we can write the negative log-likelihood of with a squared RKHS functional norm penalty, after dropping the constants, as:
(5) |
where is the conditional log-likelihood of :
(6) |
and is the vectorized residual at . To estimate the parameters, one needs to solve a constrained minimization problem:
(7) |
We now define the functional norm penalty in (5) explicitly and derive a finite-dimensional equivalent of the optimization problem above. We assume that the spatial kernel function is continuous and square-integrable, thus it has an eigen-decomposition following the Mercer’s Theorem (Williams and Rasmussen, 2006):
(8) |
where is a sequence of non-negative eigenvalues and is a set of orthonormal eigen-functions on . The functional norm of function from the RKHS endowed with kernel is defined as:
(9) |
following van Zanten and van der Vaart (2008).
Given any in (5), the generalized representer theorem (Schölkopf et al., 2001) suggests that the solution of the functional parameters, denoted as , of the minimization problem (7), with all other parameters held fixed, is a linear combination of the representers plus a linear combination of the basis functions of the null space of , i.e.,
(10) |
where we omit the subscript for the coefficient for brevity but they are essentially different for each . We assume that the null space of contains only the zero function for the remainder of the paper. As a consequence of (10), the minimization problem in (7) can be reduced to a finite-dimensional Kernel Ridge Regression (KRR) problem. We summarize the discussion above in the proposition below:
Proposition 1
If , the constrained minimization problem in (7) is equivalent to the following unconstrained kernel ridge regression problem:
(11) |
where is the vectorized residual, is the kernel Gram matrix with and contains the coefficients of the representers with being the coefficient for the representer and the auxiliary covariate at lag .
We give the proof in the supplemental material. After introducing the functional norm penalty, the original tensor coefficient is now converted to a linear combination of the representer functions with the relationship that where .
We attempt to solve the minimization problem in (11) with an alternating minimization algorithm (Attouch et al., 2013) where we update one block of parameters at a time while keeping the others fixed. We update the parameters following the order of: . We choose the alternating minimization algorithm for its simplicity and efficiency. Each step of the algorithm conducts exact minimization over one block of the parameters, leading to a non-increasing sequence of the objective function, which guarantees the convergence of the algorithm towards a local stationary point.
To solve the optimization problem in (11) for at the iteration, it suffices to solve the following least-square problem:
(12) |
where is the residual matrix when predicting :
and we use to denote the partial residual excluding the term involving and use to denote the tensor coefficient satisfying , with . The superscript represents the value at the iteration. To simplify the notation, we define , where are arbitrary matrices/vectors with conformal matrix sizes and we simply write if . Solving (12) yields the following updating formula for :
(13) |
Similarly, we have the following updating formula for :
(14) |
For updating , or its vectorized version , it is required to solve the following kernel ridge regression problem:
where and is the vectorized partial residual of by leaving out the lag- auxiliary predictor, defined in a similar way as . Solving the kernel ridge regression leads to the following updating formula for :
(15) |
The step in (15) can be slow since one needs to invert a square matrix of size . In Section 3.2, we propose an approximation to (15) to speed up the computation under high dimensionality.
The updating rule of and can be easily derived by taking their derivative in (11) and setting it to zero. Specifically, we have:
(16) | ||||
(17) |
where is the full residual when predicting .
The algorithm cycles through (13), (14), (15), (16) and (17) and terminates when , , have their relative changes between iterations fall under a pre-specified threshold. We make two additional remarks on the algorithm:
Remark 2
(Identifiability Constraint) The MARAC model specified in (1) is scale-unidentifiable in that one can re-scale each pair of by a non-zero constant and obtain without changing their Kronecker product. To enforce scale identifiability, we re-scale the algorithm output for each pair of such that , . The identifiability constraint is enforced before outputting the estimators.
Remark 3
(Convergence of Kronecker Product) When dealing with high-dimensional matrices, it is cumbersome to compute the change between and under the Frobenius norm. An upper bound of can be used instead:
(18) |
and a similar bound can be used for the convergence check of .
3.2 Approximated Penalized MLE with Kernel Truncation
The iterative algorithm in Section 3.1 requires inverting an matrix in (15) when updating , i.e., the coefficients of the representer functions . One way to reduce the computational complexity without any approximation is to divide the step of updating to updating one block of parameters at a time following the order of . However, such a procedure requires inverting a matrix of size , which could still be high-dimensional.
To circumvent the issue of inverting large matrices, we can approximate the linear combination of all representers using a set of basis functions, i.e., , where . For example, one can reduce the spatial resolution by subsampling a fraction of the rows and columns of the matrix and only use the representers at the subsampled “knots” as the basis functions. In this subsection, we consider an alternative approach by truncating the Mercer decomposition in (8). A similar technique can be found in Kang et al. (2018).
Given the eigen-decomposition of in (8), one can truncate the decomposition at the largest eigenvalue and get an approximation: . We will use the set of eigen-functions for faster computation. The choice of depends on the decaying rate of the eigenvalue sequence (thus the smoothness of the underlying functional parameters). Our simulation result shows that the estimation and prediction errors shrink monotonically as . Therefore, can be chosen based on the computational resources available. The kernel truncation speeds up the computation at the cost of providing an overly-smoothed estimator, as we demonstrate empirically in the supplemental material.
Given the kernel truncation, any functional parameter is now approximated as: . The parameter to be estimated now is , whose dimension is much lower than before (). Estimating requires solving a ridge regression problem, and the updating formula for generating at the iteration can be written as:
where satisfies , and , with being the largest eigenvalue of the Mercer decomposition of . Now we only need to invert a matrix of size , which speeds up the computation.
3.3 Lag Selection
The MARAC model (1) has three hyper-parameters: the autoregressive lag , the auxiliary covariate lag , and the RKHS norm penalty weight . In practice, can be chosen by cross-validation while choosing and requires a more formal model selection criterion. We propose to select and by using information criterion, including the Akaike Information Criterion (AIC) (Akaike, 1998) and the Bayesian Information Criterion (BIC) (Schwarz, 1978). We formally define the AIC and BIC for the MARAC model here and empirically validate their consistency via simulation experiments in Section 5.
Let be the set of the estimated parameters of the MARAC model, and be the effective degrees of the freedom of the model. We can then define the AIC and the BIC as follows:
(19) | ||||
(20) |
To calculate , we decompose it into the sum of three components: 1) for each pair of the autoregressive coefficient , the model has degrees of freedom; 2) for the noise covariance , the model has degrees of freedom; and 3) for the auxiliary covariate functional parameters , inspired by the kernel ridge regression estimator in (15), we define the sum of their degrees of freedom as:
where . As , we have ; namely each covariate has free parameters, which then reduces to the element-wise linear regression model. Empirically, we find that the BIC is a consistent lag selection criterion for our model.
4 Theoretical Analysis
This section presents the theoretical analyses of the MARAC model. We first formulate the condition under which the matrix and vector time series are jointly stationary. Under this condition, we then establish the consistency and asymptotic normality of the penalized MLE under fixed matrix dimensionality as . Finally, we consider the case where the matrix size goes to infinity as and derive the convergence rate of the penalized MLE estimator and also the optimal order of the functional norm penalty tuning parameter . Without loss of generality, we assume that the matrix and vector time series have zero means, and we use to denote the spatial dimensionality of the matrix data. All proofs are deferred to the supplemental material.
4.1 Stationarity Condition
To facilitate the theoretical analysis, we make an assumption for the vector time series , which greatly simplifies the theoretical analysis.
Assumption 4
The -dimensional auxiliary vector time series follows a stationary VAR process:
(21) |
where is the lag- transition matrix and has independent sub-Gaussian entries and is independent of .
Remark 5
With Assumption 4, we can combine the MARAC model for in (4) and the VAR process for in (21) as a single VAR model, with , as:
(22) |
with being the Hadamard product between matrices and being a matrix with all elements taking values from and is zero matrix. As (22) shows, can help forecast where but not the opposite, indicating that is an exogenous predictor for .
Given the joint vector autoregressive model in (22), we derive the conditions for and to be jointly stationary in Theorem 6.
Theorem 6 (MARAC Stationarity Condition)
The proof of Theorem 6 follows the proof of the stationarity of the joint autoregressive model in (22). As a special case where , the stationarity condition in (23) is equivalent to and , where is the spectral radius of a square matrix.
Based on Theorem 6, the stationarity of the matrix and vector time series relies on the stationarity of the autoregressive coefficients of the MARAC and VAR models and the tensor coefficients do not affect the stationarity. The MARAC model can be extended to the joint autoregressive process (22) where the dynamics of and are modeled jointly. We stick to the simpler case here and leave the joint autoregression to future work.
4.2 Finite Spatial Dimension Asymptotics
In this subsection, we establish the consistency and asymptotic normality of the MARAC model estimators under the scenario that are fixed. Given a fixed matrix dimensionality, the functional parameters can only be estimated at fixed locations, and thus the asymptotic normality result is established for the corresponding tensor coefficient . In Section 4.3, we will discuss the double asymptotics when both . For the remainder of the paper, we denote the true model coefficient with an asterisk superscript, such as and .
To start with, we make an assumption on the Gram matrix :
Assumption 7
The minimum eigenvalue of is bounded by a positive constant , i.e. .
As a result of Assumption 7, every has a unique kernel decomposition: . With this additional assumption, the first theoretical result we establish is the consistency of the covariance matrix estimator , which we summarize in Proposition 8.
Proposition 8 (Covariance Consistency)
Given this result, we can further establish the asymptotic normality of the other model estimators:
Theorem 9 (Asymptotic Normality)
Assume that the matrix time series follows the MARAC model (1) with i.i.d. noise following (3) and Assumption 4, 7 and the stationarity condition in Theorem 6 hold and . Additionally, assume that . Then suppose are fixed and are known and denote as and for any , the penalized MLE of the MARAC model is asymptotically normal:
(24) |
where is:
and , and is defined as:
where with and:
The asymptotic distribution (24) indicates that all parameters are -consistent. Under a fixed matrix dimensionality , the functional parameters are estimated only at fixed locations. Hence, the convergence is at a parametric rate just like the autoregressive coefficient.
4.3 High Spatial Dimension Asymptotics
The previous section presents the asymptotic normality of the MARAC estimators under a fixed matrix dimensionality . In this section, we relax this assumption and establish the convergence rate of the MARAC estimators when . For technical reasons, we assume that all entries of are i.i.d. normally-distributed random variables following .
To establish the convergence rate of the MARAC estimators when , we need to make several additional assumptions.
Assumption 10
The spatial locations of the rows and columns of are sampled independently from a uniform distribution on .
Assumption 11
The spatial kernel function can be decomposed into the product of a row kernel and a column kernel that satisfies . Both have their eigenvalues decaying at a polynomial rate: with .
Assumption 11 elicits a simple eigen-spectrum characterization of the spatial kernel , whose eigenvalue can be written as . Also, the Gram matrix is separable, i.e. and all eigenvalues of have the form: , where are the Gram matrix for the kernel , respectively.
Under Assumption 10, we further have and , as . Combined with Assumption 11, we can characterize the eigenvalues of as . We refer our readers to Koltchinskii and Giné (2000); Braun (2006) for more references about the eigen-analysis of the kernel Gram matrix. One can generalize Assumption 10 to non-uniform sampling, but here, we stick to this simpler setting.
In Assumption 11, we assume the kernel separability to accommodate the grid structure of the spatial locations. We do not restrict to be an integer but just a parameter that characterizes the smoothness of the functional parameters. With these assumptions, we are now ready to present the main result in Theorem 12.
Theorem 12 (Asymptotics for High-Dimensional MARAC)
Assume that Assumptions 4, 10 and 11 hold and is generated by the MARAC model (1) with having i.i.d. entries. Then as ( is fixed) and , and under the additional assumptions that:
-
1.
;
-
2.
and as , with ;
-
3.
as , where are , and , respectively. is a constant that only relates to ;
-
4.
For any , we have , where is a finite constant,
then we have:
(25) |
where . Furthermore, we also have:
(26) |
In Theorem 12, (25) gives the error bound of the autoregressive coefficients and (26) gives the error bound of the prediction made by the auxiliary time series, which contains the functional parameter estimators. As a special case of (25) where and is fixed, the convergence rate for the autoregressive coefficients is , which reproduces the result in Theorem 9. For the discussion below, we use and as acronyms for the quantity on the left-hand side of (25) and (26).
Remark 13 (Optimal Choice of and Phase Transition)
According to our proof, the error bound (26) can be decomposed into the sum of:
-
•
nonparametric error: ,
-
•
autoregressive error: ,
where the autoregressive error stems from the estimation error of . The nonparametric error resembles the result of nonparametric regression with RKHS norm penalty (Cui et al., 2018), where if the number of data points is and penalty tuning parameter is , then the nonparametric error is bounded by with an optimal . In our model, if there is no autoregressive error, the optimal tuning parameter satisfies . The number of data points in our case is , and we are short of in the optimal tuning parameter due to Assumption 11, where the eigenvalues of , ordered as , decay slower than . This is a special result for matrix-shaped data. It is also noteworthy that under the condition that , the autoregressive error dominates the nonparametric error.
To simplify the discussion of the optimal order of , we assume that , where is a constant. Under this condition, when , the optimal tuning parameter shows an interesting phase transition phenomenon under different spatial smoothness and matrix dimensionality , which we summarize in Table 1.
Optimal | Estimator Error | ||
Based on the results in Table 1, the faster grows with respect to , the smaller the optimal tuning parameter is. This is an intuitive result since when one has more spatial locations, the observations are denser, and thus less smoothing is needed. Furthermore, we achieve an optimal tuning order of that is close to the classic nonparametric optimal rate at only under the regime where and . This regime specifies the scenario where the functional parameter is relatively unsmooth and the spatial dimensionality grows slowly with respect to . Only under this regime will the discrepancy between the nonparametric error and the autoregressive error remain small, leading to an optimal tuning parameter close to the result of nonparametric regression.
In (25), the constant appears in the error bound of the autoregressive term. This constant characterizes the spatial correlation of the matrix time series , conditioning on the auxiliary vector time series and can vary across different assumptions made on the covariances of and . In Table 1, we assume that for some universal constant . Unfortunately, in practice, it is common to have as , which makes the autoregressive coefficient converge at a slower rate but does not affect the functional parameter convergence. We leave the constant here in (25) to give a general result and leave the characterization of under specific assumptions for future works.
5 Simulation Experiments
5.1 Consistency and Convergence Rate
In this section, we validate the consistency and convergence rate of the MARAC estimators. We consider a simple setup with and and simulate the autoregressive coefficients such that they satisfy the stationarity condition in Theorem 6. We specify both and to have symmetric banded structures. To simulate (we drop the lag subscript) from the RKHS , we choose to be the Lebedev kernel (Kennedy et al., 2013) and generate randomly from Gaussian processes with the Lebedev kernel as the covariance kernel. Finally, we simulate the auxiliary vector time series from a VAR process. We include more details and visualizations of the simulation setups in the supplemental material.
The evaluation metric is the rooted mean squared error (RMSE), defined as , where is the number of elements in . We consider and we report the average RMSE for . The dataset is configured with and . For each , we train the MARAC model with over frames of the matrix time series and choose the tuning parameter based on the prediction RMSE over a held-out validation set with and we validate the prediction performance over a -frame testing set. We simulate a sufficiently long time series and choose the training set starting from the first frame and the validation set right after the training set. The testing set is always fixed as the last frames of the time series. All results are reported with repetitions in Figure 2.

The result shows that all model estimators are consistent and the convergence rate, under a fixed spatial dimensionality, is close to (the black line in panel (a) shows a reference line of ), echoing the result in Theorem 9. As the spatial dimensionality increases, the RMSE for becomes even smaller, echoing the result in (25) and Table 1. The RMSE of the nonparametric estimators , under a fixed spatial dimensionality, also decay at a rate of , echoing the result in Theorem 9 as well. The RMSE of the covariance matrix estimator suggests that it is consistent, confirming the result of Proposition 8 and shows a convergence rate similar to , though we did not provide the exact convergence rate theoretically.
In this simulation, we fix the variance of each element of to be unity. Therefore, the optimal testing set prediction RMSE should be unity. When plotting the test prediction RMSE in (d), we subtract from all RMSE results and thus the RMSE should be interpreted as the RMSE for the signal part of the matrix time series. The test prediction RMSE for all cases converges to zero, and for matrices of higher dimensionality, we typically require more training frames to reach the same prediction performance.
To validate the theoretical result of the high-dimensional MARAC in Theorem 12, we also plot the RMSE of and against in panel (e) and (f) of Figure 2. The trend line is fitted by linear regression, and it shows that converges roughly at the rate of , which indicates that under this specific simulation setup. It also shows that the functional parameter’s convergence rate is around , which coincides with our simulation setup where and the theoretical result in the last row of Table 1.
All the results reported in Figure 2 are based on the penalized MLE framework without the kernel truncation introduced in Section 3.2. Kernel truncation speeds up the computation, especially when the matrix dimensionality is high, at the cost of over-smoothing the functional parameter estimates. We illustrate the performance of the kernel truncation method in the supplemental material.
5.2 Lag Selection Consistency
In Section 3.3, we propose to select the lag parameters and of the MARAC model using information criteria such as AIC and BIC. To validate the consistency of these model selection criteria, we simulate data from a MARAC model with matrix dimensionality. We consider a candidate model class with and each model is fitted with frames with being chosen from a held-out validation set. In Table 2, we report the proportion of times that AIC and BIC select the correct , individually (first two numbers in each parenthesis), and jointly (last number in each parenthesis) from repetitions.
AIC | ||||
BIC |
5.3 Comparison with Alternative Methods
We compare our MARAC model against other competing methods for the matrix autoregression task. We simulate the matrix time series from a MARAC model, with , and the vector time series from VAR. The dataset is generated with . Under each , we simulate with varying matrix dimensionality with . We evaluate the performance of each method via the testing set prediction RMSE. Each simulation scenario is repeated 20 times.
Under each specification, we consider the following five competing methods besides our own MARAC model.
-
1.
MAR (Chen et al., 2021):
- 2.
-
3.
MAR followed by a tensor-on-scalar linear model (MAR+LM) (Li and Zhang, 2017):
(27) where come from a pre-trained MAR model and can be a low-rank tensor. The MAR+LM model can be considered as a two-step procedure for fitting the MARAC model.
-
4.
Pixel-wise autoregression (Pixel-AR): for each , we have:
-
5.
Vector Autoregression with Exogenous Predictor (VARX), which vectorizes the matrix time series and stacks them up with the vector time series as predictors.
The results of the average prediction RMSE obtained from the 20 repeated runs are plotted in Figure 3. Overall, our MARAC model outperforms the other competing methods under varying matrix dimensionality and lags. We make two additional remarks. First, when the matrix size is small (e.g., ), the vector autoregression model (VARX) performs almost as well as the MARAC model and is better than other methods. However, the performance of the VARX model gets worse quickly as the matrix becomes larger, indicating that sufficient dimension reduction is needed for dealing with large matrix time series. The MARAC model is a parsimonious version of VARX for such purposes. Secondly, the MAR, MAR with fixed-rank co-kriging (MAR+FRC), and two-step MARAC (MAR+LM) all perform worse than MARAC. This shows that when the auxiliary time series predictors are present, it is sub-optimal to remove them from the model (MAR), incorporate them implicitly in the covariance structure (MAR+FRC), or fit them separately in a tensor-on-scalar regression model (MAR+LM). Putting both matrix predictors and vector predictors in a unified framework like MARAC can be beneficial for improving prediction performances.

6 Application to Global Total Electron Content Forecast
For real data applications, we consider the problem of predicting the global total electron content (TEC) distribution, which we briefly introduce in Section 1. The TEC data we use is the IGS (International GNSS Service) TEC data, which are freely available from the National Aeronautics and Space Administration (NASA) Crustal Dynamics Data Information System (Hernández-Pajares et al., 2009). The spatial-temporal resolution of the data is . We downloaded the data for September 2017, and the whole month of data form a matrix time series with and . For the auxiliary covariates, we download the 15-minute resolution IMF Bz and Sym-H time series, which are parameters related to the near-Earth magnetic field and plasma (Papitashvili et al., 2014). We also download the daily F10.7 index, which measures the solar radio flux at 10.7 cm, as an additional auxiliary predictor. The IMF Bz and Sym-H time series are accessed from the OMNI dataset (Papitashvili and King, 2020) and the F10.7 index is accessed from the NOAA data repository (Tapping, 2013). These covariates measure the solar wind strengths. Strong solar wind might lead to geomagnetic storms that could increase the global TEC significantly.
We formulate our MARAC model for the TEC prediction problem as:
(28) |
where is the forecast latency time and includes the IMF Bz, Sym-H and F10.7 indices at time . We consider the forecasting scenario with , which corresponds to making forecasts from 15 minutes ahead up to 6 hours ahead. At each latency time, we fit our MARAC model following (28) with . We fit the MARAC model with kernel truncation approximation using basis functions from the truncated Lebedev kernel. As a comparison, we also fit the MAR model with and the MAR+LM model with , see the definition of MAR+LM model in (27). As a benchmark, we consider using to predict and name it the persistence model.
The frames of matrix data are split into a training set, validation set, and a testing set following the chronological order. We choose the tuning parameter for MARAC based on the validation set prediction RMSE. The lag parameters are chosen for all models based on the BIC. To increase computational speed, we assume that matrices are diagonal when fitting all models. We zero-meaned all sets of data using the mean of the matrix and vector time series of the training set.
In Figure 4(A), we report the pixel-wise prediction RMSE on the testing set. The result shows that when the latency time is low, the matrix autoregressive (MAR) model is sufficient for making the TEC prediction. As the latency time increases to around 4 to 5 hours, the auxiliary time series helps improve the prediction performance as compared to the MAR model. This coincides with the domain intuition that the disturbances from the solar wind to Earth’s ionosphere will affect the global TEC distribution but with a delay in time of up to several hours. The additional prediction gain from incorporating the auxiliary covariates vanishes as one further increases the latency time, indicating that the correlation of the solar wind and global TEC is weak beyond a 6-hour separation.
In Figure 4(B), we visualize an example of the TEC prediction across the competing methods under the 4-hour latency time scenario (i.e., h=16). The MAR and MAR+LM results are similar and do not resemble the ground truth very well. The global TEC typically has two peaks located symmetrically around the equator, and both models fail to capture this as they provide a single patch in the middle. The MARAC model, however, can capture this fine-scale structure in its prediction. To further showcase the MARAC model prediction result, we decompose the prediction from the autoregressive component and the auxiliary covariates component and visualize them separately. The auxiliary covariate component highlights a sub-region in the southern hemisphere with high TEC values, complementing the prediction made by the autoregressive component.

7 Summary
In this paper, we propose a new methodology for spatial-temporal matrix autoregression with non-spatial exogenous vector covariates. The model has an autoregressive component with bi-linear transformations on the lagged matrix predictors and an additive auxiliary covariate component with tensor-vector product between a tensor coefficient and the lagged vector covariates. We propose a penalized MLE estimation approach with a squared RKHS norm penalty and establish the estimator asymptotics under fixed and high matrix dimensionality. The model efficacy has been validated using both numerical experiments and an application to the global TEC forecast.
The application of our model can be extended to other spatial data with exogenous, non-spatial predictors and is not restricted to matrix-valued data but can be generalized to the tensor setting and potentially data without grid structure or containing missing data. Furthermore, our model nests a simpler model that does not contain the autoregressive term, i.e. , and thus can be applied to matrix-on-scalar regression with spatial data. We leave the discussions for these setups to future research.
Supplementary Materials
The supplemental material contains technical proofs of all theorems and propositions of the paper and additional details and results of the simulation experiments.
Acknowledgements
The authors thank Shasha Zou, Zihan Wang, and Yizhou Zhang for helpful discussions on the TEC data. YC acknowledges support from NSF DMS 2113397 and NSF PHY 2027555.
References
- Akaike (1998) Hirotogu Akaike. Information Theory and an Extension of the Maximum Likelihood Principle. Selected papers of hirotugu akaike, pages 199–213, 1998.
- Attouch et al. (2013) Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of Descent Methods for Semi-Algebraic and Tame Problems: Proximal Algorithms, Forward–Backward Splitting, and Regularized Gauss–Seidel Methods. Mathematical Programming, 137(1-2):91–129, 2013.
- Braun (2006) Mikio L Braun. Accurate Error Bounds for the Eigenvalues of the Kernel Matrix. The Journal of Machine Learning Research, 7:2303–2328, 2006.
- Cai and Yuan (2012) T Tony Cai and Ming Yuan. Minimax and Adaptive Prediction for Functional Linear Regression. Journal of the American Statistical Association, 107(499):1201–1216, 2012.
- Chen et al. (2021) Rong Chen, Han Xiao, and Dan Yang. Autoregressive Models for Matrix-valued Time Series. Journal of Econometrics, 222(1):539–560, 2021.
- Cheng and Shang (2015) Guang Cheng and Zuofeng Shang. Joint Asymptotics for Semi-nonparametric Regression Models with Partially Linear Structure. The Annals of Statistics, 43:1351–1390, 2015.
- Cressie (1986) Noel Cressie. Kriging Nonstationary Data. Journal of the American Statistical Association, 81(395):625–634, 1986.
- Cressie and Johannesson (2008) Noel Cressie and Gardar Johannesson. Fixed Rank Kriging for Very Large Spatial Data Sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226, 2008.
- Cressie and Wikle (2015) Noel Cressie and Christopher K Wikle. Statistics for Spatio-Temporal Data. John Wiley & Sons, 2015.
- Cui et al. (2018) Wenquan Cui, Haoyang Cheng, and Jiajing Sun. An RKHS-based Approach to Double-Penalized Regression in High-dimensional Partially Linear Models. Journal of Multivariate Analysis, 168:201–210, 2018.
- Dong et al. (2020) Mingwang Dong, Linfu Huang, Xueqin Wu, and Qingguang Zeng. Application of Least-Squares Method to Time Series Analysis for 4dpm Matrix. IOP Conference Series: Earth and Environmental Science, 455(1):012200, feb 2020. doi: 10.1088/1755-1315/455/1/012200. URL https://dx.doi.org/10.1088/1755-1315/455/1/012200.
- Fosdick and Hoff (2014) BK Fosdick and PD Hoff. Separable Factor Analysis with Applications to Mortality Data. The Annals of Applied Statistics, 8(1):120–147, 2014.
- Gu (2013) Chong Gu. Smoothing Spline ANOVA models, 2nd edition. Springer, New York, 2013.
- Guha and Guhaniyogi (2021) Sharmistha Guha and Rajarshi Guhaniyogi. Bayesian Generalized Sparse Symmetric Tensor-on-Vector Regression. Technometrics, 63(2):160–170, 2021.
- Guhaniyogi et al. (2017) Rajarshi Guhaniyogi, Shaan Qamar, and David B Dunson. Bayesian Tensor Regression. The Journal of Machine Learning Research, 18(1):2733–2763, 2017.
- Hall and Heyde (2014) Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic press, 2014.
- Hamilton (2020) James D Hamilton. Time Series Analysis. Princeton University Press, 2020.
- Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition. Springer, New York, 2009.
- Hernández-Pajares et al. (2009) Manuel Hernández-Pajares, JM Juan, J Sanz, R Orus, A Garcia-Rigo, J Feltens, A Komjathy, SC Schaer, and A Krankowski. The IGS VTEC Maps: a Reliable Source of Ionospheric Information since 1998. Journal of Geodesy, 83:263–275, 2009.
- Hoff (2011) Peter D Hoff. Separable Covariance Arrays via the Tucker Product, with Applications to Multivariate Relational Data. Bayesian Analysis, 6(2):179–196, 2011.
- Hsu et al. (2021) Nan-Jung Hsu, Hsin-Cheng Huang, and Ruey S Tsay. Matrix Autoregressive Spatio-Temporal Models. Journal of Computational and Graphical Statistics, 30(4):1143–1155, 2021.
- Kang et al. (2018) Jian Kang, Brian J Reich, and Ana-Maria Staicu. Scalar-on-Image Regression via the Soft-Thresholded Gaussian Process. Biometrika, 105(1):165–184, 2018.
- Kennedy et al. (2013) Rodney A Kennedy, Parastoo Sadeghi, Zubair Khalid, and Jason D McEwen. Classification and Construction of Closed-form Kernels for Signal Representation on the 2-sphere. In Wavelets and Sparsity XV, volume 8858, pages 169–183. SPIE, 2013.
- Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. Tensor Decompositions and Applications. SIAM review, 51(3):455–500, 2009.
- Koltchinskii and Giné (2000) Vladimir Koltchinskii and Evarist Giné. Random Matrix Approximation of Spectra of Integral Operators. Bernoulli, pages 113–167, 2000.
- Li and Zhang (2017) Lexin Li and Xin Zhang. Parsimonious Tensor Response Regression. Journal of the American Statistical Association, 112(519):1131–1146, 2017.
- Li et al. (2018) Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker Tensor Regression and Neuroimaging Analysis. Statistics in Biosciences, 10(3):520–545, 2018.
- Li and Xiao (2021) Zebang Li and Han Xiao. Multi-linear Tensor Autoregressive Models. arXiv preprint arXiv:2110.00928, 2021.
- Liu et al. (2020) Yipeng Liu, Jiani Liu, and Ce Zhu. Low-rank Tensor Train Coefficient Array Estimation for Tensor-on-Tensor Regression. IEEE Transactions on Neural Networks and Learning Systems, 31(12):5402–5411, 2020.
- Lock (2018) Eric F Lock. Tensor-on-Tensor Regression. Journal of Computational and Graphical Statistics, 27(3):638–647, 2018.
- Luo and Zhang (2022) Yuetian Luo and Anru R Zhang. Tensor-on-Tensor Regression: Riemannian Optimization, Over-Parameterization, Statistical-Computational gap, and their Interplay. arXiv preprint arXiv:2206.08756, 2022.
- Papadogeorgou et al. (2021) Georgia Papadogeorgou, Zhengwu Zhang, and David B Dunson. Soft Tensor Regression. J. Mach. Learn. Res., 22:219–1, 2021.
- Papitashvili and King (2020) Natalia E. Papitashvili and Joseph H. King. Omni 5-min Data [Data set]. NASA Space Physics Data Facility, 2020. https://doi.org/10.48322/gbpg-5r77.
- Papitashvili et al. (2014) Natasha Papitashvili, Dieter Bilitza, and Joseph King. OMNI: a Description of Near-Earth Solar Wind Environment. 40th COSPAR scientific assembly, 40:C0–1, 2014.
- Rabusseau and Kadri (2016) Guillaume Rabusseau and Hachem Kadri. Low-rank Regression with Tensor Responses. Advances in Neural Information Processing Systems, 29, 2016.
- Rudelson and Vershynin (2013) Mark Rudelson and Roman Vershynin. Hanson-Wright Inequality and Sub-Gaussian Concentration. Electronic Communications in Probability, pages 1–9, 2013.
- Schölkopf et al. (2001) Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A Generalized Representer Theorem. In International conference on computational learning theory, pages 416–426. Springer, 2001.
- Schwarz (1978) Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, pages 461–464, 1978.
- Shang and Cheng (2013) Zuofeng Shang and Guang Cheng. Local and Global Asymptotic Inference in Smoothing Spline Models. The Annals of Statistics, 41:2608–2638, 2013.
- Shang and Cheng (2015) Zuofeng Shang and Guang Cheng. Nonparametric Inference in Generalized Functional Linear Models. The Annals of Statistics, 43:1742–1773, 2015.
- Shen et al. (2022) Bo Shen, Weijun Xie, and Zhenyu Kong. Smooth Robust Tensor Completion for Background/Foreground Separation with Missing Pixels: Novel Algorithm with Convergence Guarantee. The Journal of Machine Learning Research, 23(1):9757–9796, 2022.
- Stock and Watson (2001) James H Stock and Mark W Watson. Vector Autoregressions. Journal of Economic perspectives, 15(4):101–115, 2001.
- Sun et al. (2022) Hu Sun, Zhijun Hua, Jiaen Ren, Shasha Zou, Yuekai Sun, and Yang Chen. Matrix Completion Methods for the Total Electron Content Video Reconstruction. The Annals of Applied Statistics, 16(3):1333–1358, 2022.
- Sun et al. (2023) Hu Sun, Ward Manchester, Meng Jin, Yang Liu, and Yang Chen. Tensor Gaussian Process with Contraction for Multi-Channel Imaging Analysis. In International Conference on Machine Learning, pages 32913–32935. PMLR, 2023.
- Sun and Li (2017) Will Wei Sun and Lexin Li. STORE: Sparse Tensor Response Regression and Neuroimaging Analysis. The Journal of Machine Learning Research, 18(1):4908–4944, 2017.
- Tapping (2013) KF Tapping. The 10.7 cm Solar Radio Flux (F10. 7). Space weather, 11(7):394–406, 2013.
- Tzeng and Huang (2018) ShengLi Tzeng and Hsin-Cheng Huang. Resolution Adaptive Fixed Rank Kriging. Technometrics, 60(2):198–208, 2018.
- van Zanten and van der Vaart (2008) JH van Zanten and Aad W van der Vaart. Reproducing Kernel Hilbert Spaces of Gaussian Priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, pages 200–222. Institute of Mathematical Statistics, 2008.
- Wang et al. (2024) Di Wang, Yao Zheng, and Guodong Li. High-Dimensional Low-rank Tensor Autoregressive Time Series Modeling. Journal of Econometrics, 238(1):105544, 2024.
- Wang et al. (2019) Dong Wang, Xialu Liu, and Rong Chen. Factor Models for Matrix-valued High-dimensional Time Series. Journal of econometrics, 208(1):231–248, 2019.
- Wang et al. (2017) Xiao Wang, Hongtu Zhu, and Alzheimer’s Disease Neuroimaging Initiative. Generalized Scalar-on-Image Regression Models via Total Variation. Journal of the American Statistical Association, 112(519):1156–1168, 2017.
- Wang et al. (2021) Zihan Wang, Shasha Zou, Lei Liu, Jiaen Ren, and Ercha Aa. Hemispheric Asymmetries in the Mid-latitude Ionosphere During the September 7–8, 2017 Storm: Multi-instrument Observations. Journal of Geophysical Research: Space Physics, 126:e2020JA028829, 4 2021. ISSN 2169-9402. doi: 10.1029/2020JA028829.
- Williams and Rasmussen (2006) Christopher K Williams and Carl Edward Rasmussen. Gaussian Processes for Machine Learning, volume 2. MIT press Cambridge, MA, 2006.
- Yang et al. (2020) Yun Yang, Zuofeng Shang, and Guang Cheng. Non-asymptotic Analysis for Nonparametric Testing. In 33rd Annual Conference on Learning Theory, pages 1–47. ACM, 2020.
- Younas et al. (2022) Waqar Younas, Majid Khan, C. Amory-Mazaudier, Paul O. Amaechi, and R. Fleury. Middle and Low Latitudes Hemispheric Asymmetries in and TEC during Intense Magnetic Storms of Solar Cycle 24. Advances in Space Research, 69:220–235, 1 2022.
- Yuan and Cai (2010) Ming Yuan and T Tony Cai. A Reproducing Kernel Hilbert Space Approach to Functional Linear Regression. The Annals of Statistics, 38(6):3412–3444, 2010.
- Zhou et al. (2013) Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor Regression with Applications in Neuroimaging Data Analysis. Journal of the American Statistical Association, 108(502):540–552, 2013.
SUPPLEMENTARY MATERIAL
This supplemental material is organized as follows. In Section A, we prove Proposition 1 on the equivalence of the estimation problem of MARAC to a kernel ridge regression problem. In Section B, we prove Theorem 6 on the joint stationarity condition of the matrix and auxiliary vector time series. Then in Section C, we provide proofs of the theoretical results under fixed spatial dimensionality, including Proposition 8 and Theorem 9. In Section D, we present proofs of the theoretical results under high spatial dimensionality, namely Theorem 12. All essential lemmas used throughout the proofs are presented and proved in Section E. Finally, we include additional details and results of the simulation experiments in Section F.
In this supplemental material, we use , , and to denote the maximum, largest, minimum eigenvalue and spectral norm of a matrix. We use to denote the maximum and minimum of and , respectively. For two sequences of random variables, say , we use to denote the case where , and to denote the case where . We then use to denote the case where both and hold.
Appendix A Proof of Proposition 1
Proof For each function , we can decompose it as follows:
where does not belong to the null space of nor the span of . Here we assume that the null space of contains only the zero function, so .
By the reproducing property of the kernel , we have , which is independent of , and therefore is independent of the prediction for in the MARAC model. In addition, for any , we have:
and the equality holds only if . Consequently, the global minimizer for the constrained optimization problem (7) must have . It then follows that the squared RKHS functional norm penalty for can be written as and the tensor coefficient satisfies . The remainder of the proof is straightforward by simple linear algebra and thus we omit it here.
Appendix B Proof of Theorem 1
Proof Under Assumption 4 that the vector time series follows a VAR process, we can derive that the vectorized matrix time series and the vector time series jointly follows a VAR process, namely,
(29) |
Let and . Denote the transition matrix in (29) at lag- as and the error term as , then we can rewrite the VAR process in (29) as a VAR process as:
(30) |
where we use to denote a zero matrix of size . For this VAR process to be stationary, we require that for all , where is the transition matrix in (30). The determinant can be simplified by column operations as:
where and , and setting completes the proof.
Appendix C Theory under Fixed Spatial Dimension
C.1 Proof of Proposition 8
Proof For the brevity of the presentation, we fix as but the proofs presented below can be easily extended to an arbitrary . For the vectorized MARAC model (4), we can equivalently write it as:
(31) |
where and . Using to denote the precision matrix for , we can rewrite the penalized likelihood in (11) for as:
(32) |
where , is defined as:
We use to denote the ground truth of , respectively. We define and as:
The estimators of MARAC, denoted as , is the minimizer of with .
In order to establish the consistency of , it suffices to show that for any constant :
(33) |
This is because if (33) is established, then as we have:
approaching and thus we must have with probability approaching as , and the consistency is established since is arbitrary.
To prove (33), we first fix and let , thus we have:
(34) |
which is a consistent estimator of for any given that and the matrix and vector time series are covariance-stationary. To see that , notice that:
(35) |
and the first term converges to since . In the second term of (35), we have:
(36) |
where , and . The convergence in probability in (36) holds due to the joint stationarity of and and the assumption that . We further note that the sequence is a martingale difference sequence (MDS), and we have by the central limit theorem (CLT) of MDS (see proposition 7.9 of Hamilton (2020) for the central limit theorem of martingale difference sequence). Combining this result together with (36), we conclude that the second term in (35) is and thus is consistent for .
Plugging into yields the profile likelihood of :
To prove (33), it suffices to show that:
(37) |
since . Now, since , we can write , with . Using this new notation, we can rewrite as:
(38) | ||||
where we define the first two terms in (38) as .
By the Cauchy-Schwartz inequality, we have:
(39) |
By the definition of , we have:
and notice that and are just rearranged versions of and , respectively. Therefore, by the joint stationarity of and , we have the time average of and converging to the rearranged version of some constant auto-covariance matrices and therefore we have the term on the right-hand side of (39) being .
Given this argument, proving (37) is now equivalent to proving:
(40) |
Define as the unconstrained minimizer of , then explicitly, we have:
where the final argument on the convergence in probability to is based on the fact that by the CLT of MDS. By the second-order Taylor expansion of at , we have:
(41) |
where , for some . For any constant such that , let , where is also a constant that relates to only. Consequently, we have:
and thus . Conditioning on the event that , we first have to hold with probability approaching one, due to the consistency of . Furthermore, we also have:
where the last inequality holds with probability approaching one since . Utilizing these facts together with (41), we end up having:
(42) |
for any such that . Since is an arbitrary positive constant and , we establish (40) and thereby completes the proof.
C.2 Proof of Theorem 9
To prove Theorem 9, we first establish the consistency and the convergence rate of the estimators in Lemma 14 below.
Lemma 14
Under the same assumption as Theorem 9, all model estimators for MARAC are -consistent, namely:
for . As a direct result, we also have:
We delay the proof of Lemma 14 to Section E.2. With this lemma, we are now ready to present the proof of Theorem 9.
Proof For the simplicity of notation and presentation, we fix as but the proving technique can be generalized to arbitrary . To start with, we revisit the updating rule for in (13). By plugging in the data-generating model for according to MARAC model, we can transform (13) into:
where for any arbitrary matrix/tensor , we define as . One can simplify the estimating equation above by left multiplying and then vectorize both sides to obtain:
On the left-hand side of the equation above, we replace with their true values , since the discrepancies are of order and can thus be incorporated into the term given the -consistency of . On the right-hand side, we have:
where the process is a martingale difference sequence and the martingale central limit theorem (Hall and Heyde, 2014) implies that , and thus by the consistency of and , we can replace and with their true values and incorporate the remainders into .
Similar transformations can be applied to (14) and (15), where the penalty term is incorporated into due to the assumption that . With the notation that , , and , these transformed estimating equations can be converted altogether into:
(43) |
where , and are defined as , and are the corresponding true coefficients.
In (43), we first establish that:
(44) |
To prove 44, by the assumption that and are zero-meaned and jointly stationary, we have by Lemma 20 and Corollary 21, where . See details of Lemma 20 and Corollary 21 in Section E.1. Then since each element of is a linear combination of terms in (thus a continuous mapping), it is straightforward that (44) holds elementwise.
For the term on the right-hand side of (45), first notice that the sequence , where , is a zero-meaned, stationary vector martingale difference sequence (MDS), thanks to the independence of from the jointly stationary and . By the martingale central limit theorem (Hall and Heyde, 2014), we have:
(46) |
Combining (45) and (46), we end up having:
(47) |
The asymptotic distribution of can thus be derived by multiplying both sides of (47) by the inverse of . However, the matrix is not a full-rank matrix, because , where . As a remedy, let , then given the identifiability constraint that and the fact that is -consistent, we have . Therefore, we have:
(48) |
Combining (47) and (48) and using the Slutsky’s theorem, we have , where and thus:
(49) |
The final asymptotic distribution of and can be derived easily from (49) with multivariate delta method and we omit the details here.
Appendix D Theory under High Spatial Dimension
D.1 Proof of Theorem 12
Proof In this proof, we will fix as again for the ease of presentation but the technical details can be generalized to arbitrary . Since we fix the lags to be , we drop the subscript of the coefficients for convenience.
Under the specification of the MARAC model, we restate the model as:
where and we introduce the following additional notations:
We will drop the subscript for convenience. Let , and be the true autoregressive and functional parameters. Correspondingly, let be the coefficients for the representers when evaluating on a matrix grid, i.e. is a discrete evaluation of on the matrix grid. Let . Using these new notations, the MARAC estimator is obtained by solving the following penalized least square problem:
(50) |
By fixing , the estimator for is given by , and can be explicitly written as:
(51) |
Plugging (51) into (50) yields the profile likelihood for :
(52) |
where is defined as:
(53) |
and the second equality in (53) is by the Woodbury matrix identity. It can be seen that is positive semi-definite and has all of its eigenvalues within . To improve the clarity and organization of the proof, we break down the proof into several major steps. In the first step, we establish the following result on :
Proposition 15
In order to derive the convergence rate of , we still require one additional result:
Lemma 16
Under the assumptions of Theorem 12 and the requirement that , it holds that:
(55) |
with probability approaching as , where is the minimum eigenvalue of a matrix and .
The proof of Proposition 15 and Lemma 16 are relegated to Section D.2 and E.3, respectively. Combining Proposition 15 and Lemma 16, we can derive the error bound of as:
(56) |
Now with this error bound of the autoregressive parameter , we further derive the prediction error bound for the functional parameters. To start with, we have:
and we will bound the terms separately.
To bound , we first establish two lemmas.
Lemma 17
Lemma 18
We leave the proof of Lemma 17 and Lemma 18 to Section E.4 and E.5. By Lemma 18, we have:
And by Lemma 17, we have .
For , we have the following bound:
(57) | ||||
(58) |
To bound , we can take advantage of the simpler form of using the Woodbury matrix identity in (53) and obtain:
In Lemma 20, which we state later in Section E.1, we have shown that for -dimensional stationary vector autoregressive process, the covariance estimator is consistent in the spectral norm as long as . Therefore, since follows a stationary VAR process and its dimensionality is fixed, we have and thus with probability approaching , we have . Therefore, we have , where is a constant related to and . Combining this with the result in Proposition 15, we can bound via its upper bound (58) as:
(59) |
Finally, for , we first notice that:
The upper bound of above can be further bounded by:
(60) |
where is the norm of all the underlying functional parameters. The last inequality of (60) follows from the fact that the quadratic form led by is non-negative. To see why, first note that:
Then, we have the following lemma:
Lemma 19
If are symmetric, positive definite real matrices and is positive semi-definite, then is also positive semi-definite.
We leave the proof to Section E.6. Let and , then both and are positive definite and is positive semi-definite. By Lemma 19, we have being positive semi-definite and thus (60) holds.
Using the result in (60), we eventually have . Combining all the bounds for , we end up with:
where we drop the term as it is a higher order term of under the condition that .
D.2 Proof of Proposition 15
Proof The MARAC estimator is the minimizer of , defined in (52), for all and thus . Equivalently, this means that:
Let and , then the inequality can be simply written as , and we can upper bound our quantity of interest, namely , as:
Therefore, the bound of can be obtained via the bound of . We have the following upper bound for :
(61) |
where the last inequality follows from the fact that is positive semi-definite.
Appendix E Technical Lemmas & Proofs
In this section, we first introduce Lemma 20 on the consistency of the covariance matrix estimator for any stationary vector autoregressive process and then Corollary 21 on the consistency of the covariance estimator of our MARAC model, given the joint stationarity condition. Then we provide proof for Lemma 14 used in Section C.2 when proving Theorem 9 on the asymptotic normality under fixed spatial dimension. Then we provide proofs for Lemma 16, 17, 18 and 19 used in Section D when proving the error bounds with high spatial dimensionality.
E.1 Statement of Lemma 20
In Lemma 20, we restate the result of Proposition 6 and 7 of Li and Xiao (2021), which covers the general result of the consistency of the estimator for the lag- auto-covariance matrix of a stationary VAR process.
Lemma 20
Let be a zero-meaned stationary VAR process: , where have independent sub-Gaussian entries. Let and , then we have:
(62) |
where is an absolute constant.
We refer our readers to Appendix C.3 of Li and Xiao (2021) for the proof. As a corollary of Lemma 20, we have the following results:
Corollary 21
Assume that is generated by a stationary VAR process: , with having independent sub-Gaussian entries, then with and , we have:
(63) |
with being an absolute constant and being a fixed positive real number, and thus .
Let be a zero-meaned matrix time series generated by the MARAC model with lag and satisfies the assumption above and are jointly stationary in the sense of Theorem 6. Assume further that has i.i.d. Gaussian entries with constant variance , then for , and , we have:
(64) |
where is an absolute constant.
Proof
The proof of (63) is straightforward from Lemma 20 together with Markov inequality. The proof of (64) also follows from Lemma 20 since follows a stationary VAR process with i.i.d. sub-Gaussian noise (see (29)) and due to stationarity.
Note that the convergence of the variance estimator in spectral norm also indicates that each element of the variance estimator converges in probability. Also, the assumption that has i.i.d. Gaussian entries can be relaxed to having independent sub-Gaussian entries.
E.2 Proof of Lemma 14
Proof Without loss of generality, we fix as 1 and use the same notation as (31) in Section C.1, so the MARAC model can be written as . Correspondingly, the penalized log-likelihood is specified by (32) and given any , we have as specified by (34). Given the decomposition of in (35), we have:
where since and the norm of the second term is . To show that the norm of the second term is , we first observe that:
For the sequence of random matrices , we have:
and we define the limiting matrix as . To show this, first note that the covariance estimator converges in probability to the true covariance , which we prove separately in Corollary 21. Secondly, notice that , thus we have and thus we have the convergence in probability of to holds.
Notice that the limiting matrix is invertible because the matrix , defined as:
is invertible. To see why, firstly note that is invertible because we can express as , where is a sequence of matrices whose elements are absolutely summable and , therefore, we have . Secondly, by Assumption 7, we have and we also have by definition, therefore we have to be positive definite. The invertibility of and the fact that indicates that , since matrix inversion is a continuous function of the input matrix and the convergence in probability carries over under continuous transformations. Eventually, this leads to the conclusion that .
For the sequence of random matrices , we note that the sequence is a martingale difference sequence (MDS) such that (see proposition 7.9 of Hamilton (2020) for the central limit theorem of martingale difference sequence). Combining the result of and , we conclude that .
Fix , we can decompose via the second-order Taylor expansion as follows:
(65) |
and recall that . In the previous proof, we’ve shown that , with being a positive definite matrix. Therefore, with probability approaching , we have .
With the lower bound on , we can claim that for some constant :
(66) |
with probability approaching . Now consider belongs to the set , where is an arbitrary sequence that diverges to infinity. Within this set, we have:
(67) |
thus for some sequence since . By the Taylor expansion in (65), we can conclude that , also using that . Combining this result together with the order of , we have the following hold according to (66):
(68) |
The result in (68) indicates that for any that lies outside of the set , the penalized log-likelihood is no smaller than a sub-optimal solution with probability approaching . Therefore, with probability approaching , one must have . And since the choice of is arbitrary, we can conclude that and thus each block of , namely converges to their ground truth value at the rate of .
The convergence rate of can be derived from the following inequality:
as well as the convergence rate of and .
E.3 Proof of Lemma 16
Proof Based on the definition of in equation (53), we have
(69) |
where the second term in (69) is positive semi-definite since both and are non-negative and the whole term is symmetric. Therefore, by Weyl’s inequality, one can lower bound by . For simplicity, we will use to denote , and to denote , respectively. We will use and to denote the estimated and true covariance matrix of . It is evident that and , since both and are blocks of and can thus be represented as with being two block matrices with unity spectral norm.
The rest of the proof focuses on showing that with , . For brevity, we omit the subscript for the spectral norm notation and simply use in this proof.
To start with, we have:
(70) |
Based on Corollary 21, under the condition that and the conditions that follows a stationary VAR process and is jointly stationary with , we have and . Therefore, with probability approaching , we have , and .
Combining these results and the upper bound in (70), with probability approaching , we have:
(71) |
The upper bound in (71) can be arbitrarily small as since and .
Eventually, with probability approaching , we have:
(72) |
This completes the proof.
E.4 Proof of Lemma 17
Proof By the definition of in (53), we have:
(73) |
Using Lemma 20, we can bound by with probability approaching as . Conditioning on this high probability event and using the Assumption 11 that the kernel function is separable, the kernel Gram matrix can be written as and thus (73) can be bounded as:
(74) |
where . As , based on Assumption 10, we have and . Consequently, we can find two constants , with being sufficiently small and being sufficiently large, such that:
(75) |
where we, with a little abuse of notations, incorporate into . To estimate the order of the lower and upper bound in (75), we first notice that for any constant , one has:
(76) |
To approximate the sum in (76), notice that:
and furthermore, we have:
where . In the assumptions of Theorem 12, we assume that and where . As a result, we have being either a finite value or infinity, thus we have:
(77) |
Combining (73), (74), (75) and (77), we have . To obtain the lower bound of , we have:
which holds with probability approaching and and the second inequality follows from (75). To further lower bound the double summation, we have:
This new lower bound can be approximated with the same method as (77) under the assumption that . We can obtain the lower bound of as , which establishes the final result.
The upper bound of is trivial since:
E.5 Proof of Lemma 18
Proof Given any fixed and , let and be some constant, then we have:
(78) |
which holds by the Hanson-Wright inequality (Rudelson and Vershynin, 2013). Letting , , and using the fact that , we have:
(79) |
We can lower bound the trace of as follows. First, note that:
By the assumption that is bounded and that the fact that with probability approaching as , we have:
(80) |
where . Since and as , with being either a positive constant or infinity, we have . Therefore, we have with probability approaching , as .
With these results, we can now upper bound the unconditional probability of the event as follows:
(81) |
This indicates that .
The proof of is similar to the proof above. In the first step, similar to (78) and (79), we have the following tail probability bound:
(82) |
We can actually establish the unboundedness of by following the same idea as the proof for Lemma 17, where we have:
with probability approaching and is some constant. Therefore, we have . The rest of the proof follows the idea of (81) and we omit the details here.
E.6 Proof of Lemma 19
Proof For any two arbitrary symmetric matrices with identical sizes, we use to indicate that is positive semi-definite and we use to denote the symmetric, positive semi-definite square root matrix of .
Since is positive semi-definite, multiplying it by on both left and right sides of , we have . Therefore, we have . Notice that the matrix is invertible and thus has no zero eigenvalues. As a result, all eigenvalues of are the same as the eigenvalues of and thus . Multiplying both sides by on both the left and right sides yields , which completes the proof.
Appendix F Additional Details on Simulation Experiments
We generate the simulated dataset according to the MARAC model specified by (1) and (3). We simulate the autoregressive coefficients such that they satisfy the stationarity condition specified in Theorem 6 and have a banded structure. We use a similar setup for generating with their diagonals fixed at unity. In Figure 5, we plot the simulated when .

To generate and mimic the spatial grid in our real data application in Section 6, we specify the 2-D spatial grid with the two dimensions being latitude and longitude of points on a unit sphere . Each of the evenly spaced grid points has its polar-azimuthal coordinate pair as , and one projects the sampled grid points on the sphere onto a plane to form an matrix. The polar (co-latitude) and azimuthal (longitude) angles are very commonly used in the spherical coordinate system, with the corresponding Euclidean coordinates being .
As for the spatial kernel, we choose the Lebedev kernel:
(83) |
where denotes the angle between two points on the sphere and is a hyperparameter of the kernel. In the simulation experiment as well as the real data application, we fix . The Lebedev kernel has the spherical harmonics functions as its eigenfunction:
where is a series of orthonormal real spherical harmonics bases defined on sphere :
with , and being the associated Legendre polynomials of order . We refer our readers to Kennedy et al. (2013) for detailed information about the spherical harmonics functions and the associated isotropic kernels. Under our 2-D grid setup and the choice of kernel, we have found that empirically, the kernel Gram matrix has its eigen spectrum decaying at a rate at with .
We randomly sample from Gaussian processes with covariance kernel being the Lebedev kernel in (83). Finally, we simulate the vector time series using a VAR process. In Figure 6, we visualize the simulated functional parameters as well as the vector time series from one random draw.

In Figure 7, we visualize the ground truth of and both its penalized MLE and truncated penalized MLE estimators. It is evident that the truncated penalized MLE estimators give a smooth approximation to and the approximation gets better when gets larger. The choice of should be as large as possible for accuracy, so one can determine based on the computational resources available.
