This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalized Matrix Factor Model

Xinbing Kong   and Tong Zhang School of Statistics and Data Science, Nanjing Audit University, Nanjing 211815, China. The author acknowledges the support from the NSF China (72342019, 12431009). Email: xinbingkong@126.comSchool of Statistics and Data Science, Nanjing Audit University, Nanjing 211815, China. Co-first author. Email: zhangtongsta@163.com
Abstract

This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM’s superiority in handling discrete and mixed-type variables. An empirical data analysis of the company’s operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix.


Keywords: Matrix factor model; Mixed-type measurement; Penalized likelihood function.

1 Introduction

Matrix sequences appear in diverse areas, such as computer vision, recommending systems, social networks, economics, and management. It has the form of a three-way array consisting of a row dimension, a column dimension, and a third sequence limit dimension that might be a time domain. A natural question is “how are the entries generated?” In the high-dimensional scenario where both the numbers of rows and columns are large, an intuitive generative mechanism is that all rows and columns are formed by a latent low-rank structure. There are at least two parallel but seemingly equivalent streams of works that are learning the low-rank formation of the matrix-variate data. The first is via matrix (or tensor) regularization with low-rank penalization, e.g., rank constraint or nuclear norm penalty, with or without noise. These results in great success in wide applications like matrix completion, compressed sensing and image localization (Mao et al. (2019) and Mao et al. (2021)). The second line of work is the factor modeling which thinks all entries of a matrix observation are driven by a low-dimensional factor matrix, and the magnitude of each entry relies on the additive and/or interactive effect of the low-dimensional row features and column characteristics. In the present paper, we follow the factor modeling manner but a regularized optimization technique is used to adapt to the nonlinearity of our proposed model.

For ease of presentation, let {Xt=(Xijt)p1×p2;1tT}\{X_{t}=(X_{ijt})_{p_{1}\times p_{2}};1\leq t\leq T\} be a matrix sequence (e.g., a matrix time series). There are also two ways to do matrix (or even tensor) factor analysis of XtX_{t}. The first manner is to “flatten” each XtX_{t} into a lengthy vector by stacking its columns or rows: the first step is learning the vector factor space with existing vector factor analysis methods (Bai and Ng (2002), Bai and Li (2012), Fan et al. (2013), Kong (2017), Pelger (2019), Trapani (2018), Barigozzi and Trapani (2022), Barigozzi et al. (2022)), and the second step is recovering the row and column factor spaces from the vectorized factor space with certain restriction, e.g. the Kronecker structure, c.f, Chen et al. (2024) and Chen et al. (2021b). The second collection of papers (Wang et al. (2019), Wang (2022), Chang et al. (2023), Yu et al. (2022), Yuan et al. (2023), He et al. (2022)) directly model each XtX_{t} with carefully designed additive and/or interactive row and column effects. It is worth noticing that Wang et al. (2019) introduced the first matrix factor in the literature; Yu et al. (2022) provided a projection approach to estimate the twisted row and column factor spaces which has, so far to the best of our knowledge, the fastest convergence rates in the class of PCA procedures; Yuan et al. (2023) presented the first two-way additive matrix factor model and Zhang et al. (2024) combined the two-way additive and interactive components in generating the matrix entries recently; the seminal work of Chen et al. (2022) extends to the general tensor data.

However, the above works are limited to single-type continuous variables. In practical applications, such as social science and biology, high-dimensional data sets often involve mixed-type variables. For instance, in the context of corporate operational performance analysis, there are not only continuous variables, such as return on equity and fixed asset turnover, but also categorical variables like industry type, enterprise nature (e.g., state-owned or private enterprise), and regional classification. These categorical indicators are often used to depict the fundamental characteristics and background information of a company. There are also some binary indicators to measure each corporate’s social responsibility, such as whether the protection of employees’ rights and interests is disclosed, and whether environmental sustainable development is disclosed. Collectively, the discrete variables encompassed in a company’s economic indicators span multiple facets and play a pivotal role in portraying the company’s economic standing, analyzing market trends, and devising business strategies. In networks analysis (Jing et al. (2021), Jing et al. (2022), Chang et al. (2023) and Zhou et al. (2024)) like international trading, the weight on each edge could be the trading direction (binary variables indicating import or export), the number of traded products (e.g. shoes, clothes, food) that are measured either in counts (count variables) or kilograms (continuous variables). For non-continuous variables, nonlinear models are usually employed for modeling, where closed-form expressions for estimating factors and their loadings do not exist, in contrast to the explicit principal component solution with or without iteration to the linear two-way factor analysis, c.f., Chen and Fan (2023), Yu et al. (2022) and He et al. (2022). Moreover, with mixed-type variables, the nonlinear structure varies depending on the type of variables, posing challenges for statistical inference and computation. To the best of our knowledge, there is still a lack of study on a general nonlinear model with factor structure for matrix sequences with mixed-type variables. This is a motivation and will be the first contribution of the present paper.

The literature recently starts to pay attention to the latent factor structure with single or mixed type variables in vector factor modeling. Chen et al. (2021a) derived the asymptotic theory of the regression coefficients and the average partial effect for a class of generalized linear vector factor models that include logit, probit, ordered probit and Poisson models as special cases. Permitting general nonlinearity, Wang (2022) investigated the maximum likelihood estimation for the factors and loadings. Liu et al. (2023) proposed a canonical exponential family model with factor structure for mixed-type variables. However, the likelihood approach for these nonlinear vector factor models can not be trivially extended to the matrix sequences since the Hessian matrix jointly in the row and column loadings and the factors are more complex. Extra manipulation is needed to feasibly derive the convergence property of the parameters related to the factor structure, especially for the asymptotic normality of the estimates, see Section 2 for more details. This is a motivation and will be a second contribution to the present paper.

In this paper, we assume that each entry is generated by the following nonlinear generalized matrix factor model (GMFM):

xijtgijt(|πijt),i=1,,p1,j=1,,p2,t=1,,T.x_{ijt}\sim g_{ijt}\left(\cdot|\pi_{ijt}\right),i=1,\ldots,p_{1},\,j=1,\ldots,p_{2},\,t=1,\ldots,T. (1)

where xijtx_{ijt} is an observed entry lying in the iith row and jjth column in sample tt; gijt(|)g_{ijt}(\cdot|\cdot) is some known probability (density or mass) function of xijtx_{ijt} allowed to vary across ii, jj and tt, permitting a mixture of distributions for each matrix; πijt=riFtcj\pi_{ijt}=r_{i}^{\prime}F_{t}c_{j} with rir_{i} being a k1k_{1} dimensional vector of row loadings, cjc_{j} being a k2k_{2} dimensional vector of column loadings, and FtF_{t} being a k1×k2k_{1}\times k_{2} dimensional matrix of factors. Both factors and loadings are unobservable and k1k_{1} and k2k_{2} are the numbers of row and column factors, respectively. In the currently studied linear matrix factor model, πijt=riFtcj\pi_{ijt}=r_{i}^{\prime}F_{t}c_{j} represents the conditional mean of xijtx_{ijt}, while for non-continuous variables in the present paper, it might stand for the conditional log odds or log probabilities. Even for continuous variables, πijt\pi_{ijt} in model (1) can be a function of the conditional smoothly-winsorized-mean in the robust matrix factor model, c.f., He et al. (2024), which maps gijt(|)g_{ijt}(\cdot|\cdot)’s to the Huber loss function. Thus model (1) is a quite flexible nonlinear matrix factor model. Certainly, to identify {ri,Ft,cj}\left\{r_{i},F_{t},c_{j}\right\}, one needs the identification constraints to form an identifiably feasible solution set.

In this paper, we consider the maximum (quasi-)likelihood estimation with rotational constraints for the factors and loadings of GMFM. We introduce a novel penalized likelihood function that not only covers the log-likelihood function and the identifiability restriction, but also includes an additional augmented Lagrangian term, which is carefully tuned to guarantee the local convexity of the negative Hessian matrix around the true parameters and hence a concave quasi-likelihood function with penalty in a neighborhood of the true parameters. This leads to a convenient derivation of the central limit theorems, and it is where the second contribution of the present paper comes from. This paper also establishes the convergence rates of the estimated factors and loadings. Our theory demonstrates that our estimators converge at rate Op(1/min{p1p2,p1T,p2T})O_{p}\left(1/\min\left\{\sqrt{p_{1}p_{2}},\sqrt{p_{1}T},\sqrt{p_{2}T}\right\}\right) in terms of the averaged Frobenius norm. This rate surpasses Op(1/min{p1p2,T})O_{p}(1/\min\{\sqrt{p_{1}p_{2}},\sqrt{T}\}), the rate for generalized vector factor analysis achieved through vectorizing Xt{X}_{t} by Liu et al. (2023) and Wang (2022), particularly when the sequence length TT is relatively short. It is also no slower than the rate, Op({1/p1+1/Tp2}{1/p2+1/Tp1})O_{p}(\{1/\sqrt{p_{1}}+1/\sqrt{Tp_{2}}\}\vee\{1/\sqrt{p_{2}}+1/\sqrt{Tp_{1}}\}), of the non-likelihood method α\alpha-PCA in the seminal paper by Chen and Fan (2023), especially when the data matrix is not balanced. Furthermore, to consistently estimate the pair of numbers of the row and column factors, we present an information-type criterion under GMFM. This set of new results forms the third contribution of the present paper. Motivated by the challenges posed by nonlinear structures and mixed-type variables, we also present extended versions of the two-stage alternating maximization algorithm inspired by Liu et al. (2023) and the minorization maximization algorithm inspired by Chen et al. (2021a).

The rest of the paper is organized as follows. Section 2 introduces the set up and the estimation methodology. Section 3 presents the asymptotic theorems, and subsection 3.4 provides two computational algorithms. Section 4 is devoted to extensive simulation studies. Section 5 analyzes high-tech company’s operating performance. Section 6 concludes. All proofs are relegated to the supplementary materials.

2 Methodology and Set Up

2.1 Methodology

With model (1), the (quasi) log-likelihood function is

L(X|r,f,c)=i=1p1j=1p2t=1Tlijt(riFtcj),\displaystyle L(X|r,f,c)=\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}l_{ijt}(r_{i}^{\prime}F_{t}c_{j}), (2)

where lijt(πijt)=loggijt(xijt|πijt)l_{ijt}(\pi_{ijt})=\log g_{ijt}\left(x_{ijt}|\pi_{ijt}\right) with πijt=riFtcj\pi_{ijt}=r_{i}^{\prime}F_{t}c_{j}; r=(r1,,rp1)r=\left(r_{1}^{\prime},\ldots,r_{p_{1}}^{\prime}\right)^{\prime} is a p1k1p_{1}k_{1} vector, c=(c1,,cp2)c=\left(c_{1}^{\prime},\ldots,c_{p_{2}}^{\prime}\right)^{\prime} is a p2k2p_{2}k_{2} vector, and f=(f1,,fT)f=\left(f_{1}^{\prime},\ldots,f_{T}^{\prime}\right)^{\prime} with ft=vec{Ft}f_{t}=vec\{F_{t}\} is a Tk1k2Tk_{1}k_{2} vector. The entries of XX can be continuous variables, count variables, binary variables, and so on. Representative examples are as follows.

Example 1 (Linear) lijt(πijt)=12(xijtriFtcj)2l_{ijt}(\pi_{ijt})=-\frac{1}{2}\left(x_{ijt}-r_{i}^{\prime}F_{t}c_{j}\right)^{2} is a likelihood function of Gaussian random variables and a quasi-likelihood function for other continuous random variables with homoscedasticity.

Example 2 (Poisson) lijt(πijt)=eriFtcj+kriFtcjlogk!l_{ijt}(\pi_{ijt})=-e^{r_{i}^{\prime}F_{t}c_{j}}+kr_{i}^{\prime}F_{t}c_{j}-\log k! as P(Xijt=k)=eλλk/k!P\left(X_{ijt}=k\right)=e^{-\lambda}\lambda^{k}/k! and λ=eriFtcj\lambda=e^{r_{i}^{\prime}F_{t}c_{j}}.

Example 3 (Probit) lijt(πijt)=xijtlogΦ(riFtcj)+(1xijt)log(1Φ(riFtcj))l_{ijt}(\pi_{ijt})=x_{ijt}\log\Phi(r^{\prime}_{i}F_{t}c_{j})+(1-x_{ijt})\log\left(1-\Phi(r^{\prime}_{i}F_{t}c_{j})\right), where Φ()\Phi(\cdot) is the cumulative distribution function of a standard normal random variable.

Example 4 (Logit) lijt(πijt)=xijtlogΨ(riFtcj)+(1xijt)log(1Ψ(riFtcj))l_{ijt}(\pi_{ijt})=x_{ijt}\log\Psi(r^{\prime}_{i}F_{t}c_{j})+(1-x_{ijt})\log\left(1-\Psi(r^{\prime}_{i}F_{t}c_{j})\right), where Ψ()\Psi(\cdot) is the cumulative distribution function of the logistic distribution.

Example 5 (Tobit) Let xijt=xijtI(xijt>0)x_{ijt}=x^{*}_{ijt}I(x^{*}_{ijt}>0) where xijt=riFtcj+eijtx_{ijt}^{*}=r_{i}^{\prime}F_{t}c_{j}+e_{ijt} and eijte_{ijt} is N(0,1)N(0,1). lijt(πijt)=12(xijtriFtcj)2I(xijt>0)+log(1Φ(riFtcj))I(xijt=0)l_{ijt}(\pi_{ijt})=-\frac{1}{2}(x_{ijt}-r^{\prime}_{i}F_{t}c_{j})^{2}I(x_{ijt}>0)+\log\left(1-\Phi(r^{\prime}_{i}F_{t}c_{j})\right)I(x_{ijt}=0), where I()I(\cdot) is the indicator function.

Let R=(r1,,rp1){R}=\left(r_{1},\ldots,r_{p_{1}}\right)^{\prime} be the p1×k1p_{1}\times k_{1} row factor loading matrix, C=(c1,,cp2){C}=\left(c_{1},\ldots,c_{p_{2}}\right)^{\prime} be the p2×k2p_{2}\times k_{2} column factor loading matrix, Ft{F}_{t} be a k1×k2k_{1}\times k_{2} common factor matrix. Denote the parameters of interest by the vector θ=(r,f,c)\theta=\left(r^{\prime},f^{\prime},c^{\prime}\right)^{\prime} and their true values by ri0{r}_{i}^{0}, Ft0{F}_{t}^{0} and cj0{c}_{j}^{0}, respectively. To well define a high-dimensional parameter space Θ\Theta of θ\theta, except the boundedness condition for each row of RR and CC and the factor matrix FtF_{t} in Assumption 1, one needs constraints to identify the parameters. Note that for any k1×k1k_{1}\times k_{1}, k2×k2k_{2}\times k_{2} invertible matrix G1G_{1} and G2G_{2}, RG1{R}G_{1}, G11FtG21G_{1}^{-1}{F}_{t}G_{2}^{-1} and CG2{C}G^{\prime}_{2} give the same likelihood as R{R}, Ft{F}_{t} and C{C} do. To uniquely determine R{R}, Ft{F}_{t}, and C{C}, without loss of generality, we impose the identifiability restriction as follows.

{(3.1)RRp1=𝕀k1,CCp2=𝕀k2,(3.2)t=1TFtFtTandt=1TFtFtTdiagonal matrices,(3.3)the first nonzero element in each column ofRandCis positive.\displaystyle\left\{\begin{array}[]{lll}(3.1)\ \ \frac{{R}^{\prime}{R}}{p_{1}}=\mathbb{I}_{k_{1}},\frac{{C}^{\prime}{C}}{p_{2}}=\mathbb{I}_{k_{2}},\\ (3.2)\ \ \sum_{t=1}^{T}\frac{{F}_{t}{F}^{\prime}_{t}}{T}\ \mbox{and}\ \sum_{t=1}^{T}\frac{{F}^{\prime}_{t}{F}_{t}}{T}\ \mbox{diagonal matrices},\\ (3.3)\ \ \mbox{the first nonzero element in each column of}\ R\ \mbox{and}\ C\ \mbox{is positive}.\end{array}\right. (6)

The conditions (3.1) and (3.2) are frequently used in the literature of matrix factor models, c.f., He et al. (2022). Without condition (3.3), results in factors and loadings can only be unambiguously determined up to a sign matrix (diagonal matrix with 11 and 1-1 on the diagonals) transformation. Then a naturally identified estimate of the vector of factor and loading parameters is the maximizer of (2) subject to the constraints in (6). It can be easily shown this constrained maximizer is equivalent to the solution to the following augmented Lagrange function up to the sign undeterminance:

maxr,f,cΘQ(r,f,c)\displaystyle\max_{r,f,c\in\Theta}Q(r,f,c) =\displaystyle= L(X|r,f,c)+P(r,f,c),\displaystyle L(X|r,f,c)+P(r,f,c), (7)
P(r,f,c)\displaystyle P(r,f,c) =\displaystyle= P1(r,f,c)+P2(r,f,c)+P3(r,f,c),\displaystyle P_{1}(r,f,c)+P_{2}(r,f,c)+P_{3}(r,f,c), (8)

where

P1(θ)\displaystyle P_{1}({\theta}) =\displaystyle= b1p1p2T[12p12p=1k1q>pk1(i=1p1ripriq)2+18p12k=1k1(i=1p1rik2p1)2\displaystyle-b_{1}p_{1}p_{2}T\Bigg{[}\frac{1}{2p_{1}^{2}}\sum_{p=1}^{k_{1}}\sum_{q>p}^{k_{1}}\left(\sum_{i=1}^{p_{1}}r_{ip}r_{iq}\right)^{2}+\frac{1}{8p_{1}^{2}}\sum_{k=1}^{k_{1}}\left(\sum_{i=1}^{p_{1}}r_{ik}^{2}-p_{1}\right)^{2} (9)
+12T2p=1k1q>pk1(t=1TFtpFtq)2],\displaystyle+\frac{1}{2T^{2}}\sum_{p=1}^{k_{1}}\sum_{q>p}^{k_{1}}\left(\sum_{t=1}^{T}{F}_{tp\cdot}{F}_{tq\cdot}^{\prime}\right)^{2}\Bigg{]},
P2(θ)\displaystyle P_{2}({\theta}) =\displaystyle= b2p1p2T[12p22p=1k2q>pk2(j=1p1cjpcjq)2+18p22k=1k2(j=1p2cjk2p2)2\displaystyle-b_{2}p_{1}p_{2}T\Bigg{[}\frac{1}{2p_{2}^{2}}\sum_{p=1}^{k_{2}}\sum_{q>p}^{k_{2}}\left(\sum_{j=1}^{p_{1}}c_{jp}c_{jq}\right)^{2}+\frac{1}{8p_{2}^{2}}\sum_{k=1}^{k_{2}}\left(\sum_{j=1}^{p_{2}}c_{jk}^{2}-p_{2}\right)^{2} (10)
+12T2p=1k2q>pk2(t=1TFtpFtq)2],\displaystyle+\frac{1}{2T^{2}}\sum_{p=1}^{k_{2}}\sum_{q>p}^{k_{2}}\left(\sum_{t=1}^{T}{F}_{t\cdot p}^{\prime}{F}_{t\cdot q}\right)^{2}\Bigg{]},
P3(θ)\displaystyle P_{3}({\theta}) =\displaystyle= b3p2t=1T[12p1p=1k1q=1k2(i=1p1(rip212ft,pq+kpk1riprikft,kq))2].\displaystyle-b_{3}p_{2}\sum_{t=1}^{T}\Bigg{[}\frac{1}{2p_{1}}\sum_{p=1}^{k_{1}}\sum_{q=1}^{k_{2}}\left(\sum_{i=1}^{p_{1}}\left(\frac{r_{ip}^{2}-1}{2}f_{t,pq}+\sum_{k\neq p}^{k_{1}}r_{ip}r_{ik}f_{t,kq}\right)\right)^{2}\Bigg{]}. (11)

where b1b_{1}, b2b_{2} and b3b_{3} are positive Lagrange multipliers. The penalty terms P1(θ)P_{1}(\theta) and P2(θ)P_{2}(\theta) are associated with the conditions (3.1) and (3.2). The additional augmented term P3(θ)P_{3}(\theta) is constructed to ensure a locally positive definite property of the negative Hessian matrix of the penalized likelihood function Q(θ)Q(\theta) around the true parameter vector θ0\theta^{0}, by cleverly trading off the non-diagonal blocks of the scaled expectation of the Hessian matrix of L(θ)L(\theta). This property shows a desirable geometric concavity of the augmented Lagrange function Q()Q(\cdot) and leads to an easy way to derive the second-order asymptotics, in particular the central limit theorems of the estimated parameters. Without this newly added term, simply introducing P1(θ)+P2(θ)P_{1}(\theta)+P_{2}(\theta), might not result in a desirable landscape around the local minima of Q(θ)Q(\theta). In the sequel, we shall consider the estimated factors and loadings as the solution to maximizing Q(r,f,c)Q(r,f,c) and obtain the asymptotic results for matrix factor analysis. The equivalence between (2)-(6) and (7) is due to the non-positiveness of P(θ)P(\theta) and the fact that Q(θ)Q(\theta) is maximized if and only if P(θ)=0P(\theta)=0.

Throughout the paper, (p1,p2,T)(p_{1},p_{2},T)\to\infty means p1,p2p_{1},p_{2} and TT going to infinity jointly, “w.p.a.1” is a short abbreviation of “with probability approaching 11”. For vectors, \|\cdot\| denotes the Euclidean norm. For matrix AA, ρmin(A)\rho_{\min}(A) is its smallest eigenvalue, and A\|A\|, AF\|A\|_{F}, A1\|A\|_{1}, A\|A\|_{\infty} and Amax\|A\|_{\max} denotes the spectral norm, Frobenius norm, 11-norm, infinity norm and max norm, respectively. When AA has TkTk rows, divide AA into TT blocks with each block containing kk rows, and let [A]tq[A]_{tq} denote the qqth row in the ttth block and [A]t=([A]t1,,[A]tk)[A]_{t}=\left([A]^{\prime}_{t1},\ldots,[A]^{\prime}_{tk}\right) is the ttth block. Define Lp1p2T=min{p1p2,p1T,p2T}L_{p_{1}p_{2}T}=\min\{\sqrt{p_{1}p_{2}},\sqrt{p_{1}T},\sqrt{p_{2}T}\}.

2.2 Set Up Assumptions

In this section, we give some technical assumptions that are used to establish the asymptotic theories. In the sequel, MM will be a generic constant that might vary from line to line.

Assumption 1.
  1. 1.

    ri0M\left\|r_{i}^{0}\right\|\leq M, cj0M\left\|c_{j}^{0}\right\|\leq M and Ft0M\left\|{F}_{t}^{0}\right\|\leq M for all i,ji,j and tt.

  2. 2.

    1Tt=1TFt0Ft0=diag(σT1,,σTk1)\frac{1}{T}\sum_{t=1}^{T}{F}^{0}_{t}{F}_{t}^{0\prime}=\operatorname*{diag}\left(\sigma_{T1},\ldots,\sigma_{Tk_{1}}\right) with σT1σTk1\sigma_{T1}\geq\cdots\geq\sigma_{Tk_{1}}, and σTkσk\sigma_{Tk}\to\sigma_{k} as TT\to\infty for k=1,,k1k=1,\ldots,k_{1} with σ1>>σk1>0\sigma_{1}>\cdots>\sigma_{k_{1}}>0. 1Tt=1TFt0Ft0=diag(σT1,,σTk2)\frac{1}{T}\sum_{t=1}^{T}{F}^{0\prime}_{t}{F}_{t}^{0}=\operatorname*{diag}\left(\sigma^{\prime}_{T1},\ldots,\sigma^{\prime}_{Tk_{2}}\right) with σT1σTk2\sigma^{\prime}_{T1}\geq\cdots\geq\sigma^{\prime}_{Tk_{2}}, and σTlσl\sigma^{\prime}_{Tl}\to\sigma_{l} as TT\to\infty for l=1,,k2l=1,\ldots,k_{2} with σ1>>σk2>0\sigma^{\prime}_{1}>\cdots>\sigma^{\prime}_{k_{2}}>0.

Assumption 1-1 indicates that ri0,Ft0,cj0{r}_{i}^{0},{F}_{t}^{0},{c}_{j}^{0} are uniformly bounded. As explained in Wang (2022), the compactness of parameter space is a commonly encountered trait in nonlinear models, see more examples in Newey and McFadden (1986), Jennrich (1969) and Wu (1981). Assumption 1-2 assumes asymptotic distinct eigenvalues for 1Tt=1TFt0Ft0\frac{1}{T}\sum^{T}_{t=1}F_{t}^{0}F_{t}^{0\prime} and 1Tt=1TFt0Ft0\frac{1}{T}\sum^{T}_{t=1}F_{t}^{0\prime}F_{t}^{0} so that the eigenvectors can be uniquely determined. This is the same as Assumption B in Yu et al. (2022).

Let πlijt(πijt)\partial_{\pi}l_{ijt}(\pi_{ijt}), π2lijt(πijt)\partial_{\pi^{2}}l_{ijt}(\pi_{ijt}) and π3lijt(πijt)\partial_{\pi^{3}}l_{ijt}(\pi_{ijt}) be the first, second and third order derivative of lijt(θ)l_{ijt}(\theta) evaluated at πijt\pi_{ijt}, respectively. When these derivatives are evaluated at πijt0\pi_{ijt}^{0}, we suppress the argument and simply write them as πlijt\partial_{\pi}l_{ijt}, π2lijt\partial_{\pi^{2}}l_{ijt} and π3lijt\partial_{\pi^{3}}l_{ijt}.

Assumption 2.
  1. 1.

    lijt(θ)l_{ijt}(\theta) is three times differentiable.

  2. 2.

    There exists bU>bL>0b_{U}>b_{L}>0 such that bLπ2lijt(πijt)bUb_{L}\leq-\partial_{\pi^{2}}l_{ijt}(\pi_{ijt})\leq b_{U}.

  3. 3.

    |π3lijt(πijt)|bU\left|\partial_{\pi^{3}}l_{ijt}(\pi_{ijt})\right|\leq b_{U} within a compact space of πijt\pi_{ijt}.

Assumption 2-1 imposes a smoothness condition on the log-likelihood function. Assumptions 2-2 and 2-3 assume that the second-order derivative of the log-likelihood function is bounded from below and above, and the third-order derivative is bounded away from infinity. The boundedness of the second and the third-order derivative is used to control the remainder term in the expansion of the first-order condition. The boundedness from below of the second order derivative together with the boundedness of πijt\pi_{ijt} are used to show the consistency of the estimated factors and loadings. It can be easily verified that commonly used models, such as Linear, Logit, Probit, Poisson, and Tobit, satisfy Assumption 2.

Assumption 3.
  1. 1.

    E(|πlijt|ξ)M\operatorname*{E}\left(\left|\partial_{\pi}l_{ijt}\right|^{\xi}\right)\leq M for some ξ>14\xi>14 and all i,ji,j and tt.

  2. 2.

    For any i,j,ti,j,t, s=1Tl=1p1k=1p2|Eπlijtπllks|M\sum_{s=1}^{T}\sum_{l=1}^{p_{1}}\sum_{k=1}^{p_{2}}\left|\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{lks}\right|\leq M.

  3. 3.

    For every (t,s)(t,s), E[p11/2p21/2i=1p1j=1p2k=1p2(πlijtπliksEπlijtπliks)]2M\operatorname*{E}\left[p_{1}^{-1/2}p_{2}^{-1/2}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{k=1}^{p_{2}}\left(\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}-\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}\right)\right]^{2}\leq M,
    E[p11/2p21/2i=1p1l=1p1j=1p2(πlijtπlljsEπlijtπlljs)]2M\operatorname*{E}\left[p_{1}^{-1/2}p_{2}^{-1/2}\sum_{i=1}^{p_{1}}\sum_{l=1}^{p_{1}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi}l_{ijt}\partial_{\pi}l_{ljs}-\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{ljs}\right)\right]^{2}\leq M,
    For every (j,k)(j,k), E[p11/2T1/2i=1p1t=1Ts=1T(πlijtπliksEπlijtπliks)]2M\operatorname*{E}\left[p_{1}^{-1/2}T^{-1/2}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\sum_{s=1}^{T}\left(\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}-\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}\right)\right]^{2}\leq M.

Notice that for linear model, πlijt\partial_{\pi}l_{ijt} is the error term and π2lijt\partial_{\pi^{2}}l_{ijt} is a constant. Many papers require πlijt\partial_{\pi}l_{ijt} to be conditionally independent for using the Hoeffding’s inequality in the proofs, such as He et al. (2023). This paper considers a more general situation: the distributions of xijtx_{ijt} are allowed to be heterogeneous over ii,jj and tt, and to have limited cross-row (-column) and cross-sample (e.g, serial) dependence of xijtx_{ijt} conditionally. Assumption 3 gives some moment conditions for {πlijtπlijs;1ip1,1jp2,1tT}\{\partial_{\pi}l_{ijt}\partial_{\pi}l_{ijs};1\leq i\leq p_{1},1\leq j\leq p_{2},1\leq t\leq T\}. They are satisfied when, for example, {πlijt}\left\{\partial_{\pi}l_{ijt}\right\} are cross-sectionally and temporally weakly dependent conditional on θΘ\theta\in\Theta.

Assumption 4.

For some ζ>2\zeta>2,

E(p11i=1p11p2Tj=1p2t=1TπlijtFt0cj0ζ)\displaystyle\operatorname*{E}\left(p_{1}^{-1}\sum_{i=1}^{p_{1}}\left\|\frac{1}{\sqrt{p_{2}T}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0}c^{0}_{j}\right\|^{\zeta}\right) \displaystyle\leq M,\displaystyle M,
E(p21j=1p21p1Ti=1p1t=1TπlijtFt0ri0ζ)\displaystyle\operatorname*{E}\left(p_{2}^{-1}\sum_{j=1}^{p_{2}}\left\|\frac{1}{\sqrt{p_{1}T}}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0\prime}r^{0}_{i}\right\|^{\zeta}\right) \displaystyle\leq M,\displaystyle M,
E(T1t=1T1p1p2i=1p1j=1p2πlijtcj0ri0ζ)\displaystyle\operatorname*{E}\left(T^{-1}\sum_{t=1}^{T}\left\|\frac{1}{\sqrt{p_{1}p_{2}}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi}l_{ijt}c^{0}_{j}\otimes r^{0}_{i}\right\|^{\zeta}\right) \displaystyle\leq M.\displaystyle M.

Assumption 4 is for the first derivatives of the log-likelihood function. Again, it is satisfied when {πlijt}\left\{\partial_{\pi}l_{ijt}\right\} are cross-sectionally and temporally weakly dependent conditional on θΘ\theta\in\Theta. To present the next assumption on the moments of the second derivatives of the log-likelihood function, we introduce some more notations. Define

Lrr\displaystyle\mathcal{H}_{Lrr^{\prime}} =[j=1p2t=1Tπ2lijtFt0cj0cj0Ft0]i=1p1,Lff=[i=1p1j=1p2π2lijtcj0ri0cj0ri0]t=1T,\displaystyle=\left[\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}F_{t}^{0\prime}\right]_{i=1}^{p_{1}},\ \mathcal{H}_{Lff^{\prime}}=\left[\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}c_{j}^{0\prime}\otimes r_{i}^{0\prime}\right]_{t=1}^{T},
Lcc\displaystyle\mathcal{H}_{Lcc^{\prime}} =[i=1p1t=1Tπ2lijtFt0ri0ri0Ft0]j=1p2.\displaystyle=\left[\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0\prime}r_{i}^{0}r_{i}^{0\prime}F_{t}^{0}\right]_{j=1}^{p_{2}}.
Assumption 5.
  1. 1.

    Recall the notation [A]l[A]_{l} for some matrix AA,

    E1p1p2Ti=1p1j=1p2t=1T(1p2T[Lrr]i)1πlijtFt0cj0ri02\displaystyle\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\frac{1}{p_{2}T}[\mathcal{H}_{Lrr^{\prime}}]_{i}\right)^{-1}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}r_{i}^{0\prime}\Big{\|}^{2} \displaystyle\leq M,\displaystyle M,
    E1p1p2Ti=1p1j=1p2t=1T(1p1p2[Lff]t)1πlijtcj0ri0ft02\displaystyle\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\frac{1}{p_{1}p_{2}}[\mathcal{H}_{Lff^{\prime}}]_{t}\right)^{-1}\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}f_{t}^{0\prime}\Big{\|}^{2} \displaystyle\leq M,\displaystyle M,
    E1p1p2Ti=1p1j=1p2t=1T(1p1T[Lcc]j)1πlijtFt0ri0cj02\displaystyle\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\frac{1}{p_{1}T}[\mathcal{H}_{Lcc^{\prime}}]_{j}\right)^{-1}\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0}c_{j}^{0\prime}\Big{\|}^{2} \displaystyle\leq M.\displaystyle M.
  2. 2.

    For any kk, ss and ll,

    E1p1p2Ti,j,t(1Tt=1T(π2liktFt0ri0ck0Ft0+πliktFt))(1p2T[Lrr]i)1×(πlijtFt0cj0)2M\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left(\frac{1}{T}\sum_{t=1}^{T}\left(\partial_{\pi^{2}}l_{ikt}F_{t}^{0\prime}r_{i}^{0}c_{k}^{0\prime}F_{t}^{0\prime}+\partial_{\pi}l_{ikt}F_{t}^{\prime}\right)\right)\left(\frac{1}{p_{2}T}[\mathcal{H}_{Lrr^{\prime}}]_{i}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}\right)\Big{\|}^{2}\leq M,

    E1p1p2Ti,j,t(1p2j=1p2(π2lijscj0ri0cj0Fs0+πlijscj0𝕀k1))(1p2T[Lrr]i)1×(πlijtFt0cj0)2M\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left(\frac{1}{p_{2}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi^{2}}l_{ijs}c_{j}^{0}\otimes r_{i}^{0}c_{j}^{0\prime}F_{s}^{0\prime}+\partial_{\pi}l_{ijs}c_{j}^{0}\otimes\mathbb{I}_{k_{1}}\right)\right)\left(\frac{1}{p_{2}T}[\mathcal{H}_{Lrr^{\prime}}]_{i}\right)^{-1}\\ \times\Big{(}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}\Big{)}\Big{\|}^{2}\leq M,

    E1p1p2Ti,j,t(1p1j=1p2(π2lljtFt0cj0cj0rl0+πlljtcj0𝕀k1))(1p1p2[Lff]t)1×(πlijtcj0ri0)2M\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left(\frac{1}{p_{1}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi^{2}}l_{ljt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}\otimes r_{l}^{0\prime}+\partial_{\pi}l_{ljt}c_{j}^{0\prime}\otimes\mathbb{I}_{k_{1}}\right)\right)\left(\frac{1}{p_{1}p_{2}}[\mathcal{H}_{Lff^{\prime}}]_{t}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}\right)\Big{\|}^{2}\leq M,

    E1p1p2Ti,j,t[1p2j=1p2(π2liktFt0ri0ck0ri0+πlikt𝕀k2ri0)](1p1p2[Lff]t)1×(πlijtcj0ri0)2M\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left[\frac{1}{p_{2}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi^{2}}l_{ikt}F_{t}^{0\prime}r_{i}^{0}c_{k}^{0\prime}\otimes r_{i}^{0\prime}+\partial_{\pi}l_{ikt}\mathbb{I}_{k_{2}}\otimes r_{i}^{0\prime}\right)\right]\left(\frac{1}{p_{1}p_{2}}[\mathcal{H}_{Lff^{\prime}}]_{t}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}\right)\Big{\|}^{2}\leq M,

    E1p1p2Ti,j,t[1p1i=1p1(π2lijscj0ri0ri0Fs0+πlijs𝕀k2ri0)](1p1T[Lcc]j)1×(πlijtFt0ri0)2M\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left[\frac{1}{p_{1}}\sum_{i=1}^{p_{1}}\left(\partial_{\pi^{2}}l_{ijs}{c}^{0}_{j}\otimes{r}^{0}_{i}{r}^{0\prime}_{i}{F}^{0}_{s}+\partial_{\pi}l_{ijs}\mathbb{I}_{k_{2}}\otimes{r}^{0}_{i}\right)\right]\left(\frac{1}{p_{1}T}[\mathcal{H}_{Lcc^{\prime}}]_{j}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0}\right)\Big{\|}^{2}\leq M,

    E1p1p2Ti,j,t[1Tt=1T(π2lljtFt0cj0rl0Ft0+πlljtFt0)](1p1T[Lcc]j)1×(πlijtFt0ri0)2M\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left[\frac{1}{T}\sum_{t=1}^{T}\left(\partial_{\pi^{2}}l_{ljt}{F}^{0}_{t}{c}^{0}_{j}{r}^{0\prime}_{l}{F}^{0}_{t}+\partial_{\pi}l_{ljt}{F}^{0}_{t}\right)\right]\left(\frac{1}{p_{1}T}[\mathcal{H}_{Lcc^{\prime}}]_{j}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0}\right)\Big{\|}^{2}\leq M.

  3. 3.

    For any ii and some positive definite matrices ΣiR\Sigma_{iR} and ΩiR\Omega_{iR},

    (p2T)1j=1p2t=1Tπ2lijtFt0cj0cj0Ft0\displaystyle-(p_{2}T)^{-1}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}F_{t}^{0\prime} 𝑃\displaystyle\xrightarrow{P} ΣiR,\displaystyle\Sigma_{iR},
    (p2T)1/2j=1p2t=1TπlijtFt0cj0\displaystyle(p_{2}T)^{-1/2}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0} 𝑑\displaystyle\xrightarrow{d} 𝒩(0,ΩiR).\displaystyle\mathcal{N}(0,\Omega_{iR}).

    For any jj and some positive definite matrices ΣjC\Sigma_{jC} and ΩjC\Omega_{jC},

    (p1T)1i=1p1t=1Tπ2lijtFt0ri0ri0Ft0\displaystyle-(p_{1}T)^{-1}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0\prime}r_{i}^{0}r_{i}^{0\prime}F_{t}^{0} 𝑃\displaystyle\xrightarrow{P} ΣjC,\displaystyle\Sigma_{jC},
    (p1T)1/2i=1p1t=1TπlijtFt0ri0\displaystyle(p_{1}T)^{-1/2}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0} 𝑑\displaystyle\xrightarrow{d} 𝒩(0,ΩjC).\displaystyle\mathcal{N}(0,\Omega_{jC}).

    For any tt and some positive definite matrices ΣtF\Sigma_{tF} and ΩtF\Omega_{tF},

    (p1p2)1i=1p1j=1p2π2lijtcj0ri0ri0cj0\displaystyle-(p_{1}p_{2})^{-1}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}r_{i}^{0\prime}\otimes c_{j}^{0\prime} 𝑃\displaystyle\xrightarrow{P} ΣtF,\displaystyle\Sigma_{tF},
    (p1p2)1/2i=1p1j=1p2πlijtcj0ri0\displaystyle(p_{1}p_{2})^{-1/2}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0} 𝑑\displaystyle\xrightarrow{d} 𝒩(0,ΩtF).\displaystyle\mathcal{N}(0,\Omega_{tF}).

Though cumbersome, it can be easily verified that Assumption 5 is also satisfied when {πlijt}\left\{\partial_{\pi}l_{ijt}\right\} are cross-sectionally and temporally weakly dependent conditional on θΘ\theta\in\Theta.

Assumption 6.

(p1p2T)3/ξ(p1+p2+T)1/ζLp1p2T0\frac{(p_{1}p_{2}T)^{3/\xi}(p_{1}+p_{2}+T)^{1/\zeta}}{L_{p_{1}p_{2}T}}\to 0, as (p1,p2,T)(p_{1},p_{2},T)\to\infty.

Assumption 6 is sort of a balance condition for p1p_{1}, p2p_{2} and TT, meaning that each dimension size can not completely dominate the other two, which is controlled by ξ\xi and ζ\zeta. It is weaker when ξ\xi and ζ\zeta are larger. For example, when πlijt\partial_{\pi}l_{ijt} has subgaussian tails, ξ\xi can be arbitrarily large. And when πlijt\partial_{\pi}l_{ijt}’s are conditionally Gaussian as a typical case in the linear model, ζ\zeta can be arbitrarily large.

3 Asymptotic Theory

Let B(𝒟)={(r,f,c):r+f+c𝒟}B(\mathcal{D})=\{(r,f,c):\|{r}\|_{\infty}+\|f\|_{\infty}+\|{c}\|_{\infty}\leq\mathcal{D}\} for some 𝒟\mathcal{D} large enough, such that the true parameter θ0{\theta}^{0} lies in the interior of B(𝒟)B(\mathcal{D}). Before stating the theoretical results, we give some more notations. Define r^=(r^1,,r^p1)\widehat{r}=(\widehat{r}_{1}^{\prime},\ldots,\widehat{r}_{p_{1}}^{\prime})^{\prime}, c^=(c^1,,c^p2)\widehat{c}=(\widehat{c}_{1}^{\prime},\ldots,\widehat{c}_{p_{2}}^{\prime})^{\prime}, and f^=(f^1,,f^T)\widehat{f}=(\widehat{f}_{1}^{\prime},\ldots,\widehat{f}_{T}^{\prime})^{\prime} as the solution of maximizing QQ within B(𝒟)B(\mathcal{D}), where f^t=vec{F^t}\widehat{f}_{t}=\text{vec}\{\widehat{F}_{t}\}. Let π^ijt=r^iF^tc^j\widehat{\pi}_{ijt}=\widehat{r}_{i}\widehat{F}_{t}\widehat{c}_{j}, θ^=(r^,f^,c^)\widehat{{\theta}}=(\widehat{r}^{\prime},\widehat{f}^{\prime},\widehat{c}^{\prime})^{\prime}, R^=(r^1,,r^p1)\widehat{R}=(\widehat{r}_{1},\ldots,\widehat{r}_{p_{1}})^{\prime}, C^=(c^1,,c^p2)\widehat{C}=(\widehat{c}_{1},\ldots,\widehat{c}_{p_{2}})^{\prime}. Let S(θ)=θQ(θ)S({\theta})=\partial_{{\theta}}Q({\theta}), Sr(θ)=rQ(θ)S_{r}({\theta})=\partial_{r}Q({\theta}), Sc(θ)=cQ(θ)S_{c}({\theta})=\partial_{c}Q({\theta}) and Sf(θ)=fQ(θ)S_{f}({\theta})=\partial_{f}Q({\theta}) be the score functions. Let (θ)=θθQ(θ)\mathcal{H}(\theta)=\partial_{\theta\theta^{\prime}}Q(\theta) be the Hessian matrix. The decomposition of (θ)\mathcal{H}(\theta) and the expression of each component is presented in Appendix A in the supplementary material. We suppress the argument when S(θ)S(\theta) and (θ)\mathcal{H}(\theta) are evaluated at θ0\theta^{0}, i.e. S=S(θ0)S=S(\theta^{0}) and =(θ0)\mathcal{H}=\mathcal{H}(\theta^{0}).

3.1 Convergence Rates

We first provide a consistent result for the estimates of factors and loadings in terms of their average Euclidean norm (or equivalently the Frobenious norm of the estimated row and column factor loading matrices and the factor matrix).

Proposition 1.

(Average Consistency). Under Assumptions 1-3, as (p1,p2,T)(p_{1},p_{2},T)\to\infty,

1p1r^r02+1p2c^c02+1Tf^f02=Op(Lp1p2T1).\frac{1}{p_{1}}\|\widehat{r}-r^{0}\|^{2}+\frac{1}{p_{2}}\|\widehat{c}-c^{0}\|^{2}+\frac{1}{T}\|\widehat{f}-f^{0}\|^{2}=O_{p}\left(L_{p_{1}p_{2}T}^{-1}\right).

To derive finer convergence rates and the limit distributions of the estimated parameters, we need to utilize the first-order condition S(θ^)=0S(\widehat{\theta})=0. The following proposition demonstrates that S(θ^)=0S(\widehat{\theta})=0 w.p.a.1.

Proposition 2.

Under Assumptions 1-4, and Assumption 6, S(θ^)=0S(\widehat{\theta})=0 w.p.a.1.

Proposition 2 is nontrivial because the dimension of θ^\widehat{\theta} increases with p1p_{1}, p2p_{2} and TT. In a fixed-dimensional parameter space, the consistency of the estimated parameters, coupled with the assumption that the true parameters lie in the interior of the parameter space, ensures that the estimated parameters also lie in the interior. Consequently, the first-order conditions are satisfied. However, when the dimension of the parameter space increases with p1p_{1}, p2p_{2}, and TT, the average consistency of θ^\widehat{\theta} as established in Proposition 1 is insufficient to ensure that θ^\widehat{\theta} remains an interior point of the parameter space. In this context, the uniform consistency of θ^\widehat{\theta} is required.

Proposition 3.

(Uniform Consistency). Under Assumptions 1-4 and 6,

r^r0+c^c0+f^f0=Op((p1p2T)3/ξLp1p2T(p1+p2+T)1/ζ).\|\widehat{r}-r^{0}\|_{\infty}+\|\widehat{c}-c^{0}\|_{\infty}+\|\widehat{f}-f^{0}\|_{\infty}=O_{p}\left(\frac{(p_{1}p_{2}T)^{3/\xi}}{L_{p_{1}p_{2}T}}(p_{1}+p_{2}+T)^{1/\zeta}\right).

Note that ξ\xi and ζ\zeta could be arbitrarily large in some examples as demonstrated below Assumption 6, and in such cases

r^r0+c^c0+f^f0=Op(Lp1p2T1).\|\widehat{r}-r^{0}\|_{\infty}+\|\widehat{c}-c^{0}\|_{\infty}+\|\widehat{f}-f^{0}\|_{\infty}=O_{p}\left(L^{-1}_{p_{1}p_{2}T}\right).

Now we utilize the first-order conditions. Using the integral form of the mean value theorem for vector-valued functions to expand the first order conditions, we have 0=S(θ^)=S+~(θ^θ0)0=S(\widehat{{\theta}})=S+\widetilde{\mathcal{H}}\left(\widehat{{\theta}}-{\theta}^{0}\right), where ~=01(θ0+s(θ^θ0))𝑑s\widetilde{\mathcal{H}}=\int_{0}^{1}\mathcal{H}\left({\theta}^{0}+s\left(\widehat{{\theta}}-{\theta}^{0}\right)\right)ds. It shows that

(p11/2(r^r0)T1/2(f^f0)p21/2(c^c0))=(p1p2T)1/2(Dp2Tp11/2~Dp2Tp11/2)1Dp2Tp11/2S,\displaystyle\left(\begin{array}[]{c}p_{1}^{-1/2}(\hat{r}-r^{0})\\ T^{-1/2}(\hat{f}-f^{0})\\ p_{2}^{-1/2}(\hat{c}-c^{0})\end{array}\right)=(p_{1}p_{2}T)^{-1/2}\left(-D_{p_{2}Tp_{1}}^{-1/2}\widetilde{\mathcal{H}}D_{p_{2}Tp_{1}}^{-1/2}\right)^{-1}D_{p_{2}Tp_{1}}^{-1/2}S, (15)

where

Dp1Tp2=diag{p1𝕀p1k1,T𝕀Tk1k2,p2𝕀p2k2},Dp2Tp1=diag{p2T𝕀p1k1,p1p2𝕀Tk1k2,p1T𝕀p2k2}.D_{p_{1}Tp_{2}}=\operatorname*{diag}\{p_{1}\mathbb{I}_{p_{1}k_{1}},T\mathbb{I}_{Tk_{1}k_{2}},p_{2}\mathbb{I}_{p_{2}k_{2}}\},D_{p_{2}Tp_{1}}=\operatorname*{diag}\{p_{2}T\mathbb{I}_{p_{1}k_{1}},p_{1}p_{2}\mathbb{I}_{Tk_{1}k_{2}},p_{1}T\mathbb{I}_{p_{2}k_{2}}\}.

It is straightforward to see that Dp2Tp11/2S=Op((p1+p2+T)1/2)\|D_{p_{2}Tp_{1}}^{-1/2}S\|=O_{p}((p_{1}+p_{2}+T)^{1/2}). Utilizing the local positive-definiteness of (θ)-\mathcal{H}({\theta}) aided by the carefully designed augmented Lagrangian terms, we show in the supplementary material (Lemma A.3) that the largest eigenvalue of (Dp2Tp11/2(θ)Dp2Tp11/2)1(-D_{p_{2}Tp_{1}}^{-1/2}{\mathcal{H}}(\theta)D_{p_{2}Tp_{1}}^{-1/2})^{-1} is Op(1)O_{p}(1) in the neighborhood B(𝒟)Dp1Tp21/2(θθ0)mB(\mathcal{D})\cap\|D_{p_{1}Tp_{2}}^{-1/2}\left({\theta}-{\theta}^{0}\right)\|\leq m for some m>0m>0. Since θ^\widehat{\theta} lies in B(𝒟)Dp1Tp21/2(θθ0)mB(\mathcal{D})\cap\|D_{p_{1}Tp_{2}}^{-1/2}\left({\theta}-{\theta}^{0}\right)\|\leq m w.p.a.1, this implies that (Dp2Tp11/2~Dp2Tp11/2)1\|(-D_{p_{2}Tp_{1}}^{-1/2}\widetilde{\mathcal{H}}D_{p_{2}Tp_{1}}^{-1/2})^{-1}\| is Op(1)O_{p}(1). Then we have the following strengthened results.

Theorem 1.

(Average Rate). Under Assumptions 1-4 and Assumption 6, 1p1r^r02+1p2c^c02+1Tf^f02=Op(Lp1p2T2)\frac{1}{p_{1}}\|\widehat{r}-r^{0}\|^{2}+\frac{1}{p_{2}}\|\widehat{c}-c^{0}\|^{2}+\frac{1}{T}\|\widehat{f}-f^{0}\|^{2}=O_{p}\left(L_{p_{1}p_{2}T}^{-2}\right).

Comparable outcomes have been established for continuous variables. For example, the averaged convergence rate of θ{\theta} by the least square estimate given in He et al. (2023) and the PE estimate given in Yu et al. (2022), both reached Lp1p2T1L^{-1}_{p_{1}p_{2}T}. When the matrices are vectorized and the method for the generalized vector factor model is implemented as in Liu et al. (2023) and Wang (2022), the averaged convergence rate of estimating the loading space spanned by CR{C}\otimes{R} is min{Op(1/T,1/p1p2)}\min\{O_{p}(1/\sqrt{T},1/\sqrt{p_{1}p_{2}})\}, which is no faster than ours, especially in scenarios where p1p2p_{1}p_{2} outweighs TT. It is also no slower than the rate, Op({1/p1+1/Tp2}{1/p2+1/Tp1})O_{p}(\{1/\sqrt{p_{1}}+1/\sqrt{Tp_{2}}\}\vee\{1/\sqrt{p_{2}}+1/\sqrt{Tp_{1}}\}), of the non-likelihood method α\alpha-PCA in the seminal paper by Chen and Fan (2023), especially when the data matrix is not balanced. Yet extra effort has to be carried to separate RR and CC from their twisted Kronecker product.

3.2 Limit distributions

Now we proceed to the limit distributions of the estimated factors and loadings. Expanding the first-order conditions to a higher order, we have

0=S(θ^)=S+(θ^θ0)+12E,0=S(\widehat{{\theta}})=S+\mathcal{H}(\widehat{{\theta}}-{\theta}^{0})+\frac{1}{2}E,

where E=(Er,Ef,Ec)E=(E_{r}^{\prime},E_{f}^{\prime},E_{c}^{\prime})^{\prime}. ErE_{r}, EfE_{f} and EcE_{c} are p1k1p_{1}k_{1}, Tk1k2Tk_{1}k_{2} and p2k2p_{2}k_{2} dimensional with element Er,ik=(θ^θ0)θθrikQ(θik)(θ^θ0)E_{r,ik}=(\hat{\theta}-\theta^{0})^{\prime}\partial_{\theta\theta^{\prime}r_{ik}}Q(\theta_{ik}^{*})(\hat{\theta}-\theta^{0}), Ef,tk=(θ^θ0)θθftkQ(θtk)(θ^θ0)E_{f,tk}=(\hat{\theta}-\theta^{0})^{\prime}\partial_{\theta\theta^{\prime}f_{tk}}Q(\theta_{tk}^{*})(\hat{\theta}-\theta^{0}) and Ec,jk=(θ^θ0)θθcjkQ(θjk)(θ^θ0)E_{c,jk}=(\hat{\theta}-\theta^{0})^{\prime}\partial_{\theta\theta^{\prime}c_{jk}}Q(\theta_{jk}^{*})(\hat{\theta}-\theta^{0}), respectively. θik\theta_{ik}^{*}, θtk\theta_{tk}^{*} and θjk\theta_{jk}^{*} are linear combinations of θ^\hat{\theta} and θ0\theta^{0}. Thus

θ^θ0\displaystyle\widehat{{\theta}}-{\theta}^{0} =\displaystyle= 1S121E,\displaystyle-\mathcal{H}^{-1}S-\frac{1}{2}\mathcal{H}^{-1}E,
r^iri0\displaystyle\hat{r}_{i}-r_{i}^{0} =\displaystyle= [θ^θ0]i=[1S]i12[1E]i.\displaystyle\left[\widehat{{\theta}}-{\theta}^{0}\right]_{i}=-\left[\mathcal{H}^{-1}S\right]_{i}-\frac{1}{2}\left[\mathcal{H}^{-1}E\right]_{i}.

Utilizing the local landscape property of \mathcal{H}, we show in the supplementary material,

[1S]i=(j=1p2t=1Tπ2lijtFt0cj0cj0Ft0)1j=1p2t=1TπlijtFt0cj0+Op((p1p2T)1/2).\displaystyle\left[\mathcal{H}^{-1}S\right]_{i}=\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}F_{t}^{0\prime}\right)^{-1}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}+O_{p}\left((p_{1}p_{2}T)^{-1/2}\right). (16)

The intuition behind (16) is that \mathcal{H} is approximately block diagonal with the aid of the augmented Lagrangian terms P1(θ)P_{1}(\theta), P2(θ)P_{2}(\theta) and also P3(θ)P_{3}(\theta).

The elements within \mathcal{H}’s diagonal blocks are significantly larger than those in its off-diagonal blocks. This structure of \mathcal{H} allows us to demonstrate that, in the expansion of [1S]i[\mathcal{H}^{-1}S]_{i}, the additional terms resulting from the nonzero off-diagonal blocks collectively have an order of Op((p1p2T)1/2)O_{p}\left((p_{1}p_{2}T)^{-1/2}\right). Then combing with Theorem 1, we show in the supplement that

[1E]i=Op((p1p2T)3/ξLp1p2T2).\displaystyle\left\|\left[\mathcal{H}^{-1}E\right]_{i}\right\|=O_{p}\left(\frac{(p_{1}p_{2}T)^{3/\xi}}{L_{p_{1}p_{2}T}^{2}}\right). (17)

Thus if (p2T)1/2(p1p2T)3/ξLp1p2T20\frac{(p_{2}T)^{1/2}(p_{1}p_{2}T)^{3/\xi}}{L_{p_{1}p_{2}T}^{2}}\rightarrow 0, [H1E]i\left\|[H^{-1}E]_{i}\right\| would be op((p2T)1/2)o_{p}((p_{2}T)^{-1/2}) and hence dominated by the first term on the right-hand side of (16).

Proposition 4.

(Individual Rate). Under Assumptions 1-4, and Assumption 6,

r^iri0\displaystyle\|\widehat{{r}}_{i}-{r}_{i}^{0}\| =\displaystyle= Op(Lp1p2T1) for each i=1,,p1,\displaystyle O_{p}(L_{p_{1}p_{2}T}^{-1})\mbox{ for each }i=1,\cdots,p_{1},
f^tft0\displaystyle\|\widehat{f}_{t}-f_{t}^{0}\| =\displaystyle= Op(Lp1p2T1) for each t=1,,T,\displaystyle O_{p}(L_{p_{1}p_{2}T}^{-1})\mbox{ for each }t=1,\cdots,T,
c^jcj0\displaystyle\|\widehat{{c}}_{j}-{c}_{j}^{0}\| =\displaystyle= Op(Lp1p2T1) for each j=1,,p2.\displaystyle O_{p}(L_{p_{1}p_{2}T}^{-1})\mbox{ for each }j=1,\cdots,p_{2}.

From Proposition 4, we see that the convergence rate of each component of θ^{\hat{\theta}} is the same as that of the least square estimator in LMFM by He et al. (2023), but slower than that of the PE estimator in Yu et al. (2022), except that one of {(Tp1)1,(Tp2)1,(p1p2)1}\left\{(Tp_{1})^{-1},(Tp_{2})^{-1},(p_{1}p_{2})^{-1}\right\} dominates the others. This is because both the maximum likelihood and the least square estimation derive estimators of θ{\theta} jointly, while the PE method estimates RR, CC and FtF_{t} individually. Though the rate Op(Lp1p2T1)O_{p}(L_{p_{1}p_{2}T}^{-1}) is not sharp, but enough for deriving the order of [1R]i\left[\mathcal{H}^{-1}R\right]_{i}. Then we have the following central limit theorem.

Theorem 2.

(Individual Limit Distribution). Under Assumptions 1-6,

p2T(r^iri0)𝑑𝒩(0,ΣiR1ΩiRΣiR1) if p2TLp1p2T2(p1p2T)3/ξ0,\displaystyle\sqrt{p_{2}T}\left(\widehat{{r}}_{i}-{r}_{i}^{0}\right)\xrightarrow{d}\mathcal{N}\left(0,\Sigma_{iR}^{-1}\Omega_{iR}\Sigma_{iR}^{\prime-1}\right)\text{ if }\frac{\sqrt{p_{2}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,
p1T(c^jcj0)𝑑𝒩(0,ΣjC1ΩjCΣjC1) if p1TLp1p2T2(p1p2T)3/ξ0,\displaystyle\sqrt{p_{1}T}\left(\widehat{{c}}_{j}-{c}_{j}^{0}\right)\xrightarrow{d}\mathcal{N}\left(0,\Sigma_{jC}^{-1}\Omega_{jC}\Sigma_{jC}^{\prime-1}\right)\text{ if }\frac{\sqrt{p_{1}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,
p1p2(f^tft0)𝑑𝒩(0,ΣtF1ΩtFΣtF1) if p1p2Lp1p2T2(p1p2T)3/ξ0,\displaystyle\sqrt{p_{1}p_{2}}(\widehat{{f}}_{t}-{f}_{t}^{0})\xrightarrow{d}\mathcal{N}\left(0,\Sigma_{tF}^{-1}\Omega_{tF}\Sigma_{tF}^{\prime-1}\right)\text{ if }\frac{\sqrt{p_{1}p_{2}}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,

where ΣiR,ΣjC,ΣtF\Sigma_{iR},\Sigma_{jC},\Sigma_{tF}, ΩiR\Omega_{iR}, ΩjC\Omega_{jC} and ΩtF\Omega_{tF} are defined in Assumption 5.

Remark 1.

The asymptotic variances of r^i\hat{r}_{i}, f^t\hat{f}_{t} and c^j\hat{c}_{j} can be consistently estimated, respectively, by

var^r=\displaystyle\widehat{\text{var}}_{r}=\, p2T(j=1p2t=1Tπ2lijt(π^ijt)F^tc^jc^jF^t)1(j=1p2t=1T(πlijt(π^ijt))2F^tc^jc^jF^t)\displaystyle p_{2}T\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}_{t}\hat{c}_{j}\hat{c}_{j}^{\prime}\hat{F}_{t}^{\prime}\right)^{-1}\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\partial_{\pi}l_{ijt}(\hat{\pi}_{ijt})\right)^{2}\hat{F}_{t}\hat{c}_{j}\hat{c}_{j}^{\prime}\hat{F}_{t}^{\prime}\right)
×(j=1p2t=1Tπ2lijt(π^ijt)F^tc^jc^jF^t)1,\displaystyle\times\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}_{t}\hat{c}_{j}\hat{c}_{j}^{\prime}\hat{F}_{t}^{\prime}\right)^{-1},
var^f=\displaystyle\widehat{\text{var}}_{f}=\, p1p2(i=1p1j=1p2π2lijt(π^ijt)c^jr^ic^jr^i)1(i=1p1j=1p2(πlijt(π^ijt))2c^jr^ic^jr^i)\displaystyle p_{1}p_{2}\left(\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{c}_{j}\otimes\hat{r}_{i}\hat{c}^{\prime}_{j}\otimes\hat{r}^{\prime}_{i}\right)^{-1}\left(\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi}l_{ijt}(\hat{\pi}_{ijt})\right)^{2}\hat{c}_{j}\otimes\hat{r}_{i}\hat{c}^{\prime}_{j}\otimes\hat{r}^{\prime}_{i}\right)
×(i=1p1j=1p2π2lijt(π^ijt)c^jr^ic^jr^i)1,\displaystyle\times\left(\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{c}_{j}\otimes\hat{r}_{i}\hat{c}^{\prime}_{j}\otimes\hat{r}^{\prime}_{i}\right)^{-1},
var^c=\displaystyle\widehat{\text{var}}_{c}=\, p1T(i=1p1t=1Tπ2lijt(π^ijt)F^tr^ir^iF^t)1(i=1p1t=1T(πlijt(π^ijt))2F^tr^ir^iF^t)\displaystyle p_{1}T\left(\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}^{\prime}_{t}\hat{r}_{i}\hat{r}_{i}^{\prime}\hat{F}_{t}\right)^{-1}\left(\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\left(\partial_{\pi}l_{ijt}(\hat{\pi}_{ijt})\right)^{2}\hat{F}^{\prime}_{t}\hat{r}_{i}\hat{r}_{i}^{\prime}\hat{F}_{t}\right)
×(i=1p1t=1Tπ2lijt(π^ijt)F^tr^ir^iF^t)1.\displaystyle\times\left(\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}^{\prime}_{t}\hat{r}_{i}\hat{r}_{i}^{\prime}\hat{F}_{t}\right)^{-1}.

Theorem 2 generalizes Theorem 3.2 of Yu et al. (2022) allowing factors to be extracted from discrete or some other nonlinear models. And it allows the probability function to differ across rows, columns and samples. Thus the vast quantity of discrete data available in macroeconomic and financial studies can be effectively utilized, either by themselves or in combination with continuous data. An immediate consequence of Theorem 2 and Remark 1 is the following corollary.

Corollary 1.

Under Assumptions 1-6,

p2T(r^iri0)/var^r𝑑𝒩(0,1) if p2TLp1p2T2(p1p2T)3/ξ0,\displaystyle\sqrt{p_{2}T}\left(\widehat{{r}}_{i}-{r}_{i}^{0}\right)/\sqrt{\widehat{\text{var}}_{r}}\xrightarrow{d}\mathcal{N}(0,1)\text{ if }\frac{\sqrt{p_{2}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,
p1T(c^jcj0)/var^c𝑑𝒩(0,1) if p1TLp1p2T2(p1p2T)3/ξ0,\displaystyle\sqrt{p_{1}T}\left(\widehat{{c}}_{j}-{c}_{j}^{0}\right)/\sqrt{\widehat{\text{var}}_{c}}\xrightarrow{d}\mathcal{N}(0,1)\text{ if }\frac{\sqrt{p_{1}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,
p1p2(f^tft0)/var^f𝑑𝒩(0,1) if p1p2Lp1p2T2(p1p2T)3/ξ0.\displaystyle\sqrt{p_{1}p_{2}}\left(\widehat{{f}}_{t}-{f}_{t}^{0}\right)/\sqrt{\widehat{\text{var}}_{f}}\xrightarrow{d}\mathcal{N}(0,1)\text{ if }\frac{\sqrt{p_{1}p_{2}}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0.

3.3 Consistency in determining the number of factors

Under the linear matrix factor model, Yu et al. (2022) proposed an eigenvalue ratio test to consistently estimate the number of factors. While the eigenvalue ratio test is based on the ordered eigenvalues of the covariance matrix of Xt{X}_{t}, it is not suitable to describe the correlation of non-continuous random variables. In this article, we use the idea of information criterion to select the number of factors. The estimator of (k1,k2)(k_{1},k_{2}) is similar to the penalized loss (PC) based estimator of Bai and Ng (2002), but is adaptive to the matrix observation and the likelihood function. Given the number of factors l1,l2l_{1},l_{2}, let θ^(l1,l2)\widehat{{\theta}}^{(l_{1},l_{2})} be the estimator of θ{\theta}. Our PC-based estimator of (k1,k2)(k_{1},k_{2}) is defined as

(k^1,k^2)=argmin(l1,l2){1p1p2TL(θ^(l1,l2))+(l1+l2)g(p1,p2,T)},\displaystyle(\widehat{k}_{1},\widehat{k}_{2})=\arg\min_{(l_{1},l_{2})}\left\{-\frac{1}{p_{1}p_{2}T}L\left(\widehat{{\theta}}^{(l_{1},l_{2})}\right)+(l_{1}+l_{2})g(p_{1},p_{2},T)\right\}, (18)

where g(p1,p2,T)g(p_{1},p_{2},T) is a prespecified penalty function. Theorem 3 below demonstrates that (k^1,k^2)(\widehat{k}_{1},\widehat{k}_{2}) are consistent to (k1,k2)(k_{1},k_{2}) by choosing a suitable g(p1,p2,T)g(p_{1},p_{2},T).

Theorem 3.

If 0<g(p1,p2,T)00<g(p_{1},p_{2},T)\to 0, g(p1,p2,T)Lp1p2T2g(p_{1},p_{2},T)L_{p_{1}p_{2}T}^{2}\to\infty, under Assumptions 1-4, and Assumption 6, P(k^1=k1)1P\left(\widehat{k}_{1}=k_{1}\right)\to 1 and P(k^2=k2)1P\left(\widehat{k}_{2}=k_{2}\right)\to 1, as p1,p2,Tp_{1},p_{2},T\to\infty.

In our simulation studies and real data analysis, we choose the penalty g(p1,p2,T)=p1+p2+Tp1p2Tln(p1p2Tp1+p2+T)g(p_{1},p_{2},T)=\frac{p_{1}+p_{2}+T}{p_{1}p_{2}T}\text{ln}\left(\frac{p_{1}p_{2}T}{p_{1}+p_{2}+T}\right), which satisfies the conditions in Theorem 3 and is confirmed to work well.

3.4 Algorithms

We shall introduce two algorithms, the two-stage alternating maximization and the minorization maximization to numerically calculate the maximum likelihood estimator. The second one is computationally simpler, but so far it has been shown to be only applicable to Probit, Logit and Tobit models (Wang (2022)).

3.4.1 Two-stage alternating maximization (TSAM)

Let 𝒮1j\mathcal{S}_{1j} be an indicator set pertaining to a specific variable type in the jj-th column of 𝐗\mathbf{X} with respect to (i,t)(i,t), such that the cardinality of S1jS_{1j} and p1Tp_{1}T are of comparable magnitude. Similarly, Let 𝒮2i\mathcal{S}_{2i} be an indicator set pertaining to a specific variable type in the ii-th row of 𝐗\mathbf{X} with respect to (j,t)(j,t), such that the number of elements in S2iS_{2i} and p2Tp_{2}T are comparable. And let 𝒮3t\mathcal{S}_{3t} be an indicator set associated with a particular variable type in the tt-th sequence of XX relative to (i,j)(i,j), where the cardinality of 𝒮3t\mathcal{S}_{3t} and p1p2p_{1}p_{2} are comparable in magnitude. Due to the finite number of variable types in 𝐗\mathbf{X}, S1i,S2j,S3tS_{1i},S_{2j},S_{3t} are well defined.

Algorithm 1. Step 1 (Initialization): Randomly generate the initial values of r~(0)\widetilde{r}^{(0)} and f~(0)\widetilde{f}^{(0)}.

Step 2 (Updating): For k0k\geq 0, calculate

c~j(k)\displaystyle\widetilde{c}_{j}^{(k)} =argmaxcj(i,t)𝒮1jlijt(r~i(k)F~t(k)cj),j=1,,p2,\displaystyle=\mbox{argmax}_{c_{j}}\sum_{(i,t)\in\mathcal{S}_{1j}}\ l_{ijt}\left(\widetilde{r}_{i}^{(k)^{\prime}}\widetilde{F}_{t}^{(k)}c_{j}\right),\,j=1,\ldots,p_{2},
r~i(k+1)\displaystyle\widetilde{r}_{i}^{(k+1)} =argmaxri(j,t)𝒮2ilijt(riF~t(k)c~j(k)),i=1,,p1,\displaystyle=\mbox{argmax}_{r_{i}}\sum_{(j,t)\in\mathcal{S}_{2i}}\ l_{ijt}\left({r}_{i}^{{}^{\prime}}\widetilde{F}_{t}^{(k)}\widetilde{c}_{j}^{(k)}\right),\,i=1,\ldots,p_{1},
F~t(k+1)\displaystyle\widetilde{F}_{t}^{(k+1)} =argmaxFt(i,j)𝒮3tlijt(r~i(k+1)Ftc~j(k)),t=1,,T.\displaystyle=\mbox{argmax}_{F_{t}}\sum_{(i,j)\in\mathcal{S}_{3t}}\ l_{ijt}\left(\widetilde{r}_{i}^{(k+1)^{\prime}}{F}_{t}\widetilde{c}_{j}^{(k)}\right),\,t=1,\ldots,T.

Repeat the iteration until convergence and denote the derived estimators as r~,f~\widetilde{r},\widetilde{f} and c~\widetilde{c}.

Step 3 (Correction): A second-stage update is then conducted based on the score function and Hessian matrix:

r^i\displaystyle\widehat{{r}}_{i} =r~i{ririL(X|r~,f~,c~)}1riL(X|r~,f~,c~),i=1,,p1,\displaystyle=\widetilde{{r}}_{i}-\left\{\partial_{{r}_{i}r_{i}^{\prime}}L(X|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}})\right\}^{-1}\partial_{{r}_{i}}L(X|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}}),\ i=1,\ldots,p_{1}, (19)
f^t\displaystyle\widehat{{f}}_{t} =f~t{ftftL(X|r~,f~,c~)}1ftL(X|r~,f~,c~),t=1,,T,\displaystyle=\widetilde{{f}}_{t}-\left\{\partial_{{f}_{t}f_{t}^{\prime}}L(X|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}})\right\}^{-1}\partial_{f_{t}}L(X|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}}),\ t=1,\ldots,T, (20)
c^j\displaystyle\widehat{{c}}_{j} =c~j{cjcjL(X|r~,f~,c~)}1cjL(X|r~,f~,c~),j=1,,p2.\displaystyle=\widetilde{{c}}_{j}-\left\{\partial_{{c}_{j}c_{j}^{\prime}}L(X|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}})\right\}^{-1}\partial_{{c_{j}}}L(X|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}}),\ j=1,\ldots,p_{2}. (21)

Step 4 (Repetition): Repeat step 1-step 3 a number of times. Take the one with the largest likelihood.

Step 5 (Normalization): Let R^(s)\widehat{R}^{(s)}, C^(s)\widehat{C}^{(s)} and F^(s)\widehat{F}^{(s)} be the estimators from step 4. To ensure the identification conditions in (6), a normalization step could be applied to the factor and loading matrices. Specifically, do singular value decomposition to R^(s)\widehat{R}^{(s)} and C^(s)\widehat{C}^{(s)} as follows.

R^(s)=URHRVR:=URQR,C^(s)=UCHCVC:=UCQC.\displaystyle\widehat{R}^{(s)}={U}_{R}{H}_{R}{V}_{R}:={U}_{R}{Q}_{R},\quad\widehat{C}^{(s)}={U}_{C}{H}_{C}{V}_{C}:={U}_{C}{Q}_{C}.

Define

Σ1=1Tp1p2t=1TQRFtQCQCFtQR,Σ2=1Tp1p2t=1TQCFtQRQRFtQC,\displaystyle{\Sigma}_{1}=\frac{1}{Tp_{1}p_{2}}\sum_{t=1}^{T}{Q}_{R}{F}_{t}{Q}_{C}^{\prime}{Q}_{C}{F}_{t}^{\prime}{Q}_{R}^{\prime},\quad{\Sigma}_{2}=\frac{1}{Tp_{1}p_{2}}\sum_{t=1}^{T}{Q}_{C}{F}_{t}^{\prime}{Q}_{R}^{\prime}{Q}_{R}{F}_{t}{Q}_{C}^{\prime},

and hence the eigenvalue decompositions

Σ1=Γ1Λ1Γ1,Σ2=Γ2Λ2Γ2.\displaystyle{\Sigma}_{1}={\Gamma}_{1}{\Lambda}_{1}{\Gamma}_{1}^{\prime},\quad{\Sigma}_{2}={\Gamma}_{2}{\Lambda}_{2}{\Gamma}_{2}^{\prime}.

Then, the normalized loading and factor score matrices are, respectively,

R^=p1URΓ1,C^=p2UCΓ2andF^t=(p1p2)1/2Γ1QRFtQCΓ2.\displaystyle\widehat{{R}}=\sqrt{p_{1}}{U}_{R}{\Gamma}_{1},\quad\widehat{{C}}=\sqrt{p_{2}}{U}_{C}{\Gamma}_{2}\ \ \mbox{and}\ \ \widehat{{F}}_{t}=(p_{1}p_{2})^{-1/2}{\Gamma}_{1}^{\prime}{Q}_{R}{F}_{t}{Q}_{C}^{\prime}{\Gamma}_{2}. (22)

Liu et al. (2023) first proposed this algorithm for the generalized vector factor model, and the convergence of step 2 to a local maximum is given in their Proposition 2. To search for the global maximum, a common practice is to randomly choose the initial values multiple times and select the one with the highest likelihood among all local maxima. We generalize this approach to the matrix factor model setting. Although the computation for (r~,f~,c~)(\widetilde{{r}},\widetilde{{f}},\widetilde{{c}}) is simple, the efficiency may be lost since r~,f~\widetilde{{r}},\widetilde{{f}} and f~\widetilde{{f}} are not obtained based on the log-likelihood function L(X|r,f,c)L(X|r,f,c). To improve the efficiency, we then conduct a second-stage update in step 3, and the increase in efficiency from this one-step correction is validated in Section 4.5. The strength of this algorithm lies in its ability to perform estimations in parallel across all rows, columns and sequences, leveraging existing packages. This ensures straightforward programming and computation, enhancing efficiency and convenience.

3.4.2 Minorization maximization (MM)

Algorithm 2. Step 1 (Initialization): Randomly generate the initial values of r^(0)\widehat{r}^{(0)}, c^(0)\widehat{c}^{(0)} and f^(0)\widehat{f}^{(0)}.

Step 2 (Updating): For k0k\geq 0, first calculate x^ijt(k)=r^i(k)F^t(k)c^j(k)+1bUπlijt(r^i(k)F^t(k)c^j(k))\widehat{x}_{ijt}^{(k)}=\widehat{r}_{i}^{(k)\prime}\widehat{F}_{t}^{(k)}\widehat{c}_{j}^{(k)}+\frac{1}{b_{U}}\partial_{\pi}l_{ijt}\left(\widehat{r}_{i}^{(k)\prime}\widehat{F}_{t}^{(k)}\widehat{c}_{j}^{(k)}\right) for i=1,,p1i=1,\ldots,p_{1}, j=1,,p2j=1,\ldots,p_{2}, t=1,,Tt=1,\ldots,T, then update as follows.

(r^(k+1),f^(k+1),c^(k+1))=argmini=1p1j=1p2t=1T(x^ijt(k)riFtcj)2.\left(\widehat{r}^{(k+1)},\widehat{f}^{(k+1)},\widehat{c}^{(k+1)}\right)=\operatorname*{argmin}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\hat{x}^{(k)}_{ijt}-{r}_{i}^{\prime}{F}_{t}{c}_{j}\right)^{2}.

Iterate until L(X|r^(k+1),f^(k+1),c^(k+1))L(X|r^(k),f^(k),c^(k))L\left(X|\widehat{r}^{(k+1)},\widehat{f}^{(k+1)},\widehat{c}^{(k+1)}\right)-L\left(X|\widehat{r}^{(k)},\widehat{f}^{(k)},\widehat{c}^{(k)}\right)\leq error, where error is the level of numerical tolerance.

Step 3 (Repetition): Repeat step 1 and step 2 a number of times. Take the one with the largest likelihood.

Step 4 (Normalization): Similar to Algorithm 1, the solution in Step 3 is normalized by (22).

The Minorization Maximization (MM) algorithm does not require alternation. Instead, it only necessitates the matrix factorization in step 2, which can be performed quickly using standard software packages. Wang (2022) applies this algorithm to the generalized vector factor model and provides the proof of convergence for step 2 in their Appendix B. We are here extending it to the matrix factor models.

4 Numerical Studies

In Sections 4.1 and 4.2, we conduct simulation studies to assess the finite-sampling performance of the proposed method (GMFM) by comparing it with the linear matrix factor model (LMFM). The accuracy between the factor loadings R^\widehat{{R}} and R0{R}^{0}, evaluated by the smallest nonzero canonical correlation between them, denoted by ccor(R^,R0)(\widehat{{R}},{R}^{0}). Similarly, we also compute the smallest nonzero canonical correlation between C^\widehat{{C}} and C0{C}^{0}, denoted by ccor(C^,C0)(\widehat{{C}},{C}^{0}). Canonical correlation has been widely used to measure the performance in factor analysis; see for example Goyal et al. (2008), Doz et al. (2012), Bai and Li (2012), Liu et al. (2023)). In sections 4.3 and 4.4, we examine the performance of the information criterion for selecting the number of factors and verify the asymptotic normality in Theorem 2, separately. Section 4.5 investigates the efficiency gain of the second-stage correction in TSAM.

4.1 Simulation Setting

We generate the factor matrix sequence by an AR(1) model as ft=0.2ft1+0.2ϵtf_{t}=0.2f_{t-1}+0.2\epsilon_{t} where ϵt{\epsilon_{t}}’s are all generated from i.i.d. 𝒩(0k1k2,𝕀k1k2)\mathcal{N}(0_{k_{1}k_{2}},\mathbb{I}_{k_{1}k_{2}}), t=1,,Tt=1,\cdots,T. We draw the entries of R{R} and C{C} independently from the uniform distribution U(0,1)U(0,1). We consider six settings with different variables and different combinations of p1=20,30,50p_{1}=20,30,50, p2=20,30,50p_{2}=20,30,50 and T=30,50T=30,50.

Case 1 (Gaussian variables with homoscedasticity): k1=k2=2k_{1}=k_{2}=2, and xijt𝒩(riFtcj,1)x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},1).

Case 2 (Gaussian variables with heteroscedasticity): k1=1k_{1}=1, k2=3k_{2}=3, xijt𝒩(riFtcj,τj2)x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},\tau_{j}^{2}) with τj=0.1+2Uj\tau_{j}=0.1+2U_{j} and UjU_{j}’s are i.i.d from U(0,1)U(0,1).

Case 3 (Poisson variables): k1=k2=3k_{1}=k_{2}=3 and xijtPoisson(exp(riFtcj))x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}})).

Case 4 (The mixture of binary and count variables): k1=k2=4k_{1}=k_{2}=4. For i=1,,p1i=1,\cdots,p_{1}, j=1,,[p2/2]j=1,\cdots,[p_{2}/2], xijtPoisson(exp(riFtcj))x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}})); for j=[p2/2]+1,,p2j=[p_{2}/2]+1,\cdots,p_{2}, xijtBernoulli(1/(1+exp(riFtcj)))x_{ijt}\sim\text{Bernoulli}(1/(1+\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}}))).

Case 5 (The mixture of continuous and count variables): k1=k2=5k_{1}=k_{2}=5. For i=1,,p1i=1,\cdots,p_{1}, j=1,,[p2/2]j=1,\cdots,[p_{2}/2], xijt𝒩(riFtcj,1)x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},1); for j=[p2/2]+1,,p2j=[p_{2}/2]+1,\cdots,p_{2}, xijtPoisson(exp(riFtcj))x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}})).

Case 6 (The mixture of continuous, count and binary variables): k1=k2=6k_{1}=k_{2}=6. For i=1,,[p1/2]i=1,\cdots,[p_{1}/2], j=1,,[p2/2]j=1,\cdots,[p_{2}/2], xijt𝒩(riFtcj,1)x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},1); for i=1,,[p1/2]i=1,\cdots,[p_{1}/2], j=[p2/2]+1,,p2j=[p_{2}/2]+1,\cdots,p_{2}, xijtPoisson(exp(riFtcj)x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}}); for i=[p1/2]+1,,p1i=[p_{1}/2]+1,\cdots,p_{1}, j=1,,[p2/2]j=1,\cdots,[p_{2}/2], xijtPoisson(exp(riFtcj)x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}}); for i=[p1/2]+1,,p1i=[p_{1}/2]+1,\cdots,p_{1}, j=[p2/2]+1,,p2j=[p_{2}/2]+1,\cdots,p_{2}, xijtBernoulli(1/(1+exp(riFtcj))x_{ijt}\sim\text{Bernoulli}(1/(1+\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}})).

4.2 Comparison with the Linear Factor Models

Tables 1-2 report the average of ccor(R^,R0)(\widehat{{R}},{R}^{0}) and ccor(C^,C0)(\widehat{{C}},{C}^{0}) using the LMFM and GMFM based on 500 repetitions. The loadings and factors of GMFM are estimated and computed by our proposed TSAM algorithm, while the loadings and factors of LMFM are estimated by α\alpha-PCA(α=0\alpha=0) in Chen and Fan (2023). Theoretically, under conditions of homoscedasticity, the LMFM and the GMFM are equivalent. This equivalence is reflected in the comparable results observed for Case 1 in Table 1. With heteroscedasticity, the GMFM demonstrates superior performance compared to the LMFM in case 2, as evident in Table 1. Moreover, it is evident that the precision of r^\widehat{{r}} and C^\widehat{{C}} increases as p1p_{1}, p2p_{2} and/or TT increases, which is consistent with our theoretical results in Theorems 1 and 2.

In case 3 of table 1, the performance of GMFM is significantly better than that of LMFM for the Poisson variables. Evidently, the LMFM is only effective for continuous variables, but performs poorly for discrete variables, and the estimation accuracy does not improve with increasing dimensionality. On the contrary, GMFM has a higher canonical correlation in estimating loadings for Poisson variables, showing a comparable accuracy to that of continuous variables.

Table 2 shows the result in Cases 4-6, where Xt,t=1,,T{X}_{t},t=1,\cdots,T are mixed variables including Poisson, binary, and normal variables. It is clear that the LMFM does not work on mixed variables containing discrete variables, while the GMFM remains reliable in Table 2. Particularly, the canonical correlation of GMFM for both R^\widehat{{R}} and C^\widehat{{C}} increases as p1p_{1}, p2p_{2} and/or TT increases, but that of LMFM does not.

Table 1: The results for single-type variables.
GMFM LMFM
T p1\p2p_{1}\backslash p_{2} 2020 3030 5050 2020 3030 5050
Case 1 for Gaussian variables with homoscedasticity
ccor(R^,R0)\left(\widehat{{R}},{R}^{0}\right) 3030 2020 0.9520 0.9677 0.9827 0.8792 09201 0.9509
3030 0.9527 0.9690 0.9836 0.8882 0.9248 0.9590
5050 0.9608 0.9718 0.9850 0.9316 0.9497 0.9724
5050 2020 0.9711 0.9833 0.9879 0.9204 0.9551 0.9725
3030 0.9734 0.9845 0.9898 0.9475 0.9650 0.9763
5050 0.9759 0.9845 0.9909 0.9590 0.9728 0.9840
ccor(C^,C0)\left(\widehat{{C}},{C}^{0}\right) 3030 2020 0.9389 0.9573 0.9606 0.8284 0.8795 0.9305
3030 0.9643 0.9669 0.9702 0.8779 0.9092 0.9427
5050 0.9788 0.9819 0.9837 0.9340 0.9541 0.9693
5050 2020 0.9638 0.9718 0.9744 0.8973 0.9381 0.9563
3030 0.9790 0.9813 0.9823 0.9268 0.9549 0.9694
5050 0.9881 0.9898 0.9903 0.9592 0.9746 0.9816
Case 2 for Gaussian variables with heteroscedasticity
ccor(R^,R0)\left(\widehat{{R}},{R}^{0}\right) 3030 2020 0.9579 0.9771 0.9874 0.9341 0.9566 0.9740
3030 0.9650 0.9809 0.9895 0.9457 0.9648 0.9809
5050 0.9729 0.9814 0.9898 0.9619 0.9724 0.9844
5050 2020 0.9788 0.9881 0.9933 0.9618 0.9765 0.9859
3030 0.9832 0.9892 0.9938 0.9725 0.9820 0.9884
5050 0.9841 0.9905 0.9941 0.9775 0.9859 0.9912
ccor(C^,C0)\left(\widehat{{C}},{C}^{0}\right) 3030 2020 0.6849 0.7969 0.8528 0.1692 0.1412 0.2050
3030 0.7992 0.8821 0.9094 0.1881 0.1559 0.1996
5050 0.9151 0.9345 0.9451 0.1850 0.1869 0.2097
5050 2020 0.8102 0.8924 0.9234 0.1670 0.1545 0.2479
3030 0.9118 0.9376 0.9494 0.1698 0.1642 0.2730
5050 0.9543 0.9662 0.9709 0.1956 0.1907 0.2650
Case 3 for Possion variables
ccor(R^,R0)\left(\widehat{{R}},{R}^{0}\right) 3030 2020 0.9663 0.9806 0.9882 0.3492 0.3393 0.3666
3030 0.9697 0.9814 0.9890 0.2998 0.2521 0.3063
5050 0.9720 0.9824 0.9891 0.2140 0.2165 0.2506
5050 2020 0.9808 0.9885 0.9931 0.3477 0.4054 0.4485
3030 0.9827 0.9889 0.9933 0.3101 0.2969 0.2905
5050 0.9836 0.9899 0.9941 0.2341 0.2528 0.2172
ccor(C^,C0)\left(\widehat{{C}},{C}^{0}\right) 3030 2020 0.9504 0.9658 0.9687 0.2711 0.2624 0.2177
3030 0.9699 0.9755 0.9806 0.2950 0.2836 0.2325
5050 0.9852 0.9867 0.9876 0.2811 0.2738 0.1843
5050 2020 0.9732 0.9815 0.9830 0.3105 0.2304 0.1961
3030 0.9830 0.9859 0.9879 0.3055 0.1895 0.1612
5050 0.9905 0.9927 0.9934 0.2963 0.2441 0.2149
Table 2: The results for mixed-type variables.
GMFM LMFM
T p1\p2p_{1}\backslash p_{2} 2020 3030 5050 2020 3030 5050
Case 4 for binary and count variables
ccor(R^,R0)\left(\widehat{{R}},{R}^{0}\right) 3030 2020 0.9408 0.9666 0.9837 0.3900 0.3485 0.3704
3030 0.9545 0.9718 0.9857 0.2693 0.2651 0.2580
5050 0.9617 0.9767 0.9868 0.2708 0.2386 0.1882
5050 2020 0.9678 0.9819 0.9911 0.3961 0.4086 0.4384
3030 0.9731 0.9847 0.9918 0.3098 0.3183 0.2968
5050 0.9779 0.9864 0.9926 0.2823 0.2124 0.2362
ccor(C^,C0)\left(\widehat{{C}},{C}^{0}\right) 3030 2020 0.7785 0.8926 0.9207 0.2160 0.1705 0.1533
3030 0.9162 0.9350 0.9500 0.2126 0.1508 0.1405
5050 0.9515 0.9627 0.9717 0.2079 0.1756 0.1412
5050 2020 0.8724 0.9382 0.9551 0.2194 0.1636 0.1041
3030 0.9407 0.9604 0.9702 0.1982 0.1746 0.1517
5050 0.9659 0.9787 0.9825 0.1941 0.1476 0.1375
Case 5 for continuous and count variables
ccor(R^,R0)\left(\widehat{{R}},{R}^{0}\right) 3030 2020 0.9624 0.9818 0.9913 0.5255 0.4910 0.5678
3030 0.9705 0.9843 0.9917 0.3885 0.4213 0.4085
5050 0.9791 0.9868 0.9929 0.2685 0.2784 0.2809
5050 2020 0.9814 0.9898 0.9946 0.5002 0.5464 0.6054
3030 0.9849 0.9916 0.9954 0.4544 0.4507 0.4529
5050 0.9875 0.9926 0.9961 0.3168 0.3032 0.3323
ccor(C^,C0)\left(\widehat{{C}},{C}^{0}\right) 3030 2020 0.8882 0.9675 0.9766 0.2878 0.2077 0.1607
3030 0.9596 0.9785 0.9850 0.2477 0.2497 0.1767
5050 0.9843 0.9895 0.9912 0.3200 0.1805 0.1495
5050 2020 0.9473 0.9806 0.9867 0.2900 0.2060 0.1764
3030 0.9745 0.9875 0.9914 0.2574 0.2566 0.1795
5050 0.9902 0.9937 0.9950 0.3306 0.1877 0.1635
Case 6 for continuous, count and binary variables
ccor(R^,R0)\left(\widehat{{R}},{R}^{0}\right) 3030 2020 0.6565 0.8353 0.9498 0.2703 0.3290 0.2900
3030 0.8167 0.9290 0.9777 0.2364 0.2423 0.2070
5050 0.9493 0.9753 0.9880 0.1303 0.1690 0.1637
5050 2020 0.7052 0.8737 0.9620 0.3071 0.3589 0.3450
3030 0.8969 0.9583 0.9849 0.1675 0.2789 0.2523
5050 0.9662 0.9850 0.9930 0.1593 0.1828 0.1758
ccor(C^,C0)\left(\widehat{{C}},{C}^{0}\right) 3030 3030 0.5365 0.7318 0.9250 0.2667 0.1670 0.1246
6060 0.6170 0.8871 0.9621 0.1935 0.1428 0.1224
9090 0.8769 0.9643 0.9848 0.2093 0.1545 0.1241
5050 3030 0.5823 0.8080 0.9544 0.2584 0.2362 0.1142
6060 0.7168 0.9095 0.9768 0.2004 0.1887 0.1074
9090 0.8829 0.9742 0.9896 0.1775 0.1606 0.1457

4.3 Model Selection

This section aims to verify the performance of the proposed methods for selecting the number of row and column factors. Table 3 reports the average of k^1\widehat{k}_{1} and k^2\widehat{k}_{2} based on 100100 repetitions, where the candidates of k1k_{1} and k2k_{2} are the integers from 11 to 88. In Table 3, it can be seen that the PC-type information criterion (18) works well in all six cases considered with different k1k_{1} and k2k_{2}. We found that the LMFM using α\alpha-PCA (α=0)(\alpha=0) by Chen et al. (2022) and the projected estimation (PE) by Yu et al. (2022) both failed under the PC information criterion, which tends to choose (1,1)(1,1) in all the six cases.

Table 3: The average of the estimated number of row and column factors by using our proposed information criterion based on GMFM from 100100 repetitions.
T p1\p2p_{1}\backslash p_{2} 2020 3030 5050 2020 3030 5050
Case 1: (k1,k2)=(2,2)(k_{1},k_{2})=(2,2) Case 2: (k1,k2)=(1,3)(k_{1},k_{2})=(1,3)
3030 2020 (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.30,1.30)(1.30,1.30) (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.00,1.17)(1.00,1.17)
3030 (1.15,1.15)(1.15,1.15) (1.60,1.60)(1.60,1.60) (1.96,1.96)(1.96,1.96) (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.00,1.30)(1.00,1.30)
5050 (1.65,1.50)(1.65,1.50) (1.85,1.60)(1.85,1.60) (2.00,2.00)(2.00,2.00) (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.00,2.02)(1.00,2.02)
5050 2020 (1.32,1.32)(1.32,1.32) (1.76,1.76)(1.76,1.76) (2.00,2.00)(2.00,2.00) (1.20,1.10)(1.20,1.10) (1.0,1.22)(1.0,1.22) (1.00,1.50)(1.00,1.50)
3030 (1.64,1.62)(1.64,1.62) (1.96,1.94)(1.96,1.94) (2.00,2.00)(2.00,2.00) (1.00,1.10)(1.00,1.10) (1.00,1.42)(1.00,1.42) (1.00,2.00)(1.00,2.00)
5050 (1.98,1.96)(1.98,1.96) (2.00,2.00)(2.00,2.00) (2.00,2.00)(2.00,2.00) (1.00,1.20)(1.00,1.20) (1.00,1.82)(1.00,1.82) (1.00,3.00)(1.00,3.00)
Case 3: (k1,k2)=(3,3)(k_{1},k_{2})=(3,3) Case 4: (k1,k2)=(4,4)(k_{1},k_{2})=(4,4)
3030 2020 (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00) (1.00,1.00)(1.00,1.00)
3030 (1.00,1.00)(1.00,1.00) (1.40,1.40)(1.40,1.40) (2.15,2.05)(2.15,2.05) (1.00,1.00)(1.00,1.00) (2.12,2.04)(2.12,2.04) (2.52,2.36)(2.52,2.36)
5050 (1.20,1.10)(1.20,1.10) (2.05,1.85)(2.05,1.85) (2.96,2.54)(2.96,2.54) (1.00,1.00)(1.00,1.00) (2.40,2.20)(2.40,2.20) (3.64,3.36)(3.64,3.36)
5050 2020 (1.00,1.00)(1.00,1.00) (1.30,1.25)(1.30,1.25) (2.50,2.48)(2.50,2.48) (1.00,1.00)(1.00,1.00) (1.52,1.30)(1.52,1.30) (2.60,2.60)(2.60,2.60)
3030 (1.70,1.60)(1.70,1.60) (2.60,2.35)(2.60,2.35) (3.00,2.93)(3.00,2.93) (2.80,2.52)(2.80,2.52) (3.36,3.04)(3.36,3.04) (4.00,3.82)(4.00,3.82)
5050 (2.32,2.16)(2.32,2.16) (3.00,2.86)(3.00,2.86) (3.00,3.00)(3.00,3.00) (3.04,2.56)(3.04,2.56) (3.95,3.40)(3.95,3.40) (4.00,4.00)(4.00,4.00)
Case 5: (k1,k2)=(5,5)(k_{1},k_{2})=(5,5) Case 6: (k1,k2)=(6,6)(k_{1},k_{2})=(6,6)
3030 2020 (4.72,4.60)(4.72,4.60) (5.00,4.85)(5.00,4.85) (5.00,5.00)(5.00,5.00) (4.20,3.92)(4.20,3.92) (5.35,4.60)(5.35,4.60) (5.46,5.86)(5.46,5.86)
3030 (5.00,4.68)(5.00,4.68) (5.00,5.00)(5.00,5.00) (5.00,5.00)(5.00,5.00) (5.52,4.40)(5.52,4.40) (5.80,5.60)(5.80,5.60) (5.86,6.00)(5.86,6.00)
5050 (5.00,5.00)(5.00,5.00) (5.00,5.00)(5.00,5.00) (5.00,5.00)(5.00,5.00) (5.82,5.30)(5.82,5.30) (6.00,5.80)(6.00,5.80) (6.00,6.00)(6.00,6.00)
5050 2020 (5.00,4.76)(5.00,4.76) (5.00,5.00)(5.00,5.00) (5.00,5.00)(5.00,5.00) (5.30,4.45)(5.30,4.45) (5.50,5.30)(5.50,5.30) (5.80,6.00)(5.80,6.00)
3030 (5.00,4.80)(5.00,4.80) (5.00,5.00)(5.00,5.00) (5.00,5.00)(5.00,5.00) (5.70,4.90)(5.70,4.90) (6.00,5.80)(6.00,5.80) (6.00,6.00)(6.00,6.00)
5050 (5.00,5.00)(5.00,5.00) (5.00,5.00)(5.00,5.00) (5.00,5.00)(5.00,5.00) (6.00,5.12)(6.00,5.12) (6.00,5.82)(6.00,5.82) (6.00,6.00)(6.00,6.00)

4.4 Asymptotic distribution

In this section, we assess the adequacy of the asymptotic distributions in Theorem 2. We consider the case with k1=k2=1k_{1}=k_{2}=1 and generate ftf_{t} from i.i.d. 𝒩(0,1)\mathcal{N}(0,1). The generation and normalization of RR and CC are the same as Section 4.1. For the given RR, FF and CC, we consider three data generating processes (DGPs) for XijtX_{ijt}.

DGP 1 (Logit): XijtX_{ijt} is a binary random variable and P(Xijt=1)=Ψ(riFtcj)P(X_{ijt}=1)=\Psi(r_{i}^{\prime}F_{t}c_{j}), where Ψ(z)=1/(1+ez)\Psi(z)=1/\left(1+e^{-z}\right).

DGP 2 (Probit): XijtX_{ijt} is a binary random variable and P(Xijt=1)=Φ(riFtcj)P(X_{ijt}=1)=\Phi(r_{i}^{\prime}F_{t}c_{j}), where Φ()\Phi(\cdot) is the cumulative distribution function of standard normal distribution.

DGP 3(Mixed): For t=1,,Tt=1,\ldots,T, i=1,,p1i=1,\ldots,p_{1}, j=1,,p2/3j=1,\ldots,p_{2}/3, XijtN(riFtcj,1)X_{ijt}\sim N(r_{i}^{\prime}F_{t}c_{j},1); for t=1,,Tt=1,\ldots,T, i=1,,p1i=1,\ldots,p_{1}, j=p2/3+1,,2p2/3j=p_{2}/3+1,\ldots,2p_{2}/3, XijtX_{ijt} is a binary random variable and P(Xijt=1)=Ψ(riFtcj)P(X_{ijt}=1)=\Psi(r_{i}^{\prime}F_{t}c_{j}); for t=1,,Tt=1,\ldots,T, i=1,,p1i=1,\ldots,p_{1}, j=2p2/3+1,,p2j=2p_{2}/3+1,\ldots,p_{2}, XijtX_{ijt} is a binary random variable and P(Xijt=1)=Φ(riFtcj)P(X_{ijt}=1)=\Phi(r_{i}^{\prime}F_{t}c_{j}).

Under such cases, by Theorem 2,

p2T(r^iri0)𝑑𝒩(0,p2TΣiR1),\displaystyle\sqrt{p_{2}T}(\hat{r}_{i}-r^{0}_{i})\xrightarrow{d}\mathcal{N}(0,p_{2}T\Sigma_{iR}^{-1}),
p1T(c^jcj0)𝑑𝒩(0,p1TΣjC1),\displaystyle\sqrt{p_{1}T}(\hat{c}_{j}-c^{0}_{j})\xrightarrow{d}\mathcal{N}(0,p_{1}T\Sigma_{jC}^{-1}),
p1p2(f^tft0)𝑑𝒩(0,p1p2ΣtF1),\displaystyle\sqrt{p_{1}p_{2}}(\hat{f}_{t}-f^{0}_{t})\xrightarrow{d}\mathcal{N}(0,p_{1}p_{2}\Sigma_{tF}^{-1}),

which reach the asymptotic efficiency bounds. Figure 1, 2 and 3 show the normalized histograms of Σ1R1/2(r^1r10),Σ1C1/2(c^1c10)\Sigma_{1R}^{1/2}(\hat{r}_{1}-r^{0}_{1}),\Sigma_{1C}^{1/2}(\hat{c}_{1}-c^{0}_{1}) and Σ1F1/2(f^1f10)\Sigma_{1F}^{1/2}(\hat{f}_{1}-f^{0}_{1}) for DGP1-DGP3 by TSAM and MM separately, and the standard normal density curve is overlaid on them for comparison. As for the step 2 in MM, we choose bU=1/4b_{U}=1/4 for the Logit case and bU=1b_{U}=1 for the Probit case by the definition of bUb_{U} in Section 2.2. The standard normal density curve clearly provides a good approximation across all subfigures.

Refer to caption
GDP1 by MM
Refer to caption
GDP2 by MM
Refer to caption
GDP3 by MM
Refer to caption
GDP1 by TSAM
Refer to caption
GDP2 by TSAM
Refer to caption
GDP3 by TSAM
Figure 1: Empirical distributions of r^1\hat{r}_{1} after standardization under various settings, over 10001000 replications with T=p1=p2=50T=p_{1}=p_{2}=50. The red real lines are the density function of the standard normal distribution.
Refer to caption
GDP1 by MM
Refer to caption
GDP2 by MM
Refer to caption
GDP3 by MM
Refer to caption
GDP1 by TSAM
Refer to caption
GDP2 by TSAM
Refer to caption
GDP3 by TSAM
Figure 2: Empirical distributions of c^1\hat{c}_{1} after standardization under various settings, over 10001000 replications with T=p1=p2=50T=p_{1}=p_{2}=50. The red real lines are the density function of the standard normal distribution.
Refer to caption
GDP1 by MM
Refer to caption
GDP2 by MM
Refer to caption
GDP3 by MM
Refer to caption
GDP1 by TSAM
Refer to caption
GDP2 by TSAM
Refer to caption
GDP3 by TSAM
Figure 3: Empirical distributions of f^1\hat{f}_{1} after standardization under various settings, over 10001000 replications with T=p1=p2=50T=p_{1}=p_{2}=50. The red real lines are the density function of the standard normal distribution.

4.5 Efficiency Gain of the One-Step Correction in TSAM

To demonstrate the efficiency improvement from the second-stage update in TSAM, we illustrate the results using Case 5 with T=20T=20. In the first stage of TSAM, we obtain the estimator using continuous variables (S1N), count variables (S1P), and both (S1NP). As shown in Table 4, the performance of the first-stage estimators improves as p1p_{1}, p2p_{2}, and/or TT increase. Notably, S1NP demonstrates the best performance, while S1P and S1N exhibit similar levels of effectiveness. This aligns with our expectation since S1NP utilizes the most comprehensive information. The second-stage updating significantly enhances the performance of the estimators obtained from stage one, especially when the sample size is small.

Table 4: The results of Case 5 with T=20T=20. The estimators obtained in the first stage, which are based solely on Normal variables, solely on Poisson variables, and on a combination of both, are abbreviated as S1N, S1P and S1NP, respectively. The estimators arising from the second-stage update are designated as OSN, OSP and OSNP. The values of ccorR and ccorC represent the averaged ccor(R^,R0)(\widehat{{R}},{R}^{0}) and ccor(C^,C0)(\widehat{{C}},{C}^{0}), respectively.
(p1,p2)=(30,20)(p_{1},p_{2})=(30,20) (p1,p2)=(30,30)(p_{1},p_{2})=(30,30) (p1,p2)=(30,50)(p_{1},p_{2})=(30,50)
S1N S1P S1NP S1N S1P S1NP S1N S1P S1NP
ccorR 0.9328 0.9388 0.9702 0.9502 0.9583 0.9787 0.9703 0.9776 0.9879
ccorC 0.8930 0.8612 0.9470 0.9055 0.9533 0.9652 0.9699 0.9695 0.9764
OSN OSP OSNP OSN OSP OSNP OSN OSP OSNP
ccorR 0.9692 0.9701 0.9753 0.9761 0.9761 0.9791 0.9872 0.9877 0.9883
ccorC 0.9456 0.9424 0.9542 0.9500 0.9689 0.9699 0.9766 0.9766 0.9769

5 Real Data Analysis

In this section, we analyze the operating performance of the listed high-tech manufacturing companies in China. These comprehensive datasets are sourced from various reliable resources, encompassing publicly accessible financial reports, corporate social responsibility statements, and stock exchange-released trading data. The data can be downloaded from https://data.csmar.com. This dataset encompasses 22 indicators for 12 manufacturing companies, spanning a total of 68 quarters from 2007-Q1 to 2023-Q2. It is worth noting that, unlike in the general LMFM literature (Wang et al. (2019),Chen et al. (2022),Yu et al. (2022)), in addition to 18 continuous variables, there are four binary variables. These binary variables represent corporate social responsibility, such as the protection of employees’ rights and interests, environmental and sustainable development initiatives, as well as public relations and social welfare undertakings. Indicator disclosure is marked as 11, while non-disclosure is marked as 0. Furthermore, the economic indicators of these companies can be categorized into four main groups: profitability, development ability, operational ability, and solvency. A detailed list of these indicators is presented in the supplementary material. The enterprises belong to three high-tech industries: pharmaceutical manufacturing, electronic equipment manufacturing, and transportation manufacturing.

The missing rate is low in this data set (0.05%0.05\%), and we use simple linear interpolation to fill in the missing values. Each original univariate time series is transformed by taking the second difference. We also standardize each of the transformed series to avoid the effects of non-zero mean or disparate variances.

Before the estimation of matrix factors and matrix loadings, we need to determine the numbers of row and column factors first. Our proposed criterion suggests taking k1=3k_{1}=3, k2=4k_{2}=4. For the row factors, Table 5 shows that they are closely related to the industry classification. Companies in the same industry often have similar loadings on the factors. These 12 companies are naturally divided into three groups, namely pharmaceutical manufacturing (P1, P2, P3, P4), electronic equipment manufacturing (E1, E2, E3, E4, E5), and transportation manufacturing (T1, T2, T3). For the column factors, Figure 4 divides the indicators into four groups: profitability (G1), operational ability (G2), solvency (G3), and development ability (G4). Notably, the disclosure of social responsibility has been incorporated into the indicators of development capability. A company with a strong sense of social responsibility generally tends to have better development prospects. This categorization aligns with economic explanations. The detailed grouping of these indicators is clearly outlined in the supplementary material.

Table 5: Row (enterprises) loading matrices (multiplied by 10) by GMLE.
Row Factor E1 E2 E3 E4 E5 P1 P2 P3 P4 T1 T2 T3
F1 -5 -3 2 0 -2 -21 -20 -16 -9 0 -2 1
F2 -3 -5 0 -8 -6 1 1 0 3 20 17 20
F3 -17 -13 -10 -20 -13 2 1 0 5 -5 -6 -8
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: The absolute values among the four column loadings, where the red triangular dots are the first several largest absolute values of each loading.

The canonical correlation of R{R} (resp. C{C}) for LMFM (estimating by α\alpha-PCA with α=0\alpha=0) and GMFM is 0.76 ( resp. 0.70) respectively, showing that the loading estimators of LMFM are different from those of GMFM. In addition, we calculated the negative scaled log-likelihood 2L/(Tp1p2)-2L/(Tp_{1}p_{2}) for GMFM and LMFM to measure the goodness of fit (1.281.28 and 1.861.86 for GMFM and LMFM, respectively, with (k1=3,k2=4)(k_{1}=3,k_{2}=4)). Clearly, GMFM has a higher goodness of fit for the data.

To further compare the GMFM and LMFM, we employ a rolling-validation procedure as in Yu et al. (2022). For each year tt from 2007 to 2022, we repeatedly use the nn (bandwidth) years observations before tt to fit the matrix-variate factor model and estimate the two loading matrices. The loadings are then used to estimate the factors and corresponding residuals of the 44 quarters in the current year. Specifically, let Yti{Y}_{t}^{i} and Y^ti\widehat{{Y}}_{t}^{i} be the observed and estimated performance matrix of quarter ii in year tt. Define

MSEt=14×12×22i=14Y^tiYtiF2,andρt=i=14Y^tiYtiF2i=14YtiY¯tF2,\displaystyle\text{MSE}_{t}=\frac{1}{4\times 12\times 22}\sum_{i=1}^{4}\left\|\widehat{{Y}}_{t}^{i}-{Y}_{t}^{i}\right\|_{F}^{2},\ \mbox{and}\ \rho_{t}=\frac{\sum_{i=1}^{4}\left\|\widehat{{Y}}_{t}^{i}-{Y}_{t}^{i}\right\|_{F}^{2}}{\sum_{i=1}^{4}\left\|{{Y}}_{t}^{i}-\overline{{Y}}_{t}\right\|_{F}^{2}},

as the mean squared error (MSE) and unexplained proportion of total variances, respectively. Let MSE¯\overline{\text{MSE}} and ρ¯\bar{\rho} be the mean MSE and the mean unexplained proportion of total variances, respectively. For LMFM, the Y^ti\widehat{{Y}}_{t}^{i}’s are reconstructed by the projected estimator (PE) in Yu et al. (2022) and the α\alpha-PCA method with α=0\alpha=0 (α\alpha-PCA) in Chen et al. (2020a).

Table 6: Rolling validation for the operating performance of the 12 listed high-tech manufacturing companies by different models.
MSE¯{\overline{MSE}} ρ¯\bar{\rho}
n kk LMFM LMFM GMFM GVFM LMFM LMFM GMFM GVFM
(α\alpha-PCA) (PE) (α\alpha-PCA) (PE)
55 1 0.97 0.99 0.92\mathbf{0.92} 0.93 1.20 1.21 1.12\mathbf{1.12} 1.13
5 2 0.90 0.91 0.85\mathbf{0.85} 0.91 1.11 1.10 1.02\mathbf{1.02} 1.12
5 3 0.81 0.80 0.78\mathbf{0.78} 0.88 0.99 0.96 0.95\mathbf{0.95} 1.07
10 1 0.96 0.96 0.91\mathbf{0.91} 0.91\mathbf{0.91} 1.21 1.19 1.12\mathbf{1.12} 1.13
10 2 0.88 0.87 0.83\mathbf{0.83} 0.89 1.10 1.08 1.02\mathbf{1.02} 1.14
10 3 0.79 0.77 0.75\mathbf{0.75} 0.88 0.97 0.950.95 0.93\mathbf{0.93} 1.08

Table 6 presents the average values of MSE and ρ\rho obtained from various estimation methods: LMFM (α\alpha-PCA), LMFM (PE), GMFM and the generalized vector factor model (GVFM) with k2k^{2} factor numbers by Liu et al. (2023). A comprehensive comparison is made across different combinations of bandwidth (nn) and the number of factors (k1=k2=kk_{1}=k_{2}=k). From Table 6, our proposed GMFM outperforms other methods across the board. Specifically, when k=1, GMFM and GVFM display comparable MSEs. Nevertheless, as k rises, the MSE of GMFM notably decreases, whereas the decline for GVFM is gradual. Moreover, the two estimation methods for LMFM demonstrate comparable behavior, both worse than GMFM.

6 Conclusion

This paper provides a general theory for matrix factor analysis of a high-dimensional nonlinear model. Given certain regularity conditions, we have established the consistency and convergence rates of the estimated factors and loadings. Meanwhile, the asymptotic normality of the estimated loadings is derived. Furthermore, we develop a criterion based on a penalized loss to consistently estimate the number of row factors and column factors within the framework of the GMFM. Computationally, We have proposed two algorithms to compute the factors and loadings from the GMFM. To justify our theory, we conduct extensive simulation studies and applied GMFM to the analysis of a business performance dataset.

Supplementary Material

The Supplementary Material contains the detailed technical proof of the main theorems, as well as some technical lemmas that are of their own interests, in addition to details of the data set.

References

  • Bai and Li (2012) Bai, J., Li, K., 2012. Statistical analysis of factor models of high dimension. The Annals of Statistics 40, 436–465.
  • Bai and Ng (2002) Bai, J., Ng, S., 2002. Determining the number of factors in approximate factor models. Econometrica 70, 191–221.
  • Barigozzi et al. (2022) Barigozzi, M., He, Y., Li, L., Trapani, L., 2022. Statistical inference for large-dimensional tensor factor model by iterative projections. arXiv:2206.09800 .
  • Barigozzi and Trapani (2022) Barigozzi, M., Trapani, L., 2022. Testing for common trends in nonstationary large datasets. Journal of Business & Economic Statistics 40, 1107–1122.
  • Chang et al. (2023) Chang, J., He, J., Yang, L., Yao, Q., 2023. Modelling matrix time series via a tensor cp-decomposition. Journal of the Royal Statistical Society Series B: Statistical Methodology 85, 127–148.
  • Chen et al. (2024) Chen, E., Fan, J., Zhu, X., 2024. Factor augmented matrix regression. arXiv:2405.17744 .
  • Chen and Fan (2023) Chen, E.Y., Fan, J., 2023. Statistical inference for high-dimensional matrix-variate factor models. Journal of the American Statistical Association 118, 1038–1055.
  • Chen et al. (2021a) Chen, M., Fern¨¢ndez-Val, I., Weidner, M., 2021a. Nonlinear factor models for network and panel data. Journal of Econometrics 220, 296–324.
  • Chen et al. (2021b) Chen, R., Xiao, H., Yang, D., 2021b. Autoregressive models for matrix-valued time series. Journal of Econometrics 222, 539–560.
  • Chen et al. (2022) Chen, R., Yang, D., Zhang, C.H., 2022. Factor models for high-dimensional tensor time series. Journal of the American Statistical Association 117, 94–116.
  • Doz et al. (2012) Doz, C., Giannone, D., Reichlin, L., 2012. A quasi-maximum likelihood approach for large, approximate dynamic factor models. The Review of Economics and Statistics 94, 1014–1024.
  • Fan et al. (2013) Fan, J., Liao, Y., Mincheva, M., 2013. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society Series B: Statistical Methodology 75, 603–680.
  • Goyal et al. (2008) Goyal, A., P¨¦rignon, C., Villa, C., 2008. How common are common return factors across the nyse and nasdaq? Journal of Financial Economics 90, 252–271.
  • He et al. (2022) He, Y., Wang, Y., Yu, L., Zhou, W., Zhou, W., 2022. Matrix kendall’s tau in high-dimensions: A robust statistic for matrix factor model. arXiv: 2207.09633 .
  • He et al. (2023) He, Y., Zhao, R., Zhou, W., 2023. Iterative alternating least square estimation for large-dimensional matrix factor model. arXiv:2301.00360 .
  • Jennrich (1969) Jennrich, R.I., 1969. Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics 40, 633–643.
  • Jing et al. (2021) Jing, B., Li, T., Lyu, Z., Xia, D., 2021. Community detection on mixture multilayer networks via regularized tensor decomposition. Annals of Statistics 49, 3181–3205.
  • Jing et al. (2022) Jing, B., Li, T., Ying, N., Yu, X., 2022. Community detection in sparse networks using the symmetrized laplacian inverse matrix (slim). Statistica Sinica 32, 1–22.
  • Kong (2017) Kong, X., 2017. On the number of common factors with high-frequency data. Biometrika 104, 397–410.
  • Liu et al. (2023) Liu, W., Lin, H., Zheng, S., Liu, J., 2023. Generalized factor model for ultra-high dimensional correlated variables with mixed types. Journal of the American Statistical Association 118, 1385–1401.
  • Mao et al. (2019) Mao, X., Chen, S., Wong, R.K.W., 2019. Matrix completion with covariate information. Journal of the American Statistical Association 114, 198–210.
  • Mao et al. (2021) Mao, X., Wong, R.K.W., Chen, S., 2021. Matrix completion under low-rank missing mechanism. Statistica Sinica 31, 2005–2030.
  • Newey and McFadden (1986) Newey, W., McFadden, D., 1986. Large sample estimation and hypothesis testing. Handbook of Econometrics 4, 2111–2245.
  • Pelger (2019) Pelger, M., 2019. Large-dimensional factor modeling based on high-frequency observations. Journal of Econometrics 208, 23–42.
  • Trapani (2018) Trapani, L., 2018. A randomized sequential procedure to determine the number of factors. Journal of the American Statistical Association 113, 1341–1349.
  • Wang et al. (2019) Wang, D., Liu, X., Chen, R., 2019. Factor models for matrix-valued high-dimensional time series. Journal of Econometrics 208, 231–248.
  • Wang (2022) Wang, F., 2022. Maximum likelihood estimation and inference for high dimensional generalized factor models with application to factor-augmented regressions. Journal of Econometrics 229, 180–200.
  • Wu (1981) Wu, C.F., 1981. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics 9, 501–513.
  • Yu et al. (2022) Yu, L., He, Y., Kong, X., Zhang, X., 2022. Projected estimation for large-dimensional matrix factor models. Journal of Econometrics 229, 201–217.
  • Yuan et al. (2023) Yuan, C., Gao, Z., He, X., Huang, W., Guo, J., 2023. Two-way dynamic factor models for high-dimensional matrix-valued time series. Journal of the Royal Statistical Society Series B: Statistical Methodology 85, 1517–1537.
  • Zhang et al. (2024) Zhang, X., Liu, C.C., Guo, J., Yuen, K.C., Welsh, A.H., 2024. Modeling and learning on high-dimensional matrix-variate sequences. Journal of the American Statistical Association 0, 1–16. URL: https://doi.org/10.1080/01621459.2024.2344687.
  • Zhou et al. (2024) Zhou, T., Pan, R., Zhang, J., Wang, H., 2024. An attribute-based node2vec model for dynamic community detection on co-authorship network. Computational Statistics , 1–28.