Generalized Matrix Factor Model

Xinbing Kong and Tong Zhang School of Statistics and Data Science, Nanjing Audit University, Nanjing 211815, China. The author acknowledges the support from the NSF China (72342019, 12431009). Email: xinbingkong@126.comSchool of Statistics and Data Science, Nanjing Audit University, Nanjing 211815, China. Co-first author. Email: zhangtongsta@163.com

Abstract

This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM’s superiority in handling discrete and mixed-type variables. An empirical data analysis of the company’s operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix.

Keywords: Matrix factor model; Mixed-type measurement; Penalized likelihood function.

1 Introduction

Matrix sequences appear in diverse areas, such as computer vision, recommending systems, social networks, economics, and management. It has the form of a three-way array consisting of a row dimension, a column dimension, and a third sequence limit dimension that might be a time domain. A natural question is “how are the entries generated?” In the high-dimensional scenario where both the numbers of rows and columns are large, an intuitive generative mechanism is that all rows and columns are formed by a latent low-rank structure. There are at least two parallel but seemingly equivalent streams of works that are learning the low-rank formation of the matrix-variate data. The first is via matrix (or tensor) regularization with low-rank penalization, e.g., rank constraint or nuclear norm penalty, with or without noise. These results in great success in wide applications like matrix completion, compressed sensing and image localization (Mao et al. (2019) and Mao et al. (2021)). The second line of work is the factor modeling which thinks all entries of a matrix observation are driven by a low-dimensional factor matrix, and the magnitude of each entry relies on the additive and/or interactive effect of the low-dimensional row features and column characteristics. In the present paper, we follow the factor modeling manner but a regularized optimization technique is used to adapt to the nonlinearity of our proposed model.

For ease of presentation, let $\{X_{t}=(X_{ijt})_{p_{1}\times p_{2}};1\leq t\leq T\}$ be a matrix sequence (e.g., a matrix time series). There are also two ways to do matrix (or even tensor) factor analysis of $X_{t}$ . The first manner is to “flatten” each $X_{t}$ into a lengthy vector by stacking its columns or rows: the first step is learning the vector factor space with existing vector factor analysis methods (Bai and Ng (2002), Bai and Li (2012), Fan et al. (2013), Kong (2017), Pelger (2019), Trapani (2018), Barigozzi and Trapani (2022), Barigozzi et al. (2022)), and the second step is recovering the row and column factor spaces from the vectorized factor space with certain restriction, e.g. the Kronecker structure, c.f, Chen et al. (2024) and Chen et al. (2021b). The second collection of papers (Wang et al. (2019), Wang (2022), Chang et al. (2023), Yu et al. (2022), Yuan et al. (2023), He et al. (2022)) directly model each $X_{t}$ with carefully designed additive and/or interactive row and column effects. It is worth noticing that Wang et al. (2019) introduced the first matrix factor in the literature; Yu et al. (2022) provided a projection approach to estimate the twisted row and column factor spaces which has, so far to the best of our knowledge, the fastest convergence rates in the class of PCA procedures; Yuan et al. (2023) presented the first two-way additive matrix factor model and Zhang et al. (2024) combined the two-way additive and interactive components in generating the matrix entries recently; the seminal work of Chen et al. (2022) extends to the general tensor data.

However, the above works are limited to single-type continuous variables. In practical applications, such as social science and biology, high-dimensional data sets often involve mixed-type variables. For instance, in the context of corporate operational performance analysis, there are not only continuous variables, such as return on equity and fixed asset turnover, but also categorical variables like industry type, enterprise nature (e.g., state-owned or private enterprise), and regional classification. These categorical indicators are often used to depict the fundamental characteristics and background information of a company. There are also some binary indicators to measure each corporate’s social responsibility, such as whether the protection of employees’ rights and interests is disclosed, and whether environmental sustainable development is disclosed. Collectively, the discrete variables encompassed in a company’s economic indicators span multiple facets and play a pivotal role in portraying the company’s economic standing, analyzing market trends, and devising business strategies. In networks analysis (Jing et al. (2021), Jing et al. (2022), Chang et al. (2023) and Zhou et al. (2024)) like international trading, the weight on each edge could be the trading direction (binary variables indicating import or export), the number of traded products (e.g. shoes, clothes, food) that are measured either in counts (count variables) or kilograms (continuous variables). For non-continuous variables, nonlinear models are usually employed for modeling, where closed-form expressions for estimating factors and their loadings do not exist, in contrast to the explicit principal component solution with or without iteration to the linear two-way factor analysis, c.f., Chen and Fan (2023), Yu et al. (2022) and He et al. (2022). Moreover, with mixed-type variables, the nonlinear structure varies depending on the type of variables, posing challenges for statistical inference and computation. To the best of our knowledge, there is still a lack of study on a general nonlinear model with factor structure for matrix sequences with mixed-type variables. This is a motivation and will be the first contribution of the present paper.

The literature recently starts to pay attention to the latent factor structure with single or mixed type variables in vector factor modeling. Chen et al. (2021a) derived the asymptotic theory of the regression coefficients and the average partial effect for a class of generalized linear vector factor models that include logit, probit, ordered probit and Poisson models as special cases. Permitting general nonlinearity, Wang (2022) investigated the maximum likelihood estimation for the factors and loadings. Liu et al. (2023) proposed a canonical exponential family model with factor structure for mixed-type variables. However, the likelihood approach for these nonlinear vector factor models can not be trivially extended to the matrix sequences since the Hessian matrix jointly in the row and column loadings and the factors are more complex. Extra manipulation is needed to feasibly derive the convergence property of the parameters related to the factor structure, especially for the asymptotic normality of the estimates, see Section 2 for more details. This is a motivation and will be a second contribution to the present paper.

In this paper, we assume that each entry is generated by the following nonlinear generalized matrix factor model (GMFM):

x_{ijt}\sim g_{ijt}\left(\cdot|\pi_{ijt}\right),i=1,\ldots,p_{1},\,j=1,\ldots,p_{2},\,t=1,\ldots,T.

(1)

where $x_{ijt}$ is an observed entry lying in the $i$ th row and $j$ th column in sample $t$ ; $g_{ijt}(\cdot|\cdot)$ is some known probability (density or mass) function of $x_{ijt}$ allowed to vary across $i$ , $j$ and $t$ , permitting a mixture of distributions for each matrix; $\pi_{ijt}=r_{i}^{\prime}F_{t}c_{j}$ with $r_{i}$ being a $k_{1}$ dimensional vector of row loadings, $c_{j}$ being a $k_{2}$ dimensional vector of column loadings, and $F_{t}$ being a $k_{1}\times k_{2}$ dimensional matrix of factors. Both factors and loadings are unobservable and $k_{1}$ and $k_{2}$ are the numbers of row and column factors, respectively. In the currently studied linear matrix factor model, $\pi_{ijt}=r_{i}^{\prime}F_{t}c_{j}$ represents the conditional mean of $x_{ijt}$ , while for non-continuous variables in the present paper, it might stand for the conditional log odds or log probabilities. Even for continuous variables, $\pi_{ijt}$ in model (1) can be a function of the conditional smoothly-winsorized-mean in the robust matrix factor model, c.f., He et al. (2024), which maps $g_{ijt}(\cdot|\cdot)$ ’s to the Huber loss function. Thus model (1) is a quite flexible nonlinear matrix factor model. Certainly, to identify $\left\{r_{i},F_{t},c_{j}\right\}$ , one needs the identification constraints to form an identifiably feasible solution set.

In this paper, we consider the maximum (quasi-)likelihood estimation with rotational constraints for the factors and loadings of GMFM. We introduce a novel penalized likelihood function that not only covers the log-likelihood function and the identifiability restriction, but also includes an additional augmented Lagrangian term, which is carefully tuned to guarantee the local convexity of the negative Hessian matrix around the true parameters and hence a concave quasi-likelihood function with penalty in a neighborhood of the true parameters. This leads to a convenient derivation of the central limit theorems, and it is where the second contribution of the present paper comes from. This paper also establishes the convergence rates of the estimated factors and loadings. Our theory demonstrates that our estimators converge at rate $O_{p}\left(1/\min\left\{\sqrt{p_{1}p_{2}},\sqrt{p_{1}T},\sqrt{p_{2}T}\right\}\right)$ in terms of the averaged Frobenius norm. This rate surpasses $O_{p}(1/\min\{\sqrt{p_{1}p_{2}},\sqrt{T}\})$ , the rate for generalized vector factor analysis achieved through vectorizing ${X}_{t}$ by Liu et al. (2023) and Wang (2022), particularly when the sequence length $T$ is relatively short. It is also no slower than the rate, $O_{p}(\{1/\sqrt{p_{1}}+1/\sqrt{Tp_{2}}\}\vee\{1/\sqrt{p_{2}}+1/\sqrt{Tp_{1}}\})$ , of the non-likelihood method $\alpha$ -PCA in the seminal paper by Chen and Fan (2023), especially when the data matrix is not balanced. Furthermore, to consistently estimate the pair of numbers of the row and column factors, we present an information-type criterion under GMFM. This set of new results forms the third contribution of the present paper. Motivated by the challenges posed by nonlinear structures and mixed-type variables, we also present extended versions of the two-stage alternating maximization algorithm inspired by Liu et al. (2023) and the minorization maximization algorithm inspired by Chen et al. (2021a).

The rest of the paper is organized as follows. Section 2 introduces the set up and the estimation methodology. Section 3 presents the asymptotic theorems, and subsection 3.4 provides two computational algorithms. Section 4 is devoted to extensive simulation studies. Section 5 analyzes high-tech company’s operating performance. Section 6 concludes. All proofs are relegated to the supplementary materials.

2 Methodology and Set Up

2.1 Methodology

With model (1), the (quasi) log-likelihood function is

\displaystyle L(X|r,f,c)=\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}l_{ijt}(r_{i}^{\prime}F_{t}c_{j}),

(2)

where $l_{ijt}(\pi_{ijt})=\log g_{ijt}\left(x_{ijt}|\pi_{ijt}\right)$ with $\pi_{ijt}=r_{i}^{\prime}F_{t}c_{j}$ ; $r=\left(r_{1}^{\prime},\ldots,r_{p_{1}}^{\prime}\right)^{\prime}$ is a $p_{1}k_{1}$ vector, $c=\left(c_{1}^{\prime},\ldots,c_{p_{2}}^{\prime}\right)^{\prime}$ is a $p_{2}k_{2}$ vector, and $f=\left(f_{1}^{\prime},\ldots,f_{T}^{\prime}\right)^{\prime}$ with $f_{t}=vec\{F_{t}\}$ is a $Tk_{1}k_{2}$ vector. The entries of $X$ can be continuous variables, count variables, binary variables, and so on. Representative examples are as follows.

Example 1 (Linear) $l_{ijt}(\pi_{ijt})=-\frac{1}{2}\left(x_{ijt}-r_{i}^{\prime}F_{t}c_{j}\right)^{2}$ is a likelihood function of Gaussian random variables and a quasi-likelihood function for other continuous random variables with homoscedasticity.

Example 2 (Poisson) $l_{ijt}(\pi_{ijt})=-e^{r_{i}^{\prime}F_{t}c_{j}}+kr_{i}^{\prime}F_{t}c_{j}-\log k!$ as $P\left(X_{ijt}=k\right)=e^{-\lambda}\lambda^{k}/k!$ and $\lambda=e^{r_{i}^{\prime}F_{t}c_{j}}$ .

Example 3 (Probit) $l_{ijt}(\pi_{ijt})=x_{ijt}\log\Phi(r^{\prime}_{i}F_{t}c_{j})+(1-x_{ijt})\log\left(1-\Phi(r^{\prime}_{i}F_{t}c_{j})\right)$ , where $\Phi(\cdot)$ is the cumulative distribution function of a standard normal random variable.

Example 4 (Logit) $l_{ijt}(\pi_{ijt})=x_{ijt}\log\Psi(r^{\prime}_{i}F_{t}c_{j})+(1-x_{ijt})\log\left(1-\Psi(r^{\prime}_{i}F_{t}c_{j})\right)$ , where $\Psi(\cdot)$ is the cumulative distribution function of the logistic distribution.

Example 5 (Tobit) Let $x_{ijt}=x^{*}_{ijt}I(x^{*}_{ijt}>0)$ where $x_{ijt}^{*}=r_{i}^{\prime}F_{t}c_{j}+e_{ijt}$ and $e_{ijt}$ is $N(0,1)$ . $l_{ijt}(\pi_{ijt})=-\frac{1}{2}(x_{ijt}-r^{\prime}_{i}F_{t}c_{j})^{2}I(x_{ijt}>0)+\log\left(1-\Phi(r^{\prime}_{i}F_{t}c_{j})\right)I(x_{ijt}=0)$ , where $I(\cdot)$ is the indicator function.

Let ${R}=\left(r_{1},\ldots,r_{p_{1}}\right)^{\prime}$ be the $p_{1}\times k_{1}$ row factor loading matrix, ${C}=\left(c_{1},\ldots,c_{p_{2}}\right)^{\prime}$ be the $p_{2}\times k_{2}$ column factor loading matrix, ${F}_{t}$ be a $k_{1}\times k_{2}$ common factor matrix. Denote the parameters of interest by the vector $\theta=\left(r^{\prime},f^{\prime},c^{\prime}\right)^{\prime}$ and their true values by ${r}_{i}^{0}$ , ${F}_{t}^{0}$ and ${c}_{j}^{0}$ , respectively. To well define a high-dimensional parameter space $\Theta$ of $\theta$ , except the boundedness condition for each row of $R$ and $C$ and the factor matrix $F_{t}$ in Assumption 1, one needs constraints to identify the parameters. Note that for any $k_{1}\times k_{1}$ , $k_{2}\times k_{2}$ invertible matrix $G_{1}$ and $G_{2}$ , ${R}G_{1}$ , $G_{1}^{-1}{F}_{t}G_{2}^{-1}$ and ${C}G^{\prime}_{2}$ give the same likelihood as ${R}$ , ${F}_{t}$ and ${C}$ do. To uniquely determine ${R}$ , ${F}_{t}$ , and ${C}$ , without loss of generality, we impose the identifiability restriction as follows.

\displaystyle\left\{\begin{array}[]{lll}(3.1)\ \ \frac{{R}^{\prime}{R}}{p_{1}}=\mathbb{I}_{k_{1}},\frac{{C}^{\prime}{C}}{p_{2}}=\mathbb{I}_{k_{2}},\\ (3.2)\ \ \sum_{t=1}^{T}\frac{{F}_{t}{F}^{\prime}_{t}}{T}\ \mbox{and}\ \sum_{t=1}^{T}\frac{{F}^{\prime}_{t}{F}_{t}}{T}\ \mbox{diagonal matrices},\\ (3.3)\ \ \mbox{the first nonzero element in each column of}\ R\ \mbox{and}\ C\ \mbox{is positive}.\end{array}\right.

(6)

The conditions (3.1) and (3.2) are frequently used in the literature of matrix factor models, c.f., He et al. (2022). Without condition (3.3), results in factors and loadings can only be unambiguously determined up to a sign matrix (diagonal matrix with $1$ and $-1$ on the diagonals) transformation. Then a naturally identified estimate of the vector of factor and loading parameters is the maximizer of (2) subject to the constraints in (6). It can be easily shown this constrained maximizer is equivalent to the solution to the following augmented Lagrange function up to the sign undeterminance:

	$\displaystyle\max_{r,f,c\in\Theta}Q(r,f,c)$	$\displaystyle=$	$\displaystyle L(X\|r,f,c)+P(r,f,c),$		(7)
	$\displaystyle P(r,f,c)$	$\displaystyle=$	$\displaystyle P_{1}(r,f,c)+P_{2}(r,f,c)+P_{3}(r,f,c),$		(8)

where

$\displaystyle P_{1}({\theta})$	$\displaystyle=$	$\displaystyle-b_{1}p_{1}p_{2}T\Bigg{[}\frac{1}{2p_{1}^{2}}\sum_{p=1}^{k_{1}}\sum_{q>p}^{k_{1}}\left(\sum_{i=1}^{p_{1}}r_{ip}r_{iq}\right)^{2}+\frac{1}{8p_{1}^{2}}\sum_{k=1}^{k_{1}}\left(\sum_{i=1}^{p_{1}}r_{ik}^{2}-p_{1}\right)^{2}$	(9)
		$\displaystyle+\frac{1}{2T^{2}}\sum_{p=1}^{k_{1}}\sum_{q>p}^{k_{1}}\left(\sum_{t=1}^{T}{F}_{tp\cdot}{F}_{tq\cdot}^{\prime}\right)^{2}\Bigg{]},$	(9)
$\displaystyle P_{2}({\theta})$	$\displaystyle=$	$\displaystyle-b_{2}p_{1}p_{2}T\Bigg{[}\frac{1}{2p_{2}^{2}}\sum_{p=1}^{k_{2}}\sum_{q>p}^{k_{2}}\left(\sum_{j=1}^{p_{1}}c_{jp}c_{jq}\right)^{2}+\frac{1}{8p_{2}^{2}}\sum_{k=1}^{k_{2}}\left(\sum_{j=1}^{p_{2}}c_{jk}^{2}-p_{2}\right)^{2}$	(10)
		$\displaystyle+\frac{1}{2T^{2}}\sum_{p=1}^{k_{2}}\sum_{q>p}^{k_{2}}\left(\sum_{t=1}^{T}{F}_{t\cdot p}^{\prime}{F}_{t\cdot q}\right)^{2}\Bigg{]},$	(10)
$\displaystyle P_{3}({\theta})$	$\displaystyle=$	$\displaystyle-b_{3}p_{2}\sum_{t=1}^{T}\Bigg{[}\frac{1}{2p_{1}}\sum_{p=1}^{k_{1}}\sum_{q=1}^{k_{2}}\left(\sum_{i=1}^{p_{1}}\left(\frac{r_{ip}^{2}-1}{2}f_{t,pq}+\sum_{k\neq p}^{k_{1}}r_{ip}r_{ik}f_{t,kq}\right)\right)^{2}\Bigg{]}.$	(11)

where $b_{1}$ , $b_{2}$ and $b_{3}$ are positive Lagrange multipliers. The penalty terms $P_{1}(\theta)$ and $P_{2}(\theta)$ are associated with the conditions (3.1) and (3.2). The additional augmented term $P_{3}(\theta)$ is constructed to ensure a locally positive definite property of the negative Hessian matrix of the penalized likelihood function $Q(\theta)$ around the true parameter vector $\theta^{0}$ , by cleverly trading off the non-diagonal blocks of the scaled expectation of the Hessian matrix of $L(\theta)$ . This property shows a desirable geometric concavity of the augmented Lagrange function $Q(\cdot)$ and leads to an easy way to derive the second-order asymptotics, in particular the central limit theorems of the estimated parameters. Without this newly added term, simply introducing $P_{1}(\theta)+P_{2}(\theta)$ , might not result in a desirable landscape around the local minima of $Q(\theta)$ . In the sequel, we shall consider the estimated factors and loadings as the solution to maximizing $Q(r,f,c)$ and obtain the asymptotic results for matrix factor analysis. The equivalence between (2)-(6) and (7) is due to the non-positiveness of $P(\theta)$ and the fact that $Q(\theta)$ is maximized if and only if $P(\theta)=0$ .

Throughout the paper, $(p_{1},p_{2},T)\to\infty$ means $p_{1},p_{2}$ and $T$ going to infinity jointly, “w.p.a.1” is a short abbreviation of “with probability approaching $1$ ”. For vectors, $\|\cdot\|$ denotes the Euclidean norm. For matrix $A$ , $\rho_{\min}(A)$ is its smallest eigenvalue, and $\|A\|$ , $\|A\|_{F}$ , $\|A\|_{1}$ , $\|A\|_{\infty}$ and $\|A\|_{\max}$ denotes the spectral norm, Frobenius norm, $1$ -norm, infinity norm and max norm, respectively. When $A$ has $Tk$ rows, divide $A$ into $T$ blocks with each block containing $k$ rows, and let $[A]_{tq}$ denote the $q$ th row in the $t$ th block and $[A]_{t}=\left([A]^{\prime}_{t1},\ldots,[A]^{\prime}_{tk}\right)$ is the $t$ th block. Define $L_{p_{1}p_{2}T}=\min\{\sqrt{p_{1}p_{2}},\sqrt{p_{1}T},\sqrt{p_{2}T}\}$ .

2.2 Set Up Assumptions

In this section, we give some technical assumptions that are used to establish the asymptotic theories. In the sequel, $M$ will be a generic constant that might vary from line to line.

Assumption 1.

1.

$\left\|r_{i}^{0}\right\|\leq M$ , $\left\|c_{j}^{0}\right\|\leq M$ and $\left\|{F}_{t}^{0}\right\|\leq M$ for all $i,j$ and $t$ .
2.

$\frac{1}{T}\sum_{t=1}^{T}{F}^{0}_{t}{F}_{t}^{0\prime}=\operatorname*{diag}\left(\sigma_{T1},\ldots,\sigma_{Tk_{1}}\right)$ with $\sigma_{T1}\geq\cdots\geq\sigma_{Tk_{1}}$ , and $\sigma_{Tk}\to\sigma_{k}$ as $T\to\infty$ for $k=1,\ldots,k_{1}$ with $\sigma_{1}>\cdots>\sigma_{k_{1}}>0$ . $\frac{1}{T}\sum_{t=1}^{T}{F}^{0\prime}_{t}{F}_{t}^{0}=\operatorname*{diag}\left(\sigma^{\prime}_{T1},\ldots,\sigma^{\prime}_{Tk_{2}}\right)$ with $\sigma^{\prime}_{T1}\geq\cdots\geq\sigma^{\prime}_{Tk_{2}}$ , and $\sigma^{\prime}_{Tl}\to\sigma_{l}$ as $T\to\infty$ for $l=1,\ldots,k_{2}$ with $\sigma^{\prime}_{1}>\cdots>\sigma^{\prime}_{k_{2}}>0$ .

Assumption 1-1 indicates that ${r}_{i}^{0},{F}_{t}^{0},{c}_{j}^{0}$ are uniformly bounded. As explained in Wang (2022), the compactness of parameter space is a commonly encountered trait in nonlinear models, see more examples in Newey and McFadden (1986), Jennrich (1969) and Wu (1981). Assumption 1-2 assumes asymptotic distinct eigenvalues for $\frac{1}{T}\sum^{T}_{t=1}F_{t}^{0}F_{t}^{0\prime}$ and $\frac{1}{T}\sum^{T}_{t=1}F_{t}^{0\prime}F_{t}^{0}$ so that the eigenvectors can be uniquely determined. This is the same as Assumption B in Yu et al. (2022).

Let $\partial_{\pi}l_{ijt}(\pi_{ijt})$ , $\partial_{\pi^{2}}l_{ijt}(\pi_{ijt})$ and $\partial_{\pi^{3}}l_{ijt}(\pi_{ijt})$ be the first, second and third order derivative of $l_{ijt}(\theta)$ evaluated at $\pi_{ijt}$ , respectively. When these derivatives are evaluated at $\pi_{ijt}^{0}$ , we suppress the argument and simply write them as $\partial_{\pi}l_{ijt}$ , $\partial_{\pi^{2}}l_{ijt}$ and $\partial_{\pi^{3}}l_{ijt}$ .

Assumption 2.

1.

$l_{ijt}(\theta)$ is three times differentiable.
2.

There exists $b_{U}>b_{L}>0$ such that $b_{L}\leq-\partial_{\pi^{2}}l_{ijt}(\pi_{ijt})\leq b_{U}$ .
3.

$\left|\partial_{\pi^{3}}l_{ijt}(\pi_{ijt})\right|\leq b_{U}$ within a compact space of $\pi_{ijt}$ .

Assumption 2-1 imposes a smoothness condition on the log-likelihood function. Assumptions 2-2 and 2-3 assume that the second-order derivative of the log-likelihood function is bounded from below and above, and the third-order derivative is bounded away from infinity. The boundedness of the second and the third-order derivative is used to control the remainder term in the expansion of the first-order condition. The boundedness from below of the second order derivative together with the boundedness of $\pi_{ijt}$ are used to show the consistency of the estimated factors and loadings. It can be easily verified that commonly used models, such as Linear, Logit, Probit, Poisson, and Tobit, satisfy Assumption 2.

Assumption 3.

1.

$\operatorname*{E}\left(\left|\partial_{\pi}l_{ijt}\right|^{\xi}\right)\leq M$ for some $\xi>14$ and all $i,j$ and $t$ .
2.

For any $i,j,t$ , $\sum_{s=1}^{T}\sum_{l=1}^{p_{1}}\sum_{k=1}^{p_{2}}\left|\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{lks}\right|\leq M$ .
3.

For every $(t,s)$ , $\operatorname*{E}\left[p_{1}^{-1/2}p_{2}^{-1/2}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{k=1}^{p_{2}}\left(\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}-\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}\right)\right]^{2}\leq M$ ,
$\operatorname*{E}\left[p_{1}^{-1/2}p_{2}^{-1/2}\sum_{i=1}^{p_{1}}\sum_{l=1}^{p_{1}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi}l_{ijt}\partial_{\pi}l_{ljs}-\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{ljs}\right)\right]^{2}\leq M$ ,
For every $(j,k)$ , $\operatorname*{E}\left[p_{1}^{-1/2}T^{-1/2}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\sum_{s=1}^{T}\left(\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}-\operatorname*{E}\partial_{\pi}l_{ijt}\partial_{\pi}l_{iks}\right)\right]^{2}\leq M$ .

Notice that for linear model, $\partial_{\pi}l_{ijt}$ is the error term and $\partial_{\pi^{2}}l_{ijt}$ is a constant. Many papers require $\partial_{\pi}l_{ijt}$ to be conditionally independent for using the Hoeffding’s inequality in the proofs, such as He et al. (2023). This paper considers a more general situation: the distributions of $x_{ijt}$ are allowed to be heterogeneous over $i$ , $j$ and $t$ , and to have limited cross-row (-column) and cross-sample (e.g, serial) dependence of $x_{ijt}$ conditionally. Assumption 3 gives some moment conditions for $\{\partial_{\pi}l_{ijt}\partial_{\pi}l_{ijs};1\leq i\leq p_{1},1\leq j\leq p_{2},1\leq t\leq T\}$ . They are satisfied when, for example, $\left\{\partial_{\pi}l_{ijt}\right\}$ are cross-sectionally and temporally weakly dependent conditional on $\theta\in\Theta$ .

Assumption 4.

For some $\zeta>2$ ,

$\displaystyle\operatorname*{E}\left(p_{1}^{-1}\sum_{i=1}^{p_{1}}\left\\|\frac{1}{\sqrt{p_{2}T}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0}c^{0}_{j}\right\\|^{\zeta}\right)$	$\displaystyle\leq$	$\displaystyle M,$
$\displaystyle\operatorname*{E}\left(p_{2}^{-1}\sum_{j=1}^{p_{2}}\left\\|\frac{1}{\sqrt{p_{1}T}}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0\prime}r^{0}_{i}\right\\|^{\zeta}\right)$	$\displaystyle\leq$	$\displaystyle M,$
$\displaystyle\operatorname*{E}\left(T^{-1}\sum_{t=1}^{T}\left\\|\frac{1}{\sqrt{p_{1}p_{2}}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi}l_{ijt}c^{0}_{j}\otimes r^{0}_{i}\right\\|^{\zeta}\right)$	$\displaystyle\leq$	$\displaystyle M.$

Assumption 4 is for the first derivatives of the log-likelihood function. Again, it is satisfied when $\left\{\partial_{\pi}l_{ijt}\right\}$ are cross-sectionally and temporally weakly dependent conditional on $\theta\in\Theta$ . To present the next assumption on the moments of the second derivatives of the log-likelihood function, we introduce some more notations. Define

	$\displaystyle\mathcal{H}_{Lrr^{\prime}}$	$\displaystyle=\left[\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}F_{t}^{0\prime}\right]_{i=1}^{p_{1}},\ \mathcal{H}_{Lff^{\prime}}=\left[\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}c_{j}^{0\prime}\otimes r_{i}^{0\prime}\right]_{t=1}^{T},$
	$\displaystyle\mathcal{H}_{Lcc^{\prime}}$	$\displaystyle=\left[\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0\prime}r_{i}^{0}r_{i}^{0\prime}F_{t}^{0}\right]_{j=1}^{p_{2}}.$

Assumption 5.

Recall the notation $[A]_{l}$ for some matrix $A$ ,

$\displaystyle\operatorname*{E}\Big{\\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\frac{1}{p_{2}T}[\mathcal{H}_{Lrr^{\prime}}]_{i}\right)^{-1}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}r_{i}^{0\prime}\Big{\\|}^{2}$	$\displaystyle\leq$	$\displaystyle M,$
$\displaystyle\operatorname*{E}\Big{\\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\frac{1}{p_{1}p_{2}}[\mathcal{H}_{Lff^{\prime}}]_{t}\right)^{-1}\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}f_{t}^{0\prime}\Big{\\|}^{2}$	$\displaystyle\leq$	$\displaystyle M,$
$\displaystyle\operatorname*{E}\Big{\\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\frac{1}{p_{1}T}[\mathcal{H}_{Lcc^{\prime}}]_{j}\right)^{-1}\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0}c_{j}^{0\prime}\Big{\\|}^{2}$	$\displaystyle\leq$	$\displaystyle M.$

2.

For any $k$ , $s$ and $l$ ,

$\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left(\frac{1}{T}\sum_{t=1}^{T}\left(\partial_{\pi^{2}}l_{ikt}F_{t}^{0\prime}r_{i}^{0}c_{k}^{0\prime}F_{t}^{0\prime}+\partial_{\pi}l_{ikt}F_{t}^{\prime}\right)\right)\left(\frac{1}{p_{2}T}[\mathcal{H}_{Lrr^{\prime}}]_{i}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}\right)\Big{\|}^{2}\leq M$ ,

$\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left(\frac{1}{p_{2}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi^{2}}l_{ijs}c_{j}^{0}\otimes r_{i}^{0}c_{j}^{0\prime}F_{s}^{0\prime}+\partial_{\pi}l_{ijs}c_{j}^{0}\otimes\mathbb{I}_{k_{1}}\right)\right)\left(\frac{1}{p_{2}T}[\mathcal{H}_{Lrr^{\prime}}]_{i}\right)^{-1}\\ \times\Big{(}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}\Big{)}\Big{\|}^{2}\leq M$ ,

$\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left(\frac{1}{p_{1}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi^{2}}l_{ljt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}\otimes r_{l}^{0\prime}+\partial_{\pi}l_{ljt}c_{j}^{0\prime}\otimes\mathbb{I}_{k_{1}}\right)\right)\left(\frac{1}{p_{1}p_{2}}[\mathcal{H}_{Lff^{\prime}}]_{t}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}\right)\Big{\|}^{2}\leq M$ ,

$\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left[\frac{1}{p_{2}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi^{2}}l_{ikt}F_{t}^{0\prime}r_{i}^{0}c_{k}^{0\prime}\otimes r_{i}^{0\prime}+\partial_{\pi}l_{ikt}\mathbb{I}_{k_{2}}\otimes r_{i}^{0\prime}\right)\right]\left(\frac{1}{p_{1}p_{2}}[\mathcal{H}_{Lff^{\prime}}]_{t}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}\right)\Big{\|}^{2}\leq M$ ,

$\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left[\frac{1}{p_{1}}\sum_{i=1}^{p_{1}}\left(\partial_{\pi^{2}}l_{ijs}{c}^{0}_{j}\otimes{r}^{0}_{i}{r}^{0\prime}_{i}{F}^{0}_{s}+\partial_{\pi}l_{ijs}\mathbb{I}_{k_{2}}\otimes{r}^{0}_{i}\right)\right]\left(\frac{1}{p_{1}T}[\mathcal{H}_{Lcc^{\prime}}]_{j}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0}\right)\Big{\|}^{2}\leq M$ ,

$\operatorname*{E}\Big{\|}\frac{1}{\sqrt{p_{1}p_{2}T}}\sum_{i,j,t}\left[\frac{1}{T}\sum_{t=1}^{T}\left(\partial_{\pi^{2}}l_{ljt}{F}^{0}_{t}{c}^{0}_{j}{r}^{0\prime}_{l}{F}^{0}_{t}+\partial_{\pi}l_{ljt}{F}^{0}_{t}\right)\right]\left(\frac{1}{p_{1}T}[\mathcal{H}_{Lcc^{\prime}}]_{j}\right)^{-1}\\ \times\left(\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0}\right)\Big{\|}^{2}\leq M$ .

For any $i$ and some positive definite matrices $\Sigma_{iR}$ and $\Omega_{iR}$ ,

	$\displaystyle-(p_{2}T)^{-1}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}F_{t}^{0\prime}$	$\displaystyle\xrightarrow{P}$	$\displaystyle\Sigma_{iR},$
	$\displaystyle(p_{2}T)^{-1/2}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}$	$\displaystyle\xrightarrow{d}$	$\displaystyle\mathcal{N}(0,\Omega_{iR}).$

For any $j$ and some positive definite matrices $\Sigma_{jC}$ and $\Omega_{jC}$ ,

	$\displaystyle-(p_{1}T)^{-1}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0\prime}r_{i}^{0}r_{i}^{0\prime}F_{t}^{0}$	$\displaystyle\xrightarrow{P}$	$\displaystyle\Sigma_{jC},$
	$\displaystyle(p_{1}T)^{-1/2}\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0\prime}r_{i}^{0}$	$\displaystyle\xrightarrow{d}$	$\displaystyle\mathcal{N}(0,\Omega_{jC}).$

For any $t$ and some positive definite matrices $\Sigma_{tF}$ and $\Omega_{tF}$ ,

	$\displaystyle-(p_{1}p_{2})^{-1}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}r_{i}^{0\prime}\otimes c_{j}^{0\prime}$	$\displaystyle\xrightarrow{P}$	$\displaystyle\Sigma_{tF},$
	$\displaystyle(p_{1}p_{2})^{-1/2}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi}l_{ijt}c_{j}^{0}\otimes r_{i}^{0}$	$\displaystyle\xrightarrow{d}$	$\displaystyle\mathcal{N}(0,\Omega_{tF}).$

Though cumbersome, it can be easily verified that Assumption 5 is also satisfied when $\left\{\partial_{\pi}l_{ijt}\right\}$ are cross-sectionally and temporally weakly dependent conditional on $\theta\in\Theta$ .

Assumption 6.

$\frac{(p_{1}p_{2}T)^{3/\xi}(p_{1}+p_{2}+T)^{1/\zeta}}{L_{p_{1}p_{2}T}}\to 0$ , as $(p_{1},p_{2},T)\to\infty$ .

Assumption 6 is sort of a balance condition for $p_{1}$ , $p_{2}$ and $T$ , meaning that each dimension size can not completely dominate the other two, which is controlled by $\xi$ and $\zeta$ . It is weaker when $\xi$ and $\zeta$ are larger. For example, when $\partial_{\pi}l_{ijt}$ has subgaussian tails, $\xi$ can be arbitrarily large. And when $\partial_{\pi}l_{ijt}$ ’s are conditionally Gaussian as a typical case in the linear model, $\zeta$ can be arbitrarily large.

3 Asymptotic Theory

Let $B(\mathcal{D})=\{(r,f,c):\|{r}\|_{\infty}+\|f\|_{\infty}+\|{c}\|_{\infty}\leq\mathcal{D}\}$ for some $\mathcal{D}$ large enough, such that the true parameter ${\theta}^{0}$ lies in the interior of $B(\mathcal{D})$ . Before stating the theoretical results, we give some more notations. Define $\widehat{r}=(\widehat{r}_{1}^{\prime},\ldots,\widehat{r}_{p_{1}}^{\prime})^{\prime}$ , $\widehat{c}=(\widehat{c}_{1}^{\prime},\ldots,\widehat{c}_{p_{2}}^{\prime})^{\prime}$ , and $\widehat{f}=(\widehat{f}_{1}^{\prime},\ldots,\widehat{f}_{T}^{\prime})^{\prime}$ as the solution of maximizing $Q$ within $B(\mathcal{D})$ , where $\widehat{f}_{t}=\text{vec}\{\widehat{F}_{t}\}$ . Let $\widehat{\pi}_{ijt}=\widehat{r}_{i}\widehat{F}_{t}\widehat{c}_{j}$ , $\widehat{{\theta}}=(\widehat{r}^{\prime},\widehat{f}^{\prime},\widehat{c}^{\prime})^{\prime}$ , $\widehat{R}=(\widehat{r}_{1},\ldots,\widehat{r}_{p_{1}})^{\prime}$ , $\widehat{C}=(\widehat{c}_{1},\ldots,\widehat{c}_{p_{2}})^{\prime}$ . Let $S({\theta})=\partial_{{\theta}}Q({\theta})$ , $S_{r}({\theta})=\partial_{r}Q({\theta})$ , $S_{c}({\theta})=\partial_{c}Q({\theta})$ and $S_{f}({\theta})=\partial_{f}Q({\theta})$ be the score functions. Let $\mathcal{H}(\theta)=\partial_{\theta\theta^{\prime}}Q(\theta)$ be the Hessian matrix. The decomposition of $\mathcal{H}(\theta)$ and the expression of each component is presented in Appendix A in the supplementary material. We suppress the argument when $S(\theta)$ and $\mathcal{H}(\theta)$ are evaluated at $\theta^{0}$ , i.e. $S=S(\theta^{0})$ and $\mathcal{H}=\mathcal{H}(\theta^{0})$ .

3.1 Convergence Rates

We first provide a consistent result for the estimates of factors and loadings in terms of their average Euclidean norm (or equivalently the Frobenious norm of the estimated row and column factor loading matrices and the factor matrix).

Proposition 1.

(Average Consistency). Under Assumptions 1-3, as $(p_{1},p_{2},T)\to\infty$ ,

\frac{1}{p_{1}}\|\widehat{r}-r^{0}\|^{2}+\frac{1}{p_{2}}\|\widehat{c}-c^{0}\|^{2}+\frac{1}{T}\|\widehat{f}-f^{0}\|^{2}=O_{p}\left(L_{p_{1}p_{2}T}^{-1}\right).

To derive finer convergence rates and the limit distributions of the estimated parameters, we need to utilize the first-order condition $S(\widehat{\theta})=0$ . The following proposition demonstrates that $S(\widehat{\theta})=0$ w.p.a.1.

Proposition 2.

Under Assumptions 1-4, and Assumption 6, $S(\widehat{\theta})=0$ w.p.a.1.

Proposition 2 is nontrivial because the dimension of $\widehat{\theta}$ increases with $p_{1}$ , $p_{2}$ and $T$ . In a fixed-dimensional parameter space, the consistency of the estimated parameters, coupled with the assumption that the true parameters lie in the interior of the parameter space, ensures that the estimated parameters also lie in the interior. Consequently, the first-order conditions are satisfied. However, when the dimension of the parameter space increases with $p_{1}$ , $p_{2}$ , and $T$ , the average consistency of $\widehat{\theta}$ as established in Proposition 1 is insufficient to ensure that $\widehat{\theta}$ remains an interior point of the parameter space. In this context, the uniform consistency of $\widehat{\theta}$ is required.

Proposition 3.

(Uniform Consistency). Under Assumptions 1-4 and 6,

\|\widehat{r}-r^{0}\|_{\infty}+\|\widehat{c}-c^{0}\|_{\infty}+\|\widehat{f}-f^{0}\|_{\infty}=O_{p}\left(\frac{(p_{1}p_{2}T)^{3/\xi}}{L_{p_{1}p_{2}T}}(p_{1}+p_{2}+T)^{1/\zeta}\right).

Note that $\xi$ and $\zeta$ could be arbitrarily large in some examples as demonstrated below Assumption 6, and in such cases

\|\widehat{r}-r^{0}\|_{\infty}+\|\widehat{c}-c^{0}\|_{\infty}+\|\widehat{f}-f^{0}\|_{\infty}=O_{p}\left(L^{-1}_{p_{1}p_{2}T}\right).

Now we utilize the first-order conditions. Using the integral form of the mean value theorem for vector-valued functions to expand the first order conditions, we have $0=S(\widehat{{\theta}})=S+\widetilde{\mathcal{H}}\left(\widehat{{\theta}}-{\theta}^{0}\right)$ , where $\widetilde{\mathcal{H}}=\int_{0}^{1}\mathcal{H}\left({\theta}^{0}+s\left(\widehat{{\theta}}-{\theta}^{0}\right)\right)ds$ . It shows that

\displaystyle\left(\begin{array}[]{c}p_{1}^{-1/2}(\hat{r}-r^{0})\\ T^{-1/2}(\hat{f}-f^{0})\\ p_{2}^{-1/2}(\hat{c}-c^{0})\end{array}\right)=(p_{1}p_{2}T)^{-1/2}\left(-D_{p_{2}Tp_{1}}^{-1/2}\widetilde{\mathcal{H}}D_{p_{2}Tp_{1}}^{-1/2}\right)^{-1}D_{p_{2}Tp_{1}}^{-1/2}S,

(15)

where

D_{p_{1}Tp_{2}}=\operatorname*{diag}\{p_{1}\mathbb{I}_{p_{1}k_{1}},T\mathbb{I}_{Tk_{1}k_{2}},p_{2}\mathbb{I}_{p_{2}k_{2}}\},D_{p_{2}Tp_{1}}=\operatorname*{diag}\{p_{2}T\mathbb{I}_{p_{1}k_{1}},p_{1}p_{2}\mathbb{I}_{Tk_{1}k_{2}},p_{1}T\mathbb{I}_{p_{2}k_{2}}\}.

It is straightforward to see that $\|D_{p_{2}Tp_{1}}^{-1/2}S\|=O_{p}((p_{1}+p_{2}+T)^{1/2})$ . Utilizing the local positive-definiteness of $-\mathcal{H}({\theta})$ aided by the carefully designed augmented Lagrangian terms, we show in the supplementary material (Lemma A.3) that the largest eigenvalue of $(-D_{p_{2}Tp_{1}}^{-1/2}{\mathcal{H}}(\theta)D_{p_{2}Tp_{1}}^{-1/2})^{-1}$ is $O_{p}(1)$ in the neighborhood $B(\mathcal{D})\cap\|D_{p_{1}Tp_{2}}^{-1/2}\left({\theta}-{\theta}^{0}\right)\|\leq m$ for some $m>0$ . Since $\widehat{\theta}$ lies in $B(\mathcal{D})\cap\|D_{p_{1}Tp_{2}}^{-1/2}\left({\theta}-{\theta}^{0}\right)\|\leq m$ w.p.a.1, this implies that $\|(-D_{p_{2}Tp_{1}}^{-1/2}\widetilde{\mathcal{H}}D_{p_{2}Tp_{1}}^{-1/2})^{-1}\|$ is $O_{p}(1)$ . Then we have the following strengthened results.

Theorem 1.

(Average Rate). Under Assumptions 1-4 and Assumption 6, $\frac{1}{p_{1}}\|\widehat{r}-r^{0}\|^{2}+\frac{1}{p_{2}}\|\widehat{c}-c^{0}\|^{2}+\frac{1}{T}\|\widehat{f}-f^{0}\|^{2}=O_{p}\left(L_{p_{1}p_{2}T}^{-2}\right)$ .

Comparable outcomes have been established for continuous variables. For example, the averaged convergence rate of ${\theta}$ by the least square estimate given in He et al. (2023) and the PE estimate given in Yu et al. (2022), both reached $L^{-1}_{p_{1}p_{2}T}$ . When the matrices are vectorized and the method for the generalized vector factor model is implemented as in Liu et al. (2023) and Wang (2022), the averaged convergence rate of estimating the loading space spanned by ${C}\otimes{R}$ is $\min\{O_{p}(1/\sqrt{T},1/\sqrt{p_{1}p_{2}})\}$ , which is no faster than ours, especially in scenarios where $p_{1}p_{2}$ outweighs $T$ . It is also no slower than the rate, $O_{p}(\{1/\sqrt{p_{1}}+1/\sqrt{Tp_{2}}\}\vee\{1/\sqrt{p_{2}}+1/\sqrt{Tp_{1}}\})$ , of the non-likelihood method $\alpha$ -PCA in the seminal paper by Chen and Fan (2023), especially when the data matrix is not balanced. Yet extra effort has to be carried to separate $R$ and $C$ from their twisted Kronecker product.

3.2 Limit distributions

Now we proceed to the limit distributions of the estimated factors and loadings. Expanding the first-order conditions to a higher order, we have

0=S(\widehat{{\theta}})=S+\mathcal{H}(\widehat{{\theta}}-{\theta}^{0})+\frac{1}{2}E,

where $E=(E_{r}^{\prime},E_{f}^{\prime},E_{c}^{\prime})^{\prime}$ . $E_{r}$ , $E_{f}$ and $E_{c}$ are $p_{1}k_{1}$ , $Tk_{1}k_{2}$ and $p_{2}k_{2}$ dimensional with element $E_{r,ik}=(\hat{\theta}-\theta^{0})^{\prime}\partial_{\theta\theta^{\prime}r_{ik}}Q(\theta_{ik}^{*})(\hat{\theta}-\theta^{0})$ , $E_{f,tk}=(\hat{\theta}-\theta^{0})^{\prime}\partial_{\theta\theta^{\prime}f_{tk}}Q(\theta_{tk}^{*})(\hat{\theta}-\theta^{0})$ and $E_{c,jk}=(\hat{\theta}-\theta^{0})^{\prime}\partial_{\theta\theta^{\prime}c_{jk}}Q(\theta_{jk}^{*})(\hat{\theta}-\theta^{0})$ , respectively. $\theta_{ik}^{*}$ , $\theta_{tk}^{*}$ and $\theta_{jk}^{*}$ are linear combinations of $\hat{\theta}$ and $\theta^{0}$ . Thus

	$\displaystyle\widehat{{\theta}}-{\theta}^{0}$	$\displaystyle=$	$\displaystyle-\mathcal{H}^{-1}S-\frac{1}{2}\mathcal{H}^{-1}E,$
	$\displaystyle\hat{r}_{i}-r_{i}^{0}$	$\displaystyle=$	$\displaystyle\left[\widehat{{\theta}}-{\theta}^{0}\right]_{i}=-\left[\mathcal{H}^{-1}S\right]_{i}-\frac{1}{2}\left[\mathcal{H}^{-1}E\right]_{i}.$

Utilizing the local landscape property of $\mathcal{H}$ , we show in the supplementary material,

\displaystyle\left[\mathcal{H}^{-1}S\right]_{i}=\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}F_{t}^{0}c_{j}^{0}c_{j}^{0\prime}F_{t}^{0\prime}\right)^{-1}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi}l_{ijt}F_{t}^{0}c_{j}^{0}+O_{p}\left((p_{1}p_{2}T)^{-1/2}\right).

(16)

The intuition behind (16) is that $\mathcal{H}$ is approximately block diagonal with the aid of the augmented Lagrangian terms $P_{1}(\theta)$ , $P_{2}(\theta)$ and also $P_{3}(\theta)$ .

The elements within $\mathcal{H}$ ’s diagonal blocks are significantly larger than those in its off-diagonal blocks. This structure of $\mathcal{H}$ allows us to demonstrate that, in the expansion of $[\mathcal{H}^{-1}S]_{i}$ , the additional terms resulting from the nonzero off-diagonal blocks collectively have an order of $O_{p}\left((p_{1}p_{2}T)^{-1/2}\right)$ . Then combing with Theorem 1, we show in the supplement that

\displaystyle\left\|\left[\mathcal{H}^{-1}E\right]_{i}\right\|=O_{p}\left(\frac{(p_{1}p_{2}T)^{3/\xi}}{L_{p_{1}p_{2}T}^{2}}\right).

(17)

Thus if $\frac{(p_{2}T)^{1/2}(p_{1}p_{2}T)^{3/\xi}}{L_{p_{1}p_{2}T}^{2}}\rightarrow 0$ , $\left\|[H^{-1}E]_{i}\right\|$ would be $o_{p}((p_{2}T)^{-1/2})$ and hence dominated by the first term on the right-hand side of (16).

Proposition 4.

(Individual Rate). Under Assumptions 1-4, and Assumption 6,

$\displaystyle\\|\widehat{{r}}_{i}-{r}_{i}^{0}\\|$	$\displaystyle=$	$\displaystyle O_{p}(L_{p_{1}p_{2}T}^{-1})\mbox{ for each }i=1,\cdots,p_{1},$
$\displaystyle\\|\widehat{f}_{t}-f_{t}^{0}\\|$	$\displaystyle=$	$\displaystyle O_{p}(L_{p_{1}p_{2}T}^{-1})\mbox{ for each }t=1,\cdots,T,$
$\displaystyle\\|\widehat{{c}}_{j}-{c}_{j}^{0}\\|$	$\displaystyle=$	$\displaystyle O_{p}(L_{p_{1}p_{2}T}^{-1})\mbox{ for each }j=1,\cdots,p_{2}.$

From Proposition 4, we see that the convergence rate of each component of ${\hat{\theta}}$ is the same as that of the least square estimator in LMFM by He et al. (2023), but slower than that of the PE estimator in Yu et al. (2022), except that one of $\left\{(Tp_{1})^{-1},(Tp_{2})^{-1},(p_{1}p_{2})^{-1}\right\}$ dominates the others. This is because both the maximum likelihood and the least square estimation derive estimators of ${\theta}$ jointly, while the PE method estimates $R$ , $C$ and $F_{t}$ individually. Though the rate $O_{p}(L_{p_{1}p_{2}T}^{-1})$ is not sharp, but enough for deriving the order of $\left[\mathcal{H}^{-1}R\right]_{i}$ . Then we have the following central limit theorem.

Theorem 2.

(Individual Limit Distribution). Under Assumptions 1-6,

	$\displaystyle\sqrt{p_{2}T}\left(\widehat{{r}}_{i}-{r}_{i}^{0}\right)\xrightarrow{d}\mathcal{N}\left(0,\Sigma_{iR}^{-1}\Omega_{iR}\Sigma_{iR}^{\prime-1}\right)\text{ if }\frac{\sqrt{p_{2}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,$
	$\displaystyle\sqrt{p_{1}T}\left(\widehat{{c}}_{j}-{c}_{j}^{0}\right)\xrightarrow{d}\mathcal{N}\left(0,\Sigma_{jC}^{-1}\Omega_{jC}\Sigma_{jC}^{\prime-1}\right)\text{ if }\frac{\sqrt{p_{1}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,$
	$\displaystyle\sqrt{p_{1}p_{2}}(\widehat{{f}}_{t}-{f}_{t}^{0})\xrightarrow{d}\mathcal{N}\left(0,\Sigma_{tF}^{-1}\Omega_{tF}\Sigma_{tF}^{\prime-1}\right)\text{ if }\frac{\sqrt{p_{1}p_{2}}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,$

where $\Sigma_{iR},\Sigma_{jC},\Sigma_{tF}$ , $\Omega_{iR}$ , $\Omega_{jC}$ and $\Omega_{tF}$ are defined in Assumption 5.

Remark 1.

The asymptotic variances of $\hat{r}_{i}$ , $\hat{f}_{t}$ and $\hat{c}_{j}$ can be consistently estimated, respectively, by

	$\displaystyle\widehat{\text{var}}_{r}=\,$	$\displaystyle p_{2}T\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}_{t}\hat{c}_{j}\hat{c}_{j}^{\prime}\hat{F}_{t}^{\prime}\right)^{-1}\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\partial_{\pi}l_{ijt}(\hat{\pi}_{ijt})\right)^{2}\hat{F}_{t}\hat{c}_{j}\hat{c}_{j}^{\prime}\hat{F}_{t}^{\prime}\right)$
		$\displaystyle\times\left(\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}_{t}\hat{c}_{j}\hat{c}_{j}^{\prime}\hat{F}_{t}^{\prime}\right)^{-1},$
	$\displaystyle\widehat{\text{var}}_{f}=\,$	$\displaystyle p_{1}p_{2}\left(\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{c}_{j}\otimes\hat{r}_{i}\hat{c}^{\prime}_{j}\otimes\hat{r}^{\prime}_{i}\right)^{-1}\left(\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\left(\partial_{\pi}l_{ijt}(\hat{\pi}_{ijt})\right)^{2}\hat{c}_{j}\otimes\hat{r}_{i}\hat{c}^{\prime}_{j}\otimes\hat{r}^{\prime}_{i}\right)$
		$\displaystyle\times\left(\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{c}_{j}\otimes\hat{r}_{i}\hat{c}^{\prime}_{j}\otimes\hat{r}^{\prime}_{i}\right)^{-1},$
	$\displaystyle\widehat{\text{var}}_{c}=\,$	$\displaystyle p_{1}T\left(\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}^{\prime}_{t}\hat{r}_{i}\hat{r}_{i}^{\prime}\hat{F}_{t}\right)^{-1}\left(\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\left(\partial_{\pi}l_{ijt}(\hat{\pi}_{ijt})\right)^{2}\hat{F}^{\prime}_{t}\hat{r}_{i}\hat{r}_{i}^{\prime}\hat{F}_{t}\right)$
		$\displaystyle\times\left(\sum_{i=1}^{p_{1}}\sum_{t=1}^{T}\partial_{\pi^{2}}l_{ijt}(\hat{\pi}_{ijt})\hat{F}^{\prime}_{t}\hat{r}_{i}\hat{r}_{i}^{\prime}\hat{F}_{t}\right)^{-1}.$

Theorem 2 generalizes Theorem 3.2 of Yu et al. (2022) allowing factors to be extracted from discrete or some other nonlinear models. And it allows the probability function to differ across rows, columns and samples. Thus the vast quantity of discrete data available in macroeconomic and financial studies can be effectively utilized, either by themselves or in combination with continuous data. An immediate consequence of Theorem 2 and Remark 1 is the following corollary.

Corollary 1.

Under Assumptions 1-6,

	$\displaystyle\sqrt{p_{2}T}\left(\widehat{{r}}_{i}-{r}_{i}^{0}\right)/\sqrt{\widehat{\text{var}}_{r}}\xrightarrow{d}\mathcal{N}(0,1)\text{ if }\frac{\sqrt{p_{2}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,$
	$\displaystyle\sqrt{p_{1}T}\left(\widehat{{c}}_{j}-{c}_{j}^{0}\right)/\sqrt{\widehat{\text{var}}_{c}}\xrightarrow{d}\mathcal{N}(0,1)\text{ if }\frac{\sqrt{p_{1}T}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0,$
	$\displaystyle\sqrt{p_{1}p_{2}}\left(\widehat{{f}}_{t}-{f}_{t}^{0}\right)/\sqrt{\widehat{\text{var}}_{f}}\xrightarrow{d}\mathcal{N}(0,1)\text{ if }\frac{\sqrt{p_{1}p_{2}}}{L^{2}_{p_{1}p_{2}T}}(p_{1}p_{2}T)^{3/\xi}\to 0.$

3.3 Consistency in determining the number of factors

Under the linear matrix factor model, Yu et al. (2022) proposed an eigenvalue ratio test to consistently estimate the number of factors. While the eigenvalue ratio test is based on the ordered eigenvalues of the covariance matrix of ${X}_{t}$ , it is not suitable to describe the correlation of non-continuous random variables. In this article, we use the idea of information criterion to select the number of factors. The estimator of $(k_{1},k_{2})$ is similar to the penalized loss (PC) based estimator of Bai and Ng (2002), but is adaptive to the matrix observation and the likelihood function. Given the number of factors $l_{1},l_{2}$ , let $\widehat{{\theta}}^{(l_{1},l_{2})}$ be the estimator of ${\theta}$ . Our PC-based estimator of $(k_{1},k_{2})$ is defined as

\displaystyle(\widehat{k}_{1},\widehat{k}_{2})=\arg\min_{(l_{1},l_{2})}\left\{-\frac{1}{p_{1}p_{2}T}L\left(\widehat{{\theta}}^{(l_{1},l_{2})}\right)+(l_{1}+l_{2})g(p_{1},p_{2},T)\right\},

(18)

where $g(p_{1},p_{2},T)$ is a prespecified penalty function. Theorem 3 below demonstrates that $(\widehat{k}_{1},\widehat{k}_{2})$ are consistent to $(k_{1},k_{2})$ by choosing a suitable $g(p_{1},p_{2},T)$ .

Theorem 3.

If $0<g(p_{1},p_{2},T)\to 0$ , $g(p_{1},p_{2},T)L_{p_{1}p_{2}T}^{2}\to\infty$ , under Assumptions 1-4, and Assumption 6, $P\left(\widehat{k}_{1}=k_{1}\right)\to 1$ and $P\left(\widehat{k}_{2}=k_{2}\right)\to 1$ , as $p_{1},p_{2},T\to\infty$ .

In our simulation studies and real data analysis, we choose the penalty $g(p_{1},p_{2},T)=\frac{p_{1}+p_{2}+T}{p_{1}p_{2}T}\text{ln}\left(\frac{p_{1}p_{2}T}{p_{1}+p_{2}+T}\right)$ , which satisfies the conditions in Theorem 3 and is confirmed to work well.

3.4 Algorithms

We shall introduce two algorithms, the two-stage alternating maximization and the minorization maximization to numerically calculate the maximum likelihood estimator. The second one is computationally simpler, but so far it has been shown to be only applicable to Probit, Logit and Tobit models (Wang (2022)).

3.4.1 Two-stage alternating maximization (TSAM)

Let $\mathcal{S}_{1j}$ be an indicator set pertaining to a specific variable type in the $j$ -th column of $\mathbf{X}$ with respect to $(i,t)$ , such that the cardinality of $S_{1j}$ and $p_{1}T$ are of comparable magnitude. Similarly, Let $\mathcal{S}_{2i}$ be an indicator set pertaining to a specific variable type in the $i$ -th row of $\mathbf{X}$ with respect to $(j,t)$ , such that the number of elements in $S_{2i}$ and $p_{2}T$ are comparable. And let $\mathcal{S}_{3t}$ be an indicator set associated with a particular variable type in the $t$ -th sequence of $X$ relative to $(i,j)$ , where the cardinality of $\mathcal{S}_{3t}$ and $p_{1}p_{2}$ are comparable in magnitude. Due to the finite number of variable types in $\mathbf{X}$ , $S_{1i},S_{2j},S_{3t}$ are well defined.

Algorithm 1. Step 1 (Initialization): Randomly generate the initial values of $\widetilde{r}^{(0)}$ and $\widetilde{f}^{(0)}$ .

Step 2 (Updating): For $k\geq 0$ , calculate

	$\displaystyle\widetilde{c}_{j}^{(k)}$	$\displaystyle=\mbox{argmax}_{c_{j}}\sum_{(i,t)\in\mathcal{S}_{1j}}\ l_{ijt}\left(\widetilde{r}_{i}^{(k)^{\prime}}\widetilde{F}_{t}^{(k)}c_{j}\right),\,j=1,\ldots,p_{2},$
	$\displaystyle\widetilde{r}_{i}^{(k+1)}$	$\displaystyle=\mbox{argmax}_{r_{i}}\sum_{(j,t)\in\mathcal{S}_{2i}}\ l_{ijt}\left({r}_{i}^{{}^{\prime}}\widetilde{F}_{t}^{(k)}\widetilde{c}_{j}^{(k)}\right),\,i=1,\ldots,p_{1},$
	$\displaystyle\widetilde{F}_{t}^{(k+1)}$	$\displaystyle=\mbox{argmax}_{F_{t}}\sum_{(i,j)\in\mathcal{S}_{3t}}\ l_{ijt}\left(\widetilde{r}_{i}^{(k+1)^{\prime}}{F}_{t}\widetilde{c}_{j}^{(k)}\right),\,t=1,\ldots,T.$

Repeat the iteration until convergence and denote the derived estimators as $\widetilde{r},\widetilde{f}$ and $\widetilde{c}$ .

Step 3 (Correction): A second-stage update is then conducted based on the score function and Hessian matrix:

$\displaystyle\widehat{{r}}_{i}$	$\displaystyle=\widetilde{{r}}_{i}-\left\{\partial_{{r}_{i}r_{i}^{\prime}}L(X\|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}})\right\}^{-1}\partial_{{r}_{i}}L(X\|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}}),\ i=1,\ldots,p_{1},$	(19)
$\displaystyle\widehat{{f}}_{t}$	$\displaystyle=\widetilde{{f}}_{t}-\left\{\partial_{{f}_{t}f_{t}^{\prime}}L(X\|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}})\right\}^{-1}\partial_{f_{t}}L(X\|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}}),\ t=1,\ldots,T,$	(20)
$\displaystyle\widehat{{c}}_{j}$	$\displaystyle=\widetilde{{c}}_{j}-\left\{\partial_{{c}_{j}c_{j}^{\prime}}L(X\|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}})\right\}^{-1}\partial_{{c_{j}}}L(X\|\widetilde{{r}},\widetilde{{f}},\widetilde{{c}}),\ j=1,\ldots,p_{2}.$	(21)

Step 4 (Repetition): Repeat step 1-step 3 a number of times. Take the one with the largest likelihood.

Step 5 (Normalization): Let $\widehat{R}^{(s)}$ , $\widehat{C}^{(s)}$ and $\widehat{F}^{(s)}$ be the estimators from step 4. To ensure the identification conditions in (6), a normalization step could be applied to the factor and loading matrices. Specifically, do singular value decomposition to $\widehat{R}^{(s)}$ and $\widehat{C}^{(s)}$ as follows.

\displaystyle\widehat{R}^{(s)}={U}_{R}{H}_{R}{V}_{R}:={U}_{R}{Q}_{R},\quad\widehat{C}^{(s)}={U}_{C}{H}_{C}{V}_{C}:={U}_{C}{Q}_{C}.

Define

\displaystyle{\Sigma}_{1}=\frac{1}{Tp_{1}p_{2}}\sum_{t=1}^{T}{Q}_{R}{F}_{t}{Q}_{C}^{\prime}{Q}_{C}{F}_{t}^{\prime}{Q}_{R}^{\prime},\quad{\Sigma}_{2}=\frac{1}{Tp_{1}p_{2}}\sum_{t=1}^{T}{Q}_{C}{F}_{t}^{\prime}{Q}_{R}^{\prime}{Q}_{R}{F}_{t}{Q}_{C}^{\prime},

and hence the eigenvalue decompositions

\displaystyle{\Sigma}_{1}={\Gamma}_{1}{\Lambda}_{1}{\Gamma}_{1}^{\prime},\quad{\Sigma}_{2}={\Gamma}_{2}{\Lambda}_{2}{\Gamma}_{2}^{\prime}.

Then, the normalized loading and factor score matrices are, respectively,

\displaystyle\widehat{{R}}=\sqrt{p_{1}}{U}_{R}{\Gamma}_{1},\quad\widehat{{C}}=\sqrt{p_{2}}{U}_{C}{\Gamma}_{2}\ \ \mbox{and}\ \ \widehat{{F}}_{t}=(p_{1}p_{2})^{-1/2}{\Gamma}_{1}^{\prime}{Q}_{R}{F}_{t}{Q}_{C}^{\prime}{\Gamma}_{2}.

(22)

Liu et al. (2023) first proposed this algorithm for the generalized vector factor model, and the convergence of step 2 to a local maximum is given in their Proposition 2. To search for the global maximum, a common practice is to randomly choose the initial values multiple times and select the one with the highest likelihood among all local maxima. We generalize this approach to the matrix factor model setting. Although the computation for $(\widetilde{{r}},\widetilde{{f}},\widetilde{{c}})$ is simple, the efficiency may be lost since $\widetilde{{r}},\widetilde{{f}}$ and $\widetilde{{f}}$ are not obtained based on the log-likelihood function $L(X|r,f,c)$ . To improve the efficiency, we then conduct a second-stage update in step 3, and the increase in efficiency from this one-step correction is validated in Section 4.5. The strength of this algorithm lies in its ability to perform estimations in parallel across all rows, columns and sequences, leveraging existing packages. This ensures straightforward programming and computation, enhancing efficiency and convenience.

3.4.2 Minorization maximization (MM)

Algorithm 2. Step 1 (Initialization): Randomly generate the initial values of $\widehat{r}^{(0)}$ , $\widehat{c}^{(0)}$ and $\widehat{f}^{(0)}$ .

Step 2 (Updating): For $k\geq 0$ , first calculate $\widehat{x}_{ijt}^{(k)}=\widehat{r}_{i}^{(k)\prime}\widehat{F}_{t}^{(k)}\widehat{c}_{j}^{(k)}+\frac{1}{b_{U}}\partial_{\pi}l_{ijt}\left(\widehat{r}_{i}^{(k)\prime}\widehat{F}_{t}^{(k)}\widehat{c}_{j}^{(k)}\right)$ for $i=1,\ldots,p_{1}$ , $j=1,\ldots,p_{2}$ , $t=1,\ldots,T$ , then update as follows.

\left(\widehat{r}^{(k+1)},\widehat{f}^{(k+1)},\widehat{c}^{(k+1)}\right)=\operatorname*{argmin}\sum_{i=1}^{p_{1}}\sum_{j=1}^{p_{2}}\sum_{t=1}^{T}\left(\hat{x}^{(k)}_{ijt}-{r}_{i}^{\prime}{F}_{t}{c}_{j}\right)^{2}.

Iterate until $L\left(X|\widehat{r}^{(k+1)},\widehat{f}^{(k+1)},\widehat{c}^{(k+1)}\right)-L\left(X|\widehat{r}^{(k)},\widehat{f}^{(k)},\widehat{c}^{(k)}\right)\leq$ error, where error is the level of numerical tolerance.

Step 3 (Repetition): Repeat step 1 and step 2 a number of times. Take the one with the largest likelihood.

Step 4 (Normalization): Similar to Algorithm 1, the solution in Step 3 is normalized by (22).

The Minorization Maximization (MM) algorithm does not require alternation. Instead, it only necessitates the matrix factorization in step 2, which can be performed quickly using standard software packages. Wang (2022) applies this algorithm to the generalized vector factor model and provides the proof of convergence for step 2 in their Appendix B. We are here extending it to the matrix factor models.

4 Numerical Studies

In Sections 4.1 and 4.2, we conduct simulation studies to assess the finite-sampling performance of the proposed method (GMFM) by comparing it with the linear matrix factor model (LMFM). The accuracy between the factor loadings $\widehat{{R}}$ and ${R}^{0}$ , evaluated by the smallest nonzero canonical correlation between them, denoted by ccor $(\widehat{{R}},{R}^{0})$ . Similarly, we also compute the smallest nonzero canonical correlation between $\widehat{{C}}$ and ${C}^{0}$ , denoted by ccor $(\widehat{{C}},{C}^{0})$ . Canonical correlation has been widely used to measure the performance in factor analysis; see for example Goyal et al. (2008), Doz et al. (2012), Bai and Li (2012), Liu et al. (2023)). In sections 4.3 and 4.4, we examine the performance of the information criterion for selecting the number of factors and verify the asymptotic normality in Theorem 2, separately. Section 4.5 investigates the efficiency gain of the second-stage correction in TSAM.

4.1 Simulation Setting

We generate the factor matrix sequence by an AR(1) model as $f_{t}=0.2f_{t-1}+0.2\epsilon_{t}$ where ${\epsilon_{t}}$ ’s are all generated from i.i.d. $\mathcal{N}(0_{k_{1}k_{2}},\mathbb{I}_{k_{1}k_{2}})$ , $t=1,\cdots,T$ . We draw the entries of ${R}$ and ${C}$ independently from the uniform distribution $U(0,1)$ . We consider six settings with different variables and different combinations of $p_{1}=20,30,50$ , $p_{2}=20,30,50$ and $T=30,50$ .

Case 1 (Gaussian variables with homoscedasticity): $k_{1}=k_{2}=2$ , and $x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},1)$ .

Case 2 (Gaussian variables with heteroscedasticity): $k_{1}=1$ , $k_{2}=3$ , $x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},\tau_{j}^{2})$ with $\tau_{j}=0.1+2U_{j}$ and $U_{j}$ ’s are i.i.d from $U(0,1)$ .

Case 3 (Poisson variables): $k_{1}=k_{2}=3$ and $x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}}))$ .

Case 4 (The mixture of binary and count variables): $k_{1}=k_{2}=4$ . For $i=1,\cdots,p_{1}$ , $j=1,\cdots,[p_{2}/2]$ , $x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}}))$ ; for $j=[p_{2}/2]+1,\cdots,p_{2}$ , $x_{ijt}\sim\text{Bernoulli}(1/(1+\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}})))$ .

Case 5 (The mixture of continuous and count variables): $k_{1}=k_{2}=5$ . For $i=1,\cdots,p_{1}$ , $j=1,\cdots,[p_{2}/2]$ , $x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},1)$ ; for $j=[p_{2}/2]+1,\cdots,p_{2}$ , $x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}}))$ .

Case 6 (The mixture of continuous, count and binary variables): $k_{1}=k_{2}=6$ . For $i=1,\cdots,[p_{1}/2]$ , $j=1,\cdots,[p_{2}/2]$ , $x_{ijt}\sim\mathcal{N}({r}_{i}^{\prime}{F}_{t}{c}_{j},1)$ ; for $i=1,\cdots,[p_{1}/2]$ , $j=[p_{2}/2]+1,\cdots,p_{2}$ , $x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}})$ ; for $i=[p_{1}/2]+1,\cdots,p_{1}$ , $j=1,\cdots,[p_{2}/2]$ , $x_{ijt}\sim\text{Poisson}(\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}})$ ; for $i=[p_{1}/2]+1,\cdots,p_{1}$ , $j=[p_{2}/2]+1,\cdots,p_{2}$ , $x_{ijt}\sim\text{Bernoulli}(1/(1+\exp{({r}_{i}^{\prime}{F}_{t}{c}_{j}}))$ .

4.2 Comparison with the Linear Factor Models

Tables 1-2 report the average of ccor $(\widehat{{R}},{R}^{0})$ and ccor $(\widehat{{C}},{C}^{0})$ using the LMFM and GMFM based on 500 repetitions. The loadings and factors of GMFM are estimated and computed by our proposed TSAM algorithm, while the loadings and factors of LMFM are estimated by $\alpha$ -PCA( $\alpha=0$ ) in Chen and Fan (2023). Theoretically, under conditions of homoscedasticity, the LMFM and the GMFM are equivalent. This equivalence is reflected in the comparable results observed for Case 1 in Table 1. With heteroscedasticity, the GMFM demonstrates superior performance compared to the LMFM in case 2, as evident in Table 1. Moreover, it is evident that the precision of $\widehat{{r}}$ and $\widehat{{C}}$ increases as $p_{1}$ , $p_{2}$ and/or $T$ increases, which is consistent with our theoretical results in Theorems 1 and 2.

In case 3 of table 1, the performance of GMFM is significantly better than that of LMFM for the Poisson variables. Evidently, the LMFM is only effective for continuous variables, but performs poorly for discrete variables, and the estimation accuracy does not improve with increasing dimensionality. On the contrary, GMFM has a higher canonical correlation in estimating loadings for Poisson variables, showing a comparable accuracy to that of continuous variables.

Table 2 shows the result in Cases 4-6, where ${X}_{t},t=1,\cdots,T$ are mixed variables including Poisson, binary, and normal variables. It is clear that the LMFM does not work on mixed variables containing discrete variables, while the GMFM remains reliable in Table 2. Particularly, the canonical correlation of GMFM for both $\widehat{{R}}$ and $\widehat{{C}}$ increases as $p_{1}$ , $p_{2}$ and/or $T$ increases, but that of LMFM does not.

Table 1: The results for single-type variables.

			GMFM			LMFM
	T	$p_{1}\backslash p_{2}$	$20$	$30$	$50$	$20$	$30$	$50$
		Case 1 for Gaussian variables with homoscedasticity
ccor $\left(\widehat{{R}},{R}^{0}\right)$	$30$	$20$	0.9520	0.9677	0.9827	0.8792	09201	0.9509
		$30$	0.9527	0.9690	0.9836	0.8882	0.9248	0.9590
		$50$	0.9608	0.9718	0.9850	0.9316	0.9497	0.9724
	$50$	$20$	0.9711	0.9833	0.9879	0.9204	0.9551	0.9725
		$30$	0.9734	0.9845	0.9898	0.9475	0.9650	0.9763
		$50$	0.9759	0.9845	0.9909	0.9590	0.9728	0.9840
ccor $\left(\widehat{{C}},{C}^{0}\right)$	$30$	$20$	0.9389	0.9573	0.9606	0.8284	0.8795	0.9305
		$30$	0.9643	0.9669	0.9702	0.8779	0.9092	0.9427
		$50$	0.9788	0.9819	0.9837	0.9340	0.9541	0.9693
	$50$	$20$	0.9638	0.9718	0.9744	0.8973	0.9381	0.9563
		$30$	0.9790	0.9813	0.9823	0.9268	0.9549	0.9694
		$50$	0.9881	0.9898	0.9903	0.9592	0.9746	0.9816
			Case 2 for Gaussian variables with heteroscedasticity
ccor $\left(\widehat{{R}},{R}^{0}\right)$	$30$	$20$	0.9579	0.9771	0.9874	0.9341	0.9566	0.9740
		$30$	0.9650	0.9809	0.9895	0.9457	0.9648	0.9809
		$50$	0.9729	0.9814	0.9898	0.9619	0.9724	0.9844
	$50$	$20$	0.9788	0.9881	0.9933	0.9618	0.9765	0.9859
		$30$	0.9832	0.9892	0.9938	0.9725	0.9820	0.9884
		$50$	0.9841	0.9905	0.9941	0.9775	0.9859	0.9912
ccor $\left(\widehat{{C}},{C}^{0}\right)$	$30$	$20$	0.6849	0.7969	0.8528	0.1692	0.1412	0.2050
		$30$	0.7992	0.8821	0.9094	0.1881	0.1559	0.1996
		$50$	0.9151	0.9345	0.9451	0.1850	0.1869	0.2097
	$50$	$20$	0.8102	0.8924	0.9234	0.1670	0.1545	0.2479
		$30$	0.9118	0.9376	0.9494	0.1698	0.1642	0.2730
		$50$	0.9543	0.9662	0.9709	0.1956	0.1907	0.2650
			Case 3 for Possion variables
ccor $\left(\widehat{{R}},{R}^{0}\right)$	$30$	$20$	0.9663	0.9806	0.9882	0.3492	0.3393	0.3666
		$30$	0.9697	0.9814	0.9890	0.2998	0.2521	0.3063
		$50$	0.9720	0.9824	0.9891	0.2140	0.2165	0.2506
	$50$	$20$	0.9808	0.9885	0.9931	0.3477	0.4054	0.4485
		$30$	0.9827	0.9889	0.9933	0.3101	0.2969	0.2905
		$50$	0.9836	0.9899	0.9941	0.2341	0.2528	0.2172
ccor $\left(\widehat{{C}},{C}^{0}\right)$	$30$	$20$	0.9504	0.9658	0.9687	0.2711	0.2624	0.2177
		$30$	0.9699	0.9755	0.9806	0.2950	0.2836	0.2325
		$50$	0.9852	0.9867	0.9876	0.2811	0.2738	0.1843
	$50$	$20$	0.9732	0.9815	0.9830	0.3105	0.2304	0.1961
		$30$	0.9830	0.9859	0.9879	0.3055	0.1895	0.1612
		$50$	0.9905	0.9927	0.9934	0.2963	0.2441	0.2149

Table 2: The results for mixed-type variables.

			GMFM			LMFM
	T	$p_{1}\backslash p_{2}$	$20$	$30$	$50$	$20$	$30$	$50$
		Case 4 for binary and count variables
ccor $\left(\widehat{{R}},{R}^{0}\right)$	$30$	$20$	0.9408	0.9666	0.9837	0.3900	0.3485	0.3704
		$30$	0.9545	0.9718	0.9857	0.2693	0.2651	0.2580
		$50$	0.9617	0.9767	0.9868	0.2708	0.2386	0.1882
	$50$	$20$	0.9678	0.9819	0.9911	0.3961	0.4086	0.4384
		$30$	0.9731	0.9847	0.9918	0.3098	0.3183	0.2968
		$50$	0.9779	0.9864	0.9926	0.2823	0.2124	0.2362
ccor $\left(\widehat{{C}},{C}^{0}\right)$	$30$	$20$	0.7785	0.8926	0.9207	0.2160	0.1705	0.1533
		$30$	0.9162	0.9350	0.9500	0.2126	0.1508	0.1405
		$50$	0.9515	0.9627	0.9717	0.2079	0.1756	0.1412
	$50$	$20$	0.8724	0.9382	0.9551	0.2194	0.1636	0.1041
		$30$	0.9407	0.9604	0.9702	0.1982	0.1746	0.1517
		$50$	0.9659	0.9787	0.9825	0.1941	0.1476	0.1375
			Case 5 for continuous and count variables
ccor $\left(\widehat{{R}},{R}^{0}\right)$	$30$	$20$	0.9624	0.9818	0.9913	0.5255	0.4910	0.5678
		$30$	0.9705	0.9843	0.9917	0.3885	0.4213	0.4085
		$50$	0.9791	0.9868	0.9929	0.2685	0.2784	0.2809
	$50$	$20$	0.9814	0.9898	0.9946	0.5002	0.5464	0.6054
		$30$	0.9849	0.9916	0.9954	0.4544	0.4507	0.4529
		$50$	0.9875	0.9926	0.9961	0.3168	0.3032	0.3323
ccor $\left(\widehat{{C}},{C}^{0}\right)$	$30$	$20$	0.8882	0.9675	0.9766	0.2878	0.2077	0.1607
		$30$	0.9596	0.9785	0.9850	0.2477	0.2497	0.1767
		$50$	0.9843	0.9895	0.9912	0.3200	0.1805	0.1495
	$50$	$20$	0.9473	0.9806	0.9867	0.2900	0.2060	0.1764
		$30$	0.9745	0.9875	0.9914	0.2574	0.2566	0.1795
		$50$	0.9902	0.9937	0.9950	0.3306	0.1877	0.1635
			Case 6 for continuous, count and binary variables
ccor $\left(\widehat{{R}},{R}^{0}\right)$	$30$	$20$	0.6565	0.8353	0.9498	0.2703	0.3290	0.2900
		$30$	0.8167	0.9290	0.9777	0.2364	0.2423	0.2070
		$50$	0.9493	0.9753	0.9880	0.1303	0.1690	0.1637
	$50$	$20$	0.7052	0.8737	0.9620	0.3071	0.3589	0.3450
		$30$	0.8969	0.9583	0.9849	0.1675	0.2789	0.2523
		$50$	0.9662	0.9850	0.9930	0.1593	0.1828	0.1758
ccor $\left(\widehat{{C}},{C}^{0}\right)$	$30$	$30$	0.5365	0.7318	0.9250	0.2667	0.1670	0.1246
		$60$	0.6170	0.8871	0.9621	0.1935	0.1428	0.1224
		$90$	0.8769	0.9643	0.9848	0.2093	0.1545	0.1241
	$50$	$30$	0.5823	0.8080	0.9544	0.2584	0.2362	0.1142
		$60$	0.7168	0.9095	0.9768	0.2004	0.1887	0.1074
		$90$	0.8829	0.9742	0.9896	0.1775	0.1606	0.1457

4.3 Model Selection

This section aims to verify the performance of the proposed methods for selecting the number of row and column factors. Table 3 reports the average of $\widehat{k}_{1}$ and $\widehat{k}_{2}$ based on $100$ repetitions, where the candidates of $k_{1}$ and $k_{2}$ are the integers from $1$ to $8$ . In Table 3, it can be seen that the PC-type information criterion (18) works well in all six cases considered with different $k_{1}$ and $k_{2}$ . We found that the LMFM using $\alpha$ -PCA $(\alpha=0)$ by Chen et al. (2022) and the projected estimation (PE) by Yu et al. (2022) both failed under the PC information criterion, which tends to choose $(1,1)$ in all the six cases.

Table 3: The average of the estimated number of row and column factors by using our proposed information criterion based on GMFM from

100

repetitions.

T	$p_{1}\backslash p_{2}$	$20$	$30$	$50$	$20$	$30$	$50$
		Case 1: $(k_{1},k_{2})=(2,2)$			Case 2: $(k_{1},k_{2})=(1,3)$
$30$	$20$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.30,1.30)$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.00,1.17)$
	$30$	$(1.15,1.15)$	$(1.60,1.60)$	$(1.96,1.96)$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.00,1.30)$
	$50$	$(1.65,1.50)$	$(1.85,1.60)$	$(2.00,2.00)$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.00,2.02)$
$50$	$20$	$(1.32,1.32)$	$(1.76,1.76)$	$(2.00,2.00)$	$(1.20,1.10)$	$(1.0,1.22)$	$(1.00,1.50)$
	$30$	$(1.64,1.62)$	$(1.96,1.94)$	$(2.00,2.00)$	$(1.00,1.10)$	$(1.00,1.42)$	$(1.00,2.00)$
	$50$	$(1.98,1.96)$	$(2.00,2.00)$	$(2.00,2.00)$	$(1.00,1.20)$	$(1.00,1.82)$	$(1.00,3.00)$
		Case 3: $(k_{1},k_{2})=(3,3)$			Case 4: $(k_{1},k_{2})=(4,4)$
$30$	$20$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.00,1.00)$	$(1.00,1.00)$
	$30$	$(1.00,1.00)$	$(1.40,1.40)$	$(2.15,2.05)$	$(1.00,1.00)$	$(2.12,2.04)$	$(2.52,2.36)$
	$50$	$(1.20,1.10)$	$(2.05,1.85)$	$(2.96,2.54)$	$(1.00,1.00)$	$(2.40,2.20)$	$(3.64,3.36)$
$50$	$20$	$(1.00,1.00)$	$(1.30,1.25)$	$(2.50,2.48)$	$(1.00,1.00)$	$(1.52,1.30)$	$(2.60,2.60)$
	$30$	$(1.70,1.60)$	$(2.60,2.35)$	$(3.00,2.93)$	$(2.80,2.52)$	$(3.36,3.04)$	$(4.00,3.82)$
	$50$	$(2.32,2.16)$	$(3.00,2.86)$	$(3.00,3.00)$	$(3.04,2.56)$	$(3.95,3.40)$	$(4.00,4.00)$
		Case 5: $(k_{1},k_{2})=(5,5)$			Case 6: $(k_{1},k_{2})=(6,6)$
$30$	$20$	$(4.72,4.60)$	$(5.00,4.85)$	$(5.00,5.00)$	$(4.20,3.92)$	$(5.35,4.60)$	$(5.46,5.86)$
	$30$	$(5.00,4.68)$	$(5.00,5.00)$	$(5.00,5.00)$	$(5.52,4.40)$	$(5.80,5.60)$	$(5.86,6.00)$
	$50$	$(5.00,5.00)$	$(5.00,5.00)$	$(5.00,5.00)$	$(5.82,5.30)$	$(6.00,5.80)$	$(6.00,6.00)$
$50$	$20$	$(5.00,4.76)$	$(5.00,5.00)$	$(5.00,5.00)$	$(5.30,4.45)$	$(5.50,5.30)$	$(5.80,6.00)$
	$30$	$(5.00,4.80)$	$(5.00,5.00)$	$(5.00,5.00)$	$(5.70,4.90)$	$(6.00,5.80)$	$(6.00,6.00)$
	$50$	$(5.00,5.00)$	$(5.00,5.00)$	$(5.00,5.00)$	$(6.00,5.12)$	$(6.00,5.82)$	$(6.00,6.00)$

4.4 Asymptotic distribution

In this section, we assess the adequacy of the asymptotic distributions in Theorem 2. We consider the case with $k_{1}=k_{2}=1$ and generate $f_{t}$ from i.i.d. $\mathcal{N}(0,1)$ . The generation and normalization of $R$ and $C$ are the same as Section 4.1. For the given $R$ , $F$ and $C$ , we consider three data generating processes (DGPs) for $X_{ijt}$ .

DGP 1 (Logit): $X_{ijt}$ is a binary random variable and $P(X_{ijt}=1)=\Psi(r_{i}^{\prime}F_{t}c_{j})$ , where $\Psi(z)=1/\left(1+e^{-z}\right)$ .

DGP 2 (Probit): $X_{ijt}$ is a binary random variable and $P(X_{ijt}=1)=\Phi(r_{i}^{\prime}F_{t}c_{j})$ , where $\Phi(\cdot)$ is the cumulative distribution function of standard normal distribution.

DGP 3(Mixed): For $t=1,\ldots,T$ , $i=1,\ldots,p_{1}$ , $j=1,\ldots,p_{2}/3$ , $X_{ijt}\sim N(r_{i}^{\prime}F_{t}c_{j},1)$ ; for $t=1,\ldots,T$ , $i=1,\ldots,p_{1}$ , $j=p_{2}/3+1,\ldots,2p_{2}/3$ , $X_{ijt}$ is a binary random variable and $P(X_{ijt}=1)=\Psi(r_{i}^{\prime}F_{t}c_{j})$ ; for $t=1,\ldots,T$ , $i=1,\ldots,p_{1}$ , $j=2p_{2}/3+1,\ldots,p_{2}$ , $X_{ijt}$ is a binary random variable and $P(X_{ijt}=1)=\Phi(r_{i}^{\prime}F_{t}c_{j})$ .

Under such cases, by Theorem 2,

	$\displaystyle\sqrt{p_{2}T}(\hat{r}_{i}-r^{0}_{i})\xrightarrow{d}\mathcal{N}(0,p_{2}T\Sigma_{iR}^{-1}),$
	$\displaystyle\sqrt{p_{1}T}(\hat{c}_{j}-c^{0}_{j})\xrightarrow{d}\mathcal{N}(0,p_{1}T\Sigma_{jC}^{-1}),$
	$\displaystyle\sqrt{p_{1}p_{2}}(\hat{f}_{t}-f^{0}_{t})\xrightarrow{d}\mathcal{N}(0,p_{1}p_{2}\Sigma_{tF}^{-1}),$

which reach the asymptotic efficiency bounds. Figure 1, 2 and 3 show the normalized histograms of $\Sigma_{1R}^{1/2}(\hat{r}_{1}-r^{0}_{1}),\Sigma_{1C}^{1/2}(\hat{c}_{1}-c^{0}_{1})$ and $\Sigma_{1F}^{1/2}(\hat{f}_{1}-f^{0}_{1})$ for DGP1-DGP3 by TSAM and MM separately, and the standard normal density curve is overlaid on them for comparison. As for the step 2 in MM, we choose $b_{U}=1/4$ for the Logit case and $b_{U}=1$ for the Probit case by the definition of $b_{U}$ in Section 2.2. The standard normal density curve clearly provides a good approximation across all subfigures.

4.5 Efficiency Gain of the One-Step Correction in TSAM

To demonstrate the efficiency improvement from the second-stage update in TSAM, we illustrate the results using Case 5 with $T=20$ . In the first stage of TSAM, we obtain the estimator using continuous variables (S1N), count variables (S1P), and both (S1NP). As shown in Table 4, the performance of the first-stage estimators improves as $p_{1}$ , $p_{2}$ , and/or $T$ increase. Notably, S1NP demonstrates the best performance, while S1P and S1N exhibit similar levels of effectiveness. This aligns with our expectation since S1NP utilizes the most comprehensive information. The second-stage updating significantly enhances the performance of the estimators obtained from stage one, especially when the sample size is small.

Table 4: The results of Case 5 with

T=20

. The estimators obtained in the first stage, which are based solely on Normal variables, solely on Poisson variables, and on a combination of both, are abbreviated as S1N, S1P and S1NP, respectively. The estimators arising from the second-stage update are designated as OSN, OSP and OSNP. The values of ccorR and ccorC represent the averaged ccor

(\widehat{{R}},{R}^{0})

and ccor

(\widehat{{C}},{C}^{0})

, respectively.

	$(p_{1},p_{2})=(30,20)$			$(p_{1},p_{2})=(30,30)$			$(p_{1},p_{2})=(30,50)$
	S1N	S1P	S1NP	S1N	S1P	S1NP	S1N	S1P	S1NP
ccorR	0.9328	0.9388	0.9702	0.9502	0.9583	0.9787	0.9703	0.9776	0.9879
ccorC	0.8930	0.8612	0.9470	0.9055	0.9533	0.9652	0.9699	0.9695	0.9764
	OSN	OSP	OSNP	OSN	OSP	OSNP	OSN	OSP	OSNP
ccorR	0.9692	0.9701	0.9753	0.9761	0.9761	0.9791	0.9872	0.9877	0.9883
ccorC	0.9456	0.9424	0.9542	0.9500	0.9689	0.9699	0.9766	0.9766	0.9769

5 Real Data Analysis

In this section, we analyze the operating performance of the listed high-tech manufacturing companies in China. These comprehensive datasets are sourced from various reliable resources, encompassing publicly accessible financial reports, corporate social responsibility statements, and stock exchange-released trading data. The data can be downloaded from https://data.csmar.com. This dataset encompasses 22 indicators for 12 manufacturing companies, spanning a total of 68 quarters from 2007-Q1 to 2023-Q2. It is worth noting that, unlike in the general LMFM literature (Wang et al. (2019),Chen et al. (2022),Yu et al. (2022)), in addition to 18 continuous variables, there are four binary variables. These binary variables represent corporate social responsibility, such as the protection of employees’ rights and interests, environmental and sustainable development initiatives, as well as public relations and social welfare undertakings. Indicator disclosure is marked as $1$ , while non-disclosure is marked as $0$ . Furthermore, the economic indicators of these companies can be categorized into four main groups: profitability, development ability, operational ability, and solvency. A detailed list of these indicators is presented in the supplementary material. The enterprises belong to three high-tech industries: pharmaceutical manufacturing, electronic equipment manufacturing, and transportation manufacturing.

The missing rate is low in this data set ( $0.05\%$ ), and we use simple linear interpolation to fill in the missing values. Each original univariate time series is transformed by taking the second difference. We also standardize each of the transformed series to avoid the effects of non-zero mean or disparate variances.

Before the estimation of matrix factors and matrix loadings, we need to determine the numbers of row and column factors first. Our proposed criterion suggests taking $k_{1}=3$ , $k_{2}=4$ . For the row factors, Table 5 shows that they are closely related to the industry classification. Companies in the same industry often have similar loadings on the factors. These 12 companies are naturally divided into three groups, namely pharmaceutical manufacturing (P1, P2, P3, P4), electronic equipment manufacturing (E1, E2, E3, E4, E5), and transportation manufacturing (T1, T2, T3). For the column factors, Figure 4 divides the indicators into four groups: profitability (G1), operational ability (G2), solvency (G3), and development ability (G4). Notably, the disclosure of social responsibility has been incorporated into the indicators of development capability. A company with a strong sense of social responsibility generally tends to have better development prospects. This categorization aligns with economic explanations. The detailed grouping of these indicators is clearly outlined in the supplementary material.

Table 5: Row (enterprises) loading matrices (multiplied by 10) by GMLE.

Row Factor	E1	E2	E3	E4	E5	P1	P2	P3	P4	T1	T2	T3
F1	-5	-3	2	0	-2	-21	-20	-16	-9	0	-2	1
F2	-3	-5	0	-8	-6	1	1	0	3	20	17	20
F3	-17	-13	-10	-20	-13	2	1	0	5	-5	-6	-8

The canonical correlation of ${R}$ (resp. ${C}$ ) for LMFM (estimating by $\alpha$ -PCA with $\alpha=0$ ) and GMFM is 0.76 ( resp. 0.70) respectively, showing that the loading estimators of LMFM are different from those of GMFM. In addition, we calculated the negative scaled log-likelihood $-2L/(Tp_{1}p_{2})$ for GMFM and LMFM to measure the goodness of fit ( $1.28$ and $1.86$ for GMFM and LMFM, respectively, with $(k_{1}=3,k_{2}=4)$ ). Clearly, GMFM has a higher goodness of fit for the data.

To further compare the GMFM and LMFM, we employ a rolling-validation procedure as in Yu et al. (2022). For each year $t$ from 2007 to 2022, we repeatedly use the $n$ (bandwidth) years observations before $t$ to fit the matrix-variate factor model and estimate the two loading matrices. The loadings are then used to estimate the factors and corresponding residuals of the $4$ quarters in the current year. Specifically, let ${Y}_{t}^{i}$ and $\widehat{{Y}}_{t}^{i}$ be the observed and estimated performance matrix of quarter $i$ in year $t$ . Define

\displaystyle\text{MSE}_{t}=\frac{1}{4\times 12\times 22}\sum_{i=1}^{4}\left\|\widehat{{Y}}_{t}^{i}-{Y}_{t}^{i}\right\|_{F}^{2},\ \mbox{and}\ \rho_{t}=\frac{\sum_{i=1}^{4}\left\|\widehat{{Y}}_{t}^{i}-{Y}_{t}^{i}\right\|_{F}^{2}}{\sum_{i=1}^{4}\left\|{{Y}}_{t}^{i}-\overline{{Y}}_{t}\right\|_{F}^{2}},

as the mean squared error (MSE) and unexplained proportion of total variances, respectively. Let $\overline{\text{MSE}}$ and $\bar{\rho}$ be the mean MSE and the mean unexplained proportion of total variances, respectively. For LMFM, the $\widehat{{Y}}_{t}^{i}$ ’s are reconstructed by the projected estimator (PE) in Yu et al. (2022) and the $\alpha$ -PCA method with $\alpha=0$ ( $\alpha$ -PCA) in Chen et al. (2020a).

Table 6: Rolling validation for the operating performance of the 12 listed high-tech manufacturing companies by different models.

		${\overline{MSE}}$				$\bar{\rho}$
n	$k$	LMFM	LMFM	GMFM	GVFM	LMFM	LMFM	GMFM	GVFM
		( $\alpha$ -PCA)	(PE)			( $\alpha$ -PCA)	(PE)
$5$	1	0.97	0.99	$\mathbf{0.92}$	0.93	1.20	1.21	$\mathbf{1.12}$	1.13
5	2	0.90	0.91	$\mathbf{0.85}$	0.91	1.11	1.10	$\mathbf{1.02}$	1.12
5	3	0.81	0.80	$\mathbf{0.78}$	0.88	0.99	0.96	$\mathbf{0.95}$	1.07
10	1	0.96	0.96	$\mathbf{0.91}$	$\mathbf{0.91}$	1.21	1.19	$\mathbf{1.12}$	1.13
10	2	0.88	0.87	$\mathbf{0.83}$	0.89	1.10	1.08	$\mathbf{1.02}$	1.14
10	3	0.79	0.77	$\mathbf{0.75}$	0.88	0.97	$0.95$	$\mathbf{0.93}$	1.08

Table 6 presents the average values of MSE and $\rho$ obtained from various estimation methods: LMFM ( $\alpha$ -PCA), LMFM (PE), GMFM and the generalized vector factor model (GVFM) with $k^{2}$ factor numbers by Liu et al. (2023). A comprehensive comparison is made across different combinations of bandwidth ( $n$ ) and the number of factors ( $k_{1}=k_{2}=k$ ). From Table 6, our proposed GMFM outperforms other methods across the board. Specifically, when k=1, GMFM and GVFM display comparable MSEs. Nevertheless, as k rises, the MSE of GMFM notably decreases, whereas the decline for GVFM is gradual. Moreover, the two estimation methods for LMFM demonstrate comparable behavior, both worse than GMFM.

6 Conclusion

This paper provides a general theory for matrix factor analysis of a high-dimensional nonlinear model. Given certain regularity conditions, we have established the consistency and convergence rates of the estimated factors and loadings. Meanwhile, the asymptotic normality of the estimated loadings is derived. Furthermore, we develop a criterion based on a penalized loss to consistently estimate the number of row factors and column factors within the framework of the GMFM. Computationally, We have proposed two algorithms to compute the factors and loadings from the GMFM. To justify our theory, we conduct extensive simulation studies and applied GMFM to the analysis of a business performance dataset.

Supplementary Material

The Supplementary Material contains the detailed technical proof of the main theorems, as well as some technical lemmas that are of their own interests, in addition to details of the data set.

References

Bai and Li (2012) Bai, J., Li, K., 2012. Statistical analysis of factor models of high dimension. The Annals of Statistics 40, 436–465.
Bai and Ng (2002) Bai, J., Ng, S., 2002. Determining the number of factors in approximate factor models. Econometrica 70, 191–221.
Barigozzi et al. (2022) Barigozzi, M., He, Y., Li, L., Trapani, L., 2022. Statistical inference for large-dimensional tensor factor model by iterative projections. arXiv:2206.09800 .
Barigozzi and Trapani (2022) Barigozzi, M., Trapani, L., 2022. Testing for common trends in nonstationary large datasets. Journal of Business & Economic Statistics 40, 1107–1122.
Chang et al. (2023) Chang, J., He, J., Yang, L., Yao, Q., 2023. Modelling matrix time series via a tensor cp-decomposition. Journal of the Royal Statistical Society Series B: Statistical Methodology 85, 127–148.
Chen et al. (2024) Chen, E., Fan, J., Zhu, X., 2024. Factor augmented matrix regression. arXiv:2405.17744 .
Chen and Fan (2023) Chen, E.Y., Fan, J., 2023. Statistical inference for high-dimensional matrix-variate factor models. Journal of the American Statistical Association 118, 1038–1055.
Chen et al. (2021a) Chen, M., Fern¨¢ndez-Val, I., Weidner, M., 2021a. Nonlinear factor models for network and panel data. Journal of Econometrics 220, 296–324.
Chen et al. (2021b) Chen, R., Xiao, H., Yang, D., 2021b. Autoregressive models for matrix-valued time series. Journal of Econometrics 222, 539–560.
Chen et al. (2022) Chen, R., Yang, D., Zhang, C.H., 2022. Factor models for high-dimensional tensor time series. Journal of the American Statistical Association 117, 94–116.
Doz et al. (2012) Doz, C., Giannone, D., Reichlin, L., 2012. A quasi-maximum likelihood approach for large, approximate dynamic factor models. The Review of Economics and Statistics 94, 1014–1024.
Fan et al. (2013) Fan, J., Liao, Y., Mincheva, M., 2013. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society Series B: Statistical Methodology 75, 603–680.
Goyal et al. (2008) Goyal, A., P¨¦rignon, C., Villa, C., 2008. How common are common return factors across the nyse and nasdaq? Journal of Financial Economics 90, 252–271.
He et al. (2022) He, Y., Wang, Y., Yu, L., Zhou, W., Zhou, W., 2022. Matrix kendall’s tau in high-dimensions: A robust statistic for matrix factor model. arXiv: 2207.09633 .
He et al. (2023) He, Y., Zhao, R., Zhou, W., 2023. Iterative alternating least square estimation for large-dimensional matrix factor model. arXiv:2301.00360 .
Jennrich (1969) Jennrich, R.I., 1969. Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics 40, 633–643.
Jing et al. (2021) Jing, B., Li, T., Lyu, Z., Xia, D., 2021. Community detection on mixture multilayer networks via regularized tensor decomposition. Annals of Statistics 49, 3181–3205.
Jing et al. (2022) Jing, B., Li, T., Ying, N., Yu, X., 2022. Community detection in sparse networks using the symmetrized laplacian inverse matrix (slim). Statistica Sinica 32, 1–22.
Kong (2017) Kong, X., 2017. On the number of common factors with high-frequency data. Biometrika 104, 397–410.
Liu et al. (2023) Liu, W., Lin, H., Zheng, S., Liu, J., 2023. Generalized factor model for ultra-high dimensional correlated variables with mixed types. Journal of the American Statistical Association 118, 1385–1401.
Mao et al. (2019) Mao, X., Chen, S., Wong, R.K.W., 2019. Matrix completion with covariate information. Journal of the American Statistical Association 114, 198–210.
Mao et al. (2021) Mao, X., Wong, R.K.W., Chen, S., 2021. Matrix completion under low-rank missing mechanism. Statistica Sinica 31, 2005–2030.
Newey and McFadden (1986) Newey, W., McFadden, D., 1986. Large sample estimation and hypothesis testing. Handbook of Econometrics 4, 2111–2245.
Pelger (2019) Pelger, M., 2019. Large-dimensional factor modeling based on high-frequency observations. Journal of Econometrics 208, 23–42.
Trapani (2018) Trapani, L., 2018. A randomized sequential procedure to determine the number of factors. Journal of the American Statistical Association 113, 1341–1349.
Wang et al. (2019) Wang, D., Liu, X., Chen, R., 2019. Factor models for matrix-valued high-dimensional time series. Journal of Econometrics 208, 231–248.
Wang (2022) Wang, F., 2022. Maximum likelihood estimation and inference for high dimensional generalized factor models with application to factor-augmented regressions. Journal of Econometrics 229, 180–200.
Wu (1981) Wu, C.F., 1981. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics 9, 501–513.
Yu et al. (2022) Yu, L., He, Y., Kong, X., Zhang, X., 2022. Projected estimation for large-dimensional matrix factor models. Journal of Econometrics 229, 201–217.
Yuan et al. (2023) Yuan, C., Gao, Z., He, X., Huang, W., Guo, J., 2023. Two-way dynamic factor models for high-dimensional matrix-valued time series. Journal of the Royal Statistical Society Series B: Statistical Methodology 85, 1517–1537.
Zhang et al. (2024) Zhang, X., Liu, C.C., Guo, J., Yuen, K.C., Welsh, A.H., 2024. Modeling and learning on high-dimensional matrix-variate sequences. Journal of the American Statistical Association 0, 1–16. URL: https://doi.org/10.1080/01621459.2024.2344687.
Zhou et al. (2024) Zhou, T., Pan, R., Zhang, J., Wang, H., 2024. An attribute-based node2vec model for dynamic community detection on co-authorship network. Computational Statistics , 1–28.

Row Factor	E1	E2	E3	E4	E5	P1	P2	P3	P4	T1	T2	T3
F1	-5	-3	2	0	-2	-21	-20	-16	-9	0	-2	1
F2	-3	-5	0	-8	-6	1	1	0	3	20	17	20
F3	-17	-13	-10	-20	-13	2	1	0	5	-5	-6	-8

Row Factor	E1	E2	E3	E4	E5	P1	P2	P3	P4	T1	T2	T3
F1	-5	-3	2	0	-2	-21	-20	-16	-9	0	-2	1
F2	-3	-5	0	-8	-6	1	1	0	3	20	17	20
F3	-17	-13	-10	-20	-13	2	1	0	5	-5	-6	-8

Row Factor	E1	E2	E3	E4	E5	P1	P2	P3	P4	T1	T2	T3
F1	-5	-3	2	0	-2	-21	-20	-16	-9	0	-2	1
F2	-3	-5	0	-8	-6	1	1	0	3	20	17	20
F3	-17	-13	-10	-20	-13	2	1	0	5	-5	-6	-8