This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Equivariance Criterion in a Linear Model for Fixed-X Cases

Daowei Wang
Paula and Gregory Chow Institute for Studies in Economics
Xiamen University
Xiamen, Fujian China PRC
ORCiD: 0009-0007-3428-723X
   Mian Wu
H. Milton Stewart School of Industrial and Systems Engineering
Georgia Institute of Technology
Atlanta, GA, USA
ORCiD: 0009-0001-2116-1773
mwu385@gatech.edu
   Dr. Haojin Zhou
Academy of Pharmacy
Xi’an Jiaotong - Liverpool University (XJTLU)
Suzhou, Jiangsu China PRC
ORCiD: 0000-0003-0802-099X
Corresponding author: haojin.zhou@xjtlu.edu.cn

In this article, we explored the usage of equivariance criterion in normal linear model with fixed-XX and extended the model to allow multiple populations, which, in turn, leads to a multivariate invariant location-scale transformation group, compared than the commonly used univariate one. The minimum risk equivariant estimators of the coefficient vector and the diagonal covariance matrix were derived, which were consistent with literature works. This work serves as an early exploration of the usage of equivariance criterion in machine learning.

Key words and Phrases: Equivariance, Linear Model, Estimation, Coefficient Vector, Covariance Matrix, Likelihood Loss

1. Introduction

Consider a general linear model with a response YY\in\mathbb{R} and a covariate vector 𝐗p{\bf X}\in\mathbb{R}^{p}. The aims of the linear model are to build a linear functional relationship between YY and 𝐗{\bf X} from paired observations, (𝐗1,Y1),,(𝐗n,Yn)({\bf X}_{1},Y_{1}),\dots,({\bf X}_{n},Y_{n}), and predict a future response, Y0Y_{0}, given a covariate, 𝐗0{\bf X}_{0}. Predominately in literature, a fixed-XX assumption has been used with the following understandings:

(i) The covariate values are fixed before sampling while the only randomness comes from the responses Y1,,YnY_{1},\dots,Y_{n};

(ii) There is no other relationship among those covariate values.

Meanwhile, random-XX cases have been investigated, e.g., (Breiman and Spector (1992)), where the assumption is to set the covariate values sampled from a random vector.

In this article, we will start with a fixed-XX as a combination of pp linearly independent rows each repeated ni1n_{i}\geq 1 times, i.e.,

X=(𝐗𝟏,,𝐗𝟏,𝐗𝟐,,𝐗𝟐,,𝐗𝐩,,𝐗𝐩),X=(\mathbf{X_{1}},\dots,\mathbf{X_{1}},\mathbf{X_{2}},\dots,\mathbf{X_{2}},\dots,\mathbf{X_{p}},\dots,\mathbf{X_{p}})^{\prime}, (1.1)

where ni=n\sum n_{i}=n. One can argue that this is the general form for a fixed-XX case as the rank of the design matrix XX is pp allowing only pp distinct rows. The difference between a fixed-XX and a random-XX lays on the fact whether XX is fully known before sampling. Naturally, one should set the number of parameters as the rank of XX, pp. Otherwise if there are more parameters than pp, then the model carries redundant parameters, some of which may not be identifiable and thus not estimable. On the other hand, if there are fewer parameters than pp, then more parameters will be needed or some of XX may be considered as unknown before sampling and hence XX is random and out of scope of this article. To derive the optimal solution, we will introduce equivariance criterion, which has been widely considered in the linear model, e.g., Rao (1965); Eaton (1989) and the distinctions between fixed-XX and random-XX cases are clearly noticed (Rao (1973), Rosset and Tibshirani (2020)).

As a principle of symmetry, equivariance criterion is proposed in literature as a logic constraint on the solutions to derive the optimal one. Compared to the unbiasedness, it is more focused on the symmetry of the problem and self-consistency of the solutions. In the era of machine learning, it has received rising attention due to widely existing symmetry in the applications. Lehmann and Casella (1998, Chap. 3) has given a detailed discussion of equivariance criterion in the location-scale family while Berger (1985, Chap. 6) has presented the theory in a decision theoretic framework. Besides those two classical textbooks, other important references are as follows: Hora and Buehler (1966); Eaton (1989); Wijsman (1990).

Equivariance criterion consists of two principles: (i) functional equivariance, which states the action in a decision problem should be consistence across different measures and (ii) formal equivariance, which requires the decision rule to be of the same form for two problems with the identical structure.

Formally, a decision problem is described by (𝒳,𝒫,Θ,𝒟,L)({\cal X},{\cal P},\Theta,{\cal D},L), where 𝒳{\cal X} is a sample space, 𝒫\cal P is a family of distributions with θ\theta as the parameter or the true state of nature, Θ\Theta is the parameter space, 𝒟\cal D is a decision space and LL is a loss function on 𝒟Θ{\cal D}\otimes\Theta.

Without loss of generalities, one starts with the identifiability on both the distribution family and loss function, which, for linear model, suggests that there is no redundant 𝜷p\boldsymbol{\beta}\in\mathbb{R}^{p} and dpd\in\mathbb{R}^{p}. The principle of functional equivariance criterion requires preservance of the model and invariance of loss function under a group of 1-1 and onto transformations, which, are defined as follows.

Definition 1.1.

(Preservance of the Model) The distribution family {\cal F} is said to be invariant under GG if for each gGg\in G, and θΘ\theta\in\Theta, there exists θΘ\theta^{\prime}\in\Theta such that

Xf(x|θ)g(X)=Xf(x|θ)X\sim f(x|\theta)\Rightarrow g(X)=X^{\prime}\sim f(x^{\prime}|\theta^{\prime}) (1.2)
Definition 1.2.

(Invariance of Loss Function) The loss function L(d,θ)L(d,\theta) is invariant under GG if for each gGg\in G and each d𝒟d\in{\cal D} there exists d=g~(d)𝒟d_{*}=\tilde{g}(d)\in{\cal D} such that

L(d,θ)=L(g~(d),g¯(θ)) for all θΘ.L(d,\theta)=L(\tilde{g}(d),\bar{g}(\theta))\mbox{ for all }\theta\in\Theta.
Definition 1.3.

A decision problem (𝒳,,Θ,𝒟,L)({\cal X},{\cal F},\Theta,{\cal D},L) is invariant under a group of transformations, GG, if GG preserves 𝒫\cal{P} and the loss function is invariant under GG.

Under the structure of an invariant decision problem, we can induce equivariance criterion.

Definition 1.4.

A decision rule δ(X)\delta(X) is said to be equivariant under GG if

δ(g(x))=g~(δ(x))for all gG and x𝒳.\delta(g(x))=\tilde{g}(\delta(x))\quad\hbox{for all }g\in G\hbox{ and }x\in{\cal X}. (1.3)

In this way, equivariance criterion imposes a constraint on the possible decision rules that one can use reasonably to derive the optimal decision rule. In this article, we will pursue the best decision rule (minimum risk equivariant rule, MRE rule) in the sense of minimizing the risk function as the expectation of the loss function over 𝐗{\bf X}, which is a function of the parameter and the decision rule.

Lehmann and Casella (1998) has applied equivariance criterion implicitly in linear model for fixed-XX cases, where the most important feature is to transform the problem to a canonical form and derive the best equivariant estimators for the coefficient vector and common variance. Wu and Yang (2002) has discussed the existence of best equivariant estimators for the coefficient vector and the covariance matrix (in three forms) in normal linear models with fixed-XXs and derived the forms when they exist. Kurata and Matsuura (2016) has derived the best equivariant estimator of regression coefficients in a seemingly unrelated regression (SUR) model with a known correlation matrix. Further, Matsuura and Kurata (2020) derived the best equivariant estimator of the variance matrix in a SUR model. It is noted that most literature works assume a single population for the linear model and thus apply a common invariant transformation on the responses to derive the optimal equivariant estimators. However, as in experimental design, one usually views each distinct covariate vector as an independent population from any other and a SUR model uses a natural multivariate extension of the common linear model. Therefore, a multivariate extension of the common linear model combined with equivariance criterion warrants further investigation, especially from the fundamental logic of equivariance criterion.

This article focuses on applying the logic of equivariance criterion to the fixed-XX linear model, where the linear model is extended to allow multiple populations instead of a single one and the invariant transformation is distinct for each population. In Section 2, we derive the best equivariant estimators for the coefficient vector and the diagonal covariance matrix in a normal linear model specially tuned for equivariance criterion. Section 3 is devoted to some concluding remarks & future work.

2. Equivariance in the Linear Model

Consider the linear regression model 𝐲=X𝜷+ϵ{\bf y}=X{\bf\boldsymbol{\beta}}+{\bf\epsilon}, where 𝐲{\bf y} is an n×1n\times 1 vector, XX is an n×pn\times p matrix and ϵ{\bf\epsilon} is the noise vector. To derive the best equivariant estimators in the model, we will walk through the basic elements of equivariance criterion first and then provide a linear model tuned for equivariance criterion starting from the basic concept of a population.

Preservance of the Model: Without loss of generalities, we assume that the design matrix XX is predetermined and of full rank with np+1n\geq p+1, where only the response 𝐲{\bf y} is random and thus transformations only on 𝐲{\bf y} will be considered. For a fixed-XX linear model, it can be viewed that the samples actually come from different populations as in the experimental design, which imposes a restriction over the choice of the transformation group, in addition to the specification of the model. Equivariance criterion implicitly requires the transformations over the samples from the same population to be identical while being distinct for samples from different populations. Therefore, it is of essence to determine the number of populations inside a linear model.

For the fixed-XX cases, one can argue that there can only be pp populations inside the linear model as the rank of the design matrix is pp. In this regard, we will assume that ϵij\epsilon_{i_{j}}’s are independently distributed as a normal distribution with mean 0 and unknown variance σi2\sigma_{i}^{2}, i=1,,pi=1,\ldots,p. Thus, we have the normal linear model tuned for equivariance criterion as follows,

𝐲=(y1,y2,,yp) with X =(𝐗𝟏,,𝐗𝟏,𝐗𝟐,,𝐗𝟐,,𝐗𝐩,,𝐗𝐩),𝜷=(β1,β2,,βp), and Σ=Diag(σ12,,σ12,σ22,,σ22,,σp2,,σp2).\begin{split}{\bf y}&=\left(\vec{y}_{1},\vec{y}_{2},\cdots,\vec{y}_{p}\right)\\ \text{ with X }&=({\bf X_{1}},\ldots,{\bf X_{1}},{\bf X_{2}},\ldots,{\bf X_{2}},\ldots,{\bf X_{p}},\ldots,{\bf X_{p}})^{\prime},\\ {\bf\boldsymbol{\beta}}&=(\beta_{1},\beta_{2},\ldots,\beta_{p})^{\prime},\\ \text{ and }\Sigma&=Diag(\sigma_{1}^{2},\ldots,\sigma_{1}^{2},\sigma_{2}^{2},\ldots,\sigma_{2}^{2},\ldots,\sigma_{p}^{2},\ldots,\sigma_{p}^{2}).\end{split} (2.1)

Note that here, it is natural to assume independent variances for each population in contrast to the traditional linear model where all the variances for 𝐲{\bf y} are assumed to be identical. In this case, we will have the usual identifiability of the model with respect to 𝜷{\bf\boldsymbol{\beta}} and Σ\Sigma. Since Σ=Diag(σ12,,σ12,σ22,,σ22,,σp2,,σp2)\Sigma=Diag(\sigma_{1}^{2},\ldots,\sigma_{1}^{2},\sigma_{2}^{2},\ldots,\sigma_{2}^{2},\ldots,\sigma_{p}^{2},\ldots,\sigma_{p}^{2}) contains only pp parameters, one can simplify the problem to consider Σp=Diag(σ12,,σp2)\Sigma_{p}=Diag(\sigma_{1}^{2},\ldots,\sigma_{p}^{2}).

The transformation group GG keeping the model invariant is of the form g(𝐲)=C𝐲+𝐚g({\bf y})=C{\bf y}+{\bf a}, with C=Diag(c1,,c1,c2,,c2,,cp,,cp)C=Diag(c_{1},\ldots,c_{1},c_{2},\ldots,c_{2},\ldots,c_{p},\ldots,c_{p}), and 𝐚=(a1,,a1,a2,,a2,,ap,,ap){\bf a}=(a_{1},\ldots,a_{1},a_{2},\ldots,a_{2},\ldots,a_{p},\ldots,a_{p})^{\prime}, ci>0c_{i}>0, ai,i=1,,pa_{i}\in\mathbb{R},i=1,\ldots,p. (Both cic_{i} and aia_{i} are repeated nin_{i} times.) Here we use independent transformations for those populations, compared to the literature where most are using one single common transformation, gc(𝐲)=c𝐲+a,c>0,ag_{c}({\bf y})=c{\bf y}+a,c>0,a\in\mathbb{R}. To facilitate discussion, we denote that Xp=(𝐗𝟏,𝐗𝟐,,𝐗𝐩)X_{p}=({\bf X_{1},X_{2}},\ldots,{\bf X_{p}})^{\prime}, Cp=Diag(c1,,cp)C_{p}=Diag(c_{1},\ldots,c_{p}), 𝐚p=(a1,,ap){\bf a}_{p}=(a_{1},\ldots,a_{p})^{\prime} and thus X=KXp,𝐚=K𝐚p,X=KX_{p},{\bf a}=K{\bf a}_{p}, C=[c1In1000c2In2000cpInp]C=\left[\begin{matrix}c_{1}I_{n_{1}}&0&\cdots&0\\ 0&c_{2}I_{n_{2}}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\cdots&0&c_{p}I_{n_{p}}\\ \end{matrix}\right] with K=[1n10n10n10n10n10n21n20n20n20n20ni0ni1ni0ni0ni0np0np0np0np1np]n×p\ K=\left[\begin{matrix}\vec{1}_{n_{1}}&\vec{0}_{n_{1}}&\vec{0}_{n_{1}}&\vec{0}_{n_{1}}&\vec{0}_{n_{1}}\\ \vec{0}_{n_{2}}&\vec{1}_{n_{2}}&\vec{0}_{n_{2}}&\vec{0}_{n_{2}}&\vec{0}_{n_{2}}\\ \vec{0}_{n_{i}}&\vec{0}_{n_{i}}&\vec{1}_{n_{i}}&\vec{0}_{n_{i}}&\vec{0}_{n_{i}}\\ &&&\ddots&&\\ \vec{0}_{n_{p}}&\vec{0}_{n_{p}}&\vec{0}_{n_{p}}&\vec{0}_{n_{p}}&\vec{1}_{n_{p}}\\ \end{matrix}\right]_{n\times p}. The corresponding transformation group G¯\bar{G} on the parameter space is of the form

g¯(𝜷,Σp)=(Xp1CpXp𝜷+Xp1𝐚p,CpΣpCp).{\bar{g}({\bf\boldsymbol{\beta}},\Sigma_{p})=(X_{p}^{-1}C_{p}X_{p}\boldsymbol{\beta}+X_{p}^{-1}{\bf a}_{p},C_{p}\Sigma_{p}C_{p}^{\prime})}.

It can be shown that the parameter space is transitive under G¯\bar{G} as np+1n\geq p+1.

In the linear model, the usual targets of interest are the coefficient vector 𝜷\boldsymbol{\beta} and the covariance matrix Σ\Sigma. We will discuss those two separately in the context of Invariance of the Loss Function.

To derive an MRE decision rule, Lehmann and Casella (1998, Chap. 3) have used maximal invariants to characterize all the equivariant estimators and then minimized the constant risk when G¯\bar{G} is transitive for location-scale families. We will follow such an approach to derive the best estimators as follows.

2.1 Estimation of the Coefficient Vector β\beta

Staudte Jr (1971); Zhou and Nayak (2014) have introduced a method to construct an invariant loss function based on the target of interest. Following their method, one can build an invariant loss function as follows, L𝜷(𝐝,𝜷)=(𝐝𝜷)TXpTΣp1Xp(𝐝𝜷)L_{\bf\boldsymbol{\beta}}({\bf d,\boldsymbol{\beta}})=({\bf d}-{\bf\boldsymbol{\beta}})^{T}X_{p}^{T}\Sigma_{p}^{-1}X_{p}({\bf d}-{\bf\boldsymbol{\beta}}). This is an extension of the one implicitly used in the equivariance literature (Theorem 4.3 (b) in Lehmann and Casella (1998, Chap. 3) and (1.4) in Wu and Yang (2002)). Thus one will have the corresponding invariant transformation on the decision space as g~(𝐝)=Xp1CpXp𝐝+Xp1𝐚p\tilde{g}({\bf d})=X_{p}^{-1}C_{p}X_{p}{\bf d}+X_{p}^{-1}{\bf a}_{p} and equivariant criterion as δ(g(𝐲))=δ(C𝐲+𝐚)=g~(δ(𝐲))=Xp1CpXpδ(𝐲)+Xp1𝐚p{\bf\delta}(g({\bf y}))={\bf\delta}(C{\bf y}+{\bf a})=\tilde{g}({\bf\delta}({\bf y}))=X_{p}^{-1}C_{p}X_{p}{\bf\delta}({\bf y})+X_{p}^{-1}{\bf a}_{p}. It can be shown that those invariant transformations form a group G~𝜷\tilde{G}_{\bf\boldsymbol{\beta}} and the least square estimator is essentially the vector of sample means as in the following Lemma 2.1 and thus equivariant under the group G~𝜷\tilde{G}_{\bf\boldsymbol{\beta}}.

Lemma 2.1.

For the fixed-XX linear model in 2.1, the least square estimator, (XX)1X𝐲=Xp1𝐲¯(X^{\prime}X)^{-1}X^{\prime}{\bf y}=X_{p}^{-1}{\bf\bar{y}}, with 𝐲¯=(Y¯1,,Y¯j,,Y¯p){\bf\bar{y}}=(\bar{Y}_{1},\ldots,\bar{Y}_{j},\ldots,\bar{Y}_{p})^{\prime}.

Proof.

Since KK=diag(n1,,nj,,np)K^{\prime}K=diag(n_{1},\ldots,n_{j},\ldots,n_{p}) and K𝐲=(n1Y¯1,,njY¯j,,npY¯p)K^{\prime}{\bf y}=(n_{1}*\bar{Y}_{1},\ldots,n_{j}*\bar{Y}_{j},\ldots,n_{p}*\bar{Y}_{p})^{\prime}, the least square estimator satisfies that

(XX)1X𝐲=(XpKKXp)1XpK𝐲=Xp1(KK)1K𝐲=Xp1𝐲¯.(X^{\prime}X)^{-1}X^{\prime}{\bf y}=(X_{p}^{\prime}K^{\prime}KX_{p})^{-1}X_{p}^{\prime}K^{\prime}{\bf y}=X_{p}^{-1}(K^{\prime}K)^{-1}K^{\prime}{\bf y}=X_{p}^{-1}{\bf\bar{y}}.

A Characterization of Equivariant Estimators. It follows from Lehmann and Casella (1998) that a characterization of equivariant estimators can be given by δ(𝐲)=(XX)1X𝐲+Xp1S(𝐲)ω(𝐳){\bf\delta}({\bf y})=(X^{\prime}X)^{-1}X^{\prime}{\bf y}+X_{p}^{-1}S({\bf y}){\bf\omega}(\bf{z}), where S(𝐲)=Diag(s1,,sp)S({\bf y})=Diag(s_{1},\ldots,s_{p}), si=1ni1j=Ni1+1Ni(YjY¯i)2s_{i}=\sqrt{\frac{1}{n_{i}-1}\sum_{j=N_{i-1}+1}^{N_{i}}(Y_{j}-\bar{Y}_{i})^{2}} is the sample deviation for each population, Yi¯=1nij=Ni1+1NiYj\bar{Y_{i}}=\frac{1}{n_{i}}\sum_{j=N_{i-1}+1}^{N_{i}}Y_{j} is the sample mean for each population with N0=0,Ni=j=1injN_{0}=0,N_{i}=\sum_{j=1}^{i}n_{j} for i=1,,pi=1,\ldots,p, 𝐳=(𝐳𝟏,,𝐳𝐩){\bf z}=({\bf z_{1}}^{\prime},\ldots,{\bf z_{p}}^{\prime})^{\prime}, with 𝐳𝐢=((YNi1+2YNi)/(YNi1+1YNi),,(YNi1YNi)/(YNi1+1YNi),sgn(YNi1+1YNi)){\bf z_{i}}=((Y_{N_{i-1}+2}-Y_{N_{i}})/(Y_{N_{i-1}+1}-Y_{N_{i}}),\ldots,(Y_{N_{i}-1}-Y_{N_{i}})/(Y_{N_{i-1}+1}-Y_{N_{i}}),sgn(Y_{N_{i-1}+1}-Y_{N_{i}}))^{\prime}, is a maximal invariant, and ω(𝐳)\bf{\omega}(z) is a pp-dimension vector.

Best Equivariant Estimator. To derive the best equivariant estimator minimizing the risk function, first one will use the fact that the risk function is constant for any equivariant estimator as the parameter space is transitive. Then one can show that (XX)1X𝐲(X^{\prime}X)^{-1}X^{\prime}{\bf y} is independent of 𝐳{\bf z}. Meanwhile (XX)1X𝐲(X^{\prime}X)^{-1}X^{\prime}{\bf y} depends on Y\vec{Y} only via y¯\bar{y}, which is independent of S(𝐲)S({\bf y}). With above arguments, one can show the main result as follows.

Theorem 2.1.

For the fixed-XX linear model in 2.1, the least square, (XX)1X𝐲(X^{\prime}X)^{-1}X^{\prime}{\bf y}, is the best equivariant estimator for 𝛃{\bf\boldsymbol{\beta}}.

Proof.

Since the parameter space Θ\Theta is transitive under group G¯\bar{G}, we choose a special parameter point (𝜷,Σp)=(𝟎,Ip)({\bf\boldsymbol{\beta}},\Sigma_{p})=({\bf 0},I_{p}) to evaluate the risk function. Thus any equivariant estimator of 𝜷{\bf\boldsymbol{\beta}}, δ(𝐲)=(XX)1X𝐲+Xp1S(𝐲)ω(𝐳)=Xp1𝐲¯+Xp1S(𝐲)ω(𝐳)\delta({\bf y})=(X^{\prime}X)^{-1}X^{\prime}{\bf y}+X_{p}^{-1}S({\bf y}){\bf\omega}({\bf z})=X_{p}^{-1}{\bf\bar{y}}+X_{p}^{-1}S({\bf y}){\bf\omega}({\bf z}) has a constant risk as follows,

R(δ)\displaystyle R(\delta) =EL(δ,θ0)=E[L(Xp1𝐲¯+Xp1S(𝐲)ω(𝐳),θ0)|𝐳]\displaystyle=EL(\delta,\theta_{0})=E[L(X_{p}^{-1}{\bf\bar{y}}+X_{p}^{-1}S({\bf y}){\bf\omega}({\bf z}),\theta_{0})|{\bf z}]
=E{[Xp1𝐲¯+Xp1S(𝐲)ω(𝐳)]XpXp[Xp1𝐲¯+Xp1S(𝐲)ω(𝐳))]}\displaystyle=E\{[X_{p}^{-1}{\bf\bar{y}}+X_{p}^{-1}S({\bf y}){\bf\omega({\bf z})}]^{\prime}X_{p}^{\prime}X_{p}[X_{p}^{-1}{\bf\bar{y}}+X_{p}^{-1}S({\bf y}){\bf\omega({\bf z})})]\}
=E𝐳[E𝐲|𝐳(𝐲¯𝐲¯)+2ω(𝐳)E(S(𝐲)Xp𝐲¯)+ω(𝐳)E𝐲|𝐳(S2(𝐲))ω(𝐳)],\displaystyle=E^{\bf z}[E^{{\bf y}|{\bf z}}({\bf\bar{y}}^{\prime}{\bf\bar{y}})+2{\bf\omega({\bf z})}^{\prime}E(S({\bf y})X_{p}{\bf\bar{y}})+{\bf\omega({\bf z})}^{\prime}E^{{\bf y}|{\bf z}}(S^{2}({\bf y})){\bf\omega({\bf z})}],

where the minimum is attained at ω=𝟎{\bf\omega}^{*}={\bf 0} as 𝐲¯,S(𝐲){\bf\bar{y}},S({\bf y}) and 𝐳{\bf z} are pairwise independent. Therefore, the best estimators is δ=(XX)1X𝐲\delta^{*}=(X^{\prime}X)^{-1}X^{\prime}{\bf y}.

2.2 Estimation of the Diagonal Covariance Matrix Σ\Sigma

Since Σ\Sigma is diagonal with only p<np<n parameters, it is equivalent to estimate the diagonal matrix, Σp\Sigma_{p}. It is noteworthy that Σp\Sigma_{p} is estimable only when n2pn\geq 2p. Consider two widely discussed loss functions: the quadratic loss, Lq(D,Σp)=tr((DΣp)Σp1(DΣp)Σp1)L_{q}(D,\Sigma_{p})=tr((D-\Sigma_{p})\Sigma_{p}^{-1}(D-\Sigma_{p})\Sigma_{p}^{-1}), and the likelihood loss, Ll(D,Σp)=tr(DΣp1)log|DΣp1|pL_{l}(D,\Sigma_{p})=tr(D\Sigma_{p}^{-1})-log|D\Sigma_{p}^{-1}|-p, where DD is a positive definite p×pp\times p diagonal matrix. The same transformation group GG is used in the preservance of the model with g¯(Σp)=CpΣpCp\bar{g}(\Sigma_{p})=C_{p}\Sigma_{p}C_{p}^{\prime} and both loss functions are invariant under GΣp\vec{G}_{\Sigma_{p}} with g~(D)=CpDCp\tilde{g}(D)=C_{p}DC_{p} = Cp2DC_{p}^{2}D. Denote W=Diag((n11)/(n1+1),,(np1)/(np+1))W=Diag((n_{1}-1)/(n_{1}+1),\ldots,(n_{p}-1)/(n_{p}+1)) and one can show the main result as follows.

Theorem 2.2.

For the fixed-case linear model in 1.1, WS2WS^{2} and S2S^{2} are the MRE estimators for Σp\Sigma_{p} under LqL_{q} and LlL_{l}, respectively.

Proof.

Analog to the univariate case, under loss function LqL_{q} and LlL_{l}, any equivariant estimator δ(𝐲)\delta({\bf y}) of Σp\Sigma_{p} can be characterized as

Δ(𝐲)=SH(𝐳)S=H(𝐳)S2,\Delta({\bf y})=SH({\bf z})S^{\prime}=H({\bf z})S^{2}, (2.2)

where S2=Diag(s12,,sp2)S^{2}=Diag(s_{1}^{2},\ldots,s_{p}^{2}), H(𝐳)=Diag(h1(𝐳),,hp(𝐳))H({\bf z})=Diag(h_{1}({\bf z}),\ldots,h_{p}({\bf z})) and 𝐳=(𝐳𝟏,,𝐳𝐩){\bf z}=({\bf z_{1}},\ldots,{\bf z_{p}})^{\prime}, with 𝐳𝐢=((YNi1+2YNi)/(YNi1+1YNi),,(YNi1YNi)/(YNi1+1YNi),sgn(YNi1+1YNi)){\bf z_{i}}=\\ ((Y_{N_{i-1}+2}-Y_{N_{i}})/(Y_{N_{i-1}+1}-Y_{N_{i}}),\ldots,(Y_{N_{i}-1}-Y_{N_{i}})/(Y_{N_{i-1}+1}-Y_{N_{i}}),sgn(Y_{N_{i-1}+1}-Y_{N_{i}}))^{\prime}. The proof of equation (2.2) can be found in Appendix A, where it is also shown that S2S^{2} and 𝐳{\bf z} are independent (see Appendix Proofs A).

Since the parameter space Θ\Theta is transitive under group G¯\bar{G}, we choose a special parameter point (𝜷,Σp)=(𝟎,Ip)({\bf\boldsymbol{\beta}},\Sigma_{p})=({\bf 0},I_{p}) to evaluate the risk function.

Firstly, under LqL_{q}, the constant risk of any equivariant estimator can be calculated as follows,

R(Δ)\displaystyle R(\Delta) =ELq(Δ,θ0)=E𝐳E𝐲|𝐳[Lq(H(𝐳)S2,θ0)|𝐳]\displaystyle=EL_{q}(\Delta,\theta_{0})=E^{\bf z}E^{{\bf y}|{\bf z}}[L_{q}(H({\bf z})S^{2},\theta_{0})|{\bf z}]
=E𝐳E𝐲Lq(HS2,θ0)=E𝐳{Eθ0[tr(HS2Ip)2]}\displaystyle=E^{\bf z}E^{\bf y}L_{q}(HS^{2},\theta_{0})=E^{\bf z}\{E_{\theta_{0}}[tr(HS^{2}-I_{p})^{2}]\}
=E𝐳i=1pEθ0(hisi21)2\displaystyle=E^{\bf z}\sum_{i=1}^{p}E_{\theta_{0}}(h_{i}s_{i}^{2}-1)^{2}
=E𝐳i=1p(hi2Eθ0(si4)2hiEθ0(si2)+1).\displaystyle=E^{\bf z}\sum_{i=1}^{p}(h_{i}^{2}E_{\theta_{0}}(s_{i}^{4})-2h_{i}E_{\theta_{0}}(s_{i}^{2})+1).

The risk attains the minimum at the point hi=Eθ0(si2)Eθ0(si4)=(ni1)22(ni1)+(ni1)2=ni1ni+1h_{i}^{*}=\frac{E_{\theta_{0}}(s_{i}^{2})}{E_{\theta_{0}}(s_{i}^{4})}=\frac{(n_{i}-1)^{2}}{2(n_{i}-1)+(n_{i}-1)^{2}}=\frac{n_{i}-1}{n_{i}+1} for i=1,,pi=1,\ldots,p or equivalently, H=Diag(n11n1+1,,np1np+1)=WH^{*}=Diag(\frac{n_{1}-1}{n_{1}+1},\ldots,\frac{n_{p}-1}{n_{p}+1})=W. Thus, δ=HS2=WS2\delta^{*}=H^{*}S^{2}=WS^{2} is the MRE estimators under LqL_{q}.

A similar result can be obtained for LlL_{l} and the MRE estimators is δ=S2\delta^{*}=S^{2} under LlL_{l}. ∎

It is noted that Wu and Yang (2002) has considered such a problem under the common location-scale transformation gcg_{c} and the loss function LqL_{q} and shown that no best equivariant estimator exists. In essence, Theorem 2.2 indicates that the multivariate MRE is a vector of the univariate MREs as in Lehmann and Casella (1998), which are sample variances multiplied by a constant. When the covariance matrix is fully unknown, it is not estimable and there will be no sufficient and complete statistic and thus the above proof won’t work.

3. Future Work and Discussion

This paper has served as an initial effort to explore the usage of equivariance criterion in the field of machine learning, where we are interested at which method yields the optimal solution and what properties the optimal solution carries, especially in the context of equivariance criterion. We start from least square in the linear model, the simplest and foundational method in machine learning, to derive the optimal solution.

In this paper, we have established that MRE estimators for the coefficient vector and the condensed covariance matrix is the least square and the vector of the sample variance within each population, respectively. In addition, we have demonstrated that in our setting 2.1, the least square estimator is essentially the vector of the sample mean within each population. Such a finding has further solidified their optimality from the perspective of multivariate normal distribution theory.

The linear model with a full rank design matrix can be of different forms across literature via the number of populations. The commonly used one is of a single population, whose characteristic is to assume a common distribution for all the noises. Naturally, in this setup, one would use a single univariate location-scale transformation to apply equivariance criterion. In this paper, we relax such an assumption and allow pp populations to accommodate the usage of a larger transformation group, which is consisted of multivariate location-scale transformations. From the perspective of experimental design, such a relaxation is quite intuitive: the design matrix is chosen carefully before sampling and in a sense, those sampling points are independent and each rank consist of a population. Meanwhile one can argue that pp populations should be the maximum number allowed in a linear model.

In terms of estimating the coefficient vector, the form of XX doesn’t have much impact on the MRE solution and thus 2.1 gives a simple path to the MRE solution. However, to estimate the condensed covariance matrix, it is essential to choose the form and thus decide the size of each population. In an experimental design setting, one would usually use n=kpn=kp, where kk is an integer, and ni=kn_{i}=k. Such a form is recommended both for its symmetry and computation convenience, which also is a special case of a seemingly unrelated regression (SUR) problem with a known correlation matrix. However, one can convert a SUR problem back into the setting ni=kn_{i}=k and functional equivariance guarantees that the MRE solution is of a corresponding form. In this sense, our model is more general than the SUR model used in Kurata and Matsuura (2016); Matsuura and Kurata (2020).

We have discussed the best equivariant estimators for the coefficient vector and the condensed covariance matrix for a normal linear model specially tuned for equivariance criterion, where the commonly used one is a special example. Such a model requires a bigger transformation group, which, in turn, will result in a smaller set of equivariant decision rules, where the MRE estimator exists. Interestingly, Wu and Yang (2002) has shown that the commonly used single location-scale transformation induces too many equivariant decision rules under LqL_{q} that there is no MRE estimator for the covariance matrix. Meanwhile, each population inside the normal model is the linear model commonly used in literature and the resulted estimators for each population are the traditional MRE ones, which are equivalent to the least square solutions. Kurata and Matsuura (2016); Matsuura and Kurata (2020) used the same transformation group for the SUR model, where a pp-dimensional distribution family was considered and samples comes from such a single multivariate population.

The choice of the invariant transformation group is an important topic in equivariance literature. Wu and Yang (2002) presented a case where the group is too large to allow an optimal solution. Usually, the group is chosen to be isomorphic to sample space/parameter space, especially when considering the Haar Prior. There are some interesting cases that all invariant transformation groups pose a nesting relationship and the largest one admits only the optimal solution in the smaller ones.

Likelihood loss is a multivariate extension of the Stein loss, which is preferred in literature (Brown (1968)). It can be seen that it induces an MRE & UMVU estimator, that is always larger than the one under the quadratic one. Meanwhile, likelihood loss is more evenhanded over the range as the covariance matrix is set to be positive definite. Such a form is quite similar to the logistic transformation in a generalized linear model.

An extension to prediction will be a future direction. However, existing frameworks (e.g., Zhou and Nayak (2015)) on the equivariance criterion can’t handle the prediction problem well in a linear model. In the literature, the predicted response is assumed to be unobservable, which is not the case in the linear model. One could also notice that overfitting (Stone (1974)) arises for a prediction problem in a linear model, which usually doesn’t occur in estimation as in deriving the least square solution, sample prediction error is used, which will converge to a univariate form of L𝜷L_{\bf\boldsymbol{\beta}}, (𝐝𝜷)TXpTXp(𝐝𝜷)/σ2({\bf d}-{\bf\boldsymbol{\beta}})^{T}X_{p}^{T}X_{p}({\bf d}-{\bf\boldsymbol{\beta}})/\sigma^{2} with σ2=1\sigma^{2}=1, with an extra term constant to dd.

Linear model with fixed-XX cases though predominantly used in literature, is of limited usage in practice, especially in our settings, where the experimental design is the ideal scenario. Linear model with random-XX cases and mixture cases are more interesting and challenging. The concept of the randomness of linear model has drawn wide attention and numerous efforts have been spent to clarify the differences between fixedness and randomness. Little (2019) has given a straightforward definition of randomness as being unknown from a Bayesian view. In his argument, the treatment indicator from a clinical trial can be considered both fixed and random. It is true that those semi-controlled covariates pose challenges to the definition of randomness. Individually, it is unknown and thus can be considered random. Population-wise, its distribution is usually under control and thus can be treated as a fixed effect in analyses.

The Gauss-Markov Theorem is the fundamental result for the linear model, where the optimality of the least square solution has been established. In most textbooks, its proof is based on the predominantly assumed fixed-XX cases. Shaffer (1991) has shown some interesting results where the Gauss-Markov Theorem no longer holds for some random-XX cases. We will investigate such a phenomenon in the context of equivariance for the random-XX cases.

In terms of randomness, it is noticed that the setup before sampling is crucial and one can classify XX before sampling into following categories: known values, from a known distribution, from a distribution family with unknown parameters, totally unknown. In a typical experimental design setting, design factors/parameters are of known values, which, in our settings, refer to fixed-XX cases. In a typical clinical trial setting, the treatment indicator is from a known distribution. In the classical parametric inference setting, we may assume XX from a distribution family with unknown parameters. For the non-parametric setting, X is usually seen as totally unknown. The latter three scenarios refer to random-XX cases, which will be another future topic.

For a linear model with a non-normal distribution, we will refer to extensions of the current results to the generalized linear model, where the challenges start from the invariant transformations. In a normal linear model, one can easily find an invariant location-scale transformation group that leave the parameter space transitive, which facilitates the derivation of the MRE solutions. This may not be the case in a non-normal linear model.

Acknowledgment. This work was supported by the Fundamental Research Funds for the Central Universities, Sun Yat-sen University (Grant No. 20lgpy145 & 2021qntd21) and the Science and Technology Program of Guangzhou Project, Fundamental and Applied Research Project (202102080175). The authors report there are no competing interests to declare.

References

  • Berger (1985) Berger, J.O. (1985). Statistical decision theory and Bayesian analysis. Springer-Verlag, New York, second edition.
  • Breiman and Spector (1992) Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression. the x-random case. International Statistical Review / Revue Internationale de Statistique, 60, 291. doi:10.2307/1403680.
  • Brown (1968) Brown, L. (1968). Inadmissibility of the usual estimators of scale parameters in problems with unknown location and scale parameters. The Annals of Mathematical Statistics, 39, 29–48. doi:10.1214/aoms/1177698503.
  • Cohen and Welling (2016) Cohen, T. and Welling, M. (2016). Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999.
  • Eaton (1989) Eaton, M.L. (1989). Group invariance applications in statistics. Institute of Mathematical Statistics, Beachwood, Ohio.
  • Hora and Buehler (1966) Hora, R.B. and Buehler, R.J. (1966). Fiducial theory and invariant estimation. The Annals of Mathematical Statistics, 37, 643–656.
  • Kurata and Matsuura (2016) Kurata, H. and Matsuura, S. (2016). Best equivariant estimator of regression coefficients in a seemingly unrelated regression model with known correlation matrix. Annals of the Institute of Statistical Mathematics, 68, 705–723. doi:10.1007/s10463-015-0512-2.
  • Lehmann and Casella (1998) Lehmann, E.L. and Casella, G. (1998). Theory of point estimation. Springer-Verlag, New York, second edition.
  • Little (2019) Little, R.J. (2019). Comment: “models as approximations i: Consequences illustrated with linear regression” by a. buja, r. berk, l. brown, e. george, e. pitkin, l. zhan and k. zhang. Statistical Science, 34, 580–583. doi:10.1214/19-sts726.
  • Marcos et al. (2017) Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017). Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5048–5057.
  • Matsuura and Kurata (2020) Matsuura, S. and Kurata, H. (2020). Covariance matrix estimation in a seemingly unrelated regression model under stein’s loss. Statistical Methods & Applications, 29, 79–99. doi:10.1007/s10260-019-00473-x.
  • Rao (1973) Rao, C. (1973). Representations of best linear unbiased estimators in the gauss-markoff model with a singular dispersion matrix. Journal of Multivariate Analysis, 3, 276–292. doi:10.1016/0047-259x(73)90042-0.
  • Rao (1965) Rao, C.R. (1965). The theory of least squares when the parameters are stochastic and its application to the analysis of growth curves. Biometrika, 52, 447–458. doi:10.1093/biomet/52.3-4.447.
  • Rosset and Tibshirani (2020) Rosset, S. and Tibshirani, R.J. (2020). From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. Journal of the American Statistical Association, 115, 138–151. doi:10.1080/01621459.2018.1424632.
  • Shaffer (1991) Shaffer, J.P. (1991). The gauss—markov theorem and random regressors. The American Statistician, 45, 269–273. doi:10.1080/00031305.1991.10475819.
  • Staudte Jr (1971) Staudte Jr, R.G. (1971). A characterization of invariant loss functions. The Annals of Mathematical Statistics, pp. 1322–1327.
  • Stone (1974) Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36, 111–133. doi:10.1111/j.2517-6161.1974.tb00994.x.
  • Wijsman (1990) Wijsman, R.A. (1990). Invariant measures on groups and their use in statistics. In Invariant measures on groups and their use in statistics. Institute of Mathematical Statistics.
  • Wu and Yang (2002) Wu, Q. and Yang, G. (2002). Existence of the uniformly minimum risk equivariant estimators of parameters in a class of normal linear models. Science in China Series A: Mathematics, 45, 845–858. doi:10.1360/02ys9093.
  • Zhou and Nayak (2014) Zhou, H. and Nayak, T.K. (2014). A note on existence and construction of invariant loss functions. Statistics, 48, 1335–1343. doi:10.1080/02331888.2013.809719.
  • Zhou and Nayak (2015) Zhou, H.J. and Nayak, T.K. (2015). On the equivariance criterion in statistical prediction. Annals of the Institute of Statistical Mathematics, 67, 541–555. doi:10.1007/s10463-014-0464-y.

Appendix A Technical proof

Identifiability of the Model

Proof.

For any two parameter points (𝜷1,Σ1),(𝜷2,Σ2)Θ({\bf\boldsymbol{\beta}}_{1},\Sigma_{1}),({\bf\boldsymbol{\beta}}_{2},\Sigma_{2})\in\Theta and any fixed XX with full rank, the density of Nn(X𝜷1,Σ1)N_{n}(X{\bf\boldsymbol{\beta}}_{1},\Sigma_{1}) equals to that of Nn(X𝜷2,Σ2)N_{n}(X{\bf\boldsymbol{\beta}}_{2},\Sigma_{2}) if and only if (𝜷1,Σp1)=(𝜷2,Σp2)({\bf\boldsymbol{\beta}}_{1},\Sigma_{p1})=({\bf\boldsymbol{\beta}}_{2},\Sigma_{p2}), where Σi=[σi12In1σpi2Inpi]\Sigma_{i}=\left[\begin{matrix}\sigma_{i1}^{2}I_{n_{1}}&&\\ &\ddots&\\ &&\sigma_{p_{i}}^{2}I_{n_{p_{i}}}\\ \end{matrix}\right] for i=1,2i=1,2. Thus, the model is identifiable with respect to (𝜷,Σp)({\bf\boldsymbol{\beta}},\Sigma_{p}). ∎

GG being a Group

Proof.

We aim to prove that

G={g:g(𝐲)=C𝐲+𝐚}with 𝐚=(a1,,ap,,ap),C=Diag(c1,,cp,,cp),ai and ci>0,i=1,,p\begin{split}G&=\left\{g:g({\bf y})=C{\bf y}+{\bf a}\right\}\\ \text{with }{\bf a}=(a_{1},\ldots,a_{p},\ldots,a_{p})^{\prime},C&=Diag(c_{1},\ldots,c_{p},\ldots,c_{p}),a_{i}\in\mathbb{R}\text{ and }c_{i}>0,i=1,\ldots,p\end{split} (1.1)

satisfies the definition of a group.

(i) Closedness: For any two transformations g1,g2Gg_{1},g_{2}\in G with

g1(𝐲)=C1𝐲+𝐚𝟏,g2(𝐲)=C2𝐲+𝐚𝟐,g_{1}({\bf y})=C_{1}{\bf y+a_{1}},\quad g_{2}({\bf y})=C_{2}{\bf y+a_{2}},

where Ci=Diag(ci1,,cip,,cip)C_{i}=Diag(c_{i1},\ldots,c_{ip},\ldots,c_{ip}), and 𝐚i=(ai1,,aip,,aip){\bf a}_{i}=(a_{i1},\ldots,a_{ip},\ldots,a_{ip})^{\prime}, i=1,2i=1,2, we have that

g2g1(𝐲)=C2C1𝐲+C2𝐚𝟏+𝐚𝟐,g_{2}g_{1}({\bf y})=C_{2}C_{1}{\bf y}+C_{2}{\bf a_{1}}+{\bf a_{2}},

with C=C1C2=Diag(c11c21,,c11c21,,c1pc2p,,c1pc2p)C^{*}=C_{1}C_{2}=Diag(c_{11}c_{21},\ldots,c_{11}c_{21},\ldots\ldots,c_{1p}c_{2p},\ldots,c_{1p}c_{2p}) and 𝐚=C2𝐚𝟏+𝐚𝟐=(c21a11+a21,,c21a11+a21,,c2pa1p+a2p,,c2pa1p+a2p){\bf a^{*}}=C_{2}{\bf a_{1}}+{\bf a_{2}}=\\ (c_{21}a_{11}+a_{21},\ldots,c_{21}a_{11}+a_{21},\ldots\ldots,c_{2p}a_{1p}+a_{2p},\ldots,c_{2p}a_{1p}+a_{2p}). Since ci=c1ic2i>0c_{i}^{*}=c_{1i}c_{2i}>0 and ai=c2ia1i+a2ia_{i}^{*}=c_{2i}a_{1i}+a_{2i}\in\mathbb{R}, i=1,,pi=1,\ldots,p, we can find that g2g1Gg_{2}g_{1}\in G.

(ii)(ii) Combination Law: For any three transformations g1,g2,g3Gg_{1},g_{2},g_{3}\in G with

gi(𝐲)=Ci𝐲+𝐚i,i=1,2,3,g_{i}({\bf y})=C_{i}{\bf y+a}_{i},\quad i=1,2,3,

we have that

(g1g2)g3(𝐲)\displaystyle(g_{1}g_{2})g_{3}({\bf y}) =C1C2(C3𝐲+𝐚3)+C1𝐚2+𝐚1\displaystyle=C_{1}C_{2}\left(C_{3}{\bf y}+{\bf a}_{3}\right)+C_{1}{\bf a}_{2}+{\bf a}_{1}
=C1(C2C3𝐲+C2𝐚3+𝐚2)+𝐚1=g1(g2g3)(𝐲).\displaystyle=C_{1}(C_{2}C_{3}{\bf y}+C_{2}{\bf a}_{3}+{\bf a}_{2})+{\bf a}_{1}=g_{1}(g_{2}g_{3})({\bf y}).

(iii)(iii) Unit Element: The transformation eGe\in G with C=In,𝐚=𝟎C=I_{n},{\bf a=0} is the unit element.

(iv)(iv) Inverse Element: For any transformation gGg\in G, its inverse transformation is g1(𝐲)=C1𝐲C1𝐚g^{-1}({\bf y})=C^{-1}{\bf y}-C^{-1}{\bf a}.

G¯\bar{G} being a Group

Proof.

We aim to prove that

G¯={g¯:g¯(𝜷,Σp)=(Xp1CpXp𝜷+Xp1𝐚p,CpΣpCp)}with Cp=Diag(c1,,cp),𝐚p=(a1,,ap)\begin{split}\bar{G}=\{\bar{g}:\bar{g}({\bf\boldsymbol{\beta}},\Sigma_{p})=(X_{p}^{-1}C_{p}X_{p}{\bf\boldsymbol{\beta}}+X_{p}^{-1}{\bf a}_{p},C_{p}\Sigma_{p}C_{p}^{\prime})\}\\ \text{with }C_{p}=Diag(c_{1},\ldots,c_{p}),{\bf a}_{p}=(a_{1},\ldots,a_{p})^{\prime}\end{split} (1.2)

satisfies the definition of a group.

(i)(i) Closedness: For any two transformations g¯1,g¯2G¯\bar{g}_{1},\bar{g}_{2}\in\bar{G} with

g¯i(𝜷,Σp)=(Xp1CpiXp𝜷+Xp1𝐚pi,CpiΣpCpi),i=1,2,\bar{g}_{i}({\bf\boldsymbol{\beta}},\Sigma_{p})=(X_{p}^{-1}C_{pi}X_{p}{\bf\boldsymbol{\beta}}+X_{p}^{-1}{\bf a}_{pi},C_{pi}\Sigma_{p}C_{pi}^{\prime}),\quad i=1,2,

where Cpi=Diag(ci1,,cip)C_{pi}=Diag(c_{i1},\ldots,c_{ip}), 𝐚pi=(ai1,,aip){\bf a}_{pi}=(a_{i1},\ldots,a_{ip})^{\prime}, we have that

g¯2g¯1(𝜷,Σp)=(Xp1Cp2Cp1Xp𝜷+Xp1(Cp2𝐚p1+𝐚p2),Cp2Cp1Σp(Cp2Cp1))\bar{g}_{2}\bar{g}_{1}({\bf\boldsymbol{\beta}},\Sigma_{p})=\left(X^{-1}_{p}C_{p2}C_{p1}X_{p}{\bf\boldsymbol{\beta}}+X_{p}^{-1}(C_{p2}{\bf a}_{p1}+{\bf a}_{p2}),C_{p2}C_{p1}\Sigma_{p}(C_{p2}C_{p1})^{\prime}\right)

with Cp=Cp2Cp1=Diag(c11c21,,c1pc2p)C_{p}^{*}=C_{p2}C_{p1}=Diag(c_{11}c_{21},\ldots,c_{1p}c_{2p}), 𝐚p=Cp2𝐚p1+𝐚p2=(c21a11+a21,,c2pa1p+a2p){\bf a}^{*}_{p}=C_{p2}{\bf a}_{p1}+{\bf a}_{p2}=(c_{21}a_{11}+a_{21},\ldots,c_{2p}a_{1p}+a_{2p}). Since ci=c1ic2i>0c_{i}^{*}=c_{1i}c_{2i}>0 and ai=c2ia1i+a2ia_{i}^{*}=c_{2i}a_{1i}+a_{2i}\in\mathbb{R}, i=1,,pi=1,\ldots,p, we can find that g¯2g¯1G¯\bar{g}_{2}\bar{g}_{1}\in\bar{G}.

(ii)(ii) Combination Law: For any three transformations g¯1,g¯2,g¯3G¯\bar{g}_{1},\bar{g}_{2},\bar{g}_{3}\in\bar{G} with

g¯i(𝜷,Σp)=(Xp1CpiXp𝜷+Xp1𝐚pi,CpiΣpCpi),i=1,2,3,\bar{g}_{i}({\bf\boldsymbol{\beta}},\Sigma_{p})=(X_{p}^{-1}C_{pi}X_{p}{\bf\boldsymbol{\beta}}+X_{p}^{-1}{\bf a}_{pi},C_{pi}\Sigma_{p}C_{pi}^{\prime}),\quad i=1,2,3,

we have that

(g¯1g¯2)g¯3(𝜷,Σp)\displaystyle(\bar{g}_{1}\bar{g}_{2})\bar{g}_{3}({\bf\boldsymbol{\beta}},\Sigma_{p})
=\displaystyle= (Xp1Cp1Cp2Xp(Xp1Cp3Xp𝜷+Xp1𝐚p3)+Xp1(Cp1𝐚p2+𝐚p1),(Cp1Cp2)Cp3ΣpCp3(Cp1Cp2))\displaystyle(X_{p}^{-1}C_{p1}C_{p2}X_{p}(X_{p}^{-1}C_{p3}X_{p}{\bf\boldsymbol{\beta}}+X_{p}^{-1}{\bf a}_{p3})+X_{p}^{-1}(C_{p1}{\bf a}_{p2}+{\bf a}_{p1}),(C_{p1}C_{p2})C_{p3}\Sigma_{p}C_{p3}^{\prime}(C_{p1}C_{p2})^{\prime})
=\displaystyle= (Xp1Cp1Xp(Xp1Cp2Cp3Xp𝜷+Xp1(Cp2𝐚p3+𝐚p2))+Xp1𝐚p1,Cp1(Cp2Cp3)Σp(Cp2Cp3)Cp1)\displaystyle(X_{p}^{-1}C_{p1}X_{p}(X_{p}^{-1}C_{p2}C_{p3}X_{p}{\bf\boldsymbol{\beta}}+X_{p}^{-1}(C_{p2}{\bf a}_{p3}+{\bf a}_{p2}))+X_{p}^{-1}{\bf a}_{p1},C_{p1}(C_{p2}C_{p3})\Sigma_{p}(C_{p2}C_{p3})^{\prime}C_{p1}^{\prime})
=\displaystyle= g¯1(g¯2g¯3)(𝜷,Σp).\displaystyle\bar{g}_{1}(\bar{g}_{2}\bar{g}_{3})({\bf\boldsymbol{\beta}},\Sigma_{p}).

(iii)(iii) Unit Element: The Transformation e¯G¯\bar{e}\in\bar{G} with Cp=IpC_{p}=I_{p}, 𝐚p=𝟎{\bf a}_{p}={\bf 0} is the unit element.

(iv)(iv) Inverse Element: For any transformation g¯G¯\bar{g}\in\bar{G}, its inverse transformation is g¯1(𝜷,Σp)=(Xp1Cp1Xp𝜷Xp1Cp1𝐚p,Cp1Σp(Cp1))\bar{g}^{-1}({\bf\boldsymbol{\beta}},\Sigma_{p})=(X_{p}^{-1}C_{p}^{-1}X_{p}{\bf\boldsymbol{\beta}}-X_{p}^{-1}C_{p}^{-1}{\bf a}_{p},C_{p}^{-1}\Sigma_{p}(C_{p}^{-1})^{\prime}).

Transitivity of the Parameter Space: For any two parameter points (β𝟏,Σp1),(β𝟐,Σp2)Θ({\bf\beta_{1}},\Sigma_{p1}),({\bf\beta_{2}},\Sigma_{p2})\in\Theta, there exists a transformation g¯G¯\bar{g}\in\bar{G} such that g¯(β1,Σp1)=(β2,Σp2)\bar{g}(\beta_{1},\Sigma_{p1})=(\beta_{2},\Sigma_{p2}).

Proof.

Let gp(𝐲)=Cp𝐲+𝐚𝐩g_{p}({\bf y})=C_{p}{\bf y}+{\bf a_{p}} with

Cp=Σp21/2Σp11/2=Diag(σ212/σ112,,σ2p2/σ1p2)C_{p}=\Sigma_{p2}^{1/2}\Sigma_{p1}^{-1/2}=\text{Diag}(\sqrt{\sigma_{21}^{2}/\sigma_{11}^{2}},\ldots,\sqrt{\sigma_{2p}^{2}/\sigma_{1p}^{2}})

and

𝐚𝐩=Xpβ𝟐CpXpβ𝟐.{\bf a_{p}}=X_{p}{\bf\beta_{2}}-C_{p}X_{p}{\bf\beta_{2}}.

Then

g¯(β1,Σp1)=(Xp1Σp21/2Σp11/2Xpβ1+Xp1(Xpβ2Σp21/2Σp11/2Xpβ1),Σp2)\bar{g}(\beta_{1},\Sigma_{p1})=(X_{p}^{-1}\Sigma_{p2}^{1/2}\Sigma_{p1}^{-1/2}X_{p}\beta_{1}+X_{p}^{-1}(X_{p}\beta_{2}-\Sigma_{p2}^{1/2}\Sigma_{p1}^{-1/2}X_{p}\beta_{1}),\Sigma_{p2})
=(β2,Σp2)=(\beta_{2},\Sigma_{p2})

G~𝜷\tilde{G}_{\bf\boldsymbol{\beta}}, G~Σp\tilde{G}_{\Sigma_{p}} being Groups

Analog to the proof for G¯\bar{G},

G~𝜷={g~:g~(𝐝)=Xp1CpXp𝐝+Xp1𝐚p} and G~Σp={g~:g~(D)=CpDCp}with Cp=Diag(c1,,cp),𝐚p=(a1,,ap)\begin{split}\tilde{G}_{\bf\boldsymbol{\beta}}=\{\tilde{g}:\tilde{g}({\bf d})&=X_{p}^{-1}C_{p}X_{p}{\bf d}+X_{p}^{-1}{\bf a}_{p}\}\text{ and }\tilde{G}_{\Sigma_{p}}=\{\tilde{g}:\tilde{g}(D)=C_{p}DC_{p}^{\prime}\}\\ \text{with }C_{p}&=Diag(c_{1},\ldots,c_{p}),{\bf a}_{p}=(a_{1},\ldots,a_{p})^{\prime}\end{split} (1.3)

can be shown to be two groups.

Characterization of Equivariant Estimators for 𝚺𝐩{\bf\boldsymbol{\Sigma}_{p}}

Proof.

It can be easily verified that Δ(𝐲)=SH(𝐳)S=H(𝐳)S2\Delta({\bf y})=SH({\bf z})S^{\prime}=H({\bf z})S^{2} is equivariant under the group action GΣpG_{\Sigma_{p}} if and only if for each of its diagonal element δi(𝐲)\delta_{i}({\bf y}), it is equivariant under the transformation group trio (G,Gc,Gc)(G,G_{c},G_{c}). For δi(𝐲)\delta_{i}({\bf y}), consider the transformation Tij=YNi1+jYNiT_{ij}=Y_{N_{i-1}+j}-Y_{N_{i}}. Then the problem can be converted into a traditional scale-equivariant one and the rest follows from Theorem 3.3 in Lehmann and Casella (1998). ∎

Characterization of Equivariant Estimators for β{\bf\boldsymbol{\beta}}

Proof.

We aim to prove that the estimator δ(𝐲)\delta({\bf y}) is equivariant if and only if it satisfies

δ(𝐲)=(XX)1X𝐲+Xp1S(𝐲)ω(𝐳).{\bf\delta}({\bf y})=(X^{\prime}X)^{-1}X^{\prime}{\bf y}+X_{p}^{-1}S({\bf y}){\bf\omega}(\bf{z}).

To start with, we prove the necessity. From Lemma 2.1, one can show that the OLS estimator is equivariant. Also, S(g(𝐲))=Diag(c1s1,,cpsp)=CpS(𝐲)S(g({\bf y}))=Diag(c_{1}s_{1},\ldots,c_{p}s_{p})=C_{p}S({\bf y}) and ω(𝐳){\bf\omega}({\bf z}) is invariant under GG. Thus,

δ(g(𝐲))\displaystyle\delta(g({\bf y})) =Xp1CpXp(XX)1X𝐲+Xp1𝐚p+Xp1CpXpXp1S(𝐲)ω(𝐳)\displaystyle=X_{p}^{-1}C_{p}X_{p}\cdot(X^{\prime}X)^{-1}X^{\prime}{\bf y}+X_{p}^{-1}{\bf a}_{p}+X_{p}^{-1}C_{p}X_{p}\cdot X_{p}^{-1}S({\bf y}){\bf\omega}({\bf z})
=Xp1CpXpδ(𝐲)+Xp1𝐚p=g~(δ(𝐲)).\displaystyle=X_{p}^{-1}C_{p}X_{p}\delta({\bf y})+X_{p}^{-1}{\bf a}_{p}=\tilde{g}(\delta({\bf y})).

Therefore, δ(𝐲)\delta({\bf y}) is equivariant.

Then we prove the sufficiency. For any equivariant estimator δ(𝐲)\delta({\bf y}), let δ0(𝐲)=Xp[δ(𝐲)(XX)1X𝐲]=Xp[δ(𝐲)Xp1𝐲¯]=Xpδ(𝐲)𝐲¯\delta_{0}({\bf y})=\\ X_{p}[\delta({\bf y})-\left(X^{\prime}X\right)^{-1}X^{\prime}{\bf y}]=X_{p}[\delta({\bf y})-X_{p}^{-1}{\bf\bar{y}}]=X_{p}\delta({\bf y})-{\bf\bar{y}} and we have that

δ0(g(𝐲))\displaystyle\delta_{0}(g({\bf y})) =Xpδ(g(𝐲))𝐠(𝐲)¯\displaystyle=X_{p}\delta(g({\bf y}))-{\bf\overline{g({\bf y})}}
=CpXpδ(𝐲)+𝐚pCp𝐲¯𝐚p\displaystyle=C_{p}X_{p}\delta({\bf y})+{\bf a}_{p}-C_{p}{\bf\bar{y}}-{\bf a}_{p}
=Cp[Xpδ(𝐲)𝐲¯]=Cpδ0(𝐲).\displaystyle=C_{p}[X_{p}\delta({\bf y})-{\bf\bar{y}}]=C_{p}\delta_{0}({\bf y}).

Therefore, δ0(𝐲)\delta_{0}({\bf y}) is equivariant under the transformation group trio (G,Gc,Gc)(G,G_{c},G_{c}).

Similar to the proof above, one can show that δ0(𝐲)\delta_{0}({\bf y}) is equivariant if and only if there exist such an ω{\bf\omega} that δ0(𝐲)=S(𝐲)ω(𝐳)\delta_{0}({\bf y})=S({\bf y}){\bf\omega}(\bf{z}) as S(𝐲)S({\bf y}) is equivariant under the transformation group trio (G,Gc,Gc)(G,G_{c},G_{c}) and 𝐳\bf{z} is a maximal invariant under the scale transformation group GcG_{c}.

Hence, we have δ(𝐲)=(XX)1X𝐲+Xp1S(𝐲)ω(𝐳){\bf\delta}({\bf y})=(X^{\prime}X)^{-1}X^{\prime}{\bf y}+X_{p}^{-1}S({\bf y}){\bf\omega}(\bf{z}). ∎

(XX)1X𝐲(X^{\prime}X)^{-1}X^{\prime}{\bf y}, S2S^{2} and 𝐳{\bf z} are pairwise independent.

Proof.

Based on Lemma 2.1, it is easy to show that (XX)1X𝐲(X^{\prime}X)^{-1}X^{\prime}{\bf y} and S2S^{2} are independent.

Next one can show that (𝐲¯,S2)({\bf\bar{y}},S^{2}) is complete and sufficient for (𝜼=Xp𝜷,Σp)(\boldsymbol{\eta}=X_{p}\boldsymbol{\beta},\Sigma_{p}), 𝐳{\bf z} is ancillary and then using Basu Theorem, we will have the independence between (𝐲¯,S2)({\bf\bar{y}},S^{2}) and 𝐳{\bf z}.

Alternatively, one can show that all those three as functions based on a linear transformation on 𝐲{\bf{y}} and the cross-products of their coefficient matrices are zero matrices. ∎