The Equivariance Criterion in a Linear Model for Fixed-X Cases
In this article, we explored the usage of equivariance criterion in normal linear model with fixed- and extended the model to allow multiple populations, which, in turn, leads to a multivariate invariant location-scale transformation group, compared than the commonly used univariate one. The minimum risk equivariant estimators of the coefficient vector and the diagonal covariance matrix were derived, which were consistent with literature works. This work serves as an early exploration of the usage of equivariance criterion in machine learning.
Key words and Phrases: Equivariance, Linear Model, Estimation, Coefficient Vector, Covariance Matrix, Likelihood Loss
1. Introduction
Consider a general linear model with a response and a covariate vector . The aims of the linear model are to build a linear functional relationship between and from paired observations, , and predict a future response, , given a covariate, . Predominately in literature, a fixed- assumption has been used with the following understandings:
(i) The covariate values are fixed before sampling while the only randomness comes from the responses ;
(ii) There is no other relationship among those covariate values.
Meanwhile, random- cases have been investigated, e.g., (Breiman and Spector (1992)), where the assumption is to set the covariate values sampled from a random vector.
In this article, we will start with a fixed- as a combination of linearly independent rows each repeated times, i.e.,
(1.1) |
where . One can argue that this is the general form for a fixed- case as the rank of the design matrix is allowing only distinct rows. The difference between a fixed- and a random- lays on the fact whether is fully known before sampling. Naturally, one should set the number of parameters as the rank of , . Otherwise if there are more parameters than , then the model carries redundant parameters, some of which may not be identifiable and thus not estimable. On the other hand, if there are fewer parameters than , then more parameters will be needed or some of may be considered as unknown before sampling and hence is random and out of scope of this article. To derive the optimal solution, we will introduce equivariance criterion, which has been widely considered in the linear model, e.g., Rao (1965); Eaton (1989) and the distinctions between fixed- and random- cases are clearly noticed (Rao (1973), Rosset and Tibshirani (2020)).
As a principle of symmetry, equivariance criterion is proposed in literature as a logic constraint on the solutions to derive the optimal one. Compared to the unbiasedness, it is more focused on the symmetry of the problem and self-consistency of the solutions. In the era of machine learning, it has received rising attention due to widely existing symmetry in the applications. Lehmann and Casella (1998, Chap. 3) has given a detailed discussion of equivariance criterion in the location-scale family while Berger (1985, Chap. 6) has presented the theory in a decision theoretic framework. Besides those two classical textbooks, other important references are as follows: Hora and Buehler (1966); Eaton (1989); Wijsman (1990).
Equivariance criterion consists of two principles: (i) functional equivariance, which states the action in a decision problem should be consistence across different measures and (ii) formal equivariance, which requires the decision rule to be of the same form for two problems with the identical structure.
Formally, a decision problem is described by , where is a sample space, is a family of distributions with as the parameter or the true state of nature, is the parameter space, is a decision space and is a loss function on .
Without loss of generalities, one starts with the identifiability on both the distribution family and loss function, which, for linear model, suggests that there is no redundant and . The principle of functional equivariance criterion requires preservance of the model and invariance of loss function under a group of 1-1 and onto transformations, which, are defined as follows.
Definition 1.1.
(Preservance of the Model) The distribution family is said to be invariant under if for each , and , there exists such that
(1.2) |
Definition 1.2.
(Invariance of Loss Function) The loss function is invariant under if for each and each there exists such that
Definition 1.3.
A decision problem is invariant under a group of transformations, , if preserves and the loss function is invariant under .
Under the structure of an invariant decision problem, we can induce equivariance criterion.
Definition 1.4.
A decision rule is said to be equivariant under if
(1.3) |
In this way, equivariance criterion imposes a constraint on the possible decision rules that one can use reasonably to derive the optimal decision rule. In this article, we will pursue the best decision rule (minimum risk equivariant rule, MRE rule) in the sense of minimizing the risk function as the expectation of the loss function over , which is a function of the parameter and the decision rule.
Lehmann and Casella (1998) has applied equivariance criterion implicitly in linear model for fixed- cases, where the most important feature is to transform the problem to a canonical form and derive the best equivariant estimators for the coefficient vector and common variance. Wu and Yang (2002) has discussed the existence of best equivariant estimators for the coefficient vector and the covariance matrix (in three forms) in normal linear models with fixed-s and derived the forms when they exist. Kurata and Matsuura (2016) has derived the best equivariant estimator of regression coefficients in a seemingly unrelated regression (SUR) model with a known correlation matrix. Further, Matsuura and Kurata (2020) derived the best equivariant estimator of the variance matrix in a SUR model. It is noted that most literature works assume a single population for the linear model and thus apply a common invariant transformation on the responses to derive the optimal equivariant estimators. However, as in experimental design, one usually views each distinct covariate vector as an independent population from any other and a SUR model uses a natural multivariate extension of the common linear model. Therefore, a multivariate extension of the common linear model combined with equivariance criterion warrants further investigation, especially from the fundamental logic of equivariance criterion.
This article focuses on applying the logic of equivariance criterion to the fixed- linear model, where the linear model is extended to allow multiple populations instead of a single one and the invariant transformation is distinct for each population. In Section 2, we derive the best equivariant estimators for the coefficient vector and the diagonal covariance matrix in a normal linear model specially tuned for equivariance criterion. Section 3 is devoted to some concluding remarks & future work.
2. Equivariance in the Linear Model
Consider the linear regression model , where is an vector, is an matrix and is the noise vector. To derive the best equivariant estimators in the model, we will walk through the basic elements of equivariance criterion first and then provide a linear model tuned for equivariance criterion starting from the basic concept of a population.
Preservance of the Model: Without loss of generalities, we assume that the design matrix is predetermined and of full rank with , where only the response is random and thus transformations only on will be considered. For a fixed- linear model, it can be viewed that the samples actually come from different populations as in the experimental design, which imposes a restriction over the choice of the transformation group, in addition to the specification of the model. Equivariance criterion implicitly requires the transformations over the samples from the same population to be identical while being distinct for samples from different populations. Therefore, it is of essence to determine the number of populations inside a linear model.
For the fixed- cases, one can argue that there can only be populations inside the linear model as the rank of the design matrix is . In this regard, we will assume that ’s are independently distributed as a normal distribution with mean and unknown variance , . Thus, we have the normal linear model tuned for equivariance criterion as follows,
(2.1) |
Note that here, it is natural to assume independent variances for each population in contrast to the traditional linear model where all the variances for are assumed to be identical. In this case, we will have the usual identifiability of the model with respect to and . Since contains only parameters, one can simplify the problem to consider .
The transformation group keeping the model invariant is of the form , with , and , , . (Both and are repeated times.) Here we use independent transformations for those populations, compared to the literature where most are using one single common transformation, . To facilitate discussion, we denote that , , and thus with . The corresponding transformation group on the parameter space is of the form
It can be shown that the parameter space is transitive under as .
In the linear model, the usual targets of interest are the coefficient vector and the covariance matrix . We will discuss those two separately in the context of Invariance of the Loss Function.
To derive an MRE decision rule, Lehmann and Casella (1998, Chap. 3) have used maximal invariants to characterize all the equivariant estimators and then minimized the constant risk when is transitive for location-scale families. We will follow such an approach to derive the best estimators as follows.
2.1 Estimation of the Coefficient Vector
Staudte Jr (1971); Zhou and Nayak (2014) have introduced a method to construct an invariant loss function based on the target of interest. Following their method, one can build an invariant loss function as follows, . This is an extension of the one implicitly used in the equivariance literature (Theorem 4.3 (b) in Lehmann and Casella (1998, Chap. 3) and (1.4) in Wu and Yang (2002)). Thus one will have the corresponding invariant transformation on the decision space as and equivariant criterion as . It can be shown that those invariant transformations form a group and the least square estimator is essentially the vector of sample means as in the following Lemma 2.1 and thus equivariant under the group .
Lemma 2.1.
For the fixed- linear model in 2.1, the least square estimator, , with .
Proof.
Since and , the least square estimator satisfies that
∎
A Characterization of Equivariant Estimators. It follows from Lehmann and Casella (1998) that a characterization of equivariant estimators can be given by , where , is the sample deviation for each population, is the sample mean for each population with for , , with , is a maximal invariant, and is a -dimension vector.
Best Equivariant Estimator. To derive the best equivariant estimator minimizing the risk function, first one will use the fact that the risk function is constant for any equivariant estimator as the parameter space is transitive. Then one can show that is independent of . Meanwhile depends on only via , which is independent of . With above arguments, one can show the main result as follows.
Theorem 2.1.
For the fixed- linear model in 2.1, the least square, , is the best equivariant estimator for .
Proof.
Since the parameter space is transitive under group , we choose a special parameter point to evaluate the risk function. Thus any equivariant estimator of , has a constant risk as follows,
where the minimum is attained at as and are pairwise independent. Therefore, the best estimators is .
∎
2.2 Estimation of the Diagonal Covariance Matrix
Since is diagonal with only parameters, it is equivalent to estimate the diagonal matrix, . It is noteworthy that is estimable only when . Consider two widely discussed loss functions: the quadratic loss, , and the likelihood loss, , where is a positive definite diagonal matrix. The same transformation group is used in the preservance of the model with and both loss functions are invariant under with = . Denote and one can show the main result as follows.
Theorem 2.2.
For the fixed-case linear model in 1.1, and are the MRE estimators for under and , respectively.
Proof.
Analog to the univariate case, under loss function and , any equivariant estimator of can be characterized as
(2.2) |
where , and , with . The proof of equation (2.2) can be found in Appendix A, where it is also shown that and are independent (see Appendix Proofs A).
Since the parameter space is transitive under group , we choose a special parameter point to evaluate the risk function.
Firstly, under , the constant risk of any equivariant estimator can be calculated as follows,
The risk attains the minimum at the point for or equivalently, . Thus, is the MRE estimators under .
A similar result can be obtained for and the MRE estimators is under . ∎
It is noted that Wu and Yang (2002) has considered such a problem under the common location-scale transformation and the loss function and shown that no best equivariant estimator exists. In essence, Theorem 2.2 indicates that the multivariate MRE is a vector of the univariate MREs as in Lehmann and Casella (1998), which are sample variances multiplied by a constant. When the covariance matrix is fully unknown, it is not estimable and there will be no sufficient and complete statistic and thus the above proof won’t work.
3. Future Work and Discussion
This paper has served as an initial effort to explore the usage of equivariance criterion in the field of machine learning, where we are interested at which method yields the optimal solution and what properties the optimal solution carries, especially in the context of equivariance criterion. We start from least square in the linear model, the simplest and foundational method in machine learning, to derive the optimal solution.
In this paper, we have established that MRE estimators for the coefficient vector and the condensed covariance matrix is the least square and the vector of the sample variance within each population, respectively. In addition, we have demonstrated that in our setting 2.1, the least square estimator is essentially the vector of the sample mean within each population. Such a finding has further solidified their optimality from the perspective of multivariate normal distribution theory.
The linear model with a full rank design matrix can be of different forms across literature via the number of populations. The commonly used one is of a single population, whose characteristic is to assume a common distribution for all the noises. Naturally, in this setup, one would use a single univariate location-scale transformation to apply equivariance criterion. In this paper, we relax such an assumption and allow populations to accommodate the usage of a larger transformation group, which is consisted of multivariate location-scale transformations. From the perspective of experimental design, such a relaxation is quite intuitive: the design matrix is chosen carefully before sampling and in a sense, those sampling points are independent and each rank consist of a population. Meanwhile one can argue that populations should be the maximum number allowed in a linear model.
In terms of estimating the coefficient vector, the form of doesn’t have much impact on the MRE solution and thus 2.1 gives a simple path to the MRE solution. However, to estimate the condensed covariance matrix, it is essential to choose the form and thus decide the size of each population. In an experimental design setting, one would usually use , where is an integer, and . Such a form is recommended both for its symmetry and computation convenience, which also is a special case of a seemingly unrelated regression (SUR) problem with a known correlation matrix. However, one can convert a SUR problem back into the setting and functional equivariance guarantees that the MRE solution is of a corresponding form. In this sense, our model is more general than the SUR model used in Kurata and Matsuura (2016); Matsuura and Kurata (2020).
We have discussed the best equivariant estimators for the coefficient vector and the condensed covariance matrix for a normal linear model specially tuned for equivariance criterion, where the commonly used one is a special example. Such a model requires a bigger transformation group, which, in turn, will result in a smaller set of equivariant decision rules, where the MRE estimator exists. Interestingly, Wu and Yang (2002) has shown that the commonly used single location-scale transformation induces too many equivariant decision rules under that there is no MRE estimator for the covariance matrix. Meanwhile, each population inside the normal model is the linear model commonly used in literature and the resulted estimators for each population are the traditional MRE ones, which are equivalent to the least square solutions. Kurata and Matsuura (2016); Matsuura and Kurata (2020) used the same transformation group for the SUR model, where a -dimensional distribution family was considered and samples comes from such a single multivariate population.
The choice of the invariant transformation group is an important topic in equivariance literature. Wu and Yang (2002) presented a case where the group is too large to allow an optimal solution. Usually, the group is chosen to be isomorphic to sample space/parameter space, especially when considering the Haar Prior. There are some interesting cases that all invariant transformation groups pose a nesting relationship and the largest one admits only the optimal solution in the smaller ones.
Likelihood loss is a multivariate extension of the Stein loss, which is preferred in literature (Brown (1968)). It can be seen that it induces an MRE & UMVU estimator, that is always larger than the one under the quadratic one. Meanwhile, likelihood loss is more evenhanded over the range as the covariance matrix is set to be positive definite. Such a form is quite similar to the logistic transformation in a generalized linear model.
An extension to prediction will be a future direction. However, existing frameworks (e.g., Zhou and Nayak (2015)) on the equivariance criterion can’t handle the prediction problem well in a linear model. In the literature, the predicted response is assumed to be unobservable, which is not the case in the linear model. One could also notice that overfitting (Stone (1974)) arises for a prediction problem in a linear model, which usually doesn’t occur in estimation as in deriving the least square solution, sample prediction error is used, which will converge to a univariate form of , with , with an extra term constant to .
Linear model with fixed- cases though predominantly used in literature, is of limited usage in practice, especially in our settings, where the experimental design is the ideal scenario. Linear model with random- cases and mixture cases are more interesting and challenging. The concept of the randomness of linear model has drawn wide attention and numerous efforts have been spent to clarify the differences between fixedness and randomness. Little (2019) has given a straightforward definition of randomness as being unknown from a Bayesian view. In his argument, the treatment indicator from a clinical trial can be considered both fixed and random. It is true that those semi-controlled covariates pose challenges to the definition of randomness. Individually, it is unknown and thus can be considered random. Population-wise, its distribution is usually under control and thus can be treated as a fixed effect in analyses.
The Gauss-Markov Theorem is the fundamental result for the linear model, where the optimality of the least square solution has been established. In most textbooks, its proof is based on the predominantly assumed fixed- cases. Shaffer (1991) has shown some interesting results where the Gauss-Markov Theorem no longer holds for some random- cases. We will investigate such a phenomenon in the context of equivariance for the random- cases.
In terms of randomness, it is noticed that the setup before sampling is crucial and one can classify before sampling into following categories: known values, from a known distribution, from a distribution family with unknown parameters, totally unknown. In a typical experimental design setting, design factors/parameters are of known values, which, in our settings, refer to fixed- cases. In a typical clinical trial setting, the treatment indicator is from a known distribution. In the classical parametric inference setting, we may assume from a distribution family with unknown parameters. For the non-parametric setting, X is usually seen as totally unknown. The latter three scenarios refer to random- cases, which will be another future topic.
For a linear model with a non-normal distribution, we will refer to extensions of the current results to the generalized linear model, where the challenges start from the invariant transformations. In a normal linear model, one can easily find an invariant location-scale transformation group that leave the parameter space transitive, which facilitates the derivation of the MRE solutions. This may not be the case in a non-normal linear model.
Acknowledgment. This work was supported by the Fundamental Research Funds for the Central Universities, Sun Yat-sen University (Grant No. 20lgpy145 & 2021qntd21) and the Science and Technology Program of Guangzhou Project, Fundamental and Applied Research Project (202102080175). The authors report there are no competing interests to declare.
References
- Berger (1985) Berger, J.O. (1985). Statistical decision theory and Bayesian analysis. Springer-Verlag, New York, second edition.
- Breiman and Spector (1992) Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression. the x-random case. International Statistical Review / Revue Internationale de Statistique, 60, 291. doi:10.2307/1403680.
- Brown (1968) Brown, L. (1968). Inadmissibility of the usual estimators of scale parameters in problems with unknown location and scale parameters. The Annals of Mathematical Statistics, 39, 29–48. doi:10.1214/aoms/1177698503.
- Cohen and Welling (2016) Cohen, T. and Welling, M. (2016). Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999.
- Eaton (1989) Eaton, M.L. (1989). Group invariance applications in statistics. Institute of Mathematical Statistics, Beachwood, Ohio.
- Hora and Buehler (1966) Hora, R.B. and Buehler, R.J. (1966). Fiducial theory and invariant estimation. The Annals of Mathematical Statistics, 37, 643–656.
- Kurata and Matsuura (2016) Kurata, H. and Matsuura, S. (2016). Best equivariant estimator of regression coefficients in a seemingly unrelated regression model with known correlation matrix. Annals of the Institute of Statistical Mathematics, 68, 705–723. doi:10.1007/s10463-015-0512-2.
- Lehmann and Casella (1998) Lehmann, E.L. and Casella, G. (1998). Theory of point estimation. Springer-Verlag, New York, second edition.
- Little (2019) Little, R.J. (2019). Comment: “models as approximations i: Consequences illustrated with linear regression” by a. buja, r. berk, l. brown, e. george, e. pitkin, l. zhan and k. zhang. Statistical Science, 34, 580–583. doi:10.1214/19-sts726.
- Marcos et al. (2017) Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. (2017). Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5048–5057.
- Matsuura and Kurata (2020) Matsuura, S. and Kurata, H. (2020). Covariance matrix estimation in a seemingly unrelated regression model under stein’s loss. Statistical Methods & Applications, 29, 79–99. doi:10.1007/s10260-019-00473-x.
- Rao (1973) Rao, C. (1973). Representations of best linear unbiased estimators in the gauss-markoff model with a singular dispersion matrix. Journal of Multivariate Analysis, 3, 276–292. doi:10.1016/0047-259x(73)90042-0.
- Rao (1965) Rao, C.R. (1965). The theory of least squares when the parameters are stochastic and its application to the analysis of growth curves. Biometrika, 52, 447–458. doi:10.1093/biomet/52.3-4.447.
- Rosset and Tibshirani (2020) Rosset, S. and Tibshirani, R.J. (2020). From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. Journal of the American Statistical Association, 115, 138–151. doi:10.1080/01621459.2018.1424632.
- Shaffer (1991) Shaffer, J.P. (1991). The gauss—markov theorem and random regressors. The American Statistician, 45, 269–273. doi:10.1080/00031305.1991.10475819.
- Staudte Jr (1971) Staudte Jr, R.G. (1971). A characterization of invariant loss functions. The Annals of Mathematical Statistics, pp. 1322–1327.
- Stone (1974) Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36, 111–133. doi:10.1111/j.2517-6161.1974.tb00994.x.
- Wijsman (1990) Wijsman, R.A. (1990). Invariant measures on groups and their use in statistics. In Invariant measures on groups and their use in statistics. Institute of Mathematical Statistics.
- Wu and Yang (2002) Wu, Q. and Yang, G. (2002). Existence of the uniformly minimum risk equivariant estimators of parameters in a class of normal linear models. Science in China Series A: Mathematics, 45, 845–858. doi:10.1360/02ys9093.
- Zhou and Nayak (2014) Zhou, H. and Nayak, T.K. (2014). A note on existence and construction of invariant loss functions. Statistics, 48, 1335–1343. doi:10.1080/02331888.2013.809719.
- Zhou and Nayak (2015) Zhou, H.J. and Nayak, T.K. (2015). On the equivariance criterion in statistical prediction. Annals of the Institute of Statistical Mathematics, 67, 541–555. doi:10.1007/s10463-014-0464-y.
Appendix A Technical proof
Identifiability of the Model
Proof.
For any two parameter points and any fixed with full rank, the density of equals to that of if and only if , where for . Thus, the model is identifiable with respect to . ∎
being a Group
Proof.
We aim to prove that
(1.1) |
satisfies the definition of a group.
(i) Closedness: For any two transformations with
where , and , , we have that
with and . Since and , , we can find that .
Combination Law: For any three transformations with
we have that
Unit Element: The transformation with is the unit element.
Inverse Element: For any transformation , its inverse transformation is .
∎
being a Group
Proof.
We aim to prove that
(1.2) |
satisfies the definition of a group.
Closedness: For any two transformations with
where , , we have that
with , . Since and , , we can find that .
Combination Law: For any three transformations with
we have that
Unit Element: The Transformation with , is the unit element.
Inverse Element: For any transformation , its inverse transformation is .
∎
Transitivity of the Parameter Space: For any two parameter points , there exists a transformation such that .
Proof.
Let with
and
Then
∎
, being Groups
Analog to the proof for ,
(1.3) |
can be shown to be two groups.
Characterization of Equivariant Estimators for
Proof.
It can be easily verified that is equivariant under the group action if and only if for each of its diagonal element , it is equivariant under the transformation group trio . For , consider the transformation . Then the problem can be converted into a traditional scale-equivariant one and the rest follows from Theorem 3.3 in Lehmann and Casella (1998). ∎
Characterization of Equivariant Estimators for
Proof.
We aim to prove that the estimator is equivariant if and only if it satisfies
To start with, we prove the necessity. From Lemma 2.1, one can show that the OLS estimator is equivariant. Also, and is invariant under . Thus,
Therefore, is equivariant.
Then we prove the sufficiency. For any equivariant estimator , let and we have that
Therefore, is equivariant under the transformation group trio .
Similar to the proof above, one can show that is equivariant if and only if there exist such an that as is equivariant under the transformation group trio and is a maximal invariant under the scale transformation group .
Hence, we have . ∎
, and are pairwise independent.
Proof.
Based on Lemma 2.1, it is easy to show that and are independent.
Next one can show that is complete and sufficient for , is ancillary and then using Basu Theorem, we will have the independence between and .
Alternatively, one can show that all those three as functions based on a linear transformation on and the cross-products of their coefficient matrices are zero matrices. ∎