This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Uniformly Most Powerful Unbiased Test for Regression Parameters by Orthogonalizing Predictors

Razvan G. Romanescu
( Department of Community Health Sciences
Center for Healthcare Innovation
University of Manitoba, Canada
3rd floor Rm 351, 753 McDermot Ave, Winnipeg MB R3E 0T6
Razvan.Romanescu@umanitoba.ca
ORCID: 0000-0002-3175-5399
)
Abstract

In a multiple regression model where features (predictors) form an orthonormal basis, we prove that there exists a uniformly most powerful unbiased (UMPU) test for testing that the coefficient of a single feature is negative (or zero) versus positive. The test statistic used is the same as the coefficient t-test commonly reported by standard statistical software, and has the same null distribution. This result suggests that orthogonalizing features around the predictor of interest, prior to fitting the multiple regression, might be a way to increase power in single coefficient testing, compared to regressing on the raw, correlated features. This paper determines the conditions under which this is true, and argues that those conditions are fulfilled in a majority of applications.

1 Introduction

It is well known from practice that having correlated predictors in a multiple regression model leads to loss of power. In an extreme case, including perfectly correlated predictors leads to a model that is over-identified. Even if features are highly, but not perfectly, correlated, multicollinearity might make the model fit unstable. The amount of multicollinearity is sometimes measured via variance inflation factors (VIFs). Parameters that have high VIF are deemed to significantly increase multicollinearity of the model and might be excluded. However, sometimes it is desirable to keep related predictors as each might contain unique information.

Orthogonalizing the design matrix is an alternative way to deal with the multicollinearity problem. This is not a new concept, having been used in many ways in many fields, see, e.g., [3, 4]. Orthogonalizing the feature space via principal components or other methods (e.g., [8]) is a standard choice as a dimension reduction technique, especially when the number of features is large, and when interpretability is not particularly important. By contrast, uptake of this approach is limited, even non-existent, when the feature space is made up of a smaller number of meaningful predictors. This may be due to the open-ended modeling questions that it raises, namely, in which order to orthogonalize features, and how to interpret the newly created predictor set. In this paper we orthogonalize features sequentially, preserving the dimension of the design matrix, and select the new basis starting from one of the initial predictors as the first direction. We do this both for ease of interpretation, and because we want to increase power in testing the significance of one predictor that may be of particular importance in a multiple regression problem. Section 2 introduces the two main theoretical results of the paper. In Section 3, we show on a simulated example that power gains can be significant if the features are correlated, and we conclude in Section 4 with a discussion of where this approach might be most beneficial and how the new features might be interpreted.

1.1 Related work

Prior work on building UMPU tests is well established in inference theory and starts with testing the single parameter exponential family. The existence of UMP tests in this case is based on the Neyman-Pearson Lemma, and tests can be built by writing the likelihood ratio as a monotone function of the sufficient statistic. While this approach does not generalize directly to multi-parameter families, UMPU tests can be constructed for one parameter by condidioning on the sufficient statistics for the other parameters. This is the problem we consider here, namely making inference about the coefficient of a single fixed effect in multiple regression, while treating the other coefficients as nuisance. A UMP invariant (UMPI) test for the directional testing of a subset of coefficients being jointly zero, assuming knowledge of the coefficients’ signs, has been constructed by [6]. The invariance condition is a somewhat strong assumption, and this test does not attain the envelope of power, even though it is shown to perform reasonably well in simulations. The problem of efficient testing in the regression model has been solved for a general distribution by [2], by using the notions of asymptotically uniformly most powerful (AUMP), and effective scores. However, these are fairly advanced concepts based on local asymptotic normality, and no simple solution has been derived for the multivariate regression case, which is an important case in applied statistics. The treatment considered here is more elementary, and relates to familiar test statistics from regression analysis.

2 Regression on an orthonormal set of predictors

We introduce the main result, which concerns the one-sided test of a coefficient in a multiple regression model, where features are orthonormal. The proof generalizes Example 6.9.11 from [1] , which establishes the result in the more limited case of testing for the slope in a simple regression model in which intercept and error variance are unknown.

THEOREM 1. Suppose we observe data vector 𝐘\mathbf{Y} from the multiple regression model 𝒀=β1𝐱1+β2𝐱2++βp𝐱p+ϵ\bm{Y}=\beta_{1}\mathbf{x}_{1}+\beta_{2}\mathbf{x}_{2}+...+\beta_{p}\mathbf{x}_{p}+\bm{\epsilon}, where ϵN(0,σ2I),\bm{\epsilon}\sim N(0,\sigma^{2}I), and 𝐱𝟏,𝐱𝟐,,𝐱𝐩\mathbf{x_{1},x_{2},...,x_{p}} are fixed covariates, for p<np<n. Assume further that 𝐱𝟏,𝐱𝟐,,𝐱𝐩\mathbf{x_{1},x_{2},...,x_{p}} form an orthonormal basis for p\mathbb{R}^{p}, and all parameters (β1,β2,,βp\beta_{1},\beta_{2},...,\beta_{p} and σ2\sigma^{2}) are unknown. The test ϕ\phi defined as

ϕ={0,if V<tnp,1α1,if Vtnp,1α,\phi=\begin{cases}0,\quad\text{if $V<t_{n-p,1-\alpha}$}\\ 1,\quad\text{if $V\geq t_{n-p,1-\alpha}$,}\end{cases} (1)

where V=np𝐱p𝐘𝐘𝐘i=1p(𝐱i𝐘)2tnpV=\frac{\sqrt{n-p}\,\mathbf{x}_{p}^{\intercal}\mathbf{Y}}{\sqrt{\mathbf{Y}^{\intercal}\mathbf{Y}-\sum_{i=1}^{p}{(\mathbf{x}_{i}^{\intercal}\mathbf{Y})^{2}}}}\sim t_{n-p} is UMPU for testing H0:βp0H_{0}:\beta_{p}\leq 0 vs H1:βp>0H_{1}:\beta_{p}>0.

PROOF: As is typical when looking for a UMP test in the presence of nuisance parameters, we first wish to identify sufficient statistics for this inference. With normal data, the joint density will belong to the exponential family and can be written thus

f(𝐘|𝜷,σ)\displaystyle f(\mathbf{Y}|\bm{\beta},\sigma) =i=1n12πσexp{(Yiq=1pβixi,q)22σ2}\displaystyle=\prod_{i=1}^{n}{\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{(Y_{i}-\sum_{q=1}^{p}{\beta_{i}x_{i,q}})^{2}}{2\sigma^{2}}\right\}}
=1(2π)nσnexp{i=1nYi22σ2i=1n(q=1pβixi,q)22σ2+i=1n(Yiq=1pβixi,q)σ2}\displaystyle=\frac{1}{(\sqrt{2\pi})^{n}\sigma^{n}}\exp\left\{-\frac{\sum_{i=1}^{n}{Y_{i}^{2}}}{2\sigma^{2}}-\frac{\sum_{i=1}^{n}{(\sum_{q=1}^{p}{\beta_{i}x_{i,q}})^{2}}}{2\sigma^{2}}+\frac{\sum_{i=1}^{n}{(Y_{i}\sum_{q=1}^{p}{\beta_{i}x_{i,q}})}}{\sigma^{2}}\right\}
=h(𝜷,σ)exp{𝐘𝐘2σ2+β1σ2𝐱1𝐘+β2σ2𝐱2𝐘+++βpσ2𝐱p𝐘}\displaystyle=h(\bm{\beta},\sigma)\exp\left\{-\frac{\mathbf{Y}^{\intercal}\mathbf{Y}}{2\sigma^{2}}+\frac{\beta_{1}}{\sigma^{2}}\mathbf{x}_{1}^{\intercal}\mathbf{Y}+\frac{\beta_{2}}{\sigma^{2}}\mathbf{x}_{2}^{\intercal}\mathbf{Y}+...++\frac{\beta_{p}}{\sigma^{2}}\mathbf{x}_{p}^{\intercal}\mathbf{Y}\right\} (2)

From this, the sufficient statistics are (𝐘𝐘,𝐱1𝐘,𝐱2𝐘,,𝐱p𝐘)(\mathbf{Y}^{\intercal}\mathbf{Y},\mathbf{x}_{1}^{\intercal}\mathbf{Y},\mathbf{x}_{2}^{\intercal}\mathbf{Y},...,\mathbf{x}_{p}^{\intercal}\mathbf{Y}) corresponding to the natural parameter vector (12σ2,β12σ2,,βp2σ2)(-\frac{1}{2\sigma^{2}},\frac{\beta_{1}}{2\sigma^{2}},...,\frac{\beta_{p}}{2\sigma^{2}}). According to [1] (pp. 147-148) there exists an unbiased UMP test ϕ1(u,𝒕)=I{uc1(𝒕)}\phi_{1}(u,\bm{t})=I\{u\geq c_{1}(\bm{t})\} where c1(𝒕)c_{1}(\bm{t}) is determined from Eβp=0[ϕ1(U,𝑻)|𝑻=𝒕]=αE_{\beta_{p}=0}[\phi_{1}(U,\bm{T})|\bm{T=t}]=\alpha, where U,𝑻U,\bm{T} are the sufficient statistics for the important and nuisance parameters, respectively. The problem is that the joint conditional distribution (U,𝑻)|𝑻=𝒕(U,\bm{T})|\bm{T=t} is not yet straightforward to obtain as U=𝐱p𝐘U=\mathbf{x}_{p}^{\intercal}\mathbf{Y} is not entirely independent of 𝑻\bm{T}. In what follows, the plan is to use Theorem 6.9.2 part A from [1], which gives some relatively simpler conditions for a test to attain UMPU property, and is especially suited when data is normal.

Our objective now is to find a simpler characterization for the distribution of the sufficient statistics. Following and extending the reasoning in the aforementioned Example 6.9.11, let 𝐚𝟏,𝐚𝟐,,𝐚𝐧\mathbf{a_{1},a_{2},...,a_{n}} be an orthonormal basis for n\mathbb{R}^{n} that includes the covariate vectors, i.e., 𝐚1=𝐱1,𝐚2=𝐱2,,𝐚p=𝐱p\mathbf{a}_{1}=\mathbf{x}_{1},\mathbf{a}_{2}=\mathbf{x}_{2},...,\mathbf{a}_{p}=\mathbf{x}_{p}, and 𝐚p+1,𝐚n\mathbf{a}_{p+1},...\mathbf{a}_{n} are chosen such that 𝐚i𝐚i=1,i\mathbf{a}_{i}^{\intercal}\mathbf{a}_{i}=1,\forall i, and 𝐚i𝐚j=0,ij\mathbf{a}_{i}^{\intercal}\mathbf{a}_{j}=0,i\neq j. Further define Wi=𝐚iϵ,iW_{i}=\mathbf{a}_{i}^{\intercal}\mathbf{\epsilon},\forall i. It is relatively straightforward to show that W1,W2,,WnW_{1},W_{2},...,W_{n} are iid N(0,σ2)N(0,\sigma^{2}). We also have that iWi2=iϵi2\sum_{i}{W_{i}^{2}}=\sum_{i}{\epsilon_{i}^{2}}. This is true because WiW_{i} is the length of the projection of the error vector ϵ\epsilon on basis vector 𝐚𝐢\mathbf{a_{i}}, and we express the squared length of vector ϵ\epsilon in both coordinate bases.

In the regression model, we can identify the best fit parameters βi,i=1,..,p\beta_{i},i=1,..,p as the projection of data vector 𝐘\mathbf{Y} onto covariate directions 𝐱i=𝐚i\mathbf{x}_{i}=\mathbf{a}_{i}. Let us call the corresponding estimators Bi=𝐚i𝐘=𝐚i(β1𝐚1+βp𝐚p+ϵ)=βi+WiB_{i}=\mathbf{a}_{i}^{\intercal}\mathbf{Y}=\mathbf{a}_{i}^{\intercal}(\beta_{1}\mathbf{a}_{1}+...\beta_{p}\mathbf{a}_{p}+\mathbf{\epsilon})=\beta_{i}+W_{i}. The residual sum of squares is R=i=p+1nWi2=i=1nϵi2i=1pWi2=i=1n(Yiβ1a1,iβ2a2,iβpap,i)2i=1p(Biβi)2R=\sum_{i=p+1}^{n}{W_{i}^{2}}=\sum_{i=1}^{n}{\epsilon_{i}^{2}}-\sum_{i=1}^{p}{W_{i}^{2}}=\sum_{i=1}^{n}{(Y_{i}-\beta_{1}a_{1,i}-\beta_{2}a_{2,i}-...-\beta_{p}a_{p,i})^{2}}-\sum_{i=1}^{p}{(B_{i}-\beta_{i})^{2}}. The first sum expands to 𝐘𝐘2i=1pβi𝐚i𝐘+i=1pβi2\mathbf{Y}^{\intercal}\mathbf{Y}-2\sum_{i=1}^{p}{\beta_{i}\mathbf{a}_{i}^{\intercal}\mathbf{Y}}+\sum_{i=1}^{p}{\beta_{i}^{2}}. It is then easy to obtain that R=𝐘𝐘i=1pBi2σ2χnp2R=\mathbf{Y}^{\intercal}\mathbf{Y}-\sum_{i=1}^{p}{B_{i}^{2}}\sim\sigma^{2}\chi_{n-p}^{2}, from the original definition of R=i=p+1nWi2R=\sum_{i=p+1}^{n}{W_{i}^{2}}.

To recapitulate, we found summary statistics BiN(βi,σ2),i=1,..,pB_{i}\sim N(\beta_{i},\sigma^{2}),i=1,..,p, and RR, which are all mutually independent. Plugging these into Equation 2,

f(𝐘|𝜷,σ)=h(𝜷,σ)exp{R+i=1pBi22σ2+β1σ2B1+β2σ2B2+++βpσ2Bp}f(\mathbf{Y}|\bm{\beta},\sigma)=h(\bm{\beta},\sigma)\exp\left\{-\frac{R+\sum_{i=1}^{p}{B_{i}^{2}}}{2\sigma^{2}}+\frac{\beta_{1}}{\sigma^{2}}B_{1}+\frac{\beta_{2}}{\sigma^{2}}B_{2}+...++\frac{\beta_{p}}{\sigma^{2}}B_{p}\right\} (3)

we see that statistics (U,T1,,Tp):=(Bp,B1,,Bp1,R+i=1pBi2)(U,T_{1},...,T_{p}):=(B_{p},B_{1},...,B_{p-1},R+\sum_{i=1}^{p}{B_{i}^{2}}) are sufficient for (βp2σ2,β12σ2,,βp12σ2,12σ2)(\frac{\beta_{p}}{2\sigma^{2}},\frac{\beta_{1}}{2\sigma^{2}},...,\frac{\beta_{p-1}}{2\sigma^{2}},-\frac{1}{2\sigma^{2}}). Next, define a new variable V=g(U,T1,T2,,Tp)=UTpT12T22Tp12U2np=BpRnpV=g(U,T_{1},T_{2},...,T_{p})=\frac{U}{\sqrt{\frac{T_{p}-T_{1}^{2}-T_{2}^{2}-...-T_{p-1}^{2}-U^{2}}{n-p}}}=\frac{B_{p}}{\sqrt{\frac{R}{n-p}}}. Check that VV satisfies the conditions of Theorem 6.9.2, namely

  1. 1.

    VV is independent of 𝑻=(T1,,Tp)\bm{T}=(T_{1},...,T_{p}) when βp/σ2=0\beta_{p}/\sigma^{2}=0. As BpN(0,σ2)B_{p}\sim N(0,\sigma^{2}) when βp=0\beta_{p}=0, we get VtnpV\sim t_{n-p}. As the distribution of VV does not depend on any of the other parameters (12σ2,β12σ2,,βp12σ2)(-\frac{1}{2\sigma^{2}},\frac{\beta_{1}}{2\sigma^{2}},...,\frac{\beta_{p-1}}{2\sigma^{2}}), it follows from Corollary 5.1.1 to Basu’s Theorem in [7] that VV is independent of 𝑻\bm{T}.

  2. 2.

    g(u,𝒕)g(u,\bm{t}) is increasing in uu for each 𝒕\bm{t}. It is easy to show that gu>0\frac{\partial g}{\partial u}>0 for any value of 𝒕\bm{t}.

Therefore, we can conclude that an UMP unbiased test for βp/σ20\beta_{p}/\sigma^{2}\leq 0 vs βp/σ2>0\beta_{p}/\sigma^{2}>0, which is equivalent to testing H0H_{0} vs H1H_{1} is

ϕ(v)={0,if v<cξ,if v=c1,if v>c,\phi(v)=\begin{cases}0,\quad\text{if $v<c$}\\ \text{$\xi$},\quad\text{if $v=c$}\\ 1,\quad\text{if $v>c$},\end{cases}

where cc and ξ\xi are determined by Eβp=0[ϕ(V)]=αE_{\beta_{p}=0}[\phi(V)]=\alpha. Ignoring the middle case (V=cV=c) which has probability zero, we determine cc from Pβp=0(V>c)=α=>c=tnp,1α.P_{\beta_{p}=0}(V>c)=\alpha=>c=t_{n-p,1-\alpha}.\qquad\square

Looking at test statistic VV, we see that this is identical to the test of a coefficient being significantly different from zero in the standard multiple regression model. Specifically, the quantity V=β^pSSE/(np)=β^ps.e.(β^p)V=\frac{\hat{\beta}_{p}}{\sqrt{SSE/(n-p)}}=\frac{\hat{\beta}_{p}}{\text{s.e.}(\hat{\beta}_{p})} is the Student-t test statistic for coefficient βp\beta_{p} that is commonly output by statistical software packages. The degrees of freedom are also the same: since we have considered the intercept to be one of the predictors (if this were included), we would have p1p-1 ”predictor variables” in the standard textbook formulation of the model, so the degrees of freedom associated with the sum of squares SSESSE would be npn-p, the same as in the previous Theorem.

Having shown that the test of significance for one coefficient attains the envelope of power for orthonormal design matrices, the natural question that arises next is whether this implies that the test for correlated predictors is inefficient, and whether something can be done to improve power. It is known that highly correlated predictors in a regression lead to underpowered inference as well as stability issues in fitting. Such a model can be improved upon by orthogonalizing the predictors, which at the same time solves any problems of multicollinearity, as well as ensures optimal power in testing.

In keeping with our theme of maximizing the power of testing one covariate deemed more important than others, we want to hold a feature fixed, and orthogonalize the other features in a particular order, to build a basis around the fixed feature. The order of orthogonalization may be random, if desired, or it may be informed by some ranking of interest in the features. In particular, the algorithm to obtaining a new basis starting from a set of pp features m1,,mpm_{1},...,m_{p} is given in the text box Algorithm 1. For example, the closed form solution for a basis with up to three features is given in the Appendix. We may find the following notation useful to denote the new basis, because it keeps track of the order of orthogonalization:

(𝒎1,𝒎2 1,𝒎3(1,2),,𝒎p(1..p1))=(𝒙1,𝒙2,,𝒙p).(\bm{m}_{1}^{\perp},\bm{m}_{2}^{\perp\,1},\bm{m}_{3}^{\perp(1,2)},...,\bm{m}_{p}^{\perp(1..p-1)})=(\bm{x}_{1},\bm{x}_{2},...,\bm{x}_{p}).
Algorithm 1 Orthogonalize a feature set
1:  Fix the first basis vector to 𝒙1=𝒎1𝒎1\bm{x}_{1}=\frac{\bm{m}_{1}}{||\bm{m}_{1}||}, where 𝒎1\bm{m}_{1} is the feature of interest
2:  for k2k\leftarrow 2 to pp do
3:     Regress the 𝒎k\bm{m}_{k}-th predictor on the basis vectors obtained so far, i.e., 𝒎k=αk,1𝒙1++αk,k1𝒙k1+𝒓k\bm{m}_{k}=\alpha_{k,1}\bm{x}_{1}+...+\alpha_{k,k-1}\bm{x}_{k-1}+\bm{r}_{k}
4:     Set the next basis vector, 𝒙k\bm{x}_{k}, as the component of 𝒎k\bm{m}_{k} orthogonal to 𝒙1,,𝒙k1\bm{x}_{1},...,\bm{x}_{k-1}, i.e., 𝒙k=𝒓^k𝒓^k\bm{x}_{k}=\frac{\bm{\hat{r}}_{k}}{||\bm{\hat{r}}_{k}||}
5:  end for

Essentially, Algorithm 1 solves for an upper triangular matrix QQ which transforms the original set of features into an orthogonal set, such that

(𝒎1,𝒎2,𝒎3,,𝒎p)=(𝒎1,𝒎2 1,𝒎3(1,2),,𝒎p(1..p1))Q=XQ.(\bm{m}_{1},\bm{m}_{2},\bm{m}_{3},...,\bm{m}_{p})=(\bm{m}_{1}^{\perp},\bm{m}_{2}^{\perp\,1},\bm{m}_{3}^{\perp(1,2)},...,\bm{m}_{p}^{\perp(1..p-1)})Q=XQ.

The benefit of constructing an orthonormal basis of predictors is visible from the following.

COROLLARY. The one-sided t-test for coefficient β1\beta_{1} in regression model 𝒀=β1𝒎1+β2𝒎2 1++βp𝒎p(1..p1)+ϵ\bm{Y}=\beta_{1}\bm{m}_{1}^{\perp}+\beta_{2}\bm{m}_{2}^{\perp\,1}+...+\beta_{p}\bm{m}_{p}^{\perp(1..p-1)}+\bm{\epsilon}, where ϵN(0,σ2I)\bm{\epsilon}\sim N(0,\sigma^{2}I), is UMPU for testing H0:β10H_{0}:\beta_{1}\leq 0 vs H1:β1>0H_{1}:\beta_{1}>0.

Proof: This follows directly from Theorem 1 due to the construction of the orthonormal basis. \square

We should note here that orthogonalizing the feature space is not a new idea, having come up in many fields at various times. For instance, principal components are routinely used in genetics to identify and account for systematic biases due to population substructure (e.g., ancestry), in a high dimensional space of predictors (individual genetic variants). The specific incarnation of orthogonalization Algorithm 1 has been described before in [10, 9] (and references within) and used specifically to preserve stability of coefficient estimates when building regression models in Chemistry. So the utility of orthogonal predictors is undeniable in applications. The next natural question to consider is whether testing the orthogonalized model is always more powerful than the naive model. While we have shown the coefficient test in an orthonormal set of predictors to be UMPU, still, the two parameterizations technically give rise to different models, as they include different predictor sets. Thus, which test is more powerful is not a foregone conclusion.

THEOREM 2. In the following two parameterizations of the same multiple regression model:

𝒀\displaystyle\bm{Y} =α1𝐦1+α2𝐦2++αp𝐦p+ϵ,and\displaystyle=\alpha_{1}\mathbf{m}_{1}+\alpha_{2}\mathbf{m}_{2}+...+\alpha_{p}\mathbf{m}_{p}+\bm{\epsilon},\text{and} (A)
𝒀\displaystyle\bm{Y} =β1𝐱1+β2𝐱2++βp𝐱p+ϵ\displaystyle=\beta_{1}\mathbf{x}_{1}+\beta_{2}\mathbf{x}_{2}+...+\beta_{p}\mathbf{x}_{p}+\bm{\epsilon} (B)

where ϵN(0,σ2I)\bm{\epsilon}\sim N(0,\sigma^{2}I) and design matrix X=MQ1X=MQ^{-1} is obtained from MM via Algorithm 1. Assuming that the sign of Δ=β1𝒒1𝜷/𝒒1\Delta=\beta_{1}-\bm{q}_{1}^{\intercal}\bm{\beta}/||\bm{q}_{1}||, where 𝒒1\bm{q}_{1} is the first row of Q1Q^{-1}, is known, and we wish to compare the t-tests ϕA,ϕB\phi_{A},\phi_{B} for HA0:α10H_{A0}:\alpha_{1}\leq 0 vs HA1:α1>0H_{A1}:\alpha_{1}>0; and HB0:β10H_{B0}:\beta_{1}\leq 0 vs HB1:β1>0H_{B1}:\beta_{1}>0, defined via statistics VA=npα^1SSEV_{A}=\frac{\sqrt{n-p}\,\hat{\alpha}_{1}}{\sqrt{SSE}} and VB=npβ^1SSEV_{B}=\frac{\sqrt{n-p}\,\hat{\beta}_{1}}{\sqrt{SSE}}, respectively, with VA,VBtnpV_{A},V_{B}\sim t_{n-p}. Then the power of ϕB\phi_{B} is higher than the power of ϕA\phi_{A} iff Δ>0\Delta>0.

Proof: To see the connection between the two parameterizations, write model AA as

𝒀=(𝒎1,𝒎2,𝒎3,,𝒎p)𝜶+ϵ=XQ𝜶+ϵ.\bm{Y}=(\bm{m}_{1},\bm{m}_{2},\bm{m}_{3},...,\bm{m}_{p})\bm{\alpha}+\bm{\epsilon}=XQ\bm{\alpha}+\bm{\epsilon}. (4)

By putting 𝜷=Q𝜶\bm{\beta}=Q\bm{\alpha} we see this to be equivalent to model BB, which is written in terms of parameters 𝜷\bm{\beta}. The ordinary least squares estimate for 𝜶\bm{\alpha} is

𝜶^\displaystyle\hat{\bm{\alpha}} =[(XQ)XQ]1(XQ)𝒀\displaystyle=[(XQ)^{\intercal}XQ]^{-1}(XQ)^{\intercal}\bm{Y}
=[Q(XX)Q]1QX𝒀\displaystyle=[Q^{\intercal}(X^{\intercal}X)Q]^{-1}Q^{\intercal}X^{\intercal}\bm{Y}
=Q1(Q)1QX𝒀\displaystyle=Q^{-1}(Q^{\intercal})^{-1}Q^{\intercal}X^{\intercal}\bm{Y}
=Q1X𝒀=Q1𝜷^,\displaystyle=Q^{-1}X^{\intercal}\bm{Y}=Q^{-1}\hat{\bm{\beta}},

where we have used that QQ is full rank. Furthermore, using the well known formula for the variance-covariance matrix of the OLS estimate, we have

Var(𝜶^)=σ2[(XQ)XQ]1=σ2(QQ)1=σ2Q1(Q1).\text{Var}(\hat{\bm{\alpha}})=\sigma^{2}[(XQ)^{\intercal}XQ]^{-1}=\sigma^{2}(Q^{\intercal}Q)^{-1}=\sigma^{2}Q^{-1}(Q^{-1})^{\intercal}.

The standard error of the estimate vector is computed by replacing σ2\sigma^{2} with s2=ϵ^ϵ^np=SSEnps^{2}=\frac{\hat{\bm{\epsilon}}^{\intercal}\hat{\bm{\epsilon}}}{n-p}=\frac{SSE}{n-p}, which is the same for both parameterizations. Let also Q1=(𝒒1,𝒒2,,𝒒p)Q^{-1}=(\bm{q}_{1},\bm{q}_{2},...,\bm{q}_{p})^{\intercal}. Thus we can simplify α^1=𝒒1𝜷^\hat{\alpha}_{1}=\bm{q}_{1}^{\intercal}\hat{\bm{\beta}} and (s.e.(α^1))2=s2𝒒1𝒒1=SSEnp𝒒12(\text{s.e.}(\hat{\alpha}_{1}))^{2}=s^{2}\bm{q}_{1}^{\intercal}\bm{q}_{1}=\frac{SSE}{n-p}||\bm{q}_{1}||^{2}, making the t-test statistic for ϕA\phi_{A}:

VA=np𝒒1𝜷^SSE𝒒1.V_{A}=\frac{\sqrt{n-p}\,\bm{q}_{1}^{\intercal}\hat{\bm{\beta}}}{\sqrt{SSE}||\bm{q}_{1}||}.

The power function for ϕA\phi_{A} is

πA(𝜷,σ)\displaystyle\pi_{A}(\bm{\beta},\sigma) =Pβ,σ(VAtnp,1α)=Pβ,σ(𝒒1𝜷^𝒒1tnp,1αSSEnp)\displaystyle=P_{\beta,\sigma}(V_{A}\geq t_{n-p,1-\alpha})=P_{\beta,\sigma}\left(\frac{\bm{q}_{1}^{\intercal}\hat{\bm{\beta}}}{||\bm{q}_{1}||}\geq t_{n-p,1-\alpha}\sqrt{\frac{SSE}{n-p}}\right)
=Pβ,σ(i=1pq1i𝒒1(βi+Wi)tnp,1αSSEnp)\displaystyle=P_{\beta,\sigma}\left(\sum_{i=1}^{p}\frac{q_{1i}}{||\bm{q}_{1}||}(\beta_{i}+W_{i})\geq t_{n-p,1-\alpha}\sqrt{\frac{SSE}{n-p}}\right)
=Pβ,σ(𝒒1𝜷𝒒1tnp,1αSSEnpZ),\displaystyle=P_{\beta,\sigma}\left(\frac{\bm{q}_{1}^{\intercal}\bm{\beta}}{||\bm{q}_{1}||}\geq t_{n-p,1-\alpha}\sqrt{\frac{SSE}{n-p}}-Z\right),

where Z=i=1pq1i𝒒1WiZ=\sum_{i=1}^{p}\frac{q_{1i}}{||\bm{q}_{1}||}W_{i} and quantities WiW_{i} are as defined in the proof of Theorem 1. Because W1,,Wpi.i.d.N(0,σ2)W_{1},...,W_{p}\sim i.i.d.N(0,\sigma^{2}), we have E(Z)=0E(Z)=0, and Var(Z)=i=1pq1i2𝒒12Var(Wi)=σ2\text{Var}(Z)=\sum_{i=1}^{p}\frac{q_{1i}^{2}}{||\bm{q}_{1}||^{2}}\text{Var}(W_{i})=\sigma^{2}. Similarly, the power function for ϕB\phi_{B} is

πB(𝜷,σ)\displaystyle\pi_{B}(\bm{\beta},\sigma) =Pβ,σ(VBtnp,1α)=Pβ,σ(β^1tnp,1αSSEnp)\displaystyle=P_{\beta,\sigma}(V_{B}\geq t_{n-p,1-\alpha})=P_{\beta,\sigma}\left(\hat{\beta}_{1}\geq t_{n-p,1-\alpha}\sqrt{\frac{SSE}{n-p}}\right)
=Pβ,σ(β1+W1tnp,1αSSEnp)=Pβ,σ(β1tnp,1αSSEnpW1).\displaystyle=P_{\beta,\sigma}\left(\beta_{1}+W_{1}\geq t_{n-p,1-\alpha}\sqrt{\frac{SSE}{n-p}}\right)=P_{\beta,\sigma}\left(\beta_{1}\geq t_{n-p,1-\alpha}\sqrt{\frac{SSE}{n-p}}-W_{1}\right).

Since W1=𝑑ZW_{1}\overset{d}{=}Z and both W1W_{1} and ZZ are independent of SSESSE we conclude that

πB(𝜷,σ)>πA(𝜷,σ)β1>𝒒1𝜷𝒒1Δ>0.\pi_{B}(\bm{\beta},\sigma)>\pi_{A}(\bm{\beta},\sigma)\iff\beta_{1}>\frac{\bm{q}_{1}^{\intercal}\bm{\beta}}{||\bm{q}_{1}||}\iff\Delta>0.\qquad\square

A couple of explanatory remarks are in order.
REMARK 1. For practical purposes, tests ϕA\phi_{A} and ϕB\phi_{B} have the same meaning: they are testing for a significant association between 𝒀\bm{Y} and the first term in the regression (either 𝒎1\bm{m}_{1} or 𝒎1\bm{m}_{1}^{\perp}, which are essentially the same feature). The interpretation of a positive result is the same, namely the first term being significantly associated with the outcome after controlling for all the other meaningful effects, regardless of how those effects are parameterized. Therefore we should use the most powerful between ϕA\phi_{A} and ϕB\phi_{B} when making inference, as long as both specifications (A) and (B) are equally acceptable in the application.

REMARK 2. The interpretation of this result is that using the t-test for the naive model results in higher power only if the cumulative effect sizes of the original features in the direction of 𝒎1\bm{m}_{1}^{\perp} are greater than the marginal effect of 𝒎1\bm{m}_{1}^{\perp} alone. To see this, notice that 𝒎1=q11𝒎1+q12𝒎2++q1p𝒎p=M𝒒1\bm{m}_{1}^{\perp}=q_{11}\bm{m}_{1}+q_{12}\bm{m}_{2}+...+q_{1p}\bm{m}_{p}=M\bm{q}_{1}, so that the components of features 𝒎1,2,,p\bm{m}_{1,2,...,p} in the direction of 𝒎1\bm{m}_{1}^{\perp} are given by vector 𝒒1\bm{q}_{1}. Multiplying these by the effect sizes in each perpendicular direction (i.e., 𝜷\bm{\beta}) gives the contributions of other predictors to the coefficient α1\alpha_{1} for 𝒎1\bm{m}_{1}.

Even though orthogonalizing predictors may not always be the most powerful strategy to test coefficients, a closer look at quantity Δ\Delta reveals that situations when the naive test is more powerful, i.e., when Δ<0\Delta<0 are quite unlikely to occur in applications. This is because we need a combination of favourable factors to make 𝒒1𝜷/𝒒1>β1\bm{q}_{1}^{\intercal}\bm{\beta}/||\bm{q}_{1}||>\beta_{1}, including:

  1. 1.

    Large effect sizes (larger than β1\beta_{1}) for all (or very large for some) of the orthogonal components of the other predictors. While this is certainly possible in applications, if all or most other predictors had a higher effect size than the predictor of interest, an analyst would have to wonder whether the effect of interest is genuine, or if there are flaws in the design of the experiment. A notable exception is the intercept, which in many applications has the the most significant effect. However, this effect can be easily eliminated by centering the output data.

  2. 2.

    The other predictors have to be correlated (either positively or negatively) with 𝒎1\bm{m}_{1} and the predictor’s component times its effect size has to be in the direction of β1\beta_{1}, as pointed out in Remark 2. This has to happen for all or a select subset of the other predictor having very high effect sizes. While having correlated inputs seems to be more the rule than the exception in applications, there is no reason why the direction of the correlation has to agree with the effect size for the orthogonal component! Orthogonality generally suggests independence of effects; thus, for a regression with pp correlated predictors, we would intuitively expect only a 1/2p1/2^{p} chance that all contributions be in the same direction as β1\beta_{1}.

Overall, the requirement that both points 1 and 2 have to be satisfied in order that Δ<0\Delta<0 makes this to be quite an unlikely case in practice. Therefore, we can conclude that, as a rule of thumb, model BB will be more powerful at testing a single predictor (the one around which the orthogonalizing was done) compared to model AA, in most situations.

3 Simulation study

We illustrate how to apply orthogonalization to improve power in testing using variables in the ’mtcars’ dataset in R. In this popular example dataset [5] we regress gas mileage for different cars on the characteristics of their engine. This application was chosen specifically due to the high correlations among predictors (0.90). Selecting a couple of features, number of cylinders and displacement, we find the following relationship for consumption in miles per gallon by fitting the standard regression model: mpg=34.6611.587cyl0.021disp\textit{mpg}=34.661-1.587\textit{cyl}-0.021\textit{disp}, where the error standard deviation is σ=3.055\sigma=3.055. We now wish to see whether power improves when we orthogonalize the inputs, and, to check this, we will simulate N=10,000N=10,000 replicated datasets, each of which keeps all of the same n=32n=32 car models, but simulates random values for mpg based on the following model:

mpg =(34.6611.587cyl¯0.021disp¯)\displaystyle=(34.661-1.587\overline{\textit{cyl}}-0.021\overline{\textit{disp}})
1.587κ(cylcyl¯)0.021κ(dispcyl¯)+ε,εN(0,3.0552).\displaystyle-1.587\kappa(\textit{cyl}-\overline{\textit{cyl}})-0.021\kappa(\textit{disp}-\overline{\textit{cyl}})+\varepsilon,\quad\varepsilon\sim N(0,3.055^{2}).

Notice that this matches the original data fit when κ=1\kappa=1, and we will vary κ\kappa to achieve smaller or larger effect sizes, while keeping the relative effects of the predictors constant. We orthogonalize the inputs on number of cylinders first, and compare the power curve for testing coefficient of cyl against the power of the corresponding test from the original regression (the ”naive model”). Later, we will do the orthogonalization on disp first and test that parameter’s coefficient. When fitting both the naive and orthogonal models, we will center all variables first, including the response mpg, and we will fit without an intercept term. This is to avoid a highly significant linear combination of basis vectors that matches the intercept vector (i.e., the vector of ones), in the orthogonal model. We assume the signs of the coefficients in both the naive and orthogonalized models to be known, which makes it sensible to conduct one-sided tests. This assumption is realistic in practice, where analysts often know the directions of effects, at least for the most important features. The sign of Δ\Delta can be computed, if desired, to check whether a power boost can be expected by the orthogonal approach. To do this, multiply the regression equation by 1-1, which makes the outcome negative miles per gallon, and the coefficients βcyl=1.587κ\beta_{cyl}=1.587\kappa and βdisp=0.021κ\beta_{disp}=0.021\kappa. Vector 𝒒1\bm{q}_{1} (the first row of Q1Q^{-1}) can be computed by following the formulas in the Appendix; thus, holding cylcyl as the fixed direction, 𝒒1=(0.10,0.21)\bm{q}_{1}=(0.10,-0.21). Using the coefficient values as in the data generating model, we get Δ=0.92κ>0\Delta=0.92\kappa>0 which means that the test using the orthonormal regression will be more powerful.

Figure 1 shows the power curves for both models, and for the two ways of orthogonalizing the features. Testing for the coefficient of the first feature 𝒎1\bm{m}_{1}^{\perp} is visibly more powerful than testing for 𝒎1\bm{m}_{1} in the naive model. The situation is almost identical when switching the order of cyl and disp, showing that the approach works regardless of which variable we choose as important.

Refer to caption
Refer to caption
Figure 1: Power profiles for the first coefficient t-test under the naive and orthogonalized regression models, with first parameter being: cylcyl (left), and dispdisp (right). All tests are one-sided, with sign assumed known.

4 Discussion

There are some strong arguments for orthogonalizing highly correlated variables in regression. This paper has focused on the power advantage, and the knowledge that the coefficient tests are UMPU in this case. There are even more obvious advantages to this approach, still. One is that of eliminating collinearity without having to throwing away predictors, which loses potentially valuable information. Another advantage is that this approach does away with the need to include interaction terms. Indeed, with pp predictors there are potentially p(p1)/2p(p-1)/2 interaction terms that can (or should) be considered, and many more three way interactions. With an orthogonal set of predictors, this is no longer the case as features are uncorrelated. This will significantly preserve power in inferences about the parameters.

The disadvantage of this approach, which has become apparent from discussions with statistical practitioners (consultants), is that of interpretability. Interpreting the remainder of one predictor after regressing on a few others may not be easy, and in many cases, this will not be desirable. For instance, health studies typically include demographic variables in analysis, such as age and sex, among others; these should probably be kept as such, for the sake of interpretability. We do not advocate that orthogonalization should be used routinely, and for all features. However, here are a few cases where it might be sensible to include this approach, at least for a subset of the feature space:

  1. 1.

    With highly correlated variables that measure roughly the same thing. In socioeconomic studies, as an example, gross and net income are both measures of income and likely to be highly correlated. If it is meaningful to keep both, then the interpretation of the coefficient for each might be difficult, because if two predictors are close to collinear, then it is likely that the fit to data will result in a positive coefficient for one, and a negative coefficient for the other. Of course, this does not mean that gross and net income have opposite effects, rather, this is likely to be an artifact of multicollinearity. In this case, it makes more sense to decide on one predictor (e.g., gross income) to be interpreted as the main effect for income, and include the orthogonal component of net income as an extra effect due to, say, differential taxation of individuals that earn the same salary before tax.

  2. 2.

    When there is a natural ordering in importance of predictors. For example, to an engineer looking at the fuel efficiency of different engines, it might be apparent that engine volume (displacement) is mainly what drives consumption, while number of cylinders might be simply a design choice that correlates with larger, more powerful engines. In this case, there would be a causal relationship between dispdisp and mpgmpg, with cylcyl being an intermediary variable of secondary importance that mediates the effect of dispdisp. It would then make sense to orthogonalize on dispdisp as the main predictor, with the remainder term cyldispcyl^{\perp\,disp} being the extra contribution of cylcyl that is not related to engine size. This example spells out how an orthogonal remainder can be interpreted.

  3. 3.

    In model building, since adding extra terms to the model does not change the estimated effects of previous predictors. As a follow-up to the previous point, when building a linear model through e.g., forward regression, one undesirable feature is that coefficients change every time a new term is added. This problem disappears when using orthogonalized predictors, as adding an extra dimension to the orthonormal basis does not change the projection of the data vector on the previous basis vectors. As such, parameters of full and reduced models will not lead to potentially contradictory interpretations.

Given all the advantages of orthogonalization, it is surprising that this approach has not gained more popularity in statistical practice.

Acknowledgements

RGR is based at the George & Fay Yee Centre for Healthcare Innovation. Support for CHI is provided by University of Manitoba, Canadian Institutes for Health Research, Province of Manitoba, and Shared Health Manitoba.

References

  • [1] P.K. Bhattacharya and Prabir Burman. Hypothesis Testing. In Theory and Methods of Statistics, pages 125–177. Elsevier, 2016.
  • [2] Sungsub Choi, W. J. Hall, and Anton Schick. Asymptotically uniformly most powerful tests in parametric and semiparametric models. The Annals of Statistics, 24(2), April 1996.
  • [3] Merlise Clyde, Heather Desimone, and Giovanni Parmigiani. Prediction Via Orthogonalized Model Mixing. 2024.
  • [4] Michele Forina, Silvia Lanteri, Monica Casale, and M. Conception Cerrato Oliveros. Stepwise orthogonalization of predictors in classification and regression techniques: An “old” technique revisited. Chemometrics and Intelligent Laboratory Systems, 87(2):252–261, June 2007.
  • [5] Harold V. Henderson and Paul F. Velleman. Building Multiple Regression Models Interactively. Biometrics, 37(2):391, June 1981.
  • [6] Maxwell L. King and Murray D. Smith. Joint one-sided tests of linear regression coefficients. Journal of Econometrics, 32(3):367–383, August 1986.
  • [7] Erich L. Lehmann and Joseph P. Romano. Testing statistical hypotheses. Springer texts in statistics. Springer, New York, 3. ed., correct. at 4. print edition, 2008.
  • [8] Xiaochang Lin, Jiewen Guan, Bilian Chen, and Yifeng Zeng. Unsupervised Feature Selection via Orthogonal Basis Clustering and Local Structure Preserving. IEEE Transactions on Neural Networks and Learning Systems, 33(11):6881–6892, November 2022.
  • [9] Milan Randić. Mathematical chemistry illustrations: a personal view of less known results. Journal of Mathematical Chemistry, 57(1):280–314, January 2019.
  • [10] Milan Šoškić, Dejan Plavšić, and Nenad Trinajstić. Link between Orthogonal and Standard Multiple Linear Regression Models. Journal of Chemical Information and Computer Sciences, 36(4):829–832, January 1996. Publisher: American Chemical Society.

Appendix

Going through the orthogonalization steps, we obtain the following new basis in closed form for up to three predictors

{𝒙1=𝒎1𝒎1𝒙2=1k2{𝒎2(𝒎2𝒎1)𝒎12𝒎1}𝒙3=1k3{𝒎3+(𝒎2𝒎1)k1/k22(𝒎3𝒎1)𝒎12𝒎1k1k22𝒎2},\begin{cases}\bm{x}_{1}=\frac{\bm{m}_{1}}{||\bm{m}_{1}||}\\ \bm{x}_{2}=\frac{1}{k_{2}}\{\bm{m}_{2}-\frac{(\bm{m}_{2}^{\intercal}\bm{m}_{1})}{||\bm{m}_{1}||^{2}}\bm{m}_{1}\}\\ \bm{x}_{3}=\frac{1}{k_{3}}\{\bm{m}_{3}+\frac{(\bm{m}_{2}^{\intercal}\bm{m}_{1})k_{1}/k_{2}^{2}-(\bm{m}_{3}^{\intercal}\bm{m}_{1})}{||\bm{m}_{1}||^{2}}\bm{m}_{1}-\frac{k_{1}}{k_{2}^{2}}\bm{m}_{2}\},\end{cases} (5)

where

{k1=𝒎2𝒎3(𝒎2𝒎1)(𝒎3𝒎1)/𝒎12k2=𝒎2(𝒎2𝒎1)𝒎12𝒎1k3=𝒎3+(𝒎2𝒎1)k1/k22(𝒎3𝒎1)𝒎12𝒎1k1k22𝒎2.\begin{cases}k_{1}=\bm{m}_{2}^{\intercal}\bm{m}_{3}-(\bm{m}_{2}^{\intercal}\bm{m}_{1})(\bm{m}_{3}^{\intercal}\bm{m}_{1})/||\bm{m}_{1}||^{2}\\ k_{2}=||\bm{m}_{2}-\frac{(\bm{m}_{2}^{\intercal}\bm{m}_{1})}{||\bm{m}_{1}||^{2}}\bm{m}_{1}||\\ k_{3}=||\bm{m}_{3}+\frac{(\bm{m}_{2}^{\intercal}\bm{m}_{1})k_{1}/k_{2}^{2}-(\bm{m}_{3}^{\intercal}\bm{m}_{1})}{||\bm{m}_{1}||^{2}}\bm{m}_{1}-\frac{k_{1}}{k_{2}^{2}}\bm{m}_{2}||.\end{cases} (6)

The corresponding matrix Q1Q^{-1} so that (𝒙1,𝒙2,𝒙3)=(𝒎1,𝒎2,𝒎3)Q1(\bm{x}_{1},\bm{x}_{2},\bm{x}_{3})=(\bm{m}_{1},\bm{m}_{2},\bm{m}_{3})Q^{-1} is

Q1=[𝒎11(𝒎2𝒎1)𝒎12k2(𝒎2𝒎1)k1/k22(𝒎3𝒎1)𝒎12k30k21k1k22k300k31].Q^{-1}=\begin{bmatrix}||\bm{m}_{1}||^{-1}&-\frac{(\bm{m}_{2}^{\intercal}\bm{m}_{1})}{||\bm{m}_{1}||^{2}k_{2}}&\frac{(\bm{m}_{2}^{\intercal}\bm{m}_{1})k_{1}/k_{2}^{2}-(\bm{m}_{3}^{\intercal}\bm{m}_{1})}{||\bm{m}_{1}||^{2}k_{3}}\\ \\ 0&k_{2}^{-1}&-\frac{k_{1}}{k_{2}^{2}k_{3}}\\ \\ 0&0&k_{3}^{-1}\end{bmatrix}.