This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation of ill-conditioned models using penalized sums of squares of the residuals

Román Salmerón Gómez Professor, Department of Quantitative methods for economics and business, University of Granada, Spain (e-mail: romansg@ugr.es).    Catalina B. García García Professor, Department of Quantitative methods for economics and business, University of Granada, Spain (e-mail: cbgarcia@ugr.es).
(September 3, 2025)
Abstract

This paper analyzes the estimation of econometric models by penalizing the sum of squares of the residuals with a factor that makes the model estimates approximate those that would be obtained when considering the possible simple regressions between the dependent variable of the econometric model and each of its independent variables. It is shown that the ridge estimator is a particular case of the penalized estimator obtained, which, upon analysis of its main characteristics, presents better properties than the ridge especially in reference to the individual boostrap inference of the coefficients of the model and the numerical stability of the estimates obtained. This improvement is due to the fact that instead of shrinking the estimator towards zero, the estimator shrinks towards the estimates of the coefficients of the simple regressions discussed above.

Keywords: ridge estimator, penalized estimation, mean error squared, variance inflation factor, condition number,

1 Introduction

Given the following multiple linear regression model for nn observations and pp independent variables (constant term included):

𝐲n×1=𝐗n×p𝜷p×1+𝐮n×1,\mathbf{y}_{n\times 1}=\mathbf{X}_{n\times p}\cdot\boldsymbol{\beta}_{p\times 1}+\mathbf{u}_{n\times 1}, (1)

where 𝐮\mathbf{u} is the random disturbance (which is assumed to be spherical), the most common estimation method is Ordinary Least Squares (OLS), which consists of minimizing with respect to 𝜷\boldsymbol{\beta} the sum of squares of the residuals given by the expression:

SCR(𝜷)=(𝐲𝐗𝜷)t(𝐲𝐗𝜷).SCR(\boldsymbol{\beta})=(\mathbf{y}-\mathbf{X}\cdot\boldsymbol{\beta})^{t}\cdot(\mathbf{y}-\mathbf{X}\cdot\boldsymbol{\beta}).

However, there are several situations where the objective function to be minimized has been modified, as in the case of the method of Restricted Least Squares (RSM). In this case, the modification seeks to incorporate certain information into the model in order to obtain a more efficient estimation.

It is also common to modify it in order to achieve greater stability in the estimates of model (1). In this sense, according to O’Driscoll and Ramirez [34], it was Tikhonov [50] who first proposed the idea of regularization to estimate ill-conditioned problems by introducing additional information.

Other examples with alternative objective functions are the ridge estimator (Hoerl and Kennard [18, 19]) or LASSO regression (Tibshirani [48, 49]), widely used to mitigate the approximate multicollinearity existing in the linear model.

In the former, SCR(𝜷)SCR(\boldsymbol{\beta}) is minimized subject to the constraint 𝜷22=i=1pβi2=𝜷t𝜷r||\boldsymbol{\beta}||_{2}^{2}=\sum\limits_{i=1}^{p}\beta_{i}^{2}=\boldsymbol{\beta}^{t}\boldsymbol{\beta}\leq r where rr is fixed, and implies minimizing with respect to 𝜷\boldsymbol{\beta} the Lagrangian:

L=SCR(𝜷)+k𝜷t𝜷,k0.L=SCR(\boldsymbol{\beta})+k\cdot\boldsymbol{\beta}^{t}\boldsymbol{\beta},\quad k\geq 0.

The solution obtained corresponds to the following expression:

𝜷^(k)=(𝐗t𝐗+k𝐈)1𝐗t𝐲,\widehat{\boldsymbol{\beta}}(k)=\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1}\cdot\mathbf{X}^{t}\mathbf{y}, (2)

where 𝐈\mathbf{I} denoted the identity matrix of appropriate dimensions. In addition, its variance-covariance matrix is:

var(𝜷^(k))=σ2(𝐗t𝐗+k𝐈)1𝐗t𝐗(𝐗t𝐗+k𝐈)1.var\left(\widehat{\boldsymbol{\beta}}(k)\right)=\sigma^{2}\cdot\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1}\cdot\mathbf{X}^{t}\mathbf{X}\cdot\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1}. (3)

In the second, SCR(𝜷)SCR(\boldsymbol{\beta}) is minimized subject to the constraint 𝜷12=i=1p|βi|s||\boldsymbol{\beta}||_{1}^{2}=\sum\limits_{i=1}^{p}|\beta_{i}|\leq s where ss is fix which implies minimizing with respect to 𝜷\boldsymbol{\beta} the Lagrangian:

L=SCR(𝜷)+ki=1p|βi|,k0.L=SCR(\boldsymbol{\beta})+k\cdot\sum\limits_{i=1}^{p}|\beta_{i}|,\quad k\geq 0.

The difference between the two lies in the fact that due to the behavior of the l1l_{1} norm, LASSO contracts and selects variables, while ridge only contracts. It can also be seen that both methods can be viewed as particular situations of the general case in which the penalty added in the objective function is considered to be given by the lql_{q} norm, i.e., it is of the form 𝜷q2=i=1p|βi|q||\boldsymbol{\beta}||_{q}^{2}=\sum\limits_{i=1}^{p}|\beta_{i}|^{q}. Other shrinkage estimators can be found in Liu [22, 23], Gao and Liu [12] or Liu and Gao [24].

More recently, the methodology proposed by Zou and Hastie [55], known as elastic-net, has emerged as a particularly noteworthy approach. This technique allows the contraction of estimators and the automatic selection of variables by combining the penalty factors (in the sum of squares of the residuals) of the ridge and LASSO regressions. It is also particularly useful when the number of independent variables is much higher than the number of observations and to deal with the problem of multicollinearity (see, for example, Bingzhen et al. [4] or Zou and Zhang [56]).

Since its publication, this work has been received111 Figures as of May 8, 2024 in Scopus and Google Scholar. 11820 citations according to SCOPUS and 20777 according to Google Scholar (2260 in 2021, 2345 in 2022 and 2100 in 2023). These figures show that the development of techniques to deal with multicollinearity and variable selection is currently an interesting econometric research topic, probably due to the fact that are essential tools in high-dimensional data analysis (see Ding et al. [5]).

Within this line of research, Wang et al. [52] proposes a novel method called average least squares-centered penalized regression (ALPR). This penalized regression method shrinks the parameters toward the weighted average OLS estimator. This paper provides a theoretical development to obtain the mean square error of the proposed estimator, as well as a simulation study comparing this estimator with the estimators provided by OLS and ridge regression. Some issues to be considered with respect to this paper are the following:

  • If the OLS estimates have coefficients with different signs, it is possible that the weighted average is close to zero, so there is not much difference between the penalized estimator they propose and the ridge regression.

  • Shrinking the penalized estimators around the weighted average OLS estimator is just as arbitrary as shrinking them around zero (ridge regression).

  • ALPR shrinks all the parameters to the same single value. Again, this is a completely arbitrary question without any theoretical justification.

  • The intercept is not penalized by ALPR, then the approximate non-essential multicollinearity (see footnote in Example 1 about essential and non-essential multicollinearity) is ignored.

Recently, Lukman et al. [25] have integrated ALPR with the principal component dimension reduction method to improve its performance in terms of root mean square error.

Alternatively, the present work proposes to minimize the Lagrangian with respect to 𝜷\boldsymbol{\beta} the Langragian:

L=SCR(𝜷)+k(𝜷𝜶)t(𝜷𝜶),k0,L=SCR(\boldsymbol{\beta})+k\cdot(\boldsymbol{\beta}-\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-\boldsymbol{\alpha}),\quad k\geq 0, (4)

where 𝜶\boldsymbol{\alpha} is a vector that collects the estimates of the slope in all simple linear regressions of the form 𝐲=α1𝟏+αi𝐗i+𝐯\mathbf{y}=\alpha_{1}\cdot\mathbf{1}+\alpha_{i}\cdot\mathbf{X}_{i}+\mathbf{v} with i=2,,pi=2,\dots,p, where 𝟏\mathbf{1} denotes the constant term.

The choice of the parameter around which to shrink the penalized estimators, 𝜶\boldsymbol{\alpha}, is intended to overcome the drawbacks of ALPR highlighted above (see section 2). In addition, the obtained estimator is analyzed in depth (sections 3 to 7), providing a consolidated methodology.

Specifically, the work is structured as follows: Section 2 justifies the usefulness of the proposal made to mitigate the approximate multicollinearity existing in the model (1) and achieve stability in the estimation of the coefficients of the independent variables, while in section 3 we obtain the expression of the estimator provided by minimizing the lagrangian given in (4) and its main characteristics: trace, norm, variance-covariance matrix, goodness-of-fit and mean square error. To highlight in this section the Monte Carlo simulation performed to analyze the behavior of the mean square error of the proposed penalized estimator. The implications of this methodology in multicollinearity mitigation are discussed in section 4. The above results are used in section 5 to propose procedures for determining the ideal value of the penalty parameter, kk. An interpretation of the same is also provided. From these values, the section 6 shows how to perform inference from bootstrap methods. Section 7 analyzes the stability of the estimates obtained from the proposed penalized estimator and section 8 illustrates the results obtained by means of an example with real data. The paper ends with the section 9, where the main results obtained are highlighted. Finally, the code utilized in R [35] to generate the results presented in this study is accessible on GitHub, specifically at https://github.com/rnoremlas/Penalized-estimator.

2 Justification of the proposed restriction

Taking into account the following notions of partial and full effect given in Novales [31]:

  • What is the impact on 𝐲\mathbf{y} of a unit variation in 𝐗i\mathbf{X}_{i} if the other explanatory variables in the model did not vary? Answer: partial effect (multiple linear regression).

  • What is the total impact on 𝐲\mathbf{y} of a unit variation in 𝐗i\mathbf{X}_{i} if the rest of the explanatory variables vary as would be expected given the correlations observed between them throughout the sample? Answer: total effect (simple linear regression).

Both effects should coincide (see example 1) if the ceteris paribus, i.e., when one variable varies the rest remain constant, were really verified.

Example 1

Given the model 𝐲=β1𝟏+β2𝐗2+β3𝐗3+𝐮\mathbf{y}=\beta_{1}\cdot\mathbf{1}+\beta_{2}\cdot\mathbf{X}_{2}+\beta_{3}\cdot\mathbf{X}_{3}+\mathbf{u}, 100 observations are simulated assuming that 𝐗2\mathbf{X}_{2} and 𝐗3\mathbf{X}_{3} are distributed according to a normal distribution of mean 1 and variance 100, 𝐮\mathbf{u} according to a normal distribution of mean 0 and variance 2 and the dependent variable 𝐲\mathbf{y} is generated as 𝐲=5𝟏+2𝐗24𝐗3+𝐮\mathbf{y}=5\cdot\mathbf{1}+2\cdot\mathbf{X}_{2}-4\cdot\mathbf{X}_{3}+\mathbf{u}.

It is verified that the coefficient of simple correlation between 𝐗2\mathbf{X}_{2} and 𝐗3\mathbf{X}_{3} is equal to -0.01684487. This leads to a variance inflation factor equal to 1.000284 which is very close to its minimum value of 1. This indicates that the degree of approximate multicollinearity of the essential type222 Marquardt [27], Marquardt and Snee [28] and Snee and Marquardt [47] distinguish between approximate multicollinearity of the essential type (relationship between the independent variables of the model excluding the constant term) and non-essential (relationship between the constant term and at least one of the remaining independent variables of the model). Salmerón et al. [41] shows that the variance inflation factor only detects variance of the essential type and that the coefficient of variation must be used to detect the non-essential multicollinearity. According to these authors, a coefficient of variation below 0.1002506 indicates that the existing approximate multicollinearity of nonessential type is of concern. is not troubling. The high values for the coefficients of variation, 5.898071 for 𝐗2\mathbf{X}_{2} and 3.022512 for 𝐗3\mathbf{X}_{3}, indicate that the nonessential type is not troubling either.

OLS estimation provides the estimates β^1=4.658938\widehat{\beta}_{1}=4.658938, β^2=1.998514\widehat{\beta}_{2}=1.998514 and β^3=3.99332\widehat{\beta}_{3}=-3.99332. Note that the estimates practically coincide with the real coefficients.

On the other hand, given the simple linear regressions 𝐲=α1𝟏+αi𝐗i+𝐯\mathbf{y}=\alpha_{1}\cdot\mathbf{1}+\alpha_{i}\mathbf{X}_{i}+\mathbf{v}, with i=2,3i=2,3, the following OLS estimators are obtained: α^2=2.061951\widehat{\alpha}_{2}=2.061951 and α^3=4.029017\widehat{\alpha}_{3}=-4.029017. Once again, very similar to the true values of β2\beta_{2} and β3\beta_{3}. \Box

Developing the previous example from a theoretical point of view:

  • Estimates of simple linear regressions 𝐲=α1𝟏+αi𝐗i+𝐯\mathbf{y}=\alpha_{1}\cdot\mathbf{1}+\alpha_{i}\cdot\mathbf{X}_{i}+\mathbf{v} with i=2,,pi=2,\dots,p are given by the expression:

    α^i=cov(𝐲,𝐗i)var(𝐗i),i=2,,p.\widehat{\alpha}_{i}=\frac{cov(\mathbf{y},\mathbf{X}_{i})}{var(\mathbf{X}_{i})},\quad i=2,\dots,p.
  • Assuming orthogonality in the model (1), the OLS estimator responds to the expression:

    𝜷^\displaystyle\widehat{\boldsymbol{\beta}} =\displaystyle= (𝐗t𝐗)1𝐗𝐲\displaystyle\left(\mathbf{X}^{t}\mathbf{X}\right)^{-1}\mathbf{X}\mathbf{y}
    =\displaystyle= (n000j=1nX2j2000j=1nXkj2)1(j=1nyjj=1nX2jyjj=1nXkjyj)1=(𝐲¯cov(𝐲,𝐗2)var(𝐗2)cov(𝐲,𝐗k)var(𝐗k))\displaystyle\left(\begin{array}[]{cccc}n&0&\cdots&0\\ 0&\sum\limits_{j=1}^{n}X_{2j}^{2}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\sum\limits_{j=1}^{n}X_{kj}^{2}\\ \end{array}\right)^{-1}\left(\begin{array}[]{c}\sum\limits_{j=1}^{n}y_{j}\\ \sum\limits_{j=1}^{n}X_{2j}y_{j}\\ \vdots\\ \sum\limits_{j=1}^{n}X_{kj}y_{j}\\ \end{array}\right)^{-1}=\left(\begin{array}[]{c}\overline{\mathbf{y}}\\ \frac{cov(\mathbf{y},\mathbf{X}_{2})}{var(\mathbf{X}_{2})}\\ \vdots\\ \frac{cov(\mathbf{y},\mathbf{X}_{k})}{var(\mathbf{X}_{k})}\\ \end{array}\right)

    since, for i=2,,pi=2,\dots,p, it is verified that cov(𝐲,𝐗i)=1nj=1nXijyjcov(\mathbf{y},\mathbf{X}_{i})=\frac{1}{n}\sum\limits_{j=1}^{n}X_{ij}y_{j} and var(𝐗i)=1nj=1nXij2var(\mathbf{X}_{i})=\frac{1}{n}\sum\limits_{j=1}^{n}X_{ij}^{2} finding that j=1nX2j==j=1nXpj=0\sum\limits_{j=1}^{n}X_{2j}=\dots=\sum\limits_{j=1}^{n}X_{pj}=0 by the orthogonality condition between 𝐗i\mathbf{X}_{i} and the intercept.

For this reason, in the expression (4) it will be considered that:

𝜶p×1=(𝐲¯,cov(𝐲,𝐗2)var(𝐗2),,cov(𝐲,𝐗p)var(𝐗p))t,\boldsymbol{\alpha}_{p\times 1}=\left(\overline{\mathbf{y}},\ \frac{cov(\mathbf{y},\mathbf{X}_{2})}{var(\mathbf{X}_{2})},\dots,\ \frac{cov(\mathbf{y},\mathbf{X}_{p})}{var(\mathbf{X}_{p})}\right)^{t}, (6)

that is, 𝜶\boldsymbol{\alpha} is a known fixed vector.

In short, in the restriction imposed on the sum of squares of the penalized residuals represented in the lagrangian of the expression (4) it will be considered that 𝜶\boldsymbol{\alpha} will take the value given in (6). In this way, the purpose is to make the estimates of the coefficients of the multiple linear regression model as close as possible to the estimates obtained in the corresponding simple linear regressions.

Example 2 (continuation of Example 1)

If, on the other hand, it is generated 𝐗3\mathbf{X}_{3} como 𝐗3=𝟏+5𝐗2+𝟏~\mathbf{X}_{3}=\mathbf{1}+5\cdot\mathbf{X}_{2}+\widetilde{\mathbf{1}}, where 𝟏~i=(1)i,\widetilde{\mathbf{1}}_{i}=(-1)^{i}, it is verified that the coefficient of simple correlation between 𝐗2\mathbf{X}_{2} and 𝐗3\mathbf{X}_{3} is equal to 0.9997975, which leads to a variance inflation factor equal to 2468.787. In this case, the coefficient of variation of 𝐗3\mathbf{X}_{3} is equal to 5.279348. Thus, the approximate multicollinearity of the essential type is troubling while the non-essential type is not.

The estimates obtained in this case are as follows:

β^1=4.3740148,β^2=0.4443309,β^3=3.6897307,α^2=18.03679,α^3=3.601057,\widehat{\beta}_{1}=4.3740148,\quad\widehat{\beta}_{2}=0.4443309,\quad\widehat{\beta}_{3}=-3.6897307,\quad\widehat{\alpha}_{2}=-18.03679,\quad\widehat{\alpha}_{3}=-3.601057,

being the estimates of β1\beta_{1} and β3\beta_{3} significantly different from zero (with 5% of significance level), contrary to that of β2\beta_{2}.

It is found that the estimation of β^2\widehat{\beta}_{2} differs from the true value of the coefficient of 𝐗2\mathbf{X}_{2}, while the value of α^2\widehat{\alpha}_{2} captures the true value of the coefficient obtained by substituting 𝐗3=𝟏+5𝐗2+𝟏~\mathbf{X}_{3}=\mathbf{1}+5\cdot\mathbf{X}_{2}+\widetilde{\mathbf{1}} en 𝐲=5𝟏+2𝐗24𝐗3+𝐮\mathbf{y}=5\cdot\mathbf{1}+2\cdot\mathbf{X}_{2}-4\cdot\mathbf{X}_{3}+\mathbf{u}. It is to say, 𝐲=(𝟏4𝟏~)18𝐗2+𝐮\mathbf{y}=\left(\mathbf{1}-4\cdot\widetilde{\mathbf{1}}\right)-18\cdot\mathbf{X}_{2}+\mathbf{u}: as captured by simple regression rather than multiple regression, the real relationship between 𝐗2\mathbf{X}_{2} and 𝐲\mathbf{y} is actually -18 rather than 2.

To summarize, it can be observed that in the first case, the degree of essential approximate multicollinearity is not troubling while in the second case it is. Note also that the consequence of high linear relationships between the independent variables is to obtain different values for the partial and total effect. \Box

3 Penalized estimator

In this section we obtain the estimator that verifies the above conditions, as well as its norm, variance-covariance matrix, goodness-of-fit and mean square error. We also consider the augmented model whose OLS estimation leads to the penalized estimator obtained. In the latter case, the effect of this augmented model on the detection of the degree of multicollinearity existing once the proposed estimator has been applied is analyzed.

With this purpose, instead of working with the expression (4), expression (7) will be minimized where previously a parameter hh has been introduced as follows:

L=SCR(𝜷)+k(𝜷h𝜶)t(𝜷h𝜶),k0,L=SCR(\boldsymbol{\beta})+k\cdot(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha}),\quad k\geq 0, (7)

note that for h=0h=0 the Hoerl and Kennard estimator would be obtained. [18, 19]. That is, the ridge estimator is a particular case of the one presented here.

While it makes sense that h[0, 1]h\in[0,\ 1], it will be considered only that h{0,1}h\in\{0,1\} (for further details see subsection 3.1.1).

3.1 Estimation

Minimizing with respect to β\beta the expression (7) the following estimator is obtained:

𝜷^(k,h)=(𝐗t𝐗+k𝐈)1(𝐗t𝐲+kh𝜶),\widehat{\boldsymbol{\beta}}(k,h)=\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1}\cdot\left(\mathbf{X}^{t}\mathbf{y}+k\cdot h\cdot\boldsymbol{\alpha}\right), (8)

where it is clear that if h=0h=0, the penalized estimator coincides with the ridge (the expressions (2) and (8) are the same) and if k=0k=0 the OLS estimation is obtained.

Remark 1

Note that if it is verified that 𝛂=𝟎\boldsymbol{\alpha}=\mathbf{0}, where 𝟎\mathbf{0} is a vector of zeros of appropriate dimensions, the penalized estimator also coincides with the ridge estimator. That is, the ridge estimator could be considered to be the particular case of the penalized estimator in which all the coefficients of the possible simple regressions are zero and the dependent variable is centered:

𝐲¯=0,cov(𝐲,𝐗i)=0 for i=2,,p.\overline{\mathbf{y}}=0,\ cov(\mathbf{y},\mathbf{X}_{i})=0\mbox{ for }i=2,\dots,p.

This implies that the total impact of the variations of each independent variable (excluding the constant term) on the dependent variable is null, assuming that the rest of the independent variables vary as would be expected given the correlations observed between them in the sample.

Remark 2

Note that in the case where h=1h=1 and 𝛂=𝐝(𝐝t𝐗t𝐗𝐝)1𝐝t𝐗t𝐗𝛃^\boldsymbol{\alpha}=\mathbf{d}\left(\mathbf{d}^{t}\mathbf{X}^{t}\mathbf{X}\mathbf{d}\right)^{-1}\mathbf{d}^{t}\mathbf{X}^{t}\mathbf{X}\widehat{\boldsymbol{\beta}} where 𝐝\mathbf{d} is a p-dimensional vector in which all the elements are one and 𝛃^\widehat{\boldsymbol{\beta}} is the OLS estimator, the ALPR estimator is obtained.

Similarly, for h=1h=1, d=kd=-k and 𝛂=𝛃^\boldsymbol{\alpha}=\widehat{\boldsymbol{\beta}}, the expression (8) coincides with the Liu [23] estimator.

On the other hand, denoting 𝐙(k)=(𝐗t𝐗+k𝐈)1\mathbf{Z}(k)=\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1} and taking into account that 𝐗t𝐗𝜷^=𝐗t𝐲\mathbf{X}^{t}\mathbf{X}\cdot\widehat{\boldsymbol{\beta}}=\mathbf{X}^{t}\mathbf{y}, it is obtained that expression (8) can be rewritten as:

𝜷^(k,h)=𝐙(k)𝐗t𝐗𝜷^+kh𝐙(k)𝜶,\widehat{\boldsymbol{\beta}}(k,h)=\mathbf{Z}(k)\cdot\mathbf{X}^{t}\mathbf{X}\cdot\widehat{\boldsymbol{\beta}}+k\cdot h\cdot\mathbf{Z}(k)\cdot\boldsymbol{\alpha}, (9)

and in this case:

E[𝜷^(k,h)]=𝐙(k)𝐗t𝐗𝜷+kh𝐙(k)𝜶,E\left[\widehat{\boldsymbol{\beta}}(k,h)\right]=\mathbf{Z}(k)\cdot\mathbf{X}^{t}\mathbf{X}\cdot\boldsymbol{\beta}+k\cdot h\cdot\mathbf{Z}(k)\cdot\boldsymbol{\alpha},

i.e., the estimator given in (8) is a biased estimator of 𝜷\boldsymbol{\beta} as long as k0k\not=0.

Finally, note that from the expressions (2) and (8) it is obtained that:

𝜷^(k,h)=𝜷^(k)+kh𝐙(k)𝜶.\widehat{\boldsymbol{\beta}}(k,h)=\widehat{\boldsymbol{\beta}}(k)+k\cdot h\cdot\mathbf{Z}(k)\cdot\boldsymbol{\alpha}. (10)
Remark 3

Considering that the expression (7) is equivalent to:

L=k1SCR(𝜷)+k2(𝜷h𝜶)t(𝜷h𝜶),L=k_{1}\cdot SCR(\boldsymbol{\beta})+k_{2}\cdot(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha}), (11)

just considering that k=k2k1k=\frac{k_{2}}{k_{1}} with k1>0k_{1}>0. Minimize with respect to betabeta the expression (11) leads to :

𝜷^(k1,k2,h)\displaystyle\widehat{\boldsymbol{\beta}}(k_{1},k_{2},h) =\displaystyle= (k1𝐗t𝐗+k2𝐈)1(k1𝐗t𝐲+k2h𝜶)\displaystyle\left(k_{1}\cdot\mathbf{X}^{t}\mathbf{X}+k_{2}\cdot\mathbf{I}\right)^{-1}\cdot\left(k_{1}\cdot\mathbf{X}^{t}\mathbf{y}+k_{2}\cdot h\cdot\boldsymbol{\alpha}\right)
=\displaystyle= (k1𝐗t𝐗+k2𝐈)1[(k1𝐈+1nk2h𝐕)𝐗t𝐲k2h𝐲¯𝐕𝐁],\displaystyle\left(k_{1}\cdot\mathbf{X}^{t}\mathbf{X}+k_{2}\cdot\mathbf{I}\right)^{-1}\cdot\left[\left(k_{1}\cdot\mathbf{I}+\frac{1}{n}\cdot k_{2}\cdot h\cdot\mathbf{V}\right)\cdot\mathbf{X}^{t}\mathbf{y}-k_{2}\cdot h\cdot\overline{\mathbf{y}}\cdot\mathbf{V}\cdot\mathbf{B}\right],

donde:

𝐕p×p=diag(1,1var(𝐗2),,1var(𝐗p)),𝐁k×1=(0,𝐗¯2,,𝐗¯p).\mathbf{V}_{p\times p}=diag\left(1,\ \frac{1}{var(\mathbf{X}_{2})},\dots,\ \frac{1}{var(\mathbf{X}_{p})}\right),\quad\mathbf{B}_{k\times 1}=\left(0,\ \overline{\mathbf{X}}_{2},\dots,\ \overline{\mathbf{X}}_{p}\right).

Note that if the variables are centered, i.e., 𝐁=𝟎p×1\mathbf{B}=\mathbf{0}_{p\times 1} where 𝟎\mathbf{0} s a vector of zeros of appropriate dimensions, the expression (3) is rewritten as:

𝜷^(k1,k2,h)\displaystyle\widehat{\boldsymbol{\beta}}(k_{1},k_{2},h) =\displaystyle= (k1𝐗t𝐗+k2𝐈)1(k1𝐈+1nk2h𝐕)𝐗t𝐲,\displaystyle\left(k_{1}\cdot\mathbf{X}^{t}\mathbf{X}+k_{2}\cdot\mathbf{I}\right)^{-1}\cdot\left(k_{1}\cdot\mathbf{I}+\frac{1}{n}\cdot k_{2}\cdot h\cdot\mathbf{V}\right)\cdot\mathbf{X}^{t}\mathbf{y}, (13)
=\displaystyle= (𝐗t𝐗+k𝐈)1(𝐈+1nkh𝐕)𝐗t𝐲,\displaystyle\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1}\cdot\left(\mathbf{I}+\frac{1}{n}\cdot k\cdot h\cdot\mathbf{V}\right)\cdot\mathbf{X}^{t}\mathbf{y},

where now:

𝐕p×p=diag(1,ni=1nXi22,,ni=1nXip2).\mathbf{V}_{p\times p}=diag\left(1,\ \frac{n}{\sum\limits_{i=1}^{n}X_{i2}^{2}},\dots,\ \frac{n}{\sum\limits_{i=1}^{n}X_{ip}^{2}}\right).

Given a problem of bad conditioning of the matrix 𝐗\mathbf{X}, it will affect the calculation of 𝛃^\widehat{\boldsymbol{\beta}} both in the expression (𝐗t𝐗)1\left(\mathbf{X}^{t}\mathbf{X}\right)^{-1} and in 𝐗t𝐲\mathbf{X}^{t}\mathbf{y}. In this case, from the expression (13), we would be considering both terms and not only the first one as, for example, the ridge estimator does (see expression (2)).

3.1.1 Orthogonal case

Section 2 shows that in the orthogonal case it is verified that 𝜷^=𝜶\widehat{\boldsymbol{\beta}}=\boldsymbol{\alpha}, therefore, taking into account that 𝐗t𝐗𝜷^=𝐗t𝐲\mathbf{X}^{t}\mathbf{X}\cdot\widehat{\boldsymbol{\beta}}=\mathbf{X}^{t}\mathbf{y} the expression (8) can be rewritten as:

𝜷^(k,h)\displaystyle\widehat{\boldsymbol{\beta}}(k,h) =\displaystyle= (𝐗t𝐗+k𝐈)1(𝐗t𝐗+kh𝐈)𝜷^\displaystyle\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1}\cdot\left(\mathbf{X}^{t}\mathbf{X}+k\cdot h\cdot\mathbf{I}\right)\cdot\widehat{\boldsymbol{\beta}}
=\displaystyle= (x11+khx11+k000x22+khx22+k000xpp+khxpp+k)𝜷^,\displaystyle\left(\begin{array}[]{cccc}\frac{x_{11}+k\cdot h}{x_{11}+k}&0&\cdots&0\\ 0&\frac{x_{22}+k\cdot h}{x_{22}+k}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\frac{x_{pp}+k\cdot h}{x_{pp}+k}\\ \end{array}\right)\cdot\widehat{\boldsymbol{\beta}},

where xiix_{ii} is denoted as the element (i,i) of 𝐗t𝐗\mathbf{X}^{t}\mathbf{X}, it is to say, xii=j=1nXij2x_{ii}=\sum\limits_{j=1}^{n}X_{ij}^{2} for i=1,,pi=1,\dots,p. In this case:

  • If h=0h=0, the penalized estimator coincides with the ridge and it is found that 𝜷^(k,h)𝜷^\widehat{\boldsymbol{\beta}}(k,h)\not=\widehat{\boldsymbol{\beta}} for values of kk different from zero.

  • If h=1h=1, then 𝜷^(k,h)=𝜷^\widehat{\boldsymbol{\beta}}(k,h)=\widehat{\boldsymbol{\beta}} for any value of kk.

  • If h(0,1)h\in(0,1), then 𝜷^(k,h)𝜷^\widehat{\boldsymbol{\beta}}(k,h)\not=\widehat{\boldsymbol{\beta}} for values of kk different from zero.

Thus, although technically hh can be any non-negative value, the only advisable value for hh would be 1, since in this case it is assured that the penalized estimator coincides with the OLS estimator when there is orthogonality in the model (1). The case h=0h=0 will be also considered to work with the ridge estimator as a particular case of the penalized estimator.

3.2 Trace and norm of the penalized estimator

From expression (8) the trace of the estimator can be obtained by simply plotting 𝜷^(k,h)\widehat{\boldsymbol{\beta}}(k,h) for different values of kk (it will be considered that k[0,1]k\in[0,1]). In this way, it is possible to obtain graphically which values of kk stabilize the estimates of 𝜷\boldsymbol{\beta} obtained from 𝜷^(k,h)\widehat{\boldsymbol{\beta}}(k,h).

Since section A.1 of the appendix A shows that:

𝜷^(k,h)h2𝜶,||\widehat{\boldsymbol{\beta}}(k,h)||\rightarrow h^{2}\cdot||\boldsymbol{\alpha}||,

it is guaranteed that there are kk values that will stabilize the values provided by the penalized estimator. In addition, as stated in the expression (7), these estimates converge to the values of 𝜶\boldsymbol{\alpha}.

3.3 Matrix of variances-covariances

From expression (9) and taking into account that var(𝜷^)=σ2(𝐗t𝐗)1var\left(\widehat{\boldsymbol{\beta}}\right)=\sigma^{2}\cdot\left(\mathbf{X}^{t}\mathbf{X}\right)^{-1} it is immediately clear that:

var(𝜷^(k,h))=σ2𝐙(k)𝐗t𝐗𝐙(k).var\left(\widehat{\boldsymbol{\beta}}(k,h)\right)=\sigma^{2}\cdot\mathbf{Z}(k)\cdot\mathbf{X}^{t}\mathbf{X}\cdot\mathbf{Z}(k). (15)

It is observed that this expression does not depend on the hh factor and coincides with that of the Hoerl and Kennard ridge estimator [18, 19] given in expression (3).

Section A.2 of appendix A shows that var(𝜷^(k,h))<var(𝜷^)var\left(\widehat{\boldsymbol{\beta}}(k,h)\right)<var\left(\widehat{\boldsymbol{\beta}}\right), i.e., the penalized estimator is optimal in the sense that it has a smaller variance-covariance matrix than the OLS estimator.

3.4 Goodness-of-fit

Taking into account that from the expression (8) the residuals of the penalized estimate would be obtained from 𝐞(k,h)=𝐲𝐗𝜷^(k,h)\mathbf{e}(k,h)=\mathbf{y}-\mathbf{X}\widehat{\boldsymbol{\beta}}(k,h), it is clear that 𝐲=𝐗𝜷^(k,h)+𝐞(k,h)\mathbf{y}=\mathbf{X}\widehat{\boldsymbol{\beta}}(k,h)+\mathbf{e}(k,h). In this case the following equality is verified:

𝐲t𝐲=𝜷^(k,h)t(𝐗t𝐗+2k𝐈)𝜷^(k,h)2kh𝜷^(k,h)t𝜶+𝐞(k,h)t𝐞(k,h).\mathbf{y}^{t}\mathbf{y}=\widehat{\boldsymbol{\beta}}(k,h)^{t}\left(\mathbf{X}^{t}\mathbf{X}+2\cdot k\cdot\mathbf{I}\right)\widehat{\boldsymbol{\beta}}(k,h)-2\cdot k\cdot h\cdot\widehat{\boldsymbol{\beta}}(k,h)^{t}\boldsymbol{\alpha}+\mathbf{e}(k,h)^{t}\mathbf{e}(k,h).

Consequently, a measure of goodness-of-fit can be given by:

GoF(k,h)\displaystyle GoF(k,h) =\displaystyle= 1𝐞(k,h)t𝐞(k,h)𝐲t𝐲\displaystyle 1-\frac{\mathbf{e}(k,h)^{t}\cdot\mathbf{e}(k,h)}{\mathbf{y}^{t}\mathbf{y}} (16)
=\displaystyle= 𝜷^(k,h)t(𝐗t𝐗+2k𝐈)𝜷^(k,h)2kh𝜷^(k,h)t𝜶𝐲t𝐲,\displaystyle\frac{\widehat{\boldsymbol{\beta}}(k,h)^{t}\left(\mathbf{X}^{t}\mathbf{X}+2\cdot k\cdot\mathbf{I}\right)\widehat{\boldsymbol{\beta}}(k,h)-2\cdot k\cdot h\cdot\widehat{\boldsymbol{\beta}}(k,h)^{t}\boldsymbol{\alpha}}{\mathbf{y}^{t}\mathbf{y}},

where 𝐲t𝐲=nvar(𝐲)\mathbf{y}^{t}\mathbf{y}=n\cdot var(\mathbf{y}). Note that

  • For k=0k=0, it is obtained the coefficient of determination obtained, R2R^{2}, by OLS in model (1), as long as the dependent variable has zero mean (see Salmerón et al. [42] for details).

  • For h=0h=0 there would be a goodness of fit for the ridge estimator.

  • Since residuals do not necessarily sum to zero (see section A.3 of Appendix A), this measure does not have to vary between zero and one as it does in OLS..

  • Section A.4 of Appendix A Analyzes the monotony of GoF(k,h)GoF(k,h), concluding that:

    • if h0h\not=0, its decrease is not assured when kk increases, although this is to be expected. In this case, it would be verified that GoF(k,h)(2h𝐲t𝐗𝜶,R2]GoF(k,h)\in(2\cdot h\cdot\mathbf{y}^{t}\mathbf{X}\boldsymbol{\alpha},R^{2}].

    • if h=0h=0, its decreasing is assured when kk increases and it would be verified that GoF(k,h)(0,R2]GoF(k,h)\in(0,R^{2}].

3.5 Mean Square Error

Since the penalized estimator is biased, it is interesting to calculate its mean square error (MSE) in order to establish when it is lower than that obtained by OLS. In this case, the MSE corresponds to the following expression:

ECM(𝜷^(k,h))=σ2traza(var(𝜷^(k,h)))+(E[𝜷^(k,h)]𝜷)t(E[𝜷^(k,h)]𝜷),ECM\left(\widehat{\boldsymbol{\beta}}(k,h)\right)=\sigma^{2}\cdot traza\left(var\left(\widehat{\boldsymbol{\beta}}(k,h)\right)\right)+\left(E\left[\widehat{\boldsymbol{\beta}}(k,h)\right]-\boldsymbol{\beta}\right)^{t}\left(E\left[\widehat{\boldsymbol{\beta}}(k,h)\right]-\boldsymbol{\beta}\right), (17)

where E[𝜷^(k,h)]𝜷=(𝐙(k)𝐗t𝐗𝐈)𝜷+kh𝐙(k)𝜶E\left[\widehat{\boldsymbol{\beta}}(k,h)\right]-\boldsymbol{\beta}=\left(\mathbf{Z}(k)\cdot\mathbf{X}^{t}\mathbf{X}-\mathbf{I}\right)\cdot\boldsymbol{\beta}+k\cdot h\cdot\mathbf{Z}(k)\cdot\boldsymbol{\alpha}.

Since the variance-covariance matrix of the penalized estimator coincides with that of the ridge estimator, the expression (17) can be rewritten as:

ECM(𝜷^(k,h))=ECM(𝜷^(k))+𝐒(k,h),ECM\left(\widehat{\boldsymbol{\beta}}(k,h)\right)=ECM\left(\widehat{\boldsymbol{\beta}}(k)\right)+\mathbf{S}(k,h), (18)

where ECM(𝜷^(k))ECM\left(\widehat{\boldsymbol{\beta}}(k)\right) is the MSE of ridge estimator and 𝐒(k,h)=2kh𝜷t(𝐙(k)𝐗t𝐗𝐈)t𝐙(k)𝜶+k2h2𝜶t𝐙(k)t𝐙(k)𝜶\mathbf{S}(k,h)=2\cdot k\cdot h\cdot\boldsymbol{\beta}^{t}\cdot\left(\mathbf{Z}(k)\cdot\mathbf{X}^{t}\mathbf{X}-\mathbf{I}\right)^{t}\cdot\mathbf{Z}(k)\cdot\boldsymbol{\alpha}+k^{2}\cdot h^{2}\cdot\boldsymbol{\alpha}^{t}\cdot\mathbf{Z}(k)^{t}\cdot\mathbf{Z}(k)\cdot\boldsymbol{\alpha}.

Hoerl and Kennard [18] showed that the mean square error of the ridge estimator is equal to the sum of two terms::

ECM(𝜷^(k))=σ2i=1pλi(λi+k)2+k2𝜷t(𝐗t𝐗+k𝐈)2𝜷,ECM\left(\widehat{\boldsymbol{\beta}}(k)\right)=\sigma^{2}\sum\limits_{i=1}^{p}\frac{\lambda_{i}}{(\lambda_{i}+k)^{2}}+k^{2}\boldsymbol{\beta}^{t}\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-2}\boldsymbol{\beta},

such that the first is decreasing in kk and the second is increasing, therefore, the monotonicity ofECM(𝜷^(k))ECM\left(\widehat{\boldsymbol{\beta}}(k)\right) is not clear. However, due to

k2𝜷t(𝐗t𝐗+k𝐈)2𝜷𝜷t𝜷,k^{2}\boldsymbol{\beta}^{t}\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-2}\boldsymbol{\beta}\longrightarrow\boldsymbol{\beta}^{t}\boldsymbol{\beta},

when k+k\rightarrow+\infty, it is verified that ECM(𝜷^(k))ECM\left(\widehat{\boldsymbol{\beta}}(k)\right) has a horizontal asymptote.

Also, section A.5 of Appendix A shows that the first term of 𝐒(k,h)\mathbf{S}(k,h) has no definite monotonicity while the second is increasing in kk. Since both terms have a horizontal asymptote, there must be a value of kk for which the value of 𝐒(k,h)\mathbf{S}(k,h) is stable. This is an important consideration because if it is verified that 𝐒(k,h)<0\mathbf{S}(k,h)<0, then ECM(𝜷^(k,h))<ECM(𝜷^(k))ECM\left(\widehat{\boldsymbol{\beta}}(k,h)\right)<ECM\left(\widehat{\boldsymbol{\beta}}(k)\right).

Finally, it should be noted that ECM(𝜷^(k,h))ECM\left(\widehat{\boldsymbol{\beta}}(k,h)\right) has a horizontal asymptote because when the k+k\rightarrow+\infty it is verified333 Note that for h=1h=1: ECM(𝜷^(k,h))(𝜷𝜶)t(𝜷𝜶)ECM\left(\widehat{\boldsymbol{\beta}}(k,h)\right)\longrightarrow\left(\boldsymbol{\beta}-\boldsymbol{\alpha}\right)^{t}\left(\boldsymbol{\beta}-\boldsymbol{\alpha}\right) when k+k\rightarrow+\infty. that:

ECM(𝜷^(k,h))𝜷t𝜷2h𝜷t𝜶+h2𝜶t𝜶.ECM\left(\widehat{\boldsymbol{\beta}}(k,h)\right)\longrightarrow\boldsymbol{\beta}^{t}\boldsymbol{\beta}-2\cdot h\cdot\boldsymbol{\beta}^{t}\boldsymbol{\alpha}+h^{2}\cdot\boldsymbol{\alpha}^{t}\boldsymbol{\alpha}.

Therefore, there must be a value of kk at which the ECM(𝜷^(k,h))ECM\left(\widehat{\boldsymbol{\beta}}(k,h)\right) is stabilized.

3.5.1 Monte Carlo Simulation

In order to obtain information on the comparison of the ECM of the penalized estimator with that of the OLS and ridge estimator, and due to the impossibility of analytically establishing the behavior of the ECM monotonicity, a Monte Carlo simulation is proposed below where values are simulated from:

𝐗i=1ξ2𝐖i+𝐖p,i=2,,p,\mathbf{X}_{i}=\sqrt{1-\xi^{2}}\cdot\mathbf{W}_{i}+\mathbf{W}_{p},\quad i=2,\dots,p,

where p=3,4,5,6p=3,4,5,6, ξ{0.96,0.97,0.98,0.99}\xi\in\{0.96,0.97,0.98,0.99\}, 𝐖iN(μ,σ2)\mathbf{W}_{i}\sim N(\mu,\sigma^{2}) con μ\mu randomly obtained from the set {0,±2,±4,±6,±8,±10}\{0,\pm 2,\pm 4,\pm 6,\pm 8,\pm 10\} and σ{0.01,0.1,5,10,15}\sigma\in\{0.01,0.1,5,10,15\} and n{30,40,50,,200}n\in\{30,40,50,\dots,200\}. This way of simulating data has been previously used, for example, by McDonald and Galarneau [29], Wichern and Churchill [53], Gibbons [15], Kibria [21], Salmerón et al. [38, 40] or Rodríguez et al. [36, 37]; and the goal is that each pair of independent variables has a correlation coefficient equal to ξ2\xi^{2}.

Once the data has been simulated (1440 simulations are performed), the matrix 𝐗=[𝟏,𝐗2,,𝐗p]\mathbf{X}=[\mathbf{1},\mathbf{X}_{2},\dots,\mathbf{X}_{p}] is generated and also 𝐲=𝐗𝜷+𝐮\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{u} where the elements of 𝜷\boldsymbol{\beta} are obtained randomly according to βi{±1,±2,±3,±4,±5}\beta_{i}\in\{\pm 1,\pm 2,\pm 3,\pm 4,\pm 5\} and 𝐮N(0,1)\mathbf{u}\sim N(0,1).

For each case, considering that k{0,0.01,0.02,,0.09,1}k\in\{0,0.01,0.02,\dots,0.09,1\}, it is calculated:

  • The mean square error in OLS, ECMOLS=ECM(𝜷^(0,0))ECM_{OLS}=ECM\left(\widehat{\boldsymbol{\beta}}(0,0)\right), the minimum value of the mean square error for the ridge estimator, ECMR=ECM(𝜷^(k,0))ECM_{R}=ECM\left(\widehat{\boldsymbol{\beta}}(k,0)\right) and for the penalized estimator for h=1h=1, ECMP=ECM(𝜷^(k,1))ECM_{P}=ECM\left(\widehat{\boldsymbol{\beta}}(k,1)\right). The Algorithm 1 has been applied in this case. Note that this algorithm also calculates whether the minimum obtained is a unique minimum or not: in all cases where the minimum has been reached, it is unique.

  • The minimum coefficient of variation (CV), the maximum variance inflation factor (VIF) and the condition number (CN). In this way, the existence of approximate multicollinearity of essential and non-essential type in the simulated data is analyzed. Calculations were performed with the package “multiColl” [30] of R [35] (for further details see Salmerón et al. [43, 44]).

Table 1 shows the minimum, mean, median and maximum values for the minimum coefficient of variation, maximum variance inflation factor and condition number. Considering the traditional thresholds for these multicollinearity detection measures, it is observed that situations have been simulated where the existing approximate multicollinearity (both essential and nonessential) is and is not troubling. In addition, the significant difference between the mean and median of the values obtained for the number of condition is notable, indicating the existence of very high values that shift the distribution to the right.

Minimum CV Maximum VIF CN
Minimum 0.00123 6.394 5.966
Mean 2.48557 71.689 1752.275
Median 1.03504 52.495 33.489
Maximum 42.55852 367.690 25384.87
Table 1: Minimum, mean, median and maximum value for the minimum coefficient of variation (CV), the maximum variance inflation factor (VIF) and condition number (CN)
Case Number (%) Minimum CV Maximum VIF CN
(A) MSEP<MSER<MSEOLSMSE_{P}<MSE_{R}<MSE_{OLS} 0 (0%)
1.85 31.199 12.726
(B) MSEP<MSEOLS<MSERMSE_{P}<MSE_{OLS}<MSE_{R} 408 (46.52%) 4.6786 65.957 18.934
5.8266 80.703 23.334
0.00274 32.45 199.889
(C) MSER<MSEP<MSEOLSMSE_{R}<MSE_{P}<MSE_{OLS} 469 (53.48%) 0.57782 76.68 3260.171
0.05795 99.5 4678.892
(D) MSER<MSEOLS<MSEPMSE_{R}<MSE_{OLS}<MSE_{P} 0 (0%)
(E) MSEOLS<MSEP<MSERMSE_{OLS}<MSE_{P}<MSE_{R} 0 (0%)
(F) MSEOLS<MSER<MSEPMSE_{OLS}<MSE_{R}<MSE_{P} 0 (0%)
Table 2: Comparison of mean squared errors for minimum coefficient of variation, maximum variance inflation factor and number of condition. First and third quartiles as well as the mean are provided.

Table 2 compares the values obtained. It is observed that:

  • A total of 877 simulations were finally worked with. This is because there are 563 cases where it is not possible to calculate the minimum MSE for the ridge and/or penalized estimator either because the discretization is not fine enough or because the interval stop, 1, is insufficient. In this regard, tests have been performed by modifying the interval stop:

    • If the interval stop is equal to 2: there are 903 cases in which both minima are obtained, in 432 cases (B) is verified and in 471 cases (C) is verified.

    • IIf the interval stop is equal to 5: there are 921 cases in which both minima are obtained, in 445 cases (B) is verified and in 476 cases (C) is verified.

    • If the interval stop is equal to 10: there are 923 cases in which both minima are obtained, in 447 cases (B) is verified and in 476 cases (C) is verified.

    It can be seen that this increase results in a decreasing incorporation of simulations and that the new incorporations are mostly in case (B).

  • Case B: In 46.52% of the cases the penalized estimator has a lower mean square error and the ridge the worst. These cases are characterized by a high coefficient of variation, so that the multicollinearity treated is of the essential type.

  • Case C: In 53.48% of the cases the penalized estimator has a worse mean square error than the ridge, but better than that of OLS. These cases are characterized by a low coefficient of variation, so that the multicollinearity treated is of a non-essential type.

  • The values shown by the condition number explain that the shift to the right discussed above is due to the detection by the condition number of non-essential multicollinearity, which is ignored by the variance inflation factor.

In short, the simulation shows that the penalized estimator performs better in terms of mean square error than the ridge estimator when the existing multicollinearity is of the essential type.

0: Calculate D(n)D(n) := {discretization of the interval [0,1] with nn points}
1: i = 0
2:for k \in D(n) do
3:  i = i + 1
4:  Calculate MSE(𝜷^(k,h))[i]MSE\left(\widehat{\boldsymbol{\beta}}(k,h)\right)[i] with the expression given in subsection 3.5
5:end for
6: l = 0
7:for k \in 2:(i-1) do
8:  if MSE(𝜷^(k,h))[j]<MSE(𝜷^(k,h))[j1]MSE\left(\widehat{\boldsymbol{\beta}}(k,h)\right)[j]<MSE\left(\widehat{\boldsymbol{\beta}}(k,h)\right)[j-1] and MSE(𝜷^(k,h))[j]<MSE(𝜷^(k,h))[j+1]MSE\left(\widehat{\boldsymbol{\beta}}(k,h)\right)[j]<MSE\left(\widehat{\boldsymbol{\beta}}(k,h)\right)[j+1] then
9:   l = l + 1
10:   index[l] = j
11:  end if
12:end for
13: only = 0
14:if l = 1 then
15:  only = 1
16:end if
17:if l >>then
18:  only = 2
19:end if
20: MSE min = MSE(𝜷^(k,h))[min(index)]MSE\left(\widehat{\boldsymbol{\beta}}(k,h)\right)[min(index)]
Algorithm 1 Calculation of kk such that MSE(𝜷^(k,h))MSE\left(\widehat{\boldsymbol{\beta}}(k,h)\right) is minimum

4 Augmented model and multicollinearity

From expression (8) the penalized estimator can be obtained by OLS estimation of the augmented model 𝐲A=𝐗A𝜷+𝐮A\mathbf{y}_{A}=\mathbf{X}_{A}\cdot\boldsymbol{\beta}+\mathbf{u}_{A}, where:

𝐗A=(𝐗k𝐈)(n+p)×p,𝐲A=(𝐲kh𝜶)(n+p)×1.\mathbf{X}_{A}=\left(\begin{array}[]{c}\mathbf{X}\\ \sqrt{k}\cdot\mathbf{I}\end{array}\right)_{(n+p)\times p},\quad\mathbf{y}_{A}=\left(\begin{array}[]{c}\mathbf{y}\\ \sqrt{k}\cdot h\cdot\boldsymbol{\alpha}\end{array}\right)_{(n+p)\times 1}.

However, the variance-covariance matrices would be different since:

var(𝜷^A)=σ2(𝐗t𝐗+k𝐈)1var(𝜷^(k,h)).var\left(\widehat{\boldsymbol{\beta}}_{A}\right)=\sigma^{2}\cdot\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)^{-1}\not=var\left(\widehat{\boldsymbol{\beta}}(k,h)\right).

Similarly, the goodness-of-fit would also change since the dependent variable is different, 𝐲𝐲A\mathbf{y}\not=\mathbf{y}_{A}.

It should be noted that the design matrix of the augmented model, 𝐗A\mathbf{X}_{A}, coincides with that of the augmented model of the ridge estimator (see Marquardt [26]), then all operations based on them will coincide. Consequently, papers presented by García et al. [13] and Salmerón et al. [39] to calculate, respectively, the extension of the Variance Inflation Factor (VIF) and the Condition Number (CN) to the ridge estimator are applicable to the penalized estimator. Thus, more specifically:

  • García et al. [13] states that to obtain an extension of the VIF, VIF(i,k)VIF(i,k) with i=2,,pi=2,\dots,p, with the desirable properties of continuity (for k=0k=0 the calculated VIF coincides with that obtained in OLS), decreasing monotonicity as kk increases (since it is assumed that when applying the ridge estimate the degree of approximate multicollinearity decreases) and that a value greater than 1 is always obtained (since this is the minimum value of the VIF); data must be standardized444 To standardize a set of data, center the data (subtract its mean) and divide it by its standard deviation multiplied by the square root of the number of observations. and subsequently the model is augmented. In other words, to calculate the VIF for the ridge estimator, the following matrix must be used:

    𝐱A=(𝐱k𝐈)(n+p)×p,\mathbf{x}_{A}=\left(\begin{array}[]{c}\mathbf{x}\\ \sqrt{k}\cdot\mathbf{I}\end{array}\right)_{(n+p)\times p},

    where the matrix 𝐱\mathbf{x} is the matrix resulting from the standardization of the columns of 𝐗\mathbf{X}.

  • Salmerón et al. [39] it is recommended that to obtain an extension of the CN, CN(k)CN(k), that verifies the above desirable properties (continuity at k=0k=0, decreasing monotonicity at kk and being greater than 1 for any value of kk) it has to be calculated from the expression:

    CN(k)=ξmax+kξmin+k,CN(k)=\sqrt{\frac{\xi_{max}+k}{\xi_{min}+k}},

    whereξmin\xi_{min} and ξmax\xi_{max} are, respectively, the minimum and maximum eigenvalues of 𝐗Tt𝐗T\mathbf{X}_{T}^{t}\mathbf{X}_{T} where 𝐗T\mathbf{X}_{T} denotes the matrix 𝐗\mathbf{X} transformed. These authors recommend that the transformation should be typing555 To typify a set of data it is necessary to center it (subtract its mean) and divide it by its standard deviation. the data, since in such a case the results obtained are similar to those obtained from the VIF. However, this transformation ignores nonessential type multicollinearity (see, for example, Salmerón et al. [40, 41]), so this transformation is not going to be used in this paper. In accordance with the recommendations of Belsley et al. [1] or Belsley [3, 2] the eigenvalues of the matrix will be calculated. 𝐗Tt𝐗T\mathbf{X}_{T}^{t}\mathbf{X}_{T} where the transformation used is that of unit length666 To transform a set of data into unit length, divide it by the square root of the sum of each element squared. Note that if the data have zero mean (a property that is obtained by centering them), it is verified that this transformation coincides with the standardization. .

5 The penalty parameter kk

In order to obtain the value of the above expressions it is necessary to use a concrete value of the penalty parameter, kk. For this reason, some possible criteria for determining a concrete value of the parameter, as well as its interpretation, are proposed below.

5.1 Interpretation of the penalty parameter

Taking into account Remark 3 in which k=k2k1k=\frac{k_{2}}{k_{1}} for k1>0k_{1}>0 and considering that k1+k2=1k_{1}+k_{2}=1, it is possible to obtain the relative importance of each of the two summands of the expression (11) in the minimization performed.

That is, assuming that k=r0k=r\geq 0 it is obtained thaat k1=11+rk_{1}=\frac{1}{1+r} and k2=r1+rk_{2}=\frac{r}{1+r}. Thus, for example:

  • If k=1k=1, then k1=0.5k_{1}=0.5 and k2=0.5k_{2}=0.5. In this case, the same importance would be given to SCR(𝜷)SCR(\boldsymbol{\beta}) and to (𝜷h𝜶)t(𝜷h𝜶)(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha}).

  • If k=2k=2, then k1=13k_{1}=\frac{1}{3} and k2=23k_{2}=\frac{2}{3}. In this case, double importance is given to (𝜷h𝜶)t(𝜷h𝜶)(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha}) que a SCR(𝜷)SCR(\boldsymbol{\beta}).

In general:

  • If k<1k<1 it is obtained that k2<k1k_{2}<k_{1} and, then, it would be giving more relevance to SCR(𝜷)SCR(\boldsymbol{\beta}) than(𝜷h𝜶)t(𝜷h𝜶)(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha}).

  • If k>1k>1 it is obtained that k2>k1k_{2}>k_{1} and, then, it would be giving more relevance to a (𝜷h𝜶)t(𝜷h𝜶)(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-h\cdot\boldsymbol{\alpha}) than SCR(𝜷)SCR(\boldsymbol{\beta}).

Note that in the ridge estimator the value of kk has traditionally been considered to belong to the interval [0, 1][0,\ 1]. hat is, preference is given to minimizing the sum of squares of the residuals over the shrinkage of the coefficients.

Moreover, the fact that k1k_{1} is imposed to be greater than zero implies that the SSR is always minimized. At the same time, since the higher the value of kk the lesser the role of SCR(𝜷)SCR(\boldsymbol{\beta}), it is intuited that the goodness-of-fit calculated from GoF(k,h)GoF(k,h) will be decreasing..

5.2 Selection of the penalty parameter

A first option, following Hoerl and Kennard [18, 19], is to select that value of kk that stabilizes the estimates of the coefficients of the independent variables of the model. For this purpose, it is necessary to plot the values of 𝜷^(k,h)\widehat{\boldsymbol{\beta}}(k,h) as a function of kk nd observe at what value of kk the estimates are stable.

On the other hand, following García et al [13] and Salmerón et al [39], could be considered the first value of kk that makes the extension of VIF or CN below the thresholds (10 and 20, respectively) above which multicollinearity is traditionally considered to be troubling.

Furthermore, since the goal is to make the estimates similar to those obtained in 𝜶\boldsymbol{\alpha}, could be considered the value of kk that makes 𝜷^(k,h)\widehat{\boldsymbol{\beta}}(k,h) and 𝜶\boldsymbol{\alpha} close to each other. For example, the difference between the two could be imposed to be less than 10%:

𝜶𝜷^(k,h)𝜶<0.1.\frac{||\boldsymbol{\alpha}-\widehat{\boldsymbol{\beta}}(k,h)||}{||\boldsymbol{\alpha}||}<0.1.

Section A.6 of Appendix A shows that 𝜶𝜷^(k,h)||\boldsymbol{\alpha}-\widehat{\boldsymbol{\beta}}(k,h)|| decreases toward zero when kk increases and h=1h=1.

Finally, by following Hoerl et al [20], the value of kk that minimizes the mean square error (if it exists) could be considered..

Example 3 (continuation of Example 2)

In example 2 (where the degree of essential multicollinearity is troubling) considering that k{0,0.01,0.02,,99.99,100}k\in\{0,0.01,0.02,\dots,99.99,100\} and h=1h=1, the following values of kk are obtained by considering different criteria:

  • k=0.06k=0.06 is the value of kk for which the maximum VIF(i,k)VIF(i,k), i=2,3i=2,3, is less than 10 (concretely 9.060921);

  • k=0.01k=0.01 is the value of kk for which the CN(k)CN(k) is lower that 20 (concretely 14.2435) and k=0.01k=0.01 is the value of kk for which the CN(k)CN(k) lower that 10 (concretely 8.316654)

  • k=0.03k=0.03 is the value of kk for which the mean square error is minimum, concretely is equal to 4.284018 taht is lower that the one obtained by OLS 4.311986.

On the other hand, in the considered discretization of the interval [0, 100] there does not exist kk such that 𝛂𝛃^(k,h)𝛂<0.1\frac{||\boldsymbol{\alpha}-\widehat{\boldsymbol{\beta}}(k,h)||}{||\boldsymbol{\alpha}||}<0.1. The minimum value obtained in this case is 0.4402975.

Finally, with these values of kk we would be giving more weight to the minimization of SCR(𝛃)SCR(\boldsymbol{\beta}) than to (𝛃𝛂)t(𝛃𝛂)(\boldsymbol{\beta}-\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-\boldsymbol{\alpha}) according to the values shown in the following table:

kk k1k_{1} k2k_{2}
0.01 0.990099 0.00990099
0.03 0.9708738 0.02912621
0.06 0.9433962 0.05660377

\Box

6 Inference

Although there are various attempts to address inference in the ridge estimator (see, e.g., Obenchain [32, 33], Halawa and El Bassiouni [17], Gökpınar and Ebegil [16], Sengupta and Sowell [45] or Vanhove [51]), in this paper we will focus on the use of bootstrap methods (see, for example, Efron [6, 7, 8, 9], Efron and Gong [10] or the a review of the above-mentioned works presented by Efron and Tibshirani [11]).

Thus, given a fixed value of kk obtained by one of the methods proposed in subsection 5.2 and a value of 𝜶\boldsymbol{\alpha}, the following steps will be taken:

  1. (i)

    Generate randomly and with replacement mm subsamples of equal size to the original one. The value of mm must be large.

  2. (ii)

    For each sub-sample above, the statistic of interest thetatheta is calculated. Therefore, we have mm values for this statistic: θ1,,θm\theta_{1},\dots,\theta_{m}.

  3. (iii)

    Calculate the mean, θ¯=1mi=1mθi\overline{\theta}=\frac{1}{m}\sum\limits_{i=1}^{m}\theta_{i}, and sample quasi-variance, σθ2=1m1i=1m(θiθ¯)2\sigma^{2}_{\theta}=\frac{1}{m-1}\sum\limits_{i=1}^{m}(\theta_{i}-\overline{\theta})^{2}, for the mm values obtained in previous step. Both measures can be used to obtain the approximation of a confidence interval by using expression θ¯±1.96σθ\overline{\theta}\pm 1.96\cdot\sigma_{\theta}.
    Similarly, θ¯\overline{\theta} can be used as an approximation of the point estimator of θ\theta.

  4. (iv)

    Another confidence region can be obtained alternatively by considering the lower and upper percentiles, 0.025 and 0.975, as the lower and upper extremes respectively, of the mm values calculated in the second step: [P0.025(θ1,,θm),P0.975(θ1,,θm)][P_{0.025}(\theta_{1},\dots,\theta_{m}),P_{0.975}(\theta_{1},\dots,\theta_{m})].

EWhat is interesting in this work are the cases in which θ\theta is equal to 𝜷^(k,1)\widehat{\boldsymbol{\beta}}(k,1) or GoF(k,1)GoF(k,1).

Example 4 (continuation of the Example 3)

Considering:

  • values 0.01, 0.03 and 0.06 of kk which make, respectively, that CN(k)<20CN(k)<20; the mean square error is minimum and CN(k)<10CN(k)<10; or VIF(i,k)<10VIF(i,k)<10 for i=2,3i=2,3 (see Example 3),

  • 𝜶=(29.476251,18.036792,3.601057)t\boldsymbol{\alpha}=(-29.476251,-18.036792,-3.601057)^{t} y

  • m=10000m=10000,

Table 3 shows the approximation of the point and interval estimators of 𝛃^(k,1)\widehat{\boldsymbol{\beta}}(k,1) and GoF(k,1)GoF(k,1).

It can be seen that the approximation of the point estimators, θ¯\overline{\theta}, are very close to the estimators of 𝛃^(k,1)\widehat{\boldsymbol{\beta}}(k,1) and calculation of GoF(k,1)GoF(k,1) obtained from expressions (8) and (16), respectively. This facilitates that the values of 𝛃^(k,1)\widehat{\boldsymbol{\beta}}(k,1) and GoF(k,1)GoF(k,1) belong to the two calculated confidence intervals.

Moreover, from these ranges it could be argued that the coefficient estimates β1\beta_{1} and β3\beta_{3} are significantly different from zero for the three values of kk considered since the intervals obtained do not contain zero, while it cannot be discarded that the estimate of the kk coefficient β2\beta_{2} is null. The same behaviour is observed as in OLS, perhaps due to the fact that the estimates obtained are still far from the values for 𝛂\boldsymbol{\alpha}.

In this context, considering that k=100k=100 (maximum value of kk considered in Example 3 to determine a value for this parameter), it is observed that the estimates of 𝛃^(k,1)\widehat{\boldsymbol{\beta}}(k,1) change substantially from the previous ones. In this case, all the coefficients can be considered significantly different from zero, although the estimation of the constant and the third coefficient are far from the values proposed in 𝛂\boldsymbol{\alpha}.

It is curious to note that there has been a distancing of 𝛃^3(k,1)\widehat{\boldsymbol{\beta}}_{3}(k,1) from -3.601057: the value -3.6767991 for k=0.01k=0.01 changes to 0.31708170.3170817 for k=100k=100. Note hat for k=100k=100 a weight of 0.00990099 would be given to minimisation of SCR(𝛃)SCR(\boldsymbol{\beta}) against a value of 0.990099 for (𝛃𝛂)t(𝛃𝛂)(\boldsymbol{\beta}-\boldsymbol{\alpha})^{t}(\boldsymbol{\beta}-\boldsymbol{\alpha}).

Finally, it is considered that k=1000k=1000, the estimation of 𝛃^3(k,1)\widehat{\boldsymbol{\beta}}_{3}(k,1) is equal to 0.03183904 with approximations of the associated confidence intervals equal to [0.06174285,0.1239955][-0.06174285,0.1239955] and [0.0678132,0.118499][-0.0678132,0.118499], i.e., a coefficient not significantly different from zero is obtained. Once again, it is interesting to note that the third variable initially has values for β^3\widehat{\beta}_{3} and α^3\widehat{\alpha}_{3} very similar between them (see Example 2) and that as a consequence of the linear relationship with the second variable, the coefficient becomes not significantly different from zero using the penalized estimator. \Box

ii θ=𝜷^i(k,1)\theta=\widehat{\boldsymbol{\beta}}_{i}(k,1), k=0.01k=0.01 θ¯\overline{\theta} σθ\sigma_{\theta} θ¯±1.96σθ\overline{\theta}\pm 1.96\cdot\sigma_{\theta} [P0.025(θ1,,θm),P0.975(θ1,,θm)][P_{0.025}(\theta_{1},\dots,\theta_{m}),P_{0.975}(\theta_{1},\dots,\theta_{m})]
1 4.3580904 4.3637278 0.2876449 [3.803312, 4.924143] [3.790357, 4.914893]
2 0.3809489 0.3793928 0.9808815 [-1.550971, 2.309756] [-1.564748, 2.283066]
3 -3.6770687 -3.6767991 0.1959585 [-4.062437, -3.291161] [-4.057372, -3.287766]
ii θ=𝜷^i(k,1)\theta=\widehat{\boldsymbol{\beta}}_{i}(k,1), k=0.03k=0.03 θ¯\overline{\theta} σθ\sigma_{\theta} θ¯±1.96σθ\overline{\theta}\pm 1.96\cdot\sigma_{\theta} [P0.025(θ1,,θm),P0.975(θ1,,θm)][P_{0.025}(\theta_{1},\dots,\theta_{m}),P_{0.975}(\theta_{1},\dots,\theta_{m})]
1 4.3264551 4.3226154 0.281187 [3.757987, 4.887244] [3.743654, 4.876066]
2 0.2552237 0.2439003 0.9683303 [-1.667152, 2.154952] [-1.708548, 2.117196]
3 -3.6519522 -3.6496978 0.1934983 [-4.031710, -3.267686] [-4.025403, -3.259135]
ii θ=𝜷^i(k,1)\theta=\widehat{\boldsymbol{\beta}}_{i}(k,1), k=0.06k=0.06 θ¯\overline{\theta} σθ\sigma_{\theta} θ¯±1.96σθ\overline{\theta}\pm 1.96\cdot\sigma_{\theta} [P0.025(θ1,,θm),P0.975(θ1,,θm)][P_{0.025}(\theta_{1},\dots,\theta_{m}),P_{0.975}(\theta_{1},\dots,\theta_{m})]
1 4.2795271 4.27471093 0.2873334 [3.717146, 4.832276] [3.695980, 4.823938]
2 0.0691914 0.05400399 0.9622777 [-1.831101, 1.939109] [-1.810631, 1.918534]
3 -3.6147874 -3.61174146 0.1923354 [-3.988318, -3.235165] [-3.984625, -3.238844]
ii θ=𝜷^i(k,1)\theta=\widehat{\boldsymbol{\beta}}_{i}(k,1), k=100k=100 θ¯\overline{\theta} σθ\sigma_{\theta} θ¯±1.96σθ\overline{\theta}\pm 1.96\cdot\sigma_{\theta} [P0.025(θ1,,θm),P0.975(θ1,,θm)][P_{0.025}(\theta_{1},\dots,\theta_{m}),P_{0.975}(\theta_{1},\dots,\theta_{m})]
1 -14.7479306 -14.8279630 0.3758633 [-15.564655, -14.0912709] [-15.6807864, -14.1957101]
2 -19.3578449 -19.3543196 0.2441033 [-19.832762, -18.8758772] [-19.8351704, -18.8815710]
3 0.3169069 0.3170817 0.05212487 [0.214917, 0.4192465] [0.2151219, 0.4176053]
kk θ=GoF(k,1)\theta=GoF(k,1) θ¯\overline{\theta} σθ\sigma_{\theta} θ¯±1.96σθ\overline{\theta}\pm 1.96\cdot\sigma_{\theta} [P0.025(θ1,,θm),P0.975(θ1,,θm)][P_{0.025}(\theta_{1},\dots,\theta_{m}),P_{0.975}(\theta_{1},\dots,\theta_{m})]
0.01 0.999884 0.9998847 2.468503105\cdot 10^{-5} [0.9998363, 0.9999331] [0.9998286, 0.9999254]
0.03 0.999884 0.9998847 2.476162105\cdot 10^{-5} [0.9998362, 0.9999333] [0.9998297, 0.9999260]
0.06 0.9998838 0.9998844 2.470472105\cdot 10^{-5} [0.9998360, 0.9999329] [0.9998289, 0.9999251]
100 0.992539 0.9924073 0.001157019 [0.9901396, 0.9946751] [0.9898363, 0.9943386]
Table 3: Bootstrap inference for the second case in the Example 2

7 Stability of the penalized estimator

In order to study the stability of the estimators, the independent variables of the multiple linear regression model will be perturbed by 1% following the procedure shown in Belsley [3]. Thus, given a vector 𝐱\mathbf{x} of size nn, a perturbation of it by 1% is given by the expression::

𝐱p=𝐱+0.01𝐩𝐱𝐩,\mathbf{x}_{p}=\mathbf{x}+0.01\cdot\mathbf{p}\cdot\frac{||\mathbf{x}||}{||\mathbf{p}||},

where 𝐩\mathbf{p} is a random vector, of equal dimension to 𝐱\mathbf{x}, and 𝐱=i=1nxi2||\mathbf{x}||=\sqrt{\sum\limits_{i=1}^{n}x_{i}^{2}}.

The effect of the perturbation on the estimation of the coefficients is quantified by the following expression:

𝜷~𝜷~p𝜷~100%,\frac{||\widetilde{\boldsymbol{\beta}}-\widetilde{\boldsymbol{\beta}}_{p}||}{||\widetilde{\boldsymbol{\beta}}||}\cdot 100\%, (19)

where 𝜷~\widetilde{\boldsymbol{\beta}} denotes an estimator of the parameter 𝜷\boldsymbol{\beta} and the subindex pp refers to the estimates obtained in the perturbed model, i.e. in the model in which all the independent variables have been perturbed.777 In this case it is considered that all variables are quantitative. If there were any binary variables, it would be advisable to use a different perturbation method than the one described. (without considering the intercept).

To obtain consolidated results, the above procedure will be performed 1000 times and its mean and confidence region determined by the 0.025 and 0.975 percentiles will be used for interpretations.

Example 5 (continuation of example 4)

In example 2 in which the degree of essential approximate multicollinearity is troubling, the independent variables have been perturbed 1000 times for k=0k=0 and for each value of kk considered in the Example 4: k=0.01,0.03,0.06,100,1000k=0.01,0.03,0.06,100,1000.

For k=0k=0 (i.e., when OLS is applied), is obtained that perturbing the independent variables by 1% implies a variation of 38.27699%. Therefore, this would be a case in which the multicollinearity in the model causes numerical instability, since a small change in the independent variables implies a significant variation in the estimates of their coefficients.

Table 4 shows the mean, 0.025 percentile and 0.095 percentile of the percent variance of the 1000 iterations performed for k=0,0.01,0.03,0.06,100,1000k=0,0.01,0.03,0.06,100,1000 and for each estimator: OLS, ridge and penalized:

  • When k=0.01,0.03,0.06k=0.01,0.03,0.06, it is observed that the values obtained for both the mean value and the percentiles are very similar to those of OLS, i.e., those values of kk that make the quadratic error minimal or measures such as the condition number or variance inflation factor fall below the established thresholds do not substantially improve the numerical instability of the model.

  • It is observed that for the high values of kk considered, k=100,1000k=100,1000, stable estimators are achieved in the presence of small changes in the independent variables.

To finish, Figure 1 shows the average percentage change (out of 1000 iterations) that occurs in the coefficient estimates in the presence of a 1 percent perturbation in the independent variables for k{0,1,2,,198,199,199}k\in\{0,1,2,\dots,198,199,199\}. It is observed that the penalized estimator (blue - dotted) has a higher stability to the discussed perturbations than the ridge estimator (lightblue - dashed) as the value of kk increases. \Box

Ridge Penalized
kk Mean P0.025P_{0.025} P0.075P_{0.075} Mean P0.025P_{0.025} P0.075P_{0.075}
0 38.27699 23.95753 55.43604 38.27699 23.95753 55.43604
0.01 38.22262 23.95097 55.33613 38.20928 24.02455 55.36458
0.03 38.11476 23.94311 55.13764 38.06461 24.15165 55.20508
0.06 37.95507 23.93155 54.84321 37.82431 24.26060 54.92455
100 15.65896 13.62525 17.41957 2.535079 2.156762 2.913410
1000 3.256717 2.822840 3.643900 0.3383532 0.3016403 0.3746771
Table 4: Mean, 0.025th percentile and 0.095th percentile of the percent variation of the 1000 iterations performed for k=0,0.01,0.03,0.06,100k=0,0.01,0.03,0.06,100 in each estimator.
Refer to caption
Figure 1: Average percentage change (for 1000 iterations) that occurs in the coefficient estimates in the presence of a 1% perturbation in the independent variables for k{0,1,2,,198,199,199}k\in\{0,1,2,\dots,198,199,199\}: OLS estimator in orange, ridge estimator in lightblue and penalized estimator in blue

8 Example

Year 𝐃\mathbf{D} 𝐂\mathbf{C} 𝐈\mathbf{I}
1996 3.80510 4.7703 4.8786
1997 3.94580 4.7784 5.0510
1998 4.05790 4.9348 5.3620
1999 4.19130 5.0998 5.5585
2000 4.35850 5.2907 5.8425
2001 4.54530 5.4335 6.1523
2002 4.81490 5.6194 6.5206
2003 5.12860 5.8318 6.9151
2004 5.61510 6.1258 7.4230
2005 6.22490 6.4386 7.8024
2006 6.78640 6.7394 8.4297
2007 7.49440 6.9104 8.7241
2008 8.39930 7.0993 8.8819
2009 9.39510 7.2953 9.1636
2010 10.68000 7.5614 9.7272
2011 12.07100 7.8036 10.3010
2012 13.44821 8.0441 10.9830
Table 5: Data on credit in the United State

Table 5 shows the outstanding mortgage debt, 𝐃\mathbf{D}, personal consumption, 𝐂\mathbf{C}, personal income, 𝐈\mathbf{I}, and outstanding consumer credit, 𝐂𝐏\mathbf{CP}, for years from 1996 to 2012 in relation to credit in the United State. This dataset was previously applied by [54]. Results of the OLS estimation are shown in Table 6.

Variable Estimate p-value
Intercept 5.469264 0.681
(13.016791)
𝐂\mathbf{C} -4.252429 0.423
(5.135058)
𝐈\mathbf{I} 3.120395 0.149
(2.035671)
𝐂𝐏\mathbf{CP} 0.002879 0.626
(0.005764)
F-statistic 52.3 0.0000001629
R2R^{2} 0.9235
Table 6: OLS estimation of dataset related to bank credit in the United State

It is observed that the OLS estimates of the coefficients on consumption, income and credit outstanding are not significantly888Throughout the section, it is considered to work at 5% significance different from zero, while at the same time the model is found to be jointly valid. This contradiction suggests that the degree of approximate multicollinearity in the model affects the statistical analysis of the model.

For this model, the following simple regressions can be performed:

𝐃=α1+α2𝐂+𝐯,𝐃=α1+α3𝐈+𝐯,𝐃=α1+α4𝐂𝐏+𝐯,\mathbf{D}=\alpha_{1}+\alpha_{2}\mathbf{C}+\mathbf{v},\quad\mathbf{D}=\alpha_{1}+\alpha_{3}\mathbf{I}+\mathbf{v},\quad\mathbf{D}=\alpha_{1}+\alpha_{4}\mathbf{CP}+\mathbf{v},

which in all cases have coefficients for the slopes significantly different from zero and lead to establish 𝜶=(6.76245,2.62875,1.51533,0.00519)t\boldsymbol{\alpha}=\left(6.76245,2.62875,1.51533,0.00519\right)^{t}.

8.1 Diagnosis of troubling approximate multicollinearity.

This possible worrying approximate multicollinearity is confirmed with the conclusions obtained from the following diagnosis measures (calculated by “multiColl” package [30] R [35]):

  • the coefficients of variation of the independent variables are equal to 0.1718940, 0.2482804 and 0.3607848 (not lower than the threshold of 0.1002506 established as troubling by Salmerón et al. [41]),

  • the values for the variance inflation factor are equal to 589.7540, 281.8862 and 189.4874 (higher than the threshold of 10 established as troubling, for example, by Marquardt [26]),

  • the condition number is equal to 332.3 (above the threshold of 30 established as troubling by Belsley et al. [1] or Belsley [3, 2]) and

  • the determinant of the correlation matrix is equal to 0.00002007699 (below the threshold of 0.1013 established as troubling by Garcia et al. [14]).

More specifically, it can be stated that the degree of essential multicollinearity existing in model is troubling, while that of the non-essential type is not.

8.2 Traces

Figures 2 to 4 show the traces of the estimates, mean squared error, goodness of fit and norms on the difference of the penalized estimator (right) and ridge (left) considering that k{0,0.01,0.02,,100}k\in\{0,0.01,0.02,\dots,100\}. It is concluded that:

  • The estimates of the ridge estimator decay rapidly towards zero; while that of the penalized estimator do not, as they converge to the values given for the vector 𝜶\boldsymbol{\alpha}.

  • The norm of the penalized estimator decreases until it reaches a minimum value, from which it grows in the direction of its horizontal asymptote; while the norm of the ridge estimator is clearly strictly decreasing towards zero.

  • The goodness of fit in both cases is decreasing, although in the case of the penalized estimator it takes smaller values than in the ridge (it decreases more rapidly).

  • It is observed that in both cases there is a value of kk that implies a minimum squared error that is smaller than that of OLS. Moreover, it is found that the mean squared error of the penalized and ridge estimator is smaller than that of OLS for all positive values of kk considered.

Refer to caption
Refer to caption
Figure 2: Trace of the coefficient estimates of the penalized estimator (h=1h=1, left) and ridge (h=0h=0, right) of the model on bank credit in the United States for k{0,0.01,0.02,,100}k\in\{0,0.01,0.02,\dots,100\}
Refer to caption
Refer to caption
Figure 3: Trace of the norms of the coefficient estimates of the penalized estimator (h=1h=1, left) and ridge (h=0h=0, right) of the model on bank credit in the United States for k{0,0.01,0.02,,100}k\in\{0,0.01,0.02,\dots,100\}
Refer to caption
Refer to caption
Figure 4: Trace of the goodness-of-fit of the penalized estimator (h=1h=1, left) and ridge (h=0h=0, right) of the model on bank credit in the United States for k{0,0.01,0.02,,100}k\in\{0,0.01,0.02,\dots,100\}
Refer to caption
Refer to caption
Figure 5: Trace of the mean squared errors of the penalized estimator (h=1h=1, left) and ridge (h=0h=0, right) of the model on bank credit in the United States for k{0,0.01,0.02,,100}k\in\{0,0.01,0.02,\dots,100\} (red dashed line represents the MSE asymptote)

On the other hand, considering thatk{0,0.01,0.02,,1}k\in\{0,0.01,0.02,\dots,1\}, Figure 6 shows the traces of the variance inflation factor and the condition number. Note that in this case they coincide for the ridge and penalized estimator since the expanded matrix from which these measures are calculated coincide in both cases.

It is observed that in both cases they decrease towards their minimum values and that there are values of kk that are below the thresholds considered to be troubling.

Refer to caption
Refer to caption
Figure 6: Trace of variance inflation factor (left) and condition number (right) of the model on bank credit in the United States for k{0,0.01,0.02,,1}k\in\{0,0.01,0.02,\dots,1\} (red dashed line represents the thresholds: 10 and 20, respectively.)

Finally, considering once again that k{0,0.01,0.02,,100}k\in\{0,0.01,0.02,\dots,100\}, Figure 7 shows the trace of 𝜶𝜷(k,h)/𝜶||\boldsymbol{\alpha}-\boldsymbol{\beta}(k,h)||/||\boldsymbol{\alpha}||. It is observed that it is strictly monotonically decreasing, obtaining for the highest values of kk values less than 20%.

Refer to caption
Figure 7: Trace of 𝜶𝜷(k,h)/𝜶||\boldsymbol{\alpha}-\boldsymbol{\beta}(k,h)||/||\boldsymbol{\alpha}|| for the penalized estimator (h=1h=1) of the model on bank credit in the United States for k{0,0.01,0.02,,100}k\in\{0,0.01,0.02,\dots,100\}

8.3 Choice of penalty parameter kk and analysis of the model

As noted in the previous graphical representations, following the guidelines in subsection 5.2, it is possible to make a choice of kk that implies a variance inflation factor or condition number below the established thresholds or for a minimum mean square error. As a summary, Table 7 shows the values of kk leading to the minimum square error and the first value of the variance inflation factor and condition number that falls below the thresholds traditionally set as troubling. Note that the penalized estimator has a smaller minimum square error than that of the ridge estimator.

Criterion Penalized estimator (h=1h=1) Ridge estimator (h=0h=0)
minimum MSE k=0.07 (5.131445) k=0.02 (40.14453)
NC lower than 30 k=0.01 (19.83053)
NC lower than 20 k=0.01 (19.83053)
NC lower than 10 k=0.04 (9.966163)
Maximum VIF lower than 10 k=0.08 (8.980033)
Table 7: Selection of the value of kk following the guidelines of the subsection 5.2 (in parentheses the value of the measure for the value of kk under consideration)

For these values (including k=0k=0), Table 8 shows the results obtained with the application of the ridge estimator while Table 9 shows the results obtained with the application of the penalized estimator. Note that in this last table, the bootstrap approximations to the confidence interval obtained from the 0.025 and 0.975 percentiles are coded as type 1 and those obtained from the mean and quasi-variance as type 2.

Estimation of coefficients

It is observed that the estimates of the ridge estimator decrease towards zero; contrary to those of the penalized estimator, although they are still far from the values fixed in the vector 𝜶\boldsymbol{\alpha}.

On the other hand, inference using bootstrap methods from 10000 iterations indicates that the estimates obtained by the ridge estimator are not significantly different from zero for the kk values considered. However, this is not the case for the estimates obtained from the penalized estimator. Thus, for example, in the case where k=0.08k=0.08:

  • The intercept is significantly different from zero with positive sign.

  • The coefficient of personal consumption is significantly different from zero with a negative sign. That is, higher personal consumption is associated with a decrease in mortgage debt, which could be interpreted as higher consumption in situations where there is less mortgage debt. However, it should be noted that this estimated sign is the opposite 999 Considering that k=100k=100, it is obtained that 𝜷^(h=1,k)=(6.4705,1.6741,0.8247,0.009)t\widehat{\boldsymbol{\beta}}(h=1,k)=\left(6.4705,1.6741,0.8247,-0.009\right)^{t}, all of them being coefficients significantly different from zero.In this case, the sign of the second coefficient has been corrected in accordance with the specifications of vector 𝜶\boldsymbol{\alpha}, although the fourth coefficient now has a negative sign. This situation is more understandable because the value set in the vector 𝜶\boldsymbol{\alpha} for this parameter, 0.00519, is susceptible to this possible change of sign.
    It is also obtained a root mean square error equal to 41.3982 (increases with respect to the minimum value although it is still lower than that of OLS), a condition number equal to 1.019 and maximum variance inflation factor equal to 1.000193. That is, the multicollinearity detection measures are very close to their minimum value, so the degree of multicollinearity is very low. This issue is reflected in the fact that the average variance of the coefficient estimates when the independent variables are perturbed by 1% is 0.1607%.
    of that fixed in the vector 𝜶\boldsymbol{\alpha}.

  • The coefficient of outstanding credit is significantly different from zero with a positive sign. That is, higher credit outstanding is associated with an increase in mortgage debt, which could be interpreted as meaning that there is higher credit outstanding when mortgage debt is higher.

Note that in this case we obtain a mean square error very close to the minimum value and much lower than that of OLS; at the same time we verify that the degree of existing approximate multicollinearity is considered not to be troubling since the number of condition and all the variance inflation factor are less than 10.

Goodness-of-fit

Considering that for k=0k=0 the results obtained coincide with those of OLS, it is observed that the value of the goodness-of-fit, 0.9878, does not coincide with the value of the coefficient of determination, 0.9235, provided in Table 6. This discrepancy is due, as discussed, to the fact that the proposed goodness-of-fit coincides with the coefficient of determination when k=0k=0 only if the dependent variable has zero mean, a condition that in this case is not verified.

On the other hand, it is observed that the goodness-of-fit is very similar in the ridge and penalized estimator and, in addition, does not differ substantially from the value obtained in OLS.

Mean Square Error

With respect to the mean square error, lower values are obtained in all cases in the case of the penalized estimator. This result is in line with the results shown in the Monte Carlo simulation of subsection 3.5.1: the penalized estimator performs better in terms of MSE than the peak estimator when the troubling approximate multicollinearity is of the essential type.

This result is interesting as both the ridge and penalized estimators are biased estimators.

Existence of approximate multicollinearity

As previously discussed, in both cases the same values are obtained for the variance inflation factor and the condition number. The condition number is less than 20 in all cases considered, while the maximum VIF is less than 10 only for k=0.08k=0.08.

Numerical stability of the obtained estimates

By perturbing the observations 10000 times by 1% as indicated in the section 7 and calculating their effect on the coefficient estimates according to the expression (19), the average values shown in the tables are obtained.

It can be seen that initially small changes in the data result in an average variation in the coefficient estimates of 73.9726%. This indicates a significant instability in the estimates obtained by OLS. This value decreases as the value of kk increases, the decrease being more pronounced in the case of the penalized estimator. Thus, for example, in the case where k=0.08k=0.08 we have that a perturbation of 1% in the independent variables implies an average variation of 7.6294% in the coefficient estimates.

k=0 k=0.01 k=0.02 k=0.04 k=0.08
Ridge Bootstrap Ridge Bootstrap Ridge Bootstrap Ridge Bootstrap Ridge Bootstrap
β1\beta_{1} 5.4693 2.2482 1.0412 0.2857 0.2629 -0.0943 -0.21 -0.3216 -0.428 -0.412
β2\beta_{2} -4.2524 -2.466 -2.4673 -1.7351 -2.1056 -1.5493 -1.814 -1.379 -1.5481 -1.2016
β3\beta_{3} 3.1204 2.0841 2.5273 1.887 2.3586 1.789 2.1596 1.6495 1.8911 1.4441
β4\beta_{4} 0.0029 0.0028 0.0014 0.002 0.0013 0.002 0.0014 0.0021 0.0017 0.0025
DS for β1\beta_{1} 2947.0815 13.9162 902.4274 3.2142 532.7874 1.8279 292.8689 1.004 154.0945 0.5946
DS for β2\beta_{2} 1146.1535 6.1901 350.9741 2.167 207.2237 1.6559 113.9276 1.3197 59.9713 1.0715
DS for β3\beta_{3} 338.336 3.0476 103.6891 1.9856 61.3139 1.8011 33.8661 1.6106 18.0615 1.382
DS for β4\beta_{4} 1.0981 0.0055 0.3364 0.0032 0.1987 0.0031 0.1095 0.0029 0.0579 0.0025
GoF 0.9878 0.991 0.9877 0.9905 0.9876 0.9903 0.9876 0.9902 0.9874 0.99
BIT1 for β1\beta_{1} (-23.3118, 28.5705) (-5.4661, 6.7827) (-3.4506, 3.648) (-2.2583, 1.7004) (-1.6039, 0.6764)
BIT1 for β2\beta_{2} (-13.1453, 9.7675) (-5.1669, 2.8114) (-4.0082, 1.9754) (-3.2576, 1.4472) (-2.7288, 1.0429)
BIT1 for β3\beta_{3} (-4.5941, 6.4143) (-2.4435, 4.7682) (-2.1057, 4.4248) (-1.7724, 4.0229) (-1.4312, 3.5192)
BIT1 for β4\beta_{4} (-0.0082, 0.0135) (-0.004, 0.0081) (-0.0035, 0.0081) (-0.0028, 0.0079) (-0.0018, 0.0074)
BIT1 for GoF (0.9852, 0.998) (0.9848, 0.9974) (0.9845, 0.9972) (0.9843, 0.997) (0.9841, 0.9968)
BIT2 for β1\beta_{1} (-25.0275, 29.5239) (-6.0141, 6.5856) (-3.6769, 3.4884) (-2.2895, 1.6464) (-1.5774, 0.7535)
BIT2 for β2\beta_{2} (-14.5985, 9.6666) (-5.9824, 2.5121) (-4.7948, 1.6962) (-3.9655, 1.2076) (-3.3018, 0.8986)
BIT2 for β3\beta_{3} (-3.8892, 8.0574) (-2.0049, 5.7788) (-1.7411, 5.3191) (-1.5072, 4.8063) (-1.2646, 4.1529)
BIT2 for β4\beta_{4} (-0.0081, 0.0136) (-0.0043, 0.0083) (-0.004, 0.008) (-0.0035, 0.0077) (-0.0024, 0.0074)
BIT2 for GoF (0.9847, 0.9973) (0.9843, 0.9967) (0.9841, 0.9966) (0.9839, 0.9965) (0.9837, 0.9963)
MSE 199.9497 44.3756 41.3225 43.3614 45.9795
CN 332.3 19.8305 14.0525 9.9662 7.0841
VIF(1,k) 589.754 60.3983 32.2129 16.9411 8.98
VIF(1,k) 281.8862 48.7101 28.4015 15.8204 8.6686
VIF(1,k) 189.4874 45.2022 27.2576 15.4841 8.5752
Stability 73.9726 50.4276 36.4043 24.5117 17.0305
Table 8: Analysis of the model on bank credit in the United States using the ridge estimator for k=0.01,0.02,0.04,0.08k=0.01,0.02,0.04,0.08 (BIT1 = Bootstrap interval type 1; BIT2 = Bootstrap interval type 2; DS = Desviation Standard)
k=0 k=0.01 k=0.04 k=0.07 k=0.08
Penalized Bootstrap Penalized Bootstrap Penalized Bootstrap Penalized Bootstrap Penalized Bootstrap
β1\beta_{1} 5.4693 2.2482 4.585 4.0199 4.4378 4.3899 4.4938 4.5378 4.5171 4.5763
β2\beta_{2} -4.2524 -2.466 -3.8019 -3.1602 -3.4622 -3.0343 -3.247 -2.8617 -3.1853 -2.8093
β3\beta_{3} 3.1204 2.0841 2.8777 2.2792 2.4876 1.9617 2.182 1.6749 2.0917 1.5909
β4\beta_{4} 0.0029 0.0028 0.0028 0.0035 0.0035 0.0043 0.0041 0.0049 0.0042 0.0051
DS for β1\beta_{1} 2947.0815 13.9162 902.4274 3.1633 292.8689 0.9857 174.8012 0.6339 154.0945 0.5787
DS for β2\beta_{2} 1146.1535 6.1901 350.9741 2.1118 113.9276 1.3152 68.0214 1.1398 59.9713 1.1021
DS for β3\beta_{3} 338.336 3.0476 103.6891 1.9337 33.8661 1.6009 20.4131 1.4532 18.0615 1.4139
DS for β4\beta_{4} 1.0981 0.0055 0.3364 0.0032 0.1095 0.0028 0.0656 0.0026 0.0579 0.0025
GoF 0.9878 0.991 0.9878 0.9905 0.9877 0.9901 0.9875 0.9899 0.9874 0.9898
BIT1 for β1\beta_{1} (-23.3118, 28.5705) (-1.615, 10.4396) (2.5116, 6.3897) (3.2787, 5.7392) (3.4191, 5.6548)
BIT1 for β2\beta_{2} (-13.1453, 9.7675) (-6.4747, 1.3367) (-4.8868, -0.193) (-4.4573, -0.4378) (-4.3505, -0.4793)
BIT1 for β3\beta_{3} (-4.5941, 6.4143) (-1.9709, 5.1374) (-1.4629, 4.3341) (-1.3739, 3.8329) (-1.3547, 3.6979)
BIT1 for β4\beta_{4} (-0.0082, 0.0135) (-0.0025, 0.0096) (-0.0006, 0.01) (0.00005, 0.0101) (0.0008, 0.0101)
BIT1 for GoF (0.9852, 0.998) (0.9849, 0.9969) (0.9845, 0.9964) (0.9842, 0.9963) (0.9841, 0.9962)
BIT2 for β1\beta_{1} (-25.0275, 29.5239) (-2.1802, 10.2199) (2.4578, 6.3219) (3.2954, 5.7803) (3.442, 5.7106)
BIT2 for β2\beta_{2} (-14.5985, 9.6666) (-7.2993, 0.9789) (-5.612, -0.4566) (-5.0956, -0.6277) (-4.9695, -0.6491)
BIT2 for β3\beta_{3} (-3.8892, 8.0574) (-1.5109, 6.0694) (-1.1761, 5.0996) (-1.1733, 4.5232) (-1.1804, 4.3622)
BIT2 for β4\beta_{4} (-0.0081, 0.0136) (-0.0027, 0.0097) (-0.0013, 0.0099) (-0.0002, 0.01) (0.0001, 0.0101)
BIT2 for GoF (0.9847, 0.9973) (0.9845, 0.9965) (0.9842, 0.9961) (0.9839, 0.9959) (0.9838, 0.9958)
MSE 199.9497 22.2729 6.3279 5.4749 5.4808
CN 332.3 19.8305 9.9662 7.5635 7.0841
VIF(1,k) 589.754 60.3983 16.9411 10.131 8.98
VIF(2,k) 281.8862 48.7101 15.8204 9.7315 8.6686
VIF(3,k) 189.4874 45.2022 15.4841 9.6116 8.5752
Stability 73.9726 29.8586 12.0844 8.3133 7.6294
Table 9: Analysis of the model on bank credit in the United States using the penalized estimator for k=0.01,0.04,0.07,0.08k=0.01,0.04,0.07,0.08 (BIT1 = Bootstrap interval type 1; BIT2 = Bootstrap interval type 2; DS = Desviation Standard)

9 Conclusions

This paper proposes the estimation of the coefficients of a multiple linear regression model by penalizing the function to be minimized with the objective of getting the estimates to be as close as possible to the relationships given by the corresponding simple linear regressions, since then the relationships of each of the independent variables to the dependent variable are obtained if the rest of the explanatory variables vary as would be expected.

The main characteristics of the proposed penalized estimator (trace and norm of the estimator, variance-covariance matrix, goodness-of-fit and mean square error) and its implications in the detection and treatment of multicollinearity have been thoroughly analyzed. This analysis has served to propose different possibilities for choosing the penalty parameter, so that for a fixed value of this parameter, inference can be performed by bootstrap methodology. The fact that the objective is that the estimates converge to the estimates of the simple regressions means that the bootstrap inference proposed can provide estimates of the coefficients of the independent variables significantly different from zero, contrary to what occurs in the ridge estimator, since in this case it is established that the estimates converge to zero.

It has also been found that the ridge estimator of Hoerl and Kennard [18, 19] is a particular case of the proposed estimator. Due to this relationship, several characteristics coincide in both estimators, such as the variance-covariance matrix, the calculation of the variance inflation factor (VIF) and the condition number (CN). This makes it possible that studies on these measures in the ridge estimator will be directly applicable to the penalized estimator obtained.

It is also important to note that this estimator mitigates the numerical instability that can occur when estimating by OLS a multiple linear regression model where the approximate multicollinearity is troubling, so its application is interesting when this consequence of multicollinearity is present in the econometric model analyzed.

On the other hand, in the works of García et al. [13], Salmerón et al. [39] and Rodríguez et al. [37, 36] where the VIF, CN, goodness of fit and Stewart index, are respectively extended (see [46]) to ridge estimator, it is stated that the transformation of the data is not optional but is required for the correct application of these measures in ridge regression. This issue has already been addressed by Marquardt in his work entitled “You should standardize the predictor variables in your regression models” ([27]). While Obenchain [32] indicated that when working with the ridge estimator the anaylisis should be completely redone if any transformation of regressors is adopted. In this sense, this work establishes that for the penalized estimator it is not necessary to perform any transformation of the data except to consider the standardization of the data to calculate the VIF and the unit length to calculate the CN.

Finally, future lines of work include the combination with the LASSO estimator to emulate the elastic net methodology and perform variable selection, analyze the possibilities of the proposed estimator when there are outliers in the data and compare its performance in terms of MSE with the proposals from Wang et al. [52] and Lukman et al. [25]. A future goal is also the creation of a package in R [35] from the code available in Github that allows the application of the methodology presented to any user in the teaching, research and/or business environment.

Appendix A Results of interest

A.1 Norm of the penalized estimator

Taking into account the decomposition 𝐗t𝐗=𝚪𝐃λi𝚪t\mathbf{X}^{t}\mathbf{X}=\boldsymbol{\Gamma}\mathbf{D}_{\lambda_{i}}\boldsymbol{\Gamma}^{t} donde 𝚪p×p\boldsymbol{\Gamma}_{p\times p} is an orthogonal matrix (𝚪t𝚪=𝐈=𝚪𝚪t\boldsymbol{\Gamma}^{t}\boldsymbol{\Gamma}=\mathbf{I}=\boldsymbol{\Gamma}\boldsymbol{\Gamma}^{t}) which contains the eigenvectors of 𝐗t𝐗\mathbf{X}^{t}\mathbf{X} and 𝐃λi\mathbf{D}_{\lambda_{i}} is a diagonal matrix (𝐃λi=diag(λ1,,λp)\mathbf{D}_{\lambda_{i}}=diag\left(\lambda_{1},\dots,\lambda_{p}\right)) of dimension p×pp\times p containing its eigenvalues (which are real and positive), it is verified that:

𝐙(k)=(𝐗t𝐗+k𝐈)1=𝚪𝐃1λi+k𝚪t.\mathbf{Z}(k)=\left(\mathbf{X}^{t}\mathbf{X}+k\mathbf{I}\right)^{-1}=\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\boldsymbol{\Gamma}^{t}. (20)

Denoting 𝚿p×1=𝐗t𝐲\boldsymbol{\Psi}_{p\times 1}=\mathbf{X}^{t}\mathbf{y} in expression (8), it is obtained that:

𝜷^(k,h)\displaystyle||\widehat{\boldsymbol{\beta}}(k,h)|| =\displaystyle= 𝜷^(k,h)t𝜷^(k,h)=(𝚿+kh𝜶)t𝐙(k)t𝐙(k)(𝚿+kh𝜶)\displaystyle\widehat{\boldsymbol{\beta}}(k,h)^{t}\widehat{\boldsymbol{\beta}}(k,h)=\left(\boldsymbol{\Psi}+k\cdot h\cdot\boldsymbol{\alpha}\right)^{t}\cdot\mathbf{Z}(k)^{t}\mathbf{Z}(k)\cdot\left(\boldsymbol{\Psi}+k\cdot h\cdot\boldsymbol{\alpha}\right)
=\displaystyle= 𝚿t𝚪𝐃1(λi+k)2𝚪t𝚿+2kh𝚿t𝚪𝐃1(λi+k)2𝚪t𝜶+k2h2𝜶t𝚪𝐃1(λi+k)2𝚪t𝜶\displaystyle\boldsymbol{\Psi}^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\boldsymbol{\Psi}+2\cdot k\cdot h\cdot\boldsymbol{\Psi}^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\boldsymbol{\alpha}+k^{2}\cdot h^{2}\cdot\boldsymbol{\alpha}^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\boldsymbol{\alpha}
=\displaystyle= 𝐚t𝐃1(λi+k)2𝐚+2h𝐚t𝐃k(λi+k)2𝐛+h2𝐛t𝐃k2(λi+k)2𝐛\displaystyle\mathbf{a}^{t}\mathbf{D}_{\frac{1}{(\lambda_{i}+k)^{2}}}\mathbf{a}+2\cdot h\cdot\mathbf{a}^{t}\mathbf{D}_{\frac{k}{(\lambda_{i}+k)^{2}}}\mathbf{b}+h^{2}\cdot\mathbf{b}^{t}\mathbf{D}_{\frac{k^{2}}{(\lambda_{i}+k)^{2}}}\mathbf{b}
=\displaystyle= i=1pai2(λi+k)2+2hi=1pkaibi(λi+k)2+h2i=1pk2bi2(λi+k)2,\displaystyle\sum\limits_{i=1}^{p}\frac{a_{i}^{2}}{(\lambda_{i}+k)^{2}}+2\cdot h\cdot\sum\limits_{i=1}^{p}\frac{ka_{i}b_{i}}{(\lambda_{i}+k)^{2}}+h^{2}\cdot\sum\limits_{i=1}^{p}\frac{k^{2}b_{i}^{2}}{(\lambda_{i}+k)^{2}},

where it was considered that 𝐚p×1=𝚪t𝚿\mathbf{a}_{p\times 1}=\boldsymbol{\Gamma}^{t}\boldsymbol{\Psi} and 𝐛p×1=𝚪t𝜶\mathbf{b}_{p\times 1}=\boldsymbol{\Gamma}^{t}\boldsymbol{\alpha}.

Deriving with respect to kk:

k(i=1pai2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{a_{i}^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2ai2(λi+k)3,\displaystyle-\sum\limits_{i=1}^{p}\frac{2a_{i}^{2}}{(\lambda_{i}+k)^{3}}, (21)
k(i=1pkaibi(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{ka_{i}b_{i}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p(λik)aibi(λi+k)3,\displaystyle\sum\limits_{i=1}^{p}\frac{(\lambda_{i}-k)a_{i}b_{i}}{(\lambda_{i}+k)^{3}}, (22)
k(i=1pk2bi2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{k^{2}b_{i}^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2kλibi2(λi+k)3,\displaystyle\sum\limits_{i=1}^{p}\frac{2k\lambda_{i}b_{i}^{2}}{(\lambda_{i}+k)^{3}}, (23)

the first term is decreasing when kk increases while the third term is increasing since k,λi>0k,\lambda_{i}>0 and then its derivatives are, expressions (21) and (23) respectively, negatives and positives. At the same time the sign of the derivative of the second term (expression (22)) is undefined since, on the one hand, it depends on the sign of λik\lambda_{i}-k and, on the other hand, on the sign of aibia_{i}b_{i}.

In short, when h0h\not=0 cannot be concluded about the monotony of the 𝜷^(k,h)||\widehat{\boldsymbol{\beta}}(k,h)||, although it is certain that there exists a value of kk that stabilizes this norm, since when k+k\rightarrow+\infty:

𝜷^(k,h)h2i=1pbi2=h2𝐛t𝐛=h2𝜶t𝚪t𝚪𝜶=h2𝜶t𝜶=h2𝜶.||\widehat{\boldsymbol{\beta}}(k,h)||\rightarrow h^{2}\cdot\sum\limits_{i=1}^{p}b_{i}^{2}=h^{2}\cdot\mathbf{b}^{t}\mathbf{b}=h^{2}\cdot\boldsymbol{\alpha}^{t}\boldsymbol{\Gamma}^{t}\boldsymbol{\Gamma}\boldsymbol{\alpha}=h^{2}\cdot\boldsymbol{\alpha}^{t}\boldsymbol{\alpha}=h^{2}\cdot||\boldsymbol{\alpha}||.

Finally, when h=0h=0 (the penalized estimator coincides with the ridge estimator), it is clear that 𝜷^(k,h)||\widehat{\boldsymbol{\beta}}(k,h)|| is decreasing in kk.

A.2 Optimality of the variance-covariance matrix

Starting with the expression (20), it follows that 𝐙(k)𝐗t𝐗𝐙(k)=𝚪𝐃λi(λi+k)2𝚪t\mathbf{Z}(k)\mathbf{X}^{t}\mathbf{X}\mathbf{Z}(k)=\boldsymbol{\Gamma}\mathbf{D}_{\frac{\lambda{i}}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}. Since (𝐗t𝐗)1=𝚪𝐃1λi𝚪t\left(\mathbf{X}^{t}\mathbf{X}\right)^{-1}=\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}}}\boldsymbol{\Gamma}^{t}, it is clear that:

var(𝜷^(k,h))var(𝜷^)=σ2𝚪𝐃λi2(λi+k)2λi(λi+k)2𝚪t.var\left(\widehat{\boldsymbol{\beta}}(k,h)\right)-var\left(\widehat{\boldsymbol{\beta}}\right)=\sigma^{2}\cdot\boldsymbol{\Gamma}\mathbf{D}_{\frac{\lambda_{i}^{2}-(\lambda_{i}+k)^{2}}{\lambda_{i}(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}.

Considering 𝐜p×1\mathbf{c}_{p\times 1} a non-zero vector and taking into account that k>0k>0 and λi>0\lambda_{i}>0 for i=1,,pi=1,\dots,p, it is verified that:

𝐜𝚪𝐃λi2(λi+k)2λi(λi+k)2𝚪t𝐜=𝐝𝐃λi2(λi+k)2λi(λi+k)2𝐝=i=1pλi2(λi+k)2λi(λi+k)2di2=i=1pk2+2λikλi(λi+k)2di2<0,\mathbf{c}\boldsymbol{\Gamma}\mathbf{D}_{\frac{\lambda_{i}^{2}-(\lambda_{i}+k)^{2}}{\lambda_{i}(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\mathbf{c}=\mathbf{d}\mathbf{D}_{\frac{\lambda_{i}^{2}-(\lambda_{i}+k)^{2}}{\lambda_{i}(\lambda_{i}+k)^{2}}}\mathbf{d}=\sum\limits_{i=1}^{p}\frac{\lambda_{i}^{2}-(\lambda_{i}+k)^{2}}{\lambda_{i}(\lambda_{i}+k)^{2}}d_{i}^{2}=-\sum\limits_{i=1}^{p}\frac{k^{2}+2\lambda_{i}k}{\lambda_{i}(\lambda_{i}+k)^{2}}d_{i}^{2}<0,

where 𝐝p×1=𝚪t𝐜\mathbf{d}_{p\times 1}=\boldsymbol{\Gamma}^{t}\mathbf{c}. Therefore, 𝚪𝐃λi2(λi+k)2λi(λi+k)2𝚪t\boldsymbol{\Gamma}\mathbf{D}_{\frac{\lambda_{i}^{2}-(\lambda_{i}+k)^{2}}{\lambda_{i}(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t} es a negative definite matrix and, since σ2>0\sigma^{2}>0, it is verified:

var(𝜷^(k,h))var(𝜷^)<0var(𝜷^(k,h))<var(𝜷^).var\left(\widehat{\boldsymbol{\beta}}(k,h)\right)-var\left(\widehat{\boldsymbol{\beta}}\right)<0\rightarrow var\left(\widehat{\boldsymbol{\beta}}(k,h)\right)<var\left(\widehat{\boldsymbol{\beta}}\right).

A.3 Estimation errors

Following the development of Rodríguez et al. [37] (Proposition 1), taking into account that the expression (8) comes from the system of normal equations:

(𝐗t𝐗+k𝐈)𝜷^(k,h)=𝐗t𝐲+kh𝜶,\left(\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I}\right)\cdot\widehat{\boldsymbol{\beta}}(k,h)=\mathbf{X}^{t}\mathbf{y}+k\cdot h\cdot\boldsymbol{\alpha},

it must be verified that the first row of 𝐗t𝐗+k𝐈\mathbf{X}^{t}\mathbf{X}+k\cdot\mathbf{I} , multiplied by 𝜷^(k,h)\widehat{\boldsymbol{\beta}}(k,h):

(n+k,j=1pX2j,,j=1pXpj)(β^1(k,h)β^2(k,h)β^p(k,h))\displaystyle\left(n+k,\sum\limits_{j=1}^{p}X_{2j},\cdots,\sum\limits_{j=1}^{p}X_{pj}\right)\cdot\left(\begin{array}[]{c}\widehat{\beta}_{1}(k,h)\\ \widehat{\beta}_{2}(k,h)\\ \vdots\\ \widehat{\beta}_{p}(k,h)\\ \end{array}\right) =\displaystyle= (n+k)β^1(k,h)+β^2(k,h)j=1pX2j\displaystyle(n+k)\cdot\widehat{\beta}_{1}(k,h)+\widehat{\beta}_{2}(k,h)\cdot\sum\limits_{j=1}^{p}X_{2j}
++β^p(k,h)j=1pXpj,\displaystyle+\cdots+\widehat{\beta}_{p}(k,h)\cdot\sum\limits_{j=1}^{p}X_{pj},

has to be equal to the first element of 𝐗t𝐲+kh𝜶\mathbf{X}^{t}\mathbf{y}+k\cdot h\cdot\boldsymbol{\alpha}, that is to say to j=1nyj+kh𝐲¯\sum\limits_{j=1}^{n}y_{j}+k\cdot h\cdot\overline{\mathbf{y}}. Consequently:

j=1nyj+kh𝐲¯=(n+k)β^1(k,h)+β^2(k,h)j=1pX2j++β^p(k,h)j=1pXpj,\sum\limits_{j=1}^{n}y_{j}+k\cdot h\cdot\overline{\mathbf{y}}=(n+k)\cdot\widehat{\beta}_{1}(k,h)+\widehat{\beta}_{2}(k,h)\cdot\sum\limits_{j=1}^{p}X_{2j}+\cdots+\widehat{\beta}_{p}(k,h)\cdot\sum\limits_{j=1}^{p}X_{pj},

or equivalently:

j=1nyj=(n+k)β^1(k,h)+β^2(k,h)j=1pX2j++β^p(k,h)j=1pXpjkh𝐲¯.\sum\limits_{j=1}^{n}y_{j}=(n+k)\cdot\widehat{\beta}_{1}(k,h)+\widehat{\beta}_{2}(k,h)\cdot\sum\limits_{j=1}^{p}X_{2j}+\cdots+\widehat{\beta}_{p}(k,h)\cdot\sum\limits_{j=1}^{p}X_{pj}-k\cdot h\cdot\overline{\mathbf{y}}. (25)

On the other hand, based on 𝐞(k,h)=𝐲𝐗𝜷^(k,h)\mathbf{e}(k,h)=\mathbf{y}-\mathbf{X}\cdot\widehat{\boldsymbol{\beta}}(k,h), it is obtained that:

𝐞j(k,h)=yj(β^1(k,h)+β^1(k,h)X2j++β^p(k,h)Xpj).\mathbf{e}_{j}(k,h)=y_{j}-\left(\widehat{\beta}_{1}(k,h)+\widehat{\beta}_{1}(k,h)X_{2j}+\cdots+\widehat{\beta}_{p}(k,h)X_{pj}\right).

In this case:

j=1n𝐞j(k,h)=j=1nyj(nβ^1(k,h)+β^1(k,h)j=1nX2j++β^p(k,h)j=1nXpj).\sum\limits_{j=1}^{n}\mathbf{e}_{j}(k,h)=\sum\limits_{j=1}^{n}y_{j}-\left(n\cdot\widehat{\beta}_{1}(k,h)+\widehat{\beta}_{1}(k,h)\sum\limits_{j=1}^{n}X_{2j}+\cdot+\widehat{\beta}_{p}(k,h)\sum\limits_{j=1}^{n}X_{pj}\right). (26)

Substituting the expression (25) in (26), j=1n𝐞j(k,h)=kβ^1(k,h)kh𝐲¯\sum\limits_{j=1}^{n}\mathbf{e}_{j}(k,h)=k\cdot\widehat{\beta}_{1}(k,h)-k\cdot h\cdot\overline{\mathbf{y}}, which in principle differs from zero if k0k\not=0.

A.4 Monotony of goodness-of-fit

From expression (20) and taking into account that 𝚿=𝐗t𝐲\boldsymbol{\Psi}=\mathbf{X}^{t}\mathbf{y}, it is obtained:

𝜷^(k,h)=𝚪𝐃1λi+k𝚪t(𝚿+kh𝜶).\widehat{\boldsymbol{\beta}}(k,h)=\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\boldsymbol{\Gamma}^{t}\left(\boldsymbol{\Psi}+k\cdot h\cdot\boldsymbol{\alpha}\right).

And, then:

𝜷^(k,h)t(𝐗t𝐗+2k𝐈)𝜷^(k,h)\displaystyle\widehat{\boldsymbol{\beta}}(k,h)^{t}\left(\mathbf{X}^{t}\mathbf{X}+2\cdot k\cdot\mathbf{I}\right)\widehat{\boldsymbol{\beta}}(k,h) =\displaystyle= (𝚿+kh𝜶)t𝚪𝐃1λi+k𝚪t(𝚪𝐃1λi𝚪t+2k𝚪𝚪t)\displaystyle\left(\boldsymbol{\Psi}+k\cdot h\cdot\boldsymbol{\alpha}\right)^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\boldsymbol{\Gamma}^{t}\cdot\left(\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}}}\boldsymbol{\Gamma}^{t}+2\cdot k\cdot\boldsymbol{\Gamma}\boldsymbol{\Gamma}^{t}\right)
𝚪𝐃1λi+k𝚪t(𝚿+kh𝜶)\displaystyle\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\boldsymbol{\Gamma}^{t}\left(\boldsymbol{\Psi}+k\cdot h\cdot\boldsymbol{\alpha}\right)
=\displaystyle= 𝐚t𝐃λi+2k(λi+k)2𝐚+2hk𝐚t𝐃λi+2k(λi+k)2𝐛+k2h2𝐛t𝐃λi+2k(λi+k)2𝐛.\displaystyle\mathbf{a}^{t}\mathbf{D}_{\frac{\lambda_{i}+2k}{(\lambda_{i}+k)^{2}}}\mathbf{a}+2\cdot h\cdot k\cdot\mathbf{a}^{t}\mathbf{D}_{\frac{\lambda_{i}+2k}{(\lambda_{i}+k)^{2}}}\mathbf{b}+k^{2}\cdot h^{2}\cdot\mathbf{b}^{t}\mathbf{D}_{\frac{\lambda_{i}+2k}{(\lambda_{i}+k)^{2}}}\mathbf{b}.
𝜷^(k,h)t𝜶\displaystyle\widehat{\boldsymbol{\beta}}(k,h)^{t}\boldsymbol{\alpha} =\displaystyle= (𝚿+kh𝜶)t𝚪𝐃1λi+k𝚪t𝜶\displaystyle\left(\boldsymbol{\Psi}+k\cdot h\cdot\boldsymbol{\alpha}\right)^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\boldsymbol{\Gamma}^{t}\cdot\boldsymbol{\alpha}
=\displaystyle= 𝐚t𝐃1λi+k𝐛+kh𝐛t𝐃1λi+k𝐛,\displaystyle\mathbf{a}^{t}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\mathbf{b}+k\cdot h\cdot\mathbf{b}^{t}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\mathbf{b},

where 𝐚=𝚪t𝚿\mathbf{a}=\boldsymbol{\Gamma}^{t}\boldsymbol{\Psi} and 𝐛=𝚪t𝜶\mathbf{b}=\boldsymbol{\Gamma}^{t}\boldsymbol{\alpha}.

Thus,

𝜷^(k,h)t(𝐗t𝐗+2k𝐈)𝜷^(k,h)2kh𝜷^(k,h)t𝜶\displaystyle\widehat{\boldsymbol{\beta}}(k,h)^{t}\left(\mathbf{X}^{t}\mathbf{X}+2\cdot k\cdot\mathbf{I}\right)\widehat{\boldsymbol{\beta}}(k,h)-2\cdot k\cdot h\cdot\widehat{\boldsymbol{\beta}}(k,h)^{t}\boldsymbol{\alpha} =\displaystyle= 𝐚t𝐃λi+2k(λi+k)2𝐚+2h𝐚t𝐃k2(λi+k)2𝐛\displaystyle\mathbf{a}^{t}\mathbf{D}_{\frac{\lambda_{i}+2k}{(\lambda_{i}+k)^{2}}}\mathbf{a}+2\cdot h\cdot\mathbf{a}^{t}\mathbf{D}_{\frac{k^{2}}{(\lambda_{i}+k)^{2}}}\mathbf{b}
+h2𝐛t𝐃λik2(λi+k)2𝐛.\displaystyle+h^{2}\cdot\mathbf{b}^{t}\mathbf{D}_{-\frac{\lambda_{i}\cdot k^{2}}{(\lambda_{i}+k)^{2}}}\mathbf{b}.

Then, expression (16) can be rewritten as:

GoF(k,h)\displaystyle GoF(k,h) =\displaystyle= 𝐚t𝐃λi+2k(λi+k)2𝐚+2h𝐚t𝐃k2(λi+k)2𝐛+h2𝐛t𝐃λik2(λi+k)2𝐛𝐲t𝐲\displaystyle\frac{\mathbf{a}^{t}\mathbf{D}_{\frac{\lambda_{i}+2k}{(\lambda_{i}+k)^{2}}}\mathbf{a}+2\cdot h\cdot\mathbf{a}^{t}\mathbf{D}_{\frac{k^{2}}{(\lambda_{i}+k)^{2}}}\mathbf{b}+h^{2}\cdot\mathbf{b}^{t}\mathbf{D}_{-\frac{\lambda_{i}\cdot k^{2}}{(\lambda_{i}+k)^{2}}}\mathbf{b}}{\mathbf{y}^{t}\mathbf{y}} (27)
=\displaystyle= 1𝐲t𝐲i=1p(λi+2k)ai2(λi+k)2+2h1𝐲t𝐲i=1pk2aibi(λi+k)2+h21𝐲t𝐲i=1pλik2bi2(λi+k)2.\displaystyle\frac{1}{\mathbf{y}^{t}\mathbf{y}}\cdot\sum\limits_{i=1}^{p}\frac{(\lambda_{i}+2k)a_{i}^{2}}{(\lambda_{i}+k)^{2}}+2\cdot h\cdot\frac{1}{\mathbf{y}^{t}\mathbf{y}}\cdot\sum\limits_{i=1}^{p}\frac{k^{2}a_{i}b_{i}}{(\lambda_{i}+k)^{2}}+h^{2}\cdot\frac{1}{\mathbf{y}^{t}\mathbf{y}}\cdot\sum\limits_{i=1}^{p}\frac{-\lambda_{i}k^{2}b_{i}^{2}}{(\lambda_{i}+k)^{2}}.

Deriving with respect to kk:

k(i=1p(λi+2k)ai2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{(\lambda_{i}+2k)a_{i}^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2ai2k(λi+k)3,\displaystyle-\sum\limits_{i=1}^{p}\frac{2a_{i}^{2}k}{(\lambda_{i}+k)^{3}}, (28)
k(i=1pk2aibi(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{k^{2}a_{i}b_{i}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2aibikλi(λi+k)3,\displaystyle\sum\limits_{i=1}^{p}\frac{2a_{i}b_{i}k\lambda_{i}}{(\lambda_{i}+k)^{3}}, (29)
k(i=1pλik2bi2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{-\lambda_{i}k^{2}b_{i}^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2λi2kbi2(λi+k)3.\displaystyle-\sum\limits_{i=1}^{p}\frac{2\lambda_{i}^{2}kb_{i}^{2}}{(\lambda_{i}+k)^{3}}. (30)

Due to k>0k>0 and λi>0\lambda_{i}>0 para i=1,,pi=1,\dots,p, it is evident that the derivatives of the expressions (28) and (30) are negatives, lwhich implies that the first and third summands of expression (27) are decreasing in kk. Note that this implies that for h=0h=0 (ridge regression), the goodness-of-fit is decreasing in kk.

On the contrary, the sign of the derivative of expression (29) depends on the sign of aibia_{i}b_{i}, for i=1,,pi=1,\dots,p, and therefore no statement can be made about the monotonicity of this second summand. . However, it is verified that this term has a horizontal asymptote:

i=1pk2aibi(λi+k)2i=1paibi=𝐚t𝐛=𝚿t𝜶 when k+,\sum\limits_{i=1}^{p}\frac{k^{2}a_{i}b_{i}}{(\lambda_{i}+k)^{2}}\longrightarrow\sum\limits_{i=1}^{p}a_{i}b_{i}=\mathbf{a}^{t}\mathbf{b}=\boldsymbol{\Psi}^{t}\boldsymbol{\alpha}\mbox{ when }k\rightarrow+\infty,

i.e., there must be a value of kk from which the value of this term stabilizes.

Taking into account the above analysis, it is to be expected that GoF(k,h)GoF(k,h) decreasing when kk increases, aalthough it is not assured. In such case, it would be verified that 2h𝚿t𝜶<GoF(k,h)R22\cdot h\cdot\boldsymbol{\Psi}^{t}\boldsymbol{\alpha}<GoF(k,h)\leq R^{2}, where R2R^{2} is the coefficient of determination of the model (1).

A.5 Monotony of 𝐒(k,h)\mathbf{S}(k,h)

Once again, from the expression (20):

𝐙(k)𝐗t𝐗𝐈\displaystyle\mathbf{Z}(k)\mathbf{X}^{t}\mathbf{X}-\mathbf{I} =\displaystyle= 𝚪𝐃kλi+k𝚪t,\displaystyle\boldsymbol{\Gamma}\mathbf{D}_{-\frac{k}{\lambda_{i}+k}}\boldsymbol{\Gamma}^{t},
(𝐙(k)𝐗t𝐗𝐈)𝐙(k)\displaystyle\left(\mathbf{Z}(k)\mathbf{X}^{t}\mathbf{X}-\mathbf{I}\right)\mathbf{Z}(k) =\displaystyle= 𝚪𝐃k(λi+k)2𝚪t,\displaystyle\boldsymbol{\Gamma}\mathbf{D}_{-\frac{k}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t},
𝜷t(𝐙(k)𝐗t𝐗𝐈)𝐙(k)𝜶\displaystyle\boldsymbol{\beta}^{t}\left(\mathbf{Z}(k)\mathbf{X}^{t}\mathbf{X}-\mathbf{I}\right)\mathbf{Z}(k)\boldsymbol{\alpha} =\displaystyle= 𝜷t𝚪𝐃k(λi+k)2𝚪t𝜶=𝐞t𝐃k(λi+k)2𝐛,\displaystyle\boldsymbol{\beta}^{t}\boldsymbol{\Gamma}\mathbf{D}_{-\frac{k}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\boldsymbol{\alpha}=\mathbf{e}^{t}\mathbf{D}_{-\frac{k}{(\lambda_{i}+k)^{2}}}\mathbf{b},

where 𝐞=𝚪t𝜷\mathbf{e}=\boldsymbol{\Gamma}^{t}\boldsymbol{\beta}, it can be obtained:

2kh𝜷t(𝐙(k)𝐗t𝐗𝐈)𝐙(k)𝜶=2hi=1peibik2(λi+k)2.2\cdot k\cdot h\cdot\boldsymbol{\beta}^{t}\left(\mathbf{Z}(k)\mathbf{X}^{t}\mathbf{X}-\mathbf{I}\right)\mathbf{Z}(k)\boldsymbol{\alpha}=-2\cdot h\cdot\sum\limits_{i=1}^{p}\frac{e_{i}b_{i}k^{2}}{(\lambda_{i}+k)^{2}}. (31)

On the other hand:

𝜶t𝐙(k)t𝐙(k)𝜶=𝜶t𝚪𝐃1(λi+k)2𝚪t𝜶=𝐛t𝐃1(λi+k)2𝐛,\boldsymbol{\alpha}^{t}\mathbf{Z}(k)^{t}\mathbf{Z}(k)\boldsymbol{\alpha}=\boldsymbol{\alpha}^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\boldsymbol{\alpha}=\mathbf{b}^{t}\mathbf{D}_{\frac{1}{(\lambda_{i}+k)^{2}}}\mathbf{b},

and, then:

k2h2𝜶t𝐙(k)t𝐙(k)𝜶=h2i=1pbi2k2(λi+k)2.k^{2}\cdot h^{2}\cdot\boldsymbol{\alpha}^{t}\mathbf{Z}(k)^{t}\mathbf{Z}(k)\boldsymbol{\alpha}=h^{2}\cdot\sum\limits_{i=1}^{p}\frac{b_{i}^{2}k^{2}}{(\lambda_{i}+k)^{2}}. (32)

From expressions (31) and (32) it is obtained that:

𝐒(k,h)=2hi=1peibik2(λi+k)2+h2i=1pbi2k2(λi+k)2.\mathbf{S}(k,h)=-2\cdot h\cdot\sum\limits_{i=1}^{p}\frac{e_{i}b_{i}k^{2}}{(\lambda_{i}+k)^{2}}+h^{2}\cdot\sum\limits_{i=1}^{p}\frac{b_{i}^{2}k^{2}}{(\lambda_{i}+k)^{2}}.

Deriving with respect to kk:

k(i=1peibik2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{e_{i}b_{i}k^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2keibiλi(λi+k)3,\displaystyle\sum\limits_{i=1}^{p}\frac{2ke_{i}b_{i}\lambda{i}}{(\lambda_{i}+k)^{3}}, (33)
k(i=1pbi2k2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{b_{i}^{2}k^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2bi2kλi(λi+k)3.\displaystyle\sum\limits_{i=1}^{p}\frac{2b_{i}^{2}k\lambda_{i}}{(\lambda_{i}+k)^{3}}. (34)

Due to k>0k>0 and λi>0\lambda_{i}>0 for i=1,,pi=1,\dots,p, it is clear that the derivative of the expression (33) has a sign which depends on eibie_{i}b_{i}, for i=1,,pi=1,\dots,p, and the derivative of the expression (34) has a positive sign, then the second term of 𝐒(k,h)\mathbf{S}(k,h) is increasing in kk.

Finally, considering that k+k\rightarrow+\infty:

𝐒(k,h)2hi=1peibi+h2i=1pbi2=2h𝐞t𝐛+h2𝐛t𝐛=2h𝜷t𝜶+h2𝜶t𝜶.\mathbf{S}(k,h)\longrightarrow-2\cdot h\cdot\sum\limits_{i=1}^{p}e_{i}b_{i}+h^{2}\cdot\sum\limits_{i=1}^{p}b_{i}^{2}=-2\cdot h\cdot\mathbf{e}^{t}\mathbf{b}+h^{2}\cdot\mathbf{b}^{t}\mathbf{b}=-2\cdot h\cdot\boldsymbol{\beta}^{t}\boldsymbol{\alpha}+h^{2}\cdot\boldsymbol{\alpha}^{t}\boldsymbol{\alpha}.

This is to say, 𝐒(k,h)\mathbf{S}(k,h) has a horizontal asymptote, so there must be a value of kk at which it stabilizes.

A.6 Monotony of α𝜷^(k,h)||\alpha-\widehat{\boldsymbol{\beta}}(k,h)||

Since the main objective of the proposed penalty is to ensure that the estimates of 𝜷\boldsymbol{\beta} are close to 𝜶\boldsymbol{\alpha}, it is interesting to analyze the behavior of α𝜷^(k,h)||\alpha-\widehat{\boldsymbol{\beta}}(k,h)||.

Taking into account thata 𝜷^(k,h)=𝜷^(k)+kh𝐙(k)𝜶\widehat{\boldsymbol{\beta}}(k,h)=\widehat{\boldsymbol{\beta}}(k)+k\cdot h\cdot\mathbf{Z}(k)\boldsymbol{\alpha} con 𝜷^(k)=𝐙(k)𝐗t𝐗𝜷^\widehat{\boldsymbol{\beta}}(k)=\mathbf{Z}(k)\mathbf{X}^{t}\mathbf{X}\widehat{\boldsymbol{\beta}} (see expression (10)), it is obtained that 𝜶𝜷^(k,h)=(𝐈kh𝐙(k))𝜶𝜷^(k)\boldsymbol{\alpha}-\widehat{\boldsymbol{\beta}}(k,h)=\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)\boldsymbol{\alpha}-\widehat{\boldsymbol{\beta}}(k). In this case:

α𝜷^(k,h)\displaystyle||\alpha-\widehat{\boldsymbol{\beta}}(k,h)|| =\displaystyle= ((𝐈kh𝐙(k))𝜶𝜷^(k))t((𝐈kh𝐙(k))𝜶𝜷^(k))\displaystyle\left(\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)\boldsymbol{\alpha}-\widehat{\boldsymbol{\beta}}(k)\right)^{t}\left(\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)\boldsymbol{\alpha}-\widehat{\boldsymbol{\beta}}(k)\right)
=\displaystyle= 𝜶t(𝐈kh𝐙(k))t(𝐈kh𝐙(k))𝜶2𝜶t(𝐈kh𝐙(k))t𝜷^(k)+𝜷^(k)t𝜷^(k).\displaystyle\boldsymbol{\alpha}^{t}\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)^{t}\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)\boldsymbol{\alpha}-2\boldsymbol{\alpha}^{t}\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)^{t}\widehat{\boldsymbol{\beta}}(k)+\widehat{\boldsymbol{\beta}}(k)^{t}\widehat{\boldsymbol{\beta}}(k).

From 𝐙(k)=𝚪𝐃1λi+k𝚪t\mathbf{Z}(k)=\boldsymbol{\Gamma}\mathbf{D}_{\frac{1}{\lambda_{i}+k}}\boldsymbol{\Gamma}^{t}, 𝐗t𝐗=𝚪𝐃λi𝚪t\mathbf{X}^{t}\mathbf{X}=\boldsymbol{\Gamma}\mathbf{D}_{\lambda_{i}}\boldsymbol{\Gamma}^{t} and remembering that 𝐛=𝚪t𝜶\mathbf{b}=\boldsymbol{\Gamma}^{t}\boldsymbol{\alpha} and 𝐞=𝚪t𝜷\mathbf{e}=\boldsymbol{\Gamma}^{t}\boldsymbol{\beta}, it is obtained that:

𝜶t(𝐈kh𝐙(k))t(𝐈kh𝐙(k))𝜶\displaystyle\boldsymbol{\alpha}^{t}\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)^{t}\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)\boldsymbol{\alpha} =\displaystyle= 𝜶^t𝚪𝐃(λi+kkh)2(λi+k)2𝚪t𝜶^=i=1p(λi+kkh)2bi2(λi+k)2,\displaystyle\widehat{\boldsymbol{\alpha}}^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{(\lambda_{i}+k-k\cdot h)^{2}}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\widehat{\boldsymbol{\alpha}}=\sum\limits_{i=1}^{p}\frac{(\lambda_{i}+k-k\cdot h)^{2}b_{i}^{2}}{(\lambda_{i}+k)^{2}},
𝜶t(𝐈kh𝐙(k))t𝜷^(k)\displaystyle\boldsymbol{\alpha}^{t}\left(\mathbf{I}-k\cdot h\cdot\mathbf{Z}(k)\right)^{t}\widehat{\boldsymbol{\beta}}(k) =\displaystyle= 𝜶^t𝚪𝐃λi(λi+kkh)(λi+k)2𝚪t𝜷^=i=1pλi(λi+kkh)biei(λi+k)2,\displaystyle\widehat{\boldsymbol{\alpha}}^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{\lambda_{i}(\lambda_{i}+k-k\cdot h)}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\widehat{\boldsymbol{\beta}}=\sum\limits_{i=1}^{p}\frac{\lambda_{i}(\lambda_{i}+k-k\cdot h)b_{i}e_{i}}{(\lambda_{i}+k)^{2}},
𝜷^(k)t𝜷^(k)\displaystyle\widehat{\boldsymbol{\beta}}(k)^{t}\widehat{\boldsymbol{\beta}}(k) =\displaystyle= 𝜷^t𝚪𝐃λi2(λi+k)2𝚪t𝜷^=i=1pλi2ei2(λi+k)2.\displaystyle\widehat{\boldsymbol{\beta}}^{t}\boldsymbol{\Gamma}\mathbf{D}_{\frac{\lambda_{i}^{2}}{(\lambda_{i}+k)^{2}}}\boldsymbol{\Gamma}^{t}\widehat{\boldsymbol{\beta}}=\sum\limits_{i=1}^{p}\frac{\lambda_{i}^{2}e_{i}^{2}}{(\lambda_{i}+k)^{2}}.

Deriving with respect to kk:

k(i=1p(λi+kkh)2bi2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{(\lambda_{i}+k-k\cdot h)^{2}b_{i}^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2(λi+kkh)bi2hλi(λi+k)3,\displaystyle-\sum\limits_{i=1}^{p}\frac{2(\lambda_{i}+k-k\cdot h)b_{i}^{2}h\lambda_{i}}{(\lambda_{i}+k)^{3}}, (35)
k(i=1pλi(λi+kkh)biei(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{\lambda_{i}(\lambda_{i}+k-k\cdot h)b_{i}e_{i}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1pλi(λikhλi+kh)biei(λi+k)3,\displaystyle\sum\limits_{i=1}^{p}\frac{\lambda_{i}(-\lambda_{i}-k-h\lambda_{i}+k\cdot h)b_{i}e_{i}}{(\lambda_{i}+k)^{3}}, (36)
k(i=1pλi2ei2(λi+k)2)\displaystyle\frac{\partial}{\partial k}\left(\sum\limits_{i=1}^{p}\frac{\lambda_{i}^{2}e_{i}^{2}}{(\lambda_{i}+k)^{2}}\right) =\displaystyle= i=1p2λi2ei2(λi+k)3.\displaystyle-\sum\limits_{i=1}^{p}\frac{2\lambda_{i}^{2}e_{i}^{2}}{(\lambda_{i}+k)^{3}}. (37)

From the above expressions it can only be stated that the third term is decreasing in kk (since its derivative, expression (37), has a negative sign), so it is difficult to analyze the monotony of the α𝜷^(k,h)||\alpha-\widehat{\boldsymbol{\beta}}(k,h)||. The one thing that is clear is that when k+k\rightarrow+\infty:

α𝜷^(k,h)(1h)i=1pbi2(1h)𝐛t𝐛=(1h)𝜶t𝜶=(1h)𝜶.||\alpha-\widehat{\boldsymbol{\beta}}(k,h)||\rightarrow(1-h)\cdot\sum\limits_{i=1}^{p}b_{i}^{2}(1-h)\cdot\mathbf{b}^{t}\mathbf{b}=(1-h)\cdot\boldsymbol{\alpha}^{t}\boldsymbol{\alpha}=(1-h)\cdot||\boldsymbol{\alpha}||.

However, for h=1h=1 it is verified that:

α𝜷^(k,h)=i=1pλi2bi2(λi+k)22i=1pλi2biei(λi+k)2+i=1pλi2ei2(λi+k)2,||\alpha-\widehat{\boldsymbol{\beta}}(k,h)||=\sum\limits_{i=1}^{p}\frac{\lambda_{i}^{2}b_{i}^{2}}{(\lambda_{i}+k)^{2}}-2\sum\limits_{i=1}^{p}\frac{\lambda_{i}^{2}b_{i}e_{i}}{(\lambda_{i}+k)^{2}}+\sum\limits_{i=1}^{p}\frac{\lambda_{i}^{2}e_{i}^{2}}{(\lambda_{i}+k)^{2}},

which clearly decreases to zero as kk increases.

References

  • [1] Belsley, D.A., Kuh, E and Welsch, R.E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. Wiley, New York.
  • [2] Belsley, D.A. (1982). Assessing the presence of harmful collinearity and other forms of weak data through a test for signal-to-noise. Journal of Econometrics, 20, 211-253.
  • [3] Belsley, D.A. (1984). Demeaning conditioning diagnostics through centering. The American Statistician, 38(2), 73-77.
  • [4] Bingzhen, C., Wenjuan, Z. and Lingchen, K. (2022). Variable selection and collinearity processing for multivariate data via row-elastic-net regularization. AStA Advances in Statistical Analysis, 106, 79-96.
  • [5] Ding, Y., Oeng, Q., Song, Z. and Chen, H. (2023). Variable selection and regularization via arbitrary rectangle-range generalized elastic net. Statistics and Computing, 33, article 72.
  • [6] Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, 7(1), 1-26.
  • [7] Efron, B. (1981). Nonparametric estimates of standard error: the jackknife, the bootstrap, and other resampling methods. Biometrika, 68(3), 589-599.
  • [8] Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics, CBMS-NSF Regional Conference Series in Applied Mathematics, Monograph 38.
  • [9] Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171-185.
  • [10] Efron, B. and Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37(1), 36-48.
  • [11] Efron, B. and Tibshirani, R. (1936). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1(1), 54-75.
  • [12] Gao, F. and Liu, X.Q. (2011). Linearized ridge regression estimator under the mean squared error criterion in a linear regression model. Communications in Statistics - Theroy and Methods, 40(9), 1434-1443.
  • [13] García, J., Salmerón, R., García, C.B. and López Martín, M.M. (2015). Standardization of variables and collinearity diagnostic in ridge regression. International Statistical Review, 84(2), 245-266.
  • [14] García, C., Salmerón, R. and García, C.B. (2018). Choice of the ridge factor from the correlation matrix determinant. Journal of Statistical Computation and Simulation, 89(2), 211-231.
  • [15] Gibbons, D.G. (1981). A simulation study of some ridge estimators. Journal of American Statistical Association, 76, 131-139.
  • [16] Gökpınar, E. and M. Ebegil (2016). A study on tests of hypothesis based on ridge estimator. Gazi University Journal of Science 29(4), 769-781.
  • [17] Halawa, A. and M. El Bassiouni (2000). Tests of regression coefficients under ridge regression models. Journal of Statistical Computation and Simulation 65(1-4), 341-356.
  • [18] Hoerl, A.E. and Kennard, R.W. (1970a). Ridge regression. Biased estimation for nonorthogonal problems. Technometrics, 12, 55-67.
  • [19] Hoerl, A.E. and Kennard, R.W. (1970b). Ridge regression. Applications to nonorthogonal problems. Technometrics, 12, 69-82.
  • [20] Hoerl, A., Kennard, R. and Baldwin, K. (1975). Ridge regression: some simulation. Communications in Statistics - Theory and Methods, 4(2), 105-123.
  • [21] Kibria, B. (2003). Performance of some new ridge regression estimators. Communications in Statistics - Simulation and Computation, 32(2), 419-435.
  • [22] Liu, K. (1993). A new class of blased estimate in linear regression. Communications in Statistics - Theroy and Methods, 22(2), 393-402.
  • [23] Liu, K. (2003). Using Liu-type estimator to combat collinearity. Communications in Statistics - Theroy and Methods, 32(5), 1009-1020.
  • [24] Liu, X.Q. and Gao, F. (2011). Linearized ridge regression estimator in linear regression. Communications in Statistics - Theroy and Methods, 40(12), 2182-2192.
  • [25] Lukman, A. F., Adewuyi, E. T., Alqasem, O. A., Arashi, M., and Ayinde, K. (2024). Enhanced Model Predictions through Principal Components and Average Least Squares-Centered Penalized Regression. Symmetry, 16(4), 469.
  • [26] Marquardt, D.W. (1970). Generalized Inverses, Ridge Regression, Biased Linear Estimation and Nonlinear Estimation. Technometrics, 12(3), 591-612.
  • [27] Marquandt, D.W. (1980). You should standardize the predictor variables in your regression models. Journal of the American Statistical Association, 75(369), 87-91.
  • [28] Marquardt, D.W. and Snee, R. (1975). Ridge regression in practice. The American Statistician, 29(1), 3-20.
  • [29] McDonald G.C. and Galarneau, D.I. (1975). A Monte Carlo evaluation of some ridge-type estimators. Journal of the American Statistical Association, 70(350), 407-416.
  • [30] Salmerón, R., García, C., García, J. (2019) multiColl: collinearity cetection in a multiple linear regression model. R package version 2.0, https://CRAN.R-project.org/package=multiColl.
  • [31] Novales. A. Análisis de Regresión. URL: https://www.ucm.es/data/cont/docs/518-2013-11-13-Analisis%20de%20Regresion.pdf. Fecha de consulta: 21 de noviembre de 2023.
  • [32] Obenchain, R.L. (1975). Ridge analysis following a preliminary test of the shrunken hypothesis. Technometrics, 17(4), 431-441.
  • [33] Obenchain, R.L. (1977). Classical f-tests and confidence regions for ridge regression. Technometrics 19(4), 429-439.
  • [34] O’Driscoll, D. and Ramirez, D.E. (2016). Limitations of the Least Squares Estimators; a teaching perspective. Conference: 10th Annual International Conference on Statistics: Teaching, Theory and Applications. ATINER’S Conference Paper Series, STA2016-2074.
  • [35] R Core Team (2022) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/.
  • [36] Rodríguez, A., Salmerón, R. and García, C. (2021). Obtaining a threshold for the stewart index and its extension to ridge regression. Computational Statistics, 36, 1011-1029.
  • [37] Rodríguez, A., Salmerón, R. and García, C.B. (2022). The coefficient of determination in the ridge regression. Communications in Statistics - Simulation and Computation, 51(1), 201-219.
  • [38] Salmerón, R., García, J., López, M.M. and García, C.B. (2016). Collinearity diagnostic in ridge estimation through the variance inflation factor. Journal of Applied Statistics, 43(10), 1831-1849.
  • [39] Salmerón, R., García, J., García, C.B. and López, M.M. (2017). Transformation of variables and the condition number in ridge estimation. Computational Statistics, 33(3), 1497-1524.
  • [40] Salmerón, R. and García, C. and García, J. (2018). Variance inflation factor and condition number in multiple linear regression. Journal of Statistical Computation and Simulation, 88, 2365-2384.
  • [41] Salmerón, R., Rodríguez, A. and García, C.B. (2020). Diagnosis and quantification of the non-essential collinearity. Computational Statistics, 35, 647-666.
  • [42] Salmerón, R., García, C.B. and García, J. (2020). Detection of near-multicollinearity through centered and noncentered regression. Mathematics, 8, 931.
  • [43] Salmerón, R., García, C., García, J. (2021) A guide to using the R package “multiColl” for detecting multicollinearity. Computational Economics 57, 529-536.
  • [44] Salmerón, R., García, C., García, J. (2022) The “multiColl” package versus other existing packages in R to detect multicollinearity. Computational Economics 60, 439-450.
  • [45] Sengupta, N. and F. Sowell (2020). On the asymptotic distribution of ridge regression estimators using training and test samples. Econometrics 8(4), 39.
  • [46] Stewart, G.W. (1987). Collinearity and least squares regression. Statistical Science, 2(1), 68-84.
  • [47] Snee, R.D. and Marquardt, D.W. (1984). Collinearity diagnostics depend on the domain of prediction, the model, and the data. The American Statistician, 38(2), 83-87.
  • [48] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
  • [49] Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society, 73, 273-282.
  • [50] Tikhonov, A.N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR, 39(5), 195-198.
  • [51] Vanhove, J. (2021). Collinearity isn’t a disease that needs curing. Meta-Psychology 5.
  • [52] Wang, W., Li, L., Li, S., Yin, F., Liao, F., Zhang, T. ,Li, X., Xiao, X. and Ma, Y. (2021). Average ordinary least squarest-centered penalized regression: a more efficient way to address multicollinearity than ridge regression. Statistica Neerlandica, 76, 347-368.
  • [53] Wichern, D.W. and Churchill, G.A. (1978). A comparison of ridge estimators. Technometrics, 20, 301-311.
  • [54] Wissel, J. (2009). A new biased estimator for multivariate regression models with highly collinear variables. Ph.D. thesis, Dissertation zur Erlangung des naturwissenschaftlichen Doktorgrades der Bayerischen Julius-Maximilians-Universita¨{\rm\ddot{a}}t Wu¨{\rm\ddot{u}}rzburg.
  • [55] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
  • [56] Zou, H. and Zhang, H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4), 1733-1751.