This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayes beats Cross Validation: Fast and Accurate
Ridge Regression via Expectation Maximization

Shu Yu Tew
Monash University
shu.tew@monash.edu
&Mario Boley
Monash University
mario.boley@monash.edu
Daniel F. Schmidt
Monash University
daniel.schmidt@monash.edu
Abstract

We present a novel method for tuning the regularization hyper-parameter, λ\lambda, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite nn and thus requires the specification of a set of candidate λ\lambda, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough nn, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough nn, allowing for both the optimal λ\lambda and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in O(min(n,p))O(\min(n,p)) operations, for input data with nn rows and pp columns. In contrast, evaluating a single value of λ\lambda using fast LOOCV costs O(nmin(n,p))O(n\min(n,p)) operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of ll for ll candidate values for λ\lambda (in the regime q,pO(n)q,p\in O(\sqrt{n}) where qq is the number of regression targets).

1 Introduction

Ridge regression [25] is one of the most widely used statistical learning algorithms. Given training data 𝐗n×p{\bf X}\in\mathbb{R}^{n\times p} and 𝐲n{\bf y}\in\mathbb{R}^{n}, ridge regression finds the linear regression coefficients 𝜷^λ\hat{{\bm{\beta}}}_{\lambda} that minimize the 2\ell_{2}-regularized sum of squared errors, i.e.,

𝜷^λ=argmin𝜷{𝐲𝐗𝜷2+λ𝜷2}.\hat{{\bm{\beta}}}_{\lambda}=\underset{\bm{\beta}}{\operatorname{arg\,min}}\left\{||{\bf y}-{\bf X}\bm{\beta}||^{2}+\lambda||\bm{\beta}||^{2}\right\}. (1)

In practice, using ridge regression additionally involves estimating the value for the tuning parameter λ\lambda that minimizes the expected squared error 𝔼(𝐱T𝜷^λy)2\mathbb{E}({\bf x}^{\rm T}\!\hat{{\bm{\beta}}}_{\lambda}-y)^{2} for new data 𝐱{\bf x} and yy sampled from the same distribution as the training data. This problem is usually approached via the leave-one-out cross-validation (LOOCV) estimator, which can be computed efficiently by exploiting a closed-form solution for the leave-one-out test errors for a given λ\lambda. The wide and long-lasting use of the LOOCV approach suggests that it solves the ridge regression problem more or less optimally, both in terms of its statistical performance, as well as its computational complexity.

However, in this work, we show that LOOCV is outperformed by a simple expectation maximization (EM) approach based on a Bayesian formulation of ridge regression. While the two procedures are not equivalent, in the sense that they generally do not produce identical parameter values, the EM estimates tend to be of equal quality or, particularly in sparse regimes, superior to the LOOCV estimates (see Figure 1). Specifically, the LOOCV risk estimates can suffer from potential multiple and bad local minima when using iterative optimization, or misspecified candidates when performing grid search. In contrast, we show that the EM algorithm finds a unique optimal solution for large enough nn (outside pathological cases) without requiring any hard to specify hyper-parameters, which is a consequence of a more general bound on nn (Thm. 3.1) that we establish to guarantee the unimodality of the posterior distribution of Bayesian ridge regression—a result with potentially wider applications. In addition, the EM procedure is asymptotically faster than the LOOCV procedure by a factor of ll where ll is the number of candidate values for λ\lambda to be evaluated (in the regime p,qO(n)p,q\in O(\sqrt{n}) where pp, qq, and nn are the number of covariates, target variables, and data points, respectively). In practice, even in the usual case of q=1q=1 and l=O(1)l=O(1), the EM algorithm tends to outperform LOOCV computationally by an order of magnitude as we demonstrate on a test suite of datasets from the UCI machine learning repository and the UCR time series classification archive.

Refer to caption
Figure 1: Comparison of LOOCV (with fixed candidate grid of size 100100) and EM for setting with sparse covariate vectors of 𝐱=(x1,,x100){\bf x}=(x_{1},\dots,x_{100}) such that xiBer(1/100)x_{i}\sim\mathrm{Ber}(1/100) i.i.d. and responses y|𝐱N(βT𝐱,σ2)y|{\bf x}\sim N(\beta^{\rm T}{\bf x},\sigma^{2}) for increasing noise levels σ\sigma and sample sizes nn. In an initial phase for small nn, the number of EM iterations kk tends to decrease rapidly from an initial large number until it reaches a small constant (around 1010). In this phase, EM is computationally slightly more expensive than LOOCV (third row) but has a better parameter mean squared error (first row) corresponding to less shrinkage (second row). In the subsequent phase, both algorithms have essentially identical parameter estimates but EM outperforms LOOCV in terms of computation by a wide margin.

While the EM procedure discussed in this paper is based on a recently published procedure for learning sparse linear regression models [44], the adaption of this procedure to ridge regression has not been previously discussed in the literature. Furthermore, a direct adoption would lead to a main loop complexity of O(p3)O(p^{3}) that is uncompetitive with LOOCV. Therefore, in addition to evaluating the empirical accuracy and efficiency of the EM algorithm for ridge regression, the main technical contributions of this work are to show how certain key quantities can be efficiently computed from either a singular value decomposition of the design matrix, when pnp\geq n, or an eigenvalue decomposition of the Gram matrix 𝐗T𝐗{\bf X}^{\rm T}\!{\bf X}, when n>pn>p. This results in an E-step of the algorithm in time O(r)O(r) where r=min(n,p)r=\min(n,p), and an M-step found in closed form and solved in time O(1)O(1), yielding an ultra-fast main loop for the EM algorithm.

These computational advantages result in an algorithm that is computationally superior to efficient, optimized implementations of the fast LOOCV algorithm. Our particular implementation of LOOCV actually outperforms the implementation in scikit-learn by approximately a factor of two by utilizing a similar preprocessing to the EM approach. This enables an O(nr)O(nr) evaluation for a single λ\lambda (which is still slower than the O(r)O(r) evaluation for our new EM algorithm; see Table 1 for an overview of asymptotic complexities of LOOCV and EM), and may be of interest to readers by itself. Our implementation of both algorithms, along with all experiment code, are publicly available in the standard package ecosystems of the R and Python platforms, as well as on GitHub111https://github.com/marioboley/fastridge.git.

In the remainder of this paper, we first briefly survey the literature of ridge regression with an emphasis on the use of cross validation (Sec. 2). Based on the Bayesian interpretation of ridge regression, we then introduce the EM algorithm and discuss its convergence (Sec. 3). Finally, we develop fast implementations of both the EM algorithm and LOOCV (Sec. 4) and compare them empirically (Sec. 5).

Table 1: Time complexities of algorithms; m=max(n,p)m=\max(n,p), r=min(n,p)r=\min(n,p), ll number of candidate λ\lambda for LOOCV, kk number of EM iterations and qq is the number of the target variables.
Method main loop pre-processing Overall (p,qO(n)p,q\in O(\sqrt{n}))
Naive adaption of EM O(kp3q)O(kp^{3}q) O(p2n)O(p^{2}n) O(kn2)O(kn^{2})
Proposed BayesEM O(krq)O(krq) O(mr2)O(mr^{2}) O(kn+n2)O(kn+n^{2})
Fast LOOCV O(lnrq)O(lnrq) O(mr2)O(mr^{2}) O(ln2)O(ln^{2})

2 Ridge Regression and Cross Validation

Ridge regression [25] (also known as 2\ell_{2}-regularization) is a popular method for estimation and prediction in linear models. The ridge regression estimates are the solutions to the penalized least-squares problem given in (1). The solution to this optimization problem is given by:

𝜷^λ=(𝐗T𝐗+λ𝐈p)1𝐗T𝐲.\hat{\bm{\beta}}_{\lambda}=({\bf X}^{\rm T}{\bf X}+\lambda{\bf I}_{p})^{-1}{\bf X}^{\rm T}{\bf y}. (2)

When λ0\lambda\to 0, the ridge estimates coincide with the minimum 2\ell_{2} norm least squares solution [31, 22], which simplifies to the usual least squares estimator in cases where the design matrix 𝐗{\bf X} has full column rank (i.e. 𝐗T𝐗{\bf X}^{\rm T}{\bf X} is invertible). Conversely, as λ\lambda\to\infty, the amount of shrinkage induced by the penalty increases, with the resulting ridge estimates becoming smaller for larger values of λ\lambda. Under fairly general assumptions [26], including misspecified models and random covariates of growing dimension, the ridge estimator is consistent and enjoys finite sample risk bounds for all fixed λ0\lambda\geq 0, i.e., it converges almost surely to the prediction risk minimizer, and its squared deviation from this minimizer is bounded for finite nn with high probability. However, its performance can still vary greatly with the choice of λ\lambda; hence, there is a need to estimate the optimal value from the given training data.

Earlier approaches to this problem [e.g. 28, 14, 30, 29, 2, 34, 19, 48, 23, 7] rely on an explicit estimate of the (assumed homoskedastic) noise variance, following the original idea of Hoerl and Kennard [25]. However, estimating the noise variance can be problematic, especially when pp is not much smaller than nn [18, 20, 24, 46]. More recent approaches adopt model selection criteria to select the optimal λ\lambda without requiring prior knowledge or estimation of the noise variance. These methods involve minimizing a selection criterion of choice, such as the Akaike information criterion (AIC) [1], Bayesian information criterion (BIC) [40], Mallow’s conceptual prediction (Cp) criterion [33], and, most commonly, cross validation (CV) [4, 49].

A particularly attractive variant of CV is leave-one-out cross validation (LOOCV), also referred to as the prediction error sum of squares (PRESS) statistic in the statistics literature [3]

RnCV(λ)=1ni=1n(yiy^i)2R^{\mathrm{CV}}_{n}(\lambda)=\frac{1}{n}\sum^{n}_{i=1}\left(y_{i}-\hat{y}_{i}\right)^{2} (3)

where y^i=𝐱~i𝜷^λi\hat{y}_{i}=\tilde{\bf x}_{i}\hat{\bm{\beta}}^{-i}_{\lambda}, and 𝜷^λi\hat{\bm{\beta}}^{-i}_{\lambda} denotes the solution to (2) when the ii-th data point (𝐱~i,yi)(\tilde{\bf x}_{i},y_{i}) is omitted. LOOCV offers several advantages over alternatives such as 10-fold CV: it is deterministic, nearly unbiased [47], and there exists an efficient "shortcut" formula for the LOOCV ridge estimate [36]:

RnCV(λ)=1ni=1n(ei1Hii(λ))2R^{\rm CV}_{n}(\lambda)=\frac{1}{n}\sum^{n}_{i=1}\left(\frac{e_{i}}{1-H_{ii}(\lambda)}\right)^{2} (4)

where 𝐇(λ)=𝐗(𝐗T𝐗+λ𝐈p)1𝐗T{\bf H}(\lambda)={\bf X}({\bf X}^{\rm T}{\bf X}+\lambda{\bf I}_{p})^{-1}{\bf X}^{\rm T} is the regularized “hat”, or projection, matrix and 𝐞=𝐲𝐇(λ)𝐲{\bf e}={\bf y}-{\bf H}(\lambda){\bf y} are the residuals of the ridge fit using all nn data points. As it only requires the diagonal entries of the hat matrix, Eq. (4) allows for the computation of the PRESS statistic with the same time complexity O(p3+np2)O(p^{3}+np^{2}) as a single ridge regression fit.

Moreover, unless p/n1p/n\to 1, the LOOCV ridge regression risk as a function of λ\lambda converges uniformly (almost surely) to the true risk function on [0,)[0,\infty) and therefore optimizing it consistently estimates the optimal λ\lambda [36, 22]. However, for finite nn, the LOOCV risk can be multimodal and, even worse, there can exist local minima that are almost as bad as the worst λ\lambda [42]. Therefore, iterative algorithms like gradient descent cannot be reliably used for the optimization, giving theoretical justification for the pre-dominant approach of optimizing over a finite grid of candidates L=(λ1,,λl)L=(\lambda_{1},\dots,\lambda_{l}). Unfortunately, despite the true risk function being smooth and unimodal, a naïvely chosen finite grid cannot be guaranteed to contain any candidate with a risk value close to the optimum. While this might not pose a problem for small nn when the error in estimating the true risk via LOOCV is likely large, it can potentially become a dominating source of error for growing nn and pp. Therefore, letting ll grow moderately with the sample size appears necessary, turning it into a relevant factor in the asymptotic time complexity.

As a further disadvantage, LOOCV (or CV in general) is sensitive to sparse covariates, as illustrated in Figure 1 where the performance of LOOCV, relative to the proposed EM algorithm, degrades as the noise variance σ2\sigma^{2} grows. In the sparse covariate setting, a situation common in genomics, the information about each coefficient is concentrated in only a few observations. As LOOCV drops an observation to estimate future prediction error, the variance of the CV score can be very large when the predictor matrix is very sparse, as the estimates depend on only a small number of the remaining observations. In the most extreme case, known as the multiple means problem [27], 𝐗=𝐈n{\bf X}={\bf I}_{n}, and all the information about each coefficient is concentrated in a single observation. In this setting, the LOOCV score reduces to yi2\sum y_{i}^{2}, and provides no information about how to select λ\lambda. In contrast, the proposed EM approach explicitly ties together the coefficients via the probabilistic Bayesian interpretation of λ\lambda as the inverse-variance of the unknown coefficient vector. This “borrowing of strength” means that the procedure provides a sensible estimate of λ\lambda even in the case of multiple means (see Appendix A).

3 Bayesian Ridge Regression

The ridge estimator (2) has a well-known Bayesian interpretation; specifically, if we assume that the coefficients are a priori normally distributed with mean zero and common variance τ2σ2\tau^{2}\sigma^{2} we obtain a Bayesian version of the usual ridge regression procedure, i.e.,

𝐲|𝐗,𝜷,σ2Nn(𝐗𝜷,σ2𝐈n),𝜷|τ2,σ2Np(0,τ2σ2𝐈p),σ2σ2dσ2,τ2π(τ2)dτ2,\displaystyle\begin{split}{\bf y}\,|\,{\bf X},\bm{\beta},\sigma^{2}\;&\sim\;N_{n}\left({\bf X}\bm{\beta},\;\sigma^{2}{\bf I}_{n}\right),\\ {\bm{\beta}}\,|\,\tau^{2},\sigma^{2}\;&\sim\;N_{p}\left(0,\;\tau^{2}\sigma^{2}{\bf I}_{p}\right),\\ \sigma^{2}&\sim\sigma^{-2}d\sigma^{2},\\ \tau^{2}\;&\sim\;\pi(\tau^{2})d\tau^{2},\end{split} (5)

where π()\pi(\cdot) is an appropriate prior distribution assigned to the variance hyperparameter τ2\tau^{2}. For a given τ>0\tau>0 and σ>0\sigma>0, the conditional posterior distribution of 𝜷\bm{\beta} is also normal [32]

𝜷|τ2,σ2,𝐲Np(𝜷^τ,σ2𝐀τ1),𝜷^τ=𝐀τ1𝐗T𝐲,𝐀τ=(𝐗T𝐗+τ2𝐈p),\displaystyle\begin{split}\bm{\beta}\,|\,\tau^{2},\sigma^{2},{\bf y}\;&\sim\;N_{p}(\hat{{\bm{\beta}}}_{\tau},\;\sigma^{2}{\bf A}_{\tau}^{-1}),\\ \hat{{\bm{\beta}}}_{\tau}&={\bf A}_{\tau}^{-1}{\bf X}^{\rm T}{\bf y},\\ {\bf A}_{\tau}\;&=\;({\bf X}^{\rm T}{\bf X}+\tau^{-2}{\bf I}_{p}),\end{split} (6)

where the posterior mode (and mean) 𝜷^τ\hat{{\bm{\beta}}}_{\tau} is equivalent to the ridge estimate with penalty λ=1/τ2\lambda=1/\tau^{2} (we rely on the variable name in the notation 𝜷^x\hat{{\bm{\beta}}}_{x} to indicate whether it refers to (6) or (2)).

Shrinkage Prior

To estimate the τ2\tau^{2} hyperparameter in the Bayesian framework, we first must choose a prior distribution for the hypervariance τ2\tau^{2}. We assume that no strong prior knowledge on the degree of shrinkage of the regression coefficients is available, and instead assign the recommended default beta-prime prior distribution for τ2\tau^{2} [15, 38] with probability density function:

π(τ2)=(τ2)a1(1+τ2)abB(a,b),a>0,b>0,\pi(\tau^{2})=\frac{(\tau^{2})^{a-1}(1+\tau^{2})^{-a-b}}{B(a,b)},\;a>0,b>0, (7)

where B(a,b)B(a,b) is the beta function. Specifically, we choose a=b=1/2a=b=1/2, which corresponds to a standard half-Cauchy prior on τ\tau. The half-Cauchy is a heavy-tailed, weakly informative prior that is frequently recommended as a default choice for scale-type hyperparameters such as τ\tau [38]. Further, this estimator is very insensitive to the choice of aa or bb. As demonstrated by Theorem 6.1 in [6], the marginal prior density over 𝜷\bm{\beta}, π(𝜷|τ2)π(τ2|a,b)𝑑τ2=π(𝜷|a,b)\int\pi({\bm{\beta}}|\tau^{2})\pi(\tau^{2}|a,b)d\tau^{2}=\pi({\bm{\beta}}|a,b) has polynomial tails in 𝜷2\|\bm{\beta}\|^{2} for all a>0,b>0a>0,b>0, and has Cauchy or heavier tails for b1/2b\leq 1/2. This type of polynomial-tailed prior distribution over the norm of the coefficients is insensitive to the overall scale of the coefficients, which is likely unknown a priori. This robustness is in contrast to other standard choices of prior distributions for τ2\tau^{2} such as the inverse-gamma distribution [e.g., 41, 35] which are highly sensitive to the choice of hyperparameters [38].

Unimodality and Consistency

The asymptotic properties of the posterior distributions in Gaussian linear models (5) have been extensively researched [45, 16, 17, 8]. These studies reveal that in linear models, the posterior distribution of 𝜷\bm{\beta} is consistent, and converges asymptotically to a normal distribution centered on the true parameter value. When pp is fixed, this assertion can be established through the Bernstein-Von Mises theorem [45, Sec. 10.2]. Our specific problem (5) satisfies the conditions for this theorem to hold: 1) both the Gaussian-linear model p(y|𝜷,σ2)p(y|{\bm{\beta}},\sigma^{2}) and the marginal distribution p(y|𝜷,σ2)π(𝜷|τ2)𝑑𝜷=p(y|τ2,σ2)\int p(y|{\bm{\beta}},\sigma^{2})\pi({\bm{\beta}}|\tau^{2})d{\bm{\beta}}=p(y|\tau^{2},\sigma^{2}) are identifiable; 2) they have well defined Fisher information matrices; and 3) the priors over 𝜷{\bm{\beta}} and τ\tau are absolutely continuous. Further, these asymptotic properties remain valid when the number of predictors pnp_{n} is allowed to grow with the sample size nn at a sufficiently slower rate [16, 8].

The following theorem (see the proof in Appendix B) provides a simple bound on the number of samples required to guarantee that the posterior distribution for the Bayesian ridge regression hierarchy given by (5) has only one mode outside a small environment around zero.

Theorem 3.1.

Let ϵ>0\epsilon>0, and let γn\gamma_{n} be the smallest eigenvalue of 𝐗T𝐗/n{\bf X}^{\rm T}{\bf X}/n. If γn>0\gamma_{n}>0 and ϵ>4/(nγn)\epsilon>4/(n\gamma_{n}) then the joint posterior p(𝛃,σ2,τ2|𝐲)p({\bm{\beta}},\sigma^{2},\tau^{2}|{\bf y}) has a unique mode with τ2ϵ\tau^{2}\geq\epsilon. In particular, if γncnα\gamma_{n}\geq cn^{-\alpha} with α<1\alpha<1 and c>0c>0 then there is a unique mode with τ2ϵ\tau^{2}\geq\epsilon if n>(4/(cϵ))1/(1α)n>(4/(c\epsilon))^{1/(1-\alpha)}.

In other words, all sub-optimal non-zero posterior modes vanish for large enough nn if the smallest eigenvalue of 𝐗T𝐗{\bf X}^{\rm T}{\bf X} grows at least proportionally to some positive power of nn. This is a very mild assumption that is typically satisfied in fixed as well as random design settings, e.g., with high probability when the smallest marginal covariate variance is bounded away from zero.

Expectation Maximization

Given the restricted unimodality of the joint posterior (5) for large enough nn, in conjunction with its asymptotic concentration around the optimal 𝜷0{\bm{\beta}}_{0}, estimating the model parameters via an EM algorithm appears attractive, as they are guaranteed to converge to an exact posterior mode. In particular, in the non-degenerate case that 𝜷00{\bm{\beta}}_{0}\neq 0, there exist τ2=ϵ2>0\tau^{2}=\epsilon^{2}>0, such that for large enough, but finite nn, the posterior concentrates around (𝜷0,τ2)({\bm{\beta}}_{0},\tau^{2}), and thus 𝜷0{\bm{\beta}}_{0} is identified by EM if initialized with a large enough τ2\tau^{2}.

Specifically, we use the novel approach [44] in which the coefficients 𝜷\bm{\beta} are treated as “missing data”, and τ2\tau^{2} and σ2\sigma^{2} as parameters to be estimated. Given the hierarchy (5), the resulting Bayesian EM algorithm then solves for the posterior mode estimates of 𝜷\bm{\beta} by repeatedly iterating through the following two steps until convergence:

E-step. Find the parameters of the Q-function, i.e., the expected complete negative log-posterior (with respect to 𝜷\bm{\beta}), conditional on the current estimates of τ^t2\hat{\tau}^{2}_{t} and σ^t2\hat{\sigma}^{2}_{t}, and the observed data 𝐲{\bf y}:

Q(τ2,σ2|τ^t2,σ^t2)=E𝜷[logp(𝜷,τ2,σ2|𝐲)|τ^t2,σ^t2,𝐲]\displaystyle Q(\tau^{2},\sigma^{2}|\hat{\tau}^{2}_{t},\hat{\sigma}^{2}_{t})={\rm E}_{\bm{\beta}}\!\left[-\log p(\bm{\beta},\tau^{2},\sigma^{2}\,|\,{\bf y})\>|\>\hat{\tau}^{2}_{t},\hat{\sigma}^{2}_{t},{\bf y}\right]
=(n+p+22)logσ2+ESS2σ2+p+12logτ2+ESN2σ2τ2+log(1+τ2)\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;=\left(\frac{n+p+2}{2}\right)\log\sigma^{2}+\frac{{\mathrm{ESS}}}{2\sigma^{2}}+\frac{p+1}{2}\log\tau^{2}+\frac{{\mathrm{ESN}}}{2\sigma^{2}\tau^{2}}+\log(1+\tau^{2}) (8)

where the quantities to be computed are the (conditionally) expected sum of squared errors ESS=E[𝐲𝐗𝜷2|τ^t2,σ^t2]{\mathrm{ESS}}={\rm E}\!\left[||{\bf y}-{\bf X}\bm{\beta}||^{2}\>|\>\hat{\tau}^{2}_{t},\hat{\sigma}^{2}_{t}\right] and the expected squared norm ESN=E[𝜷2|τ^t2,σ^t2]{\mathrm{ESN}}={\rm E}\!\left[\|{{\bm{\beta}}}\|^{2}\>|\>\hat{\tau}^{2}_{t},\hat{\sigma}^{2}_{t}\right]. Denoting by tr()\mathrm{tr}(\cdot) the trace operator, one can show (see Appendix C) that these quantities can be computed as

ESS=𝐲𝐗𝜷^τ2+σ2tr(𝐗T𝐗𝐀τ1)andESN=σ2tr(𝐀τ1)+𝜷^τ2.\displaystyle\mathrm{ESS}=||{\bf y}-{\bf X}\;\hat{{\bm{\beta}}}_{\tau}||^{2}+\sigma^{2}\mathrm{tr}({\bf X}^{\rm T}{\bf X}{\bf A}_{\tau}^{-1})\;\;\;\;\;\;{\rm and}\;\;\;\;\;\;\mathrm{ESN}=\sigma^{2}\mathrm{tr}({\bf A}_{\tau}^{-1})+\|\hat{\bm{\beta}}_{\tau}\|^{2}\enspace. (9)

M-step. Update the parameter estimates by minimizing the Q-function with respect to the shrinkage hyperparameter τ2\tau^{2} and noise variance σ2\sigma^{2}, i.e.,

{τ^t+12,σ^t+12}=argminτ2,σ2{Q(τ2,σ2|τ^t2,σ^t2)}.\displaystyle\{\hat{\tau}^{2}_{t+1},\hat{\sigma}^{2}_{t+1}\}=\underset{\tau^{2},\sigma^{2}}{\operatorname{arg\,min}}\left\{Q\left(\tau^{2},\sigma^{2}\,|\,\hat{\tau}^{2}_{t},\hat{\sigma}^{2}_{t}\right)\right\}. (10)

Instead of numerically optimizing the two-dimensional Q-function (10), we can derive closed-form solutions for both parameters by first finding σ^2(τ2)\hat{\sigma}^{2}(\tau^{2}), i.e., the update for σ2\sigma^{2}, as a function of τ2\tau^{2}, and then substituting this into the Q-function. This yields a Q-function that is no longer dependent on σ2\sigma^{2}, and solving for τ^2\hat{\tau}^{2} is straightforward. The resulting parameter updates in the M-step are given by:

σ^2=τ2ESS+ESN(n+p+2)τ2andτ^2=(n1)ESN(1+p)ESS+g(6+2p)ESS,\hat{\sigma}^{2}=\frac{\tau^{2}\mathrm{ESS}+\mathrm{ESN}}{(n+p+2)\tau^{2}}\;\;\;\;\;{\rm and}\;\;\;\;\;{\hat{\tau}^{2}}=\frac{(n-1)\mathrm{ESN}-(1+p)\mathrm{ESS}+\sqrt{g}}{(6+2p)\mathrm{ESS}}, (11)

where g=(4n+4)ESN(3+p)ESS+((1n)ESN+(p+1)ESS)2g=(4n+4)\mathrm{ESN}\,(3+p)\mathrm{ESS}+((1-n)\mathrm{ESN}+(p+1)\mathrm{ESS})^{2}. The derivations of these formulae are presented in Appendix D.

From (11), we see that updating the parameter estimates in the M-step requires only constant time. Therefore, the overall efficiency of the EM algorithm is determined by the computational complexity of the E-step. Computing the parameters of the Q-functions directly via (9) requires inverting 𝐀τ{\bf A}_{\tau}, resulting in O(p3)O(p^{3}) operations. In the next section, we show how to substantially improve this approach via singular value decomposition.

4 Fast Implementations via Singular Value Decomposition

To obtain efficient implementations of the E-Step of the EM algorithm as well as of the LOOCV shortcut formula, one can exploit the fact that the ridge solution is preserved under orthogonal transformations. Specifically, let r=min(n,p)r=\min(n,p) and m=max(n,p)m=\max(n,p) and let 𝐔𝚺𝐕T=𝐗{\bf U}{\bm{\Sigma}}{\bf V}^{\rm T}={\bf X} be a compact singular value decomposition (SVD) of 𝐗{\bf X}. That is, 𝐔n×r{\bf U}\in\mathbb{R}^{n\times r} and 𝐕p×r{\bf V}\in\mathbb{R}^{p\times r} are semi-orthonormal column matrices, i.e., 𝐔T𝐔=𝐈n{\bf U}^{\rm T}{\bf U}={\bf I}_{n} and 𝐕T𝐕=𝐈p{\bf V}^{\rm T}{\bf V}={\bf I}_{p}, and 𝚺=diag(s1,,sr)r×r{\bm{\Sigma}}=\mathrm{diag}(s_{1},\dots,s_{r})\in\mathbb{R}^{r\times r} is a diagonal matrix that contains the non-zero singular values 𝐬=(s1,,sr){\bf s}=(s_{1},\dots,s_{r}) of 𝐗{\bf X}. With this decomposition, and an additional O(nr)O(nr) pre-processing step to compute 𝐜=𝚺𝐔T𝐲{\bf c}={\bm{\Sigma}}{\bf U}^{\rm T}{\bf y}, we can compute the ridge solution 𝜶τr{\bm{\alpha}}_{\tau}\in\mathbb{R}^{r} for a given τ\tau with respect to the rotated inputs 𝐗𝐕{\bf X}{\bf V} in time O(r)O(r) via

𝜶^τ=(𝚺T𝐔T𝐔𝚺+τ2𝐈)1𝚺𝐔T𝐲=(𝚺2+τ2𝐈)1𝐜=(1/(sj2+τ2))j=1r𝐜\hat{{\bm{\alpha}}}_{\tau}=({\bm{\Sigma}}^{\rm T}{\bf U}^{\rm T}{\bf U}{\bm{\Sigma}}+\tau^{-2}{\bf I})^{-1}{\bm{\Sigma}}{\bf U}^{\rm T}{\bf y}=({\bm{\Sigma}}^{2}+\tau^{-2}{\bf I})^{-1}{\bf c}=\left(1/(s_{j}^{2}+\tau^{-2})\right)_{j=1}^{r}\odot{\bf c} (12)

where 𝐚𝐛{\bf a}\odot{\bf b} denotes the element-wise Hadamard product of vectors 𝐚{\bf a} and 𝐛{\bf b}. The compact SVD itself can be obtained in time O(mr2)O(mr^{2}) via an eigendecomposition of either 𝐗T𝐗=𝐕𝚺2𝐕T{\bf X}^{\rm T}{\bf X}={\bf V}{\bm{\Sigma}}^{2}{\bf V}^{T} in case npn\geq p or 𝐗𝐗T=𝐔𝚺2𝐔T{\bf X}{\bf X}^{\rm T}={\bf U}{\bm{\Sigma}}^{2}{\bf U}^{\rm T} in case n<pn<p followed by the computation of the missing 𝐔=𝐗𝐕𝚺1{\bf U}={\bf X}{\bf V}{\bm{\Sigma}}^{-1} or 𝐕=𝐗T𝐔𝚺1{\bf V}={\bf X}^{\rm T}{\bf U}{\bm{\Sigma}}^{-1}.

In summary, after an O(mr2)O(mr^{2}) pre-processing step, we can obtain rotated ridge solutions for an individual candidate τ\tau in time O(r)O(r). Moreover, for the optimal τ\tau^{*}, we can find the ridge solution 𝜷^τ=𝐕𝜶^τ\hat{{\bm{\beta}}}_{\tau^{*}}={\bf V}\hat{{\bm{\alpha}}}_{\tau^{*}} with respect to the original input matrix via an O(pr)O(pr) post-processing step. Below we show how the key statistics that have to be computed per candidate τ\tau (and σ\sigma) can be computed efficiently based on 𝜶^τ\hat{{\bm{\alpha}}}_{\tau}, the pre-computed 𝐜{\bf c}, and SVD. For the EM algorithm, these are the posterior squared norm and sum of squared errors, and for the LOOCV algorithm, this is the PRESS statistic. While the main focus of this work is the EM algorithm, the fast computation of the PRESS shortcut formula appears to be not widely known (e.g., the current implementation in both scikit-learn and glmnet do not use it) and may therefore be of independent interest.

ESN

For the posterior expected squared norm ESN=σ2tr(𝐀τ1)+𝜷^τ2\mathrm{ESN}=\sigma^{2}\mathrm{tr}({\bf A}_{\tau}^{-1})+\|\hat{{\bm{\beta}}}_{\tau}\|^{2}, we first observe that 𝜷^τ2=𝐕𝜶^τ2=𝜶^τ2\|\hat{{\bm{\beta}}}_{\tau}\|^{2}=\|{\bf V}\hat{{\bm{\alpha}}}_{\tau}\|^{2}=\|\hat{{\bm{\alpha}}}_{\tau}\|^{2}, and then note that the trace can be computed as

tr(𝐀τ1)\displaystyle\mathrm{tr}({\bf A}^{-1}_{\tau}) =tr(𝐕p(𝚺p2+τ2𝐈p)1𝐕pT)\displaystyle=\mathrm{tr}({\bf V}_{p}({\bm{\Sigma}}_{p}^{2}+\tau^{-2}{\bf I}_{p})^{-1}{\bf V}_{p}^{\rm T})
=τ2max(pn,0)+j=1r1/(sj2+τ2),\displaystyle=\tau^{2}\max(p-n,0)+\sum_{j=1}^{r}1/(s^{2}_{j}+\tau^{-2}), (13)

where in the first equation we denote by 𝐕p{\bf V}_{p}, 𝚺p2{\bm{\Sigma}}_{p}^{2} the full matrices of eigenvectors and eigenvalues of 𝐗T𝐗{\bf X}^{\rm T}{\bf X} (including potential zeros), and in the second equation we used the cyclical property of the trace. Thus, all quantities required for ESN\mathrm{ESN} can be computed in time O(r)O(r) given the SVD and 𝜶^τ\hat{{\bm{\alpha}}}_{\tau}.

ESS

For the posterior expected sum of squared errors ESS=𝐲𝐗𝜷^τ2+σ2tr(𝐗T𝐗𝐀τ1)\mathrm{ESS}=\|{\bf y}-{\bf X}\hat{{\bm{\beta}}}_{\tau}\|^{2}+\sigma^{2}\mathrm{tr}({\bf X}^{\rm T}{\bf X}{\bf A}_{\tau}^{-1}), we can compute the residual sum of squares term via

𝐲𝐗𝜷^τ2\displaystyle\|{\bf y}-{\bf X}\hat{{\bm{\beta}}}_{\tau}\|^{2} =𝐲22𝐲T𝐔𝚺𝜶^τ+𝐔𝚺𝜶^τ2\displaystyle=\|{\bf y}\|^{2}-2{\bf y}^{T}{\bf U}{\bm{\Sigma}}\hat{{\bm{\alpha}}}_{\tau}+\|{\bf U}{\bm{\Sigma}}\hat{{\bm{\alpha}}}_{\tau}\|^{2}
=𝐲22𝜶^τT𝐜+𝐬𝜶^τ2,\displaystyle=\|{\bf y}\|^{2}-2\hat{{\bm{\alpha}}}_{\tau}^{\rm T}{\bf c}+\|{\bf s}\odot\hat{{\bm{\alpha}}}_{\tau}\|^{2}, (14)

where we use 𝜷^τ=𝐕𝜶^τ\hat{{\bm{\beta}}}_{\tau}={\bf V}\hat{{\bm{\alpha}}}_{\tau} and 𝐗=𝐔𝚺𝐕T{\bf X}={\bf U}{\bm{\Sigma}}{\bf V}^{T} in the first equation and the orthonormality of 𝐔{\bf U} and the definition of 𝐜=𝚺𝐔T𝐲{\bf c}={\bm{\Sigma}}{\bf U}^{\rm T}{\bf y} in the second. Finally, for the trace term, we find that

tr(𝐗T𝐗𝐀τ1)\displaystyle\mathrm{tr}({\bf X}^{\rm T}{\bf X}{\bf A}_{\tau}^{-1}) =tr(𝐕𝚺2𝐕T(𝐕(𝚺2+τ2𝐈p)𝐕T)1)\displaystyle=\mathrm{tr}({\bf V}{\bm{\Sigma}}^{2}{\bf V}^{\rm T}({\bf V}({\bm{\Sigma}}^{2}+\tau^{-2}{\bf I}_{p}){\bf V}^{\rm T})^{-1})
=j=1rsj2/(sj2+τ2).\displaystyle=\sum_{j=1}^{r}s^{2}_{j}/(s^{2}_{j}+\tau^{-2}). (15)

PRESS

The shortcut formula of the PRESS statistic (4) for a candidate λ\lambda requires the computation of the diagonal elements of the hat matrix 𝐇(λ)=𝐗(𝐗T𝐗+λ𝐈p)1𝐗T{\bf H}(\lambda)={\bf X}({\bf X}^{\rm T}{\bf X}+\lambda{\bf I}_{p})^{-1}{\bf X}^{\rm T} as well as the residual vector 𝐞=𝐲𝐇𝐲{\bf e}={\bf y}-{\bf H}{\bf y}. With the SVD, the first simplifies to

𝐇(λ)\displaystyle{\bf H}(\lambda) =𝐔𝚺(𝚺2+λ𝐈r)1𝚺𝐔T\displaystyle={\bf U}{\bm{\Sigma}}({\bm{\Sigma}}^{2}+\lambda{\bf I}_{r})^{-1}{\bm{\Sigma}}{\bf U}^{\rm T}
=𝐔diag(s12s12+λ,,sr2sr2+λ)𝐔T\displaystyle={\bf U}\;\mathrm{diag}\left(\frac{s^{2}_{1}}{s^{2}_{1}+\lambda},\dots,\frac{s^{2}_{r}}{s^{2}_{r}+\lambda}\right){\bf U}^{\rm T}

where we use the fact that diagonal matrices commute. This allows to compute the desired diagonal elements hiih_{ii} in time O(r)O(r) via

hii(λ)=j=1ruij2sj2/(sj2+λ)h_{ii}(\lambda)=\sum_{j=1}^{r}u^{2}_{ij}s^{2}_{j}/(s^{2}_{j}+\lambda) (16)

where uiju_{ij} denotes the elements of 𝐔{\bf U}. Computing the residual vector is easily done via the rotated ridge solution 𝐞=𝐲𝐔𝚺𝜶^λ{\bf e}={\bf y}-{\bf U}{\bm{\Sigma}}\hat{{\bm{\alpha}}}_{\lambda}. However, this still requires O(nr)O(nr) operations, simply because there are nn residuals to compute.

Thus, in summary, by combining the pre-processing with the fast computation of the PRESS statistic, we obtain an overall O(mr2+lqnr)O(mr^{2}+lqnr) implementation of ridge regression via LOOCV where ll denotes the number of candidate λ\lambda and qq the number of regression target variables. In contrast, for the EM algorithm, by combining the fast computation of ESS and ESN, we end up with an overall complexity of O(mr2+kqr)O(mr^{2}+kqr) where kk denotes the number of EM iterations. If we further assume that k=o(n)k=o(n), which is supported by experimental results, see Sec. 5, and that both q,p=O(n)q,p=O(\sqrt{n}) there is an asymptotic advantage of a factor of ll of the EM approach. This regime is common in settings where more data allows for more fine-grained input as well as output measurements, e.g., in satellite time series classification via multiple target regression [11, 37]. All time complexities are summarized in Tab. 1 and detailed pseudocode for both the fast EM algorithm and the fast LOOCV algorithm is provided in the Appendix (see Table 3 and 4).

5 Empirical Evaluation

Refer to caption
Figure 2: Comparison of EM to LOOCV variants for increasing nn and pp for settings with 𝐱Np(𝟎,Σ){\bf x}\sim N_{p}({\bf 0},\Sigma) and 𝐲|𝐱N(𝐱T𝜷,0.25){\bf y}|{\bf x}\sim N({\bf x}^{T}{\bm{\beta}},0.25) with random 𝜷Np(𝟎,𝐈){\bm{\beta}}\sim N_{p}({\bf 0},{\bf I}) and ΣWp(I,p)\Sigma\sim W_{p}(I,p).

In this section, we compare the predictive performance and computational cost of LOOCV against the proposed EM method. We present numerical results on both synthetic and real-world datasets. To implement the LOOCV estimator, we use a predefined grid, L=(λ1,,λl)L=(\lambda_{1},\ldots,\lambda_{l}). We use the two most common methods for this task: (i) fixed grid - arbitrarily selecting a very small value as λmin\lambda_{\rm min}, a large value as λmax\lambda_{\rm max}, and construct a sequence of ll values from λmax\lambda_{\rm max} to λmin\lambda_{\rm min} on log scale; (ii) data-driven grid - find the smallest value of λmax\lambda_{\rm max} that sets all the regression coefficient vector to zero 222For ridge regression, λmax=\lambda_{\rm max}=\infty. Following the glmnet package, the sequence of λ\lambda is derived for α=0.001\alpha=0.001. The penalty function used by glmnet is λ[(1α)𝜷22+α𝜷1]\lambda[(1-\alpha)\|{\bm{\beta}}\|^{2}_{2}+\alpha\|{\bm{\beta}}\|_{1}], where α=0\alpha=0 corresponds to ridge regression. (i.e. 𝜷^=0\hat{\bm{\beta}}=0), multiply this value by a ratio such that λmin=κλmax\lambda_{\rm min}=\kappa\lambda_{\rm max} and create a sequence from λmax\lambda_{\rm max} to λmin\lambda_{\rm min} on log scale. The latter method is implemented in the glmnet package in combination with an adaptive κ\kappa coefficient

κ={0.0001, if np0.01, otherwise,\kappa=\begin{cases}0.0001&,\text{ if }n\geq p\\ 0.01&,\text{ otherwise}\end{cases}\enspace,

which we replicate here as input to our fast LOOCV algorithm (Appendix, Table 4) to efficiently recover the glmnet LOOCV ridge estimate. 333glmnet LOOCV is computed directly by model refitting via coordinate-wise descent which can be slow.

We consider a fixed grid of 𝝀=(1010,,1010){\bm{\lambda}}=(10^{-10},\ldots,10^{10}) and the grid based on the glmnet heuristic; in both cases, we use a sequence of length 100. The latter is a data-driven grid, so we will have a different penalty grid for each simulated or real data set. Our EM algorithm does not require a predefined penalty grid, but it needs a convergence threshold which we set to be ϵ=108\epsilon=10^{-8}. All experiments in this section are performed in Python and the R statistical platform. Datasets and code for the experimental results is publicly available. As is standard in penalized regression, and without any loss of generality, we standardized the data before model fitting. This means that the predictors are standardized to have zero mean, standard deviation of one, and the target has a mean of zero, i.e., the intercept estimate is simply β^0=(1/n)yi\hat{\beta}_{0}=(1/n)\sum y_{i}.

5.1 Simulated Data

In this section, we use a simulation study to investigate the behavior of EM and LOOCV as a function of the sample size, nn, and two other parameters of interest: the number of covariates pp, and the noise level of the target variable. In particular, we are interested in the parameter estimation performance, the corresponding λ\lambda-values, and the computational cost. To gain further insights into the latter, the number of iterations performed by the EM algorithm is of particular interest, as we do not have quantitative bounds for its behavior. We consider two settings that vary in the level of sparsity and correlation structure of the covariates. The first setting (Fig. 1) assumes i.i.d Bernoulli distributed covariates with small success probabilities that result in sparse covariate matrices, while the second setting (Fig. 2) assumes normally distributed covariates with random non-zero covariances. In both cases, the target variable is conditionally normal with mean 𝐱T𝜷{\bf x}^{\rm T}{\bm{\beta}} for a random 𝜷{\bm{\beta}} drawn from a standard multivariate normal distribution.

Looking at the results, a common feature of both settings is that the computational complexity of the EM algorithm is a non-monotone function in nn. In contrast to LOOCV, the behavior of EM shows distinctive phases where the complexity temporarily decreases with nn before it settles into the, usually expected, monotonically increasing phase. As can be seen, this is due to the behavior of the number of iterations kk, which peaks for small values of nn before it declines rapidly to a small constant (around 10) when the cost of the pre-processing begins to dominate. The occurrence of these phases is more pronounced for both growing pp and growing σ\sigma. This behavior is likely due to the convergence to normality of the posterior distribution as the sample size nn\to\infty, with convergence being slower for large pp.

An interesting observation is that CV with the employed glmnet grid heuristic fails, in the sense that the resulting ridge estimator does not appear to be consistent for large pp in Setting 2. This is due to the minimum value of λ\lambda produced by the glmnet heuristic being too large, and the resulting ridge estimates being overshrunk. This clearly underlines the difficulty of choosing a robust dynamic grid – a problem that our EM algorithm avoids completely.

5.2 Real Data

We evaluated our EM method on 24 real-world datasets. This includes 21 datasets from the UCI machine learning repository [5] (unless referenced otherwise) for normal linear regression tasks and 3 time-series datasets from the UCR repository [10] for multitarget regression tasks. The latter is a multilabel classification problem in which the feature matrix was generated by the state-of-the-art HYDRA [12] time series classification procedure (which by default uses LOOCV ridge regression for classification), and we train qq ridge regression models in a one-versus-all fashion, where qq is the number of target classes. The datasets were chosen such that they covered a wide range of sample sizes, nn, and number of predictors, pp. We compared our EM algorithm against the fast LOOCV in terms of predictive performance, measured in R2R^{2} (and classification accuracy) on the test data, and computational efficiency.

Our linear regression experiments involve 3 settings: (i) standard linear regression; (ii) second-order multivariate polynomial regression with added interactions and second-degree polynomial transformations of variables, and (iii) third-order multivariate polynomial regression with added three-way interactions and cubic polynomial transformations. For each experiment, we repeated the process 100 times and used a random 70/30 train-test split. Due to memory limitations, we limit our design matrix size to a maximum of 35 million entries. If the number of transformed predictors exceeded this limit, we uniformly sub-sampled the interaction variables to ensure that p35000000/(0.7n)p^{*}\leq 35000000/(0.7n), and then fit the model using the sampled variables. Note that we always keep the original variables (main effects) and sub-sampled the interactions. In the case of multitarget regression, we performed a random 70/30 train-test split and repeated the experiment 30 times. To ensure efficient reproducibility of our experiments, we set a maximum runtime of 3 hours for each dataset. Any settings that exceeded this time limit were consequently excluded from the result table.

[t]

Table 2: Real datasets experiment results. The first column is the dataset (abbreviated, refer to Appendix E for the full name); the number of target variables qq, the training sample size nn, the raw number of features pp; TT is the ratio of time tCV/tEM{\rm t}_{\rm CV}/{\rm t}_{\rm EM} ; pp^{*} is the number of features including interactions; EM, Fix, and GLM are the R2R^{2} values on the test data for the three procedures.
Linear 2nd Order 3rd Order   
Dataset (qq) nn pp TT EM Fix GLM pp^{*} TT EM Fix GLM pp^{*} TT EM Fix GLM
Twitter 408275 77 20 0.94 0.94 0.94 86 16 0.94 0.94 0.94 - - - - -
Blog 39355 275 13 0.46 0.46 0.46 804 9.1 0.51 0.51 0.51 - - - - -
CT slices 37450 379 12 0.86 0.86 0.86 930 7.7 0.92 0.91 0.92 - - - - -
TomsHw 19725 96 17 0.63 0.63 0.63 1775 6.5 0.71 0.71 0.71 - - - - -
NPD - com 8353 13 13 0.84 0.84 0.84 104 15 1.00 1.00 1.00 559 8.6 1.00 1.00 1.00
NPD - tur 8353 13 14 0.91 0.91 0.91 104 15 1.00 1.00 1.00 559 8.6 1.00 1.00 1.00
PT - motor 4112 19 13 0.15 0.15 0.15 208 12 0.25 0.19 0.21 1539 3.9 -1.09 0.01 0.04
PT - total 4112 19 13 0.17 0.17 0.17 208 12 0.24 0.23 0.21 1539 3.7 -1.38 -0.04 0.00
Abalone 2923 9 13 0.53 0.53 0.53 51 16 0.38 0.35 0.50 209 12 0.28 0.12 0.12
Crime 1395 99 14 0.66 0.66 0.66 5049 1.3 0.67 -0.74 0.66 17652 1.1 0.66 -0.22 0.60
Airfoil 1052 5 17 0.51 0.51 0.51 20 14 0.62 0.62 0.62 55 11 0.73 0.73 0.73
Student 730 39 12 0.18 0.18 0.18 801 3.8 0.19 -0.89 0.16 10693 1.1 0.19 -6.22 -0.18
Concrete 721 8 16 0.61 0.61 0.61 44 11 0.78 0.78 0.78 164 5.5 0.85 0.85 0.85
F.Fires 361 12 2.4 -0.01 -0.01 -0.03 295 0.5 -0.01 -0.01 -0.14 1984 0.3 -0.01 -50.6 -0.45
B.Housing 354 13 11 0.71 0.71 0.71 104 8.1 0.84 0.83 0.80 559 2.2 0.83 -3e2 0.83
Facebook 346 15 15 0.91 0.91 0.91 167 6.6 -5.09 -26.4 -3.99 1087 1.9 -2.53 -2e4 -5.76
Diabetes 1 309 10 13 0.49 0.49 0.49 65 6.5 0.49 0.48 0.48 285 2.1 0.47 0.47 0.47
R.Estate 289 6 13 0.56 0.56 0.56 27 10 0.65 0.65 0.65 83 5.5 0.65 0.65 0.65
A.MPG 278 8 13 0.81 0.81 0.81 35 8.3 0.86 0.86 0.86 119 4.1 0.86 0.86 0.86
Yacht 215 7 16 0.97 0.97 0.97 27 11 0.97 0.97 0.97 83 6.1 0.98 0.98 0.98
A.mobile 111 25 6.1 0.90 0.89 0.89 1076 1.7 0.90 -4e3 0.89 12924 0.5 0.88 -1e4 0.83
Eye 1 84 200 2.7 0.50 0.26 0.45 20300 1.3 0.19 0.19 0.19 - - - - -
Ribo 1 49 4088 2.3 0.64 0.64 0.64 - - - - - - - - - -
Crop (24) 2 11760 3072 49 0.75 0.75 0.76 - - - - - - - - - -
ElecD (7) 2 11645 4096 9.2 0.88 0.88 0.89 - - - - - - - - - -
StarL (3) 2 6465 7168 2.1 0.98 0.60 0.98 - - - - - - - - - -
  • 1

    This dataset is not from the UCI repository. Data references can be found in the appendix.

  • 2

    Time-series dataset from the UCR repository. EM, Fix, and GLM are the classification accuracy on the test data

Table 2 details the results of our experiments; specifically, the ratio of time taken to run fast LOOCV divided by the time taken to run our EM procedure (TT), and the R2R^{2} values obtained by both methods on the withheld test set. The number of features, pp, and observations, nn recorded are values after data preprocessing (missing observations removed, one-hot encoding transformation, etc.). The results demonstrate that our EM algorithm can be up to 49 times faster than the fast LOOCV, with the speed-ups becoming more apparent as the sample size nn and the number of target variables qq increases. In addition, we see that this advantage in speed does not come at a cost in predictive performance, as our EM approach is comparable to, if not better than, LOOCV in almost all cases (also see Appendix, Figure 1, in which most of R2R^{2} values are distributed along the diagonal line).

An interesting observation is that LOOCV using the fixed grid can occasionally perform extremely poorly (as indicated by large negative R2R^{2} values) while LOOCV using the glmnet grid does not seem to exhibit this behavior. This appears likely to be due to the grid chosen using the glmnet heuristic. Its performance is artificially improved because it is unable to evaluate sufficiently small values of λ\lambda and is not actually selecting the very small λ\lambda value that minimizes the LOOCV score. The incorrectly large λ\lambda values are providing protection in these examples from undershrinkage.

6 Conclusion

The introduced EM algorithm is a robust and computationally fast alternative to LOOCV for ridge regression. The unimodality of the posterior guarantees a robust behavior for finite nn under mild conditions relative to LOOCV grid search, and the SVD preprocessing enables an overall faster computation with an ultra-fast O(kmin(n,p))O(k\min(n,p)) main loop. Combining this with a procedure such as orthogonal least-squares to provide a highly efficient forward selection procedure is a promising avenue for future research. As the Q-function is an expected negative log-posterior, it offers a score on which the usefulness of predictors themselves may be assessed, i.e., a model selection criteria, resulting in a potentially very accurate and fast procedure for sparse model learning. An important open problem is the theoretical analysis of the expected number of EM iterations kk that is required for convergence. The empirical evidence suggests that kk converges to a constant, and is thus negligible in the asymptotic time complexity. This is in alignment with the convergence of the posterior to a multivariate normal distribution. However, such intuitive and empirical arguments cannot replace a rigorous worst-case analysis.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for their valuable feedback that led to a substantially improved theoretical analysis. This work was supported by the Australian Research Council (DP210100045).

References

  • Akaike [1974] Hirotugu Akaike. A new look at the statistical model identification. In Selected Papers of Hirotugu Akaike, pages 215–222. Springer, 1974.
  • Alkhamisi et al. [2006] Mahdi Alkhamisi, Ghadban Khalaf, and Ghazi Shukur. Some modifications for choosing ridge parameters. Communications in Statistics-Theory and Methods, 35(11):2005–2020, 2006.
  • Allen [1974] David M Allen. The relationship between variable selection and data agumentation and a method for prediction. technometrics, 16(1):125–127, 1974.
  • Arlot and Celisse [2010] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. 2010.
  • Asuncion and Newman [2007] Arthur Asuncion and David Newman. UCI machine learning repository, 2007.
  • Barndorff-Nielsen et al. [1982] Ole Barndorff-Nielsen, John Kent, and Michael Sørensen. Normal variance-mean mixtures and z distributions. International Statistical Review/Revue Internationale de Statistique, pages 145–159, 1982.
  • Bhat and Raju [2017] Satish Bhat and Vidya Raju. A class of generalized ridge estimators. Communications in Statistics - Simulation and Computation, 46(7):5105–5112, 2017. doi: 10.1080/03610918.2016.1144765. URL https://doi.org/10.1080/03610918.2016.1144765.
  • Bontemps [2011] Dominique Bontemps. Bernstein–von Mises theorems for Gaussian regression with increasing number of regressors. 2011.
  • Bühlmann et al. [2014] Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014.
  • Dau et al. [2019] Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica, 6(6):1293–1305, 2019.
  • Dempster et al. [2020] Angus Dempster, François Petitjean, and Geoffrey I Webb. ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, 34(5):1454–1495, 2020.
  • Dempster et al. [2023] Angus Dempster, Daniel F Schmidt, and Geoffrey I Webb. HYDRA: Competing convolutional kernels for fast and accurate time series classification. Data Mining and Knowledge Discovery, 2023.
  • Efron et al. [2004] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
  • Ehsanes Saleh and Golam Kibria [1993] AK Md Ehsanes Saleh and BM Golam Kibria. Performance of some new preliminary test ridge regression estimators and their properties. Communications in Statistics-Theory and Methods, 22(10):2747–2764, 1993.
  • Gelman [2006] Andrew Gelman. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1(3):515–533, 2006.
  • Ghosal [1999] Subhashis Ghosal. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli, pages 315–331, 1999.
  • Ghosal et al. [2000] Subhashis Ghosal, Jayanta K Ghosh, and Aad W Van Der Vaart. Convergence rates of posterior distributions. Annals of Statistics, pages 500–531, 2000.
  • Golub et al. [1979] Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979.
  • Hamed et al. [2013] Ramadan Hamed, Ali EL Hefnawy, and Aya Farag. Selection of the ridge parameter using mathematical programming. Communications in Statistics-Simulation and Computation, 42(6):1409–1432, 2013.
  • Hanson [1971] Richard J Hanson. A numerical method for solving Fredholm integral equations of the first kind using singular values. SIAM Journal on Numerical Analysis, 8(3):616–622, 1971.
  • Harrison Jr and Rubinfeld [1978] David Harrison Jr and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102, 1978.
  • Hastie et al. [2022] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
  • Hefnawy and Farag [2014] Ali El Hefnawy and Aya Farag. A combined nonlinear programming model and Kibria method for choosing ridge parameter regression. Communications in Statistics-Simulation and Computation, 43(6):1442–1470, 2014.
  • Hilgers [1976] John W Hilgers. On the equivalence of regularization and certain reproducing kernel Hilbert space approaches for solving first kind problems. SIAM Journal on Numerical Analysis, 13(2):172–184, 1976.
  • Hoerl and Kennard [1970] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  • Hsu et al. [2012] Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 9.1–9.24, Edinburgh, Scotland, 25–27 Jun 2012. PMLR.
  • JAMES [1961] W JAMES. Estimation with quadratic loss. In Proc. 4th Berkeley Symp. on Math. Statist. and Prob., 1961, 1961.
  • JF [1976] Lawless JF. A simulation study of ridge and other regression estimators. Communications in Statistics-theory and Methods, 5(4):307–323, 1976.
  • Khalaf and Shukur [2005] Ghadban Khalaf and Ghazi Shukur. Choosing ridge parameter for regression problems. 2005.
  • Kibria [2003] BM Golam Kibria. Performance of some new ridge regression estimators. Communications in Statistics-Simulation and Computation, 32(2):419–435, 2003.
  • Kobak et al. [2020] Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. The Journal of Machine Learning Research, 21(1):6863–6878, 2020.
  • Lindley and Smith [1972] D. V. Lindley and A. F. M. Smith. Bayes estimates for the linear model. Journal of the Royal Statistical Society (Series B), 34(1):1–41, 1972.
  • Mallows [2000] Colin L Mallows. Some comments on Cp. Technometrics, 42(1):87–94, 2000.
  • Muniz and Kibria [2009] Gisela Muniz and BM Golam Kibria. On some ridge regression estimators: An empirical comparisons. Communications in Statistics—Simulation and Computation®, 38(3):621–630, 2009.
  • Park and Casella [2008] Trevor Park and George Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008.
  • Patil et al. [2021] Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani. Uniform consistency of cross-validation estimators for high-dimensional ridge regression. In International Conference on Artificial Intelligence and Statistics, pages 3178–3186. PMLR, 2021.
  • Petitjean et al. [2012] François Petitjean, Jordi Inglada, and Pierre Gançarski. Satellite image time series analysis under time warping. IEEE transactions on geoscience and remote sensing, 50(8):3081–3095, 2012.
  • Polson and Scott [2012] Nicholas G Polson and James G Scott. On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4):887–902, 2012.
  • Scheetz et al. [2006] Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun, Kevin L Knudtson, Anne M Dorrance, Gerald F DiBona, Jian Huang, Thomas L Casavant, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429–14434, 2006.
  • Schwarz et al. [1978] Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
  • Spiegelhalter et al. [1996] David Spiegelhalter, Andrew Thomas, Nicky Best, and Wally Gilks. BUGS 0.5: Bayesian inference using gibbs sampling manual (version ii). MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK, pages 1–59, 1996.
  • Stephenson et al. [2021] Will Stephenson, Zachary Frangella, Madeleine Udell, and Tamara Broderick. Can we globally optimize cross-validation loss? Quasiconvexity in ridge regression. Advances in Neural Information Processing Systems, 34:24352–24364, 2021.
  • Strawderman [1971] William E Strawderman. Proper Bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics, 42(1):385–388, 1971.
  • Tew et al. [2022] Shu Yu Tew, Daniel F Schmidt, and Enes Makalic. Sparse horseshoe estimation via expectation-maximisation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 123–139. Springer, 2022.
  • Van der Vaart [2000] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • Varah [1973] M Varah, James. On the numerical solution of ill-conditioned linear systems with applications to ill-posed problems. SIAM Journal on Numerical Analysis, 10(2):257–267, 1973.
  • Wang and Zou [2021] Boxiang Wang and Hui Zou. Honest leave-one-out cross-validation for estimating post-tuning generalization error. Stat, 10(1):e413, 2021.
  • Yasin et al. [2013] ASAR Yasin, Adnan Karaibrahimoğlu, and GENÇ Aşır. Modified ridge regression parameters: A comparative Monte Carlo study. Hacettepe Journal of Mathematics and Statistics, 43(5):827–841, 2013.
  • Zhang and Yang [2015] Yongli Zhang and Yuhong Yang. Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187(1):95–112, 2015.

Appendix A EM Procedure for Multiple Means

We show that the Bayesian EM procedure provides sensible estimates of the regularization parameter even in the setting of the normal multiple means problem with known variance σ2\sigma^{2}. In this setting, the LOOCV is unable to provide any guidance on how to choose λ\lambda due to all the information for each regression parameter being concentrated in a single observation. We use the τ\tau parameterisation of the hyperparameter, rather than τ2\tau^{2}, as the resulting estimator has an easy to analyse form.

In the normal multiple means model, we are given (yi|βi)N(βi,1)(y_{i}|\beta_{i})\;\sim\;N(\beta_{i},1), i.e., 𝐲{\bf y} is a pp-dimensional normally distributed vector with mean 𝜷\bm{\beta} and identity covariance matrix. The conditional posterior distribution of 𝜷\bm{\beta} is:

𝜷|𝐲,τN((1κ)𝐲,σ2(1κ))\bm{\beta}|{\bf y},\tau\;\sim\;\rm{N}\left((1-\kappa)\bm{y},\sigma^{2}(1-\kappa)\right) (17)

where κ=1/(1+τ2)\kappa=1/(1+\tau^{2}). Under this setting, Strawderman [43] proved that if p3p\geq 3, then any estimator of the form

(1r(12y2)p2y2)y\left(1-r\left(\frac{1}{2}||y||^{2}\right)\frac{p-2}{||y||^{2}}\right)y (18)

where 0r(12y2)20\leq r\left(\frac{1}{2}||y||^{2}\right)\leq 2 and r()r(\cdot) is non-decreasing, is minimax, i.e., it dominates least-squares. We will now show that our EM procedure not only yields reasonable estimates in this setting, in contrast to LOOCV, but that these estimates are minimax, and hence dominate least-sqaures.

For the normal means model, we can obtain a closed form solution for the optimum τ\tau, by solving for the stationary point for which τt+1=τt\tau_{t+1}=\tau_{t}, with τC+(0,1)\tau\sim C^{+}(0,1):

argmin𝜏{E𝜷[logp(y|β,τ)logp(β|τ)logπ(τ)]}\displaystyle\underset{\tau}{\operatorname{arg\,min}}\left\{{\rm E}_{\bm{\beta}}\!\left[-\log\>p(y|\beta,\tau)-\log\>p(\beta|\tau)-\log\>\pi(\tau)\right]\right\} =τ\displaystyle=\tau
argmin𝜏{p2logτ2+w2τ2+log(1+τ2)}\displaystyle\underset{\tau}{\operatorname{arg\,min}}\left\{\frac{p}{2}\log\tau^{2}+\frac{w}{2\tau^{2}}+\log(1+\tau^{2})\right\} =τ\displaystyle=\tau
wp+p2+8w+2pw+w22(2+p)\displaystyle\sqrt{\frac{w-p+\sqrt{p^{2}+8w+2pw+w^{2}}}{2(2+p)}} =τ,\displaystyle=\tau,

and with w=j=1pE[βj2]=(1κ)2s+(1κ)pw=\sum_{j=1}^{p}{\rm E}\!\left[{\beta}^{2}_{j}\right]=(1-\kappa)^{2}s+(1-\kappa)p, s=y2s=||y||^{2} and τ=1κκ\tau=\sqrt{\frac{1-\kappa}{\kappa}}. This yields

(p((κ2)2p8κ+8)2(κ1)2s((κ2)p4)+(κ1)4s2κp+(κ1)2s)2(2+p)\displaystyle\sqrt{\frac{(\sqrt{p((\kappa-2)^{2}p-8\kappa+8)-2(\kappa-1)^{2}s((\kappa-2)p-4)+(\kappa-1)^{4}s^{2}}-\kappa p+(\kappa-1)^{2}s)}{2(2+p)}} =1κκ\displaystyle=\sqrt{\frac{1-\kappa}{\kappa}}

with solution κ=(p+2)/s\kappa=(p+2)/s. Plugging this κ\kappa solution into (17), we note that the resulting estimator of 𝜷\bm{\beta} (17) is of the form (18) with

r(12y2)\displaystyle r\left(\frac{1}{2}||y||^{2}\right) =(p+2y2)/(p2y2)\displaystyle=\left(\frac{p+2}{||y||^{2}}\right)/\left(\frac{p-2}{||y||^{2}}\right)
=p+2p2\displaystyle=\frac{p+2}{p-2}

As we have r(12y2)2r\left(\frac{1}{2}||y||^{2}\right)\leq 2 when p6p\geq 6, the EM ridge estimator is minimax in this setting for p6p\geq 6.

Appendix B Proof of Theorem 3.1

We prove that for sufficiently large nn, a continuous injective reparameterization of the negative log joint posterior of (5) & (7) is convex when restricted to τ2ϵ\tau^{2}\geq\epsilon. This is sufficient, since unimodality is preserved by strictly monotone transformations and continuous injective reparameterizations.

Specifically, for the presented hierarchical model, the negative log joint posterior up to an additive constant is

n+p+22logσ2+12σ2𝐲𝐗𝜷2+p+22a2logτ2+𝜷22σ2τ2+(a+b)log(1+τ2)\frac{n+p+2}{2}\log\sigma^{2}+\frac{1}{2\sigma^{2}}\|{\bf y}-{\bf X}{\bm{\beta}}\|^{2}+\frac{p+2-2a}{2}\log\tau^{2}+\frac{\|{\bm{\beta}}\|^{2}}{2\sigma^{2}\tau^{2}}+(a+b)\log(1+\tau^{2})

and reparameterising with ϕ=β/σ\phi=\beta/\sigma, ρ=1/σ\rho=1/\sigma and χ=1/τ\chi=1/\tau and reorganising terms yields

(n+p+2)logρ(p+22a)logχ+(a+b)log(1+χ2)+12ρ𝐲𝐗ϕ2+χϕ22.-(n+p+2)\log\rho-(p+2-2a)\log\chi+(a+b)\log(1+\chi^{-2})+\frac{1}{2}\|\rho{\bf y}-{\bf X}{\bm{\phi}}\|^{2}+\frac{\|\chi{\bm{\phi}}\|^{2}}{2}\enspace.

The first three terms can easily be checked to be convex via a second derivative test, for which the convexity of the second term is contingent on the condition that a<1+p/2a<1+p/2, a condition that holds true in our specific scenario with a=b=1/2a=b=1/2. For the last two terms, the combined Hessian is of block form [𝐀,𝐁;𝐁T,𝐂][{\bf A},{\bf B};{\bf B}^{\rm T},{\bf C}] with 𝐀=𝐗T𝐗+χ2𝐈p{\bf A}={\bf X}^{\rm T}{\bf X}+\chi^{2}{\bf I}_{p}, 𝐁=2χϕ{\bf B}=2\chi{\bm{\phi}}, and 𝐂=ϕ2{\bf C}=\|{\bm{\phi}}\|^{2}. Symmetric matrices of this form are positive definite if 𝐀{\bf A} and its Schur complement

𝐂𝐁T𝐀1𝐁=ϕ24ϕT(𝐗T𝐗/χ2+𝐈p)1ϕ{\bf C}-{\bf B}^{\rm T}{\bf A}^{-1}{\bf B}=\|{\bm{\phi}}\|^{2}-4{\bm{\phi}}^{\rm T}({\bf X}^{\rm T}{\bf X}/\chi^{2}+{\bf I}_{p})^{-1}{\bm{\phi}}

are positive definite. Clearly, 𝐀{\bf A} is positive definite. Moreover, for n>4/(ϵ2γn)n>4/(\epsilon^{2}\gamma_{n}), we have

ϕT(𝐗T𝐗/χ2+I)1ϕ\displaystyle{\bm{\phi}}^{\rm T}({\bf X}^{\rm T}{\bf X}/\chi^{2}+I)^{-1}{\bm{\phi}} =ϕT(𝐕𝚺T𝚺𝐕T/χ2+I)1ϕ\displaystyle={\bm{\phi}}^{\rm T}({\bf V}{\bm{\Sigma}}^{\rm T}{\bm{\Sigma}}{\bf V}^{\rm T}/\chi^{2}+I)^{-1}{\bm{\phi}}
=ϕT𝐕(𝚺T𝚺/χ2+I)1𝐕Tϕ\displaystyle={\bm{\phi}}^{\rm T}{\bf V}({\bm{\Sigma}}^{\rm T}{\bm{\Sigma}}/\chi^{2}+I)^{-1}{\bf V}^{\rm T}{\bm{\phi}}
ϕT𝐕𝐕Tϕχ2/(nγn)\displaystyle\leq{\bm{\phi}}^{\rm T}{\bf V}{\bf V}^{\rm T}{\bm{\phi}}\chi^{2}/(n\gamma_{n})
=((𝐈𝐕𝐕T)ϕ+𝐕𝐕Tϕ)T𝐕𝐕Tϕχ2/(nγn)\displaystyle=(({\bf I}-{\bf V}{\bf V}^{\rm T}){\bm{\phi}}+{\bf V}{\bf V}^{\rm T}{\bm{\phi}})^{\rm T}{\bf V}{\bf V}^{\rm T}{\bm{\phi}}\chi^{2}/(n\gamma_{n})
=𝐕𝐕Tϕ2χ2/(nγn)\displaystyle=\|{\bf V}{\bf V}^{\rm T}{\bm{\phi}}\|^{2}\chi^{2}/(n\gamma_{n})
ϕ2χ2/(nγn)\displaystyle\leq\|{\bm{\phi}}\|^{2}\chi^{2}/(n\gamma_{n})
<ϕ2/(ϵ2nγn)\displaystyle<\|{\bm{\phi}}\|^{2}/(\epsilon^{2}n\gamma_{n})
<ϕ2/4\displaystyle<\|{\bm{\phi}}\|^{2}/4

where we used the fact that 𝐕𝐕T{\bf V}{\bf V}^{\rm T} is the orthogonal projection onto the column space of 𝐕{\bf V}. The overall inequality implies the required positivity of the Schur complement.

Appendix C Derivation of Equation 9

C.1 Derivation of ESN\mathrm{ESN}

Here we show that

j=1pE[βj2|τ^(t),σ^2(t)]\displaystyle\sum_{j=1}^{p}{\rm E}\!\left[\beta_{j}^{2}\>|\>\hat{\tau}^{(t)},\hat{\sigma}^{2^{(t)}}\right] =tr(Cov[𝜷])+j=1pE[βj]2\displaystyle={\rm tr}\left({{\rm Cov}[\bm{\beta}]}\right)+\sum_{j=1}^{p}{\rm E}\!\left[{\beta}_{j}\right]^{2}
=σ2tr(𝐀τ1)+𝜷^τ2\displaystyle=\sigma^{2}\mathrm{tr}({\bf A}_{\tau}^{-1})+\|\hat{\bm{\beta}}_{\tau}\|^{2}

This is a rather straightforward proof. We use the fact that given a random variable xx; the expected squared value of xx is E[x2]=Var[x]+E[x]2{\rm E}\!\left[x^{2}\right]={\rm Var}[x]+{\rm E}\!\left[x\right]^{2}.

C.2 Derivation of ESS\mathrm{ESS}

Here we show that

E𝜷[𝐲𝐗𝜷2|τ^(t),σ^2(t)]\displaystyle{\rm E}_{\bm{\beta}}\!\left[||{\bf y}-{\bf X}\bm{\beta}||^{2}\>|\>\hat{\tau}^{(t)},\hat{\sigma}^{2^{(t)}}\right] =𝐲𝐗E[𝜷]2+tr(𝐗T𝐗Cov[𝜷])\displaystyle=||{\bf y}-{\bf X}\;{\rm E}\!\left[\bm{\beta}\right]||^{2}+{\rm tr}({\bf X}^{\rm T}{\bf X}\;\rm{Cov}[\bm{\beta}]) (19)

We first provide an important fact on the quadratic forms of random variables in Lemma C.1 below:

Lemma C.1.

Let 𝐛{\bf b} be a p-dimensional random vector and 𝐀{\bf A} be a p-dimensional symmetric matrix. If E[𝐛]=𝛍{\rm E}\!\left[\bf b\right]={\bm{\mu}} and Var(𝐛)=𝚺{\rm Var}(\bf b)={\bm{\Sigma}}, then E[𝐛T𝐀𝐛]=tr(𝐀𝚺)+𝛍T𝐀𝛍{\rm E}\!\left[{\bf b}^{\rm T}{\bf A}{\bf b}\right]={\rm tr}({\bf A}{\bm{\Sigma}})+{\bm{\mu}}^{\rm T}{\bf A}{\bm{\mu}}.

Now, we expand the left-hand side of Equation 19 :

E𝜷[𝐲𝐗𝜷2]\displaystyle{\rm E}_{\bm{\beta}}\!\left[||{\bf y}-{\bf X}\bm{\beta}||^{2}\right] =E𝜷[(𝐲𝐗𝜷)T(𝐲𝐗𝜷)]\displaystyle={\rm E}_{\bm{\beta}}\!\left[({\bf y}-{\bf X}\bm{\beta})^{T}({\bf y}-{\bf X}\bm{\beta})\right]
=E𝜷[𝐲T𝐲2𝐲T𝐗𝜷+𝜷T𝐗T𝐗𝜷]\displaystyle={\rm E}_{\bm{\beta}}\!\left[{\bf y}^{\rm T}{\bf y}-2{\bf y}^{\rm T}{\bf X}\bm{\beta}+\bm{\beta}^{\rm T}{\bf X}^{\rm T}{\bf X}\bm{\beta}\right]
=𝐲T𝐲2𝐲T𝐗E[𝜷]+E[𝜷T𝐗T𝐗𝜷]\displaystyle={\bf y}^{\rm T}{\bf y}-2\>{\bf y}^{\rm T}{\bf X}{\rm E}\!\left[\bm{\beta}\right]+{\rm E}\!\left[\bm{\beta}^{\rm T}{\bf X}^{\rm T}{\bf X}\bm{\beta}\right] (20)

The use of lemma C.1 allows Equation 20 to be rewritten as

E𝜷[𝐲𝐗𝜷2]\displaystyle{\rm E}_{\bm{\beta}}\!\left[||{\bf y}-{\bf X}\bm{\beta}||^{2}\right] =𝐲T𝐲2𝐲T𝐗E[𝜷]+E[𝜷]T(𝐗T𝐗)E[𝜷]+tr(𝐗T𝐗Cov[𝜷])\displaystyle={\bf y}^{\rm T}{\bf y}-2\>{\bf y}^{\rm T}{\bf X}{\rm E}\!\left[\bm{\beta}\right]+{\rm E}\!\left[\bm{\beta}\right]^{\rm T}({\bf X}^{\rm T}{\bf X}){\rm E}\!\left[\bm{\beta}\right]+{\rm tr}({\bf X}^{\rm T}{\bf X}\;\rm{Cov}[\bm{\beta}])
=𝐲𝐗E[𝜷]2+tr(𝐗T𝐗Cov[𝜷])\displaystyle=||{\bf y}-{\bf X}\;{\rm E}\!\left[\bm{\beta}\right]||^{2}+{\rm tr}({\bf X}^{\rm T}{\bf X}\;\rm{Cov}[\bm{\beta}])
=𝐲𝐗𝜷^τ2+σ2tr(𝐗T𝐗𝐀τ1)\displaystyle=||{\bf y}-{\bf X}\;\hat{{\bm{\beta}}}_{\tau}||^{2}+\sigma^{2}\mathrm{tr}({\bf X}^{\rm T}{\bf X}{\bf A}_{\tau}^{-1})

Appendix D Solving for the parameter updates (Derivation of Equation 11)

Rather than solving a two-dimensional numerical optimization problem (10), we show that given a fixed τ2\tau^{2}, we can find a closed formed solution for σ2\sigma^{2}, and vice versa. To start off, we need to find the solution for σ2\sigma^{2} as a function of τ2\tau^{2}. First, find the negative logarithm of the joint probability distribution of hierarchy (5):

argminσ2{E𝜷[logp(y|X,β,σ2)logp(β|τ2,σ2)logp(σ2)logπ(τ2)]}.\displaystyle\underset{\sigma^{2}}{\operatorname{arg\,min}}\left\{{\rm E}_{\bm{\beta}}\!\left[-\log\>p(y|X,\beta,\sigma^{2})-\log\>p(\beta|\tau^{2},\sigma^{2})-\log\>p(\sigma^{2})-\log\>\pi(\tau^{2})\right]\right\}. (21)

Dropping terms that do not depend on σ2{\sigma^{2}} yields:

argminσ2{E𝜷[logp(y|X,β,σ2)logp(β|τ2,σ2)logp(σ2)]}\displaystyle\underset{\sigma^{2}}{\operatorname{arg\,min}}\left\{{\rm E}_{\bm{\beta}}\!\left[-\log\>p(y|X,\beta,\sigma^{2})-\log\>p(\beta|\tau^{2},\sigma^{2})-\log\>p(\sigma^{2})\right]\right\}
=argminσ2{(n+p2)logσ2+ESS2σ2+ESN2σ2τ2+logσ2}\displaystyle=\underset{\sigma^{2}}{\operatorname{arg\,min}}\left\{\left(\frac{n+p}{2}\right)\log\sigma^{2}+\frac{\mathrm{ESS}}{2\sigma^{2}}+\frac{\mathrm{ESN}}{2\sigma^{2}\tau^{2}}+\log\sigma^{2}\right\}
=argminσ2{(n+p+22)logσ2+ESS2σ2+ESN2σ2τ2}.\displaystyle=\underset{\sigma^{2}}{\operatorname{arg\,min}}\left\{\left(\frac{n+p+2}{2}\right)\log\sigma^{2}+\frac{\mathrm{ESS}}{2\sigma^{2}}+\frac{\mathrm{ESN}}{2\sigma^{2}\tau^{2}}\right\}. (22)

Solving the above minimization problem involves differentiating the negative logarithm with respect to σ2\sigma^{2} and solving for σ2\sigma^{2} that set the derivative to zero. This gives us:

σ2{(n+p+22)logσ2+ESS2σ2+ESN2σ2τ2}\displaystyle\frac{\partial}{\partial\sigma^{2}}\left\{\left(\frac{n+p+2}{2}\right)\log\sigma^{2}+\frac{\mathrm{ESS}}{2\sigma^{2}}+\frac{\mathrm{ESN}}{2\sigma^{2}\tau^{2}}\right\} =0\displaystyle=0
2+n+p2σ2ESS2(σ2)2ESN2(σ2)2τ2\displaystyle\frac{2+n+p}{2\sigma^{2}}-\frac{\mathrm{ESS}}{2(\sigma^{2})^{2}}-\frac{\mathrm{ESN}}{2(\sigma^{2})^{2}\tau^{2}} =0\displaystyle=0
σ^2\displaystyle\hat{\sigma}^{2} =τ2ESS+ESN(n+p+2)τ2\displaystyle=\frac{\tau^{2}\mathrm{ESS}+\mathrm{ESN}}{(n+p+2)\tau^{2}} (23)

Next, to obtain the M-step updates for the shrinkage parameter τ2\tau^{2}, we repeat the same procedure - find the negative logarithm of the joint probability distribution and remove terms that do not depend on either σ2{\sigma^{2}} or τ2\tau^{2}:

argminτ2{E𝜷[logp(y|X,β,σ2)logp(β|τ2,σ2)logp(σ2)logπ(τ2)]}\displaystyle\underset{\tau^{2}}{\operatorname{arg\,min}}\left\{{\rm E}_{\bm{\beta}}\!\left[-\log\>p(y|X,\beta,\sigma^{2})-\log\>p(\beta|\tau^{2},\sigma^{2})-\log\>p(\sigma^{2})-\log\>\pi(\tau^{2})\right]\right\}
=argminτ2{(n+p+22)logσ2+ESS2σ2+ESN2σ2τ2+p2logτ2+log(1+τ2)+logτ22}\displaystyle=\underset{\tau^{2}}{\operatorname{arg\,min}}\left\{\left(\frac{n+p+2}{2}\right)\log\sigma^{2}+\frac{\mathrm{ESS}}{2\sigma^{2}}+\frac{\mathrm{ESN}}{2\sigma^{2}\tau^{2}}+\frac{p}{2}\log\tau^{2}+\log(1+\tau^{2})+\frac{\log\tau^{2}}{2}\right\} (24)

Substiting the solution for σ2\sigma^{2} (23) into equation (24), yields a Q-function that depends only on τ2\tau^{2}. We eliminate the dependency on σ2\sigma^{2} by finding the optimal σ2\sigma^{2} as a function of τ2\tau^{2} and substitute it into the Q-function of (24):

argminτ2{12[(1+p)logτ2+2log(1+τ2)+(n+p+2)(1+log(ESN+τ2ESS(n+p+2)τ2))]}\displaystyle\underset{\tau^{2}}{\operatorname{arg\,min}}\left\{\frac{1}{2}\left[(1+p)\log\tau^{2}+2\log(1+\tau^{2})+(n+p+2)\left(1+\log\left(\frac{\mathrm{ESN}+\tau^{2}\mathrm{ESS}}{(n+p+2)\tau^{2}}\right)\right)\right]\right\} (25)

Differentiating (25) with respect to τ2\tau^{2} and solving for the τ2\tau^{2} that set the derivative to zero yields:

τ2{12[(1+p)logτ2+2log(1+τ2)+(n+p+2)(1+log(ESN+τ2ESS(n+p+2)τ2))]}\displaystyle\frac{\partial}{\partial\tau^{2}}\left\{\frac{1}{2}\left[(1+p)\log\tau^{2}+2\log(1+\tau^{2})+(n+p+2)\left(1+\log\left(\frac{\mathrm{ESN}+\tau^{2}\mathrm{ESS}}{(n+p+2)\tau^{2}}\right)\right)\right]\right\} =0\displaystyle=0
(3ESS+ESSp)(τ2)2+(ESNESNn+ESS+ESSp)τ2ESNESNn2τ2(1+τ2)(ESN+τ2ESS)\displaystyle\frac{(3\mathrm{ESS}+\mathrm{ESS}p)(\tau^{2})^{2}+(\mathrm{ESN}-\mathrm{ESN}n+\mathrm{ESS}+\mathrm{ESS}p)\tau^{2}-\mathrm{ESN}-\mathrm{ESN}n}{2\tau^{2}(1+\tau^{2})(\mathrm{ESN}+\tau^{2}\mathrm{ESS})} =0.\displaystyle=0. (26)

The τ2\tau^{2} update is the positive solution to the quadratic equation (in terms of τ2\tau^{2}) (26):

τ^2=(n1)ESN(1+p)ESS+(4n+4)ESN(3+p)ESS+((1n)ESN+(p+1)ESS)2(6+2p)ESS\displaystyle{\hat{\tau}^{2}}=\frac{(n-1)\mathrm{ESN}-(1+p)\mathrm{ESS}+\sqrt{(4n+4)\mathrm{ESN}(3+p)\mathrm{ESS}+((1-n)\mathrm{ESN}+(p+1)\mathrm{ESS})^{2}}}{(6+2p)\mathrm{ESS}}
Table 3: Pseudo-code of EM algorithm with complexity of individual steps.

EM Algorithm with SVD Operations
Input: Standardised predictors 𝐗n×p{\bf X}\in\mathbb{R}^{n\times p}, centered targets 𝐲n{\bf y}\in\mathbb{R}^{n} and convergence threshold ϵ>0\epsilon>0
Output: 𝜷p{\bm{\beta}}\in\mathbb{R}^{p}
r=min(n,p)r=\min(n,p) O(1)O(1)
IF pnp\geq n
[𝐔,𝚺,𝐕]=svd(𝐗)\quad[{\bf U},{\bm{\Sigma}},{\bf V}]={\rm svd}({\bf X}) O(mr2)O(mr^{2})
𝐬2=(Σ1,12,,Σr,r2)\quad{\bf s}^{2}=(\Sigma^{2}_{1,1},\ldots,\Sigma^{2}_{r,r}) O(r)O(r)
𝐜=(𝐔T𝐲)𝐬\quad{\bf c}=({\bf U}^{\rm T}{\bf y})\odot{\bf s} O(nr)O(nr)
ELSE
[𝐕,𝚺2]=eigen(𝐗T𝐗)\quad[{\bf V},{\bm{\Sigma}}^{2}]={\rm eigen}({\bf X}^{\rm T}{\bf X}) O(mr2)O(mr^{2})
𝐬2=(𝚺1,12,,𝚺r,r2)\quad{\bf s}^{2}=({\bm{\Sigma}}^{2}_{1,1},\ldots,{\bm{\Sigma}}^{2}_{r,r}) O(r)O(r)
𝐜=𝐕T𝐗T𝐲\quad{\bf c}={\bf V}^{\rm T}{\bf X}^{\rm T}{\bf y} O(np)O(np)
Y=𝐲T𝐲Y={\bf y}^{\rm T}{\bf y} O(n)O(n)
τ21\tau^{2}\leftarrow 1 O(1)O(1)
σ2(1/n)i=1n(yiy¯)2,y¯=(1/n)i=1nyi\sigma^{2}\leftarrow(1/n)\sum^{n}_{i=1}\left(y_{i}-\bar{y}\right)^{2},\;\bar{y}=(1/n)\sum^{n}_{i=1}y_{i} O(n)O(n)
RSS{\rm RSS}\leftarrow{\infty} O(1)O(1)
DO
RSSoldRSS\quad{\rm RSS_{old}}\leftarrow{\rm RSS} O(k)O(k)
αjcjsj2+1/τ2\quad\alpha_{j}\leftarrow\displaystyle{\frac{c_{j}}{s^{2}_{j}+1/\tau^{2}}} O(kr)O(kr)
   (E-step)
ESNj=1rαj2+σ2(j=1r1sj2+τ2+τ2max(pn,0))\quad\mathrm{ESN}\leftarrow{\displaystyle\sum_{j=1}^{r}}\alpha_{j}^{2}+\displaystyle{\sigma^{2}\left({\displaystyle\sum_{j=1}^{r}}\frac{1}{s^{2}_{j}+\tau^{-2}}+\tau^{2}\max(p-n,0)\right)} O(kr)O(kr)
RSSY2j=1rαjcj+j=1rαj2sj2\quad{\rm RSS}\leftarrow Y-2\displaystyle\sum^{r}_{j=1}{\alpha_{j}}c_{j}+\displaystyle\sum^{r}_{j=1}\alpha_{j}^{2}s_{j}^{2} O(kr)O(kr)
ESSRSS+σ2(j=1rsj2sj2+τ2)\quad\mathrm{ESS}\leftarrow{\rm RSS}+\displaystyle{\sigma^{2}\left({\displaystyle\sum_{j=1}^{r}}\frac{s^{2}_{j}}{s^{2}_{j}+\tau^{-2}}\right)} O(kr)O(kr)
   (M-step)
g(4n+4)ESN(3+p)ESS+((1n)ESN+(p+1)ESS)2\quad g\leftarrow(4n+4)\mathrm{ESN}\,(3+p)\mathrm{ESS}+((1-n)\mathrm{ESN}+(p+1)\mathrm{ESS})^{2} O(k)O(k)
τ2(n1)ESN(1+p)ESS+g(6+2p)ESS\quad\tau^{2}\leftarrow\displaystyle{\frac{(n-1)\mathrm{ESN}-(1+p)\mathrm{ESS}+\sqrt{g}}{(6+2p)\mathrm{ESS}}} O(k)O(k)
σ2τ2ESS+ESN(n+p+2)τ2\quad\sigma^{2}\leftarrow\displaystyle{\frac{\tau^{2}\mathrm{ESS}+\mathrm{ESN}}{(n+p+2)\tau^{2}}} O(k)O(k)
δ|RSSoldRSS|(1+|RSS|)\quad\delta\leftarrow\displaystyle{\frac{|{\rm RSS_{old}}-{\rm RSS}|}{\left(1+|{\rm RSS}|\right)}} O(k)O(k)
until δ<ϵ\delta<\epsilon
αjcjsj2+1/τ2\alpha_{j}\leftarrow\displaystyle{\frac{c_{j}}{s^{2}_{j}+1/\tau^{2}}} O(kr)O(kr)
𝜷=𝐕𝜶{\bm{\beta}}={\bf V}{\bm{\alpha}} O(pr)O(pr)
return 𝜷{\bm{\beta}}
Table 4: Pseudocode of the fast LOOCV algorithm with complexity of individual steps. 𝐑{\bf R} has column vectors 𝐫j{\bf r}_{j} for 1jr1\leq j\leq r.
Fast LOOCV ridge with SVD Operation
Input: Standardised predictors 𝐗n×p{\bf X}\in\mathbb{R}^{n\times p}, centered targets 𝐲n{\bf y}\in\mathbb{R}^{n} and a grid of penalty parameters L=(λ1,λ2,,λl)L=(\lambda_{1},\lambda_{2},\ldots,\lambda_{l})
Output: 𝜷p{\bm{\beta}}\in\mathbb{R}^{p}
r=min(n,p)r=\min(n,p) O(1)O(1)
IF pnp\geq n
[𝐔,𝚺,𝐕]=svd(𝐗)\quad[{\bf U},{\bm{\Sigma}},{\bf V}]={\rm svd}({\bf X}) O(mr2)O(mr^{2})
𝐬=(Σ1,1,,Σr,r)\quad{\bf s}=\left(\Sigma_{1,1},\ldots,\Sigma_{r,r}\right) O(r)O(r)
𝐑=(s1𝐮1,,sr𝐮r)\quad{\bf R}=(s_{1}{\bf u}_{1},\ldots,s_{r}{\bf u}_{r}) O(nr)O(nr)
𝐜=(𝐔T𝐲)𝐬\quad{\bf c}=({\bf U}^{\rm T}{\bf y})\odot{\bf s} O(nr)O(nr)
ELSE
[𝐕,𝚺2]=eigen(𝐗T𝐗)\quad[{\bf V},{\bm{\Sigma}}^{2}]={\rm eigen}({\bf X}^{\rm T}{\bf X}) O(mr2)O(mr^{2})
𝐬2=(𝚺1,12,,𝚺r,r2)\quad{\bf s}^{2}=({\bm{\Sigma}}^{2}_{1,1},\ldots,{\bm{\Sigma}}^{2}_{r,r}) O(r)O(r)
𝐑=𝐗𝐕\quad{\bf R}={\bf X}{\bf V} O(nrp)O(nrp)
𝐜=𝐑T𝐲\quad{\bf c}={\bf R}^{\rm T}{\bf y} O(nr)O(nr)
𝐔=(𝐫1/s1,,𝐫r/sr)\quad{\bf U}=({\bf r}_{1}/s_{1},\ldots,{\bf r}_{r}/s_{r}) O(nr)O(nr)
forλL{{\rm for}\;\lambda\in\;L\;\{
hi=j=1r(sj2sj2+λ)uij2,(i=1,,n)\quad h_{i}=\displaystyle{\sum_{j=1}^{r}\left(\frac{s_{j}^{2}}{s_{j}^{2}+\lambda}\right)u^{2}_{ij}},\qquad(i=1,\ldots,n) O(lnr)O(lnr)
αj=cjsj2+λ\quad{\displaystyle\alpha_{j}=\frac{c_{j}}{s^{2}_{j}+\lambda}} O(lr)O(lr)
𝐞=𝐲𝐑𝜶\quad{\displaystyle{\bf e}={\bf y}-{\bf R}{\bm{\alpha}}} O(lnr)O(lnr)
CVE(λ)=1ni=1n(ei1hi)2\quad{\displaystyle{\rm CVE}(\lambda)=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{e_{i}}{1-h_{i}}\right)^{2}}\; O(ln)O(ln)
}
Find λ=argminλL{CVE(λ)}\lambda^{*}=\underset{\lambda\in L}{\operatorname{arg\,min}}\left\{{\rm CVE}(\lambda)\right\} O(l)O(l)
αj=cjsj2+λ{\displaystyle\alpha_{j}=\frac{c_{j}}{s^{2}_{j}+\lambda^{*}}} O(r)O(r)
𝜷=𝐕𝜶{\bm{\beta}}={\bf V}{\bm{\alpha}} O(pr)O(pr)
return 𝜷{\bm{\beta}}

Appendix E Supplementary Results Material

E.1 Real Datasets Details

Table 5: Real datasets details
Datasets Abbreviation nn pp Target Variable Source
Buzz in social media (Twitter) Twitter 583250 77 mean number of active discussion UCI
Blog Feedback Blog 60021 281 number of comments in the next 24 hours UCI
Relative location of CT slices on axial axis CT slices 53500 386 reference: Relative image location on axial axis UCI
Buzz in social media (Tom’s Hardware) TomsHw 28179 97 Mean Number of display UCI
Condition-based maintenance of naval propulsion plants NPD - com 11934 16 GT Compressor decay state coefficient UCI
Condition-based maintenance of naval propulsion plants NPD - tur 11934 16 GT Turbine decay state coefficient UCI
Parkinson’s Telemonitoring PT - motor 5875 26 motor UPDRS score UCI
Parkinson’s Telemonitoring PT - total 5875 26 total UPDRS score UCI
Abalone Abalone 4177 8 Rings (age in years) UCI
Communities and Crime Crime 1994 128 ViolentCrimesPerPop UCI
Airfoil Self-Noise Airfoil 1503 6 Scaled sound pressure level (decibels) UCI
Student Performance Student 649 33 final grade (with G1 & G2 removed) UCI
Concrete Compressive Strength Concrete 1030 9 Concrete compressive strength (MPa) UCI
Forest Fires F.Fires 517 13 forest burned area (in ha) UCI
Boston Housing B.Housing 506 13 Median value of owner-occupied homes in $1000’s [21]
Facebook metrics Facebook 500 19 Total Interactions (with comment, like, and share columns removed) UCI
Diabetes Diabetes 442 10 quantitative measure of disease progression one year after baseline [13]
Real estate valuation R.Estate 414 7 house price of unit area UCI
Auto mpg A.MPG 398 8 city-cycle fuel consumption in miles per gallon UCI
Yacht hydrodynamics Yacht 308 7 residuary resistance per unit weight of displacement UCI
Automobile A.mobile 205 26 price UCI
Rat eye tissues Eye 120 200 the expression level of TRIM32 gene [39]
Riboflavin Ribo 71 4088 Log-transformed riboflavin production rate [9]
Crop Crop 24000 3072 24 crop classes UCR
Electric Devices ElecD 16637 4096 7 electric devices UCR
StarLight Curves StarL 9236 7168 3 starlight curves UCR
Refer to caption
Figure 3: Comparison of predictive performance (R2R^{2}) of EM algorithm (xx-axes) against CV with fixed grid (yy-axes, top) and glmnet heuristic (yy-axis, bottom). Columns correspond to the results of linear features (left), second-order features (middle), and third-order features (right). Negative values are capped at 0. Points skewing toward the bottom right indicate when our EM approach is giving better/same prediction performance as LOOCV (colored in green).