Tactics for Improving Least Squares Estimation

Qiang Heng¹¹1Department of Computational Medicine, UCLA, Hua Zhou²²2Departments of Biostatistics and Computational Medicine, UCLA , and Kenneth Lange ³³3Departments of Computational Medicine, Human Genetics, and Statistics, UCLA

Abstract

This paper deals with tactics for fast computation in least squares regression in high dimensions. These tactics include: (a) the majorization-minimization (MM) principle, (b) smoothing by Moreau envelopes, and (c) the proximal distance principal for constrained estimation. In iteratively reweighted least squares, the MM principle can create a surrogate function that trades case weights for adjusted responses. Reduction to ordinary least squares then permits the reuse of the Gram matrix and its Cholesky decomposition across iterations. This tactic is pertinent to estimation in L2E regression and generalized linear models. For problems such as quantile regression, non-smooth terms of an objective function can be replaced by their Moreau envelope approximations and majorized by spherical quadratics. Finally, penalized regression with distance-to-set penalties also benefits from this perspective. Our numerical experiments validate the speed and utility of deweighting and Moreau envelope approximations. Julia software implementing these experiments is available on our web page.

Keywords: MM principle, Moreau envelope, matrix decomposition, distance majorization, Sylvester equation

1 Introduction

It is fair to say that most estimation problems in classical statistics reduce to either least squares or maximum likelihood and that most of maximum likelihood estimation reduces to iteratively reweighted least squares (IRLS). Quantile regression and generalized linear model (GLM) regression are just two of the many examples of IRLS. The modern era of frequentist statistics has added to the classical inference soup sparsity, robustness, and rank restrictions. These additions complicate estimation by introducing nonsmooth terms in the objective function. In practice, smooth approximations can be substituted for nonsmooth terms without substantially impacting estimates or inference. The present paper documents the beneficial effects on model selection and solution speed of several tactics: (a) systematic application of the majorization-minimization (MM) principle, (b) substitution of ordinary least squares for weighted least squares, (c) substitution of Moreau envelopes for nonsmooth losses and penalties, and (d) application of the proximal distance principle in constrained estimation.

The speed of an optimization algorithm is a trade-off between the cost per iteration and the number of iterations until convergence. In our view, this trade-off has not been adequately explored in IRLS. Recall that in IRLS, we minimize the objective

\displaystyle f(\bm{\beta})

\displaystyle=

\displaystyle\frac{1}{2}\sum_{i=1}^{n}w_{mi}(y_{i}-\bm{x}_{i}^{\top}\bm{\beta})^{2},

(1)

at iteration $m$ . Here $\bm{y}=(y_{i})$ is the response vector, $\bm{X}=(x_{ij})$ is the design matrix, and $\bm{\beta}=(\beta_{j})$ is the vector of regression coefficients. The dominant cost in high-dimensional IRLS is solution of the normal equations $(\bm{X}^{\top}\bm{W}_{m}\bm{X})\bm{\beta}=\bm{X}^{\top}\bm{W}_{m}\bm{y}$ , where $\bm{W}_{m}$ is the diagonal matrix of case weights. For dense problems, solution of the normal equations can be achieved via the Cholesky decomposition of the weighted Gram matrix $\bm{X}^{\top}\bm{W}_{m}\bm{X}$ or the QR decomposition of the weighted design matrix $\bm{W}_{m}^{1/2}\bm{X}$ .

Unfortunately, both decompositions must be recalculated from scratch each time the weight matrix $\bm{W}_{m}$ changes. Given $p\leq n$ regression coefficients, the complexity of computing the Gram matrix, weighted or unweighted, is $O(np^{2})$ . Fortunately, matrix multiplication is super fast on modern computers with access to the highly optimized BLAS routines. On large-scale problems with $p$ comparable to $n$ , Cholesky and QR decompositions execute more slowly than matrix multiplications even though the former has computational complexity $O(p^{3})$ , and the latter has computational complexity $O(np^{2})$ . If it takes $i$ iterations until convergence, these costs must be multiplied by $i$ . In repeated unweighted regressions, a single QR decomposition or a single paired Gram matrix and Cholesky decomposition suffices, and the computational burden drops to $O(p^{3})$ . On the other hand, the process of deweighting may well cause $i$ to increase. The downside of the Cholesky decomposition approach is that the condition number of $\bm{X}^{\top}\bm{X}$ is the square of the condition number of $\bm{X}$ . In our view this danger is over-rated. Adding a small ridge penalty overcomes it and is still consistent with rapid extraction of the Cholesky decomposition. In practice, code depending on Cholesky decomposition is faster than code depending on QR decomposition, and our numerical experiments take advantage of this fact.

The majorization-minimization (MM) principle (Hunter and Lange, 2004; Lange, 2016; Lange et al., 2000) is our engine for leveraging Moreau envelopes and converting weighted to unweighted least squares. In minimization, MM iteratively substitutes a surrogate function $g(\bm{\beta}\mid\bm{\beta}_{m})$ that majorizes a loss $f(\bm{\beta})$ around the current iterate $\bm{\beta}_{m}$ . Majorization is defined by the tangency condition $g(\bm{\beta}_{m}\mid\bm{\beta}_{m})=f(\bm{\beta}_{m})$ and the domination condition $g(\bm{\beta}\mid\bm{\beta}_{m})\geq f(\bm{\beta})$ for all $\bm{\beta}$ . The surrogate balances two goals, hugging the objective tightly and simplifying minimization. Minimizing the surrogate produces the next iterate $\bm{\beta}_{m+1}$ and drives the objective downhill owing to the conditions

\displaystyle f(\bm{\beta}_{m+1})

\displaystyle\leq

\displaystyle g(\bm{\beta}_{m+1}\mid\bm{\beta}_{m})\mathop{\;\>}\nolimits\leq\mathop{\;\>}\nolimits g(\bm{\beta}_{m}\mid\bm{\beta}_{m})\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits f(\bm{\beta}_{m}).

In maximization, the surrogate minorizes the objective and instead must be maximized. The tangency condition remains intact, but the domination condition $g(\bm{\beta}\mid\bm{\beta}_{m})\leq f(\bm{\beta})$ is now reversed. The celebrated EM (expectation-maximization) principle for maximum likelihood estimation with missing data (McLachlan and Krishnan, 2007) is a special case of minorization-maximization. In the EM setting Jensen’s inequality supplies the surrogate as the expectation of the complete data loglikelihood conditional on the observed data.

In practice, many surrogate functions are strictly convex quadratics. When this is the case, minimizing the surrogate is achieved by the Newton update

	$\displaystyle\bm{\beta}_{m+1}$	$\displaystyle=$	$\displaystyle\bm{\beta}_{m}-d^{2}g(\bm{\beta}_{m}\mid\bm{\beta}_{m})^{-1}\nabla g(\bm{\beta}_{m}\mid\bm{\beta}_{m})$		(2)
		$\displaystyle=$	$\displaystyle\bm{\beta}_{m}-d^{2}g(\bm{\beta}_{m}\mid\bm{\beta}_{m})^{-1}\nabla f(\bm{\beta}_{m}).$		(2)

The second form of the update reflects the tangency condition $\nabla g(\bm{\beta}_{m}\mid\bm{\beta}_{m})=\nabla f(\bm{\beta}_{m})$ . In this version of gradient descent, the step length is exactly $1$ . The local rate of convergence is determined by how well the distorted curvature matrix $d^{2}g(\bm{\beta}_{m}\mid\bm{\beta}_{m})$ approximates the actual curvature matrix $d^{2}f(\bm{\beta}_{m})$ .

The Moreau envelope of a function $f(\bm{\beta})$ from $\mathbb{R}^{p}$ to $\mathbb{R}\cup\{\infty\}$ is defined by

\displaystyle M_{\mu f}(\bm{\beta})

\displaystyle=

\displaystyle\inf_{\bm{\nu}}\Big{[}f(\bm{\nu})+\frac{1}{2\mu}\|\bm{\beta}-\bm{\nu}\|_{2}^{2}\Big{]}.

When $f(\bm{\beta})$ is convex, the infimum is attained, and its Moreau envelope $M_{\mu f}(\bm{\beta})$ is convex and continuously differentiable (Nesterov, 2005). However, $M_{\mu f}(\bm{\beta})$ also exists for many nonconvex functions $f(\bm{\beta})$ (Polson et al., 2015). In the absence of convexity, the proximal operator

\displaystyle\mathop{\rm prox}\nolimits_{\mu f}(\bm{\beta})

\displaystyle=

\displaystyle\underset{\bm{\nu}}{\mathop{\rm argmin}\nolimits}\Big{[}f(\bm{\nu})+\frac{1}{2\mu}\|\bm{\beta}-\bm{\nu}\|_{2}^{2}\Big{]}

(3)

can map to a set rather than a single point. For a proxable function $f(\bm{\beta})$ , the definition of the Moreau envelope implies the quadratic majorization

\displaystyle M_{\mu f}(\bm{\beta})

\displaystyle\leq

\displaystyle f(\bm{\nu}_{m})+\frac{1}{2\mu}\|\bm{\beta}-\bm{\nu}_{m}\|_{2}^{2}

(4)

at $\bm{\nu}_{m}$ for any $\bm{\nu}_{m}\in\mathop{\rm prox}\nolimits_{\mu f}(\bm{\beta}_{m})$ . Generally, $M_{\mu f}(\bm{\beta})\leq f(\bm{\beta})$ and $\underset{\mu\downarrow 0}{\lim}M_{\mu f}(\bm{\beta})=f(\bm{\beta})$ under mild conditions. In particular, if $f(\bm{\beta})$ is Lipschitz with constant L, then

\displaystyle 0

\displaystyle\leq

\displaystyle f(\bm{\beta})-M_{\mu f}(\bm{\beta})\mathop{\;\>}\nolimits\leq\mathop{\;\>}\nolimits\frac{L^{2}\mu}{2}.

Despite the elementary nature of the Moreau majorization (4), its value in regression has largely gone unnoticed.

In Section 2.5 we rederive the majorization of Heiser (1987) replacing the weighted least squares criterion (1) with the unweighted surrogate

\displaystyle g(\bm{\beta}\mid\bm{\beta}_{m})

\displaystyle=

\displaystyle\frac{1}{2}\sum_{i=1}^{n}[w_{mi}y_{i}+(1-w_{mi})\mu_{mi}-\mu_{i}]^{2}+c_{m}.

(5)

Here the constant $c_{m}$ is irrelevant in the subsequent minimization. In our examples the regression functions $\mu_{i}(\bm{\beta})=\bm{x}_{i}^{\top}\bm{\beta}$ depend linearly on the parameter vector $\bm{\beta}$ . The majorization (5) removes the weights and shifts the responses. It therefore allows the Cholesky decomposition of $\bm{X}^{\top}\bm{X}$ to be recycled. Derivation of the surrogate requires the assumption $0\leq w_{mi}\leq 1$ . Because rescaling does not alter the minimum point, one can divide $w_{mi}$ by $w_{\max}=\max_{j}w_{mj}$ at iteration $m$ to ensure $0\leq w_{mi}\leq 1$ .

To render this discussion more concrete, consider the problem of least absolute deviation (LAD) regression with objective

\displaystyle f(\bm{\beta})

\displaystyle=

\displaystyle\sum_{i=1}^{n}|y_{i}-\bm{x}_{i}^{\top}\bm{\beta}|.

The nonsmooth absolute value function has Moreau envelope equal to Huber’s function

\displaystyle M_{\mu|\cdot|}(r)

\displaystyle=

\displaystyle\begin{cases}\frac{1}{2\mu}r^{2},&|r|\leq\mu\\ |r|-\frac{\mu}{2},&|r|>\mu\end{cases},

with proximal map

\displaystyle\mathop{\rm prox}\nolimits_{\mu|\cdot|}(r)

\displaystyle=

\displaystyle\Big{(}1-\frac{\mu}{\max\{|r|,\mu\}}\Big{)}r.

(6)

The loose majorization (4) generates the least squares surrogate

\displaystyle g(\bm{\beta}\mid\bm{\beta}_{m})

\displaystyle=

\displaystyle\frac{1}{2\mu}\sum_{i=1}^{n}(r_{i}-z_{mi})^{2},

(7)

where $r_{i}=y_{i}-\bm{x}_{i}^{\top}\bm{\beta}$ and $z_{mi}=\mathop{\rm prox}\nolimits_{\mu|\cdot|}(r_{mi})$ . Minimization of the surrogate invokes the same Cholesky decomposition at every iteration. Algorithm 1 summarizes this simple but effective algorithm for (smoothed) LAD regression.

Algorithm 1 Fast LAD Regression

0: Design matrix

\bm{X}\in\mathbb{R}^{n\times p}

, response vector

\bm{y}\in\mathbb{R}^{p}

, smoothing constant

\mu>0

, and iteration number

m=0

1: Compute the Cholesky decomposition

\bm{L}

\bm{X}^{\top}\bm{X}

2: Initialize the regression coefficients

\bm{\beta}_{0}=(\bm{L}^{\top})^{-1}\bm{L}^{-1}\bm{X}^{\top}\bm{y}

by least squares.

3: while not converged do

4: For each case

i

set

z_{mi}=\mathop{\rm prox}\nolimits_{\mu|\cdot|}(y_{i}-\bm{x}_{i}^{\top}\bm{\beta}_{m})

based on formula (6).

\bm{\beta}_{m+1}\leftarrow(\bm{L}^{\top})^{-1}\bm{L}^{-1}\bm{X}^{\top}(\bm{y}-\bm{z}_{m})

update.

m\leftarrow m+1

update.

7: end while

7: The final iterate

\bm{\beta}_{m}

Alternatively, one can employ the best quadratic majorization (de Leeuw and Lange, 2009)

\displaystyle M_{\mu|\cdot|}(r)

\displaystyle\leq

\displaystyle\begin{cases}\frac{1}{2|r_{m}|}(r-r_{m})^{2}-r-\frac{1}{2}\mu&r_{m}\leq-\mu\\ \frac{1}{2\mu}r^{2}&|r_{m}|<\mu\\ \frac{1}{2|r_{m}|}(r-r_{m})^{2}+r-\frac{1}{2}\mu&r_{m}\geq\mu\end{cases}.

(8)

This majorization translates into a weighted sum of squares plus an irrelevant constant. The resulting weighted sum of squares can be further deweighted by appealing to the surrogate (5). Appendix A shows that these two successive majorizations lead exactly to the Moreau envelope surrogate (7).

Although the primary purpose of this paper is expository, we would like to highlight several likely new contributions. The first is our emphasis on the Moreau envelope majorization noted in equations (3) and (4) and recognized implicitly in the convex case by Chen et al. (2012) and explicitly in general in the book (Lange, 2016). Among its many virtues, this majorization converts smoothed quantile regression into ordinary least squares. The Moreau envelopes of various functions, such as the $\ell_{0}$ -norm, the matrix rank function, and various set indicator functions, are invaluable in inducing sparse and low-rank structures. The spherical majorization of the convolution-smoothed check function in Section 2.3 and the subsequent MM algorithm is new to our knowledge. Our emphasis on the deweighting majorization of Heiser (1987) and Kiers (1997) may help revive this important tactic. We also revisit the quadratic upper bound majorization of Böhning and Lindsay (1988). Application of this principle has long been a staple of logistic and multinomial regression (Böhning, 1992). Inspired by the paper (Xu and Lange, 2022), we demonstrate how to reduce the normal equation of penalized multinomial regression to a Sylvester equation (Bartels and Stewart, 1972). Finally, in addition to mentioning existing theory guaranteeing the convergence of the algorithms studied here, we also construct a new convergence proof based on considerations of fixed points and monotone operators.

As a takeaway message from this paper, we hope readers will better appreciate the potential in high-dimensional estimation of combining the MM principle, smoothing, and numerical tactics such as recycling matrix decompositions and restarted Nesterov acceleration. These advances are ultimately as important as faster hardware. Together the best hardware and the best algorithms will make future statistical analysis even more magical.

2 Methods

In this section we first briefly review majorizations pertinent to least squares. The material covered here is largely standard. Good references are (Bauschke and Combettes, 2011) and (Beck, 2017). After a few preliminaries, we take up: (a) conversion of weighted to unweighted least squares (b) the quadratic bound principle, (c) integration with the proximal distance method (d) restarted Nesterov acceleration, and (e) convergence theory.

2.1 Notation

Here are the notational conventions used throughout this article. All vectors and matrices appear in boldface. An entry of a vector $\bm{x}$ or matrix $\bm{A}$ is denoted by the corresponding subscripted lower-case letter $x_{i}$ or $a_{ij}$ . All entries of the vector $\bm{0}$ equal $0$ ; $\bm{I}$ indicates an identity matrix. The ^⊤ superscript indicates a vector transpose. The symbols $\bm{x}_{m}$ and $\bm{A}_{m}$ denote a sequence of vectors and a sequence of matrices with entries $x_{mi}$ and $a_{mij}$ . The Euclidean norm and $\ell_{1}$ norm of a vector $\bm{x}$ are denoted by $\|\bm{x}\|$ and $\|\bm{x}\|_{1}$ . The spectral, Frobenius, and nuclear norms of a matrix $\bm{A}$ are denoted by $\|\bm{A}\|$ , $\|\bm{A}\|_{F}$ , and $\|\bm{A}\|_{*}$ . For a smooth real-valued function $f(\bm{\beta})$ , we write its first differential (row vector of partial derivatives) as $df(\bm{\beta})$ . The gradient $\nabla f(\bm{\beta})$ is the transpose of $df(\bm{\beta})$ . The directional derivative of $f(\bm{\beta})$ in the direction $\bm{v}$ is written $d_{\bm{v}}f(\bm{\beta})$ . When the function $f(\bm{\beta})$ is differentiable, $d_{\bm{v}}f(\bm{\beta})=df(\bm{\beta})\bm{v}$ . The second differential (Hessian matrix) of $f(\bm{\beta})$ is $d^{2}f(\bm{\beta})$ .

2.2 M Estimation

In M estimation (de Menezes et al., 2021; Huber and Dutter, 1974; Huber and Ronchetti, 2011) based on residuals, one minimizes a function

\displaystyle f(\bm{\beta})

\displaystyle=

\displaystyle\sum_{i=1}^{n}\rho(r_{i})

of the residuals $r_{i}=y_{i}-\bm{x}_{i}^{\top}\bm{\beta}$ with $\rho(r)$ bounded below. We have already dealt with the choice $\rho(r)=|r|$ in LAD regression. In general, one can replace $\rho(r)$ by its Moreau envelope, leverage the Moreau majorization, and immediately invoke ordinary least squares. Many authors advocate estimating $\bm{\beta}$ by minimizing a function

\displaystyle f(\bm{\beta})

\displaystyle=

\displaystyle\sum_{i=1}^{n}\rho(r_{i}^{2})

of the squared residuals. Conveniently, this functional form is consistent with loglikelihoods under an elliptically symmetric distribution (Lange and Sinsheimer, 1993). If $\rho(s)$ is increasing, differentiable, and concave, then

\displaystyle\rho(s)

\displaystyle\leq

\displaystyle\rho(s_{m})+\rho^{\prime}(s_{m})(s-s_{m})

for all $s$ . This plays out as the majorization

\displaystyle f(\bm{\beta})

\displaystyle\leq

\displaystyle\sum_{i=1}^{n}w_{mi}(y_{i}-\bm{x}_{i}^{\top}\bm{\beta})^{2}+c_{m},

with weights $w_{mi}=\rho^{\prime}[(y_{i}-\bm{x}_{i}^{\top}\bm{\beta}_{m})^{2}]$ and an irrelevant constant $c_{m}$ that depends on $\bm{\beta}_{m}$ but not on $\bm{\beta}$ . This surrogate can be minimized by deweighting and invoking ordinary least squares. The Geman-McClure choice $\rho(s)=\log(1+s)$ and the Cauchy choice $\rho(s)=\frac{s}{1+s}$ for $s\geq 0$ show that the intersection of the imposed conditions is non-empty.

A prime example (Chi and Chi, 2022; Heng et al., 2023) of robust M estimation revolves around the $\text{L}_{2}\text{E}$ objective (Scott, 2001; Liu et al., 2023)

\displaystyle f(\bm{\beta},\tau)

\displaystyle=

\displaystyle\frac{\tau}{2\sqrt{\pi}}-\frac{\tau}{n}\sqrt{\frac{2}{\pi}}\sum_{i=1}^{n}e^{-\frac{\tau^{2}r_{i}^{2}}{2}}.

This loss is particularly attractive because it incorporates a precision parameter $\tau$ as well as the regression coefficient vector $\bm{\beta}$ . Because $-e^{-s}$ is concave, one can construct the MM surrogate

\displaystyle e^{-\frac{\tau^{2}r_{i}^{2}}{2}}

\displaystyle\leq

\displaystyle-e^{-\frac{\tau^{2}r_{mi}^{2}}{2}}+\frac{\tau^{2}}{2}e^{-\frac{\tau^{2}r_{mi}^{2}}{2}}\left(r_{i}^{2}-r_{mi}^{2}\right)

at iteration $m$ with $\tau$ fixed. The resulting weighted least squares surrogate can be further deweighted as explained in Section 2.5. Under deweighting it morphs into the surrogate

\displaystyle g(\bm{\beta}\mid\bm{\beta}_{m})

\displaystyle=

\displaystyle\frac{\tau^{3}}{2n}\sqrt{\frac{2}{\pi}}\sum_{i=1}^{n}\left[w_{mi}y_{i}+(1-w_{mi})\bm{x}_{i}^{\top}\bm{\beta}_{m}-\bm{x}_{i}^{\top}\bm{\beta}\right]^{2}+c_{m},

where $w_{mi}=e^{-\tau^{2}r_{mi}^{2}/2}$ and $c_{m}$ is an irrelevant constant. This ordinary least squares surrogate allows one to recycle a single Cholesky decomposition across all iterations. For $\bm{\beta}$ fixed, one can update the precision parameter $\tau$ by gradient descent (Heng et al., 2023) or an approximate Newton’s method (Liu et al., 2023). Both methods rely on backtracking.

2.3 Quantile Regression

Quantile regression (Koenker and Bassett, 1978) is a special case of M estimation with the objective $f(\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}\rho_{q}(y_{i}-\bm{x}_{i}^{\top}\bm{\beta})$ , where

\displaystyle\rho_{q}(r)

\displaystyle=

\displaystyle|r|[q1_{\{r\geq 0\}}+(1-q)1_{\{r<0\}}]\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\Big{(}q-\frac{1}{2}\Big{)}r+\frac{1}{2}|r|

(9)

is called the check function. The argument $r$ here suggests a residual. When $q=\frac{1}{2}$ , quantile regression reduces to LAD regression. The check function is not differentiable at $r=0$ . To smooth the objective (9), one can take either its Moreau envelope or the Moreau envelope of just the nonsmooth part $\frac{1}{2}|r|$ . In the former case, elementary calculus shows that the minimum is attained at

\displaystyle\mathop{\rm prox}\nolimits_{\mu\rho_{q}}(r)

\displaystyle=

\displaystyle\begin{cases}r-q\mu,&r\geq q\mu\\ 0,&-(1-q)\mu<r<q\mu\\ r+(1-q)\mu,&r\leq-(1-q)\mu\,.\end{cases}

It follows that

\displaystyle M_{\mu\rho_{q}}(r)

\displaystyle=

\displaystyle\begin{cases}qr-\frac{\mu}{2}q^{2},&r\geq q\mu\\ \frac{1}{2\mu}r^{2},&-(1-q)\mu<r<q\mu\\ -(1-q)r-\frac{\mu}{2}(1-q)^{2},&r\leq-(1-q)\mu\,.\end{cases}

A benefit of replacing $\rho_{q}(r)$ by $M_{\mu\rho_{q}}(r)$ is that, by definition, $M_{\mu\rho_{q}}(r)$ satisfies the Moreau majorization (4) at the anchor point $r_{m}$ . This majorization can be improved, but it is convenient because it induces the uniform majorization

\displaystyle\sum_{i=1}^{n}M_{\mu\rho_{q}}(r_{i})

\displaystyle\leq

\displaystyle\sum_{i=1}^{n}\Big{[}\rho_{q}(z_{mi})+\frac{1}{2\mu}(r_{i}-z_{mi})^{2}\Big{]},\quad z_{mi}\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\mathop{\rm prox}\nolimits_{\mu\rho_{q}}(r_{mi}),

(10)

which is an unweighted sum of squares in the residuals $r_{i}=y_{i}-\bm{x}_{i}^{\top}\bm{\beta}$ .

Convolution approximation offers yet another avenue for majorization (Fernandes et al., 2021; He et al., 2023; Kaplan and Sun, 2017; Tan et al., 2022; Whang, 2006). If we select a well-behaved kernel $k(x)\geq 0$ with bounded support and total mass $\int k(x)dx=1$ , then the convolution

\displaystyle C_{\mu g}(y)

\displaystyle=

\displaystyle\frac{1}{\mu}\int g(x)k\Big{(}\frac{y-x}{\mu}\Big{)}dx\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\frac{1}{\mu}\int g(y-x)k\Big{(}\frac{x}{\mu}\Big{)}dx

approximates $g(y)$ as the bandwidth $\mu\downarrow 0$ . Given $g(x)=|x|$ and the symmetric uniform kernel $k(x)=\frac{1}{2}1_{\{|x|\leq 1\}}$ , the necessary integral

\displaystyle C_{\mu|\cdot|}(y)

\displaystyle=

\displaystyle\frac{1}{2\mu}\int_{-\mu}^{\mu}|y-x|dx\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\mu\begin{cases}\frac{1}{2}\Big{[}1+\Big{(}\frac{y}{\mu}\Big{)}^{2}\Big{]}&|y|\leq\mu\\ \frac{|y|}{\mu}&|y|>\mu\end{cases}

is straightforward to calculate. This is just the Moreau envelope $M_{\mu|\cdot|}(y)$ shifted upward by $\frac{\mu}{2}$ . This approximate loss is now optimized via the Moreau majorization

\displaystyle\Big{(}q-\frac{1}{2}\Big{)}r+\frac{1}{2}C_{\mu|\cdot|}(r)

\displaystyle\leq

\displaystyle\frac{1}{4\mu}[r-z_{m}+(2q-1)\mu]^{2}+c_{m},

(11)

where $c_{m}$ is an irrelevant constant and $z_{m}=\mathop{\rm prox}\nolimits_{\mu|\cdot|}(r_{m})$ . Although our numerical experiments in quantile estimation feature the convolution-smoothed quantile loss (11), the choice (10) works equally well. What is more important in practice is the transformation of quantile regression into a sequence of ordinary least squares problems with shifted responses.

2.4 Sparsity and Low Matrix Rank

The $\ell_{0}$ -norm, the rank function, and their corresponding constraint sets are helpful in inducing sparsity and low matrix rank. Directly imposing these nonconvex penalties or constraint sets can lead to intractable optimization problems. A common middle ground is to use convex surrogates, such as the $\ell_{1}$ -norm and the nuclear norm, but these induce shrinkage bias and a surplus of false positives in support recovery. The Moreau envelopes of these nonconvex penalties offer an attractive alternative that leverages differentiability and majorization.

The $\ell_{0}$ norm has Moreau envelope (Beck, 2017)

$\displaystyle M_{\mu\\|\cdot\\|_{0}}(\bm{\beta})$	$\displaystyle=$	$\displaystyle\underset{\bm{\nu}}{\min}\Big{[}\\|\bm{\nu}\\|_{0}+\frac{1}{2\mu}\\|\bm{\beta}-\bm{\nu}\\|_{2}^{2}\Big{]}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{p}\min_{\nu_{j}}\Big{[}1_{\{\nu_{j}\neq 0\}}+\frac{1}{2\mu}(\beta_{j}-\nu_{j})^{2}\Big{]}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{p}\begin{cases}1&\frac{1}{2}\beta_{j}^{2}\geq\mu\\ \frac{1}{2\mu}\beta_{j}^{2}&\frac{1}{2}\beta_{j}^{2}<\mu\end{cases}.$

The corresponding proximal map sends $\beta_{j}$ to itself when $\frac{1}{2}\beta_{j}^{2}\geq\mu$ and to $0$ when $\frac{1}{2}\beta_{j}^{2}<\mu$ . The Moreau envelope of the $0/\infty$ indicator of the sparsity set $S_{k}=\{\bm{\beta}:\|\bm{\beta}\|_{0}\leq k\}$ is determined by projection onto $S_{k}$ . Elementary arguments show that

\displaystyle M_{\mu\delta_{S_{k}}}(\bm{\beta})

\displaystyle=

\displaystyle\min_{\bm{\nu}}\Big{[}\delta_{S_{k}}(\bm{\nu})+\frac{1}{2\mu}\|\bm{\beta}-\bm{\nu}\|_{2}^{2}\Big{]}\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\frac{1}{2\mu}\sum_{j=1}^{p-k}\beta_{(j)}^{2},

where $\beta_{(1)},\ldots,\beta_{(p-k)}$ denote the $p-k$ smallest $\beta_{j}$ in magnitude (Beck, 2017). The Moreau majorization defined by equation (4) is available in both examples. In general, the Moreau envelope of the $0/\infty$ indicator of a set $S$ is $\frac{1}{2\mu}\mathop{\rm dist}\nolimits(\bm{y},S)^{2}$ , where $\mathop{\rm dist}\nolimits(\bm{y},S)$ is the Euclidean distance from $\bm{y}$ to $S$ . Other nonconvex sparsity inducing penalties $p(\beta)$ , such as the SCAD penalty (Fan and Li, 2001) and the MCP penalty (Zhang, 2010), also have straightforward Moreau envelopes (Polson et al., 2015).

For matrices, low-rank is often preferred to sparsity in estimation and inference. Let $R_{k}$ denote the set $p\times q$ matrices of rank $k$ or less. The Moreau envelope of the $0/\infty$ indicator of $R_{k}$ turns out to be $\frac{1}{2\mu}\mathop{\rm dist}\nolimits(\bm{B},R_{k})^{2}=\frac{1}{2\mu}\sum_{j>k}\sigma_{j}^{2}$ , where the $\sigma_{j}$ are the singular values of $\bm{B}$ in descending order. If $\bm{B}$ has singular value decomposition $\bm{U}\bm{\Sigma}\bm{V}^{\top}$ , then the corresponding proximal map takes $\bm{B}$ to $\bm{U}\bm{\Sigma}_{k}\bm{V}^{\top}$ , where $\bm{\Sigma}_{k}$ sets the singular values of $\bm{\Sigma}$ smaller than $\sigma_{k}$ to $0$ . This result is just the content of the Eckart-Young theorem. The function $\mathrm{rank}(\bm{B})$ has Moreau envelope (Hiriart-Urruty and Le, 2013)

\displaystyle M_{\mu\,\mathrm{rank}}(\bm{B})

\displaystyle=

\displaystyle M_{\mu\|\cdot\|_{0}}(\bm{\sigma})\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\sum_{i}\begin{cases}1&\frac{1}{2}\sigma_{i}^{2}\geq\mu\\ \frac{1}{2\mu}\sigma_{i}^{2}&\frac{1}{2}\sigma_{i}^{2}<\mu\end{cases}.

The map $\mathop{\rm prox}\nolimits_{\mu\,\mathrm{rank}}(\bm{B})=\bm{U}\bm{\Sigma}_{k}\bm{V}^{\top}$ retains $\sigma_{j}$ when $\frac{1}{2}\sigma_{j}^{2}\geq\mu$ and sets it $0$ otherwise. In other words, the proximal operator hard thresholds the singular values. In both cases the Moreau majorization (4) is available.

The nuclear norm $\|\bm{B}\|_{*}=\sum_{i}\sigma_{i}$ (Recht et al., 2010) is a widely adopted convex penalty in low-rank estimation. Its proximal map soft thresholds all $\sigma_{i}$ by subtracting $\mu$ from each, then replacing negative values by $0$ , and finally setting $\mathop{\rm prox}\nolimits_{\mu\|\cdot\|_{*}}(\bm{B})=\bm{U}\tilde{\bm{\Sigma}}\bm{V}^{\top}$ , where $\tilde{\bm{\Sigma}}$ contains the thresholded values. The Moreau envelope $M_{\mu\|\cdot\|_{*}}(\bm{B})$ serves as a smooth-approximation to $\|\bm{B}\|_{*}$ and benefits from the Moreau majorization (4).

2.5 Deweighting Weighted Least Squares

To remove the weights and shift the responses in least squares, Heiser (1987) and Kiers (1997) suggest a crucial deweighting majorization. At iteration $m$ , assume that the weights satisfy $0\leq w_{mi}\leq 1$ and that $\mu_{i}$ abbreviates the mean $\mu_{i}(\bm{\beta})$ of case $i$ . Then we have

$\displaystyle w_{mi}(y_{i}-\mu_{i})^{2}$	$\displaystyle=$	$\displaystyle w_{mi}(y_{i}-\mu_{mi}+\mu_{mi}-\mu_{i})^{2}$
	$\displaystyle=$	$\displaystyle w_{mi}[(y_{i}-\mu_{i})^{2}+2(y_{i}-\mu_{mi})(\mu_{mi}-\mu_{i})+(\mu_{mi}-\mu_{i})^{2}]$
	$\displaystyle=$	$\displaystyle w_{mi}(1-w_{mi})(y_{i}-\mu_{mi})^{2}+w_{mi}^{2}(y_{i}-\mu_{mi})^{2}$
		$\displaystyle+2w_{mi}(y_{i}-\mu_{mi})(\mu_{mi}-\mu_{i})+w_{mi}(\mu_{mi}-\mu_{i})^{2}$
	$\displaystyle=$	$\displaystyle w_{mi}^{2}(y_{i}-\mu_{mi})^{2}+2w_{mi}(y_{i}-\mu_{mi})(\mu_{mi}-\mu_{i})$
		$\displaystyle+w_{mi}(\mu_{mi}-\mu_{i})^{2}+d_{m}$
	$\displaystyle\leq$	$\displaystyle w_{mi}^{2}(y_{i}-\mu_{mi})^{2}+2w_{mi}(y_{i}-\mu_{im})(\mu_{mi}-\mu_{i})$
		$\displaystyle+(\mu_{mi}-\mu_{i})^{2}+d_{m}$
	$\displaystyle=$	$\displaystyle[w_{mi}(y_{i}-\mu_{mi})+(\mu_{mi}-\mu_{i})]^{2}+d_{m}$
	$\displaystyle=$	$\displaystyle[w_{mi}y_{i}+(1-w_{mi})\mu_{mi}-\mu_{i}]^{2}+d_{m},$

where $d_{m}=\sum_{i=1}^{n}w_{mi}(1-w_{mi})(y_{i}-\mu_{mi})^{2}$ does not depend on the current $\mu_{i}$ . Equality holds, as it should, when $\mu_{i}=\mu_{mi}$ .

2.6 Quadratic Upper and Lower Bound Principles

The quadratic upper bound principle applies to functions $f(\bm{\beta})$ with bounded curvature (Böhning and Lindsay, 1988). Given that $f(\bm{\beta})$ is twice differentiable, we seek a matrix $\bm{H}$ satisfying $\bm{H}\succeq d^{2}f(\bm{\beta})$ and $\bm{H}\succ{\bf 0}$ in the sense that $\bm{H}-d^{2}f(\bm{\beta})$ is positive semidefinite for all $\bm{\beta}$ and $\bm{H}$ is positive definite. The quadratic bound principle then amounts to the majorization

$\displaystyle f(\bm{\beta})$	$\displaystyle=$	$\displaystyle f(\bm{\beta}_{m})+df(\bm{\beta}_{m})(\bm{\beta}-\bm{\beta}_{m})$
		$\displaystyle+\,(\bm{\beta}-\bm{\beta}_{m})^{\top}\!\int_{0}^{1}d^{2}f[\bm{\beta}_{m}+t(\bm{\beta}-\bm{\beta}_{m})](1-t)\,dt\,(\bm{\beta}-\bm{\beta}_{m})$
	$\displaystyle\leq$	$\displaystyle f(\bm{\beta}_{m})+df(\bm{\beta}_{m})(\bm{\beta}-\bm{\beta}_{m})+\frac{1}{2}(\bm{\beta}-\bm{\beta}_{m})^{\top}\bm{H}(\bm{\beta}-\bm{\beta}_{m})$
	$\displaystyle=$	$\displaystyle g(\bm{\beta}\mid\bm{\beta}_{m}).$

The quadratic surrogate is exactly minimized by the update (2) with $g(\bm{\beta}_{m}\mid\bm{\beta}_{m})$ replaced by $\bm{H}$ . The quadratic lower bound principle involves minorization and subsequent maximization.

The quadratic upper bound principle applies to logistic regression and its generalization to multinomial regression (Böhning, 1992; Krishnapuram et al., 2005). In multinomial regression, items are draw from $c$ categories numbered $1$ through $c$ . Logistic regression is the special case $c=2$ . If $y_{i}$ denotes the response of sample $i$ , and $\bm{x}_{i}$ denotes a corresponding predictor vector, then the probability $w_{ij}$ that $y_{i}=j$ is

\displaystyle w_{ij}(\bm{B})

\displaystyle=

\displaystyle\begin{cases}\frac{e^{r_{ij}}}{1+\sum_{k=1}^{c-1}e^{r_{ik}}}&1\leq j<c\\ \frac{1}{1+\sum_{k=1}^{c-1}e^{r_{ik}}}&j=c,\end{cases}

where $r_{ij}=\bm{x}_{i}^{\top}\bm{\beta}_{j}$ . To estimate the regression coefficient vectors $\bm{\beta}_{j}$ over $n$ independent trials we must find a quadratic lower bound for the loglikelihood

\displaystyle\mathcal{L}(\bm{B})

\displaystyle=

\displaystyle\sum_{i=1}^{n}\begin{cases}r_{iy_{i}}-\ln(1+\sum_{j=1}^{c-1}e^{r_{ij}})&1\leq y_{i}<c\\ -\ln(1+\sum_{j=1}^{c-1}e^{r_{ij}})&y_{i}=c.\end{cases}

(12)

Here $\bm{B}$ is the matrix with $j$ th column $\bm{\beta}_{j}$ . It suffices to find a quadratic lower bound for each sample, so we temporarily drop the subscript $i$ to simplify notation.

The loglikelihood has directional derivative

\displaystyle d_{\bm{V}}\mathcal{L}(\bm{B})

\displaystyle=

\displaystyle 1_{\{y<c\}}\bm{x}^{\top}\bm{v}_{y}-\frac{\sum_{j=1}^{c-1}e^{r_{j}}\bm{x}^{\top}\bm{v}_{j}}{1+\sum_{j=1}^{c-1}e^{r_{j}}}\quad=\quad[(\bm{y}-\bm{w})\otimes\bm{x}]^{\top}\mathop{\rm vec}\nolimits(\bm{V})

for the perturbation $\bm{V}$ of $\bm{B}$ with $j$ th column $\bm{v}_{j}$ . In the Kronecker product representation of $d_{\bm{V}}\mathcal{L}(\bm{B})$ , $\bm{y}$ is now an indicator vector of length $c-1$ with a $1$ in entry $j$ if the sample belongs to category $j$ , and $\bm{w}$ is the weight vector with value $w_{j}=\frac{e^{r_{j}}}{1+\sum_{k=1}^{c-1}e^{r_{k}}}$ in entry $j<c$ . The quadratic form associated with the second differential $d^{2}\mathcal{L}(\bm{B})$ is

\displaystyle d^{2}_{\bm{V}}\mathcal{L}(\bm{B})

\displaystyle=

\displaystyle-\frac{\sum_{j=1}^{c-1}e^{r_{j}}(\bm{x}^{\top}\bm{v}_{j})^{2}}{1+\sum_{j=1}^{c-1}e^{r_{j}}}+\frac{(\sum_{j=1}^{c-1}e^{r_{j}}\bm{x}^{\top}\bm{v}_{j})^{2}}{(1+\sum_{j=1}^{c-1}e^{r_{j}})^{2}}.

Across all cases Böhning (1992) derives the lower bound

\displaystyle d^{2}_{\bm{V}}\mathcal{L}(\bm{B})

\displaystyle\geq

\displaystyle-\frac{1}{2}\mathop{\rm vec}\nolimits(\bm{V})^{\top}\left[\left(\bm{I}-\frac{1}{c}\bm{1}\bm{1}^{\top}\right)\otimes\left(\sum_{i=1}^{n}\bm{x}_{i}\bm{x}_{i}^{\top}\right)\right]\mathop{\rm vec}\nolimits(\bm{V}),

where the $(c-1)\times(c-1)$ matrix $\bm{E}=\frac{1}{2}(\bm{I}-\frac{1}{c}\bm{1}\bm{1}^{\top})$ has the explicit inverse $2(\bm{I}+\bm{1}\bm{1}^{\top})$ . The full-sample multinomial loglikelihood is then minorized by the quadratic

	$\displaystyle\mathcal{L}(\bm{B})$	$\displaystyle\geq$	$\displaystyle\mathcal{L}(\bm{B}_{m})+\sum_{i=1}^{n}[(\bm{y}_{i}-\bm{w}_{mi})\otimes\bm{x}_{i}]^{\top}\mathop{\rm vec}\nolimits(\bm{B}-\bm{B}_{m})$		(13)
			$\displaystyle-\frac{1}{2}\mathop{\rm vec}\nolimits(\bm{B}-\bm{B}_{m})^{\top}\left[\bm{E}\otimes\left(\bm{X}^{\top}\bm{X}\right)\right]\mathop{\rm vec}\nolimits(\bm{B}-\bm{B}_{m}).$		(13)

Maximizing the surrogate yields the update

	$\displaystyle\mathop{\rm vec}\nolimits(\bm{B}_{m+1})$	$\displaystyle=$	$\displaystyle\mathop{\rm vec}\nolimits(\bm{B}_{m})+\left[\bm{E}\otimes\left(\bm{X}^{\top}\bm{X}\right)\right]^{-1}\left[\sum_{i=1}^{n}(\bm{y}_{i}-\bm{w}_{mi})\otimes\bm{x}_{i}\right]$
		$\displaystyle=$	$\displaystyle\mathop{\rm vec}\nolimits(\bm{B}_{m})+\left[\bm{E}^{-1}\otimes(\bm{X}^{\top}\bm{X})^{-1}\right]\left[\sum_{i=1}^{n}(\bm{y}_{i}-\bm{w}_{mi})\otimes\bm{x}_{i}\right].$

Reverting to matrices, this update can be restated as

	$\displaystyle\bm{B}_{m+1}$	$\displaystyle=$	$\displaystyle\bm{B}_{m}+(\bm{X}^{\top}\bm{X})^{-1}\left[\sum_{i=1}^{n}\bm{x}_{i}(\bm{y}_{i}-\bm{w}_{mi})^{\top}\right]\bm{E}^{-1}$
		$\displaystyle=$	$\displaystyle\bm{B}_{m}+(\bm{X}^{\top}\bm{X})^{-1}\bm{X}^{\top}(\bm{Y}-\bm{W}_{m})\bm{E}^{-1},$

where $\bm{W}_{m}\in\mathbb{R}^{n\times(c-1)}$ conveys the weights for the first $c-1$ categories and the rows of $\bm{Y}\in\mathbb{R}^{n\times(c-1)}$ are indicator vectors conveying category membership. In the special case of logistic regression, the solution reduces to

\displaystyle\bm{\beta}_{m+1}

\displaystyle=

\displaystyle\bm{\beta}_{m}+4(\bm{X}^{\top}\bm{X})^{-1}\bm{X}^{\top}(\bm{y}-\bm{w}_{m}).

Later, when we discuss penalized multinomial regression, we will consider the penalized loglikelihood $\mathcal{L}(\bm{B})-\frac{\lambda}{2}\|\bm{B}-\bm{P}_{m}\|_{F}^{2}$ , where the projection $\bm{P}_{m}$ depends on the current iterate $\bm{B}_{m}$ and $\lambda>0$ . Redoing the minorization (13) with this extra term leads to the stationary condition

\displaystyle\bm{0}

\displaystyle=

\displaystyle\mathop{\rm vec}\nolimits(\bm{C})-\left(\bm{E}\otimes\bm{X}^{\top}\bm{X}+\lambda\bm{I}\right)\bm{\Delta},

with $\bm{C}=\bm{X}^{\top}(\bm{Y}-\bm{W}_{m})+\lambda\bm{P}_{m}$ and $\bm{\Delta}=\bm{B}-\bm{B}_{m}$ . This condition is the same as the two equivalent matrix equations

\displaystyle\bm{X}^{\top}\bm{X}\bm{\Delta}\bm{E}+\lambda\bm{\Delta}

\displaystyle=

\displaystyle\bm{C}\quad\text{and}\quad\bm{X}^{\top}\bm{X}\bm{\Delta}+\lambda\bm{\Delta}\bm{E}^{-1}\quad=\quad\bm{C}\bm{E}^{-1}.

The second of these is a Sylvester equation (Bartels and Stewart, 1972) for the matrix $\bm{\Delta}$ . It can be solved by extracting the spectral decompositions $\bm{X}^{\top}\bm{X}=\bm{U}\bm{\Sigma}_{1}\bm{U}^{\top}$ and $\bm{E}^{-1}=\bm{V}\bm{\Sigma}_{2}\bm{V}^{\top}$ , left multiplying the equation by $\bm{U}^{\top}$ , and right multiplying the equation by $\bm{V}$ . If in the result

\displaystyle\bm{\Sigma}_{1}\bm{U}^{\top}\bm{\Delta}\bm{V}+\bm{U}^{\top}\bm{\Delta}\bm{V}(\lambda\bm{\Sigma}_{2})

\displaystyle=

\displaystyle\bm{U}^{\top}\bm{C}\bm{V}\bm{\Sigma}_{2}

we treat $\bm{U}^{\top}\bm{\Delta}\bm{V}$ as a new variable $\bm{Z}=(z_{ij})$ , then this simpler Sylvester equation has the entry-by-entry solution

\displaystyle z_{ij}

\displaystyle=

\displaystyle\frac{(\bm{U}^{\top}\bm{C}\bm{V})\sigma_{2ij}}{\sigma_{1ii}+\lambda\sigma_{2jj}}.

Overall, this highly efficient way of updating $\bm{B}$ requires just two spectral decompositions, which can be recycled across all iterations and all values of $\lambda$ . Computation of the projection map sending $\bm{B}_{m}$ to $\bm{P}_{m}$ must still be performed at each iteration.

2.7 Integration with the Proximal Distance Principle

The proximal distance principle is designed to minimize a loss function $f(\bm{\beta})$ subject to $\bm{\beta}\in C$ , where $C$ is a nonempty closed set (Keys et al., 2019; Landeros et al., 2022, 2023). For instance, $C$ could be the sparsity set $\{\bm{\beta}\in\mathbb{R}^{p}:\|\bm{\beta}\|_{0}\leq k\}$ . The general idea of the proximal distance method is to approximate the solution by minimizing the penalized loss $f(\bm{\beta})+\frac{\lambda}{2}\mathop{\rm dist}\nolimits(\bm{\beta},C)^{2}$ for $\lambda>0$ large. The squared Euclidean distance is majorized by the spherical quadratic $\frac{\lambda}{2}\|\bm{\beta}-P_{C}(\bm{\beta}_{m})\|_{2}^{2}$ , where $P_{C}(\bm{y})$ denotes the Euclidean projection of $\bm{y}$ onto $C$ . If $f(\bm{\beta})$ is an ordinary sum of squares, then the surrogate $f(\bm{\beta})+\frac{\lambda}{2}\|\bm{\beta}-P_{C}(\bm{\beta}_{m})\|^{2}$ remains within this realm. Otherwise, it may be that $f(\bm{\beta})$ is majorized by an ordinary sum of squares $g(\bm{\beta}\mid\bm{\beta}_{m})$ . The overall surrogate $g(\bm{\beta}\mid\bm{\beta}_{m})+\frac{\lambda}{2}\|\bm{\beta}-P_{C}(\bm{\beta}_{m})\|^{2}$ is then also an ordinary sum of squares.

Because the penalty constant $\lambda$ is gradually sent to $\infty$ , whatever matrix factorization is invoked should apply for all choices of $\lambda$ . The annealing schedule for $\lambda$ is crucial in practice. Simply setting $\lambda$ equal to a large constant can cause the proximal distance algorithm to converge prematurely and prevent the algorithm from adequately exploring parameter space. Although the converged value of $\bm{\beta}$ will be close to the constraint set $C$ , the influence of the loss will be slight. The remedy is to start with a small value of $\lambda$ and gradually increase it at a slow geometric rate to a final desired level. Our previous papers (Keys et al., 2019; Landeros et al., 2022) provide essential details about annealing schedules and stopping criteria. In any event, the penalty parameter $\lambda$ should not be confused with the smoothing parameter $\mu$ determining a Moreau envelope. Our examples always fix the value of $\mu$ .

Sparse quantile regression provides a concrete example of the proximal distance principle in action. After distance penalization, the smoothed quantile loss (11) is majorized by the surrogate function

\displaystyle g(\bm{\beta}\mid\bm{\beta}_{m})

\displaystyle=

\displaystyle\frac{1}{4n\mu}\sum_{i=1}^{n}[y_{i}-z_{mi}+(2q-1)\mu-\bm{x}_{i}^{\top}\bm{\beta}]^{2}+\frac{\lambda}{2}\|\bm{\beta}-P_{S_{k}}(\bm{\beta}_{m})\|_{2}^{2}+c_{m}.

(14)

The stationary condition is then

\displaystyle\Big{(}\frac{1}{2n\mu}\bm{X}^{\top}\bm{X}+\lambda\bm{I}_{p}\Big{)}\bm{\beta}

\displaystyle=

\displaystyle\frac{1}{2n\mu}\bm{X}^{\top}\tilde{\bm{y}}+\lambda P_{S_{k}}(\bm{\beta}_{m}),

where $\tilde{y}_{i}=y_{i}-z_{mi}+(2q-1)\mu$ are the shifted responses. Given the pre-computed spectral decomposition $\bm{X}^{\top}\bm{X}=\bm{U}\bm{\Lambda}\bm{U}^{\top}$ of the Gram matrix, the normal equations have the explicit solution

\displaystyle\bm{\beta}_{m+1}

\displaystyle=

\displaystyle\bm{U}\Big{(}\frac{1}{2n\mu}\bm{\Lambda}+\lambda\bm{I}_{p}\Big{)}^{-1}\bm{U}^{\top}\Big{[}\frac{1}{2n\mu}\bm{X}^{\top}\tilde{\bm{y}}+\lambda P_{S_{k}}(\bm{\beta}_{m})\Big{]}.

Although a spectral decomposition is more expensive to compute than a Cholesky decomposition, a single spectral decomposition suffices across all $\lambda$ . In contrast, each time $\lambda$ changes, the Cholesky decomposition must be completely recalculated.

Sparse quantile regression can also be achieved by adding the Moreau envelope penalty $\lambda M_{\alpha\|\cdot\|_{0}}(\bm{\beta})$ to the loss. (Observe that $\mu$ and $\alpha$ are distinct smoothing parameters; nothing prevents us from equating them.) The surrogate function (14) remains the same, provided we substitute the proximal map $\mathop{\rm prox}\nolimits_{\alpha\|\cdot\|_{0}}(\bm{\beta}_{m})$ for the projection $P_{S_{k}}(\bm{\beta}_{m})$ and the constant $\frac{\lambda}{2\alpha}$ for the constant $\frac{\lambda}{2}$ . If $\bm{\beta}_{\lambda}$ denotes the minimum point of the penalized loss, then the curve $\lambda\mapsto M_{\alpha\|\cdot\|_{0}}(\bm{\beta}_{\lambda})$ now becomes an object of interest. This curve should smoothly approximate a jump function with downward jumps of $1$ as $\lambda$ increases.

2.8 Nesterov Acceleration

Conversion of weighted to ordinary least squares comes at the expense of an increased number of iterations until convergence. This adverse consequence can be mitigated by Nesterov acceleration (Nesterov, 1983), which is easier to implement than to understand. In practice one maintains both the current iterate $\bm{\beta}_{m}$ and the previous iterate $\bm{\beta}_{m-1}$ . The shifted point $\bm{\alpha}_{m}=\bm{\beta}_{m}+\frac{m-1}{m+2}(\bm{\beta}_{m}-\bm{\beta}_{m-1})$ is then taken as the base point in majorization-minimization. The next point $\bm{\beta}_{m+1}$ is no longer guaranteed to reduce the objective. If the descent property fails, we take the standard precaution of restarting the Nesterov acceleration at $m=1$ . Some of our numerical examples benefit from acceleration while others do not. We also explored the application of SQUAREM (Varadhan and Roland, 2008) and Anderson acceleration (Anderson, 1965), but they do not accelerate convergence as satisfactorily as Nesterov acceleration.

2.9 Convergence Theory

As a prelude to establishing the convergence of our MM algorithms, one needs to prove that an optimal point actually exists. This issue is usually tackled in minimization by showing that the objective $f(\bm{\beta})$ has compact sublevel sets $\{\bm{\beta}:f(\bm{\beta})\leq c\}$ . This notion is implied by the coerciveness condition $\lim_{\|\bm{\beta}\|\to\infty}f(\bm{\beta})=\infty$ , which in turn is implied by strong convexity. In the absence of strong convexity, one can check that a finite convex function $f(\bm{\beta})$ is coercive by checking that it is coercive along all nontrivial rays $\{\bm{\beta}\in\mathbb{R}^{p}:\bm{\beta}=\bm{\alpha}+t\bm{v},\,t\geq 0\}$ emanating from some arbitrarily chosen base point $\bm{\alpha}$ (Lange, 2016). For instance in quantile regression, the objective $f(\bm{\beta})$ is convex and

\displaystyle f(\bm{\beta}+t\bm{v})

\displaystyle=

\displaystyle\sum_{i=1}^{n}\rho_{q}(y_{i}-\bm{x}_{i}^{\top}\bm{\beta}-t\bm{x}_{i}^{\top}\bm{v}).

If $\bm{X}$ has full rank and $\bm{v}\neq\bm{0}$ , then at least one inner product $\bm{x}_{i}^{\top}\bm{v}$ does not vanish. The corresponding term $\rho_{q}(y_{i}-\bm{x}_{i}^{\top}\bm{\beta}-t\bm{x}_{i}^{\top}\bm{v})$ then tends to $\infty$ as $t$ tends to $\infty$ . Given that all other terms are nonnegative, $f(\bm{\beta})$ is coercive.

As pointed out earlier, our algorithms maps assume the form

\displaystyle\mathcal{M}(\bm{\beta})

\displaystyle=

\displaystyle\bm{\beta}-t_{\bm{\beta}}\bm{H}(\bm{\beta})^{-1}\nabla f(\bm{\beta}),

where the matrix $H(\bm{\beta})=d^{2}g(\bm{\beta}\mid\bm{\beta})$ is positive definite and $t_{\bm{\beta}}\in(0,1)$ is a step-size multiplier chosen by backtracking. For our MM algorithms, the choice $t_{\bm{\beta}}=1$ always decreases the objective. Lange (2013) and Xu et al. (2017) prove the following proposition pertinent to our problems.

Proposition 2.1.

Suppose the objective $f(\bm{\beta})$ has compact sublevel sets and $\bm{H}(\bm{\beta})$ is continuous and positive definite. If the step size $t_{\bm{\beta}}$ is selected by Armijo backtracking, then the limit points of the sequence $\bm{\beta}_{m+1}=\mathcal{M}(\bm{\beta}_{m})$ are stationary points of $f(\bm{\beta})$ . Moreover, the set of limit points is compact and connected. If all stationary points are isolated, then the sequence $\bm{\beta}_{m}$ converges to one of them.

Our examples omit backtracking and involve surrogates satisfying the uniform Lipschitz gradient condition

\displaystyle\|\nabla g(\bm{\gamma}\mid\bm{\beta}_{m})-\nabla g(\bm{\delta}\mid\bm{\beta}_{m})\|

\displaystyle\leq

\displaystyle L\|\bm{\gamma}-\bm{\delta}\|

(15)

for all $\bm{\gamma}$ , $\bm{\delta}$ , and $\bm{\beta}_{m}$ . Proposition 8 of (Lange et al., 2021) then offers the following simpler but more precise result.

Proposition 2.2.

Let $f(\bm{\beta})$ be a coercive differentiable function majorized by a surrogate satisfying condition (15). If $\bm{\beta}_{\infty}$ denotes a minimum point of $f(\bm{\beta})$ , then the iterates $\bm{\beta}_{m}$ delivered by the corresponding MM algorithm satisfy the bound

\displaystyle\sum_{k=0}^{m}\|\nabla f(\bm{\beta}_{k})\|^{2}

\displaystyle\leq

\displaystyle 2L[f(\bm{\beta}_{0})-f(\bm{\beta}_{\infty})].

It follows that $\lim_{m\to\infty}\|\nabla f(\bm{\beta}_{m})\|=0$ . Furthermore, when $f(\bm{\beta})$ is continuously differentiable, any limit point of the sequence $\bm{\beta}_{m}$ is a stationary point of $f(\bm{\beta})$ .

Proposition 12 of (Lange et al., 2021) dispenses with the isolated stationary point assumption of Proposition 2.1 and proves convergence when $f(\bm{\beta})$ is coercive, continuous, and subanalytic and all $g(\bm{\beta}\mid\bm{\beta}_{m})$ are continuous, $\mu$ -strongly convex, and satisfy condition (15) on the compact sublevel set $\{\bm{\beta}:f(\bm{\beta})\leq f(\bm{\beta}_{0})\}$ . Virtually all well-behaved functions are subanalytic. For instance, the Moreau envelope of a convex lower-semicontinuous subanalytic function is subanalytic (Bolte et al., 2007); the function $\|\bm{\beta}\|_{0}$ is semialgebraic and therefore subanalytic (Landeros et al., 2022).

For the convex problems, one can also attack global convergence by invoking monotone operator theory (Bauschke and Combettes, 2011).

Proposition 2.3.

For a convex problem with $\bm{H}=d^{2}g(\bm{\beta}\mid\bm{\beta})$ a fixed positive definite matrix, suppose that the set of stationary points is nonempty. Then the MM iterates

\displaystyle\bm{\beta}_{m+1}

\displaystyle=

\displaystyle\bm{\beta}_{m}-\bm{H}^{-1}\nabla f(\bm{\beta}_{m})

converge to a minimum point.

Proof.

See Appendix D. ∎

Proposition 10 of (Lange et al., 2021) proves that MM iterates converge at a linear rate in the best circumstances.

Proposition 2.4.

Let $f(\bm{\beta})$ be $\mu$ -strongly convex and differentiable, and assume $g(\bm{\beta}\mid\bm{\beta}_{m})$ satisfies the Lipschitz condition (15). If the global minimum occurs at $\bm{\alpha}$ , then the MM iterates $\bm{\beta}_{m}$ satisfy

\displaystyle f(\bm{\beta}_{m})-f(\bm{\alpha})

\displaystyle\leq

\displaystyle\left[1-\left(\frac{\mu}{2L}\right)^{2}\right]^{m}[f(\bm{\beta}_{0})-f(\bm{\alpha})],

establishing linear convergence of $\bm{\beta}_{m}$ to $\bm{\alpha}$ .

Proof.

See Appendix D for a slightly corrected proof. ∎

In many of our examples, the Hessian $d^{2}g(\bm{\beta}\mid\bm{\beta})=c\bm{X}^{\top}\bm{X}$ for some $c>0$ . The constants of Proposition 2.4 satisfy $L=c\sigma_{1}^{2}$ and $\mu=c\sigma_{p}^{2}$ , where the $\sigma_{i}$ are the ordered singular values of $\bm{X}$ . The ratio $\frac{L}{\mu}$ is the condition number of $\bm{X}$ relevant to the rate of convergence. Note that $c$ disappears in the condition number. In the proximal distance method, we encounter the Hessian $d^{2}g(\bm{\beta}\mid\bm{\beta})=c\bm{X}^{\top}\bm{X}+\lambda\bm{I}$ , where the constant $c$ no longer cancels in the condition number. Although addition of $\lambda\bm{I}$ improves the condition number pertinent to inner iterations, annealing of $\lambda$ towards $\infty$ in outer iterations slows overall convergence to the constrained minimum. Multinomial regression replaces the Gram matrix $\bm{X}^{\top}\bm{X}$ by the Kronecker product $\left(\bm{I}-\frac{1}{c}\bm{1}\bm{1}^{\top}\right)\otimes\left(\bm{X}^{\top}\bm{X}\right)$ , whose singular values are products of the eigenvalues of the matrices $\bm{I}-\frac{1}{c}\bm{1}\bm{1}^{\top}$ and $\bm{X}^{\top}\bm{X}$ . Because these are $\frac{1}{c}$ and $1$ and the $\sigma_{i}^{2}$ , respectively, the revised condition number turns out to equal $\frac{cL}{\mu}$ .

3 Numerical Experiments

The numerical examples of this section demonstrate the versatility of the MM principle in the derivation of fast estimation algorithms. We consider quantile regression (ordinary and sparse), $\text{L}_{2}\text{E}$ regression (ordinary and isotonic), logistic regression, and multinomial regression (reduced rank). In sparse and low-rank regression our penalties ignore intercept parameters. All computations were conducted on the UCLA Hoffman2 cluster with 4 CPU cores.

3.1 Unpenalized Models

We now compare the performance of our MM algorithms and popular competing algorithms for smoothed quantile regression, $\text{L}_{2}\text{E}$ regression, and logistic/multinomial regression. No constraints or penalties are imposed at this stage. We adopt a common protocol for generating the design matrix $\bm{X}\in\mathbb{R}^{n\times(p-1)}$ . The aim here is to demonstrate the computational efficiency of the MM algorithms on large-scale data. Each row of $\bm{X}$ is sampled from a multivariate normal distribution with mean $\bm{0}$ and covariance matrix $\bm{\Sigma}$ with entries $\sigma_{ij}=0.7^{|i-j|}$ . An intercept column $\bm{1}_{n}$ is then left appended to $\bm{X}$ to obtain a full design matrix $\tilde{\bm{X}}\in\mathbb{R}^{n\times p}$ . Generally, we set the true coefficient vector to $\bm{\beta}^{*}=(1,0.1,0.1,\dots,0.1)^{\top}\in\mathbb{R}^{p}$ . The responses $\bm{y}$ require a different generation protocol for each problem class. Unless stated otherwise, the number of observations equals $n=100p$ , where the number of predictors $p$ ranges over the set $\{100,200,\dots,1000\}$ . We terminate our MM algorithms when the relative change in objective falls below $10^{-6}$ . In the following paragraphs, we describe competing methods, their software implementations, and the further protocols for generating the response vector $\bm{y}$ .

Quantile Regression

For convolution-smoothed quantile regression, we compare MM to the conquer R package (Tan et al., 2022; He et al., 2023). The conquer package implements gradient descent with a Barzilai-Borwein stepsize. In calling conquer, we invoke the uniform kernel with the default bandwidth $h=\max\{[(\log(n)+p)/n]^{0.4},0.05\}$ . Our MM software shares this kernel and bandwidth and therefore optimizes the same objective. The response $y_{i}$ is generated as

\displaystyle y_{i}

\displaystyle=

\displaystyle\tilde{\bm{x}}_{i}^{\top}\bm{\beta}^{*}+\Big{(}\frac{x_{ip}}{2}+1\Big{)}[\epsilon_{i}-F^{-1}_{\epsilon_{i}}(q)],

(16)

where $\epsilon_{i}$ follows a $t$ -distribution with 1.5 degrees of freedom and $F$ is the distribution function of the $t$ -distribution. The term $-F^{-1}_{\epsilon_{i}}(q)$ creates quantile drift, which is a common practice in simulation studies of quantile regression. The multiplier $\frac{x_{ip}}{2}+1$ encourages heteroskedasticity. We consider two quantile levels, $q=0.5$ and $q=0.7$ in our experiments.

Refer to caption — Figure 1: Results for non-sparse quantile regression. The left panel shows average computation time, with the error bars marking $\pm 1$ standard deviation. In the right panel, the objective trajectories of MM and conquer overlap.

$\text{L}_{2}\text{E}$ Regression

For $\text{L}_{2}\text{E}$ regression, we compare the proposed double majorization technique described in Section 2.2 to iteratively reweighted least squares as proposed by Liu et al. (2023). Their implementation in R executes poorly on large-scale data. We re-implemented their IRLS approach in Julia and used it as our baseline competition. Responses are generated as

\displaystyle y_{i}

\displaystyle=

\displaystyle\tilde{\bm{x}}_{i}^{\top}\bm{\beta}^{*}+\epsilon_{i},\quad i=1,2,\dots,n,

where the $\epsilon_{i}$ are standard normal random deviates. To induce outlier contamination, we add 10 to $y_{i}$ for the first 10% of the samples and 10 to the second column of $\tilde{\bm{X}}$ for the last 10% of samples. This simulation choice demonstrates that the $\text{L}_{2}\text{E}$ estimator is also robust to contamination in the covariates $\bm{X}$ , a property not enjoyed by Huber regression or quantile regression.

Logistic and Multinomial Regression

For logistic and multinomial (logistic) regression, we compare our MM algorithm to the glmnet R package (Friedman et al., 2010) with penalty parameter 0. For logistic regression, the responses $y_{i}$ are generated as

\displaystyle y_{i}

\displaystyle\sim

\displaystyle\text{Bernoulli}(p_{i}),\quad p_{i}\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\frac{\exp(\tilde{\bm{x}}_{i}^{\top}\bm{\beta}^{*})}{1+\exp(\tilde{\bm{x}}_{i}^{\top}\bm{\beta}^{*})}.

For multinomial regression, the responses $Y_{i}\in\mathbb{R}^{c}$ are generated as

\displaystyle Y_{i}\sim\text{Multinomial}[1,(p_{i1},p_{i2},\dots,p_{ic})],\quad p_{ij}=\frac{\exp(\tilde{\bm{x}}_{i}^{\top}\bm{\beta}^{*}_{j})}{\sum_{l=1}^{c}\exp(\tilde{\bm{x}}_{i}^{\top}\bm{\beta}^{*}_{l})},

where $\bm{\beta}^{*}_{j}$ is the $j$ -th column of the coefficient matrix $\bm{B}^{*}\in\mathbb{R}^{p\times c}$ . The entries of $\bm{B}^{*}$ are populated with random $\text{Uniform}[0,0.2]$ values. In this simulation there are $c=10$ categories. In contrast to the two previous examples, $p$ ranges over $\{30,60,\dots,300\}$ and $n=1000p$ . This adjustment is made because glmnet often fails to converge when the number of predictors $p$ is too large.

Figure 1, 2, and 3 summarize our computational experiments. All experimental results are averaged over 50 random replicates. Figure 1 shows that MM converges to the same objective as conquer, certifying the correctness of our implementation. However, as demonstrated in the left panel of figure 1, MM can be 3 to 7 times faster than conquer. In $\text{L}_{2}\text{E}$ regression, figure 2 demonstrates that double majorization is nearly 10 times faster than weighted least squares. The right panel of figure 2 shows that the estimation error $\|\hat{\bm{\beta}}-\bm{\beta}^{*}\|_{2}$ of the $\text{L}_{2}\text{E}$ estimator is much smaller than that of least squares, confirming the robustness of $\text{L}_{2}\text{E}$ . Figure 3 shows that MM converges to the same loglikelihood as glmnet, but can be 3 times faster for logistic regression and 10 times faster for multinomial regression. To our surprise, glmnet often fails to converge for $p>150$ in multinomial regression. Figure 3 consequently only show results for $p=30,60,\dots,150$ . When glmnet does converge, our converged loglikelihoods match its converged loglikelihoods. These results highlight MM’s efficiency in computing common statistical estimators with no compromise in accuracy.

3.2 Sparse Quantile Regression

We next compare sparse quantile regression under $\ell_{0}$ -norm and distance-to-set penalties to quantile regression under Lasso, SCAD, and MCP penalties. The conquer.cv.reg function of conquer supplies the Lasso, SCAD, and MCP results. In our comparisons, the design matrix $\tilde{\bm{X}}$ is generated as described in Section 3.1. However, following Tan et al. (2022) and Man et al. (2024), the true coefficient vector $\bm{\beta}$ has $\bm{\beta}^{*}_{1}=4$ , $\bm{\beta}^{*}_{3}=1.8$ , $\bm{\beta}^{*}_{5}=1.6$ , $\bm{\beta}^{*}_{7}=1.4$ , $\bm{\beta}^{*}_{9}=1.2$ , $\bm{\beta}^{*}_{11}=1$ , $\bm{\beta}^{*}_{13}=-1$ , $\bm{\beta}^{*}_{15}=-1.2$ , $\bm{\beta}^{*}_{17}=-1.4$ , $\bm{\beta}^{*}_{19}=-1.6$ , and $\bm{\beta}^{*}_{21}=-1.8$ . All remaining elements of $\bm{\beta}^{*}$ are 0. In other words, there are 10 nonzero predictors besides the intercept. The responses are generated by the protocol (16) except that we now consider $\mathcal{N}(0,2)$ noise in addition to the heavy-tailed $t_{1.5}$ noise. Our experiments cover both the over-determined case $(n=500,p=250)$ and the under-determined case $(n=250,p=500)$ . Quantile levels are set at $q=0.5$ and $q=0.7$ . We set the Moreau envelope parameter $\mu$ for the quantile loss to be the same as the bandwidth paramter $h$ returned by conquer, namely $\mu=\max\{0.05,\sqrt{\tau(1-\tau)}*(\log(p)/n)^{0.25})\}$ .

The following metrics measure model performance: (a) true positive rate (TPR), (b) false positive rate (FPR), (c) estimation error (EE), defined as $\|\hat{\bm{\beta}}-\bm{\beta}\|_{2}$ , and (d) prediction error (PE), defined as $\|\tilde{\bm{X}}\hat{\bm{\beta}}-\tilde{\bm{X}}\bm{\beta}\|_{2}$ . Penalty parameters for all models are selected via 5-fold cross-validation, with quantile loss serving as the validation error. For the $\ell_{0}$ -norm Moreau envelope penalty with penalty constant $\lambda$ , we set $\alpha=0.01$ and search over an exponentially spaced grid of 50 different $\lambda$ points. For the proximal distance model, we search over the grid $k\in\{1,2,\dots,50\}$ to identify the optimal sparsity level. While the optimal solution $\hat{\bm{\beta}}$ is not sparse in the $\ell_{0}$ -norm model, its proximal projection $\text{prox}_{\alpha\|\cdot\|_{0}}(\hat{\bm{\beta}})$ is sparse, and we use this projection to assess variable selection performance.

Method	TPR	FPR	EE	PE	Time (s)
$(n=500,p=250),q=0.5$
SQR-Lasso	1.00 (0.00)	0.09 (0.04)	0.47 (0.10)	7.92 (1.41)	2.52
SQR-SCAD	0.93 (0.13)	0.00 (0.05)	0.86 (1.23)	11.2 (13.6)	3.02
SQR-MCP	0.93 (0.14)	0.02 (0.10)	1.49 (0.66)	15.4 (43.9)	2.92
SQR- $\ell_{0}$	1.00 (0.00)	0.00 (0.00)	0.24 (0.08)	4.54 (1.13)	0.73
SQR-PD	1.00 (0.00)	0.00 (0.00)	0.21 (0.07)	3.88 (1.16)	6.40
$(n=500,p=250),q=0.7$
SQR-Lasso	1.00 (0.00)	0.08 (0.04)	0.57 (0.13)	9.72 (2.04)	2.69
SQR-SCAD	0.90 (0.15)	0.00 (0.01)	1.22 (1.19)	16.4 (14.9)	2.69
SQR-MCP	0.90 (0.13)	0.02 (0.11)	0.88 (0.20)	40.9 (182.1)	2.69
SQR- $\ell_{0}$	1.00 (0.00)	0.00 (0.00)	0.27 (0.07)	5.21 (1.19)	0.66
SQR-PD	1.00 (0.00)	0.00 (0.00)	0.26 (0.08)	4.96 (1.40)	6.95
$(n=250,p=500),q=0.5$
SQR-Lasso	1.00 (0.00)	0.05 (0.02)	0.83 (0.18)	9.58 (1.71)	3.20
SQR-SCAD	0.76 (0.18)	0.10 (0.28)	6.43 (14.3)	66.0 (152)	3.81
SQR-MCP	0.82 (0.16)	0.15 (0.27)	4.45 (6.98)	45.1 (75.3)	4.10
SQR- $\ell_{0}$	0.97 (0.07)	0.00 (0.00)	0.71 (0.49)	8.00 (4.03)	0.56
SQR-PD	0.99 (0.02)	0.01 (0.01)	0.61 (0.27)	7.68 (2.83)	16.9
$(n=250,p=500),q=0.7$
SQR-Lasso	1.00 (0.02)	0.05 (0.02)	1.15 (0.30)	12.7 (3.16)	3.50
SQR-SCAD	0.68 (0.17)	0.06 (0.22)	4.82 (8.88)	48.9 (91.1)	3.41
SQR-MCP	0.78 (0.17)	0.10 (0.28)	3.91 (6.22)	39.2 (61.9)	2.69
SQR- $\ell_{0}$	0.94 (0.11)	0.00 (0.00)	1.04 (0.73)	11.3 (5.56)	0.63
SQR-PD	0.97 (0.07)	0.01 (0.01)	0.97 (0.58)	11.5 (4.28)	14.2

Table 1: Simulation results for sparse quantile regression under heavy-tailed

t_{1.5}

noise.

Table 1 reports simulation results averaged over 50 random replicates under heavy-tailed noise. In general, Lasso-penalized quantile regression tends to produce an excess of false positives. SCAD and MCP penalties significantly reduce the number of false positives, but often at the expense of missing some true positives. In the under-determined case $(n=250,p=500)$ with heavy-tailed noise, SCAD and MCP perform poorly, missing many true positives while including many false positives. SCAD and MCP do perform better when the noise is normal, as the results tabulated in Appendix E indicate. The $\ell_{0}$ -norm and distance-to-set penalties consistently achieve high TPR, low FPR, and low prediction error. The $\ell_{0}$ -norm penalty delivers high precision and efficient computation, while the distance-to-set penalty offers the best overall estimation performance. Given its dense grid search and annealing of $\lambda$ , distance-to-set estimation is generally slower than $\ell_{0}$ -norm estimation.

3.3 Robust Isotonic Regression

To illustrate the application of robust isotonic regression, we consider global warming. The dataset climate at a glance records December land and ocean temperature anomalies from 1850 to 2023 relative to the 1901-2000 mean. This dataset is similar to one analyzed by Wu et al. (2001) and Tibshirani et al. (2011). Appendix B records the implementation details of our $\text{L}_{2}\text{E}$ method. The method not only enables robust estimation, but it also quantifies the outlyingness of each observation through a converged case weight $w_{i}$ . We compare the method, which involves recycled matrix decompositions, to the weighted least squares method of Liu et al. (2023). As a non-robust benchmark, we include the non-robust isotonic fit delivered by the pooled adjacent violators algorithm (Barlow and Brunk, 1972).

The two robust methods give almost identical fits, as depicted in the left panel of figure 4. However, our method takes 0.97 seconds, while IRLS takes 13.15 seconds. These results echo the findings in Section 3.1 about efficient computation. Compared to the non-robust fit, our robust isotonic estimate appears to be less influenced by the unusually low temperatures near 1920 and the unusually high ones near 1940. The right panel of figure 4 depicts the values $-\log(w_{i})$ quantifying the outlyingness of each observation. The two spikes near 1920 and 1940 are clearly visible.

3.4 Low-Rank Multinomial Regression

Simulated data on smoothed-nuclear-norm penalized multinomial regression allow us to draw comparisons with the R package npmr (Powers et al., 2018), which is powered by accelerated proximal gradient descent. Appendix C records the details of our multinoimial model under the smoothed-nuclear-norm penalty. The coefficient matrix $\bm{B}^{*}$ is now populated with $\text{Uniform}[0,3]$ deviates and then projected to the closest subspace of rank 3. To avoid penalizing intercepts, this projection ignores the first row of $\bm{B}^{*}$ . The design matrix and response are again generated according to the protocols of 3.1. We consider the two settings $(n,p,c)=(1000,100,10)$ and $(n,p,c)=(5000,250,20)$ and carry out a grid search grid on the penalty constant $\lambda$ over 50 exponentially spaced points between $10^{-4}$ and $1$ . Because we divide the overall loglikelihood by $n$ while Powers et al. (2018) does not, the grid for npmr must be multiplied by $n$ . The values of the Moreau smoothing constant $\mu=0.01$ and $\mu=1$ adopted illustrate the impact of $\mu$ on inferrence. For each random replicate, we generate a separate test set and evaluate estimation performance by the loglikelihood of both the training set and the test set.

Figure 5 displays training and test loglikelihoods averaged over 50 replicates. All methods yield similar best test loglikelihoods over the $\lambda$ grid. Our smoothed nuclear norm MM algorithm executes over 20 times faster than npmr when $(n,p,c)=(1000,100,10)$ and around 30 times faster when $(n,p,c)=(5000,250,20)$ . Two main factors may be at play here. Although proximal gradient descent benefits from backtracking, the starting step size still has a major impact on speed and can be tricky to specify without careful hand tuning. Our experiments adopt the default initial step size of npmr. In contrast, step size is not an issue with an MM algorithm since majorization automatically guarantees the descent property. Our MM algorithm also incorporates curvature information through the quadratic upper bound majorization. This translates into faster convergence. Finally, our MM algorithm bypasses the bottleneck imposed by repeated matrix decompositions. Reduction of the normal equation to a Sylvester equation is a key tactic for reducing computational complexity.

We also applied MM based low-rank multinomial regression to the Vowel data set available in the npmr package. The 990 observations on 10 independent predictors record speaker independent recognition of the $11$ steady state vowels of British English. After adding an intercept, $(p,c)=(11,11)$ . The 10 predictors are log-area ratios under linear predictive coding. The data are divided into a training set of 528 observations, and a testing set of $462$ observations. We again compute the whole solution path on the training set with the penalty constant $\lambda$ ranging over 50 exponentially spaced points between $10^{-4}$ and $1$ . We report the best test loglikelihood and total computation time for MM low-rank multinomial regression ( $\mu=0.01$ ), npmr low-rank multinomial regression, and ordinary multinomial regression without low-rank regularization. Because glmnet fails to converge without regularization on these data, we employ our own implementation of ordinary multinomial regression.

Table 2: Results on the Vowel Data.

Method	Best Test Loglikelihood	Computation Time
Low-rank Multinomial	-579.1	1.0 s
package npmr	-581.5	22.9 s
Multinomial	-1216.9	0.04 s

Table 2 summarizes the experimental results. The table shows that regularized multinomial regression yields much higher test loglikelihoods than unregularized multinomial regression, demonstrating the benefits in inference of low-rank regularization. Despite its reliance on nuclear norm smoothing, our MM algorithm also yields similar and even slightly higher test loglikelihoods than npmr. Our code executes over 20 times faster than npmr, reinforcing our previous claims about computational efficiency.

4 Discussion

The current paper stresses the MM principle, smoothing by Moreau envelopes, and the recycling of matrix decompositions. The MM principle simplifies optimization by: (a) separating the variables of a problem, (b) avoiding large matrix inversions, (c) linearizing a problem, (d) restoring symmetry, (e) dealing with equality and inequality constraints gracefully, and (f) turning a nondifferentiable problem into a smooth problem. Our examples feature quadratic surrogates, by far the most common surrogates and the easiest to implement in practice. Quadratic surrogates are not necessarily parameter separated, and their stationary conditions require solving linear equations. This is where recycling of matrix decompositions comes into play. Cholesky, QR, and spectral decompositions are expensive to extract in high dimensions, so there is strong motivation to employ quadratic surrogates with the same curvature across all iterations. Deweighting and Moreau majorization are two tactics achieving this goal. Although Cholesky decompositions are more likely to suffer from ill conditioning than QR decompositions, Cholesky decompositions are faster to extract and feature exclusively in our numerical examples.

Many loss and penalty functions are nondifferentiable. Smoothing by Moreau envelopes and other approximation methods removes the kinks and allows the standard rules of calculus to apply in optimization. A Moreau envelope can be majorized by a spherical quadratic with parameters separated. This advantage extends to squared distance-to-set penalties in constrained optimization. These majorizations are not especially tight, so it often takes many iterations until convergence. Fortunately, Nesterov acceleration takes most of the sting out of this drawback. Our numerical examples illustrate the virtues of these tactics. Computational speed is often an order of magnitude better than that delivered by popular competing methods. More subtly, statistical inference is improved through the parsimony imposed by regularization and the flagging of outliers.

In any event, it is the combination of tactics that produces the fastest and most reliable algorithms. Techniques such as alternating minimization (block descent and ascent), profile loglikelihoods, penalty constant annealing (Zhou and Lange, 2010), and quasi-Newton methods (Zhou et al., 2011) have also proved valuable in many settings. Readers should keep in mind that the MM principle can be invoked on the subproblems of alternating minimization. Progress, not perfection, is usually adequate in solving the subproblems. Optimization requires both art and science. Although there is no panacea, over-arching themes can guide the construction of good algorithms. Our examples vividly demonstrate the speed improvements possible with thoughtful algorithm construction. There are doubtless many advances yet to be made in accelerating the optimization algorithms so vital to the progress of statistics. Readers who want to replicate our experiments and extend our Julia code can visit the web site https://github.com/qhengncsu/MMDeweighting.jl.

Acknowledgments

This research was partially funded by grants from the National Institute of General Medical Sciences (R35GM141798, HZ and KL) and the National Science Foundation (DMS-2054253 and IIS-2205441, HZ).

References

Anderson (1965) Anderson, D. G. (1965). Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM) 12(4), 547–560.
Barlow and Brunk (1972) Barlow, R. E. and H. D. Brunk (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association 67(337), 140–147.
Bartels and Stewart (1972) Bartels, R. H. and G. W. Stewart (1972). Algorithm 432 [C2]: solution of the matrix equation AX+ XB= C [F4]. Communications of the ACM 15(9), 820–826.
Bauschke and Combettes (2011) Bauschke, H. H. and P. L. Combettes (2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer.
Bauschke and Moursi (2023) Bauschke, H. H. and W. M. Moursi (2023). An Introduction to Convexity, Optimization, and Algorithms. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Beck (2017) Beck, A. (2017). First-Order Methods in Optimization. SIAM.
Böhning (1992) Böhning, D. (1992). Multinomial logistic regression algorithm. Annals of the Iinstitute of Statistical Mathematics 44(1), 197–200.
Böhning and Lindsay (1988) Böhning, D. and B. G. Lindsay (1988). Monotonicity of quadratic-approximation algorithms. Annals of the Institute of Statistical Mathematics 40(4), 641–663.
Bolte et al. (2007) Bolte, J., A. Daniilidis, and A. Lewis (2007). The Lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization 17(4), 1205–1223.
Chen et al. (2012) Chen, X., Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing (2012). Smoothing proximal gradient method for general structured sparse regression. The Annals of Applied Statistics 6(2), 719 – 752.
Chi and Chi (2022) Chi, J. T. and E. C. Chi (2022). A user-friendly computational framework for robust structured regression with the $L_{2}$ criterion. Journal of Computational and Graphical Statistics 31(4), 1051–1062.
de Leeuw and Lange (2009) de Leeuw, J. and K. Lange (2009). Sharp quadratic majorization in one dimension. Computational Statistics & Data Analysis 53(7), 2471–2484.
de Menezes et al. (2021) de Menezes, D., D. Prata, A. Secchi, and J. Pinto (2021). A review on robust M-estimators for regression analysis. Computers & Chemical Engineering 147, 107254.
Fan and Li (2001) Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456), 1348–1360.
Fernandes et al. (2021) Fernandes, M., E. Guerre, and E. Horta (2021). Smoothing quantile regressions. Journal of Business & Economic Statistics 39(1), 338–357.
Friedman et al. (2010) Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1), 1.
He et al. (2023) He, X., X. Pan, K. M. Tan, and W.-X. Zhou (2023). Smoothed quantile regression with large-scale inference. Journal of Econometrics 232(2), 367–388.
Heiser (1987) Heiser, W. J. (1987). Correspondence analysis with least absolute residuals. Computational Statistics & Data Analysis 5(4), 337–356.
Heng et al. (2023) Heng, Q., E. C. Chi, and Y. Liu (2023). Robust low-rank tensor decomposition with the $L_{2}$ criterion. Technometrics 65(4), 537–552.
Hiriart-Urruty and Le (2013) Hiriart-Urruty, J.-B. and H. Y. Le (2013). From Eckart and Young approximation to Moreau envelopes and vice versa. RAIRO-Operations Research-Recherche Opérationnelle 47(3), 299–310.
Huber and Dutter (1974) Huber, P. J. and R. Dutter (1974). Numerical solution of robust regression problems. In COMPSTAT 1974, Proc. Symposium on Computational Statistics. Physike Verlag.
Huber and Ronchetti (2011) Huber, P. J. and E. M. Ronchetti (2011). Robust Statistics. John Wiley & Sons.
Hunter and Lange (2004) Hunter, D. R. and K. Lange (2004). A tutorial on MM algorithms. The American Statistician 58, 30–37.
Kaplan and Sun (2017) Kaplan, D. M. and Y. Sun (2017). Smoothed estimating equations for instrumental variables quantile regression. Econometric Theory 33(1), 105–157.
Keys et al. (2019) Keys, K. L., H. Zhou, and K. Lange (2019). Proximal distance algorithms: theory and practice. Journal of Machine Learning Research 20(66), 1–38.
Kiers (1997) Kiers, H. A. (1997). Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62, 251–266.
Koenker and Bassett (1978) Koenker, R. and G. Bassett (1978). Regression quantiles. Econometrica 46(1), 33–50.
Krishnapuram et al. (2005) Krishnapuram, B., L. Carin, M. A. Figueiredo, and A. J. Hartemink (2005). Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 957–968.
Landeros et al. (2022) Landeros, A., O. H. M. Padilla, H. Zhou, and K. Lange (2022). Extensions to the proximal distance method of constrained optimization. Journal of Machine Learning Research 23(182), 1–45.
Landeros et al. (2023) Landeros, A., J. Xu, and K. Lange (2023). MM optimization: proximal distance algorithms, path following, and trust regions. Proceedings of the National Academy of Sciences of the United States of America 120(27), e2303168120.
Lange (2013) Lange, K. (2013). Optimization, Volume 95. Springer Science & Business Media.
Lange (2016) Lange, K. (2016). MM Optimization Algorithms. SIAM.
Lange et al. (2000) Lange, K., D. R. Hunter, and I. Yang (2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics 9, 1–20.
Lange and Sinsheimer (1993) Lange, K. and J. S. Sinsheimer (1993). Normal/independent distributions and their applications in robust regression. Journal of Computational and Graphical Statistics 2(2), 175–198.
Lange et al. (2021) Lange, K., J.-H. Won, A. Landeros, and H. Zhou (2021). Nonconvex Optimization via MM Algorithms: Convergence Theory, pp. 1–22. John Wiley & Sons, Ltd.
Liu et al. (2023) Liu, X., E. C. Chi, and K. Lange (2023). A sharper computational tool for $L_{2}E$ regression. Technometrics 65(1), 117–126.
Man et al. (2024) Man, R., X. Pan, K. M. Tan, and W.-X. Zhou (2024). A unified algorithm for penalized convolution smoothed quantile regression. Journal of Computational and Graphical Statistics 33(2), 625–637.
McLachlan and Krishnan (2007) McLachlan, G. J. and T. Krishnan (2007). The EM Algorithm and Extensions. John Wiley & Sons.
Nesterov (1983) Nesterov, Y. (1983). A method for solving the convex programming problem with convergence rate $o(\frac{1}{k^{2}})$ . Soviet Mathematics Doklady 27, 372–376.
Nesterov (2005) Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming 103, 127–152.
Polson et al. (2015) Polson, N. G., J. G. Scott, and B. T. Willard (2015). Proximal Algorithms in Statistics and Machine Learning. Statistical Science 30(4), 559 – 581.
Powers et al. (2018) Powers, S., T. Hastie, and R. Tibshirani (2018). Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball. Statistical Modelling 18(5-6), 388–410.
Recht et al. (2010) Recht, B., M. Fazel, and P. A. Parrilo (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review 52(3), 471–501.
Scott (2001) Scott, D. W. (2001). Parametric statistical modeling by minimum integrated square error. Technometrics 43(3), 274–285.
Tan et al. (2022) Tan, K. M., L. Wang, and W.-X. Zhou (2022). High-dimensional quantile regression: Convolution smoothing and concave regularization. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 205–233.
Tibshirani et al. (2011) Tibshirani, R. J., H. Hoefling, and R. Tibshirani (2011). Nearly-isotonic regression. Technometrics 53(1), 54–61.
Varadhan and Roland (2008) Varadhan, R. and C. Roland (2008). Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scandinavian Journal of Statistics 35(2), 335–353.
Whang (2006) Whang, Y.-J. (2006). Smoothed empirical likelihood methods for quantile regression models. Econometric Theory 22(2), 173–205.
Wu et al. (2001) Wu, W. B., M. Woodroofe, and G. Mentz (2001). Isotonic regression: another look at the changepoint problem. Biometrika 88(3), 793–804.
Xu et al. (2017) Xu, J., E. Chi, and K. Lange (2017). Generalized linear model regression under distance-to-set penalties. Advances in Neural Information Processing Systems 30, 1–11.
Xu and Lange (2022) Xu, J. and K. Lange (2022). A proximal distance algorithm for likelihood-based sparse covariance estimation. Biometrika 109(4), 1047–1066.
Zhang (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2), 894 – 942.
Zhou et al. (2011) Zhou, H., D. Alexander, and K. Lange (2011). A quasi-newton acceleration for high-dimensional optimization algorithms. Statistics and computing 21, 261–273.
Zhou and Lange (2010) Zhou, H. and K. L. Lange (2010). On the bumpy road to the dominant mode. Scandinavian Journal of Statistics 37(4), 612–631.

Appendix A Deweighting the Sharpest LAD Majorization

In LAD regression deweighting the sharpest quadratic majorization (8) leads to the surrogate (7) derived from the Moreau majorization (4). To prove this assertion, recall that the Moreau surrogate (7) is

\displaystyle g(\bm{\beta}\mid\bm{\beta}_{m})

\displaystyle=

\displaystyle\frac{1}{2\mu}\sum_{i=1}^{n}(r_{i}-z_{mi})^{2},

where $r_{i}=y_{i}-\bm{x}_{i}^{\top}\bm{\beta}$ and $z_{mi}=\mathop{\rm prox}\nolimits_{\mu|\cdot|}(r_{mi})$ . On the other hand, the best quadratic majorization (8) in LAD is

\displaystyle g_{1}(\bm{\beta}\mid\bm{\beta}_{m})

\displaystyle=

\displaystyle\frac{1}{2\mu}\sum_{i=1}^{n}w_{mi}r_{i}^{2}+c_{m},

where $w_{mi}=1$ if $|r_{mi}|<\mu$ and $w_{mi}=\frac{\mu}{|r_{mi}|}$ if $|r_{mi}|\geq\mu$ . In both cases the weights satisfy $0\leq w_{mi}\leq 1$ . Deweighting $g_{1}(\bm{\beta}\mid\bm{\beta}_{m})$ yields the new surrogate is

\displaystyle g_{2}(\bm{\beta}\mid\bm{\beta}_{m})

\displaystyle=

\displaystyle\frac{1}{2\mu}\sum_{i=1}^{n}[w_{mi}y_{i}+(1-w_{mi})\bm{x}_{i}^{\top}\bm{\beta}_{m}-\bm{x}_{i}^{\top}\bm{\beta}]^{2}.

When $|r_{mi}|\leq\mu$ ,

\displaystyle w_{mi}y_{i}+(1-w_{mi})\bm{x}_{i}^{\top}\bm{\beta}_{m}

\displaystyle=

\displaystyle y_{i}\quad=\quad y_{i}-\mathop{\rm prox}\nolimits_{\mu|\cdot|}(r_{mi}),

and when $|r_{mi}|>\mu$ ,

$\displaystyle w_{mi}y_{i}+(1-w_{mi})\bm{x}_{i}^{\top}\bm{\beta}_{m}$	$\displaystyle=$	$\displaystyle y_{i}-(1-w_{mi})(y_{i}-\bm{x}_{i}^{\top}\bm{\beta}_{m})$
	$\displaystyle=$	$\displaystyle y_{i}-(1-w_{mi})r_{mi}$
	$\displaystyle=$	$\displaystyle y_{i}-(r_{mi}-\mu)$
	$\displaystyle=$	$\displaystyle y_{i}-\mathop{\rm prox}\nolimits_{\mu\|\cdot\|}(r_{mi}).$

Appendix B Robust Isotonic Regression

The $\text{L}_{2}\text{E}$ version of robust isotonic regression is driven by the penalized loss

\displaystyle f(\bm{\beta},\tau)

\displaystyle=

\displaystyle\frac{\tau}{2\sqrt{\pi}}-\frac{\tau}{n}\sqrt{\frac{2}{\pi}}\sum_{i=1}^{n}e^{-\frac{\tau^{2}}{2}(y_{i}-\bm{\beta}_{i})^{2}}+\frac{\lambda}{2}\text{dist}^{2}(\bm{D}\bm{\beta},\mathbb{R}^{p-1}_{+}),

where $\bm{D}$ is a matrix that generates the differences of adjacent components of $\bm{\beta}$ , and $\mathbb{R}^{p-1}_{+}$ is the $p-1$ -dimensional nonnegative orthant. Our double majorization of the $\text{L}_{2}\text{E}$ loss and distance majorization of the set penalty together produce the MM surrogate

	$\displaystyle g(\bm{\beta}\|\bm{\beta}_{m})$	$\displaystyle=$	$\displaystyle\frac{\tau^{3}}{2n}\sqrt{\frac{2}{\pi}}\sum_{i=1}^{n}\left[w_{mi}y_{i}+(1-w_{mi})\beta_{mi}-\beta_{i}\right]^{2}$
			$\displaystyle+\frac{\lambda}{2}\\|\bm{D}\bm{\beta}-P_{\mathbb{R}^{p-1}_{+}}(\bm{D}\bm{\beta}_{m})\\|_{2}^{2}+c_{m}.$

The stationary condition for updating $\bm{\beta}$ is

\displaystyle\Big{(}\frac{\tau^{3}}{n}\sqrt{\frac{2}{\pi}}\bm{I}_{n}+\lambda\bm{D}^{\top}\bm{D}\Big{)}\bm{\beta}

\displaystyle=

\displaystyle\frac{\tau^{3}}{n}\sqrt{\frac{2}{\pi}}\tilde{\bm{y}}+\lambda\bm{D}^{\top}P_{\mathbb{R}^{p-1}_{+}}(\bm{D}\bm{\beta}_{m}),

where $\tilde{\bm{y}}$ is the vector of shifted responses with components $\tilde{y}_{i}=w_{mi}y_{i}+(1-w_{mi})\beta_{mi}$ . Block descent alternates the updates of $\tau$ and $\bm{\beta}$ . The precision parameter $\tau$ can be updated by gradient descent (Heng et al., 2023) or an approximate Newton’s method (Liu et al., 2023).

Appendix C Low-Rank Multinomial Regression

In multinomial regression, imposing a rank penalty on the regression coefficient matrix $\bm{B}$ encourages information sharing and improves prediction. Following Powers et al. (2018) we start with a nuclear norm penalty $n\lambda\|\bm{B}\|_{*}$ . Because the nuclear norm is hard to majorize, we pass to its Moreau envelope and minimize the penalized loss

\displaystyle f(\bm{B})

\displaystyle=

\displaystyle-\frac{1}{n}\mathcal{L}(\bm{B})+\lambda M_{\mu\|\cdot\|_{*}}(\bm{B})

involving the loglikelihood $\mathcal{L}(\bm{B})$ defined by equation (12). Invoking the quadratic upper bound majorization for $-\frac{1}{n}\mathcal{L}(\bm{B})$ and the Moreau envelope majorization for $M_{\mu\|\cdot\|_{*}}(\bm{B})$ yields the MM surrogate

$\displaystyle g(\bm{B}\mid\bm{B}_{m})$	$\displaystyle=$	$\displaystyle\frac{1}{n}\mathop{\rm vec}\nolimits[\bm{X}^{\top}(\bm{W}_{m}-\bm{Y})]^{\top}\mathop{\rm vec}\nolimits(\bm{B}-\bm{B}_{m})$
		$\displaystyle+\frac{1}{2n}\mathop{\rm vec}\nolimits(\bm{B}-\bm{B}_{m})^{\top}\left[\bm{E}\otimes\bm{X}^{\top}\bm{X}\right]\mathop{\rm vec}\nolimits(\bm{B}-\bm{B}_{m})$
		$\displaystyle+\frac{\lambda}{2\mu}\\|\bm{B}-\mathop{\rm prox}\nolimits_{\mu\\|\cdot\\|_{*}}(\bm{B}_{m})\\|_{F}^{2}+c_{m},$

where $\bm{W}\in\mathbb{R}^{n\times(c-1)}$ and $\bm{Y}\in\mathbb{R}^{n\times(c-1)}$ are described in the last paragraph of Section 2.6. The stationary condition and the Sylvester equation method for solving it are found there as well.

Appendix D Proofs

D.1 Proof of Proposition 2.3

Proof.

Our attack exploits the inner product $\langle\bm{\alpha},\bm{\beta}\rangle_{\bm{H}^{-1}}=\bm{\alpha}^{\top}\bm{H}^{-1}\bm{\beta}$ and corresponding norm $\|\bm{\beta}\|_{\bm{H}^{-1}}=\sqrt{\bm{\beta}^{\top}\bm{H}^{-1}\bm{\beta}}$ associated with the positive definite second differential $\bm{H}=d^{2}g(\bm{\beta}\mid\bm{\beta})$ . Taking $\bm{\alpha}=\bm{\beta}-\bm{H}^{-1}\nabla f(\bm{\beta})$ in the majorization

\displaystyle f(\bm{\alpha})

\displaystyle\leq

\displaystyle f(\bm{\beta})+\nabla f(\bm{\beta})^{\top}(\bm{\alpha}-\bm{\beta})+\frac{1}{2}(\bm{\alpha}-\bm{\beta})^{\top}\bm{H}(\bm{\alpha}-\bm{\beta})

(17)

leads to the conclusion

\displaystyle\underset{\bm{\alpha}}{\inf}f(\bm{\alpha})

\displaystyle\leq

\displaystyle f(\bm{\beta})-\frac{1}{2}\nabla f(\bm{\beta})^{\top}\bm{H}^{-1}\nabla f(\bm{\beta}).

(18)

Now consider the function $g_{\bm{\beta}}(\bm{\alpha})=f(\bm{\alpha})-\nabla f(\bm{\beta})^{\top}\bm{\alpha}$ . It is convex, achieves its minimum at the stationary point $\bm{\alpha}=\bm{\beta}$ , and satisfies the analogues of the inequalities (17) and (18). Therefore,

$\displaystyle f(\bm{\alpha})-f(\bm{\beta})-\nabla f(\bm{\beta})^{\top}(\bm{\alpha}-\bm{\beta})$	$\displaystyle=$	$\displaystyle g_{\bm{\beta}}(\bm{\alpha})-g_{\bm{\beta}}(\bm{\beta})$
	$\displaystyle\geq$	$\displaystyle\frac{1}{2}\nabla g_{\bm{\beta}}(\bm{\alpha})^{\top}\bm{H}^{-1}\nabla g_{\bm{\beta}}(\bm{\alpha})$
	$\displaystyle=$	$\displaystyle\frac{1}{2}[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]^{\top}\bm{H}^{-1}[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})].$

By symmetry,

\displaystyle f(\bm{\beta})-f(\bm{\alpha})-\nabla f(\bm{\alpha})^{\top}(\bm{\beta}-\bm{\alpha})

\displaystyle\geq

\displaystyle\frac{1}{2}[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]^{\top}\bm{H}^{-1}[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})],

and adding the last two inequalities gives

\displaystyle[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]^{\top}\bm{H}^{-1}[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]

\displaystyle\leq

\displaystyle[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]^{\top}(\bm{\beta}-\bm{\alpha}).

We are now in a position to prove that the operator $S(\bm{\beta})=\bm{\beta}-2\bm{H}^{-1}\nabla f(\bm{\beta})$ is non-expansive in the norm $\|\cdot\|_{\bm{H}}$ . Indeed,

$\displaystyle\\|S(\bm{\beta})-S(\bm{\alpha})\\|_{\bm{H}}^{2}$	$\displaystyle=$	$\displaystyle\\|\bm{\beta}-\bm{\alpha}\\|^{2}_{\bm{H}}-4(\bm{\beta}-\bm{\alpha})^{\top}[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]$
		$\displaystyle+\,4[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]^{\top}\bm{H}^{-1}[\nabla f(\bm{\beta})-\nabla f(\bm{\alpha})]$
	$\displaystyle\leq$	$\displaystyle\\|\bm{\beta}-\bm{\alpha}\\|^{2}_{\bm{H}}.$

The related operator

\displaystyle T(\bm{\beta})

\displaystyle=

\displaystyle\frac{1}{2}\bm{\beta}+\frac{1}{2}S(\bm{\beta})\mathop{\;\>}\nolimits=\mathop{\;\>}\nolimits\bm{\beta}-\bm{H}^{-1}\nabla f(\bm{\beta})

is $\frac{1}{2}$ -averaged with fixed points equal to the stationary points of $f(\bm{\beta})$ . To complete the proof, we note that the sequence $\bm{\beta}_{m+1}=T(\bm{\beta}_{m})$ defined by an averaged operator is known to converge to a fixed point. For example, see Proposition 7.5.2 of (Lange, 2016) or Corollary 22.20 of (Bauschke and Moursi, 2023). ∎

D.2 Proof of Proposition 2.4

Proof.

Existence and uniqueness of $\bm{\alpha}$ follow from the strong convexity of $f(\bm{\beta})$ . Because $\nabla g(\bm{\alpha}\mid\bm{\alpha})=\nabla f(\bm{\alpha})=\mathbf{0}$ , the $L$ -smoothness of $g(\bm{\beta}\mid\bm{\alpha})$ gives the quadratic upper bound

$\displaystyle f(\bm{\beta})-f(\bm{\alpha})$	$\displaystyle\leq$	$\displaystyle g(\bm{\beta}\mid\bm{\alpha})-g(\bm{\alpha}\mid\bm{\alpha})$
	$\displaystyle\leq$	$\displaystyle\nabla g(\bm{\alpha}\mid\bm{\alpha})^{\top}(\bm{\beta}-\bm{\alpha})+\frac{L}{2}\\|\bm{\beta}-\bm{\alpha}\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle\frac{L}{2}\\|\bm{\beta}-\bm{\alpha}\\|_{2}^{2},$

which incidentally implies $\mu\leq L$ . In view of the strong convexity assumption, we have the lower bound

$\displaystyle\\|\nabla f(\bm{\beta})\\|_{2}\cdot\\|\bm{\alpha}-\bm{\beta}\\|_{2}$	$\displaystyle\geq$	$\displaystyle-\nabla f(\bm{\beta})^{\top}(\bm{\alpha}-\bm{\beta})$
	$\displaystyle\geq$	$\displaystyle f(\bm{\alpha})-f(\bm{\beta})-\nabla f(\bm{\beta})^{\top}(\bm{\alpha}-\bm{\beta})$
	$\displaystyle\geq$	$\displaystyle\frac{\mu}{2}\\|{\bm{\alpha}-\bm{\beta}}\\|_{2}^{2}.$

Therefore, $\|\nabla f(\bm{\beta})\|_{2}\geq\frac{\mu}{2}\|\bm{\alpha}-\bm{\beta}\|_{2}$ . Combining the inequalities (D.2) and (D.2) yields the Polyak-Łojasiewicz (PL) bound

\displaystyle\|\nabla f(\bm{\beta})\|_{2}^{2}

\displaystyle\geq

\displaystyle\frac{\mu^{2}}{2L}[f(\bm{\beta})-f(\bm{\alpha})].

We now turn to the MM iterates and take $\bm{\beta}=\bm{\beta}_{m}-\frac{1}{L}\nabla f(\bm{\beta}_{m})$ . The PL inequality implies

$\displaystyle f(\bm{\beta}_{m+1})-f(\bm{\beta}_{m})$	$\displaystyle\leq$	$\displaystyle g(\bm{\beta}_{m+1}\mid\bm{\beta}_{m})-g(\bm{\beta}_{m}\mid\bm{\beta}_{m})\leq g(\bm{\beta}\mid\bm{\beta}_{m})-g(\bm{\beta}_{m}\mid\bm{\beta}_{m})$
	$\displaystyle\leq$	$\displaystyle-\frac{1}{L}\nabla g(\bm{\beta}_{m}\mid\bm{\beta}_{m})^{\top}\nabla g(\bm{\beta}_{m}\mid\bm{\beta}_{m})+\frac{L}{2}\\|{\frac{1}{L}\nabla g(\bm{\beta}_{m}\mid\bm{\beta}_{m})}\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle-\frac{1}{2L}\\|{\nabla f(\bm{\beta}_{m})}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle-\frac{\mu^{2}}{4L^{2}}\left[f(\bm{\beta}_{m})-f(\bm{\alpha})\right].$

Subtracting $f(\bm{\alpha})$ from both sides of the previous inequality and rearranging gives

\displaystyle f(\bm{\beta}_{m+1})-f(\bm{\alpha})

\displaystyle\leq

\displaystyle\left(1-\frac{\mu^{2}}{4L^{2}}\right)[f(\bm{\beta}_{m})-f(\bm{\alpha})].

Iteration of this inequality yields the claimed linear convergence. ∎

Appendix E Sparse Quantile Regression under Gaussian Noise

Table 3: Simulation results for sparse quantile regression under

\mathcal{N}(0,2)

noise.

Method	TPR	FPR	EE	PE	Time (s)
$(n=500,p=250),\tau=0.5$
SQR-Lasso	1.00 (0.00)	0.09 (0.04)	0.45 (0.10)	8.02 (1.46)	1.56
SQR-SCAD	0.99 (0.04)	0.00 (0.00)	0.30 (0.30)	5.02 (4.21)	1.33
SQR-MCP	1.00 (0.00)	0.00 (0.00)	0.22 (0.06)	4.01 (0.93)	1.30
SQR- $\ell_{0}$	1.00 (0.00)	0.00 (0.00)	0.26 (0.08)	4.79 (1.15)	0.48
SQR-PD	1.00 (0.00)	0.00 (0.00)	0.23 (0.08)	4.34 (1.28)	6.52
$(n=500,p=250),\tau=0.7$
SQR-Lasso	1.00 (0.00)	0.09 (0.04)	0.54 (0.10)	9.35 (1.56)	1.51
SQR-SCAD	1.00 (0.01)	0.00 (0.00)	0.26 (0.15)	4.55 (2.14)	1.09
SQR-MCP	1.00 (0.00)	0.00 (0.00)	0.22 (0.05)	3.97 (0.85)	1.11
SQR- $\ell_{0}$	1.00 (0.00)	0.00 (0.00)	0.29 (0.08)	5.33 (1.40)	0.55
SQR-PD	1.00 (0.00)	0.00 (0.00)	0.26 (0.08)	4.76 (1.41)	7.15
$(n=250,p=500),\tau=0.5$
SQR-Lasso	1.00 (0.00)	0.06 (0.03)	0.83 (0.16)	9.69 (1.66)	2.34
SQR-SCAD	0.94 (0.09)	0.00 (0.00)	0.69 (0.54)	7.47 (5.18)	1.99
SQR-MCP	0.95 (0.09)	0.00 (0.00)	0.67 (0.53)	7.42 (4.93)	1.99
SQR- $\ell_{0}$	1.00 (0.00)	0.00 (0.00)	0.48 (0.19)	5.95 (2.13)	0.46
SQR-PD	1.00 (0.00)	0.00 (0.00)	0.50 (0.29)	6.47 (2.59)	13.99
$(n=250,p=500),\tau=0.7$
SQR-Lasso	1.00 (0.00)	0.06 (0.03)	0.87 (0.18)	10.33 (1.73)	2.50
SQR-SCAD	0.93 (0.10)	0.00 (0.00)	0.79 (0.61)	8.45 (5.59)	2.04
SQR-MCP	0.92 (0.09)	0.00 (0.00)	0.90 (0.58)	9.36 (5.37)	2.09
SQR- $\ell_{0}$	0.99 (0.00)	0.00 (0.00)	0.57(0.24)	7.53 (2.54)	0.50
SQR-PD	0.99 (0.05)	0.01 (0.01)	0.64 (0.44)	7.71 (3.61)	15.55

Tactics for Improving Least Squares Estimation

Abstract

1 Introduction

2 Methods

2.1 Notation

2.2 M Estimation

2.3 Quantile Regression

2.4 Sparsity and Low Matrix Rank

2.5 Deweighting Weighted Least Squares

2.6 Quadratic Upper and Lower Bound Principles

2.7 Integration with the Proximal Distance Principle

2.8 Nesterov Acceleration

2.9 Convergence Theory

Proposition 2.1.

Proposition 2.2.

Proposition 2.3.

Proof.

Proposition 2.4.

Proof.

3 Numerical Experiments

3.1 Unpenalized Models

Quantile Regression

L2​E\text{L}_{2}\text{E} Regression

Logistic and Multinomial Regression

3.2 Sparse Quantile Regression

3.3 Robust Isotonic Regression

3.4 Low-Rank Multinomial Regression

4 Discussion

Acknowledgments

References

Appendix A Deweighting the Sharpest LAD Majorization

Appendix B Robust Isotonic Regression

Appendix C Low-Rank Multinomial Regression

Appendix D Proofs

D.1 Proof of Proposition 2.3

Proof.

D.2 Proof of Proposition 2.4

Proof.

Appendix E Sparse Quantile Regression under Gaussian Noise

$\text{L}_{2}\text{E}$ Regression