Parallel ADMM Algorithm with Gaussian Back Substitution for High-Dimensional Quantile Regression and Classification

Xiaofei Wu, Dingzi Guo ,
Rongmei Liang ,
Zhimin Zhang College of Mathematics and Statistics, Chongqing University, Chongqing, 401331, P.R. China. Email: xfwu1016@163.comGuo and Wu have contributed equally to this paper. Institute of Financial Studies, Shandong University, Shandong, 250100, P.R. China. Email: guodingzi@mail.sdu.edu.cn.Department of Statistics and Data Science, Southern University of Science and Technology, Guangdong, 518055, P.R. China. Email: liang_r_m@163.comCorresponding author. College of Mathematics and Statistics, Chongqing University, Chongqing, 401331, P.R. China. Email: zmzhang@cqu.edu.cn

Abstract

In the field of high-dimensional data analysis, modeling methods based on quantile loss function are highly regarded due to their ability to provide a comprehensive statistical perspective and effective handling of heterogeneous data. In recent years, many studies have focused on using the parallel alternating direction method of multipliers (P-ADMM) to solve high-dimensional quantile regression and classification problems. One efficient strategy is to reformulate the quantile loss function by introducing slack variables. However, this reformulation introduces a theoretical challenge: even when the regularization term is convex, the convergence of the algorithm cannot be guaranteed. To address this challenge, this paper proposes the Gaussian Back-Substitution strategy, which requires only a simple and effective correction step that can be easily integrated into existing parallel algorithm frameworks, achieving a linear convergence rate. Furthermore, this paper extends the parallel algorithm to handle some novel quantile loss classification models. Numerical simulations demonstrate that the proposed modified P-ADMM algorithm exhibits excellent performance in terms of reliability and efficiency.

Keywords: Three-block ADMM; Gaussian back substitution; Massive data; Parallel algorithm

1 Introduction

Quantile regression, pioneered by Koenker and Basset (1978), explores how a response variable depends on a set of predictors by modeling the conditional quantile as a function of these predictors. Unlike mean regression, which focuses solely on estimating the conditional mean of the response, quantile regression offers a more precise representation of the relationship between the response and the predictors. Furthermore, quantile regression exhibits superior robustness when handling datasets with heterogeneous characteristics and can effectively process data with heavy-tailed distributions, owing to its less restrictive assumptions regarding error distribution. Quantile loss is also utilized in support vector machines (SVM) for classification purposes (see Section 9.3 in Christmann and Steinwart (2008) and Proposition 1 in Wu et al. (2025b)). Compared to traditional SVM (support vector machine) in Vapnik (1995), quantile loss SVM has been shown to be less sensitive to noise around the separating hyperplane, making it more robust to resampling. For a detailed discussion, see Huang et al. (2014).

Consider a regression or classification problem with $n$ observations of the form

\{{y_{i}},\bm{x}_{i}\}_{i=1}^{n}=\{{y_{i}},{x_{i,1}},{x_{i,2}},\ldots,{x_{i,p}}\}_{i=1}^{n}=\{\bm{y},\bm{X}\}=\bm{D},

(1.1)

where the data is assumed to be a random sample from an unknown joint distribution with a probability density function. The random variable $y$ represents the “response” or “outcome”, while $\bm{x}=\{x_{1},x_{2},\dots,x_{p}\}$ denotes the predictor variables (features). These features may include the original observations and/or functions derived from them. If considering the intercept term, the first column of $\bm{X}$ is set to all 1. Without loss of generality, ${y}_{i}$ is quantitative for regression models, but equals to -1 or 1 for classification models. For a given quantile $\tau\in(0,1)$ , one can obtain an estimate of quantile regression by optimizing the following objective function,

\hat{\bm{\beta}}(\tau)=\arg\min_{\bm{\beta}}\sum_{i=1}^{n}\rho_{\tau}\left(y_{i}-\bm{x}_{i}^{\top}\bm{\beta}\right),

where $\rho_{\tau}(u)=u[\tau-I(u<0)]$ (for $u\in\mathbb{R}$ ) is the check loss, and $I(\cdot)$ is the indicator function. Note that when $\tau=1$ , the quantile loss degenerates into hinge loss in SVM for classification. Many studies have shown that quantile loss can also be used for classification when $\tau\in(0,1)$ , such as Christmann and Steinwart (2008), Huang et al. (2014), Liang et al. (2024) and Wu et al. (2025b). The estimator for quantile loss SVM can be obtained by solving the following optimization problem,

\displaystyle\arg\min_{\bm{\beta}}\left\{\sum_{i=1}^{n}\rho_{\tau}\left(1-y_{i}\bm{x}_{i}^{\top}\bm{\beta}\right)+\|\bm{\lambda}\odot\bm{\beta}\|_{2}^{2}\right\},

where $\bm{\lambda}$ is a penalization parameter vector, where the first element is always set to 0, and the remaining elements are constrained to be non-negative. Here, the symbol $\odot$ is commonly used to denote the Hadamard product, also known as the element-wise product or the entrywise product. Clearly, when $\tau=1$ , $\rho_{\tau}$ degenerates into the hinge loss commonly used in SVM, and the above expression simplifies to the form of the ordinary SVM, specifically, hinge loss plus ridge penalty term.

To accommodate high-dimensional scenarios where $p>n$ , it is common practice to substitute the ridge term $\|\bm{\lambda}\odot\bm{\beta}\|_{2}^{2}$ (Hoerl and Kennard (1970)) with sparse regularization techniques, including LASSO in Tibshirani (1996), elastic net in Zou (2006), adaptive LASSO in Zou (2006), SCAD in Fan and Li (2001), and MCP in Zhang (2010). Consequently, the sparse penalized quantile regression and classification formulation becomes

\displaystyle\hat{\bm{\beta}}(\tau)=\arg\min_{\bm{\beta}}\left\{\sum_{i=1}^{n}\rho_{\tau}\left(\tilde{y_{i}}-\tilde{\bm{x}}_{i}^{\top}\bm{\beta}\right)+P_{\bm{\lambda}}(|\bm{\beta}|)\right\}.

(1.2)

Here, $P_{\bm{\lambda}}(|\bm{\beta}|)$ denotes a sparse regularization term that is separable, meaning:

P_{\bm{\lambda}}(|\bm{\beta}|)=\sum_{j=1}^{p}P_{\lambda_{j}}(|\beta_{j}|).

This formulation allows for a distinct penalization of each component $\beta_{j}$ using its respective regularization parameter $\lambda_{j}$ . For regression, $\tilde{y}_{i}$ is $y_{i}$ , $\tilde{\bm{x}}_{i}$ is ${\bm{x}}_{i}$ ; for classification, $\tilde{y}_{i}$ is 1; $\tilde{\bm{x}}_{i}$ is $y_{i}{\bm{x}}_{i}$ . Optimization formula (1.2) is a highly flexible expression hat can represent numerous quantile regression and SVM classification models, including penalized quantile regression (Belloni and Chernozhukov (2011), Wang et al. (2012), Fan et al. (2012) and Gu et al. (2018)) and SVM with sparse regularizationin (Zhu et al. (2003), Wang et al. (2006), Zhang et al. (2016) and Liang et al. (2024))

The advancement of modern science and technology has made data collection increasingly effortless, leading to an explosion of variables and vast amounts of data. Due to the sheer volume of data and other factors such as privacy concerns, it has become essential to store it in a distributed manner. Consequently, designing parallel algorithms that can effectively manage these large and distributed datasets is crucial. Several parallel algorithms have been proposed to address problem (1.2), including QR-ADMM (ADMM for quantile regression) in Yu and Lin (2017), QPADM (quantile regression with parallel ADMM) in Yu et al. (2017) and Wu et al. (2025a). More recently, inspired by Guan et al. (2018), Fan et al. (2021) introduced a slack variable representation of the quantile regression problem, demonstrating that this new formulation is significantly faster than QPADM, especially when the data volume $n$ or the dimensionality $p$ is large. In addition, the slack variable representation is also used by Guan et al. (2020) to design parallel algorithms for solving SVMs with sparse regularization.

However, these slack variable representations raise a new issue, which is that convergence cannot be proven mathematically. The main reason for this is that the slack variable representation introduces two slack variables, which transform the parallel ADMM algorithms (both distributed and non-distributed) into a three-block ADMM algorithm, see Section 3.1 for detailed information. Chen et al. (2016) demonstrated that directly extending the ADMM algorithm for convex optimization with three or more separable blocks may not guarantee convergence, and they even provided an example of divergence. Therefore, although the parallel ADMM algorithms proposed by Guan et al. (2020) and Fan et al. (2021) did not exhibit non-convergence of iterative solutions in numerical experiments, there is no theoretical guarantee of convergence for the iterative solutions, even when the optimization objective is convex.

In this paper, we apply the Gaussian back substitution technique to refine the iterative steps, which allows the parallel ADMM algorithm proposed by Guan et al. (2020) and Fan et al. (2021) to achieve a linear convergence rate. This Gaussian back substitution technique is straightforward and easy to implement, requiring only a linear operation on a portion of the iterative sequences generated by their algorithm. Besides demonstrating that the algorithm in Guan et al. (2020) and Fan et al. (2021) can theoretically guarantee convergence with a simple adjustment, the main contributions of this paper are as follows:

1.

We suggest changing the order of variable iteration in Fan et al. (2021) such that our Gaussian back substitution technique involves only simple vector additions and subtractions, thereby eliminating the need for matrix-vector multiplication. Although this change may seem minor, it can significantly enhance computational efficiency in algorithms where both $n$ and $p$ are relatively large. More importantly, this change will not impact the linear convergence of the algorithm.
2.

This paper proposes some new classification models with nonconvex regularization terms based on quantile loss. Leveraging the equivalence of quantile loss in classification and regression tasks, it indicates that existing parallel ADMM algorithms for solving penalized quantile regression, as well as those proposed in this paper, can also be applied to solve these new classification models.

The remainder of this paper is organized as follows. Section 2 provides a review of relevant literature along with an introduction to preliminary knowledge. Section 3 outlines the existing QPADM-slack algorithm, incorporating adjustments to the Gaussian back-substitution process to achieve linear convergence. Section 4 proposes modifications to the variable update sequence in QPADM-slack, simplifying the correction steps during Gaussian back-substitution. Numerical experiments in Section 5 demonstrate that Gaussian back-substitution not only theoretically guarantees the linear convergence of QPADM-slack but also significantly improves computational efficiency and accuracy. Section 6 summarizes the key findings of the study and identifies avenues for future research. The proofs of the theorems and supplementary experimental results are included in the online appendix.

Notations: $\bm{0}_{n}$ and $\bm{1}_{n}$ represent $n$ -dimensional vectors with all elements being 0 and 1, respectively. $\bm{F}$ is a $(p-1)\times p$ matrix with all elements being 0, except for 1 on the diagonal and -1 on the superdiagonal. $\bm{I}_{n}$ represents the $n$ -dimensional identity matrix. The Hadamard product is denoted by $\odot$ . The sign $(\cdot)$ function is defined component-wise such that sign $(t)=1$ if $t>0$ , sign $(t)=0$ if $t=0$ , and sign $(t)=-1$ if $t<0$ . $(\cdot)_{+}$ signifies the element-wise operation of extracting the positive component, while $|\cdot|$ denotes the element-wise absolute value function. For any vector $\bm{u}$ , $\|\bm{u}\|_{1}$ and $\|\bm{u}\|_{2}$ denote the $\ell_{1}$ norm and the $\ell_{2}$ norm of $\bm{u}$ , respectively. $\|\bm{u}\|_{\bm{H}}:=\sqrt{\bm{u}^{\top}\bm{H}\bm{u}}$ is used to denote the norm of $\bm{u}$ under the matrix $\bm{H}$ , where $\bm{H}$ is a matrix.

2 Preliminaries and Literature Review

A traditional method for solving penalized quantile regression and SVMs is to transform the corresponding optimization problem into a linear program (see Zhu et al. (2003), Koenker (2005), Wu and Liu (2009), Wang et al. (2012) and Zhang et al. (2016)), which can then be solved using many existing optimization methods. Koenker and Ng (2005) proposed an interior-point method for quantile regression and penalized quantile regression. Hastie et al. (2004), Rosset and Zhu (2007) and Li and Zhu (2008) introduced an algorithm for computing the solution path of the LASSO-penalized quantile regression and SVM, building on the LARS approach (Efron et al. (2004)). However, these linear programming algorithms are known to be inefficient for high-dimensional quantile regressions and SVMs.

Although gradient descent algorithms (Beck and Teboulle (2009), Yang and Zou (2015)) and coordinate descent methods (Friedman et al. (2010), Yang and Zou (2013)) are efficient in solving penalized smooth regression and classification problems (including some smooth SVMs), they cannot be directly applied to penalized quantile regressions and SVMs due to the non-smooth nature of quantile loss. To extend the coordinate descent method to nonsmooth loss regressions, Peng and Wang (2015) integrated the majorization-minimization algorithm with the coordinate descent algorithm to develop an iterative coordinate descent algorithm (QICD) for solving sparse penalized quantile regression. Yi and Huang (2016) introduced a coordinate descent algorithm for penalized Huber regression and utilized it to approximate penalized quantile regression. However, these algorithms are not suitable for distributed storage and are not easy to implement in parallel.

An efficient algorithm for solving penalized quantile regressions and SVMs is the alternating direction method of multipliers (ADMM), which, owing to its split structure, is well-suited for parallel computing environments. In the following subsection, we will review these ADMM algorithms and their parallel versions.

2.1 ADMM

The alternating direction method of multipliers (ADMM) is an iterative optimization method designed to solve complex convex minimization problems with linear constraints. It works by breaking down the original problem into smaller, more manageable subproblems that are easier to solve. ADMM alternates between optimizing these subproblems and updating dual variables to enforce the constraints, making it particularly useful for large-scale problems and various statistical learning applications (see Boyd et al. (2010) for more details). Since the quantile loss function $\rho_{\tau}(\cdot)$ is nonsmooth and nondifferentiable, it is necessary to introduce the linear constraint $\bm{r}=(r_{1},r_{2},\dots,r_{n})^{\top}=\tilde{\bm{y}}-\tilde{\bm{X}}\bm{\beta}$ to apply ADMM for solving problem (1.2). To better address the subproblem involving $\bm{\beta}$ , one can introduce the equality constraint $\bm{z}=\bm{\beta}$ . Consequently, the constrained optimization problem can be formulated as follows,

	$\displaystyle\min_{\bm{\beta},\bm{r},\bm{z}}$	$\displaystyle\quad\rho_{\tau}(\bm{r})+{P}_{\lambda}(\|\bm{\beta}\|),$
	s.t.	$\displaystyle\tilde{\bm{y}}-\tilde{\bm{X}}\bm{z}=\bm{r},\ \bm{z}=\bm{\beta},$		(2.1)

where $\rho_{\tau}(\bm{r})=\sum_{i=1}^{n}\rho(r_{i})$ . The augmented Lagrangian form of (2.1) is

	$\displaystyle L_{\mu}(\bm{\beta},\bm{r},\bm{z},\bm{d}_{1},\bm{e}_{1})$	$\displaystyle=\rho_{\tau}(\bm{r})+{P}_{\lambda}(\|\bm{\beta}\|)-\bm{d}_{1}^{\top}(\tilde{\bm{y}}-\tilde{\bm{X}}\bm{z}-\bm{r})+\frac{\mu}{2}\\|\tilde{\bm{y}}-\tilde{\bm{X}}\bm{z}-\bm{r}\\|_{2}^{2}$
		$\displaystyle-\bm{e}_{1}^{\top}(\bm{z}-\bm{\beta})+\frac{\mu}{2}\\|\bm{z}-\bm{\beta}\\|_{2}^{2},$		(2.2)

where $\bm{d}_{1}$ and $\bm{e}_{1}$ are dual variables corresponding to the linear constraints, and $\mu>0$ is a given augmented parameter. Given $(\bm{r}^{0},\bm{z}^{0},\bm{d}_{1}^{0},\bm{e}_{1}^{0})$ , the iterative scheme of ADMM for problem (2.1) is as follows,

\left\{\begin{array}[]{l}\bm{\beta}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}}\left\{L_{\mu}(\bm{\beta},\bm{r}^{k},\bm{z}^{k},\bm{d}_{1}^{k},\bm{e}_{1}^{k})\right\};\\ \bm{r}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{r}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{r},\bm{z}^{k},\bm{d}_{1}^{k},\bm{e}_{1}^{k})\right\};\\ \bm{z}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{z}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{r}^{k+1},\bm{z},\bm{d}_{1}^{k},\bm{e}_{1}^{k})\right\};\\ \bm{d}_{1}^{k+1}\ \leftarrow\bm{d}_{1}^{k}-\mu(\tilde{\bm{y}}-\tilde{\bm{X}}\bm{z}^{k+1}-\bm{r}^{k+1});\\ \bm{e}_{1}^{k+1}\ \leftarrow\bm{e}_{1}^{k}-\mu(\bm{z}^{k+1}-\bm{\beta}^{k+1}).\end{array}\right.

(2.3)

The entire iterative process of (2.3) is the non parallel QPADM algorithm proposed by Yu et al. (2017) and Wu et al. (2024). On the other hand, Gu et al. (2018) and Wu et al. (2025b) introduced the linearized ADMM algorithms for solving penalized quantile regression and SVM. These methods do not introduce an auxiliary variable $\bm{z}$ , and instead linearize the quadratic function in the $\bm{\beta}$ -subproblem to facilitate finding the solution for $\bm{\beta}$ . However, their did not extend the algorithms to handle distributed storage data, meaning there is no parallel version of their ADMM algorithms. Therefore, we will focus on discussing the parallel QPADM algorithm from Yu et al. (2017) and Wu et al. (2025a).

2.2 Parallel ADMM

When designing algorithms for distributed parallel processing, the setup generally involves a central machine and several local machines. Assume the data matrix $\tilde{\bm{X}}$ and the response vector $\tilde{\bm{y}}$ are distributed across $M$ local machines as follows,

\displaystyle\tilde{\bm{X}}=(\tilde{\bm{X}}_{1}^{\top},\tilde{\bm{X}}_{2}^{\top},\dots,\tilde{\bm{X}}_{M}^{\top})^{\top}\ \text{and}\ \ \tilde{\bm{y}}=(\tilde{\bm{y}}_{1}^{\top},\tilde{\bm{y}}_{2}^{\top},\dots,\tilde{\bm{y}}_{M}^{\top})^{\top},

(2.4)

where $\tilde{\bm{X}}_{m}\in\mathbb{R}^{n_{m}\times p}$ , $\tilde{\bm{y}}_{m}\in\mathbb{R}^{n_{m}}$ and $\sum_{m=1}^{M}n_{m}=n$ . To accommodate the structure of distributed storage data, Boyd et al. (2010) introduced $\bm{\beta}_{m}=\bm{\beta},m=1,2,\dots,M$ , to enable parallel processing of the data. Then, the constrained optimization problem can be formulated as

	$\displaystyle\min_{\bm{\beta},\bm{r}_{m},\bm{\beta}_{m}}$	$\displaystyle\quad\sum_{m=1}^{M}\rho_{\tau}(\bm{r}_{m})+{P}_{\lambda}(\|\bm{\beta}\|),$
	$\displaystyle\text{s.t.}\ \bm{\beta}_{m}=\bm{\beta},\ \tilde{\bm{y}}_{m}-$	$\displaystyle\tilde{\bm{X}}_{m}\bm{\beta}_{m}=\bm{r}_{m},\ m=1,2,\dots,M,$		(2.5)

where $\rho(\bm{r}_{m})=\sum_{i=1}^{n_{m}}\rho(r_{i})$ . The augmented Lagrangian form of (2.2) is

	$\displaystyle\small L_{\mu}(\bm{\beta},\bm{r}_{m},\bm{\beta}_{m},\bm{d}_{m},\bm{e}_{m})$	$\displaystyle=\ \sum_{m=1}^{M}\left[\rho(\bm{r}_{m})+\bm{d}_{m}^{\top}(\tilde{\bm{y}}_{m}-\tilde{\bm{X}}_{m}\bm{\beta}_{m}-\bm{r}_{m})+\frac{\mu}{2}\\|\tilde{\bm{y}}_{m}-\tilde{\bm{X}}_{m}\bm{\beta}_{m}-\bm{r}_{m}\\|_{2}^{2}\right.$
		$\displaystyle\left.+\ \bm{e}_{m}^{\top}(\bm{\beta}_{m}-\bm{\beta})+\frac{\mu}{2}\\|\bm{\beta}_{m}-\bm{\beta}\\|_{2}^{2}\right]+{P}_{\lambda}(\|\bm{\beta}\|),$		(2.6)

where $\bm{d}_{m}$ and $\bm{e}_{m}$ are dual variables corresponding to the linear constraints. Given $(\bm{r}_{m}^{0},\bm{\beta}_{m}^{0},\bm{d}_{m}^{0},\bm{e}_{m}^{0})$ , the iterative scheme of parallel ADMM for problem (2.2) is as follows,

\left\{\begin{array}[]{l}\bm{\beta}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}}\left\{L_{\mu}(\bm{\beta},\bm{r}_{m}^{k},\bm{\beta}_{m}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{r}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{r}_{m}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{r}_{m},\bm{\beta}_{m}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{\beta}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}_{m}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{r}^{k+1},\bm{\beta}_{m},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{d}_{m}^{k+1}\ \leftarrow\bm{d}_{m}^{k}-\mu(-\tilde{\bm{y}}_{m}+\tilde{\bm{X}}_{m}\bm{\beta}_{m}^{k+1}+\bm{r}_{m}^{k+1});\\ \bm{e}_{m}^{k+1}\ \leftarrow\bm{e}_{m}^{k}-\mu(-\bm{\beta}_{m}^{k+1}+\bm{\beta}^{k+1}).\end{array}\right.

(2.7)

This distribution enables efficient parallelization of the algorithm. Each local machine independently solves its own subproblem (e.g., $\bm{r}_{m},\bm{\beta}_{m},\bm{u}_{m},\bm{v}_{m}$ ), while the central machine consolidates the results by updating $\bm{\beta}$ and coordinates updates among the local machines. This framework facilitates parallel processing and effective management of large datasets. In subsection 3.2 of Yu et al. (2017), it is indicated that for convex penalties $P_{\lambda}$ , the QPADM converges to the solution of (2.2). Here, we will briefly explain the proof approach. For ease of description, we will only discuss the case where $M=2$ ; the case where $M>2$ is similar.

When $M=2$ , the constraint form of (2.2) is written in matrix form as

\bm{A}_{1}\bm{\beta}+\bm{A}_{2}\bm{r}_{1}+\bm{A}_{3}\bm{r}_{2}+\bm{B}_{1}\bm{\beta}_{1}+\bm{B}_{2}\bm{\beta}_{2}=\bm{e},

where

\bm{A}_{1}=[-\bm{I}_{p},-\bm{I}_{p},\bm{0}^{\top},\bm{0}^{\top}]^{\top},\quad\bm{A}_{2}=[\bm{0}^{\top},\bm{0}^{\top},\bm{I}_{n_{1}},\bm{0}^{\top}]^{\top},\quad\bm{A}_{3}=[\bm{0}^{\top},\bm{0}^{\top},\bm{0}^{\top},\bm{I}_{n_{2}}]^{\top},

\bm{B}_{1}=[\bm{I}_{p},\bm{0}^{\top},\bm{X}_{1}^{\top},\bm{0}^{\top}]^{\top},\quad\bm{B}_{2}=[\bm{0}^{\top},\bm{I}_{p},\bm{0}^{\top},\bm{X}_{2}^{\top}]^{\top},\quad\bm{e}=[\bm{0}^{\top},\bm{0}^{\top},\bm{y}_{1}^{\top},\bm{y}_{2}^{\top}]^{\top}.

Note that $\bm{A}_{1}$ , $\bm{A}_{2}$ , and $\bm{A}_{3}$ are mutually orthogonal, as are $\bm{B}_{1}$ and $\bm{B}_{2}$ . Therefore, $\bm{\beta}$ , $\bm{r}_{1}$ , and $\bm{r}_{2}$ can be considered as a single variable, and $\bm{\beta}_{1}$ and $\bm{\beta}_{2}$ can also be considered as a single variable. Consequently, the QPADM degrades to a traditional two-block ADMM algorithm in the parallel case. A similar conclusion applies to non-parallel QPADM, provided that $\bm{\beta}$ and $\bm{r}$ are treated as a single block, and $\bm{z}$ is treated as a separate block. The convergence of the two-block traditional ADMM algorithm has been well-studied, thus QPADM is convergent as well. According to He and Yuan (2015), it exhibits linear convergence rate.

However, this does not apply to QPADM-slack in Guan et al. (2020) and Fan et al. (2021) because the introduction of slack variables prevents it from being formulated as a two-block ADMM. We will provide a detailed discussion on this in subsequent sections, which also serves as the motivation for this paper.

3 QPADM-slack and Gaussian Back Substitution

3.1 QPADM-slack

Simulation results from Yu et al. (2017) indicated that QPADM may require a large number of iterations to converge for penalized quantile regressions. This poses challenges for their practical application to big data, particularly in distributed environments where communication costs are high. To address this issue, Fan et al. (2021) proposed using two sets of nonnegative slack variables $\bm{\xi}=(\xi_{1},\xi_{2},\ldots,\xi_{n})^{\top}\in\mathbb{R}^{n}$ and $\bm{\eta}=(\eta_{1},\eta_{2},\ldots,\eta_{n})^{\top}\in\mathbb{R}^{n}$ to represent the quantile loss. Thus, (2.2) can be written as

		$\displaystyle\arg\min_{\bm{\beta},\bm{\xi}_{m},\bm{\eta}_{m},\bm{\beta}_{m}}\quad\sum_{m=1}^{M}\left[\tau\bm{1}_{n_{m}}^{\top}\bm{\xi}_{m}+(1-\tau)1_{n_{m}}^{\top}\bm{\eta}_{m}\right]+{P}_{\lambda}(\|\bm{\beta}\|),$
		$\displaystyle\text{s.t.}\ \bm{\beta}_{m}=\bm{\beta},\ \tilde{\bm{y}}_{m}-\tilde{\bm{X}}_{m}\bm{\beta}_{m}=\bm{\xi}_{m}-\bm{\eta}_{m},\bm{\xi}_{m}\geq 0,\bm{\eta}_{m}\geq 0,$		(3.1)

where $\bm{\xi}=(\bm{\xi}_{1}^{\top},\bm{\xi}_{2}^{\top}.\dots,\bm{\xi}_{M}^{\top})^{\top}$ and $\bm{\eta}=(\bm{\eta}_{1}^{\top},\bm{\eta}_{2}^{\top}.\dots,\bm{\eta}_{M}^{\top})^{\top}$ . When $\tau=1$ , the above equation is used by Guan et al. (2020) to complete the classification task.

The augmented Lagrangian form of (3.1) is

	$\displaystyle\small L_{\mu}(\bm{\beta},$	$\displaystyle\bm{\xi}_{m},\bm{\eta}_{m},\bm{\beta}_{m},\bm{d}_{m},\bm{e}_{m})=\sum_{m=1}^{M}\left[\tau\bm{1}_{n_{m}}^{\top}\bm{\xi}_{m}+(1-\tau)\bm{1}_{n_{m}}^{\top}\bm{\eta}_{m}+\bm{d}_{m}^{\top}(\bm{\beta}_{m}-\bm{\beta})+\frac{\mu}{2}\\|\bm{\beta}_{m}-\bm{\beta}\\|_{2}^{2}\right.$
		$\displaystyle\left.+\bm{e}_{m}^{\top}(\tilde{\bm{y}}_{m}-\tilde{\bm{X}}_{m}\bm{\beta}_{m}-\bm{\xi}_{m}+\bm{\eta}_{m})+\frac{\mu}{2}\\|\tilde{\bm{y}}_{m}-\tilde{\bm{X}}_{m}\bm{\beta}_{m}-\bm{\xi}_{m}+\bm{\eta}_{m}\\|_{2}^{2}\right]+{P}_{\lambda}(\|\bm{\beta}\|).$		(3.2)

Given $(\bm{\xi}_{m}^{0},\bm{\eta}_{m}^{0},\bm{\beta}_{m}^{0},\bm{d}_{m}^{0},\bm{e}_{m}^{0})$ , the iterative scheme of parallel ADMM for problem (3.1) is as follows,

\left\{\begin{array}[]{l}\bm{\beta}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}}\left\{L_{\mu}(\bm{\beta},\bm{\xi}_{m}^{k},\bm{\eta}_{m}^{k},\bm{\beta}_{m}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{\xi}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\xi}_{m}\geq\bm{0}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{\xi}_{m},\bm{\eta}_{m}^{k},\bm{\beta}_{m}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{\eta}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\eta}_{m}\geq\bm{0}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{\xi}_{m}^{k+1},\bm{\eta}_{m},\bm{\beta}_{m}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{\beta}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}_{m}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{\xi}_{m}^{k+1},\bm{\eta}_{m}^{k+1},\bm{\beta}_{m},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{d}_{m}^{k+1}\ \leftarrow\bm{d}_{m}^{k}-\mu(-\bm{\beta}_{m}^{k+1}+\bm{\beta}^{k+1});\\ \bm{e}_{m}^{k+1}\ \leftarrow\bm{e}_{m}^{k}-\mu(-\tilde{\bm{y}}_{m}+\tilde{\bm{X}}_{m}\bm{\beta}_{m}+\bm{\xi}_{m}^{k+1}-\bm{\eta}_{m}^{k+1}).\end{array}\right.

(3.3)

Clearly, $\bm{\beta}^{k+1}$ will be updated on the central machine, while $\bm{\xi}_{m}^{k+1},\bm{\eta}_{m}^{k+1},\bm{\beta}_{m}^{k+1},\bm{d}_{m}^{k+1}$ and $\bm{e}_{m}^{k+1}$ will be updated in parallel on the sub machines. In section 3 of Fan et al. (2021), they provided closed-form solutions for the ( $\bm{\beta},\bm{\xi}_{m},\bm{\eta}_{m},\bm{\beta}_{m}$ )-subproblems, which significantly facilitated the implementation of the parallel algorithm.

The iteration process described above cannot be reduced to a two-block ADMM algorithm; it can only be reduced to a three-block ADMM algorithm. Next, we will briefly explain this point using mathematical expressions. When $M\geq 2$ , the constraint form of (3.1) is written in matrix form as

\displaystyle\bm{A}_{1}\bm{\beta}+\sum_{m=2}^{M+1}\bm{A}_{m}\bm{\xi}_{m-1}+\sum_{m=1}^{M}\bm{B}_{m}\bm{\eta}_{m}+\sum_{m=1}^{M}\bm{C}_{m}\bm{\beta}_{m}=\bm{e},

(3.4)

where $\bm{A}_{m}$ , $\bm{B}_{m}$ , and $\bm{C}_{m}$ are block matrices partitioned into $2M$ blocks by rows, and $\bm{e}$ is the corresponding column vector partitioned into $2M$ blocks. For example, for $M=2$ , $\bm{A}_{1}=[-\bm{I}_{p},-\bm{I}_{p},\bm{0}^{\top},\bm{0}^{\top}]^{\top}$ , $\bm{A}_{2}=[\bm{0}^{\top},\bm{0}^{\top},\bm{I}_{n_{1}},\bm{0}^{\top}]^{\top}$ , $\bm{A}_{3}=[\bm{0}^{\top},\bm{0}^{\top},\bm{0}^{\top},\bm{I}_{n_{2}}]^{\top}$ , $\bm{B}_{1}=[\bm{0}^{\top},\bm{0}^{\top},-\bm{I}_{n_{1}},\bm{0}^{\top}]^{\top}$ , $\bm{B}_{2}=[\bm{0}^{\top},\bm{0}^{\top},\bm{0}^{\top},-\bm{I}_{n_{2}}]^{\top}$ , $\bm{C}_{1}=[\bm{I}_{p},\bm{0}^{\top},\bm{X}_{1}^{\top},\bm{0}^{\top}]^{\top}$ , $\bm{C}_{2}=[\bm{0}^{\top},\bm{I}_{p},\bm{0}^{\top},\bm{X}_{2}^{\top}]^{\top}$ , and $\bm{e}=[\bm{0}^{\top},\bm{0}^{\top},\bm{y}_{1}^{\top},\bm{y}_{2}^{\top}]^{\top}$ . The above six matrices cannot be divided into two sets of mutually orthogonal matrices like QPADM. Indeed, these six matrices can be partitioned into three mutually orthogonal groups in the following order: $\bm{A}=[\bm{A}_{1},\bm{A}_{2},\bm{A}_{3}],\bm{B}=[\bm{B}_{1},\bm{B}_{2}]$ , and $\bm{C}=[\bm{C}_{1},\bm{C}_{2}]$ . The same operation can also be implemented when $M>2$ , that is,

\displaystyle\bm{A}=[\bm{A}_{1},\bm{A}_{2},\dots,\bm{A}_{M+1}],\ \bm{B}=[\bm{B}_{1},\bm{B}_{2},\dots,\bm{B}_{M}],\ \text{and}\ \bm{C}=[\bm{C}_{1},\bm{C}_{2},\dots,\bm{C}_{M}].

(3.5)

Hence, $\bm{\beta}$ and $\bm{\xi}_{m}$ can be considered as the first variable, $\bm{\eta}_{m}$ as the second variable, and $\bm{\beta}_{m}$ as the third variable. Thus, QPADM-slack is actually a three-block ADMM algorithm in terms of optimization form.

Chen et al. (2016) demonstrated that directly extending the ADMM algorithm for convex optimization with three or more separable blocks may not guarantee convergence, and they provided an example of divergence. As a result, QPADM-slack does not have a theoretical guarantee of convergence even in the convex case.

3.2 Gaussian Back Substitution

He et al. (2012) pointed out that although the ADMM algorithm with three blocks cannot be theoretically proven to converge, the algorithm can be corrected for some iterative solutions through simple Gaussian back substitution, thereby making the algorithm convergent. Because the correction matrix is an upper triangular block matrix, it is called Gaussian back substitution. The technique of Gaussian back substitution was extensively applied to various statistical and machine learning domains, including but not limited to the works of Ng et al. (2013), He and Yuan (2013), He et al. (2017), and He and Yuan (2018).

Let

\displaystyle\bm{a}=(\bm{\beta},\bm{\xi}_{1},\dots,\bm{\xi}_{M}),\ \bm{b}=(\bm{\eta}_{1},\dots,\bm{\eta}_{M})\ \text{and}\ \bm{c}=(\bm{\beta}_{1},\dots,\bm{\beta}_{M}),

(3.6)

and $\bm{d}=(\bm{d}_{1},\dots,\bm{d}_{M},\bm{e}_{1},\dots,\bm{e}_{M})$ . To facilitate the description of the correction process (Gaussian back substitution), we define the sequence of $k+1$ iterations $({\bm{b}}^{k+1},{\bm{c}}^{k+1})$ , generated by (3.3), as $(\tilde{\bm{b}}^{k},\tilde{\bm{c}}^{k})$ . Then, the Gaussian back substitution of QPADM-slack is defined as

\displaystyle\begin{bmatrix}\bm{b}^{k+1}\\ \bm{c}^{k+1}\end{bmatrix}=\begin{bmatrix}{\bm{b}}^{k}\\ {\bm{c}}^{k}\end{bmatrix}-\begin{bmatrix}v\bm{I}&-v(\bm{B}^{\top}\bm{B})^{-1}\bm{B}^{\top}\bm{C}\\ \bm{0}&v\bm{I}\end{bmatrix}\begin{bmatrix}{\bm{b}}^{k}-\tilde{\bm{b}}^{k}\\ {\bm{c}}^{k}-\tilde{\bm{c}}^{k}\end{bmatrix},

(3.7)

where $\nu\in(0,1)$ . Note that $\bm{B}^{\top}\bm{B}$ is an identity matrix, and thus

(\bm{B}^{\top}\bm{B})^{-1}\bm{B}^{\top}\bm{C}=\bm{B}^{\top}\bm{C}=\begin{bmatrix}-\bm{X}_{1}&0&\cdots&0\\ 0&-\bm{X}_{2}&\cdots&0\\ 0&0&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&-\bm{X}_{M}\end{bmatrix},

it follows that

\begin{bmatrix}\bm{b}^{k+1}\\ \bm{c}^{k+1}\end{bmatrix}=\begin{bmatrix}\bm{\eta}_{m}^{k+1}\\ \bm{\beta}_{m}^{k+1}\end{bmatrix}=\begin{bmatrix}(1-\nu){\bm{\eta}}_{m}^{k}+\nu\tilde{\bm{\eta}}_{m}^{k}-\nu\bm{X}_{m}(\bm{\beta}_{m}^{k}-\tilde{\bm{\beta}}_{m}^{k})\\ (1-\nu){\bm{\beta}}_{m}^{k}+\nu\tilde{\bm{\beta}}_{m}^{k}.\end{bmatrix}.

(3.8)

We summarize the process of QPADM-slack with Gaussian back substitution (QPADM-slack(GB)) in Algorithm 1. Its parallel algorithm diagram is shown in Figure 1. Algorithm 1 differs from QPADM-slack only in the correction steps performed in each local machine, specifically steps 3 and 4 in Local machines. This correction process involves only linear operations, with the heaviest computational load being the matrix-vector multiplication in the $\bm{\eta}_{m}^{k+1}$ correction step. Next, we will establish the global convergence of QPADM-slack(GB) and its linear convergence rate. The proof of the following theorem is provided in Appendix A.

Refer to caption — Figure 1: Schematic diagram of QPADM-slack(GB) algorithm.

Theorem 1

Let the sequence $\left\{\bm{g}^{k}=(\bm{b}^{k},\bm{c}^{k},\bm{d}^{k})\right\}$ be generated by QPADM-slack(GB).

1.

(Algorithm global convergence). It converges to some $\bm{g}^{\infty}=(\bm{b}^{\infty},\bm{c}^{\infty},\bm{d}^{\infty})$ that belongs to $\mathcal{V}^{*}$ , where $\mathcal{V}^{*}=\{(\bm{b}^{*},\bm{c}^{*},\bm{d}^{*})\mid(\bm{a}^{*},\bm{b}^{*},\bm{c}^{*},\bm{d}^{*})\in\Omega^{*}\}$ , and $\Omega^{*}$ is the set of optimal solutions for (3.1).

(Linear convergence rate). For any integer $K>0$ , we have

\displaystyle\|\bm{g}^{K}-\bm{g}^{K+1}\|_{\bm{H}}^{2}\leq\frac{1}{c_{0}\left(K+1\right)}\|\bm{g}^{0}-\bm{g}^{*}\|_{\bm{H}}^{2},

(3.9)

where $c_{0}$ is a positive constant and $\bm{H}$ is a positive definite matrix.

Remark 1

It is clear that $\|\bm{g}^{0}-\bm{g}^{*}\|_{\bm{H}}^{2}$ is a positive constant. Therefore, $\|\bm{g}^{K}-\bm{g}^{K+1}\|_{\bm{H}}^{2}\leq\mathcal{O}\left(\frac{1}{K}\right)$ , which is known as a linear convergence rate. In addition, during the proof of this theorem, we demonstrated that $\|\bm{g}^{k}-\bm{g}^{*}\|_{\bm{H}}^{2}$ and $\|\bm{g}^{k}-\bm{g}^{k+1}\|_{\bm{H}}^{2}$ are monotonically nonincreasing, that is, $\|\bm{g}^{k+1}-\bm{g}^{*}\|_{\bm{H}}^{2}\leq\|\bm{g}^{k}-\bm{g}^{*}\|_{\bm{H}}^{2}$ and $\|\bm{g}^{k+1}-\bm{g}^{k+2}\|_{\bm{H}}^{2}\leq\|\bm{g}^{k}-\bm{g}^{k+1}\|_{\bm{H}}^{2}$ , see the propositions in the Appendix A.

Algorithm 1 QPADM-slack with Gaussian back substitution

Input:

\bullet

Central machine:

\mu,M,\bm{\lambda}

\bullet

The

m

-th local machine:

\bm{X}_{m},\bm{y}_{m}

;

\mu

\nu

\tau

;

\bm{\beta}_{m}^{0}

\bm{\eta}_{m}^{0}

\bm{d}_{m}^{0}

\bm{e}_{m}^{0}

Output: the total number of iterations

K

\bm{\beta}^{K}

while not converged do

Central machine: 1. Receive

{\bm{\beta}}_{m}^{k}

and

\bm{d}_{m}^{k}

transmitted by

M

local machines, 2. Update

\bm{\beta}^{k+1}

as QPADM-slack, 3. Send

\bm{\beta}^{k+1}

to the local machines.

Local machines: for

m=1,2,\dots,M

(in parallel) 1. Receive

\bm{\beta}^{k+1}

transmitted by the central machine, 2. Update

\bm{\xi}_{m}^{k+1}

{\bm{\eta}}_{m}^{k+1}

\bm{\beta}_{m}^{k+1}

\bm{d}_{m}^{k+1}

and

\bm{e}_{m}^{k+1}

as QPADM-slack, 3. Let

\tilde{\bm{\eta}}_{m}^{k}=\bm{\eta}_{m}^{k+1}

and

\tilde{\bm{\beta}}_{m}^{k}=\bm{\beta}_{m}^{k+1}

, 4. Correct

\bm{\eta}_{m}^{k+1}

and

\bm{\beta}_{m}^{k+1}

according to Gaussian back substitution in (3.8), 5. Send

{\bm{\beta}}_{m}^{k+1}

and

\bm{d}^{k+1}_{m}

to the central machine.

end while

return solution.

3.3 Nonconvex Extension and Local Linear Approximation

In recent years, nonconvex penalized quantile regression has been extensively studied, both theoretically and algorithmically, see Wang et al. (2012), Fan et al. (2014), Gu et al. (2018), Fan et al. (2021), Wang and He (2024). However, there has been little research on quantile loss SVMs with nonconvex regularization terms. In this subsection, we propose the nonconvex quantile loss SVMs, defined as

\begin{array}[]{l}\mathop{\arg\min}\limits_{\bm{\beta},{\xi}\geq\bm{0}}\sum\limits_{i=1}^{n}{{{\xi}_{i}}}+P_{\bm{\lambda}}(|\bm{\beta}|)\\ \text{s.t.}\;\;\sum\limits_{j=1}^{p}y_{i}x_{ij}\beta_{j}\geq 1-{1\over\tau}{{\xi}_{i}},\quad i=1,2,\cdots n,\\ \ \ \ \ \ \ \sum\limits_{j=1}^{p}y_{i}x_{ij}\beta_{j}\leq 1+{1\over(1-\tau)}{{\xi}_{i}},\ i=1,2,\cdots n.\end{array}

(3.10)

When $\tau=1$ , the second inequality is always satisfied, and thus the aforementioned SVM simplifies to the nonconvex hinge SVM addressed in Zhang et al. (2016) and Guan et al. (2018). It is not difficult to see that (3.10) can be included in (1.2). In this paper, we mainly consider two popular nonconvex regularizations, SCAD penalty and MCP penalty.

In high-dimensional penalized linear regression and classification models, convex regularization terms like adaptive LASSO and elastic net ensure global optimality and computational efficiency, while non-convex regularization terms may provide better estimation and prediction performance but pose computational challenges due to the lack of global optimality. Fortunately, for quantile regression models and SVM models with nonconvex penalties, Fan et al. (2014) and Zhang et al. (2016) respectively pointed out that these problems can be uniformly solved by combining the local linear approximation (LLA, Zou and Li (2008)) method with an effective solution for the weighted $\ell_{1}$ penalized quantile regression and SVM.

LLA involves using the first-order Taylor expansion of the nonconvex regularization term to replace the original nonconvex regularization term, that is

\displaystyle P_{\bm{\lambda}}(|\bm{\beta}|)\approx P_{\bm{\lambda}}(|\bm{\beta}^{l}|)+\nabla P_{\bm{\lambda}}(|\bm{\beta}^{l}|)^{\top}(|\bm{\beta}|-|\bm{\beta}^{l}|),\ \text{for}\ \bm{\beta}\approx\bm{\beta}^{l},

(3.11)

where $\bm{\beta}^{l}$ is the solution from the last iteration, and $\nabla P_{\bm{\lambda}}(|\bm{\beta}^{l}|)=(\nabla P_{\bm{\lambda}_{1}}(|\bm{\beta}_{1}^{l}|),\nabla P_{\bm{\lambda}_{2}}(|\bm{\beta}_{2}^{l}|),\dots,\nabla P_{\bm{\lambda}_{p}}(|\bm{\beta}_{p}^{l}|))^{\top}$ .

$\bullet$ For SCAD, we have

\displaystyle\nabla P_{\bm{\lambda}_{j}}(|\bm{\beta}_{j}|)=\begin{cases}\bm{\lambda}_{j},&\text{{if }}|\bm{\beta}_{j}|\leq\bm{\lambda}_{j},\\ \frac{a\bm{\lambda}_{j}-|\bm{\beta}_{j}|}{a-1},&\text{{if }}\bm{\lambda}_{j}<|\bm{\beta}_{j}|<a\bm{\lambda}_{j},\\ 0,&\text{{if }}|\bm{\beta}_{j}|\geq a\bm{\lambda}_{j}.\\ \end{cases}

(3.12)

$\bullet$ For MCP, we have

\displaystyle\nabla P_{\bm{\lambda}_{j}}(|\bm{\beta}_{j}|)=\begin{cases}\bm{\lambda}_{j}-\frac{|\bm{\beta}_{j}|}{a},&\text{{if }}|\bm{\beta}_{j}|\leq a\bm{\lambda}_{j},\\ 0,&\text{{if }}|\bm{\beta}_{j}|>a\bm{\lambda}_{j}.\\ \end{cases}

(3.13)

Then, the nonconvex penalized quantile regressions and SVMs can be written as

\displaystyle\arg\min_{\bm{\beta}}\left\{\sum_{i=1}^{n}\rho_{\tau}\left(\tilde{y_{i}}-\tilde{\bm{x}}_{i}^{\top}\bm{\beta}\right)+P_{\bm{\lambda}}(|\bm{\beta}|)\right\}.

(3.14)

By substituting equation (3.11) into equation (3.14), we can obtain the following optimized form in a weighted manner,

\displaystyle\bm{\beta}^{l+1}=\mathop{\arg\min}\limits_{\bm{\beta}}\left\{\sum_{i=1}^{n}\rho_{\tau}\left(\tilde{y_{i}}-\tilde{\bm{x}}_{i}^{\top}\bm{\beta}\right)+\sum_{j=1}^{p}\nabla P_{\bm{\lambda}_{j}}(|\beta_{j}^{l}|)|\beta_{j}|\right\}.

(3.15)

Note that we only need to make a small change to solve this weighted optimization form using Algorithm 1. This change only requires replacing $\bm{\lambda}$ with $\nabla P_{\bm{\lambda}}(|\bm{\beta}^{l}|)$ .

To address nonconvex penalized quantile regressions and SVMs through the LLA algorithm, it is crucial to identify a suitable initial value. Following the guidance of Zhang et al. (2016) and Gu et al. (2018), we can utilize the solution derived from $P_{\bm{\lambda}}(|\bm{\beta}|)=\|\bm{\lambda}\odot\bm{\beta}\|_{1}$ in (3.14) as the initial starting point. Subsequently, the solution to (3.14) is obtained by iteratively solving a series of weighted $\ell_{1}$ penalized quantile regressions and SVMs. The comprehensive iterative steps of this approach are systematically outlined in Algorithm 2.

Algorithm 2 QPADM-slack(GB) for nonconvex penalized quantile regressions and SVMs

1. Initialize

\bm{\beta}

with

\bm{\beta}^{1}

, where

\bm{\beta}^{1}

is obtained by Algorithm 1.

2. For

l=1,2,\dots,L

, continue iterating the LLA iteration until convergence is achieved.

2.1. Compute the weights

\nabla P_{\bm{\lambda}}(|\bm{\beta}^{l}|)=(\nabla P_{\bm{\lambda}_{1}}(|\bm{\beta}_{1}^{l}|),\nabla P_{\bm{\lambda}_{2}}(|\bm{\beta}_{2}^{l}|),\dots,\nabla P_{\bm{\lambda}_{p}}(|\bm{\beta}_{p}^{l}|))^{\top}

by (3.12) or (3.13),

2.2. Solve the weighted problem in (3.15) by modified Algorithm 1 with

\bm{\lambda}

replacing with

\nabla P_{\bm{\lambda}}(|\bm{\beta}^{l}|)

. Let this solution be denoted as

\bm{\beta}^{l+1}

The preceding discussion reveals that resolving nonconvex penalized quantile regressions and SVMs requires multiple iterations of weighted $\ell_{1}$ penalized models. Theoretically, as shown by Fan et al. (2014) and Zhang et al. (2016), only two to three iterations suffice to find the Oracle solution for (3.14). In implementing Algorithm 2, we utilize the warm-start technique (Friedman et al. (2010)), initializing $\bm{\beta}^{l+1}$ with $\bm{\beta}^{l}$ , significantly reducing step 2.2’s iteration count, often leading to convergence within just a few to a dozen steps.

4 Modified QPADM-slack and Gaussian Back Substitution

The iteration order of variables in QPADM-slack is

\displaystyle\cdots\to{\bm{\beta}\to\bm{\xi}_{m}\to\bm{\eta}_{m}\to\bm{\beta}_{m}}\to\cdots,

(4.1)

and its Gaussian back substitution is $\bm{\eta}_{m}^{k+1}=(1-v){\bm{\eta}}_{m}^{k}+v\tilde{\bm{\eta}}_{m}^{k}+\bm{X}_{m}(\bm{\beta}_{m}^{k}-\tilde{\bm{\beta}}_{m}^{k})$ and $\bm{\beta}_{m}^{k+1}=(1-v){\bm{\beta}}_{m}^{k}+v\tilde{\bm{\beta}}_{m}^{k}$ . We should note that the correction step for $\bm{\eta}_{m}$ involves matrix-vector multiplication, which can impose an additional computational burden when both $n_{m}$ and $p$ are large.

4.1 Modified QPADM-slack

To address this issue, we suggest changing the variable update order of QPADM-slack in (4.1) to

\displaystyle\cdots\to\bm{\beta}_{m}\to\bm{\xi}_{m}\to\bm{\eta}_{m}\to\bm{\beta}\to\cdots.

(4.2)

We will explain the reason for adopting this order below. From the discussion in Section 3.2, it is evident that the first block among the three blocks of primal variables to be updated does not require correction. To avoid matrix operations in the correction step, $\bm{\beta_{m}}$ should be placed in the first block because only the constraint matrix corresponding to $\bm{\beta_{m}}$ is not composed of identity and zero matrices. During the Gaussian back substitution process, we need to compute the inverse of $\bm{B}^{\top}\bm{B}$ . To keep this inverse as simple as possible, we must place either $\xi_{m}$ or $\eta_{m}$ in the second block. Here, we choose to place $\xi_{m}$ in the second block in sequence. The remaining $\eta_{m}$ , $\bm{\beta}$ form the third block.

Thus, with $(\bm{\xi}_{m}^{0},\bm{\eta}_{m}^{0},\bm{\beta}^{0},\bm{d}_{m}^{0},\bm{e}_{m}^{0})$ , the iterative scheme of modified QPADM-slack (M-QPADM-slack) for problem (3.1) is as follows,

\left\{\begin{aligned} \bm{\beta}_{m}^{k+1}&\leftarrow\mathop{\arg\min}\limits_{\bm{\beta}_{m}}\left\{L_{\mu}(\bm{\beta}_{m},\bm{\xi}_{m}^{k},\bm{\eta}_{m}^{k},\bm{\beta}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \tilde{\bm{\xi}}_{m}^{k}&\leftarrow\mathop{\arg\min}\limits_{\bm{\xi}_{m}\geq\bm{0}}\left\{L_{\mu}(\bm{\beta}_{m}^{k+1},\bm{\xi}_{m},\bm{\eta}_{m}^{k},\bm{\beta}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \tilde{\bm{\eta}}_{m}^{k}&\leftarrow\mathop{\arg\min}\limits_{\bm{\eta}_{m}\geq\bm{0}}\left\{L_{\mu}(\bm{\beta}_{m}^{k+1},\tilde{\bm{\xi}}_{m}^{k},\bm{\eta}_{m},\bm{\beta}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \tilde{\bm{\beta}}^{k}&\leftarrow\mathop{\arg\min}\limits_{\bm{\beta}}\left\{L_{\mu}(\bm{\beta}_{m}^{k+1},\tilde{\bm{\xi}}_{m}^{k},\tilde{\bm{\eta}}_{m}^{k},\bm{\beta},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{d}_{m}^{k+1}&\leftarrow\bm{d}_{m}^{k}-\mu\left(-\bm{\beta}_{m}^{k+1}+\tilde{\bm{\beta}}^{k}\right);\\ \bm{e}_{m}^{k+1}&\leftarrow\bm{e}_{m}^{k}-\mu\left(-\tilde{\bm{y}}_{m}+\tilde{\bm{X}}_{m}\bm{\beta}_{m}^{k+1}+\tilde{\bm{\xi}}_{m}^{k}-\tilde{\bm{\eta}}_{m}^{k}\right);\end{aligned}\right.

(4.3)

where $L_{\mu}$ is defined as in (3.1). $\bm{\beta}_{m}^{k+1}$ , $\tilde{\bm{\xi}}_{m}^{k}$ , $\tilde{\bm{\eta}}_{m}^{k}$ , $\bm{d}_{m}^{k+1}$ and $\bm{e}_{m}^{k+1}$ will be updated in parallel by each sub machine, while $\tilde{\bm{\beta}}^{k}$ will be updated on the central machine. However, this update sequence will cause a new problem, that is, the updates of local sub machines become incoherent, necessitating that the central machine first completes the update of $\tilde{\bm{\beta}}^{k}$ before proceeding with the update of $\bm{d}_{m}^{k+1}$ . This will result in an additional round of communication between the central machine and the local machines, thereby reducing the efficiency of the algorithm. This issue will not occur in (3.3) because the updates of variables on its local sub machines are coherent. Fortunately, since the updates of each sub problem in (4.3) do not strictly depend on the previously updated variables, this issue can be resolved by adjusting the update positions of the dual variables $\bm{d}_{m}$ and $\bm{e}_{m}$ .

For the convenience of describing the correction steps, we need to define

\displaystyle\bm{a}=(\bm{\beta}_{1},\dots,\bm{\beta}_{M}),\ \bm{b}=(\bm{\xi}_{1},\dots,\bm{\xi}_{M})\ \text{and}\ \bm{c}=(\bm{\eta}_{1},\dots,\bm{\eta}_{M},\bm{\beta}).

(4.4)

Correspondingly,

\bm{A}=\begin{bmatrix}\bm{I}_{p}&&\\ &\ddots&\\ &&\bm{I}_{p}\\ \bm{X}_{1}&&\\ &\ddots&\\ &&\bm{X}_{M}\end{bmatrix},\bm{B}=\begin{bmatrix}\bm{0}&&\\ &\ddots&\\ &&\bm{0}\\ -\bm{I}_{n_{1}}&&\\ &\ddots&\\ &&-\bm{I}_{n_{M}}\end{bmatrix}\text{and}\ \bm{C}=\begin{bmatrix}&&&-\bm{I}_{p}\\ &&&\vdots\\ &&&-\bm{I}_{p}\\ \bm{I}_{n_{1}}&&&\\ &\ddots&&\\ &&\bm{I}_{n_{M}}&\end{bmatrix}.

It is obvious that the matrices $\bm{A}$ and $\bm{B}$ are $2M\times M$ partitioned, while $\bm{C}$ is $2M\times(M+1)$ partitioned. $\bm{A}_{m}$ , $\bm{B}_{m}$ and $\bm{C}_{m}$ correspond to the block matrices of the $m$ -th columns of $\bm{A}$ , $\bm{B}$ and $\bm{C}$ , respectively. We also define the sequence of $k+1$ iterations $({\bm{b}}^{k+1},{\bm{c}}^{k+1})$ , generated by (4.3), as $(\tilde{\bm{b}}^{k},\tilde{\bm{c}}^{k})$ . According to (3.7), it follows from

(\bm{B}^{\top}\bm{B})^{-1}\bm{B}^{\top}\bm{C}=\begin{bmatrix}-\bm{I}_{n_{1}}&&&&\bm{0}\\ &-\bm{I}_{n_{2}}&&&\bm{0}\\ &&\ddots&&\vdots\\ &&&-\bm{I}_{n_{M}}&\bm{0}\end{bmatrix}

that

\begin{bmatrix}\bm{b}^{k+1}\\ \bm{c}^{k+1}\end{bmatrix}=\begin{bmatrix}\bm{\xi}_{m}^{k+1}\\ \bm{\eta}_{m}^{k+1}\\ \bm{\beta}^{k+1}\end{bmatrix}=\begin{bmatrix}(1-\nu)\bm{\xi}_{m}^{k}+\nu\tilde{\bm{\xi}}_{m}^{k}-\nu(\bm{\eta}_{m}^{k}-\tilde{\bm{\eta}}_{m}^{k})\\ (1-\nu)\bm{\eta}_{m}^{k}+\nu\tilde{\bm{\eta}}_{m}^{k}\\ (1-\nu)\bm{\beta}^{k}+\nu\tilde{\bm{\beta}}^{k}\end{bmatrix},

(4.5)

where $\nu\in(0,1)$ . Unlike the correction step in (3.8), the correction step in (4.5) involves only simple addition and subtraction operations, without matrix-vector multiplications. As a result, the computational burden of this correction is less affected by large values of $n_{m}$ and $p$ .

We summarize the process of M-QPADM-slack with Gaussian back substitution (M-QPADM-slack(GB)) in Algorithm 3, and its parallel algorithm diagram is shown in Figure 2. In Algorithm 3, prior to commencing the $(k+1)$ -th iteration, it is essential to first update the dual variables ( $\bm{d}_{m}^{k}$ and $\bm{e}_{m}^{k}$ ) from the $k$ -th iteration. This is achievable due to the sequence outlined in (4.3), where the dual variable updates are positioned at the conclusion of each iteration. Specifically, the updates for the primal variables in the $(k+1)$ -th iteration depend only on the dual variables $\bm{d}_{m}^{k}$ and $\bm{e}_{m}^{k}$ from the previous iteration, and not on the newly updated $\bm{d}_{m}^{k+1}$ and $\bm{e}_{m}^{k+1}$ . By relocating the updates of $\bm{d}_{m}^{k}$ and $\bm{e}_{m}^{k}$ , which are initially set to occur at the end of the $k$ -th iteration, to the start of the $(k+1)$ -th iteration, the incoherence in the sequence of iteration variables is effectively addressed. It is crucial to note that this rearrangement does not modify the iterative process delineated in (4.3) within the execution of Algorithm 3.

For some nonconvex regularization terms, M-QPADM-slack(GB) can also be embedded in Algorithm 2 to solve, simply replacing Algorithm 1 in its step 2.2 with Algorithm 3. In addition, Algorithm 3 also inherits the convergence result of Theorem 1, with $\bm{b}$ and $\bm{c}$ in (3.6) replaced by those in (4.4).

Algorithm 3 Modified QPADM-slack with Gaussian Back Substitution

Input:

\bullet

Central machine:

\mu,\nu,M,\bm{\lambda}

;

\tilde{\bm{\beta}}^{-1}

and

{\bm{\beta}}^{0}

\bullet

The

m

-th local machine:

\bm{X}_{m},\bm{y}_{m}

;

\mu

\nu

\tau

;

\bm{\beta}_{m}^{0}

\bm{\xi}_{m}^{-1}

\bm{\xi}_{m}^{0}

\bm{\eta}_{m}^{-1}

\bm{\eta}_{m}^{0}

\bm{d}_{m}^{-1}

\bm{e}_{m}^{-1}

Output: the total number of iterations

K

\bm{\beta}^{K}

while not converged do

Local machines: for

m=1,2,\dots,M

(in parallel) 1. Receive

\tilde{\bm{\beta}}^{k-1}

and

{\bm{\beta}}^{k}

transmitted by the central machine, 2. Update

\bm{d}_{m}^{k}=\bm{d}_{m}^{k-1}-\mu(-\bm{\beta}_{m}^{k}+\tilde{\bm{\beta}}^{k-1})

and

\bm{e}_{m}^{k}=\bm{e}_{m}^{k-1}-\mu(-\tilde{\bm{y}}_{m}+\tilde{\bm{X}}_{m}\bm{\beta}_{m}^{k}+\tilde{\bm{\xi}}_{m}^{k-1}-\tilde{\bm{\eta}}_{m}^{k-1})

, 3. Update

\bm{\beta}_{m}^{k+1}

\tilde{\bm{\xi}}_{m}^{k}

and

\tilde{\bm{\eta}}_{m}^{k}

according to (4.6)-(4.8), 4. Correct

\bm{\xi}_{m}^{k+1}

and

\bm{\eta}_{m}^{k+1}

according to (4.5), 5. Compute

\bm{\alpha}_{m}^{k+1}={\bm{\beta}}_{m}^{k+1}+\bm{d}_{m}^{k}/\mu

, 6. Send

\bm{\alpha}_{m}^{k+1}

to the central machine.

Central machine: 1. Receive

\bm{\alpha}_{m}^{k+1}

transmitted by

M

local machines, 2. Update

\tilde{\bm{\beta}}^{k}

according to (4.9), 3. Correct

\bm{\beta}^{k+1}

according to (4.5), 4. Send

\tilde{\bm{\beta}}^{k}

and

{\bm{\beta}}^{k+1}

to the local machines.

end while

return solution.

4.2 Details of Iteration

In this subsection, we describe the details of solving each subproblem in (4.3). In fact, our M-QPADM-slack (4.3) differs from the QPADM-slack in (3.3) only in the order of the iteration variables. Therefore, the solutions for each subproblem in (4.3) are generally the same as those used in Section 3, except for the order of the iteration variables. For completeness, we also provide detailed iteration steps.

For the update of the subproblems $\bm{\beta}^{k+1}_{m}$ , $\tilde{\bm{\xi}}_{m}^{k}$ and $\tilde{\bm{\eta}}_{m}^{k}$ in (4.3), the minimization problem is quadratic and differentiable, allowing us to solve the subproblem by solving

$\displaystyle\bm{\beta}_{m}^{k+1}$	$\displaystyle=(\tilde{\bm{X}}_{m}^{\top}\tilde{\bm{X}}_{m}+\bm{I}_{p})^{-1}\left[(\bm{\beta}^{k}-{\bm{d}_{m}^{k}}/{\mu})+\tilde{\bm{X}}_{m}^{\top}(\tilde{\bm{y}}_{m}-\bm{\xi}_{m}^{k}+\bm{\eta}_{m}^{k}+{\bm{e}_{m}^{k}}/{\mu})\right],$	(4.6)
$\displaystyle\tilde{\bm{\xi}}_{m}^{k}$	$\displaystyle=\max\left\{\tilde{\bm{y}}_{m}-\tilde{\bm{X}}_{m}\bm{\beta}_{m}^{k+1}+\bm{\eta}_{m}^{k}+{\bm{e}_{m}^{k}}/{\mu}-{\tau\bm{1}_{n_{m}}}/{\mu},\bm{0}\right\},$	(4.7)
$\displaystyle\tilde{\bm{\eta}}_{m}^{k}$	$\displaystyle=\max\left\{{(\tau-1)\bm{1}_{n_{m}}}/{\mu}-(\tilde{\bm{y}}_{m}-\tilde{\bm{X}}_{m}\bm{\beta}_{m}^{k+1}-\tilde{\bm{\xi}}_{m}^{k}+{\bm{e}_{m}^{k}}/{\mu}),\bm{0}\right\}.$	(4.8)

For the first equation, Yu et al. (2017) suggested using the Woodbury matrix identity $(\tilde{\bm{X}}_{m}^{\top}\tilde{\bm{X}}_{m}+\bm{I}_{p})^{-1}=\bm{I}_{p}-\tilde{\bm{X}}_{m}^{\top}(\bm{I}_{n_{m}}+\tilde{\bm{X}}_{m}\tilde{\bm{X}}_{m}^{\top})^{-1}\tilde{\bm{X}}_{m}$ . This method is actually very practical when the size of $n_{m}$ is small because the inverse only needs to be computed once throughout the ADMM iteration.

For the subproblem of updating $\tilde{\bm{\beta}}^{k}$ in (4.3), by rearranging the optimization equation and omitting some constant terms, we get

\displaystyle\tilde{\bm{\beta}}^{k}=\arg\min_{\bm{\beta}}\left\{P_{\bm{\lambda}}(|\bm{\beta}|)+\frac{\mu M}{2}\left\|\bm{\beta}-\sum_{m=1}^{M}\bm{\alpha}_{m}^{k+1}/M\right\|_{2}^{2}\right\},

(4.9)

where $\bm{\alpha}_{m}^{k+1}={\bm{\beta}}_{m}^{k+1}+\bm{d}_{m}^{k}/\mu$ . It is clear that the above expression represents a proximal operator, applicable to most convex regularization terms, such as LASSO and group LASSO (see Boyd et al. (2010)), elastic-net (see Parikh and Boyd (2013)), and sparse group LASSO (see Sprechmann et al. (2010)). Since the entire objective function is additive, the optimization problem can be transformed into smaller, independent subproblems during implementation.

5 Numerical Results

This section is dedicated to demonstrating the efficacy of the proposed algorithm in solving classification and regression problems. It focuses on evaluating the algorithm’s model selection capability, estimation accuracy, and computational efficiency. The algorithm, augmented with the Gaussian back substitution technique, is assessed in both non-parallel and parallel computing environments to showcase its robustness and scalability. To this end, the P-ADMM algorithm, combined with the Gaussian back substitution technique, is applied to various problems, including Penalized Quantile Regression (PQR, Fan et al. (2021)) and support vector machines (SVM) with sparse regularization (Guan et al. (2020)).

To select the optimal values for the regularization parameters $\bm{\lambda}$ , we employ the method proposed in Zhang et al. (2016) and Yu et al. (2017), which minimizes the HBIC criterion. The HBIC criterion is defined as follows:

\displaystyle\text{HBIC}(\bm{\lambda})=\log\left(\sum_{i=1}^{n}\mathcal{L}(y_{i}-\bm{x}_{i}^{\top}\hat{\bm{\beta}}_{\bm{\lambda}})\right)+|S_{\bm{\lambda}}|\frac{\log(\log n)}{n}C_{n}.

(5.1)

Here, $\mathcal{L}$ represents the specific loss function, $\hat{\bm{\beta}}_{\bm{\lambda}}$ denotes the parameter estimates obtained from the non-convex estimation, and $|S_{\bm{\lambda}}|$ denotes the number of non-zero coordinates in $\hat{\bm{\beta}}_{\bm{\lambda}}$ . The value of $C_{n}=6\log(p)$ is recommended by Peng and Wang (2015) and Fan et al. (2021). By minimizing the HBIC criterion, we can effectively select the optimal values for $\bm{\lambda}$ for convex and non-convex estimation. This selection method balances model complexity and goodness of fit for optimal estimation. Moreover, reviewing the correction steps (3.8) and (4.5), an additional parameter $\nu$ needs to be selected. According to the proof of Theorem 1, as long as $\nu$ is within the interval (0, 1), it guarantees the convergence and the linear convergence rate of the two proposed algorithms. Empirically, we suggest setting $\nu=0.75$ , as this choice assigns a weight of 0.75 to the current iteration variable and a weight of 0.25 to the previous iteration variable. This selection places primary emphasis on the current iteration solution while also appropriately considering the previous iteration solution.

For the initial vector in the iterative algorithm, all elements are initialized to 0.01, although other values are also feasible. For all tested ADMM algorithms, the maximum number of iterations was set to 500, and the following stopping criterion was employed:

\displaystyle\frac{\|\bm{\beta}^{k}-\bm{\beta}^{k-1}\|_{2}}{\max(1,\|\bm{\beta}^{k}\|_{2})}\leq 10^{-4}.

This stopping criterion ensures that the difference between the estimated coefficients in consecutive iterations does not exceed the specified threshold. All experiments were conducted on a computer equipped with an AMD Ryzen 9 7950X 16-core processor (clocked at 4.50 GHz) and 32 GB of memory, using the R programming language. For clarity, we denote QPADM-slack(GB) as the standard QRADM-slack with Gaussian back substitution (Algorithm 1), and M-QPADM-slack(GB) as the modified QRADM-slack with Gaussian back substitution (Algorithm 3).

5.1 Synthetic Data

In the first simulation, the P-ADMM algorithm with Gaussian back substitution proposed in this section is used to solve the $\ell_{1}$ quantile regression ( $\ell_{1}$ -QR, Li and Zhu (2008)) problem, and its performance is compared with several state-of-the-art algorithms, including QPADM proposed by Yu et al. (2017) and QPADM-slack proposed by Fan et al. (2021). We have included the experiment on nonconvex regularization terms (SCAD and MCP) in the Appendix B.1.

Although Yu et al. (2017) and Fan et al. (2021) provide R packages for their respective algorithms, these packages are only compatible with the Mac operating system. Furthermore, the package in Yu et al. (2017) only provides the estimated coefficients, lacking information such as the number of iterations and computational time. To ensure a fair comparison, the R code for the QPADM and QPADM-slack algorithms was rewritten based on the descriptions in their respective papers.

In the simulation study of this section, the models used in the numerical experiments of Peng and Wang (2015), Yu et al. (2017), and Fan et al. (2021) were adopted. Specifically, data were generated from the heteroscedastic regression model $y=x_{6}+x_{12}+x_{15}+x_{20}+0.7x_{1}\epsilon$ , where $\epsilon\sim N(0,1)$ . The independent variables $(x_{1},x_{2},\dots,x_{p})$ were generated in two steps.

•

First, $\bm{\tilde{}}{x}=(\tilde{x}_{1},\tilde{x}_{2},\dots,\tilde{x}_{p})^{\top}$ was generated from a $p$ -dimensional multivariate normal distribution $N(\bm{0},\bm{\Sigma})$ , where the covariance matrix $\bm{\Sigma}$ satisfies $\Sigma_{ij}=0.5^{|i-j|}$ for $1\leq i,j\leq p$ .
•

Second, $x_{1}$ was set to $\Phi(\tilde{x}_{1})$ , while for $j=2,\dots,p$ , we directly set $x_{j}=\tilde{x}_{j}$ .

As stated by Yu et al. (2017) and Fan et al. (2021), $x_{1}$ does not present when $\tau=0.5$ . For simplicity, in the ensuing numerical experiments pertaining to regression, the default selection for $\tau$ is set to 0.7. In a non-parallel environment ( $M=1$ ), this section simulates datasets of different sizes, specifically ( $n$ , $p$ ) = (30,000, 1,000), (1,000, 30,000), (10,000, 30,000), and (30,000, 30,000). In a parallel environment ( $M\geq 2$ ), datasets of sizes ( $n$ , $p$ ) = (200,000, 500) and (500,000, 1,000) are simulated. For each setting, 100 independent simulations were conducted. The average results for non-parallel and parallel computations are presented in Tables 1 and 2, respectively.

Table 1: Comparison results of ADMM algorithms under different data scales with LASSO penalty

$(n,p)$	(30000,1000)					(10000,30000)
	P1	P2	AE	Ite	Time	P1	P2	AE	Ite	Time
QPADM	100	100	0.018(0.03)	107.9(12.6)	41.35(6.78)	86	100	4.010(0.71)	500+(0.0)	2015.5(69.2)
QPADMslack	100	100	0.058(0.06)	56.7(6.04)	37.56(5.27)	84	100	4.196(1.05)	279(28.6)	1776.5(34.2)
QPADM-slack(GB)	100	100	0.029(0.04)	47.3(5.11)	35.17(5.30)	90	100	3.616(0.86)	233(20.8)	1031.9(23.1)
M-QPADM-slack(GB)	100	100	0.021(0.03)	31.6(4.50)	24.42(3.08)	97	100	3.241(0.78)	189(16.6)	795.6(16.3)
$(n,p)$	(1000,30000)					(30000,30000)
	P1	P2	AE	Ite	Time	P1	P2	AE	Ite	Time
QPADM	76	100	8.012(1.05)	500+(0.0)	2024.6(60.1)	100	100	1.701(0.33)	500+(0.0)	3217.3(68.6)
QPADMslack	73	100	8.324(1.22)	243(27.8)	1526.7(46.9)	100	100	2.010(0.39)	322(21.8)	2713.6(41.7)
QPADM-slack(GB)	85	100	8.107(1.09)	213(24.3)	928.1(20.8)	100	100	1.929(0.36)	299(18.8)	1928.3(37.5)
M-QPADM-slack(GB)	91	100	7.352(0.77)	166(11.9)	833.6(15.7)	100	100	1.512(0.09)	227(14.7)	1532.6(28.1)

*

The symbols in this table are defined as follows: P1 (%) represents the proportion of times $x_{1}$ is selected; P2 (%) represents the proportion of times $x_{6}$ , $x_{12}$ , $x_{15}$ , and $x_{20}$ are selected; AE denotes the absolute estimation error; Ite indicates the number of iterations; and Time (s) refers to the running time. The numbers in parentheses represent the corresponding standard deviations, and the optimal results are highlighted in bold.

The results in Table 1 reveal that the Gaussian back substitution method significantly enhances the convergence rate of QPADM-slack, as evidenced by the reduction in the number of iterations (Ite). The QPADM-slack(GB) algorithm introduces an additional correction step compared to QPADM-slack, which involves matrix-vector multiplication. Consequently, despite achieving a notably lower iteration count, QPADM-slack(GB) does not exhibit a substantial advantage in computational time (Time, in seconds) over QPADM-slack. This is because during the correction step, there are frequent matrix multiplication operations by vectors. By contrast, the modified M-QPADM-slack(GB), which adjusts the sequence of variable updates, demonstrates statistically significant improvements in both iteration count (Ite) and computational time (Time). In terms of computational precision, specifically absolute estimation error (AE), QPADM performs the best when the dimensionality is small. However, as the dimensionality increases, the situation changes. While the AE of QPADM remains better than that of QPADM-slack, it performs slightly worse compared to QPADM-slack(GB) and M-QPADM-slack(GB). Furthermore, regarding variable selection performance (P1 and P2), all methods perform well when the sample size ( $n$ ) is larger than or slightly smaller than the dimensionality ( $p$ ). However, in scenarios where dimensionality significantly exceeds sample size, the Gaussian back substitution methods exhibit statistically superior performance compared to their counterparts.

In Table 2, as the number of local sub-machines ( $M$ ) increases, a deteriorating trend is observed in both Nonzero and AE for all variants of P-ADMM, a numerical observation consistent with the findings in the numerical experiments reported by Fan et al. (2021). Across different numbers of local sub-machines ( $M$ ), M-QPADM-slack(GB) consistently surpasses other P-ADMM methods by a notable margin, exhibiting superior performance in both computational accuracy and efficiency. This further supports the necessity of modifying the iteration sequence in QPADM-slack when Gaussian back substitution is incorporated.

More detailed results on estimation errors and computation times are presented in Figure 3. Concerning computation time, we noticed that after $M$ surpasses 20, the tendency of computation time to decrease with the increase of $M$ diminishes and may stabilize. This is not a result of the algorithm struggling to manage a high number of sub-machines, but rather a constraint imposed by our computer’s configuration. Using a machine with ample memory would prevent this issue from arising.

Table 2: Comparison of different P-ADMM algorithms under the LASSO penalty

	QPADM		$(200000,500)$		QPADM		$(500000,1000)$
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	41.0(3.83)	0.074(0.0009)	359.4(27.1)	80.1(5.82)	28.3(2.15)	0.042(0.0006)	442.1(37.0)	177.2(12.6)
10	44.5(4.01)	0.071(0.0009)	372.3(28.9)	48.2(2.98)	29.1(2.33)	0.049(0.0007)	471.2(40.8)	87.6(6.63)
20	47.2(4.32)	0.075(0.0011)	405.2(32.6)	29.7(1.64)	32.2(2.01)	0.052(0.0008)	494.5(47.1)	43.5(3.88)
	QPADM-slack		$(200000,500)$		QPADM-slack		$(500000,1000)$
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	36.5(2.95)	0.079(0.0010)	255.6(22.0)	61.3(3.56)	25.2(1.92)	0.049(0.0007)	361.5(32.6)	136.6(9.32)
10	39.9(3.06)	0.080(0.0011)	269.1(26.1)	35.8(2.28)	28.9(2.09)	0.051(0.0008)	379.9(36.8)	78.4(5.17)
20	42.3(3.16)	0.083(0.0013)	356.8(42.2)	22.6(1.53)	31.6(2.23)	0.055(0.0009)	423.6(40.2)	42.9(2.61)
	QPADM-slack(GB)		$(200000,500)$		QPADM-slack(GB)		$(500000,1000)$
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	25.5(2.20)	0.058(0.0006)	195.6(12.6)	49.5(3.03)	24.4(1.90)	0.033(0.0005)	258.1(23.2)	82.6(6.62)
10	26.4(2.41)	0.062(0.0008)	203.0(13.4)	29.8(1.63)	25.6(1.97)	0.037(0.0005)	269.3(23.5)	50.2(3.88)
20	27.0(2.43)	0.065(0.0009)	211.9(13.5)	15.7(0.92)	26.2(2.04)	0.039(0.0006)	271.0(26.0)	35.1(2.12)
	M-QPADM-slack(GB)		$(200000,500)$		M-QPADM-slack(GB)		$(500000,1000)$
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	20.1(1.91)	0.050(0.0005)	148.5(9.31)	39.2(2.82)	15.21(1.44)	0.027(0.0004)	196.7(13.6)	62.5(4.32)
10	21.6(1.97)	0.053(0.0005)	152.1(9.92)	22.2(1.73)	15.38(1.57)	0.030(0.0005)	199.6(13.4)	36.4(2.73)
20	22.3(2.01)	0.054(0.0006)	156.6(10.8)	12.9(0.95)	15.52(1.64)	0.033(0.0005)	201.1(13.9)	20.2(1.51)

*

Since the values of all methods for metrics P1 and P2 are 100, they are not listed in Table 2. “Nonzero” indicates the number of non-zero coefficients in the estimates. The numbers in parentheses represent the corresponding standard deviations, and the optimal results are shown in bold.

5.2 Real Data Experiment

In this section, the empirical analysis focuses on classification tasks using real-world data. The dataset employed is rcv1.binary, which consists of 47,236 features, 20,242 training samples, and 677,399 testing samples. This dataset is publicly accessible at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#rcv1.binary.

In the subsequent experiments, the training samples are utilized to fit the model, where the data matrix $\bm{X}$ has dimensions $n=20,242$ (number of observations) and $p=47,236$ (number of features). This high-dimensional setting, where $p>n$ , necessitates the use of regularization techniques to address potential overfitting. Specifically, an $\ell_{1}$ regularization term is incorporated to induce sparsity in the model, as many features are expected to be irrelevant for classification. The experiment on SVM with nonconvex regularization terms is included in Appendix B.2.

As demonstrated in Proposition 1 of Wu et al. (2025b), the optimization algorithms for piecewise linear classification and regression models are interchangeable. Consequently, both QPADM and QPADM-slack can be applied to solve the $\ell_{1}$ -SVM problem (Zhu et al. (2003)). Notably, the algorithm proposed in Guan et al. (2020) reformulates the piecewise linear classification loss using slack variables, making it equivalent to QPADM-slack with the quantile parameter $\tau=1$ . Therefore, QPADM-slack ( $\tau=1$ ) is adopted as the baseline method for comparison in the following analysis.

To evaluate the algorithms, several performance metrics are defined. For the testing set, a random subsample of 10,000 observations is drawn from the 677,399 testing samples, yielding $n_{\text{test}}=10,000$ . The metrics include: (1) “Time”, which measures the computational runtime of the algorithm; (2) “Iteration”, which records the number of iterations required for convergence; (3) “Sparsity”, defined as the proportion of zero coefficients (i.e., the number of zero coefficients divided by the total number of coefficients); (4) “Train”, which quantifies the classification accuracy on the training set; and (5) “Test”, which measures the classification accuracy on the testing set. Each experimental setting is independently simulated 100 times, and the average results are reported in Table 3. Since the previous experiments demonstrated that the Gaussian back substitution technique outperforms QPADM-slack in terms of iteration count and computational time, the results for “Ite” and “Time” are omitted in Table 3.

Table 3: Comparative analysis of Sparsity, training, and average testing accuracies (%) for

\ell_{1}

SVM using M-QPADM-slack(GB), QPADM-slack, and QPADM-slack(GB)

	M-QPADM-slack(GB)			QPADM-slack(GB)			QPADM-slack
M	Sparsity	Train	Test	Sparsity	Train	Test	Sparsity	Train	Test
2	90.36	99.42	97.25	85.98	95.12	92.98	83.63	93.37	91.99
4	90.36	99.23	97.15	84.60	94.55	92.02	81.36	92.25	91.04
6	90.36	99.15	97.05	83.59	93.87	91.46	80.13	91.76	90.58
8	90.31	99.02	96.91	82.10	93.03	90.93	79.36	91.04	89.91
10	90.12	98.89	96.82	81.36	92.56	90.07	77.98	90.76	89.07
12	90.01	98.77	96.69	80.75	91.87	89.38	77.12	90.01	88.36
14	89.85	98.65	96.57	79.36	91.15	88.64	76.42	89.72	87.82
16	89.58	98.51	96.41	78.69	90.36	87.39	73.95	88.36	87.05
18	89.65	98.43	96.36	77.65	89.78	86.27	72.65	87.77	86.23
20	89.55	98.32	96.27	76.25	88.62	85.33	71.45	87.65	85.16

The numerical results in Table 3 reveal that, not only in the regression numerical experiments, but also in the classification experiments, the performance of the three P-ADMM methods based on the consensus structure deteriorates as the number of local sub-machines ( $M$ ) increases. This occurs because the consensus structure, designed to support the parallel algorithm, unavoidably introduces consensus constraints, namely auxiliary variables, as the number of local sub-machines grows. These additional auxiliary variables degrade the quality of the iterative solutions of the P-ADMM algorithm. In terms of classification accuracy (Train and Test) and variable selection (Sparsity), M-QPADM-slack(GB) performs the best, followed by QPADM-slack(GB), and lastly, QPADM-slack. This indicates that when solving sparse regularization classification problems using the slack-based P-ADMM algorithm, incorporating the Gaussian back substitution technique can enhance both the efficiency and accuracy of the algorithm.

6 Conclusions and Future Prospects

This paper introduces a Gaussian back substitution technique to adapt the parallel ADMM (P-ADMM) algorithms proposed by Guan et al. (2020) and Fan et al. (2021). Specifically, we incorporate minor linear adjustments at each iteration step, which, despite their minimal computational overhead, significantly enhance the algorithm’s convergence speed. More importantly, this technique theoretically achieves linear convergence rates and demonstrates high efficiency and robustness in practical applications.

Furthermore, this paper proposes a new iteration variable sequence for P-ADMM with slack variables. When combined with the Gaussian back-substitution technique, this approach can significantly enhance computational efficiency. The new iteration sequence is designed to better align with the Gaussian back-substitution technique. By streamlining key steps in the algorithm to involve only basic addition and subtraction operations, we avoid complex matrix-vector multiplications, thereby improving overall efficiency. This is particularly advantageous for large datasets and high-dimensional spaces, as it avoids computational load and complexity. This modification not only preserves the algorithm’s convergence properties but potentially accelerates the convergence process, offering a more efficient solution for practical applications.

Another significant contribution of this paper is the extension of quantile loss applications from traditional regression tasks to classification tasks. Given the intrinsic equivalence of quantile loss optimization forms in both classification and regression tasks, the proposed parallel algorithm can be seamlessly applied to quantile regression classification models. This extension provides a novel perspective and tool for addressing classification problems.

Despite the advancements presented in this paper, a challenges remains. To be specific, even with the incorporation of the Gaussian back-substitution technique, the quality of the solution and the algorithm’s convergence speed deteriorate as the number of local sub-machines increases. This limitation arises from the consensus-based parallel structure, which complicates resource allocation in practical applications. Consequently, developing a parallel algorithm that circumvents the consensus structure emerges as a highly promising research direction. Such an algorithm could provide more robust and efficient solutions for larger datasets and higher-dimensional feature spaces.

Acknowledgements

We express our sincere gratitude to Professor Bingsheng He for engaging in invaluable discussions with us. His insights and expertise have greatly assisted us in effectively utilizing Gaussian back substitution to correct the convergence issues of the QPADM-slack algorithm. The research of Zhimin Zhang was supported by the National Natural Science Foundation of China [Grant Numbers 12271066, 12171405], and the research of Xiaofei Wu was supported by the Scientific and Technological Research Program of Chongqing Municipal Education Commission [Grant Numbers KJQN202302003].

References

Beck and Teboulle (2009) Beck, A., Teboulle, M., 2009. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences 2, 183–202.
Belloni and Chernozhukov (2011) Belloni, A., Chernozhukov, V., 2011. $\ell_{1}$ -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39, 82 – 130.
Boyd et al. (2010) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., 2010. Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers. Foundation and Trends in Machine Learning 3, 1–122.
Chen et al. (2016) Chen, C., He, B., Ye, Y., Yuan, X., 2016. The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming 155, 57–79.
Christmann and Steinwart (2008) Christmann, A., Steinwart, I., 2008. Support Vector Machines. Springer New York, NY.
Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. Annals of Statistics 32, 407–499.
Fan et al. (2012) Fan, J., Fan, Y., Barut, E., 2012. Adaptive robust variable selection. Annals of statistics 42(1), 324–351.
Fan and Li (2001) Fan, J., Li, R., 2001. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association 96, 1348–1360.
Fan et al. (2014) Fan, J., Xue, L., Zou, H., 2014. Strong Oracle Optimality of Folded Concave Penalized Estimation. The Annals of Statistics 42, 819–849.
Fan et al. (2021) Fan, Y., Lin, N., Yin, X., 2021. Penalized Quantile Regression for Distributed Big Data Using the Slack Variable Representation. Journal of Computational and Graphical Statistics 30, 557–565.
Friedman et al. (2010) Friedman, J.H., Hastie, T.J., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 1, 1–22.
Gu et al. (2018) Gu, Y., Fan, J., Kong, L., Ma, S., Zou, H., 2018. ADMM for High-Dimensional Sparse Penalized Quantile Regression. Technometrics 60, 319–331.
Guan et al. (2018) Guan, L., Qiao, L., Li, D., Sun, T., Ge, K., Lu, X., 2018. An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines. 2018 IEEE International Conference on Data Mining Workshops , 1209–1216.
Guan et al. (2020) Guan, L., Sun, T., Qiao, L.b., Yang, Z.h., Li, D.s., Ge, K.s., Lu, X., 2020. An efficient parallel and distributed solution to nonconvex penalized linear SVMs. Frontiers of Information Technology & Electronic Engineering 21, 587–603.
Hastie et al. (2004) Hastie, T.J., Rosset, S., Tibshirani, R., Zhu, J., 2004. The entire regularization path for the support vector machine. Journal of machine learning research 5, 1391–1415.
He et al. (2022) He, B., Ma, F., Xu, S., Yuan, X., 2022. A rank-two relaxed parallel splitting version of the augmented Lagrangian method with step size in (0,2) for separable convex programming. Math. Comput. 92, 1633–1663.
He et al. (2012) He, B., Tao, M., Yuan, X., 2012. Alternating direction method with gaussian back substitution for separable convex programming. SIAM Journal on Optimization 22, 313–340.
He et al. (2017) He, B., Tao, M., Yuan, X., 2017. Convergence rate and iteration complexity on the alternating direction method of multipliers with a substitution procedure for separable convex programming. Mathematics of Operations Research 42, 662–691.
He and Yuan (2013) He, B., Yuan, X., 2013. Linearized alternating direction method of multipliers with gaussian back substitution for separable convex programming. Numerical Algebra, Control and Optimization 3, 247–260.
He and Yuan (2015) He, B., Yuan, X., 2015. On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik 130, 567–577.
He and Yuan (2018) He, B., Yuan, X., 2018. A class of ADMM-based algorithms for three-block separable convex programming. Computational Optimization and Applications 70, 791 – 826.
Hoerl and Kennard (1970) Hoerl, E., Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55–67.
Huang et al. (2014) Huang, X., Shi, L., Suykens, J.A.K., 2014. Support vector machine classifier with pinball loss. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 984–997.
Koenker (2005) Koenker, R., 2005. Quantile Regression. Cambridge University Press.
Koenker and Basset (1978) Koenker, R., Basset, G., 1978. Regressions Quantiles. Econometrica 46, 33–50.
Koenker and Ng (2005) Koenker, R.W., Ng, P.T., 2005. A frisch-newton algorithm for sparse quantile regression. Acta Mathematicae Applicatae Sinica 21, 225–236.
Li and Zhu (2008) Li, Y., Zhu, J., 2008. L1-norm quantile regression. Journal of Computational and Graphical Statistics 17, 163 – 185.
Liang et al. (2024) Liang, R., Wu, X., Zhang, Z., 2024. Linearized Alternating Direction Method of Multipliers for Elastic-net Support Vector Machines. Pattern Recognition 148, 110134.
Ng et al. (2013) Ng, M.K., Yuan, X., Zhang, W., 2013. Coupled variational image decomposition and restoration model for blurred cartoon-plus-texture images with missing pixels. IEEE Transactions on Image Processing 22, 2233–2246.
Parikh and Boyd (2013) Parikh, N., Boyd, S.P., 2013. Proximal Algorithms. Now Foundations and Trends 1, 127–239.
Peng and Wang (2015) Peng, B., Wang, L., 2015. An Iterative Coordinate Descent Algorithm for High-Dimensional Nonconvex Penalized Quantile Regression. Journal of Computational & Graphical Statistics 24, 676–694.
Rosset and Zhu (2007) Rosset, S., Zhu, J., 2007. Piecewise linear regularized solution paths. The Annals of Statistics 35, 1012 – 1030.
Sprechmann et al. (2010) Sprechmann, P., Paulino, I.F.R., Sapiro, G., Eldar, Y.C., 2010. C-hilasso: A collaborative hierarchical sparse modeling framework. IEEE Transactions on Signal Processing 59, 4183–4198.
Tibshirani (1996) Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58.
Vapnik (1995) Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer .
Wang and He (2024) Wang, L., He, X., 2024. Analysis of global and local optima of regularized quantile regression in high dimensions: A subgradient approach. Econometric Theory 40, 233–277.
Wang et al. (2012) Wang, L., Wu, Y., Li, R., 2012. Quantile Regression for Analyzing Heterogeneity in Ultra-High Dimension. Journal of the American Statistical Association 107, 214–222.
Wang et al. (2006) Wang, L., Zhu, J., Zou, H., 2006. The doubly regularized support vector machine. Statistica Sinica 16, 589–615.
Wen et al. (2024) Wen, J., Yang, S., Zhao, D., 2024. Nonconvex Dantzig selector and its parallel computing algorithm. Statistics and Computing 34, 1573–1375.
Wu et al. (2025a) Wu, X., Liang, R., Zhang, Z., Cui, Z., 2025a. A unified consensus-based parallel ADMM algorithm for high-dimensional regression with combined regularizations. Computational Statistics & Data Analysis 203, 108081.
Wu et al. (2025b) Wu, X., Liang, R., Zhang, Z., Cui, Z., 2025b. Multi-block linearized alternating direction method for sparse fused Lasso modeling problems. Applied Mathematical Modelling 137, 115694.
Wu et al. (2024) Wu, X., Ming, H., Zhang, Z., Cui, Z., 2024. Multi-block Alternating Direction Method of Multipliers for Ultrahigh Dimensional Quantile Fused Regression. Computational Statistics & Data Analysis 192, 107901.
Wu and Liu (2009) Wu, Y., Liu, Y., 2009. Variable selection in quantile regression. Statistica Sinica 19, 801–817.
Yang and Zou (2013) Yang, Y., Zou, H., 2013. An Efficient Algorithm for Computing the HHSVM and Its Generalizations. Journal of Computational and Graphical Statistics 22, 396–415.
Yang and Zou (2015) Yang, Y., Zou, H., 2015. A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing 25, 1129–1141.
Yi and Huang (2016) Yi, C., Huang, J., 2016. Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics 26, 547 – 557.
Yu and Lin (2017) Yu, L., Lin, N., 2017. ADMM for Penalized Quantile Regression in Big Data. International Statistical Review 85, 494–518.
Yu et al. (2017) Yu, L., Lin, N., Wang, L., 2017. A Parallel Algorithm for Large-Scale Nonconvex Penalized Quantile Regression. Journal of Computational and Graphical Statistics 26, 935–939.
Zhang (2010) Zhang, C., 2010. Nearly Unbiased Variable Selection Under Minimax Concave Penalty. Annals of Statistics 38, 894–942.
Zhang et al. (2016) Zhang, X., Wu, Y., Wang, L., Li, R., 2016. Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78, 53–76.
Zhu et al. (2003) Zhu, J., Rosset, S., Tibshirani, R., Hastie, T., 2003. 1-norm support vector machines, in: Thrun, S., Saul, L., Schölkopf, B. (Eds.), Advances in Neural Information Processing Systems, MIT Press.
Zou (2006) Zou, H., 2006. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101, 1418–1429.
Zou and Li (2008) Zou, H., Li, R., 2008. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. The Annals of Statistics 36, 1509–1533.

Online Appendix

Appendix A Proofs of Convergence Theorems

A.1 Preliminary

A.1.1 Lemma

Lemma 1

(Lemma 2.1 in He et al. (2022)) Let $\mathbb{Z}\subset\mathbb{R}^{l}$ be a closed convex set, and let $\theta:\mathbb{R}^{l}\to\mathbb{R}$ and $f:\mathbb{R}^{l}\to\mathbb{R}$ be convex functions. Suppose $f$ is differentiable on an open set containing $\mathbb{Z}$ , and the minimization problem

\min\{\theta(z)+f(z)\mid z\in\mathbb{Z}\}

has a nonempty solution set. Then, $z^{*}\in\arg\min\{\theta(z)+f(z)\mid z\in\mathbb{Z}\}$ if and only if

z^{*}\in\mathbb{Z}\quad\text{and}\quad\theta(z)-\theta(z^{*})+(z-z^{*})^{\top}\nabla f(z^{*})\geq 0,\quad\forall z\in\mathbb{Z}.

Lemma 2

Assume that $\bm{G}$ and $\bm{H}\in\mathbb{R}^{n\times n}$ are both positive definite matrices, and $\bm{a},\bm{b},\bm{c},\bm{d}$ are four arbitrary $n$ -dimensional vectors. Then the following identities hold:

\displaystyle\|\bm{a}\|_{\bm{H}}^{2}-\|\bm{b}\|_{\bm{H}}^{2}=2\bm{a}^{\top}\bm{H}(\bm{a}-\bm{b})-\|\bm{a}-\bm{b}\|_{\bm{H}}^{2}.

(A.1)

\displaystyle 2(\bm{a}-\bm{b})^{\top}\bm{H}(\bm{c}-\bm{d})=(\|\bm{a}-\bm{d}\|_{\bm{H}}^{2}-\|\bm{a}-\bm{c}\|_{\bm{H}}^{2})+(\|\bm{c}-\bm{b}\|_{\bm{H}}^{2}-\|\bm{d}-\bm{b}\|_{\bm{H}}^{2}).

(A.2)

The conclusion of Lemma 2 is straightforward to verify. Despite its simplicity, this result is widely utilized in the proof of convergence for ADMM, as detailed in He and Yuan (2018) and He et al. (2022).

A.1.2 Four Matrices

To simplify the presentation of analysis, we define the following four matrices,

\displaystyle\bm{Q}=\begin{bmatrix}\mu\bm{B}^{\top}\bm{B}&\bm{0}&\bm{0}\\ \mu\bm{C}^{\top}\bm{B}&\mu\bm{C}^{\top}\bm{C}&\bm{0}\\ -\bm{B}&-\bm{C}&\frac{\bm{I}}{\mu}\end{bmatrix},\ \bm{M}=\begin{bmatrix}\nu\bm{I}&-\nu(\bm{B}^{\top}\bm{B})^{-1}\bm{B}^{\top}\bm{C}&\bm{0}\\ \bm{0}&\nu\bm{I}&\bm{0}\\ -\mu\bm{B}&-\mu\bm{C}&\bm{I}\end{bmatrix},

(A.3)

\displaystyle\bm{H}=\begin{bmatrix}\frac{\mu}{\nu}\bm{B}^{\top}\bm{B}&\frac{\mu}{\nu}\bm{B}^{\top}\bm{C}&\bm{0}\\ \frac{\mu}{\nu}\bm{C}^{\top}\bm{B}&\frac{\mu}{\nu}\left[\bm{C}^{\top}\bm{C}+\bm{C}^{\top}\bm{B}(\bm{B}^{\top}\bm{B})^{-1}\bm{B}^{\top}\bm{C}\right]&\bm{0}\\ \bm{0}&\bm{0}&\frac{\bm{I}}{\mu}\end{bmatrix},\bm{G}=\begin{bmatrix}(1-\nu)\mu\bm{B}^{\top}\bm{B}&\bm{0}&\bm{0}\\ \bm{0}&(1-\nu)\mu\bm{C}^{\top}\bm{C}&\bm{0}\\ \bm{0}&\bm{0}&\frac{\bm{I}}{\mu}\end{bmatrix}.

(A.4)

It is obvious that the matrices $\bm{Q}$ , $\bm{M}$ , $\bm{H}$ and $\bm{G}$ are all $3\times 3$ partitioned, and they satisfy the condition

\displaystyle\bm{H}\bm{M}=\bm{Q}\ \text{and}\ \bm{G}=(\bm{Q}^{\top}+\bm{Q})-\bm{M}^{\top}\bm{H}\bm{M}.

(A.5)

It is not difficult to verify that both $\bm{H}$ and $\bm{G}$ are positive-definite matrices when $\nu\in(0,1)$ .

A.1.3 Variational Inequality Characterization

To transform the optimization objective function (3.1) into an equality optimization problem, that is, to eliminate the nonnegative constraints of $\bm{\xi}_{m}\geq\bm{0}$ and $\bm{\eta}_{m}\geq\bm{0}$ , we need to introduce the following indicator function,

I_{+}(\bm{x})=\begin{cases}0,&\text{if }\bm{x}\in\mathbb{R}^{+},\\ \infty,&\text{if }\bm{x}\notin\mathbb{R}^{+}.\end{cases}

If we consider the QPADM-Slack(GB) method, we define:

\bm{a}=(\bm{\beta},\bm{\xi}_{1},\dots,\bm{\xi}_{M}),\quad\theta_{1}(\bm{a})=P_{\lambda}(\bm{\beta})+\sum_{m=1}^{M}\left[\tau\bm{1}_{n_{m}}^{\top}\bm{\xi}_{m}+I_{+}(\bm{\xi}_{m})\right]

\bm{b}=(\bm{\eta}_{1},\dots,\bm{\eta}_{M}),\quad\theta_{2}(\bm{b})=\sum_{m=1}^{M}\left[(1-\tau)\bm{1}_{n_{m}}^{\top}\bm{\eta}_{m}+I_{+}(\bm{\eta}_{m})\right]

\bm{c}=(\bm{\beta}_{1},\dots,\bm{\beta}_{M}),\quad\theta_{3}(\bm{c})=0

\bm{e}=(\bm{0},\dots,\bm{0},\bm{y}_{1}^{\top},\dots,\bm{y}_{M}^{\top})^{\top}

Thus, from (3.5), equation (3.1) can be rewritten as:

	$\displaystyle\min_{\bm{a},\bm{b},\bm{c}}$	$\displaystyle\theta_{1}(\bm{a})+\theta_{2}(\bm{b})+\theta_{3}(\bm{c}),$		(A.6)
	s.t.	$\displaystyle\bm{A}\bm{a}+\bm{B}\bm{b}+\bm{C}\bm{c}=\bm{e},$		(A.6)

where the definitions of $\bm{A}$ , $\bm{B}$ , $\bm{C}$ , and $\bm{e}$ are provided in Section 3.1.

If the modified update sequence as described in Section 4.1 is employed, the variables are defined as follows:

\bm{a}=(\bm{\beta}_{1},\dots,\bm{\beta}_{M}),\quad\bm{b}=(\bm{\xi}_{1},\dots,\bm{\xi}_{M}),\quad\bm{c}=(\bm{\eta}_{1},\dots,\bm{\eta}_{M},\bm{\beta}).

Equation (3.1) can then be rewritten as (A.6), with the corresponding matrices $\bm{A}$ , $\bm{B}$ , and $\bm{C}$ as detailed in Section 4.1.

Based on the variational inequality characterization discussed in Section 2.4.2 of this paper, the solution to the constrained optimization problem (A.6) is the saddle point of the following Lagrangian function:

\displaystyle L(\bm{a},\bm{b},\bm{c},\bm{d})=\theta_{1}(\bm{a})+\theta_{2}(\bm{b})+\theta_{3}(\bm{c})-\bm{d}^{\top}(\bm{Aa}+\bm{Bb}+\bm{Cc}-\bm{e}),

(A.7)

where $\bm{d}=(\bm{d}_{1},\dots,\bm{d}_{M},\bm{e}_{1},\dots,\bm{e}_{M})$ . However, since $\theta_{1}$ and $\theta_{2}$ are non-differentiable in this context, the variational inequality mentioned in He and Yuan (2018) cannot be directly applied to our scenario. Recently, the work in He et al. (2022) has covered the non-differentiable setting considered in this paper. As described in their Section 2.2, finding the saddle point of $L(\bm{a},\bm{b},\bm{c},\bm{d})$ is equivalent to finding $\bm{a}^{*},\bm{b}^{*},\bm{c}^{*},\bm{d}^{*}$ that satisfy the following variational inequality:

\bm{h}^{*}\in\Omega,\quad\theta(\bm{f})-\theta(\bm{f}^{*})+(\bm{h}-\bm{h}^{*})^{\top}F(\bm{h}^{*})\geq 0,\quad\forall\bm{h}\in\Omega,

(A.8)

where $\bm{f}=(\bm{a},\bm{b},\bm{c})$ , $\bm{h}=(\bm{a},\bm{b},\bm{c},\bm{d})$ , and

\theta(\bm{f})=\theta_{1}(\bm{a})+\theta_{2}(\bm{b})+\theta_{3}(\bm{c}),\quad\Omega=\mathbb{R}^{(p+Mn)}\times\mathbb{R}^{(Mn)}\times\mathbb{R}^{(Mn)}\times\mathbb{R}^{(2M)},

(A.9)

and

F(\bm{h})=\begin{pmatrix}-\bm{A}^{\top}\bm{d}\\ -\bm{B}^{\top}\bm{d}\\ -\bm{C}^{\top}\bm{d}\\ \bm{Aa}+\bm{Bb}+\bm{Cc}-\bm{e}\end{pmatrix}.

(A.10)

It is important to note that the operator $F$ defined in (A.10) is antisymmetric , as

(\bm{h}_{1}-\bm{h}_{2})^{\top}\left[F(\bm{h}_{1})-F(\bm{h}_{2})\right]\equiv 0,\quad\forall\ \bm{h}_{1},\bm{h}_{2}\in\Omega.

(A.11)

In this chapter, $\Omega^{*}$ denotes the solution set of (A.8), which is also the set of saddle points of the Lagrangian function (A.7) for the model (3.1).

A.2 Proof

With the preparation above, we can now proceed to prove the convergence and convergence rate of QPADM-Slack with Gaussian backtracking (which includes both QPADM-Slack(GB) and M-QPADM-Slack(GB)). For brevity, both of these methods with Gaussian backtracking will hereafter be referred to as QPADM-Slack(GB).

The process of QPADM-Slack(GB) can be divided into two steps: The first step involves generating the predicted values $\tilde{\bm{a}}^{k}$ , $\tilde{\bm{b}}^{k}$ , $\tilde{\bm{c}}^{k}$ , and $\tilde{\bm{d}}^{k}$ ; the second step involves correcting $\tilde{\bm{b}}^{k}$ , $\tilde{\bm{c}}^{k}$ , and $\tilde{\bm{d}}^{k}$ to produce ${\bm{b}}^{k+1}$ , ${\bm{c}}^{k+1}$ , and ${\bm{d}}^{k+1}$ .

The specific iteration formulas for the first step (prediction step) are as follows:

\left\{\begin{array}[]{l}\tilde{\bm{a}}^{k}=\mathop{\arg\min}\limits_{\bm{a}}\left\{\theta_{1}(\bm{a})-\bm{a}^{\top}\bm{A}^{\top}\bm{d}^{k}+\frac{\mu}{2}\|\bm{Aa}+\bm{B}\bm{b}^{k}+\bm{C}\bm{c}^{k}-\bm{e}\|^{2}\right\};\\ \tilde{\bm{b}}^{k}=\mathop{\arg\min}\limits_{\bm{b}}\left\{\theta_{2}(\bm{b})-\bm{b}^{\top}\bm{B}^{\top}\bm{d}^{k}+\frac{\mu}{2}\|\bm{A}\tilde{\bm{a}}^{k}+\bm{B}\bm{b}+\bm{C}\bm{c}^{k}-\bm{e}\|^{2}\right\};\\ \tilde{\bm{c}}^{k}=\mathop{\arg\min}\limits_{\bm{c}}\left\{\theta_{3}(\bm{c})-\bm{c}^{\top}\bm{C}^{\top}\bm{d}^{k}+\frac{\mu}{2}\|\bm{A}\tilde{\bm{a}}^{k}+\bm{B}\tilde{\bm{b}}^{k}+\bm{C}\bm{c}-\bm{e}\|^{2}\right\};\\ \tilde{\bm{d}}^{k}=\bm{d}^{k}-\mu(\bm{A}\tilde{\bm{a}}^{k}+\bm{B}{\bm{b}}^{k}+\bm{C}{\bm{c}}^{k}-\bm{e});\end{array}\right.

(A.12)

where $\bm{d}^{k}=(\bm{u}_{1}^{k},\dots,\bm{u}_{M}^{k},\bm{v}_{1}^{k},\dots,\bm{v}_{M}^{k})$ . Here, the updates for $\tilde{\bm{a}}^{k}$ , $\tilde{\bm{b}}^{k}$ , and $\tilde{\bm{c}}^{k}$ are essentially the same as those for ${\bm{a}}^{k+1}$ , ${\bm{b}}^{k+1}$ , and ${\bm{c}}^{k+1}$ in QPADM-Slack, or as improved in Section 4.1. For the sake of convenience in subsequent proofs, the update method for $\tilde{\bm{d}}^{k}$ differs from that in QPADM-Slack, primarily because it does not utilize the newly generated $\tilde{\bm{b}}^{k}$ and $\tilde{\bm{c}}^{k}$ .

According to Lemma 1, we have

\left\{\begin{array}[]{l}\theta_{1}(\bm{a})-\theta_{1}(\tilde{\bm{a}}^{k})+(\bm{a}-\tilde{\bm{a}}^{k})^{\top}\left[-\bm{A}^{\top}\bm{d}^{k}+\mu\bm{A}^{\top}(\bm{A}\tilde{\bm{a}}^{k}+\bm{B}\bm{b}^{k}+\bm{C}\bm{c}^{k}-\bm{e})\right];\\ \theta_{2}(\bm{b})-\theta_{2}(\tilde{\bm{b}}^{k})+(\bm{b}-\tilde{\bm{b}}^{k})^{\top}\left[-\bm{B}^{\top}\bm{d}^{k}+\mu\bm{B}^{\top}(\bm{A}\tilde{\bm{a}}^{k}+\bm{B}\tilde{\bm{b}}^{k}+\bm{C}\bm{c}^{k}-\bm{e})\right];\\ \theta_{3}(\bm{c})-\theta_{3}(\tilde{\bm{c}}^{k})+(\bm{c}-\tilde{\bm{c}}^{k})^{\top}\left[-\bm{C}^{\top}\bm{d}^{k}+\mu\bm{C}^{\top}(\bm{A}\tilde{\bm{a}}^{k}+\bm{B}\tilde{\bm{b}}^{k}+\bm{C}\tilde{\bm{c}}^{k}-\bm{e})\right];\\ (\bm{d}-\tilde{\bm{d}}^{k})^{\top}\left[(\tilde{\bm{d}}^{k}-{\bm{d}}^{k})/\mu+\bm{A}\tilde{\bm{a}}^{k}+\bm{B}{\bm{b}}^{k}+\bm{C}{\bm{c}}^{k}-\bm{e}\right]=0.\end{array}\right.

(A.13)

The last equation is derived from the final part of equation (A.12). By combining the four equations in (A.13), we obtain

\displaystyle\theta(\bm{f})-\theta(\tilde{\bm{f}}^{k})+(\bm{h}-\tilde{\bm{h}}^{k})^{T}F(\tilde{\bm{h}}^{k})\geq(\bm{g}-\tilde{\bm{g}}^{k})^{T}\bm{Q}(\bm{g}^{k}-\tilde{\bm{g}}^{k}),\quad\forall\bm{h}\in\Omega,

(A.14)

where $\bm{g}=(\bm{b},\bm{c},\bm{d})$ , and by definition, $\bm{f}$ and $\bm{g}$ are both components of $\bm{h}$ .

The second step (correction step) involves setting $\bm{a}^{k+1}=\tilde{\bm{a}}^{k}$ , followed by the equations:

\displaystyle\begin{bmatrix}{\bm{b}}^{k+1}\\ {\bm{c}}^{k+1}\\ {\bm{d}}^{k+1}\end{bmatrix}=\begin{bmatrix}{\bm{b}}^{k}\\ {\bm{c}}^{k}\\ {\bm{d}}^{k}\end{bmatrix}-\begin{bmatrix}\nu\bm{I}&-\nu(\bm{B}^{\top}\bm{B})^{-1}\bm{B}^{\top}\bm{C}&\bm{0}\\ \bm{0}&\nu\bm{I}&\bm{0}\\ -\mu\bm{B}&-\mu\bm{C}&{\bm{I}}\end{bmatrix}\begin{bmatrix}{\bm{b}}^{k}-\tilde{\bm{b}}^{k}\\ {\bm{c}}^{k}-\tilde{\bm{c}}^{k}\\ {\bm{d}}^{k}-\tilde{\bm{d}}^{k}\end{bmatrix}.

(A.15)

The first two rows of the above equation correspond to the correction steps for $\tilde{\bm{\xi}}^{k}_{m}$ and $\tilde{\bm{\eta}}^{k}_{m}$ , while the third row adjusts $\tilde{\bm{d}}^{k}$ to align with the expression for $\bm{d}^{k+1}$ in QPADM-Slack. The adjustment in the third row is necessary because QPADM-Slack (GB) only requires corrections for $\tilde{\bm{\xi}}^{k}_{m}$ and $\tilde{\bm{\eta}}^{k}_{m}$ . Hence, the correction step in (A.15) can be rewritten as

\displaystyle\bm{g}^{k+1}=\bm{g}^{k}-\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k}),

(A.16)

where the definition of the matrix $\bm{M}$ is given in (A.3).

The above discussion indicates that the QPADM-slack (GB) algorithm can be simplified into two steps: the prediction step (see (A.12)) and the correction step (see (A.15)). The prediction and correction steps can also be expressed as:

	$\displaystyle\theta(\bm{f})-\theta(\tilde{\bm{f}}^{k})+(\bm{h}-\tilde{\bm{h}}^{k})^{\top}F(\tilde{\bm{h}}^{k})\geq$	$\displaystyle\ (\bm{g}-\tilde{\bm{g}}^{k})^{\top}\bm{Q}(\bm{g}^{k}-\tilde{\bm{g}}^{k}),\quad\forall\bm{h}\in\Omega,$		(A.17)
	$\displaystyle\bm{g}^{k+1}=$	$\displaystyle\ \bm{g}^{k}-\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k}).$		(A.18)

A.2.1 Global Convergence

Next, based on the predictive and corrective steps, we derive that the sequence $\{\bm{g}^{k}\}$ exhibits a contraction property.

Proposition 1

For the sequence $\{\bm{g}^{k}\}$ generated by the QPADM-slack(GB) algorithm (including Algorithm 1 and Algorithm 3), the following inequality holds:

\displaystyle\|\bm{g}^{k+1}-\bm{g}^{*}\|_{\bm{H}}^{2}\leq\|\bm{g}^{k}-\bm{g}^{*}\|_{\bm{H}}^{2}-\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\|_{\bm{G}}^{2},\quad\forall\bm{g}^{*}\in\Omega^{*},

(A.19)

where $k>0$ and $\bm{G}$ and $\bm{H}$ are defined in (A.4).

Proof. By applying $\bm{H}\bm{M}=\bm{Q}$ (see (A.5)) and the update expression in (A.18), the right-hand side of (A.17) can be rewritten as:

\displaystyle\left[\theta(\bm{f})-\theta(\tilde{\bm{f}}^{k})+(\bm{h}-\tilde{\bm{h}}^{k})^{\top}\bm{F}(\tilde{\bm{h}}^{k})\right]\geq(\bm{g}-\tilde{\bm{g}}^{k})^{\top}\bm{H}(\bm{g}^{k}-\bm{g}^{k+1}),\quad\forall\bm{h}\in\Omega.

(A.20)

Using the identity in (A.1), let $\bm{a}=\bm{g}$ , $\bm{b}=\tilde{\bm{g}}^{k}$ , $\bm{c}=\bm{g}^{k}$ , and $\bm{d}=\bm{g}^{k+1}$ , we obtain:

	$\displaystyle(\bm{g}-\tilde{\bm{g}}^{k})^{\top}\bm{H}(\bm{g}^{k}-\bm{g}^{k+1})$	$\displaystyle=\frac{1}{2}\left(\\|\bm{g}-\bm{g}^{k+1}\\|_{\bm{H}}^{2}-\\|\bm{g}-\bm{g}^{k}\\|_{\bm{H}}^{2}\right)$
		$\displaystyle+\frac{1}{2}\left(\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}-\\|\bm{g}^{k+1}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}\right).$		(A.21)

For the last term in (A.2.1), we have:

$\displaystyle\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}-\\|\bm{g}^{k+1}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}$	$\displaystyle=\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}-\\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k}-\bm{g}^{k+1})\\|_{\bm{H}}^{2}$	(A.22)
	$\displaystyle\overset{(\ref{c})}{=}\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}-\\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\\|_{\bm{H}}^{2}$
	$\displaystyle=2(\bm{g}^{k}-\tilde{\bm{g}}^{k})^{\top}\bm{H}\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k}-\tilde{\bm{g}}^{k})^{\top}\bm{M}^{\top}\bm{H}\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})$
	$\displaystyle=(\bm{g}^{k}-\tilde{\bm{g}}^{k})^{\top}(\bm{Q}^{\top}+\bm{Q}-\bm{M}^{\top}\bm{H}\bm{M})(\bm{g}^{k}-\tilde{\bm{g}}^{k})$
	$\displaystyle\overset{(\ref{m2})}{=}\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{G}}^{2},$

where the second-to-last equality holds due to $\bm{H}\bm{M}=\bm{Q}$ and $2\bm{g}^{\top}\bm{Q}\bm{g}=\bm{g}^{\top}(\bm{Q}^{\top}+\bm{Q})\bm{g}$ .

Substituting (A.2.1) and (A.22) into (A.20), we obtain:

	$\displaystyle\left[\theta(\bm{f})-\theta(\tilde{\bm{f}}^{k})+(\bm{h}-\tilde{\bm{h}}^{k})^{\top}\bm{F}(\tilde{\bm{h}}^{k})\right]$	$\displaystyle\geq\frac{1}{2}\left(\\|\bm{g}-\bm{g}^{k+1}\\|_{\bm{H}}^{2}-\\|\bm{g}-\bm{g}^{k}\\|_{\bm{H}}^{2}\right)$		(A.23)
		$\displaystyle+\frac{1}{2}\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{G}}^{2},\quad\forall\bm{h}\in\Omega.$		(A.23)

Let $\bm{g}=\bm{g}^{*}$ and $\bm{f}=\bm{f}^{*}$ in (A.23), which yields

		$\displaystyle\\|\bm{g}^{k}-\bm{g}^{}\\|_{\bm{H}}^{2}-\\|\bm{g}^{k+1}-\bm{g}^{}\\|_{\bm{H}}^{2}$		(A.24)
		$\displaystyle\geq\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{G}}^{2}+2\left[\theta(\tilde{\bm{f}}^{k})-\theta(\bm{f}^{})+(\tilde{\bm{h}}^{k}-\bm{h}^{})^{\top}\bm{F}(\tilde{\bm{h}}^{k})\right].$		(A.24)

By the optimality of $\bm{g}^{*}$ and the monotonicity of $\bm{F}({\bm{h}})$ (see (A.11)), we have:

\displaystyle\theta(\tilde{\bm{f}}^{k})-\theta(\bm{f}^{*})+(\tilde{\bm{h}}^{k}-\bm{h}^{*})^{\top}\bm{F}(\tilde{\bm{h}}^{k})=\theta(\tilde{\bm{f}}^{k})-\theta(\bm{f}^{*})+(\tilde{\bm{h}}^{k}-\bm{h}^{*})^{\top}\bm{F}({\bm{h}^{*}})\geq 0.

(A.25)

The conclusion in (A.19) follows directly from (A.24) and (A.25).

The contraction property mentioned above is crucial for the convergence of the sequence. The proof of sequence convergence derived from (A.19) has been extensively documented in the literature, including Theorem 2 in He and Yuan (2018) and Theorem 4.1 in He et al. (2022). For completeness, this section provides a detailed proof process here.

Theorem 2

For the sequence $\{\bm{g}^{k}\}$ generated by the QPADM-slack(GB) algorithm (including Algorithms 1 and 3), it will converge to some $\bm{g}^{\infty}$ belonging to $\mathcal{V}^{*}$ , where $\mathcal{V}^{*}=\{(\bm{b}^{*},\bm{c}^{*},\bm{d}^{*})\mid(\bm{a}^{*},\bm{b}^{*},\bm{c}^{*},\bm{d}^{*})\in\Omega^{*}\}$ .

Proof. Based on (A.19), the sequence $\{\bm{g}^{k}\}$ is bounded, and

\displaystyle\lim_{k\to\infty}{\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\|}_{\bm{G}}=0.

(A.26)

Hence, $\{\tilde{\bm{g}}^{k}\}$ is also bounded. Let $\bm{g}^{\infty}$ be an accumulation point of $\{\tilde{\bm{g}}^{k}\}$ and $\{\tilde{\bm{g}}^{k_{j}}\}$ be a subsequence converging to $\bm{g}^{\infty}$ .

Recalling the statement in (A.14), the sequences $\{\tilde{\bm{g}}_{k}\}$ and $\{\tilde{\bm{g}}_{k_{j}}\}$ are associated with $\{\tilde{\bm{h}}_{k}\}$ and $\{\tilde{\bm{h}}_{k_{j}}\}$ respectively. According to (A.17), we have

\theta(\bm{f})-\theta(\tilde{\bm{f}}^{k_{j}})+(\bm{h}-\tilde{\bm{h}}^{k_{j}})^{\top}F(\tilde{\bm{h}}^{k_{j}})\geq\ (\bm{g}-\tilde{\bm{g}}^{k_{j}})^{\top}\bm{Q}(\bm{g}^{k}-\tilde{\bm{g}}^{k_{j}}),\quad\forall\bm{h}\in\Omega.

Noting that the matrix $\bm{G}$ is nonsingular (see (A.3)), this implies $\lim_{k\to\infty}(\bm{g}^{k}-\tilde{\bm{g}}^{k_{j}})=0$ . Due to the continuity of $\theta(\bm{f})$ and $F(\bm{h})$ , we obtain

\bm{h}^{\infty}\in\Omega,\quad\theta(\bm{f})-\theta(\bm{f}^{\infty})+(\bm{h}-\bm{h}^{\infty})^{\top}F(\bm{h}^{\infty})\geq 0,\quad\forall\bm{h}\in\Omega.

The above variational inequality indicates that $\bm{h}^{\infty}$ is a solution point of (A.8). Together with (A.19), we obtain

\displaystyle\|\bm{g}^{k+1}-\bm{g}^{\infty}\|_{H}\leq\|\bm{g}^{k}-\bm{g}^{\infty}\|_{H}.

(A.27)

Furthermore, according to (A.26) and the fact that $\lim\limits_{j\to\infty}\tilde{\bm{g}}^{k_{j}}=\bm{g}^{\infty}$ , the subsequence $\{\bm{g}^{k_{j}}\}$ also converges to $\bm{g}^{\infty}$ . Then, in conjunction with (A.27), it can be concluded that $\bm{g}^{k}$ does not possess more than one cluster point, thereby establishing the convergence of the sequence $\{\bm{g}^{k}\}$ to $\bm{g}^{\infty}$ .

A.2.2 Linear Convergence Rate

Here, we demonstrate that a worst-case convergence rate of $\mathcal{O}\left({1}/{K}\right)$ in a non-ergodic sense can be established for QPADM-slack using Gaussian back substitution. To do this, we first need to prove the following proposition.

Proposition 2

For the sequence $\{\bm{g}^{k}\}$ generated by the QPADM-slack(GB) algorithm (which includes Algorithm 1 and Algorithm 3), we have

\displaystyle\|\bm{g}^{k+1}-\bm{g}^{k+2}\|_{\bm{H}}^{2}\leq

\displaystyle\|\bm{g}^{k}-\bm{g}^{k+1}\|_{\bm{H}}^{2},

(A.28)

where $k>0$ , and $\bm{H}$ is defined in (A.4).

Proof. First, by setting $\bm{h}=\tilde{\bm{h}}^{k+1}$ in (A.17), we obtain

\displaystyle\theta(\tilde{\bm{f}}^{k+1})-\theta(\tilde{\bm{f}}^{k})+(\tilde{\bm{h}}^{k+1}-\tilde{\bm{h}}^{k})^{\top}F(\tilde{\bm{h}}^{k})\geq

\displaystyle\ (\tilde{\bm{g}}^{k+1}-\tilde{\bm{g}}^{k})^{\top}\bm{Q}(\bm{g}^{k}-\tilde{\bm{g}}^{k}).

(A.29)

Note that (A.17) is also true for $k=k+1$ . Thus, we also have

\displaystyle\theta(\bm{f})-\theta(\tilde{\bm{f}}^{k+1})+(\bm{h}-\tilde{\bm{h}}^{k+1})^{\top}F(\tilde{\bm{h}}^{k+1})\geq

\displaystyle\ (\bm{g}-\tilde{\bm{g}}^{k+1})^{\top}\bm{Q}(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1}),\ \forall\bm{h}\in\Omega.

Setting $\bm{h}=\tilde{\bm{h}}^{k}$ in the above inequality, we obtain

\displaystyle\theta(\tilde{\bm{f}}^{k})-\theta(\tilde{\bm{f}}^{k+1})+(\tilde{\bm{h}}^{k}-\tilde{\bm{h}}^{k+1})^{\top}F(\tilde{\bm{h}}^{k+1})\geq

\displaystyle\ (\tilde{\bm{g}}^{k}-\tilde{\bm{g}}^{k+1})^{\top}\bm{Q}(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1}),\ \forall\bm{h}\in\Omega.

(A.30)

By adding (A.29) and (A.30) , and utilizing the antisymmetry of $F$ in (A.11), we obtain

\displaystyle(\tilde{\bm{g}}^{k}-\tilde{\bm{g}}^{k+1})^{\top}\bm{Q}\left[(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right]\geq 0.

(A.31)

By adding the term

\left[(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right]^{\top}\bm{Q}\left[(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right]

to both sides of (A.31), and using $\bm{g}^{\top}\bm{Q}\bm{g}=\frac{1}{2}\bm{g}^{\top}(\bm{Q}^{\top}+\bm{Q})\bm{g}$ , we obtain

(\bm{g}^{k}-\bm{g}^{k+1})^{\top}\bm{Q}\left[(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right]\geq\frac{1}{2}\left\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right\|^{2}_{\bm{Q}^{\top}+\bm{Q}}.

By substituting

\displaystyle(\bm{g}^{k}-\bm{g}^{k+1})=\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\ \text{and}\ (\bm{g}^{k+1}-\bm{g}^{k+2})=\bm{M}(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})

(A.32)

(from (A.18)) into the left-hand side of the last inequality, and using $\bm{Q}\bm{M}^{-1}=\bm{H}$ , it follows that

\displaystyle(\bm{g}^{k}-\bm{g}^{k+1})^{\top}\bm{H}\left[(\bm{g}^{k}-\bm{g}^{k+1})-(\bm{g}^{k+1}-\bm{g}^{k+2})\right]\geq\frac{1}{2}\left\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right\|^{2}_{\bm{Q}^{\top}+\bm{Q}}.

(A.33)

Letting $\bm{a}=(\bm{g}^{k}-{\bm{g}}^{k+1})$ and $\bm{b}=(\bm{g}^{k+1}-{\bm{g}}^{k+2})$ in the identity (A.2), we have

	$\displaystyle\\|\bm{g}^{k}-{\bm{g}}^{k+1}\\|_{\bm{H}}^{2}-\\|\bm{g}^{k+1}-{\bm{g}}^{k+2}\\|_{\bm{H}}^{2}$	$\displaystyle=2(\bm{g}^{k}-\bm{g}^{k+1})^{\top}\bm{H}\left[(\bm{g}^{k}-\bm{g}^{k+1})-(\bm{g}^{k+1}-\bm{g}^{k+2})\right]$
		$\displaystyle-\\|(\bm{g}^{k}-\bm{g}^{k+1})-(\bm{g}^{k+1}-\bm{g}^{k+2})\\|_{\bm{H}}^{2}.$

By inserting (A.33) into the first term on the right-hand side of the last equality, we obtain

	$\displaystyle\\|\bm{g}^{k}-{\bm{g}}^{k+1}\\|_{\bm{H}}^{2}-\\|\bm{g}^{k+1}-{\bm{g}}^{k+2}\\|_{\bm{H}}^{2}$	$\displaystyle=\left\\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right\\|^{2}_{\bm{Q}^{\top}+\bm{Q}}$
		$\displaystyle-\\|(\bm{g}^{k}-\bm{g}^{k+1})-(\bm{g}^{k+1}-\bm{g}^{k+2})\\|_{\bm{H}}^{2}.$

By inserting (A.32) into the second term on the right-hand side of the last equality, we get

	$\displaystyle\\|\bm{g}^{k}-{\bm{g}}^{k+1}\\|_{\bm{H}}^{2}-\\|\bm{g}^{k+1}-{\bm{g}}^{k+2}\\|_{\bm{H}}^{2}$	$\displaystyle=\left\\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\right\\|^{2}_{\bm{Q}^{\top}+\bm{Q}}$
		$\displaystyle-\\|\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})-\bm{M}(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\\|_{\bm{H}}^{2}.$

Since $\bm{G}=(\bm{Q}^{\top}+\bm{Q})-\bm{M}^{\top}\bm{H}\bm{M}$ is a positive-definite matrix, the assertion in (A.28) follows immediately.

Note that $\bm{Q}^{\top}+\bm{Q}$ and $\bm{G}=(\bm{Q}^{\top}+\bm{Q})-\bm{M}^{\top}\bm{H}\bm{M}$ are positive-definite matrices. Together with Proposition 1, there exists a constant $c_{0}>0$ such that

\displaystyle\|\bm{g}^{k+1}-\bm{g}^{*}\|_{\bm{H}}^{2}\leq\|\bm{g}^{k}-\bm{g}^{*}\|_{\bm{H}}^{2}-c_{0}\|\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\|_{\bm{H}}^{2},\quad\forall\bm{g}^{*}\in\mathcal{V}^{*}.

(A.34)

Proposition 2 and (A.32) indicate that

\displaystyle\|\bm{M}(\bm{g}^{k+1}-\tilde{\bm{g}}^{k+1})\|_{\bm{H}}^{2}\leq\|\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\|_{\bm{H}}^{2}.

(A.35)

Now, with (A.34) and (A.35), we can establish the worst-case $\mathcal{O}\left({1}/{K}\right)$ convergence rate in a nonergodic sense for QPADM-slack(GB).

Theorem 3

The sequence $\{\bm{g}^{k}\}$ generated by the QPADM-slack(GB) algorithm (including Algorithm 1 and Algorithm 3) satisfies, for any positive integer $K>0$ ,

\displaystyle\|\bm{g}^{K}-\bm{g}^{K+1}\|_{\bm{H}}^{2}\leq\frac{1}{c_{0}\left(K+1\right)}\|\bm{g}^{0}-\bm{g}^{*}\|_{\bm{H}}^{2}.

(A.36)

Proof. First, it follows from (A.34) that

\displaystyle\sum_{k=0}^{\infty}c_{0}\|\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\|_{\bm{H}}^{2}\leq\|\bm{g}^{0}-\bm{g}^{*}\|_{\bm{H}}^{2},\quad\forall\bm{g}^{*}\in\mathcal{V}^{*}.

(A.37)

According to (A.35), the sequence $\{\|\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\|_{\bm{H}}^{2}\}$ is monotonically non-increasing. Therefore, we have

\displaystyle(K+1)\|\bm{M}(\bm{g}^{K+1}-\tilde{\bm{g}}^{K+1})\|_{\bm{H}}^{2}\leq\sum_{k=0}^{K}\|\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\|_{\bm{H}}^{2}.

(A.38)

It follows from(A.37) and (A.38) that

\displaystyle c_{0}(K+1)\|\bm{M}(\bm{g}^{K+1}-\tilde{\bm{g}}^{K+1})\|_{\bm{H}}^{2}\leq\|\bm{g}^{0}-\bm{g}^{*}\|_{\bm{H}}^{2}

(A.39)

The assertion (A.36) follows from (A.39) and $\bm{M}(\bm{g}^{K+1}-\tilde{\bm{g}}^{K+1})=(\bm{g}^{K}-\bm{g}^{K+1})$ immediately.

Appendix B Supplementary Experiments

Here, we will supplement numerical experiments on nonconvex regularization regression and classification problems. Note that when employing algorithms QPADM-slack(GB) and M-QPADM-slack(GB) to solve the problem of nonconvex regularizations, the LLA method (referred to as Algorithm 2) is actually used. Consequently, the number of iterations (Ite) corresponds to the total number of iterations generated within Algorithm 2. Typically, Algorithm 2 involves two ADMM iteration processes: one for solving the unweighted $\ell_{1}$ regularization problem and another for solving the weighted $\ell_{1}$ regularization problem. This counting method was also adopted by Wen et al. (2024) in their ADMM algorithms. We have implemented warm-start technique (Friedman et al. (2010)), which allows the second ADMM algorithm to converge rapidly, typically requiring only around ten to twenty iterations in most cases.

B.1 Supplementary Experiments for Section 5.1

In this section, we first supplement the experiments with nonconvex regularization terms for Table 1 (non-parallel computing environment), where the dimensions are chosen as $(n,p)=(1000,30000)$ and $(30000,1000)$ . The specific numerical results are provided in Table 4. The numerical results indicate that when solving the quantile regression problem with MCP and SCAD penalties, the ADMM algorithm with Gaussian back substitution still outperforms the one without it. Additionally, it is observed that the number of iterations required by algorithms QPADM-slack(GB) and M-QPADM-slack(GB) for these two nonconvex penalized quantile regressions is no more than 20 steps greater than that required for solving the quantile regression with $\ell_{1}$ penalty. This advantage is attributed to the warm-start technique, where the solution obtained from the $\ell_{1}$ penalized regularization serves as the initial solution for the weighted $\ell_{1}$ penalty regularization.

Table 4: Comparison results of ADMM algorithms under nonconvex penalties.

MCP	$(n,p)=(30000,1000)$					$(n,p)=(1000,30000)$
	P1	P2	AE	Ite	Time	P1	P2	AE	Ite	Time
QPADM	100	100	0.014(0.03)	138.1(14.5)	41.35(6.78)	82	100	7.532(1.12)	500+(0.0)	2038.7(68.6)
QPADMslack	100	100	0.051(0.05)	68.3(6.69)	37.56(5.27)	89	100	7.807(1.33)	266(28.6)	1276.5(30.6)
QPADM-slack(GB)	100	100	0.025(0.04)	52.1(5.44)	35.17(5.30)	93	100	7.352(1.08)	241(22.8)	1101.4(27.6)
M-QPADM-slack(GB)	100	100	0.017(0.03)	39.5(4.78)	24.42(3.08)	96	100	7.082(0.92)	172(16.3)	885.7(19.4)
SCAD	$(n,p)=(30000,1000)$					$(n,p)=(1000,30000)$
	P1	P2	AE	Ite	Time	P1	P2	AE	Ite	Time
QPADM	100	100	0.012(0.03)	142.3(14.9)	44.2(6.91)	85	100	7.233(1.02)	500+(0.0)	2056.2(67.9)
QPADMslack	100	100	0.056(0.05)	65.2(6.22)	34.78(5.10)	88	100	7.886(1.38)	251(24.6)	1202.5(28.9)
QPADM-slack(GB)	100	100	0.026(0.04)	53.0(5.61)	37.09(5.5)	94	100	7.426(1.12)	261(25.7)	1208.6(29.6)
M-QPADM-slack(GB)	100	100	0.015(0.03)	40.9(4.91)	26.32(3.17)	98	100	7.021(0.89)	175(16.7)	892.2(20.2)

*

The symbols in this table are defined as follows: P1 (%) represents the proportion of times $x_{1}$ is selected; P2 (%) represents the proportion of times $x_{6}$ , $x_{12}$ , $x_{15}$ , and $x_{20}$ are selected; AE denotes the absolute estimation error; Ite indicates the number of iterations; and Time (s) refers to the running time. The numbers in parentheses represent the corresponding standard deviations, and the optimal results are highlighted in bold.

Next, we supplement the experiments with nonconvex regularizer terms for Table 2 (parallel computing environment), where the dimensions are chosen as $(n,p)=(500000,1000)$ . The specific numerical results are provided in Table 5, which demonstrate that the Gaussian back substitution technique effectively enhances both the accuracy and efficiency of the QPADM-slack algorithm.

Table 5: Comparison results of different P-ADMM algorithms under nonconvex penalties.

	QPADM		MCP		QPADM		SCAD
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	26.5(2.01)	0.034(0.0005)	451.3(37.5)	180.4(15.02)	27.1(1.99)	0.035(0.0005)	479.2(39.2)	189.3(14.3)
10	27.2(2.09)	0.039(0.0006)	475.7(40.1)	98.3(6.99)	28.5(2.07)	0.040(0.0006)	496.3(42.7)	99.2(7.12)
20	29.7(2.16)	0.045(0.0008)	500+(0.00)	49.8(4.11)	29.9(2.18)	0.047(0.0007)	500+(0.00)	50.2(4.03)
	QPADM-slack		MCP		QPADM-slack		SCAD
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	25.1(1.97)	0.037(0.0006)	332.5(30.0)	141.2(9.66)	24.7(1.95)	0.038(0.0006)	341.6(29.8)	136.4(9.52)
10	29.1(2.06)	0.044(0.0007)	360.1(34.5)	75.2(5.23)	28.9(2.11)	0.043(0.0007)	366.8(32.3)	76.6(5.15)
20	31.1(2.17)	0.050(0.0008)	431.1(41.7)	42.1(2.52)	30.1(2.23)	0.049(0.0008)	425.9(40.2)	43.2(2.74)
	QPADM-slack(GB)		MCP		QPADM-slack(GB)		SCAD
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	23.5(1.91)	0.028(0.0005)	290.1(24.8)	87.2(6.53)	23.1(1.88)	0.031(0.0005)	271.2(25.2)	88.9(6.81)
10	24.4(1.98)	0.032(0.0005)	263.0(22.4)	55.1(3.99)	24.2(1.91)	0.035(0.0005)	269.3(23.5)	54.6(4.02)
20	25.1(2.03)	0.035(0.0006)	291.2(27.0)	39.5(2.35)	25.3(2.00)	0.038(0.0006)	283.4(27.6)	37.3(2.21)
	M-QPADM-slack(GB)		MCP		M-QPADM-slack(GB)		SCAD
M	Nonzero	AE	Ite	Time	Nonzero	AE	Ite	Time
5	14.10(1.40)	0.020(0.0004)	228.1(14.9)	71.1(4.90)	14.05(1.37)	0.022(0.0004)	215.3(15.1)	70.5(4.84)
10	14.5(1.47)	0.028(0.0005)	232.2(17.83)	40.1(2.73)	14.77(1.48)	0.027(0.0005)	219.6(16.2)	39.2(2.77)
20	15.3(1.51)	0.034(0.0005)	230.7(13.8)	22.0(1.55)	15.01(1.54)	0.031(0.0005)	221.3(14.1)	21.3(1.59)

*

Since the values of all methods for metrics P1 and P2 are 100, they are not listed in Table 5. “Nonzero” indicates the number of non-zero coefficients in the estimates. The numbers in parentheses represent the corresponding standard deviations, and the optimal results are shown in bold.

B.2 Supplementary Experiments for Section 5.2

In this section, we provide additional experiments involving nonconvex regularization terms for the data presented in Table 6 and Table 7. All experimental settings remain consistent with those described in Section 5.2. The numerical results in both tables demonstrate that incorporating Gaussian back substitution steps into the QPADM algorithm significantly enhances the sparsity and accuracy of nonconvex SVM classifiers.

Table 6: Comparative analysis of Sparsity, training, and average testing accuracies (%) for SCAD-SVM using M-QPADM-slack(GB), QPADM-slack, and QPADM-slack(GB)

	M-QPADM-slack(GB)			QPADM-slack(GB)			QPADM-slack
M	Sparsity	Train	Test	Sparsity	Train	Test	Sparsity	Train	Test
2	92.00	99.80	97.80	87.50	95.50	93.50	85.00	94.00	92.50
4	92.00	99.60	97.50	85.50	94.80	92.50	82.00	93.00	91.50
6	92.00	99.50	97.40	84.50	94.00	91.60	81.00	92.00	90.60
8	91.80	99.20	97.10	83.00	93.20	91.00	80.00	91.20	90.00
10	91.60	99.00	96.90	82.00	92.80	90.20	79.00	90.80	89.20
12	91.50	98.80	96.70	81.50	92.00	89.50	77.50	90.00	88.50
14	91.40	98.70	96.60	80.50	91.30	88.70	76.50	89.80	87.80
16	91.20	98.50	96.40	79.50	90.50	87.50	74.50	88.50	87.20
18	91.10	98.40	96.30	78.50	89.80	86.50	73.50	87.80	86.30
20	91.00	98.30	96.20	77.50	88.80	85.50	72.50	87.70	85.20

Table 7: Comparative analysis of Sparsity, training, and average testing accuracies (%) for MCP-SVM using M-QPADM-slack(GB), QPADM-slack, and QPADM-slack(GB)

	M-QPADM-slack(GB)			QPADM-slack(GB)			QPADM-slack
M	Sparsity	Train	Test	Sparsity	Train	Test	Sparsity	Train	Test
2	92.20	99.90	97.90	87.30	95.40	93.40	84.80	93.80	92.40
4	92.10	99.70	97.60	85.30	94.70	92.40	81.80	92.90	91.40
6	92.00	99.60	97.50	84.30	93.90	91.50	80.80	91.90	90.50
8	91.90	99.30	97.20	82.80	93.10	90.90	79.80	91.10	89.90
10	91.70	99.10	97.00	81.80	92.70	90.10	78.80	90.70	89.10
12	91.60	98.90	96.80	81.30	91.90	89.40	77.30	89.90	88.40
14	91.50	98.80	96.70	80.30	91.20	88.60	76.30	89.70	87.70
16	91.30	98.60	96.50	79.30	90.40	87.40	74.30	88.40	87.10
18	91.20	98.50	96.40	78.30	89.70	86.40	73.30	87.70	86.20
20	91.10	98.40	96.30	77.30	88.70	85.40	72.30	87.60	85.10

$\displaystyle\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}-\\|\bm{g}^{k+1}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}$	$\displaystyle=\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}-\\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k}-\bm{g}^{k+1})\\|_{\bm{H}}^{2}$	(A.22)
	$\displaystyle\overset{(\ref{c})}{=}\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{H}}^{2}-\\|(\bm{g}^{k}-\tilde{\bm{g}}^{k})-\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})\\|_{\bm{H}}^{2}$
	$\displaystyle=2(\bm{g}^{k}-\tilde{\bm{g}}^{k})^{\top}\bm{H}\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})-(\bm{g}^{k}-\tilde{\bm{g}}^{k})^{\top}\bm{M}^{\top}\bm{H}\bm{M}(\bm{g}^{k}-\tilde{\bm{g}}^{k})$
	$\displaystyle=(\bm{g}^{k}-\tilde{\bm{g}}^{k})^{\top}(\bm{Q}^{\top}+\bm{Q}-\bm{M}^{\top}\bm{H}\bm{M})(\bm{g}^{k}-\tilde{\bm{g}}^{k})$
	$\displaystyle\overset{(\ref{m2})}{=}\\|\bm{g}^{k}-\tilde{\bm{g}}^{k}\\|_{\bm{G}}^{2},$