This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computing one-bit compressive sensing via zero-norm regularized DC loss model and its surrogate

Kai Chen111School of Mathematics, South China University of Technology, Guangzhou, China   Ling Liang222School of Mathematics, South China University of Technology, Guangzhou, China  and  Shaohua Pan333Corresponding author (shhpan@scut.edu.cn), School of Mathematics, South China University of Technology, Guangzhou, China
Abstract

One-bit compressed sensing is very popular in signal processing and communications due to its low storage costs and low hardware complexity, but it is a challenging task to recover the signal by using the one-bit information. In this paper, we propose a zero-norm regularized smooth difference of convexity (DC) loss model and derive a family of equivalent nonconvex surrogates covering the MCP and SCAD surrogates as special cases. Compared to the existing models, the new model and its SCAD surrogate have better robustness. To compute their τ\tau-stationary points, we develop a proximal gradient algorithm with extrapolation and establish the convergence of the whole iterate sequence. Also, the convergence is proved to have a linear rate under a mild condition by studying the KL property of exponent 0 of the models. Numerical comparisons with several state-of-art methods show that in terms of the quality of solution, the proposed model and its SCAD surrogate are remarkably superior to the p\ell_{p}-norm regularized models, and are comparable even superior to those sparsity constrained models with the true sparsity and the sign flip ratio as inputs.

Keywords: One-bit compressive sensing, zero-norm, DC loss, equivalent surrogates, global convergence, KL property

1 Introduction

Compressive sensing (CS) has gained significant progress in theory and algorithms over the past few decades since the seminal works [10, 13]. It aims to recover a sparse signal xtruenx^{\rm true}\in\mathbb{R}^{n} from a small number of linear measurements. One-bit compressive sensing, as a variant of the CS, was proposed in [6] and had attracted considerable interests in the past few years (see, e.g., [12, 23, 30, 46, 51, 21]). Unlike the conventional CS which relies on real-valued measurements, one-bit CS aims to reconstruct the sparse signal xtruex^{\rm true} from the sign of measurement. Such a new setup is appealing because (i) the hardware implementation of one-bit quantizer is low-cost and efficient; (ii) one-bit measurement is robust to nonlinear distortions [8]; and (iii) in certain situations, for example, when the signal-to-noise ratio is low, one-bit CS performs even better than the conventional one [25]. For the applications of one-bit CS, we refer to the recent survey paper [26].

1.1 Review on the related works

In the noiseless setup, the one-bit CS acquires the measurements via the linear model b=sgn(Φxtrue)b={\rm sgn}(\Phi x^{\rm true}), where Φm×n\Phi\in\mathbb{R}^{m\times n} is the measurement matrix and the function sgn(){\rm sgn}(\cdot) is applied to Φxtrue\Phi x^{\rm true} in a component-wise way. Here, for any tt\in\mathbb{R}, sgn(t)=1{\rm sgn}(t)=1 if t>0t>0 and 1-1 otherwise, which has a little difference from the common sign(){\rm sign}(\cdot). By following the theory of conventional CS, the ideal optimization model for one-bit CS is as follows:

minxn{x0s.t.b=sgn(Φx),x=1},\min_{x\in\mathbb{R}^{n}}\Big{\{}\|x\|_{0}\ \ {\rm s.t.}\ \ b={\rm sgn}(\Phi x),\,\|x\|=1\Big{\}}, (1)

where x0\|x\|_{0} denotes the zero-norm (i.e., the number of nonzero entries) of xnx\in\mathbb{R}^{n}, and x\|x\| means the Euclidean norm of xx. The unit sphere constraint is introduced into (1) to address the issue that the scale information of a signal is lost during the one-bit quantization. Due to the combinatorial properties of the functions sgn(){\rm sgn}(\cdot) and 0\|\cdot\|_{0}, the problem (1) is NP-hard. Some earlier works (see, e.g., [6, 30, 43]) mainly focus on its convex relaxation model, obtained by replacing the zero-norm by the 1\ell_{1}-norm and relaxing the consistency constraint b=sgn(Φx)b={\rm sgn}(\Phi x) into the linear constraint b(Φx)0b\circ(\Phi x)\geq 0, where the notation “\circ” means the Hadamard operation of vectors.

In practice the measurement is often contaminated by noise before the quantization and some signs will be flipped after quantization due to quantization distoration, i.e.,

b=ζsgn(Φxtrue+ε)b=\zeta\circ{\rm sgn}(\Phi x^{\rm true}+\varepsilon) (2)

where ζ{1,1}m\zeta\in\{-1,1\}^{m} is a random binary vector and εm\varepsilon\in\mathbb{R}^{m} denotes the noise vector. Let L:m+L\!:\mathbb{R}^{m}\to\mathbb{R}_{+} be a loss function to ensure data fidelity as well as to tolerate the existence of sign flips. Then, it is natural to consider the zero-norm regularized loss model:

minxn{L(Ax)+λx0s.t.x=1}withA:=Diag(b)Φ,\min_{x\in\mathbb{R}^{n}}\Big{\{}L(Ax)+\lambda\|x\|_{0}\ \ {\rm s.t.}\ \ \|x\|=1\Big{\}}\ \ {\rm with}\ A:={\rm Diag}(b)\Phi, (3)

and achieve a desirable estimation for the true signal xtruex^{\rm true} by tuning the parameter λ>0\lambda>0. Consider that the projection mapping onto the intersection of the sparsity constraint set and the unit sphere has a closed norm. Some researchers prefer the following model or a similar variant to achieve a desirable estimation for xtruex^{\rm true} (see, e.g., [7, 46, 12, 52]):

minxn{L(Ax)s.t.x0s,x=1},\min_{x\in\mathbb{R}^{n}}\Big{\{}L(Ax)\ \ {\rm s.t.}\ \ \|x\|_{0}\leq s,\,\|x\|=1\Big{\}}, (4)

where the positive integer ss is an estimation for the sparsity of xtruex^{\rm true}. For this model, if there is a big difference between the estimation ss from the true sparsity ss^{*}, the mean-squared-error (MSE) of the associated solutions will become worse. Take the model in [52] for example. If the difference between the estimation ss from the true sparsity ss^{*} is 22, the MSE of the associated solutions will have a difference at least 20%20\% (see Figure 1). Moreover, now it is unclear how to achieve such a tight estimation for ss^{*}. We find that the numerical experiments for the zero-norm constrained model all use the true sparsity as an input (see [46, 52]). In this work, we are interested in the regularization models.

Refer to caption
Figure 1: MSE of the solution yielded by GPSP with different ss (the data generated in the same way as in Section 5.1 with (m,n,s)=(500,1000,5),(μ,ϖ)=(0.3,0.1)(m,n,s^{*})=(500,1000,5),(\mu,\varpi)=(0.3,0.1) and Φ\Phi of type I)

The existing loss functions for the one-bit CS are mostly convex, including the one-sided 2\ell_{2} loss [23, 46], the linear loss [31, 51], the one-sided 1\ell_{1} loss [23, 46, 33], the pinball loss [21] and the logistic loss [16]. Among others, the one-sided 1\ell_{1} loss is closely related to the hinge loss function in machine learning [11, 50], which was reported to have a superior performance to the one-sided 2\ell_{2} loss (see [23]), and the pinball loss provides a bridge between the hinge loss and the linear loss. One can observe that these convex loss functions all impose a large penalty on the flipped samples, which inevitably imposes a negative effect on the solution quality of the model (3). In fact, for the pinball loss in [21], when the involved parameter τ\tau is closer to 0, the penalty degree on the flipped samples becomes smaller. This partly accounts for τ=0.2\tau=-0.2 instead of τ=1\tau=-1 used for numerical experiments there. Recently, Dai et al. [12] derived a one-sided zero-norm loss by maximizing a posteriori estimation of the true signal. This loss function and its lower semicontinuous (lsc) majorization proposed there impose a constant penalty for those flipped samples, but their combinatorial property brings much difficulty to the solution of the associated optimization models. Inspired by the superiority of the ramp loss in SVM [9, 22], in this work we are interested in a more general DC loss:

Lσ(z):=i=1mϑσ(zi)withϑσ(t):={max(0,t)iftσ,σift<σ,L_{\sigma}(z):=\sum_{i=1}^{m}\vartheta_{\!\sigma}(z_{i})\ \ {\rm with}\ \vartheta_{\!\sigma}(t):=\left\{\begin{array}[]{cl}\max(0,-t)&{\rm if}\ t\geq-\sigma,\\ \sigma&{\rm if}\ t<-\sigma,\end{array}\right. (5)

where σ(0,1]\sigma\in(0,1] is a constant representing the penalty degree imposed on the flip outlier. Clearly, the DC function ϑσ\vartheta_{\!\sigma} imposes a small fixed penalty for those flip outliers.

Due to the nonconvexity of the zero-norm and the sphere constraint, some researchers are interested in the convex relaxation of (3) obtained by replacing the zero-norm by the 1\ell_{1}-norm and the unit sphere constraint by the unit ball constraint; see [51, 21, 24, 31]. However, as in the conventional CS, the 1\ell_{1}-norm convex relaxation not only has a weak sparsity-promoting ability but also leads to a biased solution; see the discussion in [14]. Motivated by this, many researchers resort to the nonconvex surrogate functions of the zero-norm, such as the minimax concave penalty (MCP) [53, 20], the sorted 1\ell_{1} penalty [20], the logarithmic smoothing functions [40], the q(0<q<1)\ell_{q}\,(0<\!q\!<1)-norm [15], and the Schur-concave functions [32], and then develop algorithms for solving the associated nonconvex surrogate problems to achieve a better sparse solution. To the best of our knowledge, most of these algorithms are lack of convergence certificate. Although many nonconvex surrogates of the zero-norm are used for the one-bit CS, there is no work to investigate the equivalence between the surrogate problems and the model (3) in a global sense.

1.2 Main contributions

The nonsmooth DC loss LσL_{\sigma} is desirable to ensure data fidelity and tolerate the existence of sign flips, but its nonsmoothness is inconvenient for the solution of the associated regularization model (3). With 0<γ<σ/20<\!\gamma<\!{\sigma}/{2} we construct a smooth approximation to it:

Lσ,γ(z):=i=1mϑσ,γ(zi)withϑσ,γ(t):={0ift>0,t2/(2γ)ifγ<t0,tγ/2ifσ+γ<t<γ,σγ2(t+σ+γ)24γif(σ+γ)tγσ,σγ/2ift<(σ+γ).L_{\sigma,\gamma}(z):=\sum_{i=1}^{m}\vartheta_{\sigma,\gamma}(z_{i})\ \ {\rm with}\ \ \vartheta_{\sigma,\gamma}(t)\!:=\!\left\{\begin{array}[]{cl}0&{\rm if}\ t>0,\\ t^{2}/(2\gamma)&{\rm if}\ -\!\gamma<t\leq 0,\\ -t-\gamma/2&{\rm if}\ -\!\sigma\!+\!\gamma<t<-\gamma,\\ \!\sigma\!-\!\frac{\gamma}{2}\!-\!\frac{(t+\sigma+\gamma)^{2}}{4\gamma}&{\rm if}\ -\!(\sigma\!+\!\gamma)\leq t\leq\!\gamma\!-\!\sigma,\\ \!\sigma\!-\!{\gamma}/2&{\rm if}\ t<-(\sigma\!+\!\gamma).\end{array}\right. (6)

Clearly, as the parameter γ\gamma approaches to 0, ϑσ,γ\vartheta_{\!\sigma,\gamma} is closer to ϑσ\vartheta_{\!\sigma}. As illustrated in Figure 2, the smooth function ϑσ,γ\vartheta_{\!\sigma,\gamma} approximates ϑσ\vartheta_{\!\sigma} very well even with γ=0.05\gamma=0.05. Therefore, in this paper we are interested in the zero-norm regularized smooth DC loss model

minxnFσ,γ(x):=Lσ,γ(Ax)+δ𝒮(x)+λx0,\min_{x\in\mathbb{R}^{n}}F_{\sigma,\gamma}(x):=L_{\sigma,\gamma}(Ax)+\delta_{\mathcal{S}}(x)+\lambda\|x\|_{0}, (7)

where 𝒮\mathcal{S} denotes a unit sphere whose dimension is known from the context, and δ𝒮\delta_{\mathcal{S}} means the indicator function of 𝒮\mathcal{S}, i.e., δ𝒮(x)=0\delta_{\mathcal{S}}(x)=0 if x𝒮x\in\mathcal{S} and otherwise δ𝒮(x)=+\delta_{\mathcal{S}}(x)=+\infty.

Refer to caption
Figure 2: Approximation degree of ϑσ,γ\vartheta_{\!\sigma,\gamma} with different γ\gamma to ϑσ\vartheta_{\!\sigma} for σ=1\sigma=1

Let \mathscr{L} denote the family of proper lsc convex functions ϕ:(,+]\phi\!:\mathbb{R}\to(-\infty,+\infty] satisfying

int(domϕ)[0,1], 1>t:=argmin0t1ϕ(t),ϕ(t)=0andϕ(1)=1.{\rm int}({\rm dom}\,\phi)\supseteq[0,1],\ 1>t^{*}\!:=\mathop{\arg\min}_{0\leq t\leq 1}\phi(t),\ \phi(t^{*})=0\ \ {\rm and}\ \ \phi(1)=1. (8)

With an arbitrary ϕ\phi\in\!\mathscr{L}, the model (7) is reformulated as a mathematical program with an equilibrium constraint (MPEC) in Section 3, and by studying its global exact penalty induced by the equilibrium constraint, we derive a family of equivalent surrogates

minxnGσ,γ,ρ(x):=Lσ,γ(Ax)+δ𝒮(x)+λρφρ(x),\mathop{\min}_{x\in\mathbb{R}^{n}}G_{\sigma,\gamma,\rho}(x):=L_{\sigma,\gamma}(Ax)+\delta_{\mathcal{S}}(x)+\lambda\rho\varphi_{\rho}(x), (9)

in the sense that the problem (9) associated to every ρ>ρ¯\rho>\overline{\rho} has the same global optimal solution set as the problem (7) does. Here φρ(x):=x11ρi=1nψ(ρ|xi|)\varphi_{\rho}(x)\!:=\!\|x\|_{1}-\!\frac{1}{\rho}\!\sum_{i=1}^{n}\psi^{*}(\rho|x_{i}|) with ρ>0\rho>0 being the penalty parameter and ψ\psi^{*} being the conjugate function of ψ\psi:

ψ(ω):=supt{ωtψ(t)}forψ(t):={ϕ(t)ift[0,1],+otherwise.\psi^{*}(\omega):=\sup_{t\in\mathbb{R}}\big{\{}\omega t-\psi(t)\big{\}}\ \ {\rm for}\ \psi(t)\!:=\!\left\{\begin{array}[]{cl}\phi(t)&{\rm if}\ t\in[0,1],\\ +\infty&{\rm otherwise}.\end{array}\right.

This family of equivalent surrogates is illustrated to include the one associated to the MCP function (see [49, 53, 20]) and the SCAD function [14]. The SCAD function corresponds to ϕ(t)=a1a+1t2+2a+1t(a>1)\phi(t)=\frac{a-1}{a+1}t^{2}+\frac{2}{a+1}t\ (a>1) for tt\in\mathbb{R}, whose conjugate has the form

ψ(ω)={0ω2a+1,((a+1)ω2)24(a21)2a+1<ω2aa+1,ω1ω>2aa+1,forω.\psi^{*}(\omega)=\begin{cases}0&\omega\leq\frac{2}{a+1},\\ \frac{((a+1)\omega-2)^{2}}{4(a^{2}-1)}&\frac{2}{a+1}<\omega\leq\frac{2a}{a+1},\\ \omega-1&\omega>\frac{2a}{a+1},\end{cases}\quad{\rm for}\ \omega\in\mathbb{R}. (10)

Figure 3 below shows that Gσ,γ,ρG_{\sigma,\gamma,\rho} with ψ\psi^{*} in (10) approximates Fσ,γF_{\sigma,\gamma} every well for ρ2\rho\geq 2, though the model (9) has the same global optimal solution set as the model (7) does only when ρ\rho is over the theoretical threshold ρ¯\overline{\rho}. Unless otherwise stated, the function Gσ,γ,ρG_{\sigma,\gamma,\rho} appearing in the rest of this paper always represents the one associated to ψ\psi^{*} in (10).

Refer to caption
Figure 3: Approximation effect of |t|ρ1ψ(ρ|t|)|t|-\rho^{-1}\psi^{*}(\rho|t|) with ψ\psi^{*} in (10) to sign(|t|){\rm sign}(|t|)

For the nonconvex nonsmooth optimization problems (7) and (9), we develop a proximal gradient (PG) method with extrapolation to solve them, establish the convergence of the whole iterate sequence generated, and analyze its local linear convergence rate under a mild condition. The main contributions of this paper can be summarized as follows:

  • (i)

    We introduce a smooth DC loss function well suited for the data with a high sign flip ratio, and propose the zero-norm regularized smooth DC loss model (7) which, unlike those models in [7, 12, 46, 52], does not require any priori information on the sparsity of the true signal and the number of sign flips. In particular, a family of equivalent nonconvex surrogates is derived for the model (7). We also introduce a class of τ\tau-stationary points for the model (7) and its equivalent surrogate (9) associated to ψ\psi^{*} in (10), which is stronger than the limiting critical points of the corresponding objective functions.

  • (ii)

    By characterizing the closed form of the proximal operators of δ𝒮()+λ0\delta_{\mathcal{S}}(\cdot)+\lambda\|\cdot\|_{0} and δ𝒮()+λ1\delta_{\mathcal{S}}(\cdot)+\lambda\|\cdot\|_{1}, we develop a proximal gradient (PG) algorithm with extrapolation for solving the problem (7) (PGe-znorm) and its surrogate (9) associated to ψ\psi^{*} in (10) (PGe-scad) and establish the convergence of the whole iterate sequences. Also, by analyzing the KL property of exponent 0 of Fσ,γF_{\sigma,\gamma} and Gσ,γ,ρG_{\sigma,\gamma,\rho}, the convergence is shown to have a linear rate under a mild condition. It is worth pointing out that to verify if a nonconvex and nonsmooth function has the KL property of exponent not more than 1/21/2 is not an easy task because there is lack of a criterion for it.

  • (iii)

    Numerical experiments indicate that the proposed models armed with the PGe-znorm and PGe-scad are robust to a large range of λ\lambda, and numerical comparisons with several state-of-art methods demonstrate that the proposed models are well suited for high noise and/or high sign flip ratio. The obtained solutions are remarkably superior to those yielded by other regularization models, and for the data with a high flip ratio they are also superior to those yielded by the models with the true sparsity as an input, in terms of MSE and Hamming error.

2 Notation and preliminaries

Throughout this paper, ¯\overline{\mathbb{R}} denotes the extended real number set (,](-\infty,\infty], II and ee denote an identity matrix and a vector of all ones, whose dimensions are known from the context; and {e1,,en}\{e^{1},\ldots,e^{n}\} denotes the orthonormal basis of n\mathbb{R}^{n}. For a integer k>0k>0, write [k]:={1,2,,k}[k]:=\{1,2,\ldots,k\}. For a vector znz\in\mathbb{R}^{n}, |z|nz|z|_{\rm nz} denotes the smallest nonzero entry of the vector |z||z|, zz^{\downarrow} means the vector of the entries of zz arranged in a nonincreasing order, and zs,z^{s,\downarrow} means the vector (z1,,zs)𝕋(z_{1}^{\downarrow},\ldots,z_{s}^{\downarrow})^{\mathbb{T}}. For given index sets I[m]I\subseteq[m] and J[n]J\subseteq[n], AIJ|I|×|J|A_{I\!J}\in\mathbb{R}^{|I|\times|J|} denotes the submatrix of AJA_{J} consisting of those rows AiA_{i} with iIi\in I, and AJm×|J|A_{J}\in\mathbb{R}^{m\times|J|} denotes the submatrix of AA consisting of those columns AjA_{j} with jJj\in\!J. For a proper h:n¯h\!:\mathbb{R}^{n}\to\overline{\mathbb{R}}, domh:={zn|h(z)<+}{\rm dom}\,h\!:=\!\{z\in\mathbb{R}^{n}\,|\,h(z)<+\infty\} denotes its effective domain, and for any given <η1<η2<-\infty<\eta_{1}<\eta_{2}<\infty, [η1<h<η2][\eta_{1}<h<\eta_{2}] represents the set {xn|η1<h(x)<η2}\{x\in\mathbb{R}^{n}\,|\,\eta_{1}<h(x)<\eta_{2}\}. For any λ>0,ρ>0,0<γ<σ/2\lambda>0,\rho>0,0<\gamma<\sigma/2 and any xnx\in\mathbb{R}^{n}, write Γ(x):={i[m]|γ(Ax)i0}{i[m]|σγ(Ax)iγσ}\Gamma(x)\!:=\!\{i\in[m]\,|\,-\!\gamma\leq(Ax)_{i}\leq 0\}\cup\{i\in[m]\,|\,-\!\sigma\!-\!\gamma\leq(Ax)_{i}\leq\!\gamma\!-\!\sigma\} and define

fσ,γ(x):=Lσ,γ(Ax),Ξσ,γ(x):=fσ,γ(x)λi=1nψ(ρ|x|i),\displaystyle f_{\sigma,\gamma}(x):=L_{\sigma,\gamma}(Ax),\ \ \Xi_{\sigma,\gamma}(x):=f_{\sigma,\gamma}(x)-\lambda{\textstyle\sum_{i=1}^{n}}\psi^{*}(\rho|x|_{i}), (11)
gλ(x):=δ𝒮(x)+λx0andhλ,ρ(x):=δ𝒮(x)+λρx1.\displaystyle g_{\lambda}(x):=\delta_{\mathcal{S}}(x)+\lambda\|x\|_{0}\ \ {\rm and}\ \ h_{\lambda,\rho}(x):=\delta_{\mathcal{S}}(x)+\lambda\rho\|x\|_{1}. (12)

For a proper lsc h:n¯h\!:\mathbb{R}^{n}\to\overline{\mathbb{R}}, the proximal mapping of hh associated to τ>0\tau>0 is defined as

𝒫τh(x):=argminzn{12τzx2+h(z)}xn.\mathcal{P}_{\!\tau}h(x)\!:=\mathop{\arg\min}_{z\in\mathbb{R}^{n}}\big{\{}\frac{1}{2\tau}\|z-x\|^{2}+h(z)\big{\}}\quad\ \forall x\in\mathbb{R}^{n}.

When hh is convex, 𝒫τh\mathcal{P}_{\!\tau}h is a Lipschitz continuous mapping with modulus 11. When hh is an indicator function of a closed set CnC\subseteq\mathbb{R}^{n}, 𝒫τh\mathcal{P}_{\!\tau}h is the projection mapping ΠC\Pi_{C} onto CC.

2.1 Proximal mappings of gλg_{\lambda} and hλ,ρh_{\lambda,\rho}

To characterize the proximal mapping of the nonconvex nonsmooth function gλg_{\lambda}, we need the following lemma, whose proof is not included due to the simplicity.

Lemma 2.1

Fix any zn\{0}z\in\mathbb{R}^{n}\backslash\{0\} and an integer s1s\geq 1. Consider the following problem

S(z):=argminxn{12xz2s.t.x=1,x0=s}.S^{*}(z):=\mathop{\arg\min}_{x\in\mathbb{R}^{n}}\Big{\{}\frac{1}{2}\|x-z\|^{2}\ \ {\rm s.t.}\ \ \|x\|=1,\|x\|_{0}=s\Big{\}}. (13)

Then, S(z)={P𝕋(|z|s1,;|z|i;0)(|z|s1,;|z|i;0)|i{s,,n}issuchthat|z|i=|z|s}S^{*}(z)=\big{\{}\frac{P^{\mathbb{T}}(|z|^{s-1,\downarrow};|z|_{i};0)}{\|(|z|^{s-1,\downarrow};|z|_{i};0)\|}\,|\ i\in\{s,\ldots,n\}\ {\rm is\ such\ that}\ |z|_{i}=|z|_{s}^{\downarrow}\big{\}}, where PP is an n×nn\times n signed permutation matrix such that Pz=|z|Pz=|z|^{\downarrow}.

Proposition 2.1

Fix any λ>0\lambda>0 and τ>0\tau>0. For any znz\in\mathbb{R}^{n}, by letting PP be an n×nn\times n signed permutation matrix such that Pz=|z|Pz=|z|^{\downarrow}, it holds that 𝒫τgλ(z)=P𝕋𝒬τλ(|z|)\mathcal{P}_{\!\tau}g_{\lambda}(z)=P^{\mathbb{T}}\mathcal{Q}_{\tau\lambda}(|z|^{\downarrow}) with

𝒬ν(y):=argminxn{12xy2+νx0s.t.x=1}yn.\mathcal{Q}_{\nu}(y):=\mathop{\arg\min}_{x\in\mathbb{R}^{n}}\Big{\{}\frac{1}{2}\|x-y\|^{2}+\nu\|x\|_{0}\ \ {\rm s.t.}\ \|x\|=1\Big{\}}\quad\forall y\in\mathbb{R}^{n}. (14)

For any y0y\neq 0 with y1yn0y_{1}\geq\cdots\geq y_{n}\geq 0, by letting χj(y):=yj,yj1,\chi_{j}(y)\!:=\!\|y^{j,\downarrow}\|-\!\|y^{j-1,\downarrow}\| with y0,=0y^{0,\downarrow}=0, 𝒬ν(y)={yy}\mathcal{Q}_{\nu}(y)\!=\!\{\frac{y}{\|y\|}\} if νχn(y)\nu\leq\!\chi_{n}(y); 𝒬ν(y)={(yi|yi|,0,,0)𝕋|i[n]issuchthatyi=y1}\mathcal{Q}_{\nu}(y)\!=\!\big{\{}(\frac{y_{i}}{|y_{i}|},0,\ldots,0)^{\mathbb{T}}\,|\,i\in[n]\ {\rm is\ such\ that}\ y_{i}=y_{1}\big{\}} if νχ1(y)\nu\geq\!\chi_{1}(y); otherwise 𝒬ν(y):={(yl,yl,;0)|l[n]issuchthatν(χl+1(y),χl(y)]}\mathcal{Q}_{\nu}(y)\!:=\!\big{\{}(\frac{y^{l,\downarrow}}{\|y^{l,\downarrow}\|};0)\ |\ l\in[n]\ {\rm is\ such\ that}\ \nu\in(\chi_{l+1}(y),\chi_{l}(y)]\big{\}}.

Proof: By the definition of gλg_{\lambda}, for any znz\in\mathbb{R}^{n}, 𝒫τgλ(z)=𝒬τλ(z)\mathcal{P}_{\!\tau}g_{\lambda}(z)=\mathcal{Q}_{\tau\lambda}(z). Since for any n×nn\times n signed permutation matrix QQ and any znz\in\mathbb{R}^{n}, Qz=z\|Qz\|=\|z\| and Qz0=z0\|Qz\|_{0}=\|z\|_{0}, it is easy to verify that 𝒬ν(z)=P𝕋𝒬ν(|z|)\mathcal{Q}_{\nu}(z)=P^{\mathbb{T}}\mathcal{Q}_{\nu}(|z|^{\downarrow}). The first part of the conclusions then follows. For the second part, we first argue that the following inequality relations hold:

χ1(y)χ2(y)χn(y).\chi_{1}(y)\geq\chi_{2}(y)\geq\cdots\geq\chi_{n}(y). (15)

Indeed, for each j{1,2,,n1}j\in\{1,2,\ldots,n\!-\!1\}, from the definition of yjy^{j}, it is immediate to have

yj2yj12=yj2yj+12=yj+12yj2andyj+yj1yj+1+yj.\displaystyle\|y^{j}\|^{2}\!-\!\|y^{j-1}\|^{2}=y_{j}^{2}\geq y_{j+1}^{2}=\|y^{j+1}\|^{2}\!-\!\|y^{j}\|^{2}\ \ {\rm and}\ \ \|y^{j}\|\!+\!\|y^{j-1}\|\leq\|y^{j+1}\|\!+\!\|y^{j}\|.

Along with χj(y)=yj2yj12yj+yj1\chi_{j}(y)=\frac{\|y^{j}\|^{2}-\|y^{j-1}\|^{2}}{\|y^{j}\|+\|y^{j-1}\|}, we get χj(y)χj+1(y)\chi_{j}(y)\geq\chi_{j+1}(y) and the relations in (15) hold. Let υ(y)\upsilon^{*}(y) denote the optimal value of (14). Then υ(y)=min{χ¯1(y),,χ¯n(y)}\upsilon^{*}(y)=\min\{\overline{\chi}_{1}(y),\ldots,\overline{\chi}_{n}(y)\} with

χ¯s(y):=minxn{12xy2+νx0s.t.x0=s,x=1}fors=1,,n.\overline{\chi}_{s}(y):=\min_{x\in\mathbb{R}^{n}}\Big{\{}\frac{1}{2}\|x-y\|^{2}+\nu\|x\|_{0}\ \ {\rm s.t.}\ \ \|x\|_{0}=s,\|x\|=1\Big{\}}\ \ {\rm for}\ s=1,\ldots,n. (16)

From Lemma 2.1, it follows that χ¯s(y)=12(1+y22ys,)+νs\overline{\chi}_{s}(y)=\frac{1}{2}(1+\|y\|^{2}-2\|y^{s,\downarrow}\|)+\nu s. Then,

Δχ¯s(y):=χ¯s+1(y)χ¯s(y)=ys,ys+1,+ν=χs(y)+ν.\Delta\overline{\chi}_{s}(y):=\overline{\chi}_{s+1}(y)-\overline{\chi}_{s}(y)=\|y^{s,\downarrow}\|-\|y^{s+1,\downarrow}\|+\nu=-\chi_{s}(y)+\nu.

When νχn(y)\nu\leq\chi_{n}(y), we have νχs(y)\nu\leq\chi_{s}(y) for all s=1,,ns=1,\ldots,n. From the last equation, Δχ¯s(y)0\Delta\overline{\chi}_{s}(y)\leq 0 for s=1,,n1s=1,\ldots,n-\!1, which means that χ¯1(y)χ¯2(y)χ¯n(y).\overline{\chi}_{1}(y)\geq\overline{\chi}_{2}(y)\geq\cdots\geq\overline{\chi}_{n}(y). Hence, υ(y)=χ¯n(y)\upsilon^{*}(y)=\overline{\chi}_{n}(y), and 𝒬ν(y)={yy}\mathcal{Q}_{\nu}(y)=\{\frac{y}{\|y\|}\} follows by Lemma 2.1. Using the similar arguments, we can obtain the rest of the conclusions. \Box

To characterize the proximal mapping of the nonconvex nonsmooth function hλ,ρh_{\lambda,\rho}, we need the following lemma, whose proof is omitted due to the simplicity.

Lemma 2.2

Let 𝒮+:=𝒮+n\mathcal{S}_{+}:=\mathcal{S}\cap\mathbb{R}_{+}^{n}. For any znz\in\!\mathbb{R}^{n}, by letting PP be an n×nn\times n permutation matrix such that Pz=zPz=z^{\downarrow}, it holds that Π𝒮+(z)=P𝕋Π𝒮+(z)\Pi_{\mathcal{S}_{+}}(z)=P^{\mathbb{T}}\Pi_{\mathcal{S}_{+}}(z^{\downarrow}). Also, for any yny\!\in\mathbb{R}^{n} with y1yny_{1}\geq\cdots\geq y_{n}, Π𝒮+(y)={ei|i[n]issuchthatyi=y1}\Pi_{\mathcal{S}_{+}}(y)=\big{\{}e_{i}\,|\,i\in[n]\ {\rm is\ such\ that}\ y_{i}=y_{1}\big{\}} if y10y_{1}\leq 0; Π𝒮+(y)={yy}\Pi_{\mathcal{S}_{+}}(y)=\big{\{}\frac{y}{\|y\|}\big{\}} if yn0y_{n}\geq 0, otherwise Π𝒮+(y)={(y1,,yj,0,,0)𝕋(y1,,yj,0,,0)𝕋|j[n1]issuchthatyj>0yj+1}\Pi_{\mathcal{S}_{+}}(y)=\big{\{}\frac{(y_{1},\ldots,y_{j},0,\ldots,0)^{\mathbb{T}}}{\|(y_{1},\ldots,y_{j},0,\ldots,0)^{\mathbb{T}}\|}\,|\,j\in[n\!-\!1]\ {\rm is\ such\ that}\ y_{j}>0\geq y_{j+1}\big{\}}.

Proposition 2.2

Fix any λ>0,ρ>0\lambda>0,\rho>0 and τ>0\tau>0. For any znz\in\mathbb{R}^{n}, by letting PP be an n×nn\times n signed permutation matrix with Pz=|z|Pz=|z|^{\downarrow}, 𝒫τhλ,ρ(z)=P𝕋Π𝒮+(|z|τλρe)\mathcal{P}_{\!\tau}h_{\lambda,\rho}(z)=P^{\mathbb{T}}\Pi_{\mathcal{S}_{+}}(|z|^{\downarrow}\!-\!\tau\lambda\rho e).

Proof: Fix any ξn\xi\in\mathbb{R}^{n} with ξ1ξ2ξn0\xi_{1}\geq\xi_{2}\geq\cdots\geq\xi_{n}\geq 0. Consider the following problem

𝒫ν(ξ):=argminxn{12xξ2+νx1s.t.x=1}\mathcal{P}_{\nu}(\xi):=\mathop{\arg\min}_{x\in\mathbb{R}^{n}}\Big{\{}\frac{1}{2}\big{\|}x-\xi\big{\|}^{2}+\nu\|x\|_{1}\ \ {\rm s.t.}\ \|x\|=1\Big{\}} (17)

where ν>0\nu>0 is a regularization parameter. By the definition of hλ,ρh_{\lambda,\rho}, 𝒫τhλ,ρ(z)=𝒫τλρ(z)\mathcal{P}_{\!\tau}h_{\lambda,\rho}(z)=\mathcal{P}_{\!\tau\lambda\rho}(z), so it suffices to argue that 𝒫ν(ξ)=Π𝒮+(ξνe)\mathcal{P}_{\!\nu}(\xi)=\Pi_{\mathcal{S}_{+}}(\xi-\nu e). Indeed, if xx^{*} is a global optimal solution of (17), then x0x^{*}\geq 0 necessarily holds. If not, we will have J:={j|xj<0}J:=\{j\,|\,x_{j}^{*}<0\}\neq\emptyset. Let J¯={1,,n}\J\overline{J}=\{1,\ldots,n\}\backslash J. Take x~i=xi\widetilde{x}_{i}^{*}=x_{i}^{*} for each iJ¯i\in\overline{J} and x~i=xi\widetilde{x}_{i}^{*}=-x_{i}^{*} for each iJi\in J. Clearly, x~0\widetilde{x}^{*}\geq 0 and x~=1\|\widetilde{x}^{*}\|=1. However, it holds that 12x~ξ2+νx~112xξ2+νx1\frac{1}{2}\big{\|}\widetilde{x}^{*}-\xi\big{\|}^{2}+\nu\|\widetilde{x}^{*}\|_{1}\leq\frac{1}{2}\big{\|}x^{*}-\xi\big{\|}^{2}+\nu\|x^{*}\|_{1}, which contradicts the fact that xx^{*} is a global optimal solution of (17). This implies that 𝒫ν(ξ)=argminxn{12xξ2+νe,xs.t.x0,x=1}.\mathcal{P}_{\nu}(\xi)=\mathop{\arg\min}_{x\in\mathbb{R}^{n}}\big{\{}\frac{1}{2}\big{\|}x-\xi\big{\|}^{2}+\nu\langle e,x\rangle\ \ {\rm s.t.}\ x\geq 0,\|x\|=1\big{\}}. Consequently, 𝒫ν(ξ)=Π𝒮+(ξνe)\mathcal{P}_{\nu}(\xi)=\Pi_{\mathcal{S}_{+}}(\xi-\nu e). The desired equality then follows. \Box

2.2 Generalized subdifferentials

Definition 2.1

(see [36, Definition 8.3]) Consider a function h:n¯h\!:\mathbb{R}^{n}\to\overline{\mathbb{R}} and a point xdomhx\in{\rm dom}h. The regular subdifferential of hh at xx, denoted by ^h(x)\widehat{\partial}h(x), is defined as

^h(x):={vn|lim infxxxxh(x)h(x)v,xxxx0};\widehat{\partial}h(x):=\bigg{\{}v\in\mathbb{R}^{n}\ \big{|}\ \liminf_{x^{\prime}\to x\atop x^{\prime}\neq x}\frac{h(x^{\prime})-h(x)-\langle v,x^{\prime}-x\rangle}{\|x^{\prime}-x\|}\geq 0\bigg{\}};

and the (limiting) subdifferential of hh at xx, denoted by h(x)\partial h(x), is defined as

h(x):={vn|xkxwithh(xk)h(x)andvk^h(xk)vask}.\partial h(x):=\Big{\{}v\in\mathbb{R}^{n}\,|\,\exists\,x^{k}\to x\ {\rm with}\ h(x^{k})\to h(x)\ {\rm and}\ v^{k}\in\widehat{\partial}h(x^{k})\to v\ {\rm as}\ k\to\infty\Big{\}}.
Remark 2.1

(i) At each xdomhx\in{\rm dom}h, ^h(x)h(x)\widehat{\partial}h(x)\subseteq\partial h(x), ^h(x)\widehat{\partial}h(x) is always closed and convex, and h(x)\partial h(x) is closed but generally nonconvex. When hh is convex, ^h(x)=h(x)\widehat{\partial}h(x)=\partial h(x), which is precisely the subdifferential of hh at xx in the sense of convex analysis.

(ii) Let {(xk,vk)}k\{(x^{k},v^{k})\}_{k\in\mathbb{N}} be a sequence in the graph of h\partial h that converges to (x,v)(x,v) as kk\to\infty. By invoking Definition 2.1, if h(xk)h(x)h(x^{k})\to h(x) as kk\to\infty, then vh(x)v\in\partial h(x).

(iii) A point x¯\overline{x} at which 0h(x¯)0\in\partial h(\overline{x}) (0^h(x¯)0\in\widehat{\partial}h(\overline{x})) is called a limiting (regular) critical point of hh. In the sequel, we denote by crith{\rm crit}\,h the limiting critical point set of hh.

When hh is an indicator function of a closed set CC, the subdifferential of hh at xCx\in C is the normal cone to CC at xx, denoted by 𝒩C(x)\mathcal{N}_{C}(x). The following lemma characterizes the (regular) subdifferentials of Fσ,γF_{\sigma,\gamma} and Gσ,γ,ρG_{\sigma,\gamma,\rho} at any point of their domains.

Lemma 2.3

Fix any λ>0,ρ>0\lambda>0,\rho>0 and 0<γ<σ/20<\!\gamma<\!{\sigma}/{2}. Consider any x𝒮x\in\mathcal{S}. Then,

  • (i)

    fσ,γf_{\sigma,\gamma} is a smooth function whose gradient fσ,γ\nabla\!f_{\sigma,\gamma} is Lipschitz continuous with the modulus Lf1γA2L_{\!f}\leq\frac{1}{\gamma}\|A\|^{2}.

  • (ii)

    ^Fσ,γ(x)=Fσ,γ(x)=fσ,γ(x)+𝒩𝒮(x)+λx0\widehat{\partial}F_{\sigma,\gamma}(x)=\partial F_{\sigma,\gamma}(x)=\nabla\!f_{\sigma,\gamma}(x)+\mathcal{N}_{\mathcal{S}}(x)+\lambda\partial\|x\|_{0}.

  • (iii)

    ^Gσ,γ,ρ(x)=Gσ,γ,ρ(x)=Ξσ,γ(x)+𝒩𝒮(x)+λρx1\widehat{\partial}G_{\sigma,\gamma,\rho}(x)=\partial G_{\sigma,\gamma,\rho}(x)\!=\!\nabla\Xi_{\sigma,\gamma}(x)\!+\!\mathcal{N}_{\mathcal{S}}(x)+\lambda\rho\partial\|x\|_{1}.

  • (iv)

    When |x|nz2aρ(a1)|x|_{\rm nz}\geq\frac{2a}{\rho(a-1)}, it holds that Gσ,γ,ρ(x)Fσ,γ(x)\partial G_{\sigma,\gamma,\rho}(x)\subseteq\partial F_{\sigma,\gamma}(x).

Proof: (i) The result is immediate by the definition of fσ,γf_{\sigma,\gamma} and the expression of Lσ,γL_{\sigma,\gamma}.

(ii) From [42, Lemma 3.1-3.2 & 3.4], ^gλ(x)=gλ(x)=𝒩𝒮(x)+λx0\widehat{\partial}g_{\lambda}(x)=\partial g_{\lambda}(x)=\mathcal{N}_{\mathcal{S}}(x)+\lambda\partial\|x\|_{0}. Together with part (i) and [36, Exercise 8.8], we obtain the desired result.

(iii) By the convexity and Lipschitz continuity of 1\ell_{1}-norm and [36, Exercise 10.10], it follows that hλ,ρ(x)=𝒩𝒮(x)+λρx1\partial h_{\lambda,\rho}(x)\!=\mathcal{N}_{\mathcal{S}}(x)+\lambda\rho\partial\|x\|_{1}. Let θρ(z):=ρ1i=1nψ(ρ|zi|)\theta_{\!\rho}(z)\!:=\!\rho^{-1}\sum_{i=1}^{n}\psi^{*}(\rho|z_{i}|) for znz\in\mathbb{R}^{n}. Clearly, Ξσ,γ=fσ,γλρθρ\Xi_{\sigma,\gamma}=f_{\sigma,\gamma}-\lambda\rho\theta_{\!\rho}. By the expression of ψ\psi^{*} in (10), it is easy to verify that θρ\theta_{\!\rho} is smooth and θρ\nabla\theta_{\!\rho} is Lipschitz continuous with modulus ρmax(a+12,a+12(a1))\rho\max(\frac{a+1}{2},\frac{a+1}{2(a-1)}). Hence, Ξσ,γ\Xi_{\sigma,\gamma} is a smooth function whose gradient is Lipschitz continuous. Together with [36, Exercise 8.8] and Gσ,γ,ρ=Ξσ,γ+hλ,ρG_{\sigma,\gamma,\rho}=\Xi_{\sigma,\gamma}+h_{\lambda,\rho}, we obtain the desired equalities.

(iv) Let θρ\theta_{\!\rho} be the function defined as above. After an elementary calculation, we have

θρ(x)=((ψ)(ρ|x1|)sign(x1),,(ψ)(ρ|xn|)sign(xn))𝕋.\nabla\theta_{\!\rho}(x)\!=\big{(}(\psi^{*})^{\prime}(\rho|x_{1}|){\rm sign}(x_{1}),\ldots,(\psi^{*})^{\prime}(\rho|x_{n}|){\rm sign}(x_{n})\big{)}^{\mathbb{T}}.

Along with |x|nz2aρ(a1)|x|_{\rm nz}\geq\frac{2a}{\rho(a-1)} and the expression of ψ\psi^{*} in (10), we have θρ(x)=sign(x)\nabla\theta_{\!\rho}(x)={\rm sign}(x) and x1θρ(x)ρx0\partial\|x\|_{1}\!-\!\nabla\theta_{\!\rho}(x)\subseteq\rho\partial\|x\|_{0}. By part (iii), Ξσ,γ(x)=fσ,γ(x)λρθρ(x)\nabla\Xi_{\sigma,\gamma}(x)=\nabla\!f_{\sigma,\gamma}(x)-\lambda\rho\nabla\theta_{\!\rho}(x). Comparing Gσ,γ,ρ(x)\partial G_{\sigma,\gamma,\rho}(x) in part (iii) with Fσ,γ(x)\partial F_{\sigma,\gamma}(x) in part (ii) yields that Gσ,γ,ρ(x)Fσ,γ(x)\partial G_{\sigma,\gamma,\rho}(x)\subseteq\partial F_{\sigma,\gamma}(x). \Box

2.3 Stationary points

Lemma 2.3 shows that for the functions Fσ,γF_{\sigma,\gamma} and Gσ,γ,ρG_{\sigma,\gamma,\rho} the set of their regular critical points coincides with that of their limiting critical points, so we call the critical point of Fσ,γF_{\sigma,\gamma} a stationary point of (7), and the critical point of Gσ,γ,ρG_{\sigma,\gamma,\rho} a stationary point of (9). Motivated by the work [4], we introduce a class of τ\tau-stationary points for them.

Definition 2.2

Let τ>0\tau>0. A vector xnx\in\mathbb{R}^{n} is called a τ\tau-stationary point of (7) if x𝒫τgλ(xτfσ,γ(x))x\in\mathcal{P}_{\!\tau}g_{\lambda}(x\!-\!\tau\nabla\!f_{\sigma,\gamma}(x)), and is called a τ\tau-stationary point of (9) if x𝒫τhλ,ρ(xτΞσ,γ(x))x\in\mathcal{P}_{\!\tau}h_{\lambda,\rho}(x\!-\!\tau\nabla\Xi_{\sigma,\gamma}(x)).

In the sequel, we denote by Sτ,gλS_{\tau,g_{\lambda}} and Sτ,hλ,ρS_{\tau,h_{\lambda,\rho}} the τ\tau-stationary point set of (7) and (9), respectively. By Proposition 2.1 and 2.2, we have the following result for them.

Lemma 2.4

Fix any τ>0,λ>0,ρ>0\tau>0,\lambda>0,\rho>0 and 0<γ<σ/20<\!\gamma<\!{\sigma}/{2}. Then, Sτ,gλcritFσ,γS_{\tau,g_{\lambda}}\subseteq{\rm crit}F_{\sigma,\gamma} and Sτ,hλ,ρcritGσ,γ,ρS_{\tau,h_{\lambda,\rho}}\subseteq{\rm crit}G_{\sigma,\gamma,\rho}.

Proof: Pick any x¯Sτ,gλ\overline{x}\in\!S_{\tau,g_{\lambda}}. Then x¯=𝒫τgλ(x¯τfσ,γ(x¯))\overline{x}=\mathcal{P}_{\!\tau}g_{\lambda}(\overline{x}\!-\!\tau\nabla\!f_{\sigma,\gamma}(\overline{x})). By Proposition 2.1, for each isupp(x¯)i\in{\rm supp}(\overline{x}), x¯i=α1[x¯iτ(fσ,γ(x¯))i]\overline{x}_{i}=\alpha^{-1}[\overline{x}_{i}\!-\!\tau(\nabla\!f_{\sigma,\gamma}(\overline{x}))_{i}] for some α>0\alpha\!>0 (depending on x¯\overline{x}). Then, for each isupp(x¯)i\in{\rm supp}(\overline{x}), it holds that (fσ,γ(x¯))i+τ1(α1)x¯i=0(\nabla\!f_{\sigma,\gamma}(\overline{x}))_{i}+\tau^{-1}(\alpha\!-\!1)\overline{x}_{i}=0. Recall that

𝒩𝒮(x¯)={βx¯|β}andx¯0={vn|vi=0forisupp(x¯)}.\mathcal{N}_{\mathcal{S}}(\overline{x})=\big{\{}\beta\overline{x}\,|\,\beta\in\mathbb{R}\big{\}}\ \ {\rm and}\ \ \partial\|\overline{x}\|_{0}=\big{\{}v\in\mathbb{R}^{n}\,|\,v_{i}=0\ {\rm for}\ i\in{\rm supp}(\overline{x})\big{\}}. (18)

We have 0fσ,γ(x¯)+𝒩𝒮(x¯)+λx¯00\in\nabla\!f_{\sigma,\gamma}(\overline{x})+\mathcal{N}_{\mathcal{S}}(\overline{x})+\lambda\partial\|\overline{x}\|_{0}, and hence x¯critFσ,γ\overline{x}\in{\rm crit}F_{\sigma,\gamma} by Lemma 2.3 (ii).

Pick any x¯Sτ,hλ,ρ\overline{x}\in\!S_{\tau,h_{\lambda,\rho}}. Write u¯=x¯τΞσ,γ(x¯)\overline{u}=\overline{x}\!-\!\tau\nabla\Xi_{\sigma,\gamma}(\overline{x}). Then, we have x¯=𝒫τhλ,ρ(u¯)\overline{x}=\mathcal{P}_{\!\tau}h_{\lambda,\rho}(\overline{u}). Let J:={i[n]||u¯i|>τλρ}J\!:=\{i\in[n]\,|\,|\overline{u}_{i}|>\tau\lambda\rho\} and J¯=[n]\J\overline{J}=[n]\backslash J. For each iJ¯i\in\overline{J}, |Ξσ,γ(x¯)τ1x¯i|λρ|\nabla\Xi_{\sigma,\gamma}(\overline{x})-\tau^{-1}\overline{x}_{i}|\leq\lambda\rho. Since the subdifferential of the function t|t|t\mapsto|t| at 0 is [1,1][-1,1], it holds that

0[Ξσ,γ(x¯)]J¯τ1x¯J¯+λρx¯J¯1.0\in[\nabla\Xi_{\sigma,\gamma}(\overline{x})]_{\overline{J}}-\tau^{-1}\overline{x}_{\overline{J}}+\lambda\rho\partial\|\overline{x}_{\overline{J}}\|_{1}.

By Proposition 2.2, we have x¯J=u¯Jτλρsign(u¯J)u¯Jτλρsign(u¯J).\overline{x}_{J}=\frac{\overline{u}_{J}-\tau\lambda\rho{\rm sign}(\overline{u}_{J})}{\|\overline{u}_{J}-\tau\lambda\rho{\rm sign}(\overline{u}_{J})\|}. Together with sign(u¯J)=sign(x¯J){\rm sign}(\overline{u}_{J})={\rm sign}(\overline{x}_{J}),

(Ξσ,γ(x¯))J+τ1(u¯Jτλρsign(u¯J)1)x¯J+λρsign(x¯J)=0.(\nabla\Xi_{\sigma,\gamma}(\overline{x}))_{J}+\tau^{-1}(\|\overline{u}_{J}\!-\!\tau\lambda\rho{\rm sign}(\overline{u}_{J})\|\!-\!1)\overline{x}_{J}+\lambda\rho{\rm sign}(\overline{x}_{J})=0.

By the expression of 𝒩𝒮(x¯)\mathcal{N}_{\mathcal{S}}(\overline{x}) in (18), from the last two equations it follows that

0Ξσ,γ(x¯)+𝒩𝒮(x¯)+λρx¯1.0\in\nabla\Xi_{\sigma,\gamma}(\overline{x})+\mathcal{N}_{\mathcal{S}}(\overline{x})+\lambda\rho\partial\|\overline{x}\|_{1}.

By Lemma 2.3 (iii), this shows that x¯critGσ,γ,ρ\overline{x}\in{\rm crit}G_{\sigma,\gamma,\rho}. The proof is completed. \Box

Note that if x¯\overline{x} is a stationary point of (7), then for isupp(x¯)i\notin{\rm supp}(\overline{x}), [𝒫τgλ(x¯τfσ,γ(x¯))]i[\mathcal{P}_{\!\tau}g_{\lambda}(\overline{x}\!-\!\tau\nabla\!f_{\sigma,\gamma}(\overline{x}))]_{i} does not necessarily equal 0. A similar case also occurs for the stationary point of (9). This means that the two inclusions in Lemma 2.4 are generally strict. By combining Lemma 2.4 with [36, Theorem 10.1], it is immediate to obtain the following conclusion.

Corollary 2.1

Fix any τ>0\tau>0. For the problems (7) and (9), their local optimal solution is necessarily is a stationary point, and consequently a τ\tau-stationary point.

2.4 Kurdyka-Łöjasiewicz property

Definition 2.3

(see [2]) A proper lsc function h:n¯h\!:\mathbb{R}^{n}\to\overline{\mathbb{R}} is said to have the KL property at x¯domh\overline{x}\in{\rm dom}\,\partial h if there exist η(0,+]\eta\in(0,+\infty], a neighborhood 𝒰\mathcal{U} of x¯\overline{x}, and a continuous concave function φ:[0,η)+\varphi\!:[0,\eta)\to\mathbb{R}_{+} that is continuously differentiable on (0,η)(0,\eta) with φ(s)>0\varphi^{\prime}(s)>0 for all s(0,η)s\in(0,\eta) and φ(0)=0\varphi(0)=0, such that for all x𝒰[h(x¯)<h<h(x¯)+η]x\in\mathcal{U}\cap\big{[}h(\overline{x})<h<h(\overline{x})+\eta\big{]},

φ(h(x)h(x¯))dist(0,h(x))1.\varphi^{\prime}(h(x)-h(\overline{x})){\rm dist}(0,\partial h(x))\geq 1.

If φ\varphi can be chosen as φ(t)=ct1θ\varphi(t)=ct^{1-\theta} with θ[0,1)\theta\in[0,1) for some c>0c>0, then hh is said to have the KL property of exponent θ\theta at x¯\overline{x}. If hh has the KL property (of exponent θ\theta) at each point of domh{\rm dom}\,\partial h, then it is called a KL function (of exponent θ\theta).

Remark 2.2

(a) As discussed thoroughly in [2, Section 4], there are a large number of nonconvex nonsmooth functions are the KL functions, which include real semi-algebraic functions and those functions definable in an o-minimal structure.

(b) From [2, Lemma 2.1], a proper lsc function has the KL property of exponent θ=0\theta=0 at any noncritical point. Thus, to prove that a proper lsc h:n¯h\!:\mathbb{R}^{n}\to\overline{\mathbb{R}} is a KL function (of exponent θ\theta), it suffices to achieve its KL property (of exponent θ\theta) at critical points. On the calculation of KL exponent, please refer to the recent works [34, 48].

3 Equivalent surrogates of the model (7)

Pick any ϕ\phi\in\!\mathscr{L}. By invoking equation (8), it is immediate to verify that for any xnx\in\mathbb{R}^{n},

x0=minw[0,e]{i=1nϕ(wi)s.t.ew,|x|=0}.\|x\|_{0}=\min_{w\in[0,e]}\Big{\{}\textstyle{\sum_{i=1}^{n}}\phi(w_{i})\ \ {\rm s.t.}\ \langle e-\!w,|x|\rangle=0\Big{\}}.

This means that the zero-norm regularized problem (7) can be reformulated as

minx𝒮,w[0,e]{fσ,γ(x)+λi=1nϕ(wi)s.t.ew,|x|=0}\min_{x\in\mathcal{S},w\in[0,e]}\Big{\{}f_{\sigma,\gamma}(x)+\lambda\textstyle{\sum_{i=1}^{n}}\phi(w_{i})\quad\mbox{s.t.}\ \ \langle e-w,|x|\rangle=0\Big{\}} (19)

in the following sense: if xx^{*} is globally optimal to the problem (7), then (x,sign(|x|))(x^{*}\!,{\rm sign}(|x^{*}|)) is a global optimal solution of the problem (19), and conversely, if (x,w)(x^{*},w^{*}) is a global optimal solution of (19), then xx^{*} is globally optimal to (7). The problem (19) is a mathematical program with an equilibrium constraint ew0,|x|0,ew,|x|=0e\!-\!w\geq 0,|x|\geq 0,\langle e-w,|x|\rangle=0. In this section, we shall show that the penalty problem induced by this equilibrium constraint, i.e.,

minx𝒮,w[0,e]{fσ,γ(x)+λi=1nϕ(wi)+ρλew,|x|}\min_{x\in\mathcal{S},w\in[0,e]}\Big{\{}f_{\sigma,\gamma}(x)+\lambda\textstyle{\sum_{i=1}^{n}}\phi(w_{i})+\rho\lambda\langle e-w,|x|\rangle\Big{\}} (20)

is a global exact penalty of (19) and from this global exact penalty achieve the equivalent surrogate in (9), where ρ>0\rho>0 is the penalty parameter. For each s[n]s\in[n], write

Ωs:=𝒮swiths:={xn|x0s}.\Omega_{s}:=\mathcal{S}\cap\mathcal{R}_{s}\ \ {\rm with}\ \ \mathcal{R}_{s}:=\{x\in\mathbb{R}^{n}\,|\,\|x\|_{0}\leq s\}.

To get the conclusion of this section, we need the following global error bound result.

Lemma 3.1

For each s{1,2,,n}s\in\{1,2,\ldots,n\}, there exists κs>0\kappa_{s}>0 such that for all x𝒮x\in\mathcal{S},

dist(x,Ωs)κs[x1x(s)],{\rm dist}(x,\Omega_{s})\leq\kappa_{s}\big{[}\|x\|_{1}-\|x\|_{(s)}\big{]},

where x(s)\|x\|_{(s)} denotes the sum of the first ss largest entries of the vector xnx\in\mathbb{R}^{n}.

Proof: Fix any s{1,2,,n}s\in\{1,2,\ldots,n\}. We first argue that the following multifunction

Υs(τ):={x𝒮|x1x(s)=τ}forτ\Upsilon_{\!s}(\tau):=\big{\{}x\in\mathcal{S}\,|\,\|x\|_{1}-\|x\|_{(s)}=\tau\big{\}}\ \ {\rm for}\ \tau\in\mathbb{R}

is calm at 0 for every xΥs(0)x\in\Upsilon_{\!s}(0). Pick any x^Υs(0)\widehat{x}\in\Upsilon_{\!s}(0). By [39, Theorem 3.1], the calmness of Υs\Upsilon_{\!s} at 0 for x^\widehat{x} is equivalent to the existence of δ>0\delta>0 and κ>0\kappa>0 such that

dist(x,Ωs)κ[dist(x,𝒮)+dist(x,s)]forallx𝔹(x^,δ).{\rm dist}(x,\Omega_{s})\leq\kappa\big{[}{\rm dist}(x,\mathcal{S})+{\rm dist}(x,\mathcal{R}_{s})\big{]}\ \ {\rm for\ all}\ x\in\mathbb{B}(\widehat{x},\delta). (21)

Since x^=1\|\widehat{x}\|=1, there exists ε(0,1/2)\varepsilon\in(0,1/2) such that for all x𝔹(x^,ε)x\in\mathbb{B}(\widehat{x},\varepsilon), x0x\neq 0. Fix any x𝔹(x^,ε/2)x\in\mathbb{B}(\widehat{x},{\varepsilon}/{2}). Clearly, xx^ε/23/4\|x\|\geq\|\widehat{x}\|-{\varepsilon}/{2}\geq{3}/{4}. This means that x34n\|x\|_{\infty}\!\geq\frac{3}{4\sqrt{n}}. Pick any xΠs(x)x^{*}\in\Pi_{\mathcal{R}_{s}}(x). Clearly, x=x34n\|x^{*}\|_{\infty}=\|x\|_{\infty}\geq\frac{3}{4\sqrt{n}} and xxΩs\frac{x^{*}}{\|x^{*}\|}\in\Omega_{s}. Then, with x¯=xx\overline{x}=\frac{x}{\|x\|},

dist(x,Ωs)\displaystyle{\rm dist}(x,\Omega_{s}) xx/xxx¯+x¯x/x\displaystyle\leq\|x-x^{*}\!/{\|x^{*}\|}\|\leq\|x-\overline{x}\|+\|\overline{x}-x^{*}\!/{\|x^{*}\|}\|
xx¯+(xx)x+x(xx)xx\displaystyle\leq\|x-\overline{x}\|+\frac{\|(x-x^{*})\|x\|+x(\|x^{*}\|-\|x\|)\|}{\|x\|\|x^{*}\|}
xx¯+(2/x)xxdist(x,𝒮)+3ndist(x,s).\displaystyle\leq\|x-\overline{x}\|+(2/\|x^{*}\|)\|x-x^{*}\|\leq{\rm dist}(x,\mathcal{S})+3\sqrt{n}{\rm dist}(x,\mathcal{R}_{s}).

This shows that the inequality (21) holds for δ=ε/2\delta=\varepsilon/2 and κ=3n\kappa=3\sqrt{n}. Consequently, the mapping Υs\Upsilon_{\!s} is calm at 0 for every xΥs(0)x\in\Upsilon_{\!s}(0). Now by invoking [39, Theorem 3.3] and the compactness of 𝒮\mathcal{S}, we obtain the desired result. The proof is completed. \Box

Now we are ready to show that the problem (20) is a global exact penalty of (19).

Proposition 3.1

Let ρ¯:=κϕ(1)(1t)αfλ(1t0)\overline{\rho}:=\frac{\kappa\phi_{-}^{\prime}(1)(1-t^{*})\alpha_{\!f}}{\lambda(1-t_{0})} where t0[0,1)t_{0}\in[0,1) is such that 11tϕ(t0)\frac{1}{1-t^{*}}\in\partial\phi(t_{0}), ϕ(1)\phi_{-}^{\prime}(1) is the left derivative of ϕ\phi at 11, αf\alpha_{\!f} is the Lipschitz constant of fσ,γf_{\sigma,\gamma} on 𝒮\mathcal{S}, and κ=max1snκs\kappa=\max_{1\leq s\leq n}\kappa_{s} with κs\kappa_{s} given by Lemma 3.1. Then, for any (x,w)𝒮×[0,e](x,w)\in\mathcal{S}\times[0,e],

[fσ,γ(x)+λi=1nϕ(wi)][fσ,γ(x)+λi=1nϕ(wi)]+ρ¯λew,|x|0,\big{[}f_{\sigma,\gamma}(x)+\lambda{\textstyle\sum_{i=1}^{n}}\phi(w_{i})\big{]}-\big{[}f_{\sigma,\gamma}(x^{*})+\lambda{\textstyle\sum_{i=1}^{n}}\phi(w_{i}^{*})\big{]}+\overline{\rho}\lambda\langle e\!-\!w,|x|\rangle\geq 0, (22)

where (x,w)(x^{*},w^{*}) is an arbitrary global optimal solution of (19), and consequently the problem (20) associated to each ρ>ρ¯\rho>\overline{\rho} has the same global optimal solution set as (19) does.

Proof: By Lemma 3.1 and κ=max1snκs\kappa=\max_{1\leq s\leq n}\kappa_{s}, for each s{1,2,,n}s\in\{1,2,\ldots,n\} and any z𝒮z\in\mathcal{S},

dist(z,𝒮s)κ[z1z(k)].{\rm dist}(z,\mathcal{S}\cap\mathcal{R}_{s})\leq\kappa\big{[}\|z\|_{1}-\|z\|_{(k)}\big{]}. (23)

Fix any (x,w)𝒮×[0,e](x,w)\in\mathcal{S}\times[0,e]. Let J={j[n]|ρ¯|x|j>ϕ(1)}J=\big{\{}j\in[n]\,|\ \overline{\rho}|x|_{j}^{\downarrow}>\phi_{-}^{\prime}(1)\big{\}} and r=|J|r=|J|. By invoking (23) for s=rs=r with z=xz=x, there exists xρ¯𝒮rx^{\overline{\rho}}\in\mathcal{S}\cap\mathcal{R}_{r} such that

xxρ¯κ[x1x(r)]=κj=r+1n|x|j.\|x-x^{\overline{\rho}}\|\leq\kappa\big{[}\|x\|_{1}-\|x\|_{(r)}\big{]}=\kappa{\textstyle\sum_{j=r+1}^{n}}|x|_{j}^{\downarrow}. (24)

Let J1={j[n]|11tρ¯|x|jϕ(1)}J_{1}=\!\big{\{}j\in[n]\,|\,\frac{1}{1-t^{*}}\!\leq\!\overline{\rho}|x|_{j}^{\downarrow}\leq\phi_{-}^{\prime}(1)\big{\}} and J2={j[n]| 0ρ¯|x|j<11t}J_{2}=\!\big{\{}j\in[n]\,|\,0\!\leq\!\overline{\rho}|x|_{j}^{\downarrow}<\frac{1}{1-t^{*}}\big{\}}. Note that

i=1nϕ(wi)+ρ¯(x1w,|x|)i=1nmint[0,1]{ϕ(t)+ρ¯|x|i(1t)}.{\textstyle\sum_{i=1}^{n}}\phi(w_{i})+\overline{\rho}\big{(}\|x\|_{1}\!-\langle w,|x|\rangle\big{)}\geq{\textstyle\sum_{i=1}^{n}}\min_{t\in[0,1]}\big{\{}\phi(t)+\overline{\rho}|x|_{i}^{\downarrow}(1-t)\big{\}}.

By invoking [35, Lemma 1] with ω=|x|j\omega=|x|_{j}^{\downarrow} for each jj, it immediately follows that

i=1nϕ(wi)+ρ¯(x1w,x)xρ¯0+ρ¯(1t0)ϕ(1)(1t)jJ1|x|j+ρ¯(1t0)jJ2|x|j.{\textstyle\sum_{i=1}^{n}}\phi(w_{i})+\overline{\rho}\big{(}\|x\|_{1}\!-\langle w,x\rangle\big{)}\geq\|x^{\overline{\rho}}\|_{0}+\frac{\overline{\rho}(1\!-\!t_{0})}{\phi_{-}^{\prime}(1)(1\!-t^{*})}\sum_{j\in J_{1}}\,|x|_{j}^{\downarrow}+\overline{\rho}(1\!-\!t_{0})\sum_{j\in J_{2}}\,|x|_{j}^{\downarrow}.

Notice that 1=ϕ(1)=ϕ(1)ϕ(t)ϕ(1)(1t)1=\phi(1)=\phi(1)-\phi(t^{*})\leq\phi_{-}^{\prime}(1)(1-t^{*}). From the last inequality, we have

i=1nϕ(wi)+ρ¯(x1w,x)xρ¯0+ρ¯(1t0)ϕ(1)(1t)jJ1J2|x|j\displaystyle{\textstyle\sum_{i=1}^{n}}\phi(w_{i})+\overline{\rho}\big{(}\|x\|_{1}\!-\langle w,x\rangle\big{)}\geq\|x^{\overline{\rho}}\|_{0}+\frac{\overline{\rho}(1-t_{0})}{\phi_{-}^{\prime}(1)(1\!-t^{*})}\sum_{j\in J_{1}\cup J_{2}}|x|_{j}^{\downarrow}
=xρ¯0+ρ¯(1t0)ϕ(1)(1t)j=r+1n|x|jxρ¯0+αfλ1xxρ¯\displaystyle=\|x^{\overline{\rho}}\|_{0}+\frac{\overline{\rho}(1-t_{0})}{\phi_{-}^{\prime}(1)(1\!-t^{*})}\sum_{j=r+1}^{n}|x|_{j}^{\downarrow}\geq\|x^{\overline{\rho}}\|_{0}+\alpha_{\!f}\lambda^{-1}\|x-x^{\overline{\rho}}\|\qquad (25)

where the last inequality is due to (24) and the definition of ρ¯\overline{\rho}. Since x𝒮x\in\mathcal{S} and xρ¯𝒮x^{\overline{\rho}}\in\mathcal{S}, we have fσ,γ(xρ¯)fσ,γ(x)αfxxρ¯f_{\sigma,\gamma}(x^{\overline{\rho}})-f_{\sigma,\gamma}(x)\leq\alpha_{\!f}\|x-x^{\overline{\rho}}\|. Together with the last inequality,

i=1nϕ(wi)+ρ¯(x1w,x)xρ¯0+λ1[fσ,γ(xρ¯)fσ,γ(x)].{\textstyle\sum_{i=1}^{n}}\phi(w_{i})+\overline{\rho}\big{(}\|x\|_{1}\!-\langle w,x\rangle\big{)}\geq\|x^{\overline{\rho}}\|_{0}+\lambda^{-1}\big{[}f_{\sigma,\gamma}(x^{\overline{\rho}})-f_{\sigma,\gamma}(x)\big{]}. (26)

Now take wiρ¯=1w_{i}^{\overline{\rho}}=1 for isupp(xρ¯)i\in{\rm supp}(x^{\overline{\rho}}) and wiρ¯=0w_{i}^{\overline{\rho}}=0 for isupp(xρ¯)i\notin{\rm supp}(x^{\overline{\rho}}). Clearly, (xρ¯,wρ¯)(x^{\overline{\rho}},w^{\overline{\rho}}) is a feasible point of the MPEC (19) with i=1nϕ(wiρ¯)=xρ¯0\sum_{i=1}^{n}\phi(w_{i}^{\overline{\rho}})=\|x^{\overline{\rho}}\|_{0}. Then, it holds that

fσ,γ(xρ¯)+λxρ¯0fσ,γ(x)+λi=1nϕ(wi).f_{\sigma,\gamma}(x^{\overline{\rho}})+\lambda\|x^{\overline{\rho}}\|_{0}\geq f_{\sigma,\gamma}(x^{*})+\lambda{\textstyle\sum_{i=1}^{n}}\phi(w_{i}^{*}).

Together with (26), we obtain the inequality (22). Notice that ew,|x|=0\langle e\!-\!w^{*},|x^{*}|\rangle=0. The inequality (22) implies that every global optimal solution of (19) is globally optimal to the problem (20) associated to every ρ>ρ¯\rho>\overline{\rho}. Conversely, by fixing any ρ>ρ¯\rho>\overline{\rho} and letting (x¯ρ,w¯ρ)(\overline{x}^{\rho},\overline{w}^{\rho}) be a global optimal solution of the problem (20) associated to ρ\rho, it holds that

fσ,γ(x¯ρ)+λi=1nϕ(w¯iρ)+ρλew¯ρ,|x¯ρ|\displaystyle f_{\sigma,\gamma}(\overline{x}^{\rho})+\lambda\textstyle{\sum_{i=1}^{n}}\phi(\overline{w}_{i}^{\rho})+\rho\lambda\langle e-\overline{w}^{\rho},|\overline{x}^{\rho}|\rangle
fσ,γ(x)+λi=1nϕ(wi)=fσ,γ(x)+λi=1nϕ(wi)+ρ+ρ¯2λew¯ρ,|x¯ρ|\displaystyle\leq f_{\sigma,\gamma}(x^{*})+\lambda\textstyle{\sum_{i=1}^{n}}\phi(w_{i}^{*})=f_{\sigma,\gamma}(x^{*})+\lambda\textstyle{\sum_{i=1}^{n}}\phi(w_{i}^{*})+\frac{\rho+\overline{\rho}}{2}\lambda\langle e-\overline{w}^{\rho},|\overline{x}^{\rho}|\rangle
fσ,γ(x¯ρ)+λi=1nϕ(w¯iρ)+ρ+ρ¯2λew¯ρ,|x¯ρ|,\displaystyle\leq f_{\sigma,\gamma}(\overline{x}^{\rho})+\lambda\textstyle{\sum_{i=1}^{n}}\phi(\overline{w}_{i}^{\rho})+\frac{\rho+\overline{\rho}}{2}\lambda\langle e-\overline{w}^{\rho},|\overline{x}^{\rho}|\rangle,

which implies that ρρ¯2λew¯ρ,|x¯ρ|0\frac{\rho-\overline{\rho}}{2}\lambda\langle e-\overline{w}^{\rho},|\overline{x}^{\rho}|\rangle\leq 0. Since ρ>ρ¯\rho>\overline{\rho} and ew¯ρ,|x¯ρ|0\langle e-\overline{w}^{\rho},|\overline{x}^{\rho}|\rangle\geq 0, we obtain ew¯ρ,|x¯ρ|=0\langle e-\overline{w}^{\rho},|\overline{x}^{\rho}|\rangle=0. Together with the last inequality, it follows that (x¯ρ,w¯ρ)(\overline{x}^{\rho},\overline{w}^{\rho}) is a global optimal solution of (19). The second part then follows. \Box

By the definition of ψ\psi, the penalty problem (20) can be rewritten in a compact form

minx𝒮,wn{fσ,γ(x)+λi=1nψ(wi)+ρλew,|x|},\min_{x\in\mathcal{S},w\in\mathbb{R}^{n}}\big{\{}f_{\sigma,\gamma}(x)+\lambda\textstyle{\sum_{i=1}^{n}}\psi(w_{i})+\rho\lambda\langle e-w,|x|\rangle\big{\}},

which, by the definition of the conjugate function ψ\psi^{*}, can be simplified to be (9). Then, Proposition 3.1 implies that the problem (9) associated to every ϕ\phi\in\!\mathscr{L} and ρ>ρ¯\rho>\overline{\rho} is an equivalent surrogate of the problem (7). For a specific ϕ\phi, since t,t0t^{*},t_{0} and ϕ(1)\phi_{-}^{\prime}(1) are known, the threshold ρ¯\overline{\rho} is also known by Lemma 3.1 though κ=3n\kappa=3\sqrt{n} is a rough estimate.

When ϕ\phi is the one in Section 1.2, it is easy to verify that λρφρ\lambda\rho\varphi_{\rho} with λ=(a+1)ν22\lambda=\frac{(a+1)\nu^{2}}{2} and ρ=2(a+1)ν\rho=\frac{2}{(a+1)\nu} is exactly the SCAD function xi=1nρν(xi)x\mapsto\sum_{i=1}^{n}\rho_{\nu}(x_{i}) proposed in [14]. Since t=0,t0=1/2t^{*}=0,t_{0}=1/2 and ϕ(1)=2aa+1\phi_{-}^{\prime}(1)=\frac{2a}{a+1} for this ϕ\phi, the SCAD function with ν<2(a+1)ρ¯\nu<\frac{2}{(a+1)\overline{\rho}} is an equivalent surrogate of (7). When ϕ(t)=a24t2a22t+at+(a2)24(a>2)\phi(t)=\frac{a^{2}}{4}t^{2}-\frac{a^{2}}{2}t+at+\frac{(a-2)^{2}}{4}\ (a>2) for tt\in\mathbb{R},

ψ(ω)={(a2)24ifωaa2/2,1a2(a(a2)2+ω)2(a2)24ifaa2/2<ωa,ω1ifω>a.\psi^{*}(\omega)=\left\{\begin{array}[]{cl}-\frac{(a-2)^{2}}{4}&\textrm{if}\ \omega\leq a-a^{2}/2,\\ \frac{1}{a^{2}}(\frac{a(a-2)}{2}+\omega)^{2}-\frac{(a-2)^{2}}{4}&\textrm{if}\ a-a^{2}/2<\omega\leq a,\\ \omega-1&\textrm{if}\ \omega>a.\end{array}\right.

It is not hard to verify that the function λρφρ\lambda\rho\varphi_{\rho} with λ=aν2/2\lambda={a\nu^{2}}/{2} and ρ=1/ν\rho={1}/{\nu} is exactly the one xi=1ngν,b(xi)x\mapsto\sum_{i=1}^{n}g_{\nu,b}(x_{i}) with b=ab=a used in [20, Section 3.3]. Since t=a2a,t0=a1at^{*}=\frac{a-2}{a},t_{0}=\frac{a-1}{a} and ϕ(1)=a\phi_{-}^{\prime}(1)=a for this ϕ\phi, the MCP function used in [20] with ν<1/ρ¯\nu<1/{\overline{\rho}} and b=ab=a is also an equivalent surrogate of the problem (7).

4 PG method with extrapolation

4.1 PG with extrapolation for solving (7)

Recall that fσ,γf_{\sigma,\gamma} is a smooth function whose gradient fσ,γ\nabla\!f_{\sigma,\gamma} is Lipschitz continuous with modulus Lfγ1A2L_{\!f}\leq\gamma^{-1}\|A\|^{2}. While by Proposition 2.1 the proximal mapping of gλg_{\lambda} has a closed form. This inspires us to apply the PG method with extrapolation to solving (7).

Algorithm 1 (PGe-znorm for solving the problem (7))

Initialization: Choose ς(0,1),0<τ<(1ς)Lf1,0<βmaxς(τ1Lf)τ12(τ1+Lf)\varsigma\in(0,1),0<\tau<(1\!-\!\varsigma)L_{\!f}^{-1},0<\beta_{\rm max}\leq\frac{\sqrt{\varsigma(\tau^{-1}-L_{\!f})\tau^{-1}}}{2(\tau^{-1}+L_{\!f})} and an initial point x0𝒮x^{0}\in\mathcal{S}. Set x1=x0x^{-1}=x^{0} and k:=0k:=0.

while the termination condition is not satisfied do

  • 1.

    Let x~k=xk+βk(xkxk1)\widetilde{x}^{k}=x^{k}+\beta_{k}(x^{k}-x^{k-1}). Compute xk+1𝒫τgλ(x~kτfσ,γ(x~k))x^{k+1}\in\mathcal{P}_{\!\tau}g_{\lambda}(\widetilde{x}^{k}\!-\!\tau\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k})).

  • 3.

    Choose βk+1[0,βmax]\beta_{k+1}\in[0,\beta_{\rm max}]. Let kk+1k\leftarrow k+1 and go to Step 1.

end (while)

Remark 4.1

The main computation work of Algorithm 1 in each iteration is to seek

xk+1argminxn{fσ,γ(x~k),xx~k+12τxx~k2+gλ(x)}.x^{k+1}\in\mathop{\arg\min}_{x\in\mathbb{R}^{n}}\Big{\{}\langle\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k}),x-\widetilde{x}^{k}\rangle+\frac{1}{2\tau}\|x-\widetilde{x}^{k}\|^{2}+g_{\lambda}(x)\Big{\}}. (27)

By Proposition 2.1, to achieve a global optimal solution of the nonconvex problem (27) requires about 2mn2mn flops. Owing to the good performance of the Nesterov’s acceleration strategy [37, 38], one can use this strategy to choose the extrapolation parameter βk\beta_{k}, i.e.,

βk=tk11tkwithtk+1=12(1+1+4tk2)fort1=t0=1.\beta_{k}=\frac{t_{k-1}-1}{t_{k}}\ \ {\rm with}\ t_{k+1}=\frac{1}{2}\big{(}1+\!\sqrt{1+4t_{k}^{2}}\big{)}\ \ {\rm for}\ t_{-1}=t_{0}=1. (28)

In Algorithm 1, an upper bound βmax\beta_{\rm max} is imposed on βk\beta_{k} just for the convergence analysis. It is easy to check that as ς\varsigma approaches to 11, say ς=0.999\varsigma=0.999, βmax\beta_{\rm max} can take 0.2350.235.

The PG method with extrapolation, first proposed in [37] and extended to a general composite setting in [3], is a popular first-order one for solving nonconvex nonsmooth composite optimization problems such as (7) and (9). In the past several years, the PGe and its variants have received extensive attentions (see, e.g., [17, 27, 41, 29, 28, 45, 47]). Due to the nonconvexity of the sphere constraint and the zero-norm, the results obtained in [17, 41, 29] are not applicable to (7). Although Algorithm 1 is a special case of those studied in [27, 45, 47], the convergence results of [27, 47] are obtained for the objective value sequence and the convergence result of [45] on the iterate sequence requires a strong restriction on βk\beta_{k}, i.e., it is such that the objective value sequence is nonincreasing.

Next we provide the proof for the convergence and local convergence rate of the iterate sequence yielded by Algorithm 1. For any τ>0\tau>0 and ς(0,1)\varsigma\in(0,1), we define the function

Hτ,ς(x,u):=Fσ,γ(x)+ς4τxu2(x,u)n×n.H_{\tau,\varsigma}(x,u):=F_{\sigma,\gamma}(x)+\frac{\varsigma}{4\tau}\|x-u\|^{2}\quad\ \forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{n}. (29)

The following lemma summarizes the properties of Hτ,ζH_{\tau,\zeta} on the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}}.

Lemma 4.1

Let {xk}k\{x^{k}\}_{k\in\mathbb{N}} be the sequence generated by Algorithm 1. Then,

  • (i)

    for each kk\in\mathbb{N}, Hτ,ς(xk+1,xk)Hτ,ς(xk,xk1)ς(τ1Lf)2xk+1xk2.H_{\tau,\varsigma}(x^{k+1},x^{k})\leq H_{\tau,\varsigma}(x^{k},x^{k-1})-\frac{\varsigma(\tau^{-1}-L_{\!f})}{2}\|x^{k+1}\!-\!x^{k}\|^{2}.

  • (ii)

    The sequence {Hτ,ς(xk,xk1)}k\{H_{\tau,\varsigma}(x^{k},x^{k-1})\}_{k\in\mathbb{N}} is convergent and k=1xk+1xk2<\sum_{k=1}^{\infty}\|x^{k+1}\!-\!x^{k}\|^{2}<\infty.

  • (iii)

    For each kk\in\mathbb{N}, there exists wkHτ,ς(xk,xk1)w^{k}\in\partial H_{\tau,\varsigma}(x^{k},x^{k-1}) with wk+1b1xk+1xk+b2xkxk1\|w^{k+1}\|\leq b_{1}\|x^{k+1}\!-\!x^{k}\|+b_{2}\|x^{k}\!-\!x^{k-1}\|, where b1>0b_{1}>0 and b2>0b_{2}>0 are the constants independent of kk.

Proof: (i) Since fσ,γ\nabla\!f_{\sigma,\gamma} is globally Lipschitz continuous, from the descent lemma we have

fσ,γ(x)fσ,γ(x)+fσ,γ(x),xx+(Lf/2)xx2x,xn.f_{\sigma,\gamma}(x^{\prime})\leq f_{\sigma,\gamma}(x)+\langle\nabla\!f_{\sigma,\gamma}(x),x^{\prime}-x\rangle+({L_{\!f}}/{2})\|x^{\prime}-x\|^{2}\quad\forall x^{\prime},x\in\mathbb{R}^{n}. (30)

From the definition of xk+1x^{k+1} or the equation (27), for each kk\in\mathbb{N} it holds that

fσ,γ(x~k),xk+1xk+12τxk+1x~k2+gλ(xk+1)12τxkx~k2+gλ(xk).\langle\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k}),x^{k+1}-x^{k}\rangle+\frac{1}{2\tau}\|x^{k+1}-\widetilde{x}^{k}\|^{2}+g_{\lambda}(x^{k+1})\leq\frac{1}{2\tau}\|x^{k}-\widetilde{x}^{k}\|^{2}+g_{\lambda}(x^{k}).

Together with the inequality (30) for x=xk+1x^{\prime}=x^{k+1} and x=xkx=x^{k}, it follows that

fσ,γ(xk+1)+gλ(xk+1)\displaystyle f_{\sigma,\gamma}(x^{k+1})+g_{\lambda}(x^{k+1}) fσ,γ(xk)+gλ(xk)12τxk+1x~k2+Lf2xk+1xk2\displaystyle\leq f_{\sigma,\gamma}(x^{k})+g_{\lambda}(x^{k})-\frac{1}{2\tau}\|x^{k+1}-\widetilde{x}^{k}\|^{2}+\frac{L_{\!f}}{2}\|x^{k+1}-x^{k}\|^{2}
+fσ,γ(xk)fσ,γ(x~k),xk+1xk+12τxkx~k2\displaystyle\quad+\langle\nabla\!f_{\sigma,\gamma}(x^{k})-\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k}),x^{k+1}-x^{k}\rangle+\frac{1}{2\tau}\|x^{k}-\widetilde{x}^{k}\|^{2}
=fσ,γ(xk)+gλ(xk)12(τ1Lf)xk+1xk2\displaystyle=f_{\sigma,\gamma}(x^{k})+g_{\lambda}(x^{k})-\frac{1}{2}(\tau^{-1}\!-\!L_{\!f})\|x^{k+1}-x^{k}\|^{2}
1τxk+1xk,xkx~k+fσ,γ(xk)fσ,γ(x~k),xk+1xk.\displaystyle\quad-\frac{1}{\tau}\langle x^{k+1}\!-\!x^{k},x^{k}\!-\!\widetilde{x}^{k}\rangle+\langle\nabla\!f_{\sigma,\gamma}(x^{k})\!-\!\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k}),x^{k+1}\!-\!x^{k}\rangle.

Using x~k=xk+βk(xkxk1)\widetilde{x}^{k}=x^{k}+\beta_{k}(x^{k}-x^{k-1}) and the Lipschitz continuity of fσ,γ\nabla\!f_{\sigma,\gamma} yields that

Fσ,γ(xk+1)\displaystyle F_{\sigma,\gamma}(x^{k+1}) Fσ,γ(xk)τ1Lf2xk+1xk2+(τ1+Lf)βkxk+1xkxkxk1\displaystyle\leq F_{\sigma,\gamma}(x^{k})-\frac{\tau^{-1}\!-\!L_{\!f}}{2}\|x^{k+1}\!-\!x^{k}\|^{2}+(\tau^{-1}\!+\!L_{\!f})\beta_{k}\|x^{k+1}\!-\!x^{k}\|\|x^{k}\!-\!x^{k-1}\|
Fσ,γ(xk)τ1Lf4xk+1xk2+(τ1+Lf)2τ1Lfβk2xkxk12\displaystyle\leq F_{\sigma,\gamma}(x^{k})-\frac{\tau^{-1}\!-\!L_{\!f}}{4}\|x^{k+1}\!-\!x^{k}\|^{2}+\frac{(\tau^{-1}\!+\!L_{\!f})^{2}}{\tau^{-1}\!-\!L_{\!f}}\beta_{k}^{2}\|x^{k}\!-\!x^{k-1}\|^{2}
Fσ,γ(xk)τ1Lf4xk+1xk2+ς4τxkxk12\displaystyle\leq F_{\sigma,\gamma}(x^{k})-\frac{\tau^{-1}\!-\!L_{\!f}}{4}\|x^{k+1}\!-\!x^{k}\|^{2}+\frac{\varsigma}{4\tau}\|x^{k}\!-\!x^{k-1}\|^{2}
=Fσ,γ(xk)(1ς)τ1Lf4xk+1xk2ς4τxk+1xk2+ς4τxkxk12\displaystyle=F_{\sigma,\gamma}(x^{k})-\frac{(1\!-\!\varsigma)\tau^{-1}\!-\!L_{\!f}}{4}\|x^{k+1}\!-\!x^{k}\|^{2}-\frac{\varsigma}{4\tau}\|x^{k+1}\!-\!x^{k}\|^{2}+\frac{\varsigma}{4\tau}\|x^{k}\!-\!x^{k-1}\|^{2}

where the second is due to aba24s+b2ab\leq\frac{a^{2}}{4s}+b^{2} with a=xk+1xka=\|x^{k+1}\!-\!x^{k}\|, b=(τ1+Lf)βkxkxk1b=(\tau^{-1}\!+\!L_{\!f})\beta_{k}\|x^{k}\!-\!x^{k-1}\| and s=1τ1Lf>0s=\frac{1}{\tau^{-1}-L_{\!f}}>0, and the last is due to βkβmaxς(τ1Lf)τ12(τ1+Lf)\beta_{k}\leq\beta_{\rm max}\leq\frac{\sqrt{\varsigma(\tau^{-1}-L_{\!f})\tau^{-1}}}{2(\tau^{-1}+L_{\!f})}. Combining the last inequality with the definition of Hτ,ςH_{\tau,\varsigma}, we obtain the result.

(ii) Note that Hτ,ςH_{\tau,\varsigma} is lower bounded by the lower boundedness of the function Fσ,γF_{\sigma,\gamma}. The nonincreasing of the sequence {Hτ,ς(xk,xk1)}k\{H_{\tau,\varsigma}(x^{k},x^{k-1})\}_{k\in\mathbb{N}} in part (i) implies its convergence, and consequently, k=1xk+1xk2<\sum_{k=1}^{\infty}\|x^{k+1}\!-\!x^{k}\|^{2}<\infty follows by using part (i) again.

(iii) From the definition of Hτ,ςH_{\tau,\varsigma} and [36, Exercise 8.8], for any (x,u)𝒮×n(x,u)\in\mathcal{S}\times\mathbb{R}^{n},

Hτ,ς(x,u)=(Fσ,γ(x)+12τ1ς(xu)12τ1ς(ux)).\partial H_{\tau,\varsigma}(x,u)=\left(\begin{matrix}\partial F_{\sigma,\gamma}(x)+\frac{1}{2}\tau^{-1}\varsigma(x-u)\\ \frac{1}{2}\tau^{-1}\varsigma(u-x)\end{matrix}\right). (31)

Fix any kk\in\mathbb{N}. By the optimality of xk+1x^{k+1} to the nonconvex problem (27), it follows that

0fσ,γ(x~k)+τ1(xk+1x~k)+hλ(xk+1),0\in\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k})+\tau^{-1}(x^{k+1}\!-\widetilde{x}^{k})+\partial h_{\lambda}(x^{k+1}),

which is equivalent to fσ,γ(xk+1)fσ,γ(x~k)τ1(xk+1x~k)Fσ,γ(xk+1).\nabla\!f_{\sigma,\gamma}(x^{k+1})\!-\!\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k})-\tau^{-1}(x^{k+1}\!-\!\widetilde{x}^{k})\in\partial F_{\sigma,\gamma}(x^{k+1}). Write

wk:=(fσ,γ(xk)fσ,γ(x~k1)τ1(xkx~k1)+12τ1ς(xkxk1)12τ1ς(xk1xk)).w^{k}:=\left(\begin{matrix}\nabla\!f_{\sigma,\gamma}(x^{k})-\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k-1})-\tau^{-1}(x^{k}-\widetilde{x}^{k-1})+\frac{1}{2}\tau^{-1}\varsigma(x^{k}-x^{k-1})\\ \frac{1}{2}\tau^{-1}\varsigma(x^{k-1}-x^{k})\end{matrix}\right).

By comparing with (31), we have wkHτ,ς(xk,xk1)w^{k}\in\partial H_{\tau,\varsigma}(x^{k},x^{k-1}). From the Lipschitz continuity of fσ,γ\nabla\!f_{\sigma,\gamma} and Step 1, wk+1(τ1+Lf+τ1ς)xk+1xk+(τ1+Lf)βmaxxkxk1\|w^{k+1}\|\leq(\tau^{-1}\!+\!L_{\!f}\!+\!\tau^{-1}\varsigma)\|x^{k+1}\!-\!x^{k}\|+(\tau^{-1}\!+\!L_{\!f})\beta_{\rm max}\|x^{k}\!-\!x^{k-1}\|. Since βmax(0,1)\beta_{\rm max}\in(0,1), the result holds with b1=τ1+Lf+τ1ςb_{1}=\tau^{-1}\!+\!L_{\!f}\!+\!\tau^{-1}\varsigma and b2=τ1+Lfb_{2}=\tau^{-1}\!+\!L_{\!f}. \Box

Lemma 4.2

Let {xk}k\{x^{k}\}_{k\in\mathbb{N}} be the sequence generated by Algorithm 1 and denote by ϖ(x0)\varpi(x^{0}) the set of accumulation points of {xk}k\{x^{k}\}_{k\in\mathbb{N}}. Then, the following assertions hold:

  • (i)

    ϖ(x0)\varpi(x^{0}) is a nonempty compact set and ϖ(x0)Sτ,gλcritFσ,γ\varpi(x^{0})\subseteq S_{\tau,g_{\lambda}}\subseteq{\rm crit}F_{\sigma,\gamma};

  • (ii)

    limkdist((xk,xk1),Ω)=0\lim_{k\to\infty}{\rm dist}((x^{k},x^{k-1}),\Omega)=0 with Ω:={(x,x)|xϖ(x0)}critHτ,ς\Omega:=\{(x,x)\,|\,x\in\varpi(x^{0})\}\subseteq{\rm crit}H_{\tau,\varsigma};

  • (iii)

    the function Hτ,ςH_{\tau,\varsigma} is finite and keeps the constant on the set Ω\Omega.

Proof: (i) Since {xk}k𝒮\{x^{k}\}_{k\in\mathbb{N}}\subseteq\mathcal{S}, we have ϖ(x0)\varpi(x^{0})\neq\emptyset. Since ϖ(x0)\varpi(x^{0}) can be viewed as an intersection of compact sets, i.e., ϖ(x0)=qkq{xk}¯\varpi(x^{0})=\bigcap_{q\in\mathbb{N}}\overline{\bigcup_{k\geq q}\{x^{k}\}}, it is also compact. Now pick any xϖ(x0)x^{*}\in\varpi(x^{0}). There exists a subsequence {xkj}j\{x^{k_{j}}\}_{j\in\mathbb{N}} with xkjxx^{k_{j}}\rightarrow x^{*} as jj\rightarrow\infty. Note that limjxkjxkj1=0\lim_{j\to\infty}\|x^{k_{j}}-x^{k_{j}-1}\|=0 implied by Lemma 4.1 (ii). Then, xkj1xx^{k_{j}-1}\rightarrow x^{*} and xkj+1xx^{k_{j}+1}\rightarrow x^{*} as jj\rightarrow\infty. Recall that x~kj=xkj+βkj(xkjxkj1)\widetilde{x}^{k_{j}}=x^{k_{j}}+\beta_{k_{j}}(x^{k_{j}}-x^{k_{j}-1}) and βkj[0,βmax)\beta_{k_{j}}\in[0,\beta_{\rm max}). When jj\rightarrow\infty, we have x~kjx\widetilde{x}^{k_{j}}\rightarrow x^{*} and then x~kjτfσ,γ(x~kj)xτfσ,γ(x)\widetilde{x}^{k_{j}}\!-\!\tau\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k_{j}})\rightarrow x^{*}\!-\!\tau\nabla\!f_{\sigma,\gamma}(x^{*}). In addition, since gλg_{\lambda} is proximally bounded with threshold ++\infty, i.e., for any τ>0\tau^{\prime}>0 and xnx\in\mathbb{R}^{n}, minzn{12τzx2+gλ(z)}>\min_{z\in\mathbb{R}^{n}}\big{\{}\frac{1}{2\tau^{\prime}}\|z-x\|^{2}+g_{\lambda}(z)\big{\}}>-\infty, from [36, Example 5.23] it follows that 𝒫τgλ\mathcal{P}_{\!\tau}g_{\lambda} is outer semicontinuous. Thus, from xkj+1𝒫τgλ(x~kjτfσ,γ(x~kj))x^{k_{j}+1}\in\mathcal{P}_{\!\tau}g_{\lambda}(\widetilde{x}^{k_{j}}\!-\!\tau\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k_{j}})) for each jj\in\mathbb{N}, we have x𝒫τgλ(xτfσ,γ(x))x^{*}\in\mathcal{P}_{\!\tau}g_{\lambda}(x^{*}\!-\!\tau\nabla\!f_{\sigma,\gamma}(x^{*})), and then xSτ,gλx^{*}\in S_{\tau,g_{\lambda}}. By the arbitrariness of xϖ(x0)x^{*}\in\varpi(x^{0}), the first inclusion follows. The second inclusion is given by Lemma 2.4.

(ii)-(iii) The result of part (ii) is immediate, so it suffices to prove part (iii). By Lemma 4.1 (i), the sequence {Hτ,ς(xk,xk1)}k\{H_{\tau,\varsigma}(x^{k},x^{k-1})\}_{k\in\mathbb{N}} is convergent and denote its limit by ω\omega^{*}. Pick any (x,x)Ω(x^{*},x^{*})\in\Omega. There exists a subsequence {xkj}j\{x^{k_{j}}\}_{j\in\mathbb{N}} with xkjxx^{k_{j}}\rightarrow x^{*} as jj\rightarrow\infty. If limjHτ,ς(xkj,xkj1)=Hτ,ς(x,x)\lim_{j\rightarrow\infty}H_{\tau,\varsigma}(x^{k_{j}},x^{k_{j}-1})=H_{\tau,\varsigma}(x^{*},x^{*}), then the convergence of {Hτ,ς(xk,xk1)}k\{H_{\tau,\varsigma}(x^{k},x^{k-1})\}_{k\in\mathbb{N}} implies that Hτ,ς(x,x)=ωH_{\tau,\varsigma}(x^{*},x^{*})=\omega^{*}, which by the arbitrariness of (x,x)Ω(x^{*},x^{*})\in\Omega shows that the function Hτ,ςH_{\tau,\varsigma} is finite and keeps the constant on Ω\Omega. Hence, it suffices to argue that limjHτ,ς(xkj,xkj1)=Hτ,ς(x,x)\lim_{j\rightarrow\infty}H_{\tau,\varsigma}(x^{k_{j}},x^{k_{j}-1})=H_{\tau,\varsigma}(x^{*},x^{*}). Recall that limjxkjxkj1=0\lim_{j\to\infty}\|x^{k_{j}}-x^{k_{j}-1}\|=0 by Lemma 4.1 (ii). We only need argue that limjFσ,γ(xkj)=Fσ,γ(x)\lim_{j\rightarrow\infty}F_{\sigma,\gamma}(x^{k_{j}})=F_{\sigma,\gamma}(x^{*}). From (27), it holds that

fσ,γ(x~kj1),xkjx+12τxkjx~kj12+gλ(xkj)12τxx~kj12+gλ(x).\langle\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k_{j}-1}),x^{k_{j}}-x^{*}\rangle+\frac{1}{2\tau}\|x^{k_{j}}-\widetilde{x}^{k_{j}-1}\|^{2}+g_{\lambda}(x^{k_{j}})\leq\frac{1}{2\tau}\|x^{*}-\widetilde{x}^{k_{j}-1}\|^{2}+g_{\lambda}(x^{*}).

Together with the inequality (30) with x=xkjx^{\prime}=x^{k_{j}} and x=xx=x^{*}, we obtain that

Fσ,γ(xkj)\displaystyle F_{\sigma,\gamma}(x^{k_{j}}) Fσ,γ(x)+fσ,γ(x)fσ,γ(x~kj1),xkjx12τxkjx~kj12\displaystyle\leq F_{\sigma,\gamma}(x^{*})+\langle\nabla\!f_{\sigma,\gamma}(x^{*})-\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k_{j}-1}),x^{k_{j}}-x^{*}\rangle-\frac{1}{2\tau}\|x^{k_{j}}-\widetilde{x}^{k_{j}-1}\|^{2}
+12τxx~kj12+Lf2xkjx2,\displaystyle\quad+\frac{1}{2\tau}\|x^{*}-\widetilde{x}^{k_{j}-1}\|^{2}+\frac{L_{\!f}}{2}\|x^{k_{j}}-x^{*}\|^{2},

which by limjxkj=x=limjxkj1\lim_{j\rightarrow\infty}x^{k_{j}}=x^{*}=\lim_{j\rightarrow\infty}x^{k_{j}-1} implies that lim supjFσ,γ(xkj)Fσ,γ(x)\limsup_{j\rightarrow\infty}F_{\sigma,\gamma}(x^{k_{j}})\leq F_{\sigma,\gamma}(x^{*}). In addition, by the lower semicontinuity of Fσ,γF_{\sigma,\gamma}, lim infjFσ,γ(xkj)Fσ,γ(x)\liminf_{j\rightarrow\infty}F_{\sigma,\gamma}(x^{k_{j}})\geq F_{\sigma,\gamma}(x^{*}). The two sides show that limjFσ,γ(xkj)=Fσ,γ(x)\lim_{j\rightarrow\infty}F_{\sigma,\gamma}(x^{k_{j}})=F_{\sigma,\gamma}(x^{*}). The proof is then completed. \Box

Since fσ,γf_{\sigma,\gamma} is a piecewise linear-quadratic function, it is semi-algebraic. Recall that the zero-norm is semi-algebraic. Hence, Fσ,γF_{\sigma,\gamma} and Hτ,ςH_{\tau,\varsigma} are also semi-algebraic, and then the KL functions. By using Lemma 4.1-4.2 and following the arguments as those for [5, Theorem 1] and [1, Theorem 2] we can establish the following convergence results.

Theorem 4.1

Let {xk}k\{x^{k}\}_{k\in\mathbb{N}} be the sequence generated by Algorithm 1. Then,

  • (i)

    k=1xk+1xk<\sum_{k=1}^{\infty}\|x^{k+1}-x^{k}\|<\infty and consequently {xk}k\{x^{k}\}_{k\in\mathbb{N}} converges to some xSτ,gλx^{*}\in S_{\tau,g_{\lambda}}.

  • (ii)

    If Fσ,γF_{\sigma,\gamma} is a KL function of exponent 1/21/2, then there exist c1>0c_{1}>0 and ϱ(0,1)\varrho\in(0,1) such that for all sufficiently large kk, xkxc1ϱk\|x^{k}-x^{*}\|\leq c_{1}\varrho^{k}.

Proof: (i) For each kk\in\mathbb{N}, write zk:=(xk,xk1)z^{k}\!:=(x^{k},x^{k-1}). Since {xk}k\{x^{k}\}_{k\in\mathbb{N}} is bounded, there exists a subsequence {xkq}q\{x^{k_{q}}\}_{q\in\mathbb{N}} with xkqx¯x^{k_{q}}\to\overline{x} as qq\to\infty. By the proof of Lemma 4.2 (iii), limkHτ,ς(zk)=Hτ,ς(z¯)\lim_{k\to\infty}H_{\tau,\varsigma}(z^{k})=H_{\tau,\varsigma}(\overline{z}) with z¯=(x¯,x¯)\overline{z}=(\overline{x},\overline{x}). If there exists k¯\overline{k}\in\mathbb{N} such that Hτ,ς(zk¯)=Hτ,ς(z¯)H_{\tau,\varsigma}(z^{\overline{k}})=H_{\tau,\varsigma}(\overline{z}), by Lemma 4.1 (i) we have xk=xk¯x^{k}=x^{\overline{k}} for all kk¯k\geq\overline{k} and the result follows. Thus, it suffices to consider that Hτ,ς(zk)>Hτ,ς(z¯)H_{\tau,\varsigma}(z^{k})>H_{\tau,\varsigma}(\overline{z}) for all kk\in\mathbb{N}. Since limkHτ,ς(zk)=Hτ,ς(z¯)\lim_{k\to\infty}H_{\tau,\varsigma}(z^{k})=H_{\tau,\varsigma}(\overline{z}), for any η>0\eta>0 there exists k0k_{0}\in\mathbb{N} such that for all kk0k\geq k_{0}, Hτ,ς(zk)<Hτ,ς(z¯)+ηH_{\tau,\varsigma}(z^{k})<H_{\tau,\varsigma}(\overline{z})+\eta. In addition, from Lemma 4.2 (ii), for any ε>0\varepsilon>0 there exists k1k_{1}\in\mathbb{N} such that for all kk1k\geq k_{1}, dist(zk,Ω)<ε{\rm dist}(z^{k},\Omega)<\varepsilon. Then, for all kk¯:=max(k0,k1)k\geq\overline{k}:=\max(k_{0},k_{1}),

zk{z|dist(z,Ω)<ε}[Hτ,ς(z¯)<Hτ,ς<Hτ,ς(z¯)+η].z^{k}\in\big{\{}z\,|\,{\rm dist}(z,\Omega)<\varepsilon\big{\}}\cap[H_{\tau,\varsigma}(\overline{z})<H_{\tau,\varsigma}<H_{\tau,\varsigma}(\overline{z})+\eta].

By combining Lemma 4.2 (iii) and [5, Lemma 6], there exist δ>0\delta>0, η>0\eta>0 and a continuous concave function φ:[0,η)+\varphi\!:[0,\eta)\to\mathbb{R}_{+} satisfying the conditions in Definition 2.3 such that for all z¯Ω\overline{z}\in\Omega and all z{z|dist(z,Ω)<ε}[Hτ,ς(z¯)<Hτ,ς<Hτ,ς(z¯)+η]z\in\big{\{}z\,|\,{\rm dist}(z,\Omega)<\varepsilon\big{\}}\cap[H_{\tau,\varsigma}(\overline{z})<H_{\tau,\varsigma}<H_{\tau,\varsigma}(\overline{z})+\eta],

φ(Hτ,ς(z)Hτ,ς(z¯))dist(0,Hτ,ς(z))1.\varphi^{\prime}(H_{\tau,\varsigma}(z)-H_{\tau,\varsigma}(\overline{z})){\rm dist}(0,\partial H_{\tau,\varsigma}(z))\geq 1.

Consequently, for all k>k¯k>\overline{k}, φ(Hτ,ς(zk)Hτ,ς(z¯))dist(0,Hτ,ς(zk))1\varphi^{\prime}(H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z})){\rm dist}(0,\partial H_{\tau,\varsigma}(z^{k}))\geq 1. By Lemma 4.1 (iii), there exists wkHτ,ς(zk)w^{k}\in\partial H_{\tau,\varsigma}(z^{k}) with wkb1xkxk1+b2xk1xk2\|w^{k}\|\leq b_{1}\|x^{k}-x^{k-1}\|+b_{2}\|x^{k-1}-x^{k-2}\|. Then,

φ(Hτ,ς(zk)Hτ,ς(z¯))wk1.\varphi^{\prime}(H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z}))\|w^{k}\|\geq 1.

Together with the concavity of φ\varphi and Lemma 4.1 (i), it follows that for all k>k¯k>\overline{k},

[φ(Hτ,ς(zk)Hτ,ς(z¯))φ(Hτ,ς(zk+1)Hτ,ς(z¯))]wk\displaystyle[\varphi(H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z}))-\varphi(H_{\tau,\varsigma}(z^{k+1})-H_{\tau,\varsigma}(\overline{z}))]\|w_{k}\|
φ(Hτ,ς(zk)Hτ,ς(z¯))[Hτ,ς(zk)Hτ,ς(zk+1)]wk\displaystyle\geq\varphi^{\prime}(H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z}))[H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(z^{k+1})]\|w_{k}\|
Hτ,ς(zk)Hτ,ς(zk+1)axk+1xk2witha=ς(τ1Lf)/2.\displaystyle\geq H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(z^{k+1})\geq a\|x^{k+1}-x^{k}\|^{2}\ \ {\rm with}\ a={\varsigma(\tau^{-1}\!-\!L_{\!f})}/{2}.

For each kk\in\mathbb{N}, let Δk:=φ(Hτ,ς(zk)Hτ,ς(z¯))φ(Hτ,ς(zk+1)Hτ,ς(z¯))\Delta_{k}:=\varphi(H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z}))-\varphi(H_{\tau,\varsigma}(z^{k+1})-H_{\tau,\varsigma}(\overline{z})). For all k>k¯k>\overline{k},

2xk+1xk\displaystyle 2\|x^{k+1}-x^{k}\| 2a1Δkwk2a1Δk[b1xkxk1+b2xk1xk2]\displaystyle\leq 2\sqrt{a^{-1}\Delta_{k}\|w_{k}\|}\leq 2\sqrt{a^{-1}\Delta_{k}[b_{1}\|x^{k}\!-\!x^{k-1}\|+b_{2}\|x^{k-1}\!-\!x^{k-2}\|]}
12(xkxk1+xk1xk2)+2a1max(b1,b2)Δk,\displaystyle\leq\frac{1}{2}\big{(}\|x^{k}-x^{k-1}\|+\|x^{k-1}-x^{k-2}\|\big{)}+2a^{-1}\max(b_{1},b_{2})\Delta_{k},

where the second inequality is due to 2sts/2+2t2\sqrt{st}\leq s/2+2t for any s,t0s,t\geq 0. For any ν>k>k¯\nu>k>\overline{k}, summing the last inequality from kk to ν\nu yields that

j=kνxj+1xjxkxk1+12xk1xk2+2max(b1,b2)aφ(Hτ,ς(zk)Hτ,ς(z¯)).\sum_{j=k}^{\nu}\|x^{j+1}\!-\!x^{j}\|\leq\|x^{k}-x^{k-1}\|+\frac{1}{2}\|x^{k-1}\!-\!x^{k-2}\|+\frac{2\max(b_{1},b_{2})}{a}\varphi(H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z})). (32)

By passing the limit ν\nu\to\infty to the last inequality, we obtain the desired result.

(ii) Since Fσ,γF_{\sigma,\gamma} is a KL function of exponent 1/21/2, by [34, Theorem 3.6] and the expression of Hτ,ςH_{\tau,\varsigma}, it follows that Hτ,ςH_{\tau,\varsigma} is also a KL function of exponent 1/21/2. From the arguments for part (i) with φ(t)=ct\varphi(t)=c\sqrt{t} for t0t\geq 0 and Lemma 4.1 (iii), for all kk¯k\geq\overline{k} it holds that

Hτ,ς(zk)Hτ,ς(z¯)c2dist(0,Hτ,ς(zk))c2[b1xkxk1+b2xk1xk2].\sqrt{H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z})}\leq\frac{c}{2}{\rm dist}(0,\partial H_{\tau,\varsigma}(z^{k}))\leq\frac{c}{2}[b_{1}\|x^{k}-x^{k-1}\|+b_{2}\|x^{k-1}-x^{k-2}\|].

Consequently, φ(Hτ,ς(zk)Hτ,ς(z¯))c22[b1xkxk1+b2xk1xk2]\varphi(H_{\tau,\varsigma}(z^{k})-H_{\tau,\varsigma}(\overline{z}))\leq\frac{c^{2}}{2}[b_{1}\|x^{k}-x^{k-1}\|+b_{2}\|x^{k-1}-x^{k-2}\|]. Together with the inequality (32), by letting c=c2a1[max(b1,b2)]2c^{\prime}=c^{2}a^{-1}[\max(b_{1},b_{2})]^{2}, for any ν>k>k¯\nu>k>\overline{k} we have

j=kνxj+1xj(1+c)xkxk1+(1/2+c)xk1xk2{\textstyle\sum_{j=k}^{\nu}}\|x^{j+1}-x^{j}\|\leq(1+c^{\prime})\|x^{k}-x^{k-1}\|+(1/2+c^{\prime})\|x^{k-1}-x^{k-2}\|

For each kk\in\mathbb{N}, let Δk:=j=kxj+1xj\Delta_{k}:=\sum_{j=k}^{\infty}\|x^{j+1}-x^{j}\|. Passing the limit ν+\nu\to+\infty to this inequality, we obtain Δk(1+c)[Δk1Δk]+(1/2+c)[Δk2Δk1](1+c)[Δk2Δk],\Delta_{k}\leq(1+c^{\prime})[\Delta_{k-1}-\Delta_{k}]+(1/2+c^{\prime})[\Delta_{k-2}-\Delta_{k-1}]\leq(1+c^{\prime})[\Delta_{k-2}-\Delta_{k}], which means that ΔkϱΔk2\Delta_{k}\leq\varrho\Delta_{k-2} for ϱ=1+c2+c\varrho=\frac{1+c^{\prime}}{2+c^{\prime}}. The result follows by this recursion. \Box

It is worthwhile to point out that by Lemma 4.1-4.2 and the proof of Lemma 4.2 (iii), applying [28, Theorem 10] directly can yield k=1xk+1xk<\sum_{k=1}^{\infty}\|x^{k+1}-x^{k}\|<\infty. Here, we include its proof just for the convergence rate analysis in Theorem 4.1 (ii). Notice that Theorem 4.1 (ii) requires the KL property of exponent 1/21/2 of the function Fσ,γF_{\sigma,\gamma}. The following lemma shows that Fσ,γF_{\sigma,\gamma} indeed has such an important property under a mild condition.

Lemma 4.3

If any x¯critFσ,γ\overline{x}\in\!{\rm crit}F_{\sigma,\gamma} has Γ(x¯)=\Gamma(\overline{x})\!=\emptyset, then Fσ,γF_{\sigma,\gamma} is a KL function of exponent 0.

Proof: Write f~σ,γ(x):=fσ,γ(x)+δ𝒮(x)\widetilde{f}_{\sigma,\gamma}(x):=f_{\sigma,\gamma}(x)+\delta_{\mathcal{S}}(x) for xnx\in\mathbb{R}^{n}. For any x𝒮x\in\mathcal{S}, by [36, Exercise 8.8],

f~σ,γ(x)=fσ,γ(x)+𝒩𝒮(x)=A𝕋Lσ,γ(Ax)+𝒩𝒮(x).\partial\!\widetilde{f}_{\sigma,\gamma}(x)=\nabla\!f_{\sigma,\gamma}(x)+\mathcal{N}_{\mathcal{S}}(x)=A^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)+\mathcal{N}_{\mathcal{S}}(x). (33)

Fix any x¯critFσ,γ\overline{x}\in{\rm crit}F_{\sigma,\gamma} with Γ(x¯)=\Gamma(\overline{x})=\emptyset. Let J:=supp(x¯),I0:={i[m]|(Ax¯)i>0}J:={\rm supp}(\overline{x}),I_{0}\!:=\{i\in[m]\,|\,(A\overline{x})_{i}>0\}, I1:={i[m]|(Ax¯)i<(σ+γ)}I_{1}\!:=\{i\in[m]\,|\,(A\overline{x})_{i}<-(\sigma\!+\!\gamma)\} and I2:={i[m]|σ+γ<(Ax¯)i<γ}I_{2}\!:=\!\{i\in[m]\,|\,-\sigma\!+\!\gamma<(A\overline{x})_{i}<-\gamma\}. Since Γ(x¯)=\Gamma(\overline{x})=\emptyset, we have I0I1I2=[m]I_{0}\cup I_{1}\cup I_{2}=[m]. Moreover, from the continuity, there exists ε>0\varepsilon^{\prime}>0 such that for all x𝔹(x¯,ε)x\in\mathbb{B}(\overline{x},\varepsilon^{\prime}), supp(x)J{\rm supp}(x)\supseteq J and the following inequalities hold:

(Ax)i>0foriI0,(Ax)i<(σ+γ)foriI1andγσ<(Ax)i<γforiI2.(Ax)_{i}>0\ {\rm for}\ i\in I_{0},\ (Ax)_{i}<-(\sigma\!+\!\gamma)\ {\rm for}\ i\in I_{1}\ {\rm and}\ \gamma\!-\!\sigma<(Ax)_{i}<-\gamma\ {\rm for}\ i\in I_{2}. (34)

By the continuity of fσ,γf_{\sigma,\gamma}, there exists ε′′>0\varepsilon^{\prime\prime}>0 such that for all x𝔹(x¯,ε′′)x\in\mathbb{B}(\overline{x},\varepsilon^{\prime\prime}), fσ,γ(x)>fσ,γ(x¯)λ/2f_{\sigma,\gamma}(x)>f_{\sigma,\gamma}(\overline{x})-\lambda/2. Set ε=min(ε,ε′′)\varepsilon=\min(\varepsilon^{\prime},\varepsilon^{\prime\prime}) and pick any η(0,λ/4]\eta\in(0,\lambda/4]. Next we argue that

𝔹(x¯,ε)[Fσ,γ(x¯)<Fσ,γ<Fσ,γ(x¯)+η]=,\mathbb{B}(\overline{x},\varepsilon)\cap[F_{\sigma,\gamma}(\overline{x})<F_{\sigma,\gamma}<F_{\sigma,\gamma}(\overline{x})+\eta]=\emptyset,

which by Definition 2.3 implies that Fσ,γF_{\sigma,\gamma} is a KL function of exponent 1/21/2. Suppose on the contradiction that there exists x𝔹(x¯,ε)[Fσ,γ(x¯)<Fσ,γ<Fσ,γ(x¯)+η]x\in\mathbb{B}(\overline{x},\varepsilon)\cap[F_{\sigma,\gamma}(\overline{x})<F_{\sigma,\gamma}<F_{\sigma,\gamma}(\overline{x})+\eta]. From Fσ,γ(x)<Fσ,γ(x¯)+ηF_{\sigma,\gamma}(x)<F_{\sigma,\gamma}(\overline{x})+\eta, we have x𝒮x\in\mathcal{S}. Together with supp(x)J{\rm supp}(x)\supseteq J, we deduce that supp(x)=J{\rm supp}(x)=J (if not, fσ,γ(x)+λx¯0+λ<Fσ,γ(x)<fσ,γ(x¯)+λx¯0+ηf_{\sigma,\gamma}(x)+\lambda\|\overline{x}\|_{0}+\lambda<F_{\sigma,\gamma}(x)<f_{\sigma,\gamma}(\overline{x})+\lambda\|\overline{x}\|_{0}+\eta, which along with fσ,γ(x)>fσ,γ(x¯)λ/2f_{\sigma,\gamma}(x)>f_{\sigma,\gamma}(\overline{x})-\lambda/2 implies η>λ/2\eta>\lambda/2, a contradiction to ηλ/4\eta\leq\lambda/4). Now from x𝒮x\in\mathcal{S}, equation (34), the expression of ϑσ,γ\vartheta_{\sigma,\gamma} and I0I1I2=[m]I_{0}\cup I_{1}\cup I_{2}=[m], it follows that

0<Fσ,γ(x)Fσ,γ(x¯)=Lσ,γ(Ax)Lσ,γ(Ax¯)=iI2[(Ax¯)i(Ax)i].0<F_{\sigma,\gamma}(x)-F_{\sigma,\gamma}(\overline{x})=L_{\sigma,\gamma}(Ax)-L_{\sigma,\gamma}(A\overline{x})={\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}-(Ax)_{i}]. (35)

Recall that [Lσ,γ(Ax)]I2=e[\nabla\!L_{\sigma,\gamma}(Ax^{\prime})]_{I_{2}}=-e and [Lσ,γ(Ax)]I0I1=0[\nabla\!L_{\sigma,\gamma}(Ax^{\prime})]_{I_{0}\cup I_{1}}=0 with x=xx^{\prime}=x and x¯\overline{x}. Hence,

AJ𝕋Lσ,γ(Ax)2=AI2J𝕋e2,Lσ,γ(Ax),Ax2=[iI2(Ax)i]2,\displaystyle\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}=\|A_{I_{2}J}^{\mathbb{T}}e\|^{2},\ \langle\nabla\!L_{\sigma,\gamma}(Ax),Ax\rangle^{2}=[{\textstyle\sum_{i\in I_{2}}}(Ax)_{i}]^{2}, (36)
AJ𝕋Lσ,γ(Ax¯)2=AI2J𝕋e2,Lσ,γ(Ax¯),Ax¯2=[iI2(Ax¯)i]2.\displaystyle\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(A\overline{x})\|^{2}=\|A_{I_{2}J}^{\mathbb{T}}e\|^{2},\ \langle\nabla\!L_{\sigma,\gamma}(A\overline{x}),A\overline{x}\rangle^{2}=[{\textstyle\sum_{i\in I_{2}}}(A\overline{x})_{i}]^{2}. (37)

By comparing (33) with Lemma 2.3 (ii), we have Fσ,γ(x)=f~σ,γ(x)+λx0\partial F_{\sigma,\gamma}(x)=\partial\!\widetilde{f}_{\sigma,\gamma}(x)+\lambda\partial\|x\|_{0}. Since supp(x)=J{\rm supp}(x)=J, we also have x0={vn|vi=0foriJ}\partial\|x\|_{0}=\{v\in\mathbb{R}^{n}\,|\,v_{i}=0\ {\rm for}\ i\in J\}. Then, it holds that

dist2(0,Fσ,γ(x))\displaystyle{\rm dist}^{2}(0,\partial F_{\sigma,\gamma}(x)) =minuf~σ,γ(x),vλx0u+v2=minuf~σ,γ(x)uJ2\displaystyle=\min_{u\in\partial\!\widetilde{f}_{\sigma,\gamma}(x),v\in\lambda\partial\|x\|_{0}}\|u+v\|^{2}=\min_{u\in\partial\!\widetilde{f}_{\sigma,\gamma}(x)}\|u_{J}\|^{2}
=minαAJ𝕋Lσ,γ(Ax)+αxJ2\displaystyle=\min_{\alpha\in\mathbb{R}}\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)+\alpha x_{J}\|^{2}
=minαα2+2Ax,Lσ,γ(Ax)α+AJ𝕋Lσ,γ(Ax)2\displaystyle=\min_{\alpha\in\mathbb{R}}\alpha^{2}+2\langle Ax,\nabla\!L_{\sigma,\gamma}(Ax)\rangle\alpha+\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}
=AJ𝕋Lσ,γ(Ax)2Ax,Lσ,γ(Ax)2.\displaystyle=\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}-\langle Ax,\nabla\!L_{\sigma,\gamma}(Ax)\rangle^{2}. (38)

Since 0Fσ,γ(x¯)=fσ,γ(x¯)+𝒩𝒮(x¯)+λx¯00\in\partial F_{\sigma,\gamma}(\overline{x})=\nabla\!f_{\sigma,\gamma}(\overline{x})+\mathcal{N}_{\mathcal{S}}(\overline{x})+\lambda\partial\|\overline{x}\|_{0}, from the expression of x¯0\partial\|\overline{x}\|_{0} we have

AJ𝕋Lσ,γ(Ax¯)=α¯x¯Jwithα¯=Ax¯,Lσ,γ(Ax¯).A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(A\overline{x})=\overline{\alpha}\,\overline{x}_{J}\ \ {\rm with}\ \ \overline{\alpha}=\langle A\overline{x},\nabla\!L_{\sigma,\gamma}(A\overline{x})\rangle.

Together with the equations (36)-(4.1), it immediately follows that

0\displaystyle 0 dist2(0,Fσ,γ(x))=AJ𝕋Lσ,γ(Ax)2Ax,Lσ,γ(Ax)2AJ𝕋Lσ,γ(Ax¯)α¯x¯J2\displaystyle\leq{\rm dist}^{2}(0,\partial F_{\sigma,\gamma}(x))=\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}-\langle Ax,\nabla\!L_{\sigma,\gamma}(Ax)\rangle^{2}-\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(A\overline{x})-\overline{\alpha}\,\overline{x}_{J}\|^{2}
=AJ𝕋Lσ,γ(Ax)2AJ𝕋Lσ,γ(Ax¯)2+[Lσ,γ(Ax¯),Ax¯2Lσ,γ(Ax),Ax2]\displaystyle=\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}-\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(A\overline{x})\|^{2}+\big{[}\langle\nabla\!L_{\sigma,\gamma}(A\overline{x}),A\overline{x}\rangle^{2}\!-\!\langle\nabla\!L_{\sigma,\gamma}(Ax),Ax\rangle^{2}\big{]}
=iI2[(Ax¯)i(Ax)i]iI2[(Ax¯)i+(Ax)i].\displaystyle={\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}-(Ax)_{i}]\cdot{\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}+(Ax)_{i}].

Since iI2[(Ax¯)i+(Ax)i]<0{\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}+(Ax)_{i}]<0, the last inequality implies that iI2[(Ax¯)i(Ax)i]0{\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}-(Ax)_{i}]\leq 0, which is a contradiction to the inequality (35). The proof is then completed. \Box

Remark 4.2

By the definition of Γ(x¯)\Gamma(\overline{x}), when γ\gamma is small enough, it is highly possible for Γ(x¯)=\Gamma(\overline{x})=\emptyset and then for Fσ,γF_{\sigma,\gamma} to be a KL function of exponent 0.

4.2 PG with extrapolation for solving (9)

By the proof of Lemma 2.3, Ξσ,γ\Xi_{\sigma,\gamma} is a smooth function and Ξσ,γ\nabla\Xi_{\sigma,\gamma} is globally Lipschitz continuous with Lipschitz constant LΞγ1A2+λρ2max(a+12,a+12(a1))L_{\Xi}\leq\gamma^{-1}\|A\|^{2}+\!\lambda\rho^{2}\max(\frac{a+1}{2},\frac{a+1}{2(a-1)}). While by Proposition 2.2, the proximal mapping of hλ,ρh_{\lambda,\rho} has a closed form. This motivates us to apply the PG method with extrapolation to solving the problem (9).

Algorithm 2 (PGe-scad for solving the problem (9))

Initialization: Choose ς(0,1),0<τ<(1ς)LΞ1,0<βmaxς(τ1LΞ)τ12(τ1+LΞ)\varsigma\in(0,1),0<\tau<(1\!-\!\varsigma)L_{\Xi}^{-1},0<\beta_{\rm max}\leq{\frac{\sqrt{\varsigma(\tau^{-1}-L_{\Xi})\tau^{-1}}}{2(\tau^{-1}+L_{\Xi})}} and an initial point x0𝒮x^{0}\in\mathcal{S}. Set x1=x0x^{-1}=x^{0} and k:=0k:=0.

while the termination condition is not satisfied do

  • 1.

    Let x~k=xk+βk(xkxk1)\widetilde{x}^{k}=x^{k}+\beta_{k}(x^{k}-x^{k-1}) and compute xk+1𝒫τhλ,ρ(x~kτΞσ,γ(x~k))x^{k+1}\in\mathcal{P}_{\!\tau}h_{\lambda,\rho}(\widetilde{x}^{k}\!-\!\tau\nabla\Xi_{\sigma,\gamma}(\widetilde{x}^{k})).

  • 3.

    Choose βk+1[0,βmax]\beta_{k+1}\in[0,\beta_{\rm max}]. Let kk+1k\leftarrow k+1 and go to Step 1.

end (while)

Similar to Algorithm 1, the extrapolation parameter βk\beta_{k} in Algorithm 2 can be chosen in terms of the rule in (28). For any τ>0\tau>0 and ς(0,1)\varsigma\in(0,1), we define the potential function

Υτ,ς(x,u):=Gσ,γ,ρ(x)+ς4τxu2(x,u)n×n.\Upsilon_{\!\tau,\varsigma}(x,u):=G_{\sigma,\gamma,\rho}(x)+\frac{\varsigma}{4\tau}\|x-u\|^{2}\quad\ \forall(x,u)\in\mathbb{R}^{n}\times\mathbb{R}^{n}. (39)

Then, by following the same arguments as those for Lemma 4.1 and 4.2, we can establish the following properties of Υτ,ς\Upsilon_{\!\tau,\varsigma} on the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by Algorithm 2.

Lemma 4.4

Let {xk}k\{x^{k}\}_{k\in\mathbb{N}} be the sequence generated by Algorithm 2 and denote by π(x0)\pi(x^{0}) the set of accumulation points of {xk}k\{x^{k}\}_{k\in\mathbb{N}}. Then, the following assertions hold.

  • (i)

    For each kk\in\mathbb{N}, Υτ,ς(xk+1,xk)Υτ,ς(xk,xk1)ς(τ1LΞ)2xk+1xk2.\Upsilon_{\!\tau,\varsigma}(x^{k+1},x^{k})\leq\Upsilon_{\!\tau,\varsigma}(x^{k},x^{k-1})-\frac{\varsigma(\tau^{-1}-L_{\Xi})}{2}\|x^{k+1}\!-\!x^{k}\|^{2}. Consequently, {Υτ,ς(xk,xk1)}k\{\Upsilon_{\!\tau,\varsigma}(x^{k},x^{k-1})\}_{k\in\mathbb{N}} is convergent and k=1xk+1xk2<\sum_{k=1}^{\infty}\|x^{k+1}\!-\!x^{k}\|^{2}<\infty.

  • (ii)

    For each kk\in\mathbb{N}, there exists wkΥτ,ς(xk,xk1)w^{k}\in\partial\Upsilon_{\!\tau,\varsigma}(x^{k},x^{k-1}) with wk+1b1xk+1xk+b2xkxk1\|w^{k+1}\|\leq b_{1}^{\prime}\|x^{k+1}\!-\!x^{k}\|+b_{2}^{\prime}\|x^{k}\!-\!x^{k-1}\|, where b1>0b_{1}^{\prime}>0 and b2>0b_{2}^{\prime}>0 are the constants independent of kk.

  • (iii)

    π(x0)\pi(x^{0}) is a nonempty compact set and π(x0)Sτ,hλ,ρ\pi(x^{0})\subseteq S_{\tau,h_{\lambda,\rho}}.

  • (iv)

    limkdist((xk,xk1),π(x0)×π(x0))=0\lim_{k\to\infty}{\rm dist}((x^{k},x^{k-1}),\pi(x^{0})\times\pi(x^{0}))=0, and Υτ,ς\Upsilon_{\!\tau,\varsigma} is finite and keeps the constant on the set π(x0)×π(x0)\pi(x^{0})\times\pi(x^{0}).

By using Lemma 4.4 and following the same arguments as those for Theorem 4.1, it is not difficult to achieve the following convergence results for Algorithm 2.

Theorem 4.2

Let {xk}k\{x^{k}\}_{k\in\mathbb{N}} be the sequence generated by Algorithm 2. Then,

  • (i)

    k=1xk+1xk<\sum_{k=1}^{\infty}\|x^{k+1}-x^{k}\|<\infty and consequently {xk}k\{x^{k}\}_{k\in\mathbb{N}} converges to some xSτ,hλ,ρx^{*}\in S_{\tau,h_{\lambda,\rho}}.

  • (ii)

    If Gσ,γ,ρG_{\sigma,\gamma,\rho} is a KL function of exponent 1/21/2, then there exist c2>0c_{2}>0 and ϱ(0,1)\varrho\in(0,1) such that for all sufficiently large kk, xkxc2ϱk\|x^{k}-x^{*}\|\leq c_{2}\varrho^{k}.

Theorem 4.2 (ii) requires that Gσ,γ,ρG_{\sigma,\gamma,\rho} is a KL function of exponent 1/21/2. We next show that it indeed holds under a little stronger condition than the one used in Lemma 4.3.

Lemma 4.5

If λ\lambda and ρ\rho are chosen with λρ>maxzcritGσ,γ,ρfσ,γ(z)\lambda\rho\!>\!{\displaystyle\max_{z\in{\rm crit}G_{\sigma,\gamma,\rho}}}\|\nabla\!f_{\sigma,\gamma}(z)\|_{\infty} and all x¯critGσ,γ,ρ\overline{x}\!\in{\rm crit}G_{\sigma,\gamma,\rho} satisfy Γ(x¯)=\Gamma(\overline{x})=\emptyset and |x¯|nz>2aρ(a1)|\overline{x}|_{\rm nz}>\frac{2a}{\rho(a-1)}, then Gσ,γ,ρG_{\sigma,\gamma,\rho} is a KL function of exponent 0.

Proof: Fix any x¯critGσ,γ,ρ\overline{x}\in{\rm crit}G_{\sigma,\gamma,\rho} with Γ(x¯)=\Gamma(\overline{x})=\emptyset and |x¯|nz>2aρ(a1)|\overline{x}|_{\rm nz}>\frac{2a}{\rho(a-1)}. Let J=supp(x¯)J={\rm supp}(\overline{x}) and J¯=[n]\J\overline{J}=[n]\backslash J. Let θρ\theta_{\!\rho} be the function in the proof of Lemma 2.3. Since [θρ(x¯)]J¯=0[\nabla\theta_{\!\rho}(\overline{x})]_{\overline{J}}=0, the given assumption means that [fσ,γ(x¯)λρθρ(x¯)]J¯<λρ\|[\nabla\!f_{\sigma,\gamma}(\overline{x})\!-\!\lambda\rho\nabla\theta_{\!\rho}(\overline{x})]_{\overline{J}}\|_{\infty}<\lambda\rho. By the continuity, there exists δ0>0\delta_{0}>0 such that for all x𝔹(x¯,δ0)x\in\mathbb{B}(\overline{x},\delta_{0}), [fσ,γ(x)λρθρ(x)]J¯<λρ.\|[\nabla\!f_{\sigma,\gamma}(x)\!-\!\lambda\rho\nabla\theta_{\!\rho}(x)]_{\overline{J}}\|_{\infty}<\lambda\rho. Let I0,I1I_{0},I_{1} and I2I_{2} be same as in the proof of Lemma 4.3. Then, there exists δ1>0\delta_{1}>0 such that for all x𝔹(x¯,δ1)x\in\mathbb{B}(\overline{x},\delta_{1}), supp(x)J{\rm supp}(x)\supseteq J and the relations in (35) hold. By the continuity, there exist δ2>0\delta_{2}>0 such that for all x𝔹(x¯,δ2)x\in\mathbb{B}(\overline{x},\delta_{2}), |xi|>2aρ(a1)|x_{i}|>\frac{2a}{\rho(a-1)} with isupp(x)i\in{\rm supp}(x) and

Ξσ,γ(x)+λρiJ|xi|>Ξσ,γ(x¯)+λρiJ|x¯i|aλ/(a1).\Xi_{\sigma,\gamma}(x)+\lambda\rho{\textstyle\sum_{i\in J}}|x_{i}|>\Xi_{\sigma,\gamma}(\overline{x})+\lambda\rho{\textstyle\sum_{i\in J}}|\overline{x}_{i}|-{a\lambda}/{(a\!-\!1)}. (40)

Set δ=min(δ0,δ1,δ2)\delta=\min(\delta_{0},\delta_{1},\delta_{2}). Pick any η(0,aλ2(a1))\eta\in(0,\frac{a\lambda}{2(a-1)}). Next we argue that 𝔹(x¯,ε)[Gσ,γ,ρ(x¯)<Gσ,γ,ρ<Gσ,γ,ρ(x¯)+η]=\mathbb{B}(\overline{x},\varepsilon)\cap[G_{\sigma,\gamma,\rho}(\overline{x})<G_{\sigma,\gamma,\rho}<G_{\sigma,\gamma,\rho}(\overline{x})+\eta]=\emptyset, which by Definition 2.3 implies that Gσ,γ,ρG_{\sigma,\gamma,\rho} is a KL function of exponent 1/21/2. Suppose on the contradiction that there exists x𝔹(x¯,ε)[Gσ,γ,ρ(x¯)<Gσ,γ,ρ<Gσ,γ,ρ(x¯)+η]x\in\mathbb{B}(\overline{x},\varepsilon)\cap[G_{\sigma,\gamma,\rho}(\overline{x})<G_{\sigma,\gamma,\rho}<G_{\sigma,\gamma,\rho}(\overline{x})+\eta]. From Gσ,γ,ρ(x)<Gσ,γ,ρ(x¯)+ηG_{\sigma,\gamma,\rho}(x)<G_{\sigma,\gamma,\rho}(\overline{x})+\eta, we have x𝒮x\in\mathcal{S}, which along with supp(x)J{\rm supp}(x)\supseteq J implies that supp(x)=J{\rm supp}(x)=J (if not, we will have Ξσ,γ(x)+λρiJ|xi|+λρisupp(x)\J|xi|<Gσ,γ,ρ(x)<Ξσ,γ(x¯)+λρiJ|x¯i|+η\Xi_{\sigma,\gamma}(x)+\lambda\rho\sum_{i\in J}|x_{i}|+\lambda\rho\sum_{i\in{\rm supp}(x)\backslash J}|x_{i}|<G_{\sigma,\gamma,\rho}(x)<\Xi_{\sigma,\gamma}(\overline{x})+\lambda\rho\sum_{i\in J}|\overline{x}_{i}|+\eta, which along with (40) and |xi|>2aρ(a1)|x_{i}|>\frac{2a}{\rho(a-1)} for isupp(x)\Ji\in{\rm supp}(x)\backslash J implies that η>aλa1\eta>\frac{a\lambda}{a-1}, a contradiction to η<aλ2(a1)\eta<\frac{a\lambda}{2(a-1)}). Now from supp(x)=J{\rm supp}(x)=J and |xi|>2aρ(a1)|x_{i}|>\frac{2a}{\rho(a-1)} for isupp(x)i\in{\rm supp}(x), it is not hard to verify that x1θρ(x)=x¯1θρ(x¯)\|x\|_{1}-\theta_{\!\rho}(x)=\|\overline{x}\|_{1}-\theta_{\!\rho}(\overline{x}). Together with x𝒮x\in\mathcal{S} and the expression of Lσ,γL_{\sigma,\gamma}, we have

0<Gσ,γ,ρ(x)Gσ,γ,ρ(x¯)=Lσ,γ(Ax)Lσ,γ(Ax¯)=iI2[(Ax¯)i(Ax)i].0<G_{\sigma,\gamma,\rho}(x)-G_{\sigma,\gamma,\rho}(\overline{x})=L_{\sigma,\gamma}(Ax)-L_{\sigma,\gamma}(A\overline{x})={\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}-(Ax)_{i}]. (41)

Moreover, the equalities in (36)-(37) still hold for xx. Let f~σ,γ\widetilde{f}_{\sigma,\gamma} be same as in the proof of Lemma 4.3. Clearly, Gσ,γ,ρ(x)=f~σ,γ(x)+λρ[x1θρ(x)]\partial G_{\sigma,\gamma,\rho}(x)=\partial\!\widetilde{f}_{\sigma,\gamma}(x)+\lambda\rho[\partial\|x\|_{1}-\nabla\theta_{\!\rho}(x)]. Then, it holds that

dist2(0,Gσ,γ,ρ(x))=minuf~σ,γ(x),vλρ[x1θρ(x)]u+v2.{\rm dist}^{2}(0,\partial G_{\sigma,\gamma,\rho}(x))=\min_{u\in\partial\!\widetilde{f}_{\sigma,\gamma}(x),v\in\lambda\rho[\partial\|x\|_{1}-\nabla\theta_{\!\rho}(x)]}\|u+v\|^{2}.

Notice that f~σ,γ(x)={fσ,γ(x)+αx|α},[fσ,γ(x)λρθρ(x)]J¯<λρ\partial\!\widetilde{f}_{\sigma,\gamma}(x)=\{\nabla\!f_{\sigma,\gamma}(x)+\alpha x\,|\,\alpha\in\mathbb{R}\},\|[\nabla\!f_{\sigma,\gamma}(x)\!-\!\lambda\rho\nabla\theta_{\!\rho}(x)]_{\overline{J}}\|_{\infty}<\lambda\rho and vJ¯[λρ,λρ]λρ[θρ(x)]J¯v_{\overline{J}}\in[-\lambda\rho,\lambda\rho]-\lambda\rho[\nabla\theta_{\!\rho}(x)]_{\overline{J}}. From the last equation, it follows that

dist2(0,Gσ,γ,ρ(x))\displaystyle{\rm dist}^{2}(0,\partial G_{\sigma,\gamma,\rho}(x)) =minuf~σ,γ(x)uJ2=minαAJ𝕋Lσ,γ(Ax)+αxJ2\displaystyle=\min_{u\in\partial\!\widetilde{f}_{\sigma,\gamma}(x)}\|u_{J}\|^{2}=\min_{\alpha\in\mathbb{R}}\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)+\alpha x_{J}\|^{2}
=minαα2+2Ax,Lσ,γ(Ax)α+AJ𝕋Lσ,γ(Ax)2\displaystyle=\min_{\alpha\in\mathbb{R}}\alpha^{2}+2\langle Ax,\nabla\!L_{\sigma,\gamma}(Ax)\rangle\alpha+\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}
=AJ𝕋Lσ,γ(Ax)2Ax,Lσ,γ(Ax)2.\displaystyle=\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}-\langle Ax,\nabla\!L_{\sigma,\gamma}(Ax)\rangle^{2}. (42)

Since 0Gσ,γ,ρ(x¯)=fσ,γ(x¯)+𝒩𝒮(x¯)+λρ[x¯1θρ(x¯)]0\in\partial G_{\sigma,\gamma,\rho}(\overline{x})=\nabla\!f_{\sigma,\gamma}(\overline{x})+\mathcal{N}_{\mathcal{S}}(\overline{x})+\lambda\rho[\partial\|\overline{x}\|_{1}-\nabla\theta_{\rho}(\overline{x})] and |x¯|nz>2aρ(a1)|\overline{x}|_{\rm nz}>\frac{2a}{\rho(a-1)}, by the proof of Lemma 2.3 (iii) and 𝒩𝒮(x¯)={αx¯|α}\mathcal{N}_{\mathcal{S}}(\overline{x})=\{\alpha\overline{x}\,|\,\alpha\in\mathbb{R}\}, there exists α¯\overline{\alpha}\in\mathbb{R} such that AJ𝕋Lσ,γ(Ax¯)=α¯x¯JA_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(A\overline{x})=\overline{\alpha}\,\overline{x}_{J} with α¯=Ax¯,Lσ,γ(Ax¯)\overline{\alpha}=\langle A\overline{x},\nabla\!L_{\sigma,\gamma}(A\overline{x})\rangle. Together with (36)-(37) and (4.2),

0\displaystyle 0 AJ𝕋Lσ,γ(Ax)2Ax,Lσ,γ(Ax)2AJ𝕋Lσ,γ(Ax¯)α¯x¯J2\displaystyle\leq\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}-\langle Ax,\nabla\!L_{\sigma,\gamma}(Ax)\rangle^{2}-\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(A\overline{x})-\overline{\alpha}\,\overline{x}_{J}\|^{2}
=AJ𝕋Lσ,γ(Ax)2AJ𝕋Lσ,γ(Ax¯)2+[Lσ,γ(Ax¯),Ax¯2Lσ,γ(Ax),Ax2]\displaystyle=\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(Ax)\|^{2}-\|A_{J}^{\mathbb{T}}\nabla\!L_{\sigma,\gamma}(A\overline{x})\|^{2}+\big{[}\langle\nabla\!L_{\sigma,\gamma}(A\overline{x}),A\overline{x}\rangle^{2}\!-\!\langle\nabla\!L_{\sigma,\gamma}(Ax),Ax\rangle^{2}\big{]}
=iI2[(Ax¯)i(Ax)i]iI2[(Ax¯)i+(Ax)i].\displaystyle={\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}-(Ax)_{i}]\cdot{\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}+(Ax)_{i}].

Since iI2[(Ax¯)i+(Ax)i]<0{\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}+(Ax)_{i}]<0, the last inequality implies that iI2[(Ax¯)i(Ax)i]0{\textstyle\sum_{i\in I_{2}}}[(A\overline{x})_{i}-(Ax)_{i}]\leq 0, which is a contradiction to the inequality (41). The proof is then completed. \Box

5 Numerical experiments

In this section we demonstrate the performance of the zero-norm regularized DC loss model (7) and its surrogate (9), which are respectively solved with PGe-znorm and PGe-scad. All numerical experiments are performed in MATLAB on a laptop running on 64-bit Windows System with an Intel(R) Core(TM) i7-7700HQ CPU 2.80GHz and 16 GB RAM. The MATLAB package for reproducing all the numerical results can be found at https://github.com/SCUT-OptGroup/onebit.

5.1 Experiment setup

The setup of our experiments is similar to the one in [46, 19]. Specifically, we generate the original ss^{*}-sparse signal xtruex^{\rm true} with the support TT chosen uniformly from {1,2,,n}\{1,2,\ldots,n\} and (xtrue)T(x^{\rm true})_{T} taking the form of ξ/ξ{\xi}/{\|\xi\|}, where the entries of ξs\xi\in\mathbb{R}^{s^{*}} are drawn from the standard normal distribution. Then, we obtain the observation vector bb via (2), where the sampling matrix Φm×n\Phi\in\mathbb{R}^{m\times n} is generated in the two ways: (I) the rows of Φ\Phi are i.i.d. samples of N(0,Σ)N(0,\Sigma) with Σij=μ|ij|\Sigma_{ij}=\mu^{|i-j|} for i,j[n]i,j\in[n] (II) the entries of Φ\Phi are i.i.d. and follow the standard normal distribution; the noise εm\varepsilon\in\mathbb{R}^{m} is generated from N(0,ϖ2I)N(0,\varpi^{2}I); and the entries of ζ\zeta is set by (ζi=1)=1(ζi=1)=1r\mathbb{P}(\zeta_{i}=1)=1-\mathbb{P}(\zeta_{i}=-1)=1-r. In the sequel, we denote the corresponding data with the two triples (m,n,s)(m,n,s^{*}) and (μ,ϖ,r)(\mu,\varpi,r), where μ\mu means the correlation factor, ϖ\varpi denotes the noise level and rr means the sign flip ratio.

We evaluate the quality of an output xsolx^{\rm sol} of a solver in terms of the mean squared error (MSE), the Hamming error (Herr), the ratio of missing support (FNR) and the ratio of misidentified support (FPR), which are defined as follows

MSE:=xsolxtrue,Herr:=1msign(Φxsol)sign(Φxtrue)0,\displaystyle{\rm MSE}:=\|x^{\rm sol}\!-\!x^{\rm true}\|,\ {\rm Herr}:=\dfrac{1}{m}\|{\rm sign}(\Phi x^{\rm sol})-{\rm sign}(\Phi x^{\rm true})\|_{0},
FNR:=|T\supp(xsol)||T|andFPR:=|supp(xsol)\T|n|T|,\displaystyle{\rm FNR}:=\frac{|T\backslash{\rm supp}(x^{\rm sol})|}{|T|}\ \ {\rm and}\ \ {\rm FPR}:=\frac{|{\rm supp}(x^{\rm sol})\backslash T|}{n-|T|},\qquad

where, in our numerical experiments, a component of a vector znz\in\mathbb{R}^{n} being nonzero means that its absolute value is larger than 105z10^{-5}\|z\|_{\infty}. Clearly, a solver has a better performance if its output has the smaller MSE, Herr{\rm Herr}, FNR and FPR.

5.2 Implementation of PGe-znorm and PGe-scad

From the definition of xk+1x^{k+1} in PGe-znorm and PGe-scad, we have (xk+1x~k)+x~k𝒫τgλ(x~kτfσ,γ(x~k))(x^{k+1}\!-\!\widetilde{x}^{k})+\widetilde{x}^{k}\in\mathcal{P}_{\!\tau}g_{\lambda}(\widetilde{x}^{k}\!-\!\tau\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k})) and (xk+1x~k)+x~k𝒫τhλ,ρ(x~kτfσ,γ(x~k))(x^{k+1}-\widetilde{x}^{k})+\widetilde{x}^{k}\in\mathcal{P}_{\!\tau}h_{\lambda,\rho}(\widetilde{x}^{k}\!-\!\tau\nabla\!f_{\sigma,\gamma}(\widetilde{x}^{k})). Together with the expression of 𝒫τgλ\mathcal{P}_{\!\tau}g_{\lambda} and 𝒫τhλ,ρ\mathcal{P}_{\!\tau}h_{\lambda,\rho}, when xk+1x~k\|x^{k+1}\!-\!\widetilde{x}^{k}\| is small enough, x~k\widetilde{x}^{k} can be viewed as an approximate τ\tau-stationary point. Hence, we terminate PGe-znorm and PGe-scad at the iterate xkx^{k} once xk+1x~k106\|x^{k+1}-\widetilde{x}^{k}\|\leq 10^{-6} or k2000k\geq 2000. In addition, we also terminate the two algorithms at xkx^{k} when |Fσ,γ(xkj)Fσ,γ(xkj)|max(1,Fσ,γ(xkj))1010\frac{|F_{\sigma,\gamma}(x^{k-j})-F_{\sigma,\gamma}(x^{k-j})|}{\max(1,F_{\sigma,\gamma}(x^{k-j}))}\leq 10^{-10} for k100k\geq 100 and j=0,1,,9j=0,1,\ldots,9. The extrapolation parameters βk\beta_{k} in the two algorithms are chosen by (28) with βmax=0.235\beta_{\rm max}=0.235. The starting point x0x^{0} of PGe-znorm and PGe-scad is always chosen to be e𝕋A/e𝕋A{e^{\mathbb{T}}A}/{\|e^{\mathbb{T}}A\|}.

5.3 Choice of the model parameters

The model (7) and its surrogate (9) involve the parameters λ>0,ρ>0\lambda>0,\rho>0 and 0<γ<σ/20<\gamma<\sigma/2. By Figure 2 and 3, we choose γ=0.05,ρ=10\gamma=0.05,\rho=10 for the subsequent tests. To choose an appropriate σ>2γ\sigma>2\gamma, we generate the original signal xtruex^{\rm true}, the sampling matrix Φ\Phi of type I and the observation bb with (m,n,s,r)=(500,1000,5,1.0)(m,n,s^{*},r)=(500,1000,5,1.0), and then solve the model (7) associated to γ=0.05,λ=10\gamma=0.05,\lambda=10 for each σ{0.2,0.4,,3}\sigma\in\{0.2,0.4,\ldots,3\} with PGe-znorm and the model (9) associated to γ=0.05,λ=5,ρ=10\gamma=0.05,\lambda=5,\rho=10 for each σ{0.2,0.4,,3}\sigma\in\{0.2,0.4,\ldots,3\} with PGe-scad. Figure 4 plots the average MSE of 5050 trials for each σ\sigma. We see that σ[0.6,1.2]\sigma\in[0.6,1.2] is a desirable choice, so choose σ=0.8\sigma=0.8 for the two models in the subsequent experiments.

Refer to caption
Figure 4: Influence of σ\sigma on the performance of the models (7) and (9)

Next we take a closer look at the influence of λ\lambda on the models (7) and (9). To this end, we generate the signal xtruex^{\rm true}, the sampling matrix Φ\Phi of type I, and the observation bb with (m,n,s)=(500,1000,5)(m,n,s^{*})=(500,1000,5) and (μ,ϖ)=(0.3,0.1)(\mu,\varpi)=(0.3,0.1), and then solve the model (7) associated to σ=0.8,γ=0.05\sigma=0.8,\gamma=0.05 for each λ{1,3,5,,49}\lambda\in\{1,3,5,\ldots,49\} with PGe-znorm and solve the model (9) associated to σ=0.8,γ=0.05,ρ=10\sigma=0.8,\gamma=0.05,\rho=10 for each λ{0.5,1.5,2.5,,24.5}\lambda\in\{0.5,1.5,2.5,\ldots,24.5\} with PGe-scad. Figure 5 plots the average MSE of 5050 trials for each λ\lambda. When r=0.15r=0.15, the MSE from the model (7) has a small variation for λ[7,49]\lambda\in[7,49], while the MSE from the model (9) has a small variation for λ[3,24.5]\lambda\in[3,24.5]. When r=0.05r=0.05, the MSE from the model (7) has a small variation for λ[5,49]\lambda\in[5,49] and is relatively low for λ[5,16]\lambda\in[5,16], while the MSE from the model (9) has a tiny change for λ[1.5,24.5]\lambda\in[1.5,24.5]. In view of this, we always choose λ=8\lambda=8 for the model (7), and choose λ=4\lambda=4 and λ=8\lambda=8 for the model (9) with n5000n\leq 5000 and n>5000n>5000, respectively, in the subsequent experiments.

Refer to caption
Figure 5: Influence of λ\lambda on the MSE from the models (7) and (9)

5.4 Numerical comparisons

We compare PGe-znorm and PGe-scad with six state-of-the-art solvers, which are BIHT-AOP [46], PIHT [20], PIHT-AOP [21], GPSP [52] (https://github.com/ShenglongZhou/GPSP), PDASC [19] and WPDASC [15]. Among others, the codes for BIHT-AOP, PIHT and PIHT-AOP can be found at http://www.esat.kuleuven.be/stadius/ADB/huang/downloads/1bitCSLab.zip and the codes for PDASC and WPDASC can be found at https://github.com/cjia80/numericalSimulation. It is worth pointing out that BIHT-AOP, GPSP and PIHT-AOP all require an estimation on ss^{*} and rr as an input, PIHT require an estimation on ss^{*} as an input, while PDASC, WPDASC, PGe-znorm and PGe-scad do not need these prior information. For the solvers to require an estimation on ss^{*} and rr, we directly input the true sparsity ss^{*} and rr as those papers do. During the testing, PGe-znorm and PGe-scad use the parameters described before, and other solvers use their default setting except the PIHT is terminated once its iteration is over 100100.

We first apply the eight solvers to solving the test problems with the sampling matrix of type I and low noise. Table 1 reports their average MSE, Herr, FNR, FPR and CPU time for 5050 trials. We see that among the four solvers without requiring any information on xtruex^{\rm true}, PGe-scad and PGe-znorm yield the lower MSE, Herr and FNR than PDASC and WPDASC do, and PGe-scad is the best one in terms of MSE, Herr and FNR; while among the four solvers requiring some information on xtruex^{\rm true}, BIHT-AOP and PIHT-AOP yield the smaller MSE and Herr than PIHT and GPSP do, and the former also yields the lower FNR and FPR under the scenario of r=0.05r=0.05. When comparing PGe-scad with BIHT-AOP and PIHT-AOP, the former yields the smaller MSE, Herr, FNR and FPR under the scenario of r=0.15r=0.15, and under the scenario of r=0.05r=0.05, it also yields the comparable MSE, Herr and FNR as BIHT-AOP and PIHT-AOP do.

Table 1: Numerical comparisons of eight solvers for test problems with Φ\Phi of type I and low noise
m=800,n=2000,s=10,ϖ=0.1,𝐫=0.05m=800,n=2000,s^{*}=10,\varpi=0.1,{\bf r=0.05}
μ=0.1\mu=0.1 μ=0.3\mu=0.3 μ=0.5\mu=0.5
solvers MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s)
PIHT 2.57e-1 6.80e-2 3.26e-1 1.64e-3 1.55e-1 2.75e-1 7.27e-2 3.52e-1 1.77e-3 1.57e-1 3.52e-1 9.22e-2 4.24e-1 2.13e-3 1.58e-1
BIHT-AOP 1.46e-1 4.36e-2 1.94e-1 9.75e-4 5.30e-1 1.32e-1 3.85e-2 1.80e-1 9.05e-4 5.47e-1 1.46e-1 4.18e-2 2.06e-1 1.04e-3 5.45e-1
PIHT-AOP 1.61e-1 4.67e-2 2.06e-1 1.04e-3 1.81e-1 1.55e-1 4.60e-2 1.90e-1 9.55e-4 1.92e-1 1.40e-1 4.17e-2 2.02e-1 1.02e-3 1.86e-1
GPSP 1.91e-1 5.02e-2 2.56e-1 1.29e-3 1.60e-2 1.87e-1 4.83e-2 2.40e-1 1.21e-3 1.83e-2 1.89e-1 4.78e-2 2.52e-1 1.27e-3 2.31e-2
PGe-scad 2.15e-1 6.70e-2 3.34e-1 0 2.29e-1 2.04e-1 6.36e-2 3.32e-1 0 2.79e-1 2.10e-1 6.42e-2 3.44e-1 1.01e-5 2.82e-1
PGe-znorm 2.10e-1 6.52e-2 3.58e-1 2.01e-5 1.19e-1 2.10e-1 6.41e-2 3.62e-1 4.02e-5 1.24e-1 2.22e-1 6.82e-2 3.72e-1 3.02e-5 1.26e-1
PDASC 4.29e-1 1.34e-1 5.94e-1 0 6.04e-2 4.27e-1 1.33e-1 5.92e-1 0 5.98e-2 4.53e-1 1.37e-1 6.08e-1 1.01e-5 5.93e-2
WPDASC 4.38e-1 1.37e-1 6.02e-1 0 9.31e-2 4.20e-1 1.30e-1 5.78e-1 0 9.32e-2 3.97e-1 1.20e-1 5.56e-1 1.01e-5 9.69e-2
m=800,n=2000,s=10,ϖ=0.1,𝐫=0.15m=800,n=2000,s^{*}=10,\varpi=0.1,{\bf r=0.15}
μ=0.1\mu=0.1 μ=0.3\mu=0.3 μ=0.5\mu=0.5
MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s)
PIHT 4.10e-1 1.13e-1 4.04e-1 2.03e-3 1.61e-1 4.01e-1 1.08e-1 2.90e-1 1.96e-3 1.57e-1 4.18e-1 1.12e-1 4.02e-1 2.02e-1 1.60e-1
BIHT-AOP 3.77e-1 1.04e-1 4.10e-1 2.06e-3 5.41e-1 3.74e-1 9.88e-2 4.16e-1 2.09e-3 5.51e-1 3.56e-1 9.49e-2 3.94e-1 1.98e-3 5.38e-1
PIHT-AOP 3.48e-1 9.80e-2 3.82e-1 1.92e-3 1.85e-1 3.70e-1 1.01e-1 4.10e-1 2.06e-3 1.91e-1 3.65e-1 9.68e-2 4.08e-1 2.05e-3 1.87e-1
GPSP 3.90e-1 1.05e-1 3.86e-1 1.94e-3 1.86e-2 3.73e-1 1.01e-1 3.76e-1 1.89e-3 2.04e-2 3.63e-1 9.31e-2 3.74e-1 1.88e-3 2.48e-2
PGe-scad 2.72e-1 8.54e-2 3.98e-1 1.51e-4 2.31e-1 2.78e-1 8.67e-2 3.90e-1 1.81e-4 2.83e-1 2.83e-1 8.39e-2 4.02e-1 1.91e-4 2.63e-1
PGe-znorm 3.33e-1 9.82e-2 4.20e-1 8.24e-4 1.34e-1 3.31e-1 9.64e-2 4.16e-1 9.45e-3 1.34e-1 3.42e-1 9.56e-2 4.14e-1 9.55e-4 1.50e-1
PDASC 5.63e-1 1.80e-1 6.88e-1 0 5.08e-2 5.89e-1 1.85e-1 7.12e-1 0 5.18e-2 5.58e-1 1.73e-1 6.90e-1 4.02e-5 5.18e-2
WPDASC 5.40e-1 1.71e-1 6.72e-1 0 7.93e-2 5.87e-1 1.83e-1 7.10e-1 1.01e-5 8.32e-2 5.63e-1 1.75e-1 6.98e-1 0 8.08e-2

Next we use the eight solvers to solve the test problems with the sampling matrix of type I and high noise. Table 2 reports the average MSE, Herr, FNR, FPR and CPU time for 5050 trials. Now among the four solvers requiring partial information on xtruex^{\rm true}, GPSP yields the smallest MSE, Herr, FNR and FPR, and among the four solvers without requiring any information on xtruex^{\rm true}, PGe-scad is still the best one. Also, for those problems with r=0.15r=0.15, PGe-scad yields the smaller MSE, Herr and FNR than GPSP does.

Table 2: Numerical comparisons of eight solvers for test problems with Φ\Phi of type I and high noise
m=1000,n=5000,s=15,ϖ=0.3,𝐫=0.05m=1000,n=5000,s^{*}=15,\varpi=0.3,{\bf r=0.05}
μ=0.1\mu=0.1 μ=0.3\mu=0.3 μ=0.5\mu=0.5
solvers MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s)
PIHT 3.48e-1 9.53e-2 4.15e-1 1.25e-3 5.42e-1 3.40e-1 9.54e-2 4.19e-1 1.26e-3 5.46e-1 3.62e-1 9.67e-2 4.47e-1 1.34e-3 5.56e-1
BIHT-AOP 3.57e-1 1.11e-1 3.80e-1 1.14e-3 1.59e-0 3.47e-1 1.09e-1 3.67e-1 1.10e-3 1.60e-0 3.25e-1 1.05e-1 3.61e-1 1.09e-3 1.61e-0
PIHT-AOP 3.71e-1 1.17e-1 3.83e-1 1.15e-3 5.66e-1 3.47e-1 1.10e-1 3.71e-1 1.11e-3 5.79e-1 3.30e-1 1.10e-1 3.59e-1 1.08e-3 5.83e-1
GPSP 2.64e-1 7.33e-2 3.31e-1 9.95e-4 5.77e-2 2.68e-1 7.52e-2 3.32e-1 9.99e-4 5.19e-2 2.95e-1 8.08e-2 3.63e-1 1.09e-3 4.89e-2
PGe-scad 2.65e-1 8.36e-2 3.87e-1 2.21e-4 1.05e-0 2.67e-1 8.35e-2 3.85e-1 2.45e-4 1.01e-0 2.65e-1 8.13e-2 3.93e-1 2.29e-4 1.10e-0
PGe-znorm 2.89e-1 8.85e-2 4.67e-1 8.83e-5 4.05e-1 2.92e-1 8.89e-2 4.69e-1 7.22e-5 4.17e-1 3.00e-1 8.95e-2 4.77e-1 9.63e-5 4.20e-1
PDASC 5.55e-1 1.77e-1 7.24e-1 0 1.63e-1 5.80e-1 1.84e-1 7.44e-1 0 1.64e-1 5.96e-1 1.89e-1 7.48e-1 4.01e-6 1.65e-1
WPDASC 5.73e-1 1.84e-1 7.36e-1 0 2.66e-1 5.54e-1 1.76e-1 7.19e-1 0 2.68e-1 5.96e-1 1.88e-1 7.49e-1 0 2.70e-1
m=1000,n=5000,s=15,ϖ=0.3,𝐫=0.15m=1000,n=5000,s^{*}=15,\varpi=0.3,{\bf r=0.15}
μ=0.1\mu=0.1 μ=0.3\mu=0.3 μ=0.5\mu=0.5
MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s)
PIHT 5.28e-1 1.48e-1 4.97e-1 1.50e-3 5.52e-1 5.42e-1 1.51e-1 5.09e-1 1.53e-3 5.50e-1 5.30e-1 1.47e-1 4.81e-1 1.45e-3 5.46e-1
BIHT-AOP 5.23e-1 1.51e-1 5.13e-1 1.54e-3 1.61e-0 4.97e-1 1.45e-1 5.00e-1 1.50e-3 1.60e-0 5.19e-1 1.46e-1 5.35e-1 1.61e-3 1.61e-0
PIHT-AOP 5.13e-1 1.46e-1 5.09e-1 1.53e-3 5.78e-1 5.04e-1 1.47e-1 5.13e-1 1.54e-3 5.79e-1 5.30e-1 1.52e-1 5.45e-1 1.64e-3 5.70e-1
GPSP 4.60e-1 1.29e-1 4.59e-1 1.38e-3 8.21e-2 4.60e-1 1.29e-1 4.65e-1 1.40e-3 8.13e-2 4.90e-1 1.33e-1 4.77e-1 1.44e-3 6.03e-2
PGe-scad 3.55e-1 1.09e-1 4.76e-1 1.34e-3 1.05e-0 3.63e-1 1.12e-1 4.81e-1 1.50e-3 1.11e-0 3.63e-1 1.12e-1 4.80e-1 1.38e-3 1.27e-0
PGe-znorm 4.22e-1 1.24e-2 5.17e-1 6.38e-4 4.19e-1 4.51e-1 1.30e-1 5.19e-1 7.94e-4 4.06e-1 4.41e-1 1.27e-1 5.21e-1 6.62e-4 4.49e-1
PDASC 6.90e-1 2.24e-1 8.12e-1 4.01e-6 1.35e-1 7.07e-1 2.26e-1 8.24e-1 4.01e-6 1.34e-1 7.11e-1 2.27e-1 8.27e-1 0 1.33e-1
WPDASC 6.62e-1 2.14e-1 7.92e-1 0 2.35e-1 6.82e-1 2.18e-1 8.03e-1 4.01e-6 2.34e-1 7.04e-1 2.25e-1 8.20e-1 4.01e-6 2.30e-1

Finally, we use the eight solvers to solve the test problems with the sampling matrix of type II. Table 3 reports the average MSE, Herr, FNR, FPR and CPU time for 5050 trials. From Table 3, among the four solvers requiring partial information on xtruex^{\rm true}, PIHT yields the better MSE, Herr,FNR and FPR than others for those examples with high noise, and among the four solvers without needing any information on xtruex^{\rm true}, PGe-scad is still the best one. Moreover, PGe-scad yields the smaller MSE, Herr and FNR than PIHT does for ϖ=0.3\varpi=0.3 and 0.50.5. We also observe that among the eight solvers, GPSP always requires the least CPU time, and PGe-scad and PGe-znorm requires the comparable CPU time as PIHT, BIHT-AOP and PIHT-AOP do for all test examples.

Table 3: Numerical comparisons of eight solvers for test problems with Φ\Phi of type II with different noise levels
m=2500,n=10000,s=20,r=0.1m=2500,n=10000,s^{*}=20,r=0.1
ϖ=0.1\varpi=0.1 ϖ=0.3\varpi=0.3 ϖ=0.5\varpi=0.5
MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s) MSE Herr FNR FPR time(s)
PIHT 2.57e-1 7.23e-1 3.44e-1 6.89e-4 2.55 2.74e-1 8.15e-2 3.49e-1 6.99e-4 2.54 3.14e-1 9.58e-2 3.69e-1 7.39e-4 2.54
BIHT-AOP 1.54e-1 4.61e-2 2.40e-1 4.81e-4 7.70 3.06e-1 9.84e-2 3.36e-1 6.73e-3 7.68 4.23e-1 1.29e-1 3.95e-1 7.92e-4 7.64
PIHT-AOP 1.68e-1 5.08e-2 2.44e-1 4.89e-4 2.66 3.16e-1 1.03e-1 3.38e-1 6.77e-4 2.66 4.62e-1 1.52e-1 4.14e-1 8.30e-4 2.66
GPSP 2.45e-1 6.88e-2 3.36e-1 6.73e-4 0.19 2.77e-1 8.22e-2 3.51e-1 7.03e-4 0.19 3.23e-1 9.68e-2 3.73e-1 7.47e-4 0.19
PGe-scad 2.10e-1 6.65e-2 3.08e-1 2.61e-5 2.79 2.44e-1 7.82e-2 3.61e-1 2.10e-4 2.71 2.92e-1 9.34e-2 4.14e-1 6.89e-4 2.75
PGe-znorm 2.34e-1 7.36e-2 4.25e-1 2.00e-6 1.78 2.44e-1 7.71e-2 4.27e-1 8.02e-6 1.78 2.74e-1 8.70e-2 4.45e-1 3.21e-5 1.77
PDASC 5.41e-1 1.73e-1 6.84e-1 0 8.28e-1 6.26e-1 2.01e-1 7.50e-1 0 8.07e-1 6.28e-1 2.03e-1 7.56e-1 4.02e-5 7.66e-1
WPDASC 5.41e-1 1.73e-1 6.83e-1 0 1.01 5.62e-1 1.81e-1 7.03e-1 0 9.98e-1 6.17e-1 1.99e-1 7.37e-1 0 9.71e-1

6 Conclusion

We proposed a zero-norm regularized smooth DC loss model and derived a family of equivalent nonconvex surrogates that cover the MCP and SCAD surrogates as special cases. For the proposed model and its SCAD surrogate, we developed the PG method with extrapolation to compute their τ\tau-stationary points and provided its convergence certificate by establishing the convergence of the whole iterate sequence and its local linear convergence rate. Numerical comparisons with several state-of-art methods demonstrate that the two new models are well suited for high noise and/or high sign flip ratio. An interesting future topic is to analyze the statistical error bound for them.

References

  • [1] H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming, 116(2009): 5-16.
  • [2] H. Attouch, J. Bolte, P. Redont and A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality, Mathematics of Operations Research, 35(2010): 438-457.
  • [3] A. Beck and M. Teboulle, Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems, IEEE Transactions on Image Processing, 18(2009): 2419-2434.
  • [4] A. Beck and N. Hallak, Optimization problems involving group sparsity terms, Mathematical Programming, 178(2019): 39-67.
  • [5] J. Bolte, S. Sabach and M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming, 146(2014): 459-494.
  • [6] P. T. Boufounos and R. G. Baraniuk, 1-bit compressive sensing, Proceedings of the Forty Second Annual Conference on Information Sciences and Systems, 2008, pp. 16-21.
  • [7] P. T. Boufounos, Greedy sparse signal reconstruction from sign measurements, Proceedings of the Asilomar Conference on Signals, Systems, and Computers, 2009: 1305-1309.
  • [8] P. T. Boufounos, Reconstruction of sparse signals from distorted randomized measurements, In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, 2010, pp. 3998-4001.
  • [9] J. P. Brooks, Support vector machines with the ramp Loss and the hard margin loss, Operations Research, 59(2011): 467-479.
  • [10] E. J. Candès and T. Tao, Decoding by linear programming, IEEE Transactions on Information Theory, 51(2005): 4203-4215.
  • [11] F. Cucker and D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge, U.K.: Cambridge Univ. Press, 2007.
  • [12] D. Q. Dai, L. X. Shen, Y. S. Xu and N. Zhang, Noisy 1-bit compressive sensing: models and algorithms, Applied and Computational Harmonic Analysis, 40(2016): 1-32.
  • [13] D. L. Donoho, Compressed sensing, IEEE Transactions on Information Theory, 52(2006): 1289-1306.
  • [14] J. Q. Fan and R. Z. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of American Statistics Association, 96(2001): 1348-1360.
  • [15] Q. B. Fan, C. Jia, J. Liu and Y. Luo, Robust recovery in 1-bit compressive sensing via q\ell_{q}-constrained least squares, Signal Processing, 179(2021): 107822.
  • [16] J. Fang, Y. N. Shen, H. B. Li and Z. Ren, Sparse signal recovery from one-bit quantized data: an iterative reweighted algorithm, Signal Processing, 102(2014): 201-206.
  • [17] S. Ghadimi and G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Mathematical Programming, 156(2016):59-99.
  • [18] S. Gopi, P. Netrapalli, P. Jain and A. Nori, One-bit compressed sensing: Provable support and vector recovery, in Int. Conf. Mach. Learn. PMLR, 2013, pp. 154-162.
  • [19] J. Huang, Y. L. Jiao, X. L. Lu and L. P. Zhu, Robust decoding from 1-bit compressive sampling with ordinary and regularized least squares, SIAM Journal on Scientific Computing, 40(2018): A2062-A2086.
  • [20] X. L. Huang and M. Yan, Nonconvex penalties with analytical solutions for one-bit compressive sensing, Signal Processing, 144(2018): 341-351.
  • [21] X. L. Huang, L. Shi, M. Yan and J. A. K. Suykens, Pinball loss minimization for one-bit compressive sensing: convex models and algorithms, Neurocomputing, 314(2018): 275-283.
  • [22] X. L. Huang, L. Shi and J. A. K. Suykens, Ramp loss linear programming support vector machine, Journal of Machine Learning Research, 15(2014): 2185-2211.
  • [23] L. Jacques, J. N. Laska, P. T. Boufounos and R. G. Baraniuk, Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors, IEEE Transactions on Information Theory, 59(2013): 2082-2102.
  • [24] J. N. Laska, Z. W. Wen, W. T. Yin and R. G. Baraniuk, Trust, but verify: fast and accurate signal recovery from 1-bit compressive measurements, IEEE Transactions on Signal Processing, 59(2011): 5289-5301.
  • [25] J. N. Laska and R. G. Baraniuk, Regime change: Bitdepth versus measurement-rate in compressive sensing, IEEE Transactions on Signal Processing, 60(2012): 3496-3505.
  • [26] Z. L. Li, W. B. Xu, X. B. Zhang and J. R. Lin, A survey on one-bit compressed sensing: theory and applications, Frontiers of Computer Science, 12(2018): 217-230.
  • [27] H. Li and Z. Lin, Accelerated proximal gradient methods for nonconvex programming, In Advances in Neural Information Processing Systems, 2015: 379-387.
  • [28] P. Ochs, Unifying abstract inexact convergence theorems and block coordiate variable metric IPIANO, SIAM Journal on Optimization, 29(2019): 511-570.
  • [29] P. Ochs, Y. Chen, T. Brox and T. Pock, iPiano: Inertial proximal algorithm for nonconvex optimization, SIAM Journal on Optimization, 7(2014): 1388-1419.
  • [30] Y. Plan and R. Vershynin, One-bit compressed sensing by linear programming, Communications on Pure and Applied Mathematics, 66(2013): 1275-1297.
  • [31] Y. Plan and R. Vershynin, Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach, IEEE Transactions on Information Theory, 59(2013): 482-494.
  • [32] X. Peng, B. Liao and J. Li, One-bit compressive sensing via Schur-concave function minimization, IEEE Transactions on Signal Processing, 67(2019): 4139-4151.
  • [33] X. Peng, B. Liao, X. D. Huang and Z. Quan, 1-bit compressive sensing with an improved algorithm based on fixed-point continuation, Signal Processing, 154(2019): 168-173.
  • [34] G. Y. Li and T. K. Pong, Calculus of the exponent of Kurdyka-Łöjasiewicz inequality and its applications to linear convergence of first-order methods, Foundations of Computational Mathematics, 18(2018): 1199-1232.
  • [35] Y. L. Liu, S. J. Bi and S. H.  Pan, Equivalent Lipschitz surrogates for zero-norm and rank optimization problems, Journal of Global Optimization, 72(2018): 679-704.
  • [36] R. T. Rockafellar and R. J-B. Wets, Variational Analysis, Springer, 1998.
  • [37] Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2)O(1/k^{2}), Soviet Mathematics Doklady, 27: 372-376, 1983.
  • [38] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, Boston, 2004.
  • [39] Y. T. Qian and S. H. Pan, Calmness of partial perturbation to composite rank constraint systems and its applications, arXiv:2102.10373v2, October 8, 2021.
  • [40] L. X. Shen and B. W. Suter, One-bit compressive sampling via 0\ell_{0} minimization, EURASIP Journal on Advances in Signal Processing, 71(2016).
  • [41] B. Wen, X. Chen, and T. K. Pong, Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems, SIAM Journal on Optimization, 27(2017): 124-145.
  • [42] Y. Q. Wu, S. H. Pan and S. J. Bi, Kurdyka-Łojasiewicz property of zero-norm composite functions, Journal of Optimization Theory and Applications, 188(2021): 94-112.
  • [43] H. Wang, X. Huang, Y. Liu, H. Sabine Van and W. Qun, Binary reweighted l1-norm minimization for one-bit compressed sensing, In 8th International Conference on Bio-inspired Systems & Signal Processing, 2015.
  • [44] P. Xiao, B. Liao and J. Li, One-bit compressive sensing via Schur-concave function minimization, IEEE Transactions on Signal Processing, 16(2019): 4139-4151.
  • [45] Y. Y. Xu and W. Yin, A globally convergent algorithm for nonconvex optimization based on block coordinate update, Journal of Scientific Computing, 72(2017): 700-734.
  • [46] M. Yan, Y. Yang and S. Osher, Robust 1-bit compressive sensing using adaptive outlier pursuit, IEEE Transactions on Signal Processing, 60(2012): 3868-3875.
  • [47] L. Yang, Proximal gradient method with extrapolation and line-search for a class of nonconvex and nonsmooth problems, arXiv:1711.06831v4, 2021.
  • [48] P. R. Yu, G. Y. Li and T. K. Pong, Kurdyka-Łöjasiewicz exponent via inf-projection, Foundations of Computational Mathematics, DOI: https://doi.org/10.1007/s10208-021-09528-6.
  • [49] C. H. Zhang, Nearly unbiased variable selection under minimax concave penalty, Annals of Statistics, 38(2010): 894-942.
  • [50] T. Zhang, Statistical analysis of some multi-category large margin classification methods, Journal of Machine Learning Research, 5(2004): 1225-1251.
  • [51] L. Zhang, J. Yi and R. Jin, Efficient algorithms for robust one-bit compressive sensing, Proceedings of the thirty First International Conference on Machine Learning, 2014, pp. 820-828.
  • [52] S. L. Zhou, Z. Y. Luo, N. H. Xiu and G. Y. Li, Computing One-bit compressive sensing via double-sparsity constrained optimization, Journal of Machine Learning Research, 5(2004): 1225-1251.
  • [53] R. D. Zhu and Q. Q. Gu, Towards a lower sample complexity for robust one-bit compressed sensing, Proceedings of the 32nd International Conference on Machine Learning, PMLR, 37(2015): 739-747.