This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Implicit Regularization in Over-Parameterized Support Vector Machine

Yang Sui  Xin He11footnotemark: 1  Yang Bai
School of Statistics and Management
Shanghai University of Finance and Economics
suiyang1027@stu.sufe.edu.cn;he.xin17@mail.shufe.edu.cn;
statbyang@mail.shufe.edu.cn
Equal contributions.Corresponding author.
Abstract

In this paper, we design a regularization-free algorithm for high-dimensional support vector machines (SVMs) by integrating over-parameterization with Nesterov’s smoothing method, and provide theoretical guarantees for the induced implicit regularization phenomenon. In particular, we construct an over-parameterized hinge loss function and estimate the true parameters by leveraging regularization-free gradient descent on this loss function. The utilization of Nesterov’s method enhances the computational efficiency of our algorithm, especially in terms of determining the stopping criterion and reducing computational complexity. With appropriate choices of initialization, step size, and smoothness parameter, we demonstrate that unregularized gradient descent achieves a near-oracle statistical convergence rate. Additionally, we verify our theoretical findings through a variety of numerical experiments and compare the proposed method with explicit regularization. Our results illustrate the advantages of employing implicit regularization via gradient descent in conjunction with over-parameterization in sparse SVMs.

1 Introduction

In machine learning, over-parameterized models, such as deep learning models, are commonly used for regression and classification lecun2015deep ; voulodimos2018deep ; otter2020survey ; fan2021selective . The corresponding optimization tasks are primarily tackled using gradient-based methods. Despite challenges posed by the nonconvexity of the objective function and over-parameterization yunsmall , empirical observations show that even simple algorithms, such as gradient descent, tend to converge to a global minimum. This phenomenon is known as the implicit regularization of variants of gradient descent, which acts as a form of regularization without an explicit regularization term neyshabur2015search ; neyshabur2017geometry .

Implicit regularization has been extensively studied in many classical statistical problems, such as linear regression vaskevicius2019implicit ; li2021implicit ; zhao2022high and matrix factorization gunasekar2017implicit ; li2018algorithmic ; arora2019implicit . These studies have shown that unregularized gradient descent can yield optimal estimators under certain conditions. However, a deeper understanding of implicit regularization in classification problems, particularly in support vector machines (SVMs), remains limited. Existing studies have focused on specific cases and alternative regularization approaches soudry2018implicit ; molinari2021iterative ; apidopoulos2022iterative . A comprehensive analysis of implicit regularization via direct gradient descent in SVMs is still lacking. We need further investigation to explore the implications and performance of implicit regularization in SVMs.

The practical significance of such exploration becomes evident when considering today’s complex data landscapes and the challenges they present. In modern applications, we often face classification challenges due to redundant features. Data sparsity is notably evident in classification fields like finance, document classification, and gene expression analysis. For instance, in genomics, only a few genes out of thousands are used for disease diagnosis and drug discovery huang2012identification . Similarly, spam classifiers rely on a small selection of words from extensive dictionaries almeida2011contributions . These scenarios highlight limitations in standard SVMs and those with explicit regularization, like the 2\ell_{2} norm. From an applied perspective, incorporating sparsity into SVMs is an intriguing research direction. While sparsity in regression has been deeply explored recently, sparse SVMs have received less attention. Discussions typically focus on generalization error and risk analysis, with limited mention of variable selection and error bounds peng2016error . Our study delves into the implicit regularization in classification, complementing the ongoing research in sparse SVMs.

In this paper, we focus on the implicit regularization of gradient descent applied to high-dimensional sparse SVMs. Contrary to existing studies on implicit regularization in first-order iterative methods for nonconvex optimization that primarily address regression, we investigate this intriguing phenomenon of gradient descent applied to SVMs with hinge loss. By re-parameterizing the parameters using the Hadamard product, we introduce a novel approach to nonconvex optimization problems. With proper initialization and parameter tuning, our proposed method achieves the desired statistical convergence rate. Extensive simulation results reveal that our method outperforms the estimator under explicit regularization in terms of estimation error, prediction accuracy, and variable selection. Moreover, it rivals the performance of the gold standard oracle solution, assuming the knowledge of true support.

1.1 Our Contributions

First, we reformulate the parameter 𝜷\bm{\beta} as 𝐰𝐰𝐯𝐯\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v}, where \odot denotes the Hadamard product operator, resulting in a non-convex optimization problem. This re-parameterization technique has not been previously employed for classification problems. Although it introduces some theoretical complexities, it is not computationally demanding. Importantly, it allows for a detailed theoretical analysis of how signals change throughout iterations, offering a novel perspective on the dynamics of gradient descent not covered in prior works gunasekar2018implicit ; soudry2018implicit . With the help of re-parameterization, we provide a theoretical analysis showing that with appropriate choices of initialization size α\alpha, step size η\eta, and smoothness parameter γ\gamma, our method achieves a near-oracle rate of slogp/n\sqrt{s\log p/n}, where ss is the number of the signals, pp is the dimension of 𝜷\bm{\beta}, and nn is the sample size. Notice that the near-oracle rate is achievable via explicit regularization peng2016error ; zhou2022communication ; kharoubi2023high . To the best of our knowledge, this is the first study that investigates implicit regularization via gradient descent and establishes the near-oracle rate specifically in classification.

Second, we employ a simple yet effective smoothing technique Nesterov2005 ; zhou2010nesvm for the re-parameterized hinge loss, addressing the non-differentiability of the hinge loss. Additionally, our method introduces a convenient stopping criterion post-smoothing, which we discuss in detail in Section 2.2. Notably, the smoothed gradient descent algorithm is not computationally demanding, primarily involving vector multiplication. The variables introduced by the smoothing technique mostly take values of 0 or 11 as smoothness parameter γ\gamma decreases, further streamlining the computation. Incorporating Nesterov’s smoothing is instrumental in our theoretical derivations. Directly analyzing the gradient algorithm with re-parameterization for the non-smooth hinge loss, while computationally feasible, introduces complexities in theoretical deductions. In essence, Nesterov’s smoothing proves vital both computationally and theoretically.

Third, to support our theoretical results, we present finite sample performances of our method through extensive simulations, comparing it with both the 1\ell_{1}-regularized estimator and the gold standard oracle solution. We demonstrate that the number of iterations tt in gradient descent parallels the role of the 1\ell_{1} regularization parameter λ\lambda. When chosen appropriately, both can achieve the near-oracle statistical convergence rate. Further insights from our simulations illustrate that, firstly, in terms of estimation error, our method generalizes better than the 1\ell_{1}-regularized estimator. Secondly, for variable selection, our method significantly reduces false positive rates. Lastly, due to the efficient transferability of gradient information among machines, our method is easier to be paralleled and generalized to large-scale applications. Notably, while our theory is primarily based on the Sub-Gaussian distribution assumption, our method is actually applicable to a much wider range. Additional simulations under heavy-tailed distributions still yield remarkably desired results. Extensive experimental results indicate that our method’s performance, employing implicitly regularized gradient descent in SVMs, rivals that of algorithms using explicit regularization. In certain simple scenarios, it even matches the performance of the oracle solution.

1.2 Related Work

Frequent empirical evidence shows that simple algorithms such as (stochastic) gradient descent tend to find the global minimum of the loss function despite nonconvexity. To understand this phenomenon, studies by neyshabur2015search ; neyshabur2017geometry ; keskarlarge ; wilson2017marginal ; zhang2022statistical proposed that generalization arises from the implicit regularization of the optimization algorithm. Specifically, these studies observe that in over-parameterized statistical models, although optimization problems may contain bad local errors, optimization algorithms, typically variants of gradient descent, exhibit a tendency to avoid these bad local minima and converge towards better solutions. Without adding any regularization term in the optimization objective, the implicit preference of the optimization algorithm itself plays the role of regularization.

Implicit regularization has attracted significant attention in well-established statistical problems, including linear regression gunasekar2018implicit ; vaskevicius2019implicit ; woodworth2020kernel ; li2021implicit ; zhao2022high and matrix factorization gunasekar2017implicit ; li2018algorithmic ; arora2019implicit ; ma2021sign ; ma2022global . In high-dimensional sparse linear regression problems, vaskevicius2019implicit ; zhao2022high introduced a re-parameterization technique and demonstrated that unregularized gradient descent yields an estimator with optimal statistical accuracy under the Restricted Isometric Property (RIP) assumption candes2005decoding . fan2022understanding obtained similar results for the single index model without the RIP assumption. In low-rank matrix sensing, gunasekar2017implicit ; li2018algorithmic revealed that gradient descent biases towards the minimum nuclear norm solution when initiated close to the origin. Additionally, arora2019implicit demonstrated the same implicit bias towards the nuclear norm using a depth-NN linear network. Nevertheless, research on the implicit regularization of gradient descent in classification problems remains limited. soudry2018implicit found that the gradient descent estimator converges to the direction of the max-margin solution on unregularized logistic regression problems. In terms of the hinge loss in SVMs, apidopoulos2022iterative provided a diagonal descent approach and established its regularization properties. However, these investigations rely on the diagonal regularization process bahraoui1994convergence , and their algorithms’ convergence rates depend on the number of iterations, and are not directly compared with those of explicitly regularized algorithms. Besides, Frank-Wolfe method and its variants have been used for classification lacoste2013block ; tajima2021frank . However, sub-linear convergence to the optimum requires the strict assumption that both the direction-finding step and line search step are performed exactly jaggi2013revisiting .

2 Model and Algorithms

2.1 Notations

Throughout this work, we denote vectors with boldface letters and real numbers with normal font. Thus, 𝐰\mathbf{w} denotes a vector and wiw_{i} denotes the ii-th coordinate of 𝐰\mathbf{w}. We use [n][n] to denote the set {1,2,n}\{1,2\ldots,n\}. For any subset SS in [n][n] and a vector 𝐰\mathbf{w}, we use 𝐰S\mathbf{w}_{S} to denote the vector whose ii-th entry is wiw_{i} if iSi\in S and 0 otherwise. For any given vector 𝐰\mathbf{w}, let 𝐰1\Arrowvert\mathbf{w}\Arrowvert_{1}, 𝐰\Arrowvert\mathbf{w}\Arrowvert and 𝐰\Arrowvert\mathbf{w}\Arrowvert_{\infty} denote its 1\ell_{1}, 2\ell_{2} and \ell_{\infty} norms. Moreover, for any two vectors 𝐰,𝐯p\mathbf{w},\mathbf{v}\in\mathbb{R}^{p}, we define 𝐰𝐯p\mathbf{w}\odot\mathbf{v}\in\mathbb{R}^{p} as the Hadamard product of vectors 𝐰\mathbf{w} and 𝐯\mathbf{v}, whose components are wiviw_{i}v_{i} for i[p]i\in[p]. For any given matrix 𝐗p1×p2\mathbf{X}\in\mathbb{R}^{p_{1}\times p_{2}}, we use 𝐗F\Arrowvert\mathbf{X}\Arrowvert_{F} and 𝐗S\Arrowvert\mathbf{X}\Arrowvert_{S} to represent the Frobenius norm and spectral norm of matrix 𝐗\mathbf{X}, respectively. In addition, we let {an,bn}n1\{a_{n},b_{n}\}_{n\geq 1} be any two positive sequences. We write anbna_{n}\lesssim b_{n} if there exists a universal constant CC such that anCbna_{n}\leq C\cdot b_{n} and we write anbna_{n}\asymp b_{n} if we have anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}. Moreover, an=𝒪(bn)a_{n}=\mathcal{O}(b_{n}) shares the same meaning with anbna_{n}\lesssim b_{n}.

2.2 Over-parameterization for 1\ell_{1}-regularized SVM

Given a random sample 𝒵n={(𝐱i,yi)}i=1n{\cal Z}^{n}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} with 𝐱ip\mathbf{x}_{i}\in\mathbb{R}^{p} denoting the covariates and yi{0,1}y_{i}\in\{0,1\} denoting the corresponding label, we consider the following 1\ell_{1}-regularized SVM:

min𝜷p1ni=1n(1yi𝐱iT𝜷)++λ𝜷1,\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}(1-y_{i}{\mathbf{x}}_{i}^{T}\bm{\beta})_{+}+\lambda\|\bm{\beta}\|_{1}, (1)

where ()+=max{,0}(\cdot)_{+}=\max\{\cdot,0\} denotes the hinge loss and λ\lambda denotes the 1\ell_{1} regularization parameter. Let L𝒵n(𝜷)L_{\mathcal{Z}^{n}}(\bm{\beta}) denote the first term of the right-hand side of (1). Previous works have shown that the 1\ell_{1}-regularized estimator of the optimization problem (1) and its extensions achieve a near-oracle rate of convergence to the true parameter 𝜷\bm{\beta}^{*} peng2016error ; wang2021improved ; zhang2022statistical . In contrast, rather than imposing the 1\ell_{1} regularization term in (1), we minimize the hinge loss function L𝒵n(𝜷)L_{\mathcal{Z}^{n}}(\bm{\beta}) directly to obtain a sparse estimator. Specifically, we re-parameterize 𝜷\bm{\beta} as 𝜷=𝐰𝐰𝐯𝐯\bm{\beta}=\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v}, using two vectors 𝐰\mathbf{w} and 𝐯\mathbf{v} in p\mathbb{R}^{p}. Consequently, L𝒵n(𝜷)L_{\mathcal{Z}^{n}}(\bm{\beta}) can be reformulated as 𝒵n(𝐰,𝐯)\mathcal{E}_{\mathcal{Z}^{n}}(\mathbf{w},\mathbf{v}):

𝒵n(𝐰,𝐯)=1ni=1n(1yi𝐱iT(𝐰𝐰𝐯𝐯))+.\displaystyle\mathcal{E}_{\mathcal{Z}^{n}}(\mathbf{w},\mathbf{v})=\frac{1}{n}\sum_{i=1}^{n}\big{(}1-y_{i}{\mathbf{x}}_{i}^{T}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v})\big{)}_{+}. (2)

Note that the dimensionality of 𝜷\bm{\beta} in (1) is pp, but a 2p2p-dimensional parameter is involved in (2). This indicates that we over-parameterize 𝜷\bm{\beta} via 𝜷=𝐰𝐰𝐯𝐯\bm{\beta}=\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v} in (2). We briefly describe our motivation on over-parameterizing 𝜷\bm{\beta} this way. Following Hoff2017 , 𝜷1=argmin𝜷=𝐚𝐜(𝐚2+𝐜2)/2\|\bm{\beta}\|_{1}=\mathop{\rm argmin}_{\bm{\beta}=\mathbf{a}\odot\mathbf{c}}(\|\mathbf{a}\|^{2}+\|\mathbf{c}\|^{2})/2. Thus, the optimization problem (1) translates to min𝐚,𝐜𝒵n(𝐚,𝐜)+λ(𝐚2+𝐜2)/2\min_{\mathbf{a},\mathbf{c}}\mathcal{E}_{\mathcal{Z}^{n}}(\mathbf{a},\mathbf{c})+\lambda(\|\mathbf{a}\|^{2}+\|\mathbf{c}\|^{2})/2. For a better understanding of implicit regularization by over-parameterization, we set 𝐰=𝐚+𝐜2\mathbf{w}=\frac{\mathbf{a}+\mathbf{c}}{2} and 𝐯=𝐚𝐜2\mathbf{v}=\frac{\mathbf{a}-\mathbf{c}}{2}, leading to 𝜷=𝐰𝐰𝐯𝐯\bm{\beta}=\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v}. This incorporation of new parameters 𝐰\mathbf{w} and 𝐯\mathbf{v} effectively over-parameterizes the problem. Finally, we drop the explicit 2\ell_{2} regularization term λ(𝐚2+𝐜2)/2\lambda(\|\mathbf{a}\|^{2}+\|\mathbf{c}\|^{2})/2 and perform gradient descent to minimize the empirical loss 𝒵n(𝐰,𝐯)\mathcal{E}_{\mathcal{Z}^{n}}(\mathbf{w},\mathbf{v}) in (2), in line with techniques seen in neural network training, high-dimensional regression vaskevicius2019implicit ; li2021implicit ; zhao2022high , and high-dimensional single index models fan2022understanding .

2.3 Nesterov’s smoothing

It is well-known that the hinge loss function is not differentiable. As a result, traditional first-order optimization methods, such as the sub-gradient and stochastic gradient methods, converge slowly and are not suitable for large-scale problems zhou2010nesvm . Second-order methods, like the Newton and Quasi-Newton methods, can address this by replacing the hinge loss with a differentiable approximation chapelle2007training . Although these second-order methods might achieve better convergence rates, the computational cost associated with computing the Hessian matrix in each iteration is prohibitively high. Clearly, optimizing (2) using gradient-based methods may not be the best choice.

To address the trade-off between convergence rate and computational cost, we incorporate Nesterov’s method Nesterov2005 to smooth the hinge loss and then update the parameters via gradient descent. By employing Nesterov’s method, (2) can be reformulated as the following saddle point function:

𝒵n(𝐰,𝐯)max𝝁𝒫11ni=1n(1yi𝐱iT(𝐰𝐰𝐯𝐯))μi,\displaystyle{\cal E}_{{\cal Z}^{n}}(\mathbf{w},\mathbf{v})\equiv\max_{\bm{\mu}\in{\cal P}_{1}}\frac{1}{n}\sum_{i=1}^{n}\big{(}1-y_{i}{\mathbf{x}}_{i}^{T}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v})\big{)}\mu_{i},

where 𝒫1={𝝁n:0μi1}{\cal P}_{1}=\{\bm{\mu}\in\mathbb{R}^{n}:0\leq\mu_{i}\leq 1\}. According to Nesterov2005 , the above saddle point function can be smoothed by subtracting a prox-function dγ(𝝁)d_{\gamma}(\bm{\mu}), where dγ(𝝁)d_{\gamma}(\bm{\mu}) is a strongly convex function of 𝝁\bm{\mu} with a smoothness parameter γ>0\gamma>0. Throughout this paper, we select the prox-function as dγ(𝝁)=γ2𝝁2d_{\gamma}(\bm{\mu})=\frac{\gamma}{2}\|\bm{\mu}\|^{2}. Consequently, 𝒵n(𝐰,𝐯){\cal E}_{{\cal Z}^{n}}(\mathbf{w},\mathbf{v}) can be approximated by

𝒵n,γ(𝐰,𝐯)max𝝁𝒫1{1ni=1n(1yi𝐱iT(𝐰𝐰𝐯𝐯))μidγ(𝝁)}.\displaystyle{\cal E}_{{\cal Z}^{n},\gamma}^{*}(\mathbf{w},\mathbf{v})\equiv\max_{\bm{\mu}\in{\cal P}_{1}}\left\{\frac{1}{n}\sum_{i=1}^{n}\big{(}1-y_{i}{\mathbf{x}}_{i}^{T}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v})\big{)}\mu_{i}-d_{\gamma}(\bm{\mu})\right\}. (3)

Since dγ(𝝁)d_{\gamma}(\bm{\mu}) is strongly convex, μi\mu_{i} can be obtained by setting the gradient of the objective function in (3) to zero and has the explicit form:

μi=median(0,1yi𝐱iT(𝐰𝐰𝐯𝐯)γn,1).\displaystyle\mu_{i}=\text{median}\left(0,\frac{1-y_{i}{\mathbf{x}_{i}}^{T}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v})}{\gamma n},1\right). (4)

For each sample point 𝒵i={(𝐱i,yi)}\mathcal{Z}^{i}=\{(\mathbf{x}_{i},y_{i})\}, i[n]i\in[n], we use 𝒵i(𝐰,𝐯)\mathcal{E}_{\mathcal{Z}^{i}}(\mathbf{w},\mathbf{v}) and 𝒵i,γ(𝐰,𝐯)\mathcal{E}^{*}_{\mathcal{Z}^{i},\gamma}(\mathbf{w},\mathbf{v}) to denote its hinge loss and the corresponding smoothed approximation, respectively. With different choices of μi\mu_{i} for any i[n]i\in[n] in (4), the explicit form of 𝒵i,γ(𝐰,𝐯)\mathcal{E}^{*}_{\mathcal{Z}^{i},\gamma}(\mathbf{w},\mathbf{v}) can be written as

𝒵i,γ(𝐰,𝐯)={0,ifyi𝐱iT(𝐰𝐰𝐯𝐯)>1,(1yi𝐱iT(𝐰𝐰𝐯𝐯))/nγ/2,ifyi𝐱iT(𝐰𝐰𝐯𝐯)<1γn,(1yi𝐱iT(𝐰𝐰𝐯𝐯))2/(2γn2),otherwise.\displaystyle\mathcal{E}^{*}_{\mathcal{Z}^{i},\gamma}(\mathbf{w},\mathbf{v})=\begin{cases}0,&\text{if}\ \ y_{i}\mathbf{x}^{T}_{i}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v})>1,\\ (1-y_{i}\mathbf{x}^{T}_{i}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v}))/n-\gamma/2,&\text{if}\ \ y_{i}\mathbf{x}^{T}_{i}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v})<1-\gamma n,\\ (1-y_{i}\mathbf{x}^{T}_{i}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v}))^{2}/(2\gamma n^{2}),&\text{otherwise}.\end{cases} (5)

Note that the explicit solution (5) indicates that a larger γ\gamma yields a smoother 𝒵i,γ(𝐰,𝐯)\mathcal{E}^{*}_{\mathcal{Z}^{i},\gamma}(\mathbf{w},\mathbf{v}) with larger approximation error, and can be considered as a parameter that controls the trade-off between smoothness and approximation accuracy. The following theorem provides the theoretical guarantee of the approximation error. The proof can be directly derived from (5), and thus is omitted here.

Theorem 1.

For any random sample 𝒵i={(𝐱i,yi)}{\cal Z}^{i}=\{(\mathbf{x}_{i},y_{i})\}, i[n]i\in[n], the corresponding hinge loss 𝒵i(𝐰,𝐯)\mathcal{E}_{\mathcal{Z}^{i}}(\mathbf{w},\mathbf{v}) is bounded by its smooth approximation 𝒵i,γ(𝐰,𝐯)\mathcal{E}^{*}_{\mathcal{Z}^{i},\gamma}(\mathbf{w},\mathbf{v}), and the approximation error is completely controlled by the smooth parameter γ\gamma. For any (𝐰,𝐯)(\mathbf{w},\mathbf{v}), we have

𝒵i,γ(𝐰,𝐯)𝒵i(𝐰,𝐯)𝒵i,γ(𝐰,𝐯)+γ2.\displaystyle\mathcal{E}_{{\mathcal{Z}}^{i},\gamma}^{*}(\mathbf{w},\mathbf{v})\leq\mathcal{E}_{{\mathcal{Z}}^{i}}(\mathbf{w},\mathbf{v})\leq{\mathcal{E}}_{{\mathcal{Z}}^{i},\gamma}^{*}(\mathbf{w},\mathbf{v})+\frac{\gamma}{2}.

2.4 Implicit regularization via gradient descent

In this section, we apply gradient descent algorithm to 𝒵n,γ(𝐰,𝐯)\mathcal{E}_{\mathcal{Z}^{n},\gamma}(\mathbf{w},\mathbf{v}) in (3) by updating 𝐰\mathbf{w} and 𝐯\mathbf{v} to obtain the estimator of 𝜷\bm{\beta}^{*}. Specifically, the gradients of (3) with respect to 𝐰\mathbf{w} and 𝐯\mathbf{v} can be directly obtained, with the form of 2/ni=1nyiμi𝐱i𝐰-2/n\sum_{i=1}^{n}y_{i}\mu_{i}{\mathbf{x}}_{i}\odot\mathbf{w} and 2/ni=1nyiμi𝐱i𝐯2/n\sum_{i=1}^{n}y_{i}\mu_{i}{\mathbf{x}}_{i}\odot\mathbf{v}, respectively. Thus, the updates for 𝐰\mathbf{w} and 𝐯\mathbf{v} are given as

𝐰t+1=𝐰t+2η1ni=1nyiμt,i𝐱i𝐰tand𝐯t+1=𝐯t2η1ni=1nyiμt,i𝐱i𝐯t,\displaystyle{\mathbf{w}}_{t+1}={\mathbf{w}}_{t}+2\eta\frac{1}{n}\sum_{i=1}^{n}y_{i}\mu_{t,i}{\mathbf{x}}_{i}\odot{\mathbf{w}}_{t}\quad and\quad{\mathbf{v}}_{t+1}={\mathbf{v}}_{t}-2\eta\frac{1}{n}\sum_{i=1}^{n}y_{i}\mu_{t,i}{\mathbf{x}}_{i}\odot{\mathbf{v}}_{t}, (6)

where η\eta denotes the step size. Once (𝐰t+1,𝐯t+1)(\mathbf{w}_{t+1},\mathbf{v}_{t+1}) is obtained, we can update 𝜷\bm{\beta} as 𝜷t+1=𝐰t+1𝐰t+1𝐯t+1𝐯t+1\bm{\beta}_{t+1}=\mathbf{w}_{t+1}\odot\mathbf{w}_{t+1}-{\mathbf{v}}_{t+1}\odot{\mathbf{v}}_{t+1} via the over-parameterization of 𝜷\bm{\beta}. Note that we cannot initialize the values of 𝐰0\mathbf{w}_{0} and 𝐯0\mathbf{v}_{0} as zero vectors because these vectors are stationary points of the algorithm. Given the sparsity of the true parameter 𝜷\bm{\beta}^{*} with the support SS, ideally, 𝐰\mathbf{w} and 𝐯\mathbf{v} should be initialized with the same sparsity pattern as 𝜷\bm{\beta}^{*}, with 𝐰S\mathbf{w}_{S} and 𝐯S\mathbf{v}_{S} being non-zero and the values outside the support SS being zero. However, such initialization is infeasible as SS is unknown. As an alternative, we initialize 𝐰0\mathbf{w}_{0} and 𝐯0\mathbf{v}_{0} as 𝐰0=𝐯0=α𝟏p×1\mathbf{w}_{0}=\mathbf{v}_{0}=\alpha\bm{1}_{p\times 1}, where α>0\alpha>0 is a small constant. This initialization approach strikes a balance: it aligns with the sparsity assumption by keeping the zero component close to zero, while ensuring that the non-zero component begins with a non-zero value fan2022understanding .

We summarize the details of the proposed gradient descent method for high-dimensional sparse SVM in the following Algorithm 2.4.

Algorithm 1: Gradient Descent Algorithm for High-Dimensional Sparse SVM.   Given: Training set 𝒵n{\cal Z}^{n}, initial value α\alpha, step size η\eta, smoothness parameter γ\gamma, maximum iteration number T1T_{1}, validation set 𝒵~n\widetilde{\cal Z}^{n}.
Initialize: 𝐰0=α𝟏p×1\mathbf{w}_{0}=\alpha\bm{1}_{p\times 1}, 𝐯0=α1p×1\mathbf{v}_{0}=\alpha\bm{}1{p\times 1}, and set iteration index t=0t=0.
While t<T1t<T_{1}, do
𝐰t+1\displaystyle{\mathbf{w}}_{t+1} =𝐰t+2η1ni=1nyiμt,i𝐱i𝐰t;\displaystyle={\mathbf{w}}_{t}+2\eta\frac{1}{n}\sum_{i=1}^{n}y_{i}\mu_{t,i}{\mathbf{x}}_{i}\odot{\mathbf{w}}_{t};
𝐯t+1\displaystyle{\mathbf{v}}_{t+1} =𝐯t2η1ni=1nyiμt,i𝐱i𝐯t;\displaystyle={\mathbf{v}}_{t}-2\eta\frac{1}{n}\sum_{i=1}^{n}y_{i}\mu_{t,i}{\mathbf{x}}_{i}\odot{\mathbf{v}}_{t};
𝜷t+1\displaystyle\bm{\beta}_{t+1} =𝐰t+1𝐰t+1𝐯t+1𝐯t+1;\displaystyle={\mathbf{w}}_{t+1}\odot{\mathbf{w}}_{t+1}-{\mathbf{v}}_{t+1}\odot{\mathbf{v}}_{t+1};
μt+1,i\displaystyle\mu_{t+1,i} =median(0,1yi𝐱iT𝜷t+1nγ,1);\displaystyle=\text{median}\left(0,\frac{1-y_{i}{\mathbf{x}_{i}}^{T}\bm{\beta}_{t+1}}{n\gamma},1\right);
t\displaystyle t =t+1.\displaystyle=t+1.
End if t>T1t>T_{1} or 𝝁t=𝟎\bm{\mu}_{t}=\bm{0}.
Return Set 𝜷^\widehat{\bm{\beta}} as 𝜷t\bm{\beta}_{t}.

We highlight three key advantages of Algorithm 2.4. First, the stopping condition for Algorithm 2.4 can be determined based on the value of 𝝁\bm{\mu} in addition to the preset maximum iteration number T1T_{1}. Specifically, when the values of μi\mu_{i} are 0 across all samples, the algorithm naturally stops as no further updates are made. Thus, 𝝁\bm{\mu} serves as an intrinsic indicator for convergence, providing a more efficient stopping condition. Second, Algorithm 2.4 avoids heavy computational cost like the computation of the Hessian matrix. The main computational load comes from the vector multiplication in (6). Since a considerable portion of the elements in 𝝁\bm{\mu} are either 0 or 11, and the proportion of these elements increases substantially as γ\gamma decreases, the computation in (6) can be further simplified. Lastly, the utilization of Nesterov’s smoothing not only optimizes our approach but also aids in our theoretical derivations, as detailed in Appendix E.

3 Theoretical Analysis

In this section, we analyze the theoretical properties of Algorithm 2.4. The main result is the error bound 𝜷t𝜷\|\bm{\beta}_{t}-\bm{\beta}^{*}\|, where 𝜷\bm{\beta}^{*} is the minimizer of the population hinge loss function for 𝜷\bm{\beta} without the 1\ell_{1} norm: 𝜷=argmin𝜷p𝔼(1y𝐱T𝜷)+\bm{\beta}^{*}=\mathop{\rm arg}\min_{\bm{\beta}\in\mathbb{R}^{p}}{\mathbb{E}}(1-y\mathbf{x}^{T}\bm{\beta})_{+}. We start by defining the δ\delta-incoherence, a key assumption for our analysis.

Definition 1.

Let 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p} be a matrix with 2\ell_{2}-normalized columns 𝐱1,,𝐱p\mathbf{x}_{1},\ldots,\mathbf{x}_{p}, i.e., 𝐱i=1\Arrowvert\mathbf{x}_{i}\Arrowvert=1 for all i[n]i\in[n]. The coherence δ=δ(𝐗)\delta=\delta(\mathbf{X}) of the matrix 𝐗\mathbf{X} is defined as

δ:=maxK[n],1ijp|𝐱i𝟏K,𝐱j𝟏K|,\delta:=\max_{K\subseteq[n],1\leq i\not=j\leq p}|\langle\mathbf{x}_{i}\odot\bm{1}_{K},\mathbf{x}_{j}\odot\bm{1}_{K}\rangle|,

where 𝟏K\bm{1}_{K} denotes the nn-dimensional vector whose ii-th entry is 11 if iKi\in K and 0 otherwise. Then, the matrix 𝐗\mathbf{X} is said to be satisfying δ\delta-incoherence.

Coherence measures the suitability of measurement matrices in compressed sensing foucart2013invitation . Several techniques exist for constructing matrices with low coherence. One such approach involves utilizing sub-Gaussian matrices that satisfy the low-incoherence property with high probability donoho2005stable ; carin2011coherence . The Restricted Isometry Property (RIP) is another key measure for ensuring reliable sparse recovery in various applications vaskevicius2019implicit ; zhao2022high , but verifying RIP for a designed matrix is NP-hard, making it computationally challenging bandeira2013certifying . In contrast, coherence offers a computationally feasible metric for sparse regression donoho2005stable ; li2021implicit . Hence, the assumptions required in our main theorems can be verified within polynomial time, distinguishing them from the RIP assumption.

Recall that 𝜷p\bm{\beta}^{*}\in\mathbb{R}^{p} is the ss-sparse signal to be recovered. Let S{1,,p}S\subset\{1,\ldots,p\} denote the index set that corresponds to the nonzero components of 𝜷\bm{\beta}^{*}, and the size |S||S| of SS is given by ss. Among the ss nonzero signal components of 𝜷\bm{\beta}^{*}, we define the index set of strong signals as S1={iS:|βi|Cslogplogp/n}S_{1}=\{i\in S:|\beta^{*}_{i}|\geq C_{s}\log p\sqrt{\log p/n}\} and of weak signals as S2={iS:|βi|Cwlogp/n}S_{2}=\{i\in S:|\beta^{*}_{i}|\leq C_{w}\sqrt{\log p/n}\} for some constants Cs,Cw>0C_{s},C_{w}>0. We denote s1s_{1} and s2s_{2} as the cardinalities of S1S_{1} and S2S_{2}, respectively. Furthermore, we use m=miniS1|βi|m=\min_{i\in S_{1}}|\beta^{*}_{i}| to represent the minimal strength for strong signals and κ\kappa to represent the condition number-the ratio of the largest absolute value of strong signals to the smallest. In this paper, we focus on the case that each nonzero signal in 𝜷\bm{\beta}^{*} is either strong or weak, which means that s=s1+s2s=s_{1}+s_{2}. Regarding the input data and parameters in Algorithm 2.4, we introduce the following two structural assumptions.

Assumption 1.

The design matrix 𝐗/n\mathbf{X}/\sqrt{n} satisfies δ\delta-incoherence with 0<δ1/(κslogp)0<\delta\lesssim 1/(\kappa s\log p). In addition, every entry xx of 𝐗\mathbf{X} is i.i.d.i.i.d. zero-mean sub-Gaussian random variable with bounded sub-Gaussian norm σ\sigma.

Assumption 2.

The initialization for gradient descent are 𝐰0=α𝟏p×1\mathbf{w}_{0}=\alpha\bm{1}_{p\times 1}, 𝐯0=α𝟏p×1\mathbf{v}_{0}=\alpha\bm{1}_{p\times 1} where the initialization size α\alpha satisfies 0<α1/p0<\alpha\lesssim 1/p , the parameter of prox-function γ\gamma satisfies 0<γ1/n0<\gamma\leq 1/n, and the step size η\eta satisfies 0<η1/(κlogp)0<\eta\lesssim 1/(\kappa\log p).

Assumption 1 characterizes the distribution of the input data, which can be easily satisfied across a wide range of distributions. Interestingly, although our proof relies on Assumption 1, numerical results provide compelling evidence that it isn’t essential for the success of our method. This indicates that the constraints set by Assumption 1 can be relaxed in practical applications, as discussed in Section 4. The assumptions about the initialization size α\alpha, the smoothness parameter γ\gamma, and the step size η\eta primarily stem from the theoretical induction of Algorithm 2.4. For instance, α\alpha controls the strength of the estimated weak signals and error components, γ\gamma manages the approximation error in smoothing, and η\eta affects the accuracy of the estimation of strong signals. Our numerical simulations indicate that extremely small initialization size α\alpha, step size η\eta, and smoothness parameter γ\gamma are not required to achieve the desired convergence results, highlighting the low computational burden of our method, with details found in Section 4. The primary theoretical result is summarized in the subsequent theorem.

Theorem 2.

Suppose that Assumptions 1 and 2 hold, then there exist positive constants c1,c2,c3c_{1},c_{2},c_{3} and c4c_{4} such that there holds with probability at least 1c1n1c2p11-c_{1}n^{-1}-c_{2}p^{-1} that, for every time tt with c3log(m/α2)/(ηm)tc4log(1/α)/(ηlogn)c_{3}\log(m/\alpha^{2})/(\eta m)\leq t\leq c_{4}\log(1/\alpha)/(\eta\log n), the solution of the tt-th iteration in Algorithm 2.4, 𝛃t=𝐰t𝐰t𝐯t𝐯t\bm{\beta}_{t}=\mathbf{w}_{t}\odot\mathbf{w}_{t}-\mathbf{v}_{t}\odot\mathbf{v}_{t}, satisfies

𝜷t𝜷2slogpn.\displaystyle\Arrowvert\bm{\beta}_{t}-\bm{\beta}^{*}\Arrowvert^{2}\lesssim\frac{s\log p}{n}.

Theorem 2 demonstrates that if 𝜷\bm{\beta}^{*} contains ss nonzero signals, then with high probability, for any t[c3log(m/α2)/(ηm),c4log(1/α)/(ηlogn)]t\in[c_{3}\log(m/\alpha^{2})/(\eta m),c_{4}\log(1/\alpha)/(\eta\log n)], the convergence rate of 𝒪(slogp/n)\mathcal{O}(\sqrt{s\log p/n}) in terms of the 2\ell_{2} norm can be achieved. Such a convergence rate matches the near-oracle rate of sparse SVMs and can be attained through explicit regularization using the 1\ell_{1} norm penalty peng2016error ; zhou2022communication , as well as through concave penalties kharoubi2023high . Therefore, Theorem 2 indicates that with over-parameterization, the implicit regularization of gradient descent achieves the same effect as imposing an explicit regularization into the objective function in (1).

Proof Sketch. The ideas behind the proof of Theorem 2 are as follows. First, we can control the estimated strengths associated with the non-signal and weak signal components, denoted as 𝐰t𝟏S1c\Arrowvert\mathbf{w}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty} and 𝐯t𝟏S1c\Arrowvert\mathbf{v}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}, to the order of the square root of the initialization size α\alpha for up to 𝒪(log(1/α)/(ηlogn))\mathcal{O}(\log(1/\alpha)/(\eta\log n)) steps. This provides an upper boundary on the stopping time. Also, the magnitude of α\alpha determines the size of coordinates outside the signal support S1S_{1} at the stopping time. The importance of choosing small initialization sizes and their role in achieving the desired statistical performance are further discussed in Section 4. On the other hand, each entry of the strong signal part, denoted as 𝜷t𝟏S1\bm{\beta}_{t}\odot\bm{1}_{S_{1}}, increases exponentially with an accuracy of around 𝒪(logp/n)\mathcal{O}(\log p/n) near the true parameter 𝜷𝟏S1\bm{\beta}^{*}\odot\bm{1}_{S_{1}} within roughly 𝒪(log(m/α2)/(ηm))\mathcal{O}(\log(m/\alpha^{2})/(\eta m)) steps. This establishes the left boundary of the stopping time. The following two Propositions summarize these results.

Proposition 1.

(Analyzing Weak Signals and Errors) Under Assumptions 1-2, with probability at least 1cn11-cn^{-1}, we have

𝐰t𝟏S1cα1pand𝐯t𝟏S1cα1p,\displaystyle\Arrowvert\mathbf{w}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\leq\sqrt{\alpha}\lesssim\frac{1}{\sqrt{p}}\quad and\quad\Arrowvert\mathbf{v}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\leq\sqrt{\alpha}\lesssim\frac{1}{\sqrt{p}},

for all tT=𝒪(log(1/α)/(ηlogn))t\leq T^{*}=\mathcal{O}(\log(1/\alpha)/(\eta\log n)), where cc is some positive constant.

Proposition 2.

(Analyzing Strong Signals) Under Assumptions 1-2, with probability at least 1c1n1c2p11-c_{1}n^{-1}-c_{2}p^{-1}, we have

𝜷t𝟏S1𝜷𝟏S1logpn,\displaystyle\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}}-\bm{\beta}^{*}\odot\bm{1}_{S_{1}}\Arrowvert_{\infty}\lesssim\sqrt{\frac{\log p}{n}},

holds for all 𝒪(log(m/α2)/(ηm))t𝒪(log(1/α)/(ηlogn))\mathcal{O}(\log(m/\alpha^{2})/(\eta m))\leq t\leq\mathcal{O}(\log(1/\alpha)/(\eta\log n)) where c1,c2c_{1},c_{2} are two constants.

Consequently, by appropriately selecting the stopping time tt within the interval specified in Theorem 2, we can ensure convergence of the signal components and effectively control the error components. The final convergence rate can be obtained by combining the results from Proposition 1 and Proposition 2.

4 Numerical Study

In our simulations, unless otherwise specified, we follow a default setup. We generate 3n3n independent observations, divided equally for training, validation, and testing. The true parameters 𝜷\bm{\beta}^{*} is set to m𝟏Sm\bm{1}_{S} with a constant mm. Each entry of 𝐱\mathbf{x} is sampled as i.i.d.i.i.d. zero-mean Gaussian random variable, and the labels yy are determined by a binomial distribution with probability p=1/(1+exp(𝐱T𝜷))p={1}/({1+\exp(\mathbf{x}^{T}\bm{\beta}^{*})}). Default parameters are: true signal strength m=10m=10, number of signals s=4s=4, sample size n=200n=200, dimension p=400p=400, step size η=0.5\eta=0.5, smoothness parameter γ=104\gamma=10^{-4}, and initialization size α=108\alpha=10^{-8}. For evaluation, we measure the estimation error using 𝜷t/𝜷t𝜷/𝜷\Arrowvert\bm{\beta}_{t}/\Arrowvert\bm{\beta}_{t}\Arrowvert-\bm{\beta}^{*}/\Arrowvert\bm{\beta}^{*}\Arrowvert\Arrowvert (for comparison with oracle estimator) and the prediction accuracy on the testing set with P(y^=ytest)P(\hat{y}=y_{test}). Additionally, "False positive" and "True negative" metrics represent variable selection errors. Specifically, "False positive" means the true value is zero but detected as a signal, while "True negative" signifies a non-zero true value that isn’t detected. Results are primarily visualized employing shaded plots and boxplots, where the solid line depicts the median of 3030 runs and the shaded area marks the 2525-th and 7575-th percentiles over these runs.

Effects of Small Initialization Size. We investigate the power of small initialization size α\alpha on the performance of our algorithm. We set the initialization size α={104,106,108,1010}\alpha=\{10^{-4},10^{-6},10^{-8},10^{-10}\}, and other parameters are set by default. Figure 1 shows the importance of small initialization size in inducing exponential paths for the coordinates. Our simulations reveal that small initialization size leads to lower estimation errors and more precise signal recovery, while effectively constraining the error term to a negligible magnitude. Remarkably, although small initialization size might slow the convergence rate slightly, this trade-off is acceptable given the enhanced estimation accuracy.

Refer to caption
(a) Effects of Initialization
Refer to caption
(b) Paths of Signals
Refer to caption
(c) Paths of Errors
Figure 1: Effects of small initialization size α\alpha. In Figure 1(a), the estimation error is calculated by 𝜷t𝜷/𝜷\Arrowvert\bm{\beta}_{t}-\bm{\beta}^{*}\Arrowvert/\Arrowvert\bm{\beta}^{*}\Arrowvert.

Effects of Signal Strength and Sample Size. We examine the influence of signal strength on the estimation accuracy of our algorithm. We set the true signal strength m=0.5k,k=1,,20m=0.5*k,k=1,\ldots,20 and keep other parameters at their default values. As depicted in Figure 2, we compare our method (denoted by GD) with 1\ell_{1}-regularized SVM (denoted by Lasso method), and obtain the oracle solution (denoted by Oracle) using the true support information. We assess the advantages of our algorithm from three aspects. Firstly, in terms of estimation error, our method consistently outperforms the Lasso method across different signal strengths, approaching near-oracle performance. Secondly, all three methods achieve high prediction accuracy on the testing set. Lastly, when comparing variable selection performance, our method significantly surpasses the Lasso method in terms of false positive error. Since the true negative error of both methods is basically 0 , we only present results for false positive error in Figure 2.

Refer to caption
(a) Estimation Result
Refer to caption
(b) Prediction Result
Refer to caption
(c) Variable Selection Result
Figure 2: Effects of signal strength mm. The orange vertical line in Figure 2(a) show the threshold of strong signal logplogp/n\log p\sqrt{\log p/n}.

We further analyze the impact of sample size nn on our proposed algorithm. Keeping the true signal strength fixed at m=5m=5, we vary the sample size as n=50kn=50*k for k=1,,8k=1,\ldots,8. Other parameters remain at their default values. Consistently, our method outperforms the Lasso method in estimation, prediction, and variable selection, see Figure 3 for a summary of the results.

Refer to caption
(a) Estimation Result
Refer to caption
(b) Prediction Result
Refer to caption
(c) Variable Selection Result
Figure 3: Effects of sample size nn.

Performance on Complex Signal Structure. To examine the performance of our method under more complex signal structures, we select five signal structures: 𝐀(5,6,7,8)\mathbf{A}-(5,6,7,8), 𝐁(4,6,8,9)\mathbf{B}-(4,6,8,9), 𝐂(3,6,9,10)\mathbf{C}-(3,6,9,10), 𝐃(2,6,10,11)\mathbf{D}-(2,6,10,11), and 𝐄(1,6,11,12)\mathbf{E}-(1,6,11,12). Other parameters are set by default. The results, summarized in Figure 4, highlight the consistent superiority of our method over the Lasso method in terms of prediction and variable selection performance, even approaching an oracle-like performance for complex signal structures. High prediction accuracy is achieved by the both methods.

Refer to caption
(a) Estimation Result
Refer to caption
(b) Prediction Result
Refer to caption
(c) Variable Selection Result
Figure 4: Performance on complex signal structure. The boxplots are depicted based on 3030 runs.

Performance on Heavy-tailed Distribution. Although the sub-Gaussian distribution of input data is assumed in Assumption 1, we demonstrate that our method can be extended to a wider range of distributions. We conduct simulations under both uniform and heavy-tailed distributions. The simulation setup mirrors that of Figure 4, with the exception that we sample 𝐱\mathbf{x} from a [1,1][-1,1] uniform distribution and a t(3)t(3) distribution, respectively. Results corresponding to the t(3)t(3) distribution are presented in Figure 5, and we can see that our method maintains strong performance, suggesting that the constraints of Assumption 1 can be substantially relaxed. Additional simulation results can be found in the Appendix A.

Refer to caption
(a) Estimation Result
Refer to caption
(b) Prediction Result
Refer to caption
(c) Variable Selection Result
Figure 5: Performance on complex signal structure under t(3)t(3) distribution. The boxplots are depicted based on 3030 runs.

Sensitivity Analysis with respect to smoothness parameter γ\gamma. We analyze the impact of smoothness parameter γ\gamma on our proposed algorithm. Specifically, the detailed experimental setup follows the default configuration, and γ\gamma is set within the range [2.5×105,1×103][2.5\times 10^{-5},1\times 10^{-3}]. The simulations are replicated 3030 times, and the numerical results of estimation error and four estimated signal strengths are presented in Figure 6. From Figure 6, it’s evident that the choice of γ\gamma is relatively insensitive in the sense that the estimation errors and the estimated strengths of the signals under different γ\gammas are very close. Furthermore, as γ\gamma increases, the estimation accuracy experiences a minor decline, but it remains within an acceptable range. See Appendix A for simulation results of Signal 33 and Signal 44.

Refer to caption
(a) Estimation Result
Refer to caption
(b) Signal 11
Refer to caption
(c) Signal 22
Figure 6: Sensitivity analysis of smoothness parameter γ\gamma. The boxplots are depicted based on 3030 runs. The estimation error is calculated by 𝜷t𝜷/𝜷\Arrowvert\bm{\beta}_{t}-\bm{\beta}^{*}\Arrowvert/\Arrowvert\bm{\beta}^{*}\Arrowvert.

5 Conclusion

In this paper, we leverage over-parameterization to design an unregularized gradient-based algorithm for SVM and provide theoretical guarantees for implicit regularization. We employ Nesterov’s method to smooth the re-parameterized hinge loss function, which solves the difficulty of non-differentiability and improves computational efficiency. Note that our theory relies on the incoherence of the design matrix. It would be interesting to explore to what extent these assumptions can be relaxed, which is a topic of future work mentioned in other studies on implicit regularization. It is also promising to consider extending the current study to nonlinear SVMs, potentially incorporating the kernel technique to delve into the realm of implicit regularization in nonlinear classification. In summary, this paper not only provides novel theoretical results for over-parameterized SVMs but also enriches the literature on high-dimensional classification with implicit regularization.

6 Acknowledgements

The authors sincerely thank the anonymous reviewers, AC, and PCs for their valuable suggestions that have greatly improved the quality of our work. The authors also thank Professor Shaogao Lv for the fruitful and valuable discussions at the very beginning of this work. Dr. Xin He’s and Dr. Yang Bai’s work is supported by the Program for Innovative Research Team of Shanghai University of Finance and Economics and the Shanghai Research Center for Data Science and Decision Technology.

References

  • [1] Tiago A Almeida, José María G Hidalgo, and Akebo Yamakami. Contributions to the study of sms spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering, pages 259–262, 2011.
  • [2] Vassilis Apidopoulos, Tomaso Poggio, Lorenzo Rosasco, and Silvia Villa. Iterative regularization in classification via hinge loss diagonal descent. arXiv preprint arXiv:2212.12675, 2022.
  • [3] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.
  • [4] MA Bahraoui and B Lemaire. Convergence of diagonally stationary sequences in convex optimization. Set-Valued Analysis, 2:49–61, 1994.
  • [5] Afonso S Bandeira, Edgar Dobriban, Dustin G Mixon, and William F Sawin. Certifying the restricted isometry property is hard. IEEE transactions on information theory, 59(6):3448–3450, 2013.
  • [6] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
  • [7] Lawrence Carin, Dehong Liu, and Bin Guo. Coherence, compressive sensing, and random sensor arrays. IEEE Antennas and Propagation Magazine, 53(4):28–39, 2011.
  • [8] Olivier Chapelle. Training a support vector machine in the primal. Neural computation, 19(5):1155–1178, 2007.
  • [9] David L Donoho, Michael Elad, and Vladimir N Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory, 52(1):6–18, 2005.
  • [10] Jianqing Fan, Cong Ma, and Yiqiao Zhong. A selective overview of deep learning. Statistical science: a review journal of the Institute of Mathematical Statistics, 36(2):264, 2021.
  • [11] Jianqing Fan, Zhuoran Yang, and Mengxin Yu. Understanding implicit regularization in over-parameterized single index model. Journal of the American Statistical Association, pages 1–14, 2022.
  • [12] Simon Foucart, Holger Rauhut, Simon Foucart, and Holger Rauhut. An invitation to compressive sensing. Springer, 2013.
  • [13] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31, 2018.
  • [14] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 2017.
  • [15] Xin He, Shaogao Lv, and Junhui Wang. Variable selection for classification with derivative-induced regularization. Statistica Sinica, 30(4):2075–2103, 2020.
  • [16] P. Hoff. Lasso, fractional norm and structured sparse estimation using a hadamard product parametrization. Computational Statistics &\& Data Analysis, 115:186–198, 2017.
  • [17] Yuan Huang, Jian Huang, Ben-Chang Shia, and Shuangge Ma. Identification of cancer genomic markers via integrative sparse boosting. Biostatistics, 13(3):509–522, 2012.
  • [18] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pages 427–435. PMLR, 2013.
  • [19] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
  • [20] Rachid Kharoubi, Abdallah Mkhadri, and Karim Oualkacha. High-dimensional penalized bernstein support vector machines. arXiv preprint arXiv:2303.09066, 2023.
  • [21] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frank-wolfe optimization for structural svms. In International Conference on Machine Learning, pages 53–61. PMLR, 2013.
  • [22] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [23] Jiangyuan Li, Thanh Nguyen, Chinmay Hegde, and Ka Wai Wong. Implicit sparse regularization: The impact of depth and early stopping. Advances in Neural Information Processing Systems, 34:28298–28309, 2021.
  • [24] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47. PMLR, 2018.
  • [25] Jianhao Ma and Salar Fattahi. Sign-rip: A robust restricted isometry property for low-rank matrix recovery. arXiv preprint arXiv:2102.02969, 2021.
  • [26] Jianhao Ma and Salar Fattahi. Global convergence of sub-gradient method for robust matrix recovery: Small initialization, noisy measurements, and over-parameterization. arXiv preprint arXiv:2202.08788, 2022.
  • [27] Cesare Molinari, Mathurin Massias, Lorenzo Rosasco, and Silvia Villa. Iterative regularization for convex regularizers. In International conference on artificial intelligence and statistics, pages 1684–1692. PMLR, 2021.
  • [28] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103:127–152, 2005.
  • [29] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017.
  • [30] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. International Conference on Learning Representations, 2015.
  • [31] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020.
  • [32] Bo Peng, Lan Wang, and Yichao Wu. An error bound for l1-norm support vector machine coefficients in ultra-high dimension. The Journal of Machine Learning Research, 17(1):8279–8304, 2016.
  • [33] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  • [34] Kenya Tajima, Yoshihiro Hirohashi, Esmeraldo Ronnie Rey Zara, and Tsuyoshi Kato. Frank-wolfe algorithm for learning svm-type multi-category classifiers. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 432–440. SIAM, 2021.
  • [35] Tomas Vaskevicius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery. Advances in Neural Information Processing Systems, 32, 2019.
  • [36] Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, Eftychios Protopapadakis, et al. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience, 2018, 2018.
  • [37] Junhui Wang, Jiankun Liu, Yong Liu, et al. Improved learning rates of a functional lasso-type svm with sparse multi-kernel representation. Advances in Neural Information Processing Systems, 34:21467–21479, 2021.
  • [38] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
  • [39] Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.
  • [40] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. In International Conference on Learning Representations.
  • [41] Hao Helen Zhang. Variable selection for support vector machines via smoothing spline anova. Statistica Sinica, pages 659–674, 2006.
  • [42] Xiang Zhang, Yichao Wu, Lan Wang, and Runze Li. Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(1):53–76, 2016.
  • [43] Yingying Zhang, Yan-Yong Zhao, and Heng Lian. Statistical rates of convergence for functional partially linear support vector machines for classification. Journal of Machine Learning Research, 23(156):1–24, 2022.
  • [44] Peng Zhao, Yun Yang, and Qiao-Chu He. High-dimensional linear regression via implicit regularization. Biometrika, 109(4):1033–1046, 2022.
  • [45] Tianyi Zhou, Dacheng Tao, and Xindong Wu. Nesvm: A fast gradient method for support vector machines. In 2010 IEEE International Conference on Data Mining, pages 679–688. IEEE, 2010.
  • [46] Xingcai Zhou and Hao Shen. Communication-efficient distributed learning for high-dimensional support vector machines. Mathematics, 10(7):1029, 2022.

Appendix

The appendix is organized as follows.

In Appendix A, we provide supplementary explanations and additional experimental results for the numerical study.

In Appendix B, we discuss the performance of our method using other data generating schemes.

In Appendix C, we characterize the dynamics of the gradient descent algorithm for minimizing the Hadamard product re-parametrized smoothed hinge loss 𝒵n,γ\mathcal{E}^{*}_{\mathcal{Z}^{n},\gamma}.

In Appendix D, we prove the main results stated in the paper.

In Appendix E, we provide the proof of propositions and technical lemmas in Section 3.

Appendix A Additional Experimental Results

A.1 Effects of Small Initialization on Error Components

We present a comprehensive analysis of the impact of initialization on the error components, as depicted in Figure 7. Our results demonstrate that, while the initialization size α\alpha does not alter the error trajectory, it significantly affects the error bounds. Numerically, when the initialization α\alpha is 10410^{-4}, the maximum absolute value of the error components is around 3×1053\times 10^{-5}, which is bounded by the initialization size 10410^{-4}. Similarly, when the initialization is 101010^{-10}, the maximum absolute value of the error components is around 7.5×10147.5\times 10^{-14}, which is bounded by the initialization size 101010^{-10}. The same result is obtained for the other two initialization sizes. The specific numerical results we obtained validate the conclusions drawn in Proposition 1, as the error term satisfies

𝜷t𝟏S1c=𝐰t𝐰t𝟏S1c𝐯t𝐯t𝟏S1c(α)2=α.\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}=\Arrowvert\mathbf{w}_{t}\odot\mathbf{w}_{t}\odot\bm{1}_{S_{1}^{c}}-\mathbf{v}_{t}\odot\mathbf{v}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\lesssim(\sqrt{\alpha})^{2}=\alpha.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Effects of small initialization α\alpha on error components.

A.2 True Negative Results

In the main text, we could not include the True Negative error figures in 2, 3, 4, and 5 due to space constraints. However, these figures are provided in Figure 8. We observe that both the gradient descent estimator and the Lasso estimator effectively detect the true signals across different settings.

Refer to caption
(a) Results in Figure 2
Refer to caption
(b) Results in Figure 3
Refer to caption
(c) Results in Figure 4
Refer to caption
(d) Results in Figure 5
Figure 8: The True Negative results correspond to the settings illustrated in Figures 2, 3, 4 and 5, respectively.

A.3 Additional Results under Uniform Distribution

As previously described, we sample 𝐗\mathbf{X} from a [1,1][-1,1] uniform distribution and set other parameters as follows: true signal structures are 𝐀=(5,6,7,8)\mathbf{A}=(5,6,7,8), 𝐁=(4,6,8,9)\mathbf{B}=(4,6,8,9), 𝐂=(3,6,9,10)\mathbf{C}=(3,6,9,10), 𝐃=(2,6,10,11)\mathbf{D}=(2,6,10,11), and 𝐄=(1,6,11,12)\mathbf{E}=(1,6,11,12). The number of signals is s=4s=4, the sample size is n=200n=200, dimension is p=400p=400, step size is η=0.5\eta=0.5, smoothness parameter is γ=104\gamma=10^{-4}, and the initialization size is α=108\alpha=10^{-8}. The experimental results are summarized in Figure 9. As depicted in Figure 9, the gradient descent (GD) estimator consistently outperforms the Lasso estimator in terms of estimation accuracy and variable selection error. Both methods slightly surpass the oracle estimator in terms of prediction. However, this advantage mainly stems from the inevitable overestimation of the number of signals in the estimates by both the GD and Lasso methods.

Refer to caption
(a) Estimation Result
Refer to caption
(b) Prediction Result
Refer to caption
(c) False Positive Result
Refer to caption
(d) True Negative Result
Figure 9: Performance on complex signal structure under [1,1][-1,1] uniform distribution. The boxplots are depicted based on 3030 runs.

A.4 Additional Results of Sensitivity Analysis

Within our sensitivity analysis, the simulation results for Signal 33 and Signal 44 are presented in Figure 10. We observe that the estimated strengths of the two signals exhibit similar distributions across different values of the smoothness parameter γ\gamma.

Refer to caption
(a) Signal 33
Refer to caption
(b) Signal 44
Figure 10: Sensitivity analysis of smoothness parameter γ\gamma. The boxplots are depicted based on 3030 runs.

Appendix B Other Data Generating Schemes

In section 4, the scheme for generating yy in our original submission is based on a treatment similar to those in [41, 42], where the task of variable selection for the support vector machine is considered. In this section, we also introduce other generating schemes as adopted in [32, 15]. Specifically, we generate random data based on the following two models.

  • Model 11: 𝐱MN(𝟎p,𝚺)\mathbf{x}\sim MN(\bm{0}_{p},\bm{\Sigma}), 𝚺=(σij)\bm{\Sigma}=(\sigma_{ij}) with nonzero elements σij=0.4|ij|\sigma_{ij}=0.4^{|i-j|} for 1i,jp1\leq i,j\leq p, P(y=1|𝐱)=Φ(𝐱T𝜷)P(y=1|\mathbf{x})=\Phi(\mathbf{x}^{T}\bm{\beta}^{*}), where Φ()\Phi(\cdot) is the cumulative density function of the standard normal distribution, 𝜷=(1.1,1.1,1.1,1.1,0,,0)T\bm{\beta}^{*}=(1.1,1.1,1.1,1.1,0,\ldots,0)^{T} and s=4s=4.

  • Model 22: P(y=1)=P(y=1)=0.5P(y=1)=P(y=-1)=0.5, 𝐱|(y=1)MN(𝝁,𝚺)\mathbf{x}|(y=1)\sim MN(\bm{\mu},\bm{\Sigma}), 𝐱|(y=1)MN(𝝁,𝚺)\mathbf{x}|(y=-1)\sim MN(-\bm{\mu},\bm{\Sigma}), s=5s=5, 𝝁=(0.1,0.2,0.3,0.4,0.5,0,,0)T\bm{\mu}=(0.1,0.2,0.3,0.4,0.5,0,\ldots,0)^{T}, 𝚺=(σij)\bm{\Sigma}=(\sigma_{ij}) with diagonal entries equal to 1, nonzero entries σij=0.2\sigma_{ij}=-0.2 for 1ijs1\leq i\not=j\leq s and other entries equal to 0. The bayes rule is sign(1.39𝐱1+1.47𝐱2+1.56𝐱3+1.65𝐱4+1.74𝐱5)\mathop{\rm sign}(1.39\mathbf{x}_{1}+1.47\mathbf{x}_{2}+1.56\mathbf{x}_{3}+1.65\mathbf{x}_{4}+1.74\mathbf{x}_{5}) with bayes error 6.3%6.3\%.

We follow the default parameter setup from Section 4, and the experiments are repeated 3030 times. The averaged estimation and prediction results are presented in Figure 11. From Figure 11, it’s evident that the GD estimator closely approximates the oracle estimator in terms of estimation error and prediction accuracy.

Refer to caption
(a) Model 11: Estimation Error
Refer to caption
(b) Model 11: Non-normalized Error
Refer to caption
(c) Model 11: Prediction Result
Refer to caption
(d) Model 22: Estimation Error
Refer to caption
(e) Model 22: Non-normalized Error
Refer to caption
(f) Model 22: Prediction Result
Figure 11: Estimation error and prediction performance on Model 11 and Model 22. The boxplots are depicted based on 3030 runs. The non-normalized error is calculated by 𝜷t𝜷/𝜷\Arrowvert\bm{\beta}_{t}-\bm{\beta}^{*}\Arrowvert/\Arrowvert\bm{\beta}^{*}\Arrowvert.

Appendix C Gradient Descent Dynamics

First, let’s recall the gradient descent updates. We over-parameterize 𝜷\bm{\beta} by writing it as 𝐰𝐰𝐯𝐯\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v}, where 𝐰\mathbf{w} and 𝐯\mathbf{v} are p×1p\times 1 vectors. We then apply gradient descent to the following optimization problem:

min𝐰,𝐯max𝝁𝒫1{1ni=1n(1yi𝐱iT(𝐰𝐰𝐯𝐯))μidγ(𝝁)},\min_{\mathbf{w},\mathbf{v}}\max_{\bm{\mu}\in{\cal P}_{1}}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\big{(}1-y_{i}{\mathbf{x}}_{i}^{T}(\mathbf{w}\odot\mathbf{w}-\mathbf{v}\odot\mathbf{v})\big{)}\mu_{i}-d_{\gamma}(\bm{\mu})\Big{\}},

where dγ(𝝁)=γ2𝝁2d_{\gamma}(\bm{\mu})=\frac{\gamma}{2}\Arrowvert\bm{\mu}\Arrowvert^{2}. The gradient descent updates with respect to 𝐰\mathbf{w} and 𝐯\mathbf{v} are given by

𝐰t+1=𝐰t+2η1ni=1nyiμt,i𝐱i𝐰tand𝐯t+1=𝐯t2η1ni=1nyiμt,i𝐱i𝐯t.{\mathbf{w}}_{t+1}={\mathbf{w}}_{t}+2\eta\frac{1}{n}\sum_{i=1}^{n}y_{i}\mu_{t,i}{\mathbf{x}}_{i}\odot{\mathbf{w}}_{t}\quad and\quad{\mathbf{v}}_{t+1}={\mathbf{v}}_{t}-2\eta\frac{1}{n}\sum_{i=1}^{n}y_{i}\mu_{t,i}{\mathbf{x}}_{i}\odot{\mathbf{v}}_{t}.

For the sake of convenience, let 𝒢tp\mathcal{G}_{t}\in\mathbb{R}^{p} represent the gradients in the form of 𝒢t=n1𝐗T𝐘𝝁t\mathcal{G}_{t}=n^{-1}\mathbf{X}^{T}\mathbf{Y}\bm{\mu}_{t}, where 𝐘\mathbf{Y} is a diagonal matrix composed of the elements of yy. Consequently, the ii-th element of 𝒢t\mathcal{G}_{t} can be expressed as Gt,i=n1k=1nμt,kykxkiG_{t,i}=n^{-1}\sum_{k=1}^{n}\mu_{t,k}y_{k}x_{ki}, where μt,k=median(0,(1yk𝐱kT𝜷t)/γn,1)\mu_{t,k}=\text{median}(0,(1-y_{k}\mathbf{x}_{k}^{T}\bm{\beta}_{t})/\gamma n,1). Subsequently, the updating rule can be rephrased as follows:

𝐰t+1=𝐰t+2η𝐰t𝒢tand𝐯t+1=𝐯t2η𝐯t𝒢t.\displaystyle\mathbf{w}_{t+1}={\mathbf{w}}_{t}+2\eta{\mathbf{w}}_{t}\odot\mathcal{G}_{t}\quad and\quad{\mathbf{v}}_{t+1}={\mathbf{v}}_{t}-2\eta{\mathbf{v}}_{t}\odot\mathcal{G}_{t}. (7)

Furthermore, the error parts of 𝐰t\mathbf{w}_{t} and 𝐯t\mathbf{v}_{t} are denoted by 𝐰t𝟏S1c\mathbf{w}_{t}\odot\bm{1}_{S_{1}^{c}} and 𝐯t𝟏S1c\mathbf{v}_{t}\odot\bm{1}_{S_{1}^{c}}, which include both weak signal parts and pure error parts. In addition, strong signal parts of 𝐰t\mathbf{w}_{t} and 𝐯t\mathbf{v}_{t} are denoted by 𝐰t𝟏S1\mathbf{w}_{t}\odot\bm{1}_{S_{1}} and 𝐯t𝟏S1\mathbf{v}_{t}\odot\bm{1}_{S_{1}}, respectively.

We examine the dynamic changes of error components and strong signal components in two stages. Without loss of generality, we focus on analyzing entries iS1i\in S_{1} where βi>0\beta^{*}_{i}>0. The analysis for the case where βi<0\beta^{*}_{i}<0 and iS1i\in S_{1} is similar and therefore not presented here. Specifically, within Stage One, we can ensure that with a high probability, for tT0=2log(m/α2)/(ηm)t\geq T_{0}=2\log(m/\alpha^{2})/(\eta m), the following results hold under Assumptions 1 and 2:

strong signal dynamics: βt,imin{βi2,(1+ηβiγn)tα2α2}foriS1,βi>0,\displaystyle\beta_{t,i}\geq\min\left\{\frac{\beta^{*}_{i}}{2},\left(1+\frac{\eta\beta^{*}_{i}}{\gamma n}\right)^{t}\alpha^{2}-\alpha^{2}\right\}\ \ for\ i\in S_{1},\beta^{*}_{i}>0, (8)
max{𝐰t𝟏S1,𝐯t𝟏S1}α2foriS1,\displaystyle\max\left\{\Arrowvert\mathbf{w}_{t}\odot\bm{1}_{S_{1}}\Arrowvert_{\infty},\Arrowvert\mathbf{v}_{t}\odot\bm{1}_{S_{1}}\Arrowvert_{\infty}\right\}\leq\alpha^{2}\ \ for\ i\in S_{1},
error component dynamics: max{𝐰t𝟏S1c,𝐯t𝟏S1c}αforiS1c.\displaystyle\max\left\{\Arrowvert\mathbf{w}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty},\Arrowvert\mathbf{v}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\right\}\leq\sqrt{\alpha}\ \ for\ i\in S_{1}^{c}.

From (8), we can observe that for tT0t\geq T_{0}, the iterate (𝐰t,𝐯t)(\mathbf{w}_{t},\mathbf{v}_{t}) reaches the desired performance level. Specifically, each component βt,i\beta_{t,i} of the positive strong signal part 𝜷t𝟏S1\bm{\beta}_{t}\odot\bm{1}_{S_{1}} increases exponentially in tt until it reaches βi/2\beta^{*}_{i}/2. Meanwhile, the weak signal and error part 𝜷t𝟏S1c\bm{\beta}_{t}\odot\bm{1}_{S_{1}^{c}} remains bounded by 𝒪(α)\mathcal{O}(\alpha). This observation highlights the significance of small initialization size α\alpha for the gradient descent algorithm, as it restricts the error term. A smaller initialization leads to better estimation but with the trade-off of a slower convergence rate.

After each component βt,i\beta_{t,i} of the strong signal reaches βi/2\beta^{*}_{i}/2, Stage Two starts. In this stage, βt,i\beta_{t,i} continues to grow at a slower rate and converges to the true parameter βi\beta^{*}_{i}. After this time, after 3log(m/α2)/(ηm)3\log(m/\alpha^{2})/(\eta m) iterations, βt,i\beta_{t,i} of the strong signal enters a desired interval, which is within a distance of Cϵlogp/nC_{\epsilon}\sqrt{\log p/n} from the true parameter βi\beta^{*}_{i}. Simultaneously, the error term 𝜷t𝟏S1c\bm{\beta}_{t}\odot\bm{1}_{S_{1}^{c}} remains bounded by 𝒪(α)\mathcal{O}(\alpha) until 𝒪(log(1/α)/(ηlogn))\mathcal{O}(\log(1/\alpha)/(\eta\log n)) iterations.

We summarize the dynamic analysis results described above in Proposition 1 and Proposition 2. By combining the results from these two propositions, we can obtain the proof of Theorem 2 in the subsequent subsection.

Appendix D Proof of Theorem 2

Proof.

We first utilize Proposition 1 to upper bound the error components 𝜷t𝟏S1c\bm{\beta}_{t}\odot\bm{1}_{S^{c}_{1}}. By Proposition 1, we are able to control the error parts of the same order as the initialization size α\alpha within the time interval 0tT=log(1/α)/(4σηlogn)0\leq t\leq T^{*}=\log(1/\alpha)/(4\sigma\eta\log n). Thus, by direct computation, we have

𝜷t𝟏S1c𝜷𝟏S1c2\displaystyle\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}^{c}}-\bm{\beta}^{*}\odot\bm{1}_{S_{1}^{c}}\Arrowvert^{2} =𝜷t𝟏S1c+(𝜷t𝟏S2𝜷𝟏S2)2\displaystyle=\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}^{c}}+(\bm{\beta}_{t}\odot\bm{1}_{S_{2}}-\bm{\beta}^{*}\odot\bm{1}_{S_{2}})\Arrowvert^{2} (9)
(i)𝜷𝟏S22+p𝐰t𝐰t𝟏S1c𝐯t𝐯t𝟏S1c2\displaystyle\overset{(i)}{\leq}\Arrowvert\bm{\beta}^{*}\odot\bm{1}_{S_{2}}\Arrowvert^{2}+p\Arrowvert\mathbf{w}_{t}\odot\mathbf{w}_{t}\odot\bm{1}_{S_{1}^{c}}-\mathbf{v}_{t}\odot\mathbf{v}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert^{2}_{\infty}
(ii)Cw2s2logpn+2Cα41p,\displaystyle\overset{(ii)}{\leq}C_{w}^{2}\cdot\frac{s_{2}\log p}{n}+2C_{\alpha}^{4}\cdot\frac{1}{p},

where (i)(i) is based on the relationship between 2\ell_{2} norm and \ell_{\infty} norm and (ii)(ii) follows form the results in Proposition 1. As for the strong signal parts, by Proposition 2, that with probability at least 1c1n1c2p11-c_{1}n^{-1}-c_{2}p^{-1}, we obtain

𝜷t𝟏S1𝜷𝟏S12s1𝜷t𝟏S1𝜷𝟏S12Cϵ2s1logpn.\displaystyle\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}}-\bm{\beta}^{*}\odot\bm{1}_{S_{1}}\Arrowvert^{2}\leq s_{1}\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}}-\bm{\beta}^{*}\odot\bm{1}_{S_{1}}\Arrowvert^{2}_{\infty}\leq C_{\epsilon}^{2}\cdot\frac{s_{1}\log p}{n}. (10)

Finally, combining (9) and (10), with probability at least 1c1n1c2p11-c_{1}n^{-1}-c_{2}p^{-1}, for any tt that belongs to the interval

[5log(m/α2)/(ηm),log(1/α)/(4σηlogn)],[5\log(m/\alpha^{2})/(\eta m),\log(1/\alpha)/(4\sigma\eta\log n)],

it holds that

𝜷T1𝜷2\displaystyle\Arrowvert\bm{\beta}_{T_{1}}-\bm{\beta}^{*}\Arrowvert^{2} =𝜷t𝟏S1c𝜷𝟏S1c2+𝜷t𝟏S1𝜷𝟏S12\displaystyle=\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}^{c}}-\bm{\beta}^{*}\odot\bm{1}_{S_{1}^{c}}\Arrowvert^{2}+\Arrowvert\bm{\beta}_{t}\odot\bm{1}_{S_{1}}-\bm{\beta}^{*}\odot\bm{1}_{S_{1}}\Arrowvert^{2}
Cϵ2s1logpn+Cw2s2logpn+2Cα41p.\displaystyle\leq C_{\epsilon}^{2}\cdot\frac{s_{1}\log p}{n}+C_{w}^{2}\cdot\frac{s_{2}\log p}{n}+2C_{\alpha}^{4}\cdot\frac{1}{p}.

Since pp is much larger than nn, the last term, 2Cα4/p2C^{4}_{\alpha}/p, is negligible. Considering the constants CwC_{w}, CαC_{\alpha} and CϵC_{\epsilon}, we finally obtain the error bound of gradient descent estimator in terms of 2\ell_{2} norm. Therefore, we conclude the proof of Theorem 2. ∎

Appendix E Proof of Propositions 1 and 2

In this section, we provide the proof for the two propositions mentioned in Section 3. First, we introduce some useful lemmas, which are about the coherence of the design matrix 𝐗\mathbf{X} and the upper bound of sub-Gaussian random variables.

E.1 Technical Lemmas

Lemma 1.

(General Hoeffding’s inequality) Suppose that θ1,,θm\theta_{1},\cdots,\theta_{m} are independent mean- zero sub-Gaussian random variables and 𝛍=(μ1,,μm)m\bm{\mu}=(\mu_{1},\cdots,\mu_{m})\in\mathbb{R}^{m}. Then, for every t>0t>0, we have

[|i=1mμiθi|t]2exp(ct2d2𝝁2),\mathbb{P}\left[\left|\sum_{i=1}^{m}\mu_{i}\theta_{i}\right|\geq t\right]\leq 2\exp\left(-\frac{ct^{2}}{d^{2}\Arrowvert\bm{\mu}\Arrowvert^{2}}\right),

where d=maxiθiψ2d=\max_{i}\Arrowvert\theta_{i}\Arrowvert_{\psi_{2}} and cc is an absolute constant.

Proof.

General Hoeffding’s inequality can be proofed directly by the independence of θ1,,θm\theta_{1},\cdots,\theta_{m} and the sub-Gaussian property. The detailed proof is omitted here. ∎

Lemma 2.

Suppose that 𝐗/n\mathbf{X}/\sqrt{n} is a n×pn\times p 2\ell_{2}-normalized matrix satisfying δ\delta-incoherence; that is 1/n|𝐱i𝟏K,𝐱i𝟏K|δ,ij1/n|\langle\mathbf{x}_{i}\odot\bm{1}_{K},\mathbf{x}_{i}\odot\bm{1}_{K}\rangle|\leq\delta,i\not=j for any K[n]K\subseteq[n]. Then, for ss-sparse vector 𝐳p\mathbf{z}\in\mathbb{R}^{p}, we have

(1n𝐗KT𝐗K𝐈)𝐳δs𝐳,\left\Arrowvert\left(\frac{1}{n}\mathbf{X}^{T}_{K}\mathbf{X}_{K}-\mathbf{I}\right)\mathbf{z}\right\Arrowvert_{\infty}\leq\delta s\Arrowvert\mathbf{z}\Arrowvert_{\infty},

where 𝐗K\mathbf{X}_{K} denotes the n×pn\times p matrix whose ii-th column is 𝐱i\mathbf{x}_{i} if iKi\in K and 𝟎p×1\bm{0}_{p\times 1} otherwise.

Proof.

The proof of Lemma 2 is similar to Lemma 2 in [23]. According to the condition

1n|𝐱i𝟏K,𝐱i𝟏K|δ,ij,K[n],\frac{1}{n}|\langle\mathbf{x}_{i}\odot\bm{1}_{K},\mathbf{x}_{i}\odot\bm{1}_{K}\rangle|\leq\delta,i\not=j,K\subseteq[n],

we can verify that for any i[p]i\in[p],

|(1n𝐗KT𝐗K𝐳)izi|δs𝐳.\left|\left(\frac{1}{n}\mathbf{X}^{T}_{K}\mathbf{X}_{K}\mathbf{z}\right)_{i}-z_{i}\right|\leq\delta s\Arrowvert\mathbf{z}\Arrowvert_{\infty}.

Therefore,

(1n𝐗KT𝐗K𝐈)𝐳δs𝐳.\left\Arrowvert\left(\frac{1}{n}\mathbf{X}^{T}_{K}\mathbf{X}_{K}-\mathbf{I}\right)\mathbf{z}\right\Arrowvert_{\infty}\leq\delta s\Arrowvert\mathbf{z}\Arrowvert_{\infty}.

E.2 Proof of Proposition 1

Proof.

Here we prove Proposition 1 by induction hypothesis. It holds that our initializations 𝐰0=α𝟏p×1\mathbf{w}_{0}=\alpha\bm{1}_{p\times 1} and 𝐯0=α𝟏p×1\mathbf{v}_{0}=\alpha\bm{1}_{p\times 1} satisfy our conclusion given in Proposition 1. As we initialize w0,iw_{0,i} and v0,iv_{0,i} for any fixed iS1ci\in S_{1}^{c} with

|w0,i|=ααand|v0,i|=αα,|w_{0,i}|=\alpha\leq\sqrt{\alpha}\quad and\quad|v_{0,i}|=\alpha\leq\sqrt{\alpha},

Proposition 1 holds when t=0t=0. In the following, we show that for any tt^{*} with 0tT=log(1/α)/(4σηlogn)0\leq t^{*}\leq T^{*}=\log(1/\alpha)/(4\sigma\eta\log n), if the conclusion of Proposition 1 stands for all tt with 0t<t0\leq t<t^{*}, then it also stands at the (t+1)(t^{*}+1)-th step. From the gradient descent updating rule given in (7), the updates of the weak signals and errors wt,iw_{t,i}, vt,iv_{t,i} for any fixed iS1ci\in S_{1}^{c} are obtained as follows,

wt+1,i=wt,i(1+2ηGt,i)andvt+1,i=vt,i(12ηGt,i).w_{t+1,i}=w_{t,i}(1+2\eta G_{t,i})\quad and\quad v_{t+1,i}=v_{t,i}(1-2\eta G_{t,i}).

Recall that |μt,k|1|\mu_{t,k}|\leq 1 for k=1,,nk=1,\ldots,n, for any fixed iS1ci\in S^{c}_{1}, then we have

|Gt,i|=|1nk=1nμt,kykxki|1nk=1n|xki|.|G_{t,i}|=\left|\frac{1}{n}\sum_{k=1}^{n}\mu_{t,k}y_{k}x_{ki}\right|\leq\frac{1}{n}\sum_{k=1}^{n}|x_{ki}|.

For ease of notation, we denote Mi=n1k=1n|xki|M_{i}=n^{-1}\sum_{k=1}^{n}|x_{ki}| and M=maxiS1cMiM=\max_{i\in S_{1}^{c}}M_{i}. Then we can easily bound the weak signals and errors as follows,

𝐰t+1𝟏S1c(1+2ηM)𝐰t𝟏S1cand𝐯t+1𝟏S1c(1+2ηM)𝐯t𝟏S1c.\displaystyle\Arrowvert\mathbf{w}_{t+1}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\leq(1+2\eta M)\Arrowvert\mathbf{w}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\quad and\quad\Arrowvert\mathbf{v}_{t+1}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\leq(1+2\eta M)\Arrowvert\mathbf{v}_{t}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}. (11)

By General Hoeffding’s inequality of sub-Gaussian random variables, with probability at least 1cnn1-cn^{-n}, we obtain Miσlogn,iS1cM_{i}\leq\sigma\sqrt{\log n},i\in S^{c}_{1}, where cc is a constant. Since pp is much larger than nn, the inequality (1cnn)p>1cn1(1-cn^{-n})^{p}>1-cn^{-1} holds. Then, with probability at least 1cn11-cn^{-1}, where cc is a constant, we have MσlognM\leq\sigma\sqrt{\log n}.

Combined (11) and the bound of MM, we then have

𝐰t+1𝟏S1c\displaystyle\Arrowvert\mathbf{w}_{t^{*}+1}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty} (1+2ηM)t+1𝐰0𝟏S1c\displaystyle\leq(1+2\eta M)^{t^{*}+1}\Arrowvert\mathbf{w}_{0}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}
=exp((t+1)log(1+2ηM))α\displaystyle=\exp((t^{*}+1)\log(1+2\eta M))\alpha
(i)exp(Tlog(1+2ηM))α\displaystyle\overset{(i)}{\leq}\exp(T^{*}\log(1+2\eta M))\alpha
(ii)exp(T2ηM)α\displaystyle\overset{(ii)}{\leq}\exp(T^{*}\cdot 2\eta M)\alpha
(iii)exp((1/2)log(1/α))α=α,\displaystyle\overset{(iii)}{\leq}\exp((1/2)\log(1/\alpha))\alpha=\sqrt{\alpha},

where (i)(i) and (ii)(ii) follow from t+1Tt^{*}+1\leq T^{*} and log(1+x)<x\log(1+x)<x when x>0x>0, respectively. Moreover, (iii)(iii) is based on the definition of TT^{*}. Similarly, we can prove that vt𝟏S1cα\Arrowvert v_{t^{*}}\odot\bm{1}_{S_{1}^{c}}\Arrowvert_{\infty}\leq\sqrt{\alpha}. Thus, the induction hypothesis also holds for t+1t^{*}+1. Since t<Tt^{*}<T^{*} is arbitrarily chosen, we complete the proof of Proposition 1. ∎

E.3 Proof of Proposition 2

Proof.

In order to analyze strong signals: βt,i=wt,i2vt,i2\beta_{t,i}=w_{t,i}^{2}-v_{t,i}^{2} for any fixed iS1i\in S_{1}, we focus on the dynamics wt,i2w_{t,i}^{2} and vt,i2v_{t,i}^{2}. Following the updating rule (7), we have

wt+1,i2\displaystyle w_{t+1,i}^{2} =wt,i2+4ηwt,i2Gt,i+4η2wt,i2Gt,i2,\displaystyle=w_{t,i}^{2}+4\eta w_{t,i}^{2}G_{t,i}+4\eta^{2}w_{t,i}^{2}G_{t,i}^{2}, (12)
vt+1,i2\displaystyle v_{t+1,i}^{2} =vt,i24ηvt,i2Gt,i+4η2vt,i2Gt,i2.\displaystyle=v_{t,i}^{2}-4\eta v_{t,i}^{2}G_{t,i}+4\eta^{2}v_{t,i}^{2}G_{t,i}^{2}.

Without loss of generality, here we just analyze entries iS1i\in S_{1} with βi>0\beta^{*}_{i}>0. Analysis for the negative case with βi<0,iS1\beta^{*}_{i}<0,i\in S_{1} is almost the same and is thus omitted. We divide our analysis into two stages, as discussed in Appendix C.

Specifically, in Stage One, since we choose the initialization as 𝐰0=𝐯0=α𝟏p×1\mathbf{w}_{0}=\mathbf{v}_{0}=\alpha\bm{1}_{p\times 1}, we obtain β0,i=0\beta_{0,i}=0 for any fixed iS1i\in S_{1}, we show after 2log(βi/α2)/ηβi2\log(\beta^{*}_{i}/\alpha^{2})/\eta\beta^{*}_{i} iterations, the gradient descent coordinate βt,i\beta_{t,i} will exceed βi/2\beta^{*}_{i}/2. Therefore, all components of strong signal part will exceed half of the true parameters after 2log(m/α2)/(ηm)2\log(m/\alpha^{2})/(\eta m) iterations. Furthermore, we calculate the number of iterations required to achieve βt,iβiϵ\beta_{t,i}\geq\beta^{*}_{i}-\epsilon, where ϵ=Cϵlogp/n\epsilon=C_{\epsilon}\sqrt{\log p/n}, in Stage Two. Thus, we conclude the proof of Proposition 2.

Stage One. First, we introduce some new notations. In tt-th iteration, according to the values of μt,i\mu_{t,i}, we divide the set [n][n] into three parts, Kt,1={i[n]:μt,i=1}K_{t,1}=\{i\in[n]:\mu_{t,i}=1\}, Kt,2={i[n]:0<μt,i<1}K_{t,2}=\{i\in[n]:0<\mu_{t,i}<1\} and Kt,3={i[n]:μt,i=0}K_{t,3}=\{i\in[n]:\mu_{t,i}=0\}, with cardinalities of kt,1k_{t,1}, kt,2k_{t,2} and kt,3k_{t,3}, respectively. From [28, 45], most elements of 𝝁t\bm{\mu}_{t} are 0 or 11 and the proportion of these elements will rapidly increase with the decreasing of γ\gamma, which means that kt,2(kt,1+kt,3)k_{t,2}\ll(k_{t,1}+k_{t,3}) controlling γ\gamma in a small level.

According to the updating formula of wt+1,i2w^{2}_{t+1,i} in (12), we define the following form of element-wise bound ξt,i\xi_{t,i} for any fixed iS1i\in S_{1} as

ξt,i:\displaystyle\xi_{t,i}: =1wt,i2wt+1,i2(14ηγn(βt,iβi))\displaystyle=1-\frac{w_{t,i}^{2}}{w_{t+1,i}^{2}}\left(1-\frac{4\eta}{\gamma n}(\beta_{t,i}-\beta_{i}^{*})\right) (13)
=1wt+1,i2(wt+1,i2wt,i2+4ηγnwt,i2(βt,iβi))\displaystyle=\frac{1}{w_{t+1,i}^{2}}\left(w_{t+1,i}^{2}-w_{t,i}^{2}+\frac{4\eta}{\gamma n}w_{t,i}^{2}(\beta_{t,i}-\beta_{i}^{*})\right)
=wt,i2wt+1,i2(4η(Gt,i+1γn(βt,iβi))+4η2Gt,i2).\displaystyle=\frac{w_{t,i}^{2}}{w_{t+1,i}^{2}}\left(4\eta\left(G_{t,i}+\frac{1}{\gamma n}(\beta_{t,i}-\beta^{*}_{i})\right)+4\eta^{2}G_{t,i}^{2}\right).

Rewriting (13), we can easily get that

wt+1,i2=wt,i2(14ηγn(βt,iβi))+ξt,iwt+1,i2.\displaystyle w_{t+1,i}^{2}=w_{t,i}^{2}\left(1-\frac{4\eta}{\gamma n}(\beta_{t,i}-\beta_{i}^{*})\right)+\xi_{t,i}w_{t+1,i}^{2}. (14)

From (14), it is obvious that if the magnitude of ξt,i\xi_{t,i} is sufficiently small, we can easily conclude that wt+1,i2wt,i2w_{t+1,i}^{2}\geq w_{t,i}^{2}. Therefore, our goal is to evaluate the magnitude of ξt,i\xi_{t,i} in (13) for any fixed iS1i\in S_{1}. First, we focus on the term Gt,i+(βt,iβi)/(γn)G_{t,i}+(\beta_{t,i}-\beta^{*}_{i})/(\gamma n) in (13). In particular, recall the definition of Gt,iG_{t,i} and we expand Gt,i+(βt,iβi)/(γn)G_{t,i}+(\beta_{t,i}-\beta^{*}_{i})/(\gamma n) in the following form

Gt,i+1γn(βt,iβi)\displaystyle G_{t,i}+\frac{1}{\gamma n}(\beta_{t,i}-\beta^{*}_{i}) =1n(kKt,1ykxki+kKt,2ykxki1yk𝐱kT𝜷tnγ)+1γn(βt,iβi)\displaystyle=\frac{1}{n}\left(\sum_{k\in K_{t,1}}y_{k}x_{ki}+\sum_{k\in K_{t,2}}y_{k}x_{ki}\frac{1-y_{k}\mathbf{x}_{k}^{T}\bm{\beta}_{t}}{n\gamma}\right)+\frac{1}{\gamma n}(\beta_{t,i}-\beta^{*}_{i})
=1n(kKt,1ykxki+kKt,2ykxkinγ)1γn2(kKt,2xk,i𝐱kT𝜷tn(βt,iβi))\displaystyle=\frac{1}{n}\left(\sum_{k\in K_{t,1}}y_{k}x_{ki}+\sum_{k\in K_{t,2}}\frac{y_{k}x_{ki}}{n\gamma}\right)-\frac{1}{\gamma n^{2}}\left(\sum_{k\in K_{t,2}}x_{k,i}\mathbf{x}_{k}^{T}\bm{\beta}_{t}-n(\beta_{t,i}-\beta_{i}^{*})\right)
=1n(kKt,1ykxki+kKt,2(yk𝐱kT𝜷)xkinγ)\displaystyle=\frac{1}{n}\left(\sum_{k\in K_{t,1}}y_{k}x_{ki}+\sum_{k\in K_{t,2}}\frac{(y_{k}-\mathbf{x}_{k}^{T}\bm{\beta}^{*})x_{ki}}{n\gamma}\right)
1γn(1nkKt,2xk,i𝐱kT(𝜷t𝜷)(βt,iβi))\displaystyle\quad-\frac{1}{\gamma n}\left(\frac{1}{n}\sum_{k\in K_{t,2}}x_{k,i}\mathbf{x}_{k}^{T}(\bm{\beta}_{t}-\bm{\beta}^{*})-(\beta_{t,i}-\beta_{i}^{*})\right)
=^(I1)+(I2).\displaystyle\widehat{=}(I_{1})+(I_{2}).

For the term (I1)(I_{1}), by the condition kt,2kt,1k_{t,2}\ll k_{t,1} and General Hoeffding’s inequality, we have

1n|kKt,1ykxki+kKt,2(yk𝐱kT𝜷)xkinγ|1nσnlogp=σlogpn,\displaystyle\frac{1}{n}\left|\sum_{k\in K_{t,1}}y_{k}x_{ki}+\sum_{k\in K_{t,2}}\frac{(y_{k}-\mathbf{x}_{k}^{T}\bm{\beta}^{*})x_{ki}}{n\gamma}\right|\leq\frac{1}{n}\sigma\sqrt{n\log p}=\sigma\sqrt{\frac{\log p}{n}}, (15)

holds with probability at least 1cp11-cp^{-1}, where cc is a constant.

For the term (I2)(I_{2}), let 𝐗t,2n×p\mathbf{X}_{t,2}\in\mathbb{R}^{n\times p} denote the matrix whose kk-th column is 𝐱k\mathbf{x}_{k} if kKt,2k\in K_{t,2}. Then, we approximate (I2)(I_{2}) based on the assumption on the design matrix 𝐗\mathbf{X} via Lemma 2,

1n𝐗t,2T𝐗t,2(𝜷t𝜷)(𝜷t𝟏S1𝜷𝟏S1)δsκm.\displaystyle\left\Arrowvert\frac{1}{n}\mathbf{X}_{t,2}^{T}\mathbf{X}_{t,2}(\bm{\beta}_{t}-\bm{\beta}^{*})-(\bm{\beta}_{t}\odot\bm{1}_{S_{1}}-\bm{\beta}^{*}\odot\bm{1}_{S_{1}})\right\Arrowvert_{\infty}\lesssim\delta s\kappa m. (16)

By (15), (16), condition of the minimal strength mσlogplogp/nm\geq\sigma\log p\sqrt{\log p/n} and condition δ1/(κslogp)\delta\lesssim 1/(\kappa s\log p), we have

|Gt,i+1γn(βt,iβi)|mγnlogp.\displaystyle\left|G_{t,i}+\frac{1}{\gamma n}(\beta_{t,i}-\beta^{*}_{i})\right|\lesssim\frac{m}{\gamma n\log p}. (17)

Then, we can bound Gt,iG_{t,i} through

|Gt,i|\displaystyle|G_{t,i}| |Gt,i+1/(γn)(βt,iβi)|+|1/(γn)(βt,iβi)|\displaystyle\leq|G_{t,i}+1/(\gamma n)(\beta_{t,i}-\beta^{*}_{i})|+|1/(\gamma n)(\beta_{t,i}-\beta^{*}_{i})| (18)
1γn(mlogp+κm).\displaystyle\lesssim\frac{1}{\gamma n}\left(\frac{m}{\log p}+\kappa m\right).

Note that for any fixed iS1i\in S_{1}, Gt,in1k=1n|xki|G_{t,i}\leq n^{-1}\sum_{k=1}^{n}|x_{ki}|. By General Hoeffding’s inequality, we have with probability at least 1c1n11-c_{1}n^{-1}, |Gt,i|logn|G_{t,i}|\lesssim\log n and then |ηGt,i|<1|\eta G_{t,i}|<1 based on the assumption of η\eta. Therefore, wt,i2/wt+1,i2w_{t,i}^{2}/w_{t+1,i}^{2} is always bounded by some constant greater than 11.

Recalling the definition of element-wise bound ξt,i\xi_{t,i}, and combining (17) and (18), we have that

|ξt,i|(i)1γn(ηmlogp+η2κ2m2)(ii)ηmγnlogp,\displaystyle|\xi_{t,i}|\overset{(i)}{\lesssim}\frac{1}{\gamma n}\left(\frac{\eta m}{\log p}+\eta^{2}\kappa^{2}m^{2}\right)\overset{(ii)}{\lesssim}\frac{\eta m}{\gamma n\log p}, (19)

where we use the bound of wt,i2/wt+1,i2w_{t,i}^{2}/w_{t+1,i}^{2} for any fixed iS1i\in S_{1} in step(i)(i) and step (ii)(ii) is based on ηκmm/logp\eta\kappa m\leq m/\log p and κm1\kappa m\lesssim 1. Thus, we conclude that ξt,i\xi_{t,i} is sufficiently small for any fixed iS1i\in S_{1}.

Now, for any fixed iS1i\in S_{1}, when 0βt,iβi/20\leq\beta_{t,i}\leq\beta^{*}_{i}/2, we can get the increment of wt,i2w_{t,i}^{2} for any fixed iS1i\in S_{1} according to (19),

wt+1,i2\displaystyle w_{t+1,i}^{2} =wt,i2(14ηγn(βt,iβi))/(1+cηmγnlogp)\displaystyle={w_{t,i}^{2}\left.\left(1-\frac{4\eta}{\gamma n}(\beta_{t,i}-\beta^{*}_{i})\right)\right/\left(1+c\frac{\eta m}{\gamma n\log p}\right)}
(i)wt,i2(1+2ηβiγn)/(1+cηmγnlogp)\displaystyle\overset{(i)}{\geq}w_{t,i}^{2}\left.\left(1+\frac{2\eta\beta^{*}_{i}}{\gamma n}\right)\right/\left(1+c\frac{\eta m}{\gamma n\log p}\right)
(ii)wt,i2(1+ηβiγn),\displaystyle\overset{(ii)}{\geq}w_{t,i}^{2}\left(1+\frac{\eta\beta^{*}_{i}}{\gamma n}\right),

where (i)(i) follows from 0βt,iβi/20\leq\beta_{t,i}\leq\beta^{*}_{i}/2 and (ii)(ii) holds since m/logpβim/\log p\lesssim\beta^{*}_{i}. Similarly, we can analyze vt,i2v_{t,i}^{2} to get that when 0βt,iβi/20\leq\beta_{t,i}\leq\beta^{*}_{i}/2,

vt+1,i2vt,i2(1ηβi/γn).v_{t+1,i}^{2}\leq v_{t,i}^{2}\left(1-{\eta\beta^{*}_{i}}/{\gamma n}\right).

Therefore, wt,i2w_{t,i}^{2} increases at an exponential rate while vt,i2v_{t,i}^{2} decreases at an exponential rate. Stage One ends when βt,i\beta_{t,i} exceeds βi/2\beta^{*}_{i}/2, and our goal is to estimate ti,0t_{i,0} that satisfies

βt,i(1+ηβiγn)ti,0α2(1ηβiγn)ti,0α2βi/2.\beta_{t,i}\geq\left(1+\frac{\eta\beta^{*}_{i}}{\gamma n}\right)^{t_{i,0}}\alpha^{2}-\left(1-\frac{\eta\beta^{*}_{i}}{\gamma n}\right)^{t_{i,0}}\alpha^{2}\geq\beta_{i}^{*}/2.

Note that {vt,i2}t0\{v_{t,i}^{2}\}_{t\geq 0} forms a decreasing sequence and thus is bounded by α2\alpha^{2}. Hence, it suffices to solve the following inequality for ti,0t_{i,0},

(1+ηβiγn)ti,0α2βi/2+α2,\left(1+\frac{\eta\beta^{*}_{i}}{\gamma n}\right)^{t_{i,0}}\alpha^{2}\geq\beta^{*}_{i}/2+\alpha^{2},

which is equivalent to obtain ti,0t_{i,0} satisfying

ti,0Ti,0=log(β2α2+1)/log(1+ηβiγn).t_{i,0}\geq T_{i,0}=\log\left.\left(\frac{\beta^{*}}{2\alpha^{2}}+1\right)\right/\log\left(1+\frac{\eta\beta^{*}_{i}}{\gamma n}\right).

For Ti,0T_{i,0}, by direct calculation, we have

Ti,0\displaystyle T_{i,0} (i)log(β2α2+1)(1+ηβiγn)/(ηβiγn)\displaystyle\overset{(i)}{\leq}\log\left(\frac{\beta^{*}}{2\alpha^{2}}+1\right)\left.\left(1+\frac{\eta\beta^{*}_{i}}{\gamma n}\right)\right/\left(\frac{\eta\beta^{*}_{i}}{\gamma n}\right)
(ii)2log(βiα2)/ηβi,\displaystyle\overset{(ii)}{\leq}2\log\left.\left(\frac{\beta^{*}_{i}}{\alpha^{2}}\right)\right/\eta\beta^{*}_{i},

where (i)(i) follows from xlogxx+10x\log x-x+1\geq 0 when x0x\geq 0 and (ii)(ii) holds due to log(x/2+1)logx\log(x/2+1)\leq\log x when x2x\geq 2 as well as the assumption on γ\gamma and ηβi1\eta\beta_{i}^{*}\leq 1. Thus, we set ti,0=2log(βi/α2)/(ηβi)t_{i,0}=2\log(\beta^{*}_{i}/\alpha^{2})/(\eta\beta^{*}_{i}) such that for all tti,0t\geq t_{i,0}, βt,iβi/2\beta_{t,i}\geq\beta^{*}_{i}/2 for all iS1i\in S_{1}.

Stage Two. Define ϵ=Cϵlogp/n\epsilon=C_{\epsilon}\sqrt{\log p/n} and ai,1=log2(βi/ϵ)a_{i,1}=\lceil\log_{2}(\beta^{*}_{i}/\epsilon)\rceil. Next, we refine the element-wise bound ξt,i\xi_{t,i} according to (13). For any fixed iS1i\in S_{1}, if there exists some aa such that 1aai,11\leq a\leq a_{i,1} and (11/2a)βiβt,i(11/2a+1)βi(1-1/2^{a})\beta^{*}_{i}\leq\beta{t,i}\leq(1-1/2^{a+1})\beta^{*}_{i}, then, based on the analytical thinking in Stage One, we can easily deduce that

|ξt,i|ηm2aγnlogp.\displaystyle|\xi_{t,i}|\lesssim\frac{\eta m}{2^{a}\gamma n\log p}. (20)

Using this element-wise bound (20), we get the increment of wt+1,i2w_{t+1,i}^{2} and decrement of vt+1,i2v_{t+1,i}^{2},

wt+1,i2wt,i2(1+ηβi2aγn)andvt+1,i2vt,i2(1ηβi2aγn).w_{t+1,i}^{2}\geq w_{t,i}^{2}\left(1+\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right)\quad and\quad v_{t+1,i}^{2}\leq v_{t,i}^{2}\left(1-\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right).

We define ti,at_{i,a} as the smallest tt such that βt+ti,a,i(11/2a+1)βi\beta_{t+t_{i,a},i}\geq(1-1/2^{a+1})\beta^{*}_{i}. Intuitively, suppose that the current estimate βt,i\beta_{t,i} is between (11/2a)βi(1-1/2^{a})\beta^{*}_{i} and (11/2a+1)βi(1-1/2^{a+1})\beta^{*}_{i}, we aim to find the number of iterations required for the sequence {βt,i}t0\{\beta_{t,i}\}_{t\geq 0} to exceed (11/2a+1)βi(1-1/2^{a+1})\beta^{*}_{i}. To obtain a sufficient condition for ti,at_{i,a}, we construct the following inequality,

βt+ti,a,iwt,i2(1+ηβi2aγn)ti,avt,i2(1ηβi2aγn)ti,a(11/2a+1)βi.\beta_{t+t_{i,a},i}\geq w_{t,i}^{2}\left(1+\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right)^{t_{i,a}}-v_{t,i}^{2}\left(1-\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right)^{t_{i,a}}\geq(1-1/2^{a+1})\beta^{*}_{i}.

Similar with Stage One, the sequence {vt,i2}t0\{v_{t,i}^{2}\}_{t\geq 0} is bounded by α2\alpha^{2}. Then, it is sufficient to solve the following inequality for ti,at_{i,a},

ti,a:Ti,a=log((11/2a+1)βi+α2wt,i2)/log(1+ηβi2aγn).t_{i,a}:\geq T_{i,a}=\log\left.\left(\frac{(1-1/2^{a+1})\beta^{*}_{i}+\alpha^{2}}{w_{t,i}^{2}}\right)\right/\log\left(1+\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right).

To obtain a more precise upper bound of Ti,aT_{i,a}, we have

Ti,a\displaystyle T_{i,a} =log((11/2a+1)βi+α2wt,i2)/log(1+ηβi2aγn)\displaystyle=\log\left.\left(\frac{(1-1/2^{a+1})\beta^{*}_{i}+\alpha^{2}}{w_{t,i}^{2}}\right)\right/\log\left(1+\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right)
(i)log((11/2a+1)βi+α2(11/2a)βi)/log(1+ηβi2aγn)\displaystyle\overset{(i)}{\leq}\log\left.\left(\frac{(1-1/2^{a+1})\beta^{*}_{i}+\alpha^{2}}{(1-1/2^{a})\beta^{*}_{i}}\right)\right/\log\left(1+\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right)
(ii)log(1+1/2a+111/2a+α2(11/2a)βi)(1+ηβi2aγn)/(ηβi2aγn),\displaystyle\overset{(ii)}{\leq}\log\left.\left(1+\frac{1/2^{a+1}}{1-1/2^{a}}+\frac{\alpha^{2}}{(1-1/2^{a})\beta^{*}_{i}}\right)\left(1+\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right)\right/\left(\frac{\eta\beta^{*}_{i}}{2^{a}\gamma n}\right),

where (i)(i) is based on the condition wt,i2(11/2a)βiw_{t,i}^{2}\geq(1-1/2^{a})\beta^{*}_{i}, (ii)(ii) stands since xlogxx+10x\log x-x+1\geq 0 when x0x\geq 0. By direct calculation, we obtain

Ti,a\displaystyle T_{i,a} (iii)(1/2a+111/2a+α2(11/2a)βi)/(ηβi2a+1γn)\displaystyle\overset{(iii)}{\leq}\left.\left(\frac{1/2^{a+1}}{1-1/2^{a}}+\frac{\alpha^{2}}{(1-1/2^{a})\beta^{*}_{i}}\right)\right/\left(\frac{\eta\beta^{*}_{i}}{2^{a+1}\gamma n}\right) (21)
(iv)2ηβi+2a+2α2ηβi2,\displaystyle\overset{(iv)}{\leq}\frac{2}{\eta\beta^{*}_{i}}+\frac{2^{a+2}\alpha^{2}}{\eta\beta^{*2}_{i}},

where we use log(x+1)x\log(x+1)\leq x when x0x\geq 0 in step (iii)(iii) and 11/2a1/21-1/2^{a}\geq 1/2 in step (iv)(iv), respectively. Recall the assumption aai,1=log2(βi/ϵ)a\leq a_{i,1}=\lceil\log_{2}(\beta^{*}_{i}/\epsilon)\rceil with ϵ=Cϵlogp/n\epsilon=C_{\epsilon}\sqrt{\log p/n}, then we get 2a+24βi/ϵ4n/logpβi/Cϵ2^{a+2}\leq 4\beta^{*}_{i}/\epsilon\leq 4\sqrt{n/\log p}\beta^{*}_{i}/C_{\epsilon}. Moreover, by the assumption on the initialization size α\alpha, we have α2Cα/p2\alpha^{2}\leq C_{\alpha}/p^{2}. Thus, we can bound 2a+2α2/ηβi2{2^{a+2}\alpha^{2}}/{\eta\beta^{*2}_{i}} in (21) as

2a+2α2ηβi2nlogp4CαCϵp2ηβi(i)1ηβi,\displaystyle\frac{2^{a+2}\alpha^{2}}{\eta\beta^{*2}_{i}}\leq\sqrt{\frac{n}{\log p}}\frac{4C_{\alpha}}{C_{\epsilon}p^{2}\eta\beta^{*}_{i}}\overset{(i)}{\leq}\frac{1}{\eta\beta^{*}_{i}}, (22)

where (i)(i) holds when p2logp4Cαn/Cϵp^{2}\log p\geq 4C_{\alpha}\sqrt{n}/C_{\epsilon}. Combined (21) and (22), we calculate the final bound of Ti,aT_{i,a} as

Ti,a3ηβi,T_{i,a}\leq\frac{3}{\eta\beta^{*}_{i}},

for any aai,1a\leq a_{i,1}. If there exists an 1aai,11\leq a\leq a_{i,1} such that βt,i[(11/2a)βi,(11/2a+1)βi]\beta_{t,i}\in[(1-1/2^{a})\beta^{*}_{i},(1-1/2^{a+1})\beta^{*}_{i}], after ti,a3/ηβit_{i,a}\geq 3/\eta\beta^{*}_{i} iterations, we can guarantee that βt,i(11/2a+1)βi\beta_{t,i}\geq(1-1/2^{a+1})\beta^{*}_{i}. Now recalling the definition of ai,1a_{i,1}, we have βi/2ai,1ϵ=Cϵlogp/n\beta^{*}_{i}/2^{a_{i,1}}\leq\epsilon=C_{\epsilon}\sqrt{\log p/n}. Therefore, with at most a=0ai,1Ti,a\sum_{a=0}^{a_{i,1}}T_{i,a} iterations, we have βt,iβiϵ\beta_{t,i}\geq\beta^{*}_{i}-\epsilon. By the assumption of α\alpha, we obtain

a=0ai,1Ti,a\displaystyle\sum_{a=0}^{a_{i,1}}T_{i,a} 2log(βi/α2)/(ηβi)+3log2(βiϵ)/(ηβi)\displaystyle\leq 2\log(\beta^{*}_{i}/\alpha^{2})/(\eta\beta^{*}_{i})+3\left.\left\lceil\log_{2}\left(\frac{\beta^{*}_{i}}{\epsilon}\right)\right\rceil\right/(\eta\beta^{*}_{i})
5log(βiα2)/(ηβi).\displaystyle\leq 5\log\left.\left(\frac{\beta^{*}_{i}}{\alpha^{2}}\right)\right/(\eta\beta^{*}_{i}).

Since the function log(x/α2)/(ηx)\log(x/\alpha^{2})/(\eta x) is decreasing with respect to xx, we have after 5log(m/α2)/(ηm)5\log(m/\alpha^{2})/(\eta m) iterations that |βt,iβi|logp/n|\beta_{t,i}-\beta_{i}^{*}|\lesssim\sqrt{\log p/n} for any fixed iS1i\in S_{1}. Thus, we complete the proof of Proposition 2. ∎