This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic subgradient projection methods for composite optimization with functional constraints

\nameIon Necoara \emailion.necoara@upb.ro
\addrAutomatic Control and Systems Engineering Department, University Politehnica Bucharest, Spl. Independentei 313, 060042 Bucharest, Romania.
Gheorghe Mihoc-Caius Iacob Institute of Mathematical Statistics and Applied Mathematics of the Romanian Academy, 050711 Bucharest, Romania. \AND\nameNitesh Kumar Singh \emailnitesh.nitesh@stud.acs.upb.ro
\addrAutomatic Control and Systems Engineering Department, University Politehnica Bucharest, Spl. Independentei 313, 060042 Bucharest, Romania
Abstract

In this paper we consider convex optimization problems with stochastic composite objective function subject to (possibly) infinite intersection of constraints. The objective function is expressed in terms of expectation operator over a sum of two terms satisfying a stochastic bounded gradient condition, with or without strong convexity type properties. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, in this paper we consider that each constraint set is given as the level set of a convex but not necessarily differentiable function. Based on the flexibility offered by our general optimization model we consider a stochastic subgradient method with random feasibility updates. At each iteration, our algorithm takes a stochastic proximal (sub)gradient step aimed at minimizing the objective function and then a subsequent subgradient step minimizing the feasibility violation of the observed random constraint. We analyze the convergence behavior of the proposed algorithm for diminishing stepsizes and for the case when the objective function is convex or strongly convex, unifying the nonsmooth and smooth cases. We prove sublinear convergence rates for this stochastic subgradient algorithm, which are known to be optimal for subgradient methods on this class of problems. When the objective function has a linear least-square form and the constraints are polyhedral, it is shown that the algorithm converges linearly. Numerical evidence supports the effectiveness of our method in real problems.

Keywords: Stochastic convex optimization, functional constraints, stochastic subgradient, rate of convergence, constrained least-squares, robust/sparse svm.

1 Introduction

The large sum of functions in the objective function and/or the large number of constraints in most of the practical optimization applications, including machine learning and statistics (Vapnik,, 1998; Bhattacharyya et al.,, 2004), signal processing (Necoara,, 2021; Tibshirani,, 2011), computer science (Kundu et al.,, 2018), distributed control (Nedelcu et al.,, 2014), operations research and finance (Rockafellar and Uryasev,, 2000), create several theoretical and computational challenges. For example, these problems are becoming increasingly large in terms of both the number of variables and the size of training data and they are usually nonsmooth due to the use of regularizers, penalty terms and the presence of constraints. Due to these challenges, (sub)gradient-based methods are widely applied. In particular, proximal gradient algorithms (Necoara,, 2021; Rosasco et al.,, 2019) are natural in applications where the function to be minimized is in composite form, i.e. the sum of a smooth term and a nonsmooth regularizer (e.g., the indicator function of some simple constraints), as e.g. the empirical risk in machine learning. In these algorithms, at each iteration the proximal operator defined by the nonsmooth component is applied to the gradient descent step for the smooth data fitting component. In practice, it is important to consider situations where these operations cannot be performed exactly. For example, the case where the gradient, the proximal operator or the projection can be computed only up to an error have been considered in (Devolder et al.,, 2014; Nedelcu et al.,, 2014; Necoara and Patrascu,, 2018; Rasch and Chambolle,, 2020), while the situation when only stochastic estimates of these operators are available have been studied in (Rosasco et al.,, 2019; Hardt et al.,, 2016; Patrascu and Necoara,, 2018; Hermer et al.,, 2020). This latter setting, which is also of interest here, is very important in machine learning, where we have to minimize an expected objective function with or without constraints from random samples (Bhattacharyya et al.,, 2004), or in statistics, where we need to minimize a finite sum objective subject to functional constraints (Tibshirani,, 2011).

Previous work. A very popular approach for minimizing an expected or finite sum objective function is the stochastic gradient descent (SGD) (Robbins and Monro,, 1951; Nemirovski and Yudin,, 1983; Hardt et al.,, 2016) or the stochastic proximal point (SPP) algorithms (Moulines and Bach,, 2011; Nemirovski et al.,, 2009; Necoara,, 2021; Patrascu and Necoara,, 2018; Rosasco et al.,, 2019). In these studies sublinear convergence is derived for SGD or SPP with decreasing stepsizes under the assumptions that the objective function is smooth and (strongly) convex. For nonsmooth stochastic convex optimization one can recognize two main approaches. The first one uses stochastic variants of the subgradient method combined with different averaging techniques, see e.g. (Nemirovski et al.,, 2009; Yang and Lin,, 2016). The second line of research is based on stochastic variants of the proximal gradient method under the assumption that the expected objective function is in composite form (Duchi and Singer,, 2009; Necoara,, 2021; Rosasco et al.,, 2019). For both approaches, using decreasing stepsizes, sublinear convergence rates are derived under the assumptions that the objective function is (strongly) convex. Even though SGD and SPP are well-developed methodologies, they only apply to problems with simple constraints, requiring the whole feasible set to be projectable.

In spite of its wide applicability, the study on efficient solution methods for optimization problems with many constraints is still limited. The most prominent line of research in this area is the alternating projections, which focus on applying random projections for solving problems that involve intersection of a (infinite) number of sets. The case when the objective function is not present in the formulation, which corresponds to the convex feasibility problem, stochastic alternating projection algorithms were analyzed e.g., in (Bauschke and Borwein,, 1996), with linear convergence rates, provided that the sets satisfy some linear regularity condition. Stochastic forward-backward algorithms have been also applied to solve optimization problems with many constraints. However, the papers introducing those general algorithms focus on proving only asymptotic convergence results without rates, or they assume the number of constraints is finite, which is more restricted than our settings, see e.g. (Bianchi et al.,, 2019; Xu,, 2020; Wang et al.,, 2015). In the case where the number of constraints is finite and the objective function is deterministic, Nesterov’s smoothing framework is studied in (Tran-Dinh et al.,, 2018) under the setting of accelerated proximal gradient methods. Incremental subgradient methods or primal-dual approaches were also proposed for solving problems with finite intersection of simple sets through an exact penalty reformulation in  (Bertsekas,, 2011; Kundu et al.,, 2018).

The papers most related to our work are (Nedich,, 2011; Nedich and Necoara,, 2019), where subgradient methods with random feasibility steps are proposed for solving convex problems with deterministic objective and many functional constraints. However, the optimization problem, the algorithm and consequently the convergence analysis are different from the present paper. In particular, our algorithm is a stochastic proximal gradient extension of the algorithms proposed in (Nedich,, 2011; Nedich and Necoara,, 2019). Additionally, the stepsizes in (Nedich,, 2011; Nedich and Necoara,, 2019) are chosen decreasing, while in the present work for strongly like objective functions we derive insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. Furthermore, in (Nedich,, 2011) and (Nedich and Necoara,, 2019) sublinear convergence rates are established either for convex or strongly convex deterministic objective functions, respectively, while in this paper we prove (sub)linear rates under an expected composite objective function which is either convex or strongly convex. Moreover, (Nedich,, 2011; Nedich and Necoara,, 2019) present separately the convergence analysis for smooth and nonsmooth objective, while in this paper we present a unified convergence analysis covering both cases through the so-called stochastic bounded gradient condition. Hence, since we deal with stochastic composite objective functions, smooth or nonsmooth, convex or strongly convex, and since we consider a stochastic proximal gradient with new stepsize rules, our convergence analysis requires additional insights that differ from that of (Nedich,, 2011; Nedich and Necoara,, 2019).

In (Patrascu and Necoara,, 2018) a stochastic optimization problem with infinite intersection of sets is considered and stochastic proximal point steps are combined with alternating projections for solving it. However, in order to prove sublinear convergence rates, (Patrascu and Necoara,, 2018) requires strongly convex and smooth objective functions, while our results are valid also for convex non-smooth functions. Lastly, (Patrascu and Necoara,, 2018) assumes the projectability of individual sets, whereas in our case, the constraints might not be projectable. Finally, in all these studies the nonsmooth component is assumed to be proximal, i.e. one can easily compute its proximal operator. This assumption is restrictive, since in many applications the nonsmooth term is also expressed as expectation or finite sum of nonsmooth functions, which individually are proximal, but it is hard to compute the proximal operator for the whole nonsmooth component, as e.g., in support vector machine or generalized lasso problems (Villa et al.,, 2014; Rosasco et al.,, 2019). Moreover, all the convergence results from the previous papers are derived separately for the smooth and the nonsmooth stochastic optimization problems.

Contributions. In this paper we remove the previous drawbacks. We propose a stochastic subgradient algorithm for solving general convex optimization problems having the objective function expressed in terms of expectation operator over a sum of two terms subject to (possibly infinite number of) functional constraints. The only assumption we require is to have access to an unbiased estimate of the (sub)gradient and of the proximal operator of each of these two terms and to the subgradients of the constraints. To deal with such problems, we propose a stochastic subgradient method with random feasibility updates. At each iteration, the algorithm takes a stochastic subgradient step aimed at only minimizing the expected objective function, followed by a feasibility step for minimizing the feasibility violation of the observed random constraint achieved through Polyak’s subgradient iteration, see (Polyak,, 1969). In doing so, we can avoid the need for projections onto the whole set of constraints, which may be expensive computationally. The proposed algorithm is applicable to the situation where the whole objective function and/or constraint set are not known in advance, but they are rather learned in time through observations.

We present a general framework for the convergence analysis of this stochastic subgradient algorithm which is based on the assumptions that the objective function satisfies a stochastic bounded gradient condition, with or without strong convexity and the subgradients of the functional constraints are bounded. These assumptions include the most well-known classes of objective functions and of constraints analyzed in the literature: composition of a nonsmooth function and a smooth function, with or without strong convexity, and nonsmooth Lipschitz functions, respectively. Moreover, when the objective function satisfies the strong convexity property, we prove insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. Then, we prove sublinear convergence rates for the weighted averages of the iterates in terms of expected distance to the constraint set, as well as for the expected optimality of the function values/distance to the optimal set. Under some special conditions we also derive linear rates. Our convergence estimates are known to be optimal for this class of stochastic subgradient schemes for solving (nonsmooth) convex problems with functional constraints. Besides providing a general framework for the design and analysis of stochastic subgradient methods, in special cases, where complexity bounds are known for some particular problems, our convergence results recover the existing bounds. In particular, for problems without functional constraints we recover the complexity bounds from (Necoara,, 2021) and for linearly constrained least-squares problems we get similar convergence bounds as in (Leventhal and Lewis,, 2010).

Content. In Section 2 we present our optimization problem and the main assumptions. In Section 3 we design a stochastic subgradient projection algorithm and analyze its convergence properties, while is Section 4 we adapt this algorithm to constrained least-squares problems and derive linear convergence. Finally, in Section 5 detailed numerical simulations are provided that support the effectiveness of our method in real problems.

2 Problem formulation and assumptions

We consider the general convex composite optimization problem with expected objective function and functional constraints:

F=minx𝒴nF(x)(:=𝔼[f(x,ζ)+g(x,ζ)])subject to h(x,ξ)0ξΩ2,\begin{array}[]{rl}F^{*}=&\min\limits_{x\in\mathcal{Y}\subseteq\mathbb{R}^{n}}\;F(x)\quad\left(:=\mathbb{E}[f(x,\zeta)+g(x,\zeta)]\right)\\ &\text{subject to }\;h(x,\xi)\leq 0\;\;\forall\xi\in\Omega_{2},\end{array} (1)

where the composite objective function FF has a stochastic representation in the form of expectation w.r.t. a random variable ζΩ1\zeta\in\Omega_{1}, i.e., 𝔼[f(x,ζ)+g(x,ζ)]\mathbb{E}[f(x,\zeta)+g(x,\zeta)], Ω2\Omega_{2} is an arbitrary collection of indices and 𝒴\mathcal{Y} is a simple closed convex set. Hence, we separate the feasible set in two parts: one set, 𝒴\mathcal{Y}, admits easy projections and the other part is not easy for projection as it is described by the level sets of some convex functions h(,ξ)h(\cdot,\xi)’s. Moreover, f(,ζ),g(,ζ)f(\cdot,\zeta),g(\cdot,\zeta) and h(,ξ)h(\cdot,\xi) are proper lower-semicontinuous convex functions containing the interior of 𝒴\mathcal{Y} in their domains. One can notice that more commonly, one sees a single gg representing the regularizer on the parameters in the formulation (1). However, there are also applications where one encounters terms of the form 𝔼[g(x,ζ)]\mathbb{E}[g(x,\zeta)] or ξΩ2g(x,ξ)\sum_{\xi\in\Omega_{2}}g(x,\xi), as e.g., in Lasso problems with mixed 12\ell_{1}-\ell_{2} regularizers or regularizers with overlapping groups, see e.g., (Villa et al.,, 2014). Multiple functional constraints, onto which is difficult to project, can arise from robust classification in machine learning, chance constrained problems, min-max games and control, see (Bhattacharyya et al.,, 2004; Rockafellar and Uryasev,, 2000; Patrascu and Necoara,, 2018). For further use, we define the following functions: F(x,ζ)=f(x,ζ)+g(x,ζ)F(x,\zeta)=f(x,\zeta)+g(x,\zeta), f(x)=𝔼[f(x,ζ)]f(x)=\mathbb{E}[f(x,\zeta)] and g(x)=𝔼[g(x,ζ)]g(x)=\mathbb{E}[g(x,\zeta)]. Moreover, 𝒴\mathcal{Y} is assumed to be simple, i.e. one can easily compute the projection of a point onto this set. Let us define the individual sets 𝒳ξ\mathcal{X}_{\xi} as 𝒳ξ={xn:h(x,ξ)0}\mathcal{X}_{\xi}=\left\{x\in\mathbb{R}^{n}:\;h(x,\xi)\leq 0\right\} for all ξΩ2.\xi\in\Omega_{2}. Denote the feasible set of (1) by:

𝒳={x𝒴:h(x,ξ)0ξΩ2}=𝒴(ξΩ2𝒳ξ).\mathcal{X}=\left\{x\in\mathcal{Y}:\;h(x,\xi)\leq 0\;\;\forall\xi\in\Omega_{2}\right\}=\mathcal{Y}\cap\left(\cap_{\xi\in\Omega_{2}}\mathcal{X}_{\xi}\right).

We assume 𝒳\mathcal{X} to be nonempty. We also assume that the optimization problem (1) has finite optimum and we let FF^{*} and 𝒳\mathcal{X}^{*} denote the optimal value and the optimal set, respectively:

F=minx𝒳F(x):=𝔼[F(x,ζ)],𝒳={x𝒳F(x)=F}.F^{*}=\min_{x\in\mathcal{X}}F(x):=\mathbb{E}[F(x,\zeta)],\quad\mathcal{X}^{*}=\{x\in\mathcal{X}\mid F(x)=F^{*}\}\not=\emptyset.

Further, for any xnx\in\mathbb{R}^{n} we denote its projection onto the optimal set 𝒳\mathcal{X}^{*} by x¯\bar{x}, that is:

x¯=Π𝒳(x).\bar{x}=\Pi_{\mathcal{X}^{*}}(x).

For the objective function we assume that the first term f(,ζ)f(\cdot,\zeta) is either differentiable or nondifferentiable function and we use, with some abuse of notation, the same notation for the gradient or the subgradient of f(,ζ)f(\cdot,\zeta) at xx, that is f(x,ζ)f(x,ζ)\nabla f(x,\zeta)\in\partial f(x,\zeta), where the subdifferential f(x,ζ)\partial f(x,\zeta) is either a singleton or a nonempty set for any ζΩ1\zeta\in\Omega_{1}. The other term g(,ζ)g(\cdot,\zeta) is assumed to have an easy proximal operator for any ζΩ1\zeta\in\Omega_{1}:

proxγg(,ζ)(x)=argminymg(y,ζ)+12γyx2.\operatorname{prox}_{\gamma g(\cdot,\zeta)}(x)=\arg\min\limits_{y\in\mathbb{R}^{m}}g(y,\zeta)+\frac{1}{2\gamma}\|y-x\|^{2}.

Recall that the proximal operator of the indicator function of a closed convex set becomes the projection. We consider additionally the following assumptions on the objective function and constraints:

Assumption 1

The (sub)gradients of FF satisfy a stochastic bounded gradient condition, that is there exist non-negative constants L0L\geq 0 and B0B\geq 0 such that:

B2+L(F(x)F)𝔼[F(x,ζ)2]x𝒴.B^{2}+L(F(x)-F^{*})\geq\mathbb{E}[\|\nabla F(x,\zeta)\|^{2}]\quad\forall x\in\mathcal{Y}. (2)

From Jensen’s inequality, taking x=x𝒳x=x^{*}\in\mathcal{X}^{*} in (2), we get:

B2𝔼[F(x,ζ)2]𝔼[F(x,ζ)]2=F(x)2x𝒳.B^{2}\geq\mathbb{E}[\|\nabla F(x^{*},\zeta)\|^{2}]\geq\|\mathbb{E}[\nabla F(x^{*},\zeta)]\|^{2}=\|\nabla F(x^{*})\|^{2}\quad\forall x^{*}\in\mathcal{X}^{*}. (3)

We also assume FF to satisfy a (strong) convexity condition:

Assumption 2

The function FF satisfies a (strong) convex condition on 𝒴\mathcal{Y}, i.e., there exists non-negative constant μ0\mu\geq 0 such that:

F(y)F(x)+F(x),yx+μ2yx2x,y𝒴.F(y)\geq F(x)+\langle\nabla F(x),y-x\rangle+\frac{\mu}{2}\|y-x\|^{2}\quad\forall x,y\in\mathcal{Y}. (4)

Note that when μ=0\mu=0 relation (4) states that FF is convex on 𝒴\mathcal{Y}. Finally, we assume boundedness on the subgradients of the functional constraints:

Assumption 3

The functional constraints h(,ξ)h(\cdot,\xi) have bounded subgradients on 𝒴\mathcal{Y}, i.e., there exists non-negative constant Bh>0B_{h}>0 such that for all h(x,ξ)h(x,ξ)\nabla h(x,\xi)\in\partial h(x,\xi), we have:

h(x,ξ)Bhx𝒴and ξΩ2.\|\nabla h(x,\xi)\|\leq B_{h}\quad\forall x\in{\color[rgb]{0,0,0}\mathcal{Y}}\;\;\text{and }\;\;\xi\in\Omega_{2}.

Note that our assumptions are quite general and cover the most well-known classes of functions analyzed in the literature. In particular, Assumptions 1 and 2, related to the objective function, covers the class of non-smooth Lipschitz functions and composition of a (potentially) non-smooth function and a smooth function, with or without strong convexity, as the following examples show.

Example 1 [Non-smooth (Lipschitz) functions satisfy Assumption 1]: Assume that the functions ff and gg have bounded (sub)gradients:

f(x,ζ)Bfandg(x,ζ)Bgx𝒴.\|\nabla f(x,\zeta)\|\leq B_{f}\quad\text{and}\quad\|\nabla g(x,\zeta)\|\leq B_{g}\quad\forall x\in\mathcal{Y}.

Then, obviously Assumption 1 holds with L=0andB2=2Bf2+2Bg2.L=0\quad\text{and}\quad B^{2}=2B_{f}^{2}+2B_{g}^{2}.

Example 2 [Smooth (Lipschitz gradient) functions satisfy Assumption 1]: Condition (2) includes the class of functions formed as a sum of two terms, f(,ζ)f(\cdot,\zeta) convex having Lipschitz continuous gradient and g(,ζ)g(\cdot,\zeta) convex having bounded subgradients, and 𝒴\mathcal{Y} bounded. Indeed, let us assume that the convex function f(,ζ)f(\cdot,\zeta) has Lipschitz continuous gradient, i.e. there exists Lf(ζ)>0L_{f}(\zeta)>0 such that:

f(x,ζ)f(x¯,ζ)Lf(ζ)xx¯x𝒴.\|\nabla f(x,\zeta)-\nabla f(\bar{x},\zeta)\|\leq L_{f}(\zeta)\|x-\bar{x}\|\quad\forall x\in\mathcal{Y}.

Using standard arguments (see Theorem 2.1.5 in (Nesterov,, 2018)), we have:

f(x,ζ)f(x¯,ζ)f(x¯,ζ),xx¯+12Lf(ζ)f(x,ζ)f(x¯,ζ)2.f(x,\zeta)-f(\bar{x},\zeta)\geq\langle\nabla f(\bar{x},\zeta),x-\bar{x}\rangle+\frac{1}{2L_{f}(\zeta)}\|\nabla f(x,\zeta)-\nabla f(\bar{x},\zeta)\|^{2}.

Since g(,ζ)g(\cdot,\zeta) is also convex, then adding g(x,ζ)g(x¯,ζ)g(x¯,ζ),xx¯g(x,\zeta)-g(\bar{x},\zeta)\geq\langle\nabla g(\bar{x},\zeta),x-\bar{x}\rangle in the previous inequality, where g(x¯,ζ)g(x¯,ζ)\nabla g(\bar{x},\zeta)\in\partial g(\bar{x},\zeta), we get:

F(x,ζ)F(x¯,ζ)\displaystyle F(x,\zeta)-F(\bar{x},\zeta) F(x¯,ζ),xx¯+12Lf(ζ)f(x,ζ)f(x¯,ζ)2,\displaystyle\geq\langle\nabla F(\bar{x},\zeta),x-\bar{x}\rangle+\frac{1}{2L_{f}(\zeta)}\|\nabla f(x,\zeta)-\nabla f(\bar{x},\zeta)\|^{2},

where we used that F(x¯,ζ)=f(x¯,ζ)+g(x¯,ζ)F(x¯,ζ)\nabla F(\bar{x},\zeta)=\nabla f(\bar{x},\zeta)+\nabla g(\bar{x},\zeta)\in\partial F(\bar{x},\zeta). Taking expectation w.r.t. ζ\zeta and assuming that the set 𝒴\mathcal{Y} is bounded, with the diameter DD, then after using Cauchy-Schwartz inequality, we get (we also assume Lf(ζ)LfL_{f}(\zeta)\leq L_{f} for all ζ\zeta):

F(x)F\displaystyle F(x)-F^{*} 12Lf𝔼[f(x,ζ)f(x¯,ζ)2]DF(x¯)x𝒴.\displaystyle\geq\frac{1}{2L_{f}}\mathbb{E}[\|\nabla f(x,\zeta)-\nabla f(\bar{x},\zeta)\|^{2}]-D\|\nabla F(\bar{x})\|\quad\forall x\in\mathcal{Y}.

Therefore, for any g(x,ζ)g(x,ζ)\nabla g(x,\zeta)\in\partial g(x,\zeta), we have:

𝔼[F(x,ζ)2]\displaystyle\mathbb{E}[\|\nabla F(x,\zeta)\|^{2}] =𝔼[f(x,ζ)f(x¯,ζ)+g(x,ζ)+f(x¯,ζ)2]\displaystyle=\mathbb{E}[\|\nabla f(x,\zeta)-\nabla f(\bar{x},\zeta)+\nabla g(x,\zeta)+\nabla f(\bar{x},\zeta)\|^{2}]
2𝔼[f(x,ζ)f(x¯,ζ)2]+2𝔼[g(x,ζ)+f(x¯,ζ)2]\displaystyle\leq 2\mathbb{E}[\|\nabla f(x,\zeta)-\nabla f(\bar{x},\zeta)\|^{2}]+2\mathbb{E}[\|\nabla g(x,\zeta)+\nabla f(\bar{x},\zeta)\|^{2}]
4Lf(F(x)F)+4LfDF(x¯)+2𝔼[g(x,ζ)+f(x¯,ζ)2].\displaystyle\leq 4L_{f}(F(x)-F^{*})+4L_{f}D\|\nabla F(\bar{x})\|+2\mathbb{E}[\|\nabla g(x,\zeta)+\nabla f(\bar{x},\zeta)\|^{2}].

Assuming now that the regularization functions gg have bounded subgradients on 𝒴\mathcal{Y}, i.e., g(x,ζ)Bg\|\nabla g(x,\zeta)\|\leq B_{g}, then the bounded gradient condition (2) holds on on 𝒴\mathcal{Y} with:

L=4LfandB2=4(Bg2+maxx¯𝒳(𝔼[f(x¯,ζ)2]+DLfF(x¯))).L=4L_{f}\quad\text{and}\quad B^{2}=4\left(B_{g}^{2}+\max_{\bar{x}\in\mathcal{X}^{*}}\left(\mathbb{E}[\|\nabla f(\bar{x},\zeta)\|^{2}]+DL_{f}\|\nabla F(\bar{x})\|\right)\right).

Note that B2B^{2} is finite, since the optimal set 𝒳\mathcal{X}^{*} is compact (recall that in this example 𝒴\mathcal{Y} is assumed bounded).

Finally, we also impose some regularity condition for the constraints.

Assumption 4

The functional constraints satisfy a regularity condition, i.e., there exists non-negative constant c>0c>0 such that:

dist2(x,𝒳)c𝔼[(h(x,ξ))+2]x𝒴.\displaystyle{\rm dist}^{2}(x,\mathcal{X})\leq c\cdot\mathbb{E}\left[(h(x,\xi))_{+}^{2}\right]\;\;\forall x\in{\color[rgb]{0,0,0}\mathcal{Y}}. (5)

Note that this assumption holds e.g., when the index set Ω2\Omega_{2} is arbitrary and the feasible set 𝒳\mathcal{X} has an interior point, see (Polyak,, 2001), or when the feasible set is polyhedral, see relation (24) below. However, Assumption (4) holds for more general sets, e.g., when a strengthened Slater condition holds for the collection of convex functional constraints, such as the generalized Robinson condition, as detailed in Corollary 3 of (Lewis and Pang,, 1998).

3 Stochastic subgradient projection algorithm

Given the iteration counter kk, we consider independent random variables ζk\zeta_{k} and ξk\xi_{k} sampled from Ω1\Omega_{1} and Ω2\Omega_{2} according to probability distributions P1\textbf{P}_{1} and P2\textbf{P}_{2}, respectively. Then, we define the following stochastic subgradient projection algorithm, where at each iteration we perform a stochastic proximal (sub)gradient step aimed at minimizing the expected composite objective function and then a subsequent subgradient step minimizing the feasibility violation of the observed random constraint (we use the convention 0/0=00/0=0)111In this variant we have corrected some derivations from the journal version: Journal of Machine Learning Research, 23(265), 1-35, 2022.:

Algorithm 1 (SSP): Choosex0𝒴and stepsizesαk>0\text{Choose}\;x_{0}\in\mathcal{Y}\;\text{and stepsizes}\;\alpha_{k}>0 and β(0,2)\beta\in(0,2) Fork0repeat:\text{For}\;k\geq 0\;\text{repeat:} Sample independentlyζkP1andξkP2and update:\displaystyle\text{Sample independently}\;\zeta_{k}\sim\textbf{P}_{1}\;\text{and}\;\xi_{k}\sim\textbf{P}_{2}\;\text{and update:} uk=proxαkg(,ζk)(xkαkf(xk,ζk)),vk=Π𝒴(uk)\displaystyle{\color[rgb]{0,0,0}u_{k}}=\text{prox}_{\alpha_{k}g(\cdot,\zeta_{k})}\left(x_{k}-\alpha_{k}\nabla f(x_{k},\zeta_{k})\right),\quad v_{k}={\color[rgb]{0,0,0}\Pi_{\mathcal{Y}}(u_{k})} (6) zk=vkβ(h(vk,ξk))+h(vk,ξk)2h(vk,ξk)\displaystyle z_{k}=v_{k}-\beta\frac{(h(v_{k},\xi_{k}))_{+}}{\|\nabla h(v_{k},\xi_{k})\|^{2}}\nabla h(v_{k},\xi_{k}) (7) xk+1=Π𝒴(zk).\displaystyle x_{k+1}=\Pi_{\mathcal{Y}}(z_{k}). (8)

Here, αk>0\alpha_{k}>0 and β>0\beta>0 are deterministic stepsizes and (x)+=max{0,x}(x)_{+}=\max\{0,x\}. Note that vkv_{k} represents a stochastic proximal (sub)gradient step (or a stochastic forward-backward step) at xkx_{k} for the expected composite objective function F(x)=𝔼[f(x,ζ)+g(x,ζ)]F(x)=\mathbb{E}[f(x,\zeta)+g(x,\zeta)]. Note that the optimality step selects a random pair of functions (f(xk,ζk),g(xk,ζk))(f(x_{k},\zeta_{k}),g(x_{k},\zeta_{k})) from the composite objective function FF according to the probability distribution P1\textbf{P}_{1}, i.e., the index variable ζk\zeta_{k} with values in the set Ω1\Omega_{1}. Also, we note that the random feasibility step selects a random constraint h(,ξk)0h(\cdot,\xi_{k})\leq 0 from the collection of constraints set according to the probability distribution P2\textbf{P}_{2}, independently from ζk\zeta_{k}, i.e. the index variable ξk\xi_{k} with values in the set Ω2\Omega_{2}. The vector h(vk,ξk)\nabla h(v_{k},\xi_{k}) is chosen as:

h(vk,ξk)={h(vk,ξk)((h(vk,ξk))+)if (h(vk,ξk))+>0sh0if (h(vk,ξk))+=0,\nabla h(v_{k},\xi_{k})=\begin{cases}\nabla h(v_{k},\xi_{k})\in\partial((h(v_{k},\xi_{k}))_{+})&\mbox{if }(h(v_{k},\xi_{k}))_{+}>0\\ s_{h}\neq 0&\mbox{if }(h(v_{k},\xi_{k}))_{+}=0,\end{cases}

where shns_{h}\in\mathbb{R}^{n} is any nonzero vector. If (h(vk,ξk))+=0(h(v_{k},\xi_{k}))_{+}=0, then for any choice of nonzero shs_{h}, we have zk=vkz_{k}=v_{k}. Note that the feasibility step (7) has the special form of the Polyak’s subgradient iteration, see e.g., (Polyak,, 1969). Moreover, when β=1\beta=1, zkz_{k} is the projection of vkv_{k} onto the hyperplane:

vk,ξk={z:h(vk,ξk)+h(vk,ξk)T(zvk)0},\mathcal{H}_{v_{k},\xi_{k}}=\{z:\;h(v_{k},\xi_{k})+\nabla h(v_{k},\xi_{k})^{T}(z-v_{k})\leq 0\},

that is zk=Πvk,ξk(vk)z_{k}=\Pi_{\mathcal{H}_{v_{k},\xi_{k}}}(v_{k}) for β=1\beta=1. Indeed, if vkvk,ξkv_{k}\in\mathcal{H}_{v_{k},\xi_{k}}, then (h(vk,ξk))+=0(h(v_{k},\xi_{k}))_{+}=0 and the projection is the point itself, i.e., zk=vkz_{k}=v_{k}. On the other hand, if vkvk,ξkv_{k}\notin\mathcal{H}_{v_{k},\xi_{k}}, then the projection of vkv_{k} onto vk,ξk\mathcal{H}_{v_{k},\xi_{k}} reduces to the projection onto the corresponding hyperplane:

zk=vkh(vk,ξk)h(vk,ξk)2h(vk,ξk).\displaystyle z_{k}=v_{k}-\frac{h(v_{k},\xi_{k})}{\|\nabla h(v_{k},\xi_{k})\|^{2}}\nabla h(v_{k},\xi_{k}).

Combining these two cases, we finally get our update (7) . Note that when the feasible set of optimization problem (1) is described by (infinite) intersection of simple convex sets, see e.g. (Patrascu and Necoara,, 2018; Bianchi et al.,, 2019; Xu,, 2020; Wang et al.,, 2015):

𝒳=𝒴(ξΩ2𝒳ξ),\mathcal{X}=\mathcal{Y}\cap\left(\cap_{\xi\in\Omega_{2}}\mathcal{X}_{\xi}\right),

where each set XξX_{\xi} admits an easy projection, then one can choose the following functional representation in problem (1):

h(x,ξ)=(h(x,ξ))+=dist(x,𝒳ξ)ξΩ2.h(x,\xi)=(h(x,\xi))_{+}={\rm dist}(x,\mathcal{X}_{\xi})\quad\forall\xi\in\Omega_{2}.

One can easily notice that this function is convex, nonsmooth and with bounded subgradients, since we have (Mordukhovich and Nam,, 2005):

xΠ𝒳ξ(x)xΠ𝒳ξ(x)h(x,ξ).\frac{x-\Pi_{\mathcal{X}_{\xi}}(x)}{\|x-\Pi_{\mathcal{X}_{\xi}}(x)\|}\in\partial h(x,\xi).

In this case, step (7) in SSP algorithm becomes a usual random projection step:

zk=vkβ(vkΠ𝒳ξk(vk)).z_{k}=v_{k}-\beta(v_{k}-\Pi_{\mathcal{X}_{\xi_{k}}}(v_{k})).

Hence, our formulation is more general than (Patrascu and Necoara,, 2018; Bianchi et al.,, 2019; Xu,, 2020; Wang et al.,, 2015), as it allows to also deal with constraints that might not be projectable, but one can easily compute a subgradient of hh. We mention also the possibility of performing iterations in parallel in steps (6) and (7) of SSP algorithm. However, here we do not consider this case. A thorough convergence analysis of minibatch iterations when the objective function is deterministic and strongly convex can be found in (Nedich and Necoara,, 2019). For the analysis of SSP algorithm, let us define the stochastic (sub)gradient mapping (for simplicity we omit its dependence on stepsize α\alpha):

𝒮(x,ζ)=α1(xproxαg(,ζ)(xαf(x,ζ))).\mathcal{S}(x,\zeta)=\alpha^{-1}\left(x-\operatorname{prox}_{\alpha g(\cdot,\zeta)}(x-\alpha{\nabla}f(x,\zeta))\right).

Then, it follows immediately that the stochastic proximal (sub)gradient step (6) can be written as:

uk=xkαk𝒮(xk,ζk).{\color[rgb]{0,0,0}u_{k}}=x_{k}-\alpha_{k}\mathcal{S}(x_{k},\zeta_{k}).

Moreover, from the optimality condition of the prox operator it follows that there exists g(uk,ζk)g(uk,ζk)\nabla g({\color[rgb]{0,0,0}u_{k}},\zeta_{k})\in\partial g({\color[rgb]{0,0,0}u_{k}},\zeta_{k}) such that:

𝒮(xk,ζk)=f(xk,ζk)+g(uk,ζk).\mathcal{S}(x_{k},\zeta_{k})=\nabla f(x_{k},\zeta_{k})+\nabla g({\color[rgb]{0,0,0}u_{k}},\zeta_{k}).

Let us also recall a basic property of the projection onto a closed convex set 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n}, see e.g., (Nedich and Necoara,, 2019):

Π𝒳(v)y2vy2Π𝒳(v)v2vnandy𝒳.\displaystyle\|\Pi_{\mathcal{X}}(v)-y\|^{2}\leq\|v-y\|^{2}-\|\Pi_{\mathcal{X}}(v)-v\|^{2}\qquad\forall v\in\mathbb{R}^{n}\;\text{and}\;y\in\mathcal{X}. (9)

Define also the filtration:

[k]={ζ0,,ζk,ξ0,,ξk}.\mathcal{F}_{[k]}=\{\zeta_{0},\cdots,\zeta_{k},\;\;\xi_{0},\cdots,\xi_{k}\}.

The next lemma provides a key descent property for the sequence vkv_{k} and for the proof we use as main tool the stochastic (sub)gradient mapping 𝒮()\mathcal{S}(\cdot).

Lemma 5

Let f(,ζ)f(\cdot,\zeta) and g(,ζ)g(\cdot,\zeta) be convex functions. Additionally, assume that the bounded gradient condition from Assumption 1 holds. Then, for any k0k\geq 0 and stepsize αk>0\alpha_{k}>0, we have the following recursion:

𝔼[vkv¯k2]𝔼[xkx¯k2]αk(2αkL)𝔼[F(xk)F(x¯k)]+αk2B2.\displaystyle\mathbb{E}[\|v_{k}-\bar{v}_{k}\|^{2}]\leq\mathbb{E}[\|x_{k}-\bar{x}_{k}\|^{2}]-\alpha_{k}(2-\alpha_{k}L)\,\mathbb{E}[F(x_{k})-F(\bar{x}_{k})]+\alpha_{k}^{2}B^{2}. (10)

Proof  Recalling that v¯k=Π𝒳(vk)\bar{v}_{k}=\Pi_{\mathcal{X}^{*}}(v_{k}) and x¯k=Π𝒳(xk)\bar{x}_{k}=\Pi_{\mathcal{X}^{*}}(x_{k}), from the definition of uk{\color[rgb]{0,0,0}u_{k}} and using (9) for y=x¯k𝒳𝒴y=\bar{x}_{k}\in\mathcal{X^{*}}{\color[rgb]{0,0,0}\subseteq\mathcal{Y}} and v=vkv=v_{k}, we get:

vkv¯k2\displaystyle\|v_{k}-\bar{v}_{k}\|^{2} (9)vkx¯k2=Π𝒴(uk)Π𝒴(x¯k)2ukx¯k2=xkx¯kαk𝒮(xk,ζk)2\displaystyle\overset{\eqref{proj_prop}}{\leq}\|v_{k}-\bar{x}_{k}\|^{2}{\color[rgb]{0,0,0}=\|\Pi_{\mathcal{Y}}(u_{k})-\Pi_{\mathcal{Y}}(\bar{x}_{k})\|^{2}\leq\|u_{k}-\bar{x}_{k}\|^{2}}=\|x_{k}-\bar{x}_{k}-\alpha_{k}\mathcal{S}(x_{k},\zeta_{k})\|^{2}
=xkx¯k22αk𝒮(xk,ζk),xkx¯k+αk2𝒮(xk,ζk)2\displaystyle=\|x_{k}-\bar{x}_{k}\|^{2}-2\alpha_{k}\langle\mathcal{S}(x_{k},\zeta_{k}),x_{k}-\bar{x}_{k}\rangle+\alpha_{k}^{2}\|\mathcal{S}(x_{k},\zeta_{k})\|^{2}
=xkx¯k22αkf(xk,ζk)+g(uk,ζk),xkx¯k+αk2𝒮(xk,ζk)2.\displaystyle=\|x_{k}-\bar{x}_{k}\|^{2}-2\alpha_{k}\langle\nabla f(x_{k},\zeta_{k})+\nabla g({\color[rgb]{0,0,0}u_{k}},\zeta_{k}),x_{k}-\bar{x}_{k}\rangle+\alpha_{k}^{2}\|\mathcal{S}(x_{k},\zeta_{k})\|^{2}. (11)

Now, we refine the second term. First, from convexity of ff we have:

f(xk,ζk),xkx¯kf(xk,ζk)f(x¯k,ζk).\langle\nabla f(x_{k},\zeta_{k}),x_{k}-\bar{x}_{k}\rangle\geq f(x_{k},\zeta_{k})-f(\bar{x}_{k},\zeta_{k}).

Then, from convexity of g(,ζ)g(\cdot,\zeta) and the definition of the gradient mapping 𝒮(,ζ)\mathcal{S}(\cdot,\zeta), we have:

g(uk,ζk),xkx¯k=g(uk,ζk),xkuk+g(uk,ζk),ukx¯k\displaystyle\langle\nabla g({\color[rgb]{0,0,0}u_{k}},\zeta_{k}),x_{k}-\bar{x}_{k}\rangle=\langle\nabla g({\color[rgb]{0,0,0}u_{k}},\zeta_{k}),x_{k}-{\color[rgb]{0,0,0}u_{k}}\rangle+\langle\nabla g({\color[rgb]{0,0,0}u_{k}},\zeta_{k}),{\color[rgb]{0,0,0}u_{k}}-\bar{x}_{k}\rangle
αk𝒮(xk,ζk)2αkf(xk,ζk),𝒮(xk,ζk)+g(uk,ζk)g(x¯k,ζk)\displaystyle\geq\alpha_{k}\|\mathcal{S}(x_{k},\zeta_{k})\|^{2}-\alpha_{k}\langle\nabla f(x_{k},\zeta_{k}),\mathcal{S}(x_{k},\zeta_{k})\rangle+g({\color[rgb]{0,0,0}u_{k}},\zeta_{k})-g(\bar{x}_{k},\zeta_{k})
αk𝒮(xk,ζk)2αkf(xk,ζk)+g(xk,ζk),𝒮(xk,ζk)+g(xk,ζk)g(x¯k,ζk).\displaystyle\geq\alpha_{k}\|\mathcal{S}(x_{k},\zeta_{k})\|^{2}-\alpha_{k}\langle\nabla f(x_{k},\zeta_{k})+\nabla g(x_{k},\zeta_{k}),\mathcal{S}(x_{k},\zeta_{k})\rangle+g(x_{k},\zeta_{k})-g(\bar{x}_{k},\zeta_{k}).

Replacing the previous two inequalities in (3), we obtain:

vkv¯k2\displaystyle\|v_{k}-\bar{v}_{k}\|^{2} xkx¯k22αk(f(xk,ζk)+g(xk,ζk)f(x¯k,ζk)g(x¯k,ζk))\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-2\alpha_{k}\left(f(x_{k},\zeta_{k})+g(x_{k},\zeta_{k})-f(\bar{x}_{k},\zeta_{k})-g(\bar{x}_{k},\zeta_{k})\right)
+2αk2f(xk,ζk)+g(xk,ζk),𝒮(xk,ζk)αk2𝒮(xk,ζk)2.\displaystyle\quad+2\alpha_{k}^{2}\langle\nabla f(x_{k},\zeta_{k})+\nabla g(x_{k},\zeta_{k}),\mathcal{S}(x_{k},\zeta_{k})\rangle-\alpha_{k}^{2}\|\mathcal{S}(x_{k},\zeta_{k})\|^{2}.

Using that 2u,vv2u22\langle u,v\rangle-\|v\|^{2}\leq\|u\|^{2} for all vnv\in\mathbb{R}^{n}, that F(xk,ζk)=f(xk,ζk)+g(xk,ζk)F(x_{k},\zeta_{k})=f(x_{k},\zeta_{k})+g(x_{k},\zeta_{k}) and F(xk,ζk)=f(xk,ζk)+g(xk,ζk)F(xk,ζk)\nabla F(x_{k},\zeta_{k})=\nabla f(x_{k},\zeta_{k})+\nabla g(x_{k},\zeta_{k})\in\partial F(x_{k},\zeta_{k}), we further get:

vkv¯k2\displaystyle\|v_{k}-\bar{v}_{k}\|^{2} xkx¯k22αk(F(xk,ζk)F(x¯k,ζk))+αk2F(xk,ζk)2.\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-2\alpha_{k}\left(F(x_{k},\zeta_{k})-F(\bar{x}_{k},\zeta_{k})\right)+\alpha_{k}^{2}\|\nabla F(x_{k},\zeta_{k})\|^{2}.

Since vkv_{k} depends on [k1]{ζk}\mathcal{F}_{[k-1]}\cup\{\zeta_{k}\}, not on ξk\xi_{k}, and the stepsize αk\alpha_{k} does not depend on (ζk,ξk)(\zeta_{k},\xi_{k}), then from basic properties of conditional expectation, we have:

𝔼ζk[vkv¯k2|[k1]]\displaystyle\mathbb{E}_{\zeta_{k}}[\|v_{k}-\bar{v}_{k}\|^{2}|\mathcal{F}_{[k-1]}]
xkx¯k22αk𝔼ζk[F(xk,ζk)F(x¯k,ζk)|[k1]]+αk2𝔼ζk[F(xk,ζk)2|[k1]]\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-2\alpha_{k}\mathbb{E}_{\zeta_{k}}[F(x_{k},\zeta_{k})-F(\bar{x}_{k},\zeta_{k})|\mathcal{F}_{[k-1]}]+\alpha_{k}^{2}\mathbb{E}_{\zeta_{k}}[\|\nabla F(x_{k},\zeta_{k})\|^{2}|\mathcal{F}_{[k-1]}]
=xkx¯k22αk(F(xk)F(x¯k))+αk2𝔼ζk[F(xk,ζk)2|[k1]]\displaystyle=\|x_{k}-\bar{x}_{k}\|^{2}-2\alpha_{k}\left(F(x_{k})-F(\bar{x}_{k})\right)+\alpha_{k}^{2}\mathbb{E}_{\zeta_{k}}[\|\nabla F(x_{k},\zeta_{k})\|^{2}|\mathcal{F}_{[k-1]}]
xkx¯k22αk(F(xk)F(x¯k))+αk2(B2+L(F(xk)F(x¯k)))\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-2\alpha_{k}\left(F(x_{k})-F(\bar{x}_{k})\right)+\alpha_{k}^{2}\left(B^{2}+L\left(F(x_{k})-F(\bar{x}_{k})\right)\right)
=xkx¯k2αk(2αkL)(F(xk)F(x¯k))+αk2B2,\displaystyle=\|x_{k}-\bar{x}_{k}\|^{2}-\alpha_{k}(2-\alpha_{k}L)\left(F(x_{k})-F(\bar{x}_{k})\right)+\alpha_{k}^{2}B^{2},

where in the last inequality we used xk𝒴x_{k}\in\mathcal{Y} and the stochastic bounded gradient inequality from Assumption 1. Now, taking expectation w.r.t. [k1]\mathcal{F}_{[k-1]} we get:

𝔼[vkv¯k2]𝔼[xkx¯k2]αk(2αkL)𝔼[(F(xk)F(x¯k))]+αk2B2.\displaystyle\mathbb{E}[\|v_{k}-\bar{v}_{k}\|^{2}]\leq\mathbb{E}[\|x_{k}-\bar{x}_{k}\|^{2}]-\alpha_{k}(2-\alpha_{k}L)\mathbb{E}[(F(x_{k})-F(\bar{x}_{k}))]+\alpha_{k}^{2}B^{2}.

which concludes our statement.  

Next lemma gives a relation between xkx_{k} and vk1v_{k-1} (see also (Nedich and Necoara,, 2019)).

Lemma 6

Let h(,ξ)h(\cdot,\xi) be convex functions. Additionally, Assumption 3 holds. Then, for any y𝒴y\in\mathcal{Y} such that (h(y,ξk1))+=0(h(y,\xi_{k-1}))_{+}=0 the following relation holds:

xky2vk1y2β(2β)[(h(vk1,ξk1))+2Bh2].\displaystyle\|x_{k}-y\|^{2}\leq\|v_{k-1}-y\|^{2}-\beta(2-\beta)\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{B^{2}_{h}}\right].

Proof  Consider any y𝒴y\in\mathcal{Y} such that (h(y,ξk1))+=0(h(y,\xi_{k-1}))_{+}=0. Then, using the nonexpansive property of the projection and the definition of zk1z_{k-1}, we have:

xky2\displaystyle\|x_{k}-y\|^{2} =Π𝒴(zk1)y2zk1y2\displaystyle=\|\Pi_{\mathcal{Y}}(z_{k-1})-y\|^{2}\leq\|z_{k-1}-y\|^{2}
=vk1yβ(h(vk1,ξk1))+h(vk1,ξk1)2h(vk1,ξk1)2\displaystyle=\|v_{k-1}-y-\beta\frac{(h(v_{k-1},\xi_{k-1}))_{+}}{\|\nabla h(v_{k-1},\xi_{k-1})\|^{2}}\nabla h(v_{k-1},\xi_{k-1})\|^{2}
=vk1y2+β2(h(vk1,ξk1))+2h(vk1,ξk1)2\displaystyle=\|v_{k-1}-y\|^{2}+\beta^{2}\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{\|\nabla h(v_{k-1},\xi_{k-1})\|^{2}}
2β(h(vk1,ξk1))+h(vk1,ξk1)2vk1y,h(vk1,ξk1)\displaystyle\quad-2\beta\frac{(h(v_{k-1},\xi_{k-1}))_{+}}{\|\nabla h(v_{k-1},\xi_{k-1})\|^{2}}\langle v_{k-1}-y,\nabla h(v_{k-1},\xi_{k-1})\rangle
vk1y2+β2(h(vk1,ξk1))+2h(vk1,ξk1)22β(h(vk1,ξk1))+h(vk1,ξk1)2(h(vk1,ξk1))+,\displaystyle\leq\|v_{k-1}-y\|^{2}+\beta^{2}\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{\|\nabla h(v_{k-1},\xi_{k-1})\|^{2}}-2\beta\frac{(h(v_{k-1},\xi_{k-1}))_{+}}{\|\nabla h(v_{k-1},\xi_{k-1})\|^{2}}(h(v_{k-1},\xi_{k-1}))_{+},

where the last inequality follows from convexity of (h(,ξ))+(h(\cdot,\xi))_{+} and our assumption that (h(y,ξk1))+=0(h(y,\xi_{k-1}))_{+}=0. After rearranging the terms and using that vk1𝒴v_{k-1}\in\mathcal{Y}, we get:

xky2\displaystyle\|x_{k}-y\|^{2} vk1y2β(2β)[(h(vk1,ξk1))+2h(vk1,ξk1)2]\displaystyle\leq\|v_{k-1}-y\|^{2}-\beta(2-\beta)\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{\|\nabla h(v_{k-1},\xi_{k-1})\|^{2}}\right]
Assumption3vk1y2β(2β)[(h(vk1,ξk1))+2Bh2],\displaystyle\overset{\text{Assumption}\,\ref{assumption3}}{\leq}\|v_{k-1}-y\|^{2}-\beta(2-\beta)\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{B_{h}^{2}}\right],

which concludes our statement.  
In the next sections, based on the previous two lemmas, we derive convergence rates for SSP algorithm depending on the (convexity) properties of f(,ζ)f(\cdot,\zeta).

3.1 Convergence analysis: convex objective function

In this section we consider that the functions f(,ζ),g(,ζ)f(\cdot,\zeta),g(\cdot,\zeta) and h(,ξ)h(\cdot,\xi) are convex. First, it is easy to prove that cBh2>1cB_{h}^{2}>1 (we can always choose cc sufficiently large such that this relation holds), see also (Nedich and Necoara,, 2019). For simplicity of the exposition let us introduce the following notation:

Cβ,c,Bh:=β(2β)cBh2(1β(2β)cBh2)1>0.C_{\beta,c,B_{h}}:=\frac{\beta(2-\beta)}{cB^{2}_{h}}\left(1-\frac{\beta(2-\beta)}{cB^{2}_{h}}\right)^{-1}>0.

We impose the following conditions on the stepsize αk\alpha_{k}:

0<αkαk(2αkL)<1αk{(0,12)ifL=0(0,1(1L)+L)ifL>0.\displaystyle 0<\alpha_{k}\leq\alpha_{k}(2-\alpha_{k}L)<1\;\;\iff\;\;\alpha_{k}\in\begin{cases}\left(0,\frac{1}{2}\right)\;\;\text{if}\;L=0\\ \left(0,\frac{1-\sqrt{(1-L)_{+}}}{L}\right)\;\;\text{if}\;L>0.\end{cases} (12)

Then, we can define the following average sequence generated by the algorithm SSP:

x^k=j=1kαj(2αjL)xjSk,whereSk=j=1kαj(2αjL).\hat{x}_{k}=\frac{\sum_{j=1}^{k}\alpha_{j}{\color[rgb]{0,0,0}(2-\alpha_{j}L)}x_{j}}{S_{k}},\quad\text{where}\;S_{k}=\sum_{j=1}^{k}\alpha_{j}{\color[rgb]{0,0,0}(2-\alpha_{j}L)}.

Note that this type of average sequence is also consider in (Garrigos et al.,, 2023) for unconstrained stochastic optimization problems. The next theorem derives sublinear convergence rates for the average sequence x^k\hat{x}_{k}.

Theorem 7

Let f(,ζ),g(,ζ)f(\cdot,\zeta),g(\cdot,\zeta) and h(,ξ)h(\cdot,\xi) be convex functions. Additionally, Assumptions 1, 3 and 4 hold. Further, choose the stepsize sequence αk\alpha_{k} as in (12) and stepsize β(0,2)\beta\in(0,2). Then, we have the following estimates for the average sequence x^k\hat{x}_{k} in terms of optimality and feasibility violation for problem (1):

𝔼[F(x^k)F]v0v¯02Sk+B2j=1kαj2Sk,\displaystyle\mathbb{E}\left[F(\hat{x}_{k})-F^{*}\right]\leq\frac{\|v_{0}-\overline{v}_{0}\|^{2}}{S_{k}}+\frac{B^{2}\sum_{j=1}^{k}\alpha_{j}^{2}}{S_{k}},
𝔼[dist2(x^k,𝒳)]v0v¯02Cβ,c,BhSk+B2j=1kαj2Cβ,c,BhSk.\displaystyle\mathbb{E}\left[{\rm dist}^{2}(\hat{x}_{k},\mathcal{X})\right]\leq\frac{\|v_{0}-\overline{v}_{0}\|^{2}}{C_{\beta,c,B_{h}}\cdot S_{k}}+\frac{B^{2}\sum_{j=1}^{k}\alpha_{j}^{2}}{C_{\beta,c,B_{h}}\cdot S_{k}}.

Proof  Recall that from Lemma 5, we have:

𝔼[vkv¯k2]𝔼[xkx¯k2]αk(2αkL)𝔼[F(xk)F(x¯k)]+αk2B2\displaystyle\mathbb{E}\left[\|v_{k}-\bar{v}_{k}\|^{2}\right]\leq\mathbb{E}\left[\|x_{k}-\bar{x}_{k}\|^{2}\right]-\alpha_{k}(2-\alpha_{k}L)\mathbb{E}\left[F(x_{k})-F(\bar{x}_{k})\right]+\alpha_{k}^{2}B^{2} (13)

Now, for y=v¯k1𝒳𝒳𝒴y=\bar{v}_{k-1}\in\mathcal{X}^{*}\subseteq\mathcal{X}\subseteq\mathcal{Y} we have that (h(v¯k1,ξk1))+=0(h(\bar{v}_{k-1},\xi_{k-1}))_{+}=0, and thus using Lemma 6, we get:

xkx¯k2(9)xkv¯k12vk1v¯k12β(2β)[(h(vk1,ξk1))+2Bh2].\displaystyle\|x_{k}-\bar{x}_{k}\|^{2}\overset{\eqref{proj_prop}}{\leq}\|x_{k}-\bar{v}_{k-1}\|^{2}\leq\|v_{k-1}-\bar{v}_{k-1}\|^{2}-\beta(2-\beta)\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{B^{2}_{h}}\right].

Taking conditional expectation on ξk1\xi_{k-1} given [k1]\mathcal{F}_{[k-1]}, we get:

𝔼ξk1[xkx¯k2|[k1]]\displaystyle\mathbb{E}_{\xi_{k-1}}[\|x_{k}-\bar{x}_{k}\|^{2}|\mathcal{F}_{[k-1]}] vk1v¯k12β(2β)𝔼ξk1[(h(vk1,ξk1))+2Bh2|[k1]]\displaystyle\leq\|v_{k-1}-\bar{v}_{k-1}\|^{2}-\beta(2-\beta)\mathbb{E}_{\xi_{k-1}}\left[\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{B^{2}_{h}}|\mathcal{F}_{[k-1]}\right]
(5)vk1v¯k12β(2β)cBh2dist2(vk1,𝒳).\displaystyle\overset{\eqref{eq:constrainterrbound}}{\leq}\|v_{k-1}-\bar{v}_{k-1}\|^{2}-\frac{\beta(2-\beta)}{cB^{2}_{h}}{\rm dist}^{2}(v_{k-1},\mathcal{X}).

Taking now full expectation, we obtain:

𝔼[xkx¯k2]𝔼[vk1v¯k12]β(2β)cBh2𝔼[dist2(vk1,𝒳)],\displaystyle\mathbb{E}[\|x_{k}-\bar{x}_{k}\|^{2}]\leq\mathbb{E}[\|v_{k-1}-\bar{v}_{k-1}\|^{2}]-\frac{\beta(2-\beta)}{cB^{2}_{h}}\mathbb{E}[{\rm dist}^{2}(v_{k-1},\mathcal{X})], (14)

and using this relation in (13), we get:

𝔼[vkv¯k2]+β(2β)cBh2𝔼[dist2(vk1,𝒳)]+αk(2αkL)𝔼[F(xk)F(x¯k)]\displaystyle\mathbb{E}\left[\|v_{k}-\bar{v}_{k}\|^{2}\right]+\frac{\beta(2-\beta)}{cB^{2}_{h}}\mathbb{E}\left[{\rm dist}^{2}(v_{k-1},\mathcal{X})\right]+\alpha_{k}(2-\alpha_{k}L)\mathbb{E}\left[F(x_{k})-F(\bar{x}_{k})\right]
𝔼[vk1v¯k12]+αk2B2.\displaystyle\leq\mathbb{E}[\|v_{k-1}-\bar{v}_{k-1}\|^{2}]+\alpha_{k}^{2}B^{2}. (15)

Similarly, for y=Π𝒳(vk1)𝒳𝒴y=\Pi_{\mathcal{X}}({v}_{k-1})\subseteq\mathcal{X}\subseteq\mathcal{Y} we have that (h(Π𝒳(vk1),ξk1))+=0(h(\Pi_{\mathcal{X}}({v}_{k-1}),\xi_{k-1}))_{+}=0, and thus using again Lemma 6, we obtain:

dist2(xk,𝒳)=xkΠ𝒳(xk)2xkΠ𝒳(vk1)2\displaystyle{\rm dist}^{2}(x_{k},\mathcal{X})=\|x_{k}-\Pi_{\mathcal{X}}({x}_{k})\|^{2}\leq\|x_{k}-\Pi_{\mathcal{X}}({v}_{k-1})\|^{2}
dist2(vk1,𝒳)β(2β)(h(vk1,ξk1))+2Bh2.\displaystyle\leq{\rm dist}^{2}(v_{k-1},\mathcal{X})-\beta(2-\beta)\frac{(h(v_{k-1},\xi_{k-1}))_{+}^{2}}{B^{2}_{h}}.

Taking conditional expectation on ξk1\xi_{k-1} given [k1]\mathcal{F}_{[k-1]}, we get:

𝔼ξk1[dist2(xk,𝒳)|[k1]]\displaystyle\mathbb{E}_{\xi_{k-1}}\left[{\rm dist}^{2}(x_{k},\mathcal{X})|\mathcal{F}_{[k-1]}\right]
dist2(vk1,𝒳)β(2β)Bh2𝔼ξk1[(h(vk1,ξk1))+2|[k1]]\displaystyle\leq{\rm dist}^{2}(v_{k-1},\mathcal{X})-\frac{\beta(2-\beta)}{B^{2}_{h}}\mathbb{E}_{\xi_{k-1}}\left[(h(v_{k-1},\xi_{k-1}))_{+}^{2}|\mathcal{F}_{[k-1]}\right]
(5)(1β(2β)cBh2)dist2(vk1,𝒳).\displaystyle\overset{\eqref{eq:constrainterrbound}}{\leq}\left(1-\frac{\beta(2-\beta)}{cB^{2}_{h}}\right){\rm dist}^{2}(v_{k-1},\mathcal{X}).

After taking full expectation, we get:

𝔼[dist2(xk,𝒳)]\displaystyle\mathbb{E}\left[{\rm dist}^{2}(x_{k},\mathcal{X})\right] (1β(2β)cBh2)𝔼[dist2(vk1,𝒳)].\displaystyle\leq\left(1-\frac{\beta(2-\beta)}{cB^{2}_{h}}\right)\mathbb{E}\left[{\rm dist}^{2}(v_{k-1},\mathcal{X})\right]. (16)

Using (16) in (3.1), we obtain:

𝔼[vkv¯k2]+β(2β)cBh2(1β(2β)cBh2)1𝔼[dist2(xk,𝒳)]\displaystyle\mathbb{E}\left[\|v_{k}-\bar{v}_{k}\|^{2}\right]+\frac{\beta(2-\beta)}{cB^{2}_{h}}\left(1-\frac{\beta(2-\beta)}{cB^{2}_{h}}\right)^{-1}\mathbb{E}\left[{\rm dist}^{2}(x_{k},\mathcal{X})\right]
+αk(2αkL)𝔼[F(xk)F(x¯k)]𝔼[vk1v¯k12]+αk2B2.\displaystyle+\alpha_{k}(2-\alpha_{k}L)\mathbb{E}\left[F(x_{k})-F(\bar{x}_{k})\right]\leq\mathbb{E}\left[\|v_{k-1}-\bar{v}_{k-1}\|^{2}\right]+\alpha_{k}^{2}B^{2}.

Since αk\alpha_{k} satisfies (12), then αk(2αkL)1\alpha_{k}(2-\alpha_{k}L)\leq 1, and we further get the following recurrence:

𝔼[vkv¯k2]+Cβ,c,Bhαk(2αkL)𝔼[dist2(xk,𝒳)]\displaystyle\mathbb{E}\left[\|v_{k}-\bar{v}_{k}\|^{2}\right]+C_{\beta,c,B_{h}}\alpha_{k}(2-\alpha_{k}L)\mathbb{E}\left[{\rm dist}^{2}(x_{k},\mathcal{X})\right] (17)
+αk(2αkL)𝔼[F(xk)F(x¯k)]𝔼[vk1v¯k12]+αk2B2.\displaystyle+\alpha_{k}(2-\alpha_{k}L)\mathbb{E}\left[F(x_{k})-F(\bar{x}_{k})\right]\leq\mathbb{E}\left[\|v_{k-1}-\bar{v}_{k-1}\|^{2}\right]+\alpha_{k}^{2}B^{2}.

Summing (17) from 11 to kk, we get:

𝔼[vkv¯k2]+Cβ,c,Bhj=1kαj(2αjL)𝔼[dist2(xj,𝒳)]\displaystyle\mathbb{E}\left[\|v_{k}-\bar{v}_{k}\|^{2}\right]+C_{\beta,c,B_{h}}\sum_{j=1}^{k}\alpha_{j}(2-\alpha_{j}L)\mathbb{E}\left[{\rm dist}^{2}(x_{j},\mathcal{X})\right]
+j=1kαj(2αjL)𝔼[F(xj)F]v0v¯02+B2j=1kαj2.\displaystyle+\sum_{j=1}^{k}\alpha_{j}(2-\alpha_{j}L)\mathbb{E}\left[F(x_{j})-F^{*}\right]\leq\|v_{0}-\bar{v}_{0}\|^{2}+B^{2}\sum_{j=1}^{k}\alpha_{j}^{2}.

Using the definition of the average sequence x^k\hat{x}_{k} and the convexity of FF and of dist2(,𝒳){\rm dist}^{2}(\cdot,\mathcal{X}), we further get sublinear rates in expectation for the average sequence in terms of optimality:

𝔼[F(x^k)F]j=1kαj(2αjL)Sk𝔼[F(xj)F]v0v¯02Sk+B2j=1kαj2Sk,\displaystyle\mathbb{E}\left[F(\hat{x}_{k})-F^{*}\right]\leq\sum_{j=1}^{k}\frac{\alpha_{j}(2-\alpha_{j}L)}{S_{k}}\mathbb{E}\left[F(x_{j})-F^{*}\right]\leq\frac{\|v_{0}-\bar{v}_{0}\|^{2}}{S_{k}}+B^{2}\frac{\sum_{j=1}^{k}\alpha_{j}^{2}}{S_{k}},

and feasibility violation:

Cβ,c,Bh𝔼[dist2(x^k,𝒳)]\displaystyle C_{\beta,c,B_{h}}\mathbb{E}\left[{\rm dist}^{2}(\hat{x}_{k},\mathcal{X})\right] Cβ,c,Bhj=1kαj(2αjL)Sk𝔼[dist2(xj,𝒳)]\displaystyle\leq C_{\beta,c,B_{h}}\sum_{j=1}^{k}\frac{\alpha_{j}(2-\alpha_{j}L)}{S_{k}}\mathbb{E}\left[{\rm dist}^{2}(x_{j},\mathcal{X})\right]
v0v¯02Sk+B2j=1kαj2Sk.\displaystyle\leq\frac{\|v_{0}-\bar{v}_{0}\|^{2}}{S_{k}}+B^{2}\frac{\sum_{j=1}^{k}\alpha_{j}^{2}}{S_{k}}.

These conclude our statements.  

Note that for stepsize αk=α0(k+1)γ\alpha_{k}=\frac{\alpha_{0}}{(k+1)^{\gamma}}, with γ[1/2,1)\gamma\in[1/2,1) and α0\alpha_{0} satisfies (12), we have:

Sk(12)j=1kαj𝒪(k1γ)andj=1kαj2{𝒪(1)ifγ>1/2𝒪(ln(k))ifγ=1/2.S_{k}{\color[rgb]{0,0,0}\overset{\eqref{eq:alk}}{\geq}}\sum_{j=1}^{k}\alpha_{j}\geq{\cal O}(k^{1-\gamma})\quad\text{and}\quad\sum_{j=1}^{k}\alpha_{j}^{2}\leq\begin{cases}{\cal O}(1)\;\;\text{if}\;\gamma>1/2\\ {\cal O}(\ln(k))\;\;\;\text{if}\;\gamma=1/2.\end{cases}

Consequently for γ(1/2,1)\gamma\in(1/2,1) we obtain from Theorem 7 the following sublinear convergence rates:

𝔼[(F(x^k)F)]v0v¯02𝒪(k1γ)+B2𝒪(1)𝒪(k1γ),\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq\frac{\|v_{0}-\bar{v}_{0}\|^{2}}{{\color[rgb]{0,0,0}{\cal O}(k^{1-\gamma})}}+\frac{B^{2}{\color[rgb]{0,0,0}{\cal O}(1)}}{{\color[rgb]{0,0,0}{\cal O}(k^{1-\gamma})}},
𝔼[dist2(x^k,𝒳)]v0v¯02Cβ,c,Bh𝒪(k1γ)+B2𝒪(1)Cβ,c,Bh𝒪(k1γ).\displaystyle\mathbb{E}\left[{\rm dist}^{2}(\hat{x}_{k},\mathcal{X})\right]\leq\frac{\|v_{0}-\bar{v}_{0}\|^{2}}{C_{\beta,c,B_{h}}\cdot{\color[rgb]{0,0,0}{\cal O}(k^{1-\gamma})}}+\frac{B^{2}{\color[rgb]{0,0,0}{\cal O}(1)}}{C_{\beta,c,B_{h}}\cdot{\color[rgb]{0,0,0}{\cal O}(k^{1-\gamma})}}.

For the particular choice γ=1/2\gamma=1/2, we get similar rates as above just replacing 𝒪(1){\cal O}(1) with 𝒪(ln(k)){\cal O}(\ln(k)). However, if we neglect the logarithmic terms, we get sublinear convergence rates of order:

𝔼[(F(x^k)F)]𝒪(1k1/2)and𝔼[dist2(x^k,𝒳)]𝒪(1k1/2).\displaystyle\mathbb{E}\left[(F(\hat{x}_{k})-F^{*})\right]\leq{\cal O}\left(\frac{1}{k^{1/2}}\right)\quad\text{and}\quad\mathbb{E}\left[{\rm dist}^{2}(\hat{x}_{k},\mathcal{X})\right]\leq{\cal O}\left(\frac{1}{k^{1/2}}\right).

It is important to note that when B=0B=0, from Theorem 7 improved rates can be derived for SSP algorithm in the convex case. More precisely, for stepsize αk=α0(k+1)γ\alpha_{k}=\frac{\alpha_{0}}{(k+1)^{\gamma}}, with γ[0,1)\gamma\in[0,1) and α0\alpha_{0} satisfies (12), we obtain convergence rates for x^k\hat{x}_{k} in optimality and feasibility violation of order 𝒪(1k1γ){\cal O}\left(\frac{1}{k^{1-\gamma}}\right). In particular, for γ=0\gamma=0 (i.e., constant stepsize αk=α0(0,min(12,1L))\alpha_{k}=\alpha_{0}\in\left(0,\min(\frac{1}{2},\frac{1}{L})\right) for all k0k\geq 0) the previous convergence estimates yield rates of order 𝒪(1k){\cal O}\left(\frac{1}{k}\right). From our best knowledge these rates are new for stochastic subgradient methods applied on the class of optimization problems (1).

3.2 Convergence analysis: strongly convex objective function

In this section we additionally assume the strong convex inequality from Assumption 2 with μ>0\mu>0. The next lemma derives an improved recurrence relation for the sequence vkv_{k} under the strong convexity. Furthermore, due to strongly convex assumption on FF, problem (1) has a unique optimum, denoted xx^{*}. Our proofs from this section are different from the ones in (Nedich and Necoara,, 2019), since here we consider a more general type of bounded gradient condition, i.e., Assumption 1.

Lemma 8

Let f(,ζ),g(,ζ)f(\cdot,\zeta),g(\cdot,\zeta) and h(,ξ)h(\cdot,\xi) be convex functions. Additionally, Assumptions 14 hold, with μ>0\mu>0. Define k0=8Lμ1k_{0}=\lfloor\frac{8L}{\mu}-1\rfloor, β(0,2)\beta\in\left(0,2\right), θL,μ=1μ/(4L)\theta_{L,\mu}\!=\!1\!-\!\mu/(4L) and αk=4μγk\alpha_{k}\!=\!\frac{4}{\mu}\gamma_{k}, where γk\gamma_{k} is given by:

γk={μ4Lifkk02k+1ifk>k0.\gamma_{k}=\left\{\begin{array}[]{ll}\frac{\mu}{4L}&\text{\emph{if}}\;\;k\leq k_{0}\\ \frac{2}{k+1}&\text{\emph{if}}\;\;k>k_{0}.\end{array}\right.

Then, the iterates of SSP algorithm satisfy the following recurrence:

𝔼[vk0x2]{B2L2ifθL,μ0θL,μk0v0x2+1θL,μk01θL,μ(1+52Cβ,c,BhθL,μ)B2L2ifθL,μ>0,\displaystyle\mathbb{E}[\|v_{k_{0}}-x^{*}\|^{2}]\leq\left\{\begin{array}[]{ll}\frac{B^{2}}{L^{2}}&\text{\emph{if}}\;\;\theta_{L,\mu}\leq 0\\ \theta_{L,\mu}^{k_{0}}\|v_{0}-x^{*}\|^{2}+\frac{1-\theta_{L,\mu}^{k_{0}}}{1-\theta_{L,\mu}}\left(1+\frac{5}{2C_{\beta,c,B_{h}}\theta_{L,\mu}}\right)\frac{B^{2}}{L^{2}}&\text{\emph{if}}\;\;\theta_{L,\mu}>0,\end{array}\right.
𝔼[vkx2]+γk𝔼[xkx2]+Cβ,c,Bh6𝔼[dist2(xk,𝒳)]\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}]+\gamma_{k}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\frac{C_{\beta,c,B_{h}}}{6}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
(1γk)𝔼[vk1x2]+(1+6Cβ,c,Bh)16B289μ2γk2k>k0.\displaystyle\leq\left(1-\gamma_{k}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]+\left(1+\frac{6}{C_{\beta,c,B_{h}}}\right)\frac{16B^{2}89}{\mu^{2}}\gamma_{k}^{2}\qquad\forall k>k_{0}.

Proof  One can easily see that our stepsize can be written equivalently as αk=min(1L,8μ(k+1))\alpha_{k}=\min\left(\frac{1}{L},\frac{8}{\mu(k+1)}\right) (for L=0L=0 we use the convention 1/L=1/L=\infty). Using Assumption 2 in Lemma 5, we get:

𝔼[vkx2]𝔼[xkx2]αk(2αkL)𝔼[(F(xk)F)]+αk2B2\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}]\leq\mathbb{E}[\|x_{k}-x^{*}\|^{2}]-\alpha_{k}{\color[rgb]{0,0,0}(2-\alpha_{k}L)}\mathbb{E}[(F(x_{k})-F^{*})]+\alpha_{k}^{2}B^{2}
Assumption2𝔼[xkx2]αk(2αkL)𝔼[F(x),xkx+μ2xkx2]+αk2B2\displaystyle\overset{\text{Assumption}\,\ref{assumption2}}{\leq}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]-{\color[rgb]{0,0,0}\alpha_{k}(2-\alpha_{k}L)\mathbb{E}\left[\langle\nabla F(x^{*}),x_{k}-x^{*}\rangle+\frac{\mu}{2}\|x_{k}-x^{*}\|^{2}\right]}+\alpha_{k}^{2}B^{2}
(1μαk2)𝔼[xkx2]αk(2αkL)𝔼[F(x),xkΠ𝒳(xk)]+αk2B2\displaystyle\leq{\color[rgb]{0,0,0}\left(1-\frac{\mu\alpha_{k}}{2}\right)\mathbb{E}[\|x_{k}-x^{*}\|^{2}]-\alpha_{k}(2-\alpha_{k}L)\mathbb{E}\left[\langle\nabla F(x^{*}),x_{k}-\Pi_{\mathcal{X}}(x_{k})\rangle\right]+\alpha_{k}^{2}B^{2}}
(1μαk2)𝔼[xkx2]+η2𝔼[xkΠ𝒳(xk)2]+αk2(2αkL)22η𝔼[F(x)2]+αk2B2\displaystyle{\color[rgb]{0,0,0}\leq\left(1-\frac{\mu\alpha_{k}}{2}\right)\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\frac{\eta}{2}\mathbb{E}[\|x_{k}-\Pi_{\mathcal{X}}(x_{k})\|^{2}]+\frac{\alpha^{2}_{k}(2-\alpha_{k}L)^{2}}{2\eta}\mathbb{E}\left[\|\nabla F(x^{*})\|^{2}\right]+\alpha_{k}^{2}B^{2}}
(3)(1μαk4)𝔼[xkx2]μαk4𝔼[xkx2]+η2𝔼[xkΠ𝒳(xk)2]\displaystyle{\color[rgb]{0,0,0}\overset{\eqref{as:main1_spg2}}{\leq}\left(1-\frac{\mu\alpha_{k}}{4}\right)\mathbb{E}[\|x_{k}-x^{*}\|^{2}]-\frac{\mu\alpha_{k}}{4}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\frac{\eta}{2}\mathbb{E}[\|x_{k}-\Pi_{\mathcal{X}}(x_{k})\|^{2}]}
+(1+2η)αk2B2,\displaystyle\quad{\color[rgb]{0,0,0}+\left(1+\frac{2}{\eta}\right)\alpha_{k}^{2}B^{2}}, (18)

where the third inequality uses the property of the stepsize (if αk1L\alpha_{k}\leq\frac{1}{L}, then (2αkL)1(2-\alpha_{k}L)\geq 1), together with the optimality condition (F(x),Π𝒳(xk)x0\langle\nabla F(x^{*}),\Pi_{\mathcal{X}}(x_{k})-x^{*}\rangle\geq 0 for all kk), the fourth inequality uses the identity a,b12ηa2η2b2\langle a,b\rangle\geq-\frac{1}{2\eta}\|a\|^{2}-\frac{\eta}{2}\|b\|^{2} for any a,bn,η>0a,b\in\mathbb{R}^{n},\eta>0, and the final inequality uses the fact that (2αkL)24(2-\alpha_{k}L)^{2}\leq 4. From (14), we also have:

𝔼[xkx2](14)𝔼[vk1x2]β(2β)𝔼[dist2(vk1,𝒳)]cBh2\displaystyle\mathbb{E}[\|x_{k}-x^{*}\|^{2}]\overset{\eqref{eq:yvk-1}}{\leq}\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]-\beta(2-\beta)\frac{\mathbb{E}[{\rm dist}^{2}(v_{k-1},\mathcal{X})]}{cB^{2}_{h}}
(16)𝔼[vk1x2]β(2β)cBh2(1β(2β)cBh2)1𝔼[dist2(xk,𝒳)]\displaystyle\overset{(\ref{eq:distvdistx})}{\leq}\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]-\frac{\beta(2-\beta)}{cB_{h}^{2}}\left(1-\frac{\beta(2-\beta)}{cB_{h}^{2}}\right)^{-1}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
=𝔼[vk1x2]Cβ,c,Bh𝔼[dist2(xk,𝒳)].\displaystyle=\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]-C_{\beta,c,B_{h}}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]. (19)

Taking η=Cβ,c,Bh(1μαk4)>0\eta=C_{\beta,c,B_{h}}\left(1-\frac{\mu\alpha_{k}}{4}\right)>0 and combining (3.2) with (3.2), we obtain:

𝔼[vkx2]\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}]\leq (1μαk4)𝔼[vk1x2]12Cβ,c,Bh(1μαk4)𝔼[dist2(xk,𝒳)]\displaystyle\left(1-\frac{\mu\alpha_{k}}{4}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]-\frac{1}{2}C_{\beta,c,B_{h}}\left(1-\frac{\mu\alpha_{k}}{4}\right)\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
μαk4𝔼[xkx2]+(1+2Cβ,c,Bh(1μαk4))αk2B2.\displaystyle-\frac{\mu\alpha_{k}}{4}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\left(1+\frac{2}{C_{\beta,c,B_{h}}\left(1-\frac{\mu\alpha_{k}}{4}\right)}\right)\alpha_{k}^{2}B^{2}. (20)

For L=0L=0 we have k0=0k_{0}=0. For L>0L>0, then k0>0k_{0}>0 and for any kk0k\leq k_{0}, we have αk=1L\alpha_{k}=\frac{1}{L}. Hence, from (3.2), we obtain for any kk0k\leq k_{0}:

𝔼[vkx2]\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}] (1μ4L)𝔼[vk1x2]+(1+2Cβ,c,Bh(1μ4L))B2L2\displaystyle\leq\left(1-\frac{\mu}{4L}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]+\left(1+\frac{2}{C_{\beta,c,B_{h}}\left(1-\frac{\mu}{4L}\right)}\right)\frac{B^{2}}{L^{2}}
max((1μ4L)𝔼[vk1x2]+(1+2Cβ,c,Bh(1μ4L))B2L2,B2L2).\displaystyle\leq\max\left(\left(1-\frac{\mu}{4L}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]+\left(1+\frac{2}{C_{\beta,c,B_{h}}\left(1-\frac{\mu}{4L}\right)}\right)\frac{B^{2}}{L^{2}},\frac{B^{2}}{L^{2}}\right).

Using the geometric sum formula and recalling that θL,μ=1μ/(4L)\theta_{L,\mu}=1-\mu/(4L), we obtain the first statement. Further, for k>k0k>k_{0}, from relation (3.2), we have:

𝔼[vkx2]+μαk4𝔼[xkx2]+(1μαk4)Cβ,c,Bh2𝔼[dist2(xk,𝒳)]\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}]+\frac{\mu\alpha_{k}}{4}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\left(1-\frac{\mu\alpha_{k}}{4}\right)\frac{C_{\beta,c,B_{h}}}{2}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
(1μαk4)𝔼[vk1x2]+(1+2Cβ,c,Bh(1μαk4))αk2B2.\displaystyle\qquad\qquad\qquad\leq\left(1-\frac{\mu\alpha_{k}}{4}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]+\left(1+\frac{2}{C_{\beta,c,B_{h}}\left(1-\frac{\mu\alpha_{k}}{4}\right)}\right)\alpha_{k}^{2}B^{2}.

Since k>k0=8Lμ1k>{\color[rgb]{0,0,0}k_{0}=\lfloor{\frac{8L}{\mu}}-1\rfloor} and αk=4μγk\alpha_{k}=\frac{4}{\mu}\gamma_{k}, we get:

𝔼[vkx2]+γk𝔼[xkx2]+(1γk)Cβ,c,Bh2𝔼[dist2(xk,𝒳)]\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}]+\gamma_{k}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\left(1-\gamma_{k}\right)\frac{C_{\beta,c,B_{h}}}{2}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
(1γk)𝔼[vk1x2]+(1+2Cβ,c,Bh(1γk))16μ2γk2B2k>k0.\displaystyle\qquad\qquad\qquad\leq\left(1-\gamma_{k}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]+\left(1+\frac{2}{C_{\beta,c,B_{h}}\left(1-\gamma_{k}\right)}\right)\frac{16}{\mu^{2}}\gamma_{k}^{2}B^{2}\quad\forall k>k_{0}.

Note that in this case γk=2/(k+1)\gamma_{k}=2/(k+1) is a decreasing sequence, an thus we have:

1γk=k1k+113k2.\displaystyle 1-\gamma_{k}=\frac{k-1}{k+1}\geq\frac{1}{3}\quad\forall k\geq 2.

Using this bound in the previous recurrence, we also get the second statement.  

Now, we are ready to derive sublinear rates when we assume a strongly convex condition on the objective function. Let us define for kk0+1k\geq k_{0}+1 the sum:

S¯k=j=k0+1k(j+1)2𝒪(k3k03),\displaystyle\bar{S}_{k}=\sum_{j=k_{0}+1}^{k}(j+1)^{2}\sim\mathcal{O}(k^{3}-k_{0}^{3}),

and the corresponding average sequences:

x^k=j=k0+1k(j+1)2xjS¯k,andw^k=j=k0+1k(j+1)2Π𝒳(xj)S¯k𝒳.\displaystyle\hat{x}_{k}=\frac{\sum_{j=k_{0}+1}^{k}(j+1)^{2}x_{j}}{\bar{S}_{k}},\quad\text{and}\quad\hat{w}_{k}=\frac{\sum_{j=k_{0}+1}^{k}(j+1)^{2}\Pi_{\mathcal{X}}(x_{j})}{\bar{S}_{k}}\in\mathcal{X}.
Theorem 9

Let f(,ζ),g(,ζ)f(\cdot,\zeta),g(\cdot,\zeta) and h(,ξ)h(\cdot,\xi) be convex functions. Additionally, Assumptions 14 hold, with μ>0\mu>0. Further, consider stepsize αk=min(1L,8μ(k+1))\alpha_{k}=\min\left(\frac{1}{L},\frac{8}{\mu(k+1)}\right) and β(0,2)\beta\in\left(0,2\right). Then, for k>k0k>k_{0}, where k0=8Lμ1k_{0}=\lfloor{\frac{8L}{\mu}}-1\rfloor, we have the following sublinear convergence rates for the average sequence x^k\hat{x}_{k} in terms of optimality and feasibility violation for problem (1) (we keep only the dominant terms):

𝔼[x^kx2]𝒪(B2μ2Cβ,c,Bh(k+1)),\displaystyle\mathbb{E}[\|\hat{x}_{k}-x^{*}\|^{2}]\leq\mathcal{O}\left(\frac{B^{2}}{\mu^{2}C_{\beta,c,B_{h}}(k+1)}\right),
𝔼[dist2(x^k,𝒳)]𝒪(B2μ2Cβ,c,Bh2(k+1)2).\displaystyle\mathbb{E}\left[{\rm dist}^{2}(\hat{x}_{k},\mathcal{X})\right]\leq\mathcal{O}\left(\frac{B^{2}}{\mu^{2}C_{\beta,c,B_{h}}^{2}(k+1)^{2}}\right).

Proof  From Lemma 8, the following recurrence is valid for any k>k0k>k_{0}:

𝔼[vkx2]+γk𝔼[xkx2]+Cβ,c,Bh6𝔼[dist2(xk,𝒳)]\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}]+\gamma_{k}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\frac{C_{\beta,c,B_{h}}}{6}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
(1γk)𝔼[vk1x2]+(1+6Cβ,c,Bh)16B2μ2γk2.\displaystyle\qquad\qquad\qquad\leq\left(1-\gamma_{k}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]+\left(1+\frac{6}{C_{\beta,c,B_{h}}}\right)\frac{16B^{2}}{\mu^{2}}\gamma_{k}^{2}.

From definition of γk=2k+1\gamma_{k}=\frac{2}{k+1} and multiplying the whole inequality with (k+1)2(k+1)^{2}, we get:

(k+1)2𝔼[vkx2]+2(k+1)𝔼[xkx2]+Cβ,c,Bh6(k+1)2𝔼[dist2(xk,𝒳)]\displaystyle(k+1)^{2}\mathbb{E}[\|v_{k}-x^{*}\|^{2}]+2(k+1)\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\frac{C_{\beta,c,B_{h}}}{6}(k+1)^{2}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
k2𝔼[vk1x2]+(1+6Cβ,c,Bh)64μ2B2.\displaystyle\leq k^{2}\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}]+\left(1+\frac{6}{C_{\beta,c,B_{h}}}\right)\frac{64}{\mu^{2}}B^{2}.

Summing this inequality from k0+1k_{0}+1 to kk, we get:

(k+1)2𝔼[vkx2]+2j=k0+1k(j+1)𝔼[xjx2]\displaystyle{(k+1)^{2}}\mathbb{E}[\|v_{k}-x^{*}\|^{2}]+2\sum_{j=k_{0}+1}^{k}(j+1)\mathbb{E}[\|x_{j}-x^{*}\|^{2}]
+Cβ,c,Bh6j=k0+1k(j+1)2𝔼[dist2(xj,𝒳)]\displaystyle+\frac{C_{\beta,c,B_{h}}}{6}\sum_{j=k_{0}+1}^{k}(j+1)^{2}\mathbb{E}[{\rm dist}^{2}(x_{j},\mathcal{X})]
(k0+1)2𝔼[vk0x2]+(1+6Cβ,c,Bh)64B2μ2(kk0).\displaystyle\leq(k_{0}+1)^{2}\mathbb{E}[\|v_{k_{0}}-x^{*}\|^{2}]+\left(1+\frac{6}{C_{\beta,c,B_{h}}}\right)\frac{64B^{2}}{\mu^{2}}(k-k_{0}).

By linearity of the expectation operator, we further have:

(k+1)2𝔼[vkv¯k2]+2(k+1)𝔼[j=k0+1k(j+1)2xjx2]\displaystyle{(k+1)^{2}}\mathbb{E}[\|v_{k}-\bar{v}_{k}\|^{2}]+\frac{2}{(k+1)}\mathbb{E}\left[\sum_{j=k_{0}+1}^{k}(j+1)^{2}\|x_{j}-x^{*}\|^{2}\right]
+Cβ,c,Bh6𝔼[j=k0+1k(j+1)2dist2(xj,𝒳)]\displaystyle+\frac{C_{\beta,c,B_{h}}}{6}\mathbb{E}\left[\sum_{j=k_{0}+1}^{k}(j+1)^{2}{\rm dist}^{2}(x_{j},\mathcal{X})\right] (21)
(k0+1)2𝔼[vk0v¯k02]+(1+6Cβ,c,Bh)64B2μ2(kk0).\displaystyle\leq(k_{0}+1)^{2}\mathbb{E}[\|v_{k_{0}}-\bar{v}_{k_{0}}\|^{2}]+\left(1+\frac{6}{C_{\beta,c,B_{h}}}\right)\frac{64B^{2}}{\mu^{2}}(k-k_{0}).

and using convexity of the norm, we get:

(k+1)2𝔼[vkx2]+2S¯k(k+1)𝔼[x^kx2]+S¯kCβ,c,Bh6𝔼[w^kx^k2]\displaystyle{(k+1)^{2}}\mathbb{E}[\|v_{k}-x^{*}\|^{2}]+\frac{2\bar{S}_{k}}{(k+1)}\mathbb{E}[\|\hat{x}_{k}-x^{*}\|^{2}]+\frac{\bar{S}_{k}C_{\beta,c,B_{h}}}{6}\mathbb{E}[\|\hat{w}_{k}-\hat{x}_{k}\|^{2}]
(k0+1)2𝔼[vk0x2]+(1+6Cβ,c,Bh)64B2μ2(kk0),\displaystyle\leq(k_{0}+1)^{2}\mathbb{E}[\|v_{k_{0}}-x^{*}\|^{2}]+\left(1+\frac{6}{C_{\beta,c,B_{h}}}\right)\frac{64B^{2}}{\mu^{2}}(k-k_{0}),

After some simple calculations and keeping only the dominant terms, we get the following convergence rate for the average sequence x^k\hat{x}_{k} in terms of optimality:

𝔼[x^kx2]𝒪(B2μ2Cβ,c,Bh(k+1)),\displaystyle\mathbb{E}[\|\hat{x}_{k}-x^{*}\|^{2}]\leq\mathcal{O}\left(\frac{B^{2}}{\mu^{2}C_{\beta,c,B_{h}}(k+1)}\right),
𝔼[w^kx^k2]𝒪(B2μ2Cβ,c,Bh2(k+1)2).\displaystyle\mathbb{E}[\|\hat{w}_{k}-\hat{x}_{k}\|^{2}]\leq\mathcal{O}\left(\frac{B^{2}}{\mu^{2}C_{\beta,c,B_{h}}^{2}(k+1)^{2}}\right).

Since w^k𝒳\hat{w}_{k}\in\mathcal{X}, we get the following convergence rate for the average sequence x^k\hat{x}_{k} in terms of feasibility violation:

𝔼[dist2(x^k,𝒳)]\displaystyle\mathbb{E}[{\rm dist}^{2}(\hat{x}_{k},\mathcal{X})] 𝔼[w^kx^k2]𝒪(B2μ2Cβ,c,Bh2(k+1)2).\displaystyle\leq\mathbb{E}[\|\hat{w}_{k}-\hat{x}_{k}\|^{2}]\leq\mathcal{O}\left(\frac{B^{2}}{\mu^{2}C_{\beta,c,B_{h}}^{2}(k+1)^{2}}\right).

This proves our statements.  

Recall the expression of Cβ,c,BhC_{\beta,c,B_{h}}:

Cβ,c,Bh=(β(2β)cBh2)(1β(2β)cBh2)1=(β(2β)cBh2β(2β)).\displaystyle C_{\beta,c,B_{h}}=\left(\frac{\beta(2-\beta)}{cB_{h}^{2}}\right)\left(1-\frac{\beta(2-\beta)}{cB_{h}^{2}}\right)^{-1}=\left(\frac{\beta(2-\beta)}{cB_{h}^{2}-\beta(2-\beta)}\right).

For the particular choice of the stepsize β=1\beta=1, we have:

C1,c,Bh=(1cBh21)>0,\displaystyle C_{1,c,B_{h}}=\left(\frac{1}{cB_{h}^{2}-1}\right)>0,

since we always have cBh2>1cB_{h}^{2}>1. Using this expression in the convergence rates of Theorem 9, we obtain:

𝔼[dist2(x^k,𝒳)]𝒪(B2(cBh21)2μ2(k+1)2),\displaystyle\mathbb{E}\left[{\rm dist}^{2}(\hat{x}_{k},\mathcal{X})\right]\leq\mathcal{O}\left(\frac{B^{2}(cB_{h}^{2}-1)^{2}}{\mu^{2}(k+1)^{2}}\right),
𝔼[x^kx2]𝒪(B2(cBh21)μ2(k+1)).\displaystyle\mathbb{E}\left[\|\hat{x}_{k}-x^{*}\|^{2}\right]\leq\mathcal{O}\left(\frac{B^{2}(cB_{h}^{2}-1)}{\mu^{2}(k+1)}\right).

We can easily notice from Lemma 8 that for B=0B=0 we can get better convergence rates. More specifically, for this particular case, taking constant stepsize, we get linear rates for the last iterate xkx_{k} in terms of optimality and feasibility violation. We state this result in the next corollary.

Corollary 10

Under the assumptions of Theorem 9, with B=0B=0, the last iterate xkx_{k} generated by SSP algorithm with constant stepsize αkα<min(1/L,4/μ)\alpha_{k}\equiv\alpha<\min(1/L,4/\mu) converges linearly in terms of optimality and feasibility violation.

Proof  When B=0B=0 and the stepsize satisfies αk=α<min(1/L,4/μ)\alpha_{k}=\alpha<\min(1/L,4/\mu), from (3.2), we obtain:

𝔼[vkx2]\displaystyle\mathbb{E}[\|v_{k}-x^{*}\|^{2}] +μα4𝔼[xkx2]+12Cβ,c,Bh(1μα4)𝔼[dist2(xk,𝒳)]\displaystyle+\frac{\mu\alpha}{4}\mathbb{E}[\|x_{k}-x^{*}\|^{2}]+\frac{1}{2}C_{\beta,c,B_{h}}\left(1-\frac{\mu\alpha}{4}\right)\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]
(1μα4)𝔼[vk1x2].\displaystyle\leq\left(1-\frac{\mu\alpha}{4}\right)\mathbb{E}[\|v_{k-1}-x^{*}\|^{2}].

Then, since 1μα/4(0,1)1-\mu\alpha/4\in(0,1), we get immediately that:

𝔼[xkx2](1μα4)k4μαv0x2,\displaystyle\mathbb{E}[\|x_{k}-x^{*}\|^{2}]\leq\left(1-\frac{\mu\alpha}{4}\right)^{k}\frac{4}{\mu\alpha}\|v_{0}-x^{*}\|^{2},
12Cβ,c,Bh𝔼[dist2(xk,𝒳)](1μα4)k1v0x2,\displaystyle\frac{1}{2}C_{\beta,c,B_{h}}\mathbb{E}[{\rm dist}^{2}(x_{k},\mathcal{X})]\leq\left(1-\frac{\mu\alpha}{4}\right)^{k-1}\|v_{0}-x^{*}\|^{2},

which proves our statements.  

Note that in (Necoara,, 2021) it has been proved that stochastic first order methods are converging linearly on optimization problems of the form (1) without functional constraints and satisfying Assumption 1 with B=0B=0. This paper extends this result to a stochastic subgradient projection method on optimization problems with functional constraints (1) satisfying Assumption 1 with B=0B=0. To the best of our knowledge, these convergence rates are new for stochastic subgradient projection methods applied on the general class of optimization problems (1). In Section 4 we provide an example of an optimization problem with functional constraints, that is the constrained linear least-squares, which satisfies Assumption 1 with B=0B=0.

4 Stochastic subgradient for constrained least-squares

In this section we consider the problem of finding a solution to a system of linear equalities and inequalities, see also equation (11)(11) in (Leventhal and Lewis,, 2010):

findx𝒴:Ax=b,Cxd,\displaystyle\text{find}\;x\in\mathcal{Y}:\;Ax=b,\;Cx\leq d, (22)

where Am×nA\in\mathbb{R}^{m\times n}, Cp×nC\in\mathbb{R}^{p\times n} and 𝒴\mathcal{Y} is a simple polyhedral set. We assume that this system is consistent, i.e. it has at least one solution. This problem can be reformulated equivalently as a particular case of the optimization problem with functional constraints (1):

minx𝒴f(x)(:=12𝔼[AζTxbζ2])\displaystyle\min_{x\in\mathcal{Y}}\;f(x)\;\;\left(:=\frac{1}{2}\mathbb{E}\left[\|A_{\zeta}^{T}x-b_{\zeta}\|^{2}\right]\right) (23)
subject toCξTxdξ0ξΩ2,\displaystyle\text{subject to}\;\;C_{\xi}^{T}x-d_{\xi}\leq 0\quad\forall\xi\in\Omega_{2},

where AζTA_{\zeta}^{T} and CξTC_{\xi}^{T} are (block) rows partitions of matrices AA and CC, respectively. Clearly, problem (23) is a particular case of problem (1), with f(x,ζ)=12AζTxbζ2f(x,\zeta)=\frac{1}{2}\|A_{\zeta}^{T}x-b_{\zeta}\|^{2}, g(x,ζ)=0g(x,\zeta)=0, and h(x,ξ)=CξTxdξh(x,\xi)=C_{\xi}^{T}x-d_{\xi} (provided that CξC_{\xi} is a row of CC). Let us define the polyhedral subset partitions 𝒞ξ={xn:CξTxdξ0}\mathcal{C}_{\xi}=\{x\in\mathbb{R}^{n}:C_{\xi}^{T}x-d_{\xi}\leq 0\} and 𝒜ζ={xn:AζTxbζ=0}\mathcal{A}_{\zeta}=\{x\in\mathbb{R}^{n}:A_{\zeta}^{T}x-b_{\zeta}=0\}. In this case the feasible set is the polyhedron 𝒳={x𝒴:Cxd}\mathcal{X}=\{x\in\mathcal{Y}:\;Cx\leq d\} and the optimal set is the polyhedron:

𝒳={x𝒴:Ax=b,Cxd}=𝒴ζΩ1𝒜ζξΩ2𝒞ξ.\mathcal{X}^{*}=\{x\in\mathcal{Y}:\;Ax=b,\;Cx\leq d\}=\mathcal{Y}\cap_{\zeta\in\Omega_{1}}\mathcal{A}_{\zeta}\cap_{\xi\in\Omega_{2}}\mathcal{C}_{\xi}.

Note that for the particular problem (23) Assumption 1 holds with B=0B=0 and e.g. L=2maxζAζ2L=2\max_{\zeta}\|A_{\zeta}\|^{2}, since f=0f^{*}=0 and we have:

𝔼[f(x,ζ)2]=𝔼[Aζ(AζTxbζ)2](2maxζAζ2)(12𝔼[AζTxbζ2])=Lf(x).\mathbb{E}[\|\nabla f(x,\zeta)\|^{2}]=\mathbb{E}[\|A_{\zeta}(A_{\zeta}^{T}x-b_{\zeta})\|^{2}]\leq(2\max_{\zeta}\|A_{\zeta}\|^{2})\left(\frac{1}{2}\mathbb{E}\left[\|A_{\zeta}^{T}x-b_{\zeta}\|^{2}\right]\right)=L\,f(x).

It is also obvious that Assumption 3 holds, since the functional constraints are linear. Moreover, for the constrained least-squares problem, we replace Assumptions 2 and 4 with the well-known Hoffman property of a polyhedral set, see Example 3 from Section 2 and also (Pena et al.,, 2021; Leventhal and Lewis,, 2010; Necoara et al.,, 2019):

dist2(u,𝒳)c𝔼[dist2(u,𝒜ζ)+dist2(u,𝒞ξ)]u𝒴,\displaystyle{\rm dist}^{2}(u,\mathcal{X}^{*})\leq c\cdot\mathbb{E}\left[{\rm dist}^{2}(u,\mathcal{A}_{\zeta})+{\rm dist}^{2}(u,\mathcal{C}_{\xi})\right]\quad\forall u\in\mathcal{Y}, (24)

for some c(0,)c\in(0,\infty). Recall that the Hoffman condition (24) always holds for nonempty polyhedral sets (Pena et al.,, 2021). For the constrained least-squares problem the SSP algorithm becomes:

Algorithm 2 (SSP-LS): Choosex0𝒴,stepsizesαk>0\text{Choose}\;x_{0}\in\mathcal{Y},\;\text{stepsizes}\;\alpha_{k}>0 and β(0,2)\beta\in(0,2) Fork0repeat:\text{For}\;k\geq 0\;\text{repeat:} Sample independentlyζkP1andξkP2and update:\displaystyle\text{Sample independently}\;\zeta_{k}\sim\textbf{P}_{1}\;\text{and}\;\xi_{k}\sim\textbf{P}_{2}\;\text{and update:} vk=xkαkAζk(AζkTxkbζk)\displaystyle v_{k}=x_{k}-\alpha_{k}A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}) zk=(1β)vk+βΠ𝒞ξk(vk)\displaystyle z_{k}=(1-\beta)v_{k}+\beta\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k}) xk+1=Π𝒴(zk).\displaystyle x_{k+1}=\Pi_{\mathcal{Y}}(z_{k}).

Note that the update for vkv_{k} can be written as step (6) in SSP for f(x,ζ)=12AζTxbζ2f(x,\zeta)=\frac{1}{2}\|A_{\zeta}^{T}x-b_{\zeta}\|^{2} and g=0g=0. In contrast to the previous section however, here we consider an adaptive stepsize:

αk=δAζkTxkbζk2Aζk(AζkTxkbζk)2,whereδ(0,2).\displaystyle\alpha_{k}=\delta\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{2}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}},\;\;\text{where}\;\;\delta\in(0,2).

Note that when CξC_{\xi} is a row of CC, then zkz_{k} has the explicit expression:

zk=vkβ(CξkTvkdξk)+Cξk2Cξk,z_{k}=v_{k}-\beta\frac{(C_{\xi_{k}}^{T}v_{k}-d_{\xi_{k}})_{+}}{\|C_{\xi_{k}}\|^{2}}C_{\xi_{k}},

which coincides with step (7) in SSP for h(x,ξ)=CξTxdξh(x,\xi)=C_{\xi}^{T}x-d_{\xi}. Note that we can use e.g., probability distributions dependent on the (block) rows of matrices AA and CC:

P1(ζ=ζk)=AζkF2AF2andP2(ξ=ξk)=CξkF2CF2,\displaystyle\textbf{P}_{1}(\zeta=\zeta_{k})=\frac{\|A_{\zeta_{k}}\|^{2}_{F}}{\|{A}\|_{F}^{2}}\;\;\text{and}\;\;\textbf{P}_{2}(\xi=\xi_{k})=\frac{\|C_{\xi_{k}}\|^{2}_{F}}{\|{C}\|_{F}^{2}},

where F\|\cdot\|_{F} denotes the Frobenius norm of a matrix. Note that our algorithm SSP-LS is different from Algorithm 4.64.6 in (Leventhal and Lewis,, 2010) through the choice of the stepsize αk\alpha_{k}, of the sampling rules and of the update law for xkx_{k} and it is more general as it allows to work with block of rows of the matrices AA and CC. Moreover, SSP-LS includes the classical Kaczmarz’s method when solving linear systems of equalities. In the next section we derive linear convergence rates for SSP-LS algorithm, provided that the system of equalities and inequalities is consistent, i.e. 𝒳\mathcal{X}^{*} is nonempty.

4.1 Linear convergence

In this section we prove linear convergence for the sequence generated by the SSP-LS algorithm for solving the constrained least-squares problem (23). Let us define maximum block condition number over all the submatrices AζA_{\zeta}:

κblock=maxζP1AζT(AζT),\kappa_{\text{block}}=\max_{\zeta\sim\textbf{P}_{1}}\|A_{\zeta}^{T}\|\cdot\|(A_{\zeta}^{T})^{\dagger}\|,

where (AζT)(A_{\zeta}^{T})^{\dagger} denotes the pseudoinverse of AζTA_{\zeta}^{T}. Note that if AζTA_{\zeta}^{T} has full rank, then (AζT)=Aζ(AζTAζ)1(A_{\zeta}^{T})^{\dagger}=A_{\zeta}(A_{\zeta}^{T}A_{\zeta})^{-1}. Then, we have the following result.

Theorem 11

Assume that the polyhedral set 𝒳={x𝒴:Ax=b,Cxd}\mathcal{X}^{*}=\{x\in\mathcal{Y}:\;Ax=b,\;Cx\leq d\} is nonempty. Then, we have the following linear rate of convergence for the sequence xkx_{k} generated by the SSP-LS algorithm:

𝔼[dist2(xk,𝒳)](11cmin(δ(2δ)2κblock2,2δ4δ,β(2β)2))kdist2(x0,𝒳).\displaystyle\mathbb{E}\left[{{\rm dist}^{2}(x_{k},\mathcal{X}^{*})}\right]\leq\left(1-\frac{1}{c}\min\left(\frac{\delta(2-\delta)}{2\kappa_{\text{block}}^{2}},\frac{2-\delta}{4\delta},\frac{\beta(2-\beta)}{2}\right)\right)^{k}{\rm dist}^{2}(x_{0},\mathcal{X}^{*}).

Proof  From the updates of the sequences xk+1x_{k+1}, zkz_{k} and vkv_{k} in SSP-LS algorithm, we have:

xk+1x¯k+12xk+1x¯k2=Π𝒴(zk)Π𝒴(x¯k)2zkx¯k2\displaystyle\|x_{k+1}-\bar{x}_{k+1}\|^{2}\leq\|x_{k+1}-\bar{x}_{k}\|^{2}=\|\Pi_{\mathcal{Y}}(z_{k})-\Pi_{\mathcal{Y}}(\bar{x}_{k})\|^{2}\leq\|z_{k}-\bar{x}_{k}\|^{2}
=vkx¯k+β(Π𝒞ξk(vk)vk)2\displaystyle=\|v_{k}-\bar{x}_{k}+\beta(\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k})\|^{2}
=vkx¯k2+β2Π𝒞ξk(vk)vk2+2βvkx¯k,Π𝒞ξk(vk)vk\displaystyle=\|v_{k}-\bar{x}_{k}\|^{2}+\beta^{2}\|\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\|^{2}+2\beta\langle v_{k}-\bar{x}_{k},\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\rangle
=xkx¯k2+αk2Aζk(AζkTxkbζk)22αkxkx¯k,Aζk(AζkTxkbζk)\displaystyle=\|x_{k}-\bar{x}_{k}\|^{2}+\alpha_{k}^{2}\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}-2\alpha_{k}\langle x_{k}-\bar{x}_{k},A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\rangle
+β2Π𝒞ξk(vk)vk2+2βvkx¯k,Π𝒞ξk(vk)vk.\displaystyle\qquad+\beta^{2}\|\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\|^{2}+2\beta\langle v_{k}-\bar{x}_{k},\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\rangle.

Using the definition of αk\alpha_{k} and that AζkT(xkx¯k)=AζkTxkbζkA_{\zeta_{k}}^{T}(x_{k}-\bar{x}_{k})=A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}, we further get:

xk+1x¯k+12\displaystyle\|x_{k+1}-\bar{x}_{k+1}\|^{2} xkx¯k2δ(2δ)AζkTxkbζk4Aζk(AζkTxkbζk)2+β2Π𝒞ξk(vk)vk2\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-\delta(2-\delta)\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{4}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}}+\beta^{2}\|\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\|^{2}
+2βvkx¯k,Π𝒞ξk(vk)vk\displaystyle\quad+2\beta\langle v_{k}-\bar{x}_{k},\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\rangle
=xkx¯k2δ(2δ)AζkTxkbζk4Aζk(AζkTxkbζk)2+β2Π𝒞ξk(vk)vk2\displaystyle=\|x_{k}-\bar{x}_{k}\|^{2}-\delta(2-\delta)\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{4}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}}+\beta^{2}\|\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\|^{2}
2βΠ𝒞ξk(vk)vk,Π𝒞ξk(vk)vk+2βΠ𝒞ξk(vk)x¯k,Π𝒞ξk(vk)vk.\displaystyle\quad-2\beta\langle\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k},\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\rangle+2\beta\langle\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-\bar{x}_{k},\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\rangle.

From the optimality condition of the projection we always have Π𝒞ξk(vk)z,Π𝒞ξk(vk)vk0\langle\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-z,\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\rangle\leq 0 for all z𝒞ξkz\in\mathcal{C}_{\xi_{k}}. Taking z=x¯k𝒳𝒞ξkz=\bar{x}_{k}\in\mathcal{X}^{*}\subseteq\mathcal{C}_{\xi_{k}} in the previous relation, we finally get:

xk+1x¯k+12\displaystyle\|x_{k+1}-\bar{x}_{k+1}\|^{2} (25)
xkx¯k2δ(2δ)AζkTxkbζk4Aζk(AζkTxkbζk)2β(2β)Π𝒞ξk(vk)vk2.\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-\delta(2-\delta)\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{4}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}}-\beta(2-\beta)\|\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\|^{2}.

From the definition of vkv_{k} and αk\alpha_{k}, we have:

vk=xkαkAζk(AζkTxkbζk)\displaystyle v_{k}=x_{k}-\alpha_{k}A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}) αk2Aζk(AζkTxkbζk)2=vkxk2\displaystyle\iff\alpha_{k}^{2}\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}=\|v_{k}-x_{k}\|^{2}
AζkTxkbζk4Aζk(AζkTxkbζk)2=1δ2vkxk2.\displaystyle\iff\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{4}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}}=\frac{1}{\delta^{2}}\|v_{k}-x_{k}\|^{2}.

Also, from the definition of zkz_{k}, we have:

Π𝒞ξk(vk)vk2=1β2zkvk2.\displaystyle\|\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})-v_{k}\|^{2}=\frac{1}{\beta^{2}}\|z_{k}-v_{k}\|^{2}. (26)

Now, replacing these two relations in (25), we get:

xk+1x¯k+12\displaystyle\|x_{k+1}-\bar{x}_{k+1}\|^{2}
xkx¯k2δ(2δ)δ2vkxk2β(2β)β2zkvk2\displaystyle\quad\leq\|x_{k}-\bar{x}_{k}\|^{2}-\frac{\delta(2-\delta)}{\delta^{2}}\|v_{k}-x_{k}\|^{2}-\frac{\beta(2-\beta)}{\beta^{2}}\|z_{k}-v_{k}\|^{2}
xkx¯k2δ(2δ)2κblock2κblock2δ2vkxk2\displaystyle\quad\leq\|x_{k}-\bar{x}_{k}\|^{2}-\frac{\delta(2-\delta)}{2\kappa_{\text{block}}^{2}}\frac{\kappa_{\text{block}}^{2}}{\delta^{2}}\|v_{k}-x_{k}\|^{2}
min(δ(2δ)4δ2,β(2β)2)(2vkxk2+2β2zkvk2).\displaystyle\qquad-\min\left(\frac{\delta(2-\delta)}{4\delta^{2}},\frac{\beta(2-\beta)}{2}\right)\left(2\|v_{k}-x_{k}\|^{2}+\frac{2}{\beta^{2}}\|z_{k}-v_{k}\|^{2}\right). (27)

First, let us consider the subset 𝒞ξk\mathcal{C}_{\xi_{k}}. Then, we have:

dist2(xk,𝒞ξk)\displaystyle{\rm dist}^{2}(x_{k},\mathcal{C}_{\xi_{k}}) =xkΠ𝒞ξk(xk)2xkΠ𝒞ξk(vk)22xkvk2+2vkΠ𝒞ξk(vk)2\displaystyle=\|x_{k}-\Pi_{\mathcal{C}_{\xi_{k}}}(x_{k})\|^{2}\leq\|x_{k}-\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})\|^{2}\leq 2\|x_{k}-v_{k}\|^{2}+2\|v_{k}-\Pi_{\mathcal{C}_{\xi_{k}}}(v_{k})\|^{2}
(26)2xkvk2+2β2vkzk2.\displaystyle\overset{(\ref{def:zk})}{\leq}2\|x_{k}-v_{k}\|^{2}+\frac{2}{\beta^{2}}\|v_{k}-z_{k}\|^{2}.

Second, let us consider the subset 𝒜ζk\mathcal{A}_{\zeta_{k}}. Since the corresponding AζkTA_{\zeta_{k}}^{T} represents a block of rows of matrix AA, the update for vkv_{k} in SSP-LS can be written as:

vk=(1δ)xk+δTζk(xk),v_{k}=(1-\delta)x_{k}+\delta T_{\zeta_{k}}(x_{k}),

where the operator TζkT_{\zeta_{k}} is given by the following expression

Tζk(xk)=xkAζkTxkbζk2Aζk(AζkTxkbζk)2Aζk(AζkTxkbζk).T_{\zeta_{k}}(x_{k})=x_{k}-\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{2}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}}A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}).

Further, the projection of xkx_{k} onto the subset 𝒜ζk\mathcal{A}_{\zeta_{k}} is, see e.g., (Horn and Johnson,, 2012):

Π𝒜ζk(xk)=xk(AζT)(AζkTxkbζk).\Pi_{\mathcal{A}_{\zeta_{k}}}(x_{k})=x_{k}-(A_{\zeta}^{T})^{\dagger}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}).

Hence, we have:

dist2(xk,𝒜ζk)\displaystyle{\rm dist}^{2}(x_{k},\mathcal{A}_{\zeta_{k}}) =xkΠ𝒜ζk(xk)2=(AζT)(AζkTxkbζk)2\displaystyle=\|x_{k}-\Pi_{\mathcal{A}_{\zeta_{k}}}(x_{k})\|^{2}=\|(A_{\zeta}^{T})^{\dagger}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}
(AζT)2AζkTxkbζk2\displaystyle\leq\|(A_{\zeta}^{T})^{\dagger}\|^{2}\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{2}
=(AζT)2AζkTxkbζk2Aζk(AζkTxkbζk)2Aζk(AζkTxkbζk)2\displaystyle=\|(A_{\zeta}^{T})^{\dagger}\|^{2}\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{2}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}}\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}
(AζkT)2AζkT2AζkTxkbζk4Aζk(AζkTxkbζk)2\displaystyle\leq\|(A_{\zeta_{k}}^{T})^{\dagger}\|^{2}\|A_{\zeta_{k}}^{T}\|^{2}\frac{\|A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}}\|^{4}}{\|A_{\zeta_{k}}(A_{\zeta_{k}}^{T}x_{k}-b_{\zeta_{k}})\|^{2}}
=κblock2Tζk(xk)xk2\displaystyle=\kappa_{\text{block}}^{2}\|T_{\zeta_{k}}(x_{k})-x_{k}\|^{2}
=κblock2δ2xkvk2.\displaystyle=\frac{\kappa_{\text{block}}^{2}}{\delta^{2}}\|x_{k}-v_{k}\|^{2}.

Using these two relations in (4.1), we finally get the following recurrence:

xk+1x¯k+12\displaystyle\|x_{k+1}-\bar{x}_{k+1}\|^{2} (28)
xkx¯k2min(δ(2δ)2κblock2,2δ4δ,β(2β)2)(dist2(xk,𝒜ζk)+dist2(xk,𝒞ξk)).\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-\min\left(\frac{\delta(2-\delta)}{2\kappa_{\text{block}}^{2}},\frac{2-\delta}{4\delta},\frac{\beta(2-\beta)}{2}\right)\left({\rm dist}^{2}(x_{k},\mathcal{A}_{\zeta_{k}})+{\rm dist}^{2}(x_{k},\mathcal{C}_{\xi_{k}})\right).

Now, taking conditional expectation w.r.t. [k1]\mathcal{F}_{[k-1]} in (28) and using Hoffman inequality (24), we obtain:

𝔼ζk,ξk[xk+1x¯k+12|[k1]]\displaystyle\mathbb{E}_{\zeta_{k},\xi_{k}}[\|x_{k+1}-\bar{x}_{k+1}\|^{2}|\mathcal{F}_{[k-1]}]
xkx¯k21cmin(δ(2δ)2κblock2,2δ4δ,β(2β)2)dist2(xk,𝒳)\displaystyle\leq\|x_{k}-\bar{x}_{k}\|^{2}-\frac{1}{c}\min\left(\frac{\delta(2-\delta)}{2\kappa_{\text{block}}^{2}},\frac{2-\delta}{4\delta},\frac{\beta(2-\beta)}{2}\right){\rm dist}^{2}(x_{k},\mathcal{X^{*}})
=(11cmin(δ(2δ)2κblock2,2δ4δ,β(2β)2))xkx¯k2.\displaystyle=\left(1-\frac{1}{c}\min\left(\frac{\delta(2-\delta)}{2\kappa_{\text{block}}^{2}},\frac{2-\delta}{4\delta},\frac{\beta(2-\beta)}{2}\right)\right)\|x_{k}-\bar{x}_{k}\|^{2}.

Finally, taking full expectation, recursively we get the statement of the theorem.  
Note that for δ=β=1\delta=\beta=1 and AζTA_{\zeta}^{T} a single row of matrix AA, we have κblock=1\kappa_{\text{block}}=1 and we get a simplified estimate for linear convergence 𝔼[dist2(xk,𝒳)](11/(4c))kdist2(x0,𝒳)\mathbb{E}\left[{{\rm dist}^{2}(x_{k},\mathcal{X}^{*})}\right]\leq(1-1/(4c))^{k}{\rm dist}^{2}(x_{0},\mathcal{X}^{*}), which is similar to convergence estimate for the algorithm in (Leventhal and Lewis,, 2010). In the block case, for δ=β=1\delta=\beta=1 and assuming that κblock2\kappa_{\text{block}}\geq 2, we get the linear convergence 𝔼[dist2(xk,𝒳)](11/(2cκblock2))kdist2(x0,𝒳)\mathbb{E}\left[{{\rm dist}^{2}(x_{k},\mathcal{X}^{*})}\right]\leq\left(1-1/(2c\kappa_{\text{block}}^{2})\right)^{k}{\rm dist}^{2}(x_{0},\mathcal{X}^{*}), i.e., our rate depends explicitly on the geometric properties of the submatrices AζA_{\zeta} and CζC_{\zeta} (recall that both constants cc and κblock\kappa_{\text{block}} are defined in terms of these submatrices). From our best knowledge, this is the first time when such convergence bounds are obtained for a stochastic subgradient type algorithm solving constrained least-squares.

5 Illustrative examples and numerical tests

In this section, we present several applications where our algorithm can be applied, such as the robust sparse SVM classification problem (Bhattacharyya et al.,, 2004), sparse SVM classification problem (Weston et al.,, 2003), constrained least-squares and linear programs (Tibshirani,, 2011), accompanied by detailed numerical simulations. The codes were written in Matlab and run on a PC with i7 CPU at 2.1 GHz and 16 GB memory.

5.1 Robust sparse SVM classifier

We consider a two class dataset {(zi,yi)}i=1N\{(z_{i},y_{i})\}_{i=1}^{N}, where ziz_{i} is the vector of features and yi{1,1}y_{i}\in\{-1,1\} is the corresponding label. A robust classifier is a hyperplane parameterized by a weight vector ww and an offset from the origin dd, in which the decision boundary and set of relevant features are resilient to uncertainty in the data, see equation (2)(2) in (Bhattacharyya et al.,, 2004) for more details. Then, the robust sparse classification problem can be formulated as:

minw,d,uλi=1Nui+w1\displaystyle\min_{w,d,u}\;\lambda\sum_{i=1}^{N}u_{i}+\|w\|_{1}
subject to:yi(wTz¯i+d)1uiz¯i𝒵i,ui0i=1:N,\displaystyle\text{subject to}:\;y_{i}(w^{T}\bar{z}_{i}+d)\geq 1-u_{i}\quad\forall\bar{z}_{i}\in{\cal Z}_{i},\;\;u_{i}\geq 0\quad\forall i=1:N,

where 𝒵i{\cal Z}_{i} is the uncertainty set in the data ziz_{i}, the parameter λ>0\lambda>0 and 11-norm is added in the objective to induce sparsity in ww and uiu_{i}’s are slack variables that provide a mechanism for handling an error in the assigned class . To find a hyperplane that is robust and generalized well, each 𝒵i{\cal Z}_{i} would need to be specified by a large corpus of pseudopoints. In particular, finding a robust hyperplane can be simplified by considering a data uncertainty model in the form of ellipsoids. In this case we can convert infinite number of linear constraints into a single non-linear constraint and thus recasting the above set of robust linear inequalities as second order cone constraints. Specifically, if the uncertainty set for iith data is defined by an ellipsoid with the center ziz_{i} and the shape given by the positive semidefinite matrix Qi0Q_{i}\succeq 0, i.e. 𝒵i={z¯i:Qi(z¯izi),z¯izi1}{\cal Z}_{i}=\{\bar{z}_{i}:\langle Q_{i}(\bar{z}_{i}-z_{i}),\bar{z}_{i}-z_{i}\rangle\leq 1\}, then a solution to the robust hyperplane classification problem is one in which the hyperplane parameterized by (w,d)(w,d) does not intersect any ellipsoidal data uncertainty model (see Appendix 1 for proof):

yi(wTzi+d)Qi1/2wui.\displaystyle y_{i}(w^{T}z_{i}+d)\geq\|Q_{i}^{-1/2}w\|{\color[rgb]{0,0,0}-u_{i}}. (29)

Hence, the robust classification problem can be recast as a convex optimization problem with many functional constraints that are either linear or second order cone constraints:

minw,d,uλi=1Nui+w1\displaystyle\min_{w,d,u}\;\lambda\sum_{i=1}^{N}u_{i}+\|w\|_{1}
subject to:yi(wTzi+d)1ui,ui0i=1:N\displaystyle\text{subject to}:\;\;y_{i}(w^{T}z_{i}+d)\geq 1-u_{i},\;\;\;u_{i}\geq 0\quad\forall i=1:N
yi(wTzi+d)Qi1/2wuii=1:N.\displaystyle\qquad\qquad\quad\;\;y_{i}(w^{T}z_{i}+d)\geq\|Q_{i}^{-1/2}w\|{\color[rgb]{0,0,0}-u_{i}}\;\;\forall i=1:N.

This is a particular form of problem (1) and thus we can solve it using our algorithm SSP. Since every one of the NN data points has its own covariance matrix QiQ_{i}, this formulation results in a large optimization problem so it is necessary to impose some restrictions on the shape of these matrices. Hence, we consider two scenarios: (i) class-dependent covariance matrices, i.e., Qi=Q+Q_{i}=Q_{+} if yi=+1y_{i}=+1 or Qi=QQ_{i}=Q_{-} if yi=1y_{i}=-1; (ii) class-independent covariance matrix, i.e., Qi=Q±Q_{i}=Q_{\pm} for all yiy_{i}. For more details on the choice of covariance matrices see (Bhattacharyya et al.,, 2004). Here, each covariance matrix is assumed to be diagonal. In a class-dependent diagonal covariance matrix, the diagonal elements of Q+Q_{+} or QQ_{-} are unique, while in class-independent covariance matrix, all diagonal elements of Q±Q_{\pm} are identical. Computational experiments designed to evaluate the performance of SSP require datasets in which the level of variability associated with the data can be quantified. Here, a noise level parameter, 0ρ10\leq\rho\leq 1, is introduced to scale each diagonal element of the covariance matrix, i.e., ρQ+\rho Q_{+} or ρQ\rho Q_{-} or ρQ±\rho Q_{\pm}. When ρ=0\rho=0, data points are associated with no noise (the nominal case). The ρ\rho value acts as a proxy for data variability. For classifying a point we consider the following rules. The “ordinary rule” for classifying a data point zz is as follows: if wTz+d>0w^{T}_{*}z+d_{*}>0, then zz is assigned to the +1+1 class; if wTz+d<0w^{T}_{*}z+d_{*}<0, then zz is identified with the 1-1 class. An ordinary error occurs when the class predicted by the hyperplane differs from the known class of the data point. The “worst case rule” determines whether an ellipsoid with center zz intersects the hyperplane. Hence, some allowable values of zz will be classified incorrectly if |wTz+d|<Qi1/2w2|w^{T}_{*}z+d_{*}|<\|Q_{i}^{-1/2}w_{*}\|^{2} (worst case error).

Table 1: Comparison between nominal and robust classifiers on training data: class-dependent covariance matrix (first half), class-independent covariance matrix (second half).
λ\lambda ρ\rho Robust Nominal
w0w_{*}\neq 0 worst case error ordinary error w0w_{*}\neq 0 ordinary error
0.1 0.01 1699 3 332 546 198
0.2 406 25 116 767 113
0.3 698 14 111 774 67
0.1 0.3 1581 63 331 546 198
0.2 1935 86 326 767 113
0.3 1963 77 311 774 67
0.1 0.01 1734 0 331 546 198
0.2 1822 1 268 767 113
0.3 1937 0 266 774 67
0.1 0.3 1629 19 316 546 198
0.2 1899 19 296 767 113
0.3 2050 20 282 774 67
Table 2: Comparison between nominal and robust classifiers on testing data: class-dependent covariance matrix (first half), class-independent covariance matrix (second half).
λ\lambda ρ\rho Robust Nominal
accuracy (ordinary rule) accuracy (ordinary rule)
0.1 0.01 180, 72.5% 166, 66.9%
0.2 200, 80.6% 198, 79.8%
0.3 198, 79.8% 193, 77.9%
0.1 0.3 180, 72.5% 168, 67.8%
0.2 200, 80.6% 174, 70.2%
0.3 198, 79.8% 172, 69.4%
0.1 0.01 180, 72.5% 167, 67.4%
0.2 200, 80.6% 196, 79.1%
0.3 198, 79.8% 193, 77.9%
0.1 0.3 180, 72.5% 165, 66.6%
0.2 200, 80.6% 183, 73.8%
0.3 198, 79.8% 159, 64.1%

Tables 1 and 2 give the results of our algorithm SSP for robust (ρ>0\rho>0) and nominal (ρ=0\rho=0) classification formulations. We choose the parameters λ=0.1,0.2,0.3\lambda=0.1,0.2,0.3, β=1.96\beta=1.96, and stopping criterion 10210^{-2}. We consider a dataset of CT scan images having two classes, covid and non-covid, available at https://www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset. This dataset contains CT scan images of dimension ranging from 190×190190\times 190 to 410×386410\times 386 pixels. To implement our algorithm we have taken 14881488 data in which 751751 are of Covid patients and 737737 of Non-Covid patients. Then, we divide them into training data and testing data. For training data we have taken 12401240 images in which 626626 are of Covid and 614614 are of Non-Covid. For testing data we have taken 248248 images in which 125125 are of Covid and 123123 of Non-Covid. We also resize all images into 190×190190\times 190 pixels. First half of the Tables 1 and 2 correspond to feature-dependent covariance matrix and the second half to feature-independent covariance matrix. Table 1 shows the results for the training data and Table 2 for the testing data. As one can see from Tables 1 and 2, the robust classifier yields better accuracies on both training and testing datasets.

5.2 Constrained least-squares

Next, we consider constrained least-squares problem (23). We compare the performance of our algorithm SSP-LS and the algorithm in (Leventhal and Lewis,, 2010) on synthetic data matrices AA and CC generated from a normal distribution. Both algorithms were stopped when max(Axb,(Cxd)+)103\max(\|Ax-b\|,\|(Cx-d)_{+}\|)\leq 10^{-3}. The results for different sizes of matrices AA and CC are given in Table 3. One can easily see from Table 3 the superior performance of our algorithm in both, number of full iterations (epochs, i.e. number of passes through data) and cpu time (in seconds).

Table 3: Comparison between SSP-LS and algorithm in (Leventhal and Lewis,, 2010) in terms of epochs and cpu time (sec) on random least-squares problems.
δ=β\delta=\beta mm pp nn SSP-LS (Leventhal and Lewis,, 2010)
epochs cpu time (s) epochs cpu time (s)
0.96 900 900 10310^{3} 755 26.0 817 29.9
1.96 900 900 10310^{3} 591 20.1 787 26.2
0.96 900 1100 10310^{3} 624 23.2 721 23.9
1.96 900 1100 10310^{3} 424 16.7 778 27.3
0.96 9000 9000 10410^{4} 1688 5272.0 1700 5778.1
1.96 9000 9000 10410^{4} 1028 3469.2 1662 5763.7
0.96 9000 11000 10410^{4} 1224 5716.0 1437 5984.8
1.96 9000 11000 10410^{4} 685 2693.7 1461 5575.3
0.96 900 10510^{5} 10310^{3} 5 105.0 9 163.9
1.96 900 10510^{5} 10310^{3} 4 77.7 9 163.9
0.96 9000 10510^{5} 10410^{4} 64 2054.5 214 5698.5
1.96 9000 10510^{5} 10410^{4} 42 1306.5 213 5156.7
0.96 9000 9000 10510^{5} 23 1820.3 23 1829.4
0.96 9000 11000 10510^{5} 21 2158.7 23 2193.1
1.96 9000 11000 10510^{5} 19 1939.7 23 2216.9
0.96 900 900 10510^{5} 14 65.7 17 86.8
1.96 900 1100 10510^{5} 13 74.3 15 75.6

5.3 Linear programs

Next, we consider solving linear programs (LP) of the form:

min𝐳0𝐜T𝐳subject to𝐂𝐳𝐝.\displaystyle\min_{\mathbf{z}\geq 0}\mathbf{c}^{T}\mathbf{z}\quad\text{subject to}\quad\mathbf{C}\mathbf{z}\leq\mathbf{d}.

Using the primal-dual formulation this problem is equivalent to (22):

find𝐳[0,)n,ν[0,)p:𝐜T𝐳+𝐝Tν=0,𝐂𝐳𝐝,𝐂Tν+𝐜0.\text{find}\;\mathbf{z}\in[0,\infty)^{n},\mathbf{\nu}\in[0,\infty)^{p}:\mathbf{c}^{T}\mathbf{z}+\mathbf{d}^{T}\mathbf{\nu}=0,\;\mathbf{C}\mathbf{z}\leq\mathbf{d},\;\mathbf{C}^{T}\mathbf{\nu}+\mathbf{c}\geq 0.

Therefore, we can easily identify in (22):

x=[𝐳ν]𝒴=[0,)n+p,A=[𝐜T𝐝T]n+p,C=[𝐂0p×p0n×n𝐂T].x=\begin{bmatrix}\mathbf{z}\\ \nu\end{bmatrix}\in\mathcal{Y}=[0,\infty)^{n+p},\;{A}=[\mathbf{c}^{T}\;\mathbf{d}^{T}]\in\mathbb{R}^{n+p},\;{C}=\begin{bmatrix}\mathbf{C}\qquad 0_{p\times p}\\ 0_{n\times n}\;-\mathbf{C}^{T}\end{bmatrix}.

Hence, we can use our algorithm SSP-LS to solve LPs. In Table 4 we compare the performance of our algorithm, the algorithm in (Leventhal and Lewis,, 2010) and Matlab solver lsqlin for solving the least-squares formulation of LPs taken from the Netlib library available on https://www.netlib.org/lp/data/index.html, and Matlab format LP library available on https://users.clas.ufl.edu/hager/coap/Pages/matlabpage.html. The first two algorithms were stopped when max(Axb,(Cxd)+)103\max(\|Ax-b\|,\|(Cx-d)_{+}\|)\leq 10^{-3} and we choose δ=β=1.96\delta=\beta=1.96. In Table 4, in the first column after the name of the LP we provide the dimension of the matrix 𝐂\mathbf{C}. From Table 4 we observe that SSP-LS is always better than the algorithm in (Leventhal and Lewis,, 2010) and for large dimensions it also better than lsqlin, a Matlab solver specially dedicated for solving constrained least-squares problems. Moreover, for ”qap15” dataset lsqlin yields out of memory.

Table 4: Comparison between SSP-LS, algorithm in (Leventhal and Lewis,, 2010) and Matlab solver lsqlin in terms of epochs and cpu time (sec) on real data LPs.
LP SSP-LS (Leventhal and Lewis,, 2010) lsqlin
epochs time (s) epochs time (s) time (s)
afiro (21×\times51) 1163 1.9 5943 2.7 0.09
beaconfd (173×\times295) 1234 9.9 9213 63.5 1.0
kb2 (43×\times68) 10 0.02 17 0.03 0.14
sc50a (50×\times78) 9 0.04 879 1.9 0.14
sc50b (50×\times78) 25 0.1 411 0.8 0.2
share2b (96×\times162) 332 1.8 1691 84.9 0.2
degen2 (444×\times757) 4702 380.6 5872 440.6 9.8
fffff800 (524×\times1028) 44 5.5 80 9.3 3.4
israel (174×\times316) 526 5.4 3729 312.9 0.3
lpi bgdbg1 (348×\times649) 476 15.4 9717 263.1 0.5
osa 07 (1118×\times25067) 148 3169.8 631 7169.7 3437.1
qap15 (6330×\times22275) 70 2700.7 373 7794.6 *
fit2p (3000×\times13525) 7 65.7 458 2188.3 824.9
maros r7 (3137×\times9408) 635 2346.8 1671 3816.2 3868.8
qap12 (3193×\times8856) 54 374.1 339 1081.8 3998.4

5.4 Sparse linear SVM

Finally, we consider the sparse linear SVM classification problem:

minw,d,uλi=1nui+w1\displaystyle\min_{w,d,u}\;\lambda\sum_{i=1}^{n}u_{i}+\|w\|_{1}
subject to:yi(wTzi+d)1ui,ui0i=1:N.\displaystyle\text{subject to}:\;\;y_{i}(w^{T}z_{i}+d)\geq 1-u_{i},\;\;\;u_{i}\geq 0\quad\forall i=1:N.

This can be easily recast as an LP and consequently as a least-squares problem. In Table 5 we report the results provided by our algorithms SSP-LS for solving sparse linear SVM problems. Here, the ordinary error has the same meaning as in Section 5.1. We use several datasets: covid dataset from www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset; PMU-UD, sobar-72 and divorce datasets available on https://archive-beta.ics.uci.edu/ml/datasets; and the rest from LIBSVM library https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. In the first column, the first argument represents the number of features and the second argument represents the number of data. We divided each dataset into 80%80\% for training and 20%20\% for testing. For the LP formulation we use the simple observation that any scalar uu can be written as u=u+uu=u_{+}-u_{-}, with u+,u0u_{+},u_{-}\geq 0. We use the same stopping criterion as in Section 5.3. In Table 5 we provide the number of relevant features, i.e., the number of nonzero elements of the optimal ww_{*}, and the number of misclassified data (ordinary error) on real training and testing datasets.

Table 5: Performance of sparse linear SVM classifier: ordinary error and sparsity of ww_{*} on real training and testing datasets.
Dataset λ\lambda w0w_{*}\neq 0 ordinary error (train) ordinary error (test)
sobar-72 (38×\times72) 0.1 7/38 9/58 3/14
0.5 12/38 1/58 2/14
breastcancer (18×\times683) 0.1 9/18 12/547 13/136
0.5 9/18 17/547 7/136
divorce (108×\times170) 0.1 23/108 3/136 0/34
0.5 13/108 0/136 1/34
caesarian (10×\times80) 0.1 0/10(NA) * *
0.5 5/10 14/64 9/16
cryotherapy (12×\times90) 0.1 3/12 8/72 5/18
0.5 6/12 6/72 1/18
PMU-UD (19200×\times1051) 0.1 13/19200 0/841 70/210
0.5 37/19200 0/841 47/210
Covid (20000×\times2481) 0.1 428/20000 146/1985 101/496
0.5 752/20000 142/1985 97/496
Nomao (16×\times34465) 0.1 12/14 3462/27572 856/6893
0.5 11/14 3564/27572 885/6893
Training (60×\times11055) 0.1 31/60 630/8844 152/2211
0.5 30/60 611/8844 159/2211
leukemia (14258×\times38) 0.1 182/14258 9/31 2/7
0.5 271/14258 9/31 2/7
mushrooms (42×\times8124) 0.1 30/42 426/6500 121/1624
0.5 31/42 415/6500 105/1624
ijcnn1 (28×\times49990) 0.1 11/28 3831/39992 1021/9998
0.5 10/28 3860/39992 992/9998
phishing (60×\times11055) 0.1 56/60 2953/8844 728/2211
0.5 51/60 2929/8844 725/2211

Appendix

1. Proof of inequality (29). We want to find an hyperplane parameterized by (w,d)(w,d) which does not intersect any ellipsoidal data uncertainty model. Thus, one can easily see that the relaxed (via slack variable uiu_{i}) robust linear inequality

yi(wTz¯i+d)uiz¯i𝒵iy_{i}(w^{T}\bar{z}_{i}+d)\geq{\color[rgb]{0,0,0}-u_{i}}\quad\forall\bar{z}_{i}\in{\cal Z}_{i}

over the ellipsoid with the center in ziz_{i}

𝒵i={z¯i:Qi(z¯izi),z¯izi1},{\cal Z}_{i}=\{\bar{z}_{i}:\langle Q_{i}(\bar{z}_{i}-z_{i}),\bar{z}_{i}-z_{i}\rangle\leq 1\},

can be written as optimization problem whose minimum value must satisfy:

ui\displaystyle{\color[rgb]{0,0,0}-u_{i}}\leq minz¯iyi(wTz¯i+d)\displaystyle\min_{\bar{z}_{i}}\;y_{i}(w^{T}\bar{z}_{i}+d)
subject to:(z¯izi)TQi(z¯izi)10.\displaystyle\text{subject to:}\;(\bar{z}_{i}-z_{i})^{T}Q_{i}(\bar{z}_{i}-z_{i})-1\leq 0.

The corresponding dual problem is as follows:

maxλ0minz¯iyi(wTz¯i+d)+λ((z¯izi)TQi(z¯izi)1).\displaystyle\max_{\lambda\geq 0}\min_{\bar{z}_{i}}y_{i}(w^{T}\bar{z}_{i}+d)+\lambda((\bar{z}_{i}-z_{i})^{T}Q_{i}(\bar{z}_{i}-z_{i})-1). (30)

Minimizing the above problem with respect to z¯i\bar{z}_{i}, we obtain:

yiw+2λQi(z¯izi)=0z¯i=ziyi2λQi1w.\displaystyle y_{i}w+2\lambda Q_{i}(\bar{z}_{i}-z_{i})=0\iff\bar{z}_{i}=z_{i}-\frac{y_{i}}{2\lambda}Q_{i}^{-1}w. (31)

By replacing this value of z¯i\bar{z}_{i} into the dual problem (30), we get:

maxλ0[yi24λwTQi1w+yi(wTzi+d)λ].\displaystyle\max_{\lambda\geq 0}\left[-\frac{y_{i}^{2}}{4\lambda}w^{T}Q_{i}^{-1}w+y_{i}(w^{T}z_{i}+d)-\lambda\right].

For the dual optimal solution

λ=12yi2(wTQi1w)=12(wTQi1w),\displaystyle\lambda^{*}=\frac{1}{2}\sqrt{y_{i}^{2}(w^{T}Q_{i}^{-1}w)}=\frac{1}{2}\sqrt{(w^{T}Q_{i}^{-1}w)},

we get the primal optimal solution

z¯i=ziyiQi1wwTQi1w,\displaystyle\bar{z}_{i}^{*}=z_{i}-\frac{y_{i}Q_{i}^{-1}w}{\sqrt{w^{T}Q_{i}^{-1}w}},

and consequently the second order cone condition

yi(wTzi+d)Qi1/2wui.\displaystyle y_{i}\left(w^{T}z_{i}+d\right)\geq\|Q_{i}^{-1/2}w\|{\color[rgb]{0,0,0}-u_{i}}.

Acknowledgments

The research leading to these results has received funding from: NO Grants 2014–2021, under project ELO-Hyp, contract no. 24/2020; UEFISCDI PN-III-P4-PCE-2021-0720, under project L2O-MOC, nr. 70/2022.

References

  • Bauschke and Borwein, (1996) H. Bauschke and J. Borwein, On projection algorithms for solving convex feasibility problems, SIAM Review, 38(3): 367–376, 1996.
  • Bertsekas, (2011) D.P. Bertsekas, Incremental proximal methods for large scale convex optimization, Mathematical Programming, 129(2): 163–195, 2011.
  • Bianchi et al., (2019) P. Bianchi, W. Hachem and A. Salim, A constant step forward-backward algorithm involving random maximal monotone operators, Journal of Convex Analysis, 26(2): 397–436, 2019.
  • Bhattacharyya et al., (2004) C. Bhattacharyya, L.R. Grate, M.I. Jordan, L. El Ghaoui and S. Mian, Robust sparse hyperplane classifiers: Application to uncertain molecular profiling data, Journal of Computational Biology, 11(6): 1073–1089, 2004.
  • Devolder et al., (2014) O. Devolder, F. Glineur and Yu. Nesterov, First-order methods of smooth convex optimization with inexact oracle, Mathematical Programming, 146: 37–75, 2014.
  • Duchi and Singer, (2009) J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research, 10: 2899–2934, 2009.
  • Garrigos et al., (2023) G. Garrigos and R.M. Gower, Handbook of convergence theorems for (stochastic) gradient methods, arXiv:2301.11235v2, 2023.
  • Hardt et al., (2016) M. Hardt, B. Recht and Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, International Conference on Machine Learning, 2016.
  • Hermer et al., (2020) N. Hermer, D.R. Luke and A. Sturm, Random function iterations for stochastic fixed point problems, arXiv:2007.06479, 2020.
  • Horn and Johnson, (2012) R. A. Horn and C.R. Johnson, Matrix Analysis, Cambridge University Press, 2012.
  • Kundu et al., (2018) A. Kundu, F. Bach and C. Bhattacharya, Convex optimization over inter-section of simple sets: improved convergence rate guarantees via an exact penalty approach, International Conference on Artificial Intelligence and Statistics, 2018.
  • Lewis and Pang, (1998) A. Lewis and J. Pang, Error bounds for convex inequality systems, Generalized Convexity, Generalized Monotonicity (J. Crouzeix, J. Martinez-Legaz and M. Volle eds.), Cambridge University Press, 75–110, 1998.
  • Leventhal and Lewis, (2010) D. Leventhal and A.S. Lewis. Randomized Methods for linear constraints: convergence rates and conditioning, Mathematics of Operations Research, 35(3): 641–654, 2010.
  • Moulines and Bach, (2011) E. Moulines and F. Bach, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, Advances in Neural Information Processing Systems Conf., 2011.
  • Mordukhovich and Nam, (2005) B.S. Mordukhovich and N.M. Nam, Subgradient of distance functions with applications to Lipschitzian stability, Mathematical Programming, 104: 635–668, 2005.
  • Necoara, (2021) I. Necoara, General convergence analysis of stochastic first order methods for composite optimization, Journal of Optimization Theory and Applications, 189: 66–95 2021.
  • Necoara et al., (2019) I. Necoara, Yu. Nesterov and F. Glineur, Linear convergence of first order methods for non-strongly convex optimization, Mathematical Programming, 175(1): 69–107, 2019.
  • Nedelcu et al., (2014) V. Nedelcu, I. Necoara and Q. Tran Dinh, Computational complexity of inexact gradient augmented Lagrangian methods: application to constrained MPC, SIAM Journal on Control and Optimization, 52(5): 3109–3134, 2014.
  • Nemirovski and Yudin, (1983) A. Nemirovski and D.B. Yudin, Problem complexity and method efficiency in optimization, Wiley Interscience, 1983.
  • Nemirovski et al., (2009) A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal Optimization, 19(4): 1574–1609, 2009.
  • Nesterov, (2018) Yu. Nesterov, Lectures on Convex Optimization, Springer Optimization and Its Applications, 137, 2018.
  • Nedich, (2011) A. Nedich, Random algorithms for convex minimization problems, Mathematical Programming, 129(2): 225–273, 2011.
  • Nedich and Necoara, (2019) A. Nedich and I. Necoara, Random minibatch subgradient algorithms for convex problems with functional constraints, Applied Mathematics and Optimization, 8(3): 801–833, 2019.
  • Necoara and Patrascu, (2018) A. Patrascu and I. Necoara, On the convergence of inexact projection first order methods for convex minimization, IEEE Transactions Automatic Control, 63(10): 3317–3329, 2018.
  • Patrascu and Necoara, (2018) A. Patrascu and I. Necoara, Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization, Journal of Machine Learning Research, 18(198): 1–42, 2018.
  • Pena et al., (2021) J. Pena, J. Vera and L. Zuluaga, New characterizations of Hoffman constants for systems of linear constraints, Mathematical Programming, 187: 79–109, 2021.
  • Polyak, (1969) B.T. Polyak, Minimization of unsmooth functionals , USSR Computational Mathematics and Mathematical Physics, 9 (3), 14-29, 1969.
  • Polyak, (2001) B.T. Polyak, Random algorithms for solving convex inequalities, Studies in Computational Mathematics, 8: 409–422, 2001.
  • Robbins and Monro, (1951) H. Robbins and S. Monro, A Stochastic approximation method, The Annals of Mathematical Statistics, 22(3): 400–407, 1951.
  • Rockafellar and Uryasev, (2000) R.T. Rockafellar and S.P. Uryasev, Optimization of conditional value-at-risk, Journal of Risk, 2: 21–41, 2000.
  • Rosasco et al., (2019) L. Rosasco, S. Villa and B.C. Vu, Convergence of stochastic proximal gradient algorithm, Applied Mathematics and Optimization, 82: 891–917 , 2019.
  • Rasch and Chambolle, (2020) J. Rasch and A. Chambolle, Inexact first-order primal–dual algorithms, Computational Optimization and Applications, 76: 381–-430, 2020.
  • Tibshirani, (2011) R. Tibshirani, The solution path of the generalized lasso, Phd Thesis, Stanford Univ., 2011.
  • Tran-Dinh et al., (2018) Q. Tran-Dinh, O. Fercoq, and V. Cevher, A smooth primal-dual optimization framework for nonsmooth composite convex minimization, SIAM Journal on Optimization, 28(1): 96–134, 2018.
  • Villa et al., (2014) S. Villa, L. Rosasco, S. Mosci and A. Verri, Proximal methods for the latent group lasso penalty, Computational Optimization and Applications, 58: 381–407, 2014.
  • Vapnik, (1998) V. Vapnik, Statistical learning theory, John Wiley, 1998.
  • Weston et al., (2003) J. Weston, A. Elisseeff and B. Scholkopf, Use of the zero norm with linear models and kernel methods, Journal of Machine Learning Research, 3: 1439–1461, 2003.
  • Wang et al., (2015) M. Wang, Y. Chen, J. Liu and Y. Gu, Random multiconstraint projection: stochastic gradient methods for convex optimization with many constraints, arXiv: 1511.03760, 2015.
  • Xu, (2020) Y. Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints, SIAM Journal on Optimization, 30(2): 1664–1692, 2020.
  • Yang and Lin, (2016) T. Yang and Q. Lin, RSG: Beating subgradient method without smoothness and strong convexity, Journal of Machine Learning Research, 19(6): 1–33, 2018.