Stochastic subgradient projection methods for composite optimization with functional constraints
Abstract
In this paper we consider convex optimization problems with stochastic composite objective function subject to (possibly) infinite intersection of constraints. The objective function is expressed in terms of expectation operator over a sum of two terms satisfying a stochastic bounded gradient condition, with or without strong convexity type properties. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, in this paper we consider that each constraint set is given as the level set of a convex but not necessarily differentiable function. Based on the flexibility offered by our general optimization model we consider a stochastic subgradient method with random feasibility updates. At each iteration, our algorithm takes a stochastic proximal (sub)gradient step aimed at minimizing the objective function and then a subsequent subgradient step minimizing the feasibility violation of the observed random constraint. We analyze the convergence behavior of the proposed algorithm for diminishing stepsizes and for the case when the objective function is convex or strongly convex, unifying the nonsmooth and smooth cases. We prove sublinear convergence rates for this stochastic subgradient algorithm, which are known to be optimal for subgradient methods on this class of problems. When the objective function has a linear least-square form and the constraints are polyhedral, it is shown that the algorithm converges linearly. Numerical evidence supports the effectiveness of our method in real problems.
Keywords: Stochastic convex optimization, functional constraints, stochastic subgradient, rate of convergence, constrained least-squares, robust/sparse svm.
1 Introduction
The large sum of functions in the objective function and/or the large number of constraints in most of the practical optimization applications, including machine learning and statistics (Vapnik,, 1998; Bhattacharyya et al.,, 2004), signal processing (Necoara,, 2021; Tibshirani,, 2011), computer science (Kundu et al.,, 2018), distributed control (Nedelcu et al.,, 2014), operations research and finance (Rockafellar and Uryasev,, 2000), create several theoretical and computational challenges. For example, these problems are becoming increasingly large in terms of both the number of variables and the size of training data and they are usually nonsmooth due to the use of regularizers, penalty terms and the presence of constraints. Due to these challenges, (sub)gradient-based methods are widely applied. In particular, proximal gradient algorithms (Necoara,, 2021; Rosasco et al.,, 2019) are natural in applications where the function to be minimized is in composite form, i.e. the sum of a smooth term and a nonsmooth regularizer (e.g., the indicator function of some simple constraints), as e.g. the empirical risk in machine learning. In these algorithms, at each iteration the proximal operator defined by the nonsmooth component is applied to the gradient descent step for the smooth data fitting component. In practice, it is important to consider situations where these operations cannot be performed exactly. For example, the case where the gradient, the proximal operator or the projection can be computed only up to an error have been considered in (Devolder et al.,, 2014; Nedelcu et al.,, 2014; Necoara and Patrascu,, 2018; Rasch and Chambolle,, 2020), while the situation when only stochastic estimates of these operators are available have been studied in (Rosasco et al.,, 2019; Hardt et al.,, 2016; Patrascu and Necoara,, 2018; Hermer et al.,, 2020). This latter setting, which is also of interest here, is very important in machine learning, where we have to minimize an expected objective function with or without constraints from random samples (Bhattacharyya et al.,, 2004), or in statistics, where we need to minimize a finite sum objective subject to functional constraints (Tibshirani,, 2011).
Previous work. A very popular approach for minimizing an expected or finite sum objective function is the stochastic gradient descent (SGD) (Robbins and Monro,, 1951; Nemirovski and Yudin,, 1983; Hardt et al.,, 2016) or the stochastic proximal point (SPP) algorithms (Moulines and Bach,, 2011; Nemirovski et al.,, 2009; Necoara,, 2021; Patrascu and Necoara,, 2018; Rosasco et al.,, 2019). In these studies sublinear convergence is derived for SGD or SPP with decreasing stepsizes under the assumptions that the objective function is smooth and (strongly) convex. For nonsmooth stochastic convex optimization one can recognize two main approaches. The first one uses stochastic variants of the subgradient method combined with different averaging techniques, see e.g. (Nemirovski et al.,, 2009; Yang and Lin,, 2016). The second line of research is based on stochastic variants of the proximal gradient method under the assumption that the expected objective function is in composite form (Duchi and Singer,, 2009; Necoara,, 2021; Rosasco et al.,, 2019). For both approaches, using decreasing stepsizes, sublinear convergence rates are derived under the assumptions that the objective function is (strongly) convex. Even though SGD and SPP are well-developed methodologies, they only apply to problems with simple constraints, requiring the whole feasible set to be projectable.
In spite of its wide applicability, the study on efficient solution methods for optimization problems with many constraints is still limited. The most prominent line of research in this area is the alternating projections, which focus on applying random projections for solving problems that involve intersection of a (infinite) number of sets. The case when the objective function is not present in the formulation, which corresponds to the convex feasibility problem, stochastic alternating projection algorithms were analyzed e.g., in (Bauschke and Borwein,, 1996), with linear convergence rates, provided that the sets satisfy some linear regularity condition. Stochastic forward-backward algorithms have been also applied to solve optimization problems with many constraints. However, the papers introducing those general algorithms focus on proving only asymptotic convergence results without rates, or they assume the number of constraints is finite, which is more restricted than our settings, see e.g. (Bianchi et al.,, 2019; Xu,, 2020; Wang et al.,, 2015). In the case where the number of constraints is finite and the objective function is deterministic, Nesterov’s smoothing framework is studied in (Tran-Dinh et al.,, 2018) under the setting of accelerated proximal gradient methods. Incremental subgradient methods or primal-dual approaches were also proposed for solving problems with finite intersection of simple sets through an exact penalty reformulation in (Bertsekas,, 2011; Kundu et al.,, 2018).
The papers most related to our work are (Nedich,, 2011; Nedich and Necoara,, 2019), where subgradient methods with random feasibility steps are proposed for solving convex problems with deterministic objective and many functional constraints. However, the optimization problem, the algorithm and consequently the convergence analysis are different from the present paper. In particular, our algorithm is a stochastic proximal gradient extension of the algorithms proposed in (Nedich,, 2011; Nedich and Necoara,, 2019). Additionally, the stepsizes in (Nedich,, 2011; Nedich and Necoara,, 2019) are chosen decreasing, while in the present work for strongly like objective functions we derive insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. Furthermore, in (Nedich,, 2011) and (Nedich and Necoara,, 2019) sublinear convergence rates are established either for convex or strongly convex deterministic objective functions, respectively, while in this paper we prove (sub)linear rates under an expected composite objective function which is either convex or strongly convex. Moreover, (Nedich,, 2011; Nedich and Necoara,, 2019) present separately the convergence analysis for smooth and nonsmooth objective, while in this paper we present a unified convergence analysis covering both cases through the so-called stochastic bounded gradient condition. Hence, since we deal with stochastic composite objective functions, smooth or nonsmooth, convex or strongly convex, and since we consider a stochastic proximal gradient with new stepsize rules, our convergence analysis requires additional insights that differ from that of (Nedich,, 2011; Nedich and Necoara,, 2019).
In (Patrascu and Necoara,, 2018) a stochastic optimization problem with infinite intersection of sets is considered and stochastic proximal point steps are combined with alternating projections for solving it. However, in order to prove sublinear convergence rates, (Patrascu and Necoara,, 2018) requires strongly convex and smooth objective functions, while our results are valid also for convex non-smooth functions. Lastly, (Patrascu and Necoara,, 2018) assumes the projectability of individual sets, whereas in our case, the constraints might not be projectable. Finally, in all these studies the nonsmooth component is assumed to be proximal, i.e. one can easily compute its proximal operator. This assumption is restrictive, since in many applications the nonsmooth term is also expressed as expectation or finite sum of nonsmooth functions, which individually are proximal, but it is hard to compute the proximal operator for the whole nonsmooth component, as e.g., in support vector machine or generalized lasso problems (Villa et al.,, 2014; Rosasco et al.,, 2019). Moreover, all the convergence results from the previous papers are derived separately for the smooth and the nonsmooth stochastic optimization problems.
Contributions. In this paper we remove the previous drawbacks. We propose a stochastic subgradient algorithm for solving general convex optimization problems having the objective function expressed in terms of expectation operator over a sum of two terms subject to (possibly infinite number of) functional constraints. The only assumption we require is to have access to an unbiased estimate of the (sub)gradient and of the proximal operator of each of these two terms and to the subgradients of the constraints. To deal with such problems, we propose a stochastic subgradient method with random feasibility updates. At each iteration, the algorithm takes a stochastic subgradient step aimed at only minimizing the expected objective function, followed by a feasibility step for minimizing the feasibility violation of the observed random constraint achieved through Polyak’s subgradient iteration, see (Polyak,, 1969). In doing so, we can avoid the need for projections onto the whole set of constraints, which may be expensive computationally. The proposed algorithm is applicable to the situation where the whole objective function and/or constraint set are not known in advance, but they are rather learned in time through observations.
We present a general framework for the convergence analysis of this stochastic subgradient algorithm which is based on the assumptions that the objective function satisfies a stochastic bounded gradient condition, with or without strong convexity and the subgradients of the functional constraints are bounded. These assumptions include the most well-known classes of objective functions and of constraints analyzed in the literature: composition of a nonsmooth function and a smooth function, with or without strong convexity, and nonsmooth Lipschitz functions, respectively. Moreover, when the objective function satisfies the strong convexity property, we prove insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime. Then, we prove sublinear convergence rates for the weighted averages of the iterates in terms of expected distance to the constraint set, as well as for the expected optimality of the function values/distance to the optimal set. Under some special conditions we also derive linear rates. Our convergence estimates are known to be optimal for this class of stochastic subgradient schemes for solving (nonsmooth) convex problems with functional constraints. Besides providing a general framework for the design and analysis of stochastic subgradient methods, in special cases, where complexity bounds are known for some particular problems, our convergence results recover the existing bounds. In particular, for problems without functional constraints we recover the complexity bounds from (Necoara,, 2021) and for linearly constrained least-squares problems we get similar convergence bounds as in (Leventhal and Lewis,, 2010).
Content. In Section 2 we present our optimization problem and the main assumptions. In Section 3 we design a stochastic subgradient projection algorithm and analyze its convergence properties, while is Section 4 we adapt this algorithm to constrained least-squares problems and derive linear convergence. Finally, in Section 5 detailed numerical simulations are provided that support the effectiveness of our method in real problems.
2 Problem formulation and assumptions
We consider the general convex composite optimization problem with expected objective function and functional constraints:
(1) |
where the composite objective function has a stochastic representation in the form of expectation w.r.t. a random variable , i.e., , is an arbitrary collection of indices and is a simple closed convex set. Hence, we separate the feasible set in two parts: one set, , admits easy projections and the other part is not easy for projection as it is described by the level sets of some convex functions ’s. Moreover, and are proper lower-semicontinuous convex functions containing the interior of in their domains. One can notice that more commonly, one sees a single representing the regularizer on the parameters in the formulation (1). However, there are also applications where one encounters terms of the form or , as e.g., in Lasso problems with mixed regularizers or regularizers with overlapping groups, see e.g., (Villa et al.,, 2014). Multiple functional constraints, onto which is difficult to project, can arise from robust classification in machine learning, chance constrained problems, min-max games and control, see (Bhattacharyya et al.,, 2004; Rockafellar and Uryasev,, 2000; Patrascu and Necoara,, 2018). For further use, we define the following functions: , and . Moreover, is assumed to be simple, i.e. one can easily compute the projection of a point onto this set. Let us define the individual sets as for all Denote the feasible set of (1) by:
We assume to be nonempty. We also assume that the optimization problem (1) has finite optimum and we let and denote the optimal value and the optimal set, respectively:
Further, for any we denote its projection onto the optimal set by , that is:
For the objective function we assume that the first term is either differentiable or nondifferentiable function and we use, with some abuse of notation, the same notation for the gradient or the subgradient of at , that is , where the subdifferential is either a singleton or a nonempty set for any . The other term is assumed to have an easy proximal operator for any :
Recall that the proximal operator of the indicator function of a closed convex set becomes the projection. We consider additionally the following assumptions on the objective function and constraints:
Assumption 1
The (sub)gradients of satisfy a stochastic bounded gradient condition, that is there exist non-negative constants and such that:
(2) |
From Jensen’s inequality, taking in (2), we get:
(3) |
We also assume to satisfy a (strong) convexity condition:
Assumption 2
The function satisfies a (strong) convex condition on , i.e., there exists non-negative constant such that:
(4) |
Note that when relation (4) states that is convex on . Finally, we assume boundedness on the subgradients of the functional constraints:
Assumption 3
The functional constraints have bounded subgradients on , i.e., there exists non-negative constant such that for all , we have:
Note that our assumptions are quite general and cover the most well-known classes of functions analyzed in the literature. In particular, Assumptions 1 and 2, related to the objective function, covers the class of non-smooth Lipschitz functions and composition of a (potentially) non-smooth function and a smooth function, with or without strong convexity, as the following examples show.
Example 1 [Non-smooth (Lipschitz) functions satisfy Assumption 1]: Assume that the functions and have bounded (sub)gradients:
Then, obviously Assumption 1 holds with
Example 2 [Smooth (Lipschitz gradient) functions satisfy Assumption 1]: Condition (2) includes the class of functions formed as a sum of two terms, convex having Lipschitz continuous gradient and convex having bounded subgradients, and bounded. Indeed, let us assume that the convex function has Lipschitz continuous gradient, i.e. there exists such that:
Using standard arguments (see Theorem 2.1.5 in (Nesterov,, 2018)), we have:
Since is also convex, then adding in the previous inequality, where , we get:
where we used that . Taking expectation w.r.t. and assuming that the set is bounded, with the diameter , then after using Cauchy-Schwartz inequality, we get (we also assume for all ):
Therefore, for any , we have:
Assuming now that the regularization functions have bounded subgradients on , i.e., , then the bounded gradient condition (2) holds on on with:
Note that is finite, since the optimal set is compact (recall that in this example is assumed bounded).
Finally, we also impose some regularity condition for the constraints.
Assumption 4
The functional constraints satisfy a regularity condition, i.e., there exists non-negative constant such that:
(5) |
Note that this assumption holds e.g., when the index set is arbitrary and the feasible set has an interior point, see (Polyak,, 2001), or when the feasible set is polyhedral, see relation (24) below. However, Assumption (4) holds for more general sets, e.g., when a strengthened Slater condition holds for the collection of convex functional constraints, such as the generalized Robinson condition, as detailed in Corollary 3 of (Lewis and Pang,, 1998).
3 Stochastic subgradient projection algorithm
Given the iteration counter , we consider independent random variables and sampled from and according to probability distributions and , respectively. Then, we define the following stochastic subgradient projection algorithm, where at each iteration we perform a stochastic proximal (sub)gradient step aimed at minimizing the expected composite objective function and then a subsequent subgradient step minimizing the feasibility violation of the observed random constraint (we use the convention )111In this variant we have corrected some derivations from the journal version: Journal of Machine Learning Research, 23(265), 1-35, 2022.:
Algorithm 1 (SSP): and (6) (7) (8)
Here, and are deterministic stepsizes and . Note that represents a stochastic proximal (sub)gradient step (or a stochastic forward-backward step) at for the expected composite objective function . Note that the optimality step selects a random pair of functions from the composite objective function according to the probability distribution , i.e., the index variable with values in the set . Also, we note that the random feasibility step selects a random constraint from the collection of constraints set according to the probability distribution , independently from , i.e. the index variable with values in the set . The vector is chosen as:
where is any nonzero vector. If , then for any choice of nonzero , we have . Note that the feasibility step (7) has the special form of the Polyak’s subgradient iteration, see e.g., (Polyak,, 1969). Moreover, when , is the projection of onto the hyperplane:
that is for . Indeed, if , then and the projection is the point itself, i.e., . On the other hand, if , then the projection of onto reduces to the projection onto the corresponding hyperplane:
Combining these two cases, we finally get our update (7) . Note that when the feasible set of optimization problem (1) is described by (infinite) intersection of simple convex sets, see e.g. (Patrascu and Necoara,, 2018; Bianchi et al.,, 2019; Xu,, 2020; Wang et al.,, 2015):
where each set admits an easy projection, then one can choose the following functional representation in problem (1):
One can easily notice that this function is convex, nonsmooth and with bounded subgradients, since we have (Mordukhovich and Nam,, 2005):
In this case, step (7) in SSP algorithm becomes a usual random projection step:
Hence, our formulation is more general than (Patrascu and Necoara,, 2018; Bianchi et al.,, 2019; Xu,, 2020; Wang et al.,, 2015), as it allows to also deal with constraints that might not be projectable, but one can easily compute a subgradient of . We mention also the possibility of performing iterations in parallel in steps (6) and (7) of SSP algorithm. However, here we do not consider this case. A thorough convergence analysis of minibatch iterations when the objective function is deterministic and strongly convex can be found in (Nedich and Necoara,, 2019). For the analysis of SSP algorithm, let us define the stochastic (sub)gradient mapping (for simplicity we omit its dependence on stepsize ):
Then, it follows immediately that the stochastic proximal (sub)gradient step (6) can be written as:
Moreover, from the optimality condition of the prox operator it follows that there exists such that:
Let us also recall a basic property of the projection onto a closed convex set , see e.g., (Nedich and Necoara,, 2019):
(9) |
Define also the filtration:
The next lemma provides a key descent property for the sequence and for the proof we use as main tool the stochastic (sub)gradient mapping .
Lemma 5
Let and be convex functions. Additionally, assume that the bounded gradient condition from Assumption 1 holds. Then, for any and stepsize , we have the following recursion:
(10) |
Proof Recalling that and , from the definition of and using (9) for and , we get:
(11) |
Now, we refine the second term. First, from convexity of we have:
Then, from convexity of and the definition of the gradient mapping , we have:
Replacing the previous two inequalities in (3), we obtain:
Using that for all , that and , we further get:
Since depends on , not on , and the stepsize does not depend on , then from basic properties of conditional expectation, we have:
where in the last inequality we used and the stochastic bounded gradient inequality from Assumption 1. Now, taking expectation w.r.t. we get:
which concludes our statement.
Next lemma gives a relation between and (see also (Nedich and Necoara,, 2019)).
Lemma 6
Let be convex functions. Additionally, Assumption 3 holds. Then, for any such that the following relation holds:
Proof Consider any such that . Then, using the nonexpansive property of the projection and the definition of , we have:
where the last inequality follows from convexity of and our assumption that . After rearranging the terms and using that , we get:
which concludes our statement.
In the next sections, based on the previous two lemmas, we derive convergence rates for SSP algorithm depending on the (convexity) properties of .
3.1 Convergence analysis: convex objective function
In this section we consider that the functions and are convex. First, it is easy to prove that (we can always choose sufficiently large such that this relation holds), see also (Nedich and Necoara,, 2019). For simplicity of the exposition let us introduce the following notation:
We impose the following conditions on the stepsize :
(12) |
Then, we can define the following average sequence generated by the algorithm SSP:
Note that this type of average sequence is also consider in (Garrigos et al.,, 2023) for unconstrained stochastic optimization problems. The next theorem derives sublinear convergence rates for the average sequence .
Theorem 7
Proof Recall that from Lemma 5, we have:
(13) |
Now, for we have that , and thus using Lemma 6, we get:
Taking conditional expectation on given , we get:
Taking now full expectation, we obtain:
(14) |
and using this relation in (13), we get:
(15) |
Similarly, for we have that , and thus using again Lemma 6, we obtain:
Taking conditional expectation on given , we get:
After taking full expectation, we get:
(16) |
Using (16) in (3.1), we obtain:
Since satisfies (12), then , and we further get the following recurrence:
(17) | ||||
Summing (17) from to , we get:
Using the definition of the average sequence and the convexity of and of , we further get sublinear rates in expectation for the average sequence in terms of optimality:
and feasibility violation:
These conclude our statements.
Note that for stepsize , with and satisfies (12), we have:
Consequently for we obtain from Theorem 7 the following sublinear convergence rates:
For the particular choice , we get similar rates as above just replacing with . However, if we neglect the logarithmic terms, we get sublinear convergence rates of order:
It is important to note that when , from Theorem 7 improved rates can be derived for SSP algorithm in the convex case. More precisely, for stepsize , with and satisfies (12), we obtain convergence rates for in optimality and feasibility violation of order . In particular, for (i.e., constant stepsize for all ) the previous convergence estimates yield rates of order . From our best knowledge these rates are new for stochastic subgradient methods applied on the class of optimization problems (1).
3.2 Convergence analysis: strongly convex objective function
In this section we additionally assume the strong convex inequality from Assumption 2 with . The next lemma derives an improved recurrence relation for the sequence under the strong convexity. Furthermore, due to strongly convex assumption on , problem (1) has a unique optimum, denoted . Our proofs from this section are different from the ones in (Nedich and Necoara,, 2019), since here we consider a more general type of bounded gradient condition, i.e., Assumption 1.
Lemma 8
Proof One can easily see that our stepsize can be written equivalently as (for we use the convention ). Using Assumption 2 in Lemma 5, we get:
(18) |
where the third inequality uses the property of the stepsize (if , then ), together with the optimality condition ( for all ), the fourth inequality uses the identity for any , and the final inequality uses the fact that . From (14), we also have:
(19) |
Taking and combining (3.2) with (3.2), we obtain:
(20) |
For we have . For , then and for any , we have . Hence, from (3.2), we obtain for any :
Using the geometric sum formula and recalling that , we obtain the first statement. Further, for , from relation (3.2), we have:
Since and , we get:
Note that in this case is a decreasing sequence, an thus we have:
Using this bound in the previous recurrence, we also get the second statement.
Now, we are ready to derive sublinear rates when we assume a strongly convex condition on the objective function. Let us define for the sum:
and the corresponding average sequences:
Theorem 9
Proof From Lemma 8, the following recurrence is valid for any :
From definition of and multiplying the whole inequality with , we get:
Summing this inequality from to , we get:
By linearity of the expectation operator, we further have:
(21) | ||||
and using convexity of the norm, we get:
After some simple calculations and keeping only the dominant terms, we get the following convergence rate for the average sequence in terms of optimality:
Since , we get the following convergence rate for the average sequence in terms of feasibility violation:
This proves our statements.
Recall the expression of :
For the particular choice of the stepsize , we have:
since we always have . Using this expression in the convergence rates of Theorem 9, we obtain:
We can easily notice from Lemma 8 that for we can get better convergence rates. More specifically, for this particular case, taking constant stepsize, we get linear rates for the last iterate in terms of optimality and feasibility violation. We state this result in the next corollary.
Corollary 10
Under the assumptions of Theorem 9, with , the last iterate generated by SSP algorithm with constant stepsize converges linearly in terms of optimality and feasibility violation.
Proof When and the stepsize satisfies , from (3.2), we obtain:
Then, since , we get immediately that:
which proves our statements.
Note that in (Necoara,, 2021) it has been proved that stochastic first order methods are converging linearly on optimization problems of the form (1) without functional constraints and satisfying Assumption 1 with . This paper extends this result to a stochastic subgradient projection method on optimization problems with functional constraints (1) satisfying Assumption 1 with . To the best of our knowledge, these convergence rates are new for stochastic subgradient projection methods applied on the general class of optimization problems (1). In Section 4 we provide an example of an optimization problem with functional constraints, that is the constrained linear least-squares, which satisfies Assumption 1 with .
4 Stochastic subgradient for constrained least-squares
In this section we consider the problem of finding a solution to a system of linear equalities and inequalities, see also equation in (Leventhal and Lewis,, 2010):
(22) |
where , and is a simple polyhedral set. We assume that this system is consistent, i.e. it has at least one solution. This problem can be reformulated equivalently as a particular case of the optimization problem with functional constraints (1):
(23) | ||||
where and are (block) rows partitions of matrices and , respectively. Clearly, problem (23) is a particular case of problem (1), with , , and (provided that is a row of ). Let us define the polyhedral subset partitions and . In this case the feasible set is the polyhedron and the optimal set is the polyhedron:
Note that for the particular problem (23) Assumption 1 holds with and e.g. , since and we have:
It is also obvious that Assumption 3 holds, since the functional constraints are linear. Moreover, for the constrained least-squares problem, we replace Assumptions 2 and 4 with the well-known Hoffman property of a polyhedral set, see Example 3 from Section 2 and also (Pena et al.,, 2021; Leventhal and Lewis,, 2010; Necoara et al.,, 2019):
(24) |
for some . Recall that the Hoffman condition (24) always holds for nonempty polyhedral sets (Pena et al.,, 2021). For the constrained least-squares problem the SSP algorithm becomes:
Algorithm 2 (SSP-LS): and
Note that the update for can be written as step (6) in SSP for and . In contrast to the previous section however, here we consider an adaptive stepsize:
Note that when is a row of , then has the explicit expression:
which coincides with step (7) in SSP for . Note that we can use e.g., probability distributions dependent on the (block) rows of matrices and :
where denotes the Frobenius norm of a matrix. Note that our algorithm SSP-LS is different from Algorithm in (Leventhal and Lewis,, 2010) through the choice of the stepsize , of the sampling rules and of the update law for and it is more general as it allows to work with block of rows of the matrices and . Moreover, SSP-LS includes the classical Kaczmarz’s method when solving linear systems of equalities. In the next section we derive linear convergence rates for SSP-LS algorithm, provided that the system of equalities and inequalities is consistent, i.e. is nonempty.
4.1 Linear convergence
In this section we prove linear convergence for the sequence generated by the SSP-LS algorithm for solving the constrained least-squares problem (23). Let us define maximum block condition number over all the submatrices :
where denotes the pseudoinverse of . Note that if has full rank, then . Then, we have the following result.
Theorem 11
Assume that the polyhedral set is nonempty. Then, we have the following linear rate of convergence for the sequence generated by the SSP-LS algorithm:
Proof From the updates of the sequences , and in SSP-LS algorithm, we have:
Using the definition of and that , we further get:
From the optimality condition of the projection we always have for all . Taking in the previous relation, we finally get:
(25) | ||||
From the definition of and , we have:
Also, from the definition of , we have:
(26) |
Now, replacing these two relations in (25), we get:
(27) |
First, let us consider the subset . Then, we have:
Second, let us consider the subset . Since the corresponding represents a block of rows of matrix , the update for in SSP-LS can be written as:
where the operator is given by the following expression
Further, the projection of onto the subset is, see e.g., (Horn and Johnson,, 2012):
Hence, we have:
Using these two relations in (4.1), we finally get the following recurrence:
(28) | ||||
Now, taking conditional expectation w.r.t. in (28) and using Hoffman inequality (24), we obtain:
Finally, taking full expectation, recursively we get the statement of the theorem.
Note that for and a single row of matrix , we have and we get a simplified estimate for linear convergence , which is similar to convergence estimate for the algorithm in (Leventhal and Lewis,, 2010). In the block case, for and assuming that , we get the linear convergence , i.e., our rate depends explicitly on the geometric properties of the submatrices and (recall that both constants and are defined in terms of these submatrices). From our best knowledge, this is the first time when such convergence bounds are obtained for a stochastic subgradient type algorithm solving constrained least-squares.
5 Illustrative examples and numerical tests
In this section, we present several applications where our algorithm can be applied, such as the robust sparse SVM classification problem (Bhattacharyya et al.,, 2004), sparse SVM classification problem (Weston et al.,, 2003), constrained least-squares and linear programs (Tibshirani,, 2011), accompanied by detailed numerical simulations. The codes were written in Matlab and run on a PC with i7 CPU at 2.1 GHz and 16 GB memory.
5.1 Robust sparse SVM classifier
We consider a two class dataset , where is the vector of features and is the corresponding label. A robust classifier is a hyperplane parameterized by a weight vector and an offset from the origin , in which the decision boundary and set of relevant features are resilient to uncertainty in the data, see equation in (Bhattacharyya et al.,, 2004) for more details. Then, the robust sparse classification problem can be formulated as:
where is the uncertainty set in the data , the parameter and -norm is added in the objective to induce sparsity in and ’s are slack variables that provide a mechanism for handling an error in the assigned class . To find a hyperplane that is robust and generalized well, each would need to be specified by a large corpus of pseudopoints. In particular, finding a robust hyperplane can be simplified by considering a data uncertainty model in the form of ellipsoids. In this case we can convert infinite number of linear constraints into a single non-linear constraint and thus recasting the above set of robust linear inequalities as second order cone constraints. Specifically, if the uncertainty set for th data is defined by an ellipsoid with the center and the shape given by the positive semidefinite matrix , i.e. , then a solution to the robust hyperplane classification problem is one in which the hyperplane parameterized by does not intersect any ellipsoidal data uncertainty model (see Appendix 1 for proof):
(29) |
Hence, the robust classification problem can be recast as a convex optimization problem with many functional constraints that are either linear or second order cone constraints:
This is a particular form of problem (1) and thus we can solve it using our algorithm SSP. Since every one of the data points has its own covariance matrix , this formulation results in a large optimization problem so it is necessary to impose some restrictions on the shape of these matrices. Hence, we consider two scenarios: (i) class-dependent covariance matrices, i.e., if or if ; (ii) class-independent covariance matrix, i.e., for all . For more details on the choice of covariance matrices see (Bhattacharyya et al.,, 2004). Here, each covariance matrix is assumed to be diagonal. In a class-dependent diagonal covariance matrix, the diagonal elements of or are unique, while in class-independent covariance matrix, all diagonal elements of are identical. Computational experiments designed to evaluate the performance of SSP require datasets in which the level of variability associated with the data can be quantified. Here, a noise level parameter, , is introduced to scale each diagonal element of the covariance matrix, i.e., or or . When , data points are associated with no noise (the nominal case). The value acts as a proxy for data variability. For classifying a point we consider the following rules. The “ordinary rule” for classifying a data point is as follows: if , then is assigned to the class; if , then is identified with the class. An ordinary error occurs when the class predicted by the hyperplane differs from the known class of the data point. The “worst case rule” determines whether an ellipsoid with center intersects the hyperplane. Hence, some allowable values of will be classified incorrectly if (worst case error).
Robust | Nominal | |||||
worst case error | ordinary error | ordinary error | ||||
0.1 | 0.01 | 1699 | 3 | 332 | 546 | 198 |
0.2 | 406 | 25 | 116 | 767 | 113 | |
0.3 | 698 | 14 | 111 | 774 | 67 | |
0.1 | 0.3 | 1581 | 63 | 331 | 546 | 198 |
0.2 | 1935 | 86 | 326 | 767 | 113 | |
0.3 | 1963 | 77 | 311 | 774 | 67 | |
0.1 | 0.01 | 1734 | 0 | 331 | 546 | 198 |
0.2 | 1822 | 1 | 268 | 767 | 113 | |
0.3 | 1937 | 0 | 266 | 774 | 67 | |
0.1 | 0.3 | 1629 | 19 | 316 | 546 | 198 |
0.2 | 1899 | 19 | 296 | 767 | 113 | |
0.3 | 2050 | 20 | 282 | 774 | 67 |
Robust | Nominal | ||
accuracy (ordinary rule) | accuracy (ordinary rule) | ||
0.1 | 0.01 | 180, 72.5% | 166, 66.9% |
0.2 | 200, 80.6% | 198, 79.8% | |
0.3 | 198, 79.8% | 193, 77.9% | |
0.1 | 0.3 | 180, 72.5% | 168, 67.8% |
0.2 | 200, 80.6% | 174, 70.2% | |
0.3 | 198, 79.8% | 172, 69.4% | |
0.1 | 0.01 | 180, 72.5% | 167, 67.4% |
0.2 | 200, 80.6% | 196, 79.1% | |
0.3 | 198, 79.8% | 193, 77.9% | |
0.1 | 0.3 | 180, 72.5% | 165, 66.6% |
0.2 | 200, 80.6% | 183, 73.8% | |
0.3 | 198, 79.8% | 159, 64.1% |
Tables 1 and 2 give the results of our algorithm SSP for robust () and nominal () classification formulations. We choose the parameters , , and stopping criterion . We consider a dataset of CT scan images having two classes, covid and non-covid, available at https://www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset. This dataset contains CT scan images of dimension ranging from to pixels. To implement our algorithm we have taken data in which are of Covid patients and of Non-Covid patients. Then, we divide them into training data and testing data. For training data we have taken images in which are of Covid and are of Non-Covid. For testing data we have taken images in which are of Covid and of Non-Covid. We also resize all images into pixels. First half of the Tables 1 and 2 correspond to feature-dependent covariance matrix and the second half to feature-independent covariance matrix. Table 1 shows the results for the training data and Table 2 for the testing data. As one can see from Tables 1 and 2, the robust classifier yields better accuracies on both training and testing datasets.
5.2 Constrained least-squares
Next, we consider constrained least-squares problem (23). We compare the performance of our algorithm SSP-LS and the algorithm in (Leventhal and Lewis,, 2010) on synthetic data matrices and generated from a normal distribution. Both algorithms were stopped when . The results for different sizes of matrices and are given in Table 3. One can easily see from Table 3 the superior performance of our algorithm in both, number of full iterations (epochs, i.e. number of passes through data) and cpu time (in seconds).
SSP-LS | (Leventhal and Lewis,, 2010) | ||||||
epochs | cpu time (s) | epochs | cpu time (s) | ||||
0.96 | 900 | 900 | 755 | 26.0 | 817 | 29.9 | |
1.96 | 900 | 900 | 591 | 20.1 | 787 | 26.2 | |
0.96 | 900 | 1100 | 624 | 23.2 | 721 | 23.9 | |
1.96 | 900 | 1100 | 424 | 16.7 | 778 | 27.3 | |
0.96 | 9000 | 9000 | 1688 | 5272.0 | 1700 | 5778.1 | |
1.96 | 9000 | 9000 | 1028 | 3469.2 | 1662 | 5763.7 | |
0.96 | 9000 | 11000 | 1224 | 5716.0 | 1437 | 5984.8 | |
1.96 | 9000 | 11000 | 685 | 2693.7 | 1461 | 5575.3 | |
0.96 | 900 | 5 | 105.0 | 9 | 163.9 | ||
1.96 | 900 | 4 | 77.7 | 9 | 163.9 | ||
0.96 | 9000 | 64 | 2054.5 | 214 | 5698.5 | ||
1.96 | 9000 | 42 | 1306.5 | 213 | 5156.7 | ||
0.96 | 9000 | 9000 | 23 | 1820.3 | 23 | 1829.4 | |
0.96 | 9000 | 11000 | 21 | 2158.7 | 23 | 2193.1 | |
1.96 | 9000 | 11000 | 19 | 1939.7 | 23 | 2216.9 | |
0.96 | 900 | 900 | 14 | 65.7 | 17 | 86.8 | |
1.96 | 900 | 1100 | 13 | 74.3 | 15 | 75.6 |
5.3 Linear programs
Next, we consider solving linear programs (LP) of the form:
Using the primal-dual formulation this problem is equivalent to (22):
Therefore, we can easily identify in (22):
Hence, we can use our algorithm SSP-LS to solve LPs. In Table 4 we compare the performance of our algorithm, the algorithm in (Leventhal and Lewis,, 2010) and Matlab solver lsqlin for solving the least-squares formulation of LPs taken from the Netlib library available on https://www.netlib.org/lp/data/index.html, and Matlab format LP library available on https://users.clas.ufl.edu/hager/coap/Pages/matlabpage.html. The first two algorithms were stopped when and we choose . In Table 4, in the first column after the name of the LP we provide the dimension of the matrix . From Table 4 we observe that SSP-LS is always better than the algorithm in (Leventhal and Lewis,, 2010) and for large dimensions it also better than lsqlin, a Matlab solver specially dedicated for solving constrained least-squares problems. Moreover, for ”qap15” dataset lsqlin yields out of memory.
LP | SSP-LS | (Leventhal and Lewis,, 2010) | lsqlin | ||
---|---|---|---|---|---|
epochs | time (s) | epochs | time (s) | time (s) | |
afiro (2151) | 1163 | 1.9 | 5943 | 2.7 | 0.09 |
beaconfd (173295) | 1234 | 9.9 | 9213 | 63.5 | 1.0 |
kb2 (4368) | 10 | 0.02 | 17 | 0.03 | 0.14 |
sc50a (5078) | 9 | 0.04 | 879 | 1.9 | 0.14 |
sc50b (5078) | 25 | 0.1 | 411 | 0.8 | 0.2 |
share2b (96162) | 332 | 1.8 | 1691 | 84.9 | 0.2 |
degen2 (444757) | 4702 | 380.6 | 5872 | 440.6 | 9.8 |
fffff800 (5241028) | 44 | 5.5 | 80 | 9.3 | 3.4 |
israel (174316) | 526 | 5.4 | 3729 | 312.9 | 0.3 |
lpi bgdbg1 (348649) | 476 | 15.4 | 9717 | 263.1 | 0.5 |
osa 07 (111825067) | 148 | 3169.8 | 631 | 7169.7 | 3437.1 |
qap15 (633022275) | 70 | 2700.7 | 373 | 7794.6 | * |
fit2p (300013525) | 7 | 65.7 | 458 | 2188.3 | 824.9 |
maros r7 (31379408) | 635 | 2346.8 | 1671 | 3816.2 | 3868.8 |
qap12 (31938856) | 54 | 374.1 | 339 | 1081.8 | 3998.4 |
5.4 Sparse linear SVM
Finally, we consider the sparse linear SVM classification problem:
This can be easily recast as an LP and consequently as a least-squares problem. In Table 5 we report the results provided by our algorithms SSP-LS for solving sparse linear SVM problems. Here, the ordinary error has the same meaning as in Section 5.1. We use several datasets: covid dataset from www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset; PMU-UD, sobar-72 and divorce datasets available on https://archive-beta.ics.uci.edu/ml/datasets; and the rest from LIBSVM library https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. In the first column, the first argument represents the number of features and the second argument represents the number of data. We divided each dataset into for training and for testing. For the LP formulation we use the simple observation that any scalar can be written as , with . We use the same stopping criterion as in Section 5.3. In Table 5 we provide the number of relevant features, i.e., the number of nonzero elements of the optimal , and the number of misclassified data (ordinary error) on real training and testing datasets.
Dataset | ordinary error (train) | ordinary error (test) | ||
---|---|---|---|---|
sobar-72 (3872) | 0.1 | 7/38 | 9/58 | 3/14 |
0.5 | 12/38 | 1/58 | 2/14 | |
breastcancer (18683) | 0.1 | 9/18 | 12/547 | 13/136 |
0.5 | 9/18 | 17/547 | 7/136 | |
divorce (108170) | 0.1 | 23/108 | 3/136 | 0/34 |
0.5 | 13/108 | 0/136 | 1/34 | |
caesarian (1080) | 0.1 | 0/10(NA) | * | * |
0.5 | 5/10 | 14/64 | 9/16 | |
cryotherapy (1290) | 0.1 | 3/12 | 8/72 | 5/18 |
0.5 | 6/12 | 6/72 | 1/18 | |
PMU-UD (192001051) | 0.1 | 13/19200 | 0/841 | 70/210 |
0.5 | 37/19200 | 0/841 | 47/210 | |
Covid (200002481) | 0.1 | 428/20000 | 146/1985 | 101/496 |
0.5 | 752/20000 | 142/1985 | 97/496 | |
Nomao (1634465) | 0.1 | 12/14 | 3462/27572 | 856/6893 |
0.5 | 11/14 | 3564/27572 | 885/6893 | |
Training (6011055) | 0.1 | 31/60 | 630/8844 | 152/2211 |
0.5 | 30/60 | 611/8844 | 159/2211 | |
leukemia (1425838) | 0.1 | 182/14258 | 9/31 | 2/7 |
0.5 | 271/14258 | 9/31 | 2/7 | |
mushrooms (428124) | 0.1 | 30/42 | 426/6500 | 121/1624 |
0.5 | 31/42 | 415/6500 | 105/1624 | |
ijcnn1 (2849990) | 0.1 | 11/28 | 3831/39992 | 1021/9998 |
0.5 | 10/28 | 3860/39992 | 992/9998 | |
phishing (6011055) | 0.1 | 56/60 | 2953/8844 | 728/2211 |
0.5 | 51/60 | 2929/8844 | 725/2211 |
Appendix
1. Proof of inequality (29). We want to find an hyperplane parameterized by which does not intersect any ellipsoidal data uncertainty model. Thus, one can easily see that the relaxed (via slack variable ) robust linear inequality
over the ellipsoid with the center in
can be written as optimization problem whose minimum value must satisfy:
The corresponding dual problem is as follows:
(30) |
Minimizing the above problem with respect to , we obtain:
(31) |
By replacing this value of into the dual problem (30), we get:
For the dual optimal solution
we get the primal optimal solution
and consequently the second order cone condition
Acknowledgments
The research leading to these results has received funding from: NO Grants 2014–2021, under project ELO-Hyp, contract no. 24/2020; UEFISCDI PN-III-P4-PCE-2021-0720, under project L2O-MOC, nr. 70/2022.
References
- Bauschke and Borwein, (1996) H. Bauschke and J. Borwein, On projection algorithms for solving convex feasibility problems, SIAM Review, 38(3): 367–376, 1996.
- Bertsekas, (2011) D.P. Bertsekas, Incremental proximal methods for large scale convex optimization, Mathematical Programming, 129(2): 163–195, 2011.
- Bianchi et al., (2019) P. Bianchi, W. Hachem and A. Salim, A constant step forward-backward algorithm involving random maximal monotone operators, Journal of Convex Analysis, 26(2): 397–436, 2019.
- Bhattacharyya et al., (2004) C. Bhattacharyya, L.R. Grate, M.I. Jordan, L. El Ghaoui and S. Mian, Robust sparse hyperplane classifiers: Application to uncertain molecular profiling data, Journal of Computational Biology, 11(6): 1073–1089, 2004.
- Devolder et al., (2014) O. Devolder, F. Glineur and Yu. Nesterov, First-order methods of smooth convex optimization with inexact oracle, Mathematical Programming, 146: 37–75, 2014.
- Duchi and Singer, (2009) J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research, 10: 2899–2934, 2009.
- Garrigos et al., (2023) G. Garrigos and R.M. Gower, Handbook of convergence theorems for (stochastic) gradient methods, arXiv:2301.11235v2, 2023.
- Hardt et al., (2016) M. Hardt, B. Recht and Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, International Conference on Machine Learning, 2016.
- Hermer et al., (2020) N. Hermer, D.R. Luke and A. Sturm, Random function iterations for stochastic fixed point problems, arXiv:2007.06479, 2020.
- Horn and Johnson, (2012) R. A. Horn and C.R. Johnson, Matrix Analysis, Cambridge University Press, 2012.
- Kundu et al., (2018) A. Kundu, F. Bach and C. Bhattacharya, Convex optimization over inter-section of simple sets: improved convergence rate guarantees via an exact penalty approach, International Conference on Artificial Intelligence and Statistics, 2018.
- Lewis and Pang, (1998) A. Lewis and J. Pang, Error bounds for convex inequality systems, Generalized Convexity, Generalized Monotonicity (J. Crouzeix, J. Martinez-Legaz and M. Volle eds.), Cambridge University Press, 75–110, 1998.
- Leventhal and Lewis, (2010) D. Leventhal and A.S. Lewis. Randomized Methods for linear constraints: convergence rates and conditioning, Mathematics of Operations Research, 35(3): 641–654, 2010.
- Moulines and Bach, (2011) E. Moulines and F. Bach, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, Advances in Neural Information Processing Systems Conf., 2011.
- Mordukhovich and Nam, (2005) B.S. Mordukhovich and N.M. Nam, Subgradient of distance functions with applications to Lipschitzian stability, Mathematical Programming, 104: 635–668, 2005.
- Necoara, (2021) I. Necoara, General convergence analysis of stochastic first order methods for composite optimization, Journal of Optimization Theory and Applications, 189: 66–95 2021.
- Necoara et al., (2019) I. Necoara, Yu. Nesterov and F. Glineur, Linear convergence of first order methods for non-strongly convex optimization, Mathematical Programming, 175(1): 69–107, 2019.
- Nedelcu et al., (2014) V. Nedelcu, I. Necoara and Q. Tran Dinh, Computational complexity of inexact gradient augmented Lagrangian methods: application to constrained MPC, SIAM Journal on Control and Optimization, 52(5): 3109–3134, 2014.
- Nemirovski and Yudin, (1983) A. Nemirovski and D.B. Yudin, Problem complexity and method efficiency in optimization, Wiley Interscience, 1983.
- Nemirovski et al., (2009) A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal Optimization, 19(4): 1574–1609, 2009.
- Nesterov, (2018) Yu. Nesterov, Lectures on Convex Optimization, Springer Optimization and Its Applications, 137, 2018.
- Nedich, (2011) A. Nedich, Random algorithms for convex minimization problems, Mathematical Programming, 129(2): 225–273, 2011.
- Nedich and Necoara, (2019) A. Nedich and I. Necoara, Random minibatch subgradient algorithms for convex problems with functional constraints, Applied Mathematics and Optimization, 8(3): 801–833, 2019.
- Necoara and Patrascu, (2018) A. Patrascu and I. Necoara, On the convergence of inexact projection first order methods for convex minimization, IEEE Transactions Automatic Control, 63(10): 3317–3329, 2018.
- Patrascu and Necoara, (2018) A. Patrascu and I. Necoara, Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization, Journal of Machine Learning Research, 18(198): 1–42, 2018.
- Pena et al., (2021) J. Pena, J. Vera and L. Zuluaga, New characterizations of Hoffman constants for systems of linear constraints, Mathematical Programming, 187: 79–109, 2021.
- Polyak, (1969) B.T. Polyak, Minimization of unsmooth functionals , USSR Computational Mathematics and Mathematical Physics, 9 (3), 14-29, 1969.
- Polyak, (2001) B.T. Polyak, Random algorithms for solving convex inequalities, Studies in Computational Mathematics, 8: 409–422, 2001.
- Robbins and Monro, (1951) H. Robbins and S. Monro, A Stochastic approximation method, The Annals of Mathematical Statistics, 22(3): 400–407, 1951.
- Rockafellar and Uryasev, (2000) R.T. Rockafellar and S.P. Uryasev, Optimization of conditional value-at-risk, Journal of Risk, 2: 21–41, 2000.
- Rosasco et al., (2019) L. Rosasco, S. Villa and B.C. Vu, Convergence of stochastic proximal gradient algorithm, Applied Mathematics and Optimization, 82: 891–917 , 2019.
- Rasch and Chambolle, (2020) J. Rasch and A. Chambolle, Inexact first-order primal–dual algorithms, Computational Optimization and Applications, 76: 381–-430, 2020.
- Tibshirani, (2011) R. Tibshirani, The solution path of the generalized lasso, Phd Thesis, Stanford Univ., 2011.
- Tran-Dinh et al., (2018) Q. Tran-Dinh, O. Fercoq, and V. Cevher, A smooth primal-dual optimization framework for nonsmooth composite convex minimization, SIAM Journal on Optimization, 28(1): 96–134, 2018.
- Villa et al., (2014) S. Villa, L. Rosasco, S. Mosci and A. Verri, Proximal methods for the latent group lasso penalty, Computational Optimization and Applications, 58: 381–407, 2014.
- Vapnik, (1998) V. Vapnik, Statistical learning theory, John Wiley, 1998.
- Weston et al., (2003) J. Weston, A. Elisseeff and B. Scholkopf, Use of the zero norm with linear models and kernel methods, Journal of Machine Learning Research, 3: 1439–1461, 2003.
- Wang et al., (2015) M. Wang, Y. Chen, J. Liu and Y. Gu, Random multiconstraint projection: stochastic gradient methods for convex optimization with many constraints, arXiv: 1511.03760, 2015.
- Xu, (2020) Y. Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints, SIAM Journal on Optimization, 30(2): 1664–1692, 2020.
- Yang and Lin, (2016) T. Yang and Q. Lin, RSG: Beating subgradient method without smoothness and strong convexity, Journal of Machine Learning Research, 19(6): 1–33, 2018.